logo
 
Datatypes
CptS 355 - Programming Language Design
Washington State University
Home
Notices
Calendar
Homework
Syllabus
Resources
People

Basic Data Types

A data type defines a collection of data objects and a set of predefined operations on those objects. History
  • 1950s - numerical computing, Fortran has integer, floats, vectors (arrays)
  • 1960s - rise of business computing, COBOL - introduces records and BCD SNOBOL - string processing (ICON - 1980s, Perl - 1990s), and dynamic memory managment (LISP)
  • 1970s - PL/I - adds precision to most types, Pascal - enumerated data types, strong typing
  • 1980s - rise of abstraction, OO Ada adds abstract data types, C++ - Objects and classes
  • 1990s - increasing emphasis on reliability, Java - "safe" pointers
A descriptor is a collection of attributes about a variable/value, which includes all the relevant information. In a language with static type bindings, the descriptors are created at compile time, then disposed of. In a language with dynamic type binding, the descriptors are managed by the run-time memory system and may be dynamically created. (Scheme example) These are basic types, often supported at the hardware level.
  • byte - 8 bits (most computers are byte-addressable)
  • words - 32 bits (or 16 bits)
  • long words - 64 bits
Some logical operations on these entities are the following.
  • hardware operations on `uninterpreted' bits
  • example, logical-or
  •      10101101   a byte
         00101011   another byte
       ----------
         10101111   result of logical-or
Some groups of bits, however, have `interpretations'.
  • binary integers
    • unsigned - 32 bits long, can represent a number in the range 0 to 2**32.
    • two's complement signed - 32 bits long but 1 bit is taken up by the sign bit. If the sign bit is 'on' then the number is a negative number, if it is 'off', it is a positive number. Can represent a number in the range range -(2**31) to (2**31) - 1.
other encodings include floating point, packed decimal, 4 bit digits - BCD, and character codes such as
  • EBCIDIC - 7 bit characters, IBM generated and dying a slow death
  • ASCII - 8 bit characters, current international standard
  • Unicode - 16 bit characters, supports non-English characters

Strings

No consensus on how strings should be handled.
  • Char array or primitive type?
  • Static or dynamic length?
  • - array of chars, terminated by a specific character ('\0'), limited (in theory!) dynamic length
    /* C example */
    chr string[20];   /* String of length 20 */
    *string = "hello";  /* Fill in array with characters, 
                           string is null terminated*/
    
  • - array of chars, (unlimited) dynamic length string
    /* PL/C */
    var jim of character[varying]; /* Any length array */
    jim = "hello";       /* String is any length */
    
  • - as a list of characters
  • - a primitive type, static length examples (generally, String can be of any size).
    # Java example
    String s = "hello there";
    String s = new String("hello there");
    

Operations on Strings

There are a wide range of possible operations on strings. When supported as arrays, usually done with libraries, e.g., the string library in C (strcpy, strcat, etc.).
# the string 'hello there' is assigned to the scalar varable s
$s = "hello there";
# Concatenate two strings
$s = "hello " .  "there";
# Use string interpolation
$h = "hello ";
$s = "$h there";
# Pattern match with a regular expression
if ($s =~ /^h.*e$/) ...     # does string start with an 
                            # 'h' and end with an 'e'?

Implementing Strings

Strings can be stored as continguous cells in memory. Strcat must then copy (at least part of a string). Alternatively, strings can be implemented with lists (e.g., each element in the list points to a substring, the string as a whole is a collection of the substrings). Usually, dynamic memory management is needed to use the list implementation strategy.
  • null terminated - Should check to see if string fits in space allocated. Must use copying strategy.
  • static length - Descriptor has length field and pointer to string
  • limited dynamic length - Descriptor has max length, current length, and pointer to string
  • dynamic length - Descriptor has current length, and pointer to string

Enumerated Data Types

An ordinal type over a range of user-defined values.
  enum days {Mon, Tue, Wed, Thu, Fri, Sat, Sun};
  days day = Mon;
  day++;  /* Day is now Tue */
  day = 23;  /* error! */
Sometimes the ordering doesn't make sense.
  enum flavors {Chocolate, Vanilla, Strawberry};
  flavors f = Chocolate;
  f++;  /* nonsensical, but f is now Vanilla*/
If the language lacks an enumerated data type, then we can model it with numbers, but this can lead to type errors. For instance in Scheme
  (define Monday 1)
   ...
  (define Sunday 7)
  (+ Sunday 10)     

Subrange Types

Can limit the "range" of another type. This sometimes adds "range" checking to run-time cost. Ada example.
  type Days is (Mon, Tue, Wed, Thu, Fri, Sat, Sun);
  subtype Weekend is Days range Sat..Sun;
  subtype SmallInt is Integer range 0..255;

  x : Integer;
  y : SmallInt;
  x := 1000;
  y := x;  /* Cannot check at compile-time, must check at run-time */
The implementation/modeling strategy is to add range checks, and enforce those checks during assignment.

Arrays

An array is like a function that maps subscripts to a value. Consider the following array in C.
  int foo[20][30];
  foo[3][4] = 9;
The array foo implments the following mapping.
   foo: 0..29 X 0..30  integers  undefined
so for instance, after the assignment, foo is
   {((0, 0), undefined), ((3, 4), 9)}
Issues.
  • Permissible types for subscript - C is very limiting, subscripts must be non-zero integers. Other languages permit more reasonable options. In the following Pascal example, the first array uses a subscript from -100 to 300, while the second array has a character subscript. In general, any "reasonable" enumerated data type can appear.
      CelsiusToFarenheit : array (-100..300) of Float;
      ASCIItoInteger : array ('a'..'z') of Integer;
      ...
      CelsiusToFarenheit[0] = 32;
      ...
      ASCIItoInteger('a') = 64;
      ASCIItoInteger('b') = 65;
      ...
    
    Could we do the following?
      CosineDegrees : array (0.0 .. 1.0) of Integer;
      NamesToPhoneNumbers : array ('Aaron Aalborg'..'Zwie Zylophone') 
                            of Integer;
      ...
    
  • Range checks? - To ensure reliability, subscripts must be checked at run-time.
      ASCIItoInteger('A') := x * y;   /* possible error, 
                                         subscript out of bounds */
      if (ASCIItoInteger('A') = x) {...   /* possible error, 
                                             subscript out of bounds */
    
    Run-time range checking is potentially expensive. C, C++ do not do range checking. Java and C# do. PL/1 and Pascal have complier options to turn off range checking.
  • When are ranges bound and when and where are arrays allocated?
    • static array - Subscript ranges are statically bound, storage allocation is in static storage. Efficient, since no dynamic allocation is needed. Array is allocated in static storage.
    • fixed stack dynamic array - Subscript ranges are statically bound, but storage allocation is done during run-time, on the stack. Space efficient, array is allocated only when needed. Array is allocated on the stack.
    • stack-dynamic array - Subscript ranges are dynamically bound (hence array may be a different size in different lifetimes), but size is static within a single lifetime.
    • fixed heap dynamic array - like stack-dynamic, but allocation is in heap, possible done separately from elaboration.
    • heap dynamic array - Subscript binding is dynamic, range is dynamic, arrays can grow and shrink in size during lifetime.
    Let's look at an example, in C.
    static int x[5];    /* static array */
    
    int foo (int v) {
      int y[7];         /* fixed stack dyanmic array */
      int w[v];         /* stack dynamic array */
    
      int f[] = (int[])malloc(30);  /* fixed heap dynamic array */
    
      f[100] = 20;     /* Yikes, overwrote memory in the heap, 
                          unfortunately, thinking of f as a 
                          heap dynamic array */
      y[8] = 20;      /* Yikes, overwrote memory, w[1] to be precise, 
                         in the stack, unfortunately, thinking of y 
                         as a dynamic array */
      }
    
    In Object-oriented languages, allocation of objects are in the heap and OO languages usually include an Array class to manage arrays. For instance in Java.
      char[] buffer = new char[size]; /* allocate a fixed heap 
                                         dynamic array */
    
    C++ has a class to manage heap dynamic arrays.
      ArrayList intList = new ArrayList();
      intArray.Add(nextOne);
    
    In "typeless" languages, there are sometimes real support for heap dyanmic arrays. For instance in Perl we can do the following.
      @days = ('Mon', 'Tue');
      print $days[0];   /* prints 'Mon' */
      $days[5] = 'Thu';
      print $days[5];   /* prints 'Thu' */
      print $days[-1];  /* -1 means "last element" prints 'Thu' */
      print $days[3];   /* warning, use of uninitialized value */
      print $days[3] if defined $days[3]; 
    
  • Can the array be initialized? -
     foo () {
      int lista[3];  /* Allocate a fixed stack array */
      int listb[] = {4, 5, 7};   /* Allocate a stack dynamic array */
      int listc[] = {4, 5, , 7};   /* Allocate a stack dynamic array */
    
      printf("%d", lista[0]);
      ...
    
  • What are array operations?
    • array allocation and initialization are common, e.g.,
        # Perl
        @foo = (2, 3, 54);
        /* C */
        int foo[] = {2, 3, 54};
      
    • array concatenation - Ada has an operation to concatenate two one dimensional arrays, or a one dimensional array and a scalar.
    • array arithmetic - In Fortran, could treat arrays as matrices.
       Integer a(3)
       Integer b(3)
       Integer c(1,1)
       Integer d
       ...
       b = a       /* array assignment */
       b = b + a   /* pairwise addition of a and b values */
       d = c . a   /* matrix multiplication */ 
      
      APL had an extensive list of matrix manipulation operations, to transpose, inversion, matrix multiplication, unfortnately, few keyboards at the time supported the operators in APL (greek letters). Slicing of arrays is another option.
       Integer a(5, 7), b(2, 4), c(7);
      
       c(1:7) = a(3;1:7)  /* Slice the third row of a and stick it 
                             into c */
       a(2:4, 3:5) = b(1:2, 2:4);  /* Assign a cube from b into a */
      
      

Associative Arrays

  • index/subscript can be anything - not limited to enumerate type
  • can be thought of as a hash table
  • Perl has associative arrays
       # The % sign means an associative array variable
       %NameToPhone = ("Joe"  2345678, 
                       "Susan"  2345442 );
       # Array is dynamic
       $NameToPhone{'Fred'} = 2579999;
      
       # Does a name to phone exist?
       if (defined $NameToPhone{"Jill"}) ...
    
       # Iterate over subscripts
       foreach my $key (keys %NameToPhone) {
         print "$key's phone number is $NameToPhone{$key}.\n"
         }
      

Union Types

Think of the union type in C.
  • Same storage cell can hold different types
  • Run-time type binding, needs a run-time type check or compile-time type inference for strong typing
    • Weak typing - check at compile-time, C, C++
    • Strong typing - Ada
    • Do not have union types - Java, C#
  • Ada uses a discriminant, the discriminant is set to indicate the type of an instance of a union type
      /* Set up an enumerated data type, Shape */
      Type Shape is (Square, Circle);
      /* Set up a Union type, discrimiant is Form */
      Type Figure is (Form: Shape) is 
        Filled: Boolean;
        record
          case Form is
           when Circle 
             Diameter: Float;
           when Square 
             Side: Integer;
          end case;
        end record
      var
        Figure_1: Figure;
        Figure_2: Figure(Form  Circle);
    
        /* Initialize Figure_1 */
        Figure_1 := (Filled  true,
                   Form  Square,
                   Side  3);
    
        /* A Type error is generated by the following, since 
           the discriminants don't match */
        Figure_2 := Figure_1;
      

Pointers

  • PL/I introduced the first pointer type, PL/I was developed in the 1970s,
  • Pointers are needed for "dynamic" data types. For example, consider a linked list data structure. In a static approach we could declare an array of a fixed size to hold the entire linked-list, but there are two problems with this approach. First, we don't know at compile-time how large the linked-list will grow. Second, we may allocated more space than needed, and so waste space. Having dynamic data structures will allow the linked list to grow at run-time and also to occupy only as much space as necessary.
  • Problems with pointers
    1. Dangling pointers - Pointer to a deallocated cell
    2. Lost heap-dynamic memory - Pointer is deallocated without deallocating what it points to
    3. Memory leak - Pointer points somewhere in memory, but not to where it is supposed to (e.g., unrestricted pointer arithmetic can lead to a pointer pointing anywhere in memory, even into code, rather than data).
  • Issues
    • Aliasing reduces readibility - pointers can create aliases for names
          int x;
          int *p = x;  /* *p is now an alias for x */
          
    • Pointer arithmetic - useful, but reduces reliability
          int x;
          int *p = x;  /* *p is now an alias for x */
          p += 400;     /* yikes, where does p point? */
          
    • Storage area - where does pointer point to
      • Heap
              int *p = malloc(4);  /* p points into heap */
              
      • Stack/static
              foo () {
                int x;
                int *p = x;  /* p points into stack */
                }
              
    • Allocation/deallocation - can be explicit (e.g., in C malloc/free) or implicit. If it is implicit there is a run-time memory manager usually that handles garbage collection. Explicit deallocation relies on the programmer to do the memory management, but programmers can make mistakes leading to the pointer problems cited above.
    • Dereferencing - a pointer is dereferenced to get to the value that it points to. Dereferencing can be explicit or implicit (and often both exist in languages that have pointers). Explicit dereferencing can also lead to memory leaks when a programmer forgets to dereference or adds an additional dereferencing operator.
          int x;
          int *p = x;
          x = 3;
          x = *p;  /* explicit dereferencing of p */
          
    • Types - is the type pointed to checked? In C void * pointers are "generic" in the sense that they can be used to point to a cell of any type.
    • Reference types - "Safe" pointers, e.g., as in Java
      • Java allocates new objects in the heap
      • Class handle is a reference to an object (a reference type pointer)
      • Limited casting of the reference type is allowed (e.g., can cast only to superclass)
      • No pointer arithmetic
      • Allocation/deallocation are implicit
      For example
          String str1;
          str1 = new String("hi");  /* str1 is a reference type */
          str1++;  /* Not allowed, error! */
          Integer i = (Integer)str1;  /* Not allowed, error! */
          
    • Avoiding dangling pointers
      • Tombstones - split cells into tombstone and value. When deallocated set the tombstone. On reference check to see if tombstone is set. Cannot reallocate.
      • Lock and key - split cells into lock and value. Split pointers into key and pointer. When allocated, set the lock to be a particular key, store the key with the pointer. When deallocated change the lock to 0. If cell is reallocated, must be reallocated with a new lock value.
      • Don't allow programmer to deallocate - C#, Java, Scheme
    • Garbage collection - Garbage is the set of memory cells that have been deallocated. Garbage collection is the process of reclaiming this memory making it available for allocation.
      • Free-list - A list of unallocated memory locations. Memory is allocated by searching this list and using either a "first-fit" strategy (i.e., use the first available block of the needed size, or "best-fit" strategy, (i.e., find the block that most closely fits the requested size).
      • Reference counters - A garbage collection strategy that keeps track of how many pointers point to a particular block and puts memory back on the free-list when the count reaches zero.
        • each memory cell is split into a counter and a data value. The counter records the number of pointers that point to the cell. Initially the count is zero. When a pointer to the cell is allocated the count is increased, when a pointer to the cell is deallocated the count is decreased.
          
                  /* Reference count for allocated 
                     block is 0 */
                  int *p = malloc(4); 
                  /* The assigment increases the reference 
                     count to 1 */
                  
                  foo (int *x) {
                    }
              
                  /* When function is called, reference count 
                     increases to since now both x and p 
                     point to the allocated memory */
                  foo(p);  /* call foo */
                  /* When foo exits, x is deallocated and the 
                     count decreases to 1 */
                  p = malloc(4); 
                  /* Reference count is now 0, but 1 for 
                     newly allocated block */
                  
        • eager strategy - it happens right away
        • incremental - reference counts are adjusted every time a pointer is allocated/deallocated
        • has additional space and time cost
      • Mark and sweep
        • This strategy first finds each pointer in stack, static, and heap memory. For each pointer that it finds, it marks the heap memory that it points to. Finally, it sweeps through the heap and re-creates the free-list from the unmarked memory.
        • lazy strategy - only called when needed
        • usually invoked when out of memory, but most programs don't run out of memory
        • one-time, high cost - it is expensive to mark and sweep

Source of Information

These lecture notes are based on Chapter 6 in "Programming Languages, 6ed" by Robert Sebesta and Chapter 2 in "Programming Language Concepts and Paradigms" by David Watt.
                                                                                                                                                                                                                                                                                                                                             
  (c) 2003 Curtis Dyreson, (c) 2004 Carl H. Hauser           E-mail questions or comments to Prof. Carl Hauser