|
|
|
Basic Data Types
A data type defines a collection of data objects and a set of
predefined operations on those objects.
History
-
1950s - numerical computing, Fortran has integer, floats, vectors (arrays)
-
1960s - rise of business computing, COBOL - introduces records and BCD
SNOBOL - string processing (ICON - 1980s, Perl - 1990s),
and dynamic memory managment (LISP)
-
1970s - PL/I - adds precision to most types, Pascal - enumerated data types,
strong typing
-
1980s - rise of abstraction, OO Ada adds abstract data types, C++ - Objects and
classes
-
1990s - increasing emphasis on reliability, Java - "safe" pointers
A descriptor is a collection of attributes about a variable/value,
which includes all the relevant information. In a language with static
type bindings, the descriptors are created at compile time, then disposed of.
In a language with dynamic type binding, the descriptors are managed by
the run-time memory system and may be dynamically created.
(Scheme example)
These are basic types, often supported at the hardware level.
- byte - 8 bits (most computers are byte-addressable)
- words - 32 bits (or 16 bits)
- long words - 64 bits
Some logical operations on these entities are the following.
- hardware operations on `uninterpreted' bits
- example, logical-or
10101101 a byte
00101011 another byte
----------
10101111 result of logical-or
Some groups of bits, however, have `interpretations'.
- binary integers
- unsigned - 32 bits long, can represent a number in the
range 0 to 2**32.
- two's complement signed - 32 bits long but 1 bit is taken up by the sign bit.
If the sign bit
is 'on' then the number is a negative number, if it is 'off', it is a
positive number. Can represent a number in the range
range -(2**31) to (2**31) - 1.
other encodings include
floating point,
packed decimal,
4 bit digits - BCD, and
character codes such as
- EBCIDIC - 7 bit characters, IBM generated and dying a slow death
- ASCII - 8 bit characters, current international standard
- Unicode - 16 bit characters, supports non-English characters
Strings
No consensus on how strings should be handled.
-
Char array or primitive type?
-
Static or dynamic length?
- - array of chars, terminated by a specific character ('\0'),
limited (in theory!) dynamic length
/* C example */
chr string[20]; /* String of length 20 */
*string = "hello"; /* Fill in array with characters,
string is null terminated*/
- - array of chars, (unlimited) dynamic length string
/* PL/C */
var jim of character[varying]; /* Any length array */
jim = "hello"; /* String is any length */
- - as a list of characters
- - a primitive type, static length examples (generally, String can
be of any size).
# Java example
String s = "hello there";
String s = new String("hello there");
Operations on Strings
There are a wide range of possible operations on strings.
When supported as arrays, usually done with libraries, e.g., the
string library in C (strcpy, strcat, etc.).
# the string 'hello there' is assigned to the scalar varable s
$s = "hello there";
# Concatenate two strings
$s = "hello " . "there";
# Use string interpolation
$h = "hello ";
$s = "$h there";
# Pattern match with a regular expression
if ($s =~ /^h.*e$/) ... # does string start with an
# 'h' and end with an 'e'?
Implementing Strings
Strings can be stored as continguous cells in memory.
Strcat must then copy (at least part of a string).
Alternatively, strings can be implemented with lists (e.g., each element in
the list points to a substring, the string as a whole is a collection of
the substrings). Usually, dynamic memory management is needed to
use the list implementation strategy.
-
null terminated - Should check to see if string fits in space allocated.
Must use copying strategy.
-
static length - Descriptor has length field and pointer to string
-
limited dynamic length - Descriptor has max length, current length, and pointer to string
-
dynamic length - Descriptor has current length, and pointer to string
Enumerated Data Types
An ordinal type over a range of user-defined values.
enum days {Mon, Tue, Wed, Thu, Fri, Sat, Sun};
days day = Mon;
day++; /* Day is now Tue */
day = 23; /* error! */
Sometimes the ordering doesn't make sense.
enum flavors {Chocolate, Vanilla, Strawberry};
flavors f = Chocolate;
f++; /* nonsensical, but f is now Vanilla*/
If the language lacks an enumerated data type, then
we can model it with numbers, but this can lead to type errors.
For instance in Scheme
(define Monday 1)
...
(define Sunday 7)
(+ Sunday 10)
Subrange Types
Can limit the "range" of another type.
This sometimes adds "range" checking to run-time cost.
Ada example.
type Days is (Mon, Tue, Wed, Thu, Fri, Sat, Sun);
subtype Weekend is Days range Sat..Sun;
subtype SmallInt is Integer range 0..255;
x : Integer;
y : SmallInt;
x := 1000;
y := x; /* Cannot check at compile-time, must check at run-time */
The implementation/modeling strategy is to add range checks, and
enforce those checks during assignment.
Arrays
An array is like a function that maps subscripts to a value.
Consider the following array in C.
int foo[20][30];
foo[3][4] = 9;
The array foo implments the following mapping.
foo: 0..29 X 0..30 integers undefined
so for instance, after the assignment, foo is
{((0, 0), undefined), ((3, 4), 9)}
Issues.
-
Permissible types for subscript - C is very limiting, subscripts must be
non-zero integers. Other languages permit more reasonable options.
In the following Pascal example, the first array uses a subscript from
-100 to 300, while the second array has a character subscript.
In general, any "reasonable" enumerated data type can appear.
CelsiusToFarenheit : array (-100..300) of Float;
ASCIItoInteger : array ('a'..'z') of Integer;
...
CelsiusToFarenheit[0] = 32;
...
ASCIItoInteger('a') = 64;
ASCIItoInteger('b') = 65;
...
Could we do the following?
CosineDegrees : array (0.0 .. 1.0) of Integer;
NamesToPhoneNumbers : array ('Aaron Aalborg'..'Zwie Zylophone')
of Integer;
...
-
Range checks? - To ensure reliability, subscripts must be checked at
run-time.
ASCIItoInteger('A') := x * y; /* possible error,
subscript out of bounds */
if (ASCIItoInteger('A') = x) {... /* possible error,
subscript out of bounds */
Run-time range checking is potentially expensive.
C, C++ do not do range checking. Java and C# do.
PL/1 and Pascal have complier options to turn off range checking.
-
When are ranges bound and when and where are arrays allocated?
-
static array - Subscript ranges are statically bound, storage allocation
is in static storage. Efficient, since no dynamic allocation is needed.
Array is allocated in static storage.
-
fixed stack dynamic array - Subscript ranges are statically bound, but
storage allocation is done during run-time, on the stack.
Space efficient, array is allocated only when needed.
Array is allocated on the stack.
-
stack-dynamic array - Subscript ranges are dynamically bound (hence
array may be a different size in different lifetimes), but size is
static within a single lifetime.
-
fixed heap dynamic array - like stack-dynamic, but allocation is
in heap, possible done separately from elaboration.
-
heap dynamic array - Subscript binding is dynamic, range is dynamic,
arrays can grow and shrink in size during lifetime.
Let's look at an example, in C.
static int x[5]; /* static array */
int foo (int v) {
int y[7]; /* fixed stack dyanmic array */
int w[v]; /* stack dynamic array */
int f[] = (int[])malloc(30); /* fixed heap dynamic array */
f[100] = 20; /* Yikes, overwrote memory in the heap,
unfortunately, thinking of f as a
heap dynamic array */
y[8] = 20; /* Yikes, overwrote memory, w[1] to be precise,
in the stack, unfortunately, thinking of y
as a dynamic array */
}
In Object-oriented languages, allocation of objects are in the heap
and OO languages usually include an Array class to manage arrays.
For instance in Java.
char[] buffer = new char[size]; /* allocate a fixed heap
dynamic array */
C++ has a class to manage heap dynamic arrays.
ArrayList intList = new ArrayList();
intArray.Add(nextOne);
In "typeless" languages, there are sometimes real support for
heap dyanmic arrays. For instance in Perl we can do the following.
@days = ('Mon', 'Tue');
print $days[0]; /* prints 'Mon' */
$days[5] = 'Thu';
print $days[5]; /* prints 'Thu' */
print $days[-1]; /* -1 means "last element" prints 'Thu' */
print $days[3]; /* warning, use of uninitialized value */
print $days[3] if defined $days[3];
-
Can the array be initialized? -
foo () {
int lista[3]; /* Allocate a fixed stack array */
int listb[] = {4, 5, 7}; /* Allocate a stack dynamic array */
int listc[] = {4, 5, , 7}; /* Allocate a stack dynamic array */
printf("%d", lista[0]);
...
-
What are array operations?
-
array allocation and initialization are common, e.g.,
# Perl
@foo = (2, 3, 54);
/* C */
int foo[] = {2, 3, 54};
-
array concatenation - Ada has an operation to concatenate two
one dimensional arrays, or a one dimensional array and a scalar.
-
array arithmetic - In Fortran, could treat arrays as matrices.
Integer a(3)
Integer b(3)
Integer c(1,1)
Integer d
...
b = a /* array assignment */
b = b + a /* pairwise addition of a and b values */
d = c . a /* matrix multiplication */
APL had an extensive list of matrix manipulation operations, to
transpose, inversion, matrix multiplication, unfortnately, few
keyboards at the time supported the operators in APL (greek letters).
Slicing of arrays is another option.
Integer a(5, 7), b(2, 4), c(7);
c(1:7) = a(3;1:7) /* Slice the third row of a and stick it
into c */
a(2:4, 3:5) = b(1:2, 2:4); /* Assign a cube from b into a */
Associative Arrays
-
index/subscript can be anything - not limited to enumerate type
-
can be thought of as a hash table
-
Perl has associative arrays
# The % sign means an associative array variable
%NameToPhone = ("Joe" 2345678,
"Susan" 2345442 );
# Array is dynamic
$NameToPhone{'Fred'} = 2579999;
# Does a name to phone exist?
if (defined $NameToPhone{"Jill"}) ...
# Iterate over subscripts
foreach my $key (keys %NameToPhone) {
print "$key's phone number is $NameToPhone{$key}.\n"
}
Union Types
Think of the union type in C.
-
Same storage cell can hold different types
-
Run-time type binding, needs a run-time type check or compile-time
type inference for strong typing
-
Weak typing - check at compile-time, C, C++
-
Strong typing - Ada
-
Do not have union types - Java, C#
-
Ada uses a discriminant, the discriminant is set to indicate
the type of an instance of a union type
/* Set up an enumerated data type, Shape */
Type Shape is (Square, Circle);
/* Set up a Union type, discrimiant is Form */
Type Figure is (Form: Shape) is
Filled: Boolean;
record
case Form is
when Circle
Diameter: Float;
when Square
Side: Integer;
end case;
end record
var
Figure_1: Figure;
Figure_2: Figure(Form Circle);
/* Initialize Figure_1 */
Figure_1 := (Filled true,
Form Square,
Side 3);
/* A Type error is generated by the following, since
the discriminants don't match */
Figure_2 := Figure_1;
Pointers
-
PL/I introduced the first pointer type, PL/I was developed in the 1970s,
-
Pointers are needed for "dynamic" data types. For example, consider a
linked list data structure. In a static approach we could declare an
array of a fixed size to hold the entire linked-list, but there are two
problems with this approach. First, we don't know at compile-time
how large the linked-list will grow. Second, we may allocated more
space than needed, and so waste space. Having dynamic data structures
will allow the linked list to grow at run-time and also to occupy
only as much space as necessary.
-
Problems with pointers
-
Dangling pointers - Pointer to a deallocated cell
-
Lost heap-dynamic memory - Pointer is deallocated without deallocating
what it points to
-
Memory leak - Pointer points somewhere in memory, but not to where it
is supposed to (e.g., unrestricted pointer arithmetic can lead to
a pointer pointing anywhere in memory, even into code, rather than data).
-
Issues
-
Aliasing reduces readibility - pointers can create aliases for names
int x;
int *p = x; /* *p is now an alias for x */
-
Pointer arithmetic - useful, but reduces reliability
int x;
int *p = x; /* *p is now an alias for x */
p += 400; /* yikes, where does p point? */
Storage area - where does pointer point to
-
Allocation/deallocation - can be explicit (e.g., in C malloc/free)
or implicit. If it is implicit there is a run-time memory manager
usually that handles garbage collection. Explicit deallocation relies
on the programmer to do the memory management, but programmers can
make mistakes leading to the pointer problems cited above.
-
Dereferencing - a pointer is dereferenced to get to the value that it
points to. Dereferencing can be explicit or implicit (and often both
exist in languages that have pointers). Explicit dereferencing can
also lead to memory leaks when a programmer forgets to dereference or
adds an additional dereferencing operator.
int x;
int *p = x;
x = 3;
x = *p; /* explicit dereferencing of p */
-
Types - is the type pointed to checked? In C
void *
pointers are "generic" in the sense that they can be used to point
to a cell of any type.
-
Reference types - "Safe" pointers, e.g., as in Java
-
Java allocates new objects in the heap
-
Class handle is a reference to an object (a reference type pointer)
-
Limited casting of the reference type is allowed (e.g., can cast only to
superclass)
-
No pointer arithmetic
-
Allocation/deallocation are implicit
For example
String str1;
str1 = new String("hi"); /* str1 is a reference type */
str1++; /* Not allowed, error! */
Integer i = (Integer)str1; /* Not allowed, error! */
-
Avoiding dangling pointers
-
Tombstones - split cells into tombstone and value. When deallocated
set the tombstone. On reference check to see if tombstone is set.
Cannot reallocate.
-
Lock and key - split cells into lock and value. Split pointers into
key and pointer. When allocated, set the lock to be a particular key,
store the key with the pointer. When deallocated change the lock to
0. If cell is reallocated, must be reallocated with a new lock value.
-
Don't allow programmer to deallocate - C#, Java, Scheme
-
Garbage collection - Garbage is the set of memory cells that have been
deallocated. Garbage collection is the process of reclaiming this memory
making it available for allocation.
-
Free-list - A list of unallocated memory locations. Memory is allocated
by searching this list and using either a "first-fit" strategy (i.e.,
use the first available block of the needed size, or "best-fit" strategy,
(i.e., find the block that most closely fits the requested size).
-
Reference counters - A garbage collection strategy that keeps track of
how many pointers point to a particular block and puts memory back on
the free-list when the count reaches zero.
-
each memory cell is split into a counter and a data value. The
counter records the number of pointers that point to the cell.
Initially the count is zero. When a pointer to the cell is
allocated the count is increased, when a pointer to the
cell is deallocated the count is decreased.
/* Reference count for allocated
block is 0 */
int *p = malloc(4);
/* The assigment increases the reference
count to 1 */
foo (int *x) {
}
/* When function is called, reference count
increases to since now both x and p
point to the allocated memory */
foo(p); /* call foo */
/* When foo exits, x is deallocated and the
count decreases to 1 */
p = malloc(4);
/* Reference count is now 0, but 1 for
newly allocated block */
-
eager strategy - it happens right away
-
incremental - reference counts are adjusted every time a pointer
is allocated/deallocated
-
has additional space and time cost
-
Mark and sweep
-
This strategy first finds each pointer in
stack, static, and heap memory. For each pointer that it finds, it
marks the heap memory that it points to. Finally, it
sweeps through the heap and re-creates the free-list
from the unmarked memory.
-
lazy strategy - only called when needed
-
usually invoked when out of memory, but most programs don't
run out of memory
-
one-time, high cost - it is expensive to mark and sweep
Source of Information
These lecture notes are based on Chapter 6 in "Programming Languages, 6ed"
by Robert Sebesta and
Chapter 2 in "Programming Language Concepts and Paradigms" by David Watt.
|