PROGRAM #5

Fall 2011

Due: December 5, 2011, 11:59pm on Angel


General Guidelines:


PROBLEM    Hashing and Hash tables:

For this assignment you will be comparing the performance of different hash functions and different collision resolution implementations. More specifically, you will be testing hash table implementations to store and retrieve a set of strings. The input is a set of n strings, and for all practical purposes, you can assume that the length of each string can be bound by a small constant. You need to build and maintain a searchable hash table for this input. Obviously there are many ways you can design the hash table, depending on what hash function you pick, and what collision resolution strategy you choose.

Aim 1.    Comparing different collision resolution strategies keeping the hash function fixed

For this aim, you will be implementing and experimentally comparing three collision resolution mechanisms: chaining, linear probing, and quadratic probing. For doing this, you don't have to write the code from scratch. The source code for chaining and quadratic probing are already downloadable from the Weiss book's webpage: http://users.cis.fiu.edu/~weiss/dsaa_c++/Code/.  tables. For your convenience, I have created zipped archive of the main relevant files in this zip file. Starting with this code as a template, you need to do the following:

  1. You need to implement the code to do linear probing. This should be a very simple modification of the quadratic probing code (more specifically, to the function that finds the next position of probe).  Call the resulting source files for linear probing as LinearProbing.h and LinearProbing.cpp.
  2. You will notice that the SeparateChaining.h and QuadraticProbing.h files include custom written header files for vector, string and linked list instead of STL. The author has done this assuming your system doesn't support STL. And their API is largely similar to, but also slightly different from the STL's own API for vector, string, and list. So you have two options. You can either stick to this version of string, vector, and linked list, which means you need to also include those source files (downloaded and made available in this zip file for your convenience) as part of your project.
    Disclaimer: Note that I have not had the chance to compile or test this code. So if you use this version then it is your responsibility to get it working.

    Alternatively, if you want to use STL, then that's okay too but for that you will have to include the corresponding STL header files instead of these custom files, and also go through the entire SeparateChaining.cpp and QuadraticProbing.cpp source codes and make them compliant with STL's API. I leave the choice to you.

    Postscript: After posting the assignment I modified the Weiss source code to be STL compliant. Here is the ZIP file. If you use this version, you don't need any of the above zip files.
    Disclaimer: I spent only about 30 minutes working on this modification and so haven't had a chance to test it thoroughly. All I can guarantee is that the code compiles and runs the test driver code (TestHashTable.cpp) successfully on my Linux box. So what I suggest you do is eye-ball every line of this code (its not very long!) and make sure it looks correct before you modify it for your assignment. You might even want to think of adding your own lines of code to the test driver code for some additional testing. In the process if you notice any issues with the code, let me know so that I can fix it.

     

You will notice that the hash function for converting a string key into a hash table index used by all these three implementations is the same, and is defined "int hash( const string & key, int tableSize )" inside the three .cpp files. For this aim, we will keep this hash function as-is, and just experiment by varying the collision resolution implementations, viz. chaining, linear probing and quadratic probing.

You will also notice that the quadratic probing code initializes the hash tablesize to a default size of 101, and then does a rehash anytime the number of elements in the hash table reaches half of the hash tablesize. And rehashing doubles the tablesize. Leave these things as they are and keep the same setup for linear probing as well.

You will also notice that the chaining implementation also initializes the hash tablesize to a default size of 101. However it does NOT do rehashing. Again, leave this as is.

Experimental plan:

Input files: There are two input files provided for your testing and performance measurements.

For the sake of simplicity of your code, we will assume that the names of these two input files won't change and that these two input files are always made available in the current working directory of execution. Of course you need to ensure this when you run and test the code.

Write a test driver function, call it TestAim1.cpp, whose main() function performs the following sequence of operations in order:

  1. Open the input data file (OHenry.txt) and load the contents into a vector of strings. Let us call this vector object "DataArray".
  2. Open the input query file (queries.txt) and load the contents into another vector of strings. Let us call this vector object "QueryArray".
  3. Instantiate all of the above three hash table (HT) implementations. Let us denote the corresponding objects as "ChainingHT",  "LinearProbingHT" and "QuadraticProbingHT".
  4. (Analysis of Chaining)
    Call a function "InsertIntoChainingHT(DataArray)". This function should insert each word in DataArray, one by one, into ChainingHT. Of course, recall that no duplicates are allowed in the hash table.
    Initialize a separate timer called InsertionTimerChainingHT and record the time taken to do all the insertions into this variable. Basically, this timer should contain the sum of the time taken to do all n inserts. From this you can calculate the average time per insertion. Make sure you include only the time to do the insert, which means only the time the code spends inside the insert() function call. The way to do this would be set a timer t1 at entrance of the insert() function and set a timer t2 just before the exit of the insert() function (doesn't matter whether insertion was successful or not), and then add [ t2-t1 ] to the global timer variable InsertionTimerChainingHT.
    Also, initialize and keep track of a variable called CollisionsChainingHT to count the total number of collisions across all inserts. Note that this is same as counting the total number of times you are inside the while loop within function findPos() across all insertions.
    Next, we will do searches and time the searches. To do this, call a function "SearchChainingHT(QueryArray)" which does a find for every query in the list of queries. For analyzing this, just keep track a global timer variable called SearchTimerChainingHT which should contain the total time to do searches for all the m queries. From this you can calculate the average time per search.
  5. (Analysis of Linear Probing and Quadratic Probing)
    Apply the same procedure as described above to analyze the linear probing and quadratic probing implementations.

At the end of this experiment, you should generate the following table:

Collision strategy  Insert Search
Total time Avg. time Total number of collisions Total time Avg. time
Chaining          
Linear Probing          
Quadratic Probing          

Make sure you specify the unit for the run-times - i.e., hours or minutes or seconds or milliseconds or microseconds, as appropriate.

 

Aim 2.    Comparing different hash functions keeping the collision strategy fixed

    For this aim, you will compare three different string hashing functions (given in the book) keeping the collision strategy fixed at Quadratic Probing. The three hash functions to use are all given in the book, Figure 5.2, Figure 5.3 and Figure 5.4. Note that Figure 5.4 is same as the default hash function used in the source code for Aim 1. For the purpose of analysis, let us refer to these three hash functions as follows:
            1) Figure 5.2:  "Simple hash function"
            2) Figure 5.3:    "Prefix hash function"
            3) Figure 5.4:    "Full-length hash function"

      For comparing these three hash functions, first add the other two hash functions into the source code for Quadratic Probing (obviously using different names to refer to those functions). Then, repeat the same procedure that you used to analyze Quadratic Probing method in Aim 1, with the only exception that now you should keep track of separate timer variables and collision variables for the three different hash functions. As a result of this experiment, you should generate the following table:

 

Hash function  Insert Search
Total time Avg. time Total number of collisions Total time Avg. time
Simple          
Prefix          
Full-length          

Note that you don't have to repeat the experiment for the row corresponding to the Full-length. The results should already be available from the corresponding entry in Aim 1.

REPORT:

In a separate document (Word or PDF), not more than 2 pages, compile the following sections:

·   A: Problem statement. In 1-2 sentences state the goal(s) of this exercise.

·     C: Experimental Results:  In this section, include the following:

o       The two tables as described above.

o       Are the observations made in the above plots as per your theoretical expectations? Justify.


FINAL CHECKLIST FOR SUBMISSION:

    ___  Cover sheet

    ___  A separate folder containing all your source code (including the main function you used for testing)

    ___  Report

    ___  Both the above zipped into another folder archive called Program5<YourLastName(s)>.zip