Cpt S 471/571, Spring 2017

Due: April 3, 2017 (Monday) 11:59pm @ OSBLE+ Assignment Dropbox

General Guidelines:


The goal of this programming assignment is to implement the McCreight's suffix tree construction algorithm.

You are expected to follow the detailed algorithmic pseudocode provided in the lecture notes (along with the scribes):  

Please also try to follow, to the extent possible, the same conventions and notation used in those notes. 

In addition to that algorithm, however, there are some additional instructions for this project which are detailed below.

Input: An input string s (in the FASTA file format) and a file containing the input alphabet (format details below).

Goal: To build a suffix for the input string adhering to following API specifications.


                $ <test executable> <input file containing sequence s> <input alphabet file>

Input File Formats:

- The FASTA file format is the same format used in PA#1. For this assignment you can assume that there is only one string in the file.  But that string can be spread over multiple (contiguous) lines in the input FASTA file.

- For the input alphabet, specify a file with all the symbols in a single line delimited by space. For coding convenience, you can assume the contents of this file to be case sensitive. For example, if you specify {A,C,G,T}in the alphabet file, you can safely assume that the input file has its sequence written in uppercase. Conversely, if you find a lowercase letter somewhere in the input sequence then  you could exit with an error.

Test inputs:

Input #1:    string s1     English alphabet                        (BWT index)

Input #2:    string s2     English alphabet                        (BWT index)

Input #3:    Human BRCA2 gene (11,382 bp)     DNA alphabet        (BWT index)

Input #4:    Tomato's chloroplast genome (155,461 bp)   DNA alphabet   

Input #5:    Yeast's Chromosome 12 (1,078,175 bp)     DNA alphabet



In a separte Word or PDF document, report the following:

  1.     System configuration:      CPU used, Clock rate, RAM, Cache size (if you know it).
  2.     Construction time:        For all the inputs successfully tested, report the running times for ST construction (do not include other times such as I/O read time or print times). Time can be reported in seconds or milliseconds or microseconds.
  3.     Justification:    Do the performance observations made above, meet your expectations? Please explain.
  4.     Implementation constant: Report your implementation's space constant for your code - i.e., how many bytes does your code consume for every input byte?  This can obviously be a roughly estimate (need not be exact) intended to give the end user an idea of your code's peak memory usage behavior.
  5.     BWT index: . For this component, please output the BWT index for the Tomato Chloropast genome and Yeast Chromosome 12 inputs, following the example shown here for the string banana$ . Name the files: Tomato_BWT.txt and Yeast_BWT.txt and attach it with your submission.  (For your reference I have generated the Human_BRCA2_BWT.txt here which you can compare and check your answer with.)
  6.     Exact matching repeat:    For each of the input sequence, what is the length and the <start> coordinates of the longest exact matching repeat? The repeat occurrences may or may not be overlapping as long as they don't share the exact same start and end coordinates. For example, for sequence "banana" it should be "ana" starting at positions 2 and 4. In your report, also comment on how you found this repeat - i.e., your algorithm to find this repeat.

            ___  Cover sheet

            ___  Source code

            ___  A helpful readme file saying how to compile and run the code

            ___  Output files for all the four test inputs (or at least up to the largest input you could run on your machine)

            ___  Report (Word or PDF)

            ___  All the above zipped into an archive. The Zip file name should have your name in it.