PROGRAMMING PROJECT 2

Cpt S 471/571, Spring 2018

Due: March 22, 2018 (Thursday), 11:59pm PST @ Blackboard Project #2 dropbox


General Guidelines:

Submission: The assignment should be zipped into an archive folder (named after your name - e.g., Program2-XYZ.zip), and uploaded into the WSU Blackboard (learn.wsu.edu) dropbox for  Proeject #2. Submissions are due by 11:59pm PST on the due date. A 24-hour grace period is allowed although it will incur a late penalty of 10%. Submissions that are more than 24 hours late will be returned without grading.



Assignment: SUFFIX TREE CONSTRUCTION

The goal of this programming assignment is to implement the McCreight's suffix tree construction algorithm.

You are expected to follow the detailed algorithmic pseudocode provided in the lecture notes (along with the scribes):

http://www.eecs.wsu.edu/~ananth/CptS571/Lectures.index.htm  

Please also try to follow, to the extent possible, the same conventions and notation used in those notes (for variable and function naming).    Deviation from this convention could make debugging and code comprehension hader. Here are the notation I expect you to use the same way  (not limited to):

node u : suffix leaf i-1's parent)

node v : suffix link SL(u)

node u': parent of u (if exists)

node v': SL(u')  - may or may not be the parent of v

Case labels:

IA) SL(u) is known and u is not the root

IB) SL(u) is known and u is the root

IIA) SL(u) is unknown and u' is not the root

IIB) SL(u) is unknown and u' is the root.

Function naming:
FindPath( args ):   finds the path starting at the specified node argument that spells out the longest possible prefix of the specified string argument, and then insert the next suffix.

NodeHops (args):  does node hopping child to child until string Beta (or Beta') is exhausted, depending on the case.


In addition to that algorithm, however, there are some additional instructions for this project which are detailed below.

Input: An input string s (in the FASTA file format) and a file containing the input alphabet (format details below).

Goal: To build a suffix for the input string adhering to following API specifications.

       

                $ <test executable> <input file containing sequence s> <input alphabet file>

Input File Formats:

- The FASTA file format is the same format used in PA#1. For this assignment you can assume that there is only one string in the file.  But that string can be spread over multiple (contiguous) lines in the input FASTA file.

- For the input alphabet, specify a file with all the symbols in a single line delimited by space. For coding convenience, you can assume the contents of this file to be case sensitive. For example, if you specify {A,C,G,T}in the alphabet file, you can safely assume that the input file has its sequence written in uppercase. Conversely, if you find a lowercase letter somewhere in the input sequence then  you could exit with an error.

Test inputs:

Input #1:    string s1     English alphabet                        (BWT index)

Input #2:    string s2     English alphabet                        (BWT index)

Input #3:    Opsin gene in human (3,302 bp)     DNA alphabet

Input #4:    Opsin gene in mouse (3,843 bp)     DNA alphabet

Input #5:    Human BRCA2 gene (11,382 bp)     DNA alphabet        (BWT index)

Input #6:    Tomato's chloroplast genome (155,461 bp)   DNA alphabet   

Input #7:    Yeast's Chromosome 12 (1,078,175 bp)     DNA alphabet

Testing:

Report:

In a separte Word or PDF document, report the following:

  1.     System configuration:      CPU used, Clock rate, RAM, Cache size (if you know it).
  2.     Construction time:        For all the inputs successfully tested, report the running times for ST construction (do not include other times such as I/O read time or print times). Time can be reported in seconds or milliseconds or microseconds.
  3.     Justification:    Do the performance observations made above, meet your expectations? Please explain.
  4.     Implementation constant: Report your implementation's space constant for your code - i.e., how many bytes does your code consume for every input byte?  This can obviously be a roughly estimate (need not be exact) intended to give the end user an idea of your code's peak memory usage behavior.
  5.     BWT index: . For this component, please output the BWT index for the  Opsin mouse, Tomato Chloropast genome and Yeast Chromosome 12 inputs, following the example shown here for the string banana$ . Name the files: Tomato_BWT.txt and Yeast_BWT.txt and attach it with your submission.  (For your reference I have generated the Human_BRCA2_BWT.txt here which you can compare and check your answer with.)
  6.     Exact matching repeat:    For each of the input sequence, what is the length and the <start> coordinates of the longest exact matching repeat? The repeat occurrences may or may not be overlapping as long as they don't share the exact same start and end coordinates. For example, for sequence "banana" it should be "ana" starting at positions 2 and 4. In your report, also comment on how you found this repeat - i.e., your algorithm to find this repeat.
  • CHECKLIST FOR SUBMISSION:

            ___  Cover sheet

            ___  Source code

            ___  A helpful readme file saying how to compile and run the code

            ___  Output file showing ONLY the statistics asked (not the entire suffix tree) for all the test inputs (or at least up to the largest input you could run on your machine)

            ___  Report (Word or PDF)

            ___  All the above zipped into an archive. The Zip file name should have your name in it.