PROGRAMMING PROJECT 2

Cpt S 471/571, Spring 2021

Due: March 29, 2021 (Monday), 11:59pm PST @ Blackboard Project #2 dropbox


General Guidelines:

Submission: The assignment should be zipped folder and both the folder and the zip file name should have your name on it. The folder naming convention is as follows: Program2-X.zip (if you are working alone and if your name is X) or Program2-X_Y.zip (if you are working as part of a team and the member names are X and Y - in any order). The zip file is the one that should be uploaded onto the WSU Blackboard dropbox (learn.wsu.edu) for  Project #2. Submissions not following this naming convention risk being not graded! So please pay close attention.
Submissions are due by 11:59pm PST on the due date. A 24-hour grace period is allowed although it will incur a late penalty of 10%. Submissions that are more than 24 hours late will be returned without grading.

Assignment: SUFFIX TREE CONSTRUCTION

The goal of this programming assignment is to implement the McCreight's suffix tree construction algorithm.

You are expected to follow the detailed algorithmic pseudocode provided in the lecture notes (along with the scribes):

http://www.eecs.wsu.edu/~ananth/CptS571/Lectures.index.htm  (please refer to the notes on McCreight's algorithm; follow the same conventions to help code readability).

Please also try to follow, to the extent possible, the same conventions and notation used in those notes (for variable and function naming).    Deviation from this convention could make debugging and code comprehension hader. Here are the notation I expect you to use the same way  (not limited to):

node u : suffix leaf i-1's parent)

node v : suffix link SL(u)

node u': parent of u (if exists)

node v': SL(u')  - may or may not be the parent of v

Case labels:

IA) SL(u) is known and u is not the root

IB) SL(u) is known and u is the root

IIA) SL(u) is unknown and u' is not the root

IIB) SL(u) is unknown and u' is the root.

Function naming:
FindPath( args ):   finds the path starting at the specified node argument that spells out the longest possible prefix of the specified string argument, and then insert the next suffix.

NodeHops (args):  does node hopping child to child until string Beta (or Beta') is exhausted, depending on the case.

In addition to that algorithm, however, there are some additional instructions for this project which are detailed below.

Input: An input string s (in the FASTA file format) and a file containing the input alphabet (format details below).

Goal: To build a suffix for the input string adhering to following API specifications.

                $ <test executable> <input file containing sequence s> <input alphabet file>

Input File Formats:

- The FASTA file format is the same format used in PA#1. For this assignment you can assume that there is only one string in the file.  But that string can be spread over multiple (contiguous) lines in the input FASTA file.

- For the input alphabet, specify a file with all the symbols in a single line delimited by space. For coding convenience, you can assume the contents of this file to be case sensitive. For example, if you specify {A,C,G,T}in the alphabet file, you can safely assume that the input file has its sequence written in uppercase. Conversely, if you find a lowercase letter somewhere in the input sequence then  you could exit with an error.

Test inputs:

Input #1:    string s1     English alphabet                        (BWT index)

Input #2:    string s2     English alphabet                        (BWT index)

Input #3:    Opsin gene in human (3,302 bp)     DNA alphabet

Input #4:    Opsin gene in mouse (3,843 bp)     DNA alphabet

Input #5:    Human BRCA2 gene (11,382 bp)     DNA alphabet        (BWT index)

Input #6:    Tomato's chloroplast genome (155,461 bp)   DNA alphabet   

Input #7:    Yeast's Chromosome 12 (1,078,175 bp)     DNA alphabet

Covid-19 Inputs:  (source: NCBI Virus database)

Covid-19 Wuhan China:    FASTA  BWT 

Covid-19 USA:                   FASTA  BWT 

Covid-19 Australia:        FASTA  BWT 

Covid-19 India:            FASTA  BWT 

Covid-19 Brazil:            FASTA  BWT 


Testing:

Tips on Testing:

SEE DETAILED NOTES ON HOW TO TEST YOUR CODE - I SUGGEST HERE 4 WAYS TO INCREMENTALLY TEST YOUR CODE.

Report:

In a separate Word or PDF document, report the following:

  1.     System configuration:      CPU used, Clock rate, RAM, Cache size (if you know it).
  2.     Construction performance:        For all the inputs successfully tested, report the running times for ST construction (do not include other times such as I/O read time or print times). Time can be reported in seconds or milliseconds or microseconds. If you want to report any other performance statistics you are welcome to do so in your report.
  3.     Justification:    Do the performance observations made above, meet your expectations? Please explain.
  4.     Implementation constant: Report your implementation's space constant for your code - i.e., how many bytes does your code consume for every input byte?  This can obviously be a roughly estimate (need not be exact) intended to give the end user an idea of your code's peak memory usage behavior.
  5.     BWT index: . For this component, please output the BWT index for all the inputs, following the example shown here for the string banana$ . Name the BWT file should denote which input - e.g., yeast_BWT.txt for yeast. )
  6.     Exact matching repeat:    For each of the input sequence, what is the length and the <start> coordinates of the longest exact matching repeat? The repeat occurrences may or may not be overlapping as long as they don't share the exact same start and end coordinates. For example, for sequence "banana" it should be "ana" starting at positions 2 and 4. In your report, also comment on how you found this repeat - i.e., your algorithm to find this repeat.
  • CHECKLIST FOR SUBMISSION:

            ___  Cover sheet

            ___  Source code

            ___  A helpful readme file saying how to compile and run the code

            ___  Output file showing the statistics asked (don't display or dump the entire suffix tree please!) for all the test inputs (or at least up to the largest input you could run on your machine)

            ___  Report (Word or PDF)

            ___  All the above zipped into an archive folder that follows the naming convention mentioned above in the instructions. 

  •