PROGRAMMING PROJECT 2
Cpt S 471/571, Spring 2021
Due: March 29, 2021 (Monday), 11:59pm PST @
Blackboard Project #2 dropbox
General Guidelines:
-
For this programming project, you
are allowed to work in teams of size (up to) 2 each. Teaming up with someone else is
not mandatory but highly encouraged. Note that regardless of whether
you decide to work in teams or not, the assignment will be graded
strictly on its merit and the same grade will be given to all team
members.
- Reproduction/reuse of source codes from online resources or other's
people's assignments in any shape or form, is strictly
prohibited (and will be considered cheating). All the source
code should be solely yours. You are allowed to use any standard library
functions supported by the underlying programming language, but if that
directly implements any of the functions posed in the question of the
assignment then you should *not* use those libraries and instead
write your own code.
- It is recommended that you use the same language that you used for the
Project 1, because you will be using both the codes in Project
3. To this effect, it is also recommended that you design a
structured interface (API) and modularized code structure, suitable for
reuse later.
- Also, you are allowed to reuse any code that you have written from
Project 1 (this will be particularly useful for Input reading).
- Submissions without the source code will NOT be graded and will be
returned with 0 points.
- Grading will be based on a combination of factors such as correctness,
coding style, implementation efficiency, exception handling, source code
documentation, and code reusability.
Submission: The assignment should be zipped folder and both
the folder and the zip file name should have your name on it. The
folder naming convention is
as follows: Program2-X.zip (if you are working alone and if your name is X)
or Program2-X_Y.zip (if you are working as part of a team and the member
names are X and Y - in any order). The zip file is the one that should be
uploaded onto the WSU Blackboard dropbox (learn.wsu.edu)
for Project #2. Submissions not
following this naming convention risk being not graded!
So please pay close attention.
Submissions are due by 11:59pm PST on the due date. A 24-hour grace period
is allowed although it will incur a late penalty of 10%. Submissions that
are more than 24 hours late will be returned without grading.
- Note: If you are submitting as a team (of
2 people) then only one of you should submit the entire submission
(with all required documents including the source
code and the cover sheet), while the other person simply submits just
the Cover Sheet. Just make sure you add the information of your team
participants in the COVER
SHEET.
Assignment: SUFFIX TREE CONSTRUCTION
The goal of this programming assignment is to implement the McCreight's
suffix tree construction algorithm.
You are expected to follow the detailed algorithmic pseudocode provided
in the lecture notes (along with the scribes):
http://www.eecs.wsu.edu/~ananth/CptS571/Lectures.index.htm
(please refer to the notes on McCreight's algorithm; follow the same conventions
to help code readability).
Please also try to follow, to the extent possible, the same conventions
and notation used in those notes (for variable and function naming).
Deviation from this convention could make debugging and code
comprehension hader. Here are the notation I expect you to use the same
way (not limited to):
node u : suffix leaf i-1's
parent)
node v : suffix link SL(u)
node u': parent of u (if
exists)
node v': SL(u') - may or
may not be the parent of v
Case labels:
IA) SL(u) is known and u is not the root
IB) SL(u) is known and u is the root
IIA) SL(u) is unknown and u' is not the root
IIB) SL(u) is unknown and u' is the root.
Function naming:
FindPath( args ): finds
the path starting at the specified node argument that spells out the longest
possible prefix of the specified string argument, and then insert the next
suffix.
NodeHops (args): does node
hopping child to child until string Beta (or Beta') is exhausted, depending
on the case.
In addition to that algorithm, however, there are some additional
instructions for this project which are detailed below.
Input: An input string s (in the FASTA file format) and a file
containing the input alphabet (format details below).
Goal: To build a suffix for the input string adhering to following
API specifications.
- The API should have a header file in which the suffix tree data
structure (or class if you are using Java/C++) is defined. The
struct/class should support the following public methods:
i) ST construction method that takes an input string
and alphabet as function parameters;
ii) given a pointer to a specific node u in
the tree, display u's children from left to right;
iii) enumerate nodes using Depth-First
Search (DFS) traversal (i.e., visit all children from left
to right AFTER visiting the parent). Note: this will enumerate
nodes of the tree in a top-down fashion. As a result of this
enumeration you should display the STRING DEPTH information from each
node. No need to display any other information. As this display can get
too long (== number of nodes in the tree) you can simply break the
output into lines with say 10 entries in each.
iv) BWT index: Modify the DFS procedure to
print the BWT index for the input string s. This can be done by
enumerating ONLY the leaf node id's from left to right
(lexicographically smallest to largest). The BWT index is an array B of
size n, given by B[i] = s[leaf(i)-1], where leaf(i) is the suffix id of
the ith leaf in the lexicographical order.
If i=0, then B[0]=$ (i.e., cycling around from the end of the string).
- Each node in the ST should be given a unique ID (an integer
identifier) between 0 and N-1, where N is the total number of nodes
(internal nodes + leaves).
- Each internal node should be a structure with at least the following
information: { node ID, suffix-link pointer, parent pointer, parent edge
label (i.e., label of the incoming edge from the parent), children
pointers, string-depth}. You can have other fields as needed but I don't
expect you will need anything more.
Note: If the node is a leaf, then you can use the node ID to store the
suffix number of the leaf. For internal nodes, it does not matter what
ID you store as long as its some unique integer in the allowed interval.
- In your construction code, remember to concatenate your original input
string with the $ sign before building the tree.
- During construction, make sure to order your children, left to right,
in the lexicographic order of their edge-labels. As a convention, assume
that the $ symbol is lexicographically SMALLER than all the other
symbols in your alphabet.
- Write a test "user" code that includes this header file and calls the
main(...) function to test all the above functions. Basically it
should read in an input FASTA and alphabet files, build the ST, then
print some basic statistics about the tree:
(#internal nodes, #leaves, #total
number of nodes, size of the tree (in bytes), average string-depth of
an internal node, string-depth of the deepest internal node).
After this, for further testing, printout a couple of randomly
handpicked node's children list, and then call the function to enumerate
the nodes in DFS and post-order traversals and output the results
into two corresponding output files.
$ <test executable> <input file containing
sequence s> <input alphabet file>
Input File Formats:
- The FASTA file format is the same format used in PA#1. For this
assignment you can assume that there is only one string in the file.
But that string can be spread over multiple (contiguous) lines in the
input FASTA file.
- For the input alphabet, specify a file with all the symbols in a single
line delimited by space. For coding convenience, you can assume the
contents of this file to be case sensitive. For example, if you specify
{A,C,G,T}in the alphabet file, you can safely assume that the input file
has its sequence written in uppercase. Conversely, if you find a lowercase
letter somewhere in the input sequence then you could exit with an
error.
Test inputs:
Input #1: string s1
English alphabet
(BWT index)
Input #2: string s2
English alphabet
(BWT index)
Input #3: Opsin
gene in human (3,302 bp) DNA
alphabet
Input #4: Opsin
gene in mouse (3,843 bp) DNA
alphabet
Input #5: Human
BRCA2 gene (11,382 bp) DNA
alphabet (BWT
index)
Input #6: Tomato's chloroplast
genome (155,461 bp) DNA
alphabet
Input #7: Yeast's Chromosome 12
(1,078,175 bp) DNA
alphabet
Covid-19 Inputs: (source: NCBI Virus database)
Covid-19 Wuhan China: FASTA
BWT
Covid-19 USA:
FASTA
BWT
Covid-19 Australia: FASTA
BWT
Covid-19 India: FASTA BWT
Covid-19 Brazil:
FASTA BWT
Testing:
- Inputs are ordered by their size. For testing/demo purposes, try to
find out how much and how well your implementation scales across these
inputs (until memory is not sufficient). This means measuring the total
runtime for tree construction alone (i.e., excluding the times to load
the input or print output) for these input sizes. If you can also
measure the peak memory usage of your code while its running (in Unix
you can do this by running and monitoring the top command).
- Since inputs #5 and above are large strings, calling the DFS and
post-order functions is NOT needed as the outputs will be very large.
But make sure you test the DFS & post-order display parts for the
smaller inputs.
Tips on Testing:
- Testing incrementally and writing functions and testing each function
as you go (modular testing) are going to be important for this project.
- I suggest you first implement the Naive version of the algorithm
first. This naive version is the version that does not use suffix links
(i.e., simply calls FindPath from the root for each suffix).
- Once that naive version works, that means your FindPath code works.
- Next, test the full-blown suffix-link version of the code taking into
the account all four sub-cases.
SEE DETAILED NOTES ON HOW TO
TEST YOUR CODE - I SUGGEST HERE 4 WAYS TO INCREMENTALLY TEST YOUR CODE.
Report:
In a separate Word or PDF document, report the following:
- System configuration:
CPU used, Clock rate, RAM, Cache size (if you know it).
- Construction performance:
For all the inputs successfully tested, report the running times for
ST construction (do not include other times such as I/O read time
or print times). Time can be reported in seconds or milliseconds or
microseconds. If you want to report any other performance statistics you
are welcome to do so in your report.
- Justification: Do the
performance observations made above, meet your expectations? Please
explain.
- Implementation constant: Report your
implementation's space constant for your code - i.e., how many bytes
does your code consume for every input byte? This can
obviously be a roughly estimate (need not be exact) intended to give the
end user an idea of your code's peak memory usage behavior.
- BWT index:
. For this component, please output the BWT index for all the
inputs, following the example
shown here for the string banana$ . Name the BWT file should
denote which input - e.g., yeast_BWT.txt for yeast. )
- Exact matching repeat:
For each of the input sequence, what is the length and the <start>
coordinates of the longest exact matching repeat? The repeat
occurrences may or may not be overlapping as long as they don't share
the exact same start and end coordinates. For example, for sequence
"banana" it should be "ana" starting at positions 2 and 4. In your
report, also comment on how you found this repeat - i.e., your algorithm
to find this repeat.
CHECKLIST FOR SUBMISSION:
___ Cover sheet
___ Source code
___ A helpful readme
file saying how to compile and run the code
___ Output file
showing the statistics asked (don't display or dump the entire suffix
tree please!) for all the test inputs (or at least up to the largest
input you could run on your machine)
___ Report (Word or
PDF)
___ All the above
zipped into an archive folder that follows the naming convention
mentioned above in the instructions.