PROGRAM PROJECT 3

PROGRAMMING PROJECT 3

Cpt S 471/571, Spring 2021

Due: April 26, 2020 (Monday), 11:59pm PST @ Blackboard Project #3 dropbox

General Guidelines:

For this programming project, you are allowed to work in teams of size (up to) 2 each. Teaming up with someone else is not mandatory but highly encouraged. Note that regardless of whether you decide to work in teams or not, the assignment will be graded strictly on its merit and the same grade will be given to all team members.
Reproduction/reuse of source codes from online resources or other's people's assignments in any shape or form, is strictly not allowed. All the source code should be solely yours. You are allowed to use any standard library functions supported by the underlying programming language, but if that directly implements any of the functions posed in the question of the assignment then you should *not* use those libraries and instead write your own code.
It is recommended that you use the same language that you used for the Projects 1 and 2. It is also recommended that you design a structured interface (API) and modularized code structure.
Also, you are allowed (in fact, expected) to reuse any code that you have written from Projects 1 and 2.
Submissions without the source code will NOT be graded and will be returned with 0 points.
Grading will be based on a combination of factors such as correctness, coding style, implementation efficiency, exception handling, source code documentation, and code reusability.

Submission: The assignment should be zipped folder and both the folder and the zip file name should have your name on it. The folder naming convention is as follows: Program3-X.zip (if you are working alone and if your name is X) or Program3-X_Y.zip (if you are working as part of a team and the member names are X and Y - in any order). The zip file is the one that should be uploaded onto the WSU Blackboard dropbox (learn.wsu.edu) for Project #3. Submissions not following this naming convention risk being not graded! So please pay close attention.
Submissions are due by 11:59pm PST on the due date. A 24-hour grace period is allowed although it will incur a late penalty of 10%. Submissions that are more than 24 hours late will be returned without grading.

Note: If you are submitting as a team (of 2 people) then only one of you should submit the entire submission (with all required documents including the source code and the cover sheet), while the other person simply submits just the Cover Sheet. Just make sure you add the information of your team participants in the COVER SHEET.

Assignment: Genomic Comparisons of Coronavirus Strains

The goal of this programming assignment is to perform a couple of different types of analyses on coronavirus genomic strains.

Quick background and some relevant external resources:
Coronavirus represents a family of viruses which is the main pathogen in SARS (2003-2004), MERS (2012-2014), and in the novel coronavirus COVID-19 of the present. Thousands of viral strains from this family have been sequenced till date.

- The NCBI Virus database maintains more than 3,000 genome accessions for this family of viruses, including sequenced genomes well for over 1,000 strains.
- The Nextstrain web portal tracks the real-time evolution of the novel coronavirus (COVID-19) along with resources to other viral pathogens. You can view a dendrogram in different modes on this website, that shows the evolutionary trajectory for the dozens of strains already sequenced.
- Testing for COVID-19 relies on our ability to extract and amplify the part of the DNA that is unique to a strain. Since the viral genetic material is RNA, a process called Reverse Transcription PCR is used to first convert the extracted RNA into a DNA, and then the DNA copies are target amplified in order to ascertain if the strain is COVID-19 (positive) or not (negative). There are many tutorials online that explain this process now, but for a simple (high-level) primer, you can refer to this YouTube video by PBS Nova.

Project objectives:

Input: A set of k genomes. We will refer to this set as S = {s_1, s_2, ...., s_k}.
For this programming project, you will implement two core functions to analyze genomes of coronavirus strains. Given a set of k genomes (isolates from different strains), :

Task 1) Identify "DNA fingerprints" for each of the k strains. Here, a "DNA fingerprint" for a given strain s_i is defined to be a shortest unique substring of s_i - i.e., a substring that is present only in strain s_i, and is the shortest possible such candidate. If there are multiple candidates for the DNA fingerprint of a strain, output any one of them and its length. On the other hand, if there exists no such fingerprint for a string s_i, then you need to state so (or) output the entire string s_i.

Task 2) Compute and output an all pairs "similarity matrix" D, which contains k rows and k columns. Cell (i,j) contains simply a similarity value (defined below) between strain s_i and strain s_j. Note that the similarity value as defined below, is a symmetric function and therefore the matrix D will also be symmetric along the main diagonal.

Definition: The similarity value between two strings (s_1, s_2) is defined over a given alignment of those two strings. Given the two strings and an alignment (local or global), the similarity value is simply equal to the number of matches in the alignment. For example, if the two strings are tacgt and gaccgg, and the alignment is a local alignment:
a c _ g
a c c g
then, the similarity value for this <input, alignment> combination is 3 (as there are 3 matches).
Note that, as defined above, there are numerous similarity values that you can compute for the same pair of sequences (depending on the alignment). The idea is to report the "best possible" such similarity value for the string pair.
Note: I am not deliberately using the word "optimal" here (as what I propose below in the algorithm is a heuristic). Read on.

Detailed Specification:

Relevant lecture notes/scribe: PDF
(also see lecture videos for March 31 and April 2)

Task 1) Detecting DNA fingerprints

For this task, your approach should follow roughly the following steps:

1) Build the GST for set S.

2) Using the GST, locate nodes that correspond to short shared matches, which upon a single character extension guarantees uniqueness in a given string s_i. This step can be implemented by doing a post-order traversal of the tree and using a coloring scheme to color nodes by their string origins.

The algorithms were described in the class lectures. Please refer to those notes along with the lecture recording for any queries.

Task 2) Computing similarity matrix

For this task, you will need to compare every pair of sequences in set S and report a similarity score.
To perform the comparison, you can do an alignment between each pair of sequences. However, you should assume that performing a brute-force alignment using dynamic programming will NOT be feasible because of the long string lengths (30Kbp vs. 30Kbp). Therefore, I would like you to implement an approach of identifying a longest common substring between the two strings and then using that as a "seed" to extend on either side to identify a local alignment anchored on the seed. More specifically, the steps are as follows:

For each pair of sequences (s_i, s_j):

1) Build the GST for {s_i, s_j}.

    2) Use the GST to find a longest common substring ("lcs") and its coordinates on s_i and s_j.
        Let us say, the lcs identified starts at index x_1 in s_i and index x_2 in s_j; and ends at index y_1 in s_i and index y_2 in s_j.
        Also, let the length of the lcs be denoted by b.

    3)
      a) Extract the prefixes s_i[1...x_1] and s_j[1...x_2];
    b) Reverse the prefixes - let us call the reversed versions of the strings s_i_rev and s_j_rev;
    c) Compute a variant of the global alignment between s_i_rev and s_j_rev (using affine gap penalty function), where the recurrence is same as the Needleman-Wunsch's algorithm in the forward phase. However, for finding the optimal answer, pick the cell that contains the maximum score from the overall matrix (i.e., not necessarily the last cell).
        Let a be the number of matches for the alignment ending at this max-cell.

       For all alignments, use the following alignment parameters: match score = +1, mismatch penalty = -2, opening gap (h) = -5, gap extension (g) = -1.

    4)
      a) Extract the suffixes s_i[y_1...m] and s_j[y_2...n]; let us call these suffixes s_i_fwd and s_j_fwd respectively.
        b) Compute a variant of the global alignment between s_i_fwd and s_j_fwd (using affine gap penalty function), where the recurrence is same as the Needleman-Wunsch's algorithm in the forward phase. However, for finding the optimal answer, pick the cell that                 contains the maximum score from the overall matrix (i.e., not necessarily the last cell).
        Let c be the number of matches for the alignment ending at this max-cell.

5) Then, the similarity value for the string pair (s_i, s_j) is equal to a+b+c.
Set D[i,j] = a+b+c;

Output the similarity matrix D.

Input File Formats:

- The FASTA file format is the same format used in PA#1. For this assignment you can assume that there is only one string in the file. But that string can be spread over multiple (contiguous) lines in the input FASTA file.

- For the input alphabet, specify a file with all the symbols in a single line delimited by space. For coding convenience, you can assume the contents of this file to be case sensitive. For example, if you specify {A,C,G,T}in the alphabet file, you can safely assume that the input file has its sequence written in uppercase. Conversely, if you find a lowercase letter somewhere in the input sequence then you could exit with an error.

Test inputs:

Covid-19 Inputs: (source: NCBI Virus database)

String ID	Strain name	Input sequence file	Sequence length (in bp)
1	COVID-19 China	FASTA	29,903
2	COVID-19 USA	FASTA	29,882
3	COVID-19 Australia	FASTA	29,893
4	COVID-19 India	FASTA	29,854
5	COVID-19 Brazil	FASTA	29,876
6	SARS_2013	FASTA	29,644
7	SARS_2017	FASTA	29,727
8	MERS_2012_Saudi	FASTA	30,056
9	MERS_2014_Saudi	FASTA	30,123
10	MERS_2014_USA	FASTA	30,123

Report:

In a separate Word or PDF document, report the following:

System configuration: CPU used, Clock rate, RAM, Cache size (if you know it).
    Quality:
          a. Create a table that shows for each input string, the length of its corresponding fingerprint (task 1 output). As for the fingerprint themselves, please output them to a file.
          b. Show the similarity matrix D as a table of values (for all pairs of sequences).
          c. Discuss your results and observations. More specifically,
                         - what does the fingerprint lengths tell you about each strain and its uniqueness?
                         - what does the similarity matrix D tell you about the relationship between these strains? Do you see any logical groupings visible through this matrix?
    Performance:
        a. Create tables and/or charts to show the runtime results of the code. Please break down the runtime into the different components:
                         Task 1 Fingerprinting performance: suffix tree construction time; time to identify fingerprints; total time for the entire fingerprinting task;
                         Task 2 Similarity matrix performance: time spent building suffix trees; time spent performing the alignments; total time for the computation of the matrix D.
                                           In addition, for task 2 performance, report the length of the longest common substrings reported for every pair of strings. You can show that too in the form of a table.
Other comments: If you have any other observations or comments, you can add them here in this section.

CHECKLIST FOR SUBMISSION:

___ Cover sheet

___ Source code

___ A helpful readme file saying how to compile and run the code

___ Output file for fingerprint information.

___ Report (Word or PDF)

___ All the above zipped into an archive folder that follows the naming convention mentioned above in the instructions.