PROGRAMMING PROJECT 3
Cpt S 471/571, Spring 2021
Due: April 26, 2020 (Monday), 11:59pm PST @ Blackboard Project #3 dropbox
General Guidelines:
Assignment: Genomic Comparisons of Coronavirus Strains
The goal of this programming assignment is to perform a couple of different types of analyses on coronavirus genomic strains.
Quick background and some relevant external resources:
Coronavirus represents a family of viruses which is the main pathogen in
SARS (2003-2004), MERS (2012-2014), and in the novel coronavirus COVID-19
of the present. Thousands of viral strains from this family have been
sequenced till date.
- The NCBI
Virus database maintains more than 3,000 genome accessions for this
family of viruses, including sequenced genomes well for over 1,000
strains.
- The Nextstrain web portal tracks
the real-time evolution of the novel coronavirus (COVID-19) along with
resources to other viral pathogens. You can view a dendrogram in different
modes on this website, that shows the evolutionary trajectory for the
dozens of strains already sequenced.
- Testing for COVID-19 relies on our ability to extract and amplify the
part of the DNA that is unique to a strain. Since the viral genetic
material is RNA, a process called Reverse Transcription PCR is used to
first convert the extracted RNA into a DNA, and then the DNA copies are
target amplified in order to ascertain if the strain is COVID-19
(positive) or not (negative). There are many tutorials online that explain
this process now, but for a simple (high-level) primer, you can refer to
this YouTube video by PBS
Nova.
Project objectives:
Input: A set of k genomes. We will refer to this set as
S = {s_1, s_2, ...., s_k}.
For this programming project, you will implement two core functions to
analyze genomes of coronavirus strains. Given a set of k genomes
(isolates from different strains), :
Task 1) Identify "DNA fingerprints" for each of the k strains. Here, a "DNA fingerprint" for a given strain s_i is defined to be a shortest unique substring of s_i - i.e., a substring that is present only in strain s_i, and is the shortest possible such candidate. If there are multiple candidates for the DNA fingerprint of a strain, output any one of them and its length. On the other hand, if there exists no such fingerprint for a string s_i, then you need to state so (or) output the entire string s_i.
Task 2) Compute and output an all pairs
"similarity matrix" D, which contains k rows and k
columns. Cell (i,j) contains simply a similarity value (defined
below) between strain s_i and strain s_j. Note that
the similarity value as defined below, is a symmetric function and
therefore the matrix D will also be symmetric along the main diagonal.
Definition: The similarity value between two strings (s_1,
s_2) is defined over a given alignment of those two strings. Given
the two strings and an alignment (local or global), the similarity
value is simply equal to the number of matches in the alignment.
For example, if the two strings are tacgt and gaccgg,
and the alignment is a local alignment:
a c _ g
a c c g
then, the similarity value for this <input, alignment> combination
is 3 (as there are 3 matches).
Note that, as defined above, there are numerous similarity values that you
can compute for the same pair of sequences (depending on the alignment).
The idea is to report the "best possible" such similarity value for the
string pair.
Note: I am not deliberately using the word "optimal" here (as what I
propose below in the algorithm is a heuristic). Read on.
Detailed Specification:
Relevant lecture notes/scribe: PDF
(also see lecture videos for March 31 and April 2)
Task 1) Detecting DNA fingerprints
For this task, your approach should follow roughly the following steps:
1) Build the GST for set S.
2) Using the GST, locate nodes that correspond to short shared matches, which upon a single character extension guarantees uniqueness in a given string s_i. This step can be implemented by doing a post-order traversal of the tree and using a coloring scheme to color nodes by their string origins.
The algorithms were described in the class lectures. Please refer to
those notes along with the lecture recording for any queries.
Task 2) Computing similarity matrix
For this task, you will need to compare every pair of sequences in set S
and report a similarity score.
To perform the comparison, you can do an alignment between each pair of
sequences. However, you should assume that performing a brute-force
alignment using dynamic programming will NOT be feasible
because of the long string lengths (30Kbp vs. 30Kbp). Therefore, I would
like you to implement an approach of identifying a longest common
substring between the two strings and then using that as a "seed" to
extend on either side to identify a local alignment anchored on the seed.
More specifically, the steps are as follows:
For each pair of sequences (s_i, s_j):
1) Build the GST for {s_i, s_j}.
2) Use the GST to find a longest common substring
("lcs") and its coordinates on s_i and s_j.
Let us say, the lcs identified
starts at index x_1 in s_i and index x_2
in s_j; and ends at index y_1 in s_i and
index y_2 in s_j.
Also, let the length of the lcs
be denoted by b.
3)
a) Extract the prefixes s_i[1...x_1]
and s_j[1...x_2];
b) Reverse the prefixes - let us call the
reversed versions of the strings s_i_rev and s_j_rev;
c) Compute a variant of the global
alignment between s_i_rev and s_j_rev (using affine
gap penalty function), where the recurrence is same as the
Needleman-Wunsch's algorithm in the forward phase. However, for finding
the optimal answer, pick the cell that contains the maximum score from the
overall matrix (i.e., not necessarily the last cell).
Let a be
the number of matches for the alignment ending at this max-cell.
For all alignments, use the following
alignment parameters: match score = +1, mismatch penalty = -2, opening gap
(h) = -5, gap extension (g) = -1.
4)
a) Extract the suffixes s_i[y_1...m]
and s_j[y_2...n]; let us call these suffixes s_i_fwd
and s_j_fwd respectively.
b) Compute a variant of the global
alignment between s_i_fwd and s_j_fwd (using affine
gap penalty function), where the recurrence is same as the
Needleman-Wunsch's algorithm in the forward phase. However, for finding
the optimal answer, pick the cell that
contains the
maximum score from the overall matrix (i.e., not necessarily the last
cell).
Let c be
the number of matches for the alignment ending at this max-cell.
5) Then, the similarity value for the string
pair (s_i, s_j) is equal to a+b+c.
Set D[i,j] = a+b+c;
Output the similarity matrix D.
Input File Formats:- The FASTA file format is the same format used in PA#1. For this assignment you can assume that there is only one string in the file. But that string can be spread over multiple (contiguous) lines in the input FASTA file.
- For the input alphabet, specify a file with all the symbols in a single line delimited by space. For coding convenience, you can assume the contents of this file to be case sensitive. For example, if you specify {A,C,G,T}in the alphabet file, you can safely assume that the input file has its sequence written in uppercase. Conversely, if you find a lowercase letter somewhere in the input sequence then you could exit with an error.
Test inputs:
Covid-19 Inputs: (source: NCBI Virus database)
String ID | Strain name | Input sequence file | Sequence length (in bp) |
1 | COVID-19 China | FASTA | 29,903 |
2 | COVID-19 USA | FASTA | 29,882 |
3 | COVID-19 Australia | FASTA | 29,893 |
4 | COVID-19 India | FASTA | 29,854 |
5 | COVID-19 Brazil | FASTA | 29,876 |
6 | SARS_2013 | FASTA | 29,644 |
7 | SARS_2017 | FASTA | 29,727 |
8 | MERS_2012_Saudi | FASTA | 30,056 |
9 | MERS_2014_Saudi | FASTA | 30,123 |
10 | MERS_2014_USA | FASTA | 30,123 |
Report:
In a separate Word or PDF document, report the following:
CHECKLIST FOR SUBMISSION:
___ Cover sheet
___ Source code
___ A helpful readme file saying how to compile and run the code
___ Output file for fingerprint information.
___ Report (Word or PDF)
___ All the above zipped into an archive folder that follows the naming convention mentioned above in the instructions.