PROGRAMMING ASSIGNMENT #1

Cpt S 471/571, Spring 2017

Due:  February 10, 2017, by 11:59pm @ OSBLE+ Assignments dropbox


General Guidelines:


Assignment:

Implement two algorithms:
i) Needleman-Wunsch algorithm for computing OPTIMAL GLOBAL ALIGNMENT, and
ii) Smith-Waterman algorithm for computing OPTIMAL LOCAL ALIGNMENT,

both using affine gap penalty function between two input DNA sequences, s1 and s2, of lengths m and n respectively. 

Each cell of your Dynamic Programming table ("DP table") should have the following structure:

    struct DP_cell {
        int score;
        ... // add any other field(s) that you may need for the implementation
    }

At the start of the program, you should read the alignment score parameters from a user-specified input file (optional). The default name of the file, if the user does not specify one, should be "parameters.config" in the present working directory. The parameters.config file should allow the user to specify one scoring parameter in each line (space or tab delimited). For example:

    match    1
    mismatch    -2
    h  -5 
    g -1

The command prompt usage for your program should look as follows:

    $ <executable name> <input file containing both s1 and s2> <0: global, 1: local> <optional: path to parameters config file>

Input File Formats:

The two input sequences should be given as input in one text file. The text file should be in what is called the "multi-sequence FASTA format", which is as follows:

For example:

An input file called "input.fasta" could look like:

    >s1 sequence
    acatgctacacgtactccgataccccgtaaccgataacgatacacagacct
    cgtacgcttgctacaacgtactctataaccgagaacgattgaca
    tgcctcgtacacatgctacacgtactccgatgaccccgt

    >s2 sequence
    acattctacgaacctctcgataaccccataaccgataacgattgacacctcgt
    acgctttctacaacttactctctcgataaccccataaccgataacgattgacacctc
    gtacacatggtacatacgtactctcgataccccgt

At completion, the program should output/print the following information:

All output should be to the standard output.


  • Example Input/Output
  • As shown above, the alignment is wrapped after every 60 alignment positions, and on both sides the starting and ending indices of the aligning positions in the respective strings should be displayed. Then, a pipe symbol "|" is shown in columns wherever there is a match. (Other columns contain a whitespace character).

    After implementing and testing, run your program on the following input file, redirect the standard output into a global alignment and a local alignment output text files and attach them as part of your submission:

    Human-Mouse-BRCA2-cds.fasta

    This file contains the coding sequences in human and mouse of a gene called BRCA2 which is responsible for breast cancer. The first sequence is the copy of the gene in human and the second is the copy of the gene in mouse


    You are welcome to create your own additional test cases using small genes from the NCBI GenBank data (e.g., genes related to color blindness from human vs. mouse).
  • CHECKLIST FOR SUBMISSION:

            ___  source code

            ___  output file from the above run on the BRCA2 gene

            ___  All the above zipped into an archive