Research

- Research interest: High Performance Computing, Computational Biology and Graph Algorithms.

- Protein families identification in large scale metagenomics data

** Main stages **

  • - [Graph Construction]

    To solve the problem in graph theoretic domain, each input sequence is represented as a vertex, and an edge is drawn between two vertices if their corresponding sequences have a high sequence similarity. However it is challenging to scale this approach to millions of input sequences, which in most of the time are totally irregular.

  • - [Suffix Tree Construction]

    One intuitive way to construct the graph is to perform an all-against-all pairwise sequence comparison. However it will require to compare quadratical number of pairs, which is impractical given the millions of input sequences. So we use an "Exact Match" based filtering technique to filter out significant amount of unnecessary pairs. In this approach, suffix tree serves as our index purpose.

  • - [Dense Subgraph Detection] - Shingling algorithm

    After sequence graph is constructed, "connected component detection" approach is applied to break down the large problem instance into subproblems of much smaller size. Subsequently, a heuristical method called "Shingling" is applied to report the final dense subgraphs.

  • ** Implementations **

  • - Sequential version: download - coming soon
  • - CrayXMT version: download - coming soon
  • - Distributed version: download - coming soon
  •