About Research Publications People Software Teaching Prospective Students Contact

Research @ HPCBio lab

Our research interests lie at the intersection of three broad areas: high performance computing, bioinformatics and computational biology, and combinatorial algorithms. Specifically, we are drawn to problems that are motivated through their applications to data-driven sciences (particularly, from modern day life sciences); that have a combinatorial flavor (e.g., graphs, strings, searching); and that have a need for tackling scale and complexity.

Active Research Topics:

        Scalable Graph Analytics: Algorithms and Architectures

o   Parallel Graph Community Detection

o   Bipartite Graph Community Detection

o   Homology Graph Construction

o   Graph Coloring

o   Parallel Architectures for Graph Analytics and Biocomputing

 

         Bioinformatics

o   Genome Assembly

o   Topological Data Analytics

Funding Sources:  We gratefully acknowledge all our research sponsors that include NSF, DOE, USDA, and CDC.



Scalable Graph Analytics for Big Data Applications and HPC Architectures

A key characteristic of big data that is integral to the discovery pipelines that use the data, is the inherent inter-connectivity of entities captured by the data e.g., a set of interacting molecules that form biological networks, or a set of neurons signaling each other to dictate the functioning of a brain, or people communicating via social media to form friendship networks. Consequently, graph and network representations have taken a centerstage in modeling the behavior of systems at scale. However, graph algorithms have been known to be notorious for parallelization as they generate irregular memory and data access patterns, creating a body of unique design challenges.

communities_exampleOne of our primary research interests is in designing and developing novel scalable algorithms and software to support large-scale graph analytics for real world applications.

Parallel Graph Community Detection

      Graph clustering (or community detection) is a fundamental operation in graph theory, used as a structure discovery tool for analyzing large graphs. The goal is to identify tightly-knit groups of vertices in a given input graph. Community detection finds use in a broad range of application areas including biological networks, citation networks, social networks, among others. Since 2015, we have been developing the Grappolo-Vite graph community detection toolkit.

 Representative Papers:


Software:       Grappolo, Vite

Key Collaborators:     Mahantesh Halappanavar


Bipartite Graph Community Detection

      Heterogeneous graph-theoretic modeling has become an important part of biological network science, owing to the variety in data sources. Analyzing the interrelationships between genes vs. diseases, proteins vs. drugs, transcriptome vs. metabolites, predators vs. preys, or hosts vs. pathogens all such relationships can be modeled in the form of a bipartite graph.  

biLouvain_example

Representative Papers:

Software:    biLouvain


Homology Graph Construction

      In a number of large-scale graph applications, particularly in the life sciences, an input graph is not readily always available; instead it needs to be constructed using pairwise homology information available from raw data. Our original work in homology graph construction was motivated by its application in identifying protein families from newly sequenced environmental microbial communities (i.e., from metagenomics data). We pose this problem as one of constructing a protein sequence homology graph in the first step, and subsequently identifying dense subgraphs within that graph. This work led to the pGraph-pClust software pipeline, for homology graph construction (pGraph) and graph clustering (pClust). pGraph_performance

 Representative Papers: 

Software:    pGraph/pGraph-Tascel, pClust

Key Collaborators: Sriram Krishnamoorthy


Graph Coloring

Coloring is a fundamental graph operation that is widely used by numerous applications that attempt to identify maximally independent subsets of vertices (i.e., those that do not depend on one another). Many parallel computing applications use coloring to identify such subsets so that they determine what subset of vertices can be processed concurrently. However, traditional formulations of graph coloring focus solely on minimizing the number of colors used (i.e., to reduce the number of parallel steps); and in the process they end up generating skewed distributions of color sizes where a a majority of the color classes receive very few vertices (thereby negatively impacting thread utilization).coloring

 Representative Papers:

Software:    Grappolo

Key Collaborators:    Mahantesh Halappanavar, Daniel Chavarria-Miranda, Assefaw Gebremedhin


Parallel Architectures for Graph Analytics and Biocomputing

      Mapping irregular application codes from bioinformatics and graph computations, on the next generation of high performance computing architectures, is an important challenge in high performance computing. NoC_hilbert

This line of work represents some of the first studies for mapping large-scale combinatorial irregular applications on NoC based manycore architectures.

Representative Papers:

Key Collaborators: Partha Pande

 


Bioinformatics Research

Genome Assembly:

De novo genome assembly is a classical problem in bioinformatics that aims to assemble an unknown genome from the short DNA reads obtained from it through sequencing. Due to significant advancements in sequencing technology, de novo genome assembly continues to be an active research topic. Over the years, we have contributed to the development of genome assemblers and their application to multiple genome projects (apple, maize, Brachypodium). Yet, the problem with tackling very large inputs (billions of DNA reads) continues to be both a time- and memory-consuming process. genome_assembly

 Representative Papers:

Software: FastEtch, PaCE

Key Collaborators:     Sriram Krishnamoorthy


Topological Data Analytics with Applications to the Life Sciences

Life science applications are rapidly adopting a wide range of sensing and high-throughput molecular and imaging technologies to generate complex data sets. These data sets are generated, with or without preconceived hypotheses, making the problem of gleaning actionable information from these data difficult. Computational techniques and advanced data mining tools are needed to analyze these complex, high dimensional data sets.Hyppo-X_example

     We are currently applying our Hyppo-X framework on different application use-cases:

Representative Papers:

Software:   Hyppo-X

Key Collaborators: Bala Krishnamoorthy, Pat Schnable, Bei Wang Phillips, Zhiwu Zhang, Eric Lofgren, Rebekah Moehring