The National Institute for Computational Sciences

Managing and Interpreting Genomic Information Better

Research Aims to Enable Timely Updates to a Proteins Database and a More Powerful Interpretation of Sequenced Genomes

[Image credit: © Dmitry Sunagatov |]

One of the many perpetual desires of the human race is to understand not only our origins but also the biological essence of who we are. Throughout history, scientists, researchers and clinicians have worked together to fully understand the biological, chemical, and anatomical mechanisms of the human body and its functions. A fundamental component in this search for understanding is the ability to describe the biological processes on a molecular level.

In the past few decades we have begun to unravel the mechanisms of biomolecular interactions by determining the order of DNA nucleotides within the entire set of chromosomes (the genome) in the process known as sequencing, and then interpreting (annotating) the genome of simple organisms; eventually we worked our way up to humans.

This process began in the late 1970s, with the initial DNA sequencing of the Phi X 174 (ΦX174) bacteriophage, a virus or group of viruses that infect bacteria. This first success triggered a worldwide race to decode and store the DNA sequences of the thousands of organisms that have since been recorded with the help of the more modern sequencing technologies.

Of all the genomic sequencing efforts in history, however, the Human Genome Project, a momentous undertaking that was completed in 2003, has by far been the most celebrated.

Protein Functional Annotation

Once a genome has been fully sequenced, it must also be annotated, which entails assigning biological functions to the proteins encoded by the sequenced genomes.

Understandably, because the size of each genome can range from tens to tens of thousands of genes, using automated approaches for functional annotations is the most efficient approach.

Currently, the most popular resource for bacterial annotation is the Cluster of Orthologous Groups (COG) of proteins database.

The database provides an evolutionary-development (phylogenetic) classification of proteins based on their orthologous relationship to other proteins. In other words, proteins that are considered direct evolutionary counterparts (orthologs) of other proteins are classified into unique functional groups. These functional groups provide essential information required to determine the nature and mechanisms of biomolecular interactions.

But as Bhanu Rekepalli, a researcher at the Joint Institute for Computational Sciences (JICS) notes, “the National Center for Biotechnology Information (NCBI) stopped updating COGs in 2006 primarily due to an increasing degree of computational complexity.” “It is this problem,” he says, “which we were trying to address with our latest paper.”

Recently, Rekepalli along with his JICS-based research team collaborated with the Seattle Children’s Research Institute to develop and optimize a high-throughput workflow that would enable timely updates of COG as well as the robust annotation of newly sequenced genomes.

Functional Annotation Workflow

The results of this joint research project were published recently in two manuscripts. The first manuscript titled “High performance computing workflow for 
protein functional annotation” and published in XSEDE ‘13 Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery in 2013, described the new automated workflow built to enable large-scale protein annotation.

The second paper, "Optimizing high performance computing workflow for protein functional annotation," was published in a special issue of Concurrency and Computation in April 2014. This most recent publication dealt with the optimization of the automated workflow. As stated in the paper, “The optimized workflow relies on highly scalable parallel implementations of the analysis tools to enable rapid analysis on the supercomputers.”

The team utilized the Kraken supercomputer (decommissioned on April 30, 2014) at the National Institute for Computational Sciences (NICS) as well as the Newton HPC cluster at the University of Tennessee for their research.

An excerpt from their 2014 paper explains, “The workflow uses a Position-Specific Iteration (PSI)-BLAST approach and a low-complexity classification method to assign newly sequenced genomes to existing COGs.”

But, to reduce error propagation, the workflow had to be iterative. Thus, as the paper explains, “With each iteration, the expansion of COG clusters leads to an increase in compute time.”

To efficiently manage this computationally demanding process, the team designed a method to reduce the compute time and memory demands while retaining the accuracy offered by their annotation workflow.

With the newly optimized functional annotation workflow, the team has processed more than 1 million newly sequenced proteins. These annotations were completed with a reduced compute time while maintaining the accuracy of classifications offered by the original COG database.

Future Endeavors

Initially, the optimized workflow was developed for Extreme Science and Engineering Discovery Environment (XSEDE) supercomputers such as Kraken. But the team is now creating a software that “wraps around” other software so that the encompassed elements also can run on non-XSEDE supercomputers.

Jacob Pieper, science writer, NICS, JICS

Article posting date: 7 September 2014

About JICS and NICS: The Joint Institute for Computational Sciences (JICS) was established by the University of Tennessee and Oak Ridge National Laboratory (ORNL) to advance scientific discovery and state-of-the-art engineering, and to further knowledge of computational modeling and simulation. JICS realizes its vision by taking full advantage of petascale-and-beyond computers housed at ORNL and by educating a new generation of scientists and engineers well versed in the application of computational modeling and simulation for solving the most challenging scientific and engineering problems. JICS runs the National Institute for Computational Sciences (NICS), which had the distinction of deploying and managing the Kraken supercomputer. NICS is a leading academic supercomputing center and a major partner in the National Science Foundation's eXtreme Science and Engineering Discovery Environment, known as XSEDE. In November 2012, JICS sited the Beacon system, which set a record for power efficiency and captured the number one position on the Green500 list of the most energy-efficient computers.