The National Institute for Computational Sciences

Researchers Beef Up Protein Search Tool

HMMER package now up and running on world’s fastest academic supercomputer

by Scott Jones

MPI-HMMER vs HSP-HMMER

Protein Modeling

This Graph compares the performance of the new Parallel HSP-HMMER to the previously existing MPI-HMMER program. The new HSP-HMMER code developed by the researchers improves performance a hundred fold.

Ever since Fred Sanger sequenced the first genome in history in 1977, that of a bacterial virus, scientists around the world have added thousands of complete genomes to the collective scientific repository. And thanks to more efficient and inexpensive sequencing techniques, the genetic blueprints of all types of organisms are being added every day.

A species’ genome reveals its DNA structure, which in turn yields its protein makeup. Proteins determine much about an organism, and it turns out there are plenty of them. To date, sequencing has yielded more than 13 million registered proteins in various scientific databases, and that number doubles every 6 to 18 months or so. Needless to say, keeping track is a bit of a numerical nightmare.

“Every living organism on Earth is made of proteins, and all over the world people are finding new proteins,” said Bhanu Rekepalli, a computational biologist at the Joint Institute for Computational Sciences (JICS), adding “you can define an organism if you can determine its protein makeup.”

Proteins are made up of different segments, or domains, and just as proteins tell cells what to do, the domains determine the function and structure of proteins and give clues to their evolution. Given that there are thousands of proteins that relate to human diseases, deciphering the various proteins in terms of their domains has enormous medical implications. This domain modeling is precisely what Rekepalli and his colleagues Christian Halloy at University of Tennessee's NICS, and Igor Jouline, leader of the computational biology and bioinformatics group at JICS, have improved.

One of the primary protein databases used by biologists is the nonredundant (NR) database, which currently contains more than 8 million individual registered proteins. Researchers have determined the sequences of just over 5,000 genomes of an estimated 2 to 200 million living species, and that sequencing is only getting easier thanks to recent improvements in DNA sequencing technology. As more genomes are sequenced, the NR database will continue to grow exponentially.

Unfortunately for the biology community, which uses computers to catalogue and search the various databases, the numbers of newly discovered proteins are far outpacing Moore’s Law, which states that computers double in memory and processing speed every 1.5 to 2 years. Simply put, technology is falling far behind in the race to catalog the domains, and thus the structure and function, of recently discovered proteins.

The complete proteins in the NR database are compared with accepted protein models in a model database known as Pfam for “protein family.” Because the functions of the domains in Pfam are largely known, the proteins in the NR database are examined for any matches in the Pfam dataset. When there is a match, researchers can begin to determine the functions of the newly discovered protein in the NR database. Slowly but surely, scientists are piecing together the protein puzzle.

The team’s tool of choice for this comparison is a combination of software packages known as HMMER (pronounced “hammer”), arguably the best tool for protein domain identification, an essential step in determining a protein’s biological function. Unfortunately, identifying the individual domains for all of the proteins in the NR database can take months to a year on a computer cluster, depending on the cluster’s size, with the traditional MPI-HMMER package. And MPI-HMMER’s various versions and enhancements historically haven’t scaled well to a large number of processors. With the NR and Pfam databases growing so rapidly, a new strategy is necessary if biology is to ever have a chance of modeling all of the protein domains found through genetic sequencing.

Supercomputers such as the Department of Energy’s (DOE’s) Jaguar and the National Science Foundation’s (NSF’s) Kraken, both of which are located at ORNL, are beginning to gain the attention of bioinformatics specialists. Both are among the world’s fastest supercomputers, and if HMMER could be made to scale to their thousands of processors, then the modeling of proteins in the various databases could be significantly expedited, paving the way for the medical investigation into those most integral of life’s building blocks.

Unfortunately, MPI-HMMER scales linearly up to only 256 processing cores. After that the number of amino acids (which make up the protein domains) it’s able to calculate per hour begins a slow, steady decline as the number of used cores increases beyond 512. Halloy and Rekepalli examined the MPI-HMMER package to see where performance could be increased and decided instead to write a new parallel program based on the sequential version of HMMER. With this new code they were able to improve its performance a hundred fold using up to and over 4,096 cores.

By eliminating the communications between the participating computing nodes and minimizing simultaneous output to file systems, the team made their version of HMMER, known as Highly Scalable Parallel (HSP)–HMMER, ideally parallel. According to the researchers, this approach not only simplifies programming issues, but also eliminates the overhead accrued with the sending and receiving a myriad of messages between nodes.

The biggest obstacle to scaling HMMER to Jaguar’s and Kraken’s thousands of processors, however, was the input/output (I/O) involved in searching the immense number of proteins and models in the respective databases, said Halloy. By distributing protein sequences of different lengths and thereby ensuring that the various jobs on different cores finish at different times, thus beginning new sequences at different times as well, the team was able to randomize the I/O events and largely do away with the previous I/O bottlenecks that plagued MPI-HMMER as well as earlier versions of HSP-HMMER.

“Think of trying to build a house with ten workers,” said Halloy. “Ten is better than one. And perhaps 100 workers could be used and finish the house even faster. But if you tried using 10,000 or more handymen simultaneously, you would clearly encounter difficulties. They would all be bumping into each other, and despite the surplus of labor, they would be counterproductive.” By carefully rotating (or synchronizing) the workers in and out, however, the building of the “house” on Jaguar or Kraken progresses rapidly, and the databases are scanned and the proteins modeled faster than ever before.

“This minimizes simultaneous reads/writes and avoids major time delays due to traffic jams,” wrote the team in a paper published in the Association for Computing Machinery Symposium on Applied Computing Bioinformatics Track 2009. This, they added, allowed for the identification of all of the functional domains in the NR database in less than 24 hours of compute time, demonstrating “an advantage of using supercomputers for computational sequence analysis, especially in the nearest future, when both the database size and the number of available domain models will increase dramatically.” With the traditional MPI-HMMER package on a cluster, this could have taken as long as two months.

So far the team has scaled up to a little more than 8,000 cores, and at that number performance still increases linearly, but the I/O problems could resurface, warned Rekepalli. And as the database continues to grow, the search times involved will likewise increase, making further improvements necessary if researchers are to keep up with the ever-increasing number of genomes being sequenced.

Although they began using the DOE’s Jaguar Cray XT4 component, said the team, they are currently utilizing Kraken, the NSF’s Cray XT5 managed by the University of Tennessee, in HSP-HMMER’s development. In fact, one of their most recent runs with the latest Pfam database was completed in less than 14 hours on Kraken due to its improved parallel I/O, a roughly ten-hour gain from their previous best using the same number of cores.

Oak Ridge National Laboratory’s supercomputing complex is home to Kraken and both XT4 and XT5 components of Jaguar. These world-class computing systems allow the team to continue improving HSP-HMMER’s scalability and I/O and give the biology community a valuable means with which to determine the nature and functions of life’s most fundamental actors.

Images used for the story were taken from the National Library of Medicine (NLM) conserved domain Web pages.

citing the Conserved Domain Database (CDD): Marchler-Bauer A, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, Fong JH, Geer LY, Geer RC, Gonzales NR, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Liebert CA, Liu C, Lu F, Lu S, Marchler GH, Mullokandov M, Song JS, Tasneem A, Thanki N, Yamashita RA, Zhang D, Zhang N, Bryant SH. CDD: specific functional annotation with the Conserved Domain Database. Nucleic Acids Res. 2009;37(Database Issue):D205-10.