NICS Staff Share Top Prize at Cray User Group

by Caitlin Elizabeth Rockett


Computational scientists from the National Institute for Computational Sciences (NICS) shared the award for “Best Paper” at this year’s Cray User Group (CUG) as co-authors of “Software Usage on Cray Systems across Three Centers (NICS, ORNL and CSCS).” CUG is an independent, international corporation of member organizations that own Cray Inc. computer systems, five of which make up the top fifteen fastest computers in the world, including NICS’ Cray XT5 known as Kraken. Founded in 1978, the annual CUG workshop was established to facilitate collaboration and information exchange in the high-performance computing (HPC) community.

“Cray has numerous machines in the top 100 fastest machines in the world, so it’s important that all the Cray users—staff from computing centers as well as Cray users—have a chance to get together and discuss issues and best practices at each site,” explained Mark Fahey, NICS deputy director and joint faculty with the Industrial and Information Engineering Department at the University of Tennessee, Knoxville.

Fahey and computational scientist Bilel Hadri of NICS joined Tim Robinson (Swiss National Supercomputing Centre, CSCS) and William Renaud (Oak Ridge National Laboratory, ORNL) in producing the award-winning paper that discussed an infrastructure called the Automatic Library Tracking Database (ALTD). ALTD automatically and transparently stores information about all libraries and third-party software used on large-scale machines. These machines often support hundreds of different software packages, each with multiple versions and each version potentially built with multiple compilers. With the cost associated in maintaining leadership computing systems, it is important to identify both the most- and least-used software in order to provide efficient, targeted support. ALTD was put into production on Cray XT and XE systems at NICS, ORNL and CSCS.

“ALTD lets us track what libraries are linked into each users’ codes, which allows system support staff to identify users that may be using non-optimal libraries or deprecated software,” explained Hadri. ALTD has been implemented with almost no overhead and is transparent to the users. When ALTD was deployed, there were no alternatives to provide such information without tracing the code during runtime.

Since the publication of their paper, other Cray owners have contacted the authors in order to implement ALTD on their own systems. The next step with ALTD involves implementing it for other architectures—a project Hadri is already tackling on one of IBM’s Blue Gene systems.

