The National Institute for Computational Sciences

Realizing the Relevancies

Study Links Items in Library Collection Based on Subject Matter and User Behaviors

By Scott Gibson

In this age of Big Data, the “recommender system” — a la Amazon.com — has emerged as a way of prioritizing descriptive information based on social behavior. Shoppers of Amazon are familiar with the words, “Customers who bought this item also bought,” followed by images of suggested books. But how could university libraries provide tools that will help patrons access the full depth of their comprehensive content?

Taking on that investigation is a team composed of Principal Investigator Harriett Green and Kirk Hess of the University of Illinois at Urbana-Champaign (UIUC) Library, along with Richard Hislop of the UIUC Department of Economics. The Nautilus supercomputer, housed at Oak Ridge National Laboratory and managed by the National Institute for Computational Sciences (NICS), provided high-performance computing support to the project, which entailed data mining involving the 14 million items in the UIUC Library.

“Current search mechanisms for online library catalogs and digital collections are narrowed to searching by indexed subject terms, authors, titles, and selected key words," Green says. "With such limited parameters, many materials — especially in collections as vast as the holdings at the University of Illinois Library — are rarely exposed in search results. We wanted to find a way to reveal these ‘underserved items’ and help users see the broadest selection of relevant resources in their searches. We sought to develop new methods of calculating relevancy and exposing the results, which ultimately we aim to incorporate into a recommender system for scholarly users.”

Hislop explains that the approach the researchers plan to use in developing the recommender system is similar to Amazon.com’s, except that instead of making recommendations at the book level, it will work by connecting topics.

“We look for descriptions that appear together in circulation records. For instance, if someone is checking out a book about Beowulf, they’re especially likely to get a second book about that same topic,” he says. “We’ll use this information to highlight especially useful subject headings as people browse the catalog.” Subject headings are topical terms that are assigned to every book and item in a library’s collections.

The first step in the project was to create algorithms that reconstituted checkout transactions. Hislop explains what the task entailed: “When a patron returns a book to the library, the database removes their name from the transaction record. This is to preserve people's privacy while allowing librarians to keep circulation statistics. It also means that, unlike Amazon, there aren't good records showing which items were picked out together. We were able to use statistics about checkout times to find items that were probably checked out together. This helped us identify the pieces of information about the books that seemed most linked to the patron's research topics.”

Using the transaction groupings, the researchers applied a Wilson score interval test to find subjects most likely to connect the items that appeared together. The statistic provides a way to compare topics that appear more or less frequently in the catalog.

The analysis of topical headings the researchers are employing is similar to ranking on the news aggregator Reddit. "The assigned designation of 'Heading appears on all the books' means that the heading looks like a good match for the person's research topic," Green explains. "We treat it like a reddit 'upvote'. The designation of 'Heading appears on one book, but not the others' means that the heading didn't seem important to the user. So we treat it like a reddit 'downvote'. This statistic is a way of solving the frustrating problem that some topics — 'Shakespeare,' for example — get viewed, and thus 'rated,' a lot more than others."

Next, the team extended the previous result by searching the graphs for subject headings most likely to lie between items that appeared together in a transaction. “This represents a new method of conducting broad evaluative surveys of the topical distribution of a library collection,” Hislop says.

The team's analyses generated semantic connections between the headings that previously did not exist in the catalog’s hierarchical index of subject terms. "For example, in the existing library catalog, a book with the subject heading of ‘Brazil—Economics’ would not necessarily be connected to another book that solely has the subject heading of ‘Economics—Brazil.’ The two books are cataloged under two different areas of library collections — Brazilian studies vs. economics. But our analyses recognize that ‘Brazil—Economics’ and ‘Economics—Brazil’ are closely related semantically and create that connection," Green says.

Finally, the team built network graphs that cluster collection items based on user behavior and semantic relevancy. They were able to generate visualizations of how library materials are connected in relevancy via shared semantic headings, clusters of items from transaction data, and use frequency.

Green describes the Illinois collection as “incredibly thorough,” with “some highly specialized topics.” “Studying the way topics are linked together in larger networks like the Illinois library is difficult and requires the kind of computational power we could only get from a system like Nautilus,” she says.

The team employed an application called Gephi, an open-source software for visualizing and analyzing large network graphs. Hess explains: “By using Gephi for visualization, we were able to display the connections between headings as well as group them into communities by modularity score (a measurement of how well a social network breaks down into subgroupings), which made our results more accessible to librarians who are unfamiliar with network theory. We’ve received positive help from our peers and we look forward to continuing our work on larger datasets.”

The knowledge discovered about tightly linked clusters of topics, combined with information from circulation records, will give librarians insight they can use to help patrons navigate through the less-trafficked parts of their collection.

“We feel that our research has strong potential to help users find materials in libraries in far more robust and efficient ways,” Green says. “The network analyses we have developed on Nautilus will serve as the basis for our work to develop a user-based recommender system in library catalogs.”

Related Publication

Green, Harriett E., Kirk Hess, and Richard D. Hislop. 2012. “Incorporating Circulation Data in Relevancy Rankings for Search Algorithms in Library Collections.” In Proceedings of the 8th IEEE International Conference on eScience, October 8–10, 2012, Chicago, USA. [United States]: IEEE: 6 pp. DOI: 10.1109/eScience.2012.6404447.

Related Presentations

  • Green, Harriett and Kirk Hess. 2013. “Clusters of Books: Subject Analysis Tools for the library catalog.” Presentation at the Association of College and Research Libraries 2013 conference, Indianapolis, IN, April 10–13, 2013.
  • Green, Harriett and Kirk Hess. 2012. “Network Analyses of Library Catalog Data.” Poster presented at DLF Forum 2012 conference, Denver, CO, November 4–5, 2012.

Article posting date: 28 September 2013

About NICS: The National Institute for Computational Sciences (NICS) operates the University of Tennessee supercomputing center, funded in part by the National Science Foundation. NICS is a major partner in NSF’s Extreme Science and Engineering Discovery Environment, known as XSEDE.