LigerCat
Holly Miller, MBLWHOI Library, Marine Biological Laboratory
Catherine N. Norton, MBLWHOI Library, Marine Biological Laboratory
Indra Neil Sarkar, Biomedical Informatics, University of Vermont
LigerCat (http://ligercat.org) is a web application that allows a user to search for
articles in PubMed with key words or even a DNA/protein sequence and then
explore the resulting set via the MeSH descriptors and a publication time line.
LigerCat is freely available on the Internet and the source code is available under an
MIT license (http://github.com/mbl-cli/LigerCat).
LigerCat, Literature and Genomic Electronic Resource Catalogue, aggregates
multiple articles in PubMed, combining the associated MeSH descriptors into a tag
cloud, weighted by frequency -- more frequent terms are in a larger font size.
Through LigerCat a user can search the PubMed database using plain language text
or molecular sequences. The MeSH tag cloud then allows the user to explore the
collection and create subsets of articles simply by clicking on terms in the cloud.
The three ways to search in LigerCat: Journals, Articles and Genes.
Journal Search
Queries entered into the LigerCat “Journals” tab are used to search the NLM Journals
database. Searches are enhanced using words in titles of journals. This approach has
shown promise to categorize biomedical literature. In our implementation, stop
words (e.g., “the,” “journal,” “association”) are removed from the list of journal titles
returned from the initial search, and then the top 15% of remaining title words are
determined. The initial search is then enhanced with these additional title words.
Across all the journals indexed in MEDLINE, we observed that on average this
enhanced approach increased results by approximately 25% over just using subject
terms associated with journals. In a specific example, searching the NLM Journals
database for journals using the term “Aging” returns 41 journals; the enhanced
approach implemented in LigerCat returns 130.
LigerCat presents users with the resulting list of journal titles, and the user selects
those that they wish to include in the MeSH cloud generation process. Using a local
MEDLINE database, all PubMed identifiers (PMIDs) are then retrieved for the
selected list of journals.
The MeSH cloud is interactive. The user can click on terms in the cloud to create
subsets of articles in real time. The resulting subset can be displayed in PubMed by
clicking on “Go to Pubmed.”
Article Search
Traditional PubMed searches are fundamental to the identification of relevant
biomedical knowledge. Through the “Articles” tab in LigerCat, users enter a query as
they would enter in the PubMed interface. A series of eUtils-mediated searches are
then used to retrieve the list of PMIDs. In addition to the MeSH cloud, a Publication
History graph is displayed. Both features are interactive. Moussing over the
histogram will highlight years and display the number of articles from that year that
your search retrieved. Clicking on the bar for that year performs the search in
PubMed and takes you to that result page.
Gene Search
The third modality of search aims to identify relevant literature that is associated
with a molecular sequence. Within the context of GenBank, 71% of sequences with
citation information are associated with at least one MEDLINE citation. Each PMID
reference associated with GenBank records are extracted and stored within a local
GenBank database. Through the LigerCat “Genes” tab, the user can enter a FASTA-
formatted sequence of a single molecule (e.g., a gene). Related sequences are then
identified using netBLAST. Based on the returned results, sequences are examined
above a minimally stringent E-value (1.0). A list of PMIDs is then derived from the
local GenBank database. The MeSH descriptors from those articles are displayed as a
MeSH cloud similar the one generated from the Article Search. In addition to the
MeSH cloud, a Publication History graph is displayed, however, at this point it is not
interactive.
Under the hood
LigerCat makes use of a hybrid approach of local databases and Web accessible
resources at the National Library of Medicine (NLM Journals, PubMed, and
GenBank). This synergy between live and archived data is leveraged both for speed
(many of the queries and MeSH profile data are precompiled) and to maintain
currency. Additionally, the use of the eUtils for mediating MEDLINE queries enables
the leveraging of the synonym mapping features that are part of the Entrez
interface. As new items are discovered through the system, they are dynamically
added to the database and the MeSH ranks are appropriately adjusted.
LigerCat uses a fast and scalable architecture. The queries to NLM are mediated
through a load-balancing mechanism that off-loads tasks to a background queue.
This is necessary in part because eUtils requires that there is a minimum 1-second
interval between queries to the live data and the volume of data that may need to be
processed to create the resultant MeSH cloud. LigerCat utilizes a number of open
source tools to process data in parallel. By designing a modular, queue-based
architecture, we were able to eliminate single points of failure in the system while
also gaining the benefit of horizontal scalability. The modular nature of the
architecture allowed us to replace a traditional relational database system with the
Redis in-memory key-value store affording us a five fold decrease in processing
times, though at the cost of an increased memory requirement. By modularizing and
virtualizing components of the architecture, we were able to scale processing across
many processor cores of a single or many hosts or out onto public clouds where
required. Scaling can occur in parallel and rapidly due to the small footprint of the
processing components. LigerCat was developed using the Ruby on Rails
framework, which allowed for rapid prototyping, development, testing and
integration of the system’s various components. The use of messaging queues, in-
memory key-value stores and related tools offers informatics developers new
solutions to existing data management and processing challenges. The modular
nature enables components of this existing system to be reused in other research
projects that are faced with similar challenges.
Acknowledgements
LigerCat development was made possible through funds from The Ellison Medical
Foundation, NIH, and the Marine Biological Laboratory.

Liger cat challenge

  • 1.
    LigerCat Holly Miller, MBLWHOILibrary, Marine Biological Laboratory Catherine N. Norton, MBLWHOI Library, Marine Biological Laboratory Indra Neil Sarkar, Biomedical Informatics, University of Vermont LigerCat (http://ligercat.org) is a web application that allows a user to search for articles in PubMed with key words or even a DNA/protein sequence and then explore the resulting set via the MeSH descriptors and a publication time line. LigerCat is freely available on the Internet and the source code is available under an MIT license (http://github.com/mbl-cli/LigerCat). LigerCat, Literature and Genomic Electronic Resource Catalogue, aggregates multiple articles in PubMed, combining the associated MeSH descriptors into a tag cloud, weighted by frequency -- more frequent terms are in a larger font size. Through LigerCat a user can search the PubMed database using plain language text or molecular sequences. The MeSH tag cloud then allows the user to explore the collection and create subsets of articles simply by clicking on terms in the cloud. The three ways to search in LigerCat: Journals, Articles and Genes. Journal Search Queries entered into the LigerCat “Journals” tab are used to search the NLM Journals database. Searches are enhanced using words in titles of journals. This approach has shown promise to categorize biomedical literature. In our implementation, stop words (e.g., “the,” “journal,” “association”) are removed from the list of journal titles returned from the initial search, and then the top 15% of remaining title words are determined. The initial search is then enhanced with these additional title words. Across all the journals indexed in MEDLINE, we observed that on average this enhanced approach increased results by approximately 25% over just using subject terms associated with journals. In a specific example, searching the NLM Journals database for journals using the term “Aging” returns 41 journals; the enhanced approach implemented in LigerCat returns 130. LigerCat presents users with the resulting list of journal titles, and the user selects those that they wish to include in the MeSH cloud generation process. Using a local MEDLINE database, all PubMed identifiers (PMIDs) are then retrieved for the selected list of journals. The MeSH cloud is interactive. The user can click on terms in the cloud to create subsets of articles in real time. The resulting subset can be displayed in PubMed by clicking on “Go to Pubmed.” Article Search Traditional PubMed searches are fundamental to the identification of relevant biomedical knowledge. Through the “Articles” tab in LigerCat, users enter a query as
  • 2.
    they would enterin the PubMed interface. A series of eUtils-mediated searches are then used to retrieve the list of PMIDs. In addition to the MeSH cloud, a Publication History graph is displayed. Both features are interactive. Moussing over the histogram will highlight years and display the number of articles from that year that your search retrieved. Clicking on the bar for that year performs the search in PubMed and takes you to that result page. Gene Search The third modality of search aims to identify relevant literature that is associated with a molecular sequence. Within the context of GenBank, 71% of sequences with citation information are associated with at least one MEDLINE citation. Each PMID reference associated with GenBank records are extracted and stored within a local GenBank database. Through the LigerCat “Genes” tab, the user can enter a FASTA- formatted sequence of a single molecule (e.g., a gene). Related sequences are then identified using netBLAST. Based on the returned results, sequences are examined above a minimally stringent E-value (1.0). A list of PMIDs is then derived from the local GenBank database. The MeSH descriptors from those articles are displayed as a MeSH cloud similar the one generated from the Article Search. In addition to the MeSH cloud, a Publication History graph is displayed, however, at this point it is not interactive. Under the hood LigerCat makes use of a hybrid approach of local databases and Web accessible resources at the National Library of Medicine (NLM Journals, PubMed, and GenBank). This synergy between live and archived data is leveraged both for speed (many of the queries and MeSH profile data are precompiled) and to maintain currency. Additionally, the use of the eUtils for mediating MEDLINE queries enables the leveraging of the synonym mapping features that are part of the Entrez interface. As new items are discovered through the system, they are dynamically added to the database and the MeSH ranks are appropriately adjusted. LigerCat uses a fast and scalable architecture. The queries to NLM are mediated through a load-balancing mechanism that off-loads tasks to a background queue. This is necessary in part because eUtils requires that there is a minimum 1-second interval between queries to the live data and the volume of data that may need to be processed to create the resultant MeSH cloud. LigerCat utilizes a number of open source tools to process data in parallel. By designing a modular, queue-based architecture, we were able to eliminate single points of failure in the system while also gaining the benefit of horizontal scalability. The modular nature of the architecture allowed us to replace a traditional relational database system with the Redis in-memory key-value store affording us a five fold decrease in processing times, though at the cost of an increased memory requirement. By modularizing and virtualizing components of the architecture, we were able to scale processing across many processor cores of a single or many hosts or out onto public clouds where required. Scaling can occur in parallel and rapidly due to the small footprint of the processing components. LigerCat was developed using the Ruby on Rails
  • 3.
    framework, which allowedfor rapid prototyping, development, testing and integration of the system’s various components. The use of messaging queues, in- memory key-value stores and related tools offers informatics developers new solutions to existing data management and processing challenges. The modular nature enables components of this existing system to be reused in other research projects that are faced with similar challenges. Acknowledgements LigerCat development was made possible through funds from The Ellison Medical Foundation, NIH, and the Marine Biological Laboratory.