BOSC 2008 Lightning Talk: The E nteropathogen R esource I ntegration C enter (ERIC), A NIAID Bioinformatics Resource Center for Biodefense and Emerging/Re-emerging Infectious Disease http://www.ericbrc.org D. Pot 1 , J. Whitmore 1 , M. Shaker 1 , J. Fedorko 1 , K. Joshi 1 , S. Nanan 1 , P. Shetty 1 , J. Thangiah 1 , S. Zaremba 1 , G. Plunkett, III 2 , J. Glasner 2 , B. Anderson 2 , D. Baumler 2 , B. Biehl 2 , V. Burland 2 , E. Cabot 2 , E. Neeno-Eckwall 2 , B. Mau 2 , P. Liss 2 , M. Rusch 2 , F. R. Blattner 2 , N. T. Perna 2 , J. M. Greene 1 1 SRA International, Inc., Rockville MD and 2 University of Wisconsin, Madison WI
ERIC is a NIAID Bioinformatics Resource Center for Biodefense and Emerging/Re-emerging Disease , one of 8 such centers funded in July 2004 for 5 years.
ERIC primarily focuses on the integration of data from five enteropathogens as well as related reference organisms:
Diarrheagenic E. coli
Partnership between personnel at the Genome Center of Wisconsin (Nicole Perna, Fred Blattner, Guy Plunkett) and SRA International’s Global Health Sector, Rockville MD.
Everything done under contract funding required to be made freely available to the Scientific Community.
ERIC-Overview Genomes Annotations (ASAP) Genome Views and Comparisons (Mauve, GBrowse) Microarray Analysis (mAdb) ERIC is a portal based system using the JBoss portal . ASAP ( A Systematic Annotation Package for community annotation) from UW-Madison is being used to allow the scientific community to annotate genes for the five enteropathogens and related reference organism useful for comparative genomics.
ERIC contains tools for comparative genomics, such as Mauve, which has the distinct advantage of allowing comparison of more than two genomes, as well as being able to handle chromosomal rearrangements. (We provide access to some other pathogenic and non-pathogenic reference genomes, particularly for E. coli .)
Mauve – whole genome comparison
Mauve identifies and aligns regions of local collinearity called locally collinear blocks (LCBs). Each locally collinear block is a homologous region of sequence shared by two or more of the genomes under study, and does not contain any rearrangements
of homologous sequence.
The Mauve genome alignment procedure results in a global alignment of each locally collinear block that has sequence elements conserved among all the genomes under study. Nucleotides in any given genome are aligned only once to other genomes,
suggesting orthology among aligned residues. Mauve makes no attempt to align paralogous regions.
The remaining unaligned regions may be lineage-specific sequence or rearranged or paralogous repetitive regions and can be identified as such
during subsequent processing with other tools.
Available at: http://gel.ahabs.wisc.edu/mauve/download.php
SRA is an industry leader in natural language processing (NLP)-based text mining
Dedicated group of linguists and software engineers
Routinely win Government text mining competitions (e.g. Message Understanding Competitions (MUC))
Extensive experience in multilingual information extraction , text clustering, and text summarization – this is not just keyword searching.
Numerous commercial and government clients/applications
Health care organizations (fraud detection); Financial services (anti-money laundering, e-mail surveillance); Government (homeland security, e- Government, business intelligence)
See Poster S04, Extraction of Facts and Relationships Relevant to Molecular Mechanisms of Bacterial Pathogenesis through Natural Language Processing, for details on how this is used in ERIC!
Latest Articles tab – we present the mined extracts on enteropathogens for the preceding week.
Text Mining – Current Awareness
Search Tab – allows users to search across extracted data
Unlike Latest Articles , this is not limited to our contract enteropathogens, and should be useful across all bacteria.
Text Mining - Search
Currently, in addition to processing all new PubMed abstracts weekly, we are extracting about 4-5,000 abstracts per night, and have extracted all PubMed abstracts back about four years.
We intend to go back at least 10 years….
No reason this cannot be applied to Open Access full length text; also will provide Web Services access to extracted data in near future…
Text Mining – Search Results
Extracted Terms and Relationships (requirements from Biologists) Frequency of Terms and Coloration Control (requirements from IT Types) Extracted Text from PubMed
What is different?
Hundreds of abstracts to read (days)
Limited, keyword searching
Data handling complex (stacks of paper)
Slower ability to reach conclusions
Quick summary provided (seconds)
Enhanced role searching
Knowledge base with links to details
Faster conclusions through mining of extracted data
Other ERIC Notes
Again, everything we do under the contract must be made freely available to the Scientific Community – all SRA’s work is available under the MIT License, and components from UW are under the GNU GPL.
Posters - Monday evening session:
I-04 on ERIC System (John Greene)
S-04 on Text Mining (David Pot)
For more information, contact [email_address] .
ERIC is supported via NIAID contract HSN266200400040C.
NetOwl Extractor Optimizing Manual Literature Annotation *Software licensed for use on ERIC bounded in red. Ontologies and patterns developed to mine text Pattern Writers