Utilizing literature for biological discovery
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Utilizing literature for biological discovery



E-BioSci/ORIEL Annual Workshop, Villa Monastero, Varenna, Italy, September 2-5, 2003

E-BioSci/ORIEL Annual Workshop, Villa Monastero, Varenna, Italy, September 2-5, 2003



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Utilizing literature for biological discovery Presentation Transcript

  • 1. Using Literature for Biological Discovery Lars Juhl Jensen EMBL Heidelberg
  • 2. Introduction
    • Why literature mining should not be used on its own
      • Biological discoveries are not made by reading papers
      • To make biological discoveries, existing scientific literature generally has to be used in combination with other data sources
      • An example of how this can be done is the Genes2Diseases server
    • Using NLP for interpreting high-throughput experiments
      • Many types of genomics scale data sets are available today, including data on gene expression and protein-protein interactions
      • To make discoveries, these data must be analyzed in the context of what is already known
      • NLP can be used for obtaining this information from literature
      • EMBL and EML are currently finalizing a method for extracting gene regulatory interactions from Medline abstracts
  • 3. Genes2Diseases: utilizing Medline for finding disease related genes in the human genome
    • Each disease is associated with a phenotypic MeSH term and mapped to a chromosomal region using LocusLink
    • Within the region, gene functions are assigned by sequence similarity
    • Gene functions are linked to chemical substances via RefSeq entries
    • Chemical substances are linked to phenotypes by Medline abstracts
    • A score of each gene’s relevance for the disease is calculated
  • 4. “Biologists would rather share their toothbrush than share a gene name”
    • Lists of synonymous identifiers and names were compiled from
      • SGD, WormBase, and FlyBase
      • BLAST search against UniGene
    • Several types of identifiers
      • Various database identifiers and accession numbers
      • Gene symbols and gene names
    • Lack of standardization
      • 8+ identifiers per yeast gene
      • Many names refer to unrelated genes in different species
    The synonyms and orthologs lists can be downloaded from: http://www.bork.embl.de/synonyms
  • 5. Retraining TreeTagger for Medline abstracts
    • The English parameter file distributed with TreeTagger was trained on the UPenn Treebank
    • We retrained TreeTagger on the manually annotated GENIA 3.0 corpus (466,179 tokens) adding gene names to the dictionary
    • Performance of the two taggers was evaluated on 55,166 tokes not used during training
    • Retraining eliminated more than half of all tagging errors
  • 6. Tagging is really easy ... compared to extracting the information you are after
    • Many ways to write the same thing
      • A activates the transcription of B
      • B transcription is induced by A
      • A is a transcriptional activator of B
      • Overexpression of A increases B mRNA levels
      • Transcription is enhanced when A binds to the B promoter
      • The B promoter contains an A UAS
    • Multiple pieces of information and negations in a sentence
      • A is a transcriptional activator of B , C , D , E , and F
      • B was not suppressed by A
      • The A transcription factor affects B but not C
      • C phosphorylation of A leads to increased expression of B
  • 7. “Biologists tend to ask simple questions: Here’s a frog ... is he happy?”
    • It is not always clear what a sentence means
      • Many biological terms/concepts are poorly defined
      • Words are often coined before a subject is understood
      • Ambiguous use of terms makes text mining more difficult
    • The complexity of biological systems makes it hard to simple experiments that lead to clear answers
      • “Protein A regulates the expression of gene B”
        • Does this mean that protein A is a transcription factor?
        • Or are more indirect regulatory mechanisms allowed?
      • “Protein A is a transcriptional activator of B”
        • Can A activate transcription alone?
        • Or only together with certain other proteins?
  • 8. A mini-ontology of transcription regulation
    • Entities (boxes)
    • generic (gray)
    • regulator (yellow)
    • activator (red)
    • repressor (green)
    • target (blue)
    • Relations (arrows)
    • is-a (black)
    • part-of (blue)
    • Events (arrows)
    • creates (green)
    • binds (red)
  • 9. Parsing abstracts to identify relationships between genes/proteins
    • Sentence and word boundaries are identified using Tokenizer
    • Our retrained TreeTagger is used for tagging part-of-speech
    • Abstracts are chunked with a custom CASS grammar to identify noun and verb chunks
    • Noun chunks are categorized according to a mini-ontology
    • Lexico-syntactic patterns are used to identify event chunks
    • SRN1 NNPG NXPGSG EVSUPVA can MD | suppress SUPV | rna2 NNPG NXPGPL | rna3 NNPG | | rna4 NNPG | | rna5 NNPG | | rna6 NNPG | | and CC | | rna8 NNPG | | singly RB or CC in IN pairs NNS
  • 10. Using text mining of Medline abstract to support predicted regulatory interactions
    • By applying the scheme just described to all Medline abstracts, a set of regulatory interactions in multiple species is obtained
    • We will use it to classify protein associations derived from
      • Microarray gene expression
      • Chromatin IP data
      • Physical protein interaction screens (e.g. Y2H and TAP)
      • Cross-species analysis of genomic context (STRING)
    • To integrate all of these different data sources the list of synonymous gene names and identifiers is again needed as different data sets use different identifiers
  • 11. The next step: mining full text scientific papers
    • Full text versions of papers from several journals are available in formats suitable for text mining
    • It matters which part of a paper a sentence is from
      • The abstract has the highest density of descriptive words
      • It is followed by the introduction and then the discussion
      • The methods section is qualitatively different
      • Interestingly the results section has the lowest density ...
    • Our NLP scheme should work on full text papers too
  • 12. Summary
    • I believe literature mining is a powerful tool for studying biology, but it should never be used alone
    • Literature mining is much needed to help interpret the massive amounts of data from genomics-scale studies
    • We have developed a method for extracting information on gene regulation from Medline abstracts using NLP
    • The same methods should be applicable to full text papers, particularly for the introduction and discussion parts
  • 13. Acknowledgments
    • European Media Laboratory GmbH (EML)
      • Jasmin Saric
      • Isabel Rojas
    • European Molecular Biology Laboratory (EMBL)
      • Miguel Andrade
      • Carolina Perez-Iratxeta
      • Parantu Shah
      • Peer Bork
    • Publications
      • C. Perez-Iratxeta, P. Bork and M. A. Andrade, Nature Genetics , 31 :316-319, 2002
      • P. Shah, C. Perez-Iratxeta, P. Bork and M. A. Andrade, BMC Bioinformatics , 4 :20, 2003
    • Web resources
      • www.bork.embl.de/g2d
      • www.bork.embl.de/synonyms