Text mining for protein and small molecule relations
Upcoming SlideShare
Loading in...5
×
 

Text mining for protein and small molecule relations

on

  • 550 views

BCB-Seminar, Charite, Berlin, Germany, Januar 19, 2006

BCB-Seminar, Charite, Berlin, Germany, Januar 19, 2006

Statistics

Views

Total Views
550
Views on SlideShare
550
Embed Views
0

Actions

Likes
0
Downloads
4
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Text mining for protein and small molecule relations Text mining for protein and small molecule relations Presentation Transcript

  • Text mining for protein and small molecule relations Lars Juhl Jensen EMBL
  • Why?
  • Overview
    • Entity recognition and identification
      • Recognition: find the words that are names of entities
      • Identification: figure out which entities they refer to
    • Information extraction
      • Simple statistical co-occurrence methods
      • Natural Language Processing (NLP)
    • Text mining
      • Mining text for overlooked relations
      • Discovery of global trends from text alone
  • Entity recognition
    • Features
      • Morphological: mixes letters and digits or ends on -ase
      • Context: followed by “protein” or “gene”
      • Grammar: should occur as a noun
    • Methodologies
      • Manually crafted rule-based systems
      • Machine learning (SVMs)
    • But what can it be used for?
  • Entity identification
    • A good synonyms list is the key
      • Combine many sources
      • Curate to eliminate stop words
    • Flexible matching to handle orthographic variation
      • Case variation: CDC28 , Cdc28 , and cdc28
      • Prefixes: myc and c-myc
      • Postfixes: Cdc28 and Cdc28p
      • Spaces and hyphens: cdc28 and cdc-28
      • Latin vs. Greek letters: TNF-alpha and TNFA
  • Example
    • Mitotic cyclin ( Clb2 )-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5 -dependent Swe1 hyperphosphorylation and degradation
    • Entities identified
      • S. cerevisiae proteins: Clb2 (YPR119W), Cdc28 (YBR160W), Swe1 (YJL187C), and Cdc5 (YMR001C)
  • Identification of small molecules
    • We have compiled a list of 14 million synonyms for 4 million chemicals
      • This list was compiled based on many resources: PubChem, KEGG, ChEBI, and SuperDrug
      • A stop word list was manually curated for based on synonyms that occur 2000+ times in Medline
    • Searching Medline with this list gives 12.5 million hits in 4.6 million abstracts
      • The precision and recall has not been evaluated yet
      • However, stop word curation has eliminated the most critical errors so fairly high precision is likely
  • Co-occurrence
    • Relations are extracted for co-occurring entities
      • Relations are always symmetric
      • The type of relation is not given
    • Scoring the relations
      • More co-occurrences  more significant
      • Ubiquitous entities  less significant
      • Same sentence vs. same paragraph
    • Simple, good recall, poor precision
  • Example
    • Mitotic cyclin ( Clb2 )-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5 -dependent Swe1 hyperphosphorylation and degradation
    • Relations
      • Correct: Clb2–Cdc28 , Clb2–Swe1 , Cdc28–Swe1 , and Cdc5–Swe1
      • Wrong: Clb2–Cdc5 and Cdc28–Cdc5
  • NLP
    • Information is extracted based on parsing and interpreting phrases or full sentences
      • Good at extracting specific types of relations
      • Handles directed relations
    • Complex, good precision, poor recall
  • Example
    • Mitotic cyclin ( Clb2 )-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5 -dependent Swe1 hyperphosphorylation and degradation
    • Relations:
      • Complex: Clb2–Cdc28
      • Phosphorylation: Clb2  Swe1 , Cdc28  Swe1 , and Cdc5  Swe1
    • Syntacto-semantic tagging
      • Part-of-speech
      • Gene and protein names
      • Cue words for entity recognition
      • Cue words for relation extraction
    • Named entity chunking
      • A CASS grammar recognizes noun chunks related to gene expression: [ nxgene The GAL4 gene ]
    • Relation chunking
      • Our CASS grammar also extracts relations between entities: [ nxexpr T he expression of [ nxgene the cytochrome genes [ nxpg CYC1 and CYC7 ]]] is controlled by [ nxpg HAP1 ]
  •  
  • Extraction of relations between protein and small molecules
    • Over 650,000 protein–chemical relations were identified using simple co-occurrence on Medline
    • Benchmarking on DrugBank
      • Of the 959 protein–drug interactions in DrugBank, literature mining evidence was found for 299
      • 30% recall is thus our current upper limit
      • Precision has not yet been evaluated
    • We are also working on adapting our NLP system to extract protein–chemical relations
  • Text mining
    • New relations can be inferred from published ones
      • This can lead to actual discoveries if no person knows all the facts required for making the inference
      • Combine facts from disconnected literatures
    • Global trends can be discovered from literature
      • Although all the detailed data is in the text, people may have missed the big picture
      • Identify significant correlations
      • Find temporal trends
  •  
  • Correlations
    • “ Customers who bought this item also bought …”
    • Correlation protein roles in networks
      • Transcription factors are themselves transcriptionally regulated
      • Kinases are themselves phosphorylated
      • Many proteins are both regulated transcriptionally and post-translationally
  • Temporal trends
  • Buzzwords
  • Acknowledgments
    • Charit é
      • Mathias Dunkel
      • Robert Preißner
    • EML Research
      • Jasmin Saric
      • Isabel Rojas
    • EMBL Heidelberg
      • Rossitza Ouzounova
      • Peer Bork
      • Rob Russell
      • Reinhard Schneider
  • Thank you!