Presentation material
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Presentation material

on

  • 533 views

 

Statistics

Views

Total Views
533
Views on SlideShare
533
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Presentation material Presentation Transcript

  • 1. A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005
  • 2. Contents
    • Introduction
    • POSBIOTM/W Workbench
    • POSBIOTM/NER System
    • POSBIOTM/NER with Active Machine Learning
    • POSBIOTM/Event System
    • Current status ( demo)
  • 3. Introduction
    • Exponentially growing biological publications
  • 4. Introduction
    • Biological named entity recognition.
    • Extract the biological interaction (events) between biological entities.
      • Important to biological pathway.
    Biological Papers
    • Two key issues to deal with biological texts.
  • 5. Introduction
    • Development workbench (common in NLP)
      • Grammar development workbench
      • POS/Tree Tagging workbench
    • Use large amount of Corpus
      • Machine Learning methods are used in NER task and event extraction task.
      • Annotated corpus is essential to achieve good results in machine learning based methods (both in quantity and quality)
      • Lack of annotated corpus (notorious in bio/medical fields)
    • Need
      • tools in support of collecting, managing, creating, annotating and exploiting rich biomedical text resources.
      • Tools which interacts with the automatic system to increase the high quality annotated corpus
    • Bio-text mining workbench
  • 6. Contents
    • Introduction
    • POSBIOTM/W Workbench
    • POSBIOTM/NER System
    • POSBIOTM/NER with Active Machine Learning
    • POSBIOTM/Event System
    • Current status
  • 7. POSBIOTM/W : A development W orkbench
    • Overall Design
  • 8. POSBIOTM/W Workbench
    • Goal
      • help users to search, collect and manage publications.
    • Quick Search Bar
      • provides quick access to PubMed.
    • Pubmed Search Assistant
      • Users can select specific abstracts to do the named-entity tagging and event extraction
    • Managing Tool
  • 9. POSBIOTM/W Workbench
    • Managing Tool
    • Pubmed search Assistant
  • 10. POSBIOTM/W Workbench
    • N amed-entity recognition (NER) task
      • identification of material names concerned.
    • Goal: automatically and effectively annotate biomedical-related entities.
    • NER Tool is a Client Tool of POSBIOTM/NER System
      • Currently, Three NER models are provided.
      • The GENIA-NER model, the GENE-NER-model and the GPCR-NER model
    • Named-entity recognition with Active learning
      • To minimize the human labeling effort
    • NER Tool
  • 11. POSBIOTM/W Workbench
    • NER Tool
    • Named-entity recognition with Active learning
  • 12. POSBIOTM/W Workbench
    • Goal: To extract the events which consist of “interaction”, “effecter”, and “reactant”
    • Named-entity types: protein (P), gene (G), small molecule (SM), and cellular process (CP).
    • Interaction: biological interaction (BI) and a chemical interaction (CI).
    • Event Extraction Tool is a Client Tool of POSBIOTM/Event System
    • Event Extraction Tool
  • 13. POSBIOTM/W Workbench
    • Extraction Result in XML format
    • Event Extraction Tool
    <Result> <NER> .... <Sentence SNum = &quot;4&quot;><protein>EDG-1</protein>, encoded by the <gene>endothelial_differentiation_gene-1</gene> , is a <protein>heterotrimeric_guanine_nucleotide_binding_protein-coupled_receptor</protein> ( <protein >GPCR</ protein > ) for < small_molecule >sphingosine-1-phosphate</ small_molecule > ( < small_molecule >SPP</ small_molecule > ) that has been shown to stimulate < cellular_process >angiogenesis</ cellular_process > and < cellular_process >cell_migration</ cellular_process > in cultured endothelial cells. </Sentence> ..... </NER> <Event_Extraction> <Event SNum = &quot;4&quot;> <Interaction>stimulate</Interaction> <Effecter>sphingosine-1-phosphate</Effecter> <Reactant>angiogenesis</Reactant> </Event> ..... </ Event_Extraction > </Result>
  • 14. POSBIOTM/W Workbench
    • Extraction Result
    • Event Extraction Tool
  • 15. POSBIOTM/W Workbench
    • Goal
      • The GUI-based Annotation tool is designed to manipulate the manual annotations.
    • Named-entity editing
      • NE is display ed in different colors which could be changed
      • add, remove or correct named-entity tags, or change the boundaries of named entities, etc.
    • Annotation Tool
  • 16. POSBIOTM/W Workbench
    • Event editing
      • extracted events are displayed in a table
      • double-clicking the event to look up the original sentence from which each event is extracted
    • Upload function
      • Users can upload the well-annotated data to the POSBIOTM system
      • incremental build-up of a massive amount of named-entity and event annotation corpus.
    • Annotation Tool
  • 17. POSBIOTM/W Workbench
    • Annotation Tool
  • 18. Contents
    • Introduction
    • POSBIOTM/W Workbench
    • POSBIOTM/NER System
    • POSBIOTM/NER with Active Machine Learning
    • POSBIOTM/Event System
    • Current status
  • 19. POSBIOTM/NER System
    • Approach
      • the named entity recognition problem is regarded as a classification problem, marking up each input token with named entity category labels.
    • CRF
      • Conditional random fields (CRFs) ([Lafferty et.al. 2001]) is a probabilistic framework for labeling and segmenting a sequential data. (s: state(tag); o: input)
      • For example:
    • Named Entity Recognition (NER)
  • 20. POSBIOTM/NER System
    • Feature Set
    • Named Entity Recognition (NER)
    base noun phrase tag of the previous/current/next words. Base noun phrase tag POS tag of the previous/current/next words. The part of speech is the term used to describe how a particular word is used. E.g. nouns, verb, etc. part-of-speech tag Prefixes/suffixes which are contained in the prefix/suffix dictionary. Biological prefix, suffix concept – ase, blast, cyt, phore, plast. prefix/suffix orthographical feature of the previous/current/next words. Upper case letters, numbers, non-alphabet letters. Greek words – alpha cells, beta hemolysis, tau interferon. word feature only in the case that the previous/current/next words are in the surface word dictionary. Lexical word Description Feature
  • 21. POSBIOTM/NER System
    • Three NER models
      • GENIA model / GENE-NER model / GPCR-NER model
    • GENIA model
      • The named entity classes used in the evaluation :
      • DNA, RNA, protein and cell_line, cell_type
      • The training data consists of 2000 MEDLINE abstracts of the GENIA version 3 corpus. These abstracts were collected using the search terms “human”, ”blood cell”, “transcription factor”.
      • The testing data will come from a super-domain of the training data (“blood cell”, ”transcription factor”).
    • NER Models
  • 22. POSBIOTM/NER System
    • GENE-NER model
      • GENE-NER module uses BioCreative corpus.
      • The aim of the GENE-NER module is the identification of which terms in biomedical research article are gene and/or protein names.
      • The training corpus consists of 7.5k sentences, selected from MEDLINE according to their likelihood of containing gene names.
    • GPCR-NER module (Postech)
      • aims at recognizing four target named entity categories:
      • protein, gene, small molecule and cellular process.
      • The training corpus consists of 50 full articles related to GPCR(G-protein coupled receptor) signal transduction pathway.
    • NER Models
  • 23. POSBIOTM/NER System
    • Evaluation for Three NER models
    • NER Models
    0.7 9 82 0.8 4 04 0. 75 50 GENE-NER 0.7370 0.8135 0.6736 GPCR-NER 0.6945 0.6929 0.6960 GENIA-NER F-Measure Recall Precision Corpus
  • 24. Contents
    • Introduction
    • POSBIOTM/W Workbench
    • POSBIOTM/NER System
    • POSBIOTM/NER with Active Machine Learning
    • POSBIOTM/Event System
    • Current status
  • 25. POSBIOTM/NER with Active Learning
    • NER with Machine Learning
      • To enhance the NER performance through the idea of re-using the annotated data and re-training the NER module
    • NER with Active Machine Learning
      • To minimize the human labeling effort without degrading the performance
      • To select the most informative samples for training
    • Active Learning in NER
  • 26. POSBIOTM/NER with Active Learning
    • Active Learning in NER Framework
  • 27. POSBIOTM/NER with Active Learning
    • Uncertainty-based Sample Selection
      • Using an entropy-based measure to quantify the uncertainty that the current classifier holds (entropy or normalized entropy of the CRF conditional probability)
      • The most uncertain samples are selected for human annotation
    • Active Learning Scoring Strategy
  • 28. POSBIOTM/NER with Active Learning
    • Diversity-based Sample Selection
      • To catch the most representative sentences in each sampling.
      • The divergence measures of the two sentences are represented by the minimum similarity among the examples
      • The similarity score of two words
      • The similarity score of two sentences
    • Active Learning Scoring Strategy
    ( for syntactic path)
  • 29. POSBIOTM/NER with Active Learning
    • MMR(Maximal Marginal Relevance) method
      • The two measures for uncertainty and diversity will be combined using the MMR method to give the sampling scores in our active learning strategy
    • Active Learning Scoring Strategy
  • 30. POSBIOTM/NER with Active Learning
    • Training Data
      • 2,000 MEDLINE abstracts from the GENIA corpus
      • 5 named entity classes
        • DNA, RNA, protein, cell line, cell type
    • Test Data
      • 404 abstracts
      • Half of them are from the same domain as the training data and the other half are from the super-domain of ‘blood cell’ and ‘transcription factor’
    • Experiment and Discussion
  • 31. POSBIOTM/NER with Active Learning
    • Pool-based sample selection
      • 100 abstracts were used to train initial NER module
      • Each time, we chose k examples (sentences) from the given pool to train the new NER module
      • The number k varied from 1,000 to 17,000 with step size 1,000
    • Active learning methods for test
      • Random selection
      • Entropy based uncertainty selection
      • Entropy combined with Diversity
      • Normalized Entropy combined with Diversity
    • Experiment and Discussion
  • 32. POSBIOTM/NER with Active Learning
    • Experiment and Discussion
  • 33. POSBIOTM/NER with Active Learning
    • All three kinds of active learning strategies outperform the random selection
      • The combined strategy reduces 24.64% training examples compared with the random selection
      • The normalized combined strategy reduces 35.43% training examples compared with the random selection
    • Diversity increases the classifier’s performance when the large amount of sample are selected
      • Up to 4,000 sentences, the entropy strategy and the combined strategy perform similar
      • After 11,000 sentence point, the combined strategy surpasses the entropy strategy
    • Experiment and Discussion
  • 34. Contents
    • Introduction
    • POSBIOTM/W Workbench
    • POSBIOTM/NER System
    • POSBIOTM/NER with Active Machine Learning
    • POSBIOTM/Event System
    • Current status
  • 35. POSBIOTM/Event System
    • System Architecture
  • 36. POSBIOTM/Event System
      • Template Element
        • Entities - participants of an event
          • protein (P), gene (G), small molecule (SM), cellular process (CP)
        • Interaction - relationship between entities
          • biological interaction (BI) – Functional interaction
            • About how/whether one component affects the other's status biologically
          • chemical interaction (CI) – Molecular interaction
            • About the interaction among entities at the molecular structural level
      • Event
        • One Interaction (I)
          • Connecting the effecter and reactant
          • Interaction keywords (BI, CI)
        • One Effecter (E)
          • Provoking an event
          • Template element (P, G, SM, CP) or nested event
        • One Reactant (R)
          • Responding to an effecter
          • Template element (P, G, SM, CP) or nested event
    • Target Slot Definition
  • 37. POSBIOTM/Event System
    • Target Slot Definition
    • Example
    • Template Element
      • Entities : PDGF (P), SPP (SM), Cell movement (CP)
      • Interaction keywords : cross-talk (BI), require (BI)
    • Event
      • cross-talk (I) : PDGF (E) : SPP (R)
      • require (I) : cross-talk (E) : cell movement (R)
    The cross-talk between PDGF and SPP is required for these embryonic cell movements .
  • 38. POSBIOTM/Event System
    • Sentence boundary detection
    • Annotating Named Entity (NER)
      • Protein
      • Small molecule
      • Gene
      • Cellular process
    • Compound/Complex Sentence Splitter
      • To simplify the complicated full texts
    • Pre-Processor
  • 39. POSBIOTM/Event System
    • Compound/Complex Sentence Splitter
      • Simple splitting rules
        • [S] NP1 VP1 NP2 [SBAR] that|which VP2 [/SBAR] [/S]
          •  NP1 VP1 NP2 + NP2 VP2
      • Example
        • “ The best studied of these is EDG-1, which is implicated in cell migration and angiogenesis.”
          • ==> 1. “The best studied of these is EDG-1 .”
          • 2. “ EDG-1 is implicated in cell migration and angiogenesis.”
    • Pre-Processor
  • 40. POSBIOTM/Event System
    • Two-level Event Rule Learner
    • Biological Event Extraction
  • 41. POSBIOTM/Event System
    • Event Rule Learner
      • Adapt a supervised machine learning algorithm: WHISK
        • learns rules in the form of context-based regular expressions
        • induces the rules with top-down manner
          • Ex) “{NP} .*? (<CP>)[E] {/NP} {VP} (<BI>)[I] {/VP} {NP} both (<P>)[R] and .*? {/NP}”
      • Limitation of the WHISK
        • The longer distance between event components, the more difficult to extract the correct event
          • WHISK consider all lexical words between event components
        • Cannot handle nested biological events
      • Propose two-level rule learning method to handle the limitation of the flat rule learning method
    • Biological Event Extraction
  • 42. POSBIOTM/Event System
    • Two-level Event Rule Learner
    • Biological Event Extraction
    4. Learn the long-span rule with the re-annotated sentence {NP} <E>cross-talk_between_PDGF_and_SPP</E> {/NP} {VP} is <BI>required</BI> {/VP} for {NP} these embryonic <CP>cell_movements</CP> {/NP} <TAGS> B {interaction require} {effecter cross-talk} {reactant cell movement} 1. Marking long NP boundary 2. Learn the short-span rule corresponding to the NP: “<BI>cross-talk</BI> between <P>PDGF</P> and <SM>SPP</SM>”  “ {NP} (<BI>)[I] between (<P>)[E] and (<SM>)[R] {/NP} “ 3. Re-annotate the short-span interaction as one noun with regular expression format {NP} <BI>cross-talk</BI> between <P>PDGF</P> and <SM>SPP</SM> {/NP} {VP} is <BI>required</BI> {/VP} for {NP} these embryonic <CP>cell_movements</CP> {/NP} <TAGS> B {interaction cross-talk} {effecter PDGF} {reactant SPP} <TAGS> B {interaction require} {effecter cross-talk} {reactant cell movement}
  • 43. POSBIOTM/Event System
    • Event Extractor
      • To extract the events with the automatic generated rules
        • by using regular expression pattern matching
      • To handle the alias and noun conjunction
        • aliases and noun conjunctions have general patterns like ‘sphingosine-1-phosphate(SPP)’ or ‘FP, IP, and TP receptors’
          • handle them with simple rules like ‘A(B)’ or ‘A, B, C, and D’
      • To remove sentences including the negative words
        • ‘ not’, ‘never’, ‘fail’, etc
    • Biological Event Extraction
  • 44. POSBIOTM/Event System
    • Event Component Verifier
  • 45. POSBIOTM/Event System
    • To remove the incorrectly extracted events
    • Classify template elements (P, G, SM, CP, BI, CI) into 4 classes
      • I (interaction), E (effecter), R (reactant), N (none)
        • I, E, R : event’s components
        • N : a template element , but not an event component
    • Use a Maximum Entropy Classifier
      • Features
        • POS tag, phrase chunks, the type of template element of neighboring words and semantic information
    • Event Component Verifier
  • 46. POSBIOTM/Event System
    • Event Component Verifier
  • 47. POSBIOTM/Event System
    • Example
    • Event Component Verifier
    Verified Biological Extracted Events Ev1: Requires (I) sphingosine_kinase (E) cell_migration (R) Ev2: Requires (I) EDG-1 (E) cell_migration (R) Event Component Verifier Results I : Requires E : EDG-1, sphingosine_kinase, PDGF R : cell_migration Extracted Biological Events Ev1: Requires (I) sphingosine_kinase(E) cell_migration (R) Ev2: Requires (I) EDG-1 (E) cell_migration (R) Ev3: Requires (I) EDG-1 (E) PDGF (R)
  • 48. POSBIOTM/Event System
      • 500 Medline abstracts including 2,314 biological events & 10-fold cross validation
        • Flat rule learner vs. two-level rule learner
        • Before verification vs. after verification
      • Performance comparison
          • Learning Information Extractors for Proteins and their Interactions (2004) - Razvan Bunescu, et. al
          • 1000 abstracts & 10-fold cross validation
    • Experiment and Discussion
    46.1 58.0 38.3 Before verification Flat rule learner 51.8 49.2 54.7 After verification 48.2 54.6 48.9 F-measure 63 56.1 68.0 Recall(%) 39 53.1 38.2 Precision(%) After verification Before verification Comparison system Two-level rule learner
  • 49. POSBIOTM/Event System
      • Trade-off between precision and recall
        • Before verification : big gap between precision and recall
        • After verification : low gap between precision and recall
          • threshold : cut the rules according to the measure on how many of the extracted events from a rule are correct
    • Experiment and Discussion
  • 50. POSBIOTM/Event System
      • Constant good performance regardless of the threshold of rule learner
    • Experiment and Discussion
  • 51. Other Corpora for Bio-Relation Extraction
    • BC-PPI
      • From BioCreative Corpus for NER
      • Protein/Gene interactions
      • 255 interactions in 1000 sentences
    • IEPA
      • Protein/Protein interactions
      • 410 interactions in 498 sentences
    • LLL05
      • Protein/Gene interactions
      • 271 interactions in 80 sentences
    • BioText
      • Disease/Treatment relations
  • 52. Contents
    • Introduction
    • POSBIOTM/W Workbench
    • POSBIOTM/NER System
    • POSBIOTM/NER with Active Machine Learning
    • POSBIOTM/Event System
    • Current status
  • 53. Current Status & future works
    • Re-implemented with Java (platform independent)
    • Integrated with J-Designer in SBW consortium (will be)
    • Integrated with Active learning method to automatically suggest human-annotated corpus
    • Used for national large scale BIT fusion projects: search for useful peptide (usable as a ligand for drug)
    • Getting more feed back from biologists
    • System getting smarter with more usage: workbench + active learning
    Workbench Demo