Presentation material
Upcoming SlideShare
Loading in...5
×
 

Presentation material

on

  • 492 views

 

Statistics

Views

Total Views
492
Views on SlideShare
492
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Presentation material Presentation material Presentation Transcript

    • A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005
    • Contents
      • Introduction
      • POSBIOTM/W Workbench
      • POSBIOTM/NER System
      • POSBIOTM/NER with Active Machine Learning
      • POSBIOTM/Event System
      • Current status ( demo)
    • Introduction
      • Exponentially growing biological publications
    • Introduction
      • Biological named entity recognition.
      • Extract the biological interaction (events) between biological entities.
        • Important to biological pathway.
      Biological Papers
      • Two key issues to deal with biological texts.
    • Introduction
      • Development workbench (common in NLP)
        • Grammar development workbench
        • POS/Tree Tagging workbench
      • Use large amount of Corpus
        • Machine Learning methods are used in NER task and event extraction task.
        • Annotated corpus is essential to achieve good results in machine learning based methods (both in quantity and quality)
        • Lack of annotated corpus (notorious in bio/medical fields)
      • Need
        • tools in support of collecting, managing, creating, annotating and exploiting rich biomedical text resources.
        • Tools which interacts with the automatic system to increase the high quality annotated corpus
      • Bio-text mining workbench
    • Contents
      • Introduction
      • POSBIOTM/W Workbench
      • POSBIOTM/NER System
      • POSBIOTM/NER with Active Machine Learning
      • POSBIOTM/Event System
      • Current status
    • POSBIOTM/W : A development W orkbench
      • Overall Design
    • POSBIOTM/W Workbench
      • Goal
        • help users to search, collect and manage publications.
      • Quick Search Bar
        • provides quick access to PubMed.
      • Pubmed Search Assistant
        • Users can select specific abstracts to do the named-entity tagging and event extraction
      • Managing Tool
    • POSBIOTM/W Workbench
      • Managing Tool
      • Pubmed search Assistant
    • POSBIOTM/W Workbench
      • N amed-entity recognition (NER) task
        • identification of material names concerned.
      • Goal: automatically and effectively annotate biomedical-related entities.
      • NER Tool is a Client Tool of POSBIOTM/NER System
        • Currently, Three NER models are provided.
        • The GENIA-NER model, the GENE-NER-model and the GPCR-NER model
      • Named-entity recognition with Active learning
        • To minimize the human labeling effort
      • NER Tool
    • POSBIOTM/W Workbench
      • NER Tool
      • Named-entity recognition with Active learning
    • POSBIOTM/W Workbench
      • Goal: To extract the events which consist of “interaction”, “effecter”, and “reactant”
      • Named-entity types: protein (P), gene (G), small molecule (SM), and cellular process (CP).
      • Interaction: biological interaction (BI) and a chemical interaction (CI).
      • Event Extraction Tool is a Client Tool of POSBIOTM/Event System
      • Event Extraction Tool
    • POSBIOTM/W Workbench
      • Extraction Result in XML format
      • Event Extraction Tool
      <Result> <NER> .... <Sentence SNum = &quot;4&quot;><protein>EDG-1</protein>, encoded by the <gene>endothelial_differentiation_gene-1</gene> , is a <protein>heterotrimeric_guanine_nucleotide_binding_protein-coupled_receptor</protein> ( <protein >GPCR</ protein > ) for < small_molecule >sphingosine-1-phosphate</ small_molecule > ( < small_molecule >SPP</ small_molecule > ) that has been shown to stimulate < cellular_process >angiogenesis</ cellular_process > and < cellular_process >cell_migration</ cellular_process > in cultured endothelial cells. </Sentence> ..... </NER> <Event_Extraction> <Event SNum = &quot;4&quot;> <Interaction>stimulate</Interaction> <Effecter>sphingosine-1-phosphate</Effecter> <Reactant>angiogenesis</Reactant> </Event> ..... </ Event_Extraction > </Result>
    • POSBIOTM/W Workbench
      • Extraction Result
      • Event Extraction Tool
    • POSBIOTM/W Workbench
      • Goal
        • The GUI-based Annotation tool is designed to manipulate the manual annotations.
      • Named-entity editing
        • NE is display ed in different colors which could be changed
        • add, remove or correct named-entity tags, or change the boundaries of named entities, etc.
      • Annotation Tool
    • POSBIOTM/W Workbench
      • Event editing
        • extracted events are displayed in a table
        • double-clicking the event to look up the original sentence from which each event is extracted
      • Upload function
        • Users can upload the well-annotated data to the POSBIOTM system
        • incremental build-up of a massive amount of named-entity and event annotation corpus.
      • Annotation Tool
    • POSBIOTM/W Workbench
      • Annotation Tool
    • Contents
      • Introduction
      • POSBIOTM/W Workbench
      • POSBIOTM/NER System
      • POSBIOTM/NER with Active Machine Learning
      • POSBIOTM/Event System
      • Current status
    • POSBIOTM/NER System
      • Approach
        • the named entity recognition problem is regarded as a classification problem, marking up each input token with named entity category labels.
      • CRF
        • Conditional random fields (CRFs) ([Lafferty et.al. 2001]) is a probabilistic framework for labeling and segmenting a sequential data. (s: state(tag); o: input)
        • For example:
      • Named Entity Recognition (NER)
    • POSBIOTM/NER System
      • Feature Set
      • Named Entity Recognition (NER)
      base noun phrase tag of the previous/current/next words. Base noun phrase tag POS tag of the previous/current/next words. The part of speech is the term used to describe how a particular word is used. E.g. nouns, verb, etc. part-of-speech tag Prefixes/suffixes which are contained in the prefix/suffix dictionary. Biological prefix, suffix concept – ase, blast, cyt, phore, plast. prefix/suffix orthographical feature of the previous/current/next words. Upper case letters, numbers, non-alphabet letters. Greek words – alpha cells, beta hemolysis, tau interferon. word feature only in the case that the previous/current/next words are in the surface word dictionary. Lexical word Description Feature
    • POSBIOTM/NER System
      • Three NER models
        • GENIA model / GENE-NER model / GPCR-NER model
      • GENIA model
        • The named entity classes used in the evaluation :
        • DNA, RNA, protein and cell_line, cell_type
        • The training data consists of 2000 MEDLINE abstracts of the GENIA version 3 corpus. These abstracts were collected using the search terms “human”, ”blood cell”, “transcription factor”.
        • The testing data will come from a super-domain of the training data (“blood cell”, ”transcription factor”).
      • NER Models
    • POSBIOTM/NER System
      • GENE-NER model
        • GENE-NER module uses BioCreative corpus.
        • The aim of the GENE-NER module is the identification of which terms in biomedical research article are gene and/or protein names.
        • The training corpus consists of 7.5k sentences, selected from MEDLINE according to their likelihood of containing gene names.
      • GPCR-NER module (Postech)
        • aims at recognizing four target named entity categories:
        • protein, gene, small molecule and cellular process.
        • The training corpus consists of 50 full articles related to GPCR(G-protein coupled receptor) signal transduction pathway.
      • NER Models
    • POSBIOTM/NER System
      • Evaluation for Three NER models
      • NER Models
      0.7 9 82 0.8 4 04 0. 75 50 GENE-NER 0.7370 0.8135 0.6736 GPCR-NER 0.6945 0.6929 0.6960 GENIA-NER F-Measure Recall Precision Corpus
    • Contents
      • Introduction
      • POSBIOTM/W Workbench
      • POSBIOTM/NER System
      • POSBIOTM/NER with Active Machine Learning
      • POSBIOTM/Event System
      • Current status
    • POSBIOTM/NER with Active Learning
      • NER with Machine Learning
        • To enhance the NER performance through the idea of re-using the annotated data and re-training the NER module
      • NER with Active Machine Learning
        • To minimize the human labeling effort without degrading the performance
        • To select the most informative samples for training
      • Active Learning in NER
    • POSBIOTM/NER with Active Learning
      • Active Learning in NER Framework
    • POSBIOTM/NER with Active Learning
      • Uncertainty-based Sample Selection
        • Using an entropy-based measure to quantify the uncertainty that the current classifier holds (entropy or normalized entropy of the CRF conditional probability)
        • The most uncertain samples are selected for human annotation
      • Active Learning Scoring Strategy
    • POSBIOTM/NER with Active Learning
      • Diversity-based Sample Selection
        • To catch the most representative sentences in each sampling.
        • The divergence measures of the two sentences are represented by the minimum similarity among the examples
        • The similarity score of two words
        • The similarity score of two sentences
      • Active Learning Scoring Strategy
      ( for syntactic path)
    • POSBIOTM/NER with Active Learning
      • MMR(Maximal Marginal Relevance) method
        • The two measures for uncertainty and diversity will be combined using the MMR method to give the sampling scores in our active learning strategy
      • Active Learning Scoring Strategy
    • POSBIOTM/NER with Active Learning
      • Training Data
        • 2,000 MEDLINE abstracts from the GENIA corpus
        • 5 named entity classes
          • DNA, RNA, protein, cell line, cell type
      • Test Data
        • 404 abstracts
        • Half of them are from the same domain as the training data and the other half are from the super-domain of ‘blood cell’ and ‘transcription factor’
      • Experiment and Discussion
    • POSBIOTM/NER with Active Learning
      • Pool-based sample selection
        • 100 abstracts were used to train initial NER module
        • Each time, we chose k examples (sentences) from the given pool to train the new NER module
        • The number k varied from 1,000 to 17,000 with step size 1,000
      • Active learning methods for test
        • Random selection
        • Entropy based uncertainty selection
        • Entropy combined with Diversity
        • Normalized Entropy combined with Diversity
      • Experiment and Discussion
    • POSBIOTM/NER with Active Learning
      • Experiment and Discussion
    • POSBIOTM/NER with Active Learning
      • All three kinds of active learning strategies outperform the random selection
        • The combined strategy reduces 24.64% training examples compared with the random selection
        • The normalized combined strategy reduces 35.43% training examples compared with the random selection
      • Diversity increases the classifier’s performance when the large amount of sample are selected
        • Up to 4,000 sentences, the entropy strategy and the combined strategy perform similar
        • After 11,000 sentence point, the combined strategy surpasses the entropy strategy
      • Experiment and Discussion
    • Contents
      • Introduction
      • POSBIOTM/W Workbench
      • POSBIOTM/NER System
      • POSBIOTM/NER with Active Machine Learning
      • POSBIOTM/Event System
      • Current status
    • POSBIOTM/Event System
      • System Architecture
    • POSBIOTM/Event System
        • Template Element
          • Entities - participants of an event
            • protein (P), gene (G), small molecule (SM), cellular process (CP)
          • Interaction - relationship between entities
            • biological interaction (BI) – Functional interaction
              • About how/whether one component affects the other's status biologically
            • chemical interaction (CI) – Molecular interaction
              • About the interaction among entities at the molecular structural level
        • Event
          • One Interaction (I)
            • Connecting the effecter and reactant
            • Interaction keywords (BI, CI)
          • One Effecter (E)
            • Provoking an event
            • Template element (P, G, SM, CP) or nested event
          • One Reactant (R)
            • Responding to an effecter
            • Template element (P, G, SM, CP) or nested event
      • Target Slot Definition
    • POSBIOTM/Event System
      • Target Slot Definition
      • Example
      • Template Element
        • Entities : PDGF (P), SPP (SM), Cell movement (CP)
        • Interaction keywords : cross-talk (BI), require (BI)
      • Event
        • cross-talk (I) : PDGF (E) : SPP (R)
        • require (I) : cross-talk (E) : cell movement (R)
      The cross-talk between PDGF and SPP is required for these embryonic cell movements .
    • POSBIOTM/Event System
      • Sentence boundary detection
      • Annotating Named Entity (NER)
        • Protein
        • Small molecule
        • Gene
        • Cellular process
      • Compound/Complex Sentence Splitter
        • To simplify the complicated full texts
      • Pre-Processor
    • POSBIOTM/Event System
      • Compound/Complex Sentence Splitter
        • Simple splitting rules
          • [S] NP1 VP1 NP2 [SBAR] that|which VP2 [/SBAR] [/S]
            •  NP1 VP1 NP2 + NP2 VP2
        • Example
          • “ The best studied of these is EDG-1, which is implicated in cell migration and angiogenesis.”
            • ==> 1. “The best studied of these is EDG-1 .”
            • 2. “ EDG-1 is implicated in cell migration and angiogenesis.”
      • Pre-Processor
    • POSBIOTM/Event System
      • Two-level Event Rule Learner
      • Biological Event Extraction
    • POSBIOTM/Event System
      • Event Rule Learner
        • Adapt a supervised machine learning algorithm: WHISK
          • learns rules in the form of context-based regular expressions
          • induces the rules with top-down manner
            • Ex) “{NP} .*? (<CP>)[E] {/NP} {VP} (<BI>)[I] {/VP} {NP} both (<P>)[R] and .*? {/NP}”
        • Limitation of the WHISK
          • The longer distance between event components, the more difficult to extract the correct event
            • WHISK consider all lexical words between event components
          • Cannot handle nested biological events
        • Propose two-level rule learning method to handle the limitation of the flat rule learning method
      • Biological Event Extraction
    • POSBIOTM/Event System
      • Two-level Event Rule Learner
      • Biological Event Extraction
      4. Learn the long-span rule with the re-annotated sentence {NP} <E>cross-talk_between_PDGF_and_SPP</E> {/NP} {VP} is <BI>required</BI> {/VP} for {NP} these embryonic <CP>cell_movements</CP> {/NP} <TAGS> B {interaction require} {effecter cross-talk} {reactant cell movement} 1. Marking long NP boundary 2. Learn the short-span rule corresponding to the NP: “<BI>cross-talk</BI> between <P>PDGF</P> and <SM>SPP</SM>”  “ {NP} (<BI>)[I] between (<P>)[E] and (<SM>)[R] {/NP} “ 3. Re-annotate the short-span interaction as one noun with regular expression format {NP} <BI>cross-talk</BI> between <P>PDGF</P> and <SM>SPP</SM> {/NP} {VP} is <BI>required</BI> {/VP} for {NP} these embryonic <CP>cell_movements</CP> {/NP} <TAGS> B {interaction cross-talk} {effecter PDGF} {reactant SPP} <TAGS> B {interaction require} {effecter cross-talk} {reactant cell movement}
    • POSBIOTM/Event System
      • Event Extractor
        • To extract the events with the automatic generated rules
          • by using regular expression pattern matching
        • To handle the alias and noun conjunction
          • aliases and noun conjunctions have general patterns like ‘sphingosine-1-phosphate(SPP)’ or ‘FP, IP, and TP receptors’
            • handle them with simple rules like ‘A(B)’ or ‘A, B, C, and D’
        • To remove sentences including the negative words
          • ‘ not’, ‘never’, ‘fail’, etc
      • Biological Event Extraction
    • POSBIOTM/Event System
      • Event Component Verifier
    • POSBIOTM/Event System
      • To remove the incorrectly extracted events
      • Classify template elements (P, G, SM, CP, BI, CI) into 4 classes
        • I (interaction), E (effecter), R (reactant), N (none)
          • I, E, R : event’s components
          • N : a template element , but not an event component
      • Use a Maximum Entropy Classifier
        • Features
          • POS tag, phrase chunks, the type of template element of neighboring words and semantic information
      • Event Component Verifier
    • POSBIOTM/Event System
      • Event Component Verifier
    • POSBIOTM/Event System
      • Example
      • Event Component Verifier
      Verified Biological Extracted Events Ev1: Requires (I) sphingosine_kinase (E) cell_migration (R) Ev2: Requires (I) EDG-1 (E) cell_migration (R) Event Component Verifier Results I : Requires E : EDG-1, sphingosine_kinase, PDGF R : cell_migration Extracted Biological Events Ev1: Requires (I) sphingosine_kinase(E) cell_migration (R) Ev2: Requires (I) EDG-1 (E) cell_migration (R) Ev3: Requires (I) EDG-1 (E) PDGF (R)
    • POSBIOTM/Event System
        • 500 Medline abstracts including 2,314 biological events & 10-fold cross validation
          • Flat rule learner vs. two-level rule learner
          • Before verification vs. after verification
        • Performance comparison
            • Learning Information Extractors for Proteins and their Interactions (2004) - Razvan Bunescu, et. al
            • 1000 abstracts & 10-fold cross validation
      • Experiment and Discussion
      46.1 58.0 38.3 Before verification Flat rule learner 51.8 49.2 54.7 After verification 48.2 54.6 48.9 F-measure 63 56.1 68.0 Recall(%) 39 53.1 38.2 Precision(%) After verification Before verification Comparison system Two-level rule learner
    • POSBIOTM/Event System
        • Trade-off between precision and recall
          • Before verification : big gap between precision and recall
          • After verification : low gap between precision and recall
            • threshold : cut the rules according to the measure on how many of the extracted events from a rule are correct
      • Experiment and Discussion
    • POSBIOTM/Event System
        • Constant good performance regardless of the threshold of rule learner
      • Experiment and Discussion
    • Other Corpora for Bio-Relation Extraction
      • BC-PPI
        • From BioCreative Corpus for NER
        • Protein/Gene interactions
        • 255 interactions in 1000 sentences
      • IEPA
        • Protein/Protein interactions
        • 410 interactions in 498 sentences
      • LLL05
        • Protein/Gene interactions
        • 271 interactions in 80 sentences
      • BioText
        • Disease/Treatment relations
    • Contents
      • Introduction
      • POSBIOTM/W Workbench
      • POSBIOTM/NER System
      • POSBIOTM/NER with Active Machine Learning
      • POSBIOTM/Event System
      • Current status
    • Current Status & future works
      • Re-implemented with Java (platform independent)
      • Integrated with J-Designer in SBW consortium (will be)
      • Integrated with Active learning method to automatically suggest human-annotated corpus
      • Used for national large scale BIT fusion projects: search for useful peptide (usable as a ligand for drug)
      • Getting more feed back from biologists
      • System getting smarter with more usage: workbench + active learning
      Workbench Demo