Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Learning to Extract Relations for Protein Annotation

Explains how to learn information extraction rules from unannotated corpora using inductive logic programming.

  • Be the first to comment

  • Be the first to like this

Learning to Extract Relations for Protein Annotation

  1. 1. Learning to Extract Relations Learning to Extract Relations for Protein Annotation Jee-Hyub Kima, Alex Mitchellb,c, Teresa K. Attwoodb,c, Melanie Hilarioa aUniversity of Geneva bUniversity of Manchster cEuropean Bioinformatics Institute ISMB/ECCB2007, 23 July 2007 1 / 32
  2. 2. Learning to Extract Relations Contents Introduction Related Work Problem and Approach Methods Experimental Results Conclusion 2 / 32
  3. 3. Learning to Extract Relations Introduction Protein Annotation Definition Given a protein sequence or name, describe the protein with all relevant information Two main methods Sequence analysis method (e.g. BLAST, CLUSTALW, etc.) Literature analysis method Traditionally, done manually by human annotators, and automation needed Text-mining has been used for automation. Two main tasks in text-mining Information retrieval (IR): to retrieve relevant documents Information extraction (IE): to extract certain pieces of information from text for pre-defined entities and their relations. 3 / 32
  4. 4. Learning to Extract Relations Introduction Adaptive Information Extraction IE is domain-specific. Generally, one developed system cannot be used for a new domain. Developing IE systems requires a significant amount of domain knowledge. Developing IE rules Defining relations These are two main bottlenecks. Still many new domains need to develop their own IE systems. e.g., cell cycle, tissue specificity, etc. 4 / 32
  5. 5. Learning to Extract Relations Related Work IE System Development: Developing IE Rules Kowledge engineering (KE) based approach Knowledge engineers write hand-crafted rules with the help of domain experts (e.g., biologists). Not scalable Machine learning (ML) based approach To increase robustness and coverage of IE rules ML has been used to learn IE rules. Annotated corpora, pre-labelled corpora, raw corpora. Labor required to develop IE systems KE > ML (annotated corpora > pre-labelled corpora > raw corpora) 5 / 32
  6. 6. Learning to Extract Relations Related Work IE System Development: Defining Relations All the previous IE systems assume relations to be extracted are already defined. It is hard to specify precisely all possible relations to extract, especially in complex and dynamically-evolving domains (e.g., biological domain) Positioning (ML-based approach) Corpora Relations Pre-defined Not defined Annotated Soderland (1999), Freitag (2000), Califf and Money (2003) Pre-labeled Riloff (1996) Our work Raw Hasegawa et al. (2004) Collier (1996) 6 / 32
  7. 7. Learning to Extract Relations Problem and Approach Our Problem Goal To alleviate the burden of developing IE systems for biologists Problem definition Given relevant sentences that describe protein X in terms of any topic Y and irrelevant sentences Learn to extract relations for protein annotation Two sub-problems What to extract How to extract 7 / 32
  8. 8. Learning to Extract Relations Problem and Approach Bottom-up Approach Figure: From sentences to relations 8 / 32
  9. 9. Learning to Extract Relations Methods System Architecture 9 / 32
  10. 10. Learning to Extract Relations Methods Analyzing Sentences: MBSP Memory-Based Shallow Parser (MBSP) Developed by Walter Daelemans Extended with named entity taggers and SVO relation finder Provides various types of information: POS, SVO, NE, etc. Adapted to the biological domain on the basis of the GENIA corpus 97.6% accuracy on POS tagging 71.0% accuracy on protein named entity recognition 10 / 32
  11. 11. Learning to Extract Relations Methods Analyzing Sentences: Example Example INPUT: Examples of this are the RNA-binding protein containing the RNA-binding domain (RBD) ... OUTPUT: Chunk Syntactic Semantic SVO relation Examples noun phrase subject of ’are’ of preposition this noun phrase are verb phrase the RNA-binding protein noun phrase protein subject of ’contain’ containing verb phrase the RNA-binding domain (RBD) noun phrase domain object of ’contain’ 11 / 32
  12. 12. Learning to Extract Relations Methods Learning IE Rules: Inductive Logic Programming (ILP) We applied ILP to learn IE rules. ILP is a ML algorithm that induces rules from examples. Outputs are readable and interpretable by the domain experts. Can deal with relational information (e.g., parse trees). Problem Setting B ∧ H |= E Given B (Background Knowledge) and E (Examples), find H (Hypothesis). 12 / 32
  13. 13. Learning to Extract Relations Methods Learning IE Rules: Representation B (Background Knowledge) Linguistic heuristics (for single-slot IE pattern) e.g., <subj> verb, verb <dobj>, verb preposition <np>, noun prep <np>, etc. Sentence descriptions (ie., analyzed sentences) E (Examples) Positive and negative examples (ie., relevant and irrelevant sentences) H (Hypothesis) A set of IE rules 13 / 32
  14. 14. Learning to Extract Relations Methods Learning IE Rules: Generalization Figure: From a sentence to an IE rule 14 / 32
  15. 15. Learning to Extract Relations Methods Learning IE Rules: Step 1 Figure: From a sentence to an IE rule 15 / 32
  16. 16. Learning to Extract Relations Methods Learning IE Rules: Step 2 Figure: From a sentence to an IE rule 16 / 32
  17. 17. Learning to Extract Relations Methods Learning IE Rules: Step 3 Figure: From a sentence to an IE rule 17 / 32
  18. 18. Learning to Extract Relations Methods Learning IE Rules: Step 4 Figure: From a sentence to an IE rule WRAcc(Rule) = coverage(Rule) ∗ (accuracy(Rule) − accuracy(Head ← true)) 18 / 32
  19. 19. Learning to Extract Relations Methods Selecting IE Rules Why is this step necessary? Rules are learned to classify sentences. Not to extract information. Spurious IE rules are learned from the previous step. Need to be filtered out by domain experts. Rules are provided to users with information. Example RULE: <subj:*> vp:contain & vp:contain <dobj:domain> [9, 0.9] S: Myocilin is a secreted glycoprotein that forms multimers and contains a leucine zipper and an olfactomedin domain. 19 / 32
  20. 20. Learning to Extract Relations Methods Transforming Rules Once IE rules are selected, transformed into relations. Mapping between IE rules and relations IE Rules Relations extract argument trigger relation name syntactic tag argument position Example <subj:*>[X] vp:contain & vp:contain <dobj:domain>[Y] → contain(X,Y) 20 / 32
  21. 21. Learning to Extract Relations Methods Grouping Rules Post-processing rules Pattern Example A verb (active form) B A activate B B be ed-participle by A B be activated by A nominal form (with suffix -tion) of verb of B by A activation of B by A A be nominal form (with suffix -or) of verb of B A is an activator of B A be ... that verb (active form) B A is ... that activates B 21 / 32
  22. 22. Learning to Extract Relations Methods Applying Rules to Extract Relations Now, we have a set of IE rules for each relation. IE rule and relation RULE: <subj:protein>[X] vp:promote & vp:promote <dobj:disease>[Y] TRIGGER: promote RELATION: promote(X,Y) Example INPUT: Our data demonstrate that PKC beta II promotes colon cancer, at least in part, through induction of Cox-2, suppression of TGF-beta signaling, and establishment of a TGF-beta-resistant, hyperproliferative state in the colonic epithelium. OUTPUT: promote(’PKC beta II’,’colon cancer’) 22 / 32
  23. 23. Learning to Extract Relations Experimental Results Experiments Our IE system was applied for PRINTS (a protein family database) annotation. Development Corpora Topic Positives Negatives Class Distribution Disease 777 1403 36-64% Function 1268 2625 33-67% Structure 1159 1750 40-60% 80% for training, 20% for test 23 / 32
  24. 24. Learning to Extract Relations Experimental Results Evaluation All the extracted relations are manually evaluated by domain experts. Results Topic Learned Selected Relations Precision Recall F1-measure rules rules Disease 55 32 21 75 18.3 29.4 Function 125 64 23 66.3 15.1 24.6 Structure 146 76 20 85.3 61 71.1 cf. a recent work of extracting regulatory gene/protein networks F1-measure of 44% (Saric et al, 2006) 24 / 32
  25. 25. Learning to Extract Relations Experimental Results Protein Annotation Examples Discovered relations Disease be_associated, is_a, be_mutated, be_caused, be_deleted, contribute, etc. Function induce, block, mediate, is_a, belong, act, etc. Structure contain, form, share, lack, bind, encode, be_conserved, etc. Annotation example for protein NF-kappaB be_implicated(,’NF-kappaB’, in:’the pathogenesis’) regulate(’IkappaBalpha’, ’NF-kappaB’), activate(’BCMA’, ’NF-kappaB’) be_composed(’NF-kappaB’, of:’heterodimeric complexes’) 25 / 32
  26. 26. Learning to Extract Relations Experimental Results Limitations Failure to find inverse relation S: Expression of uPAR in tumor extracts also inversely correlates with prognosis in many forms of cancer. RULE: np:expression <of:protein>[X] & vp:correlate <with:prognosis>[Y] RELATION: correlate(expression(’expression’,of:’uPAR’), with:’prognosis’) Anaphora problem S: Whereas the overall structure resembles that of the NF-kappaB p50-DNA complex , pronounced differences are observed within the ’ insert region ’. RULE: <subj:structure>[X] vp:resemble & vp:resemble <dobj:C>[Y] RELATION: resemble(’the overall structure’,’that’) 26 / 32
  27. 27. Learning to Extract Relations Conclusion Conclusion Proposed a methodology for developing IE systems with resources that can be provided by biologists. Learned relations as well as IE rules without annotated corpora. Annotated proteins with structured information (i.e., predicate argument structure) in terms of any topic. Validated the methodology over different topics (function, structure, disease, cancer) in bio-medical domain. Advantage: will alleviate the burden of developing IE systems for users who have little or no formal IE training. 27 / 32
  28. 28. Thank you for your attention!
  29. 29. Learning to Extract Relations Appendix For Further Reading For Further Reading I Mary Elaine Califf and Raymond J. Mooney. Bottom-up relational learning of pattern matching rules for information extraction. Journal of Machine Learning Research, 4:177–210, 2003. R. Collier. Automatic template creation for information extraction, an overview, 1996. Dayne Freitag. Machine learning for information extraction in informal domains. Machine Learning, 39(2/3):169–202, 2000. 29 / 32
  30. 30. Learning to Extract Relations Appendix For Further Reading For Further Reading II Takaaki Hasegawa, Satoshi Sekine, and Ralph Grishman. Discovering relations among named entities from large corpora. In ACL, pages 415–422, 2004. Ellen Riloff. Automatically generating extraction patterns from untagged text. In Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96), pages 1044–1049, 1996. 30 / 32
  31. 31. Learning to Extract Relations Appendix For Further Reading For Further Reading III Jasmin Saric, Lars Juhl Jensen, Peer Bork, Rossitza Ouzounova, and Isabel Rojas. Extracting regulatory gene expression networks from pubmed. In ACL, pages 191–198, 2004. Stephen Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1-3):233–272, 1999. 31 / 32
  32. 32. [Sod99] [Fre00] [CM03] [Ril96] [HSG04] [Col96] [SJB+04]

×