Your SlideShare is downloading. ×
0
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
University of Texas at Austin
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

University of Texas at Austin

1,489

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,489
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Outline of the talk: show a Medline abstract, highlight protein names, and then highlight protein interactions.
  • Outline of the talk: show a Medline abstract, highlight protein names, and then highlight protein interactions.
  • Outline of the talk: show a Medline abstract, highlight protein names, and then highlight protein interactions.
  • Extra slide 1
  • Show PR curves.
  • Show PR curves.
  • Transcript

    • 1. Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang Ge, Rohit J. Kate, Yuk Wah Wong Edward M. Marcotte, Arun Ramani Department of Computer Sciences Institute for Cellular and Molecular Biology University of Texas at Austin Raymond J. Mooney Department of Computer Sciences
    • 2. Biological Motivation <ul><li>Human Genome Project has produced huge amounts of genetic data. </li></ul><ul><li>Next step is analyzing and interpreting this data. </li></ul>
    • 3. &nbsp;
    • 4. 1 taaccctaac cctaacccta accctaaccc taaccctaac cctaacccta accctaaccc 61 taaccctaac cctaacccta accctaaccc taaccctaac cctaacccaa ccctaaccct 121 aaccctaacc ctaaccctaa ccctaacccc taaccctaac cctaacccta accctaacct 181 aaccctaacc ctaaccctaa ccctaaccct aaccctaacc ctaaccctaa cccctaaccc 241 taaccctaaa ccctaaaccc taaccctaac cctaacccta accctaaccc caaccccaac 301 cccaacccca accccaaccc caaccctaac ccctaaccct aaccctaacc ctaccctaac 361 cctaacccta accctaaccc taaccctaac ccctaacccc taaccctaac cctaacccta 421 accctaaccc taaccctaac ccctaaccct aaccctaacc ctaaccctcg cggtaccctc 481 agccggcccg cccgcccggg tctgacctga ggagaactgt gctccgcctt cagagtacca 541 ccgaaatctg tgcagaggac aacgcagctc cgccctcgcg gtgctctccg ggtctgtgct 601 gaggagaacg caactccgcc ggcgcaggcg cagagaggcg cgccgcgccg gcgcaggcgc 661 agacacatgc tagcgcgtcg gggtggaggc gtggcgcagg cgcagagagg cgcgccgcgc 721 cggcgcaggc gcagagacac atgctaccgc gtccaggggt ggaggcgtgg cgcaggcgca 781 gagaggcgca ccgcgccggc gcaggcgcag agacacatgc tagcgcgtcc aggggtggag 841 gcgtggcgca ggcgcagaga cgcaagccta cgggcggggg ttgggggggc gtgtgttgca 901 ggagcaaagt cgcacggcgc cgggctgggg cggggggagg gtggcgccgt gcacgcgcag 961 aaactcacgt cacggtggcg cggcgcagag acgggtagaa cctcagtaat ccgaaaagcc 1021 gggatcgacc gccccttgct tgcagccggg cactacagga cccgcttgct cacggtgctg 1081 tgccagggcg ccccctgctg gcgactaggg caactgcagg gctctcttgc ttagagtggt ... 5641 gctccagggc ccgctcacct tgctcctgct ccttctgctg ctgcttctcc agctttcgct 5701 ccttcatgct gcgcagcttg gccttgccga tgcccccagc ttggcggatg gactctagca 5761 gagtggccag ccaccggagg ggtcaaccac ttccctggga gctccctgga ctggagccgg 5821 gaggtgggga acagggcaag gaggaaaggc tgctcaggca gggctgggga agcttactgt 5881 gtccaagagc ctgctgggag ggaagtcacc tcccctcaaa cgaggagccc tgcgctgggg 5941 aggccggacc tttggagact gtgtgtgggg gcctgggcac tgacttctgc aaccacctga 6001 gcgcgggcat cctgtgtgca gatactccct gcttcctctc tagcccccac cctgcagagc 6061 tggacccctg agctagccat gctctgacag tctcagttgc acacacgagc cagcagaggg 6121 gttttgtgcc acttctggat gctagggtta cactgggaga cacagcagtg aagctgaaat 6181 gaaaaatgtg ttgctgtagt ttgttattag accccttctt tccattggtt taattaggaa 6241 tggggaaccc agagcctcac ttgttcaggc tccctctgcc ctagaagtga gaagtccaga 6301 gctctacagt ttgaaaacca ctattttatg aaccaagtag aacaagatat ttgaaatgga 6361 aactattcaa aaaattgaga atttctgacc acttaacaaa cccacagaaa atccacccga 6421 gtgcactgag cacgccagaa atcaggtggc ctcaaagagc tgctcccacc tgaaggagac 6481 gcgctgctgc tgctgtcgtc ctgcctggcg ccttggccta caggggccgc ggttgagggt 6541 gggagtgggg gtgcactggc cagcacctca ggagctgggg gtggtggtgg gggcggtggg 6601 ggtggtgtta gtaccccatc ttgtaggtct gaaacacaaa gtgtggggtg tctagggaag ... and 3x10 9 more... Starting at the tip of chromosome 1...
    • 5. Proteomics 101 <ul><li>Genes code for proteins. </li></ul><ul><li>Proteins are the basic components of biological machinery. </li></ul><ul><li>Proteins accomplish their functions by interacting with other proteins. </li></ul><ul><li>Knowledge of protein interactions is fundamental to understanding gene function. </li></ul><ul><li>Chains of interactions compose large, complex gene networks. </li></ul>
    • 6. Sample Gene Network
    • 7. Yeast Gene Network ~5,800 genes ~5,800 proteins x 2-10 interactions/protein ~12,000 - 60,000 interactions Yeast ~ 10-20,000 known ==&gt; ~1/3 of the way to a complete map!
    • 8. Human Gene Network ~40,000 genes &gt;&gt;40,000 proteins x 2-10 interactions/protein &gt;&gt;80,000 - 400,000 interactions &lt;5,000 known ==&gt; approx. 1% of the complete map! ==&gt; We’re a long ways from the complete map
    • 9. Relevant Sources of Data Biological literature ~14 million documents DNA sequence data ~10 10 nucleotides Gene expression data ~10 8 measurements, but... DNA polymorphisms ~10 7 known Gene inactivation (knockout) studies ~10 5 Protein structure data ~10 4 structures Protein interaction data ~10 4 interactions, but… Protein expression data ~10 4 measurements, but... Protein location data ~10 4 measurements
    • 10. Knowledge in Biomedical Literature <ul><li>An ever increasing wealth of biological information is present in millions of published articles but retrieving it in structured form is difficult. </li></ul><ul><li>Much of this literature is available through the NIH -NLM’s Medline (PubMed) repository. </li></ul><ul><li>11 million abstracts in electronic form are available through Medline. </li></ul><ul><li>Excellent source of information on protein interactions. </li></ul>
    • 11. Obtaining Protein Interactions from Medline <ul><li>Reactome, BIND, HPRD: databases with protein interactions manually curated from Medline. </li></ul><ul><li>Many interactions from Medline are not covered by current databases. </li></ul><ul><li>Need automated information extraction to easily locate and structure this information. </li></ul>We integrated these databases, removing duplicates
    • 12. TI - Two potentially oncogenic cyclins, cyclin A and cyclin D1, share common properties of subunit configuration, tyrosine phosphorylation and physical association with the Rb protein AB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viral subversion and association with a tumor-suppressor protein, properties which are indicative of an S-phase-promoting factor (SPF) as well as a candidate proto-oncogene. Other recent studies have identified human cyclin D1 (PRAD1) as a putative G1 cyclin and candidate proto-oncogene. However, the specific enzymatic activities and, hence, the precise biochemical mechanisms through which cyclins function to govern cell cycle progression remain unresolved. In the present study we have investigated the coordinate interactions between these two potentially oncogenic cyclins, cyclin-dependent protein kinase subunits (cdks) and the Rb tumor-suppressor protein. The distribution of cyclin D isoforms was modulated by serum factors in primary fetal rat lung epithelial cells. Moreover, cyclin D1 was found to be phosphorylated on tyrosine residues in vivo and, like cyclin A, was readily phosphorylated by pp60c-src in vitro. In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit. Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and p33cdk2, and that cyclin D1 immune complexes exhibit appreciable histone H1 kinase activity. Immobilized, recombinant cyclins A and D1 were found to associate with cellular proteins in complexes that contain the p105Rb protein. This study identifies several common aspects of cyclin biochemistry, including tyrosine phosphorylation and the potential to interact directly or indirectly with the Rb protein, that may ultimately relate membrane-mediated signaling events to the regulation of gene expression. Sample Medline Abstract
    • 13. TI - Two potentially oncogenic cyclins, cyclin A and cyclin D1 , share common properties of subunit configuration, tyrosine phosphorylation and physical association with the Rb protein AB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viral subversion and association with a tumor-suppressor protein, properties which are indicative of an S-phase-promoting factor ( SPF ) as well as a candidate proto-oncogene. Other recent studies have identified human cyclin D1 ( PRAD1 ) as a putative G1 cyclin and candidate proto-oncogene. However, the specific enzymatic activities and, hence, the precise biochemical mechanisms through which cyclins function to govern cell cycle progression remain unresolved. In the present study we have investigated the coordinate interactions between these two potentially oncogenic cyclins, cyclin-dependent protein kinase subunits (cdks) and the Rb tumor-suppressor protein. The distribution of cyclin D isoforms was modulated by serum factors in primary fetal rat lung epithelial cells. Moreover, cyclin D1 was found to be phosphorylated on tyrosine residues in vivo and, like cyclin A , was readily phosphorylated by pp60c-src in vitro. In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1 , a Cdk-binding subunit. Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and p33cdk2 , and that cyclin D1 immune complexes exhibit appreciable histone H1 kinase activity. Immobilized, recombinant cyclins A and D1 were found to associate with cellular proteins in complexes that contain the p105Rb protein. This study identifies several common aspects of cyclin biochemistry, including tyrosine phosphorylation and the potential to interact directly or indirectly with the Rb protein, that may ultimately relate membrane-mediated signaling events to the regulation of gene expression. Sample Medline Abstract
    • 14. Sample Medline Abstract TI - Two potentially oncogenic cyclins, cyclin A and cyclin D1 , share common properties of subunit configuration, tyrosine phosphorylation and physical association with the Rb protein AB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viral subversion and association with a tumor-suppressor protein, properties which are indicative of an S-phase-promoting factor ( SPF ) as well as a candidate proto-oncogene. Other recent studies have identified human cyclin D1 ( PRAD1 ) as a putative G1 cyclin and candidate proto-oncogene. However, the specific enzymatic activities and, hence, the precise biochemical mechanisms through which cyclins function to govern cell cycle progression remain unresolved. In the present study we have investigated the coordinate interactions between these two potentially oncogenic cyclins, cyclin-dependent protein kinase subunits (cdks) and the Rb tumor-suppressor protein. The distribution of cyclin D isoforms was modulated by serum factors in primary fetal rat lung epithelial cells. Moreover, cyclin D1 was found to be phosphorylated on tyrosine residues in vivo and, like cyclin A , was readily phosphorylated by pp60c-src in vitro. In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1 , a Cdk-binding subunit. Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and p33cdk2 , and that cyclin D1 immune complexes exhibit appreciable histone H1 kinase activity. Immobilized, recombinant cyclins A and D1 were found to associate with cellular proteins in complexes that contain the p105Rb protein. This study identifies several common aspects of cyclin biochemistry, including tyrosine phosphorylation and the potential to interact directly or indirectly with the Rb protein, that may ultimately relate membrane-mediated signaling events to the regulation of gene expression.
    • 15. Manually Developed IE Systems for Medline <ul><li>A number of projects have focused on the manual development of information extraction (IE) systems for biomedical literature. </li></ul><ul><li>KeX for extracting protein names (Fukuda et al., 1998): Extract words with special symbols excluding those with more than half of the characters being special symbols, hence eliminating strings such as “+/−”. </li></ul><ul><li>Suiseki for extracting protein interactions (Blaschke et al., 2001): PROT (0-2) PROT (0-2) complex NOUN between (0-3) PROT (0-3) and (0-3) PROT </li></ul>
    • 16. Learning Information Extractors <ul><li>Manually developing IE systems is tedious and time-consuming and they do not capture all possible formats and contexts for the desired information. </li></ul><ul><li>Machine learning from supervised corpora, is becoming the standard approach to building information extractors. </li></ul><ul><li>Recently, several learning approaches have been applied to Medline extraction (Craven &amp; Kumlein, 1999; Tanabe &amp; Wilbur, 2002; Raychaudhuri et al., 2002). </li></ul><ul><li>We have explored the use of a variety of machine learning techniques to develop IE systems for extracting human protein names and interactions , presenting uniform results on a single, reasonably large, human-annotated corpus . </li></ul>
    • 17. Framework for Interaction Extraction <ul><li>Extensive comparative experiments in [Bunescu et al. 2005] </li></ul><ul><ul><li>Protein Extraction: Maximum Entropy tagger. </li></ul></ul><ul><ul><li>Interaction Extraction: ELCS (Extraction using Longest Common Subsequences). </li></ul></ul>Medline abstract <ul><li>Traditionally, the task has two separate steps: Protein _ Extraction and Interaction Extraction . </li></ul>Protein Extraction Medline abstract (proteins tagged) Interaction Extraction Interactions Database
    • 18. Non-Learning Protein Extractors <ul><li>Dictionary-based extraction </li></ul><ul><li>KEX (Fukuda et al., 1998) </li></ul>
    • 19. Learning Methods for Protein Extraction <ul><li>Rule-based pattern induction </li></ul><ul><ul><li>Rapier (Califf &amp; Mooney, 1999) </li></ul></ul><ul><ul><li>BWI (Freitag &amp; Kushmerick, 2000) </li></ul></ul><ul><li>Token classification (chunking approach): </li></ul><ul><ul><li>K-nearest neighbor </li></ul></ul><ul><ul><li>Transformation-Based Learning Abgene ( Tanabe &amp; Wilbur, 2002) </li></ul></ul><ul><ul><li>Support Vector Machine </li></ul></ul><ul><ul><li>Maximum entropy </li></ul></ul><ul><li>Hidden Markov Models </li></ul><ul><li>Conditional Random Fields (Lafferty, McCallum, and Pereira, 2001) </li></ul><ul><li>Relational Markov Networks (Taskar, Abbeel, and Koller, 2002) </li></ul>
    • 20. Name Extraction by Token Classification (“Chunking” or “Sequence Labeling” Approach) <ul><li>Since in our data no protein names directly abut each other, we can reduce the extraction problem to classification of individual words as being part of a protein name or not. </li></ul><ul><li>Protein names are extracted by identifying the longest sequences of words classified as being part of a protein name . </li></ul>Two potentially oncogenic cyclins , cyclin A and cyclin D1 , share common properties of subunit configuration , tyrosine phosphorylation and physical association with the Rb protein
    • 21. Constructing Feature Vectors for Classification <ul><li>For each token, we take the following as features: </li></ul><ul><ul><li>Current token </li></ul></ul><ul><ul><li>Last 2 tokens and next 2 tokens </li></ul></ul><ul><ul><li>Output of dictionary-based tagger for these 5 tokens </li></ul></ul><ul><ul><li>Suffix for each of the 5 tokens (last 1, 2, and 3 characters) </li></ul></ul><ul><ul><li>Class labels for last 2 tokens </li></ul></ul>Two potentially oncogenic cyclins , cyclin A and cyclin D1 , share common properties of subunit configuration , tyrosine phosphorylation and physical association with the Rb protein
    • 22. Our Biomedical Corpora (AIMed) <ul><li>750 abstracts that contain the word human were randomly chosen from Medline for testing protein name extraction. They contain a total of 5,206 protein references. </li></ul><ul><li>200 abstracts previously known to contain protein interactions were obtained from the Database of Interacting Proteins. They contain 1,101 interactions and 4,141 protein names. </li></ul><ul><li>As negative examples for interaction extraction are rare, an extra set of 30 abstracts containing sentences with non-interacting proteins are included. </li></ul><ul><li>The resulting 230 abstracts are used for testing protein interaction extraction. </li></ul>
    • 23. The Yapex Corpus <ul><li>200 abstracts from Medline, manually tagged for protein names. </li></ul><ul><li>147 randomly chosen such that they contain the Mesh terms “protein binding”, “interaction”, “molecular”. </li></ul><ul><li>53 randomly chosen from the GENIA corpus </li></ul>http://www.sics.se/humle/projects/prothalt/
    • 24. Evaluation Metrics for Information Extraction <ul><li>Precision is the percentage of extracted items that are correct . </li></ul><ul><li>Recall is the percentage of correct items that are extracted . </li></ul><ul><li>Extracted protein names are considered correct if the same character sequences have been human-tagged as protein names in the exact positions . </li></ul><ul><li>Extracted protein interactions from an abstract are considered correct if both proteins have been human-tagged as interacting in that abstract. Positions are not taken into account. </li></ul>
    • 25. Experimental Method <ul><li>10-fold cross-validation: Average results over 10 trials with different training and (independent) test data. </li></ul><ul><li>For methods which produce confidence in extractions, vary threshold for extraction in order to explore recall-precision trade-off. </li></ul><ul><li>Use standard methods from information-retrieval to generate a complete precision-recall curve . </li></ul><ul><li>Maximizing F-measure assumes a particular cost-benefit trade-off between incorrect and missed extractions. </li></ul>
    • 26. Protein Name Extraction Results (Bunescu et al., 2004)
    • 27. Graphical Models <ul><li>Directed Models =&gt; well suited to represent temporal and causal relationships ( Bayesian Networks, HMMs ) </li></ul><ul><li>Undirected Models =&gt; appropriate for representing statistical correlation between variables ( Markov Networks ) </li></ul><ul><li>Generative Models =&gt; define a joint probability over observations and labels ( HMMs ) </li></ul><ul><li>Discriminative Models =&gt; specifies a probability over labels given a set of observations ( Conditional Random Fields) </li></ul>Probabilistic models that represent dependencies using a graph
    • 28. Conditional Random Fields Lafferty, McCallum &amp; Pereira 2001 <ul><li>CRF’s are a type of discriminative Markov b networks used for labeling sequences. </li></ul><ul><li>CRF’s have shown superior or competitive b performance in various tasks as: </li></ul><ul><ul><li>Shallow Parsing </li></ul></ul><ul><ul><li>Entity Recognition </li></ul></ul><ul><ul><li>Table Extraction </li></ul></ul>[ Sha &amp; Pereira 2003 ] [ McCallum &amp; Li 2003 ] [ Pinto et al 2003 ]
    • 29. Conditional Random Fields (CRFs) <ul><li>Tj.tag – the tag at position j </li></ul><ul><li>Tj.w – true if word w occurs at position j </li></ul><ul><li>Tj.cap – true if word at position j begins with capital letter, … </li></ul>T1.tag T2.tag T3.tag Start Tn.tag T1.w T2.w T3.w Tn.w … … T1.cap T2.cap T3.cap Tn.cap  cap  tw  tags End …
    • 30. Protein Name Extraction Results (Yapex)
    • 31. Collective Classification of Web Pages [ Taskar, Abbeel &amp; Koller 2002 ]
    • 32. Collective Information Extraction The control of human ribosomal protein L22 ( rpL22 ) to enter into the nucleolus and its ability to be assembled into the ribosome is regulated by its sequence . The nuclear import of rpL22 depends on a classical nuclear localization signal of four lysines at positions 13 – 16 … Once it reaches the nucleolus , the question of whether rpL22 is assembled into the ribosome depends upon the presence of the N - domain . e 1 e 2 e 3 e 4 ribosomal protein L22 ( rpL22 ) of rpL22 depends whether rpL22 is acronym repetition repetition repetition overlap e 5 L22
    • 33. Relational Markov Networks Discriminative Markov Networks, augmented with clique templates : <ul><li>Acronym Template (AT) </li></ul><ul><li>Repeat Template (RT) </li></ul><ul><li>Overlap Template (OT) </li></ul>[ Taskar, Abbeel &amp; Koller 2002 ] e 1 e 2 e 3 e 4 ribosomal protein L22 ( rpL22 ) of rpL22 depends whether rpL22 is e 5 L22
    • 34. Experimental Results <ul><li>Compared three approaches: </li></ul><ul><li>LT–RMN  RMN extraction using local templates _ + Overlap Template </li></ul><ul><li>GLT–RMN  RMN extraction using both local and b global templates. </li></ul><ul><li>CRF  extraction as token classification using _ Conditional Random Fields </li></ul>
    • 35. Experimental Results – Yapex
    • 36. Experimental Results – AIMed
    • 37. Protein Interaction Extraction <ul><li>Most IE methods focus on extracting individual entities. </li></ul><ul><li>Protein interaction extraction requires extracting relations between entities. </li></ul><ul><li>Our current results on relation extraction have focused on rule-based and kernel-based learning approaches. </li></ul>
    • 38. ELCS (Extraction using Longest Common Subsequences) <ul><li>A new method for inducing rules that extract interactions between previously tagged proteins. </li></ul><ul><li>Each rule consists of a sequence of words with allowable word gaps between them (similar to Blaschke &amp; Valencia, 2001, 2002). - (7) interactions (0) between (5) PROT (9) PROT (17) . </li></ul><ul><li>Any pair of proteins in a sentence if tagged as interacting forms a positive example , otherwise it forms a negative example . </li></ul><ul><li>Positive examples are repeatedly generalized to form rules until the rules become overly general and start matching negative examples. </li></ul>
    • 39. Generalizing Rules using Longest Common Subsequence The self - association site appears to be formed by interactions between helices 1 and 2 of beta spectrin repeat 17 of one dimer with helix 3 of alpha spectrin repeat 1 of the other dimer to form two combined alpha - beta triple - helical segments . Title - Physical and functional interactions between the transcriptional inhibitors Id3 and ITF-2b . - (7) interactions (0) between (5) PROT (9) PROT (17) .
    • 40. Protein Interaction Extraction Results (gold-standard protein tags)
    • 41. Protein Interaction Extraction Results (automated protein tags)
    • 42. Large-Scale Text Mining <ul><li>Apply trained extractors to 753,459 Medline abstracts that reference “human”. </li></ul><ul><li>Automatically mine large numbers of protein interactions from this scientific text. </li></ul><ul><li>Integrate extracted data with existing databases to construct the world’s largest database of human protein interactions. </li></ul><ul><li>How judge accuracy of extracted interactions? </li></ul>
    • 43. Accuracy Benchmark Shared Functional Annotations <ul><li>Accuracy of a protein interaction dataset correlates well with % of interaction partners sharing functional annotations . </li></ul><ul><li>Functional annotation  a pathway between the two proteins in a particular ontology: </li></ul><ul><ul><li>KEGG: 55 pathways at lowest level. </li></ul></ul><ul><ul><li>GO: 1356 pathways at level 8 of biological process annotation. </li></ul></ul>
    • 44. Accuracy Benchmarks LLR Scoring Scheme <ul><li>Use the log-likelihood ratio (LLR) of protein pairs: </li></ul>P(D|I) and P(D|  I) are the probabilities of observing the interaction data D conditioned on the proteins sharing ( I ) or not sharing (  I ) functional annotations. <ul><li>Higher values for LLR indicate higher accuracy. </li></ul>
    • 45. Interaction Extraction using Co-citation Analysis <ul><li>Intuition: proteins co-occurring in a large number of abstracts tend to be interacting proteins. </li></ul><ul><li>Compute the probability of co-citation under a random model ( hyper-geometric distribution ). </li></ul>N – total number of abstracts (750K) n – abstracts citing the first protein m – abstracts citing the second protein k – abstracts citing both proteins
    • 46. Interaction Extraction using Co-citation Analysis <ul><li>Protein pairs which co-occur in a large number of abstracts (high k ) are assigned a low probability under the random model . </li></ul><ul><li>Empirically, protein pairs whose observed co-citation rate is given a low probabilty under the random model score high on the functional annotation benchmark . </li></ul><ul><li>RESULT: Close to 15K interactions extracted that score comparable or better than HPRD on the functional annotation benchmark. </li></ul>
    • 47. Co-citation Analysis with Bayesian Reranking <ul><li>Use a trained Naïve Bayes model to measure the likelihood that an abstract discusses physical protein interactions. </li></ul><ul><li>For a given pair of proteins, compute the average score of co-citing abstracts. </li></ul><ul><li>Use the average score to re-rank the 15k already extracted pairs. </li></ul>Medline abstract CRF tagger Medline abstract (proteins tagged) Co-citation Analysis Ranked Interactions Naïve Bayes scores Re-ranked Interactions
    • 48. Integrating Extracted Data with Existing Databases Extracted : 6,580 interactions between 3,737 human proteins Total: 31,609 interactions between 7,748 human proteins.
    • 49. Filtered Co-citation Analysis: Evaluation
    • 50. ERK (Extraction using a Relation Kernel) <ul><li>Use SVM with a string kernel . </li></ul><ul><li>The patterns (features) are sparse subsequences of words constrained to be anchored on the two protein names. </li></ul><ul><li>The feature space can be further pruned down – in almost all examples, a sentence asserts a relationship between two entities using one of the following patterns: </li></ul><ul><ul><li>[FI] F ore- I nter: ‘ interaction of P 1 with P 2 ’, ‘ activation of P 1 by P 2 ’ </li></ul></ul><ul><ul><li>[ I] I nter: ‘ P 1 interacts with P 2 ’, ‘ P 1 is activated by P 2 ’ </li></ul></ul><ul><ul><li>[IA] I nter- A fter: ‘ P 1 – P 2 complex ’, ‘ P 1 and P 2 interact ’ </li></ul></ul><ul><li>Restrict the three types of patterns to use at most 4 words (besides the two protein anchors). </li></ul>
    • 51. ERK (Extraction using a Relation Kernel) <ul><li>The kernel K(S 1 ,S 2 )  the number of common patterns between S 1 and S 2 , weighted by their span in the two sentences. </li></ul><ul><li>K(S 1 ,S 2 ) can be computed based on the dynamic procedure from [ Lodhi et al., 2002 ]. </li></ul><ul><li>Train an SVM model to find a max-margin linear discriminator between positive and negative examples </li></ul>S 1  In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1 , a Cdk-binding subunit. S 2  Experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and p33cdk2 , and <ul><li>[FI] patterns: “ human cells P 1 associated with P 2 ”, … </li></ul><ul><li>[I] patterns: “P 1 associated with P 2 ”, … </li></ul><ul><li>[IA] patterns: “P 1 associated with P 2 , ”, … </li></ul>
    • 52. Evaluation: ERK vs ELCS vs Manual
    • 53. Future Work &amp; Conclusions <ul><li>Future Work: </li></ul><ul><li>Analyze the complete set of 750K abstracts using the relational kernel and integrate results into an improved composite dataset. </li></ul><ul><li>Conclusions: </li></ul><ul><li>Created a large database of interacting human proteins by consolidating interactions automatically extracted from Medline abstracts with existing databases. </li></ul><ul><li>Final database: 31,609 interactions between 7,748 human proteins. </li></ul>
    • 54. For Further Information <ul><li>Consolidated database available on line: </li></ul><ul><ul><li>http://bioinformatics.icmb.utexas.edu/idserve/ </li></ul></ul><ul><li>Papers available online: </li></ul><ul><ul><li>http://www.cs.utexas.edu/users/ml/publication/bioinformatics.html </li></ul></ul><ul><ul><ul><li>“ Consolidating the Set of Known Human Protein-Protein Interactions in Preparation for Large-Scale Mapping of the Human Interactome,” Ramani, A.K., Bunescu, R.C., Mooney, R.J. and Marcotte, E.M., Genome Biology , 6, 5, r40(2005). </li></ul></ul></ul><ul><ul><ul><li>“ Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions,”Arun Ramani, Edward Marcotte, Razvan Bunescu, Raymond Mooney, to appear in the Proceedings of ISMB BioLINK SIG: Linking Literature, Information and Knowledge for Biology , Detroit, MI, June 2005. </li></ul></ul></ul><ul><ul><ul><li>“ Collective Information Extraction with Relational Markov Networks,” Razvan Bunescu and Raymond J. Mooney, Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-2004), pp. 439-446, Barcelona, Spain, July 2004. </li></ul></ul></ul><ul><ul><ul><li>“ Comparative Experiments on Learning Information Extractors for Proteins and their Interactions.,” Razvan Bunescu, Ruifang Ge, Rohit J. Kate, Edward M. Marcotte, Raymond J. Mooney, Arun Kumar Ramani, and Yuk Wah Wong, Artificial Intelligence in Medicine (Special Issue on Summarization and Information Extraction from Medical Documents), 33, 2 (2005), pp. 139-155. </li></ul></ul></ul>

    ×