Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Disambiguating proteins, genes, and RNA in text: a machine learning approach Vasileios Hatzivassiloglou, Pablo A. Dubou é , Andrey Rzhetsky Bioinformatics 17 (suppl. 1), S97-S106, 2001 Summarized by Jeong -Ho Chang
  2. 2. Introduction <ul><li>Present an automated system for assiging protein, gene, or mRNA class labels to biological terms in free text. </li></ul><ul><li>Three machine learning algorithms </li></ul><ul><ul><li>Naïve Bayes classifier, decision tree (C4.5), inductive rule learning (RIPPER) </li></ul></ul><ul><li>Use of contextual features for disambiguation </li></ul><ul><ul><li>Positional information, POS tagging, stopwords, stemming, etc </li></ul></ul>
  3. 3. GeneWays http://genome6. cpmc . columbia . edu /~ krautham / geneways /
  4. 4. Disambiguation <ul><li>The goal </li></ul><ul><ul><li>Disambiguation of words or phrases known to be terms in the biology by assigning class lables to them: protein , gene , mRNA </li></ul></ul><ul><li>Instances </li></ul><ul><ul><li>By UV cross-linking and immunoprecipitation, we show that SBP2 specifically binds selenoprotein mRNAs both in vitro and in vivo. </li></ul></ul><ul><ul><li>The SBP2 clone used in this study generates a 3173 nt transcript (2541 nt of coding sequence plus a 632 nt 3’ UTR truncated at the polyadenylation site ). </li></ul></ul>
  5. 5. Overall Description of Methods <ul><li>Cast the problem into a case of word sense disambiguation . </li></ul><ul><li>Utilize approaches used in statistical language processing. </li></ul><ul><ul><li>Use the context of known occurrences of genes, proteins, and mRNA to learn weights for elements in that context. </li></ul></ul><ul><ul><li>Applying these weights to the classification. </li></ul></ul>
  6. 6. Data Preparation (1/4) <ul><li>Collection </li></ul><ul><ul><li>Download articles that appear in HTML format in the Internet, at prespecified journal publisher’s web sites or via keyword searches through the PubMed. </li></ul></ul><ul><ul><li>Covert the HTML format to XML formats. </li></ul></ul><ul><ul><li>The added XML tags mark word, sentence, paragraph, and section boundaries, non-textual material. </li></ul></ul>
  7. 7. Data Preparation (2/4) <ul><li>Sample portion of annotated document </li></ul>
  8. 8. Data Preparation (3/4) <ul><li>Term identification </li></ul><ul><ul><li>Lookup method </li></ul></ul><ul><ul><ul><li>Over the GenBank database </li></ul></ul></ul><ul><ul><ul><li>204,177 gene/protein/RNA names (2001. 2.) </li></ul></ul></ul><ul><ul><li>Preprocessing </li></ul></ul><ul><ul><ul><li>Only consider as terms those entries that either consist of multiple words or, if single words, do not appear in the lexicon of common English words in Brill’s POS tagger. </li></ul></ul></ul><ul><ul><ul><li>Break any word in the text at hyphens and allow for matches between those divisions and GenBank entries. </li></ul></ul></ul>
  9. 9. Data Preparation (4/4) <ul><li>Document set </li></ul><ul><ul><li>1,374 articles from the European Molecular Biology Organization (EMBO) journal (1997~2000). </li></ul></ul><ul><ul><li>9,003,923 words of text, 314MB in total. </li></ul></ul><ul><ul><li>346,519 terms were identified. </li></ul></ul><ul><ul><li>9,187 (2.65%) are non-ambiguous occurrences. </li></ul></ul><ul><ul><ul><li>Those immediately followed by a disambiguating word (“gene”, “protein”, “mRNA”). </li></ul></ul></ul><ul><ul><ul><li>Used in training and the automated evaluation. </li></ul></ul></ul><ul><ul><ul><li>Unsupervised training set acquisition. </li></ul></ul></ul>
  10. 10. Data Representations <ul><li>A term is represented as a vector of contextual features </li></ul><ul><ul><li>Features are for the words in a window extending N words to the left and N words to the right of the term . </li></ul></ul><ul><li>Feature Definition </li></ul><ul><ul><li>Positional information </li></ul></ul><ul><ul><ul><li>Word-bag approach </li></ul></ul></ul><ul><ul><ul><li>Words before the term, and words after the term. </li></ul></ul></ul><ul><ul><ul><li>Distance from the term, e.g. X/+2. </li></ul></ul></ul><ul><ul><li>Capitalization </li></ul></ul><ul><ul><li>Part-of-speech </li></ul></ul><ul><ul><li>Stopwords and similarly distributed words </li></ul></ul><ul><ul><li>Stemming </li></ul></ul>
  11. 11. Learning Methods <ul><li>3 Learning methods are tested. </li></ul><ul><ul><li>Naïve Bayes classifier </li></ul></ul><ul><ul><li>Decision trees (C4.5) </li></ul></ul><ul><ul><li>Inductive rule learning (RIPPER) </li></ul></ul><ul><li>Instances of rules from RIPPER </li></ul>
  12. 12. Experiments <ul><li>Design </li></ul><ul><ul><li>Experimental data set </li></ul></ul><ul><ul><ul><li>9,187 non-ambiguous occurrences </li></ul></ul></ul><ul><ul><ul><li>550 ambiguous occurences which are manually labeled. </li></ul></ul></ul><ul><ul><li>10-fold cross-validation. </li></ul></ul><ul><ul><li>For estimating best window size, another 10-fold cross-validation is performed for 1/10 th of the data. </li></ul></ul><ul><li>Task </li></ul><ul><ul><li>Two-way classification and Three-way classification </li></ul></ul><ul><ul><li>Performance measure </li></ul></ul><ul><ul><ul><li>Classification accuracy. </li></ul></ul></ul>
  13. 13. Results and Evaluation (1/4) <ul><li>Effects of the learning algorithm </li></ul><ul><ul><li>Test on about one third of collection of articles. </li></ul></ul><ul><ul><li>Naïve Bayes classifier is significantly faster in both training and prediction. </li></ul></ul><ul><ul><li>Choose the naïve Bayes classifier as the default classifier. </li></ul></ul>
  14. 14. Results and Evaluation (2/4) <ul><li>Effects of feature definitions </li></ul><ul><ul><li>Positional information </li></ul></ul><ul><ul><ul><li>Full positional information </li></ul></ul></ul><ul><ul><ul><ul><li>Lower accuracy by as much as 6%. </li></ul></ul></ul></ul><ul><ul><ul><li>Just sign of the positional difference </li></ul></ul></ul><ul><ul><ul><ul><li>Lower accuracy by 1~1.5%. </li></ul></ul></ul></ul><ul><ul><ul><li>Likely due to the sparseness of data. </li></ul></ul></ul><ul><ul><li>Capitalization </li></ul></ul><ul><ul><ul><li>Mapping all words to lower case did not alter performance. </li></ul></ul></ul>
  15. 15. Results and Evaluation (3/4) <ul><ul><li>Part-of-speech </li></ul></ul><ul><ul><ul><li>Helped the overall accuracy, but only moderately (less than 1% on average). </li></ul></ul></ul><ul><ul><li>Similarly distributed words. </li></ul></ul><ul><ul><ul><li>Has a small negative effect on performance (0.2 ~ 0.5%) </li></ul></ul></ul><ul><ul><ul><li>Can significantly reduce the total number of features. </li></ul></ul></ul><ul><ul><li>Stopwords and Stemming </li></ul></ul><ul><ul><ul><li>Both increase performance (0.5~1.5%, 0.4%). </li></ul></ul></ul>
  16. 16. Results and Evaluation (4/4) <ul><li>Evaluation of the final feature combination </li></ul><ul><ul><li>Test on full data collection using 10-fold cross-validation. </li></ul></ul>
  17. 17. Conclusion <ul><li>Explores three learning techniques and several ways for defining contextual features for the problem of automatically disambiguating biological terms. </li></ul><ul><ul><li>Utilize textual information rather than relying on extensive human markup. </li></ul></ul><ul><li>Demonstrates accuracy within the range of statistical sense disambiguation applications. </li></ul><ul><li>Plan to refine several aspects of the system </li></ul><ul><ul><li>The positional information model </li></ul></ul><ul><ul><li>Prediction of relationships between classes of biological terms, based on the results. </li></ul></ul>