Identifying genes and proteins in text: a short review           of available tools and resources                         ...
Deluge/Flood/Tsunami of publicationsLiterature contains important knowledge which is generated by researchers andideally n...
Named Entity RecognitionSelection of sup1 and sup2 mutants in the yeast Saccharomyces cerevisiae oncycloheximide containin...
Named Entity RecognitionSelection of sup1 and sup2 mutants in the yeast Saccharomyces cerevisiae oncycloheximide containin...
Named Entity RecognitionSelection of sup1 and sup2 mutants in the yeast Saccharomyces cerevisiae oncycloheximide containin...
Named Entity RecognitionSelection of sup1 and sup2 mutants in the yeast Saccharomyces cerevisiae oncycloheximide containin...
Named Entity RecognitionSelection of sup1 and sup2 mutants in the yeast Saccharomyces cerevisiae oncycloheximide containin...
ProblemsHUNK is associated with expression of Frizzled 2    HUman Natural Killer       Nathan Harmston        Review of Ge...
ProblemsHUNK is associated with expression of Frizzled 2    HUman Natural Killer    Large piece of something without defini...
ProblemsHUNK is associated with expression of Frizzled 2    HUman Natural Killer    Large piece of something without defini...
ProblemsHUNK is associated with expression of Frizzled 2    HUman Natural Killer    Large piece of something without defini...
Methods   dictionary          BioThesaurus          fuzzy matching techniques (Levenshtein, Jaro, Jaro-Winkler)          B...
Corpus    A corpus is a collection of manually annotated documents which have had    NEs marked up by a human expert.    s...
Classification-based approachesConversely, treatment of human protein-tyrosine phosphatase alpha-overexpressingcells with p...
Sequence labelling approachesConversely, treatment of human protein-tyrosine phosphatase alpha-overexpressingcells with ph...
Nathan Harmston   Review of Gene NER   24/02/2011   9 / 15
Performance - strict matching               TP                             TP                         Precision·RecallPrec...
Performance - sloppy matching               TP                             TP                         Precision·RecallPrec...
Availability    Most are easily available and released under open source licenses.    Variety of languages (primarily Java...
Literature based discovery - CRPS
Literature based discovery - CRPS                         NF-κB      Nathan Harmston    Review of Gene NER   24/02/2011   ...
Literature based discovery - CRPS                                     NF-κBOutcomeNF-κB is involved in CRPSallows generati...
Finally........     for standalone - BANNER     web services - who knows?     Chemical NER - OSCAR (make sure you use the ...
Finally........     for standalone - BANNER     web services - who knows?     Chemical NER - OSCAR (make sure you use the ...
Finally........     for standalone - BANNER     web services - who knows?     Chemical NER - OSCAR (make sure you use the ...
Finally........     for standalone - BANNER     web services - who knows?     Chemical NER - OSCAR (make sure you use the ...
Shameless self-promotion.......Harmston, N., Filsell, W., and Stumpf, M. P. H. (2010) What the paperssay: text mining for ...
Shameless self-promotion.......Harmston, N., Filsell, W., and Stumpf, M. P. H. (2010) What the paperssay: text mining for ...
Upcoming SlideShare
Loading in …5
×

Identifying genes and proteins in text: a short review of available tools and resources

827 views

Published on

Nathan from Imperial College London, gave a presentation at London Biogeeks on Thursday 24 Feb, between 6 - 6.30pm at King’s College London, Rm 1.20, Franklin Wilkins Building, Waterloo Campus, Stamford Street, London, SE1 9NH, see: biogeeks.wordpress.com/​2011/​02/​16/​ february-tech-meet-24th-kcl/​

His presentation was about identifying genes and proteins in text: a short review of available tools and resources

Abstract below:
The ever-increasing publication rate now means that manually extracting information from biological papers is now intractable. This situation has led to a sustained interest in the application of text mining (TM) methods to the biological literature. The first stage in any text-mining pipeline is to recognise named entities in text (a process called Named Entity Recognition or NER). I will discuss the basic concepts behind these methods and provide a basic evaluation of some of the freely available software (standalone and web services).

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
827
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Identifying genes and proteins in text: a short review of available tools and resources

  1. 1. Identifying genes and proteins in text: a short review of available tools and resources Nathan Harmston Theoretical Systems Biology Centre for Bioinformatics Centre for Integrative Systems Biology at Imperial College London 24/02/2011 Nathan Harmston Review of Gene NER 24/02/2011 1 / 15
  2. 2. Deluge/Flood/Tsunami of publicationsLiterature contains important knowledge which is generated by researchers andideally not just something to promote their career. Nathan Harmston Review of Gene NER 24/02/2011 2 / 15
  3. 3. Named Entity RecognitionSelection of sup1 and sup2 mutants in the yeast Saccharomyces cerevisiae oncycloheximide containing media revealed classes of mutants that either arecompletely unable to grow on YAPD without cycloheximide or need this drugunder high temperature incubation (30 or 36 degrees C). Some of these mutantsalso exhibit the growth dependence on another antibiotic– trichodermin, and, atthe same time, the osmotic dependence. A hypothesis claiming that sup1 andsup2 mutations cause conformational lability of yeast cytoplasmic ribosomes hasbeen put forward. It is also proposed that binding of cycloheximide andtrichodermin to the mutant ribosomes cause their conformational shift, whichcompensates the functional defects. Nathan Harmston Review of Gene NER 24/02/2011 3 / 15
  4. 4. Named Entity RecognitionSelection of sup1 and sup2 mutants in the yeast Saccharomyces cerevisiae oncycloheximide containing media revealed classes of mutants that either arecompletely unable to grow on YAPD without cycloheximide or need this drugunder high temperature incubation (30 or 36 degrees C). Some of these mutantsalso exhibit the growth dependence on another antibiotic– trichodermin, and, atthe same time, the osmotic dependence. A hypothesis claiming that sup1 andsup2 mutations cause conformational lability of yeast cytoplasmic ribosomes hasbeen put forward. It is also proposed that binding of cycloheximide andtrichodermin to the mutant ribosomes cause their conformational shift, whichcompensates the functional defects. Nathan Harmston Review of Gene NER 24/02/2011 3 / 15
  5. 5. Named Entity RecognitionSelection of sup1 and sup2 mutants in the yeast Saccharomyces cerevisiae oncycloheximide containing media revealed classes of mutants that either arecompletely unable to grow on YAPD without cycloheximide or need this drugunder high temperature incubation (30 or 36 degrees C). Some of these mutantsalso exhibit the growth dependence on another antibiotic– trichodermin, and, atthe same time, the osmotic dependence. A hypothesis claiming that sup1 andsup2 mutations cause conformational lability of yeast cytoplasmic ribosomes hasbeen put forward. It is also proposed that binding of cycloheximide andtrichodermin to the mutant ribosomes cause their conformational shift, whichcompensates the functional defects. Nathan Harmston Review of Gene NER 24/02/2011 3 / 15
  6. 6. Named Entity RecognitionSelection of sup1 and sup2 mutants in the yeast Saccharomyces cerevisiae oncycloheximide containing media revealed classes of mutants that either arecompletely unable to grow on YAPD without cycloheximide or need this drugunder high temperature incubation (30 or 36 degrees C). Some of these mutantsalso exhibit the growth dependence on another antibiotic– trichodermin, and, atthe same time, the osmotic dependence. A hypothesis claiming that sup1 andsup2 mutations cause conformational lability of yeast cytoplasmic ribosomes hasbeen put forward. It is also proposed that binding of cycloheximide andtrichodermin to the mutant ribosomes cause their conformational shift, whichcompensates the functional defects. Nathan Harmston Review of Gene NER 24/02/2011 3 / 15
  7. 7. Named Entity RecognitionSelection of sup1 and sup2 mutants in the yeast Saccharomyces cerevisiae oncycloheximide containing media revealed classes of mutants that either arecompletely unable to grow on YAPD without cycloheximide or need this drugunder high temperature incubation (30 or 36 degrees C). Some of these mutantsalso exhibit the growth dependence on another antibiotic– trichodermin, and, atthe same time, the osmotic dependence. A hypothesis claiming that sup1 andsup2 mutations cause conformational lability of yeast cytoplasmic ribosomes hasbeen put forward. It is also proposed that binding of cycloheximide andtrichodermin to the mutant ribosomes cause their conformational shift, whichcompensates the functional defects. Genes have many different names e.g. { P53, TP53, Hs.1845, TRP53 } Gene names are subject to morphological (transcription factor, transcriptional factor), orthographic (NF kappa B, NF kappaB), combinatorial (homolog of actin, actin homolog) and inflectional variation (antibody, antibodies). Some names overlap with normal english breathless, Not, That Deciding when a term refers to a gene, RNA or a protein is difficult: pspA, PspA Nathan Harmston Review of Gene NER 24/02/2011 3 / 15
  8. 8. ProblemsHUNK is associated with expression of Frizzled 2 HUman Natural Killer Nathan Harmston Review of Gene NER 24/02/2011 4 / 15
  9. 9. ProblemsHUNK is associated with expression of Frizzled 2 HUman Natural Killer Large piece of something without definite shape Nathan Harmston Review of Gene NER 24/02/2011 4 / 15
  10. 10. ProblemsHUNK is associated with expression of Frizzled 2 HUman Natural Killer Large piece of something without definite shape A well built sexually attractive man Nathan Harmston Review of Gene NER 24/02/2011 4 / 15
  11. 11. ProblemsHUNK is associated with expression of Frizzled 2 HUman Natural Killer Large piece of something without definite shape A well built sexually attractive man Hormonally Upregulated Neu-associated Kinase Nathan Harmston Review of Gene NER 24/02/2011 4 / 15
  12. 12. Methods dictionary BioThesaurus fuzzy matching techniques (Levenshtein, Jaro, Jaro-Winkler) BLAST Whatizit, Reflect.WS rule/pattern based matching good for things like Yeast genes, but rubbish for fruitfly ABGENE Machine learning Classification Support Vector Machines - NLProt Logistic Regression - Sequence Labelling Conditional Random Fields - ABNER, BANNER, JNET Hidden Markov Models - GENIA Hybrid methods Nathan Harmston Review of Gene NER 24/02/2011 5 / 15
  13. 13. Corpus A corpus is a collection of manually annotated documents which have had NEs marked up by a human expert. serve as a benchmark to compare methods. serve as development/training sets for methods. Size, Inter-Annotator Agreement (IAA), Scope, Evaluation scheme BioCreative I GM, BioCreative II GM, NLPBA, GENIA . . . P07642544A0868 Conversely, treatment of human protein-tyrosine phosphatase alpha-overexpressing cells with phenylarsine oxide led to a loss of the constitutive NF-kappa B activity. . . . P07642544A0868|127 135| NF-kappa B Nathan Harmston Review of Gene NER 24/02/2011 6 / 15
  14. 14. Classification-based approachesConversely, treatment of human protein-tyrosine phosphatase alpha-overexpressingcells with phenylarsine oxide led to a loss of the constitutive NF-kappa B activity.   xi = training data gene after  0  1, if xi belongs to class 1 kappa  1  yi =   −1, if xi belongs to class 2 constitutive  1    noun phrase 1 surface clues, syntactic properties of NEs, Part of Speech surrounding words matches against dictionary typically binary decision (SVMs only work well for binary problems) Maximum Entropy, SVM, Naive Bayes order-independent vector Nathan Harmston Review of Gene NER 24/02/2011 7 / 15
  15. 15. Sequence labelling approachesConversely, treatment of human protein-tyrosine phosphatase alpha-overexpressingcells with phenylarsine oxide led to a loss of the constitutive NF-kappa B activity. y1 y2 y3 y4 x1 x2 x3 x4 constitutive NF-kappa B activity consider the complete ordered sequence of tokens in a sentence predict the most probable sequence of tags for a given sequence of words in a sentence using semantic and lexical features takes order into account Nathan Harmston Review of Gene NER 24/02/2011 8 / 15
  16. 16. Nathan Harmston Review of Gene NER 24/02/2011 9 / 15
  17. 17. Performance - strict matching TP TP Precision·RecallPrecision = TP+FP Recall = TP+FN F1 = 2 · Precision+Recall Tagger Notes Precision Recall F1 ABNER NLPBA corpus 0.4867 0.5584 0.5201 ABNER BCI corpus 0.6749 0.5830 0.6256 BANNER Hepple POS + BCII 0.7605 0.7068 0.7327 BANNER MedPOS + BCII 0.7593 0.7195 0.7388 GENIA Tagger 0.4665 0.5789 0.5166 JNET 0.5074 0.3802 0.4347 Whatizit whatizitSwissprot 0.4980 0.3465 0.4087 Reflect.ws 0.4678 0.3734 0.4153 Nathan Harmston Review of Gene NER 24/02/2011 10 / 15
  18. 18. Performance - sloppy matching TP TP Precision·RecallPrecision = TP+FP Recall = TP+FN F1 = 2 · Precision+Recall Tagger Notes Precision Recall F1 ABNER NLPBA corpus 0.6229 0.7146 0.6656 ABNER BCI corpus 0.8641 0.7465 0.8010 BANNER Hepple POS + BCII 0.8654 0.8043 0.8337 BANNER MedPOS + BCII 0.8596 0.8146 0.8365 GENIA Tagger 0.5909 0.7334 0.6545 JNET 0.5616 0.4208 0.4811 Whatizit whatizitSwissprot 0.5061 0.3522 0.4154 Reflect.ws 0.4829 0.3854 0.4287 Nathan Harmston Review of Gene NER 24/02/2011 11 / 15
  19. 19. Availability Most are easily available and released under open source licenses. Variety of languages (primarily Java and C++) Most require hacking to get them working OSCAR3 is a beast GENIA - very easy to write a SWIG access so you can call it from Python JNET - few hacks ReflectWS (REST/SOAP) Whatizit (SOAP)http://pages.cs.wisc.edu/~bsettles/abner/http://banner.sourceforge.net/http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/http://linnaeus.sourceforge.net/http://cubic.bioc.columbia.edu/services/nlprot/http://www.ebi.ac.uk/webservices/whatizit/http://sourceforge.net/projects/oscar3-chem/http://julielab.de/ Nathan Harmston Review of Gene NER 24/02/2011 12 / 15
  20. 20. Literature based discovery - CRPS
  21. 21. Literature based discovery - CRPS NF-κB Nathan Harmston Review of Gene NER 24/02/2011 13 / 15
  22. 22. Literature based discovery - CRPS NF-κBOutcomeNF-κB is involved in CRPSallows generation of new mechanistic hypothesesnew drug target Hettne et al - 2007 Applied information retrieval and multidisciplinary research: new mechanistic hypotheses in Complex Regional Pain Syndrome Nathan Harmston Review of Gene NER 24/02/2011 13 / 15
  23. 23. Finally........ for standalone - BANNER web services - who knows? Chemical NER - OSCAR (make sure you use the PubMed models) Species NER - Linnaeus Nathan Harmston Review of Gene NER 24/02/2011 14 / 15
  24. 24. Finally........ for standalone - BANNER web services - who knows? Chemical NER - OSCAR (make sure you use the PubMed models) Species NER - Linnaeus So now you have the named entities - you need to map them to canonical identifiers - called gene normalisation (GN). .... but thats for another talk What are they doing? PPI extraction - is there a physical interaction between two genes in an abstract - Binding between Akt2 and APPL Nathan Harmston Review of Gene NER 24/02/2011 14 / 15
  25. 25. Finally........ for standalone - BANNER web services - who knows? Chemical NER - OSCAR (make sure you use the PubMed models) Species NER - Linnaeus So now you have the named entities - you need to map them to canonical identifiers - called gene normalisation (GN). .... but thats for another talk What are they doing? PPI extraction - is there a physical interaction between two genes in an abstract - Binding between Akt2 and APPL Text mining is noisy and imperfect - but then so is manual curation (IAA) Nathan Harmston Review of Gene NER 24/02/2011 14 / 15
  26. 26. Finally........ for standalone - BANNER web services - who knows? Chemical NER - OSCAR (make sure you use the PubMed models) Species NER - Linnaeus So now you have the named entities - you need to map them to canonical identifiers - called gene normalisation (GN). .... but thats for another talk What are they doing? PPI extraction - is there a physical interaction between two genes in an abstract - Binding between Akt2 and APPL Text mining is noisy and imperfect - but then so is manual curation (IAA) Text mining is a noisy (and biased) way of extracting information from noisy (and biased) text which represents the results of noisy (and biased) experiments carried out by researchers (who are probably noisy and biased). Nathan Harmston Review of Gene NER 24/02/2011 14 / 15
  27. 27. Shameless self-promotion.......Harmston, N., Filsell, W., and Stumpf, M. P. H. (2010) What the paperssay: text mining for genomics and systems biology. Hum Genomics, 5(1),17-29 nathan.harmston07@imperial.ac.uk Nathan Harmston Review of Gene NER 24/02/2011 15 / 15
  28. 28. Shameless self-promotion.......Harmston, N., Filsell, W., and Stumpf, M. P. H. (2010) What the paperssay: text mining for genomics and systems biology. Hum Genomics, 5(1),17-29 nathan.harmston07@imperial.ac.uk Questions? Nathan Harmston Review of Gene NER 24/02/2011 15 / 15

×