Biomedical literature mining

691 views
637 views

Published on

BioSys course, Technical University of Denmark, Lyngby, Denmark, October 24, 2005

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
691
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Biomedical literature mining

  1. 1. Biological Literature Mining Lars Juhl Jensen EMBL
  2. 2. Why?
  3. 3. Overview <ul><li>Information retrieval and text categorization </li></ul><ul><ul><li>Methodologies for finding and classifying texts </li></ul></ul><ul><li>Entity recognition and information extraction </li></ul><ul><ul><li>Identification of gene/protein/drug names in text </li></ul></ul><ul><ul><li>Statistical and NLP methods for relation extraction </li></ul></ul><ul><li>Text- and data-mining </li></ul><ul><ul><li>Making discoveries from text alone </li></ul></ul><ul><ul><li>Integration of text and other data types </li></ul></ul>
  4. 4. Status <ul><li>IR, ER, and simple IE methods are fairly well established </li></ul><ul><li>Advanced NLP-based IE systems are rapidly being improved </li></ul><ul><li>Methods for text mining and text/data integration are still in their infancy </li></ul>
  5. 5. Evaluation <ul><li>Computational linguist lingo </li></ul><ul><ul><li>Recall = sensitivity </li></ul></ul><ul><ul><li>Precision = specificity </li></ul></ul><ul><ul><li>F-score = 2  recall  precision/(recall+precision) </li></ul></ul><ul><ul><li>Best F-score  best method </li></ul></ul><ul><li>CASP-like assessments </li></ul><ul><ul><li>IR: TREC </li></ul></ul><ul><ul><li>ER: BioCreAtIvE task 1 </li></ul></ul><ul><ul><li>(IE: BioCreAtIvE task 2) </li></ul></ul>
  6. 6. Corpora <ul><li>Plain text </li></ul><ul><ul><li>Publication abstracts: M EDLINE </li></ul></ul><ul><ul><li>Full text papers: PubMed Central / Highwire Press </li></ul></ul><ul><ul><li>Gene summaries: SGD, The Interactive Fly, OMIM, … </li></ul></ul><ul><ul><li>Patents descriptions: various patent databases </li></ul></ul><ul><li>Tagged text </li></ul><ul><ul><li>Categorization: M EDLINE MeSH terms </li></ul></ul><ul><ul><li>Syntactic tagging: G ENIA </li></ul></ul><ul><ul><li>Semantic tagging: G ENETAG </li></ul></ul>
  7. 7. Example <ul><li>Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation </li></ul>
  8. 8. Information Retrieval and Text Categorization Lars Juhl Jensen EMBL
  9. 9. Overview <ul><li>Ad hoc information retrieval </li></ul><ul><ul><li>The user enters a query/a set of keywords </li></ul></ul><ul><ul><li>The system attempts to retrieve the relevant texts from a large text corpus (typically Medline) </li></ul></ul><ul><li>Text categorization </li></ul><ul><ul><li>A training set of texts is created in which texts are manually assigned to classes (often only yes/no) </li></ul></ul><ul><ul><li>A machine learning methods is trained to classify texts </li></ul></ul><ul><ul><li>This method can subsequently be used to classify a much larger text corpus </li></ul></ul>
  10. 10. Ad hoc IR <ul><li>These systems are very useful since the user can provide any query </li></ul><ul><ul><li>The query is typically Boolean ( yeast AND cell cycle ) </li></ul></ul><ul><ul><li>A few systems instead allow the relative weight of each search term to be specified by the user </li></ul></ul><ul><li>The art is to find the relevant papers even if they do not actually match the query </li></ul><ul><ul><li>Ideally our example sentence should be extracted by the query yeast cell cycle although none of these words are mentioned </li></ul></ul>
  11. 17. Automatic query expansion <ul><li>In a typical query, the user will not have provided all relevant words and variants thereof </li></ul><ul><li>By automatically expanding queries with additional search terms, recall can be improved </li></ul><ul><ul><li>Stemming removes common endings ( yeast / yeasts ) </li></ul></ul><ul><ul><li>Thesauri can be used to expand queries with synonyms and/or abbreviations ( yeast / S. cerevisiae ) </li></ul></ul><ul><ul><li>The next logical step is to use ontologies to make complex inferences ( yeast cell cycle / Cdc28 ) </li></ul></ul>
  12. 19. Document similarity <ul><li>The similarity of two documents can be defined based on their word content </li></ul><ul><ul><li>Each document can be represented by a word vector </li></ul></ul><ul><ul><li>Words should be weighted based on their frequency and background frequency </li></ul></ul><ul><ul><li>The most commonly used scheme is tf*idf weighting </li></ul></ul><ul><li>Document similarity can be used in ad hoc IR </li></ul><ul><ul><li>Rather than matching the query against each document only, the N most similar documents are also considered </li></ul></ul>
  13. 20. Document clustering <ul><li>Unsupervised clustering algorithms can be applied to a document similarity matrix </li></ul><ul><ul><li>All pairwise document similarities are calculated </li></ul></ul><ul><ul><li>Clusters of “similar documents” can be constructed using one of numerous standard clustering methods </li></ul></ul><ul><li>Practical uses of document clustering </li></ul><ul><ul><li>The “related documents” function in PubMed </li></ul></ul><ul><ul><li>Logical organization of the documents found by IR </li></ul></ul>
  14. 21. Text categorization <ul><li>These systems are a lot less flexible than ad hoc systems but can attain better accuracy </li></ul><ul><ul><li>Works on a pre-defined set of document classes </li></ul></ul><ul><ul><li>Each class is defined by manually assigning a number of documents to it </li></ul></ul><ul><li>Method </li></ul><ul><ul><li>Rules may be manually crafted based on a very small set of manually classified documents </li></ul></ul><ul><ul><li>Statistical machine learning methods can be trained on a large number of classified documents </li></ul></ul>
  15. 22. Example <ul><li>Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation </li></ul><ul><li>Hints in the text </li></ul><ul><ul><li>Strong: Cdc28 and Swe1 (“cell cycle” and “yeast”) </li></ul></ul><ul><ul><li>Weaker: mitotic cyclin , Clb2 , and Cdk1 ( “cell cycle) </li></ul></ul>
  16. 23. Machine learning <ul><li>Input features </li></ul><ul><ul><li>Word content or bi-/tri-grams </li></ul></ul><ul><ul><li>Part-of-speech tags </li></ul></ul><ul><ul><li>Filtering (stop words, part-of-speech) </li></ul></ul><ul><ul><li>Singular value decomposition </li></ul></ul><ul><li>Training </li></ul><ul><ul><li>Support vector machines are best suited </li></ul></ul><ul><ul><li>Choice of kernel function </li></ul></ul><ul><ul><li>Separate training and evaluation sets, cross validation </li></ul></ul>
  17. 25. Summary <ul><li>Pros and cons of ad hoc IR systems </li></ul><ul><ul><li>Highly flexible as it is not limited by a training data set </li></ul></ul><ul><ul><li>Can be very fast if the corpus is properly indexed </li></ul></ul><ul><ul><li>The accuracy and recall depends strongly on the ability of the user to select the right keywords </li></ul></ul><ul><ul><li>Some topics are not easily described by a query </li></ul></ul><ul><li>Pros and cons of text categorization methods </li></ul><ul><ul><li>Very high accuracy and recall can be attained </li></ul></ul><ul><ul><li>Requires a separate training set for each category </li></ul></ul>
  18. 26. Entity Recognition and Information Extraction Lars Juhl Jensen EMBL
  19. 27. Overview <ul><li>Entity recognition (ER) </li></ul><ul><ul><li>Finding the genes/proteins/drugs mentioned in a text </li></ul></ul><ul><ul><li>Word sense disambiguation </li></ul></ul><ul><li>Information extraction (IE) </li></ul><ul><ul><li>Simple statistical co-occurrence methods </li></ul></ul><ul><ul><li>Combining co-occurrence and text categorization </li></ul></ul><ul><ul><li>Natural Language Processing (NLP) </li></ul></ul>
  20. 28. Entity recognition <ul><li>An important but boring problem </li></ul><ul><ul><li>The genes/proteins/drugs mentioned within a given text must be identified </li></ul></ul><ul><li>Recognition vs. identification </li></ul><ul><ul><li>Recognition: find the words that are names of entities </li></ul></ul><ul><ul><li>Identification: figure out which entities they refer to </li></ul></ul><ul><ul><li>Recognition without identification is of limited use </li></ul></ul>
  21. 29. Example <ul><li>Mitotic cyclin ( Clb2 )-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5 -dependent Swe1 hyperphosphorylation and degradation </li></ul><ul><li>Entities identified </li></ul><ul><ul><li>S. cerevisiae proteins: Clb2 (YPR119W), Cdc28 (YBR160W), Swe1 (YJL187C), and Cdc5 (YMR001C) </li></ul></ul>
  22. 30. Recognition <ul><li>Features </li></ul><ul><ul><li>Morphological: mixes letters and digits or ends on -ase </li></ul></ul><ul><ul><li>Context: followed by “protein” or “gene” </li></ul></ul><ul><ul><li>Grammar: should occur as a noun </li></ul></ul><ul><li>Methodologies </li></ul><ul><ul><li>Manually crafted rule-based systems </li></ul></ul><ul><ul><li>Machine learning (SVMs) </li></ul></ul><ul><li>But what can it be used for? </li></ul>
  23. 31. Identification <ul><li>A good synonyms list is the key </li></ul><ul><ul><li>Combine many sources </li></ul></ul><ul><ul><li>Curate to eliminate stop words </li></ul></ul><ul><li>Flexible matching to handle orthographic variation </li></ul><ul><ul><li>Case variation: CDC28 , Cdc28 , and cdc28 </li></ul></ul><ul><ul><li>Prefixes: myc and c-myc </li></ul></ul><ul><ul><li>Postfixes: Cdc28 and Cdc28p </li></ul></ul><ul><ul><li>Spaces and hyphens: cdc28 and cdc-28 </li></ul></ul><ul><ul><li>Latin vs. Greek letters: TNF-alpha and TNFA </li></ul></ul>
  24. 32. Disambiguation <ul><li>The same word may mean many different things </li></ul><ul><ul><li>Entity names may also be common English words ( hairy ) or technical terms ( SDS ) </li></ul></ul><ul><ul><li>Protein names may refer to related or unrelated proteins in other species ( cdc2 ) </li></ul></ul><ul><li>The meaning can be resolved from the context </li></ul><ul><ul><li>ER can distinguish between names and common words </li></ul></ul><ul><ul><li>Disambiguating non-unique names is a hard problem </li></ul></ul><ul><ul><li>Ambiguity between orthologs can be safely be ignored </li></ul></ul>
  25. 38. Co-occurrence <ul><li>Relations are extracted for co-occurring entities </li></ul><ul><ul><li>Relations are always symmetric </li></ul></ul><ul><ul><li>The type of relation is not given </li></ul></ul><ul><li>Scoring the relations </li></ul><ul><ul><li>More co-occurrences  more significant </li></ul></ul><ul><ul><li>Ubiquitous entities  less significant </li></ul></ul><ul><ul><li>Same sentence vs. same paragraph </li></ul></ul><ul><li>Simple, good recall, poor precision </li></ul>
  26. 39. Example <ul><li>Mitotic cyclin ( Clb2 )-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5 -dependent Swe1 hyperphosphorylation and degradation </li></ul><ul><li>Relations </li></ul><ul><ul><li>Correct: Clb2–Cdc28 , Clb2–Swe1 , Cdc28–Swe1 , and Cdc5–Swe1 </li></ul></ul><ul><ul><li>Wrong: Clb2–Cdc5 and Cdc28–Cdc5 </li></ul></ul>
  27. 41. Categorization <ul><li>Extracting specific types of relations </li></ul><ul><ul><li>Text categorization methods can be used to identify sentences that mention a certain type of relations </li></ul></ul><ul><ul><li>Filtering can be done before or after relation extraction </li></ul></ul><ul><li>Well suited for database curation </li></ul><ul><ul><li>Text categorization can be reused </li></ul></ul><ul><ul><li>High recall is most important </li></ul></ul><ul><ul><li>Curators can compensate for the lack of precision </li></ul></ul>
  28. 43. NLP <ul><li>Information is extracted based on parsing and interpreting phrases or full sentences </li></ul><ul><ul><li>Good at extracting specific types of relations </li></ul></ul><ul><ul><li>Handles directed relations </li></ul></ul><ul><li>Complex, good precision, poor recall </li></ul>
  29. 44. Example <ul><li>Mitotic cyclin ( Clb2 )-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5 -dependent Swe1 hyperphosphorylation and degradation </li></ul><ul><li>Relations: </li></ul><ul><ul><li>Complex: Clb2–Cdc28 </li></ul></ul><ul><ul><li>Phosphorylation: Clb2  Swe1 , Cdc28  Swe1 , and Cdc5  Swe1 </li></ul></ul>
  30. 45. Architecture <ul><li>Tokenization </li></ul><ul><ul><li>Entity recognition with synonyms list </li></ul></ul><ul><ul><li>Word boundaries (multi words) </li></ul></ul><ul><ul><li>Sentence boundaries (abbreviations) </li></ul></ul><ul><li>Part-of-speech tagging </li></ul><ul><ul><li>TreeTagger trained on G ENIA </li></ul></ul><ul><li>Semantic labeling </li></ul><ul><ul><li>Dictionary of regular expressions </li></ul></ul><ul><li>Entity and relation chunking </li></ul><ul><ul><li>Rule-based system implemented in CASS </li></ul></ul>
  31. 46. <ul><li>Semantic labeling </li></ul><ul><ul><li>Gene and protein names </li></ul></ul><ul><ul><li>Cue words for entity recognition </li></ul></ul><ul><ul><li>Cue words for relation extraction </li></ul></ul><ul><li>Named entity chunking </li></ul><ul><ul><li>A CASS grammar recognizes noun chunks related to gene expression: [ nxgene The GAL4 gene ] </li></ul></ul><ul><li>Relation chunking </li></ul><ul><ul><li>Our CASS grammar also extracts relations between entities: [ nxexpr T he expression of [ nxgene the cytochrome genes [ nxpg CYC1 and CYC7 ]]] is controlled by [ nxpg HAP1 ] </li></ul></ul>
  32. 47. [ expression_repression_active Btk regulates the IL-2 gene ] [ dephosphorylation_nominal Dephosphorylation of Syk and Btk mediated by SHP-1 ] [ phosphorylation_nominal phosphorylation of Shc by the hematopoietic cell-specific tyrosine kinase Syk ] [ phosphorylation_nominal the phosphorylation of the adapter protein SHC by the Src-related kinase Lyn ] [ phosphorylation_active Lyn also participates in [ phosphorylation the tyrosine phosphorylation and activation of syk ]] [ phosphorylation_active Lyn , [ negation but not Jak2 ] phosphorylated CrkL ] [ phosphorylation_active Lyn , [ negation but not Jak2 ] phosphorylated CrkL ] [ phosphorylation_active Lyn also participates in [ phosphorylation the tyrosine phosphorylation and activation of syk ]] [ phosphorylation_nominal the phosphorylation of the adapter protein SHC by the Src-related kinase Lyn ] [ phosphorylation_nominal phosphorylation of Shc by the hematopoietic cell-specific tyrosine kinase Syk ] [ dephosphorylation_nominal Dephosphorylation of Syk and Btk mediated by SHP-1 ] [ expression_repression_active IL-10 also decreased [ expression mRNA expression of IL-2 and IL18 cytokine receptors] [ expression_repression_active IL-10 also decreased [ expression mRNA expression of IL-2 and IL18 cytokine receptors ] [ expression_activation_passive [ expression IL-13 expression] induced by IL-2 + IL-18 ] [ expression_activation_passive [ expression IL-13 expression ] induced by IL-2 + IL-18 ] [ expression_repression_active Btk regulates the IL-2 gene ]
  33. 49. MedScan
  34. 50. Summary <ul><li>Entity recognition </li></ul><ul><ul><li>The best methods rely on curated synonyms lists </li></ul></ul><ul><li>Co-occurrence methods </li></ul><ul><ul><li>High recall but typically poor accuracy </li></ul></ul><ul><ul><li>Cannot deal with directed interactions </li></ul></ul><ul><li>Natural language processing </li></ul><ul><ul><li>High accuracy but typically poor recall </li></ul></ul><ul><ul><li>Rule development is time consuming </li></ul></ul>
  35. 51. Text- and Data-mining Lars Juhl Jensen EMBL
  36. 52. Overview <ul><li>Pure text-mining </li></ul><ul><ul><li>Discovery of global trends </li></ul></ul><ul><ul><li>Inference of overlooked relations </li></ul></ul><ul><li>Integration of text and other data sources </li></ul><ul><ul><li>Augmented text-mining methods </li></ul></ul><ul><li>Automated annotation of high-throughput data </li></ul>
  37. 53. Trends <ul><li>Most similar to existing data mining approaches </li></ul><ul><ul><li>Although all the detailed data is in the text, people may have missed the big picture </li></ul></ul><ul><li>Temporal trends </li></ul><ul><ul><li>Historical summaries </li></ul></ul><ul><ul><li>Forecasting </li></ul></ul><ul><li>Correlations </li></ul><ul><ul><li>“ Customers who bought this item also bought …” </li></ul></ul>
  38. 54. Time
  39. 55. Successful genes
  40. 56. Buzzwords
  41. 57. Correlations <ul><li>“ Customers who bought this item also bought …” </li></ul><ul><li>Protein networks </li></ul><ul><ul><li>“ Proteins that regulate expression …” </li></ul></ul><ul><ul><li>“ Proteins that control phosphorylation …” </li></ul></ul><ul><ul><li>“ Proteins that are phosphorylated …” </li></ul></ul><ul><li>Co-author networks </li></ul>
  42. 58. Transcriptional networks 32 79 83 3592 Regulates Regulated P < 9  10 -9
  43. 59. Signaling pathways 11 27 44 3704 Phosphorylates Phosphorylated P < 2  10 -7
  44. 60. Multiple regulation 8 107 47 3625 Expression Phosphorylation P < 5  10 -4
  45. 62. Nuggets <ul><li>New relations can be inferred from published ones </li></ul><ul><ul><li>This can lead to actual discoveries if no person knows all the facts required for making the inference </li></ul></ul><ul><ul><li>Combining facts from disconnected literatures </li></ul></ul><ul><li>Swanson’s pioneering work </li></ul><ul><ul><li>Fish oil and Reynaud's disease </li></ul></ul><ul><ul><li>Magnesium and migraine </li></ul></ul>
  46. 65. Integration <ul><li>Automatic annotation of high-throughput data </li></ul><ul><ul><li>Loads of fairly trivial methods </li></ul></ul><ul><li>Protein interaction networks </li></ul><ul><ul><li>Can unify many types of interactions </li></ul></ul><ul><ul><li>Powerful as exploratory visualization tools </li></ul></ul><ul><li>More creative strategies </li></ul><ul><ul><li>Identification of candidate genes for genetic diseases </li></ul></ul><ul><ul><li>Linking genes to traits based on species distributions </li></ul></ul>
  47. 71. RCCs
  48. 72. Disease candidate genes <ul><li>Rank the genes within a chromosomal region to which a disease has been mapped </li></ul><ul><li>Methods </li></ul><ul><ul><li>G2D </li></ul></ul><ul><ul><ul><li>Gene  Function  Chemical  Phenotype  Disease </li></ul></ul></ul><ul><ul><ul><li>Uses M EDLINE but not the text </li></ul></ul></ul><ul><ul><li>B ITOLA </li></ul></ul><ul><ul><ul><li>Gene  Words  Disease (similar to A RROWSMITH ) </li></ul></ul></ul><ul><ul><li>Hide and co-workers </li></ul></ul><ul><ul><ul><li>Gene  Tissue  Disease </li></ul></ul></ul>
  49. 73. G2D
  50. 77. Genotype–phenotype <ul><li>Genes can be linked to traits by comparing the species distributions of both </li></ul><ul><ul><li>Mainly works for prokaryotes </li></ul></ul><ul><ul><li>Traits are represented by keywords </li></ul></ul><ul><li>Finding the species profiles </li></ul><ul><ul><li>Gene profiles are found by sequence similarity </li></ul></ul><ul><ul><li>Keyword profiles are based co-occurrence with the species name in M EDLINE </li></ul></ul>
  51. 80. Annotation <ul><li>High-throughput experiments of result in groups of related genes </li></ul><ul><ul><li>ER is used to find the associated abstracts </li></ul></ul><ul><ul><li>The frequency of each word is counted in the abstracts </li></ul></ul><ul><ul><li>Background frequencies of all words are pre-calculated </li></ul></ul><ul><ul><li>A statistical test is used to rank the words (typically Fisher’s exact test) </li></ul></ul><ul><li>The same strategy can be applied to find MeSH terms associated with a gene cluster </li></ul>
  52. 81. Summary <ul><li>Mining for overlooked relations </li></ul><ul><ul><li>Few overlooked relations can be found from text alone </li></ul></ul><ul><ul><li>Methods that combine text and other data types have much better discovery potential </li></ul></ul><ul><li>Annotation/interpretation of high-throughput data </li></ul><ul><ul><li>Molecular networks can be useful for gaining an overview of large expression data sets </li></ul></ul><ul><ul><li>Literature can be used to find keywords for a group of genes, but this has few advantages over using GO terms </li></ul></ul>
  53. 82. Outlook Lars Juhl Jensen EMBL
  54. 83. Death? <ul><li>Literature mining will not be made obsolete by <insert your favorite new technology here> </li></ul><ul><ul><li>Repositories are always made too late </li></ul></ul><ul><ul><li>There will always be new types of relations </li></ul></ul><ul><ul><li>Semantically tagged XML may replace ER (hopefully!) </li></ul></ul><ul><ul><li>Semantically tagged XML will never tag everything </li></ul></ul><ul><li>Specific IE problems will become obsolete </li></ul><ul><ul><li>Protein function </li></ul></ul><ul><ul><li>Physical protein interactions </li></ul></ul>
  55. 84. Permission denied <ul><li>Open access </li></ul><ul><ul><li>Literature mining methods cannot retrieve, extract, or correlate information from text unless it is accessible </li></ul></ul><ul><ul><li>Restricted access is already now the primary problem </li></ul></ul><ul><li>Standard formats </li></ul><ul><ul><li>Getting the text out of a PDF file is not trivial </li></ul></ul><ul><ul><li>Many journals now store papers in XML format </li></ul></ul><ul><li>Where do I get all the patent text?! </li></ul>
  56. 85. Innovation <ul><li>The basic tools are now in place for IR, ER, and IE </li></ul><ul><ul><li>Development was driven by computational linguists </li></ul></ul><ul><li>Text- and data-mining </li></ul><ul><ul><li>Biologists are needed </li></ul></ul><ul><ul><li>Collaboration with linguists </li></ul></ul><ul><li>Lack of innovation </li></ul><ul><ul><li>Very few new ideas </li></ul></ul><ul><ul><li>Text should be combined with other data </li></ul></ul>
  57. 86. Acknowledgments <ul><li>EML Research </li></ul><ul><ul><li>Jasmin Saric </li></ul></ul><ul><ul><li>Isabel Rojas </li></ul></ul><ul><li>EMBL Heidelberg </li></ul><ul><ul><li>Peer Bork </li></ul></ul><ul><ul><li>Miguel Andrade </li></ul></ul><ul><ul><li>Rossitza Ouzounova </li></ul></ul><ul><ul><li>Jan Korbel </li></ul></ul><ul><ul><li>Tobias Doerks </li></ul></ul>
  58. 87. Exercises Lars Juhl Jensen EMBL
  59. 88. Information retrieval <ul><li>PubFinder </li></ul><ul><ul><li>http://www.glycosciences.de/tools/PubFinder/ </li></ul></ul><ul><li>Ideas </li></ul><ul><ul><li>Do a very specific search on PubMed that retrieves only around 10–20 relevant papers </li></ul></ul><ul><ul><li>See if PubFinder is able to retrieve more </li></ul></ul><ul><ul><li>Compare this with using the “Related Articles” function in PubMed </li></ul></ul>
  60. 89. Entity recognition <ul><li>iHOP </li></ul><ul><ul><li>http://www.pdg.cnb.uam.es/UniPub/iHOP/ </li></ul></ul><ul><li>Ideas </li></ul><ul><ul><li>Compare iHOP vs. PubMed for finding papers related to a particular gene </li></ul></ul><ul><ul><li>Use iHOP to construct a small literature-based network </li></ul></ul>
  61. 90. Information extraction <ul><li>Relation extraction </li></ul><ul><ul><li>iProLINK ( http://pir.georgetown.edu/iprolink/ ) </li></ul></ul><ul><ul><li>PreBIND ( http://prebind.bind.ca ) </li></ul></ul><ul><ul><li>PubGene ( http://www.pubgene.org ) </li></ul></ul><ul><li>Ideas </li></ul><ul><ul><li>Check how complex sentences iProLINK can handle </li></ul></ul><ul><ul><li>Check how well PreBIND can discriminate between physcial and other interactions (other interactions can be found with PubGene, ProLinks, or STRING) </li></ul></ul>
  62. 91. Text mining <ul><li>A RROWSMITH </li></ul><ul><ul><li>http://arrowsmith.psych.uic.edu </li></ul></ul><ul><li>Ideas </li></ul><ul><ul><li>Fish oil and Reynaud's disease </li></ul></ul><ul><ul><li>Magnesium and migraine </li></ul></ul><ul><ul><li>Arginine and somatomedin C </li></ul></ul><ul><ul><li>Estrogen and Alzheimer's disease </li></ul></ul>
  63. 92. Integration 1 <ul><li>Protein networks </li></ul><ul><ul><li>S TRING beta version ( http://string.embl.de :8080 ) </li></ul></ul><ul><ul><li>ProLinks ( http://dip.doe-mbi.ucla.edu/pronav/ ) </li></ul></ul><ul><li>Ideas </li></ul><ul><ul><li>Use both tools to find functions for proteins of known and unknown function </li></ul></ul><ul><ul><li>Use S TRING to construct a network for a set of proteins </li></ul></ul><ul><ul><li>Try to reproduce the Ssn3–Msn2–Hsp104 link </li></ul></ul>
  64. 93. Integration 2 <ul><li>Finding candidate disease genes </li></ul><ul><ul><li>G2D ( http://www.ogic.ca/projects/g2d_2/ ) </li></ul></ul><ul><ul><li>B ITOLA ( http://www.mf.uni-lj.si/bitola/ ) </li></ul></ul><ul><li>Ideas </li></ul><ul><ul><li>Take a look at the G2D results for some diseases where you know which types of genes would be sensible to suggest </li></ul></ul><ul><ul><li>Compare the results with B ITOLA (if you have the patience to figure out there interface!) </li></ul></ul>
  65. 94. Integration 3 <ul><li>Annotation of expression data </li></ul><ul><ul><li>MedMiner ( http://discover.nci.nih.gov/textmining/ ) </li></ul></ul><ul><li>Ideas </li></ul><ul><ul><li>Stating the obvious … do the one thing that MedMiner can do … </li></ul></ul>

×