Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Published in: Health & Medicine, Technology
  • Be the first to comment

  • Be the first to like this


  1. 1. Mining Medical Mountains: How Bioinformatics Can Help Medical Science David Wishart University of Alberta
  2. 2. The Library of Congress • 120 million items in storage • 54 million manuscripts • 18 million books • 12 million photographs • 4.5 million maps • 4.4 million technical reports • 1.1 million PhD dissertations • ~20 Terabytes of data
  3. 3. Some Numbers… • 3 scientific journals in 1750 • 120,000 scientific journals today • 500,000 medical articles/year • 4,000,000 scientific articles/year • 14,000,000 abstracts in PubMed derived from 4600 journals • 3,307,998,701 web pages on Google • 500,000,000,000,000 bytes on the Web
  4. 4. Some Numbers… • A researcher would have to scan 130 different journals and read 27 papers per day to follow a single disease, such as breast cancer. • Baasiri, R.A., Glasser, S.R., Steffen, D.L. & Wheeler, D.A. Oncogene 18, 7958-7965 (1999)
  5. 5. Some Graphs:
  6. 6. Multiplexed CE with Fluorescent detection ABI 3700 96x700 bases
  7. 7. Genomes • 5 vertebrates (human, mouse, rat, fugu) • 2 plants (arabadopsis, rice) • 2 insects (fruit fly, mosquito) • 2 nematodes (C. elegans, C. briggsae) • 1 sea squirt • 4 parasites (plasmodium, guillardia) • 4 fungi (S. cerevisae, S. pombe) • 140 bacteria and archebacteria • 1000+ viruses
  8. 8. The Human Genome • 3.2 billion bases on 24 chromosomes • 3,201,762,515 bases sequenced (99%) • 23,531 - 31,609 genes (predicted) • 50,000+ named genes (synonyms) • 4000+ human diseases • 850-1039 disease causing genes (ID’s)
  9. 9. A Tidal Wave of Data Made worse by….
  10. 10. The Language of Biology • The EGF receptor binds epidermal growth factor which triggers the phosphorylation of PLC- gamma followed by the binding and subsequent phosphorylation of Grb2 and SOS which leads to the formation of a Raf1-MEK complex which, in turn, leads to a p21ras auto-phosphorylation cascade. The complex then phosphorylates a MAP kinase which is transported to the nucleus via a nuclear transport signal which triggers the transcription of c-Fos, c-Myc and c-Jun which upon release in the rough ER are transported to…
  11. 11. How To Make Sense of This? • How to acquire biological or medical knowledge from English text? • How to build facts and relationships from scientific/medical articles? • How to put 100+ years of useful data into readily accessible electronic repositories (the back fill problem)?
  12. 12. Some Solutions • Text Mining… • Create electronic repositories of abstracts and articles (PubMed/Entrez) • Create glossaries & thesaurus’ of terms • Employ machine learning methods to parse electronic text to extract or interpret key pieces of “atomic” information (SVM, Naïve Bayes, Reference Point Logistics, etc.)
  13. 13. PubMed
  14. 14. PubMed • Allows users to search by journal, key words, titles etc. • Uses MeSH (Medical SubHeadings) to allow automated search of synonyms (renal transplant = kidney transplantation) • API available to query PubMed automatically and remotely • Few users know how to use PubMed properly or to its full extent
  15. 15. “ouellette bf” [au] AND yeast Details
  16. 16. MeSH: Medical Subject Heading ("ouellette bf"[au] AND (("yeasts"[MeSH Terms] OR "saccharomyces cerevisiae"[MeSH Terms]) OR yeast[Text Word]))
  17. 17. Integrated Text/Sequence Searching with Entrez
  18. 18. PubCrawler
  19. 19. PubCrawler • Free "alerting" service that scans daily updates to the NCBI Medline (PubMed) and GenBank databases • Lists new database entries that match search parameters (keywords, author names, etc.) specified by the user • Results are presented as an HTML Web page (Entrez-like format) • Can be downloaded or run as a service
  20. 20. MedMiner
  21. 21. MedMiner • A text miner that filters, extracts and organizes relevant sentences in the literature based on a gene, gene- gene or gene-drug query • Combines GeneCards and PubMed searches with an integrated text filter • L. Tanabe, U. Scherf, L. H. Smith, J. K. Lee, L. Hunter and J. N. Weinstein, (1999) BioTechniques 27:1210-1217.
  22. 22. MedGene
  23. 23. MedGene • A list of human genes associated with a particular human disease in ranking order • A list of human genes associated with multiple human diseases in ranking order • A list of human diseases associated with a particular human gene in ranking order • A list of human genes associated with a particular human gene in ranking order • The sorted gene list from other disease related high-throughput experiments, (i.e. micro-array
  24. 24. MedGene Performance • Was able to identify >2400 genes associated with breast cancer in the literature • Existing databases only list 260 genes (of which MedGene found 240) • Could save ~100’s of hours of literature searching & combing
  25. 25. PolySearch
  26. 26. PolySearch • Searches over 14 million PubMed Records • Searches against 1622 diseases (and synonyms) • Searches using 9300 genes with 42,500 synonyms • Assesses quality using SCI list of impact factors for 8600+ journals
  27. 27. PolySearch • Supports PubMed text searching for gene & disease associations (user provides disease name) • Automatically scores & ID’s genes and searches for known SNPs or mutations against std. databases • Grabs gene sequences and generates primers around SNPs • Archives (MySQL database) or sends results as HTML page to user
  28. 28. Other Examples of Text or Web Mining
  29. 29.
  30. 30. Pre-BIND • Donaldson et al. BMC Bioinformatics 2003 4:11 • Used Support Vector Machine (SVM) to scan literature for protein interactions • Precision, accuracy and recall of 92% for correctly classifying PI abstracts • Estimated to capture 60% of all abstracted protein interactions for a given organism
  31. 31. Proteome Analyst • Uses Naïve Bayes methods in combination with sequence homology to identify “tokens” or nuggets of important information from text (titles, keywords, InterPro numbers and other data) • Produces quantitative estimates (queryable reliability scores) of protein function, location, etc.
  32. 32. GenePublisher • Processes raw genechip data and produces a publishable report in 1-2 hours of processor time • Mines existing databases to build up or extract relationships • Learns from previous analyses and remembers previous associations
  33. 33. GenePublisher Output
  34. 34. Continuing Problems in Text Mining Biomedical Literature are…
  35. 35. A Serious Naming Problem • Sonic Hedgehog • Draculin • Profilactin • Knobhead • Lunatic Fringe • Fidgetin • Mortalin • Antiquitin • Accelerin • Cockeye • Clootie Dumpling • SnaFu • Gleeful • Bang Senseless • Bride of Sevenless • Crack • Christmas Factor • Orphanin
  36. 36. And Exotic Terminology… • J. Med. Genetics 10, 1962-6 (1973) "Mobius Syndrome with Poland’s Anomaly.“ • Heavy use of Eponyms (Werner’s syndrome, Down’s syndrome, Angelman’s syndrome, Creutzfeld- Jacob disease, etc. etc.)
  37. 37. Some Challenges • How to name or describe proteins, genes, drugs, diseases and conditions consistently and coherently? • How to ascribe and name a function, process or location consistently? • How to describe interactions, partners, reactions and complexes? • How to classify genes & proteins (a universal taxonomy of sequences and structures)?
  38. 38. Some Solutions • Develop controlled or restricted vocabularies (IUPAC-like naming conventions) • Create thesaurus’, central repositories or synonym lists (MeSH terms in PubMed) • Work towards synoptic reporting and structured abstracting
  39. 39. Synoptic or Structured Abstract J Am Acad Dermatol. 2004 Mar;50(3):431-4. Related Articles, Links Demand outstrips supply of US pediatric dermatologists: Results from a national survey. Hester EJ, McNealy KM, Kelloff JN, Diaz PH, Weston WL, Morelli JG, Dellavalle RP. BACKGROUND: The US pediatric dermatology workforce was last examined in 1986 when limited employment opportunity was found. OBJECTIVE: We sought to re-examine pediatric dermatology workforce issues. METHODS: US dermatology chairpersons and residency program directors were surveyed for: (1) agreement with pediatric dermatology workforce statements; and (2) pediatric dermatology faculty and fellow numbers. RESULTS: Respondents agreed that having a pediatric dermatologist or dermatologists on faculty is important, and that a shortage of pediatric dermatologists exists, but did not agree that increasing pediatric dermatology training requirements will increase this shortage. Almost half of the programs (45/94) employed a full-time pediatric dermatologist, and 24 programs had currently been recruiting a pediatric dermatologist for more than 1 year. Only 6 pediatric dermatology fellows were in training. CONCLUSION: Given that open pediatric dermatology faculty positions greatly exceed the number of fellows in training and that formal training requirements will be increasing, the shortage of pediatric dermatologists will likely continue.
  40. 40. GO-Gene Ontology • To produce a controlled vocabulary that changes as biological knowledge changes • Categorizes according to 1) molecular function; 2) biological process; and 3) cellular component • Represents contributions and consensus opinions from multiple experts in various fields • Aim is to have every known protein and gene annotated consistently
  41. 41. NIH’s Medical Ontology Research Program
  42. 42. MeSH
  43. 43. OMIM
  44. 44. DrugBank
  45. 45. Bioinformatics Medinformatics
  46. 46. Conquering the Mountain