Mining text and data on chemicals

441 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
441
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Mining text and data on chemicals

  1. 1. Mining text and data on chemicals Lars Juhl Jensen
  2. 2. three parts
  3. 3. text mining
  4. 4. data integration
  5. 5. medical records
  6. 6. Part 1text mining
  7. 7. exponential growth
  8. 8. some things are constant
  9. 9. ~45 seconds per paper
  10. 10. information retrieval
  11. 11. find the relevant papers
  12. 12. still too much to read
  13. 13. computer
  14. 14. as smart as a dog
  15. 15. teach it specific tricks
  16. 16. named entity recognition
  17. 17. identify the concepts
  18. 18. small molecules
  19. 19. proteins
  20. 20. diseases
  21. 21. comprehensive lexicon
  22. 22. synonyms
  23. 23. orthographic variation
  24. 24. “black list”
  25. 25. unfortunate names
  26. 26. Reflect
  27. 27. augmented browsing
  28. 28. browser add-on
  29. 29. Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009 O’Donoghue et al., Journal of Web Semantics, 2010
  30. 30. Firefox
  31. 31. Internet Explorer
  32. 32. Google Chrome
  33. 33. Safari
  34. 34. Utopia Documents
  35. 35. web services
  36. 36. collaboration
  37. 37. SciVerse
  38. 38. information extraction
  39. 39. formalize the facts
  40. 40. co-mentioning
  41. 41. NLPNatural Language Processing
  42. 42. Gene and protein namesCue words for entity recognitionVerbs for relation extraction[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]] is controlled by [nxpg HAP1]
  43. 43. Part 2data integration
  44. 44. STITCH
  45. 45. Kuhn et al., Nucleic Acids Research, 2012
  46. 46. ~300,000 small molecules
  47. 47. ~2.6 million proteins
  48. 48. 1100+ genomes
  49. 49. experimental data
  50. 50. physical binding
  51. 51. chemical–protein
  52. 52. protein–protein
  53. 53. curated knowledge
  54. 54. drug targets
  55. 55. complexes
  56. 56. pathways
  57. 57. Letunic & Bork, Trends in Biochemical Sciences, 2008
  58. 58. text mining
  59. 59. co-mentioning
  60. 60. NLPNatural Language Processing
  61. 61. many data types
  62. 62. many databases
  63. 63. different formats
  64. 64. different identifiers
  65. 65. variable quality
  66. 66. not comparable
  67. 67. spread over many genomes
  68. 68. quality scores
  69. 69. von Mering et al., Nucleic Acids Research, 2005
  70. 70. calibrate vs. gold standard
  71. 71. von Mering et al., Nucleic Acids Research, 2005
  72. 72. probabilistic scores
  73. 73. orthology transfer
  74. 74. combine the evidence
  75. 75. Part 3patient records
  76. 76. a hard problem
  77. 77. in Danish
  78. 78. by busy doctors
  79. 79. about psychiatric patients
  80. 80. no lexicon
  81. 81. acronyms
  82. 82. typos
  83. 83. delusions
  84. 84. domain specific system
  85. 85. patient record excerpt
  86. 86. NegationF20F200 Family
  87. 87. medication
  88. 88. adverse drug events
  89. 89. diagnoses
  90. 90. pharmacovigilance
  91. 91. patient stratification
  92. 92. Roque et al., PLoS Computational Biology, 2011
  93. 93. disease comorbidity
  94. 94. Roque et al., PLoS Computational Biology, 2011
  95. 95. DNA sequencing
  96. 96. genotype
  97. 97. phenotype
  98. 98. AcknowledgmentsReflect STITCH EPJ-miningSune Frankild Michael Kuhn Francisco S RoqueHeiko Horn Damian Szklarczyk Peter B JensenEvangelos Pafilis Andrea Robert ErikssonJuan-Carlos Silla-Castro Franceschini Henriette SchmockMichael Kuhn Milan Simonovic Marlene DalgaardReinhardt Schneider Alexander Roth Massimo AndreattaSean O’Donoghue Pablo Minguez Thomas Hansen Tobias Doerks Karen Søeby Manuel Stark Søren Bredkjær Christian von Anders Juul Mering Thomas Werge Peer Bork Søren Brunak
  99. 99. larsjuhljensen

×