Large-scale integration of data and text

339 views

Published on

Published in: Science
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
339
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Large-scale integration of data and text

  1. 1. Lars Juhl Jensen Large-scale integration of data and text
  2. 2. Lars Juhl Jensen Large-scale integration of data and text
  3. 3. Ph.D.
  4. 4. sequence analysis
  5. 5. postdoc
  6. 6. staff scientist
  7. 7. protein networks
  8. 8. cellular signalling
  9. 9. group leader
  10. 10. cofounder
  11. 11. data integration
  12. 12. omics data
  13. 13. association networks
  14. 14. text mining
  15. 15. biomedical literature
  16. 16. electronic health records
  17. 17. association networks
  18. 18. guilt by association
  19. 19. STRING
  20. 20. Franceschini et al., Nucleic Acids Research, 2013
  21. 21. 1100+ genomes
  22. 22. genomic context
  23. 23. gene fusion
  24. 24. Korbel et al., Nature Biotechnology, 2004
  25. 25. operons
  26. 26. Korbel et al., Nature Biotechnology, 2004
  27. 27. bidirectional promoters
  28. 28. Korbel et al., Nature Biotechnology, 2004
  29. 29. phylogenetic profiles
  30. 30. Korbel et al., Nature Biotechnology, 2004
  31. 31. a real example
  32. 32. Cell Cellulosomes Cellulose
  33. 33. experimental data
  34. 34. gene coexpression
  35. 35. physical interactions
  36. 36. Jensen & Bork, Science, 2008
  37. 37. genetic interactions
  38. 38. Beyer et al., Nature Reviews Genetics, 2007
  39. 39. curated knowledge
  40. 40. pathways
  41. 41. Letunic & Bork, Trends in Biochemical Sciences, 2008
  42. 42. many databases
  43. 43. different formats
  44. 44. different identifiers
  45. 45. variable quality
  46. 46. not comparable
  47. 47. not same species
  48. 48. hard work
  49. 49. (Ph.D. students)
  50. 50. quality scores
  51. 51. von Mering et al., Nucleic Acids Research, 2005
  52. 52. calibrate vs. gold standard
  53. 53. von Mering et al., Nucleic Acids Research, 2005
  54. 54. homology-based transfer
  55. 55. Franceschini et al., Nucleic Acids Research, 2013
  56. 56. missing most of the data
  57. 57. text mining
  58. 58. >10 km
  59. 59. too much to read
  60. 60. computer
  61. 61. as smart as a dog
  62. 62. teach it specific tricks
  63. 63. named entity recognition
  64. 64. comprehensive lexicon
  65. 65. cyclin dependent kinase 1
  66. 66. CDC2
  67. 67. flexible matching
  68. 68. cyclin dependent kinase 1
  69. 69. cyclin-dependent kinase 1
  70. 70. orthographic variation
  71. 71. CDC2
  72. 72. hCdc2
  73. 73. “black list”
  74. 74. SDS
  75. 75. augmented browsing
  76. 76. Reflect
  77. 77. browser add-on
  78. 78. real-time text mining
  79. 79. Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009 O’Donoghue et al., Journal of Web Semantics, 2010
  80. 80. information extraction
  81. 81. co-mentioning
  82. 82. within documents
  83. 83. within paragraphs
  84. 84. within sentences
  85. 85. NLP Natural Language Processing
  86. 86. grammatical analysis
  87. 87. Gene and protein names Cue words for entity recognition Verbs for relation extraction [nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]] is controlled by [nxpg HAP1]
  88. 88. more precise
  89. 89. worse recall
  90. 90. related web resources
  91. 91. STITCH
  92. 92. STRING + 300k chemicals
  93. 93. stitch-db.org
  94. 94. COMPARTMENTS
  95. 95. compartments.jensenlab.org
  96. 96. TISSUES
  97. 97. tissues.jensenlab.org
  98. 98. DISEASES
  99. 99. diseases.jensenlab.org
  100. 100. general framework
  101. 101. curated knowledge
  102. 102. experimental data
  103. 103. text mining
  104. 104. computational predictions
  105. 105. common identifiers
  106. 106. quality scores
  107. 107. visualization
  108. 108. web resources
  109. 109. download files
  110. 110. why so many?
  111. 111. Swiss army knife syndrome
  112. 112. targeted resources
  113. 113. common infrastructure
  114. 114. medical data mining
  115. 115. Jensen et al., Nature Reviews Genetics, 2012
  116. 116. opt-out
  117. 117. opt-in
  118. 118. centralized registries
  119. 119. structured data
  120. 120. Jensen et al., Nature Reviews Genetics, 2012
  121. 121. 14 years
  122. 122. 6.2 million patients
  123. 123. 119 million diagnoses
  124. 124. distributions
  125. 125. Jensen et al., submitted, 2014
  126. 126. diagnosis trajectories
  127. 127. Jensen et al., submitted, 2014
  128. 128. Jensen et al., submitted, 2014
  129. 129. complex trajectories
  130. 130. Jensen et al., submitted, 2014
  131. 131. confounding factors
  132. 132. correlation causation≠
  133. 133. electronic health records
  134. 134. unstructured data
  135. 135. Danish
  136. 136. busy doctors
  137. 137. pharmacovigilance
  138. 138. custom dictionaries
  139. 139. drugs
  140. 140. adverse drug events
  141. 141. typo rules
  142. 142. complex filters
  143. 143. Eriksson et al., Drug Safetey, 2014
  144. 144. new adverse drug reactions
  145. 145. Eriksson et al., Drug Safety, 2014 Drug substance ADE p-value Chlordiazepoxide Nystagmus 4.0e-8 Simvastatin Personality changes 8.4e-8 Dipyridamole Visual impairment 4.4e-4 Citalopram Psychosis 8.8e-4 Bendroflumethiazi de Apoplexy 8.5e-3
  146. 146. direct medical implications
  147. 147. Acknowledgments STRING/STITCH Christian von Mering Damian Szklarczyk Michael Kuhn Manuel Stark Samuel Chaffron Chris Creevey Jean Muller Tobias Doerks Philippe Julien Alexander Roth Milan Simonovic Jan Korbel Berend Snel Martijn Huynen Peer Bork Text mining Sune Frankild Jasmin Saric Evangelos Pafilis Kalliopi Tsafou Alberto Santos Janos Binder Heiko Horn Michael Kuhn Nigel Brown Reinhardt Schneider Sean O’ Donoghue EHR mining Anders Boeck Jensen Peter Bjødstrup Jensen Robert Eriksson Francisco S. Roque Henriette Schmock Marlene Dalgaard Massimo Andreatta Thomas Hansen Karen Søeby Søren Bredkjær Anders Juul Tudor Oprea Pope Moseley Thomas Werge Søren Brunak

×