Mining literature and medical records   >10 km             Lars Juhl Jensen
exponential growth
~45 seconds per paper
outline
information retrieval
named entity recognition
augmented browsing
information extraction
text corpora
web resources
electronic health records
medical text mining
questions
information retrieval
find the relevant papers
ad hoc retrieval
user-specified query
“yeast AND cell cycle”
PubMed
indexing
fast lookup
stemming
word endings
dynamic query expansion
MeSH terms
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1homolog) directly phosphorylated Swe1 andthis modification served as a priming step...
no tool will find that
named entity recognition
identify the concepts
computer
as smart as a dog
teach it specific tricks
comprehensive lexicon
small molecules
proteins
cellular components
tissues
organisms
environments
diseases
phenotypes
behaviors
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1homolog) directly phosphorylated Swe1 andthis modification served as a priming step...
orthographic variation
prefixes and postfixes
CDC28 vs. Cdc28p
Myc vs. c-Myc
singular and plural forms
noun and adjective forms
flexible matching
upper- and lower-case
spaces and hyphens
disambiguation
homonyms
“black list”
unfortunate names
SDS
a
scalable implementation
>10 km<10 hours
augmented browsing
show relevant information
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1homolog) directly phosphorylated Swe1 andthis modification served as a priming step...
Reflect
Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009            O’Donoghue et al., Journal of Web Semantics, 2010
browser add-on
Firefox
Google Chrome
Safari
Internet Explorer
PDF viewer
Utopia Documents
web services
still too much to read
information extraction
formalize the facts
the starting point
named entity recognition
two approaches
co-mentioning
within documents
within paragraphs
within sentences
weighted counts
co-mentioning score
absolute co-mentionings
relative overrepresentation
NLPNatural Language Processing
grammatical analysis
part-of-speech tagging
noun, verb, etc.
multiword detection
semantic tagging
binding, regulation, etc.
sentence parsing
Gene and protein namesCue words for entity recognitionVerbs for relation extraction[nxexpr The expression of          [nxg...
extract stated facts
handle negations
high precision
poor recall
highly domain specific
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1homolog) directly phosphorylated Swe1 andthis modification served as a priming step...
text corpora
a body of text
most use abstracts
few use full-text articles
no access
~22 mio. abstracts
~1.8 mio. free articles
~1.4 mio. Elsevier articles
~7.5 mio. patents
web resources
information on proteins
iHOP
STRING
Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011
text mining channel
what is known
not in databases
human proteins
co-mentioning dominates
NLP provides actions
homology transfer
STITCH
small molecules
COMPARTMENTS
subcellular localization
DISEASES
human diseases
search for a protein
search for a disease
STRING payload
evidence viewers
electronic health records
what happens at a hospital
Jensen et al., Nature Reviews Genetics, 2012
two types of data
structured data
Jensen et al., Nature Reviews Genetics, 2012
unstructured data
clinical narrative
getting access
patient consent
opt-out
opt-in
ethical approval
medical question
no explorative studies
data security
not anonymized
not transferable
hospital IT systems
not standardized
clinical narrative
not normal language
trouble for NLP
in native language
not English
few tools
no dictionaries
by busy doctors and nurses
typos
medical text mining
what is possible?
a psychiatric corpus
clinical narrative
in Danish
dictionaries
diseases
drugs
adverse drug reactions
disease comorbidity
Jensen et al., Nature Reviews Genetics, 2012
multiple testing
comorbidity matrix
Roque et al., PLoS Computational Biology, 2011
patient clustering
Jensen et al., Nature Reviews Genetics, 2012
clustering algorithm
Roque et al., PLoS Computational Biology, 2011
patient stratification
temporal correlation
drug treatment
adverse drug events
Eriksson et al., in preparation, 2012
pharmacovigilance
thank you!
questions?
Mining literature and medical records
Mining literature and medical records
Mining literature and medical records
Mining literature and medical records
Mining literature and medical records
Mining literature and medical records
Mining literature and medical records
Mining literature and medical records
Mining literature and medical records
Mining literature and medical records
Mining literature and medical records
Mining literature and medical records
Mining literature and medical records
Mining literature and medical records
Mining literature and medical records
Mining literature and medical records
Mining literature and medical records
Mining literature and medical records
Mining literature and medical records
Upcoming SlideShare
Loading in …5
×

Mining literature and medical records

438 views

Published on

  • Be the first to comment

  • Be the first to like this

Mining literature and medical records

  1. 1. Mining literature and medical records >10 km Lars Juhl Jensen
  2. 2. exponential growth
  3. 3. ~45 seconds per paper
  4. 4. outline
  5. 5. information retrieval
  6. 6. named entity recognition
  7. 7. augmented browsing
  8. 8. information extraction
  9. 9. text corpora
  10. 10. web resources
  11. 11. electronic health records
  12. 12. medical text mining
  13. 13. questions
  14. 14. information retrieval
  15. 15. find the relevant papers
  16. 16. ad hoc retrieval
  17. 17. user-specified query
  18. 18. “yeast AND cell cycle”
  19. 19. PubMed
  20. 20. indexing
  21. 21. fast lookup
  22. 22. stemming
  23. 23. word endings
  24. 24. dynamic query expansion
  25. 25. MeSH terms
  26. 26. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1homolog) directly phosphorylated Swe1 andthis modification served as a priming step topromote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation
  27. 27. no tool will find that
  28. 28. named entity recognition
  29. 29. identify the concepts
  30. 30. computer
  31. 31. as smart as a dog
  32. 32. teach it specific tricks
  33. 33. comprehensive lexicon
  34. 34. small molecules
  35. 35. proteins
  36. 36. cellular components
  37. 37. tissues
  38. 38. organisms
  39. 39. environments
  40. 40. diseases
  41. 41. phenotypes
  42. 42. behaviors
  43. 43. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1homolog) directly phosphorylated Swe1 andthis modification served as a priming step topromote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation
  44. 44. orthographic variation
  45. 45. prefixes and postfixes
  46. 46. CDC28 vs. Cdc28p
  47. 47. Myc vs. c-Myc
  48. 48. singular and plural forms
  49. 49. noun and adjective forms
  50. 50. flexible matching
  51. 51. upper- and lower-case
  52. 52. spaces and hyphens
  53. 53. disambiguation
  54. 54. homonyms
  55. 55. “black list”
  56. 56. unfortunate names
  57. 57. SDS
  58. 58. a
  59. 59. scalable implementation
  60. 60. >10 km<10 hours
  61. 61. augmented browsing
  62. 62. show relevant information
  63. 63. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1homolog) directly phosphorylated Swe1 andthis modification served as a priming step topromote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation
  64. 64. Reflect
  65. 65. Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009 O’Donoghue et al., Journal of Web Semantics, 2010
  66. 66. browser add-on
  67. 67. Firefox
  68. 68. Google Chrome
  69. 69. Safari
  70. 70. Internet Explorer
  71. 71. PDF viewer
  72. 72. Utopia Documents
  73. 73. web services
  74. 74. still too much to read
  75. 75. information extraction
  76. 76. formalize the facts
  77. 77. the starting point
  78. 78. named entity recognition
  79. 79. two approaches
  80. 80. co-mentioning
  81. 81. within documents
  82. 82. within paragraphs
  83. 83. within sentences
  84. 84. weighted counts
  85. 85. co-mentioning score
  86. 86. absolute co-mentionings
  87. 87. relative overrepresentation
  88. 88. NLPNatural Language Processing
  89. 89. grammatical analysis
  90. 90. part-of-speech tagging
  91. 91. noun, verb, etc.
  92. 92. multiword detection
  93. 93. semantic tagging
  94. 94. binding, regulation, etc.
  95. 95. sentence parsing
  96. 96. Gene and protein namesCue words for entity recognitionVerbs for relation extraction[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]] is controlled by [nxpg HAP1]
  97. 97. extract stated facts
  98. 98. handle negations
  99. 99. high precision
  100. 100. poor recall
  101. 101. highly domain specific
  102. 102. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1homolog) directly phosphorylated Swe1 andthis modification served as a priming step topromote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation
  103. 103. text corpora
  104. 104. a body of text
  105. 105. most use abstracts
  106. 106. few use full-text articles
  107. 107. no access
  108. 108. ~22 mio. abstracts
  109. 109. ~1.8 mio. free articles
  110. 110. ~1.4 mio. Elsevier articles
  111. 111. ~7.5 mio. patents
  112. 112. web resources
  113. 113. information on proteins
  114. 114. iHOP
  115. 115. STRING
  116. 116. Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011
  117. 117. text mining channel
  118. 118. what is known
  119. 119. not in databases
  120. 120. human proteins
  121. 121. co-mentioning dominates
  122. 122. NLP provides actions
  123. 123. homology transfer
  124. 124. STITCH
  125. 125. small molecules
  126. 126. COMPARTMENTS
  127. 127. subcellular localization
  128. 128. DISEASES
  129. 129. human diseases
  130. 130. search for a protein
  131. 131. search for a disease
  132. 132. STRING payload
  133. 133. evidence viewers
  134. 134. electronic health records
  135. 135. what happens at a hospital
  136. 136. Jensen et al., Nature Reviews Genetics, 2012
  137. 137. two types of data
  138. 138. structured data
  139. 139. Jensen et al., Nature Reviews Genetics, 2012
  140. 140. unstructured data
  141. 141. clinical narrative
  142. 142. getting access
  143. 143. patient consent
  144. 144. opt-out
  145. 145. opt-in
  146. 146. ethical approval
  147. 147. medical question
  148. 148. no explorative studies
  149. 149. data security
  150. 150. not anonymized
  151. 151. not transferable
  152. 152. hospital IT systems
  153. 153. not standardized
  154. 154. clinical narrative
  155. 155. not normal language
  156. 156. trouble for NLP
  157. 157. in native language
  158. 158. not English
  159. 159. few tools
  160. 160. no dictionaries
  161. 161. by busy doctors and nurses
  162. 162. typos
  163. 163. medical text mining
  164. 164. what is possible?
  165. 165. a psychiatric corpus
  166. 166. clinical narrative
  167. 167. in Danish
  168. 168. dictionaries
  169. 169. diseases
  170. 170. drugs
  171. 171. adverse drug reactions
  172. 172. disease comorbidity
  173. 173. Jensen et al., Nature Reviews Genetics, 2012
  174. 174. multiple testing
  175. 175. comorbidity matrix
  176. 176. Roque et al., PLoS Computational Biology, 2011
  177. 177. patient clustering
  178. 178. Jensen et al., Nature Reviews Genetics, 2012
  179. 179. clustering algorithm
  180. 180. Roque et al., PLoS Computational Biology, 2011
  181. 181. patient stratification
  182. 182. temporal correlation
  183. 183. drug treatment
  184. 184. adverse drug events
  185. 185. Eriksson et al., in preparation, 2012
  186. 186. pharmacovigilance
  187. 187. thank you!
  188. 188. questions?

×