Your SlideShare is downloading. ×
Mining literature and medical records
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Mining literature and medical records

235
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
235
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Mining literature and medical records >10 km Lars Juhl Jensen
  • 2. exponential growth
  • 3. ~45 seconds per paper
  • 4. outline
  • 5. information retrieval
  • 6. named entity recognition
  • 7. augmented browsing
  • 8. information extraction
  • 9. text corpora
  • 10. web resources
  • 11. electronic health records
  • 12. medical text mining
  • 13. questions
  • 14. information retrieval
  • 15. find the relevant papers
  • 16. ad hoc retrieval
  • 17. user-specified query
  • 18. “yeast AND cell cycle”
  • 19. PubMed
  • 20. indexing
  • 21. fast lookup
  • 22. stemming
  • 23. word endings
  • 24. dynamic query expansion
  • 25. MeSH terms
  • 26. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1homolog) directly phosphorylated Swe1 andthis modification served as a priming step topromote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation
  • 27. no tool will find that
  • 28. named entity recognition
  • 29. identify the concepts
  • 30. computer
  • 31. as smart as a dog
  • 32. teach it specific tricks
  • 33. comprehensive lexicon
  • 34. small molecules
  • 35. proteins
  • 36. cellular components
  • 37. tissues
  • 38. organisms
  • 39. environments
  • 40. diseases
  • 41. phenotypes
  • 42. behaviors
  • 43. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1homolog) directly phosphorylated Swe1 andthis modification served as a priming step topromote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation
  • 44. orthographic variation
  • 45. prefixes and postfixes
  • 46. CDC28 vs. Cdc28p
  • 47. Myc vs. c-Myc
  • 48. singular and plural forms
  • 49. noun and adjective forms
  • 50. flexible matching
  • 51. upper- and lower-case
  • 52. spaces and hyphens
  • 53. disambiguation
  • 54. homonyms
  • 55. “black list”
  • 56. unfortunate names
  • 57. SDS
  • 58. a
  • 59. scalable implementation
  • 60. >10 km<10 hours
  • 61. augmented browsing
  • 62. show relevant information
  • 63. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1homolog) directly phosphorylated Swe1 andthis modification served as a priming step topromote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation
  • 64. Reflect
  • 65. Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009 O’Donoghue et al., Journal of Web Semantics, 2010
  • 66. browser add-on
  • 67. Firefox
  • 68. Google Chrome
  • 69. Safari
  • 70. Internet Explorer
  • 71. PDF viewer
  • 72. Utopia Documents
  • 73. web services
  • 74. still too much to read
  • 75. information extraction
  • 76. formalize the facts
  • 77. the starting point
  • 78. named entity recognition
  • 79. two approaches
  • 80. co-mentioning
  • 81. within documents
  • 82. within paragraphs
  • 83. within sentences
  • 84. weighted counts
  • 85. co-mentioning score
  • 86. absolute co-mentionings
  • 87. relative overrepresentation
  • 88. NLPNatural Language Processing
  • 89. grammatical analysis
  • 90. part-of-speech tagging
  • 91. noun, verb, etc.
  • 92. multiword detection
  • 93. semantic tagging
  • 94. binding, regulation, etc.
  • 95. sentence parsing
  • 96. Gene and protein namesCue words for entity recognitionVerbs for relation extraction[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]] is controlled by [nxpg HAP1]
  • 97. extract stated facts
  • 98. handle negations
  • 99. high precision
  • 100. poor recall
  • 101. highly domain specific
  • 102. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1homolog) directly phosphorylated Swe1 andthis modification served as a priming step topromote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation
  • 103. text corpora
  • 104. a body of text
  • 105. most use abstracts
  • 106. few use full-text articles
  • 107. no access
  • 108. ~22 mio. abstracts
  • 109. ~1.8 mio. free articles
  • 110. ~1.4 mio. Elsevier articles
  • 111. ~7.5 mio. patents
  • 112. web resources
  • 113. information on proteins
  • 114. iHOP
  • 115. STRING
  • 116. Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011
  • 117. text mining channel
  • 118. what is known
  • 119. not in databases
  • 120. human proteins
  • 121. co-mentioning dominates
  • 122. NLP provides actions
  • 123. homology transfer
  • 124. STITCH
  • 125. small molecules
  • 126. COMPARTMENTS
  • 127. subcellular localization
  • 128. DISEASES
  • 129. human diseases
  • 130. search for a protein
  • 131. search for a disease
  • 132. STRING payload
  • 133. evidence viewers
  • 134. electronic health records
  • 135. what happens at a hospital
  • 136. Jensen et al., Nature Reviews Genetics, 2012
  • 137. two types of data
  • 138. structured data
  • 139. Jensen et al., Nature Reviews Genetics, 2012
  • 140. unstructured data
  • 141. clinical narrative
  • 142. getting access
  • 143. patient consent
  • 144. opt-out
  • 145. opt-in
  • 146. ethical approval
  • 147. medical question
  • 148. no explorative studies
  • 149. data security
  • 150. not anonymized
  • 151. not transferable
  • 152. hospital IT systems
  • 153. not standardized
  • 154. clinical narrative
  • 155. not normal language
  • 156. trouble for NLP
  • 157. in native language
  • 158. not English
  • 159. few tools
  • 160. no dictionaries
  • 161. by busy doctors and nurses
  • 162. typos
  • 163. medical text mining
  • 164. what is possible?
  • 165. a psychiatric corpus
  • 166. clinical narrative
  • 167. in Danish
  • 168. dictionaries
  • 169. diseases
  • 170. drugs
  • 171. adverse drug reactions
  • 172. disease comorbidity
  • 173. Jensen et al., Nature Reviews Genetics, 2012
  • 174. multiple testing
  • 175. comorbidity matrix
  • 176. Roque et al., PLoS Computational Biology, 2011
  • 177. patient clustering
  • 178. Jensen et al., Nature Reviews Genetics, 2012
  • 179. clustering algorithm
  • 180. Roque et al., PLoS Computational Biology, 2011
  • 181. patient stratification
  • 182. temporal correlation
  • 183. drug treatment
  • 184. adverse drug events
  • 185. Eriksson et al., in preparation, 2012
  • 186. pharmacovigilance
  • 187. thank you!
  • 188. questions?