Applied text mining

363 views

Published on

Published in: Science
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
363
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Applied text mining

  1. 1. >10 km
  2. 2. too much to read
  3. 3. exponential growth
  4. 4. ~40 seconds per paper
  5. 5. computer
  6. 6. as smart as a dog
  7. 7. teach it specific tricks
  8. 8. information retrieval
  9. 9. named entity recognition
  10. 10. information extraction
  11. 11. text/data integration
  12. 12. medical text mining
  13. 13. information retrieval
  14. 14. find the relevant papers
  15. 15. ad hoc retrieval
  16. 16. user-specified query
  17. 17. “yeast AND cell cycle”
  18. 18. PubMed
  19. 19. indexing
  20. 20. fast lookup
  21. 21. stemming
  22. 22. word endings
  23. 23. dynamic query expansion
  24. 24. MeSH terms
  25. 25. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5- dependent Swe1 hyperphosphorylation and degradation
  26. 26. no tool will find that
  27. 27. named entity recognition
  28. 28. identify the concepts
  29. 29. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5- dependent Swe1 hyperphosphorylation and degradation
  30. 30. comprehensive lexicon
  31. 31. CDC2
  32. 32. cyclin dependent kinase 1
  33. 33. orthographic variation
  34. 34. flexible matching
  35. 35. upper- and lower-case
  36. 36. CDC2
  37. 37. Cdc2
  38. 38. spaces and hyphens
  39. 39. cyclin dependent kinase 1
  40. 40. cyclin-dependent kinase 1
  41. 41. name expansions
  42. 42. prefixes and postfixes
  43. 43. CDC2
  44. 44. hCDC2
  45. 45. “black list”
  46. 46. SDS
  47. 47. efficient tagger
  48. 48. Pafilis et al., PLOS ONE, 2013
  49. 49. benchmarking
  50. 50. the formal way
  51. 51. manually annotated corpus
  52. 52. precision
  53. 53. recall
  54. 54. much work
  55. 55. the pragmatic way
  56. 56. random sampling
  57. 57. precision
  58. 58. no recall
  59. 59. much less work
  60. 60. augmented browsing
  61. 61. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5- dependent Swe1 hyperphosphorylation and degradation
  62. 62. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5- dependent Swe1 hyperphosphorylation and degradation
  63. 63. Reflect
  64. 64. reflect.ws
  65. 65. information extraction
  66. 66. formalize the facts
  67. 67. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5- dependent Swe1 hyperphosphorylation and degradation
  68. 68. two approaches
  69. 69. the formal way
  70. 70. NLP Natural Language Processing
  71. 71. grammatical analysis
  72. 72. part-of-speech tagging
  73. 73. multiword detection
  74. 74. semantic tagging
  75. 75. sentence parsing
  76. 76. Gene and protein names Cue words for entity recognition Verbs for relation extraction [nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]] is controlled by [nxpg HAP1]
  77. 77. extract stated facts
  78. 78. high precision
  79. 79. poor recall
  80. 80. the pragmatic way
  81. 81. guilt by association
  82. 82. co-mentioning
  83. 83. counting
  84. 84. within documents
  85. 85. within paragraphs
  86. 86. within sentences
  87. 87. quality score
  88. 88. high recall
  89. 89. high precision
  90. 90. undirected associations
  91. 91. unknown type
  92. 92. text/data integration
  93. 93. STRING
  94. 94. protein associations
  95. 95. string-db.org
  96. 96. STITCH
  97. 97. STRING + 300k chemicals
  98. 98. stitch-db.org
  99. 99. COMPARTMENTS
  100. 100. subcellular localization
  101. 101. compartments.jensenlab.org
  102. 102. TISSUES
  103. 103. tissue expression
  104. 104. tissues.jensenlab.org
  105. 105. DISEASES
  106. 106. disease–gene assocations
  107. 107. diseases.jensenlab.org
  108. 108. curated knowledge
  109. 109. pathways
  110. 110. Letunic & Bork, Trends in Biochemical Sciences, 2008
  111. 111. experimental data
  112. 112. gene expression
  113. 113. computational predictions
  114. 114. gene neighborhood
  115. 115. Korbel et al., Nature Biotechnology, 2004
  116. 116. many databases
  117. 117. different formats
  118. 118. different identifiers
  119. 119. variable quality
  120. 120. not comparable
  121. 121. hard work
  122. 122. common identifiers
  123. 123. quality scores
  124. 124. score calibration
  125. 125. visualization
  126. 126. web interfaces
  127. 127. bulk download
  128. 128. why so many resources?
  129. 129. Swiss army knife syndrome
  130. 130. medical text mining
  131. 131. electronic health records
  132. 132. opt-out
  133. 133. opt-in
  134. 134. structured data
  135. 135. Jensen et al., Nature Reviews Genetics, 2012
  136. 136. unstructured data
  137. 137. clinical narrative
  138. 138. Danish
  139. 139. busy doctors
  140. 140. psychiatric patients
  141. 141. named entity recognition
  142. 142. custom dictionaries
  143. 143. diseases
  144. 144. drugs
  145. 145. adverse events
  146. 146. expansion rules
  147. 147. phonetic spelling
  148. 148. typos
  149. 149. sentence filters
  150. 150. negations
  151. 151. family members
  152. 152. delutions
  153. 153. detailed disease profiles
  154. 154. Roque et al., PLOS Computational Biology, 2011 3262638254947 Assigned codes Text mined codes
  155. 155. comorbidity
  156. 156. Roque et al., PLOS Computational Biology, 2011
  157. 157. patient stratification
  158. 158. Roque et al., PLOS Computational Biology, 2011
  159. 159. pharmacovigilance
  160. 160. structured medication data
  161. 161. text-mined adverse events
  162. 162. Eriksson et al., submitted, 2013
  163. 163. EMBO Practical Course Computational Biology: Genomesto Systems Puerto Varas, 3-9April2014 Thank you!Thank you!

×