Lars Juhl Jensen
Text mining and data integration
exponential growth
~45 seconds per paper
information retrieval
named entity recognition
information extraction
association networks
data integration
information retrieval
find the relevant papers
ad hoc retrieval
user-specified query
“yeast AND cell cycle”
PubMed
indexing
fast lookup
stemming
word endings
dynamic query expansion
MeSH terms
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1
and this modification served as a priming
st...
no tool will find that
named entity recognition
computer
as smart as a dog
teach it specific tricks
identify the concepts
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1
and this modification served as a priming
st...
comprehensive lexicon
CDC2
cyclin dependent kinase 1
orthographic variation
upper- and lower-case
CDC2
Cdc2
spaces and hyphens
cyclin dependent kinase 1
cyclin-dependent kinase 1
prefixes and postfixes
CDC2
hCDC2
“black list”
SDS
scalable implementation
>10 km
<10 hours
augmented browsing
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1
and this modification served as a priming
st...
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1
and this modification served as a priming
st...
Reflect
Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009
O’Donoghue et al., Journal of Web Semantics, 2010
information extraction
formalize the facts
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1
and this modification served as a priming
st...
two approaches
co-mentioning
counting
within documents
within paragraphs
within sentences
co-mentioning score
NLP
Natural Language Processing
grammatical analysis
part-of-speech tagging
multiword detection
semantic tagging
sentence parsing
Gene and protein names
Cue words for entity
recognition
Verbs for relation extraction
[nxexpr The expression of
[nxgene th...
extract stated facts
high precision
poor recall
text corpus
most use abstracts
few use full-text articles
no access
PDF files
layout-aware extraction
my corpus
~22 million abstracts
~4 million articles
association networks
guilt by association
STRING
Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011
computational predictions
gene fusion
Korbel et al., Nature Biotechnology, 2004
gene neighborhood
Korbel et al., Nature Biotechnology, 2004
phylogenetic profiles
Korbel et al., Nature Biotechnology, 2004
a real example
Cell
Cellulosomes
Cellulose
experimental data
gene coexpression
physical interactions
Jensen & Bork, Science, 2008
curated knowledge
pathways
Letunic & Bork, Trends in Biochemical Sciences, 2008
many databases
different formats
different identifiers
variable quality
not comparable
hard work
quality scores
von Mering et al., Nucleic Acids Research, 2005
calibrate vs. gold standard
von Mering et al., Nucleic Acids Research, 2005
data integration
general approach
suite of web resources
STITCH
STRING + 300k chemicals
Kuhn et al., Nucleic Acids Research, 2012
COMPARTMENTS
subcellular localization
compartments.jensenlab.org
TISSUES
tissue expression
tissues.jensenlab.org
DISEASES
disease genes
unification
curated knowledge
text mining
experimental data
computational predictions
common identifiers
quality scores
visualization
dissemination
web interfaces
evidence viewers
web services
diseases.jensenlab.org
bulk download
thank you!
Text mining and data integration
Text mining and data integration
Text mining and data integration
Text mining and data integration
Text mining and data integration
Text mining and data integration
Text mining and data integration
Text mining and data integration
Text mining and data integration
Text mining and data integration
Text mining and data integration
Text mining and data integration
Text mining and data integration
Upcoming SlideShare
Loading in...5
×

Text mining and data integration

161
-1

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
161
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Text mining and data integration

  1. 1. Lars Juhl Jensen Text mining and data integration
  2. 2. exponential growth
  3. 3. ~45 seconds per paper
  4. 4. information retrieval
  5. 5. named entity recognition
  6. 6. information extraction
  7. 7. association networks
  8. 8. data integration
  9. 9. information retrieval
  10. 10. find the relevant papers
  11. 11. ad hoc retrieval
  12. 12. user-specified query
  13. 13. “yeast AND cell cycle”
  14. 14. PubMed
  15. 15. indexing
  16. 16. fast lookup
  17. 17. stemming
  18. 18. word endings
  19. 19. dynamic query expansion
  20. 20. MeSH terms
  21. 21. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5- dependent Swe1 hyperphosphorylation and degradation
  22. 22. no tool will find that
  23. 23. named entity recognition
  24. 24. computer
  25. 25. as smart as a dog
  26. 26. teach it specific tricks
  27. 27. identify the concepts
  28. 28. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5- dependent Swe1 hyperphosphorylation and degradation
  29. 29. comprehensive lexicon
  30. 30. CDC2
  31. 31. cyclin dependent kinase 1
  32. 32. orthographic variation
  33. 33. upper- and lower-case
  34. 34. CDC2
  35. 35. Cdc2
  36. 36. spaces and hyphens
  37. 37. cyclin dependent kinase 1
  38. 38. cyclin-dependent kinase 1
  39. 39. prefixes and postfixes
  40. 40. CDC2
  41. 41. hCDC2
  42. 42. “black list”
  43. 43. SDS
  44. 44. scalable implementation
  45. 45. >10 km <10 hours
  46. 46. augmented browsing
  47. 47. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5- dependent Swe1 hyperphosphorylation and degradation
  48. 48. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5- dependent Swe1 hyperphosphorylation and degradation
  49. 49. Reflect
  50. 50. Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009 O’Donoghue et al., Journal of Web Semantics, 2010
  51. 51. information extraction
  52. 52. formalize the facts
  53. 53. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5- dependent Swe1 hyperphosphorylation and degradation
  54. 54. two approaches
  55. 55. co-mentioning
  56. 56. counting
  57. 57. within documents
  58. 58. within paragraphs
  59. 59. within sentences
  60. 60. co-mentioning score
  61. 61. NLP Natural Language Processing
  62. 62. grammatical analysis
  63. 63. part-of-speech tagging
  64. 64. multiword detection
  65. 65. semantic tagging
  66. 66. sentence parsing
  67. 67. Gene and protein names Cue words for entity recognition Verbs for relation extraction [nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]] is controlled by [nxpg HAP1]
  68. 68. extract stated facts
  69. 69. high precision
  70. 70. poor recall
  71. 71. text corpus
  72. 72. most use abstracts
  73. 73. few use full-text articles
  74. 74. no access
  75. 75. PDF files
  76. 76. layout-aware extraction
  77. 77. my corpus
  78. 78. ~22 million abstracts
  79. 79. ~4 million articles
  80. 80. association networks
  81. 81. guilt by association
  82. 82. STRING
  83. 83. Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011
  84. 84. computational predictions
  85. 85. gene fusion
  86. 86. Korbel et al., Nature Biotechnology, 2004
  87. 87. gene neighborhood
  88. 88. Korbel et al., Nature Biotechnology, 2004
  89. 89. phylogenetic profiles
  90. 90. Korbel et al., Nature Biotechnology, 2004
  91. 91. a real example
  92. 92. Cell Cellulosomes Cellulose
  93. 93. experimental data
  94. 94. gene coexpression
  95. 95. physical interactions
  96. 96. Jensen & Bork, Science, 2008
  97. 97. curated knowledge
  98. 98. pathways
  99. 99. Letunic & Bork, Trends in Biochemical Sciences, 2008
  100. 100. many databases
  101. 101. different formats
  102. 102. different identifiers
  103. 103. variable quality
  104. 104. not comparable
  105. 105. hard work
  106. 106. quality scores
  107. 107. von Mering et al., Nucleic Acids Research, 2005
  108. 108. calibrate vs. gold standard
  109. 109. von Mering et al., Nucleic Acids Research, 2005
  110. 110. data integration
  111. 111. general approach
  112. 112. suite of web resources
  113. 113. STITCH
  114. 114. STRING + 300k chemicals
  115. 115. Kuhn et al., Nucleic Acids Research, 2012
  116. 116. COMPARTMENTS
  117. 117. subcellular localization
  118. 118. compartments.jensenlab.org
  119. 119. TISSUES
  120. 120. tissue expression
  121. 121. tissues.jensenlab.org
  122. 122. DISEASES
  123. 123. disease genes
  124. 124. unification
  125. 125. curated knowledge
  126. 126. text mining
  127. 127. experimental data
  128. 128. computational predictions
  129. 129. common identifiers
  130. 130. quality scores
  131. 131. visualization
  132. 132. dissemination
  133. 133. web interfaces
  134. 134. evidence viewers
  135. 135. web services
  136. 136. diseases.jensenlab.org
  137. 137. bulk download
  138. 138. thank you!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×