Large-scale integration of data and text

437 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
437
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Large-scale integration of data and text

  1. 1. Large-scale integration of data and text Lars Juhl Jensen
  2. 2. Large-scale integration of data and text Lars Juhl Jensen
  3. 3. association networks
  4. 4. text mining
  5. 5. localization and diseases
  6. 6. me
  7. 7. promoter analysis
  8. 8. Jensen & Knudsen, Bioinformatics, 2000
  9. 9. function prediction
  10. 10. Jensen, Gupta et al., Journal of Molecular Biology, 2002
  11. 11. protein networks
  12. 12. de Lichtenberg, Jensen et al., Science, 2005
  13. 13. chemoinformatics
  14. 14. Campillos, Kuhn et al., Science, 2008
  15. 15. data mining
  16. 16. text mining
  17. 17. electronic health records
  18. 18. association networks
  19. 19. guilt by association
  20. 20. STRING
  21. 21. ~2.6 million proteins
  22. 22. Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011
  23. 23. STITCH
  24. 24. ~300,000 small molecules
  25. 25. Kuhn et al., Nucleic Acids Research, 2012
  26. 26. genomic context
  27. 27. gene fusion
  28. 28. Korbel et al., Nature Biotechnology, 2004
  29. 29. operons
  30. 30. Korbel et al., Nature Biotechnology, 2004
  31. 31. bidirectional promoters
  32. 32. Korbel et al., Nature Biotechnology, 2004
  33. 33. metagenome neighborhood
  34. 34. Harrington et al., PNAS, 2007
  35. 35. phylogenetic profiles
  36. 36. Korbel et al., Nature Biotechnology, 2004
  37. 37. a real example
  38. 38. Cell Cellulosomes Cellulose
  39. 39. experimental data
  40. 40. gene coexpression
  41. 41. protein interactions
  42. 42. Jensen & Bork, Science, 2008
  43. 43. curated knowledge
  44. 44. drug targets
  45. 45. complexes
  46. 46. pathways
  47. 47. Letunic & Bork, Trends in Biochemical Sciences, 2008
  48. 48. many databases
  49. 49. different formats
  50. 50. different identifiers
  51. 51. variable quality
  52. 52. not comparable
  53. 53. hard work
  54. 54. quality scores
  55. 55. von Mering et al., Nucleic Acids Research, 2005
  56. 56. calibrate vs. gold standard
  57. 57. missing most of the data
  58. 58. text mining
  59. 59. >10 km
  60. 60. too much to read
  61. 61. computer
  62. 62. as smart as a dog
  63. 63. teach it specific tricks
  64. 64. named entity recognition
  65. 65. comprehensive lexicon
  66. 66. cyclin dependent kinase 1
  67. 67. CDK1
  68. 68. CDC2
  69. 69. flexible matching
  70. 70. spaces and hyphens
  71. 71. cyclin dependent kinase 1
  72. 72. cyclin-dependent kinase 1
  73. 73. orthographic variation
  74. 74. CDC2
  75. 75. hCdc2
  76. 76. “black list”
  77. 77. SDS
  78. 78. information extraction
  79. 79. count co-mentioning
  80. 80. within documents
  81. 81. within paragraphs
  82. 82. within sentences
  83. 83. scoring scheme
  84. 84. corpora
  85. 85. ~22 million abstracts
  86. 86. no access
  87. 87. ~4 million full-text articles
  88. 88. augmented browsing
  89. 89. Reflect
  90. 90. browser add-on
  91. 91. real-time text mining
  92. 92. Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009 O’Donoghue et al., Journal of Web Semantics, 2010
  93. 93. localization and disease
  94. 94. small molecules
  95. 95. proteins
  96. 96. compartments
  97. 97. tissues
  98. 98. diseases
  99. 99. organisms
  100. 100. environments
  101. 101. suite of web resources
  102. 102. common backend database
  103. 103. jensenlab.org
  104. 104. text mining
  105. 105. curated knowledge
  106. 106. experimental data
  107. 107. computational predictions
  108. 108. quality scores
  109. 109. web-centric databases
  110. 110. DISEASES
  111. 111. visualization
  112. 112. COMPARTMENTS
  113. 113. compartments.jensenlab.org
  114. 114. TISSUES
  115. 115. tissues.jensenlab.org
  116. 116. project onto networks
  117. 117. Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011
  118. 118. compartments.jensenlab.org
  119. 119. tissues.jensenlab.org
  120. 120. diseases.jensenlab.org
  121. 121. summary
  122. 122. bioinformatics
  123. 123. more than alignment
  124. 124. data/text mining
  125. 125. save you much time
  126. 126. AcknowledgmentsSTRING/STITCH Literature mining Christian von Mering Sune Frankild Damian Szklarczyk Evangelos Pafilis Michael Kuhn Janos Binder Manuel Stark Kalliopi Tsafou Samuel Chaffron Alberto Santos Chris Creevey Heiko Horn Jean Muller Michael Kuhn Tobias Doerks Nigel Brown Philippe Julien Reinhardt Schneider Alexander Roth Sean O’Donoghue Milan Simonovic Jan Korbel Berend Snel Martijn Huynen Peer Bork
  127. 127. Questions?

×