Integration of heterogeneous data

983 views

Published on

10th Course in Bioinformatics and Systems Biology for Molecular Biologists, Schloss Hohenkammer, Hohenkammer, Germany, March 15, 2010.

Published in: Technology

Integration of heterogeneous data

  1. 1. Integration of heterogeneous data Lars Juhl Jensen
  2. 6. data mining
  3. 7. text mining
  4. 8. interaction networks
  5. 10. Kuhn et al., Nucleic Acids Research , 2010
  6. 11. parts lists
  7. 12. 630 genomes
  8. 13. 2.5 million proteins
  9. 14. ~74,000 small molecules
  10. 15. many databases
  11. 16. different formats
  12. 17. model organism databases
  13. 18. Ensembl
  14. 19. RefSeq
  15. 20. PubChem
  16. 21. genomic context
  17. 22. gene fusion
  18. 23. Korbel et al., Nature Biotechnology , 2004
  19. 24. conserved neighborhood
  20. 25. operons
  21. 26. Korbel et al., Nature Biotechnology , 2004
  22. 27. bidirectional promoters
  23. 28. Korbel et al., Nature Biotechnology , 2004
  24. 29. phylogenetic profiles
  25. 30. Korbel et al., Nature Biotechnology , 2004
  26. 31. experimental data
  27. 32. gene coexpression
  28. 34. protein interactions
  29. 35. Jensen & Bork, Science , 2008
  30. 36. genetic interactions
  31. 37. Beyer et al., Nature Reviews Genetics , 2007
  32. 38. small molecule interactions
  33. 39. in vitro binding assays
  34. 40. cellular activity assays
  35. 41. many databases
  36. 42. GEO Gene Expression Omnibus
  37. 43. BIND Biomolecular Interaction Network Database
  38. 44. BioGRID General Repository for Interaction Datasets
  39. 45. DIP Database of Interacting Proteins
  40. 46. IntAct
  41. 47. MINT Molecular Interactions Database
  42. 48. HPRD Human Protein Reference Database
  43. 49. PDB Protein Data Bank
  44. 50. BindingDB
  45. 51. CTD Comparative Toxicogenomics Database
  46. 52. DrugBank
  47. 53. GLIDA GPCR-Ligand Database
  48. 54. MATADOR
  49. 55. PDSP K i Psycoactive Drug Screening Program
  50. 56. PharmGKB Pharmacogenomics Knowledge Base
  51. 57. different formats
  52. 58. different identifiers
  53. 59. partially redundant
  54. 60. Campillos & Kuhn et al., Science , 2008
  55. 61. curated knowledge
  56. 62. complexes
  57. 63. pathways
  58. 64. Letunic & Bork, Trends in Biochemical Sciences , 2008
  59. 65. many databases
  60. 66. Gene Ontology
  61. 67. MIPS Munich Information center for Protein Sequences
  62. 68. KEGG Kyoto Encyclopedia of Genes and Genomes
  63. 69. MetaCyc
  64. 70. Reactome
  65. 71. PID NCI-Nature Pathway Interaction Database
  66. 72. high confidence
  67. 73. different formats
  68. 74. different identifiers
  69. 75. partially redundant
  70. 76. literature mining
  71. 77. >10 km
  72. 78. human readable
  73. 79. not computer readable
  74. 80. different names
  75. 81. text corpus
  76. 82. M EDLINE
  77. 83. SGD Saccharomyces Genome Database
  78. 84. The Interactive Fly
  79. 85. OMIM Online Mendelian Inheritance in Man
  80. 86. thesaurus
  81. 87. co-mentioning
  82. 88. statistical methods
  83. 89. NLP Natural Language Processing
  84. 90. <ul><li>Gene and protein names </li></ul><ul><li>Cue words for entity recognition </li></ul><ul><li>Verbs for relation extraction </li></ul><ul><li>[ nxgene The GAL4 gene ] </li></ul><ul><li>[ nxexpr T he expression of [ nxgene the cytochrome genes [ nxpg CYC1 and CYC7 ]]] is controlled by [ nxpg HAP1 ] </li></ul>
  85. 92. restricted access
  86. 93. Reflect
  87. 94. augmented browsing
  88. 95. Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology , 2009
  89. 96. integration
  90. 97. the easy problems
  91. 98. many databases
  92. 99. different formats
  93. 100. different identifiers
  94. 101. partially redundant
  95. 102. parsers
  96. 103. thesaurus
  97. 104. book keeping
  98. 105. the hard problems
  99. 106. many data types
  100. 107. not comparable
  101. 108. variable quality
  102. 109. raw quality scores
  103. 110. intergenic distances
  104. 111. Korbel et al., Nature Biotechnology , 2004
  105. 112. correlations
  106. 114. reproducibility
  107. 115. von Mering et al., Nucleic Acids Research , 2005
  108. 116. score calibration
  109. 117. gold standard
  110. 118. von Mering et al., Nucleic Acids Research , 2005
  111. 119. spread over 630 genomes
  112. 120. transfer by orthology
  113. 121. von Mering et al., Nucleic Acids Research , 2005
  114. 122. two modes
  115. 123. COG mode
  116. 124. von Mering et al., Nucleic Acids Research , 2005
  117. 125. protein mode
  118. 126. von Mering et al., Nucleic Acids Research , 2005
  119. 127. combine all evidence
  120. 128. P = 1-(1-P 1 )(1-P 2 )(1-P 3 ) …
  121. 129. visualize
  122. 130. Kuhn et al., Nucleic Acids Research , 2010
  123. 131. access
  124. 132. access for humans
  125. 133. web interfaces
  126. 137. access for computers
  127. 138. web services
  128. 139. REST Representational State Transfer
  129. 140. SOAP Simple Object Access Protocol
  130. 141. Acknowledgments <ul><ul><li>STITCH </li></ul></ul><ul><ul><li>Michael Kuhn </li></ul></ul><ul><ul><li>Damian Szklarczyk </li></ul></ul><ul><ul><li>Andrea Franceschini </li></ul></ul><ul><ul><li>Monica Campillos </li></ul></ul><ul><ul><li>Christian von Mering </li></ul></ul><ul><ul><li>Lars Juhl Jensen </li></ul></ul><ul><ul><li>Andreas Beyer </li></ul></ul><ul><ul><li>Peer Bork </li></ul></ul><ul><ul><li>Reflect </li></ul></ul><ul><ul><li>Sean O’Donoghue </li></ul></ul><ul><ul><li>Heiko Horn </li></ul></ul><ul><ul><li>Sune Frankild </li></ul></ul><ul><ul><li>Evangelos Pafilis </li></ul></ul><ul><ul><li>Michael Kuhn </li></ul></ul><ul><ul><li>Nigel Brown </li></ul></ul><ul><ul><li>Reinhardt Schneider </li></ul></ul><ul><ul><li>STRING </li></ul></ul><ul><ul><li>Christian von Mering </li></ul></ul><ul><ul><li>Michael Kuhn </li></ul></ul><ul><ul><li>Manuel Stark </li></ul></ul><ul><ul><li>Samuel Chaffron </li></ul></ul><ul><ul><li>Chris Creevey </li></ul></ul><ul><ul><li>Jean Muller </li></ul></ul><ul><ul><li>Tobias Doerks </li></ul></ul><ul><ul><li>Philippe Julien </li></ul></ul><ul><ul><li>Alexander Roth </li></ul></ul><ul><ul><li>Milan Simonovic </li></ul></ul><ul><ul><li>Jan Korbel </li></ul></ul><ul><ul><li>Berend Snel </li></ul></ul><ul><ul><li>Martijn Huynen </li></ul></ul><ul><ul><li>Peer Bork </li></ul></ul>
  131. 142. larsjuhljensen

×