Integration of heterogeneous data

  • 636 views
Uploaded on

10th Course in Bioinformatics and Systems Biology for Molecular Biologists, Schloss Hohenkammer, Hohenkammer, Germany, March 15, 2010.

10th Course in Bioinformatics and Systems Biology for Molecular Biologists, Schloss Hohenkammer, Hohenkammer, Germany, March 15, 2010.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
636
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
26
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Integration of heterogeneous data Lars Juhl Jensen
  • 2.  
  • 3.  
  • 4.  
  • 5.  
  • 6. data mining
  • 7. text mining
  • 8. interaction networks
  • 9.  
  • 10. Kuhn et al., Nucleic Acids Research , 2010
  • 11. parts lists
  • 12. 630 genomes
  • 13. 2.5 million proteins
  • 14. ~74,000 small molecules
  • 15. many databases
  • 16. different formats
  • 17. model organism databases
  • 18. Ensembl
  • 19. RefSeq
  • 20. PubChem
  • 21. genomic context
  • 22. gene fusion
  • 23. Korbel et al., Nature Biotechnology , 2004
  • 24. conserved neighborhood
  • 25. operons
  • 26. Korbel et al., Nature Biotechnology , 2004
  • 27. bidirectional promoters
  • 28. Korbel et al., Nature Biotechnology , 2004
  • 29. phylogenetic profiles
  • 30. Korbel et al., Nature Biotechnology , 2004
  • 31. experimental data
  • 32. gene coexpression
  • 33.  
  • 34. protein interactions
  • 35. Jensen & Bork, Science , 2008
  • 36. genetic interactions
  • 37. Beyer et al., Nature Reviews Genetics , 2007
  • 38. small molecule interactions
  • 39. in vitro binding assays
  • 40. cellular activity assays
  • 41. many databases
  • 42. GEO Gene Expression Omnibus
  • 43. BIND Biomolecular Interaction Network Database
  • 44. BioGRID General Repository for Interaction Datasets
  • 45. DIP Database of Interacting Proteins
  • 46. IntAct
  • 47. MINT Molecular Interactions Database
  • 48. HPRD Human Protein Reference Database
  • 49. PDB Protein Data Bank
  • 50. BindingDB
  • 51. CTD Comparative Toxicogenomics Database
  • 52. DrugBank
  • 53. GLIDA GPCR-Ligand Database
  • 54. MATADOR
  • 55. PDSP K i Psycoactive Drug Screening Program
  • 56. PharmGKB Pharmacogenomics Knowledge Base
  • 57. different formats
  • 58. different identifiers
  • 59. partially redundant
  • 60. Campillos & Kuhn et al., Science , 2008
  • 61. curated knowledge
  • 62. complexes
  • 63. pathways
  • 64. Letunic & Bork, Trends in Biochemical Sciences , 2008
  • 65. many databases
  • 66. Gene Ontology
  • 67. MIPS Munich Information center for Protein Sequences
  • 68. KEGG Kyoto Encyclopedia of Genes and Genomes
  • 69. MetaCyc
  • 70. Reactome
  • 71. PID NCI-Nature Pathway Interaction Database
  • 72. high confidence
  • 73. different formats
  • 74. different identifiers
  • 75. partially redundant
  • 76. literature mining
  • 77. >10 km
  • 78. human readable
  • 79. not computer readable
  • 80. different names
  • 81. text corpus
  • 82. M EDLINE
  • 83. SGD Saccharomyces Genome Database
  • 84. The Interactive Fly
  • 85. OMIM Online Mendelian Inheritance in Man
  • 86. thesaurus
  • 87. co-mentioning
  • 88. statistical methods
  • 89. NLP Natural Language Processing
  • 90.
    • Gene and protein names
    • Cue words for entity recognition
    • Verbs for relation extraction
    • [ nxgene The GAL4 gene ]
    • [ nxexpr T he expression of [ nxgene the cytochrome genes [ nxpg CYC1 and CYC7 ]]] is controlled by [ nxpg HAP1 ]
  • 91.  
  • 92. restricted access
  • 93. Reflect
  • 94. augmented browsing
  • 95. Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology , 2009
  • 96. integration
  • 97. the easy problems
  • 98. many databases
  • 99. different formats
  • 100. different identifiers
  • 101. partially redundant
  • 102. parsers
  • 103. thesaurus
  • 104. book keeping
  • 105. the hard problems
  • 106. many data types
  • 107. not comparable
  • 108. variable quality
  • 109. raw quality scores
  • 110. intergenic distances
  • 111. Korbel et al., Nature Biotechnology , 2004
  • 112. correlations
  • 113.  
  • 114. reproducibility
  • 115. von Mering et al., Nucleic Acids Research , 2005
  • 116. score calibration
  • 117. gold standard
  • 118. von Mering et al., Nucleic Acids Research , 2005
  • 119. spread over 630 genomes
  • 120. transfer by orthology
  • 121. von Mering et al., Nucleic Acids Research , 2005
  • 122. two modes
  • 123. COG mode
  • 124. von Mering et al., Nucleic Acids Research , 2005
  • 125. protein mode
  • 126. von Mering et al., Nucleic Acids Research , 2005
  • 127. combine all evidence
  • 128. P = 1-(1-P 1 )(1-P 2 )(1-P 3 ) …
  • 129. visualize
  • 130. Kuhn et al., Nucleic Acids Research , 2010
  • 131. access
  • 132. access for humans
  • 133. web interfaces
  • 134.  
  • 135.  
  • 136.  
  • 137. access for computers
  • 138. web services
  • 139. REST Representational State Transfer
  • 140. SOAP Simple Object Access Protocol
  • 141. Acknowledgments
      • STITCH
      • Michael Kuhn
      • Damian Szklarczyk
      • Andrea Franceschini
      • Monica Campillos
      • Christian von Mering
      • Lars Juhl Jensen
      • Andreas Beyer
      • Peer Bork
      • Reflect
      • Sean O’Donoghue
      • Heiko Horn
      • Sune Frankild
      • Evangelos Pafilis
      • Michael Kuhn
      • Nigel Brown
      • Reinhardt Schneider
      • STRING
      • Christian von Mering
      • Michael Kuhn
      • Manuel Stark
      • Samuel Chaffron
      • Chris Creevey
      • Jean Muller
      • Tobias Doerks
      • Philippe Julien
      • Alexander Roth
      • Milan Simonovic
      • Jan Korbel
      • Berend Snel
      • Martijn Huynen
      • Peer Bork
  • 142. larsjuhljensen