Data integration
The STITCH database of protein–small molecule interactions
Lars Juhl Jensen
guilt by association
functional associations
Kuhn et al., Nucleic Acids Research, 2010
parts lists
>2.5 million proteins
630 genomes
many databases
different formats
model organism databases
Ensembl
RefSeq
PubChem compounds
>74,000 small molecules
genomic context
gene fusion
Korbel et al., Nature Biotechnology, 2004
conserved neighborhood
operons
Korbel et al., Nature Biotechnology, 2004
bidirectional promoters
Korbel et al., Nature Biotechnology, 2004
phylogenetic profiles
Korbel et al., Nature Biotechnology, 2004
interaction data
protein–small molecule
in vitro binding assays
protein–protein
yeast two-hybrid
affinity purification
fragment complementation
Jensen & Bork, Science, 2008
genetic interactions
Beyer et al., Nature Reviews Genetics, 2007
gene coexpression
many databases
BindingDB
CTD
Comparative Toxicogenomics Database
DrugBank
GLIDA
GPCR-Ligand Database
PDSP Ki
Psycoactive Drug Screening Program
PharmGKB
Pharmacogenomics Knowledge Base
BIND
Biomolecular Interaction Network Database
BioGRID
General Repository for Interaction Datasets
DIP
Database of Interacting Proteins
IntAct
MINT
Molecular Interactions Database
HPRD
Human Protein Reference Database
PDB
Protein Data Bank
GEO
Gene Expression Omnibus
different formats
different identifiers
partially redundant
curated knowledge
complexes
pathways
Letunic & Bork, Trends in Biochemical Sciences, 2008
high confidence
many databases
MIPS
Munich Information center
for Protein Sequences
Gene Ontology
KEGG
Kyoto Encyclopedia of Genes and Genomes
MetaCyc
PID
NCI-Nature Pathway Interaction Database
Reactome
different formats
different identifiers
partially redundant
text mining
>10 km
human readable
not computer readable
different names
Reflect
dictionary
Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009
text corpus
MEDLINE
SGD
Saccharomyces Genome Database
The Interactive Fly
OMIM
Online Mendelian Inheritance in Man
co-mentioning
NLP
Natural Language Processing
integration
many data types
not comparable
variable quality
spread over 630 genomes
quality scores
reproducibility
von Mering et al., Nucleic Acids Research, 2005
intergenic distances
Korbel et al., Nature Biotechnology, 2004
benchmarking
calibrate vs. gold standard
von Mering et al., Nucleic Acids Research, 2005
raw quality scores
probabilistic scores
orthology transfer
von Mering et al., Nucleic Acids Research, 2005
combine all evidence
Acknowledgments
Damian Szklarczyk
Andrea Franceschini
Michael Kuhn
Sune Frankild
Heiko Horn
Evangelos Pafilis
Milan Simono...
Predicting novel targets for existing
drugs using side effect information
Lars Juhl Jensen
the problem
new uses for old drugs
drug–drug network
shared target(s)
chemical similarity
Campillos & Kuhn et al., Science, 2008
Campillos & Kuhn et al., Science, 2008
similar drugs share targets
only trivial predictions
the idea
chemical perturbations
phenotypic readouts
drug treatment
side effects
the hard work
information on side effects
no database
package inserts
Campillos & Kuhn et al., Science, 2008
text mining
side-effect ontology
backtracking
Campillos & Kuhn et al., Science, 2008
manual validation
SIDER
Kuhn et al., Molecular Systems Biology, 2010
side-effect correlations
Campillos & Kuhn et al., Science, 2008
GSC weighting
side-effect frequencies
Campillos & Kuhn et al., Science, 2008
raw similarity score
Campillos & Kuhn et al., Science, 2008
p-values
Campillos & Kuhn et al., Science, 2008
side-effect similarity
chemical similarity
Campillos & Kuhn et al., Science, 2008
confidence scores
reference set
incomplete databases
text mining
manual validation
MATADOR
Günther et al., Nucleic Acids Research, 2008
Campillos & Kuhn et al., Science, 2008
the results
drug–drug network
Campillos & Kuhn et al., Science, 2008
categorization
Campillos & Kuhn et al., Science, 2008
20 drug–drug pairs
in vitro binding assays
Ki<10 µM for 11 of 20
cell assays
9 of 9 showed activity
the future
link side-effects to targets
direct target prediction
Acknowledgments
Monica Campillos
Michael Kuhn
Anne-Claude Gavin
Peer Bork
larsjuhljensen
Data integration: The STITCH database of protein-small molecule interactions
Data integration: The STITCH database of protein-small molecule interactions
Data integration: The STITCH database of protein-small molecule interactions
Upcoming SlideShare
Loading in …5
×

Data integration: The STITCH database of protein-small molecule interactions

980 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
980
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • This is a conservative estimate based only on what is in PubMed
    Too much to read!
    Text mining used to extract relations
    Similar methods used to mine medical records and link diseases
  • Data integration: The STITCH database of protein-small molecule interactions

    1. 1. Data integration The STITCH database of protein–small molecule interactions Lars Juhl Jensen
    2. 2. guilt by association
    3. 3. functional associations
    4. 4. Kuhn et al., Nucleic Acids Research, 2010
    5. 5. parts lists
    6. 6. >2.5 million proteins
    7. 7. 630 genomes
    8. 8. many databases
    9. 9. different formats
    10. 10. model organism databases
    11. 11. Ensembl
    12. 12. RefSeq
    13. 13. PubChem compounds
    14. 14. >74,000 small molecules
    15. 15. genomic context
    16. 16. gene fusion
    17. 17. Korbel et al., Nature Biotechnology, 2004
    18. 18. conserved neighborhood
    19. 19. operons
    20. 20. Korbel et al., Nature Biotechnology, 2004
    21. 21. bidirectional promoters
    22. 22. Korbel et al., Nature Biotechnology, 2004
    23. 23. phylogenetic profiles
    24. 24. Korbel et al., Nature Biotechnology, 2004
    25. 25. interaction data
    26. 26. protein–small molecule
    27. 27. in vitro binding assays
    28. 28. protein–protein
    29. 29. yeast two-hybrid
    30. 30. affinity purification
    31. 31. fragment complementation
    32. 32. Jensen & Bork, Science, 2008
    33. 33. genetic interactions
    34. 34. Beyer et al., Nature Reviews Genetics, 2007
    35. 35. gene coexpression
    36. 36. many databases
    37. 37. BindingDB
    38. 38. CTD Comparative Toxicogenomics Database
    39. 39. DrugBank
    40. 40. GLIDA GPCR-Ligand Database
    41. 41. PDSP Ki Psycoactive Drug Screening Program
    42. 42. PharmGKB Pharmacogenomics Knowledge Base
    43. 43. BIND Biomolecular Interaction Network Database
    44. 44. BioGRID General Repository for Interaction Datasets
    45. 45. DIP Database of Interacting Proteins
    46. 46. IntAct
    47. 47. MINT Molecular Interactions Database
    48. 48. HPRD Human Protein Reference Database
    49. 49. PDB Protein Data Bank
    50. 50. GEO Gene Expression Omnibus
    51. 51. different formats
    52. 52. different identifiers
    53. 53. partially redundant
    54. 54. curated knowledge
    55. 55. complexes
    56. 56. pathways
    57. 57. Letunic & Bork, Trends in Biochemical Sciences, 2008
    58. 58. high confidence
    59. 59. many databases
    60. 60. MIPS Munich Information center for Protein Sequences
    61. 61. Gene Ontology
    62. 62. KEGG Kyoto Encyclopedia of Genes and Genomes
    63. 63. MetaCyc
    64. 64. PID NCI-Nature Pathway Interaction Database
    65. 65. Reactome
    66. 66. different formats
    67. 67. different identifiers
    68. 68. partially redundant
    69. 69. text mining
    70. 70. >10 km
    71. 71. human readable
    72. 72. not computer readable
    73. 73. different names
    74. 74. Reflect
    75. 75. dictionary
    76. 76. Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009
    77. 77. text corpus
    78. 78. MEDLINE
    79. 79. SGD Saccharomyces Genome Database
    80. 80. The Interactive Fly
    81. 81. OMIM Online Mendelian Inheritance in Man
    82. 82. co-mentioning
    83. 83. NLP Natural Language Processing
    84. 84. integration
    85. 85. many data types
    86. 86. not comparable
    87. 87. variable quality
    88. 88. spread over 630 genomes
    89. 89. quality scores
    90. 90. reproducibility
    91. 91. von Mering et al., Nucleic Acids Research, 2005
    92. 92. intergenic distances
    93. 93. Korbel et al., Nature Biotechnology, 2004
    94. 94. benchmarking
    95. 95. calibrate vs. gold standard
    96. 96. von Mering et al., Nucleic Acids Research, 2005
    97. 97. raw quality scores
    98. 98. probabilistic scores
    99. 99. orthology transfer
    100. 100. von Mering et al., Nucleic Acids Research, 2005
    101. 101. combine all evidence
    102. 102. Acknowledgments Damian Szklarczyk Andrea Franceschini Michael Kuhn Sune Frankild Heiko Horn Evangelos Pafilis Milan Simonovic Alexander Roth Pablo Minguez Tobias Doerks Jean Muller Manuel Stark Samuel Chaffron Chris Creevey Philippe Julien Jan Korbel Berend Snel Martijn Huynen Reinhardt Schneider Sean O’Donoghue Christian von Mering Peer Bork
    103. 103. Predicting novel targets for existing drugs using side effect information Lars Juhl Jensen
    104. 104. the problem
    105. 105. new uses for old drugs
    106. 106. drug–drug network
    107. 107. shared target(s)
    108. 108. chemical similarity
    109. 109. Campillos & Kuhn et al., Science, 2008
    110. 110. Campillos & Kuhn et al., Science, 2008
    111. 111. similar drugs share targets
    112. 112. only trivial predictions
    113. 113. the idea
    114. 114. chemical perturbations
    115. 115. phenotypic readouts
    116. 116. drug treatment
    117. 117. side effects
    118. 118. the hard work
    119. 119. information on side effects
    120. 120. no database
    121. 121. package inserts
    122. 122. Campillos & Kuhn et al., Science, 2008
    123. 123. text mining
    124. 124. side-effect ontology
    125. 125. backtracking
    126. 126. Campillos & Kuhn et al., Science, 2008
    127. 127. manual validation
    128. 128. SIDER Kuhn et al., Molecular Systems Biology, 2010
    129. 129. side-effect correlations
    130. 130. Campillos & Kuhn et al., Science, 2008
    131. 131. GSC weighting
    132. 132. side-effect frequencies
    133. 133. Campillos & Kuhn et al., Science, 2008
    134. 134. raw similarity score
    135. 135. Campillos & Kuhn et al., Science, 2008
    136. 136. p-values
    137. 137. Campillos & Kuhn et al., Science, 2008
    138. 138. side-effect similarity
    139. 139. chemical similarity
    140. 140. Campillos & Kuhn et al., Science, 2008
    141. 141. confidence scores
    142. 142. reference set
    143. 143. incomplete databases
    144. 144. text mining
    145. 145. manual validation
    146. 146. MATADOR Günther et al., Nucleic Acids Research, 2008
    147. 147. Campillos & Kuhn et al., Science, 2008
    148. 148. the results
    149. 149. drug–drug network
    150. 150. Campillos & Kuhn et al., Science, 2008
    151. 151. categorization
    152. 152. Campillos & Kuhn et al., Science, 2008
    153. 153. 20 drug–drug pairs
    154. 154. in vitro binding assays
    155. 155. Ki<10 µM for 11 of 20
    156. 156. cell assays
    157. 157. 9 of 9 showed activity
    158. 158. the future
    159. 159. link side-effects to targets
    160. 160. direct target prediction
    161. 161. Acknowledgments Monica Campillos Michael Kuhn Anne-Claude Gavin Peer Bork
    162. 162. larsjuhljensen

    ×