Integration of diverse large-scale datasets
Lars Juhl Jensen
 
 
 
promoter analysis
Jensen et al., Bioinformatics, 2000
DNA structure
genome visualization
Pedersen et al., Journal of Molecular Biology, 2000
microarray normalization
Workman et al., Genome Biology, 2002
protein function prediction
 
 
 
 
STRING
 
integrate diverse evidence
functional interactions
Bork et al., Current Opinion in Structural Biology, 2005
179 proteomes
evolution
 
 
statistics
(the original sin)
prokaryotes
genomic context methods
gene fusion
 
gene neighborhood
 
phylogenetic profiles
 
 
 
 
Cell Cellulosomes Cellulose
eukaryotes
integrate diverse datasets
Jensen et al., Drug Discovery Today: Targets, 2004
curated knowledge
MIPS Munich Information center for Protein Sequences
KEGG Kyoto Encyclopedia of Genes and Genomes
STKE Signal Transduction Knowledge Environment
Reactome
literature mining
M EDLINE
SGD Saccharomyces Genome Database
The Interactive Fly
OMIM Online Mendelian Inheritance in Man
co-mentioning
NLP Natural Language Processing
Gene  and protein  names Cue words for entity recognition Verbs for relation extraction [ nxgene  The  GAL4   gene ] [ nxexpr  T he  expression  of   [ nxgene   the cytochrome  genes   [ nxpg   CYC1  and  CYC7 ]]] is  controlled  by [ nxpg   HAP1 ]
 
primary experimental data
microarray expression data
GEO Gene Expression Omnibus
physical protein interactions
BIND Biomolecular Interaction Network Database
MINT Molecular Interactions Database
GRID General Repository for Interaction Datasets
DIP Database of Interacting Proteins
HPRD Human Protein Reference Database
problems
many sources
(different gene identifiers)
many types of evidence
questionable quality
not directly comparable
spread over many species
huge synonyms lists
calculate raw quality scores
calibrate vs. gold standard
KEGG Kyoto Encyclopedia of Genes and Genomes
von Mering et al., Nucleic Acids Research, 2005
transfer based on orthology
combine all evidence
Bork et al., Current Opinion in Structural Biology, 2005
cell cycle
qualitative modeling
 
Chen et al., Molecular Biology of the Cell, 2004
Chen et al., Molecular Biology of the Cell, 2004
synchronized cell culture
 
microarray time series
 
periodically expressed genes
 
S. cerevisiae
Cho et al.
Spellman et al.
numerous analysis methods
Cho et al.
Spellman et al.
Zhao et al.
Johansson et al.
Luan and Li
Lu et al.
Ahdesm äki et al.
Willbrand et al.
no benchmarking
de Lichtenberg et al., Bioinformatics, 2005
reproducibility
de Lichtenberg et al., Bioinformatics, 2005
regulation vs. periodicity
de Lichtenberg et al., Bioinformatics, 2005
list of 600 periodic genes
S. pombe
several expression studies
reproducibility
Marguerat et al., Yeast, 2006
name inconsistencies
Marguerat et al., Yeast, 2006
different analysis methods
no benchmarking
Marguerat et al., Yeast, 2006
Marguerat et al., Yeast, 2006
too many genes suggested
Marguerat et al., Yeast, 2006
Marguerat et al., Yeast, 2006
averaging better than voting
Marguerat et al., Yeast, 2006
S. cerevisiae
list of 600 periodic genes
protein interaction data
 
von Mering et al., Nucleic Acids Research, 2005
de Lichtenberg et al., Science, 2005
dynamic proteins
static proteins
de Lichtenberg et al., Science, 2005
reproduces what is known
de Lichtenberg et al., Science, 2005
many detailed predictions
de Lichtenberg et al., Science, 2005
global trends
dynamic proteins
de Lichtenberg et al., Science, 2005
static proteins
de Lichtenberg et al., Science, 2005
just-in-time assembly
de Lichtenberg et al., Science, 2005
de Lichtenberg et al., Science, 2005
coordinated regulation
periodically expressed genes
Cdc28p substrates
PEST degradation signals
the human interactome
yeast two-hybrid
1936 13 4 4 1385 65 18465 Stelzl  et al. Rual  et al. Small-scale studies
32 0 3 4 18 4 23 Stelzl  et al. Rual  et al. Small-scale studies
62 8 39 Small-scale studies Stelzl  et al. Rual  et al. 852 17 473 432 69 260
3.5% and 21% sensitivity
in a couple of years
the human interactome
100% = 1/5?
the yeast interactome
five years ago
yeast two-hybrid
1150 117 117 72 4053 118 4469 Uetz  et al. Ito  et al. Small-scale studies
162 53 34 72 180 29 338 Uetz  et al. Ito  et al. Small-scale studies
511 189 616 Small-scale studies Uetz  et al. Ito  et al. 439 178 759 897 190 1347
19% and 12% sensitivity
the challenge
how to get from here …
1936 13 4 4 1385 65 18465 Stelzl  et al. Rual  et al. Small-scale studies
…  to there …
de Lichtenberg et al., Science, 2005
Acknowledgments The STRING team (EMBL) Christian von Mering Berend Snel Martijn Huynen Sean Hooper Mathilde Foglierini Julien Lagarde Peer Bork Literature mining project (EML Research) Jasmin Saric Rossitza Ouzounova Isabel Rojas Cell cycle studies (CBS) Ulrik de Lichtenberg Thomas Skøt Jensen Søren Brunak S. pombe  cell cycle (Sanger) Samuel Marguerat J ürg Bähler Inspiration for presentation Lawrence Lessig Dick Clarence Hardt Anders Gorm Pedersen
Thank you!

Integration of diverse large-scale datasets