Integration of heterogeneous data Lars Juhl Jensen
 
 
 
 
data mining
text mining
interaction networks
 
Kuhn et al.,  Nucleic Acids Research , 2010
parts lists
630 genomes
2.5 million proteins
~74,000 small molecules
many databases
different formats
model organism databases
Ensembl
RefSeq
PubChem
genomic context
gene fusion
Korbel et al.,  Nature Biotechnology , 2004
conserved neighborhood
operons
Korbel et al.,  Nature Biotechnology , 2004
bidirectional promoters
Korbel et al.,  Nature Biotechnology , 2004
phylogenetic profiles
Korbel et al.,  Nature Biotechnology , 2004
experimental data
gene coexpression
 
protein interactions
Jensen & Bork,  Science , 2008
genetic interactions
Beyer et al.,  Nature Reviews Genetics , 2007
small molecule interactions
in vitro  binding assays
cellular activity assays
many databases
GEO Gene Expression Omnibus
BIND Biomolecular Interaction Network Database
BioGRID General Repository for Interaction Datasets
DIP Database of Interacting Proteins
IntAct
MINT Molecular Interactions Database
HPRD Human Protein Reference Database
PDB Protein Data Bank
BindingDB
CTD Comparative Toxicogenomics Database
DrugBank
GLIDA GPCR-Ligand Database
MATADOR
PDSP K i Psycoactive Drug Screening Program
PharmGKB Pharmacogenomics Knowledge Base
different formats
different identifiers
partially redundant
Campillos & Kuhn et al.,  Science , 2008
curated knowledge
complexes
pathways
Letunic & Bork,  Trends in Biochemical Sciences , 2008
many databases
Gene Ontology
MIPS Munich Information center for Protein Sequences
KEGG Kyoto Encyclopedia of Genes and Genomes
MetaCyc
Reactome
PID NCI-Nature Pathway Interaction Database
high confidence
different formats
different identifiers
partially redundant
literature mining
>10 km
human readable
not computer readable
different names
text corpus
M EDLINE
SGD Saccharomyces Genome Database
The Interactive Fly
OMIM Online Mendelian Inheritance in Man
thesaurus
co-mentioning
statistical methods
NLP Natural Language Processing
Gene  and protein  names Cue words for entity recognition Verbs for relation extraction [ nxgene  The  GAL4   gene ] [ nxexpr  T he  expression  of   [ nxgene   the cytochrome  genes   [ nxpg   CYC1  and  CYC7 ]]] is  controlled  by [ nxpg   HAP1 ]
 
restricted access
Reflect
augmented browsing
Pafilis, O’Donoghue, Jensen et al.,  Nature Biotechnology , 2009
integration
the easy problems
many databases
different formats
different identifiers
partially redundant
parsers
thesaurus
book keeping
the hard problems
many data types
not comparable
variable quality
raw quality scores
intergenic distances
Korbel et al.,  Nature Biotechnology , 2004
correlations
 
reproducibility
von Mering et al.,  Nucleic Acids Research , 2005
score calibration
gold standard
von Mering et al.,  Nucleic Acids Research , 2005
spread over 630 genomes
transfer by orthology
von Mering et al.,  Nucleic Acids Research , 2005
two modes
COG mode
von Mering et al.,  Nucleic Acids Research , 2005
protein mode
von Mering et al.,  Nucleic Acids Research , 2005
combine all evidence
P = 1-(1-P 1 )(1-P 2 )(1-P 3 ) …
visualize
Kuhn et al.,  Nucleic Acids Research , 2010
access
access for humans
web interfaces
 
 
 
access for computers
web services
REST Representational State Transfer
SOAP Simple Object Access Protocol
Acknowledgments STITCH Michael Kuhn Damian Szklarczyk Andrea Franceschini Monica Campillos Christian von Mering Lars Juhl Jensen Andreas Beyer Peer Bork Reflect Sean O’Donoghue Heiko Horn Sune Frankild Evangelos Pafilis Michael Kuhn Nigel Brown Reinhardt Schneider STRING Christian von Mering Michael Kuhn Manuel Stark Samuel Chaffron Chris Creevey Jean Muller Tobias Doerks Philippe Julien Alexander Roth Milan Simonovic Jan Korbel Berend Snel Martijn Huynen Peer Bork
larsjuhljensen

Integration of heterogeneous data