STRING Cross-species integration of known and predicted protein-protein interactions Lars Juhl Jensen EMBL Heidelberg
STRING provides a protein network based on integration of diverse types of evidence Genomic neighborhood Species co-occurrence Gene fusions Database imports Exp. interaction data Microarray expression data Literature co-mentioning
Inferring functional modules from gene presence/absence patterns T Resting protuberances Protracted protuberance Cellulose © Trends Microbiol, 1999 Cell Cell wall Anchoring  proteins Cellulosomes Cellulose The “Cellulosome”
Genomic context methods © Nature Biotechnology, 2004
Formalizing the phylogenetic profile method Align all proteins against all Calculate best-hit profile Join similar species by PCA Calculate PC profile distances Calibrate against KEGG maps
Predicting functional and physical interactions from gene fusion/fission events Find in  A  genes that match a the same gene in  B Exclude overlapping alignments Calibrate against KEGG  maps Calculate all-against-all pairwise alignments
Inferring functional associations from evolutionarily conserved operons Identify runs of adjacent genes with the same direction Score each gene pair based on intergenic distances Calibrate against KEGG maps Infer associations in other species
Score calibration against a common reference Many diverse types of evidence The quality of each is judged by very different raw scores Quality differences exist among data sets of the same type Solved by calibrating all scores against a common reference Scores are directly comparable Probabilistic scores allow evidence to be combined Requirements for the reference Must represent a compromise of the all types of evidence Broad species coverage
Integrating physical interaction screens Complex pull-down experiments Yeast two-hybrid data sets are inherently binary Calculate score from number of (co-)occurrences Calculate score from non-shared partners Calibrate against KEGG maps Infer associations in other species Combine evidence from experiments
Mining microarray expression databases Re-normalize arrays by modern method to remove biases Build expression matrix Combine similar arrays by PCA Calculate pairwise linear correlation coefficients Calibrate against KEGG maps Infer associations in other species
Evidence transfer based on “fuzzy orthology” Orthology transfer is tricky Correct assignment of orthology is difficult for distant species Functional equivalence cannot be guaranteed for in-paralogs These problems are addressed by our “fuzzy orthology” scheme Confidence scores for functional equivalence are calculated from all-against-all alignment Evidence is distributed across possible pairs according to confidence scores in the case of many-to-many relationships ? Source species Target species
Multiple evidence types from several species
Getting more specific – generally speaking Benchmarking against one common reference allows integration of heterogeneous data The different types of data do not all tell us about the same kind of functional associations It should be possible to assign likely interaction types from supporting evidence types The aim: to construct an accurate, qualitative models of biological systems or processes The models should be accurate even at the level of individual interactions This allows specific, testable hypotheses to be made based on high-throughput experimental data
Getting the parts list Yeast culture Microarrays Gene expression Expression profile 600 periodically expressed genes (with associated peak times) that encode “dynamic proteins” The parts list New analysis Cho & Spellman  et al.
Constructing a reliable protein network The stickiness of an interaction was scored based on its local network topology We benchmarked these scores for each individual data set against a common reference Impossible interactions were eliminated based on subcellular localization data By restricting the network to a particular system the error rate is further reduced
Extracting a cell cycle interaction network Cell cycle microarray data  Physical PPI interactions with confidence scores Expand the set of proteins to include non-periodic proteins that are strongly connected to periodic proteins Raw Data Node selection List of periodically expressed proteins with peak time Interactions Require compatible compartments and high confidence  Extract cell cycle network
The temporal interaction network Interacting proteins are expressed close in time Two thirds of the dynamic proteins lack interactions but likely participate in transient interactions
Static proteins comprise a third of the interactions at all times of the cell cycle Their time of action can be predicted from interactions with dynamic proteins Static proteins play a major role
Cdc28p and its interaction partners
Just-in-time synthesis vs. just-in-time assembly Most dynamic proteins are expressed just before they are needed to carry out their function Most complexes also contain static proteins Just-in-time assembly of complexes appear to be the general principle The time of assembly is controlled synthesizing the last subunits just-in-time
Assembly of the pre-replication complex
Network as a discovery tools The network enables us to place 30+ uncharacterized proteins in a temporal interaction context Quite detailed hypotheses can be made concerning the their function The network also contains entire novel modules and complexes
Transcription is linked to phosphorylation A genome-wide screen identified 332 Cdc28p targets, which include 6% of all yeast proteins 8% of the static proteins 27% of the dynamic ones A similar correlation was observed with predicted PEST regions This suggests a hitherto undescribed link between transcriptional and post-translational control
Conclusions Genomic context methods are able to infer the function of many prokaryotic proteins from genome sequences alone Integration of large-scale experimental data allows similar predictions to be made for eukaryotic proteins Benchmarking is a prerequisite for data integration It is possible to construct highly reliable models through careful integration of high-throughput experimental data Try STRING at  http://string.embl.de
Acknowledgments The STRING team Christian von Mering Berend Snel Martijn Huynen Daniel Jaeggi Steffen Schmidt Sean Hooper Julien Lagarde Mathilde Foglierini Peer Bork New context methods Jan Korbel Christian von Mering Peer Bork Cell cycle analysis Ulrik de Lichtenberg Thomas Skøt Jensen Anders Fausbøll Søren Brunak
Thank you!

STRING - Cross-species integration of known and predicted protein-protein interactions

  • 1.
    STRING Cross-species integrationof known and predicted protein-protein interactions Lars Juhl Jensen EMBL Heidelberg
  • 2.
    STRING provides aprotein network based on integration of diverse types of evidence Genomic neighborhood Species co-occurrence Gene fusions Database imports Exp. interaction data Microarray expression data Literature co-mentioning
  • 3.
    Inferring functional modulesfrom gene presence/absence patterns T Resting protuberances Protracted protuberance Cellulose © Trends Microbiol, 1999 Cell Cell wall Anchoring proteins Cellulosomes Cellulose The “Cellulosome”
  • 4.
    Genomic context methods© Nature Biotechnology, 2004
  • 5.
    Formalizing the phylogeneticprofile method Align all proteins against all Calculate best-hit profile Join similar species by PCA Calculate PC profile distances Calibrate against KEGG maps
  • 6.
    Predicting functional andphysical interactions from gene fusion/fission events Find in A genes that match a the same gene in B Exclude overlapping alignments Calibrate against KEGG maps Calculate all-against-all pairwise alignments
  • 7.
    Inferring functional associationsfrom evolutionarily conserved operons Identify runs of adjacent genes with the same direction Score each gene pair based on intergenic distances Calibrate against KEGG maps Infer associations in other species
  • 8.
    Score calibration againsta common reference Many diverse types of evidence The quality of each is judged by very different raw scores Quality differences exist among data sets of the same type Solved by calibrating all scores against a common reference Scores are directly comparable Probabilistic scores allow evidence to be combined Requirements for the reference Must represent a compromise of the all types of evidence Broad species coverage
  • 9.
    Integrating physical interactionscreens Complex pull-down experiments Yeast two-hybrid data sets are inherently binary Calculate score from number of (co-)occurrences Calculate score from non-shared partners Calibrate against KEGG maps Infer associations in other species Combine evidence from experiments
  • 10.
    Mining microarray expressiondatabases Re-normalize arrays by modern method to remove biases Build expression matrix Combine similar arrays by PCA Calculate pairwise linear correlation coefficients Calibrate against KEGG maps Infer associations in other species
  • 11.
    Evidence transfer basedon “fuzzy orthology” Orthology transfer is tricky Correct assignment of orthology is difficult for distant species Functional equivalence cannot be guaranteed for in-paralogs These problems are addressed by our “fuzzy orthology” scheme Confidence scores for functional equivalence are calculated from all-against-all alignment Evidence is distributed across possible pairs according to confidence scores in the case of many-to-many relationships ? Source species Target species
  • 12.
    Multiple evidence typesfrom several species
  • 13.
    Getting more specific– generally speaking Benchmarking against one common reference allows integration of heterogeneous data The different types of data do not all tell us about the same kind of functional associations It should be possible to assign likely interaction types from supporting evidence types The aim: to construct an accurate, qualitative models of biological systems or processes The models should be accurate even at the level of individual interactions This allows specific, testable hypotheses to be made based on high-throughput experimental data
  • 14.
    Getting the partslist Yeast culture Microarrays Gene expression Expression profile 600 periodically expressed genes (with associated peak times) that encode “dynamic proteins” The parts list New analysis Cho & Spellman et al.
  • 15.
    Constructing a reliableprotein network The stickiness of an interaction was scored based on its local network topology We benchmarked these scores for each individual data set against a common reference Impossible interactions were eliminated based on subcellular localization data By restricting the network to a particular system the error rate is further reduced
  • 16.
    Extracting a cellcycle interaction network Cell cycle microarray data Physical PPI interactions with confidence scores Expand the set of proteins to include non-periodic proteins that are strongly connected to periodic proteins Raw Data Node selection List of periodically expressed proteins with peak time Interactions Require compatible compartments and high confidence Extract cell cycle network
  • 17.
    The temporal interactionnetwork Interacting proteins are expressed close in time Two thirds of the dynamic proteins lack interactions but likely participate in transient interactions
  • 18.
    Static proteins comprisea third of the interactions at all times of the cell cycle Their time of action can be predicted from interactions with dynamic proteins Static proteins play a major role
  • 19.
    Cdc28p and itsinteraction partners
  • 20.
    Just-in-time synthesis vs.just-in-time assembly Most dynamic proteins are expressed just before they are needed to carry out their function Most complexes also contain static proteins Just-in-time assembly of complexes appear to be the general principle The time of assembly is controlled synthesizing the last subunits just-in-time
  • 21.
    Assembly of thepre-replication complex
  • 22.
    Network as adiscovery tools The network enables us to place 30+ uncharacterized proteins in a temporal interaction context Quite detailed hypotheses can be made concerning the their function The network also contains entire novel modules and complexes
  • 23.
    Transcription is linkedto phosphorylation A genome-wide screen identified 332 Cdc28p targets, which include 6% of all yeast proteins 8% of the static proteins 27% of the dynamic ones A similar correlation was observed with predicted PEST regions This suggests a hitherto undescribed link between transcriptional and post-translational control
  • 24.
    Conclusions Genomic contextmethods are able to infer the function of many prokaryotic proteins from genome sequences alone Integration of large-scale experimental data allows similar predictions to be made for eukaryotic proteins Benchmarking is a prerequisite for data integration It is possible to construct highly reliable models through careful integration of high-throughput experimental data Try STRING at http://string.embl.de
  • 25.
    Acknowledgments The STRINGteam Christian von Mering Berend Snel Martijn Huynen Daniel Jaeggi Steffen Schmidt Sean Hooper Julien Lagarde Mathilde Foglierini Peer Bork New context methods Jan Korbel Christian von Mering Peer Bork Cell cycle analysis Ulrik de Lichtenberg Thomas Skøt Jensen Anders Fausbøll Søren Brunak
  • 26.