STRING - Modeling of pathways through cross-species integration of large-scale data
Upcoming SlideShare
Loading in...5

STRING - Modeling of pathways through cross-species integration of large-scale data



Sienabiotech, Siena, April 21, 2005

Sienabiotech, Siena, April 21, 2005



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

STRING - Modeling of pathways through cross-species integration of large-scale data STRING - Modeling of pathways through cross-species integration of large-scale data Presentation Transcript

  • STRING Modeling of pathways through cross-species integration of large-scale data Lars Juhl Jensen EMBL Heidelberg
  • Qualitative versus quantitative modeling
  • STRING provides a modular protein network by integrating diverse types of evidence Genomic neighborhood Species co-occurrence Gene fusions Database imports Exp. interaction data Microarray expression data Literature co-mentioning
  • Inferring functional modules from gene presence/absence patterns T Resting protuberances Protracted protuberance Cellulose © Trends Microbiol, 1999 Cell Cell wall Anchoring proteins Cellulosomes Cellulose The “Cellulosome”
  • Genomic context methods © Nature Biotechnology, 2004
  • Formalizing the phylogenetic profile method Align all proteins against all Calculate best-hit profile Join similar species by PCA Calculate PC profile distances Calibrate against KEGG maps
  • Predicting functional and physical interactions from gene fusion/fission events Find in A genes that match a the same gene in B Exclude overlapping alignments Calibrate against KEGG maps Calculate all-against-all pairwise alignments
  • Inferring functional associations from evolutionarily conserved operons Identify runs of adjacent genes with the same direction Score each gene pair based on intergenic distances Calibrate against KEGG maps Infer associations in other species
  • Score calibration against a common reference
    • Many diverse types of evidence
      • The quality of each is judged by very different raw scores
      • Quality differences exist among data sets of the same type
    • Solved by calibrating all scores against a common reference
      • Scores are directly comparable
      • Probabilistic scores allow evidence to be combined
    • Requirements for the reference
      • Must represent a compromise of the all types of evidence
      • Broad species coverage
  • Integrating physical interaction screens Complex pull-down experiments Yeast two-hybrid data sets are inherently binary Calculate score from number of (co-)occurrences Calculate score from non-shared partners Calibrate against KEGG maps Infer associations in other species Combine evidence from experiments
  • Mining microarray expression databases Re-normalize arrays by modern method to remove biases Build expression matrix Combine similar arrays by PCA Construct predictor by Gaussian kernel density estimation Calibrate against KEGG maps Infer associations in other species
  • Evidence transfer based on “fuzzy orthology”
    • Orthology transfer is tricky
      • Correct assignment of orthology is difficult for distant species
      • Functional equivalence cannot be guaranteed for in-paralogs
    • These problems are addressed by our “fuzzy orthology” scheme
      • Confidence scores for functional equivalence are calculated from all-against-all alignment
      • Evidence is distributed across possible pairs according to confidence scores in the case of many-to-many relationships
    ? Source species Target species
  • Multiple evidence types from several species
  • Predicting and defining metabolic pathways and other functional modules Image: Molecular Biology of the Cell, 3 . rd edition Metabolism overview Defined manually: cutting metabolic maps into pathways Purine biosynthesis Histidine biosynthesis Defined objectively: standard clustering of genome-scale data
  • Getting more specific – generally speaking
    • Benchmarking against one common reference allows integration of heterogeneous data
    • The different types of data do not all tell us about the same kind of functional associations
    • It should be possible to assign likely interaction types from supporting evidence types
    • The aim: to construct an accurate, qualitative model of the yeast mitotic cell cycle
    • The model should be accurate even at the level of individual interactions
    • It should provide insight on the dynamic process of complex assembly
  • Model generation through data integration Model Generation A Parts List
    • Literature
    • Microarray data
    Dynamic data
    • Microarray data
    • Proteomics data
    • PPI data
    • TF-target data
    Connections YER001W YBR088C YOL007C YPL127C YNR009W YDR224C YDL003W YBL003C YDR225W YBR010W YKR013W … YDR097C YBR089W YBR054W YMR215W YBR071W YBL002W YGR189C YNL031C YNL030W YNL283C YGR152C …
  • Getting the parts list yeast culture Microarrays Gene expression Expression profile Cho et al. & Spellman et al. 600 periodically expressed genes (with associated peak times) that encode “dynamic proteins” The Parts list New Analysis
  • The temporal interaction network Observation: For two thirds of the dynamic proteins, no interactions were found
    • Why?
    • Some may be missed components of the complexes and modules already in the network
    • Some may not participate in protein-protein interactions
    • But, the majority probably participate in transient interactions that are not so well captured by current interaction assays
  • Interactions are close in time Observation: Interacting dynamic proteins typically expressed close in time
  • Static proteins play a major role Observation: Static ( scaffold ) proteins comprise about a third of the network and participate in interactions throughout the entire cycle
  • Just-in-time synthesis? yes and no! Observation: The dynamic proteins are generally expressed just before they are needed to carry out their function, generally referred to as just-in-time synthesis But, the general design principle seems to be that only some key components of each module/complex are dynamic This suggests a mechanism of just-in-time assembly or partial just-in-time synthesis
  • Network as a discovery tools Observation: The network places 30+ uncharacterized proteins in a temporal interaction context. The network thus generates detailed hypothesis about their function. Observation: The network contains entire novel modules and complexes.
  • Network Hubs: “Party” versus “Date” “ Date” Hub: the hub protein interacts with different proteins at different times. “ Party” Hub: the hub protein and its interactors are expressed close in time.
  • Transcription is linked to phosphorylation
    • Observation: 332 putative targets of the cyclin-dependent kinase Cdc28 have been determined experimentally (Übersax et al.). We find that:
    • 6% of all yeast proteins are putative Cdk targets
    • 8% of the static proteins (white) are putative Cdk targets
    • 27% of the dynamic proteins (colored) are putative Cdk targets
    • Conclusion: this reveals a hitherto undescribed link between the levels of transcriptional and post-translation control of the cell cycle
  • Conclusions
    • Genomic context methods are able to infer the function of many prokaryotic proteins from genome sequences alone
    • Integration of large-scale experimental data allows similar predictions to be made for eukaryotic proteins
    • Benchmarking is a prerequisite for data integration
    • It is possible to possible to make highly accurate models of biological systems based on high-throughput data
    • Try STRING at
  • Acknowledgments
    • The STRING team
      • Christian von Mering
      • Berend Snel
      • Martijn Huynen
      • Daniel Jaeggi
      • Steffen Schmidt
      • Mathilde Foglierini
      • Peer Bork
    • New context methods
      • Jan Korbel
      • Christian von Mering
      • Peer Bork
    • ArrayProspector
      • Julien Lagarde
      • Chris Workman
    • NetView visualization tool
      • Sean Hooper
    • Analysis of yeast cell cycle
      • Ulrik de Lichtenberg
      • Thomas Skøt
      • Anders Fausbøll
      • Søren Brunak
  • Thank you!