STRING: Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg
The problem ...
Prediction of protein function <ul><li>Homology based methods </li></ul><ul><ul><li>Simple sequence similarity searches (B...
Prediction of functional associations “ Protein mode” Separate network for each species “ COG mode” One network covering a...
STRING provides a protein network based on integration of diverse types of evidence Genomic Neighborhood Species Co-occurr...
Score calibration against a common reference <ul><li>Many diverse types of evidence </li></ul><ul><ul><li>The quality of e...
Integrating physical interaction screens Make binary representation of complexes Yeast two-hybrid data sets are inherently...
Gene fusion: predicting physical interactions Detect multiple proteins matching to one protein Exclude overlapping alignme...
Mining microarray expression databases Re-normalize arrays by modern method to remove biases Build expression matrix Combi...
Gene neighborhood: predicting co-expression Identify runs of adjacent genes with the same direction Score each gene pair b...
Co-mentioning in the scientific literature Associate abstracts with species Identify gene names in title/abstract Count (c...
Phylogenetic profile: co-mentioning in genomes Align all proteins against all Calculate best-hit profile Join similar spec...
COG based vs. similarity based transfer <ul><li>Resolution of the mapping </li></ul><ul><ul><li>COGs result in many-to-man...
Transfer and combination of evidence <ul><li>Evidence scores are multiplied by “correspondence scores” </li></ul><ul><li>F...
Combining multiple types of evidence from several species
The next step in data integration: predicting the type of interaction
Information extraction from PubMed: extracting specific types of associations <ul><li>Tokenization and multi word detectio...
We extract from both active, passive, and nominalized sentence constructs <ul><li>[ nx_prom  the  ATR1   promoter region ]...
A high confidence regulatory network <ul><li>We manage to extract a satisfactory number of relations </li></ul><ul><ul><li...
More STRING to come <ul><li>Adding more large scale data sets and more species </li></ul><ul><li>New types of genomic cont...
Acknowledgments <ul><li>The STRING team </li></ul><ul><ul><li>Christian von Mering </li></ul></ul><ul><ul><li>Berend Snel ...
Thank you!
Upcoming SlideShare
Loading in …5
×

STRING: Prediction of protein networks through integration of diverse large-scale data sets

1,344 views

Published on

EMBL white seminar, European Molecular Biology Laboratory, Heidelberg, Germany, April 2, 2004

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,344
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • STRING: Prediction of protein networks through integration of diverse large-scale data sets

    1. 1. STRING: Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg
    2. 2. The problem ...
    3. 3. Prediction of protein function <ul><li>Homology based methods </li></ul><ul><ul><li>Simple sequence similarity searches (BLAST) </li></ul></ul><ul><ul><li>Profile searches (PSI-BLAST) </li></ul></ul><ul><ul><li>Databases of conserved domains (Pfam, SMART ) </li></ul></ul><ul><li>Non-homology based methods working on sequence </li></ul><ul><ul><li>Prediction from sequence derived features (ProtFun) </li></ul></ul><ul><ul><li>Prediction from genomic context ( STRING ) </li></ul></ul><ul><li>Prediction from high-throughput experimental data </li></ul><ul><ul><li>Microarray gene expression data </li></ul></ul><ul><ul><li>Protein-protein interaction screens </li></ul></ul><ul><ul><li>... </li></ul></ul>
    4. 4. Prediction of functional associations “ Protein mode” Separate network for each species “ COG mode” One network covering all species
    5. 5. STRING provides a protein network based on integration of diverse types of evidence Genomic Neighborhood Species Co-occurrence Gene Fusions Database Imports Exp. Interaction Data Co-expression Literature Co-mentioning
    6. 6. Score calibration against a common reference <ul><li>Many diverse types of evidence </li></ul><ul><ul><li>The quality of each is judged by very different raw scores </li></ul></ul><ul><ul><li>These are all calibrated against the same reference set </li></ul></ul><ul><ul><li>This is the key to obtaining a consistent scoring scheme </li></ul></ul><ul><li>Requirements for a reference </li></ul><ul><ul><li>Must represent a compromise of the all types of evidence </li></ul></ul><ul><ul><li>Broad species coverage </li></ul></ul><ul><li>Our reference is KEGG maps </li></ul><ul><ul><li>Two proteins are “related” if on a common KEGG map </li></ul></ul>
    7. 7. Integrating physical interaction screens Make binary representation of complexes Yeast two-hybrid data sets are inherently binary Calculate score from number of (co-)occurrences Calculate score from non-shared partners Calibrate against KEGG maps Infer associations in other species Combine evidence from experiments
    8. 8. Gene fusion: predicting physical interactions Detect multiple proteins matching to one protein Exclude overlapping alignments Infer associations in other species Calibrate against KEGG maps
    9. 9. Mining microarray expression databases Re-normalize arrays by modern method to remove biases Build expression matrix Combine similar arrays by PCA Construct predictor by Gaussian kernel density estimation Calibrate against KEGG maps Infer associations in other species
    10. 10. Gene neighborhood: predicting co-expression Identify runs of adjacent genes with the same direction Score each gene pair based on intergenic distances Calibrate against KEGG maps Infer associations in other species
    11. 11. Co-mentioning in the scientific literature Associate abstracts with species Identify gene names in title/abstract Count (co-)occurrences of genes Test significance of associations Calibrate against KEGG maps Infer associations in other species
    12. 12. Phylogenetic profile: co-mentioning in genomes Align all proteins against all Calculate best-hit profile Join similar species by PCA Calculate PC profile distances Calibrate against KEGG maps
    13. 13. COG based vs. similarity based transfer <ul><li>Resolution of the mapping </li></ul><ul><ul><li>COGs result in many-to-many </li></ul></ul><ul><ul><li>Sequence similarity should resolve with better detail </li></ul></ul><ul><li>Our scoring scheme </li></ul><ul><ul><li>Pairwise alignment scores are normalized by self-hit </li></ul></ul><ul><ul><li>These scores are transformed using exp(-k 1 /x), where k 1 =0.7 </li></ul></ul><ul><ul><li>Missing values are estimated </li></ul></ul><ul><ul><li>Divide each score by the column and row sum </li></ul></ul><ul><li>This gives a quantitative score for protein correspondence </li></ul>Target species Source species Target species Source species
    14. 14. Transfer and combination of evidence <ul><li>Evidence scores are multiplied by “correspondence scores” </li></ul><ul><li>From each set of closely related species (a clade) only the best scoring evidence of each type is transferred </li></ul><ul><li>The best evidence from each clade is “added” and scaled: score transfer = k 3 * ( 1 – (1-k 2 *clade1) * (1-k 2 *clade2) * ... ) </li></ul><ul><li>In-species and transferred evidence is “added” and a total combined score calculated </li></ul>? Source species Target species
    15. 15. Combining multiple types of evidence from several species
    16. 16. The next step in data integration: predicting the type of interaction
    17. 17. Information extraction from PubMed: extracting specific types of associations <ul><li>Tokenization and multi word detection </li></ul><ul><li>Part-of-speech tagging </li></ul><ul><li>Semantic labeling </li></ul><ul><ul><li>Gene names </li></ul></ul><ul><ul><li>Cue words for entity recognition </li></ul></ul><ul><ul><li>Verbs for relation extraction </li></ul></ul><ul><li>Named entity chunking </li></ul><ul><ul><li>A CASS grammar recognizes noun chunks relevant for gene transcription </li></ul></ul><ul><ul><li>[ nxgene The GAL4 gene ] </li></ul></ul><ul><li>Relation chunking </li></ul><ul><ul><li>Our CASS grammar also recognizes relations between entities: </li></ul></ul><ul><ul><li>[ nxexpr T he expression of [ nxgene the cytochrome genes [ nxpg CYC1 and CYC7 ]]] is controlled by [ nxpg HAP1 ] </li></ul></ul><ul><li>Output and visualization </li></ul><ul><ul><li>TIGERSearch for inspection </li></ul></ul><ul><ul><li>Script for extracting a binary representation of the relations </li></ul></ul><ul><ul><li>Show later go into STRING </li></ul></ul>
    18. 18. We extract from both active, passive, and nominalized sentence constructs <ul><li>[ nx_prom the ATR1 promoter region ] [ contain contains [ nx_uas_pt [ dt-a a] [ bs binding site ] [ for for] [ nx_activator the GCN4 activator protein ]] </li></ul><ul><li>[ nx_expr RNR1 expression ] [ bez is] [ repv reduced ] [ by by] [ nx_oprd CLN1 or CLN2 overexpression ] </li></ul><ul><li>[ dt-the the] [ binding binding ] [ of of] [ nx_prot GCN4 protein ] [ to to] [ nx_prom the SER1 promoter in vitro] </li></ul>
    19. 19. A high confidence regulatory network <ul><li>We manage to extract a satisfactory number of relations </li></ul><ul><ul><li>422 relation chunks </li></ul></ul><ul><ul><li>597 binary relations </li></ul></ul><ul><ul><li>441 unique binary relations </li></ul></ul><ul><li>Activation/repression assigned for ~50% of relations </li></ul><ul><li>High accuracy: 83-90% on event extraction </li></ul><ul><li>“ Arrows” generally point from known transcription factors to other genes </li></ul>
    20. 20. More STRING to come <ul><li>Adding more large scale data sets and more species </li></ul><ul><li>New types of genomic context evidence </li></ul><ul><ul><li>White seminar by Jan Korbel in May </li></ul></ul><ul><li>Assign specific interaction types to functional associations </li></ul><ul><ul><li>Expand text mining to cover more interaction types </li></ul></ul><ul><ul><li>Predict interaction types from evidence types </li></ul></ul><ul><li>Interpreting the network </li></ul><ul><ul><li>Discover functional modules/pathways </li></ul></ul><ul><ul><li>Network topology and network motifs </li></ul></ul><ul><ul><li>White seminar by Christian von Mering in June </li></ul></ul>
    21. 21. Acknowledgments <ul><li>The STRING team </li></ul><ul><ul><li>Christian von Mering </li></ul></ul><ul><ul><li>Berend Snel </li></ul></ul><ul><ul><li>Martijn Huynen </li></ul></ul><ul><ul><li>Daniel Jaeggi </li></ul></ul><ul><ul><li>Steffen Schmidt </li></ul></ul><ul><ul><li>Mathilde Foglierini </li></ul></ul><ul><ul><li>Peer Bork </li></ul></ul><ul><li>ArrayProspector web service </li></ul><ul><ul><li>Julien Lagarde </li></ul></ul><ul><ul><li>Chris Workman </li></ul></ul><ul><li>NetView visualization tool </li></ul><ul><ul><li>Sean Hooper </li></ul></ul><ul><li>Text mining together with EML </li></ul><ul><ul><li>Jasmin Saric </li></ul></ul><ul><ul><li>Rossitza Ouzounova </li></ul></ul><ul><ul><li>Isabel Rojas </li></ul></ul><ul><li>All my other “partners in crime” on various projects </li></ul><ul><ul><li>The Steinmetz Group </li></ul></ul><ul><ul><li>The Furlong Group </li></ul></ul><ul><li>Web resources </li></ul><ul><ul><li>string.embl.de </li></ul></ul><ul><ul><li>www.bork.embl.de/ArrayProspector </li></ul></ul><ul><ul><li>www.bork.embl.de/synonyms </li></ul></ul>
    22. 22. Thank you!

    ×