Your SlideShare is downloading. ×
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Prediction of protein function
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Prediction of protein function


Published on

BioSys course, Technical University of Denmark, Lyngby, Denmark, September 11-12, 2006

BioSys course, Technical University of Denmark, Lyngby, Denmark, September 11-12, 2006

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Transcript

    • 1. There and Back Again Constructing and interpreting networks of functional associations Lars Juhl Jensen EMBL Heidelberg
    • 2. The STRING web service
      • Relies on genomic context analysis of 110 species with a total of 440.000 genes
      • Most of these are prokaryotes (8 eukaryotic genomes)
      • Contains orthologous group assignment for 80% of the genes (50% for eukaryotes)
      • Of those 70% are have links with >75% accuracy
      STRING is accessible at: http:// STRING
    • 3. Several types of genomic context evidence Phylogenetic profile Conserved neighborhood Gene-fusion
    • 4. Combining the methods improves performance Coverage (number of predicted links between orthologous groups) Accuracy Coverage 0.5 0.6 0.7 0.8 0.9 1.0 10 100 1000 10000 100000 Fusion (norm.) Fusion (abs.) Gene Order (norm.) Gene Order (abs.) Cooccurrence Integrated
    • 5. Why include high-throughput data in STRING?
      • What high-throughput data can do for STRING
        • Genomic context methods tend to work less well for eukaryotes than for prokaryotes (in part due to fewer sequenced genomes)
        • More evidence is always better
        • The type of “functional association” can sometimes be predicted
      • What STRING can do for high-throughput data
        • Facilitates easy cross-species analysis based orthologous groups
        • Genomic context can provide support for particular interactions in data sets with high error rates
        • Web interface for navigating networks of associations with multiple types of evidence
    • 6. High-throughput experiments (1/2)
      • Identification of protein-protein interactions
        • Several yeast two-hybrid screens have been conducted – two of which systematically covering most yeast genes (Uetz et al. and Ito et al. )
        • Protein complexes identified via two different MS based approaches (Gavin et al. and Ho et al. )
        • Interactions reported in databases such as Database of Interacting Proteins (DIP) and BIND
      • Mapping of transcription factor (TF) binding sites
        • The most comprehensive set maps TF binding sites in the yeast genome for more than 200 different transcription factors (Lee et al. )
        • Other smaller data sets also exist
    • 7. High-throughput experiments (2/2)
      • Spotted microarray expression data
        • Compares the expression levels of all genes between two samples
        • Large amounts of microarray data on several eukaryotes are available from the Stanford Microarray Database (SMD) repository
        • We also have access to in-house spotted array data through collaborations with several groups both within and outside EMBL
      • GeneChip expression data
        • Measures the expression level of all genes in one sample
        • More standardized than spotted array data
        • No central repositories with large quantities of data
    • 8. Microarrays 101
      • The level of expression in two samples can be compared for all genes simultaneously
      • Each spot corresponds contains either cDNA or short probes specific to one gene
      • The amount of labeled mRNA from a sample that hybridizes to each spot is measured as a fluorescence intensity
      • Spotted microarrays are quite cheap compared to GeneChips
    • 9. The need for normalization of microarray data
      • Unfortunately microarrays are prone to many different types of error:
        • Random noise
        • Non-linear dye specific biases
        • Spatial biases and pin effects
        • Overall intensity differences between arrays
      • Different approaches are needed for dealing with this
        • Averaging over replicates will lower the random noise
        • Dye specific biases can to some extend be cancelled by averaging over dye swaps
        • Dye effects, spatial biases, and overall differences between arrays can all be compensated for by computational means
    • 10. Non-linear normalization of intensities and correction for spatial effects Downloaded SMD data After intensity normalization Spatial bias estimate After spatial normalization
    • 11. Combining arrays from multiple experiments into one gene expression matrix
      • For each species, all arrays in SMD are merged
        • To ensure comparable data, all arrays were re-normalized
        • A matrix is constructed with each row being a gene and each column an array
      • This integration is complicated by the lack of consistency in the choice of gene identifiers even within SMD
        • To deal with this, a very large list of synonymous gene names and identifiers was compiled based on SGD, WormBase, FlyBase, SWISS-PROT, and UniGene
        • Such a list is also very useful for integrating the expression data with protein-protein interaction data, STRING, and text-mining of Medline abstracts
    • 12. “ And now we cluster correlated expression profiles ... no, wait a second!”
      • Traditional clustering of genes with correlated expression profiles is not well suited for inferring functional links
      • No appropriate distance measure
        • All arrays are not of the same quality
        • Not all experiments will be equally useful for inferring function
        • The arrays are not all from mutually independent experiments
      • Multi-functional proteins
        • If (A,B) are in the same cluster and (A,C) also cluster together, (B,C) will by definition be in the same cluster
        • Functional relations for the pairs (A,B) and (A,C) do not necessarily imply a functional relation for (B,C) if A has two or more functions
    • 13. Singular value decomposition – letting the data speak for themselves
      • Singular value decomposition is run on the gene expression matrix
        • Defines an ordered set of non-correlated basis vectors
        • Each singular vector is a linear combination of arrays
      • The first singular vectors effectively average over related arrays
        • Finds replicate arrays including dye-swaps
        • Adjacent arrays in time series and related experiments are combined
      • The last vectors mainly contain noise, e.g. replicate differences
      1 2 3 4 5 6 7 8 RNA stability      Starvation    Heat-shock  Salt treatment  Polysomes  Sporulation   
    • 14. Inferring functional links from projections of genes onto singular vectors
      • Analyze each singular vector
        • Do 1D density estimation of expression ratio projections for genes of known function
        • 2D density estimation for pairs of functionally related genes
        • Use Bayes’ law for estimating log-odds of functional link given a pair of projections
      • Different types of regulation
        • Up- vs. down-regulation
        • Anti-correlated expression
      • The log-odds from the first N singular vectors are summed
    • 15. Proteins linked to the human mitotic checkpoint protein BUB1 Identifier Description Comments Q8WVP0 Kinesin-like 5 Mitotic kinesin-like protein 1 CDN3_HUMAN Cyclin-dependent kinase inhibitor 3 HMG2_HUMAN High mobility group protein 2 FXM1_HUMAN Forkhead box protein M1 Phosphorylated in M-phase BASP_HUMAN Brain acid soluble protein 1 Associated with "growth cones" MYBB_HUMAN Myb-related protein B Phosphorylated by Cdk2 during S-phase O14731 Membrane-associated kinase Cell cycle regulated kinase, inhibits Cdc2 TP2A_HUMAN DNA topoisomerase II MPI1_HUMAN M-phase inducer phosphatase 1 PMC1_HUMAN Polymyositis/scleroderma autoantigen 1 Q8N324 Hypothetical protein Contains a PRY and a SPRY domain CGA2_HUMAN Cyclin A2 Q9NZJ0 L2DTL protein Contains six WD40 repeats Q15003 HCAP-H protein KF14_HUMAN Kinesin-like protein KIF14 Q96SE4 Kinesin-like protein 2 KNS2_HUMAN Kinesin-like protein 2 WEE1_HUMAN Wee1-like protein kinase May act as a negative regulator of entry into mitosis CHK1_HUMAN Serine/threonine-protein kinase Chk1 Involved in cell cycle arrest NEK2_HUMAN Serine/threonine-protein kinase NEK2 Involved in mitotic regulation CKS1_HUMAN Cyclin-dependent kinases regulatory subunit O14980 CRM1 protein Cell cycle-dependent expression
    • 16. Scoring conserved expression links between orthologous groups of proteins
      • The highest scoring protein pair that links the two orthologous groups is found for each species
        • When genes are duplicated, the expression of some copies may change
      • These scores are summed over all species having such links
        • This assumes that all species in the analysis are sufficiently distant that gene expression patterns are not conserved by chance
      S. cerevisiae C. elegans Orthologous group 2 Orthologous group 1
    • 17. A network of meiosis related genes conserved between S. cerevisiae and C. elegans
      • Many relations are well known
        • MYO3 encodes myosin I, which is important for chromosome segregation during cell division
        • SPO14 and GLC7 are both phosphatases involved in meiosis and sporulation
        • CDC39 plays important regulatory roles in the mitotic and the meiotic cell cycle
        • Physical interaction between BCK1 and MYO3 is suggested in the BIND database
      • The role of nuclear pores in meiosis is not quite clear, but they appear to be involved
      MYO3 GLC7 SPO14 BCK1 STE11 NUP49 NUP100 NUP116 NUP145 CDC39
    • 18. Now we have a network – time to take it apart
    • 19. Unsupervised discovery of functional modules Cellular Metabolism Source: Molecular Biology of the Cell, 3 rd. Edition defined manually: metabolic pathways purine- biosynthesis histidine- biosynthesis defined objectively: genome-context interactions clustering (standard- algorithms)
    • 20. “ Biologists would rather share their toothbrush than share a gene name”
      • Lists of synonymous identifiers and names were compiled from
        • SGD, WormBase, and FlyBase
        • BLAST search against UniGene
      • Several types of identifiers
        • Various database identifiers and accession numbers
        • Gene symbols and gene names
      • Lack of standardization
        • 8+ identifiers per yeast gene
        • Many names refer to unrelated genes in different species
      The synonyms and orthologs lists can be downloaded from:
    • 21. Retraining TreeTagger for Medline abstracts
      • The English parameter file distributed with TreeTagger was trained on the UPenn Treebank
      • We retrained TreeTagger on the manually annotated GENIA 3.0 corpus (466,179 tokens) adding gene names to the dictionary
      • Performance of the two taggers was evaluated on 55,166 tokes not used during training
      • Retraining eliminated more than half of all tagging errors
    • 22. Tagging is really easy ... compared to extracting the information you are after
      • Many ways to write the same thing
        • A activates the transcription of B
        • B transcription is induced by A
        • A is a transcriptional activator of B
        • Overexpression of A increases B mRNA levels
        • Transcription is enhanced when A binds to the B promoter
        • The B promoter contains an A UAS
      • Multiple pieces of information and negations in a sentence
        • A is a transcriptional activator of B , C , D , E , and F
        • B was not suppressed by A
        • The A transcription factor affects B but not C
        • C phosphorylation of A leads to increased expression of B
    • 23. “ Biologists tend to ask simple questions: Here’s a frog ... is he happy?”
      • It is not always clear what a sentence means
        • Many biological terms/concepts are poorly defined
        • Words are often coined before a subject is understood
        • Ambiguous use of terms makes text mining more difficult
      • The complexity of biological systems makes it hard to simple experiments that lead to clear answers
        • “ Protein A regulates the expression of gene B”
          • Does this mean that protein A is a transcription factor?
          • Or are more indirect regulatory mechanisms allowed?
        • “ Protein A is a transcriptional activator of B”
          • Can A activate transcription alone?
          • Or only together with certain other proteins?
    • 24. A mini-ontology of transcription regulation
      • Entities (boxes)
      • generic (gray)
      • regulator (yellow)
      • activator (red)
      • repressor (green)
      • target (blue)
      • Relations (arrows)
      • is-a (black)
      • part-of (blue)
      • Events (arrows)
      • creates (green)
      • binds (red)
    • 25. Parsing abstracts to identify relationships between genes/proteins
      • Sentence and word boundaries are identified using Tokenizer
      • Our retrained TreeTagger is used for tagging part-of-speech
      • Abstracts are chunked with a custom CASS grammar to identify noun and verb chunks
      • Noun chunks are categorized according to a mini-ontology
      • Lexico-syntactic patterns are used to identify event chunks
      • SRN1 NNPG NXPGSG EVSUPVA can MD | suppress SUPV | rna2 NNPG NXPGPL | rna3 NNPG | | rna4 NNPG | | rna5 NNPG | | rna6 NNPG | | and CC | | rna8 NNPG | | singly RB or CC in IN pairs NNS
    • 26. Using text mining of Medline abstract to support predicted regulatory interactions
      • By applying the scheme just described to all Medline abstracts, a set of regulatory interactions in multiple species is obtained
      • We will use it to classify protein associations derived from
        • Microarray gene expression
        • Chromatin IP data
        • Physical protein interaction screens (e.g. Y2H and TAP)
        • Cross-species analysis of genomic context (STRING)
      • To integrate all of these different data sources the list of synonymous gene names and identifiers is again needed as different data sets use different identifiers
    • 27. Acknowledgments
      • The STRING team
        • Christian von Mering
        • Berend Snel
        • Martijn Huynen
        • Daniel Jaeggi
        • Steffen Schmidt
        • Peer Bork
      • Microarray normalization
        • Chris Workman
      • PROPHECIES web service
        • Julien Lagarde
      • The text mining people at EML
        • Jasmin Saric
        • Isabel Rojas
      • Web resources
        • STRING
    • 28. Questions?