Network Biology Lent 2010 - lecture 1


Published on

Lent 2010 MPhil course \\\'Network biology\\\' - lecture 2

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Network Biology Lent 2010 - lecture 1

  1. 1. Exploratory analysis of phenotyping screens: enrichment, clustering, ranking Network Biology - lecture 2
  2. 2. High-dimensional phenotypes by microscopy or molecular profiling Low-dimensional phenotypes A- Time Size
  3. 3. A challenge for computation and statistics
  4. 4. Today’s lecture <ul><li>Enrichment analysis </li></ul><ul><ul><li>Hyper-geometric test and GO over-representation </li></ul></ul><ul><ul><li>Gene Set Enrichment Analysis </li></ul></ul><ul><li>Mapping phenotypes to network </li></ul><ul><ul><li>Finding rich subnetworks </li></ul></ul><ul><ul><li>Finding phenotypically correlated subnetworks </li></ul></ul><ul><li>Clustering </li></ul><ul><ul><li>Distances </li></ul></ul><ul><ul><li>Hierarchical clustering </li></ul></ul><ul><li>Ranking </li></ul>
  5. 5. High-throughput phenotyping Weak Strong Phenotypes of 100s to 10.000s perturbed genes Hits
  6. 6. Gene Ontology (GO) A GO Term with a gene set annotated to it
  7. 7. GO over-representation Hyper-geometric test Hits Weak Strong Phenotype GO All genes All Hits Hits in GO term Genes in GO term
  8. 8. Hyper-geometric distribution N genes altogether n hits k hits in GO term m genes in GO term Probability to observe k hits in GO term Number of possibilities to choose n hits out of N genes k hits fall into the GO term of size m The other n-k hits are all genes outside the GO term
  9. 9. Hyper-geometric test: example pvalue = phyper ( k , m , N-m , n , lower.tail = FALSE ) Hold these fixed : N = 10 000 m = 200 See what happens if we vary: k = 1,2,…, 30 n = 50, 100, 200, 300, 400 k or more! N n k m Number k of hits in GO term p-value [log10] 50 100 200 300 400
  10. 10. Gene Set Enrichment Analysis Not restricted to GO , could be any collection of gene sets, e.g. MSigDB at Subramanian et al. (2005) Weak Strong Phenotype GO Weak Strong Phenotype No significant trend Significant trend
  11. 12. GSEA: construction Weak Strong Phenotype Nr. non- hits before i Nr. all non- hits i Nr. hits before i Nr. all hits if p = 0
  12. 13. GSEA: examples
  13. 14. PROs and CONs Gene set 1 Gene set 2 Gene set 3 Gene set 4 … e.g. Wnt pathway DNA repair Chromosome 1 Expressed in liver p -values (hyper-geometric or GSEA) <ul><li>Advantages: </li></ul><ul><li>standard analysis </li></ul><ul><li>comprehensive first overview </li></ul><ul><li>“ unbiased” and “hypothesis-free” </li></ul><ul><li>Disadvantages: </li></ul><ul><li>“ unbiased” and “hypothesis-free” </li></ul><ul><li>relies on known gene sets </li></ul><ul><li>can not uncover new pathways </li></ul><ul><li>pathway = “unconnected” gene set </li></ul><ul><li>soon: more gene sets than genes! </li></ul>Result:
  14. 15. Map phenotypes to network Where do the hits fall in the network and what are they connected to?
  15. 16. Sub-networks rich in hits
  16. 17. Sub-networks with highly correlated phenotypes
  17. 18. Predicting phenotypes Guilt by association Use known phenotypes in the network Use edge weights if possible Success depends on quality and coverage of linkage in the network ? ? ?
  18. 19. Which networks? Networks from large-scale experiments Networks from analyzing the experimental literature Networks from probabilistic data integration
  19. 20. Cluster first, think later Perturbed Genes Phenotypic profiles Features = Expression profiles, parameters of cell shape, protein concentration or localization Genes with similar phenotypic profiles often have similar molecular functions or act in the same pathways. Principle for function prediction: Guilt by association
  20. 21. From data to distances Phenotypic Profiles Perturbed genes D [i,j] = dist( M [i,] , M [j,] ) M D D [j,i] = D [i,j] D [i,i] = 0 for all i dist(. , .) What distance measure should we use? Distance or dissimilarity matrix i j i j j i
  21. 22. Examples of distances how is this related to correlation? a b Euclidean distance dist( a,b ) = a b dist( a,b ) = Manhattan distance  a b dist( a,b ) = Cosine distance
  22. 23. Linkage: distances to clusters dist(C1, C2) = max { dist( i , j ) : i in C1, j in C2 } complete linkage = min { dist( i , j ) : i in C1, j in C2 } single linkage = mean { dist( i , j ) : i in C1, j in C2 } average linkage D Phenotype 2 Phenotype 1 1 2 3 6 5 4 D[3,4] 1 2 3 6 5 4 Phenotype 2 Phenotype 1 D[ (2,3) ,4] = ??? Distances between individual genes 3
  23. 24. Hierarchical clustering Ingredients: data matrix , distance function, linkage function Phenotype 2 Phenotype 1 1 2 3 6 5 4 1 2 3 6 5 4 Dendrogram
  24. 25. Phenotypic Data 332 knock-downs in 5 conditions Correlation matrix Dendrogram Data by Klaas Mulder, CRI
  25. 26. Clustering: PROs and CONs <ul><li>Standard analysis, almost always applicable </li></ul><ul><li>Global first overview </li></ul><ul><li>can identify strong trends and patterns in the data </li></ul>PRO NEG <ul><li>Often applied in situations where other methods would be more appropriate (e.g. supervised analysis) </li></ul>Brown et al (2005) Ivanova et al (2006) Bakal et al (2007)
  26. 27. Query-based gene ranking <ul><li>Task: find genes similar to query phenotypic profile </li></ul>Often addressed by clustering: Which cluster does the query gene fall into? <ul><li>But: </li></ul><ul><li>That’s a pretty indirect way of answering the question </li></ul><ul><li>Cluster boundaries can be arbitrary </li></ul><ul><li>Genes in other clusters can also be similar to query </li></ul><ul><li>Better use a ranking method! </li></ul>Cluster 1 Cluster 2 Cluster 3 Query gene
  27. 28. Ranking by PhenoBLAST <ul><li>Rank genes by similarity to query phenotypic profile </li></ul><ul><li>Challenge: identify informative phenotypes, correct of ubiquity </li></ul>PhenoBLAST Gunsalus et al (2004) Perturbed Genes Ordered by similarity to query gene Phenotypes
  28. 29. Summary <ul><li>Enrichment analysis </li></ul><ul><ul><li>Hyper-geometric test and GO over-representation </li></ul></ul><ul><ul><li>Gene Set Enrichment Analysis </li></ul></ul><ul><li>Mapping phenotypes to network </li></ul><ul><ul><li>Finding rich subnetworks </li></ul></ul><ul><ul><li>Finding phenotypically correlated subnetworks </li></ul></ul><ul><li>Clustering </li></ul><ul><ul><li>Distances </li></ul></ul><ul><ul><li>Hierarchical clustering </li></ul></ul><ul><li>Ranking </li></ul>
  29. 30. Three take-home messages <ul><li>Enrichment of hits in a gene list can be assessed by hypergeometric test or GSEA . </li></ul><ul><li>Phenotypes can be mapped to existing networks to find sub-networks rich in hits or significantly correlated. </li></ul><ul><li>Hierarchical clustering uses a distance metric and a linkage function to build a hierarchy of clusters that can be visualized in a dendrogram . </li></ul>
  30. 31. From clusters to pathways <ul><li>Enrichment analysis, clustering and ranking are multi-purpose methods that give a broad first over-view of general patterns in the data. </li></ul>Next lectures: graph-based and probabilistic models to infer (causal) pathway structure from phenotypic data.
  31. 32. Network Biology - lecture 2 ≥ 3 Questions !