Your SlideShare is downloading. ×
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Bayesian Clustering of Microarray Data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Bayesian Clustering of Microarray Data

505

Published on

Discussion of work done in Ghahramani et al. 2007. IEEE Transact Comp Bio & Bioinfo.

Discussion of work done in Ghahramani et al. 2007. IEEE Transact Comp Bio & Bioinfo.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
505
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Zoo-bin Gah-ha-ra-mani Tol-sto-roo-kov
  • Transcript

    • 1.
      • Modeling and Visualizing Uncertainty in Gene Expression Clusters using Dirichlet Process Mixtures
      • CE Rasmussen, BJ de la Cruz, Z Ghahramani, and DL Wild
      • IEEE Transact Comp Bio Bioinform
      • E-print: 30 Nov 2007
    • 2. Overview Mathematical Dirichlet Process Mixtures (DPM) Biological Gene ontology
    • 3. Take home message What do we want?
      • A clustering method that….
      • Is elegant
      • Requires no arbitrary assumptions
      • Allows for uncertainty
      Dirichlet process mixtures model (aka infinite Gaussian mixtures)
    • 4. Mathematical
      • Typical clustering of gene expression data
        • Measure the expression level
        • Create a distance measure
          • Between genes
          • Between experimental profiles
      • Use distance measure to create a model of data
        • Split-up: K-means
        • Build-up: Agglomerative hierarchical clustering
    • 5. Finding clusters in data: Agglomerative hierarchical clustering
      • Similar to phylogenetic tree construction
      • Hierarchical clustering depends on arbitrary threshold
      • See one tree at a time = no uncertainty
      Mathworks. http://www.mathworks.com ch_phytree_primateexample02.gif Two clusters? Five clusters?
    • 6. Problems with agglomerative hierarchical clustering
      • How do you know the tree is accurate?
      • How can we consider alternative models?
        • Use statistical methods
          • Bootstrapping: resample data
          • Maximum likelihood: build and compare models
      • Is not elegant
      • Makes arbitrary assumptions
      • Does not allow for uncertainty
    • 7. Dirichlet process mixture model
      • Let data shape clusters  no arbitrary assumptions
      • AKA Infinite Gaussian mixture model
      • Non-parametric Bayesian method
      • Is elegant
      • Captures and displays uncertainty in the data
    • 8. Why use Bayesian statistics for clustering?
      • To be honest about ignorance
      • To allow us to represent uncertainty numerically
      • To converge on the “truth” by testing different Bayesian priors
        • Iterative process
    • 9. Compare output Agglomerative hierarchical clustering Dirichlet process mixture model
    • 10. Agglomerative hierarchical clustering Hughes et al. 2000. Cell.
    • 11. DPM output: Heat map
    • 12. DPM output: Tree (Dissimilarity metric)
    • 13. A very brief methodology
      • Preprocess data
      • Reduce data (PCA)
      • Perform DPM
      • Generate visual output
      • Use gene ontology
      Biology
    • 14. Gene ontology
      • Over-representation of GO terms
      • Use SGD GO Term tool
      • Determine for each of the 3 base GO categories: process , function , and component
        • Combination of GO terms may suggest related function
        • Since p depends on size of cluster and size of reference list, use permissive cut-off: p < 0.2
    • 15. Problems with GO
      • Need a human eye
      • Small cluster (<5) often yields poor or no representative GO term
      • Hierarchical nature of GO
        • The “best” GO term may be too general or too specific to be informative
        • Look at all significant GO terms
    • 16. Problems with GO
      • Example: EC15
        • Process = “ physiological process ”  high-level but uninformative.
        • Function = “ hydrolase activity, acting on carbon-nitrogen (but not peptide bonds, in linear amides ”,  low-level, highly specific, not immediately informative.
      • Look at all significant GO terms
    • 17. Data set
      • “ Expression Compendium”
        • Hughes et al. 2000. Cell.
        • Rosetta Pharmaceuticals
        • Expression profiles of 300 conditions
          • Growth conditions
          • Chemical treatments
          • Gene mutants or knockouts
          • Gene overexpression mutants
      • Transcript (gene) clustering
      • Experimental (profile) clustering
    • 18. Biological interpretation Heat map Identify clusters Determine members SGD GO Term Finder Identify “best” (most informative) GO terms
    • 19. Transcript clusters (TC)
    • 20. Agglomerative hierarchical clustering Hughes et al. 2000. Cell.
    • 21. Transcript clusters by AHC
      • PAU
      • RNR 2,3,4
      • Stress/carbohydrates
      • Ergosterol
      • Amino acid
      • Calcineurin/PKC
      • Mito. Function
      • Mating
    • 22. Transcript clustering by DPM (636 ORFs)
    • 23. Transcript clustering by DPM (636 ORFs)
    • 24. Transcript clusters by DPM
      • TC 1: isocitrate dehydrogenase
      • TC 2: mating bud (cell wall)
      • TC 3: retrotransposons
      • TC 4: monosaccharide transport
      • TC 5: [unknown]
      • TC 6: mating tip (cell wall)
      • TC 7: amino acid biosynthesis
      • TC 8: carbohydrate transport
      • TC 9: ion transport (cell wall/PM?)
      • TC 10: cell cycle/ER
      • TC 11: cytokinesis (cell wall)
      • TC 12: endopeptidase (cell wall, PM)
      • TC 13: steroid biosynthesis (ergosterol)
      • TC 14: sulfur metabolism
      • TC 15: vitamin metabolism
      • TC 16: beta-alanine metabolism
      • TC 17: polyamine transport
      • 17 Transcript Clusters
      • 636 ORFs
      • 515 ORFs placed in clusters
    • 25. Comparison of transcript clusters: DPM vs. AHC
      • TC 1 ≈ Mito fx.
      • TC 3 ≈ RNR
      • TC 5 ≈ PAU
      • TC 6 ≈ Mating
      • TC 13 ≈ Ergosterol
      OVERLAP Stress/carbohydrate & AA SPLIT TC 4 : monosaccharide transport TC 7 : general AA biosynthesis TC 14: sulfur metabolism TC 15: vitamin metabolism TC 16: beta-alanine metabolism TC 17: polyamine transport
    • 26. Experimental clusters (EC)
    • 27. Agglomerative hierarchical clustering Hughes et al. 2000. Cell.
    • 28. Experimental clusters by AHC
      • Histone deacetylase
      • Mating
      • Ribosome/translation
      • MAPK signalling
      • isw1, isw2
      • tup1, ssn6
      • HU,MMS, rnr1
      • Cell wall (2)
      • Ergosterol
      • sir2, sir3
      • cup5, vma8
      • Mitochondria
    • 29. Experimental clustering by DPM (196 conditions)
    • 30. Experimental clustering by DPM (196 conditions)
    • 31. Selected environmental clusters by DPM
    • 32.  
    • 33.  
    • 34. Conclusions
      • Dirichlet process mixture model is a Bayesian clustering method that improves upon agglomerative hierarchical clustering methods by taking into account uncertainty in the data.
      • Heat map visualization show uncertainty in the clusters, pointing to possible shared regulation or overlapping gene clusters .
      • Confirm and extend the results of Hughes et al. by revealing a finer level of granularity in the clusters leading to new biological insights.
    • 35. Acknowledgements
      • Co-authors
      • David Wild (KGI, U. Warwick)
      • Carl Rasmussen (Max Plank Institute for Biological Cybernetics, Cambridge)
      • Zoubin Ghahramani (Cambridge)
      • Keck Graduate Institute (Claremont, Calif)
      • Jim Cregg
        • Jay Sunga, Ilya Tolstorukov
        • Geoff and Joan Lin-Cereghino (U Pacific)
      • Alpan Raval (KGI, CGU)
      • Ashish Bhan (KGI, UC Irvine)
    • 36. Directions
      • Other applications
        • Classifying protein structures (Dubey et al. 2004. Pacific Symposium of Bioinformatics.)
      • Simplified data
        • Reducing expression level to 3-states (-1, 0, +1)
        • Reducing computation time
      • Coding
        • Developing self-contained program instead of MATLAB
        • Scripting a pipeline
    • 37. Selected references
      • Rasmussen CE et al. Modeling and Visualizing Uncertainty in Gene Expression Clusters using Dirichlet Process Mixtures . IEEE Transact Comp Bio Bioinform. 2007 Nov 30.
      • Rasmussen CE. The Infinite Gaussian Mixture Model. In: Adv Neur Info Process Sys 12, MIT Press, 2000.
      • Wild DL. Graphical Models and Bayesian Methods in Bioinformatics . Presentation, KGI, Jan 2004.
      • El-Arini K. Dirichlet Process Mixtures. http:// www. cs.cmu.edu/ ~kbe/ dp_tutorial.pdf
    • 38. Rasmussen et al. 2007. IEEE TCBB
    • 39. Rasmussen et al. 2007. IEEE TCBB
    • 40. Rasmussen et al. 2007. IEEE TCBB
    • 41. Estimating number of clusters Rasmussen et al. 2007. IEEE TCBB Transcript clusters Experimental clusters
    • 42. Dirichlet process mixture model
    • 43. Long methodology [1]
      • Preprocessing data
        • Transcript:  3-fold change in 2+ experiments
        • Experiments:  3-fold change in 2+ transcripts
      • Reduce data
        • Principle component analysis
        • 10 leading eigen-directions
    • 44. Long methodology [2]
      • Markov chain monte carlo (MCMC)
        • Model data y as a Dirichlet process mixture
        • Set parameters of prior distribution
          • Initiate mixture
          • Indicator variable c i for each transcript or experiment indicating what component (cluster) data i belongs to.
    • 45. Long methodology [3]
      • MCMC (cont)
        • Burn-in: depend on mixing time
          • TCs: mixing time 200  burn-in 10,000
          • ECs: mixing time 60,000  burn-in 100,000
        • Gibbs sampling sweeps to construct probability matrix
          • TCs: 100,000 cycles, each 1000th sample
          • ECs: 11,000,000 cycles, each 100,000th sample
        • Update variables and parameters with each sampling
    • 46. Long methodology [4]
      • Visual output of probability matrix
        • Heat map
          • Colors represent probability p ij that data set i and j are part of same component C i
        • Tree
          • Dissimilarity measure (1- p ij )
          • Standard linkage algorithm to build a tree
    • 47. Long methodology [5]
      • Interpret heat map and tree
        • Identify clusters and their members
        • Use Gene Ontology to annotate clusters
          • SGD GO Term tool
          • Non-exhaustive analysis: not all possible clusters examined
    • 48.  

    ×