Bayesian Multi-topic Microarray Analysis with Hyperparameter Reestimation

2,407 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,407
On SlideShare
0
From Embeds
0
Number of Embeds
941
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bayesian Multi-topic Microarray Analysis with Hyperparameter Reestimation

  1. 1. Tomonari MASADA ( 正田备也 ) NAGASAKI University ( 长崎大学 ) [email_address]
  2. 2. Overview <ul><li>Problem </li></ul><ul><li>Latent Process Decomposition ( LPD ) </li></ul><ul><li>Hyperparameter reestimation ( MVB+ ) </li></ul><ul><li>Experiment </li></ul><ul><li>Results </li></ul><ul><li>Conclusions </li></ul>
  3. 3. Problem <ul><li>Explain differences among the cells of different nature (e.g. cancer/normal cells) </li></ul><ul><li>by analyzing differences in gene expression </li></ul><ul><li>obtained from DNA microarray experiments. </li></ul>
  4. 4. Gene expression http://bix.ucsd.edu/bioalgorithms/slides.php
  5. 5. DNA microarray experiment <ul><li>We can find out which genes are used (expressed) by different types of cells. </li></ul>
  6. 6.
  7. 7.
  8. 8. L atent P rocess D ecomposition latent Dirichlet allocation ( LDA ) [Blei et al. 01] latent process decomposition ( LPD ) [Rogers et al. 05] text mining microarray analysis document sample word gene word frequency gene expression level latent topic latent process
  9. 9. LPD as a multi-topic model <ul><li>row = gene, column = sample, color = process </li></ul>
  10. 10. LPD as a generative model <ul><li>For each sample d , draw a multinomial θ d from a Dirichlet prior Dir ( α ) </li></ul><ul><ul><li>θ d : mixing proportions of processes for sample d </li></ul></ul><ul><li>For each gene g in each sample d , </li></ul><ul><ul><li>Draw a process k from Mult ( θ d ) </li></ul></ul><ul><ul><li>Draw a real number from Gaussian N ( μ gk , λ gk ) </li></ul></ul>
  11. 11. Inference by VB [Rogers et al. 05] <ul><li>V ariational B ayesian inference </li></ul><ul><ul><li>VB is used when EM cannot be used. </li></ul></ul><ul><ul><li>Instead of log likelihood, </li></ul></ul><ul><ul><li>variational lower bound is maximized. </li></ul></ul>
  12. 12. Variational lower bound
  13. 13. Inference by MVB [Ying et al. 08] <ul><li>M arginalized v ariational B ayesian inference </li></ul><ul><ul><li>Marginalizes multinomial parameters </li></ul></ul><ul><ul><li>Achieves less approximation than VB </li></ul></ul><ul><ul><li>cf. Collapsed variational Bayesian inference for LDA [Teh et al. 06] </li></ul></ul>
  14. 14. Marginalization in MVB
  15. 15.
  16. 16. Our proposal: MVB+ <ul><li>MVB with hyperparameter reestimation </li></ul><ul><ul><li>Empirical Bayes method </li></ul></ul><ul><ul><ul><li>Estimate hyperparameters by maximizing variational lower bound </li></ul></ul></ul><ul><ul><li>Hand-tuned hyperparameter values often result in poor quality of inference. </li></ul></ul>
  17. 17. Update formulas in MVB+ Inversion of digamma function is required.
  18. 18. Hyperparameter reestimation <ul><li>An outstanding trend in Bayesian modeling? </li></ul><ul><ul><li>[Asuncion et al. UAI’09] </li></ul></ul><ul><ul><ul><li>Reestimate hyperparameters of LDA </li></ul></ul></ul><ul><ul><ul><li>Overturn our common sense! </li></ul></ul></ul><ul><ul><ul><ul><li>before: “VB < CVB < CGS” </li></ul></ul></ul></ul><ul><ul><ul><ul><li>after: “VB = CVB = CGS” (in perplexity) </li></ul></ul></ul></ul><ul><ul><li>[Masada et al. CIKM’09 (poster, to appear)] </li></ul></ul>
  19. 19. Experiments <ul><li>Datasets available from Web </li></ul><ul><ul><li>LK : Leukemia ( 白血病 , 백혈병 ) </li></ul></ul><ul><ul><ul><li>http://www.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=63 </li></ul></ul></ul><ul><ul><li>D1 : &quot;Five types of breast cancer” </li></ul></ul><ul><ul><li>D2 : &quot;Three types of bladder cancer” </li></ul></ul><ul><ul><li>D3 : &quot;Healthy tissues” </li></ul></ul><ul><ul><ul><li>http://www.ihes.fr/~zinovyev/princmanif2006/ </li></ul></ul></ul>
  20. 20. Data specifications Dataset name (abbreviation) # of samples # of genes Leukemia ( LK ) 72 12582 Five types of breast cancer ( D1 ) 286 17816 Three types of bladder cancer ( D2 ) 40 3036 Healthy tissues ( D3 ) 103 10383
  21. 21. Results <ul><li>Can we achieve inference of better quality? </li></ul><ul><li>Can we achieve better sample clustering? </li></ul><ul><li>Are there any qualitative differences between MVB and MVB+ ? </li></ul>
  22. 22. LK # of iterations lower bound
  23. 23. D1
  24. 24. D2
  25. 25. D3
  26. 26. LK # of processes lower bound (after convergence)
  27. 27. D1
  28. 28. D2
  29. 29. D3
  30. 30. Sample clustering evaluation (averaged over 100 trials) dataset method precision recall F -score LK MVB+ 0.934 + 0.007 0.931 + 0.010 0.932 + 0.009 MVB 0.930 + 0.000 0.924 + 0.000 0.927 + 0.000 D2 MVB+ 0.837 + 0.038 0.822 + 0.032 0.829 + 0.033 MVB 0.779 + 0.084 0.751 + 0.069 0.763 + 0.071
  31. 31. Qualitative difference ( LK ) <ul><ul><li>row = gene, column = sample </li></ul></ul><ul><li>MVB+ can preserve diversity of genes </li></ul>MVB+ MVB
  32. 32. Qualitative difference ( D2 ) <ul><ul><li>row = gene, column = sample </li></ul></ul><ul><li>MVB+ can preserve diversity of genes </li></ul>MVB+ MVB
  33. 33. Conclusions <ul><li>Formulas for hyperparameter reestimation </li></ul><ul><li>Improvement in inference quality </li></ul><ul><ul><li>Larger variational lower bounds </li></ul></ul><ul><ul><li>Better sample clustering </li></ul></ul><ul><ul><li>Gene diversity preservation </li></ul></ul>
  34. 34. Future work <ul><li>Use more data to prove efficiency </li></ul><ul><li>Devise collapsed Gibbs sampling for LPD </li></ul><ul><li>Accelerate computations </li></ul><ul><ul><li>OpenMP, Nvidia CUDA </li></ul></ul><ul><li>Provide a method for gene clustering </li></ul>

×