Upcoming SlideShare
×

Like this presentation? Why not share!

# Bayesian Multi-topic Microarray Analysis with Hyperparameter Reestimation

## on Jan 28, 2010

• 1,255 views

### Views

Total Views
1,255
Views on SlideShare
1,255
Embed Views
0

Likes
0
0
0

No embeds

### Report content

• Comment goes here.
Are you sure you want to

## Bayesian Multi-topic Microarray Analysis with Hyperparameter ReestimationPresentation Transcript

• Tomonari MASADA ( 正田备也 ) NAGASAKI University ( 长崎大学 ) [email_address]
• Overview
• Problem
• Latent Process Decomposition ( LPD )
• Hyperparameter reestimation ( MVB+ )
• Experiment
• Results
• Conclusions
• Problem
• Explain differences among the cells of different nature (e.g. cancer/normal cells)
• by analyzing differences in gene expression
• obtained from DNA microarray experiments.
• Gene expression http://bix.ucsd.edu/bioalgorithms/slides.php
• DNA microarray experiment
• We can find out which genes are used (expressed) by different types of cells.
• L atent P rocess D ecomposition latent Dirichlet allocation ( LDA ) [Blei et al. 01] latent process decomposition ( LPD ) [Rogers et al. 05] text mining microarray analysis document sample word gene word frequency gene expression level latent topic latent process
• LPD as a multi-topic model
• row = gene, column = sample, color = process
• LPD as a generative model
• For each sample d , draw a multinomial θ d from a Dirichlet prior Dir ( α )
• θ d : mixing proportions of processes for sample d
• For each gene g in each sample d ,
• Draw a process k from Mult ( θ d )
• Draw a real number from Gaussian N ( μ gk , λ gk )
• Inference by VB [Rogers et al. 05]
• V ariational B ayesian inference
• VB is used when EM cannot be used.
• variational lower bound is maximized.
• Variational lower bound
• Inference by MVB [Ying et al. 08]
• M arginalized v ariational B ayesian inference
• Marginalizes multinomial parameters
• Achieves less approximation than VB
• cf. Collapsed variational Bayesian inference for LDA [Teh et al. 06]
• Marginalization in MVB
• Our proposal: MVB+
• MVB with hyperparameter reestimation
• Empirical Bayes method
• Estimate hyperparameters by maximizing variational lower bound
• Hand-tuned hyperparameter values often result in poor quality of inference.
• Update formulas in MVB+ Inversion of digamma function is required.
• Hyperparameter reestimation
• An outstanding trend in Bayesian modeling?
• [Asuncion et al. UAI’09]
• Reestimate hyperparameters of LDA
• Overturn our common sense!
• before: “VB < CVB < CGS”
• after: “VB = CVB = CGS” (in perplexity)
• [Masada et al. CIKM’09 (poster, to appear)]
• Experiments
• Datasets available from Web
• LK : Leukemia ( 白血病 , 백혈병 )
• D1 : &quot;Five types of breast cancer”
• D2 : &quot;Three types of bladder cancer”
• D3 : &quot;Healthy tissues”
• http://www.ihes.fr/~zinovyev/princmanif2006/
• Data specifications Dataset name (abbreviation) # of samples # of genes Leukemia ( LK ) 72 12582 Five types of breast cancer ( D1 ) 286 17816 Three types of bladder cancer ( D2 ) 40 3036 Healthy tissues ( D3 ) 103 10383
• Results
• Can we achieve inference of better quality?
• Can we achieve better sample clustering?
• Are there any qualitative differences between MVB and MVB+ ?
• LK # of iterations lower bound
• D1
• D2
• D3
• LK # of processes lower bound (after convergence)
• D1
• D2
• D3
• Sample clustering evaluation (averaged over 100 trials) dataset method precision recall F -score LK MVB+ 0.934 + 0.007 0.931 + 0.010 0.932 + 0.009 MVB 0.930 + 0.000 0.924 + 0.000 0.927 + 0.000 D2 MVB+ 0.837 + 0.038 0.822 + 0.032 0.829 + 0.033 MVB 0.779 + 0.084 0.751 + 0.069 0.763 + 0.071
• Qualitative difference ( LK )
• row = gene, column = sample
• MVB+ can preserve diversity of genes
MVB+ MVB
• Qualitative difference ( D2 )
• row = gene, column = sample
• MVB+ can preserve diversity of genes
MVB+ MVB
• Conclusions
• Formulas for hyperparameter reestimation
• Improvement in inference quality
• Larger variational lower bounds
• Better sample clustering
• Gene diversity preservation
• Future work
• Use more data to prove efficiency
• Devise collapsed Gibbs sampling for LPD
• Accelerate computations
• OpenMP, Nvidia CUDA
• Provide a method for gene clustering