We can find out which genes are used (expressed) by different types of cells.
L atent P rocess D ecomposition latent Dirichlet allocation ( LDA ) [Blei et al. 01] latent process decomposition ( LPD ) [Rogers et al. 05] text mining microarray analysis document sample word gene word frequency gene expression level latent topic latent process
LPD as a multi-topic model
row = gene, column = sample, color = process
LPD as a generative model
For each sample d , draw a multinomial θ d from a Dirichlet prior Dir ( α )
θ d : mixing proportions of processes for sample d
For each gene g in each sample d ,
Draw a process k from Mult ( θ d )
Draw a real number from Gaussian N ( μ gk , λ gk )
Inference by VB [Rogers et al. 05]
V ariational B ayesian inference
VB is used when EM cannot be used.
Instead of log likelihood,
variational lower bound is maximized.
Variational lower bound
Inference by MVB [Ying et al. 08]
M arginalized v ariational B ayesian inference
Marginalizes multinomial parameters
Achieves less approximation than VB
cf. Collapsed variational Bayesian inference for LDA [Teh et al. 06]
Marginalization in MVB
Our proposal: MVB+
MVB with hyperparameter reestimation
Empirical Bayes method
Estimate hyperparameters by maximizing variational lower bound
Hand-tuned hyperparameter values often result in poor quality of inference.
Update formulas in MVB+ Inversion of digamma function is required.
Data specifications Dataset name (abbreviation) # of samples # of genes Leukemia ( LK ) 72 12582 Five types of breast cancer ( D1 ) 286 17816 Three types of bladder cancer ( D2 ) 40 3036 Healthy tissues ( D3 ) 103 10383
Can we achieve inference of better quality?
Can we achieve better sample clustering?
Are there any qualitative differences between MVB and MVB+ ?