Bayesian Multi-topic Microarray Analysis with Hyperparameter Reestimation
Upcoming SlideShare
Loading in...5
×
 

Bayesian Multi-topic Microarray Analysis with Hyperparameter Reestimation

on

  • 1,255 views

 

Statistics

Views

Total Views
1,255
Views on SlideShare
1,255
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Bayesian Multi-topic Microarray Analysis with Hyperparameter Reestimation Bayesian Multi-topic Microarray Analysis with Hyperparameter Reestimation Presentation Transcript

  • Tomonari MASADA ( 正田备也 ) NAGASAKI University ( 长崎大学 ) [email_address]
  • Overview
    • Problem
    • Latent Process Decomposition ( LPD )
    • Hyperparameter reestimation ( MVB+ )
    • Experiment
    • Results
    • Conclusions
  • Problem
    • Explain differences among the cells of different nature (e.g. cancer/normal cells)
    • by analyzing differences in gene expression
    • obtained from DNA microarray experiments.
  • Gene expression http://bix.ucsd.edu/bioalgorithms/slides.php
  • DNA microarray experiment
    • We can find out which genes are used (expressed) by different types of cells.
  • L atent P rocess D ecomposition latent Dirichlet allocation ( LDA ) [Blei et al. 01] latent process decomposition ( LPD ) [Rogers et al. 05] text mining microarray analysis document sample word gene word frequency gene expression level latent topic latent process
  • LPD as a multi-topic model
    • row = gene, column = sample, color = process
  • LPD as a generative model
    • For each sample d , draw a multinomial θ d from a Dirichlet prior Dir ( α )
      • θ d : mixing proportions of processes for sample d
    • For each gene g in each sample d ,
      • Draw a process k from Mult ( θ d )
      • Draw a real number from Gaussian N ( μ gk , λ gk )
  • Inference by VB [Rogers et al. 05]
    • V ariational B ayesian inference
      • VB is used when EM cannot be used.
      • Instead of log likelihood,
      • variational lower bound is maximized.
  • Variational lower bound
  • Inference by MVB [Ying et al. 08]
    • M arginalized v ariational B ayesian inference
      • Marginalizes multinomial parameters
      • Achieves less approximation than VB
      • cf. Collapsed variational Bayesian inference for LDA [Teh et al. 06]
  • Marginalization in MVB
  • Our proposal: MVB+
    • MVB with hyperparameter reestimation
      • Empirical Bayes method
        • Estimate hyperparameters by maximizing variational lower bound
      • Hand-tuned hyperparameter values often result in poor quality of inference.
  • Update formulas in MVB+ Inversion of digamma function is required.
  • Hyperparameter reestimation
    • An outstanding trend in Bayesian modeling?
      • [Asuncion et al. UAI’09]
        • Reestimate hyperparameters of LDA
        • Overturn our common sense!
          • before: “VB < CVB < CGS”
          • after: “VB = CVB = CGS” (in perplexity)
      • [Masada et al. CIKM’09 (poster, to appear)]
  • Experiments
    • Datasets available from Web
      • LK : Leukemia ( 白血病 , 백혈병 )
        • http://www.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=63
      • D1 : &quot;Five types of breast cancer”
      • D2 : &quot;Three types of bladder cancer”
      • D3 : &quot;Healthy tissues”
        • http://www.ihes.fr/~zinovyev/princmanif2006/
  • Data specifications Dataset name (abbreviation) # of samples # of genes Leukemia ( LK ) 72 12582 Five types of breast cancer ( D1 ) 286 17816 Three types of bladder cancer ( D2 ) 40 3036 Healthy tissues ( D3 ) 103 10383
  • Results
    • Can we achieve inference of better quality?
    • Can we achieve better sample clustering?
    • Are there any qualitative differences between MVB and MVB+ ?
  • LK # of iterations lower bound
  • D1
  • D2
  • D3
  • LK # of processes lower bound (after convergence)
  • D1
  • D2
  • D3
  • Sample clustering evaluation (averaged over 100 trials) dataset method precision recall F -score LK MVB+ 0.934 + 0.007 0.931 + 0.010 0.932 + 0.009 MVB 0.930 + 0.000 0.924 + 0.000 0.927 + 0.000 D2 MVB+ 0.837 + 0.038 0.822 + 0.032 0.829 + 0.033 MVB 0.779 + 0.084 0.751 + 0.069 0.763 + 0.071
  • Qualitative difference ( LK )
      • row = gene, column = sample
    • MVB+ can preserve diversity of genes
    MVB+ MVB
  • Qualitative difference ( D2 )
      • row = gene, column = sample
    • MVB+ can preserve diversity of genes
    MVB+ MVB
  • Conclusions
    • Formulas for hyperparameter reestimation
    • Improvement in inference quality
      • Larger variational lower bounds
      • Better sample clustering
      • Gene diversity preservation
  • Future work
    • Use more data to prove efficiency
    • Devise collapsed Gibbs sampling for LPD
    • Accelerate computations
      • OpenMP, Nvidia CUDA
    • Provide a method for gene clustering