Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Forest Learning based on the Chow-Liu Algorithm and its Application to Genome Differential Analysis: A Novel Mutual Information Estimation

71 views

Published on

J. Suzuki, ``Forest Learning based on the Chow-Liu Algorithm and its Application to Genome Differential Analysis: A Novel Mutual Information Estimation", AMBN 2015, Yokohama, Japan

Published in: Science
  • Be the first to comment

  • Be the first to like this

Forest Learning based on the Chow-Liu Algorithm and its Application to Genome Differential Analysis: A Novel Mutual Information Estimation

  1. 1. Forest Learning based on the Chow-Liu Algorithm and its Application to Genome Differential Analysis: A Novel Mutual Information Estimation Nov. 16-18, 2015 Joe Suzuki (Osaka Univ.) @joe_suzuki Prof-joe
  2. 2. Road Map • MI Estimation for Discrete (warming-up) • MI Estimation for Discrete/Coninuous (propose) • Experiment 1 (gene differential analysis) • Experiment 2 (combination of SNP and gene) • Concluding Remarks
  3. 3. How do you estimate MI given data ? For discrete data, a naïve way is
  4. 4. MI Estimation based on MDL (Suzuki, UAI93) P. Liang and N. Srebro 2004、K. Panayidou 2010、Edwards, et. al 2010 revisited the same
  5. 5. Overestimation occurs for 𝐼 𝑛
  6. 6. Dist. Chow-Liu (apply Kruskal) known approximation to trees with K-L minimum unknown ML spanning trees (Chow-Liu), MDL forests (Suzuki 93)
  7. 7. R package bnlearn data set “Asia” Forest Spanning Tree
  8. 8. Correlation ≠ Independence
  9. 9. X: Gauss, Y: Discrete (Edwards et. al. 2010) close to Naïve Bayes (ANOVA) Causal Direction
  10. 10. Gaussaian should not be between discrete variables Discrete Gaussian (Edwards, et. al 2010) SNP Gene Expression
  11. 11. Proposed MI estimation For each mesh (percentile), estimate the MI based on the quntized data Choose the maximum MI estimation
  12. 12. n=1000, 8x8 occurrences of X,Y occurrence of (X, Y)
  13. 13. Does not distinguish discrete and continuous For each u,v=1,2,…,s, Continue to divide the interval half (avoid to divide the mass intervals)
  14. 14. What can be proved?
  15. 15. Convex
  16. 16. Experimet 1: Genome expression profiling in breast cancer patients • 58 sample with p53 mutation and 192 without it • 1000 genes Why only Bonferroni and FDR rather than causality and regression?
  17. 17. Normality test: p-values are extreamely low
  18. 18. MI distributions for all (1000) and 50 least p-value genes
  19. 19. • 20 seconds for MI values and 30 seconds for a forest (1000 nodes) The class variable has only one connection with gene variables We conclude that regression may be more appropriate than a graphical model.
  20. 20. Experiment 2: 300 gene expression (continue) and 300 SNP (3 values) • Utah 90 residents SNP (HapMap) with northern and western European ancestry • R library (BioConductor) GGData ftp://ftp.sanger.ac.uk/pub/genevar/CEU_parents_norm_march2007.zip
  21. 21. 200 genes and 200 SNPs
  22. 22. Causality among genes and SNP can be explored!! Insights we obtain from the experiment: • In the real causality, SNP and genes are not separated as Edwards assumed !! • Both SNPs and gene expressions are hubs of the mixed network. variable cardinality SNP 3 values Gene expression continuous
  23. 23. Summary • MI estimtaion • Application to Chow-Liu • Gene Differential Analysis • Causality among SNPs and Gene Expressions Future Works Beyond Forests: • BNs with bounded TW • MNs not necessarily forests

×