Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Forest Learning based on the Chow-Liu Algorithm
and its Application to Genome Differential Analysis:
A Novel Mutual Inform...
Road Map
• MI Estimation for Discrete (warming-up)
• MI Estimation for Discrete/Coninuous (propose)
• Experiment 1 (gene d...
How do you estimate MI given data ?
For discrete data, a naïve way is
MI Estimation based on MDL (Suzuki, UAI93)
P. Liang and N. Srebro 2004、K. Panayidou 2010、Edwards, et. al 2010 revisited th...
Overestimation occurs for 𝐼 𝑛
Dist. Chow-Liu (apply Kruskal)
known approximation to trees with K-L minimum
unknown ML spanning trees (Chow-Liu), MDL for...
R package bnlearn data set “Asia”
Forest
Spanning
Tree
Correlation ≠ Independence
X: Gauss, Y: Discrete (Edwards et. al. 2010)
close to Naïve Bayes
(ANOVA)
Causal Direction
Gaussaian should not be between discrete variables
Discrete
Gaussian
(Edwards, et. al 2010)
SNP
Gene Expression
Proposed MI estimation
For each mesh (percentile), estimate the MI based on the quntized data
Choose the maximum MI estima...
n=1000, 8x8
occurrences of X,Y
occurrence of (X, Y)
Does not distinguish discrete and continuous
For each u,v=1,2,…,s,
Continue to divide the interval half
(avoid to divide t...
What can be proved?
Convex
Experimet 1:
Genome expression profiling in breast cancer patients
• 58 sample with p53 mutation and 192 without it
• 1000...
Normality test: p-values
are extreamely low
MI distributions for all (1000) and 50 least p-value genes
• 20 seconds for MI values and 30 seconds for a forest (1000 nodes)
The class variable has
only one connection
with gene v...
Experiment 2:
300 gene expression (continue) and 300 SNP (3 values)
• Utah 90 residents SNP (HapMap) with northern and wes...
200 genes and 200 SNPs
Causality among genes and SNP can be explored!!
Insights we obtain from the experiment:
• In the real causality,
SNP and g...
Summary
• MI estimtaion
• Application to Chow-Liu
• Gene Differential Analysis
• Causality among SNPs and Gene Expressions...
Forest Learning based on the Chow-Liu Algorithm and its Application to Genome Differential Analysis: A Novel Mutual Info...
Forest Learning based on the Chow-Liu Algorithm and its Application to Genome Differential Analysis: A Novel Mutual Info...
Upcoming SlideShare
Loading in …5
×

Forest Learning based on the Chow-Liu Algorithm and its Application to Genome Differential Analysis: A Novel Mutual Information Estimation

70 views

Published on

J. Suzuki, ``Forest Learning based on the Chow-Liu Algorithm and its Application to Genome Differential Analysis: A Novel Mutual Information Estimation", AMBN 2015, Yokohama, Japan

Published in: Science
  • Be the first to comment

  • Be the first to like this

Forest Learning based on the Chow-Liu Algorithm and its Application to Genome Differential Analysis: A Novel Mutual Information Estimation

  1. 1. Forest Learning based on the Chow-Liu Algorithm and its Application to Genome Differential Analysis: A Novel Mutual Information Estimation Nov. 16-18, 2015 Joe Suzuki (Osaka Univ.) @joe_suzuki Prof-joe
  2. 2. Road Map • MI Estimation for Discrete (warming-up) • MI Estimation for Discrete/Coninuous (propose) • Experiment 1 (gene differential analysis) • Experiment 2 (combination of SNP and gene) • Concluding Remarks
  3. 3. How do you estimate MI given data ? For discrete data, a naïve way is
  4. 4. MI Estimation based on MDL (Suzuki, UAI93) P. Liang and N. Srebro 2004、K. Panayidou 2010、Edwards, et. al 2010 revisited the same
  5. 5. Overestimation occurs for 𝐼 𝑛
  6. 6. Dist. Chow-Liu (apply Kruskal) known approximation to trees with K-L minimum unknown ML spanning trees (Chow-Liu), MDL forests (Suzuki 93)
  7. 7. R package bnlearn data set “Asia” Forest Spanning Tree
  8. 8. Correlation ≠ Independence
  9. 9. X: Gauss, Y: Discrete (Edwards et. al. 2010) close to Naïve Bayes (ANOVA) Causal Direction
  10. 10. Gaussaian should not be between discrete variables Discrete Gaussian (Edwards, et. al 2010) SNP Gene Expression
  11. 11. Proposed MI estimation For each mesh (percentile), estimate the MI based on the quntized data Choose the maximum MI estimation
  12. 12. n=1000, 8x8 occurrences of X,Y occurrence of (X, Y)
  13. 13. Does not distinguish discrete and continuous For each u,v=1,2,…,s, Continue to divide the interval half (avoid to divide the mass intervals)
  14. 14. What can be proved?
  15. 15. Convex
  16. 16. Experimet 1: Genome expression profiling in breast cancer patients • 58 sample with p53 mutation and 192 without it • 1000 genes Why only Bonferroni and FDR rather than causality and regression?
  17. 17. Normality test: p-values are extreamely low
  18. 18. MI distributions for all (1000) and 50 least p-value genes
  19. 19. • 20 seconds for MI values and 30 seconds for a forest (1000 nodes) The class variable has only one connection with gene variables We conclude that regression may be more appropriate than a graphical model.
  20. 20. Experiment 2: 300 gene expression (continue) and 300 SNP (3 values) • Utah 90 residents SNP (HapMap) with northern and western European ancestry • R library (BioConductor) GGData ftp://ftp.sanger.ac.uk/pub/genevar/CEU_parents_norm_march2007.zip
  21. 21. 200 genes and 200 SNPs
  22. 22. Causality among genes and SNP can be explored!! Insights we obtain from the experiment: • In the real causality, SNP and genes are not separated as Edwards assumed !! • Both SNPs and gene expressions are hubs of the mixed network. variable cardinality SNP 3 values Gene expression continuous
  23. 23. Summary • MI estimtaion • Application to Chow-Liu • Gene Differential Analysis • Causality among SNPs and Gene Expressions Future Works Beyond Forests: • BNs with bounded TW • MNs not necessarily forests

×