Successfully reported this slideshow.
Your SlideShare is downloading. ×

Project3.ppt

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 43 Ad

More Related Content

Similar to Project3.ppt (20)

Recently uploaded (20)

Advertisement

Project3.ppt

  1. 1. Bayesian Networks in Bioinformatics Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr
  2. 2. Copyright (c) 2002 by SNU CSE Biointelligence Lab 2 Contents  Bayesian networks – preliminaries  Bayesian networks vs. causal networks  Partially DAG representation of the Bayesian network  Structural learning of the Bayesian network  Classification using Bayesian networks  Microarray data analysis with Bayesian networks  Experimental results on the NCI60 data set  Term Project #3  Diagnosis using Bayesian networks
  3. 3. Copyright (c) 2002 by SNU CSE Biointelligence Lab 3 Bayesian Networks  The joint probability distribution over all the variables in the Bayesian network.   n i i i n X P X X X P 1 2 1 ) | ( ) ,..., , ( Pa ) | ( ) | ( ) , | ( ) ( ) ( ) , , , | ( ) , , | ( ) , | ( ) | ( ) ( ) , , , , ( C E P B D P B A C P B P A P D C B A E P C B A D P B A C P A B P A P E D C B A P   B A C D E Local probability distribution for Xi i i i i i i iq i i i i X r q X P X i for states of # : for ions configurat of # : ) | ( for parameter ~ ) ,..., ( of parents of set the : 1 Pa Pa Pa    
  4. 4. Copyright (c) 2002 by SNU CSE Biointelligence Lab 4 Knowing the Joint Probability Distribution  We can calculate any conditional probability from the joint probability distribution in principle. Gene B Class Gene F Gene G Gene A Gene C Gene D Gene E Gene H This Bayesian network can classify the examples by calculating the appropriate conditional probabilities.  P(Class| other variables)
  5. 5. Copyright (c) 2002 by SNU CSE Biointelligence Lab 5 Classification by Bayesian Networks I  Calculate the conditional probability of ‘Class’ variable given the value of the other variables.  Infer the conditional probability from the joint probability distribution.  For example,  where the summation is taken over all the possible class values. , ) , , , , , , , , ( ) , , , , , , , , ( ) , , , , , , , | (   Class H Gene G Gene F Gene E Gene D Gene C Gene B Gene A Gene Class P H Gene G Gene F Gene E Gene D Gene C Gene B Gene A Gene Class P H Gene G Gene F Gene E Gene D Gene C Gene B Gene A Gene Class P
  6. 6. Copyright (c) 2002 by SNU CSE Biointelligence Lab 6 Knowing the Causal Structure Gene B Class Gene F Gene G Gene A Gene C Gene D Gene E Gene H Gene C regulates Gene E and F. Gene D regulates Gene G and H. Class has an effect on Gene F and G.
  7. 7. Copyright (c) 2002 by SNU CSE Biointelligence Lab 7 Bayesian Networks vs. Causal Networks Bayesian networks Causal networks Network structure Conditional independencies Causal relationships By d-separation property of the Bayesian network structure  The network structure asserts that every node is conditionally independent from all of its non- descendants given the values of its immediate parents.
  8. 8. Copyright (c) 2002 by SNU CSE Biointelligence Lab 8 Equivalent Two DAGs X Y X Y These two DAGs assert that X and Y are dependent on each other.  the same conditional independencies  equivalence class Causal relationships are hard to learn from the observational data.
  9. 9. Copyright (c) 2002 by SNU CSE Biointelligence Lab 9 Verma and Pearl’s Theorem  Theorem:  Two DAGs are equivalent if and only if they have the same skeleton and the same v-structures. X Y Z v-structure (X, Z, Y) : X and Y are parents of Z and not adjacent to each other.
  10. 10. Copyright (c) 2002 by SNU CSE Biointelligence Lab 10 PDAG Representations  Minimal PDAG representations of the equivalence class  The only directed edges are those that participate in v-structures.  Completed PDAG representation  Every directed edge corresponds to a compelled edge, and every undirected edge corresponds to a reversible edge.
  11. 11. Copyright (c) 2002 by SNU CSE Biointelligence Lab 11 Example: PDAG Representations X Y Z W V X Y Z W V X Y Z W V X Y Z W V An equivalence class Minimal PDAG Completed PDAG
  12. 12. Copyright (c) 2002 by SNU CSE Biointelligence Lab 12 Learning Bayesian Networks  Metric approach  Use a scoring metric to measure how well a particular structure fits an observed set of cases.  A search algorithm is used.  Find a canonical form of an equivalence class.  Independence approach  An independence oracle (approximated by some statistical test) is queried to identify the equivalence class that captures the independencies in the distribution from which the data was generated.  Search for a PDAG
  13. 13. Copyright (c) 2002 by SNU CSE Biointelligence Lab 13 Scoring Metrics for Bayesian Networks  Likelihood L(G, G, C) = P(C|Gh, G)  Gh: the hypothesis that the data (C) was generated by a distribution that can be factored according to G.  The maximum likelihood metric of G ) , , ( max ) , ( C G L C G M G ML G     prefer the complete graph structure
  14. 14. Copyright (c) 2002 by SNU CSE Biointelligence Lab 14 Information Criterion Scoring Metrics  The Akaike information criterion (AIC) metric  The Bayesian information criterion (BIC) metric ) ( ) , ( log ) , ( G Dim C G M C G M ML AIC   N G Dim C G M C G M ML BIC log ) ( 2 1 ) , ( log ) , (  
  15. 15. Copyright (c) 2002 by SNU CSE Biointelligence Lab 15 MDL Scoring Metrics  The minimum description length (MDL) metric 1  The minimum description length (MDL) metric 1 ) , ( ) ( log ) , ( 1 C G M G P C G M BIC MDL   ) ( log | | ) , ( log ) , ( 2 G Dim c N E C G M C G M G ML MDL    
  16. 16. Copyright (c) 2002 by SNU CSE Biointelligence Lab 16 Bayesian Scoring Metrics  A Bayesian metric  The BDe (Bayesian Dirichlet & likelihood equivalence) metric c G C P G P C G M h h    ) , | ( log ) | ( log ) , , (                 n i q j r k ijk ijk ijk ij ij ij h i i N N N N N N G C P 1 1 1 ) ' ( ) ' ( ) ' ( ) ' ( ) , | ( 
  17. 17. Copyright (c) 2002 by SNU CSE Biointelligence Lab 17 Greedy Search Algorithm for Bayesian Network Learning  Generate the initial Bayesian network structure G0.  For m = 1, 2, 3, …, until convergence.  Among all the possible local changes (insertion of an edge, reversal of an edge, and deletion of an edge) in Gm–1, the one leads to the largest improvement in the score is performed. The resulting graph is Gm.  Stopping criterion  Score(Gm–1) == Score(Gm).  At each iteration (learning Bayesian network consisting of n variables)  O(n2) local changes should be evaluated to select the best one.  Random restarts is usually adopted to escape the local maxima.
  18. 18. Copyright (c) 2002 by SNU CSE Biointelligence Lab 18 Probabilistic Inference  Calculate the conditional probability given the values of the observed variables.  Junction tree algorithm  Sampling method  General probabilistic inference is intractable.  However, calculation of the conditional probability for the classification is rather straightforward because of the property of the Bayesian network structure.
  19. 19. Copyright (c) 2002 by SNU CSE Biointelligence Lab 19 The Markov Blanket  All the variables of interest  X = {X1, X2, …, Xn}  For a variable Xi, its Markov blanket MB(Xi) is the subset of X – Xi which satisfies the following:  Markov boundary  Minimal Markov blanket )). ( | ( ) | ( i i i i X X P X X P MB X  
  20. 20. Copyright (c) 2002 by SNU CSE Biointelligence Lab 20 Markov Blanket in Bayesian Networks  Given the Bayesian network structure, the determination of the Markov blanket of a variable is straightforward.  By the conditional independence assertions. Gene B Class Gene F Gene G Gene A Gene C Gene D Gene E Gene H The Markov blanket of a node in the Bayesian network consists of all of its parents, spouses, and children.
  21. 21. Copyright (c) 2002 by SNU CSE Biointelligence Lab 21 Classification by Bayesian Networks II ) , | ( ) , | ( ) , | ( ) , | ( ) , | ( ) ( ) , | ( ) ( ) ( ) ( ) , | ( ) , | ( ) ( ) , | ( ) ( ) ( ) ( ) | ( ) , | ( ) , | ( ) | ( ) ( ) , | ( ) ( ) ( ) ( ) | ( ) , | ( ) , | ( ) | ( ) ( ) , | ( ) ( ) ( ) ( ) , , , , , , , , ( ) , , , , , , , , ( ) , , , , , , , | ( D Class G P Class C F P B A Class P D Class G P Class C F P D P B A Class P C P B P A P D Class G P Class C F P D P B A Class P C P B P A P D H P D Class G P Class C F P C E P D P B A Class P C P B P A P D H P D Class G P Class C F P C E P D P B A Class P C P B P A P H Gene G Gene F Gene E Gene D Gene C Gene B Gene A Gene Class P H Gene G Gene F Gene E Gene D Gene C Gene B Gene A Gene Class P H Gene G Gene F Gene E Gene D Gene C Gene B Gene A Gene Class P Class Class Class       
  22. 22. Copyright (c) 2002 by SNU CSE Biointelligence Lab 22 DNA Microarrays  Monitor thousands of gene expression levels simultaneously  traditional one gene experiments.  Fabricated by high-speed robotics. Known probes
  23. 23. Copyright (c) 2002 by SNU CSE Biointelligence Lab 23 A Comparative Hybridization Experiment Image analysis
  24. 24. Copyright (c) 2002 by SNU CSE Biointelligence Lab 24 Mining on Gene Expression and Drug Activity Data  Relationships among human cancer, gene expression, and drug activity  Revealing these relationships   Cause and mechanisms of the cancer development  New molecular targets for anti-cancer drugs Human cancer Gene expression Drug activity
  25. 25. Copyright (c) 2002 by SNU CSE Biointelligence Lab 25 NCI (National Cancer Institute) Drug Discovery Program NCI 60 cell lines data set
  26. 26. Copyright (c) 2002 by SNU CSE Biointelligence Lab 26 NCI60 Cell Lines Data Set  From 60 human cancer cell lines  Colorectal, renal, ovarian, breast, prostate, lung, and central nervous system origin cancers, as well as leukemias and melanomas  Gene expression patterns  cDNA microarray  Drug activity patterns  Sulphorhodamine B assay  changes in total cellular protein after 48 hours of drug treatment
  27. 27. Copyright (c) 2002 by SNU CSE Biointelligence Lab 27 Schematic View of the Modeling Approach Gene B Cancer Drug B Drug A Gene A - Selected genes, drugs and cancer type node Drug A Cancer Drug B Gene B Gene A < Learned Bayesian network > - Dependency analysis - Probabilistic inference Drug activity Data Gene Expression Data Preprocessing - Thresholding - Clustering - Discretization
  28. 28. Copyright (c) 2002 by SNU CSE Biointelligence Lab 28 Data Preparation  cDNA microarray data  Gene expression profiles on 60 cell lines  1376  60 matrix  Drug activity data  Drug activity patterns on 60 cell lines  118  60 matrix (1376 + 118)  60 data matrix 60 samples Gene expressions 60 samples Drug activities 1376 genes 118 drugs
  29. 29. Copyright (c) 2002 by SNU CSE Biointelligence Lab 29 Preprocessing  Thresholding  Elimination of unknown ESTs  805 genes  Elimination of drugs which have more than 4 missing values  84 drugs  Discretization  Local probability model for Bayesian networks: multinomial distribution 1376 genes 118 drugs 60 samples 805 genes 84 drugs 60 samples   + c  - c 1 0 -1
  30. 30. Copyright (c) 2002 by SNU CSE Biointelligence Lab 30 Bayesian Network Learning for Gene-Drug Analysis  Large-scale Bayesian network  Several hundreds nodes (up to 890)  General greedy search is inapplicable because of time and space complexity.  Search heuristics  Local to global search heuristics  Exploit the locality of Bayesian networks to reduce the entire search space.  The local structure: Markov blanket  Find the candidate Markov blanket (of pre-determined size k) of each node  reduce the global search space
  31. 31. Copyright (c) 2002 by SNU CSE Biointelligence Lab 31 Local to Global Search Heuristics Input: - A data set D. - An initial Bayesian network structure B0. - A decomposable scoring metric, Output: A Bayesian network structure B. Loop for n = 1, 2, …, until convergence. - Local Search Step: * Based on D and Bn–1, select for Xi, a set CBi n (|CBi n|  k) of candidate Markov blanket of Xi. * For each set {Xi, CBi n}, learn the local structure and determine the Markov blanket of Xi, BLn(Xi), from this local structure. * Merge all Markov blanket structures G({Xi, BLn(Xi)}, Ei) into a global network structure Hn (could be cyclic). - Global Search Step: * Find the Bayesian network structure Bn  Hn, which maximizes Score(Bn, D) and retains all non- cyclic edges in Hn. . ) ), ( | ( ) , (   i i B i D X Pa X Score D B Score
  32. 32. Copyright (c) 2002 by SNU CSE Biointelligence Lab 32 Dimensionality Problem  The number of attributes (nodes) >> sample size  Unreliable structure of the learned Bayesian networks  Probabilistic inference is nearly impossible.  Downsize the number of attributes by clustering  Prototype: mean of all members in a cluster In the preprocessing step
  33. 33. Copyright (c) 2002 by SNU CSE Biointelligence Lab 33 Bayesian Network with 45 Prototypes  Node types (46 nodes in all)  40 gene prototypes  5 drug prototypes  Cancer label  Discretization boundary   - c,  + c  Bayesian network learning  Varying candidate Markov blanket size (k = 5 ~ 15)  Select the best one  Three data sets (c = 0.43, 0.50, 0.60)  three Bayesian networks  Probabilistic inference c Distribution Ratio -1 0 1 0.43 33.3 % 33.3 % 33.3 % 0.50 30.8 % 38.3 % 30.8 % 0.60 27.4 45.1 27.4
  34. 34. Copyright (c) 2002 by SNU CSE Biointelligence Lab 34 Correlations between ASNS and L-Asparaginase  Part of the Bayesian network (c = 0.60) Prototype for ASNS and SID W 484773, PYRROLINE-5- CARBOXYLATE REDUCTASE [5':AA037688, 3':AA037689] Prototype for L-Asparaginase P(D2|G4) D2 = -1 D2 = 0 D2 = 1 G4 = -1 0.32096 0.27086 0.40818 G4 = 0 0.31387 0.41247 0.27366 G4 = 1 0.32167 0.34920 0.32913 < Conditional probability table >
  35. 35. Copyright (c) 2002 by SNU CSE Biointelligence Lab 35 Bayesian Networks on Subset of Genes and Drugs  Node types (17 nodes in all)  12 genes  4 drugs  Cancer label  Discretization boundary   - c,  + c  Bayesian network learning  General greedy search with restart (100 times)  Select the best one  Three data sets (c = 0.43, 0.50, 0.60)  three Bayesian networks  Probabilistic inference c Distribution Ratio -1 0 1 0.43 33.3 % 33.3 % 33.3 % 0.50 30.8 % 38.3 % 30.8 % 0.60 27.4 45.1 27.4 Clustering of genes and drugs together - From neighboring clusters
  36. 36. Copyright (c) 2002 by SNU CSE Biointelligence Lab 36 Around the L-Asparaginase < Part of the Bayesian network (c = 0.6) >
  37. 37. Copyright (c) 2002 by SNU CSE Biointelligence Lab 37 Probabilistic Relationships Around the L-Asparaginase  Cancer type unobserved  D1: L-Asparaginase  G1: ASNS gene  G2: PYRROLINE-5-CARBOXYLATE REDUCTASE P(D1|G1) D1 = -1 D1 = 0 D1 = 1 G1 = -1 0.19857 0.27471 0.52672 G1 = 0 0.31110 0.49795 0.19095 G1 = 1 0.42159 0.36279 0.21561  Cancer type observed (= leukemia)  D1: L-Asparaginase  G1: ASNS gene  G2: PYRROLINE-5-CARBOXYLATE REDUCTASE P(D1|G2) D1 = -1 D1 = 0 D1 = 1 G2 = -1 0.27510 0.35226 0.37263 G2 = 0 0.31621 0.41072 0.27307 G2 = 1 0.33837 0.39664 0.26499 P(D1|G1,L) D1 = -1 D1 = 0 D1 = 1 G1 = -1 0.17536 0.22838 0.59626 G1 = 0 0.27128 0.53790 0.19081 G1 = 1 0.38500 0.42437 0.19063 P(D1|G2,L) D1 = -1 D1 = 0 D1 = 1 G2 = -1 0.23812 0.33853 0.42335 G2 = 0 0.27978 0.42666 0.29356 G2 = 1 0.30371 0.42108 0.27520
  38. 38. Term Project #3: Diagnosis Using Bayesian Networks
  39. 39. Copyright (c) 2002 by SNU CSE Biointelligence Lab 39 Outline  Task 1: Structural learning of the Bayesian network  Data generation from the ALARM network  Structural learning of Bayesian networks using more than two kinds of algorithms and scores  Compare the learned results w.r.t. the edge errors according to the various sample sizes and the learning algorithms  Task 2: Classification using Bayesian networks  Arbitrarily divide the Leukemia data set between the training set and the test set  Learn the Bayesian network from the training data set using one of the metric-based approaches  Evaluate the performance of the Bayesian network as a classifier (classification accuracy)
  40. 40. Copyright (c) 2002 by SNU CSE Biointelligence Lab 40 Data Generation  Using the Netica Software (http://www.norsys.com)  The ALARM network  # of nodes: 37  # of edges: 46
  41. 41. Copyright (c) 2002 by SNU CSE Biointelligence Lab 41 Structural Learning  Independence method  BN Power constructor (http://www.cs.ualberta.ca/~jcheng/bnsoft.htm)  Metric-based method  LearnBayes (http://www.cs.huji.ac.il/labs/compbio/LibB/)  MDL, BIC, BD, and likelihood score are can be used.
  42. 42. Copyright (c) 2002 by SNU CSE Biointelligence Lab 42 The Leukemia Data Set  Class type  ALL (acute lymphoblastic leukemia) or AML (acute myeloid leukemia)  Data set  # of attributes: 50 gene expression levels (0 or 1)  # of samples: 72
  43. 43. Copyright (c) 2002 by SNU CSE Biointelligence Lab 43 Submission  Deadline: 2002. 11. 27  Location: 301-419

×