Your SlideShare is downloading.
×

- 1. Bayesian Networks in Bioinformatics Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr
- 2. Copyright (c) 2002 by SNU CSE Biointelligence Lab 2 Contents Bayesian networks – preliminaries Bayesian networks vs. causal networks Partially DAG representation of the Bayesian network Structural learning of the Bayesian network Classification using Bayesian networks Microarray data analysis with Bayesian networks Experimental results on the NCI60 data set Term Project #3 Diagnosis using Bayesian networks
- 3. Copyright (c) 2002 by SNU CSE Biointelligence Lab 3 Bayesian Networks The joint probability distribution over all the variables in the Bayesian network. n i i i n X P X X X P 1 2 1 ) | ( ) ,..., , ( Pa ) | ( ) | ( ) , | ( ) ( ) ( ) , , , | ( ) , , | ( ) , | ( ) | ( ) ( ) , , , , ( C E P B D P B A C P B P A P D C B A E P C B A D P B A C P A B P A P E D C B A P B A C D E Local probability distribution for Xi i i i i i i iq i i i i X r q X P X i for states of # : for ions configurat of # : ) | ( for parameter ~ ) ,..., ( of parents of set the : 1 Pa Pa Pa
- 4. Copyright (c) 2002 by SNU CSE Biointelligence Lab 4 Knowing the Joint Probability Distribution We can calculate any conditional probability from the joint probability distribution in principle. Gene B Class Gene F Gene G Gene A Gene C Gene D Gene E Gene H This Bayesian network can classify the examples by calculating the appropriate conditional probabilities. P(Class| other variables)
- 5. Copyright (c) 2002 by SNU CSE Biointelligence Lab 5 Classification by Bayesian Networks I Calculate the conditional probability of ‘Class’ variable given the value of the other variables. Infer the conditional probability from the joint probability distribution. For example, where the summation is taken over all the possible class values. , ) , , , , , , , , ( ) , , , , , , , , ( ) , , , , , , , | ( Class H Gene G Gene F Gene E Gene D Gene C Gene B Gene A Gene Class P H Gene G Gene F Gene E Gene D Gene C Gene B Gene A Gene Class P H Gene G Gene F Gene E Gene D Gene C Gene B Gene A Gene Class P
- 6. Copyright (c) 2002 by SNU CSE Biointelligence Lab 6 Knowing the Causal Structure Gene B Class Gene F Gene G Gene A Gene C Gene D Gene E Gene H Gene C regulates Gene E and F. Gene D regulates Gene G and H. Class has an effect on Gene F and G.
- 7. Copyright (c) 2002 by SNU CSE Biointelligence Lab 7 Bayesian Networks vs. Causal Networks Bayesian networks Causal networks Network structure Conditional independencies Causal relationships By d-separation property of the Bayesian network structure The network structure asserts that every node is conditionally independent from all of its non- descendants given the values of its immediate parents.
- 8. Copyright (c) 2002 by SNU CSE Biointelligence Lab 8 Equivalent Two DAGs X Y X Y These two DAGs assert that X and Y are dependent on each other. the same conditional independencies equivalence class Causal relationships are hard to learn from the observational data.
- 9. Copyright (c) 2002 by SNU CSE Biointelligence Lab 9 Verma and Pearl’s Theorem Theorem: Two DAGs are equivalent if and only if they have the same skeleton and the same v-structures. X Y Z v-structure (X, Z, Y) : X and Y are parents of Z and not adjacent to each other.
- 10. Copyright (c) 2002 by SNU CSE Biointelligence Lab 10 PDAG Representations Minimal PDAG representations of the equivalence class The only directed edges are those that participate in v-structures. Completed PDAG representation Every directed edge corresponds to a compelled edge, and every undirected edge corresponds to a reversible edge.
- 11. Copyright (c) 2002 by SNU CSE Biointelligence Lab 11 Example: PDAG Representations X Y Z W V X Y Z W V X Y Z W V X Y Z W V An equivalence class Minimal PDAG Completed PDAG
- 12. Copyright (c) 2002 by SNU CSE Biointelligence Lab 12 Learning Bayesian Networks Metric approach Use a scoring metric to measure how well a particular structure fits an observed set of cases. A search algorithm is used. Find a canonical form of an equivalence class. Independence approach An independence oracle (approximated by some statistical test) is queried to identify the equivalence class that captures the independencies in the distribution from which the data was generated. Search for a PDAG
- 13. Copyright (c) 2002 by SNU CSE Biointelligence Lab 13 Scoring Metrics for Bayesian Networks Likelihood L(G, G, C) = P(C|Gh, G) Gh: the hypothesis that the data (C) was generated by a distribution that can be factored according to G. The maximum likelihood metric of G ) , , ( max ) , ( C G L C G M G ML G prefer the complete graph structure
- 14. Copyright (c) 2002 by SNU CSE Biointelligence Lab 14 Information Criterion Scoring Metrics The Akaike information criterion (AIC) metric The Bayesian information criterion (BIC) metric ) ( ) , ( log ) , ( G Dim C G M C G M ML AIC N G Dim C G M C G M ML BIC log ) ( 2 1 ) , ( log ) , (
- 15. Copyright (c) 2002 by SNU CSE Biointelligence Lab 15 MDL Scoring Metrics The minimum description length (MDL) metric 1 The minimum description length (MDL) metric 1 ) , ( ) ( log ) , ( 1 C G M G P C G M BIC MDL ) ( log | | ) , ( log ) , ( 2 G Dim c N E C G M C G M G ML MDL
- 16. Copyright (c) 2002 by SNU CSE Biointelligence Lab 16 Bayesian Scoring Metrics A Bayesian metric The BDe (Bayesian Dirichlet & likelihood equivalence) metric c G C P G P C G M h h ) , | ( log ) | ( log ) , , ( n i q j r k ijk ijk ijk ij ij ij h i i N N N N N N G C P 1 1 1 ) ' ( ) ' ( ) ' ( ) ' ( ) , | (
- 17. Copyright (c) 2002 by SNU CSE Biointelligence Lab 17 Greedy Search Algorithm for Bayesian Network Learning Generate the initial Bayesian network structure G0. For m = 1, 2, 3, …, until convergence. Among all the possible local changes (insertion of an edge, reversal of an edge, and deletion of an edge) in Gm–1, the one leads to the largest improvement in the score is performed. The resulting graph is Gm. Stopping criterion Score(Gm–1) == Score(Gm). At each iteration (learning Bayesian network consisting of n variables) O(n2) local changes should be evaluated to select the best one. Random restarts is usually adopted to escape the local maxima.
- 18. Copyright (c) 2002 by SNU CSE Biointelligence Lab 18 Probabilistic Inference Calculate the conditional probability given the values of the observed variables. Junction tree algorithm Sampling method General probabilistic inference is intractable. However, calculation of the conditional probability for the classification is rather straightforward because of the property of the Bayesian network structure.
- 19. Copyright (c) 2002 by SNU CSE Biointelligence Lab 19 The Markov Blanket All the variables of interest X = {X1, X2, …, Xn} For a variable Xi, its Markov blanket MB(Xi) is the subset of X – Xi which satisfies the following: Markov boundary Minimal Markov blanket )). ( | ( ) | ( i i i i X X P X X P MB X
- 20. Copyright (c) 2002 by SNU CSE Biointelligence Lab 20 Markov Blanket in Bayesian Networks Given the Bayesian network structure, the determination of the Markov blanket of a variable is straightforward. By the conditional independence assertions. Gene B Class Gene F Gene G Gene A Gene C Gene D Gene E Gene H The Markov blanket of a node in the Bayesian network consists of all of its parents, spouses, and children.
- 21. Copyright (c) 2002 by SNU CSE Biointelligence Lab 21 Classification by Bayesian Networks II ) , | ( ) , | ( ) , | ( ) , | ( ) , | ( ) ( ) , | ( ) ( ) ( ) ( ) , | ( ) , | ( ) ( ) , | ( ) ( ) ( ) ( ) | ( ) , | ( ) , | ( ) | ( ) ( ) , | ( ) ( ) ( ) ( ) | ( ) , | ( ) , | ( ) | ( ) ( ) , | ( ) ( ) ( ) ( ) , , , , , , , , ( ) , , , , , , , , ( ) , , , , , , , | ( D Class G P Class C F P B A Class P D Class G P Class C F P D P B A Class P C P B P A P D Class G P Class C F P D P B A Class P C P B P A P D H P D Class G P Class C F P C E P D P B A Class P C P B P A P D H P D Class G P Class C F P C E P D P B A Class P C P B P A P H Gene G Gene F Gene E Gene D Gene C Gene B Gene A Gene Class P H Gene G Gene F Gene E Gene D Gene C Gene B Gene A Gene Class P H Gene G Gene F Gene E Gene D Gene C Gene B Gene A Gene Class P Class Class Class
- 22. Copyright (c) 2002 by SNU CSE Biointelligence Lab 22 DNA Microarrays Monitor thousands of gene expression levels simultaneously traditional one gene experiments. Fabricated by high-speed robotics. Known probes
- 23. Copyright (c) 2002 by SNU CSE Biointelligence Lab 23 A Comparative Hybridization Experiment Image analysis
- 24. Copyright (c) 2002 by SNU CSE Biointelligence Lab 24 Mining on Gene Expression and Drug Activity Data Relationships among human cancer, gene expression, and drug activity Revealing these relationships Cause and mechanisms of the cancer development New molecular targets for anti-cancer drugs Human cancer Gene expression Drug activity
- 25. Copyright (c) 2002 by SNU CSE Biointelligence Lab 25 NCI (National Cancer Institute) Drug Discovery Program NCI 60 cell lines data set
- 26. Copyright (c) 2002 by SNU CSE Biointelligence Lab 26 NCI60 Cell Lines Data Set From 60 human cancer cell lines Colorectal, renal, ovarian, breast, prostate, lung, and central nervous system origin cancers, as well as leukemias and melanomas Gene expression patterns cDNA microarray Drug activity patterns Sulphorhodamine B assay changes in total cellular protein after 48 hours of drug treatment
- 27. Copyright (c) 2002 by SNU CSE Biointelligence Lab 27 Schematic View of the Modeling Approach Gene B Cancer Drug B Drug A Gene A - Selected genes, drugs and cancer type node Drug A Cancer Drug B Gene B Gene A < Learned Bayesian network > - Dependency analysis - Probabilistic inference Drug activity Data Gene Expression Data Preprocessing - Thresholding - Clustering - Discretization
- 28. Copyright (c) 2002 by SNU CSE Biointelligence Lab 28 Data Preparation cDNA microarray data Gene expression profiles on 60 cell lines 1376 60 matrix Drug activity data Drug activity patterns on 60 cell lines 118 60 matrix (1376 + 118) 60 data matrix 60 samples Gene expressions 60 samples Drug activities 1376 genes 118 drugs
- 29. Copyright (c) 2002 by SNU CSE Biointelligence Lab 29 Preprocessing Thresholding Elimination of unknown ESTs 805 genes Elimination of drugs which have more than 4 missing values 84 drugs Discretization Local probability model for Bayesian networks: multinomial distribution 1376 genes 118 drugs 60 samples 805 genes 84 drugs 60 samples + c - c 1 0 -1
- 30. Copyright (c) 2002 by SNU CSE Biointelligence Lab 30 Bayesian Network Learning for Gene-Drug Analysis Large-scale Bayesian network Several hundreds nodes (up to 890) General greedy search is inapplicable because of time and space complexity. Search heuristics Local to global search heuristics Exploit the locality of Bayesian networks to reduce the entire search space. The local structure: Markov blanket Find the candidate Markov blanket (of pre-determined size k) of each node reduce the global search space
- 31. Copyright (c) 2002 by SNU CSE Biointelligence Lab 31 Local to Global Search Heuristics Input: - A data set D. - An initial Bayesian network structure B0. - A decomposable scoring metric, Output: A Bayesian network structure B. Loop for n = 1, 2, …, until convergence. - Local Search Step: * Based on D and Bn–1, select for Xi, a set CBi n (|CBi n| k) of candidate Markov blanket of Xi. * For each set {Xi, CBi n}, learn the local structure and determine the Markov blanket of Xi, BLn(Xi), from this local structure. * Merge all Markov blanket structures G({Xi, BLn(Xi)}, Ei) into a global network structure Hn (could be cyclic). - Global Search Step: * Find the Bayesian network structure Bn Hn, which maximizes Score(Bn, D) and retains all non- cyclic edges in Hn. . ) ), ( | ( ) , ( i i B i D X Pa X Score D B Score
- 32. Copyright (c) 2002 by SNU CSE Biointelligence Lab 32 Dimensionality Problem The number of attributes (nodes) >> sample size Unreliable structure of the learned Bayesian networks Probabilistic inference is nearly impossible. Downsize the number of attributes by clustering Prototype: mean of all members in a cluster In the preprocessing step
- 33. Copyright (c) 2002 by SNU CSE Biointelligence Lab 33 Bayesian Network with 45 Prototypes Node types (46 nodes in all) 40 gene prototypes 5 drug prototypes Cancer label Discretization boundary - c, + c Bayesian network learning Varying candidate Markov blanket size (k = 5 ~ 15) Select the best one Three data sets (c = 0.43, 0.50, 0.60) three Bayesian networks Probabilistic inference c Distribution Ratio -1 0 1 0.43 33.3 % 33.3 % 33.3 % 0.50 30.8 % 38.3 % 30.8 % 0.60 27.4 45.1 27.4
- 34. Copyright (c) 2002 by SNU CSE Biointelligence Lab 34 Correlations between ASNS and L-Asparaginase Part of the Bayesian network (c = 0.60) Prototype for ASNS and SID W 484773, PYRROLINE-5- CARBOXYLATE REDUCTASE [5':AA037688, 3':AA037689] Prototype for L-Asparaginase P(D2|G4) D2 = -1 D2 = 0 D2 = 1 G4 = -1 0.32096 0.27086 0.40818 G4 = 0 0.31387 0.41247 0.27366 G4 = 1 0.32167 0.34920 0.32913 < Conditional probability table >
- 35. Copyright (c) 2002 by SNU CSE Biointelligence Lab 35 Bayesian Networks on Subset of Genes and Drugs Node types (17 nodes in all) 12 genes 4 drugs Cancer label Discretization boundary - c, + c Bayesian network learning General greedy search with restart (100 times) Select the best one Three data sets (c = 0.43, 0.50, 0.60) three Bayesian networks Probabilistic inference c Distribution Ratio -1 0 1 0.43 33.3 % 33.3 % 33.3 % 0.50 30.8 % 38.3 % 30.8 % 0.60 27.4 45.1 27.4 Clustering of genes and drugs together - From neighboring clusters
- 36. Copyright (c) 2002 by SNU CSE Biointelligence Lab 36 Around the L-Asparaginase < Part of the Bayesian network (c = 0.6) >
- 37. Copyright (c) 2002 by SNU CSE Biointelligence Lab 37 Probabilistic Relationships Around the L-Asparaginase Cancer type unobserved D1: L-Asparaginase G1: ASNS gene G2: PYRROLINE-5-CARBOXYLATE REDUCTASE P(D1|G1) D1 = -1 D1 = 0 D1 = 1 G1 = -1 0.19857 0.27471 0.52672 G1 = 0 0.31110 0.49795 0.19095 G1 = 1 0.42159 0.36279 0.21561 Cancer type observed (= leukemia) D1: L-Asparaginase G1: ASNS gene G2: PYRROLINE-5-CARBOXYLATE REDUCTASE P(D1|G2) D1 = -1 D1 = 0 D1 = 1 G2 = -1 0.27510 0.35226 0.37263 G2 = 0 0.31621 0.41072 0.27307 G2 = 1 0.33837 0.39664 0.26499 P(D1|G1,L) D1 = -1 D1 = 0 D1 = 1 G1 = -1 0.17536 0.22838 0.59626 G1 = 0 0.27128 0.53790 0.19081 G1 = 1 0.38500 0.42437 0.19063 P(D1|G2,L) D1 = -1 D1 = 0 D1 = 1 G2 = -1 0.23812 0.33853 0.42335 G2 = 0 0.27978 0.42666 0.29356 G2 = 1 0.30371 0.42108 0.27520
- 38. Term Project #3: Diagnosis Using Bayesian Networks
- 39. Copyright (c) 2002 by SNU CSE Biointelligence Lab 39 Outline Task 1: Structural learning of the Bayesian network Data generation from the ALARM network Structural learning of Bayesian networks using more than two kinds of algorithms and scores Compare the learned results w.r.t. the edge errors according to the various sample sizes and the learning algorithms Task 2: Classification using Bayesian networks Arbitrarily divide the Leukemia data set between the training set and the test set Learn the Bayesian network from the training data set using one of the metric-based approaches Evaluate the performance of the Bayesian network as a classifier (classification accuracy)
- 40. Copyright (c) 2002 by SNU CSE Biointelligence Lab 40 Data Generation Using the Netica Software (http://www.norsys.com) The ALARM network # of nodes: 37 # of edges: 46
- 41. Copyright (c) 2002 by SNU CSE Biointelligence Lab 41 Structural Learning Independence method BN Power constructor (http://www.cs.ualberta.ca/~jcheng/bnsoft.htm) Metric-based method LearnBayes (http://www.cs.huji.ac.il/labs/compbio/LibB/) MDL, BIC, BD, and likelihood score are can be used.
- 42. Copyright (c) 2002 by SNU CSE Biointelligence Lab 42 The Leukemia Data Set Class type ALL (acute lymphoblastic leukemia) or AML (acute myeloid leukemia) Data set # of attributes: 50 gene expression levels (0 or 1) # of samples: 72
- 43. Copyright (c) 2002 by SNU CSE Biointelligence Lab 43 Submission Deadline: 2002. 11. 27 Location: 301-419