Successfully reported this slideshow.

A Tutorial on Inference and Learning


Published on

Published in: Health & Medicine, Technology
  • Be the first to comment

A Tutorial on Inference and Learning

  1. 1. A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,
  2. 2. “ Road map” <ul><li>Introduction: Bayesian networks </li></ul><ul><ul><li>What are BNs: representation, types, etc </li></ul></ul><ul><ul><li>Why use BNs: Applications (classes) of BNs </li></ul></ul><ul><ul><li>Information sources, software, etc </li></ul></ul><ul><li>Probabilistic inference </li></ul><ul><ul><li>Exact inference </li></ul></ul><ul><ul><li>Approximate inference </li></ul></ul><ul><li>Learning Bayesian Networks </li></ul><ul><ul><li>Learning parameters </li></ul></ul><ul><ul><li>Learning graph structure </li></ul></ul><ul><li>Summary </li></ul>
  3. 3. Bayesian Networks = P(A) P(S) P(T|A) P(L|S) P(B|S) P(C|T,L) P(D|T,L,B) P(A, S, T, L, B, C, D) [Lauritzen & Spiegelhalter, 95] Conditional Independencies Efficient Representation CPD: T L B D=0 D=1 0 0 0 0.1 0.9 0 0 1 0.7 0.3 0 1 0 0.8 0.2 0 1 1 0.9 0.1 ... Lung Cancer Smoking Chest X-ray Bronchitis Dyspnoea Tuberculosis Visit to Asia P(D|T,L,B) P(B|S) P(S) P(C|T,L) P(L|S) P(A) P(T|A)
  4. 4. Bayesian Networks <ul><li>Structured, graphical representation of probabilistic relationships between several random variables </li></ul><ul><li>Explicit representation of conditional independencies </li></ul><ul><li>Missing arcs encode conditional independence </li></ul><ul><li>Efficient representation of joint pdf </li></ul><ul><li>Allows arbitrary queries to be answered </li></ul>P ( lung cancer =yes | smoking =no, dyspnoea =yes ) = ?
  5. 5. Example: Printer Troubleshooting (Microsoft Windows 95) [Heckerman, 95] Print Output OK Correct Driver Uncorrupted Driver Correct Printer Path Net Cable Connected Net/Local Printing Printer On and Online Correct Local Port Correct Printer Selected Local Cable Connected Application Output OK Print Spooling On Correct Driver Settings Printer Memory Adequate Network Up Spooled Data OK GDI Data Input OK GDI Data Output OK Print Data OK PC to Printer Transport OK Printer Data OK Spool Process OK Net Path OK Local Path OK Paper Loaded Local Disk Space Adequate
  6. 6. [Heckerman, 95] Example: Microsoft Pregnancy and Child Care)
  7. 7. [Heckerman, 95] Example: Microsoft Pregnancy and Child Care)
  8. 8. Independence Assumptions Tuberculosis Visit to Asia Chest X-ray Head-to-tail Lung Cancer Smoking Bronchitis tail-to-tail Dyspnoea Lung Cancer Bronchitis Head-to-head
  9. 9. Independence Assumptions <ul><li>Nodes X and Y are d-connected by nodes in Z along a trail from X to Y if </li></ul><ul><ul><li>every head-to-head node along the trail is in Z or has a descendant in Z </li></ul></ul><ul><ul><li>every other node along the trail is not in Z </li></ul></ul><ul><li>Nodes X and Y are d-separated by nodes in Z if they are not d-connected by Z along any trail from X to Y </li></ul><ul><li>Nodes X and Y are d-separated by Z implies X and Y are conditionally independent given Z </li></ul>
  10. 10. Independence Assumptions A variable (node) is conditionally independent of its non-descendants given its parents Lung Cancer Smoking Bronchitis Dyspnoea Tuberculosis Visit to Asia Chest X-ray
  11. 11. Independence Assumptions Cancer is independent of Diet given Exposure to Toxins and Smoking [Breese & Koller, 97] Cancer Smoking Lung Tumor Diet Serum Calcium Age Gender Exposure to Toxins
  12. 12. Independence Assumptions What this means is that joint pdf can be represented as product of local distributions P(A,S,T,L,B,C,D) = P(A) . P(S|A) . P(T|A,S) . P(L|A,S,T) . P(B|A,S,T,L) . P(C|A,S,T,L,B) . P(D|A,S,T,L,B,C) = P(A) . P(S) . P(T|A) . P(L|S) .P(B|S) . P(C|T,L) . P(D|T,L,B) Lung Cancer Smoking Bronchitis Dyspnoea Tuberculosis Visit to Asia Chest X-ray
  13. 13. Thus, the General Product rule for Bayesian Networks is P(X 1 ,X 2 ,…,X n ) =  P(X i | Pa(X i )) where Pa(X i ) is the set of parents of X i Independence Assumptions i=1 n
  14. 14. The Knowledge Acquisition Task <ul><li>Variables: </li></ul><ul><ul><li>collectively exhaustive, mutually exclusive values </li></ul></ul><ul><ul><li>clarity test: value should be knowable in principle </li></ul></ul><ul><li>Structure </li></ul><ul><ul><li>if data available, can be learned </li></ul></ul><ul><ul><li>constructed by hand (using “expert” knowledge) </li></ul></ul><ul><ul><li>variable ordering matters: causal knowledge usually simplifies </li></ul></ul><ul><li>Probabilities </li></ul><ul><ul><li>can be learned from data </li></ul></ul><ul><ul><li>second decimal usually does not matter; relative probs </li></ul></ul><ul><ul><li>sensitivity analysis </li></ul></ul>
  15. 15. The Knowledge Acquisition Task Fuel Gauge Start Battery TurnOver Variable Order is Important Battery TurnOver Start Fuel Gauge Fuel Gauge Start Battery TurnOver Causal Knowledge Simplifies Construction
  16. 16. The Knowledge Acquisition Task Naive Baysian Classifiers [Duda&Hart; Langley 92] Selective Naive Bayesian Classifiers [Langley & Sage 94] Conditional Trees [Geiger 92; Friedman et al 97]
  17. 17. The Knowledge Acquisition Task Selective Bayesian Networks [Singh & Provan, 95;96]
  18. 18. <ul><li>Diagnosis: P(cause| symptom )=? </li></ul>What are BNs useful for? <ul><li>Prediction: P(symptom| cause )=? </li></ul><ul><li>Decision-making (given a cost function) </li></ul><ul><li>Data mining: induce best model from data </li></ul>Medicine <ul><ul><li>Bio-informatics </li></ul></ul>Computer troubleshooting Stock market Text Classification Speech recognition <ul><li>Classification: P(class| data ) </li></ul>
  19. 19. What are BNs useful for? Cause Effect Predictive Inference Decision Making - Max. Expected Utility Cause Effect Diagnostic Reasoning Unknown but important Imperfect Observations Value Decision Known Predisposing Factors
  20. 20. What are BNs useful for? Salient Observations Fault 1 Fault 2 Fault 3 . . . Assignment of Belief Act Now! Halt? Yes No Next Best Observation (Value of Information) New Obs. Value of Information Probability of fault “i” Expected Utility Do nothing Action 2 Action 1
  21. 21. Why use BNs? <ul><li>Explicit management of uncertainty </li></ul><ul><li>Modularity implies maintainability </li></ul><ul><li>Better, flexible and robust decision making - MEU, VOI </li></ul><ul><li>Can be used to answer arbitrary queries - multiple fault problems </li></ul><ul><li>Easy to incorporate prior knowledge </li></ul><ul><li>Easy to understand </li></ul>
  22. 22. Application Examples <ul><li>Intellipath </li></ul><ul><ul><li>commercial version of Pathfinder </li></ul></ul><ul><ul><li>lymph-node diseases (60), 100 findings </li></ul></ul><ul><li>APRI system developed at AT&T Bell Labs </li></ul><ul><ul><li>learns & uses Bayesian networks from data to identify customers liable to default on bill payments </li></ul></ul><ul><li>NASA Vista system </li></ul><ul><ul><li>predict failures in propulsion systems </li></ul></ul><ul><ul><li>considers time criticality & suggests highest utility action </li></ul></ul><ul><ul><li>dynamically decide what information to show </li></ul></ul>
  23. 23. Application Examples <ul><li>Answer Wizard in MS Office 95/ MS Project </li></ul><ul><ul><li>Bayesian network based free-text help facility </li></ul></ul><ul><ul><li>uses naive Bayesian classifiers </li></ul></ul><ul><li>Office Assistant in MS Office 97 </li></ul><ul><ul><li>Extension of Answer wizard </li></ul></ul><ul><ul><li>uses naïve Bayesian networks </li></ul></ul><ul><ul><li>help based on past experience (keyboard/mouse use) and task user is doing currently </li></ul></ul><ul><ul><li>This is the “smiley face” you get in your MS Office applications </li></ul></ul>
  24. 24. Application Examples <ul><li>Microsoft Pregnancy and Child-Care </li></ul><ul><ul><li>Available on MSN in Health section </li></ul></ul><ul><ul><li>Frequently occuring children’s symptoms are linked to expert modules that repeatedly ask parents relevant questions </li></ul></ul><ul><ul><li>Asks next best question based on provided information </li></ul></ul><ul><ul><li>Presents articles that are deemed relevant based on information provided </li></ul></ul>
  25. 25. Application Examples <ul><li>Printer troubleshooting </li></ul><ul><ul><li>HP bought 40% stake in HUGIN. Developing printer troubleshooters for HP printers </li></ul></ul><ul><ul><li>Microsoft has 70+ online troubleshooters on their web site </li></ul></ul><ul><ul><ul><li>use Bayesian networks - multiple faults models, incorporate utilities </li></ul></ul></ul><ul><li>Fax machine troubleshooting </li></ul><ul><ul><li>Ricoh uses Bayesian network based troubleshooters at call centers </li></ul></ul><ul><ul><li>Enabled Ricoh to answer twice the number of calls in half the time </li></ul></ul>
  26. 26. Application Examples
  27. 27. Application Examples
  28. 28. Application Examples
  29. 29. Online/print resources on BNs <ul><li>Conferences & Journals </li></ul><ul><ul><li>UAI, ICML, AAAI, AISTAT, KDD </li></ul></ul><ul><ul><li>MLJ, DM&KD, JAIR, IEEE KDD, IJAR, IEEE PAMI </li></ul></ul><ul><li>Books and Papers </li></ul><ul><ul><li>Bayesian Networks without Tears by Eugene Charniak. AI Magazine: Winter 1991. </li></ul></ul><ul><ul><li>Probabilistic Reasoning in Intelligent Systems by Judea Pearl. Morgan Kaufmann: 1998. </li></ul></ul><ul><ul><li>Probabilistic Reasoning in Expert Systems by Richard Neapolitan. Wiley: 1990. </li></ul></ul><ul><ul><li>CACM special issue on Real-world applications of BNs, March 1995 </li></ul></ul>
  30. 30. Online/Print Resources on BNs <ul><li>Wealth of online information at Links to </li></ul><ul><ul><li>Electronic proceedings for UAI conferences </li></ul></ul><ul><ul><li>Other sites with information on BNs and reasoning under uncertainty </li></ul></ul><ul><ul><li>Several tutorials and important articles </li></ul></ul><ul><ul><li>Research groups & companies working in this area </li></ul></ul><ul><ul><li>Other societies, mailing lists and conferences </li></ul></ul>
  31. 31. Publicly available s/w for BNs <ul><li>List of BN software maintained by Russell Almond at </li></ul><ul><ul><li>several free packages: generally research only </li></ul></ul><ul><ul><li>commercial packages: most powerful (& expensive) is HUGIN; others include Netica and Dxpress </li></ul></ul><ul><ul><li>we are working on developing a Java based BN toolkit here at Watson - will also work within ABLE </li></ul></ul>
  32. 32. “ Road map” <ul><li>Introduction: Bayesian networks </li></ul><ul><ul><li>What are BNs: representation, types, etc </li></ul></ul><ul><ul><li>Why use BNs: Applications (classes) of BNs </li></ul></ul><ul><ul><li>Information sources, software, etc </li></ul></ul><ul><li>Probabilistic inference </li></ul><ul><ul><li>Exact inference </li></ul></ul><ul><ul><li>Approximate inference </li></ul></ul><ul><li>Learning Bayesian Networks </li></ul><ul><ul><li>Learning parameters </li></ul></ul><ul><ul><li>Learning graph structure </li></ul></ul><ul><li>Summary </li></ul>
  33. 33. Probabilistic Inference Tasks <ul><li>Belief updating: </li></ul><ul><li>Finding most probable explanation (MPE) </li></ul><ul><li>Finding maximum a-posteriory hypothesis </li></ul><ul><li>Finding maximum-expected-utility (MEU) decision </li></ul>
  34. 34. Belief Updating lung Cancer Smoking X-ray Bronchitis Dyspnoea P ( lung cancer =yes | smoking =no, dyspnoea =yes ) = ?
  35. 35. Belief updating: P(X|evidence)=? P(a|e=0) P(a) B C E D “ Moral” graph A D E C B P(a,e=0)= P(a)P( b |a)P(c|a)P(d| b ,a)P(e| b ,c)= P(b|a)P(d|b,a)P(e|b,c) Variable Elimination P(c|a)
  36. 36. Bucket elimination Algorithm elim-bel (Dechter 1996) Elimination operator P(a|e=0) W*=4 ” induced width” (max clique size) bucket B: P(a) P(c|a) P(b|a) P(d|b,a) P(e|b,c) bucket C: bucket D: bucket E: bucket A: e=0 B C D E A
  37. 37. Finding Algorithm elim-mpe (Dechter 1996) Elimination operator MPE W*=4 ” induced width” (max clique size) bucket B: P(a) P(c|a) P(b|a) P(d|b,a) P(e|b,c) bucket C: bucket D: bucket E: bucket A: e=0 B C D E A
  38. 38. Generating the MPE-tuple C: E: P(b|a) P(d|b,a) P(e|b,c) B: D: A: P(a) P(c|a) e=0
  39. 39. Complexity of inference The effect of the ordering: “ Moral” graph A D E C B B C D E A E D C B A
  40. 40. Other tasks and algorithms <ul><li>MAP and MEU tasks: </li></ul><ul><ul><li>Similar bucket-elimination algorithms - elim-map , elim-meu (Dechter 1996) </li></ul></ul><ul><ul><li>Elimination operation: either summation or maximization </li></ul></ul><ul><ul><li>Restriction on variable ordering: summation must precede maximization (i.e. hypothesis or decision variables are eliminated last) </li></ul></ul><ul><li>Other inference algorithms: </li></ul><ul><ul><li>Join-tree clustering </li></ul></ul><ul><ul><li>Pearl’s poly-tree propagation </li></ul></ul><ul><ul><li>Conditioning, etc. </li></ul></ul>
  41. 41. Relationship with join-tree clustering ABC BCE ADB A cluster is a set of buckets (a “super-bucket”)
  42. 42. Relationship with Pearl’s belief propagation in poly-trees Pearl’s belief propagation for single-root query elim-bel using topological ordering and super-buckets for families Elim-bel, elim-mpe, and elim-map are linear for poly-trees. “ Diagnostic support” “ Causal support”
  43. 43. “ Road map” <ul><li>Introduction: Bayesian networks </li></ul><ul><li>Probabilistic inference </li></ul><ul><ul><li>Exact inference </li></ul></ul><ul><ul><li>Approximate inference </li></ul></ul><ul><li>Learning Bayesian Networks </li></ul><ul><ul><li>Learning parameters </li></ul></ul><ul><ul><li>Learning graph structure </li></ul></ul><ul><li>Summary </li></ul>
  44. 44. Inference is NP-hard => approximations <ul><li>Approximations: </li></ul><ul><li>Local inference </li></ul><ul><li>Stochastic simulations </li></ul><ul><li>Variational approximations </li></ul><ul><li>etc. </li></ul>S X D B C C B D X
  45. 45. Local Inference Idea
  46. 46. Bucket-elimination approximation: “mini-buckets” <ul><li>Local inference idea: </li></ul><ul><li>bound the size of recorded dependencies </li></ul><ul><li>Computation in a bucket is time and space </li></ul><ul><li>exponential in the number of variables involved </li></ul><ul><li>Therefore, partition functions in a bucket </li></ul><ul><li>into “mini-buckets” on smaller number of variables </li></ul>
  47. 47. Mini-bucket approximation: MPE task Split a bucket into mini-buckets =>bound complexity
  48. 48. Approx-mpe(i) <ul><li>Input: i – max number of variables allowed in a mini-bucket </li></ul><ul><li>Output: [lower bound (P of a sub-optimal solution), upper bound] </li></ul>Example: approx-mpe(3) versus elim-mpe
  49. 49. Properties of approx-mpe(i) <ul><li>Complexity: O(exp(2i)) time and O(exp(i)) time. </li></ul><ul><li>Accuracy: determined by upper/lower (U/L) bound. </li></ul><ul><li>As i increases, both accuracy and complexity increase. </li></ul><ul><li>Possible use of mini-bucket approximations: </li></ul><ul><ul><li>As anytime algorithms (Dechter and Rish, 1997) </li></ul></ul><ul><ul><li>As heuristics in best-first search (Kask and Dechter, 1999) </li></ul></ul><ul><li>Other tasks: similar mini-bucket approximations for: belief updating, MAP and MEU (Dechter and Rish, 1997) </li></ul>
  50. 50. Anytime Approximation
  51. 51. Empirical Evaluation (Dechter and Rish, 1997; Rish, 1999) <ul><li>Randomly generated networks </li></ul><ul><ul><li>Uniform random probabilities </li></ul></ul><ul><ul><li>Random noisy-OR </li></ul></ul><ul><li>CPCS networks </li></ul><ul><li>Probabilistic decoding </li></ul><ul><li>Comparing approx-mpe and anytime-mpe </li></ul><ul><li>versus elim-mpe </li></ul>
  52. 52. Random networks <ul><li>Uniform random: 60 nodes, 90 edges (200 instances) </li></ul><ul><ul><li>In 80% of cases, 10-100 times speed-up while U/L<2 </li></ul></ul><ul><li>Noisy-OR – even better results </li></ul><ul><ul><li>Exact elim-mpe was infeasible; appprox-mpe took 0.1 to 80 sec. </li></ul></ul>
  53. 53. CPCS networks – medical diagnosis (noisy-OR model) Test case: no evidence 505.2 70.3 anytime-mpe( ), 110.5 70.3 anytime-mpe( ), 1697.6 115.8 elim-mpe cpcs422 cpcs360 Algorithm Time (sec)
  54. 54. Effect of evidence More likely evidence=>higher MPE => higher accuracy (why?) Likely evidence versus random (unlikely) evidence
  55. 55. Probabilistic decoding Error-correcting linear block code State-of-the-art: approximate algorithm – iterative belief propagation (IBP) (Pearl’s poly-tree algorithm applied to loopy networks)
  56. 56. approx-mpe vs. IBP Bit error rate (BER) as a function of noise (sigma):
  57. 57. Mini-buckets: summary <ul><li>Mini-buckets – local inference approximation </li></ul><ul><li>Idea: bound size of recorded functions </li></ul><ul><li>Approx-mpe(i) - mini-bucket algorithm for MPE </li></ul><ul><ul><li>Better results for noisy-OR than for random problems </li></ul></ul><ul><ul><li>Accuracy increases with decreasing noise in </li></ul></ul><ul><ul><li>Accuracy increases for likely evidence </li></ul></ul><ul><ul><li>Sparser graphs -> higher accuracy </li></ul></ul><ul><ul><li>Coding networks: approx-mpe outperfroms IBP on low-induced width codes </li></ul></ul>
  58. 58. “ Road map” <ul><li>Introduction: Bayesian networks </li></ul><ul><li>Probabilistic inference </li></ul><ul><ul><li>Exact inference </li></ul></ul><ul><ul><li>Approximate inference </li></ul></ul><ul><ul><ul><li>Local inference </li></ul></ul></ul><ul><ul><ul><li>Stochastic simulations </li></ul></ul></ul><ul><ul><ul><li>Variational approximations </li></ul></ul></ul><ul><li>Learning Bayesian Networks </li></ul><ul><li>Summary </li></ul>
  59. 59. Approximation via Sampling
  60. 60. Forward Sampling (logic sampling (Henrion, 1988))
  61. 61. Forward sampling (example) Drawback: high rejection rate!
  62. 62. Likelihood Weighing (Fung and Chang, 1990; Shachter and Peot, 1990) Works well for likely evidence! “ Clamping” evidence+forward sampling+ weighing samples by evidence likelihood
  63. 63. Gibbs Sampling (Geman and Geman, 1984) Markov Chain Monte Carlo (MCMC): create a Markov chain of samples Advantage: guaranteed to converge to P(X) Disadvantage: convergence may be slow
  64. 64. Gibbs Sampling (cont’d) (Pearl, 1988) Markov blanket :
  65. 65. “ Road map” <ul><li>Introduction: Bayesian networks </li></ul><ul><li>Probabilistic inference </li></ul><ul><ul><li>Exact inference </li></ul></ul><ul><ul><li>Approximate inference </li></ul></ul><ul><ul><ul><li>Local inference </li></ul></ul></ul><ul><ul><ul><li>Stochastic simulations </li></ul></ul></ul><ul><ul><ul><li>Variational approximations </li></ul></ul></ul><ul><li>Learning Bayesian Networks </li></ul><ul><li>Summary </li></ul>
  66. 66. Variational Approximations <ul><li>Idea: </li></ul><ul><li>variational transformation of CPDs simplifies inference </li></ul><ul><li>Advantages: </li></ul><ul><li>Compute upper and lower bounds on P(Y) </li></ul><ul><li>Usually faster than sampling techniques </li></ul><ul><li>Disadvantages: </li></ul><ul><li>More complex and less general: must be derived for each particular form of CPD functions </li></ul>
  67. 67. Variational bounds: example log (x) This approach can be generalized for any concave (convex) function in order to compute its upper (lower) bounds: convex duality (Jaakkola and Jordan, 1997)
  68. 68. Convex duality (Jaakkola and Jordan, 1997)
  69. 69. Example: QMR-DT network ( Quick Medical Reference – Decision-Theoretic (Shwe et al., 1991) ) Noisy-OR model: 600 diseases 4000 findings
  70. 70. Inference in QMR-DT Inference complexity: O(exp(min{p,k})) p = # of positive findings, k = max family size (Heckerman, 1989 (“Quickscore”), Rish and Dechter, 1998) Positive evidence “couples” the disease nodes factorized factorized
  71. 71. Variational approach to QMR-DT (Jaakkola and Jordan, 1997) The effect of positive evidence is now factorized (diseases are “decoupled”)
  72. 72. Variational approximations <ul><li>Bounds on local CPDs yield a bound on posterior </li></ul><ul><li>Two approaches: sequential and block </li></ul><ul><li>Sequential: applies variational transformation to (a subset of) nodes sequentially during inference using a heuristic node ordering; then optimizes across variational parameters </li></ul><ul><li>Block: selects in advance nodes to be transformed, then selects variational parameters minimizing the KL-distance between true and approximate posteriors </li></ul>
  73. 73. Block approach
  74. 74. Inference in BN: summary <ul><li>Exact inference is often intractable => need approximations </li></ul><ul><li>Approximation principles: </li></ul><ul><ul><li>Approximating elimination – local inference, bounding size of dependencies among variables (cliques in a problem’s graph). </li></ul></ul><ul><ul><li>Mini-buckets, IBP </li></ul></ul><ul><ul><li>Other approximations: stochastic simulations, variational techniques, etc. </li></ul></ul><ul><li>Further research: </li></ul><ul><ul><li>Combining “orthogonal” approximation approaches </li></ul></ul><ul><ul><li>Better understanding of “what works well where”: which approximation suits which problem structure </li></ul></ul><ul><ul><li>Other approximation paradigms (e.g., other ways of approximating probabilities, constraints, cost functions) </li></ul></ul>
  75. 75. “ Road map” <ul><li>Introduction: Bayesian networks </li></ul><ul><li>Probabilistic inference </li></ul><ul><ul><li>Exact inference </li></ul></ul><ul><ul><li>Approximate inference </li></ul></ul><ul><li>Learning Bayesian Networks </li></ul><ul><ul><li>Learning parameters </li></ul></ul><ul><ul><li>Learning graph structure </li></ul></ul><ul><li>Summary </li></ul>
  76. 76. Why learn Bayesian networks? <ul><ul><li>Efficient representation and inference </li></ul></ul><ul><ul><li>Handling missing data: <1.3 2.8 ?? 0 1 > </li></ul></ul><ul><ul><li>Incremental learning: P(H) or </li></ul></ul>S C <ul><ul><li>Learning causal relationships: </li></ul></ul><9.7 0.6 8 14 18> <0.2 1.3 5 ?? ??> <1.3 2.8 ?? 0 1 > <?? 5.6 0 10 ??> ……………… . <ul><li>Combining domain expert knowledge with data </li></ul>
  77. 77. Learning Bayesian Networks <ul><li>Known graph </li></ul>C S B D X <ul><li>Complete data: </li></ul>parameter estimation (ML, MAP) <ul><li>Incomplete data: </li></ul>non-linear parametric optimization (gradient descent, EM) P(S) P(B|S) P(X|C,S) P(C|S) P(D|C,B) – learn parameters C S B D X C S B D X <ul><li>Unknown graph </li></ul><ul><li>Complete data: </li></ul><ul><li>optimization (search </li></ul><ul><li>in space of graphs) </li></ul><ul><li>Incomplete data: </li></ul><ul><ul><li>EM plus Multiple Imputation, </li></ul></ul><ul><ul><li>structural EM, </li></ul></ul><ul><ul><li>mixture models </li></ul></ul>– learn graph and parameters
  78. 78. Learning Parameters: complete data <ul><li>ML-estimate: </li></ul>- decomposable! <ul><li>MAP-estimate ( Bayesian statistics) </li></ul>Conjugate priors - Dirichlet X C B Multinomial counts Equivalent sample size (prior knowledge)
  79. 79. Learning graph structure <ul><ul><li>Complete data – local computations </li></ul></ul><ul><ul><li>Incomplete data (score </li></ul></ul><ul><ul><li>non-decomposable):stochastic </li></ul></ul><ul><ul><li>methods </li></ul></ul><ul><li>Heuristic search: </li></ul><ul><li>Constrained-based methods </li></ul><ul><ul><li>Data impose independence </li></ul></ul><ul><ul><li>relations (constraints) </li></ul></ul>NP-hard optimization Find C S B C S B Add S->B C S B Delete S->B C S B Reverse S->B
  80. 80. Learning BNs: incomplete data <ul><li>Learning parameters </li></ul><ul><ul><li>EM algorithm [Lauritzen, 95] </li></ul></ul><ul><ul><li>Gibbs Sampling [Heckerman, 96] </li></ul></ul><ul><ul><li>Gradient Descent [Russell et al., 96] </li></ul></ul><ul><li>Learning both structure and parameters </li></ul><ul><ul><li>Sum over missing values [Cooper & Herskovits, 92; Cooper, 95] </li></ul></ul><ul><ul><li>Monte-Carlo approaches [Heckerman, 96] </li></ul></ul><ul><ul><li>Gaussian approximation [Heckerman, 96] </li></ul></ul><ul><ul><li>Structural EM [Friedman, 98] </li></ul></ul><ul><ul><li>EM and Multiple Imputation [Singh 97,98,00] </li></ul></ul>
  81. 81. Learning Parameters: incomplete data E M -algorithm: iterate until convergence Initial parameters Current model Non-decomposable marginal likelihood (hidden nodes) S X D C B < ? 0 1 0 1> <1 1 ? 0 1> <0 0 0 ? ? > < ? ? 0 ? 1> ……… Data S X D C B 1 0 1 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 ……… .. Expected counts Expectation Inference: P(S|X=0,D=1,C=0,B=1) Update parameters (ML, MAP) Maximization
  82. 82. Learning Parameters: incomplete data <ul><li>Complete-data log-likelihood is </li></ul><ul><li>E step </li></ul><ul><ul><li>Compute E( N ijk | Y obs ,  </li></ul></ul><ul><li>M step </li></ul><ul><ul><li>Compute </li></ul></ul><ul><ul><li>    E( N ijk | Y obs ,  E( N ij | Y obs ,  </li></ul></ul>(Lauritzen, 95) N ijk log  ijk
  83. 83. Learning structure: incomplete data <ul><li>Depends on the type of missing data - missing independent of anything else (MCAR) OR missing based on values of other variables (MAR) </li></ul><ul><li>While MCAR can be resolved by decomposable scores, MAR cannot </li></ul><ul><li>For likelihood-based methods, no need to explicitly model missing data mechanism </li></ul><ul><li>Very few attempts at MAR: stochastic methods </li></ul>
  84. 84. Learning structure: incomplete data <ul><li>Approximate EM by using Multiple Imputation to yield efficient Monte-Carlo method </li></ul><ul><li>[Singh 97, 98, 00] </li></ul><ul><ul><ul><li>trade-off between performance & quality </li></ul></ul></ul><ul><ul><ul><li>learned network almost optimal </li></ul></ul></ul><ul><ul><ul><li>approximate complete-data log-likelihood function using Multiple Imputation </li></ul></ul></ul><ul><ul><ul><li>yields decomposable score, dependent only on each node & its parents </li></ul></ul></ul><ul><ul><ul><li>converges to local maxima of observed-data likelihood </li></ul></ul></ul>
  85. 85. Learning structure: incomplete data
  86. 86. Scoring functions: Minimum Description Length (MDL) <ul><li>Learning  data compression </li></ul><ul><li>Other: MDL = -BIC (Bayesian Information Criterion) </li></ul><ul><li>Bayesian score (BDe) - asymptotically equivalent to MDL </li></ul>DL(Model) DL(Data|model) <9.7 0.6 8 14 18> <0.2 1.3 5 ?? ??> <1.3 2.8 ?? 0 1 > <?? 5.6 0 10 ??> ……………… .
  87. 87. Learning Structure plus Parameters No. of models is super exponential Alternatives: Model Selection or Model Averaging
  88. 88. Model Selection Generally, choose a single model M*. Equivalent to saying P(M*|D) = 1 Task is now to: 1) define a metric to decide which model is best 2) search for that model through the space of all models
  89. 89. One Reasonable Score: Posterior Probability of a Structure structure prior parameter prior likelihood
  90. 90. Global and Local Predictive Scores [Spiegelhalter et al 93] log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h h h         x x x x x x x x x 1 1 1 1 2 1 3 1 2   Bayes’ factor Local is useful for diagnostic problems
  91. 91. Local Predictive Score Spiegelhalter et al. (1993) Y disease X 1 X 2 X n symptoms ...
  92. 92. Exact computation of p ( D | S h ) <ul><li>No missing data </li></ul><ul><li>Cases are independent, given the model. </li></ul><ul><li>Uniform priors on parameters </li></ul><ul><li>discrete variables </li></ul>[Cooper & Herskovits, 92]
  93. 93. Bayesian Dirichlet Score Cooper and Herskovits (1991)
  94. 94. <ul><li>Learning BNs without specifying an ordering </li></ul><ul><li>n! ordering; ordering greatly affects the quality of network learned. </li></ul><ul><li>use conditional independence tests, and d-separation to get an ordering </li></ul>[Singh & Valtorta’ 95]
  95. 95. <ul><li>Learning BNs via the MDL principle </li></ul><ul><li>Idea: best model is that which gives the most compact representation of the data </li></ul><ul><li>So, encode the data using the model plus encode the model. Minimize this. </li></ul>[Lam & Bacchus, 93]
  96. 96. Learning BNs: summary <ul><li>Bayesian Networks – graphical probabilistic models </li></ul><ul><li>Efficient representation and inference </li></ul><ul><li>Expert knowledge + learning from data </li></ul><ul><li>Learning: </li></ul><ul><ul><li>parameters (parameter estimation, EM) </li></ul></ul><ul><ul><li>structure (optimization w/ score functions – e.g., MDL ) </li></ul></ul><ul><li>Applications/systems: collaborative filtering (MSBN), fraud detection (AT&T), classification (AutoClass (NASA), TAN-BLT(SRI)) </li></ul><ul><li>Future directions: causality, time, model evaluation criteria , approximate inference/learning, on-line learning, etc. </li></ul>