Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)


Published on

Compilation of some ideas over the last 15 years...

Published in: Technology, Health & Medicine
  • Be the first to comment

  • Be the first to like this

Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

  1. 1. Exact Inference in Bayesian Networks using MapReduce Alex Kozlov Cloudera, Inc.
  2. 2. Session Agenda  About Me  About Cloudera  Bayesian (Probabilistic) Networks  BN Inference 101  CPCS Network  Why BN Inference  Inference with MR  Results  Conclusions 2
  3. 3. About Me  Worked on BN Inference in 1995-1998 (for Ph.D.) › Published the fastest implementation at the time  Worked on DM/BI field since then  Recently joined Cloudera, Inc. › Started looking at how to solve world’s hardest problems 3
  4. 4. About Cloudera Founded in the summer 2008 Cloudera helps organizations profit from all of their data. We deliver the industry-standard platform which consolidates, stores and processes any kind of data, from any source, at scale. We make it possible to do more powerful analysis of more kinds of data, at scale, than ever before. With Cloudera, you get better insight into their customers, partners, vendors and businesses. Cloudera’s platform is built on the popular open source Apache Hadoop project. We deliver the innovative work of a global community of contributors in a package that makes it easy for anyone to put the power of Google, Facebook and Yahoo! to work on their own problems. 4
  5. 5. Bayesian Networks 1. Nodes 2. Edges 3. Probabilities Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances, published posthumously by his friend Philosophical Transactions of the Royal Society of London, 53:370-418 5
  6. 6. Applications 1. Computational biology and bioinformatics (gene regulatory networks, protein structure, gene expression analysis) 2. Medicine 3. Document classification, information retrieval 4. Image processing 5. Data fusion 6. Gaming 7. Law 8. On-line advertising! 6
  7. 7. A Simple BN Network Rain T F Rain T F F 0.4 0.6 T 0.1 0.9 0.2 0.8 Sprinkler Sprinkler, Rain T F F, F 0.01 0.99 Wet F, T 0.8 0.2 Driveway T, F 0.9 0.1 T, T 0.99 0.01 Pr(Rain | Wet Driveway) Pr(Sprinkler Broken | !Wet Driveway & !Rain) 7
  8. 8. Asia Network Pr(Visit to Asia) Pr(Lung Cancer | Smoking) Pr(Smoking) Pr(Tuberculosis | Visit to Asia) Pr(Bronchitis | Smoking) Pr(C | BE ) Pr(X-Ray | Lung Cancer or Tuberculosis) Pr(Dyspnea | CG ) Pr(Lung Cancer | Neg X-Ray & Positive Dyspnea) 8
  9. 9. BN Inference 101 (in Hive) JPD = <product of all probabilities and conditional probabilities in the network> = Pr(A, B, …, H) PAB = SELECT A, B, SUM(PROB) FROM JPD GROUP BY A, B; PB = SELECT B, SUM(PROB) FROM PAB GROUP BY A; Pr(A|B) = Pr(A,B)/Pr(B) – Bayes’ rule CPCS is 422 nodes, a table of at least 2422 rows! 9
  10. 10. Junction Tree Pr(E | F ) Pr(Tuberculosis | Visit to Asia) Pr(G | F ) Pr(Visit to Asia) Pr(F) Pr(C | BE ) Pr(H | CG ) Pr(Lung Cancer | Dyspnea) = Pr(E|H) Pr(D| C) 10
  11. 11. CPCS Networks 422 nodes 14 nodes describe diseases 33 risk factors 375 various findings related to diseases 11
  12. 12. CPCS Networks 12
  13. 13. Why Bayesian Network Inference? Choose the right tool for the right job!  BN is an abstraction for reasoning and decision making  Easy to incorporate human insight and intuitions  Very general, no specific ‘label’ node  Easy to do ‘what-if’, strength of influence, value of information, analysis  Immune to Gaussian assumptions It’s all just a joint probability distribution 13
  14. 14. Map & Reduces Map Keys B1C1E1 A1B1 B1C1E2 Reduce A2B1 B1 B1C2E1 B1C2E2 A1B2 B2C1E1 A2B2 B2 B2C1E2 ∑ Pr(B1| A) x ∑ Pr(D| C1) B2C2E1 B2C2E2 B1C1E1 C1D1 B1C1E2 Pr(C| BE) x ∑ Pr(B1| A) x ∑ Pr(D| C1) C2D1 C1 B1C2E1 B1C2E2 Aggregation 2 (x) C1D2 B2C1E1 C2D2 C2 B2C1E2 B2C2E1 BCE B2C2E2 Aggregation 1 (+) 14
  15. 15. MapReduce Implementation for each clique in depth-first order: MAP: Sum over the variables to get ‘clique message’ (requires state, custom partitioner and input format) Emit factors for the next clique REDUCE: Multiply the factors from all children Include probabilities assigned to the clique Form the new clique values the MAP is done over all child cliques 15
  16. 16. Cliques, Trees, and Parallelism C6 o Topological parallelism: compute branches C2 and C4 in parallel C5 o Clique parallelism: divide computation of each clique into maps/reducers C4 o Fall back into optimal factoring if a corresponding subtree is small C3 o Combine multiple phases together C2 o Reduce replication level C1 Cliques may be larger than they appear! 16
  17. 17. CPCS Inference CPCS: The 360-node subnet has the largest ‘clique’ of 11,739,896 floats (fits into 2GB) The full 422-node version (absent, mild, moderate, severe) 3,377,699,720,527,872 floats (or 12 PB of storage, but do not need it for all queries) In most cases do not need to do inference on the full network 17
  18. 18. Results Network Memory Time Macbook Hadoop (19971) Pro (20102) (& future3) Random 10 MB 33 sec < 1 sec (B) Random 254 MB 260 sec 10 sec (A) cpcs360 2 GB 640 sec 15 sec 1 min cpcs422 > 12 PB N/A N/A Minutes to hours for most of the queries on most of the clusters 1‘used an SGI Origin 2000 machine with sixteen MIPS R10000 processors (195 MHz clock speed)’ in 1997 2Macbook Pro 4 GB DDR3 2.53 GHz 310 node Linux Xeon cluster 24 GB quad 2-core 18
  19. 19. Conclusions  Exact probabilistic inference is finally in sight for the full 422 node CPCS network  Hadoop helps to solve the world’s hardest problems What you should know after this talk BN is a DAG and represents a joint probability distribution (JPD) Can compute conditional probabilities by multiplying and summing JPD For large networks, this may be PBytes of intermediate data, but it’s MR 19
  20. 20. Questions? alexvk@{cloudera,gmail}.com