Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010


Published on

Hadoop Summit 2010 - Research Track
Exact Inference in Bayesian Networks using MapReduce
Alex Kozlov, Cloudera

Published in: Technology

Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010

  1. 1. Exact Inference in Bayesian Networks using MapReduce<br />Alex Kozlov<br />Cloudera, Inc.<br />
  2. 2. About Me<br />About Cloudera<br />Bayesian (Probabilistic) Networks<br />BN Inference 101<br />CPCS Network<br />Why BN Inference<br />Inference with MR<br />Results<br />Conclusions<br />2<br />Session Agenda<br />
  3. 3. Worked on BN Inference in 1995-1998 (for Ph.D.)<br />Published the fastest implementation at the time<br />Worked on DM/BI field since then<br />Recently joined Cloudera, Inc.<br />Started looking at how to solve world’s hardest problems<br />3<br />About Me<br />
  4. 4. Founded in the summer 2008<br />Cloudera helps organizations profit from all of their data. We deliver the industry-standard platform which consolidates, stores and processes any kind of data, from any source, at scale. We make it possible to do more powerful analysis of more kinds of data, at scale, than ever before. With Cloudera, you get better insight into their customers, partners, vendors and businesses.<br />Cloudera’s platform is built on the popular open source Apache Hadoop project. We deliver the innovative work of a global community of contributors in a package that makes it easy for anyone to put the power of Google, Facebook and Yahoo! to work on their own problems.<br />4<br />About Cloudera<br />
  5. 5. Nodes<br />Edges<br />Probabilities<br />5<br />Bayesian Networks<br />Bayes, Thomas (1763)<br />An essay towards solving a problem in the doctrine of chances, published posthumously by his friend<br />Philosophical Transactions of the Royal Society of London, 53:370-418<br />
  6. 6. Computational biology and bioinformatics (gene regulatory networks, protein structure, gene expression analysis)<br />Medicine<br />Document classification, information retrieval<br />Image processing<br />Data fusion<br />Gaming<br />Law<br />On-line advertising!<br />6<br />Applications<br />
  7. 7. 7<br />A Simple BN Network<br />T<br />F<br />Rain<br />Rain<br />T<br />F<br />0.4<br />0.6<br />F<br />0.2<br />0.8<br />0.1<br />0.9<br />T<br />Sprinkler<br />Sprinkler, Rain<br />T<br />F<br />0.01<br />0.99<br />F, F<br />0.8<br />0.2<br />F, T<br />Wet Driveway<br />0.9 <br />0.1<br />T, F<br />0.99<br />0.01<br />T, T<br />Pr(Rain | Wet Driveway)<br />Pr(Sprinkler Broken | !Wet Driveway & !Rain)<br />
  8. 8. 8<br />Asia Network<br />Pr(Visit to Asia)<br />Pr(Smoking)<br />Pr(Lung Cancer | Smoking)<br />Pr(Tuberculosis | Visit to Asia)<br />Pr(Bronchitis | Smoking)<br />Pr(C | BE )<br />Pr(X-Ray | Lung Cancer or Tuberculosis)<br />Pr(Dyspnea | CG )<br />Pr(Lung Cancer | Neg X-Ray & Positive Dyspnea)<br />
  9. 9. JPD = <product of all probabilities and conditional probabilities in the network> = Pr(A, B, …, H)<br />PAB = <br /> SELECT A, B, SUM(PROB) FROM JPD GROUP BY A, B;<br />PB = SELECT B, SUM(PROB) FROM PAB GROUP BY A;<br />Pr(A|B) = Pr(A,B)/Pr(B) – Bayes’ rule<br />CPCS is 422 nodes, a table of at least 2422 rows!<br />9<br />BN Inference 101 (in Hive)<br />
  10. 10. 10<br />Junction Tree<br />Pr(E | F )<br />Pr(Tuberculosis | Visit to Asia)<br />Pr(G | F )<br />AB<br />EFG<br />Pr(Visit to Asia)<br />Pr(F)<br />B<br />Pr(C | BE )<br />EG<br />Pr(H | CG )<br />BCE<br />CE<br />CEGH<br />C<br />Pr(Lung Cancer | Dyspnea) =<br />Pr(E|H)<br />CD<br />Pr(D| C)<br />
  11. 11. 11<br />CPCS Networks<br />422 nodes<br />14 nodes describe diseases<br />33 risk factors<br />375 various findings related to diseases<br />
  12. 12. 12<br />CPCS Networks<br />
  13. 13. Choose the right tool for the right job!<br /><ul><li>BN is an abstraction for reasoning and decision making
  14. 14. Easy to incorporate human insight and intuitions
  15. 15. Very general, no specific ‘label’ node
  16. 16. Easy to do ‘what-if’, strength of influence, value of information, analysis
  17. 17. Immune to Gaussian assumptions</li></ul>It’s all just a joint probability distribution<br />13<br />Why Bayesian Network Inference?<br />
  18. 18. Map & Reduces<br />14<br />B1C1E1<br />Keys<br />Map<br />B1C1E2<br />A1B1<br />B1<br />Reduce<br />B1C2E1<br />A2B1<br />B1C2E2<br />A1B2<br />B2C1E1<br />∑ Pr(B1| A) x ∑ Pr(D| C1)<br />B2<br />B2C1E2<br />A2B2<br />B2C2E1<br />B2C2E2<br />Pr(C| BE) x ∑ Pr(B1| A) x ∑ Pr(D| C1)<br />Aggregation 2 (x)<br />B1C1E1<br />B1C1E2<br />C1D1<br />B1C2E1<br />C1<br />Aggregation 1 (+)<br />C2D1<br />B1C2E2<br />C1D2<br />B2C1E1<br />B2C1E2<br />C2<br />BCE<br />C2D2<br />B2C2E1<br />B2C2E2<br />
  19. 19. for each clique in depth-first order:<br />MAP:<br />Sum over the variables to get ‘clique message’ (requires state, custom partitioner and input format)<br />Emit factors for the next clique<br />REDUCE:<br />Multiply the factors from all children<br />Include probabilities assigned to the clique<br />Form the new clique values<br />the MAP is done over all child cliques<br />15<br />MapReduce Implementation<br />
  20. 20. <ul><li>Topological parallelism: compute branches C2 and C4 in parallel
  21. 21. Clique parallelism: divide computation of each clique into maps/reducers
  22. 22. Fall back into optimal factoring if a corresponding subtree is small
  23. 23. Combine multiple phases together
  24. 24. Reduce replication level</li></ul>16<br />Cliques, Trees, and Parallelism<br />C6<br />C5<br />C4<br />C3<br />C2<br />C1<br />Cliques may be larger than they appear!<br />
  25. 25. CPCS:<br />The 360-node subnet has the largest ‘clique’ of <br />11,739,896 floats (fits into 2GB)<br />The full 422-node version (absent, mild, moderate, severe)<br />3,377,699,720,527,872 floats (or 12 PB of storage, but do not need it for all queries)<br />In most cases do not need to do inference on the full network<br />17<br />CPCS Inference<br />
  26. 26. 1‘used an SGI Origin 2000 machine with sixteen MIPS R10000 processors (195 MHz clock speed)’ in 1997<br />2Macbook Pro 4 GB DDR3 2.53 GHz<br />310 node Linux Xeon cluster 24 GB quad 2-core<br />18<br />Results<br />
  27. 27. Exact probabilistic inference is finally in sight for the full 422 node CPCS network<br />Hadoop helps to solve the world’s hardest problems<br />What you should know after this talk<br />BN is a DAG and represents a joint probability distribution (JPD)<br />Can compute conditional probabilities by multiplying and summing JPD<br />For large networks, this may be PBytes of intermediate data, but it’s MR<br />19<br />Conclusions<br />
  28. 28. Questions?<br />alexvk@{cloudera,gmail}.com<br />
  29. 29. BACKUP<br />21<br />
  30. 30. Conditioning nodes (evidence) – do not need to be summed<br />Bare child nodes’ values sum to one (barren node) – can be dropped from the network<br />22<br />Optimizing BN Inference 101<br />Noisy-OR (conditional independence of parents)<br />Context specific independence (based on the specific value of one of the parents)<br />T<br />F<br />0.01<br />0.99<br />FF<br />0.8<br />0.2<br />FT<br />Wet grass<br />0.9 <br />0.1<br />TF<br />0.99<br />0.01<br />TT<br />
  31. 31. 23<br />GeNIe package<br />
  32. 32. No updates – have to compute clique potentials from all children and assigned probabilities<br />Tree structure <br />The key encodes full set of variable values (LongWritable or composite)<br />The value encodes partial sums (proportional to probabilities)<br />No need for TotalOrderPartitioning (we know the key distribution)<br />Need custom Partitioner and WritableComparator (next slide)<br />Need to do the aggregation in the Mapper (sum, next slide)<br />24<br />MapReduce Implementation<br />
  33. 33. Build on top of old 1997 C program with a few modifications<br />An interactive command line program for interactive analysis<br />Estimates running time from optimal factory plan and<br />Either executes it locally<br />Ships a jar to a Hadoop cluster to execute<br />25<br />Current implementation<br />