1.
Exact Inference in Bayesian
Networks using MapReduce
Alex Kozlov
Cloudera, Inc.
2.
Session Agenda
About Me
About Cloudera
Bayesian (Probabilistic) Networks
BN Inference 101
CPCS Network
Why BN Inference
Inference with MR
Results
Conclusions
2
3.
About Me
Worked on BN Inference in 1995-1998 (for Ph.D.)
› Published the fastest implementation at the time
Worked on DM/BI field since then
Recently joined Cloudera, Inc.
› Started looking at how to solve world’s hardest problems
3
4.
About Cloudera
Founded in the summer 2008
Cloudera helps organizations profit from all of their data. We deliver the
industry-standard platform which consolidates, stores and processes
any kind of data, from any source, at scale. We make it possible to do
more powerful analysis of more kinds of data, at scale, than ever
before. With Cloudera, you get better insight into their customers,
partners, vendors and businesses.
Cloudera’s platform is built on the popular open source Apache Hadoop
project. We deliver the innovative work of a global community of
contributors in a package that makes it easy for anyone to put the
power of Google, Facebook and Yahoo! to work on their own problems.
4
5.
Bayesian Networks
1. Nodes
2. Edges
3. Probabilities
Bayes, Thomas (1763)
An essay towards solving a problem in
the doctrine of chances, published
posthumously by his friend
Philosophical Transactions of the
Royal Society of London, 53:370-418
5
6.
Applications
1. Computational biology and bioinformatics (gene regulatory networks,
protein structure, gene expression analysis)
2. Medicine
3. Document classification, information retrieval
4. Image processing
5. Data fusion
6. Gaming
7. Law
8. On-line advertising!
6
7.
A Simple BN Network
Rain T F
Rain T F
F 0.4 0.6
T 0.1 0.9 0.2 0.8
Sprinkler
Sprinkler, Rain T F
F, F 0.01 0.99
Wet F, T 0.8 0.2
Driveway T, F 0.9 0.1
T, T 0.99 0.01
Pr(Rain | Wet Driveway)
Pr(Sprinkler Broken | !Wet Driveway & !Rain)
7
8.
Asia Network
Pr(Visit to Asia) Pr(Lung Cancer | Smoking) Pr(Smoking)
Pr(Tuberculosis | Visit to Asia) Pr(Bronchitis | Smoking)
Pr(C | BE )
Pr(X-Ray | Lung Cancer or Tuberculosis) Pr(Dyspnea | CG )
Pr(Lung Cancer | Neg X-Ray & Positive Dyspnea)
8
9.
BN Inference 101 (in Hive)
JPD = <product of all probabilities and conditional
probabilities in the network> = Pr(A, B, …, H)
PAB =
SELECT A, B, SUM(PROB) FROM JPD GROUP BY A, B;
PB = SELECT B, SUM(PROB) FROM PAB GROUP BY A;
Pr(A|B) = Pr(A,B)/Pr(B) – Bayes’ rule
CPCS is 422 nodes, a table of at least 2422 rows!
9
10.
Junction Tree
Pr(E | F )
Pr(Tuberculosis | Visit to Asia)
Pr(G | F )
Pr(Visit to Asia)
Pr(F)
Pr(C | BE )
Pr(H | CG )
Pr(Lung Cancer | Dyspnea) =
Pr(E|H)
Pr(D| C)
10
11.
CPCS Networks
422 nodes
14 nodes describe
diseases
33 risk factors
375 various findings
related to diseases
11
13.
Why Bayesian Network Inference?
Choose the right tool for the right job!
BN is an abstraction for reasoning and decision making
Easy to incorporate human insight and intuitions
Very general, no specific ‘label’ node
Easy to do ‘what-if’, strength of influence, value of information,
analysis
Immune to Gaussian assumptions
It’s all just a joint probability distribution
13
15.
MapReduce Implementation
for each clique in depth-first order:
MAP:
Sum over the variables to get ‘clique message’ (requires state, custom
partitioner and input format)
Emit factors for the next clique
REDUCE:
Multiply the factors from all children
Include probabilities assigned to the clique
Form the new clique values
the MAP is done over all child cliques
15
16.
Cliques, Trees, and Parallelism
C6
o Topological parallelism: compute
branches C2 and C4 in parallel
C5 o Clique parallelism: divide
computation of each clique into
maps/reducers
C4
o Fall back into optimal factoring if a
corresponding subtree is small
C3
o Combine multiple phases together
C2 o Reduce replication level
C1
Cliques may be larger than they
appear!
16
17.
CPCS Inference
CPCS:
The 360-node subnet has the largest ‘clique’ of
11,739,896 floats (fits into 2GB)
The full 422-node version (absent, mild, moderate, severe)
3,377,699,720,527,872 floats (or 12 PB of storage, but do not
need it for all queries)
In most cases do not need to do inference on the full network
17
18.
Results
Network Memory Time Macbook Hadoop
(19971) Pro (20102) (& future3)
Random 10 MB 33 sec < 1 sec
(B)
Random 254 MB 260 sec 10 sec
(A)
cpcs360 2 GB 640 sec 15 sec 1 min
cpcs422 > 12 PB N/A N/A Minutes to hours for
most of the queries on
most of the clusters
1‘used an SGI Origin 2000 machine with sixteen MIPS R10000 processors (195
MHz clock speed)’ in 1997
2Macbook Pro 4 GB DDR3 2.53 GHz
310 node Linux Xeon cluster 24 GB quad 2-core
18
19.
Conclusions
Exact probabilistic inference is finally in sight for the full 422 node
CPCS network
Hadoop helps to solve the world’s hardest problems
What you should know after this talk
BN is a DAG and represents a joint probability distribution (JPD)
Can compute conditional probabilities by multiplying and summing JPD
For large networks, this may be PBytes of intermediate data, but it’s MR
19
Be the first to comment