SlideShare a Scribd company logo
Exact Inference in Bayesian
Networks using MapReduce
Alex Kozlov
Cloudera, Inc.
Session Agenda


 About Me
 About Cloudera
 Bayesian (Probabilistic) Networks
 BN Inference 101
 CPCS Network
 Why BN Inference
 Inference with MR
 Results
 Conclusions
                               2
About Me



 Worked on BN Inference in 1995-1998 (for Ph.D.)
 ›   Published the fastest implementation at the time
 Worked on DM/BI field since then
 Recently joined Cloudera, Inc.
 ›   Started looking at how to solve world’s hardest problems




                                   3
About Cloudera


Founded in the summer 2008
Cloudera helps organizations profit from all of their data. We deliver the
  industry-standard platform which consolidates, stores and processes
  any kind of data, from any source, at scale. We make it possible to do
  more powerful analysis of more kinds of data, at scale, than ever
  before. With Cloudera, you get better insight into their customers,
  partners, vendors and businesses.


Cloudera’s platform is built on the popular open source Apache Hadoop
  project. We deliver the innovative work of a global community of
  contributors in a package that makes it easy for anyone to put the
  power of Google, Facebook and Yahoo! to work on their own problems.


                                       4
Bayesian Networks


1. Nodes
2. Edges
3. Probabilities


 Bayes, Thomas (1763)
 An essay towards solving a problem in
 the doctrine of chances, published
 posthumously by his friend
 Philosophical Transactions of the
 Royal Society of London, 53:370-418



                                     5
Applications


1. Computational biology and bioinformatics (gene regulatory networks,
   protein structure, gene expression analysis)
2. Medicine
3. Document classification, information retrieval
4. Image processing
5. Data fusion
6. Gaming
7. Law
8. On-line advertising!



                                     6
A Simple BN Network


Rain    T     F
                                                Rain                     T      F
F      0.4 0.6
T      0.1 0.9                                                           0.2 0.8

               Sprinkler



                                                Sprinkler, Rain   T       F

                                                          F, F    0.01   0.99
                                      Wet                 F, T    0.8    0.2
                                    Driveway              T, F    0.9    0.1
                                                          T, T    0.99   0.01

       Pr(Rain | Wet Driveway)
 Pr(Sprinkler Broken | !Wet Driveway & !Rain)
                                         7
Asia Network

     Pr(Visit to Asia)          Pr(Lung Cancer | Smoking)     Pr(Smoking)




Pr(Tuberculosis | Visit to Asia)                              Pr(Bronchitis | Smoking)




                  Pr(C | BE )




Pr(X-Ray | Lung Cancer or Tuberculosis)                     Pr(Dyspnea | CG )


           Pr(Lung Cancer | Neg X-Ray & Positive Dyspnea)
                                                  8
BN Inference 101 (in Hive)


JPD = <product of all probabilities and conditional
  probabilities in the network> = Pr(A, B, …, H)
PAB =
   SELECT A, B, SUM(PROB) FROM JPD GROUP BY A, B;
PB = SELECT B, SUM(PROB) FROM PAB GROUP BY A;
Pr(A|B) = Pr(A,B)/Pr(B) – Bayes’ rule


CPCS is 422 nodes, a table of at least 2422 rows!


                                9
Junction Tree
                                                     Pr(E | F )
       Pr(Tuberculosis | Visit to Asia)
                                                        Pr(G | F )
                Pr(Visit to Asia)
                                                            Pr(F)




                        Pr(C | BE )
                                                                  Pr(H | CG )




                                               Pr(Lung Cancer | Dyspnea) =
                                                         Pr(E|H)

                      Pr(D| C)
                                          10
CPCS Networks


                     422 nodes

                     14 nodes describe
                     diseases

                     33 risk factors

                     375 various findings
                     related to diseases




                11
CPCS Networks




                12
Why Bayesian Network Inference?


                Choose the right tool for the right job!


   BN is an abstraction for reasoning and decision making
   Easy to incorporate human insight and intuitions
   Very general, no specific ‘label’ node
   Easy to do ‘what-if’, strength of influence, value of information,
    analysis
   Immune to Gaussian assumptions


              It’s all just a joint probability distribution

                                     13
Map & Reduces
          Map        Keys

                     B1C1E1
  A1B1               B1C1E2
                                                       Reduce
  A2B1       B1      B1C2E1
                     B1C2E2
  A1B2               B2C1E1
  A2B2       B2      B2C1E2        ∑ Pr(B1| A) x ∑ Pr(D| C1)
                     B2C2E1
                     B2C2E2
                     B1C1E1
  C1D1               B1C1E2             Pr(C| BE) x ∑ Pr(B1| A) x ∑ Pr(D| C1)

  C2D1      C1       B1C2E1
                     B1C2E2   Aggregation 2 (x)
  C1D2               B2C1E1
  C2D2       C2      B2C1E2
                     B2C2E1                            BCE
                     B2C2E2
 Aggregation 1 (+)
                              14
MapReduce Implementation


for each clique in depth-first order:
   MAP:
       Sum over the variables to get ‘clique message’ (requires state, custom
         partitioner and input format)
       Emit factors for the next clique

   REDUCE:
       Multiply the factors from all children
       Include probabilities assigned to the clique
       Form the new clique values

the MAP is done over all child cliques

                                            15
Cliques, Trees, and Parallelism


                  C6
                       o Topological parallelism: compute
                         branches C2 and C4 in parallel
             C5        o Clique parallelism: divide
                         computation of each clique into
                         maps/reducers
       C4
                       o Fall back into optimal factoring if a
                         corresponding subtree is small
                  C3
                       o Combine multiple phases together
            C2         o Reduce replication level

 C1
         Cliques may be larger than they
                    appear!
                         16
CPCS Inference


CPCS:
The 360-node subnet has the largest ‘clique’ of
 11,739,896 floats (fits into 2GB)
The full 422-node version (absent, mild, moderate, severe)
 3,377,699,720,527,872 floats (or 12 PB of storage, but do not
    need it for all queries)


In most cases do not need to do inference on the full network



                                     17
Results

Network      Memory        Time          Macbook       Hadoop
                           (19971)       Pro (20102)   (& future3)
Random       10 MB         33 sec            < 1 sec
(B)
Random       254 MB        260 sec       10 sec
(A)
cpcs360      2 GB          640 sec           15 sec    1 min
cpcs422       > 12 PB      N/A           N/A           Minutes to hours for
                                                       most of the queries on
                                                       most of the clusters

1‘used an SGI Origin 2000 machine with sixteen MIPS R10000 processors (195
      MHz clock speed)’ in 1997
2Macbook    Pro 4 GB DDR3 2.53 GHz
310   node Linux Xeon cluster 24 GB quad 2-core

                                        18
Conclusions


   Exact probabilistic inference is finally in sight for the full 422 node
    CPCS network
   Hadoop helps to solve the world’s hardest problems


         What you should know after this talk

BN is a DAG and represents a joint probability distribution (JPD)
Can compute conditional probabilities by multiplying and summing JPD
For large networks, this may be PBytes of intermediate data, but it’s MR




                                        19
Questions?


   alexvk@{cloudera,gmail}.com

More Related Content

What's hot

Density based clustering
Density based clusteringDensity based clustering
Density based clustering
YaswanthHariKumarVud
 
11 clusadvanced
11 clusadvanced11 clusadvanced
11 clusadvanced
JoonyoungJayGwak
 
Double Patterning
Double PatterningDouble Patterning
Double Patterning
Danny Luk
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
Chun-Ming Chang
 
Hands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep LearningHands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep Learning
Chun-Ming Chang
 
Birch Algorithm With Solved Example
Birch Algorithm With Solved ExampleBirch Algorithm With Solved Example
Birch Algorithm With Solved Example
kailash shaw
 
Spectral cnn
Spectral cnnSpectral cnn
Spectral cnn
Brian Kim
 
Deeplab
DeeplabDeeplab
Deeplab
Cheng-You Lu
 
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
Artem Lutov
 
Recent Advances in Kernel-Based Graph Classification
Recent Advances in Kernel-Based Graph ClassificationRecent Advances in Kernel-Based Graph Classification
Recent Advances in Kernel-Based Graph Classification
Christopher Morris
 
DRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterDRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive Writer
Mark Chang
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Christopher Morris
 
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
Christopher Morris
 
Minimum Complexity Decoupling Networks for Arbitrary Coupled Loads
Minimum Complexity Decoupling Networks for Arbitrary Coupled LoadsMinimum Complexity Decoupling Networks for Arbitrary Coupled Loads
Minimum Complexity Decoupling Networks for Arbitrary Coupled Loads
Ding Nie
 
Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용
홍배 김
 
Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...
Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...
Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...
Ding Nie
 
Nie_ISCAS2015
Nie_ISCAS2015Nie_ISCAS2015
Nie_ISCAS2015
Ding Nie
 
Neural Networks: Model Building Through Linear Regression
Neural Networks: Model Building Through Linear RegressionNeural Networks: Model Building Through Linear Regression
Neural Networks: Model Building Through Linear Regression
Mostafa G. M. Mostafa
 
Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++
Dongheon Lee
 
Birch1
Birch1Birch1

What's hot (20)

Density based clustering
Density based clusteringDensity based clustering
Density based clustering
 
11 clusadvanced
11 clusadvanced11 clusadvanced
11 clusadvanced
 
Double Patterning
Double PatterningDouble Patterning
Double Patterning
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
 
Hands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep LearningHands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep Learning
 
Birch Algorithm With Solved Example
Birch Algorithm With Solved ExampleBirch Algorithm With Solved Example
Birch Algorithm With Solved Example
 
Spectral cnn
Spectral cnnSpectral cnn
Spectral cnn
 
Deeplab
DeeplabDeeplab
Deeplab
 
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
 
Recent Advances in Kernel-Based Graph Classification
Recent Advances in Kernel-Based Graph ClassificationRecent Advances in Kernel-Based Graph Classification
Recent Advances in Kernel-Based Graph Classification
 
DRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterDRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive Writer
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
 
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
 
Minimum Complexity Decoupling Networks for Arbitrary Coupled Loads
Minimum Complexity Decoupling Networks for Arbitrary Coupled LoadsMinimum Complexity Decoupling Networks for Arbitrary Coupled Loads
Minimum Complexity Decoupling Networks for Arbitrary Coupled Loads
 
Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용
 
Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...
Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...
Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...
 
Nie_ISCAS2015
Nie_ISCAS2015Nie_ISCAS2015
Nie_ISCAS2015
 
Neural Networks: Model Building Through Linear Regression
Neural Networks: Model Building Through Linear RegressionNeural Networks: Model Building Through Linear Regression
Neural Networks: Model Building Through Linear Regression
 
Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++
 
Birch1
Birch1Birch1
Birch1
 

Similar to Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
David Gleich
 
Interactive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social GraphsInteractive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social Graphs
Tokyo Tech (Tokyo Institute of Technology)
 
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetupLucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Luba Elliott
 
P2P Supernodes
P2P SupernodesP2P Supernodes
P2P Supernodes
Kevin Regan
 
Surface-related multiple elimination through orthogonal encoding in the laten...
Surface-related multiple elimination through orthogonal encoding in the laten...Surface-related multiple elimination through orthogonal encoding in the laten...
Surface-related multiple elimination through orthogonal encoding in the laten...
Oleg Ovcharenko
 
Pcm
PcmPcm
Igarss1792_v2.ppt
Igarss1792_v2.pptIgarss1792_v2.ppt
Igarss1792_v2.ppt
grssieee
 
Practical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsxPractical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsx
MannyK4
 
A MuDDy Experience - ML Bindings to a BDD Library
A MuDDy Experience - ML Bindings to a BDD LibraryA MuDDy Experience - ML Bindings to a BDD Library
A MuDDy Experience - ML Bindings to a BDD Library
Ken Friis Larsen
 
Graphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks AgeGraphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks Age
Lorenzo Alberton
 
Towards controlling evolutionary dynamics through network geometry: some very...
Towards controlling evolutionary dynamics through network geometry: some very...Towards controlling evolutionary dynamics through network geometry: some very...
Towards controlling evolutionary dynamics through network geometry: some very...
Kolja Kleineberg
 
Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar
Map reduce and the art of Thinking Parallel   - Dr. Shailesh KumarMap reduce and the art of Thinking Parallel   - Dr. Shailesh Kumar
Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar
Hyderabad Scalability Meetup
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and more
David Gleich
 
CS 354 Project 2 and Compression
CS 354 Project 2 and CompressionCS 354 Project 2 and Compression
CS 354 Project 2 and Compression
Mark Kilgard
 
Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Fine grained asynchronism for pseudo-spectral codes - with application to tur...Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Ganesan Narayanasamy
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Matthew Lease
 
Kailash(13EC35032)_mtp.pptx
Kailash(13EC35032)_mtp.pptxKailash(13EC35032)_mtp.pptx
Kailash(13EC35032)_mtp.pptx
KailashChandMeena6
 
Defense
DefenseDefense
Defense
Luca Foschini
 
PhD defense slides
PhD defense slidesPhD defense slides
PhD defense slides
Óscar González
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
David Gleich
 

Similar to Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010) (20)

Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
 
Interactive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social GraphsInteractive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social Graphs
 
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetupLucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetup
 
P2P Supernodes
P2P SupernodesP2P Supernodes
P2P Supernodes
 
Surface-related multiple elimination through orthogonal encoding in the laten...
Surface-related multiple elimination through orthogonal encoding in the laten...Surface-related multiple elimination through orthogonal encoding in the laten...
Surface-related multiple elimination through orthogonal encoding in the laten...
 
Pcm
PcmPcm
Pcm
 
Igarss1792_v2.ppt
Igarss1792_v2.pptIgarss1792_v2.ppt
Igarss1792_v2.ppt
 
Practical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsxPractical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsx
 
A MuDDy Experience - ML Bindings to a BDD Library
A MuDDy Experience - ML Bindings to a BDD LibraryA MuDDy Experience - ML Bindings to a BDD Library
A MuDDy Experience - ML Bindings to a BDD Library
 
Graphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks AgeGraphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks Age
 
Towards controlling evolutionary dynamics through network geometry: some very...
Towards controlling evolutionary dynamics through network geometry: some very...Towards controlling evolutionary dynamics through network geometry: some very...
Towards controlling evolutionary dynamics through network geometry: some very...
 
Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar
Map reduce and the art of Thinking Parallel   - Dr. Shailesh KumarMap reduce and the art of Thinking Parallel   - Dr. Shailesh Kumar
Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and more
 
CS 354 Project 2 and Compression
CS 354 Project 2 and CompressionCS 354 Project 2 and Compression
CS 354 Project 2 and Compression
 
Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Fine grained asynchronism for pseudo-spectral codes - with application to tur...Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Fine grained asynchronism for pseudo-spectral codes - with application to tur...
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Kailash(13EC35032)_mtp.pptx
Kailash(13EC35032)_mtp.pptxKailash(13EC35032)_mtp.pptx
Kailash(13EC35032)_mtp.pptx
 
Defense
DefenseDefense
Defense
 
PhD defense slides
PhD defense slidesPhD defense slides
PhD defense slides
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
 

Recently uploaded

System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 

Recently uploaded (20)

System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 

Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

  • 1. Exact Inference in Bayesian Networks using MapReduce Alex Kozlov Cloudera, Inc.
  • 2. Session Agenda  About Me  About Cloudera  Bayesian (Probabilistic) Networks  BN Inference 101  CPCS Network  Why BN Inference  Inference with MR  Results  Conclusions 2
  • 3. About Me  Worked on BN Inference in 1995-1998 (for Ph.D.) › Published the fastest implementation at the time  Worked on DM/BI field since then  Recently joined Cloudera, Inc. › Started looking at how to solve world’s hardest problems 3
  • 4. About Cloudera Founded in the summer 2008 Cloudera helps organizations profit from all of their data. We deliver the industry-standard platform which consolidates, stores and processes any kind of data, from any source, at scale. We make it possible to do more powerful analysis of more kinds of data, at scale, than ever before. With Cloudera, you get better insight into their customers, partners, vendors and businesses. Cloudera’s platform is built on the popular open source Apache Hadoop project. We deliver the innovative work of a global community of contributors in a package that makes it easy for anyone to put the power of Google, Facebook and Yahoo! to work on their own problems. 4
  • 5. Bayesian Networks 1. Nodes 2. Edges 3. Probabilities Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances, published posthumously by his friend Philosophical Transactions of the Royal Society of London, 53:370-418 5
  • 6. Applications 1. Computational biology and bioinformatics (gene regulatory networks, protein structure, gene expression analysis) 2. Medicine 3. Document classification, information retrieval 4. Image processing 5. Data fusion 6. Gaming 7. Law 8. On-line advertising! 6
  • 7. A Simple BN Network Rain T F Rain T F F 0.4 0.6 T 0.1 0.9 0.2 0.8 Sprinkler Sprinkler, Rain T F F, F 0.01 0.99 Wet F, T 0.8 0.2 Driveway T, F 0.9 0.1 T, T 0.99 0.01 Pr(Rain | Wet Driveway) Pr(Sprinkler Broken | !Wet Driveway & !Rain) 7
  • 8. Asia Network Pr(Visit to Asia) Pr(Lung Cancer | Smoking) Pr(Smoking) Pr(Tuberculosis | Visit to Asia) Pr(Bronchitis | Smoking) Pr(C | BE ) Pr(X-Ray | Lung Cancer or Tuberculosis) Pr(Dyspnea | CG ) Pr(Lung Cancer | Neg X-Ray & Positive Dyspnea) 8
  • 9. BN Inference 101 (in Hive) JPD = <product of all probabilities and conditional probabilities in the network> = Pr(A, B, …, H) PAB = SELECT A, B, SUM(PROB) FROM JPD GROUP BY A, B; PB = SELECT B, SUM(PROB) FROM PAB GROUP BY A; Pr(A|B) = Pr(A,B)/Pr(B) – Bayes’ rule CPCS is 422 nodes, a table of at least 2422 rows! 9
  • 10. Junction Tree Pr(E | F ) Pr(Tuberculosis | Visit to Asia) Pr(G | F ) Pr(Visit to Asia) Pr(F) Pr(C | BE ) Pr(H | CG ) Pr(Lung Cancer | Dyspnea) = Pr(E|H) Pr(D| C) 10
  • 11. CPCS Networks 422 nodes 14 nodes describe diseases 33 risk factors 375 various findings related to diseases 11
  • 13. Why Bayesian Network Inference? Choose the right tool for the right job!  BN is an abstraction for reasoning and decision making  Easy to incorporate human insight and intuitions  Very general, no specific ‘label’ node  Easy to do ‘what-if’, strength of influence, value of information, analysis  Immune to Gaussian assumptions It’s all just a joint probability distribution 13
  • 14. Map & Reduces Map Keys B1C1E1 A1B1 B1C1E2 Reduce A2B1 B1 B1C2E1 B1C2E2 A1B2 B2C1E1 A2B2 B2 B2C1E2 ∑ Pr(B1| A) x ∑ Pr(D| C1) B2C2E1 B2C2E2 B1C1E1 C1D1 B1C1E2 Pr(C| BE) x ∑ Pr(B1| A) x ∑ Pr(D| C1) C2D1 C1 B1C2E1 B1C2E2 Aggregation 2 (x) C1D2 B2C1E1 C2D2 C2 B2C1E2 B2C2E1 BCE B2C2E2 Aggregation 1 (+) 14
  • 15. MapReduce Implementation for each clique in depth-first order: MAP: Sum over the variables to get ‘clique message’ (requires state, custom partitioner and input format) Emit factors for the next clique REDUCE: Multiply the factors from all children Include probabilities assigned to the clique Form the new clique values the MAP is done over all child cliques 15
  • 16. Cliques, Trees, and Parallelism C6 o Topological parallelism: compute branches C2 and C4 in parallel C5 o Clique parallelism: divide computation of each clique into maps/reducers C4 o Fall back into optimal factoring if a corresponding subtree is small C3 o Combine multiple phases together C2 o Reduce replication level C1 Cliques may be larger than they appear! 16
  • 17. CPCS Inference CPCS: The 360-node subnet has the largest ‘clique’ of 11,739,896 floats (fits into 2GB) The full 422-node version (absent, mild, moderate, severe) 3,377,699,720,527,872 floats (or 12 PB of storage, but do not need it for all queries) In most cases do not need to do inference on the full network 17
  • 18. Results Network Memory Time Macbook Hadoop (19971) Pro (20102) (& future3) Random 10 MB 33 sec < 1 sec (B) Random 254 MB 260 sec 10 sec (A) cpcs360 2 GB 640 sec 15 sec 1 min cpcs422 > 12 PB N/A N/A Minutes to hours for most of the queries on most of the clusters 1‘used an SGI Origin 2000 machine with sixteen MIPS R10000 processors (195 MHz clock speed)’ in 1997 2Macbook Pro 4 GB DDR3 2.53 GHz 310 node Linux Xeon cluster 24 GB quad 2-core 18
  • 19. Conclusions  Exact probabilistic inference is finally in sight for the full 422 node CPCS network  Hadoop helps to solve the world’s hardest problems What you should know after this talk BN is a DAG and represents a joint probability distribution (JPD) Can compute conditional probabilities by multiplying and summing JPD For large networks, this may be PBytes of intermediate data, but it’s MR 19
  • 20. Questions? alexvk@{cloudera,gmail}.com