• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010

Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010



Hadoop Summit 2010 - Research Track

Hadoop Summit 2010 - Research Track
Exact Inference in Bayesian Networks using MapReduce
Alex Kozlov, Cloudera



Total Views
Views on SlideShare
Embed Views



4 Embeds 954

http://robotics.stanford.edu 836
http://xenon.stanford.edu 110
http://translate.googleusercontent.com 5
http://www.slideshare.net 3



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.


11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Has to learn this well.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • AbstractProbabilistic inference is a way of obtaining values of unobservable variables out of incomplete data. Probabilistic inference is used in robotics, medical diagnostic, image recognition, finance and other fields. One of the tools for inference and a way to represent knowledge is 'Bayesian Network', where nodes represent variables and edges represent probabilistic dependencies between variables. The advantage of exact probabilistic inference using BN is that it does not involve the traditional 'Gaussian distribution' assumptions and the results are immune to Taleb's distributions, or distributions with a high probability of outliers.A typical application of probabilistic inference is to infer the probability of one or several dependent variables, like the probability that a person has a certain disease, given other observations, like presence of abdominal pain. In exact probabilistic inference, variables are clustered in groups, called cliques, and probabilistic inference can be carried out by manipulating more or less complex data structures on top of the cliques, which leads to high computational and space complexity of the inference: the data structures can become very complex and large. The advantage: one can encode arbitrarily complex distributions and dependencies.While a lot of research has been devoted to devising schemes to approximate the solution, Hadoop allows performing exact inference on the whole network. We present an approach for performing large-scale probabilistic inference in probabilistic networks in a Hadoop cluster. Probabilistic inference is reduced to a number of MR jobs over the data structures representing clique potentials. One of the applications is the CPCS BN, one of the biggest models created at Stanford Medical Informatics Center (now The Stanford Center for Biomedical Informatics Research) in 1994, never solved exactly. In this specific network containing 422 nodes representing states of different variables, 14 nodes describe diseases, 33 nodes describe history and risk factors, and the remaining 375 nodes describe various findings related to the diseases.
  • Here is what I am going to talk about1. I will not be able to delve into every detail and the implementation is not complete2. BN Inference is not a Cloudera product today, therefore it’s not a product announcement!3. This is not a research paper either!Promise – no formulas or complicated mathI promise there will be at least one photo and an SQL statementCPCS -- (Computer-based Patient Case Study) model [Pradhanet al.1994]Pradhanet al.1994 Malcolm Pradhan, Gregory Provan, Blackford Middleton, and Max Henrion. Knowledge engineering for large belief networks. In Proceedings of the Tenth Annual Conference on Uncertainty in Artificial Intelligence (UAI-94), pages 484-490, San Francisco, CA, 1994. Morgan Kaufmann Publishers.
  • I did probabilistic inference since 1994!There is a resurgence of interest in parallel computations, see Yinglong Xia and Viktor K. Prasanna2008-2010 papers
  • Interest in Hadoop is surging…Hadoop is: ‘A scalable fault-tolerant distributed system for data storage and processing’Hadoop History2002-2004: Doug Cutting and Mike Cafarella started working on Nutch2003-2004: Google publishes GFS and MapReduce papers 2004: Cutting adds DFS & MapReduce support to Nutch2006: Yahoo! hires Cutting, Hadoop spins out of Nutch2007: NY Times converts 4TB of archives over 100 EC2s2008: Web-scale deployments at Y!, Facebook, Last.fmApril 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodesMay 2009:Yahoo does fastest sort of a TB, 62secs over 1460 nodesYahoo sorts a PB in 16.25hours over 3658 nodesJune 2009, Oct 2009:Hadoop Summit, Hadoop WorldSeptember 2009: Doug Cutting joins Cloudera
  • A gentle introduction to BNsA Bayesian network, belief network or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional independencies via a directed acyclic graph (DAG)Formally, Bayesian networks are directed acyclic graphs whose nodes represent random variables in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters or hypotheses. Edges represent conditional dependencies; nodes which are not connected represent variables which are conditionally independent of each other. Each node is associated with a probability function that takes as input a particular set of values for the node's parent variables and gives the probability of the variable represented by the node. For example, if the parents are m Boolean variables then the probability function could be represented by a table of 2m entries, one entry for each of the 2m possible combinations of its parents being true or false.Efficient algorithms exist that perform inference and learning in Bayesian networks. Bayesian networks that model infinite sequences of variables (e.g. speech signals or protein sequences) are called markov chains. Generalizations of Bayesian networks that can represent and solve decision problems under uncertainty are called influence diagrams.Bayes never invented the BNs, even didn’t have a publication on probabilities during his lifetime
  • As you can notice, BNs are used anywhere were data are a bit more complex (‘unstructured data’ in RDBMS terms)Like Hadoop!Naïve Bayes is the most famous incarnation of a BN (conditional independence of attribute variables given the class label)Let’s look at the examples of BN
  • A reasoning tool: People think they are good with probabilitiesOne advantage of Bayesian networks is that it is intuitively easier for a human to understand (a sparse set of) direct dependencies and local distributions than complete joint distribution.Wind blows – trees move?May be extended to ‘causal networks’
  • A more complex networkVisit to Asia – predisposing factorsTuberculosis, Lung Cancer, Bronchitis – diseasesX-Ray, Dyspnea – findingsLungCancer or Tuberculosis – hidden node
  • CS 221 at StanfordBN Inference in HiveFor example Pr(Lung Cancer|Dyspnea)Can some intelligently: see optimal factoring approach in my Ph.D. thesisThe largest clique size – max width in CSP terms (did I mention it’s NP-hard?)Approximate and sampling algorithms existFormally, it can be represented as `variable elimination` or ‘belief propagation` up and down a join treeLet’s have a look
  • Junction tree: Each probability is assigned to one of the cliques in the junction treeWhen we sum, the results is a message (M)When we multiply, the result is a (R)Already looks like MapReduce! MapReduce existed long before it was invented.But before we delve into MR implementation, lets talk about CPCS (Comuter-based Patient Case Study) networkDid I mention BN Inference is NP-hard? It can be mapped to a CSP problem
  • A typical query is Pr(diseases|finding, risk factors)One big mess!
  • A typical query is Pr(diseases| risk factors, findings)Interactive analysisWhat-if analysisStrength of InfluenceSensitivity AnalysisValue of informationValue of additional evidence (tests)Cost of not taking a specific decisionBy now you are wondering: why inference?
  • Let’s have a break and discuss why you should use BN InferenceIf the current tools work for you, continue using themIf you run a company that underestimates risk and looses $1T as a result, you probably need to innovate: There should be some technology that can handle itNow, let’s delve into MapReduce implementation and results
  • That’s a bit more complicated slide, but bear with meMap: summation, generate multiple keys/records per 1 original recordReduce: multiplicationThe key encodes full set of variable values (LongWritable or composite)The value encodes partial sums (proportional to probabilities)No need for TotalOrderPartitioning (we know the key distribution)Need custom Partitioner and WritableComparator (next slide)Need to do an aggregation in the Mapper (sum, next slide)By arranging the node order in the cliques we can optimize data localitySorting helps!
  • Preserves data locality by specifying node order in a certain way (need for a custom WritableComparator and Partitioner)
  • The computation is C6 -> C5 -> C4 -> C3 -> C2 -> C1Topological parallelism is usually limitedMost of the work is done in reducers (indices remapping, summation)Let’s look at the actual clique sizes in CPCS!
  • Doing inference on the ‘full’ network has only an academic inferenceHowever, to understand how the simplifications in the network affect results, we need to be able to perform exact inferenceWhat is the scoop?
  • The first three are for the ‘full’ propagation up and down the treeRandom A, B are randomly generated BNs used for the 1995 paperCpcs360 is a subset of cpcs422 used for interactive analysisCpcs422 on a 5-node subquery
  • Doing inference on the ‘full’ network has only an academic inferenceHowever, to understand how the simplifications in the network affect results, we need to be able to perform exact inference
  • git@github.com:alexvk/BN-Inference.gitGoals:Inference is an interesting applicationWe have an interactive program to perform inferenceAll questions to Cloudera, Inc.Need:Implementors (to help)Large cluster (to have 10s of PB of storage)
  • Doing inference on the ‘full’ network has only an academic inferenceHowever, to understand how the simplifications in the network affect results, we need to be able to perform exact inference
  • Summation: Pure MR+ (M-R-M-R-...-M-R) jobNormalization: Requires update (or a copy) operationEach key can encode the set of values (odometer)No need for PartialOrder (we know the key distribution)Can optimize data locality
  • Many tools for interactive analysis:-Sensitivity analysis-Strength of InfluenceValue of Information-Hybrid networks (with some continuous parents)
  • As opposed to traditional MR, aggregation is made in the map phase (summation)
  • Very few modifications:* In file included from utils.c:12:/usr/lib/gcc/x86_64-redhat-linux/4.1.2/include/varargs.h:4:2: error: #error "GCC no longer implements ."/usr/lib/gcc/x86_64-redhat-linux/4.1.2/include/varargs.h:5:2: error: #error "Revise your code to use ."* Convert ints to longsImplement MR logic and code generation

Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010 Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010 Presentation Transcript