Introduction to Bayesian Networks - Practical and Technical Perspectives


Published on

Bayesian networks are somewhat of a disruptive technology, as they challenge a number common practices in the world of business and science. So, beyond the world of academia, promoting Bayesian networks as a new tool for practical knowledge management and reasoning still requires significant persuasion efforts. With this short paper, we attempt to provide a concise justification, both from a practitioner’s and a technical perspective1, why Bayesian networks are so important.

Published in: Technology, Education

Introduction to Bayesian Networks - Practical and Technical Perspectives

  1. 1. Introduction to Bayesian NetworksPractical and Technical PerspectivesStefan Conrady, stefan.conrady@conradyscience.comDr. Lionel Jouffe, jouffe@bayesia.comFebruary 15, 2011Conrady Applied Science, LLC - Bayesia’s North American Partner for Sales and Consulting
  2. 2. Introduction to Bayesian NetworksTable of ContentsIntroductionBayesian Networks from a Practitioner’s Perspective Knowledge Uni cation 2 Knowledge Representation & Communication 3 Reasoning 4 Summary 4Technical Introduction Introduction 5 Probabilistic Semantics 7 Evidential Reasoning 8 Learning Bayesian Networks 9 Causal Networks 10 Causal Discovery 11 References 12 Contact Information 13 Conrady Applied Science, LLC 13 Bayesia SAS | i
  3. 3. Introduction to Bayesian NetworksIntroductionA simplistic analogy may help to jump-start our introduction to Bayesian networks: In the same way one can use aphone book — without having to memorize all the names and numbers, one can deliberately (and correctly) reason withthe domain knowledge contained in a Bayesian network — without having to become a domain expert.Over the last 25 years, Bayesian networks have emerged as a practically feasible form of knowledge representation, pri-marily through the seminal works of UCLA Professor Judea Pearl. With the ever-increasing computing power, Bayesiannetworks are now a powerful tool for deep understanding of very complex, high-dimensional problem domains. Theircomputational ef ciency and inherently visual structure make Bayesian networks attractive for exploring and explainingcomplex problems.However, Bayesian networks are somewhat of a disruptive technology, as they challenge a number common practices inthe world of business and science. So, beyond the world of academia, promoting Bayesian networks as a new tool forpractical knowledge management and reasoning still requires signi cant persuasion efforts. With this short paper, weattempt to provide a concise justi cation, both from a practitioner’s and a technical perspective1 , why Bayesian net-works are so important.1 Author notes: portions of the technical chapter of this paper are adapted, with permission, from Pearl and Russell(2000) | 1
  4. 4. Introduction to Bayesian Networks - Practitioners PerspectiveBayesian Networks from a Practitioner’s PerspectiveIn our quest to “evangelize” about Bayesian networks (and the BayesiaLab software package2 ), we are often limited topresenting our case in just a few PowerPoint slides and only using a few catchy bullet points. In this context, and this isobviously not comprehensive, we selected the following headings to highlight the key bene ts of Bayesian networks toresearch practitioners and business executives:1. Knowledge Uni cation2. Knowledge Representation & Communication3. ReasoningUnder these headlines, the following paragraphs are meant to provide a glimpse of the powerful properties and wide-ranging practical advantages of Bayesian networks.Knowledge Uni cationMany elds are characterized by the proverbial con ict between “art” and “science.” This manifests itself in debates,such as the one about evidence-based medicine versus the prevailing practice of physicians with years of experience.Even more common is the discrepancy between scienti cally derived market research insights and expertise-based mar-keting decisions of business executives. Traditional frameworks typically dont facilitate leveraging the knowledge avail-able on both sides.Bayesian networks have the ability of capturing both qualitative knowledge (through their network structure), andquantitative knowledge (through their parameters). While expert knowledge from practitioners is mostly qualitative, itcan be used directly for building the structure of a Bayesian network. In addition, data mining algorithms can encodeboth qualitative and quantitative knowledge and encode both forms simultaneously in a Bayesian network. As a result,Bayesian networks can bridge the gap between different types of knowledge and serve to unify all available knowledgeinto a single form of representation.2 Developed by Bayesia SAS, BayesiaLab is a comprehensive software package designed for learning, editing and analyz-ing Bayesian networks. It is available in North America from Conrady Applied Science, | 2
  5. 5. Introduction to Bayesian Networks - Practitioners Perspective Domain “Art” “Science” Expert Mathematical Knowledge Representation Qualitative Quantitative Bayesian Network Uni ed Knowledge Representation Figure 1: Knowledge uni cation with Bayesian networksKnowledge Representation & CommunicationRelaying knowledge typically includes an array of factual and causal statements. In natural language communication,such statements will often contain generalizations, approximations, and implicit assumptions regarding their probability.Such simpli cations are widely accepted in casual conversation or in media headlines.However, for more precise communication, which is required in science or business, spelling out exceptions, uncertaintyand conditions regarding statements about knowledge is necessary. With natural language expressions, however, this canbecome very cumbersome, especially when it concerns a complex domain (hence the substantial girth of many text-books).Also, the need for precision in describing complex domains is often at odds with the modern business culture, which, asalready mentioned in the introduction, dictates communication via PowerPoint in few, concise bullet points. Needless tosay, the complex dynamics of a domain can thus often not be relayed correctly to policy makers and other stakeholders.Bayesian networks are very well suited for capturing probabilistic and incomplete causal knowledge regarding a do-main. They can easily accommodate exceptions to a rule, e.g. “all swans are white, except for a certain species,” as wellas partial causal information, for instance “alcohol caused the accident,” even though more factors may actually be in-volved, such as poor road conditions.Through its structure and its parameters, a Bayesian networks comprehensively describes what is known about a par-ticular domain and especially the interactions of all the variables contained within that domain. As such, a Bayesiannetwork is a “Portable Knowledge Format,” that can succinctly and compactly communicate the state of the domain aswell as its | 3
  6. 6. Introduction to Bayesian Networks - Practitioners PerspectiveReasoningBy representing the interactions, a (correctly formulated) Bayesian network can yield a deep understanding of a domain.Deep understanding means knowing, not merely how things behaved yesterday, but also how things will behave undernew hypothetical circumstances tomorrow. More speci cally, a Bayesian network allows explicit reasoning, and deliber-ate reasoning allows us to anticipate the consequences of actions we have not yet taken. Bayesian networks thus becomean instrument for formal reasoning that is entirely transparent to stakeholders, as opposed to a more opaque, internal-ized process in the decision maker’s mind (or gut). Domain under Data Bayesian Study Network Hypothetical Domain Manipulation Manipulation Figure 2: Using Bayesian networks for formal reasoning about consequences of hypothetical actionsSummaryIn summary, Bayesian networks are a highly universal knowledge framework and they provide a common reasoninglanguage between stakeholders from different backgrounds, such as business executives and market research scientists.With all available knowledge uni ed, properly communicated and quite literally put into a “reasonable” format, Bayes-ian network are a powerful tool for making decisions and shaping | 4
  7. 7. Introduction to Bayesian Networks - Technical PerspectiveTechnical IntroductionFor the technical portion of this introduction, we defer to the words of Judea Pearl, who originally coined the term“Bayesian network”. We are grateful to him for allowing us to use and adapt large sections from one of his technicalreports for our purposes (Pearl and Russell, 2000).IntroductionProbabilistic models based on directed acyclic graphs have a long and rich tradition, beginning with the work of geneti-cist Sewall Wright in the 1920s. Variants have appeared in many elds. Within statistics, such models are known as di-rected graphical models; within cognitive science and arti cial intelligence, such models are known as Bayesian net-works. The name honors the Rev. Thomas Bayes (1702-1761), whose rule for updating probabilities in the light of newevidence is the foundation of the approach.Rev. Bayes addressed both the case of discrete probability distributions of data and the more complicated case of con-tinuous probability distributions. In the discrete case, Bayes’ theorem relates the conditional and marginal probabilitiesof events A and B, provided that the probability of B does not equal zero: P(B A)P(A)P(A B) = P(B)In Bayes’ theorem, each probability has a conventional name:• P(A) is the prior probability (or “unconditional” or “marginal” probability) of A. It is “prior” in the sense that it does not take into account any information about B; however, the event B need not occur after event A. In the nineteenth century, the unconditional probability P(A) in Bayes’s rule was called the “antecedent” probability; in deductive logic, the antecedent set of propositions and the inference rule imply consequences. The unconditional probability P(A) was called “a priori” by Ronald A. Fisher.• P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the speci ed value of B.• P(B|A) is the conditional probability of B given A. It is also called the likelihood.• P(B) is the prior or marginal probability of B, and acts as a normalizing constant.Bayes theorem in this form gives a mathematical representation of how the conditional probability of event A given B isrelated to the converse conditional probability of B given A.The initial development of Bayesian networks in the late 1970s was motivated by the need to model the top-down (se-mantic) and bottom-up (perceptual) combination of evidence in reading. The capability for bidirectional inferences,combined with a rigorous probabilistic foundation, led to the rapid emergence of Bayesian networks as the method ofchoice for uncertain reasoning in AI and expert systems replacing earlier, ad hoc rule-based | 5
  8. 8. Introduction to Bayesian Networks - Technical PerspectiveThe nodes in a Bayesian network represent propositional variables of interest (e.g. the temperature of a device, the gen-der of a patient, a feature of an object, the occurrence of an event) and the links represent statistical (informational)3 orcausal dependencies among the variables. The dependencies are quanti ed by conditional probabilities for each nodegiven its parents in the network. The network supports the computation of the posterior probabilities of any subset ofvariables given evidence about any other subset.Figure 1 shows a very simple Bayesian network consisting of only two nodes and one link, representing the joint prob-ability distribution of the variables Eye Color and Hair Color in a given population. In this case, the conditional prob-abilities of Hair Color given the values of its parent, Eye Color, are provided in a table. It is important to point out thatthis Bayesian network does not contain any causal assumptions, i.e. we have no knowledge of the causal order betweenthe variables, so the interpretation here should be merely statistical (informational). Figure 1: A Bayesian network representing the statistical relationship between to two variables.Figure 2 illustrates another simple yet typical Bayesian network. In contrast to the statistical relationships in Figure 1,the diagram in Figure 2 describes the causal relationships among the season of the year (X1), whether it’s raining (X2),whether the sprinkler is on (X3), whether the pavement is wet (X4), and whether the pavement is slippery (X5). Here theabsence of a direct link between X1 and X5, for example, captures our understanding that there is no direct in uence ofseason on slipperiness — the in uence is mediated by the wetness of the pavement (if freezing is a possibility then a di-rect link could be added).3 “informational” and “statistical” are treated here as equivalent concepts and can be used | 6
  9. 9. Introduction to Bayesian Networks - Technical Perspective Figure 2: A Bayesian network representing causal in uences among ve variablesPerhaps the most important aspect of a Bayesian networks is that they are direct representations of the world, not ofreasoning processes. The arrows in the diagram represent real causal connections and not the ow of information duringreasoning (as in rule-based systems and neural networks). Reasoning processes can operate on Bayesian networks bypropagating information in any direction. For example, if the sprinkler is on, then the pavement is probably wet (predic-tion, simulation); if someone slips on the pavement, that also provides evidence that it is wet (abduction, reasoning to aprobable cause or diagnosis). On the other hand, if we see that the pavement is wet, that makes it more likely that thesprinkler is on or that it is raining (abduction); but if we then observe that the sprinkler is on, that reduces the likelihoodthat it is raining (explaining away). It is this last form of reasoning, explaining away, that is especially dif cult to modelin rule-based systems and neural networks in any natural way, because it seems to require the propagation of informa-tion in two directions.Probabilistic SemanticsAny complete probabilistic model of a domain must, either explicitly or implicitly, represent the joint probability distri-bution — the probability of every possible event as de ned by the combination of the values of all the variables. Thereare exponentially many such events, yet Bayesian networks achieve compactness by factoring the joint distribution intolocal, conditional distributions for each variable given its parents. If xi denotes some value of the variable Xi and paidenotes some set of values for the parents of Xi, then P(xi|pai) denotes this conditional distribution. For example,P(x4|x2,x3) is the probability of wetness given the values of sprinkler and rain. The global semantics of Bayesian net-works speci es that the full joint distribution is given by the productP(xi ,..., xn ) = ∏ P(xi pai ) (1) iIn our example network, we haveP(x1 , x2 , x3 , x4 , x5 ) = P(x1 )P(x2 x1 )P(x3 x1 )P(x4 x2 , x3 )P(x5 x4 ) . (2)It becomes clear that the number of parameters grows linearly with the size of the network, i.e. the number of variables,however, the conditional probability distribution grows exponentially with the number of parents. Further savings | 7
  10. 10. Introduction to Bayesian Networks - Technical Perspectivebe achieved using compact parametric representations — such as noisy-OR models, decision trees, or neural networks— for the conditional distributions.There is also an entirely equivalent local semantics, which asserts that each variable is independent of its nondescen-dants in the network given its parents. For example, the parents of X4 in Figure 2 are X2 and X3 and they render X4independent of the remaining nondescendant, X1. That is,P(x4 x 1 , x2 , x3 ) = P(x4 x2 , x3 ) . Non-Descendants Parents Descendant Figure 3: Variable X4 is independent of its nondescendants, in this case X1, given its parents, X3 and X2The collection of independence assertions formed in this way suf ces to derive the global assertion in Equation 1, andvice versa. The local semantics is most useful in constructing Bayesian networks, because selecting as parents all the di-rect causes (or direct relationships) of a given variable invariably satis es the local conditional independence conditions.The global semantics leads directly to a variety of algorithms for reasoning.Evidential ReasoningFrom the product speci cation in Equation 1 one can express the probability of any desired proposition in terms of theconditional probabilities speci ed in the network. For example the probability that the sprinkler is on given that thepavement is slippery | 8
  11. 11. Introduction to Bayesian Networks - Technical Perspective P(X 3 = on, X5 = true)P(X 3 = on X5 = true) = P(X5 = true)= ∑ x1 , x2 , x4 P(x1 , x2 , X 3 = on, x4 , X5 = true) ∑ x1 , x2 , x3 , x4 P(x1 , x2 , x3 , x4 , X5 = true)== ∑ x1 , x2 , x4 P(x1 )P(x2 x1 )P(X 3 = on x1 )P(x4 x2 , X 3 = on)P(X5 = true x4 ) ∑ x1 , x2 , x3 , x4 P(x1 )P(x2 x1 )P(x3 x1 )P(x4 x2 , x3 )P(X5 = true x4 )These expressions can often be simpli ed in ways that re ect the structure of the network itself. The rst algorithmsproposed for probabilistic calculations in Bayesian networks used a local distributed message-passing architecture, typi-cal of many cognitive activities. Initially this approach was limited to tree-structured networks, but was later extendedto general networks in Lauritzen and Spiegelhalter’s (1988) method of junction tree propagation. A number of otherexact methods have been developed and can be found in recent textbooks.It is easy to show that reasoning in Bayesian networks subsumes the satis ability problem in propositional logic andhence is NP-hard Monte Carlo simulation methods can be used for approximate inference (Pearl, 1988) giving graduallyimproving estimates as sampling proceeds. These methods use local message propagation on the original network struc-ture unlike junction tree methods. Alternatively, variational methods provide bounds on the true probability.Learning Bayesian NetworksThe conditional probabilities P(xi|pai) of a given structure can be estimated from data by using the maximum likelihoodapproach (observed frequencies). They can also be updated continuously from observational data using gradient-basedor EM methods that use just local information derived from inference — in much the same way as weights are adjustedin neural networks.It is also possible to machine-learn the structure of a Bayesian network and two families of methods are available forthat purpose. The rst one, the constraint-based algorithms, is based on the probabilistic semantic of Bayesian networks.Links are added or deleted according to the results of statistical tests, which identify marginal and conditional independ-encies. The second approach, the score-based algorithms, is based on a metric measuring the quality of candidate net-works with respect to the observed data. This metric trades off network complexity against degree of t to the data,typically expressed as the likelihood of the data given the network.As a substrate for learning, Bayesian networks have the advantage that it is relatively easy to encode prior knowledge innetwork form, either by xing portions of the structure or by using prior distributions over the network parameters.Such prior knowledge can allow a system to learn accurate models from much less data than are required for tabula rasaapproaches.Uncertainty Over TimeEntities that live in a changing environment must keep track of variables whose values change over time. DynamicBayesian networks capture this process by representing multiple copies of the state variables, one for each time step. Aset of variables Xt denotes the world state at time t and a set of sensor variables Et denotes the observations available attime t. The sensor model P(Et|Xt) is encoded in the conditional probability distributions for the observable variables,given the state variables. The transition model P(Xt+1|Xt) relates the state at time t to the state at time t+1. Keeping trackof the world means computing the current probability distribution over world states given all past observations, i.e., | 9
  12. 12. Introduction to Bayesian Networks - Technical PerspectiveP(Xt|E1,…,Et). Dynamic Bayesian networks are strictly more expressive than other temporal probability models such ashidden Markov models and Kalman lters.Causal NetworksMost probabilistic models including, general Bayesian networks, describe a distribution over possible observed events —as in Equation 1 — but say nothing about what will happen if a certain intervention occurs. For example, what if I turnthe sprinkler on? What effect does that have on the season, or on the connection between wetness and slipperiness? Acausal network, intuitively speaking, is a Bayesian network with the added property that the parents of each node are itsdirect causes — as in Figure 2. In such a network, the result of an intervention is obvious: the sprinkler node is set toX3 = on and the causal link between the season X1 and the sprinkler X3 is removed (see Figure 4). All other causal linksand conditional probabilities remain intact so the new model isP(x1 , x2 , x4 , x5 ) = P(x1 )P(x2 x1 )P(x4 x2 , X 3 = on)P(x5 x4 ).Notice that this differs from observing that X3=on, which would result in a new model that included the termP(X3=on|x1). This mirrors the difference between seeing and doing: after observing that the sprinkler is on, we wish toinfer that the season is dry, that it probably did not rain, and so on; an arbitrary decision to turn the sprinkler on shouldnot result in any such beliefs. Figure 4: A causal network re ecting the intervention, X3=onCausal networks are more properly de ned, then, as Bayesian networks in which the correct probability model afterintervening to x any node’s value is given simply by deleting links from the node’s parents. For example, Fire → Smokeis a causal network whereas Smoke → Fire is not, even though both networks are equally capable of representing anyjoint distribution on the two variables. Causal networks model the environment as a collection of stable componentmechanisms. These mechanisms may be recon gured locally by interventions, with correspondingly local changes in themodel. This, in turn, allows causal networks to be used very naturally for prediction by an agent that is consideringvarious courses of | 10
  13. 13. Introduction to Bayesian Networks - Technical PerspectiveCausal DiscoveryOne of the most exciting prospects in recent years has been the possibility of using Bayesian networks to discovercausal structures in raw statistical data — a task previously considered impossible without controlled experiments. Con-sider, for example, the following intransitive pattern of dependencies among three events: A and B are dependent. B andC are dependent, yet A and C are independent. If you ask a person to supply an example of three such events, the exam-ple would invariably portray A and C as two independent causes and B as their common effect, namely, A → B ← C.(For instance A and C could be the outcomes of two fair coins and B represents a bell that rings whenever either coincomes up heads.) Figure 4: Causal model for variables A, C and B, representing two fair coins and a bell respectively.Fitting this dependence pattern with a scenario in which B is the cause and A and C are the effects is mathematicallyfeasible but very unnatural (see Figure 5), because it must entail ne tuning of the probabilities involved; the desireddependence pattern will be destroyed as soon as the probabilities undergo a slight change.Such thought experiments tell us that certain patterns of dependency, which are totally void of temporal information,are conceptually characteristic of certain causal directionalities and not others. When put together systematically, suchpatterns can be used to infer causal structures from raw data and to guarantee that any alternative structure compatiblewith the data must be less stable than the one(s) inferred; namely slight uctuations in parameters will render that struc-ture incompatible with the | 11
  14. 14. Introduction to Bayesian NetworksReferencesBarber, David. “Bayesian Reasoning and Machine Learning.”, David. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2011.  Barnard, G. A, and T. Bayes. “Studies in the History of Probability and Statistics: IX. Thomas Bayess Essay Towards Solving a Problem in the Doctrine of Chances.” Biometrika 45, no. 3 (1958): 293–315.  Darwiche, Adnan. “Bayesian networks.” Communications of the ACM 53, no. 12 (12, 2010): 80.  Hilbert, M., and P. Lopez. “The Worlds Technological Capacity to Store, Communicate, and Compute Information.” Science (2, 2011).  Koller, Daphne, and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. The MIT Press, 2009.  Neapolitan, Richard E., and Xia Jiang. Probabilistic Methods for Financial and Marketing Informatics. 1st ed. Morgan Kaufmann, 2007.Pearl, Judea, and Stuart Russell. Bayesian Networks. UCLA Cognitive Systems Laboratory, November 2000., Judea. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000.  Pearl, Judea. Causality: Models, Reasoning and Inference. 2nd ed. Cambridge University Press, 2009.  Spirtes, Peter, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search, Second Edition. 2nd ed. The MIT Press, 2001. | 12
  15. 15. Introduction to Bayesian NetworksContact InformationConrady Applied Science, LLC312 Hamlet’s End WayFranklin, TN 37067USA+1 888-386-8383info@conradyscience.comwww.conradyscience.comBayesia SAS6, rue Léonard de VinciBP 11953001 Laval CedexFrance+33(0)2 43 49 75 | 13