Bayesian statistics uses probability to represent uncertainty about unknown parameters in statistical models. It differs from classical statistics in that parameters are treated as random variables rather than fixed unknown constants. Bayesian probability represents a degree of belief in an event rather than the physical probability of an event. The Bayes' formula provides a way to update beliefs based on new evidence or data using conditional probability. Bayesian networks are graphical models that compactly represent joint probability distributions over many variables and allow for efficient inference.
2ProbabilityA random variableisthe basic element of probability Refers to an event and there is some degree of uncertainty as to the outcome of the eventFor example, the random variable A could be the event of getting a heads on a coin flip
3.
Classical vs. BayesianClassical:Experimentsare infinitely repeatable under the same conditions (hence: ’frequentist’)The parameter of interest (θ) is fixed and unknownBayesian:Each experiment is unique (i.e., not repeatable)The parameter of interest has an unknown distribution
4.
Classical ProbabilityProperty ofenvironment‘Physical’ probabilityImagine all data sets of size N that could be generated by sampling from the distribution determined by parameters. Each data set occurs with some probability and produces an estimate“The probability of getting heads on this particular coin is 50%”
5.
Bayesian Probability‘Personal’ probabilityDegreeof beliefProperty of person who assigns itObservations are fixed, imagine all possible values of parameters from which they could have come“I think the coin will land on heads 50% of the time”
Conditional ProbabilityP(A =true | B = true) = Out of all the outcomes in which B is true, how many also have A equal to trueRead this as: “Probability of A conditioned on B” or “Probability of A given B”H = “Have a headache”F = “Coming down with Flu”P(H = true) = 1/10P(F = true) = 1/40P(H = true | F = true) = 1/2“Headaches are rare and flu is rarer, but if you’re coming down with flu there’s a 50-50 chance you’ll have a headache.”P(F = true)P(H = true)
8.
The Joint ProbabilityDistributionWe will write P(A = true, B = true) to mean “the probability of A = trueandB = true”Notice that:P(H=true|F=true)P(F = true)P(H = true)In general, P(X|Y)=P(X,Y)/P(Y)
9.
The Joint ProbabilityDistributionJoint probabilities can be between any number of variables eg. P(A = true, B = true, C = true)For each combination of variables, we need to say how probable that combination isThe probabilities of these combinations need to sum to 1Sums to 1
10.
The Joint ProbabilityDistributionOnce you have the joint probability distribution, you can calculate any probability involving A, B, and CNote: May need to use marginalization and Bayes rule, (both of which are not discussed in these slides)Examples of things you can compute:P(A=true) = sum of P(A,B,C) in rows with A=true
11.
P(A=true, B =true | C=true) = P(A = true, B = true, C = true) / P(C = true)
12.
The Problem withthe Joint DistributionLots of entries in the table to fill up!For k Boolean random variables, you need a table of size 2kHow do we use fewer numbers? Need the concept of independence
13.
IndependenceVariables A andB are independent if any of the following hold:P(A,B) = P(A)P(B)P(A | B) = P(A)P(B | A) = P(B)This says that knowing the outcome of A does not tell me anything new about the outcome of B.
14.
IndependenceHow is independenceuseful?Suppose you have n coin flips and you want to calculate the joint distribution P(C1, …, Cn)If the coin flips are not independent, you need 2n values in the tableIf the coin flips are independent, thenEach P(Ci) table has 2 entries and there are n of them for a total of 2n values
15.
Conditional IndependenceVariables Aand B are conditionally independent given C if any of the following hold:P(A, B | C) = P(A | C)P(B | C)P(A | B, C) = P(A | C)P(B | A, C) = P(B | C)Knowing C tells me everything about B. I don’t gain anything by knowing A (either because A doesn’t influence B or because knowing C provides all the information knowing A would give)
16.
Example: A ClinicalTestLet’s move to a simple example, a clinical test. Consider a rare disease such that at any given point of time, on average 1 in 1000 people in the population has the disease. There exists also a clinical test for the disease (a blood test, for example), but it is not perfect. The sensitivity of the test, i.e. the probability of giving a correct positive, is .99. The specificity of the test, i.e. the probability of giving a correct negative is .98. To better visualise it, let’s draw a tree
Example: A ClinicalTestNow let us consider a person, who gets worried, takes a test, and get’s a positive result. Remembering, that the test is not perfect, what can we say about the actual chances of his being ill?
19.
Example: A ClinicalTest.99+.0001 Disease.01-.02+.9999No Disease.98-Applying the Bayes’ formula we get the result: the probability of our patient being ill, given what we know about the prevalence of the disease in the population and about the test performance, is .0049. Naturally, higher than in the background population, but still very small. Suppose it is customary to repeat the test, and the result is positive again
20.
.99+.0001 Disease.01-.02+.9999NoDisease.98-Example: A Clinical Test.99+.0049 Disease.01-.02+.9951No Disease.98-We can still use the same formula, of course, and the same tree for the visualisation, but now the probabilities of having the disease and being disease free before the test are not .0001 and .9999 respectively, but. Instead, are .0049 and .9951 based on the result of the test so far.After applying the formula this time, we get the result of .20. Still not too high, implying that the test is probably not very efficient, because evidently you have to repeat it many times before you become reasonably sure.
ExampleThumbtack problem: willit land on the point (heads) or the flat bit (tails)?Flip it N timesWhat will it do on the N+1th time?How to compute p(xN+1|D, ξ) from p(θ|ξ)?
23.
Inference as LearningSupposeyou have a coin with an unknown bias, θ ≡ P(head).You flip the coin multiple times and observe the outcome.From observations, you can infer the bias of the coinThis is learning. This is inference.
Posterior DistributionWhen thedata is ’strong enough’ the posterior will not depend on priorUnless some extraneous information is available, a non-informative or vague prior can be used, in which case the inference will mostly be based on data. If, however, some prior knowledge is available regarding the parameters of interest, then it should be included in the reasoning. Furthermore, it is a common practive to conduct a sensitivity analysis – in other words, to examine the effect the prior might have on the posterior distribution. In any case, the prior assumptions should always be made explicit, and if the result is found to be dependent on them, such sensitivity should be investigated in detail.
A Bayesian NetworkABayesian network is made up of:1. A Directed Acyclic GraphABCD2. A set of tables for each node in the graph
33.
A Directed AcyclicGraphA node X is a parent of another node Y if there is an arrow from node X to node Yeg. A is a parent of BEach node in the graph is a random variableABCDInformally, an arrow from node X to node Y means X has a direct influence on Y
34.
A Set ofTables for Each NodeEach node Xi has a conditional probability distribution P(Xi | Parents(Xi)) that quantifies the effect of the parents on the nodeThe parameters are the probabilities in these conditional probability tables (CPTs)ABCD
35.
A Set ofTables for Each NodeConditional Probability Distribution for C given BFor a given combination of values of the parents (B in this example), the entries for P(C=true | B) and P(C=false | B) must add up to 1 eg. P(C=true | B=false) + P(C=false |B=false )=1If you have a Boolean variable with k Boolean parents, this table has 2k+1 probabilities (but only 2k need to be stored)
36.
Bayesian NetworksTwo importantproperties:Encodes the conditional independence relationships between the variables in the graph structureIs a compact representation of the joint probability distribution over the variables
37.
Conditional IndependenceThe Markovcondition: given its parents (P1, P2),a node (X) is conditionally independent of its non-descendants (ND1, ND2)P1P2XND2ND1C1C2
38.
The Joint ProbabilityDistributionDue to the Markov condition, we can compute the joint probability distribution over all the variables X1, …, Xn in the Bayesian net using the formula:Where Parents(Xi) means the values of the Parents of the node Xi with respect to the graph
39.
Using a BayesianNetwork ExampleUsing the network in the example, suppose you want to calculate:P(A = true, B = true, C = true, D = true)= P(A = true) * P(B = true | A = true) * P(C = true | B = true) P( D = true | B = true) = (0.4)*(0.3)*(0.1)*(0.95)ABCD
40.
Using a BayesianNetwork ExampleUsing the network in the example, suppose you want to calculate:P(A = true, B = true, C = true, D = true)= P(A = true) * P(B = true | A = true) * P(C = true | B = true) P( D = true | B = true) = (0.4)*(0.3)*(0.1)*(0.95)This is from the graph structureABThese numbers are from the conditional probability tablesCD
41.
InferenceUsing a Bayesiannetwork to compute probabilities is called inferenceIn general, inference involves queries of the form: P( X | E )E = The evidence variable(s)X = The query variable(s)
42.
InferenceHasAnthraxHasCoughHasFeverHasDifficultyBreathingHasWideMediastinumAn example ofa query would be: P( HasAnthrax = true | HasFever = true, HasCough= true)Note: Even though HasDifficultyBreathing and HasWideMediastinum are in the Bayesian network, they are not given values in the query (ie. they do not appear either as query variables or evidence variables)They are treated as unobserved variables