Bayesian statistics
2ProbabilityA random variable isthe basic element of probability Refers to an event and there is some degree of uncertainty as to the outcome of the eventFor example, the random variable A could be the event of getting a heads on a coin flip
Classical vs. BayesianClassical:Experiments are infinitely repeatable under the same conditions (hence: ’frequentist’)The parameter of interest (θ) is fixed and unknownBayesian:Each experiment is unique (i.e., not repeatable)The parameter of interest has an unknown distribution
Classical ProbabilityProperty of environment‘Physical’ probabilityImagine all data sets of size N that could be generated by sampling from the distribution determined by parameters. Each data set occurs with some probability and produces an estimate“The probability of getting heads on this particular coin is 50%”
Bayesian Probability‘Personal’ probabilityDegree of beliefProperty of person who assigns itObservations are fixed, imagine all possible values of parameters from which they could have come“I think the coin will land on heads 50% of the time”
The Bayes’ Formula
Conditional ProbabilityP(A = true | B = true) = Out of all the outcomes in which B is true, how many also have A equal to trueRead this as: “Probability of A conditioned on B” or “Probability of A given B”H = “Have a headache”F = “Coming down with Flu”P(H = true) = 1/10P(F = true) = 1/40P(H  = true | F = true) = 1/2“Headaches are rare and flu is rarer, but if you’re coming down with flu there’s a 50-50 chance you’ll have a headache.”P(F = true)P(H = true)
The Joint Probability DistributionWe will write P(A = true, B = true) to mean “the probability of A = trueandB = true”Notice that:P(H=true|F=true)P(F = true)P(H = true)In general, P(X|Y)=P(X,Y)/P(Y)
The Joint Probability DistributionJoint probabilities can be between any number of variables	eg. P(A = true, B = true, C = true)For each combination of variables, we need to say how probable that combination isThe probabilities of these combinations need to sum to 1Sums to 1
The Joint Probability DistributionOnce you have the joint probability distribution, you can calculate any probability involving A, B, and CNote: May need to use marginalization and Bayes rule, (both of which are not discussed in these slides)Examples of things you can compute:P(A=true) = sum of P(A,B,C) in rows with A=true
P(A=true, B = true | C=true) = P(A = true, B = true, C = true) / P(C = true)
The Problem with the Joint DistributionLots of entries in the table to fill up!For k Boolean random variables, you need a table of size 2kHow do we use fewer numbers?  Need the concept of independence
IndependenceVariables A and B are independent if any of the following hold:P(A,B) = P(A)P(B)P(A | B) = P(A)P(B | A) = P(B)This says that knowing the outcome of A does not tell me anything new about the outcome of B.
IndependenceHow is independence useful?Suppose you have n coin flips and you want to calculate the joint distribution P(C1, …, Cn)If the coin flips are not independent, you need 2n values in the tableIf the coin flips are independent, thenEach P(Ci) table has 2 entries and there are n of them for a total of 2n values
Conditional IndependenceVariables A and B are conditionally independent given C if any of the following hold:P(A, B | C) = P(A | C)P(B | C)P(A | B, C) = P(A | C)P(B | A, C) = P(B | C)Knowing C tells me everything about B. I don’t gain anything by knowing A (either because A doesn’t influence B or because knowing C provides all the information knowing A would give)
Example: A Clinical TestLet’s move to a simple example, a clinical test. Consider a rare disease such that at any given point of time, on average 1 in 1000 people in the population has the disease. There exists also a clinical test for the disease (a blood test, for example), but it is not perfect. The sensitivity of the test, i.e. the probability of giving a correct positive, is .99. The specificity of the test, i.e. the probability of giving a correct negative is .98. To better visualise it, let’s draw a tree
Example: A Clinical TestP(Disease) = p(θ=1)=.0001P(Test Positive|Disease) =p(x=1| θ =1)=.99P(Test Negative|No Disease) =p(x=0| θ=0)=.98.99+.0001  Disease.01-.02+.9999No Disease.98-
Example: A Clinical TestNow let us consider a person, who gets worried, takes a test, and get’s a positive result. Remembering, that the test is not perfect, what can we say about the actual chances of his being ill?
Example: A Clinical Test.99+.0001  Disease.01-.02+.9999No Disease.98-Applying the Bayes’ formula we get the result: the probability of our patient being ill, given what we know about the prevalence of the disease in the population and about the test performance, is .0049. Naturally, higher than in the background population, but still very small. Suppose it is customary to repeat the test, and the result is positive again
.99+.0001  Disease.01-.02+.9999No Disease.98-Example: A Clinical Test.99+.0049  Disease.01-.02+.9951No Disease.98-We can still use the same formula, of course, and the same tree for the visualisation, but now the probabilities of having the disease and being disease free before the test are not .0001 and .9999 respectively, but. Instead, are .0049 and .9951 based on the result of the test so far.After applying the formula this time, we get the result of .20. Still not too high, implying that the test is probably not very efficient, because evidently you have to repeat it many times before you become reasonably sure.
Bayesian Inference:Combine prior with model to produce posterior probability distributionStep-wise modeling:
ExampleThumbtack problem: will it land on the point (heads) or the flat bit (tails)?Flip it N timesWhat will it do on the N+1th time?How to compute p(xN+1|D, ξ) from p(θ|ξ)?
Inference as LearningSuppose you have a coin with an unknown bias, θ ≡ P(head).You flip the coin multiple times and observe the outcome.From observations, you can infer the bias of the coinThis is learning.  This is inference.
independent events ->Sufficient statistics
Given that I have flipped a coin 100 times and it has landed heads-up 100 times, what is the likelihood that the coin is fair?
Bayesian SolutionN=10 throws. 2 ’heads’ recorded.MODEL: x|θ ~ BIN(N=10,θ)PRIOR -?Beta(θ|α=1,β=1)Beta(θ|α=5,β=5)Beta(θ|α=5,β=20)
Posterior Distribution=> θ|x,N,α,β ~ Beta(α+x,β+N-x)Beta(θ|1+2,1+10-2) i.e., Beta(θ|3,9)
Posterior DistributionWhen the data is ’strong enough’ the posterior will not depend on priorUnless some extraneous information is available, a non-informative or vague prior can be used, in which case the inference will mostly be based on data. If, however, some prior knowledge is available regarding the parameters of interest, then it should be included in the reasoning. Furthermore, it is a common practive to conduct a sensitivity analysis – in other words, to examine the effect the prior might have on the posterior distribution. In any case, the prior assumptions should always be made explicit, and if the result is found to be dependent on them, such sensitivity should be investigated in detail.
Posterior InferencePosterior distribution: Beta(θ|3,9)Posterior mean: .25Posterior variance: .014Posterior 95% HDR: (.041,.484)P(θ<.7)=.9994Frequentist:Mean:      .2Variance: .016Exact 95% CI: (.03,.56)
Bayesian networks
A Bayesian NetworkA Bayesian network is made up of:1. A Directed Acyclic GraphABCD2. A set of tables for each node in the graph
A Directed Acyclic GraphA node X is a parent of another node Y if there is an arrow from node X to node Yeg. A is a parent of BEach node in the graph is a random variableABCDInformally, an arrow from node X to node Y means X has a direct influence on Y
A Set of Tables for Each NodeEach node Xi has a conditional probability distribution P(Xi | Parents(Xi)) that quantifies the effect of the parents on the nodeThe parameters are the probabilities in these conditional probability tables (CPTs)ABCD
A Set of Tables for Each NodeConditional Probability Distribution for C given BFor a given combination of values of the parents (B in this example), the entries for P(C=true | B) and P(C=false | B) must add up to 1 eg. P(C=true | B=false) + P(C=false |B=false )=1If you have a Boolean variable with k Boolean parents, this table has 2k+1 probabilities (but only 2k need to be stored)
Bayesian NetworksTwo important properties:Encodes the conditional independence relationships between the variables in the graph structureIs a compact representation of the joint probability distribution over the variables
Conditional IndependenceThe Markov condition: given its parents (P1, P2),a node (X) is conditionally independent of its non-descendants (ND1, ND2)P1P2XND2ND1C1C2
The Joint Probability DistributionDue to the Markov condition, we can compute the joint probability distribution over all the variables X1, …, Xn in the Bayesian net using the formula:Where Parents(Xi) means the values of the Parents of the node Xi with respect to the graph
Using a Bayesian Network ExampleUsing the network in the example, suppose you want to calculate:P(A = true, B = true, C = true, D = true)= P(A = true) * P(B = true | A = true) *    P(C = true | B = true) P( D = true | B = true) = (0.4)*(0.3)*(0.1)*(0.95)ABCD
Using a Bayesian Network ExampleUsing the network in the example, suppose you want to calculate:P(A = true, B = true, C = true, D = true)= P(A = true) * P(B = true | A = true) *    P(C = true | B = true) P( D = true | B = true) = (0.4)*(0.3)*(0.1)*(0.95)This is from the graph structureABThese numbers are from the conditional probability tablesCD
InferenceUsing a Bayesian network to compute probabilities is called inferenceIn general, inference involves queries of the form:	P( X | E )E = The evidence variable(s)X = The query variable(s)

Bayesian statistics

  • 1.
  • 2.
    2ProbabilityA random variableisthe basic element of probability Refers to an event and there is some degree of uncertainty as to the outcome of the eventFor example, the random variable A could be the event of getting a heads on a coin flip
  • 3.
    Classical vs. BayesianClassical:Experimentsare infinitely repeatable under the same conditions (hence: ’frequentist’)The parameter of interest (θ) is fixed and unknownBayesian:Each experiment is unique (i.e., not repeatable)The parameter of interest has an unknown distribution
  • 4.
    Classical ProbabilityProperty ofenvironment‘Physical’ probabilityImagine all data sets of size N that could be generated by sampling from the distribution determined by parameters. Each data set occurs with some probability and produces an estimate“The probability of getting heads on this particular coin is 50%”
  • 5.
    Bayesian Probability‘Personal’ probabilityDegreeof beliefProperty of person who assigns itObservations are fixed, imagine all possible values of parameters from which they could have come“I think the coin will land on heads 50% of the time”
  • 6.
  • 7.
    Conditional ProbabilityP(A =true | B = true) = Out of all the outcomes in which B is true, how many also have A equal to trueRead this as: “Probability of A conditioned on B” or “Probability of A given B”H = “Have a headache”F = “Coming down with Flu”P(H = true) = 1/10P(F = true) = 1/40P(H = true | F = true) = 1/2“Headaches are rare and flu is rarer, but if you’re coming down with flu there’s a 50-50 chance you’ll have a headache.”P(F = true)P(H = true)
  • 8.
    The Joint ProbabilityDistributionWe will write P(A = true, B = true) to mean “the probability of A = trueandB = true”Notice that:P(H=true|F=true)P(F = true)P(H = true)In general, P(X|Y)=P(X,Y)/P(Y)
  • 9.
    The Joint ProbabilityDistributionJoint probabilities can be between any number of variables eg. P(A = true, B = true, C = true)For each combination of variables, we need to say how probable that combination isThe probabilities of these combinations need to sum to 1Sums to 1
  • 10.
    The Joint ProbabilityDistributionOnce you have the joint probability distribution, you can calculate any probability involving A, B, and CNote: May need to use marginalization and Bayes rule, (both of which are not discussed in these slides)Examples of things you can compute:P(A=true) = sum of P(A,B,C) in rows with A=true
  • 11.
    P(A=true, B =true | C=true) = P(A = true, B = true, C = true) / P(C = true)
  • 12.
    The Problem withthe Joint DistributionLots of entries in the table to fill up!For k Boolean random variables, you need a table of size 2kHow do we use fewer numbers? Need the concept of independence
  • 13.
    IndependenceVariables A andB are independent if any of the following hold:P(A,B) = P(A)P(B)P(A | B) = P(A)P(B | A) = P(B)This says that knowing the outcome of A does not tell me anything new about the outcome of B.
  • 14.
    IndependenceHow is independenceuseful?Suppose you have n coin flips and you want to calculate the joint distribution P(C1, …, Cn)If the coin flips are not independent, you need 2n values in the tableIf the coin flips are independent, thenEach P(Ci) table has 2 entries and there are n of them for a total of 2n values
  • 15.
    Conditional IndependenceVariables Aand B are conditionally independent given C if any of the following hold:P(A, B | C) = P(A | C)P(B | C)P(A | B, C) = P(A | C)P(B | A, C) = P(B | C)Knowing C tells me everything about B. I don’t gain anything by knowing A (either because A doesn’t influence B or because knowing C provides all the information knowing A would give)
  • 16.
    Example: A ClinicalTestLet’s move to a simple example, a clinical test. Consider a rare disease such that at any given point of time, on average 1 in 1000 people in the population has the disease. There exists also a clinical test for the disease (a blood test, for example), but it is not perfect. The sensitivity of the test, i.e. the probability of giving a correct positive, is .99. The specificity of the test, i.e. the probability of giving a correct negative is .98. To better visualise it, let’s draw a tree
  • 17.
    Example: A ClinicalTestP(Disease) = p(θ=1)=.0001P(Test Positive|Disease) =p(x=1| θ =1)=.99P(Test Negative|No Disease) =p(x=0| θ=0)=.98.99+.0001 Disease.01-.02+.9999No Disease.98-
  • 18.
    Example: A ClinicalTestNow let us consider a person, who gets worried, takes a test, and get’s a positive result. Remembering, that the test is not perfect, what can we say about the actual chances of his being ill?
  • 19.
    Example: A ClinicalTest.99+.0001 Disease.01-.02+.9999No Disease.98-Applying the Bayes’ formula we get the result: the probability of our patient being ill, given what we know about the prevalence of the disease in the population and about the test performance, is .0049. Naturally, higher than in the background population, but still very small. Suppose it is customary to repeat the test, and the result is positive again
  • 20.
    .99+.0001 Disease.01-.02+.9999NoDisease.98-Example: A Clinical Test.99+.0049 Disease.01-.02+.9951No Disease.98-We can still use the same formula, of course, and the same tree for the visualisation, but now the probabilities of having the disease and being disease free before the test are not .0001 and .9999 respectively, but. Instead, are .0049 and .9951 based on the result of the test so far.After applying the formula this time, we get the result of .20. Still not too high, implying that the test is probably not very efficient, because evidently you have to repeat it many times before you become reasonably sure.
  • 21.
    Bayesian Inference:Combine priorwith model to produce posterior probability distributionStep-wise modeling:
  • 22.
    ExampleThumbtack problem: willit land on the point (heads) or the flat bit (tails)?Flip it N timesWhat will it do on the N+1th time?How to compute p(xN+1|D, ξ) from p(θ|ξ)?
  • 23.
    Inference as LearningSupposeyou have a coin with an unknown bias, θ ≡ P(head).You flip the coin multiple times and observe the outcome.From observations, you can infer the bias of the coinThis is learning. This is inference.
  • 24.
  • 25.
    Given that Ihave flipped a coin 100 times and it has landed heads-up 100 times, what is the likelihood that the coin is fair?
  • 26.
    Bayesian SolutionN=10 throws.2 ’heads’ recorded.MODEL: x|θ ~ BIN(N=10,θ)PRIOR -?Beta(θ|α=1,β=1)Beta(θ|α=5,β=5)Beta(θ|α=5,β=20)
  • 27.
    Posterior Distribution=> θ|x,N,α,β~ Beta(α+x,β+N-x)Beta(θ|1+2,1+10-2) i.e., Beta(θ|3,9)
  • 29.
    Posterior DistributionWhen thedata is ’strong enough’ the posterior will not depend on priorUnless some extraneous information is available, a non-informative or vague prior can be used, in which case the inference will mostly be based on data. If, however, some prior knowledge is available regarding the parameters of interest, then it should be included in the reasoning. Furthermore, it is a common practive to conduct a sensitivity analysis – in other words, to examine the effect the prior might have on the posterior distribution. In any case, the prior assumptions should always be made explicit, and if the result is found to be dependent on them, such sensitivity should be investigated in detail.
  • 30.
    Posterior InferencePosterior distribution:Beta(θ|3,9)Posterior mean: .25Posterior variance: .014Posterior 95% HDR: (.041,.484)P(θ<.7)=.9994Frequentist:Mean: .2Variance: .016Exact 95% CI: (.03,.56)
  • 31.
  • 32.
    A Bayesian NetworkABayesian network is made up of:1. A Directed Acyclic GraphABCD2. A set of tables for each node in the graph
  • 33.
    A Directed AcyclicGraphA node X is a parent of another node Y if there is an arrow from node X to node Yeg. A is a parent of BEach node in the graph is a random variableABCDInformally, an arrow from node X to node Y means X has a direct influence on Y
  • 34.
    A Set ofTables for Each NodeEach node Xi has a conditional probability distribution P(Xi | Parents(Xi)) that quantifies the effect of the parents on the nodeThe parameters are the probabilities in these conditional probability tables (CPTs)ABCD
  • 35.
    A Set ofTables for Each NodeConditional Probability Distribution for C given BFor a given combination of values of the parents (B in this example), the entries for P(C=true | B) and P(C=false | B) must add up to 1 eg. P(C=true | B=false) + P(C=false |B=false )=1If you have a Boolean variable with k Boolean parents, this table has 2k+1 probabilities (but only 2k need to be stored)
  • 36.
    Bayesian NetworksTwo importantproperties:Encodes the conditional independence relationships between the variables in the graph structureIs a compact representation of the joint probability distribution over the variables
  • 37.
    Conditional IndependenceThe Markovcondition: given its parents (P1, P2),a node (X) is conditionally independent of its non-descendants (ND1, ND2)P1P2XND2ND1C1C2
  • 38.
    The Joint ProbabilityDistributionDue to the Markov condition, we can compute the joint probability distribution over all the variables X1, …, Xn in the Bayesian net using the formula:Where Parents(Xi) means the values of the Parents of the node Xi with respect to the graph
  • 39.
    Using a BayesianNetwork ExampleUsing the network in the example, suppose you want to calculate:P(A = true, B = true, C = true, D = true)= P(A = true) * P(B = true | A = true) * P(C = true | B = true) P( D = true | B = true) = (0.4)*(0.3)*(0.1)*(0.95)ABCD
  • 40.
    Using a BayesianNetwork ExampleUsing the network in the example, suppose you want to calculate:P(A = true, B = true, C = true, D = true)= P(A = true) * P(B = true | A = true) * P(C = true | B = true) P( D = true | B = true) = (0.4)*(0.3)*(0.1)*(0.95)This is from the graph structureABThese numbers are from the conditional probability tablesCD
  • 41.
    InferenceUsing a Bayesiannetwork to compute probabilities is called inferenceIn general, inference involves queries of the form: P( X | E )E = The evidence variable(s)X = The query variable(s)
  • 42.
    InferenceHasAnthraxHasCoughHasFeverHasDifficultyBreathingHasWideMediastinumAn example ofa query would be: P( HasAnthrax = true | HasFever = true, HasCough= true)Note: Even though HasDifficultyBreathing and HasWideMediastinum are in the Bayesian network, they are not given values in the query (ie. they do not appear either as query variables or evidence variables)They are treated as unobserved variables