SlideShare a Scribd company logo
1 of 6
Download to read offline
Bayesian Learning
Steven L. Scott
In the last section, on conditional probability, we saw that Bayes’ rule can be written
p(θ|y) ∝ p(y|θ)p(θ).
The distribution p(θ) is called the prior distribution, or just “the prior,” p(y|θ) is the likelihood function,
and p(θ|y) is the posterior distribution. The prior distribution describes one’s belief about the value of θ
before seeing y. The posterior distribution describes the same person’s belief about θ after seeing y. Bayes’
theorem describes the process of learning about θ when y is observed.
1 An example
Let’s look at Bayes’ rule through an example. Suppose a biased coin with success probability θ is indepen-
dently flipped 10 times, and 3 successes are observed. The data y = 3 arise from a binomial distribution
with n = 10 and p = θ, so the likelihood is
p(y = 3|θ) =
10
3
θ3
(1 − θ)7
. (1)
What should the prior distribution be? In an abstract problem like this, most people are comfortable
assuming that there is no reason to prefer any one legal value of θ to another, which would imply the uniform
prior: p(θ) = 1 for θ ∈ (0, 1), with p(θ) = 0 otherwise. This is a common strategy in practice. In the absence
of any “real” prior information about a parameter’s value (which is a typical situation), one strives to choose
a prior that is “nearly noninformative.” We will see below that this is not always possible, but it is a useful
guiding principle. The prior and likelihood for this example are shown in the first two panels of Figure 1.
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
θ
priordensity
0.0 0.2 0.4 0.6 0.8 1.0
0.000.050.100.150.200.25
θ
likelihood
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.53.0
θ
posteriordensity
(a) (b) (c)
Figure 1: Bayesian learning in the binomial example.
To find the posterior distribution we simply multiply the prior times the likelihood (which in this case
just gives the likelihood), and normalize so that the result integrates to 1. In this case the normalization
1
constant is proportional to a mathematical special function known as the “beta function”, and the resulting
distribution is a known distribution called the “beta distribution.” The density of the beta distribution with
parameters a and b is
p(θ) =
Γ(a + b)
Γ(a)Γ(b)
θa−1
(1 − θ)b−1
. (2)
If θ is a random variable with the density function in equation (2) then we say θ ∼ Be(a, b). If we ignore
factors other than θ and 1−θ we see that in our example a−1 = 3 and b−1 = 7, so our posterior distribution
must be Be(4, 8). This distribution is plotted in Figure 1(c). Because it is simply a renormalization of the
function in Figure 1(b), the two panels differ only in the axis labels.
2 Conjugate priors
The uniform prior used in the previous section would be inappropriate if we actually had prior information
that θ was small. For example, if y counted conversions on a website, we might have historical information
about the distribution of conversion rates on similar sites. If we can describe our prior belief in the form of
a Be(a, b) distribution (i.e. if we can represent our prior beliefs by choosing specific numerical values of a
and b), then the posterior distribution after observing y successes out of n binomial trials is
p(θ|y) ∝
n
y
θy
(1 − θ)n−y
likelihood
Γ(a + b)
Γ(a)Γ(b)
θa−1
(1 − θ)b−1
prior
∝ θy+a−1
(1 − θ)n−y+b−1
.
(3)
We move from the first line of equation (3) to the second by combining the exponents of θ and 1 − θ, and
ignoring factors that don’t depend on θ. We recognize the outcome as proportional to the Be(y+a, n−y+b)
distribution. Thus “Bayesian learning” in this example amounts to adding y to a and n − y to b. That’s a
helpful way of understanding the prior parameters: a and b represent “prior successes” and “prior failures.”
When Bayes’ rule combines a likelihood and a prior in such a way that the posterior is from the same
model family as the prior, the prior is said to be conjugate to the likelihood. Most models don’t have
conjugate priors, but many models in the exponential family do. A distribution is in the exponential family
if its log density is a linear function of some function of the data. That is, if its density can be written
p(y|θ) = a(θ)b(y)ec(θ) d(y)
. (4)
Many of the famous “named” distributions are in the exponential family, including binomial, Poisson, ex-
ponential, and Gaussian. The student t distribution is an example of a “famous” distribution that is not in
the exponential family.
If a model is in the exponential family then it has sufficient statistics: i d(yi). You can find the
conjugate prior for an exponential family model by imagining equation (4) as a function of θ rather than y,
and renormalizing (assuming the integral with respect to θ is finite). This formulation makes it clear that
the parameters of the prior can be interpreted as sufficient statistics for the model, like how a and b can be
thought of as prior successes and failures in the binomial example.
A second example is the variance of a Gaussian model with known mean. Error terms in many models are
often assumed to be zero-mean Gaussian random variables, so this problem comes up frequently. Suppose
yi ∼ N 0, σ2
, independently, and let y = y1, . . . , yn. The likelihood function is
p(y|σ2
) = (2π)−n/2 1
σ2
n/2
exp −
1
2σ2
i
y2
i . (5)
2
Distribution Conjugate Prior
binomial beta
Poisson / exponential gamma
normal mean (known variance) Normal
normal precision (known mean) gamma
Table 1: Some models with conjugate priors
The expression containing 1/σ2
in equation (5) looks like the kernel of the gamma distribution. We write
θ ∼ Ga(a, b) if
p(θ|a, b) =
ba
Γ(a)
θa−1
exp(−bθ). (6)
If one assumes the prior 1/σ2
∼ Ga df
2 , ss
2 then Bayes’ rule gives
p(1/σ2
|y) ∝
1
σ2
n/2
exp −
1
2σ2
i
y2
i
likelihood
1
σ2
df
2 −1
exp −
ss
2
1
σ2
prior
∝
1
σ2
n+df
2 −1
exp −
1
σ2
ss + i y2
i
2
∝ Ga
n + df
2
,
ss + i y2
i
2
.
(7)
Notice how the parameters of the prior df and ss interact with the sufficient statistics of the model. One
can interpret df as a “prior sample size” and ss as a “prior sum of squares.”
It is important to stress that not all models have conjugate priors, and even when they do conjugate
priors may not appropriately express certain types of prior knowledge. Yet when they exist, thinking about
prior distributions through the lens of conjugate priors can help you understand the information content of
the assumed prior.
3 Posteriors compromise between prior and likelihood
Conjugate priors allow us to mathematically study the relationship between prior and likelihood. In the
binomial example with a beta prior, the Be(a, b) distribution has mean π = a/(a + b) and variance π(1 −
π)/(ν + 1), where ν = a + b. It is clear from equation (3) that a acts like a prior number of successes and b
a prior number of failures. The mean of the posterior distribution Be(a + y, b + n − y) is thus
˜π =
a + y
ν + n
= ν
a/ν
ν + n
+ n
y/n
ν + n
. (8)
Equation (8) shows the posterior mean θ is a weighted average of the prior mean a/ν and the mean of the
data y/n. The weights in the average are proportional to ν and n, which are the total information content
in the prior and the data, respectively.
The posterior variance is
˜π(1 − ˜π)
n + ν + 1
. (9)
The total amount of information in the posterior distribution is often measured by its precision, which is the
inverse (reciprocal) of its variance. The precision of Be(a + y, b + n − y) is
n
˜π(1 − ˜π)
+
ν + 1
˜π(1 − ˜π)
,
3
which is the sum of the precision from the prior and from the data.
The results shown above are not specific to the binomial distribution. In the general setting, the posterior
mean is a precision weighted average of the mean from the data and the mean from the prior, while the
inverse of the posterior variance is the sum of the prior precision and data precision. This fact helps us get
a sense of the relative importance of the prior vs the data in forming the posteriror distribution.
4 How much should you worry about the prior?
People new to Bayesian reasoning are often concerned about “assuming the answer,” in the sense that their
choice of a prior distribution will unduly influence the posterior distribution. There is good news and bad
on this front.
4.1 Likelihood dominates prior
First the good news. In regular models with moderate to large amounts of data, the data asymptotically
overwhelm the prior. Consider Figure 2, which compares a few different Beta prior distributions with the
same data, to see impact on the posterior. In panel (a) the data contain only 10 observations, so varying the
a and b parameters in the prior distribution by one or two units each represents an appreciable change in
the total available information. Panel2(b) shows the same analysis when there are 100 observations in the
data, so moving a prior parameter by one or two units doesn’t have a particularly big impact.
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.53.0
θ
density
Be(1, 1)
Be(.5, .5)
Be(2, .5)
Be(.5, 2)
0.0 0.2 0.4 0.6 0.8 1.0
02468
θ
density
Be(1, 1)
Be(.5, .5)
Be(2, .5)
Be(.5, 2)
(a) (b)
Figure 2: How the posterior distributions varies with the choice of prior. (a) 3 successes from 10 trials, (b) 30
successes from 100 trials.
Whatever prior you choose contains a fixed amount of information. If you imagine applying that prior
to larger and larger data sets, its influence will eventually vanish.
4
4.2 Sometimes priors do strange things
Now for the bad news. Even though many models are insensitive to a poorly chosen prior, not all of them
are. If your model is based on means, standard deviations, and regression coefficients, then there is a good
chance that any “weak” prior that you choose will have minimal impact. If the model has lots of latent
variables and other weakly identified unknowns, then the prior is probably more influential. Because priors
can sometimes carry more influence than intended, researchers have spent a considerable amount of time
thinking about how to best represent “prior ignorance” using a default prior. Kass and Wasserman 1996
ably summarize these efforts.
One issue that can come up is that the amount of information in a prior distributions can depend on
the scale in which one views a parameter. For example suppose you place a uniform prior on θ, but then
the analysis calls for the distribution of z = log(θ/(1 − θ)). The Jacobian of this transformation implies
f(z) = θ(1−θ), which is plotted (as a function of z) in Figure 3. The uniform prior on θ is clearly informative
for logit(θ).
−10 −5 0 5 10
0.000.050.100.150.200.25
z
density
Figure 3: The solid line shows the density of a uniform random variable on the logit scale, derived mathematically.
The histogram is the logit transform of 10,000 uniform random deviates.
4.3 Should you worry about priors?
Sometimes you need to, and sometimes you don’t. Until you get enough experience to trust your intuition
about whether a prior is worth worrying about, it is prudent to try an analysis under a few different choices
of prior. You can vary the prior parameters among a few reasonable values, or you can experiment to see
just how extreme the prior would need to be to derail the analysis.
In their paper, Kass and Wasserman made the point that problems where weak priors can make a big
difference tend to be “hard” problems where there is not much information in the data, in which case a
non-Bayesian analysis wouldn’t be particularly compelling (or in some cases, wouldn’t be possible). If you
find that modest variations in the prior lead to different conclusions, then you’re in a hard problem. In that
case a practical strategy is to think about the scale on which you want to analyze your model, and choose
a prior that represents reasonable assumptions on that scale. State your assumptions up front, and present
5
the results with a 2-3 other prior choices to show their impact. Then proceed with your chosen prior for the
rest of the analysis.
6

More Related Content

What's hot

Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksFederico Cerutti
 
Network Analytics - Homework 3 - Msc Business Analytics - Imperial College Lo...
Network Analytics - Homework 3 - Msc Business Analytics - Imperial College Lo...Network Analytics - Homework 3 - Msc Business Analytics - Imperial College Lo...
Network Analytics - Homework 3 - Msc Business Analytics - Imperial College Lo...Jonathan Zimmermann
 
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...Christian Robert
 
An Introduction to Mis-Specification (M-S) Testing
An Introduction to Mis-Specification (M-S) TestingAn Introduction to Mis-Specification (M-S) Testing
An Introduction to Mis-Specification (M-S) Testingjemille6
 
Stability criterion of periodic oscillations in a (9)
Stability criterion of periodic oscillations in a (9)Stability criterion of periodic oscillations in a (9)
Stability criterion of periodic oscillations in a (9)Alexander Decker
 
The Newsvendor meets the Options Trader
The Newsvendor meets the Options TraderThe Newsvendor meets the Options Trader
The Newsvendor meets the Options TraderAshwin Rao
 
Characterization of student’s t distribution with some application to finance
Characterization of student’s t  distribution with some application to financeCharacterization of student’s t  distribution with some application to finance
Characterization of student’s t distribution with some application to financeAlexander Decker
 
Demystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffDemystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffAshwin Rao
 
testing as a mixture estimation problem
testing as a mixture estimation problemtesting as a mixture estimation problem
testing as a mixture estimation problemChristian Robert
 
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015Christian Robert
 
Discussion of Persi Diaconis' lecture at ISBA 2016
Discussion of Persi Diaconis' lecture at ISBA 2016Discussion of Persi Diaconis' lecture at ISBA 2016
Discussion of Persi Diaconis' lecture at ISBA 2016Christian Robert
 
from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABCChristian Robert
 
Probabilistic Reasoning
Probabilistic ReasoningProbabilistic Reasoning
Probabilistic ReasoningJunya Tanaka
 
Probability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis TestingProbability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis Testingjemille6
 

What's hot (19)

Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural Networks
 
Network Analytics - Homework 3 - Msc Business Analytics - Imperial College Lo...
Network Analytics - Homework 3 - Msc Business Analytics - Imperial College Lo...Network Analytics - Homework 3 - Msc Business Analytics - Imperial College Lo...
Network Analytics - Homework 3 - Msc Business Analytics - Imperial College Lo...
 
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
 
An Introduction to Mis-Specification (M-S) Testing
An Introduction to Mis-Specification (M-S) TestingAn Introduction to Mis-Specification (M-S) Testing
An Introduction to Mis-Specification (M-S) Testing
 
Stability criterion of periodic oscillations in a (9)
Stability criterion of periodic oscillations in a (9)Stability criterion of periodic oscillations in a (9)
Stability criterion of periodic oscillations in a (9)
 
eatonmuirheadsoaita
eatonmuirheadsoaitaeatonmuirheadsoaita
eatonmuirheadsoaita
 
The Newsvendor meets the Options Trader
The Newsvendor meets the Options TraderThe Newsvendor meets the Options Trader
The Newsvendor meets the Options Trader
 
Characterization of student’s t distribution with some application to finance
Characterization of student’s t  distribution with some application to financeCharacterization of student’s t  distribution with some application to finance
Characterization of student’s t distribution with some application to finance
 
Demystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffDemystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance Tradeoff
 
testing as a mixture estimation problem
testing as a mixture estimation problemtesting as a mixture estimation problem
testing as a mixture estimation problem
 
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
 
ISBA 2016: Foundations
ISBA 2016: FoundationsISBA 2016: Foundations
ISBA 2016: Foundations
 
Equivariance
EquivarianceEquivariance
Equivariance
 
Probability distributionv1
Probability distributionv1Probability distributionv1
Probability distributionv1
 
Discussion of Persi Diaconis' lecture at ISBA 2016
Discussion of Persi Diaconis' lecture at ISBA 2016Discussion of Persi Diaconis' lecture at ISBA 2016
Discussion of Persi Diaconis' lecture at ISBA 2016
 
Chapter4
Chapter4Chapter4
Chapter4
 
from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABC
 
Probabilistic Reasoning
Probabilistic ReasoningProbabilistic Reasoning
Probabilistic Reasoning
 
Probability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis TestingProbability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis Testing
 

Similar to bayesian learning

“I. Conjugate priors”
“I. Conjugate priors”“I. Conjugate priors”
“I. Conjugate priors”Steven Scott
 
01.conditional prob
01.conditional prob01.conditional prob
01.conditional probSteven Scott
 
01.ConditionalProb.pdf in the Bayes_intro folder
01.ConditionalProb.pdf in the Bayes_intro folder01.ConditionalProb.pdf in the Bayes_intro folder
01.ConditionalProb.pdf in the Bayes_intro folderSteven Scott
 
Data classification sammer
Data classification sammer Data classification sammer
Data classification sammer Sammer Qader
 
Machine learning (3)
Machine learning (3)Machine learning (3)
Machine learning (3)NYversity
 
For this assignment, use the aschooltest.sav dataset.The d
For this assignment, use the aschooltest.sav dataset.The dFor this assignment, use the aschooltest.sav dataset.The d
For this assignment, use the aschooltest.sav dataset.The dMerrileeDelvalle969
 
Unit-2 Bayes Decision Theory.pptx
Unit-2 Bayes Decision Theory.pptxUnit-2 Bayes Decision Theory.pptx
Unit-2 Bayes Decision Theory.pptxavinashBajpayee1
 
CS229 Machine Learning Lecture Notes
CS229 Machine Learning Lecture NotesCS229 Machine Learning Lecture Notes
CS229 Machine Learning Lecture NotesEric Conner
 
Econ 103 Homework 2Manu NavjeevanAugust 15, 2022S
Econ 103 Homework 2Manu NavjeevanAugust 15, 2022SEcon 103 Homework 2Manu NavjeevanAugust 15, 2022S
Econ 103 Homework 2Manu NavjeevanAugust 15, 2022SEvonCanales257
 
Machine learning (2)
Machine learning (2)Machine learning (2)
Machine learning (2)NYversity
 
Introduction to Bayesian Statistics.ppt
Introduction to Bayesian Statistics.pptIntroduction to Bayesian Statistics.ppt
Introduction to Bayesian Statistics.pptLong Dang
 
Binary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan ChenBinary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan ChenXuan Chen
 

Similar to bayesian learning (20)

“I. Conjugate priors”
“I. Conjugate priors”“I. Conjugate priors”
“I. Conjugate priors”
 
01.conditional prob
01.conditional prob01.conditional prob
01.conditional prob
 
01.ConditionalProb.pdf in the Bayes_intro folder
01.ConditionalProb.pdf in the Bayes_intro folder01.ConditionalProb.pdf in the Bayes_intro folder
01.ConditionalProb.pdf in the Bayes_intro folder
 
Data classification sammer
Data classification sammer Data classification sammer
Data classification sammer
 
Machine learning (3)
Machine learning (3)Machine learning (3)
Machine learning (3)
 
Bayesian statistics
Bayesian statisticsBayesian statistics
Bayesian statistics
 
Logistics regression
Logistics regressionLogistics regression
Logistics regression
 
Probability and Statistics Assignment Help
Probability and Statistics Assignment HelpProbability and Statistics Assignment Help
Probability and Statistics Assignment Help
 
Bayesian Statistics.pdf
Bayesian Statistics.pdfBayesian Statistics.pdf
Bayesian Statistics.pdf
 
2018 MUMS Fall Course - Introduction to statistical and mathematical model un...
2018 MUMS Fall Course - Introduction to statistical and mathematical model un...2018 MUMS Fall Course - Introduction to statistical and mathematical model un...
2018 MUMS Fall Course - Introduction to statistical and mathematical model un...
 
For this assignment, use the aschooltest.sav dataset.The d
For this assignment, use the aschooltest.sav dataset.The dFor this assignment, use the aschooltest.sav dataset.The d
For this assignment, use the aschooltest.sav dataset.The d
 
Chapter 14 Part I
Chapter 14 Part IChapter 14 Part I
Chapter 14 Part I
 
Regression Analysis.pdf
Regression Analysis.pdfRegression Analysis.pdf
Regression Analysis.pdf
 
Unit-2 Bayes Decision Theory.pptx
Unit-2 Bayes Decision Theory.pptxUnit-2 Bayes Decision Theory.pptx
Unit-2 Bayes Decision Theory.pptx
 
CS229 Machine Learning Lecture Notes
CS229 Machine Learning Lecture NotesCS229 Machine Learning Lecture Notes
CS229 Machine Learning Lecture Notes
 
Econ 103 Homework 2Manu NavjeevanAugust 15, 2022S
Econ 103 Homework 2Manu NavjeevanAugust 15, 2022SEcon 103 Homework 2Manu NavjeevanAugust 15, 2022S
Econ 103 Homework 2Manu NavjeevanAugust 15, 2022S
 
Machine learning (2)
Machine learning (2)Machine learning (2)
Machine learning (2)
 
Introduction to Bayesian Statistics.ppt
Introduction to Bayesian Statistics.pptIntroduction to Bayesian Statistics.ppt
Introduction to Bayesian Statistics.ppt
 
Binary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan ChenBinary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan Chen
 
New test123
New test123New test123
New test123
 

More from Steven Scott

02.bayesian learning
02.bayesian learning02.bayesian learning
02.bayesian learningSteven Scott
 
Mixture conditional-density
Mixture conditional-densityMixture conditional-density
Mixture conditional-densitySteven Scott
 
Bayesian inference and the pest of premature interpretation.
Bayesian inference and the pest of premature interpretation.Bayesian inference and the pest of premature interpretation.
Bayesian inference and the pest of premature interpretation.Steven Scott
 
02.bayesian learning
02.bayesian learning02.bayesian learning
02.bayesian learningSteven Scott
 
Introduction to Bayesian Inference
Introduction to Bayesian InferenceIntroduction to Bayesian Inference
Introduction to Bayesian InferenceSteven Scott
 
00Overview PDF in the Bayes_intro folder
00Overview PDF in the Bayes_intro folder00Overview PDF in the Bayes_intro folder
00Overview PDF in the Bayes_intro folderSteven Scott
 
Using Statistics to Conduct More Efficient Searches
Using Statistics to Conduct More Efficient SearchesUsing Statistics to Conduct More Efficient Searches
Using Statistics to Conduct More Efficient SearchesSteven Scott
 

More from Steven Scott (7)

02.bayesian learning
02.bayesian learning02.bayesian learning
02.bayesian learning
 
Mixture conditional-density
Mixture conditional-densityMixture conditional-density
Mixture conditional-density
 
Bayesian inference and the pest of premature interpretation.
Bayesian inference and the pest of premature interpretation.Bayesian inference and the pest of premature interpretation.
Bayesian inference and the pest of premature interpretation.
 
02.bayesian learning
02.bayesian learning02.bayesian learning
02.bayesian learning
 
Introduction to Bayesian Inference
Introduction to Bayesian InferenceIntroduction to Bayesian Inference
Introduction to Bayesian Inference
 
00Overview PDF in the Bayes_intro folder
00Overview PDF in the Bayes_intro folder00Overview PDF in the Bayes_intro folder
00Overview PDF in the Bayes_intro folder
 
Using Statistics to Conduct More Efficient Searches
Using Statistics to Conduct More Efficient SearchesUsing Statistics to Conduct More Efficient Searches
Using Statistics to Conduct More Efficient Searches
 

Recently uploaded

Vip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewas
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service DewasVip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewas
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewasmakika9823
 
A.I. Bot Summit 3 Opening Keynote - Perry Belcher
A.I. Bot Summit 3 Opening Keynote - Perry BelcherA.I. Bot Summit 3 Opening Keynote - Perry Belcher
A.I. Bot Summit 3 Opening Keynote - Perry BelcherPerry Belcher
 
BEST Call Girls In BELLMONT HOTEL ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In BELLMONT HOTEL ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In BELLMONT HOTEL ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In BELLMONT HOTEL ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,noida100girls
 
Marketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet CreationsMarketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet Creationsnakalysalcedo61
 
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / NcrCall Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncrdollysharma2066
 
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...lizamodels9
 
Non Text Magic Studio Magic Design for Presentations L&P.pptx
Non Text Magic Studio Magic Design for Presentations L&P.pptxNon Text Magic Studio Magic Design for Presentations L&P.pptx
Non Text Magic Studio Magic Design for Presentations L&P.pptxAbhayThakur200703
 
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdfCatalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdfOrient Homes
 
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130  Available With RoomVIP Kolkata Call Girl Howrah 👉 8250192130  Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Roomdivyansh0kumar0
 
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...lizamodels9
 
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
Keppel Ltd. 1Q 2024 Business Update  Presentation SlidesKeppel Ltd. 1Q 2024 Business Update  Presentation Slides
Keppel Ltd. 1Q 2024 Business Update Presentation SlidesKeppelCorporation
 
rishikeshgirls.in- Rishikesh call girl.pdf
rishikeshgirls.in- Rishikesh call girl.pdfrishikeshgirls.in- Rishikesh call girl.pdf
rishikeshgirls.in- Rishikesh call girl.pdfmuskan1121w
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...lizamodels9
 
(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCRsoniya singh
 
Call Girls In ⇛⇛Chhatarpur⇚⇚. Brings Offer Delhi Contact Us 8377877756
Call Girls In ⇛⇛Chhatarpur⇚⇚. Brings Offer Delhi Contact Us 8377877756Call Girls In ⇛⇛Chhatarpur⇚⇚. Brings Offer Delhi Contact Us 8377877756
Call Girls In ⇛⇛Chhatarpur⇚⇚. Brings Offer Delhi Contact Us 8377877756dollysharma2066
 
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… Abridged
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… AbridgedLean: From Theory to Practice — One City’s (and Library’s) Lean Story… Abridged
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… AbridgedKaiNexus
 
The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024christinemoorman
 
Banana Powder Manufacturing Plant Project Report 2024 Edition.pptx
Banana Powder Manufacturing Plant Project Report 2024 Edition.pptxBanana Powder Manufacturing Plant Project Report 2024 Edition.pptx
Banana Powder Manufacturing Plant Project Report 2024 Edition.pptxgeorgebrinton95
 
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝soniya singh
 

Recently uploaded (20)

Vip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewas
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service DewasVip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewas
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewas
 
A.I. Bot Summit 3 Opening Keynote - Perry Belcher
A.I. Bot Summit 3 Opening Keynote - Perry BelcherA.I. Bot Summit 3 Opening Keynote - Perry Belcher
A.I. Bot Summit 3 Opening Keynote - Perry Belcher
 
BEST Call Girls In BELLMONT HOTEL ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In BELLMONT HOTEL ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In BELLMONT HOTEL ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In BELLMONT HOTEL ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
 
Marketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet CreationsMarketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet Creations
 
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / NcrCall Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
 
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
 
Non Text Magic Studio Magic Design for Presentations L&P.pptx
Non Text Magic Studio Magic Design for Presentations L&P.pptxNon Text Magic Studio Magic Design for Presentations L&P.pptx
Non Text Magic Studio Magic Design for Presentations L&P.pptx
 
KestrelPro Flyer Japan IT Week 2024 (English)
KestrelPro Flyer Japan IT Week 2024 (English)KestrelPro Flyer Japan IT Week 2024 (English)
KestrelPro Flyer Japan IT Week 2024 (English)
 
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdfCatalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
 
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130  Available With RoomVIP Kolkata Call Girl Howrah 👉 8250192130  Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Room
 
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
 
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
Keppel Ltd. 1Q 2024 Business Update  Presentation SlidesKeppel Ltd. 1Q 2024 Business Update  Presentation Slides
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
 
rishikeshgirls.in- Rishikesh call girl.pdf
rishikeshgirls.in- Rishikesh call girl.pdfrishikeshgirls.in- Rishikesh call girl.pdf
rishikeshgirls.in- Rishikesh call girl.pdf
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
 
(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR
 
Call Girls In ⇛⇛Chhatarpur⇚⇚. Brings Offer Delhi Contact Us 8377877756
Call Girls In ⇛⇛Chhatarpur⇚⇚. Brings Offer Delhi Contact Us 8377877756Call Girls In ⇛⇛Chhatarpur⇚⇚. Brings Offer Delhi Contact Us 8377877756
Call Girls In ⇛⇛Chhatarpur⇚⇚. Brings Offer Delhi Contact Us 8377877756
 
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… Abridged
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… AbridgedLean: From Theory to Practice — One City’s (and Library’s) Lean Story… Abridged
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… Abridged
 
The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024
 
Banana Powder Manufacturing Plant Project Report 2024 Edition.pptx
Banana Powder Manufacturing Plant Project Report 2024 Edition.pptxBanana Powder Manufacturing Plant Project Report 2024 Edition.pptx
Banana Powder Manufacturing Plant Project Report 2024 Edition.pptx
 
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝
 

bayesian learning

  • 1. Bayesian Learning Steven L. Scott In the last section, on conditional probability, we saw that Bayes’ rule can be written p(θ|y) ∝ p(y|θ)p(θ). The distribution p(θ) is called the prior distribution, or just “the prior,” p(y|θ) is the likelihood function, and p(θ|y) is the posterior distribution. The prior distribution describes one’s belief about the value of θ before seeing y. The posterior distribution describes the same person’s belief about θ after seeing y. Bayes’ theorem describes the process of learning about θ when y is observed. 1 An example Let’s look at Bayes’ rule through an example. Suppose a biased coin with success probability θ is indepen- dently flipped 10 times, and 3 successes are observed. The data y = 3 arise from a binomial distribution with n = 10 and p = θ, so the likelihood is p(y = 3|θ) = 10 3 θ3 (1 − θ)7 . (1) What should the prior distribution be? In an abstract problem like this, most people are comfortable assuming that there is no reason to prefer any one legal value of θ to another, which would imply the uniform prior: p(θ) = 1 for θ ∈ (0, 1), with p(θ) = 0 otherwise. This is a common strategy in practice. In the absence of any “real” prior information about a parameter’s value (which is a typical situation), one strives to choose a prior that is “nearly noninformative.” We will see below that this is not always possible, but it is a useful guiding principle. The prior and likelihood for this example are shown in the first two panels of Figure 1. 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 θ priordensity 0.0 0.2 0.4 0.6 0.8 1.0 0.000.050.100.150.200.25 θ likelihood 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.53.0 θ posteriordensity (a) (b) (c) Figure 1: Bayesian learning in the binomial example. To find the posterior distribution we simply multiply the prior times the likelihood (which in this case just gives the likelihood), and normalize so that the result integrates to 1. In this case the normalization 1
  • 2. constant is proportional to a mathematical special function known as the “beta function”, and the resulting distribution is a known distribution called the “beta distribution.” The density of the beta distribution with parameters a and b is p(θ) = Γ(a + b) Γ(a)Γ(b) θa−1 (1 − θ)b−1 . (2) If θ is a random variable with the density function in equation (2) then we say θ ∼ Be(a, b). If we ignore factors other than θ and 1−θ we see that in our example a−1 = 3 and b−1 = 7, so our posterior distribution must be Be(4, 8). This distribution is plotted in Figure 1(c). Because it is simply a renormalization of the function in Figure 1(b), the two panels differ only in the axis labels. 2 Conjugate priors The uniform prior used in the previous section would be inappropriate if we actually had prior information that θ was small. For example, if y counted conversions on a website, we might have historical information about the distribution of conversion rates on similar sites. If we can describe our prior belief in the form of a Be(a, b) distribution (i.e. if we can represent our prior beliefs by choosing specific numerical values of a and b), then the posterior distribution after observing y successes out of n binomial trials is p(θ|y) ∝ n y θy (1 − θ)n−y likelihood Γ(a + b) Γ(a)Γ(b) θa−1 (1 − θ)b−1 prior ∝ θy+a−1 (1 − θ)n−y+b−1 . (3) We move from the first line of equation (3) to the second by combining the exponents of θ and 1 − θ, and ignoring factors that don’t depend on θ. We recognize the outcome as proportional to the Be(y+a, n−y+b) distribution. Thus “Bayesian learning” in this example amounts to adding y to a and n − y to b. That’s a helpful way of understanding the prior parameters: a and b represent “prior successes” and “prior failures.” When Bayes’ rule combines a likelihood and a prior in such a way that the posterior is from the same model family as the prior, the prior is said to be conjugate to the likelihood. Most models don’t have conjugate priors, but many models in the exponential family do. A distribution is in the exponential family if its log density is a linear function of some function of the data. That is, if its density can be written p(y|θ) = a(θ)b(y)ec(θ) d(y) . (4) Many of the famous “named” distributions are in the exponential family, including binomial, Poisson, ex- ponential, and Gaussian. The student t distribution is an example of a “famous” distribution that is not in the exponential family. If a model is in the exponential family then it has sufficient statistics: i d(yi). You can find the conjugate prior for an exponential family model by imagining equation (4) as a function of θ rather than y, and renormalizing (assuming the integral with respect to θ is finite). This formulation makes it clear that the parameters of the prior can be interpreted as sufficient statistics for the model, like how a and b can be thought of as prior successes and failures in the binomial example. A second example is the variance of a Gaussian model with known mean. Error terms in many models are often assumed to be zero-mean Gaussian random variables, so this problem comes up frequently. Suppose yi ∼ N 0, σ2 , independently, and let y = y1, . . . , yn. The likelihood function is p(y|σ2 ) = (2π)−n/2 1 σ2 n/2 exp − 1 2σ2 i y2 i . (5) 2
  • 3. Distribution Conjugate Prior binomial beta Poisson / exponential gamma normal mean (known variance) Normal normal precision (known mean) gamma Table 1: Some models with conjugate priors The expression containing 1/σ2 in equation (5) looks like the kernel of the gamma distribution. We write θ ∼ Ga(a, b) if p(θ|a, b) = ba Γ(a) θa−1 exp(−bθ). (6) If one assumes the prior 1/σ2 ∼ Ga df 2 , ss 2 then Bayes’ rule gives p(1/σ2 |y) ∝ 1 σ2 n/2 exp − 1 2σ2 i y2 i likelihood 1 σ2 df 2 −1 exp − ss 2 1 σ2 prior ∝ 1 σ2 n+df 2 −1 exp − 1 σ2 ss + i y2 i 2 ∝ Ga n + df 2 , ss + i y2 i 2 . (7) Notice how the parameters of the prior df and ss interact with the sufficient statistics of the model. One can interpret df as a “prior sample size” and ss as a “prior sum of squares.” It is important to stress that not all models have conjugate priors, and even when they do conjugate priors may not appropriately express certain types of prior knowledge. Yet when they exist, thinking about prior distributions through the lens of conjugate priors can help you understand the information content of the assumed prior. 3 Posteriors compromise between prior and likelihood Conjugate priors allow us to mathematically study the relationship between prior and likelihood. In the binomial example with a beta prior, the Be(a, b) distribution has mean π = a/(a + b) and variance π(1 − π)/(ν + 1), where ν = a + b. It is clear from equation (3) that a acts like a prior number of successes and b a prior number of failures. The mean of the posterior distribution Be(a + y, b + n − y) is thus ˜π = a + y ν + n = ν a/ν ν + n + n y/n ν + n . (8) Equation (8) shows the posterior mean θ is a weighted average of the prior mean a/ν and the mean of the data y/n. The weights in the average are proportional to ν and n, which are the total information content in the prior and the data, respectively. The posterior variance is ˜π(1 − ˜π) n + ν + 1 . (9) The total amount of information in the posterior distribution is often measured by its precision, which is the inverse (reciprocal) of its variance. The precision of Be(a + y, b + n − y) is n ˜π(1 − ˜π) + ν + 1 ˜π(1 − ˜π) , 3
  • 4. which is the sum of the precision from the prior and from the data. The results shown above are not specific to the binomial distribution. In the general setting, the posterior mean is a precision weighted average of the mean from the data and the mean from the prior, while the inverse of the posterior variance is the sum of the prior precision and data precision. This fact helps us get a sense of the relative importance of the prior vs the data in forming the posteriror distribution. 4 How much should you worry about the prior? People new to Bayesian reasoning are often concerned about “assuming the answer,” in the sense that their choice of a prior distribution will unduly influence the posterior distribution. There is good news and bad on this front. 4.1 Likelihood dominates prior First the good news. In regular models with moderate to large amounts of data, the data asymptotically overwhelm the prior. Consider Figure 2, which compares a few different Beta prior distributions with the same data, to see impact on the posterior. In panel (a) the data contain only 10 observations, so varying the a and b parameters in the prior distribution by one or two units each represents an appreciable change in the total available information. Panel2(b) shows the same analysis when there are 100 observations in the data, so moving a prior parameter by one or two units doesn’t have a particularly big impact. 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.53.0 θ density Be(1, 1) Be(.5, .5) Be(2, .5) Be(.5, 2) 0.0 0.2 0.4 0.6 0.8 1.0 02468 θ density Be(1, 1) Be(.5, .5) Be(2, .5) Be(.5, 2) (a) (b) Figure 2: How the posterior distributions varies with the choice of prior. (a) 3 successes from 10 trials, (b) 30 successes from 100 trials. Whatever prior you choose contains a fixed amount of information. If you imagine applying that prior to larger and larger data sets, its influence will eventually vanish. 4
  • 5. 4.2 Sometimes priors do strange things Now for the bad news. Even though many models are insensitive to a poorly chosen prior, not all of them are. If your model is based on means, standard deviations, and regression coefficients, then there is a good chance that any “weak” prior that you choose will have minimal impact. If the model has lots of latent variables and other weakly identified unknowns, then the prior is probably more influential. Because priors can sometimes carry more influence than intended, researchers have spent a considerable amount of time thinking about how to best represent “prior ignorance” using a default prior. Kass and Wasserman 1996 ably summarize these efforts. One issue that can come up is that the amount of information in a prior distributions can depend on the scale in which one views a parameter. For example suppose you place a uniform prior on θ, but then the analysis calls for the distribution of z = log(θ/(1 − θ)). The Jacobian of this transformation implies f(z) = θ(1−θ), which is plotted (as a function of z) in Figure 3. The uniform prior on θ is clearly informative for logit(θ). −10 −5 0 5 10 0.000.050.100.150.200.25 z density Figure 3: The solid line shows the density of a uniform random variable on the logit scale, derived mathematically. The histogram is the logit transform of 10,000 uniform random deviates. 4.3 Should you worry about priors? Sometimes you need to, and sometimes you don’t. Until you get enough experience to trust your intuition about whether a prior is worth worrying about, it is prudent to try an analysis under a few different choices of prior. You can vary the prior parameters among a few reasonable values, or you can experiment to see just how extreme the prior would need to be to derail the analysis. In their paper, Kass and Wasserman made the point that problems where weak priors can make a big difference tend to be “hard” problems where there is not much information in the data, in which case a non-Bayesian analysis wouldn’t be particularly compelling (or in some cases, wouldn’t be possible). If you find that modest variations in the prior lead to different conclusions, then you’re in a hard problem. In that case a practical strategy is to think about the scale on which you want to analyze your model, and choose a prior that represents reasonable assumptions on that scale. State your assumptions up front, and present 5
  • 6. the results with a 2-3 other prior choices to show their impact. Then proceed with your chosen prior for the rest of the analysis. 6