BML lecture #1: Bayesics
http://github.com/rbardenet/bml-course
Rémi Bardenet
remi.bardenet@gmail.com
CNRS & CRIStAL, Univ. Lille, France
1 / 38
What comes to your mind when you hear ”Bayesian ML”?
2 / 38
Course outline
3 / 38
Outline
1 A warmup: Estimation in regression models
2 ML as data-driven decision-making
3 Subjective expected utility
4 Specifying joint models
5 50 shades of Bayes
4 / 38
Quotes from Gelman et al., 2013 on Bayesian methods
▶ [...] practical methods for making inferences from data, using
probability models for quantities we observe and for quantities about
which we wish to learn.
▶ The essential characteristic of Bayesian methods is their explicit use
of probability for quantifying uncertainty in inferences based on
statistical data analysis.
▶ Three steps:
1 Setting up a full probability model,
2 Conditioning on observed data, calculating and interpreting the
appropriate “posterior distribution”,
3 Evaluating the fit of the model and the implications of the resulting
posterior distribution. In response, one can alter or expand the
model and repeat the three steps.
5 / 38
Notation that I will try to stick to
▶ y1:n = (y1, . . . , yn) ∈ Yn
denote observable data/labels.
▶ x1:n ∈ Xn
denote covariates/features/hidden states.
▶ z1:n ∈ Zn
denote hidden variables.
▶ θ ∈ Θ denote parameters.
▶ X denotes an X-valued random variable. Lowercase x denotes either
a point in X or an X-valued random variable.
6 / 38
More notation
▶ Whenever it can easily be made formal, we write densities for our
random variables and let the context indicate what is meant. So if
X ∼ N(0, σ2
), we write
Eh(X) =
Z
h(x)
e−x2
/2σ2
σ
√
2π
dx =
Z
h(x)p(x)dx.
Similarly, for X ∼ P(λ), we write
Eh(X) =
∞
X
k=0
h(k)e−λ λk
k!
=
Z
h(x)p(x)dx
▶ All pdfs are denoted by p, so that, e. g.
Eh(Y , θ) =
Z
h(y, θ)p(y, θ) dydθ
=
Z
h(y, θ)p(y, x, θ) dxdydθ
=
Z
h(y, θ)p(y, θ|x)p(x) dxdydθ
7 / 38
Outline
1 A warmup: Estimation in regression models
2 ML as data-driven decision-making
3 Subjective expected utility
4 Specifying joint models
5 50 shades of Bayes
8 / 38
Outline
1 A warmup: Estimation in regression models
2 ML as data-driven decision-making
3 Subjective expected utility
4 Specifying joint models
5 50 shades of Bayes
9 / 38
Inference in regression models
10 / 38
Inference in regression models
11 / 38
Inference in regression models
12 / 38
Inference in regression models
13 / 38
Inference in regression models
14 / 38
Outline
1 A warmup: Estimation in regression models
2 ML as data-driven decision-making
3 Subjective expected utility
4 Specifying joint models
5 50 shades of Bayes
15 / 38
Describing a decision problem under uncertainty
▶ A state space S,
Every quantity you need to consider to make your decision.
▶ Actions A ⊂ F(S, Z),
Making a decision means picking one of the available actions.
▶ A reward space Z,
Encodes how you feel about having picked a particular action.
▶ A loss function L : A × S → R+.
How much you would suffer from picking action a in state s.
16 / 38
Classification as a decision problem
▶ S = Xn
× Yn
× X × Y, i.e. s = (x1:n, y1:n, x, y).
▶ Z = {0, 1}.
▶ A = {ag : s 7→ 1y̸=g(x;x1:n,y1:n), g ∈ G}.
▶ L(ag , s) = 1y̸=g(x;x1:n,y1:n).
PAC bounds; see e.g. (Shalev-Shwartz and Ben-David, 2014)
Let (x1:n, y1:n) ∼ P⊗n
, and independently (x, y) ∼ P, we want an
algorithm g(·; x1:n, y1:n) ∈ G such that if n ⩾ n(δ, ε),
P⊗n

E(x,y)∼PL(ag , s) ⩽ ε

⩾ 1 − δ.
17 / 38
Regression as a decision problem
▶ S =
▶ Z =
▶ A =
▶
18 / 38
Estimation as a decision problem
▶ S =
▶ Z =
▶ A =
▶
19 / 38
Clustering as a decision problem
▶ S =
▶ Z =
▶ A =
▶
20 / 38
Outline
1 A warmup: Estimation in regression models
2 ML as data-driven decision-making
3 Subjective expected utility
4 Specifying joint models
5 50 shades of Bayes
21 / 38
SEU is what defines the Bayesian approach
The subjective expected utility principle
1 Choose S, Z, A and a loss function L(a, s),
2 Choose a distribution p over S,
3 Take the the corresponding Bayes action
a⋆
∈ arg min
a∈A
Es∼pL(a, s). (1)
Corollary: minimize the posterior expected loss
Now partition s = (sobs, su), then
a⋆
∈ arg min
a∈A
Esobs
Esu|sobs
L(a, s).
In ML, A = {ag }, with g = g(sobs), so that (1) is equivalent to
a⋆
= ag⋆ , with
g⋆
(sobs) ≜ arg min
g
Esu|sobs
L(a, s).
22 / 38
Outline
1 A warmup: Estimation in regression models
2 ML as data-driven decision-making
3 Subjective expected utility
4 Specifying joint models
5 50 shades of Bayes
23 / 38
A recap on probabilistic graphical models 1/2
▶ PGMs (aka “Bayesian” networks) represent the dependencies in a
joint distribution p(s) by a directed graph G = (E, V ).
▶ Two important properties:
p(s) =
Y
v∈V
p(sv |spa(v)) and yv ⊥ ynd(v)|ypa(v).
24 / 38
A recap on probabilistic graphical models 2/2
Also good to know how to determine whether A ⊥ B|C; see (Murphy,
2012, Section 10.5).
d-blocking
An undirected path P in G is d-blocked by E ⊂ V if at least one of the
following conditions hold.
▶ P contains a “chain” a → b → c and b ∈ E.
▶ P contains a “tent” a ← b → c and b ∈ E.
▶ P contains a “v-structure” a → b ← c and neither b nor any of its
descendants are in E.
Theorem
25 / 38
Exercise
▶ Does x2 ⊥ x6|x5, x1?
▶ Does x2 ⊥ x6|x1?
▶ Write the joint distribution as factorized over the graph.
26 / 38
Estimation as a decision problem: point estimates
27 / 38
Estimation as a decision problem: credible intervals
28 / 38
Choosing priors (see Exercises)
29 / 38
Classification as a decision problem
30 / 38
Regression as a decision problem 1/2
31 / 38
Regression as a decision problem 2/2
32 / 38
Dimensionality reduction as a decision problem
33 / 38
Clustering as a decision problem
34 / 38
Topic modelling as a decision problem
35 / 38
Outline
1 A warmup: Estimation in regression models
2 ML as data-driven decision-making
3 Subjective expected utility
4 Specifying joint models
5 50 shades of Bayes
36 / 38
50 shades of Bayes
An issue (or is it?)
Depending on how they interpret and how they implement SEU, you will
meet many types of Bayesians (46656, according to Good).
A few divisive questions
▶ Using data or the likelihood to choose your prior; see Lecture #5.
▶ Using MAP estimators for their computational tractability, like in
inverse problems
x̂λ ∈ arg min ∥y − Ax∥ + λΩ(x).
▶ When and how should you revise your model (likelihood or prior)?
▶ MCMC vs variational Bayes (more in Lectures #2 and #3)
37 / 38
References I
[1] A. Gelman et al. Bayesian data analysis. 3rd. CRC press, 2013.
[2] K. Murphy. Machine learning: a probabilistic perspective. MIT
Press, 2012.
[3] S. Shalev-Shwartz and S. Ben-David. Understanding machine
learning: From theory to algorithms. Cambridge university press,
2014.
38 / 38

bayes_machine_learning_book for data scientist

  • 1.
    BML lecture #1:Bayesics http://github.com/rbardenet/bml-course Rémi Bardenet remi.bardenet@gmail.com CNRS & CRIStAL, Univ. Lille, France 1 / 38
  • 2.
    What comes toyour mind when you hear ”Bayesian ML”? 2 / 38
  • 3.
  • 4.
    Outline 1 A warmup:Estimation in regression models 2 ML as data-driven decision-making 3 Subjective expected utility 4 Specifying joint models 5 50 shades of Bayes 4 / 38
  • 5.
    Quotes from Gelmanet al., 2013 on Bayesian methods ▶ [...] practical methods for making inferences from data, using probability models for quantities we observe and for quantities about which we wish to learn. ▶ The essential characteristic of Bayesian methods is their explicit use of probability for quantifying uncertainty in inferences based on statistical data analysis. ▶ Three steps: 1 Setting up a full probability model, 2 Conditioning on observed data, calculating and interpreting the appropriate “posterior distribution”, 3 Evaluating the fit of the model and the implications of the resulting posterior distribution. In response, one can alter or expand the model and repeat the three steps. 5 / 38
  • 6.
    Notation that Iwill try to stick to ▶ y1:n = (y1, . . . , yn) ∈ Yn denote observable data/labels. ▶ x1:n ∈ Xn denote covariates/features/hidden states. ▶ z1:n ∈ Zn denote hidden variables. ▶ θ ∈ Θ denote parameters. ▶ X denotes an X-valued random variable. Lowercase x denotes either a point in X or an X-valued random variable. 6 / 38
  • 7.
    More notation ▶ Wheneverit can easily be made formal, we write densities for our random variables and let the context indicate what is meant. So if X ∼ N(0, σ2 ), we write Eh(X) = Z h(x) e−x2 /2σ2 σ √ 2π dx = Z h(x)p(x)dx. Similarly, for X ∼ P(λ), we write Eh(X) = ∞ X k=0 h(k)e−λ λk k! = Z h(x)p(x)dx ▶ All pdfs are denoted by p, so that, e. g. Eh(Y , θ) = Z h(y, θ)p(y, θ) dydθ = Z h(y, θ)p(y, x, θ) dxdydθ = Z h(y, θ)p(y, θ|x)p(x) dxdydθ 7 / 38
  • 8.
    Outline 1 A warmup:Estimation in regression models 2 ML as data-driven decision-making 3 Subjective expected utility 4 Specifying joint models 5 50 shades of Bayes 8 / 38
  • 9.
    Outline 1 A warmup:Estimation in regression models 2 ML as data-driven decision-making 3 Subjective expected utility 4 Specifying joint models 5 50 shades of Bayes 9 / 38
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
    Outline 1 A warmup:Estimation in regression models 2 ML as data-driven decision-making 3 Subjective expected utility 4 Specifying joint models 5 50 shades of Bayes 15 / 38
  • 16.
    Describing a decisionproblem under uncertainty ▶ A state space S, Every quantity you need to consider to make your decision. ▶ Actions A ⊂ F(S, Z), Making a decision means picking one of the available actions. ▶ A reward space Z, Encodes how you feel about having picked a particular action. ▶ A loss function L : A × S → R+. How much you would suffer from picking action a in state s. 16 / 38
  • 17.
    Classification as adecision problem ▶ S = Xn × Yn × X × Y, i.e. s = (x1:n, y1:n, x, y). ▶ Z = {0, 1}. ▶ A = {ag : s 7→ 1y̸=g(x;x1:n,y1:n), g ∈ G}. ▶ L(ag , s) = 1y̸=g(x;x1:n,y1:n). PAC bounds; see e.g. (Shalev-Shwartz and Ben-David, 2014) Let (x1:n, y1:n) ∼ P⊗n , and independently (x, y) ∼ P, we want an algorithm g(·; x1:n, y1:n) ∈ G such that if n ⩾ n(δ, ε), P⊗n E(x,y)∼PL(ag , s) ⩽ ε ⩾ 1 − δ. 17 / 38
  • 18.
    Regression as adecision problem ▶ S = ▶ Z = ▶ A = ▶ 18 / 38
  • 19.
    Estimation as adecision problem ▶ S = ▶ Z = ▶ A = ▶ 19 / 38
  • 20.
    Clustering as adecision problem ▶ S = ▶ Z = ▶ A = ▶ 20 / 38
  • 21.
    Outline 1 A warmup:Estimation in regression models 2 ML as data-driven decision-making 3 Subjective expected utility 4 Specifying joint models 5 50 shades of Bayes 21 / 38
  • 22.
    SEU is whatdefines the Bayesian approach The subjective expected utility principle 1 Choose S, Z, A and a loss function L(a, s), 2 Choose a distribution p over S, 3 Take the the corresponding Bayes action a⋆ ∈ arg min a∈A Es∼pL(a, s). (1) Corollary: minimize the posterior expected loss Now partition s = (sobs, su), then a⋆ ∈ arg min a∈A Esobs Esu|sobs L(a, s). In ML, A = {ag }, with g = g(sobs), so that (1) is equivalent to a⋆ = ag⋆ , with g⋆ (sobs) ≜ arg min g Esu|sobs L(a, s). 22 / 38
  • 23.
    Outline 1 A warmup:Estimation in regression models 2 ML as data-driven decision-making 3 Subjective expected utility 4 Specifying joint models 5 50 shades of Bayes 23 / 38
  • 24.
    A recap onprobabilistic graphical models 1/2 ▶ PGMs (aka “Bayesian” networks) represent the dependencies in a joint distribution p(s) by a directed graph G = (E, V ). ▶ Two important properties: p(s) = Y v∈V p(sv |spa(v)) and yv ⊥ ynd(v)|ypa(v). 24 / 38
  • 25.
    A recap onprobabilistic graphical models 2/2 Also good to know how to determine whether A ⊥ B|C; see (Murphy, 2012, Section 10.5). d-blocking An undirected path P in G is d-blocked by E ⊂ V if at least one of the following conditions hold. ▶ P contains a “chain” a → b → c and b ∈ E. ▶ P contains a “tent” a ← b → c and b ∈ E. ▶ P contains a “v-structure” a → b ← c and neither b nor any of its descendants are in E. Theorem 25 / 38
  • 26.
    Exercise ▶ Does x2⊥ x6|x5, x1? ▶ Does x2 ⊥ x6|x1? ▶ Write the joint distribution as factorized over the graph. 26 / 38
  • 27.
    Estimation as adecision problem: point estimates 27 / 38
  • 28.
    Estimation as adecision problem: credible intervals 28 / 38
  • 29.
    Choosing priors (seeExercises) 29 / 38
  • 30.
    Classification as adecision problem 30 / 38
  • 31.
    Regression as adecision problem 1/2 31 / 38
  • 32.
    Regression as adecision problem 2/2 32 / 38
  • 33.
    Dimensionality reduction asa decision problem 33 / 38
  • 34.
    Clustering as adecision problem 34 / 38
  • 35.
    Topic modelling asa decision problem 35 / 38
  • 36.
    Outline 1 A warmup:Estimation in regression models 2 ML as data-driven decision-making 3 Subjective expected utility 4 Specifying joint models 5 50 shades of Bayes 36 / 38
  • 37.
    50 shades ofBayes An issue (or is it?) Depending on how they interpret and how they implement SEU, you will meet many types of Bayesians (46656, according to Good). A few divisive questions ▶ Using data or the likelihood to choose your prior; see Lecture #5. ▶ Using MAP estimators for their computational tractability, like in inverse problems x̂λ ∈ arg min ∥y − Ax∥ + λΩ(x). ▶ When and how should you revise your model (likelihood or prior)? ▶ MCMC vs variational Bayes (more in Lectures #2 and #3) 37 / 38
  • 38.
    References I [1] A.Gelman et al. Bayesian data analysis. 3rd. CRC press, 2013. [2] K. Murphy. Machine learning: a probabilistic perspective. MIT Press, 2012. [3] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. 38 / 38