Bayesian Reasoning and Learning

Bayesian Learning and Reasoning
Mohammad Reza Samsami
1
Sharif University of Technology
Fall 2019

Outline
• A brief history: from Symbolic to Connectionist AI
• Motivations
• Introduction to Bayesian Inference
• Bayesian vs Frequentist
• Bayesian method
• Point estimation
• Meaning of probability
• Bayesian linear regression
• Bayesian model comparison and averaging
• New approaches
• Troubles with Bayesianism
2

A brief history: from Symbolic to Connectionist AI
General Problem Solver
3
Content Technique
Back to 1959

4
{𝑋 ∨ 𝑌, ¬𝑌, 𝑋 → 𝑍} ⊢ 𝑍
Logic

5
Logic
Tree Number Pizza

6
Logic
Tree Number Pizza
Symbols

7
Tree Number Pizza
Symbols
Symbolic AI

8

9
Can machines think?
“Thinking is manipulation of symbols and Reasoning is computation.”
Thomas Hobbes

10
is our problem-solving procedure, and are how we
represent the world. are verbs explaining how symbols
interact with each other, or adjectives describing symbols.
Show(MohammadReza, Slides)

11
The set of all true things about our universe is called a knowledge
base, and we can use logic to examine our knowledge bases to answer
questions and discover new things.
The process of coming up with new propositions and checking
whether they fit with the logic of a knowledge base is called inference.

12
Problems with Symbolic AI
Perception
The computer itself doesn’t know what the symbols mean; which
means they are not necessarily linked to any other representations of
the world in a non-symbolic way.

13
Monotonicity
Reasoning based on classical deductive logic is monotonic. The new
knowledge cannot undo old knowledge.
A ⋃ ⊢ 𝑋Γ

14
Uncertainty

15
Intelligence as
Reasoning
Learning

16
A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P if its performance at tasks in T, as
measured by P, improves with experience E

17
Statistics
Optimization

18

19

20

21

22

Introduction
26
Concept Learning
f(x) = 1 if x is an example of the concept C, and otherwise f(x) = 0
The goal is to learn the indicator function f, which just deﬁnes
which elements are in the set C.
Number Game
Arithmetical concept C 𝒟 = {𝑥1, … , 𝑥 𝑛} 𝑥 belongs to C?
{1, 2, … , 100}

Introduction
27
𝒟 = {16}
𝑃 𝑥 𝒟 : Posterior predictive distribution

Introduction
28
𝒟 = {16, 2, 64, 8}
Induction
Powers of two

Introduction
29
How can we explain this behavior and emulate it in a machine?
The classic approach to induction is to suppose we have a hypothesis
space of concepts, ℋ, such as: odd numbers, even numbers, all numbers
between 1 and 100, powers of two, all numbers ending in 8.
The subset of ℋ that is consistent with the data 𝒟 is called the version
space. As we see more examples, the version space shrinks and we
become increasingly certain about the concept.

Introduction
30
Why powers of 2 and not even numbers?
The key intuition is that we want to avoid suspicious coincidences. If the
true concept was even numbers, how come we only saw numbers that
happened to be powers of two?
𝑃 𝒟 𝐶𝑒𝑣𝑒𝑛 =
1
𝑠𝑖𝑧𝑒(𝐶𝑒𝑣𝑒𝑛)
𝑁
=
1
50
4
= 1.6 × 10−7
𝑃 𝒟 𝐶𝑡𝑤𝑜 =
1
𝑠𝑖𝑧𝑒(𝐶𝑡𝑤𝑜)
𝑁
=
1
6
4
= 7.7 × 10−4

Introduction
31
𝑃 𝒟 𝐶𝑡𝑤𝑜
′
=
1
5
4
= 1.6 × 10−3
𝐶𝑡𝑤𝑜
′
: Powers of two except 32.

Introduction
32
𝐶𝑡𝑤𝑜
′
is conceptually unnatural!
We can capture such intuition by assigning low prior probability
to unnatural concepts.

Bayesian Inference
33
Thomas Bayes
Bayes Rule:

Bayesian Inference
34
𝑃 = 𝜃 𝑃 = 1 − 𝜃
𝑃(𝒟|𝜃) = 𝑃(𝑛ℎ, 𝑛 𝑡|𝜃) = 𝜃 𝑛ℎ(1 − 𝜃) 𝑛 𝑡
𝜃 𝑀𝐿𝐸 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃(𝒟|𝜃)
𝜃

Bayesian Inference
35
𝜃 𝑀𝐿𝐸 =
𝑛ℎ
𝑛ℎ + 𝑛 𝑡

Bayesian Inference
36
Bayesian Method: 𝑷 𝜽 𝑫 =
𝑷 𝑫 𝜽 𝑷(𝜽)
𝑷(𝑫)
𝑃 𝐷 𝜃 Binomial
𝑃(𝜃) ?Beta Distribution
ℬ(𝜃; 𝛼, 𝛽) =
𝜃 𝛼−1(1 − 𝜃) 𝛽−1
𝐵(𝛼, 𝛽)
𝐵 𝛼, 𝛽 =
0
1
𝜃 𝛼−1
(1 − 𝜃) 𝛽−1
𝑑𝜃

Bayesian Inference
37
𝑃 𝜃 𝑛ℎ, 𝑛 𝑡 =
𝑃 𝑛ℎ, 𝑛 𝑡 𝜃 𝑃(𝜃)
0
1
𝑃 𝑛ℎ, 𝑛 𝑡 𝜃 𝑃(𝜗)𝑑𝜗
= ℬ(𝛼 + 𝑛ℎ, 𝛽 + 𝑛 𝑡)

Bayesian Inference
38
𝑃 𝜃 ∈ ℋ 𝐷 =
ℋ
𝑃 𝜃 𝐷 𝑑𝜃
Hypothesis ℋ: The coin is fair
𝑃 𝜃 = 1
2 𝐷 =
1
2
1
2
𝑃 𝜃 𝐷 𝑑𝜃 = 0

Interpretations of Probability
39
Frequency Interpretation:
The dominant statistical practice for many years (known as the classical or
frequentist theory) defines probability in terms of the limit of conducting infinitely
many random experiments.
So it is impossible to consider the probability of a statement such as “at least 50%
of Iranians enjoy drinking Doogh.” This statement is either true or false, so its
frequentist probability is either zero or one (but we might not know which).

40
Subjective (or Bayesian) Interpretation:
“By degree of probability, we really mean, or ought to mean, degree of belief”
According to the subjective interpretation, probabilities are degrees of
confidence, or credence, or partial beliefs of suitable agents. Thus, we really have
many interpretations of probability here— as many as there are suitable agents.
In the Bayesian interpretation, we allow probabilities instead to describe degrees
of belief in such a proposition. In this way, we can treat everything as a random
variable and use the tools of probability to carry out all inference. That is, in
subjective probability, parameters, data, and hypotheses are all treated the same.
De Morgan

41
The betting analysis
𝐸
𝑝 𝑥
-𝑝𝑥 +𝑝𝑥

42
The betting analysis
𝐸
𝑝 𝑥
-𝑝𝑥 +𝑝𝑥
+𝑥 -𝑥
𝑝 ≤ 1
𝑝 ≥ 0

Point Estimation
Goal: Choose a good value of θ for D
Typically the posterior mean or median is the most appropriate choice
for a real valued quantity, and the vector of posterior marginals is the
best choice for a discrete quantity.
However, the posterior mode, is the most popular choice because it
reduces to an optimization problem, for which efficient algorithms often
exist.

Point Estimation
44
Maximum a posteriori (MAP) estimation
𝜽 𝑴𝑨𝑷 = 𝒂𝒓𝒈𝒎𝒂𝒙 𝑷(𝜽|𝑫) = 𝒂𝒓𝒈𝒎𝒂𝒙 𝑷 𝓓 𝜽 𝑷(𝜽)
𝜽 𝜽
We have avoided computing the normalization constant 𝑃(𝐷).

45
Maximum a posteriori (MAP) estimation
Pros: Easy to compute
Interpretable
Avoids overfitting – Regularization, Shrinkage
Cons: No representation of uncertainty
Not invariant to reparameterization: 𝜏 = 𝑓 𝜃 𝜏 𝑀𝐿𝐸 = 𝑓(𝜃 𝑀𝐿𝐸)
The mode is an untypical point

Linear Regression
47
Frequentist approach:

Linear Regression
48
The Gaussian distribution

Linear Regression
49

Linear Regression
50
Marginalization:
Conditioning:

Linear Regression
51
Convolutions:
Affine transformations:

Linear Regression
52
Bayesian approach:
Objective Prior:
Vector 𝐲 is a sum of two independent multivariate Gaussian distributed vectors

Linear Regression
53
Bayesian approach:

Linear Regression
54
Bayesian approach:
Making predictions
Connection to ridge regression?

Bayesian Model Comparison
56
Occam’s Razor
The simplest answer is often the right one.
−1, 3, 7, 11
𝑥 → 𝑥 + 4
𝑥 →
−𝑥3
11
+
9𝑥2
11
+
23
11

57
MAP
MLE

58
Bayes factor in favor of

59
Bayes factor in favor of is approximately 1.2

Bayesian Model Averaging
60
Full Bayesian method would avoid model selection.
When making predictions, we should theoretically use the sum rule
to marginalize the unknown model:
But model selection is still used widely in practice.

Pros and Cons of Bayesian Inference
61
Pros: Directly answers the questions
Avoid overfitting
Model Selection (Occam’s Razor)
Cons: Must assume prior
Intractable integral
Limited to specific approximated distributions

Probabilistic (Bayesian) Programming
63
Deep Learning
Optimization, usually gradient-based
Automated differentiation tools
Bayesian Learning and Reasoning
Inference
Probabilistic (Bayesian) programming languages

Probabilistic (Bayesian) Program Induction
64

Troubles with Bayesianism
65
Eric Mandelbaum, 2019

Some Good Resources
67
Bayesian Machine Learning
metacademy.org/roadmaps/rgrosse/bayesian_machine_learning
Created by: Roger Grosse

70
mohammadrezasamsami76@gmail.com
mrsamsami.github.io

Bayesian Reasoning and Learning

Recommended

Recommended

More Related Content

Similar to Bayesian Reasoning and Learning

Similar to Bayesian Reasoning and Learning (20)

Recently uploaded

Recently uploaded (20)

Bayesian Reasoning and Learning