Artificial intelligence has undergone a paradigm shift since the second AI winter in the 1990s: the core of intelligence, which was reasoning, gradually replaced by learning. Since then, the field has grown remarkably with machine learning, and deep learning has become the dominant approach in the area and has been used in various algorithms and models. However, since 2016, AI researchers have been addressing the inherent limitations and flaws of the conventional approaches and trying to come up with new methods.
In this presentation, we study the history of AI. Then we discuss problems of deep learning, and then we examine Bayesian inference, which is a different approach to the dominant method of statistical learning. Then we demonstrate new ideas in AI, based on Bayesian learning and reasoning, and finally, we indicate troubles with Bayesianism.
1. Bayesian Learning and Reasoning
Mohammad Reza Samsami
1
Sharif University of Technology
Fall 2019
2. Outline
• A brief history: from Symbolic to Connectionist AI
• Motivations
• Introduction to Bayesian Inference
• Bayesian vs Frequentist
• Bayesian method
• Point estimation
• Meaning of probability
• Bayesian linear regression
• Bayesian model comparison and averaging
• New approaches
• Troubles with Bayesianism
2
3. A brief history: from Symbolic to Connectionist AI
General Problem Solver
3
Content Technique
Back to 1959
4. A brief history: from Symbolic to Connectionist AI
4
{𝑋 ∨ 𝑌, ¬𝑌, 𝑋 → 𝑍} ⊢ 𝑍
Logic
5. A brief history: from Symbolic to Connectionist AI
5
Logic
Tree Number Pizza
6. A brief history: from Symbolic to Connectionist AI
6
Logic
Tree Number Pizza
Symbols
7. A brief history: from Symbolic to Connectionist AI
7
Tree Number Pizza
Symbols
Symbolic AI
9. A brief history: from Symbolic to Connectionist AI
9
Can machines think?
“Thinking is manipulation of symbols and Reasoning is computation.”
Thomas Hobbes
10. A brief history: from Symbolic to Connectionist AI
10
is our problem-solving procedure, and are how we
represent the world. are verbs explaining how symbols
interact with each other, or adjectives describing symbols.
Show(MohammadReza, Slides)
11. A brief history: from Symbolic to Connectionist AI
11
The set of all true things about our universe is called a knowledge
base, and we can use logic to examine our knowledge bases to answer
questions and discover new things.
The process of coming up with new propositions and checking
whether they fit with the logic of a knowledge base is called inference.
12. A brief history: from Symbolic to Connectionist AI
12
Problems with Symbolic AI
Perception
The computer itself doesn’t know what the symbols mean; which
means they are not necessarily linked to any other representations of
the world in a non-symbolic way.
13. A brief history: from Symbolic to Connectionist AI
13
Problems with Symbolic AI
Monotonicity
Reasoning based on classical deductive logic is monotonic. The new
knowledge cannot undo old knowledge.
A ⋃ ⊢ 𝑋Γ
14. A brief history: from Symbolic to Connectionist AI
14
Problems with Symbolic AI
Uncertainty
15. A brief history: from Symbolic to Connectionist AI
15
Intelligence as
Reasoning
Learning
16. A brief history: from Symbolic to Connectionist AI
16
A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P if its performance at tasks in T, as
measured by P, improves with experience E
17. A brief history: from Symbolic to Connectionist AI
17
Statistics
Optimization
26. Introduction
26
Concept Learning
f(x) = 1 if x is an example of the concept C, and otherwise f(x) = 0
The goal is to learn the indicator function f, which just defines
which elements are in the set C.
Number Game
Arithmetical concept C 𝒟 = {𝑥1, … , 𝑥 𝑛} 𝑥 belongs to C?
{1, 2, … , 100}
29. Introduction
29
How can we explain this behavior and emulate it in a machine?
The classic approach to induction is to suppose we have a hypothesis
space of concepts, ℋ, such as: odd numbers, even numbers, all numbers
between 1 and 100, powers of two, all numbers ending in 8.
The subset of ℋ that is consistent with the data 𝒟 is called the version
space. As we see more examples, the version space shrinks and we
become increasingly certain about the concept.
30. Introduction
30
Why powers of 2 and not even numbers?
The key intuition is that we want to avoid suspicious coincidences. If the
true concept was even numbers, how come we only saw numbers that
happened to be powers of two?
𝑃 𝒟 𝐶𝑒𝑣𝑒𝑛 =
1
𝑠𝑖𝑧𝑒(𝐶𝑒𝑣𝑒𝑛)
𝑁
=
1
50
4
= 1.6 × 10−7
𝑃 𝒟 𝐶𝑡𝑤𝑜 =
1
𝑠𝑖𝑧𝑒(𝐶𝑡𝑤𝑜)
𝑁
=
1
6
4
= 7.7 × 10−4
39. Interpretations of Probability
39
Frequency Interpretation:
The dominant statistical practice for many years (known as the classical or
frequentist theory) defines probability in terms of the limit of conducting infinitely
many random experiments.
So it is impossible to consider the probability of a statement such as “at least 50%
of Iranians enjoy drinking Doogh.” This statement is either true or false, so its
frequentist probability is either zero or one (but we might not know which).
40. Interpretations of Probability
40
Subjective (or Bayesian) Interpretation:
“By degree of probability, we really mean, or ought to mean, degree of belief”
According to the subjective interpretation, probabilities are degrees of
confidence, or credence, or partial beliefs of suitable agents. Thus, we really have
many interpretations of probability here— as many as there are suitable agents.
In the Bayesian interpretation, we allow probabilities instead to describe degrees
of belief in such a proposition. In this way, we can treat everything as a random
variable and use the tools of probability to carry out all inference. That is, in
subjective probability, parameters, data, and hypotheses are all treated the same.
De Morgan
43. Point Estimation
Goal: Choose a good value of θ for D
Typically the posterior mean or median is the most appropriate choice
for a real valued quantity, and the vector of posterior marginals is the
best choice for a discrete quantity.
However, the posterior mode, is the most popular choice because it
reduces to an optimization problem, for which efficient algorithms often
exist.
44. Point Estimation
44
Maximum a posteriori (MAP) estimation
𝜽 𝑴𝑨𝑷 = 𝒂𝒓𝒈𝒎𝒂𝒙 𝑷(𝜽|𝑫) = 𝒂𝒓𝒈𝒎𝒂𝒙 𝑷 𝓓 𝜽 𝑷(𝜽)
𝜽 𝜽
We have avoided computing the normalization constant 𝑃(𝐷).
45. Interpretations of Probability
45
Maximum a posteriori (MAP) estimation
Pros: Easy to compute
Interpretable
Avoids overfitting – Regularization, Shrinkage
Cons: No representation of uncertainty
Not invariant to reparameterization: 𝜏 = 𝑓 𝜃 𝜏 𝑀𝐿𝐸 = 𝑓(𝜃 𝑀𝐿𝐸)
The mode is an untypical point
60. Bayesian Model Averaging
60
Full Bayesian method would avoid model selection.
When making predictions, we should theoretically use the sum rule
to marginalize the unknown model:
But model selection is still used widely in practice.
61. Pros and Cons of Bayesian Inference
61
Pros: Directly answers the questions
Avoid overfitting
Model Selection (Occam’s Razor)
Cons: Must assume prior
Intractable integral
Limited to specific approximated distributions