Inductive Reasoning and (one of) the Foundations of Machine Learning

Inductive Reasoning
and (one of) the
Foundations of
Machine Learning
“beware of mathematicians, and all those
who make empty prophecies”
— St. Augustine

All men are mortal
Socrates is a man
Deductive reasoning
Socrates is mortal

All men are mortal
Socrates is a man
Deductive reasoning
Socrates is mortal
Idea: Thinking is
deductive reasoning!

Trenchard More, John McCarthy,
Marvin Minsky, Oliver Selfridge,
Ray Solomonoff
50 years later

“To understand the real world, we must have
a different set of primitives from the relatively
simple line trackers suitable and sufﬁcient
for the blocks world”
!
— Patrick Winston (1975)
Director of MIT’s AI lab from 1972-1997
A bump in the road

The AI winter
http://en.wikipedia.org/wiki/AI_winter

Reductio ad absurdum
“Intelligence is
10 million rules”
— Doug Lenat

The story so far…
• Boy meets girl
!
!
!
!
!
!

The story so far…
• Boy meets girl
!
• Boy spends 100s of millions of dollars
wooing girl with deductive reasoning!
!
!
!

The story so far…
• Boy meets girl
!
!
• Girl says: “drop dead”;
boy becomes very sad

The story so far…
• Boy meets girl
!
!
• Girl says: “drop dead”;
boy becomes very sad
Next: Boy ponders the errors of his ways

“this book is composed […] upon one very simple theme
[…] that we can learn from our mistakes”
!
!
!
!
!
!
!
!
!
!
Karl Popper, Conjectures and Refutations

We’re going to look at
4 learning algorithms.

Sequential prediction
Scenario: At time t, Forecaster predicts 0 or 1.
Nature then reveals the truth.
!
!
!
Forecaster has access to N experts.
One of them is always correct.
Goal: Predict as accurately as possible.

Algorithm #1
While t>0:
Predict by majority vote.Step 1.
Remove experts that are wrong.Step 2.
t ← t+1Step 3.
Set t = 1.

While t>0:
Question:
t ← t+1Step 3.
How long to ﬁnd correct expert?
Set t = 1.
Algorithm #1

While t>0:
BAD!!!
t ← t+1Step 3.
How long to ﬁnd correct expert?
Set t = 1.
Algorithm #1

While t>0:
Question:
t ← t+1Step 3.
How many errors?
Set t = 1.
Algorithm #1

Algorithm #1
How many errors?

How many errors?
When algorithm makes a mistake,
it removes ≥ half of experts
Algorithm #1

≤ log N
How many errors?
When algorithm makes a mistake,
it removes ≥ half of experts
Algorithm #1

Deep thought #1
Track errors, not runtime

What’s going on?
Didn’t we just use deductive reasoning!?!

What’s going on?
Didn’t we just use deductive reasoning!?!
Yes… but No!

What’s going on?
Algorithm: makes educated guesses about Nature
Analysis: proves theorem about number of errors
(inductive)
(deductive)

What’s going on?
Algorithm: makes educated guesses about Nature
Analysis: proves theorem about number of errors
(inductive)
(deductive)
The algorithm learns — but
it does not deduce!

Adversarial prediction
Scenario: At time t, Forecaster predicts 0 or 1.
!
!
!
Nature is adversarial.

At time t, Forecaster predicts 0 or 1.
!
!
!
Nature is adversarial.
Seriously?!?!

Regret
Let m* be the best expert in hindsight.
regret := errors(Forecaster) - errors(m*)
Minimize regret.

While t ≤ T:
Question:
Predict by weighted majority vote.Step 1.
Multiply incorrect experts by β.Step 2.
t ← t+1Step 3.
What is the regret?
Set t = 1.
Algorithm #2
Pick β in (0,1). Assign 1 to experts.

While t ≤ T:
Predict by weighted majority vote.Step 1.
Multiply incorrect experts by β.Step 2.
t ← t+1Step 3.
Set t = 1.
Algorithm #2
What is the regret? [ choose β carefully ]
r
T · log N
2
Pick β in (0,1). Assign 1 to experts.

Deep thought #2
Model yourself, not Nature

Online Convex Opt.
Scenario: Convex set K; convex loss L(a,b)
[ in both arguments, separately ]
!
At time t,
Forecaster picks at in K
Nature responds with bt in K
[ Nature is adversarial ]
Forecaster’s loss is L(a,b)
Goal: Minimize regret.

Follow the Leader
Idea: Predict with the at that would have
worked best on { b1, … ,bt-1 }

While t ≤ T:
Step 1.
Step 2.
Set t = 1.
Follow the Leader
Idea:
t ← t+1
Pick a1 at random.
at := argmin
a2K
"t 1X
i=1
L(a, bi)
#
Predict with the at that would have
worked best on { b1, … ,bt-1 }

While t ≤ T:
Step 1.
Step 2.
Set t = 1.
Follow the Leader
BAD!
Problem: Nature pulls Forecaster back-and-forth
No memory!
t ← t+1
Pick a1 at random.
at := argmin
a2K
"t 1X
i=1
L(a, bi)
#

While t ≤ T:
Step 1.
Step 2.
Set t = 1.
t ← t+1
Algorithm #3
Pick a1 at random.
regularize
at := argmin
a2K
"t 1X
i=1
L(a, bi) +
2
· kak2
2
#

While t ≤ T:
Step 1.
Step 2.
Set t = 1.
Algorithm #3
t ← t+1
Pick a1 at random.
gradient descent
at at 1 ·
@
@a
L(at 1, bt 1)

While t ≤ T:
Step 1.
Step 2.
Set t = 1.
Algorithm #3
Intuition: β controls memory
t ← t+1
Pick a1 at random.
at at 1 ·
@
@a
L(at 1, bt 1)

While t ≤ T:
Step 1.
Step 2.
Set t = 1.
Algorithm #3
What is the regret?
[ choose β carefully ]
 diam(K) · Lipschitz(L) ·
p
T
t ← t+1
Pick a1 at random.
at at 1 ·
@
@a
L(at 1, bt 1)

Deep thought #3
Those who cannot
remember [their]
past are condemned
to repeat it
George Santayana

Minimax theorem
inf
a2K
sup
b2K
L(a, b) = sup
b2K
inf
a2K
L(a, b)

Minimax theorem
Forecaster picks a,
Nature responds b
inf
a2K
sup
b2K
L(a, b) = sup
b2K
inf
a2K
L(a, b)

Minimax theorem
Forecaster picks a,
Nature responds b
Nature picks b,
Forecaster responds a
inf
a2K
sup
b2K
L(a, b) = sup
b2K
inf
a2K
L(a, b)

Minimax theorem
Forecaster picks a,
Nature responds b
Nature picks b,
Forecaster responds a
inf
a2K
sup
b2K
L(a, b) = sup
b2K
inf
a2K
L(a, b)
inf
a2K
sup
b2K
L(a, b) sup
b2K
inf
a2K
L(a, b)
going ﬁrst hurts Forecaster, so

Minimax theorem
Proof idea:
No-regret algorithm →
!
!
→
!
!
→
Forecaster can asymptotically
match hindsight
!
Order of players doesn’t matter
asymptotically
!
Convert series of moves into
average via online-to-batch.
Let m* be the best move in hindsight.
regret := loss(Forecaster) - loss(m*)
inf
a2K
sup
b2K
L(a, b)  sup
b2K
inf
a2K
L(a, b)

Minimax theorem
Proof idea:
No-regret algorithm →
!
!
→
!
!
→
Forecaster can asymptotically
match hindsight
!
Order of players doesn’t matter
asymptotically
!
Convert series of moves into
average via online-to-batch.
Let m* be the best move in hindsight.
regret := loss(Forecaster) - loss(m*)
¯a =
1
T
TX
t=1
at
inf
a2K
sup
b2K
L(a, b)  sup
b2K
inf
a2K
L(a, b)

Boosting
Scenario:
Goal: Combine to perform well
Algorithm W is better than guessing
on any data distribution: loss ≤ 0.5 - ε

The Boosting Game
Value of game: V(w,d) = # mistakes w
makes on d
Algorithm W is better than guessing
on any data distribution: loss ≤ 0.5 - ε
sup
d
inf
w
V(w, d) 
1
2
✏

The Boosting Game
Value of game: V(w,d) = # mistakes w
makes on d
inf
w
sup
d
V(w, d) 
1
2
✏ MINIMAX!

The Boosting Game
inf
w
sup
d
V(w, d) 
1
2
✏ MINIMAX!
∃ distribution w* on learners that
averages correctly on any data!

Meta-Algorithm #4
Play Algorithm #2 against Algorithm W
[ #2 maximizes W’s mistakes ]

Meta-Algorithm #4
inf
w
sup
d
V(w, d) 
1
2
✏
Algorithm #2
Algorithm W

Meta-Algorithm #4
• Freund and Schapire 1995
!
• Best learning algorithm in
late 1990s and early 2000s
!
• Authors won Gödel prize
inf
w
sup
d
V(w, d) 
1
2
✏
Algorithm #2
Algorithm W

Deep thought #4
Your teachers
are not your
friends

The story so far…
• Boy met girl
!
• Boy spent 100s of millions of dollars
wooing girl with deductive reasoning
!
• Girl said: “drop dead”;
boy became very sad
!
!

The story so far…
• Boy met girl
!
!
• Girl said: “drop dead”;
boy became very sad
!
• Boy learnt to learn from mistakes

The story so far…
• Boy met girl
!
!
• Girl showed no interest;
boy became very sad
!
• Boy learnt to learn from mistakes
Next: Boy invites girl for coffee. Girl accepts!

Online Convex Opt.
(deep learning)
Apply Algorithm #3
to nonconvex optimization.
!
Theorems don’t work (not convex)
→ tons of engineering on top of #3
!
Amazing performance.
!
New mathematics needs to be invented!!

Online Convex Opt.
(deep learning)
In the last 2 years deep learning has:
!
• Better than human performance at object
recognition (ImageNet).
!
• Outperformed humans at recognising
street-signs (Google streetview).
!
• Superhuman performance on Atari games
(DeepMind).
!
• Real-time translation: English voice
to Chinese text and voice.

Thank you!
#1. Halving
#2. Multiplicative Weights
Exponential Weights Algorithm (EWA)
#3. Online Gradient Descent (OGD)
Stochastic Gradient Descent (SGD)
Mirror Descent
Backpropagation
#4. AdaBoost

Details? Lecture notes on my webpage:
https://dl.dropboxusercontent.com/u/
5874168/math482.pdf
Thank you!
#1. Halving
#2. Multiplicative Weights
Exponential Weights Algorithm (EWA)
#3. Online Gradient Descent (OGD)
Stochastic Gradient Descent (SGD)
Mirror Descent
Backpropagation
#4. AdaBoost

Vladimir Vapnik
Alexey Chervonenkis!
1938 — 2014

“[A] theory of induction is superfluous.
It has no function in a logic of science.
The best we can say of a hypothesis
is that up to now it has been able to
show its worth, and that it has been
more successful that other
hypotheses although, in principle, it
can never be justified, verified, or
even shown to be probable. This
appraisal of the hypothesis relies
solely upon deductive consequences
(predictions) which may be drawn
from the hypothesis: There is no need
to even mention induction.”

“the learning process
may be regarded as
a search for a form of
behaviour which will
satisfy the teacher (or
some other criterion)”

Inductive Reasoning and (one of) the Foundations of Machine Learning

More Related Content

Similar to Inductive Reasoning and (one of) the Foundations of Machine Learning

Recently uploaded

Inductive Reasoning and (one of) the Foundations of Machine Learning