Inductive Reasoning
and (one of) the
Foundations of
Machine Learning
“beware of mathematicians, and all those
who make empty prophecies”
— St. Augustine
All men are mortal
Socrates is a man
Deductive reasoning
Socrates is mortal
All men are mortal
Socrates is a man
Deductive reasoning
Socrates is mortal
Idea: Thinking is
deductive reasoning!
Articles
Trenchard More, John McCarthy,
Marvin Minsky, Oliver Selfridge,
Ray Solomonoff
50 years later
“To understand the real world, we must have
a different set of primitives from the relatively
simple line trackers suitable and sufficient
for the blocks world”
!
— Patrick Winston (1975)
Director of MIT’s AI lab from 1972-1997
A bump in the road
The AI winter
http://en.wikipedia.org/wiki/AI_winter
Reductio ad absurdum
“Intelligence is
10 million rules”
— Doug Lenat
The story so far…
• Boy meets girl
!
!
!
!
!
!
The story so far…
• Boy meets girl
!
• Boy spends 100s of millions of dollars
wooing girl with deductive reasoning!
!
!
!
The story so far…
• Boy meets girl
!
• Boy spends 100s of millions of dollars
wooing girl with deductive reasoning!
!
• Girl says: “drop dead”;
boy becomes very sad
The story so far…
• Boy meets girl
!
• Boy spends 100s of millions of dollars
wooing girl with deductive reasoning!
!
• Girl says: “drop dead”;
boy becomes very sad
Next: Boy ponders the errors of his ways
Next: Boy ponders the errors of his ways
Next: Boy ponders the errors of his ways
“this book is composed […] upon one very simple theme
[…] that we can learn from our mistakes”
!
!
!
!
!
!
!
!
!
!
Karl Popper, Conjectures and Refutations
We’re going to look at
4 learning algorithms.
Sequential prediction
Scenario: At time t, Forecaster predicts 0 or 1.
Nature then reveals the truth.
!
!
!
Forecaster has access to N experts.
One of them is always correct.
Goal: Predict as accurately as possible.
Algorithm #1
While t>0:
Predict by majority vote.Step 1.
Remove experts that are wrong.Step 2.
t ← t+1Step 3.
Set t = 1.
While t>0:
Question:
Predict by majority vote.Step 1.
Remove experts that are wrong.Step 2.
t ← t+1Step 3.
How long to find correct expert?
Set t = 1.
Algorithm #1
While t>0:
BAD!!!
Predict by majority vote.Step 1.
Remove experts that are wrong.Step 2.
t ← t+1Step 3.
How long to find correct expert?
Set t = 1.
Algorithm #1
While t>0:
Question:
Predict by majority vote.Step 1.
Remove experts that are wrong.Step 2.
t ← t+1Step 3.
How many errors?
Set t = 1.
Algorithm #1
Algorithm #1
Predict by majority vote.Step 1.
Remove experts that are wrong.Step 2.
How many errors?
Predict by majority vote.Step 1.
Remove experts that are wrong.Step 2.
How many errors?
When algorithm makes a mistake,
it removes ≥ half of experts
Algorithm #1
≤ log N
Predict by majority vote.Step 1.
Remove experts that are wrong.Step 2.
How many errors?
When algorithm makes a mistake,
it removes ≥ half of experts
Algorithm #1
Deep thought #1
Track errors, not runtime
What’s going on?
Didn’t we just use deductive reasoning!?!
What’s going on?
Didn’t we just use deductive reasoning!?!
Yes… but No!
What’s going on?
Algorithm: makes educated guesses about Nature
Analysis: proves theorem about number of errors
(inductive)
(deductive)
What’s going on?
Algorithm: makes educated guesses about Nature
Analysis: proves theorem about number of errors
(inductive)
(deductive)
The algorithm learns — but
it does not deduce!
Adversarial prediction
Scenario: At time t, Forecaster predicts 0 or 1.
Nature then reveals the truth.
!
!
!
Forecaster has access to N experts.
One of them is always correct.
Nature is adversarial.
Goal: Predict as accurately as possible.
At time t, Forecaster predicts 0 or 1.
Nature then reveals the truth.
!
!
!
Forecaster has access to N experts.
One of them is always correct.
Nature is adversarial.
Goal: Predict as accurately as possible.
Seriously?!?!
Regret
Let m* be the best expert in hindsight.
regret := errors(Forecaster) - errors(m*)
Goal: Predict as accurately as possible.
Minimize regret.
While t ≤ T:
Question:
Predict by weighted majority vote.Step 1.
Multiply incorrect experts by β.Step 2.
t ← t+1Step 3.
What is the regret?
Set t = 1.
Algorithm #2
Pick β in (0,1). Assign 1 to experts.
While t ≤ T:
Predict by weighted majority vote.Step 1.
Multiply incorrect experts by β.Step 2.
t ← t+1Step 3.
Set t = 1.
Algorithm #2
What is the regret? [ choose β carefully ]
r
T · log N
2
Pick β in (0,1). Assign 1 to experts.
Deep thought #2
Model yourself, not Nature
Online Convex Opt.
Scenario: Convex set K; convex loss L(a,b)
[ in both arguments, separately ]
!
At time t,
Forecaster picks at in K
Nature responds with bt in K
[ Nature is adversarial ]
Forecaster’s loss is L(a,b)
Goal: Minimize regret.
Follow the Leader
Idea: Predict with the at that would have
worked best on { b1, … ,bt-1 }
While t ≤ T:
Step 1.
Step 2.
Set t = 1.
Follow the Leader
Idea:
t ← t+1
Pick a1 at random.
at := argmin
a2K
"t 1X
i=1
L(a, bi)
#
Predict with the at that would have
worked best on { b1, … ,bt-1 }
While t ≤ T:
Step 1.
Step 2.
Set t = 1.
Follow the Leader
BAD!
Problem: Nature pulls Forecaster back-and-forth
No memory!
t ← t+1
Pick a1 at random.
at := argmin
a2K
"t 1X
i=1
L(a, bi)
#
While t ≤ T:
Step 1.
Step 2.
Set t = 1.
t ← t+1
Algorithm #3
Pick a1 at random.
regularize
at := argmin
a2K
"t 1X
i=1
L(a, bi) +
2
· kak2
2
#
While t ≤ T:
Step 1.
Step 2.
Set t = 1.
Algorithm #3
t ← t+1
Pick a1 at random.
gradient descent
at at 1 ·
@
@a
L(at 1, bt 1)
While t ≤ T:
Step 1.
Step 2.
Set t = 1.
Algorithm #3
Intuition: β controls memory
t ← t+1
Pick a1 at random.
at at 1 ·
@
@a
L(at 1, bt 1)
While t ≤ T:
Step 1.
Step 2.
Set t = 1.
Algorithm #3
What is the regret?
[ choose β carefully ]
 diam(K) · Lipschitz(L) ·
p
T
t ← t+1
Pick a1 at random.
at at 1 ·
@
@a
L(at 1, bt 1)
Deep thought #3
Those who cannot
remember [their]
past are condemned
to repeat it
George Santayana
Minimax theorem
inf
a2K
sup
b2K
L(a, b) = sup
b2K
inf
a2K
L(a, b)
Minimax theorem
Forecaster picks a,
Nature responds b
inf
a2K
sup
b2K
L(a, b) = sup
b2K
inf
a2K
L(a, b)
Minimax theorem
Forecaster picks a,
Nature responds b
Nature picks b,
Forecaster responds a
inf
a2K
sup
b2K
L(a, b) = sup
b2K
inf
a2K
L(a, b)
Minimax theorem
Forecaster picks a,
Nature responds b
Nature picks b,
Forecaster responds a
inf
a2K
sup
b2K
L(a, b) = sup
b2K
inf
a2K
L(a, b)
inf
a2K
sup
b2K
L(a, b) sup
b2K
inf
a2K
L(a, b)
going first hurts Forecaster, so
Minimax theorem
Proof idea:
No-regret algorithm →
!
!
→
!
!
→
Forecaster can asymptotically
match hindsight
!
Order of players doesn’t matter
asymptotically
!
Convert series of moves into
average via online-to-batch.
Let m* be the best move in hindsight.
regret := loss(Forecaster) - loss(m*)
inf
a2K
sup
b2K
L(a, b)  sup
b2K
inf
a2K
L(a, b)
Minimax theorem
Proof idea:
No-regret algorithm →
!
!
→
!
!
→
Forecaster can asymptotically
match hindsight
!
Order of players doesn’t matter
asymptotically
!
Convert series of moves into
average via online-to-batch.
Let m* be the best move in hindsight.
regret := loss(Forecaster) - loss(m*)
inf
a2K
sup
b2K
L(a, b)  sup
b2K
inf
a2K
L(a, b)
Minimax theorem
Proof idea:
No-regret algorithm →
!
!
→
!
!
→
Forecaster can asymptotically
match hindsight
!
Order of players doesn’t matter
asymptotically
!
Convert series of moves into
average via online-to-batch.
Let m* be the best move in hindsight.
regret := loss(Forecaster) - loss(m*)
inf
a2K
sup
b2K
L(a, b)  sup
b2K
inf
a2K
L(a, b)
Minimax theorem
Proof idea:
No-regret algorithm →
!
!
→
!
!
→
Forecaster can asymptotically
match hindsight
!
Order of players doesn’t matter
asymptotically
!
Convert series of moves into
average via online-to-batch.
Let m* be the best move in hindsight.
regret := loss(Forecaster) - loss(m*)
¯a =
1
T
TX
t=1
at
inf
a2K
sup
b2K
L(a, b)  sup
b2K
inf
a2K
L(a, b)
Boosting
Scenario:
Goal: Combine to perform well
Algorithm W is better than guessing
on any data distribution: loss ≤ 0.5 - ε
The Boosting Game
Value of game: V(w,d) = # mistakes w
makes on d
Algorithm W is better than guessing
on any data distribution: loss ≤ 0.5 - ε
sup
d
inf
w
V(w, d) 
1
2
✏
The Boosting Game
Value of game: V(w,d) = # mistakes w
makes on d
inf
w
sup
d
V(w, d) 
1
2
✏ MINIMAX!
The Boosting Game
inf
w
sup
d
V(w, d) 
1
2
✏ MINIMAX!
∃ distribution w* on learners that
averages correctly on any data!
Meta-Algorithm #4
Play Algorithm #2 against Algorithm W
[ #2 maximizes W’s mistakes ]
Meta-Algorithm #4
Play Algorithm #2 against Algorithm W
[ #2 maximizes W’s mistakes ]
inf
w
sup
d
V(w, d) 
1
2
✏
Algorithm #2
Algorithm W
Meta-Algorithm #4
• Freund and Schapire 1995
!
• Best learning algorithm in
late 1990s and early 2000s
!
• Authors won Gödel prize
Play Algorithm #2 against Algorithm W
[ #2 maximizes W’s mistakes ]
inf
w
sup
d
V(w, d) 
1
2
✏
Algorithm #2
Algorithm W
Deep thought #4
Your teachers
are not your
friends
The story so far…
• Boy met girl
!
• Boy spent 100s of millions of dollars
wooing girl with deductive reasoning
!
• Girl said: “drop dead”;
boy became very sad
!
!
The story so far…
• Boy met girl
!
• Boy spent 100s of millions of dollars
wooing girl with deductive reasoning
!
• Girl said: “drop dead”;
boy became very sad
!
• Boy learnt to learn from mistakes
The story so far…
• Boy met girl
!
• Boy spent 100s of millions of dollars
wooing girl with deductive reasoning
!
• Girl showed no interest;
boy became very sad
!
• Boy learnt to learn from mistakes
Next: Boy invites girl for coffee. Girl accepts!
Online Convex Opt.
(deep learning)
Apply Algorithm #3
to nonconvex optimization.
!
Theorems don’t work (not convex)
→ tons of engineering on top of #3
!
Amazing performance.
!
New mathematics needs to be invented!!
Online Convex Opt.
(deep learning)
In the last 2 years deep learning has:
!
• Better than human performance at object
recognition (ImageNet).
!
• Outperformed humans at recognising
street-signs (Google streetview).
!
• Superhuman performance on Atari games
(DeepMind).
!
• Real-time translation: English voice
to Chinese text and voice.
Thank you!
#1. Halving
#2. Multiplicative Weights
Exponential Weights Algorithm (EWA)
#3. Online Gradient Descent (OGD)
Stochastic Gradient Descent (SGD)
Mirror Descent
Backpropagation
#4. AdaBoost
Details? Lecture notes on my webpage:
https://dl.dropboxusercontent.com/u/
5874168/math482.pdf
Thank you!
#1. Halving
#2. Multiplicative Weights
Exponential Weights Algorithm (EWA)
#3. Online Gradient Descent (OGD)
Stochastic Gradient Descent (SGD)
Mirror Descent
Backpropagation
#4. AdaBoost
Vladimir Vapnik
Alexey Chervonenkis!
1938 — 2014
“[A] theory of induction is superfluous.
It has no function in a logic of science.
The best we can say of a hypothesis
is that up to now it has been able to
show its worth, and that it has been
more successful that other
hypotheses although, in principle, it
can never be justified, verified, or
even shown to be probable. This
appraisal of the hypothesis relies
solely upon deductive consequences
(predictions) which may be drawn
from the hypothesis: There is no need
to even mention induction.”
“the learning process
may be regarded as
a search for a form of
behaviour which will
satisfy the teacher (or
some other criterion)”

Inductive Reasoning and (one of) the Foundations of Machine Learning

  • 1.
    Inductive Reasoning and (oneof) the Foundations of Machine Learning “beware of mathematicians, and all those who make empty prophecies” — St. Augustine
  • 5.
    All men aremortal Socrates is a man Deductive reasoning Socrates is mortal
  • 6.
    All men aremortal Socrates is a man Deductive reasoning Socrates is mortal Idea: Thinking is deductive reasoning!
  • 7.
  • 8.
    Trenchard More, JohnMcCarthy, Marvin Minsky, Oliver Selfridge, Ray Solomonoff 50 years later
  • 9.
    “To understand thereal world, we must have a different set of primitives from the relatively simple line trackers suitable and sufficient for the blocks world” ! — Patrick Winston (1975) Director of MIT’s AI lab from 1972-1997 A bump in the road
  • 10.
  • 11.
    Reductio ad absurdum “Intelligenceis 10 million rules” — Doug Lenat
  • 12.
    The story sofar… • Boy meets girl ! ! ! ! ! !
  • 13.
    The story sofar… • Boy meets girl ! • Boy spends 100s of millions of dollars wooing girl with deductive reasoning! ! ! !
  • 14.
    The story sofar… • Boy meets girl ! • Boy spends 100s of millions of dollars wooing girl with deductive reasoning! ! • Girl says: “drop dead”; boy becomes very sad
  • 15.
    The story sofar… • Boy meets girl ! • Boy spends 100s of millions of dollars wooing girl with deductive reasoning! ! • Girl says: “drop dead”; boy becomes very sad Next: Boy ponders the errors of his ways
  • 16.
    Next: Boy pondersthe errors of his ways
  • 17.
    Next: Boy pondersthe errors of his ways “this book is composed […] upon one very simple theme […] that we can learn from our mistakes” ! ! ! ! ! ! ! ! ! ! Karl Popper, Conjectures and Refutations
  • 18.
    We’re going tolook at 4 learning algorithms.
  • 19.
    Sequential prediction Scenario: Attime t, Forecaster predicts 0 or 1. Nature then reveals the truth. ! ! ! Forecaster has access to N experts. One of them is always correct. Goal: Predict as accurately as possible.
  • 20.
    Algorithm #1 While t>0: Predictby majority vote.Step 1. Remove experts that are wrong.Step 2. t ← t+1Step 3. Set t = 1.
  • 21.
    While t>0: Question: Predict bymajority vote.Step 1. Remove experts that are wrong.Step 2. t ← t+1Step 3. How long to find correct expert? Set t = 1. Algorithm #1
  • 22.
    While t>0: BAD!!! Predict bymajority vote.Step 1. Remove experts that are wrong.Step 2. t ← t+1Step 3. How long to find correct expert? Set t = 1. Algorithm #1
  • 23.
    While t>0: Question: Predict bymajority vote.Step 1. Remove experts that are wrong.Step 2. t ← t+1Step 3. How many errors? Set t = 1. Algorithm #1
  • 24.
    Algorithm #1 Predict bymajority vote.Step 1. Remove experts that are wrong.Step 2. How many errors?
  • 25.
    Predict by majorityvote.Step 1. Remove experts that are wrong.Step 2. How many errors? When algorithm makes a mistake, it removes ≥ half of experts Algorithm #1
  • 26.
    ≤ log N Predictby majority vote.Step 1. Remove experts that are wrong.Step 2. How many errors? When algorithm makes a mistake, it removes ≥ half of experts Algorithm #1
  • 27.
    Deep thought #1 Trackerrors, not runtime
  • 28.
    What’s going on? Didn’twe just use deductive reasoning!?!
  • 29.
    What’s going on? Didn’twe just use deductive reasoning!?! Yes… but No!
  • 30.
    What’s going on? Algorithm:makes educated guesses about Nature Analysis: proves theorem about number of errors (inductive) (deductive)
  • 31.
    What’s going on? Algorithm:makes educated guesses about Nature Analysis: proves theorem about number of errors (inductive) (deductive) The algorithm learns — but it does not deduce!
  • 32.
    Adversarial prediction Scenario: Attime t, Forecaster predicts 0 or 1. Nature then reveals the truth. ! ! ! Forecaster has access to N experts. One of them is always correct. Nature is adversarial. Goal: Predict as accurately as possible.
  • 33.
    At time t,Forecaster predicts 0 or 1. Nature then reveals the truth. ! ! ! Forecaster has access to N experts. One of them is always correct. Nature is adversarial. Goal: Predict as accurately as possible. Seriously?!?!
  • 34.
    Regret Let m* bethe best expert in hindsight. regret := errors(Forecaster) - errors(m*) Goal: Predict as accurately as possible. Minimize regret.
  • 35.
    While t ≤T: Question: Predict by weighted majority vote.Step 1. Multiply incorrect experts by β.Step 2. t ← t+1Step 3. What is the regret? Set t = 1. Algorithm #2 Pick β in (0,1). Assign 1 to experts.
  • 36.
    While t ≤T: Predict by weighted majority vote.Step 1. Multiply incorrect experts by β.Step 2. t ← t+1Step 3. Set t = 1. Algorithm #2 What is the regret? [ choose β carefully ] r T · log N 2 Pick β in (0,1). Assign 1 to experts.
  • 37.
    Deep thought #2 Modelyourself, not Nature
  • 38.
    Online Convex Opt. Scenario:Convex set K; convex loss L(a,b) [ in both arguments, separately ] ! At time t, Forecaster picks at in K Nature responds with bt in K [ Nature is adversarial ] Forecaster’s loss is L(a,b) Goal: Minimize regret.
  • 39.
    Follow the Leader Idea:Predict with the at that would have worked best on { b1, … ,bt-1 }
  • 40.
    While t ≤T: Step 1. Step 2. Set t = 1. Follow the Leader Idea: t ← t+1 Pick a1 at random. at := argmin a2K "t 1X i=1 L(a, bi) # Predict with the at that would have worked best on { b1, … ,bt-1 }
  • 41.
    While t ≤T: Step 1. Step 2. Set t = 1. Follow the Leader BAD! Problem: Nature pulls Forecaster back-and-forth No memory! t ← t+1 Pick a1 at random. at := argmin a2K "t 1X i=1 L(a, bi) #
  • 42.
    While t ≤T: Step 1. Step 2. Set t = 1. t ← t+1 Algorithm #3 Pick a1 at random. regularize at := argmin a2K "t 1X i=1 L(a, bi) + 2 · kak2 2 #
  • 43.
    While t ≤T: Step 1. Step 2. Set t = 1. Algorithm #3 t ← t+1 Pick a1 at random. gradient descent at at 1 · @ @a L(at 1, bt 1)
  • 44.
    While t ≤T: Step 1. Step 2. Set t = 1. Algorithm #3 Intuition: β controls memory t ← t+1 Pick a1 at random. at at 1 · @ @a L(at 1, bt 1)
  • 45.
    While t ≤T: Step 1. Step 2. Set t = 1. Algorithm #3 What is the regret? [ choose β carefully ]  diam(K) · Lipschitz(L) · p T t ← t+1 Pick a1 at random. at at 1 · @ @a L(at 1, bt 1)
  • 46.
    Deep thought #3 Thosewho cannot remember [their] past are condemned to repeat it George Santayana
  • 47.
    Minimax theorem inf a2K sup b2K L(a, b)= sup b2K inf a2K L(a, b)
  • 48.
    Minimax theorem Forecaster picksa, Nature responds b inf a2K sup b2K L(a, b) = sup b2K inf a2K L(a, b)
  • 49.
    Minimax theorem Forecaster picksa, Nature responds b Nature picks b, Forecaster responds a inf a2K sup b2K L(a, b) = sup b2K inf a2K L(a, b)
  • 50.
    Minimax theorem Forecaster picksa, Nature responds b Nature picks b, Forecaster responds a inf a2K sup b2K L(a, b) = sup b2K inf a2K L(a, b) inf a2K sup b2K L(a, b) sup b2K inf a2K L(a, b) going first hurts Forecaster, so
  • 51.
    Minimax theorem Proof idea: No-regretalgorithm → ! ! → ! ! → Forecaster can asymptotically match hindsight ! Order of players doesn’t matter asymptotically ! Convert series of moves into average via online-to-batch. Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*) inf a2K sup b2K L(a, b)  sup b2K inf a2K L(a, b)
  • 52.
    Minimax theorem Proof idea: No-regretalgorithm → ! ! → ! ! → Forecaster can asymptotically match hindsight ! Order of players doesn’t matter asymptotically ! Convert series of moves into average via online-to-batch. Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*) inf a2K sup b2K L(a, b)  sup b2K inf a2K L(a, b)
  • 53.
    Minimax theorem Proof idea: No-regretalgorithm → ! ! → ! ! → Forecaster can asymptotically match hindsight ! Order of players doesn’t matter asymptotically ! Convert series of moves into average via online-to-batch. Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*) inf a2K sup b2K L(a, b)  sup b2K inf a2K L(a, b)
  • 54.
    Minimax theorem Proof idea: No-regretalgorithm → ! ! → ! ! → Forecaster can asymptotically match hindsight ! Order of players doesn’t matter asymptotically ! Convert series of moves into average via online-to-batch. Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*) ¯a = 1 T TX t=1 at inf a2K sup b2K L(a, b)  sup b2K inf a2K L(a, b)
  • 55.
    Boosting Scenario: Goal: Combine toperform well Algorithm W is better than guessing on any data distribution: loss ≤ 0.5 - ε
  • 56.
    The Boosting Game Valueof game: V(w,d) = # mistakes w makes on d Algorithm W is better than guessing on any data distribution: loss ≤ 0.5 - ε sup d inf w V(w, d)  1 2 ✏
  • 57.
    The Boosting Game Valueof game: V(w,d) = # mistakes w makes on d inf w sup d V(w, d)  1 2 ✏ MINIMAX!
  • 58.
    The Boosting Game inf w sup d V(w,d)  1 2 ✏ MINIMAX! ∃ distribution w* on learners that averages correctly on any data!
  • 59.
    Meta-Algorithm #4 Play Algorithm#2 against Algorithm W [ #2 maximizes W’s mistakes ]
  • 60.
    Meta-Algorithm #4 Play Algorithm#2 against Algorithm W [ #2 maximizes W’s mistakes ] inf w sup d V(w, d)  1 2 ✏ Algorithm #2 Algorithm W
  • 61.
    Meta-Algorithm #4 • Freundand Schapire 1995 ! • Best learning algorithm in late 1990s and early 2000s ! • Authors won Gödel prize Play Algorithm #2 against Algorithm W [ #2 maximizes W’s mistakes ] inf w sup d V(w, d)  1 2 ✏ Algorithm #2 Algorithm W
  • 62.
    Deep thought #4 Yourteachers are not your friends
  • 63.
    The story sofar… • Boy met girl ! • Boy spent 100s of millions of dollars wooing girl with deductive reasoning ! • Girl said: “drop dead”; boy became very sad ! !
  • 64.
    The story sofar… • Boy met girl ! • Boy spent 100s of millions of dollars wooing girl with deductive reasoning ! • Girl said: “drop dead”; boy became very sad ! • Boy learnt to learn from mistakes
  • 65.
    The story sofar… • Boy met girl ! • Boy spent 100s of millions of dollars wooing girl with deductive reasoning ! • Girl showed no interest; boy became very sad ! • Boy learnt to learn from mistakes Next: Boy invites girl for coffee. Girl accepts!
  • 66.
    Online Convex Opt. (deeplearning) Apply Algorithm #3 to nonconvex optimization. ! Theorems don’t work (not convex) → tons of engineering on top of #3 ! Amazing performance. ! New mathematics needs to be invented!!
  • 67.
    Online Convex Opt. (deeplearning) In the last 2 years deep learning has: ! • Better than human performance at object recognition (ImageNet). ! • Outperformed humans at recognising street-signs (Google streetview). ! • Superhuman performance on Atari games (DeepMind). ! • Real-time translation: English voice to Chinese text and voice.
  • 68.
    Thank you! #1. Halving #2.Multiplicative Weights Exponential Weights Algorithm (EWA) #3. Online Gradient Descent (OGD) Stochastic Gradient Descent (SGD) Mirror Descent Backpropagation #4. AdaBoost
  • 69.
    Details? Lecture noteson my webpage: https://dl.dropboxusercontent.com/u/ 5874168/math482.pdf Thank you! #1. Halving #2. Multiplicative Weights Exponential Weights Algorithm (EWA) #3. Online Gradient Descent (OGD) Stochastic Gradient Descent (SGD) Mirror Descent Backpropagation #4. AdaBoost
  • 70.
  • 71.
    “[A] theory ofinduction is superfluous. It has no function in a logic of science. The best we can say of a hypothesis is that up to now it has been able to show its worth, and that it has been more successful that other hypotheses although, in principle, it can never be justified, verified, or even shown to be probable. This appraisal of the hypothesis relies solely upon deductive consequences (predictions) which may be drawn from the hypothesis: There is no need to even mention induction.”
  • 72.
    “the learning process maybe regarded as a search for a form of behaviour which will satisfy the teacher (or some other criterion)”