SlideShare a Scribd company logo
1
The Perceptron Algorithm
Machine	Learning
Fall	2017
Machine Learning
Spring 2018
The slides are mainly from Vivek Srikumar
Outline
• The Perceptron Algorithm
• Perceptron Mistake Bound
• Variants of Perceptron
2
Where are we?
• The Perceptron Algorithm
• Perceptron Mistake Bound
• Variants of Perceptron
3
Recall: Linear Classifiers
• Input is a n dimensional vector x
• Output is a label y ∈ {-1, 1}
• Linear Threshold Units (LTUs) classify an example x using the
following classification rule
– Output = sgn(wTx + b) = sgn(b +åwi xi)
– wTx + b ≥ 0 → Predict y = 1
– wTx + b < 0 → Predict y = -1
4
b is called the bias term
Recall: Linear Classifiers
• Input is a n dimensional vector x
• Output is a label y ∈ {-1, 1}
• Linear Threshold Units (LTUs) classify an example x using the
following classification rule
– Output = sgn(wTx + b) = sgn(b +åwi xi)
– wTx + b ¸ 0 ) Predict y = 1
– wTx + b < 0 ) Predict y = -1
5
b is called the bias term
∑
sgn
&' &( &) &* &+ &, &- &.
/' /( /) /* /+ /, /- /. 1
1
The geometry of a linear classifier
6
sgn(b +w1 x1 + w2x2)
In n dimensions,
a linear classifier
represents a hyperplane
that separates the space
into two half-spaces
x1
x2
+
+
+
+
+ ++
+
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
b +w1 x1 + w2x2=0
We only care about the
sign, not the magnitude
[w1 w2]
The Perceptron
7
The hype
8
The New Yorker, December 6, 1958 P. 44
The New York Times, July 8 1958 The IBM 704 computer
The Perceptron algorithm
• Rosenblatt 1958
• The goal is to find a separating hyperplane
– For separable data, guaranteed to find one
• An online algorithm
– Processes one example at a time
• Several variants exist (will discuss briefly at towards
the end)
9
The Perceptron algorithm
Input: A sequence of training examples (x1, y1), (x2, y2),!
where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi):
– Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
10
The Perceptron algorithm
11
Remember:
Prediction = sgn(wTx)
There is typically a bias term
also (wTx + b), but the bias
may be treated as a
constant feature and folded
into w
Input: A sequence of training examples (x1, y1), (x2, y2),!
where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi):
– Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
The Perceptron algorithm
12
Footnote: For some algorithms it is mathematically easier to represent False as -1,
and at other times, as 0. For the Perceptron algorithm, treat -1 as false and +1 as true.
Remember:
Prediction = sgn(wTx)
There is typically a bias term
also (wTx + b), but the bias
may be treated as a
constant feature and folded
into w
Input: A sequence of training examples (x1, y1), (x2, y2),!
where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi):
– Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
The Perceptron algorithm
13
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Input: A sequence of training examples (x1, y1), (x2, y2),!
where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi):
– Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
The Perceptron algorithm
14
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Input: A sequence of training examples (x1, y1), (x2, y2),!
where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi):
– Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
r is the learning rate, a small positive
number less than 1
The Perceptron algorithm
15
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Input: A sequence of training examples (x1, y1), (x2, y2),!
where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi):
– Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
r is the learning rate, a small positive
number less than 1
Update only on error. A mistake-
driven algorithm
The Perceptron algorithm
16
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Input: A sequence of training examples (x1, y1), (x2, y2),!
where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi):
– Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
r is the learning rate, a small positive
number less than 1
Update only on error. A mistake-
driven algorithm
This is the simplest version. We will
see more robust versions at the end
The Perceptron algorithm
17
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Input: A sequence of training examples (x1, y1), (x2, y2),!
where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi):
– Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
r is the learning rate, a small positive
number less than 1
Update only on error. A mistake-
driven algorithm
This is the simplest version. We will
see more robust versions at the end
Mistake can be written as yiwt
Txi ≤ 0
Intuition behind the update
Suppose we have made a mistake on a positive example
That is, y = +1 and wt
Tx <0
Call the new weight vector wt+1 = wt + x (say r = 1)
The new dot product will be wt+1
Tx = (wt + x)Tx = wt
Tx + xTx > wt
Tx
For a positive example, the Perceptron update will increase the score
assigned to the same input
Similar reasoning for negative examples
18
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Geometry of the perceptron update
19
wold
Predict
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Geometry of the perceptron update
20
wold
(x, +1)
Predict
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Geometry of the perceptron update
21
wold
(x, +1)
For a mistake on a positive
example
Predict
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Geometry of the perceptron update
22
wold
(x, +1)
w ← w + x
For a mistake on a positive
example
(x, +1)
Predict Update
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Geometry of the perceptron update
23
wold
(x, +1)
w ← w + x
For a mistake on a positive
example
(x, +1)
Predict Update
x
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Geometry of the perceptron update
24
wold
(x, +1)
w ← w + x
For a mistake on a positive
example
(x, +1)
Predict Update
x
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Geometry of the perceptron update
25
wold
(x, +1) (x, +1)
wnew
w ← w + x
For a mistake on a positive
example
(x, +1)
Predict Update After
x
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Geometry of the perceptron update
26
wold
Predict
Geometry of the perceptron update
27
wold
(x, -1)
Predict
For a mistake on a negative
example
Geometry of the perceptron update
28
wold
(x, -1)
w ← w - x
(x, -1)
Predict Update
For a mistake on a negative
example
- x
Geometry of the perceptron update
29
wold
(x, -1)
w ← w - x
(x, -1)
Predict Update
For a mistake on a negative
example
- x
Geometry of the perceptron update
30
wold
(x, -1)
w ← w - x
(x, -1)
Predict Update
For a mistake on a negative
example
- x
Geometry of the perceptron update
31
wold
(x, -1) (x, -1)
wnew
w ← w - x
(x, -1)
Predict Update After
For a mistake on a negative
example
- x
Perceptron Learnability
• Obviously Perceptron cannot learn what it cannot represent
– Only linearly separable functions
• Minsky and Papert (1969) wrote an influential book
demonstrating Perceptron’s representational limitations
– Parity functions can’t be learned (XOR)
• But we already know that XOR is not linearly separable
– Feature transformation trick
32
What you need to know
• The Perceptron algorithm
• The geometry of the update
• What can it represent
33
Where are we?
• The Perceptron Algorithm
• Perceptron Mistake Bound
• Variants of Perceptron
34
Convergence
Convergence theorem
– If there exist a set of weights that are consistent with the
data (i.e. the data is linearly separable), the perceptron
algorithm will converge.
35
Convergence
Convergence theorem
– If there exist a set of weights that are consistent with the
data (i.e. the data is linearly separable), the perceptron
algorithm will converge.
Cycling theorem
– If the training data is not linearly separable, then the
learning algorithm will eventually repeat the same set of
weights and enter an infinite loop
36
Margin
The margin of a hyperplane for a dataset is the distance between
the hyperplane and the data point nearest to it.
37
+
+
+
+
+ ++
+
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Margin with respect to this hyperplane
Margin
• The margin of a hyperplane for a dataset is the distance
between the hyperplane and the data point nearest to it.
• The margin of a data set (!) is the maximum margin possible
for that dataset using any weight vector.
38
+
+
+
+
+ ++
+
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Margin of the data
Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R
and the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u 2 <n (i.e ||u|| = 1) such that for
some ° 2 < and ° > 0 we have yi (uT xi) ¸ °.
Then, the perceptron algorithm will make at most (R/°)2 mistakes
on the training sequence.
39
Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R
and the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that
for some $ ∈ ℜ and $ > 0 we have yi (uT xi) ≥ $.
Then, the perceptron algorithm will make at most (R/°)2 mistakes
on the training sequence.
40
Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R
and the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that
for some $ ∈ ℜ and $ > 0 we have yi (uT xi) ≥ $.
Then, the perceptron algorithm will make at most (R/ $)2
mistakes on the training sequence.
41
Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R
and the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that
for some $ ∈ ℜ and $ > 0 we have yi (uT xi) ≥ $.
Then, the perceptron algorithm will make at most (R/ $)2
mistakes on the training sequence.
42
We can always find such an R. Just look for
the farthest data point from the origin.
Mistake Bound Theorem [Novikoff 1962, Block 1962]
43
The data and u have a margin !.
Importantly, the data is separable.
! is the complexity parameter that defines
the separability of data.
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R
and the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that
for some ! ∈ ℜ and ! > 0 we have yi (uT xi) ≥ !.
Then, the perceptron algorithm will make at most (R/ !)2
mistakes on the training sequence.
Mistake Bound Theorem [Novikoff 1962, Block 1962]
44
If u hadn’t been a unit vector, then
we could scale ! in the mistake
bound. This will change the final
mistake bound to (||u||R/ !)2
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R
and the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that
for some ! ∈ ℜ and ! > 0 we have yi (uT xi) ≥ !.
Then, the perceptron algorithm will make at most (R/ !)2
mistakes on the training sequence.
Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R
and the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that
for some $ ∈ ℜ and $ > 0 we have yi (uT xi) ≥ $.
Then, the perceptron algorithm will make at most (R/ $)2
mistakes on the training sequence.
45
Suppose we have a binary classification dataset with n dimensional inputs.
If the data is separable,…
…then the Perceptron algorithm will find a separating
hyperplane after making a finite number of mistakes
Proof (preliminaries)
The setting
• Initial weight vector w is all zeros
• Learning rate = 1
– Effectively scales inputs, but does not change the behavior
• All training examples are contained in a ball of size R
– ||xi|| ≤ R
• The training data is separable by margin " using a unit
vector u
– yi (uT xi) ≥ "
46
• Receive an input (xi, yi)
• if sgn(wt
Txi) ≠ yi:
Update wt+1 ← wt + yi xi
Proof (1/3)
1. Claim: After t mistakes, uTwt ≥ t "
47
• Receive an input (xi, yi)
• if sgn(wt
Txi) ≠ yi:
Update wt+1 ← wt + yi xi
Proof (1/3)
1. Claim: After t mistakes, uTwt ≥ t "
48
Because the data
is separable by a
margin "
• Receive an input (xi, yi)
• if sgn(wt
Txi) ≠ yi:
Update wt+1 ← wt + yi xi
Proof (1/3)
1. Claim: After t mistakes, uTwt ≥ t "
Because w0 = 0 (i.e uTw0 = 0), straightforward
induction gives us uTwt ≥ t "
49
Because the data
is separable by a
margin "
• Receive an input (xi, yi)
• if sgn(wt
Txi) ≠ yi:
Update wt+1 ← wt + yi xi
2. Claim: After t mistakes, ||wt||2 ≤ tR2
Proof (2/3)
50
• Receive an input (xi, yi)
• if sgn(wt
Txi) ≠ yi:
Update wt+1 ← wt + yi xi
Proof (2/3)
51
The weight is updated only
when there is a mistake. That is
when yi wt
T xi < 0.
||xi|| ≤ R, by definition of R
2. Claim: After t mistakes, ||wt||2 ≤ tR2
• Receive an input (xi, yi)
• if sgn(wt
Txi) ≠ yi:
Update wt+1 ← wt + yi xi
Proof (2/3)
2. Claim: After t mistakes, ||wt||2 ≤ tR2
Because w0 = 0 (i.e uTw0 = 0), straightforward induction
gives us ||wt||2 ≤ tR2
52
• Receive an input (xi, yi)
• if sgn(wt
Txi) ≠ yi:
Update wt+1 ← wt + yi xi
Proof (3/3)
What we know:
1. After t mistakes, uTwt ≥ t"
2. After t mistakes, ||wt||2 ≤ tR2
53
Proof (3/3)
What we know:
1. After t mistakes, uTwt ≥ t"
2. After t mistakes, ||wt||2 ≤ tR2
54
From (2)
Proof (3/3)
What we know:
1. After t mistakes, uTwt ≥ t"
2. After t mistakes, ||wt||2 ≤ tR2
55
From (2)
uT wt =||u|| ||wt|| cos(<angle between them>)
But ||u|| = 1 and cosine is less than 1
So uT wt · ||wt||
Proof (3/3)
What we know:
1. After t mistakes, uTwt ≥ t"
2. After t mistakes, ||wt||2 ≤ tR2
56
From (2)
uT wt =||u|| ||wt|| cos(<angle between them>)
But ||u|| = 1 and cosine is less than 1
So uT wt · ||wt|| (Cauchy-Schwarz inequality)
Proof (3/3)
57
From (2)
uT wt =||u|| ||wt|| cos(<angle between them>)
But ||u|| = 1 and cosine is less than 1
So uT wt ≤ ||wt||
From (1)
What we know:
1. After t mistakes, uTwt ≥ t#
2. After t mistakes, ||wt||2 ≤ tR2
Proof (3/3)
58
Number of mistakes
What we know:
1. After t mistakes, uTwt ≥ t"
2. After t mistakes, ||wt||2 ≤ tR2
Proof (3/3)
59
Bounds the total number of mistakes!
Number of mistakes
What we know:
1. After t mistakes, uTwt ≥ t"
2. After t mistakes, ||wt||2 ≤ tR2
Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R
and the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that
for some $ ∈ ℜ and $ > 0 we have yi (uT xi) ≥ $.
Then, the perceptron algorithm will make at most (R/ $)2
mistakes on the training sequence.
60
Mistake Bound Theorem [Novikoff 1962, Block 1962]
61
Mistake Bound Theorem [Novikoff 1962, Block 1962]
62
i=1
= sgn
k
X
i=1
ciw>
i x
yiw>
xi
kwk
< ⌘
s1 + s2 + s3 0
R
4
Mistake Bound Theorem [Novikoff 1962, Block 1962]
63
i=1
= sgn
k
X
i=1
ciw>
i x
yiw>
xi
kwk
< ⌘
s1 + s2 + s3 0
R
4
= sgn
k
X
i=1
ciw>
i x (3
yiw>
xi
kwk
< ⌘ (4
s1 + s2 + s3 0 (5
4
Mistake Bound Theorem [Novikoff 1962, Block 1962]
64
Number of mistakes
i=1
= sgn
k
X
i=1
ciw>
i x
yiw>
xi
kwk
< ⌘
s1 + s2 + s3 0
R
4
= sgn
k
X
i=1
ciw>
i x (3
yiw>
xi
kwk
< ⌘ (4
s1 + s2 + s3 0 (5
4
The Perceptron Mistake bound
• Exercises:
– How many mistakes will the Perceptron algorithm make for
disjunctions with n attributes?
• What are R and !?
– How many mistakes will the Perceptron algorithm make for k-
disjunctions with n attributes?
65
Number of mistakes
What you need to know
• What is the perceptron mistake bound?
• How to prove it
66
Where are we?
• The Perceptron Algorithm
• Perceptron Mistake Bound
• Variants of Perceptron
67
Practical use of the Perceptron algorithm
1. Using the Perceptron algorithm with a finite dataset
2. Margin Perceptron
3. Voting and Averaging
68
1. The “standard” algorithm
Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1}
1. Initialize w = 0 ∈ ℜn
2. For epoch = 1 … T:
1. Shuffle the data
2. For each training example (xi, yi) ∈ D:
• If yi wTxi ≤ 0, update w ← w + r yi xi
3. Return w
Prediction: sgn(wTx)
69
1. The “standard” algorithm
70
T is a hyper-parameter to the algorithm
Another way of writing that
there is an error
Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1}
1. Initialize w = 0 ∈ ℜn
2. For epoch = 1 … T:
1. Shuffle the data
2. For each training example (xi, yi) ∈ D:
• If yi wTxi ≤ 0, update w ← w + r yi xi
3. Return w
Prediction: sgn(wTx)
2. Margin Perceptron
71
Which hyperplane is better?
h1
h2
2. Margin Perceptron
72
Which hyperplane is better?
h1
h2
2. Margin Perceptron
73
Which hyperplane is better?
h1
h2
The farther from the data points, the less chance to make wrong prediction
2. Margin Perceptron
• Perceptron makes updates only when the prediction is
incorrect
yi wTxi < 0
• What if the prediction is close to being incorrect? That is,
Pick a positive ! and update when
• Can generalize better, but need to choose
– Why is this a good idea?
74
L1(y, U, B, q(v)) = log(p(U)) + H q(v)
+
Z
log N(v|0, k(B, B)
+
X
j
Z
q(v)Fv(yij
, )dv
sgn
k
X
i=1
ci · sgn(w>
i x) (2)
= sgn
k
X
i=1
ciw>
i x (3)
yiw>
xi
kwk
< ⌘ (4)
4
2. Margin Perceptron
• Perceptron makes updates only when the prediction is
incorrect
yi wTxi < 0
• What if the prediction is close to being incorrect? That is,
Pick a positive ! and update when
• Can generalize better, but need to choose
– Why is this a good idea? Intentionally set a large margin
75
L1(y, U, B, q(v)) = log(p(U)) + H q(v)
+
Z
log N(v|0, k(B, B)
+
X
j
Z
q(v)Fv(yij
, )dv
sgn
k
X
i=1
ci · sgn(w>
i x) (2)
= sgn
k
X
i=1
ciw>
i x (3)
yiw>
xi
kwk
< ⌘ (4)
4
3. Voting and Averaging
76
What if data is not linearly separable?
Finding a hyperplane with minimum mistakes is NP hard
Voted Perceptron
Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1}
1. Initialize w = 0 ∈ ℜn anda = 0 ∈ ℜn
2. For epoch = 1 … T:
– For each training example (xi, yi) ∈ D:
• If yi wTxi ≤ 0
– update wm+1 ← wm + r yi xi
– m=m+1
– Cm = 1
• Else
– Cm = Cm + 1
3. Return (w1, c1), (w2, c2), …, (wk, Ck)
Prediction:
77
(A1, rA1)
(a2, ra2)
(a3, ra3)
(a4, ra4)
(a5, ra4)
L1(y, U, B, q(v)) = log(p(U)) + H q(v)
+
Z
log N (v|0, k(B, B)
+
X
j
Z
q(v)Fv(yij
, )dv
sgn
k
X
i=1
ci · sgn(w>
i x)
Averaged Perceptron
Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1}
1. Initialize w = 0 ∈ ℜn anda = 0 ∈ ℜn
2. For epoch = 1 … T:
– For each training example (xi, yi) ∈ D:
• If yi wTxi ≤ 0
– update w ← w + r yi xi
• a ← a + w
3. Return a
Prediction: sgn(aTx)
78
(A1, rA1)
(a2, ra2)
(a3, ra3)
(a4, ra4)
(a5, ra4)
L1(y, U, B, q(v)) = log(p(U)) + H q(v)
+
Z
log N(v|0, k(B, B)
+
X
j
Z
q(v)Fv(yij
, )dv
sgn
k
X
i=1
ci · sgn(w>
i x)
= sgn
k
X
i=1
ciw>
i x
Averaged Perceptron
79
This is the simplest version of
the averaged perceptron
There are some easy
programming tricks to make
sure that a is also updated
only when there is an error
Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1}
1. Initialize w = 0 ∈ ℜn anda = 0 ∈ ℜn
2. For epoch = 1 … T:
– For each training example (xi, yi) ∈ D:
• If yi wTxi ≤ 0
– update w ← w + r yi xi
• a ← a + w
3. Return a
Prediction: sgn(aTx)
(A1, rA1)
(a2, ra2)
(a3, ra3)
(a4, ra4)
(a5, ra4)
L1(y, U, B, q(v)) = log(p(U)) + H q(v)
+
Z
log N(v|0, k(B, B)
+
X
j
Z
q(v)Fv(yij
, )dv
sgn
k
X
i=1
ci · sgn(w>
i x)
= sgn
k
X
i=1
ciw>
i x
Averaged Perceptron
80
This is the simplest version of
the averaged perceptron
There are some easy
programming tricks to make
sure that a is also updated
only when there is an error
Extremely popular
If you want to use the
Perceptron algorithm, use
averaging
Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1}
1. Initialize w = 0 ∈ ℜn anda = 0 ∈ ℜn
2. For epoch = 1 … T:
– For each training example (xi, yi) ∈ D:
• If yi wTxi ≤ 0
– update w ← w + r yi xi
• a ← a + w
3. Return a
Prediction: sgn(aTx)
(A1, rA1)
(a2, ra2)
(a3, ra3)
(a4, ra4)
(a5, ra4)
L1(y, U, B, q(v)) = log(p(U)) + H q(v)
+
Z
log N(v|0, k(B, B)
+
X
j
Z
q(v)Fv(yij
, )dv
sgn
k
X
i=1
ci · sgn(w>
i x)
= sgn
k
X
i=1
ciw>
i x
Question: What is the difference?
81
+
X
j
Z
q(v)Fv(yij
, )dv
sgn
k
X
i=1
ci · sgn(w>
i x)
sgn
k
X
i=1
ci · sgn(w>
i x)
= sgn
k
X
i=1
ciw>
i x
Voted:
Averaged:
sgn
i=1
ci · sgn(wi x) (
= sgn
k
X
i=1
ciw>
i x (
yiw>
xi
kwk
< ⌘ (
w>
1 x = s1, w>
2 x = s2, w>
3 x = s3 (
4
i=1
i
= sgn
k
X
i=1
ciw>
i x
yiw>
xi
kwk
< ⌘
c1 = c2 = c3 = 1
4
Voted: Any two are positive
Averaged:
sgn
k
X
i=1
ci · sgn(w>
i x)
= sgn
k
X
i=1
ciw>
i x
yiw>
xi
kwk
< ⌘
s1 + s2 + s3 0
4
Summary: Perceptron
• Online learning algorithm, very widely used, easy to
implement
• Additive updates to weights
• Geometric interpretation
• Mistake bound
• Practical variants abound
• You should be able to implement the Perceptron algorithm
82

More Related Content

Similar to lec-10-perceptron-upload.pdf

linear SVM.ppt
linear SVM.pptlinear SVM.ppt
linear SVM.ppt
MahimMajee
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
nikola_tesla1
 
Grovers Algorithm
Grovers Algorithm Grovers Algorithm
Grovers Algorithm
CaseyHaaland
 
lecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptlecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.ppt
NaglaaAbdelhady
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer visionzukun
 
Probabilistic Control of Switched Linear Systems with Chance Constraints
Probabilistic Control of Switched Linear Systems with Chance ConstraintsProbabilistic Control of Switched Linear Systems with Chance Constraints
Probabilistic Control of Switched Linear Systems with Chance Constraints
Leo Asselborn
 
Perceptron noural network algorithme .pptx
Perceptron noural network algorithme .pptxPerceptron noural network algorithme .pptx
Perceptron noural network algorithme .pptx
YounesAitihya
 
Tutorial APS 2023: Phase transition for statistical estimation: algorithms an...
Tutorial APS 2023: Phase transition for statistical estimation: algorithms an...Tutorial APS 2023: Phase transition for statistical estimation: algorithms an...
Tutorial APS 2023: Phase transition for statistical estimation: algorithms an...
Marc Lelarge
 
Phase Retrieval: Motivation and Techniques
Phase Retrieval: Motivation and TechniquesPhase Retrieval: Motivation and Techniques
Phase Retrieval: Motivation and Techniques
Vaibhav Dixit
 
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Yandex
 
ppt - Deep Learning From Scratch.pdf
ppt - Deep Learning From Scratch.pdfppt - Deep Learning From Scratch.pdf
ppt - Deep Learning From Scratch.pdf
surefooted
 
Optimization tutorial
Optimization tutorialOptimization tutorial
Optimization tutorial
Northwestern University
 
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast AlgorithmsReinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Sean Meyn
 
Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03
Rediet Moges
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
Elvis DOHMATOB
 
Slides ACTINFO 2016
Slides ACTINFO 2016Slides ACTINFO 2016
Slides ACTINFO 2016
Arthur Charpentier
 
Control of Discrete-Time Piecewise Affine Probabilistic Systems using Reachab...
Control of Discrete-Time Piecewise Affine Probabilistic Systems using Reachab...Control of Discrete-Time Piecewise Affine Probabilistic Systems using Reachab...
Control of Discrete-Time Piecewise Affine Probabilistic Systems using Reachab...
Leo Asselborn
 
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Beniamino Murgante
 

Similar to lec-10-perceptron-upload.pdf (20)

linear SVM.ppt
linear SVM.pptlinear SVM.ppt
linear SVM.ppt
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
Grovers Algorithm
Grovers Algorithm Grovers Algorithm
Grovers Algorithm
 
lecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptlecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.ppt
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
 
5.n nmodels i
5.n nmodels i5.n nmodels i
5.n nmodels i
 
Probabilistic Control of Switched Linear Systems with Chance Constraints
Probabilistic Control of Switched Linear Systems with Chance ConstraintsProbabilistic Control of Switched Linear Systems with Chance Constraints
Probabilistic Control of Switched Linear Systems with Chance Constraints
 
Perceptron noural network algorithme .pptx
Perceptron noural network algorithme .pptxPerceptron noural network algorithme .pptx
Perceptron noural network algorithme .pptx
 
Tutorial APS 2023: Phase transition for statistical estimation: algorithms an...
Tutorial APS 2023: Phase transition for statistical estimation: algorithms an...Tutorial APS 2023: Phase transition for statistical estimation: algorithms an...
Tutorial APS 2023: Phase transition for statistical estimation: algorithms an...
 
Phase Retrieval: Motivation and Techniques
Phase Retrieval: Motivation and TechniquesPhase Retrieval: Motivation and Techniques
Phase Retrieval: Motivation and Techniques
 
talk MCMC & SMC 2004
talk MCMC & SMC 2004talk MCMC & SMC 2004
talk MCMC & SMC 2004
 
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
 
ppt - Deep Learning From Scratch.pdf
ppt - Deep Learning From Scratch.pdfppt - Deep Learning From Scratch.pdf
ppt - Deep Learning From Scratch.pdf
 
Optimization tutorial
Optimization tutorialOptimization tutorial
Optimization tutorial
 
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast AlgorithmsReinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
 
Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
Slides ACTINFO 2016
Slides ACTINFO 2016Slides ACTINFO 2016
Slides ACTINFO 2016
 
Control of Discrete-Time Piecewise Affine Probabilistic Systems using Reachab...
Control of Discrete-Time Piecewise Affine Probabilistic Systems using Reachab...Control of Discrete-Time Piecewise Affine Probabilistic Systems using Reachab...
Control of Discrete-Time Piecewise Affine Probabilistic Systems using Reachab...
 
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
 

More from Antonio Espinosa

1.- Introducción a la estructura de datos.pptx
1.- Introducción a la estructura de datos.pptx1.- Introducción a la estructura de datos.pptx
1.- Introducción a la estructura de datos.pptx
Antonio Espinosa
 
Modelos de Seguridad DB.pdf
Modelos de Seguridad DB.pdfModelos de Seguridad DB.pdf
Modelos de Seguridad DB.pdf
Antonio Espinosa
 
TensorFlow Tutorial.pdf
TensorFlow Tutorial.pdfTensorFlow Tutorial.pdf
TensorFlow Tutorial.pdf
Antonio Espinosa
 
Una reflexión sobre un tema evangélico 1
Una reflexión sobre un tema evangélico 1Una reflexión sobre un tema evangélico 1
Una reflexión sobre un tema evangélico 1
Antonio Espinosa
 
The professional-product-owner-leveraging-scrum-as-a-competitive-advantage
The professional-product-owner-leveraging-scrum-as-a-competitive-advantageThe professional-product-owner-leveraging-scrum-as-a-competitive-advantage
The professional-product-owner-leveraging-scrum-as-a-competitive-advantage
Antonio Espinosa
 
Soft recono matriculas
Soft recono matriculasSoft recono matriculas
Soft recono matriculas
Antonio Espinosa
 
Reconocimiento automático de matriculas
Reconocimiento automático de matriculasReconocimiento automático de matriculas
Reconocimiento automático de matriculas
Antonio Espinosa
 
Que es daily scrum
Que es daily scrumQue es daily scrum
Que es daily scrum
Antonio Espinosa
 
Prontuario del viajero
Prontuario del viajeroProntuario del viajero
Prontuario del viajero
Antonio Espinosa
 
Orientaciones titulacion planes_2018
Orientaciones titulacion planes_2018Orientaciones titulacion planes_2018
Orientaciones titulacion planes_2018
Antonio Espinosa
 
Orientaciones academicas titulacion 2012
Orientaciones academicas titulacion 2012Orientaciones academicas titulacion 2012
Orientaciones academicas titulacion 2012
Antonio Espinosa
 
Orientaciones para la elaboración de titulacion
Orientaciones para la elaboración de titulacionOrientaciones para la elaboración de titulacion
Orientaciones para la elaboración de titulacion
Antonio Espinosa
 

More from Antonio Espinosa (12)

1.- Introducción a la estructura de datos.pptx
1.- Introducción a la estructura de datos.pptx1.- Introducción a la estructura de datos.pptx
1.- Introducción a la estructura de datos.pptx
 
Modelos de Seguridad DB.pdf
Modelos de Seguridad DB.pdfModelos de Seguridad DB.pdf
Modelos de Seguridad DB.pdf
 
TensorFlow Tutorial.pdf
TensorFlow Tutorial.pdfTensorFlow Tutorial.pdf
TensorFlow Tutorial.pdf
 
Una reflexión sobre un tema evangélico 1
Una reflexión sobre un tema evangélico 1Una reflexión sobre un tema evangélico 1
Una reflexión sobre un tema evangélico 1
 
The professional-product-owner-leveraging-scrum-as-a-competitive-advantage
The professional-product-owner-leveraging-scrum-as-a-competitive-advantageThe professional-product-owner-leveraging-scrum-as-a-competitive-advantage
The professional-product-owner-leveraging-scrum-as-a-competitive-advantage
 
Soft recono matriculas
Soft recono matriculasSoft recono matriculas
Soft recono matriculas
 
Reconocimiento automático de matriculas
Reconocimiento automático de matriculasReconocimiento automático de matriculas
Reconocimiento automático de matriculas
 
Que es daily scrum
Que es daily scrumQue es daily scrum
Que es daily scrum
 
Prontuario del viajero
Prontuario del viajeroProntuario del viajero
Prontuario del viajero
 
Orientaciones titulacion planes_2018
Orientaciones titulacion planes_2018Orientaciones titulacion planes_2018
Orientaciones titulacion planes_2018
 
Orientaciones academicas titulacion 2012
Orientaciones academicas titulacion 2012Orientaciones academicas titulacion 2012
Orientaciones academicas titulacion 2012
 
Orientaciones para la elaboración de titulacion
Orientaciones para la elaboración de titulacionOrientaciones para la elaboración de titulacion
Orientaciones para la elaboración de titulacion
 

Recently uploaded

急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
3ipehhoa
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
nirahealhty
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
Gal Baras
 
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
eutxy
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
Javier Lasa
 
Bài tập unit 1 English in the world.docx
Bài tập unit 1 English in the world.docxBài tập unit 1 English in the world.docx
Bài tập unit 1 English in the world.docx
nhiyenphan2005
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
laozhuseo02
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Brad Spiegel Macon GA
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
3ipehhoa
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
Arif0071
 
Italy Agriculture Equipment Market Outlook to 2027
Italy Agriculture Equipment Market Outlook to 2027Italy Agriculture Equipment Market Outlook to 2027
Italy Agriculture Equipment Market Outlook to 2027
harveenkaur52
 
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
GTProductions1
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
keoku
 
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
CIOWomenMagazine
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
JungkooksNonexistent
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
ufdana
 
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfMeet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Florence Consulting
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
Rogerio Filho
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
JeyaPerumal1
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC
 

Recently uploaded (20)

急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
 
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
 
Bài tập unit 1 English in the world.docx
Bài tập unit 1 English in the world.docxBài tập unit 1 English in the world.docx
Bài tập unit 1 English in the world.docx
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
 
Italy Agriculture Equipment Market Outlook to 2027
Italy Agriculture Equipment Market Outlook to 2027Italy Agriculture Equipment Market Outlook to 2027
Italy Agriculture Equipment Market Outlook to 2027
 
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
 
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
 
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfMeet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
 

lec-10-perceptron-upload.pdf

  • 1. 1 The Perceptron Algorithm Machine Learning Fall 2017 Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar
  • 2. Outline • The Perceptron Algorithm • Perceptron Mistake Bound • Variants of Perceptron 2
  • 3. Where are we? • The Perceptron Algorithm • Perceptron Mistake Bound • Variants of Perceptron 3
  • 4. Recall: Linear Classifiers • Input is a n dimensional vector x • Output is a label y ∈ {-1, 1} • Linear Threshold Units (LTUs) classify an example x using the following classification rule – Output = sgn(wTx + b) = sgn(b +åwi xi) – wTx + b ≥ 0 → Predict y = 1 – wTx + b < 0 → Predict y = -1 4 b is called the bias term
  • 5. Recall: Linear Classifiers • Input is a n dimensional vector x • Output is a label y ∈ {-1, 1} • Linear Threshold Units (LTUs) classify an example x using the following classification rule – Output = sgn(wTx + b) = sgn(b +åwi xi) – wTx + b ¸ 0 ) Predict y = 1 – wTx + b < 0 ) Predict y = -1 5 b is called the bias term ∑ sgn &' &( &) &* &+ &, &- &. /' /( /) /* /+ /, /- /. 1 1
  • 6. The geometry of a linear classifier 6 sgn(b +w1 x1 + w2x2) In n dimensions, a linear classifier represents a hyperplane that separates the space into two half-spaces x1 x2 + + + + + ++ + - - - - - - - - - - - - - - - - - - b +w1 x1 + w2x2=0 We only care about the sign, not the magnitude [w1 w2]
  • 8. The hype 8 The New Yorker, December 6, 1958 P. 44 The New York Times, July 8 1958 The IBM 704 computer
  • 9. The Perceptron algorithm • Rosenblatt 1958 • The goal is to find a separating hyperplane – For separable data, guaranteed to find one • An online algorithm – Processes one example at a time • Several variants exist (will discuss briefly at towards the end) 9
  • 10. The Perceptron algorithm Input: A sequence of training examples (x1, y1), (x2, y2),! where all xi ∈ ℜn, yi ∈ {-1,1} • Initialize w0 = 0 ∈ ℜn • For each training example (xi, yi): – Predict y’ = sgn(wt Txi) – If yi ≠ y’: • Update wt+1 ←wt + r (yi xi) • Return final weight vector 10
  • 11. The Perceptron algorithm 11 Remember: Prediction = sgn(wTx) There is typically a bias term also (wTx + b), but the bias may be treated as a constant feature and folded into w Input: A sequence of training examples (x1, y1), (x2, y2),! where all xi ∈ ℜn, yi ∈ {-1,1} • Initialize w0 = 0 ∈ ℜn • For each training example (xi, yi): – Predict y’ = sgn(wt Txi) – If yi ≠ y’: • Update wt+1 ←wt + r (yi xi) • Return final weight vector
  • 12. The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier to represent False as -1, and at other times, as 0. For the Perceptron algorithm, treat -1 as false and +1 as true. Remember: Prediction = sgn(wTx) There is typically a bias term also (wTx + b), but the bias may be treated as a constant feature and folded into w Input: A sequence of training examples (x1, y1), (x2, y2),! where all xi ∈ ℜn, yi ∈ {-1,1} • Initialize w0 = 0 ∈ ℜn • For each training example (xi, yi): – Predict y’ = sgn(wt Txi) – If yi ≠ y’: • Update wt+1 ←wt + r (yi xi) • Return final weight vector
  • 13. The Perceptron algorithm 13 Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi Input: A sequence of training examples (x1, y1), (x2, y2),! where all xi ∈ ℜn, yi ∈ {-1,1} • Initialize w0 = 0 ∈ ℜn • For each training example (xi, yi): – Predict y’ = sgn(wt Txi) – If yi ≠ y’: • Update wt+1 ←wt + r (yi xi) • Return final weight vector
  • 14. The Perceptron algorithm 14 Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi Input: A sequence of training examples (x1, y1), (x2, y2),! where all xi ∈ ℜn, yi ∈ {-1,1} • Initialize w0 = 0 ∈ ℜn • For each training example (xi, yi): – Predict y’ = sgn(wt Txi) – If yi ≠ y’: • Update wt+1 ←wt + r (yi xi) • Return final weight vector r is the learning rate, a small positive number less than 1
  • 15. The Perceptron algorithm 15 Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi Input: A sequence of training examples (x1, y1), (x2, y2),! where all xi ∈ ℜn, yi ∈ {-1,1} • Initialize w0 = 0 ∈ ℜn • For each training example (xi, yi): – Predict y’ = sgn(wt Txi) – If yi ≠ y’: • Update wt+1 ←wt + r (yi xi) • Return final weight vector r is the learning rate, a small positive number less than 1 Update only on error. A mistake- driven algorithm
  • 16. The Perceptron algorithm 16 Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi Input: A sequence of training examples (x1, y1), (x2, y2),! where all xi ∈ ℜn, yi ∈ {-1,1} • Initialize w0 = 0 ∈ ℜn • For each training example (xi, yi): – Predict y’ = sgn(wt Txi) – If yi ≠ y’: • Update wt+1 ←wt + r (yi xi) • Return final weight vector r is the learning rate, a small positive number less than 1 Update only on error. A mistake- driven algorithm This is the simplest version. We will see more robust versions at the end
  • 17. The Perceptron algorithm 17 Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi Input: A sequence of training examples (x1, y1), (x2, y2),! where all xi ∈ ℜn, yi ∈ {-1,1} • Initialize w0 = 0 ∈ ℜn • For each training example (xi, yi): – Predict y’ = sgn(wt Txi) – If yi ≠ y’: • Update wt+1 ←wt + r (yi xi) • Return final weight vector r is the learning rate, a small positive number less than 1 Update only on error. A mistake- driven algorithm This is the simplest version. We will see more robust versions at the end Mistake can be written as yiwt Txi ≤ 0
  • 18. Intuition behind the update Suppose we have made a mistake on a positive example That is, y = +1 and wt Tx <0 Call the new weight vector wt+1 = wt + x (say r = 1) The new dot product will be wt+1 Tx = (wt + x)Tx = wt Tx + xTx > wt Tx For a positive example, the Perceptron update will increase the score assigned to the same input Similar reasoning for negative examples 18 Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi
  • 19. Geometry of the perceptron update 19 wold Predict Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi
  • 20. Geometry of the perceptron update 20 wold (x, +1) Predict Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi
  • 21. Geometry of the perceptron update 21 wold (x, +1) For a mistake on a positive example Predict Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi
  • 22. Geometry of the perceptron update 22 wold (x, +1) w ← w + x For a mistake on a positive example (x, +1) Predict Update Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi
  • 23. Geometry of the perceptron update 23 wold (x, +1) w ← w + x For a mistake on a positive example (x, +1) Predict Update x Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi
  • 24. Geometry of the perceptron update 24 wold (x, +1) w ← w + x For a mistake on a positive example (x, +1) Predict Update x Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi
  • 25. Geometry of the perceptron update 25 wold (x, +1) (x, +1) wnew w ← w + x For a mistake on a positive example (x, +1) Predict Update After x Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi
  • 26. Geometry of the perceptron update 26 wold Predict
  • 27. Geometry of the perceptron update 27 wold (x, -1) Predict For a mistake on a negative example
  • 28. Geometry of the perceptron update 28 wold (x, -1) w ← w - x (x, -1) Predict Update For a mistake on a negative example - x
  • 29. Geometry of the perceptron update 29 wold (x, -1) w ← w - x (x, -1) Predict Update For a mistake on a negative example - x
  • 30. Geometry of the perceptron update 30 wold (x, -1) w ← w - x (x, -1) Predict Update For a mistake on a negative example - x
  • 31. Geometry of the perceptron update 31 wold (x, -1) (x, -1) wnew w ← w - x (x, -1) Predict Update After For a mistake on a negative example - x
  • 32. Perceptron Learnability • Obviously Perceptron cannot learn what it cannot represent – Only linearly separable functions • Minsky and Papert (1969) wrote an influential book demonstrating Perceptron’s representational limitations – Parity functions can’t be learned (XOR) • But we already know that XOR is not linearly separable – Feature transformation trick 32
  • 33. What you need to know • The Perceptron algorithm • The geometry of the update • What can it represent 33
  • 34. Where are we? • The Perceptron Algorithm • Perceptron Mistake Bound • Variants of Perceptron 34
  • 35. Convergence Convergence theorem – If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge. 35
  • 36. Convergence Convergence theorem – If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge. Cycling theorem – If the training data is not linearly separable, then the learning algorithm will eventually repeat the same set of weights and enter an infinite loop 36
  • 37. Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. 37 + + + + + ++ + - - - - - - - - - - - - - - - - - - Margin with respect to this hyperplane
  • 38. Margin • The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. • The margin of a data set (!) is the maximum margin possible for that dataset using any weight vector. 38 + + + + + ++ + - - - - - - - - - - - - - - - - - - Margin of the data
  • 39. Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R and the label yi ∈ {-1, +1}. Suppose there exists a unit vector u 2 <n (i.e ||u|| = 1) such that for some ° 2 < and ° > 0 we have yi (uT xi) ¸ °. Then, the perceptron algorithm will make at most (R/°)2 mistakes on the training sequence. 39
  • 40. Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R and the label yi ∈ {-1, +1}. Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that for some $ ∈ ℜ and $ > 0 we have yi (uT xi) ≥ $. Then, the perceptron algorithm will make at most (R/°)2 mistakes on the training sequence. 40
  • 41. Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R and the label yi ∈ {-1, +1}. Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that for some $ ∈ ℜ and $ > 0 we have yi (uT xi) ≥ $. Then, the perceptron algorithm will make at most (R/ $)2 mistakes on the training sequence. 41
  • 42. Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R and the label yi ∈ {-1, +1}. Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that for some $ ∈ ℜ and $ > 0 we have yi (uT xi) ≥ $. Then, the perceptron algorithm will make at most (R/ $)2 mistakes on the training sequence. 42 We can always find such an R. Just look for the farthest data point from the origin.
  • 43. Mistake Bound Theorem [Novikoff 1962, Block 1962] 43 The data and u have a margin !. Importantly, the data is separable. ! is the complexity parameter that defines the separability of data. Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R and the label yi ∈ {-1, +1}. Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that for some ! ∈ ℜ and ! > 0 we have yi (uT xi) ≥ !. Then, the perceptron algorithm will make at most (R/ !)2 mistakes on the training sequence.
  • 44. Mistake Bound Theorem [Novikoff 1962, Block 1962] 44 If u hadn’t been a unit vector, then we could scale ! in the mistake bound. This will change the final mistake bound to (||u||R/ !)2 Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R and the label yi ∈ {-1, +1}. Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that for some ! ∈ ℜ and ! > 0 we have yi (uT xi) ≥ !. Then, the perceptron algorithm will make at most (R/ !)2 mistakes on the training sequence.
  • 45. Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R and the label yi ∈ {-1, +1}. Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that for some $ ∈ ℜ and $ > 0 we have yi (uT xi) ≥ $. Then, the perceptron algorithm will make at most (R/ $)2 mistakes on the training sequence. 45 Suppose we have a binary classification dataset with n dimensional inputs. If the data is separable,… …then the Perceptron algorithm will find a separating hyperplane after making a finite number of mistakes
  • 46. Proof (preliminaries) The setting • Initial weight vector w is all zeros • Learning rate = 1 – Effectively scales inputs, but does not change the behavior • All training examples are contained in a ball of size R – ||xi|| ≤ R • The training data is separable by margin " using a unit vector u – yi (uT xi) ≥ " 46 • Receive an input (xi, yi) • if sgn(wt Txi) ≠ yi: Update wt+1 ← wt + yi xi
  • 47. Proof (1/3) 1. Claim: After t mistakes, uTwt ≥ t " 47 • Receive an input (xi, yi) • if sgn(wt Txi) ≠ yi: Update wt+1 ← wt + yi xi
  • 48. Proof (1/3) 1. Claim: After t mistakes, uTwt ≥ t " 48 Because the data is separable by a margin " • Receive an input (xi, yi) • if sgn(wt Txi) ≠ yi: Update wt+1 ← wt + yi xi
  • 49. Proof (1/3) 1. Claim: After t mistakes, uTwt ≥ t " Because w0 = 0 (i.e uTw0 = 0), straightforward induction gives us uTwt ≥ t " 49 Because the data is separable by a margin " • Receive an input (xi, yi) • if sgn(wt Txi) ≠ yi: Update wt+1 ← wt + yi xi
  • 50. 2. Claim: After t mistakes, ||wt||2 ≤ tR2 Proof (2/3) 50 • Receive an input (xi, yi) • if sgn(wt Txi) ≠ yi: Update wt+1 ← wt + yi xi
  • 51. Proof (2/3) 51 The weight is updated only when there is a mistake. That is when yi wt T xi < 0. ||xi|| ≤ R, by definition of R 2. Claim: After t mistakes, ||wt||2 ≤ tR2 • Receive an input (xi, yi) • if sgn(wt Txi) ≠ yi: Update wt+1 ← wt + yi xi
  • 52. Proof (2/3) 2. Claim: After t mistakes, ||wt||2 ≤ tR2 Because w0 = 0 (i.e uTw0 = 0), straightforward induction gives us ||wt||2 ≤ tR2 52 • Receive an input (xi, yi) • if sgn(wt Txi) ≠ yi: Update wt+1 ← wt + yi xi
  • 53. Proof (3/3) What we know: 1. After t mistakes, uTwt ≥ t" 2. After t mistakes, ||wt||2 ≤ tR2 53
  • 54. Proof (3/3) What we know: 1. After t mistakes, uTwt ≥ t" 2. After t mistakes, ||wt||2 ≤ tR2 54 From (2)
  • 55. Proof (3/3) What we know: 1. After t mistakes, uTwt ≥ t" 2. After t mistakes, ||wt||2 ≤ tR2 55 From (2) uT wt =||u|| ||wt|| cos(<angle between them>) But ||u|| = 1 and cosine is less than 1 So uT wt · ||wt||
  • 56. Proof (3/3) What we know: 1. After t mistakes, uTwt ≥ t" 2. After t mistakes, ||wt||2 ≤ tR2 56 From (2) uT wt =||u|| ||wt|| cos(<angle between them>) But ||u|| = 1 and cosine is less than 1 So uT wt · ||wt|| (Cauchy-Schwarz inequality)
  • 57. Proof (3/3) 57 From (2) uT wt =||u|| ||wt|| cos(<angle between them>) But ||u|| = 1 and cosine is less than 1 So uT wt ≤ ||wt|| From (1) What we know: 1. After t mistakes, uTwt ≥ t# 2. After t mistakes, ||wt||2 ≤ tR2
  • 58. Proof (3/3) 58 Number of mistakes What we know: 1. After t mistakes, uTwt ≥ t" 2. After t mistakes, ||wt||2 ≤ tR2
  • 59. Proof (3/3) 59 Bounds the total number of mistakes! Number of mistakes What we know: 1. After t mistakes, uTwt ≥ t" 2. After t mistakes, ||wt||2 ≤ tR2
  • 60. Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R and the label yi ∈ {-1, +1}. Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that for some $ ∈ ℜ and $ > 0 we have yi (uT xi) ≥ $. Then, the perceptron algorithm will make at most (R/ $)2 mistakes on the training sequence. 60
  • 61. Mistake Bound Theorem [Novikoff 1962, Block 1962] 61
  • 62. Mistake Bound Theorem [Novikoff 1962, Block 1962] 62 i=1 = sgn k X i=1 ciw> i x yiw> xi kwk < ⌘ s1 + s2 + s3 0 R 4
  • 63. Mistake Bound Theorem [Novikoff 1962, Block 1962] 63 i=1 = sgn k X i=1 ciw> i x yiw> xi kwk < ⌘ s1 + s2 + s3 0 R 4 = sgn k X i=1 ciw> i x (3 yiw> xi kwk < ⌘ (4 s1 + s2 + s3 0 (5 4
  • 64. Mistake Bound Theorem [Novikoff 1962, Block 1962] 64 Number of mistakes i=1 = sgn k X i=1 ciw> i x yiw> xi kwk < ⌘ s1 + s2 + s3 0 R 4 = sgn k X i=1 ciw> i x (3 yiw> xi kwk < ⌘ (4 s1 + s2 + s3 0 (5 4
  • 65. The Perceptron Mistake bound • Exercises: – How many mistakes will the Perceptron algorithm make for disjunctions with n attributes? • What are R and !? – How many mistakes will the Perceptron algorithm make for k- disjunctions with n attributes? 65 Number of mistakes
  • 66. What you need to know • What is the perceptron mistake bound? • How to prove it 66
  • 67. Where are we? • The Perceptron Algorithm • Perceptron Mistake Bound • Variants of Perceptron 67
  • 68. Practical use of the Perceptron algorithm 1. Using the Perceptron algorithm with a finite dataset 2. Margin Perceptron 3. Voting and Averaging 68
  • 69. 1. The “standard” algorithm Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1} 1. Initialize w = 0 ∈ ℜn 2. For epoch = 1 … T: 1. Shuffle the data 2. For each training example (xi, yi) ∈ D: • If yi wTxi ≤ 0, update w ← w + r yi xi 3. Return w Prediction: sgn(wTx) 69
  • 70. 1. The “standard” algorithm 70 T is a hyper-parameter to the algorithm Another way of writing that there is an error Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1} 1. Initialize w = 0 ∈ ℜn 2. For epoch = 1 … T: 1. Shuffle the data 2. For each training example (xi, yi) ∈ D: • If yi wTxi ≤ 0, update w ← w + r yi xi 3. Return w Prediction: sgn(wTx)
  • 71. 2. Margin Perceptron 71 Which hyperplane is better? h1 h2
  • 72. 2. Margin Perceptron 72 Which hyperplane is better? h1 h2
  • 73. 2. Margin Perceptron 73 Which hyperplane is better? h1 h2 The farther from the data points, the less chance to make wrong prediction
  • 74. 2. Margin Perceptron • Perceptron makes updates only when the prediction is incorrect yi wTxi < 0 • What if the prediction is close to being incorrect? That is, Pick a positive ! and update when • Can generalize better, but need to choose – Why is this a good idea? 74 L1(y, U, B, q(v)) = log(p(U)) + H q(v) + Z log N(v|0, k(B, B) + X j Z q(v)Fv(yij , )dv sgn k X i=1 ci · sgn(w> i x) (2) = sgn k X i=1 ciw> i x (3) yiw> xi kwk < ⌘ (4) 4
  • 75. 2. Margin Perceptron • Perceptron makes updates only when the prediction is incorrect yi wTxi < 0 • What if the prediction is close to being incorrect? That is, Pick a positive ! and update when • Can generalize better, but need to choose – Why is this a good idea? Intentionally set a large margin 75 L1(y, U, B, q(v)) = log(p(U)) + H q(v) + Z log N(v|0, k(B, B) + X j Z q(v)Fv(yij , )dv sgn k X i=1 ci · sgn(w> i x) (2) = sgn k X i=1 ciw> i x (3) yiw> xi kwk < ⌘ (4) 4
  • 76. 3. Voting and Averaging 76 What if data is not linearly separable? Finding a hyperplane with minimum mistakes is NP hard
  • 77. Voted Perceptron Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1} 1. Initialize w = 0 ∈ ℜn anda = 0 ∈ ℜn 2. For epoch = 1 … T: – For each training example (xi, yi) ∈ D: • If yi wTxi ≤ 0 – update wm+1 ← wm + r yi xi – m=m+1 – Cm = 1 • Else – Cm = Cm + 1 3. Return (w1, c1), (w2, c2), …, (wk, Ck) Prediction: 77 (A1, rA1) (a2, ra2) (a3, ra3) (a4, ra4) (a5, ra4) L1(y, U, B, q(v)) = log(p(U)) + H q(v) + Z log N (v|0, k(B, B) + X j Z q(v)Fv(yij , )dv sgn k X i=1 ci · sgn(w> i x)
  • 78. Averaged Perceptron Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1} 1. Initialize w = 0 ∈ ℜn anda = 0 ∈ ℜn 2. For epoch = 1 … T: – For each training example (xi, yi) ∈ D: • If yi wTxi ≤ 0 – update w ← w + r yi xi • a ← a + w 3. Return a Prediction: sgn(aTx) 78 (A1, rA1) (a2, ra2) (a3, ra3) (a4, ra4) (a5, ra4) L1(y, U, B, q(v)) = log(p(U)) + H q(v) + Z log N(v|0, k(B, B) + X j Z q(v)Fv(yij , )dv sgn k X i=1 ci · sgn(w> i x) = sgn k X i=1 ciw> i x
  • 79. Averaged Perceptron 79 This is the simplest version of the averaged perceptron There are some easy programming tricks to make sure that a is also updated only when there is an error Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1} 1. Initialize w = 0 ∈ ℜn anda = 0 ∈ ℜn 2. For epoch = 1 … T: – For each training example (xi, yi) ∈ D: • If yi wTxi ≤ 0 – update w ← w + r yi xi • a ← a + w 3. Return a Prediction: sgn(aTx) (A1, rA1) (a2, ra2) (a3, ra3) (a4, ra4) (a5, ra4) L1(y, U, B, q(v)) = log(p(U)) + H q(v) + Z log N(v|0, k(B, B) + X j Z q(v)Fv(yij , )dv sgn k X i=1 ci · sgn(w> i x) = sgn k X i=1 ciw> i x
  • 80. Averaged Perceptron 80 This is the simplest version of the averaged perceptron There are some easy programming tricks to make sure that a is also updated only when there is an error Extremely popular If you want to use the Perceptron algorithm, use averaging Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1} 1. Initialize w = 0 ∈ ℜn anda = 0 ∈ ℜn 2. For epoch = 1 … T: – For each training example (xi, yi) ∈ D: • If yi wTxi ≤ 0 – update w ← w + r yi xi • a ← a + w 3. Return a Prediction: sgn(aTx) (A1, rA1) (a2, ra2) (a3, ra3) (a4, ra4) (a5, ra4) L1(y, U, B, q(v)) = log(p(U)) + H q(v) + Z log N(v|0, k(B, B) + X j Z q(v)Fv(yij , )dv sgn k X i=1 ci · sgn(w> i x) = sgn k X i=1 ciw> i x
  • 81. Question: What is the difference? 81 + X j Z q(v)Fv(yij , )dv sgn k X i=1 ci · sgn(w> i x) sgn k X i=1 ci · sgn(w> i x) = sgn k X i=1 ciw> i x Voted: Averaged: sgn i=1 ci · sgn(wi x) ( = sgn k X i=1 ciw> i x ( yiw> xi kwk < ⌘ ( w> 1 x = s1, w> 2 x = s2, w> 3 x = s3 ( 4 i=1 i = sgn k X i=1 ciw> i x yiw> xi kwk < ⌘ c1 = c2 = c3 = 1 4 Voted: Any two are positive Averaged: sgn k X i=1 ci · sgn(w> i x) = sgn k X i=1 ciw> i x yiw> xi kwk < ⌘ s1 + s2 + s3 0 4
  • 82. Summary: Perceptron • Online learning algorithm, very widely used, easy to implement • Additive updates to weights • Geometric interpretation • Mistake bound • Practical variants abound • You should be able to implement the Perceptron algorithm 82