This document discusses the Perceptron algorithm for linear classification. It begins with an overview of linear classifiers and the Perceptron, then describes the Perceptron algorithm which makes weight updates only for misclassified examples. The algorithm is guaranteed to converge if the data is linearly separable. The document also discusses the Perceptron mistake bound theorem, which provides an upper bound on the number of mistakes the algorithm can make based on the margin of separation between classes.
The main machine learning algorithms are built upon various mathematical foundations such as statistics, optimization, and probability. Will this also hold true for Artificial Intelligence? In this presentation, I will showcase some recent examples of interactions between machine learning and mathematics.
Colloquium @ CEREMADE (October 3, 2023)
Random Matrix Theory and Machine Learning - Part 3Fabian Pedregosa
ICML 2021 tutorial on random matrix theory and machine learning.
Part 3 covers: 1. Motivation: Average-case versus worst-case in high dimensions 2. Algorithm halting times (runtimes) 3. Outlook
The main machine learning algorithms are built upon various mathematical foundations such as statistics, optimization, and probability. Will this also hold true for Artificial Intelligence? In this presentation, I will showcase some recent examples of interactions between machine learning and mathematics.
Colloquium @ CEREMADE (October 3, 2023)
Random Matrix Theory and Machine Learning - Part 3Fabian Pedregosa
ICML 2021 tutorial on random matrix theory and machine learning.
Part 3 covers: 1. Motivation: Average-case versus worst-case in high dimensions 2. Algorithm halting times (runtimes) 3. Outlook
Probabilistic Control of Switched Linear Systems with Chance ConstraintsLeo Asselborn
An approach to algorithmically synthesize control
strategies for set-to-set transitions of uncertain discrete-time
switched linear systems based on a combination of tree search
and reachable set computations in a stochastic setting is
proposed in this presentation. The initial state and disturbances
are assumed to be Gaussian distributed, and a time-variant
hybrid control law stabilizes the system towards a goal set.
The algorithmic solution computes sequences of discrete states
via tree search and the continuous controls are obtained
from solving embedded semi-definite programs (SDP). These
program taking polytopic input constraints as well as timevarying
probabilistic state constraints into account. An example
for demonstrating the principles of the solution procedure with
focus on handling the chance constraints is included.
In this tutorial, we study various statistical problems such as community detection on graphs, Principal Component Analysis (PCA), sparse PCA, and Gaussian mixture clustering in a Bayesian framework. Using a statistical physics point of view, we show that there exists a critical noise level above which it is impossible to estimate better than random guessing. Below this threshold, we compare the performance of existing polynomial-time algorithms to the optimal one and observe a gap in many situations: even if non-trivial estimation is theoretically possible, computationally efficient methods do not manage to achieve optimality. This tutorial will present how we adapted the tools and techniques from the mathematical study of spin glasses to study high-dimensional statistics and Approximate Message Passing (AMP) algorithm.
This tutorial was presented by Marc Lelarge at the 21st INFORMS Applied Probability Society Conference (2023)
https://informs-aps2023.event.univ-lorraine.fr/
Phase Retrieval: Motivation and TechniquesVaibhav Dixit
This presentation describes two techniques namely Transport of Intensity Equation(TIE) technique and Phase Diversity technique for retrieving phase information from light.
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...Yandex
We consider a new class of huge-scale problems, the problems with sparse subgradients. The most important functions of this type are piecewise linear. For optimization problems with uniform sparsity of corresponding linear operators, we suggest a very efficient implementation of subgradient iterations, the total cost of which depends logarithmically in the dimension. This technique is based on a recursive update of the results of matrix/vector products and the values of symmetric functions. It works well, for example, for matrices with few nonzero diagonals and for max-type functions.
We show that the updating technique can be efficiently coupled with the simplest subgradient methods. Similar results can be obtained for a new non-smooth random variant of a coordinate descent scheme. We also present promising results of preliminary computational experiments.
Reinforcement Learning: Hidden Theory and New Super-Fast AlgorithmsSean Meyn
A tutorial, and very new algorithms -- more details on arXiv and at NIPS 2017 https://arxiv.org/abs/1707.03770
Part of the Data Science Summer School at École Polytechnique: http://www.ds3-datascience-polytechnique.fr/program/
---------
2018 Updates:
See Zap slides from ISMP 2018 for new inverse-free optimal algorithms
Simons tutorial, March 2018 [one month before most discoveries announced at ISMP]
Part I (Basics, with focus on variance of algorithms)
https://www.youtube.com/watch?v=dhEF5pfYmvc
Part II (Zap Q-learning)
https://www.youtube.com/watch?v=Y3w8f1xIb6s
Big 2017 survey on variance in SA:
Fastest convergence for Q-learning
https://arxiv.org/abs/1707.03770
You will find the infinite-variance Q result there.
Our NIPS 2017 paper is distilled from this.
Digital Signal Processing[ECEG-3171]-Ch1_L03Rediet Moges
This Digital Signal Processing Lecture material is the property of the author (Rediet M.) . It is not for publication,nor is it to be sold or reproduced.
#Africa#Ethiopia
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
The GraphNet (aka S-Lasso), as well as other “sparsity + structure” priors like TV (Total-Variation), TV-L1, etc., are not easily applicable to brain data because of technical problems
relating to the selection of the regularization parameters. Also, in
their own right, such models lead to challenging high-dimensional optimization problems. In this manuscript, we present some heuristics for speeding up the overall optimization process: (a) Early-stopping, whereby one halts the optimization process when the test score (performance on leftout data) for the internal cross-validation for model-selection stops improving, and (b) univariate feature-screening, whereby irrelevant (non-predictive) voxels are detected and eliminated before the optimization problem is entered, thus reducing the size of the problem. Empirical results with GraphNet on real MRI (Magnetic Resonance Imaging) datasets indicate that these heuristics are a win-win strategy, as they add speed without sacrificing the quality of the predictions. We expect the proposed heuristics to work on other models like TV-L1, etc.
Control of Discrete-Time Piecewise Affine Probabilistic Systems using Reachab...Leo Asselborn
This presentation proposes an algorithmic approach to
synthesize stabilizing control laws for discrete-time piecewise
affine probabilistic (PWAP) systems based on computations of
probabilistic reachable sets. The considered class of systems
contains probabilistic components (with Gaussian distribution)
modeling additive disturbances and state initialization. The
probabilistic reachable state sets contain all states that are
reachable with a given confidence level under the effect of
time-variant control laws. The control synthesis uses principles
of the ellipsoidal calculus, and it considers that the system
parametrization depends on the partition of the state space. The
proposed algorithm uses LMI-constrained semi-definite programming
(SDP) problems to compute stabilizing controllers,
while polytopic input constraints and transitions between regions
of the state space are considered. The formulation of
the SDP is adopted from a previous work in [1] for switched
systems, in which the switching of the continuous dynamics
is triggered by a discrete input variable. Here, as opposed
to [1], the switching occurs autonomously and an algorithmic
procedure is suggested to synthesis a stabilizing controller. An
example for illustration is included.
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Beniamino Murgante
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov – National Centre for Geocomputation, National University of Ireland , Maynooth (Ireland)
Intelligent Analysis of Environmental Data (S4 ENVISA Workshop 2009)
Explica que son las estructuras de datos
para que sirven
cokmo se utilizan
los difeentes tipos de estructuras
las clasificaciones de las mismas
y da ejemplos de sus diferentes usos
adicionalmente da una introducción a temás de programación prioritarios
Probabilistic Control of Switched Linear Systems with Chance ConstraintsLeo Asselborn
An approach to algorithmically synthesize control
strategies for set-to-set transitions of uncertain discrete-time
switched linear systems based on a combination of tree search
and reachable set computations in a stochastic setting is
proposed in this presentation. The initial state and disturbances
are assumed to be Gaussian distributed, and a time-variant
hybrid control law stabilizes the system towards a goal set.
The algorithmic solution computes sequences of discrete states
via tree search and the continuous controls are obtained
from solving embedded semi-definite programs (SDP). These
program taking polytopic input constraints as well as timevarying
probabilistic state constraints into account. An example
for demonstrating the principles of the solution procedure with
focus on handling the chance constraints is included.
In this tutorial, we study various statistical problems such as community detection on graphs, Principal Component Analysis (PCA), sparse PCA, and Gaussian mixture clustering in a Bayesian framework. Using a statistical physics point of view, we show that there exists a critical noise level above which it is impossible to estimate better than random guessing. Below this threshold, we compare the performance of existing polynomial-time algorithms to the optimal one and observe a gap in many situations: even if non-trivial estimation is theoretically possible, computationally efficient methods do not manage to achieve optimality. This tutorial will present how we adapted the tools and techniques from the mathematical study of spin glasses to study high-dimensional statistics and Approximate Message Passing (AMP) algorithm.
This tutorial was presented by Marc Lelarge at the 21st INFORMS Applied Probability Society Conference (2023)
https://informs-aps2023.event.univ-lorraine.fr/
Phase Retrieval: Motivation and TechniquesVaibhav Dixit
This presentation describes two techniques namely Transport of Intensity Equation(TIE) technique and Phase Diversity technique for retrieving phase information from light.
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...Yandex
We consider a new class of huge-scale problems, the problems with sparse subgradients. The most important functions of this type are piecewise linear. For optimization problems with uniform sparsity of corresponding linear operators, we suggest a very efficient implementation of subgradient iterations, the total cost of which depends logarithmically in the dimension. This technique is based on a recursive update of the results of matrix/vector products and the values of symmetric functions. It works well, for example, for matrices with few nonzero diagonals and for max-type functions.
We show that the updating technique can be efficiently coupled with the simplest subgradient methods. Similar results can be obtained for a new non-smooth random variant of a coordinate descent scheme. We also present promising results of preliminary computational experiments.
Reinforcement Learning: Hidden Theory and New Super-Fast AlgorithmsSean Meyn
A tutorial, and very new algorithms -- more details on arXiv and at NIPS 2017 https://arxiv.org/abs/1707.03770
Part of the Data Science Summer School at École Polytechnique: http://www.ds3-datascience-polytechnique.fr/program/
---------
2018 Updates:
See Zap slides from ISMP 2018 for new inverse-free optimal algorithms
Simons tutorial, March 2018 [one month before most discoveries announced at ISMP]
Part I (Basics, with focus on variance of algorithms)
https://www.youtube.com/watch?v=dhEF5pfYmvc
Part II (Zap Q-learning)
https://www.youtube.com/watch?v=Y3w8f1xIb6s
Big 2017 survey on variance in SA:
Fastest convergence for Q-learning
https://arxiv.org/abs/1707.03770
You will find the infinite-variance Q result there.
Our NIPS 2017 paper is distilled from this.
Digital Signal Processing[ECEG-3171]-Ch1_L03Rediet Moges
This Digital Signal Processing Lecture material is the property of the author (Rediet M.) . It is not for publication,nor is it to be sold or reproduced.
#Africa#Ethiopia
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
The GraphNet (aka S-Lasso), as well as other “sparsity + structure” priors like TV (Total-Variation), TV-L1, etc., are not easily applicable to brain data because of technical problems
relating to the selection of the regularization parameters. Also, in
their own right, such models lead to challenging high-dimensional optimization problems. In this manuscript, we present some heuristics for speeding up the overall optimization process: (a) Early-stopping, whereby one halts the optimization process when the test score (performance on leftout data) for the internal cross-validation for model-selection stops improving, and (b) univariate feature-screening, whereby irrelevant (non-predictive) voxels are detected and eliminated before the optimization problem is entered, thus reducing the size of the problem. Empirical results with GraphNet on real MRI (Magnetic Resonance Imaging) datasets indicate that these heuristics are a win-win strategy, as they add speed without sacrificing the quality of the predictions. We expect the proposed heuristics to work on other models like TV-L1, etc.
Control of Discrete-Time Piecewise Affine Probabilistic Systems using Reachab...Leo Asselborn
This presentation proposes an algorithmic approach to
synthesize stabilizing control laws for discrete-time piecewise
affine probabilistic (PWAP) systems based on computations of
probabilistic reachable sets. The considered class of systems
contains probabilistic components (with Gaussian distribution)
modeling additive disturbances and state initialization. The
probabilistic reachable state sets contain all states that are
reachable with a given confidence level under the effect of
time-variant control laws. The control synthesis uses principles
of the ellipsoidal calculus, and it considers that the system
parametrization depends on the partition of the state space. The
proposed algorithm uses LMI-constrained semi-definite programming
(SDP) problems to compute stabilizing controllers,
while polytopic input constraints and transitions between regions
of the state space are considered. The formulation of
the SDP is adopted from a previous work in [1] for switched
systems, in which the switching of the continuous dynamics
is triggered by a discrete input variable. Here, as opposed
to [1], the switching occurs autonomously and an algorithmic
procedure is suggested to synthesis a stabilizing controller. An
example for illustration is included.
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Beniamino Murgante
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov – National Centre for Geocomputation, National University of Ireland , Maynooth (Ireland)
Intelligent Analysis of Environmental Data (S4 ENVISA Workshop 2009)
Explica que son las estructuras de datos
para que sirven
cokmo se utilizan
los difeentes tipos de estructuras
las clasificaciones de las mismas
y da ejemplos de sus diferentes usos
adicionalmente da una introducción a temás de programación prioritarios
This 7-second Brain Wave Ritual Attracts Money To You.!nirahealhty
Discover the power of a simple 7-second brain wave ritual that can attract wealth and abundance into your life. By tapping into specific brain frequencies, this technique helps you manifest financial success effortlessly. Ready to transform your financial future? Try this powerful ritual and start attracting money today!
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Brad Spiegel Macon GA’s journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, he’s not only bridging the gap in Macon but also setting an example for others to follow.
Italy Agriculture Equipment Market Outlook to 2027harveenkaur52
Agriculture and Animal Care
Ken Research has an expertise in Agriculture and Animal Care sector and offer vast collection of information related to all major aspects such as Agriculture equipment, Crop Protection, Seed, Agriculture Chemical, Fertilizers, Protected Cultivators, Palm Oil, Hybrid Seed, Animal Feed additives and many more.
Our continuous study and findings in agriculture sector provide better insights to companies dealing with related product and services, government and agriculture associations, researchers and students to well understand the present and expected scenario.
Our Animal care category provides solutions on Animal Healthcare and related products and services, including, animal feed additives, vaccination
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfFlorence Consulting
Quattordicesimo Meetup di Milano, tenutosi a Milano il 23 Maggio 2024 dalle ore 17:00 alle ore 18:30 in presenza e da remoto.
Abbiamo parlato di come Axpo Italia S.p.A. ha ridotto il technical debt migrando le proprie APIs da Mule 3.9 a Mule 4.4 passando anche da on-premises a CloudHub 1.0.
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC
Ellisha Heppner, Grant Management Lead, presented an update on APNIC Foundation to the PNG DNS Forum held from 6 to 10 May, 2024 in Port Moresby, Papua New Guinea.
3. Where are we?
• The Perceptron Algorithm
• Perceptron Mistake Bound
• Variants of Perceptron
3
4. Recall: Linear Classifiers
• Input is a n dimensional vector x
• Output is a label y ∈ {-1, 1}
• Linear Threshold Units (LTUs) classify an example x using the
following classification rule
– Output = sgn(wTx + b) = sgn(b +åwi xi)
– wTx + b ≥ 0 → Predict y = 1
– wTx + b < 0 → Predict y = -1
4
b is called the bias term
5. Recall: Linear Classifiers
• Input is a n dimensional vector x
• Output is a label y ∈ {-1, 1}
• Linear Threshold Units (LTUs) classify an example x using the
following classification rule
– Output = sgn(wTx + b) = sgn(b +åwi xi)
– wTx + b ¸ 0 ) Predict y = 1
– wTx + b < 0 ) Predict y = -1
5
b is called the bias term
∑
sgn
&' &( &) &* &+ &, &- &.
/' /( /) /* /+ /, /- /. 1
1
6. The geometry of a linear classifier
6
sgn(b +w1 x1 + w2x2)
In n dimensions,
a linear classifier
represents a hyperplane
that separates the space
into two half-spaces
x1
x2
+
+
+
+
+ ++
+
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
b +w1 x1 + w2x2=0
We only care about the
sign, not the magnitude
[w1 w2]
8. The hype
8
The New Yorker, December 6, 1958 P. 44
The New York Times, July 8 1958 The IBM 704 computer
9. The Perceptron algorithm
• Rosenblatt 1958
• The goal is to find a separating hyperplane
– For separable data, guaranteed to find one
• An online algorithm
– Processes one example at a time
• Several variants exist (will discuss briefly at towards
the end)
9
10. The Perceptron algorithm
Input: A sequence of training examples (x1, y1), (x2, y2),!
where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi):
– Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
10
11. The Perceptron algorithm
11
Remember:
Prediction = sgn(wTx)
There is typically a bias term
also (wTx + b), but the bias
may be treated as a
constant feature and folded
into w
Input: A sequence of training examples (x1, y1), (x2, y2),!
where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi):
– Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
12. The Perceptron algorithm
12
Footnote: For some algorithms it is mathematically easier to represent False as -1,
and at other times, as 0. For the Perceptron algorithm, treat -1 as false and +1 as true.
Remember:
Prediction = sgn(wTx)
There is typically a bias term
also (wTx + b), but the bias
may be treated as a
constant feature and folded
into w
Input: A sequence of training examples (x1, y1), (x2, y2),!
where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi):
– Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
13. The Perceptron algorithm
13
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Input: A sequence of training examples (x1, y1), (x2, y2),!
where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi):
– Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
14. The Perceptron algorithm
14
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Input: A sequence of training examples (x1, y1), (x2, y2),!
where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi):
– Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
r is the learning rate, a small positive
number less than 1
15. The Perceptron algorithm
15
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Input: A sequence of training examples (x1, y1), (x2, y2),!
where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi):
– Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
r is the learning rate, a small positive
number less than 1
Update only on error. A mistake-
driven algorithm
16. The Perceptron algorithm
16
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Input: A sequence of training examples (x1, y1), (x2, y2),!
where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi):
– Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
r is the learning rate, a small positive
number less than 1
Update only on error. A mistake-
driven algorithm
This is the simplest version. We will
see more robust versions at the end
17. The Perceptron algorithm
17
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Input: A sequence of training examples (x1, y1), (x2, y2),!
where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi):
– Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
r is the learning rate, a small positive
number less than 1
Update only on error. A mistake-
driven algorithm
This is the simplest version. We will
see more robust versions at the end
Mistake can be written as yiwt
Txi ≤ 0
18. Intuition behind the update
Suppose we have made a mistake on a positive example
That is, y = +1 and wt
Tx <0
Call the new weight vector wt+1 = wt + x (say r = 1)
The new dot product will be wt+1
Tx = (wt + x)Tx = wt
Tx + xTx > wt
Tx
For a positive example, the Perceptron update will increase the score
assigned to the same input
Similar reasoning for negative examples
18
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
19. Geometry of the perceptron update
19
wold
Predict
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
20. Geometry of the perceptron update
20
wold
(x, +1)
Predict
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
21. Geometry of the perceptron update
21
wold
(x, +1)
For a mistake on a positive
example
Predict
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
22. Geometry of the perceptron update
22
wold
(x, +1)
w ← w + x
For a mistake on a positive
example
(x, +1)
Predict Update
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
23. Geometry of the perceptron update
23
wold
(x, +1)
w ← w + x
For a mistake on a positive
example
(x, +1)
Predict Update
x
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
24. Geometry of the perceptron update
24
wold
(x, +1)
w ← w + x
For a mistake on a positive
example
(x, +1)
Predict Update
x
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
25. Geometry of the perceptron update
25
wold
(x, +1) (x, +1)
wnew
w ← w + x
For a mistake on a positive
example
(x, +1)
Predict Update After
x
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
27. Geometry of the perceptron update
27
wold
(x, -1)
Predict
For a mistake on a negative
example
28. Geometry of the perceptron update
28
wold
(x, -1)
w ← w - x
(x, -1)
Predict Update
For a mistake on a negative
example
- x
29. Geometry of the perceptron update
29
wold
(x, -1)
w ← w - x
(x, -1)
Predict Update
For a mistake on a negative
example
- x
30. Geometry of the perceptron update
30
wold
(x, -1)
w ← w - x
(x, -1)
Predict Update
For a mistake on a negative
example
- x
31. Geometry of the perceptron update
31
wold
(x, -1) (x, -1)
wnew
w ← w - x
(x, -1)
Predict Update After
For a mistake on a negative
example
- x
32. Perceptron Learnability
• Obviously Perceptron cannot learn what it cannot represent
– Only linearly separable functions
• Minsky and Papert (1969) wrote an influential book
demonstrating Perceptron’s representational limitations
– Parity functions can’t be learned (XOR)
• But we already know that XOR is not linearly separable
– Feature transformation trick
32
33. What you need to know
• The Perceptron algorithm
• The geometry of the update
• What can it represent
33
34. Where are we?
• The Perceptron Algorithm
• Perceptron Mistake Bound
• Variants of Perceptron
34
35. Convergence
Convergence theorem
– If there exist a set of weights that are consistent with the
data (i.e. the data is linearly separable), the perceptron
algorithm will converge.
35
36. Convergence
Convergence theorem
– If there exist a set of weights that are consistent with the
data (i.e. the data is linearly separable), the perceptron
algorithm will converge.
Cycling theorem
– If the training data is not linearly separable, then the
learning algorithm will eventually repeat the same set of
weights and enter an infinite loop
36
37. Margin
The margin of a hyperplane for a dataset is the distance between
the hyperplane and the data point nearest to it.
37
+
+
+
+
+ ++
+
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Margin with respect to this hyperplane
38. Margin
• The margin of a hyperplane for a dataset is the distance
between the hyperplane and the data point nearest to it.
• The margin of a data set (!) is the maximum margin possible
for that dataset using any weight vector.
38
+
+
+
+
+ ++
+
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Margin of the data
39. Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R
and the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u 2 <n (i.e ||u|| = 1) such that for
some ° 2 < and ° > 0 we have yi (uT xi) ¸ °.
Then, the perceptron algorithm will make at most (R/°)2 mistakes
on the training sequence.
39
40. Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R
and the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that
for some $ ∈ ℜ and $ > 0 we have yi (uT xi) ≥ $.
Then, the perceptron algorithm will make at most (R/°)2 mistakes
on the training sequence.
40
41. Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R
and the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that
for some $ ∈ ℜ and $ > 0 we have yi (uT xi) ≥ $.
Then, the perceptron algorithm will make at most (R/ $)2
mistakes on the training sequence.
41
42. Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R
and the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that
for some $ ∈ ℜ and $ > 0 we have yi (uT xi) ≥ $.
Then, the perceptron algorithm will make at most (R/ $)2
mistakes on the training sequence.
42
We can always find such an R. Just look for
the farthest data point from the origin.
43. Mistake Bound Theorem [Novikoff 1962, Block 1962]
43
The data and u have a margin !.
Importantly, the data is separable.
! is the complexity parameter that defines
the separability of data.
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R
and the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that
for some ! ∈ ℜ and ! > 0 we have yi (uT xi) ≥ !.
Then, the perceptron algorithm will make at most (R/ !)2
mistakes on the training sequence.
44. Mistake Bound Theorem [Novikoff 1962, Block 1962]
44
If u hadn’t been a unit vector, then
we could scale ! in the mistake
bound. This will change the final
mistake bound to (||u||R/ !)2
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R
and the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that
for some ! ∈ ℜ and ! > 0 we have yi (uT xi) ≥ !.
Then, the perceptron algorithm will make at most (R/ !)2
mistakes on the training sequence.
45. Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R
and the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that
for some $ ∈ ℜ and $ > 0 we have yi (uT xi) ≥ $.
Then, the perceptron algorithm will make at most (R/ $)2
mistakes on the training sequence.
45
Suppose we have a binary classification dataset with n dimensional inputs.
If the data is separable,…
…then the Perceptron algorithm will find a separating
hyperplane after making a finite number of mistakes
46. Proof (preliminaries)
The setting
• Initial weight vector w is all zeros
• Learning rate = 1
– Effectively scales inputs, but does not change the behavior
• All training examples are contained in a ball of size R
– ||xi|| ≤ R
• The training data is separable by margin " using a unit
vector u
– yi (uT xi) ≥ "
46
• Receive an input (xi, yi)
• if sgn(wt
Txi) ≠ yi:
Update wt+1 ← wt + yi xi
47. Proof (1/3)
1. Claim: After t mistakes, uTwt ≥ t "
47
• Receive an input (xi, yi)
• if sgn(wt
Txi) ≠ yi:
Update wt+1 ← wt + yi xi
48. Proof (1/3)
1. Claim: After t mistakes, uTwt ≥ t "
48
Because the data
is separable by a
margin "
• Receive an input (xi, yi)
• if sgn(wt
Txi) ≠ yi:
Update wt+1 ← wt + yi xi
49. Proof (1/3)
1. Claim: After t mistakes, uTwt ≥ t "
Because w0 = 0 (i.e uTw0 = 0), straightforward
induction gives us uTwt ≥ t "
49
Because the data
is separable by a
margin "
• Receive an input (xi, yi)
• if sgn(wt
Txi) ≠ yi:
Update wt+1 ← wt + yi xi
50. 2. Claim: After t mistakes, ||wt||2 ≤ tR2
Proof (2/3)
50
• Receive an input (xi, yi)
• if sgn(wt
Txi) ≠ yi:
Update wt+1 ← wt + yi xi
51. Proof (2/3)
51
The weight is updated only
when there is a mistake. That is
when yi wt
T xi < 0.
||xi|| ≤ R, by definition of R
2. Claim: After t mistakes, ||wt||2 ≤ tR2
• Receive an input (xi, yi)
• if sgn(wt
Txi) ≠ yi:
Update wt+1 ← wt + yi xi
52. Proof (2/3)
2. Claim: After t mistakes, ||wt||2 ≤ tR2
Because w0 = 0 (i.e uTw0 = 0), straightforward induction
gives us ||wt||2 ≤ tR2
52
• Receive an input (xi, yi)
• if sgn(wt
Txi) ≠ yi:
Update wt+1 ← wt + yi xi
53. Proof (3/3)
What we know:
1. After t mistakes, uTwt ≥ t"
2. After t mistakes, ||wt||2 ≤ tR2
53
54. Proof (3/3)
What we know:
1. After t mistakes, uTwt ≥ t"
2. After t mistakes, ||wt||2 ≤ tR2
54
From (2)
55. Proof (3/3)
What we know:
1. After t mistakes, uTwt ≥ t"
2. After t mistakes, ||wt||2 ≤ tR2
55
From (2)
uT wt =||u|| ||wt|| cos(<angle between them>)
But ||u|| = 1 and cosine is less than 1
So uT wt · ||wt||
56. Proof (3/3)
What we know:
1. After t mistakes, uTwt ≥ t"
2. After t mistakes, ||wt||2 ≤ tR2
56
From (2)
uT wt =||u|| ||wt|| cos(<angle between them>)
But ||u|| = 1 and cosine is less than 1
So uT wt · ||wt|| (Cauchy-Schwarz inequality)
57. Proof (3/3)
57
From (2)
uT wt =||u|| ||wt|| cos(<angle between them>)
But ||u|| = 1 and cosine is less than 1
So uT wt ≤ ||wt||
From (1)
What we know:
1. After t mistakes, uTwt ≥ t#
2. After t mistakes, ||wt||2 ≤ tR2
58. Proof (3/3)
58
Number of mistakes
What we know:
1. After t mistakes, uTwt ≥ t"
2. After t mistakes, ||wt||2 ≤ tR2
59. Proof (3/3)
59
Bounds the total number of mistakes!
Number of mistakes
What we know:
1. After t mistakes, uTwt ≥ t"
2. After t mistakes, ||wt||2 ≤ tR2
60. Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi|| ≤ R
and the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that
for some $ ∈ ℜ and $ > 0 we have yi (uT xi) ≥ $.
Then, the perceptron algorithm will make at most (R/ $)2
mistakes on the training sequence.
60
62. Mistake Bound Theorem [Novikoff 1962, Block 1962]
62
i=1
= sgn
k
X
i=1
ciw>
i x
yiw>
xi
kwk
< ⌘
s1 + s2 + s3 0
R
4
63. Mistake Bound Theorem [Novikoff 1962, Block 1962]
63
i=1
= sgn
k
X
i=1
ciw>
i x
yiw>
xi
kwk
< ⌘
s1 + s2 + s3 0
R
4
= sgn
k
X
i=1
ciw>
i x (3
yiw>
xi
kwk
< ⌘ (4
s1 + s2 + s3 0 (5
4
64. Mistake Bound Theorem [Novikoff 1962, Block 1962]
64
Number of mistakes
i=1
= sgn
k
X
i=1
ciw>
i x
yiw>
xi
kwk
< ⌘
s1 + s2 + s3 0
R
4
= sgn
k
X
i=1
ciw>
i x (3
yiw>
xi
kwk
< ⌘ (4
s1 + s2 + s3 0 (5
4
65. The Perceptron Mistake bound
• Exercises:
– How many mistakes will the Perceptron algorithm make for
disjunctions with n attributes?
• What are R and !?
– How many mistakes will the Perceptron algorithm make for k-
disjunctions with n attributes?
65
Number of mistakes
66. What you need to know
• What is the perceptron mistake bound?
• How to prove it
66
67. Where are we?
• The Perceptron Algorithm
• Perceptron Mistake Bound
• Variants of Perceptron
67
68. Practical use of the Perceptron algorithm
1. Using the Perceptron algorithm with a finite dataset
2. Margin Perceptron
3. Voting and Averaging
68
69. 1. The “standard” algorithm
Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1}
1. Initialize w = 0 ∈ ℜn
2. For epoch = 1 … T:
1. Shuffle the data
2. For each training example (xi, yi) ∈ D:
• If yi wTxi ≤ 0, update w ← w + r yi xi
3. Return w
Prediction: sgn(wTx)
69
70. 1. The “standard” algorithm
70
T is a hyper-parameter to the algorithm
Another way of writing that
there is an error
Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1}
1. Initialize w = 0 ∈ ℜn
2. For epoch = 1 … T:
1. Shuffle the data
2. For each training example (xi, yi) ∈ D:
• If yi wTxi ≤ 0, update w ← w + r yi xi
3. Return w
Prediction: sgn(wTx)
73. 2. Margin Perceptron
73
Which hyperplane is better?
h1
h2
The farther from the data points, the less chance to make wrong prediction
74. 2. Margin Perceptron
• Perceptron makes updates only when the prediction is
incorrect
yi wTxi < 0
• What if the prediction is close to being incorrect? That is,
Pick a positive ! and update when
• Can generalize better, but need to choose
– Why is this a good idea?
74
L1(y, U, B, q(v)) = log(p(U)) + H q(v)
+
Z
log N(v|0, k(B, B)
+
X
j
Z
q(v)Fv(yij
, )dv
sgn
k
X
i=1
ci · sgn(w>
i x) (2)
= sgn
k
X
i=1
ciw>
i x (3)
yiw>
xi
kwk
< ⌘ (4)
4
75. 2. Margin Perceptron
• Perceptron makes updates only when the prediction is
incorrect
yi wTxi < 0
• What if the prediction is close to being incorrect? That is,
Pick a positive ! and update when
• Can generalize better, but need to choose
– Why is this a good idea? Intentionally set a large margin
75
L1(y, U, B, q(v)) = log(p(U)) + H q(v)
+
Z
log N(v|0, k(B, B)
+
X
j
Z
q(v)Fv(yij
, )dv
sgn
k
X
i=1
ci · sgn(w>
i x) (2)
= sgn
k
X
i=1
ciw>
i x (3)
yiw>
xi
kwk
< ⌘ (4)
4
76. 3. Voting and Averaging
76
What if data is not linearly separable?
Finding a hyperplane with minimum mistakes is NP hard
77. Voted Perceptron
Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1}
1. Initialize w = 0 ∈ ℜn anda = 0 ∈ ℜn
2. For epoch = 1 … T:
– For each training example (xi, yi) ∈ D:
• If yi wTxi ≤ 0
– update wm+1 ← wm + r yi xi
– m=m+1
– Cm = 1
• Else
– Cm = Cm + 1
3. Return (w1, c1), (w2, c2), …, (wk, Ck)
Prediction:
77
(A1, rA1)
(a2, ra2)
(a3, ra3)
(a4, ra4)
(a5, ra4)
L1(y, U, B, q(v)) = log(p(U)) + H q(v)
+
Z
log N (v|0, k(B, B)
+
X
j
Z
q(v)Fv(yij
, )dv
sgn
k
X
i=1
ci · sgn(w>
i x)
78. Averaged Perceptron
Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1}
1. Initialize w = 0 ∈ ℜn anda = 0 ∈ ℜn
2. For epoch = 1 … T:
– For each training example (xi, yi) ∈ D:
• If yi wTxi ≤ 0
– update w ← w + r yi xi
• a ← a + w
3. Return a
Prediction: sgn(aTx)
78
(A1, rA1)
(a2, ra2)
(a3, ra3)
(a4, ra4)
(a5, ra4)
L1(y, U, B, q(v)) = log(p(U)) + H q(v)
+
Z
log N(v|0, k(B, B)
+
X
j
Z
q(v)Fv(yij
, )dv
sgn
k
X
i=1
ci · sgn(w>
i x)
= sgn
k
X
i=1
ciw>
i x
79. Averaged Perceptron
79
This is the simplest version of
the averaged perceptron
There are some easy
programming tricks to make
sure that a is also updated
only when there is an error
Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1}
1. Initialize w = 0 ∈ ℜn anda = 0 ∈ ℜn
2. For epoch = 1 … T:
– For each training example (xi, yi) ∈ D:
• If yi wTxi ≤ 0
– update w ← w + r yi xi
• a ← a + w
3. Return a
Prediction: sgn(aTx)
(A1, rA1)
(a2, ra2)
(a3, ra3)
(a4, ra4)
(a5, ra4)
L1(y, U, B, q(v)) = log(p(U)) + H q(v)
+
Z
log N(v|0, k(B, B)
+
X
j
Z
q(v)Fv(yij
, )dv
sgn
k
X
i=1
ci · sgn(w>
i x)
= sgn
k
X
i=1
ciw>
i x
80. Averaged Perceptron
80
This is the simplest version of
the averaged perceptron
There are some easy
programming tricks to make
sure that a is also updated
only when there is an error
Extremely popular
If you want to use the
Perceptron algorithm, use
averaging
Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1}
1. Initialize w = 0 ∈ ℜn anda = 0 ∈ ℜn
2. For epoch = 1 … T:
– For each training example (xi, yi) ∈ D:
• If yi wTxi ≤ 0
– update w ← w + r yi xi
• a ← a + w
3. Return a
Prediction: sgn(aTx)
(A1, rA1)
(a2, ra2)
(a3, ra3)
(a4, ra4)
(a5, ra4)
L1(y, U, B, q(v)) = log(p(U)) + H q(v)
+
Z
log N(v|0, k(B, B)
+
X
j
Z
q(v)Fv(yij
, )dv
sgn
k
X
i=1
ci · sgn(w>
i x)
= sgn
k
X
i=1
ciw>
i x
81. Question: What is the difference?
81
+
X
j
Z
q(v)Fv(yij
, )dv
sgn
k
X
i=1
ci · sgn(w>
i x)
sgn
k
X
i=1
ci · sgn(w>
i x)
= sgn
k
X
i=1
ciw>
i x
Voted:
Averaged:
sgn
i=1
ci · sgn(wi x) (
= sgn
k
X
i=1
ciw>
i x (
yiw>
xi
kwk
< ⌘ (
w>
1 x = s1, w>
2 x = s2, w>
3 x = s3 (
4
i=1
i
= sgn
k
X
i=1
ciw>
i x
yiw>
xi
kwk
< ⌘
c1 = c2 = c3 = 1
4
Voted: Any two are positive
Averaged:
sgn
k
X
i=1
ci · sgn(w>
i x)
= sgn
k
X
i=1
ciw>
i x
yiw>
xi
kwk
< ⌘
s1 + s2 + s3 0
4
82. Summary: Perceptron
• Online learning algorithm, very widely used, easy to
implement
• Additive updates to weights
• Geometric interpretation
• Mistake bound
• Practical variants abound
• You should be able to implement the Perceptron algorithm
82