Introduction to Machine Learning
Combining Models, Bayesian Average and Boosting
Andres Mendez-Vazquez
July 31, 2018
1 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
2 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
3 / 132
Images/cinvestav-
Introduction
Observation
It is often found that improved performance can be obtained by
combining multiple classifiers together in some way.
Example, Committees
We might train L different classifiers and then make predictions:
by using the average of the predictions made by each classifier.
Example, Boosting
It involves training multiple models in sequence:
A error function used to train a particular model depends on the
performance of the previous models.
4 / 132
Images/cinvestav-
Introduction
Observation
It is often found that improved performance can be obtained by
combining multiple classifiers together in some way.
Example, Committees
We might train L different classifiers and then make predictions:
by using the average of the predictions made by each classifier.
Example, Boosting
It involves training multiple models in sequence:
A error function used to train a particular model depends on the
performance of the previous models.
4 / 132
Images/cinvestav-
Introduction
Observation
It is often found that improved performance can be obtained by
combining multiple classifiers together in some way.
Example, Committees
We might train L different classifiers and then make predictions:
by using the average of the predictions made by each classifier.
Example, Boosting
It involves training multiple models in sequence:
A error function used to train a particular model depends on the
performance of the previous models.
4 / 132
Images/cinvestav-
Introduction
Observation
It is often found that improved performance can be obtained by
combining multiple classifiers together in some way.
Example, Committees
We might train L different classifiers and then make predictions:
by using the average of the predictions made by each classifier.
Example, Boosting
It involves training multiple models in sequence:
A error function used to train a particular model depends on the
performance of the previous models.
4 / 132
Images/cinvestav-
Introduction
Observation
It is often found that improved performance can be obtained by
combining multiple classifiers together in some way.
Example, Committees
We might train L different classifiers and then make predictions:
by using the average of the predictions made by each classifier.
Example, Boosting
It involves training multiple models in sequence:
A error function used to train a particular model depends on the
performance of the previous models.
4 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
5 / 132
Images/cinvestav-
We could use simple averaging
Given a series of observed samples {x1, x2, ..., xN } with noise
∼ N (0, 1)
We could use our knowledge on the noise, for example additive:
xi = xi +
We can use our knowledge of probability to remove such noise
E [xi] = E [xi + ] = E [xi] + E [ ]
Then, because E [ ] = 0
E [xi] = E [xi] ≈
1
N
N
i=1
xi
6 / 132
Images/cinvestav-
We could use simple averaging
Given a series of observed samples {x1, x2, ..., xN } with noise
∼ N (0, 1)
We could use our knowledge on the noise, for example additive:
xi = xi +
We can use our knowledge of probability to remove such noise
E [xi] = E [xi + ] = E [xi] + E [ ]
Then, because E [ ] = 0
E [xi] = E [xi] ≈
1
N
N
i=1
xi
6 / 132
Images/cinvestav-
We could use simple averaging
Given a series of observed samples {x1, x2, ..., xN } with noise
∼ N (0, 1)
We could use our knowledge on the noise, for example additive:
xi = xi +
We can use our knowledge of probability to remove such noise
E [xi] = E [xi + ] = E [xi] + E [ ]
Then, because E [ ] = 0
E [xi] = E [xi] ≈
1
N
N
i=1
xi
6 / 132
Images/cinvestav-
For Example
We have a nice result
7 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
8 / 132
Images/cinvestav-
Beyond Simple Averaging
Instead of averaging the predictions of a set of models
You can use an alternative form of combination that selects one of
the models to make the prediction.
Where
The choice of model is a function of the input variables.
Thus
Different Models become responsible for making decisions in different
regions of the input space.
9 / 132
Images/cinvestav-
Beyond Simple Averaging
Instead of averaging the predictions of a set of models
You can use an alternative form of combination that selects one of
the models to make the prediction.
Where
The choice of model is a function of the input variables.
Thus
Different Models become responsible for making decisions in different
regions of the input space.
9 / 132
Images/cinvestav-
Beyond Simple Averaging
Instead of averaging the predictions of a set of models
You can use an alternative form of combination that selects one of
the models to make the prediction.
Where
The choice of model is a function of the input variables.
Thus
Different Models become responsible for making decisions in different
regions of the input space.
9 / 132
Images/cinvestav-
Something like this
Models in charge of different set of inputs
10 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
11 / 132
Images/cinvestav-
Example, Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
12 / 132
Images/cinvestav-
Example, Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
12 / 132
Images/cinvestav-
Example, Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
12 / 132
Images/cinvestav-
Example, Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
12 / 132
Images/cinvestav-
Example, Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
12 / 132
Images/cinvestav-
Example, Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
12 / 132
Images/cinvestav-
Example, Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
12 / 132
Images/cinvestav-
This is used in the mixture of distributions
Thus (Mixture of Experts)
p (t|x) =
M
k=1
πk (x) p (t|x, k) (1)
where πk (x) = p (k|x) represent the input-dependent mixing coefficients.
This type of models
They can be viewed as mixture distribution in which the component
densities and the mixing coefficients are conditioned on the input variables
and are known as mixture experts.
13 / 132
Images/cinvestav-
This is used in the mixture of distributions
Thus (Mixture of Experts)
p (t|x) =
M
k=1
πk (x) p (t|x, k) (1)
where πk (x) = p (k|x) represent the input-dependent mixing coefficients.
This type of models
They can be viewed as mixture distribution in which the component
densities and the mixing coefficients are conditioned on the input variables
and are known as mixture experts.
13 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
14 / 132
Images/cinvestav-
It is important to differentiate between them
Although
Model Combinations and Bayesian Model Averaging look similar.
However, they are actually different
For this
We have the following example.
15 / 132
Images/cinvestav-
It is important to differentiate between them
Although
Model Combinations and Bayesian Model Averaging look similar.
However, they are actually different
For this
We have the following example.
15 / 132
Images/cinvestav-
Example of the Differences
For this consider the following
Mixture of Gaussians with a binary latent variable z indicating to
which component a point belongs to.
Thus the model is specified in terms a joint distribution
p (x, z)
Corresponding density over the observed variable x using
marginalization
p (x) =
z
p (x, z)
16 / 132
Images/cinvestav-
Example of the Differences
For this consider the following
Mixture of Gaussians with a binary latent variable z indicating to
which component a point belongs to.
Thus the model is specified in terms a joint distribution
p (x, z)
Corresponding density over the observed variable x using
marginalization
p (x) =
z
p (x, z)
16 / 132
Images/cinvestav-
Example of the Differences
For this consider the following
Mixture of Gaussians with a binary latent variable z indicating to
which component a point belongs to.
Thus the model is specified in terms a joint distribution
p (x, z)
Corresponding density over the observed variable x using
marginalization
p (x) =
z
p (x, z)
16 / 132
Images/cinvestav-
Example
In the case of Mixture of Gaussian’s
p (x) =
K
k=1
πkN (x|µk, Σk)
This is an example of model combination.
What about other Models
17 / 132
Images/cinvestav-
Example
In the case of Mixture of Gaussian’s
p (x) =
K
k=1
πkN (x|µk, Σk)
This is an example of model combination.
What about other Models
17 / 132
Images/cinvestav-
More Models
Now, for independent, identically distributed data
X = {x1, x2, ..., xN }
p (X) =
N
n=1
p (xn) =
N
n=1 zn
p (xn, zn)
18 / 132
Images/cinvestav-
Therefore
Something Notable
Each observed data point xn has a corresponding latent variable zn.
Here, we are doing a Combination of Models
Each Gaussian indexed by zn is in charge of generating one section of
the sample space
19 / 132
Images/cinvestav-
Therefore
Something Notable
Each observed data point xn has a corresponding latent variable zn.
Here, we are doing a Combination of Models
Each Gaussian indexed by zn is in charge of generating one section of
the sample space
19 / 132
Images/cinvestav-
Example
We have
20 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
21 / 132
Images/cinvestav-
Now, suppose
We have several different models indexed by h = 1, ..., H with prior
probabilities
One model might be a mixture of Gaussians and another model might
be a mixture of Cauchy distributions
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h)
≈p(h|X)
This is an example of Bayesian model averaging
22 / 132
Images/cinvestav-
Now, suppose
We have several different models indexed by h = 1, ..., H with prior
probabilities
One model might be a mixture of Gaussians and another model might
be a mixture of Cauchy distributions
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h)
≈p(h|X)
This is an example of Bayesian model averaging
22 / 132
Images/cinvestav-
Bayesian Model Averaging
Remark
The summation over h means that just one model is responsible for
generating the whole data set.
Observation
The probability over h simply reflects our uncertainty of which is the
correct model to use.
Thus, as the size of the data set increases
This uncertainty reduces
Posterior probabilities p(h|X) become increasingly focused on just one
of the models.
23 / 132
Images/cinvestav-
Bayesian Model Averaging
Remark
The summation over h means that just one model is responsible for
generating the whole data set.
Observation
The probability over h simply reflects our uncertainty of which is the
correct model to use.
Thus, as the size of the data set increases
This uncertainty reduces
Posterior probabilities p(h|X) become increasingly focused on just one
of the models.
23 / 132
Images/cinvestav-
Bayesian Model Averaging
Remark
The summation over h means that just one model is responsible for
generating the whole data set.
Observation
The probability over h simply reflects our uncertainty of which is the
correct model to use.
Thus, as the size of the data set increases
This uncertainty reduces
Posterior probabilities p(h|X) become increasingly focused on just one
of the models.
23 / 132
Images/cinvestav-
Example
We have
INCREASING THE
NUMBER OF SAMPLES
24 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
25 / 132
Images/cinvestav-
The Differences
Bayesian model averaging
The whole data set is generated by a single model h.
Model combination
Different data points within the data set can potentially be generated
from different by different components.
26 / 132
Images/cinvestav-
The Differences
Bayesian model averaging
The whole data set is generated by a single model h.
Model combination
Different data points within the data set can potentially be generated
from different by different components.
26 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
27 / 132
Images/cinvestav-
Committees
Idea, the simplest way to construct a committee
It is to average the predictions of a set of individual models.
Thinking as a frequentist
This is coming from taking in consideration the trade-off between bias
and variance.
Where the error in the model into
The bias component that arises from differences between the model
and the true function to be predicted.
The variance component that represents the sensitivity of the
model to the individual data points.
28 / 132
Images/cinvestav-
Committees
Idea, the simplest way to construct a committee
It is to average the predictions of a set of individual models.
Thinking as a frequentist
This is coming from taking in consideration the trade-off between bias
and variance.
Where the error in the model into
The bias component that arises from differences between the model
and the true function to be predicted.
The variance component that represents the sensitivity of the
model to the individual data points.
28 / 132
Images/cinvestav-
Committees
Idea, the simplest way to construct a committee
It is to average the predictions of a set of individual models.
Thinking as a frequentist
This is coming from taking in consideration the trade-off between bias
and variance.
Where the error in the model into
The bias component that arises from differences between the model
and the true function to be predicted.
The variance component that represents the sensitivity of the
model to the individual data points.
28 / 132
Images/cinvestav-
Committees
Idea, the simplest way to construct a committee
It is to average the predictions of a set of individual models.
Thinking as a frequentist
This is coming from taking in consideration the trade-off between bias
and variance.
Where the error in the model into
The bias component that arises from differences between the model
and the true function to be predicted.
The variance component that represents the sensitivity of the
model to the individual data points.
28 / 132
Images/cinvestav-
For example
When we averaged a set of low-bias models
We obtained accurate predictions of the underlying sinusoidal function
from which the data were generated.
29 / 132
Images/cinvestav-
However
Big Problem
We have normally a single data set
Thus
We need to introduce certain variability between the different
committee members.
One approach
You can use bootstrap data sets.
30 / 132
Images/cinvestav-
However
Big Problem
We have normally a single data set
Thus
We need to introduce certain variability between the different
committee members.
One approach
You can use bootstrap data sets.
30 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
31 / 132
Images/cinvestav-
The Idea of Bootstrap
We denote the training set by Z = {z1, z2, ..., zN }
Where zi = (xi, yi)
The basic idea is to randomly draw datasets with replacement from
the training data
Each sample the same size as the original training set.
This is done B times
Producing B bootstrap datasets.
32 / 132
Images/cinvestav-
The Idea of Bootstrap
We denote the training set by Z = {z1, z2, ..., zN }
Where zi = (xi, yi)
The basic idea is to randomly draw datasets with replacement from
the training data
Each sample the same size as the original training set.
This is done B times
Producing B bootstrap datasets.
32 / 132
Images/cinvestav-
The Idea of Bootstrap
We denote the training set by Z = {z1, z2, ..., zN }
Where zi = (xi, yi)
The basic idea is to randomly draw datasets with replacement from
the training data
Each sample the same size as the original training set.
This is done B times
Producing B bootstrap datasets.
32 / 132
Images/cinvestav-
Then
Then a quantity is computed
S (Z) is any quantity computed from the data Z
From the bootstrap sampling
We can estimate any aspect of the distribution of S (Z).
33 / 132
Images/cinvestav-
Then
Then a quantity is computed
S (Z) is any quantity computed from the data Z
From the bootstrap sampling
We can estimate any aspect of the distribution of S (Z).
33 / 132
Images/cinvestav-
Then
Then a quantity is computed
S (Z) is any quantity computed from the data Z
From the bootstrap sampling
We can estimate any aspect of the distribution of S (Z).
33 / 132
Images/cinvestav-
Then
we refit the model to each of the bootstrap datasets
You generate S Z∗b to refit the model to this dataset.
Then
You examine the behavior of the fits over the B replications.
34 / 132
Images/cinvestav-
Then
we refit the model to each of the bootstrap datasets
You generate S Z∗b to refit the model to this dataset.
Then
You examine the behavior of the fits over the B replications.
34 / 132
Images/cinvestav-
For Example
Its variance
V ar [S (Z)] =
1
B − 1
B
b=1
S Z∗b
− S
∗ 2
Where
S
∗
=
1
B
B
b=1
S Z∗b
35 / 132
Images/cinvestav-
For Example
Its variance
V ar [S (Z)] =
1
B − 1
B
b=1
S Z∗b
− S
∗ 2
Where
S
∗
=
1
B
B
b=1
S Z∗b
35 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
36 / 132
Images/cinvestav-
Relation with Monte-Carlo Estimation
Note that V ar [S (Z)]
It can be thought of as a Monte-Carlo estimate of the variance of
S (Z) under sampling.
This is coming
From the empirical distribution function F for the data
Z = {z1, z2, ..., zN }
37 / 132
Images/cinvestav-
Relation with Monte-Carlo Estimation
Note that V ar [S (Z)]
It can be thought of as a Monte-Carlo estimate of the variance of
S (Z) under sampling.
This is coming
From the empirical distribution function F for the data
Z = {z1, z2, ..., zN }
37 / 132
Images/cinvestav-
For Example
Schematic of the bootstrap process
Bootstrap
Samples
Bootstrap
Replications
Training
Samples
38 / 132
Images/cinvestav-
Thus
Use each of them to train a copy yb (x) of a predictive regression
model to predict a single continuous variable
Then,
ycom (x) =
1
B
B
b=1
yb (x) (2)
This is also known as Bootstrap Aggregation or Bagging.
39 / 132
Images/cinvestav-
What do we with this samples?
Now, assume a true regression function h (x) and a estimation yb (x)
yb (x) = h (x) + b (x) (3)
The average sum-of-squares error over the data takes the form
Ex (yb (x) − h (x))2
= Ex
2
b (x) (4)
What is Ex?
It denotes a frequentest expectation with respect to the distribution of the
input vector x.
40 / 132
Images/cinvestav-
What do we with this samples?
Now, assume a true regression function h (x) and a estimation yb (x)
yb (x) = h (x) + b (x) (3)
The average sum-of-squares error over the data takes the form
Ex (yb (x) − h (x))2
= Ex
2
b (x) (4)
What is Ex?
It denotes a frequentest expectation with respect to the distribution of the
input vector x.
40 / 132
Images/cinvestav-
What do we with this samples?
Now, assume a true regression function h (x) and a estimation yb (x)
yb (x) = h (x) + b (x) (3)
The average sum-of-squares error over the data takes the form
Ex (yb (x) − h (x))2
= Ex
2
b (x) (4)
What is Ex?
It denotes a frequentest expectation with respect to the distribution of the
input vector x.
40 / 132
Images/cinvestav-
Meaning
Thus, the average error is
EAV =
1
B
b
b=1
Ex { b (x)}2
(5)
Similarly the Expected error over the committee
ECOM = Ex


1
B
B
b=1
(ym (x) − h (x))
2

 = Ex


1
B
B
b=1
b (x)
2


(6)
41 / 132
Images/cinvestav-
Meaning
Thus, the average error is
EAV =
1
B
b
b=1
Ex { b (x)}2
(5)
Similarly the Expected error over the committee
ECOM = Ex


1
B
B
b=1
(ym (x) − h (x))
2

 = Ex


1
B
B
b=1
b (x)
2


(6)
41 / 132
Images/cinvestav-
Assume that the errors have zero mean and are
uncorrelated
Assume that the errors have zero mean and are uncorrelated
Something Reasonable to assume given the way we produce the
Bootstrap Samples
Ex [ b (x)] =0
Ex [ b (x) l (x)] = 0, for b = l
42 / 132
Images/cinvestav-
Then
We have that
ECOM =
1
b2
Ex
B
b=1
( b (x))
2
=
1
B2
Ex



B
b=1
2
b (x) +
B
h=1
B
k=1
h=k
h (x) k (x)



=
1
B2



Ex
B
b=1
2
b (x) + Ex



B
h=1
B
k=1
h=k
h (x) k (x)






=
1
B2



Ex
B
b=1
2
b (x) +
M
h=1
M
k=1
h=k
Ex ( h (x) k (x))



=
1
B2
Ex
B
b=1
2
b (x) =
1
B
1
B
Ex
B
b=1
2
b (x)
43 / 132
Images/cinvestav-
Then
We have that
ECOM =
1
b2
Ex
B
b=1
( b (x))
2
=
1
B2
Ex



B
b=1
2
b (x) +
B
h=1
B
k=1
h=k
h (x) k (x)



=
1
B2



Ex
B
b=1
2
b (x) + Ex



B
h=1
B
k=1
h=k
h (x) k (x)






=
1
B2



Ex
B
b=1
2
b (x) +
M
h=1
M
k=1
h=k
Ex ( h (x) k (x))



=
1
B2
Ex
B
b=1
2
b (x) =
1
B
1
B
Ex
B
b=1
2
b (x)
43 / 132
Images/cinvestav-
Then
We have that
ECOM =
1
b2
Ex
B
b=1
( b (x))
2
=
1
B2
Ex



B
b=1
2
b (x) +
B
h=1
B
k=1
h=k
h (x) k (x)



=
1
B2



Ex
B
b=1
2
b (x) + Ex



B
h=1
B
k=1
h=k
h (x) k (x)






=
1
B2



Ex
B
b=1
2
b (x) +
M
h=1
M
k=1
h=k
Ex ( h (x) k (x))



=
1
B2
Ex
B
b=1
2
b (x) =
1
B
1
B
Ex
B
b=1
2
b (x)
43 / 132
Images/cinvestav-
Then
We have that
ECOM =
1
b2
Ex
B
b=1
( b (x))
2
=
1
B2
Ex



B
b=1
2
b (x) +
B
h=1
B
k=1
h=k
h (x) k (x)



=
1
B2



Ex
B
b=1
2
b (x) + Ex



B
h=1
B
k=1
h=k
h (x) k (x)






=
1
B2



Ex
B
b=1
2
b (x) +
M
h=1
M
k=1
h=k
Ex ( h (x) k (x))



=
1
B2
Ex
B
b=1
2
b (x) =
1
B
1
B
Ex
B
b=1
2
b (x)
43 / 132
Images/cinvestav-
Then
We have that
ECOM =
1
b2
Ex
B
b=1
( b (x))
2
=
1
B2
Ex



B
b=1
2
b (x) +
B
h=1
B
k=1
h=k
h (x) k (x)



=
1
B2



Ex
B
b=1
2
b (x) + Ex



B
h=1
B
k=1
h=k
h (x) k (x)






=
1
B2



Ex
B
b=1
2
b (x) +
M
h=1
M
k=1
h=k
Ex ( h (x) k (x))



=
1
B2
Ex
B
b=1
2
b (x) =
1
B
1
B
Ex
B
b=1
2
b (x)
43 / 132
Images/cinvestav-
We finally obtain
We obtain
ECOM =
1
B
EAV (7)
Looks great BUT!!!
Unfortunately, it depends on the key assumption that the errors at the
individual Bootstrap Models are uncorrelated.
44 / 132
Images/cinvestav-
We finally obtain
We obtain
ECOM =
1
B
EAV (7)
Looks great BUT!!!
Unfortunately, it depends on the key assumption that the errors at the
individual Bootstrap Models are uncorrelated.
44 / 132
Images/cinvestav-
Thus
The Reality!!!
The errors are typically highly correlated, and the reduction in overall error
is generally small.
Something Notable
However, It can be shown that the expected committee error will not
exceed the expected error of the constituent models, so
ECOM ≤ EAV (8)
However, we need something better
A more sophisticated technique known as boosting.
45 / 132
Images/cinvestav-
Thus
The Reality!!!
The errors are typically highly correlated, and the reduction in overall error
is generally small.
Something Notable
However, It can be shown that the expected committee error will not
exceed the expected error of the constituent models, so
ECOM ≤ EAV (8)
However, we need something better
A more sophisticated technique known as boosting.
45 / 132
Images/cinvestav-
Thus
The Reality!!!
The errors are typically highly correlated, and the reduction in overall error
is generally small.
Something Notable
However, It can be shown that the expected committee error will not
exceed the expected error of the constituent models, so
ECOM ≤ EAV (8)
However, we need something better
A more sophisticated technique known as boosting.
45 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
46 / 132
Images/cinvestav-
Boosting
What Boosting does?
It combines several classifiers to produce a form of a committee.
We will describe AdaBoost
“Adaptive Boosting” developed by Freund and Schapire (1995).
47 / 132
Images/cinvestav-
Boosting
What Boosting does?
It combines several classifiers to produce a form of a committee.
We will describe AdaBoost
“Adaptive Boosting” developed by Freund and Schapire (1995).
47 / 132
Images/cinvestav-
Sequential Training
Main difference between boosting and committee methods
The base classifiers are trained in sequence.
Explanation
Consider a two-class classification problem:
1 Samples x1, x2,..., xN
2 Binary labels (-1,1) t1, t2, ..., tN
48 / 132
Images/cinvestav-
Sequential Training
Main difference between boosting and committee methods
The base classifiers are trained in sequence.
Explanation
Consider a two-class classification problem:
1 Samples x1, x2,..., xN
2 Binary labels (-1,1) t1, t2, ..., tN
48 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
49 / 132
Images/cinvestav-
Cost Function
Now
You want to put together a set of M experts able to recognize the most
difficult inputs in an accurate way!!!
Thus
For each pattern xi each expert classifier outputs a classification
yj (xi) ∈ {−1, 1}
The final decision of the committee of M experts is sign (C (xi))
C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (9)
50 / 132
Images/cinvestav-
Cost Function
Now
You want to put together a set of M experts able to recognize the most
difficult inputs in an accurate way!!!
Thus
For each pattern xi each expert classifier outputs a classification
yj (xi) ∈ {−1, 1}
The final decision of the committee of M experts is sign (C (xi))
C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (9)
50 / 132
Images/cinvestav-
Cost Function
Now
You want to put together a set of M experts able to recognize the most
difficult inputs in an accurate way!!!
Thus
For each pattern xi each expert classifier outputs a classification
yj (xi) ∈ {−1, 1}
The final decision of the committee of M experts is sign (C (xi))
C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (9)
50 / 132
Images/cinvestav-
Now
Adaptive Boosting
It works even with a continuum of classifiers.
However
For the sake of simplicity, we will assume that the set of expert is finite.
51 / 132
Images/cinvestav-
Now
Adaptive Boosting
It works even with a continuum of classifiers.
However
For the sake of simplicity, we will assume that the set of expert is finite.
51 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
52 / 132
Images/cinvestav-
Getting the correct classifiers
We want the following
We want to review possible element members.
Select them, if they have certain properties.
Assigning a weight to their contribution to the set of experts.
53 / 132
Images/cinvestav-
Getting the correct classifiers
We want the following
We want to review possible element members.
Select them, if they have certain properties.
Assigning a weight to their contribution to the set of experts.
53 / 132
Images/cinvestav-
Getting the correct classifiers
We want the following
We want to review possible element members.
Select them, if they have certain properties.
Assigning a weight to their contribution to the set of experts.
53 / 132
Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
multidimensional data points xi:
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actions
We test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).
Charging a cost exp {−β} any time a classifier provides the right
label (a hit).
54 / 132
Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
multidimensional data points xi:
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actions
We test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).
Charging a cost exp {−β} any time a classifier provides the right
label (a hit).
54 / 132
Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
multidimensional data points xi:
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actions
We test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).
Charging a cost exp {−β} any time a classifier provides the right
label (a hit).
54 / 132
Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
multidimensional data points xi:
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actions
We test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).
Charging a cost exp {−β} any time a classifier provides the right
label (a hit).
54 / 132
Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
multidimensional data points xi:
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actions
We test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).
Charging a cost exp {−β} any time a classifier provides the right
label (a hit).
54 / 132
Images/cinvestav-
Remarks about β
We require β > 0
Thus misses are penalized more heavily penalized than hits
Although
It looks strange to penalize hits,
However, as long that the penalty of a success is smaller than the
penalty for a miss:
exp {−β} < exp {β}
Why?
if we assign cost a to misses and cost b to hits, where a > b > 0.
We can rewrite such costs as a = cd and b = c−d for constants c and
d.
It does not compromise generality.
55 / 132
Images/cinvestav-
Remarks about β
We require β > 0
Thus misses are penalized more heavily penalized than hits
Although
It looks strange to penalize hits,
However, as long that the penalty of a success is smaller than the
penalty for a miss:
exp {−β} < exp {β}
Why?
if we assign cost a to misses and cost b to hits, where a > b > 0.
We can rewrite such costs as a = cd and b = c−d for constants c and
d.
It does not compromise generality.
55 / 132
Images/cinvestav-
Remarks about β
We require β > 0
Thus misses are penalized more heavily penalized than hits
Although
It looks strange to penalize hits,
However, as long that the penalty of a success is smaller than the
penalty for a miss:
exp {−β} < exp {β}
Why?
if we assign cost a to misses and cost b to hits, where a > b > 0.
We can rewrite such costs as a = cd and b = c−d for constants c and
d.
It does not compromise generality.
55 / 132
Images/cinvestav-
Remarks about β
We require β > 0
Thus misses are penalized more heavily penalized than hits
Although
It looks strange to penalize hits,
However, as long that the penalty of a success is smaller than the
penalty for a miss:
exp {−β} < exp {β}
Why?
if we assign cost a to misses and cost b to hits, where a > b > 0.
We can rewrite such costs as a = cd and b = c−d for constants c and
d.
It does not compromise generality.
55 / 132
Images/cinvestav-
Remarks about β
We require β > 0
Thus misses are penalized more heavily penalized than hits
Although
It looks strange to penalize hits,
However, as long that the penalty of a success is smaller than the
penalty for a miss:
exp {−β} < exp {β}
Why?
if we assign cost a to misses and cost b to hits, where a > b > 0.
We can rewrite such costs as a = cd and b = c−d for constants c and
d.
It does not compromise generality.
55 / 132
Images/cinvestav-
Remarks about β
We require β > 0
Thus misses are penalized more heavily penalized than hits
Although
It looks strange to penalize hits,
However, as long that the penalty of a success is smaller than the
penalty for a miss:
exp {−β} < exp {β}
Why?
if we assign cost a to misses and cost b to hits, where a > b > 0.
We can rewrite such costs as a = cd and b = c−d for constants c and
d.
It does not compromise generality.
55 / 132
Images/cinvestav-
Exponential Loss Function
This kind of error function is different from Squared Euclidean
distance
The classification target is called an exponential loss function.
AdaBoost uses exponential error loss as error criterion.
+0.1 +0.2 +0.3 +0.4 +0.5 +0.6 +0.7 +0.8 +0.9 +1.0−0.1−0.2
−0.6
−0.5
−0.4
−0.3
−0.2
−0.1
+0.1
+0.2
+0.3
+0.4
+0.5
+0.6
+0.7
+0.8
+0.9
+1.0
+1.1
+1.2
+1.3
+1.4
+1.5
+1.6
+1.7
+1.8
+1.9
+2.0
+2.1
+2.2
+2.3
56 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
57 / 132
Images/cinvestav-
Selection of the Classifier
We need to have a way to select the best Classifier in the Pool
When we test the M classifiers in the pool, we build a matrix S
Then
We record the misses (with a ONE) and hits (with a ZERO) of each
classifiers.
58 / 132
Images/cinvestav-
Selection of the Classifier
We need to have a way to select the best Classifier in the Pool
When we test the M classifiers in the pool, we build a matrix S
Then
We record the misses (with a ONE) and hits (with a ZERO) of each
classifiers.
58 / 132
Images/cinvestav-
The Matrix S
Row i in the matrix is reserved for the data point xi
Column m is reserved for the mth classifier in the pool.
Classifiers
1 2 · · · M
x1 0 1 · · · 1
x2 0 0 · · · 1
x3 1 1 · · · 0
...
...
...
...
xN 0 0 · · · 0
59 / 132
Images/cinvestav-
Something interesting about the S
The sum along the rows is the sum at the empirical risk
ER (yj) =
1
N
N
i=1
Sij with j = 1, ..., M
Therefore, the candidate to be used at certain iteration
It is the classifier yj with the smallest empirical risk!!!
60 / 132
Images/cinvestav-
Something interesting about the S
The sum along the rows is the sum at the empirical risk
ER (yj) =
1
N
N
i=1
Sij with j = 1, ..., M
Therefore, the candidate to be used at certain iteration
It is the classifier yj with the smallest empirical risk!!!
60 / 132
Images/cinvestav-
Main Idea
What does AdaBoost want to do?
The main idea of AdaBoost is to proceed systematically by extracting one
classifier from the pool in each of M iterations.
Thus
The elements in the data set are weighted according to their current
relevance (or urgency) at each iteration.
Thus at the beginning of the iterations
All data samples are assigned the same weight:
Just 1, or 1
N , if we want to have a total sum of 1 for all weights.
61 / 132
Images/cinvestav-
Main Idea
What does AdaBoost want to do?
The main idea of AdaBoost is to proceed systematically by extracting one
classifier from the pool in each of M iterations.
Thus
The elements in the data set are weighted according to their current
relevance (or urgency) at each iteration.
Thus at the beginning of the iterations
All data samples are assigned the same weight:
Just 1, or 1
N , if we want to have a total sum of 1 for all weights.
61 / 132
Images/cinvestav-
Main Idea
What does AdaBoost want to do?
The main idea of AdaBoost is to proceed systematically by extracting one
classifier from the pool in each of M iterations.
Thus
The elements in the data set are weighted according to their current
relevance (or urgency) at each iteration.
Thus at the beginning of the iterations
All data samples are assigned the same weight:
Just 1, or 1
N , if we want to have a total sum of 1 for all weights.
61 / 132
Images/cinvestav-
Main Idea
What does AdaBoost want to do?
The main idea of AdaBoost is to proceed systematically by extracting one
classifier from the pool in each of M iterations.
Thus
The elements in the data set are weighted according to their current
relevance (or urgency) at each iteration.
Thus at the beginning of the iterations
All data samples are assigned the same weight:
Just 1, or 1
N , if we want to have a total sum of 1 for all weights.
61 / 132
Images/cinvestav-
The process of the weights
As the selection progresses
The more difficult samples, those where the committee still performs
badly, are assigned larger and larger weights.
The selection process concentrates in selecting new classifiers
For the committee by focusing on those which can help with the still
misclassified examples.
Then
The best classifiers are those which can provide new insights to the
committee.
Classifiers being selected should complement each other in an optimal
way.
62 / 132
Images/cinvestav-
The process of the weights
As the selection progresses
The more difficult samples, those where the committee still performs
badly, are assigned larger and larger weights.
The selection process concentrates in selecting new classifiers
For the committee by focusing on those which can help with the still
misclassified examples.
Then
The best classifiers are those which can provide new insights to the
committee.
Classifiers being selected should complement each other in an optimal
way.
62 / 132
Images/cinvestav-
The process of the weights
As the selection progresses
The more difficult samples, those where the committee still performs
badly, are assigned larger and larger weights.
The selection process concentrates in selecting new classifiers
For the committee by focusing on those which can help with the still
misclassified examples.
Then
The best classifiers are those which can provide new insights to the
committee.
Classifiers being selected should complement each other in an optimal
way.
62 / 132
Images/cinvestav-
The process of the weights
As the selection progresses
The more difficult samples, those where the committee still performs
badly, are assigned larger and larger weights.
The selection process concentrates in selecting new classifiers
For the committee by focusing on those which can help with the still
misclassified examples.
Then
The best classifiers are those which can provide new insights to the
committee.
Classifiers being selected should complement each other in an optimal
way.
62 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
63 / 132
Images/cinvestav-
Selecting New Classifiers
What we want
In each iteration, we rank all classifiers, so that we can select the current
best out of the pool.
At mth iteration
We have already included m − 1 classifiers in the committee and we want
to select the next one.
Thus, we have the following cost function which is actually the
output of the committee
C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (10)
64 / 132
Images/cinvestav-
Selecting New Classifiers
What we want
In each iteration, we rank all classifiers, so that we can select the current
best out of the pool.
At mth iteration
We have already included m − 1 classifiers in the committee and we want
to select the next one.
Thus, we have the following cost function which is actually the
output of the committee
C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (10)
64 / 132
Images/cinvestav-
Selecting New Classifiers
What we want
In each iteration, we rank all classifiers, so that we can select the current
best out of the pool.
At mth iteration
We have already included m − 1 classifiers in the committee and we want
to select the next one.
Thus, we have the following cost function which is actually the
output of the committee
C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (10)
64 / 132
Images/cinvestav-
Thus, we have that
Extending the cost function by the new regression ym
C(m) (xi) = C(m−1) (xi) + αmym (xi) (11)
At the first iteration m = 1
C(0) is the zero function.
Thus, the total cost or total error is defined as the exponential error
E =
N
i=1
exp −ti C(m−1) (xi) + αmym (xi) (12)
65 / 132
Images/cinvestav-
Thus, we have that
Extending the cost function by the new regression ym
C(m) (xi) = C(m−1) (xi) + αmym (xi) (11)
At the first iteration m = 1
C(0) is the zero function.
Thus, the total cost or total error is defined as the exponential error
E =
N
i=1
exp −ti C(m−1) (xi) + αmym (xi) (12)
65 / 132
Images/cinvestav-
Thus, we have that
Extending the cost function by the new regression ym
C(m) (xi) = C(m−1) (xi) + αmym (xi) (11)
At the first iteration m = 1
C(0) is the zero function.
Thus, the total cost or total error is defined as the exponential error
E =
N
i=1
exp −ti C(m−1) (xi) + αmym (xi) (12)
65 / 132
Images/cinvestav-
Thus
We want to determine
αm and ym in optimal way
Thus, rewriting
E =
N
i=1
w
(m)
i exp {−tiαmym (xi)} (13)
Where, for i = 1, 2, ..., N
w
(m)
i = exp −tiC(m−1) (xi) (14)
66 / 132
Images/cinvestav-
Thus
We want to determine
αm and ym in optimal way
Thus, rewriting
E =
N
i=1
w
(m)
i exp {−tiαmym (xi)} (13)
Where, for i = 1, 2, ..., N
w
(m)
i = exp −tiC(m−1) (xi) (14)
66 / 132
Images/cinvestav-
Thus
We want to determine
αm and ym in optimal way
Thus, rewriting
E =
N
i=1
w
(m)
i exp {−tiαmym (xi)} (13)
Where, for i = 1, 2, ..., N
w
(m)
i = exp −tiC(m−1) (xi) (14)
66 / 132
Images/cinvestav-
Remark
We have that the weight
w
(m)
i = exp −tiC(m−1) (xi)
Needs to be used in someway for the training of the new classifier
This is of the out most importance!!!
67 / 132
Images/cinvestav-
Remark
We have that the weight
w
(m)
i = exp −tiC(m−1) (xi)
Needs to be used in someway for the training of the new classifier
This is of the out most importance!!!
67 / 132
Images/cinvestav-
Therefore
You could use such weight
As a output in the estimator function when applied to the loss function
N
i=1
yi − w
(m)
i f (xi)
2
You could use such weight
You could sub-sample with substitution by using the distribution Dm w
(m)
i of xi
The train using that sub-sample
You could apply the weight function to the loss function itself used
for training
N
i=1
w
(m)
i (yi − wif (xi))2
68 / 132
Images/cinvestav-
Therefore
You could use such weight
As a output in the estimator function when applied to the loss function
N
i=1
yi − w
(m)
i f (xi)
2
You could use such weight
You could sub-sample with substitution by using the distribution Dm w
(m)
i of xi
The train using that sub-sample
You could apply the weight function to the loss function itself used
for training
N
i=1
w
(m)
i (yi − wif (xi))2
68 / 132
Images/cinvestav-
Therefore
You could use such weight
As a output in the estimator function when applied to the loss function
N
i=1
yi − w
(m)
i f (xi)
2
You could use such weight
You could sub-sample with substitution by using the distribution Dm w
(m)
i of xi
The train using that sub-sample
You could apply the weight function to the loss function itself used
for training
N
i=1
w
(m)
i (yi − wif (xi))2
68 / 132
Images/cinvestav-
Therefore
You could use such weight
As a output in the estimator function when applied to the loss function
N
i=1
yi − w
(m)
i f (xi)
2
You could use such weight
You could sub-sample with substitution by using the distribution Dm w
(m)
i of xi
The train using that sub-sample
You could apply the weight function to the loss function itself used
for training
N
i=1
w
(m)
i (yi − wif (xi))2
68 / 132
Images/cinvestav-
Thus
In the first iteration w
(1)
i = 1 for i = 1, ..., N
Meaning all the points have the same importance.
During later iterations, the vector w(m)
It represents the weight assigned to each data point in the training set
at iteration m.
69 / 132
Images/cinvestav-
Thus
In the first iteration w
(1)
i = 1 for i = 1, ..., N
Meaning all the points have the same importance.
During later iterations, the vector w(m)
It represents the weight assigned to each data point in the training set
at iteration m.
69 / 132
Images/cinvestav-
Rewriting the Cost Equation
We can split (Eq. 13)
E =
ti=ym(xi)
w
(m)
i exp {−αm} +
ti=ym(xi)
w
(m)
i exp {αm} (15)
Meaning
The total cost is the weighted cost of all hits plus the weighted cost of all
misses.
70 / 132
Images/cinvestav-
Rewriting the Cost Equation
We can split (Eq. 13)
E =
ti=ym(xi)
w
(m)
i exp {−αm} +
ti=ym(xi)
w
(m)
i exp {αm} (15)
Meaning
The total cost is the weighted cost of all hits plus the weighted cost of all
misses.
70 / 132
Images/cinvestav-
Therefore
Writing the first summand as Wc exp {−αm} and the second as
We exp {αm}
E = Wc exp {−αm} + We exp {αm} (16)
71 / 132
Images/cinvestav-
Empty
Now, for the selection of ym
The exact value of αm > 0 is irrelevant
Since a fixed αm minimizing E
It is equivalent to minimizing exp {αm} E
Or in other words
exp {αm} E = Wc + We exp {2αm} (17)
72 / 132
Images/cinvestav-
Empty
Now, for the selection of ym
The exact value of αm > 0 is irrelevant
Since a fixed αm minimizing E
It is equivalent to minimizing exp {αm} E
Or in other words
exp {αm} E = Wc + We exp {2αm} (17)
72 / 132
Images/cinvestav-
Empty
Now, for the selection of ym
The exact value of αm > 0 is irrelevant
Since a fixed αm minimizing E
It is equivalent to minimizing exp {αm} E
Or in other words
exp {αm} E = Wc + We exp {2αm} (17)
72 / 132
Images/cinvestav-
Now, we have
Given that αm > 0
2αm > 0
We have
exp {2αm} > exp {0} = 1
73 / 132
Images/cinvestav-
Now, we have
Given that αm > 0
2αm > 0
We have
exp {2αm} > exp {0} = 1
73 / 132
Images/cinvestav-
Then
We can rewrite (Eq. 17)
exp {αm} E = Wc + We − We + We exp {2αm} (18)
Thus
exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (19)
Now, Wc + We is the total sum W of the weights
Of all data points which is constant in the current iteration.
74 / 132
Images/cinvestav-
Then
We can rewrite (Eq. 17)
exp {αm} E = Wc + We − We + We exp {2αm} (18)
Thus
exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (19)
Now, Wc + We is the total sum W of the weights
Of all data points which is constant in the current iteration.
74 / 132
Images/cinvestav-
Then
We can rewrite (Eq. 17)
exp {αm} E = Wc + We − We + We exp {2αm} (18)
Thus
exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (19)
Now, Wc + We is the total sum W of the weights
Of all data points which is constant in the current iteration.
74 / 132
Images/cinvestav-
Thus
The right hand side of the equation is minimized
When at the m-th iteration, we pick the classifier with the lowest
total cost We
That is the lowest rate of weighted error.
Intuitively
The next selected ym should be the one with the lowest penalty given the
current set of weights.
75 / 132
Images/cinvestav-
Thus
The right hand side of the equation is minimized
When at the m-th iteration, we pick the classifier with the lowest
total cost We
That is the lowest rate of weighted error.
Intuitively
The next selected ym should be the one with the lowest penalty given the
current set of weights.
75 / 132
Images/cinvestav-
Do you remember?
The Matrix S
We pick the classifier with the lowest total cost We
Now, we need to do some updates
Specifically the value αm .
76 / 132
Images/cinvestav-
Do you remember?
The Matrix S
We pick the classifier with the lowest total cost We
Now, we need to do some updates
Specifically the value αm .
76 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
77 / 132
Images/cinvestav-
Deriving against the weight αm
Going back to the original E, we can use the derivative trick
∂E
∂αm
= −Wc exp {−αm} + We exp {αm} (20)
Making the equation equal to zero and multiplying by exp {αm}
−Wc + We exp {2αm} = 0 (21)
The optimal value is thus
αm =
1
2
ln
Wc
We
(22)
78 / 132
Images/cinvestav-
Deriving against the weight αm
Going back to the original E, we can use the derivative trick
∂E
∂αm
= −Wc exp {−αm} + We exp {αm} (20)
Making the equation equal to zero and multiplying by exp {αm}
−Wc + We exp {2αm} = 0 (21)
The optimal value is thus
αm =
1
2
ln
Wc
We
(22)
78 / 132
Images/cinvestav-
Deriving against the weight αm
Going back to the original E, we can use the derivative trick
∂E
∂αm
= −Wc exp {−αm} + We exp {αm} (20)
Making the equation equal to zero and multiplying by exp {αm}
−Wc + We exp {2αm} = 0 (21)
The optimal value is thus
αm =
1
2
ln
Wc
We
(22)
78 / 132
Images/cinvestav-
Now
Making the total sum of all weights
W = Wc + We (23)
We can rewrite the previous equation as
αm =
1
2
ln
W − We
We
=
1
2
ln
1 − em
em
(24)
With the percentage rate of error given the weights of the data points
em =
We
W
(25)
79 / 132
Images/cinvestav-
Now
Making the total sum of all weights
W = Wc + We (23)
We can rewrite the previous equation as
αm =
1
2
ln
W − We
We
=
1
2
ln
1 − em
em
(24)
With the percentage rate of error given the weights of the data points
em =
We
W
(25)
79 / 132
Images/cinvestav-
Now
Making the total sum of all weights
W = Wc + We (23)
We can rewrite the previous equation as
αm =
1
2
ln
W − We
We
=
1
2
ln
1 − em
em
(24)
With the percentage rate of error given the weights of the data points
em =
We
W
(25)
79 / 132
Images/cinvestav-
What about the weights?
Using the equation
w
(m)
i = exp −tiC(m−1) (xi) (26)
And because we have αm and ym (xi)
w
(m+1)
i = exp −tiC(m) (xi)
= exp −ti C(m−1) (xi) + αmym (xi)
=w
(m)
i exp {−tiαmym (xi)}
80 / 132
Images/cinvestav-
What about the weights?
Using the equation
w
(m)
i = exp −tiC(m−1) (xi) (26)
And because we have αm and ym (xi)
w
(m+1)
i = exp −tiC(m) (xi)
= exp −ti C(m−1) (xi) + αmym (xi)
=w
(m)
i exp {−tiαmym (xi)}
80 / 132
Images/cinvestav-
What about the weights?
Using the equation
w
(m)
i = exp −tiC(m−1) (xi) (26)
And because we have αm and ym (xi)
w
(m+1)
i = exp −tiC(m) (xi)
= exp −ti C(m−1) (xi) + αmym (xi)
=w
(m)
i exp {−tiαmym (xi)}
80 / 132
Images/cinvestav-
What about the weights?
Using the equation
w
(m)
i = exp −tiC(m−1) (xi) (26)
And because we have αm and ym (xi)
w
(m+1)
i = exp −tiC(m) (xi)
= exp −ti C(m−1) (xi) + αmym (xi)
=w
(m)
i exp {−tiαmym (xi)}
80 / 132
Images/cinvestav-
Sequential Training
Thus
AdaBoost trains a new classifier using a data set
There the weighting coefficients are adjusted according to the
performance of the previously trained classifier
To give greater weight to the misclassified data points.
81 / 132
Images/cinvestav-
Sequential Training
Thus
AdaBoost trains a new classifier using a data set
There the weighting coefficients are adjusted according to the
performance of the previously trained classifier
To give greater weight to the misclassified data points.
81 / 132
Images/cinvestav-
Sequential Training
Thus
AdaBoost trains a new classifier using a data set
There the weighting coefficients are adjusted according to the
performance of the previously trained classifier
To give greater weight to the misclassified data points.
81 / 132
Images/cinvestav-
Illustration
Schematic Illustration
82 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
83 / 132
Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti=ym(xi)
w
(m)
i = arg min
ym
We (27)
Where I is an indicator function.
84 / 132
Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti=ym(xi)
w
(m)
i = arg min
ym
We (27)
Where I is an indicator function.
84 / 132
Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti=ym(xi)
w
(m)
i = arg min
ym
We (27)
Where I is an indicator function.
84 / 132
Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti=ym(xi)
w
(m)
i = arg min
ym
We (27)
Where I is an indicator function.
84 / 132
Images/cinvestav-
AdaBoost Algorithm
Step 2
Evaluate
em =
N
n=1
w
(m)
n I (ym (xn) = tn)
N
n=1
w
(m)
n
(28)
Where I is an indicator function
85 / 132
Images/cinvestav-
AdaBoost Algorithm
Step 3
Set the αm weight to
αm =
1
2
ln
1 − em
em
(29)
Now update the weights of the data for the next iteration
If ti = ym (xi) i.e. a miss
w
(m+1)
i = w
(m)
i exp {αm} = w
(m)
i
1 − em
em
(30)
If ti == ym (xi) i.e. a hit
w
(m+1)
i = w
(m)
i exp {−αm} = w
(m)
i
em
1 − em
(31)
86 / 132
Images/cinvestav-
AdaBoost Algorithm
Step 3
Set the αm weight to
αm =
1
2
ln
1 − em
em
(29)
Now update the weights of the data for the next iteration
If ti = ym (xi) i.e. a miss
w
(m+1)
i = w
(m)
i exp {αm} = w
(m)
i
1 − em
em
(30)
If ti == ym (xi) i.e. a hit
w
(m+1)
i = w
(m)
i exp {−αm} = w
(m)
i
em
1 − em
(31)
86 / 132
Images/cinvestav-
AdaBoost Algorithm
Step 3
Set the αm weight to
αm =
1
2
ln
1 − em
em
(29)
Now update the weights of the data for the next iteration
If ti = ym (xi) i.e. a miss
w
(m+1)
i = w
(m)
i exp {αm} = w
(m)
i
1 − em
em
(30)
If ti == ym (xi) i.e. a hit
w
(m+1)
i = w
(m)
i exp {−αm} = w
(m)
i
em
1 − em
(31)
86 / 132
Images/cinvestav-
Finally, make predictions
For this use
YM (x) = sign
M
m=1
αmym (x) (32)
87 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
88 / 132
Images/cinvestav-
Observations
First
The first base classifier is the usual procedure of training a single classifier.
Second
From (Eq. 30) and (Eq. 31), we can see that the weighting coefficient are
increased for data points that are misclassified.
Third
The quantity em represent weighted measures of the error rate.
Thus αm gives more weight to the more accurate classifiers.
89 / 132
Images/cinvestav-
Observations
First
The first base classifier is the usual procedure of training a single classifier.
Second
From (Eq. 30) and (Eq. 31), we can see that the weighting coefficient are
increased for data points that are misclassified.
Third
The quantity em represent weighted measures of the error rate.
Thus αm gives more weight to the more accurate classifiers.
89 / 132
Images/cinvestav-
Observations
First
The first base classifier is the usual procedure of training a single classifier.
Second
From (Eq. 30) and (Eq. 31), we can see that the weighting coefficient are
increased for data points that are misclassified.
Third
The quantity em represent weighted measures of the error rate.
Thus αm gives more weight to the more accurate classifiers.
89 / 132
Images/cinvestav-
In addition
The pool of classifiers in Step 1 can be substituted by a family of
classifiers
One whose members are trained to minimize the error function given the
current weights
Now
If indeed a finite set of classifiers is given, we only need to test the
classifiers once for each data point
The Scouting Matrix S
It can be reused at each iteration by multiplying the transposed vector of
weights w(m) with S to obtain We of each machine
90 / 132
Images/cinvestav-
In addition
The pool of classifiers in Step 1 can be substituted by a family of
classifiers
One whose members are trained to minimize the error function given the
current weights
Now
If indeed a finite set of classifiers is given, we only need to test the
classifiers once for each data point
The Scouting Matrix S
It can be reused at each iteration by multiplying the transposed vector of
weights w(m) with S to obtain We of each machine
90 / 132
Images/cinvestav-
In addition
The pool of classifiers in Step 1 can be substituted by a family of
classifiers
One whose members are trained to minimize the error function given the
current weights
Now
If indeed a finite set of classifiers is given, we only need to test the
classifiers once for each data point
The Scouting Matrix S
It can be reused at each iteration by multiplying the transposed vector of
weights w(m) with S to obtain We of each machine
90 / 132
Images/cinvestav-
We have then
The following
W(1)
e W(2)
e · · · WM
e = w(m)
T
S (33)
This allows to reformulate the weight update step such that
It only misses lead to weight modification.
Note
Note that the weight vector w(m) is constructed iteratively.
It could be recomputed completely at every iteration, but the iterative
construction is more efficient and simple to implement.
91 / 132
Images/cinvestav-
We have then
The following
W(1)
e W(2)
e · · · WM
e = w(m)
T
S (33)
This allows to reformulate the weight update step such that
It only misses lead to weight modification.
Note
Note that the weight vector w(m) is constructed iteratively.
It could be recomputed completely at every iteration, but the iterative
construction is more efficient and simple to implement.
91 / 132
Images/cinvestav-
We have then
The following
W(1)
e W(2)
e · · · WM
e = w(m)
T
S (33)
This allows to reformulate the weight update step such that
It only misses lead to weight modification.
Note
Note that the weight vector w(m) is constructed iteratively.
It could be recomputed completely at every iteration, but the iterative
construction is more efficient and simple to implement.
91 / 132
Images/cinvestav-
We have then
The following
W(1)
e W(2)
e · · · WM
e = w(m)
T
S (33)
This allows to reformulate the weight update step such that
It only misses lead to weight modification.
Note
Note that the weight vector w(m) is constructed iteratively.
It could be recomputed completely at every iteration, but the iterative
construction is more efficient and simple to implement.
91 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
92 / 132
Images/cinvestav-
Explanation
Graph for 1−em
em
in the range 0 ≤ em ≤ 1
93 / 132
Images/cinvestav-
So we have
Graph for αm
94 / 132
Images/cinvestav-
We have the following cases
We have the following
If em −→ 1, we have that all the samples were not correctly classified!!!
Thus
We get that for all miss-classified sample lim
em→1
1−em
em
−→ 0, then
αm −→ −∞
95 / 132
Images/cinvestav-
We have the following cases
We have the following
If em −→ 1, we have that all the samples were not correctly classified!!!
Thus
We get that for all miss-classified sample lim
em→1
1−em
em
−→ 0, then
αm −→ −∞
95 / 132
Images/cinvestav-
Now
We get that for all miss-classified sample
w
(m+1)
i = w
(m)
i exp {αm} −→ 0
Therefore
We only need to reverse the answers to get the perfect classifier and
select it as the only committee member.
96 / 132
Images/cinvestav-
Now
We get that for all miss-classified sample
w
(m+1)
i = w
(m)
i exp {αm} −→ 0
Therefore
We only need to reverse the answers to get the perfect classifier and
select it as the only committee member.
96 / 132
Images/cinvestav-
Now, the Last Case
If em −→ 1/2
We have αm −→ 0
Thus we have that if the sample is well or bad classified
exp {−αmtiym (xi)} → 1 (34)
Therefore
The weight does not change at all.
97 / 132
Images/cinvestav-
Now, the Last Case
If em −→ 1/2
We have αm −→ 0
Thus we have that if the sample is well or bad classified
exp {−αmtiym (xi)} → 1 (34)
Therefore
The weight does not change at all.
97 / 132
Images/cinvestav-
Now, the Last Case
If em −→ 1/2
We have αm −→ 0
Thus we have that if the sample is well or bad classified
exp {−αmtiym (xi)} → 1 (34)
Therefore
The weight does not change at all.
97 / 132
Images/cinvestav-
Thus, we have
What about em → 0
We have that αm → +∞
Thus, we have
Samples always correctly classified
w
(m+1)
i = w
(m)
i exp {−αmtiym (xi)} → 0
Thus, the only need m committee members, we do not need another
m + 1 member.
98 / 132
Images/cinvestav-
Thus, we have
What about em → 0
We have that αm → +∞
Thus, we have
Samples always correctly classified
w
(m+1)
i = w
(m)
i exp {−αmtiym (xi)} → 0
Thus, the only need m committee members, we do not need another
m + 1 member.
98 / 132
Images/cinvestav-
Thus, we have
What about em → 0
We have that αm → +∞
Thus, we have
Samples always correctly classified
w
(m+1)
i = w
(m)
i exp {−αmtiym (xi)} → 0
Thus, the only need m committee members, we do not need another
m + 1 member.
98 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
99 / 132
Images/cinvestav-
This comes from
The paper
“Additive Logistic Regression: A Statistical View of Boosting” by
Friedman, Hastie and Tibshirani
Something Notable
In this paper, a proof exists to show that boosting algorithms are
procedures to fit and additive logistic regression model.
E [y|x] = F (x) with F (x) =
M
m=1
fm (x)
100 / 132
Images/cinvestav-
This comes from
The paper
“Additive Logistic Regression: A Statistical View of Boosting” by
Friedman, Hastie and Tibshirani
Something Notable
In this paper, a proof exists to show that boosting algorithms are
procedures to fit and additive logistic regression model.
E [y|x] = F (x) with F (x) =
M
m=1
fm (x)
100 / 132
Images/cinvestav-
Consider the Additive Regression Model
We are interested in modeling the mean E [y|x] = F (x)
With Additive Model
F (x) =
d
i=1
fi (xi)
Where each fi (xi) is a function for each feature input xi
A convenient algorithm for updating these models it the backfitting
algorithm with update:
fi (xi) = E

y −
k=i
fk (xk) |xi


101 / 132
Images/cinvestav-
Consider the Additive Regression Model
We are interested in modeling the mean E [y|x] = F (x)
With Additive Model
F (x) =
d
i=1
fi (xi)
Where each fi (xi) is a function for each feature input xi
A convenient algorithm for updating these models it the backfitting
algorithm with update:
fi (xi) = E

y −
k=i
fk (xk) |xi


101 / 132
Images/cinvestav-
Remarks
An example of these additive models is the matching pursuit
f (t) =
+∞
n=−∞
angγn (t)
Backfitting ensures that under fairly general conditions
Backfitting converges to the minimizer of E (y − f (x))2
102 / 132
Images/cinvestav-
Remarks
An example of these additive models is the matching pursuit
f (t) =
+∞
n=−∞
angγn (t)
Backfitting ensures that under fairly general conditions
Backfitting converges to the minimizer of E (y − f (x))2
102 / 132
Images/cinvestav-
In the case of AdaBoost
We have an additive model
Which considers functions {fm (x)}M
m=1 that take in account all the
features - Perceptron, Decision Trees, etc
Each of these functions is characterized by a set of parameters γm
and multiplier αm
fm (x) = αmym (x|γm)
With additive model
FM (x) = α1y1 (x|γ1) + · · · + αM yM (x|γM )
103 / 132
Images/cinvestav-
In the case of AdaBoost
We have an additive model
Which considers functions {fm (x)}M
m=1 that take in account all the
features - Perceptron, Decision Trees, etc
Each of these functions is characterized by a set of parameters γm
and multiplier αm
fm (x) = αmym (x|γm)
With additive model
FM (x) = α1y1 (x|γ1) + · · · + αM yM (x|γM )
103 / 132
Images/cinvestav-
In the case of AdaBoost
We have an additive model
Which considers functions {fm (x)}M
m=1 that take in account all the
features - Perceptron, Decision Trees, etc
Each of these functions is characterized by a set of parameters γm
and multiplier αm
fm (x) = αmym (x|γm)
With additive model
FM (x) = α1y1 (x|γ1) + · · · + αM yM (x|γM )
103 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
104 / 132
Images/cinvestav-
Remark - Moving from Regression to Classification
Given that Regression have wide ranges of outputs
Logistic Regression is widely used to move Regression to Classification
log
P (Y = 1|x)
P (Y = −1|x)
=
M
m=1
fm (x)
A nice property, the probability estimates lie in [0, 1]
Now, solving by assuming P (Y = 1|x) + P (Y = −1|x) = 1
P (Y = 1|x) =
eF(x)
1 + eF(x)
105 / 132
Images/cinvestav-
Remark - Moving from Regression to Classification
Given that Regression have wide ranges of outputs
Logistic Regression is widely used to move Regression to Classification
log
P (Y = 1|x)
P (Y = −1|x)
=
M
m=1
fm (x)
A nice property, the probability estimates lie in [0, 1]
Now, solving by assuming P (Y = 1|x) + P (Y = −1|x) = 1
P (Y = 1|x) =
eF(x)
1 + eF(x)
105 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
106 / 132
Images/cinvestav-
The Exponential Criterion
We have our exponential Criterion under an Expected Value with
y ∈ {1, −1}
J (F) = E e−yF(x)
Lemma
E e−yF (x)
is minimized at
F (x) =
1
2
log
P (Y = 1|x)
P (Y = −1|x)
Hence:
P (Y = 1|x) =
eF (x)
e−F (x) + eF (x)
P (Y = −1|x) =
e−F (x)
e−F (x) + eF (x)
107 / 132
Images/cinvestav-
The Exponential Criterion
We have our exponential Criterion under an Expected Value with
y ∈ {1, −1}
J (F) = E e−yF(x)
Lemma
E e−yF (x)
is minimized at
F (x) =
1
2
log
P (Y = 1|x)
P (Y = −1|x)
Hence:
P (Y = 1|x) =
eF (x)
e−F (x) + eF (x)
P (Y = −1|x) =
e−F (x)
e−F (x) + eF (x)
107 / 132
Images/cinvestav-
Proof
Given the discrete nature of y ∈ {1, −1}
∂E e−yF(x)
∂F (x)
= −P (Y = 1|x) e−F(x)
+ P (Y = −1|x) eF(x)
Therefore
−P (Y = 1|x) e−F(x)
+ P (Y = −1|x) eF(x)
= 0
108 / 132
Images/cinvestav-
Proof
Given the discrete nature of y ∈ {1, −1}
∂E e−yF(x)
∂F (x)
= −P (Y = 1|x) e−F(x)
+ P (Y = −1|x) eF(x)
Therefore
−P (Y = 1|x) e−F(x)
+ P (Y = −1|x) eF(x)
= 0
108 / 132
Images/cinvestav-
Then
We have that
P (Y = 1|x) e−F(x)
= P (Y = −1|x) eF(x)
= [1 − P (Y = 1|x)] eF(x)
Solving
eF(x)
= e−F(x)
+ eF(x)
P (Y = 1|x)
109 / 132
Images/cinvestav-
Then
We have that
P (Y = 1|x) e−F(x)
= P (Y = −1|x) eF(x)
= [1 − P (Y = 1|x)] eF(x)
Solving
eF(x)
= e−F(x)
+ eF(x)
P (Y = 1|x)
109 / 132
Images/cinvestav-
Finally, we have
The first equation
P (Y = 1|x) =
eF(x)
e−F(x) + eF(x)
Similarly
P (Y = −1|x) =
e−F (x)
e−F (x) + eF (x)
110 / 132
Images/cinvestav-
Finally, we have
The first equation
P (Y = 1|x) =
eF(x)
e−F(x) + eF(x)
Similarly
P (Y = −1|x) =
e−F (x)
e−F (x) + eF (x)
110 / 132
Images/cinvestav-
Basically
We have that the E e−yF(x)
When you minimize the cost function
Then at the optimal you have the Binary Classification
Of the Logistic Regression
111 / 132
Images/cinvestav-
Basically
We have that the E e−yF(x)
When you minimize the cost function
Then at the optimal you have the Binary Classification
Of the Logistic Regression
111 / 132
Images/cinvestav-
Furthermore
Corollary
If E is replaced by averages over regions of x where F (x) is constant
(Similar to a decision tree),
The same result applies to the sample proportions of y = 1 and y = −1
112 / 132
Images/cinvestav-
Furthermore
Corollary
If E is replaced by averages over regions of x where F (x) is constant
(Similar to a decision tree),
The same result applies to the sample proportions of y = 1 and y = −1
112 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
113 / 132
Images/cinvestav-
Finally, The Additive Logistic Regression
Proposition
The AdaBoost algorithm fits an additive logistic regression model by
stage-wise optimization of
J (F) = E e−yF(x)
Proof
Imagine you have an estimate F (x) then we seek an improved
estimate:
F (x) + f (x)
114 / 132
Images/cinvestav-
Finally, The Additive Logistic Regression
Proposition
The AdaBoost algorithm fits an additive logistic regression model by
stage-wise optimization of
J (F) = E e−yF(x)
Proof
Imagine you have an estimate F (x) then we seek an improved
estimate:
F (x) + f (x)
114 / 132
Images/cinvestav-
For This
We minimize at each x
J (F (x) + f (x))
This can be expanded
J (F (x) + f (x)) = E e−y(F(x)+f(x))
|x
= e−f(x)
E e−yF(x)
I (y = 1) |x +
...ef(x)
E e−yF(x)
I (y = −1) |x
115 / 132
Images/cinvestav-
For This
We minimize at each x
J (F (x) + f (x))
This can be expanded
J (F (x) + f (x)) = E e−y(F(x)+f(x))
|x
= e−f(x)
E e−yF(x)
I (y = 1) |x +
...ef(x)
E e−yF(x)
I (y = −1) |x
115 / 132
Images/cinvestav-
Deriving w.r.t. f (x)
We get
−e−f(x)
E e−yF (x)
I (y = 1) |x + ef(x)
E e−yF (x)
I (y = −1) |x = 0
116 / 132
Images/cinvestav-
We have the following
If we divide by E e−yF(x)
|x , the first term
E e−yF (x)
I (y = 1) |x
E [e−yF (x)|x]
= Ew [I (y = 1) |x]
Also
E e−yF (x)
I (y = −1) |x
E [e−yF (x)|x]
= Ew [I (y = −1) |x]
117 / 132
Images/cinvestav-
We have the following
If we divide by E e−yF(x)
|x , the first term
E e−yF (x)
I (y = 1) |x
E [e−yF (x)|x]
= Ew [I (y = 1) |x]
Also
E e−yF (x)
I (y = −1) |x
E [e−yF (x)|x]
= Ew [I (y = −1) |x]
117 / 132
Images/cinvestav-
Thus, we have
We apply the natural log to both sides
log e−f(x)
+ log Ew [I (y = 1) |x] = log ef(x)
+ log Ew [I (y = −1) |x]
Then
2f (x) = log Ew [I (y = 1) |x] − log Ew [I (y = −1) |x]
118 / 132
Images/cinvestav-
Thus, we have
We apply the natural log to both sides
log e−f(x)
+ log Ew [I (y = 1) |x] = log ef(x)
+ log Ew [I (y = −1) |x]
Then
2f (x) = log Ew [I (y = 1) |x] − log Ew [I (y = −1) |x]
118 / 132
Images/cinvestav-
Finally
We have that
f (x) =
1
2
log
Ew [I (y = 1) |x]
Ew [I (y = −1) |x]
In term of probabilities
f (x) =
1
2
log
Pw (y = 1|x)
Pw (y = −1|x)
119 / 132
Images/cinvestav-
Finally
We have that
f (x) =
1
2
log
Ew [I (y = 1) |x]
Ew [I (y = −1) |x]
In term of probabilities
f (x) =
1
2
log
Pw (y = 1|x)
Pw (y = −1|x)
119 / 132
Images/cinvestav-
The Weight Update
Finally, we have a way to update the weights by setting
wt (x, y) = e−yF(x)
wt+1 (x, y) = wt (x, y) e−yf(x)
120 / 132
Images/cinvestav-
Additionally, the weighted conditional mean
Corollary
At the Optimal F (x), the weighted conditional mean of y is 0.
Proof
When F (x) is optimal
∂J (F (x))
∂F (x)
=
∂ P (Y = 1|x) e−yF (x)
+ P (Y = −1|x) eyF (x)
∂F (x)
121 / 132
Images/cinvestav-
Additionally, the weighted conditional mean
Corollary
At the Optimal F (x), the weighted conditional mean of y is 0.
Proof
When F (x) is optimal
∂J (F (x))
∂F (x)
=
∂ P (Y = 1|x) e−yF (x)
+ P (Y = −1|x) eyF (x)
∂F (x)
121 / 132
Images/cinvestav-
Therefore
We have
∂J (F (x))
∂F (x)
= P (Y = 1|x) e−yF (x)
{−y} + P (Y = −1|x) e−yF (x)
{−y}
Therefore
E eyF(x)
y = 0
122 / 132
Images/cinvestav-
Therefore
We have
∂J (F (x))
∂F (x)
= P (Y = 1|x) e−yF (x)
{−y} + P (Y = −1|x) e−yF (x)
{−y}
Therefore
E eyF(x)
y = 0
122 / 132
Images/cinvestav-
Outline
1 Combining Models
Introduction
Average for Committee
Beyond Simple Averaging
Example
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
Now Model Averaging
The Differences
3 Committees
Introduction
Bootstrap Data Sets
Relation with Monte-Carlo Estimation
4 Boosting
AdaBoost Development
Cost Function
Selection Process
How do we select classifiers?
Selecting New Classifiers
Deriving against the weight αm
AdaBoost Algorithm
Some Remarks
Explanation about AdaBoost’s behavior
Statistical Analysis of the Exponential Loss
Moving from Regression to Classification
Minimization of the Exponential Criterion
Finally, The Additive Logistic Regression
Example using an Infinitude of Perceptrons
123 / 132
Images/cinvestav-
Here, we decide to use Perceptrons
As Weak Learners
We could be using a finite number of Perceptrons
But we want to have a infinitude of possible weak learners
Thus avoiding the need of a matrix S
Remark
We need to use a Gradient Based Learner for this
124 / 132
Images/cinvestav-
Here, we decide to use Perceptrons
As Weak Learners
We could be using a finite number of Perceptrons
But we want to have a infinitude of possible weak learners
Thus avoiding the need of a matrix S
Remark
We need to use a Gradient Based Learner for this
124 / 132
Images/cinvestav-
Here, we decide to use Perceptrons
As Weak Learners
We could be using a finite number of Perceptrons
But we want to have a infinitude of possible weak learners
Thus avoiding the need of a matrix S
Remark
We need to use a Gradient Based Learner for this
124 / 132
Images/cinvestav-
Perceptron
We use the following formula of error per sample
E (w) =
1
2
N
j=1
(wj (t) yj (t) − dj)2
With yj (t) = ϕ wT
(t) xj
Deriving against wi
∂E (w)
∂wi
=
N
j=1
(wj (t) yj (t) − dj) ϕ wT
(t) xj wb
j xij
Then, using gradient descent, we have the following update
wi (n + 1) = wi (n) − η
N
j=1
(wj (t) yj (t) − dj) ϕ wT
(t) xj wixij
125 / 132
Images/cinvestav-
Perceptron
We use the following formula of error per sample
E (w) =
1
2
N
j=1
(wj (t) yj (t) − dj)2
With yj (t) = ϕ wT
(t) xj
Deriving against wi
∂E (w)
∂wi
=
N
j=1
(wj (t) yj (t) − dj) ϕ wT
(t) xj wb
j xij
Then, using gradient descent, we have the following update
wi (n + 1) = wi (n) − η
N
j=1
(wj (t) yj (t) − dj) ϕ wT
(t) xj wixij
125 / 132
Images/cinvestav-
Perceptron
We use the following formula of error per sample
E (w) =
1
2
N
j=1
(wj (t) yj (t) − dj)2
With yj (t) = ϕ wT
(t) xj
Deriving against wi
∂E (w)
∂wi
=
N
j=1
(wj (t) yj (t) − dj) ϕ wT
(t) xj wb
j xij
Then, using gradient descent, we have the following update
wi (n + 1) = wi (n) − η
N
j=1
(wj (t) yj (t) − dj) ϕ wT
(t) xj wixij
125 / 132
Images/cinvestav-
Data Set
Training set with classes ω1 = N(0, 1) and ω2 = N(0, σ2
) − N(0, 1)
126 / 132
Images/cinvestav-
Example
For m = 10
127 / 132
Images/cinvestav-
Example
For 40
128 / 132
Images/cinvestav-
At the end of the process
For m = 80
129 / 132
Images/cinvestav-
Final Confusion Matrix
When m = 80
C1 C2
C1 1.0 0.0
C2 0.0 1.0
130 / 132
Images/cinvestav-
However
There are other versions to the Cryptic Phrase
At “Boosting: Foundation and Algorithms” by Schaphire and Freund
“Train weak learner using distribution Dt”
We could re-sample using the distribution wt
Basically using sampling with substitution over the data set
{x1, x2, ..., xN }
131 / 132
Images/cinvestav-
However
There are other versions to the Cryptic Phrase
At “Boosting: Foundation and Algorithms” by Schaphire and Freund
“Train weak learner using distribution Dt”
We could re-sample using the distribution wt
Basically using sampling with substitution over the data set
{x1, x2, ..., xN }
131 / 132
Images/cinvestav-
Other Interpretations exist
But you can use a weighted version of the cost function
1
2 j
wj (t) (yj (t) − dj)2
For More, Take a look
“Boosting Neural Networks” by Holger Schwenk and Yoshua Bengio
132 / 132

18.1 combining models

  • 1.
    Introduction to MachineLearning Combining Models, Bayesian Average and Boosting Andres Mendez-Vazquez July 31, 2018 1 / 132
  • 2.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 2 / 132
  • 3.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 3 / 132
  • 4.
    Images/cinvestav- Introduction Observation It is oftenfound that improved performance can be obtained by combining multiple classifiers together in some way. Example, Committees We might train L different classifiers and then make predictions: by using the average of the predictions made by each classifier. Example, Boosting It involves training multiple models in sequence: A error function used to train a particular model depends on the performance of the previous models. 4 / 132
  • 5.
    Images/cinvestav- Introduction Observation It is oftenfound that improved performance can be obtained by combining multiple classifiers together in some way. Example, Committees We might train L different classifiers and then make predictions: by using the average of the predictions made by each classifier. Example, Boosting It involves training multiple models in sequence: A error function used to train a particular model depends on the performance of the previous models. 4 / 132
  • 6.
    Images/cinvestav- Introduction Observation It is oftenfound that improved performance can be obtained by combining multiple classifiers together in some way. Example, Committees We might train L different classifiers and then make predictions: by using the average of the predictions made by each classifier. Example, Boosting It involves training multiple models in sequence: A error function used to train a particular model depends on the performance of the previous models. 4 / 132
  • 7.
    Images/cinvestav- Introduction Observation It is oftenfound that improved performance can be obtained by combining multiple classifiers together in some way. Example, Committees We might train L different classifiers and then make predictions: by using the average of the predictions made by each classifier. Example, Boosting It involves training multiple models in sequence: A error function used to train a particular model depends on the performance of the previous models. 4 / 132
  • 8.
    Images/cinvestav- Introduction Observation It is oftenfound that improved performance can be obtained by combining multiple classifiers together in some way. Example, Committees We might train L different classifiers and then make predictions: by using the average of the predictions made by each classifier. Example, Boosting It involves training multiple models in sequence: A error function used to train a particular model depends on the performance of the previous models. 4 / 132
  • 9.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 5 / 132
  • 10.
    Images/cinvestav- We could usesimple averaging Given a series of observed samples {x1, x2, ..., xN } with noise ∼ N (0, 1) We could use our knowledge on the noise, for example additive: xi = xi + We can use our knowledge of probability to remove such noise E [xi] = E [xi + ] = E [xi] + E [ ] Then, because E [ ] = 0 E [xi] = E [xi] ≈ 1 N N i=1 xi 6 / 132
  • 11.
    Images/cinvestav- We could usesimple averaging Given a series of observed samples {x1, x2, ..., xN } with noise ∼ N (0, 1) We could use our knowledge on the noise, for example additive: xi = xi + We can use our knowledge of probability to remove such noise E [xi] = E [xi + ] = E [xi] + E [ ] Then, because E [ ] = 0 E [xi] = E [xi] ≈ 1 N N i=1 xi 6 / 132
  • 12.
    Images/cinvestav- We could usesimple averaging Given a series of observed samples {x1, x2, ..., xN } with noise ∼ N (0, 1) We could use our knowledge on the noise, for example additive: xi = xi + We can use our knowledge of probability to remove such noise E [xi] = E [xi + ] = E [xi] + E [ ] Then, because E [ ] = 0 E [xi] = E [xi] ≈ 1 N N i=1 xi 6 / 132
  • 13.
  • 14.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 8 / 132
  • 15.
    Images/cinvestav- Beyond Simple Averaging Insteadof averaging the predictions of a set of models You can use an alternative form of combination that selects one of the models to make the prediction. Where The choice of model is a function of the input variables. Thus Different Models become responsible for making decisions in different regions of the input space. 9 / 132
  • 16.
    Images/cinvestav- Beyond Simple Averaging Insteadof averaging the predictions of a set of models You can use an alternative form of combination that selects one of the models to make the prediction. Where The choice of model is a function of the input variables. Thus Different Models become responsible for making decisions in different regions of the input space. 9 / 132
  • 17.
    Images/cinvestav- Beyond Simple Averaging Insteadof averaging the predictions of a set of models You can use an alternative form of combination that selects one of the models to make the prediction. Where The choice of model is a function of the input variables. Thus Different Models become responsible for making decisions in different regions of the input space. 9 / 132
  • 18.
    Images/cinvestav- Something like this Modelsin charge of different set of inputs 10 / 132
  • 19.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 11 / 132
  • 20.
    Images/cinvestav- Example, Decision Trees Wecan have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 12 / 132
  • 21.
    Images/cinvestav- Example, Decision Trees Wecan have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 12 / 132
  • 22.
    Images/cinvestav- Example, Decision Trees Wecan have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 12 / 132
  • 23.
    Images/cinvestav- Example, Decision Trees Wecan have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 12 / 132
  • 24.
    Images/cinvestav- Example, Decision Trees Wecan have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 12 / 132
  • 25.
    Images/cinvestav- Example, Decision Trees Wecan have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 12 / 132
  • 26.
    Images/cinvestav- Example, Decision Trees Wecan have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 12 / 132
  • 27.
    Images/cinvestav- This is usedin the mixture of distributions Thus (Mixture of Experts) p (t|x) = M k=1 πk (x) p (t|x, k) (1) where πk (x) = p (k|x) represent the input-dependent mixing coefficients. This type of models They can be viewed as mixture distribution in which the component densities and the mixing coefficients are conditioned on the input variables and are known as mixture experts. 13 / 132
  • 28.
    Images/cinvestav- This is usedin the mixture of distributions Thus (Mixture of Experts) p (t|x) = M k=1 πk (x) p (t|x, k) (1) where πk (x) = p (k|x) represent the input-dependent mixing coefficients. This type of models They can be viewed as mixture distribution in which the component densities and the mixing coefficients are conditioned on the input variables and are known as mixture experts. 13 / 132
  • 29.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 14 / 132
  • 30.
    Images/cinvestav- It is importantto differentiate between them Although Model Combinations and Bayesian Model Averaging look similar. However, they are actually different For this We have the following example. 15 / 132
  • 31.
    Images/cinvestav- It is importantto differentiate between them Although Model Combinations and Bayesian Model Averaging look similar. However, they are actually different For this We have the following example. 15 / 132
  • 32.
    Images/cinvestav- Example of theDifferences For this consider the following Mixture of Gaussians with a binary latent variable z indicating to which component a point belongs to. Thus the model is specified in terms a joint distribution p (x, z) Corresponding density over the observed variable x using marginalization p (x) = z p (x, z) 16 / 132
  • 33.
    Images/cinvestav- Example of theDifferences For this consider the following Mixture of Gaussians with a binary latent variable z indicating to which component a point belongs to. Thus the model is specified in terms a joint distribution p (x, z) Corresponding density over the observed variable x using marginalization p (x) = z p (x, z) 16 / 132
  • 34.
    Images/cinvestav- Example of theDifferences For this consider the following Mixture of Gaussians with a binary latent variable z indicating to which component a point belongs to. Thus the model is specified in terms a joint distribution p (x, z) Corresponding density over the observed variable x using marginalization p (x) = z p (x, z) 16 / 132
  • 35.
    Images/cinvestav- Example In the caseof Mixture of Gaussian’s p (x) = K k=1 πkN (x|µk, Σk) This is an example of model combination. What about other Models 17 / 132
  • 36.
    Images/cinvestav- Example In the caseof Mixture of Gaussian’s p (x) = K k=1 πkN (x|µk, Σk) This is an example of model combination. What about other Models 17 / 132
  • 37.
    Images/cinvestav- More Models Now, forindependent, identically distributed data X = {x1, x2, ..., xN } p (X) = N n=1 p (xn) = N n=1 zn p (xn, zn) 18 / 132
  • 38.
    Images/cinvestav- Therefore Something Notable Each observeddata point xn has a corresponding latent variable zn. Here, we are doing a Combination of Models Each Gaussian indexed by zn is in charge of generating one section of the sample space 19 / 132
  • 39.
    Images/cinvestav- Therefore Something Notable Each observeddata point xn has a corresponding latent variable zn. Here, we are doing a Combination of Models Each Gaussian indexed by zn is in charge of generating one section of the sample space 19 / 132
  • 40.
  • 41.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 21 / 132
  • 42.
    Images/cinvestav- Now, suppose We haveseveral different models indexed by h = 1, ..., H with prior probabilities One model might be a mixture of Gaussians and another model might be a mixture of Cauchy distributions The Marginal Distribution is p (X) = H h=1 p (X, h) = H h=1 p (X|h) p (h) ≈p(h|X) This is an example of Bayesian model averaging 22 / 132
  • 43.
    Images/cinvestav- Now, suppose We haveseveral different models indexed by h = 1, ..., H with prior probabilities One model might be a mixture of Gaussians and another model might be a mixture of Cauchy distributions The Marginal Distribution is p (X) = H h=1 p (X, h) = H h=1 p (X|h) p (h) ≈p(h|X) This is an example of Bayesian model averaging 22 / 132
  • 44.
    Images/cinvestav- Bayesian Model Averaging Remark Thesummation over h means that just one model is responsible for generating the whole data set. Observation The probability over h simply reflects our uncertainty of which is the correct model to use. Thus, as the size of the data set increases This uncertainty reduces Posterior probabilities p(h|X) become increasingly focused on just one of the models. 23 / 132
  • 45.
    Images/cinvestav- Bayesian Model Averaging Remark Thesummation over h means that just one model is responsible for generating the whole data set. Observation The probability over h simply reflects our uncertainty of which is the correct model to use. Thus, as the size of the data set increases This uncertainty reduces Posterior probabilities p(h|X) become increasingly focused on just one of the models. 23 / 132
  • 46.
    Images/cinvestav- Bayesian Model Averaging Remark Thesummation over h means that just one model is responsible for generating the whole data set. Observation The probability over h simply reflects our uncertainty of which is the correct model to use. Thus, as the size of the data set increases This uncertainty reduces Posterior probabilities p(h|X) become increasingly focused on just one of the models. 23 / 132
  • 47.
  • 48.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 25 / 132
  • 49.
    Images/cinvestav- The Differences Bayesian modelaveraging The whole data set is generated by a single model h. Model combination Different data points within the data set can potentially be generated from different by different components. 26 / 132
  • 50.
    Images/cinvestav- The Differences Bayesian modelaveraging The whole data set is generated by a single model h. Model combination Different data points within the data set can potentially be generated from different by different components. 26 / 132
  • 51.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 27 / 132
  • 52.
    Images/cinvestav- Committees Idea, the simplestway to construct a committee It is to average the predictions of a set of individual models. Thinking as a frequentist This is coming from taking in consideration the trade-off between bias and variance. Where the error in the model into The bias component that arises from differences between the model and the true function to be predicted. The variance component that represents the sensitivity of the model to the individual data points. 28 / 132
  • 53.
    Images/cinvestav- Committees Idea, the simplestway to construct a committee It is to average the predictions of a set of individual models. Thinking as a frequentist This is coming from taking in consideration the trade-off between bias and variance. Where the error in the model into The bias component that arises from differences between the model and the true function to be predicted. The variance component that represents the sensitivity of the model to the individual data points. 28 / 132
  • 54.
    Images/cinvestav- Committees Idea, the simplestway to construct a committee It is to average the predictions of a set of individual models. Thinking as a frequentist This is coming from taking in consideration the trade-off between bias and variance. Where the error in the model into The bias component that arises from differences between the model and the true function to be predicted. The variance component that represents the sensitivity of the model to the individual data points. 28 / 132
  • 55.
    Images/cinvestav- Committees Idea, the simplestway to construct a committee It is to average the predictions of a set of individual models. Thinking as a frequentist This is coming from taking in consideration the trade-off between bias and variance. Where the error in the model into The bias component that arises from differences between the model and the true function to be predicted. The variance component that represents the sensitivity of the model to the individual data points. 28 / 132
  • 56.
    Images/cinvestav- For example When weaveraged a set of low-bias models We obtained accurate predictions of the underlying sinusoidal function from which the data were generated. 29 / 132
  • 57.
    Images/cinvestav- However Big Problem We havenormally a single data set Thus We need to introduce certain variability between the different committee members. One approach You can use bootstrap data sets. 30 / 132
  • 58.
    Images/cinvestav- However Big Problem We havenormally a single data set Thus We need to introduce certain variability between the different committee members. One approach You can use bootstrap data sets. 30 / 132
  • 59.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 31 / 132
  • 60.
    Images/cinvestav- The Idea ofBootstrap We denote the training set by Z = {z1, z2, ..., zN } Where zi = (xi, yi) The basic idea is to randomly draw datasets with replacement from the training data Each sample the same size as the original training set. This is done B times Producing B bootstrap datasets. 32 / 132
  • 61.
    Images/cinvestav- The Idea ofBootstrap We denote the training set by Z = {z1, z2, ..., zN } Where zi = (xi, yi) The basic idea is to randomly draw datasets with replacement from the training data Each sample the same size as the original training set. This is done B times Producing B bootstrap datasets. 32 / 132
  • 62.
    Images/cinvestav- The Idea ofBootstrap We denote the training set by Z = {z1, z2, ..., zN } Where zi = (xi, yi) The basic idea is to randomly draw datasets with replacement from the training data Each sample the same size as the original training set. This is done B times Producing B bootstrap datasets. 32 / 132
  • 63.
    Images/cinvestav- Then Then a quantityis computed S (Z) is any quantity computed from the data Z From the bootstrap sampling We can estimate any aspect of the distribution of S (Z). 33 / 132
  • 64.
    Images/cinvestav- Then Then a quantityis computed S (Z) is any quantity computed from the data Z From the bootstrap sampling We can estimate any aspect of the distribution of S (Z). 33 / 132
  • 65.
    Images/cinvestav- Then Then a quantityis computed S (Z) is any quantity computed from the data Z From the bootstrap sampling We can estimate any aspect of the distribution of S (Z). 33 / 132
  • 66.
    Images/cinvestav- Then we refit themodel to each of the bootstrap datasets You generate S Z∗b to refit the model to this dataset. Then You examine the behavior of the fits over the B replications. 34 / 132
  • 67.
    Images/cinvestav- Then we refit themodel to each of the bootstrap datasets You generate S Z∗b to refit the model to this dataset. Then You examine the behavior of the fits over the B replications. 34 / 132
  • 68.
    Images/cinvestav- For Example Its variance Var [S (Z)] = 1 B − 1 B b=1 S Z∗b − S ∗ 2 Where S ∗ = 1 B B b=1 S Z∗b 35 / 132
  • 69.
    Images/cinvestav- For Example Its variance Var [S (Z)] = 1 B − 1 B b=1 S Z∗b − S ∗ 2 Where S ∗ = 1 B B b=1 S Z∗b 35 / 132
  • 70.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 36 / 132
  • 71.
    Images/cinvestav- Relation with Monte-CarloEstimation Note that V ar [S (Z)] It can be thought of as a Monte-Carlo estimate of the variance of S (Z) under sampling. This is coming From the empirical distribution function F for the data Z = {z1, z2, ..., zN } 37 / 132
  • 72.
    Images/cinvestav- Relation with Monte-CarloEstimation Note that V ar [S (Z)] It can be thought of as a Monte-Carlo estimate of the variance of S (Z) under sampling. This is coming From the empirical distribution function F for the data Z = {z1, z2, ..., zN } 37 / 132
  • 73.
    Images/cinvestav- For Example Schematic ofthe bootstrap process Bootstrap Samples Bootstrap Replications Training Samples 38 / 132
  • 74.
    Images/cinvestav- Thus Use each ofthem to train a copy yb (x) of a predictive regression model to predict a single continuous variable Then, ycom (x) = 1 B B b=1 yb (x) (2) This is also known as Bootstrap Aggregation or Bagging. 39 / 132
  • 75.
    Images/cinvestav- What do wewith this samples? Now, assume a true regression function h (x) and a estimation yb (x) yb (x) = h (x) + b (x) (3) The average sum-of-squares error over the data takes the form Ex (yb (x) − h (x))2 = Ex 2 b (x) (4) What is Ex? It denotes a frequentest expectation with respect to the distribution of the input vector x. 40 / 132
  • 76.
    Images/cinvestav- What do wewith this samples? Now, assume a true regression function h (x) and a estimation yb (x) yb (x) = h (x) + b (x) (3) The average sum-of-squares error over the data takes the form Ex (yb (x) − h (x))2 = Ex 2 b (x) (4) What is Ex? It denotes a frequentest expectation with respect to the distribution of the input vector x. 40 / 132
  • 77.
    Images/cinvestav- What do wewith this samples? Now, assume a true regression function h (x) and a estimation yb (x) yb (x) = h (x) + b (x) (3) The average sum-of-squares error over the data takes the form Ex (yb (x) − h (x))2 = Ex 2 b (x) (4) What is Ex? It denotes a frequentest expectation with respect to the distribution of the input vector x. 40 / 132
  • 78.
    Images/cinvestav- Meaning Thus, the averageerror is EAV = 1 B b b=1 Ex { b (x)}2 (5) Similarly the Expected error over the committee ECOM = Ex   1 B B b=1 (ym (x) − h (x)) 2   = Ex   1 B B b=1 b (x) 2   (6) 41 / 132
  • 79.
    Images/cinvestav- Meaning Thus, the averageerror is EAV = 1 B b b=1 Ex { b (x)}2 (5) Similarly the Expected error over the committee ECOM = Ex   1 B B b=1 (ym (x) − h (x)) 2   = Ex   1 B B b=1 b (x) 2   (6) 41 / 132
  • 80.
    Images/cinvestav- Assume that theerrors have zero mean and are uncorrelated Assume that the errors have zero mean and are uncorrelated Something Reasonable to assume given the way we produce the Bootstrap Samples Ex [ b (x)] =0 Ex [ b (x) l (x)] = 0, for b = l 42 / 132
  • 81.
    Images/cinvestav- Then We have that ECOM= 1 b2 Ex B b=1 ( b (x)) 2 = 1 B2 Ex    B b=1 2 b (x) + B h=1 B k=1 h=k h (x) k (x)    = 1 B2    Ex B b=1 2 b (x) + Ex    B h=1 B k=1 h=k h (x) k (x)       = 1 B2    Ex B b=1 2 b (x) + M h=1 M k=1 h=k Ex ( h (x) k (x))    = 1 B2 Ex B b=1 2 b (x) = 1 B 1 B Ex B b=1 2 b (x) 43 / 132
  • 82.
    Images/cinvestav- Then We have that ECOM= 1 b2 Ex B b=1 ( b (x)) 2 = 1 B2 Ex    B b=1 2 b (x) + B h=1 B k=1 h=k h (x) k (x)    = 1 B2    Ex B b=1 2 b (x) + Ex    B h=1 B k=1 h=k h (x) k (x)       = 1 B2    Ex B b=1 2 b (x) + M h=1 M k=1 h=k Ex ( h (x) k (x))    = 1 B2 Ex B b=1 2 b (x) = 1 B 1 B Ex B b=1 2 b (x) 43 / 132
  • 83.
    Images/cinvestav- Then We have that ECOM= 1 b2 Ex B b=1 ( b (x)) 2 = 1 B2 Ex    B b=1 2 b (x) + B h=1 B k=1 h=k h (x) k (x)    = 1 B2    Ex B b=1 2 b (x) + Ex    B h=1 B k=1 h=k h (x) k (x)       = 1 B2    Ex B b=1 2 b (x) + M h=1 M k=1 h=k Ex ( h (x) k (x))    = 1 B2 Ex B b=1 2 b (x) = 1 B 1 B Ex B b=1 2 b (x) 43 / 132
  • 84.
    Images/cinvestav- Then We have that ECOM= 1 b2 Ex B b=1 ( b (x)) 2 = 1 B2 Ex    B b=1 2 b (x) + B h=1 B k=1 h=k h (x) k (x)    = 1 B2    Ex B b=1 2 b (x) + Ex    B h=1 B k=1 h=k h (x) k (x)       = 1 B2    Ex B b=1 2 b (x) + M h=1 M k=1 h=k Ex ( h (x) k (x))    = 1 B2 Ex B b=1 2 b (x) = 1 B 1 B Ex B b=1 2 b (x) 43 / 132
  • 85.
    Images/cinvestav- Then We have that ECOM= 1 b2 Ex B b=1 ( b (x)) 2 = 1 B2 Ex    B b=1 2 b (x) + B h=1 B k=1 h=k h (x) k (x)    = 1 B2    Ex B b=1 2 b (x) + Ex    B h=1 B k=1 h=k h (x) k (x)       = 1 B2    Ex B b=1 2 b (x) + M h=1 M k=1 h=k Ex ( h (x) k (x))    = 1 B2 Ex B b=1 2 b (x) = 1 B 1 B Ex B b=1 2 b (x) 43 / 132
  • 86.
    Images/cinvestav- We finally obtain Weobtain ECOM = 1 B EAV (7) Looks great BUT!!! Unfortunately, it depends on the key assumption that the errors at the individual Bootstrap Models are uncorrelated. 44 / 132
  • 87.
    Images/cinvestav- We finally obtain Weobtain ECOM = 1 B EAV (7) Looks great BUT!!! Unfortunately, it depends on the key assumption that the errors at the individual Bootstrap Models are uncorrelated. 44 / 132
  • 88.
    Images/cinvestav- Thus The Reality!!! The errorsare typically highly correlated, and the reduction in overall error is generally small. Something Notable However, It can be shown that the expected committee error will not exceed the expected error of the constituent models, so ECOM ≤ EAV (8) However, we need something better A more sophisticated technique known as boosting. 45 / 132
  • 89.
    Images/cinvestav- Thus The Reality!!! The errorsare typically highly correlated, and the reduction in overall error is generally small. Something Notable However, It can be shown that the expected committee error will not exceed the expected error of the constituent models, so ECOM ≤ EAV (8) However, we need something better A more sophisticated technique known as boosting. 45 / 132
  • 90.
    Images/cinvestav- Thus The Reality!!! The errorsare typically highly correlated, and the reduction in overall error is generally small. Something Notable However, It can be shown that the expected committee error will not exceed the expected error of the constituent models, so ECOM ≤ EAV (8) However, we need something better A more sophisticated technique known as boosting. 45 / 132
  • 91.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 46 / 132
  • 92.
    Images/cinvestav- Boosting What Boosting does? Itcombines several classifiers to produce a form of a committee. We will describe AdaBoost “Adaptive Boosting” developed by Freund and Schapire (1995). 47 / 132
  • 93.
    Images/cinvestav- Boosting What Boosting does? Itcombines several classifiers to produce a form of a committee. We will describe AdaBoost “Adaptive Boosting” developed by Freund and Schapire (1995). 47 / 132
  • 94.
    Images/cinvestav- Sequential Training Main differencebetween boosting and committee methods The base classifiers are trained in sequence. Explanation Consider a two-class classification problem: 1 Samples x1, x2,..., xN 2 Binary labels (-1,1) t1, t2, ..., tN 48 / 132
  • 95.
    Images/cinvestav- Sequential Training Main differencebetween boosting and committee methods The base classifiers are trained in sequence. Explanation Consider a two-class classification problem: 1 Samples x1, x2,..., xN 2 Binary labels (-1,1) t1, t2, ..., tN 48 / 132
  • 96.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 49 / 132
  • 97.
    Images/cinvestav- Cost Function Now You wantto put together a set of M experts able to recognize the most difficult inputs in an accurate way!!! Thus For each pattern xi each expert classifier outputs a classification yj (xi) ∈ {−1, 1} The final decision of the committee of M experts is sign (C (xi)) C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (9) 50 / 132
  • 98.
    Images/cinvestav- Cost Function Now You wantto put together a set of M experts able to recognize the most difficult inputs in an accurate way!!! Thus For each pattern xi each expert classifier outputs a classification yj (xi) ∈ {−1, 1} The final decision of the committee of M experts is sign (C (xi)) C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (9) 50 / 132
  • 99.
    Images/cinvestav- Cost Function Now You wantto put together a set of M experts able to recognize the most difficult inputs in an accurate way!!! Thus For each pattern xi each expert classifier outputs a classification yj (xi) ∈ {−1, 1} The final decision of the committee of M experts is sign (C (xi)) C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (9) 50 / 132
  • 100.
    Images/cinvestav- Now Adaptive Boosting It workseven with a continuum of classifiers. However For the sake of simplicity, we will assume that the set of expert is finite. 51 / 132
  • 101.
    Images/cinvestav- Now Adaptive Boosting It workseven with a continuum of classifiers. However For the sake of simplicity, we will assume that the set of expert is finite. 51 / 132
  • 102.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 52 / 132
  • 103.
    Images/cinvestav- Getting the correctclassifiers We want the following We want to review possible element members. Select them, if they have certain properties. Assigning a weight to their contribution to the set of experts. 53 / 132
  • 104.
    Images/cinvestav- Getting the correctclassifiers We want the following We want to review possible element members. Select them, if they have certain properties. Assigning a weight to their contribution to the set of experts. 53 / 132
  • 105.
    Images/cinvestav- Getting the correctclassifiers We want the following We want to review possible element members. Select them, if they have certain properties. Assigning a weight to their contribution to the set of experts. 53 / 132
  • 106.
    Images/cinvestav- Now Selection is donethe following way Testing the classifiers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classifiers in the expert pool by Charging a cost exp {β} any time a classifier fails (a miss). Charging a cost exp {−β} any time a classifier provides the right label (a hit). 54 / 132
  • 107.
    Images/cinvestav- Now Selection is donethe following way Testing the classifiers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classifiers in the expert pool by Charging a cost exp {β} any time a classifier fails (a miss). Charging a cost exp {−β} any time a classifier provides the right label (a hit). 54 / 132
  • 108.
    Images/cinvestav- Now Selection is donethe following way Testing the classifiers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classifiers in the expert pool by Charging a cost exp {β} any time a classifier fails (a miss). Charging a cost exp {−β} any time a classifier provides the right label (a hit). 54 / 132
  • 109.
    Images/cinvestav- Now Selection is donethe following way Testing the classifiers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classifiers in the expert pool by Charging a cost exp {β} any time a classifier fails (a miss). Charging a cost exp {−β} any time a classifier provides the right label (a hit). 54 / 132
  • 110.
    Images/cinvestav- Now Selection is donethe following way Testing the classifiers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classifiers in the expert pool by Charging a cost exp {β} any time a classifier fails (a miss). Charging a cost exp {−β} any time a classifier provides the right label (a hit). 54 / 132
  • 111.
    Images/cinvestav- Remarks about β Werequire β > 0 Thus misses are penalized more heavily penalized than hits Although It looks strange to penalize hits, However, as long that the penalty of a success is smaller than the penalty for a miss: exp {−β} < exp {β} Why? if we assign cost a to misses and cost b to hits, where a > b > 0. We can rewrite such costs as a = cd and b = c−d for constants c and d. It does not compromise generality. 55 / 132
  • 112.
    Images/cinvestav- Remarks about β Werequire β > 0 Thus misses are penalized more heavily penalized than hits Although It looks strange to penalize hits, However, as long that the penalty of a success is smaller than the penalty for a miss: exp {−β} < exp {β} Why? if we assign cost a to misses and cost b to hits, where a > b > 0. We can rewrite such costs as a = cd and b = c−d for constants c and d. It does not compromise generality. 55 / 132
  • 113.
    Images/cinvestav- Remarks about β Werequire β > 0 Thus misses are penalized more heavily penalized than hits Although It looks strange to penalize hits, However, as long that the penalty of a success is smaller than the penalty for a miss: exp {−β} < exp {β} Why? if we assign cost a to misses and cost b to hits, where a > b > 0. We can rewrite such costs as a = cd and b = c−d for constants c and d. It does not compromise generality. 55 / 132
  • 114.
    Images/cinvestav- Remarks about β Werequire β > 0 Thus misses are penalized more heavily penalized than hits Although It looks strange to penalize hits, However, as long that the penalty of a success is smaller than the penalty for a miss: exp {−β} < exp {β} Why? if we assign cost a to misses and cost b to hits, where a > b > 0. We can rewrite such costs as a = cd and b = c−d for constants c and d. It does not compromise generality. 55 / 132
  • 115.
    Images/cinvestav- Remarks about β Werequire β > 0 Thus misses are penalized more heavily penalized than hits Although It looks strange to penalize hits, However, as long that the penalty of a success is smaller than the penalty for a miss: exp {−β} < exp {β} Why? if we assign cost a to misses and cost b to hits, where a > b > 0. We can rewrite such costs as a = cd and b = c−d for constants c and d. It does not compromise generality. 55 / 132
  • 116.
    Images/cinvestav- Remarks about β Werequire β > 0 Thus misses are penalized more heavily penalized than hits Although It looks strange to penalize hits, However, as long that the penalty of a success is smaller than the penalty for a miss: exp {−β} < exp {β} Why? if we assign cost a to misses and cost b to hits, where a > b > 0. We can rewrite such costs as a = cd and b = c−d for constants c and d. It does not compromise generality. 55 / 132
  • 117.
    Images/cinvestav- Exponential Loss Function Thiskind of error function is different from Squared Euclidean distance The classification target is called an exponential loss function. AdaBoost uses exponential error loss as error criterion. +0.1 +0.2 +0.3 +0.4 +0.5 +0.6 +0.7 +0.8 +0.9 +1.0−0.1−0.2 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 +0.1 +0.2 +0.3 +0.4 +0.5 +0.6 +0.7 +0.8 +0.9 +1.0 +1.1 +1.2 +1.3 +1.4 +1.5 +1.6 +1.7 +1.8 +1.9 +2.0 +2.1 +2.2 +2.3 56 / 132
  • 118.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 57 / 132
  • 119.
    Images/cinvestav- Selection of theClassifier We need to have a way to select the best Classifier in the Pool When we test the M classifiers in the pool, we build a matrix S Then We record the misses (with a ONE) and hits (with a ZERO) of each classifiers. 58 / 132
  • 120.
    Images/cinvestav- Selection of theClassifier We need to have a way to select the best Classifier in the Pool When we test the M classifiers in the pool, we build a matrix S Then We record the misses (with a ONE) and hits (with a ZERO) of each classifiers. 58 / 132
  • 121.
    Images/cinvestav- The Matrix S Rowi in the matrix is reserved for the data point xi Column m is reserved for the mth classifier in the pool. Classifiers 1 2 · · · M x1 0 1 · · · 1 x2 0 0 · · · 1 x3 1 1 · · · 0 ... ... ... ... xN 0 0 · · · 0 59 / 132
  • 122.
    Images/cinvestav- Something interesting aboutthe S The sum along the rows is the sum at the empirical risk ER (yj) = 1 N N i=1 Sij with j = 1, ..., M Therefore, the candidate to be used at certain iteration It is the classifier yj with the smallest empirical risk!!! 60 / 132
  • 123.
    Images/cinvestav- Something interesting aboutthe S The sum along the rows is the sum at the empirical risk ER (yj) = 1 N N i=1 Sij with j = 1, ..., M Therefore, the candidate to be used at certain iteration It is the classifier yj with the smallest empirical risk!!! 60 / 132
  • 124.
    Images/cinvestav- Main Idea What doesAdaBoost want to do? The main idea of AdaBoost is to proceed systematically by extracting one classifier from the pool in each of M iterations. Thus The elements in the data set are weighted according to their current relevance (or urgency) at each iteration. Thus at the beginning of the iterations All data samples are assigned the same weight: Just 1, or 1 N , if we want to have a total sum of 1 for all weights. 61 / 132
  • 125.
    Images/cinvestav- Main Idea What doesAdaBoost want to do? The main idea of AdaBoost is to proceed systematically by extracting one classifier from the pool in each of M iterations. Thus The elements in the data set are weighted according to their current relevance (or urgency) at each iteration. Thus at the beginning of the iterations All data samples are assigned the same weight: Just 1, or 1 N , if we want to have a total sum of 1 for all weights. 61 / 132
  • 126.
    Images/cinvestav- Main Idea What doesAdaBoost want to do? The main idea of AdaBoost is to proceed systematically by extracting one classifier from the pool in each of M iterations. Thus The elements in the data set are weighted according to their current relevance (or urgency) at each iteration. Thus at the beginning of the iterations All data samples are assigned the same weight: Just 1, or 1 N , if we want to have a total sum of 1 for all weights. 61 / 132
  • 127.
    Images/cinvestav- Main Idea What doesAdaBoost want to do? The main idea of AdaBoost is to proceed systematically by extracting one classifier from the pool in each of M iterations. Thus The elements in the data set are weighted according to their current relevance (or urgency) at each iteration. Thus at the beginning of the iterations All data samples are assigned the same weight: Just 1, or 1 N , if we want to have a total sum of 1 for all weights. 61 / 132
  • 128.
    Images/cinvestav- The process ofthe weights As the selection progresses The more difficult samples, those where the committee still performs badly, are assigned larger and larger weights. The selection process concentrates in selecting new classifiers For the committee by focusing on those which can help with the still misclassified examples. Then The best classifiers are those which can provide new insights to the committee. Classifiers being selected should complement each other in an optimal way. 62 / 132
  • 129.
    Images/cinvestav- The process ofthe weights As the selection progresses The more difficult samples, those where the committee still performs badly, are assigned larger and larger weights. The selection process concentrates in selecting new classifiers For the committee by focusing on those which can help with the still misclassified examples. Then The best classifiers are those which can provide new insights to the committee. Classifiers being selected should complement each other in an optimal way. 62 / 132
  • 130.
    Images/cinvestav- The process ofthe weights As the selection progresses The more difficult samples, those where the committee still performs badly, are assigned larger and larger weights. The selection process concentrates in selecting new classifiers For the committee by focusing on those which can help with the still misclassified examples. Then The best classifiers are those which can provide new insights to the committee. Classifiers being selected should complement each other in an optimal way. 62 / 132
  • 131.
    Images/cinvestav- The process ofthe weights As the selection progresses The more difficult samples, those where the committee still performs badly, are assigned larger and larger weights. The selection process concentrates in selecting new classifiers For the committee by focusing on those which can help with the still misclassified examples. Then The best classifiers are those which can provide new insights to the committee. Classifiers being selected should complement each other in an optimal way. 62 / 132
  • 132.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 63 / 132
  • 133.
    Images/cinvestav- Selecting New Classifiers Whatwe want In each iteration, we rank all classifiers, so that we can select the current best out of the pool. At mth iteration We have already included m − 1 classifiers in the committee and we want to select the next one. Thus, we have the following cost function which is actually the output of the committee C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (10) 64 / 132
  • 134.
    Images/cinvestav- Selecting New Classifiers Whatwe want In each iteration, we rank all classifiers, so that we can select the current best out of the pool. At mth iteration We have already included m − 1 classifiers in the committee and we want to select the next one. Thus, we have the following cost function which is actually the output of the committee C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (10) 64 / 132
  • 135.
    Images/cinvestav- Selecting New Classifiers Whatwe want In each iteration, we rank all classifiers, so that we can select the current best out of the pool. At mth iteration We have already included m − 1 classifiers in the committee and we want to select the next one. Thus, we have the following cost function which is actually the output of the committee C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (10) 64 / 132
  • 136.
    Images/cinvestav- Thus, we havethat Extending the cost function by the new regression ym C(m) (xi) = C(m−1) (xi) + αmym (xi) (11) At the first iteration m = 1 C(0) is the zero function. Thus, the total cost or total error is defined as the exponential error E = N i=1 exp −ti C(m−1) (xi) + αmym (xi) (12) 65 / 132
  • 137.
    Images/cinvestav- Thus, we havethat Extending the cost function by the new regression ym C(m) (xi) = C(m−1) (xi) + αmym (xi) (11) At the first iteration m = 1 C(0) is the zero function. Thus, the total cost or total error is defined as the exponential error E = N i=1 exp −ti C(m−1) (xi) + αmym (xi) (12) 65 / 132
  • 138.
    Images/cinvestav- Thus, we havethat Extending the cost function by the new regression ym C(m) (xi) = C(m−1) (xi) + αmym (xi) (11) At the first iteration m = 1 C(0) is the zero function. Thus, the total cost or total error is defined as the exponential error E = N i=1 exp −ti C(m−1) (xi) + αmym (xi) (12) 65 / 132
  • 139.
    Images/cinvestav- Thus We want todetermine αm and ym in optimal way Thus, rewriting E = N i=1 w (m) i exp {−tiαmym (xi)} (13) Where, for i = 1, 2, ..., N w (m) i = exp −tiC(m−1) (xi) (14) 66 / 132
  • 140.
    Images/cinvestav- Thus We want todetermine αm and ym in optimal way Thus, rewriting E = N i=1 w (m) i exp {−tiαmym (xi)} (13) Where, for i = 1, 2, ..., N w (m) i = exp −tiC(m−1) (xi) (14) 66 / 132
  • 141.
    Images/cinvestav- Thus We want todetermine αm and ym in optimal way Thus, rewriting E = N i=1 w (m) i exp {−tiαmym (xi)} (13) Where, for i = 1, 2, ..., N w (m) i = exp −tiC(m−1) (xi) (14) 66 / 132
  • 142.
    Images/cinvestav- Remark We have thatthe weight w (m) i = exp −tiC(m−1) (xi) Needs to be used in someway for the training of the new classifier This is of the out most importance!!! 67 / 132
  • 143.
    Images/cinvestav- Remark We have thatthe weight w (m) i = exp −tiC(m−1) (xi) Needs to be used in someway for the training of the new classifier This is of the out most importance!!! 67 / 132
  • 144.
    Images/cinvestav- Therefore You could usesuch weight As a output in the estimator function when applied to the loss function N i=1 yi − w (m) i f (xi) 2 You could use such weight You could sub-sample with substitution by using the distribution Dm w (m) i of xi The train using that sub-sample You could apply the weight function to the loss function itself used for training N i=1 w (m) i (yi − wif (xi))2 68 / 132
  • 145.
    Images/cinvestav- Therefore You could usesuch weight As a output in the estimator function when applied to the loss function N i=1 yi − w (m) i f (xi) 2 You could use such weight You could sub-sample with substitution by using the distribution Dm w (m) i of xi The train using that sub-sample You could apply the weight function to the loss function itself used for training N i=1 w (m) i (yi − wif (xi))2 68 / 132
  • 146.
    Images/cinvestav- Therefore You could usesuch weight As a output in the estimator function when applied to the loss function N i=1 yi − w (m) i f (xi) 2 You could use such weight You could sub-sample with substitution by using the distribution Dm w (m) i of xi The train using that sub-sample You could apply the weight function to the loss function itself used for training N i=1 w (m) i (yi − wif (xi))2 68 / 132
  • 147.
    Images/cinvestav- Therefore You could usesuch weight As a output in the estimator function when applied to the loss function N i=1 yi − w (m) i f (xi) 2 You could use such weight You could sub-sample with substitution by using the distribution Dm w (m) i of xi The train using that sub-sample You could apply the weight function to the loss function itself used for training N i=1 w (m) i (yi − wif (xi))2 68 / 132
  • 148.
    Images/cinvestav- Thus In the firstiteration w (1) i = 1 for i = 1, ..., N Meaning all the points have the same importance. During later iterations, the vector w(m) It represents the weight assigned to each data point in the training set at iteration m. 69 / 132
  • 149.
    Images/cinvestav- Thus In the firstiteration w (1) i = 1 for i = 1, ..., N Meaning all the points have the same importance. During later iterations, the vector w(m) It represents the weight assigned to each data point in the training set at iteration m. 69 / 132
  • 150.
    Images/cinvestav- Rewriting the CostEquation We can split (Eq. 13) E = ti=ym(xi) w (m) i exp {−αm} + ti=ym(xi) w (m) i exp {αm} (15) Meaning The total cost is the weighted cost of all hits plus the weighted cost of all misses. 70 / 132
  • 151.
    Images/cinvestav- Rewriting the CostEquation We can split (Eq. 13) E = ti=ym(xi) w (m) i exp {−αm} + ti=ym(xi) w (m) i exp {αm} (15) Meaning The total cost is the weighted cost of all hits plus the weighted cost of all misses. 70 / 132
  • 152.
    Images/cinvestav- Therefore Writing the firstsummand as Wc exp {−αm} and the second as We exp {αm} E = Wc exp {−αm} + We exp {αm} (16) 71 / 132
  • 153.
    Images/cinvestav- Empty Now, for theselection of ym The exact value of αm > 0 is irrelevant Since a fixed αm minimizing E It is equivalent to minimizing exp {αm} E Or in other words exp {αm} E = Wc + We exp {2αm} (17) 72 / 132
  • 154.
    Images/cinvestav- Empty Now, for theselection of ym The exact value of αm > 0 is irrelevant Since a fixed αm minimizing E It is equivalent to minimizing exp {αm} E Or in other words exp {αm} E = Wc + We exp {2αm} (17) 72 / 132
  • 155.
    Images/cinvestav- Empty Now, for theselection of ym The exact value of αm > 0 is irrelevant Since a fixed αm minimizing E It is equivalent to minimizing exp {αm} E Or in other words exp {αm} E = Wc + We exp {2αm} (17) 72 / 132
  • 156.
    Images/cinvestav- Now, we have Giventhat αm > 0 2αm > 0 We have exp {2αm} > exp {0} = 1 73 / 132
  • 157.
    Images/cinvestav- Now, we have Giventhat αm > 0 2αm > 0 We have exp {2αm} > exp {0} = 1 73 / 132
  • 158.
    Images/cinvestav- Then We can rewrite(Eq. 17) exp {αm} E = Wc + We − We + We exp {2αm} (18) Thus exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (19) Now, Wc + We is the total sum W of the weights Of all data points which is constant in the current iteration. 74 / 132
  • 159.
    Images/cinvestav- Then We can rewrite(Eq. 17) exp {αm} E = Wc + We − We + We exp {2αm} (18) Thus exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (19) Now, Wc + We is the total sum W of the weights Of all data points which is constant in the current iteration. 74 / 132
  • 160.
    Images/cinvestav- Then We can rewrite(Eq. 17) exp {αm} E = Wc + We − We + We exp {2αm} (18) Thus exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (19) Now, Wc + We is the total sum W of the weights Of all data points which is constant in the current iteration. 74 / 132
  • 161.
    Images/cinvestav- Thus The right handside of the equation is minimized When at the m-th iteration, we pick the classifier with the lowest total cost We That is the lowest rate of weighted error. Intuitively The next selected ym should be the one with the lowest penalty given the current set of weights. 75 / 132
  • 162.
    Images/cinvestav- Thus The right handside of the equation is minimized When at the m-th iteration, we pick the classifier with the lowest total cost We That is the lowest rate of weighted error. Intuitively The next selected ym should be the one with the lowest penalty given the current set of weights. 75 / 132
  • 163.
    Images/cinvestav- Do you remember? TheMatrix S We pick the classifier with the lowest total cost We Now, we need to do some updates Specifically the value αm . 76 / 132
  • 164.
    Images/cinvestav- Do you remember? TheMatrix S We pick the classifier with the lowest total cost We Now, we need to do some updates Specifically the value αm . 76 / 132
  • 165.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 77 / 132
  • 166.
    Images/cinvestav- Deriving against theweight αm Going back to the original E, we can use the derivative trick ∂E ∂αm = −Wc exp {−αm} + We exp {αm} (20) Making the equation equal to zero and multiplying by exp {αm} −Wc + We exp {2αm} = 0 (21) The optimal value is thus αm = 1 2 ln Wc We (22) 78 / 132
  • 167.
    Images/cinvestav- Deriving against theweight αm Going back to the original E, we can use the derivative trick ∂E ∂αm = −Wc exp {−αm} + We exp {αm} (20) Making the equation equal to zero and multiplying by exp {αm} −Wc + We exp {2αm} = 0 (21) The optimal value is thus αm = 1 2 ln Wc We (22) 78 / 132
  • 168.
    Images/cinvestav- Deriving against theweight αm Going back to the original E, we can use the derivative trick ∂E ∂αm = −Wc exp {−αm} + We exp {αm} (20) Making the equation equal to zero and multiplying by exp {αm} −Wc + We exp {2αm} = 0 (21) The optimal value is thus αm = 1 2 ln Wc We (22) 78 / 132
  • 169.
    Images/cinvestav- Now Making the totalsum of all weights W = Wc + We (23) We can rewrite the previous equation as αm = 1 2 ln W − We We = 1 2 ln 1 − em em (24) With the percentage rate of error given the weights of the data points em = We W (25) 79 / 132
  • 170.
    Images/cinvestav- Now Making the totalsum of all weights W = Wc + We (23) We can rewrite the previous equation as αm = 1 2 ln W − We We = 1 2 ln 1 − em em (24) With the percentage rate of error given the weights of the data points em = We W (25) 79 / 132
  • 171.
    Images/cinvestav- Now Making the totalsum of all weights W = Wc + We (23) We can rewrite the previous equation as αm = 1 2 ln W − We We = 1 2 ln 1 − em em (24) With the percentage rate of error given the weights of the data points em = We W (25) 79 / 132
  • 172.
    Images/cinvestav- What about theweights? Using the equation w (m) i = exp −tiC(m−1) (xi) (26) And because we have αm and ym (xi) w (m+1) i = exp −tiC(m) (xi) = exp −ti C(m−1) (xi) + αmym (xi) =w (m) i exp {−tiαmym (xi)} 80 / 132
  • 173.
    Images/cinvestav- What about theweights? Using the equation w (m) i = exp −tiC(m−1) (xi) (26) And because we have αm and ym (xi) w (m+1) i = exp −tiC(m) (xi) = exp −ti C(m−1) (xi) + αmym (xi) =w (m) i exp {−tiαmym (xi)} 80 / 132
  • 174.
    Images/cinvestav- What about theweights? Using the equation w (m) i = exp −tiC(m−1) (xi) (26) And because we have αm and ym (xi) w (m+1) i = exp −tiC(m) (xi) = exp −ti C(m−1) (xi) + αmym (xi) =w (m) i exp {−tiαmym (xi)} 80 / 132
  • 175.
    Images/cinvestav- What about theweights? Using the equation w (m) i = exp −tiC(m−1) (xi) (26) And because we have αm and ym (xi) w (m+1) i = exp −tiC(m) (xi) = exp −ti C(m−1) (xi) + αmym (xi) =w (m) i exp {−tiαmym (xi)} 80 / 132
  • 176.
    Images/cinvestav- Sequential Training Thus AdaBoost trainsa new classifier using a data set There the weighting coefficients are adjusted according to the performance of the previously trained classifier To give greater weight to the misclassified data points. 81 / 132
  • 177.
    Images/cinvestav- Sequential Training Thus AdaBoost trainsa new classifier using a data set There the weighting coefficients are adjusted according to the performance of the previously trained classifier To give greater weight to the misclassified data points. 81 / 132
  • 178.
    Images/cinvestav- Sequential Training Thus AdaBoost trainsa new classifier using a data set There the weighting coefficients are adjusted according to the performance of the previously trained classifier To give greater weight to the misclassified data points. 81 / 132
  • 179.
  • 180.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 83 / 132
  • 181.
    Images/cinvestav- AdaBoost Algorithm Step 1 Initializew (1) i to 1 N Step 2 For m = 1, 2, ..., M Select a weak classifier ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti=ym(xi) w (m) i = arg min ym We (27) Where I is an indicator function. 84 / 132
  • 182.
    Images/cinvestav- AdaBoost Algorithm Step 1 Initializew (1) i to 1 N Step 2 For m = 1, 2, ..., M Select a weak classifier ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti=ym(xi) w (m) i = arg min ym We (27) Where I is an indicator function. 84 / 132
  • 183.
    Images/cinvestav- AdaBoost Algorithm Step 1 Initializew (1) i to 1 N Step 2 For m = 1, 2, ..., M Select a weak classifier ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti=ym(xi) w (m) i = arg min ym We (27) Where I is an indicator function. 84 / 132
  • 184.
    Images/cinvestav- AdaBoost Algorithm Step 1 Initializew (1) i to 1 N Step 2 For m = 1, 2, ..., M Select a weak classifier ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti=ym(xi) w (m) i = arg min ym We (27) Where I is an indicator function. 84 / 132
  • 185.
    Images/cinvestav- AdaBoost Algorithm Step 2 Evaluate em= N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (28) Where I is an indicator function 85 / 132
  • 186.
    Images/cinvestav- AdaBoost Algorithm Step 3 Setthe αm weight to αm = 1 2 ln 1 − em em (29) Now update the weights of the data for the next iteration If ti = ym (xi) i.e. a miss w (m+1) i = w (m) i exp {αm} = w (m) i 1 − em em (30) If ti == ym (xi) i.e. a hit w (m+1) i = w (m) i exp {−αm} = w (m) i em 1 − em (31) 86 / 132
  • 187.
    Images/cinvestav- AdaBoost Algorithm Step 3 Setthe αm weight to αm = 1 2 ln 1 − em em (29) Now update the weights of the data for the next iteration If ti = ym (xi) i.e. a miss w (m+1) i = w (m) i exp {αm} = w (m) i 1 − em em (30) If ti == ym (xi) i.e. a hit w (m+1) i = w (m) i exp {−αm} = w (m) i em 1 − em (31) 86 / 132
  • 188.
    Images/cinvestav- AdaBoost Algorithm Step 3 Setthe αm weight to αm = 1 2 ln 1 − em em (29) Now update the weights of the data for the next iteration If ti = ym (xi) i.e. a miss w (m+1) i = w (m) i exp {αm} = w (m) i 1 − em em (30) If ti == ym (xi) i.e. a hit w (m+1) i = w (m) i exp {−αm} = w (m) i em 1 − em (31) 86 / 132
  • 189.
    Images/cinvestav- Finally, make predictions Forthis use YM (x) = sign M m=1 αmym (x) (32) 87 / 132
  • 190.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 88 / 132
  • 191.
    Images/cinvestav- Observations First The first baseclassifier is the usual procedure of training a single classifier. Second From (Eq. 30) and (Eq. 31), we can see that the weighting coefficient are increased for data points that are misclassified. Third The quantity em represent weighted measures of the error rate. Thus αm gives more weight to the more accurate classifiers. 89 / 132
  • 192.
    Images/cinvestav- Observations First The first baseclassifier is the usual procedure of training a single classifier. Second From (Eq. 30) and (Eq. 31), we can see that the weighting coefficient are increased for data points that are misclassified. Third The quantity em represent weighted measures of the error rate. Thus αm gives more weight to the more accurate classifiers. 89 / 132
  • 193.
    Images/cinvestav- Observations First The first baseclassifier is the usual procedure of training a single classifier. Second From (Eq. 30) and (Eq. 31), we can see that the weighting coefficient are increased for data points that are misclassified. Third The quantity em represent weighted measures of the error rate. Thus αm gives more weight to the more accurate classifiers. 89 / 132
  • 194.
    Images/cinvestav- In addition The poolof classifiers in Step 1 can be substituted by a family of classifiers One whose members are trained to minimize the error function given the current weights Now If indeed a finite set of classifiers is given, we only need to test the classifiers once for each data point The Scouting Matrix S It can be reused at each iteration by multiplying the transposed vector of weights w(m) with S to obtain We of each machine 90 / 132
  • 195.
    Images/cinvestav- In addition The poolof classifiers in Step 1 can be substituted by a family of classifiers One whose members are trained to minimize the error function given the current weights Now If indeed a finite set of classifiers is given, we only need to test the classifiers once for each data point The Scouting Matrix S It can be reused at each iteration by multiplying the transposed vector of weights w(m) with S to obtain We of each machine 90 / 132
  • 196.
    Images/cinvestav- In addition The poolof classifiers in Step 1 can be substituted by a family of classifiers One whose members are trained to minimize the error function given the current weights Now If indeed a finite set of classifiers is given, we only need to test the classifiers once for each data point The Scouting Matrix S It can be reused at each iteration by multiplying the transposed vector of weights w(m) with S to obtain We of each machine 90 / 132
  • 197.
    Images/cinvestav- We have then Thefollowing W(1) e W(2) e · · · WM e = w(m) T S (33) This allows to reformulate the weight update step such that It only misses lead to weight modification. Note Note that the weight vector w(m) is constructed iteratively. It could be recomputed completely at every iteration, but the iterative construction is more efficient and simple to implement. 91 / 132
  • 198.
    Images/cinvestav- We have then Thefollowing W(1) e W(2) e · · · WM e = w(m) T S (33) This allows to reformulate the weight update step such that It only misses lead to weight modification. Note Note that the weight vector w(m) is constructed iteratively. It could be recomputed completely at every iteration, but the iterative construction is more efficient and simple to implement. 91 / 132
  • 199.
    Images/cinvestav- We have then Thefollowing W(1) e W(2) e · · · WM e = w(m) T S (33) This allows to reformulate the weight update step such that It only misses lead to weight modification. Note Note that the weight vector w(m) is constructed iteratively. It could be recomputed completely at every iteration, but the iterative construction is more efficient and simple to implement. 91 / 132
  • 200.
    Images/cinvestav- We have then Thefollowing W(1) e W(2) e · · · WM e = w(m) T S (33) This allows to reformulate the weight update step such that It only misses lead to weight modification. Note Note that the weight vector w(m) is constructed iteratively. It could be recomputed completely at every iteration, but the iterative construction is more efficient and simple to implement. 91 / 132
  • 201.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 92 / 132
  • 202.
  • 203.
  • 204.
    Images/cinvestav- We have thefollowing cases We have the following If em −→ 1, we have that all the samples were not correctly classified!!! Thus We get that for all miss-classified sample lim em→1 1−em em −→ 0, then αm −→ −∞ 95 / 132
  • 205.
    Images/cinvestav- We have thefollowing cases We have the following If em −→ 1, we have that all the samples were not correctly classified!!! Thus We get that for all miss-classified sample lim em→1 1−em em −→ 0, then αm −→ −∞ 95 / 132
  • 206.
    Images/cinvestav- Now We get thatfor all miss-classified sample w (m+1) i = w (m) i exp {αm} −→ 0 Therefore We only need to reverse the answers to get the perfect classifier and select it as the only committee member. 96 / 132
  • 207.
    Images/cinvestav- Now We get thatfor all miss-classified sample w (m+1) i = w (m) i exp {αm} −→ 0 Therefore We only need to reverse the answers to get the perfect classifier and select it as the only committee member. 96 / 132
  • 208.
    Images/cinvestav- Now, the LastCase If em −→ 1/2 We have αm −→ 0 Thus we have that if the sample is well or bad classified exp {−αmtiym (xi)} → 1 (34) Therefore The weight does not change at all. 97 / 132
  • 209.
    Images/cinvestav- Now, the LastCase If em −→ 1/2 We have αm −→ 0 Thus we have that if the sample is well or bad classified exp {−αmtiym (xi)} → 1 (34) Therefore The weight does not change at all. 97 / 132
  • 210.
    Images/cinvestav- Now, the LastCase If em −→ 1/2 We have αm −→ 0 Thus we have that if the sample is well or bad classified exp {−αmtiym (xi)} → 1 (34) Therefore The weight does not change at all. 97 / 132
  • 211.
    Images/cinvestav- Thus, we have Whatabout em → 0 We have that αm → +∞ Thus, we have Samples always correctly classified w (m+1) i = w (m) i exp {−αmtiym (xi)} → 0 Thus, the only need m committee members, we do not need another m + 1 member. 98 / 132
  • 212.
    Images/cinvestav- Thus, we have Whatabout em → 0 We have that αm → +∞ Thus, we have Samples always correctly classified w (m+1) i = w (m) i exp {−αmtiym (xi)} → 0 Thus, the only need m committee members, we do not need another m + 1 member. 98 / 132
  • 213.
    Images/cinvestav- Thus, we have Whatabout em → 0 We have that αm → +∞ Thus, we have Samples always correctly classified w (m+1) i = w (m) i exp {−αmtiym (xi)} → 0 Thus, the only need m committee members, we do not need another m + 1 member. 98 / 132
  • 214.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 99 / 132
  • 215.
    Images/cinvestav- This comes from Thepaper “Additive Logistic Regression: A Statistical View of Boosting” by Friedman, Hastie and Tibshirani Something Notable In this paper, a proof exists to show that boosting algorithms are procedures to fit and additive logistic regression model. E [y|x] = F (x) with F (x) = M m=1 fm (x) 100 / 132
  • 216.
    Images/cinvestav- This comes from Thepaper “Additive Logistic Regression: A Statistical View of Boosting” by Friedman, Hastie and Tibshirani Something Notable In this paper, a proof exists to show that boosting algorithms are procedures to fit and additive logistic regression model. E [y|x] = F (x) with F (x) = M m=1 fm (x) 100 / 132
  • 217.
    Images/cinvestav- Consider the AdditiveRegression Model We are interested in modeling the mean E [y|x] = F (x) With Additive Model F (x) = d i=1 fi (xi) Where each fi (xi) is a function for each feature input xi A convenient algorithm for updating these models it the backfitting algorithm with update: fi (xi) = E  y − k=i fk (xk) |xi   101 / 132
  • 218.
    Images/cinvestav- Consider the AdditiveRegression Model We are interested in modeling the mean E [y|x] = F (x) With Additive Model F (x) = d i=1 fi (xi) Where each fi (xi) is a function for each feature input xi A convenient algorithm for updating these models it the backfitting algorithm with update: fi (xi) = E  y − k=i fk (xk) |xi   101 / 132
  • 219.
    Images/cinvestav- Remarks An example ofthese additive models is the matching pursuit f (t) = +∞ n=−∞ angγn (t) Backfitting ensures that under fairly general conditions Backfitting converges to the minimizer of E (y − f (x))2 102 / 132
  • 220.
    Images/cinvestav- Remarks An example ofthese additive models is the matching pursuit f (t) = +∞ n=−∞ angγn (t) Backfitting ensures that under fairly general conditions Backfitting converges to the minimizer of E (y − f (x))2 102 / 132
  • 221.
    Images/cinvestav- In the caseof AdaBoost We have an additive model Which considers functions {fm (x)}M m=1 that take in account all the features - Perceptron, Decision Trees, etc Each of these functions is characterized by a set of parameters γm and multiplier αm fm (x) = αmym (x|γm) With additive model FM (x) = α1y1 (x|γ1) + · · · + αM yM (x|γM ) 103 / 132
  • 222.
    Images/cinvestav- In the caseof AdaBoost We have an additive model Which considers functions {fm (x)}M m=1 that take in account all the features - Perceptron, Decision Trees, etc Each of these functions is characterized by a set of parameters γm and multiplier αm fm (x) = αmym (x|γm) With additive model FM (x) = α1y1 (x|γ1) + · · · + αM yM (x|γM ) 103 / 132
  • 223.
    Images/cinvestav- In the caseof AdaBoost We have an additive model Which considers functions {fm (x)}M m=1 that take in account all the features - Perceptron, Decision Trees, etc Each of these functions is characterized by a set of parameters γm and multiplier αm fm (x) = αmym (x|γm) With additive model FM (x) = α1y1 (x|γ1) + · · · + αM yM (x|γM ) 103 / 132
  • 224.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 104 / 132
  • 225.
    Images/cinvestav- Remark - Movingfrom Regression to Classification Given that Regression have wide ranges of outputs Logistic Regression is widely used to move Regression to Classification log P (Y = 1|x) P (Y = −1|x) = M m=1 fm (x) A nice property, the probability estimates lie in [0, 1] Now, solving by assuming P (Y = 1|x) + P (Y = −1|x) = 1 P (Y = 1|x) = eF(x) 1 + eF(x) 105 / 132
  • 226.
    Images/cinvestav- Remark - Movingfrom Regression to Classification Given that Regression have wide ranges of outputs Logistic Regression is widely used to move Regression to Classification log P (Y = 1|x) P (Y = −1|x) = M m=1 fm (x) A nice property, the probability estimates lie in [0, 1] Now, solving by assuming P (Y = 1|x) + P (Y = −1|x) = 1 P (Y = 1|x) = eF(x) 1 + eF(x) 105 / 132
  • 227.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 106 / 132
  • 228.
    Images/cinvestav- The Exponential Criterion Wehave our exponential Criterion under an Expected Value with y ∈ {1, −1} J (F) = E e−yF(x) Lemma E e−yF (x) is minimized at F (x) = 1 2 log P (Y = 1|x) P (Y = −1|x) Hence: P (Y = 1|x) = eF (x) e−F (x) + eF (x) P (Y = −1|x) = e−F (x) e−F (x) + eF (x) 107 / 132
  • 229.
    Images/cinvestav- The Exponential Criterion Wehave our exponential Criterion under an Expected Value with y ∈ {1, −1} J (F) = E e−yF(x) Lemma E e−yF (x) is minimized at F (x) = 1 2 log P (Y = 1|x) P (Y = −1|x) Hence: P (Y = 1|x) = eF (x) e−F (x) + eF (x) P (Y = −1|x) = e−F (x) e−F (x) + eF (x) 107 / 132
  • 230.
    Images/cinvestav- Proof Given the discretenature of y ∈ {1, −1} ∂E e−yF(x) ∂F (x) = −P (Y = 1|x) e−F(x) + P (Y = −1|x) eF(x) Therefore −P (Y = 1|x) e−F(x) + P (Y = −1|x) eF(x) = 0 108 / 132
  • 231.
    Images/cinvestav- Proof Given the discretenature of y ∈ {1, −1} ∂E e−yF(x) ∂F (x) = −P (Y = 1|x) e−F(x) + P (Y = −1|x) eF(x) Therefore −P (Y = 1|x) e−F(x) + P (Y = −1|x) eF(x) = 0 108 / 132
  • 232.
    Images/cinvestav- Then We have that P(Y = 1|x) e−F(x) = P (Y = −1|x) eF(x) = [1 − P (Y = 1|x)] eF(x) Solving eF(x) = e−F(x) + eF(x) P (Y = 1|x) 109 / 132
  • 233.
    Images/cinvestav- Then We have that P(Y = 1|x) e−F(x) = P (Y = −1|x) eF(x) = [1 − P (Y = 1|x)] eF(x) Solving eF(x) = e−F(x) + eF(x) P (Y = 1|x) 109 / 132
  • 234.
    Images/cinvestav- Finally, we have Thefirst equation P (Y = 1|x) = eF(x) e−F(x) + eF(x) Similarly P (Y = −1|x) = e−F (x) e−F (x) + eF (x) 110 / 132
  • 235.
    Images/cinvestav- Finally, we have Thefirst equation P (Y = 1|x) = eF(x) e−F(x) + eF(x) Similarly P (Y = −1|x) = e−F (x) e−F (x) + eF (x) 110 / 132
  • 236.
    Images/cinvestav- Basically We have thatthe E e−yF(x) When you minimize the cost function Then at the optimal you have the Binary Classification Of the Logistic Regression 111 / 132
  • 237.
    Images/cinvestav- Basically We have thatthe E e−yF(x) When you minimize the cost function Then at the optimal you have the Binary Classification Of the Logistic Regression 111 / 132
  • 238.
    Images/cinvestav- Furthermore Corollary If E isreplaced by averages over regions of x where F (x) is constant (Similar to a decision tree), The same result applies to the sample proportions of y = 1 and y = −1 112 / 132
  • 239.
    Images/cinvestav- Furthermore Corollary If E isreplaced by averages over regions of x where F (x) is constant (Similar to a decision tree), The same result applies to the sample proportions of y = 1 and y = −1 112 / 132
  • 240.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 113 / 132
  • 241.
    Images/cinvestav- Finally, The AdditiveLogistic Regression Proposition The AdaBoost algorithm fits an additive logistic regression model by stage-wise optimization of J (F) = E e−yF(x) Proof Imagine you have an estimate F (x) then we seek an improved estimate: F (x) + f (x) 114 / 132
  • 242.
    Images/cinvestav- Finally, The AdditiveLogistic Regression Proposition The AdaBoost algorithm fits an additive logistic regression model by stage-wise optimization of J (F) = E e−yF(x) Proof Imagine you have an estimate F (x) then we seek an improved estimate: F (x) + f (x) 114 / 132
  • 243.
    Images/cinvestav- For This We minimizeat each x J (F (x) + f (x)) This can be expanded J (F (x) + f (x)) = E e−y(F(x)+f(x)) |x = e−f(x) E e−yF(x) I (y = 1) |x + ...ef(x) E e−yF(x) I (y = −1) |x 115 / 132
  • 244.
    Images/cinvestav- For This We minimizeat each x J (F (x) + f (x)) This can be expanded J (F (x) + f (x)) = E e−y(F(x)+f(x)) |x = e−f(x) E e−yF(x) I (y = 1) |x + ...ef(x) E e−yF(x) I (y = −1) |x 115 / 132
  • 245.
    Images/cinvestav- Deriving w.r.t. f(x) We get −e−f(x) E e−yF (x) I (y = 1) |x + ef(x) E e−yF (x) I (y = −1) |x = 0 116 / 132
  • 246.
    Images/cinvestav- We have thefollowing If we divide by E e−yF(x) |x , the first term E e−yF (x) I (y = 1) |x E [e−yF (x)|x] = Ew [I (y = 1) |x] Also E e−yF (x) I (y = −1) |x E [e−yF (x)|x] = Ew [I (y = −1) |x] 117 / 132
  • 247.
    Images/cinvestav- We have thefollowing If we divide by E e−yF(x) |x , the first term E e−yF (x) I (y = 1) |x E [e−yF (x)|x] = Ew [I (y = 1) |x] Also E e−yF (x) I (y = −1) |x E [e−yF (x)|x] = Ew [I (y = −1) |x] 117 / 132
  • 248.
    Images/cinvestav- Thus, we have Weapply the natural log to both sides log e−f(x) + log Ew [I (y = 1) |x] = log ef(x) + log Ew [I (y = −1) |x] Then 2f (x) = log Ew [I (y = 1) |x] − log Ew [I (y = −1) |x] 118 / 132
  • 249.
    Images/cinvestav- Thus, we have Weapply the natural log to both sides log e−f(x) + log Ew [I (y = 1) |x] = log ef(x) + log Ew [I (y = −1) |x] Then 2f (x) = log Ew [I (y = 1) |x] − log Ew [I (y = −1) |x] 118 / 132
  • 250.
    Images/cinvestav- Finally We have that f(x) = 1 2 log Ew [I (y = 1) |x] Ew [I (y = −1) |x] In term of probabilities f (x) = 1 2 log Pw (y = 1|x) Pw (y = −1|x) 119 / 132
  • 251.
    Images/cinvestav- Finally We have that f(x) = 1 2 log Ew [I (y = 1) |x] Ew [I (y = −1) |x] In term of probabilities f (x) = 1 2 log Pw (y = 1|x) Pw (y = −1|x) 119 / 132
  • 252.
    Images/cinvestav- The Weight Update Finally,we have a way to update the weights by setting wt (x, y) = e−yF(x) wt+1 (x, y) = wt (x, y) e−yf(x) 120 / 132
  • 253.
    Images/cinvestav- Additionally, the weightedconditional mean Corollary At the Optimal F (x), the weighted conditional mean of y is 0. Proof When F (x) is optimal ∂J (F (x)) ∂F (x) = ∂ P (Y = 1|x) e−yF (x) + P (Y = −1|x) eyF (x) ∂F (x) 121 / 132
  • 254.
    Images/cinvestav- Additionally, the weightedconditional mean Corollary At the Optimal F (x), the weighted conditional mean of y is 0. Proof When F (x) is optimal ∂J (F (x)) ∂F (x) = ∂ P (Y = 1|x) e−yF (x) + P (Y = −1|x) eyF (x) ∂F (x) 121 / 132
  • 255.
    Images/cinvestav- Therefore We have ∂J (F(x)) ∂F (x) = P (Y = 1|x) e−yF (x) {−y} + P (Y = −1|x) e−yF (x) {−y} Therefore E eyF(x) y = 0 122 / 132
  • 256.
    Images/cinvestav- Therefore We have ∂J (F(x)) ∂F (x) = P (Y = 1|x) e−yF (x) {−y} + P (Y = −1|x) e−yF (x) {−y} Therefore E eyF(x) y = 0 122 / 132
  • 257.
    Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 123 / 132
  • 258.
    Images/cinvestav- Here, we decideto use Perceptrons As Weak Learners We could be using a finite number of Perceptrons But we want to have a infinitude of possible weak learners Thus avoiding the need of a matrix S Remark We need to use a Gradient Based Learner for this 124 / 132
  • 259.
    Images/cinvestav- Here, we decideto use Perceptrons As Weak Learners We could be using a finite number of Perceptrons But we want to have a infinitude of possible weak learners Thus avoiding the need of a matrix S Remark We need to use a Gradient Based Learner for this 124 / 132
  • 260.
    Images/cinvestav- Here, we decideto use Perceptrons As Weak Learners We could be using a finite number of Perceptrons But we want to have a infinitude of possible weak learners Thus avoiding the need of a matrix S Remark We need to use a Gradient Based Learner for this 124 / 132
  • 261.
    Images/cinvestav- Perceptron We use thefollowing formula of error per sample E (w) = 1 2 N j=1 (wj (t) yj (t) − dj)2 With yj (t) = ϕ wT (t) xj Deriving against wi ∂E (w) ∂wi = N j=1 (wj (t) yj (t) − dj) ϕ wT (t) xj wb j xij Then, using gradient descent, we have the following update wi (n + 1) = wi (n) − η N j=1 (wj (t) yj (t) − dj) ϕ wT (t) xj wixij 125 / 132
  • 262.
    Images/cinvestav- Perceptron We use thefollowing formula of error per sample E (w) = 1 2 N j=1 (wj (t) yj (t) − dj)2 With yj (t) = ϕ wT (t) xj Deriving against wi ∂E (w) ∂wi = N j=1 (wj (t) yj (t) − dj) ϕ wT (t) xj wb j xij Then, using gradient descent, we have the following update wi (n + 1) = wi (n) − η N j=1 (wj (t) yj (t) − dj) ϕ wT (t) xj wixij 125 / 132
  • 263.
    Images/cinvestav- Perceptron We use thefollowing formula of error per sample E (w) = 1 2 N j=1 (wj (t) yj (t) − dj)2 With yj (t) = ϕ wT (t) xj Deriving against wi ∂E (w) ∂wi = N j=1 (wj (t) yj (t) − dj) ϕ wT (t) xj wb j xij Then, using gradient descent, we have the following update wi (n + 1) = wi (n) − η N j=1 (wj (t) yj (t) − dj) ϕ wT (t) xj wixij 125 / 132
  • 264.
    Images/cinvestav- Data Set Training setwith classes ω1 = N(0, 1) and ω2 = N(0, σ2 ) − N(0, 1) 126 / 132
  • 265.
  • 266.
  • 267.
    Images/cinvestav- At the endof the process For m = 80 129 / 132
  • 268.
    Images/cinvestav- Final Confusion Matrix Whenm = 80 C1 C2 C1 1.0 0.0 C2 0.0 1.0 130 / 132
  • 269.
    Images/cinvestav- However There are otherversions to the Cryptic Phrase At “Boosting: Foundation and Algorithms” by Schaphire and Freund “Train weak learner using distribution Dt” We could re-sample using the distribution wt Basically using sampling with substitution over the data set {x1, x2, ..., xN } 131 / 132
  • 270.
    Images/cinvestav- However There are otherversions to the Cryptic Phrase At “Boosting: Foundation and Algorithms” by Schaphire and Freund “Train weak learner using distribution Dt” We could re-sample using the distribution wt Basically using sampling with substitution over the data set {x1, x2, ..., xN } 131 / 132
  • 271.
    Images/cinvestav- Other Interpretations exist Butyou can use a weighted version of the cost function 1 2 j wj (t) (yj (t) − dj)2 For More, Take a look “Boosting Neural Networks” by Holger Schwenk and Yoshua Bengio 132 / 132