Combining Models - In this slides, we look at a way to combine the answers form various weak classifiers to build a robust classifier. At the slides we look at the following subjects:
1.- Model Combination Vs Bayesian Model
2.- Bootstrap Data Sets
And the cherry on the top the AdaBoost
Here a Review of the Combination of Machine Learning models from Bayesian Averaging, Committees to Boosting... Specifically An statistical analysis of Boosting is done
Machine learning in science and industry — day 1arogozhnikov
A course of machine learning in science and industry.
- notions and applications
- nearest neighbours: search and machine learning algorithms
- roc curve
- optimal classification and regression
- density estimation
- Gaussian mixtures and EM algorithm
- clustering, an example of clustering in the opera
Universal Approximation Theorem
Here, we prove that the perceptron multi-layer can approximate all continuous functions in the hypercube [0,1]. For this, we used the Cybenko proof... I tried to include the basic in topology and mathematical analysis to make the slides more understandable. However, they still need some work to be done. In addition, I am a little bit rusty in my mathematical analysis, so I am still not so convinced with my linear functional I defined for the proof...!!! Back to the Rudin and Apostol!!! So expect changes in the future.
Machine learning in science and industry — day 2arogozhnikov
- decision trees
- random forest
- Boosting: adaboost
- reweighting with boosting
- gradient boosting
- learning to rank with gradient boosting
- multiclass classification
- trigger in LHCb
- boosting to uniformity and flatness loss
- particle identification
Covariance matrices are central to many adaptive filtering and optimisation problems. In practice, they have to be estimated from a finite number of samples; on this, I will review some known results from spectrum estimation and multiple-input multiple-output communications systems, and how properties that are assumed to be inherent in covariance and power spectral densities can easily be lost in the estimation process. I will discuss new results on space-time covariance estimation, and how the estimation from finite sample sets will impact on factorisations such as the eigenvalue decomposition, which is often key to solving the introductory optimisation problems. The purpose of the presentation is to give you some insight into estimating statistics as well as to provide a glimpse on classical signal processing challenges such as the separation of sources from a mixture of signals.
Introduction to machine learning terminology.
Applications within High Energy Physics and outside HEP.
* Basic problems: classification and regression.
* Nearest neighbours approach and spacial indices
* Overfitting (intro)
* Curse of dimensionality
* ROC curve, ROC AUC
* Bayes optimal classifier
* Density estimation: KDE and histograms
* Parametric density estimation
* Mixtures for density estimation and EM algorithm
* Generative approach vs discriminative approach
* Linear decision rule, intro to logistic regression
* Linear regression
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
This talk gives an introduction to tree-based methods, both from a theoretical and practical point of view. It covers decision trees, random forests and boosting estimators, along with concrete examples based on Scikit-Learn about how they work, when they work and why they work.
In this article, we propose a new approach to solve intuitionistic fuzzy assignment
problem. Classical assignment problem deals with deterministic cost. In practical
situations it is not easy to determine the parameters. The parameters can be modeled
to fuzzy or intuitionistic fuzzy parameters. This paper develops an approach based on
diagonal optimal algorithm to solve an intuitionistic fuzzy assignment problem. A new
ranking procedure based on combined arithmetic mean is used to order the
intuitionistic fuzzy numbers so that Diagonal optimal algorithm [22] can be applied to
solve the intuitionistic fuzzy assignment problem. To illustrate the effectiveness of the
algorithm numerical examples were given..
Here a Review of the Combination of Machine Learning models from Bayesian Averaging, Committees to Boosting... Specifically An statistical analysis of Boosting is done
Machine learning in science and industry — day 1arogozhnikov
A course of machine learning in science and industry.
- notions and applications
- nearest neighbours: search and machine learning algorithms
- roc curve
- optimal classification and regression
- density estimation
- Gaussian mixtures and EM algorithm
- clustering, an example of clustering in the opera
Universal Approximation Theorem
Here, we prove that the perceptron multi-layer can approximate all continuous functions in the hypercube [0,1]. For this, we used the Cybenko proof... I tried to include the basic in topology and mathematical analysis to make the slides more understandable. However, they still need some work to be done. In addition, I am a little bit rusty in my mathematical analysis, so I am still not so convinced with my linear functional I defined for the proof...!!! Back to the Rudin and Apostol!!! So expect changes in the future.
Machine learning in science and industry — day 2arogozhnikov
- decision trees
- random forest
- Boosting: adaboost
- reweighting with boosting
- gradient boosting
- learning to rank with gradient boosting
- multiclass classification
- trigger in LHCb
- boosting to uniformity and flatness loss
- particle identification
Covariance matrices are central to many adaptive filtering and optimisation problems. In practice, they have to be estimated from a finite number of samples; on this, I will review some known results from spectrum estimation and multiple-input multiple-output communications systems, and how properties that are assumed to be inherent in covariance and power spectral densities can easily be lost in the estimation process. I will discuss new results on space-time covariance estimation, and how the estimation from finite sample sets will impact on factorisations such as the eigenvalue decomposition, which is often key to solving the introductory optimisation problems. The purpose of the presentation is to give you some insight into estimating statistics as well as to provide a glimpse on classical signal processing challenges such as the separation of sources from a mixture of signals.
Introduction to machine learning terminology.
Applications within High Energy Physics and outside HEP.
* Basic problems: classification and regression.
* Nearest neighbours approach and spacial indices
* Overfitting (intro)
* Curse of dimensionality
* ROC curve, ROC AUC
* Bayes optimal classifier
* Density estimation: KDE and histograms
* Parametric density estimation
* Mixtures for density estimation and EM algorithm
* Generative approach vs discriminative approach
* Linear decision rule, intro to logistic regression
* Linear regression
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
This talk gives an introduction to tree-based methods, both from a theoretical and practical point of view. It covers decision trees, random forests and boosting estimators, along with concrete examples based on Scikit-Learn about how they work, when they work and why they work.
In this article, we propose a new approach to solve intuitionistic fuzzy assignment
problem. Classical assignment problem deals with deterministic cost. In practical
situations it is not easy to determine the parameters. The parameters can be modeled
to fuzzy or intuitionistic fuzzy parameters. This paper develops an approach based on
diagonal optimal algorithm to solve an intuitionistic fuzzy assignment problem. A new
ranking procedure based on combined arithmetic mean is used to order the
intuitionistic fuzzy numbers so that Diagonal optimal algorithm [22] can be applied to
solve the intuitionistic fuzzy assignment problem. To illustrate the effectiveness of the
algorithm numerical examples were given..
Unsupervised learning involves using unlabeled data. It is used for specific problematic like : clustering, dimensionality reduction and association rule learning.
In the first section we will talk about some of the clustering methods: k-mean, mean shift, Gaussian mixture and affinity propagation model . We will also define and use silhouette scores that will help to select the most appropriate number of clusters that the data may have.
[Notebook](https://colab.research.google.com/drive/1g4hcSfiO-TW35JbiQ_kGQAsgMZDPkp7L)
Aaa ped-14-Ensemble Learning: About Ensemble LearningAminaRepo
In this section we will start talking effectively about ensemble learning. We will simply talk about the different methods that exist to combine different models. We will then implement those methods in Python.
[Notebook](https://colab.research.google.com/drive/1fNkOh7iQ_AnjNWxm3hWyR4DIGRUNwzsS)
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
Data clustering is a common technique for statistical data analysis; it is defined as a class of
statistical techniques for classifying a set of observations into completely different groups. Cluster analysis
seeks to minimize group variance and maximize between group variance. In this study we formulate a
mathematical programming model that chooses the most important variables in cluster analysis. A nonlinear
binary model is suggested to select the most important variables in clustering a set of data. The idea of the
suggested model depends on clustering data by minimizing the distance between observations within groups.
Indicator variables are used to select the most important variables in the cluster analysis.
k-Means is a rather simple but well known algorithms for grouping objects, clustering. Again all objects need to be represented as a set of numerical features. In addition the user has to specify the number of groups (referred to as k) he wishes to identify. Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task. After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations. Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters k. Yet the main principle always remains the same. In this project we will discuss about K-means clustering algorithm, implementation and its application to the problem of unsupervised learning
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...Golden Helix Inc
Confounding from population structure, extended families and inbreeding can be a significant issue for burden and kernel association tests on rare variants from next generation DNA sequencing. An obvious solution is to combine the power of a mixed model regression analysis with the ability to assess the rare variant burden using methods such as KBAC or CMC. Recent approaches have adjusted burden and kernel tests using linear regression models; this method adjusts for the relatedness of samples and includes that directly into a logistic regression model.
This webcast will focus on the details of bringing Mixed Model Regression and KBAC together, including: deriving an optimal logistic mixed model algorithm for calculating the reduced model score, how the kinship or random effects matrix should be specified, and how it all comes together into one algorithm. Results from applying the method to variants from the 1000 Genomes project will also be presented and compared to famSKAT.
Senior data scientist and founder of the company Intelligentia Data I+D SA de CV. We are offering consultancy services, development of projects and products in Machine Learning, Big Data, Data Sciences and Artificial Intelligence.
My first set of slides (The NN and DL class I am preparing for the fall)... I included the problem of Vanishing Gradient and the need to have ReLu (Mentioning btw the saturation problem inherited from Hebbian Learning)
It has been almost 62 years since the invention of the term Artificial Intelligence by Samuel and Minsky et al. at the Dartmouth workshop College in 1956 (“Dartmouth Summer Research Project on Artificial Intelligence”) where this new area of Computer Science was invented. However, the history of Artificial Intelligence goes back to previous millennia, when the Greeks in their Myths spoke about golden robots at Hephaestus, and the Galatea of Pygmalion. They were the first automatons known at the dawn of history, and although these first attempts were only myths, automatons were invented and built through multiple civilizations in history. Nevertheless, these automatons resembled in quite limited way their final objectives, representing animals and humans. In spite of that, the greatest illusion of an automaton, the Turk by Wolfgang von Kempelen, inspired many people, trough its exhibitions, as Alexander Graham Bell and Charles Babbage to develop inventions that would change forever human history. Thus, the importance of the concept “Artificial Intelligence” as a driver of our technological dreams. And although Artificial Intelligence has never been defined in a precise practical way, the amount of research and methods that have been developed to tackle some of its basics tasks have been and are quite humongous. Thus, the importance of having an introduction to the concepts of Artificial Intelligence, thus the dream can continue.
A review of one of the most popular methods of clustering, a part of what is know as unsupervised learning, K-Means. Here, we go from the basic heuristic used to solve the NP-Hard problem to an approximation algorithm K-Centers. Additionally, we look at variations coming from the Fuzzy Set ideas. In the future, we will add more about On-Line algorithms in the line of Stochastic Gradient Ideas...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Recycled Concrete Aggregate in Construction Part III
24 Machine Learning Combining Models - Ada Boost
1. Machine Learning for Data Mining
Combining Models
Andres Mendez-Vazquez
July 20, 2015
1 / 65
2. Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Committees
Bootstrap Data Sets
4 Boosting
AdaBoost Development
Cost Function
Selection Process
Selecting New Classifiers
Using our Deriving Trick
AdaBoost Algorithm
Some Remarks and Implementation Issues
Explanation about AdaBoost’s behavior
Example
2 / 65
3. Images/cinvestav-
Introduction
Observation
It is often found that improved performance can be obtained by combining
multiple classifiers together in some way.
Example: Committees
We might train L different classifiers and then make predictions using the
average of the predictions made by each classifier.
Example: Boosting
It involves training multiple models in sequence in which the error function
used to train a particular model depends on the performance of the
previous models.
3 / 65
4. Images/cinvestav-
Introduction
Observation
It is often found that improved performance can be obtained by combining
multiple classifiers together in some way.
Example: Committees
We might train L different classifiers and then make predictions using the
average of the predictions made by each classifier.
Example: Boosting
It involves training multiple models in sequence in which the error function
used to train a particular model depends on the performance of the
previous models.
3 / 65
5. Images/cinvestav-
Introduction
Observation
It is often found that improved performance can be obtained by combining
multiple classifiers together in some way.
Example: Committees
We might train L different classifiers and then make predictions using the
average of the predictions made by each classifier.
Example: Boosting
It involves training multiple models in sequence in which the error function
used to train a particular model depends on the performance of the
previous models.
3 / 65
6. Images/cinvestav-
In addition
Instead of averaging the predictions of a set of models
You can use an alternative form of combination that selects one of the
models to make the prediction.
Where
The choice of model is a function of the input variables.
Thus
Different Models become responsible for making decisions in different
regions of the input space.
4 / 65
7. Images/cinvestav-
In addition
Instead of averaging the predictions of a set of models
You can use an alternative form of combination that selects one of the
models to make the prediction.
Where
The choice of model is a function of the input variables.
Thus
Different Models become responsible for making decisions in different
regions of the input space.
4 / 65
8. Images/cinvestav-
In addition
Instead of averaging the predictions of a set of models
You can use an alternative form of combination that selects one of the
models to make the prediction.
Where
The choice of model is a function of the input variables.
Thus
Different Models become responsible for making decisions in different
regions of the input space.
4 / 65
10. Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
11. Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
12. Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
13. Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
14. Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
15. Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
16. Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
17. Images/cinvestav-
This is used in the mixture of distributions
Thus (Mixture of Experts)
p (t|x) =
M
k=1
πk (x) p (t|x, k) (1)
where πk (x) = p (k|x) represent the input-dependent mixing coefficients.
This type of models
They can be viewed as mixture distribution in which the component
densities and the mixing coefficients are conditioned on the input variables
and are known as mixture experts.
7 / 65
18. Images/cinvestav-
This is used in the mixture of distributions
Thus (Mixture of Experts)
p (t|x) =
M
k=1
πk (x) p (t|x, k) (1)
where πk (x) = p (k|x) represent the input-dependent mixing coefficients.
This type of models
They can be viewed as mixture distribution in which the component
densities and the mixing coefficients are conditioned on the input variables
and are known as mixture experts.
7 / 65
19. Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Committees
Bootstrap Data Sets
4 Boosting
AdaBoost Development
Cost Function
Selection Process
Selecting New Classifiers
Using our Deriving Trick
AdaBoost Algorithm
Some Remarks and Implementation Issues
Explanation about AdaBoost’s behavior
Example
8 / 65
23. Images/cinvestav-
Example
For this consider the following
Mixture of Gaussians with a binary latent variable z indicating to which
component a point belongs to.
p (x, z) (2)
Thus
Corresponding density over the observed variable x using marginalization:
p (x) =
z
p (x, z) (3)
10 / 65
24. Images/cinvestav-
Example
For this consider the following
Mixture of Gaussians with a binary latent variable z indicating to which
component a point belongs to.
p (x, z) (2)
Thus
Corresponding density over the observed variable x using marginalization:
p (x) =
z
p (x, z) (3)
10 / 65
25. Images/cinvestav-
In the case of Gaussian distributions
We have
p (x) =
M
k=1
πkN (x|µkΣk) (4)
Now, for independent, identically distributed data
X = {x1, x2, ..., xN }
p (X) =
N
n=1
p (xn) =
N
n=1 zn
p (xn, zn) (5)
Now, suppose
We have several different models indexed by h = 1, ..., H with prior
probabilities.
For instance one model might be a mixture of Gaussians and another
model might be a mixture of Cauchy distributions.
11 / 65
26. Images/cinvestav-
In the case of Gaussian distributions
We have
p (x) =
M
k=1
πkN (x|µkΣk) (4)
Now, for independent, identically distributed data
X = {x1, x2, ..., xN }
p (X) =
N
n=1
p (xn) =
N
n=1 zn
p (xn, zn) (5)
Now, suppose
We have several different models indexed by h = 1, ..., H with prior
probabilities.
For instance one model might be a mixture of Gaussians and another
model might be a mixture of Cauchy distributions.
11 / 65
27. Images/cinvestav-
In the case of Gaussian distributions
We have
p (x) =
M
k=1
πkN (x|µkΣk) (4)
Now, for independent, identically distributed data
X = {x1, x2, ..., xN }
p (X) =
N
n=1
p (xn) =
N
n=1 zn
p (xn, zn) (5)
Now, suppose
We have several different models indexed by h = 1, ..., H with prior
probabilities.
For instance one model might be a mixture of Gaussians and another
model might be a mixture of Cauchy distributions.
11 / 65
28. Images/cinvestav-
Bayesian Model Averaging
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h) (6)
Observations
1 This is an example of Bayesian model averaging.
2 The summation over h means that just one model is responsible for
generating the whole data set.
3 The probability over h simply reflects our uncertainty of which is the
correct model to use.
Thus
As the size of the data set increases, this uncertainty reduces, and the
posterior probabilities p(h|X) become increasingly focused on just one of
the models.
12 / 65
29. Images/cinvestav-
Bayesian Model Averaging
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h) (6)
Observations
1 This is an example of Bayesian model averaging.
2 The summation over h means that just one model is responsible for
generating the whole data set.
3 The probability over h simply reflects our uncertainty of which is the
correct model to use.
Thus
As the size of the data set increases, this uncertainty reduces, and the
posterior probabilities p(h|X) become increasingly focused on just one of
the models.
12 / 65
30. Images/cinvestav-
Bayesian Model Averaging
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h) (6)
Observations
1 This is an example of Bayesian model averaging.
2 The summation over h means that just one model is responsible for
generating the whole data set.
3 The probability over h simply reflects our uncertainty of which is the
correct model to use.
Thus
As the size of the data set increases, this uncertainty reduces, and the
posterior probabilities p(h|X) become increasingly focused on just one of
the models.
12 / 65
31. Images/cinvestav-
Bayesian Model Averaging
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h) (6)
Observations
1 This is an example of Bayesian model averaging.
2 The summation over h means that just one model is responsible for
generating the whole data set.
3 The probability over h simply reflects our uncertainty of which is the
correct model to use.
Thus
As the size of the data set increases, this uncertainty reduces, and the
posterior probabilities p(h|X) become increasingly focused on just one of
the models.
12 / 65
32. Images/cinvestav-
Bayesian Model Averaging
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h) (6)
Observations
1 This is an example of Bayesian model averaging.
2 The summation over h means that just one model is responsible for
generating the whole data set.
3 The probability over h simply reflects our uncertainty of which is the
correct model to use.
Thus
As the size of the data set increases, this uncertainty reduces, and the
posterior probabilities p(h|X) become increasingly focused on just one of
the models.
12 / 65
33. Images/cinvestav-
The Differences
Bayesian model averaging
The whole data set is generated by a single model.
Model combination
Different data points within the data set can potentially be generated from
different values of the latent variable z and hence by different components.
13 / 65
34. Images/cinvestav-
The Differences
Bayesian model averaging
The whole data set is generated by a single model.
Model combination
Different data points within the data set can potentially be generated from
different values of the latent variable z and hence by different components.
13 / 65
35. Images/cinvestav-
Committees
Idea
The simplest way to construct a committee is to average the predictions of
a set of individual models.
Thinking as a frequentist
This is coming from taking in consideration the trade-off between bias and
variance.
Where the error in the model into
The bias component that arises from differences between the model
and the true function to be predicted.
The variance component that represents the sensitivity of the model
to the individual data points.
14 / 65
36. Images/cinvestav-
Committees
Idea
The simplest way to construct a committee is to average the predictions of
a set of individual models.
Thinking as a frequentist
This is coming from taking in consideration the trade-off between bias and
variance.
Where the error in the model into
The bias component that arises from differences between the model
and the true function to be predicted.
The variance component that represents the sensitivity of the model
to the individual data points.
14 / 65
37. Images/cinvestav-
Committees
Idea
The simplest way to construct a committee is to average the predictions of
a set of individual models.
Thinking as a frequentist
This is coming from taking in consideration the trade-off between bias and
variance.
Where the error in the model into
The bias component that arises from differences between the model
and the true function to be predicted.
The variance component that represents the sensitivity of the model
to the individual data points.
14 / 65
38. Images/cinvestav-
Committees
Idea
The simplest way to construct a committee is to average the predictions of
a set of individual models.
Thinking as a frequentist
This is coming from taking in consideration the trade-off between bias and
variance.
Where the error in the model into
The bias component that arises from differences between the model
and the true function to be predicted.
The variance component that represents the sensitivity of the model
to the individual data points.
14 / 65
39. Images/cinvestav-
For example
We can do the follwoing
When we averaged a set of low-bias models, we obtained accurate
predictions of the underlying sinusoidal function from which the data were
generated.
15 / 65
40. Images/cinvestav-
However
Big Problem
We have normally a single data set
Thus
We need to introduce certain variability between the different committee
members.
One approach
You can use bootstrap data sets.
16 / 65
41. Images/cinvestav-
However
Big Problem
We have normally a single data set
Thus
We need to introduce certain variability between the different committee
members.
One approach
You can use bootstrap data sets.
16 / 65
42. Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Committees
Bootstrap Data Sets
4 Boosting
AdaBoost Development
Cost Function
Selection Process
Selecting New Classifiers
Using our Deriving Trick
AdaBoost Algorithm
Some Remarks and Implementation Issues
Explanation about AdaBoost’s behavior
Example
17 / 65
43. Images/cinvestav-
What to do with the Bootstrap Data Set
Given a Regression Problem
Here, we are trying to predict the value of a single continuous variable.
What?
Suppose our original data set consists of N data points X = {x1, ..., xN } .
Thus
We can create a new data set XB by drawing N points at random from X,
with replacement, so that some points in X may be replicated in XB,
whereas other points in X may be absent from XB.
18 / 65
44. Images/cinvestav-
What to do with the Bootstrap Data Set
Given a Regression Problem
Here, we are trying to predict the value of a single continuous variable.
What?
Suppose our original data set consists of N data points X = {x1, ..., xN } .
Thus
We can create a new data set XB by drawing N points at random from X,
with replacement, so that some points in X may be replicated in XB,
whereas other points in X may be absent from XB.
18 / 65
45. Images/cinvestav-
What to do with the Bootstrap Data Set
Given a Regression Problem
Here, we are trying to predict the value of a single continuous variable.
What?
Suppose our original data set consists of N data points X = {x1, ..., xN } .
Thus
We can create a new data set XB by drawing N points at random from X,
with replacement, so that some points in X may be replicated in XB,
whereas other points in X may be absent from XB.
18 / 65
47. Images/cinvestav-
Thus
Use each of them to train a copy ym (x) of a predictive regression
model to predict a single continuous variable
Then,
ycom (x) =
1
M
M
m=1
ym (x) (7)
This is also known as Bootstrap Aggregation or Bagging.
20 / 65
48. Images/cinvestav-
What we do with this samples?
Now, assume a true regression function h (x) and a estimation ym (x)
ym (x) = h (x) + m (x) (8)
The average sum-of-squares error over the data takes the form
Ex {ym (x) − h (x)}2
= Ex { m (x)}2
(9)
What is Ex?
It denotes a frequentest expectation with respect to the distribution of the
input vector x.
21 / 65
49. Images/cinvestav-
What we do with this samples?
Now, assume a true regression function h (x) and a estimation ym (x)
ym (x) = h (x) + m (x) (8)
The average sum-of-squares error over the data takes the form
Ex {ym (x) − h (x)}2
= Ex { m (x)}2
(9)
What is Ex?
It denotes a frequentest expectation with respect to the distribution of the
input vector x.
21 / 65
50. Images/cinvestav-
What we do with this samples?
Now, assume a true regression function h (x) and a estimation ym (x)
ym (x) = h (x) + m (x) (8)
The average sum-of-squares error over the data takes the form
Ex {ym (x) − h (x)}2
= Ex { m (x)}2
(9)
What is Ex?
It denotes a frequentest expectation with respect to the distribution of the
input vector x.
21 / 65
51. Images/cinvestav-
Meaning
Thus, the average error is
EAV =
1
M
M
m=1
Ex { m (x)}2
(10)
Similarly the Expected error over the committee
ECOM = Ex
1
M
M
m=1
(ym (x) − h (x))
2
= Ex
1
M
M
m=1
m (x)
2
(11)
22 / 65
52. Images/cinvestav-
Meaning
Thus, the average error is
EAV =
1
M
M
m=1
Ex { m (x)}2
(10)
Similarly the Expected error over the committee
ECOM = Ex
1
M
M
m=1
(ym (x) − h (x))
2
= Ex
1
M
M
m=1
m (x)
2
(11)
22 / 65
53. Images/cinvestav-
Assume that the errors have zero mean and are
uncorrelated
Assume that the errors have zero mean and are uncorrelated
Ex [ m (x)] =0
Ex [ m (x) l (x)] = 0, for m = l
23 / 65
54. Images/cinvestav-
Then
We have that
ECOM =
1
M2
Ex
M
m=1
( m (x))
2
=
1
M2
Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
h (x) k (x)
=
1
M2
Ex
M
m=1
2
m (x) + Ex
M
h=1
M
k=1
h=k
h (x) k (x)
=
1
M2
Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
Ex ( h (x) k (x))
=
1
M2
Ex
M
m=1
2
m (x) =
1
M
1
M
Ex
M
m=1
2
m (x)
24 / 65
55. Images/cinvestav-
Then
We have that
ECOM =
1
M2
Ex
M
m=1
( m (x))
2
=
1
M2
Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
h (x) k (x)
=
1
M2
Ex
M
m=1
2
m (x) + Ex
M
h=1
M
k=1
h=k
h (x) k (x)
=
1
M2
Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
Ex ( h (x) k (x))
=
1
M2
Ex
M
m=1
2
m (x) =
1
M
1
M
Ex
M
m=1
2
m (x)
24 / 65
56. Images/cinvestav-
Then
We have that
ECOM =
1
M2
Ex
M
m=1
( m (x))
2
=
1
M2
Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
h (x) k (x)
=
1
M2
Ex
M
m=1
2
m (x) + Ex
M
h=1
M
k=1
h=k
h (x) k (x)
=
1
M2
Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
Ex ( h (x) k (x))
=
1
M2
Ex
M
m=1
2
m (x) =
1
M
1
M
Ex
M
m=1
2
m (x)
24 / 65
57. Images/cinvestav-
Then
We have that
ECOM =
1
M2
Ex
M
m=1
( m (x))
2
=
1
M2
Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
h (x) k (x)
=
1
M2
Ex
M
m=1
2
m (x) + Ex
M
h=1
M
k=1
h=k
h (x) k (x)
=
1
M2
Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
Ex ( h (x) k (x))
=
1
M2
Ex
M
m=1
2
m (x) =
1
M
1
M
Ex
M
m=1
2
m (x)
24 / 65
58. Images/cinvestav-
Then
We have that
ECOM =
1
M2
Ex
M
m=1
( m (x))
2
=
1
M2
Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
h (x) k (x)
=
1
M2
Ex
M
m=1
2
m (x) + Ex
M
h=1
M
k=1
h=k
h (x) k (x)
=
1
M2
Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
Ex ( h (x) k (x))
=
1
M2
Ex
M
m=1
2
m (x) =
1
M
1
M
Ex
M
m=1
2
m (x)
24 / 65
59. Images/cinvestav-
We finally obtain
We obtain
ECOM =
1
M
EAV (12)
Looks great BUT!!!
Unfortunately, it depends on the key assumption that the errors due to the
individual models are uncorrelated.
25 / 65
60. Images/cinvestav-
We finally obtain
We obtain
ECOM =
1
M
EAV (12)
Looks great BUT!!!
Unfortunately, it depends on the key assumption that the errors due to the
individual models are uncorrelated.
25 / 65
61. Images/cinvestav-
Thus
The Reality!!!
The errors are typically highly correlated, and the reduction in overall error
is generally small.
Something Notable
However, It can be shown that the expected committee error will not
exceed the expected error of the constituent models, so
ECOM ≤ EAV (13)
We need something better
A more sophisticated technique known as boosting.
26 / 65
62. Images/cinvestav-
Thus
The Reality!!!
The errors are typically highly correlated, and the reduction in overall error
is generally small.
Something Notable
However, It can be shown that the expected committee error will not
exceed the expected error of the constituent models, so
ECOM ≤ EAV (13)
We need something better
A more sophisticated technique known as boosting.
26 / 65
63. Images/cinvestav-
Thus
The Reality!!!
The errors are typically highly correlated, and the reduction in overall error
is generally small.
Something Notable
However, It can be shown that the expected committee error will not
exceed the expected error of the constituent models, so
ECOM ≤ EAV (13)
We need something better
A more sophisticated technique known as boosting.
26 / 65
64. Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Committees
Bootstrap Data Sets
4 Boosting
AdaBoost Development
Cost Function
Selection Process
Selecting New Classifiers
Using our Deriving Trick
AdaBoost Algorithm
Some Remarks and Implementation Issues
Explanation about AdaBoost’s behavior
Example
27 / 65
65. Images/cinvestav-
Boosting
What Boosting does?
It combines several classifiers to produce a form of a committee.
We will describe AdaBoost
“Adaptive Boosting” developed by Freund and Schapire (1995).
28 / 65
66. Images/cinvestav-
Boosting
What Boosting does?
It combines several classifiers to produce a form of a committee.
We will describe AdaBoost
“Adaptive Boosting” developed by Freund and Schapire (1995).
28 / 65
67. Images/cinvestav-
Sequential Training
Main difference between boosting and committee methods
The base classifiers are trained in sequence.
Explanation
Consider a two-class classification problem:
1 Samples x1, x2,..., xN
2 Binary labels (-1,1) t1, t2, ..., tN
29 / 65
68. Images/cinvestav-
Sequential Training
Main difference between boosting and committee methods
The base classifiers are trained in sequence.
Explanation
Consider a two-class classification problem:
1 Samples x1, x2,..., xN
2 Binary labels (-1,1) t1, t2, ..., tN
29 / 65
69. Images/cinvestav-
Cost Function
Now
You want to put together a set of M experts able to recognize the most
difficult inputs in an accurate way!!!
Thus
For each pattern xi each expert classifier outputs a classification
yj (xi) ∈ {−1, 1}
The final decision of the committee of M experts is sign (C (xi))
C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (14)
30 / 65
70. Images/cinvestav-
Cost Function
Now
You want to put together a set of M experts able to recognize the most
difficult inputs in an accurate way!!!
Thus
For each pattern xi each expert classifier outputs a classification
yj (xi) ∈ {−1, 1}
The final decision of the committee of M experts is sign (C (xi))
C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (14)
30 / 65
71. Images/cinvestav-
Cost Function
Now
You want to put together a set of M experts able to recognize the most
difficult inputs in an accurate way!!!
Thus
For each pattern xi each expert classifier outputs a classification
yj (xi) ∈ {−1, 1}
The final decision of the committee of M experts is sign (C (xi))
C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (14)
30 / 65
74. Images/cinvestav-
Getting the correct classifiers
We want the following
We want to review possible element members.
Select them, if they have certain properties.
Assigning a weight to their contribution to the set of experts.
32 / 65
75. Images/cinvestav-
Getting the correct classifiers
We want the following
We want to review possible element members.
Select them, if they have certain properties.
Assigning a weight to their contribution to the set of experts.
32 / 65
76. Images/cinvestav-
Getting the correct classifiers
We want the following
We want to review possible element members.
Select them, if they have certain properties.
Assigning a weight to their contribution to the set of experts.
32 / 65
77. Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
multidimensional data points xi:
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actions
We test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).
Charging a cost exp {−β} any time a classifier provides the right
label (a hit).
33 / 65
78. Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
multidimensional data points xi:
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actions
We test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).
Charging a cost exp {−β} any time a classifier provides the right
label (a hit).
33 / 65
79. Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
multidimensional data points xi:
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actions
We test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).
Charging a cost exp {−β} any time a classifier provides the right
label (a hit).
33 / 65
80. Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
multidimensional data points xi:
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actions
We test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).
Charging a cost exp {−β} any time a classifier provides the right
label (a hit).
33 / 65
81. Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
multidimensional data points xi:
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actions
We test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).
Charging a cost exp {−β} any time a classifier provides the right
label (a hit).
33 / 65
82. Images/cinvestav-
Remarks about β
We require β > 0
Thus misses are penalized more heavily penalized than hits
Although
It looks strange to penalize hits, but as long that the penalty of a success
is smaller than the penalty for a miss exp {−β} < exp {β}.
Why?
Show that if we assign cost a to misses and cost b to hits, where
a > b > 0, we can rewrite such costs as a = cd and b = c−d for constants
c and d. Thus, it does not compromise generality.
34 / 65
83. Images/cinvestav-
Remarks about β
We require β > 0
Thus misses are penalized more heavily penalized than hits
Although
It looks strange to penalize hits, but as long that the penalty of a success
is smaller than the penalty for a miss exp {−β} < exp {β}.
Why?
Show that if we assign cost a to misses and cost b to hits, where
a > b > 0, we can rewrite such costs as a = cd and b = c−d for constants
c and d. Thus, it does not compromise generality.
34 / 65
84. Images/cinvestav-
Remarks about β
We require β > 0
Thus misses are penalized more heavily penalized than hits
Although
It looks strange to penalize hits, but as long that the penalty of a success
is smaller than the penalty for a miss exp {−β} < exp {β}.
Why?
Show that if we assign cost a to misses and cost b to hits, where
a > b > 0, we can rewrite such costs as a = cd and b = c−d for constants
c and d. Thus, it does not compromise generality.
34 / 65
85. Images/cinvestav-
Exponential Loss Function
Something Notable
This kind of error function different from the usual squared Euclidean
distance to the classification target is called an exponential loss function.
AdaBoost uses exponential error loss as error criterion.
35 / 65
86. Images/cinvestav-
Exponential Loss Function
Something Notable
This kind of error function different from the usual squared Euclidean
distance to the classification target is called an exponential loss function.
AdaBoost uses exponential error loss as error criterion.
35 / 65
87. Images/cinvestav-
The Matrix S
When we test the M classifiers in the pool, we build a matrix S
We record the misses (with a ONE) and hits (with a ZERO) of each
classifiers.
Row i in the matrix is reserved for the data point xi - Column m is
reserved for the mth classifier in the pool.
Classifiers
1 2 · · · M
x1 0 1 · · · 1
x2 0 0 · · · 1
x3 1 1 · · · 0
...
...
...
...
xN 0 0 · · · 0
36 / 65
88. Images/cinvestav-
The Matrix S
When we test the M classifiers in the pool, we build a matrix S
We record the misses (with a ONE) and hits (with a ZERO) of each
classifiers.
Row i in the matrix is reserved for the data point xi - Column m is
reserved for the mth classifier in the pool.
Classifiers
1 2 · · · M
x1 0 1 · · · 1
x2 0 0 · · · 1
x3 1 1 · · · 0
...
...
...
...
xN 0 0 · · · 0
36 / 65
89. Images/cinvestav-
The Matrix S
When we test the M classifiers in the pool, we build a matrix S
We record the misses (with a ONE) and hits (with a ZERO) of each
classifiers.
Row i in the matrix is reserved for the data point xi - Column m is
reserved for the mth classifier in the pool.
Classifiers
1 2 · · · M
x1 0 1 · · · 1
x2 0 0 · · · 1
x3 1 1 · · · 0
...
...
...
...
xN 0 0 · · · 0
36 / 65
90. Images/cinvestav-
Main Idea
What does AdaBoost want to do?
The main idea of AdaBoost is to proceed systematically by extracting one
classifier from the pool in each of M iterations.
Thus
The elements in the data set are weighted according to their current
relevance (or urgency) at each iteration.
Thus at the beginning of the iterations
All data samples are assigned the same weight:
Just 1, or 1
N , if we want to have a total sum of 1 for all weights.
37 / 65
91. Images/cinvestav-
Main Idea
What does AdaBoost want to do?
The main idea of AdaBoost is to proceed systematically by extracting one
classifier from the pool in each of M iterations.
Thus
The elements in the data set are weighted according to their current
relevance (or urgency) at each iteration.
Thus at the beginning of the iterations
All data samples are assigned the same weight:
Just 1, or 1
N , if we want to have a total sum of 1 for all weights.
37 / 65
92. Images/cinvestav-
Main Idea
What does AdaBoost want to do?
The main idea of AdaBoost is to proceed systematically by extracting one
classifier from the pool in each of M iterations.
Thus
The elements in the data set are weighted according to their current
relevance (or urgency) at each iteration.
Thus at the beginning of the iterations
All data samples are assigned the same weight:
Just 1, or 1
N , if we want to have a total sum of 1 for all weights.
37 / 65
93. Images/cinvestav-
Main Idea
What does AdaBoost want to do?
The main idea of AdaBoost is to proceed systematically by extracting one
classifier from the pool in each of M iterations.
Thus
The elements in the data set are weighted according to their current
relevance (or urgency) at each iteration.
Thus at the beginning of the iterations
All data samples are assigned the same weight:
Just 1, or 1
N , if we want to have a total sum of 1 for all weights.
37 / 65
94. Images/cinvestav-
The process of the weights
As the selection progresses
The more difficult samples, those where the committee still performs
badly, are assigned larger and larger weights.
Basically
The selection process concentrates in selecting new classifiers for the
committee by focusing on those which can help with the still misclassified
examples.
Then
The best classifiers are those which can provide new insights to the
committee.
Classifiers being selected should complement each other in an optimal
way.
38 / 65
95. Images/cinvestav-
The process of the weights
As the selection progresses
The more difficult samples, those where the committee still performs
badly, are assigned larger and larger weights.
Basically
The selection process concentrates in selecting new classifiers for the
committee by focusing on those which can help with the still misclassified
examples.
Then
The best classifiers are those which can provide new insights to the
committee.
Classifiers being selected should complement each other in an optimal
way.
38 / 65
96. Images/cinvestav-
The process of the weights
As the selection progresses
The more difficult samples, those where the committee still performs
badly, are assigned larger and larger weights.
Basically
The selection process concentrates in selecting new classifiers for the
committee by focusing on those which can help with the still misclassified
examples.
Then
The best classifiers are those which can provide new insights to the
committee.
Classifiers being selected should complement each other in an optimal
way.
38 / 65
97. Images/cinvestav-
The process of the weights
As the selection progresses
The more difficult samples, those where the committee still performs
badly, are assigned larger and larger weights.
Basically
The selection process concentrates in selecting new classifiers for the
committee by focusing on those which can help with the still misclassified
examples.
Then
The best classifiers are those which can provide new insights to the
committee.
Classifiers being selected should complement each other in an optimal
way.
38 / 65
98. Images/cinvestav-
Selecting New Classifiers
What we want
In each iteration, we rank all classifiers, so that we can select the current
best out of the pool.
At mth iteration
We have already included m − 1 classifiers in the committee and we want
to select the next one.
Thus, we have the following cost function which is actually the
output of the committee
C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (15)
39 / 65
99. Images/cinvestav-
Selecting New Classifiers
What we want
In each iteration, we rank all classifiers, so that we can select the current
best out of the pool.
At mth iteration
We have already included m − 1 classifiers in the committee and we want
to select the next one.
Thus, we have the following cost function which is actually the
output of the committee
C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (15)
39 / 65
100. Images/cinvestav-
Selecting New Classifiers
What we want
In each iteration, we rank all classifiers, so that we can select the current
best out of the pool.
At mth iteration
We have already included m − 1 classifiers in the committee and we want
to select the next one.
Thus, we have the following cost function which is actually the
output of the committee
C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (15)
39 / 65
101. Images/cinvestav-
Thus, we have that
Extending the cost function
C(m) (xi) = C(m−1) (xi) + αmym (xi) (16)
At the first iteration m = 1
C(m−1) is the zero function.
Thus, the total cost or total error is defined as the exponential error
E =
N
i=1
exp −ti C(m−1) (xi) + αmym (xi) (17)
40 / 65
102. Images/cinvestav-
Thus, we have that
Extending the cost function
C(m) (xi) = C(m−1) (xi) + αmym (xi) (16)
At the first iteration m = 1
C(m−1) is the zero function.
Thus, the total cost or total error is defined as the exponential error
E =
N
i=1
exp −ti C(m−1) (xi) + αmym (xi) (17)
40 / 65
103. Images/cinvestav-
Thus, we have that
Extending the cost function
C(m) (xi) = C(m−1) (xi) + αmym (xi) (16)
At the first iteration m = 1
C(m−1) is the zero function.
Thus, the total cost or total error is defined as the exponential error
E =
N
i=1
exp −ti C(m−1) (xi) + αmym (xi) (17)
40 / 65
104. Images/cinvestav-
Thus
We want to determine
αm and ym in optimal way
Thus, rewriting
E =
N
i=1
w
(m)
i exp {−tiαmym (xi)} (18)
Where, for i = 1, 2, ..., N
w
(m)
i = exp −tiC(m−1) (xi) (19)
41 / 65
105. Images/cinvestav-
Thus
We want to determine
αm and ym in optimal way
Thus, rewriting
E =
N
i=1
w
(m)
i exp {−tiαmym (xi)} (18)
Where, for i = 1, 2, ..., N
w
(m)
i = exp −tiC(m−1) (xi) (19)
41 / 65
106. Images/cinvestav-
Thus
We want to determine
αm and ym in optimal way
Thus, rewriting
E =
N
i=1
w
(m)
i exp {−tiαmym (xi)} (18)
Where, for i = 1, 2, ..., N
w
(m)
i = exp −tiC(m−1) (xi) (19)
41 / 65
107. Images/cinvestav-
Thus
In the first iteration w
(1)
i = 1 for i = 1, ..., N
During later iterations, the vector w(m) represents the weight assigned to
each data point in the training set at iteration m.
Splitting equation (Eq. 18)
E =
ti=ym(xi)
w
(m)
i exp {−αm} +
ti=ym(xi)
w
(m)
i exp {αm} (20)
Meaning
The total cost is the weighted cost of all hits plus the weighted cost of all
misses.
42 / 65
108. Images/cinvestav-
Thus
In the first iteration w
(1)
i = 1 for i = 1, ..., N
During later iterations, the vector w(m) represents the weight assigned to
each data point in the training set at iteration m.
Splitting equation (Eq. 18)
E =
ti=ym(xi)
w
(m)
i exp {−αm} +
ti=ym(xi)
w
(m)
i exp {αm} (20)
Meaning
The total cost is the weighted cost of all hits plus the weighted cost of all
misses.
42 / 65
109. Images/cinvestav-
Thus
In the first iteration w
(1)
i = 1 for i = 1, ..., N
During later iterations, the vector w(m) represents the weight assigned to
each data point in the training set at iteration m.
Splitting equation (Eq. 18)
E =
ti=ym(xi)
w
(m)
i exp {−αm} +
ti=ym(xi)
w
(m)
i exp {αm} (20)
Meaning
The total cost is the weighted cost of all hits plus the weighted cost of all
misses.
42 / 65
110. Images/cinvestav-
Now
Writing the first summand as Wc exp {−αm} and the second as
We exp {αm}
E = Wc exp {−αm} + We exp {αm} (21)
Now, for the selection of ym the exact value of αm > 0 is irrelevant
Since a fixed αm minimizing E is equivalent to minimizing exp {αm} E and
exp {αm} E = Wc + We exp {2αm} (22)
43 / 65
111. Images/cinvestav-
Now
Writing the first summand as Wc exp {−αm} and the second as
We exp {αm}
E = Wc exp {−αm} + We exp {αm} (21)
Now, for the selection of ym the exact value of αm > 0 is irrelevant
Since a fixed αm minimizing E is equivalent to minimizing exp {αm} E and
exp {αm} E = Wc + We exp {2αm} (22)
43 / 65
112. Images/cinvestav-
Then
Since exp {2αm} > 1, we can rewrite (Eq. 22)
exp {αm} E = Wc + We − We + We exp {2αm} (23)
Thus
exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (24)
Now
Now, Wc + We is the total sum W of the weights of all data points which
is constant in the current iteration.
44 / 65
113. Images/cinvestav-
Then
Since exp {2αm} > 1, we can rewrite (Eq. 22)
exp {αm} E = Wc + We − We + We exp {2αm} (23)
Thus
exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (24)
Now
Now, Wc + We is the total sum W of the weights of all data points which
is constant in the current iteration.
44 / 65
114. Images/cinvestav-
Then
Since exp {2αm} > 1, we can rewrite (Eq. 22)
exp {αm} E = Wc + We − We + We exp {2αm} (23)
Thus
exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (24)
Now
Now, Wc + We is the total sum W of the weights of all data points which
is constant in the current iteration.
44 / 65
115. Images/cinvestav-
Thus
The right hand side of the equation is minimized
When at the m-th iteration we pick the classifier with the lowest total cost
We (that is the lowest rate of weighted error).
Intuitively
The next selected ym should be the one with the lowest penalty given the
current set of weights.
45 / 65
116. Images/cinvestav-
Thus
The right hand side of the equation is minimized
When at the m-th iteration we pick the classifier with the lowest total cost
We (that is the lowest rate of weighted error).
Intuitively
The next selected ym should be the one with the lowest penalty given the
current set of weights.
45 / 65
117. Images/cinvestav-
Deriving against the weight αm
Going back to the original E, we can use the derivative trick
∂E
∂αm
= −Wc exp {−αm} + We exp {αm} (25)
Making the equation equal to zero and multiplying by exp {αm}
−Wc + We exp {2αm} = 0 (26)
The optimal value is thus
αm =
1
2
ln
Wc
We
(27)
46 / 65
118. Images/cinvestav-
Deriving against the weight αm
Going back to the original E, we can use the derivative trick
∂E
∂αm
= −Wc exp {−αm} + We exp {αm} (25)
Making the equation equal to zero and multiplying by exp {αm}
−Wc + We exp {2αm} = 0 (26)
The optimal value is thus
αm =
1
2
ln
Wc
We
(27)
46 / 65
119. Images/cinvestav-
Deriving against the weight αm
Going back to the original E, we can use the derivative trick
∂E
∂αm
= −Wc exp {−αm} + We exp {αm} (25)
Making the equation equal to zero and multiplying by exp {αm}
−Wc + We exp {2αm} = 0 (26)
The optimal value is thus
αm =
1
2
ln
Wc
We
(27)
46 / 65
120. Images/cinvestav-
Now
Making the total sum of all weights
W = Wc + We (28)
We can rewrite the previous equation as
αm =
1
2
ln
W − We
We
=
1
2
ln
1 − em
em
(29)
With the percentage rate of error given the weights of the data points
em =
We
W
(30)
47 / 65
121. Images/cinvestav-
Now
Making the total sum of all weights
W = Wc + We (28)
We can rewrite the previous equation as
αm =
1
2
ln
W − We
We
=
1
2
ln
1 − em
em
(29)
With the percentage rate of error given the weights of the data points
em =
We
W
(30)
47 / 65
122. Images/cinvestav-
Now
Making the total sum of all weights
W = Wc + We (28)
We can rewrite the previous equation as
αm =
1
2
ln
W − We
We
=
1
2
ln
1 − em
em
(29)
With the percentage rate of error given the weights of the data points
em =
We
W
(30)
47 / 65
123. Images/cinvestav-
What about the weights?
Using the equation
w
(m)
i = exp −tiC(m−1) (xi) (31)
And because we have αm and ym (xi)
w
(m+1)
i = exp −tiC(m) (xi)
= exp −ti C(m−1) (xi) + αmym (xi)
=w
(m)
i exp {−tiαmym (xi)}
48 / 65
124. Images/cinvestav-
What about the weights?
Using the equation
w
(m)
i = exp −tiC(m−1) (xi) (31)
And because we have αm and ym (xi)
w
(m+1)
i = exp −tiC(m) (xi)
= exp −ti C(m−1) (xi) + αmym (xi)
=w
(m)
i exp {−tiαmym (xi)}
48 / 65
125. Images/cinvestav-
What about the weights?
Using the equation
w
(m)
i = exp −tiC(m−1) (xi) (31)
And because we have αm and ym (xi)
w
(m+1)
i = exp −tiC(m) (xi)
= exp −ti C(m−1) (xi) + αmym (xi)
=w
(m)
i exp {−tiαmym (xi)}
48 / 65
126. Images/cinvestav-
What about the weights?
Using the equation
w
(m)
i = exp −tiC(m−1) (xi) (31)
And because we have αm and ym (xi)
w
(m+1)
i = exp −tiC(m) (xi)
= exp −ti C(m−1) (xi) + αmym (xi)
=w
(m)
i exp {−tiαmym (xi)}
48 / 65
127. Images/cinvestav-
Sequential Training
Thus
AdaBoost trains a new classifier using a data set in which the weighting
coefficients are adjusted according to the performance of the previously
trained classifier so as to give greater weight to the misclassified data
points.
49 / 65
129. Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Committees
Bootstrap Data Sets
4 Boosting
AdaBoost Development
Cost Function
Selection Process
Selecting New Classifiers
Using our Deriving Trick
AdaBoost Algorithm
Some Remarks and Implementation Issues
Explanation about AdaBoost’s behavior
Example
51 / 65
130. Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti =ym(xi )
w
(m)
i = arg min
ym
We (32)
Where I is an indicator function.
2 Evaluate
em =
N
n=1
w
(m)
n I (ym (xn) = tn)
N
n=1
w
(m)
n
(33)
Where I is an indicator function 52 / 65
131. Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti =ym(xi )
w
(m)
i = arg min
ym
We (32)
Where I is an indicator function.
2 Evaluate
em =
N
n=1
w
(m)
n I (ym (xn) = tn)
N
n=1
w
(m)
n
(33)
Where I is an indicator function 52 / 65
132. Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti =ym(xi )
w
(m)
i = arg min
ym
We (32)
Where I is an indicator function.
2 Evaluate
em =
N
n=1
w
(m)
n I (ym (xn) = tn)
N
n=1
w
(m)
n
(33)
Where I is an indicator function 52 / 65
133. Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti =ym(xi )
w
(m)
i = arg min
ym
We (32)
Where I is an indicator function.
2 Evaluate
em =
N
n=1
w
(m)
n I (ym (xn) = tn)
N
n=1
w
(m)
n
(33)
Where I is an indicator function 52 / 65
134. Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti =ym(xi )
w
(m)
i = arg min
ym
We (32)
Where I is an indicator function.
2 Evaluate
em =
N
n=1
w
(m)
n I (ym (xn) = tn)
N
n=1
w
(m)
n
(33)
Where I is an indicator function 52 / 65
135. Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti =ym(xi )
w
(m)
i = arg min
ym
We (32)
Where I is an indicator function.
2 Evaluate
em =
N
n=1
w
(m)
n I (ym (xn) = tn)
N
n=1
w
(m)
n
(33)
Where I is an indicator function 52 / 65
136. Images/cinvestav-
AdaBoost Algorithm
Step 3
Set the αm weight to
αm =
1
2
ln
1 − em
em
(34)
Now update the weights of the data for the next iteration
If ti = ym (xi) i.e. a miss
w
(m+1)
i = w
(m)
i exp {αm} = w
(m)
i
1 − em
em
(35)
If ti == ym (xi) i.e. a hit
w
(m+1)
i = w
(m)
i exp {−αm} = w
(m)
i
em
1 − em
(36)
53 / 65
137. Images/cinvestav-
AdaBoost Algorithm
Step 3
Set the αm weight to
αm =
1
2
ln
1 − em
em
(34)
Now update the weights of the data for the next iteration
If ti = ym (xi) i.e. a miss
w
(m+1)
i = w
(m)
i exp {αm} = w
(m)
i
1 − em
em
(35)
If ti == ym (xi) i.e. a hit
w
(m+1)
i = w
(m)
i exp {−αm} = w
(m)
i
em
1 − em
(36)
53 / 65
138. Images/cinvestav-
AdaBoost Algorithm
Step 3
Set the αm weight to
αm =
1
2
ln
1 − em
em
(34)
Now update the weights of the data for the next iteration
If ti = ym (xi) i.e. a miss
w
(m+1)
i = w
(m)
i exp {αm} = w
(m)
i
1 − em
em
(35)
If ti == ym (xi) i.e. a hit
w
(m+1)
i = w
(m)
i exp {−αm} = w
(m)
i
em
1 − em
(36)
53 / 65
140. Images/cinvestav-
Observations
First
The first base classifier is the usual procedure of training a single classifier.
Second
From (Eq. 35) and (Eq. 36), we can see that the weighting coefficient are
increased for data points that are misclassified.
Third
The quantity em represent weighted measures of the error rate.
Thus αm gives more weight to the more accurate classifiers.
55 / 65
141. Images/cinvestav-
Observations
First
The first base classifier is the usual procedure of training a single classifier.
Second
From (Eq. 35) and (Eq. 36), we can see that the weighting coefficient are
increased for data points that are misclassified.
Third
The quantity em represent weighted measures of the error rate.
Thus αm gives more weight to the more accurate classifiers.
55 / 65
142. Images/cinvestav-
Observations
First
The first base classifier is the usual procedure of training a single classifier.
Second
From (Eq. 35) and (Eq. 36), we can see that the weighting coefficient are
increased for data points that are misclassified.
Third
The quantity em represent weighted measures of the error rate.
Thus αm gives more weight to the more accurate classifiers.
55 / 65
143. Images/cinvestav-
In addition
The pool of classifiers in Step 1 can be substituted by a family of
classifiers
One whose members are trained to minimize the error function given the
current weights
Now
If indeed a finite set of classifiers is given, we only need to test the
classifiers once for each data point
The Scouting Matrix S
It can be reused at each iteration by multiplying the transposed vector of
weights w(m) with S to obtain We of each machine
56 / 65
144. Images/cinvestav-
In addition
The pool of classifiers in Step 1 can be substituted by a family of
classifiers
One whose members are trained to minimize the error function given the
current weights
Now
If indeed a finite set of classifiers is given, we only need to test the
classifiers once for each data point
The Scouting Matrix S
It can be reused at each iteration by multiplying the transposed vector of
weights w(m) with S to obtain We of each machine
56 / 65
145. Images/cinvestav-
In addition
The pool of classifiers in Step 1 can be substituted by a family of
classifiers
One whose members are trained to minimize the error function given the
current weights
Now
If indeed a finite set of classifiers is given, we only need to test the
classifiers once for each data point
The Scouting Matrix S
It can be reused at each iteration by multiplying the transposed vector of
weights w(m) with S to obtain We of each machine
56 / 65
146. Images/cinvestav-
We have then
The following
W (1)
e W (2)
e · · · W M
e = w(m)
T
S (38)
This allows to reformulate the weight update step such that
It only misses lead to weight modification.
Note
Note that the weight vector w(m) is constructed iteratively.
It could be recomputed completely at every iteration, but the iterative
construction is more efficient and simple to implement.
57 / 65
147. Images/cinvestav-
We have then
The following
W (1)
e W (2)
e · · · W M
e = w(m)
T
S (38)
This allows to reformulate the weight update step such that
It only misses lead to weight modification.
Note
Note that the weight vector w(m) is constructed iteratively.
It could be recomputed completely at every iteration, but the iterative
construction is more efficient and simple to implement.
57 / 65
148. Images/cinvestav-
We have then
The following
W (1)
e W (2)
e · · · W M
e = w(m)
T
S (38)
This allows to reformulate the weight update step such that
It only misses lead to weight modification.
Note
Note that the weight vector w(m) is constructed iteratively.
It could be recomputed completely at every iteration, but the iterative
construction is more efficient and simple to implement.
57 / 65
149. Images/cinvestav-
We have then
The following
W (1)
e W (2)
e · · · W M
e = w(m)
T
S (38)
This allows to reformulate the weight update step such that
It only misses lead to weight modification.
Note
Note that the weight vector w(m) is constructed iteratively.
It could be recomputed completely at every iteration, but the iterative
construction is more efficient and simple to implement.
57 / 65
152. Images/cinvestav-
We have the following cases
We have the following
If em −→ 1, we have that all the sample were not correctly classified!!!
Thus
We get that for all miss-classified sample lim
em→1
1−em
em
−→ 0, then
αm −→ −∞
Finally
We get that for all miss-classified sample w
(m+1)
i = w
(m)
i exp {αm} −→ 0
We only need to reverse the answers to get the perfect classifier and
select it as the only committee member.
60 / 65
153. Images/cinvestav-
We have the following cases
We have the following
If em −→ 1, we have that all the sample were not correctly classified!!!
Thus
We get that for all miss-classified sample lim
em→1
1−em
em
−→ 0, then
αm −→ −∞
Finally
We get that for all miss-classified sample w
(m+1)
i = w
(m)
i exp {αm} −→ 0
We only need to reverse the answers to get the perfect classifier and
select it as the only committee member.
60 / 65
154. Images/cinvestav-
We have the following cases
We have the following
If em −→ 1, we have that all the sample were not correctly classified!!!
Thus
We get that for all miss-classified sample lim
em→1
1−em
em
−→ 0, then
αm −→ −∞
Finally
We get that for all miss-classified sample w
(m+1)
i = w
(m)
i exp {αm} −→ 0
We only need to reverse the answers to get the perfect classifier and
select it as the only committee member.
60 / 65
155. Images/cinvestav-
Thus, we have
If em −→ 1/2
We have αm −→ 0, thus we have that if the sample is well or bad
classified :
exp {−αmtiym (xi)} → 1 (39)
So the weight does not change at all.
What about em → 0
We have that αm → +∞
Thus, we have
Samples always correctly classified
w
(m+1)
i = w
(m)
i exp {−αmtiym (xi)} → 0
Thus, the only committee that we need is m, we do not need m + 1
61 / 65
156. Images/cinvestav-
Thus, we have
If em −→ 1/2
We have αm −→ 0, thus we have that if the sample is well or bad
classified :
exp {−αmtiym (xi)} → 1 (39)
So the weight does not change at all.
What about em → 0
We have that αm → +∞
Thus, we have
Samples always correctly classified
w
(m+1)
i = w
(m)
i exp {−αmtiym (xi)} → 0
Thus, the only committee that we need is m, we do not need m + 1
61 / 65
157. Images/cinvestav-
Thus, we have
If em −→ 1/2
We have αm −→ 0, thus we have that if the sample is well or bad
classified :
exp {−αmtiym (xi)} → 1 (39)
So the weight does not change at all.
What about em → 0
We have that αm → +∞
Thus, we have
Samples always correctly classified
w
(m+1)
i = w
(m)
i exp {−αmtiym (xi)} → 0
Thus, the only committee that we need is m, we do not need m + 1
61 / 65
158. Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Committees
Bootstrap Data Sets
4 Boosting
AdaBoost Development
Cost Function
Selection Process
Selecting New Classifiers
Using our Deriving Trick
AdaBoost Algorithm
Some Remarks and Implementation Issues
Explanation about AdaBoost’s behavior
Example
62 / 65