Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

413 views

Published on

1.- Model Combination Vs Bayesian Model

2.- Bootstrap Data Sets

And the cherry on the top the AdaBoost

Published in:
Engineering

No Downloads

Total views

413

On SlideShare

0

From Embeds

0

Number of Embeds

31

Shares

0

Downloads

27

Comments

4

Likes

2

No notes for slide

- 1. Machine Learning for Data Mining Combining Models Andres Mendez-Vazquez July 20, 2015 1 / 65
- 2. Images/cinvestav- Outline 1 Introduction 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging 3 Committees Bootstrap Data Sets 4 Boosting AdaBoost Development Cost Function Selection Process Selecting New Classiﬁers Using our Deriving Trick AdaBoost Algorithm Some Remarks and Implementation Issues Explanation about AdaBoost’s behavior Example 2 / 65
- 3. Images/cinvestav- Introduction Observation It is often found that improved performance can be obtained by combining multiple classiﬁers together in some way. Example: Committees We might train L diﬀerent classiﬁers and then make predictions using the average of the predictions made by each classiﬁer. Example: Boosting It involves training multiple models in sequence in which the error function used to train a particular model depends on the performance of the previous models. 3 / 65
- 4. Images/cinvestav- Introduction Observation It is often found that improved performance can be obtained by combining multiple classiﬁers together in some way. Example: Committees We might train L diﬀerent classiﬁers and then make predictions using the average of the predictions made by each classiﬁer. Example: Boosting It involves training multiple models in sequence in which the error function used to train a particular model depends on the performance of the previous models. 3 / 65
- 5. Images/cinvestav- Introduction Observation It is often found that improved performance can be obtained by combining multiple classiﬁers together in some way. Example: Committees We might train L diﬀerent classiﬁers and then make predictions using the average of the predictions made by each classiﬁer. Example: Boosting It involves training multiple models in sequence in which the error function used to train a particular model depends on the performance of the previous models. 3 / 65
- 6. Images/cinvestav- In addition Instead of averaging the predictions of a set of models You can use an alternative form of combination that selects one of the models to make the prediction. Where The choice of model is a function of the input variables. Thus Diﬀerent Models become responsible for making decisions in diﬀerent regions of the input space. 4 / 65
- 7. Images/cinvestav- In addition Instead of averaging the predictions of a set of models You can use an alternative form of combination that selects one of the models to make the prediction. Where The choice of model is a function of the input variables. Thus Diﬀerent Models become responsible for making decisions in diﬀerent regions of the input space. 4 / 65
- 8. Images/cinvestav- In addition Instead of averaging the predictions of a set of models You can use an alternative form of combination that selects one of the models to make the prediction. Where The choice of model is a function of the input variables. Thus Diﬀerent Models become responsible for making decisions in diﬀerent regions of the input space. 4 / 65
- 9. Images/cinvestav- We want this Models in charge of diﬀerent set of inputs 5 / 65
- 10. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classiﬁer for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classiﬁers. 6 / 65
- 11. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classiﬁer for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classiﬁers. 6 / 65
- 12. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classiﬁer for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classiﬁers. 6 / 65
- 13. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classiﬁer for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classiﬁers. 6 / 65
- 14. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classiﬁer for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classiﬁers. 6 / 65
- 15. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classiﬁer for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classiﬁers. 6 / 65
- 16. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classiﬁer for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classiﬁers. 6 / 65
- 17. Images/cinvestav- This is used in the mixture of distributions Thus (Mixture of Experts) p (t|x) = M k=1 πk (x) p (t|x, k) (1) where πk (x) = p (k|x) represent the input-dependent mixing coeﬃcients. This type of models They can be viewed as mixture distribution in which the component densities and the mixing coeﬃcients are conditioned on the input variables and are known as mixture experts. 7 / 65
- 18. Images/cinvestav- This is used in the mixture of distributions Thus (Mixture of Experts) p (t|x) = M k=1 πk (x) p (t|x, k) (1) where πk (x) = p (k|x) represent the input-dependent mixing coeﬃcients. This type of models They can be viewed as mixture distribution in which the component densities and the mixing coeﬃcients are conditioned on the input variables and are known as mixture experts. 7 / 65
- 19. Images/cinvestav- Outline 1 Introduction 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging 3 Committees Bootstrap Data Sets 4 Boosting AdaBoost Development Cost Function Selection Process Selecting New Classiﬁers Using our Deriving Trick AdaBoost Algorithm Some Remarks and Implementation Issues Explanation about AdaBoost’s behavior Example 8 / 65
- 20. Images/cinvestav- Now Although Model Combinations and Bayesian Model Averaging look similar. However They ara actually diﬀerent For this We have the following example. 9 / 65
- 21. Images/cinvestav- Now Although Model Combinations and Bayesian Model Averaging look similar. However They ara actually diﬀerent For this We have the following example. 9 / 65
- 22. Images/cinvestav- Now Although Model Combinations and Bayesian Model Averaging look similar. However They ara actually diﬀerent For this We have the following example. 9 / 65
- 23. Images/cinvestav- Example For this consider the following Mixture of Gaussians with a binary latent variable z indicating to which component a point belongs to. p (x, z) (2) Thus Corresponding density over the observed variable x using marginalization: p (x) = z p (x, z) (3) 10 / 65
- 24. Images/cinvestav- Example For this consider the following Mixture of Gaussians with a binary latent variable z indicating to which component a point belongs to. p (x, z) (2) Thus Corresponding density over the observed variable x using marginalization: p (x) = z p (x, z) (3) 10 / 65
- 25. Images/cinvestav- In the case of Gaussian distributions We have p (x) = M k=1 πkN (x|µkΣk) (4) Now, for independent, identically distributed data X = {x1, x2, ..., xN } p (X) = N n=1 p (xn) = N n=1 zn p (xn, zn) (5) Now, suppose We have several diﬀerent models indexed by h = 1, ..., H with prior probabilities. For instance one model might be a mixture of Gaussians and another model might be a mixture of Cauchy distributions. 11 / 65
- 26. Images/cinvestav- In the case of Gaussian distributions We have p (x) = M k=1 πkN (x|µkΣk) (4) Now, for independent, identically distributed data X = {x1, x2, ..., xN } p (X) = N n=1 p (xn) = N n=1 zn p (xn, zn) (5) Now, suppose We have several diﬀerent models indexed by h = 1, ..., H with prior probabilities. For instance one model might be a mixture of Gaussians and another model might be a mixture of Cauchy distributions. 11 / 65
- 27. Images/cinvestav- In the case of Gaussian distributions We have p (x) = M k=1 πkN (x|µkΣk) (4) Now, for independent, identically distributed data X = {x1, x2, ..., xN } p (X) = N n=1 p (xn) = N n=1 zn p (xn, zn) (5) Now, suppose We have several diﬀerent models indexed by h = 1, ..., H with prior probabilities. For instance one model might be a mixture of Gaussians and another model might be a mixture of Cauchy distributions. 11 / 65
- 28. Images/cinvestav- Bayesian Model Averaging The Marginal Distribution is p (X) = H h=1 p (X, h) = H h=1 p (X|h) p (h) (6) Observations 1 This is an example of Bayesian model averaging. 2 The summation over h means that just one model is responsible for generating the whole data set. 3 The probability over h simply reﬂects our uncertainty of which is the correct model to use. Thus As the size of the data set increases, this uncertainty reduces, and the posterior probabilities p(h|X) become increasingly focused on just one of the models. 12 / 65
- 29. Images/cinvestav- Bayesian Model Averaging The Marginal Distribution is p (X) = H h=1 p (X, h) = H h=1 p (X|h) p (h) (6) Observations 1 This is an example of Bayesian model averaging. 2 The summation over h means that just one model is responsible for generating the whole data set. 3 The probability over h simply reﬂects our uncertainty of which is the correct model to use. Thus As the size of the data set increases, this uncertainty reduces, and the posterior probabilities p(h|X) become increasingly focused on just one of the models. 12 / 65
- 30. Images/cinvestav- Bayesian Model Averaging The Marginal Distribution is p (X) = H h=1 p (X, h) = H h=1 p (X|h) p (h) (6) Observations 1 This is an example of Bayesian model averaging. 2 The summation over h means that just one model is responsible for generating the whole data set. 3 The probability over h simply reﬂects our uncertainty of which is the correct model to use. Thus As the size of the data set increases, this uncertainty reduces, and the posterior probabilities p(h|X) become increasingly focused on just one of the models. 12 / 65
- 31. Images/cinvestav- Bayesian Model Averaging The Marginal Distribution is p (X) = H h=1 p (X, h) = H h=1 p (X|h) p (h) (6) Observations 1 This is an example of Bayesian model averaging. 2 The summation over h means that just one model is responsible for generating the whole data set. 3 The probability over h simply reﬂects our uncertainty of which is the correct model to use. Thus As the size of the data set increases, this uncertainty reduces, and the posterior probabilities p(h|X) become increasingly focused on just one of the models. 12 / 65
- 32. Images/cinvestav- Bayesian Model Averaging The Marginal Distribution is p (X) = H h=1 p (X, h) = H h=1 p (X|h) p (h) (6) Observations 1 This is an example of Bayesian model averaging. 2 The summation over h means that just one model is responsible for generating the whole data set. 3 The probability over h simply reﬂects our uncertainty of which is the correct model to use. Thus As the size of the data set increases, this uncertainty reduces, and the posterior probabilities p(h|X) become increasingly focused on just one of the models. 12 / 65
- 33. Images/cinvestav- The Diﬀerences Bayesian model averaging The whole data set is generated by a single model. Model combination Diﬀerent data points within the data set can potentially be generated from diﬀerent values of the latent variable z and hence by diﬀerent components. 13 / 65
- 34. Images/cinvestav- The Diﬀerences Bayesian model averaging The whole data set is generated by a single model. Model combination Diﬀerent data points within the data set can potentially be generated from diﬀerent values of the latent variable z and hence by diﬀerent components. 13 / 65
- 35. Images/cinvestav- Committees Idea The simplest way to construct a committee is to average the predictions of a set of individual models. Thinking as a frequentist This is coming from taking in consideration the trade-oﬀ between bias and variance. Where the error in the model into The bias component that arises from diﬀerences between the model and the true function to be predicted. The variance component that represents the sensitivity of the model to the individual data points. 14 / 65
- 36. Images/cinvestav- Committees Idea The simplest way to construct a committee is to average the predictions of a set of individual models. Thinking as a frequentist This is coming from taking in consideration the trade-oﬀ between bias and variance. Where the error in the model into The bias component that arises from diﬀerences between the model and the true function to be predicted. The variance component that represents the sensitivity of the model to the individual data points. 14 / 65
- 37. Images/cinvestav- Committees Idea The simplest way to construct a committee is to average the predictions of a set of individual models. Thinking as a frequentist This is coming from taking in consideration the trade-oﬀ between bias and variance. Where the error in the model into The bias component that arises from diﬀerences between the model and the true function to be predicted. The variance component that represents the sensitivity of the model to the individual data points. 14 / 65
- 38. Images/cinvestav- Committees Idea The simplest way to construct a committee is to average the predictions of a set of individual models. Thinking as a frequentist This is coming from taking in consideration the trade-oﬀ between bias and variance. Where the error in the model into The bias component that arises from diﬀerences between the model and the true function to be predicted. The variance component that represents the sensitivity of the model to the individual data points. 14 / 65
- 39. Images/cinvestav- For example We can do the follwoing When we averaged a set of low-bias models, we obtained accurate predictions of the underlying sinusoidal function from which the data were generated. 15 / 65
- 40. Images/cinvestav- However Big Problem We have normally a single data set Thus We need to introduce certain variability between the diﬀerent committee members. One approach You can use bootstrap data sets. 16 / 65
- 41. Images/cinvestav- However Big Problem We have normally a single data set Thus We need to introduce certain variability between the diﬀerent committee members. One approach You can use bootstrap data sets. 16 / 65
- 42. Images/cinvestav- Outline 1 Introduction 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging 3 Committees Bootstrap Data Sets 4 Boosting AdaBoost Development Cost Function Selection Process Selecting New Classiﬁers Using our Deriving Trick AdaBoost Algorithm Some Remarks and Implementation Issues Explanation about AdaBoost’s behavior Example 17 / 65
- 43. Images/cinvestav- What to do with the Bootstrap Data Set Given a Regression Problem Here, we are trying to predict the value of a single continuous variable. What? Suppose our original data set consists of N data points X = {x1, ..., xN } . Thus We can create a new data set XB by drawing N points at random from X, with replacement, so that some points in X may be replicated in XB, whereas other points in X may be absent from XB. 18 / 65
- 44. Images/cinvestav- What to do with the Bootstrap Data Set Given a Regression Problem Here, we are trying to predict the value of a single continuous variable. What? Suppose our original data set consists of N data points X = {x1, ..., xN } . Thus We can create a new data set XB by drawing N points at random from X, with replacement, so that some points in X may be replicated in XB, whereas other points in X may be absent from XB. 18 / 65
- 45. Images/cinvestav- What to do with the Bootstrap Data Set Given a Regression Problem Here, we are trying to predict the value of a single continuous variable. What? Suppose our original data set consists of N data points X = {x1, ..., xN } . Thus We can create a new data set XB by drawing N points at random from X, with replacement, so that some points in X may be replicated in XB, whereas other points in X may be absent from XB. 18 / 65
- 46. Images/cinvestav- Furthermore This is repeated M times Thus, we generate M data sets each of size N and each obtained by sampling from the original data set X. 19 / 65
- 47. Images/cinvestav- Thus Use each of them to train a copy ym (x) of a predictive regression model to predict a single continuous variable Then, ycom (x) = 1 M M m=1 ym (x) (7) This is also known as Bootstrap Aggregation or Bagging. 20 / 65
- 48. Images/cinvestav- What we do with this samples? Now, assume a true regression function h (x) and a estimation ym (x) ym (x) = h (x) + m (x) (8) The average sum-of-squares error over the data takes the form Ex {ym (x) − h (x)}2 = Ex { m (x)}2 (9) What is Ex? It denotes a frequentest expectation with respect to the distribution of the input vector x. 21 / 65
- 49. Images/cinvestav- What we do with this samples? Now, assume a true regression function h (x) and a estimation ym (x) ym (x) = h (x) + m (x) (8) The average sum-of-squares error over the data takes the form Ex {ym (x) − h (x)}2 = Ex { m (x)}2 (9) What is Ex? It denotes a frequentest expectation with respect to the distribution of the input vector x. 21 / 65
- 50. Images/cinvestav- What we do with this samples? Now, assume a true regression function h (x) and a estimation ym (x) ym (x) = h (x) + m (x) (8) The average sum-of-squares error over the data takes the form Ex {ym (x) − h (x)}2 = Ex { m (x)}2 (9) What is Ex? It denotes a frequentest expectation with respect to the distribution of the input vector x. 21 / 65
- 51. Images/cinvestav- Meaning Thus, the average error is EAV = 1 M M m=1 Ex { m (x)}2 (10) Similarly the Expected error over the committee ECOM = Ex 1 M M m=1 (ym (x) − h (x)) 2 = Ex 1 M M m=1 m (x) 2 (11) 22 / 65
- 52. Images/cinvestav- Meaning Thus, the average error is EAV = 1 M M m=1 Ex { m (x)}2 (10) Similarly the Expected error over the committee ECOM = Ex 1 M M m=1 (ym (x) − h (x)) 2 = Ex 1 M M m=1 m (x) 2 (11) 22 / 65
- 53. Images/cinvestav- Assume that the errors have zero mean and are uncorrelated Assume that the errors have zero mean and are uncorrelated Ex [ m (x)] =0 Ex [ m (x) l (x)] = 0, for m = l 23 / 65
- 54. Images/cinvestav- Then We have that ECOM = 1 M2 Ex M m=1 ( m (x)) 2 = 1 M2 Ex M m=1 2 m (x) + M h=1 M k=1 h=k h (x) k (x) = 1 M2 Ex M m=1 2 m (x) + Ex M h=1 M k=1 h=k h (x) k (x) = 1 M2 Ex M m=1 2 m (x) + M h=1 M k=1 h=k Ex ( h (x) k (x)) = 1 M2 Ex M m=1 2 m (x) = 1 M 1 M Ex M m=1 2 m (x) 24 / 65
- 55. Images/cinvestav- Then We have that ECOM = 1 M2 Ex M m=1 ( m (x)) 2 = 1 M2 Ex M m=1 2 m (x) + M h=1 M k=1 h=k h (x) k (x) = 1 M2 Ex M m=1 2 m (x) + Ex M h=1 M k=1 h=k h (x) k (x) = 1 M2 Ex M m=1 2 m (x) + M h=1 M k=1 h=k Ex ( h (x) k (x)) = 1 M2 Ex M m=1 2 m (x) = 1 M 1 M Ex M m=1 2 m (x) 24 / 65
- 56. Images/cinvestav- Then We have that ECOM = 1 M2 Ex M m=1 ( m (x)) 2 = 1 M2 Ex M m=1 2 m (x) + M h=1 M k=1 h=k h (x) k (x) = 1 M2 Ex M m=1 2 m (x) + Ex M h=1 M k=1 h=k h (x) k (x) = 1 M2 Ex M m=1 2 m (x) + M h=1 M k=1 h=k Ex ( h (x) k (x)) = 1 M2 Ex M m=1 2 m (x) = 1 M 1 M Ex M m=1 2 m (x) 24 / 65
- 57. Images/cinvestav- Then We have that ECOM = 1 M2 Ex M m=1 ( m (x)) 2 = 1 M2 Ex M m=1 2 m (x) + M h=1 M k=1 h=k h (x) k (x) = 1 M2 Ex M m=1 2 m (x) + Ex M h=1 M k=1 h=k h (x) k (x) = 1 M2 Ex M m=1 2 m (x) + M h=1 M k=1 h=k Ex ( h (x) k (x)) = 1 M2 Ex M m=1 2 m (x) = 1 M 1 M Ex M m=1 2 m (x) 24 / 65
- 58. Images/cinvestav- Then We have that ECOM = 1 M2 Ex M m=1 ( m (x)) 2 = 1 M2 Ex M m=1 2 m (x) + M h=1 M k=1 h=k h (x) k (x) = 1 M2 Ex M m=1 2 m (x) + Ex M h=1 M k=1 h=k h (x) k (x) = 1 M2 Ex M m=1 2 m (x) + M h=1 M k=1 h=k Ex ( h (x) k (x)) = 1 M2 Ex M m=1 2 m (x) = 1 M 1 M Ex M m=1 2 m (x) 24 / 65
- 59. Images/cinvestav- We ﬁnally obtain We obtain ECOM = 1 M EAV (12) Looks great BUT!!! Unfortunately, it depends on the key assumption that the errors due to the individual models are uncorrelated. 25 / 65
- 60. Images/cinvestav- We ﬁnally obtain We obtain ECOM = 1 M EAV (12) Looks great BUT!!! Unfortunately, it depends on the key assumption that the errors due to the individual models are uncorrelated. 25 / 65
- 61. Images/cinvestav- Thus The Reality!!! The errors are typically highly correlated, and the reduction in overall error is generally small. Something Notable However, It can be shown that the expected committee error will not exceed the expected error of the constituent models, so ECOM ≤ EAV (13) We need something better A more sophisticated technique known as boosting. 26 / 65
- 62. Images/cinvestav- Thus The Reality!!! The errors are typically highly correlated, and the reduction in overall error is generally small. Something Notable However, It can be shown that the expected committee error will not exceed the expected error of the constituent models, so ECOM ≤ EAV (13) We need something better A more sophisticated technique known as boosting. 26 / 65
- 63. Images/cinvestav- Thus The Reality!!! The errors are typically highly correlated, and the reduction in overall error is generally small. Something Notable However, It can be shown that the expected committee error will not exceed the expected error of the constituent models, so ECOM ≤ EAV (13) We need something better A more sophisticated technique known as boosting. 26 / 65
- 64. Images/cinvestav- Outline 1 Introduction 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging 3 Committees Bootstrap Data Sets 4 Boosting AdaBoost Development Cost Function Selection Process Selecting New Classiﬁers Using our Deriving Trick AdaBoost Algorithm Some Remarks and Implementation Issues Explanation about AdaBoost’s behavior Example 27 / 65
- 65. Images/cinvestav- Boosting What Boosting does? It combines several classiﬁers to produce a form of a committee. We will describe AdaBoost “Adaptive Boosting” developed by Freund and Schapire (1995). 28 / 65
- 66. Images/cinvestav- Boosting What Boosting does? It combines several classiﬁers to produce a form of a committee. We will describe AdaBoost “Adaptive Boosting” developed by Freund and Schapire (1995). 28 / 65
- 67. Images/cinvestav- Sequential Training Main diﬀerence between boosting and committee methods The base classiﬁers are trained in sequence. Explanation Consider a two-class classiﬁcation problem: 1 Samples x1, x2,..., xN 2 Binary labels (-1,1) t1, t2, ..., tN 29 / 65
- 68. Images/cinvestav- Sequential Training Main diﬀerence between boosting and committee methods The base classiﬁers are trained in sequence. Explanation Consider a two-class classiﬁcation problem: 1 Samples x1, x2,..., xN 2 Binary labels (-1,1) t1, t2, ..., tN 29 / 65
- 69. Images/cinvestav- Cost Function Now You want to put together a set of M experts able to recognize the most diﬃcult inputs in an accurate way!!! Thus For each pattern xi each expert classiﬁer outputs a classiﬁcation yj (xi) ∈ {−1, 1} The ﬁnal decision of the committee of M experts is sign (C (xi)) C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (14) 30 / 65
- 70. Images/cinvestav- Cost Function Now You want to put together a set of M experts able to recognize the most diﬃcult inputs in an accurate way!!! Thus For each pattern xi each expert classiﬁer outputs a classiﬁcation yj (xi) ∈ {−1, 1} The ﬁnal decision of the committee of M experts is sign (C (xi)) C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (14) 30 / 65
- 71. Images/cinvestav- Cost Function Now You want to put together a set of M experts able to recognize the most diﬃcult inputs in an accurate way!!! Thus For each pattern xi each expert classiﬁer outputs a classiﬁcation yj (xi) ∈ {−1, 1} The ﬁnal decision of the committee of M experts is sign (C (xi)) C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (14) 30 / 65
- 72. Images/cinvestav- Now Adaptive Boosting It works even with a continuum of classiﬁers. However For the sake of simplicity, we will assume that the set of expert is ﬁnite. 31 / 65
- 73. Images/cinvestav- Now Adaptive Boosting It works even with a continuum of classiﬁers. However For the sake of simplicity, we will assume that the set of expert is ﬁnite. 31 / 65
- 74. Images/cinvestav- Getting the correct classiﬁers We want the following We want to review possible element members. Select them, if they have certain properties. Assigning a weight to their contribution to the set of experts. 32 / 65
- 75. Images/cinvestav- Getting the correct classiﬁers We want the following We want to review possible element members. Select them, if they have certain properties. Assigning a weight to their contribution to the set of experts. 32 / 65
- 76. Images/cinvestav- Getting the correct classiﬁers We want the following We want to review possible element members. Select them, if they have certain properties. Assigning a weight to their contribution to the set of experts. 32 / 65
- 77. Images/cinvestav- Now Selection is done the following way Testing the classiﬁers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classiﬁers in the expert pool by Charging a cost exp {β} any time a classiﬁer fails (a miss). Charging a cost exp {−β} any time a classiﬁer provides the right label (a hit). 33 / 65
- 78. Images/cinvestav- Now Selection is done the following way Testing the classiﬁers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classiﬁers in the expert pool by Charging a cost exp {β} any time a classiﬁer fails (a miss). Charging a cost exp {−β} any time a classiﬁer provides the right label (a hit). 33 / 65
- 79. Images/cinvestav- Now Selection is done the following way Testing the classiﬁers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classiﬁers in the expert pool by Charging a cost exp {β} any time a classiﬁer fails (a miss). Charging a cost exp {−β} any time a classiﬁer provides the right label (a hit). 33 / 65
- 80. Images/cinvestav- Now Selection is done the following way Testing the classiﬁers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classiﬁers in the expert pool by Charging a cost exp {β} any time a classiﬁer fails (a miss). Charging a cost exp {−β} any time a classiﬁer provides the right label (a hit). 33 / 65
- 81. Images/cinvestav- Now Selection is done the following way Testing the classiﬁers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classiﬁers in the expert pool by Charging a cost exp {β} any time a classiﬁer fails (a miss). Charging a cost exp {−β} any time a classiﬁer provides the right label (a hit). 33 / 65
- 82. Images/cinvestav- Remarks about β We require β > 0 Thus misses are penalized more heavily penalized than hits Although It looks strange to penalize hits, but as long that the penalty of a success is smaller than the penalty for a miss exp {−β} < exp {β}. Why? Show that if we assign cost a to misses and cost b to hits, where a > b > 0, we can rewrite such costs as a = cd and b = c−d for constants c and d. Thus, it does not compromise generality. 34 / 65
- 83. Images/cinvestav- Remarks about β We require β > 0 Thus misses are penalized more heavily penalized than hits Although It looks strange to penalize hits, but as long that the penalty of a success is smaller than the penalty for a miss exp {−β} < exp {β}. Why? Show that if we assign cost a to misses and cost b to hits, where a > b > 0, we can rewrite such costs as a = cd and b = c−d for constants c and d. Thus, it does not compromise generality. 34 / 65
- 84. Images/cinvestav- Remarks about β We require β > 0 Thus misses are penalized more heavily penalized than hits Although It looks strange to penalize hits, but as long that the penalty of a success is smaller than the penalty for a miss exp {−β} < exp {β}. Why? Show that if we assign cost a to misses and cost b to hits, where a > b > 0, we can rewrite such costs as a = cd and b = c−d for constants c and d. Thus, it does not compromise generality. 34 / 65
- 85. Images/cinvestav- Exponential Loss Function Something Notable This kind of error function diﬀerent from the usual squared Euclidean distance to the classiﬁcation target is called an exponential loss function. AdaBoost uses exponential error loss as error criterion. 35 / 65
- 86. Images/cinvestav- Exponential Loss Function Something Notable This kind of error function diﬀerent from the usual squared Euclidean distance to the classiﬁcation target is called an exponential loss function. AdaBoost uses exponential error loss as error criterion. 35 / 65
- 87. Images/cinvestav- The Matrix S When we test the M classiﬁers in the pool, we build a matrix S We record the misses (with a ONE) and hits (with a ZERO) of each classiﬁers. Row i in the matrix is reserved for the data point xi - Column m is reserved for the mth classiﬁer in the pool. Classiﬁers 1 2 · · · M x1 0 1 · · · 1 x2 0 0 · · · 1 x3 1 1 · · · 0 ... ... ... ... xN 0 0 · · · 0 36 / 65
- 88. Images/cinvestav- The Matrix S When we test the M classiﬁers in the pool, we build a matrix S We record the misses (with a ONE) and hits (with a ZERO) of each classiﬁers. Row i in the matrix is reserved for the data point xi - Column m is reserved for the mth classiﬁer in the pool. Classiﬁers 1 2 · · · M x1 0 1 · · · 1 x2 0 0 · · · 1 x3 1 1 · · · 0 ... ... ... ... xN 0 0 · · · 0 36 / 65
- 89. Images/cinvestav- The Matrix S When we test the M classiﬁers in the pool, we build a matrix S We record the misses (with a ONE) and hits (with a ZERO) of each classiﬁers. Row i in the matrix is reserved for the data point xi - Column m is reserved for the mth classiﬁer in the pool. Classiﬁers 1 2 · · · M x1 0 1 · · · 1 x2 0 0 · · · 1 x3 1 1 · · · 0 ... ... ... ... xN 0 0 · · · 0 36 / 65
- 90. Images/cinvestav- Main Idea What does AdaBoost want to do? The main idea of AdaBoost is to proceed systematically by extracting one classiﬁer from the pool in each of M iterations. Thus The elements in the data set are weighted according to their current relevance (or urgency) at each iteration. Thus at the beginning of the iterations All data samples are assigned the same weight: Just 1, or 1 N , if we want to have a total sum of 1 for all weights. 37 / 65
- 91. Images/cinvestav- Main Idea What does AdaBoost want to do? The main idea of AdaBoost is to proceed systematically by extracting one classiﬁer from the pool in each of M iterations. Thus The elements in the data set are weighted according to their current relevance (or urgency) at each iteration. Thus at the beginning of the iterations All data samples are assigned the same weight: Just 1, or 1 N , if we want to have a total sum of 1 for all weights. 37 / 65
- 92. Images/cinvestav- Main Idea What does AdaBoost want to do? The main idea of AdaBoost is to proceed systematically by extracting one classiﬁer from the pool in each of M iterations. Thus The elements in the data set are weighted according to their current relevance (or urgency) at each iteration. Thus at the beginning of the iterations All data samples are assigned the same weight: Just 1, or 1 N , if we want to have a total sum of 1 for all weights. 37 / 65
- 93. Images/cinvestav- Main Idea What does AdaBoost want to do? The main idea of AdaBoost is to proceed systematically by extracting one classiﬁer from the pool in each of M iterations. Thus The elements in the data set are weighted according to their current relevance (or urgency) at each iteration. Thus at the beginning of the iterations All data samples are assigned the same weight: Just 1, or 1 N , if we want to have a total sum of 1 for all weights. 37 / 65
- 94. Images/cinvestav- The process of the weights As the selection progresses The more diﬃcult samples, those where the committee still performs badly, are assigned larger and larger weights. Basically The selection process concentrates in selecting new classiﬁers for the committee by focusing on those which can help with the still misclassiﬁed examples. Then The best classiﬁers are those which can provide new insights to the committee. Classiﬁers being selected should complement each other in an optimal way. 38 / 65
- 95. Images/cinvestav- The process of the weights As the selection progresses The more diﬃcult samples, those where the committee still performs badly, are assigned larger and larger weights. Basically The selection process concentrates in selecting new classiﬁers for the committee by focusing on those which can help with the still misclassiﬁed examples. Then The best classiﬁers are those which can provide new insights to the committee. Classiﬁers being selected should complement each other in an optimal way. 38 / 65
- 96. Images/cinvestav- The process of the weights As the selection progresses The more diﬃcult samples, those where the committee still performs badly, are assigned larger and larger weights. Basically The selection process concentrates in selecting new classiﬁers for the committee by focusing on those which can help with the still misclassiﬁed examples. Then The best classiﬁers are those which can provide new insights to the committee. Classiﬁers being selected should complement each other in an optimal way. 38 / 65
- 97. Images/cinvestav- The process of the weights As the selection progresses The more diﬃcult samples, those where the committee still performs badly, are assigned larger and larger weights. Basically The selection process concentrates in selecting new classiﬁers for the committee by focusing on those which can help with the still misclassiﬁed examples. Then The best classiﬁers are those which can provide new insights to the committee. Classiﬁers being selected should complement each other in an optimal way. 38 / 65
- 98. Images/cinvestav- Selecting New Classiﬁers What we want In each iteration, we rank all classiﬁers, so that we can select the current best out of the pool. At mth iteration We have already included m − 1 classiﬁers in the committee and we want to select the next one. Thus, we have the following cost function which is actually the output of the committee C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (15) 39 / 65
- 99. Images/cinvestav- Selecting New Classiﬁers What we want In each iteration, we rank all classiﬁers, so that we can select the current best out of the pool. At mth iteration We have already included m − 1 classiﬁers in the committee and we want to select the next one. Thus, we have the following cost function which is actually the output of the committee C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (15) 39 / 65
- 100. Images/cinvestav- Selecting New Classiﬁers What we want In each iteration, we rank all classiﬁers, so that we can select the current best out of the pool. At mth iteration We have already included m − 1 classiﬁers in the committee and we want to select the next one. Thus, we have the following cost function which is actually the output of the committee C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (15) 39 / 65
- 101. Images/cinvestav- Thus, we have that Extending the cost function C(m) (xi) = C(m−1) (xi) + αmym (xi) (16) At the ﬁrst iteration m = 1 C(m−1) is the zero function. Thus, the total cost or total error is deﬁned as the exponential error E = N i=1 exp −ti C(m−1) (xi) + αmym (xi) (17) 40 / 65
- 102. Images/cinvestav- Thus, we have that Extending the cost function C(m) (xi) = C(m−1) (xi) + αmym (xi) (16) At the ﬁrst iteration m = 1 C(m−1) is the zero function. Thus, the total cost or total error is deﬁned as the exponential error E = N i=1 exp −ti C(m−1) (xi) + αmym (xi) (17) 40 / 65
- 103. Images/cinvestav- Thus, we have that Extending the cost function C(m) (xi) = C(m−1) (xi) + αmym (xi) (16) At the ﬁrst iteration m = 1 C(m−1) is the zero function. Thus, the total cost or total error is deﬁned as the exponential error E = N i=1 exp −ti C(m−1) (xi) + αmym (xi) (17) 40 / 65
- 104. Images/cinvestav- Thus We want to determine αm and ym in optimal way Thus, rewriting E = N i=1 w (m) i exp {−tiαmym (xi)} (18) Where, for i = 1, 2, ..., N w (m) i = exp −tiC(m−1) (xi) (19) 41 / 65
- 105. Images/cinvestav- Thus We want to determine αm and ym in optimal way Thus, rewriting E = N i=1 w (m) i exp {−tiαmym (xi)} (18) Where, for i = 1, 2, ..., N w (m) i = exp −tiC(m−1) (xi) (19) 41 / 65
- 106. Images/cinvestav- Thus We want to determine αm and ym in optimal way Thus, rewriting E = N i=1 w (m) i exp {−tiαmym (xi)} (18) Where, for i = 1, 2, ..., N w (m) i = exp −tiC(m−1) (xi) (19) 41 / 65
- 107. Images/cinvestav- Thus In the ﬁrst iteration w (1) i = 1 for i = 1, ..., N During later iterations, the vector w(m) represents the weight assigned to each data point in the training set at iteration m. Splitting equation (Eq. 18) E = ti=ym(xi) w (m) i exp {−αm} + ti=ym(xi) w (m) i exp {αm} (20) Meaning The total cost is the weighted cost of all hits plus the weighted cost of all misses. 42 / 65
- 108. Images/cinvestav- Thus In the ﬁrst iteration w (1) i = 1 for i = 1, ..., N During later iterations, the vector w(m) represents the weight assigned to each data point in the training set at iteration m. Splitting equation (Eq. 18) E = ti=ym(xi) w (m) i exp {−αm} + ti=ym(xi) w (m) i exp {αm} (20) Meaning The total cost is the weighted cost of all hits plus the weighted cost of all misses. 42 / 65
- 109. Images/cinvestav- Thus In the ﬁrst iteration w (1) i = 1 for i = 1, ..., N During later iterations, the vector w(m) represents the weight assigned to each data point in the training set at iteration m. Splitting equation (Eq. 18) E = ti=ym(xi) w (m) i exp {−αm} + ti=ym(xi) w (m) i exp {αm} (20) Meaning The total cost is the weighted cost of all hits plus the weighted cost of all misses. 42 / 65
- 110. Images/cinvestav- Now Writing the ﬁrst summand as Wc exp {−αm} and the second as We exp {αm} E = Wc exp {−αm} + We exp {αm} (21) Now, for the selection of ym the exact value of αm > 0 is irrelevant Since a ﬁxed αm minimizing E is equivalent to minimizing exp {αm} E and exp {αm} E = Wc + We exp {2αm} (22) 43 / 65
- 111. Images/cinvestav- Now Writing the ﬁrst summand as Wc exp {−αm} and the second as We exp {αm} E = Wc exp {−αm} + We exp {αm} (21) Now, for the selection of ym the exact value of αm > 0 is irrelevant Since a ﬁxed αm minimizing E is equivalent to minimizing exp {αm} E and exp {αm} E = Wc + We exp {2αm} (22) 43 / 65
- 112. Images/cinvestav- Then Since exp {2αm} > 1, we can rewrite (Eq. 22) exp {αm} E = Wc + We − We + We exp {2αm} (23) Thus exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (24) Now Now, Wc + We is the total sum W of the weights of all data points which is constant in the current iteration. 44 / 65
- 113. Images/cinvestav- Then Since exp {2αm} > 1, we can rewrite (Eq. 22) exp {αm} E = Wc + We − We + We exp {2αm} (23) Thus exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (24) Now Now, Wc + We is the total sum W of the weights of all data points which is constant in the current iteration. 44 / 65
- 114. Images/cinvestav- Then Since exp {2αm} > 1, we can rewrite (Eq. 22) exp {αm} E = Wc + We − We + We exp {2αm} (23) Thus exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (24) Now Now, Wc + We is the total sum W of the weights of all data points which is constant in the current iteration. 44 / 65
- 115. Images/cinvestav- Thus The right hand side of the equation is minimized When at the m-th iteration we pick the classiﬁer with the lowest total cost We (that is the lowest rate of weighted error). Intuitively The next selected ym should be the one with the lowest penalty given the current set of weights. 45 / 65
- 116. Images/cinvestav- Thus The right hand side of the equation is minimized When at the m-th iteration we pick the classiﬁer with the lowest total cost We (that is the lowest rate of weighted error). Intuitively The next selected ym should be the one with the lowest penalty given the current set of weights. 45 / 65
- 117. Images/cinvestav- Deriving against the weight αm Going back to the original E, we can use the derivative trick ∂E ∂αm = −Wc exp {−αm} + We exp {αm} (25) Making the equation equal to zero and multiplying by exp {αm} −Wc + We exp {2αm} = 0 (26) The optimal value is thus αm = 1 2 ln Wc We (27) 46 / 65
- 118. Images/cinvestav- Deriving against the weight αm Going back to the original E, we can use the derivative trick ∂E ∂αm = −Wc exp {−αm} + We exp {αm} (25) Making the equation equal to zero and multiplying by exp {αm} −Wc + We exp {2αm} = 0 (26) The optimal value is thus αm = 1 2 ln Wc We (27) 46 / 65
- 119. Images/cinvestav- Deriving against the weight αm Going back to the original E, we can use the derivative trick ∂E ∂αm = −Wc exp {−αm} + We exp {αm} (25) Making the equation equal to zero and multiplying by exp {αm} −Wc + We exp {2αm} = 0 (26) The optimal value is thus αm = 1 2 ln Wc We (27) 46 / 65
- 120. Images/cinvestav- Now Making the total sum of all weights W = Wc + We (28) We can rewrite the previous equation as αm = 1 2 ln W − We We = 1 2 ln 1 − em em (29) With the percentage rate of error given the weights of the data points em = We W (30) 47 / 65
- 121. Images/cinvestav- Now Making the total sum of all weights W = Wc + We (28) We can rewrite the previous equation as αm = 1 2 ln W − We We = 1 2 ln 1 − em em (29) With the percentage rate of error given the weights of the data points em = We W (30) 47 / 65
- 122. Images/cinvestav- Now Making the total sum of all weights W = Wc + We (28) We can rewrite the previous equation as αm = 1 2 ln W − We We = 1 2 ln 1 − em em (29) With the percentage rate of error given the weights of the data points em = We W (30) 47 / 65
- 123. Images/cinvestav- What about the weights? Using the equation w (m) i = exp −tiC(m−1) (xi) (31) And because we have αm and ym (xi) w (m+1) i = exp −tiC(m) (xi) = exp −ti C(m−1) (xi) + αmym (xi) =w (m) i exp {−tiαmym (xi)} 48 / 65
- 124. Images/cinvestav- What about the weights? Using the equation w (m) i = exp −tiC(m−1) (xi) (31) And because we have αm and ym (xi) w (m+1) i = exp −tiC(m) (xi) = exp −ti C(m−1) (xi) + αmym (xi) =w (m) i exp {−tiαmym (xi)} 48 / 65
- 125. Images/cinvestav- What about the weights? Using the equation w (m) i = exp −tiC(m−1) (xi) (31) And because we have αm and ym (xi) w (m+1) i = exp −tiC(m) (xi) = exp −ti C(m−1) (xi) + αmym (xi) =w (m) i exp {−tiαmym (xi)} 48 / 65
- 126. Images/cinvestav- What about the weights? Using the equation w (m) i = exp −tiC(m−1) (xi) (31) And because we have αm and ym (xi) w (m+1) i = exp −tiC(m) (xi) = exp −ti C(m−1) (xi) + αmym (xi) =w (m) i exp {−tiαmym (xi)} 48 / 65
- 127. Images/cinvestav- Sequential Training Thus AdaBoost trains a new classiﬁer using a data set in which the weighting coeﬃcients are adjusted according to the performance of the previously trained classiﬁer so as to give greater weight to the misclassiﬁed data points. 49 / 65
- 128. Images/cinvestav- Illustration Schematic Illustration 50 / 65
- 129. Images/cinvestav- Outline 1 Introduction 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging 3 Committees Bootstrap Data Sets 4 Boosting AdaBoost Development Cost Function Selection Process Selecting New Classiﬁers Using our Deriving Trick AdaBoost Algorithm Some Remarks and Implementation Issues Explanation about AdaBoost’s behavior Example 51 / 65
- 130. Images/cinvestav- AdaBoost Algorithm Step 1 Initialize w (1) i to 1 N Step 2 For m = 1, 2, ..., M 1 Select a weak classiﬁer ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti =ym(xi ) w (m) i = arg min ym We (32) Where I is an indicator function. 2 Evaluate em = N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (33) Where I is an indicator function 52 / 65
- 131. Images/cinvestav- AdaBoost Algorithm Step 1 Initialize w (1) i to 1 N Step 2 For m = 1, 2, ..., M 1 Select a weak classiﬁer ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti =ym(xi ) w (m) i = arg min ym We (32) Where I is an indicator function. 2 Evaluate em = N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (33) Where I is an indicator function 52 / 65
- 132. Images/cinvestav- AdaBoost Algorithm Step 1 Initialize w (1) i to 1 N Step 2 For m = 1, 2, ..., M 1 Select a weak classiﬁer ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti =ym(xi ) w (m) i = arg min ym We (32) Where I is an indicator function. 2 Evaluate em = N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (33) Where I is an indicator function 52 / 65
- 133. Images/cinvestav- AdaBoost Algorithm Step 1 Initialize w (1) i to 1 N Step 2 For m = 1, 2, ..., M 1 Select a weak classiﬁer ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti =ym(xi ) w (m) i = arg min ym We (32) Where I is an indicator function. 2 Evaluate em = N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (33) Where I is an indicator function 52 / 65
- 134. Images/cinvestav- AdaBoost Algorithm Step 1 Initialize w (1) i to 1 N Step 2 For m = 1, 2, ..., M 1 Select a weak classiﬁer ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti =ym(xi ) w (m) i = arg min ym We (32) Where I is an indicator function. 2 Evaluate em = N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (33) Where I is an indicator function 52 / 65
- 135. Images/cinvestav- AdaBoost Algorithm Step 1 Initialize w (1) i to 1 N Step 2 For m = 1, 2, ..., M 1 Select a weak classiﬁer ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti =ym(xi ) w (m) i = arg min ym We (32) Where I is an indicator function. 2 Evaluate em = N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (33) Where I is an indicator function 52 / 65
- 136. Images/cinvestav- AdaBoost Algorithm Step 3 Set the αm weight to αm = 1 2 ln 1 − em em (34) Now update the weights of the data for the next iteration If ti = ym (xi) i.e. a miss w (m+1) i = w (m) i exp {αm} = w (m) i 1 − em em (35) If ti == ym (xi) i.e. a hit w (m+1) i = w (m) i exp {−αm} = w (m) i em 1 − em (36) 53 / 65
- 137. Images/cinvestav- AdaBoost Algorithm Step 3 Set the αm weight to αm = 1 2 ln 1 − em em (34) Now update the weights of the data for the next iteration If ti = ym (xi) i.e. a miss w (m+1) i = w (m) i exp {αm} = w (m) i 1 − em em (35) If ti == ym (xi) i.e. a hit w (m+1) i = w (m) i exp {−αm} = w (m) i em 1 − em (36) 53 / 65
- 138. Images/cinvestav- AdaBoost Algorithm Step 3 Set the αm weight to αm = 1 2 ln 1 − em em (34) Now update the weights of the data for the next iteration If ti = ym (xi) i.e. a miss w (m+1) i = w (m) i exp {αm} = w (m) i 1 − em em (35) If ti == ym (xi) i.e. a hit w (m+1) i = w (m) i exp {−αm} = w (m) i em 1 − em (36) 53 / 65
- 139. Images/cinvestav- Finally, make predictions For this use YM (x) = sign M m=1 αmym (x) (37) 54 / 65
- 140. Images/cinvestav- Observations First The ﬁrst base classiﬁer is the usual procedure of training a single classiﬁer. Second From (Eq. 35) and (Eq. 36), we can see that the weighting coeﬃcient are increased for data points that are misclassiﬁed. Third The quantity em represent weighted measures of the error rate. Thus αm gives more weight to the more accurate classiﬁers. 55 / 65
- 141. Images/cinvestav- Observations First The ﬁrst base classiﬁer is the usual procedure of training a single classiﬁer. Second From (Eq. 35) and (Eq. 36), we can see that the weighting coeﬃcient are increased for data points that are misclassiﬁed. Third The quantity em represent weighted measures of the error rate. Thus αm gives more weight to the more accurate classiﬁers. 55 / 65
- 142. Images/cinvestav- Observations First The ﬁrst base classiﬁer is the usual procedure of training a single classiﬁer. Second From (Eq. 35) and (Eq. 36), we can see that the weighting coeﬃcient are increased for data points that are misclassiﬁed. Third The quantity em represent weighted measures of the error rate. Thus αm gives more weight to the more accurate classiﬁers. 55 / 65
- 143. Images/cinvestav- In addition The pool of classiﬁers in Step 1 can be substituted by a family of classiﬁers One whose members are trained to minimize the error function given the current weights Now If indeed a ﬁnite set of classiﬁers is given, we only need to test the classiﬁers once for each data point The Scouting Matrix S It can be reused at each iteration by multiplying the transposed vector of weights w(m) with S to obtain We of each machine 56 / 65
- 144. Images/cinvestav- In addition The pool of classiﬁers in Step 1 can be substituted by a family of classiﬁers One whose members are trained to minimize the error function given the current weights Now If indeed a ﬁnite set of classiﬁers is given, we only need to test the classiﬁers once for each data point The Scouting Matrix S It can be reused at each iteration by multiplying the transposed vector of weights w(m) with S to obtain We of each machine 56 / 65
- 145. Images/cinvestav- In addition The pool of classiﬁers in Step 1 can be substituted by a family of classiﬁers One whose members are trained to minimize the error function given the current weights Now If indeed a ﬁnite set of classiﬁers is given, we only need to test the classiﬁers once for each data point The Scouting Matrix S It can be reused at each iteration by multiplying the transposed vector of weights w(m) with S to obtain We of each machine 56 / 65
- 146. Images/cinvestav- We have then The following W (1) e W (2) e · · · W M e = w(m) T S (38) This allows to reformulate the weight update step such that It only misses lead to weight modiﬁcation. Note Note that the weight vector w(m) is constructed iteratively. It could be recomputed completely at every iteration, but the iterative construction is more eﬃcient and simple to implement. 57 / 65
- 147. Images/cinvestav- We have then The following W (1) e W (2) e · · · W M e = w(m) T S (38) This allows to reformulate the weight update step such that It only misses lead to weight modiﬁcation. Note Note that the weight vector w(m) is constructed iteratively. It could be recomputed completely at every iteration, but the iterative construction is more eﬃcient and simple to implement. 57 / 65
- 148. Images/cinvestav- We have then The following W (1) e W (2) e · · · W M e = w(m) T S (38) This allows to reformulate the weight update step such that It only misses lead to weight modiﬁcation. Note Note that the weight vector w(m) is constructed iteratively. It could be recomputed completely at every iteration, but the iterative construction is more eﬃcient and simple to implement. 57 / 65
- 149. Images/cinvestav- We have then The following W (1) e W (2) e · · · W M e = w(m) T S (38) This allows to reformulate the weight update step such that It only misses lead to weight modiﬁcation. Note Note that the weight vector w(m) is constructed iteratively. It could be recomputed completely at every iteration, but the iterative construction is more eﬃcient and simple to implement. 57 / 65
- 150. Images/cinvestav- Explanation Graph for 1−em em in the range 0 ≤ em ≤ 1 58 / 65
- 151. Images/cinvestav- So we have Graph for αm 59 / 65
- 152. Images/cinvestav- We have the following cases We have the following If em −→ 1, we have that all the sample were not correctly classiﬁed!!! Thus We get that for all miss-classiﬁed sample lim em→1 1−em em −→ 0, then αm −→ −∞ Finally We get that for all miss-classiﬁed sample w (m+1) i = w (m) i exp {αm} −→ 0 We only need to reverse the answers to get the perfect classiﬁer and select it as the only committee member. 60 / 65
- 153. Images/cinvestav- We have the following cases We have the following If em −→ 1, we have that all the sample were not correctly classiﬁed!!! Thus We get that for all miss-classiﬁed sample lim em→1 1−em em −→ 0, then αm −→ −∞ Finally We get that for all miss-classiﬁed sample w (m+1) i = w (m) i exp {αm} −→ 0 We only need to reverse the answers to get the perfect classiﬁer and select it as the only committee member. 60 / 65
- 154. Images/cinvestav- We have the following cases We have the following If em −→ 1, we have that all the sample were not correctly classiﬁed!!! Thus We get that for all miss-classiﬁed sample lim em→1 1−em em −→ 0, then αm −→ −∞ Finally We get that for all miss-classiﬁed sample w (m+1) i = w (m) i exp {αm} −→ 0 We only need to reverse the answers to get the perfect classiﬁer and select it as the only committee member. 60 / 65
- 155. Images/cinvestav- Thus, we have If em −→ 1/2 We have αm −→ 0, thus we have that if the sample is well or bad classiﬁed : exp {−αmtiym (xi)} → 1 (39) So the weight does not change at all. What about em → 0 We have that αm → +∞ Thus, we have Samples always correctly classiﬁed w (m+1) i = w (m) i exp {−αmtiym (xi)} → 0 Thus, the only committee that we need is m, we do not need m + 1 61 / 65
- 156. Images/cinvestav- Thus, we have If em −→ 1/2 We have αm −→ 0, thus we have that if the sample is well or bad classiﬁed : exp {−αmtiym (xi)} → 1 (39) So the weight does not change at all. What about em → 0 We have that αm → +∞ Thus, we have Samples always correctly classiﬁed w (m+1) i = w (m) i exp {−αmtiym (xi)} → 0 Thus, the only committee that we need is m, we do not need m + 1 61 / 65
- 157. Images/cinvestav- Thus, we have If em −→ 1/2 We have αm −→ 0, thus we have that if the sample is well or bad classiﬁed : exp {−αmtiym (xi)} → 1 (39) So the weight does not change at all. What about em → 0 We have that αm → +∞ Thus, we have Samples always correctly classiﬁed w (m+1) i = w (m) i exp {−αmtiym (xi)} → 0 Thus, the only committee that we need is m, we do not need m + 1 61 / 65
- 158. Images/cinvestav- Outline 1 Introduction 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging 3 Committees Bootstrap Data Sets 4 Boosting AdaBoost Development Cost Function Selection Process Selecting New Classiﬁers Using our Deriving Trick AdaBoost Algorithm Some Remarks and Implementation Issues Explanation about AdaBoost’s behavior Example 62 / 65
- 159. Images/cinvestav- Example Training set Classes ω1 = N(0, 1) ω2 = N(0, σ2) − N(0, 1) 63 / 65
- 160. Images/cinvestav- Example Weak Classiﬁer Perceptron For m = 1, ..., M, we have the following possible classiﬁers 64 / 65
- 161. Images/cinvestav- Example Weak Classiﬁer Perceptron For m = 1, ..., M, we have the following possible classiﬁers 64 / 65
- 162. Images/cinvestav- At the end of the process For M = 40 65 / 65

No public clipboards found for this slide

Login to see the comments