18.1 combining models

1.
Introduction to MachineLearning Combining Models, Bayesian Average and Boosting Andres Mendez-Vazquez July 31, 2018 1 / 132

2.
Images/cinvestav- Outline 1 Combining Models Introduction Averagefor Committee Beyond Simple Averaging Example 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging Now Model Averaging The Differences 3 Committees Introduction Bootstrap Data Sets Relation with Monte-Carlo Estimation 4 Boosting AdaBoost Development Cost Function Selection Process How do we select classifiers? Selecting New Classifiers Deriving against the weight αm AdaBoost Algorithm Some Remarks Explanation about AdaBoost’s behavior Statistical Analysis of the Exponential Loss Moving from Regression to Classification Minimization of the Exponential Criterion Finally, The Additive Logistic Regression Example using an Infinitude of Perceptrons 2 / 132

3.

4.
Images/cinvestav- Introduction Observation It is oftenfound that improved performance can be obtained by combining multiple classifiers together in some way. Example, Committees We might train L different classifiers and then make predictions: by using the average of the predictions made by each classifier. Example, Boosting It involves training multiple models in sequence: A error function used to train a particular model depends on the performance of the previous models. 4 / 132

5.

6.

7.

8.

9.

10.
Images/cinvestav- We could usesimple averaging Given a series of observed samples {x1, x2, ..., xN } with noise ∼ N (0, 1) We could use our knowledge on the noise, for example additive: xi = xi + We can use our knowledge of probability to remove such noise E [xi] = E [xi + ] = E [xi] + E [ ] Then, because E [ ] = 0 E [xi] = E [xi] ≈ 1 N N i=1 xi 6 / 132

11.

12.

13.
Images/cinvestav- For Example We havea nice result 7 / 132

14.

15.
Images/cinvestav- Beyond Simple Averaging Insteadof averaging the predictions of a set of models You can use an alternative form of combination that selects one of the models to make the prediction. Where The choice of model is a function of the input variables. Thus Diﬀerent Models become responsible for making decisions in diﬀerent regions of the input space. 9 / 132

16.

17.

18.
Images/cinvestav- Something like this Modelsin charge of diﬀerent set of inputs 10 / 132

19.

20.
Images/cinvestav- Example, Decision Trees Wecan have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classiﬁer for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classiﬁers. 12 / 132

21.

22.

23.

24.

25.

26.

27.
Images/cinvestav- This is usedin the mixture of distributions Thus (Mixture of Experts) p (t|x) = M k=1 πk (x) p (t|x, k) (1) where πk (x) = p (k|x) represent the input-dependent mixing coeﬃcients. This type of models They can be viewed as mixture distribution in which the component densities and the mixing coeﬃcients are conditioned on the input variables and are known as mixture experts. 13 / 132

28.
Images/cinvestav- This is usedin the mixture of distributions Thus (Mixture of Experts) p (t|x) = M k=1 πk (x) p (t|x, k) (1) where πk (x) = p (k|x) represent the input-dependent mixing coeﬃcients. This type of models They can be viewed as mixture distribution in which the component densities and the mixing coeﬃcients are conditioned on the input variables and are known as mixture experts. 13 / 132

29.

30.
Images/cinvestav- It is importantto diﬀerentiate between them Although Model Combinations and Bayesian Model Averaging look similar. However, they are actually diﬀerent For this We have the following example. 15 / 132

31.
Images/cinvestav- It is importantto diﬀerentiate between them Although Model Combinations and Bayesian Model Averaging look similar. However, they are actually diﬀerent For this We have the following example. 15 / 132

32.
Images/cinvestav- Example of theDiﬀerences For this consider the following Mixture of Gaussians with a binary latent variable z indicating to which component a point belongs to. Thus the model is speciﬁed in terms a joint distribution p (x, z) Corresponding density over the observed variable x using marginalization p (x) = z p (x, z) 16 / 132

33.

34.

35.
Images/cinvestav- Example In the caseof Mixture of Gaussian’s p (x) = K k=1 πkN (x|µk, Σk) This is an example of model combination. What about other Models 17 / 132

36.
Images/cinvestav- Example In the caseof Mixture of Gaussian’s p (x) = K k=1 πkN (x|µk, Σk) This is an example of model combination. What about other Models 17 / 132

37.
Images/cinvestav- More Models Now, forindependent, identically distributed data X = {x1, x2, ..., xN } p (X) = N n=1 p (xn) = N n=1 zn p (xn, zn) 18 / 132

38.
Images/cinvestav- Therefore Something Notable Each observeddata point xn has a corresponding latent variable zn. Here, we are doing a Combination of Models Each Gaussian indexed by zn is in charge of generating one section of the sample space 19 / 132

39.
Images/cinvestav- Therefore Something Notable Each observeddata point xn has a corresponding latent variable zn. Here, we are doing a Combination of Models Each Gaussian indexed by zn is in charge of generating one section of the sample space 19 / 132

40.
Images/cinvestav- Example We have 20 /132

41.

42.
Images/cinvestav- Now, suppose We haveseveral diﬀerent models indexed by h = 1, ..., H with prior probabilities One model might be a mixture of Gaussians and another model might be a mixture of Cauchy distributions The Marginal Distribution is p (X) = H h=1 p (X, h) = H h=1 p (X|h) p (h) ≈p(h|X) This is an example of Bayesian model averaging 22 / 132

43.
Images/cinvestav- Now, suppose We haveseveral diﬀerent models indexed by h = 1, ..., H with prior probabilities One model might be a mixture of Gaussians and another model might be a mixture of Cauchy distributions The Marginal Distribution is p (X) = H h=1 p (X, h) = H h=1 p (X|h) p (h) ≈p(h|X) This is an example of Bayesian model averaging 22 / 132

44.
Images/cinvestav- Bayesian Model Averaging Remark Thesummation over h means that just one model is responsible for generating the whole data set. Observation The probability over h simply reﬂects our uncertainty of which is the correct model to use. Thus, as the size of the data set increases This uncertainty reduces Posterior probabilities p(h|X) become increasingly focused on just one of the models. 23 / 132

45.

46.

47.
Images/cinvestav- Example We have INCREASING THE NUMBEROF SAMPLES 24 / 132

48.

49.
Images/cinvestav- The Differences Bayesian modelaveraging The whole data set is generated by a single model h. Model combination Different data points within the data set can potentially be generated from different by different components. 26 / 132

50.
Images/cinvestav- The Differences Bayesian modelaveraging The whole data set is generated by a single model h. Model combination Different data points within the data set can potentially be generated from different by different components. 26 / 132

51.

52.
Images/cinvestav- Committees Idea, the simplestway to construct a committee It is to average the predictions of a set of individual models. Thinking as a frequentist This is coming from taking in consideration the trade-oﬀ between bias and variance. Where the error in the model into The bias component that arises from diﬀerences between the model and the true function to be predicted. The variance component that represents the sensitivity of the model to the individual data points. 28 / 132

53.

54.

55.

56.
Images/cinvestav- For example When weaveraged a set of low-bias models We obtained accurate predictions of the underlying sinusoidal function from which the data were generated. 29 / 132

57.
Images/cinvestav- However Big Problem We havenormally a single data set Thus We need to introduce certain variability between the diﬀerent committee members. One approach You can use bootstrap data sets. 30 / 132

58.
Images/cinvestav- However Big Problem We havenormally a single data set Thus We need to introduce certain variability between the diﬀerent committee members. One approach You can use bootstrap data sets. 30 / 132

59.

60.
Images/cinvestav- The Idea ofBootstrap We denote the training set by Z = {z1, z2, ..., zN } Where zi = (xi, yi) The basic idea is to randomly draw datasets with replacement from the training data Each sample the same size as the original training set. This is done B times Producing B bootstrap datasets. 32 / 132

61.

62.

63.
Images/cinvestav- Then Then a quantityis computed S (Z) is any quantity computed from the data Z From the bootstrap sampling We can estimate any aspect of the distribution of S (Z). 33 / 132

64.

65.

66.
Images/cinvestav- Then we refit themodel to each of the bootstrap datasets You generate S Z∗b to refit the model to this dataset. Then You examine the behavior of the fits over the B replications. 34 / 132

67.
Images/cinvestav- Then we refit themodel to each of the bootstrap datasets You generate S Z∗b to refit the model to this dataset. Then You examine the behavior of the fits over the B replications. 34 / 132

68.
Images/cinvestav- For Example Its variance Var [S (Z)] = 1 B − 1 B b=1 S Z∗b − S ∗ 2 Where S ∗ = 1 B B b=1 S Z∗b 35 / 132

69.
Images/cinvestav- For Example Its variance Var [S (Z)] = 1 B − 1 B b=1 S Z∗b − S ∗ 2 Where S ∗ = 1 B B b=1 S Z∗b 35 / 132

70.

71.
Images/cinvestav- Relation with Monte-CarloEstimation Note that V ar [S (Z)] It can be thought of as a Monte-Carlo estimate of the variance of S (Z) under sampling. This is coming From the empirical distribution function F for the data Z = {z1, z2, ..., zN } 37 / 132

72.
Images/cinvestav- Relation with Monte-CarloEstimation Note that V ar [S (Z)] It can be thought of as a Monte-Carlo estimate of the variance of S (Z) under sampling. This is coming From the empirical distribution function F for the data Z = {z1, z2, ..., zN } 37 / 132

73.
Images/cinvestav- For Example Schematic ofthe bootstrap process Bootstrap Samples Bootstrap Replications Training Samples 38 / 132

74.
Images/cinvestav- Thus Use each ofthem to train a copy yb (x) of a predictive regression model to predict a single continuous variable Then, ycom (x) = 1 B B b=1 yb (x) (2) This is also known as Bootstrap Aggregation or Bagging. 39 / 132

75.
Images/cinvestav- What do wewith this samples? Now, assume a true regression function h (x) and a estimation yb (x) yb (x) = h (x) + b (x) (3) The average sum-of-squares error over the data takes the form Ex (yb (x) − h (x))2 = Ex 2 b (x) (4) What is Ex? It denotes a frequentest expectation with respect to the distribution of the input vector x. 40 / 132

76.

77.

78.
Images/cinvestav- Meaning Thus, the averageerror is EAV = 1 B b b=1 Ex { b (x)}2 (5) Similarly the Expected error over the committee ECOM = Ex   1 B B b=1 (ym (x) − h (x)) 2   = Ex   1 B B b=1 b (x) 2   (6) 41 / 132

79.
Images/cinvestav- Meaning Thus, the averageerror is EAV = 1 B b b=1 Ex { b (x)}2 (5) Similarly the Expected error over the committee ECOM = Ex   1 B B b=1 (ym (x) − h (x)) 2   = Ex   1 B B b=1 b (x) 2   (6) 41 / 132

80.
Images/cinvestav- Assume that theerrors have zero mean and are uncorrelated Assume that the errors have zero mean and are uncorrelated Something Reasonable to assume given the way we produce the Bootstrap Samples Ex [ b (x)] =0 Ex [ b (x) l (x)] = 0, for b = l 42 / 132

81.
Images/cinvestav- Then We have that ECOM= 1 b2 Ex B b=1 ( b (x)) 2 = 1 B2 Ex    B b=1 2 b (x) + B h=1 B k=1 h=k h (x) k (x)    = 1 B2    Ex B b=1 2 b (x) + Ex    B h=1 B k=1 h=k h (x) k (x)       = 1 B2    Ex B b=1 2 b (x) + M h=1 M k=1 h=k Ex ( h (x) k (x))    = 1 B2 Ex B b=1 2 b (x) = 1 B 1 B Ex B b=1 2 b (x) 43 / 132

82.

83.

84.

85.

86.
Images/cinvestav- We ﬁnally obtain Weobtain ECOM = 1 B EAV (7) Looks great BUT!!! Unfortunately, it depends on the key assumption that the errors at the individual Bootstrap Models are uncorrelated. 44 / 132

87.
Images/cinvestav- We ﬁnally obtain Weobtain ECOM = 1 B EAV (7) Looks great BUT!!! Unfortunately, it depends on the key assumption that the errors at the individual Bootstrap Models are uncorrelated. 44 / 132

88.
Images/cinvestav- Thus The Reality!!! The errorsare typically highly correlated, and the reduction in overall error is generally small. Something Notable However, It can be shown that the expected committee error will not exceed the expected error of the constituent models, so ECOM ≤ EAV (8) However, we need something better A more sophisticated technique known as boosting. 45 / 132

89.

90.

91.

92.
Images/cinvestav- Boosting What Boosting does? Itcombines several classiﬁers to produce a form of a committee. We will describe AdaBoost “Adaptive Boosting” developed by Freund and Schapire (1995). 47 / 132

93.
Images/cinvestav- Boosting What Boosting does? Itcombines several classiﬁers to produce a form of a committee. We will describe AdaBoost “Adaptive Boosting” developed by Freund and Schapire (1995). 47 / 132

94.
Images/cinvestav- Sequential Training Main differencebetween boosting and committee methods The base classifiers are trained in sequence. Explanation Consider a two-class classification problem: 1 Samples x1, x2,..., xN 2 Binary labels (-1,1) t1, t2, ..., tN 48 / 132

95.
Images/cinvestav- Sequential Training Main differencebetween boosting and committee methods The base classifiers are trained in sequence. Explanation Consider a two-class classification problem: 1 Samples x1, x2,..., xN 2 Binary labels (-1,1) t1, t2, ..., tN 48 / 132

96.

97.
Images/cinvestav- Cost Function Now You wantto put together a set of M experts able to recognize the most difficult inputs in an accurate way!!! Thus For each pattern xi each expert classifier outputs a classification yj (xi) ∈ {−1, 1} The final decision of the committee of M experts is sign (C (xi)) C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (9) 50 / 132

98.

99.

100.
Images/cinvestav- Now Adaptive Boosting It workseven with a continuum of classiﬁers. However For the sake of simplicity, we will assume that the set of expert is ﬁnite. 51 / 132

101.
Images/cinvestav- Now Adaptive Boosting It workseven with a continuum of classiﬁers. However For the sake of simplicity, we will assume that the set of expert is ﬁnite. 51 / 132

102.

103.
Images/cinvestav- Getting the correctclassiﬁers We want the following We want to review possible element members. Select them, if they have certain properties. Assigning a weight to their contribution to the set of experts. 53 / 132

104.

105.

106.
Images/cinvestav- Now Selection is donethe following way Testing the classifiers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classifiers in the expert pool by Charging a cost exp {β} any time a classifier fails (a miss). Charging a cost exp {−β} any time a classifier provides the right label (a hit). 54 / 132

107.

108.

109.

110.

111.
Images/cinvestav- Remarks about β Werequire β > 0 Thus misses are penalized more heavily penalized than hits Although It looks strange to penalize hits, However, as long that the penalty of a success is smaller than the penalty for a miss: exp {−β} < exp {β} Why? if we assign cost a to misses and cost b to hits, where a > b > 0. We can rewrite such costs as a = cd and b = c−d for constants c and d. It does not compromise generality. 55 / 132

112.

113.

114.

115.

116.

117.
Images/cinvestav- Exponential Loss Function Thiskind of error function is diﬀerent from Squared Euclidean distance The classiﬁcation target is called an exponential loss function. AdaBoost uses exponential error loss as error criterion. +0.1 +0.2 +0.3 +0.4 +0.5 +0.6 +0.7 +0.8 +0.9 +1.0−0.1−0.2 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 +0.1 +0.2 +0.3 +0.4 +0.5 +0.6 +0.7 +0.8 +0.9 +1.0 +1.1 +1.2 +1.3 +1.4 +1.5 +1.6 +1.7 +1.8 +1.9 +2.0 +2.1 +2.2 +2.3 56 / 132

118.

119.
Images/cinvestav- Selection of theClassifier We need to have a way to select the best Classifier in the Pool When we test the M classifiers in the pool, we build a matrix S Then We record the misses (with a ONE) and hits (with a ZERO) of each classifiers. 58 / 132

120.
Images/cinvestav- Selection of theClassifier We need to have a way to select the best Classifier in the Pool When we test the M classifiers in the pool, we build a matrix S Then We record the misses (with a ONE) and hits (with a ZERO) of each classifiers. 58 / 132

121.
Images/cinvestav- The Matrix S Rowi in the matrix is reserved for the data point xi Column m is reserved for the mth classiﬁer in the pool. Classiﬁers 1 2 · · · M x1 0 1 · · · 1 x2 0 0 · · · 1 x3 1 1 · · · 0 ... ... ... ... xN 0 0 · · · 0 59 / 132

122.
Images/cinvestav- Something interesting aboutthe S The sum along the rows is the sum at the empirical risk ER (yj) = 1 N N i=1 Sij with j = 1, ..., M Therefore, the candidate to be used at certain iteration It is the classiﬁer yj with the smallest empirical risk!!! 60 / 132

123.
Images/cinvestav- Something interesting aboutthe S The sum along the rows is the sum at the empirical risk ER (yj) = 1 N N i=1 Sij with j = 1, ..., M Therefore, the candidate to be used at certain iteration It is the classiﬁer yj with the smallest empirical risk!!! 60 / 132

124.
Images/cinvestav- Main Idea What doesAdaBoost want to do? The main idea of AdaBoost is to proceed systematically by extracting one classiﬁer from the pool in each of M iterations. Thus The elements in the data set are weighted according to their current relevance (or urgency) at each iteration. Thus at the beginning of the iterations All data samples are assigned the same weight: Just 1, or 1 N , if we want to have a total sum of 1 for all weights. 61 / 132

125.

126.

127.

128.
Images/cinvestav- The process ofthe weights As the selection progresses The more difficult samples, those where the committee still performs badly, are assigned larger and larger weights. The selection process concentrates in selecting new classifiers For the committee by focusing on those which can help with the still misclassified examples. Then The best classifiers are those which can provide new insights to the committee. Classifiers being selected should complement each other in an optimal way. 62 / 132

129.

130.

131.

132.

133.
Images/cinvestav- Selecting New Classifiers Whatwe want In each iteration, we rank all classifiers, so that we can select the current best out of the pool. At mth iteration We have already included m − 1 classifiers in the committee and we want to select the next one. Thus, we have the following cost function which is actually the output of the committee C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (10) 64 / 132

134.

135.

136.
Images/cinvestav- Thus, we havethat Extending the cost function by the new regression ym C(m) (xi) = C(m−1) (xi) + αmym (xi) (11) At the ﬁrst iteration m = 1 C(0) is the zero function. Thus, the total cost or total error is deﬁned as the exponential error E = N i=1 exp −ti C(m−1) (xi) + αmym (xi) (12) 65 / 132

137.

138.

139.
Images/cinvestav- Thus We want todetermine αm and ym in optimal way Thus, rewriting E = N i=1 w (m) i exp {−tiαmym (xi)} (13) Where, for i = 1, 2, ..., N w (m) i = exp −tiC(m−1) (xi) (14) 66 / 132

140.

141.

142.
Images/cinvestav- Remark We have thatthe weight w (m) i = exp −tiC(m−1) (xi) Needs to be used in someway for the training of the new classiﬁer This is of the out most importance!!! 67 / 132

143.
Images/cinvestav- Remark We have thatthe weight w (m) i = exp −tiC(m−1) (xi) Needs to be used in someway for the training of the new classiﬁer This is of the out most importance!!! 67 / 132

144.
Images/cinvestav- Therefore You could usesuch weight As a output in the estimator function when applied to the loss function N i=1 yi − w (m) i f (xi) 2 You could use such weight You could sub-sample with substitution by using the distribution Dm w (m) i of xi The train using that sub-sample You could apply the weight function to the loss function itself used for training N i=1 w (m) i (yi − wif (xi))2 68 / 132

145.

146.

147.

148.
Images/cinvestav- Thus In the ﬁrstiteration w (1) i = 1 for i = 1, ..., N Meaning all the points have the same importance. During later iterations, the vector w(m) It represents the weight assigned to each data point in the training set at iteration m. 69 / 132

149.
Images/cinvestav- Thus In the ﬁrstiteration w (1) i = 1 for i = 1, ..., N Meaning all the points have the same importance. During later iterations, the vector w(m) It represents the weight assigned to each data point in the training set at iteration m. 69 / 132

150.
Images/cinvestav- Rewriting the CostEquation We can split (Eq. 13) E = ti=ym(xi) w (m) i exp {−αm} + ti=ym(xi) w (m) i exp {αm} (15) Meaning The total cost is the weighted cost of all hits plus the weighted cost of all misses. 70 / 132

151.
Images/cinvestav- Rewriting the CostEquation We can split (Eq. 13) E = ti=ym(xi) w (m) i exp {−αm} + ti=ym(xi) w (m) i exp {αm} (15) Meaning The total cost is the weighted cost of all hits plus the weighted cost of all misses. 70 / 132

152.
Images/cinvestav- Therefore Writing the ﬁrstsummand as Wc exp {−αm} and the second as We exp {αm} E = Wc exp {−αm} + We exp {αm} (16) 71 / 132

153.
Images/cinvestav- Empty Now, for theselection of ym The exact value of αm > 0 is irrelevant Since a ﬁxed αm minimizing E It is equivalent to minimizing exp {αm} E Or in other words exp {αm} E = Wc + We exp {2αm} (17) 72 / 132

154.

155.

156.
Images/cinvestav- Now, we have Giventhat αm > 0 2αm > 0 We have exp {2αm} > exp {0} = 1 73 / 132

157.
Images/cinvestav- Now, we have Giventhat αm > 0 2αm > 0 We have exp {2αm} > exp {0} = 1 73 / 132

158.
Images/cinvestav- Then We can rewrite(Eq. 17) exp {αm} E = Wc + We − We + We exp {2αm} (18) Thus exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (19) Now, Wc + We is the total sum W of the weights Of all data points which is constant in the current iteration. 74 / 132

159.

160.

161.
Images/cinvestav- Thus The right handside of the equation is minimized When at the m-th iteration, we pick the classiﬁer with the lowest total cost We That is the lowest rate of weighted error. Intuitively The next selected ym should be the one with the lowest penalty given the current set of weights. 75 / 132

162.
Images/cinvestav- Thus The right handside of the equation is minimized When at the m-th iteration, we pick the classiﬁer with the lowest total cost We That is the lowest rate of weighted error. Intuitively The next selected ym should be the one with the lowest penalty given the current set of weights. 75 / 132

163.
Images/cinvestav- Do you remember? TheMatrix S We pick the classiﬁer with the lowest total cost We Now, we need to do some updates Speciﬁcally the value αm . 76 / 132

164.
Images/cinvestav- Do you remember? TheMatrix S We pick the classiﬁer with the lowest total cost We Now, we need to do some updates Speciﬁcally the value αm . 76 / 132

165.

166.
Images/cinvestav- Deriving against theweight αm Going back to the original E, we can use the derivative trick ∂E ∂αm = −Wc exp {−αm} + We exp {αm} (20) Making the equation equal to zero and multiplying by exp {αm} −Wc + We exp {2αm} = 0 (21) The optimal value is thus αm = 1 2 ln Wc We (22) 78 / 132

167.

168.

169.
Images/cinvestav- Now Making the totalsum of all weights W = Wc + We (23) We can rewrite the previous equation as αm = 1 2 ln W − We We = 1 2 ln 1 − em em (24) With the percentage rate of error given the weights of the data points em = We W (25) 79 / 132

170.

171.

172.
Images/cinvestav- What about theweights? Using the equation w (m) i = exp −tiC(m−1) (xi) (26) And because we have αm and ym (xi) w (m+1) i = exp −tiC(m) (xi) = exp −ti C(m−1) (xi) + αmym (xi) =w (m) i exp {−tiαmym (xi)} 80 / 132

173.

174.

175.

176.
Images/cinvestav- Sequential Training Thus AdaBoost trainsa new classifier using a data set There the weighting coefficients are adjusted according to the performance of the previously trained classifier To give greater weight to the misclassified data points. 81 / 132

177.

178.

179.
Images/cinvestav- Illustration Schematic Illustration 82 /132

180.

181.
Images/cinvestav- AdaBoost Algorithm Step 1 Initializew (1) i to 1 N Step 2 For m = 1, 2, ..., M Select a weak classiﬁer ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti=ym(xi) w (m) i = arg min ym We (27) Where I is an indicator function. 84 / 132

182.

183.

184.

185.
Images/cinvestav- AdaBoost Algorithm Step 2 Evaluate em= N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (28) Where I is an indicator function 85 / 132

186.
Images/cinvestav- AdaBoost Algorithm Step 3 Setthe αm weight to αm = 1 2 ln 1 − em em (29) Now update the weights of the data for the next iteration If ti = ym (xi) i.e. a miss w (m+1) i = w (m) i exp {αm} = w (m) i 1 − em em (30) If ti == ym (xi) i.e. a hit w (m+1) i = w (m) i exp {−αm} = w (m) i em 1 − em (31) 86 / 132

187.

188.

189.
Images/cinvestav- Finally, make predictions Forthis use YM (x) = sign M m=1 αmym (x) (32) 87 / 132

190.

191.
Images/cinvestav- Observations First The first baseclassifier is the usual procedure of training a single classifier. Second From (Eq. 30) and (Eq. 31), we can see that the weighting coefficient are increased for data points that are misclassified. Third The quantity em represent weighted measures of the error rate. Thus αm gives more weight to the more accurate classifiers. 89 / 132

192.

193.

194.
Images/cinvestav- In addition The poolof classifiers in Step 1 can be substituted by a family of classifiers One whose members are trained to minimize the error function given the current weights Now If indeed a finite set of classifiers is given, we only need to test the classifiers once for each data point The Scouting Matrix S It can be reused at each iteration by multiplying the transposed vector of weights w(m) with S to obtain We of each machine 90 / 132

195.

196.

197.
Images/cinvestav- We have then Thefollowing W(1) e W(2) e · · · WM e = w(m) T S (33) This allows to reformulate the weight update step such that It only misses lead to weight modiﬁcation. Note Note that the weight vector w(m) is constructed iteratively. It could be recomputed completely at every iteration, but the iterative construction is more eﬃcient and simple to implement. 91 / 132

198.

199.

200.

201.

202.
Images/cinvestav- Explanation Graph for 1−em em inthe range 0 ≤ em ≤ 1 93 / 132

203.
Images/cinvestav- So we have Graphfor αm 94 / 132

204.
Images/cinvestav- We have thefollowing cases We have the following If em −→ 1, we have that all the samples were not correctly classiﬁed!!! Thus We get that for all miss-classiﬁed sample lim em→1 1−em em −→ 0, then αm −→ −∞ 95 / 132

205.
Images/cinvestav- We have thefollowing cases We have the following If em −→ 1, we have that all the samples were not correctly classiﬁed!!! Thus We get that for all miss-classiﬁed sample lim em→1 1−em em −→ 0, then αm −→ −∞ 95 / 132

206.
Images/cinvestav- Now We get thatfor all miss-classiﬁed sample w (m+1) i = w (m) i exp {αm} −→ 0 Therefore We only need to reverse the answers to get the perfect classiﬁer and select it as the only committee member. 96 / 132

207.
Images/cinvestav- Now We get thatfor all miss-classiﬁed sample w (m+1) i = w (m) i exp {αm} −→ 0 Therefore We only need to reverse the answers to get the perfect classiﬁer and select it as the only committee member. 96 / 132

208.
Images/cinvestav- Now, the LastCase If em −→ 1/2 We have αm −→ 0 Thus we have that if the sample is well or bad classiﬁed exp {−αmtiym (xi)} → 1 (34) Therefore The weight does not change at all. 97 / 132

209.

210.

211.
Images/cinvestav- Thus, we have Whatabout em → 0 We have that αm → +∞ Thus, we have Samples always correctly classiﬁed w (m+1) i = w (m) i exp {−αmtiym (xi)} → 0 Thus, the only need m committee members, we do not need another m + 1 member. 98 / 132

212.

213.

214.

215.
Images/cinvestav- This comes from Thepaper “Additive Logistic Regression: A Statistical View of Boosting” by Friedman, Hastie and Tibshirani Something Notable In this paper, a proof exists to show that boosting algorithms are procedures to ﬁt and additive logistic regression model. E [y|x] = F (x) with F (x) = M m=1 fm (x) 100 / 132

216.
Images/cinvestav- This comes from Thepaper “Additive Logistic Regression: A Statistical View of Boosting” by Friedman, Hastie and Tibshirani Something Notable In this paper, a proof exists to show that boosting algorithms are procedures to ﬁt and additive logistic regression model. E [y|x] = F (x) with F (x) = M m=1 fm (x) 100 / 132

217.
Images/cinvestav- Consider the AdditiveRegression Model We are interested in modeling the mean E [y|x] = F (x) With Additive Model F (x) = d i=1 fi (xi) Where each fi (xi) is a function for each feature input xi A convenient algorithm for updating these models it the backﬁtting algorithm with update: fi (xi) = E  y − k=i fk (xk) |xi   101 / 132

218.
Images/cinvestav- Consider the AdditiveRegression Model We are interested in modeling the mean E [y|x] = F (x) With Additive Model F (x) = d i=1 fi (xi) Where each fi (xi) is a function for each feature input xi A convenient algorithm for updating these models it the backﬁtting algorithm with update: fi (xi) = E  y − k=i fk (xk) |xi   101 / 132

219.
Images/cinvestav- Remarks An example ofthese additive models is the matching pursuit f (t) = +∞ n=−∞ angγn (t) Backﬁtting ensures that under fairly general conditions Backﬁtting converges to the minimizer of E (y − f (x))2 102 / 132

220.
Images/cinvestav- Remarks An example ofthese additive models is the matching pursuit f (t) = +∞ n=−∞ angγn (t) Backﬁtting ensures that under fairly general conditions Backﬁtting converges to the minimizer of E (y − f (x))2 102 / 132

221.
Images/cinvestav- In the caseof AdaBoost We have an additive model Which considers functions {fm (x)}M m=1 that take in account all the features - Perceptron, Decision Trees, etc Each of these functions is characterized by a set of parameters γm and multiplier αm fm (x) = αmym (x|γm) With additive model FM (x) = α1y1 (x|γ1) + · · · + αM yM (x|γM ) 103 / 132

222.

223.

224.

225.
Images/cinvestav- Remark - Movingfrom Regression to Classiﬁcation Given that Regression have wide ranges of outputs Logistic Regression is widely used to move Regression to Classiﬁcation log P (Y = 1|x) P (Y = −1|x) = M m=1 fm (x) A nice property, the probability estimates lie in [0, 1] Now, solving by assuming P (Y = 1|x) + P (Y = −1|x) = 1 P (Y = 1|x) = eF(x) 1 + eF(x) 105 / 132

226.
Images/cinvestav- Remark - Movingfrom Regression to Classiﬁcation Given that Regression have wide ranges of outputs Logistic Regression is widely used to move Regression to Classiﬁcation log P (Y = 1|x) P (Y = −1|x) = M m=1 fm (x) A nice property, the probability estimates lie in [0, 1] Now, solving by assuming P (Y = 1|x) + P (Y = −1|x) = 1 P (Y = 1|x) = eF(x) 1 + eF(x) 105 / 132

227.

228.
Images/cinvestav- The Exponential Criterion Wehave our exponential Criterion under an Expected Value with y ∈ {1, −1} J (F) = E e−yF(x) Lemma E e−yF (x) is minimized at F (x) = 1 2 log P (Y = 1|x) P (Y = −1|x) Hence: P (Y = 1|x) = eF (x) e−F (x) + eF (x) P (Y = −1|x) = e−F (x) e−F (x) + eF (x) 107 / 132

229.
Images/cinvestav- The Exponential Criterion Wehave our exponential Criterion under an Expected Value with y ∈ {1, −1} J (F) = E e−yF(x) Lemma E e−yF (x) is minimized at F (x) = 1 2 log P (Y = 1|x) P (Y = −1|x) Hence: P (Y = 1|x) = eF (x) e−F (x) + eF (x) P (Y = −1|x) = e−F (x) e−F (x) + eF (x) 107 / 132

230.
Images/cinvestav- Proof Given the discretenature of y ∈ {1, −1} ∂E e−yF(x) ∂F (x) = −P (Y = 1|x) e−F(x) + P (Y = −1|x) eF(x) Therefore −P (Y = 1|x) e−F(x) + P (Y = −1|x) eF(x) = 0 108 / 132

231.
Images/cinvestav- Proof Given the discretenature of y ∈ {1, −1} ∂E e−yF(x) ∂F (x) = −P (Y = 1|x) e−F(x) + P (Y = −1|x) eF(x) Therefore −P (Y = 1|x) e−F(x) + P (Y = −1|x) eF(x) = 0 108 / 132

232.
Images/cinvestav- Then We have that P(Y = 1|x) e−F(x) = P (Y = −1|x) eF(x) = [1 − P (Y = 1|x)] eF(x) Solving eF(x) = e−F(x) + eF(x) P (Y = 1|x) 109 / 132

233.
Images/cinvestav- Then We have that P(Y = 1|x) e−F(x) = P (Y = −1|x) eF(x) = [1 − P (Y = 1|x)] eF(x) Solving eF(x) = e−F(x) + eF(x) P (Y = 1|x) 109 / 132

234.
Images/cinvestav- Finally, we have Theﬁrst equation P (Y = 1|x) = eF(x) e−F(x) + eF(x) Similarly P (Y = −1|x) = e−F (x) e−F (x) + eF (x) 110 / 132

235.
Images/cinvestav- Finally, we have Theﬁrst equation P (Y = 1|x) = eF(x) e−F(x) + eF(x) Similarly P (Y = −1|x) = e−F (x) e−F (x) + eF (x) 110 / 132

236.
Images/cinvestav- Basically We have thatthe E e−yF(x) When you minimize the cost function Then at the optimal you have the Binary Classiﬁcation Of the Logistic Regression 111 / 132

237.
Images/cinvestav- Basically We have thatthe E e−yF(x) When you minimize the cost function Then at the optimal you have the Binary Classiﬁcation Of the Logistic Regression 111 / 132

238.
Images/cinvestav- Furthermore Corollary If E isreplaced by averages over regions of x where F (x) is constant (Similar to a decision tree), The same result applies to the sample proportions of y = 1 and y = −1 112 / 132

239.
Images/cinvestav- Furthermore Corollary If E isreplaced by averages over regions of x where F (x) is constant (Similar to a decision tree), The same result applies to the sample proportions of y = 1 and y = −1 112 / 132

240.

241.
Images/cinvestav- Finally, The AdditiveLogistic Regression Proposition The AdaBoost algorithm ﬁts an additive logistic regression model by stage-wise optimization of J (F) = E e−yF(x) Proof Imagine you have an estimate F (x) then we seek an improved estimate: F (x) + f (x) 114 / 132

242.
Images/cinvestav- Finally, The AdditiveLogistic Regression Proposition The AdaBoost algorithm ﬁts an additive logistic regression model by stage-wise optimization of J (F) = E e−yF(x) Proof Imagine you have an estimate F (x) then we seek an improved estimate: F (x) + f (x) 114 / 132

243.
Images/cinvestav- For This We minimizeat each x J (F (x) + f (x)) This can be expanded J (F (x) + f (x)) = E e−y(F(x)+f(x)) |x = e−f(x) E e−yF(x) I (y = 1) |x + ...ef(x) E e−yF(x) I (y = −1) |x 115 / 132

244.
Images/cinvestav- For This We minimizeat each x J (F (x) + f (x)) This can be expanded J (F (x) + f (x)) = E e−y(F(x)+f(x)) |x = e−f(x) E e−yF(x) I (y = 1) |x + ...ef(x) E e−yF(x) I (y = −1) |x 115 / 132

245.
Images/cinvestav- Deriving w.r.t. f(x) We get −e−f(x) E e−yF (x) I (y = 1) |x + ef(x) E e−yF (x) I (y = −1) |x = 0 116 / 132

248.
Images/cinvestav- Thus, we have Weapply the natural log to both sides log e−f(x) + log Ew [I (y = 1) |x] = log ef(x) + log Ew [I (y = −1) |x] Then 2f (x) = log Ew [I (y = 1) |x] − log Ew [I (y = −1) |x] 118 / 132

249.
Images/cinvestav- Thus, we have Weapply the natural log to both sides log e−f(x) + log Ew [I (y = 1) |x] = log ef(x) + log Ew [I (y = −1) |x] Then 2f (x) = log Ew [I (y = 1) |x] − log Ew [I (y = −1) |x] 118 / 132

250.
Images/cinvestav- Finally We have that f(x) = 1 2 log Ew [I (y = 1) |x] Ew [I (y = −1) |x] In term of probabilities f (x) = 1 2 log Pw (y = 1|x) Pw (y = −1|x) 119 / 132

251.
Images/cinvestav- Finally We have that f(x) = 1 2 log Ew [I (y = 1) |x] Ew [I (y = −1) |x] In term of probabilities f (x) = 1 2 log Pw (y = 1|x) Pw (y = −1|x) 119 / 132

252.
Images/cinvestav- The Weight Update Finally,we have a way to update the weights by setting wt (x, y) = e−yF(x) wt+1 (x, y) = wt (x, y) e−yf(x) 120 / 132

253.
Images/cinvestav- Additionally, the weightedconditional mean Corollary At the Optimal F (x), the weighted conditional mean of y is 0. Proof When F (x) is optimal ∂J (F (x)) ∂F (x) = ∂ P (Y = 1|x) e−yF (x) + P (Y = −1|x) eyF (x) ∂F (x) 121 / 132

254.
Images/cinvestav- Additionally, the weightedconditional mean Corollary At the Optimal F (x), the weighted conditional mean of y is 0. Proof When F (x) is optimal ∂J (F (x)) ∂F (x) = ∂ P (Y = 1|x) e−yF (x) + P (Y = −1|x) eyF (x) ∂F (x) 121 / 132

255.
Images/cinvestav- Therefore We have ∂J (F(x)) ∂F (x) = P (Y = 1|x) e−yF (x) {−y} + P (Y = −1|x) e−yF (x) {−y} Therefore E eyF(x) y = 0 122 / 132

256.
Images/cinvestav- Therefore We have ∂J (F(x)) ∂F (x) = P (Y = 1|x) e−yF (x) {−y} + P (Y = −1|x) e−yF (x) {−y} Therefore E eyF(x) y = 0 122 / 132

257.

258.
Images/cinvestav- Here, we decideto use Perceptrons As Weak Learners We could be using a ﬁnite number of Perceptrons But we want to have a inﬁnitude of possible weak learners Thus avoiding the need of a matrix S Remark We need to use a Gradient Based Learner for this 124 / 132

259.

260.

261.
Images/cinvestav- Perceptron We use thefollowing formula of error per sample E (w) = 1 2 N j=1 (wj (t) yj (t) − dj)2 With yj (t) = ϕ wT (t) xj Deriving against wi ∂E (w) ∂wi = N j=1 (wj (t) yj (t) − dj) ϕ wT (t) xj wb j xij Then, using gradient descent, we have the following update wi (n + 1) = wi (n) − η N j=1 (wj (t) yj (t) − dj) ϕ wT (t) xj wixij 125 / 132

262.

263.

264.
Images/cinvestav- Data Set Training setwith classes ω1 = N(0, 1) and ω2 = N(0, σ2 ) − N(0, 1) 126 / 132

265.
Images/cinvestav- Example For m =10 127 / 132

266.
Images/cinvestav- Example For 40 128 /132

267.
Images/cinvestav- At the endof the process For m = 80 129 / 132

268.
Images/cinvestav- Final Confusion Matrix Whenm = 80 C1 C2 C1 1.0 0.0 C2 0.0 1.0 130 / 132

269.
Images/cinvestav- However There are otherversions to the Cryptic Phrase At “Boosting: Foundation and Algorithms” by Schaphire and Freund “Train weak learner using distribution Dt” We could re-sample using the distribution wt Basically using sampling with substitution over the data set {x1, x2, ..., xN } 131 / 132

270.
Images/cinvestav- However There are otherversions to the Cryptic Phrase At “Boosting: Foundation and Algorithms” by Schaphire and Freund “Train weak learner using distribution Dt” We could re-sample using the distribution wt Basically using sampling with substitution over the data set {x1, x2, ..., xN } 131 / 132

271.
Images/cinvestav- Other Interpretations exist Butyou can use a weighted version of the cost function 1 2 j wj (t) (yj (t) − dj)2 For More, Take a look “Boosting Neural Networks” by Holger Schwenk and Yoshua Bengio 132 / 132

18.1 combining models

More Related Content

What's hot

Similar to 18.1 combining models

More from Andres Mendez-Vazquez

Recently uploaded

18.1 combining models