17 vapnik chervonenkis dimension

1. Introduction to Machine Learning Vapnik–Chervonenkis Dimension Andres Mendez-Vazquez July 22, 2018 1 / 135

2. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 2 / 135

4. Images/cinvestav Until Now We have been learning A Lot of functions to approximate the models f of a given data set D. Data Observation/Data Collection What we can observePhenomea Generating Samples 4 / 135

5. Images/cinvestav The Question But Never asked ourselves if Are we able to really learn f from D? 5 / 135

6. Images/cinvestav Example Consider the following data set D Consider a Boolean target function over a three-dimensional input space X = {0, 1}3 With a data set D n xn yn 1 000 0 2 001 1 3 010 1 4 011 0 5 100 1 6 / 135

7. Images/cinvestav Example Consider the following data set D Consider a Boolean target function over a three-dimensional input space X = {0, 1}3 With a data set D n xn yn 1 000 0 2 001 1 3 010 1 4 011 0 5 100 1 6 / 135

8. Images/cinvestav We have the following We have the space of input has 23 possibilities Therefore, we have 223 possible functions for f Learning outside the data D, basically we want a g that generalize outside D n xn yn g f1 f2 f3 f4 f5 f6 f7 f8 1 000 0 0 0 0 0 0 0 0 0 0 2 001 1 1 1 1 1 1 1 1 1 1 3 010 1 1 1 1 1 1 1 1 1 1 4 011 0 0 0 0 0 0 0 0 0 0 5 100 1 1 1 1 1 1 1 1 1 1 6 101 ? 0 0 0 0 1 1 1 1 7 110 ? 0 0 1 1 0 0 1 1 7 110 ? 0 1 0 1 0 1 0 1 7 / 135

9. Images/cinvestav We have the following We have the space of input has 23 possibilities Therefore, we have 223 possible functions for f Learning outside the data D, basically we want a g that generalize outside D n xn yn g f1 f2 f3 f4 f5 f6 f7 f8 1 000 0 0 0 0 0 0 0 0 0 0 2 001 1 1 1 1 1 1 1 1 1 1 3 010 1 1 1 1 1 1 1 1 1 1 4 011 0 0 0 0 0 0 0 0 0 0 5 100 1 1 1 1 1 1 1 1 1 1 6 101 ? 0 0 0 0 1 1 1 1 7 110 ? 0 0 1 1 0 0 1 1 7 110 ? 0 1 0 1 0 1 0 1 7 / 135

11. Images/cinvestav Here is the Dilemma!!! Each of the f1, f2, ..., f8 It is a possible real f, the true f. Any of them is a possible good f Therefore The quality of the learning will be determined by how close our prediction is to the true value. 9 / 135

12. Images/cinvestav Here is the Dilemma!!! Each of the f1, f2, ..., f8 It is a possible real f, the true f. Any of them is a possible good f Therefore The quality of the learning will be determined by how close our prediction is to the true value. 9 / 135

13. Images/cinvestav Therefore, we have In order to select a g, we need to have an hypothesis H To be able to select such g by our training procedure. Further, any of the f1, f2, ..., f8 is a good choice for f Therefore, it does not matter how near we are to the bits in D Our problem, we want to generalize to the data outside D However, it does not make any diﬀerence if our Hypothesis is correct or incorrect in D 10 / 135

16. Images/cinvestav We want to Generalize But, If we want to use only a deterministic approach to H Our Attempts to use H to learn g is a waste of time!!! 11 / 135

18. Images/cinvestav Consider a “bin” with red and green marbles Going back to our example BIN SAMPLES 13 / 135

19. Images/cinvestav Therefore We have the “Real Probabilities” P [Pick a Red marble] = µ P [Pick a Blue marble] = 1 − µ However, the value of µ is not know Thus, we sample the space for N samples in an independent way. Here, the fraction of real marbles is equal to ν Question: Can ν can be used to know about µ? 14 / 135

22. Images/cinvestav Two Answers... Possible vs. Probable No!!! Because we can see only the samples For example, Sample an be mostly blue while bin is mostly red. Yes!!! Sample frequency ν is likely close to bin frequency µ. 15 / 135

23. Images/cinvestav Two Answers... Possible vs. Probable No!!! Because we can see only the samples For example, Sample an be mostly blue while bin is mostly red. Yes!!! Sample frequency ν is likely close to bin frequency µ. 15 / 135

24. Images/cinvestav What does ν say about µ? We have the following hypothesis In a big sample (large N ), ν is probably close to µ (within ). How? Hoeﬀding’s Inequality . 16 / 135

25. Images/cinvestav What does ν say about µ? We have the following hypothesis In a big sample (large N ), ν is probably close to µ (within ). How? Hoeﬀding’s Inequality . 16 / 135

27. Images/cinvestav We have the following theorem Theorem (Hoeﬀding’s inequality) Let Z1, ..., Zn be independent bounded random variables with Zi ∈ [a, b] for all i, where −∞ < a ≤ b < ∞. Then P 1 N N i=1 (Zi − E [Zi]) ≥ t ≤ exp − 2Nt2 (b−a)2 and P 1 N N i=1 (Zi − E [Zi]) ≤ −t ≤ exp − 2Nt2 (b−a)2 for all t ≥ 0. 18 / 135

28. Images/cinvestav Therefore Assume that the Zi are the random variables from the N samples Then, we have that values for Zi ∈ {0, 1} therefore we have that... First inequality, for any > 0 and N P 1 N N i=1 Zi − µ ≥ ≤ exp−2N 2 Second inequality, for > 0 and N P 1 N N i=1 Zi − µ ≤ ≤ exp−2N 2 19 / 135

31. Images/cinvestav Here We can use the fact that ν = 1 N N i=1 Zi Putting all together, we have P (ν − µ ≥ or ν − µ ≤ ) ≤ P (ν − µ ≥ ) + P (ν − µ ≤ ) Finally P (|ν − µ| ≥ ) ≤ 2 exp−2N 2 20 / 135

34. Images/cinvestav Therefore We have the following If is small enough and as long as N is large 21 / 135

35. Images/cinvestav Making Possible Possible to estimate ν ≈ µ How do we connect with Learning? Learning We want to ﬁnd a function f : X −→ Y which is unknown!!! Here we assume that each ball in the bin is a sample x ∈ X. Thus, it is necessary to select an hypothesis Basically, we want to have an hypothesis h: h (x) = f (x) we color the sample blue. h (x) = f (x) we color the sample red. 22 / 135

41. Images/cinvestav Here a Small Remark Here, we are not talking about classes When talking about blue and red balls, but if we are able to identify the correct label: yh = h (x) =f (x) = y or yh = h (x) =f (x) = y Still, the use of blue and red balls allows to see our Learning Problem as a Bernoulli distribution 23 / 135

42. Images/cinvestav Here a Small Remark Here, we are not talking about classes When talking about blue and red balls, but if we are able to identify the correct label: yh = h (x) =f (x) = y or yh = h (x) =f (x) = y Still, the use of blue and red balls allows to see our Learning Problem as a Bernoulli distribution 23 / 135

43. Images/cinvestav Swiss mathematician Jacob Bernoulli Deﬁnition The Bernoulli distribution is a discrete distribution having two possible outcomes X = 0 or X = 1. With the following probabilities P (X|p) = 1 − p if X = 0 p if X = 1 Also expressed as P (X = k|p) = (p)k (1 − p)1−k 24 / 135

47. Images/cinvestav Thus We deﬁne Ein (in-sample error) Ein (h) = 1 N N n=1 I (h (xn) = f (xn)) We have made explicit the dependency of Ein on the particular h that we are considering. Now Eout (out-of-sample error) Eout (h) = P (h (x) = f (x)) = µ Where The probability is based on the distribution P over X which is used to sample the data points x. 26 / 135

51. Images/cinvestav Generalization Error Deﬁnition (Generalization Error/out-of-sample error) Given a hypothesis/proposed model h ∈ H, a target concept/real model f ∈ F, and an underlying distribution D, the generalization error or risk of h is deﬁned by R (h) = Px∼D (h (x) = f (x)) = Ex∼D Ih(x)=f(x) a where Iω is the indicator function of the event ω. a This comes the fact that 1 ∗ P (A) + 0 ∗ P A = E [IA] 28 / 135

52. Images/cinvestav Empirical Error Deﬁnition (Empirical Error/in-sample error) Given a hypothesis/proposed model h ∈ H, a target concept/real model f ∈ F, a sample X = {x1, x2, ..., xN }, the empirical error or empirical risk of h is deﬁned by: R = 1 N N i=1 Ih(xi)=f(xi) 29 / 135

54. Images/cinvestav Basically We have P (|Ein (h) − Eout (h)| ≥ ) ≤ 2 exp−2Nt2 Now, we need to consider an entire set of hypothesis, H H = {h1, h2, ..., hM } 31 / 135

55. Images/cinvestav Basically We have P (|Ein (h) − Eout (h)| ≥ ) ≤ 2 exp−2Nt2 Now, we need to consider an entire set of hypothesis, H H = {h1, h2, ..., hM } 31 / 135

56. Images/cinvestav Therefore Each Hypothesis is a scenario in the bin space 32 / 135

57. Images/cinvestav Remark The Hoeffding Inequality still applies to each bin individually Now, we need to consider all the bins simultaneously. Here, we have the following situation h is fixed before the data set is generated!!! If you are allowed to change h after you generate the data set The Hoeffding Inequality no longer holds 33 / 135

61. Images/cinvestav Therefore With multiple hypotheses in H The Learning Algorithm chooses the ﬁnal hypothesis g based on D after generating the data. The statement we would like to make is not P (|Ein (hm) − Eout (hm)| ≥ ) is small. We would rather P (|Ein (g) − Eout (g)| ≥ ) is small for the ﬁnal hypothesis g. 35 / 135

64. Images/cinvestav Therefore Something Notable The hypothesis g is not ﬁxed ahead of time before generating the data Thus we need to bound P (|Ein (g) − Eout (g)| ≥ ) Which it does not depend on which g the algorithm picks. 36 / 135

65. Images/cinvestav Therefore Something Notable The hypothesis g is not ﬁxed ahead of time before generating the data Thus we need to bound P (|Ein (g) − Eout (g)| ≥ ) Which it does not depend on which g the algorithm picks. 36 / 135

66. Images/cinvestav We have two rules First one if A1 =⇒ A2, then P (A1) ≤ P (A2) If you have any set of events A1, A2, ..., AM P (A1 ∪ A2 ∪ · · · ∪ AM ) ≤ M m=1 P (Am) 37 / 135

67. Images/cinvestav We have two rules First one if A1 =⇒ A2, then P (A1) ≤ P (A2) If you have any set of events A1, A2, ..., AM P (A1 ∪ A2 ∪ · · · ∪ AM ) ≤ M m=1 P (Am) 37 / 135

68. Images/cinvestav Therefore Now assuming independence between hypothesis |Ein (g) − Eout (g)| ≥ =⇒ |Ein (h1) − Eout (h1)| ≥ or |Ein (h2) − Eout (h2)| ≥ · · · or |Ein (hM ) − Eout (hM )| ≥ 38 / 135

69. Images/cinvestav Thus We have P (|Ein (g) − Eout (g)| ≥ ) ≤P [|Ein (h1) − Eout (h1)| ≥ or |Ein (h2) − Eout (h2)| ≥ · · · or |Ein (hM ) − Eout (hM )| ≥ ] 39 / 135

73. Images/cinvestav We have Something Notable We have introduced two apparently conﬂicting arguments about the feasibility of learning. We have two possibilities One argument says that we cannot learn anything outside of D. The other say it is possible!!! Here, we introduce the probabilistic answer This will solve our conundrum!!! 42 / 135

77. Images/cinvestav Then The Deterministic Answer Do we have something to say about f outside of D? The answer is NO. The Probabilistic Answer Is D telling us something likely about f outside of D? The answer is YES The reason why We approach our Learning from a Probabilistic point of view!!! 43 / 135

81. Images/cinvestav For example We could have hypothesis based in hyperplanes Linear regression output: h (x) = d i=1 wixi = wT x Therefore Ein (x) = 1 N N i=1 (h (xn) − yn)2 45 / 135

82. Images/cinvestav For example We could have hypothesis based in hyperplanes Linear regression output: h (x) = d i=1 wixi = wT x Therefore Ein (x) = 1 N N i=1 (h (xn) − yn)2 45 / 135

83. Images/cinvestav Clearly, we have used loss functions Mostly to give meaning h ≈ f By Error Measures E (h, f) By using pointwise deﬁnitions e (h (x) , f (x)) Examples Squared Error e (h (x) , f (x)) = [h (x) − f (x)]2 Binary Error e (h (x) , f (x)) = I [h (x) = f (x)] 46 / 135

87. Images/cinvestav Therefore, we have The Overall Error E (h, f) = Average of pointwise errors e (h (x) , f (x)) In-Sample Error Ein (h) = 1 N N i=1 e (h (xi) , f (xi)) Out-of-sample error Ein (h) = EX [e (h (x) , f (x))] 48 / 135

90. Images/cinvestav We have the following Process Assuming P (y|x) instead of y = f (x) Then a data point (x, y) is now generated by the joint distribution P (x, y) = P (x) P (y|x) Therefore Noisy target is a deterministic target plus added noise. f (x) ≈ E [y|x] + (y − f (x)) 49 / 135

91. Images/cinvestav We have the following Process Assuming P (y|x) instead of y = f (x) Then a data point (x, y) is now generated by the joint distribution P (x, y) = P (x) P (y|x) Therefore Noisy target is a deterministic target plus added noise. f (x) ≈ E [y|x] + (y − f (x)) 49 / 135

92. Images/cinvestav Finally, we have as Learning Process Unknow Target Distribution Probability Distribution Training Examples 50 / 135

93. Images/cinvestav Therefore Distinction between P (y|x) and P (x) Both convey probabilistic aspects of x and y. Therefore 1 The Target distribution P (y|x) is what we are trying to learn. 2 The Input distribution P (x) quantiﬁes relative importance of x. Finally Merging P (x, y) = P (y|x) P (x) mixes the two concepts 51 / 135

96. Images/cinvestav Therefore Learning is feasible because It is likely that Eout (g) ≈ Ein (g) Therefore, we need g ≈ f Eout (g) = P (g (x) = f (x)) ≈ 0 How do we achieve this? Eout (g) ≈ Ein (g) = 1 N N n=1 I (g (xn) = f (xn)) 52 / 135

99. Images/cinvestav Then We make at the same time Ein (g) ≈ 0 To Make the Error in our selected hypothesis g with respect to the real function f Learning splits in two questions 1 Can we make Eout (g) is close enough Ein (g)? 2 Can we make Ein (g) small enough? 53 / 135

100. Images/cinvestav Then We make at the same time Ein (g) ≈ 0 To Make the Error in our selected hypothesis g with respect to the real function f Learning splits in two questions 1 Can we make Eout (g) is close enough Ein (g)? 2 Can we make Ein (g) small enough? 53 / 135

101. Images/cinvestav Therefore, we have Nice Connection with Bias-Variance Trade-oﬀ ERROR Out-of-Sample Error Model Complexity In-Sample Error 54 / 135

103. Images/cinvestav We have that The out-of-sample error Eout (h) = P (h (x) = f (x)) It Measures how well our training on D It has generalized to data that we have not seen before. Remark Eout is based on the performance over the entire input space X. 56 / 135

106. Images/cinvestav Testing Data Set Intuitively we want to estimate the value of Eout using a sample of data points. Something Notable These points must be ’fresh’ test points that have not been used for training. Basically Out Testing Set. 57 / 135

110. Images/cinvestav Thus It is possible to deﬁne The generalization error as the discrepancy between Ein and Eout Therefore The Hoeﬀding Inequality is a way to characterize the generalization error with a probabilistic bound P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2 For any > 0. 59 / 135

111. Images/cinvestav Thus It is possible to deﬁne The generalization error as the discrepancy between Ein and Eout Therefore The Hoeﬀding Inequality is a way to characterize the generalization error with a probabilistic bound P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2 For any > 0. 59 / 135

113. Images/cinvestav Reinterpreting This Assume a Tolerance Level δ, for example δ = 0.0005 It is possible to say that with probability 1 − δ : Eout (g) < Ein (g) + 1 2N ln 2M δ 61 / 135

114. Images/cinvestav Proof We have the complement Hoeﬀding Probability using the absolute value P (|Eout (g) − Ein (g)| < ) ≤ 1 − 2M exp−2N 2 Therefore, we have P (− < Eout (g) − Ein (g) < ) ≤ 1 − 2M exp−2N 2 This imply Eout (g) < Ein (g) + 62 / 135

117. Images/cinvestav Therefore We simply use δ = 2M exp−2N 2 Then ln 1 − ln δ 2M = 2N 2 Therefore = 1 2N ln 2M δ 63 / 135

120. Images/cinvestav Generalization Bound This inequality is know as a generalization Bound Ein (g) < Eout (g) + 1 2N ln 2M δ 64 / 135

122. Images/cinvestav We have The following inequality also holds − < Eout (g) − Ein (g) ⇒ Eout (g) > Ein (g) − Thus Not only we want our hypothesis g to do well int the out samples, Eout (g) < Ein (g) + But, we want to know how well we did with our H Thus, Eout (g) > Ein (g) − assures that it is not possible to do better!!! Given any hypothesis with higher Ein (h) = 1 N N n=1 I (h (xn) = f (xn)) than g. 66 / 135

125. Images/cinvestav Therefore But, we want to know how well we did with our H Thus, Eout (g) > Ein (g) − assures that it is not possible to do better!!! Given any hypothesis h with higher than g Ein (h) = 1 N N n=1 I (h (xn) = f (xn)) It will have a higher Eout (h) given Eout (h) > Ein (h) − 67 / 135

129. Images/cinvestav The Infiniteness of H A Problem with the Error Bound given its dependency on M 1 2N ln 2M δ What happens when M becomes infinity The number of hypothesis in H becomes infinity. Thus, the bound becomes infinity Problem, almost all interesting learning models have infinite H.... For Example... in our linear Regression... f (x) = wT x 69 / 135

132. Images/cinvestav Therefore, we need to replace M We need to find a finite substitute with finite range values For this, we notice that |Ein (h1) − Eout (h1)| ≥ or |Ein (h2) − Eout (h2)| ≥ · · · or |Ein (hM ) − Eout (hM )| ≥ 70 / 135

133. Images/cinvestav We have This guarantee |Ein (g) − Eout (g)| ≥ Thus, we can take a look at the events Bm events for which you have |Ein (hm) − Eout (hm)| ≥ Then P B1 or B2 · · · or BM ≤ M m=1 P [Bm] 71 / 135

134. Images/cinvestav We have This guarantee |Ein (g) − Eout (g)| ≥ Thus, we can take a look at the events Bm events for which you have |Ein (hm) − Eout (hm)| ≥ Then P B1 or B2 · · · or BM ≤ M m=1 P [Bm] 71 / 135

135. Images/cinvestav Now, we have the following Example We have a gross overestimate Basically, if hi and hj are quite similar the two events |Ein (hi) − Eout (hi)| ≥ and |Ein (hj) − Eout (hj)| ≥ are likely to coincide!!! 72 / 135

136. Images/cinvestav Now, we have the following Example We have a gross overestimate Basically, if hi and hj are quite similar the two events |Ein (hi) − Eout (hi)| ≥ and |Ein (hj) − Eout (hj)| ≥ are likely to coincide!!! 72 / 135

137. Images/cinvestav We have Something Notable In a typical learning model, many hypotheses are indeed very similar. The mathematical theory of generalization hinges on this observation We only need to account for the overlapping on diﬀerent hypothesis to substitute M. 73 / 135

138. Images/cinvestav We have Something Notable In a typical learning model, many hypotheses are indeed very similar. The mathematical theory of generalization hinges on this observation We only need to account for the overlapping on diﬀerent hypothesis to substitute M. 73 / 135

140. Images/cinvestav Consider A ﬁnite data set X = {x1, x2, ..., xN } And we consider a set of hypothesis h ∈ H such that h : X → {−1, +1} We get a N-tuple, when applied to X, h (x1) , h (x2) , ..., h (xN ) of ±1. Such N-tuple is called a Dichotomy Given that it splits x1, x2, ..., xN into two groups... 75 / 135

143. Images/cinvestav Dichotomy Deﬁnition Given a hypothesis set H, a dichotomy of a set X is one of the possible ways of labeling the points of X using a hypothesis in H. 76 / 135

144. Images/cinvestav Examples of Dichotomies Here the ﬁrst Dichotomy can be generated by a perceptron Class +1 Class -1 The Dichotomy Generated By a Perceptron 77 / 135

145. Images/cinvestav Something Important Each h ∈ H generates a dichotomy on x1, ..., xN However, two diﬀerent h’s may generate the same dichotomy if they generate the same pattern 78 / 135

146. Images/cinvestav Remark Deﬁnition Let x1, x2, ..., xn ∈ X. The dichotomies generated by H on these points are deﬁned by H (x1, x2, ..., xN ) = {(h [x1] , h [x2] , ..., h [xN ]) |h ∈ H} Therefore We can see H (x1, x2, ..., xN ) as a set of hypothesis by using the geometry of the points. Thus A large H (x1, x2, ..., xN ) means H is more diverse. 79 / 135

149. Images/cinvestav Growth function, Our Replacement of M Deﬁnition The growth function is deﬁned for a hypothesis set H by mH (N) = max x1,...,xN ∈X #H (x1, x2, ..., xN ) where # denotes the cardinality (number of elements) of a set. Therefore mH (N) is the maximum number of dichotomies that be generated by H on any N points. We remove dependency on the entire X 80 / 135

150. Images/cinvestav Growth function, Our Replacement of M Deﬁnition The growth function is deﬁned for a hypothesis set H by mH (N) = max x1,...,xN ∈X #H (x1, x2, ..., xN ) where # denotes the cardinality (number of elements) of a set. Therefore mH (N) is the maximum number of dichotomies that be generated by H on any N points. We remove dependency on the entire X 80 / 135

151. Images/cinvestav Therefore We have that M and mH (N) is a measure of the of the number of hypothesis in H However, we avoid considering all of X Now we only consider N points instead of the entire X. 81 / 135

152. Images/cinvestav Therefore We have that M and mH (N) is a measure of the of the number of hypothesis in H However, we avoid considering all of X Now we only consider N points instead of the entire X. 81 / 135

153. Images/cinvestav Upper Bound for mH (N) First, we know that H (x1, x2, ..., xN ) ⊆ {−1, +1}N Hence, we have the value of mH (N) is at most # {−1, +1}N mH (N) ≤ 2N 82 / 135

154. Images/cinvestav Upper Bound for mH (N) First, we know that H (x1, x2, ..., xN ) ⊆ {−1, +1}N Hence, we have the value of mH (N) is at most # {−1, +1}N mH (N) ≤ 2N 82 / 135

156. Images/cinvestav Therefore If H is capable of generating all possible dichotomies on x1, x2, ..., xN Then, H (x1, x2, ..., xN ) = {−1, +1} N and #H (x1, x2, ..., xN ) = 2N We can say that H can shatter x1, x2, ..., xN Meaning H is as diverse as can be on this particular sample. 84 / 135

159. Images/cinvestav Shattering Deﬁnition A set X of N ≥ 1 points is said to be shattered by a hypothesis set H when H realizes all possible dichotomies of X, that is when mH (N) = 2N 85 / 135

161. Images/cinvestav Example Positive Rays Imagine a input space on R, with H consisting of all hypotheses h : R → {−1, +1} of the form h (x) = sign (x − a) Example 87 / 135

162. Images/cinvestav Example Positive Rays Imagine a input space on R, with H consisting of all hypotheses h : R → {−1, +1} of the form h (x) = sign (x − a) Example 87 / 135

163. Images/cinvestav Thus, we have that As we change a, we get N + 1 diﬀerent dichotomies mH (N) = N + 1 Now, we have the case of positive intervals H consists of all hypotheses in one dimension that return +1 within some interval and −1 otherwise. 88 / 135

164. Images/cinvestav Thus, we have that As we change a, we get N + 1 diﬀerent dichotomies mH (N) = N + 1 Now, we have the case of positive intervals H consists of all hypotheses in one dimension that return +1 within some interval and −1 otherwise. 88 / 135

165. Images/cinvestav Therefore We have The line is again split by the points into N + 1 regions. Furthermore The dichotomy we get is decided by which two regions contain the end values of the interval Therefore, we have the number of possible dichotomies N + 1 2 89 / 135

168. Images/cinvestav Additionally If the two points fall in the same region, the H = −1 Then mH (N) = N + 1 2 + 1 = 1 2 N2 + 1 2 N + 1 90 / 135

169. Images/cinvestav Finally In the case of a Convex Set in R2 H consists of all hypothesis in two dimensions that are positive inside some convex set and negative elsewhere. 91 / 135

170. Images/cinvestav Therefore We have the following mH (N) = 2N By using the “Radon’s theorem” 92 / 135

172. Images/cinvestav Remember We have that P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2 What if mH (N) replaces M If mH (N) is polynomial, we have an excellent case!!! Therefore, we need to prove that mH (N) is polynomial 94 / 135

176. Images/cinvestav Break Point Deﬁnition If no data set of size k can be shattered by H, then k is said to be a break point for H: mH (k) < 2k 96 / 135

177. Images/cinvestav Example For the Perceptron, we have k = 4 Non-ShatterShatter 97 / 135

178. Images/cinvestav Important Something Notable In general, it is easier to ﬁnd a break point for H than to compute the full growth function for that H. Using this concept We are ready to deﬁne the concept of Vapnik–Chervonenkis (VC) dimension. 98 / 135

179. Images/cinvestav Important Something Notable In general, it is easier to ﬁnd a break point for H than to compute the full growth function for that H. Using this concept We are ready to deﬁne the concept of Vapnik–Chervonenkis (VC) dimension. 98 / 135

181. Images/cinvestav VC-Dimension Deﬁnition The VC-dimension of a hypothesis set H is the size of the largest set that can be fully shattered by H (Those points need to be in “General Position”): V Cdim (H) = max k|mH (k) = 2k A set containing k points, for arbitrary k, is in general linear position if and only if no (k − 1) −dimensional ﬂat contains them all 100 / 135

182. Images/cinvestav Important Remarks Remark 1 if V Cdim (H) = d, there exists a set of size d that can be fully shattered. Remark2 This does not imply that all sets of size d or less are fully shattered This is typically the case!!! 101 / 135

183. Images/cinvestav Important Remarks Remark 1 if V Cdim (H) = d, there exists a set of size d that can be fully shattered. Remark2 This does not imply that all sets of size d or less are fully shattered This is typically the case!!! 101 / 135

184. Images/cinvestav Why? General Linear Position For example in the Perceptron No General Position 102 / 135

185. Images/cinvestav Now, we define B(N, k) Definition B(N, k) is the maximum number of dichotomies on N points such that no subset of size k of the N points can be shattered by these dichotomies. Something Notable The definition of B(N, k) assumes a break point k!!! 103 / 135

186. Images/cinvestav Now, we define B(N, k) Definition B(N, k) is the maximum number of dichotomies on N points such that no subset of size k of the N points can be shattered by these dichotomies. Something Notable The definition of B(N, k) assumes a break point k!!! 103 / 135

187. Images/cinvestav Further Since B(N, k) is a maximum It is an upper bound for mH (N) under a break point k. mH (N) ≤ B (N, k) if k is a break point for H. Then We need to ﬁnd a Bound for B (N, k) to prove that mH (k) is polynomial. 104 / 135

188. Images/cinvestav Further Since B(N, k) is a maximum It is an upper bound for mH (N) under a break point k. mH (N) ≤ B (N, k) if k is a break point for H. Then We need to ﬁnd a Bound for B (N, k) to prove that mH (k) is polynomial. 104 / 135

189. Images/cinvestav Therefore Thus, we start with two boundary conditions k = 1 and N = 1 B (N, 1) = 1 B (1, k) = 2 k > 1 105 / 135

190. Images/cinvestav Why? Something Notable B(N, 1) = 1 for all N since if no subset of size 1 can be shattered Then only one dichotomy can be allowed. Because a second diﬀerent dichotomy must diﬀer on at least one point and then that subset of size 1 would be shattered. Second B(1, k) = 2 for k > 1 since there do not even exist subsets of size k. Because the constraint is vacuously true and we have 2 possible dichotomies +1 and −1. 106 / 135

196. Images/cinvestav B(N, k) Dichotomies, N ≥ 2 and k ≥ 2 # of rows x1 x2 · · · xN−1 xN +1 +1 · · · +1 +1 -1 +1 · · · +1 -1 S1 α . . . . . . · · · . . . . . . +1 -1 · · · -1 -1 -1 +1 · · · -1 +1 +1 -1 · · · +1 +1 -1 -1 · · · +1 +1 S+ 2 β . . . . . . · · · . . . . . . +1 -1 · · · +1 +1 S2 -1 +1 · · · -1 +1 +1 -1 · · · +1 -1 -1 -1 · · · +1 -1 S− 2 β . . . . . . · · · . . . . . . +1 -1 · · · +1 -1 -1 +1 · · · -1 -1 108 / 135

197. Images/cinvestav What is this partition mean First, Consider the dichotomies on x1x2 · · · xN−1 Some appear once (Either +1 or -1 at xN ), but only ONCE!!! We collect them in S1 The Remaining Dichotomies appear Twice Once with +1 and once with -1 in the xN column. 109 / 135

200. Images/cinvestav Therefore, we collect them in three sets The ones with only one Dichotomy We use the set S1 The other in two diﬀerent sets S+ 2 the ones with xN = +1. S− 2 the ones with xN = −1. 110 / 135

201. Images/cinvestav Therefore, we collect them in three sets The ones with only one Dichotomy We use the set S1 The other in two diﬀerent sets S+ 2 the ones with xN = +1. S− 2 the ones with xN = −1. 110 / 135

202. Images/cinvestav Therefore We have the following B (N, k) = α + 2β The total number of different dichotomies on the first N − 1 points They are α + β. Additionally, no subset of k of these first N − 1 points can be shattered Since no k-subset of all N points can be shattered: α + β ≤ B (N − 1, k) By definition of B. 111 / 135

205. Images/cinvestav Then Further, no subset of size k − 1 of the first N − 1 points can be shattered by the dichotomies in S+ 2 If there existed such a subset, then taking the corresponding set of dichotomies in S− 2 and xN You finish with a subset of size k that can be shattered a contradiction given the definition of B (N, k). Therefore β ≤ B (N − 1, k − 1) Then, we have B (N, k) ≤ B(N − 1, k) + B (N − 1, k − 1) 112 / 135

210. Images/cinvestav Connecting the Growth Function with the V Cdim Sauer’s Lemma For all k ∈ N , the following inequality holds: B (N, k) ≤ k−1 i=0 N i 114 / 135

211. Images/cinvestav Proof Proof For k = 1 B (N, 1) ≤ B(N − 1, 1) + B (N − 1, 0) = 1 + 0 = N 0 Then, by induction We assume that the statement is true for N ≤ N0 and all k. 115 / 135

212. Images/cinvestav Proof Proof For k = 1 B (N, 1) ≤ B(N − 1, 1) + B (N − 1, 0) = 1 + 0 = N 0 Then, by induction We assume that the statement is true for N ≤ N0 and all k. 115 / 135

213. Images/cinvestav Now We need to prove this for N = N0 + 1 and all k Observation: This is true for k = 1 given B (N, 1) = 1 Now, consider k ≥ 2 B(N0, k) + B (N0, k − 1) Therefore B (N0 + 1, k) ≤ k−1 i=0 N0 i + k−2 i=0 N0 i 116 / 135

216. Images/cinvestav Therefore We have the following B (N0 + 1, k) ≤ 1 + k−1 i=1 N0 i + k−1 i=1 N0 i − 1 = 1 + k−1 i=1 N0 i + N0 i − 1 = 1 + k−1 i=1 N0 + 1 i = k−1 i=0 N0 + 1 i Because N0 i + N0 i − 1 = N0 + 1 i 117 / 135

220. Images/cinvestav Now We have in conclusion for all k B (N, k) ≤ k−1 i=0 N i Therefore mH (N) ≤ B (N, k) ≤ k−1 i=0 N i 118 / 135

221. Images/cinvestav Now We have in conclusion for all k B (N, k) ≤ k−1 i=0 N i Therefore mH (N) ≤ B (N, k) ≤ k−1 i=0 N i 118 / 135

222. Images/cinvestav Then Theorem If mH (k) < 2k for some value k, then mH (N) ≤ k−1 i=0 N i 119 / 135

223. Images/cinvestav Finally Corollary Let H be a hypothesis set with V Cdim (H) = k. Then, for all N ≥ k mH (N) ≤ eN k k−1 = O Nk 120 / 135

224. Images/cinvestav We have Proof mH (N) ≤ k i=0 N i ≤ k i=0 N i N k k−i ≤ N i=0 N i N k k−i N k k N i=0 N i k N i 121 / 135

228. Images/cinvestav Therefore We have mH (N) ≤ N k k N i=0 N i k N i = N k k 1 + k N N Given that (1 − x) = e−x mH (N) ≤ N k k e k N ≤ N k k−1 ek−1 = e k k Nk = O Nk 122 / 135

232. Images/cinvestav Therefore We have that mH (N) is bounded by Nk−1 i.e. if mH (k) < 2k we have that mH (N) is polynomial We are not depending on the number of hypothesis!!!! 123 / 135

236. Images/cinvestav Remark about mH (k) We have bounded the number of eﬀective hypothesis Yes!!! we can have M hypotheses but the number of dichotomies generated by them is bounded by mH (k) 125 / 135

237. Images/cinvestav VC-Dimension Again Deﬁnition The VC-dimension of a hypothesis set H is the size of the largest set that can be fully shattered by H (Those points need to be in “General Position”): V Cdim (H) = max k|mH (k) = 2k Something Notable If mH (N) = 2N for all N, V Cdim (H) = ∞ 126 / 135

238. Images/cinvestav VC-Dimension Again Deﬁnition The VC-dimension of a hypothesis set H is the size of the largest set that can be fully shattered by H (Those points need to be in “General Position”): V Cdim (H) = max k|mH (k) = 2k Something Notable If mH (N) = 2N for all N, V Cdim (H) = ∞ 126 / 135

239. Images/cinvestav Remember We have the following Ein (g) < Eout (g) + 1 2N ln 2M δ We instead of using M, we use mH (N) We can use our growth function as the eﬀective way to bound Ein (g) < Eout (g) + 1 2N ln 2mH (N) δ 127 / 135

240. Images/cinvestav Remember We have the following Ein (g) < Eout (g) + 1 2N ln 2M δ We instead of using M, we use mH (N) We can use our growth function as the eﬀective way to bound Ein (g) < Eout (g) + 1 2N ln 2mH (N) δ 127 / 135

241. Images/cinvestav VC Generalized Bound Theorem (VC Generalized Bound) For any tolerance δ > 0 and H be a hypothesis set with V Cdim (H) = k., Ein (g) < Eout (g) + 2k N ln eN k + 1 2N ln 1 δ with probability ≥ 1 − δ Something Notable This Bound only fails when V Cdim (H) = ∞!!! 128 / 135

242. Images/cinvestav VC Generalized Bound Theorem (VC Generalized Bound) For any tolerance δ > 0 and H be a hypothesis set with V Cdim (H) = k., Ein (g) < Eout (g) + 2k N ln eN k + 1 2N ln 1 δ with probability ≥ 1 − δ Something Notable This Bound only fails when V Cdim (H) = ∞!!! 128 / 135

243. Images/cinvestav Proof Although we will not talk about it We will remark the that is possible to use the Rademacher complexity To manage the number of overlapping hypothesis (Which can be inﬁnite) We will stop here, but But I will encourage to look at more about the proof... 129 / 135

244. Images/cinvestav Proof Although we will not talk about it We will remark the that is possible to use the Rademacher complexity To manage the number of overlapping hypothesis (Which can be inﬁnite) We will stop here, but But I will encourage to look at more about the proof... 129 / 135

245. Images/cinvestav About the Proof For More, take a look at “A Probabilistic Theory of Pattern Recognition” by Luc Devroye et al. “Foundations of Machine Learning” by Mehryar Mohori et al. This is the equivalent to use Measure Theory to understand the innards of Probability We are professionals, we must understand!!! 130 / 135

246. Images/cinvestav About the Proof For More, take a look at “A Probabilistic Theory of Pattern Recognition” by Luc Devroye et al. “Foundations of Machine Learning” by Mehryar Mohori et al. This is the equivalent to use Measure Theory to understand the innards of Probability We are professionals, we must understand!!! 130 / 135

248. Images/cinvestav As you remember from previous classes We have architectures like 132 / 135

249. Images/cinvestav G-composition of H Let G be a layered directed acyclic graph Where directed edges go from one layer l to the next layer l + 1. Now, we have a set of hypothesis H NInput Nodes with in-degree 0 Intermediate Nodes with in-degree r Single Output node with out-degree 0 H our hypothesis over the space Euclidean space Rr Basically each node represent the hypothesis ci : Rr → {−1, 1} by mean of tanh. 133 / 135

254. Images/cinvestav Therefore We have that The Neural concept represent an hypothesis from RN to {−1, 1} Therefore the entire hypothesis is a composition of concepts This is called a G-composition of H. 134 / 135

255. Images/cinvestav Therefore We have that The Neural concept represent an hypothesis from RN to {−1, 1} Therefore the entire hypothesis is a composition of concepts This is called a G-composition of H. 134 / 135

256. Images/cinvestav We have the following theorem Theorem (Kearns and Vazirani, 1994) Let G be a layered directed acyclic graph with N input nodes and r ≥ 2 internal nodes each of indegree r. Let H hypothesis set over Rr of V Cdim (H) = d, and let G-composition of H. then V Cdim (HG) ≤ 2ds log2 (es) 135 / 135

257. Images/cinvestav We have the following theorem Theorem (Kearns and Vazirani, 1994) Let G be a layered directed acyclic graph with N input nodes and r ≥ 2 internal nodes each of indegree r. Let H hypothesis set over Rr of V Cdim (H) = d, and let G-composition of H. then V Cdim (HG) ≤ 2ds log2 (es) 135 / 135

17 vapnik chervonenkis dimension

Recommended

Recommended

More Related Content

Similar to 17 vapnik chervonenkis dimension

Similar to 17 vapnik chervonenkis dimension (20)

More from Andres Mendez-Vazquez

More from Andres Mendez-Vazquez (20)

Recently uploaded

Recently uploaded (20)

17 vapnik chervonenkis dimension