1.
Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityIntroduction to Statistical Machine Learning Christfried Webers Statistical Machine Learning Group NICTA and College of Engineering and Computer Science The Australian National University Canberra February – June 2011 (Many ﬁgures from C. M. Bishop, "Pattern Recognition and Machine Learning") 1of 267
2.
Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National University Part VII Classiﬁcation Generalised Linear ModelLinear Classiﬁcation 1 Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm 230of 267
3.
Introduction to StatisticalClassiﬁcation Machine Learning c 2011 Christfried Webers NICTA Goal : Given input data x, assign it to one of K discrete The Australian National University classes Ck where k = 1, . . . , K. Divide the input space into different regions. Classiﬁcation Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron AlgorithmFigure: Length of the petal [in cm] for a given sepal [cm] for iris ﬂowers(Iris Setosa, Iris Versicolor, Iris Virginica). 231of 267
4.
Introduction to StatisticalHow to represent binary class labels? Machine Learning c 2011 Christfried Webers NICTA The Australian National University Class labels are no longer real values as in regression, but a discrete set. Classiﬁcation Generalised Linear Two classes : t ∈ {0, 1} Model ( t = 1 represents class C1 and t = 0 represents class C2 ) Inference and Decision Discriminant Functions Can interpret the value of t as the probability of class C1 , Fisher’s Linear with only two values possible for the probability, 0 or 1. Discriminant The Perceptron Note: Other conventions to map classes into integers Algorithm possible, check the setup. 232of 267
5.
Introduction to StatisticalHow to represent multi-class labels? Machine Learning c 2011 Christfried Webers NICTA The Australian National University If there are more than two classes ( K > 2), we call it a multi-class setup. Often used: 1-of-K coding scheme in which t is a vector of Classiﬁcation length K which has all values 0 except for tj = 1, where j Generalised Linear Model comes from the membership in class Cj to encode. Inference and Decision Example: Given 5 classes, {C1 , . . . , C5 }. Membership in Discriminant Functions class C2 will be encoded as the target vector Fisher’s Linear Discriminant The Perceptron t = (0, 1, 0, 0, 0)T Algorithm Note: Other conventions to map multi-classes into integers possible, check the setup. 233of 267
6.
Introduction to StatisticalLinear Model Machine Learning c 2011 Christfried Webers NICTA Idea: Use again a Linear Model as in regression: y(x, w) is The Australian National University a linear function of the parameters w y(xn , w) = wT φ(xn ) But generally y(xn , w) ∈ R. Example: Which class is y(x, w) = 0.71623 ? Classiﬁcation Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm 234of 267
7.
Introduction to StatisticalGeneralised Linear Model Machine Learning c 2011 Christfried Webers NICTA The Australian National University Apply a mapping f : R → Z to the linear model to get the discrete class labels. Generalised Linear Model y(xn , w) = f (wT φ(xn )) Classiﬁcation Generalised Linear Model Activation function: f (·) Inference and Decision Link function : f −1 (·) Discriminant Functions Fisher’s Linear sign z Discriminant 1.0 The Perceptron 0.5 Algorithm z 0.5 0.0 0.5 1.0 0.5 1.0 Figure: Example of an activation function f (z) = sign (z) . 235of 267
8.
Introduction to StatisticalThree Models for Decision Problems Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityIn increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. Discriminative Models Classiﬁcation 1 Solve the inference problem of determining the posterior Generalised Linear class probabilities p(Ck | x). Model 2 Use decision theory to assign each new x to one of the Inference and Decision classes. Discriminant Functions Fisher’s Linear Generative Models Discriminant 1 Solve the inference problem of determining the The Perceptron Algorithm class-conditional probabilities p(x | Ck ). 2 Also, infer the prior class probabilities p(Ck ). 3 Use Bayes’ theorem to ﬁnd the posterior p(Ck | x). 4 Alternatively, model the joint distribution p(x, Ck ) directly. 5 Use decision theory to assign each new x to one of the classes. 236of 267
9.
Introduction to StatisticalTwo Classes Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityDeﬁnitionA discriminant is a function that maps from an input vector x toone of K classes, denoted by Ck . Classiﬁcation Generalised Linear Model Consider ﬁrst two classes ( K = 2 ). Inference and Decision Construct a linear function of the inputs x Discriminant Functions Fisher’s Linear y(x) = wT x + w0 Discriminant The Perceptron Algorithm such that x being assigned to class C1 if y(x) ≥ 0, and to class C2 otherwise. weight vector w bias w0 ( sometimes −w0 called threshold ) 237of 267
10.
Introduction to StatisticalTwo Classes Machine Learning c 2011 Christfried Webers NICTA The Australian National University Decision boundary y(x) = 0 is a (D − 1)-dimensional hyperplane in a D-dimensional input space (decision Classiﬁcation surface). Generalised Linear Model w is orthogonal to any vector lying in the decision surface. Inference and Decision Discriminant Functions Proof: Assume xA and xB are two points lying in the Fisher’s Linear decision surface. Then, Discriminant The Perceptron Algorithm 0 = y(xA ) − y(xB ) = wT (xA − xB ) 238of 267
11.
Introduction to StatisticalTwo Classes Machine Learning c 2011 Christfried Webers NICTA The Australian National University The normal distance from the origin to the decision surface is wT x w0 =− w w Classiﬁcation Generalised Linear y>0 x2 Model y=0 R1 Inference and Decision y<0 R2 Discriminant Functions Fisher’s Linear Discriminant x w The Perceptron y(x) Algorithm w x⊥ x1 −w0 w 239of 267
12.
Introduction to StatisticalTwo Classes Machine Learning c 2011 Christfried Webers NICTA The Australian National University The value of y(x) gives a signed measure of the perpendicular distance r of the point x from the decision surface, r = y(x)/ w . Classiﬁcation x 0 Generalised Linear T w wT w Model y(x) = w x⊥ + r +w0 = r + wT x⊥ + w0 = r w Inference and Decision w w Discriminant Functions y>0 x2 Fisher’s Linear y=0 Discriminant y<0 R1 R2 The Perceptron Algorithm x w y(x) w x⊥ x1 −w0 w 240of 267
13.
Introduction to StatisticalTwo Classes Machine Learning c 2011 Christfried Webers NICTA The Australian National University More compact notation : Add an extra dimension to the Classiﬁcation input space and set the value to x0 = 1. Generalised Linear Model Also deﬁne w = (w0 , w) and x = (1, x) Inference and Decision T Discriminant Functions y(x) = w x Fisher’s Linear Discriminant Decision surface is now a D-dimensional hyperplane in a The Perceptron Algorithm D + 1-dimensional expanded input space. 241of 267
14.
Introduction to StatisticalMulti-Class Machine Learning c 2011 Christfried Webers NICTA The Australian National University Number of classes K > 2 Can we combine a number of two-class discriminant functions using K − 1 one-versus-the-rest classiﬁers? Classiﬁcation Generalised Linear Model Inference and Decision ? Discriminant Functions Fisher’s Linear Discriminant R1 R2 The Perceptron Algorithm C1 R3 C2 not C1 not C2 242of 267
15.
Introduction to StatisticalMulti-Class Machine Learning c 2011 Christfried Webers NICTA The Australian National University Number of classes K > 2 Can we combine a number of two-class discriminant functions using K(K − 1)/2 one-versus-one classiﬁers? Classiﬁcation Generalised Linear C3 Model C1 Inference and Decision Discriminant Functions R1 Fisher’s Linear R3 Discriminant C1 ? The Perceptron Algorithm C3 R2 C2 C2 243of 267
16.
Introduction to StatisticalMulti-Class Machine Learning c 2011 Christfried Webers NICTA The Australian National University Number of classes K > 2 Solution: Use K linear functions yk (x) = wT x + wk0 k Classiﬁcation Generalised Linear Assign input x to class Ck if yk (x) > yj (x) for all j = k. Model Inference and Decision Decision boundary between class Ck and Cj given by Discriminant Functions yk (x) = yj (x) Fisher’s Linear Discriminant The Perceptron Algorithm Rj Ri Rk xB xA x 244of 267
17.
Introduction to StatisticalLeast Squares for Classiﬁcation Machine Learning c 2011 Christfried Webers NICTA The Australian National University Regression with a linear function of the model parameters and minimisation of sum-of-squares error function resulted Classiﬁcation in a closed-from solution for the parameter values. Generalised Linear Model Is this also possible for classiﬁcation? Inference and Decision Given input data x belonging to one of K classes Ck . Discriminant Functions Use 1-of-K binary coding scheme. Fisher’s Linear Discriminant Each class is described by its own linear model The Perceptron Algorithm yk (x) = wT x + wk0 k k = 1, . . . , K 245of 267
18.
Introduction to StatisticalLeast Squares for Classiﬁcation Machine Learning c 2011 Christfried Webers NICTA The Australian National University With the conventions wk0 wk = ∈ RD+1 wk Classiﬁcation 1 D+1 Generalised Linear x= ∈R Model x Inference and Decision W = w1 ... wK ∈ R(D+1)×K Discriminant Functions Fisher’s Linear Discriminant we get for the discriminant function (vector valued) The Perceptron Algorithm y(x) = WT x ∈ RK . For a new input x, the class is then deﬁned by the index of the largest value in the row vector y(x) 246of 267
19.
Introduction to StatisticalDetermine W Machine Learning c 2011 Christfried Webers NICTA The Australian National University Given a training set {xn , t} where n = 1, . . . , N, and t is the class in the 1-of-K coding scheme. Deﬁne a matrix T where row n corresponds to tT . n Classiﬁcation The sum-of-squares error can now be written as Generalised Linear Model Inference and Decision 1 ED (W) = tr (XW − T)T (XW − T) Discriminant Functions 2 Fisher’s Linear Discriminant The minimum of ED (W) will be reached for The Perceptron Algorithm W = (XT X)−1 XT T = X† T where X† is the pseudo-inverse of X. 247of 267
20.
Introduction to StatisticalDiscriminant Function for Multi-Class Machine Learning c 2011 Christfried Webers NICTA The Australian National University The discriminant function y(x) is therefore y(x) = WT x = TT (X† )T x, where X is given by the training data, and x is the new Classiﬁcation Generalised Linear input. Model Interesting property: If for every tn the same linear Inference and Decision constraint aT tn + b = 0 holds, then the prediction y(x) will Discriminant Functions Fisher’s Linear also obey the same constraint Discriminant The Perceptron aT y(x) + b = 0. Algorithm For the 1-of-K coding scheme, the sum of all components in tn is one, and therefore all components of y(x) will sum to one. BUT: the components are not probabilities, as they are not constraint to the interval (0, 1). 248of 267
21.
Introduction to StatisticalDeﬁciencies of the Least Squares Approach Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityMagenta curve : Decision Boundary for the least squaresapproach ( Green curve : Decision boundary for the logisticregression model described later) 4 4 Classiﬁcation Generalised Linear 2 2 Model Inference and Decision 0 0 Discriminant Functions −2 −2 Fisher’s Linear Discriminant −4 −4 The Perceptron Algorithm −6 −6 −8 −8 −4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8 249of 267
22.
Introduction to StatisticalDeﬁciencies of the Least Squares Approach Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityMagenta curve : Decision Boundary for the least squaresapproach ( Green curve : Decision boundary for the logisticregression model described later) 6 6 Classiﬁcation Generalised Linear 4 4 Model Inference and Decision 2 2 Discriminant Functions Fisher’s Linear 0 0 Discriminant The Perceptron −2 −2 Algorithm −4 −4 −6 −6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 250of 267
23.
Introduction to StatisticalFisher’s Linear Discriminant Machine Learning c 2011 Christfried Webers NICTA The Australian National University View linear classiﬁcation as dimensionality reduction. y(x) = wT x Classiﬁcation Generalised Linear If y ≥ −w0 then class C1 , otherwise C2 . Model But there are many projections from a D-dimensional input Inference and Decision Discriminant Functions space onto one dimension. Fisher’s Linear Projection always means loss of information. Discriminant The Perceptron For classiﬁcation we want to preserve the class separation Algorithm in one dimension. Can we ﬁnd a projection which maximally preserves the class separation ? 251of 267
24.
Introduction to StatisticalFisher’s Linear Discriminant Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversitySamples from two classes in a two-dimensional input spaceand their histogram when projected to two differentone-dimensional spaces. Classiﬁcation Generalised Linear Model 4 4 Inference and Decision Discriminant Functions 2 2 Fisher’s Linear Discriminant The Perceptron 0 0 Algorithm −2 −2 −2 2 6 −2 2 6 252of 267
25.
Introduction to StatisticalFisher’s Linear Discriminant - First Try Machine Learning c 2011 Christfried Webers NICTA Given N1 input data of class C1 , and N2 input data of class The Australian National University C2 , calculate the centres of the two classes 1 1 m1 = xn , m2 = xn N1 N2 n∈C1 n∈C2 Choose w so as to maximise the projection of the class Classiﬁcation means onto w Generalised Linear Model m1 − m2 = wT (m1 − m2 ) Inference and Decision Discriminant Functions Problem with non-uniform covariance Fisher’s Linear Discriminant 4 The Perceptron Algorithm 2 0 −2 −2 2 6 253of 267
26.
Introduction to StatisticalFisher’s Linear Discriminant Machine Learning c 2011 Christfried Webers NICTA Measure also the within-class variance for each class The Australian National University s2 = k (yn − mk )2 n∈Ck where yn = wT xn . Classiﬁcation Maximise the Fisher criterion Generalised Linear Model 2 (m2 − m1 ) Inference and Decision J(w) = s2 + s2 1 2 Discriminant Functions Fisher’s Linear Discriminant 4 The Perceptron Algorithm 2 0 −2 −2 2 6 254of 267
27.
Introduction to StatisticalFisher’s Linear Discriminant Machine Learning c 2011 Christfried Webers NICTA The Australian National University The Fisher criterion can be rewritten as wT SB w J(w) = Classiﬁcation wT S W w Generalised Linear Model SB is the between-class covariance Inference and Decision Discriminant Functions SB = (m2 − m1 )(m2 − m1 )T Fisher’s Linear Discriminant SW is the within-class covariance The Perceptron Algorithm SW = (xn − m1 )(xn − m1 )T + (xn − m2 )(xn − m2 )T n∈C1 n∈C2 255of 267
28.
Introduction to StatisticalFisher’s Linear Discriminant Machine Learning c 2011 Christfried Webers NICTA The Australian National University The Fisher criterion wT SB w Classiﬁcation J(w) = wT S W w Generalised Linear Model Inference and Decision has a maximum for Fisher’s linear discriminant Discriminant Functions w∝ S−1 (m2 W − m1 ) Fisher’s Linear Discriminant The Perceptron Algorithm Fisher’s linear discriminant is NOT a discriminant, but can be used to construct one by choosing a threshold y0 in the projection space. 256of 267
29.
Introduction to StatisticalFisher’s Discriminant For Multi-Class Machine Learning c 2011 Christfried Webers NICTA Assume that the dimensionality of the input space D is The Australian National University greater than the number of classes K. Use D > 1 linear ’features’ yk = wT x and write everything in vector form (no bias involved!) y = wT x. Classiﬁcation Generalised Linear The within-class covariance is then the sum of the Model Inference and Decision covariances for all K classes Discriminant Functions K Fisher’s Linear Discriminant SW = Sk The Perceptron k=1 Algorithm where Sk = (xn − mk )(xn − mk )T n∈Ck 1 mk = xn Nk n∈Ck 257of 267
30.
Introduction to StatisticalFisher’s Discriminant For Multi-Class Machine Learning c 2011 Christfried Webers NICTA The Australian National Between-class covariance University K SB = (mk − m)(mk − m)T . k=1 Classiﬁcation where m is the total mean of the input data Generalised Linear Model N Inference and Decision 1 m= x. Discriminant Functions N n=1 Fisher’s Linear Discriminant One possible way to deﬁne a function of W which is large The Perceptron Algorithm when the between-class covariance is large and the within-class covariance is small is given by J(W) = tr (WSW WT )−1 (WSB WT ) The maximum of J(W) is determined by the D eigenvectors of S−1 SB with the largest eigenvalues. W 258of 267
31.
Introduction to StatisticalFisher’s Discriminant For Multi-Class Machine Learning c 2011 Christfried Webers NICTA The Australian National University Classiﬁcation How many linear ’features’ can one ﬁnd with this method? Generalised Linear Model SB is of rank at most K − 1 because of the sum of K rank Inference and Decision one matrices and the global constraint via m. Discriminant Functions Projection onto the subspace spanned by SB can not have Fisher’s Linear Discriminant more than K − 1 linear features. The Perceptron Algorithm 259of 267
32.
Introduction to StatisticalThe Perceptron Algorithm Machine Learning c 2011 Christfried Webers NICTA The Australian National University Frank Rosenblatt (1928 - 1969) "Principles of neurodynamics: Perceptrons and the theory of brain mechanisms" (Spartan Books, 1962) Classiﬁcation Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm 260of 267
33.
Introduction to StatisticalThe Perceptron Algorithm Machine Learning c 2011 Christfried Webers NICTA The Australian National University Perceptron ("MARK 1") was the ﬁrst computer which could learn new skills by trial and error Classiﬁcation Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm 261of 267
34.
Introduction to StatisticalThe Perceptron Algorithm Machine Learning c 2011 Christfried Webers NICTA The Australian National Two class model University Create feature vector φ(x) by a ﬁxed nonlinear transformation of the input x. Generalised linear model Classiﬁcation y(x) = f (wT φ(x)) Generalised Linear Model with φ(x) containing some bias element φ0 (x) = 1. Inference and Decision Discriminant Functions nonlinear activation function Fisher’s Linear Discriminant +1, a ≥ 0 The Perceptron f (a) = Algorithm −1, a < 0 Target coding for perceptron +1, if C1 t= −1, if C2 262of 267
35.
Introduction to StatisticalThe Perceptron Algorithm - Error Function Machine Learning c 2011 Christfried Webers NICTA The Australian National University Idea : Minimise total number of misclassiﬁed patterns. Problem : As a function of w, this is piecewise constant Classiﬁcation and therefore the gradient is zero almost everywhere. Generalised Linear Model Better idea: Using the (−1, +1) target coding scheme, we Inference and Decision want all patterns to satisfy wT φ(xn )tn > 0. Discriminant Functions Perceptron Criterion : Add the errors for all patterns Fisher’s Linear Discriminant belonging to the set of misclassiﬁed patterns M The Perceptron Algorithm EP (w) = − wT φ(xn )tn n∈M 263of 267
36.
Introduction to StatisticalPerceptron - Stochastic Gradient Descent Machine Learning c 2011 Christfried Webers NICTA The Australian National University Perceptron Criterion (with notation φn = φ(xn ) ) EP (w) = − wT φn tn Classiﬁcation n∈M Generalised Linear Model One iteration at step τ Inference and Decision 1 Choose a training pair (xn , tn ) Discriminant Functions 2 Update the weight vector w by Fisher’s Linear Discriminant w(τ +1) = w(τ ) − η EP (w) = w(τ ) + ηφn tn The Perceptron Algorithm As y(x, w) does not depend on the norm of w, one can set η=1 w(τ +1) = w(τ ) + φn tn 264of 267
37.
Introduction to StatisticalThe Perceptron Algorithm - Update 1 Machine Learning c 2011 Christfried Webers NICTA The Australian National University Update of the perceptron weights from a misclassiﬁed pattern (green) w(τ +1) = w(τ ) + φn tn Classiﬁcation Generalised Linear Model Inference and Decision 1 1 Discriminant Functions Fisher’s Linear 0.5 0.5 Discriminant The Perceptron Algorithm 0 0 −0.5 −0.5 −1 −1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 265of 267
38.
Introduction to StatisticalThe Perceptron Algorithm - Update 2 Machine Learning c 2011 Christfried Webers NICTA The Australian National University Update of the perceptron weights from a misclassiﬁed pattern (green) w(τ +1) = w(τ ) + φn tn Classiﬁcation Generalised Linear Model Inference and Decision 1 1 Discriminant Functions Fisher’s Linear 0.5 0.5 Discriminant The Perceptron Algorithm 0 0 −0.5 −0.5 −1 −1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 266of 267
39.
Introduction to StatisticalThe Perceptron Algorithm - Convergence Machine Learning c 2011 Christfried Webers NICTA The Australian National University Does the algorithm converge ? For a single update step −w(τ +1)T φn tn = −w(τ )T φn tn − (φn tn )T φn tn < −w(τ )T φn tn Classiﬁcation Generalised Linear Model T because (φn tn ) φn tn = φn tn > 0. Inference and Decision BUT: contributions to the error from the other misclassiﬁed Discriminant Functions patterns might have increased. Fisher’s Linear Discriminant AND: some correctly classiﬁed patterns might now be The Perceptron Algorithm misclassiﬁed. Perceptron Convergence Theorem : If the training set is linearly separable, the perceptron algorithm is guaranteed to ﬁnd a solution in a ﬁnite number of steps. 267of 267
Be the first to comment