2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…
Upcoming SlideShare
Loading in...5
×
 

2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

on

  • 438 views

 

Statistics

Views

Total Views
438
Views on SlideShare
438
Embed Views
0

Actions

Likes
0
Downloads
20
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi… 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi… Document Transcript

    • 1Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryMACHINE LEARNINGVasant HonavarArtificial Intelligence Research LaboratoryDepartment of Computer ScienceBioinformatics and Computational Biology ProgramCenter for Computational Intelligence, Learning, & DiscoveryIowa State Universityhonavar@cs.iastate.eduwww.cs.iastate.edu/~honavar/www.cild.iastate.edu/Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryMaximum margin classifiers• Discriminative classifiers that maximize the margin ofseparation• Support Vector Machines– Feature spaces– Kernel machines– VC Theory and generalization bounds– Maximum margin classifiers• Comparisons with conditionally trained discriminativeclassifiers
    • 2Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryPerceptrons revisitedPerceptrons• Can only compute threshold functions• Can only represent linear decision surfaces• Can only learn to classify linearly separable trainingdataHow can we deal with non linearly separable data?• Map data into a typically higher dimensional featurespace where the classes become separableTwo problems must be solved:• Computational problem of working with highdimensional feature space• Overfitting problem in high dimensional feature spacesCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryMaximum margin modelWe cannot outperform Bayes optimal classifier when• The generative model assumed is correct• The data set is large enough to ensure reliableestimation of parameters of the modelsBut discriminative models may be better than generativemodels when• The correct generative model is seldom known• The data set is often simply not large enoughMaximum margin classifiers are a kind of discriminativeclassifiers designed to circumvent the overfitingproblem
    • 3Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryExtending Linear Classifiers – Feature spacesMap data into a feature space where they are linearlyseparable( )ϕ→x xx ( )ϕ xX ΦCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryLinear Separability in Feature spacesThe original input space can always be mapped to somehigher-dimensional feature space where the training databecome separable:Φ: x → φ(x)
    • 4Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryLearning in the Feature SpacesHigh dimensional Feature spaceswhere typically d > > n solve the problem of expressingcomplex functionsBut this introduces a• computational problem (working with very large vectors)• generalization problem (curse of dimensionality)SVM offer an elegant solution to both problems( ) ( ) ( ) ( ) ( )( )xxxxx dnxxx ϕϕϕ=ϕ→= ,....,...., 2121Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryThe Perceptron Algorithm Revisited• The perceptron works by adding misclassified positiveor subtracting misclassified negative examples to anarbitrary weight vector, which (without loss ofgenerality) we assumed to be the zero vector.• The final weight vector is a linear combination oftraining pointswhere, since the sign of the coefficient of is given bylabel yi, the are positive values, proportional to thenumber of times, misclassification of has caused theweight to be updated. It is called the embeddingstrength of the pattern .•1,li i iiyα== ∑w xixiαixix
    • 5Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryDual RepresentationThe decision function can be rewritten as:The update rule is:WLOG, we can take( ) ( )⎟⎟⎠⎞⎜⎜⎝⎛+=⎟⎟⎠⎞⎜⎜⎝⎛+⎟⎟⎠⎞⎜⎜⎝⎛=+=∑∑==bybybhjljjjljjjjxxxxxwx,sgn,sgn,sgn11αα1.η =ηααbyyyiiijljjjiii+←≤⎟⎟⎠⎞⎜⎜⎝⎛+∑=thenif),(exampletrainingon,0,1xxxαCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryImplications of Dual RepresentationWhen Linear Learning Machines are represented in thedual formData appear only inside dot products (in decisionfunction and in training algorithm)The matrix which is the matrix of pair-wise dot products between training samples is calledthe Gram matrix( ), 1li ji jG== ⋅x x( ) ( ) ⎟⎟⎠⎞⎜⎜⎝⎛+=+= ∑=bybh ijljjjii xxxwx ,sgn,sgn1α
    • 6Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryImplicit Mapping to Feature SpaceKernel Machines• Solve the computational problem of working withmany dimensions• Can make it possible to use infinite dimensionsefficiently• Offer other advantages, both practical and conceptualCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryKernel-Induced Feature SpacesΦ→ϕ X:( ) ( )( ) ( )( )iiiifhbfxxxwxsgn,=+= ϕwhere is a non-linear map from inputspace to feature spaceIn the dual representation, the data points only appearinside dot products( ) ( )( ) ( ) ( ) ⎟⎟⎠⎞⎜⎜⎝⎛+=+= ∑=bybh ijljjjii xxxwx ϕϕαϕ ,sgn,sgn1
    • 7Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryKernels• Kernel function returns the value of the dot productbetween the images of the two arguments• When using kernels, the dimensionality of the featurespace Φ is not necessarily important because of thespecial properties of kernel functions• Given a function K, it is possible to verify that it is akernel (we will return to this later)1 2 1 2( , ) ( ), ( )K x x x xϕ ϕ=Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryKernel MachinesWe can use perceptron learning algorithm in the featurespace by taking its dual representation and replacingdot products with kernels:1 2 1 2 1 2, ( , ) ( ), ( )x x K x x x xϕ ϕ← =
    • 8Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryThe Kernel MatrixK= K(2,l)…K(2,3)K(2,2)K(2,1)K(1,l)…K(1,3)K(1,2)K(1,1)K(l,l)…K(l,3)K(l,2)K(l,1)……………Kernel matrix is the Gram matrix in the feature space Φ(the matrix of pair-wise dot products between feature vectorscorresponding to the training samples)Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryProperties of Kernel MatricesIt is easy to show that the Gram matrix (and hence thekernel matrix) is• A Square matrix• Symmetric (K T = K)• Positive semi-definite (all eigenvalues of K are non-negative – Recall that eigenvalues of a square matrixA are given by values of λ that satisfy | A-λI |=0)Any symmetric positive semi definite matrix can beregarded as a kernel matrix, that is, as an innerproduct matrix in some feature space Φ( ) ( ) ( )jiji xxxxK ϕϕ= ,,
    • 9Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryMercer’s Theorem: Characterization of Kernel FunctionsA function K : X×X ℜ is said to be (finitely) positivesemi-definite if• K is a symmetric function: K(x,y) = K(y,x)• Matrices formed by restricting K to any finite subset ofthe space X are positive semi-definiteCharacterization of Kernel FunctionsEvery (finitely) positive semi definite, symmetric functionis a kernel: i.e. there exists a mapping ϕ such that itis possible to write:( ) ( ) ( )jijiK xxxx ϕϕ= ,,Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryProperties of Kernel Functions - Mercer’s TheoremEigenvalues of the Gram Matrix define an expansion ofKernels:That is, the eigenvalues act as features!Recall that the eigenvalues of asquare matrix A correspond tosolutions of⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡==λ−10000100001000010IIA where( ) ( ) ( ) ( ) ( )jlillljijiK xxxxxx ϕϕλϕϕ ∑== ,,
    • 10Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryExamples of Kernels( )⎟⎟⎠⎞⎜⎜⎝⎛σ−=zxzx,eK ,Simple examples of kernels:( ) dK zxzx ,, =Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryExample: Polynomial Kernels1 21 2( , );( , );x x xz z z==2 21 1 2 22 2 2 21 1 2 2 1 1 2 22 2 2 21 2 1 2 1 2 1 2, ( )2( , , 2 ),( , , 2 )( ), ( )x z x z x zx z x z x z x zx x x x z z z zx zϕ ϕ= += + +==
    • 11Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryMaking Kernels – Closure PropertiesThe set of kernels is closed under some operations. IfK1 , K2 are kernels over X×X, then the following are kernels:We can make complex kernels from simple ones: modularity!( ) ( ) ( )( ) ( )( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )( )( )( )( )nTNNNnnKKKKfffKKKKaaKKKKKℜ⊆×=ℜ×ℜℜ→=ℜ→==>=+=XXXandmatrixdefinitepositivesymmetricaisoverkernelais;BBzxzxzxzxzxzxzxzxzxzxzxzxzxzx;,;:;,,:;,,,,0,,,,,3321121ϕϕϕCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryKernels• We can define Kernels over arbitrary instance spacesincluding• Finite dimensional vector spaces,• Boolean spaces• ∑* where ∑ is a finite alphabet• Documents, graphs, etc.• Kernels need not always be expressed by a closed formformula.• Many useful kernels can be computed by complexalgorithms (e.g., diffusion kernels over graphs).
    • 12Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryKernels on Sets and Multi-Sets( ) kernel.ais,Then,LetdomainfixedsomeforLet21222121AAVAAKXAAVXI=∈⊆Exercise: Define a Kernel on the space of multi-sets whose elements are drawn from a finitedomain VCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryString Kernel (p-spectrum Kernel)• The p-spectrum of a string is the histogram – vector ofnumber of occurrences of all possible contiguoussubstrings – of length p• We can define a kernel function K(s, t) over ∑* × ∑* asthe inner product of the p-spectra of s and t.23====),(:substringsCommontsKtat, atipncomputatiotstatisticss
    • 13Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryKernel MachinesKernel Machines are Linear Learning Machines that :• Use a dual representation• Operate in a kernel induced feature space (that is alinear function in the feature space implicitly definedby the gram matrix corresponding to the data set K)( ) ( )( ) ( ) ( ) ⎟⎟⎠⎞⎜⎜⎝⎛+=+= ∑=bybh ijljjjii xxxwx ϕϕαϕ ,sgn,sgn1Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryKernels – the good, the bad, and the uglyBad kernel – A kernel whose Gram (kernel) matrix ismostly diagonal -- all data points are orthogonal toeach other, and hence the machine is unable to detecthidden structure in the data10…0101…000……………0…001
    • 14Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryKernels – the good, the bad, and the uglyGood kernel – Corresponds to a Gram (kernel) matrix inwhich subsets of data points belonging to the sameclass are similar to each other, and hence the machinecan detect hidden structure in the data3340000032423002430000023Class 1Class 2Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryKernels – the good, the bad, and the ugly• In mapping in a space with too many irrelevantfeatures, kernel matrix becomes diagonal• Need some prior knowledge of target so choose agood kernel
    • 15Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryLearning in the Feature SpacesHigh dimensional Feature spaceswhere typically d > > n solve the problem of expressingcomplex functionsBut this introduces a• computational problem (working with very large vectors)– solved by the kernel trick – implicit computation of dotproducts in kernel induced feature spaces via dotproducts in the input space• generalization problem (curse of dimensionality)( ) ( ) ( ) ( ) ( )( )xxxxx dnxxx ϕϕϕ=ϕ→= ,....,...., 2121Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryThe Generalization Problem• The curse of dimensionality• It is easy to overfit in high dimensional spaces• The Learning problem is ill posed• There are infinitely many hyperplanes thatseparate the training data• Need a principled approach to choose an optimalhyperplane
    • 16Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryThe Generalization Problem“Capacity” of the machine – ability to learn anytraining set without error – related to VC dimensionExcellent memory is not an asset when it comes tolearning from limited data“A machine with too much capacity is like a botanistwith a photographic memory who, when presentedwith a new tree, concludes that it is not a treebecause it has a different number of leaves fromanything she has seen before; a machine with toolittle capacity is like the botanist’s lazy brother,who declares that if it’s green, it’s a tree”C. BurgesCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryHistory of Key Developments leading to SVM1958 Perceptron (Rosenblatt)1963 Margin (Vapnik)1964 Kernel Trick (Aizerman)1965 Optimization formulation (Mangasarian)1971 Kernels (Wahba)1992-1994 SVM (Vapnik)1996 – present Rapid growth, numerous applicationsExtensions to other problems
    • 17Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryA Little Learning TheorySuppose:• We are given l training examples• Train and test points drawn randomly (i.i.d) from someunknown probability distribution D(x,y)• The machine learns the mapping and outputs ahypothesis . A particular choice ofgenerates “trained machine”• The expectation of the test error or expected risk is( , ).i iyxi iy→x( ) ( ) ( )∫ −= ydDbhybR ,,,21, xαxα( )bh ,,αx ( )b,αCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryA Bound on the Generalization PerformanceThe empirical risk is:Choose some such that . With probabilitythe following bound – risk bound of h(x,α) in distribution Dholds (Vapnik,1995):where is called VC dimension is a measure of“capacity” of machine.δ 1 δ−0d ≥10 << δ( ) ( )⎟⎟⎟⎟⎟⎠⎞⎜⎜⎜⎜⎜⎝⎛⎟⎠⎞⎜⎝⎛−⎟⎠⎞⎜⎝⎛+⎟⎠⎞⎜⎝⎛+≤ldldbRbR emp4log12log,,δαα( ) ( )∑=−=liiemp bhylbR1,,21, αxα
    • 18Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryA Bound on the Generalization PerformanceThe second term in the right-hand side is called VCconfidence.Three key points about the actual risk bound:• It is independent of D(x,y)• It is usually not possible to compute the left handside.• If we know d, we can compute the right hand side.The risk bound gives us a way to compare learningmachines!Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryThe VC DimensionDefinition: the VC dimension of a set of functionsis d if and only if there exists a set ofpoints d data points such that each of the 2d possiblelabelings (dichotomies) of the d data points, can berealized using some member of H but that no setexists with q > d satisfying this property.( ){ }bhH ,,αx=
    • 19Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryThe VC DimensionVC dimension of H is size of largest subset of Xshattered by H. VC dimension measures the capacityof a set H of hypotheses (functions).If for any number N, it is possible to find N pointsthat can be separated in all the 2N possibleways, we will say that the VC-dimension of the set isinfinite1,.., Nx xCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryThe VC Dimension Example• Suppose that the data live in space, and the setconsists of oriented straight lines, (linear discriminants).• It is possible to find three points that can be shattered bythis set of functions• It is not possible to find four.• So the VC dimension of the set of linear discriminants inis three.2R { ( , )}h αx2R
    • 20Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryThe VC Dimension of HyperplanesTheorem 1 Consider some set of m points in .Choose any one of the points as origin. Then the mpoints can be shattered by oriented hyperplanes if andonly if the position vectors of the remaining points arelinearly independent.Corollary: The VC dimension of the set of orientedhyperplanes in is n+1, since we can always choosen+1 points, and then choose one of the points asorigin, such that the position vectors of the remainingpoints are linearly independent, but can never choosen+2 pointsnRnRCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryVC DimensionVC dimension can be infinite even when the number ofparameters of the set of hypothesisfunctions is low.Example:For any integer l with any labelswe can find l points and parameter αsuch that those points can be shattered byThose points are:and parameter a is:{ ( , )}h αx( , ) sgn(sin( )), ,h α α α≡ ∈x x x R1,..., , { 1,1}l iy y y ∈ −1,..., lx x( , )h αx10 , 1,..., .iix i l−= =1(1 )10(1 )2iliiyα π=−= + ∑
    • 21Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryVapnik-Chervonenkis (VC) Dimension• Let H be a hypothesis class over an instance space X.• Both H and X may be infinite.• We need a way to describe the behavior of H on a finiteset of points S ⊆ X.• For any concept class H over X, and any S ⊆ X,• Equivalently, with a little abuse of notation, we can write( ) { }HhShSH ∈∩=Π :{ }mXXXS ...., 21=( ) ( ) ( )( ){ }HhXhXhS mH ∈=Π :......1ΠH(S) is the set of all dichotomies or behaviors on S that areinduced or realized by HCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryVapnik-Chervonenkis (VC) Dimension• If where , or equivalently,• we say that S is shattered by H.• A set S of instances is said to be shattered by ahypothesis class H if and only if for every dichotomy ofS, there exists a hypothesis in H that is consistent withthe dichotomy.( ) { }mH S 10,=Π mS = ( ) mH S 2=Π
    • 22Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryVC Dimension of a hypothesis classDefinition: The VC-dimension V(H), of a hypothesisclass H defined over an instance space X is thecardinality d of the largest subset of X that isshattered by H. If arbitrarily large finite subsets of Xcan be shattered by H, V(H)=∞How can we show that V(H) is at least d?• Find a set of cardinality at least d that is shattered byH.How can we show that V(H) = d?• Show that V(H) is at least d and no set of cardinality(d+1) can be shattered by H.Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryVC Dimension of a Hypothesis Class - Examples• Example: Let the instance space X be the 2-dimensionalEuclidian space. Let the hypothesis space H be the set ofaxis parallel rectangles in the plane.• V(H)=4 (there exists a set of 4 points that can beshattered by a set of axis parallel rectangles but no set of5 points can be shattered by H).
    • 23Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratorySome Useful Properties of VC DimensionProof: Left as an exercise( ) ( ) ( ){ }( ) ( ) ( )( ) ( ) ( ) ( )( ){ }( )( ) ( ) ( ) dmmOmΦdmmΦmΦmmSSmdVHllHVOHVHlHHVHVHVHHHHVHVHhhXHHHVHHVHVHHddmddHHHll<=≥=≤Π=Π=Π==++≤⇒∪==⇒∈−=≤≤⇒⊆ifandifwhere)(,:)(max)(,If)lg)((,fromconceptsofonintersectiorunionabyformedisIf:lg)(class,conceptfiniteaisIf2121212121Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryRisk BoundWhat is VC dimension and empirical risk of the nearestneighbor classifier?Any number of points, labeled arbitrarily, can be shattered bya nearest-neighbor classifier, thus and empiricalrisk =0 .So the bound provide no useful information in this case.d = ∞
    • 24Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryMinimizing the Bound by Minimizing d• VC confidence (second term) depends on d/l• One should choose that learning machine with minimal d• For large values of d/l the bound is not tight0.05δ =( ) ( )⎟⎟⎟⎟⎟⎠⎞⎜⎜⎜⎜⎜⎝⎛⎟⎠⎞⎜⎝⎛−⎟⎠⎞⎜⎝⎛+⎟⎠⎞⎜⎝⎛+≤ldldRR emp4log12logδααCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryBounds on Error of ClassificationThe classification error• depends on the VC dimension of the hypothesis class• is independent of the dimensionality of the feature spaceVapnik proved that the error ε of of classification functionh for separable data sets iswhere d is the VC dimension of the hypothesis class and l isthe number of training examplesdOlε⎛ ⎞= ⎜ ⎟⎝ ⎠
    • 25Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryStructural Risk MinimizationFinding a learning machine with the minimum upper boundon the actual risk• leads us to a method of choosing an optimal machine for agiven task• essential idea of the structural risk minimization (SRM).Let be a sequence of nested subsets ofhypotheses whose VC dimensions satisfy d1 < d2 < d3 < …• SRM consists of finding that subset of functions whichminimizes the upper bound on the actual risk• SRM involves training a series of classifiers, and choosing aclassifier from the series for which the sum of empiricalrisk and VC confidence is minimal.1 2 3 ...H H H⊂ ⊂ ⊂Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryMargin Based Bounds on Error of Classification⎟⎟⎠⎞⎜⎜⎝⎛⎟⎟⎠⎞⎜⎜⎝⎛γ=ε21 LlO( )min|| ||i iiy fγ =xw( ) ,f b= +x w xppL Xmax=Margin based boundImportant insightError of the classifier trained on a separable data set isinversely proportional to its margin, and is independent ofthe dimensionality of the input space!
    • 26Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryMaximal Margin ClassifierThe bounds on error of classification suggest thepossibility of improving generalization by maximizingthe marginMinimize the risk of overfitting by choosing themaximal margin hyperplane in feature spaceSVMs control capacity by increasing the margin, notby reducing the number of featuresCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryMargin11111111Linear separation of theinput space( )( ) ( )( )xxxwxfsignhbf=+= ,wwb
    • 27Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryFunctional and Geometric Margin• The functional margin of a linear discriminant (w,b) w.r.t. alabeled pattern is defined as• If the functional margin is negative, then the pattern isincorrectly classified, if it is positive then the classifierpredicts the correct label.• The larger the further away xi is from the discriminant• This is made more precise in the notion of the geometricmargin( , )i i iy bγ ≡ +w x| |iγ|| ||iiγγ ≡wwhich measures theEuclidean distance of a pointfrom the decision boundary.( ) { }1,1, −×ℜ∈ dii yxCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryGeometric Margin+∈SiXiγjγThe geometric margin of two points−∈SjX
    • 28Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryGeometric Margin+∈ SXi−∈ SX jiγjγExample( ) ( ) ( )( )( )( )21111111.111)11(1)11(21===−+=−+===−==WXWiiiiiiγxxYγYbγ(1,1)Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryMargin of a training setiiγγ min=γThe functional margin of atraining setThe geometric margin ofa training setiiγγ min=
    • 29Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryMaximum Margin Separating Hyperplaneis called the (functional) margin of (w,b)w.r.t. the data set S={(xi,yi)}.The margin of a training set S is the maximum geometricmargin over all hyperplanes. A hyperplane realizing thismaximum is a maximal margin hyperplane.Maximal Margin Hyperplanemini iγ γ≡Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryVC Dimension of - margin hyperplanes• If data samples belong to a sphere of radius R, thenthe set of - margin hyperplanes has VC dimensionbounded by• WLOG, we can make R=1 by normalizing data pointsof each class so that they lie within a sphere of radiusR.• For large margin hyperplanes, VC-dimension iscontrolled independent of dimensionality n of thefeature spaceΔΔ1),/min()( 22+Δ≤ nRHVC
    • 30Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryMaximal Margin ClassifierThe bounds on error of classification suggest thepossibility of improving generalization by maximizingthe marginMinimize the risk of overfitting by choosing themaximal margin hyperplane in feature spaceSVMs control capacity by increasing the margin, not byreducing the number of featuresCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryMaximizing Margin Minimizing ||W||Definition of hyperplane (w,b) does not change if werescale it to (σw, σb), for σ>0Functional margin depends on scaling, but geometricmargin does notIf we fix (by rescaling) the functional margin to 1, thegeometric margin will be equal 1/||w||We can maximize the margin by minimizing the norm||w||γ
    • 31Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryMaximizing Margin Minimizing ||w||Let x+ and x- be the nearest data points in the samplespace. Then we have, by fixing the distance from thehyperplane to be 1:o( ) ( )( ) wxx,wwxxw,xw,xw,221111=−=−−=−−=+=+−+−+−+bbγCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryLearning as optimizationMinimizesubject to:ww,( ) 1≥+ b,y ii xw
    • 32Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryConstrained Optimization• Primal optimization problem• Given functions f, gi, i=1…k; hj j=1..m; defined on a domainΩ ⊆ ℜn,( )( )( ){{{ sconstraintqualityesconstraintinequalityfunctionbjectiveo...10,...10subject tominimizemjhkigΩfji===≤∈wwwwShorthand( ) ( )( ) ( ) ...mjhh...kiggji10denotes010denotes0=≤==≤≤wwwwFeasible region ( ) ( ){ }00 =≤∈= www hgΩF ,:Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryOptimization problems• Linear program – objective function as well as equalityand inequality constraints are linear• Quadratic program – objective function is quadratic,and the equality and inequality constraints are linear• Inequality constraints gi(w) ≤ 0 can be active i.e. gi(w) =0 or inactive i.e. gi(w) < 0.• Inequality constraints are often transformed intoequality constraints using slack variablesgi(w) ≤ 0 ↔ gi(w) +ξi = 0 with ξi ≥ 0• We will be interested primarily in convex optimizationproblems
    • 33Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryConvex optimization problemIf function f is convex, any local minimum of anunconstrained optimization problem with objective functionf is also a global minimum, since for anyA set is called convex if , and forany the pointA convex optimization problem is one in which the set ,the objective function and all the constraints are convex*w*≠u w *( ) ( )f f≤w uΩnΩ ⊂ R ,∀ ∈Ωw u(0,1),θ ∈ ( (1 ) )θ θ+ − ∈Ωw uCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryLagrangian TheoryGiven an optimization problem with an objective function f (w)and equality constraints hj (w) = 0 j = 1..m, we define aLagrangian as( ) ( ) ( )∑=+=mjjj hβfL1wwβw,where βj are called the Lagrange multipliers. The necessaryConditions for w* to be minimum of f (w) subject to theConstraints hj (w) = 0 j = 1..m is given by( ) ( ) 00 =∂∂=∂∂ββwwβw ****,,, LLThe condition is sufficient if L(w,β*) is a convex function of w
    • 34Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryLagrangian TheoryExample( )( ) ( )( )( )( )54,5254524504,,022,,021,,42,,042,222222−=−===±==−+=∂∂=+=∂∂=+=∂∂−+++==−++=yxfyxyxyxLyyyxLxxyxLyxyxyxLyxyxyxfwhenminimizedis:haveweabove,theSolving:toSubjectMinimizemmλλλλλλλλλCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryLagrangian theory – exampleFind the lengths u, v, w of sides of the box that has the largestvolume for a given surface area c( ) ( )60000022cwvuvuβuwβvw)β(uwuvLw);β(vvwuLv);β(uuvwLcvwuvwuuvwLcvwuvwu-uvw===⇒−=−++−==∂∂++−==∂∂++−==∂∂⎟⎠⎞⎜⎝⎛−++β+−==++and;tosubjectminimize
    • 35Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryLagrangian Optimization - Example• The entropy of a probability distribution p=(p1...pn)over a finite set {1, 2,...n } is defined as• The maximum entropy distribution can be found byminimizing –H(p) subject to the constraints( ) inii ppH 21log∑=−=p011≥∀=∑=iniipipCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryGeneralized Lagrangian TheoryGiven an optimization problem with domain Ω ⊆ ℜn( ) ( ) ( ) ( )∑∑ ==++=mjjjkiii hβgαfL11,, wwwβαwwhere f is convex, and gi and hj are affine, we can define thegeneralized Lagrangian function as:( )( )( ){{{ sconstraintqualityesconstraintinequalityfunctionbjectiveo...10,...10subject tominimizemjhkigΩfji===≤∈wwwwAn affine function is a linear function plus a translationF(x) is affine if F(x)=G(x)+b where G(x) is a linear function of xand b is a constant
    • 36Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryGeneralized Lagrangian Theory: KKT ConditionsGiven an optimization problem with domain Ω ⊆ ℜnwhere f is convex, and gi and hj are affine. The necessary andSufficient conditions for w* to be an optimum are the existenceof α* and β* such that( ) ( )( ) ( ) kiggLLiiii ...;;;,,,,******,***,*100000=≥α≤=α=∂∂=∂∂wwββαwwβαw( )( )( ){{{ sconstraintqualityesconstraintinequalityfunctionbjectiveotosubjectminimizemjhkigΩfji...10,...10===≤∈wwwwCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory( ) ( ) ( )( ) ( ) ( ) ( )( )( ) ( ) ( )( ) ( )( ) ( ) ( )200,02;0,2022622;26;2202;032;0120,00,023;1032;012;0,00;2,0,00;020320122310;231,2111111111121212111212121212222==≠≠===⇒=⎟⎠⎞⎜⎝⎛−−+−−=−==−+=+−=+−⇒=≠≠=≤+==⇒=−=−⇒==≤−≤+≥≥=−=−+=−+−=∂∂=++−=∂∂−+−++−+−=≤−≤+−+−=; yxyxyxyxyxyxyxyxyxyxyxyxyyLxxLyxyxyxLyxyxyxyxfissolutiontheHence,solutionfeasibleayieldnotdoes4Casesolutionfeasibleaiswhichyieldswhich:3Casesolutionfeasibleayieldnotdoes:2caseSimilarly,violatesitbecausesolutionfeasibleanotisThis1:Case;:toSubjectMinimizeλλλλλλλλλλλλλλλλλλλλλλλλλλλ
    • 37Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryDual optimization problemA primal problem can be transformed into a dual by simplysetting to zero, the derivatives of the Lagrangian withrespect to the primal variables and substituting the resultsback into the Lagrangian to remove dependence on theprimal variables, resulting in( ) ( )βαwβαw,,inf, LΩ∈=θwhich contains only dual variables and needs to bemaximized under simpler constraintsThe infimum of a set S of real numbers is denoted by inf(S)and is defined to be the largest real number that is smallerthan or equal to every number in S. If no such numberexists then inf(S) = −∞.Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryMaximum Margin HyperplaneThe problem of finding the minimal margin hyperplaneis a constrained optimization problemUse Lagrange theory (extended by Karush, Kuhn, andTucker – KKT)Lagrangian:( ) ( )[ ]01,,21≥−+−= ∑iiiiip byLαα xwwww
    • 38Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryFrom Primal to DualMinimize Lp(w) with respect to (w,b) requiring thatderivatives of Lp(w) with respect to all vanish, subject to theconstraintsDifferentiating Lp(w) :iα0iα ≥Substituting the equalityconstraints back intoLp(w) we obtain a dualproblem.000===∂∂=∂∂∑∑iiiiiiippyybLLαα xwwCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryThe Dual ProblemMaximize:1 1max ( ) min ( )2i iy i y ib=− =⋅ + ⋅= −w x w x( )sconstraintprimalusingsolvedbetohasandtoSubjectbαyαyyybyyyybyyyyLiiiiiiijjijijiiiiiiijjijijiijjijijiijjjjiiijjjjiiiiD00,21211,21≥=+−=+−−=⎥⎥⎦⎤⎢⎢⎣⎡−⎟⎟⎠⎞⎜⎜⎝⎛+⎟⎟⎠⎞⎜⎜⎝⎛−⎟⎟⎠⎞⎜⎜⎝⎛⎟⎠⎞⎜⎝⎛=∑∑∑∑∑∑∑∑∑∑∑αααααααααααααxxxxxxxxxxw
    • 39Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryKarush-Kuhn-Tucker ConditionsSolving for the SVM is equivalent to finding a solution to theKKT conditions( ) ( )[ ]( )( )[ ]001,01,0001,,21≥=−+≥−+=⇒=∂∂=⇒=∂∂−+−=∑∑∑iiiiiiiiipiiiipiiiipbybyybLyLbyLαααααxwxwxwwxwwwwCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryKarush-Kuhn-Tucker Conditions for SVMThe KKT conditions state that optimal solutionsmust satisfyOnly the training samples xi for which the functionalmargin = 1 can have nonzero αi. They are calledSupport Vectors.The optimal hyperplane can be expressed in the dualrepresentation in terms of this subset of trainingsamples – the support vectors( , )i bα w1 sv( , , )li i i i i ii if b y b y bα α= ∈= ⋅ + = ⋅ +∑ ∑x α x x x x( )[ ] 01, =−+ by iii xwα
    • 40Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory⎟⎟⎠⎞⎜⎜⎝⎛== ∑∈SViixwαγ1( )∑∑∑∑∑∑∑∑∑∑∑∑∑∑∈∈∉∈∈∉∈∈∈∈∈∈===+⎟⎟⎠⎞⎜⎜⎝⎛+=+⎟⎟⎠⎞⎜⎜⎝⎛+=−=−=⇒=⎟⎠⎞⎜⎝⎛+∈==SViiSVjjSVjjjSVjjjSVjjSVjjSVjjjSVjjjSVijjiiijSVijiiijjjiSViiiSVjjjjijijliiljyybyybby,byyybyySVyyyy,ααααααααααααα444 3444 21QcondnsKKTofbecause011011,1,,,,wwxxxxxxxxxwwSo we haveCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratorySupport Vector Machines Yield Sparse Solutionsγ Supportvectors
    • 41Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryExample: XOR∑∑∑∑= ===−=−+−==ΦNiNjjTijijiNiiNiiTiiTTyyQbybL1 121112121)(]1)([),,()(xxxwwwwwwwαααααα 0 111X y[-1,-1] -1[-1,+1] +1[+1,-1] +1[+1,+1] -12)1(),( jTijiK xxxx +=Txxxxxx ]2,2,,2,,1[)( 21222121=xϕ⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡=9111191111911119KCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryExample: XOR)9292292229()(4243234232221312121214321αααααααααααααααααααα+−+−+++−−−+++=Q191919194321432143214321=+−−=−++−=−++−=+−−αααααααααααααααα41814,3,2,1,)( =====αααααoooooQNot surprisingly, each of the trainingsamples is a support vector2141221, == oo wwTNiiiio-yw⎥⎦⎤⎢⎣⎡== ∑=0002100)(1xϕαSetting derivative of Q wrtthe dual variables to 0
    • 42Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryExample – Two SpiralsCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratorySummary of Maximal margin classifier• Good generalization when the training data is noise freeand separable in the kernel-induced feature space• Excessively large Lagrange multipliers often signal outliers– data points that are most difficult to classify– SVM can be used for identifying outliers• Focuses on boundary points If the boundary points aremislabeled (noisy training data)– the result is a terrible classifier if the training set isseparable– Noisy training data often makes the training set nonseparable– Use of highly nonlinear kernels to make the training setseparable increases the likelihood of over fitting –despite maximizing the margin
    • 43Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryNon-Separable Case• In the case of non-separable data in feature space, theobjective function grows arbitrarily large.• Solution: Relax the constraint that all training data becorrectly classified but only when necessary to do so• Cortes and Vapnik, 1995 introduced slack variablesin the constraints:, 1,..,i i lξ =( )( )b,y iii +−= xwγξ ,0maxCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryNon-Separable CaseGeneralization error can beshown to beiξjξmax(0, ( , )).i i iy x bξ γ= − +wγ221⎟⎟⎟⎟⎠⎞⎜⎜⎜⎜⎝⎛ +≤∑γξε iiRl
    • 44Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryNon-Separable CaseFor error to occur, the corresponding must exceed(which was chosen to be unity)So is an upper bound on the number of trainingerrors.So the objective function is changed from towhere C and k are parameters(typically k=1 or 2 and C is a large positive constantiiξ∑γ∑+⎟⎟⎠⎞⎜⎜⎝⎛ikiC ξ22wiξ22wCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryThe Soft-Margin ClassifierMinimize:Subject to:∑+iiC ξww,21( ) ( )01,≥−≥+iiii byξξxw( )( )32144444 344444 2144 344 21iiξiiiiiiiiiiP byCLonsconstraintnegativitynoninvolvingsconstraintinequalityfunctionobjective −∑∑∑ −+−+−+= ξμξαξξ1,21 2xww
    • 45Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryThe Soft-Margin Classifier: KKT Conditions( ) 001,0>•=±=+•>iiiiiibbξξαhenceand,byiedmisclassifisorhenceandi.e.,vectorsupportaisifonlywxxwx( )( )001,0,0,0==+−+≥≥≥iiiiiiiiiμbyμξξααξxwCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryFrom primal to dual( )( )jijiijjiiiDDPiiiiPiiiPiiiiPPiiiiiiiiiiPyyLLLCCLybLyLLbyCLxxxwwxww,21,00000,01,21 2∑∑∑∑∑∑∑−=≤≤⇒=+⇒=∂∂=⇒=∂∂=⇒=∂∂−+−+−+=ααααμαξααξμξαξmaximizedbetodualthegetweintobackngSubstitutitovariablesprimalwrtofsderivativetheEquating
    • 46Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratorySoft Margin-Dual Lagrangian( )( )( )00001,00,21=<<==−=+−+=≤≤−=∑∑∑iiiiiiiiiiiiijijiijjiiiDαCαCCbyyCyyLsamples,training)classified(correctlyotherallForvectors,supporttrueForonly whensample)trainingfied(misclassizero-nonarevariablesSlack:ConditionsKKTand:toSubjectMaximizeααξξααααααxwxxCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryImplementation TechniquesMaximizing a quadratic function, subject to a linearequality and inequality constraints,1( ) ( , )200i i j i j i ji i jii iiW y y K x xCyα α α ααα= −≤ ≤=∑ ∑∑( )1 ( , )i j j i jjiWy y K x xαα∂= −∂∑α
    • 47Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryOn-line algorithm for the 1-norm soft margin (bias b=0)Given training set Drepeatfor i=1 to lelseend foruntil stopping criterion satisfiedreturn α←α 0(1 ( , ))i i i i j j i jjy y K x xα α η α← + − ∑if 0 then 0i iα α< ←if theni iC Cα α> ←( )iiiK xx ,1=ηIn the general setting, b ≠ 0and we need to solve for bCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryImplementation TechniquesUse QP packages (MINOS, LOQO, quadprog fromMATLAB optimization toolbox). They are not onlineand require that the data are held in memory in theform of kernel matrixStochastic Gradient Ascent. Sequentially update 1 weightat the time. So it is online. Gives excellentapproximation in most cases1ˆ 1 ( , )( , )i i i j j i jji iy y K x xK x xα α α⎛ ⎞← + −⎜ ⎟⎝ ⎠∑
    • 48Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryChunking and DecompositionGiven training set DSelect an arbitrary working setrepeatsolve optimization problem onselect new working set from data not satisfying KKTconditionsuntil stopping criterion satisfiedreturn←α 0ˆD D⊂ˆDαCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratorySequential Minimal Optimization S.M.O.At each step SMO: optimize two weightssimultaneously in order not to violate the linearconstraintOptimization of the two weights is performedanalyticallyRealizes gradient dissent without leaving the linearconstraint (J.Platt)Online versions exist (Li, Long; Gentile)0i iiyα =∑,i jα α
    • 49Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratorySVM Implementations• Details of optimization and Quadratic Programming• SVMLight – one of the first practical implementations ofSVM (Platt)• Matlab toolboxes for SVM• LIBSVM – one of the best SVM implementations is by Chih-Chung Chang and Chih-Jen Lin of the National Universityof Singapore http://www.csie.ntu.edu.tw/~cjlin/libsvm/• WLSVM – LIBSVM integrated with WEKA machine learningtoolbox Yasser El-Manzalawy of the ISU AI Labhttp://www.cs.iastate.edu/~yasser/wlsvm/Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryWhy does SVM Work?• The feature space is often very high dimensional. Whydon’t we have the curse of dimensionality?• A classifier in a high-dimensional space has manyparameters and is hard to estimate• Vapnik argues that the fundamental problem is not thenumber of parameters to be estimated. Rather, the problemis the capacity of a classifier• Typically, a classifier with many parameters is very flexible,but there are also exceptions– Let xi=10i where i ranges from 1 to n. The classifiercan classify all xi correctly for all possiblecombination of class labels on xi– This 1-parameter classifier is very flexible
    • 50Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryWhy does SVM work?• Vapnik argues that the capacity of a classifier should not becharacterized by the number of parameters, but by thecapacity of a classifier– This is formalized by the “VC-dimension” of a classifier• The minimization of ||w||2 subject to the condition that thegeometric margin =1 has the effect of restricting the VC-dimension of the classifier in the feature space• The SVM performs structural risk minimization: theempirical risk (training error), plus a term related to thegeneralization ability of the classifier, is minimized• SVM loss function is analogous to ridge regression. Theterm ½||w||2 “shrinks” the parameters towards zero to avoidoverfittingCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryChoosing the Kernel Function• Probably the most tricky part of using SVM• The kernel function should maximize the similarity amonginstances within a class while accentuating the differencesbetween classes• A variety of kernels have been proposed (diffusion kernel,Fisher kernel, string kernel, …) for different types of data -• In practice, a low degree polynomial kernel or RBF kernelwith a reasonable width is a good initial try for data that livein a fixed dimensional input space• Low order Markov kernels and its relatives are good onesto consider for structured data – strings, images, etc.
    • 51Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryOther Aspects of SVM• How to use SVM for multi-class classification?– One can change the QP formulation to become multi-class– More often, multiple binary classifiers are combined– One can train multiple one-versus-all classifiers, orcombine multiple pairwise classifiers “intelligently”• How to interpret the SVM discriminant function value asprobability?– By performing logistic regression on the SVM output of aset of data that is not used for training• Some SVM software (like libsvm) have these features built-inCopyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryRecap: How to Use SVM• Prepare the data set• Select the kernel function to use• Select the parameter of the kernel function and thevalue of C– You can use the values suggested by the SVMsoftware– You can use a tuning set to tune C• Execute the training algorithm and obtain αι and b• Unseen data can be classified using the αi, the supportvectors, and b
    • 52Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryStrengths and Weaknesses of SVM• Strengths– Training is relatively easy• No local optima– It scales relatively well to high dimensional data– Tradeoff between classifier complexity and error can becontrolled explicitly– Non-traditional data like strings and trees can be usedas input to SVM, instead of feature vectors• Weaknesses– Need to choose a “good” kernel function.Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryRecent developments• Better understanding of the relation between SVM andregularized discriminative classifiers (logistic regression)• Knowledge-based SVM – incorporating prior knowledge asconstraints in SVM optimization (Shavlik et al)• A zoo of kernel functions• Extensions of SVM to multi-class and structured labelclassification tasks• One class SVM (anomaly detection)
    • 53Copyright Vasant Honavar, 2006.Iowa State University Department of Computer ScienceArtificial Intelligence Research LaboratoryRepresentative Applications of SVM• Handwritten letter classification• Text classification• Image classification – e.g., face recognition• Bioinformatics– Gene expression based tissue classification– Protein function classification– Protein structure classification– Protein sub-cellular localization• ….