The combined image and text database is obtained from the Internet by searching for images and downloading adjacent text. Images less then 72x72 were discarded 192 image textures features, 768 image colour features and 3591 text features (terms). [was retrieved from www.yahoo.com and www.warpig.com]
Transcript
Kernel Methods: the Emergence of a Well-founded Machine Learning John Shawe-Taylor Centre for Computational Statistics and Machine Learning University College London
Overview <ul><li>Celebration of 10 years of kernel methods: </li></ul><ul><ul><li>what has been achieved and </li></ul></ul><ul><ul><li>what can we learn from the experience? </li></ul></ul><ul><li>Some historical perspectives: </li></ul><ul><ul><li>Theory or not? </li></ul></ul><ul><ul><li>Applicable or not? </li></ul></ul><ul><li>Some emphases: </li></ul><ul><ul><li>Role of theory – need for plurality of approaches </li></ul></ul><ul><ul><li>Importance of scalability </li></ul></ul>Surprising fact: there are now 13,444 citations of Support Vectors reported by Google scholar – the vast majority from papers not about machine learning! Ratio NNs/SVs in Google scholar 1987-1996: 220 1997-2006: 11
Caveats <ul><li>Personal perspective with inevitable bias </li></ul><ul><ul><li>One very small slice through what is now a very big field </li></ul></ul><ul><ul><li>Focus on theory with emphasis on frequentist analysis </li></ul></ul><ul><li>There is no pro-forma for scientific research </li></ul><ul><ul><li>But role of theory worth discussing </li></ul></ul><ul><ul><li>Needed to give firm foundation for proposed approaches? </li></ul></ul>
Motivation behind kernel methods <ul><li>Linear learning typically has nice properties </li></ul><ul><ul><li>Unique optimal solutions </li></ul></ul><ul><ul><li>Fast learning algorithms </li></ul></ul><ul><ul><li>Better statistical analysis </li></ul></ul><ul><li>But one big problem </li></ul><ul><ul><li>Insufficient capacity </li></ul></ul>
Historical perspective <ul><li>Minsky and Pappert highlighted the weakness in their book Perceptrons </li></ul><ul><li>Neural networks overcame the problem by gluing together many linear units with non-linear activation functions </li></ul><ul><ul><li>Solved problem of capacity and led to very impressive extension of applicability of learning </li></ul></ul><ul><ul><li>But ran into training problems of speed and multiple local minima </li></ul></ul>
Kernel methods approach <ul><li>The kernel methods approach is to stick with linear functions but work in a high dimensional feature space: </li></ul><ul><li>The expectation is that the feature space has a much higher dimension than the input space. </li></ul>
Example <ul><li>Consider the mapping </li></ul><ul><li>If we consider a linear equation in this feature space: </li></ul><ul><li>We actually have an ellipse – i.e. a non-linear shape in the input space. </li></ul>
Capacity of feature spaces <ul><li>The capacity is proportional to the dimension – for example: </li></ul><ul><li>2-dim: </li></ul>
Form of the functions <ul><li>So kernel methods use linear functions in a feature space: </li></ul><ul><li>For regression this could be the function </li></ul><ul><li>For classification require thresholding </li></ul>
Problems of high dimensions <ul><li>Capacity may easily become too large and lead to over-fitting: being able to realise every classifier means unlikely to generalise well </li></ul><ul><li>Computational costs involved in dealing with large vectors </li></ul>
Overview <ul><li>Two theoretical approaches converged on very similar algorithms: </li></ul><ul><ul><li>Frequentist led to Support Vector Machine </li></ul></ul><ul><ul><li>Bayesian approach led to Bayesian inference using Gaussian Processes </li></ul></ul><ul><li>First we briefly discuss the Bayesian approach before mentioning some of the frequentist results </li></ul>
Bayesian approach <ul><li>The Bayesian approach relies on a probabilistic analysis by positing </li></ul><ul><ul><li>a noise model </li></ul></ul><ul><ul><li>a prior distribution over the function class </li></ul></ul><ul><li>Inference involves updating the prior distribution with the likelihood of the data </li></ul><ul><li>Possible outputs: </li></ul><ul><ul><li>MAP function </li></ul></ul><ul><ul><li>Bayesian posterior average </li></ul></ul>
Bayesian approach <ul><li>Avoids overfitting by </li></ul><ul><ul><li>Controlling the prior distribution </li></ul></ul><ul><ul><li>Averaging over the posterior </li></ul></ul><ul><li>For Gaussian noise model (for regression) and Gaussian process prior we obtain a ‘kernel’ method where </li></ul><ul><ul><li>Kernel is covariance of the prior GP </li></ul></ul><ul><ul><li>Noise model translates into addition of ridge to kernel matrix </li></ul></ul><ul><ul><li>MAP and averaging give the same solution </li></ul></ul><ul><ul><li>Link with infinite hidden node limit of single hidden layer Neural Networks – see seminal paper </li></ul></ul><ul><ul><li>Williams, Computation with infinite neural networks (1997) </li></ul></ul>Surprising fact: the covariance (kernel) function that arises from infinitely many sigmoidal hidden units is not a sigmoidal kernel – indeed the sigmoidal kernel is not positive semi-definite and so cannot arise as a covariance function!
Bayesian approach <ul><li>Subject to assumptions about noise model and prior distribution: </li></ul><ul><ul><li>Can get error bars on the output </li></ul></ul><ul><ul><li>Compute evidence for the model and use for model selection </li></ul></ul><ul><li>Approach developed for different noise models </li></ul><ul><ul><li>eg classification </li></ul></ul><ul><ul><li>Typically requires approximate inference </li></ul></ul>
Frequentist approach <ul><li>Source of randomness is assumed to be a distribution that generates the training data i.i.d. – with the same distribution generating the test data </li></ul><ul><li>Different/weaker assumptions than the Bayesian approach – so more general but less analysis can typically be derived </li></ul><ul><li>Main focus is on generalisation error analysis </li></ul>
Capacity problem <ul><li>What do we mean by generalisation? </li></ul>
Generalisation of a learner
Example of Generalisation <ul><li>We consider the Breast Cancer dataset from the UCIrepository </li></ul><ul><li>Use the simple Parzen window classifier: weight vector is </li></ul><ul><li>where is the average of the positive (negative) training examples. </li></ul><ul><li>Threshold is set so hyperplane bisects the line joining these two points. </li></ul>
Example of Generalisation <ul><li>By repeatedly drawing random training sets S of size m we estimate the distribution of </li></ul><ul><li>by using the test set error as a proxy for the true generalisation </li></ul><ul><li>We plot the histogram and the average of the distribution for various sizes of training set </li></ul><ul><li>648, 342, 273, 205, 137, 68, 34, 27, 20, 14, 7 . </li></ul>
Example of Generalisation <ul><li>Since the expected classifier is in all cases the same </li></ul><ul><li>we do not expect large differences in the average of the distribution, though the non-linearity of the loss function means they won't be the same exactly. </li></ul>
Error distribution: full dataset
Error distribution: dataset size: 342
Error distribution: dataset size: 273
Error distribution: dataset size: 205
Error distribution: dataset size: 137
Error distribution: dataset size: 68
Error distribution: dataset size: 34
Error distribution: dataset size: 27
Error distribution: dataset size: 20
Error distribution: dataset size: 14
Error distribution: dataset size: 7
Observations <ul><li>Things can get bad if number of training examples small compared to dimension (in this case input dimension is 9) </li></ul><ul><li>Mean can be bad predictor of true generalisation – i.e. things can look okay in expectation, but still go badly wrong </li></ul><ul><li>Key ingredient of learning – keep flexibility high while still ensuring good generalisation </li></ul>
Controlling generalisation <ul><li>The critical method of controlling generalisation for classification is to force a large margin on the training data: </li></ul>
Intuitive and rigorous explanations <ul><li>Makes classification robust to uncertainties in inputs </li></ul><ul><li>Can randomly project into lower dimensional spaces and still have separation – so effectively low dimensional </li></ul><ul><li>Rigorous statistical analysis shows effective dimension </li></ul><ul><li>This is not structural risk minimisation over VC classes since hierarchy depends on the data: data-dependent structural risk minimisation </li></ul><ul><li>see S-T, Bartlett, Williamson & Anthony (1996 and 1998) </li></ul>Surprising fact: structural risk minimisation over VC classes does not provide a bound on the generalisation of SVMs, except in the transductive setting – and then only if the margin is measured on training and test data! Indeed SVMs were wake-up call that classical PAC analysis was not capturing critical factors in real-world applications!
Learning framework <ul><li>Since there are lower bounds in terms of the VC dimension the margin is detecting a favourable distribution/task alignment – luckiness framework captures this idea </li></ul><ul><li>Now consider using an SVM on the same data and compare the distribution of generalisations </li></ul><ul><li>SVM distribution in red </li></ul>
Error distribution: dataset size: 205
Error distribution: dataset size: 137
Error distribution: dataset size: 68
Error distribution: dataset size: 34
Error distribution: dataset size: 27
Error distribution: dataset size: 20
Error distribution: dataset size: 14
Error distribution: dataset size: 7
Handling training errors <ul><li>So far only considered case where data can be separated </li></ul><ul><li>For non-separable sets we can introduce a penalty proportional to the amount by which a point fails to meet the margin </li></ul><ul><li>These amounts are often referred to as slack variables from optimisation theory </li></ul>
Support Vector Machines <ul><li>SVM optimisation </li></ul><ul><li>Analysis of this case given using augmented space trick in </li></ul><ul><li>S-T and Cristianini (1999 and 2002) On the generalisation of soft margin algorithms. </li></ul>
Complexity problem <ul><li>Let’s apply the quadratic example </li></ul><ul><li>to a 20x30 image of 600 pixels – gives approximately 180000 dimensions! </li></ul><ul><li>Would be computationally infeasible to work in this space </li></ul>
Dual representation <ul><li>Suppose weight vector is a linear combination of the training examples: </li></ul><ul><li>can evaluate inner product with new example </li></ul>
Learning the dual variables <ul><li>The α i are known as dual variables </li></ul><ul><li>Since any component orthogonal to the space spanned by the training data has no effect, general result that weight vectors have dual representation: the representer theorem. </li></ul><ul><li>Hence, can reformulate algorithms to learn dual variables rather than weight vector directly </li></ul>
Dual form of SVM <ul><li>The dual form of the SVM can also be derived by taking the dual optimisation problem! This gives: </li></ul><ul><li>Note that threshold must be determined from border examples </li></ul>
Using kernels <ul><li>Critical observation is that again only inner products are used </li></ul><ul><li>Suppose that we now have a shortcut method of computing: </li></ul><ul><li>Then we do not need to explicitly compute the feature vectors either in training or testing </li></ul>
Kernel example <ul><li>As an example consider the mapping </li></ul><ul><li>Here we have a shortcut: </li></ul>
Efficiency <ul><li>Hence, in the pixel example rather than work with 180000 dimensional vectors, we compute a 600 dimensional inner product and then square the result! </li></ul><ul><li>Can even work in infinite dimensional spaces, eg using the Gaussian kernel: </li></ul>
Using Gaussian kernel for Breast: 273
Data size 342 Surprising fact: kernel methods are invariant to rotations of the coordinate system – so any special information encoded in the choice of coordinates will be lost! Surprisingly one of the most successful applications of SVMs has been for text classification in which one would expect the encoding to be very informative!
Constraints on the kernel <ul><li>There is a restriction on the function: </li></ul><ul><li>This restriction for any training set is enough to guarantee function is a kernel </li></ul>Surprising fact: the property that guarantees the existence of a feature space is precisely the property required to ensure convexity of the resulting optimisation problem for SVMs, etc!
What have we achieved? <ul><li>Replaced problem of neural network architecture by kernel definition </li></ul><ul><ul><li>Arguably more natural to define but restriction is a bit unnatural </li></ul></ul><ul><ul><li>Not a silver bullet as fit with data is key </li></ul></ul><ul><ul><li>Can be applied to non-vectorial (or high dim) data </li></ul></ul><ul><li>Gained more flexible regularisation/ generalisation control </li></ul><ul><li>Gained convex optimisation problem </li></ul><ul><ul><li>i.e. NO local minima! </li></ul></ul>
Historical note <ul><li>First use of kernel methods in machine learning was in combination with perceptron algorithm: </li></ul><ul><li>Aizerman et al., Theoretical foundations of the potential function method in pattern recognition learning(1964) </li></ul><ul><li>Apparently failed because of generalisation issues – margin not optimised but this can be incorporated with one extra parameter! </li></ul>Surprising fact: good covering number bounds on generalisation of SVMs use bound on updates of the perceptron algorithm to bound the covering numbers and hence generalisation – but the same bound can be applied directly to the perceptron without margin using a sparsity argument, so bound is tighter for classical perceptron!
PAC-Bayes: General perspectives <ul><li>The goal of different theories is to capture the key elements that enable an understanding, analysis and learning of different phenomena </li></ul><ul><li>Several theories of machine learning: notably Bayesian and frequentist </li></ul><ul><li>Different assumptions and hence different range of applicability and range of results </li></ul><ul><li>Bayesian able to make more detailed probabilistic predictions </li></ul><ul><li>Frequentist makes only i.i.d. assumption </li></ul>
Evidence and generalisation <ul><li>Evidence is a measure used in Bayesian analysis to inform model selection </li></ul><ul><li>Evidence related to the posterior volume of weight space consistent with the sample </li></ul><ul><li>Link between evidence and generalisation hypothesised by McKay </li></ul><ul><li>First such result was obtained by S-T & Williamson (1997): PAC Analysis of a Bayes Estimator </li></ul><ul><li>Bound on generalisation in terms of the volume of the sphere that can be inscribed in the version space -- included a dependence on the dim of the space but applies to non-linear function classes </li></ul><ul><li>Used Luckiness framework where luckiness measured by the volume of the ball </li></ul>
PAC-Bayes Theorem <ul><li>PAC-Bayes theorem has a similar flavour but bounds in terms of prior and posterior distributions </li></ul><ul><li>First version proved by McAllester in 1999 </li></ul><ul><li>Improved proof and bound due to Seeger in 2002 with application to Gaussian processes </li></ul><ul><li>Application to SVMs by Langford and S-T also in 2002 </li></ul><ul><li>Gives tightest bounds on generalisation for SVMs </li></ul><ul><li>see Langford tutorial on JMLR </li></ul>
Examples of bound evaluation 0.212 ±.002 0.198 ±.002 0.254 ±.003 0.334 ±.005 PAC-Bayes 0.109 ±.024 0.151 ±.005 0.184 ±.010 0.306 ±.018 PAC-Bayes prior 0.026 ±.005 Ringnorm 0.089 ±.005 Waveform 0.074 ±.014 (0.056 ±.01) Image 0.073 ±.021 Wdbc Test Error Problem Surprising fact: optimising the bound does not typically improve the test error – despite significant improvements in the bound itself! Training an SVM on half the data to learn a prior and then using the rest to learn relative to this prior further improves the bound with almost no effect on the test error!
Principal Components Analysis (PCA) <ul><li>Classical method is principal component analysis – looks for directions of maximum variance, given by eigenvectors of covariance matrix </li></ul>
Dual representation of PCA <ul><li>Eigenvectors of kernel matrix give dual representation: </li></ul><ul><li>Means we can perform PCA projection in a kernel defined feature space: kernel PCA </li></ul>
Kernel PCA <ul><li>Need to take care of normalisation to obtain: </li></ul><ul><li>where λ is the corresponding eigenvalue. </li></ul>
Generalisation of k-PCA <ul><li>How reliable are estimates obtained from a sample when working in such high dimensional spaces </li></ul><ul><li>Using Rademacher complexity bounds can show that if low dimensional projection captures most of the variance in the training set, will do well on test examples as well even in high dimensional spaces </li></ul><ul><li>see S-T, Williams, Cristianini and Kandola (2004) On the eigenspectrum of the Gram matrix and the generalisation error of kernel PCA </li></ul><ul><li>The bound also gives a new bound on the difference between the process and empirical eigenvalues </li></ul>Surprising fact: no frequentist bounds apply for learning in the representation given by k-PCA – the data is used twice, once to find the subspace and then to learn the classifier/regressor! Intuitively, this shouldn’t be a problem since the labels are not used by k-PCA!
Latent Semantic Indexing <ul><li>Developed in information retrieval to overcome problem of semantic relations between words in BoWs representation </li></ul><ul><li>Use Singular Value Decomposition of Term-doc matrix: </li></ul>
Lower dimensional representation <ul><li>If we truncate the matrix U at k columns we obtain the best k dimensional representation in the least squares sense: </li></ul><ul><li>In this case we can write a ‘semantic matrix’ as </li></ul>
Latent Semantic Kernels <ul><li>Can perform the same transformation in a kernel defined feature space by performing an eigenvalue decomposition of the kernel matrix (this is equivalent to kernel PCA): </li></ul>
Related techniques <ul><li>Number of related techniques: </li></ul><ul><ul><li>Probabilistic LSI (pLSI) </li></ul></ul><ul><ul><li>Non-negative Matrix Factorisation (NMF) </li></ul></ul><ul><ul><li>Multinomial PCA (mPCA) </li></ul></ul><ul><ul><li>Discrete PCA (DPCA) </li></ul></ul><ul><li>All can be viewed as alternative decompositions: </li></ul>
Different criteria <ul><li>Vary by: </li></ul><ul><ul><li>Different constraints (eg non-negative entries) </li></ul></ul><ul><ul><li>Different prior distributions (eg Dirichlet, Poisson) </li></ul></ul><ul><ul><li>Different optimisation criteria (eg max likelihood, Bayesian) </li></ul></ul><ul><li>Unlike LSI typically suffer from local minima and so require EM type iterative algorithms to converge to solutions </li></ul>
Other subspace methods <ul><li>Kernel partial Gram-Schmidt orthogonalisation is equivalent to incomplete Cholesky decomposition – greedy kernel PCA </li></ul><ul><li>Kernel Partial Least Squares implements a multi-dimensional regression algorithm popular in chemometrics – takes account of labels </li></ul><ul><li>Kernel Canonical Correlation Analysis uses paired datasets to learn a semantic representation independent of the two views </li></ul>
Paired corpora <ul><li>Can we use paired corpora to extract more information? </li></ul><ul><li>Two views of same semantic object – hypothesise that both views contain all of the necessary information, eg document and translation to a second language: </li></ul>
aligned text E 1 E 2 E N E i . . . . F 1 F 2 F N F i . . . .
Canadian parliament corpus LAND MINES Ms. Beth Phinney (Hamilton Mountain, Lib.): Mr. Speaker, we are pleased that the Nobel peace prize has been given to those working to ban land mines worldwide. We hope this award will encourage the United States to join the over 100 countries planning to come to … LES MINES ANTIPERSONNEL Mme Beth Phinney (Hamilton Mountain, Lib.): Monsieur le Président, nous nous réjouissons du fait que le prix Nobel ait été attribué à ceux qui oeuvrent en faveur de l'interdiction des mines antipersonnel dans le monde entier. Nous espérons que cela incitera les Américains à se joindre aux représentants de plus de 100 pays qui ont l'intention de venir à … E 12 F 12
cross-lingual lsi via svd M. L. Littman, S. T. Dumais, and T. K. Landauer. Automatic cross-language information retrieval using latent semantic indexing. In G. Grefenstette, editor, Cross-language information retrieval . Kluwer, 1998.
cross-lingual kernel canonical correlation analysis input “English” space input “French” space f F 1 f F 2 Φ(x) feature “English” space feature “French” space f E 1 f E 2
kernel canonical correlation analysis
regularization is the regularization parameter <ul><li>using kernel functions may result in overfitting </li></ul><ul><li>Theoretical analysis shows that provided the norms of the weight vectors are small the correlation will still hold for new data </li></ul><ul><li>see S-T and Cristianini (2004) Kernel Methods for Pattern Analysis </li></ul><ul><li>need to control flexibility of the projections f E and f F : add diagonal to the matrix D : </li></ul>
pseudo query test E i q e i F 1 F 2 F N F i . . . . Queries were generated from each test document by extracting 5 words with the highest TFIDF weights and using them as a query.
Experimental Results <ul><li>The goal was to retrieve the paired document. </li></ul><ul><li>Experimental procedure: </li></ul><ul><li>(1) LSI/KCCA trained on paired documents, </li></ul><ul><li>(2) All test documents projected into the LSI/KCCA semantic space, </li></ul><ul><li>(3) Each query was projected into the LSI/KCCA semantic space and documents were retrieved using nearest neighbour based on cosine distance to the query. </li></ul>
English-French retrieval accuracy, %
<ul><li>Data </li></ul><ul><ul><li>Combined image and associated text obtained from the web </li></ul></ul><ul><ul><li>Three categories: sport, aviation and paintball </li></ul></ul><ul><ul><li>400 examples from each category (1200 overall) </li></ul></ul><ul><ul><li>Features extracted: HSV, Texture , Bag of words </li></ul></ul><ul><li>Tasks </li></ul><ul><ul><li>Classification of web pages into the 3 categories </li></ul></ul><ul><ul><li>Text query -> image retrieval </li></ul></ul>Applying to different data types
Classification error of baseline method <ul><li>Previous error rates obtained using probabilistic ICA classification done for single feature groups </li></ul>3.0% 9.0% 13.6% 18.3% 22.9% All Text Colour + Texture Texture Colour
Classification with multi-views <ul><li>If we use KCCA to generate a semantic feature space and then learn with an SVM, can envisage combining the two steps into a single ‘SVM-2k’ </li></ul><ul><li>learns two SVMs one on each representation, but constrains their outputs to be similar across the training (and any unlabelled data). </li></ul><ul><li>Can give classification for single mode test data – applied to patent classification for Japanese patents </li></ul><ul><li>Again theoretical analysis predicts good generalisation if the training SVMs have a good match, while the two representations have a small overlap </li></ul><ul><li>see Farquhar, Hardoon, Meng, S-T, Szedmak (2005) Two view learning: SVM-2K, Theory and Practice </li></ul>
Targeted Density Learning <ul><li>Learning a probability density function in an L 1 sense is hard </li></ul><ul><li>Learning for a single classification or regression task is ‘easy’ </li></ul><ul><li>Consider a set of tasks (Touchstone Class) F that we think may arise with a distribution emphasising those more likely to be seen </li></ul><ul><li>Now learn density that does well for the tasks in F . </li></ul><ul><li>Surprisingly it can be shown that just including a small subset of tasks gives us convergence over all the tasks </li></ul>
Example plot <ul><li>Consider fitting cumulative distribution up to sample points as proposed by Mukherjee and Vapnik. </li></ul><ul><li>Plot shows loss as a function of the number of constraints (tasks) included: </li></ul>
Idealised view of progress Study problem to develop theoretical model Derive analysis that indicates factors that affect solution quality Translate into optimisation maximising factors – relaxing to ensure convexity Develop efficient solutions using specifics of the task
Role of theory <ul><li>Theoretical analysis plays a critical auxiliary role: </li></ul><ul><ul><li>Can never be complete picture </li></ul></ul><ul><ul><li>Needs to capture the factors that most influence quality of result </li></ul></ul><ul><ul><li>Means to an end – only as good as the result </li></ul></ul><ul><ul><li>Important to be open to refinements/relaxations that can be shown to better characterise real-world learning and translate into efficient algorithms </li></ul></ul>
Conclusions and future directions <ul><li>Extensions to learning data with output structure that can be used to inform the learning </li></ul><ul><li>More general subspace methods: algorithms and theoretical foundations </li></ul><ul><li>Extensions to more complex tasks, eg system identification, reinforcement learning </li></ul><ul><li>Linking learning into tasks with more detailed prior knowledge eg stochastic differential equations for climate modelling </li></ul><ul><li>Extending theoretical framework for more general learning tasks eg learning a density function </li></ul>
Be the first to comment