Presentation of paper #7:Nonlinear componentanalysis as a kerneleigenvalue problemScholkopf, Smola, MullerNeural Computati...
Introduction● Kernel Principal Component Analysis (KPCA)  ○ KPCA is an extension of Principal Component Analysis  ○ It com...
Introduction● Kernel Principal Component Analysis (KPCA)  ○ KPCA is an extension of Principal Component Analysis  ○ It com...
Motivation: possible solutionsPrincipal CurvesTrevor Hastie; Werner Stuetzle, “Principal Curves,” Journal of the AmericanS...
Motivation: possible solutionsAutoencodersHinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality ofdata...
Motivation: some new problems● Low input dimensions● Problem dependant● Hard optimization problems
Motivation: kernel trickKPCA captures the overall variance of patterns
Motivation: kernel trick
Motivation: kernel trick
Motivation: kernel trick
Motivation: kernel trick                Video
Principle                                             Data                                  Features"We are not interested...
Principle                                                 Data"We are not interested in PCs                               ...
PrincipleGiven a data set of N centered observations in a d-dimensional space●   PCA diagonalizes the covariance matrix:● ...
PrincipleGiven a data set of N centered observations in a high-dimensional space●   Covariance matrix in new space:●   Aga...
Principle●   Combining the last tree equations, we obtain:●   we define a new function●   and a new N x N matrix:●   our e...
Principle●   let λ1 ≤ λ2 ≤ ... ≤ λN denote the eigenvalues of K, and α1, ..., αN the    corresponding eigenvectors, with λ...
Algorithm● Centralization  For a given data set, subtracting the mean for all the observation to  achieve the centralized ...
Algorithm● Reconstructing training data  The operation cannot be done because eigenvectors do not have  a pre-images in th...
Disadvantages● Centering in original space does not mean centering in F, we need  to adjust the K matrix as follows:● KPCA...
Advantages●   Time complexity    ○   we will return to this point later●   Handle non linearly separable problems●   Extra...
Experiments●   Applications●   Data Sets●   Methods compared●   Assessment●   Experiments●   Results
Applications●   Clustering    ○   Density Estimation        ■ ex High correlation between features    ○   De-noising      ...
DatasetsExperiment Name                 Created by                          Representation                                ...
Experiments1 Simple Example 1 experimentDataset : 1+ 2 = 3 The uniform dist sd = 0.2Kernel: Polynomial 1 – 42 USPS Charact...
Methods    These are the methods we used in the experiments                                                               ...
Assessment●   1 Accuracy      Classification: Exact Classification      Clustering:     Comparable to other clusters●●   2...
Simple Example ●       Recreated example                     ●   Nonlinear PCA paper ex Dataset:      The USPS Handwritten...
Character recognition     Dataset: The USPS Handwritten digits     Training set: 3000     Classifier: The SVM dot product ...
De-noising  Dataset:           The De-noising eleven gaussians  Training set: 100  Classifier: The Gaussian Kernel sd para...
Kernels The choice of Kernel regulates the accuracy of the algorithm and is dependent on the application. The Mercer Kerne...
Results                  -The PC 1-2 separate the 3 clustersRBF                         - The PC of 3 -5 half the clusters...
Results                      Experiment 1    Experiment 2   Experiment 3   Experiment 4 1 Accuracy  Kernel              Po...
Discussions: KDAKernel Fisher Discriminant (KDA)Sebastian Mika , Gunnar Rätsch , Jason Weston , Bernhard Schölkopf, Klaus-...
DiscussionsDoing PCA in F rather in Rd●   The first k principal components carry more variance than any    other k directi...
DiscussionsGoing into a higher dimensionality for a lowerdimensionality● Pick the right high dimensionality spaceNeed of a...
DiscussionsTime Complexity● Alot of features (alot of dimensions).● KPCA works!   ○ Subspace of F (only the observed xs)  ...
DiscussionsPre-image reconstruction maybe impossibleApproximation can be done in FNeed explicite ϕ● Regression learning pr...
DiscussionsInterpretablity● Cross-Features Features   ○   Dependent on the kernel● Reduced Space Features   ○   Preserves ...
ConclusionsApplications●   Feature Extraction (Classification)●   Clustering●   Denoising●   Novelty detection●   Dimensio...
References[1] J.T. Kwok and I.W. Tsang, “The Pre-Image Problem in Kernel Methods,”IEEE Trans. Neural Networks, vol. 15, no...
References[10] J.T.Kwok, I.W.Tsang, "The pre-image problem in kernel methods",Proceedings of the Twentieth International C...
Thank you
Upcoming SlideShare
Loading in …5
×

Nonlinear component analysis as a kernel eigenvalue problem

3,170 views

Published on

Presentation of the Kernel PCA paper.

Published in: Technology, Education
1 Comment
1 Like
Statistics
Notes
No Downloads
Views
Total views
3,170
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
44
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide

Nonlinear component analysis as a kernel eigenvalue problem

  1. 1. Presentation of paper #7:Nonlinear componentanalysis as a kerneleigenvalue problemScholkopf, Smola, MullerNeural Computation 10, 1299-1319, MIT Press (1998) Group C: M. Filannino, G. Rates, U. SandoukCOMP61021: Modelling and Visualization of high-dimensional data
  2. 2. Introduction● Kernel Principal Component Analysis (KPCA) ○ KPCA is an extension of Principal Component Analysis ○ It computes PCA into a new feature space dimension ○ Useful for feature extraction, dimensionality reduction
  3. 3. Introduction● Kernel Principal Component Analysis (KPCA) ○ KPCA is an extension of Principal Component Analysis ○ It computes PCA into a new feature space ○ Useful for feature extraction, dimensionality reduction
  4. 4. Motivation: possible solutionsPrincipal CurvesTrevor Hastie; Werner Stuetzle, “Principal Curves,” Journal of the AmericanStatistical Association, Vol. 84, No. 406. (Jun. 1989), pp. 502-516.● Optimization (including the quality of data approximation)● Natural geometric meaning● Natural projection http://pisuerga.inf.ubu.es/cgosorio/Visualization/imgs/review3_html_m20a05243.png
  5. 5. Motivation: possible solutionsAutoencodersHinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality ofdata with neural networks. Science, 313, 504--507.● Feed forward neural network● Approximate the identity function http://www.nlpca.de/fig_NLPCA_bottleneck_autoassociative_autoencoder_neural_network.png
  6. 6. Motivation: some new problems● Low input dimensions● Problem dependant● Hard optimization problems
  7. 7. Motivation: kernel trickKPCA captures the overall variance of patterns
  8. 8. Motivation: kernel trick
  9. 9. Motivation: kernel trick
  10. 10. Motivation: kernel trick
  11. 11. Motivation: kernel trick Video
  12. 12. Principle Data Features"We are not interested in PCsin the input space, we areinterested in PCs of featuresthat are nonlinearly related tothe original ones"
  13. 13. Principle Data"We are not interested in PCs New featuresin the input space, we areinterested in PCs of featuresthat are nonlinearly related tothe original ones"
  14. 14. PrincipleGiven a data set of N centered observations in a d-dimensional space● PCA diagonalizes the covariance matrix:● It is necessary to solve the following system of equations:● We can define the same computation in another dot product space F:
  15. 15. PrincipleGiven a data set of N centered observations in a high-dimensional space● Covariance matrix in new space:● Again, it is necessary to solve the following system of equations:● This means that:
  16. 16. Principle● Combining the last tree equations, we obtain:● we define a new function● and a new N x N matrix:● our equation becomes:
  17. 17. Principle● let λ1 ≤ λ2 ≤ ... ≤ λN denote the eigenvalues of K, and α1, ..., αN the corresponding eigenvectors, with λp being the first nonzero eigenvalue then we require they are normalized in F:● Encoding a data point y means computing:
  18. 18. Algorithm● Centralization For a given data set, subtracting the mean for all the observation to achieve the centralized data in RN.● Finding principal components Compute the matrix using kernel function, find eigenvectors and eigenvalues● Encoding training/testing data where x is a vector that encodes the training data. This can be done since we calculated eigenvalues and eigenvectors.
  19. 19. Algorithm● Reconstructing training data The operation cannot be done because eigenvectors do not have a pre-images in the original dimension.● Reconstructing test data point The operation cannot be done because eigenvectors do not have a pre-images in the original dimension.
  20. 20. Disadvantages● Centering in original space does not mean centering in F, we need to adjust the K matrix as follows:● KPCA is now a parametric technique: ○ choice of a proper kernel function ■ Gaussian, sigmoid, polynomial ○ Mercers theorem ■ k(x,y) must be continue, simmetric, and semi-defined positive (xTAx ≥ 0) ■ it guarantees that there are non-zero eigenvalues● Data reconstruction is not possible, unless using approximation formula:
  21. 21. Advantages● Time complexity ○ we will return to this point later● Handle non linearly separable problems● Extraction of more principal components than PCA ○ Feature extraction vs. dimensionality reduction
  22. 22. Experiments● Applications● Data Sets● Methods compared● Assessment● Experiments● Results
  23. 23. Applications● Clustering ○ Density Estimation ■ ex High correlation between features ○ De-noising ■ ex Lighting removing from bright images ○ Compression ■ ex Image compression● Classification ○ ex categorisations
  24. 24. DatasetsExperiment Name Created by Representation y x2 C y= x2● Simple 1+2 = 3 Uniform distribution C noise sd 0.1 - Unlabelled example1 Dist [-1, 1] - 2 Dimensions Three clusters 1+2 = 3 Three Gaussians - Unlabelled● Simple sd = 0.1 - 2 Dimensions example2 Dist [1,1] x [0.5, 1] Kernels A circle and square The eleven gaussians - Unlabelled● De-noising Dist [-1, 1] with zero mean - 10 Dimensions● USPS Hand written digit Character -Labelled Recognition -256 Dimensions -9298 Digits
  25. 25. Experiments1 Simple Example 1 experimentDataset : 1+ 2 = 3 The uniform dist sd = 0.2Kernel: Polynomial 1 – 42 USPS Character Recognition ParametersDataset: USPS Kernel PCA Kernel Polynomial 1 7 Components 32 2048 (x x2)Methods Five layer Neural Networks Kernel SVM PCA SVM Neural Networks and SVM The best parameters for the task3 De- noising ParametersDataset: De-noising 11 gaussians sd = 0.1 The best parameters for the taskMethods Kernel Autoencoders Principal Curves Kernel PCA Linear PCA4 Kernels Parameters The best parameters for the taskRadial Basis FunctionSigmoid
  26. 26. Methods These are the methods we used in the experiments Dimensionality reductionClassification ● Supervised Unsupervised Linear PCA Linear Neural Networks Kernel PCA ● SVM Kernel Autoencoders Linear Non ● Kernel LDA Principal CurvesFaceRecognition
  27. 27. Assessment● 1 Accuracy Classification: Exact Classification Clustering: Comparable to other clusters●● 2 Time Complexity● The time to compute●● 3 Storage Complexity● The storage of the data●● 4 Interpretability● How easy it is to understand
  28. 28. Simple Example ● Recreated example ● Nonlinear PCA paper ex Dataset: The USPS Handwritten digits Dataset: 1+ 2 =3 The uniform dist with sd 0.2 Training set: 3000 Classifier: The polynomial Kernel 1 - 4 Classifier: The SVM dot product Kernel 1 -7 PC: 32 – 2048 x2 PC: 1 – 3 The eigenvector3D 1 -3 of highestby a eigenvalueKernelDoPCA Kernel Polynomial 1 -4 Accurate2D The function y = x2 + B Clustering of Non with noise B of sd= 0.2 linear from uniform distribution features [-1, 1]
  29. 29. Character recognition Dataset: The USPS Handwritten digits Training set: 3000 Classifier: The SVM dot product Kernel 1 -7 PC: 32 – 2048 (x x2)● The performance is better for Linear Classifier trained on non linear components than linear components● The performance is improved from linear as the number of component is increased Fig The result of the Character Recognition experiment ( )
  30. 30. De-noising Dataset: The De-noising eleven gaussians Training set: 100 Classifier: The Gaussian Kernel sd parameter PC: 2The de-noising on non linear feature of the distribution Fig The result of the denoising experiment ( )
  31. 31. Kernels The choice of Kernel regulates the accuracy of the algorithm and is dependent on the application. The Mercer Kernels Gram Matrix areExperimentsRadial Basis FunctionDataset Three gaussian sd 0.1Classifier y exp x y 0.1 Kernel 1 4PC 1 8SigmoidDataset Three Gaussian sd 0.1Classifier KernelPC 1 3
  32. 32. Results -The PC 1-2 separate the 3 clustersRBF - The PC of 3 -5 half the clusters PC 1 PC 2 PC 3 PC 4 -The PC of 6-8 split them orthogonally PC 5 PC 6 PC 7 PC8 The clusters are split to 12 places.Sigmoid -The PC 1 -2 separates the 3 clusters - The PC 3 half the 3 clusters -The same no of PC’s to separate PC 1 PC2 clusters. PC3 - The Sigmoid needs < PC to half.
  33. 33. Results Experiment 1 Experiment 2 Experiment 3 Experiment 4 1 Accuracy Kernel Polynomial 4 Polynomial 4 Gaussian 0.2 Sigmoid Components 8 Split to 12 512 2 3 split to 6 Accuracy 4.4 2 Time 3 Space 4 Interpretability Very Good Very Good Complicated Very good
  34. 34. Discussions: KDAKernel Fisher Discriminant (KDA)Sebastian Mika , Gunnar Rätsch , Jason Weston , Bernhard Schölkopf, Klaus-Robert Müller● Best discriminant projectionhttp://lh3.ggpht.com/_qIDcOEX659I/S14l1wmtv6I/AAAAAAAAAxE/3G9kOsTt0VM/s1600-h/kda62.png
  35. 35. DiscussionsDoing PCA in F rather in Rd● The first k principal components carry more variance than any other k directions● The mean squared error observed by the first k principles is minimal● The principal components are uncorrelated
  36. 36. DiscussionsGoing into a higher dimensionality for a lowerdimensionality● Pick the right high dimensionality spaceNeed of a proper kernel● What kernel to use? ○ Gaussian, sigmoidal, polynomial● Problem dependent
  37. 37. DiscussionsTime Complexity● Alot of features (alot of dimensions).● KPCA works! ○ Subspace of F (only the observed xs) ○ No dot product calculation● Computational complexity is hardly changed by the fact that we need to evaluate kernel function rather than just dot products ○ (if the kernel is easy to compute) ○ e.g. Polynomial Kernels Payback: using linear classifier.
  38. 38. DiscussionsPre-image reconstruction maybe impossibleApproximation can be done in FNeed explicite ϕ● Regression learning problem● Non-linear optimization problem● Algebric Solution (rarely)
  39. 39. DiscussionsInterpretablity● Cross-Features Features ○ Dependent on the kernel● Reduced Space Features ○ Preserves the highest variance among data in F.
  40. 40. ConclusionsApplications● Feature Extraction (Classification)● Clustering● Denoising● Novelty detection● Dimensionality Reduction (Compression)
  41. 41. References[1] J.T. Kwok and I.W. Tsang, “The Pre-Image Problem in Kernel Methods,”IEEE Trans. Neural Networks, vol. 15, no. 6, pp. 1517-1525, 2004.[2] Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality ofdata with neural networks. Science, 313, 504-507.[3] Sebastian Mika , Gunnar Rätsch , Jason Weston , Bernhard Schölkopf ,Klaus-Robert Müller[4] Trevor Hastie; Werner Stuetzle, “Principal Curves,” Journal of the AmericanStatistical Association, Vol. 84, No. 406. (Jun. 1989), pp. 502-516.[5] G. Moser, "Analisi delle componenti principali", Tecniche di trasformazionedi spazi vettoriali per analisi statistica multi-dimensionale.[6] I.T. Jolliffe, "Principal component analysis", Spriger-Verlag, 2002.[7] Wikipedia, "Kernel Principal Component Analysis", 2011.[8] A. Ghodsi, "Data visualization", 2006.[9] B. Scholkopf, S. Mika, A. Smola, G. Ratsch, and K.R. Muller, "Kernel PCApattern reconstruction via approximate pre-images". In Proceedings of the 8thInternational Conference on Artificial Neural Networks, pages 147 - 152, 1998.
  42. 42. References[10] J.T.Kwok, I.W.Tsang, "The pre-image problem in kernel methods",Proceedings of the Twentieth International Conference on Machine Learning(ICML-2003), 2003.● K-R, Müller, S, Mika, G, Rätsch, K,Tsuda, and B, Schölkopf “An Introduction to Kernel-Based Learning Algorithms” IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 2, MARCH 2001● S, Mika, B, Schölkopf, A, Smola Klaus-Robert M¨uller, M,Scholz, G, Rätsch “Kernel PCA and De-Noising in Feature Spaces”
  43. 43. Thank you

×