Presentation of the Kernel PCA paper.

Presentation of the Kernel PCA paper.

### Statistics

### Views

- Total Views
- 2,458
- Views on SlideShare
- 2,457
- Embed Views

### Actions

- Likes
- 0
- Downloads
- 36
- Comments
- 1

Like this presentation? Why not share!

- Pca and kpca of ecg signal by es712 1026 views
- Chrisoph Lampert, Talk part 1 by Anton Konushin 1582 views
- Multiple Kernel Learning based Appr... by ICAC09 2270 views
- CVPR2009 tutorial: Kernel Methods i... by zukun 1196 views
- ECCV2010: distance function and met... by zukun 1010 views
- Laplacian embedded regression for s... by Nano Scientific R... 429 views
- A novel approach for satellite imag... by iaeme 762 views
- cvpr2009 tutorial: kernel methods i... by zukun 964 views
- Kernel methods in machine learning by butest 1773 views
- Introduction to Machine Learning by kkkc 883 views
- study Accelerating Spatially Varyin... by Chiamin Hsu 563 views
- November, 2006 CCKM'06 1 by butest 527 views

**2,458**views

Presentation of the Kernel PCA paper.

Presentation of the Kernel PCA paper.

- Total Views
- 2,458
- Views on SlideShare
- 2,457
- Embed Views
- 1

- Likes
- 0
- Downloads
- 36
- Comments
- 1

http://www.linkedin.com | 1 |

- Presentation of paper #7:Nonlinear componentanalysis as a kerneleigenvalue problemScholkopf, Smola, MullerNeural Computation 10, 1299-1319, MIT Press (1998) Group C: M. Filannino, G. Rates, U. SandoukCOMP61021: Modelling and Visualization of high-dimensional data
- Introduction● Kernel Principal Component Analysis (KPCA) ○ KPCA is an extension of Principal Component Analysis ○ It computes PCA into a new feature space dimension ○ Useful for feature extraction, dimensionality reduction
- Introduction● Kernel Principal Component Analysis (KPCA) ○ KPCA is an extension of Principal Component Analysis ○ It computes PCA into a new feature space ○ Useful for feature extraction, dimensionality reduction
- Motivation: possible solutionsPrincipal CurvesTrevor Hastie; Werner Stuetzle, “Principal Curves,” Journal of the AmericanStatistical Association, Vol. 84, No. 406. (Jun. 1989), pp. 502-516.● Optimization (including the quality of data approximation)● Natural geometric meaning● Natural projection http://pisuerga.inf.ubu.es/cgosorio/Visualization/imgs/review3_html_m20a05243.png
- Motivation: possible solutionsAutoencodersHinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality ofdata with neural networks. Science, 313, 504--507.● Feed forward neural network● Approximate the identity function http://www.nlpca.de/fig_NLPCA_bottleneck_autoassociative_autoencoder_neural_network.png
- Motivation: some new problems● Low input dimensions● Problem dependant● Hard optimization problems
- Motivation: kernel trickKPCA captures the overall variance of patterns
- Motivation: kernel trick
- Motivation: kernel trick
- Motivation: kernel trick
- Motivation: kernel trick Video
- Principle Data Features"We are not interested in PCsin the input space, we areinterested in PCs of featuresthat are nonlinearly related tothe original ones"
- Principle Data"We are not interested in PCs New featuresin the input space, we areinterested in PCs of featuresthat are nonlinearly related tothe original ones"
- PrincipleGiven a data set of N centered observations in a d-dimensional space● PCA diagonalizes the covariance matrix:● It is necessary to solve the following system of equations:● We can define the same computation in another dot product space F:
- PrincipleGiven a data set of N centered observations in a high-dimensional space● Covariance matrix in new space:● Again, it is necessary to solve the following system of equations:● This means that:
- Principle● Combining the last tree equations, we obtain:● we define a new function● and a new N x N matrix:● our equation becomes:
- Principle● let λ1 ≤ λ2 ≤ ... ≤ λN denote the eigenvalues of K, and α1, ..., αN the corresponding eigenvectors, with λp being the first nonzero eigenvalue then we require they are normalized in F:● Encoding a data point y means computing:
- Algorithm● Centralization For a given data set, subtracting the mean for all the observation to achieve the centralized data in RN.● Finding principal components Compute the matrix using kernel function, find eigenvectors and eigenvalues● Encoding training/testing data where x is a vector that encodes the training data. This can be done since we calculated eigenvalues and eigenvectors.
- Algorithm● Reconstructing training data The operation cannot be done because eigenvectors do not have a pre-images in the original dimension.● Reconstructing test data point The operation cannot be done because eigenvectors do not have a pre-images in the original dimension.
- Disadvantages● Centering in original space does not mean centering in F, we need to adjust the K matrix as follows:● KPCA is now a parametric technique: ○ choice of a proper kernel function ■ Gaussian, sigmoid, polynomial ○ Mercers theorem ■ k(x,y) must be continue, simmetric, and semi-defined positive (xTAx ≥ 0) ■ it guarantees that there are non-zero eigenvalues● Data reconstruction is not possible, unless using approximation formula:
- Advantages● Time complexity ○ we will return to this point later● Handle non linearly separable problems● Extraction of more principal components than PCA ○ Feature extraction vs. dimensionality reduction
- Experiments● Applications● Data Sets● Methods compared● Assessment● Experiments● Results
- Applications● Clustering ○ Density Estimation ■ ex High correlation between features ○ De-noising ■ ex Lighting removing from bright images ○ Compression ■ ex Image compression● Classification ○ ex categorisations
- DatasetsExperiment Name Created by Representation y x2 C y= x2● Simple 1+2 = 3 Uniform distribution C noise sd 0.1 - Unlabelled example1 Dist [-1, 1] - 2 Dimensions Three clusters 1+2 = 3 Three Gaussians - Unlabelled● Simple sd = 0.1 - 2 Dimensions example2 Dist [1,1] x [0.5, 1] Kernels A circle and square The eleven gaussians - Unlabelled● De-noising Dist [-1, 1] with zero mean - 10 Dimensions● USPS Hand written digit Character -Labelled Recognition -256 Dimensions -9298 Digits
- Experiments1 Simple Example 1 experimentDataset : 1+ 2 = 3 The uniform dist sd = 0.2Kernel: Polynomial 1 – 42 USPS Character Recognition ParametersDataset: USPS Kernel PCA Kernel Polynomial 1 7 Components 32 2048 (x x2)Methods Five layer Neural Networks Kernel SVM PCA SVM Neural Networks and SVM The best parameters for the task3 De- noising ParametersDataset: De-noising 11 gaussians sd = 0.1 The best parameters for the taskMethods Kernel Autoencoders Principal Curves Kernel PCA Linear PCA4 Kernels Parameters The best parameters for the taskRadial Basis FunctionSigmoid
- Methods These are the methods we used in the experiments Dimensionality reductionClassification ● Supervised Unsupervised Linear PCA Linear Neural Networks Kernel PCA ● SVM Kernel Autoencoders Linear Non ● Kernel LDA Principal CurvesFaceRecognition
- Assessment● 1 Accuracy Classification: Exact Classification Clustering: Comparable to other clusters●● 2 Time Complexity● The time to compute●● 3 Storage Complexity● The storage of the data●● 4 Interpretability● How easy it is to understand
- Simple Example ● Recreated example ● Nonlinear PCA paper ex Dataset: The USPS Handwritten digits Dataset: 1+ 2 =3 The uniform dist with sd 0.2 Training set: 3000 Classifier: The polynomial Kernel 1 - 4 Classifier: The SVM dot product Kernel 1 -7 PC: 32 – 2048 x2 PC: 1 – 3 The eigenvector3D 1 -3 of highestby a eigenvalueKernelDoPCA Kernel Polynomial 1 -4 Accurate2D The function y = x2 + B Clustering of Non with noise B of sd= 0.2 linear from uniform distribution features [-1, 1]
- Character recognition Dataset: The USPS Handwritten digits Training set: 3000 Classifier: The SVM dot product Kernel 1 -7 PC: 32 – 2048 (x x2)● The performance is better for Linear Classifier trained on non linear components than linear components● The performance is improved from linear as the number of component is increased Fig The result of the Character Recognition experiment ( )
- De-noising Dataset: The De-noising eleven gaussians Training set: 100 Classifier: The Gaussian Kernel sd parameter PC: 2The de-noising on non linear feature of the distribution Fig The result of the denoising experiment ( )
- Kernels The choice of Kernel regulates the accuracy of the algorithm and is dependent on the application. The Mercer Kernels Gram Matrix areExperimentsRadial Basis FunctionDataset Three gaussian sd 0.1Classifier y exp x y 0.1 Kernel 1 4PC 1 8SigmoidDataset Three Gaussian sd 0.1Classifier KernelPC 1 3
- Results -The PC 1-2 separate the 3 clustersRBF - The PC of 3 -5 half the clusters PC 1 PC 2 PC 3 PC 4 -The PC of 6-8 split them orthogonally PC 5 PC 6 PC 7 PC8 The clusters are split to 12 places.Sigmoid -The PC 1 -2 separates the 3 clusters - The PC 3 half the 3 clusters -The same no of PC’s to separate PC 1 PC2 clusters. PC3 - The Sigmoid needs < PC to half.
- Results Experiment 1 Experiment 2 Experiment 3 Experiment 4 1 Accuracy Kernel Polynomial 4 Polynomial 4 Gaussian 0.2 Sigmoid Components 8 Split to 12 512 2 3 split to 6 Accuracy 4.4 2 Time 3 Space 4 Interpretability Very Good Very Good Complicated Very good
- Discussions: KDAKernel Fisher Discriminant (KDA)Sebastian Mika , Gunnar Rätsch , Jason Weston , Bernhard Schölkopf, Klaus-Robert Müller● Best discriminant projectionhttp://lh3.ggpht.com/_qIDcOEX659I/S14l1wmtv6I/AAAAAAAAAxE/3G9kOsTt0VM/s1600-h/kda62.png
- DiscussionsDoing PCA in F rather in Rd● The first k principal components carry more variance than any other k directions● The mean squared error observed by the first k principles is minimal● The principal components are uncorrelated
- DiscussionsGoing into a higher dimensionality for a lowerdimensionality● Pick the right high dimensionality spaceNeed of a proper kernel● What kernel to use? ○ Gaussian, sigmoidal, polynomial● Problem dependent
- DiscussionsTime Complexity● Alot of features (alot of dimensions).● KPCA works! ○ Subspace of F (only the observed xs) ○ No dot product calculation● Computational complexity is hardly changed by the fact that we need to evaluate kernel function rather than just dot products ○ (if the kernel is easy to compute) ○ e.g. Polynomial Kernels Payback: using linear classifier.
- DiscussionsPre-image reconstruction maybe impossibleApproximation can be done in FNeed explicite ϕ● Regression learning problem● Non-linear optimization problem● Algebric Solution (rarely)
- DiscussionsInterpretablity● Cross-Features Features ○ Dependent on the kernel● Reduced Space Features ○ Preserves the highest variance among data in F.
- ConclusionsApplications● Feature Extraction (Classification)● Clustering● Denoising● Novelty detection● Dimensionality Reduction (Compression)
- References[1] J.T. Kwok and I.W. Tsang, “The Pre-Image Problem in Kernel Methods,”IEEE Trans. Neural Networks, vol. 15, no. 6, pp. 1517-1525, 2004.[2] Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality ofdata with neural networks. Science, 313, 504-507.[3] Sebastian Mika , Gunnar Rätsch , Jason Weston , Bernhard Schölkopf ,Klaus-Robert Müller[4] Trevor Hastie; Werner Stuetzle, “Principal Curves,” Journal of the AmericanStatistical Association, Vol. 84, No. 406. (Jun. 1989), pp. 502-516.[5] G. Moser, "Analisi delle componenti principali", Tecniche di trasformazionedi spazi vettoriali per analisi statistica multi-dimensionale.[6] I.T. Jolliffe, "Principal component analysis", Spriger-Verlag, 2002.[7] Wikipedia, "Kernel Principal Component Analysis", 2011.[8] A. Ghodsi, "Data visualization", 2006.[9] B. Scholkopf, S. Mika, A. Smola, G. Ratsch, and K.R. Muller, "Kernel PCApattern reconstruction via approximate pre-images". In Proceedings of the 8thInternational Conference on Artiﬁcial Neural Networks, pages 147 - 152, 1998.
- References[10] J.T.Kwok, I.W.Tsang, "The pre-image problem in kernel methods",Proceedings of the Twentieth International Conference on Machine Learning(ICML-2003), 2003.● K-R, Müller, S, Mika, G, Rätsch, K,Tsuda, and B, Schölkopf “An Introduction to Kernel-Based Learning Algorithms” IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 2, MARCH 2001● S, Mika, B, Schölkopf, A, Smola Klaus-Robert M¨uller, M,Scholz, G, Rätsch “Kernel PCA and De-Noising in Feature Spaces”
- Thank you

1–1 of 1

Full NameComment goes here.Guru Prakashat Resonance Eduventures good