Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Catch the Vision - Advanced Trainin... by Wendy Merritt 2896 views
- Object Based Image Classification by Kabir Uddin 9741 views
- Multiple kernel Self-Organizing Maps by tuxette 407 views
- Talk 2007-monash-seminar-behavior-r... by Mahfuzul Haque 404 views
- Underwater sparse image classificat... by Faculty of Comput... 288 views
- Feature Hierarchies for Object Clas... by Vanya Vabrina 506 views

1,948 views

1,824 views

1,824 views

Published on

No Downloads

Total views

1,948

On SlideShare

0

From Embeds

0

Number of Embeds

30

Shares

0

Downloads

119

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Multiple Kernel Learning based Approach to Representation and Feature Selection for Image Classification using Support Vector Machines Dr. C. Chandra Sekhar Speech and Vision Laboratory Dept. of Computer Science and Engineering Indian Institute of Technology Madras Chennai-600036 chandra@cse.iitm.ac.in Presented at ICAC-09, Tiruchirappalli on August 8, 2009 1
- 2. Outline of the Talk • Kernel methods for pattern analysis • Support vector machines for pattern classification • Multiple kernel learning • Representation and feature selection • Image classification 2
- 3. Image Classification Residential Interiors Mountains Military Vehicles Sacred Places Sunsets & Sunrises 3
- 4. Pattern Analysis • Pattern: Any regularity, relation or structure in data or source of data • Pattern analysis: Automatic detection and characterization of relations in data • Statistical and machine learning methods for pattern analysis assume that the data is in vectorial form • Relations among data are expressed as – Classification rules – Regression functions – Cluster structures 4
- 5. Evolution of Pattern Analysis Methods • Stage I (1960’s): – Learning methods to detect linear relations among data – Perceptron models • Stage II (Mid 1980’s): – Learning methods to detect nonlinear patterns – Gradient descent methods: Local minima problem – Multilayer feedforward neural networks • Stage III (Mid 1990’s): – Kernel methods to detect nonlinear relations – Problem of local minima does not exist – Methods are applicable to non-vectorial data also • Graphs, sets, texts, strings, biosequences, graphs, trees – Support vector machines 5
- 6. Artificial Neural Networks 6
- 7. Neuron with Threshold Logic Activation Function Input Weights Activation Output value signal x1 w1 M w2 a s=f(a) x2 ∑ WiXi-θ i=1 wM xM Threshold logic activation function • McCulloch-Pitts Neuron 7
- 8. Perceptron • Linearly separable classes: Regions of two classes are separable by a linear surface (line, plane or hyperplane) x2 x1 8
- 9. Hard Problems • Nonlinearly separable classes a2 a1 9
- 10. Neuron with Continuous Activation Function Input Weights Activation Output value signal x1 w1 M w2 a s=f(a) x2 ∑ WiXi-θ i=1 wM xM Sigmoidal activation function 10
- 11. Multilayer Feedforward Neural Network • Architecture: – Input layer ---- Linear neurons – One or more hidden layers ---- Nonlinear neurons – Output layer ---- Linear or nonlinear neurons Input Layer Hidden Layer Output Layer x1 1 1 1 s1 x2 2 2 2 s2 . . . . . . . . . . . . . . . xi i j k sk . . . . . . . . . . . . . . . xI I J K sK 11
- 12. Gradient Descent Method Local minimum Error Local minimum Global minimum Weight 12
- 13. Artificial Neural Networks: Summary • Perceptrons, with threshold logic function as activation function, are suitable for linearly separable classes • Multilayer feedforward neural networks, with sigmoidal function as activation function, are suitable for nonlinearly separable classes – Complexity of the model depends on • Dimension of the input pattern vector • Number of classes • Shapes of the decision surfaces to be formed – Architecture of the model is empirically determined – Large number of training examples are required when the complexity of the model is high – Local minima problem – Suitable for vectorial type of data 13
- 14. Kernel Methods 14
- 15. Key Aspects of Kernel Methods • Kernel methods involve – Nonlinear transformation of data to a higher dimensional feature space induced by a Mercer kernel – Detection of optimal linear solutions in the kernel feature space • Transformation to a higher dimensional space is expected to be helpful in conversion of nonlinear relations into linear relations (Cover’s theorem) – Nonlinearly separable patterns to linearly separable patterns – Nonlinear regression to linear regression – Nonlinear separation of clusters to linear separation of clusters • Pattern analysis methods are implemented in such a way that the kernel feature space representation is not explicitly required. They involve computation of pair-wise inner-products only. • The pair-wise inner-products are computed efficiently directly from the original representation of data using a kernel function (Kernel trick) 15
- 16. Illustration of Transformation y x X (X ) x 2 , y 2 , 2 xy 16
- 17. Support Vector Machines for Pattern Classification 17
- 18. Optimal Separating Hyperplane z2 z1 18
- 19. Maximum Margin Hyperplane 19
- 20. Maximum Margin Hyperplane 1 Margin = || w || b w T (x) b 1 w T (x) b 0 w T (x) b 1 20
- 21. 21
- 22. 22
- 23. 23
- 24. 24
- 25. 25
- 26. 26
- 27. 27
- 28. 28
- 29. 29
- 30. 30
- 31. Kernel Functions • Mercer kernels: Kernel functions that satisfy Mercer’s theorem • Kernel gram matrix: The matrix containing the values of the kernel function on all pairs of data points in the training data set • Kernel gram matrix should be positive semi-definite, i.e., the eigenvalues should be non-negative, for convergence of the iterative method used for solving the constrained optimization problem • Kernels for vectorial data: – Linear kernel – Polynomial kernel – Gaussian kernel 31
- 32. Kernel Functions • Kernels for non-vectorial data: – Kernels on graphs – Kernels on sets – Kernels for texts • Vector space kernels • Semantic kernels • Latent semantic kernels – Kernels on strings • String kernels • Trie-based kernels – Kernels on trees – Kernels from generative models • Fisher kernels – Kernels on images • Hausdorff kernels • Histogram intersection kernels • Kernels learnt from data • Kernels optimized for data 32
- 33. 33
- 34. 34
- 35. 35
- 36. Applications of Kernel Methods • Genomics and computational biology • Knowledge discovery in clinical microarray data analysis • Recognition of white blood cells of Leukaemia • Classification of interleaved human brain tasks in fMRI • Image classification and retrieval • Hyperspectral image classification • Perceptual representations for image coding • Complex SVM approach to OFDM coherent demodulation • Smart antenna array processing • Speaker verification • Kernel CCA for learning the semantics of text G.Camps-Valls, J.L.Rojo-Alvarez and M.Martinez-Ramon, Kernel Methods in Bioengineering, Signal and Image Processing, Idea Group Publishing, 2007 36
- 37. Work on Kernel Methods at IIT Madras Focus: Design of suitable kernels and development of kernel methods for speech, image and video processing tasks Tasks: • Speech recognition • Speech segmentation • Speaker segmentation • Voice activity detection • Handwritten character recognition • Face detection • Image classification • Video shot boundary detection • Time series data processing 37
- 38. Kernel Methods: Summary • Kernel methods involve – Nonlinear transformation of data to a higher dimensional feature space induced by a Mercer kernel – Construction of optimal linear solutions in the kernel feature space • Complexity of model is not dependent on the dimension of data • Models can be trained with small size datasets • Kernel methods can be used for non-vectorial type of data also • Performance of kernel methods is dependent on the choice of the kernel. Design of a suitable kernel is a research problem. • Approaches to the design of kernels by learning the kernels from data or by constructing the optimized kernels for a given data are being explored 38
- 39. Multiple Kernel Learning based Approach to Representation and Feature Selection for Image Classification using Support Vector Machines Dileep A.D. and C.Chandra Sekhar, “Representation and feature selection using multiple kernel learning”, in Proceedings of International Joint Conference on Neural Networks, vol.1, pp.717-722, Atlanta, USA, June 2009. 39
- 40. Representation and Feature Selection • Early fusion for combining multiple representations Representation 1 Representation 2 ... Representation m Representation 1 Representation 2 ... Representation m ... Classifier • Feature Selection – Compute a class separability measuring criterion function – Rank features in the decreasing order of the criterion function – Features corresponding to the best values of criterion function are then selected • Multiple kernel learning (MKL) approach is to develop an integrated framework to perform both representation selection and feature selection 40
- 41. Multiple Kernel Learning (MKL) • A new kernel can be obtained as a linear combination of many base kernels – Each of the base kernels is a Mercer kernel k ( x i , x j ) l kl ( x i , x j ) l - Non-negative coefficient l represents the weight of the lth base kernel • Issues – Choice of the base kernel – Technique for determining the values of weights 41
- 42. MKL framework for SVM • In SVM framework, the MKL task is considered as a way of optimizing the kernel weights while training the SVM • The optimization problem of SVM (Dual problem) is given as: n 1 n n m max i i j yi y j l k l ( x i , x j ) i 1 2 i 1 j 1 l 1 n yi i 0 i 1 s.t . 0 i C i 1,..., n 0, l 1,..., m l • Both the Lagrange coefficients αi and the base kernel weights γl are determined by solving the optimization problem 42
- 43. Representation and Feature Selection using MKL • A base kernel is computed for each representation of data • MKL method is used to determine the weights for the base kernels • Base kernels with non-zero weights indicate the relevant representations • Similarly, a base kernel is computed for each feature in a representation • MKL method is used to select the relevant features with non-zero weights of the corresponding base kernels 43
- 44. Studies on Image Categorization • Dataset: – Data of 5 classes in Corel image database – Number of images per class: 100 (Training: 90, Test: 10, 10-fold evaluation) Class 1 [African specialty animals] Class 2 [Alaskan wildlife] Class 3 [Arabian horses] Class 4 [North American wildlife] Class 5 [Wild life Galapagos] 44 44
- 45. Representations Information Representation Tiling Dimension DCT coefficients of average colour in 20 Global 12 x 20 grid Colour CIE L*a*b* colour coordinates of two Global 6 dominant colour clusters Magnitude of the 16 x 16 FFT of Sobel Shape Global 128 edge image Colour and Haar transform of quantized HSV colour Global 256 Structure histogram Shape MPEG-7 edge histogram 4x4 80 Three first central moments of L*a*b* 5 45 Colour CIE colour distribution Average CIE L*a*b* colour 5 15 Co-occurrence matrix of four Sobel edge 5 80 Shape directions Histogram of four Sobel edge directions 5 20 Texture feature based on histogram of Texture 5 40 relative brightness of neighbouring pixels 45
- 46. Classification Performance for Different Representations • Classification accuracy (in %) of SVM based classifier Linear Gaussian Representation Kernel Kernel DCT coefficients of average colour in 20 x 20 grid 60.8 69.2 CIE L*a*b* colour coordinates of two dominant 60.8 70.2 colour clusters Magnitude of the 16 x 16 FFT of Sobel edge image 46.6 57.6 Haar transform of quantized HSV colour histogram 84.2 86.8 MPEG-7 edge histogram 53.8 53.4 Three first central moments of L*a*b* CIE colour 83.8 85.6 distribution Average CIE L*a*b* colour 68.2 78.6 Co-occurrence matrix of four Sobel edge directions 44.0 49.8 Histogram of four Sobel edge directions 36.8 49.2 Texture feature based on histogram of relative 46.0 51.6 brightness of neighbouring pixels 46
- 47. Class-wise Performance • Gaussian kernel SVM based classifier Classification Accuracy (in %) Representation Class1 Class2 Class3 Class4 Class5 DCT coefficients of average colour in 20 x 20 77 40 97 74 73 grid CIE L*a*b* colour coordinates of two dominant 78 39 96 65 80 colour clusters Magnitude of the 16 x 16 FFT of Sobel edge 53 29 83 61 72 image Haar transform of quantized HSV colour 86 80 99 90 92 histogram MPEG-7 edge histogram 38 31 82 44 56 Three first central moments of L*a*b* CIE 86 79 100 86 80 colour distribution Average CIE L*a*b* colour 75 52 96 81 84 Co-occurrence matrix of four Sobel edge 59 30 77 41 52 directions Histogram of four Sobel edge directions 49 22 71 24 55 Texture feature based on histogram of relative 40 33 89 43 52 brightness of neighbouring pixels 47
- 48. Performance for Combination of All Representations Representation 1 Representation 2 ... Representation 10 Representation 1 Representation 2 ... Representation 10 Classification Accuracy (in %) Combination of Dimension Linear Gaussian representations Kernel Kernel Supervector obtained using all the 682 90.40 91.20 representations • Supervector obtained using all representations gives a better performance than any one representation • Dimension of the combined representation is high 48
- 49. Representations Selected using MKL • Linear kernel as base kernel Magnitude of Haar transform MPEG-7 edge the 16 x 16 FFT of quantized histogram DCT of Sobel edge HSV colour coefficients of CIE L*a*b* image histogram colour average colour coordinates of in 20 x 20 two dominant Dim: 128 Dim: 256 Dim: 80 rectangular colour clusters grid 0.1170 0.5627 0.1225 Three first Co-occurrence Texture central matrix of four feature moments of Sobel edge based on L*a*b* CIE directions histogram of colour Histogram of relative distribution Average CIE four Sobel brightness of L*a*b* colour Dim: 80 edge directions neighbouring Dim: 45 pixels Dim: 40 0.1615 0.0261 0.0101 49
- 50. Representations Selected using MKL • Gaussian kernel as base kernel CIE L*a*b* Magnitude of Haar transform MPEG-7 edge colour the 16 x 16 FFT of quantized histogram DCT coordinates of of Sobel edge HSV colour coefficients of two dominant image histogram colour clusters average colour in 20 x 20 Dim: 6 Dim: 128 Dim: 256 Dim: 80 rectangular grid 0.0064 0.0030 0.6045 0.1535 Average CIE L*a*b* colour Texture Three first feature central Co-occurrence Histogram of based on moments of Dim: 15 matrix of four four Sobel histogram of L*a*b* CIE Sobel edge edge directions relative colour directions brightness of distribution neighbouring pixels 0.2326 50
- 51. Performance for Different Methods of Combining Representations • Classification accuracy (in %) Combination of Linear Kernel Gaussian Kernel representation Dimension Accuracy Dimension Accuracy Supervector of all the 682 90.4 682 91.2 representations MKL approach 682 92.0 682 92.4 Supervector of representations with non-zero weights 629 91.8 485 93.0 determined using MKL method • Supervector obtained using representations with non- zero weights gives the same performance as MKL approach, with a significantly reduced dimension of the combined representation 51
- 52. Feature Selection using MKL • Representations and features selected using MKL with Gaussian kernel as base kernel • Classification accuracy (in %) Dimension Classification accuracy Representation Actual Selected Actual Weighted features features features features CIE L*a*b* colour coordinates of two 6 6 70.2 70.6 dominant colour clusters Magnitude of the 16 x 16 128 62 57.6 60.4 FFT of Sobel edge image Haar transform of quantized HSV colour 256 63 86.8 88.0 histogram MPEG-7 edge histogram 80 53 53.4 53.8 Average CIE L*a*b* 15 14 78.6 79.0 colour 52
- 53. Performance of Different Methods for Feature Selection Classification Representation Dimension accuracy (in %) Supervector of representations with non-zero 485 93.0 weight values determined using MKL method Supervector of features with non- zero weight values determined using 198 93.0 MKL method 53
- 54. Multiple Kernel Learning: Summary • Multiple kernel learning (MKL) approach is useful for combining representations as well as selecting features, with weights representations/features determined automatically • Selection of representations and features using MKL approach is studied for image categorization task • Combining representations using MKL approach performs marginally better than the combination of all representations • Dimensionality of the feature vector obtained using the MKL approach is significantly less 54
- 55. Text Books 1. Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006. 2. B. Yegnanarayana, Artificial Neural Networks, Prentice-Hall of India, 1999. 3. Satish Kumar, Neural Networks - A Class Room Approach, Tata McGraw-Hill, 2004. 4. Simon Haykin, Neural Networks – A Comprehensive Foundation, 2nd Edition, Prentice Hall, 1999. 5. John Shawe-Taylor and Nello Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004 6. B.Scholkopf and A.J.Smola, Learning with Kernels – Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, 2002 7. R.Herbrich, Learning Kernel Classifiers – Theory and Algorithms, MIT Press, 2002 8. B.Scholkopf, K.Tsuda and J-P.Vert, Kernel Methods in Computational Biology, MIT Press, 2004 9. G.Camps-Valls, J.L.Rojo-Alvarez and M.Martinez-Ramon, Kernel Methods in Bioengineering, Signal and Image Processing, Idea Group Publishing, 2007 10. F.Camastra and A.Vinciarelli, Machine Learning for Audio, Image and Video Analysis – Theory and Applications, Springer, 2008 55 Web resource: www.kernelmachines.org
- 56. Thank You 56

No public clipboards found for this slide

Be the first to comment