Multiple Kernel Learning based Approach to
 Representation and Feature Selection for
         Image Classification using
 ...
Outline of the Talk

• Kernel methods for pattern analysis
• Support vector machines for pattern
  classification
• Multip...
Image Classification
Residential Interiors




          Mountains




   Military Vehicles




       Sacred Places




 ...
Pattern Analysis
• Pattern: Any regularity, relation or
  structure in data or source of data
• Pattern analysis: Automati...
Evolution of Pattern Analysis Methods
 • Stage I (1960’s):
   – Learning methods to detect linear relations among data
   ...
Artificial Neural Networks




                        6
Neuron with Threshold Logic
                  Activation Function

Input   Weights               Activation      Output
  ...
Perceptron

• Linearly separable classes: Regions of two
  classes are separable by a linear surface
  (line, plane or hyp...
Hard Problems

• Nonlinearly separable classes

             a2




                                  a1
                 ...
Neuron with Continuous Activation
                        Function

Input   Weights              Activation      Output
  ...
Multilayer Feedforward Neural Network
 • Architecture:
       – Input layer ---- Linear neurons
       – One or more hidde...
Gradient Descent Method




                                         Local minimum
Error



        Local minimum


      ...
Artificial Neural Networks: Summary
• Perceptrons, with threshold logic function as
  activation function, are suitable fo...
Kernel Methods




                 14
Key Aspects of Kernel Methods
• Kernel methods involve
   – Nonlinear transformation of data to a higher dimensional
     ...
Illustration of Transformation



                  


y


                 
      x      X                  
         ...
Support Vector Machines
for Pattern Classification



                         17
Optimal Separating Hyperplane

        z2




                      z1




                                18
Maximum Margin Hyperplane




                            19
Maximum Margin Hyperplane

                                           1
                               Margin =
          ...
21
22
23
24
25
26
27
28
29
30
Kernel Functions
• Mercer   kernels:      Kernel     functions   that   satisfy
  Mercer’s theorem
• Kernel gram matrix: T...
Kernel Functions
• Kernels for non-vectorial data:
    –   Kernels on graphs
    –   Kernels on sets
    –   Kernels for t...
33
34
35
Applications of Kernel Methods
•   Genomics and computational biology
•   Knowledge discovery in clinical microarray data ...
Work on Kernel Methods at IIT Madras
 Focus:
   Design of suitable kernels and development of kernel
   methods for speech...
Kernel Methods: Summary
• Kernel methods involve
  – Nonlinear transformation of data to a higher dimensional
    feature ...
Multiple Kernel Learning based Approach to
 Representation and Feature Selection for
         Image Classification using
 ...
Representation and Feature Selection
• Early fusion for combining multiple representations
      Representation 1     Repr...
Multiple Kernel Learning (MKL)
• A new kernel can be obtained as a linear
  combination of many base kernels
  – Each of t...
MKL framework for SVM
• In SVM framework, the MKL task is considered as a way of
  optimizing the kernel weights while tra...
Representation and Feature Selection
             using MKL
• A base kernel is computed for each
  representation of data
...
Studies on Image Categorization
      •    Dataset:
             – Data of 5 classes in Corel image database
             ...
Representations
Information               Representation                   Tiling   Dimension
              DCT coefficien...
Classification Performance for
           Different Representations
• Classification accuracy (in %) of SVM based classifi...
Class-wise Performance
• Gaussian kernel SVM based classifier
                                                        Clas...
Performance for Combination of
          All Representations
      Representation 1      Representation 2           ...   ...
Representations Selected using MKL
  • Linear kernel as base kernel
                                     Magnitude of     ...
Representations Selected using MKL
  • Gaussian kernel as base kernel
                   CIE L*a*b*        Magnitude of   ...
Performance for Different Methods
     of Combining Representations
 • Classification accuracy (in %)
    Combination of  ...
Feature Selection using MKL
• Representations and features selected using MKL with Gaussian
  kernel as base kernel
• Clas...
Performance of Different Methods
      for Feature Selection

                                   Classification
  Represen...
Multiple Kernel Learning: Summary

• Multiple kernel learning (MKL) approach is useful
  for combining representations as ...
Text Books
1.    Christopher M. Bishop, Pattern Recognition and Machine
      Learning, Springer, 2006.
2.    B. Yegnanara...
Thank You



            56
Upcoming SlideShare
Loading in …5
×

Multiple Kernel Learning based Approach to Representation and Feature Selection for Image Classification using Support Vector Machines

1,948 views
1,824 views

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,948
On SlideShare
0
From Embeds
0
Number of Embeds
30
Actions
Shares
0
Downloads
119
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Multiple Kernel Learning based Approach to Representation and Feature Selection for Image Classification using Support Vector Machines

  1. 1. Multiple Kernel Learning based Approach to Representation and Feature Selection for Image Classification using Support Vector Machines Dr. C. Chandra Sekhar Speech and Vision Laboratory Dept. of Computer Science and Engineering Indian Institute of Technology Madras Chennai-600036 chandra@cse.iitm.ac.in Presented at ICAC-09, Tiruchirappalli on August 8, 2009 1
  2. 2. Outline of the Talk • Kernel methods for pattern analysis • Support vector machines for pattern classification • Multiple kernel learning • Representation and feature selection • Image classification 2
  3. 3. Image Classification Residential Interiors Mountains Military Vehicles Sacred Places Sunsets & Sunrises 3
  4. 4. Pattern Analysis • Pattern: Any regularity, relation or structure in data or source of data • Pattern analysis: Automatic detection and characterization of relations in data • Statistical and machine learning methods for pattern analysis assume that the data is in vectorial form • Relations among data are expressed as – Classification rules – Regression functions – Cluster structures 4
  5. 5. Evolution of Pattern Analysis Methods • Stage I (1960’s): – Learning methods to detect linear relations among data – Perceptron models • Stage II (Mid 1980’s): – Learning methods to detect nonlinear patterns – Gradient descent methods: Local minima problem – Multilayer feedforward neural networks • Stage III (Mid 1990’s): – Kernel methods to detect nonlinear relations – Problem of local minima does not exist – Methods are applicable to non-vectorial data also • Graphs, sets, texts, strings, biosequences, graphs, trees – Support vector machines 5
  6. 6. Artificial Neural Networks 6
  7. 7. Neuron with Threshold Logic Activation Function Input Weights Activation Output value signal x1 w1 M w2 a s=f(a) x2 ∑ WiXi-θ i=1 wM xM Threshold logic activation function • McCulloch-Pitts Neuron 7
  8. 8. Perceptron • Linearly separable classes: Regions of two classes are separable by a linear surface (line, plane or hyperplane) x2 x1 8
  9. 9. Hard Problems • Nonlinearly separable classes a2 a1 9
  10. 10. Neuron with Continuous Activation Function Input Weights Activation Output value signal x1 w1 M w2 a s=f(a) x2 ∑ WiXi-θ i=1 wM xM Sigmoidal activation function 10
  11. 11. Multilayer Feedforward Neural Network • Architecture: – Input layer ---- Linear neurons – One or more hidden layers ---- Nonlinear neurons – Output layer ---- Linear or nonlinear neurons Input Layer Hidden Layer Output Layer x1 1 1 1 s1 x2 2 2 2 s2 . . . . . . . . . . . . . . . xi i j k sk . . . . . . . . . . . . . . . xI I J K sK 11
  12. 12. Gradient Descent Method Local minimum Error Local minimum Global minimum Weight 12
  13. 13. Artificial Neural Networks: Summary • Perceptrons, with threshold logic function as activation function, are suitable for linearly separable classes • Multilayer feedforward neural networks, with sigmoidal function as activation function, are suitable for nonlinearly separable classes – Complexity of the model depends on • Dimension of the input pattern vector • Number of classes • Shapes of the decision surfaces to be formed – Architecture of the model is empirically determined – Large number of training examples are required when the complexity of the model is high – Local minima problem – Suitable for vectorial type of data 13
  14. 14. Kernel Methods 14
  15. 15. Key Aspects of Kernel Methods • Kernel methods involve – Nonlinear transformation of data to a higher dimensional feature space induced by a Mercer kernel – Detection of optimal linear solutions in the kernel feature space • Transformation to a higher dimensional space is expected to be helpful in conversion of nonlinear relations into linear relations (Cover’s theorem) – Nonlinearly separable patterns to linearly separable patterns – Nonlinear regression to linear regression – Nonlinear separation of clusters to linear separation of clusters • Pattern analysis methods are implemented in such a way that the kernel feature space representation is not explicitly required. They involve computation of pair-wise inner-products only. • The pair-wise inner-products are computed efficiently directly from the original representation of data using a kernel function (Kernel trick) 15
  16. 16. Illustration of Transformation  y  x X   (X )  x 2 , y 2 , 2 xy  16
  17. 17. Support Vector Machines for Pattern Classification 17
  18. 18. Optimal Separating Hyperplane z2 z1 18
  19. 19. Maximum Margin Hyperplane 19
  20. 20. Maximum Margin Hyperplane 1 Margin = || w || b w T (x)  b  1 w T (x)  b  0 w T (x)  b  1 20
  21. 21. 21
  22. 22. 22
  23. 23. 23
  24. 24. 24
  25. 25. 25
  26. 26. 26
  27. 27. 27
  28. 28. 28
  29. 29. 29
  30. 30. 30
  31. 31. Kernel Functions • Mercer kernels: Kernel functions that satisfy Mercer’s theorem • Kernel gram matrix: The matrix containing the values of the kernel function on all pairs of data points in the training data set • Kernel gram matrix should be positive semi-definite, i.e., the eigenvalues should be non-negative, for convergence of the iterative method used for solving the constrained optimization problem • Kernels for vectorial data: – Linear kernel – Polynomial kernel – Gaussian kernel 31
  32. 32. Kernel Functions • Kernels for non-vectorial data: – Kernels on graphs – Kernels on sets – Kernels for texts • Vector space kernels • Semantic kernels • Latent semantic kernels – Kernels on strings • String kernels • Trie-based kernels – Kernels on trees – Kernels from generative models • Fisher kernels – Kernels on images • Hausdorff kernels • Histogram intersection kernels • Kernels learnt from data • Kernels optimized for data 32
  33. 33. 33
  34. 34. 34
  35. 35. 35
  36. 36. Applications of Kernel Methods • Genomics and computational biology • Knowledge discovery in clinical microarray data analysis • Recognition of white blood cells of Leukaemia • Classification of interleaved human brain tasks in fMRI • Image classification and retrieval • Hyperspectral image classification • Perceptual representations for image coding • Complex SVM approach to OFDM coherent demodulation • Smart antenna array processing • Speaker verification • Kernel CCA for learning the semantics of text G.Camps-Valls, J.L.Rojo-Alvarez and M.Martinez-Ramon, Kernel Methods in Bioengineering, Signal and Image Processing, Idea Group Publishing, 2007 36
  37. 37. Work on Kernel Methods at IIT Madras Focus: Design of suitable kernels and development of kernel methods for speech, image and video processing tasks Tasks: • Speech recognition • Speech segmentation • Speaker segmentation • Voice activity detection • Handwritten character recognition • Face detection • Image classification • Video shot boundary detection • Time series data processing 37
  38. 38. Kernel Methods: Summary • Kernel methods involve – Nonlinear transformation of data to a higher dimensional feature space induced by a Mercer kernel – Construction of optimal linear solutions in the kernel feature space • Complexity of model is not dependent on the dimension of data • Models can be trained with small size datasets • Kernel methods can be used for non-vectorial type of data also • Performance of kernel methods is dependent on the choice of the kernel. Design of a suitable kernel is a research problem. • Approaches to the design of kernels by learning the kernels from data or by constructing the optimized kernels for a given data are being explored 38
  39. 39. Multiple Kernel Learning based Approach to Representation and Feature Selection for Image Classification using Support Vector Machines Dileep A.D. and C.Chandra Sekhar, “Representation and feature selection using multiple kernel learning”, in Proceedings of International Joint Conference on Neural Networks, vol.1, pp.717-722, Atlanta, USA, June 2009. 39
  40. 40. Representation and Feature Selection • Early fusion for combining multiple representations Representation 1 Representation 2 ... Representation m Representation 1 Representation 2 ... Representation m ... Classifier • Feature Selection – Compute a class separability measuring criterion function – Rank features in the decreasing order of the criterion function – Features corresponding to the best values of criterion function are then selected • Multiple kernel learning (MKL) approach is to develop an integrated framework to perform both representation selection and feature selection 40
  41. 41. Multiple Kernel Learning (MKL) • A new kernel can be obtained as a linear combination of many base kernels – Each of the base kernels is a Mercer kernel k ( x i , x j )    l kl ( x i , x j ) l - Non-negative coefficient  l represents the weight of the lth base kernel • Issues – Choice of the base kernel – Technique for determining the values of weights 41
  42. 42. MKL framework for SVM • In SVM framework, the MKL task is considered as a way of optimizing the kernel weights while training the SVM • The optimization problem of SVM (Dual problem) is given as:  n 1 n n m  max    i     i j yi y j   l k l ( x i , x j )   i 1 2 i 1 j 1 l 1   n    yi i  0   i 1    s.t . 0   i  C i  1,..., n    0, l  1,..., m  l       • Both the Lagrange coefficients αi and the base kernel weights γl are determined by solving the optimization problem 42
  43. 43. Representation and Feature Selection using MKL • A base kernel is computed for each representation of data • MKL method is used to determine the weights for the base kernels • Base kernels with non-zero weights indicate the relevant representations • Similarly, a base kernel is computed for each feature in a representation • MKL method is used to select the relevant features with non-zero weights of the corresponding base kernels 43
  44. 44. Studies on Image Categorization • Dataset: – Data of 5 classes in Corel image database – Number of images per class: 100 (Training: 90, Test: 10, 10-fold evaluation) Class 1 [African specialty animals] Class 2 [Alaskan wildlife] Class 3 [Arabian horses] Class 4 [North American wildlife] Class 5 [Wild life Galapagos] 44 44
  45. 45. Representations Information Representation Tiling Dimension DCT coefficients of average colour in 20 Global 12 x 20 grid Colour CIE L*a*b* colour coordinates of two Global 6 dominant colour clusters Magnitude of the 16 x 16 FFT of Sobel Shape Global 128 edge image Colour and Haar transform of quantized HSV colour Global 256 Structure histogram Shape MPEG-7 edge histogram 4x4 80 Three first central moments of L*a*b* 5 45 Colour CIE colour distribution Average CIE L*a*b* colour 5 15 Co-occurrence matrix of four Sobel edge 5 80 Shape directions Histogram of four Sobel edge directions 5 20 Texture feature based on histogram of Texture 5 40 relative brightness of neighbouring pixels 45
  46. 46. Classification Performance for Different Representations • Classification accuracy (in %) of SVM based classifier Linear Gaussian Representation Kernel Kernel DCT coefficients of average colour in 20 x 20 grid 60.8 69.2 CIE L*a*b* colour coordinates of two dominant 60.8 70.2 colour clusters Magnitude of the 16 x 16 FFT of Sobel edge image 46.6 57.6 Haar transform of quantized HSV colour histogram 84.2 86.8 MPEG-7 edge histogram 53.8 53.4 Three first central moments of L*a*b* CIE colour 83.8 85.6 distribution Average CIE L*a*b* colour 68.2 78.6 Co-occurrence matrix of four Sobel edge directions 44.0 49.8 Histogram of four Sobel edge directions 36.8 49.2 Texture feature based on histogram of relative 46.0 51.6 brightness of neighbouring pixels 46
  47. 47. Class-wise Performance • Gaussian kernel SVM based classifier Classification Accuracy (in %) Representation Class1 Class2 Class3 Class4 Class5 DCT coefficients of average colour in 20 x 20 77 40 97 74 73 grid CIE L*a*b* colour coordinates of two dominant 78 39 96 65 80 colour clusters Magnitude of the 16 x 16 FFT of Sobel edge 53 29 83 61 72 image Haar transform of quantized HSV colour 86 80 99 90 92 histogram MPEG-7 edge histogram 38 31 82 44 56 Three first central moments of L*a*b* CIE 86 79 100 86 80 colour distribution Average CIE L*a*b* colour 75 52 96 81 84 Co-occurrence matrix of four Sobel edge 59 30 77 41 52 directions Histogram of four Sobel edge directions 49 22 71 24 55 Texture feature based on histogram of relative 40 33 89 43 52 brightness of neighbouring pixels 47
  48. 48. Performance for Combination of All Representations Representation 1 Representation 2 ... Representation 10 Representation 1 Representation 2 ... Representation 10 Classification Accuracy (in %) Combination of Dimension Linear Gaussian representations Kernel Kernel Supervector obtained using all the 682 90.40 91.20 representations • Supervector obtained using all representations gives a better performance than any one representation • Dimension of the combined representation is high 48
  49. 49. Representations Selected using MKL • Linear kernel as base kernel Magnitude of Haar transform MPEG-7 edge the 16 x 16 FFT of quantized histogram DCT of Sobel edge HSV colour coefficients of CIE L*a*b* image histogram colour average colour coordinates of in 20 x 20 two dominant Dim: 128 Dim: 256 Dim: 80 rectangular colour clusters grid 0.1170 0.5627 0.1225 Three first Co-occurrence Texture central matrix of four feature moments of Sobel edge based on L*a*b* CIE directions histogram of colour Histogram of relative distribution Average CIE four Sobel brightness of L*a*b* colour Dim: 80 edge directions neighbouring Dim: 45 pixels Dim: 40 0.1615 0.0261 0.0101 49
  50. 50. Representations Selected using MKL • Gaussian kernel as base kernel CIE L*a*b* Magnitude of Haar transform MPEG-7 edge colour the 16 x 16 FFT of quantized histogram DCT coordinates of of Sobel edge HSV colour coefficients of two dominant image histogram colour clusters average colour in 20 x 20 Dim: 6 Dim: 128 Dim: 256 Dim: 80 rectangular grid 0.0064 0.0030 0.6045 0.1535 Average CIE L*a*b* colour Texture Three first feature central Co-occurrence Histogram of based on moments of Dim: 15 matrix of four four Sobel histogram of L*a*b* CIE Sobel edge edge directions relative colour directions brightness of distribution neighbouring pixels 0.2326 50
  51. 51. Performance for Different Methods of Combining Representations • Classification accuracy (in %) Combination of Linear Kernel Gaussian Kernel representation Dimension Accuracy Dimension Accuracy Supervector of all the 682 90.4 682 91.2 representations MKL approach 682 92.0 682 92.4 Supervector of representations with non-zero weights 629 91.8 485 93.0 determined using MKL method • Supervector obtained using representations with non- zero weights gives the same performance as MKL approach, with a significantly reduced dimension of the combined representation 51
  52. 52. Feature Selection using MKL • Representations and features selected using MKL with Gaussian kernel as base kernel • Classification accuracy (in %) Dimension Classification accuracy Representation Actual Selected Actual Weighted features features features features CIE L*a*b* colour coordinates of two 6 6 70.2 70.6 dominant colour clusters Magnitude of the 16 x 16 128 62 57.6 60.4 FFT of Sobel edge image Haar transform of quantized HSV colour 256 63 86.8 88.0 histogram MPEG-7 edge histogram 80 53 53.4 53.8 Average CIE L*a*b* 15 14 78.6 79.0 colour 52
  53. 53. Performance of Different Methods for Feature Selection Classification Representation Dimension accuracy (in %) Supervector of representations with non-zero 485 93.0 weight values determined using MKL method Supervector of features with non- zero weight values determined using 198 93.0 MKL method 53
  54. 54. Multiple Kernel Learning: Summary • Multiple kernel learning (MKL) approach is useful for combining representations as well as selecting features, with weights representations/features determined automatically • Selection of representations and features using MKL approach is studied for image categorization task • Combining representations using MKL approach performs marginally better than the combination of all representations • Dimensionality of the feature vector obtained using the MKL approach is significantly less 54
  55. 55. Text Books 1. Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006. 2. B. Yegnanarayana, Artificial Neural Networks, Prentice-Hall of India, 1999. 3. Satish Kumar, Neural Networks - A Class Room Approach, Tata McGraw-Hill, 2004. 4. Simon Haykin, Neural Networks – A Comprehensive Foundation, 2nd Edition, Prentice Hall, 1999. 5. John Shawe-Taylor and Nello Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004 6. B.Scholkopf and A.J.Smola, Learning with Kernels – Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, 2002 7. R.Herbrich, Learning Kernel Classifiers – Theory and Algorithms, MIT Press, 2002 8. B.Scholkopf, K.Tsuda and J-P.Vert, Kernel Methods in Computational Biology, MIT Press, 2004 9. G.Camps-Valls, J.L.Rojo-Alvarez and M.Martinez-Ramon, Kernel Methods in Bioengineering, Signal and Image Processing, Idea Group Publishing, 2007 10. F.Camastra and A.Vinciarelli, Machine Learning for Audio, Image and Video Analysis – Theory and Applications, Springer, 2008 55 Web resource: www.kernelmachines.org
  56. 56. Thank You 56

×