Kernel Descriptors for Visual Recognition
by L.Bo, X.Ren and D.Fox
A Term Paper Report by Priyatham Bollimpalli (10010148)
Summary of the Paper
Popular Computer Vision algorithms like SIFT and HOG compute feature descriptor for an
image. A descriptor for an image is in simple terms, a concise representation of the image
properties which enables them to be used for many practical applications such as object
recognition, scene detection, image matching etc. Inspired from the orientation histogram
approach used in SIFT and HOG, this paper defines kernel orientation histogram and then
designs kernel descriptors for gradient, colour and local binary pattern (shape) using match
kernels. The definition of these kernels resulted in the reduction of granularity of low level
pixel features and made the idea of similaritybetween patches(high level features) come true.
To generate kernels in a computationally feasible manner, first match kernels are
approximated to finite dimension taking a set of finite basis vector from sampled normalized
gradient vectors. Then to reduce the redundancy and generate the compact features, Kernel
Principal Component Analysis is done. It is shown experimentally that the error which results
in these two stages is very less. Now gradient, colour and shape kernel descriptors are
computed more efficiently and in a simple, straight forward way over the images.
Experiment is done on four publicly available datasets: Scene-15, Caltech101, CIFAR10 and
CIFAR10-ImageNet. These datasets are for image classification and Laplacian kernel SVMs is
used in the experiments to classify. It is shown that the gradient kernel descriptor performs
best among the proposed kernel descriptors. All of them perform better than the SIFT
descriptor and other sophisticated feature learning methods.
The main novelty in the paper is that this is the first work done on kernels which is based on
low-level visual feature learning and that shows better performance than very famous
methods which are used as default choice for many applications. But some of the limitations
of this proposed scheme is the high computational time (even after optimizing) compared to
other methods and difficulty in learning pixel attributes from large image collection to
approximate the kernel. But since this area of research is new, alternative kernel functions or
using the existing one in combination of other kernel methods may get around this limitation,
further improving the performance or using in other areas where SIFT is used such as object
tracking, multi-view matching etc.
Details and Explanation of the paper
The gradient orientation at a pixel plays an important role in describing the features of the
image and this concept has been extensively used in many image descriptors. For example,
SIFT descriptor assigns the orientations to 8 bins as depicted below across 4 x 4 block.
Feature vector of each pixel z is defined as F(z) = m(z)(z) where m(z) is the magnitude of
the gradient and the ith
component of (z) is 1 if the gradient falls in ith bin and 0 otherwise.
Soft bin formulation can also be used as (z) = max(cos((z ),ai )9
, 0) where (z ) is the gradient
and ai is bin center. Over a patch P, histogram of gradients is obtained by
𝐹ℎ = ∑ 𝑚̃( 𝑧)δ(z)𝑧∈𝑃 where 𝑚̃( 𝑧) = 𝑚( 𝑧)/√∑ 𝑚( 𝑧)2 + 𝜖𝑧∈𝑃 (normalized magnitudes)
Intuitively, the similarity between two patches P and Q from different images is defined as
Since there are only inner product in the RHS, kernel functions can be defined between two
pixels and hence kernalized notion of similarity between two patches (as in HOG) is obtained.
But defining the kernel in this way introduces quantization errors and poor performance.
So to capture image variations properly, Gradient match kernel is defined as follows.
Here kp and ko are Gaussian kernels over position of pixel and orientations respectively. To
get more accuracy and for defining in uniform way, the values of pixel positions and
orientations are normalized.
The motivation for defining the gradient match kernel K as product of three kernels is as
follows. First we have to weigh the contribution of each pixel gradient magnitude and
normalized linear kernel is used for this. Then a measure of similarity of gradient orientations
should be included and the last Gaussian kernel kp measures how close two pixels are
spatially. By similar motivation, colour match kernel is defined (c(z) is the colour at z).
In shape kernel, s is the standard deviation of pixel values in the 3 x 3 neighborhood, b (z) is
binary column vector with the pixel value differences in a local window around z. Thus in
Shape Kernel descriptor, the contribution of each local binary pattern s(z) is weighed, and
shape similarity is obtained through local binary patterns b(z).
Features over image patches can be expressed as
Since Gaussian kernels are used, Fgrad(P) has infinite dimensions. Directly applying KPCA may
be computationally infeasible when the number of patches is very large. So first match kernels
are approximated directly by learning finite-dimensional features obtained by projecting
Fgrad(P ) into a set of basis vectors. An example to approximate Gaussian kernel over gradients
to d dimensions is shown below. Here xi are sampled normalized gradient vectors.
Note that the Kronekar product ⨂ is used to compute the features which still results in large
number of dimensions. Now to achieve fewer compact features, KPCA is done. This makes
the computation time of evaluation practical. The tth
kernel principle component is written as
Finally the gradient kernel descriptor is expressed as shown below. It is shown that the error
incurred in approximating the match kernels in this way is very less.
The gradient (KDES-G), color (KDES-C), and shape (KDES-S) kernel descriptors are compared
to SIFT and several other state of the art object recognition algorithms using four publicly
available datasets of Scene-15, Caltech101, CIFAR10, and CIFAR10-ImageNet. Except in
CIFAR10, Laplacian kernel SVMs are used in the experiments. The summary of the result is
shown below. The combination of the three kernel descriptors is observed to boost the
performance by 2%. Thus we can see that the proposed kernel descriptor outperforms all the
other methods.
Scene-15 Caltech-101
KDES 86.7% KDES 76.4% CDBN[2]
65.5%
SIFT 82.2% SPM [1]
64.4% LCC[4]
73.4%
CIFAR10 KDES 76.0% LCC[4]
74.5%
mcRBM-DBN[3]
71.0% TCNN[5]
73.1%
[1]Lazebnik, Schmid, Ponce, CVPR '06 [2]Lee, Grosse, Ranganath, Ng, ICML '09 [3]Ranzato, Hinton, CVPR '10 [4]Yu,
Zhang, ICML '10 [5]Le, Ngiam, Chen, Chia, Koh, Ng, NIPS '10

Kernel Descriptors for Visual Recognition

  • 1.
    Kernel Descriptors forVisual Recognition by L.Bo, X.Ren and D.Fox A Term Paper Report by Priyatham Bollimpalli (10010148) Summary of the Paper Popular Computer Vision algorithms like SIFT and HOG compute feature descriptor for an image. A descriptor for an image is in simple terms, a concise representation of the image properties which enables them to be used for many practical applications such as object recognition, scene detection, image matching etc. Inspired from the orientation histogram approach used in SIFT and HOG, this paper defines kernel orientation histogram and then designs kernel descriptors for gradient, colour and local binary pattern (shape) using match kernels. The definition of these kernels resulted in the reduction of granularity of low level pixel features and made the idea of similaritybetween patches(high level features) come true. To generate kernels in a computationally feasible manner, first match kernels are approximated to finite dimension taking a set of finite basis vector from sampled normalized gradient vectors. Then to reduce the redundancy and generate the compact features, Kernel Principal Component Analysis is done. It is shown experimentally that the error which results in these two stages is very less. Now gradient, colour and shape kernel descriptors are computed more efficiently and in a simple, straight forward way over the images. Experiment is done on four publicly available datasets: Scene-15, Caltech101, CIFAR10 and CIFAR10-ImageNet. These datasets are for image classification and Laplacian kernel SVMs is used in the experiments to classify. It is shown that the gradient kernel descriptor performs best among the proposed kernel descriptors. All of them perform better than the SIFT descriptor and other sophisticated feature learning methods. The main novelty in the paper is that this is the first work done on kernels which is based on low-level visual feature learning and that shows better performance than very famous methods which are used as default choice for many applications. But some of the limitations of this proposed scheme is the high computational time (even after optimizing) compared to other methods and difficulty in learning pixel attributes from large image collection to approximate the kernel. But since this area of research is new, alternative kernel functions or using the existing one in combination of other kernel methods may get around this limitation, further improving the performance or using in other areas where SIFT is used such as object tracking, multi-view matching etc.
  • 2.
    Details and Explanationof the paper The gradient orientation at a pixel plays an important role in describing the features of the image and this concept has been extensively used in many image descriptors. For example, SIFT descriptor assigns the orientations to 8 bins as depicted below across 4 x 4 block. Feature vector of each pixel z is defined as F(z) = m(z)(z) where m(z) is the magnitude of the gradient and the ith component of (z) is 1 if the gradient falls in ith bin and 0 otherwise. Soft bin formulation can also be used as (z) = max(cos((z ),ai )9 , 0) where (z ) is the gradient and ai is bin center. Over a patch P, histogram of gradients is obtained by 𝐹ℎ = ∑ 𝑚̃( 𝑧)δ(z)𝑧∈𝑃 where 𝑚̃( 𝑧) = 𝑚( 𝑧)/√∑ 𝑚( 𝑧)2 + 𝜖𝑧∈𝑃 (normalized magnitudes) Intuitively, the similarity between two patches P and Q from different images is defined as Since there are only inner product in the RHS, kernel functions can be defined between two pixels and hence kernalized notion of similarity between two patches (as in HOG) is obtained. But defining the kernel in this way introduces quantization errors and poor performance. So to capture image variations properly, Gradient match kernel is defined as follows. Here kp and ko are Gaussian kernels over position of pixel and orientations respectively. To get more accuracy and for defining in uniform way, the values of pixel positions and orientations are normalized. The motivation for defining the gradient match kernel K as product of three kernels is as follows. First we have to weigh the contribution of each pixel gradient magnitude and normalized linear kernel is used for this. Then a measure of similarity of gradient orientations should be included and the last Gaussian kernel kp measures how close two pixels are spatially. By similar motivation, colour match kernel is defined (c(z) is the colour at z). In shape kernel, s is the standard deviation of pixel values in the 3 x 3 neighborhood, b (z) is binary column vector with the pixel value differences in a local window around z. Thus in Shape Kernel descriptor, the contribution of each local binary pattern s(z) is weighed, and shape similarity is obtained through local binary patterns b(z).
  • 3.
    Features over imagepatches can be expressed as Since Gaussian kernels are used, Fgrad(P) has infinite dimensions. Directly applying KPCA may be computationally infeasible when the number of patches is very large. So first match kernels are approximated directly by learning finite-dimensional features obtained by projecting Fgrad(P ) into a set of basis vectors. An example to approximate Gaussian kernel over gradients to d dimensions is shown below. Here xi are sampled normalized gradient vectors. Note that the Kronekar product ⨂ is used to compute the features which still results in large number of dimensions. Now to achieve fewer compact features, KPCA is done. This makes the computation time of evaluation practical. The tth kernel principle component is written as Finally the gradient kernel descriptor is expressed as shown below. It is shown that the error incurred in approximating the match kernels in this way is very less. The gradient (KDES-G), color (KDES-C), and shape (KDES-S) kernel descriptors are compared to SIFT and several other state of the art object recognition algorithms using four publicly available datasets of Scene-15, Caltech101, CIFAR10, and CIFAR10-ImageNet. Except in CIFAR10, Laplacian kernel SVMs are used in the experiments. The summary of the result is shown below. The combination of the three kernel descriptors is observed to boost the performance by 2%. Thus we can see that the proposed kernel descriptor outperforms all the other methods. Scene-15 Caltech-101 KDES 86.7% KDES 76.4% CDBN[2] 65.5% SIFT 82.2% SPM [1] 64.4% LCC[4] 73.4% CIFAR10 KDES 76.0% LCC[4] 74.5% mcRBM-DBN[3] 71.0% TCNN[5] 73.1% [1]Lazebnik, Schmid, Ponce, CVPR '06 [2]Lee, Grosse, Ranganath, Ng, ICML '09 [3]Ranzato, Hinton, CVPR '10 [4]Yu, Zhang, ICML '10 [5]Le, Ngiam, Chen, Chia, Koh, Ng, NIPS '10