This document proposes a new method called Pyramid Histogram of Multi-scale Block Local Binary Pattern (PH-MBLBP) for scene classification. PH-MBLBP encodes both micro- and macro-structures of image patterns to provide a more complete representation than basic LBP. It divides images into spatial regions at multiple resolutions to capture geometric information. Experiments on 15 scene categories show PH-MBLBP outperforms SIFT and provides a powerful yet fast texture descriptor for scene recognition.
1. International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.4, August 2014
Scene Classification Using Pyramid Histogram of
Multi-Scale Block Local Binary Pattern
Dipankar Das
Department of Information and Communication Engineering, University of Rajshahi,
Rajshahi-6205, Bangladesh.
ABSTRACT
Pyramid Histogram of Multi-scale Block Local Binary Pattern (PH-MBLBP) descriptor for recognizing
scene categories, is presented in this paper. We show that scene categorization, especially for indoor and
outdoor environments, requires its visual descriptor to process properties that are different from other
vision domains (e.g., SIFT descriptor used for object categorization). Our proposed PH-MBLBP satisfies
these properties and suits the scene categorization task. Since the proposed PH-MBLBP mainly encodes
micro- and macro-structures of image patterns, thus, it provides relatively more complete image descriptor
than the basic LBP operator. Moreover, our PH-MBLBP descriptor is more powerful texture descriptor
than the conventional operator and it can also be calculated extremely fast. Our experiments demonstrate
that PH-MBLBP outperforms the other descriptor such as SIFT.
KEYWORDS
Scene recognition, visual descriptor, local binary pattern, SIFT
1.INTRODUCTION
Humans are very proficient at perceiving natural scenes and understanding their contents.
However, scene classes, such as forest, building, office, mountain etc., are very ambiguous and
change a wide range of illumination and scale condition. Thus, in computer vision, it is very
difficult to classify such scenes categories into different classes. In the literature [1, 2], two basic
strategies can be found for classifying scene categories. The first approach uses low-level features
such as texture, colour, power spectrum, etc. This type of approaches have been proposed in [3,
4], where the scene is considered as an individual object. In [3, 4], the authors consider only a
small number of scene categories (city versus landscape, indoor versus outdoor, etc.). On the
other hand, the second approach as proposed in [5, 6, 7], uses an intermediate representation of
scene and then uses a classifier for classifying the scene into different categories. This strategy
has been shown to classify a larger number (up to 15) of scene categories. In this paper we
propose a classification algorithm based on spatial Pyramid Histogram of Multi-scale Block
Local Binary Pattern (PH-MBLBP) to represent the natural scene categories that can be used to
classify a large number of scene classes.
In recent years, several image representation techniques such as, probabilistic latent semantic
analysis (pLSA) model [8], part-based model [9], bag of visual words (BoW) model [10, 1], etc.
are used for scene categorization. Among these methods, the BoW model proposed by Anna
Bosch et al. [1] has shown an excellent performance due to its robustness to scale, translation and
rotation invariance. However, the BoW model is less discriminative for texture classification [11].
DOI:10.5121/ijcsa.2014.4402 15
2. International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.4, August 2014
Moreover, natural images are rich in texture information, hence, how to classify the texture
images effectively is of challenge for scene classification problem. GIST descriptor, a holistic
representation of the spatial envelope, proposed by Oliva and Torralba [6], is an effective texture
descriptor. The proposed method achieved high accuracy in recognizing natural scene categories
such as mountain and forest. However, when classes of indoor environment are added, its
performance drops dramatically [12].
Local Binary Patterns (LBP) [13] is one of the best performing texture descriptor, and it has been
widely used in various applications such as scene classification, object recognition, and face
recognition. In the last few years, various extensions are made for the conventional LBP
descriptors [14, 15, 16]. Since basic LBP and its simple extension is very easy to implement and
produces very good performance on some selected object categories, thus it has been used by
several researchers for classifying scene classes and recognizing human face . In [16], Zhang et
al. propose basic LBP in Gabor transform domain. The combination of Gabor transform and LBP
descriptors improves the discriminative power of LBP descriptors. Recently, Wu et al. [12]
propose a Census transformation based approach which is equivalent to LBP code to recognize
scene categories. However, the census transformation based approach proposed in [12], uses only
gray level information of the image and is not invariant to rotation. Liao et al. [14] propose a
Dominant local binary pattern (DLBP) which is an extension of basic uniform LBP. Their main
target is to capture the dominating patterns in texture images by counting the co-occurrence
frequencies of all rotation invariant patterns defined in LBP groups. The DLBP proposed in [14]
is very useful in describing images with texture characteristic. The method also invariant in
curvature edges, complicated shapes, crossing boundaries and corners. However, most of the
previous representation methods have the common disadvantage of poorly capturing the intrinsic
multi-scale and multi-orientation geometric information of objects within images.
The basic LBP is proved as a successful and powerful local descriptor for micro-structures of
images. In the basic LBP operator, we replace the pixels values of the 3× 3-neighborhood of an
image by the thresholding of each pixel with respect to the center pixel and taking the result as a
“binary string” or a “decimal number”. However, the basic LBP operator has the following
disadvantages for classifying scene categories: (i) noisy image produces very worse result for
basic LBP operator because it is constructed in very small spatial support area, and (ii) the
dominant features of the scene (macro-structure) may be ignored in basic LBP operator because it
is calculated in the local 3×3 neighbourhood areas.
Thus, in this research, we propose an alternative representation technique, called spatial Pyramid
Histogram of Multi-scale Block LBP (PH-MBLBP). Our proposed approach overcomes some of
the limitations of basic LBP operator and successfully applies to scene categorization task. Our
main target of this research is to represent an image of a scene by its local feature in micro-structure
and macro-structure, and the spatial layout of these structures. In our approach the
micro-structure of a scene is represented by using the basic LBP and macro-structure is captured
by the average values of block sub-regions, instead of individual pixels. Then the spatial layout of
these micro- and macro-structure features is represented by tiling the image into regions at
multiple resolutions. Thus, the proposed PH-MBLBP descriptor presents several advantages: (1)
spatial pyramid histogram of MBLBP is used to capture geometric information by tiling the
image into regions at multiple resolutions, and (2) MBLBP perform better than the basic LBP for
the following reasons- (i) MBLBP encodes both micro- and micro-structures of scene image
patterns, and therefore it provides a more reliable image representation technique than the original
LBP operator; and (ii) we can efficiently compute the MBLBP using an integral image
representation techniques.
16
3. International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.4, August 2014
17
2. PYRAMID HISTOGRAM OF MBLBP
In this research, the scene image is represented by the spatial pyramid histogram of multi-scale
block local binary pattern (MBLBP). First, the approach computes the block local binary pattern
of the image at different scale. Then MBLBP is represented by the pyramid histogram.
2.1. Multi-Scale Block Local Binary Pattern (MBLBP)
The original local binary pattern (LBP) is a form of non-parametric local transform originally
developed for finding correspondence between local patches. The LBP maps the local
neighbourhood surrounding a pixel, P, to a “bit string” representing the set of neighbouring pixels
in a square window, as illustrated in Error! Reference source not found.. If the intensity
value of the pixel P is less than or equal to one of its neighbouring pixels, a bit `1' is set in the
corresponding position. Otherwise a bit `0' is set. In this intensity comparison, we generate 8-bit
for the corresponding 8-pixel positions. Finally, we put together the generated 8-bit in any order
to produce the LBP code. Here we arrange bits from left to right and from top to bottom. The
converted LBP value is a base-10 number in the range {0⋯⋯255}.
Figure 1: The basic LBP coding
Then to generate a texture descriptor of an image, we construct a histogram of the LBP code.
Error! Reference source not found. shows an illustration of the basic LBP coding technique.
Multi-scale LBP [17] is an extension to the basic LBP, with respect to neighbourhoods of
different sizes.
7 7 5
6 8 7
9 7 8
(00001010)2 1010
0 0 0
0 0
1 0 1
Figure 2: (a) The MBLBP operator with block size 3×3, (b) Thresholding binary value, (c) Binary code,
(d) Corresponding decimal value.
In this paper, we propose a distinctive rectangle features, called Multi-scale Block Local Binary
Pattern (MBLBP) feature. In MBLBP, we first calculate the average intensity values of
rectangular sub-regions as shown in Error! Reference source not found.(a). Then we
compare these average intensity values with respect to the center value in a similar way of the
basic LBP to generate the MBLBP code (Error! Reference source not found.(b)-(d)). Each
sub-region is a square block containing neighbouring pixels (or just one pixel, in this case it is
original LBP). Formally, given a pixel( , ), the resulting MBLBP can be expressed in decimal
form as:
(1)
40 44 49
38 45 50
38 39 49
0 0 1
0 1
0 0 1
(0011100) 5610
P
(a) (b) (c) (d)
4. International Journal on Computational Sciences Applications (IJCSA) Vol.4, No.4, August 2014
where i covers the 8 neighbours rectangular sub-regions,
( ,⋯,) are the average
intensities of those rectangular sub-regions, and
is the average intensity of the center
rectangular sub-region. The function () is defined as:
18
(2)
A more detailed description of the MBLBP operator is illustrated in Error! Reference source
not found.. The whole MBLBP is composed of 9 blocks. We take the size s of the block as a
parameter, and s×s denoting the scale of the block size of MBLBP operator (particularly, the
MBLBP with block size 1×1 is in fact the original LBP). In this research, the average pixel value
of each block is computed using the integral image as proposed in[18]. For this reason, we can
compute MBLBP code in a very fast way. Thus the complexity of the proposed MBLBP coding
takes a little more time than the basic LBP operator.
The MBLBP feature relies solely upon the set of block comparisons, and is therefore robust to
illumination changes, gamma variation, etc. The MBLBP retains global structures of an image
besides capturing the local structures. An original image and its MBLBP image with different
block sizes (pixel values are replaced by its LBP values) are shown in Error! Reference
source not found.(a)-(d).
(a) (b)
(c) (d)
Figure 3: (a) Original image, (b) MBLBP with block size 1×1 (i.e. original LBP),
(c) MBLBP with block size 3×3, (b) MBLBP with block size 9×9.
In MBLBP, the LBP values of neighbouring pixels are highly correlated and there exist some
direct and indirect constrains posed by the center pixels. The transitive property of such
constrains make them propagate to pixels that are far apart. The propagated constrains make LBP
values and their histograms implicitly contain information for describing global structures. As
5. International Journal on Computational Sciences Applications (IJCSA) Vol.4, No.4, August 2014
shown in Fig. 3 (b)-(d), the LBP operation maps any 3×3 pixels into one of 28 cases each
corresponding to a special type of local structure of pixel intensities. Here the LBP value acts as
an index to these different local structures. We compute histograms of MBLBP values at different
pyramid levels on the scene image. Finally, these histograms are used as the visual descriptors of
the scene.
19
2.2. Spatial Pyramid Histogram of MBLBP
Our objective is to represent an image by its local small and large scale structures, and the spatial
layout of the structure. In this research, we first use the MBLBP with different size of windows to
represent local small and large scale structure. Then, the spatial layout of the features is
represented by dividing the image into regions at multiple resolutions. The idea is illustrated in
Error! Reference source not found.. The descriptor consists of a histogram of MBLBP
features over each image sub-region at each resolution level – a Pyramid of Histograms of
MBLBP (PH-MBLBP). The distance between two PH-MBLBP image descriptors then reflects
the extent to which the images contain similar structure and the extent to which the structure
correspond in their spatial layout.
(a)
+
(b)
+ +
(c)
Figure 4: Spatial Pyramid framework. The image is recursively split and histogram of MBLBP image is
then computed for each grid cell at each pyramid resolution level. (a) Histogram of MBLBP, (b) histogram
with 2-level spatial pyramid, and (c) histogram with 3-level spatial pyramid.
6. International Journal on Computational Sciences Applications (IJCSA) Vol.4, No.4, August 2014
To construct the pyramid representation, we repetitively double the number of sub-divisions in
both axis directions. In each sub-division, we count the number of co-occurrence frequency. Our
representation is called pyramid representation because the number of co-occurrence frequency in
a sub-division at one level is simply the sum over those contained in the 4-sub-division it is
divided into at the next level.
If we denote a pyramid level by r, then it has 2r sub-divisions along each dimension. Therefore, if
our MBLBP histogram is represented by K bins, then the level 0 is represented by K feature
vectors, level 1 by 4K feature vectors, and so on. The dimension of the PH-MBLBP descriptor for
the entire image is given by:
20
4
∈! (3)
were R represent the total number of pyramid level. For an illustration, let R = 1, and K = 100
bins, then the dimension of the PH-MBLBP descriptor will be a 500-vector.
2. TRAINING
In this research, the LIBLINEAR classifier [19] is learned with PH-MBLBP feature vector. Since
our database consists of large number of images with high dimensional features vector, thus we
use LIBLINEAR classifier as it is shown to be effective for large sparse data with large number
of instances [19]. Moreover, LIBLINEAR is an easy-to-use tool to deal with such data.
LIBLINEAR classifier inherits many important characteristics (such as open source, well written
documentation, simple to use) of the most widely used support vector machine (SVM) classifier
known as LIBSVM [20]. Another important feature of the LIBLINEAR classifier over LIBSVM
is that it takes relatively very small amount of time to learn a huge amount of examples. For
instances, in our experiment to learn more than 4000 examples, the LIBLINEAR takes several
seconds whereas the LIBSVM takes several minutes. In addition, LIBLINEAR classifier is
competitive with or in some cases it is faster than other state of the art linear classifiers such as
Pegasos [21].
2. EXPERIMENTS
In this section, we reported results on fifteen scene categories database [22]. Multi-class
classification was done with a LIBLINEAR classifier trained using the one-versus-all rule: a
classifier was learned to separate each class from the rest, and a test image was assigned the label
of the classifier with the highest response.
2.3. Database
Our experimental dataset contains 15-scene categories. The 15 class scene dataset was gradually
built. The initial 8 classes were collected by Oliva and Torralba [6], and then 5 categories were
added by Fei-Fei and Perona [5]; finally, 2 additional categories were introduced by Lazebnik et
al. [22]. The 15 scene categories are: (1) bedroom (216 images), (2) CAL-suburb (241 images),
(3) industrial (311 images), (4) kitchen (210 images), (5) livingroom (289 images), (6) MIT-cost
(360 images), (7) MIT-forest (328 images), (8) MIT-highway (260 images), (9) MIT-insidecity
(308 images), (10) MIT-mountain (374 images), (11) MIT-opencountry (410 images), (12) MIT-street
(292 images), (13) MIT-tallbuilding (356 images), (14) PAR-office (215 images), and (15)
store (315 images). The database contained a total of 4485 images and the average image size was
300×250 pixels. The majority of images of the database were collected from the Google image
search, personal photographs, and COREL image collection. In scene categorization, this is one
7. International Journal on Computational putational Sciences Applications (IJCSA) Vol.4, No.4, August 2014
of the most widely used databases in literature so far.
because it contains a wide range of indoor and outdoor images
ed The database is popular for researchers
nge images.
(1) Bedroom (2) CALsuburb
) (3) Industrial (4) Kitchen (5) Livingroom
(6) MITcost (7) MITforest
) (8) MIThighway (9) MITinsidecity (10) MITmountain
(11)MITopencountry (l2) MITstreet
Figure 5: Retrieval
2.4. Experimental Results
(13) MITtallbuilding (14) PARoffice (15) Store
of images from 15-category scene database.
In experimental result, we compared our proposed feature descriptor (PH
commonly used SIFT descriptor
. one-versus-all
methodology. We use the confusion table
success rates were calculated by the average value of the diagonal entries of the confusion table.
Figure 6: Recognition performance for 15
PH-MBLBP) with the most
[23]. The testing and training were performed in one
to measure the performance of the system. The
overall
15-category scene database for SIFT and PH-MBLBP descriptors.
21
) tore
8. International Journal on Computational Sciences Applications (IJCSA) Vol.4, No.4, August 2014
22
2.6.1. PH-MBLBP Versus SIFT
We compared our MBLBP feature descriptor with the SIFT descriptor to evaluate the
effectiveness of the proposed descriptor.
Figure 7: Cross-category confusion matrix for PH-MBLBP descriptor.
Figure 8: Cross-category confusion matrix for SIFT descriptor.
9. International Journal on Computational Sciences Applications (IJCSA) Vol.4, No.4, August 2014
We computed the dense feature for both descriptors with patch size of 10. The 1 versus all
recognition rate of MBLBP and SIFT descriptors are 82.5% and 80.08%, respectively. Figure 6
shows the recognition performance for 15-category scene database.
We also computed the confusion matrix for both PH-MBLBP and SIFT descriptors. The cross-category
confusion matrices for both descriptors are shown in Error! Reference source not
found. and Figure 8, respectively. Finally, we computed the ROC curve and the Area Under the
ROC Curve (AUC) for both descriptors. The ROC curve for PH-MBLBP and SIFT descriptors
are shown in Error! Reference source not found.9. The AUC accuracy for PH-MBLBP and
SIFT descriptor were 98.264% and 97.277%, respectively.
The above results and analyses indicate that our proposed approach PH-MBLBP performs better
than the SIFT descriptor for 15-category scene database.
23
Figure 9: The ROC curves for both SIFT and PH-MBLBP descriptors.
3.CONCLUSION
In this paper, we have proposed a pyramid histogram of multi-scale block local binary pattern
(PH-MBLBP) descriptor for scene classification. We use the LIBLINEAR classifier to classify
the scene categories. The proposed PH-MBLBP represents images as micro- and macro-structures,
and hence the representation is more robust than the basic LBP operator and other
conventional operator. The LIBLINEAR classifier successfully classifies 15-category scene
classes with high classification accuracy with the PH-MBLBP descriptor. We have shown that the
PH-MBLBP adapted to incorporate at multiple blocks has superior performance with the
conventional SIFT descriptor. The proposed PH-MBLBP performed by approximately 2.5%
higher accuracy than the SIFT descriptor on the same database.
10. International Journal on Computational Sciences Applications (IJCSA) Vol.4, No.4, August 2014
24
REFERENCES
[1] A. Bosch, A. Zisserman and X. Muñoz, (2008) Scene classification using a hybrid
generative/discriminative approach, IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 30, no. 4, pp. 712-727.
[2] A. Bosch, X. Muñoz and R. Marti, (2007) A review: Which is the best way to organize/classify
images by content?, Image and Vision Computing, vol. 25, no. 6, pp. 778–791.
[3] M. Szummer and R. W. Picard, (1998) Indoor-outdoor image classification, in In ICCV Workshop
on Content-based Access of Image and Video Databases, Bombay, India.
[4] A. Vailaya, A. Figueiredo, A. Jain and H. Zhang, (2001) Image classification for content-based
indexing, IEEE Transactions on Image Processing, vol. 10, p. 117–129.
[5] L. Fei-Fei and P. Perona, (2005) A bayesian hierarchical model for learning natural scene categories,
in In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2,
pages 524–531, Washington, DC, USA, 2005., Washington, DC, USA.
[6] A. Oliva and A. Torralba, (2001) Modeling the shape of the scene: a holistic representation of the
spatial envelope, International Journal of Computer Vision, vol. 42, no. 3, pp. 145-175.
[7] J. Vogel and B. Schiele, (2007) Semantic modeling of natural scenes for content-based image
retrieval, International Journal of Computer Vision, vol. 72, no. 2, pp. 133-157.
[8] T. Hofmann, (2001) Unsupervised learning by probabilistic latent semantic analysis, Machine
Learning, vol. 41, no. 2, pp. 177-196.
[9] R. Fergus, P. Perona and A. Zisserman, (2005) A sparse object category model for efficient learning
and exhaustive recognition, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR
05).
[10] D. Gokalp and S. Aksoy, (2007) Scene classification using bag-of-regions representations, in IEEE
Conference on Computer Vision and Pattern Recognition (CVPR 07).
[11] X. Qian, X. Hua and L. K. P. Chen, (2011) PLBP: An effective local binary patterns texture
descriptor with pyramid representation, Pattern Recognition, vol. 44, no. 10-11, pp. 2502-2515.
[12] J. Wu and J. M. Rehg, (2011) CENTRIST: A Visual Descriptor for Scene Categorization, IEEE
Trans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp. 1489-1501.
[13] T. Ojala, M. Pietikainen and D. Harwood, (1996) A comparative study of texture measures with
classification based on feature distributions, Pattern Recognition, vol. 29, no. 1, pp. 51-59.
[14] S. Liao, M. Law and A. Chung, (2009) Dominant Local Binary Patterns for Texture Classification,
IEEE Transactions on Image Processing, vol. 18, no. 5, pp. 1107-1118.
[15] G. Zhao and M. Pietikainen, (2007) Dynamic texture recognition using local binary patterns with an
application to facial expressions, IEEE Transactions on PAMI, vol. 29, no. 6, pp. 915-928.
[16] W. Zhang, S. Shan, W. Gao, X. Chen and H. Zhang, (2005) Local Gabor Binary Pattern Histogram
Sequence (LGBPHS): A Novel Non-Statistical Model for Face Representation and Recognition, in
IEEE International Conference on Computer Vision (ICCV).
[17] T. Ojala, M. Pietikainen and M. Maenpaa, (2002) Multiresolution gray-scale and rotation invariant
texture classification with local binary patterns, IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 24, no. 7, pp. 971-987.
[18] P. Viola and M. Jones, (2001) Robust real time object detection, in IEEE ICCV Workshop on
Statistical and Computational Theories of Vision, Vancouver, Canada.
[19] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang and C.-J. Lin, (2008) LIBLINEAR: A Library for
Large Linear Classification, Journal of Machine Learning Research, vol. 9, pp. 1871-1874.
[20] C.-C. Chang and C.-J. Lin, (2001) LIBSVM: a library for support vector machines, Software
available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[21] S. Shalev-Shwartz, Y. Singer and N. Srebro, (2007) Pegasos: primal estimated sub-gradient solver for
SVM, in International Conference on Machine Learning, Corvallis, Oregon, USA.
[22] S. Lazebnik, C. Schmid and J. Ponce, (2006) Beyond Bags of Features: Spatial Pyramid Matching for
Recognizing Natural Scene Categories, in IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, New York, NY, USA.
[23] D. G. Lowe, (2004) Distinctive Image Features from Scale-Invariant Keypoints, International Journal
of Computer Vision, vol. 60, no. 2, pp. 91-110.
11. International Journal on Computational Sciences Applications (IJCSA) Vol.4, No.4, August 2014
25
Author
Dipankar Das received his B.Sc. and M.Sc. degree in Computer Science and
Technology from the University of Rajshahi, Rajshahi, Bangladesh in 1996 and1997,
respectively. He also received his PhD degree in Computer Vision from Saitama
University Japan in 2010. He was a Postdoctoral fellow in Robot Vision from October
2011 to March 2014 at the same university. He is currently working as an associate
professor of the Department of Information and Communication Engineering,
University of Rajshahi. His research interests include Object Recognition and Human
Computer Interaction.