REVIEW ON GENERIC OBJECT RECOGNITION TECHNIQUES: CHALLENGES AND OPPORTUNITIES

http://www.iaeme.com/IJARET/index.asp 104 editor@iaeme.com
International Journal of Advanced Research in Engineering and Technology
(IJARET)
Volume 6, Issue 12, Dec 2015, pp. 104-133, Article ID: IJARET_06_12_010
Available online at
http://www.iaeme.com/IJARET/issues.asp?JType=IJARET&VType=6&IType=12
ISSN Print: 0976-6480 and ISSN Online: 0976-6499
© IAEME Publication
REVIEW ON GENERIC OBJECT
RECOGNITION TECHNIQUES:
CHALLENGES AND OPPORTUNITIES
Prof. Deepika Shukla
Comp. Science and Engineering Department,
Institute of Technology, Nirma University, Ahmedabad, India
Apurva Desai
Department of Computer Science and Information Technology,
VNSGU, Surat India
ABSTRACT
Recognizing objects automatically from an image is a fundamental step for
many real-world computer vision applications. It is the task of identifying an
instance of object in an image or video sequence without or least human
intervention and assistance. In-spite of very high complexity, human beings
perform this task with very less effort and even in the state of least attention.
Little effort is needed for the humans to recognize huge number of and various
categories of objects in images, though ‘object’ in the image may be different
with respect to size / scale, viewpoint, position or orientation. We are even
able to recognize the objects from an image, when they are only partially
visible or present against cluttered background. Not only this, the recognition
can be for specific instance of object or object category/class. When the task is
done for classes of the object it is known as Generic object recognition or
object-class detection or category-level object recognition. It has been found
that over the years many techniques have evolved for recognizing object
classes from images, but any automated object recognition system till date has
not gained this capability fully at par with human beings. This very fact makes
recognition of objects from an image, the most basic and fundamental
challenge in the field of computer vision research. The purpose of this study is
to give an overview and categorization of the approaches used in the literature
for the purpose of Generic Object Recognition and various technical
advancements achieved in the field. Mostly the survey focusses on the leading
work since year 2000.
We have discussed the challenges that the field is currently facing. We
have also made an attempt to suggest future research directions in the area of
Generic Object Recognition. Finally we conclude the study with a hope that in

Review on Generic Object Recognition Techniques: Challenges and Opportunities
near future more sophisticated object class recognition systems would be
developed in an efficient and cost effective manner.
Key words: Object Recognition, Generic Object Recognition, Object class
Recognition, Scene Understanding, Scene categorization, Image Analysis,
Computer Vision, Machine Vision, Scene Analysis, Image Analysis.
Cite this Article: Prof. Deepika Shukla and Apurva Desai, Review on
Generic Object Recognition Techniques: Challenges and Opportunities.
International Journal of Advanced Research in Engineering and Technology,
6(12), 2015, pp. 104-133.
http://www.iaeme.com/IJARET/issues.asp?JType=IJARET&VType=6&IType=12
1. INTRODUCTION
Automated recognition of objects in images is a critical and fundamental step for
many real-world computer vision applications. It is the task of finding a given object
in an image or video sequence without or least human intervention/ assistance. As we
know, very little effort is required at our part to detect and recognize huge number of
classes of objects in images though image of the object may be different with respect
to size / scale, viewpoint position or orientation. Human beings are able to recognize
the objects from an image even when they are only partially visible or present against
cluttered background. Also, the ability to generalize from examples and categories
objects, events, scenes, and places is one of the core capabilities of the human visual
system; For human being this is a mundane activity, but imbibing these capabilities in
machine, has still proved to be significantly challenging task for computer vision
systems in general.
The reason behind this may root to the fact that “Automatic Object Recognition”
requires understanding of human visual perception and so becomes a
multidisciplinary research area involving knowledge and expertise of fields like
optics, psychology, pattern recognition, artificial intelligence, machine learning and
most importantly cognitive science which in itself needs sophisticated concepts and
tools from mathematics as well as computer science [1].
Object recognition is a dominant field of research in the computer vision as well
as image analysis applications and even the simplest machine vision task cannot be
solved without the help of recognition. The fact can be evidenced by the existence of
vast volume of research conducted in this area over the past three decades. The
statement can be substantiated by the fact that, if one just gives, “objects recognition
from images” as the search string on ieeexplore.org, gets more than 20000 results. So,
from the substantial volume of current literature existing on the topic, we can also say
that “Object Recognition” field is closely tied to and is part and parcel of computer
vision research.
This paper reviews most of the leading state-of-the-art researches performed in the
area of Generic object recognition. But more specifically, it is focused to get the
insight into following Research Questions pertaining to the topic of Generic Object
Recognition.
What are the generic object recognition techniques and approaches drawn by the
literature?
What different representation techniques are used for object representation?

Prof. Deepika Shukla and Apurva Desai
Which feature detection and extraction methods are used by most of the prominent
researchers on the topic?
Which classification/learning technique has been used in the classification stage of the
object recognition pipeline?
The rest of this paper is organized as follows. Section 2 introduces and explains
the problem of Generic object recognition which can be considered as specific subset
of object recognition problem. Section 3 concentrates on the challenges that the field
of object recognition faces in general and Generic Object Recognition in particular.
Section 4 discusses the vast literature existing for the topic area. Section 5 manifests
roadmap to future research areas and directions. Section 6 finally sought conclusion
of the study.
2. GENERIC OBJECT RECOGNITION PROBLEM
The problem of object recognition can be viewed as a classification or labelling
problem where models/representation of known objects are available to the system
and when a novel image is given, the system has to predict the class of the object[s]
present in the Image. Formally, it can be stated as, given an image containing one or
more objects of interest (and background) and a set of labels corresponding to a set of
models known to the system, the system should assign correct labels to regions, or a
set of regions, in the image. i. e. Object recognition systems should assign a high level
definition of an object based on the image data, that is represented.
Oftentimes, the task of object recognition is considered as broadly comprising of three
sub-tasks;
Object detection: Detecting whether an instance of the object category is present in
the image or not.
Localization: To give the location of object category. Drawing a bounding box
surrounding the object instance is most prominently used in literature to show the
result of localization.
Visual category recognition: To recognize and label the class/category of object
present in the image.
Moreover, the image being presented to the object recognition framework for the
purpose of recognizing objects from it may have single instance of some class of
object or may have multiple instances of single class or multiple instances of multiple
classes. Therefore, the object recognition approaches at the top-most level can broadly
be categorized to follow top-down, bottom-up or hybrid approach. And within that it
can be for recognizing specific or generic object. So basically image-based object
recognition can be stated as; Given a database of objects and an image, determine
what, if any of the object[s] are present in the image. Thus the problem of object class
recognition can be considered as an instance of supervised classification.
Another dimension along which, the task of object recognition can be categorized
is: First, Where a specific object to be recognized is known to the system and the
system is trained for that specific object category only. For example, Face recognition
, pedestrian recognition Second, Generic object recognition system. Generic object
recognition means that the computer recognizes objects from images by their general
name [2] or common name. Figure-1 shows an instance of Generic Object
Recognition. Generic object recognition has been also referred as object-class
detection or category-level object recognition in literature [14] which aims at
recognizing the class to which the object present in the image belongs. The images

can have single instance of a class, multiple instances of same class or multiple
instances of multiple classes. When categorization of multiple objects of multiple
classes in an image is done, it is known as scene categorization.
Figure 1 Generic Object Recognition
2.1. Architecture of the object recognition system
The current vision systems can said to be consisting of activities as shown in Figure -
2.
Figure 2 Activities involved in a typical vision system
Any recognition system would involve these or some subset of these activities in
its life cycle. In general, after image acquisition stage, image is pre-processed for
performing noise removal and some kind of enhancement. The pre-processing stage is
followed by feature extraction and description/representation stage which then are
passed for recognition. In the representation stage, the objects can be represented as 2-
D or 3-D. Figure-3 shows the general architecture of object recognition system.
Object Recognition task is affected by several factors and can differ according to
various aspects as shown in Figure-4. It shows the categorization of aspects in which
the work is going on, in the field of object recognition. The approaches may differ on
the basis of form and representation of objects, Matching schemes, Image Formation
Model, Type of Features, Type of Image and type of data suited for categorization.
Once we studied various aspects we figured out that these approaches mainly differ in

the object representation method based on type of features or the classification
approach adopted by the method in the recognition phase.
As the factor changes it can easily be observed that the approach changes
substantially but basically these approaches broadly follows three paradigms for
formulating and attempting the solution to the problem of Object Recognition from an
image; Bottom-up, Top-down, Hybrid paradigm [103].
Figure 3 Generic Architecture of Object Recognition System
Bottom-upIt can also be considered as Image analysis from its low level data and is
based on image segmentation techniques. It considers the raw image data which is
available in the form as it is acquired. Boundaries of the homogeneous regions are
extracted by performing non-purposive segmentation without prior knowledge about
properties of individual object classes. No attempt is made to make any prior
assumptions related to what these objects are. Fixed set of attributes are used to
characterize these regions and objects are linked together to characterize the scene
itself. However, without some additional information, purely bottom-up approaches
have so far been unable to yield figure-ground segmentations of sufficient quality for
object categorization [Leibe & Shiele] till 2009, then after many approaches have
been developed [85, 86, 88, 89] which uses bottom-up segmentation methods as
discussed in [82] and [85] and have achieved remarkable results which will be
discussed in detail later in literature review section of the paper.
Top-down: This is Image Analysis from the Semantic level data. Contradictory to
earlier approach, this methodology proceeds with an assumption that the image does
contain a particular object. If the problem is of scene categorization, it assumes that it
is a particular type of scene. The system will attempt to verify the existence of a
hypothesized object. Purposive segmentation may be performed or specialized ways
are used to represent the object.
Hybrid: Combination of the earlier two paradigms are used [61],[79] in this kind of
approach.
3. KEY CHALLENGES
3.1. Challenges overview
As stated earlier, the problem of Object recognition in general and Generic Object
Recognition in particular faces various challenges.
(I) The appearance of an object in the image can have a large range of variation due
to:
1. Viewpoint changes
2. Scale, Orientation and Shape changes (e.g., non-rigid objects)

3. Photometric effects (scene illumination etc.)
4. Scene/Background clutter (therefore objects may be occluded)
(II) Different views of the same object can give rise to widely different images.
(III) Large number of object categories existing in real-world and these categories
may have very less inter-class variation.
Figure 4 Factors affecting the task of object recognition
3.2. Description
Object recognition can be considered as yet another data processing task, so data is
given the highest priority thus acquisition should be considered as most important
step. In recent years, with the advent of high quality camera and other image
capturing devices, we can collect a huge amount of data (images) in various forms
like intensity images, range images and also from various sources like web but the
major problem that computer vision research community is facing today is scarcity of
accurately and precisely labelled image examples. As stated earlier, object recognition
problem can be considered essentially, a supervised classification task, and for that to
work successfully there remains the need of labelled images examples. The problem
becomes more gruesome for the reason that the task is labour intensive. Also due to
the non-availability of human experts which can do image annotation task efficiently
and accurately the task becomes more challenging.
‘Feature Extraction’ is the next crucial step in the pipeline of generic object
recognition. Assuming that the data is available, the feature extraction becomes the
most important stage of the entire object recognition framework. If, suitable features
of right dimensions are not extracted, this phase can become the bottle neck of the
recognition pipeline. Though recently many sophisticated approaches have been
developed and are existing in the literature ,but they are not sufficient to describe
every object , so feature extraction becomes too object specific and varies as either
viewpoint, size and illumination conditions of the image capturing varies. Thus
representing images by effective features is crucial to the performance of various
image analysis tasks. Features can be low-level (colour, texture, Intensity), middle-
level (Image Patches) or High-level (objects, textually annotated objects). Figure-5
shows a probable classification of different kinds of features.
Choosing and deploying an appropriate classifier is the next important step of the
pipeline. The classifier can be linear or a non-linear one. Various classifiers like
Byesian classifier, SVM, decision trees, Neural Networks etc. are utilized in literature

for the purpose of classification each possessing its own benefits and drawbacks. One
important inherent issue related to the classification stage is; scalability of the
classifier. As number of existing object categories is too large in real world and many
visual features are required to model each category thus forcing the system to have
huge volume of training data to model variety of category classifiers. In order to keep
scalability manageable, a linear classifier is commonly utilized, but its classification
performance is inferior to the nonlinear one whereas the non-linear classifiers are
more computation intensive. To remedy this defect of the linear classifiers, a design
of rich image feature set, (which is after all, a key factor in the success of the image
recognition system) per object class is required so that the system can distinctly
recognize objects from images possessing inter-class and intra-class variation as
shown in Figure-6. Additionally, classifier has to be updated continuously because
even if it is trained once for a category/class of object and previously unseen instances
of object emerge or appearance of the object evolves, the earlier trained classifier will
not give correct results. This kind of flexibility and resilience to change is inherently
expected from any object recognition framework.
Figure 5 Classification of Image Features
Figure 6 Images of different instances of object (Dog) in varied imaging conditions.
Intra-class appearance variations refer to the appearance differences among
different objects of the same class [14]. Intra-class appearance variation may be either
due to difference in colour, shapes and sizes of the object’s instance or due to
difference in imaging conditions. For example Image of the same object taken at
different time of the day, in different seasons or at different places, with different
devices and different viewpoints, will be entirely be different. In addition to intra-

class appearance variations, Generic Object Recognition system has to efficiently and
distinctly handle inter-class appearance variations also; which in many cases would be
very less as shown in Figure 7. For example, object recognition system should be
capable of distinctly recognize between a donkey and a horse or a Horse and a Mule.
Figure 7 Images of Horses and Donkeys with very small inter-class appearance
variation: Lower row is images of horses (adapted from [14])
The performance of generic object recognition framework is generally judged
upon criteria like robustness against noise, invariance to basic geometric
transformations, invariance to illumination and viewpoint changes and its ability to
handle the number and different types of objects, ability to handle intra-class and
inter-class variations, recognize objects in presence of clutter or complicated
background and also to be able to recognize the object even if it is partially occluded
accurately and efficiently. These requirements are expected implicitly and must be
present in any framework for object recognition and as a result these issues can be
considered as key challenges for the field of generic object recognition.
4. LITERATURE REVIEW
4.1. Overview
The object recognition pipeline, as stated in the earlier section, consists of the key
tasks like Image acquisition, Pre-processing, Feature Extraction, Feature
representation/Feature description and Classification. However Image acquisition and
Pre-processing phase falls out of the scope of this study. Although, most of the related
work surveyed and cited here focusses on one or the other phase of this pipeline, our
main focus in this study is on feature extraction and description techniques and also to
obtain the answers to the research questions put up at the beginning of this
manuscript. Although, various groups of researchers in the literature have attempted
to survey and review the work in the field of computer vision but either they are
related to some specific object like a survey on face recognition is presented in
[108,109] or various descriptors have been compared and surveyed [14,45,48] or a
separate survey is presented in [114] on object recognition using deep neural
networks. That is one particular aspect of the topic is explored and related literature
review is presented while discussing their core work. Periodically comprehensive
surveys on generic object recognition [14, 15] have been published in the past but
looking to the rapid pace of achievements in the field, it seems natural to survey the
most recent developments and object recognition techniques available in the
literature. In this study, we have mostly tried to review the work done in the field
since year 2000 but more emphasis is given on surveying the work done after 2011.

The rationale behind this is that most of the surveys and papers mostly talk
extensively of the approaches before 2011. But, looking at the pace of technical
advancements in the field, lot of approaches have emerged since 2011 which demands
detailed reportage and covering them is the basic motive of this review. Therefore, the
study is also aimed at presenting the survey in such a way which should help to gain
an insight into this field of research. Also as noted in introduction section that the task
of object recognition is considered as broadly comprising of three sub-tasks: Object
detection, Object localization and object classification, but in this manuscript we have
studied the approaches of generic object recognition which is the highest level of task
in the object categorization subtasks; i.e to categorize the class of object in the image,
object detection is inherently performed and many a times they needs to be localized
also. Due to this reason we have not segregated the approaches on the basis of
detection, localization or categorization.
4.2. Features and Feature Descriptors
The foundations of the field can be traced back to 1950s and 1960s, when early work
was done in very simplistic domains [1]. The world was modelled as being composed
of blocks defined by the coordinates of their vertices and edge information. The
“block image” represented areas of uniform brightness in the image and the edges of
blocks were located in the areas of intensity discontinuity. But very soon it was
realised that, it is not an ideal way to represent the complicated information presented
in the image. Since then various strategies are being developed for the task of object
recognition with an emphasis on feature extraction stage and in the usage of novel and
efficient type of feature descriptor.
Object recognition can be classified into various broad categories. These include
model-based approaches, shape-based approaches and appearance-based approaches.
Model-based approaches try to represent objects as set of various three dimensional
objects [1, 12, 13] like generalized cylinders, cones, cubes, cuboids spheres etc.
Shape-based approaches [13, 19, 20, 21, 52, 53] represent the objects by shape
primitives like boundary fragments, contours, shapelets, etc In contrast, for
appearance-based models only the appearance is used, which is usually captured by
different two-dimensional views of the object-of-interest. So, it can be observed
easily, whatever be the representation method object representation takes the centre
stage in the entire object recognition pipeline. And in turn the problem of object class
recognition reduces to the generating an efficient representation of object which can
detect, localise and identify the class of object discriminatively and repetitively.
As stated earlier extracting and describing features efficiently, of the objects from
the images, decides the fate and success of a typical object recognition system. In a
generic object recognition or categorization system, the relevant features or
descriptors from a characteristic point, patch or region of an image are often obtained
by different approaches. As shown in Figure-5, the features at the top most level can
be divided into two categories global and local wherein the former characterizes the
image as a whole whereas latter represents some local information in form of pixel,
patch or region. Yet another direction along which many researchers have tried to
classify features is structural and statistical. Although, there are various classifications
for features but there exists significant overlap among these classes. For example,
local features can be structural as well as statistical. These features are often
combined to form various descriptors especially region level descriptors are formed
by combining colour, texture and other such low-level features.

As far as pixel level features are concerned, they are regarded as low level
information of the image and are directly computed from the grayscale value of the
pixels individually and generally used to build more sophisticated patch-level or
region level descriptors. We now briefly discuss some of the best performing
descriptors proposed and utilized over the years. This is not meant to be an exhaustive
discussion of the existing approaches, but rather to provide a sample of some
relatively successful and widely used approaches over the years.
4.2.1. Appearance-Based Object Representation
Local Scale-Invariant Features (SIFT) [3][4] introduced by Lowe, is regarded as one
of the most popular patch-level feature descriptors reported in literature. Feature
identified are shown to be completely invariant to basic geometric transformation and
partially invariant to illumination changes and occlusion. SIFT features proved more
successful, as they do not depend on exact grey level distribution within an image
patch, instead use general configuration of image gradient [60]. This was considered
as one of the prominent approach in the area of object recognition and the work is
considered as milestone in the research of object recognition, computer vision and
other image analysis problems. However, as the descriptors are appearance based, and
may produce poor result especially if the object does not have enough information of
its texture features. The SIFT is applied for the problem of object recognition in many
works. Two such usages are mentioned in [3] and [4]. In various other work [2, 39,
42, 75, 110, 111, 112], some kind of improvisation has been achieved by combining
other features along with SIFT or using other filters than Gaussian [110]. The
dimension of key-points obtained when SIFT is applied is relatively large in number,
hence resulting into high dimensional data. This drawback was realised by authors in
[5] and SIFT was extended as PCA-SIFT , where Principal Component Analysis is
applied to normalized gradient patch resulting into lesser dimensional descriptor.
PCA-SIFT yields 36-dimensional descriptor which is fast for computation and
matching but are less distinctive [6] while descriptor introduced by Mikolajczyk and
Schmid[45] namely GLOH (Gradient Location-Oriented Histogram) is another
variant over SIFT which proved to be more distinctive with the same dimension[6].
Also, a colour image-based SIFT has been demonstrated in [75], wherein in place of
intensity gradients, colour gradients are used in Gaussian framework.
As mentioned earlier, high dimensionality of the descriptor is the major limitation
of SIFT, another effective patch-level descriptor SURF (Speeded-Up Robust
Features) is proposed in [6] by Bay et al. The authors have made use of integral
images which results into yielding not only faster but distinctive and repeatable
features. The authors based their descriptor on Hessian matrix but uses very basic
approximation. Moreover only 64 dimensions are used which is much less than
SIFT’s 128 dimensional vector. Though one can argue that PCA-SIFT results in only
36-dimensional vector but at the same time it loses the distinctiveness whereas SURF
has been proved more distinctive and repeatable.
Another level at which features descriptors are generated in numerous papers is at
region level. Dalal and Triggs [32, 33, 34] used grids of locally normalised
Histograms of Oriented Gradients (HOG) as descriptors for object detection in static
images. The technique counts occurrences of gradient orientation in localized portions
of an image. The detector window is tiled with a grid of overlapping blocks in which
Histogram of Oriented Gradient feature vectors are extracted. Detector thus presented
is contrast-based which makes it robust to small changes in image contour locations
and directions, and significant changes in image illumination and colour, while

remaining highly discriminative for overall visual form. In work of Dalal and Triggs
[32, 33,34] is aimed at detection of humans in particular, but also proved effective in
detecting other object classes from images. HOG descriptor has proved very efficient
descriptor for representing structured objects. For example, It has outperformed all
other descriptors in pedestrian detection from videos and images. Inspired by HOG
[32] Bosch et al[36] proposed a novel descriptor called PHOG (pyramid of HOG).
The idea was to represent local image shape and its spatial layout, together with a
spatial pyramid kernel of Bag of Features (BoF) [25,26]. Each image is divided into a
sequence of increasingly finer spatial grids by repeatedly doubling the number of
divisions in each axis direction (like a quadtree). The number of points in each grid
cell is then recorded. HOG vector is computed for each grid cell at each pyramid
resolution level. The final PHOG descriptor for the image is a concatenation of all the
HOG vectors. This concatenated HOG vector is then normalized to ensure that texture
rich or images with more edges are not weighted more strongly than others. Another
descriptor which is built on the idea of histogram of gradients (HOG) is CoHOG (Co-
occurance histograms of gradients) proposed in [37]. CoHOG can express shapes in
more detail than HOG as CoHOG are histograms whose basic units are pairs of
gradient orientations. Histogram is referred as co-occurrence matrix. Due to this
pairing, the vocabulary size increases resulting into more specific expression of shape
of object in the image. The usage of higher dimensional matrix makes CoHOG
powerful in terms of its discriminative power but at the same time becomes highly
computation intensive.
Bag of Features and visual codebook based approaches
The approach is inspired by BoW (Bag of Words) approach which was first proposed
in 1997 by [38] for describing the textual data for the purpose various text analysis
tasks. It is used to represent a text document or a sentence written in natural language,
as set of words, not taking into consideration its grammar or the order in which these
words occur in the original text. The frequency of occurrence of each word is
calculated and then used for various language processing tasks. The analogous term
BoF( Bag of Features), is used to represent the approach. Similar to BoW model, here
image is represented as order less collections of local features of Image. Similar terms
like Bag of Keypoints (BoK), Bag of Visual Words(BoVW) by various researchers is
used in their works. The method is based on vector quantization of affine invariant
descriptors of image patches [39]. A bag of keypoints corresponds to a histogram of
the number of occurrences of particular image patterns in a given image. The method
uses clustering to obtain quite high-dimensional feature vectors for a classifier. As
construction of codebook is done in the BoF approach, at many times it is also
referred as codebook-based approach. The method includes following main steps.
 Detection of image patches for computation of patch descriptors
 Computing patch descriptors for these patches. These descriptors can be any feature
invariant descriptors like SIFT [3,4] or any variant of it or any other lower level
descriptor like Harris-affine [43], MSER.
 Construction of a visual codebook/vocabulary/dictionary by assigning patch
descriptors to predetermined clusters (a vocabulary) with a vector quantization
algorithm that groups similar features together. For determining clusters instances of
usage of several clustering techniques are available. However, more frequently k-
means clustering is applied. [39]. Whereas Hierarchical k-means clustering is adopted
by [49] and mean-shift by authors in [35].

 Generating a histogram of number of occurrences of particular patches assigned to
each cluster. The size of the resulting histogram equals the size of the codebook and
hence the number of clusters obtained from the clustering technique [40].
 Treating the bag of features as a feature vector and using a classifier to classify the
respective image patch. A distance measure is required when comparing two term
vectors for similarity but this measure operates in the term vector space as opposed to
the feature space.
There are two reasons why the bag-of-features image representation (BoF) proved
to be popular for indexing and categorization applications. First, this representation
benefits from powerful local descriptors, such as the SIFT descriptor and Second,
these vector representations can be compared with standard distances, and
subsequently be used by robust classification methods such as support vector
machines [50]. Also, the codebook model-based approaches, while ignoring any
structural aspect in vision, provide state-of-the-art performances on current datasets
[40]. The discriminative power of such a visual codebook determines the quality of
the codebook model, whereas the size of the codebook controls the complexity of the
model. The codebook-based approaches are considered as simple and efficient, and
also can be made robust to clutter, occlusion, viewpoint change, and even non-rigid
deformations [26, 25]. Inspite of being one of the popular and successful approaches,
we find that BoF and visual codebook generation approach has also got certain
limitations. As BoF expresses the image as appearance frequency histograms of visual
words by quantizing SIFT like features, location information and the geometric
relationship between key-points are lost. Also as vector quantization is involved so
inherently loss of information occurs. Also due to loss of geometric relation between
the features, localization of the object is not possible.
To overcome the limitation of orderless representation of objects, several
researchers have proposed approaches to augment bag of features with global spatial
relations in a way that significantly, at one end improves classification performance
while at the other end remain simple and computationally efficient so that can be
applied for the real-world applications [27]. Authors in [27] have demonstrated that
bag of feature description of the image can be extended to spatial pyramids so that the
spatial location information of the features can be retained. To generate these spatial
pyramids, the input image is partitioned into increasingly fine sub-regions.
Histograms of local features are computed over these sub-regions. The histograms are
further concatenated to generate the final features. This representation is combined
with a kernel-based pyramid matching scheme proposed by [24] that efficiently
computes approximate global geometric correspondence between sets of features in
two images. While the spatial pyramid representation sacrifices the geometric
invariance properties of bags of features, it compensates for this loss with increased
discriminative power derived from the global spatial information.
Similarly in [2], to overcome the problem inherent to BoF approach, graph is
constructed by connecting SIFT key-points with lines. As a result, the key-points
maintain their relationship, and then structural representation with location
information is achieved. Since graph representation is not suitable for statistical work,
the graph is embedded into a vector space according to the graph edit distance. As a
result, authors achieved recognition accuracy compared to the conventional method in
their experiments on PASCAL VOC and Caltech-101 datasets. So, the basic idea to
achieve the improvement in BoF approach is to somehow incorporate the spatial
location information of features in BoF features so that the method can not only be
used for recognition but can also be successfully applied for object localization. The

authors in [47] have achieved improvisation by adding binary signatures to the
descriptors, First, Hamming Embedding (HE) of the SIFT descriptors; analogical to
hamming distance and second integrate a weak geometric consistency (WGC) check
within the inverted file system which penalizes the descriptors that are not consistent
in terms of angle and scale. In this way geometrical information is incorporated in the
index with very large datasets. But at the same time both Hamming Embedding (HE)
and WGC require to store additional information, hence memory requirement of index
increases.
The visual codebook approach has been used by several other researches in
slightly different way. For example, Liebe et al in [7,8,9] have adopted a two staged
approach In first stage a codebook of local appearances is learnt which contains
information, which local structures may appear on objects of the target category.
Next an Implicit Shape Model (ISM) that specifies where on the object the codebook
entries may occur. To create the codebook the authors have adopted the method
presented in [17] by Agarwal and Roth. From a variety of images, 25 x 25 pixel
patches are extracted with the Harris interest point detector. These patches are
clustered using agglomerative clustering to generate a compact cluster. These
codebook entries are used to define implicit shape model of the objects. The approach
do not try to create and define a separate model for all possible shapes an object can
take rather define shapes of an object in terms of patches that are consistent in local
appearances. Due to this concept, less number of training examples are needed to
learn object’s probable shapes. A second time codebook entries are scanned and all
those entries are activated whose similarity is above a certain chosen threshold. The
threshold chosen would be same as the threshold used during clustering performed in
the first step. While in recognition stage generalized Hough transform is performed
for identifying possible object centre.
GIST: Humans can recognize the gist of a novel image in a single glance,
independent of its complexity [69], by considering them in a “holistic” manner, while
overlooking most of the details of the constituent objects. Intuitively, GIST
summarizes the gradient information (scales and orientations) for different parts of an
image, which provides a rough description (the gist) of the scene. Input image is
divided into non-overlapping regions. The region is then further divided into sub-
regions and then Gradient Orientation histogram is computed for these sub-regions.
The GIST descriptor for a region is formed by concatenating these Gradient
Orientation histograms for all sub-regions of a region. The approach is more
prevalently used for scene understanding purpose. Approaches based on GIST cannot
be considered as an alternative to image analysis based on local feature based
approaches but can be considered as an additional support for recognition problems
by helping to constrain the local feature based image analysis. In [72,73] short binary
codes are used to compress local GIST descriptors and demonstrated that the
approach works on millions of images obtained from internet without sacrificing the
recognition accuracy and effectiveness.
4.2.2. Shape-Base Approaches
Many approaches based on intensity, colour gradient of the image patches or region is
discussed in the previous part of this paper. Although as noted, these descriptors are
very powerful and have shown to perform object recognition with remarkable
effectiveness. Still there may be a case where two object class exist with same colour
and texture with only difference in shape or for the classes where the appearance is
very much variable in every instance of the object. Such objects cannot be represented

with only colour, intensity based features alone. For example, if we consider fruit
class, raw mango and capsicum are of green colour, but having entirely different
shape. Recognition community also very soon understood that across the exemplars
that belong to a category, shape is a more invariant property than appearance. As a
result, the majority of recognition systems from the mid-1960s to the late 1990s
attempted to extract shape features, typically beginning with the extraction of edges,
at occluding boundaries and surface discontinuities, edges capture shape information
So, shape is yet another important cue which can be used to generate a discriminative
representation of objects. To compute the shape of the object, different authors have
taken different methods. Shape cues are frequently captured and described at the
region level for object class recognition or detection using contour or boundary
fragments [19], shapelets, edgelets [20], shockgrphs etc. Another area of research, as
far as object’s shape-based detection is concerned, is how to set up the
correspondence between shape extracted from training and test image i.e ; How to say
two shapes are matching [52,53]. One of the limitations of shape-based object
description is that it cannot capture intra-class variations in very discriminative way.
For example, a Zebra cannot be differentiated from a Horse. Often shape-based cues
are combined with other appearance based object cues.
4.2.3 Part-Based approaches
Object as 3D volumetric parts
Earliest attempts at solving the object recognition problem used high level 3D parts
based objects, such as generalized cylinders (Binford) and other deformable objects,
such as geons (Biederman [13] )and superquadrics (Pentland) [79]. The common
characteristic among all of them is that they all based on symmetry; a physical
regularity in our world which is exploited by our human visual system. However in
practice it becomes too complex to extract such parts efficiently and in an inexpensive
manner. But once extracted they are more semantically nearer to description of the
image content. Such parts would be limited in number, as compared to the approaches
where low-level features and mid-level features are used to describe the object.
Although the methods based on low-level and mid-level features score on their
simplicity, ease of extraction and attractive invariance properties; but have proved to
be weak in expressing high-level semantic information of the image. The above noted
facts had made object representation using 3D volumetric parts had achieved lot of
attention in the decade of 70s and 80s. The detailed coverage of the topic is out of the
idea of this study but the works of Binford and Nevatia [115] can be explored for
further information related to the concept.
Recognition based on parts
In Part-based object recognition approaches, object is modeled as a set of
geometrically constrained set of various parts of the image where each part has a
distinctive appearance and spatial position. In such approaches, shape is represented
by the mutual position of the parts [22]. Using such features it is determined whether
an instance of object of interest exist in the image or not and if at all it exist where it
exist in the image. Various methods exist in literature which differs on how these
parts are detected, how their position could be represented and what should be the
ideal number of parts to represent an image. Generally these parameters are tuned to
the requirement of the approach. In [22] Objects are modelled as flexible
constellations of parts. A probabilistic representation (which in this case the authors

have used Gaussian), is used for all aspects of the object like shape, appearance,
occlusion and relative scale. To learn and model the object category, first regions and
their scales are detected. Once the regions are identified, they are cropped from the
image and rescaled to the size of a small typically 11×11 pixel patch and then
parameters of the above densities are estimated from these regions, such that the
model gives a maximum-likelihood description of the training data. To detect the
features, a histogram is generated of the intensities in a circular region of some radius.
This is done for each point on the image. The entropy of this histogram is calculated
and local maxima of this histogram are considered as scale of the region. The N-
regions with highest saliency over the image provide the features for learning and
recognition. To reduce the dimension of the feature set, PCA has been used.
Deformable part based approach
Deformable Part Models constitute the state of the art for sliding-window object
detection [99]. The DPM’s are inspired by pictorial structure representation
introduced in [91] by Fischler and Elschlager where an object is modelled by a
collection of parts arranged in a deformable configuration [92]. To represent visual
properties of the object small picture segments are used whereas the deformable
configuration is captured by spring-like connections between these visual picture
segments. An energy function is computed by computing match cost for each part and
deformation cost for each pair of connected parts and this energy function is
minimized to find the best match of model with in an image. The effectiveness of
pictorial representation in case of image matching demonstrated in [91] is due to the
fact that the representation is simple. In addition the representation possesses wide
general applicability as it is not dependent on any particular scheme to model the
appearance of the parts so can be used to represent quite generic objects. But at the
other end, the model suffers from certain very critical limitations. Too many
parameters are involved in the construction of the model thus the energy minimization
function solving becomes very computation intensive. Also the best match is only
found likewise, if the image consist multiple instances of the same object, they would
not be detected by the pictorial representation given by [91]. The issues in pictorial
representation are aptly handled by Felzenswalb in pioneering work reported in [92].
Pictorial representation proposed by Fischler and Elschlager constructs the
representation which can be viewed as graph whereas Felzenswalb and Huttenlocher
used tree representation realising that many objects in real-world can be represented
by using a tree structure especially when the object to be modelled are human beings,
animals. Using this improvisation finding best match model to an image can be
computed in polynomial time. The approach demands that the graph which is
generated to represent the object be acyclic and function dij(li , lj) measuring the
degree of deformation of the model when part vi is placed at location li and part vj is
placed at location lj needs to be a Mahalanobis distance between transformed
locations. DPM’s are impressive way of object representation. While deformable
models can capture significant variations in appearance, a single deformable model is
often not expressive enough to represent a rich set of object category [93]. It can also
be noted that in practice simple models generally outperform approaches using
deformable part based representation. The reason being the simpler models can be
trained easily whereas it is more difficult to train more sophisticated models like
DPM. Authors in [93] illustrates that a deformable part-based model represents an
object by a low-resolution root filter and a set of higher-resolution part filters
arranged in a flexible spatial configuration. The flexible spatial configuration helps to

model the visual appearance at multiple scales. The approach has achieved benchmark
results in the PASCAL object detection challenges. The approach basically uses HOG
(Histogram of gradients) [32] using star-structured part-based model defined by a
filter similar to filter used in [32] and set of part-based filters and deformation models.
The model presented by [101] is effective for shallow structures consisting of at
most two layers, but as the number of layers in the structure increases, it becomes
difficult to scale the model without incorporating and tuning additional parameters.
Yullie et al in [106] have extended the model discussed in [101]. In this paper authors
have proposed that description of object class using several templates from different
viewpoints. Each template is represented as a tree-structure consisting of three layers.
The first layer represents entire image. The second layer divides the image into 9 sub-
images and third layer divides each sub-image of second layer into four sub-images
making third layer of 36 sub-images.
The approach used by Dalal and Triggs [32], to detect pedestrians, fails in
presence of articulation whereas [93, 94, 95, 96] allows an intermediate layer of parts
that can now be shifted with respect to each other making overall model deformable
and in this way achieves generalization. But such approaches do not work when it is a
question of extracting human pose from images. In [102] Bourdev and Malik have
introduced ‘Poselets’ ; parts that are tightly clustered in both appearance and
configuration; for detection and pose estimation of in image consisting human body.
Whereas, in [79] Pablo et al have unified the approaches presented by Dalal and
Triggs [32], Felzenswalb [95] and Bourdev and Malik [102] into a single recognition
framework and tries to take the benefit of each approach. The region-based object
descriptors are used to perform purposive semantic segmentation and subsequently
their outputs are combined and hence performance is achieved.
4.2.4. Recent Approaches and Advancement
We have discussed many approaches with their benefits and limitations in the earlier
sections. One thing can also be noted that all those approaches to object recognition
make essential use of machine learning methods. Most current machine learning
methods work well because of human-designed representations and inputs features.
Early conventional approaches involve hand-crafted features for object representation
and look for these features in image. To do this the programmer was required to have
a deep knowledge of the data and would laboriously engineer each one the feature
detection algorithms [114]. There have been big improvements in image analysis over
the last few years due to the adoption of deep learning neural networks to solve vision
problems. Fig-8 shows schematically the difference between traditional vision
systems and recent deep neural network based system.
Neural Nets for Object Recognition: Neural Networks have been used in object
recognition systems since decades. Neural Nets implement a classification approach.
Their attraction lies in their ability to partition the feature space using nonlinear
boundaries for classes. Earlier Neural Networks were used as classifier only in the
classification stage of Object recognition pipeline (Figure 8), but only recently, with
the progress in vision research and the increase in computational power, neural
networks are utilized for automatic feature learning( from the raw data of the image)
as well as classification also. LeCunn [123] in 1989 demonstrated an algorithm to
train Neural Networks in supervised way and proved applications like hand-written
digit recognition performs remarkably and is benefitted from it. Since then
Convolutional Neural network are being used by many research communities.

Convolutional Neural Networks are different than conventional approaches like BoF,
DPM (Deformable Part Model).This difference is due to two very important reasons.
First, they are deep architecture whereas the conventional approaches were shallow
architectures. And second they doesn’t need to have prior knowledge of data of
image. Deep learning neural networks made it possible to learn features in an
unsupervised manner directly from data instead of handcrafting them explicitly. The
approach has helped vision tasks particularly object recognition greatly thereby
enabling effective capturing of low-level as well as middle level cues of object to be
recognized. As a result Deep learning Neural Networks have brought huge
improvements in the performance of image analysis results, over the last few years.
What makes deep architectures achieve such a good result?
Conventional neural nets used 1 to 2 layers of neurons whereas Deep Neural
Network” is one class of neural nets that uses deep architectures with 2 to 6 layers of
neurons stacked on top of each other. As a result DNN can learn more complex
models easily without the need of hand-designed features. DNN’s have shown good
results on ImageNet dataset [126]. On the test data authors achieved top-1 and top-5
error rates of 37.5% and 17%. Their neural network consisted of 650,000 neurons and
had 5 convolutional layers and learnt 60 million parameters in ILSVRC 2010.
Like every other approach Deep architecture also has got certain limitations.
 Needs very sophisticated hardware and also image of fixed size typically 224 x 224.
 Contains huge number of parameters to be trained so computation intensive
 When trained using Gradient descent, the gradient does not trickle down to the lower
layers; so the sub-optimal sets of weights are obtained [114].
Various modifications to DNN’s have been suggested in the literature to
overcome these limitations. To overcome the constraint of fixed sized images required
by deep neural networks, several efficient pooling strategies are proposed. In [113],
network is equipped with spatial pooling strategy (SPP-net). SPP-net can generate
fixed length representation irrespective of image scale and size. Spatial pyramid
pooling is based on spatial pyramid matching [24] which in turn is an extension of
BoF [26] approach. Another improvement achieved is RNN (Recursive Neural
Network) [130] used for scene classification. The method predicts tree structure for
scene images.
Figure 8(a). Block diagram representing typical traditional object recognition system
Various competitive challenges and Datasets
Here in this section we present some of the challenges that computer vision
community organize annually to invite, evaluate and report the innovative approaches
developed by research groups all across the world. These challenges serve the purpose
of setting a common platform for researchers to present their work and compete with
each other in the area of Object detection, localization and categorization. These

challenges also provide dataset of sample images so that the approaches can be
evaluated for all possible image content and variations in image capturing conditions.
Figure 8(b). Block diagram of Deep Architectures (Image taken from:
ufldl.stanford.edu/eccv10-tutorial/eccv10-tutorial_part4.ppt)
Early approaches for object recognition used very small set of images to evaluate
their algorithmic work. But with the advent of sophisticated world-wide-web, large
number of annotated images is readily available in public as well as private
repositories. To harness the benefit of such repositories datasets are created and made
available to the research community. As mentioned earlier these challenges is an
effort to bring together the research community together in a framework of
competition so that best approaches in computer vision can be evaluated and
publicized. These challenges consist of two components. First, a publicly available
dataset with ground truth annotations with standardised evaluation software and
second a competition and workshop [119]. To review these challenges, we first
discuss the datasets made available by these competitions along with certain other
widely used datasets.
Datasets: No research is possible in any research area [30] without appropriate
datasets. The same fact applies to object recognition and computer vision research
also. Appropriate datasets are needed for all stages of recognition research; may be for
learning visual models of objects and scene categories, detecting and localizing
instances of these models in images, and evaluating the performance of recognition
algorithms. Work mentioned in [30] reviews existing Image datasets from the point of
expectation, challenges and limitations. Datasets ideally should offer vide range of
image variability and should be sufficiently challenging so that algorithms can be
evaluated. One of the major limitations in creating such datasets is that images are to
be annotated. This task of annotating has to be done by human experts and turns out
to be mammoth task considering huge number of objects existing in the real world
which are to be recognized for various application and it is not so easy to get the
human experts accomplishing this effectively, correctly and efficiently. A wonderful
approach of automatic dataset collection using web is mentioned in [66], using an
object recognition techniques in incremental method. The images present on web are
used to learn the model in a robust way. Another solution for getting annotated

training examples can be by crowdsourcing but the most common error that an
untrained annotator is susceptible to is a failure to consider a relevant class as a
possible label because they are unaware of its existence.
Now we discuss certain most prevalent datasets.
Caltech-101 & Caltech-256: Caltech-101 is a collection of pictures of objects
belonging to 101 categories collected by Fei-Fei et al [64] in 2003. About 40 to 800
images exist per category. Most categories have about 50 images. Most images have
little or no clutter. The objects tend to be centered in each image and in stereotypical
pose. In comparison to Caltech-101, Caltech-256 is collection of 256 categories of
objects. Total 30608 images are present. Fig-9 compares Caltech-101 and Caltech-
256.
Figure 9 (Courtesy: http://www.vision.caltech.edu/Image_Datasets/Caltech256/details.html)
TRECVID: TRECVID organizes competition every year and for evaluating the
performance, releases dataset consisting of video shots. The goal of the conference
series is to encourage research in information retrieval by providing a large test
collection, uniform scoring procedures, and a forum for organizations interested in
comparing their results. Annotation is not provided by the organizers.
LabelMe: LabelMe [74] is a publically available annotated image database open for
public contribution. The dataset is provided with annotation tool, so that anyone can
annotate any image. As images are annotated by experts as well as casual users,
cannot be relied for obtaining test set whereas huge quantity of training images can be
obtained.
COIL-20 & COIL-100: Coil-20 and Coil-100 is a database of grayscale images of 20
and 100 categories of objects respectively [120]. Different poses of objects were
generated by placing the objects on rotating turn table and images were captured at
angular displacement of 5 degrees generating 72 views per object. It consists of 720
unprocessed images of 10 object categories. 1440 size normalized images are also
provided.
Microsoft COCO: Common Objects in Context database is a large-scale database of
images that addresses three core research problems in scene understanding: detecting
non-iconic views of objects, contextual reasoning between objects and the precise 2D
localization of objects [122]. Contextual knowledge can be helpful to boost all the
components of the object recognition framework. The dataset is provided to support
object recognition based on the context in which they lie in the scene. The dataset
consist 91 common object categories from which 82 of them having more than 5000
labeled instances. In total dataset have 2,500,000 labeled instances in 328000 images.
The dataset consists of less object categories but very high number of instances per
category that differentiates it with other popular large-scale datasets like PASCAL
VOC and ImageNet dataset which would be discussed in following sections.
The PASCAL Visual Object Classes Challenge
The PASCAL VOC challenge was first time organized in the year 2005. Since then
up to 2012, every year this challenge was organized annually. The challenge basically
consists of two components. A dataset consisting of 1000 images related to objects of

20 categories; obtained from Flickr web-site; was made available publicly and
competition involving object classification, detection, segmentation, action
classification and person layout. Everingham et al has reviewed PASCAL VOC in
[119]. The objects were fully annotated for each of the objects. Not that in 2005, such
a rich dataset was released. Only dataset consisting of four categories (motorbikes,
bicycles, cars, and people) was made available in 2005, but every year organizers kept
on enriching it and finally in 2011 of 1000 images were released. To assess different
methods bootstrapping of ROC curve is used.
The evaluation technique is used in a number of different ways: to simply judge
the variability for a given method, to compare the relative strength of two methods, or
to look at rank ranges in order to get an overall sense of all methods in a competition
[119].
ImageNet Large Scale Visual Recognition Challenge (ILSVRC): ILSVRC was
first organized in 2010 and since then, the event is organized annually. ILSVRC is
one of the most prestigious series of competition and workshop in computer vision
community to evaluate the performance of all contemporary approaches developed by
various researchers. The challenge from various aspects is nicely reviewed in [118].
Similar to PASCAL VOC, ILSVRC also provides a huge collection of annotated
images under the name ImageNet by Deng. et al.
ImageNet Dataset: ImageNet is an image database organized according to
the WordNet hierarchy in which each node of the hierarchy is depicted by hundreds
and thousands of images. Currently, there exist over five hundred images per node
[121]. In ImageNet, on an average 1000 images to illustrate each synset has been
provided. Images of each concept are quality-controlled and human-annotated.
ILSVRC has dimension very high as compared to PASCAL VOC [121]. As per 2010
data it is organized in form of 12 subtrees with 5247 synsets and 3.2 million images in
total. As ImageNet organization is inspired by WordNet structure and there are
around 80,000 noun synset in WordNet, Similarly ImageNet also aims at providing
nearly all the majority of the 80,000 synsets of WordNet with an average of 500- 1000
clean and full resolution images. To evaluate the approaches effective strategy of
bootstrapping used by PASCAL VOC is employed in ILSVR challenge series also. In
Table 2, we present the comparison between PASCAL VOC challenge dataset and
ILSVRC challenge as per year 2012 as referred from [101].
Table 2 Comparison of PASCAL VOC and ILSVRC as per [101]
Aspect for comparison PASCAL VOC ILSVRC
Diversity of Object classes Objects are only one class
label for e.g Boat for all types
of boat, be it lifeboat, fireboat
Objects are further refined to
subcategories for for e.g boat
is just not boat but lifeboat,
gondola
Chance Performance of
Localization(CPL)
8.8% on validation set for 20
categories
20.8% for all 1000 categories
Average Object Scale /class 0.241 0.358
Average Number of Instances per
class
1.69 1.59
Clutter per class when clutter is
computed as No. of boxes.
129.96 106.98

In addition to all these, various other datasets are used in literature like GRAZ-01
by Opelt and Pinz, which contains four types of images: bikes, people, background
with no bikes, background with no people. INRIA (people) dataset used by Dalal and
Triggs in their work in [32, 33], MNIST – Dataset of handwritten digits, ImageCLEF,
INRIA (Horses,cars), TinyImages dataset by Torralba et al , ETH-80 etc. CIFAR-10
and CIFAR-100 set has 6000 examples of each of 10 classes and the CIFAR-100 set
has 600 examples of each of 100 non-overlapping classes [125]. The list that we have
considered is not exhaustive but exemplary. For an exhaustive list [127] can further be
explored.
5. FUTURE RESEARCH DIRECTIONS
Object recognition is one of the most exciting research area in the field of computer
vision. Today need is to develop systems which are, computationally efficient and at
the same time cost effective. We suggest some of the future research directions which
can be explored and in turn be incorporated in recognition systems. We attempt to
suggest these directions at algorithmic level or at product level; many of which can be
at present considered as an idea which may need knowledge base from
multidisciplinary fields.
 Currently, deep learning is the current state-of-the-art in object recognition and has
produced promising results but they suffer from certain serious limitations of being
resource intensive. So, in absence of sophisticated hardware DNN’s cannot be
adopted for object recognition. In such cases enhancing the performance of
conventional feature extraction techniques on shallow architectures can be helpful.
New approaches are in need which require shallow architectures and are still
efficient. Also it is realised that DNN’s have not shown very impressive results in the
task of object localization, this area can further be explored.
 It is also to be understood, how learning of features take place in Convolutional
neural networks? What makes deep architectures giving so high accuracy of
recognition?
 Due to advent of mobile and other hand-held devices with very nice image capturing
abilities, algorithms for mobile and other hand-held devices are in great need.
 Although considerable work exists in literature related to action recognition systems
and a complete line of research is going on in this direction as the area in itself
involves many and varied issues and research problems Products can be developed
involving action and activity recognition from videos.
 Computer vision techniques can be good method to generate assistive technology for
blind people. For example, products can be developed which sees the surrounding
and generates a natural language description of the scene and can be given as output
in spoken form. This will help to understand the surroundings and will help blind
people to navigate.
 Research in the area of understanding videos from its content already started, but still
in its infancy, Generic object recognition also paves the path for research in areas like
emotion recognition which will actually enable us to recognize the meaning of the
content in the video.
 Robotics is one another important field which can benefit from active object
recognition. Today’s robots are able to work only in well -structured and constrained
environments. Whereas, the requirement is to develop robots which can learn, adapt
and execute their tasks in real human environments.

 Almost every device has a camera and devices are now powerful enough to record
and process live video. These videos can be exploited for real-time applications. How
do we organize and personalize all of this content for common man?
 New performance evaluation techniques are needed.
 Many rich datasets have been generated and made available to public like ImageNet,
PASCAL VOC by computer vision research community. Although these datasets
holds huge number of images pertaining to various categories, but if we were to reach
to the level of near human vision capabilities in terms of flexibility and dynamism,
then these datasets are to be enriched. Novel ways of labelling huge number of
unlabeled image data should be found so that images annotated with ground truth can
be generated and can be made available publically.
6. CONCLUSION
From the literature available on the subject it was found that the demand for the
efficient Generic object recognition system is increasing very fast as the spectrum of
applications in which the object recognition is needed is very wide and rich. One
major problem in process of Generic Object Recognition is that, categories available
in the real world are varied and huge in number. Due to this fact, the training of the
recognition system for such a large number of categories and classes becomes a
challenging task however sophisticated, the approach may be. Also in such kind of
system the property of plasticity i.e the system should be able to gradually train itself
for unseen categories, is expected which further adds to the complexity of the system.
Such systems can be developed which should be flexible enough to train themselves
for new classes of objects. Another important issue with generic object recognition
system is with the feature extraction and description phase. In most of the approaches,
the number of features obtained is too large and are handcrafted. This very critical
limitation has been overcome by deep architectures which in turn have exploited
sophisticated hardware accelerations evolved recently. Approaches are needed which
make the entire set up cost effective requiring fewer resources. Ideally, it is desirable
that, the recognition task should be performed at semantic level which will result into
near human vision systems.
One of the key objectives behind this survey was to get the answers of the
research questions identified by us and mentioned in Section 1. From the literature
surveyed it can be deduced that, the earlier work related to generic object recognition
were putting more weightage on feature extraction stage and type of features, whereas
the later works were giving more prominence to type of classifier used. Also, recent
approaches are learning features directly from the image data. This can be regarded as
very striking innovation achieved by the vision community. Now the ways are needed
which can bring enhancement to these approaches. The above efforts can also be
extended for 3D images and also for videos.
As a result of this study and from the referred material, a general remark can also
be made about the kind of work that is done in the field. Most of the papers before
2008, mainly present novel ways of modelling the object class. i.e they emphasize on
novel ways of feature detection and descript tion. However, work presented and
published in the recent past since 2011, with the advent of sophisticated hardware,
more emphasis is given on handling more categories accurately and efficiently.
In this paper a current scenario of generic object recognition is portrayed in brief
with a hope that in near future, such an object recognition system will be developed

which would be capable of performing vision task similar to the human vision system
least possible effort and in a cost effective manner.
REFERENCES
[1] Bennamoun, Mohammed, and George J. Mamic. Object recognition:
fundamentals and case studies. Springer Science & Business Media, 2002.
[2] Takahiro Hori, Tetsuya Takiguchi, Yasuo Ariki. Generic Object Recognition
Using Graph Embedding into a Vector Space, American Journal of Software
Engineering and Applications. Vol. 2, No. 1, 2013, pp. 13-18.
[3] David. G. Lowe, “Object Recognition from Local Scale-Invariant Features”,
Proc. Of the International Conference on Computer Vision, /corfu.(Sept-
1999)
[4] David.G.Lowe,” Distinctive Image Features from Scale-Invariant
Keypoints”, 2004.
[5] Ke, Yan, and Rahul Sukthankar. "PCA-SIFT: A more distinctive
representation for local image descriptors." Computer Vision and Pattern
Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer
Society Conference on. Vol. 2. IEEE, 2004.
[6] Bay, Herbert, Tinne Tuytelaars, and Luc Van Gool. "Surf: Speeded up robust
features." Computer Vision–ECCV 2006. Springer Berlin Heidelberg, 2006.
404-417.
[7] B. Leibe, A. Leonardis, and B. Schiele. Combined Object Categorization and
Segmentation with an Implicit Shape Model”, In ECCV04. Workshop on
Stat. Learning in Computer Vision, pages 17–32, May 2004.
[8] Alexander Thomas, Vittorio Ferrari, Bastian Leibe, Tinne Tuytelaars, Bernt
Schiele, Luc Van Gool”, Towards Multi-View Object Class Detection”,
Proceedings of the 2006 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’06)
[9] Bastian Leibe, Aleˇs Leonardis, and Bernt Schiele,” Robust Object Detection
with Interleaved Categorization and Segmentation”, IJCV, (2008) 77,259-
289.
[10] Jia, Menglei , Li, Hua, Xie, Xing , Chen, Zheng Ma, Wei-ying, “Automatic
Classification Of Objects Within IMAGES” United states Microsoft
Corporation (Redmond, WA, US)20080037877
http://www.freepatentsonline.com/y2008/0037877.html
[11] Pisipati, Radha Krishna (Hyderabad, IN), Syed, Shahanaz (Guntur, IN),
Jonna, Kishore (Proddatur, IN), Bandyopadhyay, Subhadip (Kolkata, IN),
Narayan, Rudra Narayan (Jemadeipur, IN) 2014. Systems And Methods
For Multi-Dimensional Object Detection United States 20140029852
http://www.freepatentsonline.com/y2014/0029852.html
[12] Besl, Paul J., and Ramesh C. Jain. "Three-dimensional object recognition."
ACM Computing Surveys (CSUR) 17.1 (1985): 75-145.
[13] Irving Biederman. Recognition-by-components: A theory of human image
understanding. Psychological Review, 94(2):115-147, 1987.
[14] Zhang, Xin, et al. "Object class detection: A survey." ACM Computing
Surveys (CSUR) 46.1 (2013): 10.
[15] Andreopoulos, Alexander, and John K. Tsotsos. "50 Years of object
recognition: Directions forward." Computer Vision and Image Understanding
117.8 (2013): 827-891.

[16] Roth, Peter M., and Martin Winter. "Survey of appearance-based methods for
object recognition." Inst. for Computer Graphics and Vision, Graz University
of Technology, Austria, Technical Report ICGTR0108 (ICG-TR-01/08)
(2008).
[17] Agarwal, Shivani, Aatif Awan, and Dan Roth. "Learning to detect objects in
images via a sparse, part-based representation." Pattern Analysis and
Machine Intelligence, IEEE Transactions on 26.11 (2004): 1475-1490.
[18] Fergus, Robert, Pietro Perona, and Andrew Zisserman. "Object class
recognition by unsupervised scale-invariant learning." Computer Vision and
Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society
Conference on. Vol. 2. IEEE, 2003.
[19] A. Opelt, A. Pinz, and A. Zisserman. A Boundary-Fragment Model For
Object Detection. In Proc. ECCV, volume 2, pp 575–588, May 2006.
[20] Wu, Bo, and Ramakant Nevatia. "Detection of multiple, partially occluded
humans in a single image by bayesian combination of edgelet part
detectors."Computer Vision, 2005. ICCV 2005. Tenth IEEE International
[21] Andreas Opelt, Axel Pinz,Andrew Zisserman,” Incremental Learning Of
Object Detectors Using A Visual Shape Alphabet”, Proceedings of the 2006
IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR’06)
[22] Andreas Opelt, Axel Pinz ,Andrew Zisserman,,“Fusing shape and appearance
information for object category detection”, 2006 - eprints.pascal-network.org
[23] Andreas Opelt, Axel Pinz ,Andrew Zisserman, “Learning an Alphabet of
Shape and Appearance for Multi-Class Object Detection”, IJCV (2008) 80:
16–44
[24] Grauman, Kristen, and Trevor Darrell. "Pyramid match kernel and related
techniques." U.S. Patent No. 7,949,186. 24 May 2011.
[25] Zhang, Jianguo, et al. "Local features and kernels for classification of texture
and object categories: A comprehensive study." International journal of
computer vision 73.2 (2007): 213-238.
[26] Lazebnik, Svetlana, Cordelia Schmid, and Jean Ponce. "Beyond bags of
features: Spatial pyramid matching for recognizing natural scene categories."
Computer Vision and Pattern Recognition, 2006 IEEE Computer Society
[27] Lazebnik, Svetlana, Cordelia Schmid, and Jean Ponce. "Spatial pyramid
matching." Object Categorization: Computer and Human Vision Perspectives
3 (2009): 4.
[28] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. “A Discriminative
Framework for Texture and Object Recognition Using Local Image Features.
In Toward Category-Level Object Recognition. Springer-Verlag Lecture
Notes in Computer Science, J. Ponce, M. Hebert, C. Schmid, and A.
Zisserman (eds.), 2006.
[29] Lazebnik, Svetlana. "Local, semi-local and global models for texture, object
and scene recognition." (2006).
[30] J. Ponce, T. L. Berg, M. Everingham, D. A. Forsyth, M. Hebert, S. Lazebnik,
M. Marszalek, C. Schmid, B. C. Russell, A. Torralba, C. K. I. Williams, J.
Zhang, and A. Zisserman. “Dataset Issues in Object Recognition” In Toward
Category-Level Object Recognition. Springer-Verlag Lecture Notes in
Computer Science, J. Ponce, M. Hebert, C. Schmid, and A. Zisserman (eds.),
2006.

[31] Dorko, Gyuri, and Cordelia Schmid. "Object class recognition using
discriminative local features." (2005): 22.
[32] Dalal, N. And Triggs, B. 2005. Histograms of Oriented Gradients for Human
Detection. In Proceedings of theIEEE Conference on Computer Vision and
Pattern Recognition (CVPR’05).
[33] Dalal,N.,Triggs, B., And Schmid,C. 2006. Human Detection Using Oriented
Histograms Of Flow And Appearance. In Proceedings of the European
Conference on Computer Vision (ECCV’06).
[34] Dalal, Navneet. Finding people in images and videos. Diss. Institut National
Polytechnique de Grenoble-INPG, 2006.
[35] Jurie, Frederic, and Bill Triggs. "Creating efficient codebooks for visual
recognition." Computer Vision, 2005. ICCV 2005. Tenth IEEE International
[36] Bosch Anna, Andrew Zisserman, and Xavier Munoz. "Representing shape
with a spatial pyramid kernel." Proceedings of the 6th ACM international
conference on Image and video retrieval. ACM, 2007.
[37] Watanabe, Tomoki, Satoshi Ito, and Kentaro Yokoi. "Co-occurrence
histograms of oriented gradients for pedestrian detection." Advances in
Image and Video Technology. Springer Berlin Heidelberg, 2009. 37-47.
[38] JOACHIMS, T. 1997. A probabilistic analysis of the rocchio algorithm with
tfidf for text categorization. In Proceedings of the International Conference
on Machine Learning (ICML’97)
[39] Csurka, Gabriella, et al. "Visual categorization with bags of keypoints."
Workshop on statistical learning in computer vision, ECCV. Vol. 1. No. 1-22.
2004.
[40] Ramanan, Amirthalingam, and Mahesan Niranjan. "A review of codebook
models in patch-based visual object recognition." Journal of Signal
Processing Systems 68.3 (2012): 333-352.
[41] K. Mikolajczyk, C.Schmid,” Indexing based on scale invariant interest
points”, International Conference on Computer Vision (ICCV '01) 1 (2001)
525—531.
[42] K.Mikolajczyk A. Zisserman C. Schmid,” Shape recognition with edge-based
features”, British Machine Vision Conference (BMVC '03) 2 (2003) 779—
788.
[43] K. Mikolajczyk and C. Schmid. “Scale and affine invariant interest point
detectors”. Int. J. Comput. Vision, 60(1):63–86, 2004.
[44] K. Mikolajczyk1, T. Tuytelaars2, C. Schmid4, A. Zisserman ,”A Comparison
of Affine Region Detectors”, International Journal of Computer Vision 65,
1/2 (2005) 43—72
[45] K.Mikolajczyk and C.Schmid,”A performance evaluation of local
descriptors”, IEEE Transactions on Pattern Analysis and Machine
Intelligence 27, 10 (2005) 1615—1630.
[46] Douze, Matthijs, et al. "Evaluation of gist descriptors for web-scale image
search." Proceedings of the ACM International Conference on Image and
Video Retrieval. ACM, 2009.
[47] Jégou, Hervé, Matthijs Douze, and Cordelia Schmid. "Improving bag-of-
features for large scale image search." International Journal of Computer
Vision 87.3 (2010): 316-336.
[48] Tuytelaars, Tinne, and Krystian Mikolajczyk. "Local invariant feature
detectors: a survey." Foundations and Trends® in Computer Graphics and
Vision 3.3 (2008): 177-280.

[49] K. Mikolajczyk, Bastian Leibe, Bernt Schiele,” Multiple Object Class
Detection with a Generative Model”, Computer Vision and Pattern
Recognition, 2006 IEEE Computer Society Conference on vol-1,pp 26 – 36
[50] Jégou, Hervé, et al. "Aggregating local descriptors into a compact image
representation." Computer Vision and Pattern Recognition (CVPR), 2010
IEEE Conference on. IEEE, 2010.
[51] Wengert, Christian, Matthijs Douze, and Hervé Jégou. "Bag-of-colors for
improved image search." Proceedings of the 19th ACM international
conference on Multimedia. ACM, 2011.
[52] Belongie, S., Malik, J., And Puzicha, J. 2001. “Matching shapes”, In
Proceedings of the IEEE International Conference on Computer Vision
(ICCV’01).
[53] Belongie, Serge, Jitendra Malik, and Jan Puzicha. "Shape matching and
object recognition using shape contexts." Pattern Analysis and Machine
Intelligence, IEEE Transactions on 24.4 (2002): 509-522
[54] Andras Ferencz, Erik G. Learned-Miller, Jitendra Malik,” Building a
Classification Cascade for Visual Identification from One Example”,
Proceedings of the Tenth IEEE International Conference on Computer Vision
(ICCV’05).
[55] Hao Zhang Alexander C. Berg Michael Maire Jitendra Malik,”SVM-
KNN:Discriminative Nearest Neighbor Classification for Visual
Recognition”, Proceedings of the 2006 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR’06)
[56] Bjorn Ommer Jitendra Malik,”Multi-Scale Object Detection by Clustering
Lines”, 2009 IEEE 12th International Conference on Computer Vision
(ICCV)pp 484-491
[57] Subhransu Maji, Jitendra Malik,”Object Detection using a Max-Margin
Hough Transform”,IEEE 2009, pp 1038-1045 .
[58] Vidal-Naquet, Michel, and Shimon Ullman. "Object Recognition with
Informative Features and Linear Classification." ICCV. Vol. 3. 2003
[59] Fergus, Robert, Pietro Perona, and Andrew Zisserman. "Object class
recognition by unsupervised scale-invariant learning." Computer Vision and
Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society
[60] Boris Epshtein Shimon Ullman,” Identifying Semantically Equivalent Object
Fragments”, Proceedings of the 2005 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR’05)
[61] Eran Borenstein and Shimon Ullman,” Combined Top-Down/Bottom-Up
Segmentation”, IEEE Transactions On Pattern Analysis And Machine
Intelligence, Vol. 30, No. 12, December 2008, pp 2109-2125
[62] L. Fei-Fei, R. Fergus, and P. Perona,”Learning generative visual models from
few training examples: An incremental bayesian approach tested on 101
object categories.” In Proc. CVPR Workshop on Generative-Model Based
Vision, 2004.
[63] Fei-Fei, Li, Rob Fergus, and Pietro Perona. "Learning generative visual
models from few training examples: An incremental bayesian approach tested
on 101 object categories." Computer Vision and Image Understanding 106.1
(2007): 59-70.
[64] www.vision.caltech.edu/Image_Datasets/Caltech101/Caltech101.html

[65] Fei-Fei, Li, Robert Fergus, and Pietro Perona. "One-shot learning of object
categories." Pattern Analysis and Machine Intelligence, IEEE Transactions
on28.4 (2006): 594-611.
[66] Li-Jia Li, Gang Wang and Li Fei-Fei,” OPTIMOL: Automatic Online Picture
Collection via Incremental Model Learning”, 2007 IEEE
[67] Hao Su, Min Sun, Li Fei-Fei,Silvio Savarese,”Learning a Dense Multi-View
Representation For Detection, Viewpoint Classification And Synthesis Of
Object Categories”, 2009 IEEE 12th International Conference on Computer
Vision (ICCV)
[68] Bangpeng Yao, Li Fei-Fei,” Recognizing Human-Object Interactions in Still
Images by Modeling Mutual Context of Object and Human Pose in Human-
Object Interaction Activities”, Ieee Transactions on Pattern Analysis and
Machine Intelligence, Vol. 34, No. 9, September 2012, pp 1691-1703
[69] Oliva, Aude, and Antonio Torralba. "Building the gist of a scene: The role of
global image features in recognition." Progress in brain research 155 (2006):
23-36.
[70] Antonio Torralba, Kevin P. Murphy and William T. Freeman, “Sharing
Visual Features for Multiclass and Multiview Object Detection”, April 2004.
[71] Antonio Torralba , Kevin P. Murphy, William T. Freeman,” Sharing features:
efficient boosting procedures for multiclass object detection”
[72] Oliva, Aude, and Antonio Torralba. "Modeling the shape of the scene: A
holistic representation of the spatial envelope." International journal of
computer vision 42.3 (2001): 145-175.
[73] Torralba, Antonio, Robert Fergus, and Yair Weiss. "Small codes and large
image databases for recognition." Computer Vision and Pattern Recognition,
2008. CVPR 2008. IEEE Conference on. IEEE, 2008.
[74] BC Russell, A Torralba, KP Murphy,“LabelMe: a database and web-based
tool for image annotation”,International journal of Computer Vision,
2008(77) – Springer, pp-157-173.
[75] Taha H. Rassem, Bee Ee Khoo,” Object Class Recognition using
Combination of Color SIFT Descriptors”, 2011 IEEE
[76] Gyuri Dork_o, Cordelia Schmid,” Object Class Recognition Using
Discriminative Local Features”,Technical Report
[77] Gy. Dorko and C. Schmid. Selection of scale-invariant parts for object class
recognition”. In Proceedings of the Ninth IEEE International Conference on
Computer Vision (ICCV’03), pages 634–639, 2003.
[78] Thomas Serre, Lior Wolf, Stanley Bileschi, Maximilian Riesenhuber, and
Tomaso Poggio,” Robust Object Recognition with Cortex-Like
Mechanisms”, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND
MACHINE INTELLIGENCE, VOL. 29, NO. 3, MARCH 2007 pp 411-426.
[79] B. Mayurathan, A. Ramanan, S. Mahesan & U.A.J. Pinidiyaarachchi,”
Speeded-up and Compact Visual Codebook for Object Recognition”,
International Journal of Image Processing (IJIP), Volume (7): Issue (1): 2013
pp 31-50
[80] Rafael C. Gonzalez and Richard E. Woods. Digital Image Processing.
Prentice-Hall Inc., 2002.
[81] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman, “Learning object
Categories from google’s Image Search,” Computer Vision, 2005. ICCV’
2005. Tenth IEEE International Conference on, 2005.

[82] Joao Carreira and Cristian Sminchisescu, “Constrained Parametric Min-Cuts
for Automatic Object Segmentation”, Computer Vision and Pattern
Recognition (CVPR), 2010 IEEE Conference pp 3241-3248
[83] Van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2008). Evaluation of
color descriptors for object and scene recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition (CVPR’08)
[84] Uijlings, Jasper RR, et al. "Selective search for object
recognition."International journal of computer vision 104.2 (2013): 154-171.
[85] Van de Sande, Koen EA, et al. "Segmentation as selective search for object
recognition." Computer Vision (ICCV), 2011 IEEE International Conference
on. IEEE, 2011. (selective search) (reviewerI and III)
[86] Fuxin Li_ and Joao Carreira_ and Cristian Sminchisescu, “Object
Recognition as Ranking Holistic Figure-Ground Hypotheses” , Computer
Vision and Pattern Recognition (CVPR), 2010 IEEE Conference , pp 1712 –
1719.
[87] Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection
and semantic segmentation." Computer Vision and Pattern Recognition
(CVPR), 2014 IEEE Conference on. IEEE, 2014.
[88] Carreira, Joao, et al. "Semantic segmentation with second-order pooling."
Computer Vision–ECCV 2012. Springer Berlin Heidelberg, 2012. 430-443.
[89] Carreira, Joao, and Cristian Sminchisescu. "CPMC: Automatic object
segmentation using constrained parametric min-cuts." Pattern Analysis and
Machine Intelligence, IEEE Transactions on 34.7 (2012): 1312-1328.
[90] Li, Fuxin, Joao Carreira, and Cristian Sminchisescu. "Object recognition as
ranking holistic figure-ground hypotheses." Computer Vision and Pattern
Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010
[91] Fischler, Martin A., and Robert A. Elschlager. "The representation and
matching of pictorial structures." IEEE Transactions on Computers 22.1
(1973): 67-92.
[92] Felzenszwalb, Pedro F., and Daniel P. Huttenlocher. "Pictorial structures for
object recognition." International Journal of Computer Vision 61.1 (2005):
55-79.
[93] Felzenszwalb, Pedro F., et al. "Object detection with discriminatively trained
part-based models." Pattern Analysis and Machine Intelligence, IEEE
Transactions on 32.9 (2010): 1627-1645.
[94] Girshick, Ross B., Pedro F. Felzenszwalb, and D. McAllester.
"Discriminatively trained deformable part models, release 5." (2012).
[95] Felzenszwalb, Pedro, David McAllester, and Deva Ramanan. "A
discriminatively trained, multiscale, deformable part model." Computer
Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on.
IEEE, 2008.
[96] Felzenszwalb, Pedro F., Ross B. Girshick, and David McAllester. "Cascade
object detection with deformable part models." Computer vision and pattern
recognition (CVPR), 2010 IEEE conference on. IEEE, 2010.
[97] Ferrari, Vittorio, Frederic Jurie, and Cordelia Schmid. "Accurate object
detection with deformable shape models learnt from images." Computer
Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on. IEEE,
2007.
[98] Pentland, Alex P. "Automatic extraction of deformable part models."
International Journal of Computer Vision 4.2 (1990): 107-126.

[99] Pandey, Megha, and Svetlana Lazebnik. "Scene recognition and weakly
supervised object localization with deformable part-based models." Computer
Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
[100] Ren, Xiaofeng, and Deva Ramanan. "Histograms of sparse codes for object
detection." Computer Vision and Pattern Recognition (CVPR), 2013 IEEE
Conference on. IEEE, 2013.
[101] Yang, Yi, and Deva Ramanan. "Articulated pose estimation with flexible
mixtures-of-parts." Computer Vision and Pattern Recognition (CVPR), 2011
IEEE Conference on. IEEE, 2011.
[102] Bourdev, Lubomir, and Jitendra Malik. "Poselets: Body part detectors trained
using 3d human pose annotations." Computer Vision, 2009 IEEE 12th
International Conference on. IEEE, 2009.
[103] Arbeláez, Pablo, et al. "Semantic segmentation using regions and parts."
Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference
on. IEEE, 2012.
[104] Bourdev, Lubomir, et al. "Detecting people using mutually consistent poselet
activations." Computer Vision–ECCV 2010. Springer Berlin Heidelberg,
2010. 168-181.
[105] Arbeláez, Pablo, Bharath Hariharan, Chunhui Gu, Saurabh Gupta, Lubomir
Bourdev, and Jitendra Malik. "Semantic segmentation using regions and
parts." In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE
Conference on, pp. 3378-3385. IEEE, 2012.
[106] Zhu, Long, et al. "Latent hierarchical structural learning for object detection."
Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference
on. IEEE, 2010.
[107] Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding
convolutional networks." Computer Vision–ECCV 2014. Springer
International Publishing, 2014. 818-833.
[108] Zhao, Wenyi, et al. "Face recognition: A literature survey." Acm Computing
Surveys (CSUR) 35.4 (2003): 399-458.
[109] Yang, Ming-Hsuan, David Kriegman, and Narendra Ahuja. "Detecting faces
in images: A survey." Pattern Analysis and Machine Intelligence, IEEE
Transactions on 24.1 (2002): 34-58.
[110] T.yamazaki,T.Fujikawa,J.Katto,”Improving the performance of SIFT using
Bilateral Filter and its Application to Generic Object Recognition.”, ICASSP
2012, IEEE , pp 945 – 948.
[111] Chiu, Liang-Chi, et al. "Fast SIFT Design For Real-Time Visual Feature
Extraction." Image Processing, IEEE Transactions on 22.8 (2013): 3158-
3167.
[112] Kamencay, Patrik, et al. "Feature extraction for object recognition using
PCA-KNN with application to medical image analysis." Telecommunications
and Signal Processing (TSP), 2013 36th International Conference on. IEEE,
2013.
[113] He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks
for visual recognition."arXiv preprint arXiv: 1406.4729 (2014).
[114] Goyal, Soren, and Paul Benjamin. "Object Recognition Using Deep Neural
Networks: A Survey." arXiv preprint arXiv: 1412.3684 (2014).
[115] Nevatia, Ramakant, and Thomas O. Binford. "Description and recognition of
curved objects." Artificial Intelligence 8.1 (1977): 77-98.

[116] Fidler, Sanja, Marko Boben, and Ales Leonardis. "Learning a hierarchical
compositional shape vocabulary for multi-class object representation." arXiv
preprint arXiv: 1408.5516 (2014).
[117] Lee, Tom, Sanja Fidler, and Sven Dickinson. "Multi-cue mid-level
grouping."
[118] Russakovsky, Olga, et al. "Imagenet large scale visual recognition
challenge." arXiv preprint arXiv: 1409.0575 (2014).
[119] Everingham, Mark, et al. "The pascal visual object classes challenge: A
retrospective." International Journal of Computer Vision 111.1 (2014): 98-
136.Fei
[120] Nene, Sameer A., Shree K. Nayar, and Hiroshi Murase. Columbia object
image library (COIL-20). Technical Report CUCS-005-96, 1996.
[121] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: a large-
scale hierarchical image database, IEEE Computer Vision and Pattern
Recognition, 2009. <http://www.image-net.org/>.
[122] Lin, Tsung-Yi, et al. "Microsoft COCO: Common objects in
context." Computer Vision–ECCV 2014. Springer International Publishing,
2014. 740-755.
[123] LeCun, Yann, et al. "Backpropagation applied to handwritten zip code
recognition." Neural computation 1.4 (1989): 541-551.
[124] Humphrey, Eric J., Juan Pablo Bello, and Yann LeCun. "Moving Beyond
Feature Design: Deep Architectures and Automatic Feature Learning in
Music Informatics." ISMIR. 2012.
[125] Krizhevsky, Alex, and Geoffrey Hinton. "Learning multiple layers of features
from tiny images." Computer Science Department, University of Toronto,
Tech. Rep 1.4 (2009): 7.
[126] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet
classification with deep convolutional neural networks." Advances in neural
information processing systems. 2012.
[127] http://riemenschneider.hayko.at/vision/dataset/index.php as referred on 12th
April 2015
[128] http://image-net.org/challenges/LSVRC/2012/analysis/as referred on 12th
April 2015
Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding
convolutional networks." Computer Vision–ECCV 2014. Springer
International Publishing, 2014. 818-833.
[129] Bengio, Yoshua, et al. "Greedy layer-wise training of deep
networks."Advances in neural information processing systems 19 (2007):
153.
[130] Mrs. Manisha Bhisekar and Prof. Prajakta Deshmane, Image Retrieval and
Face Recognition Techniques: Literature Survey. International Journal of
Electronics and communication Engineering and Technology, 5(1), 2014, pp.
52-58.
[131] Yoel E. Almeida, Ashray S. Bhandare and Aishwary P. Nipane, Computer
Vision Based Adaptive Lighting Solutions for Smart and Efficient System.
International Journal of Computer Engineering and Technology, 6(3), 2015,
pp. 01-11.
[132] Socher, Richard, et al. Parsing natural scenes and natural language with
recursive neural networks. Proceedings of the 28th international conference
on machine learning (ICML-11). 2011.

REVIEW ON GENERIC OBJECT RECOGNITION TECHNIQUES: CHALLENGES AND OPPORTUNITIES

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to REVIEW ON GENERIC OBJECT RECOGNITION TECHNIQUES: CHALLENGES AND OPPORTUNITIES

Similar to REVIEW ON GENERIC OBJECT RECOGNITION TECHNIQUES: CHALLENGES AND OPPORTUNITIES (20)

More from IAEME Publication

More from IAEME Publication (20)

Recently uploaded

Recently uploaded (20)

REVIEW ON GENERIC OBJECT RECOGNITION TECHNIQUES: CHALLENGES AND OPPORTUNITIES