Visual Object Category Recognition

Visual Object Category
Recognition
Ashish Gupta
Centre for Vision, Speech, and Signal Processing

Contents
• Introduction
• Related work
• Overview: Object recognition system
• Object classification & detection
• Conclusions
• Future work

Introduction
Research Topic: Visual object category recognition using
weakly supervised learning.
DIPLECS: Artificial cognitive system for autonomous systems.
• Interested in object interactions determined by
their functional properties.
• All objects in same category have the same
functional properties.
• Recognition is based on object’s visual
properties.

Introduction
Research Topic: Visual object category recognition using
weakly supervised learning.
• A very large training set is required to learn the
large appearance variation in a category.
• So we utilize huge image datasets like Flickr®
and GoogleTM Image.
• The images are corrupt and incompletely
labelled.
• Therefore, weakly supervised learning is
utilized which can handle corrupt and noisy
training data.

Challenges
Intra-category
appearance
Pose Clutter Scale
Occlusion Illumination Articulation Camouflage

Occurrence frequency of visual words is characteristic of the object
Object model : bag-of-visual words
Creating a visual codebook

A test image can be classified
based on the distance of its
normalized codebook from the
codebooks of positive and negative
training samples.
Codebook positive samples Codebook negative samples Codebook test image

Visual codebooks for positive and negative samples of ‘car’ category in
PASCAL VOC 2006

Visual codebooks for ‘car’ and ‘cow’ categories in PASCAL VOC 2009 dataset

Classification
ROC (Receiver Operating
Characteristics): evaluating
classification performance.
ROC for ‘car’ category in
PASCAL VOC 2006
The linear kernel:
K(x,y) = xTy, was used
since it is fast.

Improve Classification
Larger Visual Codebook:
• More representative of category
• Higher computational cost
ROC of ‘car’ category in PACAL VOC
2006 for codebook sizes from 20 to
20000 visual words.

Training and test images in the
dataset scaled down by same factor.
Training and test images scaled down by
different factors.

Training Samples Dataset 1 Training Samples Dataset 2Scale down
factor
/1
/2
Y N
Y Y
Test Image Image classified correctly

ROC for 20 visual categories in
PASCAL VOC 2009
The PACAL VOC 2009 dataset is
larger and more challenging than the
2006 dataset.

ROC for PASCAL VOC 2009 training
and test images images scaled down
by factor of 2
ROC for PASCAL VOC 2009 using a
universal visual vocabulary

Object localization using sliding window
The poor localization results are due to:
• Lack of structural information in the
bag-of-words object model
• Classifier learning object background

Visual codebook
Training images with
bounding - boxes
Training images without
bounding - boxes
Good Codebook with equal population of
positive and negative visual words
Positive background different
from negative images
Positive background similar to
negative images
With no bounding-box
utilized, the codebook
consists of a majority of
negative visual words.

Visual codebook
Training images with
bounding - boxes
Training images without
bounding - boxes
Good Codebook with equal population of
positive and negative visual words
Positive background different
from negative images
Positive background similar to
negative images
Classification based on
object context
(background) rather than
object features.

The detection at each iteration estimates a bounding box which provides a better
visual codebook which in turn leads to better detection.

• Key-point configurations as
features are a discriminative
object feature set.
• A configuration of visual words
appends structural information
to the bag-of-words model.
Object detection
• Harvest frequent and discriminative configurations.
• Encode configurations called transaction vectors.
• Association between a transaction vector and the
training type is an association rule.
• Apriori algorithm finds association rules with high
confidence in a support-confidence framework.
Transaction vector encoding
key-point configuration

Apriori algorithm
• Uses breadth-first search and tree structure.
• Longer configurations will have lower support as
they are infrequent but higher confidence as they
are more discriminative.
• Downward closure lemma: prune configurations
with infrequent sub-sets.

Object localization
Training
Data Set
Test Data
Set
Test Image
Generate
Transactions Transactions
Apriori data
mining
Association
Rules
Generate Confidence
for each Transaction
Threshold
Confidence
Transactions
• A confidence is assigned to every
key-point in the image.
• Key-points with sufficiently high
confidence are retained.
• Key-points which occur on
common background objects like
doors and windows can have high
confidence.

Object classification using Apriori
Training
Data Set
Test Data
Set
Generate
Transactions Transactions
Apriori data
mining
Association
Rules
Generate Confidence
for each Transaction
Sum
Confidence
TransactionsTest
Images
ROC ‘car’ in PASCAL VOC 2006
The summed confidence score depends
upon object scale in the image, which
explains the comparatively poor
performance of this approach.

Conclusions
• The ‘bag-of-words’ model is good for classification, but poor for localization.
• Separate foreground-background for better visual codebooks.
• The good classification using PASCAL VOC 2006 dataset is attributed to
recognition of object context rather than object features.
• The dataset utilized should have sufficient variation in appearance of the
object and its background.
• Larger visual vocabulary gives slightly better classification, but is
computationally more expensive.
• The visual vocabulary built has majority of background visual words since
bounding-boxes are not utilized during training.

Conclusions
• Improving the proportion of visual words representing the object in the
vocabulary is vital for good classification.
• Incorporate object boundary contour to the descriptor.
• Use of frequent and discriminative key-point configurations is a promising
approach for object localization.
• A low quality dataset results in a weak visual codebook and classifiers biased
to the training data.
• Classification using key-point configurations was poor compared to ‘bag-of-
words’ for PASCAL VOC 2006.

Future Work
• Improve a visual codebook by increasing the proportion of visual words
pertaining to object features. Combine Apriori based localization and
clustering for visual word selection in an iterative approach.
•Model visual scene information (Use the GIST descriptor by Torralba). Learn
co-occurrence statistics of a scene and a visual category. Recognition of the
scene serves as prior for object presence and improves object recognition
performance.
• Improve object localization by using context priming.
• Model object contextual information to aid foreground-background
disambiguation for better object localization.

Future Work
• Share information of features between visual categories. The size of a
universal visual vocabulary should increase sub-linearly with increase in
number of visual categories.
• Combine image segmentation and classification to improve the object
model to provide better classification performance.
• Build a hierarchical framework for visual categorization:
• Representation: combine local and global features.
• Model: combine semantic and structural object models.
• Classification: combine generative and discriminative approaches.

Visual Object Category Recognition

More Related Content

What's hot

Viewers also liked

Similar to Visual Object Category Recognition

Recently uploaded

Visual Object Category Recognition