Visual Object Category
Recognition
Ashish Gupta
Centre for Vision, Speech, and Signal Processing
Contents
• Introduction
• Related work
• Overview: Object recognition system
• Object classification & detection
• Conclusions
• Future work
Introduction
Research Topic: Visual object category recognition using
weakly supervised learning.
DIPLECS: Artificial cognitive system for autonomous systems.
• Interested in object interactions determined by
their functional properties.
• All objects in same category have the same
functional properties.
• Recognition is based on object’s visual
properties.
Introduction
Research Topic: Visual object category recognition using
weakly supervised learning.
• A very large training set is required to learn the
large appearance variation in a category.
• So we utilize huge image datasets like Flickr®
and GoogleTM Image.
• The images are corrupt and incompletely
labelled.
• Therefore, weakly supervised learning is
utilized which can handle corrupt and noisy
training data.
Challenges
Intra-category
appearance
Pose Clutter Scale
Occlusion Illumination Articulation Camouflage
Background
Work done
Visual Recognition System
SIFT feature descriptor
Occurrence frequency of visual words is characteristic of the object
Object model : bag-of-visual words
Creating a visual codebook
Object model : bag-of-visual words
A test image can be classified
based on the distance of its
normalized codebook from the
codebooks of positive and negative
training samples.
Codebook positive samples Codebook negative samples Codebook test image
Object model : bag-of-visual words
Visual codebooks for positive and negative samples of ‘car’ category in
PASCAL VOC 2006
Object model : bag-of-visual words
Visual codebooks for ‘car’ and ‘cow’ categories in PASCAL VOC 2009 dataset
Classification
ROC (Receiver Operating
Characteristics): evaluating
classification performance.
ROC for ‘car’ category in
PASCAL VOC 2006
The linear kernel:
K(x,y) = xTy, was used
since it is fast.
Improve Classification
Larger Visual Codebook:
• More representative of category
• Higher computational cost
ROC of ‘car’ category in PACAL VOC
2006 for codebook sizes from 20 to
20000 visual words.
Improve Classification
Improve Classification
Training and test images in the
dataset scaled down by same factor.
Training and test images scaled down by
different factors.
Improve Classification
Training Samples Dataset 1 Training Samples Dataset 2Scale down
factor
/1
/2
Y N
Y Y
Test Image Image classified correctly
Improve Classification
ROC for 20 visual categories in
PASCAL VOC 2009
The PACAL VOC 2009 dataset is
larger and more challenging than the
2006 dataset.
Improve Classification
ROC for PASCAL VOC 2009 training
and test images images scaled down
by factor of 2
ROC for PASCAL VOC 2009 using a
universal visual vocabulary
Object localization using sliding window
The poor localization results are due to:
• Lack of structural information in the
bag-of-words object model
• Classifier learning object background
Visual codebook
Training images with
bounding - boxes
Training images without
bounding - boxes
Good Codebook with equal population of
positive and negative visual words
Positive background different
from negative images
Positive background similar to
negative images
With no bounding-box
utilized, the codebook
consists of a majority of
negative visual words.
Visual codebook
Training images with
bounding - boxes
Training images without
bounding - boxes
Good Codebook with equal population of
positive and negative visual words
Positive background different
from negative images
Positive background similar to
negative images
Classification based on
object context
(background) rather than
object features.
Improve Classification
The detection at each iteration estimates a bounding box which provides a better
visual codebook which in turn leads to better detection.
• Key-point configurations as
features are a discriminative
object feature set.
• A configuration of visual words
appends structural information
to the bag-of-words model.
Object detection
• Harvest frequent and discriminative configurations.
• Encode configurations called transaction vectors.
• Association between a transaction vector and the
training type is an association rule.
• Apriori algorithm finds association rules with high
confidence in a support-confidence framework.
Transaction vector encoding
key-point configuration
Apriori algorithm
• Uses breadth-first search and tree structure.
• Longer configurations will have lower support as
they are infrequent but higher confidence as they
are more discriminative.
• Downward closure lemma: prune configurations
with infrequent sub-sets.
Object localization
Training
Data Set
Test Data
Set
Test Image
Generate
Transactions Transactions
Apriori data
mining
Association
Rules
Generate Confidence
for each Transaction
Threshold
Confidence
Transactions
• A confidence is assigned to every
key-point in the image.
• Key-points with sufficiently high
confidence are retained.
• Key-points which occur on
common background objects like
doors and windows can have high
confidence.
Object classification using Apriori
Training
Data Set
Test Data
Set
Generate
Transactions Transactions
Apriori data
mining
Association
Rules
Generate Confidence
for each Transaction
Sum
Confidence
TransactionsTest
Images
ROC ‘car’ in PASCAL VOC 2006
The summed confidence score depends
upon object scale in the image, which
explains the comparatively poor
performance of this approach.
Conclusions
• The ‘bag-of-words’ model is good for classification, but poor for localization.
• Separate foreground-background for better visual codebooks.
• The good classification using PASCAL VOC 2006 dataset is attributed to
recognition of object context rather than object features.
• The dataset utilized should have sufficient variation in appearance of the
object and its background.
• Larger visual vocabulary gives slightly better classification, but is
computationally more expensive.
• The visual vocabulary built has majority of background visual words since
bounding-boxes are not utilized during training.
Conclusions
• Improving the proportion of visual words representing the object in the
vocabulary is vital for good classification.
• Incorporate object boundary contour to the descriptor.
• Use of frequent and discriminative key-point configurations is a promising
approach for object localization.
• A low quality dataset results in a weak visual codebook and classifiers biased
to the training data.
• Classification using key-point configurations was poor compared to ‘bag-of-
words’ for PASCAL VOC 2006.
Future Work
• Improve a visual codebook by increasing the proportion of visual words
pertaining to object features. Combine Apriori based localization and
clustering for visual word selection in an iterative approach.
•Model visual scene information (Use the GIST descriptor by Torralba). Learn
co-occurrence statistics of a scene and a visual category. Recognition of the
scene serves as prior for object presence and improves object recognition
performance.
• Improve object localization by using context priming.
• Model object contextual information to aid foreground-background
disambiguation for better object localization.
Future Work
• Share information of features between visual categories. The size of a
universal visual vocabulary should increase sub-linearly with increase in
number of visual categories.
• Combine image segmentation and classification to improve the object
model to provide better classification performance.
• Build a hierarchical framework for visual categorization:
• Representation: combine local and global features.
• Model: combine semantic and structural object models.
• Classification: combine generative and discriminative approaches.
Future Work
Questions?

Visual Object Category Recognition

  • 1.
    Visual Object Category Recognition AshishGupta Centre for Vision, Speech, and Signal Processing
  • 2.
    Contents • Introduction • Relatedwork • Overview: Object recognition system • Object classification & detection • Conclusions • Future work
  • 3.
    Introduction Research Topic: Visualobject category recognition using weakly supervised learning. DIPLECS: Artificial cognitive system for autonomous systems. • Interested in object interactions determined by their functional properties. • All objects in same category have the same functional properties. • Recognition is based on object’s visual properties.
  • 4.
    Introduction Research Topic: Visualobject category recognition using weakly supervised learning. • A very large training set is required to learn the large appearance variation in a category. • So we utilize huge image datasets like Flickr® and GoogleTM Image. • The images are corrupt and incompletely labelled. • Therefore, weakly supervised learning is utilized which can handle corrupt and noisy training data.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
    Occurrence frequency ofvisual words is characteristic of the object Object model : bag-of-visual words Creating a visual codebook
  • 11.
    Object model :bag-of-visual words A test image can be classified based on the distance of its normalized codebook from the codebooks of positive and negative training samples. Codebook positive samples Codebook negative samples Codebook test image
  • 12.
    Object model :bag-of-visual words Visual codebooks for positive and negative samples of ‘car’ category in PASCAL VOC 2006
  • 13.
    Object model :bag-of-visual words Visual codebooks for ‘car’ and ‘cow’ categories in PASCAL VOC 2009 dataset
  • 14.
    Classification ROC (Receiver Operating Characteristics):evaluating classification performance. ROC for ‘car’ category in PASCAL VOC 2006 The linear kernel: K(x,y) = xTy, was used since it is fast.
  • 15.
    Improve Classification Larger VisualCodebook: • More representative of category • Higher computational cost ROC of ‘car’ category in PACAL VOC 2006 for codebook sizes from 20 to 20000 visual words.
  • 16.
  • 17.
    Improve Classification Training andtest images in the dataset scaled down by same factor. Training and test images scaled down by different factors.
  • 18.
    Improve Classification Training SamplesDataset 1 Training Samples Dataset 2Scale down factor /1 /2 Y N Y Y Test Image Image classified correctly
  • 19.
    Improve Classification ROC for20 visual categories in PASCAL VOC 2009 The PACAL VOC 2009 dataset is larger and more challenging than the 2006 dataset.
  • 20.
    Improve Classification ROC forPASCAL VOC 2009 training and test images images scaled down by factor of 2 ROC for PASCAL VOC 2009 using a universal visual vocabulary
  • 21.
    Object localization usingsliding window The poor localization results are due to: • Lack of structural information in the bag-of-words object model • Classifier learning object background
  • 22.
    Visual codebook Training imageswith bounding - boxes Training images without bounding - boxes Good Codebook with equal population of positive and negative visual words Positive background different from negative images Positive background similar to negative images With no bounding-box utilized, the codebook consists of a majority of negative visual words.
  • 23.
    Visual codebook Training imageswith bounding - boxes Training images without bounding - boxes Good Codebook with equal population of positive and negative visual words Positive background different from negative images Positive background similar to negative images Classification based on object context (background) rather than object features.
  • 24.
    Improve Classification The detectionat each iteration estimates a bounding box which provides a better visual codebook which in turn leads to better detection.
  • 25.
    • Key-point configurationsas features are a discriminative object feature set. • A configuration of visual words appends structural information to the bag-of-words model. Object detection • Harvest frequent and discriminative configurations. • Encode configurations called transaction vectors. • Association between a transaction vector and the training type is an association rule. • Apriori algorithm finds association rules with high confidence in a support-confidence framework. Transaction vector encoding key-point configuration
  • 26.
    Apriori algorithm • Usesbreadth-first search and tree structure. • Longer configurations will have lower support as they are infrequent but higher confidence as they are more discriminative. • Downward closure lemma: prune configurations with infrequent sub-sets.
  • 27.
    Object localization Training Data Set TestData Set Test Image Generate Transactions Transactions Apriori data mining Association Rules Generate Confidence for each Transaction Threshold Confidence Transactions • A confidence is assigned to every key-point in the image. • Key-points with sufficiently high confidence are retained. • Key-points which occur on common background objects like doors and windows can have high confidence.
  • 28.
    Object classification usingApriori Training Data Set Test Data Set Generate Transactions Transactions Apriori data mining Association Rules Generate Confidence for each Transaction Sum Confidence TransactionsTest Images ROC ‘car’ in PASCAL VOC 2006 The summed confidence score depends upon object scale in the image, which explains the comparatively poor performance of this approach.
  • 29.
    Conclusions • The ‘bag-of-words’model is good for classification, but poor for localization. • Separate foreground-background for better visual codebooks. • The good classification using PASCAL VOC 2006 dataset is attributed to recognition of object context rather than object features. • The dataset utilized should have sufficient variation in appearance of the object and its background. • Larger visual vocabulary gives slightly better classification, but is computationally more expensive. • The visual vocabulary built has majority of background visual words since bounding-boxes are not utilized during training.
  • 30.
    Conclusions • Improving theproportion of visual words representing the object in the vocabulary is vital for good classification. • Incorporate object boundary contour to the descriptor. • Use of frequent and discriminative key-point configurations is a promising approach for object localization. • A low quality dataset results in a weak visual codebook and classifiers biased to the training data. • Classification using key-point configurations was poor compared to ‘bag-of- words’ for PASCAL VOC 2006.
  • 31.
    Future Work • Improvea visual codebook by increasing the proportion of visual words pertaining to object features. Combine Apriori based localization and clustering for visual word selection in an iterative approach. •Model visual scene information (Use the GIST descriptor by Torralba). Learn co-occurrence statistics of a scene and a visual category. Recognition of the scene serves as prior for object presence and improves object recognition performance. • Improve object localization by using context priming. • Model object contextual information to aid foreground-background disambiguation for better object localization.
  • 32.
    Future Work • Shareinformation of features between visual categories. The size of a universal visual vocabulary should increase sub-linearly with increase in number of visual categories. • Combine image segmentation and classification to improve the object model to provide better classification performance. • Build a hierarchical framework for visual categorization: • Representation: combine local and global features. • Model: combine semantic and structural object models. • Classification: combine generative and discriminative approaches.
  • 33.
  • 34.