Contextless Object Recognition 
with Shape-enriched SIFT and 
Bags of Features 
Marcel Tella Amo 
Directed by Dr. Matthias Zeppelzauer (TU Wien) 
Codirected by Dr. Xavier Giró-i-Nieto (UPC)
Motivation 
2 
Object Recognition and Classification 
Categories 
• Ball 
• Airplane 
• Chair 
• Beaver 
• … 
Ball Airplane Chair 
Shape 
Information 
Texture 
information
3 
Index 
Requirements 
State of the Art 
Design 
Results
Requirements 
4
Requirements State of the Art Design Results 
Design shape features that can be used in an 
aggregated framework, like Bag of Words with 
no need of matching or alignment. 
5 
Take a 
successful method : 
Shape 
Information 
SIFT
Requirements State of the Art Design Results 
Analyse the implication of the vocabulary size 
with respect to the size of the shape features. 
SIFT 
6 
Shape
The proposed features should be at least scale, 
rotation and translation invariant. If it is 
possible, flip invariant as well. 
7 
Requirements State of the Art Design Results
Need for Segmentation to codify the shape 
Study the limitations of shape coding when using a state of the art 
segmentation. 
Manual annotations vs Automatic Segmentation 
8 
Requirements State of the Art Design Results
State of the Art 
9
Requirements State of the Art Design Results 
Object Candidates algorithms 
Multiscale Combinatorial Grouping (MCG) 
10 
Ranking 
Object Plausibility 
Arbelaez, P., Pont-Tuset, J., Barron, J. T., Marques, F., Malik, J. (2014). 
Multiscale Combinatorial Grouping. CVPR. 
High 
Low
Requirements State of the Art Design Results 
Shape Context 
11 
G. Mori, S. Belongie, and J. Malik. Ecient shape 
matching using shape 
contexts. PAMI, 27(11), 2005.
Requirements State of the Art Design Results 
Interest point descriptors: 
SIFT descriptor 
Simplified example 
Typically 4x4 divisions * 8 bins/hist = 128 features 
dense SIFT 
sparse SIFT 
12 
David G Lowe, Distinctive image features from scale-invariant keypoints, International journal of 
computer vision 60 (2004), no. 2, 91{110.
Requirements State of the Art Design Results 
Enrichment of SIFT 
Extra features : Absolute spatial location (X,Y) or angle and distance 
Rene Grzeszick, Leonard Rothacker, and Gernot A. Fink, "Bag-of-features representations using spatial visual vocabularies 
for object classication,“ in IEEE Intl. Conf. on Image Processing, Melbourne, Australia, 2013 
Extra features : Relative position + aspect ratio + scale ratio + Color Space 
Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012). Semantic segmentation with second-order pooling. In 
Computer Vision{ECCV 2012} (pp. 430-443). Springer Berlin Heidelberg. 
13 
128-dimensional SIFT descriptor Extra features
Bag of Words 
14 
Requirements State of the Art Design Results
Requirements State of the Art Design Results 
Bags of Words - Pipeline 
15 
Get 
Descriptors 
Clustering 
(K-means) 
Create 
histograms 
Train Model 
(SVM) 
Image 
Create 
histogram 
Evaluate 
(SVM)
Design 
16
Requirements State of the Art Design Results 
Why dense SIFT? 
17
Main principle: Combination of dense SIFT and Object Candidates 
18 
Requirements State of the Art Design Results
Requirements State of the Art Design Results 
Distance to the nearest border (DNB) 
Logarithmic distance to the nearest border (LDNB) 
Less influence of big distances 
19 
Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012). Semantic segmentation with second-order 
pooling. In Computer Vision-ECCV 2012 (pp. 430-443). Springer Berlin Heidelberg.
Distance and Angle to the nearest border (DANB) 
Problem: Really similar in 2D but very different values. 
Solution: Codify them in two separated features. 
20 
Requirements State of the Art Design Results
Rotation Invariant Angle to the nearest border 
21 
Requirements State of the Art Design Results
Distance to the center (DC) 
22 
Requirements State of the Art Design Results
η - Angular Scan (ηAS) 
WINNER! 
23 
Requirements State of the Art Design Results
Shape Context from a dense SIFT (DSC) 
Note: It crosses the contour of the region like Shape Context. 
ηAS does not! 
24 
Requirements State of the Art Design Results
Requirements State of the Art Design Results 
Rotation Invariant Region Quantization (RIRQ) 
Main idea: Get spatial information. 
Easily extensible to a pyramid! 
25 
Lazebnik, S., Schmid, C., & Ponce, J. (2006). 2006 IEEE Computer Society Conference on (Vol. 2, pp. 
2169-2178). IEEE.
Achieving flip invariance (RIRQ) 
1 
2 
4 3 
1 
2 3 
4 
2 
4 1 
3 2 
3 
4 
1 
4 2 2 4 
SORT SORT 
2 4 
26 
Requirements State of the Art Design Results
Where do we integrate our features? 
Two main Architectures 
Enriched SIFT (eSIFT) 
SIFT Shape features 
Visual Vocabulary 
Bag of eSIFT visual words 
BoW+Shape 
SIFT 
Visual Vocabulary 
Bag of Words Shape histogram 
27 
Requirements State of the Art Design Results
BoW+Shape Creation of the shape histograms 
SIFT 
Accumulation of features 
Visual Vocabulary 
Bag of Words Shape histogram 
1 
1. Accumulate the 
same feature for all 
points . 
2. Create a 
histogram of X bins 
for that feature. 
1 
2 
2 
3. Concatenate 
histograms to create 
the final one. 
Example: 8-Angular Scan 
8 distances (different angles) 
# SIFT keypoints 
28 
Requirements State of the Art Design Results
Results and conclusions 
29
Requirements State of the Art Design Results 
The dataset: Caltech-101 
30 
•Well recognized dataset 
• 101 Different Categories of images 
• Ground truth annotations available 
• From 40 to 800 images per category.
Requirements State of the Art Design Results 
Metrics: Accuracy (%) 
31 
Correct Classifications 
Correct + Incorrect Classifications
Requirements State of the Art Design Results 
Experiments setup 
32 
• 30 images per category in train and 30-50 in test. 
• 101 Categories + Background category. 
• Different Vocabulary sizes in the X axis. 
• Accuracy(%) in the Y axis: 
•Experiments and analysis: 
• eSIFT 
• BoW+S 
• eSIFT vs BoW+S 
• Performance acheived 
• Comparison between adding features before or after quantization 
• Number of bins per histogram 
• Ground truth vs MCG Object Canditates 
• Context vs Shape
Results enriched SIFT 
33 
Requirements State of the Art Design Results
Results BoW+S 
34 
Requirements State of the Art Design Results
Requirements State of the Art Design Results 
Performance achieved 
35 
Conclusion 
With Angular Scan, there is an increase of performance 
from 16% to around 41%.
Requirements State of the Art Design Results 
Comparison between adding features 
after and before 
Conclusion 
In Angular Scan, if the number of shape features is high, 
both architectures tend to converge. 36
Requirements State of the Art Design Results 
Number of bins per histogram 
Conclusion 
In Angular Scan, 8 bins is the value that gives the best 
performance. 37
Requirements State of the Art Design Results 
Ground truth vs MCG Object Candidates 
Conclusion 1 
2 
Higher vocabulary values lead to a more robust 
approach in terms of segmentation errors. 
Shape-based methods are more sensible to 
segmentation errors than texture-based. 38
Requirements State of the Art Design Results 
Context gain vs Shape gain 
Conclusion 
Object 
Context 
It gives better performance to codify the shape 
than the context of the image. 39
FutureWork 
Comparison betwen our work and 
Second Order Pooling 
PhD thesis of Carles Ventura 
Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012). Semantic segmentation with second-order 
pooling. In Computer Vision-ECCV 2012 (pp. 430-443). Springer Berlin Heidelberg. 
40
Distance to the nearest border (DNB) 
41 
Future Work
Conclusions 
1. Increase of performance from 16% to around 41% 
2. In Angular Scan, if the number of shape features is high, both 
architectures tend to converge. 
3. In Angular Scan, 8 bins is the value that gives the best performance. 
4. Higher vocabulary values lead to a more robust approach in terms of 
segmentation errors. 
5. Shape-based methods are more sensible to segmentation errors than 
texture-based. 
6. It gives better performance to codify the shape than the context of the 
image. 
Thank you! 
Questions? 42

Contextless Object Recognition with Shape-enriched SIFT and Bags of Features

  • 1.
    Contextless Object Recognition with Shape-enriched SIFT and Bags of Features Marcel Tella Amo Directed by Dr. Matthias Zeppelzauer (TU Wien) Codirected by Dr. Xavier Giró-i-Nieto (UPC)
  • 2.
    Motivation 2 ObjectRecognition and Classification Categories • Ball • Airplane • Chair • Beaver • … Ball Airplane Chair Shape Information Texture information
  • 3.
    3 Index Requirements State of the Art Design Results
  • 4.
  • 5.
    Requirements State ofthe Art Design Results Design shape features that can be used in an aggregated framework, like Bag of Words with no need of matching or alignment. 5 Take a successful method : Shape Information SIFT
  • 6.
    Requirements State ofthe Art Design Results Analyse the implication of the vocabulary size with respect to the size of the shape features. SIFT 6 Shape
  • 7.
    The proposed featuresshould be at least scale, rotation and translation invariant. If it is possible, flip invariant as well. 7 Requirements State of the Art Design Results
  • 8.
    Need for Segmentationto codify the shape Study the limitations of shape coding when using a state of the art segmentation. Manual annotations vs Automatic Segmentation 8 Requirements State of the Art Design Results
  • 9.
  • 10.
    Requirements State ofthe Art Design Results Object Candidates algorithms Multiscale Combinatorial Grouping (MCG) 10 Ranking Object Plausibility Arbelaez, P., Pont-Tuset, J., Barron, J. T., Marques, F., Malik, J. (2014). Multiscale Combinatorial Grouping. CVPR. High Low
  • 11.
    Requirements State ofthe Art Design Results Shape Context 11 G. Mori, S. Belongie, and J. Malik. Ecient shape matching using shape contexts. PAMI, 27(11), 2005.
  • 12.
    Requirements State ofthe Art Design Results Interest point descriptors: SIFT descriptor Simplified example Typically 4x4 divisions * 8 bins/hist = 128 features dense SIFT sparse SIFT 12 David G Lowe, Distinctive image features from scale-invariant keypoints, International journal of computer vision 60 (2004), no. 2, 91{110.
  • 13.
    Requirements State ofthe Art Design Results Enrichment of SIFT Extra features : Absolute spatial location (X,Y) or angle and distance Rene Grzeszick, Leonard Rothacker, and Gernot A. Fink, "Bag-of-features representations using spatial visual vocabularies for object classication,“ in IEEE Intl. Conf. on Image Processing, Melbourne, Australia, 2013 Extra features : Relative position + aspect ratio + scale ratio + Color Space Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012). Semantic segmentation with second-order pooling. In Computer Vision{ECCV 2012} (pp. 430-443). Springer Berlin Heidelberg. 13 128-dimensional SIFT descriptor Extra features
  • 14.
    Bag of Words 14 Requirements State of the Art Design Results
  • 15.
    Requirements State ofthe Art Design Results Bags of Words - Pipeline 15 Get Descriptors Clustering (K-means) Create histograms Train Model (SVM) Image Create histogram Evaluate (SVM)
  • 16.
  • 17.
    Requirements State ofthe Art Design Results Why dense SIFT? 17
  • 18.
    Main principle: Combinationof dense SIFT and Object Candidates 18 Requirements State of the Art Design Results
  • 19.
    Requirements State ofthe Art Design Results Distance to the nearest border (DNB) Logarithmic distance to the nearest border (LDNB) Less influence of big distances 19 Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012). Semantic segmentation with second-order pooling. In Computer Vision-ECCV 2012 (pp. 430-443). Springer Berlin Heidelberg.
  • 20.
    Distance and Angleto the nearest border (DANB) Problem: Really similar in 2D but very different values. Solution: Codify them in two separated features. 20 Requirements State of the Art Design Results
  • 21.
    Rotation Invariant Angleto the nearest border 21 Requirements State of the Art Design Results
  • 22.
    Distance to thecenter (DC) 22 Requirements State of the Art Design Results
  • 23.
    η - AngularScan (ηAS) WINNER! 23 Requirements State of the Art Design Results
  • 24.
    Shape Context froma dense SIFT (DSC) Note: It crosses the contour of the region like Shape Context. ηAS does not! 24 Requirements State of the Art Design Results
  • 25.
    Requirements State ofthe Art Design Results Rotation Invariant Region Quantization (RIRQ) Main idea: Get spatial information. Easily extensible to a pyramid! 25 Lazebnik, S., Schmid, C., & Ponce, J. (2006). 2006 IEEE Computer Society Conference on (Vol. 2, pp. 2169-2178). IEEE.
  • 26.
    Achieving flip invariance(RIRQ) 1 2 4 3 1 2 3 4 2 4 1 3 2 3 4 1 4 2 2 4 SORT SORT 2 4 26 Requirements State of the Art Design Results
  • 27.
    Where do weintegrate our features? Two main Architectures Enriched SIFT (eSIFT) SIFT Shape features Visual Vocabulary Bag of eSIFT visual words BoW+Shape SIFT Visual Vocabulary Bag of Words Shape histogram 27 Requirements State of the Art Design Results
  • 28.
    BoW+Shape Creation ofthe shape histograms SIFT Accumulation of features Visual Vocabulary Bag of Words Shape histogram 1 1. Accumulate the same feature for all points . 2. Create a histogram of X bins for that feature. 1 2 2 3. Concatenate histograms to create the final one. Example: 8-Angular Scan 8 distances (different angles) # SIFT keypoints 28 Requirements State of the Art Design Results
  • 29.
  • 30.
    Requirements State ofthe Art Design Results The dataset: Caltech-101 30 •Well recognized dataset • 101 Different Categories of images • Ground truth annotations available • From 40 to 800 images per category.
  • 31.
    Requirements State ofthe Art Design Results Metrics: Accuracy (%) 31 Correct Classifications Correct + Incorrect Classifications
  • 32.
    Requirements State ofthe Art Design Results Experiments setup 32 • 30 images per category in train and 30-50 in test. • 101 Categories + Background category. • Different Vocabulary sizes in the X axis. • Accuracy(%) in the Y axis: •Experiments and analysis: • eSIFT • BoW+S • eSIFT vs BoW+S • Performance acheived • Comparison between adding features before or after quantization • Number of bins per histogram • Ground truth vs MCG Object Canditates • Context vs Shape
  • 33.
    Results enriched SIFT 33 Requirements State of the Art Design Results
  • 34.
    Results BoW+S 34 Requirements State of the Art Design Results
  • 35.
    Requirements State ofthe Art Design Results Performance achieved 35 Conclusion With Angular Scan, there is an increase of performance from 16% to around 41%.
  • 36.
    Requirements State ofthe Art Design Results Comparison between adding features after and before Conclusion In Angular Scan, if the number of shape features is high, both architectures tend to converge. 36
  • 37.
    Requirements State ofthe Art Design Results Number of bins per histogram Conclusion In Angular Scan, 8 bins is the value that gives the best performance. 37
  • 38.
    Requirements State ofthe Art Design Results Ground truth vs MCG Object Candidates Conclusion 1 2 Higher vocabulary values lead to a more robust approach in terms of segmentation errors. Shape-based methods are more sensible to segmentation errors than texture-based. 38
  • 39.
    Requirements State ofthe Art Design Results Context gain vs Shape gain Conclusion Object Context It gives better performance to codify the shape than the context of the image. 39
  • 40.
    FutureWork Comparison betwenour work and Second Order Pooling PhD thesis of Carles Ventura Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012). Semantic segmentation with second-order pooling. In Computer Vision-ECCV 2012 (pp. 430-443). Springer Berlin Heidelberg. 40
  • 41.
    Distance to thenearest border (DNB) 41 Future Work
  • 42.
    Conclusions 1. Increaseof performance from 16% to around 41% 2. In Angular Scan, if the number of shape features is high, both architectures tend to converge. 3. In Angular Scan, 8 bins is the value that gives the best performance. 4. Higher vocabulary values lead to a more robust approach in terms of segmentation errors. 5. Shape-based methods are more sensible to segmentation errors than texture-based. 6. It gives better performance to codify the shape than the context of the image. Thank you! Questions? 42