Dog Breed Classification Using Part Localization

Angjoo Kanazawa
Angjoo KanazawaResearch Assistant at University of Maryland College Park
Dog Breed Classification Using Part
          Localization
     Jiongxin       Liu 1,
                    Angjoo                Kanazawa2,

   David Jacobs 2, and Peter Belhumeur1
       1 Columbia   University   2 University   of Maryland
Fine-grained classification
                                           [Branso
[Nilsback
                                           n et al
and
                                           ‘10]
Zisserman
’08]



[Parkhi et
al ’12]




[Kumar et
al ‘12]
Related work
• Dense feature extraction:
  – Mine discriminative region with random forests [Yao et al
    ’11]
  – Multiple Kernel Learning [Nilsback and Zisserman ’08]
  – Post-segmentation [Parkhi and Zisserman ’12]
• Pose-normalized appearance:
  – Birdlets [Farrell et al ’11]
Related work
• Dense feature extraction:
  – Mine discriminative region with random forests [Yao et al
    ’11]
                 Generic sampling of features
  – Multiple Kernel Learning [Nilsback and Zisserman ’08]
               contains more noise than useful
  – Post-Segmentation [Parkhi and Zisserman ’12]
                       information
• Pose-normalizedfine-grained classification!
              for appearance:
  – Birdlets [Farrell et al ’11]
Same breed or not?              NO!!
Entlebucher Mountain Dog   Greater Swiss Mountain Dog
Key insight: Differences in common parts are
              more informative
  Entlebucher Mountain Dog                 Greater Swiss Mountain Dog




 Localize parts based on a non-parameteric method by [Belhumeur et al ‘11]
“Columbia dogs with parts” dataset
       133 breeds, 8351 images
Low inter-breed variation
   Norfolk Terrier or Cairn Terrier?
High intra-breed variation
      Both labrador retriever
Innumerable Poses
Diverse Appearances
Varying geometry of parts
Overview of the system
1. Face Detection    2. Part Detection 3. Feature Extraction and ear localization




                     4. One vs All classification
Pipeline 1: Dog Face Detection

                            Keep the 5
                            highest scoring
                            windows
Pipeline 2: Localize Parts
            Part locations    Detector responses




            Idea: From the “fit” to K most
          similar exemplars weighted by the
                   detector output,
             take the most probable part
                       location
Review: Consensus of Exemplars




                               ...
Local Part Detectors   Exemplar Selection   Part Localization
                                             Slide from Neeraj Kumar
RANSAC-like Exemplar Selection
1. Repeat r times:
   a. Choose random exemplar k
   b. Choose 2 random modes of local detector outputs D={d      i} on query

   c. Find similarity transform t that aligns exemplar to these points
   d. Evaluate match of all i face parts for this (k,t) pair:
                                                    n
   Probability of this
   configuration given   P(Xk,t | D) = C Õ P(x               i
                                                             k,t
                                                                      i
                                                                   |d )   Part detector probability
                                                                          at this (aligned) location
                                                    i
   detector outputs
   e. Add (k,t) pair to list of possible exemplars, ranked by score

2. Take top M (k,t) pairs for determining global configuration
                                                                          Slide from Neeraj Kumar
Final Part Localization
For each face part i:
   a. Compute distribution of this part from all M aligned exemplars
   b. For each of the top M aligned exemplars [(k,t) pairs]:
      Multiply normalized local detector outputs with global distribution of part computed from
      exemplars to get scores at each pixel location
   c. Add all scores together to get final scores at each pixel and choose max




                                                                           Slide from Neeraj Kumar
Pipeline 2: Localize Parts
                          Part locations           Detector responses




                              Difference between current part
                              location and that of exemplar




From K most similar exemplars and the detector output,
        take the most probable part location
Pipeline 3: Infer ears using detected parts




     With r(=10) exemplars from each breed
Pipeline 3: Infer ears using detected parts




     With r(=10) exemplars from each breed
Pipeline 4: Classification




Extract SIFT at part locations for each breed+color
   histogram  one vs all linear SVM classifier
Qualitative Results: Successful
Qualitative Results: Failures
Results: ROC curves
Available in iTunes now
Take a Picture

                 By tapping
                  the nose
Get the breed!
Browse Dog Breeds
Thank you!!
1 of 30

More Related Content

What's hot(20)

Gradient descent methodGradient descent method
Gradient descent method
Sanghyuk Chun13.7K views
Characteristics of cloud computingCharacteristics of cloud computing
Characteristics of cloud computing
GOVERNMENT COLLEGE OF ENGINEERING,TIRUNELVELI1.9K views
Kaggle Dog breed IdentificationKaggle Dog breed Identification
Kaggle Dog breed Identification
Fellowship at Vodafone FutureLab817 views
Machine learning clusteringMachine learning clustering
Machine learning clustering
CosmoAIMS Bassett4.8K views
KNNKNN
KNN
Joris Schelfaut526 views
Viola-Jones Object DetectionViola-Jones Object Detection
Viola-Jones Object Detection
Venugopal Boddu3.7K views
Introduction to AWS Cloud ComputingIntroduction to AWS Cloud Computing
Introduction to AWS Cloud Computing
Amazon Web Services23.2K views
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
Krish_ver214.8K views
Bee algorithmBee algorithm
Bee algorithm
Njoud Omar18.8K views
Foundation of A.IFoundation of A.I
Foundation of A.I
Megha Sharma3.2K views
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
Upekha Vandebona3K views
Semantic nets in artificial intelligenceSemantic nets in artificial intelligence
Semantic nets in artificial intelligence
harshita virwani28.5K views
Load balancing in cloudLoad balancing in cloud
Load balancing in cloud
Souvik Maji617 views
Fuzzy Logic pptFuzzy Logic ppt
Fuzzy Logic ppt
Ritu Bafna30.7K views
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
omaraldabash1.8K views
Classification and ClusteringClassification and Clustering
Classification and Clustering
Yogendra Tamang3.4K views
CNN TutorialCNN Tutorial
CNN Tutorial
Sungjoon Choi5.6K views

Similar to Dog Breed Classification Using Part Localization(20)

Dog Breed Classification Using Part Localization

  • 1. Dog Breed Classification Using Part Localization Jiongxin Liu 1, Angjoo Kanazawa2, David Jacobs 2, and Peter Belhumeur1 1 Columbia University 2 University of Maryland
  • 2. Fine-grained classification [Branso [Nilsback n et al and ‘10] Zisserman ’08] [Parkhi et al ’12] [Kumar et al ‘12]
  • 3. Related work • Dense feature extraction: – Mine discriminative region with random forests [Yao et al ’11] – Multiple Kernel Learning [Nilsback and Zisserman ’08] – Post-segmentation [Parkhi and Zisserman ’12] • Pose-normalized appearance: – Birdlets [Farrell et al ’11]
  • 4. Related work • Dense feature extraction: – Mine discriminative region with random forests [Yao et al ’11] Generic sampling of features – Multiple Kernel Learning [Nilsback and Zisserman ’08] contains more noise than useful – Post-Segmentation [Parkhi and Zisserman ’12] information • Pose-normalizedfine-grained classification! for appearance: – Birdlets [Farrell et al ’11]
  • 5. Same breed or not? NO!! Entlebucher Mountain Dog Greater Swiss Mountain Dog
  • 6. Key insight: Differences in common parts are more informative Entlebucher Mountain Dog Greater Swiss Mountain Dog Localize parts based on a non-parameteric method by [Belhumeur et al ‘11]
  • 7. “Columbia dogs with parts” dataset 133 breeds, 8351 images
  • 8. Low inter-breed variation Norfolk Terrier or Cairn Terrier?
  • 9. High intra-breed variation Both labrador retriever
  • 13. Overview of the system 1. Face Detection 2. Part Detection 3. Feature Extraction and ear localization 4. One vs All classification
  • 14. Pipeline 1: Dog Face Detection Keep the 5 highest scoring windows
  • 15. Pipeline 2: Localize Parts Part locations Detector responses Idea: From the “fit” to K most similar exemplars weighted by the detector output, take the most probable part location
  • 16. Review: Consensus of Exemplars ... Local Part Detectors Exemplar Selection Part Localization Slide from Neeraj Kumar
  • 17. RANSAC-like Exemplar Selection 1. Repeat r times: a. Choose random exemplar k b. Choose 2 random modes of local detector outputs D={d i} on query c. Find similarity transform t that aligns exemplar to these points d. Evaluate match of all i face parts for this (k,t) pair: n Probability of this configuration given P(Xk,t | D) = C Õ P(x i k,t i |d ) Part detector probability at this (aligned) location i detector outputs e. Add (k,t) pair to list of possible exemplars, ranked by score 2. Take top M (k,t) pairs for determining global configuration Slide from Neeraj Kumar
  • 18. Final Part Localization For each face part i: a. Compute distribution of this part from all M aligned exemplars b. For each of the top M aligned exemplars [(k,t) pairs]: Multiply normalized local detector outputs with global distribution of part computed from exemplars to get scores at each pixel location c. Add all scores together to get final scores at each pixel and choose max Slide from Neeraj Kumar
  • 19. Pipeline 2: Localize Parts Part locations Detector responses Difference between current part location and that of exemplar From K most similar exemplars and the detector output, take the most probable part location
  • 20. Pipeline 3: Infer ears using detected parts With r(=10) exemplars from each breed
  • 21. Pipeline 3: Infer ears using detected parts With r(=10) exemplars from each breed
  • 22. Pipeline 4: Classification Extract SIFT at part locations for each breed+color histogram  one vs all linear SVM classifier
  • 27. Take a Picture By tapping the nose

Editor's Notes

  1. This is a joint work with Jiongxin Liu, Peter Belhumeur, and David Jacobs.
  2. in which instances from different classes share common parts but have wide variation in shape and appearance. Examples are identifying species of ..These problems lie between the two extremes of individuals such as face identification and basic-level categorizes such as caltech-256.Motivation:A vision system that can do things that humans aren’t very good atApplication for education, examples such as leafsnap) domain of automatic species identification, which is extremely useful for biodiversity studies and general education.. (success in the dog domain will certainly lead to further success in broaderIt is a very challenging problem to solve. We chose dogs as our test domain, (Highlight dogs)
  3. Birdlet<-poselts, find 3D volumetric primitives & describe classes based on their variations. Our work is complementary to their work in that bidlets focuses on using large, articulated parts while we utilize parts describable at point locations. We also use a hierarchical approach in which first the face and the more rigid parts of the face are discovered and used to find class-specific parts such as ears.Built on top of the recent methods for visual object recognition, related work addresses the problem of fine-grained categorization mainly by mining discriminative features via randomized sampling, or with multiple kernel learning framework, or extracting dense features over a segmentated image.Most relevant to our approach is the work by Farrell et al which uses the poselet framework to localize the head and body of birds enabling part-based feature extraction.
  4. Dense feature extraction is often very powerful for object recognition and general visual classification tasks. However, this is not the case for fine-grained categorization, since categories are so visually similar, many regions contain more noise than useful information, and such generic sampling can miss fine details that are needed for correct classificationIn this work, we argue and demonstrate that fine-grained classification can be significantly improved if the features are localized at corresponding object parts.There is a vast literature on face detection and localization parts of human faces. We localize parts of the dog face built on the consensus of models approach by Belhumeur et al, which originally is a non-parametric face parts detector
  5. Here is an example that demonstrates this insight.
  6. Subordinate categories such as dogs/leaves all share semantic parts (legs for charis, stem for leaves, ears for cats and dogs) and the differencces in those parts are more informative than generic sampling of features. These two dogs are of different breeds. The texture of their fur and the color distribution is strikingly similar. But in general, Entlebucher mountain dogs have a shorter snout and rounder nostrils, more pendant, v-shaped, flatter ears while Greater Swiss Mountain dogs have longer snout, nostils that cut to the side with a visible septum, a line between the nostrils, and folded ears that hang on the side of the head.In this work, we argue and demonstrate that fine-grained classification can be significantly improved if the features are localized at corresponding object parts.There is a vast literature on face detection and localization parts of human faces. We localize parts of the dog face built on the consensus of exemplars approach by Belhumeur et al, a non-parametric face parts detector. We extend their method to perform object classification, which has only been previously applied to part detection
  7. -all the dogs face the camera, dog images are from the datasetWe chose dog breed identification as a test case to demonstrate our method.Dogs are an excellent domain for fine grained categorization. After humans, dogs are possibly the most photographed species (perhaps after cats) on the internet.Determination of dog breeds is a very challenging task, sharing many of the challenges seen in fine-grained classification, and success in this domain will certainly lead to further success in broader domain of automatic species identification, which is extremely useful for biodiversity studies and general education.Since we focus on localizing dog parts, we have annotated 8 parts of all dogs in our dataset. Parts are the 2 eyes, nose, ear tips, ear bases, and the top fo the head. Because we only look at these parts, all of the dogs in our dataset are facing the camera, but with varying poses, scale, and rotation, where detection of face parts is far from trivial task.Now I will go over the challenges in recognizing breeds of dogs from a single picture. The first challenge as you can see is that there are many classes. In this work we deal with 133 breeds of dogs.(As a sidenote, all of the pictures you see on these slides are images of dogs from our dataset)
  8. Many subsets of dog breeds are quite similar in appearance.
  9. On top of that, there is also great variation within breeds. These two factors make identification of breeds very challenging especially for humans without expert knowledge.(Try to go back to slide 7and point to Lakeland terrior)
  10. They come in innumerable poses, considerably more than that of human faces
  11. Dogs are very diverse in its visual appearance.
  12. The geometry of their face is also very deformable, again way more than the deformation in human faces.Especially their ear tips: Breeds like beagles have hanging ears, whereas breeds like akita have pointy upright ears.(Also note how nose has greater DoF than the nose of humans like in this picture where the eyes and nose are almost colinear because dogs faces are more 3D (less flat) than ours)These factors make localization of parts very challenging.
  13. Here is the overview of our pipeline: First we detect dog face, then localize three parts, extract features at those places to find most simlar exemplars to detect the rest of the face parts. Then using all the parts we do breed classification and here is a sample result. Green border indicates the correct breed.
  14. We use a sliding window RBF-svm regressor to detect dog faces. Each window has eight SIFT descriptors indicated by these boxes, concatenated into a 1024-dimensional feature vector. We have experimented with a cascaded adaboost detector with Haar like features which works very well with human faces. Perhaps due to the extreme variability on geometry and appearance of dogs faces, the cascaded adaboost detector produced way too many false detections. For details please referr the paper.We keep the 10 highest scoring face detection window and generate hypothesis of part locations for each of them. We keep the face window with the highest score in the next step.
  15. Want part loc that max. probability of that part loc given the detector responses.We want to empose geometric constraints to detector outputs, by combining low-level detectors with labled exemplars.Exe., help create conditional indpt between different parts since we assume that each part is generated by one of the exemplars, so we can re-write…We include the exemplars in the calculation of (1) and marginalized outIntuitively, K exemplars that are most similar to location of the modes of the detector output is selected. They are then transformed to fit the current query image. The P(delta) term is modeled as a 2D gaussian, and the difference between the current part and the exemplar gives how well the model fits the location p_i. We pick the part location that has the highest fit to all $K$ models weighted by the confidence of the detector output.To localize face parts, we first train sliding widnwo linear-SVM detectors for each dog part using a single SIFT feature. If we denote C as detector responses for parts in image I,And p^I denote the ground truth locations of the parts in the image, our goal is to compute (1).Using exemplar (labeled training samples) we can wirte the above for each ith part as (2)The t stands for similarity transformation of model $k$. The K models are selected by RANSAC like procedure. K=100?
  16. Different approach to part detection compared to DPM, but basically they both do the same MAP esimationDPM enforces geometric constraints between parts by parameterizing deformation between connected partsCoE enfoces geometric constraints non-parametricly (although not latent, and part labels are necessary)
  17. Want part loc that max. probability of that part loc given the detector responses.We want to empose geometric constraints to detector outputs, by combining low-level detectors with labled exemplars.Exe., help create conditional indpt between different parts since we assume that each part is generated by one of the exemplars, so we can re-write…We include the exemplars in the calculation of (1) and marginalized outIntuitively, K exemplars that are most similar to location of the modes of the detector output is selected. They are then transformed to fit the current query image. The P(delta) term is modeled as a 2D gaussian, and the difference between the current part and the exemplar gives how well the model fits the location p_i. We pick the part location that has the highest fit to all $K$ models weighted by the confidence of the detector output.To localize face parts, we first train sliding widnwo linear-SVM detectors for each dog part using a single SIFT feature. If we denote C as detector responses for parts in image I,And p^I denote the ground truth locations of the parts in the image, our goal is to compute (1).Using exemplar (labeled training samples) we can wirte the above for each ith part as (2)The t stands for similarity transformation of model $k$. The K models are selected by RANSAC like procedure. K=100?
  18. Similarlly, we infer the ears also by extension of the consensus of models approach. The equations are demonstrated by the animation here.Assuming that the three parts detected in the stage before are accurate, from each breed, we find $R$ many closest exemplars. Do a similarity transform, and find the parts that are most probable.
  19. Again we do this for each breed. The reason why we take this hierarchical approach to detect ears is because the geometry of ears is very breed dependent. So in the end we’ll have 133 hypothesis location of ears.R = 10.
  20. Only 1440-dimensional feature vector (11 parts + kmeans)Finally, for each 133 part hypothesis, we extract sift features at those part locations concatenate it along with color histogram of the face window and send to a linear one vs all svm.One may wonder that we might be missing a lot of information from the body features or fur which is discriminative for dogs like dauschhound, but it is much harder to accurately localize dog parts because of their deformability and occlusion, and if two dogs are easily discriminated by their fur, those breeds have low similarity in apperances and they are easier to classify. The real problem is when features such as fur color and texture is very similar and not discrminative enough, and in these cases looking at the rest of the dog parts is not so useful. One of our contribution is that we get a very good result just by considering their faces.
  21. Note the similarity between the query and the incorrect first guesses.
  22. Look at the magenta curve: Our first guessachieves 67% accuracy and within the first 10 guesses we have achieve 93% accuracy.The green curve is using bag of word approach on the extracted dense SIFT features in a face detection window. Baseline method for object recognition.The cyan and blue are state of the art approaches used earlier for fine-grained categorization methods.The cyan curve uses LLC (locally constrained linear coding) to encode the dictionary for BoW, and blue uses MKL framework and it is extremely inefficient.The second roc curve shows quantative justification of our steps. The pink curve is the our proposed method, the red curve is if we only use the highest scoring face detection window. Note that we keep 10 Green curve shows how without part localization (features extracted on a grid within the face detection window) the accuracy is much lower.
  23. Speaking of efficiency, our system runs in real time as we have an operating iphone application available in itunes now.
  24. Bare layout for ECCV 2012 video preparation. You may submit the .pptx file, or use “File->Save and Send->Create a Video”.Remember: Author names and title will be added above the video by us.
  25. Bare layout for ECCV 2012 video preparation. You may submit the .pptx file, or use “File->Save and Send->Create a Video”.Remember: Author names and title will be added above the video by us.
  26. Bare layout for ECCV 2012 video preparation. You may submit the .pptx file, or use “File->Save and Send->Create a Video”.Remember: Author names and title will be added above the video by us.
  27. Bare layout for ECCV 2012 video preparation. You may submit the .pptx file, or use “File->Save and Send->Create a Video”.Remember: Author names and title will be added above the video by us.
  28. Bare layout for ECCV 2012 video preparation. You may submit the .pptx file, or use “File->Save and Send->Create a Video”.Remember: Author names and title will be added above the video by us.