1. Dog Breed Classification Using Part
Localization
Jiongxin Liu 1,
Angjoo Kanazawa2,
David Jacobs 2, and Peter Belhumeur1
1 Columbia University 2 University of Maryland
2. Fine-grained classification
[Branso
[Nilsback
n et al
and
‘10]
Zisserman
’08]
[Parkhi et
al ’12]
[Kumar et
al ‘12]
3. Related work
• Dense feature extraction:
– Mine discriminative region with random forests [Yao et al
’11]
– Multiple Kernel Learning [Nilsback and Zisserman ’08]
– Post-segmentation [Parkhi and Zisserman ’12]
• Pose-normalized appearance:
– Birdlets [Farrell et al ’11]
4. Related work
• Dense feature extraction:
– Mine discriminative region with random forests [Yao et al
’11]
Generic sampling of features
– Multiple Kernel Learning [Nilsback and Zisserman ’08]
contains more noise than useful
– Post-Segmentation [Parkhi and Zisserman ’12]
information
• Pose-normalizedfine-grained classification!
for appearance:
– Birdlets [Farrell et al ’11]
5. Same breed or not? NO!!
Entlebucher Mountain Dog Greater Swiss Mountain Dog
6. Key insight: Differences in common parts are
more informative
Entlebucher Mountain Dog Greater Swiss Mountain Dog
Localize parts based on a non-parameteric method by [Belhumeur et al ‘11]
13. Overview of the system
1. Face Detection 2. Part Detection 3. Feature Extraction and ear localization
4. One vs All classification
14. Pipeline 1: Dog Face Detection
Keep the 5
highest scoring
windows
15. Pipeline 2: Localize Parts
Part locations Detector responses
Idea: From the “fit” to K most
similar exemplars weighted by the
detector output,
take the most probable part
location
16. Review: Consensus of Exemplars
...
Local Part Detectors Exemplar Selection Part Localization
Slide from Neeraj Kumar
17. RANSAC-like Exemplar Selection
1. Repeat r times:
a. Choose random exemplar k
b. Choose 2 random modes of local detector outputs D={d i} on query
c. Find similarity transform t that aligns exemplar to these points
d. Evaluate match of all i face parts for this (k,t) pair:
n
Probability of this
configuration given P(Xk,t | D) = C Õ P(x i
k,t
i
|d ) Part detector probability
at this (aligned) location
i
detector outputs
e. Add (k,t) pair to list of possible exemplars, ranked by score
2. Take top M (k,t) pairs for determining global configuration
Slide from Neeraj Kumar
18. Final Part Localization
For each face part i:
a. Compute distribution of this part from all M aligned exemplars
b. For each of the top M aligned exemplars [(k,t) pairs]:
Multiply normalized local detector outputs with global distribution of part computed from
exemplars to get scores at each pixel location
c. Add all scores together to get final scores at each pixel and choose max
Slide from Neeraj Kumar
19. Pipeline 2: Localize Parts
Part locations Detector responses
Difference between current part
location and that of exemplar
From K most similar exemplars and the detector output,
take the most probable part location
20. Pipeline 3: Infer ears using detected parts
With r(=10) exemplars from each breed
21. Pipeline 3: Infer ears using detected parts
With r(=10) exemplars from each breed
This is a joint work with Jiongxin Liu, Peter Belhumeur, and David Jacobs.
in which instances from different classes share common parts but have wide variation in shape and appearance. Examples are identifying species of ..These problems lie between the two extremes of individuals such as face identification and basic-level categorizes such as caltech-256.Motivation:A vision system that can do things that humans aren’t very good atApplication for education, examples such as leafsnap) domain of automatic species identification, which is extremely useful for biodiversity studies and general education.. (success in the dog domain will certainly lead to further success in broaderIt is a very challenging problem to solve. We chose dogs as our test domain, (Highlight dogs)
Birdlet<-poselts, find 3D volumetric primitives & describe classes based on their variations. Our work is complementary to their work in that bidlets focuses on using large, articulated parts while we utilize parts describable at point locations. We also use a hierarchical approach in which first the face and the more rigid parts of the face are discovered and used to find class-specific parts such as ears.Built on top of the recent methods for visual object recognition, related work addresses the problem of fine-grained categorization mainly by mining discriminative features via randomized sampling, or with multiple kernel learning framework, or extracting dense features over a segmentated image.Most relevant to our approach is the work by Farrell et al which uses the poselet framework to localize the head and body of birds enabling part-based feature extraction.
Dense feature extraction is often very powerful for object recognition and general visual classification tasks. However, this is not the case for fine-grained categorization, since categories are so visually similar, many regions contain more noise than useful information, and such generic sampling can miss fine details that are needed for correct classificationIn this work, we argue and demonstrate that fine-grained classification can be significantly improved if the features are localized at corresponding object parts.There is a vast literature on face detection and localization parts of human faces. We localize parts of the dog face built on the consensus of models approach by Belhumeur et al, which originally is a non-parametric face parts detector
Here is an example that demonstrates this insight.
Subordinate categories such as dogs/leaves all share semantic parts (legs for charis, stem for leaves, ears for cats and dogs) and the differencces in those parts are more informative than generic sampling of features. These two dogs are of different breeds. The texture of their fur and the color distribution is strikingly similar. But in general, Entlebucher mountain dogs have a shorter snout and rounder nostrils, more pendant, v-shaped, flatter ears while Greater Swiss Mountain dogs have longer snout, nostils that cut to the side with a visible septum, a line between the nostrils, and folded ears that hang on the side of the head.In this work, we argue and demonstrate that fine-grained classification can be significantly improved if the features are localized at corresponding object parts.There is a vast literature on face detection and localization parts of human faces. We localize parts of the dog face built on the consensus of exemplars approach by Belhumeur et al, a non-parametric face parts detector. We extend their method to perform object classification, which has only been previously applied to part detection
-all the dogs face the camera, dog images are from the datasetWe chose dog breed identification as a test case to demonstrate our method.Dogs are an excellent domain for fine grained categorization. After humans, dogs are possibly the most photographed species (perhaps after cats) on the internet.Determination of dog breeds is a very challenging task, sharing many of the challenges seen in fine-grained classification, and success in this domain will certainly lead to further success in broader domain of automatic species identification, which is extremely useful for biodiversity studies and general education.Since we focus on localizing dog parts, we have annotated 8 parts of all dogs in our dataset. Parts are the 2 eyes, nose, ear tips, ear bases, and the top fo the head. Because we only look at these parts, all of the dogs in our dataset are facing the camera, but with varying poses, scale, and rotation, where detection of face parts is far from trivial task.Now I will go over the challenges in recognizing breeds of dogs from a single picture. The first challenge as you can see is that there are many classes. In this work we deal with 133 breeds of dogs.(As a sidenote, all of the pictures you see on these slides are images of dogs from our dataset)
Many subsets of dog breeds are quite similar in appearance.
On top of that, there is also great variation within breeds. These two factors make identification of breeds very challenging especially for humans without expert knowledge.(Try to go back to slide 7and point to Lakeland terrior)
They come in innumerable poses, considerably more than that of human faces
Dogs are very diverse in its visual appearance.
The geometry of their face is also very deformable, again way more than the deformation in human faces.Especially their ear tips: Breeds like beagles have hanging ears, whereas breeds like akita have pointy upright ears.(Also note how nose has greater DoF than the nose of humans like in this picture where the eyes and nose are almost colinear because dogs faces are more 3D (less flat) than ours)These factors make localization of parts very challenging.
Here is the overview of our pipeline: First we detect dog face, then localize three parts, extract features at those places to find most simlar exemplars to detect the rest of the face parts. Then using all the parts we do breed classification and here is a sample result. Green border indicates the correct breed.
We use a sliding window RBF-svm regressor to detect dog faces. Each window has eight SIFT descriptors indicated by these boxes, concatenated into a 1024-dimensional feature vector. We have experimented with a cascaded adaboost detector with Haar like features which works very well with human faces. Perhaps due to the extreme variability on geometry and appearance of dogs faces, the cascaded adaboost detector produced way too many false detections. For details please referr the paper.We keep the 10 highest scoring face detection window and generate hypothesis of part locations for each of them. We keep the face window with the highest score in the next step.
Want part loc that max. probability of that part loc given the detector responses.We want to empose geometric constraints to detector outputs, by combining low-level detectors with labled exemplars.Exe., help create conditional indpt between different parts since we assume that each part is generated by one of the exemplars, so we can re-write…We include the exemplars in the calculation of (1) and marginalized outIntuitively, K exemplars that are most similar to location of the modes of the detector output is selected. They are then transformed to fit the current query image. The P(delta) term is modeled as a 2D gaussian, and the difference between the current part and the exemplar gives how well the model fits the location p_i. We pick the part location that has the highest fit to all $K$ models weighted by the confidence of the detector output.To localize face parts, we first train sliding widnwo linear-SVM detectors for each dog part using a single SIFT feature. If we denote C as detector responses for parts in image I,And p^I denote the ground truth locations of the parts in the image, our goal is to compute (1).Using exemplar (labeled training samples) we can wirte the above for each ith part as (2)The t stands for similarity transformation of model $k$. The K models are selected by RANSAC like procedure. K=100?
Different approach to part detection compared to DPM, but basically they both do the same MAP esimationDPM enforces geometric constraints between parts by parameterizing deformation between connected partsCoE enfoces geometric constraints non-parametricly (although not latent, and part labels are necessary)
Want part loc that max. probability of that part loc given the detector responses.We want to empose geometric constraints to detector outputs, by combining low-level detectors with labled exemplars.Exe., help create conditional indpt between different parts since we assume that each part is generated by one of the exemplars, so we can re-write…We include the exemplars in the calculation of (1) and marginalized outIntuitively, K exemplars that are most similar to location of the modes of the detector output is selected. They are then transformed to fit the current query image. The P(delta) term is modeled as a 2D gaussian, and the difference between the current part and the exemplar gives how well the model fits the location p_i. We pick the part location that has the highest fit to all $K$ models weighted by the confidence of the detector output.To localize face parts, we first train sliding widnwo linear-SVM detectors for each dog part using a single SIFT feature. If we denote C as detector responses for parts in image I,And p^I denote the ground truth locations of the parts in the image, our goal is to compute (1).Using exemplar (labeled training samples) we can wirte the above for each ith part as (2)The t stands for similarity transformation of model $k$. The K models are selected by RANSAC like procedure. K=100?
Similarlly, we infer the ears also by extension of the consensus of models approach. The equations are demonstrated by the animation here.Assuming that the three parts detected in the stage before are accurate, from each breed, we find $R$ many closest exemplars. Do a similarity transform, and find the parts that are most probable.
Again we do this for each breed. The reason why we take this hierarchical approach to detect ears is because the geometry of ears is very breed dependent. So in the end we’ll have 133 hypothesis location of ears.R = 10.
Only 1440-dimensional feature vector (11 parts + kmeans)Finally, for each 133 part hypothesis, we extract sift features at those part locations concatenate it along with color histogram of the face window and send to a linear one vs all svm.One may wonder that we might be missing a lot of information from the body features or fur which is discriminative for dogs like dauschhound, but it is much harder to accurately localize dog parts because of their deformability and occlusion, and if two dogs are easily discriminated by their fur, those breeds have low similarity in apperances and they are easier to classify. The real problem is when features such as fur color and texture is very similar and not discrminative enough, and in these cases looking at the rest of the dog parts is not so useful. One of our contribution is that we get a very good result just by considering their faces.
Note the similarity between the query and the incorrect first guesses.
Look at the magenta curve: Our first guessachieves 67% accuracy and within the first 10 guesses we have achieve 93% accuracy.The green curve is using bag of word approach on the extracted dense SIFT features in a face detection window. Baseline method for object recognition.The cyan and blue are state of the art approaches used earlier for fine-grained categorization methods.The cyan curve uses LLC (locally constrained linear coding) to encode the dictionary for BoW, and blue uses MKL framework and it is extremely inefficient.The second roc curve shows quantative justification of our steps. The pink curve is the our proposed method, the red curve is if we only use the highest scoring face detection window. Note that we keep 10 Green curve shows how without part localization (features extracted on a grid within the face detection window) the accuracy is much lower.
Speaking of efficiency, our system runs in real time as we have an operating iphone application available in itunes now.
Bare layout for ECCV 2012 video preparation. You may submit the .pptx file, or use “File->Save and Send->Create a Video”.Remember: Author names and title will be added above the video by us.
Bare layout for ECCV 2012 video preparation. You may submit the .pptx file, or use “File->Save and Send->Create a Video”.Remember: Author names and title will be added above the video by us.
Bare layout for ECCV 2012 video preparation. You may submit the .pptx file, or use “File->Save and Send->Create a Video”.Remember: Author names and title will be added above the video by us.
Bare layout for ECCV 2012 video preparation. You may submit the .pptx file, or use “File->Save and Send->Create a Video”.Remember: Author names and title will be added above the video by us.
Bare layout for ECCV 2012 video preparation. You may submit the .pptx file, or use “File->Save and Send->Create a Video”.Remember: Author names and title will be added above the video by us.