Learning structured representations Deva Ramanan UC Irvine
fw (x) =Traini w· • Visual representations • Training data consists of images with labeled N • Need to learn the model structure, ﬁlters and d • positives negatives Learned model Training fw (x) = w · Φ(x) • Training data consists of images with labeled bounding boxes Training • Need to learn the model structure, ﬁlters and deformation costs TrainingGeometric models positive negative Statistical classifiers (1970s-1990s) (1990s-present)weights weights Large-scale trainingHand-coded models Appearance-based representations
Learned modelLearned visual fw (x) = w · Φ(x) representations Training • Training data consists of images with labeled bounding boxes • Need Wherethe invariance built in? deformation costs to learn is model structure, ﬁlters and Representation (linear classifier, ...) Training FeaturesViolaJones Dalal Triggs positive nega weights wei
Learned visual representations Where is invariance built in? 4 4 4 4 4 4 4 4 Representation 4 4 (latent-variable classifier) Features (a) (b) (c) (a) (a) (b) (a) (a) (b) (c) (b) (b) (c) (c) (c) (a) (b) (c) Felzenszwalb et al 09 (a) (a) (b) (a) (b) (c) (b) (c) (c)on model. The model is deﬁned by a coarse root ﬁlter (a), several (a) (b) (c)ections obtained withby single by a coarse root ﬁlter (a), The model is deﬁned by a coarse (b) ﬁlter (a), severalon model. The deﬁned isa deﬁned component person model.severalon model. The model is deﬁned byroot ﬁlter root several several The model is a coarse (a),on model. The model is deﬁned by a coarse root ﬁlter (a), several model a coarse ﬁlter (a), (a) root (c)btained with each with relative tocomponent personﬁlters specifydeﬁned is deﬁned byroot ﬁlter root several severale locationobtained with a single component person model. The model is deﬁned by a coarse root ﬁlter (a), several tections obtained partcomponent the root model. The modelThe model by a coarse a coarse (a), ﬁlter (a), of a single person (c). The is tections obtained with relative to(c). The ﬁltersThe ﬁlters specify model is deﬁned by a coarse root ﬁlter (a), several location of each part a a root component person model. The a single model (c). specify tections part relative andthe spatialthe root for the location of each part relative to the root (c). The ﬁlters specify single model. eof each ﬁlters (b) to relative to the root (c). The ﬁlters specifyution part of each part relative to the root (c). The ﬁlters specify e location isualization of each(b) positive spatial model for the location of relative to relative to(c). The ﬁltersThe ﬁlters specify e location and a spatial model for thedifferent orientations. The (b) show part ution part ﬁlters the andaa single model for theof each part each part relative to the deﬁned The a coarse root ﬁlte ﬁlters obtained with a weights at location person model. The model is root (c). by ions part ofshow (b) positivespatial component location of each part relative to theatroot (c). The ﬁlters specifyvisualizationﬁlters the and a at weights atorientations. location of each part ution part ﬁlters (b) positivespatial model for the The and different different orientations. Thevisualization show the positive weights at different orientations. Thehistogram show the gradients features. Their visualization The oriented the root the different specify ution the positive weights weights at different orientations. show the positive weights root (c). orientations. The n show ﬁlters specify ingorientedof oriented gradients features. Their visualization show the positive different different orientations. model. T the center of a part at different1. histogram of oriented gradients of Fig.features. Their visualization show the positive weightscomponent person The Detections the root. obtained with a single at different orientations. Thevisualization gradients features. Their visualization show the positive weights at weights atorientations. The locations relative to the root. histogrampart of a part at different locations the root.enter the acenterat different locations relative Their visualization show the of eachweightsrelative toorientations. (c). The ﬁcing the center of (b) anddifferent locations relative tothe root. n part center of a part at different “cost” to relative to the location positive part at different the root The of ofcing the ﬁlters a part at a the locations placing histogram of models reﬂects spatial model for the centercingthe spatialoriented gradients features. of relative to the root. of a part at different locations relative to the root.
person bottle Where does learning fit in?Training Alg Groundimages output truth Matching 17 alg cat person bottle Tune parameters ( , ) till desired output on training set ‘Graduate Student Descent’ might take a while (phrase from Marshall Tappen) cat
5 years of PASCAL people detection Matching results 50 37.5 average 25precision 12.5 0 05 06 07 08 09 10 (after non-maximum suppression) 20 20 20 20 20 20 ~1 second to search all scales 1% to 47% in 5 years How do we move beyond the plateau?
How do we move beyond the plateau?1. Develop more structured models with less invariant features
Invariance vs Search Projective Invariants View-Based Mixtures
person person person person bottle person bottle person person person bottle person bottle bottleInvariance vs Parametric Search person person person person bottle person bottle bottle Part-Based Models cat cat cat cat 4 cat 4 4 4 4 cat cat cat cat cat cat c cat cat (a) (b) (c) (a) (a) (b) (a) (b) (c) (b) (c) (c) (a) (b) (c)
Learned visual representations Where is invariance built in? Representation (latent-variable classifier) FeaturesYi & Ramanan 11 Buffy performance: 88% vs 73%
How do we move beyond the plateau?1. Develop more structured models with less invariant features2. Score syntax as semantics
The forgotten challenge....!"#$%&# ()*+"&,)-#.*/)&,*$#012*-"&"3&)4#*&4501"-*)1*)&,"4*-5&5 678)4-*+"&,)-*-)"#*1)&*5&&"+9&*&)*-"&"3&*8""& Head Hand ;))& :"5- :51- Foot <=>?=@A:$+51@5B)$& CDED FEF GEH 6I;6!JAK<J LHEC GMED MEM
ure 8: Top: heat equilibrium for two bones. Bottom: the resultotating the right bone with the heat-based attachment Structured classifiers Figure 10: A centaur pirate with a centaur skeleton embedded looks at a cat with a quadruped skeleton embedded the character volume as an insulated heat-conducting body ande the temperature of bone i to be 1 while keeping the tempera- of all of the other bones at 0. Then we can take the equilibriumperature at each vertex on the surface as the weight of bone i at vertex. Figure 8 illustrates this in two dimensions. olving for heat equilibrium over a volume would require tes-ating the volume and would be slow. Therefore, for simplic-Pinocchio solves for equilibrium over the surface only, but at e vertices, it adds the heat transferred from the nearest bone. i equilibrium over the surface for bone i is given by ∂w = ∂t i + H(pi − wi ) = 0, which can be written as −∆wi + Hwi = Hpi , (1) re ∆ is the discrete surface Laplacian, calculated with the ngent formula [Meyer et al. 2003], pi is a vector with pi = 1 j e nearest bone to vertex j is i and pi = 0 otherwise, and H is shape Figure 11: The human scan on the left is rigged by Pinocchio and is posed on the right by changing joint angles in the embedded skele- ton. The well-known deﬁciencies of LBS can be seen in the right Estimated shape jdiagonal matrix with Hjj being the heat contribution weight of knee and hip areas.nearest bone to vertex j. Because ∆ has units of length−2 , so t H. Letting d(j) be the distance from vertex j to the neareste, Pinocchio uses Hjj = c/d(j)2 if the shortest line segment 5.1 Generalitym the vertex to the bone is contained in the character volume Figure 9 shows our 16 test characters and the skeletons Pinocchio Hjj = 0 if it is not. It uses the precomputed distance ﬁeld to embedded. The skeleton was correctly embedded into 13 of these classifier rmine whether a line segment is entirely contained in the char- models (81% success). For Models 7, 10 and 13, a hint for a single r volume. For c ≈ 0.22, this method gives weights with similar joint was sufﬁcient to produce a good embedding.sitions to those computed by ﬁnding the equilibrium over the These tests demonstrate the range of proportions that our method me. Pinocchio uses c = 1 (corresponding to anisotropic heat can tolerate: we have a well-proportioned human (Models 1–4, 8),usion) because the results look more natural. When k bones are large arms and tiny legs (6; in 10, this causes problems), and large distant from vertex j, heat contributions from all of them are legs and small arms (15; in 13, the small arms cause problems). Ford: pj is 1/k for all of them, and Hjj = kc/d(j)2 . other characters we tested, skeletons were almost always correctly quation (1) is a sparse linear system, and the left hand side embedded into well-proportioned characters whose pose matched Estimatedrix −∆ + H does not depend on i, the bone we are interested the given skeleton. Pinocchio was even able to transfer a bipedThus we can factor the system once and back-substitute to ﬁnd walk onto a human hand, a cat on its hind legs, and a donut.weights for each bone. Botsch et al.  show how to use The most common issues we ran into on other characters were: arse Cholesky solver to compute the factorization for this kind ystem. Pinocchio uses the TAUCS [Toledo 2003] library for computation. Note also that the weights wi sum to 1 for each reflectance • The thinnest limb into which we may hope to embed a bone has a radius of 2τ . Characters with extremely thin limbs often reflectance fail because the the graph we extract is disconnected. Reduc-ex: if we sum (1) over i, we get (−∆ + H) i wi = H · 1, P ing τ , however, hurts performance.ch yields i wi = 1. P is possible to speed up this method slightly by ﬁnding vertices • Degree 2 joints such as knees and elbows are often positioned are unambiguously attached to a single bone and forcing their incorrectly within a limb. We do not know of a reliable wayght to 1. An earlier variant of our algorithm did this, but the im- to identify the right locations for them: on some characters ement was negligible, and this introduced occasional artifacts. they are thicker than the rest of the limb, and on others they are thinner. Results Although most of our tests were done with the biped skeleton, evaluate Pinocchio with respect to the three criteria stated in we have also used other skeletons for other characters (Figure 10).introduction: generality, quality, and performance. To ensure bjective evaluation, we use inputs that were not used during 5.2 Qualityelopment. To this end, once the development was complete, we Figure 11 shows the results of manually posing a human scan us-ed Pinocchio on 16 biped Cosmic Blobs models that we had not ing our attachment. Our video [Baran and Popovi´ 2007b] demon- c iously tried. strates the quality of the animation produced by Pinocchio. 6
Lead: Jitendra Malik (UC Berkeley) Structured object reports Participants: Deva Ramanan (UC Irvine), Steve Seitz (U Washingtonduction/goal: Human detection and pose estimation are tasks with many applicatng next-generation human-computer interfaces and activity understanding. Detection “If you’re not winning the game, change the rules” s a classiﬁcation problem (does this window contain a person or not?), while pose esen cast as a regression problem, where given an image or sequence of frames, one moint angles. This project will take a more general view and cast both tasks as one of “pe a full syntactic parse will report the number of people present (if any), their body
Lead: JCaveat: we need more pixels Rama Participants: Deva Multiresolution models for object d Dennis Park Deva Ramanan Charless Fowlkes Motivation & Goal S3. Now we re Objects in images come with various resolutions. star model Most recognition systems are scale-invariant, eliminate bl i.e. ﬁxed-size template LR global tem More pixels mean more information! naturally ﬁts We want to use the information when it is avail- LR template able. HR templat Test image trained by La Goal : part locatio 1. We want to use more pixels. 2. We want to detect small instances as well. 3. In addition, we try to address the correlation be- Φ(x, s, z) = tween resolution and the role of context. Introduction/goal: Human scoring funct We should focus on high-resolution data Model detect cluding next-generation human-com= (in contrast to most learning methods) Building blocks f (x, s) HOG features  SVM cast as a classiﬁcation problem &(does S4. ﬁnal mod The boundar
Caltech Pedestrian Benchmark missed 10d detections detections Multiresolution model , we show the result of our low-resolution rigid-template baseline. Park et al. 2010s to detect large instances. On the right, we show detections of, part-based baseline, which fails to ﬁnd small instances. On thedetections of our multiresolution model that is able to detect bothtances. The threshold of each model is set todecrease same rate of Multiresolution representations yield the error by 2X compared to previous work
How do we move beyond the plateau?1. Develop more structured models with less invariant features2. Score syntax as semantics3. Generate ground-truth datasets of structured labels
Case study: small or big partsSkeleton Parts/Poselets Mini-parts
What are good representations? Exemplars Parts Attributes Visual Phrases Grammars ?
Even worse: what are the parts (if any)? Is there any structure to label here?
How do we move beyond the plateau?1. Develop more structured models with less invariant features2. Score “nuisance” variables as meaningful output3. Generate ground-truth datasets of structured labels
Diagram for Eero Machine Learning VisionVision as applied machine learning
Diagram for Eero Vision Graphics Machine Learning(shape & appearance) Vision as structured pattern recognition