Weakly supervised learning of interactions between humans and objects


Published on

services on......
If ur intrested in these project please feel free to contact us@09640648777,Mallikarjun.V

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Weakly supervised learning of interactions between humans and objects

  1. 1. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 3, MARCH 2012 601 Weakly Supervised Learning of Interactions between Humans and Objects Alessandro Prest, Student Member, IEEE, Cordelia Schmid, Senior Member, IEEE, and Vittorio Ferrari, Member, IEEE Abstract—We introduce a weakly supervised approach for learning human actions modeled as interactions between humans and objects. Our approach is human-centric: We first localize a human in the image and then determine the object relevant for the action and its spatial relation with the human. The model is learned automatically from a set of still images annotated only with the action label. Our approach relies on a human detector to initialize the model learning. For robustness to various degrees of visibility, we build a detector that learns to combine a set of existing part detectors. Starting from humans detected in a set of images depicting the action, our approach determines the action object and its spatial relation to the human. Its final output is a probabilistic model of the human- object interaction, i.e., the spatial relation between the human and the object. We present an extensive experimental evaluation on the sports action data set from [1], the PASCAL Action 2010 data set [2], and a new human-object interaction data set. Index Terms—Action recognition, weakly supervised learning, object detection. Ç1 INTRODUCTIONH UMAN action recognition is one of the most challenging problems in computer vision. It is important for awide range of applications, such as video indexing and relation to the human. Similarly, the actions “riding bike” and “wearing a hat” are defined by an object and its relation to the human.surveillance, but also image search. It is a challenging task In this paper, we introduce a weakly supervised (WS)due to the variety of human appearances and poses. Most approach for learning interaction models between humansexisting methods for action recognition either learn a spatio- and objects from a set of images depicting an action. Wetemporal model of an action [3], [4], [5] or are based on automatically localize the relevant object as well as itshuman pose [6], [7]. Spatio-temporal models measure the spatial relation to the human (Figs. 9, 10, 11, and 14). Ourmotion characteristics for a human action. They are, for approach is weakly supervised in that it can learn fromexample, based on bags of space-time interest points [3], [8], images annotated only with the action label, without being[9] or represent the human action as a distribution over given the location of humans nor objects.motion features localized in space and time [5], [4], [10]. Most related to our approach are the works of Yao andPose-based models learn the characteristic human poses Fei-Fei [12] and Gupta et al. [1], who also learn human-from still images. The pose can, for example, be represented object spatial interactions. However, these approachesby a histogram-of-gradient (HOG) [7], [11] or based on operate in a fully supervised (FS) setting, requiring trainingshape correspondences [6]. images with annotated object locations as well as human Our approach, in contrast, defines an action as the silhouettes [1] or limb locations [12]. Another work by Yaointeraction between a human and an object. Interactions are and Fei-Fei [13] deals with a somewhat different formula-often the main characteristic of an action (Figs. 9, 10, 11, and tion of the problem. Their goal is to discriminate subtle14). For example, the action “tennis serve” can be described situations where a human is holding an object without using it versus a human performing a particular actionas a human holding a tennis racket in a certain position. with the object (e.g., “holding a violin” versus “playing aCharacteristic features are the object racket and its spatial violin”). Note how this model requires manually localized humans both at training and testing time.. A. Prest is with the Computer Vision Laboratory, ETH Zurich, A recent work [14] models the contextual interaction Sternwartstrasse 7, Zurich CH-8092, Switzerland and the LEAR team, between human pose and nearby objects, but requires INRIA Rhone-Alpes, 655 Avenue de l’Europe, Montbonnot, Saint-Ismier manually annotated human and object locations at training Cedex F-38334, France. E-mail: prest@vision.ee.ethz.ch.. C. Schmid is with the LEAR team, INRIA Rhone-Alpes, 655 Avenue de time for learning the pose and object models. A previous l’Europe, Montbonnot, Saint-Ismier Cedex F-38334, France. work by the same authors [15] models spatial relations E-mail: Cordelia.Schmid@inrialpes.fr. between object classes such as cars and motorbikes for. V. Ferrari is with the Computer Vision Laboratory, ETH Zurich, Sternwartstrasse 7, Zurich CH-8092, Switzerland. object localization in a fully supervised setting. E-mail: ferrari@vision.ee.ethz.ch. Interactions are used to improve human pose estimationManuscript received 8 Sept. 2010; revised 15 Apr. 2011; accepted 12 July in [16] by inferring pose parameters (i.e., joint angles) from2011; published online 28 July 2011. the properties of objects involved in a particular human-Recommended for acceptance by S. Belongie. object interaction.For information on obtaining reprints of this article, please send e-mail to: Co-occurrence relations between humans and objectstpami@computer.org, and reference IEEECS Log NumberTPAMI-2010-09-0690. have been exploited for action recognition in videos byDigital Object Identifier no. 10.1109/TPAMI.2011.158. Ikizler-Cinbis and Sclaroff [17]. However, these relations are 0162-8828/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society
  2. 2. 602 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 3, MARCH 2012 classifier, together with the human-object interac- tion model, constitutes the action model. 4. Based on the information estimated in steps 1-3, we train a binary action classifier to decide whether a novel test image contains an instance of this action class (Section 4). 1.1.2 Testing Given a novel test image I and n different action models learned in the previous section, we want to assign one of the n possible action labels to I (Fig. 1 bottom): 1. Detect the single most prominent human in I . 2. For each action model, find the best fitting location for the action object given the detected human, the human-object interaction model, and the object appearance classifier. 3. Compute different features based on the information extracted in 1 and 2. 4. Classify I in an action class, based on the informa- tion estimated in steps 1 and 4 (Section 4). This uses the n classifiers trained in Section 1.1.1 stage 4. 1.2 Overview of the Experiments In Section 5, we present experiments on the data set of Gupta et al. [1] and on a new human-object interaction data set. The new data set and the corresponding annotations will be made available online upon acceptance of this paper. The experiments show that our method, learning with weakFig. 1. Overview of our approach. See the main text for details. supervision only, obtains classification performance com- parable to [1] and [12]. This is despite using only actionlooser than what we propose as there is no spatial modeling labels for training, which is far less supervision thanof the interaction. required by Gupta et al. [1] and Yao and Fei-Fei [12].1.1 Overview of the Method Moreover, we demonstrate that our model learns mean- ingful human-object spatial relations.1.1.1 Training In Section 6, we present experiments on the PASCALOur method takes as input a set of training images showing Action 2010 data set [2], where our method outperforms thethe humans performing the action. Our approach runs over state of the art for action classes involving humans andthe following stages (Fig. 1): objects. Furthermore, we show how our method can also handle actions not involving objects (e.g., walking). 1. Detect humans in the training set (Section 2). Our overall detector combines several detectors for different human parts, including face, upper body, 2 A PART-BASED HUMAN DETECTOR and fully body. This improves coverage as it can detect human at varying degrees of visibility. The In real-world images of human actions, the person can be detector provides the human reference frame neces- fully or partially visible (Figs. 5, 10, and 11). In this context, sary for modeling the spatial interaction with the a single detector (full person, an upper body, or face) is object in stages 2 and 3. insufficient. Our detector builds on the one by Felzenszwalb 2. Localize the action object on the training set et al. [21]; it trains several detectors for different human (Section 3.1). The basic idea is to find an object parts, adds a state-of-the-art face detector, and learns how recurring over many images at similar relative to combine the different part detectors. Our combination positions with respect to the human and with similar strategy goes beyond the maximum score selection strategy appearance between images. Related to our ap- of Felzenszwalb et al. [21] and is shown experimentally to proach are weakly supervised methods for learning object classes [18], [19], [20] which attempt to find outperform their approach (Section 2.5). Furthermore, it objects as recurring appearance patterns. provides the human reference frame necessary for model- 3. Given the localized humans and objects from stages 1 ing the spatial interaction with the object. and 2, learn the probability distribution of human- object spatial relations, such as relative location and 2.1 Individual Part Detectors. relative size. This defines the human-object inter- We use four part detectors: one for the full human body action model (Section 3.4). Additionally, we learn (FB), two for the upper-body (UB1, UB2), and one for the an object appearance classifier based on the face (F). For the fully body detector (FB) and the first upper- localized objects from stage 2. This appearance body detector (UB1), we use the two components of the
  3. 3. PREST ET AL.: WEAKLY SUPERVISED LEARNING OF INTERACTIONS BETWEEN HUMANS AND OBJECTS 603Fig. 2. Left: Detection windows returned by the individual detectors (Green: FBþUB1, Blue: UB2, Red: F). Right: Corresponding regressed windows.human detector by Felzenszwalb et al. [21]1 learned on the 2.3 Clustering Part Detections.PASCAL VOC07 training data [22]. Note that we use the After mapping detection windows from the part detectorstwo components as two separate part detectors. For to a common reference frame, detections of the same personthe second upper body detector (UB2) we train [21] on result in similar windows. Therefore, we find small groupsanother data set of near-frontal upper-bodies [23].2 There- of detections corresponding to different people by cluster-fore, UB2 is specialized to the frontal case, which frequently ing all mapped detection windows for an image in the 4Doccurs in real images. Our experiments show UB2 to space defined by their coordinates.provide detections complementary to UB1 (Section 2.5). Clustering is performed with a weighted dynamic- For the face detector, we use the technique of Rodriguez bandwidth mean-shift algorithm based on [27]. At each[24], which is similar to the popular Viola-Jones detector [25], iteration the bandwidth is set proportionally to the expected localization variance of the regressed windowsbut replaces the Haar features with local binary patterns, (i.e., to the diagonal of the window defined by the center ofproviding better robustness to illumination changes [26]. The the mean-shift kernel in the 4D space). This automaticallydetector is trained for both front and side views. adapts the clustering to the growing error of the part2.2 Mapping to a Common Reference Frame detectors with scale.As the detection windows returned by different detectors To achieve high recall, it is important to set a very low threshold on the part detectors. This results in many false-cover different areas of the human body, they must be positives, which cause substantial drift in the traditionalmapped to a common reference frame before they can be mean-shift procedure. To maintain a robust localization, atcombined. Here, we learn regressors for this mapping (Fig. 2). each iteration we compute the new cluster center as the For each part detector we learn a linear regressor Rðw; pÞ mean of its members weighted by their detection scores. Themapping a detection window w to a common reference final mean-shift location in the 4D space also gives aframe. A regressor R is defined by weighted average reference window for each cluster, which Rðw; pÞ ¼ ðx À W p1 ; y À Hp2 ; W p3 ; W p3 p4 Þ; ð1Þ is typically more accurately localized than the individual part detections in the cluster.where w ¼ ðx; y; W ; HÞ is a detection window defined bythe top-left coordinates ðx; yÞ, its width W , and its height H. 2.4 Discriminative Score Combination.The regression parameters p ¼ ðp1 ; p2 ; p3 ; p4 Þ are determined Given a cluster C containing a set of part detections, thefrom the training data as follows. goal is to determine a single combined score for the cluster. We have a set of n training pairs of detection windows wi Each cluster C has an associated representative detectionand corresponding manually annotated ground-truth re- window, computed as the weighted mean of the partference windows hi . We find the optimal regression detection windows in C.parameters pà as To compute the score of a cluster, we use the 4D vector c, where each dimension corresponds to one of the detectors. X n pà ¼ arg max IoUðhi ; Rðwi ; pÞÞ; ð2Þ The value of an entry cd is set to the maximum detection p score for detector d within the cluster. If the cluster does not i¼1where IoUða; bÞ ¼ ja bj=ja [ bj is the intersection-over-un-ion between two windows a; b. The optimal parameters pÃassure the best overlap between the mapped detectionwindows Rðwi ; pÞ and the ground-truth references hi . Fig. 3 shows an example of the original stickmanannotation and the common reference frame derived fromit. The height of the reference frame is given by the distancebetween the top point of the head stick and the mid point ofthe torso stick. The width is fixed to 90 percent of the height. Fig. 3. Example of an annotated image from the ETHZ PASCAL 1. Code available at http://people.cs.uchicago.edu/~pff/latent. Stickmen data set. Left: The original stickman annotation. Right: The 2. Data available at www.robots.ox.ac.uk/~vgg/software/UpperBody. common reference frame we derived from the sticks.
  4. 4. 604 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 3, MARCH 2012contain a detection for a detector d, we set cd ¼ d , with d thethreshold at which the detector is operating (see Section 2.5).Given the 4D score vector for each cluster, we learn a linearSVM to separate positive (human detections) from negativeexamples. The score for a test image is then the confidencevalue of the SVM. Section 2.5 explains how we collectpositive (T þ ) and negative (T À ) training examples. Thetraining set for this score-combiner SVM is the same used totrain the regressors.2.5 Experimental EvaluationThe experimental evaluation is carried out on the ETHZPASCAL Stickmen data set [28].3 It contains 549 imagesfrom the PASCAL VOC 2008 person class. In each image,one person is annotated by line segments defining theposition and orientation of the head, torso, upper and lowerarms (Fig. 3). As we want the common reference frame to be Fig. 4. Precision-recall curve for the individual detectors and thevisible in most images, we set it as a square window combined ones. We consider a detection as correct when thestarting from the top of the head and ending at the middle Intersection-Over-Union (IoU) with a ground-truth annotation is at least 50 percent. In parentheses are average precision values (AP), definedof the torso (Fig. 2). Note that this choice has no effect on the as the area under the respective curve.combined human detector. We build our positive training set T þ out of the first 400 Note that the person model of Felzenszwalb et al. [21]images and use the remaining 149 as a positive test set S þ . uses its two components (FB and UB1) in a “max-score-The negative examples are obtained from Caltech-101 [29] first” combination: If two detections from the two differentas well as from PASCAL VOC [22], [30]. We end up with5,158 negative images: 3,956 are randomly selected as the components overlap by more than 50 percent IoU, then thenegative training set T À , while the remaining form the lower scoring one is discarded. In the experiment UB1þFB,negative test set S À . we use our novel combination strategy to combine only the The optimal regressor parameters pà are learned on the two components UB1 and FB. This performs significantlypositive training set T þ (as described in Section 2.2). better than the original model [21], further demonstrating The score-combiner SVM is trained on the clusters the power of our combination strategy. In all experiments,obtained from the entire training set T þ [ T À . All clusters all detection windows are regressed to the same commonfrom T À are labeled as negative examples. Clusters from reference frame as ours.T þ are labeled as positive examples if their representative Although the face detector performs much below thedetection has an IoU with a ground-truth person bounding- other detectors, it is valuable in close-up images where thebox greater than 50 percent. All other clusters from T þ are other detectors do not fire.discarded as their ground-truth label is unknown (althoughan image in ETHZ PASCAL Stickmen might containmultiple people, only one is annotated). Note that before 3 LEARNING HUMAN-OBJECT INTERACTIONSclustering we only keep detections scoring above a low This section presents our human-object interaction modelthreshold d so as to remove weak detections likely to be and how to learn it from weakly supervised images. The goalfalse positives. is to automatically determine the object relevant for the action Fig. 4 shows a quantitative evaluation on our test set as well as its spatial relation to the human. The intuitionS þ [ S À as a precision-recall curve. The recall axis indicates behind our human-object model is that some geometricthe percentage of annotated humans that were correctly properties relating the human to the action object are stabledetected (true positives, IoU with the ground-truth greater across different instances of the same action. Let’s imagine athan 50 percent). All detections in S À are counted as false human playing a trumpet: The trumpet is always atpositives. Notice how in S þ only one human per image is approximately the same relative distance with respect to theannotated. Hence, only true positives in S þ are counted in human. We model this intuition with spatial cues involvingthe evaluation and all other detections are discarded as their the human and the object. We measure them relative to theground-truth label is unknown. Precision is defined as the position and scale of the reference frame provided by theratio between the number of true positives and the total human detector from Section 2. This makes the cuesnumber of detections at a certain recall value. comparable between different images. Our combined human detector UB1þFBþUB2þF brings Our model (Section 3.1) incorporates several cuesa considerable increase in average precision compared to the (Section 3.3). Some relate the human to the object, whilestate-of-the-art human detector of Felzenszwalb et al. [21], others are defined purely by the appearance of the object.which it incorporates. For a fair comparison, its detection Once the action objects have been localized in the images,windows are also regressed to a common reference frame we use them together with the human locations to learn(using the same regressor as in our combined detector). probability distributions of human-object spatial relations (Section 3.4). Experimental results show that these relations 3. Available at http://www.vision.ee.ethz.ch/~calvin/datasets.html. are characteristic for the action, e.g., a bike is below the
  5. 5. PREST ET AL.: WEAKLY SUPERVISED LEARNING OF INTERACTIONS BETWEEN HUMANS AND OBJECTS 605 ÂU is a sum of unary cues measuring 1) how likely a window bi is to contain an object of any class (o ðbi Þ); 2) the j j amount of overlap between the window and the human (a ðhi ; bi Þ): j À Á À Á À Á ÂU hi ; bi ¼ o bi þ a hi ; bi : j j j ð4Þ ÂH is a sum of pairwise cues capturing spatial relationsFig. 5. Two images with three candidate windows each. The blue boxes between the human and the object. They encourage theindicate the location of the human calculated by the detector. The green model to select windows with similar spatial relations to theboxes show possible action object locations. human across images (e.g., Ád measures the difference in relative distance between two human-object pairs). Theseperson riding it, whereas a hat is on top of the person cues are illustrated in Fig. 7:wearing it (Section 5). These distributions constitute our À Á À Áhuman-object interaction model. ÂH ðbi ; bl ; hi ; hl Þ ¼ Ád bi ; bl ; hi ; hl þ Ás bi ; bl ; hi ; hl j m j m j m À Á À Á þ Ál bi ; bl ; hi ; hl þ Áo bi ; bl ; hi ; hl : j m j m3.1 The Human-Object ModelOur model inputs a set of training images fI i g showing an ð5Þaction (e.g., “tennis forehand” (Fig. 5 left) and “croquet” Finally, ÂP is a sum of pairwise cues measuring the(Fig. 5 right)). We retain for each image i the single highest appearance similarity between pairs of candidate windowsscored human detection hi and use it as an anchor for in different images. These cues prefer Bà to containdefining the human-object spatial relations. Furthermore, windows of similar appearance across images. They arefor each I i we have a set X i ¼ fbi g of candidate windows 2 distances on color histograms (Ác ) and bag-of-visual- jpotentially containing the action object (Fig. 5). We use the words descriptors (Ái ):generic object detector [31] to select 500 windows likely to À Á À Á À Á ÂP bi ; bl ¼ Ác bi ; bl þ Ái bi ; bl : j m j m j m ð6Þcontain an objects rather than background (Section 3.2). Our goal is to select one window bi 2 X i containing the j We normalize the range of all cues to ½0; 1Š, but do notaction object for each image I i . We model this selection perform any other reweighting beyond this.problem in energy minimization terms. Formally, the As the pairwise terms connect all pairs of images, ourobjective is to find the configuration Bà of windows (one model is fully connected. Every candidate window in anwindow per image) so that the following energy is image is compared to every candidate window in another.minimized: Fig. 6 shows an illustration of the connectivity in our model. X We perform inference on this model using the TRW-S À Á EðBjH; ÂÞ ¼ ÂU hi ; bi j algorithm [32], obtaining a very good approximation of the bi 2B j global optimum Bà ¼ arg minEðBjH; ÂÞ. X À Á þ ÂH bi ; bl ; hi ; hl j m 3.2 Candidate Windows ð3Þ ðbi ;bl Þ2BÂB j m To obtain the candidate windows X and the unary cue o X À Á þ ÂP bi ; bl : we use the objectness measure of Alexe et al. [31], which j m ðbi ;bl Þ2BÂB quantifies how likely it is for a window to contain an object j m of any class rather than background. Objectness is trained to Here we give a brief overview of the terms in this model, distinguish windows containing an object with a well-and explain them in more detail in Section 3.3. defined boundary and center, such as cows and telephones,Fig. 6. A pair of training images from the “tennis serve” action. Candidate windows are depicted as white boxes. We employ a fully connected model,meaning that pairwise potentials (green lines) connect each pair of candidate windows between each pair of training images.
  6. 6. 606 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 3, MARCH 2012from amorphous background windows, such as grass androad. Objectness combines several image cues measuringdistinctive characteristics of objects, such as appearingdifferent from their surroundings, having a closed bound-ary, and sometimes being unique within the image. We use objectness as a location prior in our model byevaluating it for all windows in an image and thensampling 500 windows according to their scores. Theseform the set of states for a node, i.e., the candidate windowsthe model can choose from. This procedure brings two advantages. First, it greatlyreduces the computational complexity of the optimization,which grows with the square of the number of windows(there are millions of windows in an image). Second, thesampled windows and their scores o attract the modeltoward selecting objects rather than background windows. For the experiments we used the code of Alexe et al. [31], Fig. 7. For each human-object cue we show three possible configura-available online,4 without any modifications or tuning. It tions of human-object windows. The two leftmost configurations have atakes only about 3 seconds to compute candidate windows low-pairwise energy, while the rightmost has high energy compared tofor one image. either of the first two. À Á3.3 Cues Ás bi ; bl ; hi ; hl j m À À Á À Á À Á À ÁÁ3.3.1 Unary Cues ¼ max a hi ; bi =a hl ; bl ; a hl ; bl =a hi ; bi À 1; j m m jEach candidate window b is scored separately by the unary ð8Þcues o and a . The cue o ðbÞ ¼ À logðpobj ðbÞÞ, where pobj ðbÞ 2 ½0; 1Š is the whereobjectness probability [31] of b which measures how likely b À Á À Á a hi ; bi ¼ area bi =areaðhi Þ j j ð9Þis to contain an object of any class (Section 3.2). The cue a ðh; bÞ ¼ À logð1 À IoUðhi ; bi ÞÞ measures the j is the ratio between the area (in pixels) of a candidateoverlap between a candidate window and the human h window and the human window.(with IoUðÁ; ÁÞ 2 ½0; 1Š). It penalizes windows with a strong 2. The difference in the euclidean distance between theoverlap with the human since, in most images of human- object and the human:object interactions, the object is near the human, but not on À Á À À Á À Átop of it. This cue proved to be very successful in Ád bi ; bl ; hi ; hl ¼ abs l bi ; hi À l bl ; hl Þ : j m j msuppressing trivial outputs such as selecting a window ð10Þcovering the human upper body in every image, i.e., themost frequently recurring pattern in human action data sets. 3. The difference in the overlap area between the object3.3.2 Human-Object Pairwise Cues and the human (normalized by the area of the human):Candidate windows from two different images I i ; I l are !pairwise connected as shown in Fig. 6. Human-object À i l i lÁ bi hi j b l hl m Áo bj ; bm ; h ; h ¼ abs À ; ð11Þpairwise cues compare two windows bi ; bl according to j m areaðhi Þ areaðhl Þdifferent spatial layout cues. We define four cues measuringdifferent spatial relations between the human and the object where a b indicates the overlapping area (in pixels)(Fig. 7). These cues prefer pairs of candidate windows with a between two windows a and b.similar spatial relation to the human in their respective 4. The difference in the relative location between theimages. Such recurring spatial relations are characteristic for object and the human:the kind of human-object interactions we are interested in À Á À Á À Á(e.g., tennis serve). Ál bi ; bl ; hi ; hl ¼ l bi ; hi À l bl ; hl : j m j m ð12Þ Let À Á ÀÀ Á À Á 3.3.3 Object-Only Pairwise Cues l bi ; hi ¼ xi À xi =W i ; yi À yi Þ=H i j j j ð7Þ The similarity ÂP ðbi ; bl Þ between a pair of candidate j mbe the 2D location lðbi ; hi Þ of a candidate object window j windows bi ; bl from two images is computed as the j mbi ¼ ðxi ; yi ; Wji ; Hj Þ in the reference frame defined by the j j j i 2 difference between histograms describing their appear-human hi ¼ ðxi ; yi ; W i ; H i Þ in image I i . ance. We use two descriptors. The first is a color histogram With this notation, the four cues are Ác ðbi ; bl Þ. The second is a bag-of-visual-words on a 3-level j m 1. The difference in the relative scale between the spatial pyramid using SURF features [33] Ái ðbi ; bl Þ (whose j m object and the human in the two images: vocabulary is learned from the positive training images and is composed of 500 visual words). These cues prefer object 4. Source code at www.vision.ee.ethz.ch/~calvin/software.html. windows with similar appearance across images.
  7. 7. PREST ET AL.: WEAKLY SUPERVISED LEARNING OF INTERACTIONS BETWEEN HUMANS AND OBJECTS 6073.4 Learning Human-Object InteractionsGiven the human detections H and the object windows BÃminimizing (3), we learn the interactions between thehuman and the action object as two relative spatialdistributions. More precisely, we focus on relative location(7) and relative scale (9). We estimate a 2D probability density function for the Fig. 8. Human pose has a high-discriminative power for distinguishinglocation of the object with respect to the human (7) as actions. The solid window is the original human detection, while the dashed window shows the area from which the pose-from-gradients X 1 i i 2 descriptor is extracted. kl ðBÃ ; HÞ ¼ pffiffiffiffiffiffi eÀlðb ;h Þð1=2 Þ ; ð13Þ i 2 (Section 3.4); a ðbi Þ is the object appearance classifier, also twhere bi 2 BÃ is the selected object window in image I i , learned during training. The optimal window can behi 2 H is the reference human detection in that image, and found efficiently as the complexity of this optimization isthe scale is set automatically by a diffusion algorithm [34]. linear in jBj. A second density is given by the scale of the object For each action model a , we create a descriptor vectorrelative to the human (9): containing the energy of the three terms in (15), evaluated X 1 for the selected window ba . The overall human-object i i 2 ks ðBÃ ; HÞ ¼ pffiffiffiffiffiffi eÀaðb ;h Þð1=2 Þ : ð14Þ descriptor for the image is the concatenation over all i 2 n actions and has dimensionality 3n. Based on this concatenated representation, the system can learn theThe learned spatial relations for various actions are relative merits of the various terms in the context of allpresented in Section 5.4. actions. This is useful to adapt to correlations in the Additionally, we train an object appearance classifier t . appearance and relative location of the objects betweenThis classifier is an SVM on a bag-of-words representation actions (e.g., if two actions involve similar relative positions[35] using dense SURF descriptors [33]. As positive training of the object with respect to the human, the appearancesamples we use the selected object windows BÃ . As negative energy will be given higher weight).samples we use random windows from images of otheraction classes. 4.2 Whole-Image Descriptor The spatial distributions kl and ks together with the As shown by Gupta et al. [1], describing the whole imageobject appearance classifier t constitute the action model using GIST [36] provides a valuable cue for actionA ¼ ðkl ; ks ; t Þ. classification. This descriptor can capture the context of an action, which is often quite distinctive [37].4 ACTION RECOGNITION 4.3 Pose-from-Gradients DescriptorThe previous section described how we automatically learn Both [1] and [12] use human pose as a feature for actionan action model from a set of training images fI g. Given a recognition. In those approaches, pose is represented bytest image T and n action models fAa ga¼1;...;n , we want to silhouettes [1] or limb locations [12], which are expensive todetermine which action is depicted in it. annotate manually on training images. In the same spirit of In Sections 4.1 to 4.3, we present three descriptors, each leveraging on human pose for action classification butcapturing a different aspect of an image. The human- avoiding the additional annotation effort, we propose aobject descriptor (Section 4.1) exploits the spatial relations much simpler descriptor to capture pose information.and the object appearance model in A (Section 3) to Given an image and the corresponding human detection h,localize the action object and then describes the human- we extract the GIST descriptor [36] from an image windowobject configuration. Sections 4.2 and 4.3 present two obtained by enlarging h by a constant factor so as to includedescriptors capturing contextual information both at a more of the arm pose. Fig. 8 shows example humanglobal (Section 4.2) and a local (Section 4.3) level. Finally, detections and the corresponding enlarged windows. Whilein Section 4.4, we show how we combine the different this descriptor does not require any additional supervisiondescriptors for classifying T . on the training images, it proved successful in discriminat-4.1 Human-Object Descriptor ing difficult cases (see results in Section 5.3). Moreover, itWe compute a low-dimensional descriptor for an image (the takes further advantage of using a robust human detector,same procedure is applied equally to either a training or a such as the one in Section 2.test image): 1) Detect humans and keep the highest scoring 4.4 Action Classifiersone h as anchor for computing Human-Object relations; For training, we extract the descriptors of Sections 4.1-4.32) compute a set of candidate object windows B using [31] from the same training images fI i g used for learning the(Section 3.2); 3) for every action model fAa ga¼1;...;n select the human-object model (notice how only the action class labelwindow ba 2 B minimizing the energy is necessary as supervision, and not human or object EðBjh; a Þ ¼ a ðbÞ þ al ðh; bÞ þ as ðh; bÞ; ð15Þ bounding-boxes [1], [12], human silhouettes [1], or limb t k k locations [12]). We obtain a separate RBF kernel for eachwhere al ðh; bj Þ and as ðh; bj Þ are unary terms based on the k k descriptor and then compute a linear combination of them.probability distributions kl and ks learned during training Given the resulting combined kernel, we learn a multiclass
  8. 8. 608 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 3, MARCH 2012Fig. 9. Columns 1-5: Example of action-object windows localized by our method in weakly supervised training images of the Sports data set [1](columns 1-3) and of the TBH data set (columns 4-5). Both the human (green) and the object (dashed pink) are found automatically. Each columnshows three images from the same class. The method is able to handle multimodal human-object spatial configurations. Columns 6-8: Action-objectwindows automatically selected from images of the PASCAL Action 2010 data set [2]. The human window is given beforehand, as prescribed by thePASCAL protocol. The object is localized automatically by our method.SVM. The combination weights are set by cross validation processor (NLP) [40] on the text captions to retrieveto maximize the classification accuracy [38]. images showing the action. In detail, a caption should Given a new test image T , we compute the three contain: 1) a subject, specified as either “person,” “man,”descriptors and average the corresponding kernels accord- “woman,” or “boy” and 2) a verb-object pair. The verb ising to the weights learned at training time. Finally, we specified in the infinitive form, with the object as a set ofclassify T (i.e., assign T an action label) according the synonyms (e.g., “hat” and “cap”). Due to the high qualitymulticlass SVM learned during training. of the captions, this process returns almost only relevant images. We manually removed just one irrelevant image5 EXPERIMENTAL RESULTS ON THE SPORTS AND from each class. The resulting data set contains 117 images TRUMPETS, BIKES, AND HATS (TBH) DATA SETS for “riding bike” (70 training, 47 testing) and 124 images for “wearing hat” (74 training, 50 testing). In the resultingWe present here action recognition results on two data sets: TBH data set, images are only annotated by label of thethe six sports actions of Gupta et al. [1] and a new data set of action class they depict.three actions we collected, called the Trumpets, Bikes, andHats data set. The TBH data set and the corresponding 5.1.2 Sports Data Set [1]annotations will be released online upon acceptance of this This data set is composed of six actions of people doingpaper. Section 5.1 describes the data sets. Section 5.2 sports. These actions are: “cricket batting,” “cricket bowl-presents the experimental setup, namely, the two levels of ing,” “croquet,” “tennis forehand,” “tennis backhand,” andsupervision we evaluate on. Section 5.3 reports quantitative “volleyball smash.” Each action has 30 training images andresults and comparisons to [1] and [12]. The learned 20 test images. These images come with a rich set ofhuman-object interactions are illustrated in Section 5.4. annotations. The approaches of Gupta et al. [1] and Yao and5.1 Data Sets Fei-Fei [12] are in fact trained with full supervision, using all these annotations. More precisely, for each training5.1.1 TBH Data Set image they need:We introduce a new action data set called TBH. It is builtfrom Google Images and the IAPR TC-12 data set [39], and A1. Action label.contains three actions: “playing trumpet,” “riding bike,” A2. Ground-truth bounding-box for the action object.and “wearing hat.” A3. Manually segmented human silhouette [1] or limb We use Google Images to retrieve images for the action locations [12].“playing trumpet.” We manually select the first 100 images A4. Gupta et al. [1] also require a set of training imagesdepicting the action in a set of images obtained by searching for each action object, collected from Google Imagesfor “person OR man OR woman,” followed by the action (e.g., by querying for “tennis racket” and thenverb (“playing”) and the object name (“trumpet”). The manually discarding irrelevant images).amount of negative images that have been manuallydiscarded is 25 percent. We split these 100 positive images 5.2 Experimental Setupsinto training (60) and testing (40), i.e., the same proportions 5.2.1 Weakly Supervisedas the sports data set [1]. Our method learns human actions from images labeled For the actions “riding bike” and “wearing hat” we only with the action they contain (A1), i.e., weaklycollected images from the IAPR TC-12 data set. Each supervised images.image in this large data set has an accompanying text At training time we localize objects in the training set bycaption describing the image. We ran a natural language applying the model presented in Section 3 (Fig. 9). Given
  9. 9. PREST ET AL.: WEAKLY SUPERVISED LEARNING OF INTERACTIONS BETWEEN HUMANS AND OBJECTS 609 TABLE 1 Classification Results on the Sports Data Set [1]: First Row: Our Method with WS, Second Row: Our Method with FS, Third Row: Gupta et al. [1] with FS-[1], Fourth Row: Yao and Fei-Fei [12] with FS-[12] (They Only Report Results for Their Full Model)Each entry is the classification accuracy averaged over all six classes. Column “Full model” in rows 1 and 2 includes our Human-Object spatialrelations.Fig. 10. Example results on the TBH data set for test images that were correctly classified by our approach. Two images are shown for each actionclass (from left to right, “playing trumpet,” “riding bike,” and “wearing hat”). First row: Results from the weakly supervised setting WS. Second row:Results from the fully supervised setting FS.the localized objects and the humans locations we learn In the WS setup (first row), combining the objectspatial relations as well as an object appearance classifier appearance classifier (Section 3.4), the pose-from-gradients(Section 3.4). descriptor, and the whole-image classifier improves over At test time we recognize human actions in test images using any of them alone and already obtains goodby applying the procedure described in Section 4. performance (76 percent). Importantly, adding the hu- man-object interaction model (“Full model” column) raises5.2.2 Fully Supervised performance to 81 percent, confirming that our modelIn order to fairly compare our approach with [1] and [12], learns human-object spatial relations beneficial for actionwe introduce a fully supervised variant of our model where classification. Figs. 10 and 11 show humans and objectswe use A1 and A2. Instead of A3, we just use ground-truth automatically detected on the test images by our fullbounding-boxes on the human, which is less supervision method. An important point is that the performance of ourthan silhouettes [1] or limb locations [12]. It is then model trained in the WS setup is 2 percent better than thestraightforward to learn the human-object relation modelsand the object appearance classifier (Section 3.4) from these FS-[1] approach of Gupta et al. [1] and 2 percent below theground-truth bounding-boxes. We also train a sliding- FS-[1] approach of Yao and Fei-Fei [12]. This confirmswindow detector [21] for the action object using the the main claim of the paper: Our method can effectivelyground-truth bounding-boxes A2. This detector then gives learn actions defined by human-object interactions in a WSthe appearance cue t in (15). setting. Remarkably, it reaches performance comparable to In the following, we denote with FS our fully state-of-the-art methods in FS settings which are verysupervised setting using one human bounding-box and expensive in terms of training annotation.one object bounding box per training image. Instead, we The second row of Table 1 shows results for our methoddenote by FS-[12] the setting using A1-A3 and FS-[1] the in the FS setup. As expected, the object appearancesetting using A1-A4. classifier performs better than the WS one as we can train In the FS setup, we recognize human actions in test it from ground-truth bounding-boxes. Again the combina-images by applying the procedure described in Section 4. In tion with the pose-from-gradients descriptor and thestep 2 of Section 4.1, we run the action object detector to whole-scene classifier significantly improves results (nowobtain candidate windows B, i.e., all windows returned by to 80 percent). Furthermore, also in this FS setup addingthe detector, without applying any threshold nor nonmax- the human-object spatial relations raises performanceima suppression. (“Full model”). The classification accuracy exceeds that of Gupta et al. [1] and is on par with [12]. We note how5.3 Experimental Evaluation Gupta et al. [1] and Yao and Fei-Fei [12] use human body5.3.1 Sports Data Set [1] part locations or silhouettes for training, while we use onlyTable 1 presents results on the sports data set [1], where human bounding boxes, which are cheaper to obtain.the task is to classify each test image into one of six actions. Interestingly, although trained with much less supervision,
  10. 10. 610 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 3, MARCH 2012Fig. 11. Example results from the sports data set of Gupta et al. [1] for test images that were correctly classified by our approach. One image perclass is shown (from left to right: “cricket batting,” “cricket bowling,” “croquet,” “tennis forehand,” “tennis serve,” and “volleyball”). First row: Weaklysupervised setting. Second row: Fully supervised setting.our pose-from-gradients descriptor performs on par with detects it well in the WS setup, despite being trainedthe human pose descriptor of Gupta et al. [1]. without any bounding box. Failure cases are shown and discussed in Fig. TBH Data SetTable 2 shows results on the TBH data set which reinforce 5.3.3 Influence of the Human Detectorthe conclusions drawn on the sports data set: 1) Combining To demonstrate the influence of our human detectorthe object appearance classifier, pose-from-gradients, and (Section 2) on action classification results, we evaluate twowhole-scene classifier is beneficial in both WS and FS variants of our WS setup which use alternative ways tosetups, 2) the human-object interaction model brings select a human reference frame (both at training and testfurther improvements in both setups, 3) the performance time). The first variant (WS-Human [21]) uses the highestof the full model in the WS setup is only 5 percent below scoring human detection returned by Felzenszwalb et al.that of the FS setup, confirming that our method is a good [21]. The second variant (WS-HumanGT) uses the ground-solution for WS learning. truth human annotation as the reference frame. We report We note that the performance gap of the object in Table 2 results on the TBH data set, which has a highappearance classifier between FS and WS is smaller than variability of human poses and viewpoints.on the sports data set. This might be due to the greater The difference between rows “WS” and “WS-Human [21]”difference between action objects in the TBH data set, where demonstrates that using our detector (Section 2) results ina weaker object model already works well. significantly better action recognition performance over Finally, we note how the whole-scene descriptor has using [21] alone (þ5 percent). Interestingly, using our humanlower discriminative power than on the sports data set detector leads to performance close to using ground-truth(67 percent across six classes versus 58 percent across detections (row “WS-HumanGT”) (À1 percent).three classes). This is likely due to the greater intraclassvariability of backgrounds and scenes in the TBH data set. 5.3.4 Influence of the Choice of Candidate WindowsFigs. 10 and 11 show example results for automatically In all experiments so far we have used the objectnesslocalized action objects on the test data from the two data measure of Alexe et al. [31] to automatically propose a set ofsets. While, in the FS setup, our method localizes the candidate windows X i from which our algorithm choosesaction objects more accurately, in many cases it already the most consistent solution over a set of training images TABLE 2 Classification Results on the TBH Human Action Data Set: (First Row) Our Method with Weak Supervision, Second Row) Our Method with Full Supervision, (Other Rows) Variants of Our ApproachSee the text for details.
  11. 11. PREST ET AL.: WEAKLY SUPERVISED LEARNING OF INTERACTIONS BETWEEN HUMANS AND OBJECTS 611Fig. 12. Example failures of our method on several test images (after training in the WS setting). Action labels indicate the (incorrect) classes theimages were assigned to. The main reasons are: missed humans due to tilted pose or poor visibility (first, fourth, and sixth image), similaritiesbetween different action classes (fifth image), truncation or poor visibility of the action object (second and third image).(Section 3.1). To show the impact of the objectness column) is near the chest of the person, whereas the croquetmeasure, we compare to a simple baseline based on the mallet (second column) is below the torso. Trumpets areintuition that image patches close to the human are more distributed near the center of the human reference frame, aslikely to contain the action object. This baseline samples they are often played at the mouth (third column). As thearbitrary windows overlapping with the human detection fourth column shows, the relative scale between the humanhi . More precisely, for each training image i, we and the object for the “Volleyball” action indicates that arandomly sample 106 windows uniformly and score each volleyball is about half the size of a human detection (seewindow w with s ¼ 1 À absð0:5 À IoUðw; hi ÞÞ=0:5. This also the rightmost column of Fig. 11).score is highest for windows that overlap about 50 percent Importantly, the spatial relations learned in the WSwith the human, and lowest for windows either completely setting are similar to those learned in the FS setting, albeiton top of it or not overlapping with it at all (i.e., less peaked. This demonstrates that our weakly supervisedbackground). This is a good criterion as the action object approach does learn correctly human-object interactions.is typically held in the human’s hand, and so it partiallyoverlaps with it. To form the set of candidate windows, werandomly sample 500 windows according to s. 6 EXPERIMENTAL RESULTS ON THE PASCAL We report in Table 2 results on the TBH data set (WS- ACTION 2010 DATA SETAltCands). This alternative strategy for sampling candidate 6.1 Data Set and Protocolwindows leads to moderately worse action recognition The PASCAL Action [2] data set contains nine actionresults than when using objectness windows (À3 percent). classes, seven of which involve a human and an object:Moreover, it is interesting to note how the spatial-relations “phoning,” “playing instrument,” “reading,” “riding bike,”learned based on the alternative windows are weaker as “riding horse,” “taking photo,” and “using computer.” Thethey do not bring a positive contribution when combined in two actions involving no object are “running” andthe full model (cf. fourth and fifth columns). “walking.” Each class has between 50 and 60 images,5.4 Learned Human-Object Interactions divided equally into training and testing subsets. EachFig. 13 compares human-object spatial relations obtained image is annotated with ground-truth bounding-boxes onfrom automatically localized humans and objects in the WS the humans performing the action (there might be moresetup to those derived from ground-truth bounding-boxes than one). For images with multiple human annotations,in the FS setup (Section 3.4). The learned relations are we duplicate the image and assign each human to aclearly meaningful. The location of the Cricket Bat (first different image. In this way, we maintain our methodFig. 13. Human-object spatial distributions learned in the FS setting (top) and in the WS setting (bottom). (a)-(c): Relative location of the action objectwith respect to the human (kl in Section 3.4). Dashed boxes indicate the size and location of the human windows. (a) “Cricket Batting,” (b) “Croquet,”(c) “Playing Trumpet.” (d): Distribution of the object scale relative to the human scale for the action “Volleyball” (ks in Section 3.4). The horizontal axisrepresents the x-scale and the vertical the y-scale. A cross indicates the scale of the human.
  12. 12. 612 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 3, MARCH 2012 TABLE 3 Classification Results on the PASCAL Action 2010 Data Set: We Show Average Precision Results for Individual ClassesIn the last column, we show results from the best contestant in the challenge, Everingham et al. [2]. Each entry in the first nine rows is the averageprecision of one class, while the last two rows present mean average precision over several classes. Column “Full model” includes our Human-Object spatial relations (i.e., the interaction model).unchanged while also making our results fully comparable Precision over classes (mAP) in the last two rows. Fig. 14with previous work. shows results on example test images. Note how the object We perform experiments following the official protocol appearance classifier and human-object interaction compo-of the PASCAL Challenge [2], where human ground-truth nents of our model are trained in a weakly supervisedbounding-boxes are given to the algorithm both at training manner as the location of the action object is not givenand at test time. We train a separate 1-vs-all action classifier (neither at training nor at test time). The results demon-with our method from Sections 3 and 4, using the ground- strate that these components improve the performance oftruth human annotations as H. However, object locations our method compared to using information on the humanare not given and they are automatically found by our alone (“Pose from gradients” column). Also note how themodel (Fig. 9). whole-scene classifier is only moderately informative on At test time, we evaluate the classification accuracy for this data set, leaving most of the contribution to the overalleach action separately by computing a precision-recall performance to the object and interaction componentscurve. This means that each action classifier is applied to (“Full model”).all test images from all classes, and the resulting confidence Our full model achieves a 7 percent improvementvalues are used to compute the precision-recall curve. We compared to the best method in the challenge, i.e., Ever-report average precision, i.e., the area under the precision- ingham et al. [2], when averaged over the seven classesrecall curve. This is the official measure of the PASCAL involving both humans and objects (last row of Table 3).Challenge [2]. Moreover, when considering all classes it performs on par with it (second last row). As the “running” and “walking”6.2 Experimental Evaluation rows show, our method can also handle classes involvingThe first nine rows of Table 3 show the average precision for no object, delivering good performance even though it waseach of the nine actions. We present the mean Average not designed for this purpose. The reason is that ourFig. 14. Example results on the PASCAL Action 2010 test set [2]. Each column shows two test images from a class. From left to right: “playinginstrument,” “reading,” “taking photo,” “riding horse,” and “walking.”
  13. 13. PREST ET AL.: WEAKLY SUPERVISED LEARNING OF INTERACTIONS BETWEEN HUMANS AND OBJECTS 613method selects images patches on the legs as the “action [13] B. Yao and L. Fei-Fei, “Grouplet: A Structured Image Representa- tion for Recognizing Human and Object Interactions,” Proc. IEEEobject” as they are a recurring pattern which is distinctive Conf. Computer Vision and Pattern Recognition, 2010.for walking (last column of Fig. 14). [14] C. Desai, D. Ramanan, and C. Fowlkes, “Discriminative Models for Static Human-Object Interactions,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition Workshops, 2010.7 CONCLUSION [15] C. Desai, D. Ramanan, and C. Fowlkes, “Discriminative Models for Multi-Class Object Layout,” Proc. 12th IEEE Int’l Conf.This paper introduced a novel approach for learning Computer Vision, 2007.human-object interactions automatically from weakly la- [16] A. Gupta, T. Chen, F. Chen, D. Kimber, L. Davis, “Context andbeled images. Our approach automatically determines Observation Driven Latent Variable Model for Human Pose Estimation,” Proc. IEEE Conf. Computer Vision and Pattern Recogni-objects relevant for the action and their spatial relations to tion, 2008.the human. The performance of our method is comparable [17] N. Ikizler-Cinbis and S. Sclaroff, “Object, Scene and Actions:to state-of-the-art fully supervised approaches [1], [12] on Combining Multiple Features for Human Action Recognition,”the Sport data set of Gupta et al. [1]. Moreover, on the Proc. 11th European Conf. Computer Vision, 2010. [18] R. Fergus, P. Perona, and A. Zisserman, “Object Class RecognitionPASCAL Action Challenge 2010 [2], it outperforms the best by Unsupervised Scale-Invariant Learning,” Proc. IEEE CS Conf.contestant (Everingham et al. [2]) on classes involving Computer Vision and Pattern Recognition, 2003.humans and objects. [19] J. Winn, A. Criminisi, and T. Minka, “Object Categorization by In future work, we plan to extend our approach to Learned Universal Visual Dictionary,” Proc. 10th IEEE Int’l Conf. Computer Vision, 2005.videos, where temporal information can improve the [20] T. Deselaers, B. Alexe, and V. Ferrari, “Localizing Objects Whiledetection of humans and objects. Furthermore, temporal Learning Their Appearance,” Proc. 11th European Conf. Computerinformation can help to model variations within an action Vision, 2010.class and action sequences over time. [21] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, and D. Ramanan, “Object Detection with Discriminatively Trained Part Based Models,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1627-1645, Sept. 2010.ACKNOWLEDGMENTS [22] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A.This work was partially funded by the QUAERO project Zisserman “The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results,” http://www.pascal-network.org/supported by OSEO, the French State Agency for innova- challenges/VOC/voc2007/workshop/index.html, 2007.tion, the joint Microsoft/INRIA project, and the Swiss [23] V. Ferrari, M. Marin-Jimenez, and A. Zisserman, “ProgressiveNational Science Foundation. Search Space Reduction for Human Pose Estimation,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008. [24] Y. Rodriguez, “Face Detection and Verification Using Local BinaryREFERENCES Patterns,” PhD thesis, EPF Lausanne, 2006. [25] P. Viola and M. Jones, “Rapid Object Detection Using a Boosted[1] A. Gupta, A. Kembhavi, and L. Davis, “Observing Human-Object Cascade of Simple Features,” Proc. IEEE CS Conf. Computer Vision Interactions: Using Spatial and Functional Compatibility for and Pattern Recognition, 2001. Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 10, pp. 1775-1789, Oct. 2009. [26] G. Heusch, Y. Rodriguez, and S. Marcel, “Local Binary Patterns As[2] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A. an Image Preprocessing for Face Authentication,” Proc. Seventh Zisserman “The PASCAL Visual Object Classes Challenge 2010 IEEE Int’l Conf. Automatic Face and Gesture Recognition, 2006. (VOC2010) Results,” http://www.pascal-network.org/ [27] D. Comaniciu, V. Ramesh, and P. Meer, “The Variable Bandwidth challenges/VOC/voc2010/workshop/index.html, 2010. Mean Shift and Data-Driven Scale Selection,” Proc. Eighth IEEE[3] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing Human Int’l Conf. Computer Vision, 2001. Actions: A Local SVM Approach,” Proc. 17th Int’l Conf. Pattern [28] M. Eichner and V. Ferrari, “Better Appearance Models for Pictorial Recognition, 2004. Structures,” Proc. British Machine Vision Conf., 2009.[4] I. Laptev and P. Perez, “Retrieving Actions in Movies,” Proc. 11th [29] R. Fergus and P. Perona, “Caltech Object Category Datasets,” IEEE Int’l Conf. Computer Vision, 2007. http://www.vision.caltech.edu/html-files/archive.html, 2003.[5] K. Mikolajczyk and H. Uemura, “Action Recognition with Motion- [30] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A. Appearance Vocabulary Forest,” Proc. IEEE Conf. Computer Vision Zisserman “The PASCAL Visual Object Classes Challenge 2008 and Pattern Recognition, 2008. (VOC2008) Results,” http://www.pascal-network.org/[6] J. Sullivan and S. Carlsson, “Recognizing and Tracking challenges/VOC/voc2008/workshop/index.html, 2008. Human Action,” Proc. Seventh European Conf. Computer Vision, [31] B. Alexe, T. Deselaers, and V. Ferrari, “What Is an Object?” Proc. 2002. IEEE Conf. Computer Vision and Pattern Recognition, 2010.[7] N. Ikizler-Cinbis, G. Cinbis, and S. Sclaroff, “Learning Actions [32] V. Kolmogorov, “Convergent Tree-Reweighted Message Passing from the Web,” Proc. 12th IEEE Int’l Conf. Computer Vision, for Energy Minimization,” IEEE Trans. Pattern Analysis and 2009. Machine Intelligence, vol. 28, no. 10, pp. 1568-1583, Oct. 2006.[8] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior [33] H. Bay, A. Ess, T. Tuytelaars, and L. van Gool, “SURF: Speeded Recognition via Sparse Spatio-Temporal Features,” Proc. Second Up Robust Features,” Computer Vision and Image Understanding, IEEE Joint Int’l Workshop Visual Surveillance and Performance vol. 110, pp. 346-359, 2008. Evaluation of Tracking and Surveillance, 2005.[9] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld, “Learning [34] Z. Botev, “Nonparametric Density Estimation via Diffusion Realistic Human Actions from Movies,” Proc. IEEE Conf. Computer Mixing,” The Univ. of Queensland, Postgraduate Series, Nov. 2007. Vision and Pattern Recognition, 2008. [35] J. Zhang, M. Marszalek, S. Lazebnik, C. Schmid, “Local Features[10] G. Willems, J.H. Becker, T. Tuytelaars, and L. van Gool, and Kernels for Classification of Texture and Object Categories: A “Exemplar-Based Action Recognition in Video,” Proc. British Comprehensive Study,” Int’l J. Computer Vision, vol. 73, pp. 213- Machine Vision Conf., 2009. 238, 2007.[11] C. Thurau and V. Hlavac, “Pose Primitive Based Human Action [36] A. Oliva and A. Torralba, “Modeling the Shape of the Scene: A Recognition in Videos or Still Images,” Proc. IEEE Conf. Computer Holistic Representation of the Spatial Envelope,” Int’l J. Computer Vision and Pattern Recognition, 2008. Vision, vol. 42, pp. 145-175, 2001.[12] B. Yao and L. Fei-Fei, “Modeling Mutual Context of Object and [37] L.J. Li and L. Fei-Fei, “What, Where and Who? Classifying Event Human Pose in Human-Object Interaction Activities,” Proc. IEEE by Scene and Object Recognition,” Proc. 11th IEEE Int’l Conf. Conf. Computer Vision and Pattern Recognition, 2010. Computer Vision, 2007.
  14. 14. 614 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 3, MARCH 2012[38] P. Gehler and S. Nowozin, “On Feature Combination for Multi- Cordelia Schmid received the MS degree in class Object Classification,” Proc. 12th IEEE Int’l Conf. Computer computer science from the University of Vision, 2009. Karlsruhe and a doctorate from the Institut ¨[39] M. Grubinger, P.D. Clough, H. Muller, and T. Deselaers, “The National Polytechnique de Grenoble. She is a IAPR Benchmark: A New Evaluation Resource for Visual research director at INRIA Grenoble and Information Systems,” Proc. Int’l Conf. Language Resources and directs the project-team called LEAR for Evaluation, 2006. LEArning and Recognition in Vision. She is[40] R. Johansson and P. Nugues, “Dependency-Based Syntactic- the author of more than 100 technical publica- Semantic Analysis with Propbank and Nombank,” Proc. 12th tions. In 2006, she was awarded the Longuet- Conf. Computational Natural Language Learning, 2008. Higgins prize for fundamental contributions in computer vision that have withstood the test of time. She is a senior Alessandro Prest received the MSc cum laude member of the IEEE. in computer science from the University of Udine in July 2007. Since 2009, he is working toward Vittorio Ferrari received the PhD degree from the PhD degree in computer vision at ETH ETHZ in 2004 and has been a postdoctoral Zurich. He has been working as research researcher at INRIA Grenoble and the Uni- assistant in different institutions since 2004. In versity of Oxford. He is an assistant professor 2008, he was the recipient of the Best Applied at the ETH Zurich. In 2008, he was awarded a Physics award from the Italian Physical Society Swiss National Science Foundation Professor- for his work on renewable energies. He is a ship grant for outstanding young researchers. student member of the IEEE. He is the author of 40 technical publications, most of them in the highest ranked confer- ences and journals in computer vision and machine learning. He is a member of the IEEE. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.