3D People Tracking with Gaussian Process Dynamical Models ∗         Raquel Urtasun                                      Da...
Richer parameterizations of human pose and motion             A and B are nuisance parameters and should thereforecan be f...
3.1. Learning    Learning the GPDM entails estimating the latent posi-tions and the kernel hyperparameters. Following [22]...
Figure 4. Speed Variation: 2D models learned for 2 different sub-                                                         ...
two common approaches: Online methods infer φt giventhe observation history I1:t−1 . The inference is causal, andusually r...
Figure 6. Tracking 63 frames of a walking, with noisy and missing data. The skeleton of the recovered 3D model is projecte...
Figure 7. Tracking 56 frames of a walking motion with an almost total occlusion (just the head is visible) in a very clutt...
Figure 9. Tracking 37 frames of an exaggerated gait. Note that the results are very accurate even though the style is very...
Upcoming SlideShare
Loading in …5

3 d people tracking with gaussian process dynamical models


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

3 d people tracking with gaussian process dynamical models

  1. 1. 3D People Tracking with Gaussian Process Dynamical Models ∗ Raquel Urtasun David J. Fleet Pascal Fua Computer Vision Laboratory Dept. of Computer Science Computer Vision Laboratory EPFL, Switzerland University of Toronto, Canada EPFL, Switzerland raquel.urtasun@epfl.ch fleet@cs.toronto.edu pascal.fua@epfl.ch Abstract used lessen problems of over-fitting and under-fitting that We advocate the use of Gaussian Process Dynamical are otherwise problematic with small training sets [10, 12].Models (GPDMs) for learning human pose and motion pri- We propose a form of GPDM, the balanced GPDM, forors for 3D people tracking. A GPDM provides a low- learning smooth models from training motions with stylis-dimensional embedding of human motion data, with a den- tic diversity, and show that they are effective for monocularsity function that gives higher probability to poses and people tracking. We formulate the tracking problem as amotions close to the training data. With Bayesian model MAP estimator on short pose sequences in a sliding tem-averaging a GPDM can be learned from relatively small poral window. Estimates are obtained with deterministicamounts of data, and it generalizes gracefully to motions optimization, and look remarkably good despite very noisy,outside the training set. Here we modify the GPDM to per- missing or erroneous image data and significant occlusions.mit learning from motions with significant stylistic varia-tion. The resulting priors are effective for tracking a range 2. Related Workof human walking styles, despite weak and noisy image The dynamical models used in many tracking algorithmsmeasurements and significant occlusions. are weak. Most models are linear with Gaussian pro- cess noise, including simple first- and second-order Markov1. Introduction models [3, 9], and auto-regressive (AR) models [14]. Such models are often suitable for low-dimensional problems, Prior models of pose and motion play a central role in and admit closed-form analysis, but they apply to a re-3D monocular people tracking, mitigating problems caused stricted class of systems. For high-dimensional data, theby ambiguities, occlusions, and image measurement noise. number of parameters that must be manually specified orWhile powerful models of 3D human pose are emerging, learned for AR models is untenable. When used for peo-sophisticated motion models remain rare. Most state-of- ple tracking they usually include large amounts of processthe-art approaches rely on linear-Gaussian Markov models noise, and thereby provide very weak temporal predictions.which do not capture the complexities of human dynam- Switching LDS and hybrid dynamics provide muchics. Learning richer models is challenging because of the richer classes of temporal behaviors [8, 14, 15]. Never-high-dimensional variability of human pose, the nonlinear- theless, they are computationally challenging to learn, andity of human dynamics, and the relative difficulty of acquir- require large amounts of training data, especially as the di-ing large amounts of training data. mension of the state space grows. Non-parametric models This paper shows that effective models for people track- can also handle complex motions, but they also require verying can be learned using the Gaussian Process Dynamical large amounts of training data [11, 17]. Further, they doModel (GPDM) [22], even when modest amounts of train- not produce a density function. Howe et al [7] use mixtureing data are available. The GPDM is a latent variable model model density estimation to learn a distribution of short se-with a nonlinear probabilistic mapping from latent positions quences of poses. Again, with such high-dimensional data,x to human poses y, and a nonlinear dynamical mapping on density estimation will have problems of under- and over-the latent space. It provides a continuous density function fitting unless one has vast amounts of training data.over poses and motions that is generally non-Gaussian and One way to cope with high-dimensional data is to learnmultimodal. Given training sequences, one simultaneously low-dimensional latent variable models. The simplest caselearns the latent embedding, the latent dynamics, and the involves a linear subspace projection with an AR dynamicalpose reconstruction mapping. Bayesian model averaging is process. In [2, 4] a subspace is first identified using PCA, ∗ This work was supported in part by the Swiss National Science Foun- afterwhich a subspace AR model is learned. Linear modelsdation, NSERC Canada, and the Canadian Institute for Advanced Re- are tractable, but they often lack the ability to capture thesearch. We thank A. Hertzmann and J. Wang for many useful discussions. complexities of human pose and motion.
  2. 2. Richer parameterizations of human pose and motion A and B are nuisance parameters and should thereforecan be found through nonlinear dimensionality reduction be marginalized out through model averaging. With an[5, 16, 18, 21]. Geometrical methods such as Isomap and isotropic Gaussian prior on each bj , one can marginalizeLLE learn such embeddings, yielding mappings from the over B in closed form [12, 13] to yield a multivariate Gaus-pose space to the latent space. But they do not provide sian data likelihood of the forma probabilistic density model over poses, a mapping back ¯ p(Y | X, β) =from pose space to latent space, nor a dynamical model.Thus one requires additional steps to construct an effec- |W|N 1 exp − tr K−1 YW2 YT (3)tive model. For example, Sminchisescu and Jepson [18] use Y (2π)N D |K Y |D 2spectral embedding, then a Gaussian mixture to model the where Y = [y1 , ..., yN ]T is a matrix of training poses, X =latent density, an RBF mapping to reconstruct poses from [x1 , ..., xN ]T contains the associated latent positions, andlatent positions, and a hand-specified first-order, linear dy- KY is a kernel matrix. The elements of kernel matrix arenamical model. Agarwal and Triggs [1] learn a mapping defined by a kernel function, (KY )i,j = kY (xi , xj ), whichfrom silhouettes to poses using relevance vector machines, we take to be a common radial basis function (RBF) [12]:and then a second-order AR dynamical model. Rahimi et al [16] learn an embedding through a nonlinear β2 δx,x kY (x, x ) = β1 exp − ||x − x ||2 + . (4)RBF regression with an AR dynamical model to encourage 2 β3smoothness in the latent space. Our approach is similar in The scaling matrix W ≡ diag(w1 , ..., wD ) is used to ac-spirit, as this is a natural way to produce well-behaved la- count for the different variances in the different data di-tent mappings for time-series data. However, our model is mensions; this is equivalent to a Gaussian Process (GP)probabilistic and allows for nonlinear dynamics. with kernel function k(x, x )/wl for dimension l. Finally, 2 We use a form of probabilistic dimensionality reduc- ¯ = {β1 , β2 , ..., W} comprises the kernel hyperparameters βtion similar in spirit to the Gaussian Process latent variable that control the output variance, the RBF support width, andmodel (GPLVM) [10]. The GPLVM has been used to con- the variance of the additive noise ny,t .strain human poses during interactive animation [6], as a The latent dynamics are similar; i.e., we form the jointprior for 2D upperbody pose estimation [19], and as a prior density over latent positions and weights, A, and then wefor 3D monocular people tracking [20]. While powerful, marginalize out A [22]. With an isotropic Gaussian priorthe GPLVM is a static model; it has no intrinsic dynam- on the ai , the density over latent trajectories reduces toics and does not produce smooth latent paths from smoothtime-series data. Thus, even with an additional dynamical p(X | α) = ¯model, our GPLVM-based people tracker often fails due to p(x1 ) 1anomalous jumps in the latent space and to occlusions [20]. exp − tr K−1 Xout XT X out (5) (2π)(N −1)d |K X |d 23. Gaussian Process Dynamical Model where Xout = [x2 , ..., xN ]T , KX is the (N −1) × (N −1) The GPDM is a latent variable dynamical model, com- kernel matrix constructed from Xin = [x1 , ..., xN −1 ], andprising a low-dimensional latent space, a probabilistic map- x1 is given an isotropic Gaussian prior. For dynamics theping from the latent space to the pose space, and a dynam- GPDM uses a “linear + RBF” kernel, with parameters αi :ical model in the latent space [22]. The GPDM is derived −α2 δx,xfrom a generative model for zero-mean poses yt ∈ RD and kX (x, x ) = α1 exp ||x − x ||2 + α3 xT x + 2 α4latent positions xt ∈ Rd , at time t, of the form The linear term is useful for motion subsequences that are xt = ai φi (xt−1 ) + nx,t (1) approximately linear. i While the GPDM is defined above for a single input se- quence, it is easily extended to multiple sequences {Yj }. yt = bj ψj (xt ) + ny,t (2) One simply concatenates all the input sequences, ignoring temporal transitions from the end of one sequence to the jfor weights A = [a1 , a2 , ...] and B = [b1 , b2 , ...], basis beginning of the next. Each input sequence is then associ-functions φi and ψj , and additive zero-mean white Gaussian ated with a separate sequence of latent positions, {Xj }, allnoise nx,t and ny,t . For linear basis functions, (1) and (2) within a shared latent space. Accordingly, in what follows,represent the common subspace AR model (e.g., [4]). With let Y = [Y1 , ..., Ym ]T be the m training motions. Let T Tnonlinear basis functions, the model is significantly richer. X denote the associated latent positions, and for the defini- In conventional regression (e.g., with AR models) one tion of (5) let Xout comprise all but the first latent positionfixes the number of basis functions and then fits the model for each sequence. Let KX be the kernel matrix computedparameters, A and B. From a Bayesian perspective, from all but the last latent position of each sequence.
  3. 3. 3.1. Learning Learning the GPDM entails estimating the latent posi-tions and the kernel hyperparameters. Following [22] weadopt simple prior distributions over the hyperparameters,i.e., p(¯) ∝ i αi , and p(β) ∝ i βi ,1 with which α −1 ¯ −1the GPDM posterior becomes p(X, α, β | Y) ∝ p(Y | X, β) p(X | α) p(¯) p(β) . (6) ¯ ¯ ¯ ¯ α ¯ (a) (b)The latent positions and hyperparameters are found by min-imizing the negative log posterior d 1L = ln |KX | + tr K−1 Xout XT X out 2 2 D 1 − N ln |W| + ln |KY | + tr K−1 YW2 YT Y 2 2 + ln αi + ln βi + C , (7) i i (c) (d)where C is a constant. The first two terms come from the Figure 1. Golf Swing: (a) GPLVM, (b) GPDM and (c) balancedlog dynamics density (5), and the next three terms come GPDM learned from 9 different golf swings performed by thefrom the log reconstruction density (3). same subject. (d) Volumetric visualization of reconstruction vari- ance; warmer colors (i.e., red) depict lower variance.Over-Fitting: While the GPDM has advantages over theGPLVM, usually producing much smoother latent trajecto- for learning this rescales the first two terms in (7) to beries it can still produce large gaps between the latent posi- d 1tions of consecutive poses; e.g., Fig. 1 shows a GPLVM and λ ln |KX | + tr K−1 Xout XT X out . (8)a GPDM learned from the same golf swing data (large gaps 2 2are shown with red arrows). Such problems tend to occur The resulting models are easily learned and very effective.when the training set includes a relatively large number ofindividual motions (e.g., from different people or from the 3.3. Model Resultssame person performing an activity multiple times). The Figures 1–4 show models learned from motion captureproblem arises because of the large number of unknown la- data. In each case, before minimizing L, the mean pose,tent coordinates and the fact that uncertainty in latent posi- µ, was subtracted from the input pose data, and PCA ortions is not modeled. In practical terms, the GPDM learning Isomap were used to obtain an initial latent embedding ofestimates the latent positions by simultaneously minimiz- the desired dimension. We typically use a 3D latent spaceing squared reconstruction errors in pose space and squared as this is the smallest dimension for which we can robustlytemporal prediction errors in the latent space. In Fig. 1 the learn complex motions with stylistic variability. The hyper-pose space is 80D and the latent space is 3D, so it is not sur- parameters were initially set to one. The negative log pos-prising that the errors in pose reconstruction dominate the terior L was minimized using Scaled Conjugate Gradient.objective function, and thus the latent positions. Golf Swing: Fig. 1 shows models learned from 9 golf swings from one subject (from the CMU database). The3.2. Balanced GPDM: body pose was parameterized with 80 joint angles, and Ideally one should marginalize out the latent positions to the sequence lengths varied by 15 percent. The balancedlearn hyperparameters, but this is expensive computation- GPDM (Fig. 1(c)) produces smoother latent trajectories,ally. Instead, we propose a simple but effective GPDM and hence a more reliable dynamic model, than the orig-modification to balance the influence of the dynamics and inal GPDM. Fig. 1(d) shows a volume visualization of thethe pose reconstruction in learning. That is, we discount log variance of the reconstruction mapping, 2 ln σy|x,X,Y,β , ¯the differences in the pose and latent space dimensions in as a function of latent position. Warmer colors correspondthe two regressions by raising the dynamics density func- to lower variances, and thus to latent positions to which thetion in (6) to the ratio of their dimensions, i.e., λ = D/d; model assigns higher probability; this shows the model’s 1 Such priors prefer small output scale (i.e., α , α , β ), large RBF sup- preference for poses close to the training data. Walking: Figs 2 and 3 show models learned from one gait 1 3 1port (i.e., small α2 , β2 ), and large noise variances (i.e., small α−1 , β3 ). 4 −1The fact that the priors are improper is insignificant for optimization. cycle from each of 6 subjects walking at the same speed on a
  4. 4. Figure 4. Speed Variation: 2D models learned for 2 different sub- jects. Each one walking at 9 speeds ranging from 3 to 7 km/h. Red(a) (b) points are latent positions of training poses. Intensity is propor-Figure 2. Walking GPLVM: Learned from 1 gait cycle from each tional to −2 ln σy|x,X,Y,β , so brighter regions have smaller pose ¯of 6 subjects. Plots show side and top views of the 3D latent space. reconstruction variance. The subject on the left is healthy whileCircles and arrows denote latent positions and temporal sequence. that on the right has a knee pathology and walks asymmetrically. Speed Variation: Fig. 4 shows 2D GPDMs learned from two subjects, each of which walked four gait cycles at each of 9 speeds between 3 and 7km/h (equispaced). The learned latent trajectories are approximately circular, and organized by speed; the innermost and outermost trajectories corre- spond to the slowest and fastest speeds respectively. Inter- estingly, the subject on the left is healthy while the subject on right has a knee pathology. As the treadmill speed in- creases, the side of the body with the pathology performs(a) (b) the motion at slower speeds to avoid pain, and so the other side of the gait cycle must speed up to maintain the speed. This explains the anisotropy of the latent space. 3.4. Prior over New Motions The GPDM also defines a smooth probability density over new motions (Y , X ). That is, just as we did with multiple sequences above, we write the joint density over the concatenation of the sequences. The conditional density of the new sequence is proportional to the joint density, but(c) (d) with the training data and latent positions held fixed:Figure 3. Walking GPDM: Balanced GPDM learned from 1 gait ¯ ¯ ¯ ¯ p(X , Y | X, Y, α, β) ∝ p( [X, X ], [Y, Y ] | α, β) (9)cycle from 6 subjects. (a,b) Side and top views of 3D latent space.(c) Volumetric visualization of reconstruction variance. (d) Green This density can also be factored to provide:trajectories are fair samples from the dynamics model. ¯ p(Y | X , X, Y, β) p(X | X, α) . ¯ (10) For tracking we are typically given an initial state x0 , sotreadmill. For each subject the first pose is replicated at the instead of (10), we haveend of the sequence to encourage cyclical paths in the latentspace. The body was parameterized with 20 joint angles. ¯ p(Y | X , X, Y, β) p(X | X, α, x0 ) . ¯ (11)With the treadmill we do not have global position data, andhence we cannot learn the coupling between the joint angle 4. Trackingtimes series and global translational velocity. Our tracking formulation is based on a state-space Fig. 2 shows the large jumps in adjacent poses in the la- model, with a GPDM prior over pose and motion. The statetent trajectories obtained with a GPLVM. By comparison, at time t is defined as φt = [Gt , yt , xt ], where Gt denotesFig. 3 (a,b) show the smooth, clustered latent trajectories the global position and orientation of the body, yt denoteslearned from the training data. Fig. 3(c) shows a volume the articulated joint angles, and xt is a latent position. Thevisualization of the log reconstruction variance. Fig. 3(d) goal is to estimate a state sequence, φ1:T ≡ (φ1 , ..., φT ),helps to illustrate the model dynamics by plotting 20 latent given an image sequence, I1:T ≡ (I1 , ..., IT ), and a learnedtrajectories drawn at random from the dynamical model. GPDM, M ≡ (X, Y, α, β). Toward that end there are ¯ ¯The trajectories are smooth and close to the training data.
  5. 5. two common approaches: Online methods infer φt giventhe observation history I1:t−1 . The inference is causal, andusually recursive, but suboptimal as it ignores future data.Batch methods infer states φt given all past, present and fu-ture data, I1:T . Inference is optimal, but requires all futureimages which is impossible in many tracking applications. (a) (b) (c) (d) (e) Here we propose a compromise that allows some use of Figure 5. WSL Tracks: The 2D tracked regions for the differentfuture data along with predictions from previous times. In tracked sequences (in yellow) are noisy and sometimes missing.particular, at each time t we form the posterior distributionover a (noncausal) sequence of τ +1 states Further, we assume zero-mean Gaussian measurement noise in the 2D image positions provided by the tracker. Let p(φt:t+τ | I1:t+τ , M) = the perspective projection of the j th body point, pj , in pose c p(It:t+τ | φt:t+τ ) p(φt:t+τ | I1:t−1 , M) . (12) φt , be denoted P (pj (φt )), and let the associated 2D image measurement from the tracker be mj . Then, the negative ˆtInference of φt is improved with the use of future data, but log likelihood of the observations at time t isat the cost of a small temporal delay.2 With a Markov chain Jmodel one could use a forward-backward inference algo- − ln p(It | φt ) = 1 mj − P (pj (φt )) ˆt 2 . (15)rithm [23] in which separate beliefs about each state from 2σe2 j=1past and future data are propagated forward and backwardin time. Here, instead we consider the posterior over the Here we set σe = 10 pixels, based on empirical results.entire window, without requiring the Markov factorization. Prediction Distribution We factor the prediction density With the strength of the GPDM prior, we also assume p(φt:t+τ | φM AP , M) into a prediction over global motion, 1:t−1that we can use hill-climbing to find good state estimates and one over poses y and latent positions x. The reason,(i.e., MAP estimates). In effect, we assume a form of ap- as discussed above, is that our training sequences did notproximate recursive estimation: contain the global motion. So, we assume that p(φt:t+τ | I1:t+τ , M) ≈ p(φt:t+τ | φM AP , M) = 1:t−1 c p(It:t+τ | φt:t+τ ) p(φt:t+τ | φ1:t−1 , M) (13) M AP p(X t , Y t | xM AP , M) p(Gt:t+τ |GM AP ) , (16) t−1 t−1:t−2where φM AP denotes the MAP estimate history. This has where X t ≡ xt:t+τ and Y t ≡ yt:t+τ . For the global rotation and translation, Gt , we assume a 1:t−1the disadvantage that complete beliefs are not propagatedforward. But with the temporal window we still exploit data second-order Gauss-Markov model. The negative log tran-over several frames, yielding smooth tracking. sition density is, up to an additive constant, At each time step we minimize the negative log poste- ˆ ||Gt − Gt ||2rior over states from time t to time t + τ . At this minima we − ln p(Gt | GM AP ) = t−1:t−2 2 , (17) 2σGobtain the approximate MAP estimate at time t. The esti-mate is approximate in two ways. First, we do not represent where the mean prediction is just Gt = 2GM AP − GM AP . ˆ t−1 t−2and propagate uncertainty forward from time t−1 in (13). For the prior over X t , Y t , we approximate the GPDMSecond, because previous MAP estimates are influenced by in two ways. First we assume that the density over the posefuture data, the information propagated forward is biased. sequence, p(Y t | X t , M), can be factored into the den-Image Likelihood: The current version of our 3D tracker sities over individual poses. This is convenient computa-uses a simplistic observation model. That is, the image tionally since the GPDM density over a single pose, given aobservations are the approximate 2D image locations of a latent position, is Gaussian [6, 20]. Thus we obtainsmall number (J) of 3D body points (see Fig. 5). They were t+τobtained with the WSL image-based tracker [9]. − ln p(Y t |X t , M) ≈ − ¯ ln p(yj |xj , β, X, Y) While measurement errors in tracking are often corre- j=tlated over time, as is common we assume that image mea- t+τ W(yj − µY (xj )) 2 D 1surements conditioned on states are independent; i.e., = + ln σ 2 (xj ) + xj 2 (18) j=t 2σ 2 (xj ) 2 2 t+τ p(It:t+τ | φt:t+τ ) = p(Ii | φi ) . (14) where the mean and variance are given by i=t µY (x) = µ + Y T K−1 kY (x) , Y (19) 2 However an online estimate of φt+τ would still be available at t+τ . σ 2 (x) = kY (x, x) − kY (x)T K−1 kY (x) , (20) Y
  6. 6. Figure 6. Tracking 63 frames of a walking, with noisy and missing data. The skeleton of the recovered 3D model is projected onto theimages. The points tracked by WSL are shown in red.and kY (x) is the vector with elements kY (x, xj ) for all assume that there is negligible uncertainty in the recon-other latent positions xj in the model. struction mapping, and hence a pose is directly given by Second, we anneal the dynamics p(X t |xt−1 , M), be- M AP y = µY (x). This reduces the pose reconstruction likeli-cause the learned GPDM dynamics often differ in important hood in (18) to D ln σ 2 (x) + 1 x 2 , and the state at t to 2 2ways from the video motion. The most common problem φt = (Gt , xt ), which can be optimized straightforwardly.occurs when the walking speed in the video differs from thetraining data. To accommodate this, we effectively blur the 5. Tracking Resultsdynamics; this is achieved by raising the dynamics density Here we focus on tracking different styles and speedsto a small exponent, simply just using a smaller value of λ for the same activity. We use the Balanced GPDM modelin (8), for which the kernel matrix must also be updated to shown in Fig. 3 for tracking all walking sequences below. Ininclude X t . For tracking, we fix λ = 0.5. Fig. 6 we use a well-known sequence to demonstrate the ro-Optimization: Tracking is performed by minimizing the bustness of our algorithm to data loss. In the first frame, weapproximate negative log posterior in (13). With the ap- supply nine 2D points—the head, left shoulder, left hand,proximations above this becomes both knees and feet, and center of the spine (the root). They are then tracked automatically using WSL[9]. As shown in t+τ t+τ Fig. 5(d) the tracked points are very noisy; the right kneeE = − ln p(Ij | φj ) − ln p(Gj |GM AP ) j−1:j−2 is lost early in the sequence and the left knee is extremely j=t j=t inaccurate. By the end of the sequence the right foot and t+τ left hand are also lost. Given such poor input, our algorithm − ln p(X t |¯ , X) − α ln p(yj |xj , β, X, Y) (21) ¯ can nevertheless recover the correct 3D motion, as shown j=t by the projections of the skeleton onto the original images. While better image measurements can be obtained forTo minimize E in (21) with respect to φt:t+τ , we find that this sequence, this is not always an option when there arethe following procedure helps to speed up convergence, and occlusions and image clutter. E.g., Fig. 7 depicts a clutteredto reduce getting trapped in local minima. Each new state scene in which the subject becomes hidden by a shrub; onlyis first set to be the mean prediction, and then optimized in the head remains tracked by the end of the sequence (seea temporal window. For the experiments we use τ = 2. Fig. 5(e)). For these frames only the global translation is ef- fectively constrained by the image data, so the GPDM playsAlgorithm 1 Optimization Strategy (at each time step t) a critical role. In Fig. 7, note how the projected skeleton still {xt+τ } ← µX (xt+τ −1 ) = XT K−1 kX (xt+τ −1 ) out X appears to walk naturally behind the shrub. {yt+τ } ← µY (xt+τ ) = µ + Y T K−1 kY (xt+τ ) Y Figure 8 shows a sequence in which the subject is com- {Gt+τ } ← 2Gt+τ −1 − Gt+τ −2 pletely occluded for a full gait cycle. When the occlusion for n = 1 . . . iter do begins, the tracking is governed mainly by the prior.3 The {X t } ← min E with respect to X t 3D tracker is then switched back on and the global motion {φt:t+τ } ← min E with respect to φt:t+τ during the occlusion is refined by linear interpolation be- end for tween the 3D tracked poses before and after the occlusion. {X t } ← min E with respect to X t Before an occlusion, it is very important to have a good esti- mation of x, as subsequent predictions depend significantly One can also significantly speed up the minimization 3 We manually specify the beginning and end of the occlusion. We usewhen one knows that the motion of the tracked object is a template matching 2D detector to automatically re-initialize WSL aftervery similar to the training motions. In that case, one can the occlusion, as shown in Fig 5(c).
  7. 7. Figure 7. Tracking 56 frames of a walking motion with an almost total occlusion (just the head is visible) in a very clutter and movingbackground. Note that how the prior encourages realistic motion as occlusion becomes a problem.Figure 8. Tracking 72 frames of a walking motion with a total occlusion. During the occlusion the tracker is switch off and the meanprediction is used. Note the quality of the tracking before and after the occlusion and the plausible motion during it.on the latent position. To reduce the computational cost of ple tracking. We showed that these priors can be learnedestimating the latent positions with great accuracy, we as- from modest amounts of training motions including stylis-sume perfect reconstruction, i.e., y = µY (x), and use the tic diversity. Further, they are shown to be effective forsecond algorithm described in Section 4. tracking a range of human walking styles, despite weak and The latent coordinates obtained by the tracker for all of noisy image measurements and significant occlusions. Thethe above sequences are shown in Fig 10. The trajectories quality of the results, in light of such a simple measurementare smooth and reasonably close to the training data. Fur- model attest to the utility of the GPDM priors.ther, while the training gait period was 32 frames, this threesequences involve gait periods ranging from 22 to 40 frames References(by comparison, natural walking gaits span about 1.5 oc- [1] Agarwal, A. and Triggs, B.: Recovering 3D human pose fromtaves). Thus the prior generalizes well to different speeds. monocular images. To appear IEEE Trans. PAMI, 2005. To demonstrate the ability of the model to generalize to [2] Bissacco, A.: Modeling and learning contact dynamics indifferent walking styles, we also track the exaggerated walk human motion. CVPR, V1, pp. 421-428, San Diego, 2005.shown in Fig. 9. Here, the subject’s motion is exagger- [3] Choo, K., Fleet, D.: People tracking using hybrid Monte Carloated and stylistically unlike the training motions; this in- filtering. Proc. ICCV, V2, pp. 321-328, Vancouver, 2001.cludes the stride length, the lack of bending of the limbs, [4] Doretto, G., Chiuso, A., Wu, Y.N., and Soatto, S.: Dynamic textures IJCV, 51(2):91-109, 2003.and the rotation of the shoulders and hips. Despite this the3D tracker does an excellent job. The last two rows of Fig. 9 [5] Elgammal, A., Lee, C.: Inferring 3D body pose from silhou- ettes using activity manifold learning. Proc. CVPR, V2, pp.show the inferred poses with a simple character, shown from 681-688, Washington, 2004.two viewpoints, one of which is quite different from that of [6] Grochow, K., Martin, S., Hertzmann, A., Popovic, Z.: Style-the camera. The latent coordinates obtained by the tracker based inverse kinematics SIGGRAPH, pp. 522-531, 2004are shown in Fig. 10; the distance of the trajectory to the [7] N. R. Howe, M. E. Leventon, and W. T. Freeman. Bayesian re-training data is a result of the unusual walking style. constructions of 3D human motion from single-camera video. NIPS 12, pp. 281-288, MIT Press, 2000.6. Conclusions [8] Isard, M. and Blake, A.: A mixed-state Condensation tracker We have introduced the balanced GPDM for learning with automatic model-switching. Proc. ICCV, pp. 107-112,smooth prior models of human pose and motion for 3D peo- Mumbai, 1998.
  8. 8. Figure 9. Tracking 37 frames of an exaggerated gait. Note that the results are very accurate even though the style is very different from anyof the training motions. The last two rows depict two different views of the 3D inferred poses of the second row. [11] Lee, J., Chai, J., Reitsma, P., Hodgins, J., Pollard, N.. In- teractive control of avatars animated with human motion data. Proc. SIGGRAPH, pp. 491-500, 2002. [12] MacKay, D.J.C.: Information Theory, Inference, and Learn- ing Algorithms. Cambridge University Press, 2003 [13] Neal R. M.: Bayesian Learning for Neural Networks. Lec- ture Notes in Statistics No. 118. Springer-Verlag, 1996. [14] North, B., Blake, A., Isard, M., and Rittscher, J.: Learning and classification of complex dynamics. IEEE Trans. PAMI, 25(9):1016-1034, 2000. [15] Pavolic, J.M., Rehg, J., and MacCormick, J.: Learning switching linear models of human motion. NIPS 13, pp. 981- 987 MIT Press, 2000. [16] Rahimi, A., Recht, B., Darrell, T. Learning appearance man- ifolds from video. CVPR, pp868-875, San Diego, 2005 [17] Sidenbladh, H., Black, M.J., Sigal, L.: Implicit probabilistic models of human motion for synthesis and tracking. Proc. ECCV, pp. 784-800, Copenhagen, 2002. [18] Sminchisescu, C., Jepson, A.: Generative modeling for con- tinuous non-linearly embedded visual inference. Proc. ICML, Banff, July 2004. [19] Tian, T., Li, R., Sclaroff, S.: Articulated pose estimation in a learned smooth space of feasible solutions. CVPR LearningFigure 10. Tracked Latent Positions: Side and top views of the Workshop, San Diego, 20053D latent space show the latent trajectories for the tracking results [20] Urtasun, R., Fleet, D.J., Hertzmann, A., Fua, P.: Priors forof Figs. 6, 7, 8, and 9 are shown in red, blue, black, and green. The people tracking from small training sets. Proc. ICCV, V1, pp.learned model latent positions are cyan. 403-410, Beijing, 2005. [21] Wang, Q., Xu, G., Ai, H.: Learning object intrinsic structure for robust visual tracking. Proc. CVPR, Vol. 2 pp. 227-233,[9] Jepson, A.D., Fleet, D.J., El-Maraghi, T.: Robust on-line Madison, 2003. appearance models for vision tracking. IEEE Trans. PAMI, [22] Wang, J., Fleet, D.J., Hertzmann, A.: Gaussian Process dy- 25(10):1296-1311, 2003. namical models. NIPS 18, MIT Press, 2005.[10] Lawrence, N.D.: Gaussian process latent variable models for [23] Weiss, Y.: Interpreting image by propagating Bayesian be- visualisation of high dimensional data. NIPS 16, pp. 329-336 liefs. NIPS 9, pp. 908-915, MIT Press, 1997. MIT Press, 2004.