Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Choi ECCV12 presentation
1. A Unified Framework for Multi-Target
Tracking and Collective Activity
Recognition
Wongun Choi and Silvio Savarese
University of Michigan, Ann Arbor
VisionLab
1
5. Our Goal
• Multiple target tracking
• Recognize activities at
different level of
granularity
– Atomic activity & pose
Walking
Facing-back
Walking
Facing-front
Walking
Facing-front
VisionLab
5
6. Our Goal
• Multiple target tracking
• Recognize activities at
Moving-to- different level of
Opposite
granularity
– Atomic activity & pose
– Pairwise interaction
Walking-
Side-by-Side
VisionLab
6
7. Our Goal
Crossing • Multiple target tracking
• Recognize activities at
different level of
granularity
– Atomic activity & pose
– Pairwise interaction
– Collective Activity
VisionLab
7
8. Our Goal
Crossing • Multiple target tracking
• Recognize activities at
different level of
granularity
– Atomic activity & pose
– Pairwise interaction
– Collective Activity
• Solve all problems jointly!!
VisionLab
8
9. Background
Target Atomic Pairwise Collective
Tracking Activity Interaction Activity
• Wu et al, 2007 • Bobick & Davis, 2001 • Zhou et al, 2008 • Choi et al, 2009
• Avidan, 2007 • Efros et al, 2003 • Ryoo & Aggarwal, 2009 • Li et al, 2009
• Zhang et al, 2008 • Schuldt et al, 2004 • Yao et al, 2010 • Lan et al, 2010
• Breistein et al, 2009 • Dollar et al, 2005 • Choi et al, 2010 • Ryoo & Aggarwal, 2010
•
•
•
Ess at al, 2009
Wojek et al, 2009
Geiger et al, 2011
•
•
•
Investigated
Niebles et al, 2006
Laptev et al, 2008
Rodriguez et al, 2008
• Patron-perez et al, 2010 •
•
•
Choi et al, 2011
Khamis et al, 2011
Lan et al, 2012
• Brendel et al, 2011 • Wang & Mori, 2009 • Khamis et al, 2012
• Pirsiavash et al, 2011 •
•
•
in isolation
Gupta et al, 2009
Liu et al, 2009
Marszalek et al, 2009
• Amer et al, 2012
• Liu et al, 2011
Hierarchy of activities
• Lan et al, 2010 • Khamis et al, 2012
•
VisionLab
Amer et al 2012 9
15. Contributions
• Bottom-up activity understanding.
• Contextual information propagates top-down.
Crossing
Simple Social Force Model
Pellegrini et al 2009
Choi et al 2010
Leal-Taixe et al 2012
Repulsion &
etc Attraction
VisionLab
15
19. Hierarchical Activity Model
t-1 t t+1
Oc Oc Oc C
C C C
I13
I12 I23
I(t)
I(t) I(t)
I(t) I(t)
I(t)
I I I
A2
A1 A3
A(t)
A(t) A(t)
A(t) A(t)
A(t)
A A A
O1 O2 O3
OA(t) OA(t)
OO OA(t)
OOA(t) A(t)
A
OOA(t)
A A OC
VisionLab
19
20. Hierarchical Activity Model
t-1 t t+1
Oc Oc Oc
C C C
I(t)
I(t) I(t)
I(t) I(t)
I(t)
I I I
A(t)
A(t) A(t)
A(t) A(t)
A(t)
A A A
OA(t) OA(t)
OO OA(t)
OOA(t)
A A(t)
A
OOA(t)
A
VisionLab
20
21. Atomic-Observation Potential
t-1 t t+1
Oc Oc Oc
C C C Atomic Activity Models
• Action: BoW with STIP
• Pose: HoG
I(t)
I(t) I(t)
I(t) I(t)
I(t)
I I I
A(t)
A(t) A(t)
A(t) A(t)
A(t)
A A A
Dollar et al, 06; Niebles et al, 07
Dalal and Triggs, 05
OA(t) OA(t)
OO OA(t)
OOA(t)
A A(t)
A
OOA(t)
A
VisionLab
21
22. Interaction-Atomic Potential
t-1 t t+1
Oc Oc Oc
C C C I: Standing-in-a-line
A: Standing A: Standing
Facing-left Facing-left
I(t)
I(t) I(t)
I(t) I(t)
I(t)
I I I
A(t)
A(t) A(t)
A(t) A(t)
A(t)
A A A
A: Standing A: Standing
Facing-left Facing-left
OA(t) OA(t)
OO OA(t) Model
OOA(t)
A A(t)
A
OOA(t)
A
VisionLab
22
23. Interaction-Atomic Potential
t-1 t t+1
Oc Oc Oc
C C C I: Standing-in-a-line
A: Standing A: Standing
Facing-left Facing-left
I(t)
I(t) I(t)
I(t) I(t)
I(t)
I I I
A(t)
A(t) A(t)
A(t) A(t)
A(t)
A A A
OA(t) OA(t)
OO OA(t) Model
OOA(t)
A A(t)
A
OOA(t)
A
A: Standing A: Standing
Facing-right Facing-left
VisionLab
23
24. Collective-Interaction Potential
t-1 t t+1
Oc Oc Oc
C C C C: Queuing
I(t)
I(t) I(t)
I(t) I(t)
I(t)
I I I
A(t)
A(t) A(t)
A(t) A(t)
A(t)
A A A
OA(t) OA(t) OA(t) Model I: one-after
I: one-after
OOA(t) OOA(t) OOA(t)
I: one-after -the-other
-the-other
A A A -the-other
I: standing-
side-by-side
VisionLab
24
25. Collective-Interaction Potential
t-1 t t+1
Oc Oc Oc
C C C C: Queuing
I(t)
I(t) I(t)
I(t) I(t)
I(t)
I I I
A(t)
A(t) A(t)
A(t) A(t)
A(t)
A A A
I: facing-oppo facing-
I:
site-side each-other
I: facing-
OA(t) OA(t)
OO OA(t) Model
OOA(t)
A A(t)
A
OOA(t)
A
each-other
VisionLab
25
26. Collective-Observation Potential
t-1 t t+1
Oc Oc Oc
C C C Collective Activity
• STL of all targets
I(t)
I(t) I(t)
I(t) I(t)
I(t)
I I I
A(t)
A(t) A(t)
A(t) A(t)
A(t)
A A A
Choi et al, 09
OA(t) OA(t)
OO OA(t)
OOA(t)
A A(t)
A
OOA(t)
A
VisionLab
26
27. Activity Transition Potential
t-1 t t+1
Oc Oc Oc
C C C Smooth activity transition
I(t)
I(t) I(t)
I(t) I(t)
I(t)
I I I
A(t)
A(t) A(t)
A(t) A(t)
A(t)
A A A
OA(t) OA(t)
OO OA(t)
OOA(t)
A A(t)
A
OOA(t)
A
VisionLab
27
28. Trajectory Estimation
t-1 t t+1
Oc Oc Oc
C C C
I(t)
I(t) I(t)
I(t) I(t)
I(t)
I I I
A(t)
A(t) A(t)
A(t) A(t)
A(t)
A A A
OA(t) OA(t)
OO OA(t)
OOA(t)
A A(t)
A
OOA(t)
A
VisionLab
28
30. Tracklet Association Problem
Input: Fragmented Trajectories (tracklets)
• Detector failures
• Occlusion between targets Output: set of trajectories
• Scene clutter With correct IDs (color)
• etc..
VisionLab
30
31. Tracklet Association Model
t-1 t t+1
Oc Oc Oc
C C C
I(t)
I(t) I(t)
I(t) I(t)
I(t)
I I I
Simple match costs, c
A(t)
A(t) A(t)
A(t) A(t)
A(t) ?
A A A Location affinity
Appearance/Color ?
...
OA(t) OA(t)
OO OA(t)
OOA(t)
A A(t)
A
OOA(t)
A
VisionLab
31
32. Tracklet Association Model
t-1 t t+1
Oc Oc Oc
C C C Crossing
I(t)
I(t) I(t)
I(t) I(t)
I(t)
I I I
A(t)
A(t) A(t)
A(t) A(t)
A(t)
A A A
OA(t) OA(t)
OO OA(t)
OOA(t)
A A(t)
A
OOA(t)
A
VisionLab
32
33. Tracklet Association Model
t-1 t t+1
Oc Oc Oc
C C C Crossing
I(t)
I(t) I(t)
I(t) I(t)
I(t)
I I I
A(t)
A(t) A(t)
A(t) A(t)
A(t)
A A A
OA(t) OA(t)
OO OA(t)
OOA(t)
A A(t)
A
OOA(t)
A
VisionLab
33
34. Tracklet Association Model
t-1 t t+1
Oc Oc Oc
C C C Crossing
I(t)
I(t) I(t)
I(t) I(t)
I(t)
I I I
A(t)
A(t) A(t)
A(t) A(t)
A(t)
A A A
OA(t) OA(t)
OO OA(t)
OOA(t)
A A(t)
A
OOA(t)
A
VisionLab
34
37. Inference
Activity Recognition Given f Tracklet Association Given C, I, A
t-1 t t+1
Oc Oc Oc
Crossing
C C C
I(t)
I(t)
I I(t)
I(t)
I I(t)
I(t)
I
A(t) A(t) A(t)
A(t) A(t) A(t)
A A A
OA OA OA Novel Branch and Bound
Iterative Belief Propagation
VisionLab
37
38. Training
• Model weights are learned in a Max-
Margin framework using Structural
SVM.
Tsochantaridis et al, 2004
VisionLab
38
43. Target Association
Tracklet
# of error 1556
Improvement 0%
over tracklet
Result of Dataset VSWS09
VisionLab
43
44. Target Association
No
Tracklet Interaction
# of error 1556 1109
Improvement 0% 28.73%
over tracklet
Result of Dataset VSWS09
VisionLab
44
45. Target Association
No With
Tracklet Interaction Interaction
# of error 1556 1109 894
Improvement 0% 28.73% 42.54%
over tracklet
Result of Dataset VSWS09
VisionLab
45
46. Target Association
No With With GT
Tracklet Interaction Interaction Activities
# of error 1556 1109 894 736
Improvement 0% 28.73% 42.54% 52.76%
over tracklet
Result of Dataset VSWS09
VisionLab
46
47. Example Classification Result
Interaction labels
AP: approaching
FE: facing-each-other
SR: standing-in-a-row
...
VisionLab
47
48. Example Classification Result
Atomic Activities
Action:
W - walking
S – standing
Pose (8 directions)
L - left
LF– left/front
F – front
RF- right/front
etc.
VisionLab
48
49. Example Classification Result
Pair-Interactions
• AP: approaching
• .....
• FE: facing-each-
other
• SS: standing-side-
by-side
• SQ: standing-in-a-
queue
VisionLab
49
51. Example Classification Result
Tracklet Association
Color/nNumber: ID
Solid boxes: tracklets
Dashed boxes:
match hypothesis
VisionLab
51
52. Conclusion
• Propose novel model for joint activity
recognition and target association.
C
I13
I12 I23
A2
A1 A3
O1 O2 O3
OC
VisionLab
53
53. Conclusion
• Propose novel model for joint activity
recognition and target association.
• High level contextual information help
improve target association accuracy
significantly.
Crossing
VisionLab
54
54. Conclusion
• Propose novel model for joint activity
recognition and target association.
• High level contextual information help
improve target association accuracy
significantly.
• Best classification results on collective
activity up to date.
VisionLab
55
55. Thanks to
Yingze Bao, Byungsoo Kim, Min Sun, Yu Xiang,
ONR and anonymous reviewers
VisionLab
56
62. Branch-and-Bound Association
• Branch
– Divide problem into two disjoint
sub-problem. e.g. Q0 => f = [1, x, x, x, x, x, ….]
Q1 => f = [0, x, x, x, x, x, ….]
– Find the most ambiguous variable.
VisionLab
63
Editor's Notes
Good afternoon. I’m Wongun Choi from the University of Michigan. This is a joint work with my advisor, Silvio Savarese.
Consider a video sequence with multiple people,
our goal is to understand behavior of all individuals in the scene.
Firstly, we want to estimate the trajectories of all individuals
Also, we want to recognize semantic activities of the people in different level of granularities. Such activities include:[show1] single person activities in isolation which we call atomic activities, [show2] such as walking or standing, [show3] and the pose of individuals, [show4] such as facing-front or back.
We also want to recognize Interplay between pairs of people, which we call interactions, for instance, [show1] walking side by side or [show2] moving to opposite direction.
And finally, we want to identify the overall behavior of all people in the scene, the collective activity, [show1] for example crossing.
Most importantly, we address all the problems in a unified framework.
So far, a large literature have proposed methods for[show1] tracking multiple targets and recognizing[show2] atomic activity[show3] pairwise interaction[show4] and collective activity[show5] , but most of the times these problems are addressed in isolation[show6] Some exceptions are shown here. But they only focus on solving atomic and collective activity recognition.
As opposed to previous works, we propose to solve all of the four problems in a joint fashion. Let me explain this concept in few more details.
Our model is able to transfer the information in a bottom up fashion, [show1]From the estimation of trajectories of individuals and their atomic activities, we can obtain robust characterizations of interactions and collective activities.
For instance, [show1] if we observe these trajectories, [show2] we can easily infer that the activity is meeting and leaving[show3] but if we see these trajectories, [show4] we will say it is an activity crossing.
At the same time, the information flows from [show1] top to down so as to provide critical contextual information to the lower levels of hierarchy: collective activities help understand interactions; interactions help understands atomic activity and track associations.
Let me give you an example of how activity understanding help trajectory estimation.[show1] For example, if we are given these set of broken trajectories and[show2] if we know that the underlying activity is meeting and leaving, we can interpret these trajectories as follows [show3]
Now given the same broken trajectories but if we know that the activity is crossing, [show1] we can interpret the trajectories as follows.[show2] Similar concept is also introduced as a social force model in previous works, however these works only considered few hand designed types of interactions, such as [show3] repulsion and attraction. We generalize this concept and have high level activity understanding to guide the process of associating tracks.
3 Minutes up to this!!!This is an outline of today’s talk
Lets begin with our Joint model.
Given a video with a set of short fragment of trajectories[show1] tracklets,the activities of all is encoded as a hierarchical graphical model using factor graph. Each component of our factor graph model are shown on the left
[show1] From each of the tracklets, we obtain observation cue for individuals.[show2] grounded on each observation, we model the atomic activity with variable A.[show3] we encode pair-wise interaction between individuals with variable I.[show4] finally, we assign one collective activity variable, C, to chracterize overall behavior of individuals. [show5] we also provide top-down observation cue for variable C[show6] By considering the temporal relathionship among variables as well, our full model can be compactly represented as the graph shown here.
The model can be written as an energy function as shown in the right side.[show1] The energy can be factorized into multiple local potential each of which encodes relationship among variables.Let’s see the details of what each factors represents.
The highlighted potential encodes the compatibility between atomic activity models and corresponding observations.[show1] Atomic activity are modeled by the BoW representation equipped with Spatio-Temporal-Interest-Point features [Show2] and pose are described using HOG
The second potential psi (I,A,f) captures the compatibility between interactions models and observations of atomic activities. For instance[show1] here we illustrate a visualization of a possible learnt model for the “standing-in-a-line” interaction. This model captures the property that two people that “stand in a line” tend to be located nearby and face the same direction.[show2] Thus, if we are given these observations of atomic activities – standing and facing left – which are compatible with the learnt “standing-in-a-line” interaction,[show3] the potential is high.
On the other hand, if we are given these observations of atomic activities – standing, facing left, facing right,the compatibility with the model is weak, and thus[show1] the potential is low[show2] such relationship can be compactly represented as an equation below.
Similarly, the potential psi(C,I) encodes the compatibility between collective activity models and a set of observations of pair-wise interactions.[show1] For example, here we illustrate a visualization of a possible learnt model for the collective activity “queuing”. This model captures the probability of occurrences of interactions labels such as “one-after-the-other”, “facing-each-other”. For the collective activity “queueing”, the interaction“one-after-the-other”is highly probable to occur. The interaction “facing-each-other” is much less so.[show2] Thus, if we observe these set of interactions, [show3] standing-side-by-side, [show4] one-after-the-other, [show5] and so on, with Queing, [show3] the potential psi(C,I) is high
On the other hand, [show1] if we observe these interactions, facing-each-other and facing-opposite-direction,[show2] the potential is low since these are not compatible with the learned model for queuing. [sohw3] such relationship is encoded by this equation.
Similarly to bottom-upcues for atomic activities, The highlighted potential encodes the compatibility between collective activity models and corresponding observations.[show1] These are obtained using the crowd context descriptor introduced by Choi et al in 2009.
Also, we encode temporal smoothness between[show1] collective activities, [show2] interactions, [show3] and atomic activitiesin adjacent time frames.
Last term captures potential related to trajectory estimation. Before we talk about it, let me briefly discuss about how we define the tracking problem.
Suppose we have the activity “people crossing”; [show1] it is likely that our low level observations won’t be just a pair of clean tracks such as the red and back ones.
But we rather observe a fragmented set of trajectories – which we call tracklets. This is because of [show1] detection failures, occlusion, scene clutters, and so on. Given such set of initial inputs, our goal is [show2] to obtain trajectories with consistent IDs by associating the tracklets
As in traditional track association problems, we introduce [Show1] the variable f for capturing association hypotheses; A simple solution is to have the association cost vector c to encode properties such as[show2] location affinity, color similarity, and so on. As this example shows, these properties, don’t always work:[show3] location affinity becomes ambiguous at the point of crossing.[show4] Color or appearance similarity is not always reliable when people wear similar clothings
In addition to traditional cues, we advocate that the interaction labels provides critical contextual information to guide the process of associating tracklets. [show1] Such information is encoded by the interaction potential that we discussed ealier.[show1] for instance, if we know the interaction is “crossing”, this type of association[next]
[cont.] will give rise to high energy [show1] since this association is compatible with the leanrt model for crossing
whereas this type of association [show1] will give rise to low energy.In the interest of time, we skip the details of mathematics here.
Let’s see how we solve the inference problem and train the model.
The inference problem can be represented as finding the configuration of all variables that maximize the joint energy function. [show1] The optimization, however, computationally very demanding
Thus, we introduce a new iterative method to solve the problem. [show1] Given initial tracklet association, we obtain activity labels using[show2] iterative belief propagation.[show3] and given activity labels, [show4] we obtain tracklet association using our novel branch-and-bound method. For the interests of time, we skip the details of each inference method.
Finally, we obtain the model parameters from set of training data using a structural svm framework .
Let’s discuss our experimental evaluation
For the evaluation, we use the collective activity dataset, proposed by us in 2009. In addition to the collective activity lables, we provided annotations for [show1] target identities[show2] interactions between pairs of people[show3] and atomic properties of individuals.
Also, we collect an additional dataset to test our framework that is composed of 32 videos with 6 collective activities. We also provide labels for target identities, interactions and atomic activities similarly.
Firstly, we compare the collective activity classification accuracy usingBaseline method using a crowd context descriptor, we introduced in 2009 and 11, and our full hierarchical representation. We obtain good improvement, about 6%, in overall classification by utilizing the hierarchical structure in our model tested on the collective activity dataset. Again, we observe similar improvement,about 5 %, in the new dataset.
Now, we analyze the tracklet association results.The first row shows number of error in tracklet matchingAnd the second row shows % of improvement over input tracklets.
By solving the tracklet association without interaction cues, we could obtain about 30% improvement over tracklets, which have abour 400 less matching error.
By incorporating our interaction model with the estimated activity labels, We obtain 14% more improvement, which have abour 200 less matching errors than the baseline without interaction.
Notice that if ground truth activity lables are given, we obtain an upper bound target association error of 736, that is corresponding to 53% improvement over the input. More quantitative evaluation can be found in the paper.
Here we show examplar results we obtained from newly proposed dataset. The estimated collective activity lable is overlaid on top And interaction labels are displayed between pairs of people.
Now let me show results from a more complex sequence. Here, we show the estimated atomic activity label for each individual, overlaid below each bounding boxes, where ….
and interactions between pairs of people. SQ represent interaction standing in a queue.
Finally, we show the estimated collective activity variable on top.
Here, the video shows tracklet association result, we obtained using our unified framework. Color and number on top of bounding box shows identity of target, Solid boxes represent trackletsAnd dashed boxes shows smoothed paths that associate two tracklets.As you see, our model keep the targets’ identities same even after the occlusion by associating tracklets correctly.
Finally, we examplarcomparison between the tracklet association results obtained without interaction model and with interaction model. [show1] Due to the severe occlusion incurred by the car motion, [show2] traditional method without interaction context, lost the targets identities and assign new ID for all after the occlusion. [show3] on the other hand, when we have the interaction in the association model, [show4] we could keep the identity of the targets in such a challenging scenario.
In this paper, we proposed a novel model that seamlessly relate target tracking, and atomic activity, interaction and collective activity understanding. Also , we show that by solving all together, we can achieve better understanding about high level activities. Most interestingly, we also show that high level activity can help trajectory estimation better.
In this paper, we proposed a novel model that seamlessly relate target tracking, and atomic activity, interaction and collective activity understanding. Also , we show that by solving all together, we can achieve better understanding about high level activities. Most interestingly, we also show that high level activity can help trajectory estimation better.
In this paper, we proposed a novel model that seamlessly relate target tracking, and atomic activity, interaction and collective activity understanding. Also , we show that by solving all together, we can achieve better understanding about high level activities. Most interestingly, we also show that high level activity can help trajectory estimation better.