Mmsp2017 piccardi

1/26
Joint Action Recognition and Summarization
by Submodular Inference
Fairouz Hussein, Massimo Piccardi
University of Technology Sydney
UTS MMSP workshop, 12 April 2017
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference

2/26
Aims
• Action recognition in video and video summarization are
two well-established research areas
• In this work, we explore the beneﬁts of performing them
jointly
• We leverage submodular inference throughout
• Work published at ICASPP 2016 and in-press on ACM
TOMM

3/26
Action recognition
• Action recognition in video is a well-established research
area
• Leverages local features (STIP, DTF, ITF, MBH, bags of
concepts, deep learning . . . )
• Has reached remarkable accuracy also in realistic
scenarios
• At short distance, depth cameras are helping achieve
greater accuracy (e.g., Stereolabs ZED)

4/26
Video summarization
• Video summarization is, too, a well-established research
area
• Summary as a set or sequence of frames
• Leverages clustering, shot detection or “key” frame
detection

5/26
Joint action recognition and video summarization
• Action recognition and video summarization can be
performed independently, cascaded . . .
• Recognising actions from a sub-set of key frames is a
practiced idea
• However, do the key frames meet the requirements of a
good summary, i.e. coverage and non-redundancy?
• Here, we attempt to perform them jointly with a single,
uniﬁed scoring function

6/26
Our graphical model
h1 h2 ht... hT...
x1 x2 xt... xT...
ht-1
xt-1
y action
summary
measurements
• y: action class; h: binary variables: ht = 1 → frame t is in
the summary; x: frame measurements
• Scoring function:
w ψ(x, y, h) = w
T
i,j=1,j=i
φ(xi, xj, hi, hj, y)

7/26
Inference
• Given a trained model, w, and a measurement sequence,
x, how to ﬁnd the best y, h?
• Inference:
y∗
, h∗
= argmaxy,h w ψ(x, y, h)
• Inference is not just left-to-right because nodes are all
connected
• Number of possible summaries = 2T ; 1 min video
→ T ≈ 1, 800

8/26
Submodularity
• Inference is rescued by submodularity
• A submodular function abides by the “law of diminishing
returns”:
A
V
B
f(A U V) - f(A) ≤ f(B U V) - f(B)

9/26
Submodular inference
• The maximum achieved by a greedy algorithm over a
monotonic submodular function is ≥ 0.632 the actual
maximum [Nemhauser et al. 1978]
• The complexity of the greedy algorithm is O(T2) vs O(2T )
• We further constrain the summary to a maximum number
of frames (budget):.
y∗
, h∗
= argmaxy,h w ψ(x, y, h) s.t. |h| = B
• Complexity drops to O(BT)

10/26
Submodular score function: summary
• For the summary, we use a submodular score function:
φ(xi, xj, hi, hj) = λ(hi, hj)s(xi, xj)
λ(hi, hj) =



λ1, hi = 1, hj = 0 (coverage)
−λ2, hi = 1, hj = 1 (non-redundancy)
0, hi = 0, hj = 0
λ1, λ2, s(xi, xj) ≥ 0
• s(xi, xj) is a similarity between frames i and j
• It is easy to prove that this function is submodular

11/26
Submodular score function: summary + action
• For action recognition, we add a unary term:
ψ(x, y, h) =
T
i,j=1,j=i
φ(xi, xj, hi, hj, y) +
T
i=1
λ3I[hi = 1, y]xi
action
• In the ICASSP paper, we prove that this function is still
submodular
• w ≥ 0 → positive sum of submodular functions

12/26
Learning
• Learning: structural SVM
• Given a training set {xn, hn, yn}, n = 1 . . . N, we ﬁnd:
argminw,ξ
1
2
w 2
+ C
N
n=1
ξn
s.t. w ψ(xn
, yn
, hn
) − w ψ(xn
, y, h) ≥ ∆(yn
, y) − ξn
n = 1 . . . N, ∀y, h ∈ W
• W is a small (i.e., polynomial) set of constraints that
guarantees a solution arbitrarily close to the full solution

13/26
Learning: populating the constraint set
• In the structural SVM framework, the constraint set is
populated with the “most violating” labelings
• For every sample, the “most violating” labeling is given
by:
¯yn
, ¯hn
= argmaxy,h [w ψ(xn
, y, h) + ∆(yn
, y)]
• “Loss-augmented inference”: same greedy algorithm as
the regular inference

14/26
Learning: latent variables
• The ground truth for h is unknown! → the algorithm
alternates between these two steps until convergence:
Step 1:
minw,ξ
1
2
w 2
+ C
N
n=1
ξn
s.t. w [ψ(xn
, yn
, hi∗
) − ψ(xn
, y, h)] ≥ ∆(yn
, y) − ξn
∀n, y, h ∈ W
Step 2:
hn∗
= argmax
h
w ψ(xn
, yn
, h)
• At the ﬁrst iteration, h is initialized arbitrarily

15/26
Method: recap
1. A sub-modular scoring function for summary + action:
w

ψ(x, y, h) =
T
i,j=1,j=i
φ(xi, xj, hi, hj, y) +
T
i=1
λ3I[hi = 1, y]xi


2. A greedy inference algorithm with performance guarantees
3. Learning by latent structural SVM from an arbitrary
initialisation of the summary variables

16/26
V-JAUNE: a measure for the quality of a video
summary
• Video summarization lacks a generally-accepted
performance measure which accounts for both the content
and the frame order
• We propose V-JAUNE:
∆(h, ¯h) =
B
i=1
δ(hi, ¯hi)
δ(hi, ¯hi) = min{ xhj
− x¯hi
2
}, s.t. i − ≤ j ≤ i +

17/26
Multi-annotator V-JAUNE
• Multi-annotator version:
∆(h1:M
, ¯h) =
M
m=1
∆(hm
, ¯h)
• We normalize it by the disagreement between the
annotators:
D =
2
M(M − 1) p,q
∆(hp
, hq
) p = 1 . . . M, q = p+1 . . . M
∆ (h1:M
, ¯h) = ∆(h1:M
, ¯h)/D

18/26
Experiments
• Dataset: MSR Daily Activities 3D: A dataset with 16 action
classes (drink, eat, read book, call cellphone, write on a
paper, use laptop, use vacuum cleaner, cheer up, sit still,
toss paper, play game, lie down on sofa, walk, play guitar,
stand up, sit down), 320 instances from 10 actors
• RGB, depth and skeletal data from MS Kinect

19/26
Experiments
• Dataset: Actions for Cooking Eggs (ACE aka KCSGR): A
dataset where actors cook egg recipes using 8 atomic
actions (cutting, seasoning, peeling, boiling, turning,
baking, mixing and breaking); 161 instances for training,
95 for testing (different actors)

20/26
Experiments
• Measurements: STIP features only from depth frames,
encoded as VLAD (no RGB, no skeletons)
• Action recognition comparison:
• standard SVM
• the proposed method with no summary features and all
frames
• the proposed method with budget B = 10
• the proposed method on the RGB data
• results from the literature
• Summary comparison: sum of absolute differences (SAD)

21/26
Results: action recognition on MSR
Method Accuracy
SVM 34.4%
Dynamic time warping 54.0%
Proposed method (no summary, all frames) 48.8%
Proposed method 60.6%
Proposed method (RGB videos) 46.3%
• ¡The summary component helps the action recognition
accuracy!
• Accuracy from depth frames is higher than from RGB, even
without dedicated features (skeletal data would deliver
higher accuracy, but are available in limited situations)

22/26
Results: action recognition on ACE
Method Accuracy
SVM 62.1%
PA-Pooling 72.2%
Proposed method (no summary, all frames) 54.7%
Proposed method (no summary, 10 frames) 66.3%
Proposed method 77.9%
• The summary component helps, again
• Selecting the frames helps

23/26
Results: MSR summarization
• V-JAUNE is also better than SAD’s: 5.22 vs 5.65
• In many cases, the summaries from the proposed method
(top) appear more appealing than SAD’s (bottom):

24/26
Results: ACE summarization
• Here V-JAUNE is slightly worse than SAD’s: 0.947 vs
0.927
• 10% summary supervision improves both the accuracy
and V-JAUNE: 81.1%, 0.926
• Example with the proposed method:

25/26
Conclusions
• The action recognition accuracy is higher than that of
comparable methods (depth data only)
• We obtain a summary as a by-product! Action recognition
and video summarization proved synergistic
• In many cases, the summaries are more appealing than
with a low-level method (SAD)
• Efﬁcient inference and loss-augmented inference
(Hamming loss, V-JAUNE loss under review)

26/26
Any questions?
• Thank you very much for your attention!
• Any questions?
Fairouz Hussein, Massimo Piccardi
University of Technology Sydney, NSW, Australia

Mmsp2017 piccardi

Recommended

Recommended

More Related Content

Similar to Mmsp2017 piccardi

Similar to Mmsp2017 piccardi (20)

Recently uploaded

Recently uploaded (20)

Mmsp2017 piccardi