Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Mmsp2017 piccardi
1. 1/26
Joint Action Recognition and Summarization
by Submodular Inference
Fairouz Hussein, Massimo Piccardi
University of Technology Sydney
UTS MMSP workshop, 12 April 2017
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
2. 2/26
Aims
• Action recognition in video and video summarization are
two well-established research areas
• In this work, we explore the benefits of performing them
jointly
• We leverage submodular inference throughout
• Work published at ICASPP 2016 and in-press on ACM
TOMM
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
3. 3/26
Action recognition
• Action recognition in video is a well-established research
area
• Leverages local features (STIP, DTF, ITF, MBH, bags of
concepts, deep learning . . . )
• Has reached remarkable accuracy also in realistic
scenarios
• At short distance, depth cameras are helping achieve
greater accuracy (e.g., Stereolabs ZED)
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
4. 4/26
Video summarization
• Video summarization is, too, a well-established research
area
• Summary as a set or sequence of frames
• Leverages clustering, shot detection or “key” frame
detection
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
5. 5/26
Joint action recognition and video summarization
• Action recognition and video summarization can be
performed independently, cascaded . . .
• Recognising actions from a sub-set of key frames is a
practiced idea
• However, do the key frames meet the requirements of a
good summary, i.e. coverage and non-redundancy?
• Here, we attempt to perform them jointly with a single,
unified scoring function
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
6. 6/26
Our graphical model
h1 h2 ht... hT...
x1 x2 xt... xT...
ht-1
xt-1
y action
summary
measurements
• y: action class; h: binary variables: ht = 1 → frame t is in
the summary; x: frame measurements
• Scoring function:
w ψ(x, y, h) = w
T
i,j=1,j=i
φ(xi, xj, hi, hj, y)
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
7. 7/26
Inference
• Given a trained model, w, and a measurement sequence,
x, how to find the best y, h?
• Inference:
y∗
, h∗
= argmaxy,h w ψ(x, y, h)
• Inference is not just left-to-right because nodes are all
connected
• Number of possible summaries = 2T ; 1 min video
→ T ≈ 1, 800
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
8. 8/26
Submodularity
• Inference is rescued by submodularity
• A submodular function abides by the “law of diminishing
returns”:
A
V
B
f(A U V) - f(A) ≤ f(B U V) - f(B)
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
9. 9/26
Submodular inference
• The maximum achieved by a greedy algorithm over a
monotonic submodular function is ≥ 0.632 the actual
maximum [Nemhauser et al. 1978]
• The complexity of the greedy algorithm is O(T2) vs O(2T )
• We further constrain the summary to a maximum number
of frames (budget):.
y∗
, h∗
= argmaxy,h w ψ(x, y, h) s.t. |h| = B
• Complexity drops to O(BT)
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
10. 10/26
Submodular score function: summary
• For the summary, we use a submodular score function:
φ(xi, xj, hi, hj) = λ(hi, hj)s(xi, xj)
λ(hi, hj) =
λ1, hi = 1, hj = 0 (coverage)
−λ2, hi = 1, hj = 1 (non-redundancy)
0, hi = 0, hj = 0
λ1, λ2, s(xi, xj) ≥ 0
• s(xi, xj) is a similarity between frames i and j
• It is easy to prove that this function is submodular
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
11. 11/26
Submodular score function: summary + action
• For action recognition, we add a unary term:
ψ(x, y, h) =
T
i,j=1,j=i
φ(xi, xj, hi, hj, y) +
T
i=1
λ3I[hi = 1, y]xi
action
• In the ICASSP paper, we prove that this function is still
submodular
• w ≥ 0 → positive sum of submodular functions
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
12. 12/26
Learning
• Learning: structural SVM
• Given a training set {xn, hn, yn}, n = 1 . . . N, we find:
argminw,ξ
1
2
w 2
+ C
N
n=1
ξn
s.t. w ψ(xn
, yn
, hn
) − w ψ(xn
, y, h) ≥ ∆(yn
, y) − ξn
n = 1 . . . N, ∀y, h ∈ W
• W is a small (i.e., polynomial) set of constraints that
guarantees a solution arbitrarily close to the full solution
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
13. 13/26
Learning: populating the constraint set
• In the structural SVM framework, the constraint set is
populated with the “most violating” labelings
• For every sample, the “most violating” labeling is given
by:
¯yn
, ¯hn
= argmaxy,h [w ψ(xn
, y, h) + ∆(yn
, y)]
• “Loss-augmented inference”: same greedy algorithm as
the regular inference
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
14. 14/26
Learning: latent variables
• The ground truth for h is unknown! → the algorithm
alternates between these two steps until convergence:
Step 1:
minw,ξ
1
2
w 2
+ C
N
n=1
ξn
s.t. w [ψ(xn
, yn
, hi∗
) − ψ(xn
, y, h)] ≥ ∆(yn
, y) − ξn
∀n, y, h ∈ W
Step 2:
hn∗
= argmax
h
w ψ(xn
, yn
, h)
• At the first iteration, h is initialized arbitrarily
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
15. 15/26
Method: recap
1. A sub-modular scoring function for summary + action:
w
ψ(x, y, h) =
T
i,j=1,j=i
φ(xi, xj, hi, hj, y) +
T
i=1
λ3I[hi = 1, y]xi
2. A greedy inference algorithm with performance guarantees
3. Learning by latent structural SVM from an arbitrary
initialisation of the summary variables
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
16. 16/26
V-JAUNE: a measure for the quality of a video
summary
• Video summarization lacks a generally-accepted
performance measure which accounts for both the content
and the frame order
• We propose V-JAUNE:
∆(h, ¯h) =
B
i=1
δ(hi, ¯hi)
δ(hi, ¯hi) = min{ xhj
− x¯hi
2
}, s.t. i − ≤ j ≤ i +
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
17. 17/26
Multi-annotator V-JAUNE
• Multi-annotator version:
∆(h1:M
, ¯h) =
M
m=1
∆(hm
, ¯h)
• We normalize it by the disagreement between the
annotators:
D =
2
M(M − 1) p,q
∆(hp
, hq
) p = 1 . . . M, q = p+1 . . . M
∆ (h1:M
, ¯h) = ∆(h1:M
, ¯h)/D
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
18. 18/26
Experiments
• Dataset: MSR Daily Activities 3D: A dataset with 16 action
classes (drink, eat, read book, call cellphone, write on a
paper, use laptop, use vacuum cleaner, cheer up, sit still,
toss paper, play game, lie down on sofa, walk, play guitar,
stand up, sit down), 320 instances from 10 actors
• RGB, depth and skeletal data from MS Kinect
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
19. 19/26
Experiments
• Dataset: Actions for Cooking Eggs (ACE aka KCSGR): A
dataset where actors cook egg recipes using 8 atomic
actions (cutting, seasoning, peeling, boiling, turning,
baking, mixing and breaking); 161 instances for training,
95 for testing (different actors)
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
20. 20/26
Experiments
• Measurements: STIP features only from depth frames,
encoded as VLAD (no RGB, no skeletons)
• Action recognition comparison:
• standard SVM
• the proposed method with no summary features and all
frames
• the proposed method with budget B = 10
• the proposed method on the RGB data
• results from the literature
• Summary comparison: sum of absolute differences (SAD)
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
21. 21/26
Results: action recognition on MSR
Method Accuracy
SVM 34.4%
Dynamic time warping 54.0%
Proposed method (no summary, all frames) 48.8%
Proposed method 60.6%
Proposed method (RGB videos) 46.3%
• ¡The summary component helps the action recognition
accuracy!
• Accuracy from depth frames is higher than from RGB, even
without dedicated features (skeletal data would deliver
higher accuracy, but are available in limited situations)
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
22. 22/26
Results: action recognition on ACE
Method Accuracy
SVM 62.1%
PA-Pooling 72.2%
Proposed method (no summary, all frames) 54.7%
Proposed method (no summary, 10 frames) 66.3%
Proposed method 77.9%
• The summary component helps, again
• Selecting the frames helps
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
23. 23/26
Results: MSR summarization
• V-JAUNE is also better than SAD’s: 5.22 vs 5.65
• In many cases, the summaries from the proposed method
(top) appear more appealing than SAD’s (bottom):
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
24. 24/26
Results: ACE summarization
• Here V-JAUNE is slightly worse than SAD’s: 0.947 vs
0.927
• 10% summary supervision improves both the accuracy
and V-JAUNE: 81.1%, 0.926
• Example with the proposed method:
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
25. 25/26
Conclusions
• The action recognition accuracy is higher than that of
comparable methods (depth data only)
• We obtain a summary as a by-product! Action recognition
and video summarization proved synergistic
• In many cases, the summaries are more appealing than
with a low-level method (SAD)
• Efficient inference and loss-augmented inference
(Hamming loss, V-JAUNE loss under review)
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference
26. 26/26
Any questions?
• Thank you very much for your attention!
• Any questions?
Fairouz Hussein, Massimo Piccardi
University of Technology Sydney, NSW, Australia
Fairouz Hussein, Massimo Piccardi Joint Action and Summary by Submodular Inference