ECCV WS 2012 (Frank)

Recognizing Actions Across Cameras
by Exploring the Correlation Subspace
4th International Workshop on Video Event Categorization,
Tagging and Retrieval (VECTaR), in conjunction with ECCV 2012
Chun-Hao Huang, Yi-Ren Yeh, and Yu-Chiang Frank Wang
Research Center for IT Innovation, Academia Sinica, Taiwan
Oct 12th, 2012

Outline
• Introduction
• Our Proposed Framework
Learning Correlation Subspaces via CCA
Domain Transfer Ability of CCA
SVM with A Novel Correlation Regularizer
• Experiments
• Conclusion
2

Outline
• Introduction
• Experiments
• Conclusion
3

Representing an Action
4
• Actions are represented as high-dim vectors.
• Bag of spatio-temporal visual word model.
• State-of-the-art classifiers (e.g., SVM) are applied to
address the recognition task.
[Laptev, IJCV, 2005]
[Dollár et al., ICCV WS on VS-PETS, 2005]
• Spatio-temporal interest points

Cross-Camera Action Recognition
5
Source view Target view
• Models learned at source views typically do
not generalize well at target views.
check watch
punch
kick
1
s
v
2
s
v
3
s
v𝒳 𝑠
∈ ℝ 𝑑 𝑠
2
t
v
3
t
v
1
t
v
𝒳 𝑡
∈ ℝ 𝑑 𝑡
Colored: labeled data
: test data

: test data
Gray: unlabeled data
• An unsupervised strategy:
 Only unlabeled data available at target views.
 They are exploited to learn the relationship between
data at source and target views.
Cross-Camera Action Recognition (cont’d)
6
One branch of transfer learning

Approaches based on Transfer Learning
• To learn a common feature representation (e.g., a joint subspace)
for both source and target view data.
• Training/testing can be performed in terms of such representations.
• How to exploit unlabeled data from both views for determining this
joint subspace is the key issue.
• Previous approaches:
1. Splits-based feature transfer [Farhadi and Tabrizi, ECCV ‘08 ]
 Requires frame-wise correspondence
2. Bag of bilingual words model (BoBW) [Liu et al., CVPR ‘11 ]
 Considers each dimension of the derived representation to be equally important.
7

Outline
• Introduction
• Experiments
• Conclusion
8

2. Project the source label data onto it
Overview of Our Proposed Method
9
Correlation subspace 𝒳c ∈ ℝd
1
s
v
2
s
v
3
s
v𝒳s ℝ 𝑑 𝑠
2
t
v
3
t
v
1
t
v
𝒳t ℝ 𝑑 𝑡
,
2
s t
v
,
1
s t
v
4. Prediction
3. Learn a new SVM
with constraints on
domain transfer ability
1. Learn a joint subspace via canonical
correlation analysis (CCA)

Requirements of CCA
10
: unlabeled data pairs
(observed at both views)
unlabeled actions observed by both cameras
: test data

Learning the Correlation Subspace via CCA
• CCA aims at maximizing the correlation between two variable sets.
11
• Given two sets of n centered unlabeled observations :
• CCA learns two projection vectors us and ut, maximizing the
correlation coefficient ρ between projected data, i.e.,
where are
covariance matrices.
,
maxs t
s ts s t t
st
s s s s t t t t s s t t
ss tt
  
u u
u Σ uu X X u
u X X u u X X u u Σ u u Σ u
•• •
• • • • • •
, ,t t s t s s
tt st ss  Σ X X Σ X X Σ X X• • •
1 1, ... , and , ... ,s td n d ns s s t t t
n n
 
         X x x X x xR R

CCA Subspace as Common Feature Representation
12
correlation subspace 𝒳c ℝd
1
s
v
2
s
v
3
s
v𝒳s ℝ 𝑑 𝑠
2
t
v
3
t
v
1
t
v
𝒳t ℝ 𝑑 𝑡
s s
P x• t t
P x•
,
1
s t
v(ρ1,u1
𝑠
, u1
𝑡
)
,
2
s t
v (ρ2,u2
𝑠
, u2
𝑡
)
u1
𝑠
u1
𝑡
⋯ u 𝑑
𝑠
⋯ u 𝑑
𝑡
]
]
[
[
P 𝑠 =
P 𝑡 =
∈ ℝ 𝑑 𝑠×𝑑
∈ ℝ 𝑑 𝑡×𝑑

Outline
• Introduction
• The Proposed Framework
• Experiments
• Conclusion
13

Domain Transfer Ability of CCA
• Learn SVMs in the derived CCA subspace…Problem solved?
- Yes and No!
• Domain Transfer Ability:
- In CCA subspace, each dimension Vi
s,t is associated with a different ρi
- How well can the classifiers learned (in this subspace) from the
projected source view data generalize to those from the target view?
• See the example below…
14

Outline
• Introduction
SVM with a Novel Correlation Regularizer
• Experiments
• Conclusion
15

• Proposed SVM formulation:
• The introduced correlation regularizer r⊤
Abs(w) :
and
• Larger/Smaller ρi
→ Stronger/smaller correlation between source & target view data
→ SVM model wi is more/less reliable at that dimension in the CCA space.
• Our regularizer favors SVM solution to be dominant in reliable CCA dimensions
(i.e., larger correlation coefficents ρi imply larger |wi| values).
• Classification of (projected) target view test data:
16
 
   
2
2
1
1 1
min Abs
2 2
s.t. , 1, 0, ,
N
i
i
s s s s
i i i i i i l
C
y b y D

 

 
     
w
w r w
w P x x
•
•
 ( ) sgn , t t
f b x w P x•
  1 2Abs , , ... , dw w w   w  1 2, , ... , d  r
Our Proposed SVM with Domain Transfer Ability

An Approximation for the Proposed SVM
• It is not straightforward to solve the previous formulation with Abs(w).
• An approximated solution can be derived by relaxing Abs(w):
where ⨀ indicates the element-wise multiplication.
• We can further simplify the approximated problem as:
• We apply SSVM* to solve the above optimization problem.
17
 
   
2 2
1 1
1
min 1
2
s.t. , 1, 0, ,
d N
i i i
i i
s s s s
i i i i i i l
w C
y b y D
 
 
 
 
     
 w
w P x x•
   
   
2
2
1
1 1
min
2 2
s.t. , 1, 0, ,
N
i
i
s s s s
i i i i i i l
C
y b y D

 

 
     
w
w r r w w
w P x x
•
•
⨀ ⨀
*: Lee et al., Computational Optimization and Applications, 2001

Outline
• Introduction
SVM with a Novel Correlation Regularizer
• Experiments
• Conclusion
18

Dataset
• IXMAS multi-view action dataset
 Action videos of eleven action classes
 Each action video is performed three times by twelve actors
 The actions are captured simultaneously by five cameras
19

Experiment Setting
2/3 as unlabeled data: Learning correlation subspaces via CCA
20
Check-watch Scratch-head
Sit-down
Kick
Kick
1/3 as labeled data: Training and testing
⋯
Leave-one-class-out protocol (LOCO)
Without Kick action

Experimental Results
• A: BoW from source view directly
• B: BoBW + SVM [Liu et al. CVPR’11]
• C: BoBW + our SVM
21
(%)
camera0 camera1 camera2
A B C D E A B C D E A B C D E
c0 - 9.29 60.96 63.03 63.18 64.90 11.62 41.21 50.76 56.97 60.61
c1 10.71 58.08 59.70 66.72 70.25 - 7.12 33.54 38.03 57.83 59.34
c2 8.79 52.63 49.34 57.37 62.47 6.67 50.86 45.79 59.19 61.87 -
c3 6.31 40.35 44.44 65.30 66.01 9.75 33.59 33.27 46.77 52.68 5.96 41.26 43.99 61.36 61.36
c4 5.35 38.59 40.91 54.39 55.76 9.44 37.53 37.00 53.59 55.00 9.19 34.80 38.28 57.88 60.15
avg. 7.79 47.41 48.60 60.95 63.62 8.79 45.73 44.77 55.68 58.61 8.47 37.70 42.77 58.51 60.37
camera3 camera4
A B C D E A B C D E
c0 7.78 39.65 41.36 63.64 62.17 7.12 24.60 37.02 43.69 48.23
c1 12.02 35.91 39.14 48.59 54.85 8.89 26.87 22.22 44.24 49.29
c2 6.46 41.46 42.78 60.00 61.46 10.35 28.03 33.43 45.05 51.82
c3 - 8.89 27.53 28.28 40.66 41.06
c4 9.60 27.68 34.60 48.03 48.89 -
avg. 8.96 36.17 39.47 55.06 56.84 8.81 26.76 30.24 43.41 47.60
• D: CCA + SVM
• E: our proposed framework (CCA + our SVM).

Effects on The Correlation Coefficient ρ
22
• Recognition rates for the two models were 47.22% and 77.78%, respectively.
(a) Averaged |wi| of standard SVM (b) Averaged |wi| of our SVM
• We successfully suppress the SVM model |wi| when lower ρ is resulted.
• Ex: source: camera 3, target: camera 2, left-out action: get-up
dimension index dimension index
wiwi

Outline
• Introduction
• Experiments
• Conclusion
23

Conclusions
• We presented a transfer-learning based approach to cross-
camera action recognition.
• We considered the domain transfer ability of CCA, and proposed
a novel SVM formulation with a correlation regularizer.
• Experimental results on the IXMAS dataset confirmed
performance improvements using our proposed method.
24

Representing an action
26
human body model
[Mikić et al., IJCV, 2003] [Junejo et al., TPAMI, 2010]

Representing an action
27
[Blank et al., ICCV, 2005]
[Weinland et al., CVIU, 2006]
Motion history volume
Space-time shapes
spatio-temporal volumes

ℝ276
Split-based feature transfer (ECCV ‘08)
28
1
s
v
2
s
v
3
s
v𝒳 𝑠
∈ ℝ40
2
t
v
𝒳 𝑡
∈ ℝ40
ℝ276
K-means K-means
Target instance in the
source representation
frame
action video
Matching according to
split-based feature

Source view
How to construct split-based feature
29
ℝ30
ℝ30
ℝ30
⋮
1000 different
random projections
ℝ276
ℝ30
Max Margin Clustering
25 1
1
1
1 
 
  
 
 
 
Split-based feature ℝ25
Pick the best 25
random projections
+
ℝ30
-
Target view
ℝ276
25 1
1
1
1 
 
  
 
 
 
Split-based feature ℝ25
ℝ30
Train SVM using split-based
feature as labels
+
ℝ30
ℝ30
ℝ30
⋮
Same best 25
random projections
unlabeled frame

4. Train models and
predict with this
representation
3. Construct the codebook of
bilingual words
30
1. Exploit unlabeled data to model the
two codebooks as a bipartite graph
⋯
⋯
1
s
v 2
s
v 2
s
v
s
s
dv 2
s
v 2
s
v
2. Perform spectral clustering
s
s
dv s
s
dv 2
t
v
Bag of Bilingual Words (CVPR ‘11)

Learning correlation subspace via CCA
• The projection vector us can be solved by a generalized
eigenvalue decomposition problem:
31
• Largest η corresponds to largest ρ.
• Once us is obtained, ut can be calculated by
1 s
t tt st



Σ Σ u
u
⋯
⋯
⋯
⋯
]
]
[
[
P 𝑠 =
P 𝑡 =
eigenvalues η1
correlation
coefficient ρ1
u1
𝑠
u1
𝑡
∈ ℝ 𝑑 𝑠×𝑑
∈ ℝ 𝑑 𝑡×𝑑
 
1 s s
st tt st ss

Σ Σ Σ u Σ u•
   
1 s s
st tt t st ss s  

  Σ Σ I Σ u Σ I u•
> ⋯ > ηd
> ⋯ > ρd
⋯ u 𝑑
𝑠
⋯ u 𝑑
𝑡

32
: test data
LOCO protocol in real application: new action
class

ECCV WS 2012 (Frank)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to ECCV WS 2012 (Frank)

Similar to ECCV WS 2012 (Frank) (20)

ECCV WS 2012 (Frank)

Editor's Notes