【BMVC2016】Recognition of Transitional Action for Short-Term Action Prediction using Discriminative Temporal CNN Feature

Recognition of Transitional Action for Short-Term Action
Prediction using Discriminative Temporal CNN Feature
Hirokatsu Kataoka, Ph.D.
Computer Vision Research Group (CVRG), AIST
http://www.hirokatsukataoka.net/
Yudai Miyashita (TDU), Masaki Hayashi (Liquid Inc., Keio Univ.)，
Kenji Iwata, Yutaka Satoh (AIST)

Related work: Early Action Recognition
•  [Ryoo, ICCV2011]
M. S. Ryoo, “Human Activity Prediction: Early Recognition of Ongoing Activities from Streaming Videos”, International Conference on
Computer Vision (ICCV), pp.1036-1043, 2011.

Related work: Action Prediction
•  [Kataoka+, VISAPP2016]
？？？ Daytime
(Time Zone)
Walking
(Previous Activity)
Sitting
(Current Activity)
???
(Next Activity)
xtimezone
xprevious xcurrent
θ = “Using a PC”
Given Not given
Time series
H. Kataoka, Y. Aoki, K. Iwata, Y. Satoh, “Activity Prediction using a Space-Time CNN and Bayesian Framework”, in VISAPP, 2016.

Problem of related works
•  Early action recognition
–  Action recognition in an early frame of the action
–  Enough cue is required, so almost equals to action recognition
•  Action prediction
–  Complete future prediction in an unstable situation

Proposal
•  Transitional Action (TA): Action-class while an action is transitive
–  TA contains cue of prediction: Earlier than early action recognition
–  Recognition-like future action prediction: More stable prediction
[Applications] Autonomous driving, active safety and robotics
Δt
【Proposal】
Short-term action prediction
recognize “cross” at time t5
【Previous works】
Early action recognition recognize
“cross” at time t9
Walk straight
(Action)
Cross
(Action)
Walk straight – Cross
(Transitional action)
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12

Problem settings
Framework Problem
Action Recognition
Early Action Recognition
Action Prediction
Transitional Action Recognition
f (F1...t
A
) → At
f (F1...t−L
A
) → At
f (F1...t
A
) → At+L
f (F1...t
TA
) → At+L

Difference
Framework Problem
Action Recognition
Action Prediction
f (F1...t
A
) → At
f (F1...t−L
A
) → At
f (F1...t
A
) → At+L
f (F1...t
TA
) → At+L
Walk straight
(Action)
Cross
(Action)
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12
f (F1...t−L
A
) → At
A(cross)The objective action is
-  Early action recognition is late response

Difference
Framework Problem
Action Recognition
Action Prediction
f (F1...t
A
) → At
f (F1...t−L
A
) → At
f (F1...t
A
) → At+L
f (F1...t
TA
) → At+L
Walk straight
(Action)
Cross
(Action)
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12
f (F1...t
A
) → At+L
-  Action prediction is unstable

Difference
Framework Problem
Action Recognition
Action Prediction
f (F1...t
A
) → At
f (F1...t−L
A
) → At
f (F1...t
A
) → At+L
f (F1...t
TA
) → At+L
Walk straight
(Action)
Cross
(Action)
Walk straight – Cross
(Transitional action)
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12
-  Transitional action recognition is reasonable
f (F1...t
TA
) → At+L

Details of transitional action (TA)
•  Annotation for TA
–  TA and normal action (NA) classes are partially overlapped each other
•  Difficulty of TA
–  Temporally mixed between NA and TA

Subtle Motion Descriptor (SMD)
•  A discriminative temporal CNN feature
–  To divide classes between NA and TA

•  Activation feature from VGG-16
–  Fully-connected layer (N = 4,096)
–  Based on pooled time series (PoT) [Ryoo+, CVPR2015]

•  Temporal difference ΔVt is calculated
–  (Frame t) – (Frame t-1)

•  Temporal pooling from ΔV t
–  Plus and minus
–  Zero-around values are pooled (→This is the contribution of SMD)
–  TH is experimentally fixed

Datasets
•  Temporal action datasets
–  NTSEL [Kataoka+, ITSC2015]
•  Walk (NA), cross (NA), bicycle (NA), turn (TA) with human bbox
–  UTKinect-Action [Xia+, CVPRW2012]
•  Ordered 10 NAs (e.g. walk, throw, sit)
•  8 TAs (excluding push/pull; next page)
•  Without human bbox
–  Watch-n-Patch [Wu+, CVPR2015]
•  Daily 10 NAs (e.g. read, turn on monitor, leave office)
•  Top frequent 10 TAs (next page)
•  Without human bbox

Experimental settings (list of TAs)
•  @UTKinect-Action @Watch-n-Patch

Implements
•  Action recognition appraoches
–  Temporal CNN models
•  Pooled Time-series (PoT) [Ryoo+, CVPR2015]
•  CNN accumulation
•  CNN + IDT [Jain+, ECCVW2014]
–  Improved dense trajectories (IDT) and with improved features
•  IDT [Wang+, ICCV2013]
•  IDT + cooccurrence-feature [Kataoka+, ACCV2014]
•  All Features in IDT

Exploration experiment
•  Parameters
–  Frame accumulation
–  Thresholding value TH
–  Layer fc6 vs fc7

•  Temporal accumulation [frames]
–  Faster prediction: 3 [frames] (0.1s)
–  Toward state-of-the-art: 10 [frames] (0.33s)
–  Baseline should be 3 and 10 frames accumulation

•  Thresholding value
–  Depending on data

•  Layer fc6 vs fc7
–  Layer fc6 is better

Results
•  SMD (ours) is state-of-the-art in transitional action recognition

Comparison of PoT
•  Subtle motion is effective for transitional action recognition
–  NTSEL: +2.18%, +8.63%
–  UTKinect: +7.19%, +4.31%
–  Watch-n-Patch: +4.82%, +5.12%

Conclulsion
•  Two contribusions:
1.  Definition of transitional action for short-term action prediction
2.  Subtle Motion Descriptor (SMD) to classify transitional and normal actions

【BMVC2016】Recognition of Transitional Action for Short-Term Action Prediction using Discriminative Temporal CNN Feature

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Viewers also liked

Viewers also liked (18)

Similar to 【BMVC2016】Recognition of Transitional Action for Short-Term Action Prediction using Discriminative Temporal CNN Feature

Similar to 【BMVC2016】Recognition of Transitional Action for Short-Term Action Prediction using Discriminative Temporal CNN Feature (13)

Recently uploaded

Recently uploaded (20)

【BMVC2016】Recognition of Transitional Action for Short-Term Action Prediction using Discriminative Temporal CNN Feature