More Related Content

【BMVC2016】Recognition of Transitional Action for Short-Term Action Prediction using Discriminative Temporal CNN Feature

  1. Recognition of Transitional Action for Short-Term Action Prediction using Discriminative Temporal CNN Feature Hirokatsu Kataoka, Ph.D. Computer Vision Research Group (CVRG), AIST http://www.hirokatsukataoka.net/ Yudai Miyashita (TDU), Masaki Hayashi (Liquid Inc., Keio Univ.), Kenji Iwata, Yutaka Satoh (AIST)
  2. Related work: Early Action Recognition •  [Ryoo, ICCV2011] M. S. Ryoo, “Human Activity Prediction: Early Recognition of Ongoing Activities from Streaming Videos”, International Conference on Computer Vision (ICCV), pp.1036-1043, 2011.
  3. Related work: Action Prediction •  [Kataoka+, VISAPP2016] ??? Daytime (Time Zone) Walking (Previous Activity) Sitting (Current Activity) ??? (Next Activity) xtimezone xprevious xcurrent θ = “Using a PC” Given Not given Time series H. Kataoka, Y. Aoki, K. Iwata, Y. Satoh, “Activity Prediction using a Space-Time CNN and Bayesian Framework”, in VISAPP, 2016.
  4. Problem of related works •  Early action recognition –  Action recognition in an early frame of the action –  Enough cue is required, so almost equals to action recognition •  Action prediction –  Complete future prediction in an unstable situation
  5. Proposal •  Transitional Action (TA): Action-class while an action is transitive –  TA contains cue of prediction: Earlier than early action recognition –  Recognition-like future action prediction: More stable prediction [Applications] Autonomous driving, active safety and robotics Δt 【Proposal】 Short-term action prediction recognize “cross” at time t5 【Previous works】 Early action recognition recognize “cross” at time t9 Walk straight (Action) Cross (Action) Walk straight – Cross (Transitional action) t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12
  6. Problem settings Framework Problem Action Recognition Early Action Recognition Action Prediction Transitional Action Recognition f (F1...t A ) → At f (F1...t−L A ) → At f (F1...t A ) → At+L f (F1...t TA ) → At+L
  7. Difference Framework Problem Action Recognition Early Action Recognition Action Prediction Transitional Action Recognition f (F1...t A ) → At f (F1...t−L A ) → At f (F1...t A ) → At+L f (F1...t TA ) → At+L Walk straight (Action) Cross (Action) t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 f (F1...t−L A ) → At A(cross)The objective action is -  Early action recognition is late response
  8. Difference Framework Problem Action Recognition Early Action Recognition Action Prediction Transitional Action Recognition f (F1...t A ) → At f (F1...t−L A ) → At f (F1...t A ) → At+L f (F1...t TA ) → At+L Walk straight (Action) Cross (Action) t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 f (F1...t A ) → At+L A(cross)The objective action is -  Action prediction is unstable
  9. Difference Framework Problem Action Recognition Early Action Recognition Action Prediction Transitional Action Recognition f (F1...t A ) → At f (F1...t−L A ) → At f (F1...t A ) → At+L f (F1...t TA ) → At+L Walk straight (Action) Cross (Action) Walk straight – Cross (Transitional action) t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 A(cross)The objective action is -  Transitional action recognition is reasonable f (F1...t TA ) → At+L
  10. Details of transitional action (TA) •  Annotation for TA –  TA and normal action (NA) classes are partially overlapped each other •  Difficulty of TA –  Temporally mixed between NA and TA
  11. Subtle Motion Descriptor (SMD) •  A discriminative temporal CNN feature –  To divide classes between NA and TA
  12. Subtle Motion Descriptor (SMD) •  Activation feature from VGG-16 –  Fully-connected layer (N = 4,096) –  Based on pooled time series (PoT) [Ryoo+, CVPR2015]
  13. Subtle Motion Descriptor (SMD) •  Temporal difference ΔVt is calculated –  (Frame t) – (Frame t-1)
  14. Subtle Motion Descriptor (SMD) •  Temporal pooling from ΔV t –  Plus and minus –  Zero-around values are pooled (→This is the contribution of SMD) –  TH is experimentally fixed
  15. Datasets •  Temporal action datasets –  NTSEL [Kataoka+, ITSC2015] •  Walk (NA), cross (NA), bicycle (NA), turn (TA) with human bbox –  UTKinect-Action [Xia+, CVPRW2012] •  Ordered 10 NAs (e.g. walk, throw, sit) •  8 TAs (excluding push/pull; next page) •  Without human bbox –  Watch-n-Patch [Wu+, CVPR2015] •  Daily 10 NAs (e.g. read, turn on monitor, leave office) •  Top frequent 10 TAs (next page) •  Without human bbox
  16. Experimental settings (list of TAs) •  @UTKinect-Action @Watch-n-Patch
  17. Implements •  Action recognition appraoches –  Temporal CNN models •  Pooled Time-series (PoT) [Ryoo+, CVPR2015] •  CNN accumulation •  CNN + IDT [Jain+, ECCVW2014] –  Improved dense trajectories (IDT) and with improved features •  IDT [Wang+, ICCV2013] •  IDT + cooccurrence-feature [Kataoka+, ACCV2014] •  All Features in IDT
  18. Exploration experiment •  Parameters –  Frame accumulation –  Thresholding value TH –  Layer fc6 vs fc7
  19. Exploration experiment •  Temporal accumulation [frames] –  Faster prediction: 3 [frames] (0.1s) –  Toward state-of-the-art: 10 [frames] (0.33s) –  Baseline should be 3 and 10 frames accumulation
  20. Exploration experiment •  Thresholding value –  Depending on data
  21. Exploration experiment •  Layer fc6 vs fc7 –  Layer fc6 is better
  22. Results •  SMD (ours) is state-of-the-art in transitional action recognition
  23. Comparison of PoT •  Subtle motion is effective for transitional action recognition –  NTSEL: +2.18%, +8.63% –  UTKinect: +7.19%, +4.31% –  Watch-n-Patch: +4.82%, +5.12%
  24. Conclulsion •  Two contribusions: 1.  Definition of transitional action for short-term action prediction 2.  Subtle Motion Descriptor (SMD) to classify transitional and normal actions