Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

【VISAPP2016】Activity Prediction Using a Space-Time CNN and Bayesian Framework


Published on

We present a technique to address the new challenge of activity prediction in computer vision field. In activity prediction, we infer the next human activity through "classified activities" and "activity data analysis". Moreover, the prediction should be processed in real-time to avoid dangerous or anomalous activities. The combination of space--time convolutional neural networks (ST-CNN) and improved dense trajectories (IDT) are able to effectively understand human activities in image sequences. After categorizing human activities, we insert activity tags into an activity database in order to sample a distribution of human activity. A naive Bayes classifier allows us to achieve real-time activity prediction because only three elements are needed for parameter estimation. The contributions of this paper are: (i) activity prediction within a Bayesian framework and (ii) ST-CNN and IDT features for activity recognition. Moreover, human activity prediction in real-scenes is achieved with 81.0% accuracy.

Published in: Science
  • Be the first to comment

  • Be the first to like this

【VISAPP2016】Activity Prediction Using a Space-Time CNN and Bayesian Framework

  1. 1. Activity Prediction Using a Space-Time CNN and Bayesian Framework Hirokatsu KATAOKA, Yoshimitsu AOKI†, Kenji IWATA, Yutaka SATOH National Institute of Advanced Industrial Science and Technology (AIST) † Keio University
  2. 2. Background •  Computer vision for human sensing –  Detection, tracking, trajectory analysis –  Posture estimation, action analysis –  Action recognition is able to extend human sensing applications Mental state Body Situation Attention Action Analysis shakinghands Look at people Detection Gaze Estimation Action Recognition Posture Estimation Face Recognition Trajectory extraction Tracking
  3. 3. Related work 1: Action Recognition •  Action is a low-level primitive with semantic meaning –  e.g. walking, running, sitting This image contains a man walking - The classification (location is given) Action recognition Walking
  4. 4. Is action recognition enough? Time-series Post-detection Event detection (Action tag : Ai) Time-series Event prediction (Prediction tag : Aj) Pre-estimation
  5. 5. Related work 2: Early Action Recognition •  Prediction in early part of action –  Integral bag-of-words –  Accumulating likelihood through time-sequence M. S. Ryoo, “Human Activity Prediction: Early Recognition of Ongoing Activities from Streaming Videos”, International Conference on Computer Vision (ICCV), pp.1036-1043, 2011.
  6. 6. Proposal •  Action prediction within a ST-CNN and Bayesian framework –  Action recognition –  Database analysis ??? Daytime (Time Zone) Walking (Previous Action) Sitting (Current Action) ??? (Next Action) xtimezone xprevious xcurrent θ = “Using a PC” Given Not given Time series
  7. 7. Problem settings •  Three different works in action analysis –  Action recognition •  Recognizing At given 1 ~ t frames –  Early action recognition •  Recognizing At given 1 ~ t-L frames –  Action Prediction •  Recognizing At+L given 1 ~ t frames Approach Setting Action Recognition Early Action Recognition Action Prediction f (F1...t A ) → At f (F1...t−L A ) → At f (F1...t A ) → At+L
  8. 8. Process flow •  Consist of (i) action recognition (ii) action prediction 1.  Action recognition 1.1 Improved dense trajectories (IDT) 1.2 Space-time convolutional neural networks (ST-CNN) 2.  Action prediction 2.1 Bayesian framework 2.2 Database x x x x x x x x x x x x x x x x x x Trajectory (in t + L frames) Feature extraction (HOG, HOF, MBH, Traj.) Bag-of-words (BoW) Pedestrian detection IDT Input Conv Conv Pool FC Conv Conv Pool Conv Conv Pool Conv Conv Pool Conv Conv Pool ST-CNN Oxford VGG architecture (VGGNet)
  9. 9. Action Recognition (1/2) •  Improved Dense Trajectories (IDT) [Wang+, ICCV2013] –  Pyramidal image sequences and flow tracking –  Feature descriptors on trajectories –  Feature representation with bag-of-words (BoW) sittingwalking
  10. 10. Action Recognition (1/2) •  IDT + Co-occurrence HOG [Kataoka+, ACCV2014] CoHOG: edge-pair counting to corresponding histogram position Extended CoHOG(ECoHOG): edge-magnitude accumulation –  PCA dim. reduction: 103 - 104 dims into 101-102 ,easy to divide in feature space
  11. 11. Action Recognition (2/2) •  Space-time Convolutional Neural Networks (ST-CNN) –  Based on VGG 16-layer architecture (VGGNet) [Simonyan+, ICLR2015] –  Statio-temporal feature concatenation (around 10 frames) Space-time CNN (ST-CNN) Feature Input Conv Conv Pool FC FC Conv Conv Pool Conv Conv Pool Conv Conv Pool Conv Conv Pool FC So3max ・・・ CNN architecture with VGGNet
  12. 12. Action Prediction (1/2) •  Prediction model - Action sequence Predicting “Using a PC” at “Walk” => “Sit” - Time zone (supplemental info.) Day time ??? Daytime (Time Zone) Walking (Previous Activity) Sitting (Current Activity) ??? (Next Activity) xtimezone xprevious xcurrent θ = “Using a PC” Given Not given Time series
  13. 13. •  Database: ST-action tags + attribute –  Time zone •  “morning”, “day time”, “night” –  Previous & current action •  “walk”, “bend”, “stand”, “sit”… –  Next action (objective) •  “use a PC”, “read”, “meal”… Action Prediction (2/2) Action History DB Walking Sitting Using a PC Daytime
  14. 14. Experiments on the Daily Living Data –  Total 20h of video –  3 different scenes –  640x480, 30fps
  15. 15. Results •  Action recognition –  IDT (HOG, HOF, MBH, CoHOG, ECoHOG, All) –  Per-frame CNN –  ST-CNN –  Combined vector
  16. 16. Results •  Action prediction Time Attributes Estimated Intention Action PC (0.82) Read (0.11) Predicted activity Read (1.00) PC (0.00)
  17. 17. Coluclusion •  Action prediction approach within recognition and database analysis –  Concatenated vector of IDT, ST-CNN –  Bayesian framework –  Database