The presentation surveys the methodologies for action and gesture recognition tasks with deep learning models and feature engineering methods.
(6th April 2021)
3. Introduce
> Action and Gesture recognition + Deep learning
> Challenging problem: amounts of data to be processed, model complexity
> Proposed models: RNN and LSTM for action/gesture recognition
+ 3D convolutional networks
+ pre-computed motion-based features
+ combination of multiple visual
> Our goal: how they treat the temporal dimension of the data?
#Kookmin_University #Natural_Language_Processing_lab. 2
Computer vision and pattern recognition
Temporal dimension in sequences
6. Architectures
> How the deal with the temporal dimension
in deep-based human action and gesture recognition?
1) Using 3D filters in the convolutional layer
> It captures discriminative features along both spatial and temporal dimensions
while maintaining a certain temporal structure
2) Motion features
> We extract motion features
> The features input to the network as additional channels
3) Combining a 2D(or 3D) CNN applied at individual frames with a temporal sequence modeling
> with RNN or LSTM
#Kookmin_University #Natural_Language_Processing_lab. 5
8. Fusion Strategies
> Main variants for information fusion in deep learning models
1) Early
> Before the data is feed into the model,
> The model fuses information directly from multiple sources
2) Late
> Output of deep learning models are combined
3) Middle
> Intermediate layers fuse information
Additional fusion strategies: ensembles or stacked networks
#Kookmin_University #Natural_Language_Processing_lab. 7
to combine the information from parts of a segmented video sequence
12. Reviews: Action/Activity & Gesture Recognition
1. 3D Convolutional Neural Networks
2. Motion-based Features
3. Temporal Deep Learning Models: RNN and LSTM
4. Deep Learning with Fusion Strategies
#Kookmin_University #Natural_Language_Processing_lab. 11
13. 3D Convolutional Neural Networks
> Extending the convolution along the temporal axis (in 3D CNN)
- Initializing the weights of a 3D CNN by using 2D weights learned from ImageNET
- Factorizing the 3D convolutional kernel learning
as a sequential process of learning 2D spatial and 1D temporal kernels in different layers
- Performing 3D convolutions over stacks of optical flow maps
- Using multiple 3D CNNs in a multi-stage
- Combining 3D CNN models with sequence modeling methods
or hand-crafted feature desciptors
#Kookmin_University #Natural_Language_Processing_lab. 12
14. Motion-based Features
> Incorporating pre-computed temporal features within the deep model
- Presenting two-stream CNN (spatial and temporal networks)
- Exploiting a motion vector from video compression
- Extending the convolutions in time with long-term temporal convolutions
> Extending the CNN capabilities using trajectory features
- Pooling and normalization
- Learning bag-of-features from dense trajectories of synthetic 3D human models
#Kookmin_University #Natural_Language_Processing_lab. 13
15. Temporal Deep Learning Models: RNN and LSTM
> Combining CNN with temporal sequence models (RNN or LSTM)
- Changing information of motions between successive frames
- Presenting a multi-stream (motion and appearance) using bi-directional RNN
- Observing video frames and deciding both where to look next and when to emit a
prediction
- using 3D skeleton sequences to regularize LSTM network (LSTM+CNN) on video frames
- RNN with Multimodal(depth video, skeleton, and speech) system
- Multi-RNN to facilitate the handling of variable-length gestures
#Kookmin_University #Natural_Language_Processing_lab. 14
16. Deep Learning with Fusion Strategies
> Using diverse fusion schemes to improve recognition performance of
action recognition
- Learning an end-to-end hierarchical RNN with skeleton data
- DeepConvLSTM based on convolutional and LSTM recurrent units
- HMM(Hidden Markov Model), GMM(Gaussian Mixture Model)
#Kookmin_University #Natural_Language_Processing_lab. 15
17. Discussion
> Comprehensive overview of deep-based models for action and gesture recognition
- How does a method deal with temporal information?
- How can such a large net work be trained with small datasets?
> 3D networks over a long sequence can learn complex temporal patterns
> Temporal models (RNN and LSTM) has the crucial advantage to cope with longer-range
temporal relations
> Ensemble learning reduces the bias and variance errors of the learning algorithm
(fusion strategies)
#Kookmin_University #Natural_Language_Processing_lab. 16