The presentation explains the integrating attention with CNN and LSTM.
This paper carried out the video classification task using the attention with CNNLSTM models.
(9th April 2021)
3. Introduction
#Kookmin_University #Natural_Language_Processing_lab. 2
> Traditional visual features
: color-based, short-based, motion-based
> Hand-crafted features on machine learning
: support vector machine (SVM) and hidden markov model (HMM)
> For image/video classification: Convolutional neural network (CNN)
> For temporal information: Long short-term memory (LSTM)
> For process the signal by certain information: Attention mechanism
>> CNN + LSTM including Attention
4. Attention Integrated Deep Networks
#Kookmin_University #Natural_Language_Processing_lab. 3
> 2D CNN: VGG16, VGG19, Inception V3, ResNet50, Xception
> LSTM: Bi-directional LSTM
> Attention: before LSTM, after LSTM
To extract relevant features that can represent individual video frames
To preserve information from both past and future
5. Experiments
#Kookmin_University #Natural_Language_Processing_lab. 4
Network hyper-parameters
> Hidden units of LSTM: 64, 128, 256, 512
> The size of dense layer for attention: average number of utilized video frames
- long video sequences with frames: discard
- short video sequences with frames: zero padding
Evaluation results
> Dataset
(1) UCF101: 13,320 videos (101 action categories)
(2) Sports-1M: 1 million YouTube videos (487 classes)
- select video files shorter than 20 seconds in 202 classes among 487 classes
- select classes with more than 100 video files
- total: 18,319 video sequences (99 classes) >> Sports-1M-99
7. Summary
#Kookmin_University #Natural_Language_Processing_lab. 6
1. Applying attention on LSTM outputs achieves better accuracy
2. VGG19 is more suitable for integrating the attention block because of low dimension
3. 2D CNN outperforms 3D CNN
> Integrating the attention mechanism into 2D CNNs and LSTM
for video classification