Learning spatiotemporal features with 3 d convolutional networks

Understanding
of deep-learning
- CNN for video data
17.05.26 You Sung Min
Tran, Du, et al. "Learning spatiotemporal features with 3d
convolutional networks." Proceedings of the IEEE International
Conference on Computer Vision.(ICCV) 2015.
Paper review

1. Review of Convolutional Neural Networks (2D)
2. 3-D CNN for temporal features (C3D model)
3. C3D evaluation on video tasks
Contents

Convolutional Neural Network (2D)
 Convolution layer
 Subsampling (Pooling) layer
Review of Convolutional Neural Networks
Feature Extractor Classifier

Convolutional Neural Network

Convolutional Neural Network
Feature map

Visualization of feature map (Deconvnet)
Yosinski, Jason, et al.
"Understanding neural networks through deep visualization."
Deconvnet
Feature maps
Unpooling
Rectify
Deconvolution
Input Image
Activation value
(machine domain)
Pixel value
(human visual
domain)

CNN for multi-dimensional data
 How to apply CNN on multi-dimensional input (video)?
3-D CNN for temporal features
Video
Image
Convolution
?
Pooling
?

CNN for RGB images
R channel
G channel
B channel
m by n
Color image
m * n * 3
Multi-frame
?
2D Conv
kernel

2D Convolution
2D Convolution on
RGB image
width
height
Channel
(depth)
Input & Kernel
2-D feature map
CNN for RGB images
R G B

CNN for multi-dimensional data
 RGB image : height * width * channel (color)
 RGB video : height * width * channel (color) * time
 Convolution for temporal axis
Convolution
?
Pooling
?
Temporal info.
Video

3D convolutional Networks (C3D model)
L: channel
L: time (frame)

3D convolution kernel – depth select
 In general, height & width of kernel are 3
 Temporal depth experiment
- Fixed networks : 1, 3, 5, 7
- Increasing network : 3-3-5-5-7
- Decreasing network : 7-5-5-3-3
 Trained and tested on UCF101 dataset
- 1.3k Videos about 101 classes of human action
d : Temporal depth
<UCF 101 – Human Action Recognition Dataset>

3D convolution kernel – depth select
 Fixed network with depth of 3 showed best performance
2D conv
3D conv

C3D network
 8 Convolution layers (3 * 3 * 3)
 5 max-pooling layer (2 * 2 * 2), (1*2*2 for 1st conv layer)
 Video input shape : 16 * 112 * 112 (frame, height, width)
Video
Input

C3D network training and test
 Sports-1M dataset
- 1 million (1,133,158) videos of sports
- Annotated with 487 sports label
C3D evaluation on video tasks

C3D network training and test

C3D network feature visualization
Video
Input
Deconvolution

C3D network feature visualization

C3D network feature evaluation
 Tested on UCF101 dataset
 Action recognition
Video
Input
Encoded features
(4096)
Classifiers

Handcrafted feature
RGB framewise input
Multi-feature
combination input

 t-Distributed Stochastic Neighbor Embedding (t-SNE)
: dimension reduction for visualization
(2D conv) (3D conv)

Conclusion
 C3D network showed outstanding performance on several
video task
42 types of daily object
in first person view
130 videos of
13 scene categories
420 videos of
14 scene categories
3,631 videos
of 432 action

References
 Image Source from https://deeplearning4j.org/convolutionalnets
 Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional
networks.“ European Conference on Computer Vision, Springer International
Publishing, 2014.
 Jia-Bin Huang, “Lecture 29 Convolutional Neural Networks”, Computer Vision Spring
2015
 Yosinski, Jason, et al. "Understanding neural networks through deep visualization."
 Soomro et al. "UCF101: A dataset of 101 human actions classes from videos in the wild.“
 Peng, Xiaojiang, et al. "Large margin dimensionality reduction for action similarity labeling." IEEE
Signal Processing Letters 21.8 (2014): 1022-1025.
 Tran, Du, et al. "Learning spatiotemporal features with 3d convolutional networks." Proceedings of
the IEEE International Conference on Computer Vision. 2015.

Learning spatiotemporal features with 3 d convolutional networks

In this document