Understanding
of deep-learning
- CNN for video data
17.05.26 You Sung Min
Tran, Du, et al. "Learning spatiotemporal features with 3d
convolutional networks." Proceedings of the IEEE International
Conference on Computer Vision.(ICCV) 2015.
Paper review
1. Review of Convolutional Neural Networks (2D)
2. 3-D CNN for temporal features (C3D model)
3. C3D evaluation on video tasks
Contents
Convolutional Neural Network (2D)
 Convolution layer
 Subsampling (Pooling) layer
Review of Convolutional Neural Networks
Feature Extractor Classifier
Convolutional Neural Network
Review of Convolutional Neural Networks
Convolutional Neural Network
Review of Convolutional Neural Networks
Feature map
Review of Convolutional Neural Networks
Visualization of feature map (Deconvnet)
Yosinski, Jason, et al.
"Understanding neural networks through deep visualization."
Deconvnet
Feature maps
Unpooling
Rectify
Deconvolution
Input Image
Activation value
(machine domain)
Pixel value
(human visual
domain)
CNN for multi-dimensional data
 How to apply CNN on multi-dimensional input (video)?
3-D CNN for temporal features
Video
Image
Convolution
?
Pooling
?
CNN for RGB images
3-D CNN for temporal features
R channel
G channel
B channel
m by n
Color image
m * n * 3
Multi-frame
?
2D Conv
kernel
3-D CNN for temporal features
2D Convolution
2D Convolution on
RGB image
width
height
Channel
(depth)
Input & Kernel
2-D feature map
CNN for RGB images
R G B
CNN for multi-dimensional data
 RGB image : height * width * channel (color)
 RGB video : height * width * channel (color) * time
 Convolution for temporal axis
3-D CNN for temporal features
Convolution
?
Pooling
?
Temporal info.
Video
3D convolutional Networks (C3D model)
3-D CNN for temporal features
L: channel
L: time (frame)
3D convolution kernel – depth select
 In general, height & width of kernel are 3
 Temporal depth experiment
- Fixed networks : 1, 3, 5, 7
- Increasing network : 3-3-5-5-7
- Decreasing network : 7-5-5-3-3
 Trained and tested on UCF101 dataset
- 1.3k Videos about 101 classes of human action
3-D CNN for temporal features
d : Temporal depth
<UCF 101 – Human Action Recognition Dataset>
3D convolution kernel – depth select
 Fixed network with depth of 3 showed best performance
3-D CNN for temporal features
2D conv
3D conv
C3D network
 8 Convolution layers (3 * 3 * 3)
 5 max-pooling layer (2 * 2 * 2), (1*2*2 for 1st conv layer)
 Video input shape : 16 * 112 * 112 (frame, height, width)
3-D CNN for temporal features
Video
Input
Feature Extractor Classifier
C3D network training and test
 Sports-1M dataset
- 1 million (1,133,158) videos of sports
- Annotated with 487 sports label
C3D evaluation on video tasks
C3D network training and test
C3D evaluation on video tasks
C3D network feature visualization
C3D evaluation on video tasks
Video
Input
Feature Extractor Classifier
Deconvolution
C3D network feature visualization
C3D evaluation on video tasks
C3D network feature evaluation
 Tested on UCF101 dataset
 Action recognition
C3D evaluation on video tasks
Video
Input
Feature Extractor Classifier
Encoded features
(4096)
Classifiers
C3D network feature evaluation
C3D evaluation on video tasks
Handcrafted feature
RGB framewise input
Multi-feature
combination input
C3D network feature evaluation
 t-Distributed Stochastic Neighbor Embedding (t-SNE)
: dimension reduction for visualization
C3D evaluation on video tasks
(2D conv) (3D conv)
Conclusion
 C3D network showed outstanding performance on several
video task
C3D evaluation on video tasks
42 types of daily object
in first person view
130 videos of
13 scene categories
420 videos of
14 scene categories
3,631 videos
of 432 action
References
 Image Source from https://deeplearning4j.org/convolutionalnets
 Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional
networks.“ European Conference on Computer Vision, Springer International
Publishing, 2014.
 Jia-Bin Huang, “Lecture 29 Convolutional Neural Networks”, Computer Vision Spring
2015
 Yosinski, Jason, et al. "Understanding neural networks through deep visualization."
 Soomro et al. "UCF101: A dataset of 101 human actions classes from videos in the wild.“
 Peng, Xiaojiang, et al. "Large margin dimensionality reduction for action similarity labeling." IEEE
Signal Processing Letters 21.8 (2014): 1022-1025.
 Tran, Du, et al. "Learning spatiotemporal features with 3d convolutional networks." Proceedings of
the IEEE International Conference on Computer Vision. 2015.

Learning spatiotemporal features with 3 d convolutional networks

  • 1.
    Understanding of deep-learning - CNNfor video data 17.05.26 You Sung Min Tran, Du, et al. "Learning spatiotemporal features with 3d convolutional networks." Proceedings of the IEEE International Conference on Computer Vision.(ICCV) 2015. Paper review
  • 2.
    1. Review ofConvolutional Neural Networks (2D) 2. 3-D CNN for temporal features (C3D model) 3. C3D evaluation on video tasks Contents
  • 3.
    Convolutional Neural Network(2D)  Convolution layer  Subsampling (Pooling) layer Review of Convolutional Neural Networks Feature Extractor Classifier
  • 4.
    Convolutional Neural Network Reviewof Convolutional Neural Networks
  • 5.
    Convolutional Neural Network Reviewof Convolutional Neural Networks Feature map
  • 6.
    Review of ConvolutionalNeural Networks Visualization of feature map (Deconvnet) Yosinski, Jason, et al. "Understanding neural networks through deep visualization." Deconvnet Feature maps Unpooling Rectify Deconvolution Input Image Activation value (machine domain) Pixel value (human visual domain)
  • 7.
    CNN for multi-dimensionaldata  How to apply CNN on multi-dimensional input (video)? 3-D CNN for temporal features Video Image Convolution ? Pooling ?
  • 8.
    CNN for RGBimages 3-D CNN for temporal features R channel G channel B channel m by n Color image m * n * 3 Multi-frame ? 2D Conv kernel
  • 9.
    3-D CNN fortemporal features 2D Convolution 2D Convolution on RGB image width height Channel (depth) Input & Kernel 2-D feature map CNN for RGB images R G B
  • 10.
    CNN for multi-dimensionaldata  RGB image : height * width * channel (color)  RGB video : height * width * channel (color) * time  Convolution for temporal axis 3-D CNN for temporal features Convolution ? Pooling ? Temporal info. Video
  • 11.
    3D convolutional Networks(C3D model) 3-D CNN for temporal features L: channel L: time (frame)
  • 12.
    3D convolution kernel– depth select  In general, height & width of kernel are 3  Temporal depth experiment - Fixed networks : 1, 3, 5, 7 - Increasing network : 3-3-5-5-7 - Decreasing network : 7-5-5-3-3  Trained and tested on UCF101 dataset - 1.3k Videos about 101 classes of human action 3-D CNN for temporal features d : Temporal depth <UCF 101 – Human Action Recognition Dataset>
  • 13.
    3D convolution kernel– depth select  Fixed network with depth of 3 showed best performance 3-D CNN for temporal features 2D conv 3D conv
  • 14.
    C3D network  8Convolution layers (3 * 3 * 3)  5 max-pooling layer (2 * 2 * 2), (1*2*2 for 1st conv layer)  Video input shape : 16 * 112 * 112 (frame, height, width) 3-D CNN for temporal features Video Input Feature Extractor Classifier
  • 15.
    C3D network trainingand test  Sports-1M dataset - 1 million (1,133,158) videos of sports - Annotated with 487 sports label C3D evaluation on video tasks
  • 16.
    C3D network trainingand test C3D evaluation on video tasks
  • 17.
    C3D network featurevisualization C3D evaluation on video tasks Video Input Feature Extractor Classifier Deconvolution
  • 18.
    C3D network featurevisualization C3D evaluation on video tasks
  • 19.
    C3D network featureevaluation  Tested on UCF101 dataset  Action recognition C3D evaluation on video tasks Video Input Feature Extractor Classifier Encoded features (4096) Classifiers
  • 20.
    C3D network featureevaluation C3D evaluation on video tasks Handcrafted feature RGB framewise input Multi-feature combination input
  • 21.
    C3D network featureevaluation  t-Distributed Stochastic Neighbor Embedding (t-SNE) : dimension reduction for visualization C3D evaluation on video tasks (2D conv) (3D conv)
  • 22.
    Conclusion  C3D networkshowed outstanding performance on several video task C3D evaluation on video tasks 42 types of daily object in first person view 130 videos of 13 scene categories 420 videos of 14 scene categories 3,631 videos of 432 action
  • 23.
    References  Image Sourcefrom https://deeplearning4j.org/convolutionalnets  Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks.“ European Conference on Computer Vision, Springer International Publishing, 2014.  Jia-Bin Huang, “Lecture 29 Convolutional Neural Networks”, Computer Vision Spring 2015  Yosinski, Jason, et al. "Understanding neural networks through deep visualization."  Soomro et al. "UCF101: A dataset of 101 human actions classes from videos in the wild.“  Peng, Xiaojiang, et al. "Large margin dimensionality reduction for action similarity labeling." IEEE Signal Processing Letters 21.8 (2014): 1022-1025.  Tran, Du, et al. "Learning spatiotemporal features with 3d convolutional networks." Proceedings of the IEEE International Conference on Computer Vision. 2015.

Editor's Notes

  • #24 13층의 컨볼루션 신경망의 값을 산출하기 위해선 약 300억 번의 연산수 필요