SlideShare a Scribd company logo
Temporal Segment Network:
Two-stream CNN and its application in action
recognition
Dongang Wang
15 Sep 2017
Contents
• Temporal Segment Network (TSN) :
  basic ideas, method and tricks in training and test phases.
• Two-Stream CNN:
  combination of spatial and temporal features, late fusion comparison.
• BN-Inception:
  review the structure in details, derived from GoogLeNet, usage in TSN
• Optical Flow and Warped Optical Flow:
  basic idea and different methods, dense flow, warped flow.
Authors
• Limin Wang (王利民): BS in NJU, PhD in CUHK with Xiaoou Tang, now
postdoc in ETHZ.
• Yuanjun Xiong (熊元军): BE in Tsinghua, PhD in CUHK with Xiaoou Tang,
now postdoc in CUHK.
• Zhe Wang (王哲): BE in ZJU, PhD in CUHK with Xiaogang Wang.
• Yu Qiao (乔宇): Professor in SIAT.
• Dahua Lin (林达华): Professor in CUHK. BS in USTC, PhD in MIT.
• Xiaoou Tang (汤晓鸥): Professor in CUHK. BE in USTC, PhD in MIT.
• Luc Van Gool: Professor in ETHZ.
General Structure of TSN
[Wang, ECCV2016]
Issues
 1. Segments: How to select key frames/segments?
 2. Modality: How to compute Optical Flow features? And how to utilize the
flow features in CNN?
 3. Training and test: How to train and how to test?
 4. Fusion of two CNNs: Is there any other ways beside late fusion?
Temporal Segment Network
 Structure:
• Two-Stream CNN
• Batch Normalization -> Partial Batch Normalization
 Modality:
• Optical Flow
• Warped Flow
 Tricks:
• Initialization
• Data augmentation
• Segments
• Test
Two-Stream CNN
The idea comes from human visual cortex, which contains two ways: ventral stream
(object recognition), dorsal stream (motion detection)
[Simonyan, NIPS2014]
Two-Stream CNN
 This method is proved to be useful. The following picture is the 96 learnt 7x7
filter for flow stack (10 for x and 10 for y).
 This image can also show the way to use optical flow features: stack flow
images as channels. TSN also derives from here.
BN-Inception
Partial BN: freeze the mean and variance parameters of all BN layers except the first
layer.
[Ioffe, ICML2015]
Recall: GoogLeNet
Differences from BN-
Inception:
 layers
 filter numbers
 avg poolings
 add bn layers before
each ReLU
[Szegedy, CVPR2015]
Recall: GoogLeNet
Differences from BN-
Inception:
 layers
 filter numbers
 avg poolings
 add bn layers before
each ReLU
[Szegedy, CVPR2015]
Recall: GoogLeNet
Differences from BN-
Inception:
 layers
 filter numbers
 avg poolings
 add bn layers before
each ReLU
[Szegedy, CVPR2015]
Recall: GoogLeNet
Differences from BN-
Inception:
 layers
 filter numbers
 avg poolings
 add bn layers before
each ReLU
[Szegedy, CVPR2015]
BN and Partial BN
 Batch Normalization in Caffe: Two layers
– BatchNorm Layer: normalize each scalar feature independently
– Scale Layer: enable the net to recover the original activations
 While in TSN, things has changed:
– Flow images are quite different from that of RGB images, so it does not make
sense when transfer the features or layer parameters directly from ImageNet.
– Even RGB images are in different domain from ImageNet for we are dealing
with action recognition instead of object recognition.
 In that case: Partial Batch Normalization
– The mean and variance parameters are frozen as the initialized parameters from
ImageNet except for the first conv layer.
– The scale parameters (slope and bias) are treated as usual.
[Ioffe, ICML2015]
Temporal Segment Network
 Structure:
• Two-Stream CNN
• Batch Normalization -> Partial Batch Normalization
 Modality:
• Optical Flow
• Warped Flow
 Tricks:
• Initialization
• Data augmentation
• Segments
• Test
Optical Flow
 Core problem:
– How to locate the corresponding point in the latter frame?
 Basic assumption:
– Brightness of an image point remains constant over time.
– Displacement and time steps are small.
 Methods (built in OpenCV):
– Lucas-Kanade Method and its pyramidal implementation: the first method,
sparse optical flow (calcOpticalFlowPyrLK)
– Farneback Method: used in TSN, dense optical flow (calcOpticalFlowFarneback)
– Brox Method: used in Two-stream CNN (BroxOpticalFlow)
Optical Flow: Lucas-Kanade Method
 Suppose the point in image has brightness .
 Optical flow is defined as , where:
 With the two assumptions and Taylor’s Theory:
 we have
 Assume that within a small patch, remains the same. We could solve the
above equation using Least Square method.
( , , ) ( , , )I x x y y t t I x y tδ δ δ+ + + =
,
x x
u v
t t
∂ ∂
= =
∂ ∂
( , , )I x y t( , )x y
( , )u v
( , , ) ( , , )
I I I
I x x y y t t I x y t x y t
x y t
δ δ δ δ δ δ
∂ ∂ ∂
+ + + = + + +
∂ ∂ ∂
0
I I I
u v
x y t
∂ ∂ ∂
+ + =
∂ ∂ ∂
( , )u v
[Lucas, 1981]
Warped Flow
Intuition:
– The movement of camera is encoded in the frames.
Method:
– Find the correspondences between two frames
• Compute SURF descriptors of consecutive frames.
• Compute OF using Farneback Method and select the
motion vectors for salient feature points
• Estimate the homography using RANSAC
– Remove inconsistent matches due to humans
(Human actions are outliers corresponding to
camera movement)
• Use human detector for each frame
• Remove feature matches inside the human bounding
box during homography estimation
– Remove camera movement from optical flow
[Wang, ICCV2013]
Temporal Segment Network
 Structure:
• Two-Stream CNN
• Batch Normalization -> Partial Batch Normalization
 Modality:
• Optical Flow
• Warped Flow
 Tricks:
• Initialization
• Data augmentation
• Segments
• Test
Training: Initialization
 For the RGB ConvNet, they use pre-trained model from BN-Inception which
is trained in ImageNet.
 For the Flow ConvNet, they use modified RGB pre-trained model.
– Rescale the flow images to a [0, 255] range, which makes the weights of optical
flow fields to be the same with RGB images.
– Modify the weights of first convolution layer of RGB models by averaging the
weights across the RGB channels and replicating the average by the channel
number of the temporal network input.
 Original channel numbers of each ConvNet:
– Spatial (RGB) net: 3, stands for RGB
– Temporal (Flow) net: 10, stands for 5 x-flow and 5 y-flow
[Wang, ECCV2016]
Training: Segment selection and processing
 Why use segments:
– ConvNets are unable to model long-range temporal structure.
– A sparsely sampled sequence could represent the action.
 Steps:
– Divide the original video into K segments of equal durations.
– Randomly sample one frame during each segment.
– In the classifier layers, each frame will have a score matrix for all classes. Evenly
average will generate better results than maximum and weighted average.
 Specially, when K=3, the input dims of two nets (train_val):
– Spatial (RGB) net: N x 9 x 224 x 224
– Temporal (Flow) net: N x 30 x 224 x 224
[Wang, ECCV2016]
Training: Data Augmentation
 The original size of input images are 256 x 340. When feeding into the net,
the images are cropped to become 224 x 224.
 Corner Cropping:
– Previous method is random cropping, which means any part of the large image
could be selected.
– For this method, only four corners and the center are taken into consideration.
 Scale Jittering:
– Randomly select sizes from [256, 224, 192, 168], width and height are the same.
– Rescale the cropped image into 224 x 224.
 Although two methods are exploited, the number of frames each batch is
not increased. However, the variants for each frame could be 40.
[Wang, ECCV2016]
Test: Get video level scores and accuracy
 There are no segment operation in test phase. From the paper, the batch
size is set to be 25. So the input size of the two nets becomes:
– Spatial (RGB) net: 25 x 3 x 224 x 224
– Temporal (Flow) net: 25 x 10 x 224 x 224
 However, there are still tricks in the process:
– For short videos with less than 25 frames: repeat the first frame for 25 times.
– For each input frame, the original size is still 256 x 340, so the crop operation in
four corners and the center and the horizontal flipping still occurs. In that case,
the output blobs for each video is 25 x 10 x class_num
– We would want video level accuracy instead of frame level accuracy. The above
blobs are averaged first in 10 variants and then in 25 frames to get scores.
– Combination of two modalities: with weights 1 for RGB, 1.5 for Flow.
[Wang, ECCV2016]
Evaluation
For example, for UCF101 split 1, my test result is 86.02% for RGB, and 87.63% for Flow.
The combined result (1:1.5) is 93.5%.
Contributions of TSN
 Features:
• Use warped flow for ConvNets
• Tried RGB difference features, but this modality is proved to be not useful
 Structures:
• Two-stream based on batch normalization
• Segment ConvNets
 Methods:
• Partial Batch Normalization
• Cross-Modality Initialization
Reference
[Wang, ECCV2016] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016, Octob
er). Temporal segment networks: Towards good practices for deep action recognition. In European C
onference on Computer Vision (pp. 20-36). Springer International Publishing.
[Simonyan, NIPS2014] Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for
action recognition in videos. In Advances in neural information processing systems (pp. 568-576).
[Ioffe, ICML2015] Ioffe, S., & Szegedy, C. (2015, June). Batch Normalization: Accelerating Deep Netwo
rk Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (pp.
448-456).
[Szegedy, CVPR2015] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich,
A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision a
nd pattern recognition (pp. 1-9).
[Lucas, 1981] Lucas, B. D., & Kanade, T. (1981). An iterative image registration technique with an appli
cation to stereo vision. Proceeding of Imaging Understanding Workshop, 1981: 120-131.
[Wang, ICCV2013] Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In Pr
oceedings of the IEEE international conference on computer vision (pp. 3551-3558).

More Related Content

What's hot

Monitoring Java Applications with Prometheus and Grafana
Monitoring Java Applications with Prometheus and GrafanaMonitoring Java Applications with Prometheus and Grafana
Monitoring Java Applications with Prometheus and Grafana
Justin Reock
 
Deploy Prometheus - Grafana and EFK stack on Kubic k8s Clusters
Deploy Prometheus - Grafana and EFK stack on Kubic k8s ClustersDeploy Prometheus - Grafana and EFK stack on Kubic k8s Clusters
Deploy Prometheus - Grafana and EFK stack on Kubic k8s Clusters
Syah Dwi Prihatmoko
 
Embedded Android : System Development - Part II (HAL)
Embedded Android : System Development - Part II (HAL)Embedded Android : System Development - Part II (HAL)
Embedded Android : System Development - Part II (HAL)
Emertxe Information Technologies Pvt Ltd
 
Introduction to Ansible
Introduction to AnsibleIntroduction to Ansible
Introduction to Ansible
Knoldus Inc.
 
Contributing to Automotive Grade Linux (AGL) and GENIVI Development Platform ...
Contributing to Automotive Grade Linux (AGL) and GENIVI Development Platform ...Contributing to Automotive Grade Linux (AGL) and GENIVI Development Platform ...
Contributing to Automotive Grade Linux (AGL) and GENIVI Development Platform ...
Leon Anavi
 
Android audio system(audiopolicy_manager)
Android audio system(audiopolicy_manager)Android audio system(audiopolicy_manager)
Android audio system(audiopolicy_manager)fefe7270
 
RunX: deploy real-time OSes as containers at the edge
RunX: deploy real-time OSes as containers at the edgeRunX: deploy real-time OSes as containers at the edge
RunX: deploy real-time OSes as containers at the edge
Stefano Stabellini
 
Course 102: Lecture 9: Input Output Internals
Course 102: Lecture 9: Input Output Internals Course 102: Lecture 9: Input Output Internals
Course 102: Lecture 9: Input Output Internals
Ahmed El-Arabawy
 
Splunk
SplunkSplunk
Splunk
Megha Sahu
 
PR-278: RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
PR-278: RAFT: Recurrent All-Pairs Field Transforms for Optical FlowPR-278: RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
PR-278: RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
Hyeongmin Lee
 
HelloCloud.io - Introduction to IaC & Terraform
HelloCloud.io - Introduction to IaC & TerraformHelloCloud.io - Introduction to IaC & Terraform
HelloCloud.io - Introduction to IaC & Terraform
Hello Cloud
 
Designing of Video Player with Python
Designing of Video Player with PythonDesigning of Video Player with Python
Designing of Video Player with Python
HASIM ALI
 
New Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingNew Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using Tracing
ScyllaDB
 
[오픈소스컨설팅]Ansible overview
[오픈소스컨설팅]Ansible overview[오픈소스컨설팅]Ansible overview
[오픈소스컨설팅]Ansible overview
Open Source Consulting
 
Performance Testing with Tsung
Performance Testing with TsungPerformance Testing with Tsung
Performance Testing with Tsung
Opsta
 
Design and Concepts of Android Graphics
Design and Concepts of Android GraphicsDesign and Concepts of Android Graphics
Design and Concepts of Android Graphics
National Cheng Kung University
 
Biznet GIO National Seminar on Digital Forensics
Biznet GIO National Seminar on Digital ForensicsBiznet GIO National Seminar on Digital Forensics
Biznet GIO National Seminar on Digital Forensics
Yusuf Hadiwinata Sutandar
 
Apply MLOps at Scale by H&M
Apply MLOps at Scale by H&MApply MLOps at Scale by H&M
Apply MLOps at Scale by H&M
Databricks
 
How Helm, The Package Manager For Kubernetes, Works
How Helm, The Package Manager For Kubernetes, WorksHow Helm, The Package Manager For Kubernetes, Works
How Helm, The Package Manager For Kubernetes, Works
Matthew Farina
 

What's hot (20)

Monitoring Java Applications with Prometheus and Grafana
Monitoring Java Applications with Prometheus and GrafanaMonitoring Java Applications with Prometheus and Grafana
Monitoring Java Applications with Prometheus and Grafana
 
Deploy Prometheus - Grafana and EFK stack on Kubic k8s Clusters
Deploy Prometheus - Grafana and EFK stack on Kubic k8s ClustersDeploy Prometheus - Grafana and EFK stack on Kubic k8s Clusters
Deploy Prometheus - Grafana and EFK stack on Kubic k8s Clusters
 
Embedded Android : System Development - Part II (HAL)
Embedded Android : System Development - Part II (HAL)Embedded Android : System Development - Part II (HAL)
Embedded Android : System Development - Part II (HAL)
 
Introduction to Ansible
Introduction to AnsibleIntroduction to Ansible
Introduction to Ansible
 
Contributing to Automotive Grade Linux (AGL) and GENIVI Development Platform ...
Contributing to Automotive Grade Linux (AGL) and GENIVI Development Platform ...Contributing to Automotive Grade Linux (AGL) and GENIVI Development Platform ...
Contributing to Automotive Grade Linux (AGL) and GENIVI Development Platform ...
 
Android audio system(audiopolicy_manager)
Android audio system(audiopolicy_manager)Android audio system(audiopolicy_manager)
Android audio system(audiopolicy_manager)
 
RunX: deploy real-time OSes as containers at the edge
RunX: deploy real-time OSes as containers at the edgeRunX: deploy real-time OSes as containers at the edge
RunX: deploy real-time OSes as containers at the edge
 
Course 102: Lecture 9: Input Output Internals
Course 102: Lecture 9: Input Output Internals Course 102: Lecture 9: Input Output Internals
Course 102: Lecture 9: Input Output Internals
 
Splunk
SplunkSplunk
Splunk
 
PR-278: RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
PR-278: RAFT: Recurrent All-Pairs Field Transforms for Optical FlowPR-278: RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
PR-278: RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
 
HelloCloud.io - Introduction to IaC & Terraform
HelloCloud.io - Introduction to IaC & TerraformHelloCloud.io - Introduction to IaC & Terraform
HelloCloud.io - Introduction to IaC & Terraform
 
Designing of Video Player with Python
Designing of Video Player with PythonDesigning of Video Player with Python
Designing of Video Player with Python
 
New Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingNew Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using Tracing
 
[오픈소스컨설팅]Ansible overview
[오픈소스컨설팅]Ansible overview[오픈소스컨설팅]Ansible overview
[오픈소스컨설팅]Ansible overview
 
Performance Testing with Tsung
Performance Testing with TsungPerformance Testing with Tsung
Performance Testing with Tsung
 
Design and Concepts of Android Graphics
Design and Concepts of Android GraphicsDesign and Concepts of Android Graphics
Design and Concepts of Android Graphics
 
Introduction to Linux Drivers
Introduction to Linux DriversIntroduction to Linux Drivers
Introduction to Linux Drivers
 
Biznet GIO National Seminar on Digital Forensics
Biznet GIO National Seminar on Digital ForensicsBiznet GIO National Seminar on Digital Forensics
Biznet GIO National Seminar on Digital Forensics
 
Apply MLOps at Scale by H&M
Apply MLOps at Scale by H&MApply MLOps at Scale by H&M
Apply MLOps at Scale by H&M
 
How Helm, The Package Manager For Kubernetes, Works
How Helm, The Package Manager For Kubernetes, WorksHow Helm, The Package Manager For Kubernetes, Works
How Helm, The Package Manager For Kubernetes, Works
 

Similar to Temporal Segment Network

150807 Fast R-CNN
150807 Fast R-CNN150807 Fast R-CNN
150807 Fast R-CNN
Junho Cho
 
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Saimunur Rahman
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
DonghyunKang12
 
161209 Unsupervised Learning of Video Representations using LSTMs
161209 Unsupervised Learning of Video Representations using LSTMs161209 Unsupervised Learning of Video Representations using LSTMs
161209 Unsupervised Learning of Video Representations using LSTMs
Junho Cho
 
Temporal Superpixels Based on Proximity-Weighted Patch Matching
Temporal Superpixels Based on Proximity-Weighted Patch MatchingTemporal Superpixels Based on Proximity-Weighted Patch Matching
Temporal Superpixels Based on Proximity-Weighted Patch Matching
NAVER Engineering
 
Machine Learning approaches at video compression
Machine Learning approaches at video compression Machine Learning approaches at video compression
Machine Learning approaches at video compression
Roberto Iacoviello
 
Multimedia basic video compression techniques
Multimedia basic video compression techniquesMultimedia basic video compression techniques
Multimedia basic video compression techniques
Mazin Alwaaly
 
Contour-Constrained Superpixels for Image and Video Processing
Contour-Constrained Superpixels for Image and Video ProcessingContour-Constrained Superpixels for Image and Video Processing
Contour-Constrained Superpixels for Image and Video Processing
NAVER Engineering
 
A Fully Progressive approach to Single image super-resolution
A Fully Progressive approach to Single image super-resolution A Fully Progressive approach to Single image super-resolution
A Fully Progressive approach to Single image super-resolution
Mohammed Ashour
 
Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
Richard Kuo
 
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Yan Xu
 
Optic flow estimation with deep learning
Optic flow estimation with deep learningOptic flow estimation with deep learning
Optic flow estimation with deep learning
Yu Huang
 
Introduction to Wavelet Transform and Two Stage Image DE noising Using Princi...
Introduction to Wavelet Transform and Two Stage Image DE noising Using Princi...Introduction to Wavelet Transform and Two Stage Image DE noising Using Princi...
Introduction to Wavelet Transform and Two Stage Image DE noising Using Princi...
ijsrd.com
 
I3602061067
I3602061067I3602061067
I3602061067
ijceronline
 
Machine learning for Tomographic Imaging.pdf
Machine learning for Tomographic Imaging.pdfMachine learning for Tomographic Imaging.pdf
Machine learning for Tomographic Imaging.pdf
Munir Ahmad
 
Machine learning for Tomographic Imaging.pptx
Machine learning for Tomographic Imaging.pptxMachine learning for Tomographic Imaging.pptx
Machine learning for Tomographic Imaging.pptx
Munir Ahmad
 
Deep Local Parametric Filters for Image Enhancement
Deep Local Parametric Filters for Image EnhancementDeep Local Parametric Filters for Image Enhancement
Deep Local Parametric Filters for Image Enhancement
Sean Moran
 
2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘
2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘
2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘
CHENHuiMei
 
C04841417
C04841417C04841417
C04841417
IOSR-JEN
 

Similar to Temporal Segment Network (20)

150807 Fast R-CNN
150807 Fast R-CNN150807 Fast R-CNN
150807 Fast R-CNN
 
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
161209 Unsupervised Learning of Video Representations using LSTMs
161209 Unsupervised Learning of Video Representations using LSTMs161209 Unsupervised Learning of Video Representations using LSTMs
161209 Unsupervised Learning of Video Representations using LSTMs
 
Temporal Superpixels Based on Proximity-Weighted Patch Matching
Temporal Superpixels Based on Proximity-Weighted Patch MatchingTemporal Superpixels Based on Proximity-Weighted Patch Matching
Temporal Superpixels Based on Proximity-Weighted Patch Matching
 
Machine Learning approaches at video compression
Machine Learning approaches at video compression Machine Learning approaches at video compression
Machine Learning approaches at video compression
 
Multimedia basic video compression techniques
Multimedia basic video compression techniquesMultimedia basic video compression techniques
Multimedia basic video compression techniques
 
ECCV WS 2012 (Frank)
ECCV WS 2012 (Frank)ECCV WS 2012 (Frank)
ECCV WS 2012 (Frank)
 
Contour-Constrained Superpixels for Image and Video Processing
Contour-Constrained Superpixels for Image and Video ProcessingContour-Constrained Superpixels for Image and Video Processing
Contour-Constrained Superpixels for Image and Video Processing
 
A Fully Progressive approach to Single image super-resolution
A Fully Progressive approach to Single image super-resolution A Fully Progressive approach to Single image super-resolution
A Fully Progressive approach to Single image super-resolution
 
Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
 
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
 
Optic flow estimation with deep learning
Optic flow estimation with deep learningOptic flow estimation with deep learning
Optic flow estimation with deep learning
 
Introduction to Wavelet Transform and Two Stage Image DE noising Using Princi...
Introduction to Wavelet Transform and Two Stage Image DE noising Using Princi...Introduction to Wavelet Transform and Two Stage Image DE noising Using Princi...
Introduction to Wavelet Transform and Two Stage Image DE noising Using Princi...
 
I3602061067
I3602061067I3602061067
I3602061067
 
Machine learning for Tomographic Imaging.pdf
Machine learning for Tomographic Imaging.pdfMachine learning for Tomographic Imaging.pdf
Machine learning for Tomographic Imaging.pdf
 
Machine learning for Tomographic Imaging.pptx
Machine learning for Tomographic Imaging.pptxMachine learning for Tomographic Imaging.pptx
Machine learning for Tomographic Imaging.pptx
 
Deep Local Parametric Filters for Image Enhancement
Deep Local Parametric Filters for Image EnhancementDeep Local Parametric Filters for Image Enhancement
Deep Local Parametric Filters for Image Enhancement
 
2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘
2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘
2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘
 
C04841417
C04841417C04841417
C04841417
 

Recently uploaded

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 

Recently uploaded (20)

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 

Temporal Segment Network

  • 1. Temporal Segment Network: Two-stream CNN and its application in action recognition Dongang Wang 15 Sep 2017
  • 2. Contents • Temporal Segment Network (TSN) :   basic ideas, method and tricks in training and test phases. • Two-Stream CNN:   combination of spatial and temporal features, late fusion comparison. • BN-Inception:   review the structure in details, derived from GoogLeNet, usage in TSN • Optical Flow and Warped Optical Flow:   basic idea and different methods, dense flow, warped flow.
  • 3. Authors • Limin Wang (王利民): BS in NJU, PhD in CUHK with Xiaoou Tang, now postdoc in ETHZ. • Yuanjun Xiong (熊元军): BE in Tsinghua, PhD in CUHK with Xiaoou Tang, now postdoc in CUHK. • Zhe Wang (王哲): BE in ZJU, PhD in CUHK with Xiaogang Wang. • Yu Qiao (乔宇): Professor in SIAT. • Dahua Lin (林达华): Professor in CUHK. BS in USTC, PhD in MIT. • Xiaoou Tang (汤晓鸥): Professor in CUHK. BE in USTC, PhD in MIT. • Luc Van Gool: Professor in ETHZ.
  • 4. General Structure of TSN [Wang, ECCV2016]
  • 5. Issues  1. Segments: How to select key frames/segments?  2. Modality: How to compute Optical Flow features? And how to utilize the flow features in CNN?  3. Training and test: How to train and how to test?  4. Fusion of two CNNs: Is there any other ways beside late fusion?
  • 6. Temporal Segment Network  Structure: • Two-Stream CNN • Batch Normalization -> Partial Batch Normalization  Modality: • Optical Flow • Warped Flow  Tricks: • Initialization • Data augmentation • Segments • Test
  • 7. Two-Stream CNN The idea comes from human visual cortex, which contains two ways: ventral stream (object recognition), dorsal stream (motion detection) [Simonyan, NIPS2014]
  • 8. Two-Stream CNN  This method is proved to be useful. The following picture is the 96 learnt 7x7 filter for flow stack (10 for x and 10 for y).  This image can also show the way to use optical flow features: stack flow images as channels. TSN also derives from here.
  • 9. BN-Inception Partial BN: freeze the mean and variance parameters of all BN layers except the first layer. [Ioffe, ICML2015]
  • 10. Recall: GoogLeNet Differences from BN- Inception:  layers  filter numbers  avg poolings  add bn layers before each ReLU [Szegedy, CVPR2015]
  • 11. Recall: GoogLeNet Differences from BN- Inception:  layers  filter numbers  avg poolings  add bn layers before each ReLU [Szegedy, CVPR2015]
  • 12. Recall: GoogLeNet Differences from BN- Inception:  layers  filter numbers  avg poolings  add bn layers before each ReLU [Szegedy, CVPR2015]
  • 13. Recall: GoogLeNet Differences from BN- Inception:  layers  filter numbers  avg poolings  add bn layers before each ReLU [Szegedy, CVPR2015]
  • 14. BN and Partial BN  Batch Normalization in Caffe: Two layers – BatchNorm Layer: normalize each scalar feature independently – Scale Layer: enable the net to recover the original activations  While in TSN, things has changed: – Flow images are quite different from that of RGB images, so it does not make sense when transfer the features or layer parameters directly from ImageNet. – Even RGB images are in different domain from ImageNet for we are dealing with action recognition instead of object recognition.  In that case: Partial Batch Normalization – The mean and variance parameters are frozen as the initialized parameters from ImageNet except for the first conv layer. – The scale parameters (slope and bias) are treated as usual. [Ioffe, ICML2015]
  • 15. Temporal Segment Network  Structure: • Two-Stream CNN • Batch Normalization -> Partial Batch Normalization  Modality: • Optical Flow • Warped Flow  Tricks: • Initialization • Data augmentation • Segments • Test
  • 16. Optical Flow  Core problem: – How to locate the corresponding point in the latter frame?  Basic assumption: – Brightness of an image point remains constant over time. – Displacement and time steps are small.  Methods (built in OpenCV): – Lucas-Kanade Method and its pyramidal implementation: the first method, sparse optical flow (calcOpticalFlowPyrLK) – Farneback Method: used in TSN, dense optical flow (calcOpticalFlowFarneback) – Brox Method: used in Two-stream CNN (BroxOpticalFlow)
  • 17. Optical Flow: Lucas-Kanade Method  Suppose the point in image has brightness .  Optical flow is defined as , where:  With the two assumptions and Taylor’s Theory:  we have  Assume that within a small patch, remains the same. We could solve the above equation using Least Square method. ( , , ) ( , , )I x x y y t t I x y tδ δ δ+ + + = , x x u v t t ∂ ∂ = = ∂ ∂ ( , , )I x y t( , )x y ( , )u v ( , , ) ( , , ) I I I I x x y y t t I x y t x y t x y t δ δ δ δ δ δ ∂ ∂ ∂ + + + = + + + ∂ ∂ ∂ 0 I I I u v x y t ∂ ∂ ∂ + + = ∂ ∂ ∂ ( , )u v [Lucas, 1981]
  • 18. Warped Flow Intuition: – The movement of camera is encoded in the frames. Method: – Find the correspondences between two frames • Compute SURF descriptors of consecutive frames. • Compute OF using Farneback Method and select the motion vectors for salient feature points • Estimate the homography using RANSAC – Remove inconsistent matches due to humans (Human actions are outliers corresponding to camera movement) • Use human detector for each frame • Remove feature matches inside the human bounding box during homography estimation – Remove camera movement from optical flow [Wang, ICCV2013]
  • 19. Temporal Segment Network  Structure: • Two-Stream CNN • Batch Normalization -> Partial Batch Normalization  Modality: • Optical Flow • Warped Flow  Tricks: • Initialization • Data augmentation • Segments • Test
  • 20. Training: Initialization  For the RGB ConvNet, they use pre-trained model from BN-Inception which is trained in ImageNet.  For the Flow ConvNet, they use modified RGB pre-trained model. – Rescale the flow images to a [0, 255] range, which makes the weights of optical flow fields to be the same with RGB images. – Modify the weights of first convolution layer of RGB models by averaging the weights across the RGB channels and replicating the average by the channel number of the temporal network input.  Original channel numbers of each ConvNet: – Spatial (RGB) net: 3, stands for RGB – Temporal (Flow) net: 10, stands for 5 x-flow and 5 y-flow [Wang, ECCV2016]
  • 21. Training: Segment selection and processing  Why use segments: – ConvNets are unable to model long-range temporal structure. – A sparsely sampled sequence could represent the action.  Steps: – Divide the original video into K segments of equal durations. – Randomly sample one frame during each segment. – In the classifier layers, each frame will have a score matrix for all classes. Evenly average will generate better results than maximum and weighted average.  Specially, when K=3, the input dims of two nets (train_val): – Spatial (RGB) net: N x 9 x 224 x 224 – Temporal (Flow) net: N x 30 x 224 x 224 [Wang, ECCV2016]
  • 22. Training: Data Augmentation  The original size of input images are 256 x 340. When feeding into the net, the images are cropped to become 224 x 224.  Corner Cropping: – Previous method is random cropping, which means any part of the large image could be selected. – For this method, only four corners and the center are taken into consideration.  Scale Jittering: – Randomly select sizes from [256, 224, 192, 168], width and height are the same. – Rescale the cropped image into 224 x 224.  Although two methods are exploited, the number of frames each batch is not increased. However, the variants for each frame could be 40. [Wang, ECCV2016]
  • 23. Test: Get video level scores and accuracy  There are no segment operation in test phase. From the paper, the batch size is set to be 25. So the input size of the two nets becomes: – Spatial (RGB) net: 25 x 3 x 224 x 224 – Temporal (Flow) net: 25 x 10 x 224 x 224  However, there are still tricks in the process: – For short videos with less than 25 frames: repeat the first frame for 25 times. – For each input frame, the original size is still 256 x 340, so the crop operation in four corners and the center and the horizontal flipping still occurs. In that case, the output blobs for each video is 25 x 10 x class_num – We would want video level accuracy instead of frame level accuracy. The above blobs are averaged first in 10 variants and then in 25 frames to get scores. – Combination of two modalities: with weights 1 for RGB, 1.5 for Flow. [Wang, ECCV2016]
  • 24. Evaluation For example, for UCF101 split 1, my test result is 86.02% for RGB, and 87.63% for Flow. The combined result (1:1.5) is 93.5%.
  • 25. Contributions of TSN  Features: • Use warped flow for ConvNets • Tried RGB difference features, but this modality is proved to be not useful  Structures: • Two-stream based on batch normalization • Segment ConvNets  Methods: • Partial Batch Normalization • Cross-Modality Initialization
  • 26. Reference [Wang, ECCV2016] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016, Octob er). Temporal segment networks: Towards good practices for deep action recognition. In European C onference on Computer Vision (pp. 20-36). Springer International Publishing. [Simonyan, NIPS2014] Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems (pp. 568-576). [Ioffe, ICML2015] Ioffe, S., & Szegedy, C. (2015, June). Batch Normalization: Accelerating Deep Netwo rk Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (pp. 448-456). [Szegedy, CVPR2015] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision a nd pattern recognition (pp. 1-9). [Lucas, 1981] Lucas, B. D., & Kanade, T. (1981). An iterative image registration technique with an appli cation to stereo vision. Proceeding of Imaging Understanding Workshop, 1981: 120-131. [Wang, ICCV2013] Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In Pr oceedings of the IEEE international conference on computer vision (pp. 3551-3558).