Temporal Segment Network

Temporal Segment Network:
Two-stream CNN and its application in action
recognition
Dongang Wang
15 Sep 2017

Contents
• Temporal Segment Network (TSN) :
  basic ideas, method and tricks in training and test phases.
• Two-Stream CNN:
  combination of spatial and temporal features, late fusion comparison.
• BN-Inception:
  review the structure in details, derived from GoogLeNet, usage in TSN
• Optical Flow and Warped Optical Flow:
  basic idea and different methods, dense flow, warped flow.

Authors
• Limin Wang (王利民): BS in NJU, PhD in CUHK with Xiaoou Tang, now
postdoc in ETHZ.
• Yuanjun Xiong (熊元军): BE in Tsinghua, PhD in CUHK with Xiaoou Tang,
now postdoc in CUHK.
• Zhe Wang (王哲): BE in ZJU, PhD in CUHK with Xiaogang Wang.
• Yu Qiao (乔宇): Professor in SIAT.
• Dahua Lin (林达华): Professor in CUHK. BS in USTC, PhD in MIT.
• Xiaoou Tang (汤晓鸥): Professor in CUHK. BE in USTC, PhD in MIT.
• Luc Van Gool: Professor in ETHZ.

General Structure of TSN
[Wang, ECCV2016]

Issues
 1. Segments: How to select key frames/segments?
 2. Modality: How to compute Optical Flow features? And how to utilize the
flow features in CNN?
 3. Training and test: How to train and how to test?
 4. Fusion of two CNNs: Is there any other ways beside late fusion?

Temporal Segment Network
 Structure:
• Two-Stream CNN
• Batch Normalization -> Partial Batch Normalization
 Modality:
• Optical Flow
• Warped Flow
 Tricks:
• Initialization
• Data augmentation
• Segments
• Test

Two-Stream CNN
The idea comes from human visual cortex, which contains two ways: ventral stream
(object recognition), dorsal stream (motion detection)
[Simonyan, NIPS2014]

Two-Stream CNN
 This method is proved to be useful. The following picture is the 96 learnt 7x7
filter for flow stack (10 for x and 10 for y).
 This image can also show the way to use optical flow features: stack flow
images as channels. TSN also derives from here.

BN-Inception
Partial BN: freeze the mean and variance parameters of all BN layers except the first
layer.
[Ioffe, ICML2015]

Recall: GoogLeNet
Differences from BN-
Inception:
 layers
 filter numbers
 avg poolings
 add bn layers before
each ReLU
[Szegedy, CVPR2015]

BN and Partial BN
 Batch Normalization in Caffe: Two layers
– BatchNorm Layer: normalize each scalar feature independently
– Scale Layer: enable the net to recover the original activations
 While in TSN, things has changed:
– Flow images are quite different from that of RGB images, so it does not make
sense when transfer the features or layer parameters directly from ImageNet.
– Even RGB images are in different domain from ImageNet for we are dealing
with action recognition instead of object recognition.
 In that case: Partial Batch Normalization
– The mean and variance parameters are frozen as the initialized parameters from
ImageNet except for the first conv layer.
– The scale parameters (slope and bias) are treated as usual.
[Ioffe, ICML2015]

Optical Flow
 Core problem:
– How to locate the corresponding point in the latter frame?
 Basic assumption:
– Brightness of an image point remains constant over time.
– Displacement and time steps are small.
 Methods (built in OpenCV):
– Lucas-Kanade Method and its pyramidal implementation: the first method,
sparse optical flow (calcOpticalFlowPyrLK)
– Farneback Method: used in TSN, dense optical flow (calcOpticalFlowFarneback)
– Brox Method: used in Two-stream CNN (BroxOpticalFlow)

Optical Flow: Lucas-Kanade Method
 Suppose the point in image has brightness .
 Optical flow is defined as , where:
 With the two assumptions and Taylor’s Theory:
 we have
 Assume that within a small patch, remains the same. We could solve the
above equation using Least Square method.
( , , ) ( , , )I x x y y t t I x y tδ δ δ+ + + =
,
x x
u v
t t
∂ ∂
= =
∂ ∂
( , , )I x y t( , )x y
( , )u v
( , , ) ( , , )
I I I
I x x y y t t I x y t x y t
x y t
δ δ δ δ δ δ
∂ ∂ ∂
+ + + = + + +
∂ ∂ ∂
0
I I I
u v
x y t
∂ ∂ ∂
+ + =
∂ ∂ ∂
( , )u v
[Lucas, 1981]

Warped Flow
Intuition:
– The movement of camera is encoded in the frames.
Method:
– Find the correspondences between two frames
• Compute SURF descriptors of consecutive frames.
• Compute OF using Farneback Method and select the
motion vectors for salient feature points
• Estimate the homography using RANSAC
– Remove inconsistent matches due to humans
(Human actions are outliers corresponding to
camera movement)
• Use human detector for each frame
• Remove feature matches inside the human bounding
box during homography estimation
– Remove camera movement from optical flow
[Wang, ICCV2013]

Training: Initialization
 For the RGB ConvNet, they use pre-trained model from BN-Inception which
is trained in ImageNet.
 For the Flow ConvNet, they use modified RGB pre-trained model.
– Rescale the flow images to a [0, 255] range, which makes the weights of optical
flow fields to be the same with RGB images.
– Modify the weights of first convolution layer of RGB models by averaging the
weights across the RGB channels and replicating the average by the channel
number of the temporal network input.
 Original channel numbers of each ConvNet:
– Spatial (RGB) net: 3, stands for RGB
– Temporal (Flow) net: 10, stands for 5 x-flow and 5 y-flow
[Wang, ECCV2016]

Training: Segment selection and processing
 Why use segments:
– ConvNets are unable to model long-range temporal structure.
– A sparsely sampled sequence could represent the action.
 Steps:
– Divide the original video into K segments of equal durations.
– Randomly sample one frame during each segment.
– In the classifier layers, each frame will have a score matrix for all classes. Evenly
average will generate better results than maximum and weighted average.
 Specially, when K=3, the input dims of two nets (train_val):
– Spatial (RGB) net: N x 9 x 224 x 224
– Temporal (Flow) net: N x 30 x 224 x 224
[Wang, ECCV2016]

Training: Data Augmentation
 The original size of input images are 256 x 340. When feeding into the net,
the images are cropped to become 224 x 224.
 Corner Cropping:
– Previous method is random cropping, which means any part of the large image
could be selected.
– For this method, only four corners and the center are taken into consideration.
 Scale Jittering:
– Randomly select sizes from [256, 224, 192, 168], width and height are the same.
– Rescale the cropped image into 224 x 224.
 Although two methods are exploited, the number of frames each batch is
not increased. However, the variants for each frame could be 40.
[Wang, ECCV2016]

Test: Get video level scores and accuracy
 There are no segment operation in test phase. From the paper, the batch
size is set to be 25. So the input size of the two nets becomes:
– Spatial (RGB) net: 25 x 3 x 224 x 224
– Temporal (Flow) net: 25 x 10 x 224 x 224
 However, there are still tricks in the process:
– For short videos with less than 25 frames: repeat the first frame for 25 times.
– For each input frame, the original size is still 256 x 340, so the crop operation in
four corners and the center and the horizontal flipping still occurs. In that case,
the output blobs for each video is 25 x 10 x class_num
– We would want video level accuracy instead of frame level accuracy. The above
blobs are averaged first in 10 variants and then in 25 frames to get scores.
– Combination of two modalities: with weights 1 for RGB, 1.5 for Flow.
[Wang, ECCV2016]

Evaluation
For example, for UCF101 split 1, my test result is 86.02% for RGB, and 87.63% for Flow.
The combined result (1:1.5) is 93.5%.

Contributions of TSN
 Features:
• Use warped flow for ConvNets
• Tried RGB difference features, but this modality is proved to be not useful
 Structures:
• Two-stream based on batch normalization
• Segment ConvNets
 Methods:
• Partial Batch Normalization
• Cross-Modality Initialization

Reference
[Wang, ECCV2016] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016, Octob
er). Temporal segment networks: Towards good practices for deep action recognition. In European C
onference on Computer Vision (pp. 20-36). Springer International Publishing.
[Simonyan, NIPS2014] Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for
action recognition in videos. In Advances in neural information processing systems (pp. 568-576).
[Ioffe, ICML2015] Ioffe, S., & Szegedy, C. (2015, June). Batch Normalization: Accelerating Deep Netwo
rk Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (pp.
448-456).
[Szegedy, CVPR2015] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich,
A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision a
nd pattern recognition (pp. 1-9).
[Lucas, 1981] Lucas, B. D., & Kanade, T. (1981). An iterative image registration technique with an appli
cation to stereo vision. Proceeding of Imaging Understanding Workshop, 1981: 120-131.
[Wang, ICCV2013] Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In Pr
oceedings of the IEEE international conference on computer vision (pp. 3551-3558).

Temporal Segment Network

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Temporal Segment Network

Similar to Temporal Segment Network (20)

Recently uploaded

Recently uploaded (20)

Temporal Segment Network