Temporal Activity Detection in
Untrimmed Videos with Recurrent
Neural Networks
Alberto Montes
July 15th, 2016
Xavi Giró Amaia
Salvador
Outline
1. Introduction
2. Related Work
3. Methodology
4. Results
5. Conclusions and Future Work
2
Motivation
3
Motivation
4
Problem Definition
5
Videos
Problem Definition
6
Videos
Activity Classification
Longboarding
Problem Definition
7
Videos
Activity Temporal Localization
Longboarding
Problem Definition
8
How?
Problem Definition
9
Neural Network
Activity
Problem Definition
10
Activity
CNN RNN+
11
Large-Scale Activity Recognition
Challenge
Stats:
● 19,994 Videos
● 200 Activities
● 660 hours of video
● 313 hours of activities
● 65.6 million of frames
Dataset
12
Outline
1. Introduction
2. Related Work
3. Methodology
4. Results
5. Conclusions and Future Work
13
Literature Approaches
14
Activity
CNN RNN+
Convolutional Neural Network
15
Convolutional Layer
Recurrent Neural Network
16
c0
c1
c2
Literature Approaches
17
Activity
CNN RNN+
3D Convolution
18
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015, December). Learning spatiotemporal features with
3d convolutional networks. In 2015 IEEE ICCV 2015 (pp. 4489-4497). IEEE.
3D Convolution
19
● 16-frame video clip as input
● 80 million parameters
● 3x3x3 filter size at all conv layers
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015, December). Learning spatiotemporal features with
3d convolutional networks. In 2015 IEEE ICCV 2015 (pp. 4489-4497). IEEE.
Literature Approaches
20
Activity
CNN RNN+
Literature Approaches
21
Activity
CNN RNN+
Segments Proposals
22
Shou, Z., Wang, D., & Chang, S. F. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs CVPR
2016.
Literature Approaches
23
Activity
CNN RNN+
RNN for Activity Localization
24
Yeung, Serena, Olga Russakovsky, Greg Mori, and Li Fei-Fei. et al. "End-to-end Learning of Action Detection from
Frame Glimpses in Videos." CVPR 2016
Outline
1. Introduction
2. Related Work
3. Methodology
4. Results
5. Conclusions and Future Work
25
Architecture Overview
26
16 frames 200 activities
+ background
16 frames 200 activities
+ background
16 frames 200 activities
+ background
Outline
3. Methodology
a. Extracting C3D Features
b. Audio Features
c. Network Architecture
d. Training Methodology
e. Post-Processing
27
Outline
3. Methodology
a. Extracting C3D Features
b. Audio Features
c. Network Architecture
d. Training Methodology
e. Post-Processing
28
C3D Network
29
Caffe +
by
feature vector
published on:
C3D Network
30
Caffe
by
feature vector
Outline
3. Methodology
a. Extracting C3D Features
b. Audio Features
c. Network Architecture
d. Training Methodology
e. Post-Processing
31
Audio Features
32
C3D
Recurrent Neural Network Input
Audio Features:
● MFCC
● Spectral
concatvideo
features
Provided by
Ignasi Esquerra
Outline
3. Methodology
a. Extracting C3D Features
b. Audio Features
c. Network Architecture
d. Training Methodology
e. Post-Processing
33
Network Architecture
34
Network Architecture
35
Network Architecture
36
LSTM with previous output feedback
Outline
3. Methodology
a. Extracting C3D Features
b. Audio Features
c. Network Architecture
d. Training Methodology
e. Post-Processing
37
Training Methodology
Categorical Cross Entropy Loss
38
Training Methodology
For unbalanced data, weighted loss:
39
660 hours of video
313 hours of activities
Outline
3. Methodology
a. Extracting C3D Features
b. Audio Features
c. Network Architecture
d. Training Methodology
e. Post-Processing
40
Classification Post-Processing
41
Background
Activity 1
Activity 2
Activity 200
Clip1
Clip2
Clip3
ClipN
Classification Post-Processing
42
Background
Activity 1
Activity 2
Activity 200
Clip1
Clip2
Clip3
ClipN
Average
Classification Post-Processing
43
Background
Activity 1
Activity 2
Activity 200
Clip1
Clip2
Clip3
ClipN
Average
Max Probability
Detection Post-Processing
44
Background
Activity 1
Activity 2
Activity 200
Clip1
Clip2
Clip3
ClipN
Applied a mean filter of k samplestime
Detection Post-Processing
45
Background
Activity
Clip1
Clip2
Clip3
ClipN
Ɣ
Detection Post-Processing
46
Ɣ
Outline
1. Introduction
2. Related Work
3. Methodology
4. Results
5. Conclusions and Future Work
47
Classification: Audio Features
48
mAP = 0.5755mAP = 0.5938
Music unrelated to the activity is often added to the videos in post-processing,
causing a decrease in performance when audio and video features are combined.
Classification: Depth Analysis
49
mAP = 0.5938 mAP = 0.5492 mAP = 0.5635
Deeper networks present overfitting
Classification Results Per Activity
50
Classification Results Per Activity
51
Using the Pommel Horse
Sailing
Playing Ice Hockey
Rock Climbing
BMX
Classification Results Per Activity
52
Drinking Coffee
Peeling Potatoes
Having an Ice Cream
Rock-Paper-Scissors
Polishing shoes
Top Level Classification
53
Detection
54
mAP = 0.2251 mAP = 0.2067
Model with feedback did not improve results
Training with feedback
55
512-LSTM
video features0 0 1 0 0 0
concat
When training
previous
ground
truth
Training with feedback
56
512-LSTM
video features0 0.1 0.6 0.2 0.1 0
concat
When testing
previous
prediction
Comparing Post-Processing
57
Ɣ
Grid search for optimal parameters
Detection Results per Activity
58
Detection Results per Activity
59
Windsurfing
Riding Bumper Cars
Playing Racquetball
Using the Pommel Horse
Using Parallel Bars
Detection Results per Activity
60
Drinking Coffee
Putting on Shoes
Rock-Paper-Scissors
Removing Curlers
Smoking a Cigarette
Top Level Detection
61
Qualitative Evaluation
62
Ground Truth:
Playing water polo
Prediction:
0.765 Playing water polo
0.202 Swimming
0.007 Springboard diving
Qualitative Evaluation
63
Ground Truth:
Hopscotch
Prediction:
0.848 Running a marathon
0.023 Triple jump
0.022 Javelin throw
Qualitative Evaluation
64
Qualitative Evaluation
65
Challenge Results
66
Classification Task
(24 participants)
Baseline
42.20%
0% 100%
93.23%
Winner
Average
Performance
66.26%58.74%
UPC Team
* results over test subset
Slide Design by Issey Masuda
mAP
Challenge Results
67
Detection Task
(6 participants)
Baseline
9.70%
0% 50%
42.47%
Winner
Average
Performance
29.94%22.36%
UPC Team
mAP
* results over test subset
Slide Design by Issey Masuda
Outline
1. Introduction
2. Related Work
3. Methodology
4. Results
5. Conclusions and Future Work
68
Conclusions
69
Classification:
Longboarding
Detection:
42.7s – 193.5s Longboarding
Conclusions
70
Video
Spatial Net
Temporal Net
Output
Winning entry for
ActivityNet
Classification task
Wang, Limin, et al. "Towards good practices for very deep two-stream convnets." arXiv preprint arXiv:1507.02159 (2015).
Conclusions
71
Classification:
Longboarding
Detection:
42.7s – 193.5s Longboarding
Conclusions
72
Best results were obtained for sport categories, due to the pretraining of C3D with the Sports-1M dataset
Future Work: E2E Training
73
Training the whole
pipeline end-to-end would
reduce the bias towards
sport categories
Future Work: Attention Models
74
Temporal
Attention
Filters
Neural Network
Challenge Submission
75
Open Sourced Contributions
76
github.com/imatge-upc/activitynet-2016-cvprw
“Thank you for your attention
77
78
Questions?
79
Support Slides
Metrics
80
Hit@3
Classification Detection
IoU
Smoothing Effect Comparison
81
Post-Processing Effect
82
Smoothing Filter:
Post-Processing Effect
83
Activity Threshold:
Activities Duration
84
AP and Video Appearance Correlation
85
AP and Video Appearance Correlation
86
Preparing Data
87
batch 1
batch 2
Preparing Data
88
Sequence of Video Vector Features
Sequence of Activities
time
Preparing Data
89
time
timesteps
Preparing Data
90
Preparing Data
91
Gradient Propagation
Gathering Audio Features
92
16-Frame Clip
10 ms MFCC Features
t
10 ms MFCC Features
10 ms MFCC Features
10 ms MFCC Features
10 ms MFCC Features
10 ms MFCC Features
16-Frame Clip
Spectral Features
… … …
Gathering Audio Features
93
16-Frame Clip
mean
MFCC
Features
t
std
MFCC
Features
16-Frame Clip
Spectral Features
… … …
mean
MFCC
Features
std
MFCC
Features
Gathering Audio Features
94
16-Frame Clip
mean
MFCC
Features
t
std
MFCC
Features
16-Frame Clip
Spectral Features
… … …
mean
MFCC
Features
std
MFCC
Features
Spectral Features
Convolutional Neural Network
95
Convolutional Layer
Convolutional Neural Network
96
Pooling Layer
Convolutional Neural Network
97
Fully-Connected Layer
Qualitative Evaluation
98

Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks