Video Anomaly Detection Overview

Study Meeting Presentation: 
 
Unsupervised Video Anomaly Detection: A brief overview 
Author: Tiago Oliveira 
 
Date: 2021/11/10

Summary
1. Problem framing
2. Benchmark Datasets
3. How about constructing your own dataset?
4. Unsupervised Approaches
a. Convolutional LSTM Autoencoder
b. Memory-Augmented Autoencoder
c. Memory-augmented Conv2D Autoencoder (MemConv2DAE)
5. Experiment Results
6. Conclusions
2

1. Problem Framing
Identiﬁcation of frames within a video containing anomalous events.
In surveillance videos:
Presence or absence of an object or movement of an object
In industrial process videos:
Irregularities in a process such as the shape of a ﬂame
3
Zhu, S., Chen, C., & Sultani, W. (2020). Video Anomaly Detection for Smart Surveillance. http://arxiv.org/abs/2004.00222

1. Problem Framing
This is a challenging task due to two major difﬁculties:
1. The data unbalance between positive (anomalous) and negative (normal)
2. The high variance within positive samples (although negative samples can also show high variance)
Usually addressed by:
● Training a model to represent normal events and considering the outliers as the anomalous events
● Outliers are identiﬁed by high scores in some form of reconstruction loss or low scores in metrics that are
the inverse of the loss - such as the regularity score
4

1. Problem Framing
Another aspect to consider is that a sample fed to an anomaly detection model usually has
four dimensions (excluding the batch size), namely:
T (temporal depth) x h (height) x w (width) x c (channels)
The unsupervised models follow an autoencoder conﬁguration and the goal is to
reconstruct the input sequence.
5

1. Problem Framing
Input sequence
6

1. Problem Framing
Input sequence with skipping (because consecutive frames may contain redundant info)
7
シーケンスサイズ
連続したフレームには冗長な情報が含まれている可能性があります
予測でチェックされるフレーム
スキップ
1

1. Problem Framing
Abnormality Score based on the losses of set of sequences e(t):
Regularity Score:
8
Y. S. Chong and Y. H. Tay, “Abnormal Event Detection in Videos Using Spatiotemporal Autoencoder,” Advances in Neural Networks - ISNN 2017. pp. 189–196, 2017,
https://arxiv.org/abs/1701.01546
シーケンスの集合の損失に基づく異常スコア
規則性スコア

Dataset
Total number
of videos
Number of
training
videos
Number of
test videos
Average number
of frames per
video
Number of
anomalous
frames
Abnormal
events
Scenes Anomaly examples
UCSD Ped1 70 34 36 201 4,005 40
Groups of people walking
towards and away from the
camera, and some amount of
perspective distortion.
Bikers, small carts
UCSD Ped2 28 16 12 163 1,636 12
Scenes with pedestrian
movement parallel to the camera
plane.
Bikers, small carts
Subway
Entrance
1 -- -- 121,749 2,400 66 People entering the subway
Wrong direction, no
payment
Subway Exit 1 -- -- 64,901 720 19 People exiting the subway
Wrong direction, no
payment
CUHK Avenue 37 16 21 30, 652 3,820 47 CHUK campus avenue videos Run, throw, new object
Shanghai Tech 437 330 107 317,398 17,090 130
Scenes from the campus of
ShanghaiTech
Bikers, cars
UCF Crime 1,900 1,610 290 7,247 -- 13
Videos covering 13 real-world
anomaly events
Arson, accident,
burglary, fighting
9

10
Shanghai Tech

11
UCF Crime

Motivation
● Lack of datasets that have scenes about industrial processes (which we care about at Ridge-i, given our projects)
● The need for an “easy” dataset with well-defined anomalies on which we can test different models
Method
● As a domain, we selected the operation of a domestic oven
○ It is an everyday object, so it is easily accessible
○ Allows for the regulation of flame intensity
○ It is possible to place contents inside and record their respective interaction with the flames
12

13
Normal
Flame at maximum size
73 964 frames
Anomaly
Small ﬂame
11 106 frames
Anomaly
Smoke
14 529 frames
Anomaly
Ash and ﬂame deformation
5 780 frames
Oven3 Dataset
The clips in the Oven3 dataset were recorded at 60 fps with a resolution of 1080x1920.
最大サイズでの名声小火燻す灰と炎の変形
Oven3データセットのクリップは、
1080x1920の解像度で60fpsで記録されています。

14
Convolutional LSTM Autoencoder (ConvLSTMAE)
A spatiotemporal architecture with two main components: one
for spatial feature representation and one for learning the
temporal evolution of patterns.
Loss function
Y. S. Chong and Y. H. Tay, “Abnormal Event Detection in Videos Using Spatiotemporal Autoencoder,” Advances in Neural Networks - ISNN 2017. pp. 189–196, 2017,
https://arxiv.org/abs/1701.01546

15
Memory- augmented Autoencoder (MemAE)
Sometimes the ability of the autoencoder to generalize is
so powerful that it is capable of reconstructing
anomalous inputs very well.
The MemAE aims to address this issue.
D. Gong et al., “Memorizing Normality to Detect Anomaly: Memory-Augmented Deep Autoencoder for Unsupervised Anomaly Detection,” 2019 IEEE/CVF International
Conference on Computer Vision (ICCV). 2019, https://arxiv.org/abs/1904.02639

16
Memory Autoencoder (MemAE)
Latent representation
Entropy
Loss function

17
Memory Autoencoder (MemAE)
Robustness of the memory size (M): in the UCSD-Ped2
dataset the AUC saturates at around M=1000.

18
Memory-augmented Conv2D Autoencoder (MemConv2DAE)
Unlike the MemAE, the MemConv2DAE uses the output of 2D convolutional layers as queries and
features compactness and separateness losses, allowing for a much smaller number of memory
items (10 vs 2000 in the MemAE).
The model consists of three parts: an encoder, a memory module, and a decoder. The encoder
extracts a query qt of size H x W x C from an input video frame It at time t. The memory module
reads and updates memory items pM of size 1 x 1 x C using the queries qt of size 1 x 1 x C.
H. Park, J. Noh, and B. Ham, “Learning Memory-Guided Normality for Anomaly Detection,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR). 2020, https://arxiv.org/abs/2003.13228

19
Memory-augmented Conv2D Autoencoder (MemConv2DAE)
Multi-loss function
Reconstruction loss
Feature compactness loss
Feature separateness loss
H. Park, J. Noh, and B. Ham, “Learning Memory-Guided Normality for Anomaly Detection,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR). 2020, https://arxiv.org/abs/2003.13228

20
AUC scores of the selected approaches in the benchmark datasets
Model
AUC (%)
UCSD
Ped1
UCSD Ped2 CUHK Avenue Subway
Entrance
Subway
Exit
Shanghai Tech
ConvLSTMAE 89.9 87.4 80.3 84.7 94.0 --
MemAE --- 94.1 83.3 --- --- 71.2
MemConv2DAE --- 90.2 (Recon.)
97.0 (Pred.)
82.8 (Recon.)
88.5 (Pred.)
--- --- 69.8 (Recon.)
70.5 (Pred.)

21
Baseline conﬁguration for the Oven3 sequences
(established with the ConvLSTMAE)
● Temporal depth (T): 15 frames
● Skip: 15 frames
● Frame size: 64x64
○ Resizing frames to a smaller size improved the detection of
anomalies and the lowest value with improvement was 64x64
● Color space: grayscale
○ Grayscale usually produced better results than RGB, but RGB
was always considered
Test sequence

22
● The lower the regularity score for anomalies the better
● The MemAE and the MemConv2DAE show lower regularity scores for the most subtle anomaly: small ﬂame
● The MemConv2DAE shows overall lower scores for every anomaly and faster recoveries from anomaly to normal
異常値の規則性スコアが低いほど良い
MemAEとMemConv2DAEは、最も微妙な異常である小火炎の規則性スコアが低いことを示している
MemConv2DAEは、すべての異常に対して全体的に低いスコアを示し、異常から正常への回復が早いことを示しています

23
No.
Model
Dataset configuration
AUC Inference speed
Size
Color
Space
Temporal
depth
Skip
Frames
1 ConvLSTMAE 64 gray 15 30 0.9350 13 fps
2 ConvLSTMAE
64
RGB 15 30 0.9456 13 fps
3 MemAE
64
gray 15 30 0.9442 165 fps
4 MemAE
64
RGB 15 30 0.9363 160 fps
5 MemConv2DAE 64
gray
15 30 0.9617 110 fps
6 MemConv2DAE 64
RGB
15 30 0.9639 104 fps

6. Conclusions
24
● The ConvLSTMAE is very robust to changes in the parameters of the training data and hyperparameters of the model - when faced
with a new task is is always worth to try this model!
● The MemAE and the MemConv2DAE (in RGB mode) are better than ConvLSTMAE and are more sensitive to anomalies - they are
good to detect subtle anomalies!
● The MemAE was the fastest model overall.
● In the MemAE it is necessary to pay attention to the learning rate (the lower the better) and the memory size (the larger the better
until a certain point) of the MemAE.

Acknowledgements
Thank you Abe-san and Motaz-san for the collaboration in the contents of this presentation.
25

Video Anomaly Detection Overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Video Anomaly Detection Overview

Similar to Video Anomaly Detection Overview (20)

More from Ridge-i, Inc.

More from Ridge-i, Inc. (8)

Recently uploaded

Recently uploaded (20)

Video Anomaly Detection Overview