2. Summary
1. Problem framing
2. Benchmark Datasets
3. How about constructing your own dataset?
4. Unsupervised Approaches
a. Convolutional LSTM Autoencoder
b. Memory-Augmented Autoencoder
c. Memory-augmented Conv2D Autoencoder (MemConv2DAE)
5. Experiment Results
6. Conclusions
2
3. 1. Problem Framing
Identification of frames within a video containing anomalous events.
In surveillance videos:
Presence or absence of an object or movement of an object
In industrial process videos:
Irregularities in a process such as the shape of a flame
3
Zhu, S., Chen, C., & Sultani, W. (2020). Video Anomaly Detection for Smart Surveillance. http://arxiv.org/abs/2004.00222
4. 1. Problem Framing
This is a challenging task due to two major difficulties:
1. The data unbalance between positive (anomalous) and negative (normal)
2. The high variance within positive samples (although negative samples can also show high variance)
Usually addressed by:
● Training a model to represent normal events and considering the outliers as the anomalous events
● Outliers are identified by high scores in some form of reconstruction loss or low scores in metrics that are
the inverse of the loss - such as the regularity score
4
Zhu, S., Chen, C., & Sultani, W. (2020). Video Anomaly Detection for Smart Surveillance. http://arxiv.org/abs/2004.00222
5. 1. Problem Framing
Another aspect to consider is that a sample fed to an anomaly detection model usually has
four dimensions (excluding the batch size), namely:
T (temporal depth) x h (height) x w (width) x c (channels)
The unsupervised models follow an autoencoder configuration and the goal is to
reconstruct the input sequence.
5
Zhu, S., Chen, C., & Sultani, W. (2020). Video Anomaly Detection for Smart Surveillance. http://arxiv.org/abs/2004.00222
7. 1. Problem Framing
Input sequence with skipping (because consecutive frames may contain redundant info)
7
シーケンスサイズ
連続したフレームには冗長な情報が含まれている可能性があります
予測でチェックされるフレーム
スキップ
1
8. 1. Problem Framing
Abnormality Score based on the losses of set of sequences e(t):
Regularity Score:
8
Y. S. Chong and Y. H. Tay, “Abnormal Event Detection in Videos Using Spatiotemporal Autoencoder,” Advances in Neural Networks - ISNN 2017. pp. 189–196, 2017,
https://arxiv.org/abs/1701.01546
シーケンスの集合の損失に基づく異常スコア
規則性スコア
9. 2. Benchmark Datasets
Dataset
Total number
of videos
Number of
training
videos
Number of
test videos
Average number
of frames per
video
Number of
anomalous
frames
Abnormal
events
Scenes Anomaly examples
UCSD Ped1 70 34 36 201 4,005 40
Groups of people walking
towards and away from the
camera, and some amount of
perspective distortion.
Bikers, small carts
UCSD Ped2 28 16 12 163 1,636 12
Scenes with pedestrian
movement parallel to the camera
plane.
Bikers, small carts
Subway
Entrance
1 -- -- 121,749 2,400 66 People entering the subway
Wrong direction, no
payment
Subway Exit 1 -- -- 64,901 720 19 People exiting the subway
Wrong direction, no
payment
CUHK Avenue 37 16 21 30, 652 3,820 47 CHUK campus avenue videos Run, throw, new object
Shanghai Tech 437 330 107 317,398 17,090 130
Scenes from the campus of
ShanghaiTech
Bikers, cars
UCF Crime 1,900 1,610 290 7,247 -- 13
Videos covering 13 real-world
anomaly events
Arson, accident,
burglary, fighting
9
12. 3. How about constructing your own dataset?
Motivation
● Lack of datasets that have scenes about industrial processes (which we care about at Ridge-i, given our projects)
● The need for an “easy” dataset with well-defined anomalies on which we can test different models
Method
● As a domain, we selected the operation of a domestic oven
○ It is an everyday object, so it is easily accessible
○ Allows for the regulation of flame intensity
○ It is possible to place contents inside and record their respective interaction with the flames
12
13. 3. How about constructing your own dataset?
13
Normal
Flame at maximum size
73 964 frames
Anomaly
Small flame
11 106 frames
Anomaly
Smoke
14 529 frames
Anomaly
Ash and flame deformation
5 780 frames
Oven3 Dataset
The clips in the Oven3 dataset were recorded at 60 fps with a resolution of 1080x1920.
最大サイズでの名声 小火 燻す 灰と炎の変形
Oven3データセットのクリップは、
1080x1920の解像度で60fpsで記録されています。
14. 4. Unsupervised Approaches
14
Convolutional LSTM Autoencoder (ConvLSTMAE)
A spatiotemporal architecture with two main components: one
for spatial feature representation and one for learning the
temporal evolution of patterns.
Loss function
Y. S. Chong and Y. H. Tay, “Abnormal Event Detection in Videos Using Spatiotemporal Autoencoder,” Advances in Neural Networks - ISNN 2017. pp. 189–196, 2017,
https://arxiv.org/abs/1701.01546
15. 4. Unsupervised Approaches
15
Memory- augmented Autoencoder (MemAE)
Sometimes the ability of the autoencoder to generalize is
so powerful that it is capable of reconstructing
anomalous inputs very well.
The MemAE aims to address this issue.
D. Gong et al., “Memorizing Normality to Detect Anomaly: Memory-Augmented Deep Autoencoder for Unsupervised Anomaly Detection,” 2019 IEEE/CVF International
Conference on Computer Vision (ICCV). 2019, https://arxiv.org/abs/1904.02639
16. 4. Unsupervised Approaches
16
Memory Autoencoder (MemAE)
D. Gong et al., “Memorizing Normality to Detect Anomaly: Memory-Augmented Deep Autoencoder for Unsupervised Anomaly Detection,” 2019 IEEE/CVF International
Conference on Computer Vision (ICCV). 2019, https://arxiv.org/abs/1904.02639
Latent representation
Entropy
Loss function
17. 4. Unsupervised Approaches
17
Memory Autoencoder (MemAE)
Robustness of the memory size (M): in the UCSD-Ped2
dataset the AUC saturates at around M=1000.
D. Gong et al., “Memorizing Normality to Detect Anomaly: Memory-Augmented Deep Autoencoder for Unsupervised Anomaly Detection,” 2019 IEEE/CVF International
Conference on Computer Vision (ICCV). 2019, https://arxiv.org/abs/1904.02639
18. 4. Unsupervised Approaches
18
Memory-augmented Conv2D Autoencoder (MemConv2DAE)
Unlike the MemAE, the MemConv2DAE uses the output of 2D convolutional layers as queries and
features compactness and separateness losses, allowing for a much smaller number of memory
items (10 vs 2000 in the MemAE).
The model consists of three parts: an encoder, a memory module, and a decoder. The encoder
extracts a query qt of size H x W x C from an input video frame It at time t. The memory module
reads and updates memory items pM of size 1 x 1 x C using the queries qt of size 1 x 1 x C.
H. Park, J. Noh, and B. Ham, “Learning Memory-Guided Normality for Anomaly Detection,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR). 2020, https://arxiv.org/abs/2003.13228
19. 4. Unsupervised Approaches
19
Memory-augmented Conv2D Autoencoder (MemConv2DAE)
Multi-loss function
Reconstruction loss
Feature compactness loss
Feature separateness loss
H. Park, J. Noh, and B. Ham, “Learning Memory-Guided Normality for Anomaly Detection,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR). 2020, https://arxiv.org/abs/2003.13228
21. 5. Experiment Results
21
Baseline configuration for the Oven3 sequences
(established with the ConvLSTMAE)
● Temporal depth (T): 15 frames
● Skip: 15 frames
● Frame size: 64x64
○ Resizing frames to a smaller size improved the detection of
anomalies and the lowest value with improvement was 64x64
● Color space: grayscale
○ Grayscale usually produced better results than RGB, but RGB
was always considered
Test sequence
22. 22
5. Experiment Results
● The lower the regularity score for anomalies the better
● The MemAE and the MemConv2DAE show lower regularity scores for the most subtle anomaly: small flame
● The MemConv2DAE shows overall lower scores for every anomaly and faster recoveries from anomaly to normal
異常値の規則性スコアが低いほど良い
MemAEとMemConv2DAEは、最も微妙な異常である小火炎の規則性スコアが低いことを示している
MemConv2DAEは、すべての異常に対して全体的に低いスコアを示し、異常から正常への回復が早いことを示しています
24. 6. Conclusions
24
● The ConvLSTMAE is very robust to changes in the parameters of the training data and hyperparameters of the model - when faced
with a new task is is always worth to try this model!
● The MemAE and the MemConv2DAE (in RGB mode) are better than ConvLSTMAE and are more sensitive to anomalies - they are
good to detect subtle anomalies!
● The MemAE was the fastest model overall.
● In the MemAE it is necessary to pay attention to the learning rate (the lower the better) and the memory size (the larger the better
until a certain point) of the MemAE.