Study Meeting Presentation:



Unsupervised Video Anomaly Detection: A brief overview

Author: Tiago Oliveira



Date: 2021/11/10 

Summary
1. Problem framing
2. Benchmark Datasets
3. How about constructing your own dataset?
4. Unsupervised Approaches
a. Convolutional LSTM Autoencoder
b. Memory-Augmented Autoencoder
c. Memory-augmented Conv2D Autoencoder (MemConv2DAE)
5. Experiment Results
6. Conclusions
2
1. Problem Framing
Identification of frames within a video containing anomalous events.
In surveillance videos:
Presence or absence of an object or movement of an object
In industrial process videos:
Irregularities in a process such as the shape of a flame
3
Zhu, S., Chen, C., & Sultani, W. (2020). Video Anomaly Detection for Smart Surveillance. http://arxiv.org/abs/2004.00222
1. Problem Framing
This is a challenging task due to two major difficulties:
1. The data unbalance between positive (anomalous) and negative (normal)
2. The high variance within positive samples (although negative samples can also show high variance)
Usually addressed by:
● Training a model to represent normal events and considering the outliers as the anomalous events
● Outliers are identified by high scores in some form of reconstruction loss or low scores in metrics that are
the inverse of the loss - such as the regularity score
4
Zhu, S., Chen, C., & Sultani, W. (2020). Video Anomaly Detection for Smart Surveillance. http://arxiv.org/abs/2004.00222
1. Problem Framing
Another aspect to consider is that a sample fed to an anomaly detection model usually has
four dimensions (excluding the batch size), namely:
T (temporal depth) x h (height) x w (width) x c (channels)
The unsupervised models follow an autoencoder configuration and the goal is to
reconstruct the input sequence.
5
Zhu, S., Chen, C., & Sultani, W. (2020). Video Anomaly Detection for Smart Surveillance. http://arxiv.org/abs/2004.00222
1. Problem Framing
Input sequence
6
1. Problem Framing
Input sequence with skipping (because consecutive frames may contain redundant info)
7
シーケンスサイズ
連続したフレームには冗長な情報が含まれている可能性があります
予測でチェックされるフレーム
スキップ
1
1. Problem Framing
Abnormality Score based on the losses of set of sequences e(t):
Regularity Score:
8
Y. S. Chong and Y. H. Tay, “Abnormal Event Detection in Videos Using Spatiotemporal Autoencoder,” Advances in Neural Networks - ISNN 2017. pp. 189–196, 2017,
https://arxiv.org/abs/1701.01546
シーケンスの集合の損失に基づく異常スコア
規則性スコア
2. Benchmark Datasets
Dataset
Total number
of videos
Number of
training
videos
Number of
test videos
Average number
of frames per
video
Number of
anomalous
frames
Abnormal
events
Scenes Anomaly examples
UCSD Ped1 70 34 36 201 4,005 40
Groups of people walking
towards and away from the
camera, and some amount of
perspective distortion.
Bikers, small carts
UCSD Ped2 28 16 12 163 1,636 12
Scenes with pedestrian
movement parallel to the camera
plane.
Bikers, small carts
Subway
Entrance
1 -- -- 121,749 2,400 66 People entering the subway
Wrong direction, no
payment
Subway Exit 1 -- -- 64,901 720 19 People exiting the subway
Wrong direction, no
payment
CUHK Avenue 37 16 21 30, 652 3,820 47 CHUK campus avenue videos Run, throw, new object
Shanghai Tech 437 330 107 317,398 17,090 130
Scenes from the campus of
ShanghaiTech
Bikers, cars
UCF Crime 1,900 1,610 290 7,247 -- 13
Videos covering 13 real-world
anomaly events
Arson, accident,
burglary, fighting
9
2. Benchmark Datasets
10
Shanghai Tech
2. Benchmark Datasets
11
UCF Crime
3. How about constructing your own dataset?
Motivation
● Lack of datasets that have scenes about industrial processes (which we care about at Ridge-i, given our projects)
● The need for an “easy” dataset with well-defined anomalies on which we can test different models
Method
● As a domain, we selected the operation of a domestic oven
○ It is an everyday object, so it is easily accessible
○ Allows for the regulation of flame intensity
○ It is possible to place contents inside and record their respective interaction with the flames
12
3. How about constructing your own dataset?
13
Normal
Flame at maximum size
73 964 frames
Anomaly
Small flame
11 106 frames
Anomaly
Smoke
14 529 frames
Anomaly
Ash and flame deformation
5 780 frames
Oven3 Dataset
The clips in the Oven3 dataset were recorded at 60 fps with a resolution of 1080x1920.
最大サイズでの名声 小火 燻す 灰と炎の変形
Oven3データセットのクリップは、
1080x1920の解像度で60fpsで記録されています。
4. Unsupervised Approaches
14
Convolutional LSTM Autoencoder (ConvLSTMAE)
A spatiotemporal architecture with two main components: one
for spatial feature representation and one for learning the
temporal evolution of patterns.
Loss function
Y. S. Chong and Y. H. Tay, “Abnormal Event Detection in Videos Using Spatiotemporal Autoencoder,” Advances in Neural Networks - ISNN 2017. pp. 189–196, 2017,
https://arxiv.org/abs/1701.01546
4. Unsupervised Approaches
15
Memory- augmented Autoencoder (MemAE)
Sometimes the ability of the autoencoder to generalize is
so powerful that it is capable of reconstructing
anomalous inputs very well.
The MemAE aims to address this issue.
D. Gong et al., “Memorizing Normality to Detect Anomaly: Memory-Augmented Deep Autoencoder for Unsupervised Anomaly Detection,” 2019 IEEE/CVF International
Conference on Computer Vision (ICCV). 2019, https://arxiv.org/abs/1904.02639
4. Unsupervised Approaches
16
Memory Autoencoder (MemAE)
D. Gong et al., “Memorizing Normality to Detect Anomaly: Memory-Augmented Deep Autoencoder for Unsupervised Anomaly Detection,” 2019 IEEE/CVF International
Conference on Computer Vision (ICCV). 2019, https://arxiv.org/abs/1904.02639
Latent representation
Entropy
Loss function
4. Unsupervised Approaches
17
Memory Autoencoder (MemAE)
Robustness of the memory size (M): in the UCSD-Ped2
dataset the AUC saturates at around M=1000.
D. Gong et al., “Memorizing Normality to Detect Anomaly: Memory-Augmented Deep Autoencoder for Unsupervised Anomaly Detection,” 2019 IEEE/CVF International
Conference on Computer Vision (ICCV). 2019, https://arxiv.org/abs/1904.02639
4. Unsupervised Approaches
18
Memory-augmented Conv2D Autoencoder (MemConv2DAE)
Unlike the MemAE, the MemConv2DAE uses the output of 2D convolutional layers as queries and
features compactness and separateness losses, allowing for a much smaller number of memory
items (10 vs 2000 in the MemAE).
The model consists of three parts: an encoder, a memory module, and a decoder. The encoder
extracts a query qt of size H x W x C from an input video frame It at time t. The memory module
reads and updates memory items pM of size 1 x 1 x C using the queries qt of size 1 x 1 x C.
H. Park, J. Noh, and B. Ham, “Learning Memory-Guided Normality for Anomaly Detection,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR). 2020, https://arxiv.org/abs/2003.13228
4. Unsupervised Approaches
19
Memory-augmented Conv2D Autoencoder (MemConv2DAE)
Multi-loss function
Reconstruction loss
Feature compactness loss
Feature separateness loss
H. Park, J. Noh, and B. Ham, “Learning Memory-Guided Normality for Anomaly Detection,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR). 2020, https://arxiv.org/abs/2003.13228
4. Unsupervised Approaches
20
AUC scores of the selected approaches in the benchmark datasets
Model
AUC (%)
UCSD
Ped1
UCSD Ped2 CUHK Avenue Subway
Entrance
Subway
Exit
Shanghai Tech
ConvLSTMAE 89.9 87.4 80.3 84.7 94.0 --
MemAE --- 94.1 83.3 --- --- 71.2
MemConv2DAE --- 90.2 (Recon.)
97.0 (Pred.)
82.8 (Recon.)
88.5 (Pred.)
--- --- 69.8 (Recon.)
70.5 (Pred.)
5. Experiment Results
21
Baseline configuration for the Oven3 sequences
(established with the ConvLSTMAE)
● Temporal depth (T): 15 frames
● Skip: 15 frames
● Frame size: 64x64
○ Resizing frames to a smaller size improved the detection of
anomalies and the lowest value with improvement was 64x64
● Color space: grayscale
○ Grayscale usually produced better results than RGB, but RGB
was always considered
Test sequence
22
5. Experiment Results
● The lower the regularity score for anomalies the better
● The MemAE and the MemConv2DAE show lower regularity scores for the most subtle anomaly: small flame
● The MemConv2DAE shows overall lower scores for every anomaly and faster recoveries from anomaly to normal
異常値の規則性スコアが低いほど良い
MemAEとMemConv2DAEは、最も微妙な異常である小火炎の規則性スコアが低いことを示している
MemConv2DAEは、すべての異常に対して全体的に低いスコアを示し、異常から正常への回復が早いことを示しています
5. Experiment Results
23
No.
Model
Dataset configuration
AUC Inference speed
Size
Color
Space
Temporal
depth
Skip
Frames
1 ConvLSTMAE 64 gray 15 30 0.9350 13 fps
2 ConvLSTMAE
64
RGB 15 30 0.9456 13 fps
3 MemAE
64
gray 15 30 0.9442 165 fps
4 MemAE
64
RGB 15 30 0.9363 160 fps
5 MemConv2DAE 64
gray
15 30 0.9617 110 fps
6 MemConv2DAE 64
RGB
15 30 0.9639 104 fps
6. Conclusions
24
● The ConvLSTMAE is very robust to changes in the parameters of the training data and hyperparameters of the model - when faced
with a new task is is always worth to try this model!
● The MemAE and the MemConv2DAE (in RGB mode) are better than ConvLSTMAE and are more sensitive to anomalies - they are
good to detect subtle anomalies!
● The MemAE was the fastest model overall.
● In the MemAE it is necessary to pay attention to the learning rate (the lower the better) and the memory size (the larger the better
until a certain point) of the MemAE.
Acknowledgements
Thank you Abe-san and Motaz-san for the collaboration in the contents of this presentation.
25
Study Meeting Presentation:



Unsupervised Video Anomaly Detection: A brief overview

Author: Tiago Oliveira



Date: 2021/11/10 


Unsupervised Video Anomaly Detection: A brief overview

  • 1.
    Study Meeting Presentation:
 
 UnsupervisedVideo Anomaly Detection: A brief overview
 Author: Tiago Oliveira
 
 Date: 2021/11/10 

  • 2.
    Summary 1. Problem framing 2.Benchmark Datasets 3. How about constructing your own dataset? 4. Unsupervised Approaches a. Convolutional LSTM Autoencoder b. Memory-Augmented Autoencoder c. Memory-augmented Conv2D Autoencoder (MemConv2DAE) 5. Experiment Results 6. Conclusions 2
  • 3.
    1. Problem Framing Identificationof frames within a video containing anomalous events. In surveillance videos: Presence or absence of an object or movement of an object In industrial process videos: Irregularities in a process such as the shape of a flame 3 Zhu, S., Chen, C., & Sultani, W. (2020). Video Anomaly Detection for Smart Surveillance. http://arxiv.org/abs/2004.00222
  • 4.
    1. Problem Framing Thisis a challenging task due to two major difficulties: 1. The data unbalance between positive (anomalous) and negative (normal) 2. The high variance within positive samples (although negative samples can also show high variance) Usually addressed by: ● Training a model to represent normal events and considering the outliers as the anomalous events ● Outliers are identified by high scores in some form of reconstruction loss or low scores in metrics that are the inverse of the loss - such as the regularity score 4 Zhu, S., Chen, C., & Sultani, W. (2020). Video Anomaly Detection for Smart Surveillance. http://arxiv.org/abs/2004.00222
  • 5.
    1. Problem Framing Anotheraspect to consider is that a sample fed to an anomaly detection model usually has four dimensions (excluding the batch size), namely: T (temporal depth) x h (height) x w (width) x c (channels) The unsupervised models follow an autoencoder configuration and the goal is to reconstruct the input sequence. 5 Zhu, S., Chen, C., & Sultani, W. (2020). Video Anomaly Detection for Smart Surveillance. http://arxiv.org/abs/2004.00222
  • 6.
  • 7.
    1. Problem Framing Inputsequence with skipping (because consecutive frames may contain redundant info) 7 シーケンスサイズ 連続したフレームには冗長な情報が含まれている可能性があります 予測でチェックされるフレーム スキップ 1
  • 8.
    1. Problem Framing AbnormalityScore based on the losses of set of sequences e(t): Regularity Score: 8 Y. S. Chong and Y. H. Tay, “Abnormal Event Detection in Videos Using Spatiotemporal Autoencoder,” Advances in Neural Networks - ISNN 2017. pp. 189–196, 2017, https://arxiv.org/abs/1701.01546 シーケンスの集合の損失に基づく異常スコア 規則性スコア
  • 9.
    2. Benchmark Datasets Dataset Totalnumber of videos Number of training videos Number of test videos Average number of frames per video Number of anomalous frames Abnormal events Scenes Anomaly examples UCSD Ped1 70 34 36 201 4,005 40 Groups of people walking towards and away from the camera, and some amount of perspective distortion. Bikers, small carts UCSD Ped2 28 16 12 163 1,636 12 Scenes with pedestrian movement parallel to the camera plane. Bikers, small carts Subway Entrance 1 -- -- 121,749 2,400 66 People entering the subway Wrong direction, no payment Subway Exit 1 -- -- 64,901 720 19 People exiting the subway Wrong direction, no payment CUHK Avenue 37 16 21 30, 652 3,820 47 CHUK campus avenue videos Run, throw, new object Shanghai Tech 437 330 107 317,398 17,090 130 Scenes from the campus of ShanghaiTech Bikers, cars UCF Crime 1,900 1,610 290 7,247 -- 13 Videos covering 13 real-world anomaly events Arson, accident, burglary, fighting 9
  • 10.
  • 11.
  • 12.
    3. How aboutconstructing your own dataset? Motivation ● Lack of datasets that have scenes about industrial processes (which we care about at Ridge-i, given our projects) ● The need for an “easy” dataset with well-defined anomalies on which we can test different models Method ● As a domain, we selected the operation of a domestic oven ○ It is an everyday object, so it is easily accessible ○ Allows for the regulation of flame intensity ○ It is possible to place contents inside and record their respective interaction with the flames 12
  • 13.
    3. How aboutconstructing your own dataset? 13 Normal Flame at maximum size 73 964 frames Anomaly Small flame 11 106 frames Anomaly Smoke 14 529 frames Anomaly Ash and flame deformation 5 780 frames Oven3 Dataset The clips in the Oven3 dataset were recorded at 60 fps with a resolution of 1080x1920. 最大サイズでの名声 小火 燻す 灰と炎の変形 Oven3データセットのクリップは、 1080x1920の解像度で60fpsで記録されています。
  • 14.
    4. Unsupervised Approaches 14 ConvolutionalLSTM Autoencoder (ConvLSTMAE) A spatiotemporal architecture with two main components: one for spatial feature representation and one for learning the temporal evolution of patterns. Loss function Y. S. Chong and Y. H. Tay, “Abnormal Event Detection in Videos Using Spatiotemporal Autoencoder,” Advances in Neural Networks - ISNN 2017. pp. 189–196, 2017, https://arxiv.org/abs/1701.01546
  • 15.
    4. Unsupervised Approaches 15 Memory-augmented Autoencoder (MemAE) Sometimes the ability of the autoencoder to generalize is so powerful that it is capable of reconstructing anomalous inputs very well. The MemAE aims to address this issue. D. Gong et al., “Memorizing Normality to Detect Anomaly: Memory-Augmented Deep Autoencoder for Unsupervised Anomaly Detection,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, https://arxiv.org/abs/1904.02639
  • 16.
    4. Unsupervised Approaches 16 MemoryAutoencoder (MemAE) D. Gong et al., “Memorizing Normality to Detect Anomaly: Memory-Augmented Deep Autoencoder for Unsupervised Anomaly Detection,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, https://arxiv.org/abs/1904.02639 Latent representation Entropy Loss function
  • 17.
    4. Unsupervised Approaches 17 MemoryAutoencoder (MemAE) Robustness of the memory size (M): in the UCSD-Ped2 dataset the AUC saturates at around M=1000. D. Gong et al., “Memorizing Normality to Detect Anomaly: Memory-Augmented Deep Autoencoder for Unsupervised Anomaly Detection,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, https://arxiv.org/abs/1904.02639
  • 18.
    4. Unsupervised Approaches 18 Memory-augmentedConv2D Autoencoder (MemConv2DAE) Unlike the MemAE, the MemConv2DAE uses the output of 2D convolutional layers as queries and features compactness and separateness losses, allowing for a much smaller number of memory items (10 vs 2000 in the MemAE). The model consists of three parts: an encoder, a memory module, and a decoder. The encoder extracts a query qt of size H x W x C from an input video frame It at time t. The memory module reads and updates memory items pM of size 1 x 1 x C using the queries qt of size 1 x 1 x C. H. Park, J. Noh, and B. Ham, “Learning Memory-Guided Normality for Anomaly Detection,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, https://arxiv.org/abs/2003.13228
  • 19.
    4. Unsupervised Approaches 19 Memory-augmentedConv2D Autoencoder (MemConv2DAE) Multi-loss function Reconstruction loss Feature compactness loss Feature separateness loss H. Park, J. Noh, and B. Ham, “Learning Memory-Guided Normality for Anomaly Detection,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, https://arxiv.org/abs/2003.13228
  • 20.
    4. Unsupervised Approaches 20 AUCscores of the selected approaches in the benchmark datasets Model AUC (%) UCSD Ped1 UCSD Ped2 CUHK Avenue Subway Entrance Subway Exit Shanghai Tech ConvLSTMAE 89.9 87.4 80.3 84.7 94.0 -- MemAE --- 94.1 83.3 --- --- 71.2 MemConv2DAE --- 90.2 (Recon.) 97.0 (Pred.) 82.8 (Recon.) 88.5 (Pred.) --- --- 69.8 (Recon.) 70.5 (Pred.)
  • 21.
    5. Experiment Results 21 Baselineconfiguration for the Oven3 sequences (established with the ConvLSTMAE) ● Temporal depth (T): 15 frames ● Skip: 15 frames ● Frame size: 64x64 ○ Resizing frames to a smaller size improved the detection of anomalies and the lowest value with improvement was 64x64 ● Color space: grayscale ○ Grayscale usually produced better results than RGB, but RGB was always considered Test sequence
  • 22.
    22 5. Experiment Results ●The lower the regularity score for anomalies the better ● The MemAE and the MemConv2DAE show lower regularity scores for the most subtle anomaly: small flame ● The MemConv2DAE shows overall lower scores for every anomaly and faster recoveries from anomaly to normal 異常値の規則性スコアが低いほど良い MemAEとMemConv2DAEは、最も微妙な異常である小火炎の規則性スコアが低いことを示している MemConv2DAEは、すべての異常に対して全体的に低いスコアを示し、異常から正常への回復が早いことを示しています
  • 23.
    5. Experiment Results 23 No. Model Datasetconfiguration AUC Inference speed Size Color Space Temporal depth Skip Frames 1 ConvLSTMAE 64 gray 15 30 0.9350 13 fps 2 ConvLSTMAE 64 RGB 15 30 0.9456 13 fps 3 MemAE 64 gray 15 30 0.9442 165 fps 4 MemAE 64 RGB 15 30 0.9363 160 fps 5 MemConv2DAE 64 gray 15 30 0.9617 110 fps 6 MemConv2DAE 64 RGB 15 30 0.9639 104 fps
  • 24.
    6. Conclusions 24 ● TheConvLSTMAE is very robust to changes in the parameters of the training data and hyperparameters of the model - when faced with a new task is is always worth to try this model! ● The MemAE and the MemConv2DAE (in RGB mode) are better than ConvLSTMAE and are more sensitive to anomalies - they are good to detect subtle anomalies! ● The MemAE was the fastest model overall. ● In the MemAE it is necessary to pay attention to the learning rate (the lower the better) and the memory size (the larger the better until a certain point) of the MemAE.
  • 25.
    Acknowledgements Thank you Abe-sanand Motaz-san for the collaboration in the contents of this presentation. 25
  • 26.
    Study Meeting Presentation:
 
 UnsupervisedVideo Anomaly Detection: A brief overview
 Author: Tiago Oliveira
 
 Date: 2021/11/10