Adam Gibson demonstrates how to use variational autoencoders to automatically label time series location data. You'll explore the challenge of imbalanced classes and anomaly detection, learn how to leverage deep learning for automatically labeling (and the pitfalls of this), and discover how you can deploy these techniques in your organization.
5. • There are 5 main approaches to doing anomaly detection
• Probabilistic-based
• Distance-based
• Domain-based
• Reconstruction-based
• Information Theoretic-based
• All of these methods have some sort of drawback that prevents them from being
applicable to any type of data. They are either:
• Have built-in assumptions of the data (like Gaussian Mixture Models)
• Require specific domain knowledge
• Only detects certain patterns of anomalies
• Not suitable for data with high temporal dependencies
• Not suitable for multivariate data or computationally infeasible at our scale
• Multiple approaches are necessary to have a comprehensive detection pipeline
Anomaly Detection Approaches
アノマリー検出アプローチ
5
7. Cluster Based Methods1
クラスターベース方式1
Cluster based methods work
by creating a dictionary of non-
anomalous data and finding the
one that best matches the actual
data at inference time.
If the pattern was never seen
before the reconstruction will
be very different than the actual
data.
クラスタベース方式が機能する非アノマリー
データ辞書を作成し、推論時における実際
データに最も一致するものを見つける。
再構築前は従来絶対に見られなかったパ
ターンであれば、実際とは大きく異なるデータ
であるといえる。
confidential 7
9. Join Raw data Transform
Feed groups into Autoencoder and save
reconstruction error of center
エンコーダへ送り込んで中心的再構築エラーを保
存
Input Data Reconstruction
Example workflow for anomaly detection
アノマリー検出のための作業例
9
10. Example of VAE detecting anomalies
VAE (Variational Autoencoders 変分オートエンコーダー)検出アノマ
リー例
Low reconstruction error
低い再生エラー
High reconstruction error
低い再生エラー
confidential 10
11. Data Processing Step 3: Training
データプロセシング ステップ3:トレーニング
The autoencoder will be trained to recreate the input data as closely as possible
(one row at a time)
11
12. Data Processing Step 4: Ranking
データプロセシング ステップ4:ランキング
The trained autoencoder will then be run on the new data and we will store the error
(sum of mean squared difference per column per row) in a Ranking engine.
Unusual patterns will have a high reconstruction error
Output (Reconstruction)Input Data
Output (Reconstruction)
Reconstruction
-
Input
= 2 Error
Table
12
13. Problem: No Free lunch
課題:ノーフリーランチ定理
Wolpert’s no free lunch theorem states that there is no single machine learning algorithm that can perform
well on every task. Deep Learning is itself a set of very different techniques that are good or bad at various
problems.
The system will have to use various algorithms to detect different types of anomalies and possible different
root causes.
Class imbalance will be a problem at the beginning of the system’s lifetime. Labeled data will be overshadowed
by the unlabeled data and the Anomaly detectors will not be able to improve for some time. One possible
solution is to use Pseudo labeling to label all data from a trained classifier. These labels will be very noisy at
first so they might have to be deployed in stages.
confidential 13
14. • Systems need to be able to handle terabytes of data at minimum
• The system is required to train a large number of neural networks within a short
time-frame. GPU servers are a cost effective way to achieve the computation
resources necessary.
• Due to space considerations GPU servers are storage inefficient and the system
will employ the Hadoop File System and Spark on commodity servers to meet the
storage requirements.
• The system must scale to larger problems.
System Requirements
システム要求事項
confidential 14
15. Use kmeans and tsne highlighting clusters to label data points
• Uses the representation from the autoencoder to automatically group
data
• TSNE visualization allows highlighting and automatic labeling
• Use KNN and VPTrees to sample the hidden activations learned from the neural
net to interactively label
Automatic Labeling
confidential 15
16. • Autoencoders can be trained to identify causes of certain kinds of behavior
• “Spikes” in reconstruction error on time series can be used in detecting problems
in infrastructure as well as in network monitoring (dropped connections, unusually
high latency
• Use KNN and VPTrees to sample the hidden activations learned from the neural
net to interactively label
Root cause Analysis
confidential 16
17. • The system will not use a redundant environments since disaster recovery is not
a requirement.
• Inside an environment the system will have redundant hardware and can tolerate
the loss of one GPU node and one App node without service degradation.
• This system does not employ a remote backup strategy because all data is
ephemeral and can be recreated from the data on S3.
Design Considerations for Production
confidential 17