Anomaly Detection using Spark MLlib and Spark Streaming

Anomaly Detection
Offline Training using Spark Mllib;
Online Testing using Spark Streaming;
Details: https://github.com/keiraqz/anomaly-detection
Keira Zhou Dec, 2015

The Model
 Model is trained using KMeans(Spark MLlib K-means)
approach
 Trained on "normal" dataset only
 After the model is trained, the centroid of the "normal"
dataset will be returned as well as a threshold
 During the validation stage, any data points that are
further than the threshold from the centroid are
considered as "anomalies".

Dataset
 The dataset is downloaded from KDD Cup 1999 Data
for Anomaly Detection [1]
 Training Set: The training set is separated from the
whole dataset with the data points that are labeled as
"normal" only
 Validation Set: The validation set is using the whole
dataset. All data points that are NOT labeled as
"normal" are considered as "anomalies”
[1] KDD Cup 1999 Data: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

Offline Training
 The majority of the training code mainly follows the
tutorial from Sean Owen, Cloudera:
 Video: https://www.youtube.com/watch?v=TC5cKYBZAeI
 Slides-1: http://www.slideshare.net/CIGTR/anomaly-detection-with-
apache-spark
 Slides-2: http://www.slideshare.net/cloudera/anomaly-detection-with-
apache-spark-2
 Couple of modifications have been made to fit
personal interest:
 Instead of training multiple clusters, the code only trains on "normal"
data points
 Only one cluster center is recorded and threshold is set to the last of
the furthest 2000 data points
 During later validating stage, all points that are further than the
threshold is labeled as "anomaly"

Online Testing
 Validation is run as a streaming job using Spark
Streaming
 Currently the application reads the input data from a
local file
 In an ideal situation, the program will read the data from
some ingestion tools such as Kafka
 Also, the trained model (centroid and threshold) is
also saved in a local file
 In production, the information should be saved into a
database

 Spark Streaming context: process every 3 seconds
 Load the trained model:
 Load from local file and put into a queueStream
 The streaming task: Calculate the distance between the data point
and the centroid, then compare to the threshold

Notes
 Currently the application reads the input data from a
local file
 In an ideal situation, the program will read the data from
some ingestion tools such as Kafka
 Also, the trained model (centroid and threshold) is
also saved in a local file
 In production, the information should be saved into a
database
 The output of the testing can be saved into a
database for visualization

More Details
 https://github.com/keiraqz/anomaly-detection

Anomaly Detection using Spark MLlib and Spark Streaming

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Anomaly Detection using Spark MLlib and Spark Streaming

Similar to Anomaly Detection using Spark MLlib and Spark Streaming (20)

Recently uploaded

Recently uploaded (20)

Anomaly Detection using Spark MLlib and Spark Streaming