99
(the rest of this talk…)
Data Scientist @ Endgame
Time Series Anomaly Detection
1010
Problem:
Highlight when recorded metrics deviate from
normal patterns.
for example: a high number of connections might be an
indication of a brute force attack
for example: a large volume of outgoing data might be an
indication of an exfiltration event
1111
Solution:
Build a system that can track and store
historical records of any metric. Develop an
algorithm that will detect irregular behavior
with minimal false positives.
1414
kairos
A Python interface to backend storage databases
(redis in my case, others available) tailored for time
series storage.
Takes care of expiring data and different types of time
series (series, histogram, count, gauge, set).
Open sourced by Agora Games.
https://github.com/agoragames/kairos
1616
kafka-python
A Python interface to Apache Kafka, where Kafka is
publish-subscribe messaging rethought as a
distributed commit log.
Allows me to subscribe to the events as they come in
real time.
https://github.com/mumrah/kafka-python
1717
kafka-python
from kafka.client import KafkaClient
from kafka.consumer import SimpleConsumer
kclient = KafkaClient(“localhost:9092”)
kconsumer = SimpleConsumer(kclient, “timevault, “rawmsgs”)
for message in kconsumer :
insert_to_kairos(message)
Example code:
1818
pyspark
A Python interface to Apache Spark, where Spark is a
fast and general engine for large scale data
processing.
Allows me to fill in historical data to the time series
when I add or modify metrics.
http://spark.apache.org/
2020
pyspark
from json import loads
import timevault as tv
from functools import partial
from pyspark import SparkContext, SparkConf
spark_conf = (SparkConf()
.setMaster(“localhost”)
.setAppName(“timevault-update”))
sc = SparkContext(conf=spark_conf)
rdd = (sc.textFile(tv.conf.hdfs_files)
.map(loads)
.flatMap(tv.flatten_message)
.flatMap(partial(tv.emit_metrics, metrics=tv.metrics_to_emit))
.filter(lambda tup : tup[2] < float(tv.conf.limit_time))
.mapPartitions(partial(tv.insert_to_kairos, conf=tv.conf)
.count())
Example code:
2121
the end result
from pandas import DataFrame, to_datetime
series = ktseries.series(metric_name, “months”, transform=transform)
ts, fields = zip(*series.items())
df = DataFrame({"data” : fields}, index=to_datetime(ts, unit="s"))
2222
building models
First naïve model is simply the mean and standard
deviation across all time.
blue: actual number of connections
green: prediction window
red: actual value exceeded standard deviation limit
2323
building models
Second slightly less naïve model is fitting a sine curve
to the whole series.
blue: actual number of connections
green: prediction window
red: actual value exceeded standard deviation limit
2424
classification
Both naïve models left a lot to be desired. Two simple
classifications would help us treat different types of
time series appropriately:
Does this metric show a weekly pattern (ie. different
behavior on weekends versus weekdays)?
Does this metric show a daily pattern?
2525
classification
Fit a sine curve to
the weekday and
weekend periods.
Ratio of the level of
those fits to
determine if
weekdays will be
divided from
weekends.
weekly
2929
classification
Take a Fourier
transform of the time
series, and inspect
the bins associated
with a frequency of a
day.
Use the ratio of
those bins to the first
(constant or DC
component) in order
to classify the time
series.
daily
3232
classification
def daily_ratio(tsdf) :
nbins = len(tsdf)
deltat = (tsdf.index[1] - tsdf.index[0]).seconds
deltaf = 1.0 / (len(tsdf) * deltat)
daybin = int((1.0 / (24 * 3600)) / deltaf)
rfft = np.abs(np.fft.rfft(tsdf[“conns”]))
daily_ratio = np.sum(rfft[daybin - 1:daybin + 2]) / rfft[0]
return daily_ratio
daily
Find the bin
associated with the
frequency of a day
using:
3333
ewma
Exponentially weighted moving average:
The decay parameter is specified as a span, s, in
pandas, related to α by:
α = 2 / (s + 1)
A normal EWMA analysis is done when the metric
shows no daily pattern. A stacked EWMA analysis is
done when there is a daily pattern.
4040
ewma
def stacked_outlier(tsdf, stdlimit=4, span=10) :
gbdf = tsdf.groupby(‘timeofday’)[colname]
gbdf = pd.DataFrame({‘conns_binpred’ : gbdf.apply(pd.ewma, span=span),
‘conns_binstd’ : gbdf.apply(pd.ewmstd, span=span)})
interval = tsdf.timeofday[1] - tsdf.timeofday[0]
nshift = int(86400.0 / interval)
gbdf = gbdf.shift(nshift)
tsdf = gbdf.combine_first(tsdf)
tsdf[‘conns_stds’] = ((tsdf[‘conns’] – tsdf[‘conns_binpred’]) /
tsdf[‘conns_binstd’])
tsdf[‘conns_outlier’] = (tsdf[‘conns_stds’].abs() > stdlimit)
return tsdf
stacked
Shift the EWMA
results by a day and
overlay them on the
original DataFrame.
4141
ewma
blue: actual number of connections
green: prediction window
red: actual value exceeded standard deviation limit
stacked
4242
arima
I am currently investigating using ARIMA
(autoregressive integrated moving average) models to
make better predictions.
I’m not convinced that this level of detail is necessary
for the analysis I’m doing, but I wanted to highlight
another cool scientific computing library that’s
available.
4444
arima
blue: actual number of connections
green: prediction window
red: actual value exceeded standard deviation limit
p = d = q = 1
4545
takeaways
Python provides simple and usable interfaces to most
data handling projects.
Combined, these interfaces can create a full data
analysis pipeline from collection to analysis.