### Time Series Analysis for Network Secruity

• 2. 2 Time Series Analysis for Network Security Phil Roth Data Scientist @ Endgame mrphilroth.com
• 3. 33 First, an introduction. My history of Python scientific computing, in function calls:
• 4. 44 os.path.walk Physics Undergraduate @ PSU AMANDA Neutrino Telescope
• 5. 55 pylab.plot Physics Graduate Student @ UMD IceCube Neutrino Telescope
• 6. 66 numpy.fft.fft Radar Scientist @ User Systems, Inc. Various Radar Simulations
• 9. 99 (the rest of this talk…) Data Scientist @ Endgame Time Series Anomaly Detection
• 10. 1010 Problem: Highlight when recorded metrics deviate from normal patterns. for example: a high number of connections might be an indication of a brute force attack for example: a large volume of outgoing data might be an indication of an exfiltration event
• 11. 1111 Solution: Build a system that can track and store historical records of any metric. Develop an algorithm that will detect irregular behavior with minimal false positives.
• 13. 1313 real time stream batch historical Redis In memory key-value data store HDFS Large scale distributed data store Kafka Topics Distributed message passing Data Sources data flow
• 14. 1414 kairos A Python interface to backend storage databases (redis in my case, others available) tailored for time series storage. Takes care of expiring data and different types of time series (series, histogram, count, gauge, set). Open sourced by Agora Games. https://github.com/agoragames/kairos
• 15. 1515 kairos Example code: from redis import Redis from kairos import Timeseries intervals = {"days" : {"step" : 60, "steps" : 2880}, "months" : {"step" : 1800, "steps" : 4032}} rclient = Redis(“localhost”, 6379) ktseries = Timeseries(rclient, type="histogram”, intervals=intervals) ktseries.insert(metric_name, metric_value, timestamp)
• 16. 1616 kafka-python A Python interface to Apache Kafka, where Kafka is publish-subscribe messaging rethought as a distributed commit log. Allows me to subscribe to the events as they come in real time. https://github.com/mumrah/kafka-python
• 17. 1717 kafka-python from kafka.client import KafkaClient from kafka.consumer import SimpleConsumer kclient = KafkaClient(“localhost:9092”) kconsumer = SimpleConsumer(kclient, “timevault, “rawmsgs”) for message in kconsumer : insert_to_kairos(message) Example code:
• 18. 1818 pyspark A Python interface to Apache Spark, where Spark is a fast and general engine for large scale data processing. Allows me to fill in historical data to the time series when I add or modify metrics. http://spark.apache.org/
• 19. 1919 pyspark from pyspark import SparkContext, SparkConf spark_conf = (SparkConf() .setMaster(“localhost”) .setAppName(“timevault-update”)) sc = SparkContext(conf=spark_conf) rdd = (sc.textFile(hdfs_files) .map(insert_to_kairos) .count()) Example code:
• 20. 2020 pyspark from json import loads import timevault as tv from functools import partial from pyspark import SparkContext, SparkConf spark_conf = (SparkConf() .setMaster(“localhost”) .setAppName(“timevault-update”)) sc = SparkContext(conf=spark_conf) rdd = (sc.textFile(tv.conf.hdfs_files) .map(loads) .flatMap(tv.flatten_message) .flatMap(partial(tv.emit_metrics, metrics=tv.metrics_to_emit)) .filter(lambda tup : tup[2] < float(tv.conf.limit_time)) .mapPartitions(partial(tv.insert_to_kairos, conf=tv.conf) .count()) Example code:
• 21. 2121 the end result from pandas import DataFrame, to_datetime series = ktseries.series(metric_name, “months”, transform=transform) ts, fields = zip(*series.items()) df = DataFrame({"data” : fields}, index=to_datetime(ts, unit="s"))
• 22. 2222 building models First naïve model is simply the mean and standard deviation across all time. blue: actual number of connections green: prediction window red: actual value exceeded standard deviation limit
• 23. 2323 building models Second slightly less naïve model is fitting a sine curve to the whole series. blue: actual number of connections green: prediction window red: actual value exceeded standard deviation limit
• 24. 2424 classification Both naïve models left a lot to be desired. Two simple classifications would help us treat different types of time series appropriately: Does this metric show a weekly pattern (ie. different behavior on weekends versus weekdays)? Does this metric show a daily pattern?
• 25. 2525 classification Fit a sine curve to the weekday and weekend periods. Ratio of the level of those fits to determine if weekdays will be divided from weekends. weekly
• 26. 2626 classification weekly from scipy.optimize import leastsq def fitfunc(p, x) : return (p[0] * (1 - p[1] * np.sin(2 * np.pi / (24 * 3600) * (x + p[2])))) def residuals(p, y, x) : return y - fitfunc(p, x) def fit(tsdf) : tsgb = tsdf.groupby(tsdf.timeofday).mean() p0 = np.array([tsgb[“conns”].mean(), 1.0, 0.0]) plsq, suc = leastsq(residuals, p0, args=(tsgb[“conns”], np.array(tsgb.index))) return plsq
• 27. 2727 classification weekly def weekend_ratio(tsdf) : tsdf['weekday'] = pd.Series(tsdf.index.weekday < 5, index=tsdf.index) tsdf['timeofday'] = (tsdf.index.second + tsdf.index.minute * 60 + tsdf.index.hour * 3600) wdayplsq = fit(tsdf[tsdf.weekday == 1]) wendplsq = fit(tsdf[tsdf.weekdy == 0]) return wendplsq[0] / wdayplsq[0] 0 1cutoff 1 / cutoff No weekly variation.
• 29. 2929 classification Take a Fourier transform of the time series, and inspect the bins associated with a frequency of a day. Use the ratio of those bins to the first (constant or DC component) in order to classify the time series. daily
• 30. 3030 classification Time series on weekdays shown with a strong daily pattern. Fourier transform with bins around the day frequency highlighted. daily
• 31. 3131 classification Time series on weekends shown with no daily pattern. Fourier transform with bins around the day frequency highlighted. daily
• 32. 3232 classification def daily_ratio(tsdf) : nbins = len(tsdf) deltat = (tsdf.index[1] - tsdf.index[0]).seconds deltaf = 1.0 / (len(tsdf) * deltat) daybin = int((1.0 / (24 * 3600)) / deltaf) rfft = np.abs(np.fft.rfft(tsdf[“conns”])) daily_ratio = np.sum(rfft[daybin - 1:daybin + 2]) / rfft[0] return daily_ratio daily Find the bin associated with the frequency of a day using:
• 33. 3333 ewma Exponentially weighted moving average: The decay parameter is specified as a span, s, in pandas, related to α by: α = 2 / (s + 1) A normal EWMA analysis is done when the metric shows no daily pattern. A stacked EWMA analysis is done when there is a daily pattern.
• 34. 3434 ewma def ewma_outlier(tsdf, stdlimit=5, span=15) : tsdf[’conns_binpred’] = pd.ewma(tsdf[‘conns’], span=span).shift(1) tsdf[’conns_binstd’] = pd.ewmstd(tsdf[‘conns’], span=span).shift(1) tsdf[‘conns_stds’] = ((tsdf[‘conns’] – tsdf[’conns_binpred’]) / tsdf[‘conns_binstd’]) tsdf[‘conns_outlier’] = (tsdf[‘conns_stds’].abs() > stdlimit) return tsdf normal
• 35. 3535 ewma normal blue: actual number of connections green: prediction window red: actual value exceeded standard deviation limit
• 36. 3636 ewma blue: actual response size green: prediction window red: actual value exceeded standard deviation limit normal
• 40. 4040 ewma def stacked_outlier(tsdf, stdlimit=4, span=10) : gbdf = tsdf.groupby(‘timeofday’)[colname] gbdf = pd.DataFrame({‘conns_binpred’ : gbdf.apply(pd.ewma, span=span), ‘conns_binstd’ : gbdf.apply(pd.ewmstd, span=span)}) interval = tsdf.timeofday[1] - tsdf.timeofday[0] nshift = int(86400.0 / interval) gbdf = gbdf.shift(nshift) tsdf = gbdf.combine_first(tsdf) tsdf[‘conns_stds’] = ((tsdf[‘conns’] – tsdf[‘conns_binpred’]) / tsdf[‘conns_binstd’]) tsdf[‘conns_outlier’] = (tsdf[‘conns_stds’].abs() > stdlimit) return tsdf stacked Shift the EWMA results by a day and overlay them on the original DataFrame.
• 41. 4141 ewma blue: actual number of connections green: prediction window red: actual value exceeded standard deviation limit stacked
• 42. 4242 arima I am currently investigating using ARIMA (autoregressive integrated moving average) models to make better predictions. I’m not convinced that this level of detail is necessary for the analysis I’m doing, but I wanted to highlight another cool scientific computing library that’s available.
• 43. 4343 arima from statsmodels.tsa.arima_model import ARIMA def arima_model_forecast(tsdf, p, d q) : arima_model = ARIMA(tsdf[“conns”][:-1], (p, d, q)).fit() forecast, stderr, conf_int = arima_model.forecast(1) tsdf[“conns_binpred"][-1] = forecast[0] tsdf[“conns_binstd"][-1] = stderr[0] return tsdf
• 44. 4444 arima blue: actual number of connections green: prediction window red: actual value exceeded standard deviation limit p = d = q = 1
• 45. 4545 takeaways Python provides simple and usable interfaces to most data handling projects. Combined, these interfaces can create a full data analysis pipeline from collection to analysis.

### Editor's Notes

1. y=p_0left[1-p_1 sin left( frac{2 pi }{24*3600} left( x - p_2 ight) ight) ight]
Current LanguageEnglish
Español
Portugues
Français
Deutsche