SlideShare a Scribd company logo
1 of 46
1
Endgame Proprietary
2
Time Series Analysis for Network Security
Phil Roth
Data Scientist @ Endgame
mrphilroth.com
33
First, an introduction. My history of Python
scientific computing, in function calls:
44
os.path.walk
Physics Undergraduate @ PSU
AMANDA Neutrino Telescope
55
pylab.plot
Physics Graduate Student @ UMD
IceCube Neutrino Telescope
66
numpy.fft.fft
Radar Scientist @ User Systems, Inc.
Various Radar Simulations
77
pandas.io.parsers.read_csv
Side Projects
Scraping data from the web
88
sklearn.linear_model.LogisticRegression
Side Projects
Machine learning competitions
99
(the rest of this talk…)
Data Scientist @ Endgame
Time Series Anomaly Detection
1010
Problem:
Highlight when recorded metrics deviate from
normal patterns.
for example: a high number of connections might be an
indication of a brute force attack
for example: a large volume of outgoing data might be an
indication of an exfiltration event
1111
Solution:
Build a system that can track and store
historical records of any metric. Develop an
algorithm that will detect irregular behavior
with minimal false positives.
1212
Gathering Data
kairos
kafka-python
pyspark
Building Models
classification
ewma
arima
1313
real time
stream
batch
historical
Redis
In memory
key-value data
store
HDFS
Large scale
distributed
data store
Kafka Topics
Distributed
message
passing
Data Sources
data flow
1414
kairos
A Python interface to backend storage databases
(redis in my case, others available) tailored for time
series storage.
Takes care of expiring data and different types of time
series (series, histogram, count, gauge, set).
Open sourced by Agora Games.
https://github.com/agoragames/kairos
1515
kairos
Example code:
from redis import Redis
from kairos import Timeseries
intervals = {"days" : {"step" : 60, "steps" : 2880},
"months" : {"step" : 1800, "steps" : 4032}}
rclient = Redis(“localhost”, 6379)
ktseries = Timeseries(rclient, type="histogram”, intervals=intervals)
ktseries.insert(metric_name, metric_value, timestamp)
1616
kafka-python
A Python interface to Apache Kafka, where Kafka is
publish-subscribe messaging rethought as a
distributed commit log.
Allows me to subscribe to the events as they come in
real time.
https://github.com/mumrah/kafka-python
1717
kafka-python
from kafka.client import KafkaClient
from kafka.consumer import SimpleConsumer
kclient = KafkaClient(“localhost:9092”)
kconsumer = SimpleConsumer(kclient, “timevault, “rawmsgs”)
for message in kconsumer :
insert_to_kairos(message)
Example code:
1818
pyspark
A Python interface to Apache Spark, where Spark is a
fast and general engine for large scale data
processing.
Allows me to fill in historical data to the time series
when I add or modify metrics.
http://spark.apache.org/
1919
pyspark
from pyspark import SparkContext, SparkConf
spark_conf = (SparkConf()
.setMaster(“localhost”)
.setAppName(“timevault-update”))
sc = SparkContext(conf=spark_conf)
rdd = (sc.textFile(hdfs_files)
.map(insert_to_kairos)
.count())
Example code:
2020
pyspark
from json import loads
import timevault as tv
from functools import partial
from pyspark import SparkContext, SparkConf
spark_conf = (SparkConf()
.setMaster(“localhost”)
.setAppName(“timevault-update”))
sc = SparkContext(conf=spark_conf)
rdd = (sc.textFile(tv.conf.hdfs_files)
.map(loads)
.flatMap(tv.flatten_message)
.flatMap(partial(tv.emit_metrics, metrics=tv.metrics_to_emit))
.filter(lambda tup : tup[2] < float(tv.conf.limit_time))
.mapPartitions(partial(tv.insert_to_kairos, conf=tv.conf)
.count())
Example code:
2121
the end result
from pandas import DataFrame, to_datetime
series = ktseries.series(metric_name, “months”, transform=transform)
ts, fields = zip(*series.items())
df = DataFrame({"data” : fields}, index=to_datetime(ts, unit="s"))
2222
building models
First naïve model is simply the mean and standard
deviation across all time.
blue: actual number of connections
green: prediction window
red: actual value exceeded standard deviation limit
2323
building models
Second slightly less naïve model is fitting a sine curve
to the whole series.
blue: actual number of connections
green: prediction window
red: actual value exceeded standard deviation limit
2424
classification
Both naïve models left a lot to be desired. Two simple
classifications would help us treat different types of
time series appropriately:
Does this metric show a weekly pattern (ie. different
behavior on weekends versus weekdays)?
Does this metric show a daily pattern?
2525
classification
Fit a sine curve to
the weekday and
weekend periods.
Ratio of the level of
those fits to
determine if
weekdays will be
divided from
weekends.
weekly
2626
classification weekly
from scipy.optimize import leastsq
def fitfunc(p, x) :
return (p[0] * (1 - p[1] * np.sin(2 * np.pi / (24 * 3600) * (x + p[2]))))
def residuals(p, y, x) :
return y - fitfunc(p, x)
def fit(tsdf) :
tsgb = tsdf.groupby(tsdf.timeofday).mean()
p0 = np.array([tsgb[“conns”].mean(), 1.0, 0.0])
plsq, suc = leastsq(residuals, p0, args=(tsgb[“conns”],
np.array(tsgb.index)))
return plsq
2727
classification weekly
def weekend_ratio(tsdf) :
tsdf['weekday'] = pd.Series(tsdf.index.weekday < 5, index=tsdf.index)
tsdf['timeofday'] = (tsdf.index.second + tsdf.index.minute * 60 +
tsdf.index.hour * 3600)
wdayplsq = fit(tsdf[tsdf.weekday == 1])
wendplsq = fit(tsdf[tsdf.weekdy == 0])
return wendplsq[0] / wdayplsq[0]
0 1cutoff 1 / cutoff
No weekly variation.
2828
classification
Weekly pattern.
No weekly pattern.
weekly
2929
classification
Take a Fourier
transform of the time
series, and inspect
the bins associated
with a frequency of a
day.
Use the ratio of
those bins to the first
(constant or DC
component) in order
to classify the time
series.
daily
3030
classification
Time series on
weekdays shown
with a strong daily
pattern.
Fourier transform
with bins around the
day frequency
highlighted.
daily
3131
classification
Time series on
weekends shown
with no daily pattern.
Fourier transform
with bins around the
day frequency
highlighted.
daily
3232
classification
def daily_ratio(tsdf) :
nbins = len(tsdf)
deltat = (tsdf.index[1] - tsdf.index[0]).seconds
deltaf = 1.0 / (len(tsdf) * deltat)
daybin = int((1.0 / (24 * 3600)) / deltaf)
rfft = np.abs(np.fft.rfft(tsdf[“conns”]))
daily_ratio = np.sum(rfft[daybin - 1:daybin + 2]) / rfft[0]
return daily_ratio
daily
Find the bin
associated with the
frequency of a day
using:
3333
ewma
Exponentially weighted moving average:
The decay parameter is specified as a span, s, in
pandas, related to α by:
α = 2 / (s + 1)
A normal EWMA analysis is done when the metric
shows no daily pattern. A stacked EWMA analysis is
done when there is a daily pattern.
3434
ewma
def ewma_outlier(tsdf, stdlimit=5, span=15) :
tsdf[’conns_binpred’] = pd.ewma(tsdf[‘conns’], span=span).shift(1)
tsdf[’conns_binstd’] = pd.ewmstd(tsdf[‘conns’], span=span).shift(1)
tsdf[‘conns_stds’] = ((tsdf[‘conns’] – tsdf[’conns_binpred’]) /
tsdf[‘conns_binstd’])
tsdf[‘conns_outlier’] = (tsdf[‘conns_stds’].abs() > stdlimit)
return tsdf
normal
3535
ewma normal
blue: actual number of connections
green: prediction window
red: actual value exceeded standard deviation limit
3636
ewma
blue: actual response size
green: prediction window
red: actual value exceeded standard deviation limit
normal
3737
ewma stacked
3838
ewma stacked
3939
ewma stacked
4040
ewma
def stacked_outlier(tsdf, stdlimit=4, span=10) :
gbdf = tsdf.groupby(‘timeofday’)[colname]
gbdf = pd.DataFrame({‘conns_binpred’ : gbdf.apply(pd.ewma, span=span),
‘conns_binstd’ : gbdf.apply(pd.ewmstd, span=span)})
interval = tsdf.timeofday[1] - tsdf.timeofday[0]
nshift = int(86400.0 / interval)
gbdf = gbdf.shift(nshift)
tsdf = gbdf.combine_first(tsdf)
tsdf[‘conns_stds’] = ((tsdf[‘conns’] – tsdf[‘conns_binpred’]) /
tsdf[‘conns_binstd’])
tsdf[‘conns_outlier’] = (tsdf[‘conns_stds’].abs() > stdlimit)
return tsdf
stacked
Shift the EWMA
results by a day and
overlay them on the
original DataFrame.
4141
ewma
blue: actual number of connections
green: prediction window
red: actual value exceeded standard deviation limit
stacked
4242
arima
I am currently investigating using ARIMA
(autoregressive integrated moving average) models to
make better predictions.
I’m not convinced that this level of detail is necessary
for the analysis I’m doing, but I wanted to highlight
another cool scientific computing library that’s
available.
4343
arima
from statsmodels.tsa.arima_model import ARIMA
def arima_model_forecast(tsdf, p, d q) :
arima_model = ARIMA(tsdf[“conns”][:-1], (p, d, q)).fit()
forecast, stderr, conf_int = arima_model.forecast(1)
tsdf[“conns_binpred"][-1] = forecast[0]
tsdf[“conns_binstd"][-1] = stderr[0]
return tsdf
4444
arima
blue: actual number of connections
green: prediction window
red: actual value exceeded standard deviation limit
p = d = q = 1
4545
takeaways
Python provides simple and usable interfaces to most
data handling projects.
Combined, these interfaces can create a full data
analysis pipeline from collection to analysis.
46
© 2014 Endgame

More Related Content

What's hot

Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
Mariusz Gil
 

What's hot (19)

Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
 
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
 
New developments in open source ecosystem spark3.0 koalas delta lake
New developments in open source ecosystem spark3.0 koalas delta lakeNew developments in open source ecosystem spark3.0 koalas delta lake
New developments in open source ecosystem spark3.0 koalas delta lake
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
 

Viewers also liked

Outpost networksecurity
Outpost networksecurityOutpost networksecurity
Outpost networksecurity
ehsangha
 
Differential Network Entropy Reveals Cancer System Hallmarks
Differential Network Entropy Reveals Cancer System HallmarksDifferential Network Entropy Reveals Cancer System Hallmarks
Differential Network Entropy Reveals Cancer System Hallmarks
Linh Huynh, PharmD
 

Viewers also liked (18)

Hunting on the Cheap
Hunting on the CheapHunting on the Cheap
Hunting on the Cheap
 
Hunting before a Known Incident
Hunting before a Known IncidentHunting before a Known Incident
Hunting before a Known Incident
 
Examining Malware with Python
Examining Malware with PythonExamining Malware with Python
Examining Malware with Python
 
Outpost networksecurity
Outpost networksecurityOutpost networksecurity
Outpost networksecurity
 
Differential Network Entropy Reveals Cancer System Hallmarks
Differential Network Entropy Reveals Cancer System HallmarksDifferential Network Entropy Reveals Cancer System Hallmarks
Differential Network Entropy Reveals Cancer System Hallmarks
 
When Should I Use Simulation?
When Should I Use Simulation?When Should I Use Simulation?
When Should I Use Simulation?
 
Sim Slides,Tricks,Trends,2012jan15
Sim Slides,Tricks,Trends,2012jan15Sim Slides,Tricks,Trends,2012jan15
Sim Slides,Tricks,Trends,2012jan15
 
A SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKS
A SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKSA SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKS
A SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKS
 
Network analysis methods for assessment & measurement
Network analysis methods for assessment & measurementNetwork analysis methods for assessment & measurement
Network analysis methods for assessment & measurement
 
Python reading and writing files
Python reading and writing filesPython reading and writing files
Python reading and writing files
 
Loop presentation 2014
Loop presentation 2014Loop presentation 2014
Loop presentation 2014
 
Yoga gives your life a new direction
Yoga gives your life a new directionYoga gives your life a new direction
Yoga gives your life a new direction
 
Session1
Session1Session1
Session1
 
Evolution of computers
Evolution of computersEvolution of computers
Evolution of computers
 
Cd jackets
Cd jacketsCd jackets
Cd jackets
 
Bookmarks
BookmarksBookmarks
Bookmarks
 
Assembly Information Management System
Assembly Information Management SystemAssembly Information Management System
Assembly Information Management System
 
New Jersey photos
New Jersey photosNew Jersey photos
New Jersey photos
 

Similar to Time Series Analysis for Network Secruity

Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
g3_nittala
 

Similar to Time Series Analysis for Network Secruity (20)

Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
 
Time Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal RecoveryTime Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal Recovery
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Writing Faster Python 3
Writing Faster Python 3Writing Faster Python 3
Writing Faster Python 3
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
 
Python profiling
Python profilingPython profiling
Python profiling
 
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
 
Python for Scientists
Python for ScientistsPython for Scientists
Python for Scientists
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
 
Time Series Processing with Solr and Spark
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and Spark
 
Time Series Analysis
Time Series AnalysisTime Series Analysis
Time Series Analysis
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Simple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorialSimple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorial
 
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel  write Python code, get Fortran ...SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel  write Python code, get Fortran ...
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 

Recently uploaded

Recently uploaded (20)

The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)
 
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
 
What is a Recruitment Management Software?
What is a Recruitment Management Software?What is a Recruitment Management Software?
What is a Recruitment Management Software?
 
BusinessGPT - Security and Governance for Generative AI
BusinessGPT  - Security and Governance for Generative AIBusinessGPT  - Security and Governance for Generative AI
BusinessGPT - Security and Governance for Generative AI
 
^Clinic ^%[+27788225528*Abortion Pills For Sale In witbank
^Clinic ^%[+27788225528*Abortion Pills For Sale In witbank^Clinic ^%[+27788225528*Abortion Pills For Sale In witbank
^Clinic ^%[+27788225528*Abortion Pills For Sale In witbank
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
 
Salesforce Introduced Zero Copy Partner Network to Simplify the Process of In...
Salesforce Introduced Zero Copy Partner Network to Simplify the Process of In...Salesforce Introduced Zero Copy Partner Network to Simplify the Process of In...
Salesforce Introduced Zero Copy Partner Network to Simplify the Process of In...
 
Software Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements EngineeringSoftware Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements Engineering
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdf
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024
 
Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?
 
Effective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConEffective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeCon
 
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
 
architecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdfarchitecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdf
 
^Clinic ^%[+27788225528*Abortion Pills For Sale In harare
^Clinic ^%[+27788225528*Abortion Pills For Sale In harare^Clinic ^%[+27788225528*Abortion Pills For Sale In harare
^Clinic ^%[+27788225528*Abortion Pills For Sale In harare
 
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit MilanWorkshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
Transformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with LinksTransformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with Links
 

Time Series Analysis for Network Secruity

  • 2. 2 Time Series Analysis for Network Security Phil Roth Data Scientist @ Endgame mrphilroth.com
  • 3. 33 First, an introduction. My history of Python scientific computing, in function calls:
  • 4. 44 os.path.walk Physics Undergraduate @ PSU AMANDA Neutrino Telescope
  • 5. 55 pylab.plot Physics Graduate Student @ UMD IceCube Neutrino Telescope
  • 6. 66 numpy.fft.fft Radar Scientist @ User Systems, Inc. Various Radar Simulations
  • 9. 99 (the rest of this talk…) Data Scientist @ Endgame Time Series Anomaly Detection
  • 10. 1010 Problem: Highlight when recorded metrics deviate from normal patterns. for example: a high number of connections might be an indication of a brute force attack for example: a large volume of outgoing data might be an indication of an exfiltration event
  • 11. 1111 Solution: Build a system that can track and store historical records of any metric. Develop an algorithm that will detect irregular behavior with minimal false positives.
  • 13. 1313 real time stream batch historical Redis In memory key-value data store HDFS Large scale distributed data store Kafka Topics Distributed message passing Data Sources data flow
  • 14. 1414 kairos A Python interface to backend storage databases (redis in my case, others available) tailored for time series storage. Takes care of expiring data and different types of time series (series, histogram, count, gauge, set). Open sourced by Agora Games. https://github.com/agoragames/kairos
  • 15. 1515 kairos Example code: from redis import Redis from kairos import Timeseries intervals = {"days" : {"step" : 60, "steps" : 2880}, "months" : {"step" : 1800, "steps" : 4032}} rclient = Redis(“localhost”, 6379) ktseries = Timeseries(rclient, type="histogram”, intervals=intervals) ktseries.insert(metric_name, metric_value, timestamp)
  • 16. 1616 kafka-python A Python interface to Apache Kafka, where Kafka is publish-subscribe messaging rethought as a distributed commit log. Allows me to subscribe to the events as they come in real time. https://github.com/mumrah/kafka-python
  • 17. 1717 kafka-python from kafka.client import KafkaClient from kafka.consumer import SimpleConsumer kclient = KafkaClient(“localhost:9092”) kconsumer = SimpleConsumer(kclient, “timevault, “rawmsgs”) for message in kconsumer : insert_to_kairos(message) Example code:
  • 18. 1818 pyspark A Python interface to Apache Spark, where Spark is a fast and general engine for large scale data processing. Allows me to fill in historical data to the time series when I add or modify metrics. http://spark.apache.org/
  • 19. 1919 pyspark from pyspark import SparkContext, SparkConf spark_conf = (SparkConf() .setMaster(“localhost”) .setAppName(“timevault-update”)) sc = SparkContext(conf=spark_conf) rdd = (sc.textFile(hdfs_files) .map(insert_to_kairos) .count()) Example code:
  • 20. 2020 pyspark from json import loads import timevault as tv from functools import partial from pyspark import SparkContext, SparkConf spark_conf = (SparkConf() .setMaster(“localhost”) .setAppName(“timevault-update”)) sc = SparkContext(conf=spark_conf) rdd = (sc.textFile(tv.conf.hdfs_files) .map(loads) .flatMap(tv.flatten_message) .flatMap(partial(tv.emit_metrics, metrics=tv.metrics_to_emit)) .filter(lambda tup : tup[2] < float(tv.conf.limit_time)) .mapPartitions(partial(tv.insert_to_kairos, conf=tv.conf) .count()) Example code:
  • 21. 2121 the end result from pandas import DataFrame, to_datetime series = ktseries.series(metric_name, “months”, transform=transform) ts, fields = zip(*series.items()) df = DataFrame({"data” : fields}, index=to_datetime(ts, unit="s"))
  • 22. 2222 building models First naïve model is simply the mean and standard deviation across all time. blue: actual number of connections green: prediction window red: actual value exceeded standard deviation limit
  • 23. 2323 building models Second slightly less naïve model is fitting a sine curve to the whole series. blue: actual number of connections green: prediction window red: actual value exceeded standard deviation limit
  • 24. 2424 classification Both naïve models left a lot to be desired. Two simple classifications would help us treat different types of time series appropriately: Does this metric show a weekly pattern (ie. different behavior on weekends versus weekdays)? Does this metric show a daily pattern?
  • 25. 2525 classification Fit a sine curve to the weekday and weekend periods. Ratio of the level of those fits to determine if weekdays will be divided from weekends. weekly
  • 26. 2626 classification weekly from scipy.optimize import leastsq def fitfunc(p, x) : return (p[0] * (1 - p[1] * np.sin(2 * np.pi / (24 * 3600) * (x + p[2])))) def residuals(p, y, x) : return y - fitfunc(p, x) def fit(tsdf) : tsgb = tsdf.groupby(tsdf.timeofday).mean() p0 = np.array([tsgb[“conns”].mean(), 1.0, 0.0]) plsq, suc = leastsq(residuals, p0, args=(tsgb[“conns”], np.array(tsgb.index))) return plsq
  • 27. 2727 classification weekly def weekend_ratio(tsdf) : tsdf['weekday'] = pd.Series(tsdf.index.weekday < 5, index=tsdf.index) tsdf['timeofday'] = (tsdf.index.second + tsdf.index.minute * 60 + tsdf.index.hour * 3600) wdayplsq = fit(tsdf[tsdf.weekday == 1]) wendplsq = fit(tsdf[tsdf.weekdy == 0]) return wendplsq[0] / wdayplsq[0] 0 1cutoff 1 / cutoff No weekly variation.
  • 29. 2929 classification Take a Fourier transform of the time series, and inspect the bins associated with a frequency of a day. Use the ratio of those bins to the first (constant or DC component) in order to classify the time series. daily
  • 30. 3030 classification Time series on weekdays shown with a strong daily pattern. Fourier transform with bins around the day frequency highlighted. daily
  • 31. 3131 classification Time series on weekends shown with no daily pattern. Fourier transform with bins around the day frequency highlighted. daily
  • 32. 3232 classification def daily_ratio(tsdf) : nbins = len(tsdf) deltat = (tsdf.index[1] - tsdf.index[0]).seconds deltaf = 1.0 / (len(tsdf) * deltat) daybin = int((1.0 / (24 * 3600)) / deltaf) rfft = np.abs(np.fft.rfft(tsdf[“conns”])) daily_ratio = np.sum(rfft[daybin - 1:daybin + 2]) / rfft[0] return daily_ratio daily Find the bin associated with the frequency of a day using:
  • 33. 3333 ewma Exponentially weighted moving average: The decay parameter is specified as a span, s, in pandas, related to α by: α = 2 / (s + 1) A normal EWMA analysis is done when the metric shows no daily pattern. A stacked EWMA analysis is done when there is a daily pattern.
  • 34. 3434 ewma def ewma_outlier(tsdf, stdlimit=5, span=15) : tsdf[’conns_binpred’] = pd.ewma(tsdf[‘conns’], span=span).shift(1) tsdf[’conns_binstd’] = pd.ewmstd(tsdf[‘conns’], span=span).shift(1) tsdf[‘conns_stds’] = ((tsdf[‘conns’] – tsdf[’conns_binpred’]) / tsdf[‘conns_binstd’]) tsdf[‘conns_outlier’] = (tsdf[‘conns_stds’].abs() > stdlimit) return tsdf normal
  • 35. 3535 ewma normal blue: actual number of connections green: prediction window red: actual value exceeded standard deviation limit
  • 36. 3636 ewma blue: actual response size green: prediction window red: actual value exceeded standard deviation limit normal
  • 40. 4040 ewma def stacked_outlier(tsdf, stdlimit=4, span=10) : gbdf = tsdf.groupby(‘timeofday’)[colname] gbdf = pd.DataFrame({‘conns_binpred’ : gbdf.apply(pd.ewma, span=span), ‘conns_binstd’ : gbdf.apply(pd.ewmstd, span=span)}) interval = tsdf.timeofday[1] - tsdf.timeofday[0] nshift = int(86400.0 / interval) gbdf = gbdf.shift(nshift) tsdf = gbdf.combine_first(tsdf) tsdf[‘conns_stds’] = ((tsdf[‘conns’] – tsdf[‘conns_binpred’]) / tsdf[‘conns_binstd’]) tsdf[‘conns_outlier’] = (tsdf[‘conns_stds’].abs() > stdlimit) return tsdf stacked Shift the EWMA results by a day and overlay them on the original DataFrame.
  • 41. 4141 ewma blue: actual number of connections green: prediction window red: actual value exceeded standard deviation limit stacked
  • 42. 4242 arima I am currently investigating using ARIMA (autoregressive integrated moving average) models to make better predictions. I’m not convinced that this level of detail is necessary for the analysis I’m doing, but I wanted to highlight another cool scientific computing library that’s available.
  • 43. 4343 arima from statsmodels.tsa.arima_model import ARIMA def arima_model_forecast(tsdf, p, d q) : arima_model = ARIMA(tsdf[“conns”][:-1], (p, d, q)).fit() forecast, stderr, conf_int = arima_model.forecast(1) tsdf[“conns_binpred"][-1] = forecast[0] tsdf[“conns_binstd"][-1] = stderr[0] return tsdf
  • 44. 4444 arima blue: actual number of connections green: prediction window red: actual value exceeded standard deviation limit p = d = q = 1
  • 45. 4545 takeaways Python provides simple and usable interfaces to most data handling projects. Combined, these interfaces can create a full data analysis pipeline from collection to analysis.

Editor's Notes

  1. y=p_0left[1-p_1 sin left( frac{2 pi }{24*3600} left( x - p_2 ight) ight) ight]