Apache Spark for Cyber Security in an Enterprise Company

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Josef Niedermeier, HPE
Apache Spark for Cyber
Security in an Enterprise
Company
#UnifiedDataAnalytics #SparkAISummit

Agenda
• Introduction
• Challenges in Cyber Security
• Using Spark to help process an increasing amount of data
– Offloading current applications
– Replacing current applications by Big Data technologies
• Adding additional detection capabilities by Machine Learning
– Machine Learning Introduction
– Use Cases
– High level architecture
– Lessons learned
• Q&A
3#UnifiedDataAnalytics #SparkAISummit

Introduction - Team
Netwok
Traffic Logs
Users
Actions
Big Data
Platform
Actionable
Intelligence
Global Cyber Security Fusion
Center Data Science Team
Vulnerabilities
Risk and
Governance
Cyber Security
Operation Center
Advanced
Thread

SIEM
Introduction - SIEM
 SIEM - security information and event management
 Security Event Manager (SEM): generates alerts based on predefined rules and
input events
 Security Information Manager (SIM): stores relevant cyber security data and
allows querying to get context data
events Security
Analysts
SEM
SIM
Alerts
Query/Context
Aggregation
Filtering
Enriching

Challenges in Cyber Security
• Scalability and performance
– Increasing amount of data: according to Gartner, 25K EPS is
enterprise size, but in big organization there are several 100K EPS.
– Limited storage for historical data.
– Long query response time.
– IoT makes situation even worse.
• Quickly evolving requirements
• Lack of qualified and skilled professionals

Using Spark to help
process an increasing
amount of data

Big Data
Processing
Offloading current applications
 offload of aggregation, filtering and enriching
 offload of storage and querying
SIEM
events Security
Analysts
SEM
SIM
Alerts
Query/Context
Aggregation
Filtering
Enriching
Big Data
Storage
API
UI Query/Context

Big Data Processing – high level
HDFS
NetFlow
Log
Netflow
Collector
Columnar
Store
Syslog
Collector
Distributed Processing
Batch and Streaming
Deduplication, filtering,
aggregation, enriching
SIEMNetFlow
Syslog In Memory
Data Grid

Big Data Processing
Firewall logs aggregation

Big Data Processing
Syslog Collector sends
syslog events to Kafka.
(custom build)
High Available Load Balancer
sends syslog events to live
collectors. (custom build)

Big Data Processing
Firewall Aggregation (5 sec.
streaming job) aggregates events.
(using DStream.reduceByKey)
DNS enrichment adds DNS
names using DHCP and
DNS logs.

Big Data Processing
SIEM Loader (5 sec. streaming
job) sends aggregated events to
the SIEM.

Big Data Processing
Columnar Store Loader (5 sec.
streaming job) loads aggregated
events to the Columnar Store
Columnar Store
offloads storage
and querying

Big Data Processing
●
Environment
●
Inputs 65,000 EPS and 32,000 EPS
5 sec micro-batches (Spark Streaming)
●
24 executors x 11 cores each on non-dedicated, heavily utilized
Hortonworks cluster
●
Results
●
Number of the events is reduced to half
●
Query times are reduced to seconds

SIEM functionality using BigData
technology
Evens
Security
Analysts
Alerts
Big Data
Storage
Query/Context
MS
MS
API/UIMS
Orchestration
MS
Micro services based
on Big Data Technologies
implement SIEM functionality
●
Easy to add/modify functionality
●
Design driven by users
●
Easier integration with processes

SIEM functionality using BigData
technology
 Rule development and testing similar to software testing
 Similar process and tools (Jira, Git etc)
 Tools
 Spark, In Memory Data Grid
 Preliminary Results
 15 - 20 minutes to test a rule on 24h data ( 2B events) (24 executors)
 linearly scalable
Rule
Development
Unit
Testing
Fast Forward Testing
With
Production Sample
Production
Deployment

Adding additional
detection capabilities
by Machine Learning

Machine Learning - Introduction
0
1
0 1
x2
x1
Supervised Learning
1
0 1
x2
x1
Unsupervised Learning
0
1
0 1
x2
x1
Supervised Learning
1
0 1
x2
x1
Unsupervised Learning
We can derive structure
from data and find
outliers.
We can find a function f
and its parameters that fits
training data and can be
used for classification and
regression.
Labeled data – supervised learning
Unlabeled data – unsupervised learning

Machine Learning - Supervised
Training
Algorithm
Model
Parameters
(hypothesis)
Training
Labeled
Data
New
Data
Classification
/Regression
Algorithm
Classification
/Regression
Results
Training: finding a function and its parameters to fit training data
Actual Classification/Regression
20

Machine Learning – Example
21#UnifiedDataAnalytics #SparkAISummit 21
●
f: if x2 > (p0 + p1 * x1) then O else X
●
finding parameters to minimize # of wrongly
classified data points (cost function)
p0 p1 Line Cost
0.6 0 3
0.9 -0.9 2
0.8 - 0.7 0
0
1
0 1
x2
x1
Supervised Learning
Training Labeled Data
0
1
0 1
x2
x1
Supervised Learning
21
Parameters

Machine Learning - example
classification
if x2 > (0.8 – 0.7 * x1)
then O
else X
New data Classified new data

Machine Learning – Terminology
Precision=
True Positive
True Positive+False Positive
=Proportion of selected items that are relevant
Recall=
True Positive
True Positive+False Negative
=Proportion of relevant items that was selected
Source: https://en.wikipedia.org/wiki/Precision_and_recall

Machine Learning – Challenges
●
Too many false positives
●
Precision ~ 99% can be too low
●
Data cleanliness
●
Wrong time on a device can be detected as anomaly
●
Missing labeled data
●
Hard to evaluate recall

●
A ML algorithm for detecting a specific malware infection:
●
precision = 99%
●
recall = 99%.
●
The infection is relatively rare: 1 % of computers are infected.
What is probability that the computer is really infected if it is classified as
infected?
(99% or 91% or 50% or 1%)
Is 99% precision good enough?

Suppose there are 10 000 computers:
●
100 are infected
●
99 infected are correctly classified as infected (true positive)
●
1 infected is classified as not infected (false negative)
●
9,900 clean
●
99 are classified incorrectly as infected (false positive)
●
9,801 are correctly classified as not infected (true negative)
●
99 true positivo and 99 false positive = 198 computers classified as
infected but only 99 are really infected so probability that the computer
classified as infected is really infected is 50%.
P(infected given classified as infected )=
P(classified as infected given infected )∗P(infected )
P(classified as infected )
=
0.99∗0.01
(0.99∗0.01+0.01∗0.99)
=0.5Using Bayes' theorem:

●
Usually a human should make final assessment.
●
Reasonable use cases:
●
High ratio of “infection”
●
Limited (selected) data
Classifier with precision and recall 99 %
infected computers [%] really infected/classified as infected [%]
1.00% 50%
0.10% 9%
0.01% 1%

Machine Learning and Spark
●
MLlib is Apache Spark's scalable machine learning library.
●
ML algorithms
●
ML workflow utilities (data → feature, evaluation, persistence, ...)
●
Several deep learning frameworks
●
Databricks – spark-deep-learning, Deep Learning Pipelines for Apache Spark
●
Yahoo -TensorFlowOnSpark
●
Intel – BigDL
●
...

Machine Learning Use Cases
Use Case Data
source
Features Algorythm
Detect malicious
URL
Web
proxy log
Entropy, no of spec.
chars, path length, URL
length, contains org.
domain out of position,
has been seen, ...
Random Forest,
Long-Short Term
Memory
Generated domains
(malicious)
detection
DNS log Domain string Long-Short Term
Memory
Classify server
account activity
Active
Domain
log
Network distance,
organization distance,
time distance
Naïve Bayes,
Random Forest

Machine Learning Use Cases
Use Case Data
source
Features Algorythm
Detect command
and control
communication
Netflow
data
Duration of TCP/IP
session, cardinality,
octets/packet etc.
Naïve Bayes,
Random Forest

Spark
MLlib
Batch Job
Machine Learning - Architecture
Feature
extractor
Training Data
Algorithm
Training
HDFS Model
parameters

Spark
MLlib
Batch or Streaming Job
Machine Learning - Architecture
Feature
extractor
New Data
Algorithm
HDFS Model
parameters
Classification
Classified
data

Machine Learning – Lessons
Learned
●
Do not implement ML just to click “we are using ML”
●
Have good use cases including precision and recall requirements
●
Visualization can be more useful than ML in some cases
●
In most cases, there is necessary to validate a detection by an
analyst.
●
Cyber security analysts like if there are reasoning (why the
classifier decide that it is malicious)

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Apache Spark for Cyber Security in an Enterprise Company

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Spark for Cyber Security in an Enterprise Company

Similar to Apache Spark for Cyber Security in an Enterprise Company (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Apache Spark for Cyber Security in an Enterprise Company