In order to understand and react to their security situation, many cybersecurity operations use Security information and event management (SIEM) software nowadays. Using a traditional SIEM in a large company such as HP Enterprise is a challenge due to the increasing volume and rate of data. We present the solution used to reduce data volume processed by the SIEM using Spark Streaming and the results obtained in processing one of the largest data feeds in HPE: Firewall logs. Testing of SIEM rules the traditional way is a time-consuming process. Usually, it is necessary to wait one day to get results and statistic for one-day production data. An alternative approach to build a SIEM using Spark and other big data technologies will be drafted and results of “fast forward” processing of production data snapshots will be presented. HPE is the target of sophisticated well-crafted attacks and deployed cyber Security tools are not able to detect all of them. A simple application, built using Spark MLlib and company-specific data for training, for detection of malicious trending domains will be described. Takeaways: Spark streaming can be used to pre-process cybersecurity data and reduce their amount for further processing. Spark MLlib can be used to add the additional detecting capability for specific use cases.
In this presentation, we will share how Hewlett Packard Enterprise has implemented Apache Spark to deal with three main cyber security use cases:
1) Using Spark to help Security information and event management (SIEM) process an increasing amount of data
2) Using Spark to test SIEMs rules by “fast forward” processing of production data snapshots.
3) Implementing machine learning to add an additional detection capability
3. Agenda
• Introduction
• Challenges in Cyber Security
• Using Spark to help process an increasing amount of data
– Offloading current applications
– Replacing current applications by Big Data technologies
• Adding additional detection capabilities by Machine Learning
– Machine Learning Introduction
– Use Cases
– High level architecture
– Lessons learned
• Q&A
3#UnifiedDataAnalytics #SparkAISummit
4. Introduction - Team
4#UnifiedDataAnalytics #SparkAISummit
Netwok
Traffic Logs
Users
Actions
Big Data
Platform
Actionable
Intelligence
Global Cyber Security Fusion
Center Data Science Team
Vulnerabilities
Risk and
Governance
Cyber Security
Operation Center
Advanced
Thread
5. SIEM
Introduction - SIEM
5#UnifiedDataAnalytics #SparkAISummit
SIEM - security information and event management
Security Event Manager (SEM): generates alerts based on predefined rules and
input events
Security Information Manager (SIM): stores relevant cyber security data and
allows querying to get context data
events Security
Analysts
SEM
SIM
Alerts
Query/Context
Aggregation
Filtering
Enriching
6. Challenges in Cyber Security
• Scalability and performance
– Increasing amount of data: according to Gartner, 25K EPS is
enterprise size, but in big organization there are several 100K EPS.
– Limited storage for historical data.
– Long query response time.
– IoT makes situation even worse.
• Quickly evolving requirements
• Lack of qualified and skilled professionals
6#UnifiedDataAnalytics #SparkAISummit
7. Using Spark to help
process an increasing
amount of data
#UnifiedDataAnalytics #SparkAISummit
8. Big Data
Processing
Offloading current applications
8#UnifiedDataAnalytics #SparkAISummit
offload of aggregation, filtering and enriching
offload of storage and querying
SIEM
events Security
Analysts
SEM
SIM
Alerts
Query/Context
Aggregation
Filtering
Enriching
Big Data
Storage
API
UI Query/Context
9. Big Data Processing – high level
9#UnifiedDataAnalytics #SparkAISummit
HDFS
NetFlow
Log
Netflow
Collector
Columnar
Store
Syslog
Collector
Distributed Processing
Batch and Streaming
Deduplication, filtering,
aggregation, enriching
SIEMNetFlow
Syslog In Memory
Data Grid
11. Big Data Processing
11#UnifiedDataAnalytics #SparkAISummit
Firewall logs aggregation
Syslog Collector sends
syslog events to Kafka.
(custom build)
High Available Load Balancer
sends syslog events to live
collectors. (custom build)
12. Big Data Processing
12#UnifiedDataAnalytics #SparkAISummit
Firewall logs aggregation
Firewall Aggregation (5 sec.
streaming job) aggregates events.
(using DStream.reduceByKey)
DNS enrichment adds DNS
names using DHCP and
DNS logs.
14. Big Data Processing
14#UnifiedDataAnalytics #SparkAISummit
Firewall logs aggregation
Columnar Store Loader (5 sec.
streaming job) loads aggregated
events to the Columnar Store
Columnar Store
offloads storage
and querying
15. Big Data Processing
15#UnifiedDataAnalytics #SparkAISummit
●
Environment
●
Inputs 65,000 EPS and 32,000 EPS
5 sec micro-batches (Spark Streaming)
●
24 executors x 11 cores each on non-dedicated, heavily utilized
Hortonworks cluster
●
Results
●
Number of the events is reduced to half
●
Query times are reduced to seconds
Firewall logs aggregation
16. SIEM functionality using BigData
technology
16#UnifiedDataAnalytics #SparkAISummit
Evens
Security
Analysts
Alerts
Big Data
Storage
Query/Context
MS
MS
API/UIMS
Orchestration
MS
Micro services based
on Big Data Technologies
implement SIEM functionality
●
Easy to add/modify functionality
●
Design driven by users
●
Easier integration with processes
17. SIEM functionality using BigData
technology
17#UnifiedDataAnalytics #SparkAISummit
Rule development and testing similar to software testing
Similar process and tools (Jira, Git etc)
Tools
Spark, In Memory Data Grid
Preliminary Results
15 - 20 minutes to test a rule on 24h data ( 2B events) (24 executors)
linearly scalable
Rule
Development
Unit
Testing
Fast Forward Testing
With
Production Sample
Production
Deployment
19. Machine Learning - Introduction
19#UnifiedDataAnalytics #SparkAISummit
0
1
0 1
x2
x1
Supervised Learning
1
0 1
x2
x1
Unsupervised Learning
0
1
0 1
x2
x1
Supervised Learning
1
0 1
x2
x1
Unsupervised Learning
We can derive structure
from data and find
outliers.
We can find a function f
and its parameters that fits
training data and can be
used for classification and
regression.
Labeled data – supervised learning
Unlabeled data – unsupervised learning
20. Machine Learning - Supervised
20#UnifiedDataAnalytics #SparkAISummit
Training
Algorithm
Model
Parameters
(hypothesis)
Training
Labeled
Data
New
Data
Classification
/Regression
Algorithm
Classification
/Regression
Results
Training: finding a function and its parameters to fit training data
Actual Classification/Regression
20
21. Machine Learning – Example
21#UnifiedDataAnalytics #SparkAISummit 21
●
f: if x2 > (p0 + p1 * x1) then O else X
●
finding parameters to minimize # of wrongly
classified data points (cost function)
p0 p1 Line Cost
0.6 0 3
0.9 -0.9 2
0.8 - 0.7 0
0
1
0 1
x2
x1
Supervised Learning
Training Labeled Data
0
1
0 1
x2
x1
Supervised Learning
21
Parameters
22. Machine Learning - example
22#UnifiedDataAnalytics #SparkAISummit 22
classification
if x2 > (0.8 – 0.7 * x1)
then O
else X
New data Classified new data
23. Machine Learning – Terminology
23#UnifiedDataAnalytics #SparkAISummit 23
Precision=
True Positive
True Positive+False Positive
=Proportion of selected items that are relevant
Recall=
True Positive
True Positive+False Negative
=Proportion of relevant items that was selected
Source: https://en.wikipedia.org/wiki/Precision_and_recall
24. Machine Learning – Challenges
24#UnifiedDataAnalytics #SparkAISummit
●
Too many false positives
●
Precision ~ 99% can be too low
●
Data cleanliness
●
Wrong time on a device can be detected as anomaly
●
Missing labeled data
●
Hard to evaluate recall
25. Machine Learning – Challenges
25#UnifiedDataAnalytics #SparkAISummit 25
●
A ML algorithm for detecting a specific malware infection:
●
precision = 99%
●
recall = 99%.
●
The infection is relatively rare: 1 % of computers are infected.
What is probability that the computer is really infected if it is classified as
infected?
(99% or 91% or 50% or 1%)
Is 99% precision good enough?
26. Machine Learning – Challenges
26#UnifiedDataAnalytics #SparkAISummit 26
Suppose there are 10 000 computers:
●
100 are infected
●
99 infected are correctly classified as infected (true positive)
●
1 infected is classified as not infected (false negative)
●
9,900 clean
●
99 are classified incorrectly as infected (false positive)
●
9,801 are correctly classified as not infected (true negative)
●
99 true positivo and 99 false positive = 198 computers classified as
infected but only 99 are really infected so probability that the computer
classified as infected is really infected is 50%.
P(infected given classified as infected )=
P(classified as infected given infected )∗P(infected )
P(classified as infected )
=
0.99∗0.01
(0.99∗0.01+0.01∗0.99)
=0.5Using Bayes' theorem:
27. Machine Learning – Challenges
27#UnifiedDataAnalytics #SparkAISummit 27
●
Usually a human should make final assessment.
●
Reasonable use cases:
●
High ratio of “infection”
●
Limited (selected) data
Classifier with precision and recall 99 %
infected computers [%] really infected/classified as infected [%]
1.00% 50%
0.10% 9%
0.01% 1%
28. Machine Learning and Spark
28#UnifiedDataAnalytics #SparkAISummit
●
MLlib is Apache Spark's scalable machine learning library.
●
ML algorithms
●
ML workflow utilities (data → feature, evaluation, persistence, ...)
●
Several deep learning frameworks
●
Databricks – spark-deep-learning, Deep Learning Pipelines for Apache Spark
●
Yahoo -TensorFlowOnSpark
●
Intel – BigDL
●
...
29. Machine Learning Use Cases
29#UnifiedDataAnalytics #SparkAISummit
Use Case Data
source
Features Algorythm
Detect malicious
URL
Web
proxy log
Entropy, no of spec.
chars, path length, URL
length, contains org.
domain out of position,
has been seen, ...
Random Forest,
Long-Short Term
Memory
Generated domains
(malicious)
detection
DNS log Domain string Long-Short Term
Memory
Classify server
account activity
Active
Domain
log
Network distance,
organization distance,
time distance
Naïve Bayes,
Random Forest
30. Machine Learning Use Cases
30#UnifiedDataAnalytics #SparkAISummit
Use Case Data
source
Features Algorythm
Detect command
and control
communication
Netflow
data
Duration of TCP/IP
session, cardinality,
octets/packet etc.
Naïve Bayes,
Random Forest
31. Spark
MLlib
Batch Job
Machine Learning - Architecture
31#UnifiedDataAnalytics #SparkAISummit
Feature
extractor
Training Data
Algorithm
Training
HDFS Model
parameters
32. Spark
MLlib
Batch or Streaming Job
Machine Learning - Architecture
32#UnifiedDataAnalytics #SparkAISummit
Feature
extractor
New Data
Algorithm
HDFS Model
parameters
Classification
Classified
data
33. Machine Learning – Lessons
Learned
33#UnifiedDataAnalytics #SparkAISummit
●
Do not implement ML just to click “we are using ML”
●
Have good use cases including precision and recall requirements
●
Visualization can be more useful than ML in some cases
●
In most cases, there is necessary to validate a detection by an
analyst.
●
Cyber security analysts like if there are reasoning (why the
classifier decide that it is malicious)
34. DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT