SlideShare a Scribd company logo
1 of 34
Download to read offline
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Josef Niedermeier, HPE
Apache Spark for Cyber
Security in an Enterprise
Company
#UnifiedDataAnalytics #SparkAISummit
Agenda
• Introduction
• Challenges in Cyber Security
• Using Spark to help process an increasing amount of data
– Offloading current applications
– Replacing current applications by Big Data technologies
• Adding additional detection capabilities by Machine Learning
– Machine Learning Introduction
– Use Cases
– High level architecture
– Lessons learned
• Q&A
3#UnifiedDataAnalytics #SparkAISummit
Introduction - Team
4#UnifiedDataAnalytics #SparkAISummit
Netwok
Traffic Logs
Users
Actions
Big Data
Platform
Actionable
Intelligence
Global Cyber Security Fusion
Center Data Science Team
Vulnerabilities
Risk and
Governance
Cyber Security
Operation Center
Advanced
Thread
SIEM
Introduction - SIEM
5#UnifiedDataAnalytics #SparkAISummit
 SIEM - security information and event management
 Security Event Manager (SEM): generates alerts based on predefined rules and
input events
 Security Information Manager (SIM): stores relevant cyber security data and
allows querying to get context data
events Security
Analysts
SEM
SIM
Alerts
Query/Context
Aggregation
Filtering
Enriching
Challenges in Cyber Security
• Scalability and performance
– Increasing amount of data: according to Gartner, 25K EPS is
enterprise size, but in big organization there are several 100K EPS.
– Limited storage for historical data.
– Long query response time.
– IoT makes situation even worse.
• Quickly evolving requirements
• Lack of qualified and skilled professionals
6#UnifiedDataAnalytics #SparkAISummit
Using Spark to help
process an increasing
amount of data
#UnifiedDataAnalytics #SparkAISummit
Big Data
Processing
Offloading current applications
8#UnifiedDataAnalytics #SparkAISummit
 offload of aggregation, filtering and enriching
 offload of storage and querying
SIEM
events Security
Analysts
SEM
SIM
Alerts
Query/Context
Aggregation
Filtering
Enriching
Big Data
Storage
API
UI Query/Context
Big Data Processing – high level
9#UnifiedDataAnalytics #SparkAISummit
HDFS
NetFlow
Log
Netflow
Collector
Columnar
Store
Syslog
Collector
Distributed Processing
Batch and Streaming
Deduplication, filtering,
aggregation, enriching
SIEMNetFlow
Syslog In Memory
Data Grid
Big Data Processing
10#UnifiedDataAnalytics #SparkAISummit
Firewall logs aggregation
Big Data Processing
11#UnifiedDataAnalytics #SparkAISummit
Firewall logs aggregation
Syslog Collector sends
syslog events to Kafka.
(custom build)
High Available Load Balancer
sends syslog events to live
collectors. (custom build)
Big Data Processing
12#UnifiedDataAnalytics #SparkAISummit
Firewall logs aggregation
Firewall Aggregation (5 sec.
streaming job) aggregates events.
(using DStream.reduceByKey)
DNS enrichment adds DNS
names using DHCP and
DNS logs.
Big Data Processing
13#UnifiedDataAnalytics #SparkAISummit
Firewall logs aggregation
SIEM Loader (5 sec. streaming
job) sends aggregated events to
the SIEM.
Big Data Processing
14#UnifiedDataAnalytics #SparkAISummit
Firewall logs aggregation
Columnar Store Loader (5 sec.
streaming job) loads aggregated
events to the Columnar Store
Columnar Store
offloads storage
and querying
Big Data Processing
15#UnifiedDataAnalytics #SparkAISummit
●
Environment
●
Inputs 65,000 EPS and 32,000 EPS
5 sec micro-batches (Spark Streaming)
●
24 executors x 11 cores each on non-dedicated, heavily utilized
Hortonworks cluster
●
Results
●
Number of the events is reduced to half
●
Query times are reduced to seconds
Firewall logs aggregation
SIEM functionality using BigData
technology
16#UnifiedDataAnalytics #SparkAISummit
Evens
Security
Analysts
Alerts
Big Data
Storage
Query/Context
MS
MS
API/UIMS
Orchestration
MS
Micro services based
on Big Data Technologies
implement SIEM functionality
●
Easy to add/modify functionality
●
Design driven by users
●
Easier integration with processes
SIEM functionality using BigData
technology
17#UnifiedDataAnalytics #SparkAISummit
 Rule development and testing similar to software testing
 Similar process and tools (Jira, Git etc)
 Tools
 Spark, In Memory Data Grid
 Preliminary Results
 15 - 20 minutes to test a rule on 24h data ( 2B events) (24 executors)
 linearly scalable
Rule
Development
Unit
Testing
Fast Forward Testing
With
Production Sample
Production
Deployment
Adding additional
detection capabilities
by Machine Learning
#UnifiedDataAnalytics #SparkAISummit
Machine Learning - Introduction
19#UnifiedDataAnalytics #SparkAISummit
0
1
0 1
x2
x1
Supervised Learning
1
0 1
x2
x1
Unsupervised Learning
0
1
0 1
x2
x1
Supervised Learning
1
0 1
x2
x1
Unsupervised Learning
We can derive structure
from data and find
outliers.
We can find a function f
and its parameters that fits
training data and can be
used for classification and
regression.
Labeled data – supervised learning
Unlabeled data – unsupervised learning
Machine Learning - Supervised
20#UnifiedDataAnalytics #SparkAISummit
Training
Algorithm
Model
Parameters
(hypothesis)
Training
Labeled
Data
New
Data
Classification
/Regression
Algorithm
Classification
/Regression
Results
Training: finding a function and its parameters to fit training data
Actual Classification/Regression
20
Machine Learning – Example
21#UnifiedDataAnalytics #SparkAISummit 21
●
f: if x2 > (p0 + p1 * x1) then O else X
●
finding parameters to minimize # of wrongly
classified data points (cost function)
p0 p1 Line Cost
0.6 0 3
0.9 -0.9 2
0.8 - 0.7 0
0
1
0 1
x2
x1
Supervised Learning
Training Labeled Data
0
1
0 1
x2
x1
Supervised Learning
21
Parameters
Machine Learning - example
22#UnifiedDataAnalytics #SparkAISummit 22
classification
if x2 > (0.8 – 0.7 * x1)
then O
else X
New data Classified new data
Machine Learning – Terminology
23#UnifiedDataAnalytics #SparkAISummit 23
Precision=
True Positive
True Positive+False Positive
=Proportion of selected items that are relevant
Recall=
True Positive
True Positive+False Negative
=Proportion of relevant items that was selected
Source: https://en.wikipedia.org/wiki/Precision_and_recall
Machine Learning – Challenges
24#UnifiedDataAnalytics #SparkAISummit
●
Too many false positives
●
Precision ~ 99% can be too low
●
Data cleanliness
●
Wrong time on a device can be detected as anomaly
●
Missing labeled data
●
Hard to evaluate recall
Machine Learning – Challenges
25#UnifiedDataAnalytics #SparkAISummit 25
●
A ML algorithm for detecting a specific malware infection:
●
precision = 99%
●
recall = 99%.
●
The infection is relatively rare: 1 % of computers are infected.
What is probability that the computer is really infected if it is classified as
infected?
(99% or 91% or 50% or 1%)
Is 99% precision good enough?
Machine Learning – Challenges
26#UnifiedDataAnalytics #SparkAISummit 26
Suppose there are 10 000 computers:
●
100 are infected
●
99 infected are correctly classified as infected (true positive)
●
1 infected is classified as not infected (false negative)
●
9,900 clean
●
99 are classified incorrectly as infected (false positive)
●
9,801 are correctly classified as not infected (true negative)
●
99 true positivo and 99 false positive = 198 computers classified as
infected but only 99 are really infected so probability that the computer
classified as infected is really infected is 50%.
P(infected given classified as infected )=
P(classified as infected given infected )∗P(infected )
P(classified as infected )
=
0.99∗0.01
(0.99∗0.01+0.01∗0.99)
=0.5Using Bayes' theorem:
Machine Learning – Challenges
27#UnifiedDataAnalytics #SparkAISummit 27
●
Usually a human should make final assessment.
●
Reasonable use cases:
●
High ratio of “infection”
●
Limited (selected) data
Classifier with precision and recall 99 %
infected computers [%] really infected/classified as infected [%]
1.00% 50%
0.10% 9%
0.01% 1%
Machine Learning and Spark
28#UnifiedDataAnalytics #SparkAISummit
●
MLlib is Apache Spark's scalable machine learning library.
●
ML algorithms
●
ML workflow utilities (data → feature, evaluation, persistence, ...)
●
Several deep learning frameworks
●
Databricks – spark-deep-learning, Deep Learning Pipelines for Apache Spark
●
Yahoo -TensorFlowOnSpark
●
Intel – BigDL
●
...
Machine Learning Use Cases
29#UnifiedDataAnalytics #SparkAISummit
Use Case Data
source
Features Algorythm
Detect malicious
URL
Web
proxy log
Entropy, no of spec.
chars, path length, URL
length, contains org.
domain out of position,
has been seen, ...
Random Forest,
Long-Short Term
Memory
Generated domains
(malicious)
detection
DNS log Domain string Long-Short Term
Memory
Classify server
account activity
Active
Domain
log
Network distance,
organization distance,
time distance
Naïve Bayes,
Random Forest
Machine Learning Use Cases
30#UnifiedDataAnalytics #SparkAISummit
Use Case Data
source
Features Algorythm
Detect command
and control
communication
Netflow
data
Duration of TCP/IP
session, cardinality,
octets/packet etc.
Naïve Bayes,
Random Forest
Spark
MLlib
Batch Job
Machine Learning - Architecture
31#UnifiedDataAnalytics #SparkAISummit
Feature
extractor
Training Data
Algorithm
Training
HDFS Model
parameters
Spark
MLlib
Batch or Streaming Job
Machine Learning - Architecture
32#UnifiedDataAnalytics #SparkAISummit
Feature
extractor
New Data
Algorithm
HDFS Model
parameters
Classification
Classified
data
Machine Learning – Lessons
Learned
33#UnifiedDataAnalytics #SparkAISummit
●
Do not implement ML just to click “we are using ML”
●
Have good use cases including precision and recall requirements
●
Visualization can be more useful than ML in some cases
●
In most cases, there is necessary to validate a detection by an
analyst.
●
Cyber security analysts like if there are reasoning (why the
classifier decide that it is malicious)
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot

Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisArnab Mitra
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
 
Monitoring Microservices
Monitoring MicroservicesMonitoring Microservices
Monitoring MicroservicesWeaveworks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkDongwon Kim
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습동현 강
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 

What's hot (20)

Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Monitoring Microservices
Monitoring MicroservicesMonitoring Microservices
Monitoring Microservices
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습
 
Apache spark
Apache sparkApache spark
Apache spark
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 

Similar to Apache Spark for Cyber Security in an Enterprise Company

July 2021 Virtual PNW Splunk User Group Slides
July 2021 Virtual PNW Splunk User Group SlidesJuly 2021 Virtual PNW Splunk User Group Slides
July 2021 Virtual PNW Splunk User Group SlidesAmanda Richardson
 
How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...Alluxio, Inc.
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaAlluxio, Inc.
 
Big Data for Security - DNS Analytics
Big Data for Security - DNS AnalyticsBig Data for Security - DNS Analytics
Big Data for Security - DNS AnalyticsMarco Casassa Mont
 
Unified Data Access with Gimel
Unified Data Access with GimelUnified Data Access with Gimel
Unified Data Access with GimelAlluxio, Inc.
 
Data orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | GimelData orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | GimelDeepak Chandramouli
 
Using bluemix predictive analytics service in Node-RED
Using bluemix predictive analytics service in Node-REDUsing bluemix predictive analytics service in Node-RED
Using bluemix predictive analytics service in Node-REDLionel Mommeja
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent MonitoringIntelie
 
Visualization in the Age of Big Data
Visualization in the Age of Big DataVisualization in the Age of Big Data
Visualization in the Age of Big DataRaffael Marty
 
Use Machine Learning to Get the Most out of Your Big Data Clusters
Use Machine Learning to Get the Most out of Your Big Data ClustersUse Machine Learning to Get the Most out of Your Big Data Clusters
Use Machine Learning to Get the Most out of Your Big Data ClustersDatabricks
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchSylvain Wallez
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Databricks
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Guglielmo Iozzia
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessAnant Corporation
 
Machine Learning for Your Enterprise: Operations and Security for Mainframe E...
Machine Learning for Your Enterprise: Operations and Security for Mainframe E...Machine Learning for Your Enterprise: Operations and Security for Mainframe E...
Machine Learning for Your Enterprise: Operations and Security for Mainframe E...Precisely
 
230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptxArthur240715
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion
 

Similar to Apache Spark for Cyber Security in an Enterprise Company (20)

July 2021 Virtual PNW Splunk User Group Slides
July 2021 Virtual PNW Splunk User Group SlidesJuly 2021 Virtual PNW Splunk User Group Slides
July 2021 Virtual PNW Splunk User Group Slides
 
How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...
 
LEGaTO: Use cases
LEGaTO: Use casesLEGaTO: Use cases
LEGaTO: Use cases
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
Big Data for Security - DNS Analytics
Big Data for Security - DNS AnalyticsBig Data for Security - DNS Analytics
Big Data for Security - DNS Analytics
 
Unified Data Access with Gimel
Unified Data Access with GimelUnified Data Access with Gimel
Unified Data Access with Gimel
 
Data orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | GimelData orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | Gimel
 
Using bluemix predictive analytics service in Node-RED
Using bluemix predictive analytics service in Node-REDUsing bluemix predictive analytics service in Node-RED
Using bluemix predictive analytics service in Node-RED
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
 
Visualization in the Age of Big Data
Visualization in the Age of Big DataVisualization in the Age of Big Data
Visualization in the Age of Big Data
 
Use Machine Learning to Get the Most out of Your Big Data Clusters
Use Machine Learning to Get the Most out of Your Big Data ClustersUse Machine Learning to Get the Most out of Your Big Data Clusters
Use Machine Learning to Get the Most out of Your Big Data Clusters
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
 
Machine Learning for Your Enterprise: Operations and Security for Mainframe E...
Machine Learning for Your Enterprise: Operations and Security for Mainframe E...Machine Learning for Your Enterprise: Operations and Security for Mainframe E...
Machine Learning for Your Enterprise: Operations and Security for Mainframe E...
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 

Recently uploaded (20)

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 

Apache Spark for Cyber Security in an Enterprise Company

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Josef Niedermeier, HPE Apache Spark for Cyber Security in an Enterprise Company #UnifiedDataAnalytics #SparkAISummit
  • 3. Agenda • Introduction • Challenges in Cyber Security • Using Spark to help process an increasing amount of data – Offloading current applications – Replacing current applications by Big Data technologies • Adding additional detection capabilities by Machine Learning – Machine Learning Introduction – Use Cases – High level architecture – Lessons learned • Q&A 3#UnifiedDataAnalytics #SparkAISummit
  • 4. Introduction - Team 4#UnifiedDataAnalytics #SparkAISummit Netwok Traffic Logs Users Actions Big Data Platform Actionable Intelligence Global Cyber Security Fusion Center Data Science Team Vulnerabilities Risk and Governance Cyber Security Operation Center Advanced Thread
  • 5. SIEM Introduction - SIEM 5#UnifiedDataAnalytics #SparkAISummit  SIEM - security information and event management  Security Event Manager (SEM): generates alerts based on predefined rules and input events  Security Information Manager (SIM): stores relevant cyber security data and allows querying to get context data events Security Analysts SEM SIM Alerts Query/Context Aggregation Filtering Enriching
  • 6. Challenges in Cyber Security • Scalability and performance – Increasing amount of data: according to Gartner, 25K EPS is enterprise size, but in big organization there are several 100K EPS. – Limited storage for historical data. – Long query response time. – IoT makes situation even worse. • Quickly evolving requirements • Lack of qualified and skilled professionals 6#UnifiedDataAnalytics #SparkAISummit
  • 7. Using Spark to help process an increasing amount of data #UnifiedDataAnalytics #SparkAISummit
  • 8. Big Data Processing Offloading current applications 8#UnifiedDataAnalytics #SparkAISummit  offload of aggregation, filtering and enriching  offload of storage and querying SIEM events Security Analysts SEM SIM Alerts Query/Context Aggregation Filtering Enriching Big Data Storage API UI Query/Context
  • 9. Big Data Processing – high level 9#UnifiedDataAnalytics #SparkAISummit HDFS NetFlow Log Netflow Collector Columnar Store Syslog Collector Distributed Processing Batch and Streaming Deduplication, filtering, aggregation, enriching SIEMNetFlow Syslog In Memory Data Grid
  • 10. Big Data Processing 10#UnifiedDataAnalytics #SparkAISummit Firewall logs aggregation
  • 11. Big Data Processing 11#UnifiedDataAnalytics #SparkAISummit Firewall logs aggregation Syslog Collector sends syslog events to Kafka. (custom build) High Available Load Balancer sends syslog events to live collectors. (custom build)
  • 12. Big Data Processing 12#UnifiedDataAnalytics #SparkAISummit Firewall logs aggregation Firewall Aggregation (5 sec. streaming job) aggregates events. (using DStream.reduceByKey) DNS enrichment adds DNS names using DHCP and DNS logs.
  • 13. Big Data Processing 13#UnifiedDataAnalytics #SparkAISummit Firewall logs aggregation SIEM Loader (5 sec. streaming job) sends aggregated events to the SIEM.
  • 14. Big Data Processing 14#UnifiedDataAnalytics #SparkAISummit Firewall logs aggregation Columnar Store Loader (5 sec. streaming job) loads aggregated events to the Columnar Store Columnar Store offloads storage and querying
  • 15. Big Data Processing 15#UnifiedDataAnalytics #SparkAISummit ● Environment ● Inputs 65,000 EPS and 32,000 EPS 5 sec micro-batches (Spark Streaming) ● 24 executors x 11 cores each on non-dedicated, heavily utilized Hortonworks cluster ● Results ● Number of the events is reduced to half ● Query times are reduced to seconds Firewall logs aggregation
  • 16. SIEM functionality using BigData technology 16#UnifiedDataAnalytics #SparkAISummit Evens Security Analysts Alerts Big Data Storage Query/Context MS MS API/UIMS Orchestration MS Micro services based on Big Data Technologies implement SIEM functionality ● Easy to add/modify functionality ● Design driven by users ● Easier integration with processes
  • 17. SIEM functionality using BigData technology 17#UnifiedDataAnalytics #SparkAISummit  Rule development and testing similar to software testing  Similar process and tools (Jira, Git etc)  Tools  Spark, In Memory Data Grid  Preliminary Results  15 - 20 minutes to test a rule on 24h data ( 2B events) (24 executors)  linearly scalable Rule Development Unit Testing Fast Forward Testing With Production Sample Production Deployment
  • 18. Adding additional detection capabilities by Machine Learning #UnifiedDataAnalytics #SparkAISummit
  • 19. Machine Learning - Introduction 19#UnifiedDataAnalytics #SparkAISummit 0 1 0 1 x2 x1 Supervised Learning 1 0 1 x2 x1 Unsupervised Learning 0 1 0 1 x2 x1 Supervised Learning 1 0 1 x2 x1 Unsupervised Learning We can derive structure from data and find outliers. We can find a function f and its parameters that fits training data and can be used for classification and regression. Labeled data – supervised learning Unlabeled data – unsupervised learning
  • 20. Machine Learning - Supervised 20#UnifiedDataAnalytics #SparkAISummit Training Algorithm Model Parameters (hypothesis) Training Labeled Data New Data Classification /Regression Algorithm Classification /Regression Results Training: finding a function and its parameters to fit training data Actual Classification/Regression 20
  • 21. Machine Learning – Example 21#UnifiedDataAnalytics #SparkAISummit 21 ● f: if x2 > (p0 + p1 * x1) then O else X ● finding parameters to minimize # of wrongly classified data points (cost function) p0 p1 Line Cost 0.6 0 3 0.9 -0.9 2 0.8 - 0.7 0 0 1 0 1 x2 x1 Supervised Learning Training Labeled Data 0 1 0 1 x2 x1 Supervised Learning 21 Parameters
  • 22. Machine Learning - example 22#UnifiedDataAnalytics #SparkAISummit 22 classification if x2 > (0.8 – 0.7 * x1) then O else X New data Classified new data
  • 23. Machine Learning – Terminology 23#UnifiedDataAnalytics #SparkAISummit 23 Precision= True Positive True Positive+False Positive =Proportion of selected items that are relevant Recall= True Positive True Positive+False Negative =Proportion of relevant items that was selected Source: https://en.wikipedia.org/wiki/Precision_and_recall
  • 24. Machine Learning – Challenges 24#UnifiedDataAnalytics #SparkAISummit ● Too many false positives ● Precision ~ 99% can be too low ● Data cleanliness ● Wrong time on a device can be detected as anomaly ● Missing labeled data ● Hard to evaluate recall
  • 25. Machine Learning – Challenges 25#UnifiedDataAnalytics #SparkAISummit 25 ● A ML algorithm for detecting a specific malware infection: ● precision = 99% ● recall = 99%. ● The infection is relatively rare: 1 % of computers are infected. What is probability that the computer is really infected if it is classified as infected? (99% or 91% or 50% or 1%) Is 99% precision good enough?
  • 26. Machine Learning – Challenges 26#UnifiedDataAnalytics #SparkAISummit 26 Suppose there are 10 000 computers: ● 100 are infected ● 99 infected are correctly classified as infected (true positive) ● 1 infected is classified as not infected (false negative) ● 9,900 clean ● 99 are classified incorrectly as infected (false positive) ● 9,801 are correctly classified as not infected (true negative) ● 99 true positivo and 99 false positive = 198 computers classified as infected but only 99 are really infected so probability that the computer classified as infected is really infected is 50%. P(infected given classified as infected )= P(classified as infected given infected )∗P(infected ) P(classified as infected ) = 0.99∗0.01 (0.99∗0.01+0.01∗0.99) =0.5Using Bayes' theorem:
  • 27. Machine Learning – Challenges 27#UnifiedDataAnalytics #SparkAISummit 27 ● Usually a human should make final assessment. ● Reasonable use cases: ● High ratio of “infection” ● Limited (selected) data Classifier with precision and recall 99 % infected computers [%] really infected/classified as infected [%] 1.00% 50% 0.10% 9% 0.01% 1%
  • 28. Machine Learning and Spark 28#UnifiedDataAnalytics #SparkAISummit ● MLlib is Apache Spark's scalable machine learning library. ● ML algorithms ● ML workflow utilities (data → feature, evaluation, persistence, ...) ● Several deep learning frameworks ● Databricks – spark-deep-learning, Deep Learning Pipelines for Apache Spark ● Yahoo -TensorFlowOnSpark ● Intel – BigDL ● ...
  • 29. Machine Learning Use Cases 29#UnifiedDataAnalytics #SparkAISummit Use Case Data source Features Algorythm Detect malicious URL Web proxy log Entropy, no of spec. chars, path length, URL length, contains org. domain out of position, has been seen, ... Random Forest, Long-Short Term Memory Generated domains (malicious) detection DNS log Domain string Long-Short Term Memory Classify server account activity Active Domain log Network distance, organization distance, time distance Naïve Bayes, Random Forest
  • 30. Machine Learning Use Cases 30#UnifiedDataAnalytics #SparkAISummit Use Case Data source Features Algorythm Detect command and control communication Netflow data Duration of TCP/IP session, cardinality, octets/packet etc. Naïve Bayes, Random Forest
  • 31. Spark MLlib Batch Job Machine Learning - Architecture 31#UnifiedDataAnalytics #SparkAISummit Feature extractor Training Data Algorithm Training HDFS Model parameters
  • 32. Spark MLlib Batch or Streaming Job Machine Learning - Architecture 32#UnifiedDataAnalytics #SparkAISummit Feature extractor New Data Algorithm HDFS Model parameters Classification Classified data
  • 33. Machine Learning – Lessons Learned 33#UnifiedDataAnalytics #SparkAISummit ● Do not implement ML just to click “we are using ML” ● Have good use cases including precision and recall requirements ● Visualization can be more useful than ML in some cases ● In most cases, there is necessary to validate a detection by an analyst. ● Cyber security analysts like if there are reasoning (why the classifier decide that it is malicious)
  • 34. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT