Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Accelerating Cyber Threat Detection With GPU
1. Joshua Patterson | Director of Applied Solutions Engineering | GTC DC 2017
@datametrician
ACCELERATING CYBER THREAT
DETECTION WITH GPU
2. 2
RULES & PEOPLE DON’T SCALE
Right now, financial services reports it takes an average of 98 days to detect
an Advance Threat but retailers say it can be about seven months.
Once the security community moves beyond the mantras “encrypt
everything” and “secure the perimeter,” it can begin developing intelligent
prioritization and response plans to various kinds of breaches – with a
strong focus on integrity.
The challenge lies in efficiently scaling these technologies for practical
deployment, and making them reliable for large networks. This is where the
security community should focus its efforts.
http://www.wired.com/2015/12/the-cia-secret-to-cybersecurity-that-no-one-seems-to-get/
Current methods are too slow
3. 3
ATTACKS ARE MORE SOPHISTICATED
How Hackers Hijacked a Bank’s Entire Online Operation
https://www.wired.com/2017/04/hackers-hijacked-banks-entire-online-operation/
4. 4
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
2. Event management is an accelerated analytics problem, the volume and velocity of data
from devices requires a new approach that combines all data sources to allow for more
intelligent/advanced threat hunting and exploration at scale across machine data.
3. Visualization will be a key part of daily operations, which will allows analyst to label and
train Deep Learning models faster, and validate machine learning prediciton.
5. 5
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
7. 7
DATA PLATFORM-AS-A-SERVICE
• Handles 1M events/second
• Auto-scales the cluster
automatically
SCALE
• Offers HA with no data-loss
• Always-on architecture
• Data replication
HIGH AVAILABILITY
• Data platform security has
been implemented with
VPCs in AWS
• Dashboard access using
NVIDIA LDAP
SECURITY
• Log-to-analytics
• Kibana, JDBC access
• Accessing data using BI tools
SELF SERVICE
11. 11
ANOMALY DETECTION USING DEEP LEARNING
Data
Platform
AD AI
Framework
(Keras +
TensorFlow)
NGC/NGN
GPU
Cluster
NGC/NGN
GPU
ClusterGPU Cloud
Anomaly
Detection
Top Features
Automated Alerts & Dashboards
Early Detection
Self Service
Better accuracy & less noise
12. 12
Raw Dataset
Feature Learning
Algorithm: Recurrent
Neural Network (RNN),
Autoencoders (AE)
Unsupervised Learning:
Multivariate-Gaussian
Supervised Learning:
Logistic Regression
Anomalies: Email alerts,
Dashboards
Time X1 X2
Time X1 X2 X’ X’’
Time X1 X2 X’ X’’ Y
1
0
Anomaly Post-
processing: Univariate
Analysis
Time X1 X2 Y Anomaly
Description
1 X1
0
Anomaly Detection
Feedback
from user
ANOMALY DETECTION FRAMEWORK
13. 13
ANOMALY DETECTION BENEFITS WITH DEEP LEARNING
Top Features
Automated Alerts &
Dashboards
Early Detection
Self Service
Better accuracy & less
noise
14. 14
ANOMALY DETECTION TRAINING
• Evolution
• CPU vs GPU
• Learnings :
• Manual feature extraction does not scale
• Dataset preparation is the long pole
• Training on CPU takes longer than data collection rate
V0:
Manual
Feature
Creation
V1:
Automatic
Feature
Creation
using DL
(Theano)
V2: Multi-
GPU support
+ TensorFlow
Serving
(Keras +
TensorFlow)
15. 15
INFERENCING V1
• Use Case: Detecting anomalies with user’s activity
• Inferencing flow from 10k feet
• Started with python scripts for windowed aggregation
Live
Streaming
Data
AD Platform
ETL
aggregations
for inferencing
73
103
154
0
50
100
150
200
10 MINS 30 MINS 60 MINS
Python Script Performance
• Learnings: Hard to scale for near real time. AD platform runs inferencing
every 3 mins as we are impacted by speed of data processing
16. 16
INFERENCING V2
• V2: To improve performance, we started using Presto
with data on S3 in JSON format
• Live data will be streamed from Kafka to S3. We use
Presto for our data warehousing needs
• Presto is an open-source distributed SQL query engine
optimized for low-latency, ad-hoc analysis of data*
20
4
25
6
30
8
0
5
10
15
20
25
30
35
PRESTO ON JSON PRESTO ON PARQUET
10 mins 30 mins 60 mins
• Learnings: Presto with Parquet has best performance but we need
to batch data at 30 secs interval. So it’s not completely real time
Improved Performance
17. 17
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
2. Event management is an accelerated analytics problem, the volume and velocity of data
from devices requires a new approach that combines all data sources to allow for more
intelligent/advanced threat hunting and exploration at scale across machine data.
18. 18
GPU ACCELERATION
Accelerate the Pipeline, Not Just Deep Learning
• GPUs for deep learning = proven
• Where else and how else can we use
GPU acceleration?
• Dashboards
• Accelerating data pipeline
• Stream processing
• Building better models faster
• First: GPU databases
Data Ingestion
Data Processing
Visualization
Model Training
Inferencing
19. 19
MOVING TO BIG DATA IS A START
Spark outperforms traditional SIEM
vs
Big Data Solution
10 node cluster - ~$60k in hardware
Production SIEM of Fortune 500 Enterprise Data
450+ columns
~250 million events per day
SIEM
Spark vs SIEM Benchmarks from Accenture Labs - Strata NY, Bsides LV
20. 20
MOVING TO BIG DATA IS A START
Spark outperforms traditional SIEM
Typical Scenario Time Period SIEM Big Data Speed Up
1 Show all network communication from one host
(IP) to multiple hosts (IPs)
1 Day 3h 20m 13s 1m 44s 114 Times Faster
1 Week Not Feasible* 4m 05s
2 Retrieve failed logon attempts in Active
Directory
1 Day 18m 26s 1m 37s 10 Times Faster
1 Week 2h 13m 45s 3m 10s 41 Times Faster
3 Search for Malware (exe) in Symantec logs 1 Day 3h 24m 36s 1m 37s 125 Times Faster
1 Week Not Feasible* 3m 22s
4 View all proxy logs for a for specific domain 1 Day 4h 30m 13s 2m 54s 92 Times Faster
1 Week Not Feasible* 1m 09s**
Spark vs SIEM Benchmarks from Accenture Labs - Strata NY, Bsides LV
21. 21
GPU DATABASES ARE EVEN FASTER
1.1 Billion Taxi Ride Benchmarks
21 30
1560
80 99
1250
150
269
2250
372
696
2970
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
MapD DGX-1 MapD 4 x P100 Redshift 6-node Spark 11-node
Query 1 Query 2 Query 3 Query 4
TimeinMilliseconds
Source: MapD Benchmarks on DGX from internal NVIDIA testing following guidelines of
Mark Litwintschik’s blogs: Redshift, 6-node ds2.8xlarge cluster & Spark 2.1, 11 x m3.xlarge cluster w/ HDFS @marklit82
10190 8134 19624 85942
22. 22
MAPD
MapD Core MapD Immerse
LLVM Backend Rendering Streaming
LLVM creates one custom function that
runs at speeds approaching hand-written
functions. LLVM enables generic
targeting of different architectures + run
simultaneously on CPU/GPU.
Speed eliminates need to pre-index or
aggregate data. Compute resides on
GPUs freeing CPUs to parse + ingest.
Finally, newest data can be combined with
billions of rows of “near historical” data.
Data goes from compute (CUDA) to
graphics (OpenGL) pipeline without copy
and comes back as compressed PNG
(~100 KB) rather than raw data (> 1GB).
23. 23
MAPD ARCHITECTURE
Visualization Libraries
JavaScript libraries that allow
users to build custom web-
based visualization apps
powered by a MapD Core
database based on DC.js.
LLVM
MapD Core SQL queries are
compiled with a just-in-time
(JIT) LLVM based compiler,
and run as NVIDIA GPU
machine code.
Distributed Scale-out
MapD Core has native
distributed scale-out
capabilities. MapD Core users
can query and visualize larger
datasets with much smaller
cluster sizes than traditional
solutions.
High Availability
MapD Core has high
availability functionality that
provides durability and
redundancy. Ingest and
queries are load balanced
across servers for additional
throughput.
Open Source Commercial
24. 24
MAPD + IMMERSE VS ELASTIC + KIBANA
Elastic + Kibana
• Fantastic for complex search
• Scales easily (up to a point)
• Indexing consumes more storage (~4-6x)
• Kibana for KPI dashboarding?
MapD Core
• Very fast OLAP queries
• JIT LLVM query compiler
• GPUs for compute
• CPUs for parse + ingest
• Limited join support (for now)
Immerse
• c3/d3 + crossfilter = nice dashboards
• Backend rendering
27. 27
INFERENCING V3
• V3: Explored GPU databases like MapD to improve the performance for querying on streaming live data
• MapD offers constant query response times
• MapD has some SQL limitations. We use Presto as an interface & built a “MapD-> Presto” connector for full
ANSI Sql features
20
4
0.1
1.2
25
6
0.1
1.2
30
8
0.1
1.2
0
5
10
15
20
25
30
35
PRESTO ON JSON PRESTO ON PARQUET MAPD PRESTO + MAPD
GPU Database Performance
10 mins 30 mins 60 mins
ExecutionTime(seconds)
28. 28
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
2. Event management is an accelerated analytics problem, the volume and velocity of data
from devices requires a new approach that combines all data sources to allow for more
intelligent/advanced threat hunting and exploration at scale across machine data.
3. Visualization will be a key part of daily operations, which will allows analyst to label and
train Deep Learning models faster, and validate machine learning predictions.
32. 32
VISUALIZATION WITH GPU
Less hardware, more performance, more scale
1/10th the hardware
1-2 orders of
magnitude more
performance
33. 33
VISUALIZATION WITH GPU
Less hardware, more performance, more scale
1/10th the hardware
1-2 orders of
magnitude more
performance
Real time visualization of 100K+ nodes 1M+ Edges
50-100x faster clustering than other solutions
34. 34
LISTS DO NOT VISUALLY SCALE
Text search is a great
starting point!
Does not scale
Do not see the 30K+ events
nor the IPs, users, nor how
they relate…
36. 36
GRAPHS:
A KEY MISSING
VIEW
Unified Model
Shows entities, events, and relationships
Multipurpose: connect, see, interact
Visual
Inspect individual items
See behavior, patterns, and outliers
Scale to enterprise workloads
37. 37
DIFFERENT GRAPHS, DIFFERENT QUESTIONS
Uni
Ex: Network mapping
“Is it safe to reboot this?”
ip ip
Hyper
Ex: Incident response
“Did this escalate?”
Multi
Ex: SSH trails
“Is a user crossing zones?”
ip
user
userip
ip
userevent
event
user
ip
40. 40
EXPAND GPU USAGE
More Data, Less Hardware
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
2008 2010 2012 2014 2016 2017
Peak Double Precision
NVIDIA GPU x86 CPU
TFLOPS
Scaling up and out with
GPU co-processors
42. 42
GPU DATA FRAME
Pre-GOAI
H2O.ai
Graphistry Anaconda
Gunrock
BlazingDB MapD
CPU
APP A
APP B
Copy & Convert
Copy & Convert Copy & Convert
Copy & Convert Copy & ConvertCopy & Convert Copy & Convert
Evolution of Performance
No Copy & Converts - Full Interoperability
H2O.ai
Anaconda Gunrock
Graphistry
BlazingDB MapD
GPU Data
Frame
Apache
Arrow
43. 43
CYBERWORKS
CYBERWORKS SIEM SDK
Goals
• Open Source Ecosystem & Select ISVs
• Integration Points w/ leading security vendors
• FireEye
• Splunk
• Palo Alto Networks
Purpose
A platform to allow analysts to hunt and
analyze data faster at scale than traditional
big data to find unknown and zero day threats.
It will accelerate the threat detection
ecosystem and harden cyber defense utilizing
GPU ISVs and Deep Learning Frameworks.
Purpose Built SDK For SIEM Analytics
44. 44
CYBERWORKS ACTIVITIES
Continuous Improvement
Use GPU accelerated
databases to analyze
data to improve
hunting today, as well
as enrich and label
data for Deep Learning
Connect accelerated
DBs to Splunk for event
management, hunting,
and exploration. Use
Graphistry and MapD to
visualize the data for
anomaly and threat
detection in new ways.
The goal is to GPU
accelerate parts of
Splunk through
partnership and
connect/bolt on
GPUDBs/Graphistry
Use ML and Graph
Analytics for feature
extraction and behavioral
analytics, an ensemble
approach to detection.
Expand Deep Learning
training as more data is
labeled/classified, and
threats are caught faster,
building off DL techniques
used in GFN, other
groups, and external ISV.
Generalize Deep Learning for supervised
and unsupervised anomaly and threat
detection (Insider, APT, DDOS, etc…)
while building our own cyber security deep
learning accelerator. Use best practices
from Driveworks and other accelerators
and SDK as a reference architecture.
Leverage DL from other parts of the firm to
accelerate development as well.
While using Splunk
Cloud to protect
Nvidia, we create a
redundant path of data
to enable R&D.
nvGRAPH
46. 46
CYBERWORKS HARDWARE
Scale out Cluster
DGX Cluster
NAS
SIEM
Notebooks
End
User
3rd Party
Apps
Messaging
Queue
Accelerating your SIEM
47. 47
GPU OPEN ANALYTICS INITIATIVE
github.com/gpuopenanalytics
GPU Data Frame (GDF)
Ingest/
Parse
Exploratory
Analysis
Feature
Engineering
ML/DL
Algorithms
Grid Search
Scoring
Model
Export
@gpuoai
Apache Arrow
48. 48
ANACONDA
Python ETL on GPU
A Python open-source just-in-time
optimizing compiler that uses LLVM to
produce native machine instructions.
Primary Contributor to PyGDF.
Dask is a flexible parallel computing
library for analytic computing with
dynamic task scheduling and big data
collections.
Primary contributor to Dask_GDF.
Jeremy Howard
Deep learning researcher & educator.
Founder: fast.ai; Faculty: USF &
Singularity University; // Previously - CEO:
Enlitic; President: Kaggle; CEO Fastmail
Rewrote @scikit_learn PolynomialFeatures in
@ContinuumIO Numba. Got a 40x speedup (would
be bigger with more data!) 12 lines of code