GOAI: GPU-Accelerated Data Science DataSciCon 2017

1
GOAI: GPU-ACCELERATED
DATA SCIENCE
Joshua Patterson | Director of Applied Solutions Engineering | DataSciCon 2017
@datametrician

2
SPARK ECOSYSTEM
The Glue of Big Data
• Spark has almost become synonymous with Hadoop and Big Data
• It’s the interface/API for big data app to app communication
• The processing layer for big data and leading ML framework

3
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
Hadoop Processing, Reading from disk

4
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
Spark In-Memory Processing

5
SPARK ECOSYSTEM
Lacks Full GPU Integration
• 4 Core Parts: SQL, Streaming (Spark functions micro batched), Machine Learning, & Graph
• Spark is currently optimizing its existing code base, adding more usability, not GPU support yet

6
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
GPU/Spark In-Memory Processing

7
Pre-GPU DATA FRAME
CURRENT
H2O.ai
Graphistry Anaconda
Gunrock
BlazingDB MapD
CPU
APP A
APP B
Copy & Convert
Copy & Convert Copy & Convert
Copy & Convert Copy & ConvertCopy & Convert Copy & Convert
Too Much Glue Code & Lack Of Standards
• For GPU applications to talk to each other data must be copy
and converted up to three times
• Each company has to build and maintain connectors to copy
and convert
• Some products wanted direct connectors to other
products
• Reduced hops but more for them to maintain and
develop
• A standard was needed
• ISVs always starting from scratch
• Barrier to entry and integration

8
GPU Data Frame
Data Movement Kills Performance
Volume of data
Numberofdatahandoffs
Handoff
Pre-GPU DATA FRAME

9
APP A
GPU-ACCELERATED ARCHITECTURE THEN
Too much data movement and too many different data formats
CPU GPU
APP B
Read DataH2O.ai
Anaconda Gunrock
Graphistry
BlazingDB MapD
Copy & Convert
Copy & Convert
Copy & Convert
Load Data
APP A GPU
Data
APP B
GPU
Data

10
APP A
GPU-ACCELERATED ARCHITECTURE THEN
Too much data movement and too many different data formats
CPU GPU
APP B
Read DataH2O.ai
Anaconda Gunrock
Graphistry
BlazingDB MapD
Copy & Convert
Copy & Convert
Copy & Convert
Load Data
APP A GPU
Data
APP B
GPU
Data

11
INTEROPERABILITY IN BIG DATA
Lessons Learned From Apache Arrow & Parquet
• Both Apache Arrow and Apache
Parquet are compressed columnar
storage
• Arrow resides in memory whereas
Parquet resides on disk
• Major push in the big data world to
remove bottlenecks of copy &
converting data between systems
that was a major issue in the GPU
world

12
GPU-ACCELERATED ARCHITECTURE NOW
Single data format and shared access to data on GPU
CPU GPU
GPU
MEM
Read DataH2O.ai
Anaconda Gunrock
Graphistry
BlazingDB MapD Load Data
Apache Arrow
GPU
Data
Frame
Based on:

13
GPU OPEN ANALYTICS INITIATIVE
github.com/gpuopenanalytics
GPU Data Frame (GDF)
Ingest/
Parse
Exploratory
Analysis
Feature
Engineering
ML/DL
Algorithms
Grid Search
Scoring
Model
Export
@gpuoai
Apache Arrow

17
GROWING COMMUNITY SUPPORT
Apache Arrow Apache Parquet

18
GPU ACCELERATION ACROSS THE ECOSYSTEM

19
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
Arrow
Read
Query ETL
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
25-100x Improvement
Same code
Language flexible
Primarily on GPU
End to End GPU Processing (GOAI)
GPU/Spark In-Memory Processing

20
Expand GPU Usage
More Data, Less Hardware
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
2008 2010 2012 2014 2016 2017
Peak Double Precision
NVIDIA GPU x86 CPU
TFLOPS
Scaling up and out with GPU co-
processors

21
ANACONDA
Python ETL for GPU
A Python open-source just-in-time
optimizing compiler that uses LLVM to
produce native machine instructions.
Primary Contributor to PyGDF.
Dask is a flexible parallel computing
library for analytic computing with
dynamic task scheduling and big data
collections.
Primary contributor to Dask_GDF.
Jeremy Howard
Deep learning researcher & educator.
Founder: fast.ai; Faculty: USF &
Singularity University; // Previously - CEO:
Enlitic; President: Kaggle; CEO Fastmail
Rewrote @scikit_learn PolynomialFeatures in
@ContinuumIO Numba. Got a 40x speedup (would
be bigger with more data!) 12 lines of code

22
BLAZINGDB
Scale out Datawarehousing

23
Optimized Networking
GPU Analysis and MLGPU Rendering
GRAPHISTRY
Graph Visualization
Hunting: Daily Anomalies SecOps: Shadow IT UseIR: Killchain Analysis Fraud: Tracking EmbezzlersThreat Intel: Botnet Analysis

24
GPU-accelerated graph analytics library
Multi-GPU optimized algorithms
Reduced cost and increased performance
Performance constantly improving
GUNROCK

25
H2O.AI
H2O4GPU - GPU Machine Learning Library

27
87
51
171 with latest solver

29
MAPD
MapD Core MapD Immerse
LLVM Backend Rendering Streaming
LLVM creates one custom function that
runs at speeds approaching hand-written
functions. LLVM enables generic
targeting of different architectures + run
simultaneously on CPU/GPU.
Speed eliminates need to pre-index or
aggregate data. Compute resides on
GPUs freeing CPUs to parse + ingest.
Finally, newest data can be combined
with billions of rows of “near historical”
data.
Data goes from compute (CUDA) to
graphics (OpenGL) pipeline without copy
and comes back as compressed PNG
(~100 KB) rather than raw data (> 1GB).

30
MAPD ARCHITECTURE
Visualization Libraries
JavaScript libraries that allow
users to build custom web-
based visualization apps
powered by a MapD Core
database based on DC.js.
LLVM
MapD Core SQL queries are
compiled with a just-in-time
(JIT) LLVM based compiler,
and run as NVIDIA GPU
machine code.
Distributed Scale-out
MapD Core has native
distributed scale-out
capabilities. MapD Core users
can query and visualize larger
datasets with much smaller
cluster sizes than traditional
solutions.
High Availability
MapD Core has high
availability functionality that
provides durability and
redundancy. Ingest and
queries are load balanced
across servers for additional
throughput.
Open Source Commercial

31
CYBER SECURITY
An Ideal Use Case for GPU Acceleration

32
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
2. Event management is an accelerated analytics problem, the volume and velocity of data
from devices requires a new approach that combines all data sources to allow for more
intelligent/advanced threat hunting and exploration at scale across machine data.
3. Visualization will be a key part of daily operations, which will allows analyst to label and
train Deep Learning models faster, and validate machine learning prediciton.

33
RULES & PEOPLE DON’T SCALE
Right now, financial services reports it takes an average of 98 days to detect an Advance Threat but
retailers say it can be about seven months.
Once the security community moves beyond the mantras “encrypt everything” and “secure the
perimeter,” it can begin developing intelligent prioritization and response plans to various kinds
of breaches – with a strong focus on integrity.
The challenge lies in efficiently scaling these technologies for practical deployment, and making
them reliable for large networks. This is where the security community should focus its efforts.
http://www.wired.com/2015/12/the-cia-secret-to-cybersecurity-that-no-one-seems-to-get/
Current methods are too slow

34
ATTACKS ARE MORE SOPHISTICATED
How Hackers Hijacked a Bank’s Entire Online Operation
https://www.wired.com/2017/04/hackers-hijacked-banks-entire-online-operation/

35

36
MULTI MODEL APPROACH
No Silver Bullet In Cyber Security
nvGRAPH
https://github.com/h2oai/h2o4gpu
# edges = E * 2^S ~34M

37

38
GPU ACCELERATION
Accelerate the Pipeline, Not Just Deep Learning
• GPUs for deep learning = proven
• Where else and how else can we use
GPU acceleration?
• Dashboards
• Accelerating data pipeline
• Stream processing
• Building better models faster
• First: GPU databases
Data Ingestion
Data Processing
Visualization
Model Training
Inferencing

39
MOVING TO BIG DATA IS A START
Spark outperforms traditional SIEM
vs
Big Data Solution
10 node cluster - ~$60k in hardware
Production SIEM of Fortune 500 Enterprise Data
450+ columns
~250 million events per day
SIEM
Spark vs SIEM Benchmarks from Accenture Labs - Strata NY, Bsides LV

40
MOVING TO BIG DATA IS A START
Spark outperforms traditional SIEM
Typical Scenario Time Period SIEM Big Data Speed Up
1 Show all network communication from one host
(IP) to multiple hosts (IPs)
1 Day 3h 20m 13s 1m 44s 114 Times Faster
1 Week Not Feasible* 4m 05s
2 Retrieve failed logon attempts in Active
Directory
1 Day 18m 26s 1m 37s 10 Times Faster
1 Week 2h 13m 45s 3m 10s 41 Times Faster
3 Search for Malware (exe) in Symantec logs 1 Day 3h 24m 36s 1m 37s 125 Times Faster
1 Week Not Feasible* 3m 22s
4 View all proxy logs for a for specific domain 1 Day 4h 30m 13s 2m 54s 92 Times Faster
1 Week Not Feasible* 1m 09s**
Spark vs SIEM Benchmarks from Accenture Labs - Strata NY, Bsides LV

41
GPU DATABASES ARE EVEN FASTER
1.1 Billion Taxi Ride Benchmarks
21 30
1560
80 99
1250
150
269
2250
372
696
2970
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
MapD DGX-1 MapD 4 x P100 Redshift 6-node Spark 11-node
Query 1 Query 2 Query 3 Query 4
TimeinMilliseconds
Source: MapD Benchmarks on DGX from internal NVIDIA testing following guidelines of
Mark Litwintschik’s blogs: Redshift, 6-node ds2.8xlarge cluster & Spark 2.1, 11 x m3.xlarge cluster w/ HDFS @marklit82
10190 8134 19624 85942

42
3. Visualization will be a key part of daily operations, which will allows analyst to label and
train Deep Learning models faster, and validate machine learning predictions.

44
DATA PLATFORM-AS-A-SERVICE
• Handles 1M events/second
• Auto-scales the cluster
automatically
SCALE
• Offers HA with no data-loss
• Always-on architecture
• Data replication
HIGH AVAILABILITY
• Data platform security has
been implemented with
VPCs in AWS
• Dashboard access using
NVIDIA LDAP
SECURITY
• Log-to-analytics
• Kibana, JDBC access
• Accessing data using BI tools
SELF SERVICE

46
ARCHITECTURE
V2 (with MapD)

47
MAPD VS KIBANA
Dashboards Comparison + Performance Test Method

48
DASHBOARD PERFORMANCE
MapD Immerse vs Elastic Kibana
0
100
200
300
1 6 11 16 21 26 31
MapD Immerse (DGX)
MapD Immerse (P2)
Elastic Kibana
x
< 9s
< 12s
Days of Data
TimetoFullyLoad(seconds)

49
VISUALIZATION WITH GPU
Less hardware, more performance, more scale

50
1/10th the hardware
1-2 orders of
magnitude more
performance

51
1/10th the hardware
1-2 orders of
magnitude more
performance
Real time visualization of 100K+ nodes 1M+ Edges
50-100x faster clustering than other solutions

52
LISTS DO NOT VISUALLY SCALE
Text search is a great
starting point!
Does not scale
Do not see the 30K+ events
nor the IPs, users, nor how
they relate…

53
BAR CHARTS HIDE
RELATIONSHIPS
Good for summaries!
But not: individual items
But not: behaviors, relationships, patterns,
outliers, …
?

54
GRAPHS:
A KEY MISSING
VIEW
Unified Model
Shows entities, events, and relationships
Multipurpose: connect, see, interact
Visual
Inspect individual items
See behavior, patterns, and outliers
Scale to enterprise workloads

55
DIFFERENT GRAPHS, DIFFERENT QUESTIONS
Uni
Ex: Network mapping
“Is it safe to reboot this?”
ip ip
Hyper
Ex: Incident response
“Did this escalate?”
Multi
Ex: SSH trails
“Is a user crossing zones?”
ip
user
userip
ip
user
event
event
user
ip

57
CYBERWORKS
CYBERWORKS SIEM SDK
Goals
• Open Source Ecosystem & Select ISVs
• Integration Points w/ leading security vendors
• FireEye
• Splunk
• Palo Alto Networks
Purpose
A platform to allow analysts to hunt and
analyze data faster at scale than traditional
big data to find unknown and zero day threats.
It will accelerate the threat detection
ecosystem and harden cyber defense utilizing
GPU ISVs and Deep Learning Frameworks.
Purpose Built SDK For SIEM Analytics

58
CYBERWORKS ACTIVITIES
Continuous Improvement
Use GPU accelerated
databases to analyze
data to improve
hunting today, as well
as enrich and label
data for Deep Learning
Connect accelerated
DBs to Splunk for event
management, hunting,
and exploration. Use
Graphistry and MapD to
visualize the data for
anomaly and threat
detection in new ways.
The goal is to GPU
accelerate parts of
Splunk through
partnership and
connect/bolt on
GPUDBs/Graphistry
Use ML and Graph
Analytics for feature
extraction and behavioral
analytics, an ensemble
approach to detection.
Expand Deep Learning
training as more data is
labeled/classified, and
threats are caught faster,
building off DL techniques
used in GFN, other
groups, and external ISV.
Generalize Deep Learning for supervised
and unsupervised anomaly and threat
detection (Insider, APT, DDOS, etc…)
while building our own cyber security deep
learning accelerator. Use best practices
from Driveworks and other accelerators
and SDK as a reference architecture.
Leverage DL from other parts of the firm to
accelerate development as well.
While using Splunk
Cloud to protect
Nvidia, we create a
redundant path of data
to enable R&D.
nvGRAPH

59
CYBERWORKS ARCHITECTURE
SecOps
Data Sources
Ingest
Storage
Stream
Processing
Batch
Processing
Serving
Layer
Notebook
Visualization
Graph Processing
cuSTINGER
Graph
Visualization
Interactivity
QuerySpeed
Gunrock
Deep Learning
Machine Learning

60
CYBERWORKS HARDWARE
Scale out Cluster
DGX Cluster
NAS
SIEM
Notebooks
End
User
3rd Party
Apps
Messaging
Queue
Accelerating your SIEM

61
JOIN THE REVOLUTION
Everyone Can Help!
APACHE ARROW APACHE PARQUET GPU Open Analytics
Initiative
https://arrow.apache.org/
@ApacheArrow
https://parquet.apache.org/
@ApacheParquet
http://gpuopenanalytics.com/
@Gpuoai
Integrations, feedback, documentation support, pull requests, new issues, or donations welcomed!

Joshua Patterson @datametrician
QUESTIONS?

GOAI: GPU-Accelerated Data Science DataSciCon 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to GOAI: GPU-Accelerated Data Science DataSciCon 2017

Similar to GOAI: GPU-Accelerated Data Science DataSciCon 2017 (20)

Recently uploaded

Recently uploaded (20)

GOAI: GPU-Accelerated Data Science DataSciCon 2017