Data Lake to GPUs
@blazingdb@blazingdb
CPUs can no longer handle the growing data demands
of data science workloads
Slow Process Suboptimal Infrastructure
Hundreds to tens of thousands of
CPU servers are needed in data
centers.
Preparing data and training models
can take days or even weeks.
@blazingdb@blazingdb
GPUs are well known for accelerating the training of
machine learning and deep learning models.
Deep Learning
(Neural Networks)
Machine Learning
Performance
improvements
increase at scale.
40x Improvement
over CPU.
@blazingdb@blazingdb
An end-to-end analytics solution on GPUs.
Expertise:
· GPU DBMS
· GPU Columnar Analytics
· Data Lakes
Expertise:
· CUDA
· Machine Learning
· Deep Learning
Expertise:
· Python
· Data Science
· Machine Learning
Expertise:
· Graph Analytics
· Visualization
· Apache Arrow
@blazingdb@blazingdb
RAPIDS, the end-to-end GPU analytics ecosystem
cuDF
Data Preparation
cuML
Machine Learning
cuGRAPH
Graph Analytics
Model TrainingData Preparation Visualization
A set of open source libraries for GPU
accelerating data preparation and
machine learning.
In GPU Memory
@blazingdb@blazingdb
RAPIDS, the end-to-end GPU analytics ecosystem
cuDF
Data Preparation
cuML
Machine Learning
cuGRAPH
Graph Analytics
Model TrainingData Preparation Visualization
A set of open source libraries for GPU
accelerating data preparation and
machine learning.
In GPU Memory · Cross language · Zero-copy reads · Columnar
Apache Arrow
@blazingdb@blazingdb
RAPIDS, the end-to-end GPU analytics ecosystem
cuDF
Data Preparation
cuML
Machine Learning
cuGRAPH
Graph Analytics
Model TrainingData Preparation Visualization
A set of open source libraries for GPU
accelerating data preparation and
machine learning.
In GPU Memory · GPU Compute Kernels · Pandas-like API
CUDA DataFrame (cuDF)
cuDF
Data Preparation
@blazingdb@blazingdb
cuDF
Data Preparation
BlazingSQL, GPU accelerated SQL engine
cuML
Machine Learning
cuGRAPH
Graph Analytics
A SQL engine built on RAPIDS. Query
enterprise data lakes lightning fast with
full interoperability with RAPIDS stack.
In GPU Memory
from pyblazing import BlazingContext
bc = BlazingContext()
#Register Filesystem
bc.hdfs('data', host='129.13.0.12',
port=54310)
# Create Table
bc.table('performance',
file_type='parquet',
path='hdfs://data/performance/')
#Execute Query
result_gdf = bc.sql('SELECT * FROM
performance WHERE
YEAR(maturity_date)>2005')
print(result_gdf)
@blazingdb@blazingdb
BlazingSQL + Graphistry Netflow Analysis
Visually analyze the VAST netflow data set inside Graphistry in order
to quickly detect anomalous events.
ETL VisualizationNetflow Data
65M Events
2 Weeks
1,440 Devices
@blazingdb@blazingdb
Benchmarks
Netflow Demo Timings (Load & ETL)
@blazingdb@blazingdb
Benchmarks
Netflow Demo Timings (ETL Only)
@blazingdb@blazingdb
Upcoming BlazingSQL Releases
Use the PyBlazing
connection to execute SQL
queries on GDFs that are
loaded by the cuDF API
Integrate FileSystem API,
adding the ability to
directly query flat files
(Apache Parquet & CSV)
inside distributed file
systems.
SQL queries are fanned
out across multiple GPUs
and servers.
String support and string
operation support.
External Dependency
Query
GDFs
Direct Query
Flat Files
Distributed
Scheduler
String
Support
Physical Plan
Optimizer
Partition culling for where
clauses and joins.
VO.1 VO.2 VO.3 VO.4 VO.5
@blazingdb@blazingdb
Scale Up & Out
~400 GB
~200 GB
5 GB
32 GB
~10 TB
Distributed
VO.4
Physical Plan
VO.5
Scheduler Optimizer
GCP
1xT4 GPU
DGX-1
8xV100 GPU
DGX-2
16xV100 GPU
DGX-2
16xV100 GPU
GCP
4xT4 GPU
@blazingdb@blazingdb
Get Started
BlazingSQL is quick to get up and running using either
DockerHub or Conda:

BlazingSQL & Graphistry - Netflow Demo

  • 1.
  • 2.
    @blazingdb@blazingdb CPUs can nolonger handle the growing data demands of data science workloads Slow Process Suboptimal Infrastructure Hundreds to tens of thousands of CPU servers are needed in data centers. Preparing data and training models can take days or even weeks.
  • 3.
    @blazingdb@blazingdb GPUs are wellknown for accelerating the training of machine learning and deep learning models. Deep Learning (Neural Networks) Machine Learning Performance improvements increase at scale. 40x Improvement over CPU.
  • 4.
    @blazingdb@blazingdb An end-to-end analyticssolution on GPUs. Expertise: · GPU DBMS · GPU Columnar Analytics · Data Lakes Expertise: · CUDA · Machine Learning · Deep Learning Expertise: · Python · Data Science · Machine Learning Expertise: · Graph Analytics · Visualization · Apache Arrow
  • 5.
    @blazingdb@blazingdb RAPIDS, the end-to-endGPU analytics ecosystem cuDF Data Preparation cuML Machine Learning cuGRAPH Graph Analytics Model TrainingData Preparation Visualization A set of open source libraries for GPU accelerating data preparation and machine learning. In GPU Memory
  • 6.
    @blazingdb@blazingdb RAPIDS, the end-to-endGPU analytics ecosystem cuDF Data Preparation cuML Machine Learning cuGRAPH Graph Analytics Model TrainingData Preparation Visualization A set of open source libraries for GPU accelerating data preparation and machine learning. In GPU Memory · Cross language · Zero-copy reads · Columnar Apache Arrow
  • 7.
    @blazingdb@blazingdb RAPIDS, the end-to-endGPU analytics ecosystem cuDF Data Preparation cuML Machine Learning cuGRAPH Graph Analytics Model TrainingData Preparation Visualization A set of open source libraries for GPU accelerating data preparation and machine learning. In GPU Memory · GPU Compute Kernels · Pandas-like API CUDA DataFrame (cuDF) cuDF Data Preparation
  • 8.
    @blazingdb@blazingdb cuDF Data Preparation BlazingSQL, GPUaccelerated SQL engine cuML Machine Learning cuGRAPH Graph Analytics A SQL engine built on RAPIDS. Query enterprise data lakes lightning fast with full interoperability with RAPIDS stack. In GPU Memory from pyblazing import BlazingContext bc = BlazingContext() #Register Filesystem bc.hdfs('data', host='129.13.0.12', port=54310) # Create Table bc.table('performance', file_type='parquet', path='hdfs://data/performance/') #Execute Query result_gdf = bc.sql('SELECT * FROM performance WHERE YEAR(maturity_date)>2005') print(result_gdf)
  • 9.
    @blazingdb@blazingdb BlazingSQL + GraphistryNetflow Analysis Visually analyze the VAST netflow data set inside Graphistry in order to quickly detect anomalous events. ETL VisualizationNetflow Data 65M Events 2 Weeks 1,440 Devices
  • 10.
  • 11.
  • 12.
    @blazingdb@blazingdb Upcoming BlazingSQL Releases Usethe PyBlazing connection to execute SQL queries on GDFs that are loaded by the cuDF API Integrate FileSystem API, adding the ability to directly query flat files (Apache Parquet & CSV) inside distributed file systems. SQL queries are fanned out across multiple GPUs and servers. String support and string operation support. External Dependency Query GDFs Direct Query Flat Files Distributed Scheduler String Support Physical Plan Optimizer Partition culling for where clauses and joins. VO.1 VO.2 VO.3 VO.4 VO.5
  • 13.
    @blazingdb@blazingdb Scale Up &Out ~400 GB ~200 GB 5 GB 32 GB ~10 TB Distributed VO.4 Physical Plan VO.5 Scheduler Optimizer GCP 1xT4 GPU DGX-1 8xV100 GPU DGX-2 16xV100 GPU DGX-2 16xV100 GPU GCP 4xT4 GPU
  • 14.
    @blazingdb@blazingdb Get Started BlazingSQL isquick to get up and running using either DockerHub or Conda: