GPU-Accelerating UDFs in PySpark with Numba and PyGDF

GPU-ACCELERATING UDFS IN
PYSPARK WITH NUMBA AND PYGDF
Joshua Patterson @datametrician
Keith Kraus @keithjkraus

2
THE DATA
STRUGGLE IS
REAL…

3
DATA DELUGE TO INSIGHT HUNGRY
INCREASING DATA VARIETY
Search
Marketing
Behavioral
Targeting
Dynamic
Funnels
User
Generated
Content
Mobile Web
SMS/MMS
Sentiment
HD Video
Speech To
Text
Product/
Service Logs
Social
Network
Business
Data Feeds
User Click
Stream
Sensors Infotainment
Systems
Wearable
Devices
Cyber
Security Logs
Connected
Vehicles
Machine
Data
IoT Data
Dynamic
Pricing
Payment
Record
Purchase
Detail
Purchase
Record
Support
Contacts
Segmentation
Offer
Details
Web
Logs
Offer
History
A/B
Testing
BUSINESS
PROCESS
PETABYTESTERABYTESGIGABYTESEXABYTESZETTABYTES
Streaming
Video
Natural
Language
Processing
WEB
DIGITAL
AI

4
DATA FORMATS
Avro
XML
JSON
GML
ProtoBuf
HDFS
Pickle
CSV
Parquet
Pandas
Plain Text vs Binary
Compressed vs Uncompressed
CSR
COO
CSC
* Not a complete list
Numpy

5
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
Hadoop Processing, Reading from disk

6
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
25-100x
Improvement
Less code
Language flexible
Primarily In-Memory
Spark In-Memory Processing

7
Cluster computing framework
Spark has almost become synonymous with Hadoop and Big Data•
Integrates with nearly the entire Big Data ecosystem•
The processing layer for big data and leading ML framework•
Five main components RDD API, SQL, Streaming,• MLlib, and GraphX
APACHE SPARK

8
SPARK IS NOT ENOUGH
Basic workloads are bottlenecked by the CPU
Source: Mark Litwintschik’s blog: 1.1 Billion Taxi Rides: EC2 versus EMR
In a simple benchmark consisting•
of aggregating data, the CPU is
the bottleneck
This is after the data is parsed and•
cached into memory which is
another common bottleneck
The CPU bottleneck is even worse•
in more complex workloads!
SELECT cab_type, count(*) FROM
trips_orc GROUP BY cab_type;

9
SPARK ECOSYSTEM
Lacks Full GPU Integration
4 Core Parts• : SQL, Streaming (Spark functions micro batched), Machine Learning, & Graph
Spark is currently optimizing its existing code base, adding more usability, not GPU support yet•

10
SPARK ECOSYSTEM
Using• Numba, Microsoft Azure team released a
basic example showing a ~5x speedup using
GPUs with Spark
This example is extremely limited in that•
they’re not passing any real data to the Python
process or the GPU
When wanting to pass data from Spark to the•
GPU there are new issues and performance
considerations
GPU-Acceleration Possible But Not Ideal
Source: https://github.com/Azure/aztk/blob/master/node_scripts/jupyter-
samples/GPU%2Bvs%2BCPU%2Busing%2BNumba.ipynb

12
GPUS ARE FAST
1.1 Billion Taxi Ride Benchmark
21 30
1560
80 99
1250
150
269
2250
372
696
2970
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
MapD DGX-1 MapD 4 x P100 Redshift 6-node Spark 11-node
Query 1 Query 2 Query 3 Query 4
TimeinMilliseconds
Source: MapD Benchmarks on DGX from internal NVIDIA testing following guidelines of
Mark Litwintschik’s blogs: Redshift, 6-node ds2.8xlarge cluster & Spark 2.1, 11 x m3.xlarge cluster w/ HDFS @marklit82
10190 8134 19624 85942

13
GPUS ARE FAST
K-Means Benchmark
10 with latest solver

14
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
GPU/Spark In-Memory Processing

15
GPU ACCELERATED TECHNOLOGIES
GRAPH
PROCESSING
ANALYTICS
GPU DATABASES

16
APP A
GPU-ACCELERATED ARCHITECTURE THEN
Too much data movement and too many different data formats
CPU GPU
APP B
Read DataH2O.ai Graphistry
Copy & Convert
Copy & Convert
Copy & Convert
Load Data
APP A GPU
Data
APP B
GPU
Data
BlazingDB MapDSimantex
Anaconda GunrocknvGRAPH

17
APP A
GPU-ACCELERATED ARCHITECTURE THEN
Too much data movement and too many different data formats
CPU GPU
APP B
BlazingDB MapD
Copy & Convert
Copy & Convert
Copy & Convert
Load Data
APP A GPU
Data
APP B
GPU
Data
Simantex
Read DataH2O.ai Graphistry

18
APACHE ARROW COMMON DATA LAYER
From Apache Arrow Home Page - https://arrow.apache.org/

19
GPU-ACCELERATED ARCHITECTURE NOW
Single data format and shared access to data on GPU
CPU GPU
GPU
MEM
Read Data
BlazingDB MapD Load Data
Apache Arrow
Powered by:
GPU Data Frame
Simantex
H2O.ai Graphistry

20
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
GPU DATA FRAME
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
Arrow
Read
Query ETL
ML
Train
5-10x Improvement
More code
Language rigid
25-100x Improvement
Same code
Language flexible
Primarily on GPU
End to End GPU Processing (GOAI)

21
GPU OPEN ANALYTICS INITIATIVE
First Project, the GPU Data Frame
No Copy & Converts - Full
Interoperability
H2O.ai
Numba Gunrock
Graphistry
BlazingDB MapD
GPU Data
Frame
GPU Data Frame is the first project of GOAI•
Apache Arrow for GPU•
libgdf• : A C library of helper functions, including:
Copying the GDF metadata block to the host and parsing it•
to a host-side struct.
Importing/exporting a GDF using the CUDA IPC mechanism.•
CUDA kernels to perform element• -wise math operations on
GDF columns.
CUDA sort, join, and reduction operations on GDFs.•
pygdf• : A Python library for manipulating GDFs
Python interface to• libgdf library with additional
functionality
Creating GDFs from• Numpy arrays and Pandas DataFrames
JIT compilation of group by and filter kernels using• Numba
dask_gdf• : Extension for Dask to work with distributed GDFs.
Same operations as• pygdf, but working on GDFs chunked
onto different GPUs and different servers.
Will bring the same Kubernetes support that• Dask already
has.
github.com/gpuopenanalytics
nvGRAPH
Apache Arrow
Powered by:
Simantex

22
GOAI ECOSYSTEM
GRAPH
PROCESSING
ANALYTICS
GPU DATABASES
Apache Arrow
Powered by:

23
GPU ACCELERATION ACROSS THE ECOSYSTEM
Apache Arrow
H2O.ai
Numba Gunrock
Graphistry
BlazingDB MapD
GPU Data
Frame
nvGRAPH
Apache Arrow
Powered by:
Simantex

24
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
Arrow
Read
Query ETL
ML
Train
5-10x Improvement
More code
Language rigid
25-100x Improvement
Same code
Language flexible
Primarily on GPU
End to End GPU Processing (GoAi)

26
PYGDF @gpuoai
Python GPU DataFrame library

27
PYGDF @gpuoai
Pandas ↔ PyGDF

28
PYGDF @gpuoai
Built-In Functions

30
Cluster computing framework
APACHE SPARK
JVM
Local Cluster
Local Code
Spark Context
JVM
JVM

31
PYSPARK
Python API for Spark

32
PYSPARK
No cluster execution in Python if using Spark built-ins
JVM
Local Cluster
Local Code
Spark Context
JVM
JVM

33
PYSPARK UDFS
When Spark built-ins can’t get the job done alone
User defined functions (UDFs)•
allow for creating column-based
functions outside of the scope of
Spark built-in functions
UDFs can be defined in Scala/Java•
or Python and be called from
PySpark
Using Python lambdas in map•
functions is essentially the same
as using a Python UDF

34
PYSPARK PYTHON UDFS
Python UDFs in PySpark need Python workers and data movement
JVM
Local Cluster
Local Code
Spark Context
JVM
JVM

35
PYSPARK PYTHON UDFS
Moving data from the JVM to Python efficiently is hard
JVM
Local Cluster
Local Code
Spark Context
JVM
JVM

36
PYSPARK PYTHON UDFS
How is the data movement implemented?
Rows of data are pickled•
and sent from the
executor JVM process to
Python worker processes
This bottlenecks the•
data pipeline, but how
badly?
Many people avoid this•
problem by defining
their UDFs in Scala/Java
and calling them from
PySpark
JVM
Executor Python Workers
Rows (Pickle)
Rows (Pickle)

37
PYSPARK PYTHON UDFS
Performance analysis of a basic UDF
Source: Julien LeDem, Li Jin: Improving Python and Spark Performance and Interoperability with Apache Arrow
Almost all of the time is•
spent serializing and
deserializing data as
opposed to the actual
calculations!
We can’t actually feed•
the GPU fast enough to
take advantage of the
performance benefits!
lambda x: x + 1

38
PYSPARK 2.3
First release with Apache Arrow compatibility!
Apache
Arrow
spark.sql.execution.arrow.enabled à true

39
PYSPARK 2.3 PANDAS
Optimized Spark Data Frame ↔ Pandas Data Frame
df.toPandas()
createDataFrame(pdf)

40
PYSPARK 2.3 PANDAS UDFS
Vectorized user defined functions using Pandas
Scalar Pandas UDFs Grouped Map Pandas UDFs
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)@pandas_udf(‘double’, PandasUDFType.SCALAR)
Pandas.Series• in, Pandas.Series out
Input and output Series must be the same length•
Output Series must be of the type defined in the•
decorator
Pandas.DataFrame• in, Pandas.DataFrame out
Output• DataFrame can be any length
Output• DataFrame schema defined via a Spark
SQL DataFrame schema

41
PySpark data movement performance issues resolved
JVM
Executor Python Workers
Columnar
Record Batch
Columnar
Record Batch
Data is converted from•
rows to Apache Arrow
columnar record batches
within the executor JVM
processes
Data does• not have to
be serialized or
deserialized!
Apache
Arrow

42
No more serialization and deserialization overhead!
Source: Julien LeDem, Li Jin: Improving Python and Spark Performance and Interoperability with Apache Arrow
With the data movement•
performance issues resolved,
the bottleneck for many
UDFs gets pushed back to
the compute
We can utilize GPUs to help•
in this respect!
lambda x: x + 1

44
PANDAS UDFS WITH GPUS
Pandas ↔ PyGDF makes this easy!
Pandas.Series PyGDF.Series Pandas.DataFrame PyGDF.DataFrame

45
What about for more advanced operations?
Many UDFs are created because the function•
can’t be easily created using Spark primitives
Probably can’t be created with• PyGDF
primitives either
Writing low level code and tying it into your•
UDF is a non-starter

46
Numba to the rescue!
Luckily,• PyGDF has convenience functions for
Numba to JIT compile CUDA kernels for
optimized execution on the GPU
DataFrame.apply_rows• ()
Series.applymap• ()
UDFs within UDFS!•

47
Numba GPU-Accelerated PyGDF UDFs in Pandas UDFs
Pandas.Series PyGDF.Series Pandas.DataFrame PyGDF.DataFrame

48
LESSONS LEARNED
GPU-Accelerated UDFs as hard to do right
Data needs to be large enough to utilize the GPU•
effectively, but not too large to exhaust GPU memory
(1e6 – 9e9)
The work done on the GPU needs to be substantial•
enough to prevent data transfer from dominating
execution time
I.E.• Group by a timestamp and run a Grouped
Map Pandas UDF of GPU-accelerated pagerank
per group
PyGDF• depends on Arrow 0.7.1 for now while PySpark
uses Arrow 0.8+, WIP to update dependency
https• ://github.com/kkraus14/libgdf/tree/temp_r
emove_ipc_arrow for temporary workaround

50
PYGDF AND LIBGDF
Optimized join performance•
GDF Graph Analytics Library•
Support for multiple•
interconnected GPUs in LibGDF
and PyGDF (same PCIe root or
NVLink)
General• performance
improvements across the board
TIME (MS) SF1 SF10 SF100
CPU (single-threaded) 1329 31731 465064
V100 (PCIe3) 22 164 1521
V100 (3xNVLINK2) 12 45 466
3.2x
300x
TPCH Query 21 – End to End Results Using 32-bit Keys*
TIME (MS) SF1 SF10 SF100
CPU (single-threaded) 150 2041 24960
V100 (PCIe3) 13 105 946
V100 (3xNVLINK2) 7 23 308
3.1x
26x
TPCH Query 4 – End to End Results Using 32-bit Keys*

51
NUMBA AND CUPY
Standard Python GPU N-Dimensional Array
Numba• and CuPy are unifying their GPU backends
to share an n-dimensional array implementation
Hoping to get additional Python libraries like•
PyCUDA, PyTorch, etc. to unify as well in the future
PyCUDA

52
DASK.GDF AND DASK.CUPY
Scale out in addition to scaling up
Use• Dask as the scale out method for distributed
GPU data structures
Extend• Dask’s Kubernetes integration as needed to
support the full extent of GPU integration
Dask.GDF• is in the very early stages of development
https://github.com/gpuopenanalytics/dask_gdf
Dask.CuPy• has not started yet, but if interested
we’re hiring!

53
SPARK 2.3+ WISHES
More Arrow-based Pandas UDF types
Partition Pandas UDFs
@pandas_udf(schema, PandasUDFType.PARTITION)
Pandas.DataFrame• in, Pandas.DataFrame out
Output• DataFrame can be any length
Output• DataFrame schema defined via a Spark
SQL DataFrame schema

54
SPARK 2.3+ WISHES
Arrow as the primary data format for Spark DataFrame
Currently Spark can take advantage of columnar•
file formats and columnar data connections by
loading the necessary columns and pushing down
predicates
Most typical operations benefit from columnar data•
structure
Using Arrow will allow for optimized compute•
kernels and reduce the JVM dependency in the
future
Eventually native GPU acceleration•
Executor

55
JOIN THE REVOLUTION
Everyone Can Help!
Integrations, feedback, documentation support, pull requests, new issues, or donations welcomed!
APACHE ARROW GPU Open Analytics
Initiative
https://arrow.apache.org/
@ApacheArrow
http://gpuopenanalytics.com/
@Gpuoai

Joshua Patterson @datametrician
Keith Kraus @keithjkraus
QUESTIONS?

GPU-Accelerating UDFs in PySpark with Numba and PyGDF

More Related Content

What's hot

Similar to GPU-Accelerating UDFs in PySpark with Numba and PyGDF

Recently uploaded

GPU-Accelerating UDFs in PySpark with Numba and PyGDF