Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2015 Continuum Analytics- Confidential & Proprietary
GPU Computing with Apache Spark

and Python
Stan Seibert
Siu Kwan L...
My Background
• Trained in particle physics

• Using Python for data analysis for 10 years

• Using GPUs for data analysis...
About Continuum Analytics
• We give superpowers to people who change the world!

• We want to help everyone analyze their ...
• I’m going to use Anaconda throughout this presentation.

• Anaconda is a free Mac/Win/Linux Python distribution:
• Based...
Overview
1. Why Python?
2. Numba: A Python JIT compiler for the CPU and GPU
3. PySpark: Distributed computing for Python
4...
WHY PYTHON?
6
Why is Python so popular?
• Straightforward, productive language for system administrators,
programmers, scientists, analy...
8
9
• Pure, interpreted Python is slow.
• Python excels at interfacing with other languages used in HPC:
C: ctypes, CFFI, Cy...
Is there another way?
10
• Switching languages for speed in your projects can be a little
clunky:
• Sometimes tedious boil...
NUMBA: A PYTHON JIT COMPILER
11
Compiling Python
12
• Numba is an open-source, type-specializing compiler for Python functions
• Can translate Python synt...
Supported Platforms
13
OS HW SW
• Windows (7 and later) • 32 and 64-bit x86 CPUs • Python 2 and 3
• OS X (10.9 and later) ...
How Does Numba Work?
14
Python Function
(bytecode)
Bytecode
Analysis
Functions
Arguments
Numba IR
Machine
Code
Execute!
Ty...
15
Numba on the CPU
16
Array Allocation
Looping over ndarray x as an iterator
Using numpy math functions
Returning a slice of the array
2.7x s...
17
CUDA Kernels in Python
18
CUDA Kernels in Python
Decorator	will	infer	type	signature	when	you	call	it
NumPy	arrays	have	expected	
attributes	and	...
19
Calling the Kernel from Python
Works	just	like	CUDA	C,	except	Numba	handles	allocating	
and	copying	data	to/from	the	ho...
20
Handling Device Memory Directly
Memory	allocation	matters	in	small	tasks.
21
Higher Level Tools: GPU Ufuncs
22
Higher Level Tools: GPU Ufuncs
Decorator	for	creating	ufunc
List	of	supported	type	signatures
Code	generation	target
23
GPU Ufuncs Performance
4x	speedup	incl.	host<->device	round	trip	
on	GeForce	GT	650M
24
Accelerate Library Bindings: cuFFT
>2x	speedup	incl.	host<->device	round	trip	
on	GeForce	GT	650M
MKL	accelerated	FFT
PYSPARK: DISTRIBUTED COMPUTING
FOR PYTHON
25
26
What is Apache Spark?
• An API and an execution engine for distributed computing on a cluster
• Based on the concept of...
27
Computation DAGs
Fig	from:	https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualiz...
28
How does Spark Scale?
• All cluster scaling is about minimizing I/O.

Spark does this in several ways:
• Keep intermedi...
29
Python and Spark
• Spark is implemented in Java & Scala on the JVM
• Full API support for Scala, Java, and Python

(+ l...
30
Spark	
Context
Spark	
Context
Spark	
Worker
Python	
Interpreter
Spark	
Worker
Python	
Interpreter
Python Java
Java
Java...
31
Getting Started
• conda can create a local environment for Spark for you:
conda create -n spark -c anaconda-cluster pyt...
32
Using Numba (CPU) with Spark
33
Using Numba (CPU) with Spark
Load	arrays	onto	the	Spark	
Workers
Apply	my	function	
to	every	element	in	
the	RDD	and	re...
34
Using CUDA Python with Spark
Define	CUDA	kernel	
Compilation	happens	here
Wrap	CUDA	kernel	launching	
logic
Creates	Spa...
35
Client Cluster
CUDA	
Python	
Kernel
LLVM	IR
PTX
LLVM	IR PTX
PTX
CUDA	
Binary
Compiles Serialize FinalizeDeserialize
use...
36
Client Cluster
CUDA	
Python	
Kernel
LLVM	IR
PTX
LLVM	IR PTX
PTX
CUDA	
Binary
Compiles Serialize FinalizeDeserialize
Thi...
EXAMPLE: IMAGE REGISTRATION
37
38
Basic Algorithm
Group	similar	images	
(unsupervised	kNN	clustering)	
attempt	image	registration	on		
every	pair	in	each...
39
Basic Algorithm
• FFT	2D	heavy	
• Expensive	
• Mostly	running	on	GPU
• Reduce	number	of	pairwise	
image	registration	at...
40
Cross Power Spectrum
Core of phase correlation based image registration algorithm
41
Cross Power Spectrum
Core of phase correlation based image registration algorithm
cuFFT
explicit	memory	transfer
partition
i
u n
partition
i
u n
partition
i
u n …
42
Random	partitioning	
rdd.repartition(num_parts)
Image	set
Apply	ImgRe...
43
CUDA Multi-Process Service (MPS)
• Sharing one GPU between multiple workers can be beneficial
• nvidia-cuda-mps-control...
44
Scaling with Multiple GPUs and MPS
• 1 CPU worker: 1 Tesla K20
• 2 CPU worker: 2 Tesla K20
• 4 CPU worker: 2 Tesla K20
...
partition
i
u n
partition
i
u n
partition
i
u n …
45
Random	partitioning	
rdd.repartition(num_parts
Image	set
Apply	ImgReg...
TIPS AND TRICKS
46
47
Parallelism is about communication
• Be aware of the communication overhead when deciding how
to chunk your work in Spa...
48
Start Small
• Easier to debug your logic on your local system!
• Run a Spark cluster on your workstation where it is ea...
49
Be interactive with Big Data!
• Run Jupyter notebook on Spark + MultiGPU cluster
• Live experimentation helps you devel...
50
Amortize Setup Costs
• Make sure that one-time costs are actually done once:
• GPU state (like FFT plans) can be slow t...
51
Be(a)ware of Global State
• Beware that PySpark may (quite often!) spawn new processes
• New processes may have the sam...
52
Spark and Multi-GPU
• Spark is not aware of GPU resources
• Efficient usage of GPU resources require some workarounds

...
© 2015 Continuum Analytics- Confidential & Proprietary 53
GPU	assignment	by	partition	index
Delegate	GPU	work	to	external	...
CONCLUSION
54
55
PySpark and Numba for GPU clusters
• Numba let’s you create compiled CPU and CUDA functions right inside
your Python ap...
Upcoming SlideShare
Loading in …5
×

GPU Computing with Apache Spark and Python

Talk at the GPU Technology Conference in San Jose, CA on April 5 by Numba team contributors Stan Seibert and Siu Kwan Lam.

We'll demonstrate how Python and the Numba JIT compiler can be used for GPU programming that easily scales from your workstation to an Apache Spark cluster. Using an example application, we show how to write CUDA kernels in Python, compile and call them using the open source Numba JIT compiler, and execute them both locally and remotely with Spark. We also describe techniques for managing Python dependencies in a Spark cluster with the tools in the Anaconda Platform. Finally, we conclude with some tips and tricks for getting best performance when doing GPU computing with Spark and Python.

  • Be the first to comment

GPU Computing with Apache Spark and Python

  1. 1. © 2015 Continuum Analytics- Confidential & Proprietary GPU Computing with Apache Spark
 and Python Stan Seibert Siu Kwan Lam April 5, 2016
  2. 2. My Background • Trained in particle physics
 • Using Python for data analysis for 10 years
 • Using GPUs for data analysis for 7 years
 • Currently lead the High Performance Python team at Continuum 2
  3. 3. About Continuum Analytics • We give superpowers to people who change the world!
 • We want to help everyone analyze their data with Python (and other tools), so we offer: • Enterprise Products • Consulting • Training • Open Source 3
  4. 4. • I’m going to use Anaconda throughout this presentation.
 • Anaconda is a free Mac/Win/Linux Python distribution: • Based on conda, an open source package manager • Installs both Python and non-Python dependencies • Easiest way to get the software I will talk about today
 • https://www.continuum.io/downloads 4
  5. 5. Overview 1. Why Python? 2. Numba: A Python JIT compiler for the CPU and GPU 3. PySpark: Distributed computing for Python 4. Example: Image Registration 5. Tips and Tricks 6. Conclusion
  6. 6. WHY PYTHON? 6
  7. 7. Why is Python so popular? • Straightforward, productive language for system administrators, programmers, scientists, analysts and hobbyists • Great community: • Lots of tutorial and reference materials • Easy to interface with other languages • Vast ecosystem of useful libraries 7
  8. 8. 8
  9. 9. 9 • Pure, interpreted Python is slow. • Python excels at interfacing with other languages used in HPC: C: ctypes, CFFI, Cython C++: Cython, Boost.Python FORTRAN: f2py • Secret: Most scientific Python packages put the speed critical sections of their algorithms in a compiled language. … But, isn’t Python slow?
  10. 10. Is there another way? 10 • Switching languages for speed in your projects can be a little clunky: • Sometimes tedious boilerplate for translating data types across the language barrier • Generating compiled functions for the wide range of data types can be difficult • How can we use cutting edge hardware, like GPUs?
  11. 11. NUMBA: A PYTHON JIT COMPILER 11
  12. 12. Compiling Python 12 • Numba is an open-source, type-specializing compiler for Python functions • Can translate Python syntax into machine code if all type information can be deduced when the function is called. • Implemented as a module. Does not replace the Python interpreter! • Code generation done with: • LLVM (for CPU) • NVVM (for CUDA GPUs).
  13. 13. Supported Platforms 13 OS HW SW • Windows (7 and later) • 32 and 64-bit x86 CPUs • Python 2 and 3 • OS X (10.9 and later) • CUDA-capable NVIDIA GPUs • NumPy 1.7 through 1.10 • Linux (~RHEL 5 and later) • HSA-capable AMD GPUs • Experimental support for ARMv7 (Raspberry Pi 2)
  14. 14. How Does Numba Work? 14 Python Function (bytecode) Bytecode Analysis Functions Arguments Numba IR Machine Code Execute! Type Inference LLVM/NVVM JIT LLVM IR Lowering Rewrite IR Cache @jit def do_math(a, b): … >>> do_math(x, y)
  15. 15. 15 Numba on the CPU
  16. 16. 16 Array Allocation Looping over ndarray x as an iterator Using numpy math functions Returning a slice of the array 2.7x speedup! Numba decorator
 (nopython=True not required) Numba on the CPU
  17. 17. 17 CUDA Kernels in Python
  18. 18. 18 CUDA Kernels in Python Decorator will infer type signature when you call it NumPy arrays have expected attributes and indexing Helper function to compute
 blockIdx.x * blockDim.x +
 threadIdx.x Helper function to compute
 blockDim.x * gridDim.x
  19. 19. 19 Calling the Kernel from Python Works just like CUDA C, except Numba handles allocating and copying data to/from the host if needed
  20. 20. 20 Handling Device Memory Directly Memory allocation matters in small tasks.
  21. 21. 21 Higher Level Tools: GPU Ufuncs
  22. 22. 22 Higher Level Tools: GPU Ufuncs Decorator for creating ufunc List of supported type signatures Code generation target
  23. 23. 23 GPU Ufuncs Performance 4x speedup incl. host<->device round trip on GeForce GT 650M
  24. 24. 24 Accelerate Library Bindings: cuFFT >2x speedup incl. host<->device round trip on GeForce GT 650M MKL accelerated FFT
  25. 25. PYSPARK: DISTRIBUTED COMPUTING FOR PYTHON 25
  26. 26. 26 What is Apache Spark? • An API and an execution engine for distributed computing on a cluster • Based on the concept of Resilient Distributed Datasets (RDDs) • Dataset: Collection of independent elements (files, objects, etc) in memory from previous calculations, or originating from some data store • Distributed: Elements in RDDs are grouped into partitions and may be stored on different nodes • Resilient: RDDs remember how they were created, so if a node goes down, Spark can recompute the lost elements on another node
  27. 27. 27 Computation DAGs Fig from: https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html
  28. 28. 28 How does Spark Scale? • All cluster scaling is about minimizing I/O.
 Spark does this in several ways: • Keep intermediate results in memory with rdd.cache() • Move computation to the data whenever possible (functions are small and data is big!) • Provide computation primitives that expose parallelism and minimize communication between workers:
 map, filter, sample, reduce, …
  29. 29. 29 Python and Spark • Spark is implemented in Java & Scala on the JVM • Full API support for Scala, Java, and Python
 (+ limited support for R) • How does Python work, since it doesn’t run on the JVM?
 (not counting IronPython)
  30. 30. 30 Spark Context Spark Context Spark Worker Python Interpreter Spark Worker Python Interpreter Python Java Java Java Python Python Client Cluster Py4J Py4J Py4J
  31. 31. 31 Getting Started • conda can create a local environment for Spark for you: conda create -n spark -c anaconda-cluster python=3.5 spark numba 
 cudatoolkit ipython-notebook source activate spark #Uncomment below if on Mac OS X #export JRE_HOME=$(/usr/libexec/java_home) #export JAVA_HOME=$(/usr/libexec/java_home) IPYTHON_OPTS="notebook" pyspark # starts jupyter notebook
  32. 32. 32 Using Numba (CPU) with Spark
  33. 33. 33 Using Numba (CPU) with Spark Load arrays onto the Spark Workers Apply my function to every element in the RDD and return first element
  34. 34. 34 Using CUDA Python with Spark Define CUDA kernel Compilation happens here Wrap CUDA kernel launching logic Creates Spark RDD (8 partitions) Apply gpu_work on each partition
  35. 35. 35 Client Cluster CUDA Python Kernel LLVM IR PTX LLVM IR PTX PTX CUDA Binary Compiles Serialize FinalizeDeserialize use serialized PTX if CUDA arch matches client recompile if CUDA arch is different
  36. 36. 36 Client Cluster CUDA Python Kernel LLVM IR PTX LLVM IR PTX PTX CUDA Binary Compiles Serialize FinalizeDeserialize This happens on every worker process
  37. 37. EXAMPLE: IMAGE REGISTRATION 37
  38. 38. 38 Basic Algorithm Group similar images (unsupervised kNN clustering) attempt image registration on every pair in each group (phase correlation based; FFT heavy) image set unused images new images Progress?
  39. 39. 39 Basic Algorithm • FFT 2D heavy • Expensive • Mostly running on GPU • Reduce number of pairwise image registration attempt • Run on CPU Group similar images (unsupervised kNN clustering) attempt image registration on every pair in each group (phase correlation based; FFT heavy) image set unused images new images Progress?
  40. 40. 40 Cross Power Spectrum Core of phase correlation based image registration algorithm
  41. 41. 41 Cross Power Spectrum Core of phase correlation based image registration algorithm cuFFT explicit memory transfer
  42. 42. partition i u n partition i u n partition i u n … 42 Random partitioning rdd.repartition(num_parts) Image set Apply ImgReg on each Partition rdd.mapPartitions(imgreg_func) Progress? Scaling on Spark … • Machines have multiple GPUs • Each worker computes a partition at a time
  43. 43. 43 CUDA Multi-Process Service (MPS) • Sharing one GPU between multiple workers can be beneficial • nvidia-cuda-mps-control • Better GPU utilization from multiple processes • For our example app: 5-10% improvement with MPS • Effect can be bigger for app with higher GPU utilization
  44. 44. 44 Scaling with Multiple GPUs and MPS • 1 CPU worker: 1 Tesla K20 • 2 CPU worker: 2 Tesla K20 • 4 CPU worker: 2 Tesla K20
 (2 workers per GPU using MPS) • Spark + Dask = External process performing GPU calculation
  45. 45. partition i u n partition i u n partition i u n … 45 Random partitioning rdd.repartition(num_parts Image set Apply ImgReg on each Partition rdd.mapPartitions(imgreg_func) Progress? Spark vs CUDA … • Spark logic is similar to CUDA host logic • mapPartitions is like kernel launch • Spark network transfer overhead vs
 PCI-express transfer overhead • Partitions are like CUDA blocks • Work cooperatively within a partition
  46. 46. TIPS AND TRICKS 46
  47. 47. 47 Parallelism is about communication • Be aware of the communication overhead when deciding how to chunk your work in Spark. • Bigger chunks: Risk of underutilization of resources • Smaller chunks: Risk of computation swamped by overhead ➡ Start with big chunks, then move to smaller ones if you fail to reach full utilization.
  48. 48. 48 Start Small • Easier to debug your logic on your local system! • Run a Spark cluster on your workstation where it is easier to debug and profile. • Move to the cluster only after you’ve seen things work locally. (Spark makes this fairly painless.)
  49. 49. 49 Be interactive with Big Data! • Run Jupyter notebook on Spark + MultiGPU cluster • Live experimentation helps you develop and understand your progress • Spark keeps track of intermediate data and recomputes them if they are lost
  50. 50. 50 Amortize Setup Costs • Make sure that one-time costs are actually done once: • GPU state (like FFT plans) can be slow to initialize. Do this at the top of your mapPartition call and reuse for each element in the RDD partition. • Coarser chunking of tasks allows GPU memory allocations to be reused between steps inside a single task.
  51. 51. 51 Be(a)ware of Global State • Beware that PySpark may (quite often!) spawn new processes • New processes may have the same initial random seed • the python random module must be seeded properly in each newly spawned process • Use memoize • example: use global dictionary to track shared states and remember performed initializations
  52. 52. 52 Spark and Multi-GPU • Spark is not aware of GPU resources • Efficient usage of GPU resources require some workarounds
 e.g. • option 1: random GPU assignment
 numba.cuda.select_device(random.randrange(num_of_gpus)) • option 2: use mapPartitionWithIndex
 selected_gpu = partition_index % num_of_gpus • option 3: manage GPUs externally; like CaffeOnSpark
  53. 53. © 2015 Continuum Analytics- Confidential & Proprietary 53 GPU assignment by partition index Delegate GPU work to external GPU-aware process
 >2x speedup
  54. 54. CONCLUSION 54
  55. 55. 55 PySpark and Numba for GPU clusters • Numba let’s you create compiled CPU and CUDA functions right inside your Python applications. • Numba can be used with Spark to easily distribute and run your code on Spark workers with GPUs • There is room for improvement in how Spark interacts with the GPU, but things do work. • Beware of accidentally multiplying fixed initialization and compilation costs.

    Be the first to comment

    Login to see the comments

  • WhiteRoseWhiteRose

    Aug. 23, 2017
  • coalash

    Aug. 23, 2017
  • MvnBhargava

    Aug. 25, 2017
  • Hambster

    Sep. 30, 2017
  • KevinPollard1

    Nov. 18, 2017
  • ptzagk1

    Nov. 23, 2017
  • jmchung1

    Feb. 22, 2018
  • FadiHalloumi

    Mar. 11, 2018
  • taehunk1

    Apr. 18, 2018
  • dayuyang

    Apr. 22, 2018
  • weihuiqiu

    Jun. 6, 2018
  • devosp

    Jun. 16, 2018
  • AurlienDanizel

    Aug. 18, 2018
  • GeorvicTur

    Sep. 3, 2018
  • TaylorHenderson4

    Oct. 13, 2018
  • JoseFonseca16

    Oct. 13, 2018
  • GabrielGauciMaistre

    Oct. 23, 2018
  • MatasCorts5

    Nov. 25, 2018
  • tkmallik

    Jul. 16, 2019
  • ssuserc75cf7

    Oct. 3, 2019

Talk at the GPU Technology Conference in San Jose, CA on April 5 by Numba team contributors Stan Seibert and Siu Kwan Lam. We'll demonstrate how Python and the Numba JIT compiler can be used for GPU programming that easily scales from your workstation to an Apache Spark cluster. Using an example application, we show how to write CUDA kernels in Python, compile and call them using the open source Numba JIT compiler, and execute them both locally and remotely with Spark. We also describe techniques for managing Python dependencies in a Spark cluster with the tools in the Anaconda Platform. Finally, we conclude with some tips and tricks for getting best performance when doing GPU computing with Spark and Python.

Views

Total views

35,094

On Slideshare

0

From embeds

0

Number of embeds

240

Actions

Downloads

777

Shares

0

Comments

0

Likes

82

×