Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 89

Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS Library

3

Share

Download to read offline

GPU acceleration has been at the heart of scientific computing and artificial intelligence for many years now. GPUs provide the computational power needed for the most demanding applications such as Deep Neural Networks, nuclear or weather simulation. Since the launch of RAPIDS in mid-2018, this vast computational resource has become available for Data Science workloads too. The RAPIDS toolkit, which is now available on the Databricks Unified Analytics Platform, is a GPU-accelerated drop-in replacement for utilities such as Pandas/NumPy/ScikitLearn/XGboost. Through its use of Dask wrappers the platform allows for true, large scale computation with minimal, if any, code changes.

The goal of this talk is to discuss RAPIDS, its functionality, architecture as well as the way it integrates with Spark providing on many occasions several orders of magnitude acceleration versus its CPU-only counterparts.

Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS Library

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Miguel Martínez, NVIDIA Thomas Graves, NVIDIA Accelerating Apache Spark with RAPIDS.ai #UnifiedDataAnalytics #SparkAISummit
  3. 3. 3#UnifiedDataAnalytics #SparkAISummit
  4. 4. 4#UnifiedDataAnalytics #SparkAISummit Miguel Martínez. NVIDIA. Deep Learning Solution Architect Thomas Graves. NVIDIA. Distributed Systems Software Engineer – What is RAPIDS – How to start – cuDF – cuML – cuGraph – XGBoost – Summary – Coming up next – XGBoost4j-Spark Deep Dive – Spark GPU Scheduling – Spark Stage Level Scheduling – Spark SQL with Rapids – Summary
  5. 5. 5#UnifiedDataAnalytics #SparkAISummit WHAT IS RAPIDS
  6. 6. 6#UnifiedDataAnalytics #SparkAISummit Data Preparation VisualizationModel Training GPU Accelerated End-to-End Data Science RAPIDS is a set of open source libraries for GPU accelerating the end-to-end data science and analytics pipelines. rapids.ai In GPU Memory cuXFilter Visualization cuML Machine Learning cuGraph Graph Analytics Deep Learning cuDF Analytics
  7. 7. 7#UnifiedDataAnalytics #SparkAISummit cuDF • GPU-accelerated data preparation and feature engineering • Python drop-in Pandas replacement cuML • GPU-accelerated traditional machine learning libraries • XGBoost, PCA, Kalman, K-means, k-NN, DBScan, tSVD… cuGraph • GPU-accelerated graph analytics libraries cuXfilter • Web Data Visualization library • DataFrame kept in GPU-memory throughout the session
  8. 8. 8#UnifiedDataAnalytics #SparkAISummit LEARNING FROM Pandas Spark Drill Impala Parquet Cassandra Kudu HBase Copy & Convert Copy & Convert Copy & Convert Copy & Convert Copy & Convert Arrow Memory Pandas Spark Drill Impala Parquet Cassandra Kudu HBase Each system has its own internal memory format 70-80% computation wasted on serialization & deserialization Similar functionality implemented in multiple projects All systems utilize the same memory format No overhead for cross-system communication Projects can share functionality Source: https://bit.ly/apachearrow
  9. 9. 9#UnifiedDataAnalytics #SparkAISummit HOW TO START
  10. 10. 10#UnifiedDataAnalytics #SparkAISummit On-premises In the cloud https://github.com/rapidsai Source code on GitHub https://ngc.nvidia.com Containers on NGC & Docker Hub https://anaconda.org/rapidsai Conda packages
  11. 11. 11#UnifiedDataAnalytics #SparkAISummit A step-by-step installation guide (MS Azure) 1. Create a NC6s_v2 virtual machine, and select NVIDIA GPU Cloud Image for Deep Learning and HPC. 2. Connect to the virtual machine: $ ssh -L 8080:localhost:8888 -L 8787:localhost:8787 username@public_ip_address 3. Pull the RAPIDS container from NGC. Run it. $ docker pull nvcr.io/nvidia/rapidsai/rapidsai:cuda10.0-runtime-ubuntu18.04 $ docker run --runtime=nvidia --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 nvcr.io/nvidia/rapidsai/rapidsai:cuda10.0-runtime-ubuntu18.04 4. Run JupyterLab: (rapids)$ bash /rapids/notebooks/utils/start-jupyter.sh 5. Open your browser, and navigate to http://localhost:8080. 6. Navigate to cuml folder for cuML examples, or mortgage folder for XGBoost examples.
  12. 12. 12#UnifiedDataAnalytics #SparkAISummit cuDF
  13. 13. 13#UnifiedDataAnalytics #SparkAISummit GPU-Accelerated ETL The average data scientist spends 90+% of their time in ETL, as opposed to training models
  14. 14. 14#UnifiedDataAnalytics #SparkAISummit • Follow Pandas APIs and provide >10x speedup – CSV Reader/Writer – Parquet Reader – ORC Reader – JSON Reader – Avro Reader • GPU Direct Storage integration in progress for bypassing PCIe bottlenecks! • Key is GPU-accelerating both parsing and decompression wherever possible EXTRACTION IS THE CORNERSTONE cuDF for Faster Data Loading
  15. 15. 15#UnifiedDataAnalytics #SparkAISummit ETL – THE BACKBONE OF DATA SCIENCE libcuDF is… cuDF is… • Low level library containing function implementations and C/C++ API • Importing/exporting Apache Arrow in GPU memory using CUDA IPC • CUDA kernels to perform element-wise math operations on GPU DataFrame columns • CUDA sort, join, groupby, reduction, etc. operations on GPU DataFrames • A Python library for manipulating GPU DataFrames following the Pandas API • Python interface to CUDA C++ library with additional functionality • Create GPU DataFrames from Numpy arrays, Pandas DataFrames, and PyArrow Tables • JIT compilation of User-Defined Functions (UDFs) using Numba CUDA C++ Library Python Library
  16. 16. 16#UnifiedDataAnalytics #SparkAISummit BENCHMARKS Single-GPU Speedup vs Pandas Environment • cuDF v0.9 • Pandas v0.24.2 • GPU NVIDIA Tesla V100 32GB • CPU Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Benchmark Setup • DataFrames: 2x int32 columns key columns, 3x int32 value columns. • Inner Merge • GroupBy: count, sum, min, max. calculated for each value column.
  17. 17. 17#UnifiedDataAnalytics #SparkAISummit Create an empty DataFrame, and add a column Create a DataFrame with two columns Load a CSV file into a GPU DataFrame Use Pandas to load a CSV file, and copy its content into a GPU DataFrame LOADING DATA INTO A GPU DATAFRAME cuDF code examples
  18. 18. 18#UnifiedDataAnalytics #SparkAISummit Return the first three rows as a new DataFrame Row slicing with column selection Find the mean and standard deviation of a column Count number of occurrences per value, and number of unique values Transform column values with a custom function Change the data type of a column WORKING WITH GPU DATAFRAMES cuDF code examples
  19. 19. 19#UnifiedDataAnalytics #SparkAISummit Query a DataFrame with a boolean expression Return the first ‘n’ rows ordered by ‘columns’ Sort a column by its values One-hot encoding Group by column with aggregate function Join and merge DataFrames cuDF code examples QUERY, SORT, GROUP, JOIN, …
  20. 20. 20#UnifiedDataAnalytics #SparkAISummit cuML
  21. 21. 21#UnifiedDataAnalytics #SparkAISummit cuML roadmap October 2019 – RAPIDS 0.10 cuML Single-GPU Multi-GPU Multi-Node Multi-GPU XGBoost GBDT GLM Logistic Regression Random Forest K-Means K-NN DBSCAN UMAP ARIMA & Holt-Winters Kalman Filter t-SNE Principal Components Singular Value Decomposition SVM
  22. 22. 22#UnifiedDataAnalytics #SparkAISummit cuML roadmap March 2020 – RAPIDS 0.14 cuML Single-GPU Multi-GPU Multi-Node Multi-GPU XGBoost GBDT GLM Logistic Regression Random Forest K-Means K-NN DBSCAN UMAP ARIMA & Holt-Winters Kalman Filter t-SNE Principal Components Singular Value Decomposition SVM
  23. 23. 23#UnifiedDataAnalytics #SparkAISummit CPU vs GPU Training results CPU: 57.1 seconds GPU: 4.28 seconds PRINCIPAL COMPONENT ANALYSIS (PCA) Specific: Import CPU algorithm Common: Data loading and algo params Common: Data loading and algo params Specific: DataFrame from Pandas to GPU Common: Model training Common: Model training Specific: Import GPU algorithm System: AWS p3.8xlarge CPU: Intel(R) Xeon(R) E5-2686 @ 2.30GHz, 32 vCPU cores, 244 GB RAM GPU: Tesla V100 SXM2 16GB
  24. 24. 24#UnifiedDataAnalytics #SparkAISummit CPU vs GPU K-NEAREST NEIGHBORS (KNN) Specific: DataFrame from Pandas to GPU Specific: Import CPU algorithm Specific: Import GPU algorithm Common: Data loading and algo params Common: Data loading and algo params Specific: Model training Specific: Model trainingSystem: AWS p3.8xlarge CPU: Intel(R) Xeon(R) E5-2686 @ 2.30GHz, 32 vCPU cores, 244 GB RAM GPU: Tesla V100 SXM2 16GB Training results CPU: 537 seconds GPU: 4.28 seconds
  25. 25. 25 #UnifiedDataAnalytics #SparkAISummit The bigger the dataset is, the higher the training performance difference is between CPU and GPU. Dataset size trained in 15 minutes. CPU: ~130.000 rows GPU: ~5.900.000 rows Specs NC6s_vs Cores (Broadwell 2.6Ghz) 6 GPU 1 x P100 Memory 112 GB Local Disk ~700 GB SSD Network Azure Network CPU vs GPU TRAINING TIME COMPARISON
  26. 26. 26#UnifiedDataAnalytics #SparkAISummit cuGraph
  27. 27. 27#UnifiedDataAnalytics #SparkAISummit Focus on Features and User Experience GOALS AND BENEFITS OF CUGRAPH • Property Graph support via DataFrames Seamless Integration with cuDF & cuML • Up to 500 million edges on a single 32GB GPU • Multi-GPU support for scaling into the billions of edges Breakthrough Performance • Python: Familiar NetworkX-like API • C/C++: lower-level granular control for application developers Multiple APIs • Extensive collection of algorithm, primitive, and utility functions Growing Functionality
  28. 28. 28#UnifiedDataAnalytics #SparkAISummit Louvain Single Run Louvain returns: cudf.DataFrame with two names columns: louvain["vertex"]: The vertex id. louvain["partition"]: The assigned partition. G = cugraph.Graph() G.add_edge_list(gdf["src_0"], gdf["dst_0"], gdf["data"]) df, mod = cugraph.nvLouvain(G)
  29. 29. 29#UnifiedDataAnalytics #SparkAISummit cuGraph roadmap October 2019 – RAPIDS 0.10 cuGraph Single-GPU Multi-GPU Multi-Node Multi-GPU Jaccard and Weighted Jaccard Page Rank Personal Page Rank SSSP BFS Triangle Counting Subgraph Extraction Katz Centrality Betweenness Centrality Connected Components Louvain Spectral Clustering K-Cores
  30. 30. 30#UnifiedDataAnalytics #SparkAISummit cuGraph roadmap March 2020 – RAPIDS 0.14 cuGraph Single-GPU Multi-GPU Multi-Node Multi-GPU Jaccard and Weighted Jaccard Page Rank Personal Page Rank SSSP BFS Triangle Counting Subgraph Extraction Katz Centrality Betweenness Centrality Connected Components Louvain Spectral Clustering K-Cores
  31. 31. 31#UnifiedDataAnalytics #SparkAISummit WHAT IS XGBOOST
  32. 32. XGBOOST 32#UnifiedDataAnalytics #SparkAISummit DEFINITION XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. It is a powerful tool for solving classification and regression problems in a supervised learning setting.
  33. 33. 33#UnifiedDataAnalytics #SparkAISummit Input: age, gender, hair colour, … Does the person like computer games? Age < 30 Is male? +2 +1 -1 -1 -1Prediction score in each leaf Single Decision Tree HOW XGBOOST WORKS
  34. 34. 34#UnifiedDataAnalytics #SparkAISummit Age < 30 Is male? +2 +1 -1 -1 -1 Use computer daily? +0.9 +0.9 +0.9 -0.9 -0.9 Tree 1 Tree 2 f(‘Bill’) = 2 + 0.9 = 2.9 f(‘Sam’) = -1 - 0.9 = -1.9 Ensembled Decision Trees HOW XGBOOST WORKS
  35. 35. 35#UnifiedDataAnalytics #SparkAISummit TRAINED MODELS VISUALIZATION Source: https://goo.gl/GWNdEm Single Decision Tree vs Ensembled Decision Trees
  36. 36. 36#UnifiedDataAnalytics #SparkAISummit WHY XGBOOST
  37. 37. 37#UnifiedDataAnalytics #SparkAISummit Winner of Caterpiller Kaggle Contest 2015 – Machinery component pricing Winner of CERN Large Hadron Collider Kaggle Contest 2015 – Classification of rare particle decay phenomena Winner of KDD Cup 2016 – Research institutions’ impact on the acceptance of submitted academic papers Winner of ACM RecSys Challenge 2017 – Job posting recommendation A STRONG HISTORY OF SUCCESS On a Wide Range of Problems
  38. 38. 38#UnifiedDataAnalytics #SparkAISummit Source: https://goo.gl/R8Y8Pp Lower Is better WHICH ML ALGORITHM PERFORMS BETTER Average Rank Across 165 Datasets
  39. 39. 39#UnifiedDataAnalytics #SparkAISummit XGBOOST + RAPIDS
  40. 40. 40#UnifiedDataAnalytics #SparkAISummit XGBoost – Tuned for eXtreme performance and high efficiency – Multi-GPU and Multi-Node Support RAPIDS – E2E data science & analytics pipeline entirely on GPU – User-friendly Python interfaces – Relies on CUDA primitives, exposes parallelism and high-memory bandwidth – Dask integration for managing workers and data in distributed environments +
  41. 41. 41#UnifiedDataAnalytics #SparkAISummit 2,290 1,956 1,999 1,948 169 157 0 1,000 2,000 3,000 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 2,741 1,675 715 379 42 19 0 1,000 2,000 3,000 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 cuDF – Load and Data Prep cuML – XGBoost cuDF (Load and Data Preparation) 0 5,000 10,000 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 8,763 6,147 3,926 3,221 322 213 Benchmark 200GB CSV dataset; Data preparation includes joins, variable transformations. CPU Cluster Configuration CPU nodes (61 GiB of memory, 8 vCPUs, 64-bit platform), Apache Spark DGX Cluster Configuration 5x DGX-1 on InfiniBand network End-to-End BENCHMARKS XGBoostData Conversion Time in seconds — Shorter is better
  42. 42. 42#UnifiedDataAnalytics #SparkAISummit LEARN MORE
  43. 43. 43#UnifiedDataAnalytics #SparkAISummit www.rapids.ai
  44. 44. 44#UnifiedDataAnalytics #SparkAISummit DEMO
  45. 45. 45#UnifiedDataAnalytics #SparkAISummit SUMMARY
  46. 46. 46#UnifiedDataAnalytics #SparkAISummit GPU Accelerated Data Science RAPIDS is a set of open source software libraries which gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. www.rapids.ai
  47. 47. 47#UnifiedDataAnalytics #SparkAISummit COMING UP NEXT – XGBoost4j-Spark Deep Dive – Spark GPU Scheduling – Spark Stage Level Scheduling – Spark SQL with RAPIDS – Summary
  48. 48. 48#UnifiedDataAnalytics #SparkAISummit Miguel Martínez. NVIDIA. Deep Learning Solution Architect Thomas Graves. NVIDIA. Distributed Systems Software Engineer – What is RAPIDS – How to start – cuDF – cuML – cuGraph – XGBoost – Summary – Coming up next – XGBoost4j-Spark Deep Dive – Spark GPU Scheduling – Spark Stage Level Scheduling – Spark SQL with Rapids – Summary
  49. 49. 49#UnifiedDataAnalytics #SparkAISummit Data Preparation VisualizationModel Training GPU Accelerated End-to-End Data Science RAPIDS is a set of open source libraries for GPU accelerating the end-to-end data science and analytics pipelines. rapids.ai In GPU Memory cuXFilter Visualization cuML Machine Learning cuGraph Graph Analytics Deep Learning cuDF Analytics
  50. 50. 50#UnifiedDataAnalytics #SparkAISummit XGBoost4J- Spark
  51. 51. 51#UnifiedDataAnalytics #SparkAISummit + ● XGBoost4J-Spark enables XGBoost to train and inference data on Apache Spark across nodes. ● Works with Apache Spark 2.x clusters.
  52. 52. 52#UnifiedDataAnalytics #SparkAISummit + DISTRIBUTED TRAINING MULTI-GPU, MULTI-NODE ● Leverage NCCL ● Similar to dmlc/rabit (Allreduce/Broadcast)
  53. 53. 53#UnifiedDataAnalytics #SparkAISummit + Training Performance with T4 Preliminary Results on Google Cloud Accuracy (AUC) Training Loop (Seconds) Max Tree Depth = 8 CPU (15 threads) 0.832 1071.0 GPU 0.832 139.6 Speedup 767% Max Tree Depth = 20 CPU (15 threads) 0.833 1088.7 GPU 0.833 165.9 Speedup 656%
  54. 54. 54#UnifiedDataAnalytics #SparkAISummit XGBOOST + SPARK + RAPIDS
  55. 55. 55#UnifiedDataAnalytics #SparkAISummit What about end-to-end? ● XGBoost training is pretty fast on GPUs ● Data loading/ETL is slow in comparison ● We need to accelerate the machine learning workflow end to end ++
  56. 56. 56#UnifiedDataAnalytics #SparkAISummit ● GPU-accelerated DataFrames ● Data science operations: filter, sort, join, groupby,… ● Read CSV/Parquet/Orc ● Pandas-like API ● Bare-metal, CUDA/C++ performance ● cuDF is Apache Arrow compatible memory format ● Contiguous, column-major data representation ++ RAPIDS cuDF
  57. 57. 57#UnifiedDataAnalytics #SparkAISummit ● Read CSV/Parquet/Orc directly to GPU memory ● Converts column-major cuDF to sparse, row-major DMatrix ● Requires memcpy, but all data stays on GPU memory ++ XGBoost using cuDF
  58. 58. 58#UnifiedDataAnalytics #SparkAISummit RAPIDS XGBOOST4J-SPARK Training classification model for 17 year mortgage data (190GB) On-prem: 9X speedup/5X cost saving • 2 Dell PowerEdge R740 (16 cores) • 2 PowerEdge R740 each w/ 4-T4 (16GB) AWS: 34X speedup/6X cost saving • 4 r5a.4xlarge servers (16 cores) • 1 P3.16xlarge servers w/ 8-V100 (16GB)
  59. 59. 59#UnifiedDataAnalytics #SparkAISummit Databricks XGBoost DEMO
  60. 60. 60#UnifiedDataAnalytics #SparkAISummit Collaboration with the community: ● cuDF integration into XGBoost (XGBoost-4745) ● External memory support for GPU/Out-of-core XGBoost GPU (XGBoost-4357) ++
  61. 61. 61#UnifiedDataAnalytics #SparkAISummit ● Blog post: https://news.developer.nvidia.com/gpu-accelerated-spark-xgboost/ ● Sample applications, notebooks and getting started guides at: https://github.com/rapidsai/spark-examples ● Submit issues at https://github.com/rapidsai/spark-examples/issues and join the conversation! ++
  62. 62. 62#UnifiedDataAnalytics #SparkAISummit ● GPU scheduling ● Stage Level scheduling ● Columnar processing on the GPU Spark 3.0 Empowers GPU Apps
  63. 63. 63#UnifiedDataAnalytics #SparkAISummit SPARK GPU SCHEDULING
  64. 64. 64#UnifiedDataAnalytics #SparkAISummit ● GPUs used in deep learning and other workloads. ● Spark unaware of GPUs and cannot schedule directly ● Difficult to use GPU Scheduling
  65. 65. 65#UnifiedDataAnalytics #SparkAISummit ● Accelerator-aware scheduling (SPARK-24615) ○ Request Executor resources (GPU, FPGA, etc) ○ Request Driver resources ○ Discover resources ○ Resource Task resources ○ Scheduler assigns resources to Tasks ○ API to get resources ○ Supported on YARN, Kubernetes, and Standalone ○ Spark 3.0 GPU Scheduling
  66. 66. 66#UnifiedDataAnalytics #SparkAISummit EXAMPLE : ./bin/spark-shell --master yarn --executor-cores 2 --conf spark.driver.resource.gpu.amount=1 --conf spark.driver.resource.gpu.discoveryScript=/opt/spark/getGpuResources.sh --conf spark.executor.resource.gpu.amount=2 --conf spark.executor.resource.gpu.discoveryScript=./getGpuResources.sh --conf spark.task.resource.gpu.amount=1 --files examples/src/main/scripts/getGpusResources.sh GPU Scheduling Example Discovery script in Spark github
  67. 67. 67#UnifiedDataAnalytics #SparkAISummit Task API: val context = TaskContext.get() val resources = context.resources() val assignedGpuAddrs = resources("gpu").addresses ... … Pass assignedGpuAddrs into Tensorflow or other AI code GPU Scheduling
  68. 68. 68#UnifiedDataAnalytics #SparkAISummit Driver API: scala> sc.resources scala.collection.Map[String,org.apache.spark.resource.ResourceInformation] = Map(gpu -> [name: gpu, addresses: 0]) scala> sc.resources("gpu").addresses Array[String] = Array(0) GPU Scheduling
  69. 69. 69#UnifiedDataAnalytics #SparkAISummit GPU Scheduling Submit App Cluster Manager (YARN/K8s/etc) Node Executor CPU GPU Task SPARK DRIVER Request Executor Containers w/GPU(s) Spark Launches Executor Assign GPU(s) and launch task Executor Registers with GPU addrs SPARK APPLICATION TASK CODE Task runs and user gets GPU addrs assigned Pass GPU addrs to Tensorflow or other AI algo
  70. 70. 70#UnifiedDataAnalytics #SparkAISummit GPU Scheduling
  71. 71. 71#UnifiedDataAnalytics #SparkAISummit ● SPIP Documentation: https://issues.apache.org/jira/browse/SPARK-24615 ● User Documentation: https://github.com/apache/spark/blob/master/docs/configuration.md #custom-resource-scheduling-and-configuration-overview ● Working with Berkley RISELab to install GPUs in CI Jenkins nodes ● Fractional support: SPARK-29151 GPU Scheduling
  72. 72. 72#UnifiedDataAnalytics #SparkAISummit SPARK STAGE LEVEL SCHEDULING
  73. 73. 73#UnifiedDataAnalytics #SparkAISummit Stage Level Scheduling CPU NODE GPU SPARK ML APPLICATION ETL Stage ML Stage CPU NODE
  74. 74. 74#UnifiedDataAnalytics #SparkAISummit ● Stage level resource scheduling (SPARK-27495) ○ Specify Resource requirements per RDD operation ○ Spark dynamically allocates containers to meet resource requirements ○ Spark schedules tasks on appropriate containers Stage Level Scheduling
  75. 75. 75#UnifiedDataAnalytics #SparkAISummit ... do some ETL using default configs then… val rp = new ResourceProfile() rp.require(new ExecutorResourceRequest("memory", 2048)) rp.require(new ExecutorResourceRequest("cores", 2)) rp.require(new ExecutorResourceRequest("gpu", 1, Some("/opt/gpuScripts/getGpus"))) rp.require(new TaskResourceRequest("gpu", 1)) val rdd = sc.makeRDD(1 to 10, 5).mapPartitions { it => val context = TaskContext.get() val resources = context.resources() val assignedGpuAddrs = resources("gpu").addresses.iterator … feed into ML algorithm… }.withResources(rp) rdd.collect() Stage Level Scheduling
  76. 76. 76#UnifiedDataAnalytics #SparkAISummit SPARK SQL WITH GPU PROCESSING
  77. 77. 77#UnifiedDataAnalytics #SparkAISummit ● Users want to process data in a columnar manner - for instance run on the GPU. ○ Vectorized R Execution in Apache Spark ○ Making Nested Columns as First Citizen in Apache Spark SQL - Apple ○ Vectorized Query Execution in Apache Spark at Facebook ● The Dataset/DataFrame API in Spark only exposes one row at a time when processing data. This doesn’t take advantage of columnar processing. + SQL GPU Columnar Processing
  78. 78. 78#UnifiedDataAnalytics #SparkAISummit ● Columnar processing (SPARK-27396) ○ Catalyst API for columnar processing ○ Plugins can modify query plan with columnar ops ○ Spark 3.0 + SQL GPU Columnar Processing
  79. 79. 79#UnifiedDataAnalytics #SparkAISummit ● Plugin that allows running Spark on a GPU ● Requires Spark 3.0 ● No code changes required ● Only need plugin jar, cuDF jar and set spark.sql.extensions ● Runs the operations available on the GPU. ● If operation not implemented or not compatible with GPU will run on CPU ● Handles transitioning from Row to Columnar and back ● Uses Rapids cuDF library + SQL GPU Columnar Processing RAPIDS for Apache SPARK Plugin
  80. 80. 80#UnifiedDataAnalytics #SparkAISummit + SQL GPU Columnar Processing RAPIDS for Apache SPARK Plugin ./bin/spark-shell --master yarn --executor-cores 2 --conf spark.sql.extensions=ai.rapids.spark.Plugin --jars ‘rapids-4-spark.jar, cudf.jar’ --conf spark.driver.resource.gpu.amount=1 --conf spark.driver.resource.gpu.discoveryScript=/opt/spark/getGpuResources.sh --conf spark.executor.resource.gpu.amount=2 --conf spark.executor.resource.gpu.discoveryScript=./getGpuResources.sh --conf spark.task.resource.gpu.amount=1 --files examples/src/main/scripts/getGpusResources.sh
  81. 81. 81#UnifiedDataAnalytics #SparkAISummit A.join(B, A("longs") === B("blongs")).sort("longs") +
  82. 82. 82#UnifiedDataAnalytics #SparkAISummit BENCHMARK select o_orderpriority, count(*) as order_count from orders where o_orderdate >= date '1993-07-01' and o_orderdate < date '1993-07-01' + interval '3' month and exists ( select * from lineitem where l_orderkey = o_orderkey and l_commitdate < l_receiptdate ) group by o_orderpriority order by o_orderpriority TPCH Query 4 3.55X speedup Setup: 100GB data, GCP, 5 workers, n1-highmem-16 using 80 shuffle partitions
  83. 83. 83#UnifiedDataAnalytics #SparkAISummit ● Optimizer to choose when to run things on GPU ● More operations ● Handle larger data - spilling ● GPU direct storage ● RDMA and Shuffle improvements ● Better exchange format for ETL to ML + SQL GPU Columnar Processing Future Enhancements
  84. 84. 84#UnifiedDataAnalytics #SparkAISummit DEMO
  85. 85. 85#UnifiedDataAnalytics #SparkAISummit SUMMARY
  86. 86. 86#UnifiedDataAnalytics #SparkAISummit Spark 3.0 Empowers GPU Apps ● GPU scheduling ● Stage Level scheduling ● Columnar processing ● RAPIDS for Apache Spark Plugin
  87. 87. 87#UnifiedDataAnalytics #SparkAISummit ++ • XGBoost on GPU is ready for wide adoption • Classification and regression • Single GPU; Multi-GPU; Multi- node • In-core; External memory • Available on Spark • Kubernetes, YARN, Standalone • Open source collaboration • XGBoost • External memory • RAPIDS • cuDF, cuML
  88. 88. 88#UnifiedDataAnalytics #SparkAISummit
  89. 89. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×