Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters

202 views

Published on

XGBoost is one of the most popular machine learning library, and its Spark integration enables distributed training on a cluster of servers

Published in: Data & Analytics
  • Be the first to comment

Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters

  1. 1. Scalable Acceleration of XGBoost Training on Spark GPU Clusters Rong Ou, Bobby Wang NVIDIA
  2. 2. Agenda Rong Ou Introduction to XGBoost, gradient-based sampling, learning to rank Bobby Wang XGBoost training with GPUs on Spark 2.x/3.0
  3. 3. XGBoost
  4. 4. XGBoost ▪ Open source gradient boosting library ▪ Supports regression, classification, ranking and user defined objectives ▪ Wins many data science and machine learning challenges ▪ Used in production by multiple companies
  5. 5. Distributed XGBoost ▪ Supports distributed training on multiple machines, including AWS, GCE, Azure, and Yarn clusters ▪ Can be integrated with Flink, Spark and other cloud dataflow systems
  6. 6. XGBoost GPU Support ▪ Tree construction (training) and prediction can be accelerated with CUDA-capable GPUs ▪ Use gpu_hist as the tree method
  7. 7. Gradient-based Sampling
  8. 8. Out-of-core Boosting ▪ GPU memory is typically smaller than main memory ▪ Large datasets may not fit in GPU memory, even on a production cluster ▪ Naively streaming data over the PCIe bus is too slow
  9. 9. Sampling ▪ At the beginning of each iteration, sample the data, then use the sample to build the tree ▪ Uniform sampling requires at least 50% of the data to be sampled
  10. 10. Gradient-based Sampling ▪ Sample based on probability proportional to the gradients ▪ Gradient-based One-Side Sampling (GOSS) ▪ Minimal Variance Sampling (MVS) ▪ Sample ratio as low as 0.1 without loss of accuracy
  11. 11. Maximum Data Size # Rows In-core GPU 9 million Out-of-core GPU 12 million Out-of-core GPU, f = 0.1 85 million Synthetic dataset with 500 columns, NVIDIA Tesla V100 GPU (16 GB)
  12. 12. Training Time Time (seconds) AUC CPU In-core 1309.64 0.8393 CPU Out-of-core 1228.53 0.8393 GPU In-core 241.52 0.8398 GPU Out-of-core, f = 1.0 211.91 0.8396 GPU Out-of-core, f = 0.5 427.41 0.8395 GPU Out-of-core, f = 0.3 421.59 0.8399 Higgs dataset, NVIDIA Titan V
  13. 13. Model Accuracy
  14. 14. Learning to Rank
  15. 15. Learning to Rank (LTR) in a Nutshell ▪ Used in Information Retrieval (IR) class of problems ▪ A search engine indexes billions of documents ▪ A search user query should return most relevant documents ▪ Hence, pages are grouped first based on user query relevance, domains, sub domains etc. ▪ Within each group, the pages are ranked ▪ Initial ranking is based on editorial judgement of user queries ▪ The ranking is iteratively refined based on the performance of the previous model
  16. 16. LTR in XGBoost ▪ XGBoost incrementally builds a better model by combining multiple weak models ▪ Models are built by gradient descent using an objective function such as LTR ▪ XGBoost uses LambdaMart ranking algorithm which uses pairwise ranking approach ▪ This minimizes pairwise loss by repeatedly sampling pairs of instances
  17. 17. LTR Algorithms ▪ 3 Algorithms are supported ▪ Pairwise (default) ▪ mAP - mean Average Precision ▪ nDCG - normalized Discounted Cumulative Gain ▪ mAP and nDCG further minimizes Pairwise loss by adjusting it with the weight of instance pair chosen
  18. 18. Enable and Measure Model Performance ▪ Train on GPU (tree_method = gpu_hist) ▪ Choose the appropriate objective function (objective = rank:map) ▪ Measure performance of the model after each training round by enabling one of the following ranking metric (eval_metric = map) ▪ Ranking and metric evaluation are both accelerated on the GPU ▪ mAP - mean Average Precision (default) ▪ pre[@n] - precision [for top n documents] ▪ nDCG[@n] - normalized Discounted Cumulative Gain [for top n documents] ▪ auc - area under the ROC curve ▪ aucpr - area under the precision recall curve ▪ For more information and paper references, please refer to this blog
  19. 19. Performance - Environment and Configuration ▪ Used Microsoft benchmark ranking dataset ▪ Consists of ~11.3 million training instances, scattered across ~95K groups and consuming ~13 GB of disk space ▪ System info ▪ Intel Xeon 2.3 GHZ, 1 socket, 6 cores / socket, 2 threads / core, 80 GB system memory, 1 NVIDIA V100 16GB GPU; does not use hyper threads (uses only 6 cores for training) ▪ Training configuration ▪ Used default training configuration on GPU; built 100 trees; used pairwise, ndcg and map ranking algorithms and map to measure the model performance
  20. 20. Performance - Numbers Algorithm pairwise ndcg map GPU 1.72 2.54 2.73 CPU 42.37 59.33 46.38 Speedup 24.63x 23.36x 16.99x Ranking + metric computation times (in seconds) - using XGBoost HEAD from 5/18/20
  21. 21. XGBoost + Spark 2.x
  22. 22. XGBoost ▪ How to use XGBoost to train on existing data? ▪ Convert the existing data to the numeric data ▪ Do ETL on existing data
  23. 23. XGBoost4j - Spark ▪ Integrate XGBoost with Apache Spark ▪ Use the high-performance algorithm implementation of XGBoost ▪ Leverage the powerful data processing engine of Spark
  24. 24. XGBoost + Spark 2.x + Rapids ▪ Rapids cuDF (libCudf + language bindings)
  25. 25. XGBoost + Spark 2.x + Rapids ▪ Read CSV/Parquet/Orc directly to GPU memory ▪ Chunks loading ▪ Convert column-major cuDF to sparse, row-major DMatrix
  26. 26. Training on GPUs with Spark 2.x val df = spark.read.parquet(path) val featureNames = Seq("f1", "f2", "f3") val vectorAssembler = new VectorAssembler() .setInputCols(featureNames.toArray) .setOutputCol("features") val xgbInput = vectorAssembler .transform(df).select("features", labelColName) val xgbClassifier = new XGBoostClassifier(params) .setLabelCol(labelColName) .setTreeMethod("hist") .setFeaturesCol("features") val model = xgbClassifier.fit(xgbInput) val gpuDf = new GpuDataReader(spark).parquet(path) val featureNames = Seq("f1", "f2", "f3") val xgbClassifier = new XGBoostClassifier(params) .setLabelCol(labelColName) .setTreeMethod("gpu_hist") .setFeaturesCols(featureNames) val model = xgbClassifier.fit(gpuDf) CPU GPU
  27. 27. XGBoost + Spark 2.x + Rapids ▪ Training classification model for 17 year mortgage data (190GB)
  28. 28. XGBoost + Spark 3.0
  29. 29. XGBoost + Spark 3.0 + Rapids ▪ Rapids-plugin-4-spark ▪ Apache Spark plugin that leverages GPUs to accelerate processing via Rapids libraries
  30. 30. Seamless Integration with Spark 3.0 ▪ Features ▪ Use existing (unmodified) customer code ▪ Spark features that are not GPU enabled run transparently on the CPU ▪ Initial Release - GPU Acceleration of: ▪ Spark Data Frames ▪ Spark SQL ▪ ML/DL training frameworks
  31. 31. Rapids Plugin UCX LibrariesRapids C++ Libraries CUDA JNI bindings Mapping From Java/Scala to C++ RAPIDS Accelerator for Spark DISTRIBUTED SCALE-OUT SPARK APPLICATIONS Spark SQL API Spark ShuffleDataFrame API if gpu_enabled(operation, data_type) call-out to RAPIDS else execute standard Spark operation JNI bindings Mapping From Java/Scala to C++ ● Custom Implementation of Spark Shuffle ● Optimized to use RDMA and GPU- to-GPU direct communication APACHE SPARK CORE
  32. 32. XGBoost + Spark 3.0 + Rapids ▪ GPU-scheduling ▪ GPU-accelerated data reader ▪ Chunks loading ▪ Operators run on GPU, e.g. filter, sort, join, groupby, etc.
  33. 33. Training on GPUs with Spark 3.0 val df = spark.read.parquet(path) val featureNames = Seq("f1", "f2", "f3") val vectorAssembler = new VectorAssembler() .setInputCols(featureNames.toArray) .setOutputCol("features") val xgbInput = vectorAssembler .transform(df).select("features", labelColName) val xgbClassifier = new XGBoostClassifier(params) .setLabelCol(labelColName) .setTreeMethod("hist") .setFeaturesCol("features") val model = xgbClassifier.fit(xgbInput) val df = spark.read.parquet(path) val featureNames = Seq("f1", "f2", "f3") val xgbClassifier = new XGBoostClassifier(params) .setLabelCol(labelColName) .setTreeMethod("gpu_hist") .setFeaturesCols(featureNames) val model = xgbClassifier.fit(df) CPU GPU
  34. 34. XGBoost + Spark 3 + Rapids ▪ Training classification model for 23 days Criteo data (1TB)
  35. 35. New eBook: Accelerating Spark 3 Download at: nvidia.com/Spark-book In this ebook you'll learn about: ● The data processing evolution, from Hadoop to GPUs and the NVIDIA RAPIDS™ library ● Spark, what it is, what it does, and why it matters ● GPU-acceleration in Spark ● DataFrames and Spark SQL ● A Spark regression example with a random forest classifier ● An example of an end-to-end machine learning workflow GPU-accelerated with XGBoost
  36. 36. Reference ▪ XGBoost for Spark 2.x ▪ https://github.com/rapidsai/xgboost/tree/rapids-spark ▪ XGBoost for Spark 3 ▪ https://github.com/rapidsai/xgboost/tree/rapids-spark3.0 ▪ XGBoost example for Spark 2.x ▪ https://github.com/rapidsai/spark-examples/tree/master ▪ XGBoost example for Spark 3 ▪ https://github.com/rapidsai/spark-examples/tree/support-spark3.0 ▪ Blog: Machine learning with XGBoost gets faster with Dataproc on GPUs ▪ Blog: GPU-Accelerated Spark XGBoost – A Major Milestone on the Road to Large-Scale AI
  37. 37. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×