Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Choose Your Weapon: Comparing Spark on FPGAs vs GPUs

Download to read offline

Today, general-purpose CPU clusters are the most widely used environment for data analytics workloads. Recently, acceleration solutions employing field-programmable hardware have emerged providing cost, performance and power consumption advantages. Field programmable gate arrays (FPGAs) and graphics processing units (GPUs) are two leading technologies being applied. GPUs are well-known for high-performance dense-matrix, highly regular operations such as graphics processing and matrix manipulation. FPGAs are flexible in terms of programming architecture and are adept at providing performance for operations that contain conditionals and/or branches. These architectural differences have significant performance impacts, which manifest all the way up to the application layer. It is therefore critical that data scientists and engineers understand these impacts in order to inform decisions about if and how to accelerate.

This talk will characterize the architectural aspects of the two hardware types as applied to analytics, with the ultimate goal of informing the application programmer. Recently, both GPUs and FPGAs have been applied to Apache SparkSQL, via services on Amazon Web Services (AWS) cloud. These solutions’ goal is providing Spark users high performance and cost savings. We first characterize the key aspects of the two hardware platforms. Based on this characterization, we examine and contrast the sets and types of SparkSQL operations they accelerate well, how they accelerate them, and the implications for the user’s application. Finally, we present and analyze a performance comparison of the two AWS solutions (one FPGA-based, one GPU-based). The tests employ the TPC-DS (decision support) benchmark suite, a widely used performance test for data analytics.

  • Be the first to like this

Choose Your Weapon: Comparing Spark on FPGAs vs GPUs

  1. 1. Choose Your Weapon: Comparing GPU-based VS. FPGA-based Acceleration of Apache Spark Bishwa “Roop” Ganguly Chief Solution Architect
  2. 2. The Need for Higher Performance • There is mounting pressure on companies to use big data analysis to gain a competitive edge, and the amount of data is ever-growing. • CPU-based approaches are not meeting these needs satisfactorily • Non-linear scaling • Cost • Moore's law slowing • Customers are missing their SLAs due to demand for computation outgrowing the budget. • Data Scientist productivity suffering due to the time requires running their analytics.
  3. 3. Todays Approach for Spark: CPU Clusters • CPU clusters are the most prevalent way that Spark is run • Most common ways to meet performance demands: scale-up, scale-out • Costly • Typically, highly sub-linear in terms of performance improvement • I.e. Adding N times the servers yields far less than Nx performance • Solutions? • Code optimizations • Caching approaches • Spark configuration optimizations • SW-based acceleration • Bigstream has developed SW-based acceleration • Compiles Spark tasks into native code, running on standard CPU instances • Zero user code change • Complements scaling of CPU clusters
  4. 4. Average Speedup: 2.2x • All runs on identical md5.2xl EMR clusters - Baseline: Spark 3.0 • 4 workers/cluster, S3 storage • 250SF CSV gzip compressed standard data (approx. 72GB) Bigstream (SW-only) Speedup Results over Spark 3.0 0 0.5 1 1.5 2 2.5 3 1 3 5 7 9 11 13 17 19 21 23a 24a 25 27 30 32 34 36 38 39a 42 44 46 48 50 52 54 56 59 62 64 66 68 70 73 75 78 80 82 84 86 89 92 96 98 100 Speedup TPC-DS Query Number Spark 3.0/Bigstream SW Available on AWS Marketplace
  5. 5. • Programmable hardware: FPGA, GPU, ASIC • Designed for efficient execution of specialized code, contrast with general-purpose CPUs • ASICs typically support domain-specific workloads • FPGA, GPU: provide flexibility, but don’t natively connect to big data platforms • Middleware needed • Both can provide performance, and have power and cost advantages w.r.t CPU scale-up, scale-out • Designed to physically attach to existing servers simply (i.e. PCIe slot) Hardware Accelerators
  6. 6. HW Accelerator Market is Trending Source: ARK Invest “Big Ideas 2021" • Accelerators = GPUs, ASICs, and FPGAs • $41 Billion industry in next 10 years, surpassing CPUs. • Driven by big data analytics and AI
  7. 7. Architectural Comparison (GPU, FPGA) • Features • Programmable via ISA (simpler programming) • High degree of data-level parallelism • Challenges • Branch divergence can be very costly • Power consumption has been shown to be high for some analytics operations FPGA GPU • Features • Logic configured per operation can maximize efficiency • Low power consumption per computation • Bit, byte level and “irregular” parallelism can be leveraged • High on-chip BW • Challenges • Exploiting fine-grain parallelism requires FPGA architecture expertise
  8. 8. Example Irregular Computation - IF/ELSE GPU FPGA … if ( shirtsize = ‘large’) { green code; } …} else { red code; }... “if” condition runs while “else” condition waits due to SIMD (partial utilization) Conditions “if” & “else” run parallel, with branch divergence (full utilization) L E G E N D Compute Lanes GPU: SIMD lanes & FPGA: MIMD lanes Application code time
  9. 9. SQL/Analytics/ML Performance Hypotheses/Observations • Scan better on FPGA because of if-else (MIMD) efficiency of logic vs SIMD • SQL operations: GPU vs FPGA depends on degree of regularity of the operation • Training ML: GPU is better • Typically regular matrix operations • Training uses floating point, high precision parallel computations required • Inference: Depends on precision being used • If less precision can give as good answers, FPGA may have an advantage
  10. 10. Spark Performance Results • Goal: Initial assessment of FPGA-based and GPU-based solutions on AWS for a typical SparkSQL workload • Experimental setup (both technologies) • All runs use 4-node worker clusters • 8 vCPUS per worker • 1 executor/worker, 7 cores/executor • Baseline: Spark 3.0.1 on CPU • TPC-DS benchmark suite (90 queries) • Identical, standard code – TPC-DS SQL as downloaded from www.tpc.org • Identical, standard data - TPC-DS CSV format data (gzipped) • Input data coming from same AWS S3 bucket • Desired results: assessment of acceleration performance over standard Spark • Note: The two solutions run on different instance types, so not a head-to-head comparison
  11. 11. Experimental Setup per Technology • RAPIDS GPU-based Spark acceleration • Cluster allocated via AWS EMR • g4dn.2xlarge instance type (1 NVIDIA T4 GPU/instance) • 4 physical cores (3196 MHz) • 22GB executor memory • Optimized Spark configurations as recommended by NVIDIA literature • Bigstream FPGA-based Spark acceleration • Cluster allocated via single Bigstream-provided Terraform script v1.1 • Note: F1 instances not yet available in EMR • f1.2xlarge instance type (1 Xilinx FPGA/instance) • 4 physical cores (2670 MHz) • 72GB executor memory • Optimized Spark configurations as recommended by Bigstream
  12. 12. RAPIDS Speedup Results RAPIDS Average Speedup: 1.9x • 250SF CSV gzip compressed standard data (approx. 72GB) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 1 3 5 7 10 12 17 19 21 23a 24a 25 27 30 32 34 36 38 39b 42 44 46 48 50 52 54 56 59 62 65 67 69 71 74 76 79 81 83 85 87 91 93 97 99 105 Speedup TPC-DS Query Number Speedup Rapids
  13. 13. Bigstream Speedup Results Bigstream Average Speedup: 3.6x • 250SF CSV gzip compressed standard data (approx. 72GB) 0 1 2 3 4 5 6 7 1 3 5 7 10 12 17 19 21 23a 24a 25 27 30 32 34 36 38 39b 42 44 46 48 50 52 54 56 59 62 65 67 69 71 74 76 79 81 83 85 87 91 93 97 99 105 Speedup TPC-DS Query Number Speedup Bigstream-F1
  14. 14. Bigstream Hyperacceleration Layer Zero code changes Cross-platform versatility Up to 10x acceleration Adaptation Intelligent, automatic computation slicing Cross-acceleration hardware Bigstream Dataflow Bigstream Runtime
  15. 15. Summary: HW Accelerated Spark for Analytics • Available today • Cloud • On-premise • Zero code change • Provides next level of performance, over and above traditional Spark optimizations • Use case examples: • Highest performing analytics on a given infrastructure size • Ability to leverage more data • More sources • More lookback • Larger sizes • Overcoming cluster scaling limitations • Total cost of operations (TCO) savings
  16. 16. Thank you! roop@bigstream.co

Today, general-purpose CPU clusters are the most widely used environment for data analytics workloads. Recently, acceleration solutions employing field-programmable hardware have emerged providing cost, performance and power consumption advantages. Field programmable gate arrays (FPGAs) and graphics processing units (GPUs) are two leading technologies being applied. GPUs are well-known for high-performance dense-matrix, highly regular operations such as graphics processing and matrix manipulation. FPGAs are flexible in terms of programming architecture and are adept at providing performance for operations that contain conditionals and/or branches. These architectural differences have significant performance impacts, which manifest all the way up to the application layer. It is therefore critical that data scientists and engineers understand these impacts in order to inform decisions about if and how to accelerate. This talk will characterize the architectural aspects of the two hardware types as applied to analytics, with the ultimate goal of informing the application programmer. Recently, both GPUs and FPGAs have been applied to Apache SparkSQL, via services on Amazon Web Services (AWS) cloud. These solutions’ goal is providing Spark users high performance and cost savings. We first characterize the key aspects of the two hardware platforms. Based on this characterization, we examine and contrast the sets and types of SparkSQL operations they accelerate well, how they accelerate them, and the implications for the user’s application. Finally, we present and analyze a performance comparison of the two AWS solutions (one FPGA-based, one GPU-based). The tests employ the TPC-DS (decision support) benchmark suite, a widely used performance test for data analytics.

Views

Total views

99

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

7

Shares

0

Comments

0

Likes

0

×