Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Víctor Cuevas-Vicenttín,
Barcelona Supercomputing Center
Performance Analysis of
Apache Spark and Presto in
Cloud Environm...
The Barcelona Supercomputing Center (BSC) is the Spanish
national supercomputing facility, and a top EU research instituti...
4
13.7
Petaflops
TPC-DS Benchmark Work
5
The BSC collaborated with Databricks to benchmark
comparisons on large-scale analytics computation...
Context and motivation
• Need to adopt data analytics in a cost-effective
manner
– SQL still very relevant
– Open-source b...
Systems Under Test (SUTs)
• Databricks Unified Analytics Platform
– Based on Apache Spark but with optimized
Databricks Ru...
Systems Under Test (SUTs)
• AWS EMR Presto
– Distributed SQL engine created by Facebook
– Connectors non-relational and re...
Plan
• TPC Benchmark DS
• Hardware and software configuration
• Benchmarking infrastructure
• Benchmark results and their ...
TPC Benchmark DS
• Created around 2006 to evaluate decision
support systems
• Based on a retailer with several channels of...
TPC Benchmark DS
• Snowflake schema: fact tables associated
with multiple dimension tables
• Data produced by data generat...
12#UnifiedDataAnalytics #SparkAISummit
TPC Benchmark DS
• Load Test (1 TB)
• Power Test
• Data Refresh
• Throughput Test
13#UnifiedDataAnalytics #SparkAISummit
....
Hardware configuration
14#UnifiedDataAnalytics #SparkAISummit
Type vCPUs Memory Local storage
i3.2xlarge 8 (2.3 GHz Intel
...
Software configuration
15#UnifiedDataAnalytics #SparkAISummit
System Versions Configuration parameters
Runtime 5.5,
Spark ...
16#UnifiedDataAnalytics #SparkAISummit
SQL
.dat
parquet
ORC
client application cluster execution analysis
.dat
JARJAR
.log...
Benchmark execution time (base)
17#UnifiedDataAnalytics #SparkAISummit
Cost-Based Optimizer (CBO) stats
• Collect table and column-level statistics to
create optimized query evaluation plans
– ...
Benchmark execution time (stats)
19#UnifiedDataAnalytics #SparkAISummit
CBO enabled: ↑ 27.11
Speedup with table and column stats
20#UnifiedDataAnalytics #SparkAISummit
CBO enabled: ↓ 0.60
TPC-DS Power Test – geom. mean
21#UnifiedDataAnalytics #SparkAISummit
TPC-DS Power Test – arith. mean
22#UnifiedDataAnalytics #SparkAISummit
Additional configuration for Presto
23#UnifiedDataAnalytics #SparkAISummit
Query-specific configuration parameters
5, 75, ...
TPC-DS Power Test – Query 72
• Manually modified join order
24#UnifiedDataAnalytics #SparkAISummit
catalog_sales ⋈ date_di...
TPC-DS Power Test – Query 72
• Databricks optimized join order with stats
25#UnifiedDataAnalytics #SparkAISummit
(((((((ca...
Dynamic data partitioning
• Splits a table based on the value of a particular
column
– Split only 7 largest tables by date...
Benchmark exec. time (part + stats)
27#UnifiedDataAnalytics #SparkAISummit
Power Test: 2 failed queries
Throughput Test: 6...
Speedup with partitioning and stats
28#UnifiedDataAnalytics #SparkAISummit
TPC Benchmark total execution time
29#UnifiedDataAnalytics #SparkAISummit
TPC Benchmark DS metric
• The modified primary performance metric is
30#UnifiedDataAnalytics #SparkAISummit
𝑄𝑝ℎ𝐷𝑆@𝑆𝐹 =
𝑆𝐹 ...
TPC Benchmark DS metric
31#UnifiedDataAnalytics #SparkAISummit
System costs
32#UnifiedDataAnalytics #SparkAISummit
𝑁𝑢𝑚. 𝑛𝑜𝑑𝑒𝑠 × 𝑛𝑜𝑑𝑒 𝑐𝑜𝑠𝑡 𝑝𝑒𝑟 ℎ𝑜𝑢𝑟 ×𝑒𝑥𝑒𝑐. 𝑡𝑖𝑚𝑒 𝑖𝑛 ℎ𝑜𝑢𝑟𝑠
System Hardware S...
TPC Benchmark DS cost
33#UnifiedDataAnalytics #SparkAISummit
TPC-DS price-performance
34#UnifiedDataAnalytics #SparkAISummit
Disk utilization
• Databricks
– Automatically caches hot input data
– Requires machines with NVMe SSDs
• EMR Presto
– Expe...
#UnifiedDataAnalytics #SparkAISummit
DatabricksEMRPresto
3
7
#UnifiedDataAnalytics #SparkAISummit
DatabricksEMRPresto
Usability and developer productivity
38#UnifiedDataAnalytics #SparkAISummit
Feature EMR Presto EMR Spark Databricks
Easy a...
39#UnifiedDataAnalytics #SparkAISummit
Feature EMR Presto EMR Spark Databricks
JDBC access ü ü ü
Programmatic interface û ...
Conclusions
• Databricks is about 4x faster than EMR Presto
without statistics
– About 3x faster with them
• Difference sm...
Conclusions
• EMR Presto requires significantly more tuning
– Minimal for Databricks and EMR Spark
• Functionality of Data...
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
Upcoming SlideShare
Loading in …5
×

of

Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 1 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 2 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 3 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 4 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 5 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 6 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 7 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 8 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 9 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 10 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 11 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 12 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 13 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 14 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 15 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 16 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 17 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 18 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 19 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 20 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 21 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 22 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 23 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 24 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 25 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 26 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 27 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 28 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 29 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 30 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 31 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 32 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 33 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 34 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 35 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 36 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 37 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 38 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 39 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 40 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 41 Performance Analysis of Apache Spark and Presto in Cloud Environments Slide 42
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

6 Likes

Share

Download to read offline

Performance Analysis of Apache Spark and Presto in Cloud Environments

Download to read offline

Today, users have multiple options for big data analytics in terms of open-source and proprietary systems as well as in cloud computing service providers. In order to obtain the best value for their money in a SaaS cloud environment, users need to be aware of the performance of each service as well as its associated costs, while also taking into account aspects such as usability in conjunction with monitoring, interoperability, and administration capabilities.

We present an independent analysis of two mature and well-known data analytics systems, Apache Spark and Presto. Both running on the Amazon EMR platform, but in the case of Apache Spark, we also analyze the Databricks Unified Analytics Platform and its associated runtime and optimization capabilities. Our analysis is based on running the TPC-DS benchmark and thus focuses on SQL performance, which still is indispensable for data scientists and engineers. In our talk we will present quantitative results that we expect to be valuable for end users, accompanied by an in depth look into the advantages and disadvantages of each alternative.

Thus, attendees will be better informed of the current big data analytics landscape and find themselves in a better position to avoid common pitfalls in deploying data analytics at a scale.

Performance Analysis of Apache Spark and Presto in Cloud Environments

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Víctor Cuevas-Vicenttín, Barcelona Supercomputing Center Performance Analysis of Apache Spark and Presto in Cloud Environments #UnifiedDataAnalytics #SparkAISummit
  3. 3. The Barcelona Supercomputing Center (BSC) is the Spanish national supercomputing facility, and a top EU research institution, established in 2005 by the Spanish government, the Catalan government and the UPC/BarcelonaTECH university. The mission of BSC is to be at the service of the international scientific community and of industry in need of HPC resources. BSC's research lines are developed within the framework of European Union research funding programmes, and the centre also does basic and applied research in collaboration with companies like IBM, Microsoft, Intel, Nvidia, Repsol, and Iberdrola. About BSC 3
  4. 4. 4 13.7 Petaflops
  5. 5. TPC-DS Benchmark Work 5 The BSC collaborated with Databricks to benchmark comparisons on large-scale analytics computations, using the TPC-DS Toolkit v2.10.1rc3 The Transaction Processing Performance Council (TPC) Benchmark DS (1) has the objective of evaluating decision support systems, which process large volumes of data in order to provide answers to real-world business questions. Our results are not official TPC Benchmark DS results. Databricks provided BSC an account and credits, which BSC then independently used for the benchmarking study with other analytics products on the market. The TPC is a non-profit corporation focused on developing data-centric benchmark standards and disseminating objective, verifiable performance data to the industry.
  6. 6. Context and motivation • Need to adopt data analytics in a cost-effective manner – SQL still very relevant – Open-source based analytics platforms – On-demand computing resources from the Cloud • Evaluate Cloud-based SQL engines 6#UnifiedDataAnalytics #SparkAISummit
  7. 7. Systems Under Test (SUTs) • Databricks Unified Analytics Platform – Based on Apache Spark but with optimized Databricks Runtime – Notebooks for interactive development and production Jobs – JDBC and custom API access – Delta storage layer supporting ACID transactions 7#UnifiedDataAnalytics #SparkAISummit
  8. 8. Systems Under Test (SUTs) • AWS EMR Presto – Distributed SQL engine created by Facebook – Connectors non-relational and relational sources – JDBC and CLI access – Based on in-memory, pipelined parallel execution • AWS EMR Spark – Based on open-source Apache Spark 8#UnifiedDataAnalytics #SparkAISummit
  9. 9. Plan • TPC Benchmark DS • Hardware and software configuration • Benchmarking infrastructure • Benchmark results and their analysis • Usability and developer productivity • Conclusions 9#UnifiedDataAnalytics #SparkAISummit
  10. 10. TPC Benchmark DS • Created around 2006 to evaluate decision support systems • Based on a retailer with several channels of distribution • Process large volumes of data to answer real-world business questions 10#UnifiedDataAnalytics #SparkAISummit
  11. 11. TPC Benchmark DS • Snowflake schema: fact tables associated with multiple dimension tables • Data produced by data generator • 99 queries of various types – reporting – ad hoc – iterative – data mining 11#UnifiedDataAnalytics #SparkAISummit
  12. 12. 12#UnifiedDataAnalytics #SparkAISummit
  13. 13. TPC Benchmark DS • Load Test (1 TB) • Power Test • Data Refresh • Throughput Test 13#UnifiedDataAnalytics #SparkAISummit .dat ORC, parquet Query 1 Query 99Query 2 . . . Query1,1 Query1,99Query1,2 . . . Queryn,1 Queryn,99Queryn,2 . . . . . .
  14. 14. Hardware configuration 14#UnifiedDataAnalytics #SparkAISummit Type vCPUs Memory Local storage i3.2xlarge 8 (2.3 GHz Intel Xeon E5 2686 v4) 61 GiB 1 x 1,900 GB NVMe SSD 1 master node 8 worker nodes
  15. 15. Software configuration 15#UnifiedDataAnalytics #SparkAISummit System Versions Configuration parameters Runtime 5.5, Spark 2.4.3, Scala 2.11 spark.sql.broadcastTimeout: 7200 spark.sql.crossJoin.enabled: true emr-5.26.0, Presto 0.220 hive.allow-drop-table: true hive.compression-codec: SNAPPY hive.s3-file-system-type: PRESTO query.max-memory: 240 GB emr-5.26.0, Spark 2.4.3 spark.sql.broadcastTimeout : 7200 spark.driver.memory: 5692M
  16. 16. 16#UnifiedDataAnalytics #SparkAISummit SQL .dat parquet ORC client application cluster execution analysis .dat JARJAR .log AWS Glue Metastore .log .XLSX
  17. 17. Benchmark execution time (base) 17#UnifiedDataAnalytics #SparkAISummit
  18. 18. Cost-Based Optimizer (CBO) stats • Collect table and column-level statistics to create optimized query evaluation plans – distinct count, min, max, null count 18#UnifiedDataAnalytics #SparkAISummit
  19. 19. Benchmark execution time (stats) 19#UnifiedDataAnalytics #SparkAISummit CBO enabled: ↑ 27.11
  20. 20. Speedup with table and column stats 20#UnifiedDataAnalytics #SparkAISummit CBO enabled: ↓ 0.60
  21. 21. TPC-DS Power Test – geom. mean 21#UnifiedDataAnalytics #SparkAISummit
  22. 22. TPC-DS Power Test – arith. mean 22#UnifiedDataAnalytics #SparkAISummit
  23. 23. Additional configuration for Presto 23#UnifiedDataAnalytics #SparkAISummit Query-specific configuration parameters 5, 75, 78, and 80 join_distribution_type: PARTITIONED 78 and 85 join_reordering_strategy: NONE 67 task_concurrency: 32 18 join_reordering_strategy=ELIMINATE_CROSS_JOINS Session configuration for all queries query_max_stage_count: 102 join_reordering_strategy: AUTOMATIC join_distribution_type: AUTOMATIC Query modifications (carried on to all systems) 72 manual join re-ordering 95 add distinct clause
  24. 24. TPC-DS Power Test – Query 72 • Manually modified join order 24#UnifiedDataAnalytics #SparkAISummit catalog_sales ⋈ date_dim ⋈ date_dim ⋈ inventory ⋈ date_dim ⋈ warehouse ⋈ item ⋈ customer_demographics ⋈ household_demographics ⟕ promotion ⟕ catalog_returns • Databricks optimized join order no stats Same as modified join order + pushed down selections and projections • Original benchmark join order catalog_sales ⋈ inventory ⋈ warehouse ⋈ item ⋈ customer_demographics ⋈ household_demographics ⋈ date_dim ⋈ date_dim ⋈ date_dim ⟕ promotion ⟕ catalog_returns
  25. 25. TPC-DS Power Test – Query 72 • Databricks optimized join order with stats 25#UnifiedDataAnalytics #SparkAISummit (((((((catalog_sales ⋈ household_demographics) ⋈ date_dim) ⋈ customer_demographics) ⋈ item) (((date_dim ⋈ date_dim) ⋈ inventory) ⋈ warehouse)) ⋈ ⟕ promotion) ⟕ catalog_returns) +pushed down selections and projections • EMR Spark optimized join order with stats and CBO enabled/disabled Same as modified join order + pushed down selections and projections but different physical plans
  26. 26. Dynamic data partitioning • Splits a table based on the value of a particular column – Split only 7 largest tables by date surrogate keys – One S3 bucket folder for each value • Databricks and EMR Spark: limit number of files per partition • EMR Presto: out of memory error for largest table – Use Hive with TEZ to load data 26#UnifiedDataAnalytics #SparkAISummit
  27. 27. Benchmark exec. time (part + stats) 27#UnifiedDataAnalytics #SparkAISummit Power Test: 2 failed queries Throughput Test: 6 failed queries
  28. 28. Speedup with partitioning and stats 28#UnifiedDataAnalytics #SparkAISummit
  29. 29. TPC Benchmark total execution time 29#UnifiedDataAnalytics #SparkAISummit
  30. 30. TPC Benchmark DS metric • The modified primary performance metric is 30#UnifiedDataAnalytics #SparkAISummit 𝑄𝑝ℎ𝐷𝑆@𝑆𝐹 = 𝑆𝐹 ∗ 𝑄 , 𝑇./ ∗ 𝑇01 ∗ 𝑇11 Scale factor Num. weighted queries: num streams x 99 Load factor: 0.1 x num streams x load time Power Test and Throughput Test times
  31. 31. TPC Benchmark DS metric 31#UnifiedDataAnalytics #SparkAISummit
  32. 32. System costs 32#UnifiedDataAnalytics #SparkAISummit 𝑁𝑢𝑚. 𝑛𝑜𝑑𝑒𝑠 × 𝑛𝑜𝑑𝑒 𝑐𝑜𝑠𝑡 𝑝𝑒𝑟 ℎ𝑜𝑢𝑟 ×𝑒𝑥𝑒𝑐. 𝑡𝑖𝑚𝑒 𝑖𝑛 ℎ𝑜𝑢𝑟𝑠 System Hardware Software EMR Presto $0.624 $0.156 EMR Spark $0.624 $0.156 Databricks $0.624 $0.3 𝑛𝑜𝑑𝑒 ℎ𝑎𝑟𝑑𝑤𝑎𝑟𝑒 𝑐𝑜𝑠𝑡𝑠 + 𝑛𝑜𝑑𝑒 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒 𝑐𝑜𝑠𝑡𝑠
  33. 33. TPC Benchmark DS cost 33#UnifiedDataAnalytics #SparkAISummit
  34. 34. TPC-DS price-performance 34#UnifiedDataAnalytics #SparkAISummit
  35. 35. Disk utilization • Databricks – Automatically caches hot input data – Requires machines with NVMe SSDs • EMR Presto – Experimental spilling of state to disk – “we do not configure any of the Facebook deployments to spill…local disks would increase hardware costs…” 35#UnifiedDataAnalytics #SparkAISummit Raghav Sethi et al. Presto: SQL on Everything. ICDE 2019: 1802-1813
  36. 36. #UnifiedDataAnalytics #SparkAISummit DatabricksEMRPresto
  37. 37. 3 7 #UnifiedDataAnalytics #SparkAISummit DatabricksEMRPresto
  38. 38. Usability and developer productivity 38#UnifiedDataAnalytics #SparkAISummit Feature EMR Presto EMR Spark Databricks Easy and flexible cluster creation ü ü ü Framework configuration at cluster creation time ü ü ü Direct distributed file system support û û ü Independent data catalog (metastore) ü ü ü Support for notebooks ü ü ü Integrated Web GUI û û ü
  39. 39. 39#UnifiedDataAnalytics #SparkAISummit Feature EMR Presto EMR Spark Databricks JDBC access ü ü ü Programmatic interface û ü ü Job creation and management infrastructure û û ü Customized visualization of query plan execution ü ü ü Resource utilization monitoring with Ganglia and CloudWatch ü ü ü Usability and developer productivity
  40. 40. Conclusions • Databricks is about 4x faster than EMR Presto without statistics – About 3x faster with them • Difference smaller with EMR Spark – Databricks still more cost-effective – More efficient runtime, cache, and CBO optimizer • Databricks and EMR Spark deal better with concurrency and benefit from data partitioning 40#UnifiedDataAnalytics #SparkAISummit
  41. 41. Conclusions • EMR Presto requires significantly more tuning – Minimal for Databricks and EMR Spark • Functionality of Databricks and EMR Presto/Spark for SQL very similar – Databricks more user friendly in some aspects 41#UnifiedDataAnalytics #SparkAISummit
  42. 42. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
  • jazzwang

    Nov. 12, 2020
  • ssuser07d16e

    Oct. 27, 2020
  • wavefly

    Jan. 4, 2020
  • tushar_kale

    Nov. 13, 2019
  • sebasrod

    Nov. 12, 2019
  • ssuserc6cd5d

    Nov. 1, 2019

Today, users have multiple options for big data analytics in terms of open-source and proprietary systems as well as in cloud computing service providers. In order to obtain the best value for their money in a SaaS cloud environment, users need to be aware of the performance of each service as well as its associated costs, while also taking into account aspects such as usability in conjunction with monitoring, interoperability, and administration capabilities. We present an independent analysis of two mature and well-known data analytics systems, Apache Spark and Presto. Both running on the Amazon EMR platform, but in the case of Apache Spark, we also analyze the Databricks Unified Analytics Platform and its associated runtime and optimization capabilities. Our analysis is based on running the TPC-DS benchmark and thus focuses on SQL performance, which still is indispensable for data scientists and engineers. In our talk we will present quantitative results that we expect to be valuable for end users, accompanied by an in depth look into the advantages and disadvantages of each alternative. Thus, attendees will be better informed of the current big data analytics landscape and find themselves in a better position to avoid common pitfalls in deploying data analytics at a scale.

Views

Total views

1,340

On Slideshare

0

From embeds

0

Number of embeds

9

Actions

Downloads

59

Shares

0

Comments

0

Likes

6

×