Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The state of Spark in the cloud

1,184 views

Published on

Originally presented at Strata EU 2017: https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/57631
Cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Spark and Hive come ready to use, with a general-purpose configuration and upgrade management. Over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements on performance and the release of v2, making it challenging to keep up-to-date production services both on-premises and in the cloud for compatibility and stability.

Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc, and Rackspace Cloud Big Data, with an on-premises commodity cluster as baseline. Nicolas uses BigBench, the brand new standard (TPCx-BB) for big data systems, with both Spark and Hive implementations for benchmarking the systems. BigBench combines SQL queries, MapReduce, user code (UDF), and machine learning, which makes it ideal to stress Spark libraries (SparkSQL, DataFrames, MLlib, etc.).

The work is framed within the ALOJA research project, which features an open source benchmarking and analysis platform that has been recently extended to support SQL-on-Hadoop engines and BigBench. The ALOJA project aims to lower the total cost of ownership (TCO) of big data deployments and study their performance characteristics for optimization. Nicolas highlights how to easily repeat the benchmarks through ALOJA and benefit from BigBench to optimize your Spark cluster for advanced users. The work is a continuation of a paper to be published at the IEEE Big Data 16 conference. (A preprint copy can be obtained here.)

Published in: Data & Analytics
  • Login to see the comments

The state of Spark in the cloud

  1. 1. The state of in the cloud Nicolas Poggi May 2017 ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ /_/
  2. 2. Outline 1. Intro to BSC and ALOJA 2. Motivation and background 3. BigBench and PaaS 4. Sequential tests 1GB – 1TB 1. Data scales 2. Cost 5. Concurrency tests 6. Summary 2
  3. 3. Barcelona Supercomputing Center (BSC) • Spanish national supercomputing center 22 years history in: • Computer Architecture, networking and distributed systems research • Based at BarcelonaTech University (UPC) • Large ongoing life science computational projects • Prominent body of research activity around Hadoop • 2008-2013: SLA Adaptive Scheduler, Accelerators, Locality Awareness, Performance Management. 7 publications • 2013-Present: Cost-efficient upcoming Big Data architectures (ALOJA) 8+ publications
  4. 4. ALOJA: towards cost-effective Big Data • Research project for automating characterization and optimization of Big Data deployments • Open source Benchmarking-to-Insights platform and tools • Largest Big Data public repository (70,000+ jobs) • Community collaboration with industry and academia http://aloja.bsc.es Big Data Benchmarking Online Repository Web / ML Analytics
  5. 5. Platform-as-a-Service Spark • Cloud-based managed Hadoop services • Ready to use Spark, Hive, … • Simplified management • Deploys in minutes, on-demand, elastic • You select the instance and • the number of processing nodes • Decoupled compute and storage • Pay-as-you-go pricing model • Optimized for general purpose • Fined tuned to the cloud provider architecture 5
  6. 6. Motivation • 2016 SQL-on-Hadoop paper and presentations • Focused on Hive, due to SparkSQL not being ready to use • Different versions (1.3, 1.5, 1.6) • Some in preview mode • Not carefully tuned • Used TCP-H SQL-only benchmark • Early 2017, BigBench on Hive and Spark work testing more than SQL • FOSDEM and HadoopSummit EU presentations • New code available this month for MLlib2 compatibility • Goal: evaluate the current out-of-the-box experience of Spark in PaaS cloud • Readiness, scalability, price, and performance 6
  7. 7. Surveyed Hadoop/Hive PaaS services • Amazon Elastic Map Reduce (EMR) • Released: Apr 2009 • OS: Amazon Linux AMI (RHEL-like) • SW stack: EMR 5.5.0 • Spark 2.1.0 and Hive 2.1 • Google Cloud DataProc (CDP) • Released: Feb 2016 • OS: Debian GNU/Linux 8.4 • SW stack: Preview version Spark 2.1.0 • V 1.1 with Spark 2.0.2 • Both with Hive 2.1 • Azure HDInsight (HDI) • Released: Oct 2013 • OS: Windows Server and Ubuntu 16.04 • SW stack: HDP 2.6 based • Spark 2.1.0 and 1.6.3 • Hive 1.2 • Target deployment: • 16 data nodes with 8-cores each • Master node with 16-cores • Decoupled storage only • Object store / elastic stores 7
  8. 8. VM instances and characteristics Amazon Elastic Map Reduce (EMR) • 16x M4.2xlarge (datanodes) • 8-core, 32GB RAM • 1x M4.4xlarge (master) • 16-core, 64 GB RAM • Storage: 2x EBS GP2 volumes • Price/hr: $10.96 (billed by the hour) Azure HDInsight (HDI) • 16x D4v2 (datanodes) • 8-core, 28GB RAM • 2x D14v2 (master) • 16-core, 112GB RAM • Storage: WASB (Azure Blob Store) • Price/hr: $20.68 (billed by the minute) 8 Google Cloud DataProc (CDP) • 16x n1-standard-8 (datanodes) • 8-core, 30GB RAM • 1x n1-standard-16 (master) • 16-core, 60GB RAM • Storage GCS • Price/hr: $10.38 (billed by the minute) Disclaimer: snapshot of the out-of-the-box price and performance during May 2017. Performance and especially costs change often. We use non-discounted pricing. I/O costs are complex to estimate for a single benchmark, using per second billing.
  9. 9. What is BigBench (TPCx-BB)? • End-to-end application level benchmark specification • result of many years of collaboration of industry and academia • Covers most Big Data Analytical properties (3Vs) • Covers 30 business use cases for a retailer company • Defines data scale factors: 1GB to PBs 10 2012 • Launched at WBDB 2013 • Published at SIGMOD 2014 • First implementation on github 2016 • Standardized by TPC (Feb) 2016 • TCPx-BB Version 1.2 (Nov) 2017 • Spark MLlib v2 compatibility (under testing - May) BigBench history
  10. 10. BigBench use cases and process overview • 30 business uses cases covering: • Merchandising, • Pricing Optimization • Product Return • Customers... • Implementation resulted in: • 14 Declarative queries (SQL) • 7 with Natural Language Processing • 4 with data preprocessing with M/R jobs • 5 with Machine Learning jobs 11 1 Data generation 2 Data loading 3 Power test 4 Throughput test 1 5 Data refresh 6 Throughput test 2 Result • BB queries / hour
  11. 11. BigBench v1.2 – Reference Implementation HDFS Hive Metastore MapReduce Tez Spark Yarn Hive Spark SQL Mahout ML Custom Spark MLlibMachine Learning SQL Engine Table Metastore Execution Engine Filesystem Combination options: • Hive + MapReduce + Mahout • Hive + MapReduce + Spark_Mllib • v1 and v2 • Hive + Tez + Mahout • Hive + Tez + Spark_MLlib • Spark SQL + Mahout • Spark SQL + Spark_MLlib v1 • Spark 2 SQL + Mahout • Spark 2 SQL + Spark_MLlib • v1 and v2 • (also Hive-on-Spark… etc)
  12. 12. Previous results: M/R vs Tez and Mahout vs. MLlib v1 13Average of three executions using 100 GB Scale Factor M/R Tez Mahout MLlib v1 3x 2x
  13. 13. Sequential Spark 2.1 runs Queries 1-30 on Spark 2.1 (power runs) Per provider and combined Query 1 Query 2 …. Query 30 Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 2.1.0 /_/
  14. 14. BB 1GB-1TB: Spark 2.1 - Cloud Dataproc (CDP) Notes: • Chart shows time increase as we move from 1GB to 1TB in data scale • From 1 to 100GB, there is less than 2x increase in time for 100x more data. • Indicating over- provisioning • From 10 to 1000, the increase is 3x in time Scale factor
  15. 15. BB 1GB-1TB: Spark 2.1 – Elastic Map Reduce (EMR) Notes: • Chart shows time increase as we move from 1GB to 1TB in data scale for EMR • From 1 to 100GB, there is less than 2x increase in time for 100x more data. • Indicating over- provisioning • From 10 to 1000, the increase is 4x in time, while CDP was 3x only • The M/R jobs take a higher proportion of the run Scale factor
  16. 16. BB 1GB-1TB: Spark 2.1 – HDInsight (HDI) Notes: • Chart shows time increase as we move from 1GB to 1TB in data scale for HDI • From 1 to 10GB, there is only 5% increase in time • From 10 to 1000, the increase is 2.5x in time, less than the other providers Scale factor
  17. 17. Notes: • Chart shows time increase as we move from 1GB to 1TB in data scale for all providers • EMR is the fastest up to 100 • At 1TB HDI is the fasters and EMR the slowest • It has the largest increase in M/R queries Fastest EMR Slowest at 1TB BB 1GB-1TB: Spark 2.1 – All providers
  18. 18. Errors… • Everything was run out-of-the-box, except for: • Q 14 17 requires cross joins to be enabled in Spark v2 • At 10TB, • spark.sql.broadcastTimeout (default 300) had to be increased in HDI • Timeout in seconds for the broadcast wait time in broadcast joins • At 1TB memory issues • Queries 3, 4, 8 • TimSort java.lang.OutOfMemoryError: Java heap space at org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate • Queries 2, and 30 • 17/05/15 16:57:46 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.6 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. • Configs • spark.yarn.driver and executor memoryOverhead • spark.yarn.executor.memory
  19. 19. Versions and Spark config EMR CDP HDI Java version OpenJDK 1.8.0_121 OpenJDK 1.8.0_121 OpenJDK 1.8.0_131 Spark version 2.1.0 2.1 2.1.0.2.6.0.2-76 Driver memory 5G 5G 5G Executor memory 5G 10G 4G Executor cores 4 4 3 Executor instances Dynamic Dynamic 20 dynamicAllocation enabled TRUE TRUE FALSE Executor memoryOverhead Default (384MB) 1,117 MB 384 MB 20
  20. 20. BB 1TB M/R-only: Spark 2.1 – All providers Notes: • When zooming by query, we can see that query 2 is the slowest on ERM • While on CDP and HDI is within proportions
  21. 21. BB 1TB Q2: Spark 2.1 – CPU Util % EMR and HDI Notes: • Job was CPU bounded. Log showed: • WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.6 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. • Solution: Increased memory for executors and time was lowered from 6,417s to 1,501 Q2: Find the top 30 products that are mostly viewed together with a given product in online store CREATE TEMPORARY FUNCTION makePairs AS io.bigdatabenchmark.v1.queries.udf.PairwiseUDTF';
  22. 22. The sky 10TB is the limit… and a Price / perf comparison Average of three executios of 100 GB Scale Factor 23
  23. 23. BigBench 10TB SQL-only: All providers Notes: • At 10TB, only SQL part ran correctly in Spark • EMR got the fastest results • Rest still needs tuning to complete • But reaching the limit of the cluster / PaaS config
  24. 24. BB 1GB-1TB: Spark 2.1 – Cost/Performance overview Notes: • Chart show execution time-to-cost plot • Costs calculated to the second (not to the billing fractions). • EMR cheapest to run for all sizes. Also the fastest up to 100GB • CDP second, but at 1TB, HDI becomes more cost- effective Faster and cheaper
  25. 25. Other Spark comparisons: 2.0.2 vs 2.1.0 1.6.3 vs 2.1.0 MLlib v1 vs v2 Hive vs. Spark Average of three executios of 100 GB Scale Factor 26
  26. 26. BigBench 1GB-1TB: Spark 2.0.2 vs 2.1.0 (CDP) Notes: Spark 2.1 a bit faster at small scales, slower at 100 GB and 1 TB on the UDF/NLP queries 2.1 faster up to 100GB Slower at 1TB
  27. 27. BigBench 1GB-1TB: Spark 1.6.3 vs 2.1.0 MLlib 1 vs 2.1 MLlib 2(HDI) Notes: • Spark 2.1 is always faster than 1.6.3 in HDI • MLlilb 2 using dataframes over RDDs is only slightly faster than V1.
  28. 28. BigBench 10GB and 1TB: Hive (+MLlib2) vs. Spark 2.1 Notes: • Hive is faster in both HDI and EMR at 10GB. Slower in CDP. • CDP shows a scalability problem at 1TB with Hive. • As it doesn’t enables Tez by default • This was observed on previous study as well • At 1TB both CDP and HDI are faster than Hive (HDI) 10 GB 1 TB Hive much slower on CDP (doesn’t enable Tez by default)
  29. 29. Concurrency runs (throughput) 2 to 32 parallel streams 3030
  30. 30. BigBench 1-32 streams Spark 2.1 1GB scale Notes: • From 16 streams on, the bottleneck is the CPU utilization on the master • HDI faster at concurrency, • But also showed the worst number (variability) High variability in HDI
  31. 31. Conclusions • All providers have up to date (2.1.0) and well tuned versions of Spark • They could run BigBench up to 1TB on medium-sized cluster • [Almost] Out-of-the box • Performance similar among providers for similar cluster types and disk configs • Difference according to scale (and pricing) • Spark 2.1.0 is faster than previous versions • Also MLlib 2 with dataframes • But improvements within the 30% range • Hive (+Tez + MLlib) are still slightly faster than Spark at lower scales • But very similar at larger scales • And using mainly Spark simplifies the pipeline • BigBench has been useful to stress a cluster with different workloads • Highlights config problems fast and stresses scale limits • Helpful for tuning the clusters • And yes, Spark is now production ready and performant in PaaS in the cloud 32
  32. 32. Future work / WiP • Compare Hive versions 1 and 2 • HDI still on v1 • Test LLAP with different settings • Variability study for spark workloads in the cloud • Fix 10TB runs to complete results • Compare to on-prem runs • optimizations • Test G1 GC • Fat vs. thin executors configs
  33. 33. Resources and references BigBench and ALOJA • BigBench Spark 2 branch (thanks Christoph and Michael from bankmark.de): • https://github.com/carabolic/Big-Data-Benchmark-for- Big-Bench/tree/spark2 • Original BigBench Implementation repository • https://github.com/intel-hadoop/Big-Data-Benchmark- for-Big-Bench • ALOJA benchmarking platform • https://github.com/Aloja/aloja • http://aloja.bsc.es/publications • ALOJA fork of BigBench (adds support for HDI and fixes spark) • https://github.com/Aloja/Big-Data-Benchmark-for-Big- Bench • The State of SQL-on-Hadoop in the Cloud – N. Poggi et. al. • https://doi.org/10.1109/BigData.2016.7840751 Big Data Benchmarking • Big Data Benchmarking Community (BDBC) mailing list • (~200 members from ~80organizations) • http://clds.sdsc.edu/bdbc/community • Workshop Big Data Benchmarking (WBDB) • http://clds.sdsc.edu/bdbc/workshops • SPEC Research Big Data working group • http://research.spec.org/working-groups/big-data- working-group.html • Benchmarking slides and video: • Benchmarking Hadoop: • https://www.slideshare.net/ni_po/benchmarking-hadoop • Michael Frank on Big Data benchmarking • http://www.tele-task.de/archive/podcast/20430/ • Tilmann Rabl Big Data Benchmarking Tutorial • http://www.slideshare.net/tilmann_rabl/ieee2014- tutorialbarurabl 34
  34. 34. Thanks, questions? Follow up / feedback : Nicolas.Poggi@bsc.es Twitter: ni_po The state of Spark in the Cloud

×