Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

How to Automate Performance Tuning for Apache Spark Slide 1 How to Automate Performance Tuning for Apache Spark Slide 2 How to Automate Performance Tuning for Apache Spark Slide 3 How to Automate Performance Tuning for Apache Spark Slide 4 How to Automate Performance Tuning for Apache Spark Slide 5 How to Automate Performance Tuning for Apache Spark Slide 6 How to Automate Performance Tuning for Apache Spark Slide 7 How to Automate Performance Tuning for Apache Spark Slide 8 How to Automate Performance Tuning for Apache Spark Slide 9 How to Automate Performance Tuning for Apache Spark Slide 10 How to Automate Performance Tuning for Apache Spark Slide 11 How to Automate Performance Tuning for Apache Spark Slide 12 How to Automate Performance Tuning for Apache Spark Slide 13 How to Automate Performance Tuning for Apache Spark Slide 14 How to Automate Performance Tuning for Apache Spark Slide 15 How to Automate Performance Tuning for Apache Spark Slide 16 How to Automate Performance Tuning for Apache Spark Slide 17 How to Automate Performance Tuning for Apache Spark Slide 18 How to Automate Performance Tuning for Apache Spark Slide 19 How to Automate Performance Tuning for Apache Spark Slide 20 How to Automate Performance Tuning for Apache Spark Slide 21 How to Automate Performance Tuning for Apache Spark Slide 22 How to Automate Performance Tuning for Apache Spark Slide 23 How to Automate Performance Tuning for Apache Spark Slide 24 How to Automate Performance Tuning for Apache Spark Slide 25 How to Automate Performance Tuning for Apache Spark Slide 26 How to Automate Performance Tuning for Apache Spark Slide 27 How to Automate Performance Tuning for Apache Spark Slide 28 How to Automate Performance Tuning for Apache Spark Slide 29 How to Automate Performance Tuning for Apache Spark Slide 30 How to Automate Performance Tuning for Apache Spark Slide 31 How to Automate Performance Tuning for Apache Spark Slide 32 How to Automate Performance Tuning for Apache Spark Slide 33 How to Automate Performance Tuning for Apache Spark Slide 34 How to Automate Performance Tuning for Apache Spark Slide 35 How to Automate Performance Tuning for Apache Spark Slide 36
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

3 Likes

Share

Download to read offline

How to Automate Performance Tuning for Apache Spark

Download to read offline

Spark has made writing big data pipelines much easier than before. But a lot of effort is required to maintain performant and stable data pipelines in production over time. Did I choose the right type of infrastructure for my application? Did I set the Spark configurations correctly? Can my application keep running smoothly as the volume of ingested data grows over time? How to make sure that my pipeline always finishes on time and meets its SLA?

These questions are not easy to answer even for a handful of jobs, and this maintenance work can become a real burden as you scale to dozens, hundreds, or thousands of jobs. This talk will review what we found to be the most useful piece of information and parameters to look at for manual tuning, and the different options available to engineers who want to automate this work, from open-source tools to managed services provided by the data platform or third parties like the Data Mechanics platform.

How to Automate Performance Tuning for Apache Spark

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Jean-Yves Stephan & Julien Dumazert, Founders of Data Mechanics Automating performance tuning for Apache Spark #UnifiedDataAnalytics #SparkAISummit
  3. 3. What is performance tuning? 3 Cluster parameters ● Size ● Instance type ● # of processors ● # of memory ● Disks ● ... Spark configurations ● Parallelism ● Shuffle ● Storage ● JVM tuning ● Feature flags ● ...
  4. 4. Why automate performance tuning? 4 Pick new params Analyze logs Run the job
  5. 5. Why automate performance tuning? 5 Pick new params Analyze logs Run the job Pager ringing at 3am 30% of your engineers time Missing SLAs every week Hard manual work Frequent outages Slow and expensive
  6. 6. Agenda Manual performance tuning 6 Automated performance tuning
  7. 7. Hands-on performance tuning
  8. 8. Perf tuning is an iterative process For the first run There are rules of thumb for some params: • # of partitions: 3x the number of cores in the cluster • # of cores per executor: 4-8 • memory per executor: 85% * (node memory / # executors by node) For the other params: make an educated guess! 8
  9. 9. Perf tuning is an iterative process On the first attempt, the job crashes or does not meet the SLA. What to do? 9 Pick new params Analyze logs Run the job • Ensure stability of the job • Solve performance issues • Adjust speed-cost trade-off
  10. 10. Common issues: lack of parallelism 10 Only 8 cores used on each machine! Configuration: • 26 instances n1-highmem-16 • spark.executor.cores = 16
  11. 11. Common issues: lack of parallelism Configuration: • 26 instances n1-highmem-16 • spark.executor.cores = 16 11 Only 8 cores used on each machine! The reason: • 26 executors • spark.sql.shuffle.partitions = 200 → 200 / 26 ~ 7.7 tasks per executor Fix: Use 400+ partitions (duration and cost / 2) See also Adaptive execution (SPARK-23128) for a way to dynamically and automatically set the number of partitions.
  12. 12. Common issues: shuffle spill 12 The deserialized data produced by the map stages in a shuffle does not fit in memory. Spark temporarily writes it to disk, which degrades performance. Fixes: 1. Reduce the input data of each task by increasing the number of partitions 2. Increase the memory available to each task – by increasing spark executor memory – by decreasing the number of cores per executor
  13. 13. Common issues: data skew 13 This issue is not addressable with parameter tuning. A change in code is required! Change in code: 1. Find a better partition key if possible 2. Use a map-side (broadcast) join 3. Use a salted key
  14. 14. Improvements based on node metrics 14 ● Low CPU Usage => Consider oversubscription, ie telling Spark to schedule say 2x more tasks per executor than the number of cores ● Low Memory Usage => Consider pruning the excess memory and switching to CPU-intensive instances ● IO bound queries => Consider switching instance type or Spark IO configurations such as compression or IO buffer sizes
  15. 15. Cost-speed trade-off 15 On the efficient frontier: cheaper ⇒ longer shorter ⇒ more expensive Once performance issues are solved, adjust your trade-off given your needs. Efficient frontier Solving performance issues Adjusting cost-speed trade-off
  16. 16. Cost-speed trade-off 16 40 instances 10 instances 4 instances Example: impact of # of instances
  17. 17. Recap: manual perf tuning Iterative process: 17 Solve performance issues Adjust cost-speed tradeoff Most of the impact comes from a few parameters: • # and type of instances for execs and driver • executor and driver size (memory, # of cores) • # of partitions
  18. 18. Open source tuning tools To detect performance issues: DrElephant (LinkedIn) To simulate cost-speed trade-off: SparkLens (only supports adjusting # of executors) 18
  19. 19. Automated performance tuning
  20. 20. Motivations Performance tuning can make periodic workloads 2x faster and more stable. But: • tedious manual process • requires expertise → to scale it to 100+ pipelines, automation is required! 20 Pick new params Analyze logs Run the job
  21. 21. Architecture (tech) 21 Scheduler Spark jobGateway Optimization engine Data Mechanics Kubernetes cluster Customer code Job history Spark listener 1) Unoptimized Spark job description 2) Optimization engine identifies job from history and returns config 3) Optimized Spark job description 4) An agent exports event logs and system metrics during job run
  22. 22. Architecture (algo) 23 Jun 2nd Jun 3rd Jun 4th Jun 5th Jun 6th Jun 7th
  23. 23. Architecture (algo) 24 Heuristic A Jun 2nd Jun 3rd Jun 4th Jun 5th Jun 6th Heuristic B Jun 7th Spark events log and metrics
  24. 24. Architecture (algo) 25 Heuristic A Jun 2nd Jun 3rd Jun 4th Jun 5th Jun 6th Evaluator A Heuristic B Evaluator B Param set A1 Param set A3 Param set B2 Jun 7th Spark events log and metrics Evaluators leverage historical data
  25. 25. Architecture (algo) 26 Heuristic A Jun 2nd Jun 3rd Jun 4th Jun 5th Jun 6th Evaluator A Heuristic B Evaluator B Experiment manager Param set A1 Param set A3 Param set B2 Evaluated param set B2 Evaluated param set A1 Jun 7th Spark events log and metrics Evaluators leverage historical data
  26. 26. Architecture (algo) 27 Heuristic A Jun 2nd Jun 3rd Jun 4th Jun 5th Jun 6th Evaluator A Heuristic B Evaluator B Experiment manager Param set A1 Param set A3 Param set B2 Evaluated param set B2 Evaluated param set A1 Jun 7th Optimistic best param set Spark events log and metrics Evaluators leverage historical data
  27. 27. Heuristics Heuristics look for performance issues: • FewerTasksThanTotalCores • ShuffleSpill • LongShuffleReadTime • LongGC • ExecMemoryLargerThanNeeded • TooShortTasks • CPUTimeShorterThanComputeTime • … 28 Or push in a given direction: • IncreaseDefaultNumPartitions • IncreaseTotalCores • … Heuristic A Jun 6th Heuristic B Spark events log and metrics Param sets Every heuristic proposes different param sets.
  28. 28. Heuristics example FewerTasksThanTotalCores If a stage has fewer tasks than the total number of cores: 1. Increase the default number of partitions if applicable 2. Decrease the number of instances 3. Decrease the # of cores per instance (and adjust memory) 29 Heuristic A Jun 6th Heuristic B Spark events log and metrics Param sets
  29. 29. Ranking param sets Param sets cost and duration are evaluated by Evaluators. The Experiment manager selects the best solution according to customer settings. 30 Evaluator A Evaluator B Experiment manager Optimistic best param set Unevaluated param sets
  30. 30. Evaluator Estimates cost and duration distributions. • From history if possible • By simulation otherwise The simulation: an optimistic model of the Spark scheduler. • Takes as input a Spark app (Spark events log) • Simulates a new execution under different conditions • Different # of partitions, cores / exec, execs • Optimistic assumptions: no GC time, no shuffle read time Why optimistic? Encourages exploration! 31 Evaluator Unevaluated param set Evaluated param set (cost and duration) History
  31. 31. Experiment manager Selects the best param set given customer objectives like: • as cheap as possible within maximum duration • as fast as possible within budget • finish at 6am no matter what Contains Bayesian optimization logic to account for noise. 32 Experiment manager Optimistic best param set Evaluated param sets (cost and duration)
  32. 32. Architecture (algo) 33 Heuristic A Jun 2nd Jun 3rd Jun 4th Jun 5th Jun 6th Evaluator A Heuristic B Evaluator B Experiment manager Param set A1 Param set A3 Param set B2 Evaluated param set B2 Evaluated param set A1 Jun 7th Optimistic best param set Spark events log and metrics Evaluators leverage historical data
  33. 33. 34 ● Stability: Automatic remediation of OOMs and timeouts (upon retry) ● Performance: 50% cost reduction. ● Algorithm typically converges and adapts to changes in 5-10 iterations. Impact of automated tuning
  34. 34. Data Mechanics platform 35 ● A managed platform for containerized Spark apps in your cloud account ● Just send your code, we automate the scaling and configurations tuning ● Pricing based on real Spark compute time, not on idle server uptime. Gateway Data source Data engineers Data scientists
  35. 35. The hassle-free Spark platform powered by Kubernetes Learn more and sign up for private beta on https://www.datamechanics.co Also, we’re hiring :) jobs@datamechanics.co
  36. 36. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
  • guoxu1231

    Aug. 22, 2021
  • SafwanDar1

    May. 7, 2020
  • tushar_kale

    Nov. 12, 2019

Spark has made writing big data pipelines much easier than before. But a lot of effort is required to maintain performant and stable data pipelines in production over time. Did I choose the right type of infrastructure for my application? Did I set the Spark configurations correctly? Can my application keep running smoothly as the volume of ingested data grows over time? How to make sure that my pipeline always finishes on time and meets its SLA? These questions are not easy to answer even for a handful of jobs, and this maintenance work can become a real burden as you scale to dozens, hundreds, or thousands of jobs. This talk will review what we found to be the most useful piece of information and parameters to look at for manual tuning, and the different options available to engineers who want to automate this work, from open-source tools to managed services provided by the data platform or third parties like the Data Mechanics platform.

Views

Total views

1,387

On Slideshare

0

From embeds

0

Number of embeds

30

Actions

Downloads

100

Shares

0

Comments

0

Likes

3

×