Successfully reported this slideshow.
Your SlideShare is downloading. ×

Auto-Pilot for Apache Spark Using Machine Learning

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 41 Ad

Auto-Pilot for Apache Spark Using Machine Learning

Download to read offline

At Qubole, users run Spark at scale on cloud (900+ concurrent nodes). At such scale, for efficiently running SLA critical jobs, tuning Spark configurations is essential. But it continues to be a difficult undertaking, largely driven by trial and error. In this talk, we will address the problem of auto-tuning SQL workloads on Spark. The same technique can also be adapted for non-SQL Spark workloads. In our earlier work[1], we proposed a model based on simple rules and insights. It was simple yet effective at optimizing queries and finding the right instance types to run queries. However, with respect to auto tuning Spark configurations we saw scope of improvement. On exploration, we found previous works addressing auto-tuning using Machine learning techniques. One major drawback of the simple model[1] is that it cannot use multiple runs of query for improving recommendation, whereas the major drawback with Machine Learning techniques is that it lacks domain specific knowledge. Hence, we decided to combine both techniques. Our auto-tuner interacts with both models to arrive at good configurations. Once user selects a query to auto tune, the next configuration is computed from models and the query is run with it. Metrics from event log of the run is fed back to models to obtain next configuration. Auto-tuner will continue exploring good configurations until it meets the fixed budget specified by the user. We found that in practice, this method gives much better configurations compared to configurations chosen even by experts on real workload and converges soon to optimal configuration. In this talk, we will present a novel ML model technique and the way it was combined with our earlier approach. Results on real workload will be presented along with limitations and challenges in productionizing them. [1] Margoor et al,'Automatic Tuning of SQL-on-Hadoop Engines' 2018,IEEE CLOUD

At Qubole, users run Spark at scale on cloud (900+ concurrent nodes). At such scale, for efficiently running SLA critical jobs, tuning Spark configurations is essential. But it continues to be a difficult undertaking, largely driven by trial and error. In this talk, we will address the problem of auto-tuning SQL workloads on Spark. The same technique can also be adapted for non-SQL Spark workloads. In our earlier work[1], we proposed a model based on simple rules and insights. It was simple yet effective at optimizing queries and finding the right instance types to run queries. However, with respect to auto tuning Spark configurations we saw scope of improvement. On exploration, we found previous works addressing auto-tuning using Machine learning techniques. One major drawback of the simple model[1] is that it cannot use multiple runs of query for improving recommendation, whereas the major drawback with Machine Learning techniques is that it lacks domain specific knowledge. Hence, we decided to combine both techniques. Our auto-tuner interacts with both models to arrive at good configurations. Once user selects a query to auto tune, the next configuration is computed from models and the query is run with it. Metrics from event log of the run is fed back to models to obtain next configuration. Auto-tuner will continue exploring good configurations until it meets the fixed budget specified by the user. We found that in practice, this method gives much better configurations compared to configurations chosen even by experts on real workload and converges soon to optimal configuration. In this talk, we will present a novel ML model technique and the way it was combined with our earlier approach. Results on real workload will be presented along with limitations and challenges in productionizing them. [1] Margoor et al,'Automatic Tuning of SQL-on-Hadoop Engines' 2018,IEEE CLOUD

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Auto-Pilot for Apache Spark Using Machine Learning (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Auto-Pilot for Apache Spark Using Machine Learning

  1. 1. Amogh Margoor, Qubole Inc Mayur Bhosale, Qubole Inc Auto-Pilot for Apache Spark using Machine Learning #UnifiedDataAnalytics #SparkAISummit
  2. 2. Agenda • Motivation • Approach • Scope • Previous Work • Gaussian Process • Domain based Model • Uchit - Spark Auto Tuner • Demo • Experimental Evaluation • Open Source 2#UnifiedDataAnalytics #SparkAISummit
  3. 3. Motivation 3
  4. 4. Tuning a Spark Application Benefits • Performance • Resource Efficiency 4#UnifiedDataAnalytics #SparkAISummit On Public Cloud translates to $$ saved.
  5. 5. Tuning is a Hard Problem !! ● Manual ● Requires Domain Knowledge ● Too many Knobs to configure 5
  6. 6. 6 Optimize TPC-DS q2 ● Analyze query plan ○ 3 Joins in Red circle are SortMerge Join ○ All 3 can be converted to Broadcast Join.
  7. 7. 7 Optimize TPC-DS q2 ● Analyze query plan ○ 3 Joins in Red circle are SortMerge Join ○ All 3 can be converted to Broadcast Join. ● Manual ● Requires Domain Knowledge ● Too many Knobs
  8. 8. Approach 8
  9. 9. Scope 9 – Goals: Improve Runtime or Cloud Cost. – Insights through SparkLens are quite helpful (demo). Can we also Auto Tune the Spark Configuration for above goals ? – Target Repetitive Queries - ETL, Reporting etc.
  10. 10. Previous Work 10#UnifiedDataAnalytics #SparkAISummit • “Standing of the shoulder of Giants” – S. Kumar, S. Padakandla, C. Lakshminarayanan, P. Parihar, K. Gopinath, S. Bhatnagar, Performance tuning of hadoop mapreduce: A noisy gradient approach, vol. abs/1611.10052, 2016. – H. Herodotou, S. Babu, "Profiling what-if analysis and cost-based optimization of mapreduce programs", Proceedings of the VLDB Endowment, vol. 4, no. 11, pp. 1111-1122, 2011. – H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, S. Babu, "Starfish: A self-tuning system for big data analytics", Cidr, no. 2011, pp. 261-272, 2011. – A. J. Storm, C. Garcia-Arellano, S. Lightstone, Y. Diao, and M. Surendra. Adaptive Self-tuning Memory in DB2. In VLDB, 2006. – D. G. Sullivan, M. I. Seltzer, and A. Pfeffer. Using probabilistic reasoning to automate software tuning. In SIGMETRICS, 2004. – D. N. Tran, P. C. Huynh, Y. C. Tay, and A. K. H. Tung. A new approach to dynamic self-tuning of database buffers. ACM Transactions on Storage, 4(1), 2008. – B. Zhang, D. Van Aken, J. Wang, T. Dai, S. Jiang, et al. A Demonstration of the OtterTune Automatic Database Management System Tuning Service. PVLDB – S. Duan, V. Thummala, and S. Babu. Tuning Database Configuration Parameters with iTuned, VLDB, August 2009 – Prasad M. Deshpande, Amogh Margoor, Rajat Venkatesh, Automatic Tuning of SQL-on-Hadoop Engines on Cloud Platforms. IEEE CLOUD 2018
  11. 11. Tuning a Spark Application 11#UnifiedDataAnalytics #SparkAISummit • Machine Learning Based: – B. Zhang, D. Van Aken, J. Wang, T. Dai, S. Jiang, et al. A Demonstration of the OtterTune Automatic Database Management System Tuning Service. PVLDB – S. Duan, V. Thummala, and S. Babu. Tuning Database Configuration Parameters with iTuned, VLDB, August 2009 • Domain Knowledge Based: – Prasad M. Deshpande, Amogh Margoor, Rajat Venkatesh, Automatic Tuning of SQL-on-Hadoop Engines on Cloud Platforms. IEEE CLOUD 2018
  12. 12. Machine Learning Approach 12
  13. 13. Machine Learning Approaches 13#UnifiedDataAnalytics #SparkAISummit • Based on previous works, our approach is also: – Iterative approaches: • Step 1: Predict good config based on Previous runs • Step 2: Run with predicted config and add the result to Previous runs. • Repeat Step1 and Step 2 for `n` iterations – Gaussian Process based approaches.
  14. 14. Gaussian Process 14#UnifiedDataAnalytics #SparkAISummit ● Image Source: https://katbailey.github.io/post/gaussian-processes-for-dummies/ ● Gaussian is non-parametric approach. ● Other parametric regression techniques start with fixed assumption of parameters. Problems: ○ y = 𝛳0 + 𝛳1 x Linear Equation with 2 parameters is not enough for data. ○ y = 𝛳0 + 𝛳1 x + 𝛳2 x2 Quadratic equation with 3 parameters will be more appropriate. ● Gaussian Process is non-parametric i.e., it assumes all the possibilities.
  15. 15. Gaussian Process 15#UnifiedDataAnalytics #SparkAISummit ● Image Source: https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d Prior Posterior
  16. 16. Gaussian Process 16#UnifiedDataAnalytics #SparkAISummit ● Image Source: https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d
  17. 17. Gaussian Process - Advantage 17#UnifiedDataAnalytics #SparkAISummit ● Image Source: https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d How does Gaussian Process help in finding good configs iteratively ? GP tells degree of certainty of it’s prediction: low and high. Results in balancing Exploitation and Exploration. Exploration: Explore configs with low degree certainty i.e., configs different from training data. Exploitation: Predict configs from high degree certainty of improvement.
  18. 18. ML Model Issues • Training data is actual run of a job. Learning over multiple runs might be required: – Correlation between configs. – Sensitiveness of individual config for a particular job. – Explore large config space for global optimal. – Domain specific insights like cloud insights etc. • Too many runs can be expensive. 18
  19. 19. ML Model Issues • Model searches for the optimal config using historical data. – Problem: Might need multiple iterations to prune out obvious non-optimal configs. – Solution: To converge sooner, Domain based knowledge can be used to prune non-optimal configs. 19
  20. 20. Domain based model 20
  21. 21. Insight 1 Spills are expensive and should be avoided at all cost. Spill increases Disk I/O significantly Avoided by: ○ Increasing memory of task/containers. ○ More fine grained tasks i.e., increased parallelism. For e.g., decreasing split sizes or increasing shuffle partitions. Evaluation Time reduces for TPC DS q46 by almost 30% on increasing shuffle partitions from 100 to 200.
  22. 22. Insight 2 For Spark, use single fat executor which uses all cores in node. Reasons for improvements are: ● Improved memory consumption between cores ● Reduced replicas of broadcast tables ● Reduced overheads. Evaluation ● Figure besides show effect of increasing cores per executor. ● Increased spark.executor.cores from 1 to 8 and correspondingly varying spark.executor.memory from 1152MB to 11094MB, thus keeping memory per core constant. ● Saw performance benefit up to 25% with fatter executor.
  23. 23. Insight 3 Memory/vCPU ratio Yarn allocates containers on two dimensions - memory and vcpu Each container is given 1 vcpu and some memory The memory/vcpu of the containers should match the memory/vcpu ratio of the machine type Otherwise resources are wasted!
  24. 24. Machine Family Different machine families have different memory/cpu characteristics Recommended memory profile for a query container should match the family ratio Otherwise, recommend change of machine family in the cluster
  25. 25. Insight 4 Generate better SQL plans ● Collect statistics for Catalyst Optimizer. ● Tune configurations for better plans: e.g., more broadcast joins in TPC-DS q2
  26. 26. Uchit – Spark Auto Tuner 26
  27. 27. Uchit – Spark Auto Tuner 27
  28. 28. Config Sampling • Discretize configuration. For e.g., if spark.executor.memory for r3.xlarge can vary between 2GB and 24 GB Discretized values = {2, 4, 6, 8, … 24} • Possible configs for 5 configs ≃ 29 million. • With sampling we could reduce it to 2000 config space: Latin Hypercube Sampling.
  29. 29. Sampling- Latin HyperCube
  30. 30. Combined Model ML Model Historical runs Best Conf Latin HyperCube Sampler Normalizer DeNormalizer Math Model Combiner Math Model and Combiner: ● Novel technique to combine domain based Math Model and ML model. ● Combiner combines the model and main functions are: ○ Prune non-optimal spaces ○ Guide towards optimal settings.
  31. 31. Demo: https://github.com/qubole/uchit/blob/master/Uchit%20Tutorial.ipynb 31
  32. 32. Experimental Evaluation 32
  33. 33. Experimental Evaluation - I 33
  34. 34. Experimental Evaluation - q2 34
  35. 35. Experimental Evaluation 35 Config 1 vs Config 2 More Joins converted to Broadcast Join from SortMerge Join Config 1 Config 2
  36. 36. Correct configs 36
  37. 37. Correct configs 37
  38. 38. Combined Model vs ML Model 38 ● Config Space Reduction by 400X (i.e, 2000 configs to 55 configs) ● Reduce iterations by around 3X
  39. 39. Uchit OS 39 https://github.com/qubole/uchit ● Pluggable `Bring your own model` Framework. ● Clearly defined interfaces for combining models. ● Scope for tuning other engines like Tez etc.
  40. 40. 40 estions ?
  41. 41. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×