Amogh Margoor, Qubole Inc
Mayur Bhosale, Qubole Inc
Auto-Pilot for Apache
Spark using Machine
Learning
#UnifiedDataAnalytics #SparkAISummit
Agenda
• Motivation
• Approach
• Scope
• Previous Work
• Gaussian Process
• Domain based Model
• Uchit - Spark Auto Tuner
• Demo
• Experimental Evaluation
• Open Source
2#UnifiedDataAnalytics #SparkAISummit
Motivation
3
Tuning a Spark Application
Benefits
• Performance
• Resource Efficiency
4#UnifiedDataAnalytics #SparkAISummit
On Public Cloud
translates to
$$ saved.
Tuning is a Hard Problem !!
● Manual
● Requires Domain Knowledge
● Too many Knobs to configure
5
6
Optimize
TPC-DS q2
● Analyze query plan
○ 3 Joins in Red
circle are
SortMerge Join
○ All 3 can be
converted to
Broadcast Join.
7
Optimize
TPC-DS q2
● Analyze query plan
○ 3 Joins in Red
circle are
SortMerge Join
○ All 3 can be
converted to
Broadcast Join.
● Manual
● Requires Domain Knowledge
● Too many Knobs
Approach
8
Scope
9
– Goals: Improve Runtime or Cloud Cost.
– Insights through SparkLens are quite helpful (demo). Can we also
Auto Tune the Spark Configuration for above goals ?
– Target Repetitive Queries - ETL, Reporting etc.
Previous Work
10#UnifiedDataAnalytics #SparkAISummit
• “Standing of the shoulder of Giants”
– S. Kumar, S. Padakandla, C. Lakshminarayanan, P. Parihar, K. Gopinath, S. Bhatnagar, Performance tuning of hadoop mapreduce: A noisy
gradient approach, vol. abs/1611.10052, 2016.
– H. Herodotou, S. Babu, "Profiling what-if analysis and cost-based optimization of mapreduce programs", Proceedings of the VLDB
Endowment, vol. 4, no. 11, pp. 1111-1122, 2011.
– H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, S. Babu, "Starfish: A self-tuning system for big data analytics", Cidr, no.
2011, pp. 261-272, 2011.
– A. J. Storm, C. Garcia-Arellano, S. Lightstone, Y. Diao, and M. Surendra. Adaptive Self-tuning Memory in DB2. In VLDB, 2006.
– D. G. Sullivan, M. I. Seltzer, and A. Pfeffer. Using probabilistic reasoning to automate software tuning. In SIGMETRICS, 2004.
– D. N. Tran, P. C. Huynh, Y. C. Tay, and A. K. H. Tung. A new approach to dynamic self-tuning of database buffers. ACM Transactions
on Storage, 4(1), 2008.
– B. Zhang, D. Van Aken, J. Wang, T. Dai, S. Jiang, et al. A Demonstration of the OtterTune Automatic Database Management System Tuning
Service. PVLDB
– S. Duan, V. Thummala, and S. Babu. Tuning Database Configuration Parameters with iTuned, VLDB, August 2009
– Prasad M. Deshpande, Amogh Margoor, Rajat Venkatesh, Automatic Tuning of SQL-on-Hadoop Engines on Cloud Platforms. IEEE
CLOUD 2018
Tuning a Spark Application
11#UnifiedDataAnalytics #SparkAISummit
• Machine Learning Based:
– B. Zhang, D. Van Aken, J. Wang, T. Dai, S. Jiang, et al. A
Demonstration of the OtterTune Automatic Database
Management System Tuning Service. PVLDB
– S. Duan, V. Thummala, and S. Babu. Tuning Database
Configuration Parameters with iTuned, VLDB, August 2009
• Domain Knowledge Based:
– Prasad M. Deshpande, Amogh Margoor, Rajat Venkatesh,
Automatic Tuning of SQL-on-Hadoop Engines on Cloud
Platforms. IEEE CLOUD 2018
Machine Learning Approach
12
Machine Learning Approaches
13#UnifiedDataAnalytics #SparkAISummit
• Based on previous works, our approach is also:
– Iterative approaches:
• Step 1: Predict good config based on Previous runs
• Step 2: Run with predicted config and add the result to Previous
runs.
• Repeat Step1 and Step 2 for `n` iterations
– Gaussian Process based approaches.
Gaussian Process
14#UnifiedDataAnalytics #SparkAISummit
● Image Source: https://katbailey.github.io/post/gaussian-processes-for-dummies/
● Gaussian is non-parametric approach.
● Other parametric regression techniques
start with fixed assumption of
parameters. Problems:
○ y = 𝛳0
+ 𝛳1
x Linear Equation with
2 parameters is not enough for
data.
○ y = 𝛳0
+ 𝛳1
x + 𝛳2
x2
Quadratic
equation with 3 parameters will
be more appropriate.
● Gaussian Process is non-parametric
i.e., it assumes all the possibilities.
Gaussian Process
15#UnifiedDataAnalytics #SparkAISummit
● Image Source: https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d
Prior Posterior
Gaussian Process
16#UnifiedDataAnalytics #SparkAISummit
● Image Source: https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d
Gaussian Process - Advantage
17#UnifiedDataAnalytics #SparkAISummit
● Image Source: https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d
How does Gaussian Process help in finding good configs iteratively ?
GP tells degree of certainty of it’s prediction: low and high. Results in balancing
Exploitation and Exploration.
Exploration: Explore configs with low
degree certainty i.e., configs different
from training data.
Exploitation: Predict configs from
high degree certainty of
improvement.
ML Model Issues
• Training data is actual run of a job. Learning over
multiple runs might be required:
– Correlation between configs.
– Sensitiveness of individual config for a particular job.
– Explore large config space for global optimal.
– Domain specific insights like cloud insights etc.
• Too many runs can be expensive.
18
ML Model Issues
• Model searches for the optimal config using
historical data.
– Problem: Might need multiple iterations to prune out
obvious non-optimal configs.
– Solution: To converge sooner, Domain based
knowledge can be used to prune non-optimal
configs.
19
Domain based model
20
Insight 1 Spills are expensive and should be avoided at all
cost.
Spill increases Disk I/O significantly
Avoided by:
○ Increasing memory of task/containers.
○ More fine grained tasks i.e., increased parallelism. For e.g.,
decreasing split sizes or increasing shuffle partitions.
Evaluation
Time reduces for TPC DS q46 by almost 30% on increasing
shuffle partitions from 100 to 200.
Insight 2 For Spark, use single fat executor which uses all
cores in node.
Reasons for improvements are:
● Improved memory consumption between cores
● Reduced replicas of broadcast tables
● Reduced overheads.
Evaluation
● Figure besides show effect of increasing cores per executor.
● Increased spark.executor.cores from 1 to 8 and correspondingly
varying spark.executor.memory from 1152MB to 11094MB, thus
keeping memory per core constant.
● Saw performance benefit up to 25% with fatter executor.
Insight 3 Memory/vCPU ratio
Yarn allocates containers on two dimensions - memory and vcpu
Each container is given 1 vcpu and some memory
The memory/vcpu of the containers should match the memory/vcpu
ratio of the machine type
Otherwise resources are wasted!
Machine
Family
Different machine families have different memory/cpu
characteristics
Recommended memory profile for a query container should
match the family ratio
Otherwise, recommend change of machine family in the
cluster
Insight 4 Generate better SQL plans
● Collect statistics for Catalyst Optimizer.
● Tune configurations for better plans: e.g., more
broadcast joins in TPC-DS q2
Uchit – Spark Auto Tuner
26
Uchit – Spark Auto Tuner
27
Config Sampling
• Discretize configuration.
For e.g., if spark.executor.memory for r3.xlarge can vary
between 2GB and 24 GB
Discretized values = {2, 4, 6, 8, … 24}
• Possible configs for 5 configs ≃ 29 million.
• With sampling we could reduce it to 2000 config space:
Latin Hypercube Sampling.
Sampling- Latin HyperCube
Combined Model
ML Model
Historical runs
Best Conf
Latin
HyperCube
Sampler
Normalizer
DeNormalizer
Math Model
Combiner
Math Model and Combiner:
● Novel technique to combine domain
based Math Model and ML model.
● Combiner combines the model and
main functions are:
○ Prune non-optimal spaces
○ Guide towards optimal
settings.
Demo:
https://github.com/qubole/uchit/blob/master/Uchit%20Tutorial.ipynb
31
Experimental Evaluation
32
Experimental Evaluation - I
33
Experimental Evaluation - q2
34
Experimental Evaluation
35
Config 1 vs Config 2
More Joins converted to
Broadcast Join from
SortMerge Join
Config 1 Config 2
Correct configs
36
Correct configs
37
Combined Model vs ML Model
38
● Config Space Reduction by 400X (i.e, 2000 configs to 55 configs)
● Reduce iterations by around 3X
Uchit OS
39
https://github.com/qubole/uchit
● Pluggable `Bring your own model`
Framework.
● Clearly defined interfaces for combining
models.
● Scope for tuning other engines like Tez
etc.
40
estions ?
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Auto-Pilot for Apache Spark Using Machine Learning

  • 1.
    Amogh Margoor, QuboleInc Mayur Bhosale, Qubole Inc Auto-Pilot for Apache Spark using Machine Learning #UnifiedDataAnalytics #SparkAISummit
  • 2.
    Agenda • Motivation • Approach •Scope • Previous Work • Gaussian Process • Domain based Model • Uchit - Spark Auto Tuner • Demo • Experimental Evaluation • Open Source 2#UnifiedDataAnalytics #SparkAISummit
  • 3.
  • 4.
    Tuning a SparkApplication Benefits • Performance • Resource Efficiency 4#UnifiedDataAnalytics #SparkAISummit On Public Cloud translates to $$ saved.
  • 5.
    Tuning is aHard Problem !! ● Manual ● Requires Domain Knowledge ● Too many Knobs to configure 5
  • 6.
    6 Optimize TPC-DS q2 ● Analyzequery plan ○ 3 Joins in Red circle are SortMerge Join ○ All 3 can be converted to Broadcast Join.
  • 7.
    7 Optimize TPC-DS q2 ● Analyzequery plan ○ 3 Joins in Red circle are SortMerge Join ○ All 3 can be converted to Broadcast Join. ● Manual ● Requires Domain Knowledge ● Too many Knobs
  • 8.
  • 9.
    Scope 9 – Goals: ImproveRuntime or Cloud Cost. – Insights through SparkLens are quite helpful (demo). Can we also Auto Tune the Spark Configuration for above goals ? – Target Repetitive Queries - ETL, Reporting etc.
  • 10.
    Previous Work 10#UnifiedDataAnalytics #SparkAISummit •“Standing of the shoulder of Giants” – S. Kumar, S. Padakandla, C. Lakshminarayanan, P. Parihar, K. Gopinath, S. Bhatnagar, Performance tuning of hadoop mapreduce: A noisy gradient approach, vol. abs/1611.10052, 2016. – H. Herodotou, S. Babu, "Profiling what-if analysis and cost-based optimization of mapreduce programs", Proceedings of the VLDB Endowment, vol. 4, no. 11, pp. 1111-1122, 2011. – H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, S. Babu, "Starfish: A self-tuning system for big data analytics", Cidr, no. 2011, pp. 261-272, 2011. – A. J. Storm, C. Garcia-Arellano, S. Lightstone, Y. Diao, and M. Surendra. Adaptive Self-tuning Memory in DB2. In VLDB, 2006. – D. G. Sullivan, M. I. Seltzer, and A. Pfeffer. Using probabilistic reasoning to automate software tuning. In SIGMETRICS, 2004. – D. N. Tran, P. C. Huynh, Y. C. Tay, and A. K. H. Tung. A new approach to dynamic self-tuning of database buffers. ACM Transactions on Storage, 4(1), 2008. – B. Zhang, D. Van Aken, J. Wang, T. Dai, S. Jiang, et al. A Demonstration of the OtterTune Automatic Database Management System Tuning Service. PVLDB – S. Duan, V. Thummala, and S. Babu. Tuning Database Configuration Parameters with iTuned, VLDB, August 2009 – Prasad M. Deshpande, Amogh Margoor, Rajat Venkatesh, Automatic Tuning of SQL-on-Hadoop Engines on Cloud Platforms. IEEE CLOUD 2018
  • 11.
    Tuning a SparkApplication 11#UnifiedDataAnalytics #SparkAISummit • Machine Learning Based: – B. Zhang, D. Van Aken, J. Wang, T. Dai, S. Jiang, et al. A Demonstration of the OtterTune Automatic Database Management System Tuning Service. PVLDB – S. Duan, V. Thummala, and S. Babu. Tuning Database Configuration Parameters with iTuned, VLDB, August 2009 • Domain Knowledge Based: – Prasad M. Deshpande, Amogh Margoor, Rajat Venkatesh, Automatic Tuning of SQL-on-Hadoop Engines on Cloud Platforms. IEEE CLOUD 2018
  • 12.
  • 13.
    Machine Learning Approaches 13#UnifiedDataAnalytics#SparkAISummit • Based on previous works, our approach is also: – Iterative approaches: • Step 1: Predict good config based on Previous runs • Step 2: Run with predicted config and add the result to Previous runs. • Repeat Step1 and Step 2 for `n` iterations – Gaussian Process based approaches.
  • 14.
    Gaussian Process 14#UnifiedDataAnalytics #SparkAISummit ●Image Source: https://katbailey.github.io/post/gaussian-processes-for-dummies/ ● Gaussian is non-parametric approach. ● Other parametric regression techniques start with fixed assumption of parameters. Problems: ○ y = 𝛳0 + 𝛳1 x Linear Equation with 2 parameters is not enough for data. ○ y = 𝛳0 + 𝛳1 x + 𝛳2 x2 Quadratic equation with 3 parameters will be more appropriate. ● Gaussian Process is non-parametric i.e., it assumes all the possibilities.
  • 15.
    Gaussian Process 15#UnifiedDataAnalytics #SparkAISummit ●Image Source: https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d Prior Posterior
  • 16.
    Gaussian Process 16#UnifiedDataAnalytics #SparkAISummit ●Image Source: https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d
  • 17.
    Gaussian Process -Advantage 17#UnifiedDataAnalytics #SparkAISummit ● Image Source: https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d How does Gaussian Process help in finding good configs iteratively ? GP tells degree of certainty of it’s prediction: low and high. Results in balancing Exploitation and Exploration. Exploration: Explore configs with low degree certainty i.e., configs different from training data. Exploitation: Predict configs from high degree certainty of improvement.
  • 18.
    ML Model Issues •Training data is actual run of a job. Learning over multiple runs might be required: – Correlation between configs. – Sensitiveness of individual config for a particular job. – Explore large config space for global optimal. – Domain specific insights like cloud insights etc. • Too many runs can be expensive. 18
  • 19.
    ML Model Issues •Model searches for the optimal config using historical data. – Problem: Might need multiple iterations to prune out obvious non-optimal configs. – Solution: To converge sooner, Domain based knowledge can be used to prune non-optimal configs. 19
  • 20.
  • 21.
    Insight 1 Spillsare expensive and should be avoided at all cost. Spill increases Disk I/O significantly Avoided by: ○ Increasing memory of task/containers. ○ More fine grained tasks i.e., increased parallelism. For e.g., decreasing split sizes or increasing shuffle partitions. Evaluation Time reduces for TPC DS q46 by almost 30% on increasing shuffle partitions from 100 to 200.
  • 22.
    Insight 2 ForSpark, use single fat executor which uses all cores in node. Reasons for improvements are: ● Improved memory consumption between cores ● Reduced replicas of broadcast tables ● Reduced overheads. Evaluation ● Figure besides show effect of increasing cores per executor. ● Increased spark.executor.cores from 1 to 8 and correspondingly varying spark.executor.memory from 1152MB to 11094MB, thus keeping memory per core constant. ● Saw performance benefit up to 25% with fatter executor.
  • 23.
    Insight 3 Memory/vCPUratio Yarn allocates containers on two dimensions - memory and vcpu Each container is given 1 vcpu and some memory The memory/vcpu of the containers should match the memory/vcpu ratio of the machine type Otherwise resources are wasted!
  • 24.
    Machine Family Different machine familieshave different memory/cpu characteristics Recommended memory profile for a query container should match the family ratio Otherwise, recommend change of machine family in the cluster
  • 25.
    Insight 4 Generatebetter SQL plans ● Collect statistics for Catalyst Optimizer. ● Tune configurations for better plans: e.g., more broadcast joins in TPC-DS q2
  • 26.
    Uchit – SparkAuto Tuner 26
  • 27.
    Uchit – SparkAuto Tuner 27
  • 28.
    Config Sampling • Discretizeconfiguration. For e.g., if spark.executor.memory for r3.xlarge can vary between 2GB and 24 GB Discretized values = {2, 4, 6, 8, … 24} • Possible configs for 5 configs ≃ 29 million. • With sampling we could reduce it to 2000 config space: Latin Hypercube Sampling.
  • 29.
  • 30.
    Combined Model ML Model Historicalruns Best Conf Latin HyperCube Sampler Normalizer DeNormalizer Math Model Combiner Math Model and Combiner: ● Novel technique to combine domain based Math Model and ML model. ● Combiner combines the model and main functions are: ○ Prune non-optimal spaces ○ Guide towards optimal settings.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
    Experimental Evaluation 35 Config 1vs Config 2 More Joins converted to Broadcast Join from SortMerge Join Config 1 Config 2
  • 36.
  • 37.
  • 38.
    Combined Model vsML Model 38 ● Config Space Reduction by 400X (i.e, 2000 configs to 55 configs) ● Reduce iterations by around 3X
  • 39.
    Uchit OS 39 https://github.com/qubole/uchit ● Pluggable`Bring your own model` Framework. ● Clearly defined interfaces for combining models. ● Scope for tuning other engines like Tez etc.
  • 40.
  • 41.
    DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT