Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Ernest
Efficient Performance Prediction for
Advanced Analytics on Apache Spark
Shivaram Venkataraman, Zongheng Yang 
Micha...
Workload TRENDS: ADVANCED ANALYTICS
Workload Trends: Advanced Analytics
4
Workload Trends: Advanced Analytics
5
Workload Trends: Advanced Analytics
Keystone-ML TIMIT PIPELINE
Cosine Transform
 Normalization
 Linear Solver
Raw Data
Cosine
Transform
Normalization
Linear Solver
~100 iterations
Iterative 
(each iteration many jobs)
Long Running à Expensiv...
Cloud Computing CHOICEs
Amazon EC2
t2.nano, t2.micro, t2.small
m4.large, m4.xlarge, m4.2xlarge,
m4.4xlarge, m3.medium,
c4....
TYRANNY of CHOICE
User
Concerns
“What is the cheapest configuration to run my job in 2 hours?”

Given a budget, how fast can I run my job ?

...
DO CHOICES MATTER ? MATRIX MULTIPLY
0
5
10
15
20
25
30
Time(s)
1 r3.8xlarge
2 r3.4xlarge
4 r3.2xlarge
8 r3.xlarge
16 r3.la...
DO CHOICES MATTER ?
0
5
10
15
20
25
30
Time(s)
1 r3.8xlarge
2 r3.4xlarge
4 r3.2xlarge
8 r3.xlarge
16 r3.large
Matrix Multi...
0
10
20
30
0
 100
 200
 300
 400
 500
 600
Time(s)
Cores
Actual
 Ideal
r3.4xlarge instances, QR Factorization:1M by 1K 
13...
APPROACH
Job
Data
+
Performance Model
Challenges
Black Box Jobs



Model Building Overhead
Regular Structure + Few Iterati...
Modeling Jobs
15
Computation patterns
Time INPUT Time 1
machines
Communication Patterns
ONE-To-ONE Tree DAG All-to-one
CONSTANT LOG LINEAR
17
BASIC Model
time = x1 + x2 ∗
input
machines
+ x3 ∗ log(machines)+ x4 ∗ (machines)
Serial
Execution
Computation (linear)
Tr...
Does THE MODEL match real applications ?
Algorithm
Compute
 Communication
O(n/p)
 O(n2/p)
 O(1)
 O(p)
 O(log(p))
 Other
GL...
COLLECTING TRAINING DATA
1%
2%
4%
8%
1
 2
 4
 8
Input
Machines
Grid of 
input, machines
Associate cost with 
each experime...
OPTIMAL Design of EXPERIMENTSeriment design
nsider the problem of estimating a vector x ∈ Rn
from measurements or
ments
yi...
OPTIMAL Design of EXPERIMENTS
1%
2%
4%
8%
1
 2
 4
 8
Input
Machines
Use off-the-shelf solver
(CVX)
USING ERNEST
Training
Jobs
Job
Binary
Machines,
Input Size 
Linear
Model
Experiment
Design
Use few iterations for
training...
EVALUATION
24
Workloads
Keystone-ML 
Spark MLlib
ADAM
GenBase
Sparse GLMs
Random Projections
OBJECTIVES

Optimal number of machines
Pred...
Workloads
Keystone-ML 
Spark MLlib
ADAM
GenBase
Sparse GLMs
Random Projections
OBJECTIVES

Optimal number of machines
Pred...
NUMBER OF INSTANCES: Keystone-ml
TIMIT Pipeline on r3.xlarge instances, 100 iterations
27
0
4000
8000
12000
16000
20000
24...
ACCURACY: Keystone-ml
TIMIT Pipeline on r3.xlarge instances, Time Per Iteration
28
0
40
80
120
160
200
0
 20
 40
 60
 80
 ...
Accuracy: SparK MLlib
29
0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
Regression
KMeans
NaiveBayes
PCA
Pearson
Predicted / Actual
64 mach...
TRAINING TIME: Keystone-ml
TIMIT Pipeline on r3.xlarge instances, 100 iterations
30
7 data points
Up to 16 machines
Up to ...
Accuracy, Overhead: GLM REGRESSION
MLlib Regression: < 15% error
3.9%
2.9%
113.4%
112.7%
0
 1000
 2000
 3000
 4000
 5000
6...
0%
 20%
 40%
 60%
 80%
 100%
Regression
Classification
KMeans
PCA
TIMIT
Prediction Error (%)
Experiment Design
Cost-based
I...
MORE DETAILS

Detecting when the model is wrong

Model extensions

Amazon EC2 variations over time

Straggler mitigation s...
OPEN SOURCE

Python-based implementation

Experiment Design, Predictor Modules

Example using SparkML + RCV1
https://githu...
IN Conclusion
Workload Trends: Advanced Analytics in the Cloud
Computation, Communication patterns affect scalability

Ern...
Upcoming SlideShare
Loading in …5
×

Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spark: Spark Summit East talk by Shivaram Venkataraman

673 views

Published on

Recent workload trends indicate rapid growth in the deployment of machine learning, genomics and scientific workloads using Apache Spark. However, efficiently running these applications on
cloud computing infrastructure like Amazon EC2 is challenging and we find that choosing the right hardware configuration can significantly
improve performance and cost. The key to address the above challenge is having the ability to predict performance of applications under
various resource configurations so that we can automatically choose the optimal configuration. We present Ernest, a performance prediction
framework for large scale analytics. Ernest builds performance models based on the behavior of the job on small samples of data and then
predicts its performance on larger datasets and cluster sizes. Our evaluation on Amazon EC2 using several workloads shows that our prediction error is low while having a training overhead of less than 5% for long-running jobs.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spark: Spark Summit East talk by Shivaram Venkataraman

  1. 1. Ernest Efficient Performance Prediction for Advanced Analytics on Apache Spark Shivaram Venkataraman, Zongheng Yang Michael Franklin, Benjamin Recht, Ion Stoica
  2. 2. Workload TRENDS: ADVANCED ANALYTICS
  3. 3. Workload Trends: Advanced Analytics
  4. 4. 4 Workload Trends: Advanced Analytics
  5. 5. 5 Workload Trends: Advanced Analytics
  6. 6. Keystone-ML TIMIT PIPELINE Cosine Transform Normalization Linear Solver Raw Data
  7. 7. Cosine Transform Normalization Linear Solver ~100 iterations Iterative (each iteration many jobs) Long Running à Expensive Numerically Intensive 7 Keystone-ML TIMIT PIPELINE Raw Data Properties
  8. 8. Cloud Computing CHOICEs Amazon EC2 t2.nano, t2.micro, t2.small m4.large, m4.xlarge, m4.2xlarge, m4.4xlarge, m3.medium, c4.large, c4.xlarge, c4.2xlarge, c3.large, c3.xlarge, c3.4xlarge, r3.large, r3.xlarge, r3.4xlarge, i2.2xlarge, i2.4xlarge, d2.xlarge d2.2xlarge, d2.4xlarge,… MICROSOFT AZURE Basic tier: A0, A1, A2, A3, A4 Optimized Compute : D1, D2, D3, D4, D11, D12, D13 D1v2, D2v2, D3v2, D11v2,… Latest CPUs: G1, G2, G3, … Network Optimized: A8, A9 Compute Intensive: A10, A11,… n1-standard-1, ns1-standard-2, ns1-standard-4, ns1-standard-8, ns1-standard-16, ns1highmem-2, ns1-highmem-4, ns1-highmem-8, n1-highcpu-2, n1-highcpu-4, n1- highcpu-8, n1-highcpu-16, n1- highcpu-32, f1-micro, g1-small… Google Cloud Engine Instance Types and Number of Instances
  9. 9. TYRANNY of CHOICE
  10. 10. User Concerns “What is the cheapest configuration to run my job in 2 hours?” Given a budget, how fast can I run my job ? “What kind of instances should I use on EC2 ?” 10
  11. 11. DO CHOICES MATTER ? MATRIX MULTIPLY 0 5 10 15 20 25 30 Time(s) 1 r3.8xlarge 2 r3.4xlarge 4 r3.2xlarge 8 r3.xlarge 16 r3.large Matrix size: 400K by 1K Cores = 16 Memory = 244 GB Cost = $2.66/hr
  12. 12. DO CHOICES MATTER ? 0 5 10 15 20 25 30 Time(s) 1 r3.8xlarge 2 r3.4xlarge 4 r3.2xlarge 8 r3.xlarge 16 r3.large Matrix Multiply: 400K by 1K 0 5 10 15 20 25 30 35 Time(s) QR Factorization 1M by 1K Network Bound Mem Bandwidth Bound
  13. 13. 0 10 20 30 0 100 200 300 400 500 600 Time(s) Cores Actual Ideal r3.4xlarge instances, QR Factorization:1M by 1K 13 Do choices MATTER ? Computation + Communication à Non-linear Scaling
  14. 14. APPROACH Job Data + Performance Model Challenges Black Box Jobs Model Building Overhead Regular Structure + Few Iterations
  15. 15. Modeling Jobs 15
  16. 16. Computation patterns Time INPUT Time 1 machines
  17. 17. Communication Patterns ONE-To-ONE Tree DAG All-to-one CONSTANT LOG LINEAR 17
  18. 18. BASIC Model time = x1 + x2 ∗ input machines + x3 ∗ log(machines)+ x4 ∗ (machines) Serial Execution Computation (linear) Tree DAG All-to-One DAG Collect Training Data Fit Linear Regression
  19. 19. Does THE MODEL match real applications ? Algorithm Compute Communication O(n/p) O(n2/p) O(1) O(p) O(log(p)) Other GLM Regression ✔ ✔ ✔ KMeans ✔ ✔ ✔ O(center) Naïve Bayes ✔ ✔ Pearson Correlation ✔ ✔ PCA ✔ ✔ ✔ QR (TSQR) ✔ ✔ Number of data items: n, Number of processors: p, Constant number of features
  20. 20. COLLECTING TRAINING DATA 1% 2% 4% 8% 1 2 4 8 Input Machines Grid of input, machines Associate cost with each experiment Baseline: Cheapest configurations first
  21. 21. OPTIMAL Design of EXPERIMENTSeriment design nsider the problem of estimating a vector x ∈ Rn from measurements or ments yi = aT i x + wi, i = 1, . . . , m, wi is measurement noise. We assume that wi are independent Gaussian m variables with zero mean and unit variance, and that the measurement a1, . . . , am span Rn . The maximum likelihood estimate of x, which is the s the minimum variance estimate, is given by the least-squares solution ˆx = m i=1 aiaT i −1 m i=1 yiai. sociated estimation error e = ˆx − x has zero mean and covariance matrix m −1 Given a Linear Model 21 Lower variance à Better model λi - Fraction of times each experiment is run ata me g- ic- ly. We ata to ed is can set i as the fraction of time an experiment is chosen and minimize the trace of the covariance matrix: Minimize tr(( mX i=1 iaiaT i ) 1 ) subject to i 0, i  1 Using Experiment Design: The predictive model we de- scribed in the previous section can be formulated as an ex- periment design problem. The feature set used for model de- sign consisted of just the scale of the data used and the num- Bound total cost d jobs. refers to the study of experiment given the riment design specifi- ments that are optimal on. At a high-level the ne data points that can accurate model. In or- of training data points ained with those data em where we are try- surements y1, . . . , ym urement. Each feature er of dimensions (say design setup described above and only choose to run those experiments whose values are non-zero. Accounting for Cost: One additional factor we need to consider in using experiment design is that each experiment we run costs a different amount. This cost could be in terms of time (i.e. it is more expensive to train with larger fraction of the input) or in terms of machines (i.e. there is a fixed cost to say launching a machine). To account for the cost of an experiment we can augment the optimization problem we setup above with an additional constraint that the total cost should be lesser than some budget. That is if we have a cost function which gives us a cost ci for an experiment with scale si and mi machines, we add a constraint to our solver that mP i=1 ci i  B where B is the total budget. For the rest of this paper we use the time taken to collect training data as the cost and ignore any machine setup costs as we can
  22. 22. OPTIMAL Design of EXPERIMENTS 1% 2% 4% 8% 1 2 4 8 Input Machines Use off-the-shelf solver (CVX)
  23. 23. USING ERNEST Training Jobs Job Binary Machines, Input Size Linear Model Experiment Design Use few iterations for training 0 200 400 600 800 1000 1 30 Time Machines ERNEST
  24. 24. EVALUATION 24
  25. 25. Workloads Keystone-ML Spark MLlib ADAM GenBase Sparse GLMs Random Projections OBJECTIVES Optimal number of machines Prediction accuracy Model training overhead Importance of experiment design Choosing EC2 instance types Model extensions
  26. 26. Workloads Keystone-ML Spark MLlib ADAM GenBase Sparse GLMs Random Projections OBJECTIVES Optimal number of machines Prediction accuracy Model training overhead Importance of experiment design Choosing EC2 instance types Model extensions
  27. 27. NUMBER OF INSTANCES: Keystone-ml TIMIT Pipeline on r3.xlarge instances, 100 iterations 27 0 4000 8000 12000 16000 20000 24000 0 20 40 60 80 100 120 140 Time(s) Machines For 2 hour deadline, up to 5x lower cost
  28. 28. ACCURACY: Keystone-ml TIMIT Pipeline on r3.xlarge instances, Time Per Iteration 28 0 40 80 120 160 200 0 20 40 60 80 100 120 140 TimePerIteration(s) Machines 10%-15% Prediction error
  29. 29. Accuracy: SparK MLlib 29 0 0.2 0.4 0.6 0.8 1 1.2 Regression KMeans NaiveBayes PCA Pearson Predicted / Actual 64 machines 45 machines
  30. 30. TRAINING TIME: Keystone-ml TIMIT Pipeline on r3.xlarge instances, 100 iterations 30 7 data points Up to 16 machines Up to 10% data EXPERIMENT DESIGN 0 1000 2000 3000 4000 5000 6000 42 machines Time (s) Training Time Running Time
  31. 31. Accuracy, Overhead: GLM REGRESSION MLlib Regression: < 15% error 3.9% 2.9% 113.4% 112.7% 0 1000 2000 3000 4000 5000 64 45 Time (seconds) Machines Actual Predicted Training 31
  32. 32. 0% 20% 40% 60% 80% 100% Regression Classification KMeans PCA TIMIT Prediction Error (%) Experiment Design Cost-based Is Experiment Design useful ? 32
  33. 33. MORE DETAILS Detecting when the model is wrong Model extensions Amazon EC2 variations over time Straggler mitigation strategies http://shivaram.org/publications/ernest-nsdi.pdf
  34. 34. OPEN SOURCE Python-based implementation Experiment Design, Predictor Modules Example using SparkML + RCV1 https://github.com/amplab/ernest
  35. 35. IN Conclusion Workload Trends: Advanced Analytics in the Cloud Computation, Communication patterns affect scalability Ernest: Performance predictions with low overhead End-to-end linear model Optimal experimental design Workloads / Traces ? Shivaram Venkataraman shivaram@cs.berkeley.edu 35

×