Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spark: Spark Summit East talk by Shivaram Venkataraman

Ernest
Efficient Performance Prediction for
Advanced Analytics on Apache Spark
Shivaram Venkataraman, Zongheng Yang
Michael Franklin, Benjamin Recht, Ion Stoica

Workload TRENDS: ADVANCED ANALYTICS

Workload Trends: Advanced Analytics

4

5

Keystone-ML TIMIT PIPELINE
Cosine Transform
Normalization
Linear Solver
Raw Data

Cosine
Transform
Normalization
Linear Solver
~100 iterations
Iterative
(each iteration many jobs)
Long Running à Expensive
Numerically Intensive
7
Keystone-ML TIMIT PIPELINE
Raw
Data
Properties

Cloud Computing CHOICEs
Amazon EC2
t2.nano, t2.micro, t2.small
m4.large, m4.xlarge, m4.2xlarge,
m4.4xlarge, m3.medium,
c4.large, c4.xlarge, c4.2xlarge,
c3.large, c3.xlarge, c3.4xlarge,
r3.large, r3.xlarge, r3.4xlarge,
i2.2xlarge, i2.4xlarge, d2.xlarge
d2.2xlarge, d2.4xlarge,…
MICROSOFT AZURE
Basic tier: A0, A1, A2, A3, A4
Optimized Compute : D1, D2,
D3, D4, D11, D12, D13
D1v2, D2v2, D3v2, D11v2,…
Latest CPUs: G1, G2, G3, …
Network Optimized: A8, A9
Compute Intensive: A10, A11,…
n1-standard-1, ns1-standard-2,
ns1-standard-4, ns1-standard-8,
ns1-standard-16, ns1highmem-2,
ns1-highmem-4, ns1-highmem-8,
n1-highcpu-2, n1-highcpu-4, n1-
highcpu-8, n1-highcpu-16, n1-
highcpu-32, f1-micro, g1-small…
Google Cloud Engine
Instance Types and Number of Instances

User
Concerns
“What is the cheapest conﬁguration to run my job in 2 hours?”

Given a budget, how fast can I run my job ?

“What kind of instances should I use on EC2 ?”

10

DO CHOICES MATTER ? MATRIX MULTIPLY
0
5
10
15
20
25
30
Time(s)
1 r3.8xlarge
2 r3.4xlarge
4 r3.2xlarge
8 r3.xlarge
16 r3.large
Matrix size: 400K by 1K
Cores = 16
Memory = 244 GB
Cost = $2.66/hr

DO CHOICES MATTER ?
0
5
10
15
20
25
30
Time(s)
1 r3.8xlarge
2 r3.4xlarge
4 r3.2xlarge
8 r3.xlarge
16 r3.large
Matrix Multiply: 400K by 1K
0
5
10
15
20
25
30
35
Time(s)
QR Factorization 1M by 1K
Network Bound
Mem Bandwidth Bound

0
10
20
30
0
100
200
300
400
500
600
Time(s)
Cores
Actual
Ideal
r3.4xlarge instances, QR Factorization:1M by 1K
13
Do choices MATTER ?
Computation + Communication à Non-linear Scaling

APPROACH
Job
Data
+
Performance Model
Challenges
Black Box Jobs

Model Building Overhead
Regular Structure + Few Iterations

Computation patterns
Time INPUT Time 1
machines

Communication Patterns
ONE-To-ONE Tree DAG All-to-one
CONSTANT LOG LINEAR
17

BASIC Model
time = x1 + x2 ∗
input
machines
+ x3 ∗ log(machines)+ x4 ∗ (machines)
Serial
Execution
Computation (linear)
Tree DAG
All-to-One DAG
Collect Training Data
Fit Linear Regression

Does THE MODEL match real applications ?
Algorithm
Compute
Communication
O(n/p)
O(n2/p)
O(1)
O(p)
O(log(p))
Other
GLM Regression
✔
✔
✔
KMeans
✔
✔
✔
O(center)
Naïve Bayes
✔
✔
Pearson Correlation
✔
✔
PCA
✔
✔
✔
QR (TSQR)
✔
✔
Number of data items: n, Number of processors: p, Constant number of features

COLLECTING TRAINING DATA
1%
2%
4%
8%
1
2
4
8
Input
Machines
Grid of
input, machines
Associate cost with
each experiment
Baseline: Cheapest
conﬁgurations ﬁrst

OPTIMAL Design of EXPERIMENTSeriment design
nsider the problem of estimating a vector x ∈ Rn
from measurements or
ments
yi = aT
i x + wi, i = 1, . . . , m,
wi is measurement noise. We assume that wi are independent Gaussian
m variables with zero mean and unit variance, and that the measurement
a1, . . . , am span Rn
. The maximum likelihood estimate of x, which is the
s the minimum variance estimate, is given by the least-squares solution
ˆx =
m
i=1
aiaT
i
−1 m
i=1
yiai.
sociated estimation error e = ˆx − x has zero mean and covariance matrix
m −1
Given a Linear Model
21
Lower variance à
Better model
λi - Fraction of times each experiment is run
ata
me
g-
ic-
ly.
We
ata
to
ed
is
can set i as the fraction of time an experiment is chosen and
minimize the trace of the covariance matrix:
Minimize tr((
mX
i=1
iaiaT
i ) 1
)
subject to i 0, i  1
Using Experiment Design: The predictive model we de-
scribed in the previous section can be formulated as an ex-
periment design problem. The feature set used for model de-
sign consisted of just the scale of the data used and the num-
Bound total cost
d jobs.
refers to the study of
experiment given the
riment design speciﬁ-
ments that are optimal
on. At a high-level the
ne data points that can
accurate model. In or-
of training data points
ained with those data
em where we are try-
surements y1, . . . , ym
urement. Each feature
er of dimensions (say
design setup described above and only choose to run those
experiments whose values are non-zero.
Accounting for Cost: One additional factor we need to
consider in using experiment design is that each experiment
we run costs a different amount. This cost could be in terms
of time (i.e. it is more expensive to train with larger fraction
of the input) or in terms of machines (i.e. there is a ﬁxed
cost to say launching a machine). To account for the cost
of an experiment we can augment the optimization problem
we setup above with an additional constraint that the total
cost should be lesser than some budget. That is if we have a
cost function which gives us a cost ci for an experiment with
scale si and mi machines, we add a constraint to our solver
that
mP
i=1
ci i  B where B is the total budget. For the rest
of this paper we use the time taken to collect training data
as the cost and ignore any machine setup costs as we can

OPTIMAL Design of EXPERIMENTS
1%
2%
4%
8%
1
2
4
8
Input
Machines
Use off-the-shelf solver
(CVX)

USING ERNEST
Training
Jobs
Job
Binary
Machines,
Input Size
Linear
Model
Experiment
Design
Use few iterations for
training
0
200
400
600
800
1000
1
30
Time
Machines
ERNEST

Workloads
Keystone-ML
Spark MLlib
ADAM
GenBase
Sparse GLMs
Random Projections
OBJECTIVES

Optimal number of machines
Prediction accuracy
Model training overhead
Importance of experiment design
Choosing EC2 instance types
Model extensions

NUMBER OF INSTANCES: Keystone-ml
TIMIT Pipeline on r3.xlarge instances, 100 iterations
27
0
4000
8000
12000
16000
20000
24000
0
20
40
60
80
100
120
140
Time(s)
Machines
For 2 hour deadline, up to 5x lower cost

ACCURACY: Keystone-ml
TIMIT Pipeline on r3.xlarge instances, Time Per Iteration
28
0
40
80
120
160
200
0
20
40
60
80
100
120
140
TimePerIteration(s)
Machines
10%-15% Prediction error

Accuracy: SparK MLlib
29
0
0.2
0.4
0.6
0.8
1
1.2
Regression
KMeans
NaiveBayes
PCA
Pearson
Predicted / Actual
64 machines
45 machines

TRAINING TIME: Keystone-ml
TIMIT Pipeline on r3.xlarge instances, 100 iterations
30
7 data points
Up to 16 machines
Up to 10% data
EXPERIMENT DESIGN
0
1000
2000
3000
4000
5000
6000
42 machines
Time (s)
Training Time
Running Time

Accuracy, Overhead: GLM REGRESSION
MLlib Regression: < 15% error
3.9%
2.9%
113.4%
112.7%
0
1000
2000
3000
4000
5000
64
45
Time (seconds)
Machines
Actual
Predicted
Training
31

0%
20%
40%
60%
80%
100%
Regression
Classiﬁcation
KMeans
PCA
TIMIT
Prediction Error (%)
Experiment Design
Cost-based
Is Experiment Design useful ?
32

MORE DETAILS

Detecting when the model is wrong

Model extensions

Amazon EC2 variations over time

Straggler mitigation strategies
http://shivaram.org/publications/ernest-nsdi.pdf

OPEN SOURCE

Python-based implementation

Experiment Design, Predictor Modules

Example using SparkML + RCV1
https://github.com/amplab/ernest

IN Conclusion
Workload Trends: Advanced Analytics in the Cloud
Computation, Communication patterns affect scalability

Ernest: Performance predictions with low overhead

End-to-end linear model

Optimal experimental design

Workloads / Traces ?

Shivaram Venkataraman shivaram@cs.berkeley.edu

35

Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spark: Spark Summit East talk by Shivaram Venkataraman

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spark: Spark Summit East talk by Shivaram Venkataraman

Similar to Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spark: Spark Summit East talk by Shivaram Venkataraman (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spark: Spark Summit East talk by Shivaram Venkataraman