SlideShare a Scribd company logo
Ernest
Efficient Performance Prediction for
Advanced Analytics on Apache Spark
Shivaram Venkataraman, Zongheng Yang 
Michael Franklin, Benjamin Recht, Ion Stoica
Workload TRENDS: ADVANCED ANALYTICS
Workload Trends: Advanced Analytics
4
Workload Trends: Advanced Analytics
5
Workload Trends: Advanced Analytics
Keystone-ML TIMIT PIPELINE
Cosine Transform
 Normalization
 Linear Solver
Raw Data
Cosine
Transform
Normalization
Linear Solver
~100 iterations
Iterative 
(each iteration many jobs)
Long Running à Expensive
Numerically Intensive
7
Keystone-ML TIMIT PIPELINE
Raw
Data
Properties
Cloud Computing CHOICEs
Amazon EC2
t2.nano, t2.micro, t2.small
m4.large, m4.xlarge, m4.2xlarge,
m4.4xlarge, m3.medium,
c4.large, c4.xlarge, c4.2xlarge,
c3.large, c3.xlarge, c3.4xlarge,
r3.large, r3.xlarge, r3.4xlarge,
i2.2xlarge, i2.4xlarge, d2.xlarge
d2.2xlarge, d2.4xlarge,…
MICROSOFT AZURE
Basic tier: A0, A1, A2, A3, A4
Optimized Compute : D1, D2,
D3, D4, D11, D12, D13
D1v2, D2v2, D3v2, D11v2,…
Latest CPUs: G1, G2, G3, …
Network Optimized: A8, A9
Compute Intensive: A10, A11,…
n1-standard-1, ns1-standard-2,
ns1-standard-4, ns1-standard-8,
ns1-standard-16, ns1highmem-2,
ns1-highmem-4, ns1-highmem-8,
n1-highcpu-2, n1-highcpu-4, n1-
highcpu-8, n1-highcpu-16, n1-
highcpu-32, f1-micro, g1-small…
Google Cloud Engine
Instance Types and Number of Instances
TYRANNY of CHOICE
User
Concerns
“What is the cheapest configuration to run my job in 2 hours?”

Given a budget, how fast can I run my job ?

“What kind of instances should I use on EC2 ?” 

10
DO CHOICES MATTER ? MATRIX MULTIPLY
0
5
10
15
20
25
30
Time(s)
1 r3.8xlarge
2 r3.4xlarge
4 r3.2xlarge
8 r3.xlarge
16 r3.large
Matrix size: 400K by 1K 
Cores = 16
Memory = 244 GB
Cost = $2.66/hr
DO CHOICES MATTER ?
0
5
10
15
20
25
30
Time(s)
1 r3.8xlarge
2 r3.4xlarge
4 r3.2xlarge
8 r3.xlarge
16 r3.large
Matrix Multiply: 400K by 1K 
0
5
10
15
20
25
30
35
Time(s)
QR Factorization 1M by 1K 
Network Bound
 Mem Bandwidth Bound
0
10
20
30
0
 100
 200
 300
 400
 500
 600
Time(s)
Cores
Actual
 Ideal
r3.4xlarge instances, QR Factorization:1M by 1K 
13
Do choices MATTER ?
Computation + Communication à Non-linear Scaling
APPROACH
Job
Data
+
Performance Model
Challenges
Black Box Jobs



Model Building Overhead
Regular Structure + Few Iterations
Modeling Jobs
15
Computation patterns
Time INPUT Time 1
machines
Communication Patterns
ONE-To-ONE Tree DAG All-to-one
CONSTANT LOG LINEAR
17
BASIC Model
time = x1 + x2 ∗
input
machines
+ x3 ∗ log(machines)+ x4 ∗ (machines)
Serial
Execution
Computation (linear)
Tree DAG
All-to-One DAG
Collect Training Data
 Fit Linear Regression
Does THE MODEL match real applications ?
Algorithm
Compute
 Communication
O(n/p)
 O(n2/p)
 O(1)
 O(p)
 O(log(p))
 Other
GLM Regression
 ✔
 ✔
 ✔
KMeans
 ✔
 ✔
 ✔
 O(center)
Naïve Bayes
 ✔
 ✔
Pearson Correlation
 ✔
 ✔
PCA
 ✔
 ✔
 ✔
QR (TSQR)
 ✔
 ✔
Number of data items: n, Number of processors: p, Constant number of features
COLLECTING TRAINING DATA
1%
2%
4%
8%
1
 2
 4
 8
Input
Machines
Grid of 
input, machines
Associate cost with 
each experiment
Baseline: Cheapest
configurations first
OPTIMAL Design of EXPERIMENTSeriment design
nsider the problem of estimating a vector x ∈ Rn
from measurements or
ments
yi = aT
i x + wi, i = 1, . . . , m,
wi is measurement noise. We assume that wi are independent Gaussian
m variables with zero mean and unit variance, and that the measurement
a1, . . . , am span Rn
. The maximum likelihood estimate of x, which is the
s the minimum variance estimate, is given by the least-squares solution
ˆx =
m
i=1
aiaT
i
−1 m
i=1
yiai.
sociated estimation error e = ˆx − x has zero mean and covariance matrix
m −1
Given a Linear Model
21
Lower variance à 
Better model 
λi - Fraction of times each experiment is run
ata
me
g-
ic-
ly.
We
ata
to
ed
is
can set i as the fraction of time an experiment is chosen and
minimize the trace of the covariance matrix:
Minimize tr((
mX
i=1
iaiaT
i ) 1
)
subject to i 0, i  1
Using Experiment Design: The predictive model we de-
scribed in the previous section can be formulated as an ex-
periment design problem. The feature set used for model de-
sign consisted of just the scale of the data used and the num-
Bound total cost
d jobs.
refers to the study of
experiment given the
riment design specifi-
ments that are optimal
on. At a high-level the
ne data points that can
accurate model. In or-
of training data points
ained with those data
em where we are try-
surements y1, . . . , ym
urement. Each feature
er of dimensions (say
design setup described above and only choose to run those
experiments whose values are non-zero.
Accounting for Cost: One additional factor we need to
consider in using experiment design is that each experiment
we run costs a different amount. This cost could be in terms
of time (i.e. it is more expensive to train with larger fraction
of the input) or in terms of machines (i.e. there is a fixed
cost to say launching a machine). To account for the cost
of an experiment we can augment the optimization problem
we setup above with an additional constraint that the total
cost should be lesser than some budget. That is if we have a
cost function which gives us a cost ci for an experiment with
scale si and mi machines, we add a constraint to our solver
that
mP
i=1
ci i  B where B is the total budget. For the rest
of this paper we use the time taken to collect training data
as the cost and ignore any machine setup costs as we can
OPTIMAL Design of EXPERIMENTS
1%
2%
4%
8%
1
 2
 4
 8
Input
Machines
Use off-the-shelf solver
(CVX)
USING ERNEST
Training
Jobs
Job
Binary
Machines,
Input Size 
Linear
Model
Experiment
Design
Use few iterations for
training
0
200
400
600
800
1000
1
 30
Time
Machines
ERNEST
EVALUATION
24
Workloads
Keystone-ML 
Spark MLlib
ADAM
GenBase
Sparse GLMs
Random Projections
OBJECTIVES

Optimal number of machines
Prediction accuracy
Model training overhead
Importance of experiment design
Choosing EC2 instance types
Model extensions
Workloads
Keystone-ML 
Spark MLlib
ADAM
GenBase
Sparse GLMs
Random Projections
OBJECTIVES

Optimal number of machines
Prediction accuracy
Model training overhead
Importance of experiment design
Choosing EC2 instance types
Model extensions
NUMBER OF INSTANCES: Keystone-ml
TIMIT Pipeline on r3.xlarge instances, 100 iterations
27
0
4000
8000
12000
16000
20000
24000
0
 20
 40
 60
 80
 100
 120
 140
Time(s)
Machines
For 2 hour deadline, up to 5x lower cost
ACCURACY: Keystone-ml
TIMIT Pipeline on r3.xlarge instances, Time Per Iteration
28
0
40
80
120
160
200
0
 20
 40
 60
 80
 100
 120
 140
TimePerIteration(s)
Machines
10%-15% Prediction error
Accuracy: SparK MLlib
29
0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
Regression
KMeans
NaiveBayes
PCA
Pearson
Predicted / Actual
64 machines
 45 machines
TRAINING TIME: Keystone-ml
TIMIT Pipeline on r3.xlarge instances, 100 iterations
30
7 data points
Up to 16 machines
Up to 10% data
EXPERIMENT DESIGN
0
 1000
 2000
 3000
 4000
 5000
 6000
42 machines
Time (s)
Training Time
Running Time
Accuracy, Overhead: GLM REGRESSION
MLlib Regression: < 15% error
3.9%
2.9%
113.4%
112.7%
0
 1000
 2000
 3000
 4000
 5000
64
45
Time (seconds)
Machines
Actual
Predicted
Training
31
0%
 20%
 40%
 60%
 80%
 100%
Regression
Classification
KMeans
PCA
TIMIT
Prediction Error (%)
Experiment Design
Cost-based
Is Experiment Design useful ?
32
MORE DETAILS

Detecting when the model is wrong

Model extensions

Amazon EC2 variations over time

Straggler mitigation strategies
http://shivaram.org/publications/ernest-nsdi.pdf
OPEN SOURCE

Python-based implementation

Experiment Design, Predictor Modules

Example using SparkML + RCV1
https://github.com/amplab/ernest
IN Conclusion
Workload Trends: Advanced Analytics in the Cloud
Computation, Communication patterns affect scalability

Ernest: Performance predictions with low overhead

End-to-end linear model

Optimal experimental design

Workloads / Traces ? 

Shivaram Venkataraman shivaram@cs.berkeley.edu

35

More Related Content

What's hot

Travis E. Oliphant, "NumPy and SciPy: History and Ideas for the Future"
Travis E. Oliphant, "NumPy and SciPy: History and Ideas for the Future"Travis E. Oliphant, "NumPy and SciPy: History and Ideas for the Future"
Travis E. Oliphant, "NumPy and SciPy: History and Ideas for the Future"
Shohei Hido
 
Đồ án Chi tiết máy - Đỗ Văn Vinh
Đồ án Chi tiết máy - Đỗ Văn VinhĐồ án Chi tiết máy - Đỗ Văn Vinh
Đồ án Chi tiết máy - Đỗ Văn Vinh
Amanda Quitzon
 
Luận án: Thuật toán trí tuệ nhân tạo cho bài toán tái cấu trúc lưới điện
Luận án: Thuật toán trí tuệ nhân tạo cho bài toán tái cấu trúc lưới điệnLuận án: Thuật toán trí tuệ nhân tạo cho bài toán tái cấu trúc lưới điện
Luận án: Thuật toán trí tuệ nhân tạo cho bài toán tái cấu trúc lưới điện
Dịch vụ viết bài trọn gói ZALO 0917193864
 
Đồ Án Thiết Kế Hệ Dẫn Động Cơ Khí
Đồ Án Thiết Kế Hệ Dẫn Động Cơ KhíĐồ Án Thiết Kế Hệ Dẫn Động Cơ Khí
Đồ Án Thiết Kế Hệ Dẫn Động Cơ Khí
nataliej4
 
Mẫu template power point đẹp
Mẫu template power point đẹpMẫu template power point đẹp
Mẫu template power point đẹp
DSLIDES
 
Opening credit analysis of world war z
Opening credit analysis of world war zOpening credit analysis of world war z
Opening credit analysis of world war zTN0111
 
Bộ đề toán rời rạc thi cao học
Bộ đề toán rời rạc thi cao họcBộ đề toán rời rạc thi cao học
Bộ đề toán rời rạc thi cao họcNấm Lùn
 
Thiết Kế Hệ Thống Dẫn Động Băng Tải Phương Án Số 8 (Full Bản Vẽ Cad)
Thiết Kế Hệ Thống Dẫn Động Băng Tải Phương Án Số 8 (Full Bản Vẽ Cad) Thiết Kế Hệ Thống Dẫn Động Băng Tải Phương Án Số 8 (Full Bản Vẽ Cad)
Thiết Kế Hệ Thống Dẫn Động Băng Tải Phương Án Số 8 (Full Bản Vẽ Cad)
nataliej4
 

What's hot (8)

Travis E. Oliphant, "NumPy and SciPy: History and Ideas for the Future"
Travis E. Oliphant, "NumPy and SciPy: History and Ideas for the Future"Travis E. Oliphant, "NumPy and SciPy: History and Ideas for the Future"
Travis E. Oliphant, "NumPy and SciPy: History and Ideas for the Future"
 
Đồ án Chi tiết máy - Đỗ Văn Vinh
Đồ án Chi tiết máy - Đỗ Văn VinhĐồ án Chi tiết máy - Đỗ Văn Vinh
Đồ án Chi tiết máy - Đỗ Văn Vinh
 
Luận án: Thuật toán trí tuệ nhân tạo cho bài toán tái cấu trúc lưới điện
Luận án: Thuật toán trí tuệ nhân tạo cho bài toán tái cấu trúc lưới điệnLuận án: Thuật toán trí tuệ nhân tạo cho bài toán tái cấu trúc lưới điện
Luận án: Thuật toán trí tuệ nhân tạo cho bài toán tái cấu trúc lưới điện
 
Đồ Án Thiết Kế Hệ Dẫn Động Cơ Khí
Đồ Án Thiết Kế Hệ Dẫn Động Cơ KhíĐồ Án Thiết Kế Hệ Dẫn Động Cơ Khí
Đồ Án Thiết Kế Hệ Dẫn Động Cơ Khí
 
Mẫu template power point đẹp
Mẫu template power point đẹpMẫu template power point đẹp
Mẫu template power point đẹp
 
Opening credit analysis of world war z
Opening credit analysis of world war zOpening credit analysis of world war z
Opening credit analysis of world war z
 
Bộ đề toán rời rạc thi cao học
Bộ đề toán rời rạc thi cao họcBộ đề toán rời rạc thi cao học
Bộ đề toán rời rạc thi cao học
 
Thiết Kế Hệ Thống Dẫn Động Băng Tải Phương Án Số 8 (Full Bản Vẽ Cad)
Thiết Kế Hệ Thống Dẫn Động Băng Tải Phương Án Số 8 (Full Bản Vẽ Cad) Thiết Kế Hệ Thống Dẫn Động Băng Tải Phương Án Số 8 (Full Bản Vẽ Cad)
Thiết Kế Hệ Thống Dẫn Động Băng Tải Phương Án Số 8 (Full Bản Vẽ Cad)
 

Similar to Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spark: Spark Summit East talk by Shivaram Venkataraman

A04230105
A04230105A04230105
A04230105
ijceronline
 
autoTVM
autoTVMautoTVM
autoTVM
Yi-Wen Hung
 
C3 w2
C3 w2C3 w2
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
University of Huddersfield
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Spark Summit
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Intel® Software
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
byteLAKE
 
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Databricks
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Intel® Software
 
Rapport_Cemracs2012
Rapport_Cemracs2012Rapport_Cemracs2012
Rapport_Cemracs2012Jussara F.M.
 
VCE Unit 01 (1).pptx
VCE Unit 01 (1).pptxVCE Unit 01 (1).pptx
VCE Unit 01 (1).pptx
skilljiolms
 
Accelerating the Development of Efficient CP Optimizer Models
Accelerating the Development of Efficient CP Optimizer ModelsAccelerating the Development of Efficient CP Optimizer Models
Accelerating the Development of Efficient CP Optimizer Models
Philippe Laborie
 
Automatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELAutomatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSEL
Joel Falcou
 
Things you can find in the plan cache
Things you can find in the plan cacheThings you can find in the plan cache
Things you can find in the plan cachesqlserver.co.il
 
Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...
Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...
Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...eArtius, Inc.
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
Manish Pandey
 

Similar to Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spark: Spark Summit East talk by Shivaram Venkataraman (20)

A04230105
A04230105A04230105
A04230105
 
autoTVM
autoTVMautoTVM
autoTVM
 
Ch1
Ch1Ch1
Ch1
 
Ch1
Ch1Ch1
Ch1
 
C3 w2
C3 w2C3 w2
C3 w2
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei Diao
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Rapport_Cemracs2012
Rapport_Cemracs2012Rapport_Cemracs2012
Rapport_Cemracs2012
 
VCE Unit 01 (1).pptx
VCE Unit 01 (1).pptxVCE Unit 01 (1).pptx
VCE Unit 01 (1).pptx
 
Accelerating the Development of Efficient CP Optimizer Models
Accelerating the Development of Efficient CP Optimizer ModelsAccelerating the Development of Efficient CP Optimizer Models
Accelerating the Development of Efficient CP Optimizer Models
 
Automatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELAutomatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSEL
 
Things you can find in the plan cache
Things you can find in the plan cacheThings you can find in the plan cache
Things you can find in the plan cache
 
Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...
Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...
Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 

Recently uploaded (20)

一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 

Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spark: Spark Summit East talk by Shivaram Venkataraman

  • 1. Ernest Efficient Performance Prediction for Advanced Analytics on Apache Spark Shivaram Venkataraman, Zongheng Yang Michael Franklin, Benjamin Recht, Ion Stoica
  • 6. Keystone-ML TIMIT PIPELINE Cosine Transform Normalization Linear Solver Raw Data
  • 7. Cosine Transform Normalization Linear Solver ~100 iterations Iterative (each iteration many jobs) Long Running à Expensive Numerically Intensive 7 Keystone-ML TIMIT PIPELINE Raw Data Properties
  • 8. Cloud Computing CHOICEs Amazon EC2 t2.nano, t2.micro, t2.small m4.large, m4.xlarge, m4.2xlarge, m4.4xlarge, m3.medium, c4.large, c4.xlarge, c4.2xlarge, c3.large, c3.xlarge, c3.4xlarge, r3.large, r3.xlarge, r3.4xlarge, i2.2xlarge, i2.4xlarge, d2.xlarge d2.2xlarge, d2.4xlarge,… MICROSOFT AZURE Basic tier: A0, A1, A2, A3, A4 Optimized Compute : D1, D2, D3, D4, D11, D12, D13 D1v2, D2v2, D3v2, D11v2,… Latest CPUs: G1, G2, G3, … Network Optimized: A8, A9 Compute Intensive: A10, A11,… n1-standard-1, ns1-standard-2, ns1-standard-4, ns1-standard-8, ns1-standard-16, ns1highmem-2, ns1-highmem-4, ns1-highmem-8, n1-highcpu-2, n1-highcpu-4, n1- highcpu-8, n1-highcpu-16, n1- highcpu-32, f1-micro, g1-small… Google Cloud Engine Instance Types and Number of Instances
  • 10. User Concerns “What is the cheapest configuration to run my job in 2 hours?” Given a budget, how fast can I run my job ? “What kind of instances should I use on EC2 ?” 10
  • 11. DO CHOICES MATTER ? MATRIX MULTIPLY 0 5 10 15 20 25 30 Time(s) 1 r3.8xlarge 2 r3.4xlarge 4 r3.2xlarge 8 r3.xlarge 16 r3.large Matrix size: 400K by 1K Cores = 16 Memory = 244 GB Cost = $2.66/hr
  • 12. DO CHOICES MATTER ? 0 5 10 15 20 25 30 Time(s) 1 r3.8xlarge 2 r3.4xlarge 4 r3.2xlarge 8 r3.xlarge 16 r3.large Matrix Multiply: 400K by 1K 0 5 10 15 20 25 30 35 Time(s) QR Factorization 1M by 1K Network Bound Mem Bandwidth Bound
  • 13. 0 10 20 30 0 100 200 300 400 500 600 Time(s) Cores Actual Ideal r3.4xlarge instances, QR Factorization:1M by 1K 13 Do choices MATTER ? Computation + Communication à Non-linear Scaling
  • 14. APPROACH Job Data + Performance Model Challenges Black Box Jobs Model Building Overhead Regular Structure + Few Iterations
  • 17. Communication Patterns ONE-To-ONE Tree DAG All-to-one CONSTANT LOG LINEAR 17
  • 18. BASIC Model time = x1 + x2 ∗ input machines + x3 ∗ log(machines)+ x4 ∗ (machines) Serial Execution Computation (linear) Tree DAG All-to-One DAG Collect Training Data Fit Linear Regression
  • 19. Does THE MODEL match real applications ? Algorithm Compute Communication O(n/p) O(n2/p) O(1) O(p) O(log(p)) Other GLM Regression ✔ ✔ ✔ KMeans ✔ ✔ ✔ O(center) Naïve Bayes ✔ ✔ Pearson Correlation ✔ ✔ PCA ✔ ✔ ✔ QR (TSQR) ✔ ✔ Number of data items: n, Number of processors: p, Constant number of features
  • 20. COLLECTING TRAINING DATA 1% 2% 4% 8% 1 2 4 8 Input Machines Grid of input, machines Associate cost with each experiment Baseline: Cheapest configurations first
  • 21. OPTIMAL Design of EXPERIMENTSeriment design nsider the problem of estimating a vector x ∈ Rn from measurements or ments yi = aT i x + wi, i = 1, . . . , m, wi is measurement noise. We assume that wi are independent Gaussian m variables with zero mean and unit variance, and that the measurement a1, . . . , am span Rn . The maximum likelihood estimate of x, which is the s the minimum variance estimate, is given by the least-squares solution ˆx = m i=1 aiaT i −1 m i=1 yiai. sociated estimation error e = ˆx − x has zero mean and covariance matrix m −1 Given a Linear Model 21 Lower variance à Better model λi - Fraction of times each experiment is run ata me g- ic- ly. We ata to ed is can set i as the fraction of time an experiment is chosen and minimize the trace of the covariance matrix: Minimize tr(( mX i=1 iaiaT i ) 1 ) subject to i 0, i  1 Using Experiment Design: The predictive model we de- scribed in the previous section can be formulated as an ex- periment design problem. The feature set used for model de- sign consisted of just the scale of the data used and the num- Bound total cost d jobs. refers to the study of experiment given the riment design specifi- ments that are optimal on. At a high-level the ne data points that can accurate model. In or- of training data points ained with those data em where we are try- surements y1, . . . , ym urement. Each feature er of dimensions (say design setup described above and only choose to run those experiments whose values are non-zero. Accounting for Cost: One additional factor we need to consider in using experiment design is that each experiment we run costs a different amount. This cost could be in terms of time (i.e. it is more expensive to train with larger fraction of the input) or in terms of machines (i.e. there is a fixed cost to say launching a machine). To account for the cost of an experiment we can augment the optimization problem we setup above with an additional constraint that the total cost should be lesser than some budget. That is if we have a cost function which gives us a cost ci for an experiment with scale si and mi machines, we add a constraint to our solver that mP i=1 ci i  B where B is the total budget. For the rest of this paper we use the time taken to collect training data as the cost and ignore any machine setup costs as we can
  • 22. OPTIMAL Design of EXPERIMENTS 1% 2% 4% 8% 1 2 4 8 Input Machines Use off-the-shelf solver (CVX)
  • 23. USING ERNEST Training Jobs Job Binary Machines, Input Size Linear Model Experiment Design Use few iterations for training 0 200 400 600 800 1000 1 30 Time Machines ERNEST
  • 25. Workloads Keystone-ML Spark MLlib ADAM GenBase Sparse GLMs Random Projections OBJECTIVES Optimal number of machines Prediction accuracy Model training overhead Importance of experiment design Choosing EC2 instance types Model extensions
  • 26. Workloads Keystone-ML Spark MLlib ADAM GenBase Sparse GLMs Random Projections OBJECTIVES Optimal number of machines Prediction accuracy Model training overhead Importance of experiment design Choosing EC2 instance types Model extensions
  • 27. NUMBER OF INSTANCES: Keystone-ml TIMIT Pipeline on r3.xlarge instances, 100 iterations 27 0 4000 8000 12000 16000 20000 24000 0 20 40 60 80 100 120 140 Time(s) Machines For 2 hour deadline, up to 5x lower cost
  • 28. ACCURACY: Keystone-ml TIMIT Pipeline on r3.xlarge instances, Time Per Iteration 28 0 40 80 120 160 200 0 20 40 60 80 100 120 140 TimePerIteration(s) Machines 10%-15% Prediction error
  • 29. Accuracy: SparK MLlib 29 0 0.2 0.4 0.6 0.8 1 1.2 Regression KMeans NaiveBayes PCA Pearson Predicted / Actual 64 machines 45 machines
  • 30. TRAINING TIME: Keystone-ml TIMIT Pipeline on r3.xlarge instances, 100 iterations 30 7 data points Up to 16 machines Up to 10% data EXPERIMENT DESIGN 0 1000 2000 3000 4000 5000 6000 42 machines Time (s) Training Time Running Time
  • 31. Accuracy, Overhead: GLM REGRESSION MLlib Regression: < 15% error 3.9% 2.9% 113.4% 112.7% 0 1000 2000 3000 4000 5000 64 45 Time (seconds) Machines Actual Predicted Training 31
  • 32. 0% 20% 40% 60% 80% 100% Regression Classification KMeans PCA TIMIT Prediction Error (%) Experiment Design Cost-based Is Experiment Design useful ? 32
  • 33. MORE DETAILS Detecting when the model is wrong Model extensions Amazon EC2 variations over time Straggler mitigation strategies http://shivaram.org/publications/ernest-nsdi.pdf
  • 34. OPEN SOURCE Python-based implementation Experiment Design, Predictor Modules Example using SparkML + RCV1 https://github.com/amplab/ernest
  • 35. IN Conclusion Workload Trends: Advanced Analytics in the Cloud Computation, Communication patterns affect scalability Ernest: Performance predictions with low overhead End-to-end linear model Optimal experimental design Workloads / Traces ? Shivaram Venkataraman shivaram@cs.berkeley.edu 35