SlideShare a Scribd company logo
Optimizing Machine Learning Workloads in
Collaborative Environments
Behrouz Derakhshan, Alireza Rezaei Mahdiraji, Ziawasch Abedjan, Tilmann Rabl, and Volker Markl
1
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Introduction
2
Math and
Statistics
Computer
Science
Domain
Knowledge
Data Science and Machine Learning
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Introduction
2
Math and
Statistics
Computer
Science
Domain
Knowledge
Data Science and Machine Learning Data Science and ML Toolkit
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Introduction
2
Math and
Statistics
Computer
Science
Domain
Knowledge
Data Science and Machine Learning Data Science and ML Toolkit
High-quality DS and ML applications require effective collaboration
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Collaborative DS and ML
3
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Collaborative DS and ML
3
• Jupyter Notebooks enable sharing code and results
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Collaborative DS and ML
3
• Containerized environments enable execution
and deployment of the notebooks
• Jupyter Notebooks enable sharing code and results
[1]
1. https://www.docker.com/
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Collaborative DS and ML
3
• Containerized environments enable execution
and deployment of the notebooks
• Platforms, such as Google Colab and Kaggle,
enable effective collaboration among data
scientists
• Jupyter Notebooks enable sharing code and results
[1]
[2]
1. https://www.docker.com/
[3]
2. https://colab.research.google.com/
3. https://www.kaggle.com/
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Problems
4
Lack of generated data artifacts management
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Problems
4
Lack of generated data artifacts management
Re-execution of existing notebooks
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Problems
4
Top 3 notebooks of Home Credit Default Risk Kaggle Competition1 generate
100s GBs of intermediate data artifact and are copied 10,000 times
1. https://www.kaggle.com/c/home-credit-default-risk
Lack of generated data artifacts management
Re-execution of existing notebooks
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Problems
4
Top 3 notebooks of Home Credit Default Risk Kaggle Competition1 generate
100s GBs of intermediate data artifact and are copied 10,000 times
1. https://www.kaggle.com/c/home-credit-default-risk
Lack of data management in existing collaborative environment leads to
1000s of hours of redundant data processing and model training
Lack of generated data artifacts management
Re-execution of existing notebooks
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Collaborative ML Workload Optimizer
5
Parser
ML Workload
Optimizer Materializer
Executor
Experiment Graph
ClientServer
Workload DAG
Optimized DAG
Annotated DAG
1
2
5
1
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Collaborative ML Workload Optimizer
5
Experiment Graph
○ Union of all the Workload DAGs
○ Vertices: data artifacts
○ Edges: operations
Materializer
○ Store data artifacts with high-
likelihood of future reuse
Optimizer
○ Linear-time reuse algorithm
○ Finds optimal execution DAG
Parser
ML Workload
Optimizer Materializer
Executor
Experiment Graph
ClientServer
Workload DAG
Optimized DAG
A
Annotated DAG
1
2
5
1
C B
A
B
C
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Materialization Problem
6
Given a storage budget, materialize a subset of the artifacts in order to minimize
the execution cost
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Materialization Problem
6
Given a storage budget, materialize a subset of the artifacts in order to minimize
the execution cost
Challenges:
1. Future workloads are unknown
2. Even if future workloads are known a priori, the problem is NP-Hard (Bhattacherjee, 2015)
3. Accommodating large graphs and fast number of incoming workloads
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Materialization Algorithm
7
run-time
For every artifact compute a utility value: size frequency
potential
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Materialization Algorithm
7
Materializer Experiment Graph
Materialized artifact
Un-materialized artifact
50,91
40,0.9 5,0.91 5,0.91
10,0.9 3,0.91
23,0.9
2,0.9
1.5,0.91
0.5,0.0
5 2
0.1
0.1 0.1
0.1 0.1
4
4
0.05
3
3
Client Server
run-time
For every artifact compute a utility value: size frequency
potential
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Storage-aware Materialization
● Many duplicated columns in intermediate data artifacts
8
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Storage-aware Materialization
● Many duplicated columns in intermediate data artifacts
○ Feature selection operations
○ Feature generation operations
8
Feature
Selection
Input Artifact Output Artifact
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Storage-aware Materialization
● Many duplicated columns in intermediate data artifacts
○ Feature selection operations
○ Feature generation operations
● Apply column deduplication strategy
8
while budget is not exhausted:
1. run materialization algorithm
2. deduplicate the materialized artifacts
3. update the size of unmaterialized
artifact
Storage-aware Materialization
Feature
Selection
Input Artifact Output Artifact
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Storage-aware Materialization
● Many duplicated columns in intermediate data artifacts
○ Feature selection operations
○ Feature generation operations
● Apply column deduplication strategy
8
while budget is not exhausted:
1. run materialization algorithm
2. deduplicate the materialized artifacts
3. update the size of unmaterialized
artifact
Storage-aware Materialization
Improves Storage utilization and Run-time
Feature
Selection
Input Artifact Output Artifact
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Reuse Problem
9
Given EG and a new workload, find the set of workload artifacts to reuse from
EG and the set of artifact to compute
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Reuse Problem
9
Given EG and a new workload, find the set of workload artifacts to reuse from
EG and the set of artifact to compute
Challenges:
1. Exponential time complexity of exhaustive search
2. Accommodating large graphs and fast number of incoming workloads, state-of-the-art
has polynomial time complexity 𝓞(|𝒱|.|𝘌|2) (Helix, 2018)
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Reuse Algorithm
10
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
A linear-time algorithm to compute optimal execution plan with a forward and
backward pass on the workload DAG
Materialized artifact
Un-materialized or do not exist in EG
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Reuse Algorithm
10
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
A linear-time algorithm to compute optimal execution plan with a forward and
backward pass on the workload DAG
Materialized artifact
Un-materialized or do not exist in EG
ForwardPass
• Accumulate run-times
• For every vertex
• Load or compute
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Reuse Algorithm
10
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
A linear-time algorithm to compute optimal execution plan with a forward and
backward pass on the workload DAG
Materialized artifact
Un-materialized or do not exist in EG
ForwardPass
• Accumulate run-times
• For every vertex
• Load or compute
BackwardPass
• Prune unnecessary
loaded vertices
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Reuse Algorithm
10
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
A linear-time algorithm to compute optimal execution plan with a forward and
backward pass on the workload DAG
Materialized artifact
Un-materialized or do not exist in EG
ForwardPass
• Accumulate run-times
• For every vertex
• Load or compute
BackwardPass
• Prune unnecessary
loaded vertices
Prune
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Evaluation
11
● Kaggle
○ Kaggle Home Credit Default Risk1
Competition
○ 9 source datasets
○ 5 real and 3 generated workloads
○ Total of 130 GB artifact sizes
● Helix
○ State-of-the-art iterative ML framework
○ Materialization Algorithm:
■ Only execution-time is considered
■ Nodes are not prioritized
○ Reuse Algorithm
■ Utilizes Edmonds-Karp Max-Flow
algorithm, which runs in 𝓞(|𝒱|.|𝘌|2)
● Naïve
○ No Optimization
1. https://www.kaggle.com/c/home-credit-default-risk
Workload Baseline
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
End-to-end Run-time
12
Repeated executions of Kaggle workloads (materialization budget = 16 GB)
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
End-to-end Run-time
12
Repeated executions of Kaggle workloads (materialization budget = 16 GB)
Execution of Kaggle workloads in sequence (materialization budget = 16 GB)
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
End-to-end Run-time
12
Optimizing ML Workloads improves run-time up to 1 order of magnitude for repeated
executions and 50% for different workloads
Repeated executions of Kaggle workloads (materialization budget = 16 GB)
Execution of Kaggle workloads in sequence (materialization budget = 16 GB)
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Materialization Impact
13
Total run-time of the Kaggle workloads with different materialization strategies and budgets
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Materialization Impact
13
Total run-time of the Kaggle workloads with different materialization strategies and budgets
Exploiting the artifacts characteristics, such as utility and duplication rate, improves the
materialization process and improves the run-time by 50%
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Reuse Overhead
14
Overhead of different reuse algorithms for 10,000 synthetically generated workloads
100
101
102
103
104
Number of Workloads
0
1k
2k
3k
Cumulative
Overhead(s)
Colab-reuse Helix
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Reuse Overhead
14
Overhead of different reuse algorithms for 10,000 synthetically generated workloads
Our linear-time reuse algorithm generates a negligible overhead in real collaborative
environments where 1000s of workloads are executed
100
101
102
103
104
Number of Workloads
0
1k
2k
3k
Cumulative
Overhead(s)
Colab-reuse Helix
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Summary
15
Parser
ML Workload
Optimizer Materializer
Executor
Experiment Graph ClientServer
Workload DAG
Optimized DAG
Annotated DAG
1
2
5
1
● Optimization of ML workloads in collaborative
environment through Materialization and
Reuse, while incurring negligible overhead
● Things not covered in the talk:
○ API and DAG Construction
○ Quality-based Materialization
○ Model Warmstarting
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
References
● Bhattacherjee, Souvik, et al. "Principles of dataset versioning: Exploring the recreation/storage tradeoff." Proceedings of the VLDB
Endowment. International Conference on Very Large Data Bases. Vol. 8. No. 12. NIH Public Access, 2015.
● Xin, Doris, et al. "Helix: Holistic optimization for accelerating iterative machine learning." Proceedings of the VLDB Endowment 12.4
(2018): 446-460.
● Jack Edmonds and Richard M. Karp. 1972. Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems. J. ACM 19, 2
(April 1972), 248–264. https://doi.org/10.1145/321694.321699
● Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, Matthias Bussonnier, et al. 2016. Jupyter Notebooks – a
publishing format for reproducible computational workflows. In Posi- tioning and Power in Academic Publishing: Players, Agents and
Agendas, F. Loizides and B. Schmidt (Eds.). IOS Press, 87 – 90.
● Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). Borg, omega, and kubernetes. Queue, 14(1), 70-93.
16
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Evaluation (2)
17
● Openml
○ Classification Task 312
○ 2000 scikit-learn pipelines
○ 1.5 GB of artifact sizes
Workload
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
DAG Construction
train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y]
ad_desc = train['ad_desc']
vectorizer = CountVectorizer()
count_vectorized = vectorizer.fit_transform(ad_desc)
print count_vectorized.head()
selector = SelectKBest(k=2)
t_subset = train[['ts','u_id','price']]
y = train['y']
top_features = selector.fit_transform(t_subset, y)
model_1 = svm.SVC().fit(top_features, y)
print model_1 # terminal vertex
X = pd.concat([count_vectorized,top_features], axis = 1)
model_2 = svm.SVC().fit(X, y)
print model_2 # terminal vertex
train
ad_desc t_subset y
cnt_vect top_feats
X
model_2
18
model_1
cnt_head
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
DAG Construction
train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y]
ad_desc = train['ad_desc']
vectorizer = CountVectorizer()
count_vectorized = vectorizer.fit_transform(ad_desc)
print count_vectorized.head()
selector = SelectKBest(k=2)
t_subset = train[['ts','u_id','price']]
y = train['y']
top_features = selector.fit_transform(t_subset, y)
model_1 = svm.SVC().fit(top_features, y)
print model_1 # terminal vertex
X = pd.concat([count_vectorized,top_features], axis = 1)
model_2 = svm.SVC().fit(X, y)
print model_2 # terminal vertex
train
ad_desc t_subset y
cnt_vect top_feats
X
model_2
Artifact
Operation
18
model_1
cnt_head
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
DAG Construction
train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y]
ad_desc = train['ad_desc']
vectorizer = CountVectorizer()
count_vectorized = vectorizer.fit_transform(ad_desc)
print count_vectorized.head()
selector = SelectKBest(k=2)
t_subset = train[['ts','u_id','price']]
y = train['y']
top_features = selector.fit_transform(t_subset, y)
model_1 = svm.SVC().fit(top_features, y)
print model_1 # terminal vertex
X = pd.concat([count_vectorized,top_features], axis = 1)
model_2 = svm.SVC().fit(X, y)
print model_2 # terminal vertex
train
ad_desc t_subset y
cnt_vect top_feats
X
model_2
Artifact
Operation
18
model_1
cnt_head
40 MB
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
DAG Construction
train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y]
ad_desc = train['ad_desc']
vectorizer = CountVectorizer()
count_vectorized = vectorizer.fit_transform(ad_desc)
print count_vectorized.head()
selector = SelectKBest(k=2)
t_subset = train[['ts','u_id','price']]
y = train['y']
top_features = selector.fit_transform(t_subset, y)
model_1 = svm.SVC().fit(top_features, y)
print model_1 # terminal vertex
X = pd.concat([count_vectorized,top_features], axis = 1)
model_2 = svm.SVC().fit(X, y)
print model_2 # terminal vertex
train
ad_desc t_subset y
cnt_vect top_feats
X
model_2
Artifact
Operation
18
model_1
cnt_head
5 s
40 MB
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
DAG Construction
train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y]
ad_desc = train['ad_desc']
vectorizer = CountVectorizer()
count_vectorized = vectorizer.fit_transform(ad_desc)
print count_vectorized.head()
selector = SelectKBest(k=2)
t_subset = train[['ts','u_id','price']]
y = train['y']
top_features = selector.fit_transform(t_subset, y)
model_1 = svm.SVC().fit(top_features, y)
print model_1 # terminal vertex
X = pd.concat([count_vectorized,top_features], axis = 1)
model_2 = svm.SVC().fit(X, y)
print model_2 # terminal vertex
train
ad_desc t_subset y
cnt_vect top_feats
X
model_2
Artifact
Operation
18
model_1
cnt_head
5 s
Accuracy
=0.91
40 MB
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
DAG Construction
train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y]
ad_desc = train['ad_desc']
vectorizer = CountVectorizer()
count_vectorized = vectorizer.fit_transform(ad_desc)
print count_vectorized.head()
selector = SelectKBest(k=2)
t_subset = train[['ts','u_id','price']]
y = train['y']
top_features = selector.fit_transform(t_subset, y)
model_1 = svm.SVC().fit(top_features, y)
print model_1 # terminal vertex
X = pd.concat([count_vectorized,top_features], axis = 1)
model_2 = svm.SVC().fit(X, y)
print model_2 # terminal vertex
train
ad_desc t_subset y
cnt_vect top_feats
X
model_2
19
model_1
cnt_head
40 MB
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
DAG Construction
train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y]
ad_desc = train['ad_desc']
vectorizer = CountVectorizer()
count_vectorized = vectorizer.fit_transform(ad_desc)
print count_vectorized.head()
selector = SelectKBest(k=2)
t_subset = train[['ts','u_id','price']]
y = train['y']
top_features = selector.fit_transform(t_subset, y)
model_1 = svm.SVC().fit(top_features, y)
print model_1 # terminal vertex
X = pd.concat([count_vectorized,top_features], axis = 1)
model_2 = svm.SVC().fit(X, y)
print model_2 # terminal vertex
train
ad_desc t_subset y
cnt_vect top_feats
X
model_2
Source vertex
19
model_1
cnt_head
40 MB
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
DAG Construction
train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y]
ad_desc = train['ad_desc']
vectorizer = CountVectorizer()
count_vectorized = vectorizer.fit_transform(ad_desc)
print count_vectorized.head()
selector = SelectKBest(k=2)
t_subset = train[['ts','u_id','price']]
y = train['y']
top_features = selector.fit_transform(t_subset, y)
model_1 = svm.SVC().fit(top_features, y)
print model_1 # terminal vertex
X = pd.concat([count_vectorized,top_features], axis = 1)
model_2 = svm.SVC().fit(X, y)
print model_2 # terminal vertex
train
ad_desc t_subset y
cnt_vect top_feats
X
model_2
Source vertex
Terminal
vertex
19
model_1
cnt_head
Terminal
vertex
40 MB
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
DAG Construction
train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y]
ad_desc = train['ad_desc']
vectorizer = CountVectorizer()
count_vectorized = vectorizer.fit_transform(ad_desc)
print count_vectorized.head()
selector = SelectKBest(k=2)
t_subset = train[['ts','u_id','price']]
y = train['y']
top_features = selector.fit_transform(t_subset, y)
model_1 = svm.SVC().fit(top_features, y)
print model_1 # terminal vertex
X = pd.concat([count_vectorized,top_features], axis = 1)
model_2 = svm.SVC().fit(X, y)
print model_2 # terminal vertex
train
ad_desc t_subset y
cnt_vect top_feats
X
model_2
Source vertex
Operation
Terminal
vertex
19
model_1
cnt_head
Terminal
vertex
40 MB
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
DAG Annotation
20
train
ad_desc t_subset y
cnt_vect top_feats
X
model_2
model_1
cnt_head
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
DAG Annotation
20
● Edge run-time
train
ad_desc t_subset y
cnt_vect top_feats
X
model_2
model_1
cnt_head
5 s 2 s
0.1 s
0.05 s 0.1 s
0.1 s 0.1 s
4 s
4 s
0.05 s
3 s
3 s
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
DAG Annotation
20
● Edge run-time
● Vertex size
train
ad_desc t_subset y
cnt_vect top_feats
X
model_2
model_1
cnt_head
5 s 2 s
0.1 s
0.05 s 0.1 s
0.1 s 0.1 s
4 s
4 s
50 MB
40 MB
5 MB
5 MB
10 MB
3 MB
23 MB
0.5 MB
0.05 s
3 s
3 s
2 MB
1.5 MB
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
DAG Annotation
20
● Edge run-time
● Vertex size
● Vertex potential
○ Quality of the best reachable model
train
ad_desc t_subset y
cnt_vect top_feats
X
model_2
model_1
cnt_head
5 s 2 s
0.1 s
0.05 s 0.1 s
0.1 s 0.1 s
4 s
4 s
50 MB
40 MB
5 MB
5 MB
10 MB
3 MB
23 MB
0.5 MB
0.05 s
3 s
3 s
2 MB
1.5 MB
ModelQuality
Model
Quality
Model
Quality
Model
Quality
Accuracy=0.
9
Accuracy=0.9
1
0.0
0.9
0.9
0.91
0.9
0.91
0.91
0.91
0.9
0.91
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Model Materialization and Warmstarting
21
Run-time of model benchmarking scenario
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Impact of Model Warmstarting on Accuracy
22
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Reuse Algorithm
23
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
A linear-time algorithm to compute optimal execution plan with a forward and
backward pass on the workload DAG
Materialized artifact
Un-materialized or do not exist in EG
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Forward-pass)
24
1.Traverse from the source
2.Accumulate run-times
3.For every materialized node
compare the accumulated run-
time with the load-time
Forward-pass
Artifacts to execute
Artifacts to load
Materialized artifact
Un-materialized or do not exist in EG
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
Acc = 0.1
Acc = 0.05
Acc = 2.1
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Forward-pass)
24
1.Traverse from the source
2.Accumulate run-times
3.For every materialized node
compare the accumulated run-
time with the load-time
Forward-pass
Artifacts to execute
Artifacts to load
train
Materialized artifact
Un-materialized or do not exist in EG
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
Acc = 0.1
Acc = 0.05
Acc = 2.1
load cost = 0.06
Load > Acc
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Forward-pass)
24
1.Traverse from the source
2.Accumulate run-times
3.For every materialized node
compare the accumulated run-
time with the load-time
Forward-pass
Artifacts to execute
Artifacts to load
train
Materialized artifact
Un-materialized or do not exist in EG
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
Acc = 0.1
Acc = 0.05
Acc = 2.1
load cost = 0.06
Load > Acc
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Forward-pass)
24
1.Traverse from the source
2.Accumulate run-times
3.For every materialized node
compare the accumulated run-
time with the load-time
Forward-pass
Artifacts to execute
Artifacts to load
train
t_subset
Materialized artifact
Un-materialized or do not exist in EG
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
Acc = 0.1
Acc = 0.05
Acc = 2.1
load cost = 0.06
Load < Acc
load = 1.5
Load < Acc
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Forward-pass)
24
1.Traverse from the source
2.Accumulate run-times
3.For every materialized node
compare the accumulated run-
time with the load-time
Forward-pass
Artifacts to execute
Artifacts to load
train
t_subset
Materialized artifact
Un-materialized or do not exist in EG
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
Acc = 0.1
Acc = 0.05
Acc = 2.1
load cost = 0.06
Load < Acc
load = 1.5
Load < Acc
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Forward-pass)
24
1.Traverse from the source
2.Accumulate run-times
3.For every materialized node
compare the accumulated run-
time with the load-time
Forward-pass
Artifacts to execute
Artifacts to load
train
t_subset y
top_feats
Materialized artifact
Un-materialized or do not exist in EG
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Forward-pass)
24
1.Traverse from the source
2.Accumulate run-times
3.For every materialized node
compare the accumulated run-
time with the load-time
Forward-pass
Artifacts to execute
Artifacts to load
train
t_subset y
top_feats
Materialized artifact
Un-materialized or do not exist in EG
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Backward-pass)
25
1.Traverse backward from the
terminal
2.For every loaded artifact,
stop the traversal of its
parents
3.Prune artifacts that are not
visited
Backward-pass
y
top_feats
train
t_subset
Artifacts to execute
Artifacts to load
Artifacts to prune
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Backward-pass)
25
1.Traverse backward from the
terminal
2.For every loaded artifact,
stop the traversal of its
parents
3.Prune artifacts that are not
visited
Backward-pass
y
top_feats
Stop Traversal
X
X
X
train
t_subset
Artifacts to execute
Artifacts to load
Artifacts to prune
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Backward-pass)
25
1.Traverse backward from the
terminal
2.For every loaded artifact,
stop the traversal of its
parents
3.Prune artifacts that are not
visited
Backward-pass
y
top_feats
Stop Traversal
X
X
X
Artifacts to execute
Artifacts to load
Artifacts to prune
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Backward-pass)
25
1.Traverse backward from the
terminal
2.For every loaded artifact,
stop the traversal of its
parents
3.Prune artifacts that are not
visited
Backward-pass
y
top_feats
Artifacts to execute
Artifacts to load
Artifacts to prune
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Storage-aware Materialization
● Many duplicated columns in intermediate data artifacts
● Apply column deduplication strategy
26
train
ad_desc t_subset y
train = pd.read_csv('train.csv') #
[ad_desc,ts,u_id,price,y]
ad_desc = train['ad_desc‘]
t_subset = train[['ts','u_id','price‘]]
y = train['y']
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Storage-aware Materialization
● Many duplicated columns in intermediate data artifacts
● Apply column deduplication strategy
26
train
ad_desc t_subset y
train = pd.read_csv('train.csv') #
[ad_desc,ts,u_id,price,y]
ad_desc = train['ad_desc‘]
t_subset = train[['ts','u_id','price‘]]
y = train['y']
while budget is not exhausted:
1. run materialization algorithm
2. deduplicate the materialized artifacts
3. update the size of unmaterialized
artifact
Storage-aware Materialization
Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Storage-aware Materialization
● Many duplicated columns in intermediate data artifacts
● Apply column deduplication strategy
26
train
ad_desc t_subset y
train = pd.read_csv('train.csv') #
[ad_desc,ts,u_id,price,y]
ad_desc = train['ad_desc‘]
t_subset = train[['ts','u_id','price‘]]
y = train['y']
while budget is not exhausted:
1. run materialization algorithm
2. deduplicate the materialized artifacts
3. update the size of unmaterialized
artifact
Storage-aware Materialization
Improves Storage utilization and Run-time

More Related Content

What's hot

WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
t_ivanov
 
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
PingCAP
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
Sri Ambati
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
MLconf
 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.
Vicente Orjales
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
Maria Stylianou
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
robertlz
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
Milind Bhandarkar
 
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
AM Publications
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
Alexander Ulanov
 
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
PingCAP
 
HPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific ComputingHPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific Computing
inside-BigData.com
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
Carl Lu
 
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsRoots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
William Yetman
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
Impetus Technologies
 
Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012
Josh Patterson
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
Anqi Fu
 
Dremel Paper Review
Dremel Paper ReviewDremel Paper Review
Dremel Paper Review
Arinto Murdopo
 
Hfsp bringing size based scheduling to hadoop
Hfsp bringing size based scheduling to hadoopHfsp bringing size based scheduling to hadoop
Hfsp bringing size based scheduling to hadoop
LeMeniz Infotech
 
Dynamic Resource Allocation Algorithm using Containers
Dynamic Resource Allocation Algorithm using ContainersDynamic Resource Allocation Algorithm using Containers
Dynamic Resource Allocation Algorithm using Containers
IRJET Journal
 

What's hot (20)

WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
 
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
 
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
 
HPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific ComputingHPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific Computing
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsRoots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
Dremel Paper Review
Dremel Paper ReviewDremel Paper Review
Dremel Paper Review
 
Hfsp bringing size based scheduling to hadoop
Hfsp bringing size based scheduling to hadoopHfsp bringing size based scheduling to hadoop
Hfsp bringing size based scheduling to hadoop
 
Dynamic Resource Allocation Algorithm using Containers
Dynamic Resource Allocation Algorithm using ContainersDynamic Resource Allocation Algorithm using Containers
Dynamic Resource Allocation Algorithm using Containers
 

Similar to Optimizing Machine Learning Pipelines in Collaborative Environments

Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez
DataWorks Summit
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
VMware Tanzu
 
Parsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in PythonParsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in Python
Daniel S. Katz
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
VMware Tanzu
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
Anubhav Jain
 
Design of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureDesign of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore Architecture
Luiz Henrique Zambom Santana
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
MongoDB APAC
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
Sathish24111
 
Optimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericOptimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola Peric
Nik Peric
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
Gwen (Chen) Shapira
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
Aamir Ameen
 
Intro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudIntro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS Cloud
Amazon Web Services
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
Many-Objective Performance Enhancement in Computing Clusters
Many-Objective Performance Enhancement in Computing ClustersMany-Objective Performance Enhancement in Computing Clusters
Many-Objective Performance Enhancement in Computing Clusters
Tarik Reza Toha
 
Materialization and Reuse Optimizations for Production Data Science Pipelines
Materialization and Reuse Optimizations for Production Data Science PipelinesMaterialization and Reuse Optimizations for Production Data Science Pipelines
Materialization and Reuse Optimizations for Production Data Science Pipelines
Behrouz Derakhshan
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
Mahantesh Angadi
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Ramiro Aduviri Velasco
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Flink Forward
 

Similar to Optimizing Machine Learning Pipelines in Collaborative Environments (20)

Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
 
Parsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in PythonParsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in Python
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
Design of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureDesign of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore Architecture
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Optimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericOptimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola Peric
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Intro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudIntro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS Cloud
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
 
Many-Objective Performance Enhancement in Computing Clusters
Many-Objective Performance Enhancement in Computing ClustersMany-Objective Performance Enhancement in Computing Clusters
Many-Objective Performance Enhancement in Computing Clusters
 
Materialization and Reuse Optimizations for Production Data Science Pipelines
Materialization and Reuse Optimizations for Production Data Science PipelinesMaterialization and Reuse Optimizations for Production Data Science Pipelines
Materialization and Reuse Optimizations for Production Data Science Pipelines
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
 

Recently uploaded

一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
fkyes25
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 

Recently uploaded (20)

一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 

Optimizing Machine Learning Pipelines in Collaborative Environments

  • 1. Optimizing Machine Learning Workloads in Collaborative Environments Behrouz Derakhshan, Alireza Rezaei Mahdiraji, Ziawasch Abedjan, Tilmann Rabl, and Volker Markl 1
  • 2. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Introduction 2 Math and Statistics Computer Science Domain Knowledge Data Science and Machine Learning
  • 3. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Introduction 2 Math and Statistics Computer Science Domain Knowledge Data Science and Machine Learning Data Science and ML Toolkit
  • 4. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Introduction 2 Math and Statistics Computer Science Domain Knowledge Data Science and Machine Learning Data Science and ML Toolkit High-quality DS and ML applications require effective collaboration
  • 5. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Collaborative DS and ML 3
  • 6. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Collaborative DS and ML 3 • Jupyter Notebooks enable sharing code and results
  • 7. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Collaborative DS and ML 3 • Containerized environments enable execution and deployment of the notebooks • Jupyter Notebooks enable sharing code and results [1] 1. https://www.docker.com/
  • 8. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Collaborative DS and ML 3 • Containerized environments enable execution and deployment of the notebooks • Platforms, such as Google Colab and Kaggle, enable effective collaboration among data scientists • Jupyter Notebooks enable sharing code and results [1] [2] 1. https://www.docker.com/ [3] 2. https://colab.research.google.com/ 3. https://www.kaggle.com/
  • 9. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Problems 4 Lack of generated data artifacts management
  • 10. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Problems 4 Lack of generated data artifacts management Re-execution of existing notebooks
  • 11. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Problems 4 Top 3 notebooks of Home Credit Default Risk Kaggle Competition1 generate 100s GBs of intermediate data artifact and are copied 10,000 times 1. https://www.kaggle.com/c/home-credit-default-risk Lack of generated data artifacts management Re-execution of existing notebooks
  • 12. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Problems 4 Top 3 notebooks of Home Credit Default Risk Kaggle Competition1 generate 100s GBs of intermediate data artifact and are copied 10,000 times 1. https://www.kaggle.com/c/home-credit-default-risk Lack of data management in existing collaborative environment leads to 1000s of hours of redundant data processing and model training Lack of generated data artifacts management Re-execution of existing notebooks
  • 13. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Collaborative ML Workload Optimizer 5 Parser ML Workload Optimizer Materializer Executor Experiment Graph ClientServer Workload DAG Optimized DAG Annotated DAG 1 2 5 1
  • 14. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Collaborative ML Workload Optimizer 5 Experiment Graph ○ Union of all the Workload DAGs ○ Vertices: data artifacts ○ Edges: operations Materializer ○ Store data artifacts with high- likelihood of future reuse Optimizer ○ Linear-time reuse algorithm ○ Finds optimal execution DAG Parser ML Workload Optimizer Materializer Executor Experiment Graph ClientServer Workload DAG Optimized DAG A Annotated DAG 1 2 5 1 C B A B C
  • 15. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Materialization Problem 6 Given a storage budget, materialize a subset of the artifacts in order to minimize the execution cost
  • 16. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Materialization Problem 6 Given a storage budget, materialize a subset of the artifacts in order to minimize the execution cost Challenges: 1. Future workloads are unknown 2. Even if future workloads are known a priori, the problem is NP-Hard (Bhattacherjee, 2015) 3. Accommodating large graphs and fast number of incoming workloads
  • 17. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Materialization Algorithm 7 run-time For every artifact compute a utility value: size frequency potential
  • 18. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Materialization Algorithm 7 Materializer Experiment Graph Materialized artifact Un-materialized artifact 50,91 40,0.9 5,0.91 5,0.91 10,0.9 3,0.91 23,0.9 2,0.9 1.5,0.91 0.5,0.0 5 2 0.1 0.1 0.1 0.1 0.1 4 4 0.05 3 3 Client Server run-time For every artifact compute a utility value: size frequency potential
  • 19. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Storage-aware Materialization ● Many duplicated columns in intermediate data artifacts 8
  • 20. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Storage-aware Materialization ● Many duplicated columns in intermediate data artifacts ○ Feature selection operations ○ Feature generation operations 8 Feature Selection Input Artifact Output Artifact
  • 21. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Storage-aware Materialization ● Many duplicated columns in intermediate data artifacts ○ Feature selection operations ○ Feature generation operations ● Apply column deduplication strategy 8 while budget is not exhausted: 1. run materialization algorithm 2. deduplicate the materialized artifacts 3. update the size of unmaterialized artifact Storage-aware Materialization Feature Selection Input Artifact Output Artifact
  • 22. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Storage-aware Materialization ● Many duplicated columns in intermediate data artifacts ○ Feature selection operations ○ Feature generation operations ● Apply column deduplication strategy 8 while budget is not exhausted: 1. run materialization algorithm 2. deduplicate the materialized artifacts 3. update the size of unmaterialized artifact Storage-aware Materialization Improves Storage utilization and Run-time Feature Selection Input Artifact Output Artifact
  • 23. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Reuse Problem 9 Given EG and a new workload, find the set of workload artifacts to reuse from EG and the set of artifact to compute
  • 24. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Reuse Problem 9 Given EG and a new workload, find the set of workload artifacts to reuse from EG and the set of artifact to compute Challenges: 1. Exponential time complexity of exhaustive search 2. Accommodating large graphs and fast number of incoming workloads, state-of-the-art has polynomial time complexity 𝓞(|𝒱|.|𝘌|2) (Helix, 2018)
  • 25. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Reuse Algorithm 10 train t_subset y top_feats train_2 top_feats_2 merged_feats model_3 A linear-time algorithm to compute optimal execution plan with a forward and backward pass on the workload DAG Materialized artifact Un-materialized or do not exist in EG
  • 26. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Reuse Algorithm 10 train t_subset y top_feats train_2 top_feats_2 merged_feats model_3 A linear-time algorithm to compute optimal execution plan with a forward and backward pass on the workload DAG Materialized artifact Un-materialized or do not exist in EG ForwardPass • Accumulate run-times • For every vertex • Load or compute
  • 27. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Reuse Algorithm 10 train t_subset y top_feats train_2 top_feats_2 merged_feats model_3 A linear-time algorithm to compute optimal execution plan with a forward and backward pass on the workload DAG Materialized artifact Un-materialized or do not exist in EG ForwardPass • Accumulate run-times • For every vertex • Load or compute BackwardPass • Prune unnecessary loaded vertices
  • 28. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Reuse Algorithm 10 train t_subset y top_feats train_2 top_feats_2 merged_feats model_3 A linear-time algorithm to compute optimal execution plan with a forward and backward pass on the workload DAG Materialized artifact Un-materialized or do not exist in EG ForwardPass • Accumulate run-times • For every vertex • Load or compute BackwardPass • Prune unnecessary loaded vertices Prune
  • 29. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Evaluation 11 ● Kaggle ○ Kaggle Home Credit Default Risk1 Competition ○ 9 source datasets ○ 5 real and 3 generated workloads ○ Total of 130 GB artifact sizes ● Helix ○ State-of-the-art iterative ML framework ○ Materialization Algorithm: ■ Only execution-time is considered ■ Nodes are not prioritized ○ Reuse Algorithm ■ Utilizes Edmonds-Karp Max-Flow algorithm, which runs in 𝓞(|𝒱|.|𝘌|2) ● Naïve ○ No Optimization 1. https://www.kaggle.com/c/home-credit-default-risk Workload Baseline
  • 30. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments End-to-end Run-time 12 Repeated executions of Kaggle workloads (materialization budget = 16 GB)
  • 31. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments End-to-end Run-time 12 Repeated executions of Kaggle workloads (materialization budget = 16 GB) Execution of Kaggle workloads in sequence (materialization budget = 16 GB)
  • 32. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments End-to-end Run-time 12 Optimizing ML Workloads improves run-time up to 1 order of magnitude for repeated executions and 50% for different workloads Repeated executions of Kaggle workloads (materialization budget = 16 GB) Execution of Kaggle workloads in sequence (materialization budget = 16 GB)
  • 33. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Materialization Impact 13 Total run-time of the Kaggle workloads with different materialization strategies and budgets
  • 34. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Materialization Impact 13 Total run-time of the Kaggle workloads with different materialization strategies and budgets Exploiting the artifacts characteristics, such as utility and duplication rate, improves the materialization process and improves the run-time by 50%
  • 35. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Reuse Overhead 14 Overhead of different reuse algorithms for 10,000 synthetically generated workloads 100 101 102 103 104 Number of Workloads 0 1k 2k 3k Cumulative Overhead(s) Colab-reuse Helix
  • 36. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Reuse Overhead 14 Overhead of different reuse algorithms for 10,000 synthetically generated workloads Our linear-time reuse algorithm generates a negligible overhead in real collaborative environments where 1000s of workloads are executed 100 101 102 103 104 Number of Workloads 0 1k 2k 3k Cumulative Overhead(s) Colab-reuse Helix
  • 37. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Summary 15 Parser ML Workload Optimizer Materializer Executor Experiment Graph ClientServer Workload DAG Optimized DAG Annotated DAG 1 2 5 1 ● Optimization of ML workloads in collaborative environment through Materialization and Reuse, while incurring negligible overhead ● Things not covered in the talk: ○ API and DAG Construction ○ Quality-based Materialization ○ Model Warmstarting
  • 38. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments References ● Bhattacherjee, Souvik, et al. "Principles of dataset versioning: Exploring the recreation/storage tradeoff." Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases. Vol. 8. No. 12. NIH Public Access, 2015. ● Xin, Doris, et al. "Helix: Holistic optimization for accelerating iterative machine learning." Proceedings of the VLDB Endowment 12.4 (2018): 446-460. ● Jack Edmonds and Richard M. Karp. 1972. Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems. J. ACM 19, 2 (April 1972), 248–264. https://doi.org/10.1145/321694.321699 ● Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, Matthias Bussonnier, et al. 2016. Jupyter Notebooks – a publishing format for reproducible computational workflows. In Posi- tioning and Power in Academic Publishing: Players, Agents and Agendas, F. Loizides and B. Schmidt (Eds.). IOS Press, 87 – 90. ● Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). Borg, omega, and kubernetes. Queue, 14(1), 70-93. 16
  • 39. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Evaluation (2) 17 ● Openml ○ Classification Task 312 ○ 2000 scikit-learn pipelines ○ 1.5 GB of artifact sizes Workload
  • 40. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments DAG Construction train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y] ad_desc = train['ad_desc'] vectorizer = CountVectorizer() count_vectorized = vectorizer.fit_transform(ad_desc) print count_vectorized.head() selector = SelectKBest(k=2) t_subset = train[['ts','u_id','price']] y = train['y'] top_features = selector.fit_transform(t_subset, y) model_1 = svm.SVC().fit(top_features, y) print model_1 # terminal vertex X = pd.concat([count_vectorized,top_features], axis = 1) model_2 = svm.SVC().fit(X, y) print model_2 # terminal vertex train ad_desc t_subset y cnt_vect top_feats X model_2 18 model_1 cnt_head
  • 41. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments DAG Construction train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y] ad_desc = train['ad_desc'] vectorizer = CountVectorizer() count_vectorized = vectorizer.fit_transform(ad_desc) print count_vectorized.head() selector = SelectKBest(k=2) t_subset = train[['ts','u_id','price']] y = train['y'] top_features = selector.fit_transform(t_subset, y) model_1 = svm.SVC().fit(top_features, y) print model_1 # terminal vertex X = pd.concat([count_vectorized,top_features], axis = 1) model_2 = svm.SVC().fit(X, y) print model_2 # terminal vertex train ad_desc t_subset y cnt_vect top_feats X model_2 Artifact Operation 18 model_1 cnt_head
  • 42. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments DAG Construction train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y] ad_desc = train['ad_desc'] vectorizer = CountVectorizer() count_vectorized = vectorizer.fit_transform(ad_desc) print count_vectorized.head() selector = SelectKBest(k=2) t_subset = train[['ts','u_id','price']] y = train['y'] top_features = selector.fit_transform(t_subset, y) model_1 = svm.SVC().fit(top_features, y) print model_1 # terminal vertex X = pd.concat([count_vectorized,top_features], axis = 1) model_2 = svm.SVC().fit(X, y) print model_2 # terminal vertex train ad_desc t_subset y cnt_vect top_feats X model_2 Artifact Operation 18 model_1 cnt_head 40 MB
  • 43. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments DAG Construction train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y] ad_desc = train['ad_desc'] vectorizer = CountVectorizer() count_vectorized = vectorizer.fit_transform(ad_desc) print count_vectorized.head() selector = SelectKBest(k=2) t_subset = train[['ts','u_id','price']] y = train['y'] top_features = selector.fit_transform(t_subset, y) model_1 = svm.SVC().fit(top_features, y) print model_1 # terminal vertex X = pd.concat([count_vectorized,top_features], axis = 1) model_2 = svm.SVC().fit(X, y) print model_2 # terminal vertex train ad_desc t_subset y cnt_vect top_feats X model_2 Artifact Operation 18 model_1 cnt_head 5 s 40 MB
  • 44. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments DAG Construction train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y] ad_desc = train['ad_desc'] vectorizer = CountVectorizer() count_vectorized = vectorizer.fit_transform(ad_desc) print count_vectorized.head() selector = SelectKBest(k=2) t_subset = train[['ts','u_id','price']] y = train['y'] top_features = selector.fit_transform(t_subset, y) model_1 = svm.SVC().fit(top_features, y) print model_1 # terminal vertex X = pd.concat([count_vectorized,top_features], axis = 1) model_2 = svm.SVC().fit(X, y) print model_2 # terminal vertex train ad_desc t_subset y cnt_vect top_feats X model_2 Artifact Operation 18 model_1 cnt_head 5 s Accuracy =0.91 40 MB
  • 45. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments DAG Construction train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y] ad_desc = train['ad_desc'] vectorizer = CountVectorizer() count_vectorized = vectorizer.fit_transform(ad_desc) print count_vectorized.head() selector = SelectKBest(k=2) t_subset = train[['ts','u_id','price']] y = train['y'] top_features = selector.fit_transform(t_subset, y) model_1 = svm.SVC().fit(top_features, y) print model_1 # terminal vertex X = pd.concat([count_vectorized,top_features], axis = 1) model_2 = svm.SVC().fit(X, y) print model_2 # terminal vertex train ad_desc t_subset y cnt_vect top_feats X model_2 19 model_1 cnt_head 40 MB
  • 46. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments DAG Construction train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y] ad_desc = train['ad_desc'] vectorizer = CountVectorizer() count_vectorized = vectorizer.fit_transform(ad_desc) print count_vectorized.head() selector = SelectKBest(k=2) t_subset = train[['ts','u_id','price']] y = train['y'] top_features = selector.fit_transform(t_subset, y) model_1 = svm.SVC().fit(top_features, y) print model_1 # terminal vertex X = pd.concat([count_vectorized,top_features], axis = 1) model_2 = svm.SVC().fit(X, y) print model_2 # terminal vertex train ad_desc t_subset y cnt_vect top_feats X model_2 Source vertex 19 model_1 cnt_head 40 MB
  • 47. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments DAG Construction train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y] ad_desc = train['ad_desc'] vectorizer = CountVectorizer() count_vectorized = vectorizer.fit_transform(ad_desc) print count_vectorized.head() selector = SelectKBest(k=2) t_subset = train[['ts','u_id','price']] y = train['y'] top_features = selector.fit_transform(t_subset, y) model_1 = svm.SVC().fit(top_features, y) print model_1 # terminal vertex X = pd.concat([count_vectorized,top_features], axis = 1) model_2 = svm.SVC().fit(X, y) print model_2 # terminal vertex train ad_desc t_subset y cnt_vect top_feats X model_2 Source vertex Terminal vertex 19 model_1 cnt_head Terminal vertex 40 MB
  • 48. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments DAG Construction train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y] ad_desc = train['ad_desc'] vectorizer = CountVectorizer() count_vectorized = vectorizer.fit_transform(ad_desc) print count_vectorized.head() selector = SelectKBest(k=2) t_subset = train[['ts','u_id','price']] y = train['y'] top_features = selector.fit_transform(t_subset, y) model_1 = svm.SVC().fit(top_features, y) print model_1 # terminal vertex X = pd.concat([count_vectorized,top_features], axis = 1) model_2 = svm.SVC().fit(X, y) print model_2 # terminal vertex train ad_desc t_subset y cnt_vect top_feats X model_2 Source vertex Operation Terminal vertex 19 model_1 cnt_head Terminal vertex 40 MB
  • 49. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments DAG Annotation 20 train ad_desc t_subset y cnt_vect top_feats X model_2 model_1 cnt_head
  • 50. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments DAG Annotation 20 ● Edge run-time train ad_desc t_subset y cnt_vect top_feats X model_2 model_1 cnt_head 5 s 2 s 0.1 s 0.05 s 0.1 s 0.1 s 0.1 s 4 s 4 s 0.05 s 3 s 3 s
  • 51. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments DAG Annotation 20 ● Edge run-time ● Vertex size train ad_desc t_subset y cnt_vect top_feats X model_2 model_1 cnt_head 5 s 2 s 0.1 s 0.05 s 0.1 s 0.1 s 0.1 s 4 s 4 s 50 MB 40 MB 5 MB 5 MB 10 MB 3 MB 23 MB 0.5 MB 0.05 s 3 s 3 s 2 MB 1.5 MB
  • 52. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments DAG Annotation 20 ● Edge run-time ● Vertex size ● Vertex potential ○ Quality of the best reachable model train ad_desc t_subset y cnt_vect top_feats X model_2 model_1 cnt_head 5 s 2 s 0.1 s 0.05 s 0.1 s 0.1 s 0.1 s 4 s 4 s 50 MB 40 MB 5 MB 5 MB 10 MB 3 MB 23 MB 0.5 MB 0.05 s 3 s 3 s 2 MB 1.5 MB ModelQuality Model Quality Model Quality Model Quality Accuracy=0. 9 Accuracy=0.9 1 0.0 0.9 0.9 0.91 0.9 0.91 0.91 0.91 0.9 0.91
  • 53. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Model Materialization and Warmstarting 21 Run-time of model benchmarking scenario
  • 54. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Impact of Model Warmstarting on Accuracy 22
  • 55. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Reuse Algorithm 23 train t_subset y top_feats train_2 top_feats_2 merged_feats model_3 A linear-time algorithm to compute optimal execution plan with a forward and backward pass on the workload DAG Materialized artifact Un-materialized or do not exist in EG
  • 56. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments train t_subset y top_feats train_2 top_feats_2 merged_feats model_3 Reuse Algorithm (Forward-pass) 24 1.Traverse from the source 2.Accumulate run-times 3.For every materialized node compare the accumulated run- time with the load-time Forward-pass Artifacts to execute Artifacts to load Materialized artifact Un-materialized or do not exist in EG
  • 57. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments train t_subset y top_feats train_2 top_feats_2 merged_feats model_3 Acc = 0.1 Acc = 0.05 Acc = 2.1 train_2 top_feats_2 merged_feats model_3 Reuse Algorithm (Forward-pass) 24 1.Traverse from the source 2.Accumulate run-times 3.For every materialized node compare the accumulated run- time with the load-time Forward-pass Artifacts to execute Artifacts to load train Materialized artifact Un-materialized or do not exist in EG
  • 58. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments train t_subset y top_feats train_2 top_feats_2 merged_feats model_3 Acc = 0.1 Acc = 0.05 Acc = 2.1 load cost = 0.06 Load > Acc train_2 top_feats_2 merged_feats model_3 Reuse Algorithm (Forward-pass) 24 1.Traverse from the source 2.Accumulate run-times 3.For every materialized node compare the accumulated run- time with the load-time Forward-pass Artifacts to execute Artifacts to load train Materialized artifact Un-materialized or do not exist in EG
  • 59. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments train t_subset y top_feats train_2 top_feats_2 merged_feats model_3 Acc = 0.1 Acc = 0.05 Acc = 2.1 load cost = 0.06 Load > Acc train_2 top_feats_2 merged_feats model_3 Reuse Algorithm (Forward-pass) 24 1.Traverse from the source 2.Accumulate run-times 3.For every materialized node compare the accumulated run- time with the load-time Forward-pass Artifacts to execute Artifacts to load train t_subset Materialized artifact Un-materialized or do not exist in EG
  • 60. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments train t_subset y top_feats train_2 top_feats_2 merged_feats model_3 Acc = 0.1 Acc = 0.05 Acc = 2.1 load cost = 0.06 Load < Acc load = 1.5 Load < Acc train_2 top_feats_2 merged_feats model_3 Reuse Algorithm (Forward-pass) 24 1.Traverse from the source 2.Accumulate run-times 3.For every materialized node compare the accumulated run- time with the load-time Forward-pass Artifacts to execute Artifacts to load train t_subset Materialized artifact Un-materialized or do not exist in EG
  • 61. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments train t_subset y top_feats train_2 top_feats_2 merged_feats model_3 Acc = 0.1 Acc = 0.05 Acc = 2.1 load cost = 0.06 Load < Acc load = 1.5 Load < Acc train_2 top_feats_2 merged_feats model_3 Reuse Algorithm (Forward-pass) 24 1.Traverse from the source 2.Accumulate run-times 3.For every materialized node compare the accumulated run- time with the load-time Forward-pass Artifacts to execute Artifacts to load train t_subset y top_feats Materialized artifact Un-materialized or do not exist in EG
  • 62. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments train t_subset y top_feats train_2 top_feats_2 merged_feats model_3 train_2 top_feats_2 merged_feats model_3 Reuse Algorithm (Forward-pass) 24 1.Traverse from the source 2.Accumulate run-times 3.For every materialized node compare the accumulated run- time with the load-time Forward-pass Artifacts to execute Artifacts to load train t_subset y top_feats Materialized artifact Un-materialized or do not exist in EG
  • 63. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments train t_subset y top_feats train_2 top_feats_2 merged_feats model_3 train_2 top_feats_2 merged_feats model_3 Reuse Algorithm (Backward-pass) 25 1.Traverse backward from the terminal 2.For every loaded artifact, stop the traversal of its parents 3.Prune artifacts that are not visited Backward-pass y top_feats train t_subset Artifacts to execute Artifacts to load Artifacts to prune
  • 64. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments train t_subset y top_feats train_2 top_feats_2 merged_feats model_3 train_2 top_feats_2 merged_feats model_3 Reuse Algorithm (Backward-pass) 25 1.Traverse backward from the terminal 2.For every loaded artifact, stop the traversal of its parents 3.Prune artifacts that are not visited Backward-pass y top_feats Stop Traversal X X X train t_subset Artifacts to execute Artifacts to load Artifacts to prune
  • 65. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments train t_subset y top_feats train_2 top_feats_2 merged_feats model_3 train_2 top_feats_2 merged_feats model_3 Reuse Algorithm (Backward-pass) 25 1.Traverse backward from the terminal 2.For every loaded artifact, stop the traversal of its parents 3.Prune artifacts that are not visited Backward-pass y top_feats Stop Traversal X X X Artifacts to execute Artifacts to load Artifacts to prune
  • 66. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments train t_subset y top_feats train_2 top_feats_2 merged_feats model_3 train_2 top_feats_2 merged_feats model_3 Reuse Algorithm (Backward-pass) 25 1.Traverse backward from the terminal 2.For every loaded artifact, stop the traversal of its parents 3.Prune artifacts that are not visited Backward-pass y top_feats Artifacts to execute Artifacts to load Artifacts to prune
  • 67. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Storage-aware Materialization ● Many duplicated columns in intermediate data artifacts ● Apply column deduplication strategy 26 train ad_desc t_subset y train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y] ad_desc = train['ad_desc‘] t_subset = train[['ts','u_id','price‘]] y = train['y']
  • 68. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Storage-aware Materialization ● Many duplicated columns in intermediate data artifacts ● Apply column deduplication strategy 26 train ad_desc t_subset y train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y] ad_desc = train['ad_desc‘] t_subset = train[['ts','u_id','price‘]] y = train['y'] while budget is not exhausted: 1. run materialization algorithm 2. deduplicate the materialized artifacts 3. update the size of unmaterialized artifact Storage-aware Materialization
  • 69. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments Storage-aware Materialization ● Many duplicated columns in intermediate data artifacts ● Apply column deduplication strategy 26 train ad_desc t_subset y train = pd.read_csv('train.csv') # [ad_desc,ts,u_id,price,y] ad_desc = train['ad_desc‘] t_subset = train[['ts','u_id','price‘]] y = train['y'] while budget is not exhausted: 1. run materialization algorithm 2. deduplicate the materialized artifacts 3. update the size of unmaterialized artifact Storage-aware Materialization Improves Storage utilization and Run-time