Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation
S7/8: SystemML’s Optimizer and Runtime
Matthias Boehm1, Arvind C. Surve2
1 IBM Research – Almaden
2 IBM Spark Technology Center
IBM Research

Abstraction: The Good, the Bad and the Ugly
2 IBM Research
q = t(X) %*% (w * (X %*% v))
[adapted from Peter Alvaro:"I See What You Mean“,
Strange Loop, 2015]
Simple & Analysis-Centric
Data Independence
Platform Independence
Adaptivity
(Missing)
Size InformationOperator
Selection
(Missing) Rewrites
Distributed
Operations
Distributed
Storage
(Implicit)
Copy-on-Write
Data Skew
Load
Imbalance
Latency
Complex Control Flow
Local / Remote
Memory Budgets
The Ugly: Expectations ≠ Reality
è Understanding of optimizer and runtime techniques
underpinning declarative, large-scale ML
Efficiency & Performance

Outline
§ Common Framework
§ Optimizer-Centric Techniques
§ Runtime-Centric Techniques
– ParFor Optimizer/Runtime
– Buffer Pool + Specific Optimizations
– Spark-Specific Rewrites
– Partitioning-Preserving Operations
– Update In-Place
– Ongoing Research (CLA)
3 IBM Research

Optimization through ParFor
§ Motivation
– SystemML focus primarily on data parallelism
– Dedicated parfor construct for task parallelism
§ ParFor approach:
– Complementary parfor parallelization strategies
– Cost-based optimization framework for task-parallel ML
– Memory budget as common constraint
4 IBM Research

Recap: Basic HOP DAG Compilation
Example Pearson Correlation
§ DML
Script
§ HOP
DAG
5 IBM Research
X = read( "./in/X" ); #data on HDFS
Y = read( "./in/Y" );
m = nrow(X);
sigmaX = sqrt( centralMoment(X,2)*(m/(m-1.0)) );
sigmaY = sqrt( centralMoment(Y,2)*(m/(m-1.0)) );
r = cov(X,Y) / (sigmaX * sigmaY);
write( r, "./out/r" );
b(cov)
X
r (“./out/r“)
Y (“./in/Y“, 106
x1)
b(cm) b(cm)
b(*) b(*)
2
u(sqrt) u(sqrt)
b(*)
b(/ )
b(/ )
b(-)
1,000,000 1
w/ o constant
folding (1.000001)
(“./in/X“,
106
x1)
u() … unary operator
b() … binary operator
cov … covariance
cm … central moment
sqrt … square root
yx
yx
YX
σσ
ρ
),cov(
, =
Exploit Spark/MR
data parallelism
if beneficial/required

Running Example: Pairwise Pearson Correlation
§ Representative for more complex bivariate statistics
(Pearson‘s R, Anova F, Chi-squared, Degree of freedom, P-value, Cramers V, Spearman, etc)
6 IBM Research
D = read("./input/D");
m = nrow(D);
n = ncol(D);
R = matrix(0, rows=n, cols=n);
parfor( i in 1:(n-1) ) {
X = D[ ,i];
m2X = centralMoment(X,2);
sigmaX = sqrt( m2X*(m/(m-1.0)) );
parfor( j in (i+1):n ) {
Y = D[ ,j];
m2Y = centralMoment(Y,2);
sigmaY = sqrt( m2Y*(m/(m-1.0)) );
R[i,j] = cov(X,Y) / (sigmaX*sigmaY);
}}
write(R, "./output/R");
Challenges:
• Triangular nested loop
• Column-wise access on
unordered distributed data
• Bivariate all-to-all data
shuffling pattern.
Exploit task and
data parallelism
if beneficial/required

Overview Parallelization Strategies
§ Conceptual Design: Master/worker
– Task: group of parfor iterations
§ Task Partitioning
– Naive, static, fixed, factoring,
factoring_cmax
– Task overhead vs load balance?
§ Task Execution
– Local, remote (Spark/MR), remoteDP (Spark/MR)
– Various runtime optimizations
– Degree of parallelism/IO/latency?
§ Result Aggregation
– Local memory, local file, remote (Spark/MR)
– W/ and w/o compare
– Result locality/IO/latency?
7 IBM Research
n = 12
X = D[ ,i];
…
R[i,j] = …
}
è Optimizer leverages
these to generate
efficient execution
plans

Example Task Partitioning
8 IBM Research
§ Scenario: k=24 workers, 10,000 iterations
Factoring Factoring CMAX (150)
0
50
100
150
200
250
# of Iterations
Tasks (1 to 208)
Naive Fixed(250) Static
0
50
100
150
200
250
300
350
400
450
# of Iterations
Tasks (1 to 24)
0
50
100
150
200
250
300
# of Iterations
Tasks (1 to 40)
0
10
20
30
40
50
1 Iteration per task
Tasks (1 to 10000)
0
50
100
150
200
250
# of Iterations
Tasks (1 to 228)

Task Execution: Local and Remote Parallelism
9 IBM Research
Local execution (multicore) Remote execution (cluster)
Local
ParWorker k
ParFOR (local)
Local
ParWorker 1
while(wßdeq())
foreach pi ∈ w
execute(prog(pi))
Task Partitioning
Parallel Result Aggregation
Task Queue
...
w5: i, { 11}
w4: i, { 9,10}
w3: i, { 7, 8 }
w2: i, { 4,5,6}
w1: i, { 1,2,3}
Hadoop
ParWorker
Mapper k
ParFOR (remote)
ParWorker
Mapper 1
map(key,value)
wßparse(value)
foreach pi ∈ w
execute(prog(pi))
Task Partitioning
Parallel Result Aggregation
...
…
A|MATRIX|./ out/ A7tmp
w5: i, { 11}
w4: i, { 9,10}
w3: i, { 7, 8 }
w2: i, { 4,5,6}
w1: i, { 1,2,3}
Hybrid parallelism: combinations of local/remote and data-parallel jobs

Task Execution: Runtime Optimizations
§ Data Partitioning
– Problem: Repeated MR
jobs for indexed access
– Access-awareness
(cost estimation, correct plan generation)
– Operators: local file-based, remote MR job
§ Data Locality
– Problem: Co-location of parfor tasks to partitions/matrices
– Location reporting
per logical parfor
task (e.g., for
parfor(i) à D[, i])
10 IBM Research
X = D[ ,i]; …
parfor( j in (i+1):n ){
Y = D[ ,j]; …
}}
N ode2
D
3
D
4
D
5
D
9
D
10
D
11
Node 1
N ode1
D
1
D
2
D
6
D
7
D
8
Node 2
Node 1
Node 1, 2
Node 2 w5: i, { 11}
w4: i, { 9,10}
w3: i, { 7, 8 }
w2: i, { 4,5,6}
w1: i, { 1,2,3}
Reported
Locations: Task File
Partitions Partitions

Optimization Framework – Problem Formulation
§ Design: Runtime optimization for each top-level parfor
§ Plan Tree P
– Nodes NP
• Exec type et
• Parallelism k
• Attributes A
– Height h
– Exec contexts ECP
§ Plan Tree Optimization Problem
–
11 IBM Research
ParFOR
b(cm)
Generic ParFOR
Generic
RIX LIX b(cov)...
RIX b(cm)...
ec0
ParFOR
b(cm)
Generic ParFOR
ec1
Generic
RIX LIX b(cov)...
RIX b(cm)... cmec = 600 MB
ckec = 1
cmec = 1024 MB
ckec = 16
MR
ec … execution context
cm … memory constraint
ck … parallelism constraint
[M. Boehm et al. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML PVLDB 7(7), 2014]
[M. Boehm et al. Costing Generated Runtime Execution Plans for Large-Scale Machine Learning Programs.CoRR,2015]

Optimization Framework – Cost Model / Optimizer
§ Overview Heuristic Optimizer
– Time- and memory-based cost model w/o shared reads
– Heuristic high-impact rewrites
– Transformation-based search strategy with global opt scope
§ Cost Model
– HOP DAG
size propagation
– Worst-case
memory estimates
– Time estimates
– Plan tree statistics
aggregation
12 IBM Research
ParFOR
b(cm)
Generic ParFOR
Generic
RIX LIX b(cov)...
RIX b(cm)...
Plan Tree P
k=4
Mapped
HOP DAGs
D
RIX
b(cov) b(cm)
j
...
X
d1= 0, d2= 0
d1= 1M
d2= 1
d1= 0, d2= 0
d1= 1M
d2= 1
d1= 1M
d2= 10
M = (80 M B,
80 M B)
M = (8 M B,
8 M B)
M=(8 MB,
88 MB)
M = (0 M B,
8 M B)
M = (0 M B,
16 M B)
M= (< output mem> ,
< operation mem> )
Y
M=88MB
M=352MB

Hands-On Lab: Task-Parallel ParFor Programs
§ Exercise: Pairwise Pearson Correlation
– a) Simple for
– loop w/ -stats
– b) Task-parallel
parfor w/ -stats
13 IBM Research
D = rand(rows=100000, cols=100);
m = nrow(D);
n = ncol(D);
R = matrix(0, rows=n, cols=n);
X = D[ ,i];
m2X = centralMoment(X,2);
sigmaX = sqrt( m2X*(m/(m-1.0)) );
parfor( j in (i+1):n ) {
Y = D[ ,j];
m2Y = centralMoment(Y,2);
sigmaY = sqrt( m2Y*(m/(m-1.0)) );
R[i,j] = cov(X,Y) / (sigmaX*sigmaY);
}}
write(R, "./tmp/R", format="binary");

Outline
§ Common Framework
– Update In-Place
14 IBM Research

Buffer Pool Overview
§ Motivation
– Exchange of intermediates between local and remote operations
(HDFS, RDDs, GPU divide memory)
– Eviction of in-memory objects (integrated with garbage collector)
§ Primitives
– acquireRead, acquireModify, release, exportData, getRdd, getBroadcast
§ Spark Specifics
– Lineage tracking
RDDs/broadcasts
– Guarded RDD
collect/parallelize
– Partitioned
Broadcast variables
15 IBM Research
MatrixObject/
WriteBuffer
Lineage Tracking

Outline
§ Common Framework
– Update In-Place
16 IBM Research

Spark-Specific Optimizations
§ Spark-Specific Rewrites
– Automatic caching/checkpoint injection
(MEM_DISK / MEM_DISK_SER)
– Automatic repartition injection
§ Operator Selection
– Spark exec type selection
– Transitive Spark exec type
– Physical operator selection
§ Extended ParFor Optimizer
– Deferred checkpoint/repartition injection
– Eager checkpointing/repartitioning
– Fair scheduling for concurrent jobs
– Local degree of parallelism
§ Runtime Optimizations
– Lazy Spark context creation
– Short-circuit read/collect
17 IBM Research
X = read($1);
y = read($2);
...
r = -(t(X) %*% y);
while(i < maxi &
norm_r2 > norm_r2_trgt) {
q = t(X)%*%(X%*%p) + lambda*p;
alpha = norm_r2 / (t(p)%*%q);
w = w + alpha * p;
old_norm_r2 = norm_r2;
r = r + alpha * q;
norm_r2 = sum(r * r);
beta = norm_r2 / old_norm_r2;
p = -r + beta * p;
i = i + 1;
}
...
write(w, $4);
chkpt X MEM_DISK
Ex: Checkpoint Injection LinregCG
Spark Exec
(24 cores)
25% user
75% data&exec
(50% Min & 75% Max)

SystemML on Spark: Lessons Learned
§ Spark over Custom Framework
– Well engineered framework with strong contributor base
– Seamless data preparation and feature engineering
§ Stateful Distributed Caching
– Standing executors with distributed caching and fast task scheduling
– Challenges: task parallelism, memory constraints, fair resource management
§ Memory Efficiency
– Compact data structures to avoid cache spilling (serialization, CSR)
– Custom serialization and compression
§ Lazy RDD Evaluation
– Automatic grouping of operations into distributed jobs, incl partitioning
– Challenges: multiple actions/repeated execution, runtime plan compilation!
§ Declarative ML
– Introduction of Spark backend did not require algorithm changes!
– Automatically exploit distributed caching and partitioning via rewrites
18 IBM Research
25% tasks

Outline
§ Common Framework
– Update In-Place
19 IBM Research

Partitioning-Preserving Operations on Spark
§ Partitioning-preserving ops
– Op is partitioning-preserving if key not changed (guaranteed)
– 1) Implicit: Use restrictive APIs (mapValues() vs mapToPair())
– 2) Explicit: Partition computation w/ declaration of partitioning-preserving
(memory-efficiency via “lazy iterators”)
§ Partitioning-exploiting ops
– 1) Implicit: Operations based on join, cogroup, etc
– 2) Explicit: Custom physical operators on original keys (e.g., zipmm)
20 IBM Research
Physical
Blocking and
Partitioning

Partitioning-Exploiting ZIPMM
§ Operation:
Z = t(X) %* % y
21 IBM Research
§ Operations: Transpose, Join, Multiplication
§ Shuffle
§ Operations: Join, Transpose & Multiplication
§ Avoid unnecessary shuffle
X y
Input:
1,1
1,2
1,3
Approach: zipmm
X y Z
1,1
1,2
1,3
Partitions not
preserved after
transpose, as keys
changed.
t(X)
yApproach: Naive
1,1 2,1 3,1

Example Multiclass SVM
§ Example: Multiclass SVM
– Vectors in nrow(X) neither fit into driver nor broadcast
(MapMM not applicable)
– ncol(X) ≤ Bc (zipmm applicable)
22 IBM Research
parfor(iter_class in 1:num_classes) {
Y_local = 2 * (Y == iter_class) – 1;
g_old = t(X) %*% Y_local;
...
while( continue ) {
Xd = X %*% s;
... inner while loop (compute step_sz)
Xw = Xw + step_sz * Xd;
out = 1 - Y_local * Xw;
out = (out > 0) * out;
g_new = t(X) %*% (out * Y_local) ...
repart, chkpt X MEM_DISK
chkpt y_local MEM_DISK
zipmm
chkpt Xd, Xw MEM_DISK

Hands-On Lab: Partitioning-Preserving Operations
§ Exercise: MultiClass SVM
– W/o repartition injection
– W/ repartitioning injection
23 IBM Research
parfor(iter_class in 1:num_classes) {
Y_local = 2 * (Y == iter_class) –
1;
g_old = t(X) %*% Y_local;
...
while( continue ) {
Xd = X %*% s;
... inner while loop (compute
step_sz)
Xw = Xw + step_sz * Xd;
out = 1 - Y_local * Xw;
out = (out > 0) * out;
g_new = t(X) %*% (out *
Y_local) ...
}
}

Outline
§ Common Framework
– Update In-Place
24 IBM Research

Update In-Place
§ Loop Update In-Place
– 1) ParFor result indexing / intermediates (w/ pinned matrix objects)
– 2) For/while/parfor loops with pure left indexing access to variable
– Both require pinning / shallow serialize to overcome buffer pool serialization
– Example Type 2:
§ Where we cannot apply Update In-Place
– Matrix object cannot fit into local memory budget (CP only)
– Interleaving operations (mix of update and reference, might be non-obvious)
– Example
25 IBM Research
for(i in 1:nrow(X))
for(j in 1:ncol(X))
X[i,j] = i+j;
R = X;
X[i,j] = i+j;
y = sum(R);
Would create
incorrect results!

Hands-On Lab: Update In-Place
§ Exercise: Update In-Place (SystemML master/0.11 only):
– a) Update in-place application (investigate -explain and –stats)
– b) Update in-place not applicable – why?
26 IBM Research
for(i in 1:nrow(X))
for(j in 1:ncol(X))
X[i,j] = i+j;
for(i in 1:nrow(X)) {
for(j in 1:ncol(X)) {
print(sum(X));
X[i,j] = i+j;
}
}

Outline
§ Common Framework
– Update In-Place
27 IBM Research

Compressed Linear Algebra
§ Motivation / Problem
– Iterative ML algorithms w/ repeated read-only data access
– IO-bound matrix-vector multiplications è crucial to fit data in memory
– General-purpose heavy-/lightweight techniques too slow / modest comp. ratios
§ Goals
– Performance close to uncompressed
– Good compression ratios
28 IBM Research
[A. Elgohary,M. Boehm,P. J. Haas, F. R. Reiss, B.
Reinwald:Compressed Linear Algebra for Large-
Scale Machine Learning.PVLDB 9(12), 2016]

Compressed Linear Algebra (2)
§ Approach
– Database compression
– LA over compressed rep.
– Column-compression
schemes (OLE, RLE, UC)
– Cache-conscious CLA ops
– Sampling-based
compression algorithm
§ Results
29 IBM Research
[A. Elgohary,M. Boehm,P. J. Haas, F. R. Reiss, B.
Reinwald:Compressed Linear Algebra for Large-
Scale Machine Learning.PVLDB 9(12), 2016]
Algorithm Dataset ULA Snappy CLA
GLM Mnist40m (90GB) 409s 647s 397s
Mnist240m (540GB) 74,301s 23,717s 2,787s
MLogreg Mnist40m (90GB) 630s 875s 622s
Mnist240m (540GB) 83,153s 27,626s 4,379s
L2SVM Mnist40m (90GB) 394 461 429
Mnist240m (540GB) 14,041 8,423 2,593
Up to
26x

© 2015 IBM Corporation30 IBM Research
SystemML is Open Source:
• Apache Incubator Project (11/2015)
• Website: http://systemml.apache.org/
• Source code: https://github.com/
apache/incubator-systemml

Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (16)

Similar to Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

Similar to Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm (20)

More from Arvind Surve

More from Arvind Surve (19)

Recently uploaded

Recently uploaded (20)

Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm