Size-Based Scheduling: From Theory To Practice, And Back

.
......
Size-Based Scheduling:
From Theory To Practice, And Back
Matteo Dell’Amico
EURECOM
24 April 2014
1

Credits
.
......
Joint work with
Pietro Michiardi, Mario Pastorelli (EURECOM)
Antonio Barbuzzi (ex EURECOM, now @VisualDNA, UK)
Damiano Carra (University of Verona, Italy)
2

Outline
...1 Big Data and MapReduce
...2 Size-Based Scheduling for MapReduce
...3 Size-Based Scheduling With Errors
3

Big Data and MapReduce
Outline
4

Big Data and MapReduce Big Data
Big Data: Deﬁnition
.
......
Data that is too big for you to handle the way you normally do
5

.
......
.
The 3 (+2) Vs
..
......
Volume, Velocity, Variety
… plus Veracity and Value
5

.
......
.
The 3 (+2) Vs
..
......
Volume, Velocity, Variety
… plus Veracity and Value
.
…But Still…
..
......
Why is everybody talking about Big Data now?
5

Big Data: Why Now?
.
1991: Maxtor 7040A
..
......
40 MB
600-700 KB/s
One minute to read it all
6

Big Data: Why Now?
.
1991: Maxtor 7040A
..
......
40 MB
600-700 KB/s
One minute to read it all
.
Now: Western Digital Caviar
..
......
4 TB
128 MB/s
9 hours to read
6

Moore and His Brothers
.
......
Moore’s Law: processing power doubles every 18 months
Kryder’s Law: storage capacity doubles every year
Nielsen’s Law: bandwidth doubles every 21 months
7

Moore and His Brothers
.
......
Moore’s Law: processing power doubles every 18 months
Kryder’s Law: storage capacity doubles every year
Nielsen’s Law: bandwidth doubles every 21 months
.
......
Storage is cheap: we never throw away anything
Processing all that data is expensive
Moving it around is even worse
7

Big Data and MapReduce MapReduce
MapReduce
Bring the computation to the data – split in blocks across a cluster
.
Map
..
......
One task per block
Hadoop ﬁlesystem (HDFS): 64 MB by default
Stores locally key-value pairs
e.g., for word count: [(red, 15) , (green, 7) , . . .]
8

MapReduce
Bring the computation to the data – split in blocks across a cluster
.
Map
..
......
One task per block
Hadoop ﬁlesystem (HDFS): 64 MB by default
Stores locally key-value pairs
e.g., for word count: [(red, 15) , (green, 7) , . . .]
.
Reduce
..
......
# of tasks set by the programmer
Mapper output is partitioned by key and pulled from “mappers”
The Reduce function operates on all values for a single key
e.g., (green, [7, 42, 13, . . .])
8

The Problem With Scheduling
.
Current Workloads
..
......
Huge job size variance
Running time: seconds to hours
I/O: KBs to TBs
[Chen et al., VLDB ’12; Ren et al., VLDB ’13; Appuswamy et al., SOCC ’13]
.
Consequence
..
......
Interactive jobs are delayed by long ones
In smaller clusters long queues exacerbate the problem
9

Size-Based Scheduling for MapReduce
Outline
10

Size-Based Scheduling for MapReduce Size-Based Scheduling
Shortest Remaining Processing Time
100
usage (%)
cluster
50
10 15 37.5 42.5 50
time
(s)
100
usage (%)
cluster
10 5020 30
50
time
(s)
job 1
job 2
job 3
job 1 job 3job 2 job 1
11

Size-Based Scheduling
.
Shortest Remaining Processing Time (SRPT)
..
......
Minimizes average sojourn time (between job submission and
completion)
.
Fair Sojourn Protocol (FSP)
..
......
Jobs are scheduled in the order they would complete if doing
Processor Sharing (PS)
Avoids starving large jobs
Fairness: jobs guaranteed to complete before Processor Sharing
[Friedman & Henderson, SIGMETRICS ’03]
.
Unknown Job size
..
......
…and what if we can only estimate job size?
12

Multi-Processor Size-Based Scheduling
10 13 3923.5
usage (%)
cluster
100
50
24.5
time
(s)
10 13 20 23 39
100
50
usage (%)
cluster
time
(s)
job 1
job 2
job 3
job 1
job 2
job 3
13

Size-Based Scheduling for MapReduce HFSP Implementation
HFSP In A Nutshell
.
Job Size Estimation
..
......
Naive estimation at ﬁrst
After the ﬁrst s “training” tasks have run, we update it
s = 5 by default
On t task slots, we give priority to training tasks
t avoids starving “old” jobs
“shortcut” for very small jobs
14

HFSP In A Nutshell
.
Job Size Estimation
..
......
Naive estimation at ﬁrst
After the ﬁrst s “training” tasks have run, we update it
s = 5 by default
On t task slots, we give priority to training tasks
t avoids starving “old” jobs
“shortcut” for very small jobs
.
Scheduling Policy
..
......
We treat Map and Reduce phases as separate jobs
Virtual time: per-job simulated completion time
When a task slot frees up, we schedule one from the job that
completes earlier in the virtual time
14

Job Size Estimation
.
Initial Estimation
..
......
k · l
k: # of tasks
l: average size of past Map/Reduce tasks
.
Second Estimation
..
......
After the s samples have run, compute an l′ as the average size of
the sample tasks
timeout (60 s by default): if tasks are not completed by then, use
progress %
Predicted job size: k · l′
15

Virtual Time
.
......
Estimated job size is in a “serialized” single-machine format
Simulates a processor-sharing cluster to compute completion
time, based on
number of tasks per job
available task slots in the real cluster
Simulation is updated when
new jobs arrive
tasks complete
16

Size-Based Scheduling for MapReduce Experiments
Experimental Setup
.
Platform
..
......
36 machines with 4 CPUs, 16 GB RAM
.
Workloads
..
......
Generated with the PigMix benchmark: realistic operations on
synthetic data
Data sizes inspired by known measurements [Chen et al., VLDB ’12; Ren
et al., VLDB ’13]
.
Conﬁguration
..
......
We compare to Hadoop’s FAIR scheduler
similar to processor-sharing
Delay scheduling enabled both for FAIR and HFSP
17

Sojourn Time
101 102 103
Sojourn Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
HFSP
FAIR
101 102 103 104
Sojourn Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
HFSP
FAIR
“small” workload: ~16% better “large” workload: ~75% better
Sojourn time: time that passes between the moment a job is
submitted and it terminates
With higher load, the scheduler becomes decisive
Analogous results on diﬀerent platform & diﬀerent workload
18

Job Size Estimation
0.25 0.5 1 2 4
Error
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
MAP
REDUCE
Error:
real size
estimated size
Fits a log-normal distribution
The estimation isn’t even that good! Why does HFSP work that
well?
19

Size-Based Scheduling With Errors
Outline
20

Size-Based Scheduling With Errors Scheduling Simulation
Scheduling Simulation
How does size-based scheduling behave in presence of errors?
Lu et al. (MASCOTS 2004) suggest much worse results
We wrote a simulator to understand better, with Hadoop-like
workloads [Chen et al., VLDB ’12]
written in Python, eﬃcient and easy to prototype new schedulers
21

Log-Normal Error Distribution
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
PDF
sigma= 0.125
sigma= 0.25
sigma= 1
sigma= 4
Error:
real size
estimated size
22

Weibull Job Size Distribution
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x
0.0
0.5
1.0
1.5
2.0
PDF
shape= 0.125
shape= 1
shape= 2
shape= 4
Interpolates between
heavy-tailed job size distributions (sigma<1)
exponential distributions (sigma=1)
bell-shaped distributions (sigma>1) 23

Size-Based Scheduling With Errors
shape
0.125
0.25
0.5
1
2
4
sigm
a
0.125
0.25
0.5
1
2
4
MST/MST(PS)
0.25
0.5
1
2
4
8
16
32
64
128
shape
0.125
0.25
0.5
1
2
4
sigm
a
0.125
0.25
0.5
1
2
4
MST/MST(PS)
0.25
0.5
1
2
4
8
16
32
64
128
SRPT FSP
Problems for heavy-tailed job size distributions
Otherwise, size-based scheduling works very well
24

Over-Estimations and Under-Estimations
Over-‐es'ma'on
Under-‐es'ma'on

t

t

t

t

Remaining
size

Remaining
size

Remaining
size

Remaining
size

J1
J2

J3

J2

J3

J1

^

J4

J5

J6

J4

J5

J6

^

Under-estimations can wreak havoc with heavy-tailed
workloads
25

FSP + PS
.
Idea
..
......
Without errors, real jobs always complete before virtual ones
When they don’t (they are late), there has been an estimation
error
The scheduler can realize this, and take corrective action
26

FSP + PS
.
Idea
..
......
Without errors, real jobs always complete before virtual ones
When they don’t (they are late), there has been an estimation
error
The scheduler can realize this, and take corrective action
.
Realization
..
......
To avoid that late jobs block the system, just do processor
sharing between them instead of scheduling the ”most late” one
26

FSP + PS: Results
shape
0.125
0.25
0.5
1
2
4
sigm
a
0.125
0.25
0.5
1
2
4
MST/MST(PS)
0.25
0.5
1
2
4
8
16
32
64
128
shape
0.125
0.25
0.5
1
2
4
sigm
a
0.125
0.25
0.5
1
2
4
MST/MST(PS)
0.25
0.5
1
2
4
8
16
32
64
128
FSP FSP + PS
27

Take-Home Messages
.
......
Size-based scheduling on Hadoop is viable, and particularly
appealing for companies with (semi-)interactive jobs and smaller
clusters
.
......
Schedulers like HFSP (in practice) and FSP+PS (in theory) are robust
with respect to errors
therefore, simple rough estimations are suﬃcient
HFSP is available as free software at
http://github.com/bigfootproject/hfsp
Scheduling simulator at
https://bitbucket.org/bigfootproject/schedsim
HFSP: published at IEEE BIGDATA 2013
scheduling simulator and FSP+PS: under submission, available at
http://arxiv.org/abs/1403.5996
28

Bonus Content Comparison with SRPT
Schedulers vs. SRPT
0.125 0.25 0.5 1 2 4
shape
2
4
6
8
10
MST/MST(SRPT)
SRPTE
FSPE
FSPE+PS
PS
LAS
FIFO
29

Bonus Content Real Workloads
Facebook
0.125 0.25 0.5 1 2 4
sigma
2
4
6
8
10
MST/MST(SRPT)
SRPTE
FSPE
FSPE+PS
PS
LAS
0.125 0.25 0.5 1 2 4
sigma
2
4
6
8
10
MST/MST(SRPT)
SRPTE
FSPE
FSPE+PS
PS
LAS
Synthetic workload (shape=0.25) Facebook Hadoop Cluster
30

Bonus Content Real Workloads
Web Cache
0.125 0.25 0.5 1 2 4
sigma
1
10
100
MST/MST(SRPT)
SRPTE
FSPE
FSPE+PS
PS
LAS
0.125 0.25 0.5 1 2 4
sigma
1
10
100
1000
10000
MST/MST(SRPT)
SRPTE
FSPE
FSPE+PS
PS
LAS
FIFO
Synthetic workload (shape=0.177) IRCache Web Cache
31

Bonus Content Job Preemption
Job Preemption
.
Supported in Hadoop
..
......
Kill running tasks
wastes work
Wait for them to ﬁnish
may take long
32

Job Preemption
.
Supported in Hadoop
..
......
Kill running tasks
wastes work
Wait for them to ﬁnish
may take long
.
Our Choice
..
......
Map tasks: Wait
generally small
For Reduce tasks, we implemented Suspend and Resume
avoids the drawbacks of both Wait and Kill
32

Job Preemption: Suspend and Resume
.
Our Solution
..
......We delegate to the OS: SIGSTOP and SIGCONT
33

.
Our Solution
..
.
......
The OS will swap tasks if and when memory is needed
no risk of thrashing: swapped data is loaded only when resuming
33

.
Our Solution
..
.
......
.
......
Conﬁgurable maximum number of suspended tasks
if reached, switch to Wait
hard limit on memory allocated to suspended tasks
33

.
Our Solution
..
.
......
.
......
Conﬁgurable maximum number of suspended tasks
if reached, switch to Wait
hard limit on memory allocated to suspended tasks
.
......
Between preemptable running tasks, suspend the youngest
likely to ﬁnish later
may have smaller memory footprint
33

Size-Based Scheduling: From Theory To Practice, And Back

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to Size-Based Scheduling: From Theory To Practice, And Back

Similar to Size-Based Scheduling: From Theory To Practice, And Back (20)

Recently uploaded

Recently uploaded (20)

Size-Based Scheduling: From Theory To Practice, And Back