Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Size-Based Disciplines for Job Scheduling
in Data-Intensive Scalable Computing
Systems
Mario Pastorelli
Jury:
Prof. Ernst BIERSACK
Prof. Guillaume URVOY-KELLER
Prof. Giovanni CHIOLA
Dr. Patrick BROWN
Supervisor:
Prof. Pietro MICHIARDI
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 1

Context 1/3
In 2004, Google presented MapReduce, a system used to process
large quantity of data. The key ideas are:
Client-Server architecture
Move the computation, not the data
Programming model inspired by Lisp lists functions:
map : (k1, v1) → [(k2, v2)]
reduce : (k2, [v2]) → [(k3, v3)]

Context 1/3
In 2004, Google presented MapReduce, a system used to process
large quantity of data. The key ideas are:
Client-Server architecture
Move the computation, not the data
Programming model inspired by Lisp lists functions:
map : (k1, v1) → [(k2, v2)]
reduce : (k2, [v2]) → [(k3, v3)]
Hadoop, the main open-source implementation of MapReduce, is
released one year later. It is widely adopted and used by many
important companies (Facebook, Twitter, Yahoo, IBM, Microsoft. . . )

Context 2/3
In MapReduce, the Scheduling Policy is fundamental
Complexity of the system
Distributed resources
Multiple jobs running in parallel
Jobs are composed by two sequential phases, the map and the
reduce phase
Each phase is composed by multiple tasks, where each task runs on a
slot of a client

Context 2/3
In MapReduce, the Scheduling Policy is fundamental
Complexity of the system
Distributed resources
Multiple jobs running in parallel
Jobs are composed by two sequential phases, the map and the
reduce phase
Each phase is composed by multiple tasks, where each task runs on a
slot of a client
Heterogeneous workloads
Big diﬀerences in jobs sizes
Interactive jobs (e.g. data exploration, algorithm tuning,
orchestration jobs. . . ) must run as soon as possible. . .
. . . without impacting batch jobs too much

Context 3/3
Schedulers (strive to) optimize one or more metrics. For example:
Fairness: how a job is treated compared to the others
Mean response time: of jobs, that is the responsiveness of the system
. . .

Context 3/3
. . .
Schedulers for Hadoop, e.g. the Fair Scheduler, focus on fairness
rather than other metrics

Context 3/3
. . .
Schedulers for Hadoop, e.g. the Fair Scheduler, focus on fairness
rather than other metrics
Short response times are very important! Usually there is one or
more system administrators making a manual ad-hoc configuration
Fine-tuning of the scheduler parameters
Configuration of pools of jobs with priorities
Complex, error prone and difficult to adapt to workload/cluster
changes

Motivations
Size-based schedulers are more eﬃcient than other schedulers (in
theory). . .
Job priority based on the job size
Focus resources on a few jobs instead of splitting them among many
jobs

Motivations
theory). . .
jobs
. . . but (in practice) they are not adopted in real systems
Job size is unknown
No studies on applicability to distributed systems

Motivations
theory). . .
jobs
. . . but (in practice) they are not adopted in real systems
Job size is unknown
No studies on applicability to distributed systems
MapReduce is suitable for size-based scheduling
We don’t have the job size but we have the time to estimate it
No perfect estimation is required . . .
. . . as long as very diﬀerent jobs are sorted correctly

Size-Based Schedulers: Example
Job Arrival Time Size
job1 0s 30s
job2 10s 10s
job3 15s 10s
Processor Sharing
Shortest Remaining
Processing Time
(SRPT)

Size-Based Schedulers: Example
Job Arrival Time Size
job1 0s 30s
job2 10s 10s
job3 15s 10s
Scheduler AVG sojourn time
Processor Sharing 35s
SRPT 25s
Processor Sharing
Shortest Remaining
Processing Time
(SRPT)

Challenges
Job sizes are unknown: how do you obtain an approximation of a
job size while the job is running?

Challenges
Estimation errors: how do you cope with an approximated size?

Challenges
Scheduler for real and distributed systems: can we design a
size-based scheduler that works for existing systems?

Challenges
Scheduler for real and distributed systems: can we design a
size-based scheduler that works for existing systems?
Job preemption: preemption is fundamental for scheduling, but
current system support it partially. Can we improve that?

The Hadoop Fair Sojourn Protocol

Hadoop Fair Sojourn Protocol [BIGDATA 2013]
Size-based scheduler for Hadoop that is fair and achieves small response
times

times
The map and the reduce phases are treated independently and thus
a job has two sizes

times
a job has two sizes
Sizes estimations are done in two steps by the Estimation Module

times
a job has two sizes
Estimated sizes are then given in input to the Aging Module that
converts them into virtual sizes to avoid starvation

times
a job has two sizes
Estimated sizes are then given in input to the Aging Module that
converts them into virtual sizes to avoid starvation
Schedule jobs with smallest virtual sizes

Estimation Module
Two ways to estimate a job size:
Oﬄine: based on the information available a priori (num tasks, block
size, past history . . . ):
available since job submission but not very precise
Online: based on the performance of a subset of t tasks:
need time for training but more precise

Estimation Module
We need both:
Oﬄine estimation for the initial size, because jobs need size since their
submission
Online estimation because it is more precise: when it is completed, the
job size is updated to the ﬁnal size

Estimation Module
We need both:
Oﬄine estimation for the initial size, because jobs need size since their
submission
Online estimation because it is more precise: when it is completed, the
job size is updated to the ﬁnal size
Tiny Jobs: jobs with less than t tasks are considered tiny and have
the highest priority possible

Aging Module 1/2
Aging: the more a job stays in queue, the higher its priority will be

Aging Module 1/2
Aging: the more a job stays in queue, the higher its priority will be
A technique used in the literature to age jobs is the Virtual Size
Each job is run in a simulation using processor sharing
The output of the simulation is the job virtual size, that is the job size
aged by the amount of time the job has spent in the simulation
Jobs are sorted by remaining virtual size and resources are assigned to
the job with smallest virtual size
0 1 2 3 4 5 6 7 8 9 10
time (s)
0.5
1
1.5
2
2.5
3
3.5
4
jobvirtualtime(s)
Job 1
Job 2
Job 3
Virtual Size (Simulation)
0 1 2 3 4 5 6 7 8 9 10
time (s)
0.5
1
1.5
2
2.5
3
3.5
4
jobsize(s)
Job 1
Job 2
Job 3
Real Size (Real Scheduling)

Aging Module 2/2
In HFSP the estimated sizes are converted in virtual sizes by the
Aging Module
The simulation is run in a virtual cluster that has the same resources
of the real one
Simulating Processor Sharing with Max-Min Fair Sharing

Aging Module 2/2
In HFSP the estimated sizes are converted in virtual sizes by the
Aging Module
The simulation is run in a virtual cluster that has the same resources
of the real one
Simulating Processor Sharing with Max-Min Fair Sharing
The number of tasks of a job determines how fast it can age

Task Scheduling Policy
When a job is submitted
If it is tiny then assign a ﬁnal size to it of 0
Else
assign an initial size to it based on its number of tasks
mark the job as in training stage and select t training tasks

Task Scheduling Policy
When a job is submitted
If it is tiny then assign a ﬁnal size to it of 0
Else
assign an initial size to it based on its number of tasks
mark the job as in training stage and select t training tasks
When a resource becomes available
If there are jobs in the training stage then assign a task from the job
with the smallest initial virtual size
Else assign a task from the job with the smallest ﬁnal virtual size

Experimental Evaluation

Experimental Setup
20 TaskTrackers (MapReduce clients) for a total of 40 map and 20
reduce slots

Experimental Setup
reduce slots
Three kinds of workloads inspired by existing traces
Bin
Dataset
Size
Averag. num.
Map Tasks
Workload
DEV TEST PROD
1 1 GB < 5 65% 30% 0%
2 10 GB 10 − 50 20% 40% 10%
3 100 GB 50 − 150 10% 10% 60%
4 1 TB > 150 5% 20% 30%

Experimental Setup
reduce slots
Bin
Dataset
Size
Averag. num.
Map Tasks
Workload
DEV TEST PROD
1 1 GB < 5 65% 30% 0%
2 10 GB 10 − 50 20% 40% 10%
3 100 GB 50 − 150 10% 10% 60%
4 1 TB > 150 5% 20% 30%
Each experiment is composed by 100 jobs taken from PigMix and
has been executed 5 times

Experimental Setup
reduce slots
Bin
Dataset
Size
Averag. num.
Map Tasks
Workload
DEV TEST PROD
1 1 GB < 5 65% 30% 0%
2 10 GB 10 − 50 20% 40% 10%
3 100 GB 50 − 150 10% 10% 60%
4 1 TB > 150 5% 20% 30%
Each experiment is composed by 100 jobs taken from PigMix and
has been executed 5 times
HFSP compared to the Fair Scheduler

Performance Metrics
Mean Response Time
A job response time is the time passed between the job submission and
when it completes
The mean of the response times of all jobs indicates the
responsiveness of the system under that scheduling policy

Performance Metrics
Mean Response Time
A job response time is the time passed between the job submission and
when it completes
The mean of the response times of all jobs indicates the
responsiveness of the system under that scheduling policy
Fairness
A common approach is to use the job slowdown, i.e. the ratio
between job response time and its size, to indicate how fair the
scheduler has been with that job
In the literature a scheduler with same or smaller slowdowns than
the Processor Sharing is considered fair

Results: Mean Response Time
D
EV
TEST
PRO
D
-34% -26%
-33%
25 28
109
38 38
163
MeanResponseTime(s)
HFSP Fair
Overall HFSP decreases the mean
response time of ∼30%

Results: Mean Response Time
D
EV
TEST
PRO
D
-34% -26%
-33%
25 28
109
38 38
163
MeanResponseTime(s)
HFSP Fair
Overall HFSP decreases the mean
response time of ∼30%
Tiny jobs (bin 1) are treated in the same
way by the two schedulers: they run as
soon as possible
Medium, large and huge jobs (bins 2, 3
and 4) are consistently treated better
by HFSP thanks to its size-based
sequential nature

Results: Fairness
0.1 1.0 10.0 100.0
Response time / Isolation runtime
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
HFSP
Fair
DEV workload
0.1 1.0 10.0 100.0
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
HFSP
Fair
TEST workload
0.1 1.0 10.0 100.0
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
HFSP
Fair
PROD workload
HFSP is globally more fair to jobs than the Fair Scheduler
The “heavier” the workload is, the better HFSP treats jobs compared
to the Fair Scheduler
For the PROD workload, the gap between the median under HFSP
and the one under Fair is one order of magnitude

Impact of the errors

Task Times and Estimation Errors
Tasks of a single job are stable
Even a small number of
training tasks is enough for
estimating the phase size
1 10 102
task time / mean task time
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
map
reduce

Task Times and Estimation Errors
Tasks of a single job are stable
Even a small number of
training tasks is enough for
estimating the phase size
1 10 102
task time / mean task time
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
map
reduce
0.25 0.5 1 2 4
error using 5 samples
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
map
reduce
error = est. size
real size
error > 1 ⇒ estimated size is bigger
than the real one (over-estimation)
error < 1 ⇒ estimated size is smaller
than the real one (under-estimation)
Biggest errors are on over-estimating map
phases

Estimation Errors: Job Sizes and Phases
bin2 bin3 bin4
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Map Phase
bin2 bin3 bin4
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
Reduce Phase
Majority of estimated sizes are close to the correct one
Tendency to over-estimate in all the bins
Smaller errors on medium jobs (bin 2) compared to large and huge
ons (bin 3 and 4)
Switching jobs is highly unlikely

FSP with Estimation Errors
Our experiments show that, in Hadoop, the estimation errors don’t
impact our size-based scheduler performance
Can we abstract from Hadoop and extract a general rule on the
applicability of size-based scheduling policies?

Simulative approach: simulations are fast making possible to try
diﬀerent workloads, jobs arrival times and errors
Our results show that size-based schedulers, like FSP and SRPT, are
tolerant to errors in many cases

Simulative approach: simulations are fast making possible to try
diﬀerent workloads, jobs arrival times and errors
Our results show that size-based schedulers, like FSP and SRPT, are
tolerant to errors in many cases
We created FSP+PS that tolerates even more “extreme” conditions
[MASCOTS 2014]

Task Preemption

Task Preemption in HFSP
In theory
Preemption consists in removing resources from a running job and
granting them to another one
Without knowledge of the workload, preemptive schedulers outmatch
their non-preemptive counterparts

In theory
In practice
Preemption is diﬃcult to implement

In theory
In practice
Preemption is diﬃcult to implement
In Hadoop
Task preemption support through the kill primitive: it removes
resources from a task by killing it ⇒ all work is lost!
Kill disadvantages are well known and usually it is disabled or used very
carefully
HFSP is a preemptive scheduler and supports the task kill primitive

Results: Kill Preemption
1 10 100
slowdown (s)
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
kill
wait
1 10 102 103 104 105
sojourn time (s)
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
kill
wait
Kill improves fairness and response times of small and medium
jobs. . .
. . . but impacts heavily large jobs response times

OS-Assisted Preemption
Kill preemption is non-optimal: it preempts running tasks but has a
high cost
Can we do a mechanism that is more similar to an ideal preemption?

high cost
Idea . . .
Instead of killing a task, we can suspend it where it is running
When the task should run again, we can resume it where it was
running
. . . but how can be implemented?

high cost
Idea . . .
Instead of killing a task, we can suspend it where it is running
When the task should run again, we can resume it where it was
running
. . . but how can be implemented?
Operating Systems know very well how to suspend and resume
processes
At low-level, tasks are processes
Exploit OS capabilities to get a new preemption primitive: Task
Suspension [DCPERF 2014]

Conclusions

Conclusion
Size-based schedulers with estimated (imprecise) sizes can
outperform schedulers not size-based in real systems

Conclusion
We showed this by designing the Hadoop Fair Sojourn Protocol, a
size-based scheduler for a real and distributed system such as
Hadoop

Conclusion
Hadoop
HFSP is fair and achieves small mean response time

Conclusion
Hadoop
HFSP is fair and achieves small mean response time
It can also use Hadoop preemption mechanism to improve fairness
and response times of small jobs, but this will aﬀect the performance
of large and huge jobs

Future Work
HFSP + Suspension: adding the suspension mechanism to HFSP
raises many challenges, such as the eviction policy and the reduce
locality
Recurring Jobs: exploit the past runs of recurring jobs to obtain an
almost perfect estimation since their submission.
Complex Jobs: high-level languages and libraries push the scheduling
problem from simple jobs to complex jobs, that are chains of simple
jobs. Can we adapt HFSP to such jobs?

Size-Based Scheduling with Estimated
Sizes

Impact of Over-estimation and Under-estimation
Over-‐es'ma'on
Under-‐es'ma'on

t

t

t

t

Remaining
size

Remaining
size

Remaining
size

Remaining
size

J1
J2

J3

J2

J3

J1

^

J4

J5

J6

J4

J5

J6

^

Over-estimating a job affects only that job. Other jobs in queue are
not affected
Under-estimating a job can affect other jobs in queue

FSP+PS
In FSP, under-estimated jobs can complete in the virtual system
before than in the real system. We call them late jobs
When a job is late, it should not prevent executing other jobs
FSP+PS solves the problem by scheduling late jobs using processor
sharing

OS-Assisted Task Preemption

Kill preemption primitive has many drawbacks, can we do better?

At low level, tasks are processes and processes can be suspended
and resumed by the Operating System

We exploit this mechanism by enabling task suspension and
resuming

resuming
No need to change existent jobs! Done at low-level and transparent
to the user

resuming
No need to change existent jobs! Done at low-level and transparent
to the user
Bonus: the operating system manages the memory of processes
Memory of suspended tasks can be granted to other (running) tasks by
the OS. . .
. . . and because the OS knows how much memory the process needs,
only the memory required will be taken from the suspended task

OS-Assisted Task Preemption: Trashing
Trashing: when data is continuously read from and written to swap
space, the machine performance are highly degraded to a point that
the machine doesn’t work properly anymore
Trashing is caused by the working set (memory) that is larger than
the system memory

OS-Assisted Task Preemption: Trashing
Trashing: when data is continuously read from and written to swap
space, the machine performance are highly degraded to a point that
the machine doesn’t work properly anymore
Trashing is caused by the working set (memory) that is larger than
the system memory
In Hadoop this doesn’t happen because:
Running tasks per machine are limited
Heap space per task is limited

OS-Assisted Task Preemption: Experiments
Test the worst case for suspension, that is when the jobs allocate all
the memory
Two jobs, th and tl , allocating 2 GB of memory
10 20 30 40 50 60 70 80 90
tl progress at launch of th (%)
80
90
100
110
120
130
140
150
sojourntimeth(s)
wait
kill
susp
10 20 30 40 50 60 70 80 90
tl progress at launch of th (%)
170
180
190
200
210
220
230
240
250
makespan(s)
wait
kill
susp
Our primitive outperform kill and wait
Overhead for swapping doesn’t aﬀect the jobs too much

OS-Assisted Task Preemption: Conclusions
Task Suspension/Resume outperform current preemption
implementations. . .

. . . but it raises new challenges, e.g. state locality for task suspended

. . . but it raises new challenges, e.g. state locality for task suspended
With a good scheduling policy (and eviction policy), OS-assisted
preemption can substitute current preemption mechanism

Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Recommended

Recommended

More Related Content

Similar to Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Similar to Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems (20)

Recently uploaded

Recently uploaded (20)

Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems