SlideShare a Scribd company logo
1 of 79
Download to read offline
Size-Based Disciplines for Job Scheduling
in Data-Intensive Scalable Computing
Systems
Mario Pastorelli
Jury:
Prof. Ernst BIERSACK
Prof. Guillaume URVOY-KELLER
Prof. Giovanni CHIOLA
Dr. Patrick BROWN
Supervisor:
Prof. Pietro MICHIARDI
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 1
Context 1/3
In 2004, Google presented MapReduce, a system used to process
large quantity of data. The key ideas are:
Client-Server architecture
Move the computation, not the data
Programming model inspired by Lisp lists functions:
map : (k1, v1) → [(k2, v2)]
reduce : (k2, [v2]) → [(k3, v3)]
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 2
Context 1/3
In 2004, Google presented MapReduce, a system used to process
large quantity of data. The key ideas are:
Client-Server architecture
Move the computation, not the data
Programming model inspired by Lisp lists functions:
map : (k1, v1) → [(k2, v2)]
reduce : (k2, [v2]) → [(k3, v3)]
Hadoop, the main open-source implementation of MapReduce, is
released one year later. It is widely adopted and used by many
important companies (Facebook, Twitter, Yahoo, IBM, Microsoft. . . )
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 2
Context 2/3
In MapReduce, the Scheduling Policy is fundamental
Complexity of the system
Distributed resources
Multiple jobs running in parallel
Jobs are composed by two sequential phases, the map and the
reduce phase
Each phase is composed by multiple tasks, where each task runs on a
slot of a client
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 3
Context 2/3
In MapReduce, the Scheduling Policy is fundamental
Complexity of the system
Distributed resources
Multiple jobs running in parallel
Jobs are composed by two sequential phases, the map and the
reduce phase
Each phase is composed by multiple tasks, where each task runs on a
slot of a client
Heterogeneous workloads
Big differences in jobs sizes
Interactive jobs (e.g. data exploration, algorithm tuning,
orchestration jobs. . . ) must run as soon as possible. . .
. . . without impacting batch jobs too much
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 3
Context 3/3
Schedulers (strive to) optimize one or more metrics. For example:
Fairness: how a job is treated compared to the others
Mean response time: of jobs, that is the responsiveness of the system
. . .
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 4
Context 3/3
Schedulers (strive to) optimize one or more metrics. For example:
Fairness: how a job is treated compared to the others
Mean response time: of jobs, that is the responsiveness of the system
. . .
Schedulers for Hadoop, e.g. the Fair Scheduler, focus on fairness
rather than other metrics
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 4
Context 3/3
Schedulers (strive to) optimize one or more metrics. For example:
Fairness: how a job is treated compared to the others
Mean response time: of jobs, that is the responsiveness of the system
. . .
Schedulers for Hadoop, e.g. the Fair Scheduler, focus on fairness
rather than other metrics
Short response times are very important! Usually there is one or
more system administrators making a manual ad-hoc configuration
Fine-tuning of the scheduler parameters
Configuration of pools of jobs with priorities
Complex, error prone and difficult to adapt to workload/cluster
changes
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 4
Motivations
Size-based schedulers are more efficient than other schedulers (in
theory). . .
Job priority based on the job size
Focus resources on a few jobs instead of splitting them among many
jobs
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
Motivations
Size-based schedulers are more efficient than other schedulers (in
theory). . .
Job priority based on the job size
Focus resources on a few jobs instead of splitting them among many
jobs
. . . but (in practice) they are not adopted in real systems
Job size is unknown
No studies on applicability to distributed systems
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
Motivations
Size-based schedulers are more efficient than other schedulers (in
theory). . .
Job priority based on the job size
Focus resources on a few jobs instead of splitting them among many
jobs
. . . but (in practice) they are not adopted in real systems
Job size is unknown
No studies on applicability to distributed systems
MapReduce is suitable for size-based scheduling
We don’t have the job size but we have the time to estimate it
No perfect estimation is required . . .
. . . as long as very different jobs are sorted correctly
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
Size-Based Schedulers: Example
Job Arrival Time Size
job1 0s 30s
job2 10s 10s
job3 15s 10s
Processor Sharing
Shortest Remaining
Processing Time
(SRPT)
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 6
Size-Based Schedulers: Example
Job Arrival Time Size
job1 0s 30s
job2 10s 10s
job3 15s 10s
Scheduler AVG sojourn time
Processor Sharing 35s
SRPT 25s
Processor Sharing
Shortest Remaining
Processing Time
(SRPT)
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 6
Challenges
Job sizes are unknown: how do you obtain an approximation of a
job size while the job is running?
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 7
Challenges
Job sizes are unknown: how do you obtain an approximation of a
job size while the job is running?
Estimation errors: how do you cope with an approximated size?
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 7
Challenges
Job sizes are unknown: how do you obtain an approximation of a
job size while the job is running?
Estimation errors: how do you cope with an approximated size?
Scheduler for real and distributed systems: can we design a
size-based scheduler that works for existing systems?
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 7
Challenges
Job sizes are unknown: how do you obtain an approximation of a
job size while the job is running?
Estimation errors: how do you cope with an approximated size?
Scheduler for real and distributed systems: can we design a
size-based scheduler that works for existing systems?
Job preemption: preemption is fundamental for scheduling, but
current system support it partially. Can we improve that?
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 7
The Hadoop Fair Sojourn Protocol
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 8
Hadoop Fair Sojourn Protocol [BIGDATA 2013]
Size-based scheduler for Hadoop that is fair and achieves small response
times
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 9
Hadoop Fair Sojourn Protocol [BIGDATA 2013]
Size-based scheduler for Hadoop that is fair and achieves small response
times
The map and the reduce phases are treated independently and thus
a job has two sizes
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 9
Hadoop Fair Sojourn Protocol [BIGDATA 2013]
Size-based scheduler for Hadoop that is fair and achieves small response
times
The map and the reduce phases are treated independently and thus
a job has two sizes
Sizes estimations are done in two steps by the Estimation Module
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 9
Hadoop Fair Sojourn Protocol [BIGDATA 2013]
Size-based scheduler for Hadoop that is fair and achieves small response
times
The map and the reduce phases are treated independently and thus
a job has two sizes
Sizes estimations are done in two steps by the Estimation Module
Estimated sizes are then given in input to the Aging Module that
converts them into virtual sizes to avoid starvation
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 9
Hadoop Fair Sojourn Protocol [BIGDATA 2013]
Size-based scheduler for Hadoop that is fair and achieves small response
times
The map and the reduce phases are treated independently and thus
a job has two sizes
Sizes estimations are done in two steps by the Estimation Module
Estimated sizes are then given in input to the Aging Module that
converts them into virtual sizes to avoid starvation
Schedule jobs with smallest virtual sizes
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 9
Estimation Module
Two ways to estimate a job size:
Offline: based on the information available a priori (num tasks, block
size, past history . . . ):
available since job submission but not very precise
Online: based on the performance of a subset of t tasks:
need time for training but more precise
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 10
Estimation Module
Two ways to estimate a job size:
Offline: based on the information available a priori (num tasks, block
size, past history . . . ):
available since job submission but not very precise
Online: based on the performance of a subset of t tasks:
need time for training but more precise
We need both:
Offline estimation for the initial size, because jobs need size since their
submission
Online estimation because it is more precise: when it is completed, the
job size is updated to the final size
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 10
Estimation Module
Two ways to estimate a job size:
Offline: based on the information available a priori (num tasks, block
size, past history . . . ):
available since job submission but not very precise
Online: based on the performance of a subset of t tasks:
need time for training but more precise
We need both:
Offline estimation for the initial size, because jobs need size since their
submission
Online estimation because it is more precise: when it is completed, the
job size is updated to the final size
Tiny Jobs: jobs with less than t tasks are considered tiny and have
the highest priority possible
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 10
Aging Module 1/2
Aging: the more a job stays in queue, the higher its priority will be
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 11
Aging Module 1/2
Aging: the more a job stays in queue, the higher its priority will be
A technique used in the literature to age jobs is the Virtual Size
Each job is run in a simulation using processor sharing
The output of the simulation is the job virtual size, that is the job size
aged by the amount of time the job has spent in the simulation
Jobs are sorted by remaining virtual size and resources are assigned to
the job with smallest virtual size
0 1 2 3 4 5 6 7 8 9 10
time (s)
0.5
1
1.5
2
2.5
3
3.5
4
jobvirtualtime(s)
Job 1
Job 2
Job 3
Virtual Size (Simulation)
0 1 2 3 4 5 6 7 8 9 10
time (s)
0.5
1
1.5
2
2.5
3
3.5
4
jobsize(s)
Job 1
Job 2
Job 3
Real Size (Real Scheduling)
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 11
Aging Module 2/2
In HFSP the estimated sizes are converted in virtual sizes by the
Aging Module
The simulation is run in a virtual cluster that has the same resources
of the real one
Simulating Processor Sharing with Max-Min Fair Sharing
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 12
Aging Module 2/2
In HFSP the estimated sizes are converted in virtual sizes by the
Aging Module
The simulation is run in a virtual cluster that has the same resources
of the real one
Simulating Processor Sharing with Max-Min Fair Sharing
The number of tasks of a job determines how fast it can age
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 12
Task Scheduling Policy
When a job is submitted
If it is tiny then assign a final size to it of 0
Else
assign an initial size to it based on its number of tasks
mark the job as in training stage and select t training tasks
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 13
Task Scheduling Policy
When a job is submitted
If it is tiny then assign a final size to it of 0
Else
assign an initial size to it based on its number of tasks
mark the job as in training stage and select t training tasks
When a resource becomes available
If there are jobs in the training stage then assign a task from the job
with the smallest initial virtual size
Else assign a task from the job with the smallest final virtual size
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 13
Experimental Evaluation
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 14
Experimental Setup
20 TaskTrackers (MapReduce clients) for a total of 40 map and 20
reduce slots
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 15
Experimental Setup
20 TaskTrackers (MapReduce clients) for a total of 40 map and 20
reduce slots
Three kinds of workloads inspired by existing traces
Bin
Dataset
Size
Averag. num.
Map Tasks
Workload
DEV TEST PROD
1 1 GB < 5 65% 30% 0%
2 10 GB 10 − 50 20% 40% 10%
3 100 GB 50 − 150 10% 10% 60%
4 1 TB > 150 5% 20% 30%
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 15
Experimental Setup
20 TaskTrackers (MapReduce clients) for a total of 40 map and 20
reduce slots
Three kinds of workloads inspired by existing traces
Bin
Dataset
Size
Averag. num.
Map Tasks
Workload
DEV TEST PROD
1 1 GB < 5 65% 30% 0%
2 10 GB 10 − 50 20% 40% 10%
3 100 GB 50 − 150 10% 10% 60%
4 1 TB > 150 5% 20% 30%
Each experiment is composed by 100 jobs taken from PigMix and
has been executed 5 times
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 15
Experimental Setup
20 TaskTrackers (MapReduce clients) for a total of 40 map and 20
reduce slots
Three kinds of workloads inspired by existing traces
Bin
Dataset
Size
Averag. num.
Map Tasks
Workload
DEV TEST PROD
1 1 GB < 5 65% 30% 0%
2 10 GB 10 − 50 20% 40% 10%
3 100 GB 50 − 150 10% 10% 60%
4 1 TB > 150 5% 20% 30%
Each experiment is composed by 100 jobs taken from PigMix and
has been executed 5 times
HFSP compared to the Fair Scheduler
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 15
Performance Metrics
Mean Response Time
A job response time is the time passed between the job submission and
when it completes
The mean of the response times of all jobs indicates the
responsiveness of the system under that scheduling policy
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 16
Performance Metrics
Mean Response Time
A job response time is the time passed between the job submission and
when it completes
The mean of the response times of all jobs indicates the
responsiveness of the system under that scheduling policy
Fairness
A common approach is to use the job slowdown, i.e. the ratio
between job response time and its size, to indicate how fair the
scheduler has been with that job
In the literature a scheduler with same or smaller slowdowns than
the Processor Sharing is considered fair
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 16
Results: Mean Response Time
D
EV
TEST
PRO
D
-34% -26%
-33%
25 28
109
38 38
163
MeanResponseTime(s)
HFSP Fair
Overall HFSP decreases the mean
response time of ∼30%
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 17
Results: Mean Response Time
D
EV
TEST
PRO
D
-34% -26%
-33%
25 28
109
38 38
163
MeanResponseTime(s)
HFSP Fair
Overall HFSP decreases the mean
response time of ∼30%
Tiny jobs (bin 1) are treated in the same
way by the two schedulers: they run as
soon as possible
Medium, large and huge jobs (bins 2, 3
and 4) are consistently treated better
by HFSP thanks to its size-based
sequential nature
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 17
Results: Fairness
0.1 1.0 10.0 100.0
Response time / Isolation runtime
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
HFSP
Fair
DEV workload
0.1 1.0 10.0 100.0
Response time / Isolation runtime
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
HFSP
Fair
TEST workload
0.1 1.0 10.0 100.0
Response time / Isolation runtime
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
HFSP
Fair
PROD workload
HFSP is globally more fair to jobs than the Fair Scheduler
The “heavier” the workload is, the better HFSP treats jobs compared
to the Fair Scheduler
For the PROD workload, the gap between the median under HFSP
and the one under Fair is one order of magnitude
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 18
Impact of the errors
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 19
Task Times and Estimation Errors
Tasks of a single job are stable
Even a small number of
training tasks is enough for
estimating the phase size
1 10 102
task time / mean task time
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
map
reduce
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 20
Task Times and Estimation Errors
Tasks of a single job are stable
Even a small number of
training tasks is enough for
estimating the phase size
1 10 102
task time / mean task time
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
map
reduce
0.25 0.5 1 2 4
error using 5 samples
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
map
reduce
error = est. size
real size
error > 1 ⇒ estimated size is bigger
than the real one (over-estimation)
error < 1 ⇒ estimated size is smaller
than the real one (under-estimation)
Biggest errors are on over-estimating map
phases
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 20
Estimation Errors: Job Sizes and Phases
bin2 bin3 bin4
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Map Phase
bin2 bin3 bin4
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
Reduce Phase
Majority of estimated sizes are close to the correct one
Tendency to over-estimate in all the bins
Smaller errors on medium jobs (bin 2) compared to large and huge
ons (bin 3 and 4)
Switching jobs is highly unlikely
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 21
FSP with Estimation Errors
Our experiments show that, in Hadoop, the estimation errors don’t
impact our size-based scheduler performance
Can we abstract from Hadoop and extract a general rule on the
applicability of size-based scheduling policies?
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 22
FSP with Estimation Errors
Our experiments show that, in Hadoop, the estimation errors don’t
impact our size-based scheduler performance
Can we abstract from Hadoop and extract a general rule on the
applicability of size-based scheduling policies?
Simulative approach: simulations are fast making possible to try
different workloads, jobs arrival times and errors
Our results show that size-based schedulers, like FSP and SRPT, are
tolerant to errors in many cases
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 22
FSP with Estimation Errors
Our experiments show that, in Hadoop, the estimation errors don’t
impact our size-based scheduler performance
Can we abstract from Hadoop and extract a general rule on the
applicability of size-based scheduling policies?
Simulative approach: simulations are fast making possible to try
different workloads, jobs arrival times and errors
Our results show that size-based schedulers, like FSP and SRPT, are
tolerant to errors in many cases
We created FSP+PS that tolerates even more “extreme” conditions
[MASCOTS 2014]
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 22
Task Preemption
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 23
Task Preemption in HFSP
In theory
Preemption consists in removing resources from a running job and
granting them to another one
Without knowledge of the workload, preemptive schedulers outmatch
their non-preemptive counterparts
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 24
Task Preemption in HFSP
In theory
Preemption consists in removing resources from a running job and
granting them to another one
Without knowledge of the workload, preemptive schedulers outmatch
their non-preemptive counterparts
In practice
Preemption is difficult to implement
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 24
Task Preemption in HFSP
In theory
Preemption consists in removing resources from a running job and
granting them to another one
Without knowledge of the workload, preemptive schedulers outmatch
their non-preemptive counterparts
In practice
Preemption is difficult to implement
In Hadoop
Task preemption support through the kill primitive: it removes
resources from a task by killing it ⇒ all work is lost!
Kill disadvantages are well known and usually it is disabled or used very
carefully
HFSP is a preemptive scheduler and supports the task kill primitive
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 24
Results: Kill Preemption
1 10 100
slowdown (s)
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
kill
wait
1 10 102 103 104 105
sojourn time (s)
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
kill
wait
Kill improves fairness and response times of small and medium
jobs. . .
. . . but impacts heavily large jobs response times
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 25
OS-Assisted Preemption
Kill preemption is non-optimal: it preempts running tasks but has a
high cost
Can we do a mechanism that is more similar to an ideal preemption?
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 26
OS-Assisted Preemption
Kill preemption is non-optimal: it preempts running tasks but has a
high cost
Can we do a mechanism that is more similar to an ideal preemption?
Idea . . .
Instead of killing a task, we can suspend it where it is running
When the task should run again, we can resume it where it was
running
. . . but how can be implemented?
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 26
OS-Assisted Preemption
Kill preemption is non-optimal: it preempts running tasks but has a
high cost
Can we do a mechanism that is more similar to an ideal preemption?
Idea . . .
Instead of killing a task, we can suspend it where it is running
When the task should run again, we can resume it where it was
running
. . . but how can be implemented?
Operating Systems know very well how to suspend and resume
processes
At low-level, tasks are processes
Exploit OS capabilities to get a new preemption primitive: Task
Suspension [DCPERF 2014]
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 26
Conclusions
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 27
Conclusion
Size-based schedulers with estimated (imprecise) sizes can
outperform schedulers not size-based in real systems
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 28
Conclusion
Size-based schedulers with estimated (imprecise) sizes can
outperform schedulers not size-based in real systems
We showed this by designing the Hadoop Fair Sojourn Protocol, a
size-based scheduler for a real and distributed system such as
Hadoop
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 28
Conclusion
Size-based schedulers with estimated (imprecise) sizes can
outperform schedulers not size-based in real systems
We showed this by designing the Hadoop Fair Sojourn Protocol, a
size-based scheduler for a real and distributed system such as
Hadoop
HFSP is fair and achieves small mean response time
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 28
Conclusion
Size-based schedulers with estimated (imprecise) sizes can
outperform schedulers not size-based in real systems
We showed this by designing the Hadoop Fair Sojourn Protocol, a
size-based scheduler for a real and distributed system such as
Hadoop
HFSP is fair and achieves small mean response time
It can also use Hadoop preemption mechanism to improve fairness
and response times of small jobs, but this will affect the performance
of large and huge jobs
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 28
Future Work
HFSP + Suspension: adding the suspension mechanism to HFSP
raises many challenges, such as the eviction policy and the reduce
locality
Recurring Jobs: exploit the past runs of recurring jobs to obtain an
almost perfect estimation since their submission.
Complex Jobs: high-level languages and libraries push the scheduling
problem from simple jobs to complex jobs, that are chains of simple
jobs. Can we adapt HFSP to such jobs?
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 29
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 30
Size-Based Scheduling with Estimated
Sizes
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 1
Impact of Over-estimation and Under-estimation
Over-­‐es'ma'on	
   Under-­‐es'ma'on	
  
t	
  
t	
  
t	
  
t	
  
Remaining	
  size	
  
Remaining	
  size	
  
Remaining	
  size	
  
Remaining	
  size	
  
J1	
   J2	
  
J3	
  
J2	
  
J3	
  
J1	
  
^	
  
J4	
  
J5	
  
J6	
  
J4	
  
J5	
  
J6	
  
^	
  
Over-estimating a job affects only that job. Other jobs in queue are
not affected
Under-estimating a job can affect other jobs in queue
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 2
FSP+PS
In FSP, under-estimated jobs can complete in the virtual system
before than in the real system. We call them late jobs
When a job is late, it should not prevent executing other jobs
FSP+PS solves the problem by scheduling late jobs using processor
sharing
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 3
OS-Assisted Task Preemption
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 4
OS-Assisted Task Preemption
Kill preemption primitive has many drawbacks, can we do better?
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
OS-Assisted Task Preemption
Kill preemption primitive has many drawbacks, can we do better?
At low level, tasks are processes and processes can be suspended
and resumed by the Operating System
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
OS-Assisted Task Preemption
Kill preemption primitive has many drawbacks, can we do better?
At low level, tasks are processes and processes can be suspended
and resumed by the Operating System
We exploit this mechanism by enabling task suspension and
resuming
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
OS-Assisted Task Preemption
Kill preemption primitive has many drawbacks, can we do better?
At low level, tasks are processes and processes can be suspended
and resumed by the Operating System
We exploit this mechanism by enabling task suspension and
resuming
No need to change existent jobs! Done at low-level and transparent
to the user
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
OS-Assisted Task Preemption
Kill preemption primitive has many drawbacks, can we do better?
At low level, tasks are processes and processes can be suspended
and resumed by the Operating System
We exploit this mechanism by enabling task suspension and
resuming
No need to change existent jobs! Done at low-level and transparent
to the user
Bonus: the operating system manages the memory of processes
Memory of suspended tasks can be granted to other (running) tasks by
the OS. . .
. . . and because the OS knows how much memory the process needs,
only the memory required will be taken from the suspended task
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
OS-Assisted Task Preemption: Trashing
Trashing: when data is continuously read from and written to swap
space, the machine performance are highly degraded to a point that
the machine doesn’t work properly anymore
Trashing is caused by the working set (memory) that is larger than
the system memory
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 6
OS-Assisted Task Preemption: Trashing
Trashing: when data is continuously read from and written to swap
space, the machine performance are highly degraded to a point that
the machine doesn’t work properly anymore
Trashing is caused by the working set (memory) that is larger than
the system memory
In Hadoop this doesn’t happen because:
Running tasks per machine are limited
Heap space per task is limited
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 6
OS-Assisted Task Preemption: Experiments
Test the worst case for suspension, that is when the jobs allocate all
the memory
Two jobs, th and tl , allocating 2 GB of memory
10 20 30 40 50 60 70 80 90
tl progress at launch of th (%)
80
90
100
110
120
130
140
150
sojourntimeth(s)
wait
kill
susp
10 20 30 40 50 60 70 80 90
tl progress at launch of th (%)
170
180
190
200
210
220
230
240
250
makespan(s)
wait
kill
susp
Our primitive outperform kill and wait
Overhead for swapping doesn’t affect the jobs too much
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 7
OS-Assisted Task Preemption: Conclusions
Task Suspension/Resume outperform current preemption
implementations. . .
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 8
OS-Assisted Task Preemption: Conclusions
Task Suspension/Resume outperform current preemption
implementations. . .
. . . but it raises new challenges, e.g. state locality for task suspended
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 8
OS-Assisted Task Preemption: Conclusions
Task Suspension/Resume outperform current preemption
implementations. . .
. . . but it raises new challenges, e.g. state locality for task suspended
With a good scheduling policy (and eviction policy), OS-assisted
preemption can substitute current preemption mechanism
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 8

More Related Content

Similar to Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Project Analytics
Project AnalyticsProject Analytics
Project AnalyticsDanTrietsch
 
data science and business analytics
data science and business analyticsdata science and business analytics
data science and business analyticssunnypatil1778
 
From Lab to Factory: Creating value with data
From Lab to Factory: Creating value with dataFrom Lab to Factory: Creating value with data
From Lab to Factory: Creating value with dataPeadar Coyle
 
Weapons of Math Instruction: Evolving from Data0-Driven to Science-Driven
Weapons of Math Instruction: Evolving from Data0-Driven to Science-DrivenWeapons of Math Instruction: Evolving from Data0-Driven to Science-Driven
Weapons of Math Instruction: Evolving from Data0-Driven to Science-Drivenindeedeng
 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsAkin Osman Kazakci
 
Agilelessons scanagile-final 2013
Agilelessons scanagile-final 2013Agilelessons scanagile-final 2013
Agilelessons scanagile-final 2013lokori
 
Revisiting Size-Based Scheduling with Estimated Job Sizes
Revisiting Size-Based Scheduling with Estimated Job SizesRevisiting Size-Based Scheduling with Estimated Job Sizes
Revisiting Size-Based Scheduling with Estimated Job SizesMatteo Dell'Amico
 
SE - Lecture 11 - Software Project Estimation.pptx
SE - Lecture 11 - Software Project Estimation.pptxSE - Lecture 11 - Software Project Estimation.pptx
SE - Lecture 11 - Software Project Estimation.pptxTangZhiSiang
 
From Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into valueFrom Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into valuePeadar Coyle
 
Symposium 2019 : Gestion de projet en Intelligence Artificielle
Symposium 2019 : Gestion de projet en Intelligence ArtificielleSymposium 2019 : Gestion de projet en Intelligence Artificielle
Symposium 2019 : Gestion de projet en Intelligence ArtificiellePMI-Montréal
 
Open & reproducible research - What can we do in practice?
Open & reproducible research - What can we do in practice?Open & reproducible research - What can we do in practice?
Open & reproducible research - What can we do in practice?Felix Z. Hoffmann
 
Estimation and Velocity - Scrum Framework
Estimation and Velocity - Scrum FrameworkEstimation and Velocity - Scrum Framework
Estimation and Velocity - Scrum FrameworkUpekha Vandebona
 
Novel Scheduling Algorithm in DFS9(1)
Novel Scheduling Algorithm in DFS9(1)Novel Scheduling Algorithm in DFS9(1)
Novel Scheduling Algorithm in DFS9(1)kota Ankita
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
 
Machine learning at b.e.s.t. summer university
Machine learning  at b.e.s.t. summer universityMachine learning  at b.e.s.t. summer university
Machine learning at b.e.s.t. summer universityLászló Kovács
 
Effort estimation for software development
Effort estimation for software developmentEffort estimation for software development
Effort estimation for software developmentSpyros Ktenas
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning CCG
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEnrico Daga
 
Requirements Engineering - System Vision
Requirements Engineering - System VisionRequirements Engineering - System Vision
Requirements Engineering - System VisionBirgit Penzenstadler
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsOsman Ali
 

Similar to Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems (20)

Project Analytics
Project AnalyticsProject Analytics
Project Analytics
 
data science and business analytics
data science and business analyticsdata science and business analytics
data science and business analytics
 
From Lab to Factory: Creating value with data
From Lab to Factory: Creating value with dataFrom Lab to Factory: Creating value with data
From Lab to Factory: Creating value with data
 
Weapons of Math Instruction: Evolving from Data0-Driven to Science-Driven
Weapons of Math Instruction: Evolving from Data0-Driven to Science-DrivenWeapons of Math Instruction: Evolving from Data0-Driven to Science-Driven
Weapons of Math Instruction: Evolving from Data0-Driven to Science-Driven
 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analytics
 
Agilelessons scanagile-final 2013
Agilelessons scanagile-final 2013Agilelessons scanagile-final 2013
Agilelessons scanagile-final 2013
 
Revisiting Size-Based Scheduling with Estimated Job Sizes
Revisiting Size-Based Scheduling with Estimated Job SizesRevisiting Size-Based Scheduling with Estimated Job Sizes
Revisiting Size-Based Scheduling with Estimated Job Sizes
 
SE - Lecture 11 - Software Project Estimation.pptx
SE - Lecture 11 - Software Project Estimation.pptxSE - Lecture 11 - Software Project Estimation.pptx
SE - Lecture 11 - Software Project Estimation.pptx
 
From Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into valueFrom Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into value
 
Symposium 2019 : Gestion de projet en Intelligence Artificielle
Symposium 2019 : Gestion de projet en Intelligence ArtificielleSymposium 2019 : Gestion de projet en Intelligence Artificielle
Symposium 2019 : Gestion de projet en Intelligence Artificielle
 
Open & reproducible research - What can we do in practice?
Open & reproducible research - What can we do in practice?Open & reproducible research - What can we do in practice?
Open & reproducible research - What can we do in practice?
 
Estimation and Velocity - Scrum Framework
Estimation and Velocity - Scrum FrameworkEstimation and Velocity - Scrum Framework
Estimation and Velocity - Scrum Framework
 
Novel Scheduling Algorithm in DFS9(1)
Novel Scheduling Algorithm in DFS9(1)Novel Scheduling Algorithm in DFS9(1)
Novel Scheduling Algorithm in DFS9(1)
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
 
Machine learning at b.e.s.t. summer university
Machine learning  at b.e.s.t. summer universityMachine learning  at b.e.s.t. summer university
Machine learning at b.e.s.t. summer university
 
Effort estimation for software development
Effort estimation for software developmentEffort estimation for software development
Effort estimation for software development
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data Cubes
 
Requirements Engineering - System Vision
Requirements Engineering - System VisionRequirements Engineering - System Vision
Requirements Engineering - System Vision
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 

Recently uploaded

Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf203318pmpc
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfRagavanV2
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projectssmsksolar
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 

Recently uploaded (20)

Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 

Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

  • 1. Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems Mario Pastorelli Jury: Prof. Ernst BIERSACK Prof. Guillaume URVOY-KELLER Prof. Giovanni CHIOLA Dr. Patrick BROWN Supervisor: Prof. Pietro MICHIARDI Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 1
  • 2. Context 1/3 In 2004, Google presented MapReduce, a system used to process large quantity of data. The key ideas are: Client-Server architecture Move the computation, not the data Programming model inspired by Lisp lists functions: map : (k1, v1) → [(k2, v2)] reduce : (k2, [v2]) → [(k3, v3)] Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 2
  • 3. Context 1/3 In 2004, Google presented MapReduce, a system used to process large quantity of data. The key ideas are: Client-Server architecture Move the computation, not the data Programming model inspired by Lisp lists functions: map : (k1, v1) → [(k2, v2)] reduce : (k2, [v2]) → [(k3, v3)] Hadoop, the main open-source implementation of MapReduce, is released one year later. It is widely adopted and used by many important companies (Facebook, Twitter, Yahoo, IBM, Microsoft. . . ) Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 2
  • 4. Context 2/3 In MapReduce, the Scheduling Policy is fundamental Complexity of the system Distributed resources Multiple jobs running in parallel Jobs are composed by two sequential phases, the map and the reduce phase Each phase is composed by multiple tasks, where each task runs on a slot of a client Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 3
  • 5. Context 2/3 In MapReduce, the Scheduling Policy is fundamental Complexity of the system Distributed resources Multiple jobs running in parallel Jobs are composed by two sequential phases, the map and the reduce phase Each phase is composed by multiple tasks, where each task runs on a slot of a client Heterogeneous workloads Big differences in jobs sizes Interactive jobs (e.g. data exploration, algorithm tuning, orchestration jobs. . . ) must run as soon as possible. . . . . . without impacting batch jobs too much Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 3
  • 6. Context 3/3 Schedulers (strive to) optimize one or more metrics. For example: Fairness: how a job is treated compared to the others Mean response time: of jobs, that is the responsiveness of the system . . . Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 4
  • 7. Context 3/3 Schedulers (strive to) optimize one or more metrics. For example: Fairness: how a job is treated compared to the others Mean response time: of jobs, that is the responsiveness of the system . . . Schedulers for Hadoop, e.g. the Fair Scheduler, focus on fairness rather than other metrics Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 4
  • 8. Context 3/3 Schedulers (strive to) optimize one or more metrics. For example: Fairness: how a job is treated compared to the others Mean response time: of jobs, that is the responsiveness of the system . . . Schedulers for Hadoop, e.g. the Fair Scheduler, focus on fairness rather than other metrics Short response times are very important! Usually there is one or more system administrators making a manual ad-hoc configuration Fine-tuning of the scheduler parameters Configuration of pools of jobs with priorities Complex, error prone and difficult to adapt to workload/cluster changes Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 4
  • 9. Motivations Size-based schedulers are more efficient than other schedulers (in theory). . . Job priority based on the job size Focus resources on a few jobs instead of splitting them among many jobs Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
  • 10. Motivations Size-based schedulers are more efficient than other schedulers (in theory). . . Job priority based on the job size Focus resources on a few jobs instead of splitting them among many jobs . . . but (in practice) they are not adopted in real systems Job size is unknown No studies on applicability to distributed systems Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
  • 11. Motivations Size-based schedulers are more efficient than other schedulers (in theory). . . Job priority based on the job size Focus resources on a few jobs instead of splitting them among many jobs . . . but (in practice) they are not adopted in real systems Job size is unknown No studies on applicability to distributed systems MapReduce is suitable for size-based scheduling We don’t have the job size but we have the time to estimate it No perfect estimation is required . . . . . . as long as very different jobs are sorted correctly Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
  • 12. Size-Based Schedulers: Example Job Arrival Time Size job1 0s 30s job2 10s 10s job3 15s 10s Processor Sharing Shortest Remaining Processing Time (SRPT) Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 6
  • 13. Size-Based Schedulers: Example Job Arrival Time Size job1 0s 30s job2 10s 10s job3 15s 10s Scheduler AVG sojourn time Processor Sharing 35s SRPT 25s Processor Sharing Shortest Remaining Processing Time (SRPT) Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 6
  • 14. Challenges Job sizes are unknown: how do you obtain an approximation of a job size while the job is running? Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 7
  • 15. Challenges Job sizes are unknown: how do you obtain an approximation of a job size while the job is running? Estimation errors: how do you cope with an approximated size? Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 7
  • 16. Challenges Job sizes are unknown: how do you obtain an approximation of a job size while the job is running? Estimation errors: how do you cope with an approximated size? Scheduler for real and distributed systems: can we design a size-based scheduler that works for existing systems? Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 7
  • 17. Challenges Job sizes are unknown: how do you obtain an approximation of a job size while the job is running? Estimation errors: how do you cope with an approximated size? Scheduler for real and distributed systems: can we design a size-based scheduler that works for existing systems? Job preemption: preemption is fundamental for scheduling, but current system support it partially. Can we improve that? Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 7
  • 18. The Hadoop Fair Sojourn Protocol Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 8
  • 19. Hadoop Fair Sojourn Protocol [BIGDATA 2013] Size-based scheduler for Hadoop that is fair and achieves small response times Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 9
  • 20. Hadoop Fair Sojourn Protocol [BIGDATA 2013] Size-based scheduler for Hadoop that is fair and achieves small response times The map and the reduce phases are treated independently and thus a job has two sizes Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 9
  • 21. Hadoop Fair Sojourn Protocol [BIGDATA 2013] Size-based scheduler for Hadoop that is fair and achieves small response times The map and the reduce phases are treated independently and thus a job has two sizes Sizes estimations are done in two steps by the Estimation Module Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 9
  • 22. Hadoop Fair Sojourn Protocol [BIGDATA 2013] Size-based scheduler for Hadoop that is fair and achieves small response times The map and the reduce phases are treated independently and thus a job has two sizes Sizes estimations are done in two steps by the Estimation Module Estimated sizes are then given in input to the Aging Module that converts them into virtual sizes to avoid starvation Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 9
  • 23. Hadoop Fair Sojourn Protocol [BIGDATA 2013] Size-based scheduler for Hadoop that is fair and achieves small response times The map and the reduce phases are treated independently and thus a job has two sizes Sizes estimations are done in two steps by the Estimation Module Estimated sizes are then given in input to the Aging Module that converts them into virtual sizes to avoid starvation Schedule jobs with smallest virtual sizes Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 9
  • 24. Estimation Module Two ways to estimate a job size: Offline: based on the information available a priori (num tasks, block size, past history . . . ): available since job submission but not very precise Online: based on the performance of a subset of t tasks: need time for training but more precise Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 10
  • 25. Estimation Module Two ways to estimate a job size: Offline: based on the information available a priori (num tasks, block size, past history . . . ): available since job submission but not very precise Online: based on the performance of a subset of t tasks: need time for training but more precise We need both: Offline estimation for the initial size, because jobs need size since their submission Online estimation because it is more precise: when it is completed, the job size is updated to the final size Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 10
  • 26. Estimation Module Two ways to estimate a job size: Offline: based on the information available a priori (num tasks, block size, past history . . . ): available since job submission but not very precise Online: based on the performance of a subset of t tasks: need time for training but more precise We need both: Offline estimation for the initial size, because jobs need size since their submission Online estimation because it is more precise: when it is completed, the job size is updated to the final size Tiny Jobs: jobs with less than t tasks are considered tiny and have the highest priority possible Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 10
  • 27. Aging Module 1/2 Aging: the more a job stays in queue, the higher its priority will be Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 11
  • 28. Aging Module 1/2 Aging: the more a job stays in queue, the higher its priority will be A technique used in the literature to age jobs is the Virtual Size Each job is run in a simulation using processor sharing The output of the simulation is the job virtual size, that is the job size aged by the amount of time the job has spent in the simulation Jobs are sorted by remaining virtual size and resources are assigned to the job with smallest virtual size 0 1 2 3 4 5 6 7 8 9 10 time (s) 0.5 1 1.5 2 2.5 3 3.5 4 jobvirtualtime(s) Job 1 Job 2 Job 3 Virtual Size (Simulation) 0 1 2 3 4 5 6 7 8 9 10 time (s) 0.5 1 1.5 2 2.5 3 3.5 4 jobsize(s) Job 1 Job 2 Job 3 Real Size (Real Scheduling) Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 11
  • 29. Aging Module 2/2 In HFSP the estimated sizes are converted in virtual sizes by the Aging Module The simulation is run in a virtual cluster that has the same resources of the real one Simulating Processor Sharing with Max-Min Fair Sharing Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 12
  • 30. Aging Module 2/2 In HFSP the estimated sizes are converted in virtual sizes by the Aging Module The simulation is run in a virtual cluster that has the same resources of the real one Simulating Processor Sharing with Max-Min Fair Sharing The number of tasks of a job determines how fast it can age Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 12
  • 31. Task Scheduling Policy When a job is submitted If it is tiny then assign a final size to it of 0 Else assign an initial size to it based on its number of tasks mark the job as in training stage and select t training tasks Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 13
  • 32. Task Scheduling Policy When a job is submitted If it is tiny then assign a final size to it of 0 Else assign an initial size to it based on its number of tasks mark the job as in training stage and select t training tasks When a resource becomes available If there are jobs in the training stage then assign a task from the job with the smallest initial virtual size Else assign a task from the job with the smallest final virtual size Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 13
  • 33. Experimental Evaluation Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 14
  • 34. Experimental Setup 20 TaskTrackers (MapReduce clients) for a total of 40 map and 20 reduce slots Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 15
  • 35. Experimental Setup 20 TaskTrackers (MapReduce clients) for a total of 40 map and 20 reduce slots Three kinds of workloads inspired by existing traces Bin Dataset Size Averag. num. Map Tasks Workload DEV TEST PROD 1 1 GB < 5 65% 30% 0% 2 10 GB 10 − 50 20% 40% 10% 3 100 GB 50 − 150 10% 10% 60% 4 1 TB > 150 5% 20% 30% Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 15
  • 36. Experimental Setup 20 TaskTrackers (MapReduce clients) for a total of 40 map and 20 reduce slots Three kinds of workloads inspired by existing traces Bin Dataset Size Averag. num. Map Tasks Workload DEV TEST PROD 1 1 GB < 5 65% 30% 0% 2 10 GB 10 − 50 20% 40% 10% 3 100 GB 50 − 150 10% 10% 60% 4 1 TB > 150 5% 20% 30% Each experiment is composed by 100 jobs taken from PigMix and has been executed 5 times Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 15
  • 37. Experimental Setup 20 TaskTrackers (MapReduce clients) for a total of 40 map and 20 reduce slots Three kinds of workloads inspired by existing traces Bin Dataset Size Averag. num. Map Tasks Workload DEV TEST PROD 1 1 GB < 5 65% 30% 0% 2 10 GB 10 − 50 20% 40% 10% 3 100 GB 50 − 150 10% 10% 60% 4 1 TB > 150 5% 20% 30% Each experiment is composed by 100 jobs taken from PigMix and has been executed 5 times HFSP compared to the Fair Scheduler Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 15
  • 38. Performance Metrics Mean Response Time A job response time is the time passed between the job submission and when it completes The mean of the response times of all jobs indicates the responsiveness of the system under that scheduling policy Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 16
  • 39. Performance Metrics Mean Response Time A job response time is the time passed between the job submission and when it completes The mean of the response times of all jobs indicates the responsiveness of the system under that scheduling policy Fairness A common approach is to use the job slowdown, i.e. the ratio between job response time and its size, to indicate how fair the scheduler has been with that job In the literature a scheduler with same or smaller slowdowns than the Processor Sharing is considered fair Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 16
  • 40. Results: Mean Response Time D EV TEST PRO D -34% -26% -33% 25 28 109 38 38 163 MeanResponseTime(s) HFSP Fair Overall HFSP decreases the mean response time of ∼30% Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 17
  • 41. Results: Mean Response Time D EV TEST PRO D -34% -26% -33% 25 28 109 38 38 163 MeanResponseTime(s) HFSP Fair Overall HFSP decreases the mean response time of ∼30% Tiny jobs (bin 1) are treated in the same way by the two schedulers: they run as soon as possible Medium, large and huge jobs (bins 2, 3 and 4) are consistently treated better by HFSP thanks to its size-based sequential nature Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 17
  • 42. Results: Fairness 0.1 1.0 10.0 100.0 Response time / Isolation runtime 0.0 0.2 0.4 0.6 0.8 1.0 ECDF HFSP Fair DEV workload 0.1 1.0 10.0 100.0 Response time / Isolation runtime 0.0 0.2 0.4 0.6 0.8 1.0 ECDF HFSP Fair TEST workload 0.1 1.0 10.0 100.0 Response time / Isolation runtime 0.0 0.2 0.4 0.6 0.8 1.0 ECDF HFSP Fair PROD workload HFSP is globally more fair to jobs than the Fair Scheduler The “heavier” the workload is, the better HFSP treats jobs compared to the Fair Scheduler For the PROD workload, the gap between the median under HFSP and the one under Fair is one order of magnitude Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 18
  • 43. Impact of the errors Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 19
  • 44. Task Times and Estimation Errors Tasks of a single job are stable Even a small number of training tasks is enough for estimating the phase size 1 10 102 task time / mean task time 0.0 0.2 0.4 0.6 0.8 1.0 ECDF map reduce Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 20
  • 45. Task Times and Estimation Errors Tasks of a single job are stable Even a small number of training tasks is enough for estimating the phase size 1 10 102 task time / mean task time 0.0 0.2 0.4 0.6 0.8 1.0 ECDF map reduce 0.25 0.5 1 2 4 error using 5 samples 0.0 0.2 0.4 0.6 0.8 1.0 ECDF map reduce error = est. size real size error > 1 ⇒ estimated size is bigger than the real one (over-estimation) error < 1 ⇒ estimated size is smaller than the real one (under-estimation) Biggest errors are on over-estimating map phases Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 20
  • 46. Estimation Errors: Job Sizes and Phases bin2 bin3 bin4 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Map Phase bin2 bin3 bin4 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 Reduce Phase Majority of estimated sizes are close to the correct one Tendency to over-estimate in all the bins Smaller errors on medium jobs (bin 2) compared to large and huge ons (bin 3 and 4) Switching jobs is highly unlikely Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 21
  • 47. FSP with Estimation Errors Our experiments show that, in Hadoop, the estimation errors don’t impact our size-based scheduler performance Can we abstract from Hadoop and extract a general rule on the applicability of size-based scheduling policies? Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 22
  • 48. FSP with Estimation Errors Our experiments show that, in Hadoop, the estimation errors don’t impact our size-based scheduler performance Can we abstract from Hadoop and extract a general rule on the applicability of size-based scheduling policies? Simulative approach: simulations are fast making possible to try different workloads, jobs arrival times and errors Our results show that size-based schedulers, like FSP and SRPT, are tolerant to errors in many cases Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 22
  • 49. FSP with Estimation Errors Our experiments show that, in Hadoop, the estimation errors don’t impact our size-based scheduler performance Can we abstract from Hadoop and extract a general rule on the applicability of size-based scheduling policies? Simulative approach: simulations are fast making possible to try different workloads, jobs arrival times and errors Our results show that size-based schedulers, like FSP and SRPT, are tolerant to errors in many cases We created FSP+PS that tolerates even more “extreme” conditions [MASCOTS 2014] Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 22
  • 50. Task Preemption Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 23
  • 51. Task Preemption in HFSP In theory Preemption consists in removing resources from a running job and granting them to another one Without knowledge of the workload, preemptive schedulers outmatch their non-preemptive counterparts Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 24
  • 52. Task Preemption in HFSP In theory Preemption consists in removing resources from a running job and granting them to another one Without knowledge of the workload, preemptive schedulers outmatch their non-preemptive counterparts In practice Preemption is difficult to implement Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 24
  • 53. Task Preemption in HFSP In theory Preemption consists in removing resources from a running job and granting them to another one Without knowledge of the workload, preemptive schedulers outmatch their non-preemptive counterparts In practice Preemption is difficult to implement In Hadoop Task preemption support through the kill primitive: it removes resources from a task by killing it ⇒ all work is lost! Kill disadvantages are well known and usually it is disabled or used very carefully HFSP is a preemptive scheduler and supports the task kill primitive Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 24
  • 54. Results: Kill Preemption 1 10 100 slowdown (s) 0.0 0.2 0.4 0.6 0.8 1.0 ECDF kill wait 1 10 102 103 104 105 sojourn time (s) 0.0 0.2 0.4 0.6 0.8 1.0 ECDF kill wait Kill improves fairness and response times of small and medium jobs. . . . . . but impacts heavily large jobs response times Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 25
  • 55. OS-Assisted Preemption Kill preemption is non-optimal: it preempts running tasks but has a high cost Can we do a mechanism that is more similar to an ideal preemption? Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 26
  • 56. OS-Assisted Preemption Kill preemption is non-optimal: it preempts running tasks but has a high cost Can we do a mechanism that is more similar to an ideal preemption? Idea . . . Instead of killing a task, we can suspend it where it is running When the task should run again, we can resume it where it was running . . . but how can be implemented? Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 26
  • 57. OS-Assisted Preemption Kill preemption is non-optimal: it preempts running tasks but has a high cost Can we do a mechanism that is more similar to an ideal preemption? Idea . . . Instead of killing a task, we can suspend it where it is running When the task should run again, we can resume it where it was running . . . but how can be implemented? Operating Systems know very well how to suspend and resume processes At low-level, tasks are processes Exploit OS capabilities to get a new preemption primitive: Task Suspension [DCPERF 2014] Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 26
  • 58. Conclusions Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 27
  • 59. Conclusion Size-based schedulers with estimated (imprecise) sizes can outperform schedulers not size-based in real systems Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 28
  • 60. Conclusion Size-based schedulers with estimated (imprecise) sizes can outperform schedulers not size-based in real systems We showed this by designing the Hadoop Fair Sojourn Protocol, a size-based scheduler for a real and distributed system such as Hadoop Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 28
  • 61. Conclusion Size-based schedulers with estimated (imprecise) sizes can outperform schedulers not size-based in real systems We showed this by designing the Hadoop Fair Sojourn Protocol, a size-based scheduler for a real and distributed system such as Hadoop HFSP is fair and achieves small mean response time Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 28
  • 62. Conclusion Size-based schedulers with estimated (imprecise) sizes can outperform schedulers not size-based in real systems We showed this by designing the Hadoop Fair Sojourn Protocol, a size-based scheduler for a real and distributed system such as Hadoop HFSP is fair and achieves small mean response time It can also use Hadoop preemption mechanism to improve fairness and response times of small jobs, but this will affect the performance of large and huge jobs Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 28
  • 63. Future Work HFSP + Suspension: adding the suspension mechanism to HFSP raises many challenges, such as the eviction policy and the reduce locality Recurring Jobs: exploit the past runs of recurring jobs to obtain an almost perfect estimation since their submission. Complex Jobs: high-level languages and libraries push the scheduling problem from simple jobs to complex jobs, that are chains of simple jobs. Can we adapt HFSP to such jobs? Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 29
  • 64. Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 30
  • 65. Size-Based Scheduling with Estimated Sizes Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 1
  • 66. Impact of Over-estimation and Under-estimation Over-­‐es'ma'on   Under-­‐es'ma'on   t   t   t   t   Remaining  size   Remaining  size   Remaining  size   Remaining  size   J1   J2   J3   J2   J3   J1   ^   J4   J5   J6   J4   J5   J6   ^   Over-estimating a job affects only that job. Other jobs in queue are not affected Under-estimating a job can affect other jobs in queue Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 2
  • 67. FSP+PS In FSP, under-estimated jobs can complete in the virtual system before than in the real system. We call them late jobs When a job is late, it should not prevent executing other jobs FSP+PS solves the problem by scheduling late jobs using processor sharing Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 3
  • 68. OS-Assisted Task Preemption Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 4
  • 69. OS-Assisted Task Preemption Kill preemption primitive has many drawbacks, can we do better? Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
  • 70. OS-Assisted Task Preemption Kill preemption primitive has many drawbacks, can we do better? At low level, tasks are processes and processes can be suspended and resumed by the Operating System Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
  • 71. OS-Assisted Task Preemption Kill preemption primitive has many drawbacks, can we do better? At low level, tasks are processes and processes can be suspended and resumed by the Operating System We exploit this mechanism by enabling task suspension and resuming Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
  • 72. OS-Assisted Task Preemption Kill preemption primitive has many drawbacks, can we do better? At low level, tasks are processes and processes can be suspended and resumed by the Operating System We exploit this mechanism by enabling task suspension and resuming No need to change existent jobs! Done at low-level and transparent to the user Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
  • 73. OS-Assisted Task Preemption Kill preemption primitive has many drawbacks, can we do better? At low level, tasks are processes and processes can be suspended and resumed by the Operating System We exploit this mechanism by enabling task suspension and resuming No need to change existent jobs! Done at low-level and transparent to the user Bonus: the operating system manages the memory of processes Memory of suspended tasks can be granted to other (running) tasks by the OS. . . . . . and because the OS knows how much memory the process needs, only the memory required will be taken from the suspended task Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
  • 74. OS-Assisted Task Preemption: Trashing Trashing: when data is continuously read from and written to swap space, the machine performance are highly degraded to a point that the machine doesn’t work properly anymore Trashing is caused by the working set (memory) that is larger than the system memory Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 6
  • 75. OS-Assisted Task Preemption: Trashing Trashing: when data is continuously read from and written to swap space, the machine performance are highly degraded to a point that the machine doesn’t work properly anymore Trashing is caused by the working set (memory) that is larger than the system memory In Hadoop this doesn’t happen because: Running tasks per machine are limited Heap space per task is limited Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 6
  • 76. OS-Assisted Task Preemption: Experiments Test the worst case for suspension, that is when the jobs allocate all the memory Two jobs, th and tl , allocating 2 GB of memory 10 20 30 40 50 60 70 80 90 tl progress at launch of th (%) 80 90 100 110 120 130 140 150 sojourntimeth(s) wait kill susp 10 20 30 40 50 60 70 80 90 tl progress at launch of th (%) 170 180 190 200 210 220 230 240 250 makespan(s) wait kill susp Our primitive outperform kill and wait Overhead for swapping doesn’t affect the jobs too much Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 7
  • 77. OS-Assisted Task Preemption: Conclusions Task Suspension/Resume outperform current preemption implementations. . . Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 8
  • 78. OS-Assisted Task Preemption: Conclusions Task Suspension/Resume outperform current preemption implementations. . . . . . but it raises new challenges, e.g. state locality for task suspended Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 8
  • 79. OS-Assisted Task Preemption: Conclusions Task Suspension/Resume outperform current preemption implementations. . . . . . but it raises new challenges, e.g. state locality for task suspended With a good scheduling policy (and eviction policy), OS-assisted preemption can substitute current preemption mechanism Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 8