"HFSP: Size-based Scheduling for Hadoop" presentation for BigData 2014
1. HFSP: Size-based Scheduling for Hadoop
Mario Pastorelli∗ Antonio Barbuzzi∗ Matteo Dell’Amico∗
Damiano Carra† Pietro Michiardi∗
∗EURECOM, France
†University of Verona, Italy
IEEE BigData 2013
Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 1 / 15
2. Why a new scheduler?
Focus on short system response times
heterogeneous workloads [VLDB12,VLDB13,SOCC13]
big differences in jobs sizes
data exploration, preliminary analyses, algorithm tuning, orchestration
jobs. . .
Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 2 / 15
3. Why a new scheduler?
Focus on short system response times
heterogeneous workloads [VLDB12,VLDB13,SOCC13]
big differences in jobs sizes
data exploration, preliminary analyses, algorithm tuning, orchestration
jobs. . .
Current schedulers need manual setup
fine-tuning of the scheduler parameters
configuration of pools of jobs
complex, error prone and difficult to adapt to workload/cluster
changes
Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 2 / 15
4. Size-based schedulers
Size-based schedulers are more efficient than other schedulers
job priority based on the job size
focus resources on a few jobs instead of splitting them among many
jobs
. . . but the job size is required
Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 3 / 15
5. Size-based schedulers
Size-based schedulers are more efficient than other schedulers
job priority based on the job size
focus resources on a few jobs instead of splitting them among many
jobs
. . . but the job size is required
MapReduce is suitable for size-based scheduling
we don’t have the job size but we have the time to estimate it
no perfect estimation is required . . .
. . . as long as the jobs very different are sorted correctly
Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 3 / 15
6. Size-based schedulers: example
Job Arrival Time Size
job1 0s 30s
job2 10s 10s
job3 15s 10s
Processor
Share
SRPT
Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 4 / 15
7. Size-based schedulers: example
Job Arrival Time Size
job1 0s 30s
job2 10s 10s
job3 15s 10s
Scheduler AVG sojourn time
Processor Share 35s
SRPT 25s
Processor
Share
SRPT
Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 4 / 15
8. Hadoop Fair Sojourn Protocol
Like SRPT, HFSP wants to be efficient but it avoids starvation
How: Shortest Remaining Virtual Time first (SRVT)
Each job has a virtual size based on the real one
Virtual size decreases with time
Jobs are scheduled by ascending virtual size
Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 5 / 15
9. Hadoop Fair Sojourn Protocol: challenges
Job size estimation
Virtual size and aging
Task scheduling policy
Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 6 / 15
10. Job size estimation (1/2)
Two ways to estimate a job size:
Offline: based on the informations available a priori (num tasks, block
size, past history . . . ):
available since job submission
not very precise
Online: based on the performance of a subset of tasks:
need time for training
more precise
Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 7 / 15
11. Job size estimation (1/2)
Two ways to estimate a job size:
Offline: based on the informations available a priori (num tasks, block
size, past history . . . ):
available since job submission
not very precise
Online: based on the performance of a subset of tasks:
need time for training
more precise
We need both:
Offline estimation for the initial size, because jobs need size since their
submission
Online estimation because it is more precise: when it is completed, the
job size is updated
Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 7 / 15
12. Job size estimation (2/2)
Implementation details:
Online estimation is done while the job progresses, no work is wasted
Estimation technique: first-order statistics are good enough
The Map and Reduce phases of a job are treated as independent
Further details in the paper . . .
Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 8 / 15
13. Virtual size and aging
Like SRPT, HFSP wants to be efficient but it avoids starvation
How:
Each job has a “virtual” size
A “virtual” Fair Scheduler lets each job make virtual progress
We use virtual job sizes to take scheduling decision in the real cluster
→ Priority to small jobs
→ Every job eventually gets small, hence no starvation
Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 9 / 15
14. Task scheduling policy
When a task slot becomes free:
Schedule a task for online estimation, if any
otherwise, schedule a task from the highest priority job
Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 10 / 15
16. Experimental Setup
Task Trackers 36
CPUs Task Tracker 4
RAM Task Tracker 8 GB
Map slots 72
Reduce slots 36
Network speed: 1Gbps
Using PigMix jobs
Two kinds of workloads
inspired by existing traces
Dataset size Map tasks
Workload
SMALL LARGE
1 GB < 5 65% 0%
10 GB 10 − 50 20% 10%
40 GB 50 − 150 10% 60%
100 GB > 150 5% 30%
Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 11 / 15
17. Results
SMALL
101 102 103
Sojourn Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
HFSP
FAIR
Same performance for tiny jobs
Large difference for other jobs
Mean sojourn time descreased by
16% using HFSP
Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 12 / 15
18. Results
SMALL
101 102 103
Sojourn Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
HFSP
FAIR
Same performance for tiny jobs
Large difference for other jobs
Mean sojourn time descreased by
16% using HFSP
LARGE
101 102 103 104
Sojourn Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
HFSP
FAIR
Jobs completed after 100 seconds:
Fair: 2% jobs HFSP: 30% jobs
Jobs completed after 1000 seconds:
Fair: 15% jobs HFSP: 90% jobs
Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 12 / 15
19. Experiments: task times and estimation errors
Task times are skewed
10% of the Reducers are much
longer than other tasks
100 101 102 103 104
Task Time
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
MAP
REDUCE
Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 13 / 15
20. Experiments: task times and estimation errors
Task times are skewed
10% of the Reducers are much
longer than other tasks
100 101 102 103 104
Task Time
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
MAP
REDUCE
0.25 0.5 1 2 4
Error
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
MAP
REDUCE error = est. size
real size
∼60% jobs are over estimated
impact of the over-estimation is
mitigated by the aging function
Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 13 / 15
21. Conclusions
HFSP strives for efficiency and avoids starvation
Particularly suitable for loaded clusters
Requires no manual, per-job priorities
→ heterogeneous workloads can coexist in the same cluster
HFSP developed within the BigFoot project
Available at: https://github.com/bigfootproject/HFSP
Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 14 / 15