SlideShare a Scribd company logo
.
......
Size-Based Scheduling:
From Theory To Practice, And Back
Matteo Dell’Amico
EURECOM
24 April 2014
1
Credits
.
......
Joint work with
Pietro Michiardi, Mario Pastorelli (EURECOM)
Antonio Barbuzzi (ex EURECOM, now @VisualDNA, UK)
Damiano Carra (University of Verona, Italy)
2
Outline
...1 Big Data and MapReduce
...2 Size-Based Scheduling for MapReduce
...3 Size-Based Scheduling With Errors
3
Big Data and MapReduce
Outline
...1 Big Data and MapReduce
...2 Size-Based Scheduling for MapReduce
...3 Size-Based Scheduling With Errors
4
Big Data and MapReduce Big Data
Big Data: Definition
.
......
Data that is too big for you to handle the way you normally do
5
Big Data and MapReduce Big Data
Big Data: Definition
.
......
Data that is too big for you to handle the way you normally do
.
The 3 (+2) Vs
..
......
Volume, Velocity, Variety
… plus Veracity and Value
5
Big Data and MapReduce Big Data
Big Data: Definition
.
......
Data that is too big for you to handle the way you normally do
.
The 3 (+2) Vs
..
......
Volume, Velocity, Variety
… plus Veracity and Value
.
…But Still…
..
......
Why is everybody talking about Big Data now?
5
Big Data and MapReduce Big Data
Big Data: Why Now?
.
1991: Maxtor 7040A
..
......
40 MB
600-700 KB/s
One minute to read it all
6
Big Data and MapReduce Big Data
Big Data: Why Now?
.
1991: Maxtor 7040A
..
......
40 MB
600-700 KB/s
One minute to read it all
.
Now: Western Digital Caviar
..
......
4 TB
128 MB/s
9 hours to read
6
Big Data and MapReduce Big Data
Moore and His Brothers
.
......
Moore’s Law: processing power doubles every 18 months
Kryder’s Law: storage capacity doubles every year
Nielsen’s Law: bandwidth doubles every 21 months
7
Big Data and MapReduce Big Data
Moore and His Brothers
.
......
Moore’s Law: processing power doubles every 18 months
Kryder’s Law: storage capacity doubles every year
Nielsen’s Law: bandwidth doubles every 21 months
.
......
Storage is cheap: we never throw away anything
Processing all that data is expensive
Moving it around is even worse
7
Big Data and MapReduce MapReduce
MapReduce
Bring the computation to the data – split in blocks across a cluster
.
Map
..
......
One task per block
Hadoop filesystem (HDFS): 64 MB by default
Stores locally key-value pairs
e.g., for word count: [(red, 15) , (green, 7) , . . .]
8
Big Data and MapReduce MapReduce
MapReduce
Bring the computation to the data – split in blocks across a cluster
.
Map
..
......
One task per block
Hadoop filesystem (HDFS): 64 MB by default
Stores locally key-value pairs
e.g., for word count: [(red, 15) , (green, 7) , . . .]
.
Reduce
..
......
# of tasks set by the programmer
Mapper output is partitioned by key and pulled from “mappers”
The Reduce function operates on all values for a single key
e.g., (green, [7, 42, 13, . . .])
8
Big Data and MapReduce MapReduce
The Problem With Scheduling
.
Current Workloads
..
......
Huge job size variance
Running time: seconds to hours
I/O: KBs to TBs
[Chen et al., VLDB ’12; Ren et al., VLDB ’13; Appuswamy et al., SOCC ’13]
.
Consequence
..
......
Interactive jobs are delayed by long ones
In smaller clusters long queues exacerbate the problem
9
Size-Based Scheduling for MapReduce
Outline
...1 Big Data and MapReduce
...2 Size-Based Scheduling for MapReduce
...3 Size-Based Scheduling With Errors
10
Size-Based Scheduling for MapReduce Size-Based Scheduling
Shortest Remaining Processing Time
100
usage (%)
cluster
50
10 15 37.5 42.5 50
time
(s)
100
usage (%)
cluster
10 5020 30
50
time
(s)
job 1
job 2
job 3
job 1 job 3job 2 job 1
11
Size-Based Scheduling for MapReduce Size-Based Scheduling
Shortest Remaining Processing Time
100
usage (%)
cluster
50
10 15 37.5 42.5 50
time
(s)
100
usage (%)
cluster
10 5020 30
50
time
(s)
job 1
job 2
job 3
job 1 job 3job 2 job 1
11
Size-Based Scheduling for MapReduce Size-Based Scheduling
Size-Based Scheduling
.
Shortest Remaining Processing Time (SRPT)
..
......
Minimizes average sojourn time (between job submission and
completion)
.
Fair Sojourn Protocol (FSP)
..
......
Jobs are scheduled in the order they would complete if doing
Processor Sharing (PS)
Avoids starving large jobs
Fairness: jobs guaranteed to complete before Processor Sharing
[Friedman & Henderson, SIGMETRICS ’03]
.
Unknown Job size
..
......
…and what if we can only estimate job size?
12
Size-Based Scheduling for MapReduce Size-Based Scheduling
Multi-Processor Size-Based Scheduling
10 13 3923.5
usage (%)
cluster
100
50
24.5
time
(s)
10 13 20 23 39
100
50
usage (%)
cluster
time
(s)
job 1
job 2
job 3
job 1
job 2
job 3
13
Size-Based Scheduling for MapReduce HFSP Implementation
HFSP In A Nutshell
.
Job Size Estimation
..
......
Naive estimation at first
After the first s “training” tasks have run, we update it
s = 5 by default
On t task slots, we give priority to training tasks
t avoids starving “old” jobs
“shortcut” for very small jobs
14
Size-Based Scheduling for MapReduce HFSP Implementation
HFSP In A Nutshell
.
Job Size Estimation
..
......
Naive estimation at first
After the first s “training” tasks have run, we update it
s = 5 by default
On t task slots, we give priority to training tasks
t avoids starving “old” jobs
“shortcut” for very small jobs
.
Scheduling Policy
..
......
We treat Map and Reduce phases as separate jobs
Virtual time: per-job simulated completion time
When a task slot frees up, we schedule one from the job that
completes earlier in the virtual time
14
Size-Based Scheduling for MapReduce HFSP Implementation
Job Size Estimation
.
Initial Estimation
..
......
k · l
k: # of tasks
l: average size of past Map/Reduce tasks
.
Second Estimation
..
......
After the s samples have run, compute an l′ as the average size of
the sample tasks
timeout (60 s by default): if tasks are not completed by then, use
progress %
Predicted job size: k · l′
15
Size-Based Scheduling for MapReduce HFSP Implementation
Virtual Time
.
......
Estimated job size is in a “serialized” single-machine format
Simulates a processor-sharing cluster to compute completion
time, based on
number of tasks per job
available task slots in the real cluster
Simulation is updated when
new jobs arrive
tasks complete
16
Size-Based Scheduling for MapReduce Experiments
Experimental Setup
.
Platform
..
......
36 machines with 4 CPUs, 16 GB RAM
.
Workloads
..
......
Generated with the PigMix benchmark: realistic operations on
synthetic data
Data sizes inspired by known measurements [Chen et al., VLDB ’12; Ren
et al., VLDB ’13]
.
Configuration
..
......
We compare to Hadoop’s FAIR scheduler
similar to processor-sharing
Delay scheduling enabled both for FAIR and HFSP
17
Size-Based Scheduling for MapReduce Experiments
Sojourn Time
101 102 103
Sojourn Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
HFSP
FAIR
101 102 103 104
Sojourn Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
HFSP
FAIR
“small” workload: ~16% better “large” workload: ~75% better
Sojourn time: time that passes between the moment a job is
submitted and it terminates
With higher load, the scheduler becomes decisive
Analogous results on different platform & different workload
18
Size-Based Scheduling for MapReduce Experiments
Job Size Estimation
0.25 0.5 1 2 4
Error
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
MAP
REDUCE
Error:
real size
estimated size
Fits a log-normal distribution
The estimation isn’t even that good! Why does HFSP work that
well?
19
Size-Based Scheduling With Errors
Outline
...1 Big Data and MapReduce
...2 Size-Based Scheduling for MapReduce
...3 Size-Based Scheduling With Errors
20
Size-Based Scheduling With Errors Scheduling Simulation
Scheduling Simulation
How does size-based scheduling behave in presence of errors?
Lu et al. (MASCOTS 2004) suggest much worse results
We wrote a simulator to understand better, with Hadoop-like
workloads [Chen et al., VLDB ’12]
written in Python, efficient and easy to prototype new schedulers
21
Size-Based Scheduling With Errors Scheduling Simulation
Log-Normal Error Distribution
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
PDF
sigma= 0.125
sigma= 0.25
sigma= 1
sigma= 4
Error:
real size
estimated size
22
Size-Based Scheduling With Errors Scheduling Simulation
Weibull Job Size Distribution
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x
0.0
0.5
1.0
1.5
2.0
PDF
shape= 0.125
shape= 1
shape= 2
shape= 4
Interpolates between
heavy-tailed job size distributions (sigma<1)
exponential distributions (sigma=1)
bell-shaped distributions (sigma>1) 23
Size-Based Scheduling With Errors Scheduling Simulation
Size-Based Scheduling With Errors
shape
0.125
0.25
0.5
1
2
4
sigm
a
0.125
0.25
0.5
1
2
4
MST/MST(PS)
0.25
0.5
1
2
4
8
16
32
64
128
shape
0.125
0.25
0.5
1
2
4
sigm
a
0.125
0.25
0.5
1
2
4
MST/MST(PS)
0.25
0.5
1
2
4
8
16
32
64
128
SRPT FSP
Problems for heavy-tailed job size distributions
Otherwise, size-based scheduling works very well
24
Size-Based Scheduling With Errors Scheduling Simulation
Over-Estimations and Under-Estimations
Over-­‐es'ma'on	
   Under-­‐es'ma'on	
  
t	
  
t	
  
t	
  
t	
  
Remaining	
  size	
  
Remaining	
  size	
  
Remaining	
  size	
  
Remaining	
  size	
  
J1	
   J2	
  
J3	
  
J2	
  
J3	
  
J1	
  
^	
  
J4	
  
J5	
  
J6	
  
J4	
  
J5	
  
J6	
  
^	
  
Under-estimations can wreak havoc with heavy-tailed
workloads
25
Size-Based Scheduling With Errors Scheduling Simulation
FSP + PS
.
Idea
..
......
Without errors, real jobs always complete before virtual ones
When they don’t (they are late), there has been an estimation
error
The scheduler can realize this, and take corrective action
26
Size-Based Scheduling With Errors Scheduling Simulation
FSP + PS
.
Idea
..
......
Without errors, real jobs always complete before virtual ones
When they don’t (they are late), there has been an estimation
error
The scheduler can realize this, and take corrective action
.
Realization
..
......
To avoid that late jobs block the system, just do processor
sharing between them instead of scheduling the ”most late” one
26
Size-Based Scheduling With Errors Scheduling Simulation
FSP + PS: Results
shape
0.125
0.25
0.5
1
2
4
sigm
a
0.125
0.25
0.5
1
2
4
MST/MST(PS)
0.25
0.5
1
2
4
8
16
32
64
128
shape
0.125
0.25
0.5
1
2
4
sigm
a
0.125
0.25
0.5
1
2
4
MST/MST(PS)
0.25
0.5
1
2
4
8
16
32
64
128
FSP FSP + PS
27
Size-Based Scheduling With Errors Scheduling Simulation
Take-Home Messages
.
......
Size-based scheduling on Hadoop is viable, and particularly
appealing for companies with (semi-)interactive jobs and smaller
clusters
.
......
Schedulers like HFSP (in practice) and FSP+PS (in theory) are robust
with respect to errors
therefore, simple rough estimations are sufficient
HFSP is available as free software at
http://github.com/bigfootproject/hfsp
Scheduling simulator at
https://bitbucket.org/bigfootproject/schedsim
HFSP: published at IEEE BIGDATA 2013
scheduling simulator and FSP+PS: under submission, available at
http://arxiv.org/abs/1403.5996
28
Bonus Content Comparison with SRPT
Schedulers vs. SRPT
0.125 0.25 0.5 1 2 4
shape
2
4
6
8
10
MST/MST(SRPT)
SRPTE
FSPE
FSPE+PS
PS
LAS
FIFO
29
Bonus Content Real Workloads
Facebook
0.125 0.25 0.5 1 2 4
sigma
2
4
6
8
10
MST/MST(SRPT)
SRPTE
FSPE
FSPE+PS
PS
LAS
0.125 0.25 0.5 1 2 4
sigma
2
4
6
8
10
MST/MST(SRPT)
SRPTE
FSPE
FSPE+PS
PS
LAS
Synthetic workload (shape=0.25) Facebook Hadoop Cluster
30
Bonus Content Real Workloads
Web Cache
0.125 0.25 0.5 1 2 4
sigma
1
10
100
MST/MST(SRPT)
SRPTE
FSPE
FSPE+PS
PS
LAS
0.125 0.25 0.5 1 2 4
sigma
1
10
100
1000
10000
MST/MST(SRPT)
SRPTE
FSPE
FSPE+PS
PS
LAS
FIFO
Synthetic workload (shape=0.177) IRCache Web Cache
31
Bonus Content Job Preemption
Job Preemption
.
Supported in Hadoop
..
......
Kill running tasks
wastes work
Wait for them to finish
may take long
32
Bonus Content Job Preemption
Job Preemption
.
Supported in Hadoop
..
......
Kill running tasks
wastes work
Wait for them to finish
may take long
.
Our Choice
..
......
Map tasks: Wait
generally small
For Reduce tasks, we implemented Suspend and Resume
avoids the drawbacks of both Wait and Kill
32
Bonus Content Job Preemption
Job Preemption: Suspend and Resume
.
Our Solution
..
......We delegate to the OS: SIGSTOP and SIGCONT
33
Bonus Content Job Preemption
Job Preemption: Suspend and Resume
.
Our Solution
..
......We delegate to the OS: SIGSTOP and SIGCONT
.
......
The OS will swap tasks if and when memory is needed
no risk of thrashing: swapped data is loaded only when resuming
33
Bonus Content Job Preemption
Job Preemption: Suspend and Resume
.
Our Solution
..
......We delegate to the OS: SIGSTOP and SIGCONT
.
......
The OS will swap tasks if and when memory is needed
no risk of thrashing: swapped data is loaded only when resuming
.
......
Configurable maximum number of suspended tasks
if reached, switch to Wait
hard limit on memory allocated to suspended tasks
33
Bonus Content Job Preemption
Job Preemption: Suspend and Resume
.
Our Solution
..
......We delegate to the OS: SIGSTOP and SIGCONT
.
......
The OS will swap tasks if and when memory is needed
no risk of thrashing: swapped data is loaded only when resuming
.
......
Configurable maximum number of suspended tasks
if reached, switch to Wait
hard limit on memory allocated to suspended tasks
.
......
Between preemptable running tasks, suspend the youngest
likely to finish later
may have smaller memory footprint
33

More Related Content

What's hot

Time Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal RecoveryTime Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal Recovery
Daniel Cuneo
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
Albert Bifet
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop Performance
Ted Dunning
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
Kelly Technologies
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
Cloudera, Inc.
 
Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)
Karel Dumon
 
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot ConfigurationsMap Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
dbpublications
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
Alpine Data
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing data
Zhong Wang
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
NAVER Engineering
 

What's hot (10)

Time Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal RecoveryTime Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal Recovery
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop Performance
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
 
Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)
 
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot ConfigurationsMap Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing data
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
 

Similar to Size-Based Scheduling: From Theory To Practice, And Back

HFSP: the Hadoop Fair Sojourn Protocol
HFSP: the Hadoop Fair Sojourn ProtocolHFSP: the Hadoop Fair Sojourn Protocol
HFSP: the Hadoop Fair Sojourn Protocol
Matteo Dell'Amico
 
Revisiting Size-Based Scheduling with Estimated Job Sizes
Revisiting Size-Based Scheduling with Estimated Job SizesRevisiting Size-Based Scheduling with Estimated Job Sizes
Revisiting Size-Based Scheduling with Estimated Job Sizes
Matteo Dell'Amico
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
ijccsa
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentationNoha Elprince
 
OS-Assisted Task Preemption for Hadoop
OS-Assisted Task Preemption for HadoopOS-Assisted Task Preemption for Hadoop
OS-Assisted Task Preemption for Hadoop
Matteo Dell'Amico
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
Dilum Bandara
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
Kelly Technologies
 
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endthkoch
 
"HFSP: Size-based Scheduling for Hadoop" presentation for BigData 2014
"HFSP: Size-based Scheduling for Hadoop" presentation for BigData 2014"HFSP: Size-based Scheduling for Hadoop" presentation for BigData 2014
"HFSP: Size-based Scheduling for Hadoop" presentation for BigData 2014
Mario Pastorelli
 
IEEE CLOUD \'11
IEEE CLOUD \'11IEEE CLOUD \'11
IEEE CLOUD \'11
David Ribeiro Alves
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
Kelly Technologies
 
E031201032036
E031201032036E031201032036
E031201032036
ijceronline
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
Sina Ebrahimi
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
Habiba Abderrahim
 
Hadoop - How It Works
Hadoop - How It WorksHadoop - How It Works
Hadoop - How It Works
Vladimír Hanušniak
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
ShimoFcis
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
TSANKARARAO
 

Similar to Size-Based Scheduling: From Theory To Practice, And Back (20)

HFSP: the Hadoop Fair Sojourn Protocol
HFSP: the Hadoop Fair Sojourn ProtocolHFSP: the Hadoop Fair Sojourn Protocol
HFSP: the Hadoop Fair Sojourn Protocol
 
Revisiting Size-Based Scheduling with Estimated Job Sizes
Revisiting Size-Based Scheduling with Estimated Job SizesRevisiting Size-Based Scheduling with Estimated Job Sizes
Revisiting Size-Based Scheduling with Estimated Job Sizes
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
 
OS-Assisted Task Preemption for Hadoop
OS-Assisted Task Preemption for HadoopOS-Assisted Task Preemption for Hadoop
OS-Assisted Task Preemption for Hadoop
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead end
 
"HFSP: Size-based Scheduling for Hadoop" presentation for BigData 2014
"HFSP: Size-based Scheduling for Hadoop" presentation for BigData 2014"HFSP: Size-based Scheduling for Hadoop" presentation for BigData 2014
"HFSP: Size-based Scheduling for Hadoop" presentation for BigData 2014
 
IEEE CLOUD \'11
IEEE CLOUD \'11IEEE CLOUD \'11
IEEE CLOUD \'11
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
E031201032036
E031201032036E031201032036
E031201032036
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
 
Hadoop
HadoopHadoop
Hadoop
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 
Hadoop - How It Works
Hadoop - How It WorksHadoop - How It Works
Hadoop - How It Works
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 

Recently uploaded

Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
alishadewangan1
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
yusufzako14
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
NoelManyise1
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
sonaliswain16
 

Recently uploaded (20)

Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
 

Size-Based Scheduling: From Theory To Practice, And Back

  • 1. . ...... Size-Based Scheduling: From Theory To Practice, And Back Matteo Dell’Amico EURECOM 24 April 2014 1
  • 2. Credits . ...... Joint work with Pietro Michiardi, Mario Pastorelli (EURECOM) Antonio Barbuzzi (ex EURECOM, now @VisualDNA, UK) Damiano Carra (University of Verona, Italy) 2
  • 3. Outline ...1 Big Data and MapReduce ...2 Size-Based Scheduling for MapReduce ...3 Size-Based Scheduling With Errors 3
  • 4. Big Data and MapReduce Outline ...1 Big Data and MapReduce ...2 Size-Based Scheduling for MapReduce ...3 Size-Based Scheduling With Errors 4
  • 5. Big Data and MapReduce Big Data Big Data: Definition . ...... Data that is too big for you to handle the way you normally do 5
  • 6. Big Data and MapReduce Big Data Big Data: Definition . ...... Data that is too big for you to handle the way you normally do . The 3 (+2) Vs .. ...... Volume, Velocity, Variety … plus Veracity and Value 5
  • 7. Big Data and MapReduce Big Data Big Data: Definition . ...... Data that is too big for you to handle the way you normally do . The 3 (+2) Vs .. ...... Volume, Velocity, Variety … plus Veracity and Value . …But Still… .. ...... Why is everybody talking about Big Data now? 5
  • 8. Big Data and MapReduce Big Data Big Data: Why Now? . 1991: Maxtor 7040A .. ...... 40 MB 600-700 KB/s One minute to read it all 6
  • 9. Big Data and MapReduce Big Data Big Data: Why Now? . 1991: Maxtor 7040A .. ...... 40 MB 600-700 KB/s One minute to read it all . Now: Western Digital Caviar .. ...... 4 TB 128 MB/s 9 hours to read 6
  • 10. Big Data and MapReduce Big Data Moore and His Brothers . ...... Moore’s Law: processing power doubles every 18 months Kryder’s Law: storage capacity doubles every year Nielsen’s Law: bandwidth doubles every 21 months 7
  • 11. Big Data and MapReduce Big Data Moore and His Brothers . ...... Moore’s Law: processing power doubles every 18 months Kryder’s Law: storage capacity doubles every year Nielsen’s Law: bandwidth doubles every 21 months . ...... Storage is cheap: we never throw away anything Processing all that data is expensive Moving it around is even worse 7
  • 12. Big Data and MapReduce MapReduce MapReduce Bring the computation to the data – split in blocks across a cluster . Map .. ...... One task per block Hadoop filesystem (HDFS): 64 MB by default Stores locally key-value pairs e.g., for word count: [(red, 15) , (green, 7) , . . .] 8
  • 13. Big Data and MapReduce MapReduce MapReduce Bring the computation to the data – split in blocks across a cluster . Map .. ...... One task per block Hadoop filesystem (HDFS): 64 MB by default Stores locally key-value pairs e.g., for word count: [(red, 15) , (green, 7) , . . .] . Reduce .. ...... # of tasks set by the programmer Mapper output is partitioned by key and pulled from “mappers” The Reduce function operates on all values for a single key e.g., (green, [7, 42, 13, . . .]) 8
  • 14. Big Data and MapReduce MapReduce The Problem With Scheduling . Current Workloads .. ...... Huge job size variance Running time: seconds to hours I/O: KBs to TBs [Chen et al., VLDB ’12; Ren et al., VLDB ’13; Appuswamy et al., SOCC ’13] . Consequence .. ...... Interactive jobs are delayed by long ones In smaller clusters long queues exacerbate the problem 9
  • 15. Size-Based Scheduling for MapReduce Outline ...1 Big Data and MapReduce ...2 Size-Based Scheduling for MapReduce ...3 Size-Based Scheduling With Errors 10
  • 16. Size-Based Scheduling for MapReduce Size-Based Scheduling Shortest Remaining Processing Time 100 usage (%) cluster 50 10 15 37.5 42.5 50 time (s) 100 usage (%) cluster 10 5020 30 50 time (s) job 1 job 2 job 3 job 1 job 3job 2 job 1 11
  • 17. Size-Based Scheduling for MapReduce Size-Based Scheduling Shortest Remaining Processing Time 100 usage (%) cluster 50 10 15 37.5 42.5 50 time (s) 100 usage (%) cluster 10 5020 30 50 time (s) job 1 job 2 job 3 job 1 job 3job 2 job 1 11
  • 18. Size-Based Scheduling for MapReduce Size-Based Scheduling Size-Based Scheduling . Shortest Remaining Processing Time (SRPT) .. ...... Minimizes average sojourn time (between job submission and completion) . Fair Sojourn Protocol (FSP) .. ...... Jobs are scheduled in the order they would complete if doing Processor Sharing (PS) Avoids starving large jobs Fairness: jobs guaranteed to complete before Processor Sharing [Friedman & Henderson, SIGMETRICS ’03] . Unknown Job size .. ...... …and what if we can only estimate job size? 12
  • 19. Size-Based Scheduling for MapReduce Size-Based Scheduling Multi-Processor Size-Based Scheduling 10 13 3923.5 usage (%) cluster 100 50 24.5 time (s) 10 13 20 23 39 100 50 usage (%) cluster time (s) job 1 job 2 job 3 job 1 job 2 job 3 13
  • 20. Size-Based Scheduling for MapReduce HFSP Implementation HFSP In A Nutshell . Job Size Estimation .. ...... Naive estimation at first After the first s “training” tasks have run, we update it s = 5 by default On t task slots, we give priority to training tasks t avoids starving “old” jobs “shortcut” for very small jobs 14
  • 21. Size-Based Scheduling for MapReduce HFSP Implementation HFSP In A Nutshell . Job Size Estimation .. ...... Naive estimation at first After the first s “training” tasks have run, we update it s = 5 by default On t task slots, we give priority to training tasks t avoids starving “old” jobs “shortcut” for very small jobs . Scheduling Policy .. ...... We treat Map and Reduce phases as separate jobs Virtual time: per-job simulated completion time When a task slot frees up, we schedule one from the job that completes earlier in the virtual time 14
  • 22. Size-Based Scheduling for MapReduce HFSP Implementation Job Size Estimation . Initial Estimation .. ...... k · l k: # of tasks l: average size of past Map/Reduce tasks . Second Estimation .. ...... After the s samples have run, compute an l′ as the average size of the sample tasks timeout (60 s by default): if tasks are not completed by then, use progress % Predicted job size: k · l′ 15
  • 23. Size-Based Scheduling for MapReduce HFSP Implementation Virtual Time . ...... Estimated job size is in a “serialized” single-machine format Simulates a processor-sharing cluster to compute completion time, based on number of tasks per job available task slots in the real cluster Simulation is updated when new jobs arrive tasks complete 16
  • 24. Size-Based Scheduling for MapReduce Experiments Experimental Setup . Platform .. ...... 36 machines with 4 CPUs, 16 GB RAM . Workloads .. ...... Generated with the PigMix benchmark: realistic operations on synthetic data Data sizes inspired by known measurements [Chen et al., VLDB ’12; Ren et al., VLDB ’13] . Configuration .. ...... We compare to Hadoop’s FAIR scheduler similar to processor-sharing Delay scheduling enabled both for FAIR and HFSP 17
  • 25. Size-Based Scheduling for MapReduce Experiments Sojourn Time 101 102 103 Sojourn Time (s) 0.0 0.2 0.4 0.6 0.8 1.0 ECDF HFSP FAIR 101 102 103 104 Sojourn Time (s) 0.0 0.2 0.4 0.6 0.8 1.0 ECDF HFSP FAIR “small” workload: ~16% better “large” workload: ~75% better Sojourn time: time that passes between the moment a job is submitted and it terminates With higher load, the scheduler becomes decisive Analogous results on different platform & different workload 18
  • 26. Size-Based Scheduling for MapReduce Experiments Job Size Estimation 0.25 0.5 1 2 4 Error 0.0 0.2 0.4 0.6 0.8 1.0 ECDF MAP REDUCE Error: real size estimated size Fits a log-normal distribution The estimation isn’t even that good! Why does HFSP work that well? 19
  • 27. Size-Based Scheduling With Errors Outline ...1 Big Data and MapReduce ...2 Size-Based Scheduling for MapReduce ...3 Size-Based Scheduling With Errors 20
  • 28. Size-Based Scheduling With Errors Scheduling Simulation Scheduling Simulation How does size-based scheduling behave in presence of errors? Lu et al. (MASCOTS 2004) suggest much worse results We wrote a simulator to understand better, with Hadoop-like workloads [Chen et al., VLDB ’12] written in Python, efficient and easy to prototype new schedulers 21
  • 29. Size-Based Scheduling With Errors Scheduling Simulation Log-Normal Error Distribution 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 PDF sigma= 0.125 sigma= 0.25 sigma= 1 sigma= 4 Error: real size estimated size 22
  • 30. Size-Based Scheduling With Errors Scheduling Simulation Weibull Job Size Distribution 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x 0.0 0.5 1.0 1.5 2.0 PDF shape= 0.125 shape= 1 shape= 2 shape= 4 Interpolates between heavy-tailed job size distributions (sigma<1) exponential distributions (sigma=1) bell-shaped distributions (sigma>1) 23
  • 31. Size-Based Scheduling With Errors Scheduling Simulation Size-Based Scheduling With Errors shape 0.125 0.25 0.5 1 2 4 sigm a 0.125 0.25 0.5 1 2 4 MST/MST(PS) 0.25 0.5 1 2 4 8 16 32 64 128 shape 0.125 0.25 0.5 1 2 4 sigm a 0.125 0.25 0.5 1 2 4 MST/MST(PS) 0.25 0.5 1 2 4 8 16 32 64 128 SRPT FSP Problems for heavy-tailed job size distributions Otherwise, size-based scheduling works very well 24
  • 32. Size-Based Scheduling With Errors Scheduling Simulation Over-Estimations and Under-Estimations Over-­‐es'ma'on   Under-­‐es'ma'on   t   t   t   t   Remaining  size   Remaining  size   Remaining  size   Remaining  size   J1   J2   J3   J2   J3   J1   ^   J4   J5   J6   J4   J5   J6   ^   Under-estimations can wreak havoc with heavy-tailed workloads 25
  • 33. Size-Based Scheduling With Errors Scheduling Simulation FSP + PS . Idea .. ...... Without errors, real jobs always complete before virtual ones When they don’t (they are late), there has been an estimation error The scheduler can realize this, and take corrective action 26
  • 34. Size-Based Scheduling With Errors Scheduling Simulation FSP + PS . Idea .. ...... Without errors, real jobs always complete before virtual ones When they don’t (they are late), there has been an estimation error The scheduler can realize this, and take corrective action . Realization .. ...... To avoid that late jobs block the system, just do processor sharing between them instead of scheduling the ”most late” one 26
  • 35. Size-Based Scheduling With Errors Scheduling Simulation FSP + PS: Results shape 0.125 0.25 0.5 1 2 4 sigm a 0.125 0.25 0.5 1 2 4 MST/MST(PS) 0.25 0.5 1 2 4 8 16 32 64 128 shape 0.125 0.25 0.5 1 2 4 sigm a 0.125 0.25 0.5 1 2 4 MST/MST(PS) 0.25 0.5 1 2 4 8 16 32 64 128 FSP FSP + PS 27
  • 36. Size-Based Scheduling With Errors Scheduling Simulation Take-Home Messages . ...... Size-based scheduling on Hadoop is viable, and particularly appealing for companies with (semi-)interactive jobs and smaller clusters . ...... Schedulers like HFSP (in practice) and FSP+PS (in theory) are robust with respect to errors therefore, simple rough estimations are sufficient HFSP is available as free software at http://github.com/bigfootproject/hfsp Scheduling simulator at https://bitbucket.org/bigfootproject/schedsim HFSP: published at IEEE BIGDATA 2013 scheduling simulator and FSP+PS: under submission, available at http://arxiv.org/abs/1403.5996 28
  • 37. Bonus Content Comparison with SRPT Schedulers vs. SRPT 0.125 0.25 0.5 1 2 4 shape 2 4 6 8 10 MST/MST(SRPT) SRPTE FSPE FSPE+PS PS LAS FIFO 29
  • 38. Bonus Content Real Workloads Facebook 0.125 0.25 0.5 1 2 4 sigma 2 4 6 8 10 MST/MST(SRPT) SRPTE FSPE FSPE+PS PS LAS 0.125 0.25 0.5 1 2 4 sigma 2 4 6 8 10 MST/MST(SRPT) SRPTE FSPE FSPE+PS PS LAS Synthetic workload (shape=0.25) Facebook Hadoop Cluster 30
  • 39. Bonus Content Real Workloads Web Cache 0.125 0.25 0.5 1 2 4 sigma 1 10 100 MST/MST(SRPT) SRPTE FSPE FSPE+PS PS LAS 0.125 0.25 0.5 1 2 4 sigma 1 10 100 1000 10000 MST/MST(SRPT) SRPTE FSPE FSPE+PS PS LAS FIFO Synthetic workload (shape=0.177) IRCache Web Cache 31
  • 40. Bonus Content Job Preemption Job Preemption . Supported in Hadoop .. ...... Kill running tasks wastes work Wait for them to finish may take long 32
  • 41. Bonus Content Job Preemption Job Preemption . Supported in Hadoop .. ...... Kill running tasks wastes work Wait for them to finish may take long . Our Choice .. ...... Map tasks: Wait generally small For Reduce tasks, we implemented Suspend and Resume avoids the drawbacks of both Wait and Kill 32
  • 42. Bonus Content Job Preemption Job Preemption: Suspend and Resume . Our Solution .. ......We delegate to the OS: SIGSTOP and SIGCONT 33
  • 43. Bonus Content Job Preemption Job Preemption: Suspend and Resume . Our Solution .. ......We delegate to the OS: SIGSTOP and SIGCONT . ...... The OS will swap tasks if and when memory is needed no risk of thrashing: swapped data is loaded only when resuming 33
  • 44. Bonus Content Job Preemption Job Preemption: Suspend and Resume . Our Solution .. ......We delegate to the OS: SIGSTOP and SIGCONT . ...... The OS will swap tasks if and when memory is needed no risk of thrashing: swapped data is loaded only when resuming . ...... Configurable maximum number of suspended tasks if reached, switch to Wait hard limit on memory allocated to suspended tasks 33
  • 45. Bonus Content Job Preemption Job Preemption: Suspend and Resume . Our Solution .. ......We delegate to the OS: SIGSTOP and SIGCONT . ...... The OS will swap tasks if and when memory is needed no risk of thrashing: swapped data is loaded only when resuming . ...... Configurable maximum number of suspended tasks if reached, switch to Wait hard limit on memory allocated to suspended tasks . ...... Between preemptable running tasks, suspend the youngest likely to finish later may have smaller memory footprint 33