• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Size-Based Scheduling: From Theory To Practice, And Back
 

Size-Based Scheduling: From Theory To Practice, And Back

on

  • 204 views

The proof that the best response time in queuing systems is obtained by scheduling the jobs with the shortest remaining processing time dates back to 1966; since then, other size-based scheduling ...

The proof that the best response time in queuing systems is obtained by scheduling the jobs with the shortest remaining processing time dates back to 1966; since then, other size-based scheduling protocols that pair near-optimal response times with strong fairness guarantees have been proposed. Yet, despite these very desirable properties, size-based scheduling policies are almost never used in practice: a key reason is that, in real systems, it is prohibitive to know a priori exact job sizes.
In this talk, I will first describe our efforts to put in practice concepts coming from theory, developing HFSP: a size-based scheduler for Hadoop MapReduce that uses estimations rather than exact size information. We obtained results that were surprisingly good even with very inaccurate size estimations: this motivated us to return to theory, and perform an in-depth study of scheduling based on estimated sizes. We obtained very promising results: for a large class of workloads, size-based scheduling performs well even with very rough size estimations; for the other workloads, simple modifications to the existing scheduling protocols are sufficient to greatly enhance performance.

Statistics

Views

Total Views
204
Views on SlideShare
201
Embed Views
3

Actions

Likes
0
Downloads
2
Comments
0

3 Embeds 3

https://twitter.com 1
https://www.linkedin.com 1
http://www.slideee.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Size-Based Scheduling: From Theory To Practice, And Back Size-Based Scheduling: From Theory To Practice, And Back Presentation Transcript

    • . ...... Size-Based Scheduling: From Theory To Practice, And Back Matteo Dell’Amico EURECOM 24 April 2014 1
    • Credits . ...... Joint work with Pietro Michiardi, Mario Pastorelli (EURECOM) Antonio Barbuzzi (ex EURECOM, now @VisualDNA, UK) Damiano Carra (University of Verona, Italy) 2
    • Outline ...1 Big Data and MapReduce ...2 Size-Based Scheduling for MapReduce ...3 Size-Based Scheduling With Errors 3
    • Big Data and MapReduce Outline ...1 Big Data and MapReduce ...2 Size-Based Scheduling for MapReduce ...3 Size-Based Scheduling With Errors 4
    • Big Data and MapReduce Big Data Big Data: Definition . ...... Data that is too big for you to handle the way you normally do 5
    • Big Data and MapReduce Big Data Big Data: Definition . ...... Data that is too big for you to handle the way you normally do . The 3 (+2) Vs .. ...... Volume, Velocity, Variety … plus Veracity and Value 5
    • Big Data and MapReduce Big Data Big Data: Definition . ...... Data that is too big for you to handle the way you normally do . The 3 (+2) Vs .. ...... Volume, Velocity, Variety … plus Veracity and Value . …But Still… .. ...... Why is everybody talking about Big Data now? 5
    • Big Data and MapReduce Big Data Big Data: Why Now? . 1991: Maxtor 7040A .. ...... 40 MB 600-700 KB/s One minute to read it all 6
    • Big Data and MapReduce Big Data Big Data: Why Now? . 1991: Maxtor 7040A .. ...... 40 MB 600-700 KB/s One minute to read it all . Now: Western Digital Caviar .. ...... 4 TB 128 MB/s 9 hours to read 6
    • Big Data and MapReduce Big Data Moore and His Brothers . ...... Moore’s Law: processing power doubles every 18 months Kryder’s Law: storage capacity doubles every year Nielsen’s Law: bandwidth doubles every 21 months 7
    • Big Data and MapReduce Big Data Moore and His Brothers . ...... Moore’s Law: processing power doubles every 18 months Kryder’s Law: storage capacity doubles every year Nielsen’s Law: bandwidth doubles every 21 months . ...... Storage is cheap: we never throw away anything Processing all that data is expensive Moving it around is even worse 7
    • Big Data and MapReduce MapReduce MapReduce Bring the computation to the data – split in blocks across a cluster . Map .. ...... One task per block Hadoop filesystem (HDFS): 64 MB by default Stores locally key-value pairs e.g., for word count: [(red, 15) , (green, 7) , . . .] 8
    • Big Data and MapReduce MapReduce MapReduce Bring the computation to the data – split in blocks across a cluster . Map .. ...... One task per block Hadoop filesystem (HDFS): 64 MB by default Stores locally key-value pairs e.g., for word count: [(red, 15) , (green, 7) , . . .] . Reduce .. ...... # of tasks set by the programmer Mapper output is partitioned by key and pulled from “mappers” The Reduce function operates on all values for a single key e.g., (green, [7, 42, 13, . . .]) 8
    • Big Data and MapReduce MapReduce The Problem With Scheduling . Current Workloads .. ...... Huge job size variance Running time: seconds to hours I/O: KBs to TBs [Chen et al., VLDB ’12; Ren et al., VLDB ’13; Appuswamy et al., SOCC ’13] . Consequence .. ...... Interactive jobs are delayed by long ones In smaller clusters long queues exacerbate the problem 9
    • Size-Based Scheduling for MapReduce Outline ...1 Big Data and MapReduce ...2 Size-Based Scheduling for MapReduce ...3 Size-Based Scheduling With Errors 10
    • Size-Based Scheduling for MapReduce Size-Based Scheduling Shortest Remaining Processing Time 100 usage (%) cluster 50 10 15 37.5 42.5 50 time (s) 100 usage (%) cluster 10 5020 30 50 time (s) job 1 job 2 job 3 job 1 job 3job 2 job 1 11
    • Size-Based Scheduling for MapReduce Size-Based Scheduling Shortest Remaining Processing Time 100 usage (%) cluster 50 10 15 37.5 42.5 50 time (s) 100 usage (%) cluster 10 5020 30 50 time (s) job 1 job 2 job 3 job 1 job 3job 2 job 1 11
    • Size-Based Scheduling for MapReduce Size-Based Scheduling Size-Based Scheduling . Shortest Remaining Processing Time (SRPT) .. ...... Minimizes average sojourn time (between job submission and completion) . Fair Sojourn Protocol (FSP) .. ...... Jobs are scheduled in the order they would complete if doing Processor Sharing (PS) Avoids starving large jobs Fairness: jobs guaranteed to complete before Processor Sharing [Friedman & Henderson, SIGMETRICS ’03] . Unknown Job size .. ...... …and what if we can only estimate job size? 12
    • Size-Based Scheduling for MapReduce Size-Based Scheduling Multi-Processor Size-Based Scheduling 10 13 3923.5 usage (%) cluster 100 50 24.5 time (s) 10 13 20 23 39 100 50 usage (%) cluster time (s) job 1 job 2 job 3 job 1 job 2 job 3 13
    • Size-Based Scheduling for MapReduce HFSP Implementation HFSP In A Nutshell . Job Size Estimation .. ...... Naive estimation at first After the first s “training” tasks have run, we update it s = 5 by default On t task slots, we give priority to training tasks t avoids starving “old” jobs “shortcut” for very small jobs 14
    • Size-Based Scheduling for MapReduce HFSP Implementation HFSP In A Nutshell . Job Size Estimation .. ...... Naive estimation at first After the first s “training” tasks have run, we update it s = 5 by default On t task slots, we give priority to training tasks t avoids starving “old” jobs “shortcut” for very small jobs . Scheduling Policy .. ...... We treat Map and Reduce phases as separate jobs Virtual time: per-job simulated completion time When a task slot frees up, we schedule one from the job that completes earlier in the virtual time 14
    • Size-Based Scheduling for MapReduce HFSP Implementation Job Size Estimation . Initial Estimation .. ...... k · l k: # of tasks l: average size of past Map/Reduce tasks . Second Estimation .. ...... After the s samples have run, compute an l′ as the average size of the sample tasks timeout (60 s by default): if tasks are not completed by then, use progress % Predicted job size: k · l′ 15
    • Size-Based Scheduling for MapReduce HFSP Implementation Virtual Time . ...... Estimated job size is in a “serialized” single-machine format Simulates a processor-sharing cluster to compute completion time, based on number of tasks per job available task slots in the real cluster Simulation is updated when new jobs arrive tasks complete 16
    • Size-Based Scheduling for MapReduce Experiments Experimental Setup . Platform .. ...... 36 machines with 4 CPUs, 16 GB RAM . Workloads .. ...... Generated with the PigMix benchmark: realistic operations on synthetic data Data sizes inspired by known measurements [Chen et al., VLDB ’12; Ren et al., VLDB ’13] . Configuration .. ...... We compare to Hadoop’s FAIR scheduler similar to processor-sharing Delay scheduling enabled both for FAIR and HFSP 17
    • Size-Based Scheduling for MapReduce Experiments Sojourn Time 101 102 103 Sojourn Time (s) 0.0 0.2 0.4 0.6 0.8 1.0 ECDF HFSP FAIR 101 102 103 104 Sojourn Time (s) 0.0 0.2 0.4 0.6 0.8 1.0 ECDF HFSP FAIR “small” workload: ~16% better “large” workload: ~75% better Sojourn time: time that passes between the moment a job is submitted and it terminates With higher load, the scheduler becomes decisive Analogous results on different platform & different workload 18
    • Size-Based Scheduling for MapReduce Experiments Job Size Estimation 0.25 0.5 1 2 4 Error 0.0 0.2 0.4 0.6 0.8 1.0 ECDF MAP REDUCE Error: real size estimated size Fits a log-normal distribution The estimation isn’t even that good! Why does HFSP work that well? 19
    • Size-Based Scheduling With Errors Outline ...1 Big Data and MapReduce ...2 Size-Based Scheduling for MapReduce ...3 Size-Based Scheduling With Errors 20
    • Size-Based Scheduling With Errors Scheduling Simulation Scheduling Simulation How does size-based scheduling behave in presence of errors? Lu et al. (MASCOTS 2004) suggest much worse results We wrote a simulator to understand better, with Hadoop-like workloads [Chen et al., VLDB ’12] written in Python, efficient and easy to prototype new schedulers 21
    • Size-Based Scheduling With Errors Scheduling Simulation Log-Normal Error Distribution 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 PDF sigma= 0.125 sigma= 0.25 sigma= 1 sigma= 4 Error: real size estimated size 22
    • Size-Based Scheduling With Errors Scheduling Simulation Weibull Job Size Distribution 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x 0.0 0.5 1.0 1.5 2.0 PDF shape= 0.125 shape= 1 shape= 2 shape= 4 Interpolates between heavy-tailed job size distributions (sigma<1) exponential distributions (sigma=1) bell-shaped distributions (sigma>1) 23
    • Size-Based Scheduling With Errors Scheduling Simulation Size-Based Scheduling With Errors shape 0.125 0.25 0.5 1 2 4 sigm a 0.125 0.25 0.5 1 2 4 MST/MST(PS) 0.25 0.5 1 2 4 8 16 32 64 128 shape 0.125 0.25 0.5 1 2 4 sigm a 0.125 0.25 0.5 1 2 4 MST/MST(PS) 0.25 0.5 1 2 4 8 16 32 64 128 SRPT FSP Problems for heavy-tailed job size distributions Otherwise, size-based scheduling works very well 24
    • Size-Based Scheduling With Errors Scheduling Simulation Over-Estimations and Under-Estimations Over-­‐es'ma'on   Under-­‐es'ma'on   t   t   t   t   Remaining  size   Remaining  size   Remaining  size   Remaining  size   J1   J2   J3   J2   J3   J1   ^   J4   J5   J6   J4   J5   J6   ^   Under-estimations can wreak havoc with heavy-tailed workloads 25
    • Size-Based Scheduling With Errors Scheduling Simulation FSP + PS . Idea .. ...... Without errors, real jobs always complete before virtual ones When they don’t (they are late), there has been an estimation error The scheduler can realize this, and take corrective action 26
    • Size-Based Scheduling With Errors Scheduling Simulation FSP + PS . Idea .. ...... Without errors, real jobs always complete before virtual ones When they don’t (they are late), there has been an estimation error The scheduler can realize this, and take corrective action . Realization .. ...... To avoid that late jobs block the system, just do processor sharing between them instead of scheduling the ”most late” one 26
    • Size-Based Scheduling With Errors Scheduling Simulation FSP + PS: Results shape 0.125 0.25 0.5 1 2 4 sigm a 0.125 0.25 0.5 1 2 4 MST/MST(PS) 0.25 0.5 1 2 4 8 16 32 64 128 shape 0.125 0.25 0.5 1 2 4 sigm a 0.125 0.25 0.5 1 2 4 MST/MST(PS) 0.25 0.5 1 2 4 8 16 32 64 128 FSP FSP + PS 27
    • Size-Based Scheduling With Errors Scheduling Simulation Take-Home Messages . ...... Size-based scheduling on Hadoop is viable, and particularly appealing for companies with (semi-)interactive jobs and smaller clusters . ...... Schedulers like HFSP (in practice) and FSP+PS (in theory) are robust with respect to errors therefore, simple rough estimations are sufficient HFSP is available as free software at http://github.com/bigfootproject/hfsp Scheduling simulator at https://bitbucket.org/bigfootproject/schedsim HFSP: published at IEEE BIGDATA 2013 scheduling simulator and FSP+PS: under submission, available at http://arxiv.org/abs/1403.5996 28
    • Bonus Content Comparison with SRPT Schedulers vs. SRPT 0.125 0.25 0.5 1 2 4 shape 2 4 6 8 10 MST/MST(SRPT) SRPTE FSPE FSPE+PS PS LAS FIFO 29
    • Bonus Content Real Workloads Facebook 0.125 0.25 0.5 1 2 4 sigma 2 4 6 8 10 MST/MST(SRPT) SRPTE FSPE FSPE+PS PS LAS 0.125 0.25 0.5 1 2 4 sigma 2 4 6 8 10 MST/MST(SRPT) SRPTE FSPE FSPE+PS PS LAS Synthetic workload (shape=0.25) Facebook Hadoop Cluster 30
    • Bonus Content Real Workloads Web Cache 0.125 0.25 0.5 1 2 4 sigma 1 10 100 MST/MST(SRPT) SRPTE FSPE FSPE+PS PS LAS 0.125 0.25 0.5 1 2 4 sigma 1 10 100 1000 10000 MST/MST(SRPT) SRPTE FSPE FSPE+PS PS LAS FIFO Synthetic workload (shape=0.177) IRCache Web Cache 31
    • Bonus Content Job Preemption Job Preemption . Supported in Hadoop .. ...... Kill running tasks wastes work Wait for them to finish may take long 32
    • Bonus Content Job Preemption Job Preemption . Supported in Hadoop .. ...... Kill running tasks wastes work Wait for them to finish may take long . Our Choice .. ...... Map tasks: Wait generally small For Reduce tasks, we implemented Suspend and Resume avoids the drawbacks of both Wait and Kill 32
    • Bonus Content Job Preemption Job Preemption: Suspend and Resume . Our Solution .. ......We delegate to the OS: SIGSTOP and SIGCONT 33
    • Bonus Content Job Preemption Job Preemption: Suspend and Resume . Our Solution .. ......We delegate to the OS: SIGSTOP and SIGCONT . ...... The OS will swap tasks if and when memory is needed no risk of thrashing: swapped data is loaded only when resuming 33
    • Bonus Content Job Preemption Job Preemption: Suspend and Resume . Our Solution .. ......We delegate to the OS: SIGSTOP and SIGCONT . ...... The OS will swap tasks if and when memory is needed no risk of thrashing: swapped data is loaded only when resuming . ...... Configurable maximum number of suspended tasks if reached, switch to Wait hard limit on memory allocated to suspended tasks 33
    • Bonus Content Job Preemption Job Preemption: Suspend and Resume . Our Solution .. ......We delegate to the OS: SIGSTOP and SIGCONT . ...... The OS will swap tasks if and when memory is needed no risk of thrashing: swapped data is loaded only when resuming . ...... Configurable maximum number of suspended tasks if reached, switch to Wait hard limit on memory allocated to suspended tasks . ...... Between preemptable running tasks, suspend the youngest likely to finish later may have smaller memory footprint 33