Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

GoodFit: Multi-Resource Packing of Tasks with Dependencies

474 views

Published on

GoodFit: Multi-Resource Packing of Tasks with Dependencies

Published in: Technology
  • Be the first to comment

  • Be the first to like this

GoodFit: Multi-Resource Packing of Tasks with Dependencies

  1. 1. GoodFit: Multi-Resource Packing of Tasks with Dependencies
  2. 2. Cluster Scheduling for Jobs Jobs Machines, file-system, network Cluster Scheduler matches tasks to resources Goals • High cluster utilization • Fast job completion time • Predictable perf./ fairness E.g., BigData (Hive, SCOPE, Spark) E.g., CloudBuild Tasks Dependencies • Need not keep resource “buffers” • More dynamic than VM placement (tasks last seconds) • Aggregate properties are important (eg, all tasks in a job should finish)
  3. 3. Need careful multi-resource planning Problem Fragmentation Current Schedulers Packer Scheduler Over-allocation of net/disk Current Schedulers Packer Scheduler 2 tasks/T  3 tasks/T (+50%) 2 tasks/ 2T  2 tasks/T (+100%)
  4. 4. … worse with dependencies Problem 2 Tt, 𝟏 𝒏 r t, 1- r t, r t, 1- r t, 1- r (T- 2)t, 𝟏 𝒏 r (T- 4)t, 𝟏 𝒏 r ~Tt, 𝟏 𝒏 r … … DAG label= {duration, resource demand} resource time ~nT t … resource time ~T t … … Crit. Path Best Critical path scheduling is n times off since it ignores resource demands Packers can be d times off since they ignore future work [d resources]
  5. 5. Typical job scheduler infrastructure + packing + bounded unfairness + merge schedules + overbook DAG AM DAG AM … Node heartbeat Task assignment Schedule Constructor Schedule Constructor RM NM NM NM NM
  6. 6. Main ideas in multi-resource packing Task packing ~ Multi-dimensional bin packing, but * Very hard problem (“APX-hard”) * Available heuristics do not directly apply [task demands change with placement] Alignment score (A) = D  R A packing heuristic  Task’s resources demand vector: D  Machine resource vector: R< Fit A job completion time heuristic shortest remaining work, P tasks avg. duration tasks avg. resource demand * * = remaining # tasks Packing Efficiency ? delays job completion loses packing efficiencyJob Completion Time Fairness Trade-offs: We show that: {best “perf” |bounded unfairness} ~ best “perf” loses both
  7. 7. Main ideas in packing dependent tasks 1. Identify troublesome tasks (meat) and place them first 2. Systematically place other tasks without deadlocks 3. At runtime, use a precedence order from the computed schedule + heuristics to (a) overbook, (b) previous slide. 4. Better lower bounds for DAG completion time M P C O time resource meat begin meat end parents meat children
  8. 8. Results - 1 Packing Packing + Deps. Lower bound [20K DAGs from Cosmos]
  9. 9. Results - 2 Tez + Packing Tez + Pack +Deps [200 jobs from TPC-DS, 200 server cluster]
  10. 10. Bundling
  11. 11. Temporal relaxation of fairness
  12. 12. Map (disk) Reduce (netw.) Fair share among two identical jobs 50% 50% 50% 50% 2T 4T Instantaneous fairness 100 % 100 % 100 % 100 % 2T 3TT 1) Temporal relaxation of fairness a job will finish within (1 + 𝑓)x the time it takes given strict share 2) Optimal trade-off with performance (1 + 𝑓)x fairness costs (2 + 2𝑓 − 2 𝑓 + 𝑓2)x on make-span 3) A simple (offline) algorithm that achieves the above trade-off Problem: Instantaneous fairness can be up to dx worse on makespan (d resources) Best Fairness slack 𝒇 Perf loss 0 (perfectly fair) 2x 1 (<2x longer) 1.1x 2 (<3x longer) 1.07x
  13. 13. Bare metal VM Allocation Data-parallel Jobs Job: Tasks Dependencies E.g., HDInsight, AzureBatch E.g., BigData (Yarn, Cosmos, Spark) E.g., CloudBuild 3500 servers 3500 users >20M targets/day ~100K servers (40K at Yahoo) >50K servers >2EB stored >6K devs
  14. 14. • Tasks are short-lived (10s of seconds) • Have peculiar shaped demands • Composites are important (job needs all tasks to finish) • OK to kill and restart tasks • Locality 1) Job scheduling has specific aspects 2) will speed-up the average job (and reduce resource cost) 3) research + practice
  15. 15. Resource aware scheduling improves SLOs and Return/$
  16. 16. Cluster Scheduling for Jobs Jobs Machines, file-system, network Cluster Scheduler matches tasks to resources Goals • High cluster utilization • Fast job completion time • Predictable perf./ fairness • Efficient (milliseconds…) E.g., HDInsight, AzureBatch E.g., BigData (Hive, SCOPE, Spark) E.g., CloudBuild Tasks Dependencies
  17. 17. Need careful multi-resource planning Problem Fragmentation Current Schedulers Packer Scheduler Over-allocation of net/disk Current Schedulers Packer Scheduler 2 tasks/T  3 tasks/T (+50%) 2 tasks/ 2T  2 tasks/T (+100%)
  18. 18. … worse with dependencies Problem 2 Tt, 𝟏 𝒏 r t, 1- r t, r t, 1- r t, 1- r (T- 2)t, 𝟏 𝒏 r (T- 4)t, 𝟏 𝒏 r ~Tt, 𝟏 𝒏 r … … DAG label= {duration, resource demand} resource time ~nT t … resource time ~T t … … Crit. Path Best Critical path scheduling is n times off since it ignores resource demands Packers can be d times off since they ignore future work [d resources]
  19. 19. Typical job scheduler infrastructure + packing + bounded unfairness + merge schedules + overbook DAG AM DAG AM … Node heartbeat Task assignment Schedule Constructor Schedule Constructor RM NM NM NM NM
  20. 20. Main ideas in packing dependent tasks 1. Identify troublesome tasks (T) and place them first 2. Systematically place other tasks without dead- ends 3. At runtime, enforce computed schedule + heuristics to (a) overbook, (b) previous slide. 4. Better lower bounds for DAG completion time T P C O time resource Trouble begin Trouble end parents trouble children
  21. 21. Results - 1 Packing Packing + Deps. Lower bound [20K DAGs from Cosmos] 2X1.5X
  22. 22. Results - 2 Tez + Packing Tez + Pack +Deps [200 jobs from TPC-DS, 200 server cluster]
  23. 23. Multi-Resource Packing for Cluster Schedulers
  24. 24. Performance of cluster schedulers We observe that: 1Time to finish a set of jobs  Resources are fragmented i.e. machines are running below capacity  Even at 100% usage, goodput is much smaller due to over-allocation  Even pareto-efficient multi-resource fair schemes result in much lower performance Tetris up to 40% improvement in makespan1 and job completion time with near-perfect fairness
  25. 25. Findings from Bing and Facebook traces analysis  Tasks need varying amounts of each resource  Demands for resources are weakly correlated Diversity in multi-resource requirements: Multiple resources become tight This matters because no single bottleneck resource:  Enough cross-rack network bandwidth to use all CPU cores 25 Upper bounding potential gains  reduce makespan1 by up to 49%  reduce avg. job compl. time by up to 46%
  26. 26. 26 Why so bad #1 Production schedulers neither pack tasks nor consider all their relevant resource demands #1 Resource Fragmentation #2 Over-allocation
  27. 27. Current Schedulers “Packer” Scheduler Machine A 4 GB Memory Machine B 4 GB Memory T1: 2 GB T3: 4 GB T2: 2 GB Time Resource Fragmentation (RF) STOP Machine A 4 GB Memory Machine B 4 GB Memory T1: 2 GB T3: 4 GB T2: 2 GB Time Avg. task compl. time = 1 t 27 Current Schedulers RF increase with the number of resources being allocated ! Avg. task compl.time = 1.33 t Resources allocated in terms of Slots Free resources unable to be assigned to tasks
  28. 28. Current Schedulers “Packer” Scheduler Machine A 4 GB Memory; 20 MB/s Nw. Time T1: 2 GB Memory 20 MB/s Nw. T2: 2 GB Memory 20 MB/s Nw. T3: 2 GB Memory Machine A 4 GB Memory; 20 MB/s Nw.Time T1: 2 GB Memory 20 MB/s Nw. T2: 2 GB Memory 20 MB/s Nw. T3: 2 GB Memory STOP 20 MB/s Nw. 20 MB/s Nw. 28 Over-Allocation  Not all tasks resource demands are explicitly allocated  Disk and network are over-allocated Avg. task compl.time= 2.33 t Avg. task compl. time = 1.33 t Current Schedulers
  29. 29. Work Conserving != no fragmentation, over-allocation  Treat cluster as a big bag of resources  Hides the impact of resource fragmentation  Assume job has a fixed resource profile  Different tasks in the same job have different demands Multi-resource Fairness Schemes do not help either Why so bad #2  The schedule impacts job’s current resource profiles  Can schedule to create complementarity profiles Packer Scheduler vs. DRF  Avg. Job Compl.Time: 50%  Makespan: 33% Pareto1 efficient != performant 1no job can increase share without decreasing the share of another 29
  30. 30. Competing objectives Job completion time Fairness vs. Cluster efficiency vs. Current Schedulers 1. Resource Fragmentation 3. Fair allocations sacrifice performance 2. Over-Allocation 30
  31. 31. # 1 Pack tasks along multiple resources to improve cluster efficiency and reduce makespan 31
  32. 32. Theory Practice Multi-Resource Packing of Tasks similar to Multi-Dimensional Bin Packing Balls could be tasks Bin could be machine, time 1APX-Hard is a strict subset of NP-hard APX-Hard1 Existing heuristics do not directly apply here:  Assume balls of a fixed size  Assume balls are known apriori 32  vary with time / machine placed  elastic  cope with online arrival of jobs, dependencies, cluster activity Avoiding fragmentation looks like:  Tight bin packing  Reduces # of bins used -> reduce makespan
  33. 33. # 1 Packing heuristic 1. Check for fit ensure no over-allocation Over-Allocation Alignment score (A) 33 A packing heuristic  Tasks resources demand vector  Machine resource vector< Fit “A” works because: 2. Bigger balls get bigger scores 3. Abundant resources used first Resource Fragmentation 4. Can spread load across machines
  34. 34. # 2 Faster average job completion time 34
  35. 35. 35 CHALLENGE # 2 Shortest Remaining Time First1 (SRTF) 1SRTF – M. Harchol-Balter et al. Connection Scheduling in Web Servers [USITS’99] schedules jobs in ascending order of their remaining time Job Completion Time Heuristic Q: What is the shortest “remaining time” ? “remaining work” remaining # tasks tasks durations tasks resource demands & & = A job completion time heuristic  Gives a score P to every job  Extended SRTF to incorporate multiple resources
  36. 36. 36 CHALLENGE # 2 Job Completion Time Heuristic Combine A and P scores ! Packing Efficiency Completion Time ? 1: among J runnable jobs 2: score (j) = A(t, R)+ P(j) 3: max task t in j, demand(t) ≤ R (resources free) 4: pick j*, t* = argmax score(j) A: delays job completion time P: loss in packing efficiency
  37. 37. # 3 Achieve performance and fairness 37
  38. 38. # 3 38  A says: “task i should go here to improve packing efficiency” Feasible solution which typically can satisfy all of them  P says: “schedule job j next to improve job completion time”  Fairness says: “this set of jobs should be scheduled next” Fairness Heuristic Performance and fairness do not mix well in general But …. We can get “perfect fairness” and much better performance
  39. 39. # 3 39  Fairness Knob, F  [0, 1)  F = 0 most efficient scheduling  F → 1 close to perfect fairness Pick the best-for-perf. task from among 1-F fraction of jobs furthest from fair share Fairness Heuristic Fairness is not a tight constraint  Long term fairness not short term fairness  Lose a bit of fairness for a lot of gains in performance Heuristic
  40. 40. 40 Putting it all together We saw: Other things in the paper:  Packing efficiency  Prefer small remaining work  Fairness knob  Estimate task demands  Deal with inaccuracies, barriers  Ingestion / evacuation Job Manager1 Node Manager1 Cluster-wide Resource Manager Multi-resource asks; barrier hint Track resource usage; enforce allocations New logic to match tasks to machines (+packing, +SRTF, +fairness) Allocations Asks Offers Resource availability reports Yarn architecture Changes to add Tetris(shown in orange)
  41. 41. Evaluation  Pluggable scheduler in Yarn 2.4  250 machine cluster deployment  Replay Bing and Facebook traces 41
  42. 42. 42 Efficiency Makespan DRF 28 % Avg. Job Compl. Time 35% 0 50 100 150 200 0 5000 10000 15000 Utilization(%) Time (s) CPU Mem In St Tetris Gains from  avoiding fragmentation  avoid over-allocation 0 50 100 150 200 0 4500 9000 13500 18000 22500 Utilization(%) Time (s) CPU Mem In St Tetris vs. Capacity Scheduler 29 % 30 % Over-allocation Lower value => higher resource fragmentation Utilization(%) 200 150 100 50 0 0 5000 10000 15000 Time (s) Over-allocation Lower value => higher resource fragmentation Capacity Scheduler
  43. 43. 43 Fairness Fairness Knob  quantifies the extent to which Tetris adheres to fair allocation No Fairness F = 0 Makespan 50 % 10 % 25 % Job Compl. Time 40 % 23 % 35 % Avg. Slowdown [over impacted jobs] 25 % 2 % 5 % Full Fairness F → 1 F = 0.25
  44. 44. Pack efficiently along multiple resources Prefer jobs with less “remaining work” Incorporate Fairness  combine heuristics that improve packing efficiency with those that lower average job completion time  achieving desired amounts of fairness can coexist with improving cluster performance  implemented inside YARN; trace-driven simulations and deployment show encouraging initial results We are working towards a Yarn check-in http://research.microsoft.com/en-us/UM/redmond/projects/tetris/ 44
  45. 45. 45 Backup slides
  46. 46. Estimating resource requirements Estimating Resource Demands Under-utilization  from: o finished tasks in the same phase  peak usage demands estimates Machine1 - In Network 850 1024 0 512 MBytes/sec Time (sec) In Network Used In Network Free Resource Tracker o report unused resources o aware of other cluster activities: ingestion and evacuation Resource Tracker o collecting statistics from recurring jobs Peak Demand o inputs size/location of tasks 46 Placement Impacts network/disk requirements
  47. 47. Packer Scheduler vs. DRF DRF Scheduler Packer Schedulers 2 tasks Job Schedule Resources used 2 tasks 2 tasks 2 tasks 2 tasks 2 tasks 6 tasks 6 tasks 6 tasksA B C 18 cores 16 GB 18 cores 16 GB 18 cores 16 GB t 2t 3t 0 tasks Job Schedule Resources used 0 tasks 6 tasks 0 tasks 6 tasks 18 tasksA B C 18 cores 18 cores 6 GB 18 cores 6 GB t 2t 3t 36 GB Durations: A: 3t B: 3t C: 3t Durations: A: t B: 2t C: 3t 33% improvement Dominant Resource Fairness (DRF) computes the dominant share (DS) of every user and seeks to maximize the minimum DS across all users Cluster [18 Cores, 36 GB Memory] Job: [Task Prof.], # tasks A [1 Core, 2 GB], 18 B [3 Cores, 1 GB], 6 C [3 Cores, 1 GB], 6 DS = 𝟏 𝟑 max (qA, qB, qC) (Maximize allocations) qA + 3qB + 3qC ≤ 18 (CPU constraint) 2qA + 1qB + 1qC ≤ 36 (Memory constraint) qA 18 = qB 6 = qC 6 (Equalize DS) 47
  48. 48. 1Time to finish a set of jobs Machine 1,2: [2 Cores, 4 GB] Job: [Task Prof.], # tasks A [2 Cores, 3 GB], 6 B [1 Core, 2 GB], 2 Resources used 4 cores 6 GB 2 tasks 2 tasks 2 tasks 2 tasks t 2t 3t 4t Job Schedule 4 cores 6 GB 4 cores 6 GB 2 cores 4 GB Resources used 2 cores 4 GB 2 tasks 2 tasks 2 tasks 2 tasks t 2t 3t 4t Job Schedule 4 cores 6 GB 4 cores 6 GB 4 cores 6 GB Pack No Pack Durations: A: 3t B: 4t Durations: A: 4t B: t 29% improvement 48 Packing efficiency does not achieve everything Achieving packing efficiency does not necessarily improve job completion time
  49. 49. 49 Ingestion / evacuation ingestion = storing incoming data for later analytics evacuation = data evacuated and re-replicated before maintenance operations  e.g. some clusters reports volumes of up to 10 TB per hour Other cluster activities which produce background traffic  e.g. rack decommission for machines re-imaging Resource Tracker reports, used by Tetris to avoid contention between its tasks and these activities
  50. 50. 50 Workload analysis
  51. 51. 51 Alternative Packing Heuristics
  52. 52. 52 Fairness vs. Efficiency
  53. 53. 53 Fairness vs. Efficiency
  54. 54. 54 Virtual Machine Packing != Tetris Virtual Machine Packing But focus on different challenges and not task packing:  balance load across servers  ensure VM availability inspite of failures  allow for quick software and hardware updates  NO corresponding entity to a job and hence job completion time is inexpressible  Explicit resource requirements (e.g. small VM) makes VM packing simpler Consolidating VMs, with multi-dimensional resource requirements, on to the fewest number of servers
  55. 55. 55 Barrier knob, b  [0, 1) Tetris gives preference for last tasks in a stage Offer resources to tasks in a stage preceding a barrier, where b fraction of tasks have finished  b = 1 no tasks preferentially treated
  56. 56. 56 Starvation Prevention It could take a long time to accommodate large tasks ? But … 1. most tasks have demands within one order of magnitude of one another 2. machines report resource availability to the scheduler periodically  scheduler learn about all the resources freed up by tasks that finish in the preceding period together => can to reservation for large tasks
  57. 57. 57 Cluster load vs. Tetris performance
  58. 58. Packing and Dependency-aware Scheduling for Data-Parallel Clusters
  59. 59. Performance of cluster schedulers We observe that: 1Time to finish a set of jobs  Typically cluster schedulers do dependency-aware scheduling  OR multi-resource packing  None of the existing solutions are close to optimal for more than 50% of the production jobs Graphene > 30% improvements in makespan1 and job completion time for more than 50% of the jobs 2
  60. 60. Findings from Bing traces analysis Jobs structure have evolved into complex DAGs of tasks  depth 7  103 tasks Median job DAG’s has: A good cluster scheduler should be aware of dependencies 1Time to finish a set of jobs 3
  61. 61. Findings from Bing traces analysis  High coefficient of variation (~1) for many resources  Demands for resources are weakly correlated Applications have (very) diverse resource needs: Multiple resources become tight This matters because no single bottleneck resource:  Enough cross-rack network bandwidth to use all CPU cores 61  CPU, Memory, Network and Disk A good cluster scheduler should pack resources
  62. 62. 62 Why so bad Production schedulers DON’T pack tasks consider dependencies ORAND
  63. 63. Dependency-aware Packing  Breadth First Search (BFS) 63  Do not account for tasks resource demands  If so, they assume tasks have homogeneous demands OR Consider the DAG structure during the schedule  Tetris  Ignore dependencies  Takes local greedy choices  Handle tasks with multiple resource requirements   Any scheduler that is not packing, is up to n x OPTIMAL (n – number tasks) Any scheduler that ignores dependencies is d x OPTIMAL (d – number resource dimensions)  Critical Path Scheduling (CPSched)
  64. 64. Where does the “work” lie in a DAG? “Work” – stages in a DAG where most amount of resources X time is spent  Large DAGs that are neither a bunch of unrelated stages nor a chain of stages  > 40% of the DAGs have most of the “work” on the Critical Path CPSched performs well  > 30% of the DAGs have most of the “work” such that Packers performs well For ~50% of the DAGs neither packers nor critically-based schedulers may perform well 7
  65. 65. Pack tasks along multiple resources while consider tasks dependencies 65  State-of-the art techniques are suboptimal  Key ideas in Graphene  Conclusion
  66. 66. State-of-the art scheduling techniques are suboptimal CPSched / Tetris 3 X Optimal 66 t0: t1: t2: t3: 1 {.7, .31} .01 {.95, .01} .01 {.1, .7} .96 {. 2, .68} .98 {. 1, .01} .01 {. 01, .01} t4: t5: duration {rsrc.1, rsrc.2} task: CPSched t0 t4 t5 t t1 t3t2 2t 3t Time: ~3T Tetris t0 t1 t2 t t4 t3t5 2t 3t Time: ~3T Optimal t1 t0 t t4 t3 t2 3t Time: ~T t5 Key insights:  t0, t2, t5 are troublesome tasks  schedule them as soon as possible  Total capacity in any dimens. = 1
  67. 67. Schedule construction: identify troublesome tasks and place accordingly on a virtual resource time space. 67 # 1
  68. 68. T P C O … time resources T … time resources P O C T Schedule Construction  Identify tasks that can lead to a poor schedule (troublesome tasks) - T  more likely to be on the critical path  more difficult to pack  Break the others tasks into P, C, O sets based on their relationship with tasks from T  Place tasks in T on a virtual time space; overlay the others to fill any resultant holes in this space Nearly optimal for over three quarters of our analyzed production DAGs 11
  69. 69. Online component: enforces the desired schedule of the various DAGs. 69 # 2
  70. 70. DAG Schedule Construction Schedule Construction Preference order Preference order - merging schedulesDAG Runtime component Node heartbeat Task assignment Resource Manager  Prefer jobs with less remaining work  Enforces priority ordering  Local placement  Multi-resource packing  Judicious overbooking of malleable resources  Deficit counters to bound unfairness  Enables implementation of different fairness schemes Job completion time Online Scheduling Makespan Being Fair - bound unfairness - packing + overbooking 13
  71. 71. Evaluation  Implemented in Yarn and Tez  250 machine cluster deployment  Replay Bing traces and TPC-DS / TPC-H workloads 71
  72. 72. Makespan Tetris 29 % Avg. Job Compl. Time 27% Graphene vs. Critical Path 31 % 33 %BFS 23 % 24% Gains from  view of the entire DAG  place the troublesome tasks first Efficiency  more compact schedule  better packing  overbooking 15
  73. 73.  combine various mechanisms to improve packing efficiency and consider tasks dependencies  constructs a good schedule by placing tasks on a virtual resource time space  implemented inside YARN and Tez; trace-driven simulations and deployment show encouraging initial results 73  online heuristics that softly enforces the desired schedules
  74. 74. Makespan Tetris 29 % Avg. Job Compl. Time 27% Graphene vs. Critical Path 31 % 33 %BFS 23 % 24% Gains from  view of the entire DAG  place the troublesome tasks first Graphene BFSRunning tasks Efficiency  more compact schedule  better packing  overbooking 15

×