Your SlideShare is downloading. ×
HFSP: the Hadoop Fair Sojourn Protocol
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

HFSP: the Hadoop Fair Sojourn Protocol

860
views

Published on

Size-based scheduling for Hadoop providing both efficiency and fairness.

Size-based scheduling for Hadoop providing both efficiency and fairness.

Published in: Technology, Business

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
860
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. HFSP: the Hadoop Fair Sojourn ProtocolMario Pastorelli, Antonio Barbuzzi, Damiano Carra, MatteoDell’Amico, Pietro MichiardiMay 13, 20131
  • 2. Outline1 Hadoop and MapReduce2 Fair Sojourn Protocol3 HFSP Implementation4 Experiments2
  • 3. Hadoop and MapReduceOutline1 Hadoop and MapReduce2 Fair Sojourn Protocol3 HFSP Implementation4 Experiments3
  • 4. Hadoop and MapReduce MapReduceMapReduceBring the computation to the data – split in blocks across the clusterMAPOne task per blockHadoop filesystem (HDFS): 64 MB by defaultStores locally key-value pairse.g., for word count: [(manzana, 15) , (melocoton, 7) , . . .]4
  • 5. Hadoop and MapReduce MapReduceMapReduceBring the computation to the data – split in blocks across the clusterMAPOne task per blockHadoop filesystem (HDFS): 64 MB by defaultStores locally key-value pairse.g., for word count: [(manzana, 15) , (melocoton, 7) , . . .]REDUCE# of tasks set by the programmerMapper output is partitioned by key and pulled from “mappers”The REDUCE function operates on all values for a single keye.g., (melocoton, [7, 42, 13, . . .])4
  • 6. Hadoop and MapReduce Problem StatementThe Problem With SchedulingCurrent WorkloadsHuge job size varianceRunning time: seconds to hoursI/O: KBs to TBs[Chen et al., VLDB ’12; Ren et al., CMU TR ’12]ConsequenceInteractive jobs are delayed by long onesIn smaller clusters long queues exacerbate the problem5
  • 7. Fair Sojourn ProtocolOutline1 Hadoop and MapReduce2 Fair Sojourn Protocol3 HFSP Implementation4 Experiments6
  • 8. Fair Sojourn Protocol Introduction To FSPFair Sojourn Protocol [Friedman & Henderson, SIGMETRICS ’03]100usage (%)cluster5010 15 37.5 42.5 50time(s)100usage (%)cluster10 5020 3050time(s)job 1job 2job 3job 1 job 3job 2 job 1Simulate completion time using a simulated processor sharingdisciplineSchedule all resources to the job that would complete first 7
  • 9. Fair Sojourn Protocol Introduction To FSPMulti-Processor FSP10 13 3923.5usage (%)cluster1005024.5time(s)10 13 20 23 3910050usage (%)clustertime(s)job 1job 2job 3job 1job 2job 3In our case, some jobs may not require all cluster resources8
  • 10. HFSP ImplementationOutline1 Hadoop and MapReduce2 Fair Sojourn Protocol3 HFSP Implementation4 Experiments9
  • 11. HFSP Implementation HFSP In GeneralHFSP In A NutshellJob Size EstimationNaive estimation at firstAfter the first s “training” tasks have run, we make a betterestimations = 5 by defaultOn t task slots, we give priority to training taskst avoids starving “old” jobs“shortcut” for very small jobs10
  • 12. HFSP Implementation HFSP In GeneralHFSP In A NutshellJob Size EstimationNaive estimation at firstAfter the first s “training” tasks have run, we make a betterestimations = 5 by defaultOn t task slots, we give priority to training taskst avoids starving “old” jobs“shortcut” for very small jobsScheduling PolicyWe treat MAP and REDUCE phases as separate jobsA virtual cluster outputs a per-job simulated completion timePreempt running tasks of jobs that complete later in the virtualcluster10
  • 13. HFSP Implementation Size EstimationJob Size Estimation (1)Initial Estimationξ · k · lk: # of tasksl: average size of past MAP/REDUCE tasksξ ∈ [1, ∞]: aggressivity for scheduling jobs in training phaseξ = 1 (default): tend to schedule training jobs right awaythey may have to be preemptedξ = ∞: wait for training to end before decidingmay require more “waves”11
  • 14. HFSP Implementation Size EstimationJob Size Estimation (2)MAP PhaseFrom the size of the s samples, generate an empirical CDF(Least-square) fit to a parametric distributionPredicted job size: k time the expected value of the fitteddistribution12
  • 15. HFSP Implementation Size EstimationJob Size Estimation (2)MAP PhaseFrom the size of the s samples, generate an empirical CDF(Least-square) fit to a parametric distributionPredicted job size: k time the expected value of the fitteddistributionData LocalityExperimentally, we find out it’s not an issueFor the s sample jobs, there are plenty of unprocessed blocks aroundWe use delay scheduling [Zaharia et al., EuroSys ’10]12
  • 16. HFSP Implementation Size EstimationJob Size Estimation (3)REDUCE PhaseShuffle time: getting data to the reducertime between scheduling a REDUCE task and executing a REDUCEfunction the first timeaverage of sample shuffle sizes, weighted by data sizeExecution timewe set a timeout ∆ (default 60s)if the timeout is hit, estimated execution time is∆pwhere progress p is the fraction of data processedCompute estimated reduce time as before13
  • 17. HFSP Implementation Virtual ClusterVirtual ClusterEstimated job size is in a “serialized” single-machine formatSimulates a processor-sharing cluster to compute completiontime, based onnumber of tasks per jobavailable task slots in the real clusterSimulation is updated whennew jobs arrivetasks complete14
  • 18. HFSP Implementation PreemptionJob PreemptionSupported in HadoopKILL running taskswastes workWAIT for them to finishmay take long15
  • 19. HFSP Implementation PreemptionJob PreemptionSupported in HadoopKILL running taskswastes workWAIT for them to finishmay take longOur ChoiceMAP tasks: WAITgenerally smallFor REDUCE tasks, we implemented SUSPEND and RESUMEavoids the drawbacks of both WAIT and KILL15
  • 20. HFSP Implementation PreemptionJob Preemption: SUSPEND and RESUMEOur SolutionWe delegate to the OS: SIGSTOP and SIGCONT16
  • 21. HFSP Implementation PreemptionJob Preemption: SUSPEND and RESUMEOur SolutionWe delegate to the OS: SIGSTOP and SIGCONTThe OS will swap tasks if and when memory is neededno risk of thrashing: swapped data is loaded only when resuming16
  • 22. HFSP Implementation PreemptionJob Preemption: SUSPEND and RESUMEOur SolutionWe delegate to the OS: SIGSTOP and SIGCONTThe OS will swap tasks if and when memory is neededno risk of thrashing: swapped data is loaded only when resumingConfigurable maximum number of suspended tasksif reached, switch to WAIThard limit on memory allocated to suspended tasks16
  • 23. HFSP Implementation PreemptionJob Preemption: SUSPEND and RESUMEOur SolutionWe delegate to the OS: SIGSTOP and SIGCONTThe OS will swap tasks if and when memory is neededno risk of thrashing: swapped data is loaded only when resumingConfigurable maximum number of suspended tasksif reached, switch to WAIThard limit on memory allocated to suspended tasksIf not all running tasks should be preempted, suspend theyoungestlikely to finish latermay have smaller memory footprint16
  • 24. ExperimentsOutline1 Hadoop and MapReduce2 Fair Sojourn Protocol3 HFSP Implementation4 Experiments17
  • 25. Experiments Setup and TracesExperimental SetupPlatform100 m1.xlarge Amazon EC2 instances4 x 2 GHz cores, 1.6 TB storage, 15 GB RAM eachWorkloadsGenerated with the SWIM workload generator [Chen et al., MASCOTS ’11]Sinthetized from Facebook traces [Chen et al., VLDB ’12]FB2009: 100 jobs, most are small; 22 minutes submission scheduleFB2010: 93 jobs, small jobs filtered out; 1h submission scheduleConfigurationWe compare to Hadoop’s FAIR schedulersimilar to a processor-sharing disciplineDelay scheduling enabled both for FAIR and HFSP18
  • 26. Experiments ResultsFB200900.250.50.7510 0.5 1 1.5 2 2.5FractionofcompletedjobsSojourn Time [min]HFSPFAIR00.250.50.7510 20 40 60 80 100Sojourn Time [min]HFSPFAIR00.250.50.7510 50 100 150 200 250Sojourn Time [min]HFSPFAIRSmall jobs Medium jobs Large jobsThe FIFO scheduler would mostly fall outside of the graphSmall jobs (few tasks) are not problematic in either casethey are allocated enough tasksMedium and large jobs instead require a significant amount ofthe cluster resources“focusing” all resources of the cluster pays off19
  • 27. Experiments ResultsFB201000.250.50.7510 100 200 300 400 500FractionofcompletedjobsMap Time [min]HFSPFAIR00.250.50.7510 75 150 225 300 375Reduce Time [min]HFSPFAIR00.250.50.7510 125 250 375 500 625 750Sojourn Time [min]HFSPFAIRMAP phase REDUCE phase AggregateLarger jobs, longer queues, more pressure on the schedulerMedian MAP sojourn time is more than halvedMain reason: less “waves” because cluster resources are focusedOn aggregate, when the first job completes with FAIR, 20% jobsare done with HFSP.20
  • 28. Experiments ResultsCluster Size02040608010012010 20 30 40 50 60 70 80 90 100Averagesojourntime[min]Cluster nodes numberHFSPFAIRExperiment done with the Mumak Hadoop official emulator andFB2009For smaller clusters, scheduling makes a bigger difference21
  • 29. Experiments ResultsRobustness to Estimation Errors1401501601701801902002102202300.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1AverageSojournTime[s]αFAIRHFSP (α=0)Experimental settings as before: FB2009 and Mumak againFor a job size estimation of θ, we introduce an error and pick avalue uniformly in[(1 − α) θ, (1 + α) θ]22
  • 30. Experiments ResultsPreemption: CostsQuestionCould the costs associated to swapping make SUSPEND not worth it?MeasurementsLinux can read and write swap close to maximum disk speed100 MB/s for usWorst-Case AnalysisIn the FB2010 experiment, 10% of REDUCE tasks are suspendedThe JVM heap space for REDUCE tasks is 1GBas advised in Hadoop docsTherefore, a SUSPEND/RESUME induces swapping for at most 20 sone order of magnitude less than average size of preempted tasks23
  • 31. Experiments ConclusionsTake-Home MessagesSize-based scheduling on Hadoop is viable, and particularly appealingfor companies with (semi-)interactive jobs and smaller clustersEven simple approximate means for size estimation are sufficient, asHFSP is robust with respect to errorsOS delegation to POSIX SIGSTOP and SIGCONT signals is an efficientway to perform preemption in HadoopHFSP is available as free software athttp://bitbucket.org/bigfootproject/hfspPaper at http://arxiv.org/abs/1302.274924