SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Computing Infrastructures
Upcoming SlideShare
Loading in...5
×
 

SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Computing Infrastructures

on

  • 411 views

SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Computing Infrastructures ...

SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Computing Infrastructures
Simon Delamare Gilles Fedak Derrick Kondo Oleg Lodygensky
High-Performance Parallel and Distributed Computing, 2012

Statistics

Views

Total Views
411
Views on SlideShare
411
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Computing Infrastructures SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Computing Infrastructures Presentation Transcript

    • SpeQuloS: A QoS Service for BoT Applications UsingBest Effort Distributed Computing InfrastructuresSimon Delamare 1Gilles Fedak 2Derrick Kondo 3Oleg Lodygensky 41LIP/CNRS, Univ. Lyon, France2LIP/INRIA, Univ. Lyon, France3LIG/INRIA, Univ. Grenoble, France4LAL/CNRS, Univ. Paris XI, FranceHigh-Performance Parallel and Distributed Computing, 2012S. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 1 / 18
    • IntroductionBE-DCI = “Best-Effort” Distributed Computing Infrastructure→ Large computing power at low cost, Avoid wasting resources→ No availability guaranteeDesktop Grids→ BOINC projects: Peta FLOPS for freeGrids used in Best-Effort mode→ ≈ 40% of utilization in Grid5000@LyonCloud “Spot” Instances→ c1.large instance price: 0.12$/h (spot) vs. 0.32$/h (regular)Relevant for BoT execution ...Bag of Tasks: Set of independent tasks to compute→ but Low QoS levelEspecially compared to regular infrastructuresS. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 2 / 18
    • Performance Problem AddressedBoT completion rate increases at the end of execution→ Tail Effect00.20.40.60.811.20 20 40 60 80 100BoTcompletionratioTimeContinuation is performedat 90% of completionIdeal Time Actual Completion TimeTail DurationSlowdown = (Tail Duration + Ideal Time) / Ideal TimeBoT completionTail part of the BoTMeasured by Slowdown:S =IdealCompletionTimeRealCompletionTimeS. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 3 / 18
    • Slowdown by Tail EffectSlowdown reported on BoT execution00.20.40.60.810.1 1 10 100Fractionofexecutionwheretailslowdown<STail Slowdown S (Completion time observed divided by ideal completion time)BOINCXWHEPBest 50% ⇒ S < 1.325% to 33% ⇒ S > 2Worst 5% ⇒ S> 4 to 10Avg. % of BoT in tail Avg. % of time in tailBE-DCI Trace BOINC XWHEP BOINC XWHEPDesktop Grids 4.65 5.11 51.8 45.2Best Effort Grids 3.74 6.40 27.4 16.5Spot Instances 2.94 5.19 22.7 21.6→ Caused by no more than the last 7% ofBoTS. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 4 / 18
    • How to improve the situation ?Better schedulingQoS in Grid scheduling ([12], [20], [38])→ Require heavy modification of middleware→ No satisfactory solution for unreliable infrastructure ([7])Addressing the tail effect→ e.g. in MapReduce ([3], [39]), but require precise information from computenodes, hard in large DCIs.Building Hybrid DCIsGrid & Desktop Grid ([35],[36])→ Mostly to offload Grid usageUsing Cloud computing ([10],[28],[37])→ To address peak demandsS. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 5 / 18
    • SpeQuloS Service→ Improving BE-DCIs users perceived QoSSpeeding up BoT executionBring information on expected BoT execution timeBy dynamic provision of Cloud resources→ Monitoring BoT execution→ Execute the tail on CloudFeatures:1 Our context: Existing BE-DCIs and Clouds, not administrator: Black Boxes2 Interface with users: QoS requests, State of completion, Prediction onremaining time3 Careful utilization of Cloud resources w/ Billing & Accounting of usageS. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 6 / 18
    • FrameworkSpeQuloS modules:Information: Collect QoS-relatedinformation from DGsOracle: Strategies to appropriatelyuse Cloud resources / QoSprediction for usersScheduler: Start/Stop Cloudresources, usage accountingCredit System: Bill Cloud usage touser, using “credits” to buy Cloudresource cpu.hImplementationIndependant modules using Python & MySQLSupported Clouds: EC2, OpenNebula, etc.Supported DG middleware: BOINC & XtremWeb-HEPS. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 7 / 18
    • Cloud Provisioning StrategiesWhen to start Cloud resources ?At 90% of BoT completion (9C)At 90% of BoT assignment (9A)When Tail appear, by monitoring execution time variance (V)S. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 8 / 18
    • Cloud Provisioning StrategiesWhen to start Cloud resources ?At 90% of BoT completion (9C)At 90% of BoT assignment (9A)When Tail appear, by monitoring execution time variance (V)How many Cloud resources to start (for a given amount of Credits) ?Greedy: As much as possible, for 1 hour of cloud usage (G)Conservative: To ensure that there will be enough credits to run Cloud up toan estimated completion time (C)S. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 8 / 18
    • Cloud Provisioning StrategiesWhen to start Cloud resources ?At 90% of BoT completion (9C)At 90% of BoT assignment (9A)When Tail appear, by monitoring execution time variance (V)How many Cloud resources to start (for a given amount of Credits) ?Greedy: As much as possible, for 1 hour of cloud usage (G)Conservative: To ensure that there will be enough credits to run Cloud up toan estimated completion time (C)How to use Cloud resources ?Flat: Cloud worker not differentiated from BE-DCI workers (F)Reschedule : Scheduler reshedule tasks executed on BE-DCI to Cloud (R)Cloud Duplication : Uncompleted tasks are duplicated to a dedicated Cloudinfrastructure (D)S. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 8 / 18
    • Experimentation Setup (1)Simulations using real BE-DCI infrastructures availability traces, various BoTworkloads, BOINC and XWEP middlewareBE-DCIs availability traces :Desktop Grids: seti, nd (SETI@Home & NotreDame traces from FTA)Best Effort Grids: g5klyo, g5kgre (Available ressources in Grid5000 Lyon &Grenoble clusters in December 2010)Cloud Spot instances: spot10, spot100 (Maximum number of instances for arenting cost of 10 or 100 $ per hour, fluctuates according to market price)trace length mean deviation min max av. quartiles (s) unav. quartiles (s) avg. power power(days) (nops/s) std. dev.seti 120 24391 6793 15868 31092 61,531,5407 174,501,3078 1000 250nd 413.87 180 4.129 77 501 952,3840,26562 640,960,1920 1000 250g5klyo 31 90.573 105.4 6 226 21,51,63 191,236,480 3000 0g5kgre 31 474.69 178.7 184 591 5,182,11268 23,547,6891 3000 0spot10 90 82.186 3.814 29 87 4415,5432,17109 4162,5034,9976 3000 300spot100 90 823.95 4.945 196 877 1063,5566,22490 383,1906,10274 3000 300S. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 9 / 18
    • Experimentation Setup (2)BoT workloads:Size nops / task Arrival timeSMALL 1000 3600000 0BIG 10000 60000 0RANDOM norm(µ = 1000, σ2= 200) norm(µ = 60000, σ2= 10000) weib(λ = 91.98, k = 0.57)Simulations methodology:Reproducible executions wo & w/ SpeQuloSSpeQuloS Credits provisioned w/ 10% of BoT workload (in Cloud resourcecpu.hour equivalent)→ 25000 BoT execution tracesS. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 10 / 18
    • Strategies ComparisonTail Removal Efficiency→ Tail Duration w/ SpeQuloS vs Tail Duration wo SpeQuloS00.20.40.60.810 20 40 60 80 100FractionofBoTwheretailefficiency>PTail Removal Efficiency (Percentage P)9C-G-F9A-G-FV-G-F9C-C-F9A-C-FV-C-FFlat deploymentstrategy00.20.40.60.810 20 40 60 80 100FractionofBoTwheretailefficiency>P Tail Removal Efficiency (Percentage P)9C-G-R9A-G-RV-G-R9C-C-R9A-C-RV-C-RReschedule deploymentstrategy00.20.40.60.810 20 40 60 80 100FractionofBoTwheretailefficiency>PTail Removal Efficiency (Percentage P)9C-G-D9A-G-DV-G-D9C-C-D9A-C-DV-C-DCloud duplicationdeployment strategyBest strategies are able toSuppress tail for 50% of executionHalf the tail for 80% of executionS. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 11 / 18
    • Strategies ComparisonTail Removal Efficiency→ Tail Duration w/ SpeQuloS vs Tail Duration wo SpeQuloS00.20.40.60.810 20 40 60 80 100FractionofBoTwheretailefficiency>PTail Removal Efficiency (Percentage P)9C-G-F9A-G-FV-G-F9C-C-F9A-C-FV-C-FFlat deploymentstrategy00.20.40.60.810 20 40 60 80 100FractionofBoTwheretailefficiency>P Tail Removal Efficiency (Percentage P)9C-G-R9A-G-RV-G-R9C-C-R9A-C-RV-C-RReschedule deploymentstrategy00.20.40.60.810 20 40 60 80 100FractionofBoTwheretailefficiency>PTail Removal Efficiency (Percentage P)9C-G-D9A-G-DV-G-D9C-C-D9A-C-DV-C-DCloud duplicationdeployment strategyBest strategies are able toSuppress tail for 50% of executionHalf the tail for 80% of executionFlat (F) < Reschedule (R) & Cloud Duplication (D)Tail Detection (V) triggers Cloud too lateS. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 11 / 18
    • Cloud Resources ConsumptionPercentage of credits spent vscredits provisioned (=10% of BoTworkload).10% to 25% of what has beenprovisioned are actually used byCloud resources010203040509C-G-F9C-G-R9C-G-D9C-C-F9C-C-R9C-C-D9A-G-F9A-G-R9A-G-D9A-C-F9A-C-R9A-C-DV-G-FV-G-RV-G-DV-C-FV-C-RV-C-DPercentageofcreditsusedCombination of SpeQuloS strategiesS. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 12 / 18
    • Cloud Resources ConsumptionPercentage of credits spent vscredits provisioned (=10% of BoTworkload).10% to 25% of what has beenprovisioned are actually used byCloud resources010203040509C-G-F9C-G-R9C-G-D9C-C-F9C-C-R9C-C-D9A-G-F9A-G-R9A-G-D9A-C-F9A-C-R9A-C-DV-G-FV-G-RV-G-DV-C-FV-C-RV-C-DPercentageofcreditsusedCombination of SpeQuloS strategies→ ≈2.5% of BoT workload is executed on CloudS. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 12 / 18
    • Completion TimeCombination of strategies used: 9C-C-R020000400006000080000100000120000140000SETINDG5KLYOG5KGRESPOT10SPOT100Completiontime(s)BE-DCINo SpeQuloSSpeQuloSBOINC & SMALL BoT0500010000150002000025000SETINDG5KLYOG5KGRESPOT10SPOT100Completiontime(s)BE-DCINo SpeQuloSSpeQuloSBOINC & BIG BoT010000200003000040000500006000070000SETINDG5KLYOG5KGRESPOT10SPOT100Completiontime(s)BE-DCINo SpeQuloSSpeQuloSBOINC & RANDOM BoT0500010000150002000025000300003500040000SETINDG5KLYOG5KGRESPOT10SPOT100Completiontime(s)BE-DCINo SpeQuloSSpeQuloSXWHEP & SMALL BoT010002000300040005000600070008000SETINDG5KLYOG5KGRESPOT10SPOT100Completiontime(s)BE-DCINo SpeQuloSSpeQuloSXWHEP & BIG BoT10002000300040005000600070008000SETINDG5KLYOG5KGRESPOT10SPOT100Completiontime(s)BE-DCINo SpeQuloSSpeQuloSXWHEP & RANDOM BoT→ Up to 9x speedup→ Depend on middleware used, BE-DCI volatilityS. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 13 / 18
    • Completion Time Prediction→ User can ask prediction at any moment of BoT executionPredicted completion time:tp = α ×t(r)rCurrent completion ratio: rTime elapsed since submission: t(r)α: adjustment factor, depend on execution environment:DG server & middlwareApplication & BoT size→ Adjusted after BoT execution to minimize difference w/ completion timeobservedStatistical uncertainty (±x%): Success rate of prediction vs previous executionS. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 14 / 18
    • Prediction ResultsCompletion Time Predication:Made at 50% of BoT executionUncertainty: ± 20%α adjusted after 30 execution w/ same BD-DCI, middleware, BoT workloadBoT category & MiddlewareSMALL BIG RANDOMBE-DCI BOINC XWHEP BOINC XWHEP BOINC XWHEP Mixedseti 100 100 100 82.8 100 87.0 94.1nd 100 100 100 100 100 96.0 99.4g5klyo 88.0 89.3 96.0 87.5 75 75 85.6g5kgre 96.3 88.5 100 92.9 83.3 34.8 83.3spot10 100 100 100 100 100 100 100spot100 100 100 100 100 76 3.6 78.3Mixed 97.6 96.1 99.2 93.5 89.6 65.3 90.2→ Successful prediction in 9 cases out of 10→ Lower results with heterogeneous BoT→ Needs a learning phase, with same BoT (at least same app.), executed onsame BE-DCI.S. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 15 / 18
    • SpeQuloS Deployment in European Desktop Grid InitiativeEDGI project: Bringing European Desktop Grids computing resources to scientificcommunities.S. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 16 / 18
    • ConclusionBE-DCIs: “Low-cost” solution but poor QoS (tail effect)SpeQuloS: Use Cloud resources to improve QoS delivered to BE-DCI usersEfficiently removes the tail problem→ Speed up BoT execution→ Only require few % of workload to be executed on CloudEnable completion time prediction for users→ A step towards BE-DCIs usability in computing landscape ?Future work:Better strategies to anticipate problems (tail effect)Analysis from users feedback in SpeQuloS deploymentsS. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 17 / 18
    • S. Delamare, G. Fedak, D. Kondo and O. Lodygensky (LIP/CNRS, Univ. Lyon, France, LIP/INRIA, Univ. Lyon, France, LIG/INRIA, Univ. Grenoble, France, LAL/SpeQuloS HPDC’12 18 / 18