Successfully reported this slideshow.
Your SlideShare is downloading. ×

Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 18 Ad

Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience

Download to read offline

Presented at ADAPTIVE 2011.

The High Energy Physics community requires a large amount of compute resources and had to adopt the Grid computing paradigm to satisfy its needs. Scheduling of jobs in the Grid environment is however very challenging, due to the high autonomy enjoyed by the participating resource providers, requiring the user community to constantly adapt to the ever-changing conditions. The CMS experiment addressed this problem by developing the glideinWMS system, which addresses this problem by using an overlay compute pool and a few simple rules for provisioning the needed resources. This approach has been very successful and is now being used by several other communities in the Open Science Grid. This presentation provides the description of the glideinWMS system, the algorithms used as well as an analysis of the experience CMS has had using the system.

Paper available at: http://www.thinkmind.org/index.php?view=article&articleid=adaptive_2011_2_20_50040

ADAPTIVE 2011 web site: http://www.iaria.org/conferences2011/ADAPTIVE11.html

Presented at ADAPTIVE 2011.

The High Energy Physics community requires a large amount of compute resources and had to adopt the Grid computing paradigm to satisfy its needs. Scheduling of jobs in the Grid environment is however very challenging, due to the high autonomy enjoyed by the participating resource providers, requiring the user community to constantly adapt to the ever-changing conditions. The CMS experiment addressed this problem by developing the glideinWMS system, which addresses this problem by using an overlay compute pool and a few simple rules for provisioning the needed resources. This approach has been very successful and is now being used by several other communities in the Open Science Grid. This presentation provides the description of the glideinWMS system, the algorithms used as well as an analysis of the experience CMS has had using the system.

Paper available at: http://www.thinkmind.org/index.php?view=article&articleid=adaptive_2011_2_20_50040

ADAPTIVE 2011 web site: http://www.iaria.org/conferences2011/ADAPTIVE11.html

Advertisement
Advertisement

More Related Content

More from Igor Sfiligoi (20)

Advertisement

Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience

  1. 1. Adaptive 2011 Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience by Igor Sfiligoi1, Benjamin Hass1, Frank Würthwein1, and Burt Holzman2 1 UCSD 2FNAL Rome, Sep 2011 Adapting with few simple rules in glideinWMS 1
  2. 2. The Grid landscape ● Many highly autonomous Grid sites Within Scientific Grid environments ● Many diverse user communities (e.g. OSG, EGI) Nothing shared ● How can users efficiently schedule their jobs? Rome, Sep 2011 Adapting with few simple rules in glideinWMS 2
  3. 3. Scheduling problem ● Grid sites expose only partial information ● Access to finer details restricted to site admins ● Each user community wants independence ● No centralized, Grid-wide job scheduling ● As a result ● Cannot accurately predict even the near future ● Partitioning across sites mostly a guesswork ● Adapting to the ever-changing state a must Rome, Sep 2011 Adapting with few simple rules in glideinWMS 3
  4. 4. Traditional approaches ● Force sites to expose as much info as possible ● Sites end up publishing lots of garbage ● Implement retries ● Long tail before ALL jobs in a workflow finish ● Start at many sites concurrently, then kill some ● Wasteful and with semantic problems ● Mediocre results and complex code Rome, Sep 2011 Adapting with few simple rules in glideinWMS 4
  5. 5. The glideinWMS ● The glideinWMS approach to the problem ● Use the pilot paradigm ● Pressure based The glideinWMS is a Grid job scheduler initially developed at FNAL by the CMS scheduling experiment ● Based on the CDF glideCAF concept With contributions from several Avoid using ● ● other institutes ● Widely used in OSG, with a large external information instance at UCSD ● Range reduction Rome, Sep 2011 Adapting with few simple rules in glideinWMS 5
  6. 6. The pilot paradigm ● Send pilots to Grid sites (never user jobs) ● Create a dynamic overlay pool of compute resources ● Jobs scheduled within Pilots not user specific Site 1 this overlay pool ● Scheduling in the Pilot overlay pool easy Overlay Pilot pool submitter ● Complete info ● Full control Pilot ● Problem moved to One pool x the pilot submitter user community Site N Rome, Sep 2011 Adapting with few simple rules in glideinWMS 6
  7. 7. Is pilot scheduling easier? ● User jobs ● Pilot jobs ● Every job is important ● All the same => users wait for last ● A failed pilot job is just to finish wasted CPU time ● A failed job is a ● Single credential => problem for the user no need to prioritize ● Many users => priority between them handling ● Must handle each ● Only number of pilot and every one jobs counts Rome, Sep 2011 Adapting with few simple rules in glideinWMS 7
  8. 8. Pressure based scheduling ● The glideinWMS pilot scheduling based on the concept of pilot pressure ● Keep a fixed no. of pending pilots Site 1 in remote queues ● Site by site Pilot ● Furthermore, split VO P Glidein factory pilot scheduling frontend R Glidein factory from pilot submission Pilot ● Scheduling in VO frontend Site N Rome, Sep 2011 Adapting with few simple rules in glideinWMS 8
  9. 9. Determining the pressure ● Calculating the proper pressure important ● Too low => small overlay pool => long job wait ● Too high => on jobs when pilot starts => wasted CPU ● Must be recalculated often Ps(t)=f(Is(t)) ● Each site has its own pressure ● Input to pressure calculation ● Only no. matching pending(i.e. idle) user jobs ● Grid status incomplete and unreliable ● Some jobs that can run on multiple Grid sites ● Count them as the appropriate fraction against each Rome, Sep 2011 Adapting with few simple rules in glideinWMS 9
  10. 10. Simple pressure function ● Experience tells us Grid jobs have relatively flat start and terminate rates ● Typical O(10/few mins), max O(100/few mins) ● So pressure can be capped in the O(10) range ● Small range => tuning only when few jobs ● Using simple heuristic of dividing by 3 ● Just to have a reasonable edge-case policy f(Is(t)) = min(Is(t)/3,Cs) Rome, Sep 2011 Adapting with few simple rules in glideinWMS 10
  11. 11. Operational experience (1) ● CMS@UCSD has 2 years of experience ● Serving O(4k) users ● Using about O(100) Grid sites located in the Americas, Europe and Asia Status of the overlay pool Grid sites concurrently used Rome, Sep 2011 Adapting with few simple rules in glideinWMS 11
  12. 12. Operational experience (2) ● CMS@UCSD has 2 years of experience ● The glideinWMS logic works very efficiently ● Quick job startup times Status of the overlay pool Rome, Sep 2011 Adapting with few simple rules in glideinWMS 12
  13. 13. Operational experience (3) ● CMS@UCSD has 2 years of experience ● The glideinWMS logic works very efficiently ● Quick job startup times ● Little over-provisioning (~5%) Status of the overlay pool Rome, Sep 2011 Adapting with few simple rules in glideinWMS 13
  14. 14. Related work ● Non-pilot WMS (i.e. direct submission) ● gLite WMS and OSG MM ● More complex and brittle since they require accurate and complete info from Grid sites ● Pilot WMS ● PANDA – Pressure based, with basically constant pressure over time => high load on sites ● DIRAC and MyCluster – Require services at Grid sites to gather site state => many Grid sites do not allow this Rome, Sep 2011 Adapting with few simple rules in glideinWMS 14
  15. 15. Summary ● Direct Grid-wide job scheduling is hard ● Pilot paradigm simplifies it by making it uniform ● The glideinWMS use pressure logic ● Based on number of pending user jobs only ● Pressure function capped => simple rules ● CMS experience at UCSD shows it works ● and it works well Rome, Sep 2011 Adapting with few simple rules in glideinWMS 15
  16. 16. For more information ● The glideinWMS home page http://tinyurl.com/glideinWMS ● Relevant papers: ● I. Sfiligoi et al., "The pilot way to grid resources using glideinWMS," CSIE, WRI World Cong. on, vol. 2, pp. 428-432, 2009, doi:10.1109/CSIE.2009.950 ● The CMS Collaboration et al. “The CMS experiment at the CERN LHC,” J. Inst, vol. 3, S08004, pp. 1-334, 2008, doi:10.1088/1748-0221/3/08/S08004 Rome, Sep 2011 Adapting with few simple rules in glideinWMS 16
  17. 17. Acknowledgment ● This work is partially sponsored by ● the US Department of Energy under Grant No. DE-FC02-06ER41436 subcontract No. 647F290 (OSG), and ● the US National Science Foundation under Grant No. PHY-0612805 (CMS Maintenance & Operations). Rome, Sep 2011 Adapting with few simple rules in glideinWMS 17
  18. 18. Copyright notice ● This presentation contains graphics copyright of Toon-a-day that was licensed to Igor Sfiligoi for use in this presentation ● Any other use strictly prohibited Rome, Sep 2011 Adapting with few simple rules in glideinWMS 18

×