Reducing the Human Cost of Grid Computing with glideinWMS

438 views

Published on

Presented at Cloud Computing 2011.

The switch from dedicated, tightly controlled compute clusters to a widely distributed, shared Grid infrastructure has introduced significant operational overheads. If not properly managed, this human cost could grow to a point where it would undermine the benefits of increased resource availability of Grid computing. The glideinWMS system addresses the human cost problem by drastically reducing the number of people directly exposed to the Grid infrastructure. This presentation provides an analysis of what steps have been taken to reduce the human cost problem, alongside the experience of glideinWMS use within the Open Science Grid.

Paper available at: http://www.thinkmind.org/index.php?view=article&articleid=cloud_computing_2011_8_40_20068

Cloud Computing 2011 Web site: http://www.iaria.org/conferences2011/CLOUDCOMPUTING11.html

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
438
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Reducing the Human Cost of Grid Computing with glideinWMS

  1. 1. Cloud Computing 2011 Reducing the Human Cost of Grid Computing with glideinWMS by Igor Sfiligoi1, F. Würthwein1, J.M. Dost1, I. MacNeill1, B. Holzman2, and P. Mhashilkar2 1 UCSD 2FNALRoma, Sept 2011 Reducing human cost with glideinWMS 1
  2. 2. Our environment - Grid computing ● Set of loosely coupled compute clusters (i.e. sites) ● Great for resource providers (i.e. site operators) ● High autonomy ● Easy sharing between communities (VOs) ● High utilization ● Not so great for users ● Actually, not too bad for users when things work ● But handling failures extremely time consuming – May need to contact multiple site adminsRoma, Sept 2011 Reducing human cost with glideinWMS 2
  3. 3. A problem of scale ● O(100) sites ● Aggregate of O(100k) CPUs ● At least a few sites have some broken nodes at any point in time ● O(10k) users ● O(100) users likely hit by those broken nodes every day ● If each spent even 30 mins debugging – O(10) scientific FTEs wasted (and I am being an optimist) – Plus drastic reduction in usability (users expect things “to just work”)Roma, Sept 2011 Reducing human cost with glideinWMS 3
  4. 4. The glideinWMS ● The glideinWMS approach to the problem ● Use the pilot paradigm ● Split pilot submission from pilot regulation ● Emphasize sharing The glideinWMS is a Grid job scheduler initially developed at FNAL by the CMS of pilot submission experiment ● Based on the CDF glideCAF concept service ● With contributions from several other institutes ● Widely used in OSG, with a large instance at UCSDRome, Sep 2011 Adapting with few simple rules in glideinWMS 4
  5. 5. The pilot paradigm ● Send pilots to Grid sites (never user jobs) ● Create a dynamic overlay pool of compute resources ● Handle user jobs within Pilots not user specific this overlay pool Site 1 ● A broken node will Pilot fail pilot jobs ● So they will not join Overlay pool Pilot submitter the overlay pool ● No user job ever fails Pilot ● Problem moved to One pool x Site N the pilot submitter user communityRoma, Sept 2011 Reducing human cost with glideinWMS 5
  6. 6. Cost reduction ● Difference in job types Estimates for a sizable OSG VO Entities Metric to debug ● All user jobs are precious => must recover O(10M) jobs O(100k) ● Pilot jobs are all the same Assuming 1% error rate => pilot failures not critical – Failures used to detect O(1k) nodes O(10) broken compute nodes Reduction by – Diagnose node problem several orders of magnitude ● Fewer humans exposed ● Can be more expert => lower cost per eventRoma, Sept 2011 Reducing human cost with glideinWMS 6
  7. 7. Traditional pilots & multiple VOs ● Each Site 1 user community (VO) Overlay wants its own pool Pilot pilot infrastructure Pilot submitter ● To maintain control Site k Pilot over Overlay submitter scheduling policies pool Pilot Site N ● Many pilot admins, debugging the same sitesRoma, Sept 2011 Reducing human cost with glideinWMS 7
  8. 8. Splitting the process ● The glideinWMS separates ● pilot submission Site 1 (glidein factory) Pilot ● from pilot regulation (VO frontend) Overlay Glidein pool factory ● Credential owed by VO frontend Pilot ● And delegated to factory VO frontend as needed Site N ● All scheduling policy implemented in the frontendRoma, Sept 2011 Reducing human cost with glideinWMS 8
  9. 9. The factory can be shared ● Each VO runs only its own VO frontend VO frontend Site 1 (with the associated Overlay pool Pilot overlay pool) ● While still having Site k Glidein factory full control over policy Overlay ● All debugging pool Pilot handled by a VO frontend Site N single factory teamRoma, Sept 2011 Reducing human cost with glideinWMS 9
  10. 10. Risk of common factory? ● A single factory is a single point of failure ● And possibly a scalability choke point ● The glideinWMS allows for multiple factories ● For redundancy, scalability, trust, etc. ● Of course the cost goes up ● How many factories to use is a balance between low cost and low risk ● Each VO can decide what works best for itRoma, Sept 2011 Reducing human cost with glideinWMS 10
  11. 11. OSG experience ● Operating a multi-VO factory since 2009 ● 12 VOs at the time of writing ● Gliding into ~100 Grid sites ● We include sites that claim to support the VOs we serve ● Significant fraction shared ● Weekly statistics One VO much bigger than the otherRoma, Sept 2011 Reducing human cost with glideinWMS 11
  12. 12. Effort investment ● About 1 FTE total ● Only fraction for actual Grid debugging ● Comparable fraction helping VOs debug problems between Grid nodes and their VO overlay pool ● We also help with know-how in configuring and operating the overlay poolRoma, Sept 2011 Reducing human cost with glideinWMS 12
  13. 13. Savings estimate ● Not counting the consulting services ● Those tend to be high at start-up and then level off ● For the remainder of the effort: 7x cheaperRoma, Sept 2011 Reducing human cost with glideinWMS 13
  14. 14. Conclusions ● Failures in a highly distributed system like the scientific Grids can have a high human cost ● The pilot paradigm drastically reduces this by ● Catching errors during provisioning ● Debugging by expert staff only ● The glideinWMS further reduces the cost by allowing for a shared pilot factory ● Confirmed by the OSG experienceRoma, Sept 2011 Reducing human cost with glideinWMS 14
  15. 15. For more information ● The glideinWMS home page http://tinyurl.com/glideinWMS ● Relevant papers and supporting material: ● I. Sfiligoi et al., "The pilot way to grid resources using glideinWMS," CSIE, WRI World Cong. on, vol. 2, pp. 428-432, 2009, doi:10.1109/CSIE.2009.950 ● Open Science Grid home page, http://www.opensciencegrid.org/Rome, Sep 2011 Adapting with few simple rules in glideinWMS 15
  16. 16. Acknowledgment ● This work is partially sponsored by ● US Department of Energy under Grant No. DE-FC02-06ER41436 subcontract No. 647F290 (OSG) ● the US National Science Foundation under Grants No. PHY-0612805 (CMS Maintenance & Operations), and OCI-0943725 (STCI).Rome, Sep 2011 Adapting with few simple rules in glideinWMS 16

×