An argument for moving the requirements out of user hands - The CMS Experience

340 views
283 views

Published on

This talks makes an argument why users should not write arbitrary Requirements expressions in Condor, but only express the desires, and let someone else write the actual policy.

Presentation at Condor Week 2012.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
340
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

An argument for moving the requirements out of user hands - The CMS Experience

  1. 1. Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented by Igor Sfiligoi UC San DiegoCondor Week 2012 - May 2012 No user requirements 1
  2. 2. The basics ● Condor architecture clearly separates ● Resource providers Machines (aka worker nodes) CPUs, Memory, IO,... from ● Resource consumers Job queues (aka submit nodes) Jobs submitted by users ● Each has a daemon process to represent it Schedd Negotiator Startd for resource provides ... ● Collector Schedd Startd ● Schedd for resource consumers ... Startd ● A central service connects them all ... Startd ● Managed by a Collector/Negotiator pairCondor Week 2012 - May 2012 No user requirements 2
  3. 3. Matchmaking ● In order for a job to start running on a resource ● The job requirements must evaluate to True ● The machine requirements must evaluate to True There is also the ranking, but thats 2nd level optimizationCondor Week 2012 - May 2012 No user requirements 3
  4. 4. Matchmaking ● In order for a job to start running on a resource ● The job requirements must evaluate to True ● The machine requirements Most manuals must evaluate to True focus on job reqs There is also the ranking, but thats 2nd level optimizationCondor Week 2012 - May 2012 No user requirements 4
  5. 5. Matchmaking ● In order for a job to start running on a resource ● The job requirements must evaluate to True ● The machine requirements Most manuals must evaluate to True focus on job reqs There is also the ranking, but thats 2nd level optimization Machine reqs deemed for handling Owner state onlyCondor Week 2012 - May 2012 No user requirements 5
  6. 6. My argument Get rid of Job Requirements Put all logic in the Machine RequirementsCondor Week 2012 - May 2012 No user requirements 6
  7. 7. Some background ● I am a big glideinWMS user http://tinyurl.com/glideinWMS ● And glideinWMS has 2 level matchmaking ● One at VO Frontend level – where to send glideins ● One at Negotiator level – which job to start in a glidein Collector Schedd Negotiator glidein Execution node Startd Frontend Job FactoryCondor Week 2012 - May 2012 No user requirements 7
  8. 8. Matchmaking problem ● Both levels must be in sync, or you either ● Ask for glideins which never match any jobs Job matched to site in VO Frontend but not to sites machine in the NegotiatorCondor Week 2012 - May 2012 No user requirements 8
  9. 9. Matchmaking problem ● Both levels must be in sync, or you either ● Ask for glideins which never match any jobs, or ● Have job waiting in the queue when site available Job doesnt match to site in VO Frontend but would to sites machine in the Negotiator if glideins were requestedCondor Week 2012 - May 2012 No user requirements 9
  10. 10. Matchmaking problem ● Both levels must be in sync, or you either ● Ask for glideins which never match any jobs, or ● Have job waiting in the queue when site available ● But site and machine adds have different attributes ● Not all machines on a site are exactly the same So I need 2 different requirementsCondor Week 2012 - May 2012 No user requirements 10
  11. 11. The usual dilemma ● Where do I define these requirements? ● In user job ClassAd? ● Or in the Resource ClassAds? – And we have 2 different resource types hereCondor Week 2012 - May 2012 No user requirements 11
  12. 12. The usual dilemma ● Where do I define these requirements? ● In user job ClassAd? ● Or in the Resource ClassAds? – And we have 2 different resource types here The 2 resources are handled by the same admin (in VO Frontend)Condor Week 2012 - May 2012 No user requirements 12
  13. 13. The usual dilemma ● Where do I define these requirements? ● In user job ClassAd? ● Or in the Resource ClassAds? – And we have 2 different resource types here Potentially O(1k) The 2 resources users! are handled by the same admin (in VO Frontend) Typically O(1)Condor Week 2012 - May 2012 No user requirements 13
  14. 14. The usual dilemma ● Where do I define these requirements? ● In user job ClassAd? ● Or in the Resource ClassAds? – And we have 2 different resource types here And do you really trust your users The 2 resources to do the right thing? are handled by the same admin (in VO Frontend) Potentially Typically O(1k) O(1)Condor Week 2012 - May 2012 No user requirements 14
  15. 15. Moving reqs to the resources ● So we went for setting the requirements in the resources themselvesCondor Week 2012 - May 2012 No user requirements 15
  16. 16. Moving reqs to the resources ● So we went for setting the requirements in the resources themselves ● But users still need a way to select resources! ● How do they do it???Condor Week 2012 - May 2012 No user requirements 16
  17. 17. Moving reqs to the resources ● So we went for setting the requirements in the resources themselves ● But users still need a way to select resources! ● How do they do it??? They express their needs through attributes!Condor Week 2012 - May 2012 No user requirements 17
  18. 18. Fixed schema ● The resource provider defines the requirements a.k.a startd + VO Frontend ● The VO Frontend admin in our case ● Those requirements look for well-defined user-provided attributes Site attribute Machine attribute Job attribute entry_req = stringListMember(GLIDEIN_Site,DESIRED_Sites)&& ((GLIDEIN_Min_Mem>DESIRED_Mem)=!=False)Example Start = stringListMember(GLIDEIN_Site,DESIRED_Sites)&& ((Memory>DESIRED_Mem)=!=False)Condor Week 2012 - May 2012 No user requirements 18
  19. 19. Simple user job submit file ● No complex requirements to write ● Very little user training ● Low error rate a.submit Executable = a.sh Output = a.out +DESIRED_Sites=”UCSD,Nebraska” Requirements=True queueCondor Week 2012 - May 2012 No user requirements 19
  20. 20. How well does it work? ● Advantages ● Disadvantages ● Easy to keep the two ● Rigid schema levels in sync ● Easy to define reasonable defaults ● Easy on the users ● And more... (wait for later slides)Condor Week 2012 - May 2012 No user requirements 20
  21. 21. How well does it work? ● Advantages ● Disadvantages ● Easy to keep the two ● Rigid schema levels in sync ● Easy to define reasonable defaults ● Easy on the users ● And more... In our experience, (wait for later slides) does not need to change more than a couple of times a yearCondor Week 2012 - May 2012 No user requirements 21
  22. 22. Is this glideinWMS specific? ● Advantages ● Disadvantages ● Easy to keep the two Dont think so! ● Rigid schema levels in sync ● Easy to define reasonable defaults The ease of use ● Easy on the users is there for any Condor setupCondor Week 2012 - May 2012 No user requirements 22
  23. 23. Is this glideinWMS specific? ● Advantages ● Disadvantages ● Easy to keep the two Dont think so! ● Rigid schema levels in sync ● Easy to define reasonable defaults ● Easy on the users The ease of use I would argue it is there for should become any Condor setup standard practiceCondor Week 2012 - May 2012 No user requirements 23
  24. 24. And there is still more to itCondor Week 2012 - May 2012 No user requirements 24
  25. 25. Side effect ● Discovered one unexpected nice side effectCondor Week 2012 - May 2012 No user requirements 25
  26. 26. Side effect ● Discovered one unexpected nice side effect We can outsmart our users!Condor Week 2012 - May 2012 No user requirements 26
  27. 27. The overflow use case ● Normally, CMS jobs run near the data ● So users provide a whitelist of sites to run on ● And we have the appropriate glideinWMS expression a.submit Executable = a.sh Output = a.out +DESIRED_Sites=”UCSD,Nebraska” Requirements=True entry_req = stringListMember(GLIDEIN_Site,DESIRED_Sites)&& queue ((GLIDEIN_Min_Mem>DESIRED_Mem)=!=False)Example Start = stringListMember(GLIDEIN_Site,DESIRED_Sites)&& ((Memory>DESIRED_Mem)=!=False)Condor Week 2012 - May 2012 No user requirements 27
  28. 28. The overflow use case ● Normally, CMS jobs run near the data ● But some jobs could run over the WAN ● It is just slightly less efficient Xrootd based WAN access if you are interestedCondor Week 2012 - May 2012 No user requirements 28
  29. 29. The overflow use case ● Normally, CMS jobs run near the data ● Jobs could run over the WAN ● It is just slightly less efficient ● But if some CPUs are idle due to low demand ● A low efficiency job is still better than no job! ● As long as it results in users getting We call this their results sooner “overflowing” – i.e. only if “optimal resources” are not availableCondor Week 2012 - May 2012 No user requirements 29
  30. 30. The overflow use case ● Normally, CMS jobs run near the data ● Jobs could run over the WAN ● It is just slightly less efficient ● But if some CPUs are idle due to low demand ● A low efficiency job is still better than no job! ● As long as it results in users getting their results sooner So, how do we – i.e. only if implement this? “optimal resources” are not availableCondor Week 2012 - May 2012 No user requirements 30
  31. 31. Overflow configuration ● Essentially, we change the rules! ● Without involving the users ● We write the requirements based on where the data is, not where the CPUs are ● Since not all sites have xrootd installed entry_req = ((CurrentTime-QDate)>21600)&&(Country=?=”US”)&& stringListsIntersect(DESIRED_Sites,”UCSD,Wisc”)&&Example ((GLIDEIN_Min_Mem>DESIRED_Mem)=!=False) Start = ((CurrentTime-QDate)>21600)&& stringListsIntersect(DESIRED_Sites,”UCSD,Wisc”)&& ((Memory>DESIRED_Mem)=!=False)Condor Week 2012 - May 2012 No user requirements 31
  32. 32. Overflow configuration ● Essentially, we change the rules! ● Without involving the users No change to the user submit file! ● We write the requirements based on where the data is, not where the CPUs are a.submit Executable = a.sh ● Since not all sites have xrootd installed Output = a.out +DESIRED_Sites=”UCSD,Nebraska” Requirements=True entry_req = ((CurrentTime-QDate)>21600)&&(Country=?=”US”)&& queue stringListsIntersect(DESIRED_Sites,”UCSD,Wisc”)&&Example ((GLIDEIN_Min_Mem>DESIRED_Mem)=!=False) Start = ((CurrentTime-QDate)>21600)&& stringListsIntersect(DESIRED_Sites,”UCSD,Wisc”)&& ((Memory>DESIRED_Mem)=!=False)Condor Week 2012 - May 2012 No user requirements 32
  33. 33. Overflow configuration ● Essentially,decide And we can we change the rules! where to overflow from users ● Without involving the No change to the hours after user submit file! the jobs werethe requirements ● We write submitted based on where the data is, not where the CPUs are a.submit Executable = a.sh ● Since not all sites have xrootd installed Output = a.out +DESIRED_Sites=”UCSD,Nebraska” Requirements=True entry_req = ((CurrentTime-QDate)>21600)&&(Country=?=”US”)&& queue ”UCSD,Wisc” stringListsIntersect(DESIRED_Sites,”UCSD,Wisc”)&&Example ((GLIDEIN_Min_Mem>DESIRED_Mem)=!=False) Start = ((CurrentTime-QDate)>21600)&& stringListsIntersect(DESIRED_Sites,”UCSD,Wisc”)&& ((Memory>DESIRED_Mem)=!=False)Condor Week 2012 - May 2012 No user requirements 33
  34. 34. Looking at the futureCondor Week 2012 - May 2012 No user requirements 34
  35. 35. Looking at the future ● The attribute schema now a fixed one ● At least regarding matchmaking ● Opens up new interesting possibilitiesCondor Week 2012 - May 2012 No user requirements 35
  36. 36. Looking at the future ● The attribute schema now a fixed one ● At least regarding matchmaking ● Can consider RDBMS techniques ● Example uses – RDMBS driven matchmaking – Tracking of job and/or machine history in a DBCondor Week 2012 - May 2012 No user requirements 36
  37. 37. Looking at the future ● The attribute schema now a fixed one ● At least regarding matchmaking ● Can consider RDBMS techniques ● Example uses – RDMBS driven matchmaking – Tracking of job and/or machine history in a DB ● Will likely need code changes in Condor – UCSD committed to try it our in the near future – If the results end up as good as hoped for, will push for official adoptionCondor Week 2012 - May 2012 No user requirements 37
  38. 38. Conclusions ● Matchmaking is made of requirements ● But there are two places where they can be defined ● Usually, users expected to set requirements ● CMS glideinWMS setup moves it completely into the resource domain ● Experience with no-user-req setup very positive ● Advocating that this should be the recommended way for all Condor deployments ● Also opens up interesting new venuesCondor Week 2012 - May 2012 No user requirements 38
  39. 39. Acknowledgments ● This work is partially sponsored by ● the US National Science Foundation under Grants No. PHY-1104549 (AAA) and PHY-0612805 (CMS Maintenance & Operations) and ● the US Department of Energy under Grant No. DE- FC02-06ER41436 subcontract No. 647F290 (OSG).Condor Week 2012 - May 2012 No user requirements 39

×