Introduction to glideinWMS

  • 245 views
Uploaded on

Introductiory talk to glideinWMS aimed at the users of a glideinWMS installation.

Introductiory talk to glideinWMS aimed at the users of a glideinWMS installation.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
245
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
3
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. glideinWMS for users Introduction to glideinWMS by Igor Sfiligoi (UCSD)CERN, Dec 2012 glideinWMS Intro 1
  • 2. Scope of this talk This talk provides a user perspective of glideinWMS for users with previous experience with Grid computing. It does not provide much detail but concentrates on the concepts behind the system instead.CERN, Dec 2012 glideinWMS Intro 2
  • 3. The problem(s) ● Users have many ● Resources provided jobs that must be run by O(100) Grid sites ● Each user has multiple tasks at once ● How do we schedule them to get the results in the shortest amount of time? ● Assuming one result per task ● How do we treat all users in a fair way? ● Independently of how many jobs they submitCERN, Dec 2012 glideinWMS Intro 3
  • 4. glideinWMS approach ● Separates ● Resource provisioning from Never the ● Resource scheduling user jobs! ● In practice, sends out pilot jobs to the Grid ● And creates an “overlay batch system” ● Pilot jobs get ownership of the Grid slots ● At least for a limited time Grid Site Grid Site Overlay ● Known also as batch Grid Site system the pilot approach Grid Site Grid SiteCERN, Dec 2012 glideinWMS Intro 4
  • 5. Job scheduling ● Once we have the overlay batch system, job scheduling works like in any dedicated B.S. ● We own the B.S. and can set the job scheduling policies ● glideinWMS based on HTCondor ● So can do whatever HTCondor can do ● Which is quite flexible – But also nothing more... HTCondor (formerly knowns as Condor) ● HTCondor-based pilots are is a widely used batch system. usually called glideins Thus glideinWMS stands for “glidein based Workflow Management System” More details later on.CERN, Dec 2012 glideinWMS Intro 5
  • 6. Creating the overlay i.e. resource provisioning ● glideinWMS will grow and shrink the overlay B.S. automatically ● No human intervention needed ● Expansion based on user jobs in the queue ● The more jobs, the faster it will try to grow ● Since not all jobs can run at all sites, different attempted growth rates for different sites ● Shrinks automatically if resources unused ● Again, based on user jobs in the queue Each glidein should run at least one user job, but will try to run many if the Grid slot is long enoughCERN, Dec 2012 glideinWMS Intro 6
  • 7. glideinWMS in a picture Grid Site Grid Site Grid site glideinWMS HTCondor CPU Handler User Job HTCondor Job RepositoryCERN, Dec 2012 glideinWMS Intro 7
  • 8. From the user point of view ● Users see just a “regular” HTCondor system ● Just a dynamic one ● However ● Have to be aware of the resource provisioning logic – No native HTCondor tools to help with this ● Debugging system problems much harder – Again, no native HTCondor tools to help with this – Most likely question a user will ask is “Why is my job not starting?”CERN, Dec 2012 glideinWMS Intro 8
  • 9. A few more details ● So, glideinWMS is really HTCondor++ ● First you have to understand how HTCondor works ● If you do, 99% of the problems are solved ● HTCondor is composed of 3 logical pieces ● Submit nodes – keep the job queue(s) ● Execute nodes – owns and operates a resource ● A central manager – glues the other two categories together and executes policiesCERN, Dec 2012 glideinWMS Intro 9
  • 10. HTCondor in a picture Execute node Central manager Execute node Submit node Execute node Condor Submit node Execute node Submit node Execute node Condor CondorCERN, Dec 2012 glideinWMS Intro 10
  • 11. Even more details ● The actual work quanta are ● Jobs – on the submit node, typically many/node ● Slots – on the execute node, typically only a few /node ● While internally implemented differently, jobs and slots are conceptually very similar ● Both describe a logical entity “ClassAd” in ● Both have attributes describing it HTCondor speak ● Both have requirements ● HTCondor policy engine is really all about matchmaking jobs to slotsCERN, Dec 2012 glideinWMS Intro 11
  • 12. Matchmaking ● Jobs that are not running (i.e. are “idle” in HTCondor speak) will be matched against Slots that dont yet run anything (i.e. are “Unclaimed” in HTCondor speak) ● Requirements expressions can (and usually do) reference attributes in the other ClassAd ● Both sides must evaluate to True for a match ● Although we encourage all logic to reside in the Slot Requirement in glideinWMS setups (more on this later on)CERN, Dec 2012 glideinWMS Intro 12
  • 13. A non-technical example Buyer Ad Pet Ad MyType = “Buyer” MyType = “Pet” TargetType = “Pet” TargetType = “Buyer” Requirements = Requirements = (PetType == “Dog”) && DogLover == True (Price <= AcctBalance) && PetType = “Dog” (Size == "Large"||Size == "Very Large") Color = “Brown” AcctBalance = 1000 Price = 75 DogLover = True Breed = "Saint Bernard" LegalName = “Curly Howard” ... Size = "Very Large" ... Buyer ~= Job Dog == Resource ~= SlotCERN, Dec 2012 glideinWMS Intro 13
  • 14. Matching order ● Most of the time, there are way more idle jobs than Unclaimed slots ● So order is important ● Two policies 1) Jobs from highest priority user first 2) Priority-FIFO policy for jobs of the same user ● User priority based on usage ● The more resources you use, the lower the priority (with priority recovery over time) ● But some users may be marked as “more important” (priority multipliers and group quotas)CERN, Dec 2012 glideinWMS Intro 14
  • 15. The glideinWMS part ● i.e. The layer on top of HTCondor that finds resources where to start the “Execute node” daemons ● i.e. glideins ● Composed of two parts: ● Glidein Factory – The abstraction layer ● VO Fronend – The brainCERN, Dec 2012 glideinWMS Intro 15
  • 16. Glidein Factory● The splitting in two allows for the i.e. serve many VO FEs Glidein Factory to be generic ● I.e can (and should) be shared between many VOs● The G.F. is really just an abstraction layer ● It insulates the VO from the provisioning details – e.g. knowing the name of the Grid CE and relative RSL ● Allows new technology to be added seamlessly (e.g. Clouds)● It also provides a troubleshooting service ● The factory operators are supposed to address any Grid related problems they observeCERN, Dec 2012 glideinWMS Intro 16
  • 17. The VO Frontend ● The name may be misleading ● It is really the “matchmaker of Grid resources” ● Introduces a new quanta ● Entry – logical equivalent of a “queue at a site” Basic working block of a G.F. ● The VO Frontend 1) Matches idle Jobs to Entries 2) Instructs the affected G.F. to increase or decrease the number of glideins on that Entry Thus regulates the resource provisioningCERN, Dec 2012 glideinWMS Intro 17
  • 18. Updated glideinWMS picture G.F. +3 VO FE Grid G.F. +1 Execute node Central manager Execute node Submit node Execute node Condor Submit node Execute node Submit node Execute node Condor CondorCERN, Dec 2012 glideinWMS Intro 18
  • 19. VO FE Matchmaking logic ● Based on Job attributes ● Jobs dont have “FE-specific requirements” ● The exact matchmaking policy depends on the VO FE instance Will describe CMS policies in a different talk ● glideinWMS has 2 level matchmaking ● Once in the FE, then in the HTCondor C.M. ● Recommended to avoid explicit “HTCondor requirements” in the Job ClassAd – The glideins should set “Slot requirements” based on the same attributes used by the VO FE, instead Since VO FE configures the glideinsCERN, Dec 2012 glideinWMS Intro 19
  • 20. What is the user to do? 0) Learn how to use HTCondor 1) Learn what the VO FE policy is 2) Create the HTCondor submit file (i.e. JDL) containing the necessary attributes 3) Submit jobs 4) Wait for the results to come back 5) Rinse and repeat (from (2))CERN, Dec 2012 glideinWMS Intro 20
  • 21. This is it ● Hopefully you have a high level view of the system now ● More details in separate talksCERN, Dec 2012 glideinWMS Intro 21
  • 22. Pointers ● glideinWMS Home Page http://tinyurl.com/glideinWMS ● HTCondor Home Page http://research.cs.wisc.edu/htcondor/ ● HTCondor support htcondor-users@cs.wisc.edu htcondor-admin@cs.wisc.edu ● glideinWMS support glideinwms-support@fnal.govCERN, Dec 2012 glideinWMS Intro 22
  • 23. Acknowledgments ● The creation of this document was sponsored by grants from the US NSF and US DOE, and by the University of California systemCERN, Dec 2012 glideinWMS Intro 23