glideinWMS for users



    Matchmaking in glideinWMS
             in CMS
                     by Igor Sfiligoi (UCSD)




CERN, Dec 2012          glideinWMS matchmaking   1
Scope of this talk



                      This talk provides a
                 high level description of how
                  glideinWMS matchmaking
                        works in CMS.



                 Reader is expected to be familiar with the CMS experiment environment
                                        http://cms.web.cern.ch/


CERN, Dec 2012                        glideinWMS matchmaking                             2
glideinWMS architecture
 ●   A reminder                  G.F.
                       +3
          VO FE                                          Grid
                                 G.F.
                       +1
                                                      Execute node

                            Central manager           Execute node
         Submit node
                                                      Execute node
                              Negotiator
         Submit node
                                                      Execute node
         Submit node
                                                      Execute node
           Schedd                                       Condor




CERN, Dec 2012               glideinWMS matchmaking                  3
Two levels of matchmaking
 ●   First in the VO Frontend
      ●   To decide where                                             G.F.

          to provision resources               VO FE
                                                             +3

                                                             +1
                                                                      G.F.
                                                                                               Grid
                                                                                          Execute node

      ●   i.e. where                           Submit node
                                                                  Central manager         Execute node
                                                                                          Execute node

          to send glideins
                                                                    Negotiator
                                               Submit node
                                                                                          Execute node
                                               Submit node
                                                                                          Execute node
                                                Schedd

     Then in the
                                                                                              Condor
 ●

     HTCondor Negotiator
      ●   To decide                                                                 The two
          which Job gets the glidein Slot                                    must have
                                                                              compatible
                                                                               policies


CERN, Dec 2012               glideinWMS matchmaking                                                      4
Defining the policy
 ●    The VO FE configures the glideins
       ●   So it can define the Slot Requirements
 ●    Preferred strategy to leave all policy
      decisions in the VO FE hands, i.e. both
       ●   VO FE matchmaking policy                                           Easier keep them
                                                                              in sync this way
       ●   HTCondor matchmaking policy
 ●    This implies
       ●   Users should not define Job Requirements
       ●   Instead, publish attributes describing requirements
     http://www.slideshare.net/igor_sfiligoi/condor-week-12-attribute-matchmaking-move-req-out-of-user-hands


CERN, Dec 2012                              glideinWMS matchmaking                                             5
CMS Production @ CERN
                Policies




CERN, Dec 2012   glideinWMS matchmaking   6
Description
 ●   The VO FE @ CERN serves
     the production needs
      ●   i.e. Reconstruction and MC production
 ●   Job submission regulated by service managed
     by a dedicated team,
     so jobs are
      ●   Targeted
      ●   Well behaved
                             At least by and large



CERN, Dec 2012            glideinWMS matchmaking     7
Matchmaking policy
 ●   Two dimensions
      ●   Grid Site
      ●   Single CPU vs HTPC
 ●   The actual policy is the AND of both
 ●   Both VO FE policy and HTCondor policy
     defined in the VO FE instance configuration




CERN, Dec 2012          glideinWMS matchmaking     8
Matching on Grid site name
 ●   User Jobs expected to publish the attribute
     DESIRED_Sites               String list

      ●   e.g. +DESIRED_Sites   = “T2_DE_DESY,T2_US_UCSD”
 ●   The G.F. and the glideins advertising
     GLIDEIN_CMSSite
 ●   The matchmaking policy is
     GLIDEIN_CMSSite ∈ DESIRED_Sites




CERN, Dec 2012            glideinWMS matchmaking            9
Matching on Job Type
 ●   Use Jobs can publish the attribute
     DESIRES_HTPC            Integer representation of Boolean values

      ●   e.g. +DESIRES_HTPS   = 1
      ●   If not defined, defaults to 0
 ●   The G.F. And the glideins may advertise
     GLIDEIN_Is_HTPC          Boolean value

      ●   If not defined, defaults to False
 ●   The matchmaking policy is
     (GLIDEIN_Is_HTPC==True)==(DESIRES_HTPC==1)


CERN, Dec 2012              glideinWMS matchmaking                 10
Example submit file


         Universe
          Universe = vanilla
                     = vanilla
         Executable = mcgen
          Executable = mcgen
         Arguments = -k 1543.3
          Arguments = -k 1543.3
         Output
          Output    = mcgen.out
                     = mcgen.out
         Error
          Error     = mcgen.err
                     = mcgen.err
         Log
          Log       = mcgen.log
                     = mcgen.log
         +DESIRED_Sites = “T2_DE_DESY,T2_US_UCSD”
          +DESIRED_Sites = “T2_DE_DESY,T2_US_UCSD”
         +DESIRES_HTPC = 0
          +DESIRES_HTPC = 0
         Requirements = True
          Requirements = True
         Queue 1
          Queue 1




CERN, Dec 2012           glideinWMS matchmaking      11
CMS AnaOps @ UCSD
                      Policies




CERN, Dec 2012        glideinWMS matchmaking   12
Description
 ●   VO FE @ UCSD serves CMS analysis users
 ●   User Jobs much more chaotic
      ●   Most users don't really understand their needs
      ●   Must protect from accidental errors
      ●   Yet keep the system flexible
 ●   Net result
      ●   More complex policy



CERN, Dec 2012             glideinWMS matchmaking          13
Two different policies
 ●   The AnaOps FE actually has two policies
      ●   The Regular policy
      ●   The Overflow policy
 ●   The Regular policy tries to match resources
      ●   Based on User desires
 ●   The Overflow policy “outsmarts” the Users
      ●   Will violate User desires without breaking the Jobs
      ●   The aim is to finish user jobs sooner
      ●   User can opt-out, if he wishes
CERN, Dec 2012             glideinWMS matchmaking               14
The Regular M.M. policy
 ●   Four+one dimensions
      ●   Grid Site
      ●   Single CPU vs HTPC
      ●   Memory usage
      ●   Job duration
                                           Due to preemption
      ●   Number of Job Starts
 ●   The actual policy is the AND of both
 ●   Both VO FE policy and HTCondor policy
     defined in the VO FE instance configuration
CERN, Dec 2012            glideinWMS matchmaking               15
Grid site selection
 ●   This is both similar and different compared to
     the Production FE @CERN
      ●   Serves the same purpose, but supports three
          different ways to select a site
           –     Due to historical evolution
 ●   The three options are
      ●   GLIDEIN_CMSSite ∈ DESIRED_Sites
                                                              Planning to extend to
      ●   GLIDEIN_SEs ∈ DESIRED_SEs                        (GLIDEIN_SEs ∩ DESIRED_SEs) ≠∅

      ●   GLIDEIN_Gatekeeper ∈ DESIRED_Gatekeepers
 ●   The actual policy is the OR of the three

CERN, Dec 2012                    glideinWMS matchmaking                           16
Job type selection
 ●   Just like @ CERN




CERN, Dec 2012        glideinWMS matchmaking   17
Memory Usage
●   Most Grid sites put strict limits on the amount of
    memory that can be used
    ●   Will kill glideins if they exceed the limit
●   G.F. and glideins advertise the Entry-specific limit
    GLIDEIN_MaxMemMBs
●   Jobs can explicitly declare the needed memory
    request_memory Native Condor attribute, no + needed
     ● Condor will also measure it at run time            Use a combination
                                                          of these to calculate
         –   ImageSize – Virtual memory used              the actual JobMemory

         –   ResidentSetSize – True memory usage
●   Policy: JobMemory <= GLIDEIN_MaxMemMBs
CERN, Dec 2012               glideinWMS matchmaking                               18
Job Duration                  1/2




 ●   Glideins have a limited lifetime
      ●   Must fit within the limits of the Grid site's queue
      ●   Glideins publish the deadline
          GLIDEIN_ToDie
           –     Jobs must finish before reaching the deadline
 ●   Final user job lifetime unpredictable
      ●   Depends on the type of computing done
      ●   User should indicate the expected job lifetime
           –     Else we have to assume reasonable defaults
                                                                Not many users set
                                                               this value(s) right now
CERN, Dec 2012                  glideinWMS matchmaking                                   19
Job Duration                  2/2




 ●   The same type of computation may take
     different amount of time
      ●   e.g. Based on the type of input
 ●   Jobs can declare two attributes
      ●   NormMaxWallTimeMins – Expected limit
      ●   MaxWallTimeMins – Absolute max limit
 ●   The matchmaking logic is
      ●   Use NormMaxWallTimeMins for
                                                       Based on simple assumption
          the first job startup                         that the job was killed for
                                                           hitting the deadline.
      ●   Use MaxWallTimeMins for all others

CERN, Dec 2012              glideinWMS matchmaking                                    20
Cut on number of re-starts
 ●   Not really a user configurable property
      ●   More an emergency break
 ●   In a properly configured system,
     should never be triggered
      ●   But unexpected problems happen
      ●   So better limit the damage




CERN, Dec 2012            glideinWMS matchmaking   21
The Overflow Use case
 ●   User Jobs specify a list of sites,
     because the data they need is there
 ●   With recent versions of CMSSW, jobs can
     access the data from remote
      ●   With a small performance penalty
 ●   We can thus schedule jobs “anywhere”
      ●   As long as the needed data is
          at a Site that has joined the xrootd federation
      ●   But only if no CPU available “close to the data”
           –     And not too far, either
                      http://indico.cern.ch/contributionDisplay.py?contribId=381&sessionId=5&confId=149557
                      http://indico.cern.ch/contributionDisplay.py?contribId=232&sessionId=8&confId=149557

CERN, Dec 2012                               glideinWMS matchmaking                                          22
The Overflow M.M. policy
 ●   Violate only the “Site selection” rule
      ●   Keep all the others
 ●   Plus, add one+one more:
      ●   An opt-out mechanism
      ●   Delayed matching




CERN, Dec 2012             glideinWMS matchmaking   23
New Site M.M. policy
 ●   The user specified attribute is used
     to flag the job as “Overflowable”
      ●   i.e. the job will match if and only if
          (DESIRED_<site>s ∩ SUPPORTED_<site>s) ≠∅
                        Still support all 3 types of site identification
 ●   Matching jobs can then run on any glidein
      ●   Additional limits can be put in place by the FE,
          but mostly invisible to the user




CERN, Dec 2012               glideinWMS matchmaking                        24
The opt-out mechanism
 ●   The Overflow policy
     considers all jobs by default
      ●   But Users may want to opt-out some of the Jobs
           –     Sometimes it is just a need
                 (to get deterministic results, e.g. for testing a site)
 ●   To opt-out, the user defines
     +CMS_ALLOW_OVERFLOW = False
      ●   The FE will not consider such jobs for Overflowing




CERN, Dec 2012                     glideinWMS matchmaking                  25
Delayed matching
 ●   As said initially,
     Jobs should preferentially run close to the data
      ●   Overflow should only consider jobs
          “that cannot find resources close to the data”
 ●   We implemented it based on time
      ●   Jobs are matched only
          if waiting in the queue for more than 6 hours

                                    Users cannot influence it




CERN, Dec 2012             glideinWMS matchmaking               26
Example submit file

 Universe
  Universe = vanilla
             = vanilla
 Executable = myana
  Executable = myana
 Arguments = -k 1543.3
  Arguments = -k 1543.3
 Output
  Output    = myana.out
             = myana.out
 Error
  Error     = myana.err
             = myana.err
 Log
  Log       = myana.log
             = myana.log
 request_memory = 1500
  request_memory = 1500
 +DESIRED_SEs = "dc2-grid-64.brunel.ac.uk,stormfe1.pi.infn.it"
  +DESIRED_SEs = "dc2-grid-64.brunel.ac.uk,stormfe1.pi.infn.it"
 +NormMaxWallTimeMins = 7200
  +NormMaxWallTimeMins = 7200
 +MaxWallTimeMins = 14400
  +MaxWallTimeMins = 14400
 +DESIRES_HTPC = 0
  +DESIRES_HTPC = 0
 +CMS_ALLOW_OVERFLOW = True
  +CMS_ALLOW_OVERFLOW = True
 Requirements = True
  Requirements = True
 Queue 1
  Queue 1


CERN, Dec 2012            glideinWMS matchmaking                  27
The End




CERN, Dec 2012   glideinWMS matchmaking   28
Pointers
 ●   glideinWMS Home Page
     http://tinyurl.com/glideinWMS
 ●   HTCondor Home Page
     http://research.cs.wisc.edu/htcondor/
 ●   HTCondor support
     htcondor-users@cs.wisc.edu
     htcondor-admin@cs.wisc.edu
 ●   glideinWMS support
     glideinwms-support@fnal.gov

CERN, Dec 2012        glideinWMS matchmaking   29
Acknowledgments
 ●   The creation of this document was sponsored
     by grants from the US NSF and US DOE,
     and by the University of California system




CERN, Dec 2012       glideinWMS matchmaking        30

Matchmaking in glideinWMS in CMS

  • 1.
    glideinWMS for users Matchmaking in glideinWMS in CMS by Igor Sfiligoi (UCSD) CERN, Dec 2012 glideinWMS matchmaking 1
  • 2.
    Scope of thistalk This talk provides a high level description of how glideinWMS matchmaking works in CMS. Reader is expected to be familiar with the CMS experiment environment http://cms.web.cern.ch/ CERN, Dec 2012 glideinWMS matchmaking 2
  • 3.
    glideinWMS architecture ● A reminder G.F. +3 VO FE Grid G.F. +1 Execute node Central manager Execute node Submit node Execute node Negotiator Submit node Execute node Submit node Execute node Schedd Condor CERN, Dec 2012 glideinWMS matchmaking 3
  • 4.
    Two levels ofmatchmaking ● First in the VO Frontend ● To decide where G.F. to provision resources VO FE +3 +1 G.F. Grid Execute node ● i.e. where Submit node Central manager Execute node Execute node to send glideins Negotiator Submit node Execute node Submit node Execute node Schedd Then in the Condor ● HTCondor Negotiator ● To decide The two which Job gets the glidein Slot must have compatible policies CERN, Dec 2012 glideinWMS matchmaking 4
  • 5.
    Defining the policy ● The VO FE configures the glideins ● So it can define the Slot Requirements ● Preferred strategy to leave all policy decisions in the VO FE hands, i.e. both ● VO FE matchmaking policy Easier keep them in sync this way ● HTCondor matchmaking policy ● This implies ● Users should not define Job Requirements ● Instead, publish attributes describing requirements http://www.slideshare.net/igor_sfiligoi/condor-week-12-attribute-matchmaking-move-req-out-of-user-hands CERN, Dec 2012 glideinWMS matchmaking 5
  • 6.
    CMS Production @CERN Policies CERN, Dec 2012 glideinWMS matchmaking 6
  • 7.
    Description ● The VO FE @ CERN serves the production needs ● i.e. Reconstruction and MC production ● Job submission regulated by service managed by a dedicated team, so jobs are ● Targeted ● Well behaved At least by and large CERN, Dec 2012 glideinWMS matchmaking 7
  • 8.
    Matchmaking policy ● Two dimensions ● Grid Site ● Single CPU vs HTPC ● The actual policy is the AND of both ● Both VO FE policy and HTCondor policy defined in the VO FE instance configuration CERN, Dec 2012 glideinWMS matchmaking 8
  • 9.
    Matching on Gridsite name ● User Jobs expected to publish the attribute DESIRED_Sites String list ● e.g. +DESIRED_Sites = “T2_DE_DESY,T2_US_UCSD” ● The G.F. and the glideins advertising GLIDEIN_CMSSite ● The matchmaking policy is GLIDEIN_CMSSite ∈ DESIRED_Sites CERN, Dec 2012 glideinWMS matchmaking 9
  • 10.
    Matching on JobType ● Use Jobs can publish the attribute DESIRES_HTPC Integer representation of Boolean values ● e.g. +DESIRES_HTPS = 1 ● If not defined, defaults to 0 ● The G.F. And the glideins may advertise GLIDEIN_Is_HTPC Boolean value ● If not defined, defaults to False ● The matchmaking policy is (GLIDEIN_Is_HTPC==True)==(DESIRES_HTPC==1) CERN, Dec 2012 glideinWMS matchmaking 10
  • 11.
    Example submit file Universe Universe = vanilla = vanilla Executable = mcgen Executable = mcgen Arguments = -k 1543.3 Arguments = -k 1543.3 Output Output = mcgen.out = mcgen.out Error Error = mcgen.err = mcgen.err Log Log = mcgen.log = mcgen.log +DESIRED_Sites = “T2_DE_DESY,T2_US_UCSD” +DESIRED_Sites = “T2_DE_DESY,T2_US_UCSD” +DESIRES_HTPC = 0 +DESIRES_HTPC = 0 Requirements = True Requirements = True Queue 1 Queue 1 CERN, Dec 2012 glideinWMS matchmaking 11
  • 12.
    CMS AnaOps @UCSD Policies CERN, Dec 2012 glideinWMS matchmaking 12
  • 13.
    Description ● VO FE @ UCSD serves CMS analysis users ● User Jobs much more chaotic ● Most users don't really understand their needs ● Must protect from accidental errors ● Yet keep the system flexible ● Net result ● More complex policy CERN, Dec 2012 glideinWMS matchmaking 13
  • 14.
    Two different policies ● The AnaOps FE actually has two policies ● The Regular policy ● The Overflow policy ● The Regular policy tries to match resources ● Based on User desires ● The Overflow policy “outsmarts” the Users ● Will violate User desires without breaking the Jobs ● The aim is to finish user jobs sooner ● User can opt-out, if he wishes CERN, Dec 2012 glideinWMS matchmaking 14
  • 15.
    The Regular M.M.policy ● Four+one dimensions ● Grid Site ● Single CPU vs HTPC ● Memory usage ● Job duration Due to preemption ● Number of Job Starts ● The actual policy is the AND of both ● Both VO FE policy and HTCondor policy defined in the VO FE instance configuration CERN, Dec 2012 glideinWMS matchmaking 15
  • 16.
    Grid site selection ● This is both similar and different compared to the Production FE @CERN ● Serves the same purpose, but supports three different ways to select a site – Due to historical evolution ● The three options are ● GLIDEIN_CMSSite ∈ DESIRED_Sites Planning to extend to ● GLIDEIN_SEs ∈ DESIRED_SEs (GLIDEIN_SEs ∩ DESIRED_SEs) ≠∅ ● GLIDEIN_Gatekeeper ∈ DESIRED_Gatekeepers ● The actual policy is the OR of the three CERN, Dec 2012 glideinWMS matchmaking 16
  • 17.
    Job type selection ● Just like @ CERN CERN, Dec 2012 glideinWMS matchmaking 17
  • 18.
    Memory Usage ● Most Grid sites put strict limits on the amount of memory that can be used ● Will kill glideins if they exceed the limit ● G.F. and glideins advertise the Entry-specific limit GLIDEIN_MaxMemMBs ● Jobs can explicitly declare the needed memory request_memory Native Condor attribute, no + needed ● Condor will also measure it at run time Use a combination of these to calculate – ImageSize – Virtual memory used the actual JobMemory – ResidentSetSize – True memory usage ● Policy: JobMemory <= GLIDEIN_MaxMemMBs CERN, Dec 2012 glideinWMS matchmaking 18
  • 19.
    Job Duration 1/2 ● Glideins have a limited lifetime ● Must fit within the limits of the Grid site's queue ● Glideins publish the deadline GLIDEIN_ToDie – Jobs must finish before reaching the deadline ● Final user job lifetime unpredictable ● Depends on the type of computing done ● User should indicate the expected job lifetime – Else we have to assume reasonable defaults Not many users set this value(s) right now CERN, Dec 2012 glideinWMS matchmaking 19
  • 20.
    Job Duration 2/2 ● The same type of computation may take different amount of time ● e.g. Based on the type of input ● Jobs can declare two attributes ● NormMaxWallTimeMins – Expected limit ● MaxWallTimeMins – Absolute max limit ● The matchmaking logic is ● Use NormMaxWallTimeMins for Based on simple assumption the first job startup that the job was killed for hitting the deadline. ● Use MaxWallTimeMins for all others CERN, Dec 2012 glideinWMS matchmaking 20
  • 21.
    Cut on numberof re-starts ● Not really a user configurable property ● More an emergency break ● In a properly configured system, should never be triggered ● But unexpected problems happen ● So better limit the damage CERN, Dec 2012 glideinWMS matchmaking 21
  • 22.
    The Overflow Usecase ● User Jobs specify a list of sites, because the data they need is there ● With recent versions of CMSSW, jobs can access the data from remote ● With a small performance penalty ● We can thus schedule jobs “anywhere” ● As long as the needed data is at a Site that has joined the xrootd federation ● But only if no CPU available “close to the data” – And not too far, either http://indico.cern.ch/contributionDisplay.py?contribId=381&sessionId=5&confId=149557 http://indico.cern.ch/contributionDisplay.py?contribId=232&sessionId=8&confId=149557 CERN, Dec 2012 glideinWMS matchmaking 22
  • 23.
    The Overflow M.M.policy ● Violate only the “Site selection” rule ● Keep all the others ● Plus, add one+one more: ● An opt-out mechanism ● Delayed matching CERN, Dec 2012 glideinWMS matchmaking 23
  • 24.
    New Site M.M.policy ● The user specified attribute is used to flag the job as “Overflowable” ● i.e. the job will match if and only if (DESIRED_<site>s ∩ SUPPORTED_<site>s) ≠∅ Still support all 3 types of site identification ● Matching jobs can then run on any glidein ● Additional limits can be put in place by the FE, but mostly invisible to the user CERN, Dec 2012 glideinWMS matchmaking 24
  • 25.
    The opt-out mechanism ● The Overflow policy considers all jobs by default ● But Users may want to opt-out some of the Jobs – Sometimes it is just a need (to get deterministic results, e.g. for testing a site) ● To opt-out, the user defines +CMS_ALLOW_OVERFLOW = False ● The FE will not consider such jobs for Overflowing CERN, Dec 2012 glideinWMS matchmaking 25
  • 26.
    Delayed matching ● As said initially, Jobs should preferentially run close to the data ● Overflow should only consider jobs “that cannot find resources close to the data” ● We implemented it based on time ● Jobs are matched only if waiting in the queue for more than 6 hours Users cannot influence it CERN, Dec 2012 glideinWMS matchmaking 26
  • 27.
    Example submit file Universe Universe = vanilla = vanilla Executable = myana Executable = myana Arguments = -k 1543.3 Arguments = -k 1543.3 Output Output = myana.out = myana.out Error Error = myana.err = myana.err Log Log = myana.log = myana.log request_memory = 1500 request_memory = 1500 +DESIRED_SEs = "dc2-grid-64.brunel.ac.uk,stormfe1.pi.infn.it" +DESIRED_SEs = "dc2-grid-64.brunel.ac.uk,stormfe1.pi.infn.it" +NormMaxWallTimeMins = 7200 +NormMaxWallTimeMins = 7200 +MaxWallTimeMins = 14400 +MaxWallTimeMins = 14400 +DESIRES_HTPC = 0 +DESIRES_HTPC = 0 +CMS_ALLOW_OVERFLOW = True +CMS_ALLOW_OVERFLOW = True Requirements = True Requirements = True Queue 1 Queue 1 CERN, Dec 2012 glideinWMS matchmaking 27
  • 28.
    The End CERN, Dec2012 glideinWMS matchmaking 28
  • 29.
    Pointers ● glideinWMS Home Page http://tinyurl.com/glideinWMS ● HTCondor Home Page http://research.cs.wisc.edu/htcondor/ ● HTCondor support htcondor-users@cs.wisc.edu htcondor-admin@cs.wisc.edu ● glideinWMS support glideinwms-support@fnal.gov CERN, Dec 2012 glideinWMS matchmaking 29
  • 30.
    Acknowledgments ● The creation of this document was sponsored by grants from the US NSF and US DOE, and by the University of California system CERN, Dec 2012 glideinWMS matchmaking 30