Pilot Factory


Published on

This slide show illustrates an preliminary work on the "pilot mechanism" using the Condor system. The goal is to create a uniform user interface to the computational resource from across the network and in the meantime, increase the parallelism of user tasks towards optimal throughput in the long run.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Pilot Factory

  1. 1. Pilot Factory usingSchedd GlideinBarnett ChiuBNL10.04.07
  2. 2. Problem to solve (1)n  Pilot ¨  Probe the resource (http, environment, interpreter, other executables …etc) ¨  Pull jobs from remote server (e.g. Panda server) ¨  Matchmaking n  Group jobs in different categories E.g Production jobs, Analysis jobs (CHARMM …), Test jobs … n  Other criteria: Number of CPUs, RAM … etc
  3. 3. Problem to Solve (2)n  Current approach of pilot submissions ¨  Local pool : Vanilla ¨  Remote pool: Condor-Gn  Largeamounts of user jobs (production + analysis) ~ large amount of Condor-G pilot jobs ~ computational overhead on gatekeepers (e.g. large memory consumptions)
  4. 4. Solutionn  Is there any way to bypass GRAM to submit jobs to remote machines?n  Local submissions, but how? ¨  We need something that continuously submit local pilot jobs on the gatekeeper ¨  Solution: Pilot Factory
  5. 5. Pilot Factory Overviewn  Pilot Factory is an application that combines the following ideas: ¨  schedd glidein ¨  pilot submission program (or pilot generator)n  What is glidein? ¨  Mini-Condor pool on a remote machine n  A complete Condor pool has at least 5 components: i.e. master, startd, schedd, collector, negotiator n  Glidein: {master, startd}, {master, schedd}, … etc ¨  Properly configured condor daemons submitted as batch job
  6. 6. Glidein (1)n  Two major steps Condor-G #1: installation glidein setup script condor configuration file glidein startup script download Condor binaries (http, gsiftp …etc) Condor-G #2: execution exec glidein startup script à condor_master
  7. 7. Glidein (2) master ~/Condor_glidein startd Startup script Tarball server Glidein config {master, schedd …} Central Manager Collector ? master scheddSubmit Host master master … startd startd Master Glidein types schedd Execute hosts master master schedd startd
  8. 8. Schedd Glideinn  Logics based on startd glidein (two Condor-G to set up )n  Usage: By running glidein schedd on gatekeeper, the schedd then serves as a gateway between submit host and grid sitesn  Mini Condor pool with schedd functionalities: ¨  Submit host ¨  Maintain persistent queue of jobs ¨  Communicate with native batch system and forward user jobs n  Condor, PBS, LSF, …etc ¨  Manipulate job queues through the followoing commands: n  condor_submit,condor_rm, condor_q, condor_hold, condor_release, condor_prio ¨  Security Features (GSI) n  Who is authorized to set up Pilot Factory?
  9. 9. Schedd Glidein Example (1)n  Command: // schedd glidein #1 condor_glidein -count 1 -arch 6.8.1-i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork gridgk01.racf.bnl.gov/jobmanager-fork -type schedd –forcesetup Use fork since we want schedd to be on gatekeeper!n  Command: // schedd glidein #2 condor_glidein -count 1 -arch 6.8.1-i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork gridgk02.racf.bnl.gov/jobmanager-fork -type schedd –forcesetupn  Command: // schedd glidein # 3, #4, #5 condor_glidein -count 3 -arch 6.8.1-i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork nostos.cs.wisc.edu/jobmanager-fork -type schedd –forcesetup
  10. 10. Schedd Glidein Example (2)Command: condor_status -scheddName Machine TotalRunningJobs TotalIdleJobs TotalHeldJobsagrd0926@gridgk01.ra gridgk01.r 0 0 0agrd0926@gridgk02.ra gridgk02.r 0 0 0pleiades@gridui01.us gridui01.u 0 0 0pleiades@ribera.cs.w ribera.cs. 0 0 0pleiades@ron.cs.wisc ron.cs.wis 0 0 0pleiades@vail.cs.wis vail.cs.wi 0 0 0 TotalRunningJobs TotalIdleJobs TotalHeldJobs Total 0 0 0
  11. 11. Pilot Submission Program (Generator)n  Communicate with a DB server that maintains information about pilot jobs ¨  E.g. pilot_type, pilot_queuen  Pulls desired pilot script from an external servern  Periodically submit pilot jobs (with pilot script as executable) ¨  condor_submit ¨  qsub? No, not necessary, since …
  12. 12. Build Pilot Factory with Glidein Grid Resource n  Schedd glidein installed and executed on the gatekeeper JobManager n  User submit a Condor-C job with pilot generator as the executable ¨  Generator runs on the gatekeeper as a local LSF universe job supervised by the glidein PBS schedd master n  Generator submits pilots schedd schedd ¨  Types, frequency adjustable by users ¨  Depending on the native batch system, pilots can be submitted as grid universe ~ jobs ¨  Along with GAHP and related binaries, Pilot generator schedd has the ability to communicate different batch systems
  13. 13. Pilot Factory master schedd Cluster Worker Nodes ~ Pilot Factory Connected to Collector Glidein request Submit Pilots Submit Node(Collector, Master, Gatekeeper withNegotiator, Schedd) {Globus, Condor| PBS|…}
  14. 14. Future Workn  Integrating pilot with Condor startd to implement startd-based pilot ¨  the startd-based pilot retrieves the payload of a user job in the same way as does the generic pilot but in addition, it also inherits functionalities of Condor startd. ¨  Original intention was to run PFs with the startd-pilots on worker nodes (too greedy, unacceptable?) ¨  Using Condor started makes it easier to integrate with gLexecn  Transform Generic PF (GPF) to Startd PF (SPF)
  15. 15. Reference[1] Schedd Glidein[2] Pilot Factory[3] glideinWMS: An advanced application on glideins