Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Enabling Condor in LHC Computing Grid
1. HEP, Cavendish Laboratory
Configuring & Enabling Condor
in LHC Computing Grid
Condor is a specialized workload management system for compute-intensive jobs, which can effectively manage a variety of
clusters of dedicated compute nodes. Today, there are grid schedulers, resource managers, and workload management
systems available that can provide the functionality of the traditional batch queuing system e.g. Torque/PBS or provide the
ability to harness cycles from idle desktop workstations. Condor addresses both of these areas by providing a single tool. In
Grid-style computing environment, Condor's "flocking" technology allows multiple Condor compute installations to work
together and opens a wide range of possible options for resource sharing.
Central Manager
negotiator
master
startd
collector
scheddSubmit Node
schedd
master
Execute Node
startd
master
Regular Node
master
schedd
startd
Execute Node
startd
master
Execute Node
startd
master
Process Spawned
ClassAd
Communication
Pathway
Condor Working Model Like other full-featured batch systems, Condor provides the traditional
job queuing mechanism, scheduling policy, priority scheme, along with
resource classifications. In a nutshell, job is submitted to Condor via a
machine running a scheduler (schedd). The scheduler communicates
with the collector process on the Central Manager (CM). The
negotiator on the CM performs a matchmaking service and sends jobs
to an available machine on the network which begins running the job
on that machine. Machines that can run jobs (Execute Node) also
communicate with the collector (via a startd process). A shadow
process on the Submit node keeps communicating with the running job
so if the job stops executing, Condor can detect this (e.g. if the job or
the machine crashes). If checkpointing is not in use, these jobs can be
restarted by Condor if requested and allowed.
Although, Condor as a batch system, is officially supported by gLite/EGEE, various parts of the middleware still limited to
the PBS/Torque in terms of transparent integrity. We have extended the support to allow middleware to work seamlessly
with Condor and enable interaction with local/university compute clusters. We provide details of the configuration,
implementation, and testing of Condor for LCG in multi-cultural environment, where a common cluster is used for different
types of jobs. The system is presented as an extension to the default LCG/gLite configuration that provides transparent
access for both LCG and local jobs to the common resource. Using Condor and Chirp/Parrot, we have extended the
possibilities to use university clusters for LCG/gLite jobs in a very non-privileged way.
PoolsmaintainedbyindividualGroup/Department
Local Users
HEP Cluster
Execute node
Execute node
Execute node
CamGrid Project Model
HEPCM+
gLiteSubmitnode
Local(HEP)
Submitnode
Central
Submitnode
Grid Users
Grid Users
Central
Submitnode
HEP submission
CamGrid submission
LCG/gLite submission
Condor works by providing a High Throughput
Computing (HTC) environment. In addition to
the typical usage scenario, Condor can also
effectively manage no dedicated resources
by taking advantage of spare cycles when
those resources are idle. The ClassAd
mechanism in Condor provides an extremely
flexible and expressive framework for
matching resource requests (jobs) with
resource offers (machines). This is why we
have chosen Condor as the primary batch
system for our LCG farm. The same cluster is
also the part of CamGrid Project. The condor
central manager is configured to act as a
submit node only for the LCG/gLite
submission. The additional submit node is for
our local users and for CamGrid submission.
CamGrid is made up of a number of Condor
pools belonging to the departments and that
allow their resources to be shared by using
Condor's flocking mechanism. This federated
approach means that there is no single point
of failure in this environment, and the grid
does not depend on any individual pool to
continue working. Grid jobs, runs only on the
HEP cluster but the CmGrid jobs, submitted
through central submit hosts or the local
submit host, run everywhere across the
CamGrid infrastructure.
Santanu
Das.
3rd
EGEE
User
Forum,
11-‐14
February
2008,
Clermont-‐Ferrand,
FRANCE.
santanu@hep.phy.cam.ac.uk