2. Office of Instructional and
Research Technology
A little background - why use clusters?
• Computing power - more is more
• Omelets - a dozen chicken eggs work as well as one ostrich
egg and is cheaper
• Big tasks can run for months
• Moore’s law (doubling the number of transistors on a chip
every 24 months) still only nibbles at the Grand Challenges
3. Office of Instructional and
Research Technology
What is Condor?
• Via the Condor Project home page - Condor is a “specialized
workload management system”
• Via Wikipedia - Condor is a “high-throughput computing
software framework for coarse grain distributed parallelization
of computationally intensive tasks”
4. Office of Instructional and
Research Technology
What is Condor again?
• Via Eric - Condor is a “way to get work done on computers
that no one is using right now”
• Condor is a an application that runs your programs on other
computers
• Condor is a job scheduler aka a batch scheduler
5. Office of Instructional and
Research Technology
Why use a batch scheduler?
• Generally high performance machines are expensive enough
that people would like to have them in use all the time
• Generally people have noticed that computers are good are
doing mundane tasks like waiting for a program to finish and
start another program
• Generally people use batch schedulers to ensure that a
resource, like a cluster, is being fully used without
overwhelming the resource or without having the resource sit
idle, merely generating heat
6. Office of Instructional and
Research Technology
Condor twist on idle resources
• The clever folks at University of Wisconsin-Madison expanded
the notion of idle resources to include all desktop computers
that were not in use at the moment
• Like the SETI screensaver idea, Condor runs jobs on
computers that are otherwise idle
• Once a user moves a mouse or touches a key, Condor gets
out of the way and lets the user have the full machine back
7. Office of Instructional and
Research Technology
How does Condor work?
Configure who can submit jobs
8. Office of Instructional and
Research Technology
How does Condor work?
Configure where to run jobs
10. Office of Instructional and
Research Technology
Condor - more of big picture
• Developed by the Universiy of Wisconsin - Madison
• Cost: Free (Open source-ish license)
• Overhead: Need a ‘Master server’ and workstations. Each
workstation runs a daemon that watches user I/O and CPU
load. When a workstation has been idle for two hours, a job
from the batch queue is assigned to the workstation and will
run until the daemon detects a keystroke, mouse motion, or
high non-Condor CPU usage. At that point, the job will be
removed from the workstation and placed back on the batch
queue
11. Office of Instructional and
Research Technology
Condor - yet more of big picture
• Condor can run both sequential and parallel jobs. Sequential jobs
can be run in several different "universes", including "vanilla" which
provides the ability to run most "batch ready" programs, and
"standard universe" in which the target application is re-linked with
the Condor I/O library which provides for remote job I/O and job
checkpointing.
• Condor supports the standard Message Passing Interface (MPI) and
Parallel Virtual Machine (PVM) as well as a Globus module
• Supported on Windows 2000, 2003, XP, Vista
• Supported on Solaris 8, 9, 10
• Supported on Linux Red Hat (7.1 and on), SuSE (8 & 9), Debian 3.1
• Supported on Mac OSX 10.3 and on
• Other supported platforms
12. Office of Instructional and
Research Technology
Condor pluses
• No need to recompile jobs (recompile for checkpointing
however)
• Users do not have to worry about details, logons, etc. on
remote machines
• Respects the owner of the remote system
• Flexible system for matching resources with requests
13. Office of Instructional and
Research Technology
Condor limitations
• Most platforms other than linux/unix only support the “vanilla
universe” which means no checkpointing, so a job sleeps or
is killed outright
• “Standard universe” checkpointing can not handle simple
multi-process jobs - no fork(), no exec(), no system()
• No interprocess communication - no pipes, semaphores or
shared memory
• No reading or writing of files larger than 2 GB
• Limits on signals, timers and file locks
• Network communication must be brief
14. Office of Instructional and
Research Technology
Questions?
Website: The Condor Project http://www.cs.wisc.edu/condor/
Eric Marshall
Office of Instructional and Research Technology
eric.marshall@rutgers.edu
732 445-2262