Your SlideShare is downloading. ×
0
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Presentation 15 condor-v1
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Presentation 15 condor-v1

167

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
167
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 15 Condor – A Distributed Job Scheduler Todd Tannenbaum, Derek Wright, Karen Miller, and Miron Livny Beowulf Cluster Computing with Linux, Thomas Sterling, editor, Oct. 2001. Summarized by Simon Kim
  • 2. Contents • Introduction to Condor • Using Condor • Condor Architecture • Installing Condor under Linux • Configuring Condor • Administration Tools • Cluster Setup Scenarios
  • 3. Introduction to Condor • Distributed Job Scheduler • Condor Research Project at University of Wisconsin-Madison Department of Computer Sciences • Changed name to HTCondor in 2012 – http://research.cs.wisc.edu/htcondor
  • 4. Introduction to Condor Condor Job Run Run Idle Monitor Progress Report Queue Run Idle Run Nodes User Policy Complete!
  • 5. Introduction to Condor • Workload Management System • Job Queuing Mechanism • Scheduling Policy • Priority Scheme • Resource Monitoring and Management
  • 6. Condor Features • Distributed Submission • User/Job Priorities • Job Dependency - DAG • Multiple Job Models – Serial/Parallel Jobs • ClassAds – Job : Machine Matchmaking • Job Checkpoint and Migration • Remote System Calls – Seamless I/O Redirection • Grid Computing – Interaction with Globus Resources
  • 7. ClassAds and Matchmaking • Job ClassAd – Looking for Machine – Requirements: Intel, Linux, Disk Space, … – Rank: Memory, Kflops, … • Machine ClassAd – Looking for Job – Requirements – Rank
  • 8. Using Condor • Roadmap to Using Condor • Submitting a Job • User Commands • Universes • Standard Universe – Process Checkpointing – Remote System Calls – Relinking – Limitations • Data File Access • DAGMan Scheduler
  • 9. Using Condor Batch Job STDIN STDOUT STDERR univers = vanilla executable = foo log = foo.log input = input.data output = output.data queue Submit Description Standard Vanilla PVM MPI Grid Scheduler Universes: Runtime Environment Prepare a Job Submit Serial Job Parallel Job Meta Scheduler $ condor_submit
  • 10. Status of Submitted Jobs • $ condor_status -submitters
  • 11. All jobs in the Queue • $ condor_q • Removing Job – $ condor_rm 350.0 • Changing Job Priority: -20 ~ 20(high), default: 0 – $ condor_prio –p -15 350.1
  • 12. Universes • Execution Environment - Universe • Vanilla – Serial Jobs – Binary Executable and Scripts • MPI Universe – MPI Programs – Parallel Jobs – Only on Dedicated Resources # Submit Description Universe = mpi … machine_count = 8 queue
  • 13. Universes • PVM Universe – Master-worker Style Parallel Programs • Written for Parallel Virtual Machine Interface – Both Dedicated and Non- dedicated (workstations) – Condor Acts as Resource Manager for PVM Daemon – Dynamic Node Allocation PVM Daemon Condor # Submit Description Universe = pvm … machine_count = 1..75 queue pvm_addhosts()
  • 14. Universes • Scheduler Universe – Meta-Scheduler – DAGMan Scheduler • Complex Interdependencies Between Jobs A B C D * B and C are executed in parallel Job Sequence: A -> B and C -> D
  • 15. Universes • Standard Universe – Serial Job – Process Checkpoint, Restart, and Migration – Remote System Calls
  • 16. Process Checkpointing • Checkpoint – Snapshot of the Program’s Current State – Preemptive Resume Scheduling – Periodic Checkpoints – Fault Tolerance – No Program Source Code Change • Relinking with Condor System Call Library – Signal Handler • Process State Written to a Local/Network File • Stack/Data Segments, CPU state, Open Files, Signal Handlers and Pending Signals – Optional Checkpoint Server • Checkpoint Repository
  • 17. Remote System Calls • Redirects File I/O – Open(), read(), write() -> Network Socket I/O – Sent to ‘condor_shadow’ process on Submit Machine • Handles Actual File I/O • Note that Job Runs on Remote Machine • Relinking Condor Remote System Call Library – $ condor_compile cc myprog.o –o myprog
  • 18. Standard Universe Limitations • No Multi-Process Jobs – fork(), exec(), system() • No IPC – Pipes, Semaphores, and Shared Memory • Brief Network Communication – Long Connection -> Delay Checkpoints and Migration • No Kernel-level Threads – User-level Threads Are Allowed • File Access: Read-only or Write-only – Read-Write: Hard to Roll Back to Old Checkpoint • On Linux, Must be Statically Linked
  • 19. Data Access from a Job • Remote System Call – Standard Universe • Shared Network File System • What About Non-dedicated Machines (Desktops) ? – Condor File Transfer – Before Run, Input Files Transferred to Remote – On Completion, Output Files Transferred Back to Submit Machine – Requested in Submit Description File • transfer_input_files = <…>, transfer_output_files=<…> • transfer_files=<ONEXIT | ALWAYS | NEVER>
  • 20. Condor Architecture Central Manager Machine Negotiator Collector Startd Sched Startd Sched Machine 1 Startd Sched Machine 2 Startd Sched Machine N
  • 21. Condor Architecture Central Manager Machine Negotiator Collector Startd Sched Startd Sched Machine 1: Submit Startd Sched Machine N: Execute Starter Job Shadow Condor Remote System Call
  • 22. Cluster Setup Scenarios • Uniformed Owned Dedicated Cluster – MPI Jobs on Dedicated Nodes • Cluster of Multi-Processor Nodes – 1VM per Processor • Cluster of Distributively Owned Nodes – Jobs from Owner Preferred • Desktop Submission to Cluster – Submit-only Node Setup • Non-Dedicated Computing Resources – Opportunistic Scheduling and Matchmaking with Process Checkpointing, Migration, Suspend and Resume
  • 23. Conclusion • Distinct Features – Matchmaking with Job and Machine ClassAds – Preemptive Scheduling and Migration with Checkpointing – Condor Remote System Call • Powerful Tool for Distributed Scheduling Jobs – Within and Beyond Beowulf Clusters • Unique Combination of Dedicated and Opportunistic Scheduling

×