Rob Gardner presented on bringing OSG pools to campus clusters and connecting clusters to the Open Science Grid (OSG). The presentation covered installing the Connect Client software to submit jobs from a campus cluster to OSG resources, totaling over 100,000 CPU cores. Users can check job status and view where jobs ran using Connect Client commands like "connect q" and "connect histogram". Tutorials are available online to help users learn new techniques for running on OSG Connect.
1. Rob Gardner • University of Chicago
Bring OSG pools to your Cluster.
Share your HTCondor pools with OSG.
Introduction to High Throughput Computing
for Users and System Administrators
#TechEX15 Cleveland OH, October 8, 2015
2. Goals for this session
● Quick introduction to the OSG
● Lets submit to OSG from your campus
● Connecting your cluster to the OSG
7. Lightweight Connect Client
Idea is to bring the
submit point to
“home” campus
Submit locally,
run globally
Heavy lifting done
by OSG hosted
services
8. What is Connect Client?
On the OSG Connect login host, we encapsulate many
common operations under the connect command:
● connect status
● connect watch
● connect histogram
● connect project
9. What is Connect Client?
We bring those commands and a bit more to your campus:
● connect setup
● connect pull
● connect submit
● connect q
● connect history
● connect rm
● connect status
● ...
10. Command summary
● connect setup remote-username
○ one-time authorization setup. (Creates a new SSH key pair and uses your
password to authorize it.)
○ connect test can validate access at any time
● connect pull / push
○ lightweight access means no service can monitor file readiness for
transfer
○ instead, we have explicit commands for uni- or bi-directional file
synchronization between local and remote (the “connected” server). The
sync occurs over a secure SSH channel.
11. Command summary
● connect submit
○ like condor_submit, submits a job from a job control file (submit script).
Implicitly performs a push beforehand.
● connect q
○ runs condor_q remotely
● connect history
● connect status
● connect rm
○ also condor_* wrappers
● connect shell
○ gives you a login on the connect server at the location of your job
12. Get an OSG Connect credential
● Sign up at http://osgconnect.net/
● Test that you can login:
$ ssh username@login.osgconnect.net
● You can work from here, but to submit from
campus cluster log out for now
13. Setup your campus cluster
● We will install the Connect Client
● Works for CentOS 6.x and similar
● You will need sys privs (unless pre-reqs installed)
● You can practice today using our docker service
and a vanilla CentOS 6 container image
● So, either login to your home cluster, or follow
instructions on next slide...
14. Aside: practice within a container
First you will need to login via SSH to docker.osgconnect.net using your
OSG Connect credentials.
Once there, create a new Docker container:
docker run -ti centos:centos6 /bin/sh
Once your container is ready, you will see a prompt similar to this:
sh-4.1#
You are now inside of your container as the super user.
15. Prerequisite software
Before installing the Connect Client, we will need to install some other prerequisite
software:
yum install -y git python-paramiko
A lot of text will scroll by as the packages are downloaded and installed into your
container. When it’s finished, you will see the following message and be returned
to your shell:
Complete!
sh-4.1#
16. Get the Connect Client
Now that the dependencies are installed, you can fetch the
Connect Client from GitHub:
cd
git clone --recursive https://github.com/CI-Connect/connect-client
cd connect-client
git checkout v0.5
You should see some more text, ending with:
HEAD is now at ca309c9... tag release v0.5
17. Setup environment
Once installed, you will need to return to the home directory
and add the Connect Client to PATH:
cd
export PATH=$PATH:~/connect-client/connect/bin:~/connect-
client/scripts/tutorial
18. Setup the connection to OSG
sh-4.1# connect setup
Please enter the user name that you created during Connect registration. Note that it
consists only of letters and numbers, with no @ symbol.
You will be connecting via the connect-client.osgconnect.net server.
Enter your Connect username: rwg
Password for rwg@connect-client.osgconnect.net:
notice: Ongoing client access has been authorized at connect-client.osgconnect.net.
notice: Use "connect test" to verify access.
sh-4.1# connect test
Success! Your client access to connect-client.osgconnect.net is working.
sh-4.1#
20. sh-4.1# tutorial quickstart
Installing quickstart (master)...
Tutorial files installed in ./tutorial-
quickstart.
Running setup in ./tutorial-quickstart...
sh-4.1# cd tutorial-quickstart/
Try it out the quickstart $ tutorial
21. Prepare to submit 10 jobs
sh-4.1# cat tutorial03.submit
Universe = vanilla
Executable = short.sh
Arguments = 5 # to sleep 5 seconds
Error = log/job.err.$(Cluster)-$(Process)
Output = log/job.out.$(Cluster)-$(Process)
Log = log/job.log.$(Cluster)
#+ProjectName="ConnectTrain"
Queue 10
sh-4.1#
Change arg to 60 seconds
Change ConnectTrain to TechEX15
Change Queue value to 10
22. Inspect the job script itself
sh-4.1# cat short.sh
#!/bin/bash
# short.sh: a short discovery job
printf "Start time: "; /bin/date
printf "Job is running on node: "; /bin/hostname
printf "Job running as user: "; /usr/bin/id
printf "Job is running in directory: "; /bin/pwd
echo
echo "Working hard..."
sleep ${1-15}
echo "Science complete!"
sh-4.1#
Script that runs
on OSG pools
23. sh-4.1# connect submit tutorial03.submit
+++++.+.+++
9 objects sent; 2 objects up to date; 0 errors
Submitting job(s)..........
10 job(s) submitted to cluster 4070.
sh-4.1#
Submit $ connect submit
24. Check queue
sh-4.1# connect q
-- Submitter: login02.osgconnect.net : <192.170.227.251:37303> : login02.osgconnect.net
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
4070.0 rwg 10/8 04:28 0+00:00:00 I 0 0.0 short.sh 60 # to s
4070.1 rwg 10/8 04:28 0+00:00:00 I 0 0.0 short.sh 60 # to s
4070.2 rwg 10/8 04:28 0+00:00:00 I 0 0.0 short.sh 60 # to s
4070.3 rwg 10/8 04:28 0+00:00:00 I 0 0.0 short.sh 60 # to s
4070.4 rwg 10/8 04:28 0+00:00:00 I 0 0.0 short.sh 60 # to s
4070.5 rwg 10/8 04:28 0+00:00:00 I 0 0.0 short.sh 60 # to s
4070.6 rwg 10/8 04:28 0+00:00:00 I 0 0.0 short.sh 60 # to s
4070.7 rwg 10/8 04:28 0+00:00:00 I 0 0.0 short.sh 60 # to s
4070.8 rwg 10/8 04:28 0+00:00:00 I 0 0.0 short.sh 60 # to s
4070.9 rwg 10/8 04:28 0+00:00:00 I 0 0.0 short.sh 60 # to s
10 jobs; 0 completed, 0 removed, 10 idle, 0 running, 0 held, 0 suspended
sh-4.1#
$ connect q
25. Where did the jobs run?
$ connect historam
or
$ connect histogram --last
26. Where are the results?
● Nothing was returned to the client host
automatically:
sh-4.1# ls
README.md log short.sh tutorial01.submit tutorial02.submit tutorial03.
submit
sh-4.1# ls log/
sh-4.1#
● Results are sitting on the OSG Connect server
27. Check and bring back
On OSG Connect
On campus
$ connect shell
28. connect command
● connect show-projects
show projects you have access to
● connect project
set your accounting project
● connect status
show condor_status in all pools
● connect q
check progress of your job queue
● connect histogram [--last]
shows where your jobs have been run
29. tutorial command
sh$ tutorial
$ tutorial
usage: tutorial list - show available tutorials
tutorial info <tutorial-name> - show details of a tutorial
tutorial <tutorial-name> - set up a tutoria
Currently available tutorials:
AutoDockVina .......... Ligand-Receptor docking with AutoDock Vina
R ..................... Estimate Pi using the R programming language
ScalingUp-R ........... Scaling up compute resources - R example
blast ................. blast sequence analysis
cp2k .................. How-to for the electronic structure package CP2K
dagman-namd ........... Launch a series of NAMD simulations via an HTCondor DAG
error101 .............. Use condor_q -better-analyze to analyze stuck jobs
30. tutorial command
● Tutorials are maintained in github and
downloaded on demand
● Each tutorial’s README is in the OSG Support
site
○ http://osg.link/connect/userguide
○ http://osg.link/connect/recipes
● These are recommended for learning new
techniques on OSG Connect
33. OSG for resource providers
● Connect your campus users to the OSG
○ connect-client - job submit client for the local cluster
○ provide “burst” like capability for HTC jobs to shared opportunistic
resources
● Connect campus cluster to OSG
○ Lightweight connect : OSG sends “glidein” jobs to your cluster, using a
simple user account
■ No local software or services needed!
○ Large scale: deploy the OSG software stack
■ Support more science communities at larger scale
34. “Quick Connect” Process
● Phone call to discuss particulars of cluster
○ does not need to be HTCondor -- slurm, pbs, others
supported
○ Nodes need outbound network connectivity
● Create an osgconnect account that OSG team
uses to access
38. ● 2014 stats
○ 67% size of XD, 35% BlueWaters
○ 2.5 Million CPU hours/day
○ 800M hours/year
○ 125M/y provided opportunistic
● >1 petabyte of data/day xfered
● 50+ research groups
● thousands of users
● XD service provider for XSEDE
Rudi Eigenmann
Program Director Division of
Advanced Cyberinfrastructure (ACI)
NSF CISE
CASC Meeting, April 1, 2015
Distributed HTC on OSG
42. OSG Connect Service
View OSG as an
HTC cluster
★ Login host
★ Job scheduler
★ Software
★ Storage
43. Software & tools on the OSG
● Distributed software file system OASIS
● Special module command
○ identical software on all clusters
○ 170 libraries
#!/bin/bash
switchmodules oasis
module load R
module load matlab
...
44. Submit jobs to OSG with HTCondor
● Simple HTCondor submission
● Complexity hidden from the user
● No grid (X509) certificates required
● Uses HTCondor ClassAd and glidein
technology
● DAGMan and other workflow tools