2. 28 May 2003 2
LSF @ AMD – History
•We’ve used LSF for 7 years, from K6 to the
Opteron and beyond
•My group has 3 large clusters-the largest with
several thousand systems—mostly Athlon Linux.
3. 3
Large LSF Clusters = Hard Work!
•How do we do it?
•A Great LSF Team
•Finding and using good tools
•I’ll talk about types of tools and some specific
examples
28 May 2003
4. 4
Tools by Example
•These examples are from our large environment
•They should be useful anywhere people want to
work smartr.
28 May 2003
5. 5
Grim Reality: Thousands of Systems
•System Database
–MySQL + Perl + local programs
–Updated daily, automatically
•Trouble Ticket Database
–RT - Request Tracker
–Used by sysadmins and customers
28 May 2003
6. 6
Coordination: Large Systems Team,
Large Customer Base
•RT: trouble ticket system
•We use RT to track:
–Track customer problems
–Track bugs in vendor software
–Schedule and control changes to the LSF cluster
•You need one of these, unless you like to work too
hard.
28 May 2003
7. 7
Test or Change Many or All!
Systems?
•We use clsh (the cluster shell) to run programs on
many systems serially or in parallel.
•Clsh can execute programs in our cluster at over
600 systems/minute.
•Example: run `uname –a’ on all systems in the
tx_linux netgroup:
–$ clsh –ng tx_linux ‘uname –a’
•Scared yet?
–$ sudo clsh –ng tx_linux ‘halt’
28 May 2003
8. 8
Programs Crashing or Hanging?
•Trace a running program
–# strace -p <PID>
•Run a program while tracing it
–$ strace –t –v –f /bin/hostname
•Everything is a file, find the files
–$ lsof -p <PID>
28 May 2003
10. 10
The Tool for Tools - Perl
•Obvious facts (?)
–Cross-platform
–Great software library (CPAN)
–Well known in the EDA and Unix world
–Fun to use (for some strange folks, anyway)
–The strong attractive force
28 May 2003
11. 11
1000 foot view of the cluster
•Cricket/RRDtool
•System Accounting
•Syslog Server
28 May 2003
12. 12
The Second Law / Entropy
•Entropy is
•Misteaks happen
•RCS/CVS/SCCS/…
–You must use revision control, or chaos will win
•Sudo
–Use sudo for root access, for logging and assigning limited
privilege
28 May 2003
13. 13
Acute vs Chronic Trouble
•How do we diagnose and fix symptoms that are not
easily reproduced?
•Lsfbug-a program for users
– Saves Unix environment
– Saves LSF environment
– Submits a test job to LSF
– Emails the output to the LSF team
28 May 2003
14. 14
Cross-Platform Compatibility
•Use similar paths for similar tools – regardless of
the OS or OS version
– Perl should always be at the same place – even for AIX and Linux
and HP-UX and …
•Install user tools on NFS servers
•Use package management software (opt_depot,
stow)
•Install systems w/Kickstart/Jumpstart/Ignite
28 May 2003
18. 18
Tools List 3
•strace http://sourceforge.net/projects/strace
•sudo http://www.courtesan.com/sudo
•tusc Use Google
•vnc http://www.uk.research.att.com/vn
•xchat http://www.xchat.org
28 May 2003
19. 19
Reading List
•The Practice of System and Network
Administration, by Limoncelli and Hogan
–http://www.sysadminfocus.com
•The Unix System Administration Handbook
–http://www.admin.com
28 May 2003
20. 28 May 2003 20
Trademark Attribution
AMD, the AMD Arrow Logo and combinations thereof
are trademarks of Advanced Micro Devices, Inc.
Other product names used in this presentation are
for identification purposes only and may be
trademarks of their respective companies.