Justin Senseney of NIST's presentation from StackiFest 2017.
Stacki was used to upgrade a high-performance computing (HPC) cluster at the National Institute of Standards and Technology (NIST) in Gaithersburg, Maryland. NIST is the United States’ federal metrology institute, performing research and creating standards for measurements and technology, including materials, data, and cyber-security. A 1,200 node CentOS5 Maui/Torque cluster was upgraded to CentOS7 with a Slurm queuing system. At the same time, hundreds of servers were removed and added to this cluster. This presentation will show the application of Stacki to this HPC cluster and contrast previous methods used for provisioning. Stacki carts and pallets are used to provision role-based servers, including GPU, high-memory, and multiple login servers. Ideas are proposed to allow us to extend this application to managing multiple clusters. Any mention of commercial products within this presentation, including Stacki, is for information purposes; it does not imply recommendation or endorsement by NIST.
3. Disclaimer
3
• Any mention of commercial products, including Stacki, within this
presentation is for information only; it does not imply
recommendation or endorsement by NIST.
4. About NIST
• Part of U.S. Department of Commerce
• Non-regulatory
• A metrology institute – maintains measurement standards
• Time server - time.nist.gov
• Two factor authentication – HSPD 12
• Standard reference data
• Gaithersburg, MD
• Boulder, CO
4
5. Our ideas
• Ordering carts for multiple carts
• Stack compile cart for changes
• Version number that node is installed with
• Clean upgrade process to preserve git repository
5
6. Existing system
• In place last 10+ years:
• Hardware – heterogeneous hardware
• Network – simple, flat network
• Software – CentOS 5, maui/torque, consistent image
6
7. Existing system
• In place last 10+ years:
• Hardware – heterogeneous hardware
• Support different hardware configurations
• Network – simple, flat network
• Increase topology complexity
• Software – CentOS 5, maui/torque, consistent image
• Software installed on local machine
7
8. Hardware
• Located in Gaithersburg, MD.
• Owned by NIST scientific organizational units,
managed by Office of Information Systems
Management (OISM).
• Nodes - 8GB/core, procured 2010 - 2016:
• Different vendors
• Network – infiniband, ethernet
• Daisy chained switches
• Different vendors
8
9. Network
• Protocols
• Infiniband on separate card
• Ethernet on-board
• Head node
• Login
• Provisioning
• Queue manager
• Flat network
Raritan uses this typical HPC setup. All incoming connections go to the head node.
Picture from: http://www.udel.edu/it/research/training/config_laptop/
9
10. Software
• Maui/Torque
• Sysimager – provided consistent image
• CentOS 5 and 7
• SSH keys shared
• Directories:
• /usr/local/bin for shared software
• /tmp for local computing
• /wrk for shared storage
• /home for user data
10
11. New HPC Architecture
• Advanced Provisioning (Stacki)
• Networking Design & VLAN(s)
• Outfitting
• IP Management
• NIS+ Replacement
• SSH keys
• Directory structure
• Software
• Modules
• Development /Test Environments
11
Rocks Stacki
Appliance Box
Roll Pallet
Distribution Cart
17. Stacki
• Stacki box
• Our box contains
multiple Pallets
• Our box contains 1
Cart
• Pallets
• 3rd party, software
with different
versions
• Carts
• Custom/configured
software
• System files
17
https://commons.wikimedia.org/wiki/File:Box.agr.jpg
https://upload.wikimedia.org/wikipedia/commons/8/88/A1210.jpg
https://commons.wikimedia.org/wiki/Category:Utility_carts#/media/File:Moebelhunt_fcm.jpg
• Ordering carts for multiple carts
• stack compile cart for changes
18. SSH keys
• Sysimager duplicates ssh keys, how do users login
• Munge authentication:
• Slurm lets users launch commands using munge using shared key
18
19. SSH keys
• Sysimager duplicates ssh keys, how do users login
• Munge authentication:
• Slurm lets users launch commands using munge using shared key
19
20. SSH keys
• Sysimager duplicates ssh keys, how do users login
• Munge authentication:
• Slurm lets users launch commands using munge using shared key
20
21. SSH keys
• Sysimager duplicates ssh keys, how do users login:
• Munge authentication:
• Slurm lets users launch commands using munge using shared key
• SSH keys:
• Have users add SSH key in home to .ssh/authorized_keys
21
22. SSH keys
• Sysimager duplicates ssh keys, how do users login now:
• Munge authentication:
• Slurm lets users launch commands using munge using shared key
• SSH keys:
• Have users add SSH key in home to .ssh/authorized_keys
• Placed script for doing this in /share/sw <-- need understandable directory
structure
22
26. Directory structure
• /export/apps/configfiles – cart pulls from here
• A git repository
• /export/stack/carts/extend – the config cart
• A git repository
26
• Version number that node is
installed with
• Clean upgrade process to
preserve git repository
29. Directory structure
• /export/apps/configfiles – cart pulls from here
• A git repository
• /export/stack/carts/extend – the config cart
• A git repository
• /export/sw – 3rd party/licensed software installed here
• Can get messy
29
30. Directory structure
• /export/apps/configfiles – cart pulls from here
• A git repository
• /export/stack/carts/extend – the config cart
• A git repository
• /export/sw – 3rd party/licensed software installed here
• Can get messy
• /export/stack/spreadsheets
• Created by stacki
30
31. Directory structure
• /export/apps/configfiles – cart pulls from here
• A git repository
• /export/stack/carts/extend – the config cart
• A git repository
• /export/sw – 3rd party/licensed software installed here
• Can get messy
• /export/stack/spreadsheets
• Created by stacki
31
32. Cluster software – all by yum install
• Programming libraries:
• Lapack
• MKL
• ACML
• BLACS
• ScaLAPACK
• CMLIB
• Python
• R
• Java
• Editing files:
• No x11
• Vi
• Vim
• Emacs
• Nano
• X11
• Gedit
• Sublime
• Comparing files:
• No x11
• Vimdiff
• X11
• Kdiff3
• Meld
33
39. Our ideas
• Ordering carts for multiple carts
• Stack compile cart for changes
• Version number that node is installed with
• Comments column for hostfile.csv and database entry
• Firewall rules ordering and modification
• Clean upgrade process to preserve git repository
46
40. Questions?
• My email: Justin.Senseney@nist.gov
47
• Any mention of commercial products within this presentation is for
information only; it does not imply recommendation or endorsement
by NIST.