Speaker: Renato Golin
Speaker Bio:
He started programming in the late 80's in C for PCs after a few years playing with 8-bit computers, but he only started programming professionally in the late 90's during the .com bubble. After many years working on Internet's back-end, he moved to UK and worked a few years on bioinformatics at EBI before joining ARM, where he worked on the DS-5 debugger and on the EDG-to-LLVM bridge, where he became the LLVM Tech Lead. Recently, he worked with large clusters and big data at HPCC before moving to Linaro.
Talk Title: OpenHPC Automation with Ansible
Talk Abstract: "In order to test OpenHPC packages and components and to use it as a
platform to benchmark HPC applications, Linaro is developing an automated deployment strategy, using Ansible, Mr-Provisioner and Jenkins, to install the
OS, OpenHPC and prepare the environment on varied architectures (Arm, x86). This work is meant to replace the existing ageing Bash-based recipes upstream while still keeping the documents intact. Our aim is to make it easier to vary hardware configuration, allow for different provisioning techniques and mix internal infrastructure logic to different labs, while still using the same recipes. We hope this will help more people use OpenHPC with a better out-of-the-box experience and with more robust results"
7. Existing OpenHPC Automation
● A recipe with all rules described in the official documents
● LaTex snippets containing shell code
● Are converted and merged into a (OS-dependent) bash script
● Scripts in OHPC packages, installed after OpenHPC main package is installed
● Plus a input.local file, with some cluster-specific configuration / environment
7
8. Existing OpenHPC Automation
Shortcomings:
● The input.local file exports shell variables
● The recipe.sh is not idempotent
● Extensibility is impossible without editing the files
8
9. Ansible Playbooks
Ansible is a widely used automation tool which can describe the structure and
configuration of IT infrastructure with YAML “playbooks” and “roles”.
OpenHPC with Ansible:
● Ansible playbooks can more easily be made idempotent
● Ansible can manage nodes/tasks according to the the structure of the cluster
● Configuration is passed as a YAML file (no environment handling)
● Composition, using playbooks and roles, building on third-party content
9
10. Ansible OpenHPC Benefits
● Flexible cluster configuration
○ Fine grained / composable
○ Cluster wide / node group wide / node specific
● Works on both x86_64 and AArch64
○ Ansible gathers information about architecture
○ Same playbook runs on both
● OS is directly inferred by Ansible (gather_facts)
○ Ansible gathers information about OS
○ Yum, apt, zypper… can be switched in the roles’ logic
10
11. Ansible OpenHPC Recipe
The basic structure of the Ansible playbook
playbook
+---- group_vars/
+-- all.yml cluster wide configurations
+-- group1,group2 … node group(e.g., computing nodes, login, DFS) specific configurations
+---- host_vars/
+--- host1,host2 … host specific configurations
+---- roles/ package specific tasks, templates, config files, and config variables
+--- package-name/
+--- tasks/main.yml … YAML file to describe installation method of package-name
+--- vars/main.yml … package specific configuration variables
+--- templates/*.j2 … template files to generate configuration files
11
13. Upstreaming our work
After multiple discussions, our proposal is:
● Generate from LaTex sources at the same time as recipe.sh
○ Each block/section as a separate role
○ Control OS/Arch via gather_facts/config
○ xCAT vs Warewulf, Slurm vs PBS: only add what hasn’t yet
● Push to a separate repository (openhpc/ansible-role ?)
○ Control release branch/tags as to not overwrite previous recipes
○ Repo can have additional playbooks (CI, new users)
○ Once release is validated, push to Ansible Galaxy
● Direct pull requests to playbooks / support roles
○ Only recipe roles are auto-generated
○ Support roles (specific to Ansible, playbook) are maintained on the repo
13
15. Test Suite
Most tests green, however:
● Intel-specific tests (CILK, TBB, IMB) disabled
● Others need package install (PDF, CDF, HDF), but pass when installed
● TAU fails because LMod defaults to openmpi (needs openmpi3)
● Lustre fails as package depends on kernel 4.2 (which won’t work on our machines)
● MiniDFT and PRK had make failures, but we haven’t investigated yet
● --enable-longdoesn’t really, need to look into why not
The plan from now on is:
1. Automate package install conditional on enabled tests, fix remaining errors
2. Work with members to prioritise long term ones (like Lustre)
3. Use it for additional packages, so we can test them before sending upstream
4. Add a benchmark mode, making sure to use entire cluster
15