OpenStack at the
Sanger Institute
Dave Holland
From zero knowledge to
“Pan-Prostate Genome Blaster”
in 18 months
What I’ll talk about
● The Sanger Institute
● Motivations for using OpenStack
● Our journey
● Some decisions we made (and why)
● Some problems we encountered (and how we addressed them)
● Projects that are using it so far
● Next steps
The Sanger Institute
LSF 9
~10,000 cores in main compute farm
~10,000 cores across smaller project-specific farms
13PB Lustre storage
Mostly everything is available everywhere - “isolation” is based on POSIX file
permissions
Motivations
LSF great for HPC utilization but…
● It doesn’t address data size/sharing/locality
● It’s quicker to move an image (or an image definition) to the data
○ benefit from existing data security arrangements
○ benefit from tenant isolation
LSF isn’t going away - complementary to cloud-style computing
Our journey
● 2015, June: sysadmin training
● July: experiments with RHOSP6 (Juno)
● August: RHOSP7 (Kilo) released
● December: pilot “beta” system opened to testers
● 2016, first half: Science As A Service
● July: pilot “gamma” system opened using proper Ceph hardware
● August: datacentre shutdown
● September: production system hardware installation
● 2017, January: “delta” system opened to early adopters
● February: Sanger Flexible Compute Platform announced
Science As A Service
First half of 2016
Proof-of-concept of a user-friendly orchestration portal (CloudForms) on top
of OpenStack and VMware
Consultancy and development input from RedHat
Presented at Scientific Working Group in Barcelona summit, October 2016
Decisions we made
Hardware
We approached current vendors, and SuperMicro via BIOS-IT
Wanted to get most bang for buck
Arista provided seed switch kit and offered VXLAN support
Production OpenStack (1)
• 107 Compute nodes (Supermicro) each with:
• 512GB of RAM, 2 * 25GB/s network interfaces
• 1 * 960GB local SSD, 2 * Intel E52690v4 (14 cores @ 2.6Ghz)
• 6 Control nodes (Supermicro) allow 2 openstack deployments
• 256 GB RAM, 2 * 100 GB/s network interfaces
• 1 * 120 GB local SSD, 1 * Intel P3600 NVMe (/var)
• 2 * Intel E52690v4 (14 cores @ 2.6Ghz)
• Total of 53 TB of RAM, 2996 cores, 5992 with hyperthreading
• RHOSP8 (Liberty) deployed with Triple-O
Production OpenStack (2)
• 9 Storage nodes (Supermicro) each with:
• 512GB of RAM
• 2 * 100GB/s network interfaces,
• 60 * 6TB SAS discs, 2 system SSD
• 2 * Intel E52690v4 (14 cores @ 2.6Ghz)
• 4TB of Intel P3600 NVMe used for journal
• Ubuntu Xenial
• 3 PB of disc space, 1PB usable
• Single instance (1.3 GBytes/sec write, 200 MBytes/sec read)
• Ceph benchmarks imply 7 GBytes/sec
Production OpenStack (3)
• 3 racks of equipment, 24 KW load per rack
• 10 Arista 7060CX-32S switches
• 1U, 32 * 100Gb/s -> 128 * 25Gb/s
• Hardware VXLAN support integrated with OpenStack *
• Layer two traffic limited to rack, VXLAN used inter-rack
• Layer three between racks and interconnect to legacy systems
• All network switch software can be upgraded without disruption
• True Linux systems
• 400 Gb/s from racks to spine, 160 Gb/s from spine to legacy systems
* VxLan in ml2 plugin not used in first iteration because of software issues
OpenStack installation
RHOSP vs Packstack vs …
• Paid-for support from RedHat
• Terminology confusion: Triple-O undercloud and overcloud
• Need wellness checks of undercloud and overcloud before each
(re)deploy
• Keep deployment configuration in git and deploy with a script for
consistency
Ceph installation
Integrated or standalone?
• Deployment by RHOSP is easier but is tied to that OpenStack
• A separate self-supported Ceph was more cost effective and a
better fit for staff knowledge at the time
• It’s possible to share a Ceph between multiple OpenStacks
• ceph-ansible is seductive but brings some headaches
• e.g. --check causes problems like changing the fsid
Networking
We wanted VXLAN support in switches to enable metal-as-a-service
Unfortunately we’re not there yet…
e.g. ml2 driver bugs: “reserved” is not a valid UUID
We currently have VXLAN double encapsulation
Local customisations
Puppet or what?
We chose to use Ansible
• There’s only a single Puppet post-deploy hook
• Wider strategic use of Ansible within Sanger IT
• Keep configuration in git
Our customisations
• scheduler tweaks (stack not spread, CPU/RAM overcommit)
• hypervisor tweaks (instance root disk on Ceph or hypervisor)
• enable SSL for Horizon and API
• change syslog destination
• add “MOTD” to Horizon login page
• change session timeouts
• register systems with RedHat
• and more...
Customisation pitfalls
Some customisations become obsolete when moving to a newer
version of OpenStack - can’t blindly carry them forward
A redeploy (e.g. to add compute nodes) overwrites configuration so
the customisations need to be reapplied - and there’s a window when
they’re absent
Restarting too many services too quickly upsets HAproxy, rabbitmq...
Flavours and host aggregates
Three main flavour types:
1. Standard “m1.*”
• True cloud-style compute; root disk on hypervisor; 90% of compute
nodes
2. Ceph “c1.*”
• Root disk on Ceph allows live migration; 6 compute nodes support this
3. Reserved “h1.*”
• Limited to tenants running essential availability services
Flavours and host aggregates
Per-project flavours:
• For Cancer group “k1.*”
• True cloud-style compute, like “m1.*”
• Sized to fit two instances on each hypervisor: half the disk, half the CPUs,
half the RAM
• Trying to prevent Ceph “double load” caused by data movement:
Ceph→S3→instance→Cinder volume→Ceph
• Only viable with homogeneous hypervisors and known/predictable
resource requirements
Deployment thoughts
“Premature optimisation is the root of all evil” - Knuth
“Get it working, then make it faster” - my boss Pete
“Keep it simple (because I’m) stupid” - me
Turn off h/w acceleration (10GbE offloads guilty until proven innocent)
Find some enthusiastic early adopters to shake the problems out
Deploy, monitor, tweak, rinse, repeat
Metrics, monitoring, logging
Metrics
Find the balance between
“if it moves, graph it”
and
“don’t overload the metrics server”
50,000 metrics every 10 seconds is optimistic
Architecture
We’re using collectd → graphite/carbon → grafana
Modular plugins make it easy to record new metrics e.g.
entropy_avail
Using the collectd libvirt plugin means new instances are
automatically measured
...although the automatic naming isn’t great:
openstack_flex2.instance-00000097_bbb85e84-6c0c-4fe
8-9b3c-db17a665e7ef.libvirt.virt_cpu_total
Per-tenant graphs
Logging
We wanted something like Splunk
...but without the £££
We’re using ELK
Today as a syslog destination; planning to use rsyslog to watch
OpenStack component log files
Monitoring
Bare minimum in Opsview (Nagios)
• Horizon and API availability
• Controllers up
• radosgw S3 availability
• Ceph nodes up
We’d like hardware status reporting but SuperMicro IPMI is not helpful
Pitfalls and problems
“Space,” it says, “is big. Really big. You just won't believe how vastly,
hugely, mindbogglingly big it is.”
There’s a substantial learning curve for admins and developers
OpenStack
Problems with Docker
Docker likes to use 172.17.0.0/16 for its bridge network
Sanger uses 172.16.0.0/12 for its internal network
...oh.
Also problems with bridge MTU > instance MTU and PMTUD not
working. Fix: --bip=192.168.3.3/24 --mtu=1400
Problems with radosgw
Ceph radosgw implements most but not all AWS S3 features
ACLs are implemented, policies are not
We’re trying to implement a write-only bucket using nginx as a proxy
to rewrite the auth header
Problems with DHCP
On Ceph nodes, Ubuntu DHCP client doesn’t request a default
gateway
Infoblox DHCP server sends Classless Static Routes option
DHCP client can override a server-supplied value but not ignore it
The Ceph nodes’ default route ends up pointing down the 1GbE
management NIC not the 2x100GbE bond
...oh.
Problems with rabbitmq
rabbitmq partitions are really painful
We sometimes end up rebooting all the controllers - there must be a
better way
Fortunately running instances aren’t affected
Problems with deployment
Running the overcloud deployment from the wrong directory is
very bad
The deployer doesn’t find the file containing the service
passwords and proceeds to change them all, which is very tedious
to recover from
The deployment script really really really needs to have
cd ~stack
to prevent accidents
Problems with cinder
When a volume is destroyed, cinder overwrites the volume with
zeroes
If a user is running a pipeline which creates and destroys many 1TB
volumes this produces a lot of I/O
Consider setting volume_clear and/or volume_clear_size in
cinder.conf
Use cases
Prostate cancer analysis
Pan-Prostate builds on previous Pan-Cancer work
Multiple participating institutes using Docker to provide a consistent
analysis framework
In the past that required admin time to build an isolated network,
now OpenStack gives us that for free - and lets the scientists drive it
themselves
wr - Workflow Runner
Reimplementation of Vertebrate Resequencing Group’s pipeline
manager in Go
Designed to be fast, powerful and easy to use
Can manage LSF like existing version, and adds OpenStack
https://github.com/VertebrateResequencing/wr
wr - Workflow Runner
Lessons learned:
• “There’s a surprising amount of stuff you have to do to get
everything working well”
• There are annoying gaps in the Go SDK
• Lots of things can go wrong if end users bring up servers, so handle
all the details for them
New Pipeline Group
Using s3fs as a shim on top of radosgw S3 speeds development
s3fs presents a bucket as a filesystem (but it’s turtles all the way
down)
In tests, launching up to 240 instances, for read-only access to a few
GB of reference sequence data, with caching turned on: up to ~8
might get stuck
Human Genetics Informatics
Working towards a production Arvados system
Speedbumps around many tools/SDKs assuming real AWS S3, not
some S3-alike
Sending patches to open-source projects (Packer, Terraform…)
What next?
More Ceph
...because 1PB isn’t enough…
This has implications for DC placement (due to cooling requirements)
and Ceph CRUSH map (to ensure data replicas are properly
separated)
Should we split rbd pools from radosgw pools?
OpenStack version upgrade
We will probably skip to RHOSP10 (Newton)
Need Arista driver integrations for VXLAN for metal-as-a-service
We will install a new system alongside the current one and migrate
users and then compute nodes
$THING-as-a-service
metal - deploy instance on bare-metal (Ironic)
key management (Barbican) to enable encrypted volumes
DNS (Designate)
shared filesystem (Manila)
…though many of these can already be achieved with creative use of
images/heat/user-data
Federation
JISC Assent looks interesting
Lots of internal process to work through first
Open questions about:
• scheduling - pre-emptible instances would help
• charging - market-based instance pricing?
Lustre
We have 13PB of Lustre storage
Consider exposing some of it to tenants using Lustre routers, NID
mapping and sub-mounts
Little things
• expose hypervisor RNG to instances
• could make instance key generation go faster
• have LogStash report metrics of “log per host”
• to spot log volume anomalies
• ...
Thanks
My colleagues at Sanger - both in Systems and across the institute
The OpenStack community
Helpful people on mailing lists
Questions?
dh3@sanger.ac.uk
Sanger OpenStack presentation March 2017

Sanger OpenStack presentation March 2017

  • 1.
    OpenStack at the SangerInstitute Dave Holland
  • 2.
    From zero knowledgeto “Pan-Prostate Genome Blaster” in 18 months
  • 3.
    What I’ll talkabout ● The Sanger Institute ● Motivations for using OpenStack ● Our journey ● Some decisions we made (and why) ● Some problems we encountered (and how we addressed them) ● Projects that are using it so far ● Next steps
  • 4.
    The Sanger Institute LSF9 ~10,000 cores in main compute farm ~10,000 cores across smaller project-specific farms 13PB Lustre storage Mostly everything is available everywhere - “isolation” is based on POSIX file permissions
  • 5.
    Motivations LSF great forHPC utilization but… ● It doesn’t address data size/sharing/locality ● It’s quicker to move an image (or an image definition) to the data ○ benefit from existing data security arrangements ○ benefit from tenant isolation LSF isn’t going away - complementary to cloud-style computing
  • 6.
    Our journey ● 2015,June: sysadmin training ● July: experiments with RHOSP6 (Juno) ● August: RHOSP7 (Kilo) released ● December: pilot “beta” system opened to testers ● 2016, first half: Science As A Service ● July: pilot “gamma” system opened using proper Ceph hardware ● August: datacentre shutdown ● September: production system hardware installation ● 2017, January: “delta” system opened to early adopters ● February: Sanger Flexible Compute Platform announced
  • 7.
    Science As AService First half of 2016 Proof-of-concept of a user-friendly orchestration portal (CloudForms) on top of OpenStack and VMware Consultancy and development input from RedHat Presented at Scientific Working Group in Barcelona summit, October 2016
  • 10.
  • 11.
    Hardware We approached currentvendors, and SuperMicro via BIOS-IT Wanted to get most bang for buck Arista provided seed switch kit and offered VXLAN support
  • 13.
    Production OpenStack (1) •107 Compute nodes (Supermicro) each with: • 512GB of RAM, 2 * 25GB/s network interfaces • 1 * 960GB local SSD, 2 * Intel E52690v4 (14 cores @ 2.6Ghz) • 6 Control nodes (Supermicro) allow 2 openstack deployments • 256 GB RAM, 2 * 100 GB/s network interfaces • 1 * 120 GB local SSD, 1 * Intel P3600 NVMe (/var) • 2 * Intel E52690v4 (14 cores @ 2.6Ghz) • Total of 53 TB of RAM, 2996 cores, 5992 with hyperthreading • RHOSP8 (Liberty) deployed with Triple-O
  • 14.
    Production OpenStack (2) •9 Storage nodes (Supermicro) each with: • 512GB of RAM • 2 * 100GB/s network interfaces, • 60 * 6TB SAS discs, 2 system SSD • 2 * Intel E52690v4 (14 cores @ 2.6Ghz) • 4TB of Intel P3600 NVMe used for journal • Ubuntu Xenial • 3 PB of disc space, 1PB usable • Single instance (1.3 GBytes/sec write, 200 MBytes/sec read) • Ceph benchmarks imply 7 GBytes/sec
  • 15.
    Production OpenStack (3) •3 racks of equipment, 24 KW load per rack • 10 Arista 7060CX-32S switches • 1U, 32 * 100Gb/s -> 128 * 25Gb/s • Hardware VXLAN support integrated with OpenStack * • Layer two traffic limited to rack, VXLAN used inter-rack • Layer three between racks and interconnect to legacy systems • All network switch software can be upgraded without disruption • True Linux systems • 400 Gb/s from racks to spine, 160 Gb/s from spine to legacy systems * VxLan in ml2 plugin not used in first iteration because of software issues
  • 16.
    OpenStack installation RHOSP vsPackstack vs … • Paid-for support from RedHat • Terminology confusion: Triple-O undercloud and overcloud • Need wellness checks of undercloud and overcloud before each (re)deploy • Keep deployment configuration in git and deploy with a script for consistency
  • 18.
    Ceph installation Integrated orstandalone? • Deployment by RHOSP is easier but is tied to that OpenStack • A separate self-supported Ceph was more cost effective and a better fit for staff knowledge at the time • It’s possible to share a Ceph between multiple OpenStacks • ceph-ansible is seductive but brings some headaches • e.g. --check causes problems like changing the fsid
  • 19.
    Networking We wanted VXLANsupport in switches to enable metal-as-a-service Unfortunately we’re not there yet… e.g. ml2 driver bugs: “reserved” is not a valid UUID We currently have VXLAN double encapsulation
  • 20.
  • 21.
    Puppet or what? Wechose to use Ansible • There’s only a single Puppet post-deploy hook • Wider strategic use of Ansible within Sanger IT • Keep configuration in git
  • 22.
    Our customisations • schedulertweaks (stack not spread, CPU/RAM overcommit) • hypervisor tweaks (instance root disk on Ceph or hypervisor) • enable SSL for Horizon and API • change syslog destination • add “MOTD” to Horizon login page • change session timeouts • register systems with RedHat • and more...
  • 23.
    Customisation pitfalls Some customisationsbecome obsolete when moving to a newer version of OpenStack - can’t blindly carry them forward A redeploy (e.g. to add compute nodes) overwrites configuration so the customisations need to be reapplied - and there’s a window when they’re absent Restarting too many services too quickly upsets HAproxy, rabbitmq...
  • 24.
    Flavours and hostaggregates Three main flavour types: 1. Standard “m1.*” • True cloud-style compute; root disk on hypervisor; 90% of compute nodes 2. Ceph “c1.*” • Root disk on Ceph allows live migration; 6 compute nodes support this 3. Reserved “h1.*” • Limited to tenants running essential availability services
  • 25.
    Flavours and hostaggregates Per-project flavours: • For Cancer group “k1.*” • True cloud-style compute, like “m1.*” • Sized to fit two instances on each hypervisor: half the disk, half the CPUs, half the RAM • Trying to prevent Ceph “double load” caused by data movement: Ceph→S3→instance→Cinder volume→Ceph • Only viable with homogeneous hypervisors and known/predictable resource requirements
  • 26.
    Deployment thoughts “Premature optimisationis the root of all evil” - Knuth “Get it working, then make it faster” - my boss Pete “Keep it simple (because I’m) stupid” - me Turn off h/w acceleration (10GbE offloads guilty until proven innocent) Find some enthusiastic early adopters to shake the problems out Deploy, monitor, tweak, rinse, repeat
  • 27.
  • 28.
    Metrics Find the balancebetween “if it moves, graph it” and “don’t overload the metrics server” 50,000 metrics every 10 seconds is optimistic
  • 29.
    Architecture We’re using collectd→ graphite/carbon → grafana Modular plugins make it easy to record new metrics e.g. entropy_avail Using the collectd libvirt plugin means new instances are automatically measured ...although the automatic naming isn’t great: openstack_flex2.instance-00000097_bbb85e84-6c0c-4fe 8-9b3c-db17a665e7ef.libvirt.virt_cpu_total
  • 33.
  • 34.
    Logging We wanted somethinglike Splunk ...but without the £££ We’re using ELK Today as a syslog destination; planning to use rsyslog to watch OpenStack component log files
  • 35.
    Monitoring Bare minimum inOpsview (Nagios) • Horizon and API availability • Controllers up • radosgw S3 availability • Ceph nodes up We’d like hardware status reporting but SuperMicro IPMI is not helpful
  • 36.
  • 37.
    “Space,” it says,“is big. Really big. You just won't believe how vastly, hugely, mindbogglingly big it is.” There’s a substantial learning curve for admins and developers OpenStack
  • 38.
    Problems with Docker Dockerlikes to use 172.17.0.0/16 for its bridge network Sanger uses 172.16.0.0/12 for its internal network ...oh. Also problems with bridge MTU > instance MTU and PMTUD not working. Fix: --bip=192.168.3.3/24 --mtu=1400
  • 39.
    Problems with radosgw Cephradosgw implements most but not all AWS S3 features ACLs are implemented, policies are not We’re trying to implement a write-only bucket using nginx as a proxy to rewrite the auth header
  • 40.
    Problems with DHCP OnCeph nodes, Ubuntu DHCP client doesn’t request a default gateway Infoblox DHCP server sends Classless Static Routes option DHCP client can override a server-supplied value but not ignore it The Ceph nodes’ default route ends up pointing down the 1GbE management NIC not the 2x100GbE bond ...oh.
  • 41.
    Problems with rabbitmq rabbitmqpartitions are really painful We sometimes end up rebooting all the controllers - there must be a better way Fortunately running instances aren’t affected
  • 42.
    Problems with deployment Runningthe overcloud deployment from the wrong directory is very bad The deployer doesn’t find the file containing the service passwords and proceeds to change them all, which is very tedious to recover from The deployment script really really really needs to have cd ~stack to prevent accidents
  • 43.
    Problems with cinder Whena volume is destroyed, cinder overwrites the volume with zeroes If a user is running a pipeline which creates and destroys many 1TB volumes this produces a lot of I/O Consider setting volume_clear and/or volume_clear_size in cinder.conf
  • 44.
  • 45.
    Prostate cancer analysis Pan-Prostatebuilds on previous Pan-Cancer work Multiple participating institutes using Docker to provide a consistent analysis framework In the past that required admin time to build an isolated network, now OpenStack gives us that for free - and lets the scientists drive it themselves
  • 48.
    wr - WorkflowRunner Reimplementation of Vertebrate Resequencing Group’s pipeline manager in Go Designed to be fast, powerful and easy to use Can manage LSF like existing version, and adds OpenStack https://github.com/VertebrateResequencing/wr
  • 50.
    wr - WorkflowRunner Lessons learned: • “There’s a surprising amount of stuff you have to do to get everything working well” • There are annoying gaps in the Go SDK • Lots of things can go wrong if end users bring up servers, so handle all the details for them
  • 51.
    New Pipeline Group Usings3fs as a shim on top of radosgw S3 speeds development s3fs presents a bucket as a filesystem (but it’s turtles all the way down) In tests, launching up to 240 instances, for read-only access to a few GB of reference sequence data, with caching turned on: up to ~8 might get stuck
  • 52.
    Human Genetics Informatics Workingtowards a production Arvados system Speedbumps around many tools/SDKs assuming real AWS S3, not some S3-alike Sending patches to open-source projects (Packer, Terraform…)
  • 53.
  • 54.
    More Ceph ...because 1PBisn’t enough… This has implications for DC placement (due to cooling requirements) and Ceph CRUSH map (to ensure data replicas are properly separated) Should we split rbd pools from radosgw pools?
  • 55.
    OpenStack version upgrade Wewill probably skip to RHOSP10 (Newton) Need Arista driver integrations for VXLAN for metal-as-a-service We will install a new system alongside the current one and migrate users and then compute nodes
  • 56.
    $THING-as-a-service metal - deployinstance on bare-metal (Ironic) key management (Barbican) to enable encrypted volumes DNS (Designate) shared filesystem (Manila) …though many of these can already be achieved with creative use of images/heat/user-data
  • 57.
    Federation JISC Assent looksinteresting Lots of internal process to work through first Open questions about: • scheduling - pre-emptible instances would help • charging - market-based instance pricing?
  • 58.
    Lustre We have 13PBof Lustre storage Consider exposing some of it to tenants using Lustre routers, NID mapping and sub-mounts
  • 59.
    Little things • exposehypervisor RNG to instances • could make instance key generation go faster • have LogStash report metrics of “log per host” • to spot log volume anomalies • ...
  • 60.
    Thanks My colleagues atSanger - both in Systems and across the institute The OpenStack community Helpful people on mailing lists
  • 61.