Workday’s Next Generation Private Cloud
The Fifth Generation of an OpenStack Platform inside Workday
The Leading Enterprise
Cloud for Finance and HR
Customer Satisfaction:
Workers Supported
97%
60M+
Fortune 500 Companies:
50%+
Silvano Buback
Principal, Software Development Engineer
Jan Gutter
Senior Software Development Engineer
Workday
Private
Cloud
OpenStack at Workday
8 SREs
9 Developers
SLO: 99% API Call
success
87 Clusters
2 Million Cores
12.5 PB RAM
60k concurrent VMs
241k VMs recreated weekly
Simple set of OpenStack components to deliver a resilient platform.
● Single client (PaaS)
● ~300 compute nodes per cluster
● Workday service weekend maintenance “The Patch”
● OpenStack projects are used to denote Workday services
● Unique tooling for batch scheduling and capacity planning
Workday’s Use Case
Regular maintenance window every weekend where where service
VMs are recreated and the Workday application gets upgraded
● “The Power of One” is an important mission for us
● Largest impact to control and data plane during this time
● SLO target is 99% success for all API calls over the week
● 60% of instances deleted/created during “The Patch”
● Remaining 40% are recreated throughout the week
“The Patch”
Development Environment
We Treat
Everything
as a High
Security
Environment
Weekly
Builds in
Dev
Clusters
Dev Clusters
Run Internal
Services
Dev and
Production
Run Very
Different
Workloads
Workday
Private
Cloud
5
Fourth Generation
Private Cloud Evolution
• OpenStack Victoria
• CentOS Stream 8
• Kolla-Ansible + plain Ansible
• Kolla Containers (built from source)
• Calico
• L3 only BGP Fabric
• Zuul CI
• Internal solution for CD
• Branch for each stable series
Fifth Generation
• OpenStack Mitaka
• CentOS Linux 7
• Chef
• RPM
• Contrail
• Overlay Networks
• Jenkins for CI
• Jenkins + Internal solution for CD
• Single branch, releases are snapshots
First use of gated development!
CI Tooling
Target multiple scenarios:
• CLI
• Zuul
• Custom Ansible Orchestration Service
Three types of clusters:
• Overcloud - a cluster built from instances in a single tenant
• Zuul - a cluster built from a nodeset
• Baremetal
Pain Point - Multiple Deployment Scenarios
Zuul: Expectations vs Reality
Successfully
keeps a lot
of core code
stable
Naively
expected to
reuse
community
pipeline
Evolved pipeline
multiple times
with no
interruptions
Community
pipelines tied
to community
infrastructure
● Use branches for stable releases
● Nothing new about this: OpenStack community also uses this
● “branch for stable release” model was a new concept for us
● We forked https://opendev.org/openstack/releases to handle this
Zuul Pipeline Design
For every tool / service, there’s a Workday
name!
Home Grown
Tools
List of Home Grown Tools
DNS
Infrastructure IP Address
Management
Certificate
Authority
Ansible
Orchestration
Multi Cluster
Cloud
Overview
Compute
Node
Health
Check
List of (more) Home Grown Tools
Capacity
Management
Chef
Implementation
Batch
Scheduling
PaaS
(Image Build
Service, Instance
lifecycle
management) BM Lifecycle
Tracking
Bare Metal
Provisioning
Service
Differences with community version
Downstream
Changes
Downstream Changes
TLS everywhere
Compute nodes use
Prometheus/OpenStack
integration
Prometheus upgraded to newer version
Custom tags based on
Kolla-Ansible inventory
Wavefront integration
while we transition to
Cortex
● New Prometheus Exporters (some are upgrades)
○ libvirt exporter
○ OpenStack exporter upgrade
○ BIRD exporter (BGP router)
● Fluentd parses HAproxy/Apache logs to provide API request metrics
● “Singleton” containers
○ One running container per cluster
○ Using Keepalived for HA
○ Examples: Prometheus, DB Backup, openstack-exporter
● Timeouts/Retry/Performance improvements on K-A deployment
(more) Downstream Changes
● Kolla containers for Calico
● Enabled etcdv3 in Kolla-Ansible
● Building C8 binaries
● Using a local fork of the Neutron plugin
● Wrote our own metadata proxy (TLS support)
● Numerous small changes
○ MTU
○ Newer version of OpenStack
○ DHCP service monitoring
● Most of the changes were in the Neutron plugin, Felix code is
essentially unchanged
Calico Fork
Q & A
Random notes about our environment
Other
Interesting Bits
● Every instance gets an internally routable IPv4 address. 🤯
● Multiple layers of network security
● Previously: Contrail with virtual overlay networks
● Now: Calico with routing fabric
Requirements for Networking
● In preparation for OpenStack Victoria, we reduced the use of file
injection in our PaaS system significantly
● We were fortunate because we could move service accounts from
one cluster to another
● To reduce transition time, we allocate overlapping ranges
● During The Patch, instances running on the previous generation
are removed
Forklift
Thank You

Workday's Next Generation Private Cloud

  • 1.
    Workday’s Next GenerationPrivate Cloud The Fifth Generation of an OpenStack Platform inside Workday
  • 2.
    The Leading Enterprise Cloudfor Finance and HR Customer Satisfaction: Workers Supported 97% 60M+ Fortune 500 Companies: 50%+
  • 3.
    Silvano Buback Principal, SoftwareDevelopment Engineer Jan Gutter Senior Software Development Engineer
  • 4.
  • 5.
    OpenStack at Workday 8SREs 9 Developers SLO: 99% API Call success 87 Clusters 2 Million Cores 12.5 PB RAM 60k concurrent VMs 241k VMs recreated weekly
  • 6.
    Simple set ofOpenStack components to deliver a resilient platform. ● Single client (PaaS) ● ~300 compute nodes per cluster ● Workday service weekend maintenance “The Patch” ● OpenStack projects are used to denote Workday services ● Unique tooling for batch scheduling and capacity planning Workday’s Use Case
  • 7.
    Regular maintenance windowevery weekend where where service VMs are recreated and the Workday application gets upgraded ● “The Power of One” is an important mission for us ● Largest impact to control and data plane during this time ● SLO target is 99% success for all API calls over the week ● 60% of instances deleted/created during “The Patch” ● Remaining 40% are recreated throughout the week “The Patch”
  • 8.
    Development Environment We Treat Everything asa High Security Environment Weekly Builds in Dev Clusters Dev Clusters Run Internal Services Dev and Production Run Very Different Workloads
  • 9.
  • 10.
    Fourth Generation Private CloudEvolution • OpenStack Victoria • CentOS Stream 8 • Kolla-Ansible + plain Ansible • Kolla Containers (built from source) • Calico • L3 only BGP Fabric • Zuul CI • Internal solution for CD • Branch for each stable series Fifth Generation • OpenStack Mitaka • CentOS Linux 7 • Chef • RPM • Contrail • Overlay Networks • Jenkins for CI • Jenkins + Internal solution for CD • Single branch, releases are snapshots
  • 11.
    First use ofgated development! CI Tooling
  • 12.
    Target multiple scenarios: •CLI • Zuul • Custom Ansible Orchestration Service Three types of clusters: • Overcloud - a cluster built from instances in a single tenant • Zuul - a cluster built from a nodeset • Baremetal Pain Point - Multiple Deployment Scenarios
  • 13.
    Zuul: Expectations vsReality Successfully keeps a lot of core code stable Naively expected to reuse community pipeline Evolved pipeline multiple times with no interruptions Community pipelines tied to community infrastructure
  • 14.
    ● Use branchesfor stable releases ● Nothing new about this: OpenStack community also uses this ● “branch for stable release” model was a new concept for us ● We forked https://opendev.org/openstack/releases to handle this Zuul Pipeline Design
  • 15.
    For every tool/ service, there’s a Workday name! Home Grown Tools
  • 16.
    List of HomeGrown Tools DNS Infrastructure IP Address Management Certificate Authority Ansible Orchestration Multi Cluster Cloud Overview Compute Node Health Check
  • 17.
    List of (more)Home Grown Tools Capacity Management Chef Implementation Batch Scheduling PaaS (Image Build Service, Instance lifecycle management) BM Lifecycle Tracking Bare Metal Provisioning Service
  • 18.
    Differences with communityversion Downstream Changes
  • 19.
    Downstream Changes TLS everywhere Computenodes use Prometheus/OpenStack integration Prometheus upgraded to newer version Custom tags based on Kolla-Ansible inventory Wavefront integration while we transition to Cortex
  • 20.
    ● New PrometheusExporters (some are upgrades) ○ libvirt exporter ○ OpenStack exporter upgrade ○ BIRD exporter (BGP router) ● Fluentd parses HAproxy/Apache logs to provide API request metrics ● “Singleton” containers ○ One running container per cluster ○ Using Keepalived for HA ○ Examples: Prometheus, DB Backup, openstack-exporter ● Timeouts/Retry/Performance improvements on K-A deployment (more) Downstream Changes
  • 21.
    ● Kolla containersfor Calico ● Enabled etcdv3 in Kolla-Ansible ● Building C8 binaries ● Using a local fork of the Neutron plugin ● Wrote our own metadata proxy (TLS support) ● Numerous small changes ○ MTU ○ Newer version of OpenStack ○ DHCP service monitoring ● Most of the changes were in the Neutron plugin, Felix code is essentially unchanged Calico Fork
  • 22.
  • 23.
    Random notes aboutour environment Other Interesting Bits
  • 24.
    ● Every instancegets an internally routable IPv4 address. 🤯 ● Multiple layers of network security ● Previously: Contrail with virtual overlay networks ● Now: Calico with routing fabric Requirements for Networking
  • 25.
    ● In preparationfor OpenStack Victoria, we reduced the use of file injection in our PaaS system significantly ● We were fortunate because we could move service accounts from one cluster to another ● To reduce transition time, we allocate overlapping ranges ● During The Patch, instances running on the previous generation are removed Forklift
  • 26.