SlideShare a Scribd company logo
Workday’s Next Generation Private Cloud
The Fifth Generation of an OpenStack Platform inside Workday
The Leading Enterprise
Cloud for Finance and HR
Customer Satisfaction:
Workers Supported
97%
60M+
Fortune 500 Companies:
50%+
Silvano Buback
Principal, Software Development Engineer
Jan Gutter
Senior Software Development Engineer
Workday
Private
Cloud
OpenStack at Workday
8 SREs
9 Developers
SLO: 99% API Call
success
87 Clusters
2 Million Cores
12.5 PB RAM
60k concurrent VMs
241k VMs recreated weekly
Simple set of OpenStack components to deliver a resilient platform.
● Single client (PaaS)
● ~300 compute nodes per cluster
● Workday service weekend maintenance “The Patch”
● OpenStack projects are used to denote Workday services
● Unique tooling for batch scheduling and capacity planning
Workday’s Use Case
Regular maintenance window every weekend where where service
VMs are recreated and the Workday application gets upgraded
● “The Power of One” is an important mission for us
● Largest impact to control and data plane during this time
● SLO target is 99% success for all API calls over the week
● 60% of instances deleted/created during “The Patch”
● Remaining 40% are recreated throughout the week
“The Patch”
Development Environment
We Treat
Everything
as a High
Security
Environment
Weekly
Builds in
Dev
Clusters
Dev Clusters
Run Internal
Services
Dev and
Production
Run Very
Different
Workloads
Workday
Private
Cloud
5
Fourth Generation
Private Cloud Evolution
• OpenStack Victoria
• CentOS Stream 8
• Kolla-Ansible + plain Ansible
• Kolla Containers (built from source)
• Calico
• L3 only BGP Fabric
• Zuul CI
• Internal solution for CD
• Branch for each stable series
Fifth Generation
• OpenStack Mitaka
• CentOS Linux 7
• Chef
• RPM
• Contrail
• Overlay Networks
• Jenkins for CI
• Jenkins + Internal solution for CD
• Single branch, releases are snapshots
First use of gated development!
CI Tooling
Target multiple scenarios:
• CLI
• Zuul
• Custom Ansible Orchestration Service
Three types of clusters:
• Overcloud - a cluster built from instances in a single tenant
• Zuul - a cluster built from a nodeset
• Baremetal
Pain Point - Multiple Deployment Scenarios
Zuul: Expectations vs Reality
Successfully
keeps a lot
of core code
stable
Naively
expected to
reuse
community
pipeline
Evolved pipeline
multiple times
with no
interruptions
Community
pipelines tied
to community
infrastructure
● Use branches for stable releases
● Nothing new about this: OpenStack community also uses this
● “branch for stable release” model was a new concept for us
● We forked https://opendev.org/openstack/releases to handle this
Zuul Pipeline Design
For every tool / service, there’s a Workday
name!
Home Grown
Tools
List of Home Grown Tools
DNS
Infrastructure IP Address
Management
Certificate
Authority
Ansible
Orchestration
Multi Cluster
Cloud
Overview
Compute
Node
Health
Check
List of (more) Home Grown Tools
Capacity
Management
Chef
Implementation
Batch
Scheduling
PaaS
(Image Build
Service, Instance
lifecycle
management) BM Lifecycle
Tracking
Bare Metal
Provisioning
Service
Differences with community version
Downstream
Changes
Downstream Changes
TLS everywhere
Compute nodes use
Prometheus/OpenStack
integration
Prometheus upgraded to newer version
Custom tags based on
Kolla-Ansible inventory
Wavefront integration
while we transition to
Cortex
● New Prometheus Exporters (some are upgrades)
○ libvirt exporter
○ OpenStack exporter upgrade
○ BIRD exporter (BGP router)
● Fluentd parses HAproxy/Apache logs to provide API request metrics
● “Singleton” containers
○ One running container per cluster
○ Using Keepalived for HA
○ Examples: Prometheus, DB Backup, openstack-exporter
● Timeouts/Retry/Performance improvements on K-A deployment
(more) Downstream Changes
● Kolla containers for Calico
● Enabled etcdv3 in Kolla-Ansible
● Building C8 binaries
● Using a local fork of the Neutron plugin
● Wrote our own metadata proxy (TLS support)
● Numerous small changes
○ MTU
○ Newer version of OpenStack
○ DHCP service monitoring
● Most of the changes were in the Neutron plugin, Felix code is
essentially unchanged
Calico Fork
Q & A
Random notes about our environment
Other
Interesting Bits
● Every instance gets an internally routable IPv4 address. 🤯
● Multiple layers of network security
● Previously: Contrail with virtual overlay networks
● Now: Calico with routing fabric
Requirements for Networking
● In preparation for OpenStack Victoria, we reduced the use of file
injection in our PaaS system significantly
● We were fortunate because we could move service accounts from
one cluster to another
● To reduce transition time, we allocate overlapping ranges
● During The Patch, instances running on the previous generation
are removed
Forklift
Thank You

More Related Content

What's hot

Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
CI/CD on Google Cloud Platform
CI/CD on Google Cloud PlatformCI/CD on Google Cloud Platform
CI/CD on Google Cloud Platform
DevOps Indonesia
 
Securing Kafka
Securing Kafka Securing Kafka
Securing Kafka
confluent
 
Migrating To GitHub
Migrating To GitHub  Migrating To GitHub
Migrating To GitHub
Sridhar Peddinti
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
confluent
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOps
Weaveworks
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
CI:CD in Lightspeed with kubernetes and argo cd
CI:CD in Lightspeed with kubernetes and argo cdCI:CD in Lightspeed with kubernetes and argo cd
CI:CD in Lightspeed with kubernetes and argo cd
Billy Yuen
 
GitOps 101 Presentation.pdf
GitOps 101 Presentation.pdfGitOps 101 Presentation.pdf
GitOps 101 Presentation.pdf
ssuser31375f
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
Flink Forward
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
C4Media
 
Difference between Github vs Gitlab vs Bitbucket
Difference between Github vs Gitlab vs BitbucketDifference between Github vs Gitlab vs Bitbucket
Difference between Github vs Gitlab vs Bitbucket
jeetendra mandal
 
Modern CI/CD Pipeline Using Azure DevOps
Modern CI/CD Pipeline Using Azure DevOpsModern CI/CD Pipeline Using Azure DevOps
Modern CI/CD Pipeline Using Azure DevOps
GlobalLogic Ukraine
 
Containers Docker Kind Kubernetes Istio
Containers Docker Kind Kubernetes IstioContainers Docker Kind Kubernetes Istio
Containers Docker Kind Kubernetes Istio
Araf Karsh Hamid
 
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
HostedbyConfluent
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introduction
Rico Chen
 
The Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps ToolkitThe Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps Toolkit
Weaveworks
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
Guido Schmutz
 
Running Kafka as a Native Binary Using GraalVM with Ozan Günalp
Running Kafka as a Native Binary Using GraalVM with Ozan GünalpRunning Kafka as a Native Binary Using GraalVM with Ozan Günalp
Running Kafka as a Native Binary Using GraalVM with Ozan Günalp
HostedbyConfluent
 
Datadog- Monitoring In Motion
Datadog- Monitoring In Motion Datadog- Monitoring In Motion
Datadog- Monitoring In Motion
Cloud Native Apps SF
 

What's hot (20)

Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
CI/CD on Google Cloud Platform
CI/CD on Google Cloud PlatformCI/CD on Google Cloud Platform
CI/CD on Google Cloud Platform
 
Securing Kafka
Securing Kafka Securing Kafka
Securing Kafka
 
Migrating To GitHub
Migrating To GitHub  Migrating To GitHub
Migrating To GitHub
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOps
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
CI:CD in Lightspeed with kubernetes and argo cd
CI:CD in Lightspeed with kubernetes and argo cdCI:CD in Lightspeed with kubernetes and argo cd
CI:CD in Lightspeed with kubernetes and argo cd
 
GitOps 101 Presentation.pdf
GitOps 101 Presentation.pdfGitOps 101 Presentation.pdf
GitOps 101 Presentation.pdf
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Difference between Github vs Gitlab vs Bitbucket
Difference between Github vs Gitlab vs BitbucketDifference between Github vs Gitlab vs Bitbucket
Difference between Github vs Gitlab vs Bitbucket
 
Modern CI/CD Pipeline Using Azure DevOps
Modern CI/CD Pipeline Using Azure DevOpsModern CI/CD Pipeline Using Azure DevOps
Modern CI/CD Pipeline Using Azure DevOps
 
Containers Docker Kind Kubernetes Istio
Containers Docker Kind Kubernetes IstioContainers Docker Kind Kubernetes Istio
Containers Docker Kind Kubernetes Istio
 
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introduction
 
The Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps ToolkitThe Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps Toolkit
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Running Kafka as a Native Binary Using GraalVM with Ozan Günalp
Running Kafka as a Native Binary Using GraalVM with Ozan GünalpRunning Kafka as a Native Binary Using GraalVM with Ozan Günalp
Running Kafka as a Native Binary Using GraalVM with Ozan Günalp
 
Datadog- Monitoring In Motion
Datadog- Monitoring In Motion Datadog- Monitoring In Motion
Datadog- Monitoring In Motion
 

Similar to Workday's Next Generation Private Cloud

Running Production-Grade Kubernetes on AWS
Running Production-Grade Kubernetes on AWSRunning Production-Grade Kubernetes on AWS
Running Production-Grade Kubernetes on AWS
DoiT International
 
Pivotal Container Service Overview
Pivotal Container Service Overview Pivotal Container Service Overview
Pivotal Container Service Overview
VMware Tanzu
 
Intel open stack-summit-session-nov13-final
Intel open stack-summit-session-nov13-finalIntel open stack-summit-session-nov13-final
Intel open stack-summit-session-nov13-final
Deepak Mane
 
Free GitOps Workshop
Free GitOps WorkshopFree GitOps Workshop
Free GitOps Workshop
Weaveworks
 
Red Hat presentatie: Open stack Latest Pure Tech
Red Hat presentatie: Open stack Latest Pure TechRed Hat presentatie: Open stack Latest Pure Tech
Red Hat presentatie: Open stack Latest Pure Tech
ProxyServices
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
Datadog
 
Kubernetes for Beginners
Kubernetes for BeginnersKubernetes for Beginners
Kubernetes for Beginners
DigitalOcean
 
20141111_SOS3_Gallo
20141111_SOS3_Gallo20141111_SOS3_Gallo
20141111_SOS3_GalloAndrea Gallo
 
Sven Vogel: Running CloudStack and OpenShift with NetApp on KVM
Sven Vogel: Running CloudStack and OpenShift with NetApp on KVMSven Vogel: Running CloudStack and OpenShift with NetApp on KVM
Sven Vogel: Running CloudStack and OpenShift with NetApp on KVM
ShapeBlue
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
javier ramirez
 
Kubernetes 1.12 Update and Container Security with Liz Rice
Kubernetes 1.12 Update and Container Security with Liz RiceKubernetes 1.12 Update and Container Security with Liz Rice
Kubernetes 1.12 Update and Container Security with Liz Rice
CloudOps2005
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
Nicolas Brousse
 
OpenEBS hangout #4
OpenEBS hangout #4OpenEBS hangout #4
OpenEBS hangout #4
OpenEBS
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
aspyker
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
Peter Clapham
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
Peter Clapham
 
OpenNebulaConf 2016 - OpenNebula 5.0 Highlights and Beyond by Ruben S. Monter...
OpenNebulaConf 2016 - OpenNebula 5.0 Highlights and Beyond by Ruben S. Monter...OpenNebulaConf 2016 - OpenNebula 5.0 Highlights and Beyond by Ruben S. Monter...
OpenNebulaConf 2016 - OpenNebula 5.0 Highlights and Beyond by Ruben S. Monter...
OpenNebula Project
 
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
Oleg Shalygin
 
Oracle week Israel - OpenStack Platform - 2013
Oracle week Israel - OpenStack Platform - 2013Oracle week Israel - OpenStack Platform - 2013
Oracle week Israel - OpenStack Platform - 2013
Arthur Berezin
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices world
Karol Chrapek
 

Similar to Workday's Next Generation Private Cloud (20)

Running Production-Grade Kubernetes on AWS
Running Production-Grade Kubernetes on AWSRunning Production-Grade Kubernetes on AWS
Running Production-Grade Kubernetes on AWS
 
Pivotal Container Service Overview
Pivotal Container Service Overview Pivotal Container Service Overview
Pivotal Container Service Overview
 
Intel open stack-summit-session-nov13-final
Intel open stack-summit-session-nov13-finalIntel open stack-summit-session-nov13-final
Intel open stack-summit-session-nov13-final
 
Free GitOps Workshop
Free GitOps WorkshopFree GitOps Workshop
Free GitOps Workshop
 
Red Hat presentatie: Open stack Latest Pure Tech
Red Hat presentatie: Open stack Latest Pure TechRed Hat presentatie: Open stack Latest Pure Tech
Red Hat presentatie: Open stack Latest Pure Tech
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
 
Kubernetes for Beginners
Kubernetes for BeginnersKubernetes for Beginners
Kubernetes for Beginners
 
20141111_SOS3_Gallo
20141111_SOS3_Gallo20141111_SOS3_Gallo
20141111_SOS3_Gallo
 
Sven Vogel: Running CloudStack and OpenShift with NetApp on KVM
Sven Vogel: Running CloudStack and OpenShift with NetApp on KVMSven Vogel: Running CloudStack and OpenShift with NetApp on KVM
Sven Vogel: Running CloudStack and OpenShift with NetApp on KVM
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
Kubernetes 1.12 Update and Container Security with Liz Rice
Kubernetes 1.12 Update and Container Security with Liz RiceKubernetes 1.12 Update and Container Security with Liz Rice
Kubernetes 1.12 Update and Container Security with Liz Rice
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
OpenEBS hangout #4
OpenEBS hangout #4OpenEBS hangout #4
OpenEBS hangout #4
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
OpenNebulaConf 2016 - OpenNebula 5.0 Highlights and Beyond by Ruben S. Monter...
OpenNebulaConf 2016 - OpenNebula 5.0 Highlights and Beyond by Ruben S. Monter...OpenNebulaConf 2016 - OpenNebula 5.0 Highlights and Beyond by Ruben S. Monter...
OpenNebulaConf 2016 - OpenNebula 5.0 Highlights and Beyond by Ruben S. Monter...
 
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
 
Oracle week Israel - OpenStack Platform - 2013
Oracle week Israel - OpenStack Platform - 2013Oracle week Israel - OpenStack Platform - 2013
Oracle week Israel - OpenStack Platform - 2013
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices world
 

Recently uploaded

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 

Recently uploaded (20)

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 

Workday's Next Generation Private Cloud

  • 1. Workday’s Next Generation Private Cloud The Fifth Generation of an OpenStack Platform inside Workday
  • 2. The Leading Enterprise Cloud for Finance and HR Customer Satisfaction: Workers Supported 97% 60M+ Fortune 500 Companies: 50%+
  • 3. Silvano Buback Principal, Software Development Engineer Jan Gutter Senior Software Development Engineer
  • 5. OpenStack at Workday 8 SREs 9 Developers SLO: 99% API Call success 87 Clusters 2 Million Cores 12.5 PB RAM 60k concurrent VMs 241k VMs recreated weekly
  • 6. Simple set of OpenStack components to deliver a resilient platform. ● Single client (PaaS) ● ~300 compute nodes per cluster ● Workday service weekend maintenance “The Patch” ● OpenStack projects are used to denote Workday services ● Unique tooling for batch scheduling and capacity planning Workday’s Use Case
  • 7. Regular maintenance window every weekend where where service VMs are recreated and the Workday application gets upgraded ● “The Power of One” is an important mission for us ● Largest impact to control and data plane during this time ● SLO target is 99% success for all API calls over the week ● 60% of instances deleted/created during “The Patch” ● Remaining 40% are recreated throughout the week “The Patch”
  • 8. Development Environment We Treat Everything as a High Security Environment Weekly Builds in Dev Clusters Dev Clusters Run Internal Services Dev and Production Run Very Different Workloads
  • 10. Fourth Generation Private Cloud Evolution • OpenStack Victoria • CentOS Stream 8 • Kolla-Ansible + plain Ansible • Kolla Containers (built from source) • Calico • L3 only BGP Fabric • Zuul CI • Internal solution for CD • Branch for each stable series Fifth Generation • OpenStack Mitaka • CentOS Linux 7 • Chef • RPM • Contrail • Overlay Networks • Jenkins for CI • Jenkins + Internal solution for CD • Single branch, releases are snapshots
  • 11. First use of gated development! CI Tooling
  • 12. Target multiple scenarios: • CLI • Zuul • Custom Ansible Orchestration Service Three types of clusters: • Overcloud - a cluster built from instances in a single tenant • Zuul - a cluster built from a nodeset • Baremetal Pain Point - Multiple Deployment Scenarios
  • 13. Zuul: Expectations vs Reality Successfully keeps a lot of core code stable Naively expected to reuse community pipeline Evolved pipeline multiple times with no interruptions Community pipelines tied to community infrastructure
  • 14. ● Use branches for stable releases ● Nothing new about this: OpenStack community also uses this ● “branch for stable release” model was a new concept for us ● We forked https://opendev.org/openstack/releases to handle this Zuul Pipeline Design
  • 15. For every tool / service, there’s a Workday name! Home Grown Tools
  • 16. List of Home Grown Tools DNS Infrastructure IP Address Management Certificate Authority Ansible Orchestration Multi Cluster Cloud Overview Compute Node Health Check
  • 17. List of (more) Home Grown Tools Capacity Management Chef Implementation Batch Scheduling PaaS (Image Build Service, Instance lifecycle management) BM Lifecycle Tracking Bare Metal Provisioning Service
  • 18. Differences with community version Downstream Changes
  • 19. Downstream Changes TLS everywhere Compute nodes use Prometheus/OpenStack integration Prometheus upgraded to newer version Custom tags based on Kolla-Ansible inventory Wavefront integration while we transition to Cortex
  • 20. ● New Prometheus Exporters (some are upgrades) ○ libvirt exporter ○ OpenStack exporter upgrade ○ BIRD exporter (BGP router) ● Fluentd parses HAproxy/Apache logs to provide API request metrics ● “Singleton” containers ○ One running container per cluster ○ Using Keepalived for HA ○ Examples: Prometheus, DB Backup, openstack-exporter ● Timeouts/Retry/Performance improvements on K-A deployment (more) Downstream Changes
  • 21. ● Kolla containers for Calico ● Enabled etcdv3 in Kolla-Ansible ● Building C8 binaries ● Using a local fork of the Neutron plugin ● Wrote our own metadata proxy (TLS support) ● Numerous small changes ○ MTU ○ Newer version of OpenStack ○ DHCP service monitoring ● Most of the changes were in the Neutron plugin, Felix code is essentially unchanged Calico Fork
  • 22. Q & A
  • 23. Random notes about our environment Other Interesting Bits
  • 24. ● Every instance gets an internally routable IPv4 address. 🤯 ● Multiple layers of network security ● Previously: Contrail with virtual overlay networks ● Now: Calico with routing fabric Requirements for Networking
  • 25. ● In preparation for OpenStack Victoria, we reduced the use of file injection in our PaaS system significantly ● We were fortunate because we could move service accounts from one cluster to another ● To reduce transition time, we allocate overlapping ranges ● During The Patch, instances running on the previous generation are removed Forklift