Silicon Valley Code Camp 2019 - Data center automation using pipeline as code
Have you ever thought of automating end to end workflow for setting up a new data center by single click? Have you ever thought of implementing automations for infrastructure setups which generally takes months of effort to hours with single click? Some of the examples of such setups are:
1. Setting up DC network in reliable and reproducible manner.
2. Automatic OS provisioning on blade servers.
3. Configuring the DC components using idempotent automation workflows.
4. Setting up highly available internal private cloud / container orchestration platform like Kubernetes on auto provisioned infra.
5. A very complex Inventory state life management workflow.
To accomplish reliable, reproducible and idempotent automation for infrastructure setup, NVIDIA DevOps team has been working on implementing *DC Automation Manager*, a framework developed using CICD tools ecosystem.
In this presentation we will talk about design and automation used at NVIDIA GPU Cloud to setup new DC of 1000s of GPU and CPU blade servers from scratch using Jenkins and GitOps for,
1. Streamlining inventory life cycle
2. L2/L3 network setups
3. Node provisioning and OS configuration with dynamic inventory capabilities
4. Setting up container orchestration platforms on BM/Cloud
5. Bridging the gap between application engineering and operation engineering.
Why DevOps into Data Center Setup ?
As in other production units, like in a
Dunkin’ Donut’s bakery
On-Prem vs Cloud
An analogy to differentiate on-prem Vs cloud setup is,
baking a cake on your own Vs buying it from a bakery.
FACTORS Cloud ERP ON-PREMISE ERP
Cost Predictable costs, cheaper investment and no
additional hardware for short term.
Initial CAPEX cost is high to setup the BM with risks involved. Capacity utilization is
Security Data security on the vendor side with
associated risks. Matured security
infrastructure through IAM (Identity Access
Security is self driven with more control.
But challenge in integrating access with IAMs/AD/SSO, key management, network
segmentation and software/system life cycle management functions.
Customization Less customizations, offering greater stability
but no control over underlying HW.
Ability to customize and control underlying HW structure, performance to containers,
kernel settings directly on BM, but associated with challenges in managing them
(network, nodes, boot process, virtual infrastructure)
Implementation Less time to implement a complete workflow. More control on implementation process, but takes longer time to implement workflow
Resources Provisioning resources for compute, storage
and networking taken care.
Provisioning resources is challenging and need better understanding.
Use case Small and midsize businesses seeking lower
upfront costs, system stability and ease of
Larger enterprise businesses with higher budgets, a desire to customize system
operations and the existing infrastructure for BW heavy and performance (AL/ML
centric) along with security control.
Automation of DC setup workflow helps...
● Set up DC network in a reliable and reproducible manner.
● Manage Inventory life-cycle through an automated flow.
● Automate OS provisioning on bare metal.
● Configure the DC components.
● Set up highly available private cloud and container orchestration platform.
● Effective communication across cross-functional teams.
E2E Automation is the key to better manage spending, security, resources,
customizations and implementation in a stable, reliable, reproducible and
idempotent manner for setting up DC infrastructure.
Microservices Inspired Automation
● Autonomous scripts, meant for specific configuration tasks.
● Independent automations communicate with each other using job
● One centralized orchestration job for E2E setup.
● Each automation has a separate codebase and can be managed by its
Container Orchestration Platform Objectives
● Setup a container orchestration platform like Kubernetes.
● Persist logs of Cluster Management.
● Version control changes to configuration of the cluster.
● Execute the cluster management activities in Jenkins or any CICD tool.
● Enable CI for Container Orchestration.
Container Orchestration Platform
Orchestration Platform: Kubernetes
Management tasks for the following activities can be scripted:
1. k8s cluster creation
2. k8s cluster reset
3. k8s cluster service upgrade
4. k8s cluster scaling up & down
5. K8s cluster validation
K8s Cluster Setup validation
Automate k8s setup & cluster-validation for:
● identifying the setup issues much earlier in cycle
● setting the platform for reliable application deployment.
Example test cases:
● All k8s masters have "Ready" status.
● All k8s nodes have "Ready" status.
● Component status returns healthy for all components.
● All pods in the kube-system namespace are running and healthy.
● Deploy Containerized Applications in Kubernetes
● Track every deployment
● Group logically related kubernetes resources.
● Environment Specific Configuration
○ Allow you to install and upgrade k8s resources.
○ Can be versioned.
○ Can establish dependencies to other charts.
● Tiller - Server side component of helm.
○ It is deployed in kubernetes as a pod
○ It keeps track of revisions of a release deployed.
○ Allows rollback based on revision number.
Challenges and Approaches
● Open source tools may not support every use-case. Be ready to extend it.
● While setting up Kubernetes at scale, we may encounter failures. We had to
tune the configurations to suit our requirement and fix a few bugs too.
● Kubernetes, Helm, Security, Network switch and router configuration has
● Ansible is stateless. You need additional dashboard to store the cluster state.
● Versioning infra, unlike application is very complicated process.
● Not all systems support webhooks. You may have to use polling to fetch status
in a few integrations.
● Automation can be used to create idempotent, reproducible
infrastructure. Automate now, don’t delay.
● Version the infrastructure code. Follow pull-request & review for each
● Maintain inventory life cycle in a well defined state workflow.
● Use stateful CM which can record the state of your infra, if possible.
● Treat security as primary customer.
Intelligent DevOps systems should be capable of providing:
○ Auto healing
○ Auto error detection
○ Auto rollback
○ Unsupervised ML that automatically tests and verifies deployments.
○ Analyze data from various tools.
○ Uses automated testing tools to look for anomalies and failures.