Kubernetes - 7 lessons learned from 7 data centers in 7 months

© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Kubernetes - 7 lessons learned from 7 data centers in 7 months
Open Infrastructure Summit, Denver 2019
Tony Gosselin –
Site Reliability Engineer @ Adobe
Mike Tougeron –
Senior Site Reliability Engineer @
Adobe

$ whoami && id | grep Adobe
 Mike Tougeron
 Senior Site Reliability Engineer @ Adobe
 Twitter: @mtougeron
 Started using Kubernetes in 2015
 Tony Gosselin
 Site Reliability Engineer @ Adobe
 Twitter: @sfotony
 Started using Kubernetes in 2018

Agenda
 Quick Introduction to Adobe Advertising Cloud’s Kubernetes Infrastructure
 Lesson 1: Communication, Teamwork & Training
 Lesson 2: Code to production pipelines
 Lesson 3: The ABCs of Production apps
 Lesson 4: Multi-cloud challenges
 Lesson 5: Knowing your application
 Lesson 6: Metrics based monitoring
 Lesson 7: Autoscaling benefits & challenges

High Traffic
350 billion requests
a day
Latency
<50ms @ 95th
percentile
Huge Datasets
Billions of objects to
store

Adobe Advertising Cloud’s Kubernetes Overview
 ~225 worker nodes; growing to ~300
in May/June
 6 OpenStack data centers across 4
regions
 Running on VMs
 No persistent storage
 No autoscaling; “fixed” footprint
 Smaller but growing
 3 AWS clusters in us-east-1
 Running on m5d.12xlarge ec2 instances
 EBS volumes for persistent storage
 Uses cluster-autoscaler
 Autoscaling events many times per hour
 Prometheus for monitoring
 Dozens of Machine Learning
workloads in AWS
 Reason for frequent autoscaling events
 Cluster updates done via new Image
and rolling update of existing nodes
 Updates are deployed approx every
4-6 weeks

Lesson 1: Communication, Teamwork & Training

Communication: Reaching large, distributed teams

Teamwork: Who’s responsible for what?

Abstraction vs Experts
 Need understanding of core
resources but also need easy
onboarding
 Pair programming training sessions
 Remove need for boiler plate
 Don’t duplicate efforts by avoiding
abstraction
 Don’t abstract to the point where
you’re not using Kubernetes
 kubectl should *not* be your
entrypoint

Lesson 2: Code to production pipelines
De
v
Pull
Request
maste
r
Unit
testin
g
merge
Deplo
y bot
Production
Integration
testing
Insert your steps here!

Tools to help build application resources
 Helm (templating and/or tiller)
 Kustomize
 Kapitan
 and more…
 We use a combination of Helm
templating for infrastructure/3rd-party
and Kustomize for application teams

"GitOps"
 Overloaded term that means different things to
different people
 Builds happen at a different stage than people
were used to
 Poly-repo => mono-repo migration – everyone
has opinions
 "But how do I control when my app is released?"

Lesson 3: The ABCs of Production
 HorizontalPodAutoscaler
 PodDisruptionBudget
 "DevOps"
 Cluster Upgrades

HorizontalPodAutoscaler
 Easily scale on CPU or Memory usage
 Also able to scale on custom metrics like
http_requests from Ingress resources
 Don’t set replicas in your Deployment

PodDisruptionBudget
 Not the same thing as a Deployment
strategy
 Helps prevent taking down so many Pods
that the application is overwhelmed
 Can set by minAvailable or
maxUnavailable by number or
percentage
 Good for helping keep quorum
 Doesn’t apply to manual deletions

DevOps
 Expertise/specialists
 But empowerment & speed
 Things get lost in shuffle
 Everyone can do everything; aka don’t forget your guardrails

Cluster Upgrades - Blue/Green or Canary?
 Who really has the hardware to run a 2nd full
Kubernetes cluster in their datacenter?
 Public cloud is easier, but you still have cost
considerations
 Are the application team(s) able to handle
deploying to a 2nd mirrored cluster?
 Does it make more sense to run N workers of a
different version/config for a period of time?
 Do you have the visibility into the cluster to know
how one performs vs the other?

Lesson 4: Multi-Cloud Challenges
 Advertising Cloud lives in two (cloud) worlds
 AWS runs the majority of our persistent storage workloads
 ML
 User data ingress
 “Master” clusters – kafka, aerospike, internal applications
 Only in one region
 Openstack runs our bidding and ad server infrastructure
 Multiple regions worldwide
 Regions persist long-term data to AWS

Our Purpose-Built OpenStack Regions
 We originally built for specific needs/design w/just Nova & Neutron
 Deploy full rack at a time (no piece-meal expansion)
 Mantra: Build What You Need
 ”Suddenly”, Kubernetes!
 What to do?
 Square AWS peg, round Openstack hole
 No persistent storage
 Compute and rack anti-affinity now matter
 No autoscaling
 Wrapping heads around the custom/specific design can be hard

Multiple code-bases but consistent infrastructure
 Packer – Shared modular code base, different builders
 Terraform – Separate but closely aligned code bases
 Puppet – Same code base
 Helm – Same modular code base
 Leverage templating to build the same deployments for
different (and future) clouds
 Re-use, re-use, re-use!
 Lab environments in all clouds
 OSSIA for HV/rack metadata for region/zone

Lesson 5: Knowing your applications
 Seems like an obvious statement but it’s easy to forget to
think about
 Kubernetes brings advantages, but not all the ones that
bare metal and virtual machines bring out of the box
 Think about how your app actually functions
 Service Discovery
 Persistent Storage
 Shared Storage (e.g. replication, sharding, etc)
 Scheduling / Restarting
 Networking Ingress / Egress
 Think about how your app is going to handle the way
Kubernetes does things
https://imgur.com/gallery/B4D7Lf1

Elasticsearch as Deployment (What We Did)
https://www.slideshare.net/JoergHenning/elasticsearch-on-kubernetes (slightly
modified)

Oops…yeah Touge, I think something is wrong…

Elasticsearch as StatefulSet (What We Should Have Done)
https://www.slideshare.net/JoergHenning/elasticsearch-on-kubernetes (slightly
modified)

You don’t have to know everything

Lesson 6: Metrics-Based Monitoring

Monitoring All The Things
 In k8s, everything is transient
 We care more about deltas and
patterns than about events
• Resource requests versus actual usage
• Disk performance for machine learning
• Predicting volume utilization
 (We still care about events, too)
 This lesson is applicable beyond just
kubernetes
 Every new component needs
monitoring

Lesson 7: Autoscaling benefits & challenges
 Cost savings by running only what
you need
 Easer auto-remediation
 Frequent autoscaling means frequent
rescheduling
 Can be slow to re-attach EBS
volumes / can't re-attach across AZs

Immutable images vs config management
 We use immutable images for core Kubernetes changes
 e.g., apiserver, kubelet, docker
 Apps like kube-proxy, calico are deployed via GitOps
 Works great in AWS where we have persistent storage; not
so much where we don't
 Exploring “light” config management to reduce the need to
change images

Bonus Lesson (aka off by 1 error): Take a deep breath
 Same team so we all learn & fix together
 Experience has been enlightening &
engineers have had fun
 Teams already onboarded are moving
faster than before
 Dev cycle to production is faster as we
integrate more automated testing

Thanks!
SLIDE DECK URL GOES HERE
Mike Tougeron
Email: tougeron@adobe.com
Twitter: @mtougeron
Tony Gosselin
Email: gosselin@adobe.com
Twitter: @sfotony
Images from https://stock.adobe.com

Kubernetes - 7 lessons learned from 7 data centers in 7 months

More Related Content

What's hot

Similar to Kubernetes - 7 lessons learned from 7 data centers in 7 months

Recently uploaded

Kubernetes - 7 lessons learned from 7 data centers in 7 months

Editor's Notes