© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Kubernetes - 7 lessons learned from 7 data centers in 7 months
Open Infrastructure Summit, Denver 2019
Tony Gosselin –
Site Reliability Engineer @ Adobe
Mike Tougeron –
Senior Site Reliability Engineer @
Adobe
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
$ whoami && id | grep Adobe
 Mike Tougeron
 Senior Site Reliability Engineer @ Adobe
 Twitter: @mtougeron
 Started using Kubernetes in 2015
 Tony Gosselin
 Site Reliability Engineer @ Adobe
 Twitter: @sfotony
 Started using Kubernetes in 2018
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Agenda
 Quick Introduction to Adobe Advertising Cloud’s Kubernetes Infrastructure
 Lesson 1: Communication, Teamwork & Training
 Lesson 2: Code to production pipelines
 Lesson 3: The ABCs of Production apps
 Lesson 4: Multi-cloud challenges
 Lesson 5: Knowing your application
 Lesson 6: Metrics based monitoring
 Lesson 7: Autoscaling benefits & challenges
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
High Traffic
350 billion requests
a day
Latency
<50ms @ 95th
percentile
Huge Datasets
Billions of objects to
store
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Adobe Advertising Cloud’s Kubernetes Overview
 ~225 worker nodes; growing to ~300
in May/June
 6 OpenStack data centers across 4
regions
 Running on VMs
 No persistent storage
 No autoscaling; “fixed” footprint
 Smaller but growing
 3 AWS clusters in us-east-1
 Running on m5d.12xlarge ec2 instances
 EBS volumes for persistent storage
 Uses cluster-autoscaler
 Autoscaling events many times per hour
 Prometheus for monitoring
 Dozens of Machine Learning
workloads in AWS
 Reason for frequent autoscaling events
 Cluster updates done via new Image
and rolling update of existing nodes
 Updates are deployed approx every
4-6 weeks
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Lesson 1: Communication, Teamwork & Training
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Communication: Reaching large, distributed teams
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Teamwork: Who’s responsible for what?
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Abstraction vs Experts
 Need understanding of core
resources but also need easy
onboarding
 Pair programming training sessions
 Remove need for boiler plate
 Don’t duplicate efforts by avoiding
abstraction
 Don’t abstract to the point where
you’re not using Kubernetes
 kubectl should *not* be your
entrypoint
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Lesson 2: Code to production pipelines
De
v
Pull
Request
maste
r
Unit
testin
g
merge
Deplo
y bot
Production
Integration
testing
Insert your steps here!
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Tools to help build application resources
 Helm (templating and/or tiller)
 Kustomize
 Kapitan
 and more…
 We use a combination of Helm
templating for infrastructure/3rd-party
and Kustomize for application teams
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
"GitOps"
 Overloaded term that means different things to
different people
 Builds happen at a different stage than people
were used to
 Poly-repo => mono-repo migration – everyone
has opinions
 "But how do I control when my app is released?"
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Lesson 3: The ABCs of Production
 HorizontalPodAutoscaler
 PodDisruptionBudget
 "DevOps"
 Cluster Upgrades
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
HorizontalPodAutoscaler
 Easily scale on CPU or Memory usage
 Also able to scale on custom metrics like
http_requests from Ingress resources
 Don’t set replicas in your Deployment
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
PodDisruptionBudget
 Not the same thing as a Deployment
strategy
 Helps prevent taking down so many Pods
that the application is overwhelmed
 Can set by minAvailable or
maxUnavailable by number or
percentage
 Good for helping keep quorum
 Doesn’t apply to manual deletions
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
DevOps
 Expertise/specialists
 But empowerment & speed
 Things get lost in shuffle
 Everyone can do everything; aka don’t forget your guardrails
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Cluster Upgrades - Blue/Green or Canary?
 Who really has the hardware to run a 2nd full
Kubernetes cluster in their datacenter?
 Public cloud is easier, but you still have cost
considerations
 Are the application team(s) able to handle
deploying to a 2nd mirrored cluster?
 Does it make more sense to run N workers of a
different version/config for a period of time?
 Do you have the visibility into the cluster to know
how one performs vs the other?
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Lesson 4: Multi-Cloud Challenges
 Advertising Cloud lives in two (cloud) worlds
 AWS runs the majority of our persistent storage workloads
 ML
 User data ingress
 “Master” clusters – kafka, aerospike, internal applications
 Only in one region
 Openstack runs our bidding and ad server infrastructure
 Multiple regions worldwide
 Regions persist long-term data to AWS
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Our Purpose-Built OpenStack Regions
 We originally built for specific needs/design w/just Nova & Neutron
 Deploy full rack at a time (no piece-meal expansion)
 Mantra: Build What You Need
 ”Suddenly”, Kubernetes!
 What to do?
 Square AWS peg, round Openstack hole
 No persistent storage
 Compute and rack anti-affinity now matter
 No autoscaling
 Wrapping heads around the custom/specific design can be hard
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Multiple code-bases but consistent infrastructure
 Packer – Shared modular code base, different builders
 Terraform – Separate but closely aligned code bases
 Puppet – Same code base
 Helm – Same modular code base
 Leverage templating to build the same deployments for
different (and future) clouds
 Re-use, re-use, re-use!
 Lab environments in all clouds
 OSSIA for HV/rack metadata for region/zone
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Lesson 5: Knowing your applications
 Seems like an obvious statement but it’s easy to forget to
think about
 Kubernetes brings advantages, but not all the ones that
bare metal and virtual machines bring out of the box
 Think about how your app actually functions
 Service Discovery
 Persistent Storage
 Shared Storage (e.g. replication, sharding, etc)
 Scheduling / Restarting
 Networking Ingress / Egress
 Think about how your app is going to handle the way
Kubernetes does things
https://imgur.com/gallery/B4D7Lf1
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Elasticsearch as Deployment (What We Did)
https://www.slideshare.net/JoergHenning/elasticsearch-on-kubernetes (slightly
modified)
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Oops…yeah Touge, I think something is wrong…
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Elasticsearch as StatefulSet (What We Should Have Done)
https://www.slideshare.net/JoergHenning/elasticsearch-on-kubernetes (slightly
modified)
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
You don’t have to know everything
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Lesson 6: Metrics-Based Monitoring
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Monitoring All The Things
 In k8s, everything is transient
 We care more about deltas and
patterns than about events
• Resource requests versus actual usage
• Disk performance for machine learning
• Predicting volume utilization
 (We still care about events, too)
 This lesson is applicable beyond just
kubernetes
 Every new component needs
monitoring
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Lesson 7: Autoscaling benefits & challenges
 Cost savings by running only what
you need
 Easer auto-remediation
 Frequent autoscaling means frequent
rescheduling
 Can be slow to re-attach EBS
volumes / can't re-attach across AZs
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Immutable images vs config management
 We use immutable images for core Kubernetes changes
 e.g., apiserver, kubelet, docker
 Apps like kube-proxy, calico are deployed via GitOps
 Works great in AWS where we have persistent storage; not
so much where we don't
 Exploring “light” config management to reduce the need to
change images
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Bonus Lesson (aka off by 1 error): Take a deep breath
 Same team so we all learn & fix together
 Experience has been enlightening &
engineers have had fun
 Teams already onboarded are moving
faster than before
 Dev cycle to production is faster as we
integrate more automated testing
© 2019 Adobe. Open Infrastructure Summit - Denver 2019
Thanks!
SLIDE DECK URL GOES HERE
Mike Tougeron
Email: tougeron@adobe.com
Twitter: @mtougeron
Tony Gosselin
Email: gosselin@adobe.com
Twitter: @sfotony
Images from https://stock.adobe.com

Kubernetes - 7 lessons learned from 7 data centers in 7 months

  • 1.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Kubernetes - 7 lessons learned from 7 data centers in 7 months Open Infrastructure Summit, Denver 2019 Tony Gosselin – Site Reliability Engineer @ Adobe Mike Tougeron – Senior Site Reliability Engineer @ Adobe
  • 2.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 $ whoami && id | grep Adobe  Mike Tougeron  Senior Site Reliability Engineer @ Adobe  Twitter: @mtougeron  Started using Kubernetes in 2015  Tony Gosselin  Site Reliability Engineer @ Adobe  Twitter: @sfotony  Started using Kubernetes in 2018
  • 3.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Agenda  Quick Introduction to Adobe Advertising Cloud’s Kubernetes Infrastructure  Lesson 1: Communication, Teamwork & Training  Lesson 2: Code to production pipelines  Lesson 3: The ABCs of Production apps  Lesson 4: Multi-cloud challenges  Lesson 5: Knowing your application  Lesson 6: Metrics based monitoring  Lesson 7: Autoscaling benefits & challenges
  • 4.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 High Traffic 350 billion requests a day Latency <50ms @ 95th percentile Huge Datasets Billions of objects to store
  • 5.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Adobe Advertising Cloud’s Kubernetes Overview  ~225 worker nodes; growing to ~300 in May/June  6 OpenStack data centers across 4 regions  Running on VMs  No persistent storage  No autoscaling; “fixed” footprint  Smaller but growing  3 AWS clusters in us-east-1  Running on m5d.12xlarge ec2 instances  EBS volumes for persistent storage  Uses cluster-autoscaler  Autoscaling events many times per hour  Prometheus for monitoring  Dozens of Machine Learning workloads in AWS  Reason for frequent autoscaling events  Cluster updates done via new Image and rolling update of existing nodes  Updates are deployed approx every 4-6 weeks
  • 6.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019
  • 7.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Lesson 1: Communication, Teamwork & Training
  • 8.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Communication: Reaching large, distributed teams
  • 9.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Teamwork: Who’s responsible for what?
  • 10.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Abstraction vs Experts  Need understanding of core resources but also need easy onboarding  Pair programming training sessions  Remove need for boiler plate  Don’t duplicate efforts by avoiding abstraction  Don’t abstract to the point where you’re not using Kubernetes  kubectl should *not* be your entrypoint
  • 11.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Lesson 2: Code to production pipelines De v Pull Request maste r Unit testin g merge Deplo y bot Production Integration testing Insert your steps here!
  • 12.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Tools to help build application resources  Helm (templating and/or tiller)  Kustomize  Kapitan  and more…  We use a combination of Helm templating for infrastructure/3rd-party and Kustomize for application teams
  • 13.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 "GitOps"  Overloaded term that means different things to different people  Builds happen at a different stage than people were used to  Poly-repo => mono-repo migration – everyone has opinions  "But how do I control when my app is released?"
  • 14.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Lesson 3: The ABCs of Production  HorizontalPodAutoscaler  PodDisruptionBudget  "DevOps"  Cluster Upgrades
  • 15.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 HorizontalPodAutoscaler  Easily scale on CPU or Memory usage  Also able to scale on custom metrics like http_requests from Ingress resources  Don’t set replicas in your Deployment
  • 16.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 PodDisruptionBudget  Not the same thing as a Deployment strategy  Helps prevent taking down so many Pods that the application is overwhelmed  Can set by minAvailable or maxUnavailable by number or percentage  Good for helping keep quorum  Doesn’t apply to manual deletions
  • 17.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 DevOps  Expertise/specialists  But empowerment & speed  Things get lost in shuffle  Everyone can do everything; aka don’t forget your guardrails
  • 18.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Cluster Upgrades - Blue/Green or Canary?  Who really has the hardware to run a 2nd full Kubernetes cluster in their datacenter?  Public cloud is easier, but you still have cost considerations  Are the application team(s) able to handle deploying to a 2nd mirrored cluster?  Does it make more sense to run N workers of a different version/config for a period of time?  Do you have the visibility into the cluster to know how one performs vs the other?
  • 19.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Lesson 4: Multi-Cloud Challenges  Advertising Cloud lives in two (cloud) worlds  AWS runs the majority of our persistent storage workloads  ML  User data ingress  “Master” clusters – kafka, aerospike, internal applications  Only in one region  Openstack runs our bidding and ad server infrastructure  Multiple regions worldwide  Regions persist long-term data to AWS
  • 20.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Our Purpose-Built OpenStack Regions  We originally built for specific needs/design w/just Nova & Neutron  Deploy full rack at a time (no piece-meal expansion)  Mantra: Build What You Need  ”Suddenly”, Kubernetes!  What to do?  Square AWS peg, round Openstack hole  No persistent storage  Compute and rack anti-affinity now matter  No autoscaling  Wrapping heads around the custom/specific design can be hard
  • 21.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Multiple code-bases but consistent infrastructure  Packer – Shared modular code base, different builders  Terraform – Separate but closely aligned code bases  Puppet – Same code base  Helm – Same modular code base  Leverage templating to build the same deployments for different (and future) clouds  Re-use, re-use, re-use!  Lab environments in all clouds  OSSIA for HV/rack metadata for region/zone
  • 22.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Lesson 5: Knowing your applications  Seems like an obvious statement but it’s easy to forget to think about  Kubernetes brings advantages, but not all the ones that bare metal and virtual machines bring out of the box  Think about how your app actually functions  Service Discovery  Persistent Storage  Shared Storage (e.g. replication, sharding, etc)  Scheduling / Restarting  Networking Ingress / Egress  Think about how your app is going to handle the way Kubernetes does things https://imgur.com/gallery/B4D7Lf1
  • 23.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Elasticsearch as Deployment (What We Did) https://www.slideshare.net/JoergHenning/elasticsearch-on-kubernetes (slightly modified)
  • 24.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Oops…yeah Touge, I think something is wrong…
  • 25.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Elasticsearch as StatefulSet (What We Should Have Done) https://www.slideshare.net/JoergHenning/elasticsearch-on-kubernetes (slightly modified)
  • 26.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 You don’t have to know everything
  • 27.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Lesson 6: Metrics-Based Monitoring
  • 28.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Monitoring All The Things  In k8s, everything is transient  We care more about deltas and patterns than about events • Resource requests versus actual usage • Disk performance for machine learning • Predicting volume utilization  (We still care about events, too)  This lesson is applicable beyond just kubernetes  Every new component needs monitoring
  • 29.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Lesson 7: Autoscaling benefits & challenges  Cost savings by running only what you need  Easer auto-remediation  Frequent autoscaling means frequent rescheduling  Can be slow to re-attach EBS volumes / can't re-attach across AZs
  • 30.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Immutable images vs config management  We use immutable images for core Kubernetes changes  e.g., apiserver, kubelet, docker  Apps like kube-proxy, calico are deployed via GitOps  Works great in AWS where we have persistent storage; not so much where we don't  Exploring “light” config management to reduce the need to change images
  • 31.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Bonus Lesson (aka off by 1 error): Take a deep breath  Same team so we all learn & fix together  Experience has been enlightening & engineers have had fun  Teams already onboarded are moving faster than before  Dev cycle to production is faster as we integrate more automated testing
  • 32.
    © 2019 Adobe.Open Infrastructure Summit - Denver 2019 Thanks! SLIDE DECK URL GOES HERE Mike Tougeron Email: tougeron@adobe.com Twitter: @mtougeron Tony Gosselin Email: gosselin@adobe.com Twitter: @sfotony Images from https://stock.adobe.com

Editor's Notes

  • #5 Adobe Advertising Cloud allows you to manage video, display, and search advertising across traditional TV and digital formats.
  • #7  ./deploy-ami.py master --context aws-lab
  • #9 Repeat, repeat, repeat There's always a medium that someone doesn't read even if they are supposed to Shout it from the mountain top Still drives me nuts
  • #10 Deploybot deploys yaml after being committed to git Team A wrote app, Team X had failure, Who gets alerts? Assumptions made by all parties involved Same type of problem with Registry server All boils down to lack of communication
  • #11 Don’t have good answer for everyone Balance is key to success
  • #12 Crucial to success Slow pipeline slows down adoption & Creates friction Easy pipeline creates the “that’s it?” question far too often :)
  • #18 Reverse of the saying. Now it's can't see the trees for the forst
  • #20 We chose canary  -  app teams are not far enough to support cross-cluster LB
  • #21 Most data warehousing and analytics processing happens in AWS Bidding and ad serving then happen in via one of our six Openstack regions throughout the world Allows us the best of both worlds Burstable compute and storage when we need it Cheap, fast, low-latency compute that the majority of our workload needs
  • #22 BWYN = We focused on Nova and Neutron only, because we only used EC2 in our remote regions (no persistent storage) This helped make on-prem incredibly cost-effective, easy to manage, and durable Business needs change fast, sometimes faster that equipment depreciation (five years doesn’t work) Replacing everything with a greenfield deployment rare We needed to make our kubernetes architecture work on our purpose built OS stack without sacrificing our internal SLAs
  • #23 We re-used much of the AWS code, and adapted it to be modular based on the target cloud Consistency across clusters and clouds Write once, target OSSIA – Open Stack Simple Inventory API Written in-house by Mykola Moglyenko Allows us to tag pods by their physical location in the cage, and make decisions that evenly spread out workloads Adobe will be open-sourcing this tool this spring
  • #24 Does a fixed hostname make a difference? e.g., app-0, app-1, such like zookeeper How does the app/service save its state? What about data? Does it need to replicate? How well does it handle rescheduling? How do other applications or teams access the app/service?
  • #25 How many people have run an elasticsearch cluster, or at least know about elasticsearch? We followed a blog post to set it up in K8s. Not a bad thing! We just didn’t think in a kubernetes way It looked like this. This lived in our AWS cluster, where our ML jobs causing a lot of auto-scaling up and down Fair amount of volatility When we first deployed it, it worked! Then we upgraded our nodes, which meant draining and replacing them one at a time Lots of app rescheduling Lots of autoscaler activity
  • #26 While deploying new worker images to our nodes, we noticed this happening to elasticsearch Everything was suddenly in CLBO Unassigned primary and replicas When we got things back up, we found we had lost 7% of our data (this was in dev)
  • #27  Converted es-master deployment to a StatefulSet Makes sure that master nodes are gracefully removed and re-added, without impacting quorum Adjusted cluster deployment scripts Respect the pod disruption budget for longer timeouts Pre-cordon nodes Increase size of cluster before draining nodes Disabled the cluster-autoscaler (so the cluster will stay inflated)
  • #28 You’re not alone! There are resources to help ensure you are using best practices needed to run your application, now and during future upgrades
  • #30 * Nodes * Pods * Deployments Deltas and patterns Resource requests versus actual usage Disk performance Volume utilization
  • #31 More rescheduling of apps Slower for apps with EBS volumes
  • #32 Roll out config management for “light” changes, such as patches and minor point releases. This will help us reduce the occurrence of image replacement When we reached the point of creating our on-prem clusters in our Openstack-based data centers, we ran into new problems that challenged our immutable approach: Less overhead capacity – We don’t have the room to spin up 50% new workers to shift things around. This means, invariably, some loads are going to get shift to worker nodes that have yet to be upgraded, and if that happens, they be re-scheduled more than once. As we saw in lesson 5, if the app is configured properly it can weather this problem. However, it isn’t an ideal experience to have multiple nodes flapping (triggering deployment health alarms along the way). Scale – At capacity, each region will have 400-450 workers. If we are now upgrading In groups, we are now looking at a process that will take hours and hours, with multiple shifts in application loads. No persistent storage – This causes headaches for both developers and ops. For developers, it means when a clustered node is rescheduled, it must re-sync whatever sharded data it needs before it can re-join the cluster; this takes time. For ops, it means recovery of etcd must happen in a semi-manual way, as we must both tell the cluster the the old node is gone and point the new node at the existing cluster.