SlideShare a Scribd company logo
1 of 74
Download to read offline
Adopting Karpenter for
Cost and Simplicity
Logan Ballard
Senior Software Engineer
Logan Ballard
Senior Software Engineer
When Cluster Autoscaler isn’t meeting your
needs in EKS, consider Karpenter. We will walk
through our journey to Karpenter, alternatives
we considered, associated trade offs, and why
Karpenter ultimately has helped us reduce
costs and complexity to better serve our
customers on AWS.
Adopting
Karpenter for
Cost and
Simplicity
Optimizing Autoscaling
Overview of platform engineering at Grafana Labs
● Company goal: one-stop-shop for observability
○ “Query, visualize, alert on, and understand your data no matter where it’s stored. With Grafana you can
create, explore, and share all of your data through beautiful, flexible dashboards.”
Overview of platform engineering at Grafana Labs
● Company goal: one-stop-shop for observability
○ “Query, visualize, alert on, and understand your data no matter where it’s stored. With Grafana you can
create, explore, and share all of your data through beautiful, flexible dashboards.”
● Grafana platform team provides a PaaS for the product layer
○ Managing k8s, CI/CD, and interfaces for the “golden path”
Overview of platform engineering at Grafana Labs
● Company goal: one-stop-shop for observability
○ “Query, visualize, alert on, and understand your data no matter where it’s stored. With Grafana you can
create, explore, and share all of your data through beautiful, flexible dashboards.”
● Grafana platform team provides a PaaS for the product layer
○ Managing k8s, CI/CD, and interfaces for the “golden path”
● To deliver the platform, we use managed Kubernetes offerings
○ GKE (Google)
○ EKS (Amazon)
○ AKS (Azure)
Overview of platform engineering at Grafana Labs
● Company goal: one-stop-shop for observability
○ “Query, visualize, alert on, and understand your data no matter where it’s stored. With Grafana you can
create, explore, and share all of your data through beautiful, flexible dashboards.”
● Grafana platform team provides a PaaS for the product layer
○ Managing k8s, CI/CD, and interfaces for the “golden path”
● To deliver the platform, we use managed Kubernetes offerings
○ GKE (Google)
○ EKS (Amazon)
○ AKS (Azure)
● Platform team is further divided into sub-specialties, one of which is capacity
Overview of capacity at Grafana Labs
● Capacity is responsible for engineering around usage optimization as well as reliability
○ Observability into cost and utilization
○ Ownership of scaling tools
○ Company-wide efforts to optimize resource usage
● Today we’ll be focusing on autoscaling
Overview of Cluster Autoscaler
● Up until last year, we used
Cluster Autoscaler exclusively
● Scaling heuristic:
○ Demand goes up and down in the
form of pod resource requests
○ If there is too much CPU or Memory
requested, more nodes are added
○ If resource requests scale down,
nodes are removed
Source: https://aws.github.io/aws-eks-best-practices/cluster-autoscaling/
What we liked about Cluster Autoscaler
● Simple, standardized, and has lots of support
What we liked about Cluster Autoscaler
● Simple, standardized, and has lots of support
● Accomplished basic scaling needs
What we liked about Cluster Autoscaler
● Simple, standardized, and has lots of support
● Accomplished basic scaling needs
● Infrastructure was managed in a standardized way
○ Node Groups defined in Terraform
What we liked about Cluster Autoscaler
● Simple, standardized, and has lots of support
● Accomplished basic scaling needs
● Infrastructure was managed in a standardized way
○ Node Groups defined in Terraform
● Deployment was managed in a standardized way, as either:
○ GKE: part of the managed offering
○ EKS: as-code in jsonnet, same as all our other workloads
What we liked about Cluster Autoscaler
● Simple, standardized, and has lots of support
● Accomplished basic scaling needs
● Infrastructure was managed in a standardized way
○ Node Groups defined in Terraform
● Deployment was managed in a standardized way, as either:
○ GKE: part of the managed offering
○ EKS: as-code in jsonnet, same as all our other workloads
● …we already had it
What we didn’t like about Cluster Autoscaler
● Lots of overhead when defining diverse machine types
What we didn’t like about Cluster Autoscaler
● Lots of overhead when defining diverse machine types
● Inability to understand max pod IP restrictions when making scaling decisions
What we didn’t like about Cluster Autoscaler
● Lots of overhead when defining diverse machine types
● Inability to understand max pod IP restrictions when making scaling decisions
● No support for “soft” scheduling constraints
○ E.g. preferredDuringSchedulingIgnoredDuringExecution
What we didn’t like about Cluster Autoscaler
● Lots of overhead when defining diverse machine types
● Inability to understand max pod IP restrictions when making scaling decisions
● No support for “soft” scheduling constraints
○ E.g. preferredDuringSchedulingIgnoredDuringExecution
● Experimentation was challenging
What we didn’t like about Cluster Autoscaler
● Lots of overhead when defining diverse machine types
● Inability to understand max pod IP restrictions when making scaling decisions
● No support for “soft” scheduling constraints
○ E.g. preferredDuringSchedulingIgnoredDuringExecution
● Experimentation was challenging
● Upgrades were a long process
What we didn’t like about Cluster Autoscaler
● Lots of overhead when defining diverse machine types
● Inability to understand max pod IP restrictions when making scaling decisions
● No support for “soft” scheduling constraints
○ E.g. preferredDuringSchedulingIgnoredDuringExecution
● Experimentation was challenging
● Upgrades were a long process
● No fallback for workloads from spot nodes -> on demand nodes
Why change?
Why change?
● We measure “idle ratio” to calculate scheduling efficiency
Why change?
● We measure “idle ratio” to calculate scheduling efficiency
● Idle ratio is calculated as a product of:
○ (CPU allocatable) - (CPU Used) / (CPU allocatable)
○ (Memory allocatable) - (Memory Used) / (Memory allocatable)
Why change?
● We measure “idle ratio” to calculate scheduling efficiency
● Idle ratio is calculated as a product of:
○ (CPU allocatable) - (CPU Used) / (CPU allocatable)
○ (Memory allocatable) - (Memory Used) / (Memory allocatable)
● Idle ratio is a “golden signal” for capacity squad
● High idle ratio is wasted $$$
Why change?
● We measure “idle ratio” to calculate scheduling efficiency
● Idle ratio is calculated as a product of:
○ (CPU allocatable) - (CPU Used) / (CPU allocatable)
○ (Memory allocatable) - (Memory Used) / (Memory allocatable)
● Idle ratio is a “golden signal” for capacity squad
● High idle ratio is wasted $$$
Idle Ratio Visualized
GKE VS AWS
Idle Ratio Visualized
GKE VS AWS
Idle Ratio Visualized
GKE VS AWS
⚠ Idle ratio in AWS was almost double that of GKE ⚠
What to do?
What to do?
● Nothing?
What to do?
● Nothing?
● Try to replicate the setup of GKE with it’s OPTIMIZE_UTILIZATION autoscaling profile
○ Run a custom scheduler: ‘MostAllocated’ scoring strategy
○ Configure Cluster Autoscaler: set expanders to priority? Most pods?
○ All of this is obscured behind the GKE control plane and would be a tedious process
What to do?
● Nothing?
● Try to replicate the setup of GKE with it’s OPTIMIZE_UTILIZATION autoscaling profile
○ Run a custom scheduler: ‘MostAllocated’ scoring strategy
○ Configure Cluster Autoscaler: set expanders to priority? Most pods?
○ All of this is obscured behind the GKE control plane and would be a tedious process
● Manage our own control plane?
What to do?
● Nothing?
● Try to replicate the setup of GKE with it’s OPTIMIZE_UTILIZATION autoscaling profile
○ Run a custom scheduler: ‘MostAllocated’ scoring strategy
○ Configure Cluster Autoscaler: set expanders to priority? Most pods?
○ All of this is obscured behind the GKE control plane and would be a tedious process
● Manage our own control plane?
○ No
What to do?
● Nothing?
● Try to replicate the setup of GKE with it’s OPTIMIZE_UTILIZATION autoscaling profile
○ Run a custom scheduler: ‘MostAllocated’ scoring strategy
○ Configure Cluster Autoscaler: set expanders to priority? Most pods?
○ All of this is obscured behind the GKE control plane and would be a tedious process
● Manage our own control plane?
○ No
● Karpenter?
Overview of Karpenter
● Open Source
● EKS specific
○ Update: Core logic of Karpenter is now part
of the autoscaling SIG, as of November
2023
● “Just in time” model for upscaling
○ Matches workload size and shape to
machines
● “Consolidation” model for downscaling
○ Continual bin-packing
● High level of visibility into decisions
● …we were sold
Source:
https://aws.amazon.com/blogs/aws/introducing-karpenter-an-open-source-high-performan
ce-kubernetes-cluster-autoscaler/
Goals to get there
Goals to get there
● Zero or minimal downtime
Goals to get there
● Zero or minimal downtime
● Transparent to end-users of platform
Goals to get there
● Zero or minimal downtime
● Transparent to end-users of platform
● Reproducible migration across different environments
Goals to get there
● Zero or minimal downtime
● Transparent to end-users of platform
● Reproducible migration across different environments
● Observability in all steps of the migration
Migration details
Migration details
● Define all resources in code
○ Supporting Infrastructure (IAM, SQS) defined as Terraform
○ Kubernetes resources (deployment, service account, etc) defined in jsonnet
Migration details
● Define all resources in code
○ Supporting Infrastructure (IAM, SQS) defined as Terraform
○ Kubernetes resources (deployment, service account, etc) defined in jsonnet
● Migration ordering of events is very important
Migration details
● Define all resources in code
○ Supporting Infrastructure (IAM, SQS) defined as Terraform
○ Kubernetes resources (deployment, service account, etc) defined in jsonnet
● Migration ordering of events is very important
○ Create “critical” node group for Karpenter itself to run on
Migration details
● Define all resources in code
○ Supporting Infrastructure (IAM, SQS) defined as Terraform
○ Kubernetes resources (deployment, service account, etc) defined in jsonnet
● Migration ordering of events is very important
○ Create “critical” node group for Karpenter itself to run on
○ Deploy all Karpenter resources except provisioners and node templates
Migration details
● Define all resources in code
○ Supporting Infrastructure (IAM, SQS) defined as Terraform
○ Kubernetes resources (deployment, service account, etc) defined in jsonnet
● Migration ordering of events is very important
○ Create “critical” node group for Karpenter itself to run on
○ Deploy all Karpenter resources except provisioners and node templates
○ Turn off Cluster Autoscaler
○ Deploy Karpenter Provisioners + Node Templates
Migration details
● Define all resources in code
○ Supporting Infrastructure (IAM, SQS) defined as Terraform
○ Kubernetes resources (deployment, service account, etc) defined in jsonnet
● Migration ordering of events is very important
○ Create “critical” node group for Karpenter itself to run on
○ Deploy all Karpenter resources except provisioners and node templates
○ Turn off Cluster Autoscaler
○ Deploy Karpenter Provisioners + Node Templates
○ Gradually cordon and drain existing nodes
Migration details
● Define all resources in code
○ Supporting Infrastructure (IAM, SQS) defined as Terraform
○ Kubernetes resources (deployment, service account, etc) defined in jsonnet
● Migration ordering of events is very important
○ Create “critical” node group for Karpenter itself to run on
○ Deploy all Karpenter resources except provisioners and node templates
○ Turn off Cluster Autoscaler
○ Deploy Karpenter Provisioners + Node Templates
○ Gradually cordon and drain existing nodes
○ Delete existing node groups
Results
Results
● Migration accomplished without downtime
Results
● Migration accomplished without downtime
● Massively lowered idle ratio, costs
○ More effective distribution of resources
○ Allowed us to take advantage of varying machine types
Results
● Migration accomplished without downtime
● Massively lowered idle ratio, costs
○ More effective distribution of resources
○ Allowed us to take advantage of varying machine types
● Improved reliability
Results
● Migration accomplished without downtime
● Massively lowered idle ratio, costs
○ More effective distribution of resources
○ Allowed us to take advantage of varying machine types
● Improved reliability
● Better developer experience
Results
● Migration accomplished without downtime
● Massively lowered idle ratio, costs
○ More effective distribution of resources
○ Allowed us to take advantage of varying machine types
● Improved reliability
● Better developer experience
● Simplified infrastructure
Results - idle ratio visualized
● Costs lowered
● Idle ratio lowered
Results: cost efficiency visualized
Large cluster (~350 nodes): pre-Karpenter
Results: cost efficiency visualized
Large cluster (~350 nodes): post-Karpenter
Results - improved reliability
● Workloads could either require spot
or on-demand
● Maxing out autoscaling groups would
result in alerts, downtime
● Single instance node groups subject
to scarcity
Before
Rigid Options
VS
● Workloads can prefer spot, but fall
back to on-demand
● Fleet is not bounded by autoscaling
group limitations
● Wide variety of instance types all but
guarantees availability
After
More Flexibility
Results - developer experience
● Trying out new instance types
required lots of overhead
● Infrastructure managed in Terraform
● Specialized workloads required
specialized node groups
Before
High Barrier to Experimentation
VS
● New instance types are easy to try
out
● Infrastructure managed by
Kubernetes object
● Expressive “requirements” syntax for
Karpenter allows workloads to ask for
what the want, not peg themselves to
a machine type
After
Easy to Play Around
Results - simplified infrastructure
● Upgrades handled on a node group
basis
● Definition of cluster in Terraform was
largely copy/paste
Before
Slow and Manual
VS
● Upgrades can be hands-off
● Definition of cluster in Karpenter is
largely programmatic
After
Fast and Automated
Trade-offs
Trade-offs
● No unified autoscaling solution across cloud providers, EKS only
○ Although… Azure is under development!
Trade-offs
● No unified autoscaling solution across cloud providers, EKS only
○ Although… Azure is under development!
● Karpenter is still considered “beta”
○ Recent release has a breaking API change
Trade-offs
● No unified autoscaling solution across cloud providers, EKS only
○ Although… Azure is under development!
● Karpenter is still considered “beta”
○ Recent release has a breaking API change
● No spot-to-spot consolidation at the moment
Trade-offs
● No unified autoscaling solution across cloud providers, EKS only
○ Although… Azure is under development!
● Karpenter is still considered “beta”
○ Recent release has a breaking API change
● No spot-to-spot consolidation at the moment
● Deprovisioning lacks fine-grained control
Wrap-up
Wrap-up
● Measuring idleness helps greatly for cost optimization
Wrap-up
● Measuring idleness helps greatly for cost optimization
● Karpenter requires buy-in, but is a significant EKS opportunity
Wrap-up
● Measuring idleness helps greatly for cost optimization
● Karpenter requires buy-in, but is a significant EKS opportunity
● Ensuring a smooth migration was high effort, but worth it
Wrap-up
● Measuring idleness helps greatly for cost optimization
● Karpenter requires buy-in, but is a significant EKS opportunity
● Ensuring a smooth migration was high effort, but worth it
● Passing information on to end-users of the platform yielded and continues to yield high
benefits
Wrap-up
● Measuring idleness helps greatly for cost optimization
● Karpenter requires buy-in, but is a significant EKS opportunity
● Ensuring a smooth migration was high effort, but worth it
● Passing information on to end-users of the platform yielded and continues to yield high
benefits
● We are excited to see where this tool goes!
Questions?
Additional Info
● Contact me
○ loganballard@gmail.com
○ https://github.com/logyball
● Karpenter
○ https://karpenter.sh/
● Grafana
○ “How Grafana Labs switched to Karpenter to reduce costs and
complexities in Amazon EKS”
○ https://bit.ly/grafana-karpenter QR code to relevant blog post
Take Grafana Labs
Observability Survey

More Related Content

Similar to Adopting Karpenter for Cost and Simplicity at Grafana Labs.pdf

Kubernetes for machine learning
Kubernetes for machine learningKubernetes for machine learning
Kubernetes for machine learningAkash Agrawal
 
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITOpenStack
 
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...Altinity Ltd
 
Next Generation Cloud Computing With Google - RightScale Compute 2013
Next Generation Cloud Computing With Google - RightScale Compute 2013Next Generation Cloud Computing With Google - RightScale Compute 2013
Next Generation Cloud Computing With Google - RightScale Compute 2013RightScale
 
AppRunner DeepDive
AppRunner DeepDiveAppRunner DeepDive
AppRunner DeepDiveDhaval Nagar
 
Worldwide Local Latency With ScyllaDB
Worldwide Local Latency With ScyllaDB Worldwide Local Latency With ScyllaDB
Worldwide Local Latency With ScyllaDB ScyllaDB
 
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...Haggai Philip Zagury
 
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...Amazon Web Services
 
Building a Pluggable, Cloud-native Event-driven Serverless Architecture - Rea...
Building a Pluggable, Cloud-native Event-driven Serverless Architecture - Rea...Building a Pluggable, Cloud-native Event-driven Serverless Architecture - Rea...
Building a Pluggable, Cloud-native Event-driven Serverless Architecture - Rea...Dan Farrelly
 
Languages don't matter anymore!
Languages don't matter anymore!Languages don't matter anymore!
Languages don't matter anymore!Soluto
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsWeaveworks
 
Does Django scale? An introduction to Scalability
Does Django scale? An introduction to ScalabilityDoes Django scale? An introduction to Scalability
Does Django scale? An introduction to ScalabilityDavid Arcos
 
Greenplum for Kubernetes - Greenplum Summit 2019
Greenplum for Kubernetes - Greenplum Summit 2019Greenplum for Kubernetes - Greenplum Summit 2019
Greenplum for Kubernetes - Greenplum Summit 2019VMware Tanzu
 
Google container engine (GKE)
Google container engine (GKE)Google container engine (GKE)
Google container engine (GKE)Md. Sadhan Sarker
 
Cinder project update at OpenStack Boston Summit May 2017
Cinder project update at OpenStack Boston Summit May 2017Cinder project update at OpenStack Boston Summit May 2017
Cinder project update at OpenStack Boston Summit May 2017Miroslav Halas
 
DevEx | there’s no place like k3s
DevEx | there’s no place like k3sDevEx | there’s no place like k3s
DevEx | there’s no place like k3sHaggai Philip Zagury
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Mariano Gonzalez
 
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...Xiaoman DONG
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerItai Yaffe
 
Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)Idan Atias
 

Similar to Adopting Karpenter for Cost and Simplicity at Grafana Labs.pdf (20)

Kubernetes for machine learning
Kubernetes for machine learningKubernetes for machine learning
Kubernetes for machine learning
 
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
 
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
 
Next Generation Cloud Computing With Google - RightScale Compute 2013
Next Generation Cloud Computing With Google - RightScale Compute 2013Next Generation Cloud Computing With Google - RightScale Compute 2013
Next Generation Cloud Computing With Google - RightScale Compute 2013
 
AppRunner DeepDive
AppRunner DeepDiveAppRunner DeepDive
AppRunner DeepDive
 
Worldwide Local Latency With ScyllaDB
Worldwide Local Latency With ScyllaDB Worldwide Local Latency With ScyllaDB
Worldwide Local Latency With ScyllaDB
 
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...
 
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
 
Building a Pluggable, Cloud-native Event-driven Serverless Architecture - Rea...
Building a Pluggable, Cloud-native Event-driven Serverless Architecture - Rea...Building a Pluggable, Cloud-native Event-driven Serverless Architecture - Rea...
Building a Pluggable, Cloud-native Event-driven Serverless Architecture - Rea...
 
Languages don't matter anymore!
Languages don't matter anymore!Languages don't matter anymore!
Languages don't matter anymore!
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOps
 
Does Django scale? An introduction to Scalability
Does Django scale? An introduction to ScalabilityDoes Django scale? An introduction to Scalability
Does Django scale? An introduction to Scalability
 
Greenplum for Kubernetes - Greenplum Summit 2019
Greenplum for Kubernetes - Greenplum Summit 2019Greenplum for Kubernetes - Greenplum Summit 2019
Greenplum for Kubernetes - Greenplum Summit 2019
 
Google container engine (GKE)
Google container engine (GKE)Google container engine (GKE)
Google container engine (GKE)
 
Cinder project update at OpenStack Boston Summit May 2017
Cinder project update at OpenStack Boston Summit May 2017Cinder project update at OpenStack Boston Summit May 2017
Cinder project update at OpenStack Boston Summit May 2017
 
DevEx | there’s no place like k3s
DevEx | there’s no place like k3sDevEx | there’s no place like k3s
DevEx | there’s no place like k3s
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
 
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own Docker
 
Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)
 

More from MichaelOLeary82

BOS K8S Meetup - Finetuning LLama 2 Model on GKE.pdf
BOS K8S Meetup - Finetuning LLama 2 Model on GKE.pdfBOS K8S Meetup - Finetuning LLama 2 Model on GKE.pdf
BOS K8S Meetup - Finetuning LLama 2 Model on GKE.pdfMichaelOLeary82
 
K8 Meetup_ K8s secrets management best practices (Git Guardian).pdf
K8 Meetup_ K8s secrets management best practices (Git Guardian).pdfK8 Meetup_ K8s secrets management best practices (Git Guardian).pdf
K8 Meetup_ K8s secrets management best practices (Git Guardian).pdfMichaelOLeary82
 
Sampling strategies with Otel.pptx
Sampling strategies with Otel.pptxSampling strategies with Otel.pptx
Sampling strategies with Otel.pptxMichaelOLeary82
 
Platform Engineering using GitOps, Boston Kubernetes Meetup
Platform Engineering using GitOps, Boston Kubernetes MeetupPlatform Engineering using GitOps, Boston Kubernetes Meetup
Platform Engineering using GitOps, Boston Kubernetes MeetupMichaelOLeary82
 
Operationalizing Multi Cluster Istio_ Lessons Learned and Developing Ambient ...
Operationalizing Multi Cluster Istio_ Lessons Learned and Developing Ambient ...Operationalizing Multi Cluster Istio_ Lessons Learned and Developing Ambient ...
Operationalizing Multi Cluster Istio_ Lessons Learned and Developing Ambient ...MichaelOLeary82
 

More from MichaelOLeary82 (6)

BOS K8S Meetup - Finetuning LLama 2 Model on GKE.pdf
BOS K8S Meetup - Finetuning LLama 2 Model on GKE.pdfBOS K8S Meetup - Finetuning LLama 2 Model on GKE.pdf
BOS K8S Meetup - Finetuning LLama 2 Model on GKE.pdf
 
K8 Meetup_ K8s secrets management best practices (Git Guardian).pdf
K8 Meetup_ K8s secrets management best practices (Git Guardian).pdfK8 Meetup_ K8s secrets management best practices (Git Guardian).pdf
K8 Meetup_ K8s secrets management best practices (Git Guardian).pdf
 
Sampling strategies with Otel.pptx
Sampling strategies with Otel.pptxSampling strategies with Otel.pptx
Sampling strategies with Otel.pptx
 
KubeConNA23 Recap.pdf
KubeConNA23 Recap.pdfKubeConNA23 Recap.pdf
KubeConNA23 Recap.pdf
 
Platform Engineering using GitOps, Boston Kubernetes Meetup
Platform Engineering using GitOps, Boston Kubernetes MeetupPlatform Engineering using GitOps, Boston Kubernetes Meetup
Platform Engineering using GitOps, Boston Kubernetes Meetup
 
Operationalizing Multi Cluster Istio_ Lessons Learned and Developing Ambient ...
Operationalizing Multi Cluster Istio_ Lessons Learned and Developing Ambient ...Operationalizing Multi Cluster Istio_ Lessons Learned and Developing Ambient ...
Operationalizing Multi Cluster Istio_ Lessons Learned and Developing Ambient ...
 

Recently uploaded

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Recently uploaded (20)

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Adopting Karpenter for Cost and Simplicity at Grafana Labs.pdf

  • 1. Adopting Karpenter for Cost and Simplicity Logan Ballard Senior Software Engineer
  • 3. When Cluster Autoscaler isn’t meeting your needs in EKS, consider Karpenter. We will walk through our journey to Karpenter, alternatives we considered, associated trade offs, and why Karpenter ultimately has helped us reduce costs and complexity to better serve our customers on AWS. Adopting Karpenter for Cost and Simplicity Optimizing Autoscaling
  • 4. Overview of platform engineering at Grafana Labs ● Company goal: one-stop-shop for observability ○ “Query, visualize, alert on, and understand your data no matter where it’s stored. With Grafana you can create, explore, and share all of your data through beautiful, flexible dashboards.”
  • 5. Overview of platform engineering at Grafana Labs ● Company goal: one-stop-shop for observability ○ “Query, visualize, alert on, and understand your data no matter where it’s stored. With Grafana you can create, explore, and share all of your data through beautiful, flexible dashboards.” ● Grafana platform team provides a PaaS for the product layer ○ Managing k8s, CI/CD, and interfaces for the “golden path”
  • 6. Overview of platform engineering at Grafana Labs ● Company goal: one-stop-shop for observability ○ “Query, visualize, alert on, and understand your data no matter where it’s stored. With Grafana you can create, explore, and share all of your data through beautiful, flexible dashboards.” ● Grafana platform team provides a PaaS for the product layer ○ Managing k8s, CI/CD, and interfaces for the “golden path” ● To deliver the platform, we use managed Kubernetes offerings ○ GKE (Google) ○ EKS (Amazon) ○ AKS (Azure)
  • 7. Overview of platform engineering at Grafana Labs ● Company goal: one-stop-shop for observability ○ “Query, visualize, alert on, and understand your data no matter where it’s stored. With Grafana you can create, explore, and share all of your data through beautiful, flexible dashboards.” ● Grafana platform team provides a PaaS for the product layer ○ Managing k8s, CI/CD, and interfaces for the “golden path” ● To deliver the platform, we use managed Kubernetes offerings ○ GKE (Google) ○ EKS (Amazon) ○ AKS (Azure) ● Platform team is further divided into sub-specialties, one of which is capacity
  • 8. Overview of capacity at Grafana Labs ● Capacity is responsible for engineering around usage optimization as well as reliability ○ Observability into cost and utilization ○ Ownership of scaling tools ○ Company-wide efforts to optimize resource usage ● Today we’ll be focusing on autoscaling
  • 9. Overview of Cluster Autoscaler ● Up until last year, we used Cluster Autoscaler exclusively ● Scaling heuristic: ○ Demand goes up and down in the form of pod resource requests ○ If there is too much CPU or Memory requested, more nodes are added ○ If resource requests scale down, nodes are removed Source: https://aws.github.io/aws-eks-best-practices/cluster-autoscaling/
  • 10. What we liked about Cluster Autoscaler ● Simple, standardized, and has lots of support
  • 11. What we liked about Cluster Autoscaler ● Simple, standardized, and has lots of support ● Accomplished basic scaling needs
  • 12. What we liked about Cluster Autoscaler ● Simple, standardized, and has lots of support ● Accomplished basic scaling needs ● Infrastructure was managed in a standardized way ○ Node Groups defined in Terraform
  • 13. What we liked about Cluster Autoscaler ● Simple, standardized, and has lots of support ● Accomplished basic scaling needs ● Infrastructure was managed in a standardized way ○ Node Groups defined in Terraform ● Deployment was managed in a standardized way, as either: ○ GKE: part of the managed offering ○ EKS: as-code in jsonnet, same as all our other workloads
  • 14. What we liked about Cluster Autoscaler ● Simple, standardized, and has lots of support ● Accomplished basic scaling needs ● Infrastructure was managed in a standardized way ○ Node Groups defined in Terraform ● Deployment was managed in a standardized way, as either: ○ GKE: part of the managed offering ○ EKS: as-code in jsonnet, same as all our other workloads ● …we already had it
  • 15. What we didn’t like about Cluster Autoscaler ● Lots of overhead when defining diverse machine types
  • 16. What we didn’t like about Cluster Autoscaler ● Lots of overhead when defining diverse machine types ● Inability to understand max pod IP restrictions when making scaling decisions
  • 17. What we didn’t like about Cluster Autoscaler ● Lots of overhead when defining diverse machine types ● Inability to understand max pod IP restrictions when making scaling decisions ● No support for “soft” scheduling constraints ○ E.g. preferredDuringSchedulingIgnoredDuringExecution
  • 18. What we didn’t like about Cluster Autoscaler ● Lots of overhead when defining diverse machine types ● Inability to understand max pod IP restrictions when making scaling decisions ● No support for “soft” scheduling constraints ○ E.g. preferredDuringSchedulingIgnoredDuringExecution ● Experimentation was challenging
  • 19. What we didn’t like about Cluster Autoscaler ● Lots of overhead when defining diverse machine types ● Inability to understand max pod IP restrictions when making scaling decisions ● No support for “soft” scheduling constraints ○ E.g. preferredDuringSchedulingIgnoredDuringExecution ● Experimentation was challenging ● Upgrades were a long process
  • 20. What we didn’t like about Cluster Autoscaler ● Lots of overhead when defining diverse machine types ● Inability to understand max pod IP restrictions when making scaling decisions ● No support for “soft” scheduling constraints ○ E.g. preferredDuringSchedulingIgnoredDuringExecution ● Experimentation was challenging ● Upgrades were a long process ● No fallback for workloads from spot nodes -> on demand nodes
  • 22. Why change? ● We measure “idle ratio” to calculate scheduling efficiency
  • 23. Why change? ● We measure “idle ratio” to calculate scheduling efficiency ● Idle ratio is calculated as a product of: ○ (CPU allocatable) - (CPU Used) / (CPU allocatable) ○ (Memory allocatable) - (Memory Used) / (Memory allocatable)
  • 24. Why change? ● We measure “idle ratio” to calculate scheduling efficiency ● Idle ratio is calculated as a product of: ○ (CPU allocatable) - (CPU Used) / (CPU allocatable) ○ (Memory allocatable) - (Memory Used) / (Memory allocatable) ● Idle ratio is a “golden signal” for capacity squad ● High idle ratio is wasted $$$
  • 25. Why change? ● We measure “idle ratio” to calculate scheduling efficiency ● Idle ratio is calculated as a product of: ○ (CPU allocatable) - (CPU Used) / (CPU allocatable) ○ (Memory allocatable) - (Memory Used) / (Memory allocatable) ● Idle ratio is a “golden signal” for capacity squad ● High idle ratio is wasted $$$
  • 28. Idle Ratio Visualized GKE VS AWS ⚠ Idle ratio in AWS was almost double that of GKE ⚠
  • 30. What to do? ● Nothing?
  • 31. What to do? ● Nothing? ● Try to replicate the setup of GKE with it’s OPTIMIZE_UTILIZATION autoscaling profile ○ Run a custom scheduler: ‘MostAllocated’ scoring strategy ○ Configure Cluster Autoscaler: set expanders to priority? Most pods? ○ All of this is obscured behind the GKE control plane and would be a tedious process
  • 32. What to do? ● Nothing? ● Try to replicate the setup of GKE with it’s OPTIMIZE_UTILIZATION autoscaling profile ○ Run a custom scheduler: ‘MostAllocated’ scoring strategy ○ Configure Cluster Autoscaler: set expanders to priority? Most pods? ○ All of this is obscured behind the GKE control plane and would be a tedious process ● Manage our own control plane?
  • 33. What to do? ● Nothing? ● Try to replicate the setup of GKE with it’s OPTIMIZE_UTILIZATION autoscaling profile ○ Run a custom scheduler: ‘MostAllocated’ scoring strategy ○ Configure Cluster Autoscaler: set expanders to priority? Most pods? ○ All of this is obscured behind the GKE control plane and would be a tedious process ● Manage our own control plane? ○ No
  • 34. What to do? ● Nothing? ● Try to replicate the setup of GKE with it’s OPTIMIZE_UTILIZATION autoscaling profile ○ Run a custom scheduler: ‘MostAllocated’ scoring strategy ○ Configure Cluster Autoscaler: set expanders to priority? Most pods? ○ All of this is obscured behind the GKE control plane and would be a tedious process ● Manage our own control plane? ○ No ● Karpenter?
  • 35. Overview of Karpenter ● Open Source ● EKS specific ○ Update: Core logic of Karpenter is now part of the autoscaling SIG, as of November 2023 ● “Just in time” model for upscaling ○ Matches workload size and shape to machines ● “Consolidation” model for downscaling ○ Continual bin-packing ● High level of visibility into decisions ● …we were sold Source: https://aws.amazon.com/blogs/aws/introducing-karpenter-an-open-source-high-performan ce-kubernetes-cluster-autoscaler/
  • 36. Goals to get there
  • 37. Goals to get there ● Zero or minimal downtime
  • 38. Goals to get there ● Zero or minimal downtime ● Transparent to end-users of platform
  • 39. Goals to get there ● Zero or minimal downtime ● Transparent to end-users of platform ● Reproducible migration across different environments
  • 40. Goals to get there ● Zero or minimal downtime ● Transparent to end-users of platform ● Reproducible migration across different environments ● Observability in all steps of the migration
  • 42. Migration details ● Define all resources in code ○ Supporting Infrastructure (IAM, SQS) defined as Terraform ○ Kubernetes resources (deployment, service account, etc) defined in jsonnet
  • 43. Migration details ● Define all resources in code ○ Supporting Infrastructure (IAM, SQS) defined as Terraform ○ Kubernetes resources (deployment, service account, etc) defined in jsonnet ● Migration ordering of events is very important
  • 44. Migration details ● Define all resources in code ○ Supporting Infrastructure (IAM, SQS) defined as Terraform ○ Kubernetes resources (deployment, service account, etc) defined in jsonnet ● Migration ordering of events is very important ○ Create “critical” node group for Karpenter itself to run on
  • 45. Migration details ● Define all resources in code ○ Supporting Infrastructure (IAM, SQS) defined as Terraform ○ Kubernetes resources (deployment, service account, etc) defined in jsonnet ● Migration ordering of events is very important ○ Create “critical” node group for Karpenter itself to run on ○ Deploy all Karpenter resources except provisioners and node templates
  • 46. Migration details ● Define all resources in code ○ Supporting Infrastructure (IAM, SQS) defined as Terraform ○ Kubernetes resources (deployment, service account, etc) defined in jsonnet ● Migration ordering of events is very important ○ Create “critical” node group for Karpenter itself to run on ○ Deploy all Karpenter resources except provisioners and node templates ○ Turn off Cluster Autoscaler ○ Deploy Karpenter Provisioners + Node Templates
  • 47. Migration details ● Define all resources in code ○ Supporting Infrastructure (IAM, SQS) defined as Terraform ○ Kubernetes resources (deployment, service account, etc) defined in jsonnet ● Migration ordering of events is very important ○ Create “critical” node group for Karpenter itself to run on ○ Deploy all Karpenter resources except provisioners and node templates ○ Turn off Cluster Autoscaler ○ Deploy Karpenter Provisioners + Node Templates ○ Gradually cordon and drain existing nodes
  • 48. Migration details ● Define all resources in code ○ Supporting Infrastructure (IAM, SQS) defined as Terraform ○ Kubernetes resources (deployment, service account, etc) defined in jsonnet ● Migration ordering of events is very important ○ Create “critical” node group for Karpenter itself to run on ○ Deploy all Karpenter resources except provisioners and node templates ○ Turn off Cluster Autoscaler ○ Deploy Karpenter Provisioners + Node Templates ○ Gradually cordon and drain existing nodes ○ Delete existing node groups
  • 51. Results ● Migration accomplished without downtime ● Massively lowered idle ratio, costs ○ More effective distribution of resources ○ Allowed us to take advantage of varying machine types
  • 52. Results ● Migration accomplished without downtime ● Massively lowered idle ratio, costs ○ More effective distribution of resources ○ Allowed us to take advantage of varying machine types ● Improved reliability
  • 53. Results ● Migration accomplished without downtime ● Massively lowered idle ratio, costs ○ More effective distribution of resources ○ Allowed us to take advantage of varying machine types ● Improved reliability ● Better developer experience
  • 54. Results ● Migration accomplished without downtime ● Massively lowered idle ratio, costs ○ More effective distribution of resources ○ Allowed us to take advantage of varying machine types ● Improved reliability ● Better developer experience ● Simplified infrastructure
  • 55. Results - idle ratio visualized ● Costs lowered ● Idle ratio lowered
  • 56. Results: cost efficiency visualized Large cluster (~350 nodes): pre-Karpenter
  • 57. Results: cost efficiency visualized Large cluster (~350 nodes): post-Karpenter
  • 58. Results - improved reliability ● Workloads could either require spot or on-demand ● Maxing out autoscaling groups would result in alerts, downtime ● Single instance node groups subject to scarcity Before Rigid Options VS ● Workloads can prefer spot, but fall back to on-demand ● Fleet is not bounded by autoscaling group limitations ● Wide variety of instance types all but guarantees availability After More Flexibility
  • 59. Results - developer experience ● Trying out new instance types required lots of overhead ● Infrastructure managed in Terraform ● Specialized workloads required specialized node groups Before High Barrier to Experimentation VS ● New instance types are easy to try out ● Infrastructure managed by Kubernetes object ● Expressive “requirements” syntax for Karpenter allows workloads to ask for what the want, not peg themselves to a machine type After Easy to Play Around
  • 60. Results - simplified infrastructure ● Upgrades handled on a node group basis ● Definition of cluster in Terraform was largely copy/paste Before Slow and Manual VS ● Upgrades can be hands-off ● Definition of cluster in Karpenter is largely programmatic After Fast and Automated
  • 62. Trade-offs ● No unified autoscaling solution across cloud providers, EKS only ○ Although… Azure is under development!
  • 63. Trade-offs ● No unified autoscaling solution across cloud providers, EKS only ○ Although… Azure is under development! ● Karpenter is still considered “beta” ○ Recent release has a breaking API change
  • 64. Trade-offs ● No unified autoscaling solution across cloud providers, EKS only ○ Although… Azure is under development! ● Karpenter is still considered “beta” ○ Recent release has a breaking API change ● No spot-to-spot consolidation at the moment
  • 65. Trade-offs ● No unified autoscaling solution across cloud providers, EKS only ○ Although… Azure is under development! ● Karpenter is still considered “beta” ○ Recent release has a breaking API change ● No spot-to-spot consolidation at the moment ● Deprovisioning lacks fine-grained control
  • 67. Wrap-up ● Measuring idleness helps greatly for cost optimization
  • 68. Wrap-up ● Measuring idleness helps greatly for cost optimization ● Karpenter requires buy-in, but is a significant EKS opportunity
  • 69. Wrap-up ● Measuring idleness helps greatly for cost optimization ● Karpenter requires buy-in, but is a significant EKS opportunity ● Ensuring a smooth migration was high effort, but worth it
  • 70. Wrap-up ● Measuring idleness helps greatly for cost optimization ● Karpenter requires buy-in, but is a significant EKS opportunity ● Ensuring a smooth migration was high effort, but worth it ● Passing information on to end-users of the platform yielded and continues to yield high benefits
  • 71. Wrap-up ● Measuring idleness helps greatly for cost optimization ● Karpenter requires buy-in, but is a significant EKS opportunity ● Ensuring a smooth migration was high effort, but worth it ● Passing information on to end-users of the platform yielded and continues to yield high benefits ● We are excited to see where this tool goes!
  • 73. Additional Info ● Contact me ○ loganballard@gmail.com ○ https://github.com/logyball ● Karpenter ○ https://karpenter.sh/ ● Grafana ○ “How Grafana Labs switched to Karpenter to reduce costs and complexities in Amazon EKS” ○ https://bit.ly/grafana-karpenter QR code to relevant blog post