SlideShare a Scribd company logo
1 of 43
Download to read offline
Getting the most from your
Kubernetes Clusters
Abstract
Deploying and running complex distributed applications involves a trade-off; we want to maximise performance and availability, and at the same time
make responsible use of finite and costly compute resource.
In this talk we will look at the features and tools that Kubernetes provides to help manage this trade-off, particularly those features provided by the
Kubernetes Scheduler.
We'll also take a brief look at how the scheduler can be extended for custom use-cases.
We will look at how we can define different levels of coupling between our application components, how we can scale those components
dynamically, and how we can observe an application's performance.
We will also show that Kubernetes can provide a way for application owners to express the runtime requirements of their applications in a way that
operations and platform teams can easily consume as those teams design and build platforms.
Agenda
● Hello :)
● What problem are trying to solve?
● Brief Kubernetes Recap
● How Kubernetes and in particular the Kubernetes Scheduler, can help us
○ Resources, Priorities, Affinity
○ Hopefully share some pointers and good practice
● Q & A
Tony Scully
● https://twitter.com/tonyjscully
● tscully@vmware.com
● Staff Field Engineer at VMware
● Previously Red Hat, Sysdig
● 20-something years working with
UNIX/Linux/Distributed Systems
Definitely much more Ops than Dev!
What I do at VMware
● VMware Tanzu Labs Platform Services
● Work with customers to build, manage and run platforms
● At least as much about culture and practice as technology
● Platform Teams
● Platform as a Product
● Define Delivery Paths to Production
● Self Service
The Benefits of Platforms based on Kubernetes 1/2
● Kubernetes is a platform to build platforms
● The ‘right’ abstraction over infrastructure
● API centric
● Allows for progressive delivery of changes to the platform and workloads
● Moves some application architecture concerns into the platform
● Leads to increased velocity
The Benefits of Platforms based on Kubernetes 2/2
VMware ‘State of Kubernetes 2022’ Survey
The Problem: How can we
Improve Utilization?
Improving utilization is the Right Thing to do
● How to Love K8s and Not Wreck the Planet - Holly Cummins, Worldwide IBM
Garage Dev Lead
Kubecon EU Keynote 2020
https://youtu.be/j5jql3e6hTA
And can address other issues
● Reduce costs
● Resource constrained environments
● Lack of elasticity on-premises
A Trade Off: Utilization vs Fault Tolerance vs Complexity
● What are our reliability requirements?
● How are those requirements best achieved?
● How simple or complex are our workloads?
● Right-sizing is hard and introduces complexity
● Try not to over optimize at the cost of complexity
Kubernetes Recap
The Kubernetes Resource Model
● A set of primitives to work with
● However, unlikely to be at a level you want to work at day-to-day
● Worthwhile exploring these API resources
● Sorry about the YAML :)
Nodes and Pods
● Nodes are the representation of a compute instance
○ Can play different roles in a cluster
○ VMs, cloud instances, bare metal servers etc.
● Pods are the atomic unit of deployment
○ Almost invariably created by a higher order controller
○ Provides access to operating system the container(s) need
○ Can specify one or more containers
○ Allows for close-coupling, for example, initContainers, sidecars and proxies
Clusters and Namespaces
● Clusters can be used as a boundary for teams and applications
○ Can lead to sprawl, overprovisioning and underutilisation
○ Sizing the nodes and cluster is key
○ Clusters can consist of worker nodes with different characteristics
● Namespaces are a way to ‘divide up’ a cluster
○ Are an organisation boundary for some resource types, and for RBAC
○ Cluster admin can create resourcequota and limitrange objects at the namepace level
Labels
● Arbitrary key:value pairs that can be set/unset on objects
● Allows for implicit grouping
● Stored in the cluster
● Can be added/remove/updated at any time
Kube-scheduler
The role of the Scheduler
● Watches for pods with no value set for the nodeName:
○ These could be newly created, or evicted and waiting to be rescheduled etc.
○ You are unlikely to create singleton pods directly but the pod is the smallest atomic unit
● Determines the most suitable node to run the pod
○ Filtering – which nodes are feasible to run the pod using a set of predicates
○ Scoring – which of those is the ‘best’ to run the pod using a set of priorities
● The default scheduler is extensible or can be entirely replaced
● You can run multiple schedulers in the same cluster
Scheduler Framework
Some example predicates and priorities
● ImageLocality
○ Favors nodes that already have the container images that the Pod runs. Extension points: score.
● TaintToleration
○ Implements taints and tolerations. Implements extension points: filter, preScore, score.
● NodeName
○ Checks if a Pod spec node name matches the current node. Extension points: filter.
● NodePorts
○ Checks if a node has free ports for the requested Pod ports. Extension points: preFilter, filter.
● NodeAffinity
○ Implements node selectors and node affinity. Extension points: filter, score.
● NodeResourcesFit
○ Checks if the node has all the resources that the Pod is requesting. The score can use one of three strategies: LeastAllocated(default),
MostAllocatedand RequestedToCapacityRatio
. Extension points: preFilter, filter, score.
● InterPodAffinity
○ Implements inter-Pod affinity and anti-affinity. Extension points: preFilter, filter, preScore, score.
● PrioritySort
○ Provides the default priority based sorting. Extension points: queueSort.
● VolumeBinding
○ Checks if the node has or if it can bind the requested volumes. Extension points: preFilter, filter, reserve, preBind, score.
● PodTopologySpread
○ Implements Pod topology spread. Extension points: preFilter, filter, preScore, score.
Source: https://kubernetes.io/docs/reference/scheduling/config/#scheduling-plugins © Kubernetes.io CC-BY 4.0
Resource Fit
● CPU
● Memory
● Ephemeral Storage
● Memory Hugepages
● And we can define custom resources
Node
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 4
ephemeral-storage: 83873772Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7806352Ki
pods: 110
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 3800m
ephemeral-storage: 77298068148
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7499152Ki
pods: 110
Resources in the pod.spec.containers
apiVersion: v1
kind: Pod
metadata:
labels:
run: basic-pod
name: basic-pod
spec:
containers:
- image: myimage:v1.4
name: basic-pod
resources:
requests:
memory: 512Mi
cpu: 500m
limits:
memory: 512Mi
cpu: 500m
What are Requests and Limits
● The request values are used by the scheduler to decide ‘fit’ on the node
● Requests can be considered ‘reservations’ or ‘guarantees’
● Limits are boundaries
● The mechanisms for managing this vary with resource type
CPU
● Requests are managed by cgroups
○ Uses shares of the CPU time available so the Linux CFS gives each process the allocated
amount of time on the CPU
● Limits are managed by CFS throttling
○ Within each time slice the process will be able to use up to the set limit of processor time, but
will be removed from the CPU once that limit is reached
● CPU demand will not trigger eviction
● Throttling can be an issue for some workloads
● The CPU manager policy can be changed to use a more deterministic method
Memory
● Is not compressible!
● Managed by cgroups, the process will get the requested value as a
‘reservation’ and be allowed to burst to the limit, without any attempt to
reclaim memory
● If the process attempts to allocate more than the limit, the kernel will report
out of memory and will terminate the process (OOM killer)
Setting values for Request and Limits
● Test, observe and measure, using metrics
● Iterate and refine the values over time
● Consider horizontally scaling if possible
● Aim for good and iterate
Oversized Containers (requests are higher than used)
topk(10,sum by (namespace,pod,container)((rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[30m]) -
on (namespace,pod,container) group_left avg by
(namespace,pod,container)(kube_pod_container_resource_requests{resource="cpu"})) * -1 >0))
CPU Throttling
rate(container_cpu_cfs_throttled_seconds_total[5m])
Eviction
● API or resource based
● The application running in the container needs to gracefully handle this
○ Probes, timeouts, signals
○ Prestop hooks
Pod QoS classes, used for eviction decisions
● Best Effort
○ No requests or limits set
● Burstable
○ Limits > Requests
● Guaranteed
○ Limits == Requests
○ Or only limits set
Priority and Preemption
● Cluster admin can define priority classes that can be used per-pod
● In queue or at execution time
● Used by default to protect system pods
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-nonpreempting
value: 1000000
preemptionPolicy: Never
globalDefault: false
description: "This priority class will not cause other pods to be
preempted."
Scaling up (and down)
● Horizontal Pod Autoscaler (HBA)
○ Add/Remove replicas
○ Consider start/warm up time
○ Scale up/down decision based on metrics
● Vertical Pod Autoscaler (VBA)
○ Increases request/limit settings
○ Advisory mode can be useful in testing
● Cluster Autoscaler (CA)
○ Consider the Node creation time, may make sense to pre-define some capacity
○ Simple for public cloud, on premises is harder
Pod Disruption Budgets
● A PDB specifies a minimum availability that the cluster will honour
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-pdb
namespace: default
spec:
minAvailable: 1
selector:
matchLabels:
app: rails
Pod placement: Affinity and Anti-Affinity
● A pod can have affinity for a node based on the labels assigned to the node
● A pod can have anti-affinity for a node based on taints on the node
● A pod can have affinity for other pods based on labels on the pods
● A pod can have anti-affinity for other pods based on label on the pods
● Node and Pod Affinity and Anti-affinity is currently a schedule time operation
● Taints can be used to evict currently running pods
● Can be ‘required’ (hard) or ‘preferred’ (soft)
Pod with Node Affinity
apiVersion: v1
kind: Pod
~~~ <snip many lines>
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: environment
operator: In
values:
- test
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: type
operator: In
values:
Pod with Pod Affinity
apiVersion: v1
kind: Pod
~~~ <snip many lines>
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: caching
operator: In
values:
- high
topologyKey: topology.kubernetes.io/zone
Pod with Pod Anti-Affinity
apiVersion: v1
kind: Pod
~~~ <snip many lines>
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: security
operator: In
values:
- public
topologyKey: topology.kubernetes.io/zone
Summary
How to learn more
● Tanzu Developer Center
○ https://tanzu.vmware.com/developer/
● Kubernetes Documentation
○ https://kubernetes.io/
● Run a cluster locally
○ https://tanzucommunityedition.io/
● Run a simulator
○ https://github.com/kubernetes-sigs/kube-scheduler-simulator
Good Practice
● Work closely with your platform team
● Experiment, test and measure your workloads
● As a minimum, set requests for cpu
● Set requests and limits for memory
● Utilise pod affinity and anti-affinity for availability and better resource usage
● Protect scarce resources with taints
● Use node pools to better control resources and match workloads
● Experiment with node dimensions and grouping
● Enable limitranges and resource quotas as guardrails
● Use scaling for unpredictable workload changes
Try to avoid
● Avoid optimizing too early or too aggressively
● Very complex rules are hard to troubleshoot and can leave the cluster unable
to make progress
● Pod to Pod Affinity/Anti-Affinity can be costly in larger clusters
● Avoid eviction of pods with applications that don’t tolerate it well
Thank You!
Questions?

More Related Content

More from VMware Tanzu

More from VMware Tanzu (20)

Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfSpring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
 
tanzu_developer_connect.pptx
tanzu_developer_connect.pptxtanzu_developer_connect.pptx
tanzu_developer_connect.pptx
 
Tanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchTanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - French
 
Tanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishTanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - English
 
Virtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVirtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - English
 
Tanzu Developer Connect - French
Tanzu Developer Connect - FrenchTanzu Developer Connect - French
Tanzu Developer Connect - French
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootSpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software Engineer
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeSpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs Practice
 
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsSpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
 
SpringOne Tour: Doing Progressive Delivery with your Team
SpringOne Tour: Doing Progressive Delivery with your TeamSpringOne Tour: Doing Progressive Delivery with your Team
SpringOne Tour: Doing Progressive Delivery with your Team
 
SpringOne Tour: Make the Right Thing the Obvious Thing: The Journey to Intern...
SpringOne Tour: Make the Right Thing the Obvious Thing: The Journey to Intern...SpringOne Tour: Make the Right Thing the Obvious Thing: The Journey to Intern...
SpringOne Tour: Make the Right Thing the Obvious Thing: The Journey to Intern...
 
SpringOne Tour: An Introduction to Azure Spring Apps Enterprise
SpringOne Tour: An Introduction to Azure Spring Apps EnterpriseSpringOne Tour: An Introduction to Azure Spring Apps Enterprise
SpringOne Tour: An Introduction to Azure Spring Apps Enterprise
 
SpringOne Tour: 10 Practical Tips for Building Native and Serverless Spring A...
SpringOne Tour: 10 Practical Tips for Building Native and Serverless Spring A...SpringOne Tour: 10 Practical Tips for Building Native and Serverless Spring A...
SpringOne Tour: 10 Practical Tips for Building Native and Serverless Spring A...
 
SpringOne Tour: Spring Boot 3 and Beyond
SpringOne Tour: Spring Boot 3 and BeyondSpringOne Tour: Spring Boot 3 and Beyond
SpringOne Tour: Spring Boot 3 and Beyond
 
SpringOne Tour 2023: Let's Get Streaming! A Guide to Orchestrating Spring Clo...
SpringOne Tour 2023: Let's Get Streaming! A Guide to Orchestrating Spring Clo...SpringOne Tour 2023: Let's Get Streaming! A Guide to Orchestrating Spring Clo...
SpringOne Tour 2023: Let's Get Streaming! A Guide to Orchestrating Spring Clo...
 
Tanzu Developer Connect | Public Sector | March 29, 2023.pdf
Tanzu Developer Connect | Public Sector | March 29, 2023.pdfTanzu Developer Connect | Public Sector | March 29, 2023.pdf
Tanzu Developer Connect | Public Sector | March 29, 2023.pdf
 

Getting The Most From Your Kubernetes Cluster

  • 1. Getting the most from your Kubernetes Clusters
  • 2. Abstract Deploying and running complex distributed applications involves a trade-off; we want to maximise performance and availability, and at the same time make responsible use of finite and costly compute resource. In this talk we will look at the features and tools that Kubernetes provides to help manage this trade-off, particularly those features provided by the Kubernetes Scheduler. We'll also take a brief look at how the scheduler can be extended for custom use-cases. We will look at how we can define different levels of coupling between our application components, how we can scale those components dynamically, and how we can observe an application's performance. We will also show that Kubernetes can provide a way for application owners to express the runtime requirements of their applications in a way that operations and platform teams can easily consume as those teams design and build platforms.
  • 3. Agenda ● Hello :) ● What problem are trying to solve? ● Brief Kubernetes Recap ● How Kubernetes and in particular the Kubernetes Scheduler, can help us ○ Resources, Priorities, Affinity ○ Hopefully share some pointers and good practice ● Q & A
  • 4. Tony Scully ● https://twitter.com/tonyjscully ● tscully@vmware.com ● Staff Field Engineer at VMware ● Previously Red Hat, Sysdig ● 20-something years working with UNIX/Linux/Distributed Systems Definitely much more Ops than Dev!
  • 5. What I do at VMware ● VMware Tanzu Labs Platform Services ● Work with customers to build, manage and run platforms ● At least as much about culture and practice as technology ● Platform Teams ● Platform as a Product ● Define Delivery Paths to Production ● Self Service
  • 6. The Benefits of Platforms based on Kubernetes 1/2 ● Kubernetes is a platform to build platforms ● The ‘right’ abstraction over infrastructure ● API centric ● Allows for progressive delivery of changes to the platform and workloads ● Moves some application architecture concerns into the platform ● Leads to increased velocity
  • 7. The Benefits of Platforms based on Kubernetes 2/2 VMware ‘State of Kubernetes 2022’ Survey
  • 8. The Problem: How can we Improve Utilization?
  • 9. Improving utilization is the Right Thing to do ● How to Love K8s and Not Wreck the Planet - Holly Cummins, Worldwide IBM Garage Dev Lead Kubecon EU Keynote 2020 https://youtu.be/j5jql3e6hTA
  • 10. And can address other issues ● Reduce costs ● Resource constrained environments ● Lack of elasticity on-premises
  • 11. A Trade Off: Utilization vs Fault Tolerance vs Complexity ● What are our reliability requirements? ● How are those requirements best achieved? ● How simple or complex are our workloads? ● Right-sizing is hard and introduces complexity ● Try not to over optimize at the cost of complexity
  • 13. The Kubernetes Resource Model ● A set of primitives to work with ● However, unlikely to be at a level you want to work at day-to-day ● Worthwhile exploring these API resources ● Sorry about the YAML :)
  • 14. Nodes and Pods ● Nodes are the representation of a compute instance ○ Can play different roles in a cluster ○ VMs, cloud instances, bare metal servers etc. ● Pods are the atomic unit of deployment ○ Almost invariably created by a higher order controller ○ Provides access to operating system the container(s) need ○ Can specify one or more containers ○ Allows for close-coupling, for example, initContainers, sidecars and proxies
  • 15. Clusters and Namespaces ● Clusters can be used as a boundary for teams and applications ○ Can lead to sprawl, overprovisioning and underutilisation ○ Sizing the nodes and cluster is key ○ Clusters can consist of worker nodes with different characteristics ● Namespaces are a way to ‘divide up’ a cluster ○ Are an organisation boundary for some resource types, and for RBAC ○ Cluster admin can create resourcequota and limitrange objects at the namepace level
  • 16. Labels ● Arbitrary key:value pairs that can be set/unset on objects ● Allows for implicit grouping ● Stored in the cluster ● Can be added/remove/updated at any time
  • 18. The role of the Scheduler ● Watches for pods with no value set for the nodeName: ○ These could be newly created, or evicted and waiting to be rescheduled etc. ○ You are unlikely to create singleton pods directly but the pod is the smallest atomic unit ● Determines the most suitable node to run the pod ○ Filtering – which nodes are feasible to run the pod using a set of predicates ○ Scoring – which of those is the ‘best’ to run the pod using a set of priorities ● The default scheduler is extensible or can be entirely replaced ● You can run multiple schedulers in the same cluster
  • 20. Some example predicates and priorities ● ImageLocality ○ Favors nodes that already have the container images that the Pod runs. Extension points: score. ● TaintToleration ○ Implements taints and tolerations. Implements extension points: filter, preScore, score. ● NodeName ○ Checks if a Pod spec node name matches the current node. Extension points: filter. ● NodePorts ○ Checks if a node has free ports for the requested Pod ports. Extension points: preFilter, filter. ● NodeAffinity ○ Implements node selectors and node affinity. Extension points: filter, score. ● NodeResourcesFit ○ Checks if the node has all the resources that the Pod is requesting. The score can use one of three strategies: LeastAllocated(default), MostAllocatedand RequestedToCapacityRatio . Extension points: preFilter, filter, score. ● InterPodAffinity ○ Implements inter-Pod affinity and anti-affinity. Extension points: preFilter, filter, preScore, score. ● PrioritySort ○ Provides the default priority based sorting. Extension points: queueSort. ● VolumeBinding ○ Checks if the node has or if it can bind the requested volumes. Extension points: preFilter, filter, reserve, preBind, score. ● PodTopologySpread ○ Implements Pod topology spread. Extension points: preFilter, filter, preScore, score. Source: https://kubernetes.io/docs/reference/scheduling/config/#scheduling-plugins © Kubernetes.io CC-BY 4.0
  • 21. Resource Fit ● CPU ● Memory ● Ephemeral Storage ● Memory Hugepages ● And we can define custom resources
  • 22. Node Capacity: attachable-volumes-aws-ebs: 25 cpu: 4 ephemeral-storage: 83873772Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 7806352Ki pods: 110 Allocatable: attachable-volumes-aws-ebs: 25 cpu: 3800m ephemeral-storage: 77298068148 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 7499152Ki pods: 110
  • 23. Resources in the pod.spec.containers apiVersion: v1 kind: Pod metadata: labels: run: basic-pod name: basic-pod spec: containers: - image: myimage:v1.4 name: basic-pod resources: requests: memory: 512Mi cpu: 500m limits: memory: 512Mi cpu: 500m
  • 24. What are Requests and Limits ● The request values are used by the scheduler to decide ‘fit’ on the node ● Requests can be considered ‘reservations’ or ‘guarantees’ ● Limits are boundaries ● The mechanisms for managing this vary with resource type
  • 25. CPU ● Requests are managed by cgroups ○ Uses shares of the CPU time available so the Linux CFS gives each process the allocated amount of time on the CPU ● Limits are managed by CFS throttling ○ Within each time slice the process will be able to use up to the set limit of processor time, but will be removed from the CPU once that limit is reached ● CPU demand will not trigger eviction ● Throttling can be an issue for some workloads ● The CPU manager policy can be changed to use a more deterministic method
  • 26. Memory ● Is not compressible! ● Managed by cgroups, the process will get the requested value as a ‘reservation’ and be allowed to burst to the limit, without any attempt to reclaim memory ● If the process attempts to allocate more than the limit, the kernel will report out of memory and will terminate the process (OOM killer)
  • 27. Setting values for Request and Limits ● Test, observe and measure, using metrics ● Iterate and refine the values over time ● Consider horizontally scaling if possible ● Aim for good and iterate
  • 28. Oversized Containers (requests are higher than used) topk(10,sum by (namespace,pod,container)((rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[30m]) - on (namespace,pod,container) group_left avg by (namespace,pod,container)(kube_pod_container_resource_requests{resource="cpu"})) * -1 >0))
  • 30. Eviction ● API or resource based ● The application running in the container needs to gracefully handle this ○ Probes, timeouts, signals ○ Prestop hooks
  • 31. Pod QoS classes, used for eviction decisions ● Best Effort ○ No requests or limits set ● Burstable ○ Limits > Requests ● Guaranteed ○ Limits == Requests ○ Or only limits set
  • 32. Priority and Preemption ● Cluster admin can define priority classes that can be used per-pod ● In queue or at execution time ● Used by default to protect system pods apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: high-priority-nonpreempting value: 1000000 preemptionPolicy: Never globalDefault: false description: "This priority class will not cause other pods to be preempted."
  • 33. Scaling up (and down) ● Horizontal Pod Autoscaler (HBA) ○ Add/Remove replicas ○ Consider start/warm up time ○ Scale up/down decision based on metrics ● Vertical Pod Autoscaler (VBA) ○ Increases request/limit settings ○ Advisory mode can be useful in testing ● Cluster Autoscaler (CA) ○ Consider the Node creation time, may make sense to pre-define some capacity ○ Simple for public cloud, on premises is harder
  • 34. Pod Disruption Budgets ● A PDB specifies a minimum availability that the cluster will honour apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: my-pdb namespace: default spec: minAvailable: 1 selector: matchLabels: app: rails
  • 35. Pod placement: Affinity and Anti-Affinity ● A pod can have affinity for a node based on the labels assigned to the node ● A pod can have anti-affinity for a node based on taints on the node ● A pod can have affinity for other pods based on labels on the pods ● A pod can have anti-affinity for other pods based on label on the pods ● Node and Pod Affinity and Anti-affinity is currently a schedule time operation ● Taints can be used to evict currently running pods ● Can be ‘required’ (hard) or ‘preferred’ (soft)
  • 36. Pod with Node Affinity apiVersion: v1 kind: Pod ~~~ <snip many lines> affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: environment operator: In values: - test preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: type operator: In values:
  • 37. Pod with Pod Affinity apiVersion: v1 kind: Pod ~~~ <snip many lines> affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: caching operator: In values: - high topologyKey: topology.kubernetes.io/zone
  • 38. Pod with Pod Anti-Affinity apiVersion: v1 kind: Pod ~~~ <snip many lines> affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: security operator: In values: - public topologyKey: topology.kubernetes.io/zone
  • 40. How to learn more ● Tanzu Developer Center ○ https://tanzu.vmware.com/developer/ ● Kubernetes Documentation ○ https://kubernetes.io/ ● Run a cluster locally ○ https://tanzucommunityedition.io/ ● Run a simulator ○ https://github.com/kubernetes-sigs/kube-scheduler-simulator
  • 41. Good Practice ● Work closely with your platform team ● Experiment, test and measure your workloads ● As a minimum, set requests for cpu ● Set requests and limits for memory ● Utilise pod affinity and anti-affinity for availability and better resource usage ● Protect scarce resources with taints ● Use node pools to better control resources and match workloads ● Experiment with node dimensions and grouping ● Enable limitranges and resource quotas as guardrails ● Use scaling for unpredictable workload changes
  • 42. Try to avoid ● Avoid optimizing too early or too aggressively ● Very complex rules are hard to troubleshoot and can leave the cluster unable to make progress ● Pod to Pod Affinity/Anti-Affinity can be costly in larger clusters ● Avoid eviction of pods with applications that don’t tolerate it well