Getting The Most From Your Kubernetes Cluster

Getting the most from your
Kubernetes Clusters

Abstract
Deploying and running complex distributed applications involves a trade-off; we want to maximise performance and availability, and at the same time
make responsible use of ﬁnite and costly compute resource.
In this talk we will look at the features and tools that Kubernetes provides to help manage this trade-off, particularly those features provided by the
Kubernetes Scheduler.
We'll also take a brief look at how the scheduler can be extended for custom use-cases.
We will look at how we can deﬁne different levels of coupling between our application components, how we can scale those components
dynamically, and how we can observe an application's performance.
We will also show that Kubernetes can provide a way for application owners to express the runtime requirements of their applications in a way that
operations and platform teams can easily consume as those teams design and build platforms.

Agenda
● Hello :)
● What problem are trying to solve?
● Brief Kubernetes Recap
● How Kubernetes and in particular the Kubernetes Scheduler, can help us
○ Resources, Priorities, Affinity
○ Hopefully share some pointers and good practice
● Q & A

Tony Scully
● https://twitter.com/tonyjscully
● tscully@vmware.com
● Staff Field Engineer at VMware
● Previously Red Hat, Sysdig
● 20-something years working with
UNIX/Linux/Distributed Systems
Definitely much more Ops than Dev!

What I do at VMware
● VMware Tanzu Labs Platform Services
● Work with customers to build, manage and run platforms
● At least as much about culture and practice as technology
● Platform Teams
● Platform as a Product
● Define Delivery Paths to Production
● Self Service

The Benefits of Platforms based on Kubernetes 1/2
● Kubernetes is a platform to build platforms
● The ‘right’ abstraction over infrastructure
● API centric
● Allows for progressive delivery of changes to the platform and workloads
● Moves some application architecture concerns into the platform
● Leads to increased velocity

The Benefits of Platforms based on Kubernetes 2/2
VMware ‘State of Kubernetes 2022’ Survey

The Problem: How can we
Improve Utilization?

Improving utilization is the Right Thing to do
● How to Love K8s and Not Wreck the Planet - Holly Cummins, Worldwide IBM
Garage Dev Lead
Kubecon EU Keynote 2020
https://youtu.be/j5jql3e6hTA

And can address other issues
● Reduce costs
● Resource constrained environments
● Lack of elasticity on-premises

A Trade Off: Utilization vs Fault Tolerance vs Complexity
● What are our reliability requirements?
● How are those requirements best achieved?
● How simple or complex are our workloads?
● Right-sizing is hard and introduces complexity
● Try not to over optimize at the cost of complexity

The Kubernetes Resource Model
● A set of primitives to work with
● However, unlikely to be at a level you want to work at day-to-day
● Worthwhile exploring these API resources
● Sorry about the YAML :)

Nodes and Pods
● Nodes are the representation of a compute instance
○ Can play different roles in a cluster
○ VMs, cloud instances, bare metal servers etc.
● Pods are the atomic unit of deployment
○ Almost invariably created by a higher order controller
○ Provides access to operating system the container(s) need
○ Can specify one or more containers
○ Allows for close-coupling, for example, initContainers, sidecars and proxies

Clusters and Namespaces
● Clusters can be used as a boundary for teams and applications
○ Can lead to sprawl, overprovisioning and underutilisation
○ Sizing the nodes and cluster is key
○ Clusters can consist of worker nodes with different characteristics
● Namespaces are a way to ‘divide up’ a cluster
○ Are an organisation boundary for some resource types, and for RBAC
○ Cluster admin can create resourcequota and limitrange objects at the namepace level

Labels
● Arbitrary key:value pairs that can be set/unset on objects
● Allows for implicit grouping
● Stored in the cluster
● Can be added/remove/updated at any time

The role of the Scheduler
● Watches for pods with no value set for the nodeName:
○ These could be newly created, or evicted and waiting to be rescheduled etc.
○ You are unlikely to create singleton pods directly but the pod is the smallest atomic unit
● Determines the most suitable node to run the pod
○ Filtering – which nodes are feasible to run the pod using a set of predicates
○ Scoring – which of those is the ‘best’ to run the pod using a set of priorities
● The default scheduler is extensible or can be entirely replaced
● You can run multiple schedulers in the same cluster

Some example predicates and priorities
● ImageLocality
○ Favors nodes that already have the container images that the Pod runs. Extension points: score.
● TaintToleration
○ Implements taints and tolerations. Implements extension points: filter, preScore, score.
● NodeName
○ Checks if a Pod spec node name matches the current node. Extension points: filter.
● NodePorts
○ Checks if a node has free ports for the requested Pod ports. Extension points: preFilter, filter.
● NodeAffinity
○ Implements node selectors and node affinity. Extension points: filter, score.
● NodeResourcesFit
○ Checks if the node has all the resources that the Pod is requesting. The score can use one of three strategies: LeastAllocated(default),
MostAllocatedand RequestedToCapacityRatio
. Extension points: preFilter, filter, score.
● InterPodAffinity
○ Implements inter-Pod affinity and anti-affinity. Extension points: preFilter, filter, preScore, score.
● PrioritySort
○ Provides the default priority based sorting. Extension points: queueSort.
● VolumeBinding
○ Checks if the node has or if it can bind the requested volumes. Extension points: preFilter, filter, reserve, preBind, score.
● PodTopologySpread
○ Implements Pod topology spread. Extension points: preFilter, filter, preScore, score.
Source: https://kubernetes.io/docs/reference/scheduling/config/#scheduling-plugins © Kubernetes.io CC-BY 4.0

Resource Fit
● CPU
● Memory
● Ephemeral Storage
● Memory Hugepages
● And we can define custom resources

Node
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 4
ephemeral-storage: 83873772Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7806352Ki
pods: 110
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 3800m
ephemeral-storage: 77298068148
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7499152Ki
pods: 110

Resources in the pod.spec.containers
apiVersion: v1
kind: Pod
metadata:
labels:
run: basic-pod
name: basic-pod
spec:
containers:
- image: myimage:v1.4
name: basic-pod
resources:
requests:
memory: 512Mi
cpu: 500m
limits:
memory: 512Mi
cpu: 500m

What are Requests and Limits
● The request values are used by the scheduler to decide ‘fit’ on the node
● Requests can be considered ‘reservations’ or ‘guarantees’
● Limits are boundaries
● The mechanisms for managing this vary with resource type

CPU
● Requests are managed by cgroups
○ Uses shares of the CPU time available so the Linux CFS gives each process the allocated
amount of time on the CPU
● Limits are managed by CFS throttling
○ Within each time slice the process will be able to use up to the set limit of processor time, but
will be removed from the CPU once that limit is reached
● CPU demand will not trigger eviction
● Throttling can be an issue for some workloads
● The CPU manager policy can be changed to use a more deterministic method

Memory
● Is not compressible!
● Managed by cgroups, the process will get the requested value as a
‘reservation’ and be allowed to burst to the limit, without any attempt to
reclaim memory
● If the process attempts to allocate more than the limit, the kernel will report
out of memory and will terminate the process (OOM killer)

Setting values for Request and Limits
● Test, observe and measure, using metrics
● Iterate and refine the values over time
● Consider horizontally scaling if possible
● Aim for good and iterate

Oversized Containers (requests are higher than used)
topk(10,sum by (namespace,pod,container)((rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[30m]) -
on (namespace,pod,container) group_left avg by
(namespace,pod,container)(kube_pod_container_resource_requests{resource="cpu"})) * -1 >0))

CPU Throttling
rate(container_cpu_cfs_throttled_seconds_total[5m])

Eviction
● API or resource based
● The application running in the container needs to gracefully handle this
○ Probes, timeouts, signals
○ Prestop hooks

Pod QoS classes, used for eviction decisions
● Best Effort
○ No requests or limits set
● Burstable
○ Limits > Requests
● Guaranteed
○ Limits == Requests
○ Or only limits set

Priority and Preemption
● Cluster admin can define priority classes that can be used per-pod
● In queue or at execution time
● Used by default to protect system pods
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-nonpreempting
value: 1000000
preemptionPolicy: Never
globalDefault: false
description: "This priority class will not cause other pods to be
preempted."

Scaling up (and down)
● Horizontal Pod Autoscaler (HBA)
○ Add/Remove replicas
○ Consider start/warm up time
○ Scale up/down decision based on metrics
● Vertical Pod Autoscaler (VBA)
○ Increases request/limit settings
○ Advisory mode can be useful in testing
● Cluster Autoscaler (CA)
○ Consider the Node creation time, may make sense to pre-define some capacity
○ Simple for public cloud, on premises is harder

Pod Disruption Budgets
● A PDB specifies a minimum availability that the cluster will honour
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-pdb
namespace: default
spec:
minAvailable: 1
selector:
matchLabels:
app: rails

Pod placement: Affinity and Anti-Affinity
● A pod can have affinity for a node based on the labels assigned to the node
● A pod can have anti-affinity for a node based on taints on the node
● A pod can have affinity for other pods based on labels on the pods
● A pod can have anti-affinity for other pods based on label on the pods
● Node and Pod Affinity and Anti-affinity is currently a schedule time operation
● Taints can be used to evict currently running pods
● Can be ‘required’ (hard) or ‘preferred’ (soft)

Pod with Node Affinity
apiVersion: v1
kind: Pod
~~~ <snip many lines>
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: environment
operator: In
values:
- test
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: type
operator: In
values:

Pod with Pod Affinity
apiVersion: v1
kind: Pod
affinity:
podAffinity:
- labelSelector:
matchExpressions:
- key: caching
operator: In
values:
- high
topologyKey: topology.kubernetes.io/zone

Pod with Pod Anti-Affinity
apiVersion: v1
kind: Pod
affinity:
podAntiAffinity:
- labelSelector:
matchExpressions:
- key: security
operator: In
values:
- public
topologyKey: topology.kubernetes.io/zone

How to learn more
● Tanzu Developer Center
○ https://tanzu.vmware.com/developer/
● Kubernetes Documentation
○ https://kubernetes.io/
● Run a cluster locally
○ https://tanzucommunityedition.io/
● Run a simulator
○ https://github.com/kubernetes-sigs/kube-scheduler-simulator

Good Practice
● Work closely with your platform team
● Experiment, test and measure your workloads
● As a minimum, set requests for cpu
● Set requests and limits for memory
● Utilise pod affinity and anti-affinity for availability and better resource usage
● Protect scarce resources with taints
● Use node pools to better control resources and match workloads
● Experiment with node dimensions and grouping
● Enable limitranges and resource quotas as guardrails
● Use scaling for unpredictable workload changes

Try to avoid
● Avoid optimizing too early or too aggressively
● Very complex rules are hard to troubleshoot and can leave the cluster unable
to make progress
● Pod to Pod Affinity/Anti-Affinity can be costly in larger clusters
● Avoid eviction of pods with applications that don’t tolerate it well

Getting The Most From Your Kubernetes Cluster

Recommended

Recommended

More Related Content

More from VMware Tanzu

More from VMware Tanzu (20)

Getting The Most From Your Kubernetes Cluster