Deploying and running complex distributed applications involves a trade-off; we want to maximise performance and availability, and at the same time make responsible use of finite and costly compute resource.In this talk we will look at the features and tools that Kubernetes provides to help manage this trade-off, particularly those features provided by the Kubernetes Scheduler. We'll also take a brief look at how the scheduler can be extended for custom use-cases.We will look at how we can define different levels of coupling between our application components, how we can scale those components dynamically, and how we can observe an application's performance. We will also show that Kubernetes can provide a way for application owners to express the runtime requirements of their applications in a way that operations and platform teams can easily consume as those teams design and build platforms.
2. Abstract
Deploying and running complex distributed applications involves a trade-off; we want to maximise performance and availability, and at the same time
make responsible use of finite and costly compute resource.
In this talk we will look at the features and tools that Kubernetes provides to help manage this trade-off, particularly those features provided by the
Kubernetes Scheduler.
We'll also take a brief look at how the scheduler can be extended for custom use-cases.
We will look at how we can define different levels of coupling between our application components, how we can scale those components
dynamically, and how we can observe an application's performance.
We will also show that Kubernetes can provide a way for application owners to express the runtime requirements of their applications in a way that
operations and platform teams can easily consume as those teams design and build platforms.
3. Agenda
● Hello :)
● What problem are trying to solve?
● Brief Kubernetes Recap
● How Kubernetes and in particular the Kubernetes Scheduler, can help us
○ Resources, Priorities, Affinity
○ Hopefully share some pointers and good practice
● Q & A
4. Tony Scully
● https://twitter.com/tonyjscully
● tscully@vmware.com
● Staff Field Engineer at VMware
● Previously Red Hat, Sysdig
● 20-something years working with
UNIX/Linux/Distributed Systems
Definitely much more Ops than Dev!
5. What I do at VMware
● VMware Tanzu Labs Platform Services
● Work with customers to build, manage and run platforms
● At least as much about culture and practice as technology
● Platform Teams
● Platform as a Product
● Define Delivery Paths to Production
● Self Service
6. The Benefits of Platforms based on Kubernetes 1/2
● Kubernetes is a platform to build platforms
● The ‘right’ abstraction over infrastructure
● API centric
● Allows for progressive delivery of changes to the platform and workloads
● Moves some application architecture concerns into the platform
● Leads to increased velocity
7. The Benefits of Platforms based on Kubernetes 2/2
VMware ‘State of Kubernetes 2022’ Survey
9. Improving utilization is the Right Thing to do
● How to Love K8s and Not Wreck the Planet - Holly Cummins, Worldwide IBM
Garage Dev Lead
Kubecon EU Keynote 2020
https://youtu.be/j5jql3e6hTA
10. And can address other issues
● Reduce costs
● Resource constrained environments
● Lack of elasticity on-premises
11. A Trade Off: Utilization vs Fault Tolerance vs Complexity
● What are our reliability requirements?
● How are those requirements best achieved?
● How simple or complex are our workloads?
● Right-sizing is hard and introduces complexity
● Try not to over optimize at the cost of complexity
13. The Kubernetes Resource Model
● A set of primitives to work with
● However, unlikely to be at a level you want to work at day-to-day
● Worthwhile exploring these API resources
● Sorry about the YAML :)
14. Nodes and Pods
● Nodes are the representation of a compute instance
○ Can play different roles in a cluster
○ VMs, cloud instances, bare metal servers etc.
● Pods are the atomic unit of deployment
○ Almost invariably created by a higher order controller
○ Provides access to operating system the container(s) need
○ Can specify one or more containers
○ Allows for close-coupling, for example, initContainers, sidecars and proxies
15. Clusters and Namespaces
● Clusters can be used as a boundary for teams and applications
○ Can lead to sprawl, overprovisioning and underutilisation
○ Sizing the nodes and cluster is key
○ Clusters can consist of worker nodes with different characteristics
● Namespaces are a way to ‘divide up’ a cluster
○ Are an organisation boundary for some resource types, and for RBAC
○ Cluster admin can create resourcequota and limitrange objects at the namepace level
16. Labels
● Arbitrary key:value pairs that can be set/unset on objects
● Allows for implicit grouping
● Stored in the cluster
● Can be added/remove/updated at any time
18. The role of the Scheduler
● Watches for pods with no value set for the nodeName:
○ These could be newly created, or evicted and waiting to be rescheduled etc.
○ You are unlikely to create singleton pods directly but the pod is the smallest atomic unit
● Determines the most suitable node to run the pod
○ Filtering – which nodes are feasible to run the pod using a set of predicates
○ Scoring – which of those is the ‘best’ to run the pod using a set of priorities
● The default scheduler is extensible or can be entirely replaced
● You can run multiple schedulers in the same cluster
24. What are Requests and Limits
● The request values are used by the scheduler to decide ‘fit’ on the node
● Requests can be considered ‘reservations’ or ‘guarantees’
● Limits are boundaries
● The mechanisms for managing this vary with resource type
25. CPU
● Requests are managed by cgroups
○ Uses shares of the CPU time available so the Linux CFS gives each process the allocated
amount of time on the CPU
● Limits are managed by CFS throttling
○ Within each time slice the process will be able to use up to the set limit of processor time, but
will be removed from the CPU once that limit is reached
● CPU demand will not trigger eviction
● Throttling can be an issue for some workloads
● The CPU manager policy can be changed to use a more deterministic method
26. Memory
● Is not compressible!
● Managed by cgroups, the process will get the requested value as a
‘reservation’ and be allowed to burst to the limit, without any attempt to
reclaim memory
● If the process attempts to allocate more than the limit, the kernel will report
out of memory and will terminate the process (OOM killer)
27. Setting values for Request and Limits
● Test, observe and measure, using metrics
● Iterate and refine the values over time
● Consider horizontally scaling if possible
● Aim for good and iterate
28. Oversized Containers (requests are higher than used)
topk(10,sum by (namespace,pod,container)((rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[30m]) -
on (namespace,pod,container) group_left avg by
(namespace,pod,container)(kube_pod_container_resource_requests{resource="cpu"})) * -1 >0))
30. Eviction
● API or resource based
● The application running in the container needs to gracefully handle this
○ Probes, timeouts, signals
○ Prestop hooks
31. Pod QoS classes, used for eviction decisions
● Best Effort
○ No requests or limits set
● Burstable
○ Limits > Requests
● Guaranteed
○ Limits == Requests
○ Or only limits set
32. Priority and Preemption
● Cluster admin can define priority classes that can be used per-pod
● In queue or at execution time
● Used by default to protect system pods
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-nonpreempting
value: 1000000
preemptionPolicy: Never
globalDefault: false
description: "This priority class will not cause other pods to be
preempted."
33. Scaling up (and down)
● Horizontal Pod Autoscaler (HBA)
○ Add/Remove replicas
○ Consider start/warm up time
○ Scale up/down decision based on metrics
● Vertical Pod Autoscaler (VBA)
○ Increases request/limit settings
○ Advisory mode can be useful in testing
● Cluster Autoscaler (CA)
○ Consider the Node creation time, may make sense to pre-define some capacity
○ Simple for public cloud, on premises is harder
34. Pod Disruption Budgets
● A PDB specifies a minimum availability that the cluster will honour
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-pdb
namespace: default
spec:
minAvailable: 1
selector:
matchLabels:
app: rails
35. Pod placement: Affinity and Anti-Affinity
● A pod can have affinity for a node based on the labels assigned to the node
● A pod can have anti-affinity for a node based on taints on the node
● A pod can have affinity for other pods based on labels on the pods
● A pod can have anti-affinity for other pods based on label on the pods
● Node and Pod Affinity and Anti-affinity is currently a schedule time operation
● Taints can be used to evict currently running pods
● Can be ‘required’ (hard) or ‘preferred’ (soft)
36. Pod with Node Affinity
apiVersion: v1
kind: Pod
~~~ <snip many lines>
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: environment
operator: In
values:
- test
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: type
operator: In
values:
37. Pod with Pod Affinity
apiVersion: v1
kind: Pod
~~~ <snip many lines>
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: caching
operator: In
values:
- high
topologyKey: topology.kubernetes.io/zone
38. Pod with Pod Anti-Affinity
apiVersion: v1
kind: Pod
~~~ <snip many lines>
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: security
operator: In
values:
- public
topologyKey: topology.kubernetes.io/zone
40. How to learn more
● Tanzu Developer Center
○ https://tanzu.vmware.com/developer/
● Kubernetes Documentation
○ https://kubernetes.io/
● Run a cluster locally
○ https://tanzucommunityedition.io/
● Run a simulator
○ https://github.com/kubernetes-sigs/kube-scheduler-simulator
41. Good Practice
● Work closely with your platform team
● Experiment, test and measure your workloads
● As a minimum, set requests for cpu
● Set requests and limits for memory
● Utilise pod affinity and anti-affinity for availability and better resource usage
● Protect scarce resources with taints
● Use node pools to better control resources and match workloads
● Experiment with node dimensions and grouping
● Enable limitranges and resource quotas as guardrails
● Use scaling for unpredictable workload changes
42. Try to avoid
● Avoid optimizing too early or too aggressively
● Very complex rules are hard to troubleshoot and can leave the cluster unable
to make progress
● Pod to Pod Affinity/Anti-Affinity can be costly in larger clusters
● Avoid eviction of pods with applications that don’t tolerate it well