Kubernetes pods / container scheduling 201 - pod and node affinity and anti-affinity, node selectors, taints and tolerations, persistent volumes constraints, scheduler configuration and custom scheduler development and more.
5. What’s in the slides
• Kubernetes overview
• Scheduling algorithm
• Scheduling controls
• Advanced scheduling techniques
• Examples, use cases, and recommendations
@olgch; @kublr
6. Kubernetes | Nodes and Pods
Node2
Pod A-2
10.0.1.5
Cnt1
Cnt2
Node 1
Pod A-1
10.0.0.3
Cnt1
Cnt2
Pod B-1
10.0.0.8
Cnt3
@olgch; @kublr
7. Kubernetes | Container Orchestration
Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
Pod A
Pod B
K8S
Controller(s)
User
Node 1
Pod A
Pod B Node 2
Pod C
@olgch; @kublr
8. Kubernetes | Container Orchestration
Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
It all starts empty
@olgch; @kublr
9. Kubernetes | Container Orchestration
Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Kubelet registers node
object in master
@olgch; @kublr
11. Kubernetes | Container Orchestration
Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
User creates (unscheduled) Pod
object(s) in Master
Pod A
Pod B
Pod C
@olgch; @kublr
12. Kubernetes | Container Orchestration
Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
Scheduler notices
unscheduled Pods ...
Pod A
Pod B
Pod C
@olgch; @kublr
13. Kubernetes | Container Orchestration
Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
…identifies the best
node to run them on…
Pod A
Pod B
Pod C
@olgch; @kublr
14. Kubernetes | Container Orchestration
Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
…and marks the pods as
scheduled on corresponding
nodes.
Pod A
Pod B
Pod C
@olgch; @kublr
15. Kubernetes | Container Orchestration
Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
Kubelet notices pods
scheduled to its nodes…
Pod A
Pod B
Pod C
@olgch; @kublr
16. Kubernetes | Container Orchestration
Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
… starts pods’
containers.
Pod A
Pod B
Pod C
Pod A
Pod B
@olgch; @kublr
17. Kubernetes | Container Orchestration
Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
… and reports pods as “running”
to master.
Pod A
Pod B
Pod C
Pod A
Pod B
@olgch; @kublr
18. Kubernetes | Container Orchestration
Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
Scheduler finds the best
node to run pods.
HOW?
Pod A
Pod B
Pod C
Pod A
Pod B
@olgch; @kublr
19. Kubernetes | Scheduling Algorithm
For each pod that needs scheduling:
1. Filter nodes
2. Calculate nodes priorities
3. Schedule pod if possible
@olgch; @kublr
20. Kubernetes | Scheduling Algorithm
Volume filters
• Do pod requested volumes’ zones fit the node’s zone?
• Can the node attach the volumes?
• Are there mounted volumes conflicts?
• Are there additional volume topology constraints?
Volume filters
Resource filters
Topology filters
Prioritization
@olgch; @kublr
21. Kubernetes | Scheduling Algorithm
Resource filters
• Does pod requested resources (CPU, RAM GPU, etc) fit the node’s available
resources?
• Can pod requested ports be opened on the node?
• Is there no memory or disk pressure on the node?
Volume filters
Resource filters
Topology filters
Prioritization
@olgch; @kublr
22. Kubernetes | Scheduling Algorithm
Topology filters
• Is Pod requested to run on this node?
• Are there inter-pod affinity constraints?
• Does the node match Pod’s node selector?
• Can Pod tolerate node’s taints
Volume filters
Resource filters
Topology filters
Prioritization
@olgch; @kublr
24. Scheduling | Controlling Pods Destination
• Resource requirements
• Be aware of volumes
• Node constraints
• Affinity and anti-affinity
• Priorities and Priority Classes
• Scheduler configuration
• Custom / multiple schedulers
@olgch; @kublr
25. Scheduling Controlled | Resources
• CPU, RAM, other (GPU)
• Requests and limits
• Reserved resources
kind: Node
status:
allocatable:
cpu: "4"
memory: 8070796Ki
pods: "110"
capacity:
cpu: "4"
memory: 8Gi
pods: "110"
kind: Pod
spec:
containers:
- name: main
resources:
requests:
cpu: 100m
memory: 1Gi
@olgch; @kublr
26. Scheduling Controlled | Volumes
• Request volumes in the right
zones
• Make sure node can attach
enough volumes
• Avoid volume location conflicts
• Use volume topology constraints
Node 1
Pod A
Node 2 Volume 2
Pod B
Unschedulable
Zone A
Pod C
Requested
Volume
Zone B
@olgch; @kublr
27. Scheduling Controlled | Volumes
• Request volumes in the right
zones
• Make sure node can attach
enough volumes
• Avoid volume location conflicts
• Use volume topology constraints
Node 1
Pod A
Volume 2Pod B
Pod C Requested
Volume
Volume 1
@olgch; @kublr
28. Scheduling Controlled | Volumes
• Request volumes in the right
zones
• Make sure node can attach
enough volumes
• Avoid volume location conflicts
• Use volume topology constraints
Node 1
Volume 1Pod A
Node 2
Volume 2Pod B
Pod C
@olgch; @kublr
29. Scheduling Controlled | Volumes
• Request volumes in the right
zones
• Make sure node can attach
enough volumes
• Avoid volume location conflicts
• Use volume topology constraints
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv
spec:
...
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- example-node
@olgch; @kublr
30. Scheduling Controlled | Constraints
• Host constraints
• Labels and node selectors
• Taints and tolerations
Node 1Pod A
kind: Pod
spec:
nodeName: node1
kind: Node
metadata:
name: node1
@olgch; @kublr
31. Scheduling Controlled | Node Constraints
• Host constraints
• Labels and node selectors
• Taints and tolerations
Node 1
Pod A Node 2
Node 3
label: tier: backend
kind: Node
metadata:
labels:
tier: backend
kind: Pod
spec:
nodeSelector:
tier: backend
@olgch; @kublr
32. Scheduling Controlled | Node Constraints
• Host constraints
• Labels and node selectors
• Taints and tolerations
kind: Pod
spec:
tolerations:
- key: error
value: disk
operator: Equal
effect: NoExecute
tolerationSeconds: 60
kind: Node
spec:
taints:
- effect: NoSchedule
key: error
value: disk
timeAdded: null
Pod B
Node 1
tainted
Pod A
tolerate
@olgch; @kublr
33. Scheduling Controlled | Taints
Taints communicate node conditions
• Key – condition category
• Value – specific condition
• Operator – value wildcard
• Equal – value equality
• Exists – key existence
• Effect
• NoSchedule – filter at scheduling time
• PreferNoSchedule – prioritize at scheduling time
• NoExecute – filter at scheduling time, evict if executing
• TolerationSeconds – time to tolerate “NoExecute” taint
kind: Pod
spec:
tolerations:
- key: <taint key>
value: <taint value>
operator: <match operator>
effect: <taint effect>
tolerationSeconds: 60
@olgch; @kublr
40. Scheduling Controlled | Affinity Example
affinity:
topologyKey: tier
labelSelector:
matchLabels:
group: a
Node 1
tier: a
Pod B
group: a
Node 3
tier: b
tier: a
Node 4
tier: b
tier: b
Pod B
group: a
Node 1
tier: a
@olgch; @kublr
48. Scheduling Controlled | Custom Scheduler
Naive implementation
• In an infinite loop:
• Get list of Nodes: /api/v1/nodes
• Get list of Pods: /api/v1/pods
• Select Pods with
status.phase == Pending and
spec.schedulerName == our-name
• For each pod:
• Calculate target Node
• Create a new Binding object: POST /api/v1/bindings
apiVersion: v1
kind: Binding
Metadata:
namespace: default
name: pod1
target:
apiVersion: v1
kind: Node
name: node1
@olgch; @kublr
49. Scheduling Controlled | Custom Scheduler
Better implementation
• Watch Pods: /api/v1/pods
• On each Pod event:
• Process if the Pod with
status.phase == Pending and
spec.schedulerName == our-name
• Get list of Nodes: /api/v1/nodes
• Calculate target Node
• Create a new Binding object: POST /api/v1/bindings
apiVersion: v1
kind: Binding
Metadata:
namespace: default
name: pod1
target:
apiVersion: v1
kind: Node
name: node1
@olgch; @kublr
50. Scheduling Controlled | Custom Scheduler
Even better implementation
• Watch Nodes: /api/v1/nodes
• On each Node event:
• Update Node cache
• Watch Pods: /api/v1/pods
• On each Pod event:
• Process if the Pod with
status.phase == Pending and
spec.schedulerName == our-name
• Calculate target Node
• Create a new Binding object: POST /api/v1/bindings
apiVersion: v1
kind: Binding
Metadata:
namespace: default
name: pod1
target:
apiVersion: v1
kind: Node
name: node1
@olgch; @kublr
51. Use Case | Distributed Pods
apiVersion: v1
kind: Pod
metadata:
name: db-replica-3
labels:
component: db
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchExpressions:
- key: component
operator: In
values: [ "db" ]
Node 2
db-replica-2
Node 1
Node 3
db-replica-1
db-replica-3
@olgch; @kublr
52. Use Case | Co-located Pods
apiVersion: v1
kind: Pod
metadata:
name: app-replica-1
labels:
component: web
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchExpressions:
- key: component
operator: In
values: [ "db" ]
Node 2
db-replica-2
Node 1
Node 3
db-replica-1
app-replica-1
@olgch; @kublr
53. Use Case | Reliable Service on Spot Nodes
• “fixed” node group
Expensive, more reliable, fixed number
Tagged with label nodeGroup: fixed
• “spot” node group
Inexpensive, unreliable, auto-scaled
Tagged with label nodeGroup: spot
• Scheduling rules:
• At least two pods on “fixed” nodes
• All other pods favor “spot” nodes
• Custom scheduler or multiple Deployments
@olgch; @kublr
54. Scheduling | Dos and Don’ts
DO
• Prefer scheduling based on resources and
pod affinity to node constraints and affinity
• Specify resource requests
• Keep requests == limits
• Especially for non-elastic resources
• Memory is non-elastic!
• Safeguard against missing resource specs
• Namespace default limits
• Admission controllers
• Plan architecture of localized volumes
(EBS, local)
DON’T
• ... assign pod to nodes directly
• ... use node-affinity or node constraints
• ... use pods with no resource requests
@olgch; @kublr
55. Scheduling | Key Takeaways
• Scheduling filters and priorities
• Resource requests and availability
• Inter-pod affinity/anti-affinity
• Volumes localization (AZ)
• Node labels and selectors
• Node affinity/anti-affinity
• Node taints and tolerations
• Scheduler(s) tweaking and customization
@olgch; @kublr
56. Next steps
• Pod priority, preemption, and eviction
• Pod Overhead
• Scheduler Profiles
• Scheduler performance considerations
• Admission Controllers and dynamic admission control
• Dynamic policies and OPA
@olgch; @kublr
“If you like something you hear today, please tweet at me @olgch”
I will spend a few minutes reintroducing docker and kubernetes architecture concepts…
before we dig into kubernetes scheduling.
Talking about scheduling, I’ll try to explain
capabilities, …
controls available to cluster users and administrators, …
and extension points
We’ll also look at a couple of examples and…
Some recommendations
Registering nodes in the wizard
Appointment of pods on the nodes
The address allocation is submitted (from the pool of addresses of the overlay network allocated to the node at registration)
Joint launch of containers in the pod
Sharing the address space of a dataport and data volumes with containers
The overall life cycle of the pod and its container
The life cycle of the pod is very simple - moving and changing is not allowed, you must be re-created
Master API maintains the general picture – vision of desired and current known state
Master relies on other components – controllers, kubelet – to update current known state
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
First there was nothing
Master API maintains the general picture
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
Master API maintains the general picture
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
Master API maintains the general picture
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
Master API maintains the general picture
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
Master API maintains the general picture
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
Master API maintains the general picture
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
Master API maintains the general picture
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
Master API maintains the general picture
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
Master API maintains the general picture
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
Master API maintains the general picture
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
Pod requests new volumes, can they be created in a zone where the can be attached to the node?
If requested volumes already exist, can they be attached to the node?
If the volumes are already attached/mounted, can they be mounted to this node?
Any other user-specified constraints?
This most often happens in AWS, where
EBS can only be attached to instances in the same AZ where EBS is located
This pod should be co-located (affinity) or not co-located (anti-affinity)
with the pods matching the labelSelector in the specified namespaces,
where co-located is defined as running on a node whose value of the label with key topologyKey matches that of any node on which any of the selected pods is running.
Empty topologyKey:
For PreferredDuringScheduling pod anti-affinity, empty topologyKey is interpreted as "all topologies" ("all topologies" here means all the topologyKeys indicated by scheduler command-line argument --failure-domains);
For affinity and for RequiredDuringScheduling pod anti-affinity, empty topologyKey is not allowed.
This pod should be co-located (affinity) or not co-located (anti-affinity)
with the pods matching the labelSelector in the specified namespaces,
where co-located is defined as running on a node whose value of the label with key topologyKey matches that of any node on which any of the selected pods is running.
Empty topologyKey:
For PreferredDuringScheduling pod anti-affinity, empty topologyKey is interpreted as "all topologies" ("all topologies" here means all the topologyKeys indicated by scheduler command-line argument --failure-domains);
For affinity and for RequiredDuringScheduling pod anti-affinity, empty topologyKey is not allowed.
Unified application delivery and ops platform wanted:monitoring, logs, security, multiple env, ...
Where the project comes from
Company overview
Kubernetes as a solution – standardized delivery platform
Kubernetes is great for managing containers, but who manages Kubernetes?
How to streamline monitoring and collection of logs with multiple Kubernetes clusters?
Unified application delivery and ops platform wanted:monitoring, logs, security, multiple env, ...
Where the project comes from
Company overview
Kubernetes as a solution – standardized delivery platform
Kubernetes is great for managing containers, but who manages Kubernetes?
How to streamline monitoring and collection of logs with multiple Kubernetes clusters?