Advanced Scheduling in Kubernetes

Advanced Scheduling in Kubernetes
Oleg Chunikhin | CTO, Kublr

Introductions
Oleg Chunikhin
CTO, Kublr
• 20 years in software architecture & development
• Working w/ Kubernetes since its release in 2015
• Software architect behind Kublr—an enterprise
ready container management platform
• Twitter @olgch

Enterprise Kubernetes Needs
Developers SRE/Ops/DevOps/SecOps
• Self-service
• Compatible
• Conformant
• Configurable
• Open & Flexible
• Governance
• Org multi-tenancy
• Single pane of glass
• Operations
• Monitoring
• Log collection
• Image management
• Identity management
• Security
• Reliability
• Performance
• Portability
@olgch; @kublr

@olgch; @kublr
Automation
Ingress
Custom
Clusters
Infrastructure
Logging Monitoring
Observability
API
Usage
Reporting
RBAC IAM
Air Gap TLS
Certificate
Rotation
Audit
Storage Networking Container
Registry
CI / CD App Mgmt
Infrastructure
Container Runtime Kubernetes
OPERATIONS SECURITY &
GOVERNANCE

What’s in the slides
• Kubernetes overview
• Scheduling algorithm
• Scheduling controls
• Advanced scheduling techniques
• Examples, use cases, and recommendations
@olgch; @kublr

Kubernetes | Nodes and Pods
Node2
Pod A-2
10.0.1.5
Cnt1
Cnt2
Node 1
Pod A-1
10.0.0.3
Cnt1
Cnt2
Pod B-1
10.0.0.8
Cnt3
@olgch; @kublr

Kubernetes | Container Orchestration
Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
Pod A
Pod B
K8S
Controller(s)
User
Node 1
Pod A
Pod B Node 2
Pod C
@olgch; @kublr

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
It all starts empty
@olgch; @kublr

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Kubelet registers node
object in master
@olgch; @kublr

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
@olgch; @kublr

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
User creates (unscheduled) Pod
object(s) in Master
Pod A
Pod B
Pod C
@olgch; @kublr

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
Scheduler notices
unscheduled Pods ...
Pod A
Pod B
Pod C
@olgch; @kublr

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
…identifies the best
node to run them on…
Pod A
Pod B
Pod C
@olgch; @kublr

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
…and marks the pods as
scheduled on corresponding
nodes.
Pod A
Pod B
Pod C
@olgch; @kublr

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
Kubelet notices pods
scheduled to its nodes…
Pod A
Pod B
Pod C
@olgch; @kublr

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
… starts pods’
containers.
Pod A
Pod B
Pod C
Pod A
Pod B
@olgch; @kublr

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
… and reports pods as “running”
to master.
Pod A
Pod B
Pod C
Pod A
Pod B
@olgch; @kublr

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
Scheduler finds the best
node to run pods.
HOW?
Pod A
Pod B
Pod C
Pod A
Pod B
@olgch; @kublr

Kubernetes | Scheduling Algorithm
For each pod that needs scheduling:
1. Filter nodes
2. Calculate nodes priorities
3. Schedule pod if possible
@olgch; @kublr

Volume filters
• Do pod requested volumes’ zones fit the node’s zone?
• Can the node attach the volumes?
• Are there mounted volumes conflicts?
• Are there additional volume topology constraints?
Volume filters
Resource filters
Topology filters
Prioritization
@olgch; @kublr

Resource filters
• Does pod requested resources (CPU, RAM GPU, etc) fit the node’s available
resources?
• Can pod requested ports be opened on the node?
• Is there no memory or disk pressure on the node?
Volume filters
Resource filters
Topology filters
Prioritization
@olgch; @kublr

Topology filters
• Is Pod requested to run on this node?
• Are there inter-pod affinity constraints?
• Does the node match Pod’s node selector?
• Can Pod tolerate node’s taints
Volume filters
Resource filters
Topology filters
Prioritization
@olgch; @kublr

Prioritize with weights for:
• Pod replicas distribution
• Least (or most) node utilization
• Balanced resource usage
• Inter-pod affinity priority
• Node affinity priority
• Taint toleration priority
Volume filters
Resource filters
Topology filters
Prioritization
@olgch; @kublr

Scheduling | Controlling Pods Destination
• Resource requirements
• Be aware of volumes
• Node constraints
• Affinity and anti-affinity
• Priorities and Priority Classes
• Scheduler configuration
• Custom / multiple schedulers
@olgch; @kublr

Scheduling Controlled | Resources
• CPU, RAM, other (GPU)
• Requests and limits
• Reserved resources
kind: Node
status:
allocatable:
cpu: "4"
memory: 8070796Ki
pods: "110"
capacity:
cpu: "4"
memory: 8Gi
pods: "110"
kind: Pod
spec:
containers:
- name: main
resources:
requests:
cpu: 100m
memory: 1Gi
@olgch; @kublr

Scheduling Controlled | Volumes
• Request volumes in the right
zones
• Make sure node can attach
enough volumes
• Avoid volume location conflicts
• Use volume topology constraints
Node 1
Pod A
Node 2 Volume 2
Pod B
Unschedulable
Zone A
Pod C
Requested
Volume
Zone B
@olgch; @kublr

zones
enough volumes
Node 1
Pod A
Volume 2Pod B
Pod C Requested
Volume
Volume 1
@olgch; @kublr

zones
enough volumes
Node 1
Volume 1Pod A
Node 2
Volume 2Pod B
Pod C
@olgch; @kublr

zones
enough volumes
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv
spec:
...
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- example-node
@olgch; @kublr

Scheduling Controlled | Constraints
• Host constraints
• Labels and node selectors
• Taints and tolerations
Node 1Pod A
kind: Pod
spec:
nodeName: node1
kind: Node
metadata:
name: node1
@olgch; @kublr

Scheduling Controlled | Node Constraints
Node 1
Pod A Node 2
Node 3
label: tier: backend
kind: Node
metadata:
labels:
tier: backend
kind: Pod
spec:
nodeSelector:
tier: backend
@olgch; @kublr

Scheduling Controlled | Node Constraints
kind: Pod
spec:
tolerations:
- key: error
value: disk
operator: Equal
effect: NoExecute
tolerationSeconds: 60
kind: Node
spec:
taints:
- effect: NoSchedule
key: error
value: disk
timeAdded: null
Pod B
Node 1
tainted
Pod A
tolerate
@olgch; @kublr

Scheduling Controlled | Taints
Taints communicate node conditions
• Key – condition category
• Value – specific condition
• Operator – value wildcard
• Equal – value equality
• Exists – key existence
• Effect
• NoSchedule – filter at scheduling time
• PreferNoSchedule – prioritize at scheduling time
• NoExecute – filter at scheduling time, evict if executing
• TolerationSeconds – time to tolerate “NoExecute” taint
kind: Pod
spec:
tolerations:
- key: <taint key>
value: <taint value>
operator: <match operator>
effect: <taint effect>
tolerationSeconds: 60
@olgch; @kublr

Scheduling Controlled | Affinity
• Node affinity
• Inter-pod affinity
• Inter-pod anti-affinity
kind: Pod
spec:
affinity:
nodeAffinity: { ... }
podAffinity: { ... }
podAntiAffinity: { ... }
@olgch; @kublr

Scheduling Controlled | Node Affinity
Scope
• Preferred during scheduling, ignored during execution
• Required during scheduling, ignored during execution
kind: Pod
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 10
preference: { <node selector term> }
- ...
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- { <node selector term> }
- ... v
@olgch; @kublr

Scheduling Controlled | Inter-pod Affinity
Scope
kind: Pod
spec:
affinity:
podAffinity:
- weight: 10
podAffinityTerm: { <pod affinity term> }
- ...
- { <pod affinity term> }
- ...
@olgch; @kublr

Scheduling Controlled | Inter-pod Anti-affinity
Scope
kind: Pod
spec:
affinity:
podAntiAffinity:
- weight: 10
podAffinityTerm: { <pod affinity term> }
- ...
- { <pod affinity term> }
- ...
@olgch; @kublr

Scheduling Controlled | Pod Affinity Terms
• topologyKey – nodes’ label key defining co-location
• labelSelector and namespaces – select group of pods
<pod affinity term>:
topologyKey: <topology label key>
namespaces: [ <namespace>, ... ]
labelSelector:
matchLabels:
<label key>: <label value>
...
matchExpressions:
- key: <label key>
operator: In | NotIn | Exists | DoesNotExist
values: [ <value 1>, ... ]
...
@olgch; @kublr

Scheduling Controlled | Affinity Example
affinity:
topologyKey: tier
labelSelector:
matchLabels:
group: a
Node 1
tier: a
Pod B
group: a
Node 3
tier: b
tier: a
Node 4
tier: b
tier: b
Pod B
group: a
Node 1
tier: a
@olgch; @kublr

Scheduling Controlled | Scheduler Configuration
• Algorithm Provider
• Scheduling Policies and Profiles (alpha)
• Scheduler WebHook
@olgch; @kublr

Default Scheduler | Algorithm Provider
kube-scheduler
--scheduler-name=default-scheduler
--algorithm-provider=DefaultProvider
--algorithm-provider=ClusterAutoscalerProvider
@olgch; @kublr

Default Scheduler | Custom Policy Config
kube-scheduler
--scheduler-name=default-scheduler
--policy-config-file=<file>
--use-legacy-policy-config=<true|false>
--policy-configmap=<config map name>
--policy-configmap-namespace=<config map ns>
@olgch; @kublr

Default Scheduler | Custom Policy Config
{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
{"name" : "PodFitsHostPorts"},
...
{"name" : "HostName"}
],
"priorities" : [
{"name" : "LeastRequestedPriority", "weight" : 1},
...
{"name" : "EqualPriority", "weight" : 1}
],
"hardPodAffinitySymmetricWeight" : 10,
"alwaysCheckAllPredicates" : false
}
@olgch; @kublr

Default Scheduler | Scheduler WebHook
{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [...],
"priorities" : [...],
"extenders" : [{
"urlPrefix": "http://127.0.0.1:12346/scheduler",
"filterVerb": "filter",
"bindVerb": "bind",
"prioritizeVerb": "prioritize",
"weight": 5,
"enableHttps": false,
"nodeCacheCapable": false
}],
"hardPodAffinitySymmetricWeight" : 10,
"alwaysCheckAllPredicates" : false
}
@olgch; @kublr

Default Scheduler | Scheduler WebHook
func fiter(pod, nodes) api.NodeList
func prioritize(pod, nodes) HostPriorityList
func bind(pod, node)
@olgch; @kublr

Scheduling Controlled | Multiple Schedulers
kind: Pod
Metadata:
name: pod2
spec:
schedulerName: my-scheduler
kind: Pod
Metadata:
name: pod1
spec:
...
@olgch; @kublr

Scheduling Controlled | Custom Scheduler
Naive implementation
• In an infinite loop:
• Get list of Nodes: /api/v1/nodes
• Get list of Pods: /api/v1/pods
• Select Pods with
status.phase == Pending and
spec.schedulerName == our-name
• For each pod:
• Calculate target Node
• Create a new Binding object: POST /api/v1/bindings
apiVersion: v1
kind: Binding
Metadata:
namespace: default
name: pod1
target:
apiVersion: v1
kind: Node
name: node1
@olgch; @kublr

Better implementation
• Watch Pods: /api/v1/pods
• On each Pod event:
• Process if the Pod with
• Get list of Nodes: /api/v1/nodes
apiVersion: v1
kind: Binding
Metadata:
namespace: default
name: pod1
target:
apiVersion: v1
kind: Node
name: node1
@olgch; @kublr

Even better implementation
• Watch Nodes: /api/v1/nodes
• On each Node event:
• Update Node cache
• Watch Pods: /api/v1/pods
• On each Pod event:
• Process if the Pod with
apiVersion: v1
kind: Binding
Metadata:
namespace: default
name: pod1
target:
apiVersion: v1
kind: Node
name: node1
@olgch; @kublr

Use Case | Distributed Pods
apiVersion: v1
kind: Pod
metadata:
name: db-replica-3
labels:
component: db
spec:
affinity:
podAntiAffinity:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchExpressions:
- key: component
operator: In
values: [ "db" ]
Node 2
db-replica-2
Node 1
Node 3
db-replica-1
db-replica-3
@olgch; @kublr

Use Case | Co-located Pods
apiVersion: v1
kind: Pod
metadata:
name: app-replica-1
labels:
component: web
spec:
affinity:
podAffinity:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchExpressions:
- key: component
operator: In
values: [ "db" ]
Node 2
db-replica-2
Node 1
Node 3
db-replica-1
app-replica-1
@olgch; @kublr

Use Case | Reliable Service on Spot Nodes
• “fixed” node group
Expensive, more reliable, fixed number
Tagged with label nodeGroup: fixed
• “spot” node group
Inexpensive, unreliable, auto-scaled
Tagged with label nodeGroup: spot
• Scheduling rules:
• At least two pods on “fixed” nodes
• All other pods favor “spot” nodes
• Custom scheduler or multiple Deployments
@olgch; @kublr

Scheduling | Dos and Don’ts
DO
• Prefer scheduling based on resources and
pod affinity to node constraints and affinity
• Specify resource requests
• Keep requests == limits
• Especially for non-elastic resources
• Memory is non-elastic!
• Safeguard against missing resource specs
• Namespace default limits
• Admission controllers
• Plan architecture of localized volumes
(EBS, local)
DON’T
• ... assign pod to nodes directly
• ... use node-affinity or node constraints
• ... use pods with no resource requests
@olgch; @kublr

Scheduling | Key Takeaways
• Scheduling filters and priorities
• Resource requests and availability
• Inter-pod affinity/anti-affinity
• Volumes localization (AZ)
• Node labels and selectors
• Node affinity/anti-affinity
• Node taints and tolerations
• Scheduler(s) tweaking and customization
@olgch; @kublr

Next steps
• Pod priority, preemption, and eviction
• Pod Overhead
• Scheduler Profiles
• Scheduler performance considerations
• Admission Controllers and dynamic admission control
• Dynamic policies and OPA
@olgch; @kublr

References
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
https://kubernetes.io/docs/concepts/configuration/resource-bin-packing/
https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/
https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/
https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/
https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/
https://kubernetes.io/docs/reference/scheduling/policies/
https://kubernetes.io/docs/reference/scheduling/profiles/
https://github.com/kubernetes/community/blob/master/contributors/design-
proposals/scheduling/scheduler_extender.md
@olgch; @kublr

Oleg Chunikhin
CTO
oleg@kublr.com
@olgch
Kublr | kublr.com
@kublr
Signup for our newsletter
at kublr.com

Advanced Scheduling in Kubernetes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Advanced Scheduling in Kubernetes

Similar to Advanced Scheduling in Kubernetes (20)

More from Kublr

More from Kublr (13)

Recently uploaded

Recently uploaded (20)

Advanced Scheduling in Kubernetes

Editor's Notes