Implement Advanced Scheduling Techniques in Kubernetes

Implement Advanced Scheduling Techniques in Kubernetes
Oleg Chunikhin | CTO, Kublr | February 2018

Introduction
• Oleg Chunikhin
• CTO @ Kublr
• Chief Software Architect @ EastBanc Technologies
• Kublr
• Enterprise Kubernetes cluster manager
• Application delivery platform

What to Look For
• Kubernetes overview
• Scheduling algorithm
• Scheduling controls
• Advanced scheduling techniques
• Examples, use cases, and recommendations

Kubernetes | Technology Stack
Kubernetes
• Orchestration
• Network
• Configuration
• Service discovery
• Ingress
• Persistence
• …
Docker
• Distribution
• Configuration
• Isolation

Docker | Architecture
Docker image
repository
Instance
Images
App data
Docker CLI
Overlay
network
Docker daemon
Application containersApplication containers

Kubernetes | Architecture
Master Node
K8s master
components:
etcd, scheduler, api,
controller
K8s
metadata
Docker
kubelet
App
data
K8s node components:
overlay network,
discovery, connectivity
Infrastructure and
application containers
Infrastructure and
application containers
Overlay
network

Kubernetes | Nodes and Pods
Node2
Pod A-2
10.0.1.5
Cnt1
Cnt2
Node 1
Pod A-1
10.0.0.3
Cnt1
Cnt2
Pod B-1
10.0.0.8
Cnt3

Node 1
Kubernetes | Container Orchestration
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
Pod A
Pod B
K8S
Controller(s)
User
Node 1
Pod A
Pod B Node 2
Pod C

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
It all starts empty

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Kubelet registers node
object in master

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
User creates (unscheduled) Pod
object(s) in Master

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
Scheduler notices
unscheduled Pods ...

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
…identifies the best
node to run them on…
Pod A
Pod B
Pod C

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
…and marks the
pods as scheduled
on corresponding
nodes.
Pod A
Pod B
Pod C

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
Kubelet notices pods
scheduled to its nodes…
Pod A
Pod B
Pod C

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
…and starts pods’
containers.
Pod A
Pod B
Pod C
Pod A
Pod B

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
Scheduler finds the
best node to run pods.
HOW?
Pod A
Pod B
Pod C
Pod A
Pod B

Kubernetes | Scheduling Algorithm
For each pod that needs scheduling:
1. Filter nodes
2. Calculate nodes priorities
3. Schedule pod if possible

Volume filters
• Do pod requested volumes’ zones
fit the node’s zone?
• Can the node attach to the
volumes?
• Are there mounted volumes
conflicts?
• Are there additional volume
topology constraints?
Volume filters
Resource filters
Topology filters
Prioritization

Resource filters
• Does pod requested resources
(CPU, RAM GPU, etc) fit the node’s
available resources?
• Can pod requested ports be
opened on the node?
• Is there no memory or disk
pressure on the node?
Volume filters
Resource filters
Topology filters
Prioritization

Topology filters
• Is the pod requested to run on this
node?
• Are there inter-pod affinity
constraints?
• Does the node match the pod’s
node selector?
• Can the pod tolerate the node’s
taints?
Volume filters
Resource filters
Topology filters
Prioritization

Prioritize with weights for
• Pod replicas distribution
• Least (or most) node utilization
• Balanced resource usage
• Inter-pod affinity priority
• Node affinity priority
• Taint toleration priority
Volume filters
Resource filters
Topology filters
Prioritization

Scheduling Controlling Pods Destination
• Specify resource requirements
• Be aware of volumes
• Use node constraints
• Use affinity and anti-affinity
• Scheduler configuration
• Custom / multiple schedulers

Scheduling Controlled | Resources
• CPU, RAM, other (GPU)
• Requests and limits
• Reserved resources
kind: Node
status:
allocatable:
cpu: "4"
memory: 8070796Ki
pods: "110"
capacity:
cpu: "4"
memory: 8Gi
pods: "110"
kind: Pod
spec:
containers:
- name: main
resources:
requests:
cpu: 100m
memory: 1Gi

Scheduling Controlled | Volumes
• Request volumes in the right zones
• Make sure the node can attach
enough volumes
• Avoid volume location conflicts
• Use volume topology constraints
(alpha in 1.7)
Node 1
Pod A
Node 2 Volume 2
Pod B
Unschedulable
Zone A
Pod C
Requested
Volume
Zone B

• Make sure the node can attach
enough volumes
(alpha in 1.7)
Node 1
Pod A
Volume 2Pod B
Pod C Requested
Volume
Volume 1

• Make sure node can attach enough
volumes
(alpha in 1.7)
Node 1
Volume 1Pod A
Node 2
Volume 2Pod B
Pod C

• Make sure node can attach enough
volumes
(alpha in 1.7)
annotations:
"volume.alpha.kubernetes.io/node-affinity": '{
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [{
"matchExpressions": [{
"key": "kubernetes.io/hostname",
"operator": "In",
"values": ["docker03"]
}]
}]
}}'

Scheduling Controlled | Constraints
• Host constraints
• Labels and node selectors
• Taints and tolerations
Node 1Pod A
kind: Pod
spec:
nodeName: node1
kind: Node
metadata:
name: node1

Scheduling Controlled | Node Constraints
Node 1
Pod A Node 2
Node 3
label: tier: backend
kind: Node
metadata:
labels:
tier: backend
kind: Pod
spec:
nodeSelector:
tier: backend

Scheduling Controlled | Node Constraints
kind: Pod
spec:
tolerations:
- key: error
value: disk
operator: Equal
effect: NoExecute
tolerationSeconds: 60
kind: Node
spec:
taints:
- effect: NoSchedule
key: error
value: disk
timeAdded: null
Pod B
Node 1
tainted
Pod A
tolerate

Scheduling Controlled | Taints
Taints communicate
node conditions
• Key – condition category
• Value – specific condition
• Operator – value wildcard
• Equal
• Exists
• Effect
• NoSchedule – filter at scheduling time
• PreferNoSchedule – prioritize at scheduling time
• NoExecute – filter at scheduling time, evict if executing
• TolerationSeconds – time to tolerate “NoExecute” taint
kind: Pod
spec:
tolerations:
- key: <taint key>
value: <taint value>
operator: <match operator>
effect: <taint effect>
tolerationSeconds: 60

Scheduling Controlled | Affinity
• Node affinity
• Inter-pod affinity
• Inter-pod anti-affinity
kind: Pod
spec:
affinity:
nodeAffinity: { ... }
podAffinity: { ... }
podAntiAffinity: { ... }

Scheduling Controlled | Node Affinity
Scope
• Preferred during scheduling, ignored during execution
• Required during scheduling, ignored during execution
kind: Pod
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 10
preference: { <node selector term> }
- ...
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- { <node selector term> }
- ... v

Scheduling Controlled | Inter-pod Affinity
Scope
kind: Pod
spec:
affinity:
podAffinity:
- weight: 10
podAffinityTerm: { <pod affinity term> }
- ...
- { <pod affinity term> }
- ...

Scheduling Controlled | Inter-pod Anti-affinity
Scope
kind: Pod
spec:
affinity:
podAntiAffinity:
- weight: 10
podAffinityTerm: { <pod affinity term> }
- ...
- { <pod affinity term> }
- ...

Scheduling Controlled | Pod Affinity Terms
• topologyKey – nodes’ label key defining co-location
• labelSelector and namespaces – select group of pods
<pod affinity term>:
topologyKey: <topology label key>
namespaces: [ <namespace>, ... ]
labelSelector:
matchLabels:
<label key>: <label value>
...
matchExpressions:
- key: <label key>
operator: In | NotIn | Exists | DoesNotExist
values: [ <value 1>, ... ]
...

Scheduling Controlled | Affinity Example
affinity:
topologyKey: tier
labelSelector:
matchLabels:
group: a
Node 1
tier: a
Pod B
group: a
Node 3
tier: b
tier: a
Node 4
tier: b
tier: b
Pod B
group: a
Node 1
tier: a

Scheduling Controlled | Scheduler Configuration
• Algorithm provider
• Policy configuration file / ConfigMap
• Extender

Default Scheduler | Algorithm Provider
kube-scheduler
--scheduler-name=default-scheduler
--algorithm-provider=DefaultProvider
--algorithm-provider=ClusterAutoscalerProvider

Default Scheduler | Custom Policy Config
kube-scheduler
--config=<file>
--policy-config-file=<file>
--use-legacy-policy-config=<true|false>
--policy-configmap=<config map name>
--policy-configmap-namespace=<config map ns>

Default Scheduler | Custom Policy Config
{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
{"name" : "PodFitsHostPorts"},
...
{"name" : "HostName"}
],
"priorities" : [
{"name" : "LeastRequestedPriority", "weight" : 1},
...
{"name" : "EqualPriority", "weight" : 1}
],
"hardPodAffinitySymmetricWeight" : 10,
"alwaysCheckAllPredicates" : false
}

Default Scheduler | Scheduler Extender
{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [...],
"priorities" : [...],
"extenders" : [{
"urlPrefix": "http://127.0.0.1:12346/scheduler",
"filterVerb": "filter",
"bindVerb": "bind",
"prioritizeVerb": "prioritize",
"weight": 5,
"enableHttps": false,
"nodeCacheCapable": false
}],
"hardPodAffinitySymmetricWeight" : 10,
"alwaysCheckAllPredicates" : false
}

Default Scheduler | Scheduler Extender
func fiter(pod, nodes) api.NodeList
func prioritize(pod, nodes) HostPriorityList
func bind(pod, node)

Scheduling Controlled | Multiple Schedulers
kind: Pod
Metadata:
name: pod2
spec:
schedulerName: my-scheduler
kind: Pod
Metadata:
name: pod1
spec:
...

Scheduling Controlled | Custom Scheduler
Naive implementation
• In an infinite loop:
• Get list of Nodes: /api/v1/nodes
• Get list of Pods: /api/v1/pods
• Select Pods with
status.phase == Pending and
spec.schedulerName == our-name
• For each pod:
• Calculate target Node
• Create a new Binding object: POST /api/v1/bindings
apiVersion: v1
kind: Binding
Metadata:
namespace: default
name: pod1
target:
apiVersion: v1
kind: Node
name: node1

Better implementation
• Watch Pods: /api/v1/pods
• On each Pod event:
• Process if the Pod with
• Get list of Nodes: /api/v1/nodes
apiVersion: v1
kind: Binding
Metadata:
namespace: default
name: pod1
target:
apiVersion: v1
kind: Node
name: node1

Even better implementation
• Watch Nodes: /api/v1/nodes
• On each Node event:
• Update Node cache
• Watch Pods: /api/v1/pods
• On each Pod event:
• Process if the Pod with
apiVersion: v1
kind: Binding
Metadata:
namespace: default
name: pod1
target:
apiVersion: v1
kind: Node
name: node1

Custom Scheduler | Standard Filters
• Minimal set of filters
• kube-scheduler
• Extend
• Re-implement
GitHub kubernetes/kubernetes
plugin/pkg/scheduler/scheduler.go
plugin/pkg/scheduler/algorithm/predicates/predicates.go

Use Case | Distributed Pods
apiVersion: v1
kind: Pod
metadata:
name: db-replica-3
labels:
component: db
spec:
affinity:
podAntiAffinity:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchExpressions:
- key: component
operator: In
values: [ "db" ]
Node 2
db-replica-2
Node 1
Node 3
db-replica-1
db-replica-3

Use Case | Co-located Pods
apiVersion: v1
kind: Pod
metadata:
name: app-replica-1
labels:
component: web
spec:
affinity:
podAffinity:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchExpressions:
- key: component
operator: In
values: [ "db" ]
Node 2
db-replica-2
Node 1
Node 3
db-replica-1
app-replica-1

Use Case | Reliable Service on Spot Nodes
• “fixed” node group
Expensive, more reliable, fixed number
Tagged with label nodeGroup: fixed
• “spot” node group
Inexpensive, unreliable, auto-scaled
Tagged with label nodeGroup: spot
• Scheduling rules:
• At least two pods on “fixed” nodes
• All other pods favor “spot” nodes
• Custom scheduler

Scheduling | Dos and Don’ts
DO
• Use resource-based scheduling instead of
node-based
• Specify resource requests
• Keep requests == limits
• Especially for non-elastic resources
• Memory is non-elastic!
• Safeguard against missing resource specs
• Namespace default limits
• Admission controllers
• Plan architecture of localized volumes (EBS,
local)
• Use inter-pod affinity/anti-affinity if possible
DON’T
• ... assign pod to nodes directly
• ... use pods with no resource requests
• ... use resource requests rather node
• ... use node-affinity or node assignment if
possible

Scheduling | Key Takeaways
• Scheduling filters and priorities
• Resource requests and availability
• Inter-pod affinity/anti-affinity
• Volumes localization (AZ)
• Node labels and selectors
• Node affinity/anti-affinity
• Node taints and tolerations
• Scheduler(s) tweaking and customization

Oleg Chunikhin
Chief Technology Officer
oleg@kublr.com
kublr.com
Thank you!

Implement Advanced Scheduling Techniques in Kubernetes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Implement Advanced Scheduling Techniques in Kubernetes

Similar to Implement Advanced Scheduling Techniques in Kubernetes (20)

More from Kublr

More from Kublr (18)

Recently uploaded

Recently uploaded (20)

Implement Advanced Scheduling Techniques in Kubernetes

Editor's Notes