Container Scheduling
A Practical Guide
@tekgrrl #kubecon #kubernetes
@tekgrrl
+MandyWaite
@tekgrrl #kubecon #kubernetes
web browsers
BorgMaster
link shard
UI
shardBorgMaster
link shard
UI
shardBorgMaster
link shard
UI
shardBorgMaster
link shard
UI
shard
Scheduler
borgcfg web browsers
scheduler
Borglet Borglet Borglet Borglet
Config
file
BorgMaster
link shard
UI
shard
persistent store
(Paxos)
Binary
Cell
Storage
@tekgrrl #kubecon #kubernetes
Developer View
job hello_world = {
runtime = { cell = 'ic' } // Cell (cluster) to run in
binary = '.../hello_world_webserver' // Program to run
args = { port = '%port%' } // Command line parameters
requirements = { // Resource requirements
ram = 100M
disk = 100M
cpu = 0.1
}
replicas = 5 // Number of tasks
}
10000
@tekgrrl #kubecon #kubernetes
Developer View
@tekgrrl #kubecon #kubernetes
Hello
world!
Hello
world!
Hello
world!
Hello
world!Hello
world! Hello
world! Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world!Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world! Hello
world!
Hello
world!
Hello
world!
Hello
world!
Image by Connie
Zhou
Hello
world!
Hello
world!
Hello
world! Hello
world!
Hello
world! Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world! Hello
world!
Hello
world! Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world! Hello
world!
Hello
world! Hello
world!
Hello
world!
Hello
world!
7
@tekgrrl #kubecon #kubernetes
Developer View
Hello world!
“Internally, we don't use VMs - we just use containers to
pack multiple tasks onto one machine, and stop them
treading on one another.” - John Wilkes
8
@tekgrrl #kubecon #kubernetes
Developer View
9
@tekgrrl #kubecon #kubernetes
task-eviction rates
and causes
Failures
10
@tekgrrl #kubecon #kubernetes
Images by
Connie Zhou
A 2000-machine service will have
>10 task exits per day
This is not a problem: it's normal
11
@tekgrrl #kubecon #kubernetes
available resources
one
machine
Efficiency
Advanced bin-
packing
algorithms
Experimental placement
of production VM
workload, July 2014
stranded resources
12
@tekgrrl #kubecon #kubernetes
Efficiency
UsedCPU
UsedCPU(incores)
UsedMemory
UsedMemory
Available
Resources
Stranded
Resources
UsedCPU(incores)UsedMemory
13
@tekgrrl #kubecon #kubernetes
tasks per machine
Efficiency
Multiple
applications
per machine
CPI^2 paper,
EuroSys 2013
Median
14
@tekgrrl #kubecon #kubernetes
web browsers
BorgMaster
link shard
UI
shardBorgMaster
link shard
UI
shardBorgMaster
link shard
UI
shardBorgMaster
link shard
UI
shard
Scheduler
borgcfg web browsers
scheduler
Cell
Config
file
BorgMaster
link shard
UI
shard
persistent store
(Paxos)
Binary
Cell
Storage
Efficiency
batch
Cells run both
Prod and Non
Prod tasks
batch
15
@tekgrrl #kubecon #kubernetes
Efficiency
Cell
Sharing Cells
between prod/non-
prod is Better
shared cell
(original)
shared cell
(compacted)
Cell
Non-Prod load
(compacted)
Prod load
(compacted)
Represents the
overhead of running
prod and non-prod in
their own cells
16
@tekgrrl #kubecon #kubernetes
Resource reclamation
time
limit: amount of resource
requested
usage: actual resource
consumption
Efficiency
reservation: estimate of
future usage
potentially reusable
resources
17
@tekgrrl #kubecon #kubernetes
Resource reclamation could be more aggressive
Nov/Dec 2013
Efficiency
18
@tekgrrl #kubecon #kubernetes
Nov/Dec 2013
Efficiency
Resource reclamation could be more aggressive
Kubernetes
@tekgrrl #kubecon #kubernetes
K8s Master
API Server
Dash Board
scheduler
Kubelet Kubelet Kubelet Kubelet
Container
Registry
etcdControllers
web browserskubectl web browsers
Config
file
Image
@tekgrrl #kubecon #kubernetes
Kubernetes without a Scheduler
K8s Master
API Server
Dash Board
scheduler etcd
apiVersion: v1
kind: Pod
metadata:
name: bursty-static
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
Controllers
k8s-minion-xyz
Kubelet
k8s-minion-abc
Kubelet
k8s-minion-fig
Kubelet
k8s-minion-cat
Kubelet
@tekgrrl #kubecon #kubernetes
Kubernetes without a Scheduler
K8s Master
API Server
Dashboard
k8s-minion-xyz
poddy
Kubelet
k8s-minion-abc
Kubelet
k8s-minion-fig
Kubelet
k8s-minion-cat
Kubelet
etcd
apiVersion: v1
kind: Pod
metadata:
name: poddy
spec:
nodeName: k8s-minion-xyz
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
Controllers
Resources
@tekgrrl #kubecon #kubernetes
A Resource is something that can be
requested, allocated, or consumed to/by
a pod or a container
CPU: Specified in units of Cores,
what that is depends on the provider
Memory: Specified in units of Bytes
CPU is Compressible (i.e. it has a rate
and can be throttled)
Memory is Incompressible, it can’t be
throttled
Kubernetes Resources
@tekgrrl #kubecon #kubernetes
Future Plans:
More Resources:
● Network Ops
● Network Bandwidth
● Storage
● IOPS
● Storage Time
Kubernetes Compute Unit (KCU)
Kubernetes Resources (contd)
@tekgrrl #kubecon #kubernetes
...
spec:
containers:
- name: locust
image: gcr.io/rabbit-skateboard/guestbook:gdg-rtv
resources:
requests:
memory: "300Mi"
cpu: "100m"
limits:
memory: "300Mi"
cpu: "100m"
my-controller.yaml
Resource based Scheduling
@tekgrrl #kubecon #kubernetes
Resource based Scheduling (Work In Progress)
Provide QoS for Scheduled Pods
Per Container CPU and Memory requirements
Specified as Request and Limit
Future releases will [better] support:
● Best Effort (Request == 0)
● Burstable ( Request < Limit)
● Guaranteed (Request == Limit)
Best Effort Scheduling for low priority workloads improves
Utilization at Google by 20%
@tekgrrl #kubecon #kubernetes
Scheduling Pods: Nodes
K8s Node
Kubelet
disk = ssd
Resources
LabelsDisks
Nodes may not be heterogeneous, they
can differ in important ways:
● CPU and Memory Resources
● Attached Disks
● Specific Hardware
Location may also be important
@tekgrrl #kubecon #kubernetes
What CPU and Memory Resources
does it need?
Can also be used as a measure of
priority
Pod Scheduling: Identifying Potential Nodes
K8s Node
Kubelet Proxy
CPU
Mem
@tekgrrl #kubecon #kubernetes
What Resources does it need?
What Disk(s) does it need (GCE PD and
EBS) and can it/they be mounted
without conflict?
Note: 1.1 limits to
Pod Scheduling: Finding Potential Nodes
K8s Node
Kubelet Proxy
CPU
Mem
@tekgrrl #kubecon #kubernetes
What Resources does it need?
What Disk(s) does it need?
What node(s) can it run on (Node
Selector)?
Pod Scheduling: Identifying Potential Nodes
K8s Node
Kubelet Proxy
CPU
Mem
disktype = ssd
kubectl label nodes node-3 disktype=ssd
(pod) spec:
nodeSelector:
disktype: ssd
@tekgrrl #kubecon #kubernetes
nodeAffinity (Alpha in 1.2)
{
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "beta.kubernetes.io/instance-type",
"operator": "In",
"values": ["n1-highmem-2", "n1-highmem-4"]
}
]
}
]
}
}
}
http://kubernetes.github.io/docs/user-guide/node-selection/
Implemented through Annotations in 1.2,
through fields in 1.3
Can be ‘Required’ or ‘Preferred’ during
scheduling
In future can can be ‘Required’ during
execution (Node labels can change)
Will eventually replace NodeSelector
If you specify both nodeSelector and
nodeAffinity, both must be satisfied
@tekgrrl #kubecon #kubernetes
Prefer node with most free resource
left after the pod is deployed
Prefer nodes with the specified label
Minimise number of Pods from the
same service on the same node
CPU and Memory is balanced after the
Pod is deployed [Default]
Pod Scheduling: Ranking Potential Nodes
Node2
Node3
Node1
@tekgrrl #kubecon #kubernetes
Extending the Scheduler
1. Add rules to the scheduler and
recompile
2. Run your own scheduler process
instead of, or as well as, the
Kubernetes scheduler
3. Implement a "scheduler extender"
that the Kubernetes scheduler calls
out to as a final pass when making
scheduling decisions
@tekgrrl #kubecon #kubernetes
Admission Control
Admission Control enforces certain conditions, before a
request is accepted by the API Server
AC functionality implemented as plugins which are
executed in the sequence they are specified
AC is performed after AuthN checks
Enforcement usually results in either
● A Request denial
● Mutation of the Request Resource
● Mutation of related Resources
K8s Master
API
Server
scheduler
Controllers
AdmissionControl
@tekgrrl #kubecon #kubernetes
NamespaceLifecycle
Enforces that a Namespace that is undergoing termination cannot have new objects created in it, and ensures that
requests in a non-existant Namespace are rejected
LimitRanger
Observes the incoming request and ensures that it does not violate any of the constraints enumerated in the
LimitRange object in a Namespace
ServiceAccount
Implements automation for serviceAccounts
ResourceQuota
Observes the incoming request and ensures that it does not violate any of the constraints enumerated in the
ResourceQuota object in a Namespace.
Default plug-ins in 1.2: --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,
ResourceQuota,PersistentVolumeLabel
Admission Control Examples
@tekgrrl #kubecon #kubernetes
Mandy’s Canonical K8s deck: http://bit.ly/1oRMS0r
One little-o R M S Zero little-r
Setting Pod and CPU Limits
Runtime Constraints Example
Extending the Scheduler
Resource Model Design Doc (beyond 1.1)
Resources
@tekgrrl #kubecon #kubernetes
Kubernetes is Open Source
We want your help!
http://kubernetes.io
https://github.com/kubernetes/kubernetes
Slack: #kubernetes-users
@kubernetesio
@tekgrrl #kubecon #kubernetes
Images by Connie
Zhou
cloud.google.com

KubeCon EU 2016: A Practical Guide to Container Scheduling

  • 1.
  • 2.
  • 3.
    @tekgrrl #kubecon #kubernetes webbrowsers BorgMaster link shard UI shardBorgMaster link shard UI shardBorgMaster link shard UI shardBorgMaster link shard UI shard Scheduler borgcfg web browsers scheduler Borglet Borglet Borglet Borglet Config file BorgMaster link shard UI shard persistent store (Paxos) Binary Cell Storage
  • 4.
    @tekgrrl #kubecon #kubernetes DeveloperView job hello_world = { runtime = { cell = 'ic' } // Cell (cluster) to run in binary = '.../hello_world_webserver' // Program to run args = { port = '%port%' } // Command line parameters requirements = { // Resource requirements ram = 100M disk = 100M cpu = 0.1 } replicas = 5 // Number of tasks } 10000
  • 5.
  • 6.
    @tekgrrl #kubecon #kubernetes Hello world! Hello world! Hello world! Hello world!Hello world!Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world!Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Image by Connie Zhou Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world!
  • 7.
    7 @tekgrrl #kubecon #kubernetes DeveloperView Hello world! “Internally, we don't use VMs - we just use containers to pack multiple tasks onto one machine, and stop them treading on one another.” - John Wilkes
  • 8.
  • 9.
  • 10.
    10 @tekgrrl #kubecon #kubernetes Imagesby Connie Zhou A 2000-machine service will have >10 task exits per day This is not a problem: it's normal
  • 11.
    11 @tekgrrl #kubecon #kubernetes availableresources one machine Efficiency Advanced bin- packing algorithms Experimental placement of production VM workload, July 2014 stranded resources
  • 12.
  • 13.
    13 @tekgrrl #kubecon #kubernetes tasksper machine Efficiency Multiple applications per machine CPI^2 paper, EuroSys 2013 Median
  • 14.
    14 @tekgrrl #kubecon #kubernetes webbrowsers BorgMaster link shard UI shardBorgMaster link shard UI shardBorgMaster link shard UI shardBorgMaster link shard UI shard Scheduler borgcfg web browsers scheduler Cell Config file BorgMaster link shard UI shard persistent store (Paxos) Binary Cell Storage Efficiency batch Cells run both Prod and Non Prod tasks batch
  • 15.
    15 @tekgrrl #kubecon #kubernetes Efficiency Cell SharingCells between prod/non- prod is Better shared cell (original) shared cell (compacted) Cell Non-Prod load (compacted) Prod load (compacted) Represents the overhead of running prod and non-prod in their own cells
  • 16.
    16 @tekgrrl #kubecon #kubernetes Resourcereclamation time limit: amount of resource requested usage: actual resource consumption Efficiency reservation: estimate of future usage potentially reusable resources
  • 17.
    17 @tekgrrl #kubecon #kubernetes Resourcereclamation could be more aggressive Nov/Dec 2013 Efficiency
  • 18.
    18 @tekgrrl #kubecon #kubernetes Nov/Dec2013 Efficiency Resource reclamation could be more aggressive
  • 19.
  • 20.
    @tekgrrl #kubecon #kubernetes K8sMaster API Server Dash Board scheduler Kubelet Kubelet Kubelet Kubelet Container Registry etcdControllers web browserskubectl web browsers Config file Image
  • 21.
    @tekgrrl #kubecon #kubernetes Kuberneteswithout a Scheduler K8s Master API Server Dash Board scheduler etcd apiVersion: v1 kind: Pod metadata: name: bursty-static spec: containers: - name: nginx image: nginx ports: - containerPort: 80 Controllers k8s-minion-xyz Kubelet k8s-minion-abc Kubelet k8s-minion-fig Kubelet k8s-minion-cat Kubelet
  • 22.
    @tekgrrl #kubecon #kubernetes Kuberneteswithout a Scheduler K8s Master API Server Dashboard k8s-minion-xyz poddy Kubelet k8s-minion-abc Kubelet k8s-minion-fig Kubelet k8s-minion-cat Kubelet etcd apiVersion: v1 kind: Pod metadata: name: poddy spec: nodeName: k8s-minion-xyz containers: - name: nginx image: nginx ports: - containerPort: 80 Controllers
  • 23.
  • 24.
    @tekgrrl #kubecon #kubernetes AResource is something that can be requested, allocated, or consumed to/by a pod or a container CPU: Specified in units of Cores, what that is depends on the provider Memory: Specified in units of Bytes CPU is Compressible (i.e. it has a rate and can be throttled) Memory is Incompressible, it can’t be throttled Kubernetes Resources
  • 25.
    @tekgrrl #kubecon #kubernetes FuturePlans: More Resources: ● Network Ops ● Network Bandwidth ● Storage ● IOPS ● Storage Time Kubernetes Compute Unit (KCU) Kubernetes Resources (contd)
  • 26.
    @tekgrrl #kubecon #kubernetes ... spec: containers: -name: locust image: gcr.io/rabbit-skateboard/guestbook:gdg-rtv resources: requests: memory: "300Mi" cpu: "100m" limits: memory: "300Mi" cpu: "100m" my-controller.yaml Resource based Scheduling
  • 27.
    @tekgrrl #kubecon #kubernetes Resourcebased Scheduling (Work In Progress) Provide QoS for Scheduled Pods Per Container CPU and Memory requirements Specified as Request and Limit Future releases will [better] support: ● Best Effort (Request == 0) ● Burstable ( Request < Limit) ● Guaranteed (Request == Limit) Best Effort Scheduling for low priority workloads improves Utilization at Google by 20%
  • 28.
    @tekgrrl #kubecon #kubernetes SchedulingPods: Nodes K8s Node Kubelet disk = ssd Resources LabelsDisks Nodes may not be heterogeneous, they can differ in important ways: ● CPU and Memory Resources ● Attached Disks ● Specific Hardware Location may also be important
  • 29.
    @tekgrrl #kubecon #kubernetes WhatCPU and Memory Resources does it need? Can also be used as a measure of priority Pod Scheduling: Identifying Potential Nodes K8s Node Kubelet Proxy CPU Mem
  • 30.
    @tekgrrl #kubecon #kubernetes WhatResources does it need? What Disk(s) does it need (GCE PD and EBS) and can it/they be mounted without conflict? Note: 1.1 limits to Pod Scheduling: Finding Potential Nodes K8s Node Kubelet Proxy CPU Mem
  • 31.
    @tekgrrl #kubecon #kubernetes WhatResources does it need? What Disk(s) does it need? What node(s) can it run on (Node Selector)? Pod Scheduling: Identifying Potential Nodes K8s Node Kubelet Proxy CPU Mem disktype = ssd kubectl label nodes node-3 disktype=ssd (pod) spec: nodeSelector: disktype: ssd
  • 32.
    @tekgrrl #kubecon #kubernetes nodeAffinity(Alpha in 1.2) { "nodeAffinity": { "requiredDuringSchedulingIgnoredDuringExecution": { "nodeSelectorTerms": [ { "matchExpressions": [ { "key": "beta.kubernetes.io/instance-type", "operator": "In", "values": ["n1-highmem-2", "n1-highmem-4"] } ] } ] } } } http://kubernetes.github.io/docs/user-guide/node-selection/ Implemented through Annotations in 1.2, through fields in 1.3 Can be ‘Required’ or ‘Preferred’ during scheduling In future can can be ‘Required’ during execution (Node labels can change) Will eventually replace NodeSelector If you specify both nodeSelector and nodeAffinity, both must be satisfied
  • 33.
    @tekgrrl #kubecon #kubernetes Prefernode with most free resource left after the pod is deployed Prefer nodes with the specified label Minimise number of Pods from the same service on the same node CPU and Memory is balanced after the Pod is deployed [Default] Pod Scheduling: Ranking Potential Nodes Node2 Node3 Node1
  • 34.
    @tekgrrl #kubecon #kubernetes Extendingthe Scheduler 1. Add rules to the scheduler and recompile 2. Run your own scheduler process instead of, or as well as, the Kubernetes scheduler 3. Implement a "scheduler extender" that the Kubernetes scheduler calls out to as a final pass when making scheduling decisions
  • 35.
    @tekgrrl #kubecon #kubernetes AdmissionControl Admission Control enforces certain conditions, before a request is accepted by the API Server AC functionality implemented as plugins which are executed in the sequence they are specified AC is performed after AuthN checks Enforcement usually results in either ● A Request denial ● Mutation of the Request Resource ● Mutation of related Resources K8s Master API Server scheduler Controllers AdmissionControl
  • 36.
    @tekgrrl #kubecon #kubernetes NamespaceLifecycle Enforcesthat a Namespace that is undergoing termination cannot have new objects created in it, and ensures that requests in a non-existant Namespace are rejected LimitRanger Observes the incoming request and ensures that it does not violate any of the constraints enumerated in the LimitRange object in a Namespace ServiceAccount Implements automation for serviceAccounts ResourceQuota Observes the incoming request and ensures that it does not violate any of the constraints enumerated in the ResourceQuota object in a Namespace. Default plug-ins in 1.2: --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount, ResourceQuota,PersistentVolumeLabel Admission Control Examples
  • 37.
    @tekgrrl #kubecon #kubernetes Mandy’sCanonical K8s deck: http://bit.ly/1oRMS0r One little-o R M S Zero little-r Setting Pod and CPU Limits Runtime Constraints Example Extending the Scheduler Resource Model Design Doc (beyond 1.1) Resources
  • 38.
    @tekgrrl #kubecon #kubernetes Kubernetesis Open Source We want your help! http://kubernetes.io https://github.com/kubernetes/kubernetes Slack: #kubernetes-users @kubernetesio
  • 39.
    @tekgrrl #kubecon #kubernetes Imagesby Connie Zhou cloud.google.com