KubeCon EU 2016: A Practical Guide to Container Scheduling

Container Scheduling
A Practical Guide

@tekgrrl #kubecon #kubernetes
@tekgrrl
+MandyWaite

web browsers
BorgMaster
link shard
UI
shardBorgMaster
link shard
UI
shardBorgMaster
link shard
UI
shardBorgMaster
link shard
UI
shard
Scheduler
borgcfg web browsers
scheduler
Borglet Borglet Borglet Borglet
Config
file
BorgMaster
link shard
UI
shard
persistent store
(Paxos)
Binary
Cell
Storage

Developer View
job hello_world = {
runtime = { cell = 'ic' } // Cell (cluster) to run in
binary = '.../hello_world_webserver' // Program to run
args = { port = '%port%' } // Command line parameters
requirements = { // Resource requirements
ram = 100M
disk = 100M
cpu = 0.1
}
replicas = 5 // Number of tasks
}
10000

Developer View

Hello
world!
Hello
world!
Hello
world!
Hello
world!Hello
world! Hello
world! Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world!Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world! Hello
world!
Hello
world!
Hello
world!
Hello
world!
Image by Connie
Zhou
Hello
world!
Hello
world!
Hello
world! Hello
world!
Hello
world! Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world! Hello
world!
Hello
world! Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
world! Hello
world!
Hello
world! Hello
world!
Hello
world!
Hello
world!

7
Developer View
Hello world!
“Internally, we don't use VMs - we just use containers to
pack multiple tasks onto one machine, and stop them
treading on one another.” - John Wilkes

8
Developer View

9
task-eviction rates
and causes
Failures

10
Images by
Connie Zhou
A 2000-machine service will have
>10 task exits per day
This is not a problem: it's normal

11
available resources
one
machine
Efficiency
Advanced bin-
packing
algorithms
Experimental placement
of production VM
workload, July 2014
stranded resources

12
Efficiency
UsedCPU
UsedCPU(incores)
UsedMemory
UsedMemory
Available
Resources
Stranded
Resources
UsedCPU(incores)UsedMemory

13
tasks per machine
Efficiency
Multiple
applications
per machine
CPI^2 paper,
EuroSys 2013
Median

14
web browsers
BorgMaster
link shard
UI
shardBorgMaster
link shard
UI
shardBorgMaster
link shard
UI
shardBorgMaster
link shard
UI
shard
Scheduler
borgcfg web browsers
scheduler
Cell
Config
file
BorgMaster
link shard
UI
shard
persistent store
(Paxos)
Binary
Cell
Storage
Efficiency
batch
Cells run both
Prod and Non
Prod tasks
batch

15
Efficiency
Cell
Sharing Cells
between prod/non-
prod is Better
shared cell
(original)
shared cell
(compacted)
Cell
Non-Prod load
(compacted)
Prod load
(compacted)
Represents the
overhead of running
prod and non-prod in
their own cells

16
Resource reclamation
time
limit: amount of resource
requested
usage: actual resource
consumption
Efficiency
reservation: estimate of
future usage
potentially reusable
resources

17
Resource reclamation could be more aggressive
Nov/Dec 2013
Efficiency

18
Nov/Dec 2013
Efficiency
Resource reclamation could be more aggressive

K8s Master
API Server
Dash Board
scheduler
Kubelet Kubelet Kubelet Kubelet
Container
Registry
etcdControllers
web browserskubectl web browsers
Config
file
Image

Kubernetes without a Scheduler
K8s Master
API Server
Dash Board
scheduler etcd
apiVersion: v1
kind: Pod
metadata:
name: bursty-static
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
Controllers
k8s-minion-xyz
Kubelet
k8s-minion-abc
Kubelet
k8s-minion-fig
Kubelet
k8s-minion-cat
Kubelet

Kubernetes without a Scheduler
K8s Master
API Server
Dashboard
k8s-minion-xyz
poddy
Kubelet
k8s-minion-abc
Kubelet
k8s-minion-fig
Kubelet
k8s-minion-cat
Kubelet
etcd
apiVersion: v1
kind: Pod
metadata:
name: poddy
spec:
nodeName: k8s-minion-xyz
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
Controllers

A Resource is something that can be
requested, allocated, or consumed to/by
a pod or a container
CPU: Specified in units of Cores,
what that is depends on the provider
Memory: Specified in units of Bytes
CPU is Compressible (i.e. it has a rate
and can be throttled)
Memory is Incompressible, it can’t be
throttled
Kubernetes Resources

Future Plans:
More Resources:
● Network Ops
● Network Bandwidth
● Storage
● IOPS
● Storage Time
Kubernetes Compute Unit (KCU)
Kubernetes Resources (contd)

...
spec:
containers:
- name: locust
image: gcr.io/rabbit-skateboard/guestbook:gdg-rtv
resources:
requests:
memory: "300Mi"
cpu: "100m"
limits:
memory: "300Mi"
cpu: "100m"
my-controller.yaml
Resource based Scheduling

Resource based Scheduling (Work In Progress)
Provide QoS for Scheduled Pods
Per Container CPU and Memory requirements
Specified as Request and Limit
Future releases will [better] support:
● Best Effort (Request == 0)
● Burstable ( Request < Limit)
● Guaranteed (Request == Limit)
Best Effort Scheduling for low priority workloads improves
Utilization at Google by 20%

Scheduling Pods: Nodes
K8s Node
Kubelet
disk = ssd
Resources
LabelsDisks
Nodes may not be heterogeneous, they
can differ in important ways:
● CPU and Memory Resources
● Attached Disks
● Specific Hardware
Location may also be important

What CPU and Memory Resources
does it need?
Can also be used as a measure of
priority
Pod Scheduling: Identifying Potential Nodes
K8s Node
Kubelet Proxy
CPU
Mem

What Resources does it need?
What Disk(s) does it need (GCE PD and
EBS) and can it/they be mounted
without conflict?
Note: 1.1 limits to
Pod Scheduling: Finding Potential Nodes
K8s Node
Kubelet Proxy
CPU
Mem

What Resources does it need?
What Disk(s) does it need?
What node(s) can it run on (Node
Selector)?
Pod Scheduling: Identifying Potential Nodes
K8s Node
Kubelet Proxy
CPU
Mem
disktype = ssd
kubectl label nodes node-3 disktype=ssd
(pod) spec:
nodeSelector:
disktype: ssd

nodeAffinity (Alpha in 1.2)
{
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "beta.kubernetes.io/instance-type",
"operator": "In",
"values": ["n1-highmem-2", "n1-highmem-4"]
}
]
}
]
}
}
}
http://kubernetes.github.io/docs/user-guide/node-selection/
Implemented through Annotations in 1.2,
through fields in 1.3
Can be ‘Required’ or ‘Preferred’ during
scheduling
In future can can be ‘Required’ during
execution (Node labels can change)
Will eventually replace NodeSelector
If you specify both nodeSelector and
nodeAffinity, both must be satisfied

Prefer node with most free resource
left after the pod is deployed
Prefer nodes with the specified label
Minimise number of Pods from the
same service on the same node
CPU and Memory is balanced after the
Pod is deployed [Default]
Pod Scheduling: Ranking Potential Nodes
Node2
Node3
Node1

Extending the Scheduler
1. Add rules to the scheduler and
recompile
2. Run your own scheduler process
instead of, or as well as, the
Kubernetes scheduler
3. Implement a "scheduler extender"
that the Kubernetes scheduler calls
out to as a final pass when making
scheduling decisions

Admission Control
Admission Control enforces certain conditions, before a
request is accepted by the API Server
AC functionality implemented as plugins which are
executed in the sequence they are specified
AC is performed after AuthN checks
Enforcement usually results in either
● A Request denial
● Mutation of the Request Resource
● Mutation of related Resources
K8s Master
API
Server
scheduler
Controllers
AdmissionControl

NamespaceLifecycle
Enforces that a Namespace that is undergoing termination cannot have new objects created in it, and ensures that
requests in a non-existant Namespace are rejected
LimitRanger
Observes the incoming request and ensures that it does not violate any of the constraints enumerated in the
LimitRange object in a Namespace
ServiceAccount
Implements automation for serviceAccounts
ResourceQuota
Observes the incoming request and ensures that it does not violate any of the constraints enumerated in the
ResourceQuota object in a Namespace.
Default plug-ins in 1.2: --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,
ResourceQuota,PersistentVolumeLabel
Admission Control Examples

Mandy’s Canonical K8s deck: http://bit.ly/1oRMS0r
One little-o R M S Zero little-r
Setting Pod and CPU Limits
Runtime Constraints Example
Extending the Scheduler
Resource Model Design Doc (beyond 1.1)
Resources

Kubernetes is Open Source
We want your help!
http://kubernetes.io
https://github.com/kubernetes/kubernetes
Slack: #kubernetes-users
@kubernetesio

Images by Connie
Zhou
cloud.google.com

KubeCon EU 2016: A Practical Guide to Container Scheduling

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to KubeCon EU 2016: A Practical Guide to Container Scheduling

Similar to KubeCon EU 2016: A Practical Guide to Container Scheduling (20)

More from KubeAcademy

More from KubeAcademy (20)

Recently uploaded

Recently uploaded (20)

KubeCon EU 2016: A Practical Guide to Container Scheduling