Containers are at the forefront of a new wave of technology innovation but the methods for scheduling and managing them are still new to most developers. In this talk we'll look at the kind of problems that container scheduling solves and at how maximising efficiency and maiximising QoS don't have to be exclusive goals. We'll take a behind the scenes look at the Kubernetes scheduler: How does it prioritize? What about node selection and external dependencies? How do you schedule based on your own specific needs? How does it scale and what’s in it both for developers already using containers and for those that aren't? We’ll use a combination of slides, code, demos to answer all these questions and hopefully all of yours.
Sched Link: http://sched.co/6BZa
3. @tekgrrl #kubecon #kubernetes
web browsers
BorgMaster
link shard
UI
shardBorgMaster
link shard
UI
shardBorgMaster
link shard
UI
shardBorgMaster
link shard
UI
shard
Scheduler
borgcfg web browsers
scheduler
Borglet Borglet Borglet Borglet
Config
file
BorgMaster
link shard
UI
shard
persistent store
(Paxos)
Binary
Cell
Storage
4. @tekgrrl #kubecon #kubernetes
Developer View
job hello_world = {
runtime = { cell = 'ic' } // Cell (cluster) to run in
binary = '.../hello_world_webserver' // Program to run
args = { port = '%port%' } // Command line parameters
requirements = { // Resource requirements
ram = 100M
disk = 100M
cpu = 0.1
}
replicas = 5 // Number of tasks
}
10000
7. 7
@tekgrrl #kubecon #kubernetes
Developer View
Hello world!
“Internally, we don't use VMs - we just use containers to
pack multiple tasks onto one machine, and stop them
treading on one another.” - John Wilkes
11. 11
@tekgrrl #kubecon #kubernetes
available resources
one
machine
Efficiency
Advanced bin-
packing
algorithms
Experimental placement
of production VM
workload, July 2014
stranded resources
14. 14
@tekgrrl #kubecon #kubernetes
web browsers
BorgMaster
link shard
UI
shardBorgMaster
link shard
UI
shardBorgMaster
link shard
UI
shardBorgMaster
link shard
UI
shard
Scheduler
borgcfg web browsers
scheduler
Cell
Config
file
BorgMaster
link shard
UI
shard
persistent store
(Paxos)
Binary
Cell
Storage
Efficiency
batch
Cells run both
Prod and Non
Prod tasks
batch
15. 15
@tekgrrl #kubecon #kubernetes
Efficiency
Cell
Sharing Cells
between prod/non-
prod is Better
shared cell
(original)
shared cell
(compacted)
Cell
Non-Prod load
(compacted)
Prod load
(compacted)
Represents the
overhead of running
prod and non-prod in
their own cells
16. 16
@tekgrrl #kubecon #kubernetes
Resource reclamation
time
limit: amount of resource
requested
usage: actual resource
consumption
Efficiency
reservation: estimate of
future usage
potentially reusable
resources
24. @tekgrrl #kubecon #kubernetes
A Resource is something that can be
requested, allocated, or consumed to/by
a pod or a container
CPU: Specified in units of Cores,
what that is depends on the provider
Memory: Specified in units of Bytes
CPU is Compressible (i.e. it has a rate
and can be throttled)
Memory is Incompressible, it can’t be
throttled
Kubernetes Resources
25. @tekgrrl #kubecon #kubernetes
Future Plans:
More Resources:
● Network Ops
● Network Bandwidth
● Storage
● IOPS
● Storage Time
Kubernetes Compute Unit (KCU)
Kubernetes Resources (contd)
27. @tekgrrl #kubecon #kubernetes
Resource based Scheduling (Work In Progress)
Provide QoS for Scheduled Pods
Per Container CPU and Memory requirements
Specified as Request and Limit
Future releases will [better] support:
● Best Effort (Request == 0)
● Burstable ( Request < Limit)
● Guaranteed (Request == Limit)
Best Effort Scheduling for low priority workloads improves
Utilization at Google by 20%
28. @tekgrrl #kubecon #kubernetes
Scheduling Pods: Nodes
K8s Node
Kubelet
disk = ssd
Resources
LabelsDisks
Nodes may not be heterogeneous, they
can differ in important ways:
● CPU and Memory Resources
● Attached Disks
● Specific Hardware
Location may also be important
29. @tekgrrl #kubecon #kubernetes
What CPU and Memory Resources
does it need?
Can also be used as a measure of
priority
Pod Scheduling: Identifying Potential Nodes
K8s Node
Kubelet Proxy
CPU
Mem
30. @tekgrrl #kubecon #kubernetes
What Resources does it need?
What Disk(s) does it need (GCE PD and
EBS) and can it/they be mounted
without conflict?
Note: 1.1 limits to
Pod Scheduling: Finding Potential Nodes
K8s Node
Kubelet Proxy
CPU
Mem
31. @tekgrrl #kubecon #kubernetes
What Resources does it need?
What Disk(s) does it need?
What node(s) can it run on (Node
Selector)?
Pod Scheduling: Identifying Potential Nodes
K8s Node
Kubelet Proxy
CPU
Mem
disktype = ssd
kubectl label nodes node-3 disktype=ssd
(pod) spec:
nodeSelector:
disktype: ssd
32. @tekgrrl #kubecon #kubernetes
nodeAffinity (Alpha in 1.2)
{
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "beta.kubernetes.io/instance-type",
"operator": "In",
"values": ["n1-highmem-2", "n1-highmem-4"]
}
]
}
]
}
}
}
http://kubernetes.github.io/docs/user-guide/node-selection/
Implemented through Annotations in 1.2,
through fields in 1.3
Can be ‘Required’ or ‘Preferred’ during
scheduling
In future can can be ‘Required’ during
execution (Node labels can change)
Will eventually replace NodeSelector
If you specify both nodeSelector and
nodeAffinity, both must be satisfied
33. @tekgrrl #kubecon #kubernetes
Prefer node with most free resource
left after the pod is deployed
Prefer nodes with the specified label
Minimise number of Pods from the
same service on the same node
CPU and Memory is balanced after the
Pod is deployed [Default]
Pod Scheduling: Ranking Potential Nodes
Node2
Node3
Node1
34. @tekgrrl #kubecon #kubernetes
Extending the Scheduler
1. Add rules to the scheduler and
recompile
2. Run your own scheduler process
instead of, or as well as, the
Kubernetes scheduler
3. Implement a "scheduler extender"
that the Kubernetes scheduler calls
out to as a final pass when making
scheduling decisions
35. @tekgrrl #kubecon #kubernetes
Admission Control
Admission Control enforces certain conditions, before a
request is accepted by the API Server
AC functionality implemented as plugins which are
executed in the sequence they are specified
AC is performed after AuthN checks
Enforcement usually results in either
● A Request denial
● Mutation of the Request Resource
● Mutation of related Resources
K8s Master
API
Server
scheduler
Controllers
AdmissionControl
36. @tekgrrl #kubecon #kubernetes
NamespaceLifecycle
Enforces that a Namespace that is undergoing termination cannot have new objects created in it, and ensures that
requests in a non-existant Namespace are rejected
LimitRanger
Observes the incoming request and ensures that it does not violate any of the constraints enumerated in the
LimitRange object in a Namespace
ServiceAccount
Implements automation for serviceAccounts
ResourceQuota
Observes the incoming request and ensures that it does not violate any of the constraints enumerated in the
ResourceQuota object in a Namespace.
Default plug-ins in 1.2: --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,
ResourceQuota,PersistentVolumeLabel
Admission Control Examples
37. @tekgrrl #kubecon #kubernetes
Mandy’s Canonical K8s deck: http://bit.ly/1oRMS0r
One little-o R M S Zero little-r
Setting Pod and CPU Limits
Runtime Constraints Example
Extending the Scheduler
Resource Model Design Doc (beyond 1.1)
Resources
38. @tekgrrl #kubecon #kubernetes
Kubernetes is Open Source
We want your help!
http://kubernetes.io
https://github.com/kubernetes/kubernetes
Slack: #kubernetes-users
@kubernetesio