Container Orchestration
from Theory to Practice
Laura Frank & Stephen Day
Open Source Summit, Los Angeles
September 2017
v0
2
Stephen Day
Docker, Inc.
stephen@docker.com
@stevvooe
Laura Frank
Codeship
laura@codeship.com
@rhein_wein
3
Agenda
- Understanding the SwarmKit object model
- Node topology
- Distributed consensus and Raft
SwarmKit
An open source framework for building orchestration systems
https://github.com/docker/swarmkit
SwarmKit Object Model
Laura Frank & Stephen Day
Container Orchestration from Theory to Practice
6
Orchestration
A control system for your cluster
ClusterO
-
Δ St
D
D = Desired State
O = Orchestrator
C = Cluster
St
= State at time t
Δ = Operations to converge S to D
https://en.wikipedia.org/wiki/Control_theory
7
Convergence
A functional view
D = Desired State
O = Orchestrator
C = Cluster
St
= State at time t
f(D, Sn-1
, C) → Sn
| min(S-D)
8
Observability and Controllability
The Problem
Low Observability High Observability
Failure Process State User Input
9
Data Model Requirements
- Represent difference in cluster state
- Maximize observability
- Support convergence
- Do this while being extensible and reliable
Show me your data structures
and I’ll show you your
orchestration system
11
Services
- Express desired state of the cluster
- Abstraction to control a set of containers
- Enumerates resources, network availability, placement
- Leave the details of runtime to container process
- Implement these services by distributing processes across a cluster
Node 1 Node 2 Node 3
12
Declarative
$ docker network create -d overlay backend
vpd5z57ig445ugtcibr11kiiz
$ docker service create -p 6379:6379 --network backend redis
pe5kyzetw2wuynfnt7so1jsyl
$ docker service scale serene_euler=3
serene_euler scaled to 3
$ docker service ls
ID NAME REPLICAS IMAGE COMMAND
pe5kyzetw2wu serene_euler 3/3 redis
13
Reconciliation
Spec → Object
Object
Current State
Spec
Desired State
14
Demo Time!
Task Model
Prepare: setup resources
Start: start the task
Wait: wait until task exits
Shutdown: stop task, cleanly
Runtime
Orchestrator
16
Task Model
Atomic Scheduling Unit of SwarmKit
Object
Current State
Spec
Desired
State
Task0
Task1
…
Taskn Scheduler
Manager
Task
Task
Data Flow
ServiceSpec
TaskSpec
Service
ServiceSpec
TaskSpec
Task
TaskSpec
Worker
Consistency
19
Field Ownership
Only one component of the system can
write to a field
Consistency
Worker
Pre-Run
Preparing
Manager
Terminal States
Task State
New Allocated Assigned
Ready Starting
Running
Complete
Shutdown
Failed
Rejected
Field Handoff
Task Status
State Owner
< Assigned Manager
>= Assigned Worker
22
Observability and Controllability
The Problem
Low Observability High Observability
Failure Process State User Input
23
Orchestration
A control system for your cluster
ClusterO
-
Δ St
D
D = Desired State
O = Orchestrator
C = Cluster
St
= State at time t
Δ = Operations to converge S to D
https://en.wikipedia.org/wiki/Control_theory
Orchestrator
24
Task Model
Atomic Scheduling Unit of SwarmKit
Object
Current State
Spec
Desired
State
Task0
Task1
…
Taskn Scheduler
SwarmKit Doesn’t Quit!
Node Topology in
SwarmKit
Laura Frank & Stephen Day
Container Orchestration from Theory to Practice
We’ve got a bunch of nodes…
now what?
28
Push Model Pull Model
Manager
Worker
Discovery
System
(ZooKeeper)
3 - Payload
1 - Register
2 - Discover Manager
Worker
Registration &
Payload
29
Push Model Pull Model
Pros Provides better control over
communication rate
- Managers decide when to
contact Workers
Cons Requires a discovery
mechanism
- More failure scenarios
- Harder to troubleshoot
Pros Simpler to operate
- Workers connect to Managers
and don’t need to bind
- Can easily traverse networks
- Easier to secure
- Fewer moving parts
Cons Workers must maintain connection
to Managers at all times
30
Push vs Pull
• SwarmKit adopted the Pull model
• Favored operational simplicity
• Engineered solutions to provide rate control in pull mode
Rate Control in a Pull Model
32
Rate Control: Heartbeats
• Manager dictates heartbeat rate to Workers
• Rate is configurable (not by end user)
• Managers agree on same rate by
consensus via Raft
• Managers add jitter so pings are spread
over time (avoid bursts)
Manager
Worker
Ping? Pong!
Ping me back in
5.2 seconds
33
Rate Control: Workloads
• Worker opens a gRPC stream to
receive workloads
• Manager can send data whenever it
wants to
• Manager will send data in batches
• Changes are buffered and sent in
batches of 100 or every 100 ms,
whichever occurs first
• Adds little delay (at most 100ms) but
drastically reduces amount of
communication
Manager
Worker
Give me
work to do
100ms - [Batch of 12 ]
200ms - [Batch of 26 ]
300ms - [Batch of 32 ]
340ms - [Batch of 100]
360ms - [Batch of 100]
460ms - [Batch of 42 ]
560ms - [Batch of 23 ]
Replication
Running multiple managers for high availability
35
Replication
Manager Manager Manager
Worker
Leader FollowerFollower
• Worker can connect to any
Manager
• Followers will forward traffic to
the Leader
36
Replication
Manager Manager Manager
Worker
Leader FollowerFollower
• Followers multiplex all workers
to the Leader using a single
connection
• Backed by gRPC channels
(HTTP/2 streams)
• Reduces Leader networking load
by spreading the connections
evenly
Worker Worker
Example: On a cluster with 10,000 workers and 5 managers,
each will only have to handle about 2,000 connections. Each
follower will forward its 2,000 workers using a single socket to
the leader.
37
Replication
Manager Manager Manager
Worker
Leader FollowerFollower
• Upon Leader failure, a new one
is elected
• All managers start redirecting
worker traffic to the new one
• Transparent to workers
Worker Worker
38
Replication
Manager Manager Manager
Worker
Follower FollowerLeader
• Upon Leader failure, a new one
is elected
• All managers start redirecting
worker traffic to the new one
• Transparent to workers
Worker Worker
39
Replication
Manager
3
Manager
1
Manager
2
Worker
Leader FollowerFollower
• Manager sends list of all
managers’ addresses to Workers
• When a new manager joins, all
workers are notified
• Upon manager failure, workers
will reconnect to a different
manager
- Manager 1 Addr
- Manager 2 Addr
- Manager 3 Addr
40
Replication
Manager
3
Manager
1
Manager
2
Worker
Leader FollowerFollower
• Manager sends list of all
managers’ addresses to Workers
• When a new manager joins, all
workers are notified
• Upon manager failure, workers
will reconnect to a different
manager
41
Replication
Manager
3
Manager
1
Manager
2
Worker
Leader FollowerFollower
• Manager sends list of all
managers’ addresses to Workers
• When a new manager joins, all
workers are notified
• Upon manager failure, workers
will reconnect to a different
manager
Reconnect to
random manager
Presence
Scalable presence in a distributed environment
43
Presence
• Leader commits Worker state (Up vs Down) into Raft
− Propagates to all managers
− Recoverable in case of leader re-election
• Heartbeat TTLs kept in Leader memory
− Too expensive to store “last ping time” in Raft
• Every ping would result in a quorum write
− Leader keeps worker<->TTL in a heap (time.AfterFunc)
− Upon leader failover workers are given a grace period to reconnect
• Workers considered Unknown until they reconnect
• If they do they move back to Up
• If they don’t they move to Down
Distributed Consensus
Laura Frank & Stephen Day
Container Orchestration from Theory to Practice
45
The Raft Consensus Algorithm
Orchestration systems typically use some kind of service to
maintain state in a distributed system
- etcd
- ZooKeeper
- …
Many of these services are backed by the Raft consensus
algorithm
46
SwarmKit and Raft
Docker chose to implement the algorithm directly
- Fast
- Don’t have to set up a separate service to get started with
orchestration
- Differentiator between SwarmKit/Docker and other orchestration
systems
demo.consensus.group
secretlivesofdata.com
Consistency
Sequencer
● Every object in the store has a Version field
● Version stores the Raft index when the object was last updated
● Updates must provide a base Version; are rejected if it is out of date
● Similar to CAS
● Also exposed through API calls that change objects in the store
49
50
Versioned Updates
Consistency
service := getCurrentService()
spec := service.Spec
spec.Image = "my.serv/myimage:mytag"
update(spec, service.Version)
Sequencer
Service ABC
Spec
Replicas = 4
Image = registry:2.3.0
...
Version = 189
Original object:
51
Raft index when it was last updated
Sequencer
Service ABC
Spec
Replicas = 4
Image = registry:2.3.0
...
Version = 189
Service ABC
Spec
Replicas = 4
Image = registry:2.4.0
...
Version = 189
Update request:Original object:
52
Sequencer
Service ABC
Spec
Replicas = 4
Image = registry:2.3.0
...
Version = 189
Original object:
Service ABC
Spec
Replicas = 4
Image = registry:2.4.0
...
Version = 189
Update request:
53
Sequencer
Service ABC
Spec
Replicas = 4
Image = registry:2.4.0
...
Version = 190
Updated object:
54
Sequencer
Service ABC
Spec
Replicas = 4
Image = registry:2.4.0
...
Version = 190
Service ABC
Spec
Replicas = 5
Image = registry:2.3.0
...
Version = 189
Update request:Updated object:
55
Sequencer
Service ABC
Spec
Replicas = 4
Image = registry:2.4.0
...
Version = 190
Service ABC
Spec
Replicas = 5
Image = registry:2.3.0
...
Version = 189
Update request:Updated object:
56
github.com/docker/swarmkit
github.com/coreos/etcd (Raft)
THANK YOU

SwarmKit in Theory and Practice