SwarmKit in Theory and Practice

Container Orchestration
from Theory to Practice
Laura Frank & Stephen Day
Open Source Summit, Los Angeles
September 2017
v0

2
Stephen Day
Docker, Inc.
stephen@docker.com
@stevvooe
Laura Frank
Codeship
laura@codeship.com
@rhein_wein

3
Agenda
- Understanding the SwarmKit object model
- Node topology
- Distributed consensus and Raft

SwarmKit
An open source framework for building orchestration systems
https://github.com/docker/swarmkit

SwarmKit Object Model
Container Orchestration from Theory to Practice

6
Orchestration
A control system for your cluster
ClusterO
-
Δ St
D
D = Desired State
O = Orchestrator
C = Cluster
St
= State at time t
Δ = Operations to converge S to D
https://en.wikipedia.org/wiki/Control_theory

7
Convergence
A functional view
D = Desired State
O = Orchestrator
C = Cluster
St
= State at time t
f(D, Sn-1
, C) → Sn
| min(S-D)

8
Observability and Controllability
The Problem
Low Observability High Observability
Failure Process State User Input

9
Data Model Requirements
- Represent difference in cluster state
- Maximize observability
- Support convergence
- Do this while being extensible and reliable

Show me your data structures
and I’ll show you your
orchestration system

11
Services
- Express desired state of the cluster
- Abstraction to control a set of containers
- Enumerates resources, network availability, placement
- Leave the details of runtime to container process
- Implement these services by distributing processes across a cluster
Node 1 Node 2 Node 3

12
Declarative
$ docker network create -d overlay backend
vpd5z57ig445ugtcibr11kiiz
$ docker service create -p 6379:6379 --network backend redis
pe5kyzetw2wuynfnt7so1jsyl
$ docker service scale serene_euler=3
serene_euler scaled to 3
$ docker service ls
ID NAME REPLICAS IMAGE COMMAND
pe5kyzetw2wu serene_euler 3/3 redis

13
Reconciliation
Spec → Object
Object
Current State
Spec
Desired State

Task Model
Prepare: setup resources
Start: start the task
Wait: wait until task exits
Shutdown: stop task, cleanly
Runtime

Orchestrator
16
Task Model
Atomic Scheduling Unit of SwarmKit
Object
Current State
Spec
Desired
State
Task0
Task1
…
Taskn Scheduler

Manager
Task
Task
Data Flow
ServiceSpec
TaskSpec
Service
ServiceSpec
TaskSpec
Task
TaskSpec
Worker

19
Field Ownership
Only one component of the system can
write to a field
Consistency

Worker
Pre-Run
Preparing
Manager
Terminal States
Task State
New Allocated Assigned
Ready Starting
Running
Complete
Shutdown
Failed
Rejected

Field Handoff
Task Status
State Owner
< Assigned Manager
>= Assigned Worker

22
Observability and Controllability
The Problem
Low Observability High Observability
Failure Process State User Input

23
Orchestration
A control system for your cluster
ClusterO
-
Δ St
D
D = Desired State
O = Orchestrator
C = Cluster
St
= State at time t
Δ = Operations to converge S to D
https://en.wikipedia.org/wiki/Control_theory

Orchestrator
24
Task Model
Atomic Scheduling Unit of SwarmKit
Object
Current State
Spec
Desired
State
Task0
Task1
…
Taskn Scheduler

Node Topology in
SwarmKit

We’ve got a bunch of nodes…
now what?

28
Push Model Pull Model
Manager
Worker
Discovery
System
(ZooKeeper)
3 - Payload
1 - Register
2 - Discover Manager
Worker
Registration &
Payload

29
Push Model Pull Model
Pros Provides better control over
communication rate
- Managers decide when to
contact Workers
Cons Requires a discovery
mechanism
- More failure scenarios
- Harder to troubleshoot
Pros Simpler to operate
- Workers connect to Managers
and don’t need to bind
- Can easily traverse networks
- Easier to secure
- Fewer moving parts
Cons Workers must maintain connection
to Managers at all times

30
Push vs Pull
• SwarmKit adopted the Pull model
• Favored operational simplicity
• Engineered solutions to provide rate control in pull mode

32
Rate Control: Heartbeats
• Manager dictates heartbeat rate to Workers
• Rate is configurable (not by end user)
• Managers agree on same rate by
consensus via Raft
• Managers add jitter so pings are spread
over time (avoid bursts)
Manager
Worker
Ping? Pong!
Ping me back in
5.2 seconds

33
Rate Control: Workloads
• Worker opens a gRPC stream to
receive workloads
• Manager can send data whenever it
wants to
• Manager will send data in batches
• Changes are buffered and sent in
batches of 100 or every 100 ms,
whichever occurs first
• Adds little delay (at most 100ms) but
drastically reduces amount of
communication
Manager
Worker
Give me
work to do
100ms - [Batch of 12 ]
340ms - [Batch of 100]
360ms - [Batch of 100]

Replication
Running multiple managers for high availability

35
Replication
Manager Manager Manager
Worker
Leader FollowerFollower
• Worker can connect to any
Manager
• Followers will forward traffic to
the Leader

36
Replication
Worker
• Followers multiplex all workers
to the Leader using a single
connection
• Backed by gRPC channels
(HTTP/2 streams)
• Reduces Leader networking load
by spreading the connections
evenly
Worker Worker
Example: On a cluster with 10,000 workers and 5 managers,
each will only have to handle about 2,000 connections. Each
follower will forward its 2,000 workers using a single socket to
the leader.

37
Replication
Worker
• Upon Leader failure, a new one
is elected
• All managers start redirecting
worker traffic to the new one
• Transparent to workers
Worker Worker

38
Replication
Worker
Follower FollowerLeader
• Upon Leader failure, a new one
is elected
• All managers start redirecting
worker traffic to the new one
• Transparent to workers
Worker Worker

39
Replication
Manager
3
Manager
1
Manager
2
Worker
• Manager sends list of all
managers’ addresses to Workers
• When a new manager joins, all
workers are notified
• Upon manager failure, workers
will reconnect to a different
manager
- Manager 1 Addr
- Manager 2 Addr
- Manager 3 Addr

40
Replication
Manager
3
Manager
1
Manager
2
Worker
manager

41
Replication
Manager
3
Manager
1
Manager
2
Worker
manager
Reconnect to
random manager

Presence
Scalable presence in a distributed environment

43
Presence
• Leader commits Worker state (Up vs Down) into Raft
− Propagates to all managers
− Recoverable in case of leader re-election
• Heartbeat TTLs kept in Leader memory
− Too expensive to store “last ping time” in Raft
• Every ping would result in a quorum write
− Leader keeps worker<->TTL in a heap (time.AfterFunc)
− Upon leader failover workers are given a grace period to reconnect
• Workers considered Unknown until they reconnect
• If they do they move back to Up
• If they don’t they move to Down

Distributed Consensus

45
The Raft Consensus Algorithm
Orchestration systems typically use some kind of service to
maintain state in a distributed system
- etcd
- ZooKeeper
- …
Many of these services are backed by the Raft consensus
algorithm

46
SwarmKit and Raft
Docker chose to implement the algorithm directly
- Fast
- Don’t have to set up a separate service to get started with
orchestration
- Differentiator between SwarmKit/Docker and other orchestration
systems

demo.consensus.group
secretlivesofdata.com

Sequencer
● Every object in the store has a Version field
● Version stores the Raft index when the object was last updated
● Updates must provide a base Version; are rejected if it is out of date
● Similar to CAS
● Also exposed through API calls that change objects in the store
49

50
Versioned Updates
Consistency
service := getCurrentService()
spec := service.Spec
spec.Image = "my.serv/myimage:mytag"
update(spec, service.Version)

Sequencer
Service ABC
Spec
Replicas = 4
Image = registry:2.3.0
...
Version = 189
Original object:
51
Raft index when it was last updated

Sequencer
Service ABC
Spec
Replicas = 4
...
Version = 189
Service ABC
Spec
Replicas = 4
...
Version = 189
Update request:Original object:
52

Sequencer
Service ABC
Spec
Replicas = 4
...
Version = 189
Original object:
Service ABC
Spec
Replicas = 4
...
Version = 189
Update request:
53

Sequencer
Service ABC
Spec
Replicas = 4
...
Version = 190
Updated object:
54

Sequencer
Service ABC
Spec
Replicas = 4
...
Version = 190
Service ABC
Spec
Replicas = 5
...
Version = 189
Update request:Updated object:
55

Sequencer
Service ABC
Spec
Replicas = 4
...
Version = 190
Service ABC
Spec
Replicas = 5
...
Version = 189
Update request:Updated object:
56

github.com/docker/swarmkit
github.com/coreos/etcd (Raft)

SwarmKit in Theory and Practice

More Related Content

What's hot

Viewers also liked

Similar to SwarmKit in Theory and Practice

More from Laura Frank Tacho

Recently uploaded

SwarmKit in Theory and Practice