Heart of the SwarmKit: Store, Topology & Object Model by Aaron, Andrea, Stephen D (Docker)
Swarmkit repo - https://github.com/docker/swarmkit
Liveblogging: http://canopy.mirage.io/Liveblog/SwarmKitDDS2016
5. 5
Orchestration
A control system for your cluster
ClusterO
-
Δ St
D
D = Desired State
O = Orchestrator
C = Cluster
St
= State at time t
Δ = Operations to converge S to D
https://en.wikipedia.org/wiki/Control_theory
8. 8
Data Model Requirements
- Represent difference in cluster state
- Maximize Observability
- Support Convergence
- Do this while being Extensible and Reliable
9. Show me your data structures
and I’ll show you your
orchestration system
10. 10
Services
- Express desired state of the cluster
- Abstraction to control a set of containers
- Enumerates resources, network availability, placement
- Leave the details of runtime to container process
- Implement these services by distributing processes across a cluster
Node 1 Node 2 Node 3
11. 11
Declarative
$ docker network create -d overlay backend
31ue4lvbj4m301i7ef3x8022t
$ docker service create -p 6379:6379 --network backend
redis
bhk0gw6f0bgrbhmedwt5lful6
$ docker service scale serene_euler=3
serene_euler scaled to 3
$ docker service ls
ID NAME REPLICAS IMAGE COMMAND
dj0jh3bnojtm serene_euler 3/3 redis
15. Service Spec
message ServiceSpec {
// Task defines the task template this service will spawn.
TaskSpec task = 2 [(gogoproto.nullable) = false];
// UpdateConfig controls the rate and policy of updates.
UpdateConfig update = 6;
// Service endpoint specifies the user provided configuration
// to properly discover and load balance a service.
EndpointSpec endpoint = 8;
}
Protobuf Example
16. Service Object
message Service {
ServiceSpec spec = 3;
// UpdateStatus contains the status of an update, if one is in
// progress.
UpdateStatus update_status = 5;
// Runtime state of service endpoint. This may be different
// from the spec version because the user may not have entered
// the optional fields like node_port or virtual_ip and it
// could be auto allocated by the system.
Endpoint endpoint = 4;
}
Protobuf Example
25. 25
Orchestration
A control system for your cluster
ClusterO
-
Δ St
D
D = Desired State
O = Orchestrator
C = Cluster
St
= State at time t
Δ = Operations to converge S to D
https://en.wikipedia.org/wiki/Control_theory
31. 31
Push vs Pull
Push
• Pros: Provides better control
over communication rate
− Managers decide when to
contact Workers
• Cons: Requires a discovery
mechanism
− More failure scenarios
− Harder to troubleshoot
Pull
• Pros: Simpler to operate
− Workers connect to Managers
and don’t need to bind
− Can easily traverse networks
− Easier to secure
− Less moving parts
• Cons: Workers must maintain
connection to Managers at all
times
32. 32
Push vs Pull
• SwarmKit adopted the Pull model
• Favored operational simplicity
• Engineered solutions to provide rate control in pull mode
34. 34
Rate Control: Heartbeats
• Manager dictates heartbeat rate to
Workers
• Rate is Configurable
• Managers agree on same Rate by
Consensus (Raft)
• Managers add jitter so pings are spread
over time (avoid bursts)
Manager
Worker
Ping? Pong!
Ping me back in
5.2 seconds
35. 35
Rate Control: Workloads
• Worker opens a gRPC stream to
receive workloads
• Manager can send data whenever it
wants to
• Manager will send data in batches
• Changes are buffered and sent in
batches of 100 or every 100 ms,
whichever occurs first
• Adds little delay (at most 100ms) but
drastically reduces amount of
communication
Manager
Worker
Give me
work to do
100ms - [Batch of 12 ]
200ms - [Batch of 26 ]
300ms - [Batch of 32 ]
340ms - [Batch of 100]
360ms - [Batch of 100]
460ms - [Batch of 42 ]
560ms - [Batch of 23 ]
38. 38
Replication
Manager Manager Manager
Worker
Leader FollowerFollower
• Followers multiplex all workers
to the Leader using a single
connection
• Backed by gRPC channels
(HTTP/2 streams)
• Reduces Leader networking load
by spreading the connections
evenly
Worker Worker
Example: On a cluster with 10,000 workers and 5 managers,
each will only have to handle about 2,000 connections. Each
follower will forward its 2,000 workers using a single socket to
the leader.
39. 39
Replication
Manager Manager Manager
Worker
Leader FollowerFollower
• Upon Leader failure, a new one
is elected
• All managers start redirecting
worker traffic to the new one
• Transparent to workers
Worker Worker
46. 46
Presence
• Leader commits Worker state (Up vs Down) into Raft
− Propagates to all managers
− Recoverable in case of leader re-election
• Heartbeat TTLs kept in Leader memory
− Too expensive to store “last ping time” in Raft
• Every ping would result in a quorum write
− Leader keeps worker<->TTL in a heap (time.AfterFunc)
− Upon leader failover workers are given a grace period to reconnect
• Workers considered Unknown until they reconnect
• If they do they move back to Up
• If they don’t they move to Down
47. Heart of the SwarmKit:
Distributed Data Store
Aaron Lehmann
Docker
48. What we store
● State of the cluster
● User-defined configuration
● Organized into objects:
○ Cluster
○ Node
○ Service
○ Task
○ Network
○ etc...
48
49. Why embed the distributed data store?
● Ease of setup
● Fewer round trips
● Can maintain local indices
49
50. In-memory data structures
● Objects are protocol buffers messages
● go-memdb used as in-memory database:
https://github.com/hashicorp/go-memdb
● Underlying data structure: radix trees
50
51. Radix trees for indexing
Hel
Hello Helpful
Wo
World Work Word
Wor Won
51
52. Radix trees for indexing
id:
id:abcd id:efgh
node:
node:1234:abcd node:1234:efgh node:5678:ijkl
node:1234
node:5678:mnop
node:5678
id:ijkl id:mnop
52
57. Transactions
● We provide a transactional interface to read or write data in the store
● Read transactions are just atomic snapshots
● Write transaction:
○ Take a snapshot
○ Make changes
○ Replace tree root with modified tree’s root (atomic pointer swap)
● Only one write transaction allowed at once
● Commit of write transaction blocks until changes are committed to Raft
57
58. Transaction example: Read
dataStore.View(func(tx store.ReadTx) {
tasks, err = store.FindTasks(tx,
store.ByServiceID(serviceID))
if err == nil {
for _, t := range tasks {
fmt.Println(t.ID)
}
}
})
58
59. Transaction example: Write
err := dataStore.Update(func(tx store.Tx) error {
t := store.GetTask(tx, "id1")
if t == nil {
return errors.New("task not found")
}
t.DesiredState = api.TaskStateRunning
return store.UpdateTask(tx, t)
})
59
60. Watches
● Code can register to receive specific creation, update, or deletion
events on a Go channel
● Selectors on particular fields in the objects
● Currently an internal feature, will expose through API in the future
60
63. Replication
● Only Raft leader does writes
● During write transaction, log every change as well as updating the radix
tree
● The transaction log is serialized and replicated through Raft
● Since our internal types are protobuf types, serialization is very easy
● Followers replay the log entries into radix tree
63
64. Sequencer
● Every object in the store has a Version field
● Version stores the Raft index when the object was last updated
● Updates must provide a base Version; are rejected if it is out of date
● Similar to CAS
● Also exposed through API calls that change objects in the store
64
70. Sequencer
Service ABC
Spec
Replicas = 4
Image = registry:2.4.0
...
Version = 190
Service ABC
Spec
Replicas = 5
Image = registry:2.3.0
...
Version = 189
Update request:Updated object:
70
71. Sequencer
Service ABC
Spec
Replicas = 4
Image = registry:2.4.0
...
Version = 190
Service ABC
Spec
Replicas = 5
Image = registry:2.3.0
...
Version = 189
Update request:Updated object:
71
72. Write batching
● Every write transaction involves a Raft round trip to get consensus
● Costly to do many transactions, but want to limit the size of writes to
Raft
● Batch primitive lets the store automatically split a group of changes
across multiple writes to Raft
72
73. Write batching
_, err = d.store.Batch(func(batch *store.Batch) error {
for _, n := range nodes {
err := batch.Update(func(tx store.Tx) error {
node := store.GetNode(tx, n.ID)
node.Status = api.NodeStatus{
State: api.NodeStatus_UNKNOWN,
Message: `Node moved to "unknown" state`,
}
return store.UpdateNode(tx, node)
}
}
return nil
}
73