Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SwarmKit in Theory and Practice

2,854 views

Published on

Join Laura Frank and Stephen Day as they explain and examine technical concepts behind container orchestration systems, like distributed consensus, object models, and node topology. These concepts build the foundation of every modern orchestration system, and each technical explanation will be illustrated using Docker’s SwarmKit as a real-world example. Gain a deeper understanding of how orchestration systems like SwarmKit work in practice, and walk away with more insights into your production applications.

Published in: Software
  • Be the first to comment

SwarmKit in Theory and Practice

  1. 1. Container Orchestration from Theory to Practice Laura Frank & Stephen Day Open Source Summit, Los Angeles September 2017 v0
  2. 2. 2 Stephen Day Docker, Inc. stephen@docker.com @stevvooe Laura Frank Codeship laura@codeship.com @rhein_wein
  3. 3. 3 Agenda - Understanding the SwarmKit object model - Node topology - Distributed consensus and Raft
  4. 4. SwarmKit An open source framework for building orchestration systems https://github.com/docker/swarmkit
  5. 5. SwarmKit Object Model Laura Frank & Stephen Day Container Orchestration from Theory to Practice
  6. 6. 6 Orchestration A control system for your cluster ClusterO - Δ St D D = Desired State O = Orchestrator C = Cluster St = State at time t Δ = Operations to converge S to D https://en.wikipedia.org/wiki/Control_theory
  7. 7. 7 Convergence A functional view D = Desired State O = Orchestrator C = Cluster St = State at time t f(D, Sn-1 , C) → Sn | min(S-D)
  8. 8. 8 Observability and Controllability The Problem Low Observability High Observability Failure Process State User Input
  9. 9. 9 Data Model Requirements - Represent difference in cluster state - Maximize observability - Support convergence - Do this while being extensible and reliable
  10. 10. Show me your data structures and I’ll show you your orchestration system
  11. 11. 11 Services - Express desired state of the cluster - Abstraction to control a set of containers - Enumerates resources, network availability, placement - Leave the details of runtime to container process - Implement these services by distributing processes across a cluster Node 1 Node 2 Node 3
  12. 12. 12 Declarative $ docker network create -d overlay backend vpd5z57ig445ugtcibr11kiiz $ docker service create -p 6379:6379 --network backend redis pe5kyzetw2wuynfnt7so1jsyl $ docker service scale serene_euler=3 serene_euler scaled to 3 $ docker service ls ID NAME REPLICAS IMAGE COMMAND pe5kyzetw2wu serene_euler 3/3 redis
  13. 13. 13 Reconciliation Spec → Object Object Current State Spec Desired State
  14. 14. 14 Demo Time!
  15. 15. Task Model Prepare: setup resources Start: start the task Wait: wait until task exits Shutdown: stop task, cleanly Runtime
  16. 16. Orchestrator 16 Task Model Atomic Scheduling Unit of SwarmKit Object Current State Spec Desired State Task0 Task1 … Taskn Scheduler
  17. 17. Manager Task Task Data Flow ServiceSpec TaskSpec Service ServiceSpec TaskSpec Task TaskSpec Worker
  18. 18. Consistency
  19. 19. 19 Field Ownership Only one component of the system can write to a field Consistency
  20. 20. Worker Pre-Run Preparing Manager Terminal States Task State New Allocated Assigned Ready Starting Running Complete Shutdown Failed Rejected
  21. 21. Field Handoff Task Status State Owner < Assigned Manager >= Assigned Worker
  22. 22. 22 Observability and Controllability The Problem Low Observability High Observability Failure Process State User Input
  23. 23. 23 Orchestration A control system for your cluster ClusterO - Δ St D D = Desired State O = Orchestrator C = Cluster St = State at time t Δ = Operations to converge S to D https://en.wikipedia.org/wiki/Control_theory
  24. 24. Orchestrator 24 Task Model Atomic Scheduling Unit of SwarmKit Object Current State Spec Desired State Task0 Task1 … Taskn Scheduler
  25. 25. SwarmKit Doesn’t Quit!
  26. 26. Node Topology in SwarmKit Laura Frank & Stephen Day Container Orchestration from Theory to Practice
  27. 27. We’ve got a bunch of nodes… now what?
  28. 28. 28 Push Model Pull Model Manager Worker Discovery System (ZooKeeper) 3 - Payload 1 - Register 2 - Discover Manager Worker Registration & Payload
  29. 29. 29 Push Model Pull Model Pros Provides better control over communication rate - Managers decide when to contact Workers Cons Requires a discovery mechanism - More failure scenarios - Harder to troubleshoot Pros Simpler to operate - Workers connect to Managers and don’t need to bind - Can easily traverse networks - Easier to secure - Fewer moving parts Cons Workers must maintain connection to Managers at all times
  30. 30. 30 Push vs Pull • SwarmKit adopted the Pull model • Favored operational simplicity • Engineered solutions to provide rate control in pull mode
  31. 31. Rate Control in a Pull Model
  32. 32. 32 Rate Control: Heartbeats • Manager dictates heartbeat rate to Workers • Rate is configurable (not by end user) • Managers agree on same rate by consensus via Raft • Managers add jitter so pings are spread over time (avoid bursts) Manager Worker Ping? Pong! Ping me back in 5.2 seconds
  33. 33. 33 Rate Control: Workloads • Worker opens a gRPC stream to receive workloads • Manager can send data whenever it wants to • Manager will send data in batches • Changes are buffered and sent in batches of 100 or every 100 ms, whichever occurs first • Adds little delay (at most 100ms) but drastically reduces amount of communication Manager Worker Give me work to do 100ms - [Batch of 12 ] 200ms - [Batch of 26 ] 300ms - [Batch of 32 ] 340ms - [Batch of 100] 360ms - [Batch of 100] 460ms - [Batch of 42 ] 560ms - [Batch of 23 ]
  34. 34. Replication Running multiple managers for high availability
  35. 35. 35 Replication Manager Manager Manager Worker Leader FollowerFollower • Worker can connect to any Manager • Followers will forward traffic to the Leader
  36. 36. 36 Replication Manager Manager Manager Worker Leader FollowerFollower • Followers multiplex all workers to the Leader using a single connection • Backed by gRPC channels (HTTP/2 streams) • Reduces Leader networking load by spreading the connections evenly Worker Worker Example: On a cluster with 10,000 workers and 5 managers, each will only have to handle about 2,000 connections. Each follower will forward its 2,000 workers using a single socket to the leader.
  37. 37. 37 Replication Manager Manager Manager Worker Leader FollowerFollower • Upon Leader failure, a new one is elected • All managers start redirecting worker traffic to the new one • Transparent to workers Worker Worker
  38. 38. 38 Replication Manager Manager Manager Worker Follower FollowerLeader • Upon Leader failure, a new one is elected • All managers start redirecting worker traffic to the new one • Transparent to workers Worker Worker
  39. 39. 39 Replication Manager 3 Manager 1 Manager 2 Worker Leader FollowerFollower • Manager sends list of all managers’ addresses to Workers • When a new manager joins, all workers are notified • Upon manager failure, workers will reconnect to a different manager - Manager 1 Addr - Manager 2 Addr - Manager 3 Addr
  40. 40. 40 Replication Manager 3 Manager 1 Manager 2 Worker Leader FollowerFollower • Manager sends list of all managers’ addresses to Workers • When a new manager joins, all workers are notified • Upon manager failure, workers will reconnect to a different manager
  41. 41. 41 Replication Manager 3 Manager 1 Manager 2 Worker Leader FollowerFollower • Manager sends list of all managers’ addresses to Workers • When a new manager joins, all workers are notified • Upon manager failure, workers will reconnect to a different manager Reconnect to random manager
  42. 42. Presence Scalable presence in a distributed environment
  43. 43. 43 Presence • Leader commits Worker state (Up vs Down) into Raft − Propagates to all managers − Recoverable in case of leader re-election • Heartbeat TTLs kept in Leader memory − Too expensive to store “last ping time” in Raft • Every ping would result in a quorum write − Leader keeps worker<->TTL in a heap (time.AfterFunc) − Upon leader failover workers are given a grace period to reconnect • Workers considered Unknown until they reconnect • If they do they move back to Up • If they don’t they move to Down
  44. 44. Distributed Consensus Laura Frank & Stephen Day Container Orchestration from Theory to Practice
  45. 45. 45 The Raft Consensus Algorithm Orchestration systems typically use some kind of service to maintain state in a distributed system - etcd - ZooKeeper - … Many of these services are backed by the Raft consensus algorithm
  46. 46. 46 SwarmKit and Raft Docker chose to implement the algorithm directly - Fast - Don’t have to set up a separate service to get started with orchestration - Differentiator between SwarmKit/Docker and other orchestration systems
  47. 47. demo.consensus.group secretlivesofdata.com
  48. 48. Consistency
  49. 49. Sequencer ● Every object in the store has a Version field ● Version stores the Raft index when the object was last updated ● Updates must provide a base Version; are rejected if it is out of date ● Similar to CAS ● Also exposed through API calls that change objects in the store 49
  50. 50. 50 Versioned Updates Consistency service := getCurrentService() spec := service.Spec spec.Image = "my.serv/myimage:mytag" update(spec, service.Version)
  51. 51. Sequencer Service ABC Spec Replicas = 4 Image = registry:2.3.0 ... Version = 189 Original object: 51 Raft index when it was last updated
  52. 52. Sequencer Service ABC Spec Replicas = 4 Image = registry:2.3.0 ... Version = 189 Service ABC Spec Replicas = 4 Image = registry:2.4.0 ... Version = 189 Update request:Original object: 52
  53. 53. Sequencer Service ABC Spec Replicas = 4 Image = registry:2.3.0 ... Version = 189 Original object: Service ABC Spec Replicas = 4 Image = registry:2.4.0 ... Version = 189 Update request: 53
  54. 54. Sequencer Service ABC Spec Replicas = 4 Image = registry:2.4.0 ... Version = 190 Updated object: 54
  55. 55. Sequencer Service ABC Spec Replicas = 4 Image = registry:2.4.0 ... Version = 190 Service ABC Spec Replicas = 5 Image = registry:2.3.0 ... Version = 189 Update request:Updated object: 55
  56. 56. Sequencer Service ABC Spec Replicas = 4 Image = registry:2.4.0 ... Version = 190 Service ABC Spec Replicas = 5 Image = registry:2.3.0 ... Version = 189 Update request:Updated object: 56
  57. 57. github.com/docker/swarmkit github.com/coreos/etcd (Raft)
  58. 58. THANK YOU

×