Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Container Orchestration from Theory to Practice

632 views

Published on

Join Laura Frank and Stephen Day as they explain and examine technical concepts behind container orchestration systems, like distributed consensus, object models, and node topology. These concepts build the foundation of every modern orchestration system, and each technical explanation will be illustrated using Docker’s SwarmKit as a real-world example. Gain a deeper understanding of how orchestration systems like SwarmKit work in practice and walk away with more insights into your production applications.

Published in: Technology
  • Be the first to comment

Container Orchestration from Theory to Practice

  1. 1. Stephen Day & Laura Frank Container Orchestration from Theory to Practice
  2. 2. Stephen Day Docker, Inc. stephen@docker.com @stevvooe Laura Frank Codeship laura@codeship.com @rhein_wein
  3. 3. Agenda ● Understanding the SwarmKit object model ● Node topology ● Distributed consensus and Raft
  4. 4. An open source framework for building orchestration systems github.com/docker/swarmkit SwarmKit
  5. 5. containerd kubernetes docker swarmkitorchestration container runtime user tooling
  6. 6. Orchestration The Old Way Cluster https://en.wikipedia.org/wiki/Control_theory
  7. 7. Orchestration A control system for your cluster ClusterO - Δ S D D = Desired State O = Orchestrator C = Cluster S = State Δ = Operations to converge S to D https://en.wikipedia.org/wiki/Control_theory
  8. 8. Convergence A functional view D = Desired State O = Orchestrator C = Cluster St = State at time t f(D, St-1 , C) → St | min(St -D)
  9. 9. Observability and Controllability The Problem Low Observability High Observability Failure Process State User Input
  10. 10. SwarmKit Object Model
  11. 11. Data Model Requirements ● Represent difference in cluster state ● Maximize observability ● Support convergence ● Do this while being extensible and reliable
  12. 12. Show me your data structures and I’ll show you your orchestration system
  13. 13. Services ● Express desired state of the cluster ● Abstraction to control a set of containers ● Enumerates resources, network availability, placement ● Leave the details of runtime to container process ● Implement these services by distributing processes across a cluster Node 1 Node 2 Node 3
  14. 14. message ServiceSpec { // Task defines the task template this service will spawn. TaskSpec task = 2 [(gogoproto.nullable) = false]; // UpdateConfig controls the rate and policy of updates. UpdateConfig update = 6; // Service endpoint specifies the user provided configuration // to properly discover and load balance a service. EndpointSpec endpoint = 8; } Service Spec Protobuf Example
  15. 15. message Service { ServiceSpec spec = 3; // UpdateStatus contains the status of an update, if one is in // progress. UpdateStatus update_status = 5; // Runtime state of service endpoint. This may be different // from the spec version because the user may not have entered // the optional fields like node_port or virtual_ip and it // could be auto allocated by the system. Endpoint endpoint = 4; } Service Protobuf Example
  16. 16. Object Current State Spec Desired State Reconciliation Spec -> Object
  17. 17. Prepare: setup resources Start: start the task Wait: wait until task exits Shutdown: stop task, cleanly Task Model Runtime
  18. 18. message TaskSpec { oneof runtime { NetworkAttachmentSpec attachment = 8; ContainerSpec container = 1; } // Resource requirements for the container. ResourceRequirements resources = 2; // RestartPolicy specifies what to do when a task fails or finishes. RestartPolicy restart = 4; // Placement specifies node selection constraints Placement placement = 5; // Networks specifies the list of network attachment // configurations (which specify the network and per-network // aliases) that this task spec is bound to. repeated NetworkAttachmentConfig networks = 7; } Task Spec Protobuf Example
  19. 19. message Task { // Spec defines the desired state of the task as specified by the user. // The system will honor this and will *never* modify it. TaskSpec spec = 3 [(gogoproto.nullable) = false]; // DesiredState is the target state for the task. It is set to // TaskStateRunning when a task is first created, and changed to // TaskStateShutdown if the manager wants to terminate the task. This field // is only written by the manager. TaskState desired_state = 10; TaskStatus status = 9 [(gogoproto.nullable) = false]; } Task Protobuf Example
  20. 20. Orchestrator Task Model Atomic Scheduling Unit of SwarmKit Object Current State Spec Desired State Task0 Task1 … Taskn Scheduler
  21. 21. Kubernetes type Deployment struct { // Specification of the desired behavior of the Deployment. // +optional Spec DeploymentSpec // Most recently observed status of the Deployment. // +optional Status DeploymentStatus } Parallel Concepts
  22. 22. Declarative $ docker network create -d overlay backend vpd5z57ig445ugtcibr11kiiz $ docker service create -p 6379:6379 --network backend redis pe5kyzetw2wuynfnt7so1jsyl $ docker service scale serene_euler=3 serene_euler scaled to 3 $ docker service ls ID NAME REPLICAS IMAGE COMMAND pe5kyzetw2wu serene_euler 3/3 redis
  23. 23. Peanut Butter Demo Time!
  24. 24. And now with jelly!
  25. 25. Consistency
  26. 26. Manager Task Task Data Flow ServiceSpec TaskSpec Service ServiceSpec TaskSpec Task TaskSpec Worker
  27. 27. Only one component of the system can write to a field Field Ownership Consistency
  28. 28. message Task { TaskSpec spec = 3; string service_id = 4; uint64 slot = 5; string node_id = 6; TaskStatus status = 9; TaskState desired_state = 10; repeated NetworkAttachment networks = 11; Endpoint endpoint = 12; Driver log_driver = 13; } Owner User Orchestrator Allocator Scheduler Shared Task Protobuf Example
  29. 29. State Owner < Assigned Manager >= Assigned Worker Field Handoff Task Status
  30. 30. Worker Pre-Run Preparing Manager Terminal States Task State New Allocated Assigned Ready Starting Running Complete Shutdown Failed Rejected
  31. 31. Observability and Controllability The Problem Low Observability High Observability Failure Task/Pod State User Input
  32. 32. Node Topology in SwarmKit
  33. 33. We’ve got a bunch of nodes… now what? Task management is only half the fun. These tasks can be scheduled across a cluster for HA!
  34. 34. We’ve got a bunch of nodes… now what? Your cluster needs to communicate.
  35. 35. ● Not the mechanism, but the sequence and frequency ● Three scenarios important to us now ○ Node registration ○ Workload dispatching ○ State reporting Manager <-> Worker Communication
  36. 36. ● Most approaches take the form of two patterns ○ Push, where the Managers push to the Workers ○ Pull, where the Workers pull from the Managers Manager <-> Worker Communication
  37. 37. 3 - Payload 1 - Register 2 - Discover Registration & Payload Push Model Pull Model Payload
  38. 38. Pros Provides better control over communication rate - Managers decide when to contact Workers Con s Requires a discovery mechanism - More failure scenarios - Harder to troubleshoot Pros Simpler to operate - Workers connect to Managers and don’t need to bind - Can easily traverse networks - Easier to secure - Fewer moving parts Cons Workers must maintain connection to Managers at all times Push Model Pull Model
  39. 39. ● Fetching logs ● Attaching to running pods via kubectl ● Port-forwarding Manager Kubernetes
  40. 40. ● Work dispatching in batches ● Next heartbeat timeout Manager Worker SwarmKit
  41. 41. ● Always open ● Self-registration ● Healthchecks ● Task status ● ... Manager Worker SwarmKit
  42. 42. Rate Control in SwarmKit
  43. 43. Rate Control: Heartbeats ● Manager dictates heartbeat rate to Workers ● Rate is configurable (not by end user) ● Managers agree on same rate by consensus via Raft ● Managers add jitter so pings are spread over time (avoid bursts) Ping? Pong! Ping me back in 5.2 seconds
  44. 44. Rate Control: Workloads ● Worker opens a gRPC stream to receive workloads ● Manager can send data whenever it wants to ● Manager will send data in batches ● Changes are buffered and sent in batches of 100 or every 100 ms, whichever occurs first ● Adds little delay (at most 100ms) but drastically reduces amount of communication Give me work to do 100ms - [Batch of 12 ] 200ms - [Batch of 26 ] 300ms - [Batch of 32 ] 340ms - [Batch of 100] 360ms - [Batch of 100] 460ms - [Batch of 42 ] 560ms - [Batch of 23 ]
  45. 45. Distributed Consensus
  46. 46. The Raft Consensus Algorithm ● This talk is about SwarmKit, but Raft is also used by Kubernetes ● The implementation is slightly different but the behavior is the same
  47. 47. The Raft Consensus Algorithm SwarmKit and Kubernetes don’t make the decision about how to handle leader election and log replication, Raft does! github.com/kubernetes/kubernetes/tree/master/vendor/github.com/coreos/etcd/raft github.com/docker/swarmkit/tree/master/vendor/github.com/coreos/etcd/raft No really...
  48. 48. The Raft Consensus Algorithm secretlivesofdata.com Want to know more about Raft?
  49. 49. Single Source of Truth ● SwarmKit implements Raft directly, instead of via etcd ● The logs live in /var/lib/docker/swarm ● Or /var/lib/etcd/member/wal ● Easy to observe (and even read)
  50. 50. Manager <-> Worker CommunicationIn Swarmkit
  51. 51. Communication Leader FollowerFollower ● Worker can connect to any reachable Manager ● Followers will forward traffic to the Leader
  52. 52. Reducing Network Load ● Followers multiplex all workers to the Leader using a single connection ● Backed by gRPC channels (HTTP/2 streams) ● Reduces Leader networking load by spreading the connections evenly Example: On a cluster with 10,000 workers and 5 managers, each will only have to handle about 2,000 connections. Each follower will forward its 2,000 workers using a single socket to the leader. Leader FollowerFollower
  53. 53. Leader FollowerFollower Leader Failure (Raft!) ● Upon Leader failure, a new one is elected ● All managers start redirecting worker traffic to the new one ● Transparent to workers
  54. 54. Leader Election (Raft!) Follower FollowerLeader ● Upon Leader failure, a new one is elected ● All managers start redirecting worker traffic to the new one ● Transparent to workers
  55. 55. Manager Failure - Manager 1 Addr - Manager 2 Addr - Manager 3 Addr ● Manager sends list of all managers’ addresses to Workers ● When a new manager joins, all workers are notified ● Upon manager failure, workers will reconnect to a different manager Leader FollowerFollower
  56. 56. Leader FollowerFollower Manager Failure (Worker POV) ● Manager sends list of all managers’ addresses to Workers ● When a new manager joins, all workers are notified ● Upon manager failure, workers will reconnect to a different manager
  57. 57. Manager Failure (Worker POV) Leader FollowerFollower Reconnect to random manager ● Manager sends list of all managers’ addresses to Workers ● When a new manager joins, all workers are notified ● Upon manager failure, workers will reconnect to a different manager
  58. 58. Presence Scalable presence in a distributed environment
  59. 59. Presence ● Swarm still handles node management, even if you use the Kubernetes scheduler ● Manager Leader commits Worker state (Up or Down) into Raft ○ Propagates to all managers via Raft ○ Recoverable in case of leader re-election
  60. 60. Presence ● Heartbeat TTLs kept in Manager Leader memory ○ Otherise, every ping would result in a quorum write ○ Leader keeps worker<->TTL in a heap (time.AfterFunc) ○ Upon leader failover workers are given a grace period to reconnect ■ Workers considered Unknown until they reconnect ■ If they do they move back to Up ■ If they don’t they move to Down
  61. 61. Consistency
  62. 62. Sequencer ● Every object in the store has a Version field ● Version stores the Raft index when the object was last updated ● Updates must provide a base Version; are rejected if it is out of date ● Provides CAS semantics ● Also exposed through API calls that change objects in the store
  63. 63. Versioned Updates Consistency service := getCurrentService() spec := service.Spec spec.Image = "my.serv/myimage:mytag" update(spec, service.Version)
  64. 64. Sequencer Original object: Raft index when it was last updated Service ABC Version = 189 Spec Replicas = 4 Image = registry:2.3.0 ...
  65. 65. Sequencer Service ABC Spec Replicas = 4 Image = registry:2.3.0 ... Version = 189 Service ABC Spec Replicas = 4 Image = registry:2.4.0 ... Version = 189 Update request:Original object:
  66. 66. Sequencer Service ABC Spec Replicas = 4 Image = registry:2.3.0 ... Version = 189 Original object: Service ABC Spec Replicas = 4 Image = registry:2.4.0 ... Version = 189 Update request:
  67. 67. Sequencer Service ABC Spec Replicas = 4 Image = registry:2.4.0 ... Version = 190 Updated object:
  68. 68. Sequencer Service ABC Spec Replicas = 4 Image = registry:2.4.0 ... Version = 190 Service ABC Spec Replicas = 5 Image = registry:2.3.0 ... Version = 189 Update request:Updated object:
  69. 69. Sequencer Service ABC Spec Replicas = 4 Image = registry:2.4.0 ... Version = 190 Service ABC Spec Replicas = 5 Image = registry:2.3.0 ... Version = 189 Update request:Updated object:
  70. 70. github.com/docker/swarmkit github.com/coreos/etcd (Raft)
  71. 71. Thank you! @stevvooe @rhein_wein

×