This document discusses using Apache Zookeeper to orchestrate microservice deployments. It describes how Zookeeper can be used to define service topology, enable one-button deployments through a coordinator service called Maestro, and ensure high availability and failure recovery. The Maestro coordinator initiates and manages deployments by monitoring global state in Zookeeper and determining which nodes to deploy next. Maestro agents on each node receive notifications, create execution plans to deploy updates, and publish status to Zookeeper. Different propagation strategies like canary deployments and rollback capabilities provide health mediation during deployments.
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Global State Management of Micro Services
1. Global State Management of Micro
Services
Using Apache Zookeeper for orchestrating deployments
Murat Ezbiderli
Principal Software Engineer, Search Cloud
2. (Micro-)Service Management Lifecycle
❏ Host provisioning and management
❏ Defining service topology
❏ Deployment
❏ Configuration
❏ Monitoring
❏ Alerting and mitigation
3. Host Provisioning and Management
❏ Hardware acquisition
❏ DNS/DHCP configurations
❏ IP Address Management
❏ Base image setup
❏ Role assignment
❏ Provisioning and configuring OS/app stack
❏ Host and base service monitoring
6. Defining Service Topology
❏ Defining service height (global, data center or lower)
❏ Clusters (environments)
❏ HA/redundancy groups
❏ Server roles and mapping
❏ Deployment groups
❏ Service stack and dependencies
❏ Network access requirements/restrictions (ACLs)
8. Features of a Deployment System
❏ Specification of sources(s) and target(s)
❏ Initiating rollout
❏ Canarying
❏ Sophisticated orchestration
❏ Health mediation
❏ Automated failure detection and
recovery
❏ Rollbacks
❏ High availability
❏ Ability to work with different
artifactories and package formats
❏ High availability
❏ Performance
❏ Efficient use of network and server
resources
❏ Security and audit trail
❏ Monitoring and reporting
9. 10K Feet View
SCM (p4, sd)
Deployment
System
(Maestro)
Artifactory
Zookeeper
Group-1 Search
Servers
Group-2 Search
Servers
Group-N Search
Servers
Production
Build Pipeline
Internal
Dev
Deployment spec
Service topology
Config updates
Render/
Package
Publish Detect/Queue
Publish
declared
state
Receive
notifications
Publish
actual state
Receive
declared
state
10. Technical Highlights
❏ Zookeeper encapsulates both global state and notification mechanism
❏ Single point of failure, hence heavily clustered
❏ Coordinator service responsible from;
• Detecting and queuing deployment requests
• Initiating and maintaining global deployment state
• Determining which node(s) to deploy next
• Administrative operations (suspend/cancel/rollback etc.)
11. Maestro Coordinator
❏ Coordinator service responsible from;
• Detecting and queuing deployment requests
• Initiating and managing deployments
• Monitoring and maintaining global deployment state
• Determining which node(s) to deploy next (FSM)
• Administrative operations (suspend/cancel/rollback etc.)
12. Maestro Agent
❏ Agents responsible from;
• Receiving deployment notifications (declared state)
• Creating execution plan to fill in delta (declared-actual state)
• Performing execution plan
• Publishing actual state after each step
❏ An execution plan contains tasks like
• Downloading and extracting service packages
• Fine grained configuration on disk
• Stopping old service, starting new service, etc.
13. Health Mediation and Different Propagation Strategies
❏ We should not throw bad bits to all hosts simultaneously
❏ Some strategies include;
• Canarying on a small group of servers first
• Following fibonacci/exponential sequence
• Respect HA groupings (do not take down entire group at a time)
❏ Check cluster health after each iteration
• Ephemeral nodes in ZK
• Cluster level watchdogs and other signals
❏ Rollback if things go sideways
14. High Availability and Failure Recovery
❏ Coordinator:
• Persist and restore (non-)active deployment state
• Re-initialize FSM and resume deployment propagation
❏ Agent:
• Re-entrant tasks allow failing any time
• Always check declared state upon startup and create new execution plan
as necessary
15. Service Configuration
❏ Reduce re-action time to minutes rather than hours or days
❏ Cluster configuration (topology)
❏ Service (application) configuration
❏ Utilize the same pipeline with trimmed tasks
• Detect config overrides
• Render on disk or Zookeeper
• Staggered restart of services