Global State Management of Micro Services

Global State Management of Micro
Services
Using Apache Zookeeper for orchestrating deployments
Murat Ezbiderli
Principal Software Engineer, Search Cloud

(Micro-)Service Management Lifecycle
❏ Host provisioning and management
❏ Defining service topology
❏ Deployment
❏ Configuration
❏ Monitoring
❏ Alerting and mitigation

Host Provisioning and Management
❏ Hardware acquisition
❏ DNS/DHCP configurations
❏ IP Address Management
❏ Base image setup
❏ Role assignment
❏ Provisioning and configuring OS/app stack
❏ Host and base service monitoring

Host Provisioning Technologies
❏ Puppet (www.puppet.com)
❏ Razor (https://github.com/puppetlabs/Razor)
❏ Rundeck (http://www.rundeck.org)
❏ Python Celery (http://www.celeryproject.org/)
❏ Docker (www.docker.com)

Defining Service Topology
P
o
d
-
1
P
o
d
-
N
B
u
d
d
y
-
1
B
u
d
d
y
-
N
B
u
d
d
y
-
1
M
a
r
i
o
L
u
i
g
i
B
u
d
d
y
-
N
P
e
e
r
1
M
a
r
i
o
L
u
i
g
i
P
e
e
r
2
P
e
e
r
3
P
e
e
r
1
P
e
e
r
2
P
e
e
r
3
P
e
e
r
1
P
e
e
r
2
P
e
e
r
3
P
e
e
r
1
P
e
e
r
2
P
e
e
r
3
…
…
…
P
o
d
-
1
P
o
d
-
N
B
u
d
d
y
-
1
B
u
d
d
y
-
N
B
u
d
d
y
-
1
M
a
r
i
o
L
u
i
g
i
B
u
d
d
y
-
N
P
e
e
r
1
M
a
r
i
o
L
u
i
g
i
P
e
e
r
2
P
e
e
r
3
P
e
e
r
1
P
e
e
r
2
P
e
e
r
3
P
e
e
r
1
P
e
e
r
2
P
e
e
r
3
P
e
e
r
1
P
e
e
r
2
P
e
e
r
3
…
…
…
DC
P
o
d
-
1
P
o
d
-
N
B
u
d
d
y
-
1
B
u
d
d
y
-
N
B
u
d
d
y
-
1
M
a
r
i
o
L
u
i
g
i
B
u
d
d
y
-
N
P
e
e
r
1
M
a
r
i
o
L
u
i
g
i
P
e
e
r
2
P
e
e
r
3
P
e
e
r
1
P
e
e
r
2
P
e
e
r
3
P
e
e
r
1
P
e
e
r
2
P
e
e
r
3
P
e
e
r
1
P
e
e
r
2
P
e
e
r
3
…
…
…
SFDC

Defining Service Topology
❏ Defining service height (global, data center or lower)
❏ Clusters (environments)
❏ HA/redundancy groups
❏ Server roles and mapping
❏ Deployment groups
❏ Service stack and dependencies
❏ Network access requirements/restrictions (ACLs)

(One button) Deployment
❏ Automated means to get bits out there without interfering production

Features of a Deployment System
❏ Specification of sources(s) and target(s)
❏ Initiating rollout
❏ Canarying
❏ Sophisticated orchestration
❏ Health mediation
❏ Automated failure detection and
recovery
❏ Rollbacks
❏ High availability
❏ Ability to work with different
artifactories and package formats
❏ High availability
❏ Performance
❏ Efficient use of network and server
resources
❏ Security and audit trail
❏ Monitoring and reporting

10K Feet View
SCM (p4, sd)
Deployment
System
(Maestro)
Artifactory
Zookeeper
Group-1 Search
Servers
Group-2 Search
Servers
Group-N Search
Servers
Production
Build Pipeline
Internal
Dev
Deployment spec
Service topology
Config updates
Render/
Package
Publish Detect/Queue
Publish
declared
state
Receive
notifications
Publish
actual state
Receive
declared
state

Technical Highlights
❏ Zookeeper encapsulates both global state and notification mechanism
❏ Single point of failure, hence heavily clustered
❏ Coordinator service responsible from;
• Detecting and queuing deployment requests
• Initiating and maintaining global deployment state
• Determining which node(s) to deploy next
• Administrative operations (suspend/cancel/rollback etc.)

Maestro Coordinator
❏ Coordinator service responsible from;
• Detecting and queuing deployment requests
• Initiating and managing deployments
• Monitoring and maintaining global deployment state
• Determining which node(s) to deploy next (FSM)
• Administrative operations (suspend/cancel/rollback etc.)

Maestro Agent
❏ Agents responsible from;
• Receiving deployment notifications (declared state)
• Creating execution plan to fill in delta (declared-actual state)
• Performing execution plan
• Publishing actual state after each step
❏ An execution plan contains tasks like
• Downloading and extracting service packages
• Fine grained configuration on disk
• Stopping old service, starting new service, etc.

Health Mediation and Different Propagation Strategies
❏ We should not throw bad bits to all hosts simultaneously
❏ Some strategies include;
• Canarying on a small group of servers first
• Following fibonacci/exponential sequence
• Respect HA groupings (do not take down entire group at a time)
❏ Check cluster health after each iteration
• Ephemeral nodes in ZK
• Cluster level watchdogs and other signals
❏ Rollback if things go sideways

High Availability and Failure Recovery
❏ Coordinator:
• Persist and restore (non-)active deployment state
• Re-initialize FSM and resume deployment propagation
❏ Agent:
• Re-entrant tasks allow failing any time
• Always check declared state upon startup and create new execution plan
as necessary

Service Configuration
❏ Reduce re-action time to minutes rather than hours or days
❏ Cluster configuration (topology)
❏ Service (application) configuration
❏ Utilize the same pipeline with trimmed tasks
• Detect config overrides
• Render on disk or Zookeeper
• Staggered restart of services

Thank you!
mezbiderli@salesforce.com

Global State Management of Micro Services

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Global State Management of Micro Services

Similar to Global State Management of Micro Services (20)

More from Salesforce Engineering

More from Salesforce Engineering (17)

Recently uploaded

Recently uploaded (20)

Global State Management of Micro Services