Everything You Thought You Already Knew About Orchestration

Everything You Thought
You Already Knew About
Orchestration
Laura Frank
Director of Engineering,
Codeship

Managing Distributed State with Raft
Quorum 101
Leader Election
Log Replication
Service Scheduling
Failure Recovery
Agenda
bonus debugging tips!

They’re trying to get a collection of nodes to behave
like a single node.
• How does the system maintain state?
• How does work get scheduled?
The Big Problem(s)
What are tools like Swarm and Kubernetes trying to do?

Manager
leader
WorkerWorker
Manager
follower
Manager
follower
WorkerWorker Worker
raft consensus group

Quorum
The minimum number of votes that a consensus
group needs in order to be allowed to perform
an operation.
Without quorum, your system can’t do work.

Math! Managers Quorum Fault Tolerance
1 1 0
2 2 0
3 2 1
4 3 1
5 3 2
6 4 2
7 4 3
(N/2) + 1
In simpler terms, it
means a majority

Having two managers instead of
one actually doubles your
chances of losing quorum.

Pay attention to datacenter topology when placing
managers.
Quorum With Multiple Regions
Manager Nodes Distribution across 3 Regions
3 1-1-1
5 1-2-2
7 3-2-2
9 3-3-3
magically works with
Docker for AWS

I think I’ll just write my
own distributed
consensus algorithm.
-no sensible person

Log replication
Leader election
Safety (won’t talk about this much today)
Raft is responsible for…
Being easier to understand

Orchestration systems typically use a key/value
store backed by a consensus algorithm
In a lot of cases, that algorithm is Raft!
Raft is used everywhere…
…that etcd is used

SwarmKit implements the
Raft algorithm directly.

In most cases, you don’t want to run
work on your manager nodes
docker node update --availability drain <NODE>
Participating in a Raft consensus group is work, too.
Make your manager nodes unavailable for tasks:
*I will run work on managers for educational purposes

Leader Election &
Log Replication

Manager
leader
Manager
candidate
Manager
follower
Manager
offline

The log is the source of truth for
your application.

In the context of distributed computing (and this
talk), a log is an append-only, time-based record
of data.
2 10 30 25 5 12first entry append entry here!
This log is for computers, not humans.

2 10 30 25 5 12
Server
12
Client
12
In simple systems, the log is pretty straightforward.

In a manager group, that log entry can only “become
truth” once it is confirmed from the majority of
followers (quorum!)
Client
12
Manager
follower
Manager
follower
Manager
leader

In distributed computing, it’s essential that you
understand log replication.
bit.ly/logging-post

Debugging Tip
Watch the Raft logs.
Monitor via inotifywait OR
just read them directly!

HA application
problems
scheduling
problems
orchestrator problems

Scheduling constraints
Restrict services to specific nodes, such as specific
architectures, security levels, or types
docker service create
--constraint 'node.labels.type==web' my-app

New in 17.04.0-ce
Topology-aware scheduling!!1!
Implements a spread strategy over nodes that belong to a
certain category.
Unlike --constraint, this is a “soft” preference
—placement-pref ‘spread=node.labels.dc’

Swarm will not rebalance
healthy tasks when a new
node comes online

Debugging Tip
Add a manager to your
Swarm running with
--availability drain
and in Engine debug mode

• Bring the downed nodes back online (derp)
Regain quorum
• On a healthy manager, run  
docker swarm init --force-new-cluster
This will create a new cluster with one healthy manager
• You need to promote new managers

• Bring up a new manager and stop Docker
• sudo rm -rf /var/lib/docker/swarm
• Copy backup to /var/lib/docker/swarm
• Start Docker
• docker swarm init (--force-new-cluster)
Restore from a backup in 5 easy steps!

• In general, users shouldn’t be allowed to modify IP
addresses of nodes
• Restoring from a backup == old IP address for node1
• Workaround is to use elastic IPs with ability to reassign
But wait, there’s a bug… or a feature

Everything You Thought You Already Knew About Orchestration

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Everything You Thought You Already Knew About Orchestration

Similar to Everything You Thought You Already Knew About Orchestration (20)

More from Laura Frank Tacho

More from Laura Frank Tacho (8)

Recently uploaded

Recently uploaded (20)

Everything You Thought You Already Knew About Orchestration