[2C4]Clustered computing with CoreOS, fleet and etcd

Clustered computing
with CoreOS, fleet and etcd
DEVIEW 2014
Jonathan Boulle
CoreOS, Inc.

@baronboulle
jonboulle
Who am I?
Jonathan Boulle

Jonathan Boulle
@baronboulle
jonboulle
Who am I?
South Africa -> Australia -> London -> San Francisco

Jonathan Boulle
@baronboulle
jonboulle
Who am I?
Red Hat -> Twitter -> CoreOS

Jonathan Boulle
@baronboulle
jonboulle
Who am I?
Red Hat -> Twitter -> CoreOS
Linux, Python, Go, FOSS

Agenda
● CoreOS Linux
– Securing the internet

Agenda
● CoreOS Linux
– Application containers

Agenda
● CoreOS Linux
– Automatic updates

Agenda
● CoreOS Linux
● fleet

Agenda
● CoreOS Linux
● fleet
– cluster-level init system

Agenda
● CoreOS Linux
● fleet
– etcd + systemd

Agenda
● CoreOS Linux
● fleet
– etcd + systemd
● fleet and...

Agenda
● CoreOS Linux
● fleet
– etcd + systemd
● fleet and...
– systemd: the good, the bad

Agenda
● CoreOS Linux
● fleet
– etcd + systemd
● fleet and...
– etcd: the good, the bad

Agenda
● CoreOS Linux
● fleet
– etcd + systemd
● fleet and...
– Golang: the good, the bad

Agenda
● CoreOS Linux
● fleet
– etcd + systemd
● fleet and...
– Golang: the good, the bad
● Q&A

CoreOS Linux
A minimal, automatically-updated
Linux distribution,
designed for distributed systems.

Why?
● CoreOS mission: “Secure the internet”

Why?
● Status quo: set up a server and never touch it

Why?
● Internet is full of servers running years-old
software with dozens of vulnerabilities

Why?
● CoreOS: make updating the default, seamless
option

Why?
● CoreOS: make updating the default, seamless
option
● Regular

Why?
● Internet is full of servers running years-old software
with dozens of vulnerabilities
● CoreOS: make updating the default, seamless option
● Regular
● Reliable

Why?
● Internet is full of servers running years-old software
with dozens of vulnerabilities
● CoreOS: make updating the default, seamless option
● Regular
● Reliable
● Automatic

How do we achieve this?
● Containerization of applications

● Self-updating operating system

● Self-updating operating system
● Distributed systems tooling to make applications
resilient to updates

KERNEL
SYSTEMD
SSH
DOCKER
PYTHON
JAVA
NGINX
MYSQL
OPENSSL
distro distro distro distro distro distro distro distro distro distro distro
APP

KERNEL
SYSTEMD
SSH
DOCKER
PYTHON
JAVA
NGINX
MYSQL
OPENSSL
APP

KERNEL
SYSTEMD
SSH
DOCKER
Application Containers
(e.g. Docker)
PYTHON
JAVA
NGINX
MYSQL
OPENSSL
APP

KERNEL
SYSTEMD
SSH
DOCKER
- Minimal base OS (~100MB)
- Vanilla upstream components
wherever possible
- Decoupled from applications
- Automatic, atomic updates

Automatic updates
How do updates work?

Automatic updates
● Omaha protocol (check-in/retrieval)

Automatic updates
– Simple XML-over-HTTP protocol developed by
Google to facilitate polling and pulling updates
from a server

Omaha protocol
Client sends application id and
current version to the update server
<request protocol="3.0" version="CoreOSUpdateEngine-0.1.0.0">
<app appid="{e96281a6-d1af-4bde-9a0a-97b76e56dc57}"
version="410.0.0" track="alpha" from_track="alpha">
<event eventtype="3"></event>
</app>
</request>

Omaha protocol
Update server responds with the
URL of an update to be applied
<url codebase="https://commondatastorage.googleapis.com/update-storage.
core-os.net/amd64-usr/452.0.0/"></url>
<package hash="D0lBAMD1Fwv8YqQuDYEAjXw6YZY=" name="update.gz"
size="103401155" required="false">

Omaha protocol
Client downloads data, verifies
hash & cryptographic signature,
and applies the update

Omaha protocol
Client downloads data, verifies
hash & cryptographic signature,
and applies the update
Updater exits with response code
then reports the update to the
update server

Automatic updates
Google to facilitate polling and pulling updates
from a server
● Active/passive read-only root partitions

Automatic updates
Google to facilitate polling and pulling updates from
a server
● Active/passive read-only root partitions
– One for running the live system, one for updates

Active/passive root partitions
Booted off partition A.
Download update and commit
to partition B. Change GPT.

Reboot (into Partition B).
If tests succeed, continue
normal operation and mark
success in GPT.

But what if partition B fails
update tests...

Change GPT to point to
previous partition, reboot.
Try update again later.

core-01 ~ # cgpt show /dev/sda3
start size contents
264192 2097152 Label: "USR-A"
Type: Alias for coreos-rootfs
UUID: 7130C94A-213A-4E5A-8E26-6CCE9662
Attr: priority=1 tries=0 successful=1
start size contents
2492416 2097152 Label: "USR-B"
UUID: E03DD35C-7C2D-4A47-B3FE-27F15780A

Active/passive /usr partitions
start size contents
264192 2097152 Label: "USR-A"
UUID: 7130C94A-213A-4E5A-8E26-6CCE9662
start size contents
2492416 2097152 Label: "USR-B"
UUID: E03DD35C-7C2D-4A47-B3FE-27F15780A

● Single image containing most of the OS

– Mounted read-only onto /usr

– / is mounted read-write on top (persistent data)

– Parts of /etc generated dynamically at boot

– A lot of work moving default configs from /etc to /usr

# /etc/nsswitch.conf:
passwd: files usrfiles
shadow: files usrfiles
group: files usrfiles

# /etc/nsswitch.conf:
passwd: files usrfiles
/usr/share/baselayout/passwd
shadow: files usrfiles
/usr/share/baselayout/shadow
group: files usrfiles
/usr/share/baselayout/group

Atomic updates
● Entire OS is a single read-only image
core-01 ~ # touch /usr/bin/foo
touch: cannot touch '/usr/bin/foo': Read-only file system

Atomic updates
– Easy to verify cryptographically
● sha1sum on AWS or bare metal gives the same result

Atomic updates
– Easy to verify cryptographically
● sha1sum on AWS or bare metal gives the same result
– No chance of inconsistencies due to partial updates
● e.g. pull a plug on a CentOS system during a yum update
● At large scale, such events are inevitable

Automatic, atomic updates are great! But...

● Problem:

● Problem:
– updates still require a reboot (to use new kernel and
mount new filesystem)

● Problem:
– Reboots cause application downtime...

● Problem:
● Solution: fleet

● Problem:
● Solution: fleet
– Highly available, fault tolerant, distributed process
scheduler

● Problem:
● Solution: fleet
– Highly available, fault tolerant, distributed process
scheduler
● ... The way to run applications on a CoreOS cluster

● Problem:
– updates still require a reboot (to use new kernel and mount
new filesystem)
● Solution: fleet
– Highly available, fault tolerant, distributed process scheduler
● ... The way to run applications on a CoreOS cluster
– fleet keeps applications running during server downtime

fleet – the “cluster-level init system”

 fleet is the abstraction between machine and
application:

application:
– init system manages process on a machine
– fleet manages applications on a cluster of machines

application:
 Similar to Mesos, but very different architecture
(e.g. based on etcd/Raft, not Zookeeper/Paxos)

application:
 Similar to Mesos, but very different architecture
(e.g. based on etcd/Raft, not Zookeeper/Paxos)
 Uses systemd for machine-level process
management, etcd for cluster-level co-ordination

fleet – low level view
 fleetd binary (running on all CoreOS nodes)

– encapsulates two roles:

• engine (cluster-level unit scheduling – talks to etcd)

• agent (local unit management – talks to etcd and systemd)

 fleetctl command-line administration tool

– create, destroy, start, stop units

– Retrieve current status of units/machines in the cluster

– Retrieve current status of units/machines in the cluster
 HTTP API

fleet – high level view
Cluster
Single machine

systemd
 Linux init system (PID 1) – manages processes

systemd
– Relatively new, replaces SysVinit, upstart, OpenRC, ...

systemd
– Being adopted by all major Linux distributions

systemd
 Fundamental concept is the unit

systemd
– Units include services (e.g. applications), mount points,
sockets, timers, etc.

systemd
– Units include services (e.g. applications), mount points,
sockets, timers, etc.
– Each unit configured with a simple unit file

fleet + systemd
 systemd exposes a D-Bus interface

fleet + systemd
– D-Bus: message bus system for IPC on Linux

fleet + systemd
– One-to-one messaging (methods), plus pub/sub abilities

fleet + systemd
 fleet uses godbus to communicate with systemd

fleet + systemd
– Sending commands: StartUnit, StopUnit

fleet + systemd
– Sending commands: StartUnit, StopUnit
– Retrieving current state of units (to publish to the
cluster)

systemd is great!
 Automatically handles:

systemd is great!
– Process daemonization

systemd is great!
– Resource isolation/containment (cgroups)

systemd is great!
• e.g. MemoryLimit=512M

systemd is great!
– Health-checking, restarting failed services

systemd is great!
– Logging (journal)

systemd is great!
• applications can just write to stdout, systemd adds metadata

systemd is great!
• applications can just write to stdout, systemd adds metadata
– Timers, inter-unit dependencies, socket activation, ...

fleet + systemd
 systemd takes care of things so we don't have to

fleet + systemd
 fleet configuration is just systemd unit files

fleet + systemd
 fleet extends systemd to the cluster-level, and adds
some features of its own (using [X-Fleet])

fleet + systemd
– Template units (run n identical copies of a unit)

fleet + systemd
– Global units (run a unit everywhere in the cluster)

fleet + systemd
– Global units (run a unit everywhere in the cluster)
– Machine metadata (run only on certain machines)

systemd is... not so great
 Problem: unreliable pub/sub

– fleet agent initially used a systemd D-Bus subscription
to track unit status

– Every change in unit state in systemd triggers an event
in fleet (e.g. “publish this new state to the cluster”)

– Under heavy load, or byzantine conditions, unit state
changes would be dropped

– Under heavy load, or byzantine conditions, unit state
changes would be dropped
– As a result, unit state in the cluster became stale

 Solution: polling for unit states

– Every n seconds, retrieve state of units from systemd,
and synchronize with cluster

– Less efficient, but much more reliable

– Optimize by caching state and only publishing changes

– Optimize by caching state and only publishing changes
– Any state inconsistencies are quickly fixed

systemd (and docker) are... not so great

 Problem: poor integration with Docker

– Docker is de facto application container manager

– Docker and systemd do not always play nicely together..

– Docker and systemd do not always play nicely together..
– Both Docker and systemd manage cgroups and
processes, and when the two are trying to manage the
same thing, the results are mixed

 Example: sending signals to a container

– Given a simple container:
[Service]
ExecStart=/usr/bin/docker run busybox /bin/bash -c
"while true; do echo Hello World; sleep 1; done"

[Service]
– Try to kill it with systemctl kill hello.service

[Service]
– ... Nothing happens

[Service]
– ... Nothing happens
– Kill command sends SIGTERM, but bash in a Docker
container has PID1, which happily ignores the signal...

 OK, SIGTERM didn't work, so escalate to SIGKILL:
systemctl kill -s SIGKILL hello.service

● Now the systemd service is gone:
hello.service: main process exited, code=killed,
status=9/KILL

● Now the systemd service is gone:
hello.service: main process exited, code=killed, status=9/KILL
● But... the Docker container still exists?
# docker ps
CONTAINER ID COMMAND STATUS NAMES
7c7cf8ffabb6 /bin/sh -c 'while tr Up 31 seconds hello
# ps -ef|grep '[d]ocker run'
root 24231 1 0 03:49 ? 00:00:00 /usr/bin/docker run -name hello ...

 Why?

 Why?
– Docker client does not run containers itself; it just sends
a command to the Docker daemon which actually forks
and starts running the container

 Why?
– systemd expects processes to fork directly so they will
be contained under the same cgroup tree

 Why?
– systemd expects processes to fork directly so they will
be contained under the same cgroup tree
– Since the Docker daemon's cgroup is entirely separate,
systemd cannot keep track of the forked container

# systemctl cat hello.service
[Service]
ExecStart=/bin/bash -c 'while true; do echo Hello World; sleep 1; done'
# systemd-cgls
...
├─hello.service
│ ├─23201 /bin/bash -c while true; do echo Hello World; sleep 1; done
│ └─24023 sleep 1

# systemctl cat hello.service
[Service]
ExecStart=/usr/bin/docker run -name hello busybox /bin/sh -c
# systemd-cgls
...
│ ├─hello.service
│ │ └─24231 /usr/bin/docker run -name hello busybox /bin/sh -c while
true; do echo Hello World; sleep 1; done
...
│ ├─docker-
51a57463047b65487ec80a1dc8b8c9ea14a396c7a49c1e23919d50bdafd4fefb.scope
│ │ ├─24240 /bin/sh -c while true; do echo Hello World; sleep 1; done
│ │ └─24553 sleep 1

 Solution: ... work in progress

– systemd-docker – small application that moves cgroups
of Docker containers back under systemd's cgroup

– Use Docker for image management, but systemd-nspawn
for runtime (e.g. CoreOS's toolbox)

– Use Docker for image management, but systemd-nspawn
for runtime (e.g. CoreOS's toolbox)
– (proposed) Docker standalone mode: client starts
container directly rather than through daemon

etcd
 A consistent, highly available key/value store

etcd
– Shared configuration, distributed locking, ...

etcd
 Driven by Raft

etcd
 Driven by Raft
 Consensus algorithm similar to Paxos

etcd
 Driven by Raft
 Designed for understandability and simplicity

etcd
 Driven by Raft
 Popular and widely used

etcd
 Driven by Raft
 Popular and widely used
– Simple HTTP API + libraries in Go, Java, Python, Ruby, ...

fleet + etcd
● fleet needs a consistent view of the cluster to make
scheduling decisions: etcd provides this view

fleet + etcd
– What units exist in the cluster?

fleet + etcd
– What machines exist in the cluster?

fleet + etcd
– What are their current states?

fleet + etcd
– What are their current states?
● All unit files, unit state, machine state and
scheduling information is stored in etcd

etcd is great!
● Fast and simple API

etcd is great!
● Handles all cluster-level/inter-machine
communication so we don't have to

etcd is great!
● Powerful primitives:

etcd is great!
– Compare-and-Swap allows for atomic operations and
implementing locking behaviour

etcd is great!
– Compare-and-Swap allows for atomic operations and
implementing locking behaviour
– Watches (like pub-sub) provide event-driven behaviour

etcd is... not so great
● Problem: unreliable watches

– fleet initially used a purely event-driven architecture

– watches in etcd used to trigger events

● e.g. “machine down” event triggers unit rescheduling

– Highly efficient: only take action when necessary

– Highly responsive: as soon as a user submits a new
unit, an event is triggered to schedule that unit

– Highly responsive: as soon as a user submits a new
unit, an event is triggered to schedule that unit
– Unfortunately, many places for things to go wrong...

● Example:
Unreliable watches

● Example:
Unreliable watches
– etcd is undergoing a leader election or otherwise
unavailable, watches do not work during this period

● Example:
Unreliable watches
– Change occurs (e.g. a machine leaves the cluster)

● Example:
Unreliable watches
– Event is missed

● Example:
Unreliable watches
– Event is missed
– fleet doesn't know machine is lost!

● Example:
Unreliable watches
– Event is missed
– fleet doesn't know machine is lost!
– Now fleet doesn't know to reschedule units that were
running on that machine

● Problem: limited event history

– etcd retains a history of all events that occur

– Can “watch” from an arbitrary point in the past, but..

– History is a limited window!

– With a busy cluster, watches can fall out of this window

– With a busy cluster, watches can fall out of this window
– Can't always replay event stream from the point we
want to

● Example:
Limited event history
– etcd holds history of last 1000 events

● Example:
– fleet sets watch at i=100 to watch for machine loss

● Example:
– Meanwhile, many changes occur in other parts of the
keyspace, advancing index to i=1500

● Example:
– Leader election/network hiccup occurs and severs
watch

● Example:
– Leader election/network hiccup occurs and severs
watch
– fleet tries to recreate watch at i=100 and fails:

● Example:
– Leader election/network hiccup occurs and severs watch
– fleet tries to recreate watch at i=100 and fails:
err="401: The event in requested index is outdated and
cleared (the requested history has been cleared [1500/100])

– Missed events lead to unrecoverable situations
– Can't always replay entire event stream

● Solution: move to “reconciler” model

Reconciler model
In a loop, run periodically until stopped:

Reconciler model
1.Retrieve current state (how the world is) and desired state
(how the world should be) from datastore (etcd)

Reconciler model
2.Determine necessary actions to transform current state -->
desired state

Reconciler model
2.Determine necessary actions to transform current state -->
desired state
3.Perform actions and save results as new current state

Reconciler model
Example: fleet's engine (scheduler) looks something like:
for { // loop forever
select {
case <- stopChan: // if stopped, exit
return
case <- time.After(5 * time.Minute):
units = fetchUnits()
machines = fetchMachines()
schedule(units, machines)
}
}

– Less efficient, but extremely robust

– Less efficient, but extremely robust
– Still many paths for optimisation (e.g. using watches to
trigger reconciliations)

golang
● Standard language for all CoreOS projects (above OS)

– etcd
golang

– etcd
– fleet
golang

golang
– etcd
– fleet
– locksmith (semaphore for reboots during updates)

golang
– etcd
– fleet
– etcdctl, updatectl, coreos-cloudinit, ...

golang
– etcd
– fleet
– etcdctl, updatectl, coreos-cloudinit, ...
● fleet is ~10k LOC (and another ~10k LOC tests)

● Fast!
Go is great!
– to write (concise syntax)

● Fast!
Go is great!
– to compile (builds typically <1s)

● Fast!
Go is great!
– to run tests (O(seconds), including with race detection)

● Fast!
Go is great!
– Never underestimate power of rapid iteration

● Fast!
Go is great!
● Simple, powerful tooling

● Fast!
Go is great!
● Simple, powerful tooling
– Built in package management, code coverage, etc.

Go is great!
● Rich standard library

Go is great!
– “Batteries are included”

Go is great!
– e.g.: completely self-hosted HTTP server, no need for
reverse proxies or worker systems to serve many
concurrent HTTP requests

Go is great!
● Static compilation into a single binary

Go is great!
● Static compilation into a single binary
– Ideal for a minimal OS with no libraries

Go is... not so great
● Problem: managing third-party dependencies

– modular package management but: no versioning

– import “github.com/coreos/fleet” - which SHA?

● Solution: vendoring :-/

– Copy entire source tree of dependencies into repository

– Copy entire source tree of dependencies into repository
– Slowly maturing tooling: goven, third_party.go, Godep

● Problem: (relatively) large binary sizes

– “relatively”, but... CoreOS is the minimal OS

– ~10MB per binary, many tools, quickly adds up

● Solutions:

● Solutions:
– upgrading golang!

● Solutions:
● go1.2 to go1.3 = ~25% reduction

● Solutions:
● go1.2 to go1.3 = ~25% reduction
– sharing the binary between tools

Sharing a binary
● client/daemon often share much of the same code

Sharing a binary
– Encapsulate multiple tools in one binary, symlink the
different command names, switch off command name

Sharing a binary
– Example: fleetd/fleetctl

Sharing a binary
func main() {
switch os.Args[0] {
case “fleetctl”:
Fleetctl()
case “fleetd”:
Fleetd()
}
}

Sharing a binary
func main() {
switch os.Args[0] {
case “fleetctl”:
Fleetctl()
case “fleetd”:
Fleetd()
}
}
Before:
9150032 fleetctl
8567416 fleetd
After:
11052256 fleetctl
8 fleetd -> fleetctl

● Problem: young language => immature libraries

– CLI frameworks

– CLI frameworks
– godbus, go-systemd, go-etcd :-(

– CLI frameworks
● Solutions:

– CLI frameworks
● Solutions:
– Keep it simple

– CLI frameworks
● Solutions:
– Keep it simple
– Roll your own (e.g. fleetctl's command line)

type Command struct {
Name string
Summary string
Usage string
Description string
Flags flag.FlagSet
Run func(args []string) int
}
fleetctl CLI

Wrap up/recap
 CoreOS Linux

Wrap up/recap
 CoreOS Linux
– Minimal OS with cluster capabilities built-in

Wrap up/recap
 CoreOS Linux
– Containerized applications --> a[u]tom[at]ic updates

Wrap up/recap
 CoreOS Linux
 fleet

Wrap up/recap
 CoreOS Linux
 fleet
– Simple, powerful cluster-level application manager

Wrap up/recap
 CoreOS Linux
 fleet
– Glue between local init system (systemd) and cluster-level
awareness (etcd)

Wrap up/recap
 CoreOS Linux
 fleet
– Glue between local init system (systemd) and cluster-level
awareness (etcd)
– golang++

Thank you :-)
● Everything is open source – join us!
– https://github.com/coreos
● Any more questions, feel free to email
– jonathan.boulle@coreos.com
● CoreOS stickers!

References
● CoreOS updates
https://coreos.com/using-coreos/updates/
● Omaha protocol
https://code.google.com/p/omaha/wiki/ServerProtocol
● Raft algorithm
http://raftconsensus.github.io/

References
● fleet
https://github.com/coreos/fleet
● etcd
https://github.com/coreos/etcd
● toolbox
https://github.com/coreos/toolbox
● systemd-docker
https://github.com/ibuildthecloud/systemd-docker
● systemd-nspawn
http://0pointer.de/public/systemd-man/systemd-nspawn.html

Brief side-note: locksmith
● Reboot manager for CoreOS
● Uses a semaphore in etcd to co-ordinate reboots
● Each machine in the cluster:
1. Downloads and applies update
2. Takes lock in etcd (using Compare-And-Swap)
3. Reboots and releases lock

[2C4]Clustered computing with CoreOS, fleet and etcd

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to [2C4]Clustered computing with CoreOS, fleet and etcd

Similar to [2C4]Clustered computing with CoreOS, fleet and etcd (20)

More from NAVER D2

More from NAVER D2 (20)

Recently uploaded

Recently uploaded (20)

[2C4]Clustered computing with CoreOS, fleet and etcd