DockerCon14 Cluster Management and Containerization

Cluster
Management

and

Containerization

Benjamin
Hindman,
@benh

Twitter,
Inc.

$
whoami

2006
-‐
2011
2009
-‐
2010
-‐

cluster management
(server/IT automation)

cluster
management

① conﬁguration/package
management

② deployment

③ naming

cluster
management

management

② deployment

③ naming

④ monitoring

cluster
management

management

② deployment

③ naming

④ monitoring

ops

cluster
management

management

② deployment

③ naming

④ monitoring

ops
developers

cluster
management

conﬁguration/package

management

naming
deployment


management

“what/how
do
things
get
installed?”

(10’s
of
machines)

hosts.txt

web1.twttr.com

web2.twttr.com

web3.twttr.com

web4.twttr.com

$
ssh
host
./conﬁgure
&&
make
install


management

“what/how
do
things
get
installed?”

(10’s
of
machines)

hosts.txt

web1.twttr.com

web2.twttr.com

web3.twttr.com

web4.twttr.com

$
ssh
host
rpm
-‐ivh
pkg-‐x.y.z.rpm

deployment

“what
should
run
where?”

“how
should
it
be
started/stopped?”

$
ssh
host
nohup
myapp
(10’s
of
machines)

hosts.txt

web1.twttr.com

web2.twttr.com

web3.twttr.com

web4.twttr.com

deployment

“what
should
run
where?”

“how
should
it
be
started/stopped?”

$
ssh
host
monit
start
myapp
(10’s
of
machines)

hosts.txt

web1.twttr.com

web2.twttr.com

web3.twttr.com

web4.twttr.com

deployment

“what
should
run
where?”

“how
should
it
be
started/stopped?”

$
scp
myapp
host

$
ssh
host
monit
myapp

(10’s
of
machines)

hosts.txt

web1.twttr.com

web2.twttr.com

web3.twttr.com

web4.twttr.com

deployment

“what
should
run
where?”

“how
should
it
be
started/stopped?”

(10’s
of
machines)

$
ssh
host
git
pull
&&

monit
myapp

hosts.txt

web1.twttr.com

web2.twttr.com

web3.twttr.com

web4.twttr.com

“how
should
apps
ﬁnd
each
other?”

naming

webhosts.txt

web1.twttr.com

web2.twttr.com

web3.twttr.com

web4.twttr.com

dbhosts.txt

db1.twttr.com

db2.twttr.com

db3.twttr.com

db4.twttr.com

(10’s
of
machines)

(10’s
of
machines)

(100’s
-‐>
1000’s
of
machines)

to
scale,

need
more
automation

Twitter,
circa
2010

webhosts.txt
dbhosts.txt

$
ssh
host
…

(conﬁguration/package
management)

(deployment)

MySQL
Cassandra
Rails
Hadoop
memcached

types
of
failure:

fault
domains

machine

(disk,
memory,
CPU,
etc)

rack

(switch,
PDU)

datacenter

challenges
② 
maintenance

(aka
“planned
failures”)

maintenance

① upgrading
software
(i.e.,
installing
and

uninstalling
packages)

ops
developers

maintenance

① upgrading
software
(i.e.,
installing
and

uninstalling
packages)

② replacing
machines,
switches,
PDUs,
etc

challenges
③ 
utilization

Rails

Hadoop

memcached

utilization

utilization

Rails

Hadoop

memcached

utilization

Rails

Hadoop

memcached

buy
less
machines

or

run
more
applications!

challenges
① failures

② maintenance

③ utilization

challenges
① failures

② maintenance

③ utilization

planning
for

failure?

challenges
① failures

② maintenance

③ utilization

planning
for

utilization?

planning
for
utilization

intra-‐machine
resource
sharing:

share
a
single
machine’s
resources
between

multiple
applications
(multi-‐tenancy)

intra-‐datacenter
resource
sharing:

share
multiple
machine’s
resources
between

multiple
applications

Twitter,
circa
2010

what
software

can
help
me!?

cluster
management

industry
academia

diﬀerent
software

academia
industry

• 
MPI
(Message
Passing
Interface)
• 
Apache
(mod_perl,
mod_php)

• 
web
services
(Java,
Ruby,
…)

diﬀerent
scale
(at
ﬁrst)

academia
industry

• 
1,000’s
of
machines
• 
10’s
of
machines

cluster
management

academia
industry

• 
PBS
(Portable
Batch
System)

• 
TORQUE

• 
SGE
(Sun
Grid
Engine)

• 
ssh

• 
Puppet/Chef

• 
Capistrano/Ansible

cluster
managers

cluster
manager
provides
a

level-‐of-‐indirection
between

hardware
resources
(machines)

and
applications/jobs

cluster
manager

Rails
Hadoop
memcached
…

…

cluster
management

academia
industry

• 
PBS
(Portable
Batch
System)

• 
TORQUE

• 
SGE
(Sun
Grid
Engine)

• 
ssh

• 
Puppet/Chef

• 
Capistrano/Ansible

batch
computation!

Mesos
is
a
modern

general
purpose
cluster
manager

(i.e.,
not
just
focused
on

batch
scheduling)

Mesos

service
batch
storage
…

…

streaming

support
many
diﬀerent
types
of
computation/scheduling

Mesos

service
batch
storage
…
streaming

(1)
coordinate
for
resources

Mesos

service
batch
storage
…
streaming

(2)
launch
tasks

Mesos

service
batch
storage
…
streaming

(3)
launch
tasks

Mesos

service
batch
storage
…
streaming

Mesos

service
batch
storage
…
streaming

(4)
task
termination

Mesos

service
batch
storage
…
streaming

(5)
task
status
update

challenges revisited
① failures

② maintenance

③ utilization

Mesos

service
batch
storage
…
streaming

(1)  when
resources
become
idle,
can
be
scheduled
and
reused

by
other
schedulers

Mesos

service
batch
storage
…
streaming

(1)  when
resources
become
idle,
can
be
scheduled
and
reused

by
other
schedulers

(2)  multi-‐tenancy
on
individual
machines

multi-‐tenancy

task!
task!
containers

task!

containerization
started
leveraging
containerization
technology

in
~2011

2011

LXC

2012

cgroups

2013

Docker

(preliminary)

2014

how
Mesos
has
changed

cluster
management

at
Twitter
today


management

developers

(1)
bundle
services
as
jar,
tar/gzip

(2)
upload
to
HDFS


management
(planning)

developers

(1)
bundle
services
as
jar,
tar/gzip,

and
using
Docker

(2)
upload
to
HDFS
(or
use
registry)

deployment

Apache
Aurora
(incubating),
a

scheduler
for
running
stateless

services
written
in
any
language

(but
primarily
used
at
Twitter
for

JVM
services)

deployment
(via
Aurora)

developers

(1)  describe
service
using

Python
based
DSL

(2)
submit
service
to
Aurora

using
CLI

deployment
(via
Marathon)

developers

(1)  describe
services
using

JSON

(2)
submit
service
to

Marathon
via
REST

naming

Apache
ZooKeeper

using
Apache
ZooKeeper
and
server
sets
(github.com/twitter/commons)

naming

(1)  task
gets
launched
on
machine

Apache
ZooKeeper

using
Apache
ZooKeeper
and
server
sets

naming

(2)
service
gets
registered
in
a
server

set
in
ZooKeeper

(1)  task
gets
launched
on
machine

Apache
ZooKeeper

using
Apache
ZooKeeper
and
server
sets

naming

(2)
service
gets
registered
in
a
server

set
in
ZooKeeper

(1)  task
gets
launched
on
machine

(3)
other
services
use
ZooKeeper
to

ﬁnd
services
they
need

Apache
ZooKeeper

using
Apache
ZooKeeper
and
server
sets

naming

(2)
service
gets
registered
in
a
server

set
in
ZooKeeper

(1)  task
gets
launched
on
machine

(3)
other
services
use
ZooKeeper
to

ﬁnd
services
they
need

(4)
services
connect
directly
with
one

another

Apache
ZooKeeper

using
Apache
ZooKeeper
and
server
sets

naming
alternative

(2)
update
HAProxy
with
new
service

location

(1)  task
gets
launched
on
machine

(3)
other
services
send
traﬃc
through

HAProxy

ZooKeeper/server sets requires injecting code into your clients!

where
are
we
today?

ops
developers

where
are
we
today?

ops
developers

deploys
decoupled
from
ops

(many
deploys
per
day,
per
service)

maintenance
consists
of
“draining”

hosts,
getting
tasks
rescheduled,

then
pulling
the
cord

wait
…
don’t
virtual

machines
solve
my

cluster
management

challenges?

wait
…
don’t
virtual

machines
solve
my

cluster
management

challenges?

No.

VMs
are
neither
suﬃcient

nor
necessary!

big
computers

small applications

① failures

② maintenance

③ utilization

public
or
private
IaaS,
failures

still
occur
(on
EC2,
instead
of

racks,
have
availability
zones,

instead
of
datacenters,
have

regions)

① failures

② maintenance

③ utilization

provider
wins
with
public
IaaS,

better
resource
sharing
with

private
IaaS,
but
a
static

partition
of
VMs
is
still
a
static

partition!

Mesos:
level
of
abstraction

Mesos

build
and
run

using
resources

Mesos:
level
of
abstraction

IaaS

Mesos

provision
and
manage

machines

build
and
run

using
resources

Mesos
on
IaaS

IaaS

Mesos

use
OpenStack
or
EC2

to
run
Mesos

Mesos
on
IaaS/hardware

IaaS

Mesos

hardware

use
OpenStack
or
EC2

or
physical
machines

to
run
Mesos

physical machines virtual machines
aggregation
not
virtualization

physical machines datacenter computer
aggregation
not
virtualization

small
computers

?
big applications

power wall
time
complex
single core
simple
many core

applications
don’t
ﬁt

on
a
single
computer

anymore

(1)
lots
of
data
…

(2)
lots
of
users
…

growing
everyday

these
applications
need

lots
of
resources

(CPUs,
memory,
I/O)

these
applications
need

datacenters

the
datacenter

is

the
new
computer

desktop
computer
server datacenter
OS

OS

OS

the
datacenter
computer

needs
an
OS

operating
system

“a
collection
of
software
that
manages
the
computer

hardware
resources
and
provides
common
services
for

computer
programs”

- Wikipedia

datacenter
operating
system

“a
collection
of
software
that
manages
the
datacenter

computer
hardware
resources
and
provides
common

services
for
computer
programs”

- Wikipedia

today
Your App
API
tomorrow
datacenter
OS

provides
common

functionality
every
new

distributed
system
re-‐
implements:

• 
failure
detection

• 
package
distribution

• 
task
starting

• 
resource
isolation

• 
resource
monitoring

• 
task
killing,
cleanup

• 
…

today
Your App
API
tomorrow
provides
common

functionality
every
new

distributed
system
re-‐
implements:

• 
failure
detection

• 
package
distribution

• 
task
starting

• 
resource
isolation

• 
resource
monitoring

• 
task
killing,
cleanup

• 
…

datacenter
OS

Don’t
reinvent
the
wheel!

case
studies

distributed
in-‐memory
analytics
framework

distributed
cron
scheduler
(with
dependencies)

github.com/apache/spark

github.com/airbnb/chronos

cluster
management
w/
Docker
+
Mesos

①  conﬁguration/package

management

②  deployment

③  naming

datacenter
OS

IaaS

Mesos

hardware

provide
common
functionality

via
an
API
(kernel)

your
distributed
system

Mesos
0.19.0
released
today!

mesos.apache.org

mesos.apache.org/blog

@ApacheMesos

DockerCon14 Cluster Management and Containerization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to DockerCon14 Cluster Management and Containerization

Similar to DockerCon14 Cluster Management and Containerization (20)

More from Docker, Inc.

More from Docker, Inc. (20)

DockerCon14 Cluster Management and Containerization