OSDC 2015: Bernd Mathiske | Why the Datacenter Needs an Operating System

Dr. Bernd Mathiske

Senior Software Architect 
Mesosphere

Why the Datacenter needs an
Operating System 
1

Bringing
Google-Scale  
Computing
to Everybody

A Slice of Google Tech Transfer History
2005: MapReduce -> Hadoop (Yahoo)
2007: Linux cgroups for lightweight isolation (Google)
2009: BigTable -> MongoDB
2009: “The Datacenter as a Computer” - Barroso, Hölzle (Google) 
 
2009: Mesos - a distributed operating system kernel (UC Berkeley)
2010: Large scale production Mesos deployment (Twitter)
since 2010: Many more frameworks and quite a few meta-frameworks

Notable Operating System Developments
Single-something => multi-something: user, tasking, threading, core, …

More: bits, memory, storage, bandwidth…

OS virtualization => lightweight virtualization (cgroups, LXCs, jails, …)

Packaging => containers (docker, rkt, lmctfy, …)

Static libraries => dynamic libraries => static libraries

4

Cluster Operating Systems (Hardware Clustering)
Researched since the 1980s

Trying to provide (the illusion of) a single system image

Aiming at HA, load balancing, location transparency (e.g. for storage)

Many systems: Amoeba, ChorusOS, GLUnix, Hurricane, MOSIX, Plan9, RHCS,
Spring, Sprite, Sumo, QNX, Solaris MC, UnixWare, VAXclusters, …

 
Relatively low scale (up to 100s of nodes)

Complicated to manage, less dynamic than software clustering

5

From HPC Grid to Enterprise Cloud
Condor, LSF, Maui, Moab, Quartz, SLURM, …

Typically for batch jobs

Also cover services => SOA => more job schedulers

=> grid computing => grid middleware … => cloud stacks

6

From Server Virtualization to App Aggregation
Cloud Era: 
Big apps, small servers
Client-Server Era: 
Small apps, big servers
Server
Virtualization
App App App App
App
Aggregation
Serv Serv Serv Serv

Cloud Computing
SaaS: Salesforce demonstrated success, then many followed

PaaS: Deis, Dotcloud, OpenShift, Heroku, Pivotal, Stackato, …

IaaS: AWS, Azure, DigitalOcean, GCE… 
Private cloud stacks including IaaS: Eucalyptus, CloudStack,
Joyent, OpenStack, SmartCloud, vSphere, …

8

Datacenter
✴ A facility used to house computer systems and associated
components (e.g. networking, storage, cooling, sensors)

✴ In this talk we focus on how to manage and use a single
production cluster of networked computers in a datacenter

✴ Such clusters range in size from 10s to 10000s of nodes

✴ Why should we and how can we end up with  
just one production cluster?

9

Datacenter Services
✴ LAMP (Linux, Apache, MSQL, PHP) or similar

✴ MEAN (MongoDB, Express.js, Angular.js, Node.js) or similar

✴ Cassandra, ElasticSearch, Exelixi, Hadoop, Hypertable, Jenkins,
Kafka, MPI, Spark, Storm, SSSP, Torque, …

✴ Private PaaS: Deis, …

✴ …
10

Operate your Laptop like your Datacenter?

From Static Partitioning to Elastic Sharing
Static Partitioning
Elastic Sharing
WEB HADOOPCACHE
WASTED
FREEFREE
HADOOP
WEB
CACHE
WASTED WASTED
100% —
100% —

Software Clustering
Layer between node OS and
application frameworks 
Scale
Multi-tenancy
High availability

Available Open Source Components
✴ 2-level scheduler: Apache Mesos

✴ Meta-frameworks / schedulers: Aurora, Chronos, Marathon,
Kubernetes, Swarm, …

✴ Service discovery: Consul, HAProxy, Mesos DNS, …

✴ Highly available conﬁguration: zk, etcd, …

✴ Storage: HDFS, Ceph, …

✴ Node OSs: lots of Linux variants

✴ Lots of app frameworks: Sparc, Storm, Cassandra, Kafka, …14

2-Level Scheduling
Scale: from 1 node to at least 10000s of nodes

Optimizing resource management

End-to-end principle: “application-speciﬁc functions ought to reside
in the end nodes of a network rather than intermediary nodes”

-> Requirement for general multi-tenancy

-> Requirement for having only one production cluster
15

App
How Mesos Works
16
Framework
Scheduler Master Slave
Master
Master
Master
Executor
Executor
Task
Task
Task
Task
zk/etcd

Ways to Run an Application
1. Vanilla job

• Employ meta-framework for invocation: Chronos, Aurora, Kubernetes, …

2. Application of an adapted framework

• Hadoop, Sparc, Storm, ElasticSearch, Cassandra, Kafka, many more…

3. Non-adapted services

• Employ meta-framework for invocation: Marathon, Aurora, Kubernetes, …

• Provide (select) a service discovery solution

4. Program your own scheduler (and executor)

17

The Mesos Framework API
✴ Currently like internal Mesos communication:

• protobuf messages over HTTP 
✴ Soon:

• JSON messages over HTTP (stream)

=> no need to link with binary Mesos library and/or less to reimplement

ca. a dozen programming languages => any language
18

How to implement a framework
✴ Scheduler interface: 1 half of 2-level scheduling

• The framework knows best when to do what with what kind of resources

• About a dozen callbacks, main functionality in 2 of them:

- receive resource oﬀers

- receive task status updates 
✴ Executor interface: task life-cycle management and monitoring

• Command line executor included in Mesos

• Docker executor included in Mesos

• Custom executors often not needed
19

Scheduler SPI (implemented by Framework)
20
public interface Scheduler {
void registered(SchedulerDriver driver, FrameworkID frameworkId,  
MasterInfo masterInfo);
void reregistered(SchedulerDriver driver, MasterInfo masterInfo);
void resourceOffers(SchedulerDriver driver, List<Offer> offers);
void offerRescinded(SchedulerDriver driver, OfferID offerId);
void statusUpdate(SchedulerDriver driver, TaskStatus status);
void frameworkMessage(SchedulerDriver driver, ExecutorID executorId,
SlaveID slaveId, byte[] data);
void disconnected(SchedulerDriver driver);
void slaveLost(SchedulerDriver driver, SlaveID slaveId);
void executorLost(SchedulerDriver driver, ExecutorID executorId,
SlaveID slaveId, int status);
void error(SchedulerDriver driver, String message);
}

Minimal Scheduler Implementation
class MyFrameworkScheduler implements Scheduler {
…
private TaskGenerator _taskGen;
public void resourceOffers(SchedulerDriver driver, List<Offer> offers) {
if (_taskGen.doneCreatingTasks()) {
for (offer : offers) {
driver.declineOffer(offer.getId());
}
} else {
for (offer : offers) {
List<TaskInfo> taskInfos = _taskGen.generateTaskInfos(offer);
driver.launchTasks(offer.getId(), taskInfos, _filters);
}
}
}
public void statusUpdate(SchedulerDriver driver, TaskStatus status) {
_taskGen.observeTaskStatusUpdate(taskStatus);
if (_taskGen.done()) {
driver.stop();
}
}
… 
}
21

The Developer’s Perspective
✴ Focus on application logic, not datacenter structure

✴ Avoid networking-related code

✴ Reuse of built-in fault-tolerance and high availability

✴ Reuse distributed (infrastructure) frameworks (e.g., storage)

=> API, SDK for datacenter services
22

The Operations Engineer’s Perspective
✴ Ease of deployment/management

✴ Uniformity of deployment/management

✴ Hardware utilization rate

✴ Scaling up as business grows

✴ Scaling out sporadically

✴ Cost and time for moving to a diﬀerent datacenter

✴ High availability and fault-tolerance of system services

✴ Monitoring

✴ Trouble shooting
23

Necessary Multi-Tenancy Features
Task containerization

Resource isolation

Resource and task attributes

Static and dynamic resource reservations

Reservation levels

Meta-frameworks

Dynamic scheduler update and reconﬁguration

Security

24

Desirable Multi-Tenancy Features
Optimistic oﬀers

Oversubscription

Task preemption, migration, resizing, reconﬁguration

Rate limiting

Auto-scaling => hybrid cloud

Infrastructure frameworks

25

Other Schedulers as Meta-Frameworks in a 2-level Scheduler

YARN => https://github.com/mesos/myriad

Kubernetes => https://github.com/mesosphere/kubernetes-mesos

Swarm => Swarm on Mesos (new project)

=> run everything in one cluster

27

Myriad : Virtual YARN Clusters on Mesos
28
◦ POST /api/clusters: Registers a new YARN

◦ GET /api/clusters: Lists all registered clusters

◦ GET /api/clusters/{clusterId}: Lists the cluster with {clusterId}

◦ PUT /api/clusters/{clusterId}/ﬂexup: Expands the size of cluster with {clusterId}

◦ PUT /api/clusters/{clusterId}/ﬂexdown: Shrinks the size of cluster with {clusterId}

◦ DELETE /api/clusters/{clusterId}: Unregisters YARN cluster with {clusterId}. Also, kills all the nodes.
Node
Master
Mesos
Slave
Mesos
YARN
Myriad
Scheduler RM
Myriad
Executor
1. Launch NodeManager
1
1
1
2.5 CPU
2.5 GB
1
NM
YARN
flexUp
2.0 CPU
2.0 GB
C1
C2

Portability
30
Mesos
Public Cloud Managed Cloud Your Own DC
Framework Apps
Meta-Frameworks
Vanilla Apps
Infrastructure Frameworks

The Application User’s Perspective
✴ Focus on apps, services, parameters, results

✴ Avoid dealing with datacenter operations/management

✴ Avoid adjusting system settings

✴ High availability

✴ Throughput

✴ Responsiveness

✴ Predictiveness

✴ Run everything I need

✴ Return on and safety of investment
31

The Datacenter is the new form factor
✴ 2-level scheduler => single production cluster

✴ scalability and portability => avoiding hardware/cloud lock-in

✴ built-in container support => running containers at scale

✴ automation => operator eﬃciency

✴ repositories => apps/services readily available

✴ API and SDK => productive/quick app/service development
32

33
Above the Clouds
with Open Source!

OSDC 2015: Bernd Mathiske | Why the Datacenter Needs an Operating System

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to OSDC 2015: Bernd Mathiske | Why the Datacenter Needs an Operating System

Similar to OSDC 2015: Bernd Mathiske | Why the Datacenter Needs an Operating System (20)

OSDC 2015: Bernd Mathiske | Why the Datacenter Needs an Operating System