Dr. Bernd Mathiske

Senior Software Architect

Mesosphere

Why the Datacenter needs an
Operating System

1
Bringing
Google-Scale 

Computing
to Everybody
A Slice of Google Tech Transfer History
2005: MapReduce -> Hadoop (Yahoo)
2007: Linux cgroups for lightweight isolation (Google)
2009: BigTable -> MongoDB
2009: “The Datacenter as a Computer” - Barroso, Hölzle (Google)



2009: Mesos - a distributed operating system kernel (UC Berkeley)
2010: Large scale production Mesos deployment (Twitter)
since 2010: Many more frameworks and quite a few meta-frameworks

Notable Operating System Developments
Single-something => multi-something: user, tasking, threading, core, …

More: bits, memory, storage, bandwidth…

OS virtualization => lightweight virtualization (cgroups, LXCs, jails, …)

Packaging => containers (docker, rkt, lmctfy, …)

Static libraries => dynamic libraries => static libraries

4
Cluster Operating Systems (Hardware Clustering)
Researched since the 1980s

Trying to provide (the illusion of) a single system image

Aiming at HA, load balancing, location transparency (e.g. for storage)

Many systems: Amoeba, ChorusOS, GLUnix, Hurricane, MOSIX, Plan9, RHCS,
Spring, Sprite, Sumo, QNX, Solaris MC, UnixWare, VAXclusters, …



Relatively low scale (up to 100s of nodes) 

Complicated to manage, less dynamic than software clustering

5
From HPC Grid to Enterprise Cloud
Condor, LSF, Maui, Moab, Quartz, SLURM, …

Typically for batch jobs

Also cover services => SOA => more job schedulers

=> grid computing => grid middleware … => cloud stacks

6
From Server Virtualization to App Aggregation
Cloud Era:

Big apps, small servers
Client-Server Era:

Small apps, big servers
Server
Virtualization
App App App App
App
Aggregation
Serv Serv Serv Serv
Cloud Computing
SaaS: Salesforce demonstrated success, then many followed

PaaS: Deis, Dotcloud, OpenShift, Heroku, Pivotal, Stackato, …

IaaS: AWS, Azure, DigitalOcean, GCE…

Private cloud stacks including IaaS: Eucalyptus, CloudStack,
Joyent, OpenStack, SmartCloud, vSphere, …

8
Datacenter
✴ A facility used to house computer systems and associated
components (e.g. networking, storage, cooling, sensors)

✴ In this talk we focus on how to manage and use a single
production cluster of networked computers in a datacenter

✴ Such clusters range in size from 10s to 10000s of nodes

✴ Why should we and how can we end up with 

just one production cluster?

9
Datacenter Services
✴ LAMP (Linux, Apache, MSQL, PHP) or similar 

✴ MEAN (MongoDB, Express.js, Angular.js, Node.js) or similar

✴ Cassandra, ElasticSearch, Exelixi, Hadoop, Hypertable, Jenkins,
Kafka, MPI, Spark, Storm, SSSP, Torque, …

✴ Private PaaS: Deis, …

✴ …
10
Operate your Laptop like your Datacenter?
From Static Partitioning to Elastic Sharing
Static Partitioning
Elastic Sharing
WEB HADOOPCACHE
WASTED
FREEFREE
HADOOP
WEB
CACHE
WASTED WASTED
100% —
100% —
Software Clustering
Layer between node OS and
application frameworks

Scale
Multi-tenancy
High availability
Available Open Source Components
✴ 2-level scheduler: Apache Mesos

✴ Meta-frameworks / schedulers: Aurora, Chronos, Marathon,
Kubernetes, Swarm, …

✴ Service discovery: Consul, HAProxy, Mesos DNS, …

✴ Highly available configuration: zk, etcd, …

✴ Storage: HDFS, Ceph, …

✴ Node OSs: lots of Linux variants

✴ Lots of app frameworks: Sparc, Storm, Cassandra, Kafka, …14
2-Level Scheduling
Scale: from 1 node to at least 10000s of nodes

Optimizing resource management

End-to-end principle: “application-specific functions ought to reside
in the end nodes of a network rather than intermediary nodes”

-> Requirement for general multi-tenancy

-> Requirement for having only one production cluster
15
App
How Mesos Works
16
Framework
Scheduler Master Slave
Master
Master
Master
Executor
Executor
Task
Task
Task
Task
zk/etcd
Ways to Run an Application
1. Vanilla job

• Employ meta-framework for invocation: Chronos, Aurora, Kubernetes, …

2. Application of an adapted framework

• Hadoop, Sparc, Storm, ElasticSearch, Cassandra, Kafka, many more…

3. Non-adapted services

• Employ meta-framework for invocation: Marathon, Aurora, Kubernetes, …

• Provide (select) a service discovery solution

4. Program your own scheduler (and executor)

17
The Mesos Framework API
✴ Currently like internal Mesos communication:

• protobuf messages over HTTP

✴ Soon:

• JSON messages over HTTP (stream)

=> no need to link with binary Mesos library and/or less to reimplement

ca. a dozen programming languages => any language
18
How to implement a framework
✴ Scheduler interface: 1 half of 2-level scheduling

• The framework knows best when to do what with what kind of resources

• About a dozen callbacks, main functionality in 2 of them:

- receive resource offers

- receive task status updates

✴ Executor interface: task life-cycle management and monitoring

• Command line executor included in Mesos

• Docker executor included in Mesos

• Custom executors often not needed
19
Scheduler SPI (implemented by Framework)
20
public interface Scheduler {
void registered(SchedulerDriver driver, FrameworkID frameworkId, 

MasterInfo masterInfo);
void reregistered(SchedulerDriver driver, MasterInfo masterInfo);
void resourceOffers(SchedulerDriver driver, List<Offer> offers);
void offerRescinded(SchedulerDriver driver, OfferID offerId);
void statusUpdate(SchedulerDriver driver, TaskStatus status);
void frameworkMessage(SchedulerDriver driver, ExecutorID executorId,
SlaveID slaveId, byte[] data);
void disconnected(SchedulerDriver driver);
void slaveLost(SchedulerDriver driver, SlaveID slaveId);
void executorLost(SchedulerDriver driver, ExecutorID executorId,
SlaveID slaveId, int status);
void error(SchedulerDriver driver, String message);
}
Minimal Scheduler Implementation
class MyFrameworkScheduler implements Scheduler {
…
private TaskGenerator _taskGen;
public void resourceOffers(SchedulerDriver driver, List<Offer> offers) {
if (_taskGen.doneCreatingTasks()) {
for (offer : offers) {
driver.declineOffer(offer.getId());
}
} else {
for (offer : offers) {
List<TaskInfo> taskInfos = _taskGen.generateTaskInfos(offer);
driver.launchTasks(offer.getId(), taskInfos, _filters);
}
}
}
public void statusUpdate(SchedulerDriver driver, TaskStatus status) {
_taskGen.observeTaskStatusUpdate(taskStatus);
if (_taskGen.done()) {
driver.stop();
}
}
…

}
21
The Developer’s Perspective
✴ Focus on application logic, not datacenter structure

✴ Avoid networking-related code

✴ Reuse of built-in fault-tolerance and high availability

✴ Reuse distributed (infrastructure) frameworks (e.g., storage)

=> API, SDK for datacenter services
22
The Operations Engineer’s Perspective
✴ Ease of deployment/management

✴ Uniformity of deployment/management

✴ Hardware utilization rate

✴ Scaling up as business grows

✴ Scaling out sporadically 

✴ Cost and time for moving to a different datacenter

✴ High availability and fault-tolerance of system services

✴ Monitoring

✴ Trouble shooting
23
Necessary Multi-Tenancy Features
Task containerization

Resource isolation

Resource and task attributes

Static and dynamic resource reservations

Reservation levels

Meta-frameworks

Dynamic scheduler update and reconfiguration

Security

24
Desirable Multi-Tenancy Features
Optimistic offers

Oversubscription

Task preemption, migration, resizing, reconfiguration

Rate limiting

Auto-scaling => hybrid cloud

Infrastructure frameworks

25
Using Docker Containers in Mesos
26
Mesos Master Server
init
|
+ mesos-master
|
+ marathon
|
Mesos Slave Server
init
|
+ docker
| |
| + lxc
| |
| + (user task, under container init system)
| |
|
+ mesos-slave
| |
| + /var/lib/mesos/executors/docker
| | |
| | + docker run …
| | |
Docker
Registry
When a user requests
a container…
Mesos, LXC, and
Docker are tied
together for launch
2
1
3
4
5
6
7
8
Other Schedulers as Meta-Frameworks in a 2-level Scheduler

YARN => https://github.com/mesos/myriad

Kubernetes => https://github.com/mesosphere/kubernetes-mesos

Swarm => Swarm on Mesos (new project)

=> run everything in one cluster

27
Myriad : Virtual YARN Clusters on Mesos
28
	 ◦	 POST /api/clusters: Registers a new YARN

	 ◦	 GET /api/clusters: Lists all registered clusters

	 ◦	 GET /api/clusters/{clusterId}: Lists the cluster with {clusterId}

	 ◦	 PUT /api/clusters/{clusterId}/flexup: Expands the size of cluster with {clusterId}

	 ◦	 PUT /api/clusters/{clusterId}/flexdown: Shrinks the size of cluster with {clusterId}

	 ◦	 DELETE /api/clusters/{clusterId}: Unregisters YARN cluster with {clusterId}. Also, kills all the nodes.
Node
Master
Mesos
Slave
Mesos
YARN
Myriad
Scheduler RM
Myriad
Executor
1. Launch NodeManager
1
1
1
2.5 CPU
2.5 GB
1
NM
YARN
flexUp
2.0 CPU
2.0 GB
C1
C2
29
Kubernetes in Mesos
Portability
30
Mesos
Public Cloud Managed Cloud Your Own DC
Framework Apps
Meta-Frameworks
Vanilla Apps
Infrastructure Frameworks
The Application User’s Perspective
✴ Focus on apps, services, parameters, results

✴ Avoid dealing with datacenter operations/management

✴ Avoid adjusting system settings

✴ High availability

✴ Throughput

✴ Responsiveness

✴ Predictiveness

✴ Run everything I need

✴ Return on and safety of investment
31
The Datacenter is the new form factor
✴ 2-level scheduler => single production cluster

✴ scalability and portability => avoiding hardware/cloud lock-in

✴ built-in container support => running containers at scale

✴ automation => operator efficiency

✴ repositories => apps/services readily available

✴ API and SDK => productive/quick app/service development
32
33
Above the Clouds
with Open Source!

OSDC 2015: Bernd Mathiske | Why the Datacenter Needs an Operating System

  • 1.
    Dr. Bernd Mathiske SeniorSoftware Architect
 Mesosphere Why the Datacenter needs an Operating System
 1
  • 2.
  • 3.
    A Slice ofGoogle Tech Transfer History 2005: MapReduce -> Hadoop (Yahoo) 2007: Linux cgroups for lightweight isolation (Google) 2009: BigTable -> MongoDB 2009: “The Datacenter as a Computer” - Barroso, Hölzle (Google)
 
 2009: Mesos - a distributed operating system kernel (UC Berkeley) 2010: Large scale production Mesos deployment (Twitter) since 2010: Many more frameworks and quite a few meta-frameworks

  • 4.
    Notable Operating SystemDevelopments Single-something => multi-something: user, tasking, threading, core, … More: bits, memory, storage, bandwidth… OS virtualization => lightweight virtualization (cgroups, LXCs, jails, …) Packaging => containers (docker, rkt, lmctfy, …) Static libraries => dynamic libraries => static libraries 4
  • 5.
    Cluster Operating Systems(Hardware Clustering) Researched since the 1980s Trying to provide (the illusion of) a single system image Aiming at HA, load balancing, location transparency (e.g. for storage) Many systems: Amoeba, ChorusOS, GLUnix, Hurricane, MOSIX, Plan9, RHCS, Spring, Sprite, Sumo, QNX, Solaris MC, UnixWare, VAXclusters, … 
 Relatively low scale (up to 100s of nodes) Complicated to manage, less dynamic than software clustering 5
  • 6.
    From HPC Gridto Enterprise Cloud Condor, LSF, Maui, Moab, Quartz, SLURM, … Typically for batch jobs Also cover services => SOA => more job schedulers => grid computing => grid middleware … => cloud stacks 6
  • 7.
    From Server Virtualizationto App Aggregation Cloud Era:
 Big apps, small servers Client-Server Era:
 Small apps, big servers Server Virtualization App App App App App Aggregation Serv Serv Serv Serv
  • 8.
    Cloud Computing SaaS: Salesforcedemonstrated success, then many followed PaaS: Deis, Dotcloud, OpenShift, Heroku, Pivotal, Stackato, … IaaS: AWS, Azure, DigitalOcean, GCE…
 Private cloud stacks including IaaS: Eucalyptus, CloudStack, Joyent, OpenStack, SmartCloud, vSphere, … 8
  • 9.
    Datacenter ✴ A facilityused to house computer systems and associated components (e.g. networking, storage, cooling, sensors) ✴ In this talk we focus on how to manage and use a single production cluster of networked computers in a datacenter ✴ Such clusters range in size from 10s to 10000s of nodes ✴ Why should we and how can we end up with 
 just one production cluster? 9
  • 10.
    Datacenter Services ✴ LAMP(Linux, Apache, MSQL, PHP) or similar ✴ MEAN (MongoDB, Express.js, Angular.js, Node.js) or similar ✴ Cassandra, ElasticSearch, Exelixi, Hadoop, Hypertable, Jenkins, Kafka, MPI, Spark, Storm, SSSP, Torque, … ✴ Private PaaS: Deis, … ✴ … 10
  • 11.
    Operate your Laptoplike your Datacenter?
  • 12.
    From Static Partitioningto Elastic Sharing Static Partitioning Elastic Sharing WEB HADOOPCACHE WASTED FREEFREE HADOOP WEB CACHE WASTED WASTED 100% — 100% —
  • 13.
    Software Clustering Layer betweennode OS and application frameworks
 Scale Multi-tenancy High availability
  • 14.
    Available Open SourceComponents ✴ 2-level scheduler: Apache Mesos ✴ Meta-frameworks / schedulers: Aurora, Chronos, Marathon, Kubernetes, Swarm, … ✴ Service discovery: Consul, HAProxy, Mesos DNS, … ✴ Highly available configuration: zk, etcd, … ✴ Storage: HDFS, Ceph, … ✴ Node OSs: lots of Linux variants ✴ Lots of app frameworks: Sparc, Storm, Cassandra, Kafka, …14
  • 15.
    2-Level Scheduling Scale: from1 node to at least 10000s of nodes Optimizing resource management End-to-end principle: “application-specific functions ought to reside in the end nodes of a network rather than intermediary nodes” -> Requirement for general multi-tenancy -> Requirement for having only one production cluster 15
  • 16.
    App How Mesos Works 16 Framework SchedulerMaster Slave Master Master Master Executor Executor Task Task Task Task zk/etcd
  • 17.
    Ways to Runan Application 1. Vanilla job • Employ meta-framework for invocation: Chronos, Aurora, Kubernetes, … 2. Application of an adapted framework • Hadoop, Sparc, Storm, ElasticSearch, Cassandra, Kafka, many more… 3. Non-adapted services • Employ meta-framework for invocation: Marathon, Aurora, Kubernetes, … • Provide (select) a service discovery solution 4. Program your own scheduler (and executor) 17
  • 18.
    The Mesos FrameworkAPI ✴ Currently like internal Mesos communication: • protobuf messages over HTTP
 ✴ Soon: • JSON messages over HTTP (stream) => no need to link with binary Mesos library and/or less to reimplement ca. a dozen programming languages => any language 18
  • 19.
    How to implementa framework ✴ Scheduler interface: 1 half of 2-level scheduling • The framework knows best when to do what with what kind of resources • About a dozen callbacks, main functionality in 2 of them: - receive resource offers - receive task status updates
 ✴ Executor interface: task life-cycle management and monitoring • Command line executor included in Mesos • Docker executor included in Mesos • Custom executors often not needed 19
  • 20.
    Scheduler SPI (implementedby Framework) 20 public interface Scheduler { void registered(SchedulerDriver driver, FrameworkID frameworkId, 
 MasterInfo masterInfo); void reregistered(SchedulerDriver driver, MasterInfo masterInfo); void resourceOffers(SchedulerDriver driver, List<Offer> offers); void offerRescinded(SchedulerDriver driver, OfferID offerId); void statusUpdate(SchedulerDriver driver, TaskStatus status); void frameworkMessage(SchedulerDriver driver, ExecutorID executorId, SlaveID slaveId, byte[] data); void disconnected(SchedulerDriver driver); void slaveLost(SchedulerDriver driver, SlaveID slaveId); void executorLost(SchedulerDriver driver, ExecutorID executorId, SlaveID slaveId, int status); void error(SchedulerDriver driver, String message); }
  • 21.
    Minimal Scheduler Implementation classMyFrameworkScheduler implements Scheduler { … private TaskGenerator _taskGen; public void resourceOffers(SchedulerDriver driver, List<Offer> offers) { if (_taskGen.doneCreatingTasks()) { for (offer : offers) { driver.declineOffer(offer.getId()); } } else { for (offer : offers) { List<TaskInfo> taskInfos = _taskGen.generateTaskInfos(offer); driver.launchTasks(offer.getId(), taskInfos, _filters); } } } public void statusUpdate(SchedulerDriver driver, TaskStatus status) { _taskGen.observeTaskStatusUpdate(taskStatus); if (_taskGen.done()) { driver.stop(); } } …
 } 21
  • 22.
    The Developer’s Perspective ✴Focus on application logic, not datacenter structure ✴ Avoid networking-related code ✴ Reuse of built-in fault-tolerance and high availability ✴ Reuse distributed (infrastructure) frameworks (e.g., storage) => API, SDK for datacenter services 22
  • 23.
    The Operations Engineer’sPerspective ✴ Ease of deployment/management ✴ Uniformity of deployment/management ✴ Hardware utilization rate ✴ Scaling up as business grows ✴ Scaling out sporadically ✴ Cost and time for moving to a different datacenter ✴ High availability and fault-tolerance of system services ✴ Monitoring ✴ Trouble shooting 23
  • 24.
    Necessary Multi-Tenancy Features Taskcontainerization Resource isolation Resource and task attributes Static and dynamic resource reservations Reservation levels Meta-frameworks Dynamic scheduler update and reconfiguration Security 24
  • 25.
    Desirable Multi-Tenancy Features Optimisticoffers Oversubscription Task preemption, migration, resizing, reconfiguration Rate limiting Auto-scaling => hybrid cloud Infrastructure frameworks 25
  • 26.
    Using Docker Containersin Mesos 26 Mesos Master Server init | + mesos-master | + marathon | Mesos Slave Server init | + docker | | | + lxc | | | + (user task, under container init system) | | | + mesos-slave | | | + /var/lib/mesos/executors/docker | | | | | + docker run … | | | Docker Registry When a user requests a container… Mesos, LXC, and Docker are tied together for launch 2 1 3 4 5 6 7 8
  • 27.
    Other Schedulers asMeta-Frameworks in a 2-level Scheduler YARN => https://github.com/mesos/myriad Kubernetes => https://github.com/mesosphere/kubernetes-mesos Swarm => Swarm on Mesos (new project) => run everything in one cluster 27
  • 28.
    Myriad : VirtualYARN Clusters on Mesos 28 ◦ POST /api/clusters: Registers a new YARN ◦ GET /api/clusters: Lists all registered clusters ◦ GET /api/clusters/{clusterId}: Lists the cluster with {clusterId} ◦ PUT /api/clusters/{clusterId}/flexup: Expands the size of cluster with {clusterId} ◦ PUT /api/clusters/{clusterId}/flexdown: Shrinks the size of cluster with {clusterId} ◦ DELETE /api/clusters/{clusterId}: Unregisters YARN cluster with {clusterId}. Also, kills all the nodes. Node Master Mesos Slave Mesos YARN Myriad Scheduler RM Myriad Executor 1. Launch NodeManager 1 1 1 2.5 CPU 2.5 GB 1 NM YARN flexUp 2.0 CPU 2.0 GB C1 C2
  • 29.
  • 30.
    Portability 30 Mesos Public Cloud ManagedCloud Your Own DC Framework Apps Meta-Frameworks Vanilla Apps Infrastructure Frameworks
  • 31.
    The Application User’sPerspective ✴ Focus on apps, services, parameters, results ✴ Avoid dealing with datacenter operations/management ✴ Avoid adjusting system settings ✴ High availability ✴ Throughput ✴ Responsiveness ✴ Predictiveness ✴ Run everything I need ✴ Return on and safety of investment 31
  • 32.
    The Datacenter isthe new form factor ✴ 2-level scheduler => single production cluster ✴ scalability and portability => avoiding hardware/cloud lock-in ✴ built-in container support => running containers at scale ✴ automation => operator efficiency ✴ repositories => apps/services readily available ✴ API and SDK => productive/quick app/service development 32
  • 33.