Mesos
A Platform for Fine-Grained Resource Sharing in the Data Center
@PapersWeLoveSEA
@ankurcha - Ankur Chauhan
Benjamin Hindman, Andy konwinski, Matei Zaharia,
Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica
Background
Rapid innovation in distributed/cluster
computing
● Hadoop, Spark, Flink, Tez, HDFS, Dryad etc
● Micro-services / web-services (long running)
● <Your custom framework> …
… There is a lot of wasted effort.
Problem
● Rapid innovation in cluster/dist. frameworks
● No single framework is optimal for all
workloads
● What do we want?
○ Run multiple frameworks on a single cluster (big)
■ Maximize utilisation of resources. (elastic)
■ Share data between frameworks.
Goal
Solution
Mesos Goals
● High utilization of resources
● Support diverse frameworks
● Scalability to 10,000s of nodes
● Reliability in face of failures
● Efficient with minimal overhead
● Highly available via leader election of
master.
Other benefits
● Run multiple instances of a framework
○ Isolate production and experimental tasks
○ Run multiple version of a framework
● Build special framework targeting a
particular domain
○ eg: spark
Mesos Architecture
● Resource Offers
○ Offer available resources to frameworks
● 2-level scheduler
○ Mesos takes the decision of which framework gets
the offer.
○ Frameworks decide whether to accept/reject offer.
● Fine grained sharing
○ Improved utilization, responsiveness, data locality
● Pluggable allocation module
Mesos Architecture
High level architecture
Design components
● Mesos Master
○ Runs the allocation module
● Mesos Slave
○ Offers resources
○ task/executor isolation
● Scheduler
○ Accepts/rejects resource offers
● Executor
○ Runs tasks
● Task
○ Unit of work
Mesos Architecture
● Making resource offers scalable and robust
○ Let schedulers define filters
○ Offered resources counted as allocations
○ Rescind offers if not accepted within timeout
● Fault tolerance
○ Zookeeper for leader election (hot-standby)
○ Mesos reports slave and executor failures to
scheduler
○ Minimal internal state in master
Resource offers
● Mesos decides how many
resources to offer to each
framework, based on
allocation policy (fair share),
while framework decide which
resource offers to accept and
which tasks to run on them.
● When a framework accepts an
offer, it passes Mesos a
description of the task (and
executor) to launch.
Analysis
● Resource offers work well when:
○ Frameworks can scale elastically.
○ Task durations are homogeneous.
○ Frameworks have many preferred nodes.
● Conditions hold true in many frameworks
(hadoop, dryad, spark … )
○ Work is divided into short tasks
○ Data is replicated across multiple nodes
Limitations of distributed scheduling
● Fragmentation
○ With heterogeneous resource demands, distributed
collection of frameworks may not be able to bin pack
optimally.
● Independent framework constraints
○ Esoteric dependency scenarios can only be satisfied
with a centralized scheduler
● Framework complexity
○ Need to deal with resource offers ( not onerous )
Implementation
● ~10,000 lines of C++
● libprocess - actor-based programming (akka
for c++)
● Cgroups / Docker for task isolation
● Zookeeper for leader election (HA)
● Ports / plugins
○ Hadoop - map reduce
○ Torque / MPI
○ Spark Framework - iterative jobs
Evaluation
Dynamic Resource
sharing
Evaluation
Data locality with resource offers
○ 16 Hadoop instances using 93 EC2 instances
○ 1.7x speedup with mesos
○ 97% data locality with 5sec delay scheduling
Evaluation
Mesos scalability
● 99 EC2 instances
● Scaled to 50k slaves, 200 frameworks, 100k tasks
Evaluation
● Failure recovery
○ Mean time to recovery was between 4-8 seconds
with 95% confidence of 3secs on either side.
● Performance isolation
○ Linux containers are not perfect at isolation
○ 30% increase in request latency vs 550% increase
during tests
■ apache server + cpu hog process
The end

Mesos - A Platform for Fine-Grained Resource Sharing in the Data Center

  • 1.
    Mesos A Platform forFine-Grained Resource Sharing in the Data Center @PapersWeLoveSEA @ankurcha - Ankur Chauhan Benjamin Hindman, Andy konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica
  • 2.
    Background Rapid innovation indistributed/cluster computing ● Hadoop, Spark, Flink, Tez, HDFS, Dryad etc ● Micro-services / web-services (long running) ● <Your custom framework> … … There is a lot of wasted effort.
  • 3.
    Problem ● Rapid innovationin cluster/dist. frameworks ● No single framework is optimal for all workloads ● What do we want? ○ Run multiple frameworks on a single cluster (big) ■ Maximize utilisation of resources. (elastic) ■ Share data between frameworks.
  • 4.
  • 5.
  • 6.
    Mesos Goals ● Highutilization of resources ● Support diverse frameworks ● Scalability to 10,000s of nodes ● Reliability in face of failures ● Efficient with minimal overhead ● Highly available via leader election of master.
  • 7.
    Other benefits ● Runmultiple instances of a framework ○ Isolate production and experimental tasks ○ Run multiple version of a framework ● Build special framework targeting a particular domain ○ eg: spark
  • 8.
    Mesos Architecture ● ResourceOffers ○ Offer available resources to frameworks ● 2-level scheduler ○ Mesos takes the decision of which framework gets the offer. ○ Frameworks decide whether to accept/reject offer. ● Fine grained sharing ○ Improved utilization, responsiveness, data locality ● Pluggable allocation module
  • 9.
    Mesos Architecture High levelarchitecture Design components ● Mesos Master ○ Runs the allocation module ● Mesos Slave ○ Offers resources ○ task/executor isolation ● Scheduler ○ Accepts/rejects resource offers ● Executor ○ Runs tasks ● Task ○ Unit of work
  • 10.
    Mesos Architecture ● Makingresource offers scalable and robust ○ Let schedulers define filters ○ Offered resources counted as allocations ○ Rescind offers if not accepted within timeout ● Fault tolerance ○ Zookeeper for leader election (hot-standby) ○ Mesos reports slave and executor failures to scheduler ○ Minimal internal state in master
  • 11.
    Resource offers ● Mesosdecides how many resources to offer to each framework, based on allocation policy (fair share), while framework decide which resource offers to accept and which tasks to run on them. ● When a framework accepts an offer, it passes Mesos a description of the task (and executor) to launch.
  • 12.
    Analysis ● Resource offerswork well when: ○ Frameworks can scale elastically. ○ Task durations are homogeneous. ○ Frameworks have many preferred nodes. ● Conditions hold true in many frameworks (hadoop, dryad, spark … ) ○ Work is divided into short tasks ○ Data is replicated across multiple nodes
  • 13.
    Limitations of distributedscheduling ● Fragmentation ○ With heterogeneous resource demands, distributed collection of frameworks may not be able to bin pack optimally. ● Independent framework constraints ○ Esoteric dependency scenarios can only be satisfied with a centralized scheduler ● Framework complexity ○ Need to deal with resource offers ( not onerous )
  • 14.
    Implementation ● ~10,000 linesof C++ ● libprocess - actor-based programming (akka for c++) ● Cgroups / Docker for task isolation ● Zookeeper for leader election (HA) ● Ports / plugins ○ Hadoop - map reduce ○ Torque / MPI ○ Spark Framework - iterative jobs
  • 15.
  • 16.
    Evaluation Data locality withresource offers ○ 16 Hadoop instances using 93 EC2 instances ○ 1.7x speedup with mesos ○ 97% data locality with 5sec delay scheduling
  • 17.
    Evaluation Mesos scalability ● 99EC2 instances ● Scaled to 50k slaves, 200 frameworks, 100k tasks
  • 18.
    Evaluation ● Failure recovery ○Mean time to recovery was between 4-8 seconds with 95% confidence of 3secs on either side. ● Performance isolation ○ Linux containers are not perfect at isolation ○ 30% increase in request latency vs 550% increase during tests ■ apache server + cpu hog process
  • 19.