Mesos and Container Schedulers: Landscape and Practical Experiences

Mesos and Container
Schedulers
Landscape and Practical Experiences

Multi-tenant PaaS for Heterogeneous Workloads - E.g. Metadata Enrichment

Piksel Video
Platform - Cluster
Evolution

Previous - Static Partitioning at VM Level

Impossible to
efficiently utilise
resources across
machines

Too much EC2
We estimated that this was about 20-30% utilized

Large
maintenance
overhead and
cognitive load

Puppet - VM Focused Deployment
With provisioning
tools VMs are
provisioned with a
specific role in
mind.

Teams are
thinking about
machines rather
than applications

Cluster Manager: VMs are provisioned to contribute resources
With cluster
manager the VMs
are provisioned to
contribute
resources

We dipped our
toe in the cluster
pool

We ran with YARN and with Mesos

API services: Docker, marathon, mesos

Heterogeneous API services, homogeneous workload

Where we wanted to get to
● Single platform for deploying applications
● Development teams don’t care about machines
● Development teams do care about applications
● Development teams describe their application and
operational characteristics declaratively

Heterogeneous Workload, single PaaS

Let’s Talk about
Workload Types

But Mesos Needs
a Framework to
Schedule work

The Role of a Framework
Mesos provides
resource offers to
frameworks which
can choose to
accept or decline
them subject to
various constraints

Mesos is Low-Level - Requires a Framework
Credit: https://twitter.com/wendigo/status/646394656700375040

Framework capabilities - the BIG THREE

Framework capabilities
● Choice between Aurora and the
others boils down to the need for
batch jobs and pre-emption
● Do you run DAG jobs?
● Do you have a large engineering
team? Or multiple isolated
customers that need quotas?

Apache Aurora
- Self-Service
multi-tenancy

Noisy neighbours
are a problem!

Customers….
● Some customers pay more than
others, and need more capacity
● Some jobs are higher priority than
others
● We don’t need all capacity, all the
time

Aurora - Roles
Aurora has a concept of a Role, e.g.
a Customer or Internal Team… Work
is submitted into a Role. Quotas can
be assigned to Roles.

Aurora - Goal
● Ensure high priority work gets done
● Provide quota isolation across Roles
● Allow low priority work to use all
resources unless high priority work arrives

Aurora - Pre-emption
Aurora supports pre-emption; killing
and rescheduling work when higher
priority work arrives. It does this
under 2 conditions...

Aurora - Priority killing
1. Role submits a job with higher
priority that existing job in same
Role

Aurora - Intra-Role Prioritization
2. A production job requires quota
from a non-production job of any
Role

Multi-Tenancy Scheduling - Production Quotas
● Aurora allows
us to give a
quota to Roles
for their critical
work

Multi-Tenancy Scheduling - Environments and Pre-emption

Multi-Tenancy Scheduling - Slot Non-Critical Work Around Critical Work
Aurora allows us to slot low-priority
(typically batch) work around critical
work, yet discard it if more critical
work arrives

Frameworks that
provision their
own resources
need Aurora
integration

Problem: Workloads that provision their own resources

PVP - Current state of affairs

Questions??
p.s. We’re hiring!

How do we submit work to Aurora

Job DAGs
Aurora allows Jobs to contain parallel
and sequential stages, resulting a
DAG. However all processes will be
ran in the same container. It is not a
high level job orchestrator.

Domain Specific Job Definition Language With Templating
transcode_proc = Process(
name=’transcode’,
cmdline=’ffmpeg -i {{input}} {{output}} ’
)
transcode_task = Task(
name = ‘transcode_task ’,
processes = [transcode_proc]
resources = Resources(cpu = 1, ram = 128*MB, disk=1*GB)
)
jobs = [
Job(cluster = 'devcluster',
environment = 'sandbox',
role = 'customer_a',
name = 'transcode_job',
task = transcode_task,
container = Docker(image = 'ffmpeg_image:16.04 ')
)
]

Customer Extensions - Extending the DSL
transcode_task = transcode()
.input(“s3://input.mp4”)
.output(“s3://output.avi”)
jobs = Flow.create(transcode_task)
No nice hook exists to plug in DSL extensions.
We need to patch and build the Aurora client and then register this within the
configuration loader.

Mesos and Container Schedulers: Landscape and Practical Experiences

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mesos and Container Schedulers: Landscape and Practical Experiences

Similar to Mesos and Container Schedulers: Landscape and Practical Experiences (20)

More from J On The Beach

More from J On The Beach (20)

Recently uploaded

Recently uploaded (20)

Mesos and Container Schedulers: Landscape and Practical Experiences