Mesos and its container schedulers provide a powerful platform for cloud based applications of all types. In this session, we give practical advice for running and securing large batch, streaming and service workloads (e.g. Flink) within Mesos, and describe some of the more popular schedulers including Marathon and Aurora.
21. Where we wanted to get to
● Single platform for deploying applications
● Development teams don’t care about machines
● Development teams do care about applications
● Development teams describe their application and
operational characteristics declaratively
30. Framework capabilities
● Choice between Aurora and the
others boils down to the need for
batch jobs and pre-emption
● Do you run DAG jobs?
● Do you have a large engineering
team? Or multiple isolated
customers that need quotas?
33. Customers….
● Some customers pay more than
others, and need more capacity
● Some jobs are higher priority than
others
● We don’t need all capacity, all the
time
34. Aurora - Roles
Aurora has a concept of a Role, e.g.
a Customer or Internal Team… Work
is submitted into a Role. Quotas can
be assigned to Roles.
35. Aurora - Goal
● Ensure high priority work gets done
● Provide quota isolation across Roles
● Allow low priority work to use all
resources unless high priority work arrives
36. Aurora - Pre-emption
Aurora supports pre-emption; killing
and rescheduling work when higher
priority work arrives. It does this
under 2 conditions...
37. Aurora - Priority killing
1. Role submits a job with higher
priority that existing job in same
Role
38. Aurora - Intra-Role Prioritization
2. A production job requires quota
from a non-production job of any
Role
39. Multi-Tenancy Scheduling - Production Quotas
● Aurora allows
us to give a
quota to Roles
for their critical
work
41. Multi-Tenancy Scheduling - Slot Non-Critical Work Around Critical Work
Aurora allows us to slot low-priority
(typically batch) work around critical
work, yet discard it if more critical
work arrives
48. Job DAGs
Aurora allows Jobs to contain parallel
and sequential stages, resulting a
DAG. However all processes will be
ran in the same container. It is not a
high level job orchestrator.
49. Domain Specific Job Definition Language With Templating
transcode_proc = Process(
name=’transcode’,
cmdline=’ffmpeg -i {{input}} {{output}} ’
)
transcode_task = Task(
name = ‘transcode_task ’,
processes = [transcode_proc]
resources = Resources(cpu = 1, ram = 128*MB, disk=1*GB)
)
jobs = [
Job(cluster = 'devcluster',
environment = 'sandbox',
role = 'customer_a',
name = 'transcode_job',
task = transcode_task,
container = Docker(image = 'ffmpeg_image:16.04 ')
)
]
50. Customer Extensions - Extending the DSL
transcode_task = transcode()
.input(“s3://input.mp4”)
.output(“s3://output.avi”)
jobs = Flow.create(transcode_task)
No nice hook exists to plug in DSL extensions.
We need to patch and build the Aurora client and then register this within the
configuration loader.