Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications

NEPTUNE
Scheduling Suspendable Tasks
for Unified Stream/Batch Applications
SoCC, Santa Cruz, California, November 2019
Panagiotis Garefalakis
Imperial College London
pgaref@imperial.ac.uk
Konstantinos Karanasos
Microsoft
kokarana@microsoft.com
Peter Pietzuch
Imperial College London
prp@imperial.ac.uk

Unified application example
Panagiotis Garefalakis - Imperial College London 2
Inference Job
Low-latency
responses
Trained
Model
Historical
data
Real-time
data
Training Job
Iterate
Stream
Batch Application

Evolution of analytics frameworks
Batch frameworks
20142010 2018
Frameworks
with hybrid
stream/batch
applicationsStream frameworks
Unified
stream/batch
frameworks
Structured Streaming

Requirements
> Latency: Execute inference job with minimum delay
> Throughput: Batch jobs should not be compromised
> Efficiency: Achieve high cluster resource utilization
Stream/Batch application requirements
Challenge: schedule stream/batch jobs to
satisfy their diverse requirements

Stream/Batch application scheduling
2xT
Inference (stream) Job
2xT
3T TTraining (batch) Job
Stage1
T
Stage2
T
2x 2x
3T3T3T
Stage1
TT
Stage2
4x 3x
Application
Code
Driver
DAG Scheduler
submitApp Context
run job

2xT
2xT
3T TTraining (batch) Job
3T
3T
3T
T T T T
4T
3T
executor1executor2
8T
T
T
T
Wasted
resources
Cores
2T 6T
Stage1
T
Stage2
T
2x 2x
3T3T3T
Stage1
TT
Stage2
4x 3x
> Static allocation: dedicate resources to each job
Resources can not be shared across jobs

2xT 2xT
3T T
4T 8T2T 6T
Stage1
T
Stage2
T
2x 2x
3T3T3T
Stage1
TT
Stage2
4x 3x
> FIFO: first job runs to completion
3T
3T
3T
3T
T
T
T
T T
T
Long batch jobs increase stream job latency
Cores
T
Training (batch) Jobsharedexecutors

2xT 2xT
3T T
4T 8T2T 6T
Stage1
T
Stage2
T
2x 2x
3T3T3T
Stage1
TT
Stage2
4x 3x
> FAIR: weight share resources across jobs
Cores
3T
3T
3T
3T
T
T
T
T
T
T
T
queuing
Better packing with non-optimal latency

2xT 2xT
3T T
4T 8T2T 6T
Stage1
T
Stage2
T
2x 2x
3T3T3T
Stage1
TT
Stage2
4x 3x
> KILL: avoid queueing by preempting batch tasks
Cores
3T
3T
3T
3T
T
T
T
T
T
T 3T
T 3T
Better latency at the expense of extra work

2xT 2xT
3T T
4T 8T2T 6T
Stage1
T
Stage2
T
2x 2x
3T3T3T
Stage1
TT
Stage2
4x 3x
> NEPTUNE: minimize queueing and wasted work!
Cores
3T
3T
3T
3T
T
T
T
T
T
2T
2TT
T

> How to minimize queuing for latency-sensitive jobs
and wasted work?
Implement suspendable tasks
> How to natively support stream/batch applications?
Provide a unified execution framework
> How to satisfy different stream/batch application
requirements and high-level objectives?
Introduces custom scheduling policies
Challenges

> How to minimize queuing for latency-sensitive jobs
and wasted work?
Implement suspendable tasks
> How to natively support stream/batch applications?
Provide a unified execution framework
> How to satisfy different stream/batch application
requirements and high-level objectives?
Introduces custom scheduling policies
NEPTUNE
Execution framework for Stream/Batch applications
Support suspendable tasks
Introduce pluggable scheduling policies
Unified execution framework on top of
Structured Streaming

Typical tasks
Executor
Stack
Task run
Value
Context
Iterator
Function
> Tasks: apply a function to a partition of
data
> Subroutines that run in executor to
completion
> Preemption problem:
> Loss of progress (kill)
> Unpredictable preemption times
(checkpointing)
State

Suspendable tasks
Function
Context
Iterator
Coroutine
Stack
callyield
> Idea: use coroutines
> Separate stacks to store task
state
> Yield points handing over
control to the executor
> Cooperative preemption:
> Suspend and resume in
milliseconds
> Work-preserving
> Transparent to the user
Executor
Stack
Task run
Value
State
Context
https://github.com/storm-enroute/coroutines

Execution framework
> Idea: centralized scheduler with pluggable policies
> Problem: not just assign but also suspend and resume
ExecutorExecutor
DAG scheduler
Task Scheduler
Scheduling policy
Executor
Tasks
Low-pri job High-pri job
Running Paused
suspend &
run task
App + job priorities
LowHigh
Tasks
Incrementalizer
Optimizer
launch
task
metrics

Scheduling policies
> Idea: policies trigger task suspension and resumption
> Guarantee that stream tasks bypass batch tasks
> Satisfy higher-level objectives i.e. balance cluster load
> Avoid starvation by suspending up to a number of times
> Load-balancing (LB): takes into account executors’
memory conditions and equalize the number of tasks
per node
> Locality- and memory aware (LMA): respect task
locality preferences in addition to load-balancing

> Built as an extension to
2.4.0 (https://github.com/lsds/Neptune)
> Ported all ResultTask, ShuffleMapTask functionality
across programming interfaces to coroutines
> Extended Spark’s DAG Scheduler to allow job
stages with different requirements (priorities)
> Added additional Executor performance metrics as
part of the heartbeat mechanism
Implementation

> Cluster
– 75 nodes with 4 cores and 32 GB of memory each
> Workloads
– LDA: ML training/inference application uncovering
hidden topics from a group of documents
– Yahoo Streaming Benchmark: ad-analytics on a
stream of ad impressions
– TPC-H decision support benchmark
Azure deployment

Benefit of NEPTUNE in stream latency
> LDA: training (batch) job using all available resources, with
a latency-sensitive inference (stream) using 15% of resources
NEPTUNE achieves latencies comparable to
the ideal for the latency-sensitive jobs
LB
Neptune
LMA
Neptune
IsolationKILLFAIRFIFOStatic
allocation
37%
13%
61%
54%
99th
median
5th

Impact of resource demands in performance
Past to future
> YSB: increasing stream job resource demands while batch
job using all available resources
1.5%
Efficiently share resources with low impact on
throughput

NEPTUNE supports complex unified
applications with diverse job
requirements!
>Suspendable tasks using coroutines
>Pluggable scheduling policies
>Continuous unified analytics
Thank you!
Questions?
Panagiotis Garefalakis
pgaref@imperial.ac.uk
Summary
https://github.com/lsds/Neptune

BACKUP SLIDES
22Panagiotis Garefalakis - Imperial College London

Suspension mechanism effectiveness
> TPCH: Task runtime distribution for each query ranges from
100s of milliseconds to 10s of seconds

> TPCH: Continuously transition tasks from Paused to
Resumed states until completion
Suspendable tasks effectively pause and
resume with sub-millisecond latencies

> TPCH: Continuously transition tasks from Paused to
Resumed states until completion
Coroutine tasks have minimal performance
overhead by bypassing the OS
2 4 8 16 32 64
Parallelism
0.0
2.0
4.0
6.0
8.0
10.0
Pauselatency(ms)
Coroutines ThreadSync

> Run a simple unified application with
> A high-priority latency-sensitive job
> A low-priority latency-tolerant job
> Schedule them with default Spark and Neptune
> Goal: show benefit of Neptune and ease
of use
Demo

Suspendable tasks
val collect (TaskContext, Iterator[T]) =>
(Int, Array[T]) = {
val result = new
mutable.ArrayBuffer[T]
while (itr.hasNext) {
result.append(itr.next)
}
result.toArray
}
val collect (TaskContext, Iterator[T]) => (Int, Array[T]) ={
coroutine {(context: TaskContext, itr: Iterator[T]) => {
val result = new mutable.ArrayBuffer[T]
while (itr.hasNext) {
result.append(itr.next)
if (context.isPaused())
yieldval(0)
}
result.toArray
} }
Subroutine Coroutine

Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications

Similar to Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications (20)

More from Panagiotis Garefalakis

More from Panagiotis Garefalakis (7)

Recently uploaded

Recently uploaded (20)

Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications

Editor's Notes