SlideShare a Scribd company logo
NEPTUNE
Scheduling Suspendable Tasks
for Unified Stream/Batch Applications
SoCC, Santa Cruz, California, November 2019
Panagiotis Garefalakis
Imperial College London
pgaref@imperial.ac.uk
Konstantinos Karanasos
Microsoft
kokarana@microsoft.com
Peter Pietzuch
Imperial College London
prp@imperial.ac.uk
Unified application example
Panagiotis Garefalakis - Imperial College London 2
Inference Job
Low-latency
responses
Trained
Model
Historical
data
Real-time
data
Training Job
Iterate
Stream
Batch Application
Evolution of analytics frameworks
Panagiotis Garefalakis - Imperial College London 3
Batch frameworks
20142010 2018
Frameworks
with hybrid
stream/batch
applicationsStream frameworks
Unified
stream/batch
frameworks
Structured Streaming
Requirements
> Latency: Execute inference job with minimum delay
> Throughput: Batch jobs should not be compromised
> Efficiency: Achieve high cluster resource utilization
Stream/Batch application requirements
Panagiotis Garefalakis - Imperial College London 4
Challenge: schedule stream/batch jobs to
satisfy their diverse requirements
Stream/Batch application scheduling
Panagiotis Garefalakis - Imperial College London 5
2xT
Inference (stream) Job
2xT
3T TTraining (batch) Job
Stage1
T
Stage2
T
2x 2x
3T3T3T
Stage1
TT
Stage2
4x 3x
Application
Code
Driver
DAG Scheduler
submitApp Context
run job
Stream/Batch application scheduling
Panagiotis Garefalakis - Imperial College London 6
2xT
Inference (stream) Job
2xT
3T TTraining (batch) Job
3T
3T
3T
T T T T
4T
3T
executor1executor2
8T
T
T
T
Wasted
resources
Cores
2T 6T
Stage1
T
Stage2
T
2x 2x
3T3T3T
Stage1
TT
Stage2
4x 3x
> Static allocation: dedicate resources to each job
Resources can not be shared across jobs
Stream/Batch application scheduling
Panagiotis Garefalakis - Imperial College London 7
2xT 2xT
3T T
4T 8T2T 6T
Stage1
T
Stage2
T
2x 2x
3T3T3T
Stage1
TT
Stage2
4x 3x
> FIFO: first job runs to completion
3T
3T
3T
3T
T
T
T
T T
T
Long batch jobs increase stream job latency
Cores
T
Inference (stream) Job
Training (batch) Jobsharedexecutors
Stream/Batch application scheduling
Panagiotis Garefalakis - Imperial College London 8
2xT 2xT
3T T
4T 8T2T 6T
Stage1
T
Stage2
T
2x 2x
3T3T3T
Stage1
TT
Stage2
4x 3x
> FAIR: weight share resources across jobs
Cores
3T
3T
3T
3T
T
T
T
T
T
T
T
queuing
Better packing with non-optimal latency
Inference (stream) Job
Training (batch) Jobsharedexecutors
Stream/Batch application scheduling
Panagiotis Garefalakis - Imperial College London 9
2xT 2xT
3T T
4T 8T2T 6T
Stage1
T
Stage2
T
2x 2x
3T3T3T
Stage1
TT
Stage2
4x 3x
> KILL: avoid queueing by preempting batch tasks
Cores
3T
3T
3T
3T
T
T
T
T
T
T 3T
T 3T
Better latency at the expense of extra work
Inference (stream) Job
Training (batch) Jobsharedexecutors
Stream/Batch application scheduling
Panagiotis Garefalakis - Imperial College London 10
2xT 2xT
3T T
4T 8T2T 6T
Stage1
T
Stage2
T
2x 2x
3T3T3T
Stage1
TT
Stage2
4x 3x
> NEPTUNE: minimize queueing and wasted work!
Cores
Inference (stream) Job
Training (batch) Jobsharedexecutors
3T
3T
3T
3T
T
T
T
T
T
2T
2TT
T
> How to minimize queuing for latency-sensitive jobs
and wasted work?
Implement suspendable tasks
> How to natively support stream/batch applications?
Provide a unified execution framework
> How to satisfy different stream/batch application
requirements and high-level objectives?
Introduces custom scheduling policies
Challenges
Panagiotis Garefalakis - Imperial College London 11
> How to minimize queuing for latency-sensitive jobs
and wasted work?
Implement suspendable tasks
> How to natively support stream/batch applications?
Provide a unified execution framework
> How to satisfy different stream/batch application
requirements and high-level objectives?
Introduces custom scheduling policies
NEPTUNE
Execution framework for Stream/Batch applications
Panagiotis Garefalakis - Imperial College London 12
Support suspendable tasks
Introduce pluggable scheduling policies
Unified execution framework on top of
Structured Streaming
Typical tasks
Panagiotis Garefalakis - Imperial College London 13
Executor
Stack
Task run
Value
Context
Iterator
Function
> Tasks: apply a function to a partition of
data
> Subroutines that run in executor to
completion
> Preemption problem:
> Loss of progress (kill)
> Unpredictable preemption times
(checkpointing)
State
Suspendable tasks
Panagiotis Garefalakis - Imperial College London 14
Function
Context
Iterator
Coroutine
Stack
callyield
> Idea: use coroutines
> Separate stacks to store task
state
> Yield points handing over
control to the executor
> Cooperative preemption:
> Suspend and resume in
milliseconds
> Work-preserving
> Transparent to the user
Executor
Stack
Task run
Value
State
Context
https://github.com/storm-enroute/coroutines
Execution framework
Panagiotis Garefalakis - Imperial College London 15
> Idea: centralized scheduler with pluggable policies
> Problem: not just assign but also suspend and resume
ExecutorExecutor
DAG scheduler
Task Scheduler
Scheduling policy
Executor
Tasks
Low-pri job High-pri job
Running Paused
suspend &
run task
App + job priorities
LowHigh
Tasks
Incrementalizer
Optimizer
launch
task
metrics
Scheduling policies
Panagiotis Garefalakis - Imperial College London 16
> Idea: policies trigger task suspension and resumption
> Guarantee that stream tasks bypass batch tasks
> Satisfy higher-level objectives i.e. balance cluster load
> Avoid starvation by suspending up to a number of times
> Load-balancing (LB): takes into account executors’
memory conditions and equalize the number of tasks
per node
> Locality- and memory aware (LMA): respect task
locality preferences in addition to load-balancing
> Built as an extension to
2.4.0 (https://github.com/lsds/Neptune)
> Ported all ResultTask, ShuffleMapTask functionality
across programming interfaces to coroutines
> Extended Spark’s DAG Scheduler to allow job
stages with different requirements (priorities)
> Added additional Executor performance metrics as
part of the heartbeat mechanism
Implementation
Panagiotis Garefalakis - Imperial College London 17
> Cluster
– 75 nodes with 4 cores and 32 GB of memory each
> Workloads
– LDA: ML training/inference application uncovering
hidden topics from a group of documents
– Yahoo Streaming Benchmark: ad-analytics on a
stream of ad impressions
– TPC-H decision support benchmark
Azure deployment
Panagiotis Garefalakis - Imperial College London 18
Benefit of NEPTUNE in stream latency
Panagiotis Garefalakis - Imperial College London 19
> LDA: training (batch) job using all available resources, with
a latency-sensitive inference (stream) using 15% of resources
NEPTUNE achieves latencies comparable to
the ideal for the latency-sensitive jobs
LB
Neptune
LMA
Neptune
IsolationKILLFAIRFIFOStatic
allocation
37%
13%
61%
54%
99th
median
5th
Impact of resource demands in performance
Panagiotis Garefalakis - Imperial College London 20
Past to future
> YSB: increasing stream job resource demands while batch
job using all available resources
1.5%
Efficiently share resources with low impact on
throughput
NEPTUNE supports complex unified
applications with diverse job
requirements!
>Suspendable tasks using coroutines
>Pluggable scheduling policies
>Continuous unified analytics
Thank you!
Questions?
Panagiotis Garefalakis
pgaref@imperial.ac.uk
Summary
https://github.com/lsds/Neptune
BACKUP SLIDES
22Panagiotis Garefalakis - Imperial College London
Suspension mechanism effectiveness
Panagiotis Garefalakis - Imperial College London 23
> TPCH: Task runtime distribution for each query ranges from
100s of milliseconds to 10s of seconds
Suspension mechanism effectiveness
Panagiotis Garefalakis - Imperial College London 24
> TPCH: Continuously transition tasks from Paused to
Resumed states until completion
Suspendable tasks effectively pause and
resume with sub-millisecond latencies
Suspension mechanism effectiveness
Panagiotis Garefalakis - Imperial College London 25
> TPCH: Continuously transition tasks from Paused to
Resumed states until completion
Coroutine tasks have minimal performance
overhead by bypassing the OS
2 4 8 16 32 64
Parallelism
0.0
2.0
4.0
6.0
8.0
10.0
Pauselatency(ms)
Coroutines ThreadSync
> Run a simple unified application with
> A high-priority latency-sensitive job
> A low-priority latency-tolerant job
> Schedule them with default Spark and Neptune
> Goal: show benefit of Neptune and ease
of use
Demo
Panagiotis Garefalakis - Imperial College London 26
Suspendable tasks
Panagiotis Garefalakis - Imperial College London 27
val collect (TaskContext, Iterator[T]) =>
(Int, Array[T]) = {
val result = new
mutable.ArrayBuffer[T]
while (itr.hasNext) {
result.append(itr.next)
}
result.toArray
}
val collect (TaskContext, Iterator[T]) => (Int, Array[T]) ={
coroutine {(context: TaskContext, itr: Iterator[T]) => {
val result = new mutable.ArrayBuffer[T]
while (itr.hasNext) {
result.append(itr.next)
if (context.isPaused())
yieldval(0)
}
result.toArray
} }
Subroutine Coroutine

More Related Content

What's hot

LOAD BALANCING ALGORITHM TO IMPROVE RESPONSE TIME ON CLOUD COMPUTING
LOAD BALANCING ALGORITHM TO IMPROVE RESPONSE TIME ON CLOUD COMPUTINGLOAD BALANCING ALGORITHM TO IMPROVE RESPONSE TIME ON CLOUD COMPUTING
LOAD BALANCING ALGORITHM TO IMPROVE RESPONSE TIME ON CLOUD COMPUTING
ijccsa
 
Deadline and Suffrage Aware Task Scheduling Approach for Cloud Environment
Deadline and Suffrage Aware Task Scheduling Approach for Cloud EnvironmentDeadline and Suffrage Aware Task Scheduling Approach for Cloud Environment
Deadline and Suffrage Aware Task Scheduling Approach for Cloud Environment
IRJET Journal
 
Scalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehousesScalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehouses
IRJET Journal
 
Grds conferences icst and icbelsh (5)
Grds conferences icst and icbelsh (5)Grds conferences icst and icbelsh (5)
Grds conferences icst and icbelsh (5)
Global R & D Services
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Sergey Karayev
 
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
IRJET Journal
 
Resource scheduling algorithm
Resource scheduling algorithmResource scheduling algorithm
Resource scheduling algorithm
Shilpa Damor
 
Server Consolidation through Virtual Machine Task Migration to achieve Green ...
Server Consolidation through Virtual Machine Task Migration to achieve Green ...Server Consolidation through Virtual Machine Task Migration to achieve Green ...
Server Consolidation through Virtual Machine Task Migration to achieve Green ...
IJCSIS Research Publications
 
Project Planning using Little’s Law
Project Planning using Little’s LawProject Planning using Little’s Law
Project Planning using Little’s Law
Dimitar Bakardzhiev
 

What's hot (9)

LOAD BALANCING ALGORITHM TO IMPROVE RESPONSE TIME ON CLOUD COMPUTING
LOAD BALANCING ALGORITHM TO IMPROVE RESPONSE TIME ON CLOUD COMPUTINGLOAD BALANCING ALGORITHM TO IMPROVE RESPONSE TIME ON CLOUD COMPUTING
LOAD BALANCING ALGORITHM TO IMPROVE RESPONSE TIME ON CLOUD COMPUTING
 
Deadline and Suffrage Aware Task Scheduling Approach for Cloud Environment
Deadline and Suffrage Aware Task Scheduling Approach for Cloud EnvironmentDeadline and Suffrage Aware Task Scheduling Approach for Cloud Environment
Deadline and Suffrage Aware Task Scheduling Approach for Cloud Environment
 
Scalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehousesScalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehouses
 
Grds conferences icst and icbelsh (5)
Grds conferences icst and icbelsh (5)Grds conferences icst and icbelsh (5)
Grds conferences icst and icbelsh (5)
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
 
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
 
Resource scheduling algorithm
Resource scheduling algorithmResource scheduling algorithm
Resource scheduling algorithm
 
Server Consolidation through Virtual Machine Task Migration to achieve Green ...
Server Consolidation through Virtual Machine Task Migration to achieve Green ...Server Consolidation through Virtual Machine Task Migration to achieve Green ...
Server Consolidation through Virtual Machine Task Migration to achieve Green ...
 
Project Planning using Little’s Law
Project Planning using Little’s LawProject Planning using Little’s Law
Project Planning using Little’s Law
 

Similar to Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications

Cooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache Spark
Databricks
 
Reinforcement learning based multi core scheduling (RLBMCS) for real time sys...
Reinforcement learning based multi core scheduling (RLBMCS) for real time sys...Reinforcement learning based multi core scheduling (RLBMCS) for real time sys...
Reinforcement learning based multi core scheduling (RLBMCS) for real time sys...
IJECEIAES
 
Cs 568 Spring 10 Lecture 5 Estimation
Cs 568 Spring 10  Lecture 5 EstimationCs 568 Spring 10  Lecture 5 Estimation
Cs 568 Spring 10 Lecture 5 Estimation
Lawrence Bernstein
 
Cse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solutionCse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solution
Shobha Kumar
 
Medea: Scheduling of Long Running Applications in Shared Production Clusters
Medea: Scheduling of Long Running Applications in Shared Production ClustersMedea: Scheduling of Long Running Applications in Shared Production Clusters
Medea: Scheduling of Long Running Applications in Shared Production Clusters
Panagiotis Garefalakis
 
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
IRJET Journal
 
Deadline Miss Detection with SCHED_DEADLINE
Deadline Miss Detection with SCHED_DEADLINEDeadline Miss Detection with SCHED_DEADLINE
Deadline Miss Detection with SCHED_DEADLINE
Yoshitake Kobayashi
 
Measuring Performance by Irfanullah
Measuring Performance by IrfanullahMeasuring Performance by Irfanullah
Measuring Performance by Irfanullah
guest2e9811e
 
A Novel Dynamic Priority Based Job Scheduling Approach for Cloud Environment
A Novel Dynamic Priority Based Job Scheduling Approach for Cloud EnvironmentA Novel Dynamic Priority Based Job Scheduling Approach for Cloud Environment
A Novel Dynamic Priority Based Job Scheduling Approach for Cloud Environment
IRJET Journal
 
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
International Journal of Power Electronics and Drive Systems
 
Why AIOps Matters For Kubernetes
Why AIOps Matters For KubernetesWhy AIOps Matters For Kubernetes
Why AIOps Matters For Kubernetes
Timothy Chen
 
Learning Software Performance Models for Dynamic and Uncertain Environments
Learning Software Performance Models for Dynamic and Uncertain EnvironmentsLearning Software Performance Models for Dynamic and Uncertain Environments
Learning Software Performance Models for Dynamic and Uncertain Environments
Pooyan Jamshidi
 
Service Request Scheduling in Cloud Computing using Meta-Heuristic Technique:...
Service Request Scheduling in Cloud Computing using Meta-Heuristic Technique:...Service Request Scheduling in Cloud Computing using Meta-Heuristic Technique:...
Service Request Scheduling in Cloud Computing using Meta-Heuristic Technique:...
IRJET Journal
 
Awe k2 midterms finals
Awe k2 midterms finalsAwe k2 midterms finals
Awe k2 midterms finals
Karen Tay
 
Von neumann workers
Von neumann workersVon neumann workers
Von neumann workers
riccardobecker
 
Earlier stage for straggler detection and handling using combined CPU test an...
Earlier stage for straggler detection and handling using combined CPU test an...Earlier stage for straggler detection and handling using combined CPU test an...
Earlier stage for straggler detection and handling using combined CPU test an...
IJECEIAES
 
Critical Chain Project Management & Theory of Constraints
Critical Chain Project Management & Theory of ConstraintsCritical Chain Project Management & Theory of Constraints
Critical Chain Project Management & Theory of Constraints
Abhay Kumar
 
Parallel Computing - Lec 6
Parallel Computing - Lec 6Parallel Computing - Lec 6
Parallel Computing - Lec 6
Shah Zaib
 
Time --updated 60b084af4f5af-
Time --updated 60b084af4f5af-Time --updated 60b084af4f5af-
Time --updated 60b084af4f5af-
seminiMaya
 
LEARNING SCHEDULER PARAMETERS FOR ADAPTIVE PREEMPTION
LEARNING SCHEDULER PARAMETERS FOR ADAPTIVE PREEMPTIONLEARNING SCHEDULER PARAMETERS FOR ADAPTIVE PREEMPTION
LEARNING SCHEDULER PARAMETERS FOR ADAPTIVE PREEMPTION
cscpconf
 

Similar to Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications (20)

Cooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache Spark
 
Reinforcement learning based multi core scheduling (RLBMCS) for real time sys...
Reinforcement learning based multi core scheduling (RLBMCS) for real time sys...Reinforcement learning based multi core scheduling (RLBMCS) for real time sys...
Reinforcement learning based multi core scheduling (RLBMCS) for real time sys...
 
Cs 568 Spring 10 Lecture 5 Estimation
Cs 568 Spring 10  Lecture 5 EstimationCs 568 Spring 10  Lecture 5 Estimation
Cs 568 Spring 10 Lecture 5 Estimation
 
Cse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solutionCse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solution
 
Medea: Scheduling of Long Running Applications in Shared Production Clusters
Medea: Scheduling of Long Running Applications in Shared Production ClustersMedea: Scheduling of Long Running Applications in Shared Production Clusters
Medea: Scheduling of Long Running Applications in Shared Production Clusters
 
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
 
Deadline Miss Detection with SCHED_DEADLINE
Deadline Miss Detection with SCHED_DEADLINEDeadline Miss Detection with SCHED_DEADLINE
Deadline Miss Detection with SCHED_DEADLINE
 
Measuring Performance by Irfanullah
Measuring Performance by IrfanullahMeasuring Performance by Irfanullah
Measuring Performance by Irfanullah
 
A Novel Dynamic Priority Based Job Scheduling Approach for Cloud Environment
A Novel Dynamic Priority Based Job Scheduling Approach for Cloud EnvironmentA Novel Dynamic Priority Based Job Scheduling Approach for Cloud Environment
A Novel Dynamic Priority Based Job Scheduling Approach for Cloud Environment
 
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
 
Why AIOps Matters For Kubernetes
Why AIOps Matters For KubernetesWhy AIOps Matters For Kubernetes
Why AIOps Matters For Kubernetes
 
Learning Software Performance Models for Dynamic and Uncertain Environments
Learning Software Performance Models for Dynamic and Uncertain EnvironmentsLearning Software Performance Models for Dynamic and Uncertain Environments
Learning Software Performance Models for Dynamic and Uncertain Environments
 
Service Request Scheduling in Cloud Computing using Meta-Heuristic Technique:...
Service Request Scheduling in Cloud Computing using Meta-Heuristic Technique:...Service Request Scheduling in Cloud Computing using Meta-Heuristic Technique:...
Service Request Scheduling in Cloud Computing using Meta-Heuristic Technique:...
 
Awe k2 midterms finals
Awe k2 midterms finalsAwe k2 midterms finals
Awe k2 midterms finals
 
Von neumann workers
Von neumann workersVon neumann workers
Von neumann workers
 
Earlier stage for straggler detection and handling using combined CPU test an...
Earlier stage for straggler detection and handling using combined CPU test an...Earlier stage for straggler detection and handling using combined CPU test an...
Earlier stage for straggler detection and handling using combined CPU test an...
 
Critical Chain Project Management & Theory of Constraints
Critical Chain Project Management & Theory of ConstraintsCritical Chain Project Management & Theory of Constraints
Critical Chain Project Management & Theory of Constraints
 
Parallel Computing - Lec 6
Parallel Computing - Lec 6Parallel Computing - Lec 6
Parallel Computing - Lec 6
 
Time --updated 60b084af4f5af-
Time --updated 60b084af4f5af-Time --updated 60b084af4f5af-
Time --updated 60b084af4f5af-
 
LEARNING SCHEDULER PARAMETERS FOR ADAPTIVE PREEMPTION
LEARNING SCHEDULER PARAMETERS FOR ADAPTIVE PREEMPTIONLEARNING SCHEDULER PARAMETERS FOR ADAPTIVE PREEMPTION
LEARNING SCHEDULER PARAMETERS FOR ADAPTIVE PREEMPTION
 

More from Panagiotis Garefalakis

Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancementsAccelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Panagiotis Garefalakis
 
Mres presentation
Mres presentationMres presentation
Mres presentation
Panagiotis Garefalakis
 
Dais 2013 2 6 june
Dais 2013 2 6 juneDais 2013 2 6 june
Dais 2013 2 6 june
Panagiotis Garefalakis
 
Master presentation-21-7-2014
Master presentation-21-7-2014Master presentation-21-7-2014
Master presentation-21-7-2014
Panagiotis Garefalakis
 
Pgaref Piccolo Building Fast, Distributed Programs with Partitioned Tables
Pgaref   Piccolo Building Fast, Distributed Programs with Partitioned TablesPgaref   Piccolo Building Fast, Distributed Programs with Partitioned Tables
Pgaref Piccolo Building Fast, Distributed Programs with Partitioned Tables
Panagiotis Garefalakis
 
Storage managment using nagios
Storage managment using nagiosStorage managment using nagios
Storage managment using nagios
Panagiotis Garefalakis
 
Ithings2012 20nov
Ithings2012 20novIthings2012 20nov
Ithings2012 20nov
Panagiotis Garefalakis
 

More from Panagiotis Garefalakis (7)

Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancementsAccelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
 
Mres presentation
Mres presentationMres presentation
Mres presentation
 
Dais 2013 2 6 june
Dais 2013 2 6 juneDais 2013 2 6 june
Dais 2013 2 6 june
 
Master presentation-21-7-2014
Master presentation-21-7-2014Master presentation-21-7-2014
Master presentation-21-7-2014
 
Pgaref Piccolo Building Fast, Distributed Programs with Partitioned Tables
Pgaref   Piccolo Building Fast, Distributed Programs with Partitioned TablesPgaref   Piccolo Building Fast, Distributed Programs with Partitioned Tables
Pgaref Piccolo Building Fast, Distributed Programs with Partitioned Tables
 
Storage managment using nagios
Storage managment using nagiosStorage managment using nagios
Storage managment using nagios
 
Ithings2012 20nov
Ithings2012 20novIthings2012 20nov
Ithings2012 20nov
 

Recently uploaded

Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 

Recently uploaded (20)

Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 

Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications

  • 1. NEPTUNE Scheduling Suspendable Tasks for Unified Stream/Batch Applications SoCC, Santa Cruz, California, November 2019 Panagiotis Garefalakis Imperial College London pgaref@imperial.ac.uk Konstantinos Karanasos Microsoft kokarana@microsoft.com Peter Pietzuch Imperial College London prp@imperial.ac.uk
  • 2. Unified application example Panagiotis Garefalakis - Imperial College London 2 Inference Job Low-latency responses Trained Model Historical data Real-time data Training Job Iterate Stream Batch Application
  • 3. Evolution of analytics frameworks Panagiotis Garefalakis - Imperial College London 3 Batch frameworks 20142010 2018 Frameworks with hybrid stream/batch applicationsStream frameworks Unified stream/batch frameworks Structured Streaming
  • 4. Requirements > Latency: Execute inference job with minimum delay > Throughput: Batch jobs should not be compromised > Efficiency: Achieve high cluster resource utilization Stream/Batch application requirements Panagiotis Garefalakis - Imperial College London 4 Challenge: schedule stream/batch jobs to satisfy their diverse requirements
  • 5. Stream/Batch application scheduling Panagiotis Garefalakis - Imperial College London 5 2xT Inference (stream) Job 2xT 3T TTraining (batch) Job Stage1 T Stage2 T 2x 2x 3T3T3T Stage1 TT Stage2 4x 3x Application Code Driver DAG Scheduler submitApp Context run job
  • 6. Stream/Batch application scheduling Panagiotis Garefalakis - Imperial College London 6 2xT Inference (stream) Job 2xT 3T TTraining (batch) Job 3T 3T 3T T T T T 4T 3T executor1executor2 8T T T T Wasted resources Cores 2T 6T Stage1 T Stage2 T 2x 2x 3T3T3T Stage1 TT Stage2 4x 3x > Static allocation: dedicate resources to each job Resources can not be shared across jobs
  • 7. Stream/Batch application scheduling Panagiotis Garefalakis - Imperial College London 7 2xT 2xT 3T T 4T 8T2T 6T Stage1 T Stage2 T 2x 2x 3T3T3T Stage1 TT Stage2 4x 3x > FIFO: first job runs to completion 3T 3T 3T 3T T T T T T T Long batch jobs increase stream job latency Cores T Inference (stream) Job Training (batch) Jobsharedexecutors
  • 8. Stream/Batch application scheduling Panagiotis Garefalakis - Imperial College London 8 2xT 2xT 3T T 4T 8T2T 6T Stage1 T Stage2 T 2x 2x 3T3T3T Stage1 TT Stage2 4x 3x > FAIR: weight share resources across jobs Cores 3T 3T 3T 3T T T T T T T T queuing Better packing with non-optimal latency Inference (stream) Job Training (batch) Jobsharedexecutors
  • 9. Stream/Batch application scheduling Panagiotis Garefalakis - Imperial College London 9 2xT 2xT 3T T 4T 8T2T 6T Stage1 T Stage2 T 2x 2x 3T3T3T Stage1 TT Stage2 4x 3x > KILL: avoid queueing by preempting batch tasks Cores 3T 3T 3T 3T T T T T T T 3T T 3T Better latency at the expense of extra work Inference (stream) Job Training (batch) Jobsharedexecutors
  • 10. Stream/Batch application scheduling Panagiotis Garefalakis - Imperial College London 10 2xT 2xT 3T T 4T 8T2T 6T Stage1 T Stage2 T 2x 2x 3T3T3T Stage1 TT Stage2 4x 3x > NEPTUNE: minimize queueing and wasted work! Cores Inference (stream) Job Training (batch) Jobsharedexecutors 3T 3T 3T 3T T T T T T 2T 2TT T
  • 11. > How to minimize queuing for latency-sensitive jobs and wasted work? Implement suspendable tasks > How to natively support stream/batch applications? Provide a unified execution framework > How to satisfy different stream/batch application requirements and high-level objectives? Introduces custom scheduling policies Challenges Panagiotis Garefalakis - Imperial College London 11
  • 12. > How to minimize queuing for latency-sensitive jobs and wasted work? Implement suspendable tasks > How to natively support stream/batch applications? Provide a unified execution framework > How to satisfy different stream/batch application requirements and high-level objectives? Introduces custom scheduling policies NEPTUNE Execution framework for Stream/Batch applications Panagiotis Garefalakis - Imperial College London 12 Support suspendable tasks Introduce pluggable scheduling policies Unified execution framework on top of Structured Streaming
  • 13. Typical tasks Panagiotis Garefalakis - Imperial College London 13 Executor Stack Task run Value Context Iterator Function > Tasks: apply a function to a partition of data > Subroutines that run in executor to completion > Preemption problem: > Loss of progress (kill) > Unpredictable preemption times (checkpointing) State
  • 14. Suspendable tasks Panagiotis Garefalakis - Imperial College London 14 Function Context Iterator Coroutine Stack callyield > Idea: use coroutines > Separate stacks to store task state > Yield points handing over control to the executor > Cooperative preemption: > Suspend and resume in milliseconds > Work-preserving > Transparent to the user Executor Stack Task run Value State Context https://github.com/storm-enroute/coroutines
  • 15. Execution framework Panagiotis Garefalakis - Imperial College London 15 > Idea: centralized scheduler with pluggable policies > Problem: not just assign but also suspend and resume ExecutorExecutor DAG scheduler Task Scheduler Scheduling policy Executor Tasks Low-pri job High-pri job Running Paused suspend & run task App + job priorities LowHigh Tasks Incrementalizer Optimizer launch task metrics
  • 16. Scheduling policies Panagiotis Garefalakis - Imperial College London 16 > Idea: policies trigger task suspension and resumption > Guarantee that stream tasks bypass batch tasks > Satisfy higher-level objectives i.e. balance cluster load > Avoid starvation by suspending up to a number of times > Load-balancing (LB): takes into account executors’ memory conditions and equalize the number of tasks per node > Locality- and memory aware (LMA): respect task locality preferences in addition to load-balancing
  • 17. > Built as an extension to 2.4.0 (https://github.com/lsds/Neptune) > Ported all ResultTask, ShuffleMapTask functionality across programming interfaces to coroutines > Extended Spark’s DAG Scheduler to allow job stages with different requirements (priorities) > Added additional Executor performance metrics as part of the heartbeat mechanism Implementation Panagiotis Garefalakis - Imperial College London 17
  • 18. > Cluster – 75 nodes with 4 cores and 32 GB of memory each > Workloads – LDA: ML training/inference application uncovering hidden topics from a group of documents – Yahoo Streaming Benchmark: ad-analytics on a stream of ad impressions – TPC-H decision support benchmark Azure deployment Panagiotis Garefalakis - Imperial College London 18
  • 19. Benefit of NEPTUNE in stream latency Panagiotis Garefalakis - Imperial College London 19 > LDA: training (batch) job using all available resources, with a latency-sensitive inference (stream) using 15% of resources NEPTUNE achieves latencies comparable to the ideal for the latency-sensitive jobs LB Neptune LMA Neptune IsolationKILLFAIRFIFOStatic allocation 37% 13% 61% 54% 99th median 5th
  • 20. Impact of resource demands in performance Panagiotis Garefalakis - Imperial College London 20 Past to future > YSB: increasing stream job resource demands while batch job using all available resources 1.5% Efficiently share resources with low impact on throughput
  • 21. NEPTUNE supports complex unified applications with diverse job requirements! >Suspendable tasks using coroutines >Pluggable scheduling policies >Continuous unified analytics Thank you! Questions? Panagiotis Garefalakis pgaref@imperial.ac.uk Summary https://github.com/lsds/Neptune
  • 22. BACKUP SLIDES 22Panagiotis Garefalakis - Imperial College London
  • 23. Suspension mechanism effectiveness Panagiotis Garefalakis - Imperial College London 23 > TPCH: Task runtime distribution for each query ranges from 100s of milliseconds to 10s of seconds
  • 24. Suspension mechanism effectiveness Panagiotis Garefalakis - Imperial College London 24 > TPCH: Continuously transition tasks from Paused to Resumed states until completion Suspendable tasks effectively pause and resume with sub-millisecond latencies
  • 25. Suspension mechanism effectiveness Panagiotis Garefalakis - Imperial College London 25 > TPCH: Continuously transition tasks from Paused to Resumed states until completion Coroutine tasks have minimal performance overhead by bypassing the OS 2 4 8 16 32 64 Parallelism 0.0 2.0 4.0 6.0 8.0 10.0 Pauselatency(ms) Coroutines ThreadSync
  • 26. > Run a simple unified application with > A high-priority latency-sensitive job > A low-priority latency-tolerant job > Schedule them with default Spark and Neptune > Goal: show benefit of Neptune and ease of use Demo Panagiotis Garefalakis - Imperial College London 26
  • 27. Suspendable tasks Panagiotis Garefalakis - Imperial College London 27 val collect (TaskContext, Iterator[T]) => (Int, Array[T]) = { val result = new mutable.ArrayBuffer[T] while (itr.hasNext) { result.append(itr.next) } result.toArray } val collect (TaskContext, Iterator[T]) => (Int, Array[T]) ={ coroutine {(context: TaskContext, itr: Iterator[T]) => { val result = new mutable.ArrayBuffer[T] while (itr.hasNext) { result.append(itr.next) if (context.isPaused()) yieldval(0) } result.toArray } } Subroutine Coroutine

Editor's Notes

  1. Hello everyone, I am Panagiotis and this talk is about Neptune: a new task execution framework for modern unified applications This is joint work with Kostantinos Karanasos from Microsoft and my supervisor Peter Pietzuch at the LSDS group Imperial Neptune is a data processing framework; and developers today... (click)
  2. Want to use such frameworks to develop complex ML applications This figure depicts such a production application implementing a real-time malicious behavior detection service (c) consisting of a training job and an inference job The training job using historical data to train a machine learning model and the inference job performing latency-sensitive inference to detect malicious behavior (click) using the previously trained model which is shared in memory This type of unified application design allows such jobs to easily Share application state, application logic, Result consistency, even Sharing of computation
  3. To support such applications, analytics frameworks such as.. that were traditionally dedicated to either batch or stream processing (click) evolved to unify different use cases by exposing different APIs (DStreams and RDD API) but using these APIs jobs were still deployed as separate applications (no true unification) (click) Only recently started exposing unified programming interfaces to seamlessly express different types of jobs as part of the same application Spark Structured streaming and Flink Table API provide such programming interfaces enabling users to develop unified applications Such unified apps are known with different names such as hybrid, continuous applications for the rest of the presentation I will refer to them as stream/batch
  4. However despite the unified app support from the API point of view, there is very little done at scheduling side to capture and satisfy the diverse job requirements of those applications. (click) this is a new challenge: how to schedule unified application jobs to satisfy their diverse requirements
  5. To schedule such an application in a data processing framework today, the application code will be submitted to a centralized driver using a submit command, then the driver using the same context will run application jobs by the DAG scheduler -> The DAG scheduler will split each job into stages In our example two stages each. Dashed rectangles represent those stages Every stage consists of tasks running computation, taking time T – some training task are more computationally intensive and take longer (3T) (c) The number of tasks of each stage is defined by the number of partitions it operates on: namely stage1 and 2 of inference job operate on 2 partitions while stage1 of the training job operates on 4. For execution..
  6. (c) A typical approach to satisfy requirements of the application is static resource allocation, where we dedicate a portion of resources to each job for example 3/4.. (c) we then schedule every stage of each job using the available resources (c) until completion (c) (c) In this scenario: even though stream job latency is low, it is still 2x higher than the optimal (c) And more importantly we end up with a bunch of wasted resources as in static allocation... (c) can not be shared across jobs they remain unused
  7. A more resource efficient solution would be to use a unified framework such as Spark SS that enables sharing executors across jobs By default jobs are scheduled in a FIFO fashion where the first submitted job runs to completion (c) Assuming it is the training is submitted first it will use all resources (c) until completion, and then the inference stages will be scheduled (c) (c) However FIFO causes significant queuing delays especially when the jobs scheduled before the latency sensitive ones are long stream job has worse latency than static allocation
  8. An alternative is FAIR policy equally sharing resources across jobs (possibly weighted) (c) FAIR first schedules two tasks from each job stage (c) and then the next equal share until completion (c) Even though FAIR packs resources better and reduces the stream job’s response time by (c) 2T compared to FIFO, it cannot guarantee optimal latency for the stream job as it can not avoid queuing (c) so FAIR achieves better packing..
  9. To avoid queuing delays for stream tasks completely, it is possible to use non-work preserving preemption coupled with a strategy such as FAIR to kill batch tasks when needed (c) In that scenario we would schedule tasks in a FAIR manner (c) (c) And when stream tasks NEED to run we would preempt already running batch tasks (c) Then we would need to restart all killed tasks losing any progress (c) (c) KILL can achieve better latency BUT at the expense.. Given that batch and stream jobs share the same runtime, can we do better?
  10. What we want is to do instead is (c) to suspend low-priority batch tasks in favor of higher-priority stream tasks and (c) resume them when free resources are available
  11. There are a few challenges we need to solve – such as:
  12. Neptune is our new execution framework tackling these challenges, with support for: Suspendable tasks Coroutines can suspend latency-tolerant tasks within ms and resume later avoiding loosing progress (c) Providing a unified execution framework Users express such applications using existing programming interfaces (c) Introducing pluggable scheduling policies To satisfy different requirements of stream/batch applications
  13. Tasks in analytics frameworks such as Spark today are implemented as subroutines – applying a function on a partition of data given as an iterator (c) To run the task the executor allocates space for the return value of the function in the executor call stack (c) Adds a function reference (c) Allocates space for the variables and intermediate state (c) At the end of the execution the return value is written to the appropriate slot and we return to the executor callsite However to preempt any of those tasks we either have to kill them loosing progress or checkpoint the intermediate state with unpredictable latency
  14. Instead in Neptune we implement tasks using lightweight coroutines (c)Coroutines use a separate stack to store their variables and add yield points after the processing of each record (c) To run the task the executor now invokes a special call function which creates the coroutine stack instance to store the function reference, variables, and state (c) To Preempt: the executor marks the task context to paused and the task then yields in the next record iteration the executor can later resume the same task instance or invoke another function Using coroutines in Neptune provides a transparent.. Support function composition which is fundamental for more complex logic
  15. The problem now is that we don't just assign tasks to executors but also decide.. Nep supports pluggable policies deciding which and when tasks get suspended! Lets see how a application is executed in Neptune: (c) Users develop an application and mark jobs with priorities as low or high (c) The app is going through an optimization phase creating a plan in a form of a DAG (c) The DAG Scheduler computes the execution plan of each job of the application (c) Tasks ready-to-execute are then passed on to the Task Scheduler (c) TaskSched immediately executes a tasks if there are enough free resources (c) Otherwise, the Policy can decide to suspend running batch tasks to free up resources for stream tasks to start their execution immediately (c) The oldest suspended task is scheduled for execution when others terminate to avoid loosing progress Of course: system challenges: task suspension should be fast and smart - and scheduling policy can takes into account multiple task and cluster properties
  16. For the evaluation, we deployed Neptune in a 75 node azure cluster For workloads we used… (a latency-sensitive and a latency-tolerant instance)
  17. First I want to show you how Neptune compares with existing approaches in terms of latency: unified LDA app with a latency-sensitive job and a latency-tolerant job (c) We first run the stream job in total isolation to get the ideal case for stream latency (c) We then run LDA with FIFO and FAIR as implemented in SPARK. We observe that FAIR reduces the 99th percentile latency by 37% compared to FIFO but the median is still is 2× higher than running in Isolation (c) Adding preemption to FAIR improves median by 54% compared to FAIR, but its 99th percentile latency is 2x higher (cannot preempt more than a fair share of resources) (c) We then run LDA with Neptune and two different policies the NEP-LMA achieves a latency that is just 13% higher than Isolation for the 99th NEPTUNE without cache awareness (NEP-LB) achieves 61% worse latency for the 99th percentile compared to NEP-LMA (c) Finally running the jobs in different execs has tighter distribution but high tail latencies due to executor interference (c) NEPTUNE achieves latencies comparable to the ideal for the latency-sensitive jobs
  18. Since some applications require more resources than others.. We also measured the impact of increasing resources expressed as cores in Neptune performance We run two instances of YSB.. (c) NEPTUNE maintains low latency across all percentages even though the tasks that must be suspended increase, showing the effectiveness of our preemption mechanism (c) At the same time as the preempted batch tasks increase they take a hit in throughput In this case however sharing 100% of the cluster resources only drops throughput drops by 1.5% (c) As a result we achieve efficient share of resources with low impact on throughput (c) This figure is also like a glimpse from past to future with 0% cores being the static allocation approach and 100% is the share everything approach which is the future!
  19. NEPTUNE an execution framework that supports… Implements suspendable tasks that can pause and resume in milliseconds Supports smart scheduling policies deciding how to suspend
  20. To measure the effectiveness of of suspension mechanism we use TPCH: We run the TPCH benchmark on a cluster of 4 Azure machines and measure the task runtime distribution for each query (with grey). Queries follow different task runtime distributions ranging from 100s of milliseconds to 10s of seconds,
  21. we re-run the benchmark and continuously transition tasks from PAUSED to RESUMED state until completion while measuring the latency for each transition. By continuously transitioning between states and triggering yield points, we want to measure the worst case scenario in terms of transition latency for each query tasks. (c) Although queries follow different task runtime distributions Neptune manages to pause and resume tasks with sub-millisecond latencies An exception is Q14 for which the 75th percentile for the pause latency is 100 ms – data reside in a single partition and the Parquet reader operates on a partition granularity
  22. TPCH query 1 ThreadSync relies on the preemption of the OS scheduler and is implemented using thread wait and notify calls Alternate from PAUSED to RESUMED states while increasing parallelism on the X-axis As the parallelism increases the TheadSync latency increases 2.6x for 14 threads up to 600ms for 64 threads OS scheduler must arbitrate between wait and run queues continuously – we bypass OS scheduler
  23. Now let me show you how to use Neptune to run unified application I am going to run two jobs, one with low and one with high priority Goal: Compare Default Spark with Neptune and show effect on latency
  24. To give you an idea how suspendable tasks are implemented in Neptune The code snippet on the left shows the implementation of the collect action that receives the taskContext and a record iterator as arguments and returns all dataset elements (c) The code snippet on the right shows the same logic implemented with coroutines When the task context is marked as paused it yields a value to the executor the executor can then resume the same task instance or invoke another task