Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Low latency execution for apache spark
Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout
Who am I ?
PhD candidate, AMPLab UC Berkeley 


Dissertation: System design for large scale machine learning


Apache Spar...
Low latency: SPARK STREAMING
Low latency: SPARK STREAMING
Low latency: SPARK STREAMING
Low latency: execution
Scheduler
Few milliseconds
Large Clusters
Low latency execution engine for iterative workloads
Machine learning
Stream processing
State
This talk
Execution Models
Centralized task
scheduling
Lineage, Parallel
Recovery
Microsoft Dryad
Execution models: batch processing
Task
Control Mes...
Execution models: parallel operators
Long-lived operators
Checkpointing based
fault tolerance
Naiad
S
H
U
F
F
L
E
S
H
U
F
...
SCALING behavior
0
50
100
150
200
250
4
 8
 16
 32
 64
 128
Time(ms)
Machines
Network Wait
 Compute
 Task Fetch
 Scheduler...
Can we achieve low latency with Apache Spark ?
DESIGN INSIGHT
Fine-grained execution with coarse-grained scheduling
DRIZZLE
S
H
U
F
F
L
E
B
R
O
A
D
C
A
S
T
Iteration
Batch Scheduling
 Pre-Scheduling Shuffles
 Distributed Shared Variables
DAG scheduling
Assign tasks to hosts using
(a) locality preferences 
(b) straggler mitigation 
(c) fair sharing etc.
Tasks...
Batch scheduling
Schedule a batch
of iterations at once
Fault tolerance, scheduling
at batch boundaries
1 stage in each it...
How much does this help ?
1
10
100
1000
4
 8
 16
 32
 64
 128
Time/Iter(ms)
Machines
Apache Spark
 Drizzle-10
 Drizzle-50
...
DRIZZLE
S
H
U
F
F
L
E
B
R
O
A
D
C
A
S
T
Iteration
Batch Scheduling
 Pre-Scheduling Shuffles
 Distributed Shared Variables
coordinating shuffles: APACHE SPARK
Task
Control Message
Data Message
Driver
Intermediate Data
Driver sends metadata Tasks...
coordinating shuffles: PRE-SCHEDULING
Pre-schedule down-stream
tasks on executors
Trigger tasks once
dependencies are met
...
Push-metadata, pull-data
Coalesce metadata across cores
Transfer during downstream stage
Data Message
Intermediate Data
 M...
Micro-benchmark: 2-stages
0
50
100
150
200
250
300
350
4
 8
 16
 32
 64
 128
Time/Iter(ms)
Machines
Apache Spark
 Drizzle-...
DRIZZLE
S
H
U
F
F
L
E
B
R
O
A
D
C
A
S
T
Iteration
Batch Scheduling
 Pre-Scheduling Shuffles
 Distributed Shared Variables
Shared variables
Fine-grained updates to shared data

Distributed shared variables 


- MPI using MPI_AllReduce		

- Bloom...
Drizzle: Replicated shared variables
Custom merge strategies
1
Commutative updates
e.g. vector addition
Task
Shared Variab...
Enabling async updates
Synchronous

Asynchronous semantics within a batch
Synchronous semantics enforced at batch boundari...
Evaluation
Micro-benchmarks
-  Single stage
-  Multiple stages
-  Shared variables

End-to-end experiments
-  Streaming be...
USING DRIZZLE API
DAGScheduler.scala
	def	runJob(	
	 	rdd:	RDD[T],		
	 	func:	Iterator[T]	=>	U)	
	
	
	
Internal
	def	runBa...
Using drizzle
StreamingContext.scala

def	this(	
	 	sc:	SparkContext,	
	 	batchDuration:	Duration)	
	Spark Streaming
	def	...
b=1 à Batch processing 
Choosing batch size
Higher overhead
Smaller window for fault tolerance
In practice: Largest batch ...
Evaluation
Micro-benchmarks
-  Single stage
-  Multiple stages
-  Shared variables

End-to-end experiments
-  Streaming be...
Streaming BENCHMARK
0
25
50
75
100
0
 300
 600
 900
 1200
 1500
CDF
Event Latency (ms)
Drizzle, 90ms
 Spark Streaming 560m...
Mllib: LOGISTIC REGRESSION
0
100
200
300
400
500
600
4
 8
 16
 32
 64
 128
Time/Iter(ms)
Machines
Apache Spark
 Drizzle
Sp...
Work In progress
Automatic batch size tuning

Open source release

Apache Spark JIRA to discuss potential contribution
conclusion
Overheads when using Apache Spark for streaming, ML workloads

Drizzle: Low Latency Execution Engine

- Decoupl...
Upcoming SlideShare
Loading in …5
×

Low Latency Execution For Apache Spark

3,103 views

Published on

Spark Summit 2016 talk by Shivaram Venkataraman (UC Berkeley) and Aurojit Panda (UC Berkeley)

Published in: Data & Analytics
  • Be the first to comment

Low Latency Execution For Apache Spark

  1. 1. Low latency execution for apache spark Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout
  2. 2. Who am I ? PhD candidate, AMPLab UC Berkeley Dissertation: System design for large scale machine learning Apache Spark PMC Member. Contributions to Spark core, MLlib, SparkR
  3. 3. Low latency: SPARK STREAMING
  4. 4. Low latency: SPARK STREAMING
  5. 5. Low latency: SPARK STREAMING
  6. 6. Low latency: execution Scheduler Few milliseconds Large Clusters
  7. 7. Low latency execution engine for iterative workloads Machine learning Stream processing State This talk
  8. 8. Execution Models
  9. 9. Centralized task scheduling Lineage, Parallel Recovery Microsoft Dryad Execution models: batch processing Task Control Message Driver Driver: Coordinate shuffles, data sharing S H U F F L E Network Transfer Iteration B R O A D C A S T
  10. 10. Execution models: parallel operators Long-lived operators Checkpointing based fault tolerance Naiad S H U F F L E S H U F F L E Iteration Task Control Message Driver Network Transfer Low Latency
  11. 11. SCALING behavior 0 50 100 150 200 250 4 8 16 32 64 128 Time(ms) Machines Network Wait Compute Task Fetch Scheduler Delay Cluster: 4 core, r3.xlarge machines Workload: Sum of 10k numbers per-core Median-task time breakdown
  12. 12. Can we achieve low latency with Apache Spark ?
  13. 13. DESIGN INSIGHT Fine-grained execution with coarse-grained scheduling
  14. 14. DRIZZLE S H U F F L E B R O A D C A S T Iteration Batch Scheduling Pre-Scheduling Shuffles Distributed Shared Variables
  15. 15. DAG scheduling Assign tasks to hosts using (a) locality preferences (b) straggler mitigation (c) fair sharing etc. Tasks Host1 Host2 Driver Host1 Host2 Serialize & Launch Host Metadata Scheduler Same DAG structure for many iterations Can reuse scheduling decisions
  16. 16. Batch scheduling Schedule a batch of iterations at once Fault tolerance, scheduling at batch boundaries 1 stage in each iteration b = 2
  17. 17. How much does this help ? 1 10 100 1000 4 8 16 32 64 128 Time/Iter(ms) Machines Apache Spark Drizzle-10 Drizzle-50 Drizzle-100 Workload: Sum of 10k numbers per-core Single Stage Job, 100 iterations – Varying Drizzle batch size
  18. 18. DRIZZLE S H U F F L E B R O A D C A S T Iteration Batch Scheduling Pre-Scheduling Shuffles Distributed Shared Variables
  19. 19. coordinating shuffles: APACHE SPARK Task Control Message Data Message Driver Intermediate Data Driver sends metadata Tasks pull data
  20. 20. coordinating shuffles: PRE-SCHEDULING Pre-schedule down-stream tasks on executors Trigger tasks once dependencies are met Task Control Message Data Message Driver Intermediate Data Pre-scheduled task
  21. 21. Push-metadata, pull-data Coalesce metadata across cores Transfer during downstream stage Data Message Intermediate Data Metadata Message
  22. 22. Micro-benchmark: 2-stages 0 50 100 150 200 250 300 350 4 8 16 32 64 128 Time/Iter(ms) Machines Apache Spark Drizzle-Pull Workload: Sum of 10k numbers per-core 100 iterations
  23. 23. DRIZZLE S H U F F L E B R O A D C A S T Iteration Batch Scheduling Pre-Scheduling Shuffles Distributed Shared Variables
  24. 24. Shared variables Fine-grained updates to shared data Distributed shared variables - MPI using MPI_AllReduce - Bloom, CRDTs - Parameter servers, key-value stores - reduce followed by broadcast
  25. 25. Drizzle: Replicated shared variables Custom merge strategies 1 Commutative updates e.g. vector addition Task Shared Variable M E R G E 1 1 1 1 2 2 2 2
  26. 26. Enabling async updates Synchronous Asynchronous semantics within a batch Synchronous semantics enforced at batch boundaries Task Shared Variable 1 2 Async Stale versions 1 1 2 3 Merge
  27. 27. Evaluation Micro-benchmarks -  Single stage -  Multiple stages -  Shared variables End-to-end experiments -  Streaming benchmarks -  Logistic Regression Implemented on Apache Spark 1.6.1 Integrations with Spark Streaming, MLlib
  28. 28. USING DRIZZLE API DAGScheduler.scala def runJob( rdd: RDD[T], func: Iterator[T] => U) Internal def runBatchJob( rdds: Seq[RDD[T]], funcs: Seq[Iterator[T] => U])
  29. 29. Using drizzle StreamingContext.scala def this( sc: SparkContext, batchDuration: Duration) Spark Streaming def this( sc: SparkContext, batchDuration: Duration, jobsPerBatch: Int)
  30. 30. b=1 à Batch processing Choosing batch size Higher overhead Smaller window for fault tolerance In practice: Largest batch such that overhead is below fixed threshold e.g., For 128 machines, batch of few seconds is enough b=N à Parallel operators Lower overhead Larger window for fault tolerance
  31. 31. Evaluation Micro-benchmarks -  Single stage -  Multiple stages -  Shared variables End-to-end experiments -  Streaming benchmarks -  Logistic Regression Implemented on Apache Spark 1.6.1 Integrations with Spark Streaming, MLlib
  32. 32. Streaming BENCHMARK 0 25 50 75 100 0 300 600 900 1200 1500 CDF Event Latency (ms) Drizzle, 90ms Spark Streaming 560ms Yahoo Streaming Benchmark: 1M JSON Ad-events / second, 64 machines Event Latency: Difference between window end, processing end
  33. 33. Mllib: LOGISTIC REGRESSION 0 100 200 300 400 500 600 4 8 16 32 64 128 Time/Iter(ms) Machines Apache Spark Drizzle Sparse Logistic Regression on RCV1 677K examples, 47K features
  34. 34. Work In progress Automatic batch size tuning Open source release Apache Spark JIRA to discuss potential contribution
  35. 35. conclusion Overheads when using Apache Spark for streaming, ML workloads Drizzle: Low Latency Execution Engine - Decouple execution from centralized scheduling - Milliseconds latency for iterative workloads Workloads / Questions / Contributions Shivaram Venkataraman shivaram@cs.berkeley.edu

×