My EPFL candidacy exam presentation: http://wiki.epfl.ch/edicpublic/documents/Candidacy%20exam/vozniuk_andrii_candidacy_writeup.pdf
Here I present how schedulers work in three distributed data processing systems and their possible optimizations. I consider Gamma - a parallel database, MapReduce - a data-intensive system and Condor - a compute-intensive system.
This talk is based on the following papers:
1) Batch Scheduling in Parallel Database Systems by Manish Mehta, Valery Soloviev and David J. DeWitt
2) Improving MapReduce performance in heterogeneous environments by Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz and Ion Stoica
3) Batch Scheduling in Parallel Database Systems by Manish Mehta, Valery Soloviev and David J. DeWitt
2. Big Data
Data explosion
Processing gets more complicated
Generates: 25 TB/day Generates: 40 TB/day
Stores: 10 PB/year Stores: 20 PB/year
Resources of many computers should be used
2
3. Typical Data Processing Pipeline
Log Sensor
data data
ETL-like batch Clean Analyze Using resources of
processing data data many organizations
Particle found!
Efficient query Query
execution data
User model
No one-size-fits-all system currently exists
3
4. Outline
Ɣ Gamma - parallel database
MapReduce - data-intensive system
Condor - compute-intensive system
Conclusions
Future Research
4
5. Scheduling In Distributed Systems
Scheduling
Policy: setting an ordering of tasks task
task
Assigning resources to tasks
task
task
How to match resources and tasks?
Scheduling is challenging in distributed systems
5
6. Matching Tasks With Resources
Perspectives
Data model
Execution model
System/Perspecti Data model Execution model
ve
Gamma Relational Multioperator
MapReduce Unconstrained MapReduce
Condor Unconstrained Unconstrained
How scheduling is influenced by data and execution
6 models?
7. Gamma Ɣ
Pioneering parallel database
Data model: constrained
Relational data model
Relations are horizontally partitioned
Execution model: constrained
Multioperator queries
Operators employ hash-based algorithms
7
8. Gamma: Scheduler Ɣ
SELECT r FROM R Query Host
WHERE r < ‘k’ query Manager Catalog
Machine
Gamma
Optimizes query Schedules
Scheduler Database
Compiles plan operators
Process
Operator Operator
Node 1 Process Process Node 2
Execution on
relevant nodes a-m n-z
Scheduling is done at the operator level
8
9. Gamma: Batch Scheduling Ɣ
Exploit sharing by scheduling in a batch
Example of selection sharing
σ1 σ2 σ1 σ2
Shared scan
A A A
Reads of A can be shared applying predicates in turn
Shared relation A is scanned only once
Batch scheduling trades latency for throughput
9
10. Gamma: Batch Scheduling Joins Ɣ
Several hash-joins in a batch of queries
Hash table for the same relation can be shared
Example assumes 100% selectivity of σ
Shared hash-table for A
⋈ ⋈ ⋈ ⋈
σ σ σ σ σ σ σ
A Β A C B A C
Sharing reduces I/O and memory usage
Sharing among joins reduces total execution time
10
11. Limitations Of Gamma Ɣ
Gamma offers
Efficient query execution
Sharing in a batch of queries
Gamma operates on structured data
Gamma is not suitable for
Unstructured data processing
ETL type of workload
Running on large scale
A different system for ETL processing is needed
11
12. MapReduce
System for data-intensive applications
Execution model: constrained
Job is a set of map and reduce tasks
Tasks are independent
Data model: unconstrained
Arbitrary data format
Files are partitioned into chunks
Each chunk is replicated several times
12
13. MapReduce: Scheduling
Map
Reduc Map
1e 2
Example:
Chunk1 Chunk2
MapReduce job
Result1
Temp1 Temp2
4 Map tasks
2 Reduce task Map Reduc
Map
3 4e
Chunk3 Chunk4
Temp3 Result2
Temp4
Tasks are scheduled close to data
Execution is scalable and fault-tolerant
Execution is elastic
Fine grain scheduling improves fault tolerance and
13 elasticity
14. MapReduce: Speculative Execution
Nodes may become slow
Speculative execution minimizes job’s response time
Launch if progress is 20% less than average
backup
Normal node
straggler
Temporary slow node
Speculative execution works well in homogeneous
14 environment
15. Emerging Heterogeneous Infrastructures
Replacement of failed components
Extending existing cluster with new machines
Virtualized data centers of cloud providers
CPU and RAM are isolated
Contention for disk and network
IO Performance per
60
VM (MB/s)
40
20
0
1 2 3 4 5 6 7
VMs on Physical Host
In many real-life cases the infrastructure is heterogeneous
15
16. MapReduce: Heterogeneous Cluster
Fast node
Slow node
Performance degrades on heterogeneous cluster
Slow nodes are wasted
Backup tasks on slow nodes
All straggling tasks are treated equally
Thrashing due to excessive speculative execution
Speculative execution should be improved for heterogeneous
16 cluster
17. MapReduce: LATE Scheduler
Idea: back up the task with the largest estimated finish
time (Longest Approximate Time to End)
progress score
progress rate =
execution time
1 – progress score
estimated time left =
progress rate
Thresholds
Limit the number of backup tasks
Launch backup tasks on fast nodes
Backup only sufficiently slow tasks
LATE looks forward to prioritize tasks to speculate
17
18. MapReduce: LATE Example
Back up the task with Longest Approximate Time to End
2 min
1 Estimated time left:
(1-0.66) / (1/3) = 1
1 task/min
2 Progress = 66%
Estimated time left:
(1-0.05) / (1/1.9) = 1.8
3x slower
Progress = 5.3%
3
1.9x slower
Time (min) improvement
LATE correctly identifies task which hurts the response time the
18 most
19. Limitations Of MapReduce
MapReduce offers
High scalability
Good fault tolerance
Handling of unstructured data
MapReduce is not suitable for
Running on multi organization infrastructure
Harvesting idle resources in organization
A different system for multi organization infrastructure is
19 needed
20. Condor
Compute-intensive system harvesting idle resources
Data model: arbitrary
Execution model: arbitrary
How to increase utilization
and respect the owners?
job
job
job
job
Increase resources utilization by scheduling jobs on idle
20 machines
23. Condor Scheduler: Hybrid!
Information about tasks Matchmaker Information about nodes
Scheduler 1
3 1
1
2
3 Scheduler
Scheduler
4
job
job
job
job
Hybrid approach has the best of both worlds
23
24. ClassAds: Describing Jobs and Resources
Job Description Machine Description
[MyType=“Job” [MyType=“Machine“
TargetType = “Machine“ TargetType=“Job“
Department=“CompSci“ Machine=“nostos.cs.wisc.edu“
Requirements = OpSys=“LINUX“
(other.OpSys==LINUX && Disk=3076077
other.Disk > 10000000) Requirement = (LoadAvg <= 0.3) &&
Rank=Memory] (KeyboardIdle > (15*60))
Rank =
other.Department==self.Department]
Requirements should be satisfied
Candidate with the highest rank is returned
Matchmaker is suitable for heterogeneous shared clusters
24
25. Conclusions
Scheduling done at different levels
Gamma: operator level scheduling enables sharing
MR and Condor: arbitrary code => sharing is hard
Condor: matchmaking gives control on job placement
Hybrid approaches are promising for big data processing
Scheduling in heterogeneous deployments is challenging
25
26. Thank you for your attention!
Feedback & Question?
Andrii.Vozniuk@epfl.ch
26
27. References
Matchmaking: Distributed Resource Management for
High Throughput Computing by Rajesh Raman, Miron
Livny and Marvin Solomon.
Batch Scheduling in Parallel Database Systems by Manish
Mehta, Valery Soloviev and David J. DeWitt.
Improving MapReduce performance in heterogeneous
environments by Matei Zaharia, Andy Konwinski, Anthony
D. Joseph, Randy Katz and Ion Stoica
Slides 14 and 18 exploit presentation ideas from the LATE
slides for OSDI 2008 by Matei Zaharia
27