Hadoop Scheduling - a 7 year perspective

2,240 views
2,125 views

Published on

Talk at Flipkart's SlashN conference 2014. Perspectives on Hadoop

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,240
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
49
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Hadoop Scheduling - a 7 year perspective

  1. 1. Job Scheduling in Hadoop an exposé Joydeep Sen Sarma
  2. 2. About Me c 2007 Facebook: Ran/Managed Hadoop ~ 3 years Wrote Hive Mentor/PM Hadoop Fair-Scheduler Used Hadoop/Hive (as Warehouse/ETL Dev) Re-wrote significant chunks of Hadoop Job Scheduling (incl. Corona) Qubole: Running World’s largest Hadoop clusters on AWS c 2014
  3. 3. The Crime Shared Hadoop Clusters Statistical Multiplexing Largest jobs only fit on pooled hardware Data Locality Easier to manage
  4. 4. … and the Punishment • “Have you no Hadoop Etiquettes?” (c 2007) (reducer count capped in response) • User takes down entire Cluster (OOM) (c 2007-09) • Bad Job slows down entire Cluster (c 2009) • Steady State Latencies get intolerable (c 2010-) • ”How do I know I am getting my fair share?” (c 2011) • “Too few reducer slots, cluster idle” (c 2013)
  5. 5. The Perfect Weapon Scheduler • Efficient • Scalable • Strong Isolation • Fair • Fault Tolerant • Low Latency
  6. 6. Quick Review • Fair Scheduler (Fairness/Isolation) • Speculation (Fault Tolerance/Latency) • Preemption (Fairness) • Usage Monitoring/Limits (Isolation)
  7. 7. And then there’s Hadoop (1.x) … • Single JobTracker for all Jobs – Does not scale, SPOF • Pull Based Architecture – Scalability and Low Latency at permanent War – Inefficient – leaves idle time • Slot Based Scheduling – Inefficient • Pessimistic Locking in Tracker – Scalability Bottleneck • Long Running Tasks – Fairness and Efficiency at permanent War
  8. 8. Poll Driven Scheduling insert overwrite table dest select … from ads join campaigns on …group by …; Map Tasks Job Tracker Master ReduceTasks Heartbeat MapTask TaskTracker Slave Child 8
  9. 9. Pessmistic Locking getBestTask(): for pool: sortedPools for job: pool.sortedJobs() for task: job.tasks() if betterMatch(task) … processHeartbeat(): synchronized(world): return getBestTask()
  10. 10. Slot Based Scheduling • N cpus, M map slots, R reduce slots – Memory cannot be oversubscribed! • How to divide? – M < N  not enough mappers at times – R < N  not enough reducers at times – N=M=R  enough memory to run 2N tasks ? • Reduce Tasks Problematic – Network Intensive to start, CPU wasted – Memory Intensive later
  11. 11. Long Running Reducers • Online Scheduling – No advance information of future workload • Greedy + Fair Scheduling – Schedule ASAP – Preempt if future workload disagrees • Long Running Reducers – Preemption causes restart and wasted work – No effective way to use short bursts of idle cpu
  12. 12. Optimistic Locking Task[] getBestTaskCandidates(): for pool: sortedPools for job: pool.sortedJobs.clone() for task: job.tasks.clone() synchronized(task): … processHeartbeat(): tasks = getBestTaskCandidates() synchronized(world): return acquireTasks(tasks)
  13. 13. Corona: Push Scheduling 1. JT subscribes for M maps and R reduces – Receives availability from Cluster Manager (CM) 2. CM publishes availability ASAP – Pushes events to JT 3. JT pushes tasks to available TT – In parallel
  14. 14. Corona/YARN: Scalability 1. JobTracker for each Job now Independent – More Fault Tolerant and Isolated as well 2. Centralized Cluster/Resource Manager – Must be super-efficient! 3. Fundamental Differences – – Corona ~ Latency YARN ~ Heterogenous workloads
  15. 15. Pesky Reducers • Hadoop 2 removes distinction between M and R slots • Not Enough – Reduce Tasks don’t use much CPU in shuffle – Still long running and bad to preempt  Re-architect to run millions of small Reducers
  16. 16. The Future is Cloudy • Data Center Assumption: – Cluster characteristics known – Job spec fits to cluster • In Cloud: – Cluster can grow/shrink, change node-type – Job Spec must be dynamic – Uniform task configuration untenable
  17. 17. Questions? joydeep@qubole.com http://www.linkedin.com/in/joydeeps

×