More Related Content


Scheduling on large clusters - Google's Borg and Omega, YARN, Mesos

  1. Scheduling on Large Clusters Based on Google’s Omega Paper Sameer Tiwari Hadoop Architect, Pivotal Inc., @sameertech
  2. Scheduling on Large Clusters ● Goals o High Utilization o Honor User Defined constraints o Maintain High Efficiency ● Issues o Un-predictable load o Varying types of load o Increasing load and cluster size
  3. Types of Schedulers ● Monolithic o Single Resource Manager and Scheduler o Google Borg ● Two Level o Single Resource Management and multiple schedulers o Mesos, Hadoop-on-Demand (HOD project) ● Shared state o Multiple schedulers with access to all resources o Google Omega
  4. Monolithic Schedulers ● Stable, been around since 1990s ● Issues o Head of line blocking o Scalability is limited o Popular with HPC community  Maui -> Moab(R), Platform LSF (IBM) o Multi Path scheduling addresses some of these problems
  5. Statically Partitioned Schedulers ● Common with Hadoop deployments o Assumes full control of resources o Dedicated or statically partitioned clusters ● Issues o Low utilization o Data fragmentation
  6. ● AKA: Two-level Schedulers o Resource Manager dynamically partitions a cluster o Resources presented to partitions as “offers” o Partitions request resources as needed o e.g. Mesos and Hadoop on Demand (HOD) ● Issues o Pessimistic locking is used during allocation o Not suitable for “long running” jobs o Gang scheduling (e.g. MPI jobs) can cause deadlocks o Each scheduler has no idea about any other scheduler  Pre-emption is tricky Dynamic Schedulers
  7. ● What type of scheduler is Hadoop YARN? o App Master requests single RM, per job o But, the App Master provides job-mgmt service, not scheduling o Effectively, its a Monolithic Scheduler Trivia
  8. ● No external Resource Manager ● Each scheduler has full access to cluster ● A copy of the cluster state is at each scheduler ● Optimistic concurrency control o Updates are made atomically in a transaction o Only one commit will succeed o Failed transactions will try again ● Gang scheduling, will not result in resource hoarding Shared State Schedulers
  9. ● Each scheduler, free to choose a policy ● Requires a common understanding of o Resources o Precedence ● Relies on post-facto enforcement ● Results in high utilization and efficiency Shared State Schedulers
  10. Questions?

Editor's Notes

  1. Users can ask for colocation or ask for a particular rack or machine Efficiency is : Fast allocation
  2. Works well with small jobs (<<cluster resources) and short lived jobs that give up resources frequently
  3. Works well with small jobs (<<cluster resources) and short lived jobs that give up resources frequently
  4. * Addresses two issues of the two-level scheduler approach – limited parallelism due to pessimistic concurrency control - restricted visibility of resources in a scheduler framework - no head-of-line blocking * Potential cost of redoing work when the optimistic concurrency assumptions are incorrect * Resource Hoarding not possible in an all-or-nothing resource allocation * To prevent starvation: Incremental transactions == accept all but conflicting txns