The document discusses three proposed cluster computing frameworks: CloudMirror, Mesos, and Omega.
CloudMirror addresses challenges of providing bandwidth guarantees for interactive workloads in the cloud. It proposes a new network abstraction model based on application communication structure and a workload placement algorithm for efficient resource allocation.
Mesos targets sharing cluster resources across frameworks. It introduces a two-level resource allocation and isolation model to allow sharing while preventing interference. Mesos was implemented in C++ and evaluated using various macrobenchmarks showing improved resource utilization and scalability.
Omega is a proposed scheduler architecture that avoids centralized control. It allows schedulers parallel access to the entire cluster and uses optimistic concurrency control. Simulations showed Omega improved scheduling performance
2. CloudMirror: Background Problem and Challenges
Cloud hosted Application Problem
Not simple as Hadoop or Pregel
Interactive = predictable throughput & latency
100 msec latency increase = 1 % sales loss
(Amazon)
Interactive workload ≥ batch workload CPU
Oversubscribe bandwidth to guarantee
application = very expensive cost
No bandwidth-to-vCPU ratio to guarantee the
bandwidth usage
Key Challenges
• “Easy” network abstraction model
specify bandwidth requirement
• A workload placement algorithm for
efficient resource allocation
• Scalable runtime to enforce bandwidth
guarantee and efficient usage
3. CloudMirror: Proposed Solutions
*) New network abstraction based on application communication structure
TAG*
(Tenant Application Graph)
Workload Placement
Algorithm Cloud Mirror
TAG Deployment
• Bandwidth allocation at DC uplink match with TAG model
requirements
• Bandwidth saving by VM collocations in the subtree
• VM Placement Algorithm to bridge the gap between high
level TAG and low level infrastructure
• Guaranteeing anti-affinity for HA and opportunistic anti-
affinity for non-HA
TAG model
• each vertex graph represent
application component/tier
• Intuitive, descriptive, efficient
and flexible
• produced by OpenStack Heat
and AWS Cloud formation
extension
4. CloudMirror: Simulation and Evaluation Result
Evaluation
1) Efficiency
a) Reserving Less Network Bandwidth
b) Accepting more tenant request
2) Placement ability to guarantee and improve
availability
3) Feasibility of deploying in real testbed
Result Highlight
• Benefits resource balancing as introduced in
bandwidth capacity constraint network topology
• Tenant rejection rate is less than 2.2 % and usually
because of large VM/bandwidth requirements
• Guaranteeing High Availability with higher WCS
requirement will increase rejection rate
• Scalability: 200 msec for 100 VMs/tenant or few
seconds for 1000 VMs/tenant
5. Mesos: Background Problem + Challenges and
Mesos Target Environment
Cluster Computing Framework Today
Emerge, but no framework for all
Multiplexing improve utilization and allow
sharing, but costly for replications
Static partition / VM allocation per framework
not achieve high utilization or efficient sharing
>> no fine-grained sharing across framework
Key Challenges
• Complexity : scheduler API to get all
frameworks requirements and online
optimization for millions of tasks
• New framework and new scheduling
policies : current framework still
developed
• Expensive Refactoring : move many
individual frameworks scheduling into
global scheduling
Target Environment:
Cluster run Hadoop Jobs/Tasks as well as
MPI jobs in the same time
(Facebook or Yahoo dataware house)
6. Mesos: Proposed Solutions
Key Features
1) Resource Allocations
• Two allocation modules : max-min fairness for
multiple resource and strict priorities (similar with
Hadoop & Dryad)
• Task revocation mechanism: killing low impact tasks
& trigger when revocation
2) Isolations resources between framework executors
• Leveraging several existing OS isolation (modules)
• Currently using Linux Container and Solaris Project
3) Scalable and Robust Resource Offer with 3 mechanism
• Some framework always reject certain resources
• Response timer for framework to receive offer
• One framework no response, re-offer to other
framework
• Master Process manage mesos slaves
daemon on each cluster
• Framework run on each cluster to run
the tasks on each slave
• Framework has two component:
scheduler (register to master to get
resources) and executor (run the task)
Mesos API
Function for
Scheduler &
Executors
7. Mesos: Simulation and Evaluation Result
Evaluation
1) Macrobenchmark workloads (facebook hadoop
mix, large hadoop mix, Spark, Torque/MPI)
2) Overhead
3) Data Locality through Delay Scheduling
4) Iterative jobs using Spark
5) Mesos Scalability
6) Failure Recovery
7) Performance Isolation
Implementation
• 10,000 lines codes of C++
• Run on Linux, Solaris, and OS X
• Supporting frameworks on Java, C++,
and Python
• Zookeeper to leader election
• Linux container for CPU and Memory
• Tested frameworks: Hadoop, Torque,
MPICH2, and Spark
Resource
Utilization
Mesos
Scalability
macrobenchmark
Speedup
Result
8. Omega: Background Problem + Requirement and
Solutions Approach
Cluster Scheduler Problem
Many different (high resource, rapid decision,
business constraint, etc.) goals but should
robust and always available
Cluster and workloads are keep growing fast
Monolithic and Two-level scheduling not
satisfied (difficult for new policy and difficult
to schedule)
Complexity in hardware and workload
heterogeneity
Design Issues Cluster Scheduler
• Partitioning the scheduling work
• Choice of Resources from Cluster
• Interference (optimistic & pessimistic)
• Allocation Granularity (policy flexible)
• Cluster-wide behavior
9. Omega: Proposed Solutions
Key Features
1) Grant full access all scheduler to entire cluster (allow
compete in a free-for-all manner)
2) Optimistic concurrency control to mediate clashes to
update the cluster state
3) No central resource allocator (all decisions in scheduler)
4) Resource allocation copy in scheduler (called as “cell”)
5) Synchronize cell state (transaction), if failed try it again
6) Run in parallel and no wait for other jobs (no inter-
scheduler blocking)
7) Different policies for all scheduler and apply relative
important jobs (called as “precedence”)
• Monolithic: use in HPC with single
instance , same algorithm for all jobs
• Two-level: use by Mesos and Hadoop-
on-Demand, many different scheduler
control by central scheduler
• Shared State: use by Omega, avoiding
two level and limited parallelism
New Parallel Scheduler
around “shared-state”
Lock-free Optimistic
Concurreny Control Omega
10. Omega: Simulation and Evaluation Result
Evaluation for Trace-Driven Simulation
1) Scheduling Performance : how service scheduler busyness varies as jobs and tasks
2) Scaling the Workload: time for scaling the task if there any conflicts
3) Load-balancing the batch scheduler: more decision time for large batch jobs
4) Dealing with Conflicts with two choices : coarse-grained conflict detection and all-nothing schedule
5) MapScheduler impact for the utilization and time completion jbs
Lightweight
Simulator
Result
Simulator
1)Lightweight Simulator: for compare scheduler architecture in same conditions and identic workloads
2)A high-fidelity Simulator: for historical Google workload traces