CloudStack Scalability

     By Alex Huang
Current Status
• 10k resources managed per management server
  node
• Scales out horizontally (must disable stats
  collector)
• Real production deployment of tens of thousands
  of resources
• Internal testing with software simulators up to
  30k physical resources with 30k VMs managed by
  4 management server nodes
• We believe we can at least double that scale per
  management server node
Balancing Incoming Requests
• Each management server has two worker thread pools for incoming
  requests: effectively two servers in one.
   – Executor threads provided by tomcat
   – Job threads waiting on job queue
• All incoming requests that requires mostly DB operations are short
  in duration and are executed by executor threads because incoming
  requests are already load balanced by the load balancer
• All incoming requests needing resources, which often have long
  running durations, are checked against ACL by the executor threads
  and then queued and picked up by job threads.
• # of job threads are scaled to the # of DB connections available to
  the management server
• Requests may take a long time depending on the constraint of the
  resources but they don’t fail.
The Much Harder Problem
• CloudStack performs a number of tasks on behalf of
  the users and those tasks increases with the number of
  virtual and physical resources available
   –   VM Sync
   –   SG Sync
   –   Hardware capacity monitoring
   –   Virtual resource usage statistics collection
   –   More to come
• When done in number of hundreds, no big deal.
• As numbers increase, this problem magnifies.
• How to scale this horizontally across management
  servers?
Comparison of two Approaches
• Stats Collector – collects capacity statistics
   – Fires every five minutes to collect stats about host CPU and
     memory capacity
   – Smart server and dumb client model: Resource only
     collects info and management server processes
   – Runs the same way on every management server
• VM Sync
   – Fires every minute
   – Peer to peer model: Resource does a full sync on
     connection and delta syncs thereafter. Management
     server trusts on resource for correct information.
   – Only runs against resources connected to the management
     server node
Numbers
•   Assume 10k hosts and 500k VMs (50 VMs per host)
•   Stats Collector
     – Fires off 10k requests every 5 minutes or 33 requests a second.
     – Bad but not too bad: Occupies 33 threads every second.
     – But just wait:
          •   2 management servers: 66 requests
          •   3 management servers: 99 requests
     – It gets worse as # of management servers increase because it did not auto-balance across
       management servers
     – Oh but it gets worse still: Because the 10k hosts is now spread across 3 management servers.
       While it’s 99 requests generated, the number of threads involved is three-fold because
       requests need to be routed to the right management server.
     – It keeps the management server at 20% busy even at no load from incoming requests
•   VM Sync
     – Fires off 1 request at resource connection to sync about 50 VMs
     – Then, push from resource as resource knows what it has pushed before and only pushes
       changes that are out-of-band.
     – So essentially no threads occupied for a much larger data set.
What’s the Down Side?
• Resources must reconcile between VM states
  caused by management server commands and
  VM states it collects from the physical
  hardware so it requires more CPU
• Resources must use more memory to keep
  track of what amounts to a journal of changes
  since the last sync point.
• But data centers are full of these two
  resources.
Resource Load Balancing
• As management server is added into the cluster, resources are rebalanced
  seamlessly.
    –   MS2 signals to MS1 to hand over a resource
    –   MS1 wait for the commands on the resources to finish
    –   MS1 holds further commands in a queue
    –   MS1 signals to MS2 to take over
    –   MS2 connects
    –   MS2 signals to MS1 to complete transfer
    –   MS1 discards its resource and flows the commands being held to MS2
• Listeners are provided to business logic to listen on connection status and
  adjusts work based on who’s connected.
• By only working on resources that are connected to the management
  server the process is on, work is auto-balanced between management
  servers.
• Also reduces the message routing between the management servers.
Designing for Scalability
• Take advantage of the most abundant resources in a data center
  (CPU, RAM)
• Auto-scale to the least abundant resource (DB)
• Do not hold DB connections/Transactions across resource calls.
   – Use lock table implementation (Merovingian2 or
     GenericDao.acquireLockInTable() call) over database row locks in this
     situation.
   – Database row locks are still fine quick short lock outs.
• Balance the resource intensive tasks as # of management server
  nodes increases and decreases
   – Use job queues to balance long running processes across management
     servers
   – Make use of resource rebalancing in CloudStack to auto-balance your
     world load.
Reliability

By Alex Huang
The Five W’s of Unreliability
• What is unreliable? Everything
• Who is unreliable? Developers & administrators
• When does unreliability happen? 3:04 a.m. no
  matter which time zone… Any time.
• Where does unreliability happen? In carefully
  planned, everything has been considered data
  centers.
• How does unreliability happen? Rather
  nonchalantly
Dealing with Unreliability
•   Don’t assume!
•   Don’t bang your head against the wall!
•   Know when you don’t know any better.
•   Ask for help!
Designs against Unreliability
• Management Servers keeps an heartbeat with the DB. One
  ping a minute.
• Management Servers self-fences if it cannot write the
  heartbeat
• Other management servers wait to make sure the down
  management server is no longer writing to the heartbeat
  and then signal interested software to recover
• Check points at every call to a resource and code to deal
  with recovering from those check points
• Database records are not actually deleted to help with
  manual recovery when needed
• Write code that is idempotent
• Respect modularity when writing your code

CloudStack Scalability

  • 1.
  • 2.
    Current Status • 10kresources managed per management server node • Scales out horizontally (must disable stats collector) • Real production deployment of tens of thousands of resources • Internal testing with software simulators up to 30k physical resources with 30k VMs managed by 4 management server nodes • We believe we can at least double that scale per management server node
  • 3.
    Balancing Incoming Requests •Each management server has two worker thread pools for incoming requests: effectively two servers in one. – Executor threads provided by tomcat – Job threads waiting on job queue • All incoming requests that requires mostly DB operations are short in duration and are executed by executor threads because incoming requests are already load balanced by the load balancer • All incoming requests needing resources, which often have long running durations, are checked against ACL by the executor threads and then queued and picked up by job threads. • # of job threads are scaled to the # of DB connections available to the management server • Requests may take a long time depending on the constraint of the resources but they don’t fail.
  • 4.
    The Much HarderProblem • CloudStack performs a number of tasks on behalf of the users and those tasks increases with the number of virtual and physical resources available – VM Sync – SG Sync – Hardware capacity monitoring – Virtual resource usage statistics collection – More to come • When done in number of hundreds, no big deal. • As numbers increase, this problem magnifies. • How to scale this horizontally across management servers?
  • 5.
    Comparison of twoApproaches • Stats Collector – collects capacity statistics – Fires every five minutes to collect stats about host CPU and memory capacity – Smart server and dumb client model: Resource only collects info and management server processes – Runs the same way on every management server • VM Sync – Fires every minute – Peer to peer model: Resource does a full sync on connection and delta syncs thereafter. Management server trusts on resource for correct information. – Only runs against resources connected to the management server node
  • 6.
    Numbers • Assume 10k hosts and 500k VMs (50 VMs per host) • Stats Collector – Fires off 10k requests every 5 minutes or 33 requests a second. – Bad but not too bad: Occupies 33 threads every second. – But just wait: • 2 management servers: 66 requests • 3 management servers: 99 requests – It gets worse as # of management servers increase because it did not auto-balance across management servers – Oh but it gets worse still: Because the 10k hosts is now spread across 3 management servers. While it’s 99 requests generated, the number of threads involved is three-fold because requests need to be routed to the right management server. – It keeps the management server at 20% busy even at no load from incoming requests • VM Sync – Fires off 1 request at resource connection to sync about 50 VMs – Then, push from resource as resource knows what it has pushed before and only pushes changes that are out-of-band. – So essentially no threads occupied for a much larger data set.
  • 7.
    What’s the DownSide? • Resources must reconcile between VM states caused by management server commands and VM states it collects from the physical hardware so it requires more CPU • Resources must use more memory to keep track of what amounts to a journal of changes since the last sync point. • But data centers are full of these two resources.
  • 8.
    Resource Load Balancing •As management server is added into the cluster, resources are rebalanced seamlessly. – MS2 signals to MS1 to hand over a resource – MS1 wait for the commands on the resources to finish – MS1 holds further commands in a queue – MS1 signals to MS2 to take over – MS2 connects – MS2 signals to MS1 to complete transfer – MS1 discards its resource and flows the commands being held to MS2 • Listeners are provided to business logic to listen on connection status and adjusts work based on who’s connected. • By only working on resources that are connected to the management server the process is on, work is auto-balanced between management servers. • Also reduces the message routing between the management servers.
  • 9.
    Designing for Scalability •Take advantage of the most abundant resources in a data center (CPU, RAM) • Auto-scale to the least abundant resource (DB) • Do not hold DB connections/Transactions across resource calls. – Use lock table implementation (Merovingian2 or GenericDao.acquireLockInTable() call) over database row locks in this situation. – Database row locks are still fine quick short lock outs. • Balance the resource intensive tasks as # of management server nodes increases and decreases – Use job queues to balance long running processes across management servers – Make use of resource rebalancing in CloudStack to auto-balance your world load.
  • 10.
  • 11.
    The Five W’sof Unreliability • What is unreliable? Everything • Who is unreliable? Developers & administrators • When does unreliability happen? 3:04 a.m. no matter which time zone… Any time. • Where does unreliability happen? In carefully planned, everything has been considered data centers. • How does unreliability happen? Rather nonchalantly
  • 12.
    Dealing with Unreliability • Don’t assume! • Don’t bang your head against the wall! • Know when you don’t know any better. • Ask for help!
  • 13.
    Designs against Unreliability •Management Servers keeps an heartbeat with the DB. One ping a minute. • Management Servers self-fences if it cannot write the heartbeat • Other management servers wait to make sure the down management server is no longer writing to the heartbeat and then signal interested software to recover • Check points at every call to a resource and code to deal with recovering from those check points • Database records are not actually deleted to help with manual recovery when needed • Write code that is idempotent • Respect modularity when writing your code