Current Status• 10k resources managed per management server node• Scales out horizontally (must disable stats collector)• Real production deployment of tens of thousands of resources• Internal testing with software simulators up to 30k physical resources with 30k VMs managed by 4 management server nodes• We believe we can at least double that scale per management server node
Balancing Incoming Requests• Each management server has two worker thread pools for incoming requests: effectively two servers in one. – Executor threads provided by tomcat – Job threads waiting on job queue• All incoming requests that requires mostly DB operations are short in duration and are executed by executor threads because incoming requests are already load balanced by the load balancer• All incoming requests needing resources, which often have long running durations, are checked against ACL by the executor threads and then queued and picked up by job threads.• # of job threads are scaled to the # of DB connections available to the management server• Requests may take a long time depending on the constraint of the resources but they don’t fail.
The Much Harder Problem• CloudStack performs a number of tasks on behalf of the users and those tasks increases with the number of virtual and physical resources available – VM Sync – SG Sync – Hardware capacity monitoring – Virtual resource usage statistics collection – More to come• When done in number of hundreds, no big deal.• As numbers increase, this problem magnifies.• How to scale this horizontally across management servers?
Comparison of two Approaches• Stats Collector – collects capacity statistics – Fires every five minutes to collect stats about host CPU and memory capacity – Smart server and dumb client model: Resource only collects info and management server processes – Runs the same way on every management server• VM Sync – Fires every minute – Peer to peer model: Resource does a full sync on connection and delta syncs thereafter. Management server trusts on resource for correct information. – Only runs against resources connected to the management server node
Numbers• Assume 10k hosts and 500k VMs (50 VMs per host)• Stats Collector – Fires off 10k requests every 5 minutes or 33 requests a second. – Bad but not too bad: Occupies 33 threads every second. – But just wait: • 2 management servers: 66 requests • 3 management servers: 99 requests – It gets worse as # of management servers increase because it did not auto-balance across management servers – Oh but it gets worse still: Because the 10k hosts is now spread across 3 management servers. While it’s 99 requests generated, the number of threads involved is three-fold because requests need to be routed to the right management server. – It keeps the management server at 20% busy even at no load from incoming requests• VM Sync – Fires off 1 request at resource connection to sync about 50 VMs – Then, push from resource as resource knows what it has pushed before and only pushes changes that are out-of-band. – So essentially no threads occupied for a much larger data set.
What’s the Down Side?• Resources must reconcile between VM states caused by management server commands and VM states it collects from the physical hardware so it requires more CPU• Resources must use more memory to keep track of what amounts to a journal of changes since the last sync point.• But data centers are full of these two resources.
Resource Load Balancing• As management server is added into the cluster, resources are rebalanced seamlessly. – MS2 signals to MS1 to hand over a resource – MS1 wait for the commands on the resources to finish – MS1 holds further commands in a queue – MS1 signals to MS2 to take over – MS2 connects – MS2 signals to MS1 to complete transfer – MS1 discards its resource and flows the commands being held to MS2• Listeners are provided to business logic to listen on connection status and adjusts work based on who’s connected.• By only working on resources that are connected to the management server the process is on, work is auto-balanced between management servers.• Also reduces the message routing between the management servers.
Designing for Scalability• Take advantage of the most abundant resources in a data center (CPU, RAM)• Auto-scale to the least abundant resource (DB)• Do not hold DB connections/Transactions across resource calls. – Use lock table implementation (Merovingian2 or GenericDao.acquireLockInTable() call) over database row locks in this situation. – Database row locks are still fine quick short lock outs.• Balance the resource intensive tasks as # of management server nodes increases and decreases – Use job queues to balance long running processes across management servers – Make use of resource rebalancing in CloudStack to auto-balance your world load.
The Five W’s of Unreliability• What is unreliable? Everything• Who is unreliable? Developers & administrators• When does unreliability happen? 3:04 a.m. no matter which time zone… Any time.• Where does unreliability happen? In carefully planned, everything has been considered data centers.• How does unreliability happen? Rather nonchalantly
Dealing with Unreliability• Don’t assume!• Don’t bang your head against the wall!• Know when you don’t know any better.• Ask for help!
Designs against Unreliability• Management Servers keeps an heartbeat with the DB. One ping a minute.• Management Servers self-fences if it cannot write the heartbeat• Other management servers wait to make sure the down management server is no longer writing to the heartbeat and then signal interested software to recover• Check points at every call to a resource and code to deal with recovering from those check points• Database records are not actually deleted to help with manual recovery when needed• Write code that is idempotent• Respect modularity when writing your code