Your SlideShare is downloading. ×
  • Like
5  scalability Cloudstack Developer Day
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

5 scalability Cloudstack Developer Day

  • 1,634 views
Published

5 scalability Cloudstack Developer Day …

5 scalability Cloudstack Developer Day

By Alex Huang
Architect, Cloud Platforms Group, Citrix Systems Inc.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,634
On SlideShare
0
From Embeds
0
Number of Embeds
6

Actions

Shares
Downloads
103
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. CloudStack ScalabilityTesting, Development, Results, and Futures
  • 2. Apache CloudStack: a project in incubation • Secure, multi-tenant cloud orchestration platform – Turnkey platform for delivering IaaS clouds – Hypervisor agnostic – Highly scalable, secure and open – Complete Self-service portal – Open source, open standards – Deploys on premise
  • 3. Manage hosts, create VMs, virtual disks, virtual Admin networks, meter usage, …. Internet Management Server Cluster Primary Router MySQL Backup Load Balancer MySQL L3 Core SwitchTop of Rack Switch Object Storage Servers … … … … … Availability Zone 1 Pod 1 Pod 2 Pod 3 Pod N
  • 4. Thinking about cloud orchestration at scale • Host management • Capacity management • What host to use to deploy a new VM • Failure handling • Security group propagation • Set a goal
  • 5. We can’t afford this as our QA lab
  • 6. Simulator enables scale testing Mgmt. Server Zone User API MySQL Simulator Load Mgmt. Balancer ServerAdmin API Mgmt. MySQL Server Mgmt. Server
  • 7. Environment 2 cores, 4 with Hyper Threading. 2.2 GHz Xeon. Mgmt. 16 GB RAM. 12 GB JVM Server Heap. Zone Single spinning disk, later MySQL User API singleSimulator GB RAM. SSD. 32 Load Mgmt. MySQL 5.5. Balancer ServerAdmin API Mgmt. MySQL Server Mgmt. Server
  • 8. Allocator performance is awful with 1000 hosts • Two minutes to decide which host to use for a new VM! • Computing capacity for every pod repeatedly • Fixed that, but still 12 seconds to decide • Use host tags, down to 2 seconds • Major changes required to improve further • In 2.2.0, store capacity info in DB, skip pod altogether • Harness the power of SQL select and all is well
  • 9. Polling doesn’t scale TRUE? FALSE? Sometimes, it is good enough
  • 10. Host management• Check host state via TCP connection• Check every minute • 30,000 checks per minute, 500 per second • But they take 10 seconds, so 5000 in parallel • Not using async I/O so 5000 threads required… • Single JVM can support 2000+ threads so this is concerning but may not be the limiting factor
  • 11. Host management• What is the maximum feasible JVM heap size? • Some people use heaps with hundreds of GB • Commercial tools can help, but cost • We decided to stay below 20 GB (GC concerns)• How much CPU is required for background processing?
  • 12. CPU utilization while deploying 30,000 VMs on 30,000 hosts CPU Utilization. 400% is maximum 20,000 5000 5000 Idle Time
  • 13. Deploy time from 25,000 to 30,000 VMs Seconds to deploy VM number: 25,000 plus X
  • 14. Problem: agent load balancing Mgmt Mgmt • Management servers Server 1 Server 2 start/stop/fail/crash • How do newly started Management Servers get agents / work? • When a Management Server exits, how do others pick up its load? • When new hosts are added how is the load distributed?
  • 15. Common use case timings at scale• 30,000 hosts and 4 Management Servers• 4 Management Servers running, 1 fails: 10 minutes to redistribute 7500 agents• 3 Management Servers running, add a fourth: 40 minutes to redistribute load evenly IMPORTANT• 0 Management Servers running, start all 4 simultaneously: 16 minutes to connect to all 30,000 hosts
  • 16. Understanding security groups Web DB Web VM VM VM Web DB Security Security Web Group Web Group DB VM VM VM … … … Web Web VM VM Ingress Rule: Allow VMs in Web Security Group access to VMs in DB Security Group on Port 3306
  • 17. L3 isolation with distributed firewallsPublic Public IP address Tenant 10.1.0.2Internet 65.37.141.11 1 VM 1 65.37.141.24 10.1.0.1 Pod 1 L2 Tenant 10.1.0.3 65.37.141.36 Switch 2 VM 1 65.37.141.80 Tenant 10.1.0.4 1 VM 2 L3 Core Pod 2 L2 Switch 10.1.8.1 … Load Pod 3 L2 10.1.16.1 Balancer Switch …
  • 18. L3 isolation with distributed firewallsPublic Public IP address Tenant 10.1.0.2Internet 65.37.141.11 1 VM 1 65.37.141.24 10.1.0.1 Pod 1 L2 Tenant 10.1.0.3 65.37.141.36 Switch 2 VM 1 65.37.141.80 Tenant 10.1.0.4 1 VM 2 L3 Core Pod 2 L2 Switch 10.1.8.1 … Load Pod 3 L2 10.1.16.1 Balancer Switch … Tenant 1 VM 3 10.1.16.47 Tenant 10.1.16.85 1 VM 4
  • 19. L3 isolation with distributed firewallsPublic Public IP address Tenant 10.1.0.2Internet 65.37.141.11 1 VM 1 65.37.141.24 10.1.0.1 Pod 1 L2 Tenant 10.1.0.3 65.37.141.36 Switch 2 VM 1 65.37.141.80 Tenant 10.1.0.4 1 VM 2 L3 Core Pod 2 L2 Switch 10.1.8.1 … Tenant 10.1.16.12 Load Pod 3 L2 10.1.16.1 2 VM 2 Balancer Switch Tenant 10.1.16.21 2 VM 3 … Tenant 1 VM 3 10.1.16.47 Tenant 10.1.16.85 1 VM 4
  • 20. One firewall perVirtual Machine
  • 21. One million firewalls? VM VM VM VM … … … VM VM VM … … VM VM VM VM VM VM VM VM VM VM VM VM … … … VM VM VM … … VM VM VM VM VM VM VM VM VM VM VM VM … … … VM VM VM … … VM VM VM VM VM VM VM VM VM VM VM VM … … … VM VM VM … … VM VM VM VM VM VM VM VM VM VM VM VM … … … VM VM VM … … VM VM VM VM VM VM VM VM VM … VM VM VM … VM VM VM … VM VM VM … VM … VM … VM VM VM VM VM VM VM … … … VM VM VM … … VM VM VM VM VM VM VM VM VM VM VM VM … … … VM VM VM … … VM VM VM VM VM VM VM VM
  • 22. Orchestrating hundreds of thousands of firewallsWell-known software scaling techniques• Message queues• Consistency tradeoffs• Idempotent configuration & retriesCloudStack uses• Special purpose queues• Optimized for large security groups• Eventual consistency for rule updates
  • 23. Problem: firewall rules explosion in dom0 Allow Security Group {Web} on TCP port 3060 -A FORWARD -m tcp –p tcp –dport 3060 –src 10.1.16.31 – j ACCEPT -A FORWARD -m tcp –p tcp –dport 3060 –src 10.1.45.112 – j ACCEPT -A FORWARD -m tcp –p tcp –dport 3060 –src 10.1.189.5 – j ACCEPT … -A FORWARD -m tcp –p tcp –dport 3060 –src 10.21.9.77 – j ACCEPT Performance suffers for large security groups
  • 24. Problem: firewall rules explosion in dom0Fix with ipsets: ipset –N web_sg iptreemap ipset –A web_sg 10.1.16.31 ipset –A web_sg 10.1.16.112 ipset –A web_sg 10.1.189.5 ipset –A web_sg 10.21.9.77 … -A FORWARD –p tcp –m tcp –dport 3060 –m set –match-set web_sg src -j ACCEPTSee also http://daemonkeeper.net/781/mass-blocking-ip-addresses-with-ipset/
  • 25. Security group propagation time Seconds to fully synced Number of VMs in security group
  • 26. Problem: database connection management• Scale testing resulted in several “too many open connections” errors from MySQL• Common problem: holding open connections while doing long-running operations• Took some code clean up and refactoring• No longer an issue • MySQL supports 10,000 connections • CloudStack is far below that
  • 27. DB connections per MS while deploying 30,000 VMs 5,000 5,000 Number of DB connections 20,000 Time
  • 28. Other considerations (beyond control plane)• Network design and devices• Object store scalability• Per-host and cluster scalability• Storage• Understand your workload
  • 29. Future work• Improve simulator accuracy• Publish results of advanced network (VLAN) testing• Verify assumption of VM density not impacting scale
  • 30. More information and joining the projectProject web site:http://incubator.apache.org/projects/cloudstack.htmlMailing lists:cloudstack-dev-subscribe@incubator.apache.orgcloudstack-users-subscribe@incubator.apache.orgScalability study:http://wiki.cloudstack.org/pages/viewpage.action?pageId=14320020
  • 31. Q&A