Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Capacity Management/Provisioning (Cloud's full, Can't build here)

462 views

Published on

As a service provider, Rackspace is constantly bringing new OpenStack capacity online. In this session, we will detail a myriad of challenges around adding new compute capacity. These include: planning, automation, organizational, quality assurance, monitoring, security, networking, integration, and more.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Capacity Management/Provisioning (Cloud's full, Can't build here)

  1. 1. Capacity Management and Provisioning (Cloud's full, can't build here) Matt Van Winkle, Manager Cloud Engineering @mvanwink Andy Hill, Systems Engineer @andyhky Joel Preas, Systems Engineer @joelintheory
  2. 2. Public Cloud Capacity at Rackspace • Rackspace Public Cloud has deployed 100+ cells in ~2 years • New cells used to take engineer assembly and 3-5w after bare OS install • 1 year later done by on-shift operators ~1w (as low as 1d) • Usually constrained by networking
  3. 3. Control Plane Sizing • Data plane operations impacting both cell and top level control plane – Image downloads/uploads • How large should Nova DB be? – Breaking point of ‘standard’ cell control plane buildout - particularly database
  4. 4. Cell Sizing Considerations • Efficient use of Private IP address space – Used for connections to services like Swift and dedicated environment • Broadcast domains • Attempt to have minimal control plane for overhead/complexity
  5. 5. Hypervisor Sizing Considerations • Enough spare drive space for COW images – XS VHD size can easily be 2x space given to guest during normal operation! – Errors in cleaning up “snapshots” exacerbated by tight disk overhead constraints • Drive space for pre-cached images – cache_images=some # nova – use_cow_images=True # nova – cache_in_nova=True # glance
  6. 6. Other Sizing Notes • Need reserve space for emergencies (host evac) • Reserve space is cell-bound, due to instances being unable to move between cells – https://review.openstack.org/#/c/125607/ – cells.host_reserve_percent • VM overhead – https://wiki.openstack.org/wiki/XenServer/Overhead – https://review.openstack.org/#/c/60087/
  7. 7. Problems • Load Balancers • Glance and Swift • Fraud / Non Payment • Routes • Road Testing
  8. 8. Load Balancers • Alternate Routes needed for high BW operations – Generally Glance • Load Balancer can become bottleneck • Database queries returning lots of rows (cell sizing)
  9. 9. Swift and Glance Bandwidth Problems: • Creates single bottleneck • Imaging speeds monitored, exceeding thresholds triggers investigation / scale out • Cache not shared between glance-api nodes
  10. 10. Swift and Glance Bandwidth Monitoring / Solutions: • Need to get downloads out of path of control plane (compute direct to image store) • Cache base images – Pre-seed when possible – Can cache images to HV ahead of time for fast-cloning https://wiki.openstack.org/wiki/FastCloningForXenServer • Glance and Swift having shared request IDs would be nice • Shared cache might elevate hit-rate, save bandwidth What about when scaling out doesn’t work? Rearchitecture.
  11. 11. Fraud and Non-Payment Fraud • Mark instance as suspended • Still takes capacity • What do? • Account Actioneer Non-Payment • Similar to fraud but worse for capacity! • Try to give customer as much time as possible to return to the fold • Same overall strategy as fraud but instances kept longer
  12. 12. Road Testing nodes before enabling • New Cell – Bypass URLs (cell-specific API nodes) • Different nova.conf not using cells – compute_api_class=nova.compute.api.API # before • Cell tenant restrictions • Existing Cell/Rekick - Not as easy :( – How to ensure customer builds don’t land on box that isn’t road tested?
  13. 13. Managing the Capacity Management ● Supply Chain/Resource Pipeline ● Impact from Product Development ● Gaps/Challenges from upstream
  14. 14. Capacity Pipeline • Large Customer Requests • Triggers – % Used – # Largest Slots per flavor • IPv4 Addresses – Cells and scheduler unaware :( – Auditor + Resolver • Control Plane (runs on OpenStack too)
  15. 15. Product Implications • Keep up with code deploys (hotpatches) • Adjusting provisioning playbooks to: – new flavor types – new configurations/applications (quantum- >neutron, nova-conductor) – control plane changes (10g glance) – new hardware manufacturers (OCP) • Non production environments
  16. 16. Upstream Challenges • Disabled flag for cells – Blueprint: http://bit.do/CellDisableBP – Bug: http://bit.do/CellDisableBug • Build to “disabled” host – Testing after a re-provision – Testing for adding new capacity to existing cell • Scheduling based on IP capacity – New scheduler service? – Currently handled by outside service “Resolver”, similar to Entropy • General “Cells as first class citizen” effort led by alaski
  17. 17. Questions? THANK YOU RACKSPACE® | 1 FANATICAL PLACE, CITY OF WINDCREST | SAN ANTONIO, TX 78218 US SALES: 1-800-961-2888 | US SUPPORT: 1-800-961-4454 | WWW.RACKSPACE.COM © RACKSPACE LTD. | RACKSPACE® AND FANATICAL SUPPORT® ARE SERVICE MARKS OF RACKSPACE US, INC. REGISTERED IN THE UNITED STATES AND OTHER COUNTRIES. | WWW.RACKSPACE.COM

×