Successfully reported this slideshow.

Capacity Management/Provisioning (Cloud's full, Can't build here)

0

Share

Loading in …3
×
1 of 17
1 of 17

Capacity Management/Provisioning (Cloud's full, Can't build here)

0

Share

Download to read offline

As a service provider, Rackspace is constantly bringing new OpenStack capacity online. In this session, we will detail a myriad of challenges around adding new compute capacity. These include: planning, automation, organizational, quality assurance, monitoring, security, networking, integration, and more.

As a service provider, Rackspace is constantly bringing new OpenStack capacity online. In this session, we will detail a myriad of challenges around adding new compute capacity. These include: planning, automation, organizational, quality assurance, monitoring, security, networking, integration, and more.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Capacity Management/Provisioning (Cloud's full, Can't build here)

  1. 1. Capacity Management and Provisioning (Cloud's full, can't build here) Matt Van Winkle, Manager Cloud Engineering @mvanwink Andy Hill, Systems Engineer @andyhky Joel Preas, Systems Engineer @joelintheory
  2. 2. Public Cloud Capacity at Rackspace • Rackspace Public Cloud has deployed 100+ cells in ~2 years • New cells used to take engineer assembly and 3-5w after bare OS install • 1 year later done by on-shift operators ~1w (as low as 1d) • Usually constrained by networking
  3. 3. Control Plane Sizing • Data plane operations impacting both cell and top level control plane – Image downloads/uploads • How large should Nova DB be? – Breaking point of ‘standard’ cell control plane buildout - particularly database
  4. 4. Cell Sizing Considerations • Efficient use of Private IP address space – Used for connections to services like Swift and dedicated environment • Broadcast domains • Attempt to have minimal control plane for overhead/complexity
  5. 5. Hypervisor Sizing Considerations • Enough spare drive space for COW images – XS VHD size can easily be 2x space given to guest during normal operation! – Errors in cleaning up “snapshots” exacerbated by tight disk overhead constraints • Drive space for pre-cached images – cache_images=some # nova – use_cow_images=True # nova – cache_in_nova=True # glance
  6. 6. Other Sizing Notes • Need reserve space for emergencies (host evac) • Reserve space is cell-bound, due to instances being unable to move between cells – https://review.openstack.org/#/c/125607/ – cells.host_reserve_percent • VM overhead – https://wiki.openstack.org/wiki/XenServer/Overhead – https://review.openstack.org/#/c/60087/
  7. 7. Problems • Load Balancers • Glance and Swift • Fraud / Non Payment • Routes • Road Testing
  8. 8. Load Balancers • Alternate Routes needed for high BW operations – Generally Glance • Load Balancer can become bottleneck • Database queries returning lots of rows (cell sizing)
  9. 9. Swift and Glance Bandwidth Problems: • Creates single bottleneck • Imaging speeds monitored, exceeding thresholds triggers investigation / scale out • Cache not shared between glance-api nodes
  10. 10. Swift and Glance Bandwidth Monitoring / Solutions: • Need to get downloads out of path of control plane (compute direct to image store) • Cache base images – Pre-seed when possible – Can cache images to HV ahead of time for fast-cloning https://wiki.openstack.org/wiki/FastCloningForXenServer • Glance and Swift having shared request IDs would be nice • Shared cache might elevate hit-rate, save bandwidth What about when scaling out doesn’t work? Rearchitecture.
  11. 11. Fraud and Non-Payment Fraud • Mark instance as suspended • Still takes capacity • What do? • Account Actioneer Non-Payment • Similar to fraud but worse for capacity! • Try to give customer as much time as possible to return to the fold • Same overall strategy as fraud but instances kept longer
  12. 12. Road Testing nodes before enabling • New Cell – Bypass URLs (cell-specific API nodes) • Different nova.conf not using cells – compute_api_class=nova.compute.api.API # before • Cell tenant restrictions • Existing Cell/Rekick - Not as easy :( – How to ensure customer builds don’t land on box that isn’t road tested?
  13. 13. Managing the Capacity Management ● Supply Chain/Resource Pipeline ● Impact from Product Development ● Gaps/Challenges from upstream
  14. 14. Capacity Pipeline • Large Customer Requests • Triggers – % Used – # Largest Slots per flavor • IPv4 Addresses – Cells and scheduler unaware :( – Auditor + Resolver • Control Plane (runs on OpenStack too)
  15. 15. Product Implications • Keep up with code deploys (hotpatches) • Adjusting provisioning playbooks to: – new flavor types – new configurations/applications (quantum- >neutron, nova-conductor) – control plane changes (10g glance) – new hardware manufacturers (OCP) • Non production environments
  16. 16. Upstream Challenges • Disabled flag for cells – Blueprint: http://bit.do/CellDisableBP – Bug: http://bit.do/CellDisableBug • Build to “disabled” host – Testing after a re-provision – Testing for adding new capacity to existing cell • Scheduling based on IP capacity – New scheduler service? – Currently handled by outside service “Resolver”, similar to Entropy • General “Cells as first class citizen” effort led by alaski
  17. 17. Questions? THANK YOU RACKSPACE® | 1 FANATICAL PLACE, CITY OF WINDCREST | SAN ANTONIO, TX 78218 US SALES: 1-800-961-2888 | US SUPPORT: 1-800-961-4454 | WWW.RACKSPACE.COM © RACKSPACE LTD. | RACKSPACE® AND FANATICAL SUPPORT® ARE SERVICE MARKS OF RACKSPACE US, INC. REGISTERED IN THE UNITED STATES AND OTHER COUNTRIES. | WWW.RACKSPACE.COM

×