SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
As a service provider, Rackspace is constantly bringing new OpenStack capacity online. In this session, we will detail a myriad of challenges around adding new compute capacity. These include: planning, automation, organizational, quality assurance, monitoring, security, networking, integration, and more.
As a service provider, Rackspace is constantly bringing new OpenStack capacity online. In this session, we will detail a myriad of challenges around adding new compute capacity. These include: planning, automation, organizational, quality assurance, monitoring, security, networking, integration, and more.
1.
Capacity Management
and Provisioning
(Cloud's full, can't build here)
Matt Van Winkle, Manager Cloud Engineering @mvanwink
Andy Hill, Systems Engineer @andyhky
Joel Preas, Systems Engineer @joelintheory
2.
Public Cloud Capacity at Rackspace
• Rackspace Public Cloud has deployed 100+ cells in ~2
years
• New cells used to take engineer assembly and 3-5w
after bare OS install
• 1 year later done by on-shift operators ~1w (as low as
1d)
• Usually constrained by networking
3.
Control Plane Sizing
• Data plane operations impacting both cell and top level
control plane
– Image downloads/uploads
• How large should Nova DB be?
– Breaking point of ‘standard’ cell control plane
buildout - particularly database
4.
Cell Sizing Considerations
• Efficient use of Private IP address space
– Used for connections to services like Swift and
dedicated environment
• Broadcast domains
• Attempt to have minimal control plane for
overhead/complexity
5.
Hypervisor Sizing Considerations
• Enough spare drive space for COW images
– XS VHD size can easily be 2x space given to guest during normal
operation!
– Errors in cleaning up “snapshots” exacerbated by tight disk overhead
constraints
• Drive space for pre-cached images
– cache_images=some # nova
– use_cow_images=True # nova
– cache_in_nova=True # glance
6.
Other Sizing Notes
• Need reserve space for emergencies (host evac)
• Reserve space is cell-bound, due to instances being
unable to move between cells
– https://review.openstack.org/#/c/125607/
– cells.host_reserve_percent
• VM overhead
– https://wiki.openstack.org/wiki/XenServer/Overhead
– https://review.openstack.org/#/c/60087/
7.
Problems
• Load Balancers
• Glance and Swift
• Fraud / Non Payment
• Routes
• Road Testing
8.
Load Balancers
• Alternate Routes needed for high BW operations
– Generally Glance
• Load Balancer can become bottleneck
• Database queries returning lots of rows (cell sizing)
9.
Swift and Glance Bandwidth
Problems:
• Creates single bottleneck
• Imaging speeds monitored, exceeding thresholds
triggers investigation / scale out
• Cache not shared between glance-api nodes
10.
Swift and Glance Bandwidth
Monitoring / Solutions:
• Need to get downloads out of path of control plane (compute direct to
image store)
• Cache base images
– Pre-seed when possible
– Can cache images to HV ahead of time for fast-cloning
https://wiki.openstack.org/wiki/FastCloningForXenServer
• Glance and Swift having shared request IDs would be nice
• Shared cache might elevate hit-rate, save bandwidth
What about when scaling out doesn’t work? Rearchitecture.
11.
Fraud and Non-Payment
Fraud
• Mark instance as
suspended
• Still takes capacity
• What do?
• Account Actioneer
Non-Payment
• Similar to fraud but worse for capacity!
• Try to give customer as much time as
possible to return to the fold
• Same overall strategy as fraud but
instances kept longer
12.
Road Testing nodes before enabling
• New Cell
– Bypass URLs (cell-specific API nodes)
• Different nova.conf not using cells
– compute_api_class=nova.compute.api.API # before
• Cell tenant restrictions
• Existing Cell/Rekick - Not as easy :(
– How to ensure customer builds don’t land on box
that isn’t road tested?
13.
Managing the Capacity Management
● Supply Chain/Resource Pipeline
● Impact from Product Development
● Gaps/Challenges from upstream
14.
Capacity Pipeline
• Large Customer Requests
• Triggers
– % Used
– # Largest Slots per flavor
• IPv4 Addresses
– Cells and scheduler unaware :(
– Auditor + Resolver
• Control Plane (runs on OpenStack too)
15.
Product Implications
• Keep up with code deploys (hotpatches)
• Adjusting provisioning playbooks to:
– new flavor types
– new configurations/applications (quantum-
>neutron, nova-conductor)
– control plane changes (10g glance)
– new hardware manufacturers (OCP)
• Non production environments
16.
Upstream Challenges
• Disabled flag for cells
– Blueprint: http://bit.do/CellDisableBP
– Bug: http://bit.do/CellDisableBug
• Build to “disabled” host
– Testing after a re-provision
– Testing for adding new capacity to existing cell
• Scheduling based on IP capacity
– New scheduler service?
– Currently handled by outside service “Resolver”, similar to Entropy
• General “Cells as first class citizen” effort led by alaski