Learning to Scale OpenStack: An Update from the Rackspace Public Cloud

1,807 views

Published on

At the Portland and Hong Kong Summits, Rackspace invited the OpenStack community into the their experiences deploying OpenStack trunk to their their Public Cloud Infrastructure. In this presentation, Rackspace's Deployment System Team will provide an update on the latest challenges, triumphs, and lessons learned deploying and operating a production OpenStack public cloud during the Icehouse cycle. We’ll conclude by sharing the vision for our next steps in OpenStack deployments during the Juno cycle and beyond.

Published in: Internet, Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,807
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
53
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Introductions, welcome to the talk.
  • A review of the Rackspace Public Cloud – sets the context for the conversation
    <number>
  • This is our third summit presenting on this topic. Here is a brief review of some of the scale issues we were facing back at the Havana Summit in Portland
    Our window of time is 30 minutes perceived downtime, 4 hour deploy windows
    Code coverage wasn't great, lots of errors discovered in production
    Upstream moved very fast, and we couldn't keep up with all the testing downstream
    <number>
  • Here is a comparison of how we met some of our challenges
    Our deploys are much faster, some as short as 10 minutes total in our largest environment with 3 minutes of API interruption
    Deploys are now more reliable
    Migration data is known ahead of time (and bad ones blocked upstream)
    We still haven't solved keeping up with upstream. Many factors there.
    <number>
  • We are also learning the countours of openstack, by being the largest public cloud operator. We get to sweat up the hills and coast back down.
    <number>
  • Some of our new challenges – scaling not just deploying bits on nodes as fast as we can.
    Scaling services
    Scaling Deployments
    Scaling Frequency
    While we are trying to be a thought leader and front runner, collaboration is the key to success. The developer, operator, and testing communities need be aware of these scaling challenges
    <number>
  • Scaling Services – As the size of our cloud grows, and the features of our cloud grows, the services used need to scale along with them. Here we will walk through two scaling scenarios that highlight the challenge.
    <number>
  • Glance is an interesting case. Our glance talks acts as a middle person between HVs and Swift. As glance got used more, the bottleneck emerged. Partly due to our own configuration, but partly due to the nature of glance.
    Once we resolve the glance issues, swift could be the next bottleneck, care will be needed to make sure we don't just kick performance problems down the line to the next group.
  • Nova cells is responsible for interacting between the global cell and all the child cells. Doing this with just a single instance was never going to scale, we just ran out of runway before the pain hit.
    Through collaboration with upstream, we are
    now more able to scale out nova-cells as our cell counts grow.
  • These challenges will repeat. New bottlenecks will be found and new resource limits will be discovered. Staying ahead of the pain is key. We will not be the only ones to experience this, we are looking for collaboration on how best to manage this kind of scale.
    <number>
  • Our next scale challenge involves deployments.
    We made great strides around Havana, what have we been doing since?
    <number>
  • Orchestration has been our theme around deployments. We continue to iterate on the parts of the deployment causing the most pain, always making improvements for the next time.
    Walk through each block and explain why the change was made
  • Even with the improvements, we still treat openstack like a legacy application; upgrading in place, not utilizing load balancers, stopping everything to migrate databases, preventing mixed versions, etc.. There are many things that are preventing us from getting to zero downtime, and that's where we can all work together!
    <number>
  • A third scale challenge is frequency. This is the scale of doing things much more often.
    <number>
  • A very relevant quote, but unlike bicycling, when you do something more often in the DevOps world, it does tend to get easier, but there are challenges to going faster!
    <number>
  • Change comes from many sources. These changes need to be distributed to the environments, but with as little customer impact as possible. If we can't deploy changes often enough, we fall behind upstream, we fall behind our features, and we have larger deployments to consume. A snowball effect.
    Our work on creating new multiple release pipelines, improving our deployment methods, and moving our tests upstream have enabled us to move faster, but not fast enough.
  • This is our limit. We absolutely have to make this better. This is a global need, throughout the community of developers, operators, and testers.
    <number>
  • A quick look at what we've got cooking for the Juno cycle
    <number>
  • In Icehouse nova made great strides toward live upgrade with object model and conductor, which give us the ability to run multiple versions of openstack at the same time, notably we could run newer nova-api against an older version in the rest of the environment and shield nova-compute from migrations. This could allow us to roll the update through without downtime of the API and the computes with less interruption.
    Investigate putting API nodes in read-only during migrations to satisfy some requests and queue others
    <number>
  • This is an ongoing conversation. If we allow each service to work independently, what does that do to the version test matrix? Can we reliably validate anything? While individual projects/services might go faster, does that allow the entire pipeline to go faster? This ties into the discussions happening now at the design summit about cross project interactions.
    <number>
  • Yeah, we need them. Setting them up is hard, lets work together to make them easier.
    The ops meetups are great for collaborating on the issues at hand.
    <number>
  • We do a lot of things that are hard, but if it wasn't hard, it wouldn't be as satisfying. That's what keeps us coming back.
    Scaling is more than just tossing code on nodes. There are a lot more considerations to take into account.
    The development, operator, and tester communities need to collaborate more on where the painful parts are, particularly at scale, and work together on solutions.
    <number>
  • Learning to Scale OpenStack: An Update from the Rackspace Public Cloud

    1. 1. An update from the Rackspace Public Cloud Learning to Scale Openstack Rainya Mosher and Jesse Keating – Deployment Engineering @rainyamosher @iamjkeating
    2. 2. #rackstackatl The Rackspace Public Cloud 6 Public Regions 3 Pre-Production Regions 10s of Thousands of nodes Growing continually Frequent deployments Staying aligned with upstream #rackstackatl
    3. 3. #rackstackatl • We could not deploy code in a reasonable window of time • We did not have confidence in the code we were deploying • We could not keep up with upstream Our Old Challenges
    4. 4. #rackstackatl • Deploys taking 6+ hours • Deploys often failed the first time • Migrations were an unknown factor • Deploys roughly 2 months behind upstream Old Challenges Met • Deploys take an hour, as short as 10 minutes • Deploys rarely fail the first time • Migrations tested upstream and timed downstream • Still up to 2 months behind
    5. 5. #rackstackatl It is by riding a bicycle that you learn the contours of a country best, since you have to sweat up the hills and coast down them. ~ Ernest Hemingway
    6. 6. #rackstackatl • Scaling Services • Scaling Deployments • Scaling Frequency Our New Challenges
    7. 7. #rackstackatl Scaling Services #rackstackatl
    8. 8. #rackstackatl Scaling Glance • Scheduled Images feature went live • Glance saw much more usage • Glance servers became saturated • Builds and snapshots slowed down, eventually piling up faster than could be consumed • Resolved by: – Scaling number of glance-api nodes – Scaling size of glance-api nodes – Scaling use of glance-bypass feature
    9. 9. #rackstackatl Scaling Nova Cells • Performance Cells went live • More and more cells added to regions • Nova cells service became single funnel slowing down the exchange of data • Eventually our single nova-cells service could not consume messages faster than they were being produced • Resolved by: – Scaling number of nova-cells services – Optimizing instance healing calls – Optimizing database usage from cells service
    10. 10. #rackstackatl How do we anticipate where our growth will hurt and proactively scale to match?
    11. 11. #rackstackatl Scaling Deployments #rackstackatl
    12. 12. #rackstackatl Higher Form Orchestration • Pre-staging content outside of deploy window • Increased tolerance of “downed” hosts • Targeted bring up of services – API first, then computes • More deployment options – Factonly – Cellonly – No migrations • Reduced complexity – Single entry point: bin/deploy – Single orchestration system: Ansible
    13. 13. #rackstackatl We still treat OpenStack as a legacy software deployment. As a community we need to treat it more like a cloud application, but that requires collaboration!
    14. 14. #rackstackatl Scaling Frequency #rackstackatl
    15. 15. #rackstackatl It never gets easier, you just go faster. ~ Greg LeMond
    16. 16. #rackstackatl Scaling Change • New features coming • New configurations coming • Accommodate without interrupting customer experience • Change faster, change frequently, on an ever growing fleet of systems • Resolved by: – Understanding change before it happens – Scheduling changes to not conflict – Dedicating release iterations to risky change on top of known good code – Custom deploy modes per change type
    17. 17. #rackstackatl Customer Experience is our most important measurement of how fast we can scale.
    18. 18. #rackstackatl Object Placeholder The Next Iteration
    19. 19. #rackstackatl • Leverage object model in Icehouse for mixed- version services • Implement Nova conductor service • Investigate read-only states Zero Perceived Downtime
    20. 20. #rackstackatl • Can we give Glance it's own pipeline and deployment capability, independent of Nova or other services? • How do we combat the exponential growth of service version combinations? • Does this actually make the whole pipeline any faster? Individual Service Deployment Pipelines
    21. 21. #rackstackatl • Creating not just ephemeral environments, but production ones as well • Upgrades are easy, initial setups are a lot harder • Validation is critical • Developers and Operators need to collaborate on this use case when services are being designed Fully Automated Environments
    22. 22. #rackstackatl I have always struggled to achieve excellence. One thing that cycling has taught me is that if you can achieve something without a struggle it's not going to be satisfying. ~ Greg LeMond
    23. 23. #rackstackatl RACKSPACE® HOSTING | 5000 WALZEM ROAD | SAN ANTONIO, TX 78218 US SALES: 1-800-961-2888 | US SUPPORT: 1-800-961-4454 | WWW.RACKSPACE.COM RACKSPACE® HOSTING | © RACKSPACE US, INC. | RACKSPACE® AND FANATICAL SUPPORT® ARE SERVICE MARKS OF RACKSPACE US, INC. REGISTERED IN THE UNITED STATES AND OTHER COUNTRIES. | WWW.RACKSPACE.COMRACKSPACE® HOSTING | © RACKSPACE US, INC. | RACKSPACE® AND FANATICAL SUPPORT® ARE SERVICE MARKS OF RACKSPACE US, INC. REGISTERED IN THE UNITED STATES AND OTHER COUNTRIES. | WWW.RACKSPACE.COM

    ×