Using Ansible at Scale to Manage a Public Cloud


Published on

An overview of three scale challenges at Rackspace and how Ansible key features helps us solve those three challenges.

Published in: Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • I'm Jesse Keating I work at Rackspace I'm going to talk about what Rackspace does with Ansible
  • At Rackspace we care about scale. Scale of number of server systems Scale of product environments Scale of engineers doing awesome things at Rackspace. Going to cover three scale challenges with three case studies that will highlight key Ansible features that have made it my go to tool in the box.
  • First is the scale of servers. I work in the Rackspace Public Cloud product group. We have... It is a lot to handle. Have existing inventory files for use with pssh/etc. Admins worry about what's there, engineers work on growing capacity and automation, developers work on new code and new tools to deploy code. We all work together, DevOps.
  • A real world example from a couple days ago Needed to copy one file out to nova-compute Vms and restart nova-compute service Want to avoid flooding the admins with alerts Want easy to read output to know what happened. Before would have been manual actions on nagios hosts, bash script around pssh, lots of output noise, repeat delays on inactive hosts
  • Key things Ansible brings to the party
  • Example of existing inventory contents. Regions with cells with groups
  • More
  • Json output that ansible can use. Groups of groups, group_vars, addresses.
  • Fairly simple python script to hand to ansible (but it can be anything, so long as it hands back json)
  • Silly example of a one-off task
  • Actual playbook used to hot-patch production
  • This is how we're using Ansible RIGHT NOW with our production environment Building up a toolbox as we go
  • Next I want to talk about the scale of our environments. Again I'll be focusing on our public cloud, which is powered by OpenStack. Stop me when you spot the problem. Servers, block storage, object storage, networks, auth, usage, etc... CI is really just for automated tests to gauge health Way too many moving parts for one pre-production environment, puts risk on deploying code in timely manner. Not easy to deploy from personal branch/fork
  • What we want to do is build out preproduction environments for each group or individual developer. Big task Before could be days or weeks before an environment could be created, then could sit unused for long periods of time. Devs couldn't do it, Engineers had to find time to fit it in.
  • Why we went with Ansible to back this service
  • Apologize for puppet/mco stuff here, but that is what is pre-existing Localhost actions to prepare files for new hosts
  • Use the host loop to parallelize host boot up in one of our internal Nova environments Eventually this will use the rax module, which could do the DNS step for us
  • Now do some actions on the remote hosts. Not showing everything Still in development
  • Inventory files look a little different here, more details per host. Making use of some yaml syntax to have defaults that can be overloaded.
  • Plugin to read the files, and use --host
  • What could take days/weeks to get done can now take minutes. Automating the part that isn't already automated, filling the gap. Will hook it into a web service where developers can make a reservation and provide input as to what they want deployed. Significant overlap with process to roll out new production environments, obvious next step
  • Finally lets talk about the scale of our Engineering organization(s) No hard rules about what tech must be used. Best practices bubble up A real challenge to bring on new employees, worse to bring on intern and make most use of their time
  • Once more talking about our cloud group, ozone. Not the full story, but some idea of what has to happen. Took me weeks to get fully set up, and I think I'm still missing some stuff, exacerbated by being remote and off-hours from main group some times.
  • How can Ansible help here?
  • Ansibox is a project I'm working on personally to help with onboarding. Taking inspiration from Github's Boxen project. Roles are where the magic happens.
  • Engineers should have to give limited input to Ansibox in order for Ansibox to be able to perform the setup. These could be prompted for in the future. Engineer names a role and provides a location to find that role.
  • The top level playbook fetches all the roles, can update them optionally. Generates another playbook to actually go through and apply the roles to the host. Generated playbook comes from a template and is very simple.
  • Here is a look at after it gets generated. Doing sudo no at this level, each task in each role can decide to do sudo if author wants it.
  • A very simple start to a ansibox executable. Two playbooks are necessary due to Ansible design Prompt is there for second play in case any role wants sudo
  • This is the start of a task list for the ozone role. Repos get cloned, tools get installed, configuration files get put into place. Here we could also check for permissions to services and prompt the engineer on what to do to gain access
  • With this system it becomes easy for an engineer to boot strap a system, and easy for a group to own that process for the group. Engineers can also add their own roles for personal setups, and be unafraid to refresh devices. Engineers can also contribute to the system as gaps are found
  • Using Ansible at Scale to Manage a Public Cloud

    1. 1. Jesse Keating – Linux Systems Engineer IV – Cloud Servers@iamjkeatingUsing Ansible at Scale to Managea Public Cloud06/13/2013 – AnsibleFest
    2. 2. RACKSPACE® HOSTING | WWW.RACKSPACE.COMRackspace cares about scale● Scale of server systems● Scale of environments● Scale of engineers
    3. 3. Scale of Server Systems
    4. 4. Rackspace Public Cloud● 4 “Production” regions– 1 to 8 cells per region– 250 to 500 nodes per cell● Nearly 15K “systems” in production● Another 500~ in CI/pre-production● Mixed use of copy-pasta pssh scripts, pre-configuredagent actions, jenkins automation, and host-basedconfig management● Managed by admins, engineers, developers
    5. 5. RACKSPACE® HOSTING | WWW.RACKSPACE.COMCase study: Hotpatch One ProductionEnvironment● 3900~ compute-nodes– Spread across 8 cells– Out of 6000~ total hosts● Alerting will flood admins● Output is hard to parse
    6. 6. RACKSPACE® HOSTING | WWW.RACKSPACE.COMAnsible Key Features● Inventory plugin● Simple process flow● Reusable playbooks with variable adjustments● Avoids repeated actions on downed hosts● Cleaner output
    7. 7. Need to change
    8. 8. .. and
    9. 9. to...
    10. 10. So we can do...
    11. 11. Or this
    12. 12. RACKSPACE® HOSTING | WWW.RACKSPACE.COMAnsible Use● Replacing use of pssh for Random Tasks● Replacing use of pssh for Expected Tasks (outsideconfig management)● Reuse existing inventory content● Easily bolt together processes such as disabling nagiosalerts prior to execution
    13. 13. Scale of Environments
    14. 14. Rackspace OpenStack Development● At least 7 major software projects– Different feature schedules within each● One Continuous Integration environment● One Pre-production environment● One branch of code that can easily be deployed● New code deploys every two weeks
    15. 15. RACKSPACE® HOSTING | WWW.RACKSPACE.COMCase Study: Create production likeenvironment to test disruptive product codechange● 30~ virtual instances– DB servers– Rabbit servers– Service providers● 40~ capacity nodes– Hypervisor + nova-compute VM● Mixed use of fabric, shell scripts, copy-pasta● No self service
    16. 16. RACKSPACE® HOSTING | WWW.RACKSPACE.COMAnsible Key Features● Intermix local actions and remote actions● External inventory plugin● Start from nothing● API to use directly within another application
    17. 17. Start with localhost prep
    18. 18. Local actions to boot instances
    19. 19. Remote actions on hosts
    20. 20. Existing yaml for host vars
    21. 21. RACKSPACE® HOSTING | WWW.RACKSPACE.COMAnsible Use● Replacing use of fabric, pssh, copy-pasta● Boot strapping environment to the point where existingconfig management can take over● Freeing up Engineer time by making it self-service● Freeing up resources by tearing down environmentsafter use● Working toward using same process to build outproduction environments
    22. 22. Scale of Engineers
    23. 23. Rackspace Engineering● Between 4K and 6K employees/contractors● Between 500 and 1K Engineer/Developer types● Many dozens of summer interns● Countless groups● Countless projects● Rapid team creation / shifting of resources● Mixed use of Mac OSX and Linux● Mixed use of automation, configuration, et al tools● Disjoint ownership of engineering onboarding
    24. 24. RACKSPACE® HOSTING | WWW.RACKSPACE.COMCase study: Ozone Onboard● 30+ git repos● 5+ utilities w/ configuration● Permissions to a plethora of services● Configuration for CI/preprod/prod environments● Details scattered throughout wiki pages and tribalknowledge
    25. 25. RACKSPACE® HOSTING | WWW.RACKSPACE.COMAnsible Key Features● Modular Roles● Minimal dependencies● OS agnostic● Idempotent● Fast● Easy to use and extend
    26. 26. Overview of Ansibox
    27. 27. User edited file
    28. 28. Top level playbook
    29. 29. Generated Playbook
    30. 30. Making it go
    31. 31. Ozone Tasks
    32. 32. RACKSPACE® HOSTING | WWW.RACKSPACE.COMAnsible Use● Developer bootstraps their own system by selectingroles and providing details● Teams own role definitions within a shared framework● Repeatable process– Ansible playbook to clone/update roles– Second playbook to process roles
    33. 33. Conclusion● Ansible solves many problems Rackspace faces● Chip away at edges with Ansible, perhaps one dayreplace existing config management systems withAnsible● Continue to assist in development of Ansiblemodules, plugins, and scale testing● Launch Ansibox soon!