Sean schofield & Richard Lister, Spree Commerce_ Fearless deployment @ Open Commerce Conference 2016


Presented at the Open Commerce Conference on June 28-29, 2016 in New York City

  1. 1. Fearless Deployment Sean Schofield (@uberzealot) Richard Lister (@bnzmnzhnz)
  2. 2. Background ● Open Source ● Consulting company ● VC Backed ● Acquired by First Data in 2015
  3. 3. What are we afraid of? 1. The “Real World” 2. Instability 3. Going Slow
  4. 4. The “Real World” ● Differences between staging and production ● Volume of data ● Nature of data ● Missing configuration
  5. 5. Instability ● Deployments cause most of the problems that impact customers ● Code being deployed as well as the deployment itself ● Risk increases over time ● External sources of instability
  6. 6. Going slow ● Speed of development ○ We don’t want stability at the expense of speed ○ Whatever solution we come up with it will just slow us down ● Intervals between deployments ○ The longer we go between deploys, the more worried we are about the next one ○ Migrations are more likely to fail ○ We’re only making the problem worse by delaying our deployments
  7. 7. Goal #1: Embrace the Real World
  8. 8. Embracing the “Real World” ● Two things keep us separated from the “Real World” ○ Application behavior ○ User behavior ● Let’s figure out a way to eliminate those differences ● No more surprises when we deploy!
  9. 9. Replace Staging Environment with Stacks
  10. 10. Use the stacks to go live ● Each release is done as a self-contained “stack” ● No more staging environment ● No more RAILS_ENV ● Think release candidate for your infrastructure ● No more surprises based on real world data
  11. 11. Stop separating the test data ● DynamoDB is designed for massive amounts of data ● Test data and live customer data can peacefully co-exist ● Use a test attribute to identify our test records ● Everything lives together in a single database!
  12. 12. Stop using ActiveRecord ● Learned things the hard way with Spree ● Really slow when doing a lot of writes ● Use Plain Old Ruby Objects (PORO) instead ● All of our tables have the same structure ○ store_id ○ object_id ○ object_value
  13. 13. Protect the real world data ● No database write access for developers ● Only the store owner change their own data ● No super admin ● Impossible for developers to change data while testing ● Ensure no real world side effects whenever we write data
  14. 14. Complete copy of the database ● Every stack has a complete database copy ● Migrations are performed at the same time as copy ● Shoryuken workers for multi-threaded processing ● We can copy 500,000 records in under ten minutes
  15. 15. Sync changes after the copy ● Track changes since our bulk copy ● DynamoDB streams to monitor these changes ● New data is continuously migrated ● Same migration logic as with bulk copy ● No more migrations on release day!
  16. 16. Goal #2: Stability
  17. 17. Ops Code as First Class Citizen ● Infrastructure must be change-controlled and repeatable ● Operations source-code is in same git repo as application code ● Every release is tracked as a single SHA in Github ● Check out a SHA to get a fully self-contained ops+app setup ● We use AWS Cloudformation templates to describe all resources
  18. 18. Cloudformation Top Tip Don’t do this Do this
  19. 19. The stack contains everything we need ● Networking ● Load-balancers ● Auto-scaling groups ● Instance config ● Permissions ● Database
  20. 20. Docker Containers ● Provide a runnable application artifact ● Dependency management ○ System libraries ○ Ruby + Gems ○ Application code
  21. 21. Docker Decouples Application from OS ● Protect against changes in the underlying OS, which just provides: ○ Kernel ○ Docker daemon ○ Systemd, to start containers ● We are safer making OS updates ○ Updates to system libraries do not affect application
  22. 22. Amazon Machine Image ● AMI provides a runnable server artifact ○ We get the same artifact every time ● What if Docker repository goes down? ○ Create AMI with packer and bake in all docker images ○ We’re happy to trade AMI build time for stability ● What if Github or rubygems are down? ○ Instance needs no external information to start app
  23. 23. The Dreaded AWS Degradation Email
  24. 24. Cattle vs Pets Don’t do this Do this
  25. 25. Auto Scaling ● Stop caring about individual instances ● Autoscaling replaces failed instances ● We trust replacement because we do it all the time ● Copy easily with changing load
  26. 26. Production Deployment
  27. 27. Release Procedure ● Tag branch in git ● Build docker container ● Build AMI ● Create stack ● Copy data from production ● Sync new data from production ● Test, test, test ● Update DNS ● Delete old stack
  28. 28. Immutable once we go live ● New releases require a new stack ● Emergency hotfixes require a new AMI ● Instances are replaced, not modified ● Once deployed nothing can be changed ● There is no SSH
  29. 29. Goal #3: Go Fast
  30. 30. Continuous Deployment for Developers ● We deploy many times a day - just not to production ○ Devs get a stack for each feature branch, with a full copy of production data ○ Go crazy, break things, it will be entirely deleted when done ● Docker lets us build image fast ○ We don’t want to wait for a brand new AMI with each commit ○ Write Dockerfile to use caching in a smart way ● Dev stacks can be deployed by just replacing docker image
  31. 31. Argus for Fast Docker Builds ● Enqueue docker builds using SQS ● Distributed workers for fast builds ● Workers pre-pull existing image layers ● This means all workers can use docker cache ● Pushes image to AWS EC2 Container Registry
  32. 32. Developer Deploys
  33. 33. Developer Deploys Are Fast ● If the bundle is cached, docker build takes about 15 seconds ● AWS SSM Run Command runs a canned script ● Simply pulls latest docker image and restarts container ● Access is controlled with IAM ● Logs are in logstash
  34. 34. Summary ● All infrastructure and code is in the stack ● The stack is immutable ● We use stacks instead of a having a special staging environment ● We use a complete copy of real world data in our stacks ● We’re constantly deploying - just not to production ● Production deploys are just updating the DNS to the new stack
  35. 35. Resources ● - Ruby library for PORO ● - asynchronous Ruby workers with SQS ● - fast Docker build and push to ECR ● - Ruby library for common stack operations ● - Ruby DSL for Cloudformation templates ● - guidelines for stateless software as a service
  36. 36. Questions?