Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Shitlist-driven development and other tricks for working on large codebases

1,013 views

Published on

Working on large codebases is hard. Doing so with 700 people is even harder. Deploying it 50 times a day is almost impossible. We will look at productivity tricks and automations that we use at Shopify to get stuff done. We will learn how we fix the engine while the plane is running, how to quickly change code that lots of people depend on, how to automatically track down productivity killers like unreliable tests, how to maintain a level of agility that keeps developers happy and allows them to ship fast, and most importantly what the heck a "shitlist" is.

Published in: Technology
  • Be the first to comment

Shitlist-driven development and other tricks for working on large codebases

  1. 1. Shitlist-driven development and othertricks forworking on large codebases FLOR IAN WE INGARTEN flo@shopify.com @fw1729
  2. 2. 3
  3. 3. 4 “Programmers at work maintaining a Ruby on Rails application” (Classic Programmer Paintings)
  4. 4. 5 • >400k shops (multi-tenant architecture). • 20k-40k RPS (80k RPS peak). • ~800 contributors (developers, designers, …) • Everybody can merge to master and deploy to production. • 40-50 deploys (50-100 PRs) shipped to production per day. The Shopify Monolith
  5. 5. 6 MONOLITH AT SCALE PRODU CTIVI TY PROBLEM 1: DEPLOYS BECOME A BOT TLENECK
  6. 6. 7 Deploy bottleneck: Speed
  7. 7. 7 • More people => more PRs => more deploys or bigger deploys. Deploy bottleneck: Speed
  8. 8. 7 • More people => more PRs => more deploys or bigger deploys. • Small deploys: Fewer changes at once is safer, easier to debug, etc. Deploy bottleneck: Speed
  9. 9. 7 • More people => more PRs => more deploys or bigger deploys. • Small deploys: Fewer changes at once is safer, easier to debug, etc. • Observation: If you want small and often, you need fast. Deploy bottleneck: Speed
  10. 10. 7 • More people => more PRs => more deploys or bigger deploys. • Small deploys: Fewer changes at once is safer, easier to debug, etc. • Observation: If you want small and often, you need fast. • Shopify: 40-50 deploys/day, that’s ~6 per (business) hour. If deploys become slower than ~10min, they become a productivity problem for us. Deploy bottleneck: Speed
  11. 11. 8 Deploy bottleneck: Speed
  12. 12. 8 • Parallel CI builds. Deploy bottleneck: Speed
  13. 13. 8 • Parallel CI builds. • Build containers in advance and quickly. Deploy bottleneck: Speed
  14. 14. 8 • Parallel CI builds. • Build containers in advance and quickly. • Avoid booting application multiple times during container builds. Deploy bottleneck: Speed
  15. 15. 8 • Parallel CI builds. • Build containers in advance and quickly. • Avoid booting application multiple times during container builds. • Deploy to many servers in parallel. Deploy bottleneck: Speed
  16. 16. 8 • Parallel CI builds. • Build containers in advance and quickly. • Avoid booting application multiple times during container builds. • Deploy to many servers in parallel. • Reduce application boot time. Deploy bottleneck: Speed
  17. 17. 8 • Parallel CI builds. • Build containers in advance and quickly. • Avoid booting application multiple times during container builds. • Deploy to many servers in parallel. • Reduce application boot time. • Reduce application shutdown time (e.g. Unicorn timeout, …). Deploy bottleneck: Speed
  18. 18. 9 Deploy bottleneck: Humans
  19. 19. 9 • Asking ops team to deploy doesn’t scale. Deploy bottleneck: Humans
  20. 20. 9 • Asking ops team to deploy doesn’t scale. • Asking people to decide when a good time to deploy is doesn’t scale. Deploy bottleneck: Humans
  21. 21. 9 • Asking ops team to deploy doesn’t scale. • Asking people to decide when a good time to deploy is doesn’t scale. • Asking everyone to pay attention to master CI doesn’t scale. Deploy bottleneck: Humans
  22. 22. 9 • Asking ops team to deploy doesn’t scale. • Asking people to decide when a good time to deploy is doesn’t scale. • Asking everyone to pay attention to master CI doesn’t scale. • Asking everyone to pay attention to errors during a deploy doesn’t scale. Deploy bottleneck: Humans
  23. 23. 9 • Asking ops team to deploy doesn’t scale. • Asking people to decide when a good time to deploy is doesn’t scale. • Asking everyone to pay attention to master CI doesn’t scale. • Asking everyone to pay attention to errors during a deploy doesn’t scale. • Asking developers to deploy themselves doesn’t scale. Deploy bottleneck: Humans
  24. 24. 9 • Asking ops team to deploy doesn’t scale. • Asking people to decide when a good time to deploy is doesn’t scale. • Asking everyone to pay attention to master CI doesn’t scale. • Asking everyone to pay attention to errors during a deploy doesn’t scale. • Asking developers to deploy themselves doesn’t scale. • Humans don’t scale. Automate! Deploy bottleneck: Humans
  25. 25. Automatic deploy when CI is passing
  26. 26. Automatic range lock for reverts
  27. 27. 13 MONOLITH AT SCALE PRODU CTIVI TY PROBLEM 2: TOO M ANY COOKS IN T HE K ITCH EN
  28. 28. Yay, everything is fixed!
  29. 29. Someone “unfixed" it Someone added new shit In the meantime …
  30. 30. Someone “unfixed" it Someone added new shit In the meantime … Too many cooks in the kitchen! !
  31. 31. Have to fix everything at once now :-(
  32. 32. Have to fix everything at once now :-( Idea: Can we raise only for B but not for C?
  33. 33. Still shitlisted.
  34. 34. Fixed. Can’t be accidentally “unfixed". Still shitlisted.
  35. 35. Fixed. Can’t be accidentally “unfixed". Still shitlisted. Only B is allowed to do it wrong. No new shit can be introduced.
  36. 36. Problems: - Not always possible to change the API. - Sometimes you want different “granularity".
  37. 37. Granularity is now at the web request and job level
  38. 38. Granularity is now at the web request and job level All jobs and all requests are now “registering" themselves so the shitlist can verify which codepaths are allowed to call the deprecated code.
  39. 39. 26 • Great for changing very “broad" behaviour. • Great for breaking down a huge task into many small chunks. • Great for generating “To-Do lists”. • Great for “educating" a large team about how you want them to write code and enforcing the new behaviour. Shitlists
  40. 40. 27 • Bad error message: “Someone decided that the thing that worked yesterday is now wrong. Good luck fixing it yourself.” • Good error message:
 “Your code tried to make an HTTP request within a MySQL database
 transaction. This has been deprecated since it can negatively impact
 database performance. Using after_commit instead of after_save is often a good fix. If you need more help, please come see us in Slack in the #database-team channel.” Shitlist error messages
  41. 41. 28 MONOLITH AT SCALE PRODU CTIVI TY PROBLEM 3: UN RELIA BLE TEST S
  42. 42. 29 Unlikely problems become likely at scale • Unreliable test: On the same version of the code, the test sometimes passes and sometimes fails. • Shopify: About 750 CI runs per day, ~10 min and ~70k tests each. • If only a single one of those 70k tests is unreliable and fails 1% of the time, we lose over 1 hour of productivity per day.
  43. 43. 30 Types of unreliable tests Flaky test: time-dependent, load-dependent, … Leaky test: order-dependent (test B fails if test A ran first)
  44. 44. Automatic test grind
  45. 45. Automatic leaky test "bisect" • Take list of all tests that ran before the failing test. • Binary search through list of candidates.
  46. 46. 34 TL;DR SUMM ARY A ND K EY TAKEAWAYS
  47. 47. 35 Summary: Monolith productivity at scale • Productivity problem 1: Deploys. • Solution: Often and small. Make them fast and automate everything. • Productivity problem 2: Too many cooks in the kitchen. • Solution: Shitlist-driven development. • Productivity problem 3: Unreliable tests. • Solution: Tracking and alerting. Bisect and grind. Automation.
  48. 48. Thanks! Questions? FLOR IAN WE INGARTEN flo@shopify.com @fw1729

×