Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

WinOps Conf 2016 - Peter Mounce - DoS yourself in production every night to prove you can take it


Published on

At JUST EAT, we haven't had an embarrassing performance regression that we haven't noticed and put right on the same day we deployed it - for over two years now. We also haven't been taken down by unexpected load.

We have around 150 engineers deploying tens of times a week across around 120 different components in production. In 2014, we pushed more than 800 discrete changes and coped with around 50% more traffic as we grew. Our uptime has… gone up.

We don't ​​have engineers run a performance test before we release have engineers run capacity tests every month or so when we remember to Instead, we do something a bit different...

Published in: Technology
  • Be the first to comment

  • Be the first to like this

WinOps Conf 2016 - Peter Mounce - DoS yourself in production every night to prove you can take it

  1. 1. 1 DoS yourself in production every night to prove you can take it @petemounce @justeat_tech@petemounce + @justeat_tech
  2. 2. 2 Any questions? Shout them out as we go. That's more fun.
  3. 3. 3 Who are JUST EAT?
  4. 4. 4 Performance?
  5. 5. 5 When do you suppose peak time is? The same time we DoS ourselves, of course!
  6. 6. 6 We have cyclic demand
  7. 7. 7 The problem with continuous delivery Everyone wants to change everything, all the time.
  8. 8. 8 Traditional approach Let's make an environment like production and run load through that.
  9. 9. 9 Individual tests take too long
  10. 10. 10 But of course...
  11. 11. 11 So, test all the time
  12. 12. 12 So, test in production #YOLO! We deploy 10s of small changes a day and we have alerts. I bet we won't break production (without noticing) #WhatCouldPossiblyGoWrong? Let's just do it in production with fake traffic at the same time as customers!
  13. 13. 13 Reasons why this isn't insane
  14. 14. 14 How did we start doing this? Technology aspects and people aspects
  15. 15. 15 Have the idea to start ( )We didn't invent this
  16. 16. 16 Choose scenarios we care about
  17. 17. 17 Choose a load agent
  18. 18. 18 Gain confidence outside of peak time (This part is also about reassuring stakeholders that you've got it all under control...)
  19. 19. 19 Start adding data variety
  20. 20. 20 Make the computer do it every day This is the most vital part!
  21. 21. 21 Get more elaborate later Fake away external dependencies x‐traffic‐flavour: fake
  22. 22. 22 And even more elaborate... Fake away more complicated things
  23. 23. 23 How have we kept doing this? ... and what did we learn to do better
  24. 24. 24 Didn't allow tests to be red (for long)
  25. 25. 25 Needed to tune levels over time
  26. 26. 26 Got smarter about data management
  27. 27. 27 Embraced the fact that things break
  28. 28. 28 What battle scars did we get... lately? All of these would have hurt badly if we hadn't had the ability to turn the pain off ourselves
  29. 29. 29 Find unbounded result sets before customers
  30. 30. 30 Monitoring needs to be solid!
  31. 31. 31 Realise AWS account limits are closer than we thought... Credit:
  32. 32. 32 Realise haproxy should balance, not magnify load...
  33. 33. 33 Realise we're not as smart as we think... Dear, this is why I had to leave early last year...
  34. 34. 34 But... Discovered problems during peacetime, not peak time
  35. 35. 35 What did we gain?
  36. 36. 36 Peace of mind, #1 Continuous, early, warning about: Getting slower Running out of capacity
  37. 37. 37 Peace of mind, #2 Good, simple, clear operational response to most surprises: Is fake load running? Stop it. Scale up Now, start to think
  38. 38. 38 Peace of mind, #3 If we find a problem Thursday night: 1. Turn off fake load for the weekend 2. Enjoy weekend 3. Fix it next week with less pressure
  39. 39. 39 Performance & operability == 1st class concern
  40. 40. 40 Alerts become automated tests in production
  41. 41. 41 git push production is one step closer Continuous testing in production can be applied to more than just performance & capacity
  42. 42. 42 Online takeaway. Harder than you might think We've got many open spots for talented engineers (London, Bristol, Kiev), if you're interested. tech.just­ Get in touch ­ peter.mounce@just­