Successfully reported this slideshow.
Your SlideShare is downloading. ×

Production testing through monitoring

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
BizOps and you
BizOps and you
Loading in …3
×

Check these out next

1 of 58 Ad

Production testing through monitoring

Download to read offline

Identifying and fixing issues in new code before deploying it to production is important for every software development cycle. However, relying on traditional testing methods in the age of Internet-scale data driven problems may prove to be incomplete. Identifying and fixing the issues in production quickly is crucial, but it requires insight into usage patterns and trends across the whole architecture and application logic. In this talk I touch on inefficiencies of some of the most common testing methods, provide real world examples of discovering odd edge cases with monitoring and offer recommendations on top-down metric instrumentation to help DevOps organizations with identifying and acting on business-effecting problems.

Identifying and fixing issues in new code before deploying it to production is important for every software development cycle. However, relying on traditional testing methods in the age of Internet-scale data driven problems may prove to be incomplete. Identifying and fixing the issues in production quickly is crucial, but it requires insight into usage patterns and trends across the whole architecture and application logic. In this talk I touch on inefficiencies of some of the most common testing methods, provide real world examples of discovering odd edge cases with monitoring and offer recommendations on top-down metric instrumentation to help DevOps organizations with identifying and acting on business-effecting problems.

Advertisement
Advertisement

More Related Content

Slideshows for you (16)

Similar to Production testing through monitoring (20)

Advertisement

Recently uploaded (20)

Advertisement

Production testing through monitoring

  1. 1. @papa_fire Troubleshooting with monitoring Testing in production DevOps monitoring [something] testing [something] monitoring [something] in production Leon Fayer
  2. 2. ❖ @papa_fire ❖ leon@omniti.com ❖ fayerplay.com ❖ slideshare.net/LeonFayer1 THAT’S ME WHO AM I? ๏ engineer for 20+ years ๏ professional cynic ๏ @ OmniTI ๏ build and operate big systems ๏ we are hiring! ๏ omniti.com/is/hiring
  3. 3. @papa_fire I HATE TESTING
  4. 4. @papa_fire testing is required
  5. 5. @papa_fire testing is not enough
  6. 6. @papa_fire > unit testing > functional testing > resilience testing > performance testing > …
  7. 7. @papa_fire testing can give a false sense of security
  8. 8. @papa_fire testing is deterministic
  9. 9. @papa_fire data problem
  10. 10. @papa_fire > quantity of data > frequency of data > quality of data
  11. 11. @papa_fire example Wolfe+585
  12. 12. @papa_fire example Hubert Blaine Wolfeschlegelsteinhausenbergerdorffwelchevoralternwaren- gewissenhaftschaferswessenschafewarenwohlgepflegeundsorgfaltigkeitbe schutzenvorangreifendurchihrraubgierigfeindewelchevoralternzwolfhundert tausendjahresvorandieerscheinenvonderersteerdemenschderraumschiff genachtmittungsteinundsiebeniridiumelektrischmotorsgebrauchlichtalsseinur sprungvonkraftgestartseinlangefahrthinzwischensternartigraumaufdersuchen nachbarschaftdersternwelchegehabtbewohnbarplanetenkreisedrehensichundwo hinderneuerassevonverstandigmenschlichkeitkonntefortpflanzenundsicher freuenanlebenslanglichfreudeundruhemitnichteinfurchtvorangreifenvor andererintelligentgeschopfsvonhinzwischensternartigraum, Sr.
  13. 13. @papa_fire user problem
  14. 14. @papa_fire “Users (n) - distributed fault injection test suite for production
  15. 15. @papa_fire example Corrupted Blood bug
  16. 16. @papa_fire example
  17. 17. @papa_fire other factors
  18. 18. @papa_fire > lack of foresight (Y2K bug) > too many use-cases (female Tauren bug) > change to assumptions
  19. 19. @papa_fire testing is great for “known knowns”
  20. 20. @papa_fire testing is ok for “known unknowns”
  21. 21. @papa_fire testing is bad for “unknown unknowns”
  22. 22. @papa_fire enter monitoring
  23. 23. @papa_fire why monitor?
  24. 24. @papa_fire because testing isn’t enough
  25. 25. @papa_fire > software is never perfect > systems are complex > external dependency worry > proactive is better than reactive > …
  26. 26. @papa_fire because things change
  27. 27. @papa_fire because things change in production
  28. 28. @papa_fire what to monitor?
  29. 29. @papa_fire in God we trust all others we monitor “
  30. 30. @papa_fire > systems > databases > applications > integration points > performance > user behavior > …
  31. 31. @papa_fire is it enough?
  32. 32. @papa_fire is it too much?
  33. 33. @papa_fire what is important?
  34. 34. @papa_fire what is important? (i.e. what to alert on)
  35. 35. @papa_fire example > servers up and running > HTTP checks return 200 > tweets are lost
  36. 36. @papa_fire s/system checks/unit tests/
  37. 37. @papa_fire I don’t give a **** if the datacenter is on fire as long as I am still making money “ — CEO
  38. 38. @papa_fire we monitor because things change
  39. 39. @papa_fire changes effect business
  40. 40. @papa_fire top-down approach > understand business > define baseline > correlate data
  41. 41. @papa_fire example ๏ online marketing company ๏ major e-commerce component ๏ ~100 million users ๏ 1 billion emails/month ๏ 300,000 lines of code ๏5600+ metrics collected
  42. 42. @papa_fire it all starts with a call …
  43. 43. @papa_fire revenue
  44. 44. @papa_fire revenue + traffic
  45. 45. @papa_fire revenue + traffic + load time
  46. 46. @papa_fire revenue + traffic + load time + db
  47. 47. @papa_fire revenue + traffic + load time + db + email
  48. 48. @papa_fire … email wasn’t monitored? what if …
  49. 49. @papa_fire … email wasn’t monitored? (it would be after this) what if …
  50. 50. @papa_fire instrumentation is never done
  51. 51. @papa_fire example > same symptoms > higher decline rates > all metrics are within norm
  52. 52. @papa_fire example > same symptoms > higher decline rates > all metrics are within norm AmEx blocked
  53. 53. @papa_fire tl;dr
  54. 54. @papa_fire testing and monitoring not testing or monitoring
  55. 55. @papa_fire understand the business
  56. 56. @papa_fire continuous improvement
  57. 57. @papa_fire {also bad at conclusions}
  58. 58. @papa_fire THANK YOU questions?

×