Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
@papa_fire
Troubleshooting with monitoring
Testing in production
DevOps monitoring
[something] testing [something]
monitori...
❖ @papa_fire
❖ leon@omniti.com
❖ fayerplay.com
❖ slideshare.net/LeonFayer1
THAT’S ME
WHO AM I?
๏ engineer for 20+ years
๏ p...
@papa_fire
I HATE TESTING
@papa_fire
testing is required
@papa_fire
testing is not enough
@papa_fire
> unit testing
> functional testing
> resilience testing
> performance testing
> …
@papa_fire
testing can give a false
sense of security
@papa_fire
testing is deterministic
@papa_fire
data problem
@papa_fire
> quantity of data
> frequency of data
> quality of data
@papa_fire
example
Wolfe+585
@papa_fire
example
Hubert Blaine Wolfeschlegelsteinhausenbergerdorffwelchevoralternwaren-
gewissenhaftschaferswessenschafew...
@papa_fire
user problem
@papa_fire
“Users (n) - distributed fault injection
test suite for production
@papa_fire
example
Corrupted Blood bug
@papa_fire
example
@papa_fire
other factors
@papa_fire
> lack of foresight
	 	 	 (Y2K bug)
> too many use-cases
	 	 	 (female Tauren bug)
> change to assumptions
@papa_fire
testing is great for
“known knowns”
@papa_fire
testing is ok for
“known unknowns”
@papa_fire
testing is bad for
“unknown unknowns”
@papa_fire
enter monitoring
@papa_fire
why monitor?
@papa_fire
because testing isn’t
enough
@papa_fire
> software is never perfect
> systems are complex
> external dependency worry
> proactive is better than reactiv...
@papa_fire
because things change
@papa_fire
because things change
in production
@papa_fire
what to monitor?
@papa_fire
in God we trust all others
we monitor
“
@papa_fire
> systems
> databases
> applications
> integration points
> performance
> user behavior
> …
@papa_fire
is it enough?
@papa_fire
is it too much?
@papa_fire
what is important?
@papa_fire
what is important?
(i.e. what to alert on)
@papa_fire
example
> servers up and running
> HTTP checks return 200
> tweets are lost
@papa_fire
s/system checks/unit tests/
@papa_fire
I don’t give a **** if the
datacenter is on fire as
long as I am still making
money
“ — CEO
@papa_fire
we monitor because things
change
@papa_fire
changes effect business
@papa_fire
top-down approach
> understand business
> define baseline
> correlate data
@papa_fire
example
๏ online marketing company
๏ major e-commerce component
๏ ~100 million users
๏ 1 billion emails/month
๏ ...
@papa_fire
it all starts with a call …
@papa_fire
revenue
@papa_fire
revenue + traffic
@papa_fire
revenue + traffic + load time
@papa_fire
revenue + traffic + load time + db
@papa_fire
revenue + traffic + load time + db + email
@papa_fire
… email wasn’t monitored?
what if …
@papa_fire
… email wasn’t monitored?
(it would be after this)
what if …
@papa_fire
instrumentation
is never done
@papa_fire
example
> same symptoms
> higher decline rates
> all metrics are within norm
@papa_fire
example
> same symptoms
> higher decline rates
> all metrics are within norm
AmEx blocked
@papa_fire
tl;dr
@papa_fire
testing and monitoring
not
testing or monitoring
@papa_fire
understand the business
@papa_fire
continuous improvement
@papa_fire
{also bad at conclusions}
@papa_fire
THANK YOU
questions?
Upcoming SlideShare
Loading in …5
×

Production testing through monitoring

4,640 views

Published on

Identifying and fixing issues in new code before deploying it to production is important for every software development cycle. However, relying on traditional testing methods in the age of Internet-scale data driven problems may prove to be incomplete. Identifying and fixing the issues in production quickly is crucial, but it requires insight into usage patterns and trends across the whole architecture and application logic. In this talk I touch on inefficiencies of some of the most common testing methods, provide real world examples of discovering odd edge cases with monitoring and offer recommendations on top-down metric instrumentation to help DevOps organizations with identifying and acting on business-effecting problems.

Published in: Technology

Production testing through monitoring

  1. 1. @papa_fire Troubleshooting with monitoring Testing in production DevOps monitoring [something] testing [something] monitoring [something] in production Leon Fayer
  2. 2. ❖ @papa_fire ❖ leon@omniti.com ❖ fayerplay.com ❖ slideshare.net/LeonFayer1 THAT’S ME WHO AM I? ๏ engineer for 20+ years ๏ professional cynic ๏ @ OmniTI ๏ build and operate big systems ๏ we are hiring! ๏ omniti.com/is/hiring
  3. 3. @papa_fire I HATE TESTING
  4. 4. @papa_fire testing is required
  5. 5. @papa_fire testing is not enough
  6. 6. @papa_fire > unit testing > functional testing > resilience testing > performance testing > …
  7. 7. @papa_fire testing can give a false sense of security
  8. 8. @papa_fire testing is deterministic
  9. 9. @papa_fire data problem
  10. 10. @papa_fire > quantity of data > frequency of data > quality of data
  11. 11. @papa_fire example Wolfe+585
  12. 12. @papa_fire example Hubert Blaine Wolfeschlegelsteinhausenbergerdorffwelchevoralternwaren- gewissenhaftschaferswessenschafewarenwohlgepflegeundsorgfaltigkeitbe schutzenvorangreifendurchihrraubgierigfeindewelchevoralternzwolfhundert tausendjahresvorandieerscheinenvonderersteerdemenschderraumschiff genachtmittungsteinundsiebeniridiumelektrischmotorsgebrauchlichtalsseinur sprungvonkraftgestartseinlangefahrthinzwischensternartigraumaufdersuchen nachbarschaftdersternwelchegehabtbewohnbarplanetenkreisedrehensichundwo hinderneuerassevonverstandigmenschlichkeitkonntefortpflanzenundsicher freuenanlebenslanglichfreudeundruhemitnichteinfurchtvorangreifenvor andererintelligentgeschopfsvonhinzwischensternartigraum, Sr.
  13. 13. @papa_fire user problem
  14. 14. @papa_fire “Users (n) - distributed fault injection test suite for production
  15. 15. @papa_fire example Corrupted Blood bug
  16. 16. @papa_fire example
  17. 17. @papa_fire other factors
  18. 18. @papa_fire > lack of foresight (Y2K bug) > too many use-cases (female Tauren bug) > change to assumptions
  19. 19. @papa_fire testing is great for “known knowns”
  20. 20. @papa_fire testing is ok for “known unknowns”
  21. 21. @papa_fire testing is bad for “unknown unknowns”
  22. 22. @papa_fire enter monitoring
  23. 23. @papa_fire why monitor?
  24. 24. @papa_fire because testing isn’t enough
  25. 25. @papa_fire > software is never perfect > systems are complex > external dependency worry > proactive is better than reactive > …
  26. 26. @papa_fire because things change
  27. 27. @papa_fire because things change in production
  28. 28. @papa_fire what to monitor?
  29. 29. @papa_fire in God we trust all others we monitor “
  30. 30. @papa_fire > systems > databases > applications > integration points > performance > user behavior > …
  31. 31. @papa_fire is it enough?
  32. 32. @papa_fire is it too much?
  33. 33. @papa_fire what is important?
  34. 34. @papa_fire what is important? (i.e. what to alert on)
  35. 35. @papa_fire example > servers up and running > HTTP checks return 200 > tweets are lost
  36. 36. @papa_fire s/system checks/unit tests/
  37. 37. @papa_fire I don’t give a **** if the datacenter is on fire as long as I am still making money “ — CEO
  38. 38. @papa_fire we monitor because things change
  39. 39. @papa_fire changes effect business
  40. 40. @papa_fire top-down approach > understand business > define baseline > correlate data
  41. 41. @papa_fire example ๏ online marketing company ๏ major e-commerce component ๏ ~100 million users ๏ 1 billion emails/month ๏ 300,000 lines of code ๏5600+ metrics collected
  42. 42. @papa_fire it all starts with a call …
  43. 43. @papa_fire revenue
  44. 44. @papa_fire revenue + traffic
  45. 45. @papa_fire revenue + traffic + load time
  46. 46. @papa_fire revenue + traffic + load time + db
  47. 47. @papa_fire revenue + traffic + load time + db + email
  48. 48. @papa_fire … email wasn’t monitored? what if …
  49. 49. @papa_fire … email wasn’t monitored? (it would be after this) what if …
  50. 50. @papa_fire instrumentation is never done
  51. 51. @papa_fire example > same symptoms > higher decline rates > all metrics are within norm
  52. 52. @papa_fire example > same symptoms > higher decline rates > all metrics are within norm AmEx blocked
  53. 53. @papa_fire tl;dr
  54. 54. @papa_fire testing and monitoring not testing or monitoring
  55. 55. @papa_fire understand the business
  56. 56. @papa_fire continuous improvement
  57. 57. @papa_fire {also bad at conclusions}
  58. 58. @papa_fire THANK YOU questions?

×