Your SlideShare is downloading. ×
0
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Adaptive Fault Detection
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Adaptive Fault Detection

2,187

Published on

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,187
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
18
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • seq 1 15 | awk '{printf "%.2f%%\\n", 100-(.499**$1*100)}'\n
  • seq 1 15 | awk '{printf "%.2f%%\\n", 100-(.499**$1*100)}'\n
  • seq 1 15 | awk '{printf "%.2f%%\\n", 100-(.499**$1*100)}'\n
  • seq 1 15 | awk '{printf "%.2f%%\\n", 100-(.499**$1*100)}'\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript

    • 1. Adaptive Fault DetectionBaron Schwartz • Percona Live NYC 2012Date
    • 2. Me Optimization, Backups, 3r e rs V Replication, and more Co d e rs v Ed io Author of High Performance MySQL iti n 5.✤ on 5✤ Creator of some tools that you might use✤ I love hearing from people just like you: High @xaprb on Twitter Performance MySQL ✤ ✤ http://www.linkedin.com/in/xaprb Baron Schwartz, Peter Zaitsev & Vadim Tkachenko
    • 3. Conventional Fault DetectionMetrics, Thresholds, and Actions
    • 4. Nagios and ThresholdsIs there a right answer?
    • 5. Motivations✤ Detect unknown failure mode✤ Capture diagnostic data automatically✤ Surface relevant information
    • 6. Six Sigmas99.7% of measurements fall within ±3 sigmas of mean in a normal distribution
    • 7. Abnormality DetectionStatistical process control, operations research, and intuition
    • 8. Shewhart Control ChartsMetrics that fall ± too many standard deviations from the mean are out of bounds
    • 9. Holt-Winters ForecastingPredict the future based on history, trend, and seasonality
    • 10. Brownian MotionA random walk shouldn’t go the same way for long
    • 11. Probability of Increase/Decrease Increase 49.88% Same 0.21% Decrease 49.91%Coin TossingQPS increases and decreases with ~equal probability
    • 12. Length of QPS Runs 100000 75000 50000 25000 0 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9I Feel Normal TodayLong runs of QPS increases/decreases behave like a coin toss
    • 13. Run Improbability Length 1 50.10% 75.09% 2 75.10% 93.79% 3 87.57% 98.46% 4 93.80% 99.62% 5 96.91% 99.90% 6 98.46% 99.98% 7 99.23% 99.99% 8 99.62% 9 99.90% 10 99.95%Oddly EvenHow long is an unusually long random walk?
    • 14. Run Improbability Length 1 50.10% 75.09% 2 75.10% 93.79% 3 87.57% 98.46% 4 93.80% 99.62% Two ! 5 96.91% 99.90% 6 98.46% 99.98% 7 99.23% 99.99% 8 99.62% 9 99.90% 10 99.95%Oddly EvenHow long is an unusually long random walk?
    • 15. Run Improbability Length 1 50.10% 75.09% 2 75.10% 93.79% 3 87.57% 98.46% 4 93.80% 99.62% Two ! 5 96.91% 99.90% 6 98.46% 99.98% 7 99.23% 99.99% 8 99.62% Three ! 9 99.90% 10 99.95%Oddly EvenHow long is an unusually long random walk?
    • 16. Run Improbability Length 1 50.10% 75.09% 2 75.10% 93.79% 3 87.57% 98.46% 4 93.80% 99.62% Two ! 5 96.91% 99.90% 6 98.46% 99.98% 7 99.23% 99.99% 8 99.62% Three ! 9 99.90% 10 99.95% Same thing, butOddly Even two variablesHow long is an unusually long random walk?
    • 17. Houston, We Have an OpportunityThese techniques fall short of what’s needed
    • 18. God hath chosen the foolish things of theworld to confound the wise [1 Cor 1:27]
    • 19. Unexpected Things Happen... but who says abnormal is bad?
    • 20. Bottleneck DetectionFind abnormalities, then determine whether they are system faults
    • 21. Metrics That MatterThroughput, concurrency, and change—but not response time (why?)
    • 22. AlgorithmsVarious combinations of severity, directionality, run length, duration, and more
    • 23. Out of BoundsHow often each algorithm detected abnormalities
    • 24. Drilling DownOne of the algorithms triggered at the gray line
    • 25. Another ExampleThis one from a more selective algorithm
    • 26. Brownian CommotionJust because it’s long and aimed the right way doesn’t mean it’s scary
    • 27. QPS Cxn RunsSee Spots RunThere’s a clear run of decreasing QPS and increasing connections—but no stall/lockup
    • 28. QPS Cxn RunsSee Spots RunThere’s a clear run of decreasing QPS and increasing connections—but no stall/lockup
    • 29. A New ApproachCombinations of algorithms to avoid run-based false positives
    • 30. I’d Like Fries With ThatAnd supersize my Threads_running, please
    • 31. Why Not Response Time?We are legion and we want your carat patch
    • 32. Mass ApplicationDoes it generalize to many different settings?
    • 33. Dataset #2Looks reasonable on a different workload—so far
    • 34. Stall Detection In ActionSorry, no witty comment here
    • 35. Oh Crud.What does “bad” mean on a workload like this?
    • 36. He taketh the wise in their own craftiness.[1 Cor 3:19]
    • 37. Different Like Everybody ElseVariance-to-mean ratio / index of dispersion to the rescue?
    • 38. Still Life With Purple LineIf you think this is art, I am happy to sell it to you
    • 39. When In Doubt, Get EmpiricalMeasure 99.7th% V:M on a well-behaved dataset. Good enough for government work.
    • 40. Workload #1 ReduxStill finds lots of the same “bad spots” with the V:M ratio filter
    • 41. If you can read this, flip me over!Workload #2 With New Filter
    • 42. Conclusions✤ Workloads differ. GIGO.✤ Unlikely events are not necessarily bad.✤ Naive techniques fail; more sophisticated methods are required.✤ Thresholds are too simplistic, but reappear.✤ Common sense beats elaborate math.✤ Automated problem detection is a good thing?
    • 43. Jobs✤ Want to work and live in America’s #1 City?✤ I am hiring DBAs and developers. Talk to me.
    • 44. Questions? @xaprb • http://www.linkedin.com/in/xaprb
    • 45. Image Credits✤ http://www.flickr.com/photos/katej/853418592/✤ http://www.flickr.com/photos/domesticat/2963393184/✤ http://www.flickr.com/photos/markho/481969187/✤ http://www.flickr.com/photos/exquisitur/3502317741/✤ http://www.flickr.com/photos/calleephoto/4952091078/✤ http://www.flickr.com/photos/paperpariah/4150220583/✤ http://www.flickr.com/photos/stevewall/6057281066/✤ http://www.flickr.com/photos/mybloodyself/5879425774/✤ http://www.flickr.com/photos/josephrobertson/92849605/✤ http://en.wikipedia.org/

    ×