Adaptive Fault Detection
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Adaptive Fault Detection

on

  • 2,571 views

 

Statistics

Views

Total Views
2,571
Views on SlideShare
1,758
Embed Views
813

Actions

Likes
2
Downloads
15
Comments
0

4 Embeds 813

http://www.xaprb.com 754
http://www.cvilleblogs.com 53
http://newsblur.com 5
http://www.newsblur.com 1

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • seq 1 15 | awk '{printf "%.2f%%\\n", 100-(.499**$1*100)}'\n
  • seq 1 15 | awk '{printf "%.2f%%\\n", 100-(.499**$1*100)}'\n
  • seq 1 15 | awk '{printf "%.2f%%\\n", 100-(.499**$1*100)}'\n
  • seq 1 15 | awk '{printf "%.2f%%\\n", 100-(.499**$1*100)}'\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Adaptive Fault Detection Presentation Transcript

  • 1. Adaptive Fault DetectionBaron Schwartz • Percona Live NYC 2012Date
  • 2. Me Optimization, Backups, 3r e rs V Replication, and more Co d e rs v Ed io Author of High Performance MySQL iti n 5.✤ on 5✤ Creator of some tools that you might use✤ I love hearing from people just like you: High @xaprb on Twitter Performance MySQL ✤ ✤ http://www.linkedin.com/in/xaprb Baron Schwartz, Peter Zaitsev & Vadim Tkachenko
  • 3. Conventional Fault DetectionMetrics, Thresholds, and Actions
  • 4. Nagios and ThresholdsIs there a right answer?
  • 5. Motivations✤ Detect unknown failure mode✤ Capture diagnostic data automatically✤ Surface relevant information
  • 6. Six Sigmas99.7% of measurements fall within ±3 sigmas of mean in a normal distribution
  • 7. Abnormality DetectionStatistical process control, operations research, and intuition
  • 8. Shewhart Control ChartsMetrics that fall ± too many standard deviations from the mean are out of bounds
  • 9. Holt-Winters ForecastingPredict the future based on history, trend, and seasonality
  • 10. Brownian MotionA random walk shouldn’t go the same way for long
  • 11. Probability of Increase/Decrease Increase 49.88% Same 0.21% Decrease 49.91%Coin TossingQPS increases and decreases with ~equal probability
  • 12. Length of QPS Runs 100000 75000 50000 25000 0 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9I Feel Normal TodayLong runs of QPS increases/decreases behave like a coin toss
  • 13. Run Improbability Length 1 50.10% 75.09% 2 75.10% 93.79% 3 87.57% 98.46% 4 93.80% 99.62% 5 96.91% 99.90% 6 98.46% 99.98% 7 99.23% 99.99% 8 99.62% 9 99.90% 10 99.95%Oddly EvenHow long is an unusually long random walk?
  • 14. Run Improbability Length 1 50.10% 75.09% 2 75.10% 93.79% 3 87.57% 98.46% 4 93.80% 99.62% Two ! 5 96.91% 99.90% 6 98.46% 99.98% 7 99.23% 99.99% 8 99.62% 9 99.90% 10 99.95%Oddly EvenHow long is an unusually long random walk?
  • 15. Run Improbability Length 1 50.10% 75.09% 2 75.10% 93.79% 3 87.57% 98.46% 4 93.80% 99.62% Two ! 5 96.91% 99.90% 6 98.46% 99.98% 7 99.23% 99.99% 8 99.62% Three ! 9 99.90% 10 99.95%Oddly EvenHow long is an unusually long random walk?
  • 16. Run Improbability Length 1 50.10% 75.09% 2 75.10% 93.79% 3 87.57% 98.46% 4 93.80% 99.62% Two ! 5 96.91% 99.90% 6 98.46% 99.98% 7 99.23% 99.99% 8 99.62% Three ! 9 99.90% 10 99.95% Same thing, butOddly Even two variablesHow long is an unusually long random walk?
  • 17. Houston, We Have an OpportunityThese techniques fall short of what’s needed
  • 18. God hath chosen the foolish things of theworld to confound the wise [1 Cor 1:27]
  • 19. Unexpected Things Happen... but who says abnormal is bad?
  • 20. Bottleneck DetectionFind abnormalities, then determine whether they are system faults
  • 21. Metrics That MatterThroughput, concurrency, and change—but not response time (why?)
  • 22. AlgorithmsVarious combinations of severity, directionality, run length, duration, and more
  • 23. Out of BoundsHow often each algorithm detected abnormalities
  • 24. Drilling DownOne of the algorithms triggered at the gray line
  • 25. Another ExampleThis one from a more selective algorithm
  • 26. Brownian CommotionJust because it’s long and aimed the right way doesn’t mean it’s scary
  • 27. QPS Cxn RunsSee Spots RunThere’s a clear run of decreasing QPS and increasing connections—but no stall/lockup
  • 28. QPS Cxn RunsSee Spots RunThere’s a clear run of decreasing QPS and increasing connections—but no stall/lockup
  • 29. A New ApproachCombinations of algorithms to avoid run-based false positives
  • 30. I’d Like Fries With ThatAnd supersize my Threads_running, please
  • 31. Why Not Response Time?We are legion and we want your carat patch
  • 32. Mass ApplicationDoes it generalize to many different settings?
  • 33. Dataset #2Looks reasonable on a different workload—so far
  • 34. Stall Detection In ActionSorry, no witty comment here
  • 35. Oh Crud.What does “bad” mean on a workload like this?
  • 36. He taketh the wise in their own craftiness.[1 Cor 3:19]
  • 37. Different Like Everybody ElseVariance-to-mean ratio / index of dispersion to the rescue?
  • 38. Still Life With Purple LineIf you think this is art, I am happy to sell it to you
  • 39. When In Doubt, Get EmpiricalMeasure 99.7th% V:M on a well-behaved dataset. Good enough for government work.
  • 40. Workload #1 ReduxStill finds lots of the same “bad spots” with the V:M ratio filter
  • 41. If you can read this, flip me over!Workload #2 With New Filter
  • 42. Conclusions✤ Workloads differ. GIGO.✤ Unlikely events are not necessarily bad.✤ Naive techniques fail; more sophisticated methods are required.✤ Thresholds are too simplistic, but reappear.✤ Common sense beats elaborate math.✤ Automated problem detection is a good thing?
  • 43. Jobs✤ Want to work and live in America’s #1 City?✤ I am hiring DBAs and developers. Talk to me.
  • 44. Questions? @xaprb • http://www.linkedin.com/in/xaprb
  • 45. Image Credits✤ http://www.flickr.com/photos/katej/853418592/✤ http://www.flickr.com/photos/domesticat/2963393184/✤ http://www.flickr.com/photos/markho/481969187/✤ http://www.flickr.com/photos/exquisitur/3502317741/✤ http://www.flickr.com/photos/calleephoto/4952091078/✤ http://www.flickr.com/photos/paperpariah/4150220583/✤ http://www.flickr.com/photos/stevewall/6057281066/✤ http://www.flickr.com/photos/mybloodyself/5879425774/✤ http://www.flickr.com/photos/josephrobertson/92849605/✤ http://en.wikipedia.org/