Finger pointing


Published on

as boundary change the game with second by second application monitoring sometimes this will affect how you apply your problem analysis steps. perhaps things can change

Published in: Technology, Health & Medicine
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Finger pointing

  1. 1. Finger Pointing Mahendra Kutare twitter - @imaxxs
  2. 2. FingerPointing ?FingerPointing is a way throughw h ic h h u m a n s co m m u n icateemotions of urgency, surprise, joy,acknowle dgment, achievement,blame, frustration, fear and more.
  3. 3. FingerPointing ?Some do it with one.. Some need two..
  4. 4. FingerPointing ?Some do it with one.. Some need two..
  5. 5. Systems FingerPointing ? Some do it everywhere...
  6. 6. Human Computer FingerPointing ? Some do it with....
  7. 7. Systems Control Loop Time to Collect Monitor Collect Info Time to Detect/Analyze Act Time to Recover Recover Analysis Local Global
  8. 8. Systems Control Loop Time to Collect Meter Collector Time to Detect/Analyze Time to Recover Recover Engine Local Global
  9. 9. Problem DeterminationDetection - Identifies violations oranomalies.Diagnosis - Analyzes violations oranomalies.Remediation - Recovers thesystem to normal state
  10. 10. DetectionThresholdSignatureAnomaly
  11. 11. DetectionThresholds - Matching single value/predicate.Signature - Matching faults with known faultsignatures. It can detect a set of know faults.Anomalies - Learn to recognize the normalruntime behavior. It can detect previouslyunseen faults.
  12. 12. Aniketos No use of statistical machine learning. Uses computational geometry - convex hull. Convex hull - Encompassing shape around a group of points. Works independent of whether metrics are correlated or not.Stehle, Lynch ICAC 2010
  13. 13. Fault Detection
  14. 14. Training PhaseNo one knows when enough training data iscollected.If a system has an extensive test suite, thatrepresents normal behavior, then executionof the test suite will produce a good trainingdataset.Replay request logs of production system ontest system.
  15. 15. Bounded Box ExampleGiven two metrics A and B, if the safe range of Ais 5 to 10 and B is 10 to 20 the normal behavior ofthe system can be represented as 2D rectanglewith vertices (5,10), (5,20), (10,20) and (10,10)Any datapoint that falls within that rectangle, forexample (7,15), is classified as normal.Any datapoint that falls outside of the rectangle,for example (15,15) is classified as anomalous.
  16. 16. Detection Phase
  17. 17. Egress/Ingress Datavolume_1s_meter_ip query, 6000 data points
  18. 18. Egress/Ingress Datavolume_1s_meter_ip query, 150,000 data points
  19. 19. Fault Detection ComparisonMaximum fault coverage, tradeoff false positives
  20. 20. DiagnosisDependency InferenceCorrelation AnalysisPeer Analysis
  21. 21. E2EProfUseful for debugging distributed systems of black boxes. Sandeep et. al DSN 2007
  22. 22. Service PathsClient requests take different “paths” through thesoftware invoking dynamic dependencies acrossdistributed systems. Ensemble of paths taken byclient requests - “Service Paths”Key idea - Convert message traces per servicenode to per edge signals and compute crosscorrelations of these signals.
  23. 23. Path DiscoveryA request path VC1->VS1->VS2->VS4Collect timestamp, source/dest ip at each VSnode.Calculates cross correlation between timeseries signals across VS nodes.If cross correlation has a spike at a phaselag = latency between nodes, there exists apath/edge between VS nodes.
  24. 24. App Vis Network topology viewAugment with “service paths” ??
  25. 25. RemediationSoftware Rejuvenation for Software Aging Reactive - Reboots, Micro Reboots Proactive - Time or load basedCheckpointing and RecoveryTreating bugs as allergies
  26. 26. Software AgingPatriot missiles, used during the Gulf war, todestroy Iraq’s Scud missile used a computerwho software accu mu late d er rors i.esoftware aging.The effect of aging in this case was mis-interpretation of an incoming Scud as not amissile but just a false alarm, which resultedin death of 28 US soldiers.
  27. 27. Software RejuvenationPeriodic preemptive rollback of continuously runningapplications to prevent failures in the future.Open - Not based on feedback from the system -Elapsed Time, Cumulative jobs in systemClosed - Based on some notion of system health.Continuously monitor, analyze the estimated time toexhaustion of a resource. Trivedi et. al Duke University.
  28. 28. Apache Web ServerMaxRequestPerChild - If this value is setto a positive value, then the parentprocess of Apache kills a child process assoon as MaxRequestsPerChild requesthave been handled by this child process.By doing this, Apache limits “the amountof memory a process can consume byaccidental memory leak”and “helps reducethe num of process when server loadreduces.”
  29. 29. Treating Bugs as Allergies Inspired by allergy treatment in real life. If you are allergic to milk, remove dairy products from your diet. Rollback the program to a recent checkpoint when a bug is detected, dynamically change the execution environment based on failure symptoms, and then re-execute the program in modified environment. Quin et. al SOSP 2005
  30. 30. Treating Bugs As Allergies
  31. 31. ExamplesUninitialized reads may be avoided if everynewly allocated buffer is filled with zeros.Data races can be avoided by changing timerelated event such as thread scheduling,asynchronous events.
  32. 32. Environment Changes
  33. 33. Comparison of Rx and Alternative ApproachesFor systems where reboot ~5sec is not good enough Checkpoint, Replay bounded by reboot ~5sec