DetectionThresholds - Matching single value/predicate.Signature - Matching faults with known faultsignatures. It can detect a set of know faults.Anomalies - Learn to recognize the normalruntime behavior. It can detect previouslyunseen faults.
Aniketos No use of statistical machine learning. Uses computational geometry - convex hull. Convex hull - Encompassing shape around a group of points. Works independent of whether metrics are correlated or not.Stehle, Lynch et.al ICAC 2010
Training PhaseNo one knows when enough training data iscollected.If a system has an extensive test suite, thatrepresents normal behavior, then executionof the test suite will produce a good trainingdataset.Replay request logs of production system ontest system.
Bounded Box ExampleGiven two metrics A and B, if the safe range of Ais 5 to 10 and B is 10 to 20 the normal behavior ofthe system can be represented as 2D rectanglewith vertices (5,10), (5,20), (10,20) and (10,10)Any datapoint that falls within that rectangle, forexample (7,15), is classiﬁed as normal.Any datapoint that falls outside of the rectangle,for example (15,15) is classiﬁed as anomalous.
E2EProfUseful for debugging distributed systems of black boxes. Sandeep et. al DSN 2007
Service PathsClient requests take different “paths” through thesoftware invoking dynamic dependencies acrossdistributed systems. Ensemble of paths taken byclient requests - “Service Paths”Key idea - Convert message traces per servicenode to per edge signals and compute crosscorrelations of these signals.
Path DiscoveryA request path VC1->VS1->VS2->VS4Collect timestamp, source/dest ip at each VSnode.Calculates cross correlation between timeseries signals across VS nodes.If cross correlation has a spike at a phaselag = latency between nodes, there exists apath/edge between VS nodes.
App Vis Network topology viewAugment with “service paths” ??
RemediationSoftware Rejuvenation for Software Aging Reactive - Reboots, Micro Reboots Proactive - Time or load basedCheckpointing and RecoveryTreating bugs as allergies
Software AgingPatriot missiles, used during the Gulf war, todestroy Iraq’s Scud missile used a computerwho software accu mu late d er rors i.esoftware aging.The effect of aging in this case was mis-interpretation of an incoming Scud as not amissile but just a false alarm, which resultedin death of 28 US soldiers.
Software RejuvenationPeriodic preemptive rollback of continuously runningapplications to prevent failures in the future.Open - Not based on feedback from the system -Elapsed Time, Cumulative jobs in systemClosed - Based on some notion of system health.Continuously monitor, analyze the estimated time toexhaustion of a resource. Trivedi et. al Duke University.
Apache Web ServerMaxRequestPerChild - If this value is setto a positive value, then the parentprocess of Apache kills a child process assoon as MaxRequestsPerChild requesthave been handled by this child process.By doing this, Apache limits “the amountof memory a process can consume byaccidental memory leak”and “helps reducethe num of process when server loadreduces.”
Treating Bugs as Allergies Inspired by allergy treatment in real life. If you are allergic to milk, remove dairy products from your diet. Rollback the program to a recent checkpoint when a bug is detected, dynamically change the execution environment based on failure symptoms, and then re-execute the program in modiﬁed environment. Quin et. al SOSP 2005