SlideShare a Scribd company logo
1 of 34
Finger Pointing
Mahendra Kutare
mahendra@boundary.com
twitter - @imaxxs
FingerPointing is a way through
which humans communicate
emotions of urgency, surprise, joy,
acknowledgment, achievement,
blame, frustration, fear and more.
FingerPointing ?
FingerPointing ?
Some do it with one..
Some need two..
FingerPointing ?
Some do it with one..
Some need two..
Systems FingerPointing ?
Some do it everywhere...
Human Computer FingerPointing ?
Some do it with....
Systems Control Loop
Monitor
Recover
Collect
Analysis
Time to Collect
Time to Recover
Time to Detect/Analyze
Local Global
Info
Act
Systems Control Loop
Meter
Recover
Collector
Engine
Time to Collect
Time to Recover
Time to Detect/Analyze
Local Global
Problem Determination
Detection - Identifies violations or
anomalies.
Diagnosis - Analyzes violations or
anomalies.
Remediation - Recovers the
system to normal state
Detection
Threshold
Signature
Anomaly
Detection
Thresholds - Matching single value/predicate.
Signature - Matching faults with known fault
signatures. It can detect a set of know faults.
Anomalies - Learn to recognize the normal
runtime behavior. It can detect previously
unseen faults.
Aniketos
No use of statistical machine learning.
Uses computational geometry - convex hull.
Convex hull - Encompassing shape around a
group of points.
Works independent of whether metrics are
correlated or not.
Stehle, Lynch et.al ICAC 2010
Fault Detection
Training Phase
No one knows when enough training data is
collected.
If a system has an extensive test suite, that
represents normal behavior, then execution
of the test suite will produce a good training
dataset.
Replay request logs of production system on
test system.
Bounded Box Example
Given two metrics A and B, if the safe range of A
is 5 to 10 and B is 10 to 20 the normal behavior of
the system can be represented as 2D rectangle
with vertices (5,10), (5,20), (10,20) and (10,10)
Any datapoint that falls within that rectangle, for
example (7,15), is classified as normal.
Any datapoint that falls outside of the rectangle,
for example (15,15) is classified as anomalous.
Detection Phase
Egress/Ingress Data
volume_1s_meter_ip query, 6000 data points
Egress/Ingress Data
volume_1s_meter_ip query, 150,000 data points
Fault Detection Comparison
Maximum fault coverage, tradeoff false positives
Diagnosis
Dependency Inference
Correlation Analysis
Peer Analysis
E2EProf
Sandeep et. al DSN 2007
Useful for debugging distributed systems of black boxes.
Service Paths
Client requests take different “paths” through the
software invoking dynamic dependencies across
distributed systems. Ensemble of paths taken by
client requests - “Service Paths”
Key idea - Convert message traces per service
node to per edge signals and compute cross
correlations of these signals.
Path Discovery
A request path VC1->VS1->VS2->VS4
Collect timestamp, source/dest ip at each VS
node.
Calculates cross correlation between time
series signals across VS nodes.
If cross correlation has a spike at a phase
lag = latency between nodes, there exists a
path/edge between VS nodes.
App Vis
Network topology view
Augment with “service paths” ??
Remediation
Software Rejuvenation for Software Aging
Reactive - Reboots, Micro Reboots
Proactive - Time or load based
Checkpointing and Recovery
Treating bugs as allergies
Software Aging
Patriot missiles, used during the Gulf war, to
destroy Iraq’s Scud missile used a computer
who software accumulated errors i.e
software aging.
The effect of aging in this case was mis-
interpretation of an incoming Scud as not a
missile but just a false alarm, which resulted
in death of 28 US soldiers.
Software Rejuvenation
Periodic preemptive rollback of continuously running
applications to prevent failures in the future.
Open - Not based on feedback from the system -
Elapsed Time, Cumulative jobs in system
Closed - Based on some notion of system health.
Continuously monitor, analyze the estimated time to
exhaustion of a resource.
Trivedi et. al Duke University.
Apache Web Server
MaxRequestPerChild - If this value is set
to a positive value, then the parent
process of Apache kills a child process as
soon as MaxRequestsPerChild request
have been handled by this child process.
By doing this, Apache limits “the amount
of memory a process can consume by
accidental memory leak”and “helps reduce
the num of process when server load
reduces.”
Treating Bugs as Allergies
Inspired by allergy treatment in real life. If
you are allergic to milk, remove dairy
products from your diet.
Rollback the program to a recent checkpoint
when a bug is detected, dynamically change
the execution environment based on failure
symptoms, and then re-execute the program
in modified environment.
Quin et. al SOSP 2005
Treating Bugs As Allergies
Examples
Uninitialized reads may be avoided if every
newly allocated buffer is filled with zeros.
Data races can be avoided by changing time
related event such as thread scheduling,
asynchronous events.
Environment Changes
Comparison of Rx and
Alternative Approaches
For systems where reboot ~5sec is not good enough
Checkpoint, Replay bounded by reboot ~5sec
Fingerpointing in Large Scale Distributed Systems

More Related Content

Viewers also liked (11)

La prehistòria conferència Bernat 4t
La prehistòria conferència Bernat 4tLa prehistòria conferència Bernat 4t
La prehistòria conferència Bernat 4t
 
Monalytics - Online Monitoring and Analytics for Large Scale Data Centers
Monalytics - Online Monitoring and Analytics for Large Scale Data CentersMonalytics - Online Monitoring and Analytics for Large Scale Data Centers
Monalytics - Online Monitoring and Analytics for Large Scale Data Centers
 
2L06 Group 5 Benzene (Version 3)
2L06 Group 5 Benzene (Version 3)2L06 Group 5 Benzene (Version 3)
2L06 Group 5 Benzene (Version 3)
 
Esaera Zaharrak
Esaera ZaharrakEsaera Zaharrak
Esaera Zaharrak
 
Programacion en PHP
Programacion en PHPProgramacion en PHP
Programacion en PHP
 
Electricity in egypt
Electricity in egyptElectricity in egypt
Electricity in egypt
 
Haur jolasen aurkezpena
Haur jolasen aurkezpenaHaur jolasen aurkezpena
Haur jolasen aurkezpena
 
Discontinuidades Urbanas
Discontinuidades UrbanasDiscontinuidades Urbanas
Discontinuidades Urbanas
 
esaera zaharrak
esaera zaharrakesaera zaharrak
esaera zaharrak
 
Algebra
AlgebraAlgebra
Algebra
 
EPM Assignment 2
EPM Assignment 2EPM Assignment 2
EPM Assignment 2
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

Fingerpointing in Large Scale Distributed Systems

  • 2. FingerPointing is a way through which humans communicate emotions of urgency, surprise, joy, acknowledgment, achievement, blame, frustration, fear and more. FingerPointing ?
  • 3. FingerPointing ? Some do it with one.. Some need two..
  • 4. FingerPointing ? Some do it with one.. Some need two..
  • 5. Systems FingerPointing ? Some do it everywhere...
  • 6. Human Computer FingerPointing ? Some do it with....
  • 7. Systems Control Loop Monitor Recover Collect Analysis Time to Collect Time to Recover Time to Detect/Analyze Local Global Info Act
  • 8. Systems Control Loop Meter Recover Collector Engine Time to Collect Time to Recover Time to Detect/Analyze Local Global
  • 9. Problem Determination Detection - Identifies violations or anomalies. Diagnosis - Analyzes violations or anomalies. Remediation - Recovers the system to normal state
  • 11. Detection Thresholds - Matching single value/predicate. Signature - Matching faults with known fault signatures. It can detect a set of know faults. Anomalies - Learn to recognize the normal runtime behavior. It can detect previously unseen faults.
  • 12. Aniketos No use of statistical machine learning. Uses computational geometry - convex hull. Convex hull - Encompassing shape around a group of points. Works independent of whether metrics are correlated or not. Stehle, Lynch et.al ICAC 2010
  • 14. Training Phase No one knows when enough training data is collected. If a system has an extensive test suite, that represents normal behavior, then execution of the test suite will produce a good training dataset. Replay request logs of production system on test system.
  • 15. Bounded Box Example Given two metrics A and B, if the safe range of A is 5 to 10 and B is 10 to 20 the normal behavior of the system can be represented as 2D rectangle with vertices (5,10), (5,20), (10,20) and (10,10) Any datapoint that falls within that rectangle, for example (7,15), is classified as normal. Any datapoint that falls outside of the rectangle, for example (15,15) is classified as anomalous.
  • 19. Fault Detection Comparison Maximum fault coverage, tradeoff false positives
  • 21. E2EProf Sandeep et. al DSN 2007 Useful for debugging distributed systems of black boxes.
  • 22. Service Paths Client requests take different “paths” through the software invoking dynamic dependencies across distributed systems. Ensemble of paths taken by client requests - “Service Paths” Key idea - Convert message traces per service node to per edge signals and compute cross correlations of these signals.
  • 23. Path Discovery A request path VC1->VS1->VS2->VS4 Collect timestamp, source/dest ip at each VS node. Calculates cross correlation between time series signals across VS nodes. If cross correlation has a spike at a phase lag = latency between nodes, there exists a path/edge between VS nodes.
  • 24. App Vis Network topology view Augment with “service paths” ??
  • 25. Remediation Software Rejuvenation for Software Aging Reactive - Reboots, Micro Reboots Proactive - Time or load based Checkpointing and Recovery Treating bugs as allergies
  • 26. Software Aging Patriot missiles, used during the Gulf war, to destroy Iraq’s Scud missile used a computer who software accumulated errors i.e software aging. The effect of aging in this case was mis- interpretation of an incoming Scud as not a missile but just a false alarm, which resulted in death of 28 US soldiers.
  • 27. Software Rejuvenation Periodic preemptive rollback of continuously running applications to prevent failures in the future. Open - Not based on feedback from the system - Elapsed Time, Cumulative jobs in system Closed - Based on some notion of system health. Continuously monitor, analyze the estimated time to exhaustion of a resource. Trivedi et. al Duke University.
  • 28. Apache Web Server MaxRequestPerChild - If this value is set to a positive value, then the parent process of Apache kills a child process as soon as MaxRequestsPerChild request have been handled by this child process. By doing this, Apache limits “the amount of memory a process can consume by accidental memory leak”and “helps reduce the num of process when server load reduces.”
  • 29. Treating Bugs as Allergies Inspired by allergy treatment in real life. If you are allergic to milk, remove dairy products from your diet. Rollback the program to a recent checkpoint when a bug is detected, dynamically change the execution environment based on failure symptoms, and then re-execute the program in modified environment. Quin et. al SOSP 2005
  • 30. Treating Bugs As Allergies
  • 31. Examples Uninitialized reads may be avoided if every newly allocated buffer is filled with zeros. Data races can be avoided by changing time related event such as thread scheduling, asynchronous events.
  • 33. Comparison of Rx and Alternative Approaches For systems where reboot ~5sec is not good enough Checkpoint, Replay bounded by reboot ~5sec