Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Papers We Love Sept. 2018
007: Democratically Finding The
Cause of Packet Drops
Michael Kehoe
Staff Site Reliability Engin...
007: Democratically Finding The Cause of Packet Drops
Behnaz Arzani Selim Ciraci Luiz Chamon Yibo Zhu
Hongqiang Liu Jitu P...
Today’s
agenda
1 Introduction & Motivation
2 TCP Monitoring Agent
3 Path Discovery Agent
4 Analysis Agent
5 Evaluations: S...
Introduction & Motivation
Introduction & Motivation
“Even a small network outage
or a few lossy links can cause
the VM to “panic” and reboot.
In fac...
Introduction & Motivation
• Pingmesh [1]
• Leaves gaps
• Overhead
• Out-of-band
Introduction & Motivation
• Roy et al [2]
• Requires modifications
to routers
• Requires additional
features on switches
Introduction & Motivation
• Everflow [3]
• Requires all traffic to be
captured
“In a network of ≥ 106 links it’s a reasonable
assumption that there is a non-zero chance that
a number (> 10) of these li...
“007 records the path of TCP connections
(flows) suffering from one or more
retransmissions and assigns proportional
“blam...
Introduction & Motivation
1. Does not require any changes to
network infrastructure
2. Does not require any changes to cli...
Assumptions
DISCUSSION
1. L2 networks are not viable unless;
1. Support path discovery methods
2. Supports EverFlow
2. No ...
Assumptions
DISCUSSION
Design Overview
Design Overview
• TCP monitoring agent: detects
retransmissions at each end-host.
• Path discovery agent: which
identifies...
Design Overview
• 6000 lines of C++ code
• 600KB memory usage
• 1-3% CPU Usage
• 200 KBs bandwidth
utilization
TCP Monitoring Agent
TCP Monitoring Agent
• TCP Monitoring agent
notifies Path Discovery
Agent immediately after any
retransmit
• Use of ‘Event...
Path Discovery Agent
Path Discovery Agent
“The path discovery agent
uses traceroute packets to
find the path of flows that
suffer retransmissio...
Path Discovery Agent
“Once the TCP monitoring
agent notifies the path
discovery agent that a flow
has suffered a
retransmi...
Path Discovery Agent
ENGINEERING CHALLENGES – ECMP
• ECMP algorithms are
unknown
• All packets of a given flow,
defined by...
Path Discovery Agent
ENGINEERING CHALLENGES – RE-ROUTING & PACKET DROPS
• Traceroute itself may fail
• A lossy link may ca...
Path Discovery Agent
ENGINEERING CHALLENGES – ROUTER ALIASING
• Have a pre-mapped
topology of:
• Switch/Router names
• Rou...
Analysis Agent
Analysis Agent
VOTING BASED SCHEME
• Good votes are 0
• Bad votes are
1
ℎ
where h is
the number of hops on the
path
• Each...
Analysis Agent
4
2
1 3
0 0+ 1/2
1/2
+ 1/2
+ 1/2
Analysis Agent
VOTING BASED SCHEME
• Congestion & single drops
are akin to noise
• Single flow is unlikely to go
through m...
Simulations
Simulations
PERFORMANCE
• Accuracy: Proportion if
correctly identified drop
causes
• Recall: How many of the
failures are ...
Evaluation: Simulations
PERFORMANCE: OPTIMAL CASE
• 0.05 -1% drop rate
• Accuracy is > 96%
• Recall/ Precision is
almost a...
Evaluation: Simulations
PERFORMANCE: VARYING DROP RATES
• Maintains accuracy for
both single and multiple
failures
https:/...
Evaluation: Simulations
PERFORMANCE: IMPACT OF NOISE
• Almost no impact
https://github.com/behnazak/Vigil-007SourceCode
Evaluation: Simulations
PERFORMANCE: NUMBER OF CONNECTIONS
• Almost no impact
https://github.com/behnazak/Vigil-007SourceC...
Evaluation: Simulations
PERFORMANCE: TRAFFIC SKEWS
• Can tolerate 50% skew
• When TOR traffic >50%
& >10 failures, accurac...
Evaluation: Simulations
PERFORMANCE: BAD LINKS
• 007 can detect up to 7
failures with accuracy >
90%
https://github.com/be...
Evaluation: Simulations
PERFORMANCE: NETWORK SIZE
• Single failure:
• Accuracy >98% for up to
6 pods
• Multiple failures:
...
Evaluations: Production
Evaluation: Production
• 007 located bad link
correctly in 281 cases of
VM reboot in Microsoft
DCN
• Identifies average 0....
Discussion
Discussion
• Congestion detection
• Ranking with bias
• Finding the cause of other
problems
• 007 can also be used for:
• ...
Questions?
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Upcoming SlideShare
Loading in …5
×

of

Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 1 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 2 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 3 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 4 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 5 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 6 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 7 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 8 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 9 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 10 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 11 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 12 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 13 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 14 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 15 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 16 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 17 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 18 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 19 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 20 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 21 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 22 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 23 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 24 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 25 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 26 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 27 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 28 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 29 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 30 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 31 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 32 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 33 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 34 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 35 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 36 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 37 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 38 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 39 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 40 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 41 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 42 Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops Slide 43
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops

Download to read offline

Network failures continue to plague datacenter
operators as their symptoms may not have
direct correlation with where or why they occur. We
introduce 007, a lightweight, always-on diagnosis application
that can find problematic links and also
pinpoint problems for each TCP connection. 007 is
completely contained within the end host. During
its two month deployment in a tier-1 datacenter, it
detected every problem found by previously deployed
monitoring tools while also finding the sources of
other problems previously undetected.

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops

  1. 1. Papers We Love Sept. 2018 007: Democratically Finding The Cause of Packet Drops Michael Kehoe Staff Site Reliability Engineer NDSI - https://www.usenix.org/conference/nsdi18/presentation/arzani
  2. 2. 007: Democratically Finding The Cause of Packet Drops Behnaz Arzani Selim Ciraci Luiz Chamon Yibo Zhu Hongqiang Liu Jitu Padhye Boon Thau Loo Geoff Outhred
  3. 3. Today’s agenda 1 Introduction & Motivation 2 TCP Monitoring Agent 3 Path Discovery Agent 4 Analysis Agent 5 Evaluations: Simulations 6 Evaluations: Production 7 Discussion
  4. 4. Introduction & Motivation
  5. 5. Introduction & Motivation “Even a small network outage or a few lossy links can cause the VM to “panic” and reboot. In fact, 17% of our VM reboots are due to network issues and in over 70% of these none of our monitoring tools were able to find the links that caused the problem.”
  6. 6. Introduction & Motivation • Pingmesh [1] • Leaves gaps • Overhead • Out-of-band
  7. 7. Introduction & Motivation • Roy et al [2] • Requires modifications to routers • Requires additional features on switches
  8. 8. Introduction & Motivation • Everflow [3] • Requires all traffic to be captured
  9. 9. “In a network of ≥ 106 links it’s a reasonable assumption that there is a non-zero chance that a number (> 10) of these links are bad (due to device, port, or cable, etc.)…However, currently we do not have a direct way to correlate customer impact with bad links". Introduction & Motivation
  10. 10. “007 records the path of TCP connections (flows) suffering from one or more retransmissions and assigns proportional “blame” to each link on the path. It then provides a ranking of links that represents their relative drop rates.” Introduction & Motivation
  11. 11. Introduction & Motivation 1. Does not require any changes to network infrastructure 2. Does not require any changes to client software 3. Detects in-band failures 4. Resilient to noise 5. Negligible overhead
  12. 12. Assumptions DISCUSSION 1. L2 networks are not viable unless; 1. Support path discovery methods 2. Supports EverFlow 2. No use of Source NATs (SNATs) 3. Assumes ECMP (L3) Clos network design 4. Don’t try to reverse-engineer ECMP
  13. 13. Assumptions DISCUSSION
  14. 14. Design Overview
  15. 15. Design Overview • TCP monitoring agent: detects retransmissions at each end-host. • Path discovery agent: which identifies the flow’s path to the Destination IP (DIP) • At the end-hosts, a voting scheme is used based on the paths of flows that had retransmissions. At regular intervals of 30s the votes are tallied by a centralized analysis agent to find the top- voted links.
  16. 16. Design Overview • 6000 lines of C++ code • 600KB memory usage • 1-3% CPU Usage • 200 KBs bandwidth utilization
  17. 17. TCP Monitoring Agent
  18. 18. TCP Monitoring Agent • TCP Monitoring agent notifies Path Discovery Agent immediately after any retransmit • Use of ‘Event Tracing for Windows’ (ETW) • Could use BPF in Linux
  19. 19. Path Discovery Agent
  20. 20. Path Discovery Agent “The path discovery agent uses traceroute packets to find the path of flows that suffer retransmissions. These packets are used solely to identify the path of a flow. They do not need to be dropped for 007 to operate”
  21. 21. Path Discovery Agent “Once the TCP monitoring agent notifies the path discovery agent that a flow has suffered a retransmission, the path discovery agent checks its cache of discovered path for that epoch…It then sends 15 appropriately crafted TCP packets with TTL values ranging from 1–15.”
  22. 22. Path Discovery Agent ENGINEERING CHALLENGES – ECMP • ECMP algorithms are unknown • All packets of a given flow, defined by the five-tuple, follow the same path
  23. 23. Path Discovery Agent ENGINEERING CHALLENGES – RE-ROUTING & PACKET DROPS • Traceroute itself may fail • A lossy link may cause one or more BGP sessions to fail, triggering rerouting
  24. 24. Path Discovery Agent ENGINEERING CHALLENGES – ROUTER ALIASING • Have a pre-mapped topology of: • Switch/Router names • Router/ Interface IP addresses
  25. 25. Analysis Agent
  26. 26. Analysis Agent VOTING BASED SCHEME • Good votes are 0 • Bad votes are 1 ℎ where h is the number of hops on the path • Each link on the path is given a vote
  27. 27. Analysis Agent 4 2 1 3 0 0+ 1/2 1/2 + 1/2 + 1/2
  28. 28. Analysis Agent VOTING BASED SCHEME • Congestion & single drops are akin to noise • Single flow is unlikely to go through more than one failed link • Probability of errors in results diminishes exponentially with the number of flows
  29. 29. Simulations
  30. 30. Simulations PERFORMANCE • Accuracy: Proportion if correctly identified drop causes • Recall: How many of the failures are detected (false negatives) • Precision: How trusted are the results (false positives)
  31. 31. Evaluation: Simulations PERFORMANCE: OPTIMAL CASE • 0.05 -1% drop rate • Accuracy is > 96% • Recall/ Precision is almost always 100% https://github.com/behnazak/Vigil-007SourceCode
  32. 32. Evaluation: Simulations PERFORMANCE: VARYING DROP RATES • Maintains accuracy for both single and multiple failures https://github.com/behnazak/Vigil-007SourceCode
  33. 33. Evaluation: Simulations PERFORMANCE: IMPACT OF NOISE • Almost no impact https://github.com/behnazak/Vigil-007SourceCode
  34. 34. Evaluation: Simulations PERFORMANCE: NUMBER OF CONNECTIONS • Almost no impact https://github.com/behnazak/Vigil-007SourceCode
  35. 35. Evaluation: Simulations PERFORMANCE: TRAFFIC SKEWS • Can tolerate 50% skew • When TOR traffic >50% & >10 failures, accuracy suffers https://github.com/behnazak/Vigil-007SourceCode
  36. 36. Evaluation: Simulations PERFORMANCE: BAD LINKS • 007 can detect up to 7 failures with accuracy > 90% https://github.com/behnazak/Vigil-007SourceCode
  37. 37. Evaluation: Simulations PERFORMANCE: NETWORK SIZE • Single failure: • Accuracy >98% for up to 6 pods • Multiple failures: • Accuracy >98.01% for 30 failed links https://github.com/behnazak/Vigil-007SourceCode
  38. 38. Evaluations: Production
  39. 39. Evaluation: Production • 007 located bad link correctly in 281 cases of VM reboot in Microsoft DCN • Identifies average 0.45 ± 0.12 as bad per epoch • Of links dropping packets: • 48%: Server to TOR • 24%: T1 – TOR
  40. 40. Discussion
  41. 41. Discussion • Congestion detection • Ranking with bias • Finding the cause of other problems • 007 can also be used for: • Detection of switch failures
  42. 42. Questions?

Network failures continue to plague datacenter operators as their symptoms may not have direct correlation with where or why they occur. We introduce 007, a lightweight, always-on diagnosis application that can find problematic links and also pinpoint problems for each TCP connection. 007 is completely contained within the end host. During its two month deployment in a tier-1 datacenter, it detected every problem found by previously deployed monitoring tools while also finding the sources of other problems previously undetected.

Views

Total views

383

On Slideshare

0

From embeds

0

Number of embeds

4

Actions

Downloads

4

Shares

0

Comments

0

Likes

0

×