Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

2

Share

Download to read offline

Arachne: How does Uber check the health of its Network Infrastructure every 10 seconds?

Download to read offline

One of the major challenges and requirements in achieving a very high (>99.99%) reliability of operation of any major network infrastructure (i.e. data center, enterprise, campus, etc.) is the ability to design and deploy an always-on active system that performs end-to-end functional testing of all the network-connected infrastructure components and, as a result, monitors the infrastructure and its dependent external services with high accuracy and granularity (down to the packet level) in the most efficient way; consuming the least amount of computational or network resources.

When it comes to packet loss detection, metrics reported by the original manufacturers cannot be relied upon; their tools may either be buggy or, in most cases, do not provide APIs for extracting measurements. Therefore, we needed to create our own tool; this is the gap Arachne is filling.

In this talk, we present Arachne. Arachne is a packet loss detection system and an underperforming path detection system. It provides fast and easy active end-to-end functional testing of all the components in Data Center (DC) and Cloud infrastructures. Arachne is able to detect intra-DC, inter-DC, DC-to-Cloud, and DC-to-External-Services issues by generating minimal traffic.

Related Books

Free with a 30 day trial from Scribd

See all

Arachne: How does Uber check the health of its Network Infrastructure every 10 seconds?

  1. 1. 1
  2. 2. 2 ● ● ● ● ○ ○ ●
  3. 3. 3
  4. 4. 5
  5. 5. 6
  6. 6. ● ● ●
  7. 7. 14
  8. 8. ● ● ●
  9. 9. ● ■ ■ ■ ■ ○
  10. 10. ● ●
  11. 11. 20 ● ● ● ● ● ●
  12. 12. 36
  13. 13. ● ●
  14. 14. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Version| IHL |Type of Service| Total Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Identification |Flags| Fragment Offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ IP | Time to Live |Protocol=6(TCP)| Header Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Destination Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options | Padding | ---+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Port | Destination Port | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Acknowledgment Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data | |U|A|P|R|S|F| | TCP| Offset| Reserved |R|C|S|S|Y|I| Window | | | |G|K|H|T|N|N| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Checksum | Urgent Pointer | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TCP Options | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TCP data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Version| IHL |Type of Service| Total Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Identification |Flags| Fragment Offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ IP | Time to Live |Protocol=6(TCP)| Header Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Destination Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options | Padding | ---+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Port | Destination Port | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Acknowledgment Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data | |U|A|P|R|S|F| | TCP| Offset| Reserved |R|C|S|S|Y|I| Window | | | |G|K|H|T|N|N| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Checksum | Urgent Pointer | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TCP Options | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TCP data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ● ●
  15. 15. 41
  16. 16. 200 — OK 404 — not found 500 — internal server error 400 — bad request — { "local": { "region": "us-west", "src_address": "", "interface_name": "eth0", "target_tcp_port": 44111, "timeout_msec": "200ms", "base_src_tcp_port": 31000, "num_src_tcp_ports": 64, "batch_interval_sec": "10s", "qos": "enabled", "resolve_dns": "enabled", "dns_servers_alternate": "8.8.8.8", "poll_orchestrator_interval_success": "2h", "poll_orchestrator_interval_failure": "15m", }, "internal": [ {"ip": "10.10.10.1"}, {"host_name": "hadoop375.internal.servers"} ... ], "external": [ {"host_name": "payments.externalservice.com"}, {"host_name": "messaging.externalservice.com"} ... ] }
  17. 17. ● ● ● ●
  18. 18. Ring #0
  19. 19. Ring #1
  20. 20. Ring #2
  21. 21. Ring #3 Ring #3 Ring #3 Ring #3 Ring #3
  22. 22. Ring #4
  23. 23. 58
  24. 24. 61
  • DannyCLChen

    Mar. 9, 2020
  • thelordsaves

    May. 27, 2017

One of the major challenges and requirements in achieving a very high (>99.99%) reliability of operation of any major network infrastructure (i.e. data center, enterprise, campus, etc.) is the ability to design and deploy an always-on active system that performs end-to-end functional testing of all the network-connected infrastructure components and, as a result, monitors the infrastructure and its dependent external services with high accuracy and granularity (down to the packet level) in the most efficient way; consuming the least amount of computational or network resources. When it comes to packet loss detection, metrics reported by the original manufacturers cannot be relied upon; their tools may either be buggy or, in most cases, do not provide APIs for extracting measurements. Therefore, we needed to create our own tool; this is the gap Arachne is filling. In this talk, we present Arachne. Arachne is a packet loss detection system and an underperforming path detection system. It provides fast and easy active end-to-end functional testing of all the components in Data Center (DC) and Cloud infrastructures. Arachne is able to detect intra-DC, inter-DC, DC-to-Cloud, and DC-to-External-Services issues by generating minimal traffic.

Views

Total views

1,525

On Slideshare

0

From embeds

0

Number of embeds

27

Actions

Downloads

42

Shares

0

Comments

0

Likes

2

×