Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Debugging Distributed Systems - Velocity Santa Clara 2016

801 views

Published on

Despite our best efforts, our systems fail. Sometimes it’s our fault—code that we wrote, bugs that we caused. But sometimes the fault is with systems that we have no direct control over. Distributed systems are hard. They are complicated, hard to understand, and very challenging to manage. But they are critical to modern software, and when they have problems, we need to fix them.

ZooKeeper is a very useful distributed system that is often used as a building block for other distributed systems like Kafka and Spark. It is used by PagerDuty for many critical systems, and for five months it failed a lot. Donny Nadolny looks at what it takes to debug a problem in a distributed system like ZooKeeper, walking attendees through the process of finding and fixing one cause of many of these failures. Donny explains how to use various tools to stress test the network, some intricate details of how ZooKeeper works, and possibly more than you will want to know about TCP, including an example of machines having a different view of the state of a TCP stream.

http://conferences.oreilly.com/velocity/devops-web-performance-ca/public/schedule/detail/50058

Published in: Software
  • Be the first to comment

Debugging Distributed Systems - Velocity Santa Clara 2016

  1. 1. 2016−06−23 Debugging Distributed Systems Donny Nadolny donny@pagerduty.com #VelocityConf
  2. 2. DEBUGGING DISTRIBUTED SYSTEMS 2016−06−23
  3. 3. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • Distributed system for building distributed systems • Small in-memory filesystem What is ZooKeeper
  4. 4. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • Distributed locking • Consistent, highly available ZooKeeper at PagerDuty
  5. 5. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • Distributed locking • Consistent, highly available ZooKeeper at PagerDuty… Over A WAN ZK 3 ZK 1 ZK 2 DC-A DC-C DC-B 24 ms 2 4 m s 3 m s
  6. 6. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • Network trouble, one follower falls behind • ZooKeeper gets stuck - leader still up The Failure 1 2 DBSize
  7. 7. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • Network trouble, one follower falls behind • ZooKeeper gets stuck - leader still up The Failure 1 1 2 2 1.5 DBSize
  8. 8. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • Leader logs: “Too busy to snap, skipping” First Hint
  9. 9. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • Disk slow? let’s test: • sshfs donny@some_server:/home/donny /mnt • Similar failure profile Fault Injection
  10. 10. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • Disk slow? let’s test: • sshfs donny@some_server:/home/donny /mnt • Similar failure profile • Re-examine disk latency… nope, was a red herring Fault Injection
  11. 11. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • First warning: application monitoring • Monitoring ZooKeeper: used ruok Health Checks
  12. 12. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • Added deep health check: • write to one ZooKeeper key • read from ZooKeeper key Deep Health Checks
  13. 13. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS "LearnerHandler-/123.45.67.89:45874" prio=10 tid=0x00000000024bb800 nid=0x3d0d runnable [0x00007fe6c3193000] java.lang.Thread.State: RUNNABLE at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) … at org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:118) … at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1115) - locked <0x00000000d4cd9e28> (a org.apache.zookeeper.server.DataNode) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1130) … at org.apache.zookeeper.server.ZKDatabase.serializeSnapshot(ZKDatabase.java:467) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:493) The Stack Trace 1 2 3
  14. 14. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS Threads (Leader) Request processors Learner handler (one per follower) Client requests
  15. 15. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS Threads (Leader) Request processors Learner handler (one per follower) Client requests
  16. 16. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS Threads (Leader) Request processors Learner handler (one per follower) Client requests
  17. 17. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS Threads (Leader) Request processors Learner handler (one per follower) Client requests
  18. 18. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS void serializeNode(OutputArchive output, String path) { DataNode node = getNode(path); String[] children = {}; synchronized (node) { output.writeString(path, "path"); output.writeRecord(node, "node"); children = node.getChildren(); } for (String child : children) { serializeNode(output, path + "/" + child); } } Write Snapshot Code (simplified) Blocking network write
  19. 19. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • Why didn’t a follower take over? • ZK heartbeat: message from leader to follower, follower times out ZooKeeper Heartbeat
  20. 20. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS Threads (Leader) Request processors Learner handler (one per follower) Client requests Quorum Peer
  21. 21. 2016-06-16MAKING PAGERDUTY MORE RELIABLE USING PXC TCP
  22. 22. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS TCP Data Transmission Follower Leader ESTABLISHED ESTABLISHED Packet 1 ACK … SYN, SYN-ACK, ACK …
  23. 23. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS TCP Data Transmission Follower Leader ESTABLISHED ESTABLISHED Packet 1
  24. 24. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS TCP Data Transmission Follower Leader ESTABLISHED ESTABLISHED Packet 1 Packet 1 ~200ms
  25. 25. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS TCP Data Transmission Follower Leader ESTABLISHED ESTABLISHED Packet 1 Packet 1 ~200ms Packet 1 ~200ms
  26. 26. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS TCP Data Transmission Follower Leader ESTABLISHED ESTABLISHED Packet 1 Packet 1 ~200ms Packet 1 ~200ms ~400ms Packet 1
  27. 27. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS TCP Data Transmission Follower Leader ESTABLISHED ESTABLISHED Packet 1 Packet 1 ~200ms Packet 1 ~200ms ~400ms Packet 1 ~800ms Packet 1
  28. 28. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS TCP Data Transmission Follower Leader ESTABLISHED ESTABLISHED Packet 1 Packet 1 ~200ms Packet 1 ~200ms ~400ms Packet 1 ~800ms ~ 120sec Packet 1 Packet 1 120sec CLOSED 15 retries …
  29. 29. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • Retransmission timeout (RTO) is based on latency • TCP_RTO_MIN = 200 ms • TCP_RTO_MAX = 2 minutes • /proc/sys/net/ipv4/tcp_retries2 = 15 retries • 0.2 + 0.2 + 0.4 + 0.8 + … + 120 = 924.8 seconds (15.5 mins) TCP Retransmission (Linux Defaults)
  30. 30. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS 1. Network trouble begins - packet loss / latency 2. Follower falls behind, restarts, requests snapshot 3. Leader begins to send snapshot 4. Snapshot transfer stalls 5. Follower ZooKeeper restarts, attempts to close connection 6. Network heals 7. … Leader still stuck Timeline
  31. 31. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS 1. Network trouble begins - packet loss / latency 2. Follower falls behind, restarts, requests snapshot 3. Leader begins to send snapshot 4. Snapshot transfer stalls 5. Follower ZooKeeper restarts, attempts to close connection 6. Network heals 7. … Leader still stuck Timeline
  32. 32. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS TCP Close Connection Follower Leader ESTABLISHED ESTABLISHED FIN/ACK FIN ACK LAST_ACK CLOSED TIME_WAIT CLOSED 60 seconds FIN_WAIT1
  33. 33. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS TCP Close Connection Follower Leader ESTABLISHED ESTABLISHED CLOSED ~1m40s FIN_WAIT1 FIN FIN FIN FIN FIN 8 retries ~
  34. 34. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS TCP Close Connection Follower Leader ESTABLISHED ESTABLISHED CLOSED ~1m40s FIN_WAIT1 FIN Packet 1 CLOSED ~15.5 mins
  35. 35. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS TCP Close Connection Follower Leader ESTABLISHED ESTABLISHED CLOSED ~1m40s FIN_WAIT1 FIN Packet 1 CLOSED RST
  36. 36. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • 06:51:47 iptables: WARN: IN=eth0 OUT= MAC=00:0d: 12:34:56:78:12:34:56:78:12:34:56:78 SRC=<leader_ip> DST=<follower_ip> LEN=54 TOS=0x00 PREC=0x00 TTL=44 ID=36370 DF PROTO=TCP SPT=3888 DPT=36416 WINDOW=227 RES=0x00 ACK PSH URGP=0 syslog - Dropped Packets on Follower
  37. 37. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT iptables -A INPUT -p tcp --dport 80 -j ACCEPT ... more rules to accept connections … iptables -A INPUT -j DROP iptables
  38. 38. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT iptables -A INPUT -p tcp --dport 80 -j ACCEPT ... more rules to accept connections … iptables -A INPUT -j DROP But: iptables connections != netstat connections iptables
  39. 39. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • From linux/net/netfilter/nf_conntrack_proto_tcp.c: • [TCP_CONNTRACK_LAST_ACK] = 30 SECS conntrack Timeouts
  40. 40. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS TCP Close Connection Follower Leader CLOSED ~51.2s FIN_WAIT1 FIN FIN FIN FIN FIN ~25.6s kernel TCPconntrack LAST_ACK 30s 30s 30s 30s CLOSED ~12.8s 30s ~81.2s ~102.4s
  41. 41. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • Packet loss • Follower falls behind, requests snapshot • (Packet loss continues) follower closes connection • Follower conntrack forgets connection • Leader now stuck for ~15 mins, even if network heals The Full Story
  42. 42. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • Follower falls behind: tc qdisc add dev eth0 root netem delay 500ms 100ms loss 35% • Wait for a few minutes (Alternate: kill the follower) Reproducing (1/3) - Setup
  43. 43. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • Remove latency / packet loss: tc qdisc del dev eth0 root netem • Restrict bandwidth: tc qdisc add dev eth0 handle 1: root htb default 11 tc class add dev eth0 parent 1: classid 1:1 htb rate 100kbps tc class add dev eth0 parent 1:1 classid 1:11 htb rate 100kbps • Restart follower ZooKeeper process Reproducing (2/3) - Request Snapshot
  44. 44. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • Block traffic to leader: iptables -A OUTPUT -p tcp -d <leader ip> -j DROP • Remove bandwidth restriction: tc qdisc del dev eth0 root • Kill follower ZooKeeper process, kernel tries to close connection • Monitor conntrack status, wait for entry to disappear, ~80 seconds: conntrack -L | grep <leader ip> • Allow traffic to leader: iptables -D OUTPUT -p tcp -d <leader ip> -j DROP Reproducing (3/3) - Carefully Close Connection
  45. 45. 2016-06-16MAKING PAGERDUTY MORE RELIABLE USING PXC IPsec
  46. 46. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS IPsec Follower Leader ESP (UDP) ESP (UDP) IPsec TCP data IPsec TCP data
  47. 47. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS IPsec - Establish Connection Follower Leader IPsec Phase 1 IPsec Phase 2 TCP data
  48. 48. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS IPsec - Dropped Packets Follower Leader IPsec Phase 1 IPsec Phase 2 TCP data TCP data
  49. 49. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS IPsec - Heartbeat Follower Leader IPsec Phase 1 IPsec Phase 2 IPsec Heartbeat TCP data TCP data
  50. 50. 2016-06-16MAKING PAGERDUTY MORE RELIABLE USING PXC Lessons
  51. 51. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • Don’t lock and block • TCP can block for a really long time • Interfaces / abstract methods make analysis harder Lesson 1
  52. 52. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • Automate debug info collection (stack trace, heap dump, txn logs, etc) Lesson 2
  53. 53. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS • Application/dependency checks should be deep health checks! • Leader/follower heartbeats should be deep health checks! Lesson 3
  54. 54. 2016−06−23 Questions? donny@pagerduty.com Link: “Network issues can cause cluster to hang due to near-deadlock” https://issues.apache.org/jira/browse/ZOOKEEPER-2201
  55. 55. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS tc qdisc add dev eth0 root netem delay 500ms 100ms loss 25% #add latency tc qdisc del dev eth0 root netem #remove latency tc qdisc add dev eth0 handle 1: root htb default 11 #restrict bandwidth tc class add dev eth0 parent 1: classid 1:1 htb rate 100kbps tc class add dev eth0 parent 1:1 classid 1:11 htb rate 100kbps tc qdisc del dev eth0 root #remove bandwidth restriction #tip: when doing latency / loss / bandwidth restriction, can be helpful to do "sleep 60 && <tc delete command> & disown" in case you lose ssh access tcpdump -n "src host 123.45.67.89 or dst host 123.45.67.89" -i eth0 -s 65535 -w /tmp/packet.dump #capture packets, then open locally in wireshark iptables -A OUTPUT -p tcp --dport 4444 -j DROP #block traffic iptables -D OUTPUT -p tcp --dport 4444 -j DROP #allow traffic #can use INPUT / OUTPUT chain for incoming / outgoing traffic, also --dport <dest port>, --sport <src port>, -s <source ip>, -d <dest ip> sshfs me@123.45.67.89:/tmp/data /mnt #configure database/application local data directory to be /mnt, then use tools above against 123.45.67.89 #alternative: nbd (network block device) netstat -peanut #network connections, regular kernel view conntrack -L #network connections, iptables view “Mess With The Network” Cheat Sheet

×