Despite our best efforts, our systems fail. Sometimes it’s our fault—code that we wrote, bugs that we caused. But sometimes the fault is with systems that we have no direct control over. Distributed systems are hard. They are complicated, hard to understand, and very challenging to manage. But they are critical to modern software, and when they have problems, we need to fix them.
ZooKeeper is a very useful distributed system that is often used as a building block for other distributed systems like Kafka and Spark. It is used by PagerDuty for many critical systems, and for five months it failed a lot. Donny Nadolny looks at what it takes to debug a problem in a distributed system like ZooKeeper, walking attendees through the process of finding and fixing one cause of many of these failures. Donny explains how to use various tools to stress test the network, some intricate details of how ZooKeeper works, and possibly more than you will want to know about TCP, including an example of machines having a different view of the state of a TCP stream.
http://conferences.oreilly.com/velocity/devops-web-performance-ca/public/schedule/detail/50058
5. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Distributed locking
• Consistent, highly available
ZooKeeper at PagerDuty… Over A WAN
ZK 3
ZK 1 ZK 2
DC-A
DC-C
DC-B
24 ms
2
4
m
s
3
m
s
10. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Disk slow? let’s test:
• sshfs donny@some_server:/home/donny /mnt
• Similar failure profile
• Re-examine disk latency… nope, was a red herring
Fault Injection
13. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
"LearnerHandler-/123.45.67.89:45874" prio=10 tid=0x00000000024bb800 nid=0x3d0d runnable
[0x00007fe6c3193000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
…
at org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:118)
…
at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123)
at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1115)
- locked <0x00000000d4cd9e28> (a org.apache.zookeeper.server.DataNode)
at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1130)
…
at org.apache.zookeeper.server.ZKDatabase.serializeSnapshot(ZKDatabase.java:467)
at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:493)
The Stack Trace
1
2
3
37. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
... more rules to accept connections …
iptables -A INPUT -j DROP
iptables
38. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
... more rules to accept connections …
iptables -A INPUT -j DROP
But: iptables connections != netstat connections
iptables
40. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
TCP Close Connection
Follower Leader
CLOSED
~51.2s
FIN_WAIT1 FIN
FIN
FIN
FIN
FIN
~25.6s
kernel TCPconntrack
LAST_ACK
30s
30s
30s
30s
CLOSED
~12.8s
30s
~81.2s
~102.4s
41. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Packet loss
• Follower falls behind, requests snapshot
• (Packet loss continues) follower closes connection
• Follower conntrack forgets connection
• Leader now stuck for ~15 mins, even if network heals
The Full Story
42. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Follower falls behind:
tc qdisc add dev eth0 root netem delay 500ms 100ms loss 35%
• Wait for a few minutes
(Alternate: kill the follower)
Reproducing (1/3) - Setup
43. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Remove latency / packet loss:
tc qdisc del dev eth0 root netem
• Restrict bandwidth:
tc qdisc add dev eth0 handle 1: root htb default 11
tc class add dev eth0 parent 1: classid 1:1 htb rate 100kbps
tc class add dev eth0 parent 1:1 classid 1:11 htb rate 100kbps
• Restart follower ZooKeeper process
Reproducing (2/3) - Request Snapshot
44. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Block traffic to leader:
iptables -A OUTPUT -p tcp -d <leader ip> -j DROP
• Remove bandwidth restriction:
tc qdisc del dev eth0 root
• Kill follower ZooKeeper process, kernel tries to close connection
• Monitor conntrack status, wait for entry to disappear, ~80 seconds:
conntrack -L | grep <leader ip>
• Allow traffic to leader:
iptables -D OUTPUT -p tcp -d <leader ip> -j DROP
Reproducing (3/3) - Carefully Close Connection
53. 2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Application/dependency checks should be deep health checks!
• Leader/follower heartbeats should be deep health checks!
Lesson 3