Debugging Distributed Systems - Velocity Santa Clara 2016

2016−06−23
Debugging Distributed Systems
Donny Nadolny
donny@pagerduty.com
#VelocityConf

DEBUGGING DISTRIBUTED SYSTEMS 2016−06−23

2016−06−23DEBUGGING DISTRIBUTED SYSTEMS
• Distributed system for building distributed systems
• Small in-memory filesystem
What is ZooKeeper

• Distributed locking
• Consistent, highly available
ZooKeeper at PagerDuty

• Distributed locking
• Consistent, highly available
ZooKeeper at PagerDuty… Over A WAN
ZK 3
ZK 1 ZK 2
DC-A
DC-C
DC-B
24 ms
2
4
m
s
3
m
s

• Network trouble, one follower falls behind
• ZooKeeper gets stuck - leader still up
The Failure
1
2
DBSize

• Network trouble, one follower falls behind
• ZooKeeper gets stuck - leader still up
The Failure
1
1
2
2
1.5
DBSize

• Leader logs: “Too busy to snap, skipping”
First Hint

• Disk slow? let’s test:
• sshfs donny@some_server:/home/donny /mnt
• Similar failure profile
Fault Injection

• Disk slow? let’s test:
• sshfs donny@some_server:/home/donny /mnt
• Similar failure profile
• Re-examine disk latency… nope, was a red herring
Fault Injection

• First warning: application monitoring
• Monitoring ZooKeeper: used ruok
Health Checks

• Added deep health check:
• write to one ZooKeeper key
• read from ZooKeeper key
Deep Health Checks

"LearnerHandler-/123.45.67.89:45874" prio=10 tid=0x00000000024bb800 nid=0x3d0d runnable
[0x00007fe6c3193000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
…
at org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:118)
…
at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123)
at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1115)
- locked <0x00000000d4cd9e28> (a org.apache.zookeeper.server.DataNode)
at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1130)
…
at org.apache.zookeeper.server.ZKDatabase.serializeSnapshot(ZKDatabase.java:467)
at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:493)
The Stack Trace
1
2
3

Threads (Leader)
Request processors
Learner handler
(one per follower)
Client requests

void serializeNode(OutputArchive output, String path) {
DataNode node = getNode(path);
String[] children = {};
synchronized (node) {
output.writeString(path, "path");
output.writeRecord(node, "node");
children = node.getChildren();
}
for (String child : children) {
serializeNode(output, path + "/" + child);
}
}
Write Snapshot Code (simplified)
Blocking network write

• Why didn’t a follower take over?
• ZK heartbeat: message from leader to follower, follower times out
ZooKeeper Heartbeat

Threads (Leader)
Request processors
Learner handler
(one per follower)
Client requests
Quorum Peer

2016-06-16MAKING PAGERDUTY MORE RELIABLE USING PXC
TCP

TCP Data Transmission
Follower Leader
ESTABLISHED ESTABLISHED
Packet 1
ACK
… SYN, SYN-ACK, ACK …

Follower Leader
Packet 1

Follower Leader
Packet 1
Packet 1
~200ms

Follower Leader
Packet 1
Packet 1
~200ms
Packet 1
~200ms

Follower Leader
Packet 1
Packet 1
~200ms
Packet 1
~200ms
~400ms
Packet 1

Follower Leader
Packet 1
Packet 1
~200ms
Packet 1
~200ms
~400ms
Packet 1
~800ms
Packet 1

Follower Leader
Packet 1
Packet 1
~200ms
Packet 1
~200ms
~400ms
Packet 1
~800ms
~
120sec
Packet 1
Packet 1
120sec
CLOSED
15 retries
…

• Retransmission timeout (RTO) is based on latency
• TCP_RTO_MIN = 200 ms
• TCP_RTO_MAX = 2 minutes
• /proc/sys/net/ipv4/tcp_retries2 = 15 retries
• 0.2 + 0.2 + 0.4 + 0.8 + … + 120 = 924.8 seconds (15.5 mins)
TCP Retransmission (Linux Defaults)

1. Network trouble begins - packet loss / latency
2. Follower falls behind, restarts, requests snapshot
3. Leader begins to send snapshot
4. Snapshot transfer stalls
5. Follower ZooKeeper restarts, attempts to close connection
6. Network heals
7. … Leader still stuck
Timeline

TCP Close Connection
Follower Leader
FIN/ACK
FIN
ACK
LAST_ACK
CLOSED
TIME_WAIT
CLOSED
60 seconds
FIN_WAIT1

Follower Leader
CLOSED
~1m40s
FIN_WAIT1 FIN
FIN
FIN
FIN
FIN
8 retries
~

Follower Leader
CLOSED
~1m40s
FIN_WAIT1 FIN Packet 1
CLOSED
~15.5 mins

Follower Leader
CLOSED
~1m40s
FIN_WAIT1 FIN Packet 1
CLOSED
RST

• 06:51:47 iptables: WARN: IN=eth0 OUT= MAC=00:0d:
12:34:56:78:12:34:56:78:12:34:56:78 SRC=<leader_ip>
DST=<follower_ip> LEN=54 TOS=0x00 PREC=0x00 TTL=44
ID=36370 DF PROTO=TCP SPT=3888 DPT=36416 WINDOW=227
RES=0x00 ACK PSH URGP=0
syslog - Dropped Packets on Follower

iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
... more rules to accept connections …
iptables -A INPUT -j DROP
iptables

iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
... more rules to accept connections …
iptables -A INPUT -j DROP
But: iptables connections != netstat connections
iptables

• From linux/net/netfilter/nf_conntrack_proto_tcp.c:
• [TCP_CONNTRACK_LAST_ACK] = 30 SECS
conntrack Timeouts

Follower Leader
CLOSED
~51.2s
FIN_WAIT1 FIN
FIN
FIN
FIN
FIN
~25.6s
kernel TCPconntrack
LAST_ACK
30s
30s
30s
30s
CLOSED
~12.8s
30s
~81.2s
~102.4s

• Packet loss
• Follower falls behind, requests snapshot
• (Packet loss continues) follower closes connection
• Follower conntrack forgets connection
• Leader now stuck for ~15 mins, even if network heals
The Full Story

• Follower falls behind:
tc qdisc add dev eth0 root netem delay 500ms 100ms loss 35%
• Wait for a few minutes
(Alternate: kill the follower)
Reproducing (1/3) - Setup

• Remove latency / packet loss:
tc qdisc del dev eth0 root netem
• Restrict bandwidth:
tc qdisc add dev eth0 handle 1: root htb default 11
tc class add dev eth0 parent 1: classid 1:1 htb rate 100kbps
tc class add dev eth0 parent 1:1 classid 1:11 htb rate 100kbps
• Restart follower ZooKeeper process
Reproducing (2/3) - Request Snapshot

• Block traffic to leader:
iptables -A OUTPUT -p tcp -d <leader ip> -j DROP
• Remove bandwidth restriction:
tc qdisc del dev eth0 root
• Kill follower ZooKeeper process, kernel tries to close connection
• Monitor conntrack status, wait for entry to disappear, ~80 seconds:
conntrack -L | grep <leader ip>
• Allow traffic to leader:
iptables -D OUTPUT -p tcp -d <leader ip> -j DROP
Reproducing (3/3) - Carefully Close Connection

IPsec

IPsec
Follower Leader
ESP (UDP)
ESP (UDP)
IPsec
TCP data
IPsec
TCP data

IPsec - Establish Connection
Follower Leader
IPsec Phase 1
IPsec Phase 2
TCP data

IPsec - Dropped Packets
Follower Leader
IPsec Phase 1
IPsec Phase 2
TCP data
TCP data

IPsec - Heartbeat
Follower Leader
IPsec Phase 1
IPsec Phase 2
IPsec Heartbeat
TCP data
TCP data

Lessons

• Don’t lock and block
• TCP can block for a really long time
• Interfaces / abstract methods make analysis harder
Lesson 1

• Automate debug info collection (stack trace, heap dump, txn logs, etc)
Lesson 2

• Application/dependency checks should be deep health checks!
• Leader/follower heartbeats should be deep health checks!
Lesson 3

2016−06−23
Questions?
donny@pagerduty.com
Link:
“Network issues can cause cluster to hang due to near-deadlock”
https://issues.apache.org/jira/browse/ZOOKEEPER-2201

tc qdisc add dev eth0 root netem delay 500ms 100ms loss 25% #add latency
tc qdisc del dev eth0 root netem #remove latency
tc qdisc add dev eth0 handle 1: root htb default 11 #restrict bandwidth
tc class add dev eth0 parent 1: classid 1:1 htb rate 100kbps
tc class add dev eth0 parent 1:1 classid 1:11 htb rate 100kbps
tc qdisc del dev eth0 root #remove bandwidth restriction
#tip: when doing latency / loss / bandwidth restriction, can be helpful to do "sleep 60 && <tc delete command> & disown" in case you lose ssh access
tcpdump -n "src host 123.45.67.89 or dst host 123.45.67.89" -i eth0 -s 65535 -w /tmp/packet.dump #capture packets, then open locally in wireshark
iptables -A OUTPUT -p tcp --dport 4444 -j DROP #block traffic
iptables -D OUTPUT -p tcp --dport 4444 -j DROP #allow traffic
#can use INPUT / OUTPUT chain for incoming / outgoing traffic, also --dport <dest port>, --sport <src port>, -s <source ip>, -d <dest ip>
sshfs me@123.45.67.89:/tmp/data /mnt #configure database/application local data directory to be /mnt, then use tools above against 123.45.67.89
#alternative: nbd (network block device)
netstat -peanut #network connections, regular kernel view
conntrack -L #network connections, iptables view
“Mess With The Network” Cheat Sheet

Debugging Distributed Systems - Velocity Santa Clara 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Debugging Distributed Systems - Velocity Santa Clara 2016

Similar to Debugging Distributed Systems - Velocity Santa Clara 2016 (20)

Recently uploaded

Recently uploaded (20)

Debugging Distributed Systems - Velocity Santa Clara 2016