Donny Nadolny, PagerDuty#Devoxx #distsys
Debugging Distributed Systems
Donny Nadolny
PagerDuty
Donny Nadolny, PagerDuty#Devoxx #distsys
Donny Nadolny, PagerDuty#Devoxx #distsys
What is ZooKeeper
• Distributed system for building distributed systems
• Small in-memory filesystem
Donny Nadolny, PagerDuty#Devoxx #distsys
ZooKeeper API
• create directory
• create file (ZooKeeper term:“node”)
• atomically update a file
• watch a file for changes
• create “ephemeral” file (goes away when client does)
• create sequential file (concurrent attempts to create are
ordered)
Donny Nadolny, PagerDuty#Devoxx #distsys
ZooKeeper at PagerDuty
• Distributed locking
• Consistent, highly available
Donny Nadolny, PagerDuty#Devoxx #distsys
Current Talk: Debugging Distributed Systems
For Cassandra Consistency Issues, See:
Donny Nadolny, PagerDuty#Devoxx #distsys
ZooKeeper at PagerDuty
• Distributed locking
• Consistent, highly available
Donny Nadolny, PagerDuty#Devoxx #distsys
ZooKeeper at PagerDuty
• Distributed locking
• Consistent, highly available
ZK 3
ZK 1 ZK 2
DC-A
DC-C
DC-B
24 ms
24
m
s
3
m
s
… over a WAN
Donny Nadolny, PagerDuty#Devoxx #distsys
ZooKeeper Overview
Donny Nadolny, PagerDuty#Devoxx #distsys
The Failure
• Network trouble, one follower falls behind
• ZooKeeper gets stuck - leader still up
1
2
DBSize
Donny Nadolny, PagerDuty#Devoxx #distsys
The Failure
• Network trouble, one follower falls behind
• ZooKeeper gets stuck - leader still up
2
DBSize
1
2
1.5
1
Donny Nadolny, PagerDuty#Devoxx #distsys
Recovery
• Restart all nodes
• Restart leader
2
DBSize
1
2
1.5
1
3 3
Donny Nadolny, PagerDuty#Devoxx #distsys
First Hint
• Leader logs: “Too	busy	to	snap,	skipping”
Donny Nadolny, PagerDuty#Devoxx #distsys
Fault Injection
• Disk slow? let’s test:
•sshfs	donny@some_server:/home/donny	/mnt	
• Similar failure profile
Donny Nadolny, PagerDuty#Devoxx #distsys
Fault Injection
• Disk slow? let’s test:
•sshfs	donny@some_server:/home/donny	/mnt	
• Similar failure profile
• Re-examine disk latency… nope, was a red herring
Donny Nadolny, PagerDuty#Devoxx #distsys
Health Checks
• First warning: application monitoring
• High-level application checks are good because they catch
many problems, but don’t tell you the cause
• Monitoring ZooKeeper: used ruok
Donny Nadolny, PagerDuty#Devoxx #distsys
Deep Health Checks
• Added deep health check:
• write to one ZooKeeper key
• read from ZooKeeper key
Donny Nadolny, PagerDuty#Devoxx #distsys
"LearnerHandler-/123.45.67.89:45874"	prio=10	tid=0x00000000024bb800	nid=0x3d0d	runnable	
[0x00007fe6c3193000]	
			java.lang.Thread.State:	RUNNABLE	
								at	java.net.SocketOutputStream.socketWrite0(Native	Method)	
								at	java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)	
								…	
								at	org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:118)	
								…	
								at	org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123)	
								at	org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1115)	
								-	locked	<0x00000000d4cd9e28>	(a	org.apache.zookeeper.server.DataNode)	
								at	org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1130)	
								…	
								at	org.apache.zookeeper.server.ZKDatabase.serializeSnapshot(ZKDatabase.java:467)	
								at	org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:493)
The Stack Trace
1
2
3
Donny Nadolny, PagerDuty#Devoxx #distsys
Threads (Leader)
Request processors
Learner handler
(one per follower)
Client requests
Donny Nadolny, PagerDuty#Devoxx #distsys
🔒
🔒
Threads (Leader)
Request processors
Learner handler
(one per follower)
Client requests
🔒
🔓
🔓
Donny Nadolny, PagerDuty#Devoxx #distsys
Threads (Leader)
Request processors
Learner handler
(one per follower)
Client requests
🔒
Donny Nadolny, PagerDuty#Devoxx #distsys
Threads (Leader)
Request processors
Learner handler
(one per follower)
Client requests
🔒
🔒
Donny Nadolny, PagerDuty#Devoxx #distsys
Threads (Leader)
Request processors
Learner handler
(one per follower)
Client requests
🔒
🔒 🔒
Donny Nadolny, PagerDuty#Devoxx #distsys
void serializeNode(OutputArchive output, String path) {
DataNode node = getNode(path);
String[] children = {};
synchronized (node) {
output.writeString(path, "path");
output.writeRecord(node, "node");
children = node.getChildren();
}
for (String child : children) {
serializeNode(output, path + "/" + child);
}
}
Write Snapshot Code (simplified)
Blocking network write
Donny Nadolny, PagerDuty#Devoxx #distsys
ZooKeeper Heartbeat
• Why didn’t a follower take over?
• restart all nodes - cluster recovers
• restart leader - cluster recovers
• ZK heartbeat: message from leader to follower
• follower gets heartbeat, everything is fine
• follower doesn’t get heartbeat: start an election
Donny Nadolny, PagerDuty#Devoxx #distsys
Threads (Leader)
Request processors
Learner handler
(one per follower)
Client requests
🔒
🔒 🔒
Donny Nadolny, PagerDuty#Devoxx #distsys
Threads (Leader)
Request processors
Learner handler
(one per follower)
Client requests
Quorum Peer
Followers
❤ ❤ ❤
🔒
🔒 🔒
Donny Nadolny, PagerDuty#Devoxx #distsys
TCP
Donny Nadolny, PagerDuty#Devoxx #distsys
Follower LeaderESTABLISHED ESTABLISHED
Packet 1
ACK
… SYN, SYN-ACK, ACK …
TCP Data Transmission
Donny Nadolny, PagerDuty#Devoxx #distsys
Follower LeaderESTABLISHED ESTABLISHED
Packet 1
TCP Data Transmission
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHED
Packet 1
Packet 1
~200ms
TCP Data Transmission
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHED
Packet 1
Packet 1
~200ms
Packet 1
~200ms
TCP Data Transmission
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHED
Packet 1
Packet 1
~200ms
Packet 1
~200ms
~400ms
Packet 1
TCP Data Transmission
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHED
Packet 1
Packet 1
~200ms
Packet 1
~200ms
~400ms
Packet 1
~800ms
Packet 1
TCP Data Transmission
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHED
Packet 1
Packet 1
~200ms
Packet 1
~200ms
~400ms
Packet 1
~800ms
~
120sec
Packet 1
Packet 1
120sec
CLOSED
15 retries
…
TCP Data Transmission
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
TCP Retransmission (Linux Defaults)
• Retransmission timeout (RTO) is based on latency
• TCP_RTO_MIN = 200 ms
• TCP_RTO_MAX = 2 minutes
• /proc/sys/net/ipv4/tcp_retries2 = 15 retries
• 0.2 + 0.2 + 0.4 + 0.8 + … + 120 = 924.8 seconds (15.5 mins)
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHED
Packet 1
Packet 1
~200ms
Packet 1
~200ms
~400ms
Packet 1
~800ms
~
120sec
Packet 1
Packet 1
120sec
CLOSED
15.5 mins (or more)
…
TCP Data Transmission
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
Timeline
1. Network trouble begins - packet loss / latency
2. Follower falls behind, restarts, requests snapshot
3. Leader begins to send snapshot
4. Snapshot transfer stalls
5. Follower ZooKeeper restarts, attempts to close connection
6. Network heals
7. … Leader still stuck
Donny Nadolny, PagerDuty#Devoxx #distsys
Timeline
1. Network trouble begins - packet loss / latency
2. Follower falls behind, restarts, requests snapshot
3. Leader begins to send snapshot
4. Snapshot transfer stalls
5. Follower ZooKeeper restarts, attempts to close connection
6. Network heals
7. … Leader still stuck
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHED
FIN/ACK
FIN
ACK
LAST_ACK
CLOSED
TIME_WAIT
CLOSED
60 seconds
FIN_WAIT1
TCP Close Connection
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHED
CLOSED
~1m40s
FIN_WAIT1 FIN
FIN
FIN
FIN
FIN
8 retries
~
TCP Close Connection
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHED
CLOSED
~1m40s
FIN_WAIT1 FIN Packet 1
CLOSED
~15.5 mins
TCP Close Connection
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHED
CLOSED
~1m40s
FIN_WAIT1 FIN Packet 1
CLOSED
RST
TCP Close Connection
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
06:51:47	iptables:	WARN:	IN=eth0	OUT=	MAC=00:0d:
12:34:56:78:12:34:56:78:12:34:56:78	SRC=<leader_ip>	
DST=<follower_ip>	LEN=54	TOS=0x00	PREC=0x00	TTL=44	ID=36370	DF	
PROTO=TCP	SPT=3888	DPT=36416	WINDOW=227	RES=0x00	ACK	PSH	URGP=0
syslog - Dropped Packets on Follower
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHED
CLOSED
~1m40s
FIN_WAIT1 FIN Packet 1
TCP Close Connection
Blocked by iptablesX
Follower Leader
X
X
Donny Nadolny, PagerDuty#Devoxx #distsys
iptables
iptables	-A	INPUT	-m	state	--state	ESTABLISHED,RELATED	-j	ACCEPT	
iptables	-A	INPUT	-p	tcp	--dport	80	-j	ACCEPT	
... more rules to accept connections …
iptables	-A	INPUT	-j	DROP
Donny Nadolny, PagerDuty#Devoxx #distsys
iptables
iptables	-A	INPUT	-m	state	--state	ESTABLISHED,RELATED	-j	ACCEPT	
iptables	-A	INPUT	-p	tcp	--dport	80	-j	ACCEPT	
... more rules to accept connections …
iptables	-A	INPUT	-j	DROP	
But: iptables connections != netstat connections
Donny Nadolny, PagerDuty#Devoxx #distsys
conntrack Timeouts
• From linux/net/netfilter/nf_conntrack_proto_tcp.c:
• [TCP_CONNTRACK_LAST_ACK] = 30 SECS
Donny Nadolny, PagerDuty#Devoxx #distsys
Follower Leader
CLOSED
~51.2s
FIN_WAIT1 FIN
FIN
FIN
FIN
FIN
~25.6s
kernel TCPconntrack
LAST_ACK
30s
30s
30s
30s
CLOSED
~12.8s
30s
~81.2s
~102.4s
TCP Close Connection
Donny Nadolny, PagerDuty#Devoxx #distsys
The Full Story
• Packet loss
• Follower falls behind, requests snapshot
• (Packet loss continues) follower closes connection
• Follower conntrack forgets connection
• Leader now stuck for ~15 mins, even if network heals
Donny Nadolny, PagerDuty#Devoxx #distsys
(Alternative: kill the follower)
Reproducing (1/3) - Setup
• Follower falls behind:
tc	qdisc	add	dev	eth0	root	netem	delay	500ms	100ms	loss	35%	
• Wait for a few minutes
Donny Nadolny, PagerDuty#Devoxx #distsys
Reproducing (2/3) - Request Snapshot
• Remove latency / packet loss:
tc	qdisc	del	dev	eth0	root	netem	
• Restrict bandwidth:
tc	qdisc	add	dev	eth0	handle	1:	root	htb	default	11	
tc	class	add	dev	eth0	parent	1:	classid	1:1	htb	rate	100kbps	
tc	class	add	dev	eth0	parent	1:1	classid	1:11	htb	rate	100kbps	
• Restart follower ZooKeeper process
Donny Nadolny, PagerDuty#Devoxx #distsys
Reproducing (3/3) - Close Connection
• Block traffic to leader:
iptables	-A	OUTPUT	-p	tcp	-d	<leader	ip>	-j	DROP	
• Remove bandwidth restriction:
tc	qdisc	del	dev	eth0	root	
• Kill follower ZooKeeper process, kernel tries to close connection
• Monitor conntrack status, wait for entry to disappear, ~80 seconds:
conntrack	-L	|	grep	<leader	ip>	
• Allow traffic to leader:
iptables	-D	OUTPUT	-p	tcp	-d	<leader	ip>	-j	DROP
Donny Nadolny, PagerDuty#Devoxx #distsys
IPsec
Donny Nadolny, PagerDuty#Devoxx #distsys
Follower Leader
ESP (UDP)
ESP (UDP)
IPsec
TCP data
IPsec
TCP data
IPsec
Donny Nadolny, PagerDuty#Devoxx #distsys
IPsec Phase 1
IPsec Phase 2
TCP data
IPsec - Establish Connection
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
TCP data
IPsec - Dropped Packets
TCP data
IPsec Phase 1
IPsec Phase 2
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
IPsec Heartbeat
IPsec - Heartbeat
TCP data
TCP data
IPsec Phase 1
IPsec Phase 2
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
Lessons
Donny Nadolny, PagerDuty#Devoxx #distsys
Lesson 1
• Don’t lock and block
• TCP can block for a really long time
• Interfaces / abstract methods make analysis harder
Donny Nadolny, PagerDuty#Devoxx #distsys
Lesson 2
• Automate debug info collection (stack trace, heap dump,
transaction logs, etc)
Donny Nadolny, PagerDuty#Devoxx #distsys
Lesson 3
• Application/dependency checks should be deep health checks!
• Leader/follower heartbeats should be deep health checks!
Donny Nadolny, PagerDuty#Devoxx #distsys
Questions?
Link:
“Network issues can cause cluster to hang due to near-deadlock”
https://issues.apache.org/jira/browse/ZOOKEEPER-2201
Donny Nadolny, PagerDuty#Devoxx #distsys
“Mess With The Network” Cheat Sheet
#add	latency	
tc	qdisc	add	dev	eth0	root	netem	delay	500ms	100ms	loss	25%	
#remove	latency	
tc	qdisc	del	dev	eth0	root	netem	
#restrict	bandwidth	
tc	qdisc	add	dev	eth0	handle	1:	root	htb	default	11	
tc	class	add	dev	eth0	parent	1:	classid	1:1	htb	rate	100kbps	
tc	class	add	dev	eth0	parent	1:1	classid	1:11	htb	rate	100kbps	
#remove	bandwidth	restriction	
tc	qdisc	del	dev	eth0	root	
#tip:	when	doing	latency	/	loss	/	bandwidth	restriction:	
#run	"sleep	60	&&	<tc	delete	command>	&	disown"	in	case	you	lose	ssh	access	
#capture	packets,	then	open	locally	in	wireshark	
tcpdump	-n	"src	host	123.45.67.89	or	dst	host	123.45.67.89"	-i	eth0	-s	65535	-w	/tmp/packet.dump	
iptables	-A	OUTPUT	-p	tcp	--dport	4444	-j	DROP	#block	traffic	
iptables	-D	OUTPUT	-p	tcp	--dport	4444	-j	DROP	#allow	traffic	
#can	use	INPUT	/	OUTPUT	chain	for	incoming	/	outgoing	traffic	
#other	options:	--dport	<dest	port>,	--sport	<src	port>,	-s	<source	ip>,	-d	<dest	ip>	
#configure	database/application	local	data	directory	to	be	/mnt,	then	use	tools	above	against	123.45.67.89	
sshfs	me@123.45.67.89:/tmp/data	/mnt	
#alternative:	nbd	(network	block	device)	
netstat	-peanut	#network	connections,	regular	kernel	view	
conntrack	-L	#network	connections,	iptables	view

Debugging Distributed Systems - Devoxx Belgium 2016 [Extended]

  • 1.
    Donny Nadolny, PagerDuty#Devoxx#distsys Debugging Distributed Systems Donny Nadolny PagerDuty
  • 2.
  • 3.
    Donny Nadolny, PagerDuty#Devoxx#distsys What is ZooKeeper • Distributed system for building distributed systems • Small in-memory filesystem
  • 4.
    Donny Nadolny, PagerDuty#Devoxx#distsys ZooKeeper API • create directory • create file (ZooKeeper term:“node”) • atomically update a file • watch a file for changes • create “ephemeral” file (goes away when client does) • create sequential file (concurrent attempts to create are ordered)
  • 5.
    Donny Nadolny, PagerDuty#Devoxx#distsys ZooKeeper at PagerDuty • Distributed locking • Consistent, highly available
  • 6.
    Donny Nadolny, PagerDuty#Devoxx#distsys Current Talk: Debugging Distributed Systems For Cassandra Consistency Issues, See:
  • 7.
    Donny Nadolny, PagerDuty#Devoxx#distsys ZooKeeper at PagerDuty • Distributed locking • Consistent, highly available
  • 8.
    Donny Nadolny, PagerDuty#Devoxx#distsys ZooKeeper at PagerDuty • Distributed locking • Consistent, highly available ZK 3 ZK 1 ZK 2 DC-A DC-C DC-B 24 ms 24 m s 3 m s … over a WAN
  • 9.
    Donny Nadolny, PagerDuty#Devoxx#distsys ZooKeeper Overview
  • 10.
    Donny Nadolny, PagerDuty#Devoxx#distsys The Failure • Network trouble, one follower falls behind • ZooKeeper gets stuck - leader still up 1 2 DBSize
  • 11.
    Donny Nadolny, PagerDuty#Devoxx#distsys The Failure • Network trouble, one follower falls behind • ZooKeeper gets stuck - leader still up 2 DBSize 1 2 1.5 1
  • 12.
    Donny Nadolny, PagerDuty#Devoxx#distsys Recovery • Restart all nodes • Restart leader 2 DBSize 1 2 1.5 1 3 3
  • 13.
    Donny Nadolny, PagerDuty#Devoxx#distsys First Hint • Leader logs: “Too busy to snap, skipping”
  • 14.
    Donny Nadolny, PagerDuty#Devoxx#distsys Fault Injection • Disk slow? let’s test: •sshfs donny@some_server:/home/donny /mnt • Similar failure profile
  • 15.
    Donny Nadolny, PagerDuty#Devoxx#distsys Fault Injection • Disk slow? let’s test: •sshfs donny@some_server:/home/donny /mnt • Similar failure profile • Re-examine disk latency… nope, was a red herring
  • 16.
    Donny Nadolny, PagerDuty#Devoxx#distsys Health Checks • First warning: application monitoring • High-level application checks are good because they catch many problems, but don’t tell you the cause • Monitoring ZooKeeper: used ruok
  • 17.
    Donny Nadolny, PagerDuty#Devoxx#distsys Deep Health Checks • Added deep health check: • write to one ZooKeeper key • read from ZooKeeper key
  • 18.
    Donny Nadolny, PagerDuty#Devoxx#distsys "LearnerHandler-/123.45.67.89:45874" prio=10 tid=0x00000000024bb800 nid=0x3d0d runnable [0x00007fe6c3193000] java.lang.Thread.State: RUNNABLE at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) … at org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:118) … at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1115) - locked <0x00000000d4cd9e28> (a org.apache.zookeeper.server.DataNode) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1130) … at org.apache.zookeeper.server.ZKDatabase.serializeSnapshot(ZKDatabase.java:467) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:493) The Stack Trace 1 2 3
  • 19.
    Donny Nadolny, PagerDuty#Devoxx#distsys Threads (Leader) Request processors Learner handler (one per follower) Client requests
  • 20.
    Donny Nadolny, PagerDuty#Devoxx#distsys 🔒 🔒 Threads (Leader) Request processors Learner handler (one per follower) Client requests 🔒 🔓 🔓
  • 21.
    Donny Nadolny, PagerDuty#Devoxx#distsys Threads (Leader) Request processors Learner handler (one per follower) Client requests 🔒
  • 22.
    Donny Nadolny, PagerDuty#Devoxx#distsys Threads (Leader) Request processors Learner handler (one per follower) Client requests 🔒 🔒
  • 23.
    Donny Nadolny, PagerDuty#Devoxx#distsys Threads (Leader) Request processors Learner handler (one per follower) Client requests 🔒 🔒 🔒
  • 24.
    Donny Nadolny, PagerDuty#Devoxx#distsys void serializeNode(OutputArchive output, String path) { DataNode node = getNode(path); String[] children = {}; synchronized (node) { output.writeString(path, "path"); output.writeRecord(node, "node"); children = node.getChildren(); } for (String child : children) { serializeNode(output, path + "/" + child); } } Write Snapshot Code (simplified) Blocking network write
  • 25.
    Donny Nadolny, PagerDuty#Devoxx#distsys ZooKeeper Heartbeat • Why didn’t a follower take over? • restart all nodes - cluster recovers • restart leader - cluster recovers • ZK heartbeat: message from leader to follower • follower gets heartbeat, everything is fine • follower doesn’t get heartbeat: start an election
  • 26.
    Donny Nadolny, PagerDuty#Devoxx#distsys Threads (Leader) Request processors Learner handler (one per follower) Client requests 🔒 🔒 🔒
  • 27.
    Donny Nadolny, PagerDuty#Devoxx#distsys Threads (Leader) Request processors Learner handler (one per follower) Client requests Quorum Peer Followers ❤ ❤ ❤ 🔒 🔒 🔒
  • 28.
  • 29.
    Donny Nadolny, PagerDuty#Devoxx#distsys Follower LeaderESTABLISHED ESTABLISHED Packet 1 ACK … SYN, SYN-ACK, ACK … TCP Data Transmission
  • 30.
    Donny Nadolny, PagerDuty#Devoxx#distsys Follower LeaderESTABLISHED ESTABLISHED Packet 1 TCP Data Transmission
  • 31.
    Donny Nadolny, PagerDuty#Devoxx#distsys ESTABLISHED ESTABLISHED Packet 1 Packet 1 ~200ms TCP Data Transmission Follower Leader
  • 32.
    Donny Nadolny, PagerDuty#Devoxx#distsys ESTABLISHED ESTABLISHED Packet 1 Packet 1 ~200ms Packet 1 ~200ms TCP Data Transmission Follower Leader
  • 33.
    Donny Nadolny, PagerDuty#Devoxx#distsys ESTABLISHED ESTABLISHED Packet 1 Packet 1 ~200ms Packet 1 ~200ms ~400ms Packet 1 TCP Data Transmission Follower Leader
  • 34.
    Donny Nadolny, PagerDuty#Devoxx#distsys ESTABLISHED ESTABLISHED Packet 1 Packet 1 ~200ms Packet 1 ~200ms ~400ms Packet 1 ~800ms Packet 1 TCP Data Transmission Follower Leader
  • 35.
    Donny Nadolny, PagerDuty#Devoxx#distsys ESTABLISHED ESTABLISHED Packet 1 Packet 1 ~200ms Packet 1 ~200ms ~400ms Packet 1 ~800ms ~ 120sec Packet 1 Packet 1 120sec CLOSED 15 retries … TCP Data Transmission Follower Leader
  • 36.
    Donny Nadolny, PagerDuty#Devoxx#distsys TCP Retransmission (Linux Defaults) • Retransmission timeout (RTO) is based on latency • TCP_RTO_MIN = 200 ms • TCP_RTO_MAX = 2 minutes • /proc/sys/net/ipv4/tcp_retries2 = 15 retries • 0.2 + 0.2 + 0.4 + 0.8 + … + 120 = 924.8 seconds (15.5 mins)
  • 37.
    Donny Nadolny, PagerDuty#Devoxx#distsys ESTABLISHED ESTABLISHED Packet 1 Packet 1 ~200ms Packet 1 ~200ms ~400ms Packet 1 ~800ms ~ 120sec Packet 1 Packet 1 120sec CLOSED 15.5 mins (or more) … TCP Data Transmission Follower Leader
  • 38.
    Donny Nadolny, PagerDuty#Devoxx#distsys Timeline 1. Network trouble begins - packet loss / latency 2. Follower falls behind, restarts, requests snapshot 3. Leader begins to send snapshot 4. Snapshot transfer stalls 5. Follower ZooKeeper restarts, attempts to close connection 6. Network heals 7. … Leader still stuck
  • 39.
    Donny Nadolny, PagerDuty#Devoxx#distsys Timeline 1. Network trouble begins - packet loss / latency 2. Follower falls behind, restarts, requests snapshot 3. Leader begins to send snapshot 4. Snapshot transfer stalls 5. Follower ZooKeeper restarts, attempts to close connection 6. Network heals 7. … Leader still stuck
  • 40.
    Donny Nadolny, PagerDuty#Devoxx#distsys ESTABLISHED ESTABLISHED FIN/ACK FIN ACK LAST_ACK CLOSED TIME_WAIT CLOSED 60 seconds FIN_WAIT1 TCP Close Connection Follower Leader
  • 41.
    Donny Nadolny, PagerDuty#Devoxx#distsys ESTABLISHED ESTABLISHED CLOSED ~1m40s FIN_WAIT1 FIN FIN FIN FIN FIN 8 retries ~ TCP Close Connection Follower Leader
  • 42.
    Donny Nadolny, PagerDuty#Devoxx#distsys ESTABLISHED ESTABLISHED CLOSED ~1m40s FIN_WAIT1 FIN Packet 1 CLOSED ~15.5 mins TCP Close Connection Follower Leader
  • 43.
    Donny Nadolny, PagerDuty#Devoxx#distsys ESTABLISHED ESTABLISHED CLOSED ~1m40s FIN_WAIT1 FIN Packet 1 CLOSED RST TCP Close Connection Follower Leader
  • 44.
    Donny Nadolny, PagerDuty#Devoxx#distsys 06:51:47 iptables: WARN: IN=eth0 OUT= MAC=00:0d: 12:34:56:78:12:34:56:78:12:34:56:78 SRC=<leader_ip> DST=<follower_ip> LEN=54 TOS=0x00 PREC=0x00 TTL=44 ID=36370 DF PROTO=TCP SPT=3888 DPT=36416 WINDOW=227 RES=0x00 ACK PSH URGP=0 syslog - Dropped Packets on Follower
  • 45.
    Donny Nadolny, PagerDuty#Devoxx#distsys ESTABLISHED ESTABLISHED CLOSED ~1m40s FIN_WAIT1 FIN Packet 1 TCP Close Connection Blocked by iptablesX Follower Leader X X
  • 46.
    Donny Nadolny, PagerDuty#Devoxx#distsys iptables iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT iptables -A INPUT -p tcp --dport 80 -j ACCEPT ... more rules to accept connections … iptables -A INPUT -j DROP
  • 47.
    Donny Nadolny, PagerDuty#Devoxx#distsys iptables iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT iptables -A INPUT -p tcp --dport 80 -j ACCEPT ... more rules to accept connections … iptables -A INPUT -j DROP But: iptables connections != netstat connections
  • 48.
    Donny Nadolny, PagerDuty#Devoxx#distsys conntrack Timeouts • From linux/net/netfilter/nf_conntrack_proto_tcp.c: • [TCP_CONNTRACK_LAST_ACK] = 30 SECS
  • 49.
    Donny Nadolny, PagerDuty#Devoxx#distsys Follower Leader CLOSED ~51.2s FIN_WAIT1 FIN FIN FIN FIN FIN ~25.6s kernel TCPconntrack LAST_ACK 30s 30s 30s 30s CLOSED ~12.8s 30s ~81.2s ~102.4s TCP Close Connection
  • 50.
    Donny Nadolny, PagerDuty#Devoxx#distsys The Full Story • Packet loss • Follower falls behind, requests snapshot • (Packet loss continues) follower closes connection • Follower conntrack forgets connection • Leader now stuck for ~15 mins, even if network heals
  • 51.
    Donny Nadolny, PagerDuty#Devoxx#distsys (Alternative: kill the follower) Reproducing (1/3) - Setup • Follower falls behind: tc qdisc add dev eth0 root netem delay 500ms 100ms loss 35% • Wait for a few minutes
  • 52.
    Donny Nadolny, PagerDuty#Devoxx#distsys Reproducing (2/3) - Request Snapshot • Remove latency / packet loss: tc qdisc del dev eth0 root netem • Restrict bandwidth: tc qdisc add dev eth0 handle 1: root htb default 11 tc class add dev eth0 parent 1: classid 1:1 htb rate 100kbps tc class add dev eth0 parent 1:1 classid 1:11 htb rate 100kbps • Restart follower ZooKeeper process
  • 53.
    Donny Nadolny, PagerDuty#Devoxx#distsys Reproducing (3/3) - Close Connection • Block traffic to leader: iptables -A OUTPUT -p tcp -d <leader ip> -j DROP • Remove bandwidth restriction: tc qdisc del dev eth0 root • Kill follower ZooKeeper process, kernel tries to close connection • Monitor conntrack status, wait for entry to disappear, ~80 seconds: conntrack -L | grep <leader ip> • Allow traffic to leader: iptables -D OUTPUT -p tcp -d <leader ip> -j DROP
  • 54.
  • 55.
    Donny Nadolny, PagerDuty#Devoxx#distsys Follower Leader ESP (UDP) ESP (UDP) IPsec TCP data IPsec TCP data IPsec
  • 56.
    Donny Nadolny, PagerDuty#Devoxx#distsys IPsec Phase 1 IPsec Phase 2 TCP data IPsec - Establish Connection Follower Leader
  • 57.
    Donny Nadolny, PagerDuty#Devoxx#distsys TCP data IPsec - Dropped Packets TCP data IPsec Phase 1 IPsec Phase 2 Follower Leader
  • 58.
    Donny Nadolny, PagerDuty#Devoxx#distsys IPsec Heartbeat IPsec - Heartbeat TCP data TCP data IPsec Phase 1 IPsec Phase 2 Follower Leader
  • 59.
  • 60.
    Donny Nadolny, PagerDuty#Devoxx#distsys Lesson 1 • Don’t lock and block • TCP can block for a really long time • Interfaces / abstract methods make analysis harder
  • 61.
    Donny Nadolny, PagerDuty#Devoxx#distsys Lesson 2 • Automate debug info collection (stack trace, heap dump, transaction logs, etc)
  • 62.
    Donny Nadolny, PagerDuty#Devoxx#distsys Lesson 3 • Application/dependency checks should be deep health checks! • Leader/follower heartbeats should be deep health checks!
  • 63.
    Donny Nadolny, PagerDuty#Devoxx#distsys Questions? Link: “Network issues can cause cluster to hang due to near-deadlock” https://issues.apache.org/jira/browse/ZOOKEEPER-2201
  • 64.
    Donny Nadolny, PagerDuty#Devoxx#distsys “Mess With The Network” Cheat Sheet #add latency tc qdisc add dev eth0 root netem delay 500ms 100ms loss 25% #remove latency tc qdisc del dev eth0 root netem #restrict bandwidth tc qdisc add dev eth0 handle 1: root htb default 11 tc class add dev eth0 parent 1: classid 1:1 htb rate 100kbps tc class add dev eth0 parent 1:1 classid 1:11 htb rate 100kbps #remove bandwidth restriction tc qdisc del dev eth0 root #tip: when doing latency / loss / bandwidth restriction: #run "sleep 60 && <tc delete command> & disown" in case you lose ssh access #capture packets, then open locally in wireshark tcpdump -n "src host 123.45.67.89 or dst host 123.45.67.89" -i eth0 -s 65535 -w /tmp/packet.dump iptables -A OUTPUT -p tcp --dport 4444 -j DROP #block traffic iptables -D OUTPUT -p tcp --dport 4444 -j DROP #allow traffic #can use INPUT / OUTPUT chain for incoming / outgoing traffic #other options: --dport <dest port>, --sport <src port>, -s <source ip>, -d <dest ip> #configure database/application local data directory to be /mnt, then use tools above against 123.45.67.89 sshfs me@123.45.67.89:/tmp/data /mnt #alternative: nbd (network block device) netstat -peanut #network connections, regular kernel view conntrack -L #network connections, iptables view