SlideShare a Scribd company logo
Adventures with TCP SACKs
SF Big Analytics
1
Chris Stevens
November 12, 2019
2
A kernel developer in Big Data
● ~4 years working on the Windows NT kernel
● ~4 years building Minoca OS - https://github.com/minoca/os
● ~2 years at Databricks working on Apache Spark resource management
3
Road Map
● A failing benchmark
● TCP SACKs explained
● Debugging TCP connections
● Root cause in the Linux Kernel
4
6x slower benchmark
5
6x slower benchmark
spark
.range(10 * BILLION)
.write
.format(“parquet”)
.mode(“overwrite”)
.save(“dbfs:/tmp/test”)
6
Task Skew in the Spark UI
7
Socket timeouts in the logs
org.apache.spark.SparkException:
... truncated ...
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception:
Your socket connection to the server was not read from or written
to within the timeout period. Idle connections will be closed.
8
TCP SACKs Vulnerabilities
● June 17, 2019: CVE-2019-11477, CVE-2019-11478 & CVE-2019-11479
● Discovered by Jonathan Looney (Netflix Information Security)
● Remote attacker can ...
○ Trigger a kernel panic (CVE-2019-11477):
■ BUG_ON(tcp_skb_pcount(skb) < pcount);
○ Cause resource exhaustion (CVE-2019-11478, CVE-2019-11479).
● Mitigations:
○ Disable TCP SACKs
■ Set /proc/sys/net/ipv4/tcp_sack to 0
○ Update Linux kernel
■ Databricks updated on June 24, 2019
9
TCP Protocol
Wikipedia (GNU Free Doc License)
10
TCP without SACKs
Sender Receiver
Segment 1
Segment 1 Received
Segment 2
Segment 3
Segment 4 X
Segments [1-2] Received
Segment 3
Segment 4
Segments [1-4] Received
11
TCP with SACKs
Sender Receiver
Segment 1
Segment 1 Received
Segment 2
Segment 3
Segment 4 X
Segments [1-2, 4] Received
Segment 3
Segments [1-4] Received
12
TCP SACKs Attack
Sender Receiver
Seq = 1, Length = 8
Seq = 9, Length = 8
...
Seq = 801, Length = 8
MSS = 48
SACK = [9-16, 25-32,..., 801-808]
13
TCP SACKs Resource Exhaustion
● SACK processing fragments Linux socket buffers
● Increases resource utilization to handle fragmentation (DoS)
sk_buff
nr_frags = 17
frags[MAX_SKB_FRAGS]
skb_shared_info
Frag 1
Frag 2
Frag 2
Frag 17
...
struct skb_frag_struct
{
Struct page *page;
__u32 page_offset;
__u32 size
}
14
TCP SACKs Resource Exhaustion
● SACK processing fragments Linux socket buffers
● Increases resource utilization to handle fragmentation (DoS)
sk_buff
nr_frags = 1
frags[MAX_SKB_FRAGS]
skb_shared_info
Frag 1
Frag 2
Frag 2
Frag 17
...
sk_buff
nr_frags = 17
frags[MAX_SKB_FRAGS]
sk_shared_info
15
TCP SACKs Regressions
● CVE-2019-11478
● June 17, 2019
○ tcp: tcp_fragment() should apply sane memory limits
● June 21, 2019
○ tcp: refine memory limit test in tcp_fragment()
● July 19, 2019
○ tcp: be more careful in tcp_fragment()
○ Ubuntu updates not available until Sept, 2019
16
Is TCP SACKs the culprit?
● Benchmark regression aligned with Databricks’ June 24 kernel update.
● Impacted many versions of the Databricks Runtime (Apache Spark fork).
● Reverting just the kernel update fixed the benchmark.
○ No other platform updates at fault.
17
Debugging the live repro
18
Debugging the live repro
~# netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp6 816 99571 ip-10-0-144-224.u:42464 s3-us-west-2-r-w.:https CLOSE_WAIT
Wikipedia (GNU Free
Doc License)
19
Failed Retransmissions
~# ss -a -t -m -o --info
CLOSE-WAIT 816 99571 ::ffff:10.0.144.224:42464
::ffff:52.218.234.25:https timer: (on,1min37sec,13)
skmem:(r9792,rb236184,t0,tb46080,f732,w103971,o0,bl0) sack cubic wscale:8,7
rto:120000 backoff:13 rtt:0.379/0.055 ato:40 mss:1412 cwnd:1 ssthresh:23
bytes_acked:1398838 bytes_received:14186 segs_out:1085 segs_in:817 send
29.8Mbps lastsnd:592672 lastrcv:552644 lastack:550640 pacing_rate 4759.3Mbps
unacked:80 lost:24 sacked:56 rcv_space:26883
● Retransmission is enabled
● It’s made 13 retransmission attempts.
● It will try another retransmission in 1 minute 37 seconds.
● No packets sent in the last 9 minute 51 seconds.
20
Connection Timeout Explained
● net.ipv4.tcp_retries2 defaults to 15
● TCP_RTO_MIN defaults to 200 milliseconds
● TCP_RTO_MAX defaults to 120 seconds
● Retransmission timeout is 15 mins 24 seconds.
21
TCP Tracing
sudo tcpdump -nnS net 52.218.0.0/16 -w awstcp.pcap
● 22,138 bytes were dropped (SACK Left Edge minus ACK)
● 40 seconds of silence (expected 7 retransmissions)
22
TCP Tracing
sudo tcpdump -nnS net 52.218.0.0/16 -w awstcp.pcap
● Sender enters CLOSE_WAIT when FIN is received.
● Unacknowledged 22,138 bytes were never retransmitted
● 77,433 bytes were SACK’d.
● 22,138 + 77,433 = 99,571 (from netstat Send-Q).
● The SACK’d bytes were never freed.
23
Two Questions
● Why were the retransmissions failing?
● Why were the SACK’d bytes not freed?
24
tcp_fragment()
if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
return -ENOMEM;
}
if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf &&
skb != tcp_send_head(sk))) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
return -ENOMEM;
}
Fixing the vulnerability: patch
First regression fix: patch
~# nstat | grep
“TcpExtTCPWqueueTooBig”
TcpExtTCPWqueueTooBig 179
0.0
25
Retransmission Failures
int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
{
...
if (skb->len > cur_mss) {
if (tcp_fragment(sk, skb, cur_mss, GFP_ATOMIC))
return -ENOMEM; /* We’ll try again later. */
}
...
}
● cur_mss = 1412 (from ss)
● TCP Segmentation Offloading is enabled, so skb->len should be greater than 1412.
● TCPRETRANSFAIL = 143
26
Retransmission Failures
● “tcp_sendmsg() and tcp_sendpage() might overshoot sk_wmem_queued by about one
full [TCP Segmentation Offload] skb (64KB size).” - Linux commit message.
● sk->sk_wmem_queued = 103971 (from ss)
● sk->sk_sndbuf = 46080 (from ss)
● 103971 >> 1 = 51985 > 46080.
if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf &&
skb != tcp_send_head(sk))) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
return -ENOMEM;
}
27
ACK Failures
tcp_ack
-tcp_sacktag_write_queue
-tcp_sacktag_walk
-tcp_shift_skb_data
-tcp_match_skb_to_sack
-tcp_fragment
● TCP fragmentation is invoked if a socket buffer does not completely overlap with the SACK’d region.
● tcp_shift_skb_dataaborts if underlying fragmentation fails.
28
tcp_fragment()
limit = sk->sk_sndbuf + 2 * SKB_TRUESIZE(GSO_MAX_SIZE);
if (unlikely((sk->sk_wmem_queued >> 1) > limit &&
skb != tcp_rtx_queue_head(sk) &&
skb != tcp_rtx_queue_tail(sk))) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
return -ENOMEM;
}
Third time’s a charm: patch
● sk->sk_wmem_queued = 103971 (from ss)
● sk->sk_sndbuf = 46080 (from ss)
● limit = 46080 + 2 * 65536 = 177152
○ Adds 128KB overhead to account for overshooting sk_wmem_queued by 64KB.
● 103971 >> 1 = 51985 < 177152.
29
Lessons Learned
● Consider the whole stack
○ We’ve hit 2 other, unrelated Linux bugs since.
● Benchmarks are just numbers; logging is vital.
● Know the right tools.
○ ss was a lifesaver.
30
Thanks!
Chris Stevens - linkedin.com/in/chriscstevens

More Related Content

What's hot

Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
Rob Skillington
 
Data correlation using PySpark and HDFS
Data correlation using PySpark and HDFSData correlation using PySpark and HDFS
Data correlation using PySpark and HDFS
John Conley
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
Wayne Chen
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
Processing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalProcessing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, Tikal
Codemotion Tel Aviv
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
Kostas Tzoumas
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
Flink Forward Berlin 2017: Dr. Radu Tudoran - Huawei Cloud Stream Service in ...
Flink Forward Berlin 2017: Dr. Radu Tudoran - Huawei Cloud Stream Service in ...Flink Forward Berlin 2017: Dr. Radu Tudoran - Huawei Cloud Stream Service in ...
Flink Forward Berlin 2017: Dr. Radu Tudoran - Huawei Cloud Stream Service in ...
Flink Forward
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Shuai Yuan
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Databricks
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
Tom Croucher
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
Scaling Flink in Cloud
Scaling Flink in CloudScaling Flink in Cloud
Scaling Flink in Cloud
Steven Wu
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2Tianwei Liu
 
Performance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams ApplicationsPerformance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams Applications
Guozhang Wang
 

What's hot (20)

Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
 
Data correlation using PySpark and HDFS
Data correlation using PySpark and HDFSData correlation using PySpark and HDFS
Data correlation using PySpark and HDFS
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Processing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalProcessing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, Tikal
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Flink Forward Berlin 2017: Dr. Radu Tudoran - Huawei Cloud Stream Service in ...
Flink Forward Berlin 2017: Dr. Radu Tudoran - Huawei Cloud Stream Service in ...Flink Forward Berlin 2017: Dr. Radu Tudoran - Huawei Cloud Stream Service in ...
Flink Forward Berlin 2017: Dr. Radu Tudoran - Huawei Cloud Stream Service in ...
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
Scaling Flink in Cloud
Scaling Flink in CloudScaling Flink in Cloud
Scaling Flink in Cloud
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Performance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams ApplicationsPerformance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams Applications
 

Similar to SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACKs vulnerability fixes

Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
Andriy Berestovskyy
 
Playing BBR with a userspace network stack
Playing BBR with a userspace network stackPlaying BBR with a userspace network stack
Playing BBR with a userspace network stack
Hajime Tazaki
 
Tuning TCP and NGINX on EC2
Tuning TCP and NGINX on EC2Tuning TCP and NGINX on EC2
Tuning TCP and NGINX on EC2
Chartbeat
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPF
Alex Maestretti
 
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecasesLF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
LF_OpenvSwitch
 
Mininet: Moving Forward
Mininet: Moving ForwardMininet: Moving Forward
Mininet: Moving ForwardON.Lab
 
Capturing NIC and Kernel TX and RX Timestamps for Packets in Go
Capturing NIC and Kernel TX and RX Timestamps for Packets in GoCapturing NIC and Kernel TX and RX Timestamps for Packets in Go
Capturing NIC and Kernel TX and RX Timestamps for Packets in Go
ScyllaDB
 
Troubleshooting TCP/IP
Troubleshooting TCP/IPTroubleshooting TCP/IP
Troubleshooting TCP/IP
vijai s
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
Brendan Gregg
 
LF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK17_Accelerating P4-based Dataplane with DPDKLF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK
 
A Consolidation Success Story by Karl Arao
A Consolidation Success Story by Karl AraoA Consolidation Success Story by Karl Arao
A Consolidation Success Story by Karl AraoEnkitec
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
Adrien Mahieux
 
Experiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRCExperiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRC
Ganesan Narayanasamy
 
Polyraptor
PolyraptorPolyraptor
Polyraptor
MohammedAlasmar2
 
Troubleshooting basic networks
Troubleshooting basic networksTroubleshooting basic networks
Troubleshooting basic networks
Arnold Derrick Kinney
 
XS Boston 2008 Network Topology
XS Boston 2008 Network TopologyXS Boston 2008 Network Topology
XS Boston 2008 Network Topology
The Linux Foundation
 
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
PROIDEA
 
Stress your DUT
Stress your DUTStress your DUT
Stress your DUT
Redge Technologies
 
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PROIDEA
 

Similar to SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACKs vulnerability fixes (20)

Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Playing BBR with a userspace network stack
Playing BBR with a userspace network stackPlaying BBR with a userspace network stack
Playing BBR with a userspace network stack
 
Tuning TCP and NGINX on EC2
Tuning TCP and NGINX on EC2Tuning TCP and NGINX on EC2
Tuning TCP and NGINX on EC2
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPF
 
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecasesLF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
 
Mininet: Moving Forward
Mininet: Moving ForwardMininet: Moving Forward
Mininet: Moving Forward
 
Capturing NIC and Kernel TX and RX Timestamps for Packets in Go
Capturing NIC and Kernel TX and RX Timestamps for Packets in GoCapturing NIC and Kernel TX and RX Timestamps for Packets in Go
Capturing NIC and Kernel TX and RX Timestamps for Packets in Go
 
T.Pollak y C.Yaconi - Prey
T.Pollak y C.Yaconi - PreyT.Pollak y C.Yaconi - Prey
T.Pollak y C.Yaconi - Prey
 
Troubleshooting TCP/IP
Troubleshooting TCP/IPTroubleshooting TCP/IP
Troubleshooting TCP/IP
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
 
LF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK17_Accelerating P4-based Dataplane with DPDKLF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK17_Accelerating P4-based Dataplane with DPDK
 
A Consolidation Success Story by Karl Arao
A Consolidation Success Story by Karl AraoA Consolidation Success Story by Karl Arao
A Consolidation Success Story by Karl Arao
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
Experiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRCExperiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRC
 
Polyraptor
PolyraptorPolyraptor
Polyraptor
 
Troubleshooting basic networks
Troubleshooting basic networksTroubleshooting basic networks
Troubleshooting basic networks
 
XS Boston 2008 Network Topology
XS Boston 2008 Network TopologyXS Boston 2008 Network Topology
XS Boston 2008 Network Topology
 
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
 
Stress your DUT
Stress your DUTStress your DUT
Stress your DUT
 
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
 

More from Chester Chen

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
Chester Chen
 
zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdf
Chester Chen
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
Chester Chen
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
Chester Chen
 
A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?
Chester Chen
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdata
Chester Chen
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
Chester Chen
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdap
Chester Chen
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
Chester Chen
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Chester Chen
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
Chester Chen
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_index
Chester Chen
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
Chester Chen
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreath
Chester Chen
 
Hspark index conf
Hspark index confHspark index conf
Hspark index conf
Chester Chen
 

More from Chester Chen (20)

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
 
zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdf
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
 
A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdata
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdap
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_index
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreath
 
Hspark index conf
Hspark index confHspark index conf
Hspark index conf
 

Recently uploaded

Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
ShamsuddeenMuhammadA
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
Roshan Dwivedi
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 

Recently uploaded (20)

Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 

SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACKs vulnerability fixes

  • 1. Adventures with TCP SACKs SF Big Analytics 1 Chris Stevens November 12, 2019
  • 2. 2 A kernel developer in Big Data ● ~4 years working on the Windows NT kernel ● ~4 years building Minoca OS - https://github.com/minoca/os ● ~2 years at Databricks working on Apache Spark resource management
  • 3. 3 Road Map ● A failing benchmark ● TCP SACKs explained ● Debugging TCP connections ● Root cause in the Linux Kernel
  • 5. 5 6x slower benchmark spark .range(10 * BILLION) .write .format(“parquet”) .mode(“overwrite”) .save(“dbfs:/tmp/test”)
  • 6. 6 Task Skew in the Spark UI
  • 7. 7 Socket timeouts in the logs org.apache.spark.SparkException: ... truncated ... Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.
  • 8. 8 TCP SACKs Vulnerabilities ● June 17, 2019: CVE-2019-11477, CVE-2019-11478 & CVE-2019-11479 ● Discovered by Jonathan Looney (Netflix Information Security) ● Remote attacker can ... ○ Trigger a kernel panic (CVE-2019-11477): ■ BUG_ON(tcp_skb_pcount(skb) < pcount); ○ Cause resource exhaustion (CVE-2019-11478, CVE-2019-11479). ● Mitigations: ○ Disable TCP SACKs ■ Set /proc/sys/net/ipv4/tcp_sack to 0 ○ Update Linux kernel ■ Databricks updated on June 24, 2019
  • 9. 9 TCP Protocol Wikipedia (GNU Free Doc License)
  • 10. 10 TCP without SACKs Sender Receiver Segment 1 Segment 1 Received Segment 2 Segment 3 Segment 4 X Segments [1-2] Received Segment 3 Segment 4 Segments [1-4] Received
  • 11. 11 TCP with SACKs Sender Receiver Segment 1 Segment 1 Received Segment 2 Segment 3 Segment 4 X Segments [1-2, 4] Received Segment 3 Segments [1-4] Received
  • 12. 12 TCP SACKs Attack Sender Receiver Seq = 1, Length = 8 Seq = 9, Length = 8 ... Seq = 801, Length = 8 MSS = 48 SACK = [9-16, 25-32,..., 801-808]
  • 13. 13 TCP SACKs Resource Exhaustion ● SACK processing fragments Linux socket buffers ● Increases resource utilization to handle fragmentation (DoS) sk_buff nr_frags = 17 frags[MAX_SKB_FRAGS] skb_shared_info Frag 1 Frag 2 Frag 2 Frag 17 ... struct skb_frag_struct { Struct page *page; __u32 page_offset; __u32 size }
  • 14. 14 TCP SACKs Resource Exhaustion ● SACK processing fragments Linux socket buffers ● Increases resource utilization to handle fragmentation (DoS) sk_buff nr_frags = 1 frags[MAX_SKB_FRAGS] skb_shared_info Frag 1 Frag 2 Frag 2 Frag 17 ... sk_buff nr_frags = 17 frags[MAX_SKB_FRAGS] sk_shared_info
  • 15. 15 TCP SACKs Regressions ● CVE-2019-11478 ● June 17, 2019 ○ tcp: tcp_fragment() should apply sane memory limits ● June 21, 2019 ○ tcp: refine memory limit test in tcp_fragment() ● July 19, 2019 ○ tcp: be more careful in tcp_fragment() ○ Ubuntu updates not available until Sept, 2019
  • 16. 16 Is TCP SACKs the culprit? ● Benchmark regression aligned with Databricks’ June 24 kernel update. ● Impacted many versions of the Databricks Runtime (Apache Spark fork). ● Reverting just the kernel update fixed the benchmark. ○ No other platform updates at fault.
  • 18. 18 Debugging the live repro ~# netstat Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp6 816 99571 ip-10-0-144-224.u:42464 s3-us-west-2-r-w.:https CLOSE_WAIT Wikipedia (GNU Free Doc License)
  • 19. 19 Failed Retransmissions ~# ss -a -t -m -o --info CLOSE-WAIT 816 99571 ::ffff:10.0.144.224:42464 ::ffff:52.218.234.25:https timer: (on,1min37sec,13) skmem:(r9792,rb236184,t0,tb46080,f732,w103971,o0,bl0) sack cubic wscale:8,7 rto:120000 backoff:13 rtt:0.379/0.055 ato:40 mss:1412 cwnd:1 ssthresh:23 bytes_acked:1398838 bytes_received:14186 segs_out:1085 segs_in:817 send 29.8Mbps lastsnd:592672 lastrcv:552644 lastack:550640 pacing_rate 4759.3Mbps unacked:80 lost:24 sacked:56 rcv_space:26883 ● Retransmission is enabled ● It’s made 13 retransmission attempts. ● It will try another retransmission in 1 minute 37 seconds. ● No packets sent in the last 9 minute 51 seconds.
  • 20. 20 Connection Timeout Explained ● net.ipv4.tcp_retries2 defaults to 15 ● TCP_RTO_MIN defaults to 200 milliseconds ● TCP_RTO_MAX defaults to 120 seconds ● Retransmission timeout is 15 mins 24 seconds.
  • 21. 21 TCP Tracing sudo tcpdump -nnS net 52.218.0.0/16 -w awstcp.pcap ● 22,138 bytes were dropped (SACK Left Edge minus ACK) ● 40 seconds of silence (expected 7 retransmissions)
  • 22. 22 TCP Tracing sudo tcpdump -nnS net 52.218.0.0/16 -w awstcp.pcap ● Sender enters CLOSE_WAIT when FIN is received. ● Unacknowledged 22,138 bytes were never retransmitted ● 77,433 bytes were SACK’d. ● 22,138 + 77,433 = 99,571 (from netstat Send-Q). ● The SACK’d bytes were never freed.
  • 23. 23 Two Questions ● Why were the retransmissions failing? ● Why were the SACK’d bytes not freed?
  • 24. 24 tcp_fragment() if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) { NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG); return -ENOMEM; } if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf && skb != tcp_send_head(sk))) { NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG); return -ENOMEM; } Fixing the vulnerability: patch First regression fix: patch ~# nstat | grep “TcpExtTCPWqueueTooBig” TcpExtTCPWqueueTooBig 179 0.0
  • 25. 25 Retransmission Failures int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb) { ... if (skb->len > cur_mss) { if (tcp_fragment(sk, skb, cur_mss, GFP_ATOMIC)) return -ENOMEM; /* We’ll try again later. */ } ... } ● cur_mss = 1412 (from ss) ● TCP Segmentation Offloading is enabled, so skb->len should be greater than 1412. ● TCPRETRANSFAIL = 143
  • 26. 26 Retransmission Failures ● “tcp_sendmsg() and tcp_sendpage() might overshoot sk_wmem_queued by about one full [TCP Segmentation Offload] skb (64KB size).” - Linux commit message. ● sk->sk_wmem_queued = 103971 (from ss) ● sk->sk_sndbuf = 46080 (from ss) ● 103971 >> 1 = 51985 > 46080. if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf && skb != tcp_send_head(sk))) { NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG); return -ENOMEM; }
  • 27. 27 ACK Failures tcp_ack -tcp_sacktag_write_queue -tcp_sacktag_walk -tcp_shift_skb_data -tcp_match_skb_to_sack -tcp_fragment ● TCP fragmentation is invoked if a socket buffer does not completely overlap with the SACK’d region. ● tcp_shift_skb_dataaborts if underlying fragmentation fails.
  • 28. 28 tcp_fragment() limit = sk->sk_sndbuf + 2 * SKB_TRUESIZE(GSO_MAX_SIZE); if (unlikely((sk->sk_wmem_queued >> 1) > limit && skb != tcp_rtx_queue_head(sk) && skb != tcp_rtx_queue_tail(sk))) { NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG); return -ENOMEM; } Third time’s a charm: patch ● sk->sk_wmem_queued = 103971 (from ss) ● sk->sk_sndbuf = 46080 (from ss) ● limit = 46080 + 2 * 65536 = 177152 ○ Adds 128KB overhead to account for overshooting sk_wmem_queued by 64KB. ● 103971 >> 1 = 51985 < 177152.
  • 29. 29 Lessons Learned ● Consider the whole stack ○ We’ve hit 2 other, unrelated Linux bugs since. ● Benchmarks are just numbers; logging is vital. ● Know the right tools. ○ ss was a lifesaver.
  • 30. 30 Thanks! Chris Stevens - linkedin.com/in/chriscstevens