Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kevin Miller, Sr. Manager, EC2 Networking
Octobe...
What to Expect from this Session
Tuning TCP
on Linux
TCP Performance Application
What to Expect from this Session
Application
Watch us
increase network
performance
137%
TCP
TCP
• Transmission Control Protocol
• Underlies SSH, HTTP, *SQL, SMTP
• Stream delivery, flow control
TCP
Jack Jill
Jack Jill
Limiting in-flight data
Jack Jill
Receive
Window
Receive
Window
Congestion
Window
Congestion
Window
Round trip time
Bandwidth delay product
Jack Jill
2 ms round-trip time
Bandwidth delay product
Jack Jill
100 ms round-trip time
Receive window
Receiver controlled, signaled to sender
Congestion window
Jack Jill
Receive
Window
Receive
Window
Congestion
Window
Congestion
Window
Round trip time
Congestion window
• Sender controlled
• Window is managed by the congestion control algorithm
• Inputs – varies by algorit...
Initial congestion window
$ ip route list
default via 10.16.16.1 dev eth0
10.16.16.0/24 dev eth0 proto kernel scope link
1...
Initial congestion window
# ip route change 10.16.16.0/24 dev eth0 
proto kernel scope link initcwnd 16
$ ip route list
de...
0
20
40
60
80
100
0% 2% 4% 6% 8% 10%
Loss Rate
Impact of loss on TCP throughput
Loss is visible as TCP retransmissions
$ netstat -s | grep retransmit
58496 segments retransmitted
52788 fast retransmits
...
Socket level diagnostic
$ ss -ite
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 3829960 10.16.16.18:htt...
Socket level diagnostic
$ ss -ite
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 3829960 10.16.16.18:htt...
Socket level diagnostic
$ ss -ite
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 3829960 10.16.16.18:htt...
Socket level diagnostic
$ ss -ite
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 3829960 10.16.16.18:htt...
Socket level diagnostic
$ ss -ite
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 3829960 10.16.16.18:htt...
Socket level diagnostic
$ ss -ite
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 3829960 10.16.16.18:htt...
Monitoring retransmissions in real time
• Observable using Linux kernel tracing
# tcpretrans
TIME PID LADDR:LPORT -- RADDR...
Congestion control algorithm
Jack Jill
Congestion control algorithms in Linux
• New Reno: Pre-2.6.8
• BIC: 2.6.8 – 2.6.18
• CUBIC: 2.6.19+
• Pluggable architectu...
Tuning congestion control algorithm
$ sysctl net.ipv4.tcp_available_congestion_control
net.ipv4.tcp_available_congestion_c...
Tuning congestion control algorithm
# sysctl net.ipv4.tcp_congestion_control=illinois
net.ipv4.tcp_congestion_control = il...
Retransmission timer
• Input to when the congestion control
algorithm considers a packet lost
• Too low: spurious retransm...
Tuning retransmission timer minimum
• Default minimum: 200ms
# ip route list
default via 10.16.16.1 dev eth0
10.16.16.0/24...
Tuning retransmission timer minimum
# ip route list
default via 10.16.16.1 dev eth0
10.16.16.0/24 dev eth0 proto kernel sc...
Queueing along the network path
Jack Jill
Queueing along the network path
• Intermediate routers along a path have
interface buffers
• High load leads to more packe...
Active queue management
$ tc qdisc list
qdisc mq 0: dev eth0 root
qdisc pfifo_fast 0: dev eth0 parent :1 bands 3 […]
qdisc...
Amazon EC2 enhanced networking
Jack Jill
Amazon EC2 enhanced networking
• Higher I/O (packets per second) performance
• Lower CPU utilization
• Lower inter-instanc...
Maximum transmission unit
3.47% overhead vs. 0.58% overhead
Improvement seen among instances in your VPC
1448B
Payload
894...
Tuning maximum transmission unit
# ip link list
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc
mq state UP mode...
Tuning maximum transmission unit
# ip route change default via 10.16.16.1 dev eth0 mtu 1500
# ip route list
default via 10...
Applying our new knowledge
Test setup
• m4.10xlarge instances – Jack and Jill
• Amazon Linux 2015.09 (Kernel 4.1.7-15.23.amzn1)
• Web Server: nginx 1...
Example Apache Bench output
[ … ]
Concurrency Level: 100
Time taken for tests: 59.404 seconds
Complete requests: 10000
Fai...
Application 1
HTTPS with intermediate network loss
Jack Jill
0.2%
loss
Test setup
• 1 test server instance, 1 test client instance
• 80ms RTT
• 160 parallel clients retrieving a 100 MB object 5...
Results – application 1
Test Bandwidth Mean Time
All defaults – no loss 4163 Mbps 27.9s
All defaults – 0.2% simulated loss...
Application 2
Bulk data transfer; high RTT path
Jack Jill
Test setup
• 1 test server instance, 1 test client instance
• 80 ms RTT
• 8 parallel clients retrieving a 1 GB object 2 ti...
Results – application 2
Test Bandwidth Mean Time
All defaults 2164 Mbps 30.4s
Doubled TCP buffers on server end 1780 Mbps ...
Application 3
Bulk data transfer; low RTT path
Jack Jill
Test setup
• 1 test server instance, 1 test client instance
• 1.2 ms RTT
• 8 parallel clients retrieving a 10GB object 2 t...
Results
Test Bandwidth Mean Time
All defaults + 1500B MTU 8866 Mbps 74.0s
9001B MTU 9316 Mbps 70.4s
Active Queue Managemen...
Application 4
High transaction rate HTTP service
Jack Jill
Test setup
• 1 test server instance, 1 test client instance
• 80 ms RTT
• HTTP, not HTTPS
• 6400 parallel clients retrievi...
Results – application 4
Test Bandwidth Mean Time
All defaults 2580 Mbps 195.3ms
Initial congestion window – 16 packets 269...
Take-aways
Take-aways
• The network doesn’t have to be a black box – Linux tools
can be used to interrogate and understand
• Simple t...
Remember to complete
your evaluations!
Thank you!
(NET404) Making Every Packet Count
Upcoming SlideShare
Loading in …5
×

(NET404) Making Every Packet Count

3,224 views

Published on

Many applications are network I/O bound, including common database-based applications and service-based architectures. But operating systems and applications are often untuned to deliver high performance. This session uncovers hidden issues that lead to low network performance, and shows you how to overcome them to obtain the best network performance possible.

Published in: Technology

(NET404) Making Every Packet Count

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kevin Miller, Sr. Manager, EC2 Networking October 2015 NET404 Making Every Packet Count
  2. 2. What to Expect from this Session Tuning TCP on Linux TCP Performance Application
  3. 3. What to Expect from this Session Application Watch us increase network performance 137%
  4. 4. TCP
  5. 5. TCP • Transmission Control Protocol • Underlies SSH, HTTP, *SQL, SMTP • Stream delivery, flow control
  6. 6. TCP Jack Jill
  7. 7. Jack Jill
  8. 8. Limiting in-flight data Jack Jill Receive Window Receive Window Congestion Window Congestion Window Round trip time
  9. 9. Bandwidth delay product Jack Jill 2 ms round-trip time
  10. 10. Bandwidth delay product Jack Jill 100 ms round-trip time
  11. 11. Receive window Receiver controlled, signaled to sender
  12. 12. Congestion window Jack Jill Receive Window Receive Window Congestion Window Congestion Window Round trip time
  13. 13. Congestion window • Sender controlled • Window is managed by the congestion control algorithm • Inputs – varies by algorithm 
  14. 14. Initial congestion window $ ip route list default via 10.16.16.1 dev eth0 10.16.16.0/24 dev eth0 proto kernel scope link 169.254.169.254 dev eth0 scope link 1448 1448 1448 = 4344 bytes
  15. 15. Initial congestion window # ip route change 10.16.16.0/24 dev eth0 proto kernel scope link initcwnd 16 $ ip route list default via 10.16.16.1 dev eth0 10.16.16.0/24 dev eth0 proto kernel scope link initcwnd 16 169.254.169.254 dev eth0 scope link 1448 1448 1448 1448[ + 12 ] = 23168 bytes
  16. 16. 0 20 40 60 80 100 0% 2% 4% 6% 8% 10% Loss Rate Impact of loss on TCP throughput
  17. 17. Loss is visible as TCP retransmissions $ netstat -s | grep retransmit 58496 segments retransmitted 52788 fast retransmits 135 forward retransmits 3659 retransmits in slow start 392 SACK retransmits failed
  18. 18. Socket level diagnostic $ ss -ite State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 3829960 10.16.16.18:https 10.16.16.75:52008 timer:(on,012ms,0) uid:498 ino:7116021 sk:0001c286 <-> ts sack cubic wscale:7,7 rto:204 rtt:1.423/0.14 ato:40 mss:1448 cwnd:138 ssthresh:80 send 1123.4Mbps unacked:138 retrans:0/11737 rcv_space:26847 TCP State
  19. 19. Socket level diagnostic $ ss -ite State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 3829960 10.16.16.18:https 10.16.16.75:52008 timer:(on,012ms,0) uid:498 ino:7116021 sk:0001c286 <-> ts sack cubic wscale:7,7 rto:204 rtt:1.423/0.14 ato:40 mss:1448 cwnd:138 ssthresh:80 send 1123.4Mbps unacked:138 retrans:0/11737 rcv_space:26847 Bytes queued for transmission
  20. 20. Socket level diagnostic $ ss -ite State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 3829960 10.16.16.18:https 10.16.16.75:52008 timer:(on,012ms,0) uid:498 ino:7116021 sk:0001c286 <-> ts sack cubic wscale:7,7 rto:204 rtt:1.423/0.14 ato:40 mss:1448 cwnd:138 ssthresh:80 send 1123.4Mbps unacked:138 retrans:0/11737 rcv_space:26847 Congestion control algorithm
  21. 21. Socket level diagnostic $ ss -ite State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 3829960 10.16.16.18:https 10.16.16.75:52008 timer:(on,012ms,0) uid:498 ino:7116021 sk:0001c286 <-> ts sack cubic wscale:7,7 rto:204 rtt:1.423/0.14 ato:40 mss:1448 cwnd:138 ssthresh:80 send 1123.4Mbps unacked:138 retrans:0/11737 rcv_space:26847 Retransmission timeout
  22. 22. Socket level diagnostic $ ss -ite State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 3829960 10.16.16.18:https 10.16.16.75:52008 timer:(on,012ms,0) uid:498 ino:7116021 sk:0001c286 <-> ts sack cubic wscale:7,7 rto:204 rtt:1.423/0.14 ato:40 mss:1448 cwnd:138 ssthresh:80 send 1123.4Mbps unacked:138 retrans:0/11737 rcv_space:26847 Congestion window
  23. 23. Socket level diagnostic $ ss -ite State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 3829960 10.16.16.18:https 10.16.16.75:52008 timer:(on,012ms,0) uid:498 ino:7116021 sk:0001c286 <-> ts sack cubic wscale:7,7 rto:204 rtt:1.423/0.14 ato:40 mss:1448 cwnd:138 ssthresh:80 send 1123.4Mbps unacked:138 retrans:0/11737 rcv_space:26847 Retransmissions
  24. 24. Monitoring retransmissions in real time • Observable using Linux kernel tracing # tcpretrans TIME PID LADDR:LPORT -- RADDR:RPORT STATE 03:31:07 106588 10.16.16.18:443 R> 10.16.16.75:52291 ESTABLISHED https://github.com/brendangregg/perf-tools/
  25. 25. Congestion control algorithm Jack Jill
  26. 26. Congestion control algorithms in Linux • New Reno: Pre-2.6.8 • BIC: 2.6.8 – 2.6.18 • CUBIC: 2.6.19+ • Pluggable architecture • Other algorithms often available • Vegas, Illinois, Westwood, Highspeed, Scalable
  27. 27. Tuning congestion control algorithm $ sysctl net.ipv4.tcp_available_congestion_control net.ipv4.tcp_available_congestion_control = cubic reno $ find /lib/modules -name tcp_* […] # modprobe tcp_illinois $ sysctl net.ipv4.tcp_available_congestion_control net.ipv4.tcp_available_congestion_control = cubic reno illinois
  28. 28. Tuning congestion control algorithm # sysctl net.ipv4.tcp_congestion_control=illinois net.ipv4.tcp_congestion_control = illinois # echo “net.ipv4.tcp_congestion_control = illinois” > /etc/sysctl.d/01-tcp.conf [Restart network processes]
  29. 29. Retransmission timer • Input to when the congestion control algorithm considers a packet lost • Too low: spurious retransmission; congestion control can over-react and be slow to re-open the congestion window • Too high: increased latency while algorithm determines a packet is lost and retransmits
  30. 30. Tuning retransmission timer minimum • Default minimum: 200ms # ip route list default via 10.16.16.1 dev eth0 10.16.16.0/24 dev eth0 proto kernel scope link 169.254.169.254 dev eth0 scope link Route to other instances in our subnet (same AZ)
  31. 31. Tuning retransmission timer minimum # ip route list default via 10.16.16.1 dev eth0 10.16.16.0/24 dev eth0 proto kernel scope link 169.254.169.254 dev eth0 scope link # ip route change 10.16.16.0/24 dev eth0 proto kernel scope link rto_min 10ms # ip route list default via 10.16.16.1 dev eth0 10.16.16.0/24 dev eth0 proto kernel scope link rto_min lock 10ms 169.254.169.254 dev eth0 scope link
  32. 32. Queueing along the network path Jack Jill
  33. 33. Queueing along the network path • Intermediate routers along a path have interface buffers • High load leads to more packets in buffer • Latency increases due to queue time • Can trigger retransmission timeouts
  34. 34. Active queue management $ tc qdisc list qdisc mq 0: dev eth0 root qdisc pfifo_fast 0: dev eth0 parent :1 bands 3 […] qdisc pfifo_fast 0: dev eth0 parent :2 bands 3 […] # tc qdisc add dev eth0 root fq_codel qdisc fq_codel 8006: dev eth0 root refcnt 9 limit 10240p flows 1024 quantum 9015 target 5.0ms interval 100.0ms ecn http://www.bufferbloat.net/projects/codel/wiki
  35. 35. Amazon EC2 enhanced networking Jack Jill
  36. 36. Amazon EC2 enhanced networking • Higher I/O (packets per second) performance • Lower CPU utilization • Lower inter-instance latency • Low network jitter • Instance families: M4, C4, C3, R3, I2, D2 (w/ HVM) • Drivers built into Windows, Amazon Linux AMIs • Questions? re:Invent 2014 – SDD419
  37. 37. Maximum transmission unit 3.47% overhead vs. 0.58% overhead Improvement seen among instances in your VPC 1448B Payload 8949B Payload
  38. 38. Tuning maximum transmission unit # ip link list 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 06:f1:b7:e1:3b:e7 # ip route list default via 10.16.16.1 dev eth0 10.16.16.0/24 dev eth0 proto kernel scope link 169.254.169.254 dev eth0 scope link
  39. 39. Tuning maximum transmission unit # ip route change default via 10.16.16.1 dev eth0 mtu 1500 # ip route list default via 10.16.16.1 dev eth0 mtu 1500 10.16.16.0/24 dev eth0 proto kernel scope link 169.254.169.254 dev eth0 scope link
  40. 40. Applying our new knowledge
  41. 41. Test setup • m4.10xlarge instances – Jack and Jill • Amazon Linux 2015.09 (Kernel 4.1.7-15.23.amzn1) • Web Server: nginx 1.8.0 • Client: ApacheBench 2.3 • TLSv1,ECDHE-RSA-AES256-SHA,2048,256 • Transferring uncompressible data (random bits) • Origin data stored in tmpfs (RAM based; no server disk I/O) • Data discarded once retrieved (no client disk I/O)
  42. 42. Example Apache Bench output [ … ] Concurrency Level: 100 Time taken for tests: 59.404 seconds Complete requests: 10000 Failed requests: 0 Write errors: 0 Total transferred: 104900000 bytes HTML transferred: 102400000 bytes Requests per second: 168.34 [#/sec] (mean) Time per request: 594.038 [ms] (mean) Time per request: 5.940 [ms] (mean, across all concurrent requests) Transfer rate: 1724.49 [Kbytes/sec] received [ … ]
  43. 43. Application 1 HTTPS with intermediate network loss Jack Jill 0.2% loss
  44. 44. Test setup • 1 test server instance, 1 test client instance • 80ms RTT • 160 parallel clients retrieving a 100 MB object 5 times $ ab -n 100 -c 20 https://server/100m [* 8] • Simulated packet loss # tc qdisc add dev eth0 root netem loss 0.2% Goal: Minimize throughput impact with 0.2% loss
  45. 45. Results – application 1 Test Bandwidth Mean Time All defaults – no loss 4163 Mbps 27.9s All defaults – 0.2% simulated loss 1469 Mbps 71.8s Increased initial congestion window w/ loss 1328 Mbps 80.6s Doubled server-side TCP buffers w/ loss 1366 Mbps 78.6s Illinois congestion control algorithm w/ loss 3486 Mbps 28.2s 137% increase in performance!
  46. 46. Application 2 Bulk data transfer; high RTT path Jack Jill
  47. 47. Test setup • 1 test server instance, 1 test client instance • 80 ms RTT • 8 parallel clients retrieving a 1 GB object 2 times $ ab -n 2 -c 1 https://server/1g [* 8] Goal: Maximize the throughput / minimize transfer time
  48. 48. Results – application 2 Test Bandwidth Mean Time All defaults 2164 Mbps 30.4s Doubled TCP buffers on server end 1780 Mbps 37.4s Doubled TCP buffers on client end 2462 Mbps 27.6s Active queue management on server 2249 Mbps 29.3s Client buffers + AQM 2730 Mbps 24.5s Illinois CC + client buffers + AQM 2847 Mbps 23.0s Illinois CC + server & client buffers + AQM 2865 Mbps 23.5s 32% increase in performance!
  49. 49. Application 3 Bulk data transfer; low RTT path Jack Jill
  50. 50. Test setup • 1 test server instance, 1 test client instance • 1.2 ms RTT • 8 parallel clients retrieving a 10GB object 2 times $ ab -n 2 -c 1 https://server/100m [* 8] • Start at Internet default MTU, then increase Goal: Maximize the throughput / minimize transfer time
  51. 51. Results Test Bandwidth Mean Time All defaults + 1500B MTU 8866 Mbps 74.0s 9001B MTU 9316 Mbps 70.4s Active Queue Management (+MTU) 9316 Mbps 70.4s 5% increase
  52. 52. Application 4 High transaction rate HTTP service Jack Jill
  53. 53. Test setup • 1 test server instance, 1 test client instance • 80 ms RTT • HTTP, not HTTPS • 6400 parallel clients retrieving a 10k object 100 times $ ab -n 20000 -c 200 http://server/10k [* 32] Goal: Minimize latency
  54. 54. Results – application 4 Test Bandwidth Mean Time All defaults 2580 Mbps 195.3ms Initial congestion window – 16 packets 2691 Mbps 189.2ms Illinois CC + initial congestion window 2649 Mbps 186.2ms 4.6% decrease
  55. 55. Take-aways
  56. 56. Take-aways • The network doesn’t have to be a black box – Linux tools can be used to interrogate and understand • Simple tweaks to settings can dramatically increase performance – test, measure, change • Understand what your application needs from the network, and tune accordingly
  57. 57. Remember to complete your evaluations!
  58. 58. Thank you!

×