0
HPC
         TCP/IP




  2009 5 29   SACSIS2009
(2)



            HPC      TCP/IP



•
    •                    1 Gbps
        RTT 100

        •                 240 Mbp...
(2)



                                                HPC                                                        TCP/IP

...
(3)




•                      TCP/IP


    •
    • MPI

• TCP/IP    TCP

 • TCP      Reno [RFC2851]

 •          TCP   CU...
(4)




• TCP/IP
• TCP/IP
• TCP
• TCP/IP MPI
•
• TCP/IP
•

                 5
(5)




TCP/IP
(6)



TCP (Transport Control Protocol)

 •


 •



                                     7
(7)



        TCP

•             •
    •             •
    •             •
    • SACK            •
                      ...
(8)



        TCP
        •
            •
            •
                ACK        ACK

        •                        ...
(9)




•                    RTO

    •         RTT α           ACK

•
    • ACK            3

    • 1 RTT                ...
(10)



TCP
•
• ACK           1 RTT


                                        RTT


               #8   #7    #6     #5
  ...
(11)



TCP
•
    •
•
    •




          12
(13)




•
➡               rwin

 •        ACK

 •                            rwin

                #8     #7   #6    #5
 ...
(-)




     •

     •
     •
     ➔

                   B                               B/2
#5       #4   #3       #2   #...
(14)




•
➡               cwnd

  •
  •             cwnd


                #8     #7   #6     #5
      #8   #1           ...
(15)




      •
      •        cwnd

          •2
 cwnd
 w



w/2
                      ssthresh


                      ...
(16)




      •              cwnd


      •       cwnd

          •                 →
 cwnd
 w



w/2
                   ...
(12)




•                 rwin        ACK   TCP

•                 cwnd

‣

     =min(cwnd, rwin)
                  #8   ...
(17)



         ACK
 •
          RTT

     •


ACK




ACK




                  19
(17)



        ACK
•
           RTT

    •
• ACK
 •
(1) data



                 (2) ACK

                             20
(34)




TCP/IP
(35)



                                          TCP
• RTT   100


                    240 Mbps



                      ...
(36)




•              ×

    •              1Gbps                     50
            RTT 100                     6.25MB
...
(37)




•                         x RTT

    • 1 Gbps × 50    x 2 = 12.5 MB

    •
•                                     ...
(-)



        TCP

•
    ‣ sysctl
•
    ‣          API




                       25
(39)



sysctl                                                    net.*
                  sysctl                          ...
(38)



              sysctl
•
       $ sysctl net.core.wmem_max
       net.core.wmem_max = 131071

       # sysctl -w net...
(41)




   •       Linux


   •                            tcp_wmem
       tcp_rmem

       •               cwnd       4M...
(40)



                                           API
•
• setsockopt(2)                  getsockopt(2)
  int ssiz; /* sen...
(42)



    setsockopt

• listen connect
• Linux                     2

 • man tcp(7)                          TCP




   ...
(43)




      •
      •
                                 1000

                                  900

                   ...
(44)




•                                                      QoS
    NIC

•
•                       1000

             ...
(45)




•                                                      tc
    $ tc -s qdisc show dev eth0
    qdisc pfifo_fast 0: ...
(71)



                                              netstat -s
            •
            •                         Linux...
(46)




TCP
(47)




•       TCP                               cwnd


    •              cwnd

    •
                 1Gbps    RTT 200...
(48)




•
    • PSockets   GridFTP (BitTrrent Swarming)

•
    •
        •                 TCP           TCP Friendly

  ...
(50)




•
•

•                        Inter-protocol fairness

    • TCP Friendliness      TCP

•                        ...
(49)



  TCP
  NewReno                   RFC 2852
HighSpeed TCP               RFC 3649
 Scalable TCP
     BIC
   CUBIC   ...
(51)



                       CUBIC
•
            cwnd     Wmax


• RTT                    cubic
                        ...
(52)



             Compound TCP

• Windows Vista                loss-based (cwnd)

•                                    ...
(53)



                        TCP
•                 TCP

    •
    •
        •
    •
        •
•
•           HPC        ...
(54)




    • RTT
    •
    •
        →
                             RTT: 200
        RTT: 200
        BW: 500 Mbps


S  ...
(55)




• Limited slow start [RFC 3742]
 •   max_ssthresh < cwnd <= ssthresh     RTT
     cwnd      max_ssthresh/2

• Swi...
(18)




TCP/IP MPI
(19)



MPI                      TCP

 • MPI                  TCP
          e.g.

  •
  •
      •



                 Time...
(20)



                                  PC
8 PCs                                                              8 PCs
    ...
(-)




    8 PCs       2                     10MB                                   8 PCs
                               ...
(21)




                    1000

                     800                                                               ...
(22)




       •
           •   Linux   RTO                  cwnd        [RFC 2861]

       • MPI
        • HTTP/1.1
    ...
(23)




•
    •
    • cwnd                                    ACK
                                        →

            ...
(24)




• ACK

 •   ACK                        RTT

 •                RTT cwnd

           cwnd           RTT cwnd




  ...
(25)




                   1000

                    800
Bandwidth (Mbps)




                    600
                   ...
(-)




8 PCs                                                              8 PCs
                             WAN         ...
(29)



                                         All-to-all
      •
      • MPI
      •
                        data

    ...
(28)




                                   1
                            250


                            200           ...
(30)



                        RTO
    •           2                             RTO




    • TCP
            1         ...
(31)



           RTO
• RTO = RTT + α
 •α →
 •α →
                                   }
• RTT 0                  RTO      ...
(32)



                                                    RTO
                                 1
                       ...
(-)




                         1
                   250
                                                                ...
(33)



      MPI   TCP/IP



            ACK
RTO



            cwnd

            MPI      TCP           ,
              ...
(65)
(57)




  63
(58)


    PSPacer


•

• Linux                          OS


•                GPL


    http://www.gridmpi.org/pspacer.js...
(63)




•             500Mbps


•




    PSPacer             PSPacer   500Mbps
                                         ...
(64)



                      TCP
      Sender                                     Receiver

 A
                          ...
(62)



                  PSPacer               10GbE
                          100Mbps       Iperf 5
         GtrcNET-10
...
(66)




•

    •
• ip
        $ ip route show cached to 192.168.0.2
        192.168.0.2 from 192.168.0.1 dev eth1
       ...
(67)




•
    •
        GbE MTU 1500B → 949.3 Mbps

        (Gb       MTU 9000B → 991.4 Mbps
•                           ...
(68)




•   Infiniband      Myrinet

        •

•                                        IEEE 802.3x

    •              ...
(69)




TCP/IP
(70)



Rough consensus, running code
 •                                RFC

     •        Reno [RFC 2851]

 •
     • net/...
(72)



                    Web100
•        Pittsburgh Supercomputing Center (PSC)


•                                    ...
(72)



                 TCP Probe
• KProbes            TCP



    •
            procfs

•
•



                          ...
(73)



TCP Probe
                              cwnd ssthresh




   http://www.linuxfoundation.org/en/Net:TCP_testing
   ...
(76)




1.                                 ×RTT
                         4MB

2. setsockopt   listen   connect

3.


4.  ...
(77)




1.
           tcp_slow_start_after_idle

2.


3.               RTO
     RTO

4.


5.


                          ...
TCP
              ARPANET                                  NCP
              TCP IP         (1970           )

           ...
RFC
                   Spec.                                  sysctl             default
                TCP (RFC 793)    ...
URL
    Project                               URL
  BIC/CUBIC        http://netsrv.csc.ncsu.edu/twiki/bin/view/Main/BIC
  ...
SACSIS2009_TCP.pdf
Upcoming SlideShare
Loading in...5
×

SACSIS2009_TCP.pdf

952

Published on

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
952
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
12
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "SACSIS2009_TCP.pdf"

  1. 1. HPC TCP/IP 2009 5 29 SACSIS2009
  2. 2. (2) HPC TCP/IP • • 1 Gbps RTT 100 • 240 Mbps ☹ • 940 Mbps ☺ 2
  3. 3. (2) HPC TCP/IP • Ethernet PC • TCP/IP 2 10MB 1000 1000 800 x3.5 800 Bandwidth (Mbps) Bandwidth (Mbps) 160 600 600 400 400 200 200 0 0 0 100 200 300 400 500 600 0 100 200 300 400 500 600 Time (msec) Time (msec) 3
  4. 4. (3) • TCP/IP • • MPI • TCP/IP TCP • TCP Reno [RFC2851] • TCP CUBIC Compound TCP 4
  5. 5. (4) • TCP/IP • TCP/IP • TCP • TCP/IP MPI • • TCP/IP • 5
  6. 6. (5) TCP/IP
  7. 7. (6) TCP (Transport Control Protocol) • • 7
  8. 8. (7) TCP • • • • • • • SACK • • • ACK 8
  9. 9. (8) TCP • • • ACK ACK • ACK #5 #4 #3 #2 #1 #5 #1 #2 #3 #4 #5 #6 ※ ACK 1 1 RTT (Round Trip Time) 9
  10. 10. (9) • RTO • RTT α ACK • • ACK 3 • 1 RTT RTO +α #5 #4 #3 #2 #1 #2 #3 #3 #3 ACK ACK RTT (Round Trip Time) 10
  11. 11. (10) TCP • • ACK 1 RTT RTT #8 #7 #6 #5 #8 #1 #9 #4 #1 #2 #3 #4 #5 ACK RTT (Round Trip Time) 11
  12. 12. (11) TCP • • • • 12
  13. 13. (13) • ➡ rwin • ACK • rwin #8 #7 #6 #5 #8 #1 rwin #9 #4#3 #2 #3 #4 #5 ACK RTT (Round Trip Time) 13
  14. 14. (-) • • • ➔ B B/2 #5 #4 #3 #2 #1 #7 ... #4 #3 #2 #1 T 2T 14
  15. 15. (14) • ➡ cwnd • • cwnd #8 #7 #6 #5 #8 #1 rwin #9 #4#3 cwnd #2 #3 #4 #5 ACK RTT (Round Trip Time) 15
  16. 16. (15) • • cwnd •2 cwnd w w/2 ssthresh time 16
  17. 17. (16) • cwnd • cwnd • → cwnd w w/2 ssthresh time 17
  18. 18. (12) • rwin ACK TCP • cwnd ‣ =min(cwnd, rwin) #8 #7 #6 #5 #8 #1 rwin #9 #4#3 #2 #3 #4 #5 cwnd ACK RTT (Round Trip Time) 18
  19. 19. (17) ACK • RTT • ACK ACK 19
  20. 20. (17) ACK • RTT • • ACK • (1) data (2) ACK 20
  21. 21. (34) TCP/IP
  22. 22. (35) TCP • RTT 100 240 Mbps : 1 Gbps RTT: 100 S R GbE GbE 22
  23. 23. (36) • × • 1Gbps 50 RTT 100 6.25MB : 50 : 1 Gbps x 50 = 50 Mbit 1 Gbps = 6.25 MB 2 23
  24. 24. (37) • x RTT • 1 Gbps × 50 x 2 = 12.5 MB • • MB 6.25MB 6.25MB 12.5MB out- RTT (Round Trip Time) of-order 24
  25. 25. (-) TCP • ‣ sysctl • ‣ API 25
  26. 26. (39) sysctl net.* sysctl default net.core.rmem_default 109568 net.core.rmem_max 131071 net.core.wmem_default 109568 net.core.wmem_max 131071 net.core.netdev_max_backlog 1000 net.ipv4.tcp_rmem TCP [min, default, max] 4096, 87380, 4194304 net.ipv4.tcp_wmem TCP [min, default, max] 4096, 16384, 4194304 net.ipv4.tcp_mem [low,pressure, high] 98304, 131072, 196608 net.ipv4.tcp_no_metrics_save 0 net.ipv4.tcp_sack SACK 1 net.ipv4.tcp_congestion_control TCP cubic ※ 1GB 26
  27. 27. (38) sysctl • $ sysctl net.core.wmem_max net.core.wmem_max = 131071 # sysctl -w net.core.wmem_max=1048576 net.core.wmem_max=1048576 • Linux procfs $ cat /proc/sys/net/core/wmem_max 131071 # echo 1048576 > /proc/sys/net/core/wmem_max • $ sysctl -p # /etc/sysctl.conf $ sysctl -a | grep tcp # 27
  28. 28. (41) • Linux • tcp_wmem tcp_rmem • cwnd 4MB TCP min default cwnd max net.ipv4.tcp_wmem[0] tcp_wmem[1] tcp_wmem[2] (4KB) (16KB) (4MB) ※ 28
  29. 29. (40) API • • setsockopt(2) getsockopt(2) int ssiz; /* send buffer size */ cc = setsockopt(so, SOL_SOCKET, SO_SNDBUF, &ssiz, sizeof(ssiz)); socklen_t optlen = sizeof(csiz); cc = getsockopt(so, SOL_SOCKET, SO_SNDBUF, &csiz, &optlen); SOL_SOCKET SO_SNDBUF SOL_SOCKET SO_RCVBUF • SO_SNDBUF • SO_SNDBUF net.core.wmem_max 29
  30. 30. (42) setsockopt • listen connect • Linux 2 • man tcp(7) TCP Iperf $ iperf -w 1m -c 192.168.0.2 ------------------------------------------------------------ Client connecting to 192.168.0.2, TCP port 5001 TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte) ------------------------------------------------------------ 30
  31. 31. (43) • • 1000 900 800 700 Bandwidth (Mbps) 600 500 400 300 : 1 Gbps 200 RTT: 100 100 TCP: CUBIC 0 sndbuf=12.5MB 0 10 20 30 40 50 60 Time (Second) 31
  32. 32. (44) • QoS NIC • • 1000 900 800 700 Bandwidth (Mbps) 600 500 400 300 200 100 sndbuf=12.5MB txqueuelen=10000 sndbuf=12.5MB NIC 0 0 10 20 30 40 50 60 Time (Second) NIC: Network Interface Card 32
  33. 33. (45) • tc $ tc -s qdisc show dev eth0 qdisc pfifo_fast 0: root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 41825726768 bytes 1383455 pkt (dropped 72, overlimits 0 requeues 25164) rate 0bit 0pps backlog 0b 0p requeues 25164 • ifconfig $ ifconfig eth0 eth0 Link encap:Ethernet HWaddr 00:22:64:10:cc:9d inet addr:192.168.0.1 Bcast:192.168.0.255 Mask:255.255.255.0 inet6 addr: fe80::222:64ff:fe10:cc9d/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:795673 errors:0 dropped:0 overruns:0 frame:0 TX packets:163559 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:72212519 (72.2 MB) TX bytes:29126498 (29.1 MB) Interrupt:17 $ ifconfig eth0 txqueuelen 10000 33
  34. 34. (71) netstat -s • • Linux net-tools $ netstat -s TcpExt: [snip] 55 resets received for embryonic SYN_RECV sockets Tcp: 108 TCP sockets finished time wait in fast timer 49 active connections openings 10969 delayed acks sent 4925 passive connection openings Quick ack mode was activated 67 times 56 failed connection attempts 7961 packets directly queued to recvmsg prequeue. 12 connection resets received 43 bytes directly received in process context from prequeue 1 connections established 69443 packet headers predicted 226842 segments received 11162 acknowledgments not containing data payload received 161944 segments send out 110524 predicted acknowledgments 210 segments retransmited 4 times recovered from packet loss by selective acknowledgements 1 bad segments received. 37 congestion windows recovered without slow start after partial ack 575 resets sent 4 TCP data loss events [snip] 6 timeouts after SACK recovery 1 timeouts in loss state 5 fast retransmits 5 retransmits in slow start 167 other TCP timeouts [snip] 34
  35. 35. (46) TCP
  36. 36. (47) • TCP cwnd • cwnd • 1Gbps RTT 200ms MTU 1500B cwnd 16000 8000 8000 RTT ≒ 27 Time 36
  37. 37. (48) • • PSockets GridFTP (BitTrrent Swarming) • • • TCP TCP Friendly • • • 37
  38. 38. (50) • • • Inter-protocol fairness • TCP Friendliness TCP • Intra-protocol fairness • RTT RTT • Reno RTT 38
  39. 39. (49) TCP NewReno RFC 2852 HighSpeed TCP RFC 3649 Scalable TCP BIC CUBIC Linux H-TCP Vegas Westwood+ FAST Illinois YeAH Compound TCP Windows Vista ≒ ≒ RTT2 > RTT1 39
  40. 40. (51) CUBIC • cwnd Wmax • RTT cubic cwnd cwnd Steady state behavior Probing Wmax 3 3 Wmax β Wcubic = C T− + Wmax C Wmin Time 40
  41. 41. (52) Compound TCP • Windows Vista loss-based (cwnd) • Reno + delay-based (dwnd) • • Reno RTT • = cwnd + dwnd • dwnd 0 Reno loss free period 41
  42. 42. (53) TCP • TCP • • • • • • • HPC TCP http://code.google.com/p/pspacer/wiki/DonkanTcp 42
  43. 43. (54) • RTT • • → RTT: 200 RTT: 200 BW: 500 Mbps S R ( ) GbE GbE 500Mbps 1Gbps 43
  44. 44. (55) • Limited slow start [RFC 3742] • max_ssthresh < cwnd <= ssthresh RTT cwnd max_ssthresh/2 • Swift Start • Packet-pair probing ACK • HyStart (Hybrid slow start) • RTT • Linux 2.6.29 CUBIC 44
  45. 45. (18) TCP/IP MPI
  46. 46. (19) MPI TCP • MPI TCP e.g. • • • Time Time 46
  47. 47. (20) PC 8 PCs 8 PCs WAN Node8 Node0 Host 0 Host 0 Host 0 GtrcNET-1 Host 0 Catalyst 3750 Catalyst 3750 Host 0 Host 0 ……… ……… Node7 Node15 • CPU: Pentium4/2.4GHz • Memory: DDR400 512MB • NIC: Intel PRO/1000 (82547EI) • OS: Linux-2.6.9-1.6 (Fedora Core 2) • Socket buffer: 20MB 47
  48. 48. (-) 8 PCs 2 10MB 8 PCs WAN Node8 Node0 Host 0 Host 0 Host 0 GtrcNET-1 Host 0 Catalyst 3750 Catalyst 3750 Host 0 Host 0 ……… ……… Node7 RTT: 20 ms Node15 : 500 Mbps • CPU: Pentium4/2.4GHz • Memory: DDR400 512MB • NIC: Intel PRO/1000 (82547EI) 1 • OS: Linux-2.6.9-1.6 (Fedora Core 2) • Socket buffer: 20MB 48
  49. 49. (21) 1000 800 2 10MB Bandwidth (Mbps) 600 400 200 0 0 2 4 6 8 10 • Time (sec) 160 10MB 1000 500Mbps • 800 x3.5 Bandwidth (Mbps) 160 600 400 200 0 0 100 200 300 400 500 600 Time (msec) 49
  50. 50. (22) • • Linux RTO cwnd [RFC 2861] • MPI • HTTP/1.1 • kernel 2.6.18 net.ipv4. tcp_slow_start_after_idle=0 ssthresh cwnd time 50
  51. 51. (23) • • • cwnd ACK → cwnd cwnd ACK RTT (Round Trip Time) 51
  52. 52. (24) • ACK • ACK RTT • RTT cwnd cwnd RTT cwnd ACK RTT (Round Trip Time) 52
  53. 53. (25) 1000 800 Bandwidth (Mbps) 600 2 10MB 400 200 0 0 100 200 300 400 500 600 Time (msec) 1000 • cwnd 800 Bandwidth (Mbps) 600 400 • 200 0 0 100 200 300 400 500 600 Time (msec) 53
  54. 54. (-) 8 PCs 8 PCs WAN Node8 Node0 Host 0 Host 0 Host 0 GtrcNET-1 Host 0 Catalyst 3750 Catalyst 3750 Host 0 Host 0 ……… ……… Node7 RTT: 0 ms Node15 : 1 Gbps • CPU: Pentium4/2.4GHz • Memory: DDR400 512MB 16 • NIC: Intel PRO/1000 (82547EI) • OS: Linux-2.6.9-1.6 (Fedora Core 2) • Socket buffer: 20MB 54
  55. 55. (29) All-to-all • • MPI • data A A0 A1 A2 A3 A0 B0 C0 D0 process B B0 B1 B2 B3 A1 B1 C1 D1 C C0 C1 C2 C3 A2 B2 C2 D2 D D0 D1 D2 D3 A3 B3 C3 D3 A B A B A B C D C D C D 1 2 3 55
  56. 56. (28) 1 250 200 230Mbps • Bandwidth (Mbps) 150 • • 100 50 0 0 32 KB200 400 600 Message size (KB) 800 1000 1000 32KB • 800 Bandwidth (Mbps) 600 200 400 • 200 • 200 0 0 500 1000 1500 2000 Time (msec) 56
  57. 57. (30) RTO • 2 RTO • TCP 1 2 3 A B A B A B RTO A C 1 C D C D C D A B A B A B 2 C D C D C D 57
  58. 58. (31) RTO • RTO = RTT + α •α → •α → } • RTT 0 RTO 200 RTO = SRTT + max(G, 4 x RTTVAR) SRTT = (1 - β) x SRTT + β x RTT RTTVAR = (1 - γ) x RTTVAR + γ x |SRTT - RTT| β = 1/8, γ = 1/4, G = 200 msec SRTT: Smoothed Round Trip Time RFC2988 G 1 RTT 500 58
  59. 59. (32) RTO 1 250 RTO = SRTT + max(G, 4 x RTTVAR) 200 • G = 200 Bandwidth (Mbps) 150 100 • G=0 • (kernel 2.6.23 50 32 KB w/o RTO fix w/ RTO fix ) ip route 0 0 200 400 600 800 1000 Message size (KB) RTO RTO 1000 1000 800 800 Bandwidth (Mbps) 600 Bandwidth (Mbps) 600 400 400 200 200 0 0 0 500 1000 1500 2000 0 500 1000 1500 2000 Time (msec) Time (msec) 59
  60. 60. (-) 1 250 • 200 • Bandwidth (Mbps) 150 ‣ 100 50 32 KB w/o RTO fix w/ RTO fix • 0 0 200 400 600 800 1000 Message size (KB) • RTO ‣ RTO • • RTO 60
  61. 61. (33) MPI TCP/IP ACK RTO cwnd MPI TCP , ACS11 , 2005 61
  62. 62. (65)
  63. 63. (57) 63
  64. 64. (58) PSPacer • • Linux OS • GPL http://www.gridmpi.org/pspacer.jsp , ACS14 , 2006 64
  65. 65. (63) • 500Mbps • PSPacer PSPacer 500Mbps 65
  66. 66. (64) TCP Sender Receiver A C 1 Gbps RTT 200ms B D w/o PSPacer w/ PSPacer 535 Mbps 980 Mbps 490 Mbps 394 Mbps 490 Mbps 141 Mbps TCP ※Scalable TCP 66
  67. 67. (62) PSPacer 10GbE 100Mbps Iperf 5 GtrcNET-10 10 (Gbps) 8 PSPacer 10GbE 6 4 HTB: Hierarchical Token Bucket Linux 2 PSPacer+ HTB 0 CPU: Quad-core Xeon (E5430) x 2 0 2 4 6 8 10 NIC: Myricom Myri-10G (PCIe x 8) MTU: 9000 byte Memory: 4GB DDR2-667 (Gbps) 67
  68. 68. (66) • • • ip $ ip route show cached to 192.168.0.2 192.168.0.2 from 192.168.0.1 dev eth1 cache mtu 9000 rtt 187ms rttvar 175ms ssthresh 1024 cwnd 637 advmss 8960 hoplimit 64 • # sysctl -w net.ipv4.tcp_no_metrics_save=1 net.ipv4.tcp_no_metrics_save=1 68
  69. 69. (67) • • GbE MTU 1500B → 949.3 Mbps (Gb MTU 9000B → 991.4 Mbps • “ping -M do” IP “Don’t fragment” bit on $ ping -M do -s 9000 192.168.0.2 NG PING 192.168.0.2 (192.168.0.2) 9000(9028) bytes of data. From 192.168.0.1 icmp_seq=1 Frag needed and DF set (mtu = 9000) From 192.168.0.1 icmp_seq=1 Frag needed and DF set (mtu = 9000) From 192.168.0.1 icmp_seq=1 Frag needed and DF set (mtu = 9000) $ ping -M do -s 8972 192.168.0.2 PING 192.168.0.2 (192.168.0.2) 8972(9000) bytes of data. OK 8980 bytes from 192.168.0.2: icmp_seq=1 ttl=64 time=3.76 ms 8980 bytes from 192.168.0.2: icmp_seq=2 ttl=64 time=0.127 ms 8980 bytes from 192.168.0.2: icmp_seq=3 ttl=64 time=0.133 ms 69
  70. 70. (68) • Infiniband Myrinet • • IEEE 802.3x • PAUSE • $ sudo ethtool -A eth0 tx on rx on $ sudo ethtool -a eth0 Pause parameters for eth1: Autonegotiate: off RX: off TX: off Lossless Ethernet 70
  71. 71. (69) TCP/IP
  72. 72. (70) Rough consensus, running code • RFC • Reno [RFC 2851] • • net/ipv4 180 • regression • BIC init_ssthresh ABC • • 72
  73. 73. (72) Web100 • Pittsburgh Supercomputing Center (PSC) • procfs • netstat -s [RFC 4898] • • http://www.web100.org/ 73
  74. 74. (72) TCP Probe • KProbes TCP • procfs • • 74
  75. 75. (73) TCP Probe cwnd ssthresh http://www.linuxfoundation.org/en/Net:TCP_testing 75
  76. 76. (76) 1. ×RTT 4MB 2. setsockopt listen connect 3. 4. TCP 5. 76
  77. 77. (77) 1. tcp_slow_start_after_idle 2. 3. RTO RTO 4. 5. 77
  78. 78. TCP ARPANET NCP TCP IP (1970 ) TCP RFC793 (standard) TCP(v4) IPv4 (1986) → Tahoe RFC2851 (standard) Reno RFC2852 (experimental) NewReno Vegas HSTCP FAST ScalableTCP HTCP BIC CUBIC CompoundTCP TCP 79
  79. 79. RFC Spec. sysctl default TCP (RFC 793) - - Window scaling (RFC 1323) tcp_window_scaling 1 TImestamps option (RFC 1323) tcp_timestamps 1 TIME-WAIT Assassination Hazards (RFC 1337) tcp_rfc1337 0 SACK (RFC 2018, 3517) tcp_sack 1 Control block sharing (RFC 2140) tcp_no_metrics_save 0 MD5 signature option (RFC 2385) - - Initial window size (RFC 2414) - - Congestion control (RFC 2581) - - NewReno (RFC 3782) - - Cwnd validation (RFC 2861) tcp_slow_start_after_idle 1 D-SACK (RFC 2883) tcp_dsack 1 Limited transmit (RFC 3042) - - Explicit congestion notification (RFC 3168) tcp_ecn 0 Appropriate Byte Counting (RFC 3465) tcp_abc 0 Limited slow start (RFC 3742) tcp_max_ssthresh 0 FRTO (RFC 4138) tcp_frto 0 Forward ACK tcp_fack 1 Path MTU discovery tcp_mtu_probing 0 80
  80. 80. URL Project URL BIC/CUBIC http://netsrv.csc.ncsu.edu/twiki/bin/view/Main/BIC FAST http://netlab.caltech.edu/FAST/ TCP Westwood+ http://c3lab.poliba.it/index.php/Westwood Scalable TCP http://www.deneholme.net/tom/scalable/ HighSpeed TCP http://www.icir.org/floyd/hstcp.html H-TCP http://www.hamilton.ie/net/htcp/index.htm TCP-Illinois http://www.ews.uiuc.edu/~shaoliu/tcpillinois/ Compound TCP http://research.microsoft.com/en-us/projects/ctcp/ TCP Vegas http://www.cs.arizona.edu/projects/protocols/ TCP Westwood http://www.cs.ucla.edu/NRL/hpi/tcpw/ Circuit TCP http://www.ece.virginia.edu/cheetah/ TCP Hybla https://hybla.deis.unibo.it/ TCP Low Priority http://www.ece.rice.edu/networks/TCP-LP/ UDT http://www.cs.uic.edu/~ygu1/ XCP http://www.ana.lcs.mit.edu/dina/XCP/ Web100 http://www.web100.org/ TCP Probe http://www.linuxfoundation.org/en/Net:TcpProbe iproute2 http://www.linuxfoundation.org/en/Net:Iproute2 81
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×