Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

High performace network of Cloud Native Taiwan User Group

623 views

Published on

The introduction of network

Published in: Engineering
  • Be the first to comment

High performace network of Cloud Native Taiwan User Group

  1. 1. HIGH PERFORMANCE NETWORK HUNG WEI CHIU
  2. 2. WHO AM I • Hung-Wei Chiu (邱宏瑋) • hwchiu@linkernetworks.com • hwchiu.com • Experience • Software Engineer at Linker Netowrks • Software Engineer at Synology (2014~2017) • Co-Found of SDNDS-TW • Open Source experience • SDN related projects (mininet, ONOS, Floodlight, awesome-sdn)
  3. 3. WHAT WE DISCUSS TODAY  The Drawback of Current Network Stack.  High Performance Network Model  DPDK  RDMA  Case Study
  4. 4. DRAWBACK OF CURRENT NETWORK STACK • Linux Kernel Stack • TCP Stack • Packets Processing in Linux Kernel
  5. 5. LINUX KERNEL TCP/IP NETWORK STACK • Have you imaged how applications communicate by network?
  6. 6. Linux Linux www-server Chrome Network PACKET
  7. 7. IN YOUR APPLICATION (CHROME). • Create a Socket • Connect to Aurora-Server (we use TCP) • Send/Receives Packets. User-Space Kernel-Space Copy data from the user-space  Handle TCP  Handle IPv4  Handle Ethernet  Handle Physical  Handle Driver/NIC
  8. 8. Had you wrote a socket programming before ?
  9. 9. FOR GO LANGUAGE
  10. 10. FOR PYTHON
  11. 11. FOR C LANGUAGE
  12. 12. Did you image how kernel handle those operations ?
  13. 13. HOW ABOUT THE KERNEL ? SEND MESSAGE • User Space -> send(data….) • SYSCALL_DEFINE3(….)  kernel space. • vfs_write • do_sync_write • sock_aio_write • do_sock_write • __sock_sendmsg • security_socket_sendmsg(…)
  14. 14. • inet_sendmsg • tcp_sendmsg  finally TCP … • __tcp_push_pending_frames • Tcp_push_one • tcp_write_xmit • tcp_transmit_skb • ip_queue_xmit ---> finally IP • ip_route_output_ports • ip_route_output_flow -> routing • xfrm_lookup -> routing • Ip_local_out • dst_output • ip_output • …...
  15. 15. HOW ABOUT THE KERNEL ? RECEIVE MESSAGE • User Space -> read(data….) • SYSCALL_DEFINE3(….)  Kernel Space • …..
  16. 16. WHAT IS THE PROBLEM • TCP • Linux Kernel Network Stack • How Linux process packets.
  17. 17. THE PROBLEM OF TCP • Designed for WAN network environment • Different hardware between now and then. • Modify the implementation of TCP to improve its performance • DCTCP (Data Center TCP) • MPTCP (Multi Path TCP) • Google BBR (Modify Congestion Control Algorithm) • New Protocol • [論文導讀] • Re-architecting datacenter networks and stacks for low latency and high performance
  18. 18. THE PROBLEM OF LINUX NETWORK STACK • Increasing network speeds: 10G  40G  100G • Time between packets get smaller • For 1538 bytes. • 10 Gbis == 1230.4 ns • 40 Gbis == 307.6 ns • 100 Gbits == 123.0 ns • Refer to http://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_L CA2015.pdf • Network stack challenges at increasing speeds The 100Gbit/s challenge
  19. 19. THE PROBLEM OF LINUX NETWORK STACK • For smallest frame size 84 bytes. • At 10Gbit/s == 67.2 ns, (14.88 Mpps) (packet per second) • For 3GHz CPU, 201 CPU cycles for each packet. • System call overhead • 75.34 ns (Intel CPU E5-2630 ) • Spinlock + unlock • 16.1ns
  20. 20. THE PROBLEM OF LINUX NETWORK STACK • A single cache-miss: • 32 ns • Atomic operations • 8.25 ns • Basic sync mechanisms • Spin (16ns) • IRQ (2 ~ 14 ns)
  21. 21. SO.. • For smallest frame size 84 bytes. • At 10Gbit/s == 67.2 ns, (14.88 Mpps) (packet per second) • 75.34+16.1+32+8.25+14 = 145.69
  22. 22. PACKET PROCESSING • Let we watch the graph again
  23. 23. PACKET PROCESSING • When a network card receives a packet. • Sends the packet to its receive queue (RX) • System (kernel) needs to know the packet is coming and pass the data to a allocated buffer. • Polling/Interrupt • Allocate skb_buff for packet • Copy the data to user-space • Free the skb_buff
  24. 24. PACKETS PROCESSING IN LINUX User Space Kernel Space NIC TX/RX Queue Application Socket Driver Ring Buffer
  25. 25. PROCESSING MODE • Polling Mode • Busy Looping • CPU overloading • High Network Performance/Throughput
  26. 26. PROCESSING MODE • Interrupt Mode • Read the packet when receives the interrupt • Reduce CPU overhead. • We don’t have too many CPU before. • Worse network performance than polling mode.
  27. 27. MIX MODE • Polling + Interrupt mode (NAPI) (New API) • Interrupt first and then polling to fetch packets • Combine the advantage of both mode.
  28. 28. SUMMARY • Linux Kernel Overhead (System calls, locking, cache) • Context switching on blocking I/O • Interrupt handling in kernel • Data copy between user space and kernel space. • Too many unused network stack feature. • Additional overhead for each packets
  29. 29. HOW TO SOLVE THE PROBLEM • Out-of-tree network stack bypass solutions • Netmap • PF_RING • DPDK • RDMA
  30. 30. HOW TO SOLVE THE PROBLEM • How did those models handle the packet in 62.7ns? • Batching, preallocation, prefetching, • Staying cpu/numa local, avoid locking. • Reduce syscalls, • Faster cache-optimal data structures
  31. 31. HOW TO SOLVE THE PROBLEM • How did those models handle the packet in 62.7ns? • Batching, preallocation, prefetching, • Staying cpu/numa local, avoid locking. • Reduce syscalls, • Faster cache-optimal data structures
  32. 32. HOW TO SOLVE. • Now. There’re more and more CPU in server. • We can dedicated some CPU to handle network packets. • Polling mode • Zero-Copy • Copy to the user-space iff the application needs to modify it. • Sendfile(…) • UIO (User Space I/O) • mmap (memory mapping)
  33. 33. HIGH PERFORMANCE NETWORKING • DPDK (Data Plane Development Kit) • RDMA (Remote Directly Memory Access)
  34. 34. DPDK • Supported by Intel • Only the intel NIC support at first. • Processor affinity / NUMA • UIO • Polling Mode • Batch packet handling • Kernel Bypass • …etc
  35. 35. PACKETS PROCESSING IN DPDK User Space Kernel Space NIC TX/RX Queue Application DPDK UIO (User Space IO) Driver Ring Buffer
  36. 36. COMPARE Network Interface Card Linux Kernel Network Stack Network Driver Application Network Interface Card Linux Kernel Network Stack Network Driver Application Kernel Space User Space
  37. 37. WHAT’S THE PROBLEM. • Without the Linux Kernel Network Stack • How do we know what kind of the packets we received. • Layer2 (MAC/Vlan) • Layer3 (IPv4, IPv6) • Layer4 (TCP,UDP,ICMP)
  38. 38. USER SPACE NETWORK STACK • We need to build the user space network stack • For each applications, we need to handle following issues. • Parse packets • Mac/Vlan • IPv4/IPv6 • TCP/UDP/ICMP • For TCP, we need to handle three-way handshake
  39. 39. FOR ALL EXISTING NETWORK APPLICATIONS • Rewrite all socket related API to DPDK API • DIY • Find some OSS to help you • dpdk-ans (c ) • mTCP (c ) • yanff (go) • Those projects provide BSD-like interface for using.
  40. 40. SUPPORT DPDK? • Storage • Ceph • Software Switch • BSS • FD.IO • Open vSwitch • ..etc
  41. 41. A USE CASE • Software switch • Application • Combine both of above (Run Application as VM or Container)
  42. 42. Kernel User Open vSwitch(DPDK) NIC(DPDK) NIC(DPDK) Kernel User My Application NIC(DPDK)
  43. 43. Kernel User Open vSwitch(DPDK) NIC(DPDK) NIC(DPDK) Container 1 Container 2 How container connect to the OpenvSwitch?
  44. 44. PROBLEMS OF CONNECTION • Use VETH • Kernel space again. • Performance downgrade • Virtio_user
  45. 45. RDMA • Remote Direct Memory Access • Original from DMA (Direct Memory Access) • Access memory without interrupting CPU.
  46. 46. ADVANTAGES • Zero-Copy • Kernel bypass • No CPU involvement • Message based transactions • Scatter/Gather entries support.
  47. 47. WHAT IT PROVIDES • Low CPU usage • High throughput • Low-latency • You can’t have those features in the same time. • Refer to :Tips and tricks to optimize your RDMA code
  48. 48. SUPPORT RDMA • Storage • Ceph • DRBD (Distributed Replicated Block Device) • Tensorflow • Case Study - Towards Zero Copy Dataflows using RDMA
  49. 49. CASE STUDY • Towards Zero Copy Dataflows using RDMA • 2017 SICCOM Poster • Introduction • What problem? • How to solve ? • How to implement ? • Evaluation
  50. 50. INTRODUCTION • Based on Tensorflow • Distributed • Based on RDMA • Zero Copy • Copy problem • Contribute to Tensorflow (merged)
  51. 51. WHAT PROBLEMS • Dataflow • Directed Acyclic Graph • Large data • Hundred of MB • Some data is unmodified. • Too many copies operation • User Space <-> User Space • User Space <-> Kernel Space • Kernel Space -> Physical devices
  52. 52. WHY DATA COPY IS BOTTLENECK • Data buffer is bigger than the system L1/L2/L3 cache • Too many cache miss (increate latency) • A Single Application unlikely can congest the network bandwidth. • Authors says. • 20-30 GBs for data buffer 4KB • 2-4 GBs for data buffer > 4MB • Too many cache miss.
  53. 53. HOW TO SOLVE • Too many data copies operations. • Same device. • Use DMA to pass data. • Different device • Use RDMA • In order to read/write the remote GPU • GPUDirect RDMA (published by Nvidia)
  54. 54. HOW TO IMPLEMENT • Implement a memory allocator • Parse the computational graph/distributed graph partition • Register the memory with RDMA/DMA by the node’s type. • In Tensorflow • Replace the original gRPC format by RDMA
  55. 55. EVALUATION (TARGET) • Tensorflow v1.2 • Basd on gRPC • RDMA zero copy Tensorflow • Yahoo open RDMA Tensorflow (still some copy operat Software ions)
  56. 56. EVALUATION (RESULT) • RDMA (zero copy) v.s gRPC • 2.43x • RDMA (zero copy) v.s Yahoo version • 1.21x • Number of GPU, 16 v.s 1 • 13.8x
  57. 57. Q&A?
  58. 58. EVALUATION (HARDWARE) • Server * 4 • DUal6-core Intel Xeon E5-2603v4 CPU • 4 Nvidia Tesla K40m GPUs • 256 GB DDR4-2400MHz • Mellanox MT27500 40GbE NIC • Switch • 40Gbe RoCE Switch • Priority Flow Control
  59. 59. EVALUATION (SOFTWARE) • VGG16 CNN Model • Model parameter size is 528 MB • Synchronous • Number of PS == Number of Workers • Workers • Use CPU+GPU • Parameter Server • Only CPU

×