Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

User-space Network Processing


Published on


Published in: Engineering
  • Dating direct: ♥♥♥ ♥♥♥
    Are you sure you want to  Yes  No
    Your message goes here
  • Sex in your area is here: ❶❶❶ ❶❶❶
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website!
    Are you sure you want to  Yes  No
    Your message goes here

User-space Network Processing

  1. 1.   -‐‑‒ +)*. *+ *1 9@ 7
  2. 2. • – d TCP/IP • * • mTCP v memcached – 35% – v 2 *)4 u v
  3. 3. • B3 k6 z 2 l • mTCP +4Intel4DPDK wi • github mTCP+4DPDK orz • Key4Value4Store w k Linux l • RADIS → • d v orz • Memcached → • d 3
  4. 4. A G 7 LNPMXXT 4 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 Key size (bytes) Key size CDF by appearance USR APP ETC VAR SYS 0 0.2 0.4 0.6 0.8 1 1 10 100 1000 10000 100000 1e+06 Value size (bytes) Value Size CDF by appearance USR APP ETC VAR SYS 0 0.2 0.4 0.6 0.8 1 1 10 100 1000 10 Value size (bytes) Value size CDF by total Figure 2: Key and value size distributions for all traces. The leftmost CDF shows the sizes o B.4Atikoglu,4et4al.,4“Workload4Analysis4 of4a4LargeUScale4KeyUValue4Store,”4ACM4SIGMETRICS42012. here. It is important to note, however, that all Memcached instances in this study ran on identical hardware. 2.3 Tracing Methodology Our analysis called for complete traces of traffic passing through Memcached servers for at least a week. This task is particularly challenging because it requires nonintrusive instrumentation of high-traffic volume production servers. Standard packet sniffers such as tcpdump2 have too much overhead to run under heavy load. We therefore imple- mented an efficient packet sniffer called mcap. Implemented as a Linux kernel module, mcap has several advantages over standard packet sniffers: it accesses packet data in kernel space directly and avoids additional memory copying; it in- troduces only 3% performance overhead (as opposed to tcp- dump’s 30%); and unlike standard sniffers, it handles out- of-order packets correctly by capturing incoming traffic af- ter all TCP processing is done. Consequently, mcap has a complete view of what the Memcached server sees, which eliminates the need for further processing of out-of-order packets. On the other hand, its packet parsing is optimized for Memcached packets, and would require adaptations for other applications. The captured traces vary in size from 3T B to 7T B each. This data is too large to store locally on disk, adding another challenge: how to offload this much data (at an average rate of more than 80, 000 samples per second) without interfering with production traffic. We addressed this challenge by com- bining local disk buffering and dynamic offload throttling to take advantage of low-activity periods in the servers. Finally, another challenge is this: how to effectively pro- cess these large data sets? We used Apache HIVE3 to ana- lyze Memcached traces. HIVE is part of the Hadoop frame- work that translates SQL-like queries into MapReduce jobs. We also used the Memcached “stats” command, as well as Facebook’s production logs, to verify that the statistics we computed, such as hit rates, are consistent with the aggre- gated operational metrics collected by these tools. 3. WORKLOAD CHARACTERISTICS This section describes the observed properties of each trace 0 10000 20000 30000 40000 50000 60000 70000 USR APP ETC VAR SYS Requests(millions) Pool DELETE UPDATE GET Figure 1: Distribution of request types per pool, over exactly 7 days. UPDATE commands aggregate all non-DELETE writing operations, such as SET, REPLACE, etc. operations. DELETE operations occur when a cached database entry is modified (but not required to be set again in the cache). SET operations occur when the Web servers add a value to the cache. The rela- tively high number of DELETE operations show that this pool represents database-backed values that are affected by frequent user modifications. ETC has similar characteristics to APP, but with an even higher rate of DELETE requests (of which some may not be currently cached). ETC is the largest and least specific of the pools, so its workloads might be the most representative to emulate. Because it is such a large and heterogenous workload, we pay special attention to this workload throughout the paper. VAR is the only pool sampled that is write-dominated. It stores short-term values such as browser-window size rformance metrics over ekly patterns (Sec. 3.3, be used to generate more We found that the salient r-law distributions, sim- serving systems (Sec. 5). d deployment that can -scale production usage as follows. We begin by cached, its deployment d its workload. Sec. 3 properties of the trace ), while Sec. 4 describes he server point of view). model of the most rep- tion brings the data to- s, followed by a section zing cache behavior and RIPTION ource software package s over the network. As more RAM can be added added to the network. mmunicate with clients. o select a unique server ge of the total number of Table 1: Memcached pools sampled (in one cluster). These pools do not match their UNIX namesakes, but are used for illustrative purposes here instead of their internal names. Pool Size Description USR few user-account status information APP dozens object metadata of one application ETC hundreds nonspecific, general-purpose VAR dozens server-side browser information SYS few system data on service location A new item arriving after the heap is exhausted requires the eviction of an older item in the appropriate slab. Mem- cached uses the Least-Recently-Used (LRU) algorithm to select the items for eviction. To this end, each slab class has an LRU queue maintaining access history on its items. Although LRU decrees that any accessed item be moved to the top of the queue, this version of Memcached coalesces repeated accesses of the same item within a short period (one minute by default) and only moves this item to the top the first time, to reduce overhead. 2.2 Deployment Facebook relies on Memcached for fast access to frequently- accessed values. Web servers typically try to read persistent values from Memcached before trying the slower backend databases. In many cases, the caches are demand-filled, meaning that generally, data is added to the cache after a client has requested it and failed. Modifications to persistent data in the database often propagate as deletions (invalidations) to the Memcached tier. Some cached data, however, is transient and not backed by persistent storage, requiring no invalidations. . VPVNLNRP USR4keys4are416B4or421B 90%4of4VAR4keys4are431B USR4values4are4only42B 90%4of4values4are4smaller4than4500B
  5. 5. vw c *)>M g b*)*) ( 1%/-‐‑‒%*+ 1 t *.C c EI +> ag b+ *)2 ( *. *)/ t * NUXNT ( LNTP * * ii 5
  6. 6. L14(64KB) L24(256KB) 6 CPU4Core L14(64KB) L24(256KB) L14(64KB) L24(256KB) L14(64KB) L24(256KB) CPU4Core L14(64KB) L24(256KB) CPU4Core L14(64KB) L24(256KB) L14(64KB) L24(256KB) L14(64KB) L24(256KB) CPU4Core LLC4(12MB) 44cycles 124cycles 444cycles Memory4(xx4GB) 3004cyclesMatching4Tables (4>4xx4MB)
  7. 7. Copyright420144NTT4Corporaton m x86 v n 7
  8. 8. 2010$Sep. Per$Packet)CPU)Cycles)for)10G 8 1,200 600 1,200 1,600 Cycles' needed Packet'I/O IPv4'lookup ='1,800'cycles ='2,800 Your budget 1,400'cycles 10G, min-sized packets, dual quad-core 2.66GHz CPUs 5,4001,200 … ='6,600 Packet'I/O IPv6'lookup Packet'I/O Encryption'and'hashing IPv4 IPv6 IPsec + + + (in x86, cycle numbers are from RouteBricks [Dobrescu09] and ours) S. Han, et al., “PacketShader: a GPU-accelerated Software Router,” SIGCOMM 2010. ※
  9. 9. 2010$Sep. PacketShader:)psio I/O)Optimization 9 Packet'I/O Packet'I/O Packet'I/O Packet'I/O ! 1,200'reduced'to'200'cycles' per'packet ! Main'ideas ! Huge'packet'buffer ! Batch'processing 600 1,600 IPv4'lookup ='1,800'cycles ='2,800 5,400 … ='6,600 IPv6'lookup Encryption'and'hashing + + + 1,200 1,200 1,200 S. Han, et al., “PacketShader: a GPU-accelerated Software Router,” SIGCOMM 2010.
  10. 10. 2010$Sep. PacketShader:)GPU)Offloading 10 Packet'I/O Packet'I/O Packet'I/O ! GPU'Offloading'for ! MemoryMintensive'or ! ComputeMintensive' operations ! Main'topic'of'this'talk 600 1,600 IPv4'lookup 5,400 … IPv6'lookup Encryption'and'hashing + + + S. Han, et al., “PacketShader: a GPU-accelerated Software Router,” SIGCOMM 2010.
  11. 11. Kernel Uses the Most CPU Cycles 4 83% of CPU usage spent inside kernel! Performance bottlenecks 1. Shared resources 2. Broken locality 3. Per packet processing 1) Efficient use of CPU cycles for TCP/IP processing 2.35x more CPU cycles for app 2) 3x ~ 25x better performance Bottleneck removed by mTCPKernel (without TCP/IP) 45% Packet I/O 4% TCP/IP 34% Application 17% CPU Usage Breakdown of Web Server Web server (Lighttpd) Serving a 64 byte file Linux-3.10 11 E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4 Multicore4Systems,”4NSDI2014.
  12. 12. 12 Inefficiencies in Kernel from Shared FD 1. Shared resources – Shared listening queue – Shared file descriptor space 5 Per-core packet queue Receive-Side Scaling (H/W) Core 0 Core 1 Core 3Core 2 Listening queue Lock File descriptor space Linear search for finding empty slot E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4 Multicore4Systems,”4NSDI2014.
  13. 13. 13 Inefficiencies in Kernel from Broken Locality 2. Broken locality 6 Per-core packet queue Receive-Side Scaling (H/W) Core 0 Core 1 Core 3Core 2 Interrupt handle accept() read() write() Interrupt handling core != accepting core E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4 Multicore4Systems,”4NSDI2014.
  14. 14. 14 Inefficiencies in Kernel from Lack of Support for Batching 3. Per packet, per system call processing Inefficient per packet processing Frequent mode switching Cache pollution Per packet memory allocation Inefficient per system call processing 7 accept(), read(), write() Packet I/O Kernel TCP Application thread BSD socket LInux epoll Kernel User E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4 Multicore4Systems,”4NSDI2014.
  15. 15. 15 Overview of mTCP Architecture 10 1. Thread model: Pairwise, per-core threading 2. Batching from packet I/O to application 3. mTCP API: Easily portable API (BSD-like) User-level packet I/O library (PSIO) mTCP thread 0 mTCP thread 1 Application Thread 0 Application Thread 1 mTCP socket mTCP epoll NIC device driver Kernel-level 1 2 3 User-level Core 0 Core 1 • [SIGCOMM’10] PacketShader: A GPU-accelerated software router, E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4 Multicore4Systems,”4NSDI2014. Intel4DPDK
  16. 16. VPVNLNRP VH E • – k u •l • z z h c.f.4SeaStar 16 main4 thread worker4 thread worker4 thread worker4 thread kernel main4 thread worker4 thread worker4 thread worker4 thread mTCP thread mTCP thread mTCP thread pipe accept() accept() read() write() read() write() read() write() read() write() accept() read() write() accept() read() write()
  17. 17. c b + E MLNT X MLNT b VH E c b US R * -‐‑‒ + % 8 LNRP MP NRVL T b CPVNLNRP * -‐‑‒ + -‐‑‒ % VN MP NRVL T 17 Hardware CPU Intel Xeon E5-22430L/2.0GHz (6 core) x 2 sockets Memory 48 GB PC3-12800 Ethernet Intel X520-SR1 (10 GbE) Software OS Debian GNU/Linux 8.1 kernel Linux 3.16.0-4-amd64 Intel DPDK 2.0.0 mTCP (4603a1a,June 7 2015)
  18. 18. US R 0 20 40 60 80 100 120 140 160 180 0 2 4 6 8 10 12 10004REQUESTS/SECOND #CORES Linux SO_REUSEPORT mTCP higher4is4better • Apache4benchmark • 64B4message • 10004concurrency • 100K4requests 3.3x 5.5x 18
  19. 19. VPVNLNRP c VN MP NRVL T FL S MP NRVL T l c VH E v d G<H .  d><H +   c u d 19 TCP$ w/$1$thread TCP$ w/$3$threads mTCP w/$1$thread SET 85,404 146,3514(1.71) 115,1664(1.35) GET 115,079 139,5754(1.21) 116,8384(1.02) • mcUbenchmark • 64B4message • 5004concurrency • 100K4requests
  20. 20. VH E g c – v d u k u z ls E A u c 9G qP XUU 8E@ v d k P NX P US P S P P l u d c c v c D@ E A w d c v v d dH E(@E v z e E A e 20
  21. 21. • • X86 w • vzh – z • cpufreqUinfo(1) v v1/2 – cgroups CPU4throttling z – kXeon4Phil w z • r FLARE Tilera v 21
  22. 22. o p Supachai Thongprasit e [1]4S.4Thongprasit,4V.4Visoottiviseh,4and4R.4Takano,4“Toward4Fast4and4Scalable4 KeyUValue4Stores4Based4on4User4Space4TCP/IP4Stack,”4AINTEC42015.4 d d k d l e u d ve 22