PFQ@ 9th Italian Networking Workshop (Courmayeur)

326 views

Published on

PFQ: a Novel Architecture for Packet Capture on Parallel Commodity Hardware

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
326
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

PFQ@ 9th Italian Networking Workshop (Courmayeur)

  1. 1. PFQ: a Novel Architecture for Packet Capture on Parallel Commodity Hardware Nicola Bonelli, Andrea Di Pietro, Stefano Giordano, Gregorio Procissi CNIT e Dip. di Ingegneria dell’Informazione - Università di Pisa
  2. 2. Outline • Introduction and motivation • Multi-core programming guidelines • PFQ architecture • Performance evaluation • Conclusion and future work
  3. 3. Introduction and Motivations • Monitoring applications for fast links on commodity hardware is a very challenging task – The hardware has evolved: 10Gbits links, multi-core architectures and multi-queue network devices… • The present software for packet capturing, including some parts of the Linux kernel, is not suitable for the new hardware. – (+) kernel support for multi-queue network adapters is now implemented – (-) PF_PACKET is extremely slow, even when used in memory-map mode (pcap) • Linux Networking Subsystem is slow and pointless for monitoring applications – (-) PF_RING is designed for single-processor systems • Traffic monitoring is not limited to packet capturing… – Exploits the current hardware, scaling possibly linearly with the number of cores – Decouple the hardware parallelism from software parallelism – Divide and conquer approach to steer packets to applications
  4. 4. Multi-thread on Multi-core (1) • What’s wrong with the current software? – Previous multi-threading paradigms used for single-processor systems are still valid, but prevent the software from scaling with the number of cores. • For a software on multi-core system to be effective… – Semaphores, mutexes, R/W mutexes and spinlocks are out of question! – Atomic operations are required, but must be used with moderation • software design determines the use of atomic operations – Sharing (writes to shared data) must be used with moderation too – False-sharing must and can always be avoided • wait-free algorithms are as well as cache-oblivious algorithms are our friends
  5. 5. PFQ preamble • PFQ is a novel capture system natively supporting 64bit multi-core architectures written on top of all the previously exposed guidelines to provide the best possible performance • PFQ does not memory map packet descriptors of the device driver to user-space (like most commercial vendor products do) • PFQ is not a custom driver (such as NetMap or PF_RING DNA), it’s an architecture running on top of standard Ethernet drivers, as well as slightly modified ones “PFQ aware drivers” (PF_RING driver aware inheritance) • PFQ enables packet capturing, filtering, hw queues and devices aggregation, packet classifications, packet steering and so forth… • PFQ pre-processing is ideal for bidirectional connection balancing , VoIP, different kinds of tunnels, tasks otherwise left to the user-space applications.
  6. 6. PFQ architecture Built on the top of the following components… • DB-MPSC queue: multiple-producer, double buffered queue (for the communication to user-space): – allows concurrent NAPI contexts to enqueue packets – Reduce the sharing, eliminate the false sharing between user-space and NAPI contexts – enables user-space copies from the queue to a private buffer in a batch fashion • De-multiplexing Matrix: – perfect concurrently accessible data structure (benign race conditions) – no serialization is required to steer/copy packets • SPSC queue: – enables batching for sk_buff, increase locality for fast packet handlers • Driver aware: – an effective idea inherited from PF_RING
  7. 7. PFQ architecture
  8. 8. Prefetching queue • Memory allocation in kernels prior to 2.6.39 had a spinlock on fast path that serialized threads of executions • Allocation/deallocation of sk_buff were not completely parallelized even if running on different physical cores • Batch processing is a well-known and efficient technique: – Optimizes cache effectiveness through temporal reference locality – Reduce the probability of contention on the alloc/dealloc structures
  9. 9. Packet steering • Per socket filtering is a common paradigm in capture engines – Linearly scan the socket list to check which one may be interested for each packet is O(n)!!! • In a multi-core environment we need a new paradigm: packet steering • Completely concurrent block (wait-free): – Shared state is mostly read only – Bitmap based that can be updated through atomics (support up to 64 sockets) – Socket section is ~ O(1)
  10. 10. Packet steering • Given a packet and a set of sockets, which socket needs to receive it? – Filtering (possibly no socket needs to receive the packet) – Load balancing (balance across multiple sockets based on a hash function) • Load balancing groups: – A socket can subscribe to a load balancing group – It will receive a fraction of the overall traffic • Simple subscription: – A socket can subscribe to all of the traffic coming from one or more hardware queues • Both modes can be supported concurrently: – Copy and balancing are handled by PFQ
  11. 11. Socket queue: DB-MPSC • This is an unavoidable contention point: – Load balancing shuffles packets across sockets • How handle contention without impacting performance? – Use a wait-free algorithm: DB-MPSC queues (double buffer multi-producer single-consumer) – Support copies/balancing – Reduce traffic coherence among cores, a single (per-packet) atomic operation that will be amortized in the future implementations
  12. 12. Testbed: Mascara & Monsters Mascara Monsters 10 Gb link Dual Xeon 6-core L5640, @2.27 GHz, 24GBytes RAM New socket PF_DIRECT for generation Intel 82599 multi-queue 10G ethernet adapter. By deploying 3-4 cores, it is possible to generate up to 13 Mpps of 64 bytes. Xeon 6-core X5650 @2.57GHz, 12 GBytes RAM Intel 82599 multi-queue 10G ethernet adapter PFQ on board for traffic capture
  13. 13. Single socket layout
  14. 14. Fully parallel layout Not enough generated traffic !
  15. 15. Load balancing across user space sockets • Keep the number of capturing NAPI context fixed (12 with the Intel hyper-threading) • Change the number of user space threads All of the traffic with just 3 threads!
  16. 16. Packet copy • Copying the same traffic to a variable number of user space threads • Still 12 NAPI contexts within the kernel
  17. 17. Future directions • Work on a new packet steering framework: – How can we distribute packets according to an application- specific semantic? • Implement balancing groups • Each group is associated with an “application specific hash function” • Bind a set of sockets to each group • Use case: VoIP analysis – Steer control traffic to a specific core – Load balance candidate RTP flows across a variable number of sockets • Easy (but inaccurate): stateless heuristic • Hard: implement a distributed stateful heuristic, where each core works on a private state that is then synchronized with those of other cores periodically…
  18. 18. Conclusions • Modern commodity architectures are increasingly parallel • Huge potential for software based network devices • Need to strictly fulfill coding and design rules • PFQ – A novel packet capturing engine – Better scalability with respect to competitors – Flexible packet steering – Decouples kernel space and user space parallelism • PFQ webpage and download: – netgroup.iet.unipi.it/software/pfq

×