PFQ@ 9th Italian Networking Workshop (Courmayeur)

Uploaded on

PFQ: a Novel Architecture for Packet Capture on Parallel Commodity Hardware

PFQ: a Novel Architecture for Packet Capture on Parallel Commodity Hardware

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. PFQ: a Novel Architecture for Packet Capture on Parallel Commodity Hardware Nicola Bonelli, Andrea Di Pietro, Stefano Giordano, Gregorio Procissi CNIT e Dip. di Ingegneria dell’Informazione - Università di Pisa
  • 2. Outline • Introduction and motivation • Multi-core programming guidelines • PFQ architecture • Performance evaluation • Conclusion and future work
  • 3. Introduction and Motivations • Monitoring applications for fast links on commodity hardware is a very challenging task – The hardware has evolved: 10Gbits links, multi-core architectures and multi-queue network devices… • The present software for packet capturing, including some parts of the Linux kernel, is not suitable for the new hardware. – (+) kernel support for multi-queue network adapters is now implemented – (-) PF_PACKET is extremely slow, even when used in memory-map mode (pcap) • Linux Networking Subsystem is slow and pointless for monitoring applications – (-) PF_RING is designed for single-processor systems • Traffic monitoring is not limited to packet capturing… – Exploits the current hardware, scaling possibly linearly with the number of cores – Decouple the hardware parallelism from software parallelism – Divide and conquer approach to steer packets to applications
  • 4. Multi-thread on Multi-core (1) • What’s wrong with the current software? – Previous multi-threading paradigms used for single-processor systems are still valid, but prevent the software from scaling with the number of cores. • For a software on multi-core system to be effective… – Semaphores, mutexes, R/W mutexes and spinlocks are out of question! – Atomic operations are required, but must be used with moderation • software design determines the use of atomic operations – Sharing (writes to shared data) must be used with moderation too – False-sharing must and can always be avoided • wait-free algorithms are as well as cache-oblivious algorithms are our friends
  • 5. PFQ preamble • PFQ is a novel capture system natively supporting 64bit multi-core architectures written on top of all the previously exposed guidelines to provide the best possible performance • PFQ does not memory map packet descriptors of the device driver to user-space (like most commercial vendor products do) • PFQ is not a custom driver (such as NetMap or PF_RING DNA), it’s an architecture running on top of standard Ethernet drivers, as well as slightly modified ones “PFQ aware drivers” (PF_RING driver aware inheritance) • PFQ enables packet capturing, filtering, hw queues and devices aggregation, packet classifications, packet steering and so forth… • PFQ pre-processing is ideal for bidirectional connection balancing , VoIP, different kinds of tunnels, tasks otherwise left to the user-space applications.
  • 6. PFQ architecture Built on the top of the following components… • DB-MPSC queue: multiple-producer, double buffered queue (for the communication to user-space): – allows concurrent NAPI contexts to enqueue packets – Reduce the sharing, eliminate the false sharing between user-space and NAPI contexts – enables user-space copies from the queue to a private buffer in a batch fashion • De-multiplexing Matrix: – perfect concurrently accessible data structure (benign race conditions) – no serialization is required to steer/copy packets • SPSC queue: – enables batching for sk_buff, increase locality for fast packet handlers • Driver aware: – an effective idea inherited from PF_RING
  • 7. PFQ architecture
  • 8. Prefetching queue • Memory allocation in kernels prior to 2.6.39 had a spinlock on fast path that serialized threads of executions • Allocation/deallocation of sk_buff were not completely parallelized even if running on different physical cores • Batch processing is a well-known and efficient technique: – Optimizes cache effectiveness through temporal reference locality – Reduce the probability of contention on the alloc/dealloc structures
  • 9. Packet steering • Per socket filtering is a common paradigm in capture engines – Linearly scan the socket list to check which one may be interested for each packet is O(n)!!! • In a multi-core environment we need a new paradigm: packet steering • Completely concurrent block (wait-free): – Shared state is mostly read only – Bitmap based that can be updated through atomics (support up to 64 sockets) – Socket section is ~ O(1)
  • 10. Packet steering • Given a packet and a set of sockets, which socket needs to receive it? – Filtering (possibly no socket needs to receive the packet) – Load balancing (balance across multiple sockets based on a hash function) • Load balancing groups: – A socket can subscribe to a load balancing group – It will receive a fraction of the overall traffic • Simple subscription: – A socket can subscribe to all of the traffic coming from one or more hardware queues • Both modes can be supported concurrently: – Copy and balancing are handled by PFQ
  • 11. Socket queue: DB-MPSC • This is an unavoidable contention point: – Load balancing shuffles packets across sockets • How handle contention without impacting performance? – Use a wait-free algorithm: DB-MPSC queues (double buffer multi-producer single-consumer) – Support copies/balancing – Reduce traffic coherence among cores, a single (per-packet) atomic operation that will be amortized in the future implementations
  • 12. Testbed: Mascara & Monsters Mascara Monsters 10 Gb link Dual Xeon 6-core L5640, @2.27 GHz, 24GBytes RAM New socket PF_DIRECT for generation Intel 82599 multi-queue 10G ethernet adapter. By deploying 3-4 cores, it is possible to generate up to 13 Mpps of 64 bytes. Xeon 6-core X5650 @2.57GHz, 12 GBytes RAM Intel 82599 multi-queue 10G ethernet adapter PFQ on board for traffic capture
  • 13. Single socket layout
  • 14. Fully parallel layout Not enough generated traffic !
  • 15. Load balancing across user space sockets • Keep the number of capturing NAPI context fixed (12 with the Intel hyper-threading) • Change the number of user space threads All of the traffic with just 3 threads!
  • 16. Packet copy • Copying the same traffic to a variable number of user space threads • Still 12 NAPI contexts within the kernel
  • 17. Future directions • Work on a new packet steering framework: – How can we distribute packets according to an application- specific semantic? • Implement balancing groups • Each group is associated with an “application specific hash function” • Bind a set of sockets to each group • Use case: VoIP analysis – Steer control traffic to a specific core – Load balance candidate RTP flows across a variable number of sockets • Easy (but inaccurate): stateless heuristic • Hard: implement a distributed stateful heuristic, where each core works on a private state that is then synchronized with those of other cores periodically…
  • 18. Conclusions • Modern commodity architectures are increasingly parallel • Huge potential for software based network devices • Need to strictly fulfill coding and design rules • PFQ – A novel packet capturing engine – Better scalability with respect to competitors – Flexible packet steering – Decouples kernel space and user space parallelism • PFQ webpage and download: –