Network Processor Acceleration for a Linux* Netfilter
packet filtering operation required for firewall operation,                notifying the application. The output chain pro...
Remote Copy of
                                                                                          Circular Queues
When updating queue elements, the application writes both the                   The mirrored memory technique removes all ...
moving data over PCI. The resulting coupled packet sharing                                 With the PCI communications est...
User Space Applications
                                                                                     packets with ...
pitfalls, an ME-specific copy of the Netfilter connection table
was created for the ME resident firewall. We implemented a...
Table 3. Cycle Count for Hybrid Firewall2                     to manage. Synchronizing data structures amongst the multipl...
9. REFERENCES                                                          [8] “NPF Benchmarking Implementation Agreements”.
Upcoming SlideShare
Loading in …5

Network Processor Acceleration for a Linux* Netfilter Firewall


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Network Processor Acceleration for a Linux* Netfilter Firewall

  1. 1. Network Processor Acceleration for a Linux* Netfilter Firewall Kristen Accardi, Tony Bock, Frank Hady, Jon Krueger Intel® Corporation th 2111 NE 25 Ave Hillsboro, OR 97124 {kristen.c.accardi, tony.bock, frank.hady, jon.krueger} organization that owns the LAN. Increases in LAN and Internet ABSTRACT bandwidth rates coupled with requirements for more Network firewalls occupy a central role in computer security, sophisticated packet filtering have made the work of the firewall protecting data, compute, and networking resources while still more difficult. allowing useful packets to flow. Increases in both the work per All firewalls make forward/drop decisions for each packet, but network packet and packet rate make it increasingly difficult for the complexity of the work performed to make the forward/drop general-purpose processor based firewalls to maintain line rate. decisions varies widely. Packet Filter firewalls [1] make In a bid to address these evolving requirements we have forward/drop decisions based on the contents of the packet prototyped a hybrid firewall, using a simple firewall running on header, and a set of rules. Stateful firewalls make forward/drop a network processor to accelerate a Linux* Netfilter Firewall decisions based on packet headers, rules lists, and state collected executing on a general purpose processor. The simple firewall from previous packets. Application layer firewalls (a.k.a. on the network processor provides high rate packet processing application layer gateways) reconstruct application level data for all the packets while the general-purpose processor delivers objects carried within sets of packets and make forward/drop high rate, full featured firewall processing for those packets that decisions based upon the state of those objects. need it. This paper describes the hybrid firewall prototype with In reaction to increasingly sophisticated threats, firewall a focus on the software created to accelerate Netfilter with a complexity is increasing. A recent study [2] showed a network processor resident firewall. Measurements show our commercial ISP firewall containing 3000 rules. Almost all hybrid firewall able to maintain close to 2 Gb/sec line rate for all commercial firewalls are stateful and most include application packet sizes, a significant improvement over the original layer features. Firewalls are also incorporating Intrusion firewall. We also include the hard won lessons learned while Detection and Intrusion Prevention, improving security at the implementing the hybrid firewall. cost of increased computation. The trend toward more secure and robust firewall protection drives an escalating need for more Categories and Subject Descriptors processing power within the firewall. C.2.0 [Computer-Communication Networks]: General – Since firewalls make per-packet decisions, packet rate is a Security and protection (e.g., firewalls) critical metric of firewall performance. As a rule, firewalls must be able to handle minimum sized packets at the maximum General Terms rate delivered by the attached media. Minimum sized packets Measurement, Performance, Design, Experimentation occur frequently in real traffic – the Network Processing Forum [3] specifies 40-byte packets to occupy 56% of the Internet Mix Keywords of IP packets. Denial-of-service attacks like SYN flood [4] Network Firewall, Netfilter, Throughput, Network Processor, commonly exploit minimum sized packets because they Prototype, Hybrid Firewall represent the most difficult workload for many firewall implementations. 1. INTRODUCTION In this paper, we explore the advantages of a hybrid firewall – Network Firewalls occupy an essential role in computer security, one that includes an application layer Linux Netfilter Firewall guarding the perimeter of LANs. The goal of any network running on a general purpose processor, and a simple packet firewall is to allow desired packets unimpeded access to the filtering firewall executing on a Network Processor (NP) - by network while dropping undesirable packets. In so doing, the building and measuring the performance of such a firewall. The firewall protects the data and compute resources of the Intel® IXP2800 Network Processor [5] classifies all packets arriving at the firewall. For simple cases, the NP completes the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that ® Intel is a registered trademark of Intel Corporation or its copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, subsidiaries in the United States and other countries TM requires prior specific permission and/or a fee. Xeon is a trademark of Intel Corporation or its subsidiaries in ANCS’05, October 26–28, 2005, Princeton, New Jersey, USA. the United States and other countries Copyright 2005 ACM 1-59593-082-5/05/0010...$5.00. * Other brands and names are the property of their respective owners 115
  2. 2. packet filtering operation required for firewall operation, notifying the application. The output chain provides a similar deciding to forward or drop the packet on its own. We select an mechanism for packets originating from the local host machine. NP for our hybrid firewall both because of its packet processing local prowess and because it is programmable, allowing changes in process response to changing security needs. Programmability is an important factor in our selection of an NP over custom input output hardware. chain chain Packets requiring additional computation, for instance packets defining a new flow, are processed by a Netfilter firewall forwarding running on Intel® XeonTM Processors (called microprocessors chain or CPUs). The Intel Xeon Processor is selected for its ability to rapidly complete the complex firewall processing required by a full featured Netfilter firewall. netfilter We first describe the Linux Netfilter firewall and the significant Linux IP challenge posed by minimum packets. In order to build our hybrid firewall we constructed a library for packet communication between the NPs and the microprocessors over PCI. We describe this library, the two variations implemented, IP Packets and the performance measured. The hybrid firewall itself is Figure 1. Rules Chains for Netfilter and IPTables described next with a focus on interfacing Netfilter to the NP resident firewall. Performance measures show that our hybrid Stateful firewalls based on Netfilter/IPTables store per firewall achieves our goal of providing line rate firewall connection state. In some cases, “connection tracking” enables processing for all packets with more intensive firewall the firewall to consult a rules chain only when establishing a processing for a fraction of the packets. We dissect the new connection, improving performance. IPTables evaluates performance achieved and present the hard lessons learned new connections using helper modules that have protocol along the way. Related work and conclusions finish the paper. specific knowledge. Connection state storage also enables more 2. NETFILTER FIREWALL advanced firewall features such as Network Address Translation Netfilter [6] is a set of Linux kernel modifications provided to (NAT), and Application Layer Gateways (ALGs). ALGs for enable packet operations on packets received from the Linux IP protocols like FTP perform packet data inspection/modification networking layer. Introduced in the Linux 2.4 code base, and retain significant connection state across packets within a Netfilter and its associated modules replace the previous connection. instantiations of IP filtering used in older versions of Linux. We used Linux 2.6.7 Netfilter and IPTables to implement the These include ipfwadm in Linux 2.0, and IPChains in Linux 2.2. microprocessor based portion of our firewall. Studies have Netfilter provides the system code and the application shown that Netfilter firewalls cannot maintain Gigabit/sec line programming interfaces needed to provide access from either rates for small sized packets. Brink et. al. [7] showed multi- kernel processes or user applications. IPTables is a set of user- Gb/sec Netfilter forwarding rates for large packets, but only space tools and kernel loadable modules used in conjunction about 200 Mbps for 64-byte Ethernet packets. We measured with Netfilter. IPTables allows a user to define sets of rules slightly higher Netfilter firewall performance, but still fell well governing packet-filtering behavior. IPTables also includes short of our 2 Gb/sec line rate for 64-byte Ethernet packets. Our kernel loadable modules supporting many different types of Hybrid firewall uses an NP based firewall in concert with packet operations. Used together, Netfilter and IPTables Netfilter firewall to provide near 2 Gb/sec line rate processing support packet filtering, packet forwarding, Network Address for all packet sizes. Translation (NAT), event logging, connection tracking, and more advanced operations (often called “packet mangling”) 3. NP/CPU COMMUNICATION Our hybrid firewall consists of a Netfilter firewall on a dual 2.4 capable of modifying the L4+ packet header or payload. In this GHz Intel Xeon Processor system coupled to a simple custom paper, we will refer to Netfilter and IPTables interchangeably. packet filtering firewall executing on a dual 1.4 GHz IXP2800 The basic packet flow through Netfilter and IPTables is compact PCI card. Figure 2 is a picture of the hybrid firewall determined by a series of filter blocks known as chains. This is hardware. The two sets of processors connect via a 64 bit, 66 the same concept used with IPChains in Linux 2.2. The three MHz Peripheral Components Interface (PCI) bus. basic chains are the forwarding chain, the input chain, and the The IXP2800 network processor features sixteen RISC engines output chain as shown in Figure 1. called MEs. Each ME holds enough registers to keep state for The forwarding chain provides a path and rule set for packets up to 8 computational threads and can context swap between not destined for the local host machine – packets to be routed these threads in just a few clocks. Our hybrid firewall’s through the network. This chain allows packet filtering based on Ethernet receive, Ethernet transmit, and IP processing along user-defined rules, packet header data, and state collected from with the packet forwarding firewall execute on the MEs. To previous packets. move packets and data between the ME based firewall and the The input chain provides similar functionality for packets microprocessor based firewall, we had to create the destined to a process on the local host machine. Packets can be communications library described next. passed to the application expecting them, or dropped without 116
  3. 3. Remote Copy of Circular Queues PCI Remote Packet Buffers Local Copy of Circular Queues Local Memory Local Packet Buffers Figure 2. Hybrid Firewall prototype hardware Figure 3. Message Passing Memory Configuration 3.1 Message-Passing Architecture The first step towards packet passing between the MEs and Figure 3 shows the conceptual memory map within a given microprocessors was enabling reads and writes from one processor’s address space. This system keeps two copies of processor’s memory space to the other’s. Unlike peripheral packet buffers and circular queues, one in local memory and one devices, which share a small portion of the platform’s global in remote (the other processor’s) memory across PCI. Since the address space, processors commonly assume control over all physical and virtual addresses of each region are different for address space. This was true of both the CPU and NP each processor, a pointer value on one processor means nothing processors in our firewall. A non-transparent PCI-to-PCI bridge to the other processor. Applications use a system of base and provides connectivity between the two processor’s address offset pointers to coordinate their data structures, spaces by translating a subset of each processors address space communicating physical offsets within the shared regions rather into the address space of the other processor. than passing virtual address pointers. After enabling reads and writes, we built a message-passing architecture between the two groups of processors. Performance 3.1.2 Message Passing Operations measurements of different read and write operations across PCI Operation of the queues is similar in both directions. To prepare heavily influenced the final architecture. or refresh an element for the producer, the consumer allocates a packet buffer and calculates the corresponding physical byte 3.1.1 Memory Configuration offset to that buffer. It then records this offset along with a To implement a message passing architecture, the application unique packet identifier into the queue element and raises the establishes a set of circular queues visible to both processors. In element status bits to the FREE state to indicate to the producer order to eliminate the need for mutually exclusive accesses (not that this element is ready for the produce operation. available over PCI), the library uses a unidirectional To send a packet, the producer reads the next queue element to producer/consumer model. Complementary pairs of these make sure the status bits indicate FREE. When encountering a queues (CPU ME + ME CPU) establish bidirectional full queue, the next queue element will read FULL. If the queue message passing. is not full, the producer uses the packet buffer offset in the At initialization time, software on each side of the non- queue element to copy the packet data, including headers, to the transparent bridge communicates the offset of the base address consumer’s memory (remote). Within the queue element, the of this circular queue within the PCI address window provided producer writes the packet’s unique identifier, the length in by the non-transparent bridge. Queue size and structure are bytes of the copied data, and an indication of operations to constant (set at compile time.) Each queue consists of a fixed perform on the packet via the Application ID field. When the number of entries or elements with each element containing the copy completes, the producer sets the status bits to FULL, following fields: indicating to the consumer that valid data is available. IA Packet ID: Unique identifier for packets allocated by the The consume operation consists of reading the local unique microprocessor identifier from the queue element and passing the packet to the NP Packet ID: unique identifier for packets allocated by the NP function or application indicated by the Application ID field. processor With the packet dispatched, the consumer performs the refresh Packet Length: Length of packet in bytes, including headers operation outlined above and raises the status bits to FREE once Status bits: unidirectional semaphore; Producer raises, again. consumer lowers Our measurements show reads across the PCI bus to exhibit a Application ID: Message field to tell consumer how to dispatch latency of hundreds of nanoseconds, too long compared to small the packet packet arrival times. Using a technique herein referred to as Packet Buffer Offset: pre-allocated buffer provided by the “mirrored memory”, the PCI Communications Library consumer for the producer to copy packet data instantiates a complete copy of the circular queue structures in both the microprocessor and NP memories. 117
  4. 4. When updating queue elements, the application writes both the The mirrored memory technique removes all runtime reads from local and remote copies. Software performs all reads of queue the PCI bus. This write-write interface markedly improves state locally, so these reads return quickly without having to overall bus efficiency. The trade off is the introduction of many wait for PCI latency. The developer must impose restraints as to small writes to update remote queue elements. Figure 6 shows when each portion of the application can access particular data measurements of the final bandwidth achieved. Both tests fields within the shared application state. For instance, the assumed an ideal consumer. The data shows that queue state producer must never write an element in the FULL state whereas updates represent an acceptably small impact to performance the consumer must never write an element in the FREE state. using the mirrored memory approach. Further, software must ensure that all related updates are complete prior to changing the value of the status bits. 3.2 Application Usage Models The PCI link between the processors is a likely performance 3.1.3 Performance Considerations bottleneck. To help best manage this limitation, the PCI Moving packets between processing domains requires a copy of Communications Library provides two usage models, one packet data across the PCI bus. The IXP2800 provides DMA optimized for maximum flexibility (decoupled) and another engines for ME CPU data moves. Figure 4 shows that these designed for greater performance for a subset of applications DMA engines achieve data rates of almost 3 Gb/s. (coupled). Using the NP’s DMA engines to perform CPU ME data movement requires PCI reads. Figure 4 shows the throughput 3.2.1 Decoupled Packet Passing achieved here is much lower than in the write case; too low for In this usage model, once the producer sets the status bits to our needs. We selected an alternate method for CPU ME data FULL during the produce operation, the consumer owns the moves, using the microprocessor to write the data across the PCI packet. The producer then drops the original packet and frees bus to NP memory. To maximize the size of the CPU PCI the associated memory. writes and thereby improve PCI efficiency, we mapped the 3500 memory shared with the NP as Write-Combining. Write- 3000 Throughput (Mb/s) Combining memory enables greater CPU write performance 2500 than uncached memory. Using the CPU to move data from CPU ME costs CPU cycles. When writing a large block of 2000 data to an uncached or write- combining region, the processor 1500 will stall until the entire write completes. Figure 5 is a 1000 Packet Data Only conservative estimation of the CPU cycles spent moving 512 bytes across PCI. 500 Data + Q Mgmt 3000 0 0 256 512 768 1024 1280 1536 Throughput (Mb/s) 2500 Transfer Size (Bytes) 2000 Figure 6 PCI Bus Performance Showing Impact of Small Queue Management Writes 1500 Reads - 1CH Writes - 1CH 1000 Writes - 2CH This model allows for maximum flexibility in application design Reads - 2CH because the processors may initiate, drop, and modify any 500 packets they own without regard to any latent state on the other processor. 0 The principal drawback of this model is complete packet data 64 320 576 832 1088 1344 copies in both directions across the PCI bus. This may waste Transfer Size (Bytes) PCI bus cycles and processor cycles for applications that do not Figure 4. NP Raw DMA Performance (1 & 2 channels) modify packet data. 3.2.2 Coupled Packet Passing Coupled packet passing optimizes for packet filtering 512 Bytes applications like firewalls and intrusion detection. Software + 6clocks _ overhead = 70clocks 8Bytes / clock using coupled packet passing assumes all packets originate within the MEs. As needed, these packets pass to the 70 PCIclk (66 MHz ) = 3150 CPUclk (3GHz ) microprocessor for additional processing, but a copy of the original packet data remains within NP memory for later use. Figure 5 Cost of Moving Data with CPU Cycles1 Software on the microprocessor then returns just the original packet identifier with an action directive like “PASS” or “DROP” coded into the Application ID field. The NP references the packet identifier to locate the original packet data and uses this stored copy to perform the indicated action. 1 The six clock overhead estimate assumes only one clock to Coupled packet passing conserves PCI bus cycles by reducing propagate across the non-transparent bridge. Calculation PCI traffic by roughly half compared to the same application in assumes a perfect target with no contention from other PCI decoupled mode. Further, the CPU need not burn cycles bus masters. 118
  5. 5. moving data over PCI. The resulting coupled packet sharing With the PCI communications established between MEs and mechanism delivers about double the data rate available from CPU, we move on to describing the hybrid firewall. While this the decoupled paradigm for periods of heavy PCI utilization as firewall supports concurrent coupled and decoupled operation, shown in Figure 7. this paper addresses the coupled mode as it provides the best 2000 performance 1800 4. THE HYBRID FIREWALL 1600 Our hybrid firewall distributes the firewall work between the Throughput (Mb/s) 1400 Netfilter firewall on the microprocessors and the packet filtering 1200 firewall on the network processor. For our prototype Netfilter 1000 packet mangling was not enabled, allowing use of the coupled PCI communications library and benefit from its performance 800 advantages. 600 Packets arrive from the network at the IXP2800 MEs. The MEs 400 Coupled handle base packet processing for all packets received by the 200 Decoupled firewall including Ethernet processing, IPv4 forwarding 0 operations, and application of simple firewall rules. Some 0 256 512 768 1024 1280 packets require processing not included in the NP based packet Packet Size (Bytes) filtering firewall. For our hybrid firewall this includes stateful or application level firewall processing, such as ALG Figure 7. Coupled vs. Decoupled Maximum Data Rates processing. These packets are sent to the Netfilter firewall executing on the microprocessors, allowing our hybrid firewall Table 1 and Table 2 show processor cycle counts during each to benefit from both the time tested robustness of the Netfilter mode of operation. These cycle counts reveal the five routines firewall and the complex processing speed of the microprocessor most frequently called while servicing 1518-byte packets at the while still providing the highly optimized packet processing maximum attainable data rate. In coupled mode, the processor features of the NP. is mostly busy handling the PCI Comms tasklet, polling the ring A key assumption in the construction of our hybrid firewall is buffers, and receiving packets; packet transmission to the MEs is that a large fraction of the packets received may be processed by a low-overhead operation. the packet filtering firewall. There is good reason to believe this The same analysis performed on the decoupled model reveals is true. We studied traces from seven internet connection points the source of the performance disparity. Copies to the MEs’ collected on June 6, 2004 by the National Laboratory for memory during packet transmission dominate performance, Applied Network Research ( 90% of the consuming up to 23% of the processor’s time. These copies packets were TCP and we measured an average of 15 packets account for just 5% of the actual retired instructions, suggesting per connection – so flow setup and teardown, even at the TCP that the processor spends many cycles stalled within the copy level, could be accomplished by forwarding as little as 13% of routine. These stalls arise because the processor supplies data at the packets to the microprocessor. FTP traffic, which would a much greater rate than the PCI bus can accept it. require special ALG handling, represented only 4% of the packets seen. SSH traffic was also 4%. While the fraction of Table 1. Cycle Count – Coupled mode, 1518-byte packets packets filterable by just the packet filtering firewall will vary % of Image Symbol with rule and traffic mix, real traffic seems to support the hybrid Total (top 5 shown) approach. 15.0 Linux tasklet_action 13.0 CPU/ME Driver ia_poll 4.1 Getting Packets to our Applications As shown in Figure 8, the hybrid firewall builds upon the PCI 11.7 CPU/ME Driver ia_ingress_tasklet communications library and utilizes pre-existing software within 10.6 Linux net_rx_action the Linux kernel and IXP (yellow or lightest gray). The PCI 8.1 Linux Do_softirq Comms Kernel Module exposes a standard Linux network driver interface allowing packets to travel between the MEs and the 41.6 other Linux network stack in both directions. Upper layer modules use this driver just as they would a regular NIC driver. A Table 2. Cycle Count – Decoupled mode, 1518-byte packets lightweight Netfilter protocol interface module (NFPI) injects % of Image Symbol packets into the standard Netfilter infrastructure, bypassing the Total (top 5 shown) IPv4 forwarding and Ethernet processing already completed by 23.3 CPU/ME Driver copy_to_me the NP. 11.9 Linux tasklet_action 10.8 CPU/ME Driver ia_poll 9.6 CPU/ME Driver ia_ingress_tasklet 8.4 Linux net_rx_action 35.9 other 119
  6. 6. User Space Applications packets with a protocol ID of 0x800 normally precede to the IP General Purpose Processor receive handling routine. Snort IDS IPTables The PCI Comms library uses a unique Application ID to determine the destination application for each packet. For the Network Stack Linux Kernel Netfilter Firewall hybrid firewall application, because Ethernet and IP packet processing has already occurred in the NP, packets should NFPI simply be handed to IPTables for firewall processing. We use PCI Comms Kernel Module the existing Linux dispatch code and bypass the Linux IP stack by registering a receive handler for each unique application ID PCI 64/66 and setting the protocol field in the sk_buff to match the application ID indicated by the microblock. Linux forwards IP packets to our application specific “protocol” as long as the Network Processor protocol field in the sk_buff matches the protocol field PCI Comms Microblock registered by the application with the networking core. This IPv4 Fowarding paper refers to the protocol module as the NF Protocol Interface, because it is responsible for interfacing to the Netfilter modules. Ethernet This method provides two very important advantages. First, we do not have to write our own dispatch code. Second, by using Figure 8. Hybrid Firewall Software Architecture the existing network dispatch code in the Linux network core, we enable other legacy applications to receive our packets as The PCI Comms kernel module polls the ME CPU well. For example, to send a packet to an existing Linux communication rings for new packets arriving from the MEs. protocol stack, the NP sets the Application ID to that particular We chose a polling interface both because it is a good match for network protocol ID (such as 0x800 for TCP/IP). our high traffic rate application and because interrupts from MEs were not available to us. For heavy traffic loads, polling mode works best since it allows the operating system to 4.2 Filtering Packets Netfilter’s design assumes operation within the context of an IP schedule the work, ensuring other applications run network protocol. The NF Protocol Interface performs the uninterrupted. minimal processing on the packet’s descriptor required by We simulated interrupts combined with polling by scheduling adjusting the fields in the sk_buff to point to the IP header, as if our polling routine as a “tasklet” during times of light traffic. IP packet processing had occurred. For the FORWARD hook The operating system schedules tasklets with a much lower required by packet filtering, this is all that is required to ensure frequency than the polling interface, saving CPU cycles. If proper IPTables processing. Note that his path avoids during polling the driver determined that there were no packets unnecessary TCP/IP processing on the general-purpose received from the PCI Comms microblock, the driver removed processor. itself from the polling queue and scheduled a tasklet to check Once packets circulate through the standard IPTables entry the rings later. Because the latency between tasklet calls can be points, IPTables makes a forward/drop determination based on quite long, we had our tasklet immediately place our driver back its own rule set. IPTables then sends the sk_buff to the PCI in the polling queue even if there were no packets received. If Communications Module through the NF protocol interface interrupts had been available, we would have used them during along with a flag to indicate whether to drop or accept the times of low link utilization, switching to polling under heavy packet. The drop/accept notification is placed on the CPU ME load. communication ring. The Linux operating system uses a packet descriptor structure called a sk_buff to hold packet data and meta data as it traverses through the operating system. NIC devices will typically DMA 4.3 Sharing State Sharing state (rules tables and connection tracking tables) directly into the packet data portion of the sk_buff. Similarly, between a general-purpose processor and the MEs is difficult in the PCI Comms Kernel Module keeps a pool of allocated our prototype. The MEs cannot match the virtual to physical sk_buffs for use in microengine DMA transactions. The pool address mapping performed on the microprocessor, and so are slowly drains over time. Software replenishes it when it reaches unable to follow the virtual pointers within microprocessor a low water mark, or when there is a break in the packet passing resident data structures. Moreover, the MEs cannot atomically work. We chose this design since keeping DMA buffers update data structures held in the CPU memory since PCI does available on the queue for the NP requires the CPU minimize not provide atomic memory transactions. Without atomic time spent managing buffers between each poll. updates, it is impossible to implement semaphores to guard The network driver interface under Linux requires that the driver read/write state, forcing the programmer into a single writer set the protocol field in the sk_buff structure to contain the type model that may not be performance efficient. field from the Ethernet packet header. This protocol field is Netfilter does not provide for such state sharing with PCI then used by the Linux network core to dispatch packets to resident processors. Connection tracking state in IPTables whichever protocol has registered a receive handler for the contains both data and pointers to data. Rules state contains specific protocol ID in that packet’s descriptor. For example, function pointers as well as data. To avoid address translation 120
  7. 7. pitfalls, an ME-specific copy of the Netfilter connection table was created for the ME resident firewall. We implemented a UDP socket interface between the microprocessors and the IXP2800 firewall software to enable the microprocessor to place a copy of its connection tracking state into NP resident memory. This simple RPC-like interface allows the CPU resident NFPI to update state in the IXP2800’s memory. NP resident state is formatted specifically for optimized ME accesses. Runtime firewall rules updates could use the same mechanism, but were not implemented. 4.4 Extending the hybrid firewall By sticking with standard (rather than proprietary) interfaces to the Linux kernel, our hybrid firewall can be extended with unmodified “off the shelf” software. We tested this by adding the Snort* Intrusion Detection System (IDS) program to our Figure 9. Hybrid Firewall Performance firewall as shown in Figure 8. This addition required nothing more than the standard installation of the IDS application. 5.1 Cost of Netfilter Processing To measure the cost of Netfilter processing, we built a loopback kernel module that returns packets received from the PCI 5. HYBRID FIREWALL PERFORMANCE Comms Kernel module with no processing. Figure 10 compares An expected advantage of the hybrid firewall, and in fact our the throughput achieved by the hybrid firewall sending all motivation for exploring the hybrid firewall, is superior packets to Netfilter to the same setup with the loopback driver performance. Figure 9 shows the performance of the hybrid replacing Netfilter. Netfilter processing imposes only a slight firewall for various percentages of packets sent to the processing burden. For 64-byte packets, the introduction of microprocessors. The chart also contains a Netfilter only line Netfilter processing results in an 11% throughput reduction. showing the performance of the original, non-hybrid Netfilter Netfilter does not impose enough overhead to influence firewall on a platform using a pair of gigabit Ethernet NICs in performance for large packets. Netfilter is clearly not the place of the NP. Throughput includes Ethernet header and bottleneck limiting small packet performance. higher layer bits, but not Ethernet preamble or inter-packet gap. The 0% (NP alone handles all packets) line is 2Gb/s: full line rate. 2000 1800 When all of the packets go to Netfilter (100%), the performance 1600 of the hybrid firewall is roughly equivalent to the Netfilter only Throughput (Mb/s) 1400 firewall. This indicates that our NP based packet filtering 1200 firewall and our connection to the Netfilter firewall has a 1000 reasonably performing implementation versus off the shelf 800 products in the “Netfilter Only” case. 600 Firewall For cases where the NP handles most or all of the packets, the 400 Loopback advantages of our Hybrid firewall become clear. The 0% and 200 Line Rate 10% lines (i.e. 10% forwarded to Netfilter) achieve full line rate 0 for every packet size. The 30% case achieves full line rate with 0 256 512 768 1024 1280 packets of 128 bytes or longer. Finally, the system is able to Packet Size (Bytes) supply full line rate, even with half of the packets going to the Figure 10. Netfilter Firewall compared to Loopback general purpose CPU, for continuous streams of packets as small as 256 bytes. This represents excellent performance improvement for the Hybrid firewall over the standard Netfilter 5.2 Analysis of CPU Processing Cycles firewall – verifying our hypothesis. Table 3 shows microprocessor utilization for one of the two Intel Xeon Processors used. Both processors showed very similar utilization. The table includes all processes consuming greater than 1% of the processor’s cycle count. The data shows the hybrid firewall processing a stream of 64 byte packets, with 100% forwarded to Netfilter. All of these functions relate to packet processing and together account for 95% of the total cycle count. 121
  8. 8. Table 3. Cycle Count for Hybrid Firewall2 to manage. Synchronizing data structures amongst the multiple % of Image Symbol address spaces is both complicated and tedious. Lack of atomic Total transactions to PCI resident memory resources drove constraints 15.3 Linux tasklet_action into the creation and use of data structures accessible from both 13.0 PCI comms Driver Ia_poll microprocessor and ME. The inability of the MEs to use 11.7 PCI comms Driver ia_ingress_tasklet microprocessor virtual addresses also complicated data structure 10.1 Linux net_rx_action creation and use. The low bandwidth and long latency 7.9 Linux do_soft_irq associated with ME PCI reads drove our message passing 4.6 PCI comms Driver ia_get_msg architecture to consume extra memory bandwidth and made 4.2 Linux netif_receive_skb effective state sharing almost impossible. Polling and memory allocation for packet buffers consumed more microprocessor 3.9 PCI comms Driver ia_rx_msg cycles than we hoped. We will seek to improve these features in 3.9 PCI comms Driver ia_msg_rx_alloc_backup future implementations of our hybrid approach. 3.8 Linux Tasklet_schedule 2.0 PCI comms Driver ia_check_rings 7. RELATED WORK 2.0 Linux eth_type_trans Brink et. al. [7] present measurements for IPv4 forwarding, for a 2.0 Linux skb_dequeue Netfilter firewall and IPsec on Linux* 2.4.18 running on a dual 1.9 Linux skb_release_data Intel Xeon Processor based system. Their measurements show 1.7 Linux kfree lower-than-line-rate throughput for smaller packets, concurring 1.6 Linux kmalloc with ours. The authors also present IPv4 forwarding results for 1.3 Linux skb_queue_tail a dual Intel® IXP2400 network processor system showing close 1.2 Linux alloc_skb to maximum theoretical line rate for all packet sizes. Based on 1.2 Linux memcpy benchmarks published by the Network Processing Forum [8] 1.2 PCI comms Driver me_pkt_forward and the IETF [9], Kean [10] presents a methodology for benchmarking firewalls along with measurements for an earlier The actual firewall packet processing is only a small percentage generation network processor, the Intel® IXP1200. Kean’s of the overall cycle count, so small that the functions measurements show significant roll-off for smaller packet sizes, responsible do not appear in the table. All observed Netfilter though not as large as the Netfilter firewall. function calls accounted for only 0.91% of the cycle count in the Alternate firewall platform architectures have been explored above profile. Likewise, the NF protocol interface module before. The Twin Cities prototype provided a coherent shared accounted for 0.66% of the cycle count. memory interface between an Intel Pentium® III Processor and This analysis identifies several high cost tasks. Managing the an IXP1200 [11]. Twin Cities did not target an existing CPU OS buffers (sk_buff) accounts for about 10% of the CPU time firewall (Netfilter) and used custom hardware. Other authors (netif_receive_skb, skb_dequeue, skb_release_data, kfree, have explored different architectures, including FPGA based kmalloc, alloc_skb, skb_queue_tail). Polling rings within the firewalls [12] and firewall targeted Network Processor designs PCI communications library uses a combination of a polling [13]. loop (ia_poll 13%) and Linux tasklet scheduling (tasklet_action Corrent* [14] advertises a firewall that most closely matches the 15.3% and ia_ingress_tasklet 11.7%). Clearly, polling is the work described here, using a combination of network processors area we would first look to tune. Having a system with and general-purpose processors to execute a CheckPoint* interrupts available would eliminate this tasklet scheduling Firewall. Corrent* even shows that their approach leads to overhead. enhanced performance for small packets. IPFabrics* [15] offers both a PCI add in card that holds two Intel® IXP2350s and a 6. CONCLUSIONS Packet Processing Language that enables quick programming of Using Linux Netfilter, an existing full featured firewall running applications like Firewalls for the NPs. on a high performance general purpose CPU, we were able to successfully add a simple packet filtering firewall running on a 8. ACKNOWLEDGEMENTS network processor and achieve substantial performance gains. The authors would like to acknowledge Santosh Balakrishnan, In fact, for small packets our hybrid firewall prototype was able Alok Kumar and Ellen Deleganes for the NP based firewall that to achieve 2 Gb/s line rate with almost 30% of the packets served as an early version for the NP code used in this paper. forwarded to Netfilter. This is greater than 4X throughput gain We should also like to extend a special thank you to Rick over the standard Netfilter firewall. Our hybrid firewall Coulson and Sanjay Panditji for their steadfast support of this exhibited the superior performance we suspected it would. work and to Raj Yavatkar for his valuable guidance. A number of characteristics of the PCI connection between our microprocessors and network processor served to make programming difficult. The multiple address spaces associated with each processor, mapped together through the non- transparent PCI-to-PCI bridge were difficult for the programmer ® Pentium III is a registered trademark of Intel Corporation or its subsidiaries in the United States and other countries * Other brands and names are the property of their respective 2 Continuous stream of 64-byte packets owners 122
  9. 9. 9. REFERENCES [8] “NPF Benchmarking Implementation Agreements”. [1] W. R. Cheswick, S. Bellovin, A. Rubin. Firewalls and Network Processing Forum Internet Security, Second Edition. Addison-Wesley. 2003. San Francisco. [8] B. Hickman, D. Newman, S. Tadjudin, T. Martin, [2] M. Kounavis, A. Kumar, H. Vin, R. Yavatkar, A. “RFC3511 – Benchmarking Methodology for Firewall Campbell. “Directions in Packet Classification for Network Performance” The Internet Society. April 2003. Processors”. 2004 [9] L. Kean, S. B. M Nor. “A Benchmarking Methdology for [3] R. Peschi, P. Chandra, M. Castelino. “IP Forwarding NPU-Based Stateful Firewall”. APCC 2003. Volume 3, Application Level Benchmark v1.6”. Network Processing 21-24 Sept. 2003 Page(s):904 - 908 Forum May 12, 2003. [10] F. Hady, T. Bock, M. Cabot, J. Meinecke, K Oliver, W. Talarek. “Platform level support for High Throughput [4] Computer Emergence Response Team. “CERT Advisory Edge Applications: The Twin Cities Prototype”. IEEE CA-1996-21 TCP SYN Flooding and IP Spoofing Network. July/August 2003. pp. 22-27. Attacks.” Nov 29, 2000. [11] A. Kayssi, L. Harik, R. Ferzli, M. Fawaz. “FPGA-based Internet protocol firewall chip”. Electronics, Circuits and Systems, 2000. ICECS 2000. Volume 1, 17-20 Dec. 2000 [5] “Intel® IXP2800 Network Processor”. Intel Corporation., Page(s):316 - 319 P2800.htm [12] K. Vlachos. “A Novel Network Processor For Security Applications in High-Speed Data Networks”. Bell Labs [6] H. Welte. “What is Netfilter/IPTables?” Technical Journal 8(1). pp 131-149. 2003. [13] “Corrent Security Appliances Sustain Maximum [7] P. Brink, M. Castelino, D. Meng, C. Rawal, H. Tadepalli. Throughput Under Attack”. “Network Processing Performance Metrics for IA- and NP- Based Systems”. Intel Technology Journal, Volume 7, pdf Issue4. 2003. pp.78-91. [14] “Double Espresso.” IPFabrics*. 123