Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Intro – nothing to say here!
  • VLAN, vint stuff still a bit nebulous – this is where existing documentation is lightest, and most of my information is coming directly from the source code.
  • I’m going to cover primarily ethernet, although other data link layer protocols are handled very similarly. Ethernet is highly representative.
  • This should be nothing new.
  • I see the sk_buff level of detail as a problem for netv. Because linux today places so much information into the skb for so many layers of the protocol stack, this means that interfaces may not be well designed for protocols that don’t – and adding virtual protocol information to the skb means recompilation of the kernel, not just importing a new protocol module. This isn’t a stopping point, but it does mean that we may not be able to follow the existing protocol implementations as closely as we might wish.
  • softirq notes: softirq runs through 4 tasklet types in order: HI_SOFTIRQ, NET_TX_SOFTIRQ, NET_RX_SOFTIRQ, and TASKLET_SOFTIRQ NET_RX_SOFTIRQ calls net_rx_action() as a handler. First action in net_rx_action is to pass to all protocols registered in ptype_all – good place for hooks. (Default: none registered) Next, handle_bridge() if we’re config’s as a bridge. Last, eth_type_trans() does actual protocol demux and assigns the sk_buff’s in dev->protocol. net_rx_action maps this protocol ID (IP = 0x0800) via a hash to a protocol instance handler (ip_rcv for IPv4, arp_rcv for ARP, appropos to other protocols as needed). If the packet matches multiple protocols, we send it to ALL of them. Minor OSI layering violations in linux here. Some protocol instances are technically layer 2 but are still demux’d here. (802.2 packets are classed as ETH_P_802_2 and sent to p8022_rcv() and demux’d to layer 3 from there).
  • Note VLAN demux on output, not input! Bridging not important here, but takes place well below the IP layer, in layer 2. Because hooks exist for packet filtering, it is possible to build a layer 2 firewall that nevertheless does full packet reassembly and filtering (I’ve done this). However, these hooks are not based on the regular netfilter code, and are not standard parts of the kernel. Bridging is done from the net_rx_action NET_RX tasklet, presumably in net_receive_skb, which calls br_handle_frame() from net/bridge/br_input.c. br_handle_frame() does some sanity checking and determines whether to drop, forward to other interfaces, and/or forward up the local network stack. Incidentally, according to the references I’ve seen, vint’s were allegedly invented in linux to hide multicast implementation details (either ip-in-ip or native ethernet). I was involved in some of this development, back in 1993, and this does not agree with my memory . I recall the original idea being to place hooks in the vint for packet mangling before/after passing to the real device. Of course, this was 12 years ago, and I was more connected to the UNIX Domain Socket development (Asynchronous I/O w/SIGIO)
  • Actual protocol demux takes place in ip_local_delivery(), via a lookup table of packet handlers, ipproto->handler. (See socket interface for registration method) For forwarding, uses ip_route_input(). If a fast hash isn’t there, calls ip_route_input_slow(), which uses the FIB – Forward Information Base – to obtain the routing. I don’t even pretend to understand the FIB yet; this code is not simple. ARP is handled from ip_route_output() [called from ip_queue_xmit()]. If an existing ARP entry doesn’t exist, the packet is added to a queue associated with that destination while an ARP is sent. Once the reply is received [arp_rcv()], the sk_buffs are removed from the queue and sent to the qdisc. For virtual interfaces, dev_queue_xmit() calls the skb->dev->xmit() function. In the case of ipip, this encapsulates the packet appropriately and re-routes it back up to the beginning of the IP output routines via IPTUNNEL_XMIT(), which calls ip_send_check
  • Note: Ingress filtering exists, but is equivalent to firewalling the interface.
  • I consider the actual implementation interface a bit out of scope, but each component type has a well-defined interface. Components are implemented in an object-oriented style, in C. Function pointers are used for methods. /dev/kids is used to manage KIDS. Registering new components and managing them does, obviously, require root access. KIDS Components are not checked for security at compile time or run-time, but a decent inventory list is already provided.

    1. 1. Networking in the Linux Kernel
    2. 2. Introduction <ul><li>Overview of the Linux Networking implementation: </li></ul><ul><li>Covered: </li></ul><ul><li>Data path through the kernel </li></ul><ul><li>Quality of Service features </li></ul><ul><li>Hooks for extensions (netfilter, KIDS, protocol demux placement) </li></ul><ul><li>VLAN Tag processing </li></ul><ul><li>Virtual Interfaces </li></ul><ul><li>Not covered: </li></ul><ul><li>Kernels prior to 2.4.20, or 2.6+ </li></ul><ul><li>Specific protocol implementations </li></ul><ul><li>Detailed analysis of existing protocols, such as TCP. This is covered only in enough detail to see how they link to higher/lower layers. </li></ul>
    3. 3. OSI Model <ul><li>The Linux kernel adheres closely to the OSI 7-layer networking model </li></ul>Application Presentation Session Transport Network Data Link Physical Application (Above socket) (HTTP, SSH, etc.) TCP/UDP Internet (IP) Data Link (802.x, PPP, SLIP)
    4. 4. OSI Model (Interplay) <ul><li>Layers generally interact in the same manner, no matter where placed </li></ul>Layer N+1 Data Layer N+1 Data Layer N+1Control Add header and/or trailer Pass to layer N as raw data Layer N Data
    5. 5. Socket Buffer <ul><li>When discussing the data path through the linux kernel, the data being passed is stored in sk_buff structures (socket buffer). </li></ul><ul><li>Packet Data </li></ul><ul><li>Management Information </li></ul><ul><li>The sk_buff is first created incomplete, then filled in during passage through the kernel, both for received packets and for sent packets. </li></ul><ul><li>Packet data is normally never copied. We just pass around pointers to the sk_buff and change structure members </li></ul>
    6. 6. Socket Buffer next prev list head data tail end sk_buff sk dev_rx dev sk_buff MAC Header IP Header TCP Header data associated device source device socket All sk_buff’s are members of a queue Packet Data cloned sk_buff’s share data, but not control struct sk_buff is defined in: include/linux/skbuff.h
    7. 7. Socket Buffer <ul><li>sk_buff features: </li></ul><ul><li>Reference counts for cloned buffers </li></ul><ul><li>Separate allocation pool and support </li></ul><ul><li>Functions for manipulating the data space </li></ul><ul><li>Very “feature-rich” – this is a very complex, detailed structure, encapsulating information from protocols at multiple layers </li></ul><ul><li>There are also numerous support functions for queues of sk_buff ’s . </li></ul>
    8. 8. Data Path Overview kernel kernel hardware user Network Device Driver net_rx_action() protocol protocol IP TCP UDP socket socket Layer 3 protocol demux Layer 4 protocol demux DMA rings socket demux Queue Discipline softirq … … … …
    9. 9. OSI Layers 1&2 – Data Link <ul><li>The code presented resides mostly in the following files: </li></ul><ul><li>include/linux/netdevice.h </li></ul><ul><li>net/core/skbuff.c </li></ul><ul><li>net/core/dev.c </li></ul><ul><li>net/dev/core.c </li></ul><ul><li>arch/i386/irq.c </li></ul><ul><li>drivers/net/net_init.c </li></ul><ul><li>net/sched/sch_generic.c </li></ul><ul><li>net/ethernet/eth.c (for layer 3 demux) </li></ul>
    10. 10. Data Link – Data Path kernel hardware Network Device Driver DMA DMA Rings net_interrupt (net_rx, net_tx, net_error) netif_rx_schedule() Add device pointer to poll_queue poll_queue netif_receive_skb() IP enqueue() softirq dev->poll() net_rx_action() Layer 3 Layer 2 Queue Discipline … … … …
    11. 11. Data Link – Features <ul><li>NAPI </li></ul><ul><ul><li>Old API would reach interrupt livelock under 60 MBps </li></ul></ul><ul><ul><li>New API ensures earliest possible drop under overload </li></ul></ul><ul><ul><ul><li>Packet received at NIC </li></ul></ul></ul><ul><ul><ul><li>NIC copies to DMA ring (struct skbuff *rx_ring[]) </li></ul></ul></ul><ul><ul><ul><li>NIC raises interrupt via netif_rx_schedule() </li></ul></ul></ul><ul><ul><ul><li>Further interrupts are blocked </li></ul></ul></ul><ul><ul><ul><li>Clock-based softirq calls softirq_rx(), which calls dev->poll() </li></ul></ul></ul><ul><ul><ul><li>dev->poll() calls netif_receive_skb(), which does protocol demux (usually calling ip_rcv() ) </li></ul></ul></ul><ul><ul><li>Backward compatibility for non-DMA interfaces maintained </li></ul></ul><ul><ul><ul><li>All legacy devices use the same backlog (equivalent to DMA ring) </li></ul></ul></ul><ul><ul><ul><li>Backlog queue is treated just like all other modern devices </li></ul></ul></ul><ul><li>Per-CPU poll_list of devices to poll </li></ul><ul><ul><li>Ensures no packet re-ordering necessary </li></ul></ul><ul><li>No memory copies in kernel – packet stays in the sk_buff at the same memory location until passed to user space </li></ul>
    12. 12. Data Link – transmission <ul><li>Transmission </li></ul><ul><ul><li>Packet sent from IP layer to Queue Discipline </li></ul></ul><ul><ul><li>Any appropriate QoS in qdisc – discussed later </li></ul></ul><ul><ul><li>qdisc notifies network driver when it’s time to send – calls hard_start_xmit() </li></ul></ul><ul><ul><ul><li>Place all ready sk_buff pointers in tx_ring </li></ul></ul></ul><ul><ul><ul><li>Notifies NIC that packets are ready to send </li></ul></ul></ul><ul><ul><ul><li>NIC signals (via interrupt) when packet(s) successfully transmitted. (Highly variable on when interrupt is sent!) </li></ul></ul></ul><ul><ul><ul><li>Interrupt handler queues transmitted packets for deallocation </li></ul></ul></ul><ul><ul><li>At next softirq, all packets in completion_queue are deallocated </li></ul></ul>
    13. 13. Data Link – VLAN Features <ul><li>Still dependent on individual NICs </li></ul><ul><ul><li>Not all NICs implement VLAN filtering </li></ul></ul><ul><ul><ul><li>A partial list is available at need (not included here) </li></ul></ul></ul><ul><ul><li>For non-VLAN NICs, linux filters in software and passes to the appropriate virtual interface for ingress priotization and layer 3 protocol demux </li></ul></ul><ul><ul><ul><li>net/8021q/vlan_dev.c (and others in this directory) </li></ul></ul></ul><ul><ul><ul><li>Virtual interface passes through to real interface </li></ul></ul></ul><ul><ul><li>No VID-based demux needed for received packets, as different VLANs are irrelevant to the IP layer. </li></ul></ul><ul><ul><li>Some changes in 2.6 – still need to research this </li></ul></ul>
    14. 14. OSI Layer 3: Internet <ul><li>The code presented resides mostly in the following files: </li></ul><ul><li>net/ipv4/ip_input.c – process packet arrivals </li></ul><ul><li>net/ipv4/ip_output.c – process packet departures </li></ul><ul><li>net/ipv4/ip_forward.c – process packet traversal </li></ul><ul><li>net/ipv4/ip_fragment.c – IP packet fragmentation </li></ul><ul><li>net/ipv4/ip_options.c – IP options </li></ul><ul><li>net/ipv4/ipmr.c – IP multicast </li></ul><ul><li>net/ipv4/ipip.c – IP over IP, also good virtual interface example </li></ul>
    15. 15. Internet: Data Path Note: chart copied from DataTag’s “A Map of the Networking Code in the Linux Kernel”
    16. 16. Internet: Features <ul><li>Netfilter hooks in many places </li></ul><ul><ul><li>INPUT, OUTPUT, FORWARD (iptables) </li></ul></ul><ul><ul><li>NF_IP_PRE_ROUTING – ip_rcv() </li></ul></ul><ul><ul><li>NF_IP_LOCAL_IN – ip_local_deliver() </li></ul></ul><ul><ul><li>NF_IP_FORWARD – ip_forward() </li></ul></ul><ul><ul><li>NF_IP_LOCAL_OUT – ip_build_and_send_pkt() </li></ul></ul><ul><ul><li>NF_IP_POST_ROUTING – ip_finish_output() </li></ul></ul><ul><li>Connection tracking in IPv4, not in TCP/UDP/ICMP. </li></ul><ul><ul><li>Used for NAT, which must maintain connection state in violation of OSI Layering </li></ul></ul><ul><ul><li>Can also gather statistics for networking usage, but all of this functionality comes from the netfilter module </li></ul></ul>
    17. 17. Socket Structure and System Call Mapping <ul><li>The following files are useful: </li></ul><ul><li>include/linux/net.h </li></ul><ul><li>net/socket.c </li></ul><ul><li>There are two significant data structures involved, the socket and the net_proto_family. Both involve arrays of function pointers to handle each system call type that is relevant. </li></ul>
    18. 18. System Call: socket <ul><li>From user space, an application calls socket(family,type, protocol) </li></ul><ul><li>The kernel calls sys_socket(), which calls sock_create() </li></ul><ul><li>sock_create references net_families[family], an array of network protocol families, to find the corresponding protocol family, loading any modules necessary on the fly. </li></ul><ul><ul><li>If the module is loaded, it is loaded as “net_pf_<num>”, where the protocol family number is used directly in the string. For TCP, the family is PF_INET (was: AF_INET), and the type is SOCK_STREAM </li></ul></ul><ul><ul><li>Note: linux has a hard limit of 32 protocol families. (These include PF_INET, PF_PACKET, PF_NETLNK, PF_INET6, etc.) </li></ul></ul><ul><ul><li>Layer 4 Protocols are registered in inet_add_protocol() (include/net/protocol.h), and socket interfaces are registered by inet_register_protosw(). Raw IP datagram sockets are registered like any other Layer 4 protocol. </li></ul></ul><ul><li>Once the correct family is found, sock_create allocates an empty socket, obtains a mutex, and calls net_families[family]->create(). This is protocol-specific, and filles in the socket structure. The socket structure includes another function array, ops, which maps all system calls valid on file descriptors. </li></ul><ul><li>sys_socket() calls sock_map_fd() to map the new socket to a file descriptor, and returns it. </li></ul>
    19. 19. Other socket System Calls <ul><li>Subsequent socket system calls are passed to the appropriate function in socket->ops[]. These include (exhaustive list): </li></ul><ul><li>release </li></ul><ul><li>bind </li></ul><ul><li>connect </li></ul><ul><li>socketpair </li></ul><ul><li>accept </li></ul><ul><li>getname </li></ul><ul><li>poll </li></ul><ul><li>ioctl </li></ul><ul><li>listen </li></ul><ul><li>shutdown </li></ul><ul><li>setsockopt </li></ul><ul><li>getsockopt </li></ul><ul><li>sendmsg </li></ul><ul><li>recvmsg </li></ul><ul><li>mmap </li></ul><ul><li>sendpage </li></ul>Technically, Linux offers only one socket system call, sys_socket-call(), which multiplexes to all other system calls via the first parameter. This means that socket-based protocols could provide new and different system calls via a library and a mux, although this is never done in practice.
    20. 20. PF_PACKET <ul><li>A brief word on the PF_PACKET Protocol family </li></ul><ul><li>PF_PACKET creates a socket bound directly to a network device. The call may specify a packet type. All packets sent to this socket are sent directly over the device, and all incoming packets of this type are delivered directly to the socket. No processing is done in the kernel. Thus, this interface can – and is – used to create user-space protocol implementations. (E.g., PPPoE uses this with packet type ETH_P_PPP_DISC) </li></ul>
    21. 21. Quality of Service Mechanisms <ul><li>Linux has two QoS mechanisms: </li></ul><ul><ul><li>Traffic Control </li></ul></ul><ul><ul><ul><li>Provides for multiple queues and priority schemes within those queues between the IP layer and the network device </li></ul></ul></ul><ul><ul><ul><li>Defaults are 100-packet queues with 3 priorities and a FIFO ordering. </li></ul></ul></ul><ul><ul><li>KIDS ( Karlsruhe Implementation architecture of Differentiated Services ) </li></ul></ul><ul><ul><ul><li>Designed to be component-extensible at runtime. </li></ul></ul></ul><ul><ul><ul><li>Consists of a set of components with similar interfaces that can be plugged together in almost arbitrarily complex constructions </li></ul></ul></ul><ul><li>Neither mechanism implements the higher-level traffic agreements, such as Traffic Conditioning Agreements (TCA’s). MPLS is offered in Linux 2.6. </li></ul>
    22. 22. Traffic Control <ul><li>Traffic Control consists of three types of components: </li></ul><ul><li>Queue Disciplines </li></ul><ul><ul><li>These implement the actual enqueue() and dequeue() </li></ul></ul><ul><ul><li>Also has child components </li></ul></ul><ul><li>Filters </li></ul><ul><ul><li>Filters classify traffic received at a Queue Discipine into Classes </li></ul></ul><ul><ul><li>Normally children of a Queuing Discipline </li></ul></ul><ul><li>Classes </li></ul><ul><ul><li>These hold the packets classified by Filters, and have associated queuing disciplines to determine the queuing order. </li></ul></ul><ul><ul><li>Normally children of a Filter and parents of Queuing Displines </li></ul></ul><ul><li>Components are connected into structures called “trees,” although technically they aren’t true trees because they allow upward (cyclical) links. </li></ul>
    23. 23. Traffic Control: Example This is a typical TC tree. The top-level Queuing Discipline is the only access point from the outside, the “outer queue.” From external access, this is a single queue structure. Internally, packets eceived at the outer queue are matched against each filter in order. The first match wins, with a final default case. Dequeue requests to the outer queue are passed along recursively to the inner queues to find a packet ready for sending. Queuing Discipline 1:0 enqueue dequeue Filter Filter Default . . . Class 1:1 Class 1:2 Queuing Discipline 2:0 Queuing Discipline 3:0
    24. 24. Traffic Control (Cont’d) <ul><li>The TC architecture supports a number of pre-built filters, classes, and disciplines, found in net/sched/cls_* are filters, whereas sch_* are disciplines (classes collocated with disciplines). </li></ul><ul><li>Some disciplines: </li></ul><ul><li>ATM </li></ul><ul><li>Class-Based Queuing </li></ul><ul><li>Clark-Shenker-Zhang </li></ul><ul><li>Differentiated Services mark </li></ul><ul><li>FIFO </li></ul><ul><li>RED </li></ul><ul><li>Hierarchical Fair Service Curve (SIGCOMM’97) </li></ul><ul><li>Hierarchical Token Bucket </li></ul><ul><li>Network Emulator (For protocol testing) </li></ul><ul><li>Priority (3 levels) </li></ul><ul><li>Generic RED </li></ul><ul><li>Stochastic Fairness Queuing </li></ul><ul><li>Token Bucket </li></ul><ul><li>Equalizer (for equalizing line rates of different links) </li></ul>
    25. 25. KIDS <ul><li>KIDS establishes 5 general component types (by interface) </li></ul><ul><li>Operative Components – receive a packet and runs an algorithm on it. The packet may be modified or simply examined. E.g., Token Buckets, RED, Shaper </li></ul><ul><li>Queue Components – Data structures used to enqueue/dequeue. Includes FIFO, “Earliest-Deadline-First” (EDF), etc. </li></ul><ul><li>Enqueuing Components – enqueue packets based on special methods: tail-enqueue, head-enqueue, EDF-enqueue, etc. </li></ul><ul><li>Dequeuing Components – dequeue based on special methods </li></ul><ul><li>Strategic Components – strategies for dequeue requests. E.g., WFQ, Round Robin </li></ul>
    26. 26. KIDS (Cont’d) <ul><li>KIDS has 8 different hook points in the linux kernel, 5 at the IP layer and 3 at Layer 2: </li></ul><ul><ul><li>IP_LOCAL_IN – just prior to delivery to Layer 4 </li></ul></ul><ul><ul><li>IP_LOCAL_OUT – just after leaving Layer 4 </li></ul></ul><ul><ul><li>IP_FORWARD – packet being forwarded (router) </li></ul></ul><ul><ul><li>IP_PRE_ROUTING – Packet newly arrived at IP layer from interface </li></ul></ul><ul><ul><li>IP_POST_ROUTING – Packet routed from IP to Layer 2 </li></ul></ul><ul><ul><li>L2_INPUT_<dev> – Packet has just arrived from interface </li></ul></ul><ul><ul><li>L2_ENQUEUE_<dev> – Packet is being queued at Layer 2 </li></ul></ul><ul><ul><li>L2_DEQUEUE_<dev> – Packet is being transmitted by Layer 2 </li></ul></ul>