Your SlideShare is downloading. ×
Network Router Resource Virtualisation
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Network Router Resource Virtualisation


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Department of Computing Science University of Glasgow M ASTERS R ESEARCH T HESIS MS CI , 2004/2005 Network Router Resource Virtualisation by Ross McIlroy
  • 2. Abstract The Internet was created as a best effort service, however, there is now considerable interest in appli- cations which transport time sensitive data across the network. This project looks into a novel network router architecture, which could improve the Quality of Service guarantees which can be provided to such flows. This router architecture makes use of virtual machine techniques, to assign an individual virtual routelet to each network flow requiring QoS guarantees. In order to evaluate the effectiveness of this virtual routelet architecture, a prototype router was created, and its performance was measured and compared with that of a standard router. The results suggest promise in the virtual routelet archi- tecture, however, the overheads incurred by current virtual machine techniques prevent this architecture from outperforming a standard router. Evaluation of these results suggests a number of enhancements which could be made to virtual machine techniques in order to effectively support this virtual routelet architecture.
  • 3. Acknowledgements I would like to acknowledge the help and advice received from the following people, during the course of this project. My project supervisor, Professor Joe Sventek for the invaluable advice, support and encouragement he gave throughout this project. Dr Peter Dickman and Dr Colin Perkins for the help and advice which they freely provided whenever I had questions. Jonathan Paisley for his invaluable assistance when setting up the complex network routing required by this project. The Xen community for their help with any questions I had and their willingness to provide new unre- leased features for the purposes of this project. Special thanks to Keir Fraser for his advice when I was creating the VIF credit limiter, and Stephan Diestelhorst for providing an unfinished version of the sedf scheduler. The Click Modular Router community for help and advice when writing the Click port to the Linux 2.6 kernel, especially Francis Bogsanyi for his preliminary work and Eddie Kohler for his advice. The technical support staff within the department, especially Mr Gary Gray, Mr Naveed Khan and Mr Stewart MacNeill from the support team, and all of the department technicians, for their help in building the experimental testbed and providing equipment for the project. The other MSci students, especially Elin Olsen, Stephen Strowes and Christopher Bayliss, for answering various questions and for generally keeping me sane. My parents for their support and their help proof reading this report. i
  • 4. Contents 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Survey 3 2.1 Router Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Quality of Service Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Virtualisation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Similar Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Approach 14 4 Integration of Open Source Projects 18 4.1 Overview of Open Source Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2.1 The Click Modular Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2.2 MPLS RSVP-TE daemon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5 QoS Routelet 26 5.1 Routelet Guest Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.1.1 Linux Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.1.2 Linux Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.1.3 Routelet Startup Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.2 Routelet Packet Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2.1 Overall Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.2.2 ProcessShim Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 6 Main Router 33 6.1 Routelet Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 6.2 Assignment of QoS Network Flows to Routelets . . . . . . . . . . . . . . . . . . . . . . 34 6.2.1 ConnectRoutelet Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.2.2 ClickCom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.2.3 RSVP-TE Daemon Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.3 Packet Demultiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.3.1 Overall Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.3.2 Automatic Demultiplexing Architecture Creation . . . . . . . . . . . . . . . . . 40 6.3.3 MplsSwitch Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.3.4 Returning Routelets to the Idle Pool . . . . . . . . . . . . . . . . . . . . . . . . 41 ii
  • 5. 6.4 Routelet Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.4.1 CPU Time Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.4.2 Network Transmission Rate Allocation . . . . . . . . . . . . . . . . . . . . . . 42 7 Experimental Testbed Setup 44 7.1 Network Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 7.2 Network QoS Measurement Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 7.2.1 Throughput Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 7.2.2 Per-Packet Timing Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . 47 7.3 Accurate Relative Time Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 7.4 Network Traffic Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 7.5 Straining Router with Generated Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . 49 8 Experimental Results and Analysis 51 8.1 Virtual Machine Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 8.1.1 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 8.1.2 Interpacket Jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 8.1.3 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 8.2 Network Flow Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 8.2.1 Effect of Timer Interrupt Busy Wait . . . . . . . . . . . . . . . . . . . . . . . . 64 8.2.2 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 8.2.3 Interpacket Jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 8.2.4 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 8.3 Future Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 9 Evaluation 72 9.1 Experimental Results Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 9.2 Virtualisation Overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 9.2.1 Linux OS for each Routelet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 9.2.2 Context Switch Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 9.2.3 Routlet Access to NIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 9.2.4 Classifying Packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 9.2.5 Soft Real Time Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 10 Conclusion 76 10.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 10.2 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 10.3 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Bibliography 81 iii
  • 6. Chapter 1 Introduction There is considerable interest in applications that transport isochronous data across a network, for exam- ple teleconferencing or Voice-over-IP applications. These applications require a network that provides Quality of Service (QoS) guarantees. While there are various networks that provide some form of QoS provisioning, there has been little investigation into the allocation and partitioning of a router’s resources between flows, so that network flows can only access their own allocation of resources, and cannot affect the QoS provided to other flows by the network. This project is based on the hypothesis that virtual machine techniques can be used within network routers to provide effective partitioning between different network flows, and thus provide Quality of Service guarantees to individual network flows. The aim of this research project is to create a prototype experimental router which uses virtualisation techniques to provide QoS guarantees to network flows, in order to prove or disprove this hypothesis. 1.1 Background The Internet was created as a best effort service. Packets are delivered as quickly and reliably as possible; however, there is no guarantee as to how long a packet will be in transit, or even whether it will be delivered. There is an increasing interest in the creation of networks which can provide Quality of Service (QoS) guarantees. Such networks would be able to effectively transport isochronous data, such as audio, video or teleconference data feeds, since each feed could have guaranteed bounds on latency and throughput. A number of attempts have been made to tackle Quality of Service issues in computer networks. Var- ious approaches have been investigated, including resource reservation protocols, packet scheduling algorithms, overlay networks and new network architectures. Various networks have been built using these approaches, however one of the problems they face is allocating a fixed proportion of the network router resources to each stream. For example, if a denial of service attack is attempted on one of these streams, the router could be overwhelmed, and not able to provide the quality of service needed by the other streams. Some method is needed for partitioning the router resources among streams, so that traffic on one stream is minimally affected by the traffic of other streams. One option is to have separate routers for each stream. This over-provisioning is obviously very ex- pensive and inflexible; however, it is currently the standard commercial method of guaranteeing certain bounds on quality of service to customers. This research pproject investigates the feasibility of creating a single routing system, which is built up of a number of virtual routelets. Each QoS flow is assigned its own individual virtual routelet in each physical router along the path it takes through the network. 1
  • 7. These virtual routelets are scheduled by a virtual machine manager, and have a fixed resource allocation dependent upon the QoS required by the flow. This resource partitioning between routelets should pro- vide a similar level of independence between flows as separate routers, but at a much reduced cost. This system should also allow for greater flexibility in the network, with dynamic flow creation and dynamic adjustment of flow QoS requirements. Process sharing queueing schemes, such as Weighted Fair Queueing (WFQ) [27], are often used to guar- antee bounds on transmission latency and throughput for network flows. However, I believe that full router resource virtualisation would offer a number of advantages over WFQ. Firstly, WFQ is typically placed just before packet transmission. The packet has already been fully processed by the router before it reaches this stage. Therefore, although the router’s outgoing transmission resources are fairly shared between network flows, the internal memory and processing resources of the router are not fairly shared between flows. Another advantage router virtualisation offers over WFQ is its ability to support smarter queueing within network flows. In WFQ, when a network flow is scheduled to transmit a packet, it simply transmits the next packet on its queue in FIFO order. A virtual routelet would have much more flexibility in choosing which packet to send, or which packets to drop. For example, if a routelet ser- vicing an MPEG [16] stream become overloaded, could choose to drop packets containing difference (P or B) frames, rather than the key (I) frames, thus ensuring a gradual degradation in performance, rather than a sharp drop-off. 1.2 Document Outline The remainder of this dissertation is presented in the following format. Chapter 2 surveys relevant literature in this research area, and provides a broad background of the various aspects to this project. Chapter 3 outlines the overall approach taken during this project and gives a high level design of the prototype router which was created. Chapter 4 discusses some of the work which needed to be performed in order to integrate the various open source subsystems used by the prototype router. The design and implementation of the virtual QoS routelet is described in Chapter 5. Chapter 6 describes the modifications which were made to the main router, in order to support the creation and assignment of routelets to network flows. In order to measure the virtual routelet prototype’s performance, an experimental testbed was set up. The creation of this testbed is described in Chapter 7, and the results and analysis of the experimental results are presented in Chapter 8. These results are evaluated in more detail, with regard to the virtual routelet architecture, in Chapter 9. Finally, the conclusions drawn from this research are presented in Chapter 10. 2
  • 8. Chapter 2 Survey The aim of this project is to create a router which can allocate a fixed proportion of its resources to each QoS flow it handles. To do this, it is important to first understand how a standard router works. Section 2.1 discusses papers involving both hardware and software routers, which were investigated to understand routing techniques. There have been a number of efforts which have approached the issue of Quality of Service in the Internet. These efforts tackle different areas required by truly QoS-aware networks, such as flow speci- fication, routing, resource reservation, admission control and packet scheduling. As part of this project proposal, I have investigated efforts in a number of these areas, and describe my findings in Section 2.2. This project will make use of virtualisation techniques to partition the individual routlets from each other. There have been a number of virtualisation systems created, each with its own aims and objectives. Section 2.3 discusses the virtualisation techniques and systems which were investigated as part of this proposal. Some efforts are strongly related to this project as they attempt to provide fully QoS-aware networks with resource partitioning. Section 2.4 discusses the techniques used by these networks, and explains how their approach differs from that taken by this project. 2.1 Router Technology The first stage in creating a router which provides QoS guarantees, is to understand how a standard router operates. There are a number of papers which describe the internal workings of modern IP routers (see [22] or [29] for details). A router’s primary function is to forward packets between separate networks. Routers, or collections of routers form the core of an internetwork. Each router must understand the protocols of its connected networks and translate packet formats between these protocols. Packets are forwarded according to the destination address in the network layer (e.g. the IP destination). Each router determines where packets should be forwarded, by comparing this destination address with en- tries in a routing table. However, new services are calling upon routers to do much more than simply forward packets. For example, multicast, voice, video and QoS functions call upon the router to analyse additional information within the packets to determine how they should be handled. A router consists of three main components: network line cards, which physically connect the router to a variety of different networks; a routing processor, which runs routing protocols and builds the routing tables; and a backplane, which connects the line cards and the routing processor together. 3
  • 9. The routing process can be seen more clearly by describing the stages involved in the processing of a packet by a typical IP router. When a packet arrives from a network, the incoming network interface card first processes the packet according to the data-link layer protocol. The data-link layer header is stripped off; the IP header is examined to determine if it is valid; and the destination address is identified. The router then performs a longest match lookup (explained in the following paragraph) in the routing table for this destination address. Once a match has been found, the packet’s time to live field is decremented, a new header checksum is calculated, and the packet is queued for the appropriate network interface. This interface card then encapsulates the IP packet in an appropriate data-link-level frame, and transmits the packet. Each router is unable to store a complete routing table for the whole internetwork. However, IP addresses have a form of geographical hierarchy, where nodes on the same network have a similar class of IP address. This allows entries in the routing table to aggregate entries for similar addresses that exit the router on the same interface card. In other words, routers know in detail the addresses of nearby networks, but only know roughly where to send packets for distant networks. The routing table stores address prefix / mask pairs. The address prefix is used to match the entry to the destination address, and the mask is used to specify which digits of the address prefix are important. A longest match lookup tries to find the entry that has the most bits in common with that of the destination address. For example, if the address is matched against,, and in the routing table, the entry will be chosen. The hardware architecture used in routing systems has evolved over time. Initially, the architecture was based upon that of a typical workstation. Network line cards were connected together with a CPU using a shared bus (e.g. ISA or PCI); and all routing decisions and processing was done by the central CPU. Figure 2.1 shows the architecture of this type of router. The problem with this approach was the single shared bus. Every packet processed has to traverse this bus twice, once from the incoming network line card to the CPU and once from the CPU to the outgoing line card. Since this is a shared bus, only one packet can traverse it at any time. This restricts the overall number of packets which can be processed by a single router to the bandwidth of the shared bus, which is often lower than that of the sum of the network cards’ bandwidths. CPU Example Packet Route Shared Bus Line Line Line Card Card Card Figure 2.1: The architecture of a first generation, “Shared Bus” router. The use of switching fabrics, instead of shared busses, as a backplane increased the amount of network traffic a router could cope with, by allowing packets to traverse the backplane in parallel (Figure 2.2). The bandwidth of the backplane is no longer shared among the network interface cards. A switching fabric can be implemented in a number of ways; for example, a crossbar switch architecture could be 4
  • 10. used, where each line card has a point to point connection to every other line card. Another option is to use shared memory, where each line card can transfer information by writing and reading from the same area of memory. CPU Switch Fabric Line Line Card Card (Input) (Output) Example Packet Route Line Line Card Card (Output) (Input) Figure 2.2: The architecture of a router which uses a switching fabric as its backplane. The next component which comes under stress from increased network traffic is the shared CPU. A single CPU simply cannot cope with the amount of processing which is needed to support multi-gigabit, and therefore multi-mega-packet flows. An approach used by many modern routers is to move as much of the packet processing to the line cards as possible. A forwarding engine is placed on each of the network line cards (or in the case of [29], near the NICs). A portion of the network routing table is cached in each forwarding engine, allowing the forwarding engines to forward packets independently. A central processor is still used to control the overall routing, build and distribute the routing tables, and handle packets with unusual options or addresses outside the forwarding engine’s cached routing table. A number of other improvements have been made to router technology, including improved lookup al- gorithms, advanced buffering techniques and sophisticated scheduling algorithms. The major develop- ments in router hardware, however, have been a movement towards expensive custom hardware, working in parallel, to provide the performance required for multi-gigabit networks. Software routers run on commodity hardware, and allow a normal PC to perform network routing. Software routers do not have the performance or packet throughput of custom built hardware routers, but they are considerably cheaper and provide more flexibility than hardware routers. They are also an ideal tool for router research, as new protocols, algorithms or scheduling policies can be implemented by simply changing the code, rather than building or modifying hardware components. This project will make use of a software router for these reasons. The Zebra platform1 is a popular open source software router. It has initiated a number of spinoff projects, such as Quagga2 and ZebOS3 , a commercial version of Zebra (all collectively referred to as Zebra hereafter). Zebra runs on a number of UNIX-like operating systems (e.g. Linux, FreeBSD, Solaris etc.) and provides similar routing functionality to commercial routers. The Zebra platform is modular in design. It consists of multiple processes, each of which deals with a different routing protocol. For example, a typical Zebra router will consist of a routing daemon for the open shortest path first (OSPF) protocol, and another for the border gateway protocol (BGP). These will work together to create an 1 2 3 advanced.html 5
  • 11. overall routing table, under the control of a master Zebra daemon. This architecture allows protocols to be added at any time without affecting the whole system. There have been a number of studies which have compared the features and performance of Zebra to standard commercial hardware routers. Fatoohi et al [14] compared the functionality and usability of the Zebra router with Cisco routers. They find that Zebra provides much of the features found in Cisco routers with a similar user interface. They do not compare the relative performance of the two router types, however. Ramanath [32] investigates Zebra’s implementation of the BGP and OSPF protocols, but concentrates on the experimental set-up, with little post experimental analysis. One problem with Zebra is that the open source implementations do not provide MPLS (Multiprotocol Label Switching) support. MPLS routing support is a requirement of this project, so another software router is required. NIST Switch [7] is a research routing platform, specifically created for experimenta- tion in novel approaches to QoS guaranteed routing. It is based on two fundamental QoS concepts, local packet labels which define the characteristics of packets queueing, and traffic shaping measures applied to overall packet routes. NIST Switch supports MPLS, and uses these MPLS labels to define the QoS characteristics and route packets. It also uses RSVP, with Traffic Engineering (TE) extensions, to signal QoS requests and distribute labels. This use of RSVP was also an important requirement in this project as RSVP will be used to request router resource requests. MPLS and RSVP are explained in more detail in Section 2.2, and their use in this project is described in Chapter 3 and Chapter 6. NIST Switch only runs on the FreeBSD operating system. Lai et al [20] compared Linux, FreeBSD and Solaris, running on the x86 architecture. They found that FreeBSD has the highest network performance, making it an ideal operating system on which to base a router. However, the difference in network performance, cited by Lai et al, was not significant enough to make a major difference to the performance of the router created in this project. Linux has also been ported to the para-virtualised, virtual machine monitors required as part of this project, whereas FreeBSD has not. It was, therefore, decided that a Linux software router was required. Van Heuven et al [18] describes a MPLS capable Linux router, created by merging a number of existing projects, including as a base, the NIST Switch router. A number of modular routers were also investigated as a means of providing additional functionality in a modular fashion. The PromethOS router architecture, described in [19] and based upon the work in [12], provides a modular router framework, which can be dynamically extended using router plugins. The PromethOS framework is based upon the concept of selecting the implementation of the processing components within the router, based upon the network flow of the packet currently being processed. When a plugin is loaded, it specifies which packets it will deal with by providing a packet filter. When a packet arrives at a PromethOS router, it traverses the network stack in the usual way. However, as it traverses the network stack, it comes across gates, points where the flow of execution can branch off to plugin instances. The packet is compared against the filters specified by the plugin instances, and then sent to the correct plugin. New plugins can be created, and can be inserted at various points along the network stack, using these gates. This allows a variety of different functions to be added to the router, such as new routing algorithms, different queueing technique or enhancements to existing protocols. However, it still uses the standard IP network stack. This means that it cannot be used to investigate other network protocols. The Click Modular Router [24] takes a different approach. Click is not a router itself; instead, it is a soft- ware framework which allows the creation of routers. A Click router consists of a number of modules, called elements. These elements are fine-grained components, which perform simple operations, such as packet queueing or packet classification. New elements can easily be built by writing a C++ class which corresponds to a certain interface. Elements are plugged together using a simple language to create a certain router configuration. The same elements can be used to create different routers, by using this simple language to rearrange the way in which they are organised. These router configurations are 6
  • 12. compiled down to machine code, which runs within the Linux Kernel to provide a fast, efficient router. The approach taken by the Click router provides a flexible method of creating new routers easily. 2.2 Quality of Service Techniques There are a number of issues which need to be addressed when creating a QoS-aware network. Firstly, the overall routing algorithm needs to take QoS into account, dealing with low quality links, or failures in network nodes. QoS flows also require a mechanism for reservation of routing resources on a network. For a flow to reserve the resources it requires, there needs to be some universal description of the flow’s requirements - a flow specification. As network resources are finite, some form of admission control is required to prevent network resources from becoming overloaded. Finally, the packet scheduling algorithms, used by the network routers, need to be aware of the QoS requirements of each flow, so that bounds on latency and throughput can be guaranteed. Multiprotocol label switching (MPLS), described in [9] and [39], is a network routing protocol which uses labels to forward packets. This approach attempts to provide the performance and capabilities of layer 2 switching, while providing, the scalability and heterogeneity of layer 3 routing. It is therefore, often referred to as layer 2.5 routing. These labels also allow an MPLS network to take QoS issues into account while routing packets by, for example, engineering traffic. In conventional (layer 3) routing, a packet’s header is analysed by each router. This information is used to find an element in a routing table, which determines the packet’s next hop. In MPLS routing, this analysis is performed only once, at an ingress router. A Label Edge Router (LER), placed at the edge of the MPLS domain, analyses incoming packets, and maps these packets to an unstructured, fixed length, label. Many different headers can map to the same label, as long as these packets all leave the MPLS domain at the same point. Labels, therefore, define forwarding equivalence classes (FECs). An FEC defines, not only where a packet should be routed, but also how it should be routed across the MPLS domain. This labelling process should not interfere with the layer 3 header, or be affected by the layer 2 technol- ogy used. The label is therefore added as a shim between the layer 2 and the layer 3 headers. This shim is added by the ingress LER as a packet enters the MPLS domain, and removed by the egress LER as it leaves the MPLS domain. Once a packet has entered the MPLS domain, and has been labelled by the LER, it traverses the do- main along a label-switched path (LSP). An LSP is a series of labels along the path from the source to the destination. Packets traverse these LSPs by passing through label switching routers (LSRs). An LSR analyses an incoming packet’s label, decides upon the next hop which that packet should take and changes the label to the one used by the next hop for that packet’s FEC. The LSR can process high packet volumes, because a label is a fixed length, unstructured value, so a label lookup is fast and simple. Each LSR (and LER) assigns a different label to the same FEC. This means that globally unique labels are not required for each FEC, as long as they are locally unique. Each LSR needs to know of the labels that neighbouring LSRs use for each FEC. There are a number of ways in which MPLS can distribute this information. The Label Distribution Protocol was designed specifically for this purpose, however, it is not commonly used in real networks. Existing routing protocols, such as the Open Shortest Path First (OSPF) or Border Gateway Protocol (BGP), have been enhanced so that label information can be “piggybacked” on the existing routing information. This means that traffic going from a source to a destination will always follow the same route. Another option is to use RSVP with traffic engineering extensions (RSVP-TE) to explicitly engineer routes through the MPLS domain. This allows different 7
  • 13. routes to be taken by traffic, travelling to the same destination, depending on its QoS class. For exam- ple, a low latency route could be chosen for voice traffic, while best effort traffic, going to the same destination, travels through a more reliable, but slower route. The Resource ReSerVation Protocol (RSVP) [45] is a protocol which can be used to reserve network resources needed by a packet flow. It reserves resources along simplex, i.e. only single direction, routes. One novel feature of this protocol is that it is receiver-oriented, i.e. the receiver, rather than the sender, initiates the resource reservation, and decides on the resources required. This allows heterogenous data- flows to be accommodated; for example, lower video quality for modem users, than for broadband users. An RSVP session is initiated by a PATH message, sent from the sender to the receiver. This PATH message traverses the network, setting up a route between the sender and receiver in a hop-by-hop manner. The PATH message contains information about the type of traffic the sender expects to generate. Routers on the path can modify this data to specify their capabilities. When the receiver application receives the PATH message, it examines this data and generates a message, RESV, which specifies the resources required by the data flow. This RESV message traverses the network in the reverse of the route taken by the PATH message, informing routers along this path of the specified data flow. Each router uses an admission control protocol to decide whether it has sufficient resources to service the flow. If a router cannot service a flow, it will deny the RSVP flow, and inform the sender. The RSVP protocol is independent of the admission protocol used, so routers can make any decision they wish to decide on the paths it can support. RSVP is also independent of the specification used to describe the flow (the flowspec). As long as the sender, receiver and routers agree on the format, any information can be transmitted in the flowspec to inform routers of the resources required by a flow. RSVP is unaffected by the routing protocol used. PATH messages can be explicitly routed, as in RSVP- TE, and the RESV message follows this route in reverse, thus reserving resources along the explicitly routed path. RSVP can support dynamic changes in path routes, as well as dynamic changes in the data-flow char- acteristics of paths. It does this by maintaining soft, rather than hard, state of a reserved path in the network. Data sources periodically send PATH messages to maintain or update the path state in a net- work. Meanwhile, data receivers periodically send RESV messages to maintain or update the reservation state in the network. If a network router has not received PATH or RESV messages from an RSVP flow for some time, the state of that flow is removed from the router. This makes RSVP robust to network failures. RSVP was created specifically to support the Quality of Service model called Integrated Services ([4] and [43]). The Integrated Services approach attempts to provide real time QoS guarantees for every flow which requires them. This requires every router to store the state of resources requested by every flow, which could involve a huge amount of data. RSVP addresses this somewhat by having a number of multicast reservation types (e.g. no-filter, fixed-filter and dynamic filter) which allow individual switches to decide how resource reservation requests can be aggregated. However, the state required for an Integrated Service router, especially in the core of the network, is still formidable. The differentiated services, or Diffserv model ([3] and [46]) takes a different approach. Each packet carries with it information on how it should be treated. For example, a Type of Service (TOS) byte, carried in the packet header, could specify whether this packet requires low-delay, high-throughput or reliable transport. The problem with this approach is that only simple requirements can be specified (e.g. low latency, rather than latency less than 10ms), so that each packet does not incur excessive overheads carrying the QoS requirements. There is also no way to perform admission control, so it is impossible to guarantee the QoS for individual flows. All that can be guaranteed is that certain classes of traffic will be treated better than other classes. MPLS is in the middle ground between the two extremes of Integrated Services and Diffserv. It provides 8
  • 14. the capability to set up flows of traffic with certain requirements, much like Integrated Services, however Forwarding Equivalency Classes can be used to aggregate common flows together. Also, the use of unstructured labels greatly simplifies the filtering of traffic into flows, compared with the filtering on source and destination address (as well as other possible IP header fields) used by Integrated Service. This is a simplified view of Internet QoS techniques, with specific emphasis on those aspects of the technologies which will be used in this project. A more detailed survey is provided by Xiao et al [44]. 2.3 Virtualisation Techniques This project will use virtualisation techniques to partition resources between virtual routelets. Virtu- alisation was developed by IBM in the mid 1960’s, specifically for mainframe computers. Currently, there are a number of virtualisation systems available, each of which addresses different issues. Hard- ware constraints require that the router created by this project must run on standard, commodity x86 hardware. This proposal, therefore, concentrates on x86 virtualisation techniques, as opposed to more general virtualisation techniques, since the x86 architecture is an inherently difficult platform to virtu- alise, compared with other processor architectures. Virtualisation software allows a single machine to run multiple independent guest operating systems concurrently, in separate virtual machines. These virtual machines provide the illusion of an isolated physical machine for each of the guest operating systems. A Virtual Machine Monitor (VMM) takes complete control of the physical machine’s hardware, and controls the virtual machines’ access to it. VMware [38] and other full-virtualisation systems attempt to fully emulate the appearance of the under- lying hardware architecture in each of the virtual machines. This allows unmodified operating systems to run within a virtual machine, but presents some problems. For example, operating systems typically expect unfettered, privileged access to the underlying hardware, as it is usually their job to control this hardware. However, guest OSs within a virtualised system cannot have privileged mode access, other- wise they could interfere with other guest OS instances by, for example, modifying their virtual memory page tables. The VMM, on the other hand, controls the underlying physical hardware in a virtualised machine, and therefore has privileged mode CPU access. The guest OSs must be given the appearance of having access to privileged CPU instructions, without being able to harm other OSs. Fully virtualised systems, such as VMware, typically approach this problem by allowing guest OSs to call these privileged instructions, but by having them handled by the VMM when they are called. The problem is that the x86 fails silently when a privileged instruction is called within a non-privileged envi- ronment, rather than providing a trap to a handling routine. This means that VMware has to dynamically rewrite portions of the guest OS kernel at runtime, inserting calls to the VMM where privileged instruc- tions are needed. VMware also keeps shadow copies of each guest OS’s virtual page tables, to ensure that guest OS instances do not overwrite each other’s memory. However, this requires a trap to the VMM for every update of the page table. These requirements mean that an operating system, running within a fully virtualised system, will have severely reduced performance, compared to the same operating system running on the same physical hardware without the virtualisation layer. A number of projects have attempted to reduce this performance overhead, at the cost of not providing full physical architecture emulation. Para-virtualisation is an approach where the virtual machines do not attempt to fully emulate the underlying hardware architecture of the machine being virtualised. Instead, an idealised interface is presented to the guest operating systems, which is similar, but not identical to that of the underlying architecture. This interface does not include the privileged instructions which cause problems in fully virtualised systems, replacing them with calls which transfer control to the VMM (sometimes known as Hypercalls). This change in interface means that existing operating 9
  • 15. systems need to be ported to this new interface before they will run as guest operating systems. However, the similarity between this interface and the underlying architecture, means that the application binary interface (ABI) remains unchanged, so application programs can run unmodified on the guest OSs. The idealised interface substantially reduces the performance overhead of virtualisation, and also provides the guest OSs with more information. For example, information on both real and virtual time allows better handling of time sensitive operations, such as TCP timeouts. The Denali Isolation Kernel ([41] and [42]) uses these para-virtualisation techniques to provide an ar- chitecture which allows a large number of untrusted Internet services to be run on a single machine. The architecture provides isolation kernels for each untrusted Internet service. These isolation ker- nels are small guest operating systems which run in separate virtual machines, providing partitioning of resources and achieving safe isolation between untrusted application code. The use of simple guest operating systems, and the simplified virtual machine interface, are designed to support Denali’s aim of scaling to more than 10,000 virtual machines in one, commodity hardware, machine. The problem with using Denali, however, is that the interface has been oversimplified to support the scaling needed to run tens of thousands of virtual machines. There have been a number of omissions made in the application binary interface, compared with the x86 architecture. For example, memory segmentation is not supported. These features are expected by standard x86 operating systems and applications. It would therefore be very difficult to port standard operating systems (such as Linux or FreeBSD) to Denali, and even if they were ported, many applications would also have to be rewritten to run on Denali. The Denali project has therefore provided a simple guest OS called Ilwaco, which runs under a Denali virtual machine. Ilwaco is simply implemented as a library, much like an Exokernel libOS, with applications linking against the OS. There is no hardware protection boundary in Ilwaco, so essentially each virtual machine hosts a single-user, single-application, unprotected operating system. Since Ilwaco was created specifically for Denali, and is so simple, there is no routing software available for it. Therefore, if Denali is used as the visualisation system in this project, a software router would need to be written from scratch. Xen [2] is an x86 virtual machine monitor that uses para-virtualisation techniques to reduce the virtu- alisation overhead. Although it uses an idealised interface like Denali, this interface was specifically created to allow straightforward porting of standard operating systems, and provide an unmodified ap- plication binary interface. A number of common operating systems, such as Linux and NetBSD, have been ported to the Xen platform, and because of the unmodified ABI, unmodified applications run on these ports. The Xen platform consists of a Xen VMM (virtual machine monitor) layer above the physical hardware. This layer provides virtual hardware interfaces to a number of Domains. These domains are effectively virtual machines running the ported guest OSs, however the guest OSs, and their device drivers, are aware that they are running on Xen. The Xen VMM is designed to be as simple as possible, so although it must be involved in data-path aspects (such as CPU scheduling between domains, data block access control etc.), it does not need to be aware of higher level issues, such as how the CPU should be sched- uled. For this reason, the policy (as opposed to the mechanism) is separated from the VMM, and run in a special domain, called Domain 0. Domain 0 is given special privileges. For example, it has the ability to start and stop other domains, and is responsible for building the memory structure and initial register state of a new domain. This significantly reduces the complexity of the VMM, and prevents the need for additional bootstrapping code in ported guest OSs. There are two ways in which the Xen VMM and the overlying domains can communicate - synchronous calls from a domain to Xen can be made using hypercalls, and asynchronous notifications can be sent from Xen to an individual domain using events. Hypercalls made from a domain perform a synchronous trap to the Xen VMM, and are used to perform privileged operations. For example, a guest OS could 10
  • 16. perform a hypercall to request a set of page table updates. The Xen VMM validates these updates, to make sure the guest OS is allowed to access the memory requested, and then performs the requested update. These hypercalls are analogous to system calls in conventional operating systems. So that each hypercall does not involve a change in address space, with the associated TLB flush and page misses this would entail, the first 64MB of each domain’s virtual memory is mapped on to the Xen VMM code. Asynchronous events allow Xen to communicate with individual domains. This is a lightweight noti- fication system which replaces the traditional device interrupt mechanism, used by traditional OSs to receive notifications from hardware. Device interrupts are caught by the Xen VMM, which then per- forms the minimum amount of work necessary to buffer any data and determine the specific domain that should be informed. This domain is then informed using an asynchronous event. This specific domain is not scheduled immediately, but is informed of the event when it is next scheduled by the VMM. At this time, the device driver on the guest OS performs any necessary processing. This approach limits the crosstalk between domains, i.e. a domain will not have its scheduled time eaten into by another domain servicing device interrupts. Xen provides a “Safe Hardware Interface” for virtual machine device I/O. This interface is discussed in detail by Fraser et al [15]. Domains can communicate with devices in two ways. They can either use a legacy interface, or an idealised, unified device interface. The legacy interface allows domains to use legacy devices drivers. Domains, however, cannot share devices to enforce isolation between domains. If a legacy device driver crashes, it will affect the domain it is running in, but not any other domains. The unified device interface provides an idealised hardware interface for each class of devices. This is intended to reduce the cost of porting device drivers to this safe hardware interface. The benefit of this safe hardware interface is that it allows sharing of devices between domains. To enforce isolation between domains, even when they share a device, device drivers can be run within Isolated Driver Domains (IDDs). These IDDs are effectively isolated virtual machines, loaded with the appropriate device driver. If a device driver crashes, the IDD will be affected, but each domain which shares that driver will not crash. In fact, Xen can detect driver failures, and restart the affected IDD, providing minimal disturbance to domains using that device. Guest operating systems communicate with drivers, running within IDDs, using device channels. These channels use shared memory descriptor rings to transfer data between the IDD and domains (see Fig- ure 2.3). Two pairs of producer, consumer indices are placed around the ring, one for data transfer between the domain and the IDD, and the other for data transfer between the IDD and the domain. The use of shared memory descriptor rings avoids the need for an extra data copy between the IDD and domain address space. Although the use of these descriptor rings is a suitable approach for low latency devices, it does not scale to high bandwidth devices. DMA capable devices can transfer data directly into memory, but if a device is shared, and the demultiplexing between domains is not performed in hardware, then there is no way of knowing into which domain’s memory the data should be transferred. In this case, the data is transferred to some memory controlled by Xen. Once I/O has been demultiplexed, Xen knows to which domain this data should be transferred, but the data is in an address space which is not controlled by that domain. The data is mapped into the correct domain’s address space, using a page previously granted by this domain (i.e. this domain’s page table is updated, so that the domain’s granted virtual page now points to the physical memory which contains the data to be transferred). This page table update avoids any additional data copying due to the virtualisation layer. The use of these techniques means that Xen has a very low overhead compared with other virtualisation technologies. The performance of Xen was investigated in [2], which, in a number of benchmarks, found that the average virtualisation overhead of Xen was less than 5%, compared with the standard 11
  • 17. Figure 2.3: This diagram describes the I/O ring used to transfer data between Xen and guest OSs Linux operating system. This is contrasts with overheads of up to 90% in other virtualisation systems such as VMware and User-Mode Linux. These results have been repeated by a team of research students in [11]. 2.4 Similar Work There have been a number of efforts which have attempted to produce QoS-aware networks with some form of resource partitioning. Many of these projects use programmable networks. Programmable networks [5] can be used to rapidly create, manage and deploy network services, based on the demands of users. There are many different levels of network programmability, directed by two main trends - the Open Signalling approach and the Active Networks approach. The Open Signalling approach argues for network switches and routers to be opened up with a set of open programming interfaces. This approach is similar to that of telecommunication services, with services being set up, used, then pulled down. The Active Network community supports a more dynamic deployment of services. So called, “active packets” are transported by active networks, which at one extreme could contain code to be executed by switches and routers as the packet traverses the network. A number of research projects have created programmable networks, in an attempt to support differing objectives. Many of these programmable networks have been created in an attempt to ease network management, however, some have had the objective of creating Quality of Service aware networks. The Darwin project [8] is an attempt to create a set of customisable resource management mechanisms that can support value added services, such as video and voice data streams, which have specific QoS requirements. This project is focused on creating a network resource hierarchy, which can be used to provide various levels of network service to different data streams, depending on their requirements. The architecture contains two main components - Xena and Beagle. Xena is a service broker, which allows components to discover resources, and identify resources needed to meet an application’s requirements. Beagle is a signalling protocol, used to allocate resources by contacting the owners of those resources. The Darwin architecture has been implemented for routers running FreeBSD and NetBSD. However, this project’s focus is on providing a middleware environment for value added network services, not on the underlying mechanisms needed to guarantee quality of service flows within a network. As such, the 12
  • 18. prototype implementation of the Darwin router simplifies the aspect of resource partitioning between QoS flows. Routing resources are managed by delegates, which run within a Java Virtual Machine, with resource partitioning managed using static priorities. This does not provide the required level of performance, or real time guarantees, required for a truly QoS aware router. Members of the DARPA active networking program have developed a router operating system called NodeOS. This provides a node kernel interface for all routers within an active network. The NodeOS interface defines a domain as an abstraction which supports accounting and scheduling of the resources needed for a particular flow. Each domain contains the following set of resources - an input channel, an output channel, a memory pool and a thread pool. When traffic from a certain flow enters the router, it consumes allocated network bandwidth from its domain’s input and output channels. CPU cycles and memory usage are also charged to the domain’s thread and memory pools, as the packet is processed by the router. This framework allows resources to be allocated to QoS flows as required; however, the NodeOS platform is simply a kernel interface, and does not provide details on how the resources used by domains should be partitioned. The Genesis Kernel [6] is another network node operating system which supports active networks. This kernel supports spawning networks, which automate creation, deployment and management of network architectures, “on-the-fly”. The Genesis Kernel supports the automatic spawning of routlets. Virtual overlay networks can be set up over the top of the physical network. Routlets are spawned to process the traffic from these virtual networks. Further child virtual networks can be spawned above these virtual networks. These child networks will automatically inherit the architecture of their parent networks, thus creating a hierarchy of function from parent to child networks. This architecture is created with the intention of automating the support of network management. It supports a virtual network life cycle, through the stages of network profiling to capture the “blueprint” of a virtual network architecture, the spawning of that virtual network, and finally the management to support the network. It does not, however, deal with resource management, and so does not provide service guarantees to QoS flows. The Sago project [10] has a similar aim to this project. The aim is to virtualise the resources of a net- work, to provide QoS guarantees for individual flows. The Sago platform uses virtual overlay networks to provide protection between separate networks which use the same physical hardware. It consists of two main components - a global resource manager (GRM), and local packet processing engine (LPPE). The GRM oversees the allocation of nodes and links on the underlying network. A LPPE is placed at each network node, and performs the actual resource management. The GRM uses a separate adminis- trative network to signal the creation of virtual overlay networks. These virtual overlay networks have bandwidth and delay characteristics, which can be used to provide end to end QoS guarantees. The Sago platform, however, requires significant extra complexity compared with standard networks. For example, two physical networks are needed, one for data transmission, and an entirely separate net- work to deal with control signals. This requires significant extra complexity in the network, increasing the overall cost and the possibility of failures. Also, although Sago attempts to provide end-to-end QoS guarantees, it does not deal specifically with the partitioning of resources within a network router. Although there are many efforts which have attempted to provide QoS guarantees to network flows, very few have specifically dealt with the partitioning of resources among QoS flows. Of the efforts which did deal with router resources, none could provide strict guarantees on the router resources available to network flows. 13
  • 19. Chapter 3 Approach This chapter describes the overall approach taken during this research project. Over the course of this project, the original approach changed as problems were encountered and details were further investi- gated. Any major changes from the original approach, made during the course of this project, are also described here. The goal of this project is to produce an experimental router which demonstrates the feasibility of us- ing virtual machine techniques to partition router resources between independent flows, thus providing guarantees on each flow’s quality of service. It is not necessary that this router exhibit commercial speed or robustness, however, it should be sufficiently usable, such that the partitioning between flows and the service provided to those flows can be investigated. To this end, my original approach involved using the work of open source projects, modified so that they meet the needs of this project. Due to hardware and time constraints on a research project such as this, the router was created as a software router for commodity x86 hardware. The overall architecture of the router (known hereafter as a QuaSAR - Quality of Service Aware Router) consists of multiple guest OSs running on top of a single virtual machine manager. Each of these guest OSs runs routing software to route incoming packets. There is one main guest OS router, whose job is to route all of the best effort, non-QoS traffic. The rest of the guest OSs are known as routelets, and are available for the routing of QoS traffic flows. When a new QoS flow is initiated, the flow’s requirements are sent along the proposed route of the flow. When a QuaSAR router receives details of a new flow’s requirements, it first decides whether it has resources available to service the new flow. If there are insufficient resources available, the flow can be routed as best effort traffic, or the flow can be rejected. If, however, there are enough resources available to support the flow’s requested QoS, then the flow is associated with a particular routelet. This routelet is then allotted sufficient resources (e.g. CPU time, memory, network bandwidth, etc.) to service that flow. When a packet arrives at a QuaSAR router, it is demultiplexed and sent to the routelet which is dealing with that packet’s flow. If the packet is not part of any specified flow, it is sent to the main best-effort router. Figure 3.1 gives an overview of the architecture of the QuaSAR system. Since each routelet is running in a separate operating system (to partition its resource usage), QuaSAR uses virtualisation software to allow multiple operating system instances to be run on the same ma- chine at the same time. This virtualisation software needs to have as little overhead as possible, so that QuaSAR can cope with high-volume packet traffic demands. To this end, this project uses a para- virtualisation system, rather than a fully emulated virtualisation system, because performance and scal- ability are more important in the QuaSAR router than full emulation of the original instruction set, especially where the hardware being used (x86) is inherently virtualisation-unfriendly. Both Xen and Denali provide para-virtualisation in an attempt to increase performance and scalability, however Denali 14
  • 20. QuaSAR Packet Demultiplexer Main Best Effort Router Input NIC Channels QoS Routelets Output NIC Channels Figure 3.1: An overview of the QuaSAR architecture. does not have any well-used operating systems ported to its virtual instruction set, whereas Xen has ports for both Linux and NetBSD. This project, therefore, uses the Xen virtualisation software because: it is open source and can be modified for the project’s requirements; it has good performance and scalability; and the availability of Linux or NetBSD guest OSs allow the project to make use of pre-built routing software. Xen loads guest OSs in separate domains. The guest OS in domain 0 has special privileges, such as being able to start new guest OS domains and configure the resources which they can use. The main best-effort router is therefore started in domain 0, so that this main router can start, stop and control the QoS routelet instances as new QoS flows arrive, or their requirements change. RSVP messages are used by hosts to signal the creation of new QoS flows. When the QuaSAR router receives an RSVP RESV message, specifying the creation of a new QoS flow, it starts a new routelet, or modifies an existing routelet, so that packets from the flow are routed by that routelet. RSVP messages can contain a flowspec, which details the requirements of the new flow being created (e.g. bandwidth, latency etc.). The original approach called for the QuaSAR router to process RSVP flowspecs and automatically allot enough resources to the routelet assigned to that flow, to meet the flow’s QoS re- quirements. However, the mapping between a flow’s QoS requirements and the resources needed by a routelet to meet those requirements were not known before experiments were performed on the QuaSAR router. Therefore, a routelet’s resources are assigned manually in this project, with provisions made for the automatic allotting of resources when this mapping is discovered. When packets arrive at the router, they need to be demultiplexed to the appropriate routelet. Xen con- tains a mechanism for sharing network cards, and demultiplexing packets to the appropriate guest OS according to a set of rules. The original design of the QuaSAR router called for this mechanism to be used to demultiplex packets to the appropriate routelet, however, it was discovered that this mechanism was not appropriate for the QuaSAR router (see Section 6.3 for details). Therefore, an architecture was created, using the Click Modular Router, to demultiplex packets as they arrived in the router, and send them to the routelet processing their flow, or to the best effort router if they are not part of a QoS flow. 15
  • 21. QuaSAR routes MPLS (Multi-Protocol Label Switching) traffic, since this traffic is already routed as a flow according to its forwarding equivalency class label. The main best-effort router needs to route any traffic which is sent to it. It, therefore, needs to create, maintain and use, label and routing tables. It also needs to be able to support MPLS label distribution protocols (e.g. LDP or RSVP-TE), routing distribution protocols (e.g. OSPF or BGP), as well as the RSVP protocol used to create new QoS flows. The main router, therefore, uses an open source implementation of an MPLS Linux router from IBCN (intec broadband communication networks) research group at Ghent University. This was modified so that it could create and configure QoS routelets as required, but provided a solid basis from which to build the router, without writing a completely new router. The QoS routelets do not have complex routing requirements. They are only routing packets from a single flow. The work simply involves obtaining a new packet from the flow’s input NIC, processing the packet (e.g. substituting one label for another), queuing the packet, and sending the packet to the flow’s output NIC. Since only one flow at a time is ever routed by these routelets, they do not need to maintain label or routing tables. There is also no need for them to understand, or initiate routing or label distribution protocols. In fact, if routelets initiated distribution protocols, they could interfere with the main, best-effort router. Instead, the main router keeps track of table entries which affect routelets that have been created. If one of these entries changes, then the main router can inform the routelet which will be affected. Due to the simple requirements of the QoS routelets, they use the Click Modular Router to route packets. The simple work-flow needed by the routelets can be provided by joining a number of Click elements together. New elements can be written in C++ to any extra functionality, such as MPLS label substi- tution, required. The main router can communicate with the routelets by writing to special files which are created by the Click architecture. When these files are written to, functions within the appropriate element are invoked, thus allowing routelets to be dynamically reconfigured. Figure 3.2 gives a more detailed overview of this architecture. Once the prototype QuaSAR system was built, an important part of this project involved performing experiments with this prototype, to evaluate the effectiveness of using virtual routelets to service QoS flows. These experiments should ensure that the virtualisation process does not massively reduce the performance of the router compared with standard routers, and discover if this technique improves the performance and partitioning between QoS flows. To this end, a testbed network was set up, with a number of computers acting as communication nodes and one machine acting as a QuaSAR router. The QuaSAR machine contains multiple network line cards, each of which was connected to a separate test network. Synthesised test network traffic was sent across the QuaSAR router between these networks, to test various aspects of QuaSAR’s performance. The machine running the QuaSAR software also had a standard Linux software router installed, so that QuaSAR’s performance could be compared with that of a standard router. Initially, best effort traffic was sent across both the QuaSAR router and the standard router to discover the additional overhead incurred by running virtualisation software below the router. The next stage involved setting up a Quality of Service flow, and comparing the performance of traffic sent through this flow to that of best effort traffic. This evaluates whether demultiplexing packets between routelets incurs significant overhead, or if the simplified forwarding engine in the QoS routelets actually increases per- formance. Badly-behaved flows, which use more network bandwidth than they have reserved, were then introduced to discover their effect on well-behaved flows. This evaluates the partitioning that QuaSAR provides between network flows and evaluates QuaSAR’s immunity to denial of service attacks. These experiments should evaluate whether ensuring QoS by partitioning a router’s resources, using virtualisation, is effective. It should also identify the areas in which this concept could be improved with future work. 16
  • 22. Idle Routelet Pool Linux in Domain # Linux in Domain 0 Click Xen IBCN MPLS under Linux Router VMM Packet Demultiplexer Main Best Effort Router Linux in Domain # Click modular Router Input NIC Channels QoS Routelets Output NIC Channels Figure 3.2: A detailed overview of the QuaSAR architecture. 17
  • 23. Chapter 4 Integration of Open Source Projects When beginning this project, a decision was made to build the prototype router from a number of open source projects. The use of this open source software meant that the creation of a prototype virtualised router was feasible in the time provided by this project, however, using so many open source projects together brought a number of problems when trying to integrate them. This section describes where the open source projects were used within the QuaSAR router, as well as the changes necessary to integrate these external projects. 4.1 Overview of Open Source Projects The following open source projects were used within the QuaSAR router: Xen Virtual Machine Monitor Xen was chosen as the virtualisation technology for the QuaSAR router because of its relatively minor performance overhead compared with other virtualisation platforms. However, since it uses para-virtualisation to achieve this performance increase, it requires the guest operating systems running on top of it to be modified to support the architecture presented by Xen. The changes required to run Linux as a guest OS within a Xen virtual machine (or Domain) are provided as a patch to the source code of a standard Linux build. The most current version of Xen available at the start of this project was version 2.0.1, which provided a patch for version 2.6.9 of the Linux kernel. Therefore, without substantial rewriting of this patch, Xen 2.0.1 supports only the 2.6.9 version of the Linux kernel. Requirements: Linux Kernel 2.6.9 Click Modular Router The Click modular router provides a framework for the creation of flexible routers by the con- nection of a number of simple packet processing components (or elements), in such a way as to provide the overall required routing functionality. Click was used within the QuaSAR router to implement the routelet’s packet processing code, as this was more easily implemented using Click’s supporting framework, than if it had been written from scratch. Click also provided the ability to quickly create prototype networking code for other purposes through the creation of new elements, or the building of new click configurations, which proved invaluable throughout the project. 18
  • 24. Click has the ability to run either as a user-space process, or as a kernel module. For performance reasons QuaSAR is required to fully run within the Linux Kernel, so that each packet does not incur a switch from user to system operation mode. The QuaSAR router therefore needed to run Click as a Kernel module. The Linux kernel must be modified before it can support Click as a module. This is because Click needs access to packets at various points along the Linux network stack, and the kernel requires basic runtime support for C++, which the Click elements are written in. A Linux kernel patch is supplied with Click to make the necessary changes, however, the latest Linux kernel supported by the click patches was 2.4.22. Requirements: Linux Kernel 2.4.22 MPLS Linux QuaSAR’s main, best effort router needs the ability to set up label switched paths and route any MPLS traffic which it receives. The MPLS Linux open source project provides MPLS support for Linux by modifying the Linux kernel with a source code patch. The most recent version of MPLS Linux, available at the start of this project (1.935), provides a patch for the 2.6.6 version of the Linux kernel. The Linux MPLS project also provides modifications for iproute2 and iptables so that these tools can be used to modify and manage MPLS routing. Requirements: Linux Kernel 2.6.6 iproute2 2.4.7 iptables 1.2.9 IBCN MPLS RSVP-TE daemon The QuaSAR router uses RSVP-TE to set up MPLS label switched paths (LSPs), distribute labels and specify a flow specification (flowspec). QuaSAR’s main, best effort router, therefore, uses an MPLS enabled RSVP-TE daemon, from the IBCN research group at Ghent University, to provide these capabilities. This daemon makes use of a number of components, including an MPLS enabled Linux Kernel, and MPLS enabled iproute2 and iptables tools (MPLS support provided by MPLS Linux). However, there has been no support for this daemon for some time, so the versions of MPLS Linux which it uses is now out of date. Requirements: Linux Kernel 2.4.13 Linux MPLS 1.72 iproute2 2.4.2 iptables 1.2.4 4.2 Integration As can be seen from the previous section, each of the open source projects used as part of the QuaSAR router required a different Linux Kernel version. These projects could obviously not be run together within the router, unless they were modified so that they could all run under the same Linux Kernel version. I decided to use the 2.6.9 Linux Kernel for the QuaSAR router, mainly because substantial changes would need to be made to port the Linux Kernel to the Xen virtual machine architecture, there- fore porting a Linux Kernel other that version 2.6.9 would be unfeasible in the time provided for this project. This choice of Kernel also had the the advantage of being the most recent of the kernels used by any of the projects, therefore porting the other projects to this kernel would benefit the community 19
  • 25. by updating these projects to work under a newer operating system, rather than porting them to older operating systems which would be of limited benefit. The MPLS RSVP-TE daemon did not modify the Linux Kernel, and was therefore less dependent on which version of the Kernel used. However, it did rely on MPLS Linux, which does modify the kernel. The RSVP daemon is written to interface with version 1.72 of MPLS Linux, which in turn is written for version 2.4.13 of the Linux Kernel. The most recent version of MPLS Linux at the time of this project (1.935) is written for the 2.6.6 version of the Linux Kernel, however, it provides a very different interface to user-space processes (such as the RSVP daemon) for MPLS management. There were therefore two choices for integrating these projects into the QuaSAR router: port the 1.72 version of MPLS Linux to the 2.6.9 Kernel; or rewrite the RSVP daemon to make use of the new MPLS interface provided by version 1.935 of MPLS Linux. The first of these options would incur a major change to Linux MPLS, and very little change to the RSVP daemon, whereas the second option would incur a minor port of MPLS Linux from the 2.6.6 to the 2.6.9 kernel, but major modifications to the RSVP daemon. I decided upon the second of these options, because the modifications to the RSVP daemon are in user level code, rather than within the kernel as with MPLS Linux, and so they would be easier to debug. The consequences of this choice are described in Section 4.2.2. The rest of this section discusses how the Click modular router was ported to the 2.6.9 Linux Kernel, and how the MPLS RSVP-TE daemon was ported to MPLS Linux version 1.935. 4.2.1 The Click Modular Router The kernel level Click Modular Router consists of two main parts: a patch for the Linux Kernel to allow it to support Click; and a kernel module which, when loaded, provides Click’s functionality. To integrate these components into the QuaSAR router, they needed to be ported from the 2.4.22 Linux Kernel to the 2.6.9 Kernel. The changes necessary to port these two parts are described below. Click Kernel Patch The Click Kernel patch performs four main functions, C++ support, Click Filesystem support, Network Stack Modifications and Network Driver Polling support. The changes to the Linux Kernel between version 2.4 and 2.6 incurred changes in each of these areas. The patch’s functions are described below, along with a brief description of the changes necessary to port the Click patch to the 2.6.9 kernel. C++ support Click is written in C++, whereas the Linux Kernel is written in C, with no runtime support for C++. The patch, therefore, needs to add basic C++ runtime support to the Kernel, for example, mapping the new() function to kmalloc(), adding support for virtual functions, etc. The patch does not, however, support full C++ functionality within the kernel, for example, exceptions are not supported as this would negatively affect the kernel’s performance. The Click module also makes uses of a number of kernel header files. The kernel header files are written to be included in C source files. Since C++ reserves a number of new keywords compared with C (for example, new(), class or the :: symbol), including these Kernel headers within C++ source code could cause syntax errors. The C++ runtime support did not require any modification when porting. However the Kernel header files had changed significantly between 2.4.22 and 2.6.9, therefore the new header files, which are included by Click, needed to be modified, so that they did not cause C++ syntax errors. 20
  • 26. Xen introduces its own architecture specific header files to Linux, so that it can run on Xen’s idealised virtual machine architecture. When integrating Click with Xen, these additional header files also needed to be modified, so that they were C++ safe. Click Filesystem support Click module provides a virtual filesystem which can be used to control and monitor the click router. This filesystem makes use of Linux’s Proc filesystem, however, a small number of changes to the proc filesystem code are required, before the Click Filesystem can be supported. The Linux proc filesystem code has a slightly different layout in the 2.6 kernel, compared with the 2.4 kernel. Therefore, the appropriate locations for the changes, needed for it to support the Click Filesystem, had to be investigated before these changes could be made in the 2.6.9 kernel. Network Stack Modifications A number of Click elements can intercept, or inject packets at various points in the Linux network stack. Click makes a number of changes to the Linux networking code, to give it access to points within the network stack. These changes also require a minor modification to every network device driver within the Linux Kernel. Again, the network stack had a slightly different structure in the 2.6 kernel, with a number of extra features, compared with the 2.4 kernel. It was necessary to have an overall understanding of the Linux network stack, and its interactions, before the changes necessary for Click could be added without interfering with the features added between 2.4.22 and 2.6.9 in the network stack ([37] walks through the Linux networking functions which are called when a packet arrives at, or is sent from, a host). A number of new network device drivers have been added to Linux between the 2.4 and 2.6 kernels. These new device drivers all needed to be modified to support the network stack changes made by Click. Xen also adds a virtual network device driver to the Linux Kernel. When Click was integrated with Xen, this virtual network device driver also needed to be modified to support the changes in the network stack. Network Driver Polling support When Click was originally written, the Linux network drivers did not include polling (as opposed to interrupt driven) support. Polling support can greatly improve a router’s performance under heavy workloads, therefore Click added polling support to Linux by modifying a number of the network drivers. Linux 2.6 includes polling support for network devices, therefore a decision needs to be made as to whether it is better to keep adding Click polling support to the Linux Kernel, or whether Click should be modified to support the new polling support provided by Linux. I did not intend to add polling support into the QuaSAR prototype router, as this would give QuaSAR an advantage, not included in the project’s hypothesis, over the router it was being compared to, and so could unbalance the experimental results. Since I did not intend to use polling support and there were unresolved decisions which needed to be made by the Click community, I decided not to include Click’s polling support in the 2.6.9 patch. Click Kernel Module The Click Kernel Module performs the Click routing functions when it is loaded into the Linux Kernel. It consists of four main components: the compiled Click elements that are linked together to provide the 21
  • 27. routing functionality; a library of functionality used by all elements; router management components; and a Click filesystem used to monitor and configure Click. Porting the Click Kernel Module to the 2.6.9 kernel involved the following main changes: Update System Calls A number of system and kernel calls have changed format between the 2.4 and 2.6 kernel ver- sions. Changes were also made to the way in which modules register their use by other modules. Previously, the macros MOD INC USE COUNT and MOD DEC USE COUNT were used to record a module’s use count. These macros have now been deprecated, and the details are now dealt with internally. The Click Kernel Module was modified to reflect these changes. Update Click Filesystem The interface between Linux and filesystem implementations changed between the 2.4.22 and 2.6.9 versions. A number of functions were added to the Click Filesystem to provide the additional functionality required by the new Linux Kernel. Once the Click Kernel Module could be compiled and loaded into the 2.6.9 kernel, a problem with the Click Filesystem became apparent. The Click Filesystem caused the Linux Kernel to crash randomly after being mounted within the root filesystem. Since this crash occurred within the Linux Kernel, at a seemingly arbitrary point of time, it was extremely difficult to debug. After investigation, I discovered that the GDB debugging environment could be attached to a User-Mode Linux kernel. User-Mode Linux [13] is a build of Linux which runs as a normal userspace process within a standard Linux operating system, allowing the GDB debugging environment to attach to it, as it would with any normal Linux process. GDB provided the ability to trace through the Kernel source code and probe the value of variables as the Kernel was running. The call tracing showed that the crash was occurring within Linux’s memory management code, specifically the slab memory allocator. When the slab memory con- trol structures were probed, they were found to have nonsensical values. From this I deduced that the Click Filesystem was “walking over” unallocated memory (i.e. putting data into memory which had not been assigned to it) or was unallocating (or freeing) memory which had never been allocated. After experimentation, it was found that the Click Filesystem was freeing a memory location which was unused in the 2.6 Linux Kernel. Memory for this pointer had not been allo- cated, thus when free() was called upon it, the memory management control structures became corrupted. When this free was removed, the Click Filesystem no longer crashed the Linux Kernel. Integrate Click Module into new Kernel Build Process The build process for external modules has changed considerably between the 2.4 and 2.6 kernels. Previously, external modules were simply compiled against the appropriate kernel header files, with a set of standard compiler directives asserted. The 2.6 Linux Kernel build process for external modules involves pointing the kernel’s kBuild script to the external module’s source code, which then builds the module using the kernel’s own build process. Normally this change would not significantly affect the porting of a module, in fact, this would normally simplify the module’s build process. However Click has an unusual build process, due in part to its simultaneous support for user level and kernel level processes, as well as its use of C++ within the kernel. To support user level and kernel level processes simultaneously, Click uses preprocessor com- mands within the source code, with compiler directives creating different object files for a source file, depending on whether it was compiled for the user level or kernel level process. In order that 22
  • 28. these different object files do not get confused with each other, the object files are not compiled into the same directory as their source code, but are instead compiled into a user level directory or a kernel level directory, depending on their target. To easily support this, Click used a vpath command within its makefile (used by the kernel module’s build process). This vpath, or virtual path, command directs the make build tool to search the given paths for source files, but com- piles these source files in the current directory. This allows the object files, compiled for user and kernel level, to remain separate in their own respective directory, even when compiled from the same source file. However, the Linux Kernel build process does not support the vpath command, so these source files were not found when building the kernel module under the 2.6 kernel. New commands were therefore added to the 2.6 Click Kernel module makefile, to identify the location of the source files explicitly. Since Click elements can be added by users, their locations are not necessarily known ahead of time. A script was therefore created, which adds the location of new elements’ source code to a file used by the makefile. The use of C++ within Click also introduced problems for the Linux 2.6 Kernel build process. Since the kernel expects C source files, it does not provide any support for compiling C++ source files. Commands were added to the Click Kernel module to enable C++ compilation within the kernel build process. The compilation of the Click Kernel module within the Linux kernel build process introduces an additional problem, that of common symbols in the object files built. Common symbols are variables without initialised data storage. They are created by the gcc compiler when the compiler finds a global variable which is initialised within a function, rather than immediately after it is declared. For example, the following C or C++ code would create the variable global value as a common symbol: int global_value; void initialise() { count = 1; } The 2.6 Linux Kernel cannot load modules with common symbols, therefore source code within the above code structure needed to be modified. The above case could easily be fixed by ini- tialising global value to any value when it is declared. However, Click contained a number of instances which were more difficult to fix. For example, Click creates an atomically acces- sible integer type, for use in locking situations. This atomic type redefines the = operator with a section of code to ensure its atomicity. The code represented by the atomic type’s = operator cannot be compiled outside of a function. Therefore a global variable of this atomic type cannot be initialised where it is declared, it must be initialised within a function. This problem could be circumvented by having a global pointer to the value, rather than declaring the the value itself globally. Common symbols for atomic types could thus be removed by using the following code: atomic_int * global_value = NULL; void initialise() { global_value = (atomic_int *) kmalloc (sizeof(atomic_int)); (*global_value) = 1; } The final problem, which this new build process introduced, occurred when the Click Kernel Mod- ule was integrated into a Linux Kernel modified for Xen (XenoLinux). Once the Click module 23
  • 29. was fully operational within the 2.6 Linux Kernel, it was tested on a XenoLinux kernel. The XenoLinux kernel would load the Click module correctly, however, as soon as a Click router configuration was loaded, the machine would hang. The problem was eventually traced to a soft interrupt which was called within Click. As discussed previously, XenoLinux contains its own architecture specific header files, so that it can interface with Xen’s idealised virtual machine architecture. The external module build process was compiling Click with the standard x86 archi- tecture header files, rather than the Xen architecture header files. When the interrupt was being called, the standard x86 mechanism was being used, which Xen would not allow, therefore the guest operating system hung. To solve this problem, it was necessary to modify the Linux build process to ensure it used Xen’s own architecture specific header files when compiling Click. 4.2.2 MPLS RSVP-TE daemon As discussed previously, I decided to port the RSVP-TE daemon to a newer version of MPLS Linux, rather than porting an old version of MPLS Linux to the newer kernel. This meant the porting of MPLS Linux to the 2.6.9 kernel was relatively simple, involving relatively minor changes in function call formats and moving the location of some source code which is added to the Linux Kernel for MPLS support. However, this choice meant that the RSVP-TE daemon needed to be substantially modified to support the new MPLS Linux interface. Integration with MPLS Linux 1.935 The interface for management of MPLS label switched paths (LSPs) and labelstacks changed substan- tially between MPLS Linux 1.72 and MPLS Linux 1.935. Fortunately, the RSVP-TE daemon used a single source file to provide a library of MPLS functions to the rest of the daemon. Therefore, instead of having to search throughout the source code for MPLS operations, re-implementing each one for the new MPLS Linux interface separately, it was possible to simply re-implement the complete MPLS library file, thus providing a wrapper to the new MPLS interface. Integration with rtnetlink The library of MPLS Linux functions, used by the RSVP-TE daemon, gains access to the MPLS man- agement functions through the MPLS administration tool (mplsadmin). However, rather than starting mplsadmin each time it required access to an MPLS management function, the MPLS Linux administra- tion program was compiled within the RSVP-TE daemon. This meant that the RSVP-TE daemon could access these commands directly, rather than having to fork a new process (in the form of mplsadmin) whenever it accessed MPLS management functions. When updating the RSVP-TE daemon to support the 1.935 version of MPLS Linux, the original mplsadmin compiled within the RSVP-TE daemon was replaced with the 1.935 version of the mplsadmin. However, replacing the 1.72 mplsadmin with the 1.935 adminstration tool introduced some complica- tions. Both mplsadmin and the RSVP-TE daemon itself use rtnetlink to update the routing table in the Linux Kernel. Both of these tools, therefore, have a library of rtnetlink commands, which they use to provide this communication. The 1.72 mplsadmin uses the same rtnetlink library as the RSVP-TE tool, therefore, when it is compiled into the RSVP-TE daemon, it can use the rtnetlink library provided by the RSVP-TE daemon. However, the 1.935 version of the MPLS administration tool uses a newer ver- sion of the rtnetlink library than that used by the RSVP-TE daemon. The rtnetlink library, within the RSVP-TE daemon, was therefore replaced by the newer library, to provide the functionality required 24
  • 30. by mplsadmin. This required modification of much of the RSVP-TE daemon’s source code which dealt with rtnetlink communication, in order that it conformed to the format of the updated rtnetlink library. 4.2.3 Summary Significant effort has been expended to integrate several open source subsystems to produce the QuaSAR prototype. This level of effort is summarised in the following table. Subsystem Name Lines of Code Type of Code Click Kernel Patch ∼640 Kernel Level Code in C Click Module Patch ∼360 Kernel Level Code in C++/C Click Module Makefile 305 (∼100 significant lines) Makefile Configuration RSVP-TE Daemon Patch ∼3400 User Level C Some of the 640 lines of code required to patch the 2.6 Linux Kernel, so that it can support the Click Modular Router, came from either previous Click Kernel patches, or from some preliminary work which had been made on the 2.6 kernel patch before the start of this project. However, much of the changes required significant modification to support the 2.6 kernel, required substantial debugging before it worked, or was brand new. Similarly, some of the Click Module Patch was based upon preliminary work which had been performed before the start of this project, however, much more of it was brand new and debugging was required on the preliminary work, since it had never been fully tested. Most of the Click Module Makefile was replaced to support the 2.6 build process. Much of this replacement was simple variable set up, however, about 100 lines required careful thought. These changes were passed back to the Click community, where, with slight modification, they have been integrated into the Click development tree. The RSVP-TE daemon patch required by far the most changed lines. A large portion of these line changes consisted of simply replacing the MPLS management tool and the rtnetlink source code files. However, a substantial number of these changes were incurred by rewriting the MPLS function library and by modifying the RSVP-TE daemon’s code to support the updated rtnetlink library. 25
  • 31. Chapter 5 QoS Routelet The original approach of QuaSAR, described in Chapter 3, calls for the creation of simplified routers, or routelets, which provide routing functionality for Quality of Service traffic flows. Many of these routelets run within QuaSAR at any one time, such that each network flow requiring QoS guarantees is serviced by its own independent routelet. As such, it is important that each of these routelets uses as little of the overall system resources as possible, so that QuaSAR can support a reasonable number of network flows at any one time. Fortunately, the routing requirements of a single network flow are relatively simple, therefore the Qua- SAR routelet can have a simple design with modest system requirements. A routelet only routes packets from a single, unidirectional network flow, therefore it only needs to provide static routing and simple packet processing. When a network flow’s route changes, or its QoS requirements change, the main QuaSAR router (described in Chapter 6) will simply reconfigure the routelet. The routelet does not, therefore, need to cope with dynamic routing changes, however, it does need to be dynamically recon- figurable. This Chapter describes the design and implementation of QuaSAR’s QoS routelet. Section 5.1 describes the operating system configuration used by the routelets. Section 5.2 describes the implementation of the routelets’ packet processing and routing functionality. 5.1 Routelet Guest Operating System Each routelet runs within its own Linux guest operating system. The routelets have simple requirements, and could have been supported by a much simpler operating system than Linux (thus reducing the system resources required by each routelet). However, the routelet must be run within a Xen Virtual Machine, therefore it must run on an operating system which has been ported to the Xen virtual machine architecture. Although Linux, and a number of other major operating systems, have been ported to the Xen virtual machine architecture, no simple, small operating systems run on Xen. Porting one of these operating systems to Xen would be beyond the scope of this project, therefore the Linux Operating System was used to support the QuaSAR routelets. However, an attempt was made to use a minimal configuration of Linux, so that QuaSAR could support many routelets. The Linux kernel provides the basis of the Linux operating system, in that it manages the underlying hardware, however, it does not provide the full functionality of a Linux operating system. A fully functional Linux Operating System requires both a Linux Kernel and a collection of libraries (such as the GNU C Library) and system tools (such as ifconfig, which is used to set up network interface cards). Although it is possible to set up a Linux system from scratch with these libraries and system tools, 26
  • 32. they often come bundled in the form of a distribution. The next two sections (5.1.1 and 5.1.2) describe how these two components of the operating system were configured, in order to reduce the resource utilisation of each QuaSAR routelet. Section 5.1.3 describes the changes made to the routelets’ startup scripts, so that they can process network packets, and communicate with the main, best effort router, as soon as they have started. 5.1.1 Linux Kernel The QuaSAR routelet runs within a Xen virtual machine, therefore its Linux Kernel was modified with the XenoLinux patch before being compiled. The routelet also makes use of the Click Modular Router, so the Kernel was also patched to support this (this patch is described in Section 4.2.1). Although a routelet deals with MPLS packets, the processing of these packets occurs within Click, not within the Linux networking stack, therefore the MPLS Linux support was not required within the routelets’ kernel. The Linux Kernel’s build process allows sections of the kernel to be excluded from a build if they are not needed. This meant that the system requirements of each routelet could be reduced by compiling a minimal kernel, where major sections, such as SCSI disk drivers or USB support, were excluded as they would never be used by the routelet. 5.1.2 Linux Distribution It is possible to create a custom distribution, by choosing the appropriate libraries and system tools manually and configuring the system to use these components. This would allow the creation of an ideal minimal distribution for the QuaSAR routelet. However, manually creating a distribution is a very time consuming and error prone task, so I chose to investigate whether a pre-built Linux distribution would provide the features required by the routelet. A Linux distribution typically takes up much more hard drive space than the Linux Kernel, therefore it was important that a small distribution was chosen for use within the routelet. The routelet was required to execute standard Linux programs, so although a small distribution was preferable, it had to contain the full GNU C Library. It also had to contain (or be able to use) the tools necessary to set up the networking configuration needed by the routelet. A number of Linux distributions (e.g. SUSE or Redhat) contain the necessary libraries and tools, however, they are very large and contain features which are unnecessary for the routelet. Similarly, there are many small Linux distributions (e.g. 0sys [25]) which do not use the full GNU C Library or have networking support. The ttyLinux project [36] is an attempt to create a fully featured Linux distribution which is small and can run on limited resources. ttyLinux was the chosen distribution for the QuaSAR routelets as it can run with 4Mb of RAM, yet still provides the full GNU C Library. It also provides the networking support required by a routlet. 5.1.3 Routelet Startup Configuration Each time a routelet is started, a number of tasks need to be performed before that routelet can commu- nicate with the main QuaSAR router, or process network packets. These tasks were integrated into the routelets’ startup scripts. There are three main tasks which a routelet must perform on startup: 27
  • 33. Set up Network Devices Xen virtual network device interfaces (VIFs) are used by QuaSAR to provide communication, and packet passing between domains (i.e. QoS packets are passed from the main QuaSAR router to the appropriate QoS routelet through a VIF). When a new Xen domain is started, a parameter in the domain creation script specifies how many VIFs are created for that domain. These VIFs form a point-to-point link between their back-end within a privileged domain (which is creating this new domain), and their front- end within the new domain. The back-end and front-end parts appear as normal network devices to both domains, and so normal network messaging can be used to communicate between domains. Each routelet, when started, is configured with a VIF to correspond to each of the Ethernet devices in the underlying QuaSAR router. A script on the privileged domain (i.e. the domain with access to the underlying Ethernet devices - in QuaSAR’s case, the main, best effort router’s domain) connects the back-end of these VIFs to their corresponding physical Ethernet devices using a virtual Ethernet bridge. This means that any packets, sent out of a VIF within a routelet, are sent out of the corresponding physical device. When QuaSAR demultiplexes packets destined for a routelet (described in Section 6.3), it sends these packets to the routlet’s VIF which corresponds to the physical device which the packet arrived on. These VIFs therefore appear like the corresponding physical Ethernet devices to the routelet, but with only the traffic destined for that routelet. The routelet routes packets based upon their MPLS label, not based upon their IP header. MPLS is a lower level network protocol than IP (layer 2.5 instead of layer 3), therefore the routelets’ VIFs do not have to be set up with IP addresses. In fact, if they were set up with the addresses of their corresponding Ethernet devices, this would cause confusion within the virtual Ethernet bridge being used to connect the VIFs to the Ethernet devices. The VIFs were, therefore, started without being assigned an IP address. The ttyLinux networking startup script could not bring up (or start) network devices without assigning them an IP address. This script was therefore modified to allow these VIFs to be brought up without an IP address. Each routelet also starts with an extra control VIF, so that the main QuaSAR router can send control information to each of the routelets. A unique IP address is assigned to each routelet’s control VIF, therefore standard TCP/IP communication can be used for communication between the main router and the routelets. The ClickCom tool, described in Section 6.2.2, uses this communication link to allow the main router to control the routelet’s Click Router configuration. Figure 5.1 provides an overview of the routelet network setup1 . Although it appears that a packet must pass though a large number of interfaces to get to a routelet, the packet is never actually copied between these interfaces. Instead, packets move between interfaces by less expensive methods, such as pointer passing, or page table sharing. Page table sharing is used to move data from one Xen domain to another, for example, moving packets between the front-end and back-end of Xen virtual network interfaces. This approach involves manipulating the memory page tables of the two guest operating systems involved, to transfer ownership of the packet’s memory page(s) from one guest operating system to the other. Therefore, transfer of packets between the back-end VIFs in domain 0, and the routelet’s front-end VIFs, involves subtle manipulation of both guest operating systems’ page tables, rather than copying of the packet. 1 Note that Figure 5.1 does not show the packet demultiplexing mechanism used to send incoming packets to the correct routelet for processing. This demultiplexing is described in Section 6.3, however at present it is sufficient to view the bridges as performing the necessary demultiplexing between routelets. 28
  • 34. Key routelet 1 routelet 2 Network Interface Type FE FE FE FE FE FE eth0 eth1 eth2 eth0 eth1 eth2 FE Xen Virtual Interface [name] Front−End BE Xen Virtual Interface BE BE BE BE BE BE [name] Back−End vif1.0 vif1.1 vif1.2 vif2.0 vif2.1 vif2.2 [name] Virtual Ethernet Bridge xbr−eth0 xbr−eth1 xbr−control [name] Physical Ethernet Device (Control Messages) Data Transfer Type eth0 eth1 Memory Page Transfer Best Effort Router Pointer Passing Network Transmission Network Figure 5.1: This diagram outlines the network setup used to connect routelets’ virtual Ethernet devices to the physical Ethernet devices. Mount Click Filesystem The routelets control their Click Router configuration by reading and writing to files in the Click Filesys- tem. This virtual filesystem needs to be mounted within the routelet’s root filesystem before it can be accessed. An entry was entered into the routelet’s fstab configuration file, in order to mount the Click Filesystem automatically on startup. Start ClickCom Server Daemon The ClickCom tool (described in Section 6.2.2) is used by the main QuaSAR router to control the routing and packet processing of routelets, by transferring the required configuration to the routelet’s Click Router. The ClickCom server was added to the routelet’s startup script, allowing the main router to send Click configurations to a routelet as soon as it is started. 5.2 Routelet Packet Processing As each routelet supports a single network flow, it only has to deal with packets arriving at a single inbound network interface, static processing of these packets, then transmission on a single outbound interface. Packets therefore simply need to pass though a single chain of commands, with no branches or decision points, while being processed by a routelet. The QuaSAR routelet uses the Click Modular Router to process packets. As discussed in Section 2.1, a Click Router consists of a number of elements, linked together by one way links. When a packet arrives in a Click Router, it traverses this sequence of one way links, being processed by each element it arrives at, until the packet arrives at an element which causes it to leave the Click Router (this could involve being discarded, being sent to a network device, being passed up to the Linux networking stack, or a number of other options). This structure is ideal for the QuaSAR router, with a single sequence of 29
  • 35. elements performing the necessary processing on packets between their arrival and departure 2 . Many of the routelet’s packet processing requirements can be provided by elements which are already available within Click. Elements can also be written to provide any required functionality which is not provided by the standard Click elements. 5.2.1 Overall Design The QuaSAR router routes packets at the MPLS layer, therefore it has to process both the data-link (in this case Ethernet) header and the MPLS shim header. The routelet, therefore, processes packets by first stripping off the outermost (old) Ethernet header, processing the MPLS shim as required for the next network hop, and finally encapsulating this packet in a new Ethernet header, before sending it to the next hop in that network flow. Figure 5.2 outlines the Click configuration used by each routelet to perform this packet processing. FromDevice ([inbound nic]) Strip (14) ProcessShim ([new MPLS label]) EtherEncap ([src MAC], [dst MAC], 0x8847) ToDevice ([outbound nic]) Queue Figure 5.2: An overview of the Click architecture used by QuaSAR routelets to process packets. Each of the Click elements used by the routelet in the above Click configuration is described in more detail below: FromDevice The FromDevice element captures any packets arriving from a network device and pushes them to the next element. These packets have been captured from the network interface’s device driver, and have therefore not passed through any of Linux’s networking stack. Consequently, the packets are still encapsulated with their original data-link header. The FromDevice element takes a single configuration parameter - the name of the device it should capture packets from. This device name is specified, when the main QuaSAR router configures a routelet for a particular network flow, as the device upon which packets from that flow arrive. The FromDevice element creates a special file in the routelet’s Click Filesystem, which can be used to change the device from which it will capture packets. Therefore, if a network flow’s inbound device changes, the routelet does not have to be completely reconfigured. Instead, the main router can use the ClickCom tool to write the new device name into this special file. Strip The Strip element simply removes a certain number of bytes from the front of a packet. This is used to remove the old data-link header from packets arriving into a routelet. The Strip element 2 Click does allow elements to have more than one output or input, providing branching and merging of packet flows. This ability was used within the main QuaSAR router (Section 6.3), but was not necessary in a routelet’s packet flow. 30
  • 36. accepts one configuration parameter - the number of bytes to be stripped from the front of a packet. This is set to 14 bytes in the QuaSAR router, as this is the size of an Ethernet header. If a different technology is used, this parameter can simply be changed to support the header size of that technology. ProcessShim This element processes the MPLS shim on the packet. This involves changing the shim’s MPLS label to that used by the next hop in the network flow, and decrementing the shim’s time to live (ttl) field. This element is configured by providing the value to which an outgoing packet’s MPLS label should be changed. This element also provides a special file which can be used by the main router to change this outgoing label value, without reconfiguring the whole routelet. The ProcessShim element was created especially for the QuaSAR router. Its design and implementation is described in more detail in Section 5.2.2. EtherEncap The EtherEncap element encapsulates packets in their outgoing Ethernet header. An Ethernet header consists of the source hardware (MAC) address of the outgoing interface, the destination hardware address of the next hop in this packet’s route, and the protocol type of this packet. The EtherEncap element is thus configured by providing the values of these three components. The two MAC addresses are provided by the main router when it configures a routelet, the protocol type always being set to MPLS. The element again provides special files to alter these values dynamically without reconfiguring the whole routelet. Queue Click provides two mechanisms for moving packets between elements - push and pull. If a link is in push mode, then the first element will push packets to the second element as soon as it has finished dealing with them. The second element will then immediately process this pushed packet. In pull mode, the second element is responsible for passing packets - it will ask the first element for a packet when it is prepared to process one. An element can be of type push, pull or agnostic (either push or pull) and the elements can only be linked to another element of the same type (an agnostic element can link to either a push or a pull element, however both its input and output must be of the same type. Once connected, it effectively becomes either a push or pull element, depending on the elements to which it is connected). Most of the elements used by the routelet are agnostic, however, FromDevice is a push element (as it pushes packets as soon as they are captured) and ToDevice is a pull element (as it waits until the network device is ready before asking for the next packet to send). A FromDevice element and a ToDevice element cannot be connected together directly (even with agnostic elements in between) as they are of different types. A Queue element has a push input, and a pull output, therefore a Queue element is placed between the FromDevice and ToDevice elements to connect them. A Queue also provides storage for packets when more are being pushed into it than pulled out. This provides a degree of buffering between the rate of incoming packets and the rate at which they can be sent out of the outbound network interface. The Queue’s storage capacity can be altered, depending upon the requirements of the routelet. ToDevice The ToDevice element sends packets out of a network device. These packets are sent straight to the network interface’s device driver, and have therefore not passed through any of Linux’s networking stack. The ToDevice element is configured by providing the name of the device it should send packets out on. This can be modified using a special file in the same way as the FromDevice element. 31
  • 37. The configuration values for each element are provided by the main router when it configures a routelet for a particular network flow. This process is described in Section 6.2. 5.2.2 ProcessShim Element The ProcessShim Click element was created so that the QuaSAR routelets could modify the MPLS shims of packets as they pass through the router. An MPLS shim is a 32 bit header which is used to route packets in an MPLS network. In MPLS over Ethernet, this shim is located between the Ethernet header and the IP header of a packet. An MPLS shim has four fields: a label, used for routing; an experimental field (which can be used for DiffServ over MPLS); a ”bottom of stack” bit, used when tunnelling MPLS over MPLS; and a time to live (ttl) counter, which allows layer 3 functions, such as ping route tracing, to occur, even though an MPLS router doesn’t have access to the layer 3 header. These fields are arranged as shown in Figure 5.3. 32 bits MPLS Label Exp S TTL 20 3 1 8 Figure 5.3: The layout of an MPLS Shim. The experimental field and the ”bottom of stack” bit are not modified by the QuaSAR routelet, however, both the label and ttl fields must be modified before a packet is sent to the next hop on its route. The ttl field must be decremented by each router the packet traverses across the network. If a packet arrives at a router with a ttl count of zero, then that packet is discarded. This prevents packets travelling forever because of a loop within a network. The ttl is a 1 byte number at the end of the MPLS shim. The ProcessShim element, therefore, accesses the shim’s ttl using an unsigned char pointer to the last byte of the shim. The ProcessShim element tests this ttl against 0, discarding the packet if it is 0, otherwise the ttl is decremented. A label switched path (LSP) does not use a single label value throughout the whole network. Instead, each router assigns its own label to an LSP. This means that it is not necessary to find a globally unique label for an LSP across all of the routers that LSP traverses. When an LSP is created, a label distribution protocol (RSVP-TE in the QuaSAR router) enables neighbouring routers to exchange the label value they have assigned to that LSP. MPLS routers use this information to swap an incoming packet’s MPLS label with the label that the next hop router has assigned to that LSP, before sending the packet. When a new ProcessShim element is created, it is passed a new label value (as an unsigned integer) by a per-element configuration method. A packet’s MPLS label is replaced by the value in new label when it is processed by the ProcessShim element. An MPLS label is a 20 bit integer, therefore, since C++ does not have an integer type which is 20 bits long, bit manipulation is used to mask the 12 bits at the end of the shim when the label is replaced. A write handler was added to the ProcessShim element. This creates a special file in the Click Filesystem and handles the effect of writing to that file. The write handler was implemented so that writing to this special file updates the new label value with the value written to this file, allowing dynamic reconfiguring of the ProcessShim element. 32
  • 38. Chapter 6 Main Router In order to support QoS routelets, a number of changes must be made to the best effort router, which is in overall control of the QuaSAR router. This best effort router runs within a privileged Xen domain (virtual machine) and therefore has direct access to the physical networking devices, and has the ability to create new domains for use by routelets. The main QuaSAR router performs two functions: routing of any best effort, non QoS, packets arriving at the router; and setting up and managing QoS routelets. The routing of best effort packets is performed by two components - Linux IP forwarding for IP packets, and MPLS Linux for best effort MPLS packets. Both provide static routing of packets through a network, with tools available to manually modify their routing tables. The MPLS RSVP-TE daemon is used by the QuaSAR main router to provide dynamic creation of MPLS LSPs when the router receives RSVP- TE path creation messages. Similarly, Quasar could be modified to support dynamic IP routing, with the addition of routing daemons which support dynamic routing protocols, such as OSPF or BGP (for example, those in the Zebra Open Source Router). However, dynamic IP routing was not required by this project. Support for QoS routelets required the addition of four main components in the main QuaSAR router - Routelet Creation, Assignment of QoS Network Flows to Routelets, Packet Demultiplexing and Routelet Management. These components are described fully in the following sections of this chapter. 6.1 Routelet Creation The creation of a new routelet involves creating a new Xen domain, starting Linux within that domain and linking the domain’s virtual network devices to the router’s physical devices. It therefore takes significant CPU time (a matter of seconds) to start each new routelet. Therefore, starting a routelet each time a new network flow, requesting QoS guarantees, arrives would considerably disrupt the routing performance of the QuaSAR router as the routelet is started. Instead, QuaSAR creates a pool of routelets at startup, when no packets are being routed, then assigns routelets to QoS network flows as they arrive. A startroutelets script was implemented, which automatically creates a pool of routelets, ready for use by network flows. A single argument specifies the number of routelets which this script should create. Each routelet requires a unique ID number, both to identify the Xen domain in which it is running and to create a unique IP address, used by the routelet’s control VIF. The StartRoutelets script begins by searching the list of Xen domains currently running, to find the highest current domain ID in use. It does this by running a simple awk script over the output from the Xen management tool’s domain listing option. Each new routelet is assigned a higher ID than the highest currently in use, to ensure this 33
  • 39. new ID is unique, which allows the StartRoutelets script to add routelets to an existing pool, as well as create a new pool. Xen is passed a domain startup script and this unique ID, for each routelet to be created. This domain startup script contains the information used by Xen to create the routelet’s domains, such as: The location of the routelet’s Linux Kernel: A single Linux Kernel is shared by every routelet (de- scribed in Section 5.1.1). Since the Kernel is loaded into the routelet’s memory immediately, sharing a single Kernel file across routelets does not cause interference amongst them. The amount of memory assigned to that routelet: Each routelet is assigned the smallest amount of memory, within which it can run (6Mb), so that the QuaSAR router can support the maximum possible number of routelets. Each routelet is assigned the same amount of memory, regardless of its QoS requirements, since additional memory would not have a significant effect on the simple packet processing provided by the routelets. The location of the routelet’s filesystem: A model routelet filesystem was created with the files needed by each routelet. This model filesystem was initially located on a Logical Volume Manager (LVM) partition, where each routelet could use a LVM snapshot of this model filesystem as their filesys- tem. A snapshot partition will only write to disk changes which the routelet makes from the model filesystem. This allows multiple routelets to share the same filesystem without interfering with each other when they write to that filesystem, whilst still saving disk space, as the shared model filesystem is only stored once on disk. It was discovered, however, that routelets do not require a writable filesystem. By turning off system logging and filesystem verification checks in the routelet’s guest operating system, the routelet simply requires access to a read only filesys- tem. Every routelet is therefore given read only access to a single file backed root filesystem, simplifying the filesystem layout and reducing the disk space required, to that of a single routelet filesystem, no matter how many routelets are running simultaneously. The routelet’s virtual network interface configuration: The routelet domain startup script contains the information necessary to set up a routelet’s network configuration, as described in Section 5.1.3. This includes the number of virtual network interfaces a routelet should have, the virtual bridges to which these network interfaces should be connected, and the IP address (generated from the routelet’s ID number) which should be assigned to the routelet’s control VIF. Once all of the routelets have been started, their domains are paused, until they are assigned to a network flow. Therefore, unused routelets are not scheduled to run by the Xen Virtual Machine Monitor, and the pool of unused routelets does not use any CPU time. The routelets are paused after all of the routelets have been started, rather than after each individual routelet starts, in order to allow the routelets time to boot their operating systems fully. Finally, the framework used to demultiplex incoming packets between routelets is set up. This is dis- cussed in Section 6.3. 6.2 Assignment of QoS Network Flows to Routelets Once a pool of routelets has been created, a mechanism is needed to remove routelets from this pool and assign them to the routing of a QoS network flow. A connectroutelet script was created, to automatically perform the tasks necessary to assign a routelet to a network flow. This script is described in Section 6.2.1. 34
  • 40. In order to configure a routelet for the routing of a QoS network flow, the main QuaSAR router must be able to communicate with routelets. The ClickCom tool was created to provide communication of Click configuration files between the main router and routelets. The design of this tool is described in Section 6.2.2. The assignment of routelets to network flows was integrated within the RSVP-TE daemon, so that routelets are assigned to network flows automatically when an MPLS LSP is set up with RSVP-TE messages. Section 6.2.3 describes this integration. Finally, routelets must be returned to the idle pool after a network flow is torn down. This process is described in Section 6.3.4. 6.2.1 ConnectRoutelet Script Three tasks need to be performed to assign a routelet to a network flow: identify a routelet and remove it from the pool, configure this routelet so that it can route packets from the network flow, and finally configure packet demultiplexing so that packets from that network flow are sent to the newly configured routelet. The main QuaSAR router uses a shell script to perform these tasks automatically when a new QoS network flow arrives. The ConnectRoutelet script identifies which routelets are assigned to network flows and which are in the pool of available routelets, by identifying routelets whose domains are currently in the paused state. An awk script is run over the output from the Xen domain listing to identify routelets which are paused. One of these paused routelets is selected and un-paused using the Xen domain management tool. This routelet will be assigned to the new QoS network flow. In order to configure this routelet for the QoS network flow which has just arrived, the main router must pass a Click configuration, as described in Section 5.2.1, to the routelet. First, however, the configuration parameters of each Click element must be determined, as appropriate for the network flow. Firstly, the routelet must route the packets between the network flow’s incoming and outgoing network interfaces correctly. The network flow’s incoming and outgoing network interfaces are known when a network flow is specified, as they are a fundamental characteristic of the network flow. The names of these network devices are passed to the ConnectRoutelet script, where they are used to fill in the FromDevice and ToDevice element configurations of the routelet’s Click configuration. The routelet must replace the MPLS label of packets it is processing with the label used by the next router for this network flow. This outbound label is also passed to the ConnectRoutelet script, where it is used to configure the ProcessShim element. Finally, the EtherEncap element needs to be configured with the source and destination MAC addresses of the packet’s Ethernet header. The name of the network interface, out of which the packets will be sent, is already known, therefore the source MAC address can be found by searching for this interface name in the output of the ifconfig tool, which lists network interface information, including the interfaces’ MAC addresses. The destination address is the address of the next hop on the network flow’s route. The MAC address of the next hop is not typically known, however, the next hop’s IP address is part of a network flow’s specification. The ConnectRoutelet script is passed the next hop IP address, which it translates to a MAC address by searching for this address in the router’s ARP cache1 . If an ARP entry 1 ARP [30], or address resolution protocol, is used by hosts to find the Ethernet MAC addresses which correspond to a given IP address. When a host sends packets to an IP address, it first sends an ARP message to all neighbouring hosts to ask which would accept packets with this IP address, thereby discovering the MAC address of the host to which it should send packets with this IP address. Recent IP address / MAC address pairs are stored in an ARP cache, so that ARP messages do not have to be sent for packets with IP addresses which have been recently resolved. 35
  • 41. cannot be found for this next hop IP address, then the ConnectRoutelet script pings the next hop IP address, and checks the ARP cache again. If the IP address is still not in the ARP cache, then the next hop cannot be reached by this router, and the network flow cannot be set up. Otherwise, the MAC address, corresponding to the next hop IP address, is used to configure EtherEncap’s destination address. Once these configuration parameters have been discovered, the resulting Click configuration is sent to the newly assigned routelet, using the ClickCom tool (Section 6.2.2). Finally, the QuaSAR demultiplexing support is configured, so that packets arriving to this router from the newly instigated QoS network flow are sent to the assigned routelet. QuaSAR’s demultiplexing details are described in Section 6.3. Performing these tasks using a compiled language such as C, instead of a shell script, was investigated, however, due to QuaSAR’s reliance on external tools (such as the Xen management tool), the speed increase this would provide would be negligible. 6.2.2 ClickCom A Click Router can be loaded, by writing a Click configuration into a config file in the Click Filesys- tem. Once a Click router is running, its elements can be reconfigured by writing values into Click files, created when elements are initialised. Although a routelet can control its own Click Router configu- ration through the Click Filesystem, the main router cannot directly access another operating system’s Click filesystem (i.e. the main router cannot directly manipulate a routelet’s Click Filesystem in order to set up its configuration). It would initially seem that NFS [35] could be used to share this filesystem between the routelet and the main best effort router, however, this is not the case. NFS, or Network File System, can be used to share files across a network, however, the Click Filesystem does not consist of any physical files on disk (the files are effectively just interfaces into procedures within the Click Kernel Module), therefore there are no physical files for NFS to share. A tool (ClickCom) was therefore created to transfer configuration data from the main router into a routelet’s Click Filesystem. A Click configuration is written in a specially created programming lan- guage, therefore it can be transmitted as a simple character stream. ClickCom consists of a client and a server process, which communicate using TCP/IP. The server process runs on each routelet, binding to a known port on that routelet’s control VIF. The client process runs on the main QuaSAR router. When the main router needs to change the configuration of a routelet’s Click Router (for example when a new QoS network flow arrives), it provides the ClickCom client with the routelet’s IP address, a Click configura- tion (either through standard input or from a local file), and the location of the file on the routelet’s Click Filesystem which is to be written to (e.g. Click’s config file). The ClickCom client connects to the routelet’s server, sends the location of the Click file to which the routelet should write the configuration, and then sends the configuration as a character stream. When the server receives a connection, it opens the file at the location sent by the client, then writes the configuration character stream into that file. The ClickCom tool can be used, both to create new Click router configurations within routelets, and to reconfigure the Click elements of running routelets, by writing into their configuration files (see Section 5.2.1 for an account of the reconfiguration which is possible for each element). 6.2.3 RSVP-TE Daemon Integration To support routelets, the RSVP-TE daemon was modified to assign routelets to QoS network flows when a new label switched path (LSP) is created. An LSP is created when a RSVP-TE Resv message, corresponding to a previously received RSVP Path message, is received by the RSVP-TE Daemon. Therefore, the RSVP-TE daemon was modified to call the ConnectRoutelet script when a Path message is received, passing parameters specifying the inbound and outbound network interfaces, the 36
  • 42. incoming and outgoing MPLS labels and the next hop IP address of the network flow to this script. All of these parameters, except for the inbound network interface, are available to the RSVP-TE daemon in a structure which stores detail of the RSVP-TE Resv message. The inbound network interface can be inferred by the interface on which the corresponding RSVP Path message arrived, as RSVP-TE Path and Resv messages follow inverse routes through the network (defining the route of the LSP). The QoS requirements of a network flow can be specified using a flow specification (flowspec) within the RSVP-TE messages used to set up the network flow. The RSVP-TE daemon could be modified to extract the QoS requirement from this flowspec, and map these requirements to the resources required by the routelet assigned to that flow. The RSVP-TE daemon could then use the management tools described in Section 6.4 to allocate the appropriate resources to the newly assigned routelet. However, a flow’s QoS requirements could not automatically be mapped to the resources required by a routelet to provide this QoS before experiments were performed on the QuaSAR router. Therefore, resources are manually assigned to routelets in this project. 6.3 Packet Demultiplexing Initially, QuaSAR’s design called for the use of Xen’s inbuilt packet demultiplexing between domains to send incoming packets to the correct routelet for processing. However, further investigation showed that Xen’s demultiplexing support was not appropriate for the QuaSAR router. Xen’s packet demultiplexing involves the use of a virtual Ethernet bridge. A virtual network interface (VIF) back-end of each target domain is connected to a virtual bridge in the privileged domain, along with the the physical device where packets to be demultiplexed are arriving. Bridges route packets based upon their data-link layer address (i.e. the Ethernet MAC address in this case), therefore each VIF is assigned a unique MAC ad- dress, used by the bridge, to route incoming packets to the correct VIF (and therefore domain). There are two possible methods of demultiplexing packets between domains, using this virtual bridge approach: • Put the physical Ethernet device into promiscuous mode, and direct hosts wishing to send packets to a domain to use the domain VIF’s MAC address, instead of the physical Ethernet device’s MAC address as the packet’s destination. With the physical Ethernet device in promiscuous mode, it would accept these packets, even though their destination MAC address does not match the device’s MAC address. The bridge would then route these packets to the appropriate domain, based upon the VIF’s MAC addresses. This approach is not appropriate for the QuaSAR router because the RSVP-TE messages, used to set up an LSP, use IP to find their route through the network. Since the LSP follows the same route as the RSVP-TE messages used to set up that route, the RSVP-TE messages must be sent to the MAC address of the routelet which is going to process that LSP. A fake IP address could be set up for each routelet, with the QuaSAR router faking ARP message responses, associating that IP address with the routelet VIF’s MAC address. The RSVP-TE messages could then be sent to the appropriate fake IP address, however, this would require that the previous hop on the LSP’s route has knowledge of which routelet will manage the LSP, before the QuaSAR router even knows that a new LSP is being created. Without major changes in LSP creation (which would make the QuaSAR router inoperable with other RSVP-TE routers), this could not be accommodated. This approach would also require a unique IP address for each routelet, which would be extremely wasteful. • Packets are sent to the physical device’s MAC address, where they traverse the Linux network stack until their IP address is exposed. This IP address would be associated with the appropriate domain by entries added to the privileged domain’s ARP cache (either through ARP requests sent 37
  • 43. across the virtual bridge, or static entries added to the ARP cache). Standard Linux IP forwarding could then send the packets to the appropriate VIF’s MAC address, and thus the correct domain, over the virtual bridge. This approach is not appropriate for the QuaSAR router because QuaSAR routes packets based on their MPLS labels, a layer below the IP header. Therefore, the router should not use the IP header of these packets at all. Also, this approach would effectively require two stages of routing - routing of packets to the appropriate routelet’s domain, then routing the packets to the next hop. This would have a serious negative performance impact on the QuaSAR router, and would decrease the partitioning which QuaSAR provides between network flows. Since Xen’s packet demultiplexing is not appropriate for the QuaSAR router, a new method of demulti- plexing packets, based upon their MPLS Label, was required. Click was used to provide a framework for this demultiplexing support, since pre-existing Click elements could provide some of the functionality required to demultiplex packets between routelets. 6.3.1 Overall Design The main QuaSAR router must classify incoming packets, as either belonging to a QoS assured network flow, or being best effort packets, and route them appropriately. Packet classification consists of two stages. Firstly, packets are classified as either being MPLS or non-MPLS packets. Non-MPLS packets cannot be routed by QoS routelets, therefore they are sent to the main router’s networking stack to be processed. Secondly, the packet’s MPLS label is examined to discover which QoS routelet should process it, or whether it is from a best effort MPLS flow. It is important that this classification should be dynamically reconfigurable, so that new QoS flows can be supported as they are created. Figure 6.1 gives an example of the Click architecture which is be used by the main QuaSAR router to support demulitiplexing of two network interfaces between two routelets. Each of the Click elements used by the demultiplexing Click configuration is described in more detail below: FromDevice The FromDevice element captures packets received at an Ethernet device before they traverse the Linux networking stack. The FromDevice element is described in more detail in Section 5.2.1. Classifier The Classifier element classifies packets based upon simple bit pattern matching at a given offset into the received packet. The Classifier is configured with a sequence of patterns assigned to different output ports. The Classifier checks the packet against each pattern in turn, pushing the packet out of the port of the first pattern which matches that packet. The packet demultiplexer uses two pattern / output port pairs. The first compares the protocol field of the packet’s Ethernet header to the MPLS protocol value. If the packet has an MPLS protocol field, it is sent out of port 0, to the MplsSwitch. The second pattern is a default pattern which matches any packet, therefore packets which are not MPLS are pushed out port 1, to the ToHost element. MplsSwitch The MplsSwitch directs an incoming packet to one of its output ports based upon the packet’s MPLS label. The MplsSwitch checks incoming packets’ MPLS labels against a number of MPLS 38
  • 44. FromDevice (eth0) FromDevice (eth1) Classifier(...) Classifier(...) MPLS non-MPLS MPLS non-MPLS MplsSwitch (...) ToHost MplsSwitch (...) ToHost (to best effort router) (to best effort router) ToHost ToHost (to best effort router) (to best effort router) ToDevice (vif1.0) ToDevice (vif1.1) ToDevice (vif2.0) ToDevice (vif2.1) eth0 eth1 eth0 eth1 Routelet 1 Routelet 2 Figure 6.1: An overview of the Click architecture used to demultiplex packets between routelets. label / output port pair entries2 . Packets are sent to the output port, corresponding to their MPLS label. MPLS label / output port pair entries can be added dynamically to the MplsSwitch, to support newly created network flows. The packet demultiplexing architecture contains an MplsSwitch element for packets arriving from each physical Ethernet device. Each MplsSwitch is connected to a VIF on each of the routelets (the VIF which, on the routelet, corresponds to the physical Ethernet device which the Mpls- Switch deals with - see Section 5.1.3). When a new QoS network flow is being set up, the ConnectRoutelet script adds an MPLS label / output port pair to the MplsSwitch (which deals with packets from the network flow’s inbound network device) in order to switch packets from that network flow to its newly assigned routelet. If a packet arrives which does not match any of the MPLS label / output port pairs, it is sent out of the default port 0 (where it is sent to the main, best effort router for processing). The MplsSwitch element was created especially for the QuaSAR router. Its design and implemen- tation is described in more detail in Section 6.3.3. ToHost The ToHost element allows packets to be injected into the Linux networking stack at the same point as they were captured by the FromDevice element. Any packets which the packet demul- tiplexer sends to this element, therefore, appear to the Linux network stack as if they had never been captured by Click in the first place. The ToHost elements are configured with the name of the device from which packets, sent to this element, should appear to have arrived. Best effort packets 2 A hashtable is used to store the MPLS label / output port pairs (see Section 6.3.3), therefore an MPLS label lookup has O(1) complexity, and scales well when the number of network flows increase. 39
  • 45. are sent to the ToHost elements to be processed by the main, best effort router in the privileged Xen domain. Queue Queue elements are needed between FromDevice and ToDevice elements in Click. See Sec- tion 5.2.1 for more details. ToDevice The ToDevice element is used to send packets to the appropriate routelet’s VIF, for processing by that routelet. The ToDevice element is described in more detail in Section 5.2.1. 6.3.2 Automatic Demultiplexing Architecture Creation The packet demultiplexing Click architecture depends upon the number of routelets which have been created by the QuaSAR router. Therefore, when new routelets are created (by the StartRoutelets script) the demultiplexing architecture must be changed. The StartRoutelets script, therefore, creates a new demultiplexing architecture whenever it is run. The script searches the output of the ifconfig tool for a list of the Xen virtual network devices (VIFs) connected to the currently operational routelets. Each routelet has a VIF corresponding to each of the physical network devices on the router, therefore, each routelet’s VIF is connected to the MplsSwitch which handles packets from the corresponding physical device. The VIF is connected to the MplsS- witch’s output port number which corresponds to the routelet’s ID, so that the ConnectRoutelet script can create an MPLS label / MplsSwitch output port pair from a new flow’s MPLS label and the assigned routelet’s ID number. However, if the QuaSAR router has shut down some routelets, there will be gaps in the routelet ID number sequence. This would cause some of the MplsSwitchs’ ports to be unconnected, thus creating an invalid Click configuration. Therefore, the StartRoutelets script keeps track of unconnected ports, and connects a Discard element to those ports, as they will never be used. This may seem wasteful, if many routelets are shut down and restarted, however, this is not normal practice in the QuaSAR router. When an network flow is torn down, the associated routelet is simply returned to the idle pool (see Section 6.3.4) to be reused. Therefore, shutting down a routelet in the QuaSAR router would only occur in exceptional circumstances (such as a crashed routelet). 6.3.3 MplsSwitch Element The MplsSwitch switches packets to an output port based upon the packets’ MPLS label. It does this by comparing a packet’s MPLS label to a table of MPLS label / output port pairs, and sending the packet out of the output port specified by the pair which matches the packet’s label. These pairs are stored in a hash table, indexed by the hashed MPLS label, to ensure fast lookup when a packet arrives. If a packet does not match any of the pairs in the table, the packet is sent out of a default port. Initially the table is empty, but MPLS label / output port pairs can be added by writing to a special file which this element creates. If a character stream with the format: [MPLS Label] > [output port number] is written to this special file, then this pair will be added to the table. This special file is written to by the ConnectRoutelet script, to update the demultiplexing support when a new network flow has been 40
  • 46. assigned to a routelet. Another special file, when written to with an MPLS label, will remove that MPLS label’s entry in the MplsSwitch’s table. The final special file created by the MplsSwitch will, when read, output the current table. 6.3.4 Returning Routelets to the Idle Pool Eventually, network flows will be torn down. The QuaSAR router must therefore be able to discon- nect unused routelets, and return these routelets to the idle pool, otherwise it would quickly run out of routelets, or the resources necessary to start new routelets. Disconnecting a routelet from a network flow involves two tasks. Firstly, the MPLS label / output port pair, used to demultiplex packets to the routelet being disconnected, must be removed from the MplsSwitch. This means that any remaining packets from the flow can be routed through the best effort router, and the MPLS label can be reused for another network flow. A pair can be removed from an MplsSwitch, by writing the flow’s MPLS label to the remove label special file within the Click Filesystem. Secondly, the routelet must be returned to the idle pool so that it can be reused by another network flow. The idle pool simply consists of all the routelets whose domains are paused. Therefore, to return an routelet to the idle pool, its domain simply needs to be paused. A routelet does not need to be reconfigured before being returned to the idle pool, as its configuration is overwritten as soon as it is assigned to a new network flow. These tasks were added to a disconnect option in the ConnectRoutelet script. The overall routelet lifecycle is described by Figure 6.2. Shutdown St ar Routelet tR ou te le ts ... xm ConnectRoutelet connect ... de st ro y Idle ... Working Routelet Routelet ConnectRoutelet disconnect ... Figure 6.2: This Diagram outlines the routelet lifecycle, with the commands which move routelets between states. 6.4 Routelet Management The purpose of routing QoS network flows through routelets was to guarantee an allocation of the router’s resources to each QoS flow, thus allowing the router to provide flows’ QoS requirements even when under heavy load. Routelet management tools are required to assign resources, such as CPU time, network transmission rate, memory and disk access to each routelet, depending on the QoS required by the network flow it is servicing. The QoS provided by a routelet is not significantly affected by the memory or disk access rate granted to that routelet. This is because each routelet only performs very 41
  • 47. simple packet processing and, after startup, does not use the disk or request additional memory at all. These two resources are therefore assigned statically at routelet creation. However, the CPU time and network transmission rate, granted to a routelet, have a considerable effect on the QoS which can be provided by that routelet. Tools are therefore used by the main QuaSAR router to dynamically allocate these resources to routelets, depending on the QoS requirements of the flows they service. 6.4.1 CPU Time Allocation CPU time is assigned to a domain, and therefore to a routelet, by the Xen virtual machine moni- tor’s scheduler. The Xen virtual machine monitor can support multiple schedulers, each with different scheduling policies. To provide a guaranteed allocation of CPU time to each routelet, a Xen scheduler which provides soft real time guarantees must be used by the QuaSAR router. The original version of the Xen virtual machine monitor provided a soft real time scheduler called ATROPOS. With the ATROPOS scheduler, each domain can be assigned a certain slice of the CPU time (e.g. 20000µs of CPU time every 100000µs), which the scheduler will, within limits, guarantee. The original design of QuaSAR called for this ATROPOS scheduler to guarantee CPU time to each routelet, however, the version of Xen used by QuaSAR no longer supports the ATROPOS scheduler. A scheduler, which provides similar guarantees of CPU time allocation between domains, was being written at the time of this project. This scheduler (named SEDF, after its earliest deadline first scheduling policy) was not finished before the end of this project, nevertheless, an early version was obtained, which allowed some experiments to be performed with a soft real time scheduler. Unfortunately, this early scheduler was unstable and could not cope under heavy packet loads, therefore many of the experiments could not make use of this scheduler. The only Xen scheduler which was stable at the time of this project was a time shared scheduler, called BVT (borrowed virtual time). The BVT scheduler bases its scheduling decision on how much CPU time a domain has previously been allotted,with domains which have received less CPU time than others being more likely to be scheduled (in effect providing fair shares of CPU time to each domain). Weights can be assigned to each domain, to skew the balance of CPU time assigned to each domain, giving each domain a proportional fair share of CPU time. However, this division of CPU time between domains is not guaranteed. There was therefore no stable Xen scheduler which could be used to guarantee CPU time partitioning between routelets at the time of this project, and writing such a scheduler was beyond the scope of this project. 6.4.2 Network Transmission Rate Allocation It is important that each routelet can be guaranteed a minimum network transmission rate, so that it can provide the required QoS to the flow it is servicing. The transmission rate of a physical network card is finite, therefore, the network transmission rate of each routelet must be limited, to prevent one routelet from overwhelming a network card, and preventing other routelets from making use of their allotted transmission rate on that card. Originally, the Xen virtual network devices (VIFs) had the ability to limit the transmission rate of a domain. The original design of the QuaSAR router called for this capability to be used to limit the transmission rate of routelets, however, the change to a new virtual network device driver in newer versions of Xen had removed this ability. The Xen community planned to reintroduce the ability to limit a domain’s network transmission rate in a later version of Xen, however, this would not be implemented before this project’s deadline. I therefore implemented transmission rate limiting in the new Xen virtual network device drivers, and passed these changes back to the Xen community. 42
  • 48. The transmission limiting system is implemented as a simple credit allocation scheme. Each virtual interface can be allocated a credit of bytes which it can transmit, and a time period between credit replenishments. Each transmission of x bytes through a VIF, uses up x credits on that VIF. Once a VIF runs out of credit, it can no longer transmit packets, until the VIF’s credit is replenished. This allows a domain’s (and therefore a routelet’s) transmission rate to be shaped, for example, limiting a domain to transmitting 500kB of data every 100ms. To implement this, the Xen virtual network interface driver was modified so that it stores each VIF’s cur- rent credit, its maximum credit, its previous replenishment time and its replenishment period. Each time a request is made for the driver to transmit a packet on the network, the size of that packet is compared against the VIF’s remaining credit. If the VIF has enough remaining credit, the credit is decremented by the size of the packet, and the packet is sent. If the VIF does not have enough remaining credit, the next expected replenishment time (the previous credit replenishment time added to the credit replenishment period) is compared against the current time. If the expected replenishment time has passed, then the credit is replenished immediately (with the previous replenishment time set to the current time), and the packet is sent. If the expected replenishment time has not yet passed, a timer is set to replenish the VIF’s credit at the expected replenishment time, and the packet is dropped. Rather than checking the actual time (which involves a slow do gettimeofday system call), times are compared with Linux jiffy based timing. Linux increments a global jiffies variable each time a hardware timer interrupt occurs. This jiffies variable therefore provides a lightweight method of com- paring different times. However, as the jiffies variable is only incremented each time a hardware timer interrupt occurs, it is course grained, with an accuracy of no more than about 10ms on an x86 system. Therefore, this network transmission limiter cannot be used for fine grained traffic shaping. Finally, the Xen management tool was modified, so that the transmission rate of a VIF can be limited dynamically. These changes were submitted as a patch to the Xen project, where they have been inte- grated into the latest version of the Xen virtual machine. The VIF rate limiting command is used by the QuaSAR router to limit the transmission rate of routelets, to ensure partitioning of network transmission between network flows. 43
  • 49. Chapter 7 Experimental Testbed Setup In order to evaluate the performance of the QuaSAR router, an experimental testbed was set up. This testbed was built to allow experiments which could evaluate the QoS (e.g. latency, jitter, throughput etc.) provided by the QuaSAR router, and compare this QoS with that provided by a standard router. An important aspect of this project involves evaluating the partitioning provided by using routelets to route individual network flows. Therefore, a testbed was required which could evaluate the performance of the QuaSAR prototype, in the presence of conflicting flows. This chapter describes the creation of the experimental testbed which was set up in order to evaluate the QuaSAR prototype. Section 7.1 describes the example network which was built for the experiments on the QuaSAR router. Section 7.2 discusses the tools which were used to measure the various parameters of the QoS provided by a router (such as latency, jitter, throughput etc.). Section 7.3 describes some of the changes in the testbed hosts which were necessary to accurately measure relative times of events (necessary for per-packet measurements of transmission latency). Section 7.4 discusses the creation of a MPLS traffic generator, which was used to overwhelm the router while another flow was being measured, thus allowing evaluation of the partitioning a router provides between network flows. Finally, Section 7.5 discusses the problems encountered when trying to overwhelm the QuaSAR router with network traffic, and how these problems were overcome. 7.1 Network Setup A test network was set up in order to perform experiments to compare the QuaSAR prototype’s per- formance to that of a standard router. The main component of this network is therefore a machine which routes packets between subnetworks, acting either as a QuaSAR router, or a standard router. The machine chosen to act as a router had a 1Ghz Pentium 3 processor and 1Gb of RAM. This roughly approximates the type of machine typically used as a software router at the time of this project with, however, more RAM than would be normal, in order to support a large number of routelets running simultaneously. This router machine was loaded with the software necessary to run either as a QuaSAR router, or a standard router. The standard router chosen to compare against the QuaSAR prototype was MPLS Linux, controlled by a version of the RSVP-TE daemon, not modified to support routelets. These are unmodified versions of the software used by QuaSAR to route best effort traffic. Both routers run on similarly configured 2.6.9 Linux Kernels, with QuaSAR’s main router’s kernel targeted at the Xen architecture, and the standard MPLS router’s kernel targeted at the x86 architecture. Both routers also 44
  • 50. run with the same filesystem image, and use the same Linux distribution (SUSE 9.2), with the same Linux configuration, to eliminate unnecessary disparities between the two routers. A router can route packets from one subnetwork (subnet) to another. The number of subnets a router can connect is dependent upon the number of network interface cards present on that router. The machine used as a router had only three PCI slots, therefore it could support a maximum of three network interface cards, limiting the number of subnets that this router could support to three (unless it used non-standard network cards with more than one port, which were not available during this project). The router was fitted with three similar 100Mbit/s Ethernet network cards, each allocated its own subnet (, and Figure 7.1 gives an overview of the hosts which were connected to these subnets, and the overall network structure. The details of how these hosts were used are given below. Figure 7.1: This diagram shows the basic network setup which was used throughout the QuaSAR router experiments. In order to test the QoS provided by each router, network flows must be measured as their packets traverse the router, travelling from one subnet to another. In order to test the effect that the router has on this traffic, no other traffic should be present on the two subnetworks it traverses, otherwise collisions, queuing and other effects within the network could be attributed to the router. Therefore, two of these subnetworks were reserved as the arrival and departure subnets of the network flow(s) whose QoS is being measured. A client host was connected to each of these subnets, one to act as a source ( on the diagram) and the other as a sink ( for the QoS measuring flow(s). The source and sink 45
  • 51. hosts were connected to their subnets using an crossover Ethernet cable between the client host and the network card of the router assigned to that subnet. To test how the use of routelets affects the partitioning between network flows, an additional traffic generation network flow needs to be set up. However, this traffic generation flow cannot pass through the subnets which contain the source and sink of the QoS measuring flow. Otherwise, this traffic generation flow could interfere with the QoS measuring flow, due to elements of the network other than the router (e.g. queuing on a network switch, or collisions in a network hub). Thus, the results derived would not provide an accurate representation of the router’s partitioning capability. Therefore, the traffic generation flow requires its own source and sink subnets. However, the machine used as the router could only support one additional subnet. If both the source and sink traffic generation hosts were placed on the same subnet, their packets would not normally travel through the router (they would be routed directly through the subnet, through the Ethernet switch or hub used to create the subnet). MPLS, though, is a directed, or traffic engineered, routing protocol, so an LSP can be engineered to route traffic through the router, even though it will be entering and exiting the same subnet. Two hosts ( and, used as a traffic generator source and sink, were therefore connected to the final network card of the router using an Ethernet switch. All hosts which were used to inject traffic into the network were Dell Optiplex Gxa desktops, with 233MHz Pentium 2 processors, 128Mb of memory, and 100Mbit/s Ethernet cards. These machines were loaded with the same SUSE 9.2 Linux distribution and 2.6.9 Linux Kernel as that used by the router machine. The same (unmodified) versions of MPLS Linux and the RSVP-TE daemon were used to create, and route LSPs through the network, as those used by the router. The switch used to connect the cross-traffic generating hosts to the router was a Netgear 100/1000Mbit/s GS516T Ethernet switch. 7.2 Network QoS Measurement Tools A number of tools were used in order to measure the QoS provided by the router. The maximum through- put of a network flow through the router was measured using the iPerf tool, described in Section 7.2.1. Per-packet timing measurements through a network flow, such as latency and jitter, were measured using a specially created tool, described in Section 7.2.2. 7.2.1 Throughput Measurement The throughput of the QuaSAR router (and the standard MPLS router for comparison) was measured using the iPerf [40] network measurement tool. iPerf measures the maximum bandwidth of the network between its client and server components. It does this by setting up a TCP connection between the client and the server, then sending as much data as it can from the client to the server. The TCP protocol will use its sliding window flow control mechanism to regulate the traffic between the client and the server, to the maximum throughput supported by the network in between. The TCP flow control mechanism does not send the maximum amount of data across the network continually. Instead, the transmission rate of a TCP flow fluctuates around the maximum transmission rate, due to dynamic changes in the TCP window size introduced by the flow control mechanism used by Linux (TCP Reno [26]). Therefore, iPerf does not measure the absolute maximum bandwidth of a network flow, but does measure the likely maximum bandwidth which an application will receive through the network flow. 46
  • 52. 7.2.2 Per-Packet Timing Measurement The QuaSAR prototype was created with the needs of isochronous (time sensitive) network flows, such as voice over IP or streaming media, in mind. It was therefore important to measure the effect of the QuaSAR prototype router on per packet timing measures, such as latency and jitter, of network flows. A tool was created which mimics the traffic patterns of a simple voice over IP application, and measures the per-packet timing effects of the router on this traffic. This tool sends UDP traffic from a client component to a server component. A 70 byte packet is sent from the client to the server every 20ms (this was the minimum time period which a Linux process can accurately wait for, without using a busy wait loop which would interfere with packet transmission). Just before a packet is sent across the network, a timestamp of the current time is added to the packet. The packet then traverses the network, through the router, until it reaches the server. As soon as a packet is received by the server, the current time is again checked, and both this received time and the sent time (extracted from the packet) are written to a file1 . These two times can be used to measure: the latency of each packet across the network (received − sent); the interarrival time between packets (received−received−1 ); the interdeparture time between packets (sent−sent−1 ); and the jitter induced by the network (interdeparture − interarrival). 7.3 Accurate Relative Time Measurements In order to measure the per-packet latency and jitter using the tool described above, both the sent and received times must be measured on clocks which are accurate to each other, within the expected ranges of per-packet timings. Initially, a network time synchronisation protocol (such as NTP [23]) was con- sidered to synchronise the clocks on hosts A and B while measuring per-packet timings. Using NTP to synchronise the clocks was investigated, however, NTP could only synchronise clocks to an accuracy of about 4000µs, which was an order of magnitude greater than the minimum packet delay through the router, and two orders of magnitude greater than the jitter induced by the router. Another option was to synchronise the clocks of both hosts using a GPS receiver. GPS [28], or the Global Positioning System, consists of a number of satellites orbiting the earth. Each satellite contains an atomic clock, which it uses to transmit a very accurate time signal . This signal can be used by GPS receivers to synchronise a host’s clock very accurately2 . Therefore, two GPS receivers could be used to synchronise the two hosts to a high accuracy, however, there was not sufficient time, or equipment, to use GPS to synchronise the hosts used by this experiment. The final option was to use the same clock to measure both the received time and the sent time. There- fore, both the source and sink of the time measuring network flow must be located on the same host (i.e. and are on the same machine in Figure 7.1). This project used this approach, of routing traffic from one network interface on a host, through the router, then back to another network interface on the same host, for the per-packet timing analysis (however, throughput measurement continued to be measured using separate hosts). As the same host both sends and receives packets, both the source and sink network interfaces contend for the resources of the same system bus, therefore they could interfere with each other. However, the per-packet timing measurement tool only sends a packet every 20ms, therefore it is unlikely that an outgoing packet would interfere with the previous incoming packet, as the network transmission latency was usually much less than 20ms. 1 The server and client machines’ clocks must be synchronised very accurately for this process to produce accurate results. This synchronisation is discussed in Section 7.3 2 This GPS time signal is usually used to find the location of a GPS receiver. The signal takes a certain amount of time to travel from the satellite to the receiver, dependent on the distance between the two. The GPS receiver can therefore work out its exact position by comparing the time difference between received signals from three or more GPS satellites. 47
  • 53. This method required the routing of network traffic from one network interface to another on the same host, whilst still traversing the network. However, Linux forwards any packets addressed to local in- terfaces directly to those interfaces internally, without sending them across the network, no matter how the routing tables are set up. In order to send packets across the network, the host needs to be tricked into thinking it is sending packets to a non-local address. Packets are therefore sent from one network interface to an imaginary address. The other network interface of the host is set up to respond to ARP requests for this imaginary address, therefore the router will route packets, with this imaginary address as their destination, to the host’s other network interface. However, although the packets have traversed the network from one network interface of the host to the other, the host does not know it has to deal with packets which have this imaginary destination IP address (as it is not a local address). Therefore, NAT (Network Address Translation) was set up on the host to translate the destination address of incoming packets, having this imaginary IP address, to the local address of the interface it arrives on. A similar source address translation takes place on outgoing packets, so that the source of the network flow sees replies from the imaginary address, as it expects. The network flow therefore appears, to the host, to be being sent to another host, and be arriving from another host, even though both its source and sink are on the same machine. 7.4 Network Traffic Generation In order to evaluate the partitioning which the QuaSAR prototype provides between network flows, a traffic generation network flow needs to be sent through the router. This induces contention for router resource between the flows, thereby allowing a router’s partitioning of resources between flows to be measured. The hosts generating this conflicting traffic must generate sufficient traffic to overwhelm, or at least stress, the router, so that the effect of this conflicting traffic on the QoS, provided to other flows, can be measured. Initially, it was thought that an application could produce this level of traffic. However, the system call overheads, involved in sending each packet through the network stack from user-level code, only allowed roughly 50,000 packets per second to be sent by a single host. This was not a sufficient number of packets per second to stress the router3 . The traffic therefore needed to be generated within the Linux kernel, where it would not incur these overheads. The Linux Kernel contains a packet generator module, which can be used to generate a certain number of packets per second. This packet generator creates a packet, then repeatedly sends this packet directly to the network interface’s device driver, thereby bypassing the network stack entirely. This allows a single host to generate sufficient packets per second to stress the router. However, this packet generator module only generates IP over Ethernet network traffic, whereas the QuaSAR routelets only process MPLS packets. Therefore, to perform experiments on QuaSAR routelets, an MPLS packet generator was required. A PktGen Click element was created in order to generate sufficient MPLS traffic for experiments on the QuaSAR prototype. A Click router can be configured so that a chain of elements create any type of packet (e.g. an MPLS packet), which is then pushed into this PktGen element. The PktGen element then repeatedly sends this packet to the network interface device driver, in much the same way as the 3 Note that the amount of processing a router must perform to route a network flow is more dependent on the number of packets per second (pps) of the flow, than the flow’s overall bandwidth. This is because a router needs to perform work to route every packet. Simply increasing the bandwidth of a flow by increasing the size of each packet, without increasing the number of pps, does not significantly increase the processing load on the router. Therefore, the maximum stress a network flow can place on a router occurs when that flow is sending the smallest possible packet size (64 bytes in Ethernet) at the maximum throughput (100Mbit/s for Fast Ethernet), i.e. the maximum pps. 48
  • 54. Linux packet generator module does. This PktGen element can be configured with three parameters - the name of the network interface from which the generated traffic should be sent, the number of copies of each packet to send, and the amount of time which the PktGen element should wait between sending each packet. This allows the rate of packet generation, the length of time spent generating traffic, and outbound network interface to be altered. The PktGen element sends packets to the network device driver in a similar way to the ToDevice Click element, however, the sending is surrounded by a loop which sends a single packet the number of times specified by the PktGen’s configuration. Between each of these packet sends, the element waits for the amount of time specified by its configuration. It uses a busy wait, where it iterates in a loop of useless processing, for the required period. The number of iterations necessary to produce the required delay is calibrated when the element is initialised, with the difference in time (measured using the kernel do gettimeofday system call) between a fixed number of busy loop iterations used to measure the number of iterations the machine performs per second. This is used to calculate the number of busy loop iterations necessary to generate a certain wait period. This busy wait method of specifying a wait period obviously involves unnecessary computation, however, it is the only method of very accurately specifying a wait period, as Linux timer interrupt methods are only accurate to tens of milliseconds. A special file handler was added to the PktGen element, so that the actual number of packets per second it has transmitted can be examined. This pps value is calculated by measuring the time at the start of the traffic generation, and at the end of the traffic generation, then dividing the number of packets sent by this time difference. Overall, this PktGen element allows a single host (with a 233MHz processor) to generate up to 100,000 64 byte MPLS packets per second on a Fast Ethernet card. 7.5 Straining Router with Generated Traffic In order to test the partitioning provided by routelets, the generated traffic needs to stress the router, so that its effect on other flows can be investigated. The PktGen element can transmit a maximum of 100,000 pps. It was found that this level of traffic did not cause a noticeable change in the QoS provide by the router to other network flows. In order to stress the router, an attempt was made to use more than one host to generate an aggregate flow of more than 100,000 pps through the router. However, recall from Section 7.1, there was only one network interface card available for stress inducing traffic. Additional hosts were connected to the Ethernet switch, where the switch combined the traffic generated by these flows and sent it through the single link to the router. However, this aggregation of traffic from additional hosts actually decreased the load on the router. It seems that collisions and queued traffic decreased the overall pps which could be sent to the router. Since the amount of generated traffic could not be increased, the processing power of the router needed to be reduced so that the effect of processing this traffic could be measured. There was not enough time to make all the necessary changes to another, slower machine, in order for it to act as the QuaSAR and standard MPLS routers. Therefore the original machine had to be slowed down in some way. An attempt was made to under-clock the machine being used. Under-clocking is a process by which the machine’s processor’s clock speed is deliberately slowed down (e.g. 500MHz instead of 1GHz), meaning that the processor performs fewer operations every second. Unfortunately, the machine being used as the router did not have any controls (either hardware or software) which could be used to change the processor’s clock speed. 49
  • 55. Another option was to start a busy wait process, which simply used up CPU cycles uselessly. However, when this process was run in user-mode, the Linux Kernel (which was routing the packets) simply pre- empted this busy wait process when it had any work to do, therefore, a user-mode busy wait process had no effect on the routing of packets. The busy wait process could have been run as a kernel thread, where it would not be pre-empted by any packets being routed. However, although the standard MPLS router only has one kernel, the QuaSAR router has multiple kernels running, depending upon the current number of routelets. Therefore, the CPU consumed by the busy wait thread(s) would have to be shared across all of these kernels. It would there- fore be very difficult to generate the same busy wait overhead for each flow on the QuaSAR prototype as for each flow in the standard MPLS router. Instead, an approach was taken which would induce the same overhead in both routers. A clock timer interrupt is produced by the x86 system clock every 1ms 4 . This interrupt is handled by the Linux Kernel every time it occurs. However, in a Xen system, the interrupt is first handled by the Xen VMM, before it is passed to any of the Guest operating systems. Therefore, a busy wait loop could be inserted into the Xen timer interrupt handler on the QuaSAR router, and the Linux timer interrupt handler in the standard MPLS router, to slow down both routers equally. Although this slows down both routers equally, all of the slowdown occurs at the very start of every 1ms period. Since the average latency (on an unloaded router) is an order of magnitude less than this 1ms period, the per-packet timings could be offset in some unexpected way. However, after averaging over a significant number of packets, this effect should even itself out. This approach of slowing down the router by modifying the timer interrupts was only used in the exper- iments which measured the partitioning effect of the router (Section 8.2). All other experiments used unmodified timer interrupt handlers. Section 8.2.1 compares the unloaded QoS provided by each of the routers, and shows that the effect of slowing the router using this method does not overly affect the results. 4 This period is actually programmable at system boot, however both Xen and Linux use the same timer period. 50
  • 56. Chapter 8 Experimental Results and Analysis In order to evaluate the QuaSAR prototype router, a number of experiments were performed to assess the Quality of Service provided, by the router, to various network flows. These experiments can be split into two main sets: those which investigate the overhead which virtualisation introduces (Section 8.1); and those which examine the partitioning provided by routelets to individual network flows (Section 8.2). 8.1 Virtual Machine Overhead The major differences between the QuaSAR router and a standard software router is the use of virtual machine technologies. The use of virtual machines will always incur an overhead, because of guest operating system context switches, protection enforced by the VMM and the additional memory required to run multiple operating systems at once. It is, therefore, important to evaluate the overhead incurred within the QuaSAR router due to its use of virtualisation. In order to determine the overhead incurred by the use of virtual machine technologies in QuaSAR, and discover what causes this overhead, the Quality of Service provided by a number of different routers was measured. A network was set up, as described in Section 7.1, with no cross-traffic being generated. These experiments, therefore, measure the ideal QoS characteristics provided by the router when it is under no additional load other than the flow being measured. Three QoS characteristics were measured - latency, interpacket jitter, and maximum throughput. These characteristics were measured for three types of router: • A standard MPLS router. This router does not use any virtualisation technology, and provides the baseline for these results. • QuaSAR’s best effort router. This is effectively the same as the standard MPLS router, however it runs within a Xen virtual machine. These experiments run with QuaSAR’s best effort router running within a privileged Xen domain, with full access to the physical network devices and no other guest operating systems running concurrently. This, therefore, evaluates the overhead caused by simply running on top of Xen’s virtualisation layer. • Quasar’s routelet. This involves routing packets from the measurement flow through a QuaSAR routelet. The routelet runs within a single unprivileged domain, with incoming packets being clas- sified within the privileged best effort router, as described in Chapters 5 and 6. These experiments evaluate the overhead involved in classifying packets, and passing them between domains. 51
  • 57. The QuaSAR routers were also tested, where possible, with two Xen domain scheduling policies - Borrowed Virtual Time (bvt) and Simple Earliest Deadline First (sedf). The bvt scheduler does not provide any guarantees to routelets, apart from proportional fairness, however it was stable throughout the project. The sedf provides soft real time guarantees to routelets, however it was not stable and many experiments caused this scheduler to fail. In all virtual machine overhead experiments, the schedulers were set up so that they did not limit the CPU time to any domain, i.e. all domains were given a fair share of the CPU time, and any domain could pre-empt another when it received an interrupt or an event notification. This attempts to limit any unexpected results caused by one domain being given priority over another (The partitioning experiments in Section 8.2 purposely give some domains priority over others, in order to assign different amounts of CPU time to each flow). 8.1.1 Latency The unloaded latency of each router was measured using the tool described in Section 7.2.2, using the network setup shown in Figure 8.1. This tool measures end to end packet latency, therefore, it includes the latency induced by the network and the time packets spend traversing the host’s network stack. It, therefore, does not produce an absolute value for the time a packet takes to travel through the router. However, the latency induced by the network and the host’s network stack stays constant throughout the experiments. Therefore, the experiment provides an accurate measure of relative difference between the latencies of each type of router and thus the extra overhead incurred by each type of router. Figure 8.1: This diagram shows the network set up used to perform per-packet timing experiments. Note that the source and sink machines are the same, in order to accurately synchronise time measurement (see Section 7.3). The Labels show the interfaces real IP addresses, and their fake addresses used to trick Linux into sending the packets across the network. The latency of the network and the host’s network stack can, however, be approximated in order to provide a more accurate approximation of the absolute time spent by a packet in each type of router. Each packet transmitted by the latency measuring tool is 70 bytes + a 14 byte Ethernet header + a 20 byte IP header + 8 byte UDP header = 112 bytes. Each 112 byte packet traverses two 100Mbit/s Ethernet links (one from the host to the router, the other from the router to the host), therefore the time a packet spends on the network can be approximated to: (112 × 8) 2× = 17.9µs (8.1) 100, 000, 000 The time spent traversing the host’s networking stack is dependent upon the number of operations which need to be performed on each packet between the application layer and the IO necessary to send the packet from the physical network device. When testing the traffic generating capabilities of the hosts, it was discovered that they could send roughly 50,000pps from user level, before the network stack over- head overwhelmed the host’s CPU (see Section 7.4). These hosts were clocked at 233MHz, therefore 52
  • 58. the number of cycles necessary to process a single packet through the network stack can be estimated as: CP U Cycles per Second 233, 000, 000 = = 4660 cycles per packet (8.2) M aximum P P S 50, 000 Therefore, the time taken for a packet to traverse the host’s network stack twice (once when sending and once when receiving) can be very roughly estimated as: 4660 2× = 40µs (8.3) 233, 000, 000 Therefore, the per-packet latency due to factors other than the router can be estimated as being 57.9µs. These experiments measure the per-packet latency. However, since this value is very small, it can be greatly affected by random events, such as scheduling delays, cache misses, interrupt handling etc. Therefore, each run of the experiment measured the latency of a large number of packets, so that these fluctuations could be averaged out. To ensure that each run is averaged over enough packets to provide a steady result, a run for each router was graphed to show how the average latency changes as the number of packets used to create that average latency increases. The spread of results is also measured against the number packets measured in a run using 10th and 90th percentile error bars. Figure 8.2 shows these graphs for each of the router types. These graphs show that the average latency becomes more steady as the number of packets averaged over increases. They also show that the spread of latency values is bound because the 90th / 10th percentile error bars reduce to their minimum, then hold steady at that minimum and do not fluctuate considerably (notice the QuaSAR best effort router using the sedf scheduler has a considerably larger spread of values, shown by its larger error bars, which will be discussed later). The random fluctuations in per-packet latency average out after between 500 and 1000 packets, depending on the router type. Therefore, each run should be averaged over at least 1000 packets to remove fluctuations and ensure a consistent result. Each run was chosen to run over 3000 packets, since this did not considerably affect the overall time required by experiments. To calculate the number of 3000 packet runs which are necessary to give a stable result, the average latency was graphed against the number of runs averaged over. Figure 8.3 shows this graph from 1 to 5 runs (note the Y axis range has been decreased, compared with the previous graphs, to show the results in more detail). As can be seen, after about three to four runs, the average latency does not fluctuate considerably, and the error bars do not increase any further. Therefore, at least four separate runs must be made to ensure a consistent result. In order to add some leeway, all experiments were averaged over five runs. The average per-packet latency induced by each of the router types is shown by Figure 8.4. A 70 byte packet takes, on average, 194µs to travel through the network using the standard MPLS router. Of this, it is estimated that 136.1µs is spent within the router (recall from above that it is estimated that each packet spends 57.9µs traversing the network, and being processed by the host’s network stack). Figure 8.5(a) shows the per-packet timings for every packet in a single experimental run. From this graph, it is clear that most of the packets incur the minimum latency of about 190µs, with another significant, but much less substantial band at 230µs and then a scattering of packets delayed for much longer. The same packet will have, on average, a latency of 204µs through the network when being routed by the best effort QuaSAR router. Recall, the QuaSAR best effort router (when it uses the bvt Xen scheduler) is running the same routing software as the standard MPLS router, but from within a Xen virtual machine. Therefore, the overhead incurred by simply using Xen, even without context switches between domains, is estimated as about 10µs. This is 7.3% of the time spent in the router, or 5.2% of the overall time spent by each packet within the network. The per-packet timings in Figure 8.5(b) are relatively similar 53
  • 59. 450 400 350 average latency (μs) 300 250 200 150 100 50 0 0 500 1000 1500 2000 2500 3000 number of packets averaged over (a) Standard MPLS router 450 450 400 400 350 350 average latency (μs) average latency (μs) 300 300 250 250 200 200 150 150 100 100 50 50 0 0 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 number of packets averaged over number of packets averaged over (b) Quasar Best Effort (bvt) (c) Quasar Routelet (bvt) 450 450 400 400 350 350 average latency (μs) average latency (μs) 300 300 250 250 200 200 150 150 100 100 50 50 0 0 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 number of packets averaged over number of packets averaged over (d) Quasar Best Effort (sedf) (e) Quasar Routelet (sedf) Figure 8.2: These graphs compare the number of packets in a run, against the average latency. The error bars show the 10th / 90th percentile values (for every increase of 50 packets). 54
  • 60. 260 average latency (μs) 240 220 200 180 1 2 3 4 5 run number (a) Standard MPLS router 260 260 average latency (μs) average latency (μs) 240 240 220 220 200 200 180 180 1 2 3 4 5 1 2 3 4 5 run number run number (b) Quasar Best Effort (bvt) (c) Quasar Routelet (bvt) 260 260 average latency (μs) average latency (μs) 240 240 220 220 200 200 180 180 1 2 3 4 5 1 2 3 4 5 run number run number (d) Quasar Best Effort (sedf) (e) Quasar Routelet (sedf) Figure 8.3: These graphs compare the number of runs used to calculate the average latency, against the value of the average latency. The error bars show the minimum and maximum run average values. 55
  • 61. 250 average latency (μs) 200 150 100 50 0 Standard Quasar Best Quasar Quasar Best Quasar MPLS Effort (bvt) Routelet (bvt) Effort (sedf) Routelet (sedf) Figure 8.4: The average latency of each type of router. The error bars show the minimum and maximum run (not per packet) latencies. to those of the standard router in Figure 8.5(a), but simply shifted up by about 10µs. Therefore, Xen’s virtualisation appears to incur a consistent overhead in each packet’s latency. This suggests that the overheads incurred purely by Xen are caused by overheads in interrupt handling (which occurs for each packet) and / or overheads in the operations necessary to process each packet, rather than random (e.g. scheduling) delays. When a packet is being processed by a QuaSAR routelet, it passes through two domains, the privileged domain with access to the physical devices, and the QoS routelet’s domain. It is, therefore, expected that there is an additional per-packet overhead caused by moving between domains. Packets processed by a routelet have an average latency of 249µs, therefore, processing a packet through a routelet adds an additional overhead of 45µs, over and above the overhead induced by Xen’s virtualisation (22% of the overall network time, 30.1% of the routing time). The major additional overheads, when processing a packet in a routelet, involve moving the packet between the domains twice (once from the privileged domain to the QoS routelet when the packet arrives, then again from the routelet to the privileged domain for transmission). Moving a packet between domains involves transferring ownership of the area of memory where the packet is stored from one domain to another, then context switching between the domains. Transferring the memory ownership from one domain to another involves manipulating the page tables of both domains and some reasonably complex event channel communication between domains to initiate this process. However, this memory transfer does not account for such a large increase in per-packet overhead. The two domain context switches, on the other hand, involve a number of operations which could signif- icantly contribute to this overhead. Firstly, all of the processor’s state (for example, register values) has to be stored to memory, and the previously saved state of the new domain has to be restored. A more im- portant overhead involved in context switching between domains involves the change of address space. Since different domains have different virtual memory address spaces (to ensure one domain does not have unauthorised access to another domain’s memory), a context switch between the domains must 56
  • 62. 800 700 600 average latency (μs) 500 400 300 200 100 0 0 500 1000 1500 2000 2500 3000 packet number (a) Standard MPLS router 800 800 700 700 600 600 average latency (μs) average latency (μs) 500 500 400 400 300 300 200 200 100 100 0 0 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 packet number packet number (b) Quasar Best Effort (bvt) (c) Quasar Routelet (bvt) 800 800 700 700 600 600 average latency (μs) average latency (μs) 500 500 400 400 300 300 200 200 100 100 0 0 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 packet number packet number (d) Quasar Best Effort (sedf) (e) Quasar Routelet (sedf) Figure 8.5: The per-packet latencies for an example run of each type of router 57
  • 63. change the page table used to translate virtual addresses into physical memory addresses. This involves invalidating the TLB (Translation Look-aside Buffer - the hardware lookup table used for virtual to physical memory address translation), as the entries it contains will no longer be appropriate for the new domain. The memory cache will also no longer be relevant, as it is likely to be filled with entries used by the old domain. Therefore, a context switch involves a major direct overhead, due to saving and restoring the processor state, as well as major indirect overheads, e.g. slower initial memory access and address translation, caused by the hardware caches being invalidated. It is revealing to roughly estimate the amount of time which is spent processing each of these overheads. The total overhead can be expressed as: (2 × context switch cost) + (2 × packet transf er cost) + other costs = 45µs (8.4) where other costs account for packet demultiplexing and network device bridging. Preliminary experi- ence indicates that the other costs account for approximately 5µs per packet, and that a context switch is approximately three times more costly (due to its indirect overheads) than transferring a packet be- tween domains. When this information is applied to equation 8.4, one finds that a single context switch accounts for 15µs of the overhead and transferring a packet between domains takes approximately 5µs. Note that these overheads have not been directly measured, they are estimates which are simply provided to give a rough overview of the relative sizes of the overheads involved. The per-packet timings in Figure 8.5(c) show that there is a significant increase in the direct, per-packet latency overheads, since the minimum packet latency increases by about 35-40µs compared with the previous two routers. However, there is also a significant increase in the number of packets scattered above this minimum latency, suggesting that indirect overheads, such as cache invalidation and schedul- ing delays, produce a large proportion of the increased overheads. The final two values show the per-packet overhead incurred by running the QuaSAR router with a soft real time scheduler (sedf). The use of a real time scheduler will introduce additional overheads, due to a more complex scheduling operation and domains not necessarily being scheduled immediately when a packet arrives, due to scheduling constraints. The use of an unfinished version of the sedf scheduler is also likely to have contributed to the additional overhead. An additional overhead of 20µs (224µs overall) is incurred in the QuaSAR best effort router when using the sedf scheduler. However, when the actual per-packet timings are examined on an run of this experiment (Figure 8.5(d)) a regular patterning is clearly visible. The solid band of minimum packet latencies in both Figure 8.5(b) and Figure 8.5(d) does not appear to increase significantly when moving from the bvt to the sedf scheduler. Therefore, almost all of the increase in average latency is caused by this patterning effect, rather than a consistent 20µs overhead in every packet. This patterning effect suggests a delay between the time a packet arrives, and a domain being woken up. However, in this experiment, there is only one domain running and that domain is given full access to all the available CPU time, so the domain should never need to be woken up. Therefore, this patterning appears to be an artifact of the sedf scheduler being unfinished at the time of the experiment. The sedf scheduler increases the latency of the QuaSAR routelet by 10µs to 259µs. This increase of 10µs is less than the 20µs incurred by the sedf scheduler in the best effort router. This is unexpected, since packets passing through a routelet cause two additional scheduling decisions. Therefore, it would be expected that the additional scheduling overhead of the sedf scheduler would be more prominent with the QuaSAR routelet. However, comparing Figures 8.5(d) and 8.5(e) it is clear that the sedf routlet configuration does not suffer from the same regular latency patterning as the sedf best effort router. It appears that the context switches between domains, necessary to route packets through a routelet, prevents the apparent delay in domain wake up which may cause the patterning in the sedf best effort router. Thus, the 10µs additional overhead induced by the sedf scheduler in the QuaSAR routelet router, 58
  • 64. is more representative of the overhead which should be expected if a finished soft real time scheduler was used, than the 20µs overhead observed in the QuaSAR best effort router. 8.1.2 Interpacket Jitter Jitter between packet arrival times is another important QoS measure, especially in isochronous network flows, such as VOIP traffic. The jitter induced by each type of router can be estimated by measuring the inter-packet arrival time (i.e. the time difference between arriving packets) and comparing this with the packet period. The inter-packet arrival time is calculated by subtracting the arrival time of one packet, with the arrival time of the next packet (i.e. received − received−1 ). The period between packets is intended to be 20µs, to imitate a VOIP traffic flow, however the timers used by Linux are not accurate enough to guarantee that every packet is sent exactly 20µs after the other. Therefore, to accurately measure the jitter induced by each type of router, the inter-arrival time should be compared against the inter-departure time (sent − sent−1 ). The inter-departure time accurately measures the difference between each packets actual send time. The per-packet jitter can, therefore, be accurately calculated with interarrival − interdeparture. Figure 8.7 shows the jitter induced by each router type over a run of 3000 packets. As can be seen, the jitter fluctuates around 0µs (positive jitter must be balanced by negative jitter, oth- erwise an infinite delay will gradually build up), therefore simple averaging will not generate any useful value. Instead, the root mean square average was taken over each packet’s jitter value, to find the aver- age deviation from an ideal of zero jitter, induced by each router type. These experiments used the same raw data as the latency experiments, therefore the jitter measurements were also averaged over 5 runs of 3000 packets. Figure 8.6 shows the average root mean squared jitter induced by each of the router types. 60 50 Interpacket Jitter (us) 40 30 20 10 0 Standard QuaSAR Best Quasar Quasar Best Quasar MPLS Effort (bvt) Routelet(bvt) Effort (sedf) Routelet (sedf) Figure 8.6: The average root mean squared jitter of each type of router. The error bars show the minimum and maximum run (not per packet) jitter values. The jitter between packets when they are sent through the standard MPLS router is, on average, 12.4µs. Figure 8.7(a) shows that most of the packets have a very small jitter (< 10µs), however, there is a thin 59
  • 65. 600 400 200 Jitter (μs) 0 -200 -400 -600 0 500 1000 1500 2000 2500 3000 packet number (a) Standard MPLS router 600 600 400 400 200 200 Jitter (μs) Jitter (μs) 0 0 -200 -200 -400 -400 -600 -600 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 packet number packet number (b) Quasar Best Effort (bvt) (c) Quasar Routelet (bvt) 600 600 400 400 200 200 Jitter (μs) Jitter (μs) 0 0 -200 -200 -400 -400 -600 -600 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 packet number packet number (d) Quasar Best Effort (sedf) (e) Quasar Routelet (sedf) Figure 8.7: The per-packet jitter for a single run of each type of router 60
  • 66. band of packets with approximately 25µs jitter, and a small scattering of packets with a much larger jitter. The thin band of packets with about 25µs of jitter suggest a regular, intermittent delay. As packets in this band occur roughly every 1 second, this implies that their extra jitter is caused by a timer event which occurs every second. When packets are passed through the QuaSAR best effort router, the jitter increases to 15.1µs, an in- crease of 2.7µs or 21.7%. Closer investigation of the per-packet jitter values (Figure 8.7(b)) reveals that the vast majority of packets still experience less than 10µs of jitter. The band of packets, which experience additional jitter every 1 second, have a jitter of about 50µs, double that of the same band when routed through the standard MPLS router. This suggests that Xen doubles the time taken to pro- cess the timer event which causes this delay, than Linux does in the standard MPLS router (if indeed it is caused by a timer expiring). This may occur if the timer event is being processed by both the Xen VMM, and any other domains running. Therefore, with only a single domain running, each of these timer events would incur twice as much work - once when being run in the in the Xen VMM, then again being processed by the best effort Linux kernel. The QuaSAR best effort router also suffers from more large jitter packets than the standard MPLS router. This suggests that Xen incurs more large, random delays, probably due to processing within the Xen VMM. The average jitter of packets being routed by a QuaSAR routelet is 16.3µs, 1.2µs or 7.9% greater than the jitter produced by the QuaSAR best effort router. Since the QuaSAR routelet incurs such a large increase in overhead, it is surprising that the Quasar routelet does not increase packet jitter more than it does. Processing packets through a routelet involves at least two context switches and all of the scheduling delays and cache misses that entails (see Section 8.1.1). These effects therefore cause fairly consistent increases in delay, without inducing significant jitter. A closer look at the per packet values in Figure 8.7(c) shows that the vast majority of packet jitter values are still less than 10µs. The extra jitter seems to be caused by an increased percentage of packets in the 10 to 100µs range. It is difficult to estimate why there are more packets in this range, however, they still seem to appear in clumps separated by about 1 second, suggesting they may also be caused by the timer event which delays packets in the QuaSAR best effort router. The QuaSAR best effort router, using the sedf scheduler, incurs a massive average jitter of 55.5µs, 40.4µs (267.5%) greater than the same router using the bvt scheduler. Examining the per-packet jitter values (Figure 8.7(d)) shows that the vast majority of packets still have a jitter of less than 10µs, therefore this increase is not caused by a uniform increase in each packet’s jitter. The massive increase in jitter is, instead, clearly due to the regular diagonal bands of increasing jitter. These delayed packets occur every 200ms (i.e. every tenth packet), with each of these delayed packets having roughly 50µs more delay than the previously delayed packet. After about 10 of these delayed packets (every 2 seconds), the delay of these packets decreases to zero again. This strange cycle of delayed packets is highly suggestive of a deficiency in the unfinished sedf scheduler, which should be resolved before the scheduler is released. The sedf scheduler increases the jitter of the QuaSAR routelet by 4.7µs or 28.8% (to 21µs), compared with the bvt scheduler. Again, Figure 8.7(e) shows that the majority of packets have a jitter of less than 10µs. The additional jitter appears to be caused by an increase in the number of packets with a jitter of between 10 and 100µs, rather than an increase in the jitter of packets in this range. There were also more, seemingly random, packets with a jitter greater than 100µs. 8.1.3 Throughput The maximum throughput which each type of router can support through a single network flow was investigated using the iPerf network performance measurement tool (see Section 7.2.1). Figure 8.8 shows the network setup which was used for these throughput experiments. This measurement stresses 61
  • 67. Figure 8.8: This diagram shows the network setup used to perform throughput measurements. the router more than the previous experiments, as it involves sending a much greater traffic flow than that sent by the per-packet timing measurement tool. However, the throughput measurement is more dependent upon the network speed, and the ability of the hosts to produce the required amount of traffic. Figure 8.9 shows the maximum single flow throughput which each type of router can support, averaged over 5 runs. The results show that the overhead, introduced by Xen, decreases a router’s maximum single flow throughput by 0.1%, from 93.68Mbit/s to 93.58Mbit/s. The overheads involved in passing packets to a routelet for processing incur a further 0.02% decrease in throughput, however, this is well within the bounds of experimental uncertainty. Overall, the QuaSAR router does not significantly affect the throughput of a single network flow when using the bvt scheduler. However, when it uses the sedf scheduler, more significant decreases are seen in throughput. The QuaSAR best effort router’s throughput decreases by 2.6% to 91.24Mbit/s when using the sedf sched- uler. This sharp decrease is almost certainly due to the patterning effect seen in the timing experiments performed on the sedf best effort router. Since TCP bases its flow control mechanism on timeouts and packet loss, the delay pattern brought about by the sedf scheduler may confuse TCP’s flow control into thinking that the network has a lower maximum bandwidth than it does. Noticeably, the graph does not report the maximum throughput of a QuaSAR routelet being scheduled using the sedf scheduler. This is because this type of router would not support the bandwidth created by the iPerf tool at all. The routelet’s domain would immediately hang when the experiment was started, so no results could be determined for this router type. These results prove that the sedf scheduler was not stable enough to be used within the QuaSAR router, therefore, the sedf scheduler was not used for any further experiments. The results of these experiments are, however, limited. The throughput of a single flow is limited to 100Mbit/s, because of the speed of the Ethernet network used to connect the hosts to the router. In addition, the specification of the host, used to generate this traffic, was significantly lower than that of the router. The host seemed to struggle to cope with the overheads involved in sending this level of traffic, and its CPU time was fully utilised before it even reached the maximum throughput of 100Mbit/s. Therefore, the results of this experiment were heavily constrained by factors other than the type of router. Each type of router would support an overall throughput, much greater than that of the single flow presented here, however, the router used did not have enough network cards to fully investigate the maximum overall throughput which each router could support. The partitioning experiments in Section 8.2 provide a better overview of the overall traffic load which each router type can sustain, however, these throughput experiments do identify the limitations placed upon each flow due to the 62
  • 68. 95 94 average TCP bandwidth (Mbits/s) 93 92 91 90 Standard Quasar Best Quasar Quasar Best MPLS Effort (bvt) Routelet (bvt) Effort (sedf) Figure 8.9: The average maximum throughput of a single flow in each type of router virtualisation overheads. 8.2 Network Flow Partitioning The main purpose of the QuaSAR router was to use virtualisation techniques to improve the partitioning between network flows. In order to evaluate the effectiveness of the QuaSAR router at providing parti- tioning between network flows, the QoS of one network flow was measured, as a competing flow tried to use an increasing proportion of the router’s resources. Figure 8.10 shows the network setup which was used to perform these partitioning experiments. These experiments mainly evaluate how well each type of router can partition the CPU processing required by different flows, as these flows do not share the same network resources of the router. These experiments again evaluate the three router types (standard MPLS, QuaSAR best effort and QuaSAR routelet), however, all QuaSAR experiments were performed under the bvt scheduler, as the sedf scheduler was not stable enough to support the traffic loads this experiment employed. The QuaSAR routelet experiments used two routelets, one for processing of the measured flow, and one for processing of the cross-traffic flow. This meant that the resources used by the cross-traffic could be controlled better than if it was processed as best effort traffic. The problems of controlling the cross-traffic in the best effort router are discussed in Section 8.3. The cross traffic-flow emulates a misbehaving flow which is sending more traffic than it requested re- sources for beforehand. In this situation, the Xen scheduler should only assign CPU time to the cross- traffic’s routelet for the requested traffic load, not any extra. However, with the bvt scheduler it is not possible to guarantee that the cross-traffic’s routelet is not being assigned more CPU time than it was guaranteed. It is only possible to weight the cost assigned to each domain for a slice of CPU time, so that, ideally, it is scheduled often enough to service the QoS requirements of the requested flow and no more. The QuaSAR routelet experiments therefore weighted both routelets’ (the routelet processing the measured flow and the one processing the cross-traffic) domains so that they were likely to be assigned 63
  • 69. Figure 8.10: This diagram shows the network setup used to perform partitioning experiments. Note that the cross-traffic does not traverse network links used by the measured flow. enough CPU time to process the traffic of the measured flow (so that the scheduler did not limit the measured flow), however, this could not be guaranteed. The same Quality of Service parameters of latency, jitter and throughput were measured in each type of router, while that router was being loaded with an increasing amount of cross-traffic. The parameters were measured using the same tools as were used to evaluate the unloaded routers in Section 8.1, with each measurement also averaged over 5 runs and each per-packet timing run consisting of 3000 packets. 8.2.1 Effect of Timer Interrupt Busy Wait In order to produce usable results with these experiments, the machine used as the router needed to be slowed down. This is because the maximum amount of interference traffic which could be injected into the network was not enough to significantly affect the QoS provided by each of the router types to the measured flow. Therefore, if the routing machine had not been slowed down, the QoS characteristics provided by the routers would have been the same across the whole range of interference traffic load. This would not allow any useful conclusions to be drawn, which had not already been discovered in the unloaded router experiments. The method used to slow down the router machine (described in Section 7.5) involved inserting a busy 64
  • 70. 20 250 Without Busy Wait Without Busy Wait With Busy Wait With Busy Wait 200 15 Latency (μs) Jitter (μs) 150 10 100 5 50 0 0 Standard Quasar Best Quasar Standard Quasar Best Quasar MPLS Effor(bvt) Routelet(bvt) MPLS Effor(bvt) Routelet(bvt) (a) Latency (b) Jitter 95 Without Busy Wait With Busy Wait 94 Throughput (Mbit/s) 93 92 91 90 Standard Quasar Best Quasar MPLS Effor(bvt) Routelet(bvt) (c) Throughput Figure 8.11: This figure shows the difference in unloaded performance of each router type when it is slowed down with the timer interrupt busy wait loop. The three QoS parameters used for the partitioning experiments (Latency, Jitter and Throughput) are shown. wait loop into the timer interrupt handler. While this did mean that the available range of cross-traffic could produce measurable differences in the QoS characteristics of each router type, it is possible that it could introduce unexpected consequences. For example, since the slow down overhead occurs at the start of every 10ms period, the packet jitter could be inadvertently affected. To investigate whether this busy wait loop substantially affected the results presented here, the QoS provided by each unloaded router was investigated both with and without the added busy wait loop. Figure 8.11 shows the results of this investigation. It can be seen that the busy wait loop decreases the performance of each router type in a uniform manner. The latency of each packet’s transmission time increased by between 2-4% for each type of router. The jitter introduced by each router type increased more, by between 5-10% depending upon the router type. However, it is expected that the jitter would increase more significantly than the latency. The busy wait loop introduces all of its delay at the start of every 10ms period, therefore packets just before, and just after the busy wait period will experience significant extra delay and thus jitter. The busy wait loop has much less impact on the maximum, per flow, throughput of each router (between a 0.1-0.3% decrease depending upon the router). Since the 65
  • 71. throughput is more heavily constrained by other factors, such as network speed and host performance, it was expected that the busy wait loop would not greatly affect an unloaded router’s per-flow throughput. Overall, the busy wait loop within the timer interrupt does seem to provide the required result of slowing the machine down in a reasonably uniform manner. Loading the router may cause some unforseen consequences, however, this cannot be directly uncovered since there are no results to compare with the loaded router characteristics. 8.2.2 Latency Figure 8.12 shows the effect of increasing cross-traffic on the latency of a QoS flow in each type of router being evaluated. The latency of the standard MPLS router is not significantly affected until the cross-traffic reaches about 90,000 pps. At the maximum cross-traffic load of 97,700 pps, the latency of the measured flow increases by 140% to 480µs. The latency seems to increase exponentially after 90,000 pps, however, this cannot be confirmed without increasing the cross-traffic load placed upon the router. 20000 "Standard MPLS" "Quasar Routelet (bvt)" 18000 "QuaSAR Best Effort (bvt)" 16000 14000 12000 Latency (μs) 10000 8000 6000 4000 2000 0 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 Interference Flow Packets per Second Figure 8.12: This graph shows the effect of an increasingly badly behaved network flow on the latency generated on another flow for each type of router. The QuaSAR best effort router has a very different latency characteristic as the router becomes more loaded by the cross-traffic. The latency is not significantly affected until the cross-traffic reaches 60,000 pps, however, between 60,000 and 65,000 pps, the latency suddenly jumps by 2700% to about 6000µs. After 65,000 pps the latency does not significantly increase any further. Recall, that the QuaSAR best effort router uses the same routing software as the standard MPLS router, but within a Xen domain. It therefore identifies the overheads produced by the Xen VMM itself. Such a sudden jump, then no further increase, is highly suggestive of a timer of some sort being missed, rather than an increased per-packet 66
  • 72. overhead caused by Xen. It is possible that the increasing cross-traffic causes the Xen domain scheduler to block the best-effort routing domain. However, since no other domains are running in this experiment, the best effort router should never blocked by the Xen scheduler. It is also possible, although unlikely, that this characteristic is caused by the busy wait loop which was added to Xen’s timer handler to slow the routing machine down. The QuaSAR routelet’s latency also starts rising at 60,000 pps, however, it does not rise in a sudden jump. Instead, this rise starts off gradually, then increases exponentially, eventually reaching almost 100 times the original latency. The overheads incurred by virtualisation clearly prevent the virtual routelet approach achieving a greater partitioning between flows than the standard MPLS router. However, the virtual routelet approach does outperform the best effort router’s latency when the cross-traffic is be- tween 60,000 and 80,000 pps. It seems that the virtual routelet approach prevents the, possibly timer based, glitch which Xen causes at 60,000 pps, by limiting the time spent processing cross-traffic pack- ets. However, increasing the cross-traffic after this point causes an exponential rise in the time spent processing packets from the measured flow. It seems likely that this exponential increase is caused by context switches between domains. Each time a packet arrives from the cross-traffic flow, it causes an interrupt in the router. Xen may delay processing this interrupt somewhat, however, the bvt scheduler is specifically built to make each domain appear highly interactive. Therefore, it is likely to immediately context switch to the domain which handles the interrupt as soon as one arrives (as this usually provides low latency to asynchronous external events, making the domains appear more responsive). If the measured flow’s routelet is processing a packet and an arriving cross-traffic flow packet causes an interrupt, then the bvt scheduler may immediately context switch to the privileged domain, to handle this interrupt, then context switch back to the routelet’s domain to continue processing the packet. Since context switches between domains are very costly (see Section 8.1.1), this exponentially increases the time taken to process a packet in the routelet. This failing could have been negated by using a soft real time scheduler, which doesn’t context switch to process an interrupt until the currently running domain has used its guaranteed time slice. Another approach would be to poll the network card for packets instead of using an interrupt driven model. Neither of these approaches were available during this project for reasons discussed earlier in this report. 8.2.3 Interpacket Jitter Figure 8.13 shows the effect of increasing cross-traffic on the jitter of a QoS flow in each type of router being evaluated. As can be seen, virtualisation does not have such an extreme effect on a flow’s jitter, compared with latency, when the router is under strain. When being processed by the standard MPLS router, the flow’s jitter stays relatively steady until the cross-traffic reaches 80,000 pps. At this point, the jitter steeply increases, until it has increased to an average of 209µs at 97,700 pps. It looks likely that this increase would continue at a similar rate, if more cross-traffic could be generated. The jitter characteristic for the QuaSAR best effort router is similar to its latency characteristic. The jitter suddenly increases substantially, to 370µs when the cross-traffic reaches about 60,000 pps, but does not increase any further with increasing cross-talk. The possible missed timer in Xen, which may be the cause of the sharp increase in latency at 60,000 pps, could also be the cause of the substantial increase in jitter at the same cross-traffic load. This suggests that the timer, which is being missed by Xen, does not equally affect each packet (otherwise the jitter would not increase as the average latency increases). Instead, it seems that the missed timer substantially increases the latency of some packets, but does not affect other packets, increasing the average latency as well as the interpacket jitter. The QuaSAR routelet’s jitter more closely follows the jitter induced by the standard MPLS router than the QuaSAR best effort router does. At the maximum cross-traffic load of 97,700pps, the QuaSAR 67
  • 73. 450 Standard MPLS QuaSAR Best Effort QuaSAR Routelet 400 350 300 250 Jitter (μs) 200 150 100 50 0 0 20000 40000 60000 80000 100000 Interference Flow Packets per Second Figure 8.13: This graph shows the effect of an increasingly badly behaved network flow on the jitter of another flow for each type of router. routelet induces jitter of 325µs, 55% greater than the standard MPLS router at the same point. However, at this point, both routers’ jitter is increasing exponentially. Instead of the QuaSAR routelet’s overheads causing a consistent 55% increase in jitter, it appears that the knee point (i.e. where the jitter starts suddenly increasing exponentially) occurs at 2000 pps less cross-traffic for the QuaSAR routelet than the standard MPLS router. This suggests that the virtual routelet approach incurs the same jitter char- acteristic as the standard MPLS router, except the standard MPLS router can cope with 2000 pps more cross-traffic than the QuaSAR routelet. Overall, the QuaSAR routelet may not provide better jitter characteristics, when under load, than the standard MPLS router. However, it substantially improves upon the jitter characteristics of the QuaSAR best effort router1 . Since the QuaSAR best effort router is the standard MPLS routing software running within a Xen virtual domain, this proves that the virtual routelet approach reduces some of the additional cross-traffic overhead introduced by virtualisation, with respect to a flow’s jitter. 8.2.4 Throughput Figure 8.14 shows the effect of increasing cross-traffic on the maximum bandwidth of a QoS flow in each type of router being evaluated. Testing the throughput of each type of router is a much more strenuous test of the router’s performance than the per-packet timing experiments. This is because the throughput testing tool (iPerf) injects a much greater traffic load in the measured network flow than the per-packet 1 It appears from the graph that the jitter induced by the QuaSAR routelet will overtake the QuaSAR best effort jitter just after 100,000 pps if the QuaSAR best effort jitter stays steady. However, Section 8.2.4 suggests that the jitter induced by the best effort router will jump again as the cross-traffic load continues to increase. Therefore, the jitter introduced by the QuaSAR routelet may never overtake the QuaSAR best effort router’s jitter. 68
  • 74. timing tool does (the per-packet timing tool only injects roughly 28kbit/s, whereas iPerf tries to reach about 100Mbit/s). This means that the router is already significantly loaded, even before the cross-traffic contends for router resources. This experiment, therefore, gives some indication of the effect on each router type of a greater traffic load than can be generated by the cross-traffic flow itself. Standard MPLS QuaSAR Best Effort 100 QuaSAR Routelet 80 TCP Bandwidth (Mbit/s) 60 40 20 0 0 20000 40000 60000 80000 100000 Interference Flow Packets per Second Figure 8.14: This graph shows the effect of an increasingly badly behaved network flow on the maximum throughput of another flow for each type of router. The throughput which the standard MPLS router can provide to a single flow is not affected until the cross-traffic reaches about 90,000 pps. At the maximum cross-traffic of 97,700 pps, the throughput decreases by 4.3% to 89.6Mbit/s. The QuaSAR best effort router again incurs a steep decrease in performance, followed by a level perfor- mance characteristic, as seen in the previous two experiments. However, in this case, the sharp decrease in performance occurs when the cross-traffic reaches 40,000 pps, then again at 60,000 pps. The perfor- mance drop seen at 60,000 pps in the previous two experiments now seems to occur at 40,000 pps. Since the throughput measurement tool increases the load on the router much more than the per-packet timing measurement tool, it is expected that this performance drop would occur at an earlier cross-traffic load. The second drop in performance suggests that if we increased the cross-traffic load in the previous experiments above 100,000pps, we would see another sharp drop in performance in their results as well. The movement of the first performance drop (from 60,000 pps to 40,000 pps) and the second drop (from possibly 100,000 pps to 60,000 pps) is not the same between the timing and throughput experiments. However, the throughput performance does not necessarily scale with increased router load, in the same way as timing performance. This is especially true, since the experiment measures a single network flow’s maximum throughput (which is limited by the speed of the network connections) rather than a router’s overall maximum throughput. Also, the load on the router is affected by the throughput measurement network flow, as well as the cross-traffic network flow. This explains the lack of another performance drop at 80,000 pps, which one would naively expect to find (e.g. if performance drops 69
  • 75. occur every 20,000 pps). At 80,000 pps, the throughput being achieved by iPerf is 3Mbit/s, 3.2% of that achieved at 20,000 pps. Therefore the throughput measurement flow only exerts about 3% of the load it did when the cross-talk was 20,000 pps. So, although the load placed on the router by the cross-talk flow is much greater, the load introduced by the throughput measurement tool is much less. The QuaSAR routelet sees a much more gradual decrease in throughput performance than the best effort router. The throughput performance does, however, start to decrease almost as soon as cross-traffic is injected into the router. This suggests that the overheads involved in passing about 100Mbit/s of traffic between domains (with the context switches this incurs) are using almost all of the available CPU processing time. The QuaSAR routelet’s maximum throughput is not as significantly affected by the cross-traffic after 40,000 pps, as the QuaSAR best effort router’s throughput, suggesting that the virtual routelet design does provide some partitioning between network flows, when the overheads of virtualisation are taken into account. 8.3 Future Experiments A number of other experiments were considered to further analyse the various features of each type of router, and further evaluate the performance of the QuaSAR router. These experiments were rejected either for design issues, time limitations or equipment shortages. The experiments are presented here as possible future work, with reasons for their rejection. Best Effort Cross-Traffic When evaluating the partitioning performance of the QuaSAR routelet in Section 8.2, all of the cross-traffic was routed by a QoS routelet. While this tested the partitioning which could be provided between two QoS network flows (one well behaved and one badly behaved), it did not evaluate a QoS flow’s partitioning from best effort traffic. This was a conscious decision because of a design decision which had been made as the QuaSAR router was being built. Since the best effort router had to control the routelets when RSVP control messages arrive, it must run within a privileged Xen domain (unprivileged Xen domains have no control over other domains). The packet demultiplexing must also run within a privileged domain, since it requires direct access to the physical network devices, which can only be granted to privileged domains. It was therefore decided to run both the best effort router, and the packet demultiplexing within the same privileged domain. This meant that the available CPU time of the best effort router could not be limited, without also affecting the QoS routelet’s packet demultiplexing time or the control code. It would be possible to separate out the control code, the best effort router and the packet process- ing. A very simple domain could act as a hardware device controller and packet demultiplexer. A privileged control domain would be passed all RSVP messages, and a best effort router do- main would handle all best effort traffic. These domains could be hooked up to the simple packet demultiplexing domain in the same way as the QoS routelets are, and could also be limited in their resource (e.g. CPU time, memory, network transmission rate, etc) usage in the same way. Unfortunately, this design would have required more time to build than was available during this project. It would also have incurred extra overheads, since all packets would have to be passed from one domain (the simple packet demultiplexer) to another (the domain which processes that packet) and back again. Scalability One aspect of the QuaSAR router’s performance, which was not measured, was its scalability to an increase in network QoS flows. The QuaSAR router would not have been anywhere near 70
  • 76. as scalable as a normal router, since it runs a separate Linux operating system for each network flow, and so it would quickly run out of resources (e.g. memory) as additional Linux operating systems were started to support new network flows. However, it would still have been interesting to evaluate how the QuaSAR router copes with an increasing number of network flows. To measure the QoS parameters of each network flow to the level of detail provided in the exper- iments presented here, each flow would require its own source and sink hosts. If multiple flows were run from one host, then each could interfere with the other, especially since the host ma- chines were of a significantly lower specification than the router machine. There was not enough time or equipment to set up separate machines for each network flow. There was also not enough time to create experiments which would evaluate a flow’s QoS in less detail, but with greater consistency, when multiple flows are being sourced by a single host. Network Resource Partitioning The partitioning experiments described in Section 8.2 concentrated on partitioning the CPU time assigned to each flow’s routelet. However, the QuaSAR router also allows the network transmis- sion rate of a routelet to be limited. In order to evaluate the network transmission rate partitioning provided by the QuaSAR router, a cross-traffic network flow could be set up which would be routed to the same destination as the measured flow (rather than passing through the same router, but travelling on different network line cards than the measured flow, as in Section 8.2). This cross- traffic flow would consist of maximum sized packets, at a rate designed to saturate a 100Mbit/s network. Since large packets are used, it would take less packets per second (pps) to saturate the network. Therefore the router’s CPU would not be as overwhelmed as with the small packet, high pps, flow used in Section 8.2. The routelet network transmission rate limiting could then be used to prevent the cross-traffic flow from saturating the destination network, providing the measured flow with a better QoS. However, this experiment is very dependent upon the host’s ability to cope with 100Mbit/s of traffic. Because of the limited hosts available during this project, this experiment would have been heavily limited by the host’s performance. It would therefore have been very difficult to make any worthwhile measurements in this experiment. Direct Measurement of Individual Overheads The measurements performed here allow evaluation of the overall overheads involved in the virtualisation of a network router. Although these mea- surements can be used to conjecture a rough estimation of the individual overheads, they do not provide direct evidence of which individual operations contribute the most significantly to these overheads. To uncover exactly which areas of the virtualisation approach need to be modified so that the virtual routelet architecture is effective, more fine grained detail about the individual overheads requires to be measured. To measure these overheads directly, profiling code could be added at strategic locations within the Xen virtual machine monitor. This added profiling code would have to be carefully written, to ensure that it does not add significant overhead to any of the areas being profiled. This would also require detailed knowledge of almost all of the Xen virtual machine monitor’s code, which would not have been possible over the timescale of this project. 71
  • 77. Chapter 9 Evaluation The previous chapter presented the results of the experiments which were performed on the QuaSAR router, and provided some analysis of the possible reasons behind these results. This chapter attempts to provide an overview of these results and bring together the analysis on these results, in order to evaluate the QuaSAR router’s overall performance. Section 9.2 goes on to relate this analysis to the overall design of the QuaSAR router, attempting to identify the areas of QuaSAR’s design which prevented it from providing better partitioning between network flows than a standard router. 9.1 Experimental Results Overview The unloaded latency experiments provide a good measure of the overhead incurred by the various com- ponents of the QuaSAR router, since it measures the actual time taken to process an individual packet. These experiments suggest that the Xen virtualisation software (without context switching between do- mains) incurs roughly a 7.3% processing overhead per routed packet. This compares well with the performance evaluation performed by Barham et al in [2], which found a 5% average overhead incurred by Xen virtualisation. The virtual routelet approach used by QuaSAR incurs an additional 30.1% pro- cessing overhead per packet. This suggests that the major overheads incurred by a QuaSAR routelet are caused by the context switch time and the transferring of packets between domains. The overhead in- curred by a soft real time scheduler could not be fully evaluated, due to the unfinished nature of the sedf scheduler used. However, the fact that the unfinished sedf scheduler incurs only a 10% additional over- head (for QuaSAR routelets) implies that a soft real time scheduler, designed specifically for network routing, could be used by the QuaSAR router without incurring major overheads. The measurement of unloaded jitter identifies how evenly the overheads are spread across individual packet transit times. If a large overhead is identified by an increased average packet latency, but the jitter does not increase substantially, then this suggests that the overhead affects each packet equally. Conversely, if a large increase in average latency is accompanied by a substantial increase in jitter, then it suggests that large overheads affect a small number of packets, whereas other packets are unaffected (e.g. the overhead is caused by a random event, which delays 1 in 10 packets). The 7.3% overhead incurred by Xen virtualisation is accompanied by a 21.7% increase in jitter. This suggests that the overheads caused by Xen are random in nature and do not affect each packet equally. Therefore, the overheads cause by Xen virtualisation occur, either because of an increases in the number of random events which can delay packets (for example, a domain may periodically yield control to the VMM, to ensure they do not hog the CPU), or because the length of random delays, present before 72
  • 78. virtualisation, increases (e.g. the periodic timer interrupt now has more work to do to ensure all domains, and the Xen VMM, are notified of time changes). The 30.1% overhead incurred by the virtual routelet approach to network routing is accompanied by only a 7.9% increase in jitter. It seems likely, therefore, that most of the overheads incurred by the virtual routelet approach are incurred equally across packets. Since the major overheads which were identified (context switching and passing packets between domains) occur for each and every packet, this may not be surprising. However, it would be expected that passing a packet between domains would introduce more random jitter than it does, since a context switch needs to occur before the packet would be processed any further. The event which signals a packet being passed between domains does not immediately cause a context switch in Xen (otherwise thrashing between domains could easily occur), therefore, it would be expected that some random packets would be significantly delayed (causing con- siderable jitter) as they wait for a context switch to occur. The fact that this does not happen shows that Xen correctly assumes that the privileged domain has no more work to do, then context switches to the routelet’s domain, because it has an outstanding event. There are also some indirect consequences of context switching, such as cache memories and TLBs being invalidated. The lack in jitter which accom- panies these events suggests that a consistent set of memory locations are required during each packet’s transmission, so the overhead incurred by these indirect consequences are equal for each packet. The unloaded throughput experiments are less valuable, since their results are considerably constrained by both the maximum network speed and the performance of the hosts. The results could also be affected in unexpected ways by the TCP flow control algorithm. They do, however, provide a useful measurement of the real world performance which could be expected by the QuaSAR router. The results prove that the QuaSAR router can cope with a high throughput load, while confirming a slight overhead introduced by the QuaSAR router, as seen in previous experiments. The evaluation of QuaSAR’s flow partitioning capabilities identifies some other interesting aspects. In all three experiments, the virtualisation overheads incurred by the QuaSAR router prevent it from cop- ing with greater cross-traffic loads than the standard MPLS router. However, in all three experiments there are areas the virtual routelet approach (QuaSAR routelet) can outperform a standard router within a virtual machine (QuaSAR best effort). Since the QuaSAR routelet approach incurs much greater overheads, per-packet, than the QuaSAR best effort approach, this suggests that the virtual routelet ar- chitecture has some promise. The overheads incurred by the form of virtualisation used in this project were, however, too great for the QuaSAR router to be effective, compared with a standard router. The major overheads which prevented the QuaSAR router from providing better partitioning than a standard router are discussed in the next section. 9.2 Virtualisation Overheads The Xen virtualisation software was built to support complex, multi-threaded, multi-address space op- erating systems, in order to provide virtual server farms [33]. It was not built for this type of virtual routelet architecture. There were therefore a number of design decisions which were made in Xen which are aimed at providing support for virtual server farms, and which do not suit the virtual routelet design model. Xen must enforce security between domains, since it is designed to support several mutually untrusted domains. However, trust and security are secondary concerns for the QuaSAR router (each routelet is trusted). What is more important, in a virtual routelet architecture, is an approach which enforces cycle and bandwidth guarantees for each routelet. This section discusses some of the aspects of Xen’s design which incurred major overheads in this virtual routelet architecture or prevented the enforcement of each routelet’s cycle or bandwidth guarantees. It also discusses how these aspects could be designed differently to suit the virtualisation technique required by the QuaSAR router. 73
  • 79. 9.2.1 Linux OS for each Routelet Each routelet within the QuaSAR router requires an individual Linux operating system. As the number of network flows being routed by a QuaSAR router rise, this produces an increasingly large overhead. The QoS routelets have very simple requirements, therefore the use of a complex multi-user operating system, such as Linux, for each routelet vastly increases the resources required by each routelet. This will, in turn, decrease the scalability of the QuaSAR router, since it will more quickly run out of memory as additional flows arrive and unnecessary operations within the multiple Linux operating systems (such as processing timer events) decrease the usable CPU time. As no simple operating systems were available for Xen during the course of this project, the use of a complex operating system was unavoidable. If, however, time had allowed, the scalability of the QuaSAR router and possibly its performance under load could have been improved by designing and implementing a simple operating system aimed directly at the router’s requirements. 9.2.2 Context Switch Overhead One major overhead in QuaSAR’s design seemed to be the time taken to context switch between do- mains. The processing of each QoS packet requires two context switches, therefore this overhead be- comes very significant as the traffic load on the router increases. Xen is designed to support multiple complex multi-user operating systems, each running complex software which is not necessarily trusted. Each domain, therefore, runs within its own virtual memory address space, to ensure isolation between domains and enforce inter-domain security. Although this guarantees that one domain cannot access an- other domain’s memory, it increases the cost of each context switch, as each context switch now requires a change in memory address space (with a move to a different page table, and invalidation of the TLB), as well as saving and restoration of the CPU state. The virtual routelet architecture has a very different set of requirements from the intended uses of Xen. The routelets only run a very small, pre-known set of software, which is therefore trusted. Inter-domain memory security is therefore not very important to the QuaSAR router. The performance of the QuaSAR router could be substantially improved by sharing a single address space between all of the routelets, as this would substantially reduce the overhead caused by context switches between domains. This would reduce some of the partitioning between routelets (as they would now be sharing the same memory), however, if it was carefully designed, so that no guest operating system tramples on the memory of any other guest operating system, it would not affect the architecture’s overall approach. 9.2.3 Routlet Access to NIC Another major problem with the design of the QuaSAR router is that it requires packets to be passed between domains multiple times. Although Xen uses page table manipulation to achieve this packet transfer, it still incurs a significant overhead (due to event messaging, scheduling between domains, and other factors). This approach was necessary when using the Xen virtual machine, since only one domain can have access to the physical networking devices (again, to ensure partitioning and security between domains). As discussed previously, security between routelets is not very important in the virtual routelet archi- tecture, and while partitioning of each routelet’s network transmission rate is important, it could still be achieved if each routelet was given full, but rate limited, access to the network devices. However, naively giving each routelet access to the physical network devices would cause serious problems. For example, if one domain is pre-empted by another, while in the middle of sending a packet to a network 74
  • 80. device, that packet would become corrupted, especially if the other domain also starts sending a packet to the same network device. A shared network device driver would have to be very carefully constructed to ensure one domain cannot interfere with another domain’s packet sending or receiving. This driver would include critical sections to prevent a domain from being pre-empted when it is in the middle of sending or receiving data from a device. These shared device drivers would also need some form of rate limiting, to prevent one routelet from hogging a network device (since the routelets are trusted, the Xen VMM would not necessarily have to ensure this rate control). 9.2.4 Classifying Packets A large proportion of each packet’s processing time is spent classifying the packet and deciding which routelet processes that packet’s flow. This processing time cannot be assigned to the flow’s routelet, because the packet has not yet been classified to its particular flow. The fact the QuaSAR routelet was designed to route MPLS traffic could have exacerbated this problem. MPLS was chosen because it is very simple to classify (by simply checking the MPLS label), and its processing requirements are also extremely simple (replacing the MPLS Label, then encapsulating it in an Ethernet frame header). Al- though this simplified the design and implementation of the QuaSAR router, it meant that the difference between classification time and processing time was not as large as it could have been, thereby reducing the effectiveness of the routelet partitioning. A more complex networking protocol, such as IP, could see better partitioning with the QuaSAR router. Although this would involve more complex classification, the more routing and processing requirements of IP could have offset this classification cost, improving the overall portion of each flow’s processing which could be partitioned. 9.2.5 Soft Real Time Scheduler One of the most important deficiencies in the use of Xen for this project was the lack of a stable soft real time scheduler to provide effective CPU partitioning between domains. The use of the ATROPOS soft real time scheduler to partition each routelet’s CPU processing was an important aspect of the original design. As discussed previously, the ATROPOS scheduler was not available, and its replacement (sedf) was not stable enough to use in the partitioning experiments. Therefore, a time shared (bvt) scheduler had to be used. This scheduler does not guarantee that the routelet processing the cross-traffic does not use more processing time than it is allowed. The cross-traffic flow’s routelet may, therefore, be using some of the CPU time assigned to the measured flow’s routelet’s, thereby reducing the effectiveness of the QuaSAR routelet’s performance. This lack of a soft real time scheduler was one of the most important problems in this project. The purpose of the QuaSAR router was to evaluate the effectiveness of a virtual routelet approach. The other problems reduce the performance of the QuaSAR routelet, however this can be taken into account when evaluating the router. The lack of a soft real time scheduler prevents the QuaSAR router from fully following the design philosophy of the virtual routelet approach. Therefore, the partitioning measure- ment on the QuaSAR router does not accurately evaluate the partitioning performance which could be expected by a true virtual routelet approach. 75
  • 81. Chapter 10 Conclusion 10.1 Aims The main aim of this project was to test the hypothesis that virtual machine techniques can be used within network routers to provide effective partitioning between different network flows, and thus pro- vide Quality of Service guarantees to individual network flows. To this end, the project involved the creation of a prototype experimental router (QuaSAR) which uses virtualisation techniques to provide QoS guarantees to network flows. One of the primary aims of this research project, therefore, involved developing techniques which allow virtual machine techniques to be effectively used within network routers to improve the QoS guarantees they can make to network flows. The virtual routelet architecture was proposed as a technique which could improve the partitioning between network flows using virtualisation methods. This aim, there- fore, lead to the goal of creating design techniques and tools which would enable this virtual routelet architecture, and allow the QuaSAR prototype to be built. An important aspect of this work involved performing experiments on the prototype router, to evaluate its effectiveness and discover the areas of its design that be improved. This evaluation should also uncover whether the underlying virtual routelet architecture is an effective way of increasing the partitioning between flows, and it should suggest ways in which this architecture can be improved. The process of hardware virtualisation adds an inherent overhead to the cost of performing a given task. The amount of overhead incurred depends upon how well the virtualisation process matches the given task. The virtualisation technology used within this project (Xen) was not designed for the process of network routing, but was instead aimed at complex server virtualisation. Another aim of this project was therefore to evaluate the design features of Xen, which incur the greatest overhead in this virtual routelet architecture, and predict the design features needed by a virtualisation approach suited to the virtual routelet architecture. 10.2 Achievements Many of the aims of this project were successfully met. The creation of the QuaSAR router prototype in- volved the development of a number of tools and techniques which enable virtual machine technologies to be used within a network router. The virtual routelet architecture was elaborated upon by the creation of the components necessary to implement this design in the QuaSAR prototype router. As such, the architecture of the virtual routelet was designed and implemented, and packet processing capabilities 76
  • 82. were added so that it could route MPLS traffic. Dumultiplexing support was developed, so that packets could be sent to the appropriate routelet for processing. The complex internal networking, necessary to connect virtual routelets to the the appropriate physical devices, was determined and set up as part of the QuaSAR prototype. Control code was implemented, to allow RSVP messages to set up new network flows and associate those flows with virtual routelets. Finally, methods were developed which allowed management of each routelet’s allocation of overall resources, such as CPU time and network transmission rate. The creation of the QuaSAR router also brought about some indirect benefits to the research community which were not specific aims of this project. The integration of various open source projects meant that a number of the open source projects had to be upgraded to support the latest Linux kernel release. As part of this process, the Click Modular Router was ported from the 2.4 kernel to the 2.6 kernel. The changes I made to the Click Modular Router, along with some preliminary porting work, was passed along to the Click community as a source code patch. These changes (modified to allow backward compatibility with previous kernel releases) have now been integrated into the Click development tree. Therefore, an indirect contribution of this project, to the research community as a whole, was the porting of the Click Modular Router to the most recent Linux Kernel. The Xen virtual network interface (vif) transmission rate limiting support was included in QuaSAR’s original design as a means of guaranteeing a proportion of the overall router’s network transmission resources to each routetet (and thus, each network flow). The original vif rate limiting support had been dropped from Xen as it evolved, therefore, I reimplemented this rate limiting support for the new Xen release and added support to Xen’s management tools to allow user-level control over this function. These improvements were contributed to the Xen community as a source code patch, whereupon they were integrated within the Xen development tree for use in the next Xen release. The creation of the QuaSAR router proved that it is possible to use virtual machine techniques within a router, and still achieve acceptable performance. The QuaSAR router could support high throughput net- work flows and achieved acceptable per-packet latencies and inter-packet jitter for use in a standard edge networking environment. Virtualisation did, however, incur significant performance overheads, partic- ularly in regard to per-packet latencies. These overheads prevented the QuaSAR router from achieving better partitioning between network flows compared with a standard software router, however, when some of the virtualisation overheads were taken into account (by comparing the QuaSAR best effort router to the performance of a QuaSAR routelet), the virtual routelet approach seemed to show promise. The fact that the QuaSAR routelet outperformed QuaSAR best effort routing in some partitioning ex- periments is particularly significant, since when unloaded, the QuaSAR routelet approach incurs 29.5% more virtualisation overheads (due to context switching, packet processing between domains and other overheads) than the QuaSAR best effort router. This suggests that the virtual routelet architecture could be effective at increasing the partitioning between network flows, if the overheads of virtualisation were substantially reduced. The evaluation of the QuaSAR router provided considerable insight into the design features of the vir- tualisation software used, which were not appropriate in the virtual routelet architecture. These insights suggest a number of design features which would be important in the creation of a virtualisation ap- proach, appropriate to the virtual routelet architecture (such as low context switch overhead and shared access to physical devices - see Section 9.2). In fact, as well as evaluating the effectiveness of the virtual routelet architecture, the QuaSAR router is an effective tool with which to measure different performance aspects of virtual machine technologies. The internal routing can be set up to pass packets between any number of different domains before rout- ing them to the correct network. The latency between packet arrivals can therefore be used to measure context switch times, scheduling delays, domain memory sharing and many other virtualisation over- heads. Modification of the virtual machine manager and the QuaSAR router would allow more precise 77
  • 83. measurements of each individual overhead, providing profiling of the performance of different aspects of Xen, or any other virtualisation technology. Since the performance of the virtualisation overheads is profiled externally (i.e. by measuring the latency of packets at the external hosts) the actual profiling code does not affect the measurement results. Throughput measurements of the QuaSAR router could also be used to evaluate scalability issues of virtualisation, with regard, for example, to aspects such as maximum number of context switches that can be performed within a certain period. Overall, although the QuaSAR router could not provide better partitioning between network flows, it did uncover a number of valuable insights into potential improvements in the virtual routelet architecture (for example, in the classification stage) and the virtualisation techniques necessary to make this architecture effective. 10.3 Further Work As well as performing the experiments described in Section 8.3, identification of some of the deficiencies in the QuaSAR router suggest some natural extentions to this work. Firstly, each QoS routelet is only required to perform simple packet forwarding (complex packet forwarding, routing table maintenance and configuration are dealt with by the best effort router in the main guest operating system). However, even though the QoS routelets have such simple processing requirements, they still run on top of a full (albeit, small) Linux operating system. This is excessive in terms of the resources required by each routelet. While the best effort router requires some of the functionality provided by a complex operating system such as Linux, the QoS routelets could be supported by a much simpler operating system. Therefore, one possible extension would be the creation of a simple operating system which performs the required routing, without using extra resources, or introducing additional latency to the time taken to forward packets due to unnecessary complexity. Another problem with this design is the fact that every packet from a QoS flow must be demultiplexed within the best effort router, before being sent to its routelet for processing. This causes possible crosstalk between flows’ traffic patterns, and reduces the effective partitioning between network flows. For example, the QoS flows still require some of the resources of the best effort router, so if a router is overwhelmed with best effort traffic, the QoS flows will also suffer. This could be alleviated somewhat by demultiplexing packets to their required routelet or best effort router within hardware, rather than software. The Arsenic network interface card [31] provided the ability to demultiplex packets between virtual interfaces, based upon bit patterns in their header. While this was created with the intention of demultiplexing packets between applications, it could easily be modified to send packets directly to the routelet processing them. This would remove QoS flows’ dependency on the best effort router’s resources. While it would not remove all possible crosstalk between flows, as the resources used by demultiplexing are still shared between flows, it would at least provide partitioning flows arriving on different network line cards. It would also move the demultiplexing to hardware, increasing its speed (possibly to network line speed) and reducing the probability of the demultiplexing stage becoming overwhelmed. This work could also be extended by investigating the use of another method of resource virtualisation. The virtualisation software used in the prototype (Xen) was developed with the express purpose of sharing a machine’s resources between full multi-threaded, virtual address space, complex operating systems. These capabilities are not required by a router. A similar effect could be realised through the use of a stateless library, real time operating system. Each routelet would run as a separate real time process, with the OS’s scheduler giving each “routelet process” the resources it requires. The use of a stateless system library (such as that used by the Nemesis operating system [34], [17] and [21]) would provide partitioning between the routelets, as any system calls they make would be independent of each 78
  • 84. other (i.e. if a routelet made a system call, it would not have to wait for another routelet’s system call to finish). Therefore, each routelet could only make use of the resources allocated to it by the system’s real time scheduler. So that each QoS flow is not affected by the queues of other flows, each routelet would queue its own outgoing packets. The transmit function of the network line card’s driver would be modified, so that it obtains the next packet to send from the routelets’ outgoing queues, choosing a queue based upon the resources allocated to each routelet. The QuaSAR prototype also suffers by being a software router and using commodity x86 hardware, rather than using the hardware of high performance commercial routers. The architecture presented by a hardware router is substantially different from that of commodity hardware, requiring different design techniques. A hardware router is typically designed to work on a number of packets in parallel. For example, a modern hardware router may use forwarding engines on network interface cards (nics) to process packets independently and in parallel. These packets may then be transferred between nics in parallel using a switching fabric. However, although hardware routers provide much better routing performance, compared with software routers, they are much less flexible. Network processors bridge the gap between hardware and software routers. These processors provide the flexibility of software routers but are generally better suited to processing packets in parallel than commodity hardware. For example, the Intel IXP processor [1] consists of a general purpose CPU (generally used for control purposes), as well as a number of a simple microengines (generally used to process multiple packets in parallel). Developing the router resource virtualisation techniques proposed here for a network processor would be an interesting extension to this work. Network processors, such as the IXP, contain multiple processing engines (or microengines) which can process packets in parallel. Since these processing engines are independent, they could be used to partition the router’s resources between network flows. For example, the processing of packets from each QoS flow could be assigned to separate processing engines. There are 16 processing engines in the IXP2800, which are obviously not enough to service the number of network flows typically processed by an Internet router, if they are assigned in this way1 . One solution to this problem would be to give each network flow its own virtual processing engine, which has limited access to the physical processing engines. These virtual processing engines could be assigned to the physical processing engines by a scheduler, based on the requirements of the QoS flows which they process. These virtual processing engines would be independent of each other, so that QoS guarantees could be made by the router. 10.4 Summary The evaluation of the QuaSAR prototype router suggests that it is indeed feasible to use virtualisa- tion techniques within a network router, without overly compromising the router’s ability to route high throughput, low latency traffic of a single network flow. With regard to the hypothesis that virtual ma- chine techniques can improve a router’s ability to provide partitioning between network flows, the results are ambiguous. While routing through a QuaSAR routelet provided some improvements in partitioning, over the QuaSAR best effort router, the virtualisation overheads involved prevented it ever from reaching the level of performance achieved by a standard software router. This suggests that the virtual routelet architecture has some promise, if the virtualisation overheads can be reduced substantially. Evaluating the results of the QuaSAR router experiments, however, provides some valuable insight into the areas of virtualisation which could be enhanced to improve the effectiveness of the virtual routelet architecture. 1 The IXP actually supports up to 8 hardware threads per processing engine, however, even if each QoS flow was assigned to its own hardware thread, this would only allow for 64 network flows. 79
  • 85. Overall, this project has uncovered a number of interesting aspects of virtual machine research and novel routing architectures. 80
  • 86. Bibliography [1] Matthew Adiletta, Mark Rosenbluth, Debra Bernstein, Filbert Wolrich, and Hugh Wilkinson. The next generation of intel IPX network processors. Intel Technology Journal, 2002. [2] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the art of virtualization. In SOSP ’03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 164–177. ACM Press, 2003. [3] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss. RFC 2475: An Architecture for Differentiated Services, December 1998. [4] R. Braden, D. Clark, and S. Shenker. RFC 1633: Integrated services in the Internet architecture: an overview, June 1994. [5] Andrew T. Campbell, Herman G. De Meer, Michael E. Kounavis, Kazuho Miki, John B. Vicente, and Daniel Villela. A survey of programmable networks. SIGCOMM Comput. Commun. Rev., 29(2):7–23, 1999. [6] A.T. Campbell, H.G. De Meer, M.E. Kounavis, K. Miki, J. Vicente, and D.A. Villela. The Genesis Kernel: a virtual network operating system for spawning network architectures. In Second Inter- national Conference on Open Architectures and Network Programming (OPENARCH), pages 115 – 127, New York, March 1999. [7] Mark Carson and Michael Zink. NIST switch: a platform for research on quality-of-service rout- ing. Internet Routing and Quality of Service, 3529(1):44–52, 1998. [8] P. Chandra, A. Fisher, C. Kosak, T.S.E. Ng, P. Steenkiste, E. Takahashi, and Hui Zhang. Dar- win: customizable resource management for value-added network services. In Sixth International Conference on Network Protocols, pages 177 – 188, Austin, October 1998. [9] Cisco. Cisco IOS Switching Services Configuration Guide, Release 12.2, chapter 6, pages 123–185. Cisco Systems Inc, 2003. [10] Tzi cker Chiueh. Resource virtualization techniques for wide-area overlay networks. Technical report, Computer Science Department, State University of New York at Stony Brook, 2003. [11] Bryan Clark, Todd Deshane, Eli Dow, Stephen Evanchik, Matthews Finlayson, Jason Herne, and Jeanna Neefe Matthews. Xen and the art of repeated research. In Usenix annual technical confer- ence, 2004. [12] Dan Decasper, Zubin Dittia, Guru M. Parulkar, and Bernhard Plattner. Router plugins: A software architecture for next generation routers. In SIGCOMM, pages 229–240, 1998. [13] J. Dike. A user-mode port of the linux kernel. 5th Annual Linux Showcase & Conference, 2001. 81
  • 87. [14] Rod Fatoohi and Rupinder Singh. Performance of Zebra Routing Software. Technical report, Computer Engineering, San Jose State University, 1999. [15] Keir Fraser, Steven Hand, Rolf Neugebauer, Ian Pratt, Andrew Warfield, and Mark Williamson. Safe hardware access with the Xen virtual machine monitor. In 1st Workshop on Operating System and Architectural Support for the on demand IT InfraStructure (OASIS’04), October 2004. [16] Didier Le Gall. Mpeg: a video compression standard for multimedia applications. Commun. ACM, 34(4):46–58, 1991. [17] Steven M. Hand. Self-paging in the nemesis operating system. In OSDI ’99: Proceedings of the third symposium on Operating systems design and implementation, pages 73–86. USENIX Association, 1999. [18] P. Van Heuven, S. Van Den Berghe, J. Coppens, and P. Demeester. RSVP-TE daemon for Diffserv over MPLS under Linux. accessed November 2004. [19] R. Keller, L. Ruf, A. Guindehi, and B. Plattner. PromethOS: A dynamically extensible router architecture for active networks. In IWAN 2002, Zurich, Switzerland, 2002. [20] Kevin Lai and Mary Baker. A performance comparison of UNIX operating systems on the pentium. In USENIX Annual Technical Conference, pages 265–278, 1996. [21] IM Leslie, D McAuley, R Black, T Roscoe, P Barham, D Evers, R Fairbairns, and E Hyden. The design and implementation of an operating system to support distributed multimedia applications. In IEEE Journal on Selected Areas in Communications, pages 1280–1297, 1996. [22] Christopher Metz. IP Routers: New Tools for Gigabit Networking. IEEE Internet, 2(6):4–18, November 1998. [23] D L Mills. Internet time synchronization: The network time protocol. IEEE Transactions on Communications, 39(10):1482–1493, 1991. [24] Robert Morris, Eddie Kohler, John Jannotti, and M. Frans Kaashoek. The click modular router. In Symposium on Operating Systems Principles, pages 217–231, 1999. [25] Kensuke Otake. 0sys Minimal Linux Distribution. [26] J Padhye, V Firoiu, D F Towsley, and J F Kurose. Modeling TCP reno performance: a simple model and its empirical validation. IEEE/ACM Transactions on Networking, 8(2):133–145, 2000. [27] Abhay K. Parekh and Robert G. Gallagher. A generalized processor sharing approach to flow control in integrated services networks: the multiple node case. IEEE/ACM Trans. Netw., 2(2):137– 150, 1994. [28] B.W. Parkinson and J.J. Spilker. The global positioning system: theory and applications. The global positioning system: theory and applications/ edited by Bradford W.Parkinson, James J.Spilker, Jr.; associate editors, Penina Axelrad, Per Enge. Washington, DC: American Institute of Aeronautics and Astronautics, 1996. Progress in astronautics and aeronautics; v. 163-164., 1996. [29] Craig Partridge. A 50-Gb/s IP Router. IEEE/ACM Transactions on Networking, 6(3):237–248, June 1998. [30] D. Plummer. RFC 826: An Ethernet Address Resolution Protocol, November 1982. 82
  • 88. [31] Ian Pratt and Keir Fraser. Arsenic: A user-accessible gigabit ethernet interface. In INFOCOM, pages 67–76, 2001. [32] Avinash Ramanath. A study of the interaction in BGP/OSPF in Zebra/ZebOS/Quagga. Technical report, Computer Science Department, State University of New York at Stony Brook, 2000. [33] Dickon Reed, Ian Pratt, Paul Menage, Stephen Early, and Neil Stratford. Xenoservers: Account- able execution of untrusted programs. In Workshop on Hot Topics in Operating Systems, pages 136–141, 1999. [34] Timothy Roscoe. Linkage in the nemesis single address space operating system. SIGOPS Oper. Syst. Rev., 28(4):48–55, 1994. [35] Russel Sandberg, David Goldberg, Steve Kleiman, Dan Walsh, and Bob Lyon. Design and imple- mentation of the Sun Network Filesystem. In Proc. Summer 1985 USENIX Conf., pages 119–130, Portland OR (USA), 1985. [36] Pascal Schmitdt. ttyLinux User Guide. guide.pdf, version 4.5 edition, February 2005. [37] Jonathan Sevy. Linux Network Stack Walkthough. network stack walkthrough.html. [38] Jeremy Sugerman, Ganesh Venkitachalam, and Beng-Hong Lim. Virtualizing I/O Devices on VMware Workstation’s Hosted Virtual Machine Monitor. In Proceedings of the General Track: 2002 USENIX Annual Technical Conference, pages 1–14. USENIX Association, 2001. [39] The International Engineering Consortium. Multiprotocol Label Switching (MPLS). - Accessed November 2004. [40] A. Tirumala, F. Qin, J. Dugan, J. Ferguson, and K. Gibbs. Iperf: The TCP/UDP Bandwidth Measurement Tool. [41] A. Whitaker, M. Shaw, and S. Gribble. Denali: Lightweight virtual machines for distributed and networked applications. Technical report, University of Washington, 2002. [42] Andrew Whitaker, Marianne Shaw, and Steven D. Gribble. Scale and performance in the Denali isolation kernel. SIGOPS Operating Systems Review, 36(SI):195–209, 2002. [43] J. Wroclawski. RFC 2210: The use of RSVP with IETF integrated services, September 1997. [44] X. Xiao and L. M. Ni. Internet QoS: A big picture. IEEE Network, 13(2):8–18, March 1999. [45] Lixia Zhang, Steve Deering, Deborah Estrin, Scott Shenker, and Daniel Zappala. RSVP: A new resource reservation protocol. IEEE Network Magazine, 7(5):8–18, September 1993. [46] Junaid Ahmed Zubairi. An automated traffic engineering algorithm for MPLS-Diffserv domain. Proc. Applied Telecommunication Symposium, pages 43–48, April 2002. 83