Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Automated paravitualization of device drivers in xen


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Automated paravitualization of device drivers in xen

  1. 1. Automated Paravirtualization of device drivers in Xen Nikhil Pujari Vijayakumar M M Sireesh Bolla Stony Brook University Abstract: Xen is an x86 virtual machine monitor which adopts a paravirtualization approach to allow multiple operating systems to share conventional hardware without sacrificing performance or functionality. Since it is a paravirtualization approach which provides an idealized virtual machine abstraction, operating systems need to be ported to be able to install and run them on the Xen VMM. Instead of emulating existing hardware devices as in full virtualization, Xen provides abstract devices which implement a high level interface for each device category. Since the interfaces are abstract and provide generic operations for that device class, extra effort needs to be put for providing the guest operating systems with the peculiar functionalities that the hardware may provide. These non-generic facilities are generally exposed by the device drivers in the form of device private ioctls, which are generally for configuring the devices to enable/disable the non-generic functionality or to collect status information. As of Xen 3.3, the netfront driver does not implement device private ioctls. User mode configuration programs like ifconfig which understand the underlying devices use these ioctls. Our method implements conversion from the local ioctl call in a Linux DomU to a remote call to the Dom0. This is done by including a generic ioctl wrapper in netfront driver and a watch in the netback driver. This automates the process of exposing arbitrary functionality provided by the real network hardware to the DomU through device private ioctls. Only a specification with the ioctl numbers and buffer sizes needs to be provided which is read by a tool which writes them to the XenStore. Motivation: Paravirtualization is a virtualization technique that presents a software interface to virtual machines that is similar but not identical to that of the underlying hardware. Paravirtualization may allow the virtual machine monitor (VMM) to be simpler or virtual machines that run on it to achieve performance closer to non-virtualized hardware. However, operating systems must be explicitly ported to run on top of a paravirtualized VMM. Most full virtualization solutions provide emulated forms of simple devices. The emulated devices are typically chosen to be common hardware, so it is likely that drivers exist already for any given guest.
  2. 2. Paravirtualized guests, however, need to be modified. Therefore, the requirement for the virtual environment to use existing drivers disappears. Xen provides abstract devices which correspond to device categories e.g. it provides an abstract block device instead an SCSI device and IDE device. This device abstraction provides the generic calls corresponding to that device category e.g. read and write calls for the block device. This is done to achieve efficient I/O virtualization as opposed to emulation of devices in full virtualization. One of the important optimizations included in this approach is the grouping of I/O operations, which improves efficiency. Hardware manufacturers provide generic functionalities for that device class as well as additional device specific functionalities. For example, a cd/dvd drive which falls into the block device category provides the generic read and write capabilities, as well offer the “special”/non-generic capability of multisessioning. An ethernet device may offer special capabilities such as jumbo frames and checksum calculations, in addition to the generic functionalities of send and receive. The device drivers provided by Xen to the guest operating systems need to be modified in order to enable the guests to exploit these special features provided by the hardware. The aim of the project is to automate this process of modification to the Xen ethernet device drivers as much as possible, given the specifications of the ethernet device/NIC to be used. The initial aim of the project was to evaluate the feasibility of porting arbitrary network device drivers to Xen and whether we could automate the process. After examining, Xen’s approach and implementation of network I/O virtualization we determined that there is little need of porting network device drivers to Xen, since the real device driver and Xen’s split drivers are two separate components of the Xen network I/O chain. Existing drivers could be loaded in Xen Dom0 and network I/O would work without any modifications to Xen split drivers. Hence the project objectives were modified to exploring and implementing ways to expose device specific functionality to DomU’s and to do that in an automated way, given a specification of the non-generic functionalities implemented by the real network driver. The following sections consist of an overview of the components of Xen's I/O virtualization architecture which we had to study and use in order to implement our mechanisms to achieve this aim. Overview of Xen device driver virtualization: In Xen, the Domain0 is a privileged domain i.e. a privileged VM which hosts the administrative management and control interface. It provides
  3. 3. the capability to create and terminate other domains(underprivileged domains or DomU’s) and to control their associated scheduling parameters, physical memory allocations and the access they are given to the machine's physical disks and network devices. It also supports the creation of the virtual network interfaces (VIFs) and virtual block interfaces(VBDs) which are used by the underprivileged guests. Xen implements a Split device driver model for device driver virtualization. The Dom0 is in control of the actual hardware devices and virtual devices are exported for the DomU’s to use. Also some domains can be given control of particular hardware devices which then become the driver domains, but this is done only if IOMMU is available, otherwise it compromises security. The actual hardware driver resides in the driver domain/Dom0 and virtual device driver is split in two parts viz. the frontend driver and the backend driver. They are separated by a virtual bus, XenBus, which is roughly modeled after a device bus such as PCI. The backend driver resides in the driver domain/Dom0 and the frontend driver resides in the guest. The network frontend driver i.e. netfront driver acts as a device driver for the virtual network. It communicates with the network backend driver i.e. netback driver with the help of shared memory ring buffers and an event channel which is used for asynchronous notifications. The event channel is the analog of hardware interrupts. The netback driver communicates with the hardware through the actual hardware driver. XenStore is another important component of Xen architecture. It is a database of configuration information to be shared between domains. In relation to device drivers, it also fulfills the function of device tree which is generally a result of querying of an external bus such as the PCI bus. It is used to communicate to the front end driver the information about the domain hosting the backend driver, information about the shared memory, event channel to be used and the device specific information. Xen networking: Xen network interface employs two I/O ring buffers, one for incoming packets and one for outgoing. Ring buffers are producer – consumer queues implemented in shared memory. These ring buffers are used to transmit instructions and the actual data is transferred through shared memory pages through the grant mechanism. A grant reference refers to the shared memory page which acts as a buffer for the actual data transfer. Each transmission request contains a grant reference and an offset within the granted page. This allows
  4. 4. transmit and received buffers to be reused, preventing the TLB from needing frequent updates. A similar arrangement is used for receiving packets. The DomU guest inserts a receive request into the ring indicating where to store a packet, and the Dom0 component places the contents there. For each new DomU, Xen creates a new pair of "connected virtual ethernet interfaces", with one end in DomU and the other in Dom0. For linux DomU's, the device name it sees is named eth0. The other end of that virtual ethernet interface pair exists within Dom0 as interface vif<id#>.0. The default Xen configuration uses bridging within Dom0 to allow all domains to appear on the network as individual hosts. When xend (the Xen Daemon) starts it runs a script named network-bridge which creates a new bridge named xenbr0. The virtual network interfaces in the Dom0 are connected to real physical interface using this bridge. The network card runs in the promiscuous mode. Each guest gets its own MAC address assigned to its virtual interface. This allows all the guests to appear on the network as individual hosts. Packet arrives at hardware, is handled by real ethernet driver and appears on peth0, which is the real ethernet interface. The interface peth0 is bound to the bridge, so it is passed to the bridge from there. This step is run on Ethernet level, no IP addresses are set on peth0 or bridge. Now the bridge distributes the packet, just like a switch would. It is passed to the appropriate virtual interface based on the MAC address and from there it is delivered to the correct guest domain. XenStore and XenBus : XenStore is a hierarchical namespace (similar to sysfs or Open Firmware) which is shared between domains. The interdomain communication primitives exposed by Xen are very low-level (virtual IRQ and shared memory). XenStore is implemented on top of these primitives and provides some higher level operations (read a key, write a key, enumerate a directory, notify when a key changes value). XenStore is a database, hosted by domain 0, that supports transactions and atomic operations. It's accessible by either a Unix domain socket in Dom0, a kernel-level API, or an ioctl interface via /proc/xen/xenbus. XenStore is used to store information about the domains during their execution and as a mechanism of creating and controlling DomU devices. XenBus provides an in-kernel API used by virtual I/O drivers to interact with XenStore.
  5. 5. There are three main paths in XenStore: /vm - stores configuration information about domain /local/domain - stores information about the domain on the local node (domid, etc.) /tool - stores information for the various tools The /local path currently only contains one directory, /local/domain that is indexed by domain id. It contains the running domain information. It contains directories for each of the device backends, for example vbd for block devices and vif for network devices in the directory /local/domain/<domain- id>/backend. It consists status entries and entries for names and ids of the various entities such as DomU, bridge to which it is connected to, MAC address. This is the directory in which we can store we can store configuration information specific to our netfront-netback drivers. All Xen virtual device drivers register themselves with XenBus at initialization. Most initialization and setup is postponed until XenBus calls the probe function, which is very similar to how the PCI probe function gets called in real ethernet drivers. There are two classes of API which are used to write/read/modify XenStore. One set of API is for accessing XenStore from tools, while the the other is an in-kernel API used to access XenStore from inside the driver code. XenStore API for tools: The whole set of functions can be found in the file /tools/xenstore/xs.h. It contains functions such xs_mkdir, xs_read,xs_write,xs_directory,xs_rm which create directories,read/write entries inside directories,read directory contents, remove entries/directories respectively. These functions are very similar to the set of POSIX functions for file/directory operations. These functions can be called from C programs or perl/python scripts to create/modify/destroy entries in XenStore. Various Xen tools use them to operate on XenStore. XenStore in-kernel API or XenBus API: This set of functions can be found in the file /include/xen/xenbus.h. It includes functions such as xenbus_register_frontend/backend, xenbus_read/write,xenbus_mkdir/rm,xenbus_printf/scanf, register/unregister_xenbus_watch which registers frontend/backend drivers, create/modify/destroy
  6. 6. XenStore entries, set/unset watches on XenStore entries. XenStore Transactions: Transactions provide developers with a method for ensuring that multiple operations on the Xenstore are seen as a single atomic operation. Any time multiple operations must be performed before any changes are seen by watchers, a transaction must be used to encapsulate the changes. A transaction is started by calling a function xenbus_transaction_start() on the directory contents of which need to be changed or read. The XenStore API functions can then be used to read/write values in the desired entries. The transaction is ended by calling xenbus_transaction_end(). Similar functions exist which can be called from userspace tools to modify or read values from XenStore. XenStore Watches: A watch is the functionality provided by XenStore which allows for registering callback functions which are invoked when a particular XenStore entry or any entry below the directory being watched, is changed. This allows drivers or applications to respond immediately to changes in the XenStore. Drivers can register a watch by using the function register_xenbus_watch() which takes as input a structure of type xenbus_watch which contains the XenStore entry/directory to be watched and a pointer to the callback function. Design and Implementation: Network interfaces are represented inside the Linux kernel by struct netdevice. Network drivers populate the structure and register it with the kernel at the time of initialization. It is the very core of network driver layer and contains all the different types of information pertaining to the interface like the interface name, hardware information like DMA channel and IRQ assigned to the device, interface information such as MAC address and flags, a function dispatch table with functions such as open, close, transmit, do_ioctl, change_mtu etc. The do_ioctl method is generally used to implement non-standard functionality specific to the device. When the ioctl system call is invoked on a socket, the command number is one of the symbols defined
  7. 7. in <linux/sockios.h>, and the sock_ioctl function directly invokes a protocol-specific function. Any ioctl command that is not recognized by the protocol layer is passed to the device layer. These device- related ioctl commands accept a third argument from user space, a struct ifreq *. This structure is defined in <linux/if.h>. In addition to using the standardized calls, each interface can define its own ioctl commands. The ioctl implementation for sockets recognizes 16 commands as private to the interface: SIOCDEVPRIVATE through SIOCDEVPRIVATE+15. When one of these commands is recognized, dev->do_ioctl is called in the relevant interface driver. The function receives the same struct ifreq * pointer that the general-purpose ioctl function uses: int (*do_ioctl)(struct net_device *dev, struct ifreq *ifr, int cmd); The ifr pointer points to a kernel-space address that holds a copy of the structure passed by the user. After do_ioctl returns, the structure is copied back to user space. Therefore, the driver can use the private commands to both receive and return data. The device-specific commands can choose to use the fields in struct ifreq, but they already convey a standardized meaning, and it's unlikely that the driver can adapt the structure to its needs. The field ifr_data is a caddr_t item (a pointer) that is meant to be used for device-specific needs. The driver and the program used to invoke its ioctl commands should agree about the use of ifr_data. This pointer can point to arbitrary configuration data understood both by the application and driver. Rest of the methods are standard methods which are supported by every interface and hence by the netfront interface provided by Xen to the DomU guest. We have written a generic ioctl wrapper function, address of which is assigned to the do_ioctl member of the netfront interface. Generally the ifr_data field points to a structure or a buffer with arbitrary amount of data. If it points to a structure then the size of the buffer pointed by it can be derived from the structure definition. But if it points to an arbitrary data, then the method generally followed by driver developers to communicate the size is to encode it in first 4 bytes of the buffer. The driver first reads the size from the buffer and then reads the rest of the buffer. A summary of our implementation is as follows. The specification of non-standard functionalities implemented by the real network driver is provided in the form of a list of private ioctls(indicated by the command, which lies between SIOCDEVPRIVATE through SIOCDEVPRIVATE+15) implemented by the real network driver and the size of the buffers pointed by the ifr_data field. If the buffer points to arbitrary amount of data then it is encoded in specification by putting -1 in the size field. A script reads the list of ioctls and buffer sizes from the specification and creates corresponding fields in the XenStore
  8. 8. in the directory /local/domain/<domid>/ioctl/, which has been created before hand to house the ioctls. In the entry for each ioctl, entries are created for input and return values and return status. When a private ioctl is invoked from a DomU, the ioctl wrapper function reads the size field from the corresponding entry. It then reads the whole struct ifreq and starts a transaction to write the ifreq structure to input entry under that ioctl in the XenStore. For each ioctl we have included a return_ready entry under it in the XenStore. It is a Boolean entry which indicates whether the return field is ready, i.e. whether the netback driver has written the ioctl return value in the return entry of that ioctl. After writing the input ifreq to the input entry the netfront ioctl wrapper writes 0 to that entry and ends the transaction. It then enters a loop to poll its status. Netback driver registers a watch on the ioctl directory at the time of its initialization in the netback_init function. This watch is triggered when the netfront do_ioctl writes the ioctl input to the XenStore. Then it reads the ifreq structure from the XenStore and calls the real device drivers do_ioctl function. It invokes the dev_get_by_name function and passes it the name of the real network interface “peth0”. The real interface is renamed from eth0 to peth0, when Xend brings up the network bridge. This function returns the net_device structure for the real network interface. Then do_ioctl function is invoked on it. The return value of the ioctl is passed in the same ifreq structure which was passed to it. This structure is then written back by the watch to the return entry under the ioctl in the XenStore and return status is written in its entry. The watch then toggles the return_entry value under that ioctl in the XenStore. As soon as the return_value becomes one the netfront ioctl wrapper comes out of the loop and starts a transaction to read the return status. If the status indicates that the ioctl call was a success then it reads the return value, writes it back to the buffer pointed by ifreq and ends the transaction and returns the status. Permissions: DomU guests can be allowed or disallowed to invoke ioctls using the XenStore permissions API. These permissions can be set from tools, which are perl/python scripts or C programs, from the Dom0. When we wish to disallow a particular DomU to invoke a particular ioctl we can revoke read and write permissions for it on the path /local/domain/<domain-id>/ioctl/<ioctl-number>. This can be done by invoking the following function from a python script in Dom0,
  9. 9. xstransact.SetPermissions(path, { 'dom' : dom1, 'read' : False, 'write' : False }); The corresponding C function is xs_set_permissions. The first entry denotes the domain whose read write privileges are being revoked, and the path is the path of the ioctl entry in the XenStore. When permission is to be granted, this function can again be invoked with the values in the read and write fields as True. Conclusion: Our method automates the process of exposing non-generic functionality implemented by the real network driver in the form of device private ioctls. XenStore provides the infrastructure to encode the ioctl information and the input and return values. The transactions API ensures that the reads and writes are consistent and atomic. Since reads and writes to the XenStore are essentially File I/O, they are slower than the data transfer mechanism between frontend and backend using shared memory. But using the data transfer mechanism to transport the input and output of the ioctl would have required major changes to the Xen networking code. In principle, XenStore is a mechanism to store configuration information and ioctls are also generally used for configuration purposes. Also since ioctls do not do data transfer themselves, their execution is not as time critical as that of the data transmit and receive. Newer NICs provide hardware support for virtualization which enable the NIC to be shared between different VMs efficiently and safely. Various efforts are on to enable Xen to use these large set of diverse and evolving functionalities being provided by newer NICs. The Xen netchannel2 protocol along with a high level network I/O virtualization management system is being developed to address this need. The manager would relieve users of the need to make decisions and configurations that are customized to the underlying hardware capabilities. Instead, the manager would allow users to specify policies at a high level and then determine the appropriate low-level configurations specific to the particular hardware environment that would implement the policies. Thus, the manager would provide a clean separation between user-relevant policies, and the hardware and software mechanisms that are used to implement the policies.
  10. 10. References: 1) Xenwiki – 2) Running Xen: A Hands-On Guide to the Art of Virtualization - Jeanna N. Matthews, Eli M. Dow, Todd Deshane, Wenjin Hu, Jeremy Bongio, Patrick F. Wilbur, Brendan Johnson 3) Linux Device Drivers - Jonathan Corbet, Greg Kroah-Hartman, Alessandro Rubini 4) Understanding Linux Network Internals - Christian Benvenuti 5) Xen-devel mailing list - 6) Taming Heterogeneous NIC Capabilities for I/O Virtualization - Jose Renato Santos, Yoshio Turner, Jayaram Mudigonda.