Memory Virtualization in Database Systems

Uploaded on


More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Memory Virtualization in Database Systems Angelos Anastasopoulos Master of Science School of Informatics University of Edinburgh August 2008
  • 2. ABSTRACT Virtualization is a technology that is rapidly transforming the IT landscape and fundamentally changing the way that people compute. The benefits of virtualization techniques are becoming more and more appealing nowadays where the demand for high quality of service and error free systems is a requirement to every system. A lot of work has been done in the domain of virtual servers but database virtualization is still an open area. This work addresses an existing problem to Xcalibre's FlexiScale software: provide a transparent online scalable database system. A solution to the problem of database virtualization can be accomplished by implementing a virtualized filesystem, i.e., provide a global namespace by unifying the storage disks of any running virtual servers. I
  • 3. ACKOWLEDGEMENTS First and foremost, I would like to thank my supervisor Dr. S. Viglas for his constant guidance and support. I would also like to express my gratitude to the experienced people of Xcalibre, and especially to Mr. G. Munasinghe for their invaluable help. II
  • 4. DECLARATION I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Angelos Anastasopoulos) III
  • 5. TABLE OF CONTENTS 1 Introduction 1 1.1 Background.......................................................................................... 2 1.1.1 Single to Many Virtualization..................................................... 2 1.1.2 Many to Single Virtualization..................................................... 5 1.2 Virtualization Benefits......................................................................... 6 1.3 Related Work....................................................................................... 6 1.3.1 Amazon EC2............................................................................... 7 1.3.2 AppLogic.................................................................................... 8 1.3.3 XCalibre FlexiScale.................................................................... 9 1.4 Proposal - Projects Aims..................................................................... 10 1.5 Specifications....................................................................................... 11 2 System Design 12 2.1 Current System.................................................................................... 12 2.2 Possible Solutions................................................................................ 14 2.2.1 Modifying the Database System................................................. 14 2.2.2 File System in Kernel Space....................................................... 15 2.2.3 File System in User Space.......................................................... 16 IV
  • 6. 2.3 Our Approach....................................................................................... 18 2.3.1 Overview..................................................................................... 18 2.3.2 Our Implementation in Greater Detail........................................ 21 2.4 Graphical User Interface...................................................................... 24 26 3 Evaluation 3.1 Measuring TPS.................................................................................... 27 3.1.1 TPS-B.......................................................................................... 28 3.1.2 Selections Only........................................................................... 29 3.1.3 Discussion................................................................................... 30 3.2 Adding and Removing a Server........................................................... 31 3.2.1 Discussion................................................................................... 33 3.3 Clustered vs Unclustered Data............................................................. 35 3.3.1 Discussion................................................................................... 36 4 Conclusions 37 4.1.1 Future Recommendations........................................................... 38 References 39 V
  • 7. Chapter I Introduction The idea of virtualization is well known in both academia and industry. Virtualization is a technique for hiding the physical characteristics of computing resources from the way in which other systems, applications, or end users interact with those resources [1]. As computing becomes more and more distributed, the need for having a pool of resources available transparently to the end user has become a de facto requirement for every distributed system. There are many different kinds of virtualization. The main idea is to make a single physical resource appear to function as multiple logical resources or making multiple physical resources appear as a single resource. In any case, the end user should not bother whether the physical resource that utilizes resides in his personal environment (personal computer) or anywhere in a network. Virtualization essentially lets one computer to do the job of multiple computers, by sharing the resources of a single computer across multiple environments. This study explores ways of offering an online scalable database by implementing virtualization in a database system. This work has been done in collaboration with Xcalibre [2], a leading UK Hosting provider, which is the provider of all the required resources. The solution proposed is the design and implementation of a virtualized file system which is used by the database system to store its data. The main advantage of developing a virtualized file system is its genericness in terms of 1
  • 8. virtualization, as virtualization does not only concern database tables but common files as well. The developed virtualized file system has the ability to interact with any application, i.e. any database system that the end user selects. However, this work has been tested with PostGreSQL [3], a well-established open source database system. 1.1 Background There are various virtualization technologies as the idea of the virtual machine firstly appeared in the 1960s in the experimental IBM M44/44X system, in which the operate system uses the computing machine to simulate multiple copies of the machine itself [4]. In [5], different kinds of virtualization are presented which are encountered in various IT environments. 1.1.1 Single to Many Virtualization This category includes virtualization technologies where a single physical resource appears to function as multiple logical resources. According to the resource that is virtualized (such as a server, an operating system, an application, or storage device) there are different types of virtualization: ● Operating System Virtualization: where multiple logical (or virtual) operating systems (aka "guests") run on top of a fully functioning base (or "host") operating system. This method of virtualization usually uses a standard operating system such as Windows or Linux as the host, plus a virtual machine manager, to run multiple guest operating systems. Some vendors and products providing this type of virtualization include Microsoft Virtual Server, SWSoft Virtuozzo, Parallels Workstation/Desktop, Linux jails, and Sun Solaris containers. ● Server Virtualization (also known as "system virtualization" or "native virtualization"): where multiple virtual operating systems run 2
  • 9. directly on top of the hardware without an intervening operating system. Typically, virtualization software will run directly on the base hardware, and the operating systems will be installed onto that virtualization software. So called "paravirtualization" is (arguably) a subset of server virtualization that provides a thin interface to run between the base hardware and a modified guest operating system. Examples of server virtualization include VMware, ESX Server, and Xen. Server virtualization facilitates a rapid – or even automatic – restart of applications after a software failure. When used in conjunction with data replication between data centres, it can restart applications at a recovery site following a primary site failure. ● Application Virtualization: where an application is provided to the end- user, generally from a remote location (such as a central server), without needing to completely install this application on the user's local system. Unlike traditional client-server operations, each user has an isolated, fully functional application environment, sharing few if any components with other users. Examples of this include Citrix Presentation Server, Thinstall Virtualization Suite, and Altiris Software Virtualization Solution. ● Desktop Virtualization: where remote access to a complete desktop environment allows access to any authorized application, regardless of where the application is actually located. Examples of this include Microsoft Terminal Services, VMware Virtual Desktop, and Kidaro Managed Workspace. ● Software Streaming: essentially a subset of other virtualization technologies that provides a way for software components - including applications, desktops, and even complete operating systems - to be dynamically delivered from a central location to the end-user over the network. A user can start using streaming software before the entire download has completed, much like video streaming without a complex 3
  • 10. and lengthy installation process. Examples of this technology include AppStream, Ardence (acquired in December 2006 by Citrix), and Microsoft SoftGrid. ● Storage Virtualization: a way for many users or applications to access storage without being concerned about where or how that storage is physically located or managed. Typically storage virtualization applies to larger SAN or NAS arrays, but it is just as accurately applied to the logical partitioning of a local desktop hard drive. Examples include a range of hardware, software, and appliance solutions from IBM, EMC, Network Appliance, and others. ● Data Virtualization: Data virtualization abstracts the source of individual data items – including entire files, database contents, document metadata, messaging information, and more – and provides a common data access layer for different data access methods – such as SQL, XML, JDBC, File access, MQ, JMS, etc. This common data access layer interprets calls from any application using a single protocol, and translates the application request to the specific protocols required to store and retrieve data from any supported data storage method. This allows applications to access data with a single methodology, regardless of how or where the data is actually stored. Fig. 1.1 Data virtualization [4] 4
  • 11. ● Software as a Service (SAAS): is an implementation of virtualization, where software is provided by an external application service provider (ASP), normally on a usage basis. Typically, the end user will access the software service through a Web browser and, in some cases, specialized software may still be required. The complete software application is not hosted locally, or even within the enterprise, but is hosted at a third-party service provider. ● Thin Client: is a local system that has limited or no independent processing, storage, or peripherals of its own, relying entirely on a remote system for virtually all operations. Typically, a thin client will have limited local processing that allows it to merely perform I/O to a central server, which hosts the operating system, desktop, and applications. 1.1.2 Many to Single Virtualization This category includes virtualization technologies where many physical resources appear as a single resource. ● Clustering: A cluster is a form of virtualization that makes several locally- attached physical systems appear to the application and end users as a single processing resource. A typical use case for clustering is to group a number of identical physical servers to provide distributed processing power for high- volume applications, or as a “Web farm”, which is a collection of Web servers that can all handle load for a Web-based application. ● Grid Computing: Like a cluster, a grid provides a way to abstract multiple physical servers from the application they are running. The major difference is that the computing resources are normally spread out over a wide network, potentially across the Internet, and the physical servers that comprise a grid do not have to be identical. Unlike a cluster, where each server is locally connected, is likely to be identical, and can handle the same processing requirements, a grid is made up of heterogeneous systems, in diverse 5
  • 12. locations, each of which may specialize in a particular processing capability. Much greater coordination is needed to allocate the resources to appropriate workloads. 1.2 Virtualization Benefits The benefits of virtualization techniques are becoming more and more appealing nowadays where the demand for high quality of service and error free systems is a requirement to every system. Virtualization allows easier software migration, including system backup and recovery, which makes it extremely valuable as a Disaster Recovery (DR) or Business Continuity Planning (BCP) solution. Virtualization can duplicate critical servers, so IT does not need to maintain expensive physical duplicates of every piece of hardware for DR purposes. DR systems can even run on dissimilar hardware. In addition, virtualization reduces downtime for maintenance, as a virtual image can be migrated from one physical device to another to maintain availability while maintenance is performed on the original physical server. This applies equally to servers and desktops, or even mobile devices – virtualization allows workers to remain productive and get back online faster when their hardware fails. Other benefits of virtualization include business agility and flexibility (virtualization enables IT to respond to rapid on demand changes of the system), server consolidation (improves server utilization by distributing the workload), reduced downtime (virtual images are easier to restore after a failure), reduced software and hardware costs. 1.3 Related Work A lot of work has been done in operating system and server virtualization where numerous software products are widely used. The first successful virtualization software package was released by VMWare in the late 90s. VMware Workstation 6
  • 13. allows users to run multiple instances of x86 or x86-64 -compatible operating systems on a single physical PC, without any requirement of making changes to processors or operating systems [6]. Other well known x86 virtualization products are Parallels, Microsoft Virtual PC, QEMU+KQEMU, and VirtualBox. Most virtualization environments enable the end user to run multiple operating systems and multiple applications on the same computer at the same time, increasing the utilization and flexibility of hardware [7]. 1.3.1 Amazon EC2 Amazon Elastic Compute Cloud (Amazon EC2) [8] is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers. The main idea of Amazon EC2 is to enable end users to use- rent only the resources that they really need at any given point in time. The "Elastic" nature of the service allows developers to instantly scale to meet spikes in traffic or demand. When computing requirements unexpectedly change (up or down), Amazon EC2 can instantly respond, meaning that developers have the ability to control how many resources are in use at any given point in time. In contrast, traditional hosting services generally provide a fixed number of resources for a fixed amount of time, meaning that users have a limited ability to easily respond when their usage is rapidly changing, is unpredictable, or is known to experience large peaks at various intervals. The way that the end user interacts with EC2 is by running his application on a virtual machine and by allowing him to select the desired memory, CPU, and instance storage that is optimal for his application. 7
  • 14. Fig. 1.2 Amazon EC2 end user web interface [8] 1.3.2 AppLogic AppLogic [9] is a grid operating system designed to enable utility computing for web applications. It uses advanced virtualization technologies to ensure complete compatibility with existing operating systems, middleware and applications. As a result, AppLogic makes it easy to move existing web applications onto a grid without modifications. Fig. 1.3 The architecture of AppLogic [9] 8
  • 15. 1.3.3 Xcalibre-FlexiScale Following Amazon's EC2 idea FlexiScale [10], developed by XCalibre, is a web service which enables the end user to utilize computing power on demand. The virtual machine, called instance in EC2, is called Virtual Dedicated Server and enables the user to specify his requirements in memory and storage capacity as well as his desired operating system at any given point in time. Fig. 1.4 FlexiScale’s virtual dedicated server [10] The above web services are able to adjust their functionality to the needs of the end user in terms of memory, CPU usage and storage capacity. Things become more complex when the end user requires virtualization to be done on a database system. 9
  • 16. 1.4 Proposal-Project Aims Meanwhile, no service offers an online scalable database. This study explores ways of implementing virtualization in a database system. The foremost obstacle in making a database system scalable is the mapping from the logical-virtual (user- visible) address space to the physical address space. The end user needs not to have any knowledge of the physical address where the tables of the database are actually stored, as the physical address can change according to the demands in storage of the user, leading to transparency. In other words, the database system should be totally unaware of the underlying file system and where its data-tables are stored. The end user may boot up or kill any server (by changing the overall available storage capacity) at any time without being noticed by the database system. With the standard virtual disks provisioned with each FlexiScale VDS, it is not possible to mount the same disks in multiple servers simultaneously, or to have multiple disks mounted to a single server. Each FlexiScale VDS can be considered as an autonomous, independent machine with its own CPU, memory, operating system and storage capacity. The main contribution of this work is to overcome this limitation by enabling any service running on any of the VDSs that an end-user has boot up to use any resource from any VDS that is in his disposal. The developed virtualized file system enables data sharing between servers. To put it differently, the end-user is aware of only a single mount point and totally unaware of where data is actually stored, in which storage device, in which server. At this point we should clarify the terms virtualization and transparency. Virtualization makes a resource visible to the end user but this resource does not really exist. On the other hand, a transparent resource exists physically but it is invisible to the end user through a developed abstract layer. This study combines virtualization with transparency. The virtual layer is responsible for making the user to have a single view of the entire storage layer by providing him a common address space, a single mount point. The user’s space of logical addresses is both virtual, in 10
  • 17. the sense that it does not really exist, and transparent, as each storage disk has its own memory addresses which are invisible to the user. 1.5 Specifications The developed virtualized file system should meet the following specifications as assigned by XCaliblre: ● high transparency to the end user, i.e., the end user is not aware of the physical address that the data of any application-service (database) resides. ● on demand scale of the database without having the database server to be shut down. ● generic approach suitable for any application and therefore any database system. ● the system is responsible for adopting any change to the environment without been noticed by the applications running. As fas as the operating system on which the virtualized file system is built, Linux meets our expectations as it is open source. However, it should be noted that the virtualized file system could run on a virtual Linux operating system hosted by Windows using any operating system virtualization software. 11
  • 18. Chapter II System Design This chapter describes how FlexiScale handles the on demand scale of the system in terms of storage capacity and various possible solutions to address the problem of database virtualization. Finally, our approach and system design are presented including the Graphical User Interface, as well. 2.1 Current System For the time being, the user of FlexiScale can create a new Virtual Dedicated Server in less than a minute, change the VDS parameters according to his requirements in memory,operating system, and storage capacity on the fly and on demand, and automatically recover from a physical server failure. When the user adds a new VDS, a new autonomous server is created. The added VDS has its own storage system which is completely independent from the storage system running on any other currently in use VDS. In other words each VDS server has its own physical namespace. 12
  • 19. Fig.2.1 Current System: File_1 which resides in VDS 1 is different from File_1 in VDS 2 The absence of a single namespace raises limitations when the user wants to add a new server which will use data from older servers. For instance, if a user has boot up a VDS with a 40GB storage capacity (with 35GB used) and he requires to increase his storage capacity to 60GB, a new VDS would be created with the desired storage capacity. This will be a replica of the initial server (leaving 25GB of free space). This means that it is not possible to just add a new VDS with 20GB storage capacity as the new VDS will be unaware of the 35GB data stored in the initial server. Any process running on the initial VDS should be stopped and should wait until the replication procedure has terminated. Due to the absence of a global namespace, any application running on any VDS cannot use the data stored on the other. Fig. 2.2 Current System: Adding a new VDS 13
  • 20. 2.2 Possible Solutions Possible solutions to the problem of database virtualization include either the implementation of a virtual layer inside the database system that will be responsible for mapping logical addresses used by the storage layer of the database system to physical addresses where data resides, or the implementation of a virtual file system in kernel or in user space. In the former case the way that the database interacts with the underlying file system should be modified, whereas in the latter the underlying database remains unchanged. This work concentrates on the latter case due to loss of genericness of the former case as it will be shown shortly. 2.2.1 Modifying the Database System In this approach the virtual layer is embedded in the database system. The embedded virtual layer interacts with the storage layer of the database which can be distributed to multiple VDS. Therefore, the database system except from executing queries is responsible for making the mapping from logical to physical address. User 1 User 2 ... User n PostGreSQL User’s logical Physical Virtual Layer address address File System Storage ... Disk 1 Disk 2 Disk n Fig. 2.3 modifying the database system The database engine knows where each table of the database is physically stored and it appropriately map the user’s logical address. When the database system 14
  • 21. (PostGreSQL) interacts with the user, it uses logical addresses, whereas when interacting with the underlying file system it uses physical addresses. The main drawback of this approach is that the resulting system would solely work for the particular database system for which it would be implemented. Moreover, if the database is not open source, any change to the underlying database would be impossible. Therefore, modifying the database system would result in a static, non- dynamic solution which would not meet the objectives of XCalibre. 2.2.2 File System in Kernel Space Figure 2.4 depicts the Linux architecture in the most general terms [11]. The user- level programs communicate with the kernel using system calls. When a user process executes a system call, it changes its execution mode from user to kernel mode. In kernel mode, while executing the system call, the process has access to the kernel address space. Fig. 2.4 Linux Architecture [11] 15
  • 22. For the purposes of our work we will concentrate on device drivers, which are the software interface to an I/O device. A device driver is a collection of subroutines which are called when the kernel recognizes that a particular action should be taken by a particular device [12]. A new file system can be implemented as a character device driver either by recompiling the kernel and loading a new kernel image containing the implemented device, or by loading the driver as a kernel module. Fig. 2.5 Linux system and device driver relationship [12] 2.2.3 File System in User Space Some years ago, before the advent of user space filesystems, filesystem development was the job of the kernel developer. Creating filesystems required knowledge of kernel programming and the kernel technologies (like vfs). Filesystem in Userspace (FUSE) is a loadable kernel module for Unix-like computer operating systems, that allows non-privileged users to create their own file systems without editing the kernel code. This is achieved by running the file system code in user space, while the FUSE module only provides a "bridge" to the actual kernel interfaces. FUSE was officially merged into the mainstream Linux kernel tree in kernel version 2.6.14. Released under the terms of the GNU General Public License and the GNU Lesser General Public License, FUSE is free software. The FUSE system was originally part of A Virtual Filesystem (AVFS), but has since split off into its own project on 16
  • 23. FUSE is available for Linux, FreeBSD, NetBSD (as PUFFS), OpenSolaris and Mac OS X [13]. FUSE is particularly useful for writing virtual file systems. Unlike traditional filesystems, which essentially save data to and retrieve data from disk, virtual filesystems do not actually store data themselves. They act as a view or translation of an existing filesystem or storage device. In principle, any resource available to a FUSE implementation can be exported as a file system. With FUSE [14] it is possible to implement a fully functional filesystem as a userspace program with the following features: ● Simple library API ● Simple installation (no need to patch or recompile the kernel) ● Secure implementation ● Userspace - kernel interface is very efficient ● Usable by non privileged users ● Runs on Linux kernels 2.4.X and 2.6.X ● Has proven very stable over time Figure 2.6 shows the path of a filesystem call (e.g. stat). The FUSE kernel module and the FUSE library communicate via a special file descriptor which is obtained by opening /dev/fuse. This file can be opened multiple times, and the obtained file descriptor is passed to the mount syscall, to match up the descriptor with the mounted filesystem. 17
  • 24. Fig. 2.6 FUSE flow-chart diagram [14] 2.3 Our approach 2.3.1 Overview The key idea of our approach is the implementation of file virtualization which enables multiple physical storage devices with their own physical address namespace to be treated as a single virtual storage device with a global namespace. Each application, no matter on which VDS is run, is able to use (read,modify,delete) any data created by any application run on any VDS actually stored anywhere. Each VDS has a single mount point which enables its applications to access data located on different storage devices on other VDS. 18
  • 25. Fig. 2.7 Global namespace In Figure 2.7 the user has boot up two VDSs with a 20 GB storage capacity each. An application running on VDS 1 requests to read File_1. The application is totally unaware of where File_1 is actually stored. The application passes its request to the virtual filesystem where the mapping from virtual to physical address is done. The virtual filesystem returns to the application the requested file which is actually stored on a different VDS (VDS 2) that the application is run. Moreover, it should be noted that every application running on a ny VDS considers that the overall capacity of the system is 40 GB (i.e. the sum of the storage capacity of the running VDS). As shown in Figure 2.8, files can be moved from one server to another without affecting file paths which allows for ease in accomplishing load balance to the underlying storage devices. The virtual address of File_1 is fixed while the virtual address changes and no change should be done in the application using this file. Our approach succeeds in making the various VDS which are available to the user to appear as utilizing a single file system. 19
  • 26. Fig. 2.8 Moving files from one VDS to another This approach treats the database system as an ordinary application which stores its data to the virtual filesystem. The virtual filesystem handles every file in the same way no matter if it contains tuples or other data. Fig. 2.9 PostgreGreSQL running and data distribution The implemented file system utilizes GlusterFS [15], a free software released under GNU GPL v3 license, which is a FUSE project. GlusterFS is a clustered file-system which aggregates various storage bricks (storage node) over Infiniband RDMA or TCP/IP interconnect into one large parallel network file system. GlusterFS uses 20
  • 27. translators which are binary shared objects (.so) loaded at run-time. The idea of translators is borrowed from the GNU/Hurd [16] operating system. A translator is a program that it is inserted between the actual content of a file and the user accessing this file and processes the incoming requests in many different ways. From the kernel's point of view, translators are just another user process (run in user space). Figure 2.10 depicts a system with two storage nodes running on the same machine (localhost). Fig. 2.10 GlusterFS: server and client volume specification files for two bricks running on localhost [15] 2.3.2 Our Implementation in Greater Detail This section describes our implementation in greater detail and concentrates on how the implemented file system handles the files created by PostGreSQL. However, as we have already remarked, any application is handled in the same manner. PostgreSQL is running as a service to a single VDS, referred to as main VDS. It should be noted that all VDSs are equivalent and therefore any VDS could play the 21
  • 28. role of the main VDS. Each VDS is a brick according to GlusterFS terminology and communicates with other bricks over TCP/IP. The IP address of the main VDS is XCalibre supplied us with five VDSs in total ( A brick is a server-brick when it accepts and stores files which are created by processes running on other bricks, the clients. Each VDS-brick reacts at the same time as both a server and client brick to other VDSs allowing files created by any VDS to reside in any available VDS. As it will be shown shortly, this property is significant when a server is removed. As we have already said, the virtual file system is treated by each VDS as a single filesystem with a global namespace. Therefore, it could be possible that two different VDSs would create simultaneously a file with the same name. Such a thing would be disastrous for any filesystem. In order to overcome such an unpleasant situation, the main VDS, except from running PostGreSQL, is responsible for keeping track of every file name that is already in use. When a VDS creates a new file it should acquire the lock (semaphores are used) of the directory in which it desires to store the file, and only then carry out the creation operation. The global namespace is located in ~/gf_exports/export-namespace directory. There is a single mount point (~/mount) available to each VDS which can been considered as a gate to the virtual filesystem, as it connects the running VDSs. Listing the files of the mount directory will return all files that are handled by the virtual filesystem and not just the files that reside in the specific VDS. In other words, an ls command on the mount directory returns the same result no matter on which VDS is run. The information of which files reside on a VDS is located in ~/gf_exports/export. To be more precise, all files are actually stored in an export directory. The mount directory is just an image of all exported directories (we can consider them as soft links to the mount directory) of the running VDSs. In other words, if the user unmounts the mount point and then performs an ls in the directory, zero files will be returned. When it will be re-mounted, all data will be in the mount directory. 22
  • 29. Fig. 2.11 Configuration of filesystem In the default configuration of GlusterFS, each brick acts as a client or as a server to other bricks. Following this naïve approach, it would result in a system that a VDS (the main) would only be able to store files on others VDSs. Consequently, the running VDS would not be equivalent, as only the main VDS could run processes that would store data to other VDSs. In other words, all mount directories located to server bricks (except from the main VDS) would be read-only directories. However, such a system would be able to provide the required virtualized filesystem to the database system, as it runs on a single VDS. On one hand the main benefit of this approach is its simplicity and clarity of how the files are distributed across the VDSs. Listing the mount directory would return only files located in this particular brick and not in the whole filesystem. On the other hand, the main drawback comes when a VDS should be removed,i.e., the database shrinks. Any files stored in the removed VDS should be transferred to the remaining VDSs. In the naïve approach there is no way to send files to the other VDSs as the mount directory is read only. Therefore, files should be sent through a new connection not using the virtualized filesystem (for instance, through ftp or ssh). In our implementation each brick is both a client and a server at the same time. When a server is selected for removal all its files are re-distributed through the virtualized filesystem resulting in better performance. 23
  • 30. When a new VDS is added to the system there are two options available: i) do not distribute any existing files and consider the newly added VDS only for the following created files, or ii) re-distribute all files. In the former case, the recently added VDS will not host any old files. Such a system will be useful to a user if he/she would like to have an extra VDS for a small amount of time, may be for debugging reasons. For instance, the user would like to test the usage of a new index structure in PostGreSQL. He/She could add a new VDS to host PostGreSQL's files in order to do his/her evaluation tests and then remove the VDS without affecting the older system. However, in our implementation the latter case has been adopted; all existing files are re-distributed which results in load balanced VDSs. 2.4 Graphical User Interface Figure 2.12 is a screen shot of the Control Panel from which the end user can manage the storage servers. The end user has complete control on the VDS that are on his disposal and can manage them by a single click. He/she can start the virtual file system by specifying the number of VDS that he/she wants to include and the scheduler which sets how the created files will be distributed to the underlying servers. The options that are available concerning the scheduler are: ● round-robin (rr): creates files in a round-robin fashion. Each VDS has its own round-robin loop. This scheduler is a good choice when files are mostly similar in size which will result in load balanced VDS. ● random: files are stored randomly to any VDS. ● nufa: Non-Uniform Filesystem Scheduler gives the local system more priority for file creation over other VDS. If there is enough available space, files are stored locally (to the creator VDS). On the contrary, round robin fashion is followed to the remaining VDS. 24
  • 31. Fig. 2.12 GUI-Control Panel The end user can select to add or remove any VDS from the virtual filesystem at any time on the fly. The applications are not disrupted while the re-distribution of files take place. However, it is possible for the time period that it takes for the re- configuration to have affect a required file by a running application to be unavailable. Moreover, it is possible for the end user to get details about the running servers and locate where each file is actually stored, i.e. on which VDS. Finally, the end user can stop and restart any server. 25
  • 32. Chapter III Evaluation This chapter presents the evaluation of the implemented filesystem when deployed in FlexiScale. We concentrate on the overhead that the virtualized filesystem imposes to the system when PostGreSQL is running. Moreover, we measure the time that it takes to add and remove a VDS to the developed system. Finally, we examine if there is any impact on the performance of the database according to whether the files created by the database are clustered to a VDS or not. Table I depicts the system used for these experiments. PostGreSQL is running as a service to the main VDS ( and all VDSs have the same characteristics. TABLE I System Characteristics CPU Dual-Core AMD Opteron(tm) Processor 8220 Memory 512 MB Diskspace 20 GB Operating System Ubuntu 8.04 LTS Database PostGreSQL 8.3.3 26
  • 33. 3.1. Measuring TPS In order to evaluate the performance of the developed filesystem we measured how many transactions per second (TPS) are performed by PostGreSQL when it runs in various number of storage servers (VDSs) and executing various query scenarios. We used the pdbench benchmark [17] to generate source data. PostGreSQL is shipped with pgbench which is its standard tool for measurements. pgbench runs the same sequence of SQL commands over and over, possibly in multiple concurrent database sessions, and then calculates the average transaction rate (transactions per second). In our experiments we ran a scenario that is loosely based on TPC-B, involving five SELECT, UPDATE, and INSERT commands per transaction and a simpler scenario that SELECT and INSERT commands are issued. Table II shows the tables of the database and Table III the transaction script run by every transaction. In the select scenario the UPDATE commands are not executed. Before executing any experiment, the TRUNCATE operation takes place in order to free any unused pages from the buffer pool. The overall amount of transactions was set to 100000 and different experiments were performed with various scaling factors and number of clients. When the scaling factor is set 1, 10 ,100 the total amount of tuples residing in the accounts table is 100,000, 1,000,000, and 10,000,000 respectively. TABLE II Tables Used Table accounts Table branches Table tellers Table history Column Type Column Type Column Type Column Type aid integer bid integer tid integer tid integer bid integer bbalance integer bid integer bid integer abalance integer filler character(88) tbalance integer aid integer filler character(84) Indexes:"branches_pkey filler character(84) delta integer Indexes:"accounts_pkey " PRIMARY KEY, btree Indexes: "tellers_pkey" mtime timestamp " PRIMARY KEY, (bid) PRIMARY KEY, btree filler character(22) btree (aid) (tid) 27
  • 34. TABLE III Transaction Script BEGIN; UPDATE accounts SET abalance = abalance + :delta WHERE aid = :aid; SELECT abalance FROM accounts WHERE aid = :aid; UPDATE tellers SET tbalance = tbalance + :delta WHERE tid = :tid; UPDATE branches SET bbalance = bbalance + :delta WHERE bid = :bid; INSERT INTO history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); END; 3.1.1 TPC-B Figure 3.1 shows the results obtained when we run the TPC-B experiment on different number of servers. The blue column represents the results obtained when the experiment was run on localhost without our filesystem deployed. As it was expected, the values of blue columns are greater in all cases as the distribution of files through the implemented filesystem imposes overhead as the requested files from PostGreSQL are not soleley reside in localhost. The other columns show the TPS when up to five storage servers (VDS) are used. In the following charts, S shows the selected scaling factor and C the number of concurrent clients for each experiment 28
  • 35. TPC-B 200 188.02 1 server 180 2 servers 160 150.98 3 servers 4 servers 140 5 servers 120 109.96 TPS 100 81.78 80 60 43.85 37.61 40 20 0 s=1 c=1 S=10 c=10 S=100 c=100 Fig. 3.1 TPC-B results We observe that the deployment of the virtualized filesystem to the system results in decreasing the system response by 23%,when the scale factor is set to 1 with a sinlge client. Moreover, we notice that the performance of the system is not affected from the insertion of new VDSs (the variation of TPS with 2, 3, 4, and 5 servers is less than 5%). When the scaling factor is set to 10 (1,000,000 tuples in account table) with 10 clients, TPS is decreased by 24%, whereas with a scaling factor of 100 (10,000,000) with 100 clients the decrease is equal to 25%. We observe that while the number of clients increases and the database becomes larger the overhead that is imposed to the system can be considered constant. 3.1.2 Selections Only Figure 3.2 shows the results obtained when we ran experiments selections only on different number of servers. In these experiments lines 4 and 5 from the script shown in Table III are not executed. The transaction script is simpler compared to the TPC- B experiments resulting in larger TPS values. When the scaling factor is set to 1 with 29
  • 36. one client, the penalty in performance is less than 11%. As the experiment scenarios are getting more complex, the overhead that we witness is increasing (55% TPS decrease for a scaling factor equal to 10, and 75% for scaling factor equal to 100). Selections Only 1400 1274.34 1240.99 1 server 1200 2 servers 3 servers 4 servers 1000 5 servers 802.54 800 TPS 600 400 361.53 282.97 200 76.58 0 s=1 c=1 S=10 c=10 S=100 c=100 Fig. 3.2 Selections Only results 3.1.3 Discussion The previous experiments showed us the impact of the deployment of our filesystem with various number of servers and different workloads has on PostGreSQL. All experiments showed us that the greatest performance penalty occurs when the virtualized filesystem is inserted to the system,i.e., when adding one extra server to localhost. PostGreSQL's performance remains steady no matter how many servers will be used. Moreover, it is observed that the overhead imposed is constant in TPC- B like transactions (almost 22%) no matter how many clients are concurrently executing their queries. On the other hand, in simpler queries like the Selections Only experiments, the performance penalty increases with the number of clients. 30
  • 37. % overhead TPC-B inserting s=1 c=1 one extra server to S=10 c=10 localhost S=100 c=100 Selections Only s=1 c=1 S=10 c=10 S=100 c=100 0 10 20 30 40 50 60 70 80 % decrease of TPS Fig. 3.3 % decrease of TPS when deploying our filesystem to PostGreSQL The obtained results shown in Figure 3.3 are due to the different workload posed to PostGreSQL. TPC-B experiments are more complex and include many changes to the files (perform INSERTION commands) used by the database system, whereas Selections Only experiments execute read-only operations. Therefore, PostGreSQL is the main bottleneck as scenarios become more complex and the main overhead is posed by PostGreSQL rather than the implemented file system. 3.2. Adding and Removing a server As we have already seen, the end user can easily add or remove a server through the developed control panel. The following experiments measure the response time of the system in such cases. In order to evaluate the performance of the system when the end user adds a new server, we measured the time that it takes for the new server to be added and the time that it takes to redistribute the data. The time that it takes to start up a new server is irrelevant from the disk usage of the filesystem. That is, starting up a new server takes constant time no matter how many files are hosted by the filesystem. On the other hand, the redistribution of data is strongly affected by the 31
  • 38. storage usage. In order to measure the impact that the overall size of the hosted files has to the redistribution of data, different workloads were used. We created a 4MB file which we replicated 256, 512 and 768 times in order to achieve overall storage usage of 1, 2, and 3 GB respectively. Initially, all files reside in the main VDS. The user adds a new VDS, redistributes the files (half of the files reside in the main VDS and half in the newly added) and finally, removes the newly added VDS and all files return to the main VDS. Figure 3.4 depicts the obtained results. Adding a Server 800 700 distribute mv files 600 start up server 500 time (sec) 400 300 200 100 0 1GB 2GB 3GB Fig. 3.4 delay when adding a server Starting up a new server takes less than 8 seconds despite the disk usage. When the user selects to re-distribute the data to the running servers two operations take place: i) moving all files to a new directory (which resides in the filesystem) and ii) sending the appropriate files to the other servers. Due to the round-robin scheduler used in these experiments half of the files will be sent to the newly added server. The time that it takes to move the files is affected by the characteristics of the system,i.e., cpu speed and available memory, whereas the time that it takes to send the right files to the right server depends on the available network speed. 32
  • 39. Removing a Server 500 450 shut down server 400 distribute 350 300 time (sec) 250 200 150 100 50 0 1GB 2GB 3GB Fig. 3.5 delay when removing a server Figure 3.5 shows the obtained results when removing a server. During the distribution of files only the files that reside in the removed server should really be moved and transferred to the remaining server. Due to the fact that we have only two servers with round-robin scheduler half of the files should be really moved. From Figures 3.4 and 3.5 we can have a rough estimation of the available bandwidth between the servers which is 40 Mbps. It should be noted that sending a 1GB file through scp from one server to the other takes 2 minutes and 20 seconds, i.e., 60Mbps. Typical Ethernet speeds are 10/100/1000 Mbps. 3.2.1 Discussion As we saw, the time that it takes to add a new server is the sum of the needed time to start up the server (less than 8 seconds) and the time that it takes to redistribute the files. However, we should note that if the end user does not require load balanced servers, i.e., when a new server is added only the following files will be stored in it, then the overall time that it takes to add a new server is just the starting-up time. 33
  • 40. As fas as XCalibre's current system is concerned, when the end user requires to increase his overall storage capacity, the old server is replicated to a new server with the desired capacity. Therefore, in the current configuration of FlexiScale, the end user has only one server with the overall desired storage capacity. Making a clone of the old server and starting up the new server is succeeded through a NetApp's storage system which takes less than 5 seconds. Cloning a LUN (Logical Unit Number, unique identifier used to distinguish several devices that share the same bus) does not mean that every data is moved to the new LUN. It will only have a reference point to the master LUN. Each time a file is changed in the new LUN, the master data will be copied to the new LUN. In the NetApp level there will be no synchronization and therefore, the master will have separate data set from the clone. In other words, the movement of the files does not occur at once but gradually as files are updated. Making a new clone is treated by changing reference point to the master LUN. Our approach does not involve any hardware. It is a pure software solution which can be applied to any computer network. At this point, we should clarify that adding a new server results in the increase of the overall storage capacity. As we have already stated, there is a single database server running on the VDS server. Running a new instance of the database system on another server belongs to the domain of distributed database systems and it is out of the scope of this work. Having more than one database servers running simultaneously and using the same data can be only accomplished through a distributed database architecture where each database instance should communicate with each other. It is not possible to implement a distributed database system without altering the database system and by only sharing data. 34
  • 41. 3.3. Clustered vs Unclustered In this section we examine if there is any difference in the performance of the database system when files are clustered in the same server. We created four instances of the same database. The database schema is the same as shown in paragraph 3.1. The scaling factor was set to 10 resulting in 1,000,000 tuples in the account table and the number of concurrent clients was 10, executing 10,000 times the transaction script of Table III. In the first experiment the various files of the database instances were distributed to the system (unclustered). Whereas, in the second experiment all files are clustered to a server. That is, all files of the first instance reside on the main VDS, all files of the second instance reside on the second server, and so forth. TPC-B 120 100 80 60 Clustered TPS Unclus tered 40 20 0 1 2 3 4 Database Fig. 3.6 TPC-B results for clustered and unclustered data Figure 3.6 shows the obtained result when we ran TPC-B experiment for clustered and unclustered data. We can see that there is significant difference in the performance of the database system only for the first database instance (database 1). This behaviour can be explained due to the fact that the database server runs on the first server where the files of the first database reside. 35
  • 42. Selections Only 900 800 700 600 500 Clustered TPS 400 Unclustered 300 200 100 0 1 2 3 4 Database Fig. 3.7 Selections Only results for clustered and unclustered data Figure 3.7 shows the obtained result when we ran Selections Only experiment for clustered and unclustered data. As it was expected, there is significant difference in the performance of the database system only for the first database instance (database 1). 3.3.1 Discussion The previous experiments showed us, that there is no apparent change in the performance of the database system whether the database files are clustered to a VDS or not. That means, that the implemented filesystem does not pose any overhead to the database system. The database system still remains the bottleneck of the system and there is no significant change to the performance of the system according to how the files are distributed to the various servers. However, these results verified the results obtained from the experiments of paragraph 3.1 and proved the expected: “Having the database files in the same server where the database is running is always faster than having the files distributed”. 36
  • 43. Chapter IV Conclusions Virtualization is a technology that is rapidly transforming the IT landscape and fundamentally changing the way that people compute. In essence, virtualization is a software layer which enables sharing of hardware resources. The benefits of virtualization are becoming more and more appealing our days where the demand for high quality of service and error free systems is a requirement to every system. A lot of work has been done in the domain of virtual servers but database virtualization is still an open area. The way that a database system utilizes the underlying file system makes things more complex. This work tried to solve an existing problem to Xcalibre's FlexiScale software: provide a transparent online scalable database system. The developed virtualized system treats the database system as an ordinary application running on a VDS and it succeeded in meeting the main requirements of XCaliblre: ● high transparency to the end user, i.e., the end user is not aware of the physical address that the data of any application-service (database) resides. Any application running on any VDS can use files which reside in different VDSs. ● on demand scaling of the database without having to shut down the database server. There is no disruption of the running applications while a server is 37
  • 44. added or removed by the system. ● generic approach suitable for any application and therefore any database system. ● friendly graphical user interface which enables the user to control the various servers. The results obtained from the evaluation of the developed system are promising and show that a solution to the problem of database virtualization can be accomplished by implementing a virtualized filesystem, i.e., provide a global namespace by unifying the storage disks of the running VDSs. 4.1. Future Recommendations The granularity of the developed system is the file. That is, it uses and treats files as an indivisible entity. It is not able to process the various pages which form a file. However, most applications use files except from database systems which can also retrieve only a specific page of an entire file. The development of a FUSE project which will be able to process pages except from files would be of great importance to the database world. FUSE is a powerful tool that enables each user to implement his own filesystem according to its needs. There is no longer need for a database system to by-pass the underlying filesystem but it should instead use an implemented filesystem suitable for its requirements in order to achieve better performance. 38
  • 45. References 1. Brad, A. (2007), Power Support in a Virtualization Environment [White paper], retrieved from MGE Office Protection Systems website: irtualizationFINAL.pdf. 2. XCalibre Home Page, (2007), in XCalibre, retrieved 4 August 2008 from 3. Stonebraker, M. and Kemnitz, G. (1991), “The POSTGRES next generation database management system”, Jounral Communication ACM, 34(10), pp. 78-92. 4. Creasy,R.,J. (1981), “The origin of the VM/370 time-sharing system”, IBM Journal of Research & Development, 25(5), pp. 483-490. 5. Mann, A. (2007), Virtualization 101 [White paper], retrieved from Enterprise Management Associates (EMA) website: 6. Vmware, (2008), in Wikipedia, the free encyclopedia, retrieved 4 August 2008, from 7. Virtualization Basics, (2008), in vmware, retrieved 4 August 2008, from 8. Amazon Elastic Compute Cloud (Amazon EC2) – Beta, (2008), in amazon web services, retrieved 4 August 2008, from 9. AppLogic - Grid Operating System for Web Applications, (2008), in Applogic, retrieved 4 August 2008 from 10. How FlexiScale works, (2008), in FlexiScale, retrieved 4 August 2008 from 39
  • 46. 11. Bach, M., J., The Design of the Unix Operating System, New Jersey, Prentice Hall, 1986. 12. Coffey, T. and O'Shaughnesssy, A., Write a Linux Hardware Device, retrieved from Network Computing website: 13. Filesystem in Userspace, (2008), in Wikipedia, the free encyclopedia, retrieved 4 August 2008, from 14. FUSE Home Page, (2008), in FUSE, retrieved 6 August 2008, from 15. Gluster Storage System, (2008), in Gluster, retrieved 6 August 2008, from 16. The GNU/Hurd User's Guide, (2008), in HURD, retrieved 6 August 2008, 17. PostgreSQL 8.4devel Documentation, (2008), in PostGreSQL, retrieved 6 August 2008, from 40