Transcript of "Memory Virtualization in Database Systems"
Memory Virtualization in
Master of Science
School of Informatics
University of Edinburgh
Virtualization is a technology that is rapidly transforming the IT landscape and
fundamentally changing the way that people compute. The benefits of virtualization
techniques are becoming more and more appealing nowadays where the demand for
high quality of service and error free systems is a requirement to every system. A lot
of work has been done in the domain of virtual servers but database virtualization is
still an open area. This work addresses an existing problem to Xcalibre's FlexiScale
software: provide a transparent online scalable database system. A solution to the
problem of database virtualization can be accomplished by implementing a
virtualized filesystem, i.e., provide a global namespace by unifying the storage disks
of any running virtual servers.
First and foremost, I would like to thank my supervisor Dr. S. Viglas for his constant
guidance and support.
I would also like to express my gratitude to the experienced people of Xcalibre, and
especially to Mr. G. Munasinghe for their invaluable help.
I declare that this thesis was composed by myself, that the work contained herein is
my own except where explicitly stated otherwise in the text, and that this work has
not been submitted for any other degree or professional qualification except as
TABLE OF CONTENTS
1 Introduction 1
1.1 Background.......................................................................................... 2
1.1.1 Single to Many Virtualization..................................................... 2
1.1.2 Many to Single Virtualization..................................................... 5
1.2 Virtualization Benefits......................................................................... 6
1.3 Related Work....................................................................................... 6
1.3.1 Amazon EC2............................................................................... 7
1.3.2 AppLogic.................................................................................... 8
1.3.3 XCalibre FlexiScale.................................................................... 9
1.4 Proposal - Projects Aims..................................................................... 10
1.5 Specifications....................................................................................... 11
2 System Design 12
2.1 Current System.................................................................................... 12
2.2 Possible Solutions................................................................................ 14
2.2.1 Modifying the Database System................................................. 14
2.2.2 File System in Kernel Space....................................................... 15
2.2.3 File System in User Space.......................................................... 16
The idea of virtualization is well known in both academia and industry.
Virtualization is a technique for hiding the physical characteristics of computing
resources from the way in which other systems, applications, or end users interact
with those resources . As computing becomes more and more distributed, the need
for having a pool of resources available transparently to the end user has become a de
facto requirement for every distributed system.
There are many different kinds of virtualization. The main idea is to make a single
physical resource appear to function as multiple logical resources or making multiple
physical resources appear as a single resource. In any case, the end user should not
bother whether the physical resource that utilizes resides in his personal environment
(personal computer) or anywhere in a network. Virtualization essentially lets one
computer to do the job of multiple computers, by sharing the resources of a single
computer across multiple environments.
This study explores ways of offering an online scalable database by implementing
virtualization in a database system. This work has been done in collaboration with
Xcalibre , a leading UK Hosting provider, which is the provider of all the required
resources. The solution proposed is the design and implementation of a virtualized
file system which is used by the database system to store its data. The main
advantage of developing a virtualized file system is its genericness in terms of
virtualization, as virtualization does not only concern database tables but common
files as well. The developed virtualized file system has the ability to interact with any
application, i.e. any database system that the end user selects. However, this work
has been tested with PostGreSQL , a well-established open source database
There are various virtualization technologies as the idea of the virtual machine firstly
appeared in the 1960s in the experimental IBM M44/44X system, in which the
operate system uses the computing machine to simulate multiple copies of the
machine itself . In , different kinds of virtualization are presented which are
encountered in various IT environments.
1.1.1 Single to Many Virtualization
This category includes virtualization technologies where a single physical resource
appears to function as multiple logical resources. According to the resource that is
virtualized (such as a server, an operating system, an application, or storage device)
there are different types of virtualization:
● Operating System Virtualization: where multiple logical (or virtual)
operating systems (aka "guests") run on top of a fully functioning base (or
"host") operating system. This method of virtualization usually uses a
standard operating system such as Windows or Linux as the host, plus a
virtual machine manager, to run multiple guest operating systems. Some
vendors and products providing this type of virtualization include
Microsoft Virtual Server, SWSoft Virtuozzo, Parallels
Workstation/Desktop, Linux jails, and Sun Solaris containers.
● Server Virtualization (also known as "system virtualization" or
"native virtualization"): where multiple virtual operating systems run
directly on top of the hardware without an intervening operating system.
Typically, virtualization software will run directly on the base hardware,
and the operating systems will be installed onto that virtualization
software. So called "paravirtualization" is (arguably) a subset of server
virtualization that provides a thin interface to run between the base
hardware and a modified guest operating system. Examples of server
virtualization include VMware, ESX Server, and Xen. Server
virtualization facilitates a rapid – or even automatic – restart of
applications after a software failure. When used in conjunction with data
replication between data centres, it can restart applications at a recovery
site following a primary site failure.
● Application Virtualization: where an application is provided to the end-
user, generally from a remote location (such as a central server), without
needing to completely install this application on the user's local system.
Unlike traditional client-server operations, each user has an isolated, fully
functional application environment, sharing few if any components with
other users. Examples of this include Citrix Presentation Server, Thinstall
Virtualization Suite, and Altiris Software Virtualization Solution.
● Desktop Virtualization: where remote access to a complete desktop
environment allows access to any authorized application, regardless of
where the application is actually located. Examples of this include
Microsoft Terminal Services, VMware Virtual Desktop, and Kidaro
● Software Streaming: essentially a subset of other virtualization
technologies that provides a way for software components - including
applications, desktops, and even complete operating systems - to be
dynamically delivered from a central location to the end-user over the
network. A user can start using streaming software before the entire
download has completed, much like video streaming without a complex
and lengthy installation process. Examples of this technology include
AppStream, Ardence (acquired in December 2006 by Citrix), and
● Storage Virtualization: a way for many users or applications to access
storage without being concerned about where or how that storage is
physically located or managed. Typically storage virtualization applies to
larger SAN or NAS arrays, but it is just as accurately applied to the
logical partitioning of a local desktop hard drive. Examples include a
range of hardware, software, and appliance solutions from IBM, EMC,
Network Appliance, and others.
● Data Virtualization: Data virtualization abstracts the source of individual
data items – including entire files, database contents, document metadata,
messaging information, and more – and provides a common data access
layer for different data access methods – such as SQL, XML, JDBC, File
access, MQ, JMS, etc. This common data access layer interprets calls
from any application using a single protocol, and translates the application
request to the specific protocols required to store and retrieve data from
any supported data storage method. This allows applications to access
data with a single methodology, regardless of how or where the data is
Fig. 1.1 Data virtualization 
● Software as a Service (SAAS): is an implementation of virtualization, where
software is provided by an external application service provider (ASP),
normally on a usage basis. Typically, the end user will access the software
service through a Web browser and, in some cases, specialized software may
still be required. The complete software application is not hosted locally, or
even within the enterprise, but is hosted at a third-party service provider.
● Thin Client: is a local system that has limited or no independent processing,
storage, or peripherals of its own, relying entirely on a remote system for
virtually all operations. Typically, a thin client will have limited local
processing that allows it to merely perform I/O to a central server, which
hosts the operating system, desktop, and applications.
1.1.2 Many to Single Virtualization
This category includes virtualization technologies where many physical resources
appear as a single resource.
● Clustering: A cluster is a form of virtualization that makes several locally-
attached physical systems appear to the application and end users as a single
processing resource. A typical use case for clustering is to group a number of
identical physical servers to provide distributed processing power for high-
volume applications, or as a “Web farm”, which is a collection of Web servers
that can all handle load for a Web-based application.
● Grid Computing: Like a cluster, a grid provides a way to abstract multiple
physical servers from the application they are running. The major difference
is that the computing resources are normally spread out over a wide network,
potentially across the Internet, and the physical servers that comprise a grid
do not have to be identical. Unlike a cluster, where each server is locally
connected, is likely to be identical, and can handle the same processing
requirements, a grid is made up of heterogeneous systems, in diverse
locations, each of which may specialize in a particular processing capability.
Much greater coordination is needed to allocate the resources to appropriate
1.2 Virtualization Benefits
The benefits of virtualization techniques are becoming more and more appealing
nowadays where the demand for high quality of service and error free systems is a
requirement to every system. Virtualization allows easier software migration,
including system backup and recovery, which makes it extremely valuable as a
Disaster Recovery (DR) or Business Continuity Planning (BCP) solution.
Virtualization can duplicate critical servers, so IT does not need to maintain
expensive physical duplicates of every piece of hardware for DR purposes. DR
systems can even run on dissimilar hardware. In addition, virtualization reduces
downtime for maintenance, as a virtual image can be migrated from one physical
device to another to maintain availability while maintenance is performed on the
original physical server. This applies equally to servers and desktops, or even mobile
devices – virtualization allows workers to remain productive and get back online
faster when their hardware fails.
Other benefits of virtualization include business agility and flexibility (virtualization
enables IT to respond to rapid on demand changes of the system), server
consolidation (improves server utilization by distributing the workload), reduced
downtime (virtual images are easier to restore after a failure), reduced software and
1.3 Related Work
A lot of work has been done in operating system and server virtualization where
numerous software products are widely used. The first successful virtualization
software package was released by VMWare in the late 90s. VMware Workstation
allows users to run multiple instances of x86 or x86-64 -compatible operating
systems on a single physical PC, without any requirement of making changes to
processors or operating systems . Other well known x86 virtualization products
are Parallels, Microsoft Virtual PC, QEMU+KQEMU, and VirtualBox. Most
virtualization environments enable the end user to run multiple operating systems
and multiple applications on the same computer at the same time, increasing the
utilization and flexibility of hardware .
1.3.1 Amazon EC2
Amazon Elastic Compute Cloud (Amazon EC2)  is a web service that provides
resizable compute capacity in the cloud. It is designed to make web-scale computing
easier for developers. The main idea of Amazon EC2 is to enable end users to use-
rent only the resources that they really need at any given point in time. The "Elastic"
nature of the service allows developers to instantly scale to meet spikes in traffic or
demand. When computing requirements unexpectedly change (up or down), Amazon
EC2 can instantly respond, meaning that developers have the ability to control how
many resources are in use at any given point in time. In contrast, traditional hosting
services generally provide a fixed number of resources for a fixed amount of time,
meaning that users have a limited ability to easily respond when their usage is
rapidly changing, is unpredictable, or is known to experience large peaks at various
intervals. The way that the end user interacts with EC2 is by running his application
on a virtual machine and by allowing him to select the desired memory, CPU, and
instance storage that is optimal for his application.
Fig. 1.2 Amazon EC2 end user web interface 
AppLogic  is a grid operating system designed to enable utility computing for web
applications. It uses advanced virtualization technologies to ensure complete
compatibility with existing operating systems, middleware and applications. As a
result, AppLogic makes it easy to move existing web applications onto a grid without
Fig. 1.3 The architecture of AppLogic 
Following Amazon's EC2 idea FlexiScale , developed by XCalibre, is a web
service which enables the end user to utilize computing power on demand. The
virtual machine, called instance in EC2, is called Virtual Dedicated Server and
enables the user to specify his requirements in memory and storage capacity as well
as his desired operating system at any given point in time.
Fig. 1.4 FlexiScale’s virtual dedicated server 
The above web services are able to adjust their functionality to the needs of the end
user in terms of memory, CPU usage and storage capacity. Things become more
complex when the end user requires virtualization to be done on a database system.
1.4 Proposal-Project Aims
Meanwhile, no service offers an online scalable database. This study explores ways
of implementing virtualization in a database system. The foremost obstacle in
making a database system scalable is the mapping from the logical-virtual (user-
visible) address space to the physical address space. The end user needs not to have
any knowledge of the physical address where the tables of the database are actually
stored, as the physical address can change according to the demands in storage of the
user, leading to transparency. In other words, the database system should be totally
unaware of the underlying file system and where its data-tables are stored. The end
user may boot up or kill any server (by changing the overall available storage
capacity) at any time without being noticed by the database system.
With the standard virtual disks provisioned with each FlexiScale VDS, it is not
possible to mount the same disks in multiple servers simultaneously, or to have
multiple disks mounted to a single server. Each FlexiScale VDS can be considered as
an autonomous, independent machine with its own CPU, memory, operating system
and storage capacity. The main contribution of this work is to overcome this
limitation by enabling any service running on any of the VDSs that an end-user has
boot up to use any resource from any VDS that is in his disposal. The developed
virtualized file system enables data sharing between servers. To put it differently, the
end-user is aware of only a single mount point and totally unaware of where data is
actually stored, in which storage device, in which server.
At this point we should clarify the terms virtualization and transparency.
Virtualization makes a resource visible to the end user but this resource does not
really exist. On the other hand, a transparent resource exists physically but it is
invisible to the end user through a developed abstract layer. This study combines
virtualization with transparency. The virtual layer is responsible for making the user
to have a single view of the entire storage layer by providing him a common address
space, a single mount point. The user’s space of logical addresses is both virtual, in
the sense that it does not really exist, and transparent, as each storage disk has its
own memory addresses which are invisible to the user.
The developed virtualized file system should meet the following specifications as
assigned by XCaliblre:
● high transparency to the end user, i.e., the end user is not aware of the
physical address that the data of any application-service (database) resides.
● on demand scale of the database without having the database server to be shut
● generic approach suitable for any application and therefore any database
● the system is responsible for adopting any change to the environment without
been noticed by the applications running.
As fas as the operating system on which the virtualized file system is built, Linux
meets our expectations as it is open source. However, it should be noted that the
virtualized file system could run on a virtual Linux operating system hosted by
Windows using any operating system virtualization software.
This chapter describes how FlexiScale handles the on demand scale of the system in
terms of storage capacity and various possible solutions to address the problem of
database virtualization. Finally, our approach and system design are presented
including the Graphical User Interface, as well.
2.1 Current System
For the time being, the user of FlexiScale can create a new Virtual Dedicated Server
in less than a minute, change the VDS parameters according to his requirements in
memory,operating system, and storage capacity on the fly and on demand, and
automatically recover from a physical server failure. When the user adds a new VDS,
a new autonomous server is created. The added VDS has its own storage system
which is completely independent from the storage system running on any other
currently in use VDS. In other words each VDS server has its own physical
Fig.2.1 Current System: File_1 which resides in VDS 1
is different from File_1 in VDS 2
The absence of a single namespace raises limitations when the user wants to add a
new server which will use data from older servers. For instance, if a user has boot up
a VDS with a 40GB storage capacity (with 35GB used) and he requires to increase
his storage capacity to 60GB, a new VDS would be created with the desired storage
capacity. This will be a replica of the initial server (leaving 25GB of free space). This
means that it is not possible to just add a new VDS with 20GB storage capacity as
the new VDS will be unaware of the 35GB data stored in the initial server. Any
process running on the initial VDS should be stopped and should wait until the
replication procedure has terminated. Due to the absence of a global namespace, any
application running on any VDS cannot use the data stored on the other.
Fig. 2.2 Current System: Adding a new VDS
2.2 Possible Solutions
Possible solutions to the problem of database virtualization include either the
implementation of a virtual layer inside the database system that will be responsible
for mapping logical addresses used by the storage layer of the database system to
physical addresses where data resides, or the implementation of a virtual file system
in kernel or in user space. In the former case the way that the database interacts with
the underlying file system should be modified, whereas in the latter the underlying
database remains unchanged. This work concentrates on the latter case due to loss of
genericness of the former case as it will be shown shortly.
2.2.1 Modifying the Database System
In this approach the virtual layer is embedded in the database system. The embedded
virtual layer interacts with the storage layer of the database which can be distributed
to multiple VDS. Therefore, the database system except from executing queries is
responsible for making the mapping from logical to physical address.
User 1 User 2 ... User n
User’s logical Physical
Virtual Layer address address
Disk 1 Disk 2 Disk n
Fig. 2.3 modifying the database system
The database engine knows where each table of the database is physically stored and
it appropriately map the user’s logical address. When the database system
(PostGreSQL) interacts with the user, it uses logical addresses, whereas when
interacting with the underlying file system it uses physical addresses.
The main drawback of this approach is that the resulting system would solely work
for the particular database system for which it would be implemented. Moreover, if
the database is not open source, any change to the underlying database would be
impossible. Therefore, modifying the database system would result in a static, non-
dynamic solution which would not meet the objectives of XCalibre.
2.2.2 File System in Kernel Space
Figure 2.4 depicts the Linux architecture in the most general terms . The user-
level programs communicate with the kernel using system calls. When a user process
executes a system call, it changes its execution mode from user to kernel mode. In
kernel mode, while executing the system call, the process has access to the kernel
Fig. 2.4 Linux Architecture 
For the purposes of our work we will concentrate on device drivers, which are the
software interface to an I/O device. A device driver is a collection of subroutines
which are called when the kernel recognizes that a particular action should be taken
by a particular device . A new file system can be implemented as a character
device driver either by recompiling the kernel and loading a new kernel image
containing the implemented device, or by loading the driver as a kernel module.
Fig. 2.5 Linux system and device driver relationship 
2.2.3 File System in User Space
Some years ago, before the advent of user space filesystems, filesystem development
was the job of the kernel developer. Creating filesystems required knowledge of
kernel programming and the kernel technologies (like vfs). Filesystem in Userspace
(FUSE) is a loadable kernel module for Unix-like computer operating systems, that
allows non-privileged users to create their own file systems without editing the
kernel code. This is achieved by running the file system code in user space, while the
FUSE module only provides a "bridge" to the actual kernel interfaces. FUSE was
officially merged into the mainstream Linux kernel tree in kernel version 2.6.14.
Released under the terms of the GNU General Public License and the GNU Lesser
General Public License, FUSE is free software. The FUSE system was originally part
of A Virtual Filesystem (AVFS), but has since split off into its own project on
SourceForge.net. FUSE is available for Linux, FreeBSD, NetBSD (as PUFFS),
OpenSolaris and Mac OS X .
FUSE is particularly useful for writing virtual file systems. Unlike traditional
filesystems, which essentially save data to and retrieve data from disk, virtual
filesystems do not actually store data themselves. They act as a view or translation of
an existing filesystem or storage device. In principle, any resource available to a
FUSE implementation can be exported as a file system.
With FUSE  it is possible to implement a fully functional filesystem as a
userspace program with the following features:
● Simple library API
● Simple installation (no need to patch or recompile the kernel)
● Secure implementation
● Userspace - kernel interface is very efficient
● Usable by non privileged users
● Runs on Linux kernels 2.4.X and 2.6.X
● Has proven very stable over time
Figure 2.6 shows the path of a filesystem call (e.g. stat). The FUSE kernel module
and the FUSE library communicate via a special file descriptor which is obtained by
opening /dev/fuse. This file can be opened multiple times, and the obtained file
descriptor is passed to the mount syscall, to match up the descriptor with the
Fig. 2.6 FUSE flow-chart diagram 
2.3 Our approach
The key idea of our approach is the implementation of file virtualization which
enables multiple physical storage devices with their own physical address namespace
to be treated as a single virtual storage device with a global namespace. Each
application, no matter on which VDS is run, is able to use (read,modify,delete) any
data created by any application run on any VDS actually stored anywhere. Each VDS
has a single mount point which enables its applications to access data located on
different storage devices on other VDS.
Fig. 2.7 Global namespace
In Figure 2.7 the user has boot up two VDSs with a 20 GB storage capacity each. An
application running on VDS 1 requests to read File_1. The application is totally
unaware of where File_1 is actually stored. The application passes its request to the
virtual filesystem where the mapping from virtual to physical address is done. The
virtual filesystem returns to the application the requested file which is actually stored
on a different VDS (VDS 2) that the application is run. Moreover, it should be noted
that every application running on a ny VDS considers that the overall capacity of the
system is 40 GB (i.e. the sum of the storage capacity of the running VDS).
As shown in Figure 2.8, files can be moved from one server to another without
affecting file paths which allows for ease in accomplishing load balance to the
underlying storage devices. The virtual address of File_1 is fixed while the virtual
address changes and no change should be done in the application using this file. Our
approach succeeds in making the various VDS which are available to the user to
appear as utilizing a single file system.
Fig. 2.8 Moving files from one VDS to another
This approach treats the database system as an ordinary application which stores its
data to the virtual filesystem. The virtual filesystem handles every file in the same
way no matter if it contains tuples or other data.
Fig. 2.9 PostgreGreSQL running and data distribution
The implemented file system utilizes GlusterFS , a free software released under
GNU GPL v3 license, which is a FUSE project. GlusterFS is a clustered file-system
which aggregates various storage bricks (storage node) over Infiniband RDMA or
TCP/IP interconnect into one large parallel network file system. GlusterFS uses
translators which are binary shared objects (.so) loaded at run-time. The idea of
translators is borrowed from the GNU/Hurd  operating system. A translator is a
program that it is inserted between the actual content of a file and the user accessing
this file and processes the incoming requests in many different ways. From the
kernel's point of view, translators are just another user process (run in user space).
Figure 2.10 depicts a system with two storage nodes running on the same machine
Fig. 2.10 GlusterFS: server and client volume specification
files for two bricks running on localhost 
2.3.2 Our Implementation in Greater Detail
This section describes our implementation in greater detail and concentrates on how
the implemented file system handles the files created by PostGreSQL. However, as
we have already remarked, any application is handled in the same manner.
PostgreSQL is running as a service to a single VDS, referred to as main VDS. It
should be noted that all VDSs are equivalent and therefore any VDS could play the
role of the main VDS. Each VDS is a brick according to GlusterFS terminology and
communicates with other bricks over TCP/IP. The IP address of the main VDS is
22.214.171.124. XCalibre supplied us with five VDSs in total (126.96.36.199/118).
A brick is a server-brick when it accepts and stores files which are created by
processes running on other bricks, the clients. Each VDS-brick reacts at the same
time as both a server and client brick to other VDSs allowing files created by any
VDS to reside in any available VDS. As it will be shown shortly, this property is
significant when a server is removed.
As we have already said, the virtual file system is treated by each VDS as a single
filesystem with a global namespace. Therefore, it could be possible that two different
VDSs would create simultaneously a file with the same name. Such a thing would be
disastrous for any filesystem. In order to overcome such an unpleasant situation, the
main VDS, except from running PostGreSQL, is responsible for keeping track of
every file name that is already in use. When a VDS creates a new file it should
acquire the lock (semaphores are used) of the directory in which it desires to store the
file, and only then carry out the creation operation. The global namespace is located
in ~/gf_exports/export-namespace directory.
There is a single mount point (~/mount) available to each VDS which can been
considered as a gate to the virtual filesystem, as it connects the running VDSs.
Listing the files of the mount directory will return all files that are handled by the
virtual filesystem and not just the files that reside in the specific VDS. In other
words, an ls command on the mount directory returns the same result no matter on
which VDS is run. The information of which files reside on a VDS is located in
~/gf_exports/export. To be more precise, all files are actually stored in an export
directory. The mount directory is just an image of all exported directories (we can
consider them as soft links to the mount directory) of the running VDSs. In other
words, if the user unmounts the mount point and then performs an ls in the directory,
zero files will be returned. When it will be re-mounted, all data will be in the mount
Fig. 2.11 Configuration of filesystem
In the default configuration of GlusterFS, each brick acts as a client or as a server to
other bricks. Following this naïve approach, it would result in a system that a VDS
(the main) would only be able to store files on others VDSs. Consequently, the
running VDS would not be equivalent, as only the main VDS could run processes
that would store data to other VDSs. In other words, all mount directories located to
server bricks (except from the main VDS) would be read-only directories. However,
such a system would be able to provide the required virtualized filesystem to the
database system, as it runs on a single VDS. On one hand the main benefit of this
approach is its simplicity and clarity of how the files are distributed across the VDSs.
Listing the mount directory would return only files located in this particular brick
and not in the whole filesystem. On the other hand, the main drawback comes when a
VDS should be removed,i.e., the database shrinks. Any files stored in the removed
VDS should be transferred to the remaining VDSs. In the naïve approach there is no
way to send files to the other VDSs as the mount directory is read only. Therefore,
files should be sent through a new connection not using the virtualized filesystem
(for instance, through ftp or ssh). In our implementation each brick is both a client
and a server at the same time. When a server is selected for removal all its files are
re-distributed through the virtualized filesystem resulting in better performance.
When a new VDS is added to the system there are two options available: i) do not
distribute any existing files and consider the newly added VDS only for the
following created files, or ii) re-distribute all files. In the former case, the recently
added VDS will not host any old files. Such a system will be useful to a user if
he/she would like to have an extra VDS for a small amount of time, may be for
debugging reasons. For instance, the user would like to test the usage of a new index
structure in PostGreSQL. He/She could add a new VDS to host PostGreSQL's files in
order to do his/her evaluation tests and then remove the VDS without affecting the
older system. However, in our implementation the latter case has been adopted; all
existing files are re-distributed which results in load balanced VDSs.
2.4 Graphical User Interface
Figure 2.12 is a screen shot of the Control Panel from which the end user can manage
the storage servers. The end user has complete control on the VDS that are on his
disposal and can manage them by a single click. He/she can start the virtual file
system by specifying the number of VDS that he/she wants to include and the
scheduler which sets how the created files will be distributed to the underlying
servers. The options that are available concerning the scheduler are:
● round-robin (rr): creates files in a round-robin fashion. Each VDS has its own
round-robin loop. This scheduler is a good choice when files are mostly
similar in size which will result in load balanced VDS.
● random: files are stored randomly to any VDS.
● nufa: Non-Uniform Filesystem Scheduler gives the local system more priority
for file creation over other VDS. If there is enough available space, files are
stored locally (to the creator VDS). On the contrary, round robin fashion is
followed to the remaining VDS.
Fig. 2.12 GUI-Control Panel
The end user can select to add or remove any VDS from the virtual filesystem at any
time on the fly. The applications are not disrupted while the re-distribution of files
take place. However, it is possible for the time period that it takes for the re-
configuration to have affect a required file by a running application to be unavailable.
Moreover, it is possible for the end user to get details about the running servers and
locate where each file is actually stored, i.e. on which VDS. Finally, the end user can
stop and restart any server.
This chapter presents the evaluation of the implemented filesystem when deployed
in FlexiScale. We concentrate on the overhead that the virtualized filesystem imposes
to the system when PostGreSQL is running. Moreover, we measure the time that it
takes to add and remove a VDS to the developed system. Finally, we examine if
there is any impact on the performance of the database according to whether the files
created by the database are clustered to a VDS or not. Table I depicts the system used
for these experiments. PostGreSQL is running as a service to the main VDS
(188.8.131.52) and all VDSs have the same characteristics.
CPU Dual-Core AMD
Memory 512 MB
Diskspace 20 GB
Operating System Ubuntu 8.04 LTS
Database PostGreSQL 8.3.3
3.1. Measuring TPS
In order to evaluate the performance of the developed filesystem we measured how
many transactions per second (TPS) are performed by PostGreSQL when it runs in
various number of storage servers (VDSs) and executing various query scenarios. We
used the pdbench benchmark  to generate source data. PostGreSQL is shipped
with pgbench which is its standard tool for measurements. pgbench runs the same
sequence of SQL commands over and over, possibly in multiple concurrent database
sessions, and then calculates the average transaction rate (transactions per second).
In our experiments we ran a scenario that is loosely based on TPC-B, involving five
SELECT, UPDATE, and INSERT commands per transaction and a simpler scenario
that SELECT and INSERT commands are issued. Table II shows the tables of the
database and Table III the transaction script run by every transaction. In the select
scenario the UPDATE commands are not executed. Before executing any
experiment, the TRUNCATE operation takes place in order to free any unused pages
from the buffer pool. The overall amount of transactions was set to 100000 and
different experiments were performed with various scaling factors and number of
clients. When the scaling factor is set 1, 10 ,100 the total amount of tuples residing in
the accounts table is 100,000, 1,000,000, and 10,000,000 respectively.
Table accounts Table branches Table tellers Table history
Column Type Column Type Column Type Column Type
aid integer bid integer tid integer tid integer
bid integer bbalance integer bid integer bid integer
abalance integer filler character(88) tbalance integer aid integer
filler character(84) Indexes:"branches_pkey filler character(84) delta integer
Indexes:"accounts_pkey " PRIMARY KEY, btree Indexes: "tellers_pkey" mtime timestamp
" PRIMARY KEY, (bid) PRIMARY KEY, btree filler character(22)
btree (aid) (tid)
UPDATE accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM accounts WHERE aid = :aid;
UPDATE tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid,
:aid, :delta, CURRENT_TIMESTAMP);
Figure 3.1 shows the results obtained when we run the TPC-B experiment on
different number of servers. The blue column represents the results obtained when
the experiment was run on localhost without our filesystem deployed. As it was
expected, the values of blue columns are greater in all cases as the distribution of
files through the implemented filesystem imposes overhead as the requested files
from PostGreSQL are not soleley reside in localhost. The other columns show the
TPS when up to five storage servers (VDS) are used. In the following charts, S
shows the selected scaling factor and C the number of concurrent clients for each
160 150.98 3 servers
s=1 c=1 S=10 c=10 S=100 c=100
Fig. 3.1 TPC-B results
We observe that the deployment of the virtualized filesystem to the system results in
decreasing the system response by 23%,when the scale factor is set to 1 with a sinlge
client. Moreover, we notice that the performance of the system is not affected from
the insertion of new VDSs (the variation of TPS with 2, 3, 4, and 5 servers is less
than 5%). When the scaling factor is set to 10 (1,000,000 tuples in account table)
with 10 clients, TPS is decreased by 24%, whereas with a scaling factor of 100
(10,000,000) with 100 clients the decrease is equal to 25%. We observe that while the
number of clients increases and the database becomes larger the overhead that is
imposed to the system can be considered constant.
3.1.2 Selections Only
Figure 3.2 shows the results obtained when we ran experiments selections only on
different number of servers. In these experiments lines 4 and 5 from the script shown
in Table III are not executed. The transaction script is simpler compared to the TPC-
B experiments resulting in larger TPS values. When the scaling factor is set to 1 with
one client, the penalty in performance is less than 11%. As the experiment scenarios
are getting more complex, the overhead that we witness is increasing (55% TPS
decrease for a scaling factor equal to 10, and 75% for scaling factor equal to 100).
1240.99 1 server
1200 2 servers
s=1 c=1 S=10 c=10 S=100 c=100
Fig. 3.2 Selections Only results
The previous experiments showed us the impact of the deployment of our filesystem
with various number of servers and different workloads has on PostGreSQL. All
experiments showed us that the greatest performance penalty occurs when the
virtualized filesystem is inserted to the system,i.e., when adding one extra server to
localhost. PostGreSQL's performance remains steady no matter how many servers
will be used. Moreover, it is observed that the overhead imposed is constant in TPC-
B like transactions (almost 22%) no matter how many clients are concurrently
executing their queries. On the other hand, in simpler queries like the Selections
Only experiments, the performance penalty increases with the number of clients.
s=1 c=1 one extra
S=10 c=10 localhost
0 10 20 30 40 50 60 70 80
% decrease of TPS
Fig. 3.3 % decrease of TPS when deploying our
filesystem to PostGreSQL
The obtained results shown in Figure 3.3 are due to the different workload posed to
PostGreSQL. TPC-B experiments are more complex and include many changes to
the files (perform INSERTION commands) used by the database system, whereas
Selections Only experiments execute read-only operations. Therefore, PostGreSQL is
the main bottleneck as scenarios become more complex and the main overhead is
posed by PostGreSQL rather than the implemented file system.
3.2. Adding and Removing a server
As we have already seen, the end user can easily add or remove a server through the
developed control panel. The following experiments measure the response time of
the system in such cases. In order to evaluate the performance of the system when the
end user adds a new server, we measured the time that it takes for the new server to
be added and the time that it takes to redistribute the data. The time that it takes to
start up a new server is irrelevant from the disk usage of the filesystem. That is,
starting up a new server takes constant time no matter how many files are hosted by
the filesystem. On the other hand, the redistribution of data is strongly affected by the
In order to measure the impact that the overall size of the hosted files has to the
redistribution of data, different workloads were used. We created a 4MB file which
we replicated 256, 512 and 768 times in order to achieve overall storage usage of 1,
2, and 3 GB respectively. Initially, all files reside in the main VDS. The user adds a
new VDS, redistributes the files (half of the files reside in the main VDS and half in
the newly added) and finally, removes the newly added VDS and all files return to
the main VDS. Figure 3.4 depicts the obtained results.
Adding a Server
600 start up
1GB 2GB 3GB
Fig. 3.4 delay when adding a server
Starting up a new server takes less than 8 seconds despite the disk usage. When the
user selects to re-distribute the data to the running servers two operations take place:
i) moving all files to a new directory (which resides in the filesystem) and ii) sending
the appropriate files to the other servers. Due to the round-robin scheduler used in
these experiments half of the files will be sent to the newly added server. The time
that it takes to move the files is affected by the characteristics of the system,i.e., cpu
speed and available memory, whereas the time that it takes to send the right files to
the right server depends on the available network speed.
Removing a Server
450 shut down
1GB 2GB 3GB
Fig. 3.5 delay when removing a server
Figure 3.5 shows the obtained results when removing a server. During the
distribution of files only the files that reside in the removed server should really be
moved and transferred to the remaining server. Due to the fact that we have only two
servers with round-robin scheduler half of the files should be really moved. From
Figures 3.4 and 3.5 we can have a rough estimation of the available bandwidth
between the servers which is 40 Mbps. It should be noted that sending a 1GB file
through scp from one server to the other takes 2 minutes and 20 seconds, i.e.,
60Mbps. Typical Ethernet speeds are 10/100/1000 Mbps.
As we saw, the time that it takes to add a new server is the sum of the needed time to
start up the server (less than 8 seconds) and the time that it takes to redistribute the
files. However, we should note that if the end user does not require load balanced
servers, i.e., when a new server is added only the following files will be stored in it,
then the overall time that it takes to add a new server is just the starting-up time.
As fas as XCalibre's current system is concerned, when the end user requires to
increase his overall storage capacity, the old server is replicated to a new server with
the desired capacity. Therefore, in the current configuration of FlexiScale, the end
user has only one server with the overall desired storage capacity. Making a clone of
the old server and starting up the new server is succeeded through a NetApp's storage
system which takes less than 5 seconds. Cloning a LUN (Logical Unit Number,
unique identifier used to distinguish several devices that share the same bus) does not
mean that every data is moved to the new LUN. It will only have a reference point to
the master LUN. Each time a file is changed in the new LUN, the master data will be
copied to the new LUN. In the NetApp level there will be no synchronization and
therefore, the master will have separate data set from the clone. In other words, the
movement of the files does not occur at once but gradually as files are updated.
Making a new clone is treated by changing reference point to the master LUN. Our
approach does not involve any hardware. It is a pure software solution which can be
applied to any computer network.
At this point, we should clarify that adding a new server results in the increase of the
overall storage capacity. As we have already stated, there is a single database server
running on the VDS server. Running a new instance of the database system on
another server belongs to the domain of distributed database systems and it is out of
the scope of this work. Having more than one database servers running
simultaneously and using the same data can be only accomplished through a
distributed database architecture where each database instance should communicate
with each other. It is not possible to implement a distributed database system without
altering the database system and by only sharing data.
3.3. Clustered vs Unclustered
In this section we examine if there is any difference in the performance of the
database system when files are clustered in the same server. We created four
instances of the same database. The database schema is the same as shown in
paragraph 3.1. The scaling factor was set to 10 resulting in 1,000,000 tuples in the
account table and the number of concurrent clients was 10, executing 10,000 times
the transaction script of Table III.
In the first experiment the various files of the database instances were distributed to
the system (unclustered). Whereas, in the second experiment all files are clustered to
a server. That is, all files of the first instance reside on the main VDS, all files of the
second instance reside on the second server, and so forth.
1 2 3 4
Fig. 3.6 TPC-B results for clustered and unclustered data
Figure 3.6 shows the obtained result when we ran TPC-B experiment for clustered
and unclustered data. We can see that there is significant difference in the
performance of the database system only for the first database instance (database 1).
This behaviour can be explained due to the fact that the database server runs on the
first server where the files of the first database reside.
1 2 3 4
Fig. 3.7 Selections Only results for clustered and unclustered data
Figure 3.7 shows the obtained result when we ran Selections Only experiment for
clustered and unclustered data. As it was expected, there is significant difference in
the performance of the database system only for the first database instance (database
The previous experiments showed us, that there is no apparent change in the
performance of the database system whether the database files are clustered to a VDS
or not. That means, that the implemented filesystem does not pose any overhead to
the database system. The database system still remains the bottleneck of the system
and there is no significant change to the performance of the system according to how
the files are distributed to the various servers. However, these results verified the
results obtained from the experiments of paragraph 3.1 and proved the expected:
“Having the database files in the same server where the database is running is always
faster than having the files distributed”.
Virtualization is a technology that is rapidly transforming the IT landscape and
fundamentally changing the way that people compute. In essence, virtualization is a
software layer which enables sharing of hardware resources. The benefits of
virtualization are becoming more and more appealing our days where the demand for
high quality of service and error free systems is a requirement to every system. A lot
of work has been done in the domain of virtual servers but database virtualization is
still an open area. The way that a database system utilizes the underlying file system
makes things more complex.
This work tried to solve an existing problem to Xcalibre's FlexiScale software:
provide a transparent online scalable database system. The developed virtualized
system treats the database system as an ordinary application running on a VDS and it
succeeded in meeting the main requirements of XCaliblre:
● high transparency to the end user, i.e., the end user is not aware of the
physical address that the data of any application-service (database) resides.
Any application running on any VDS can use files which reside in different
● on demand scaling of the database without having to shut down the database
server. There is no disruption of the running applications while a server is
added or removed by the system.
● generic approach suitable for any application and therefore any database
● friendly graphical user interface which enables the user to control the various
The results obtained from the evaluation of the developed system are promising and
show that a solution to the problem of database virtualization can be accomplished
by implementing a virtualized filesystem, i.e., provide a global namespace by
unifying the storage disks of the running VDSs.
4.1. Future Recommendations
The granularity of the developed system is the file. That is, it uses and treats files as
an indivisible entity. It is not able to process the various pages which form a file.
However, most applications use files except from database systems which can also
retrieve only a specific page of an entire file. The development of a FUSE project
which will be able to process pages except from files would be of great importance to
the database world. FUSE is a powerful tool that enables each user to implement his
own filesystem according to its needs. There is no longer need for a database system
to by-pass the underlying filesystem but it should instead use an implemented
filesystem suitable for its requirements in order to achieve better performance.
1. Brad, A. (2007), Power Support in a Virtualization Environment [White
paper], retrieved from MGE Office Protection Systems website:
2. XCalibre Home Page, (2007), in XCalibre, retrieved 4 August 2008 from
3. Stonebraker, M. and Kemnitz, G. (1991), “The POSTGRES next generation
database management system”, Jounral Communication ACM, 34(10), pp.
4. Creasy,R.,J. (1981), “The origin of the VM/370 time-sharing system”, IBM
Journal of Research & Development, 25(5), pp. 483-490.
5. Mann, A. (2007), Virtualization 101 [White paper], retrieved from Enterprise
Management Associates (EMA) website:
6. Vmware, (2008), in Wikipedia, the free encyclopedia, retrieved 4 August
2008, from http://en.wikipedia.org/wiki/VMware.
7. Virtualization Basics, (2008), in vmware, retrieved 4 August 2008, from
8. Amazon Elastic Compute Cloud (Amazon EC2) – Beta, (2008), in amazon
web services, retrieved 4 August 2008, from
9. AppLogic - Grid Operating System for Web Applications, (2008), in
Applogic, retrieved 4 August 2008 from http://www.3tera.com/applogic.html.
10. How FlexiScale works, (2008), in FlexiScale, retrieved 4 August 2008 from
11. Bach, M., J., The Design of the Unix Operating System, New Jersey, Prentice
12. Coffey, T. and O'Shaughnesssy, A., Write a Linux Hardware Device, retrieved
from Network Computing website:
13. Filesystem in Userspace, (2008), in Wikipedia, the free encyclopedia,
retrieved 4 August 2008, from
14. FUSE Home Page, (2008), in FUSE, retrieved 6 August 2008, from
15. Gluster Storage System, (2008), in Gluster, retrieved 6 August 2008, from
16. The GNU/Hurd User's Guide, (2008), in HURD, retrieved 6 August 2008,
17. PostgreSQL 8.4devel Documentation, (2008), in PostGreSQL, retrieved 6
August 2008, from