caching2012.pdf

Cashing in on the Cache in the Cloud
Hyuck Han, Young Choon Lee, Member, IEEE, Woong Shin, Hyungsoo Jung,
Heon Y. Yeom, Member, IEEE, and Albert Y. Zomaya, Fellow, IEEE
Abstract—Over the past decades, caching has become the key technology used for bridging the performance gap across memory
hierarchies via temporal or spatial localities; in particular, the effect is prominent in disk storage systems. Applications that involve
heavy I/O activities, which are common in the cloud, probably benefit the most from caching. The use of local volatile memory as cache
might be a natural alternative, but many well-known restrictions, such as capacity and the utilization of host machines, hinder its
effective use. In addition to technical challenges, providing cache services in clouds encounters a major practical issue (quality of
service or service level agreement issue) of pricing. Currently, (public) cloud users are limited to a small set of uniform and coarse-
grained service offerings, such as High-Memory and High-CPU in Amazon EC2. In this paper, we present the cache as a service
(CaaS) model as an optional service to typical infrastructure service offerings. Specifically, the cloud provider sets aside a large pool of
memory that can be dynamically partitioned and allocated to standard infrastructure services as disk cache. We first investigate the
feasibility of providing CaaS with the proof-of-concept elastic cache system (using dedicated remote memory servers) built and
validated on the actual system, and practical benefits of CaaS for both users and providers (i.e., performance and profit, respectively)
are thoroughly studied with a novel pricing scheme. Our CaaS model helps to leverage the cloud economy greatly in that 1) the extra
user cost for I/O performance gain is minimal if ever exists, and 2) the provider’s profit increases due to improvements in server
consolidation resulting from that performance gain. Through extensive experiments with eight resource allocation strategies, we
demonstrate that our CaaS model can be a promising cost-efficient solution for both users and providers.
Index Terms—Cloud computing, cache as a service, remote memory, cost efficiency.
Ç
1 INTRODUCTION
THE resource abundance (redundancy) in many large
data centers is increasingly engineered to offer the spare
capacity as a service like electricity, water, and gas. For
example, public cloud service providers like Amazon Web
Services virtualize resources, such as processors, storage,
and network devices, and offer them as services on
demand, i.e., infrastructure as a service (IaaS) which is the
main focus of this paper. A virtual machine (VM) is a
typical instance of IaaS. Although a VM acts as an isolated
computing platform which is capable of running multiple
applications, it is assumed in this study to be solely
dedicated to a single application, and thus, we use the
expressions VM and application interchangeably hereafter.
Cloud services as virtualized entities are essentially elastic
making an illusion of “unlimited” resource capacity. This
elasticity with utility computing (i.e., pay-as-you-go pri-
cing) inherently brings cost effectiveness that is the primary
driving force behind the cloud.
However, putting a higher priority on cost efficiency than
cost effectiveness might be more beneficial to both the user
and the provider. Cost efficiency can be characterized by
having the temporal aspect as priority, which can translate to
the cost to performance ratio from the user’s perspective and
improvement in resource utilization from the provider’s
perspective. This characteristic is reflected in the present
economics of the cloud to a certain degree [1]. However, the
conflicting nature of these perspectives (or objectives) and
their resolution remain an open issue for the cloud.
In this paper, we investigate how cost efficiency in the
cloud can be further improved, particularly with applica-
tions that involve heavy I/O activities; hence, I/O-intensive
applications. They account for the majority of applications
deployed on today’s cloud platforms. Clearly, their perfor-
mance is significantly impacted on by how fast their I/O
activities are processed. Here, caching plays a crucial role in
improving their performance.
Over the past decades, caching has become the key
technology in bridging the performance gap across memory
hierarchies via temporal or spatial localities; in particular,
the effect is prominent in disk storage systems. Currently, the
effective use of cache for I/O-intensive applications in the
cloud is limited for both architectural and practical reasons.
Due to essentially the shared nature of some resources like
disks (not performance isolatable), the virtualization over-
head with these resources is not negligible and it further
worsens the disk I/O performance. Thus, low disk I/O
performance is one of the major challenges encountered by
most infrastructure services as in Amazon’s relational
database service, which provisions virtual servers with
database servers. At present, the performance issue of I/O-
intensive applications is mainly dealt with by using high-
performance (HP) servers with large amounts of memory,
leaving it as the user’s responsibility.
To overcome low disk I/O performance, there have been
extensive studies on memory-based cache systems [2], [3],
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 8, AUGUST 2012 1387
. H. Han, W. Shin, and H.Y. Yeom are with the Distributed Computing
Systems Laboratory, Department of Computer Science and Engineering,
Seoul National University, Bldg 302, 1 Gwanak-ro, Gwanak-gu, Seoul
151-744, Korea. E-mail: {hhyuck, wshin, yeom}@dcslab.snu.ac.kr.
. Y.C. Lee, H. Jung, and A.Y. Zomaya are with the Centre for Distributed
and High Performance Computing, School of Information Technologies,
University of Sydney, NSW 2006, Australia.
E-mail: {young.lee, hyungsoo.jung, zomaya}@sydney.edu.au.
Manuscript received 9 May 2011; revised 23 Oct. 2011; accepted 26 Oct.
2011; published online 30 Nov. 2011.
Recommended for acceptance by K. Li.
For information on obtaining reprints of this article, please send e-mail to:
tpds@computer.org, and reference IEEECS Log Number TPDS-2011-05-0291.
Digital Object Identifier no. 10.1109/TPDS.2011.297.
1045-9219/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society

[4], [5]. The main advantage of memory is that its access
time is several orders of magnitude faster than that of disk
storage. Clearly, disk-based information systems with a
memory-based cache can greatly outperform those without
cache. A natural design choice in building a disk-based
information system with ample cache capacity is to exploit a
single, expensive, large memory computer system. This
simple design—using local volatile memory as cache (LM
cache)—costs a great deal, and may not be practically
feasible in the existing cloud services due to various factors
including capacity and the utilization of host machines.
In this paper, we address the issue of disk I/O
performance in the context of caching in the cloud and
present a cache as a service (CaaS) model as an additional
service to IaaS. For example, a user is able to simply specify
more cache memory as an additional requirement to an IaaS
instance with the minimum computational capacity (e.g.,
micro/small instance in Amazon EC2) instead of an instance
with large amount of memory (high-memory instance in
Amazon EC2). The key contribution in this work is that our
cache service model much augments cost efficiency and
elasticity of the cloud from the perspective of both users and
providers. CaaS as an additional service (provided mostly in
separate cache servers) gives the provider an opportunity to
reduce both capital and operating costs using a fewer
number of active physical machines for IaaS; and this can
justify the cost of cache servers in our model. The user also
benefits from CaaS in terms of application performance with
minimal extra cost; besides, caching is enabled in a user
transparent manner and cache capacity is not limited to local
memory. The specific contributions of this paper are listed as
follows: first, we design and implement an elastic cache
system, as the architectural foundation of CaaS, with remote
memory (RM) servers or solid state drives (SSDs); this system
is designed to be pluggable and file system independent. By
incorporating our software component in existing operating
systems, we can configure various settings of storage
hierarchies without any modification of operating systems
and user applications. Currently, many users exploit
memory of distributed machines (e.g., memcached) by
integration of cache system and users’ applications in an
application level or a file-system level. In such cases, users or
administrators should prepare cache-enabled versions for
users’ application or file system to deliver a cache benefit.
Hence, file system transparency and application transpar-
ency are some of the key issues since there is a great diversity
of applications or file systems in the cloud computing era.
Second, we devise a service model with a pricing scheme,
as the economic foundation of CaaS, which effectively
balances conflicting objectives between the user and the
provider, i.e., performance versus profit. The rationale
behind our pricing scheme in CaaS is that the scheme
ensures that the user gains I/O performance improvement
with little or no extra cost and at the same time it enables the
provider to get profit increases by improving resource
utilization, i.e., better service (VM) consolidation. Specifi-
cally, the user cost for a particular application increases
proportionally to the performance gain and thus, the user’s
cost eventually remains similar to that without CaaS.
Besides, performance gains that the user get with CaaS has
further cost efficiency implications if the user is a business
service provider who rents IaaS instances and offers value-
added services to other users (end users).
Finally, we apply four well-known resource allocation
algorithms (first-fit (FF), next-fit (NF), best-fit (BF), and
worst-fit (WF)) and develop their variants with live VM
migration to demonstrate the efficacy of CaaS.
Our CaaS model and its components are thoroughly
validated and evaluated through extensive experiments in
both a real system and a simulated environment. Our RM-
based elastic cache system is tested in terms of its
performance and reliability to verify its technical feasibility
and practicality. The complete CaaS model is evaluated
through extensive simulations; and their parameters are
modeled based on preliminary experimental results ob-
tained using the actual system.
The remainder of this paper is organized as follows:
Section 2 reviews the related work about caching and its
impact on I/O performance in the context of cloud
computing. Section 3 overviews and conceptualizes the
CaaS model. Section 4 articulates the architectural design of
our “elastic” cache system. Section 5 describes the service
model with a pricing scheme for CaaS. In Section 6, we
present results of experimental validation for the cache
system and evaluation results for our CaaS model. We then
conclude this paper in Section 7.
2 BACKGROUND AND RELATED WORK
There have been a number of studies conducted to
investigate the issue of I/O performance in virtualized
systems. The focus of these investigations includes I/O
virtualization, cache alternatives and caching mechanisms.
In this section, we describe and discuss notable work
related to our study. What primarily distinguishes ours
from previous studies is the practicality with the virtualiza-
tion support of remote memory access and the incorpora-
tion of service model; hence, cache as a service.
2.1 I/O Virtualization
Virtualization enables resources in physical machines to be
multiplexed and isolated for hosting multiple guest OSes
(VMs). In virtualized environments, I/O between a guest
OS and a hardware device should be coordinated in a safe
and efficient manner. However, I/O virtualization is one of
the severe software obstacles that VMs encounter due to its
performance overhead. Menon et al. [6] tackled virtualized
I/O by performing full functional breakdown with their
profiling tools.
Several studies [7], [8], [9] contribute to the efforts
narrowing the gap between virtual and native performance.
Cherkasova and Gardner [7] and Menon et al. [6] studied I/O
performance in the Xen hypervisor [10] and showed a
significant I/O overhead in Xen’s zero copy with the page-
flipping technique. They proposed that page flipping be
simply replaced by the memcpy function to avoid side effects.
Menon et al. [9] optimized I/O performance by introducing
virtual machine monitor (VMM) superpage and global page
mappings. Liu et al. [8] proposed a new device virtualization
called VMM-bypass that eliminates data transfer between
1388 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 8, AUGUST 2012

the guest OS and the hypervisor by giving the guest device
driver direct access to the device.
With an increasing emphasis on virtualization, many
hardware vendors have started to support hardware-level
features for virtualization. Hardware-level features have
been actively evaluated to seek for near native I/O
performance [11], [12], [13]. Zhang and Dong [11] used Intel
Virtualization Technology architecture to gain better I/O
performance. Santos et al. [14] used devices that support
multiple contexts. Data transfer is offloaded from the
hypervisor to the guest OS by using mapped contexts. Dong
et al. [13] achieved 98 percent of the native performance by
incorporating several hardware features such as device
semantic preservation with input/output memory manage-
ment unit (IOMMU), effective interrupt sharing with
message signaled interrupts, and reusing direct memory
access (DMA) mappings. All these studies focused on
network I/O, where as this work looks at disk I/O.
2.2 Cache Device
Cooperative cache [2] is a kind of RM cache that improves
the performance of networked file systems. In particular, it
is adopted in the Serverless Network File System [3]. It uses
participating clients’ memory regions as a cache. A remote
cache is placed between the memory-based cache of a
requesting client and a server disk. Each participating client
exchanges meta information for the cache with others
periodically. Such a caching scheme is effective where a RM
is faster than a local disk of the requesting client. Jiang et al.
[4] propose advanced buffer management techniques for
cooperative cache. These techniques are based on the
degree of locality. Data that have high (low) locality scores
are placed on a high-level (low-level) cache. Kim et al. [5]
propose a cooperative caching system that is implemented
at the virtualization layer, and the system reduces disk I/O
operations for shared working sets of virtual machines.
Lim et al. [15] proposed two architectures for RM
systems: 1) block-access RM supported in the coherence
hardware (FGRA), and 2) page-swapped RM at the
virtualization layer (PS). In FGRA, a few hardware changes
of memory producers are necessary. On the other hand, PS
implements a RM sharing module in a VMM.
Marazakis et al. [16] utilize remote direct access memory
(RDMA) technology to improve I/O performance in a
storage area network environment. It abstracts disk devices
of remote machines into local block devices. RDMA-
enabled memory regions in remote machines are used as
buffers for write operations. Remote buffers are placed
between virtually addressed pages of requesting clients and
disk devices of remote machines in a storage hierarchy.
These proposals are different from our work in that our
system focuses on improving the I/O performance of a local
disk instead of a remote disk by using RM as a cache.
Recently, SSDs have been used as a file system cache or a
disk device cache in many studies. A hybrid drive [17] is a
NAND flash memory attached disk. Its internal flash
memory is used as the I/O buffer for frequently used data.
It was developed in 2007, but the performance improve-
ment was not significant due to the inadequate size of the
cache [18]. The Drupal data management system [19]
utilizes both SSD and HDD implicitly according to data
usage patterns. It is implemented at the software level. It
uses SSD as a file-system level cache for frequently used
data. Like a hybrid disk, the performance gain of Drupal is
not significant. Lee and Moon [20] showed that SSDs can
benefit transaction processing performance. Makatos et al.
[21] use SSD as a disk cache, and further performance
improvement is gained by employing online compression.
To alleviate performance problems of NAND flash mem-
ory, SSD-based cache systems can adopt striping [22],
parallel I/O [23], NVRAM-based buffer [24], and log-based
I/O [20], and these techniques could significantly help
amortizing the inherent latency of a raw SSD. Nevertheless,
the latency of an SSD is still higher than that of RM.
Ousterhout et al. [25] recently presented a new approach
to data processing, and proposed an architecture, called
RAMCloud that stores data entirely in DRAM of distributed
systems. RAMCloud has performance benefits owing to the
extremely low latency. Thus, it can be a good solution to
overcome the I/O problem of cloud computing. However,
RAMCloud incurs high (operational) cost and high energy
usage. In this study, we use remote memory as a cache
device, which stores only data having high locality, to meet
the balanced point of I/O performance and its cost.
3 CACHE AS A SERVICE: OVERVIEW
The CaaS model consists of two main components: an elastic
cache system as the architectural foundation and a service
model with a pricing scheme as the economic foundation.
The basic system architecture for the elastic cache aims
to use RM, which is exported from dedicated memory
servers (or possibly SSDs). It is not a new caching
algorithm. The elastic cache system can use any of the
existing cache replacement algorithms. Near uniform access
time to RM-based cache is guaranteed by a modern high-
speed network interface that supports RDMA as primitive
operations. Each VM in the cloud accesses the RM servers
via the access interface that is implemented and recognized
as a normal block device driver. Based on this access layer,
VMs utilize RM to provision a necessary amount of cache
memory on demand.
As shown in Fig. 1, a group of dedicated memory servers
exports their local memory to VMs, and exported memory
space can be viewed as an available memory pool. This
memory pool is used as an elastic cache for VMs in the
cloud. For billing purposes, cloud service providers could
employ a lease mechanism to manage the RM pool.
To employ the elastic cache system for the cloud, service
components are essential. The CaaS model consists of two
cache service types (CaaS types) based on whether LM or
RM is allocated with. Since these types are different in their
performance and costs a pricing scheme that incorporates
these characteristics is devised as part of CaaS.
Together, we consider the following scenario. The service
provider sets up a dedicated cache system with a large pool
of memory and provides cache services as an additional
service to IaaS. Now, users have an option to choose a cache
service specifying their cache requirement (cache size) and
that cache service is charged per unit cache size per time.
Specifically, the user first selects an IaaS type (e.g., Standard
small in Amazon EC2) as a base service. The user then
HAN ET AL.: CASHING IN ON THE CACHE IN THE CLOUD 1389

estimates the performance benefit of additional cache to her
application taking into account the extra cost, and deter-
mines an appropriate cache size based on that estimation. We
assume that the user is at least aware whether her application
is I/O intensive, and aware roughly how much data it deals
with. The additional cache in our study can be provided
either from the local memory of the physical machine on
which the base service resides or from the remote memory of
dedicated cache servers. The former LM case can be handled
simply by configuring the memory of the base service to be
the default memory size plus the additional cache size. On
the other hand, the latter RM case requires an atomic
memory allocation method to dedicate a specific region of
remote memory to a single user. Specific technical details of
RM cache handling are presented in Section 4.2.
The cost benefit of our CaaS model is twofold: profit
maximization and performance improvement. Clearly, the
former is the main objective of service provider. The latter
also contributes to achieving such an objective by reducing
the number of active physical machines. From the user’s
perspective, the performance improvement of application
(I/O-intensive applications in particular) can be obtained
with CaaS in a much more cost efficient manner since
caching capacity is more important than processing power
for those applications.
4 ELASTIC CACHE SYSTEM
In this section, we describe an elastic cache architecture,
which is the key component in realizing CaaS. We first
discuss the design rationale for a RM-based cache, and its
technical details.
4.1 Design Rationale
Among many important factors in designing an elastic
cache system, we particularly focus on the type of cache
medium, the implementation level of our cache system, the
communication medium between a cache server and a VM,
and reliability.
Cache media. We have three alternatives to implement
cache devices. Clearly, LM would be the best option due to
the speed gap between LM and other devices (RM and
SSD). Because LM has a higher cost per capacity, which
causes the capacity limitation, dedicating a large amount of
LM as cache could cause a side effect of memory pressure in
operating systems; this capacity issue primarily motivates
us to consider using RM and SSD as alternative cache
media. RM and SSD enable VMs to flexibly provision cache
practically without such a strict capacity limit.
SSDs have recently emerged as a new storage medium
that offers faster and more uniform access time than HDDs.
However, SSDs have few drawbacks due to the character-
istics of NAND flash memory; in-place updates are not
possible, and this causes extra overhead (latency)1
in page
update operations. Although many strategies [22], [23], [20]
are proposed to alleviate such problems, the latency of an
SSD is still higher than that of RM. In addition to this, RM
has no such limitations so that it can be a good candidate for
cache memory.
Implementation level. Elastic cache can be deployed at
either application or OS level (block device or file system
level). In this paper, it is the fundamental principle that the
cache need not affect application code or file systems owing
to the diversity of applications or file system configurations
on cloud computing. Application level elastic cache such as
memcached2
could have better performance than OS level
cache, since application level cache can exploit application
semantics. However, modification of application code is
always necessary for application level cache. A file system
level implementation can also provide many chances for
performance improvements, such as buffering and pre-
fetching. However, it forces users to use a specific file
system with the RM-based cache. In contrast, although a
block-device level implementation has fewer chances of
performance improvements than the application or file
system level counterpart, it does not depend on applica-
tions or file systems to take benefits from the underlying
block-level cache implementation.
RDMA versus TCP/IP. Despite the popularity of TCP/
IP, its use in high performance clusters has some restrictions
due to its higher protocol processing overhead and less
throughput than other cutting edge interconnects, such as
Myrinet and Infiniband. Since disk cache in our system
requires a low latency communication channel, we choose a
RDMA-enabled interface to guarantee fast and uniform
access time to RM space.
Dedicated-server-based cache versus cooperative cache.
Remote memory from dedicated servers might demand
Fig. 1. Overview of CaaS.
1. As Ousterhout et al. [25] pointed out, the low latency of a storage
device is very pivotal in designing storage systems.
2. Available at http://www. memcached.org.

more servers and related resources, such as rack and power,
during the operation. However, the total number of
machines for data processing applications is not greater
than that of machines without RM-based cache systems. As
an alternative way, we could implement remote memory
based on a cooperative cache, which uses participants’ local
memory as remote memory. This might help saving the
number of machines used and the energy consumed, but the
efficient management of cooperative cache is a daunting task
in large data centers. We are now back to the principle that
local memory should be used for a guest OS or an application
on virtual machines, rather than for remote memory. We
consider that this design rationale is practically less proble-
matic and better choice for implementing real systems.
Reliability. One of most important requirements for the
elastic cache is failure resilience. Since we implement the
elastic cache at the block device level, the cache system is
designed to support a RAID-style fault-tolerant mechanism.
Based on a RAID-like policy, the elastic cache can detect any
failure of cache servers and recovers automatically from the
failure (a single cache server failure).
In summary, we suggest that the CaaS model can be
better realized with an RM-based elastic cache system at the
block device level.
4.2 System Architecture
In this section, we discuss the important components of the
elastic cache. The elastic cache system is conceptually
composed of two components: a VM and a cache server.
A VM demands RM for use as a disk cache. We build an
RM-based cache as a block device and implement a new
block device driver (RM-Cache device). In the RM-Cache
device, RM regions are viewed as byte-addressable space.
The block address of each block I/O request is translated
into an offset of each region, and all read/write requests are
also transformed into RDMA read/write operations. We
use the device-mapper module of the Linux operating
system (i.e., DM-Cache3
) to integrate both the RM-Cache
device and a general block device (HDD) into a single block
device. This forms a new virtual block device, which makes
our cache pluggable and file-system independent.
In order to deal with resource allocation for remote
memory requested from each VM, a memory server offers a
memory pool as a cache pool. When a VM needs cache from
the memory pool, the memory pool provides available
memory. To this end, a memory server in the pool exports a
portion of its physical memory4
to VMs, and a server can
have several chunks. A normal server process creates 512 MB
memory space (chunk) via the malloc function, and it exports a
newly created chunk to all VMs, along with Chunk_Lock and
Owner regions to guarantee exclusive access to the chunk.
After a memory server process exchanges RDMA specific
information (e.g., rkey and memory address for corresponding
chunks) with a VM that demands RM, the exported memory
of each machine in the pool can be viewed as actual cache.
When a VM wants to use RM, a VM should first mark its
ownership on assigned chunks, then it can make use of the
chunk as cache. An example of layered architecture of a VM
and a memory pool, both of which are connected via the
RDMA interface, is concretely described in Fig. 2.
When multiple VMs try to mark their ownership on the
same chunk simultaneously, the access conflict can be
resolved by a safe and atomic chunk allocation method,
which is based on the CompareAndSwap operation supported
by Infiniband. The CompareAndSwap operation of InfiniBand
atomically compares the 64-bit value stored at the remote
memory to a given value and replaces the value at the remote
memory to a new value only if they are the same. By the
CompareAndSwap operation, only one node can acquire the
Chunk_Lock lock and it can safely mark its ownership to
the chunk by setting the Owner variable to consumer’s id.
Double paging in RDMA. The double paging problem
was first addressed in [26], and techniques such as ballooning
[27] are proposed to avoid the problem. Since the problem is a
bit technical but very critical in realizing CaaS in the cloud
platform, we describe what implementation difficulty it
causes and how we overcome the obstacle. Goldberg and
Hassinger [26] define levels of memory as follows:
. Level 0 memory: memory of real machine
. Level 1 memory: memory of VM
. Level 2 memory: virtual memory of VM.
In VM environments, the level 2 (level 1) memory is
mapped into the level 1 (level 0) memory, and this is called
double paging. For RDMA communication, a memory
region (level 0 memory) should be registered to the RDMA
device (i.e., InfiniBand device). Generally, kernel-level
functions mapping virtual to physical addresses (i.e.,
virt_to_phys) are used for memory registration to the
RDMA device. In VMs, the return addresses of functions in
a guest OS are in level 1 memory. Since the RDMA device
cannot understand the context of level 1 memory addresses,
direct registration of level 1 memory space to RDMA leads
to malfunction of RDMA communication.
To avoid this type of double paging anomaly in RDMA
communication, we exploit hardware IOMMUs to get
DMA-able memory (level 0 memory). IOMMUs are hard-
ware devices that manage device DMA addresses. To
virtualize IOMMUs, VMMs like Xen provide software
IOMMUs. Many hardware vendors also redesign IOMMUs
so that they are isolated between multiple operating
systems with direct device access. Thus, we use kernel
functions related with IOMMUs to get level 0 memory
addresses. The RM-Cache device allocates level 2 memory
space through kernel level memory allocation functions in
the VM. Then, it remaps the allocated memory to DMA-able
memory space through IOMMU. The mapped address of
Fig. 2. Elastic cache structure and double paging problem.
3. Available at http://visa.cis.fiu.edu/ming/dmcache/index.html.
4. A basic unit is called chunk (512 MB).

the DMA-able memory becomes level 0 memory that can
now be registered correctly by RDMA devices. Fig. 2
describes all these mechanisms in detail.
5 SERVICE MODEL
In this section, we first describe performance characteristics
of different cache alternatives and design two CaaS types.
Then, we present a pricing model that effectively captures
the tradeoff between performance and cost (profit).
5.1 Modeling Cache Services
I/O-intensive applications can be characterized primarily
by data volume, access pattern, and access type; i.e., file
size, random/sequential and read/write, respectively. The
identification of these characteristics is critical in choosing
the most appropriate cache medium and proper size since
the performance of different storage media (e.g., DRAMs,
SSDs, and HDDs) varies depending on one or more of those
characteristics. For example, the performance bottleneck
sourced from frequent disk accesses may be significantly
improved using SSDs as cache. However, if those accesses
are mostly sequential write operations the performance
with SSDs might only be marginally improved or even
made worse. Although the use of LM as cache delivers
incomparably better I/O performance than other cache
alternatives (e.g., RM),5
such a use is limited by several
issues including capacity and the utilization of host
machines. With the consideration of these facts, we have
designed two CaaS types as the following:
. High performance—makes use of LM as cache, and
thus, its service capacity is bounded by the max-
imum amount of LM.
. Best value (BV)—exploits RM as cache practically
without a limit.
In our CaaS model, it is assumed that a user, who
sends a request with a CaaS option (HP or BV), also
accompanies an application profile including data volume,
data access pattern, and data access type. It can be argued
that these pieces of application specific information might
not be readily available particularly for average users, and
some applications behave unpredictably. In this paper, we
primarily target the scenario in which users repeatedly
and/or regularly run their applications in clouds, and they
are aware of their application characteristics either by
analyzing business logic of their applications or by
obtaining such information using system tools (e.g.,
sysstat6
) and/or application profiling [28], [29]. When a
user is unable to identify/determine he/she simply rents
default IaaS instances without any cache service option
since CaaS is an optional service to IaaS. The service
granularity (cache size) in our CaaS model is set to a
certain size (512 MB/0.5 GB). In this study, we adopt three
default IaaS types: small, medium, and large with flat
rates of fs, fm, and fl, respectively.
5.2 Pricing
A pricing model that explicitly takes into account various
elastic cache options is essential for effectively capturing the
tradeoff between (I/O) performance and (operational) cost.
With HP, it is rather common to have many “awkward”
memory fragmentations (more generally, resource fragmen-
tations) in the sense that physical machines may not be used
for incoming service requests due to lack of memory. For
example, for a physical machine with four processor cores
and the maximum LM of 16 GB a request with 13 GB of HP
cache requirement on top of a small IaaS instance (which
uses 1 core) occupies the majority of LM leaving only 3 GB
available. Due to such fragmentations, an extra cost is
imposed on the HP cache option as a fragmentation penalty
(or performance penalty).
The average number of services (VMs) per physical
machine with the HP cache option (or simply HP services)
is defined as
HPservices ¼
LMmax
mHP
aHP ; ð1Þ
where LMmax
is the maximum local memory available, mHP
is the average amount of local memory for HP services. The
amount of LM cache requested for HP is assumed to be in a
uniform distribution.
And, the average number of services per physical
machine without HP is defined as
nonHPservices ¼
X
st
j¼0
LMmax
mj
aj

; ð2Þ
where st is the number of IaaS types (i.e., three in this
study), mj is the memory capacity of a service type j (sj),
and aj is the rate of services with type j.
Then, the average number of services (service count or
sc) per physical machine with/without HP requests are
defined as
scHP ¼ HPservices þ nonHPservices ð3Þ
scnoHP ¼
nonHPservices
1 aHP
; ð4Þ
where aHP is the rate of HP services. Note that the sum of all aj
is 1 aHP . We assume that the service provider has a means
to determine request rates of service types including the rate
of I/O-intensive applications (aIO) and further the rates of
those with HP and BV (aHP and aBV ) respectively. Since
services with BV use a separate RM server, they are treated
the same as default IaaS types (small, medium, and large).
In the CaaS model, the difference between scnoHP and
scHP can be seen as consolidation improvement (CI). For a
given IaaS type si, the rates (unit price) for HP and BV are
then defined as
cHP;i ¼ fi piHP;i þ ðf CIÞ=scHP ð5Þ
cBV ;i ¼ fi piBV ;i; ð6Þ
where piHP;i and piBV ;i are the average performance
improvement per unit increase of LM and RM cache
5. Surprisingly, the performance of LM cache is only marginally better
than RM in most of our experiments. The main cause of this unexpected
result is believed to be the behavior of the “pdflush” daemon in Linux, i.e.,
frequently writing back dirty data to disk.
6. Available at http://sebastien.godard.pagesperso-orange.fr.

increase (e.g., 0.5 GB), respectively, and f is the average
service rate; these values might be calculated based on
application profiles (empirical data).
With BV, the rate is solely dependent on piBV , and thus,
the total price the user pays for a given service request is
expected to be equivalent to that without cache on average
as shown in Fig. 3. We acknowledge that the use of average
performance improvement resulting in the uniformity in
service rates (cHP;i and cBV ;i) might not be accurate;
however, this is only indicative. In the actual experiments,
charges for services with cache option have been accurately
calculated in the way that for the price for a particular
service (application) remains the same regardless of use of
cache option and type of cache option. The cost efficiency
characteristic of BV can justify the use of average of varying
piBV values, the different values being due to application
characteristics (e.g., data access pattern and type) and cache
size. Alternatively, different average performance improve-
ment values (i.e., piHP and piBV ) can be used depending on
application characteristics (e.g., data access pattern and
type) profiled and specified by the user/provider. Further,
rates (pricing) may be mediated between the user and the
provider through service level agreement negotiation.
It might be desirable that the performance gain that users
experience with BV is proportional to that with HP. In other
words, their performance gap may be comparable to the
extra rate imposed on HP. The performance of a BV service
might not be easily guaranteed or accurately predicted since
that performance is heavily dependent on 1) the type and
amount of additional memory, 2) data access pattern and
type, and 3) the interplay of 1 and 2.
6 EVALUATION
In this section, we evaluate CaaS from the viewpoints of
both users and providers. To this end, we first measure the
performance benefit of our elastic cache system—in terms
of performance (e.g., transactions per minute), cache hit
ratio and reliability. The actual system level modification
for our system is not possible with the existing cloud
providers like Amazon and Microsoft. We can neither
dedicate physical servers of the cloud providers to RM
servers nor assign SSDs and RDMA devices to physical
servers. Owing to these issues, we could not test our
systems on real cloud services but we built an RDMA- and
SSD-enabled cloud infrastructure (Fig. 4) to evaluate our
systems. We then simulate a large-scale cloud environment
with more realistic settings for resources and user requests.
This simulation study enables us to examine the cost
efficiency of CaaS. While experimental results in Section 6.1
demonstrate the feasibility of our elastic cache system, those
in Section 6.2 confirm the practicality of CaaS (or applic-
ability of CaaS to the cloud).
6.1 Experimental Validation: Elastic Cache System
We validate the proof-of-concept elastic cache system with
two well-known benchmark suites: a database benchmark
program (TPC-C) and a file system benchmark program
(Postmark). TPC-C, which simulates OLTP activities, is
composed of read only and update transactions. The TPC-
C benchmark is update intensive with a 1.9:1 I/O read to
write ratio, and it has random I/O access patterns [30].
Postmark, which is designed to evaluate the performance of
e-mail servers, is performed in three phases: file creation,
transaction execution, and file deletion. Operations and files
in the transaction execution phase are randomly chosen. We
choose them because these two benchmarks have all
important characteristics of modern data processing applica-
tions. Intensive experiments with these applications show
that the prototype elastic cache architecture is a suitable
model as an efficient caching system for existing IaaS models.
Because of the attractive performance characteristics of
SSDs, the usefulness of our system might be questionable
compared with an SSD-based cache system. To answer
this, we compared our elastic cache system with an SSD-
based system.
6.1.1 Experimental Environments
Throughout this paper, we use experimental environments
as shown in Fig. 4. For performance evaluation, we used a
7-node cluster, each node of which is equipped with an
Intel(R) Core(TM)2 Quad CPU 2.83 GHz and 8 GB RAM.
All nodes are connected via both a switched 1 Gbps
Ethernet and 10 Gbps Infiniband. We used Infinihost IIILx
HCA cards from Mellanox for Infiniband connection. A
memory server runs Ubuntu 8.0.4 with Linux 2.6.24 kernel,
and exports 1 GB memory. One of the clusters instantiates a
VM using Xen 3.4.0. The VM with Linux 2.6.32 has 2 GB
memory and 1 vCPU, and it runs benchmark programs. The
cache replacement policy is Least Recently Used (LRU).
Fig. 3. Cost efficiency of CaaS. nc CHP and nc CBV are extra costs
charged for HP and BV CaaS types, respectively, where nc is the
number of cache units (e.g., 0.5 GB per cache unit. tHP , tBV , and tno-CaaS
are performance delivered with the two CaaS types and without CaaS,
respectively. Then, for a given IaaS type si, we have the following:
ðfi þ CHP;i ncÞ tHP;i ðfi þ CBV ;i ncÞ tBV ;i ¼ fi tno-CaaS;i.
Fig. 4. Experimental environment.

We configured the VM to use a 16 GB virtual disk
combined with 4 GB elastic cache (i.e., RM) via the RM-
Cache device. The ext3 file system was used for benchmark
tests. To assess the efficiency of our system, we compared
our system to a virtual disk with an SSD-based cache device
and a virtual disk without any cache space. For the SSD-
based cache device, we used one Intel X25-M SSD device.
Throughout this section, we denote “virtual disk with the
RM-based cache,” “virtual disk with the SSD-based cache,”
and “virtual disk without any cache” as RM-cache, SSD-
cache, and No-cache, respectively.
6.1.2 TPC-C Results
We first evaluate the Online Transaction Processing (OLTP)
performance on PostgreSQL, a popular open-source DBMS.
The DBMS server runs inside the VM, and the RM-Cache
device is used for the disk device assigned to databases. To
measure the OLTP performance on PostgreSQL, we used
BenchmarkSQL,7
which is a JDBC benchmark that closely
resembles the TPC-C standard for OLTP. We measured the
transaction rate (transactions per minute, tpmC) with
varying numbers of clients and warehouses. It is worth
noting that “warehouse” or “warehouses” will be abbre-
viated as WH.
Fig. 5 shows the measured tpmC and the database size.
We observe the highest tpmC at the smallest WH instance
in the RM-cache environment. Also, as the number of WHs
and clients increases, the tpmC value decreases in all
device configurations. Measured tpmC values of 60 WH
are between 270 and 400 in the No-cache environment. The
performance of the SSD-cache environment is better than
that without cache by a factor of 8, and the RM-based
cache outperforms the SSD-based cache by a factor of 1.5
due to superior bandwidth and latency.8
As shown in
Table 1, the PostgresSQL DBMS has a strong locality in its
data access pattern when processing the TPC-C-like work-
load, and SSD-based and RM-based cache devices exploit
this locality. Actually, frequently accessed data, such
indices, is always in the cache device, while less frequently
accessed data, such as unpopular records, is located to the
virtual disk. Results of 90 and 120 WH cases are similar to
those of the 60 WH case in that the performance of the RM-
cache case is always the best.
6.1.3 Postmark Results
Postmark, which is designed to evaluate the performance of
file servers for applications, such as e-mail, netnews, and
web-based commerce, is performed in three phases: file
creation, transaction execution, and file deletion. In this
experiment, the number of transactions and subdirectories
are 100,000 and 100, respectively. Three experiments are
performed by increasing the number of files.
Fig. 6 and Table 2 show the results of the Postmark
benchmark when 1) a RM-based device is used as a cache of
a virtual disk, 2) an SSD device is used, and 3) no cache
device is used. The total size of files for each experiment (as
the number of files increases from 200,000 to 800,000) is 3.4,
6.8, and 13.4 GB, and this leads to a lower cache hit ratio.
From the figure, we can see that both cache-enabled cases
outperform No-cache cases. Because Postmark is an I/O-
intensive benchmark, I/O operations involve many cache
operations. Thus, cache devices lead to better I/O perfor-
manceof virtual resources. With 200,000,400,000,and 800,000
files, RM-cache cases show (9, 5.5, and 2.5 times) better
performance than No-cache cases. RM-cache cases also have
up to 130 percent better performance than SSD-cache cases.
6.1.4 Other Experiments
Effects of cache size. Fig. 7 shows the results of the TPC-C
benchmark when the size of RM is varied. A large size of
cache increases the performance of TPC-C, due to the high
probability that a data block will reside in the cache. When a
cache of 1 GB RM is used, the performance with a cache is
Fig. 5. Results of TPC-C Benchmark (12 clients).
TABLE 1
Database Size and Cache Hit Ratio of the
TPC-C Benchmark
Fig. 6. Results of postmark benchmark (seconds).
TABLE 2
Cache Hit Ratio of Postmark
7. Available at http://pgfoundry.org/projects/benchmarksql.
8. For our evaluation, we used a new SSD. Thus, the SSD device used in
our experiments had the best condition. It is well known that if the SSD
device is used for a long period, the performance is degraded greatly.
Fig. 7. Effects of cache size (RM-cache, TPC-C, 90 WH, and 12 clients).

2.5 times better than that without any cache. A cache of
4 GB (8 GB) RM shows 2.4 (2.6) times better performance
than that of 1 GB RM. From the observation, we can safely
conclude that even a small or a moderate size of RM-based
cache can accelerate data processing applications on
existing cloud services and users can choose the suitable
cache size for their performance criteria.
Effects of file systems. Fig. 8 shows TPC-C results with
various file systems. For this experiment, we used ext2, ext3,
and reiserfs file systems. In all cases, we can see that RM-
cache cases show better performance than No-cache cases.
The ext3 and reiserfs file systems are journaling file systems;
updates to files are first written as predefined compact
entries in the journal region, and then the updates are written
to their destination on the disk. This leads to less perfor-
mance benefits in journaling file systems. In fact, the journal
data are not necessary to be cached since they are used only
for recovery from a file system crash. While the ext3 file
system journals both metadata and data, the reiserfs file
system journals only metadata. This leads to better perfor-
mance in the reiserfs case with cache. On the contrary, since
the ext2 file system is not a journaling file system, the ext2
case with cache shows the best performance among the three.
In the ext2 file system, metablocks, such as superblocks and
indirected blocks, should be accessed before actual data are
read. Thus, when such metablocks are located in the cache,
the performance gain of the elastic cache is maximized. From
this experiment, we can see that the elastic cache provided by
our cache system is file system independent and greatly
helpful for the file system performance.
6.1.5 Discussion
From our experimental results, we can draw the following
lessons. First, a small or moderate size of RM-based cache
can improve virtual disk I/O performance. Thus, if users set
an appropriate cache size, it can lead to cost-effective
performance. Second, our system can safely recover from a
single machine crash although the performance gradually
decreases during the recovery; this enhances the reliability.
Third, our system improves virtual disk I/O performance
irrespective of file systems and supports various configura-
tions of data processing applications.
It is well known that main memory databases (MMDBs)
outperform disk-based databases (DDB) due to the locality of
data in local main memory. However, since an MMDB
typically requires a large amount of main memory, it costs a
great deal. It may not be possible to provide adequate main
memory with virtual machines. From the previous section,
wecanseethataDDBwithRM-cacheleadsto(upto7-8times)
better performance than that without any cache for TPC-C
making it as a real alternative to an MMDB.
To verify this, we compare MMDBs to DDBs with RM-
cache and RM-based block device. In the experiment, we
use MySQL Cluster and MySQL with InnoDB as MMDB
and DDB, respectively. The core components of the MySQL
Cluster are mysqld, ndbd, and ndb_mgmd. mysqld is the
process that allows external clients to access the data in the
cluster. ndbd stores data in the memory and supports both
replication and fragmentation. ndb_mgmd manages all
processes of MySQL Cluster. An RM-based block device
appears as a mounted file system, but it is stored in RM
instead of a persistent storage device. Table 3 shows TPC-C
results obtained using three cache alternatives. The results
seem somewhat controversial in that the performance of
MMDB is not as good as what is normally expected. The
main reason for this is due to the inherent architecture of
MySQL cluster. An MMDB stores all data (including
records and indices for relational algebraic operations) to
the address space of ndbd processes, and this requires
coordination among MySQL daemons (mysqld and ndbd).
Thus, it usually exchanges many control messages. When
exchanging these messages between mysqld and ndbd,
MySQL is designed to use TCP/IP for all communications
between these processes. This incurs significant overhead
especially when transaction throughput reaches a certain
threshold level that inevitably saturates the performance.
However, DDBs do not incur IPC overhead since the
InnoDB storage engine is directly embedded to mysqld. The
results in Table 3 identify DDB with RM-cache outperforms
MMDB. In addition, MySQL cluster supports only very
small sized temporary space, and queries that require
temporary space resulting in large overhead when proces-
sing relational algebraic operations. These create relatively
unfavorable performance to MMDB.
6.2 Experiments: Cost Efficiency of CaaS
In this section, the cost efficiency of CaaS is evaluated.
Specifically, extensive experiments with the elastic cache
system are performed under a variety of workload char-
acteristics to extract performance metrics, which are to be
used as important parameters for large-scale simulations.
6.2.1 Preliminary Experiments
The performance metric of I/O-intensive applications is
obtained to measure the average performance improvement
of LM and non-LM cache (i.e., piHP and piBV ). To this end,
we slightly modified Postmark so that all I/O operations
are either read or update. The modified Postmark is used to
profile I/O-intensive applications by varying the ratio of
read to update. A set of performance profiles is used as
parameters for our simulation presented in Section 6.2.2.
Fig. 8. Effects of file systems (TPC-C, 90 WH, and 12 clients).
TABLE 3
Comparison between MMDB and DDB with
RM-Cache and RM-Based Block Device
(TPC-C, 40 WH, and 12 Clients)

The experiment is conducted on the same cluster that
was used in the previous performance experiment (or
Section 6.1). To obtain as many profiles as possible, we
increase the virtual disk space from 16 to 32 GB. We vary
the data set size from 3 to 30 GB (3, 7, 10, 15, and 30) and
(RM/SSD) cache size from 512 MB to 16 GB. In addition, six
different read to update ratios (10:0, 8:2, 6:4, 4:6, 2:8, and
0:10) are used to represent various I/O access patterns. We
set the parameters of Postmark, such as min/max sizes of a
file and the number of subdirectories, to 1.5 KB, 90 KB and
100, respectively.
Fig. 9 shows the measured elapsed time of executing
100,000 transactions only for RM-Cache with 3 and 10 GB
data sets because other results from using SSD and with
other data sets (i.e., 7, 15, and 30 GB) reveal similar
performance characteristics. As the cache size increases,
the performance gain increases as well. Most of the cases
have benefited from the increased cache size, except for the
case when the data set is small. As shown in Fig. 9a, in some
cases hard disk outperforms the elastic cache since 3 GB data
almost fits into the local memory (2 GB); most of the data can
be loaded and served from the page cache. The use of
additional cache devices like the elastic cache, which is
inherently slower than the page cache, might cause more
overhead than we expect in certain workload configurations.
Increasing the rate of update operations also affects the
performance. As we increase the rate of updates, the
performance of the elastic cache increases when data sets
are large (Fig. 9b) while the performance degrades when
data sets are small (Fig. 9a). Since the coherency protocol of
the elastic cache is the write-back protocol, the cache
operates as if it is a write buffer for the updates, and this
gives performance benefits to update operations. Increase in
the cache size further improves the throughput of the
update intensive workloads. However, with small data sets,
the page cache is better for read operations. While most
read operations can be served from the page cache, updates
suffer from dirty page replacement traffic with relatively
high latency of the cache device and the hard disk.
Apparently, the throughput decreases as the size of data
grows. Specifically, this can be expected because the
advantage of using LM no longer exists. In general, it is
the result of higher latency when accessing larger data sets.
To measure the performance gain of HP jobs, we
additionally give the same amount of extra memory to
make fair experiments because BV jobs require that amount
of cache space on SSD or the elastic cache. We configure
experiments accordingly so that the extra memory is used
as the page cache of Linux, which is user’s natural choice.
Fig. 10 shows the measured elapsed time for executing
100,000 transactions. From the figure we see somewhat
unexpected (or controversial) results that the performance
gain of LM depends strongly on the read to update ratio
rather than the amount of page cache; in other words, more
update operations make such an unexpected performance
pattern conspicuous. This is because the “pdflush” daemon
in Linux writes dirty data to disk if data reside in memory
until either 1) they are more than 30 seconds old, or 2) the
dirty pages have consumed more than 10 percent of the
active, working memory.
6.2.2 Experimental Settings
The cost efficiency of CaaS is evaluated through extensive
simulations with randomly generated workloads, and each
simulation is conducted using the metric for performance
improvement of each cache. Different workload character-
istics were applied. Table 4 summarizes the parameters
used in our experiments. For this evaluation, each compu-
tational resource has two quad-core processors, 16 GB
RAM, 80 GB SSD, and 1TB HDD, while each RM cache
server has a dual-core processor, 32 GB RAM, and 500 GB
HDD. In this experiment, we adopt three default IaaS types,
and each has the following specification:
. small: one core, 1 GB RAM, and 50 GB disk ($0.1/hr)
. medium: two cores, 4 GB RAM, and 100 GB disk
($0.2/hr)
. large: four cores, 8 GB RAM, and 200 GB disk
($0.4/hr).
Fig. 9. Results of postmark 100k transactions (RM-Cache).
Fig. 10. Results of postmark 100k transactions for 10 GB data (extra
memory).

A distinctive design rationale for CaaS is that the service
provider should be assured of profitability improvement
under various operational conditions; that is, the impact of
the resource scheduling policy that a provider adopts on its
profit should be minimal. To meet such a requirement, we
access the performance characteristics under four well-
known resource allocation algorithms—First-Fit, Next-Fit,
Best-Fit, and Worst-Fit—and a variant for each of these four;
hence, eight in total. The four variants adopt live resource
(VM) migration. FF places a user’s resource request in the first
resource that can accommodate the request. NF is a variant of
FF and it searches for an available resource from the resource
that is selected at the previous scheduling. BF (/WF) selects
the smallest (/largest) resource among those that can meet
the user’s resource request. Besides, we consider live VM
migration which has been widely studied primarily for better
resource management [31], [32]. In our service, a resource is
only migrated to other physical machine if the application
running on that resource is not I/O intensive. The decision on
resource migration is made in a best fit fashion. Thus, we
evaluate our CaaS model using the following eight algo-
rithms: FF, NF, BF, WF, and their migration counterparts,
FFM, NFM, BFM and WFM.
In our simulations, we set the number of physical
resources to be virtually unlimited.
6.2.3 Performance Metrics
We assume users who select BV are conservative in terms of
their spending, and their applications are I/O intensive and
not mission critical. Therefore, the performance gain from
services with more cache in BV is very beneficial. The
reciprocal benefit of that performance gain is realized on the
service provider’s side due to more efficient resource
utilization by effective service consolidation. These benefits
are measured using two performance metrics based pri-
marily on monetary relativity to those benefits. Specifically,
the benefit for users is measured by prices paid for their I/O-
intensive applications, whereas that for providers is quanti-
fied by profit (more specifically, unit profit) obtained from
running those applications. The former performance metric
is quite direct and the average price paid for I/O-intensive
applications is adopted. However, the performance metric
for providers is a little more complicated since the cost
related to serving those applications (including the number
of physical resources used) needs to be taken into account,
and thus, neither the total profit nor the average profit may
be an accurate measurement. As a result, the average unit
profit up is devised as the primary performance metric for
providers and it is defined as the total profit ptotal
obtained
over the “relative” number of physical nodes using rpn.
More formally,
ptotal
¼
X
r
i¼1
pi ð7Þ
rpn ¼
X
r
i¼1
acti
=actmax
; ð8Þ
and
up ¼ ptotal
=rpn; ð9Þ
where r is the total number of service requests (VMs), acti
and actmax
are the active duration of a physical node mi
(and it may vary between different nodes) and the
maximum duration among all physical nodes, respectively.
The active duration of a physical node is defined as the
amount of time from the time the node is instantiated to the
end time of a given operation period (or the finish time of a
particular experiment in our study).
6.2.4 Results
The number of experiments conducted with eight different
resource allocation algorithms is 320. Eight repeated trials
are executed for each experiment, and we obtained the
average value of eight results as average profit under
the corresponding parameter. These average unit profits are
normalized based on average unit profit of the WF
algorithm. Fig. 11 shows overall benefit of CaaS. From the
figure, we identify that IaaS requests with CaaS can give
more benefit (36 percent on average) to service providers
than those without CaaS regardless of the resource
allocation algorithms and VM migration policies. The
benefit of using VM migration is 32 percent on average
more than that without VM migration. The Best-Fit
algorithm gives more profit than other algorithms since it
minimizes resource fragmentation, which results in higher
resource consumption.
Fig. 12 shows average unit profits when the rate of I/O-
intensive jobs is varied. From results without VM migra-
tion, we can see that I/O-intensive jobs lead to more benefit
due to the efficiency of the elastic cache. The normalized
unit profit with VM migration increases when the number
of non-I/O-intensive jobs increases. This is because VM
TABLE 4
Experimental Parameters
Fig. 11. Overall results.

migration only applies to non-I/O-intensive jobs, and this
leads to more migration chances and higher resource
utilization.
Fig. 13 shows normalized unit profits with various ratios
of HP jobs to BV jobs. The provider profit is noticeably higher
with CaaS than No-CaaS when the rate of HP jobs is low.
However, a small loss to providers is incurred when the HP to
BV ratio is high (i.e., 2:1 and 1:0); this results from the
unexpected LM results (shown in Fig. 11). With the inherent
cost efficiency of BV, profits obtained from these jobs are
promising, particularly when the rate of BV jobs is high. If a
more efficientLM-based cacheis devised,profits with respect
to increases in HP jobs are most likely to lead to high profits.
7 CONCLUSION
With the increasing popularity of infrastructure services
such as Amazon EC2 and Amazon RDS, low disk I/O
performance is one of the most significant problems. In this
paper, we have presented a CaaS model as a cost efficient
cache solution to mitigate the disk I/O problem in IaaS. To
this end, we have built a prototype elastic cache system using
a remote-memory-based cache, which is pluggable and file-
system independent to support various configurations. This
elastic cache system together with the pricing model devised
in this study has validated the feasibility and practicality of
our CaaS model. Through extensive experiments, we have
confirmed that CaaS helps IaaS improve disk I/O perfor-
mance greatly. The performance improvement gained using
cache services clearly leads to reducing the number of
(active) physical machines the provider uses, increases
throughput, and in turn results in profit increase. This
profitability improvement enables the provider to adjust its
pricing to attract more users.
ACKNOWLEDGMENTS
Professor Albert Zomaya would like to acknowledge the
Australian Research Council Grant DP A7572. Hyungsoo
Jung is the corresponding author for this paper.
REFERENCES
[1] L. Wang, J. Zhan, and W. Shi, “In Cloud, Can Scientific
Communities Benefit from the Economies of Scale?,” IEEE Trans.
Parallel and Distributed Systems, vol. 23, no. 2, pp. 296-303, Feb.
2012.
[2] M.D. Dahlin, R.Y. Wang, T.E. Anderson, and D.A. Patterson,
“Cooperative Caching: Using Remote Client Memory to Improve
File System Performance,” Proc. First USENIX Conf. Operating
Systems Design and Implementation (OSDI ’94), 1994.
[3] T.E. Anderson, M.D. Dahlin, J.M. Neefe, D.A. Patterson, D.S.
Roselli, and R.Y. Wang, “Serverless Network File Systems,” ACM
Trans. Computer Systems, vol. 14, pp. 41-79, Feb. 1996.
[4] S. Jiang, K. Davis, and X. Zhang, “Coordinated Multilevel
Buffer Cache Management with Consistent Access Locality
Quantification,” IEEE Trans. Computers, vol. 56, no. 1, pp. 95-
108, Jan. 2007.
[5] H. Kim, H. Jo, and J. Lee, “XHive: Efficient Cooperative Caching
for Virtual Machines,” IEEE Trans. Computers, vol. 60, no. 1,
pp. 106-119, Jan. 2011.
[6] A. Menon, J.R. Santos, Y. Turner, G.J. Janakiraman, and W.
Zwaenepoel, “Diagnosing Performance Overheads in the Xen
Virtual Machine Environment,” Proc. First ACM/USENIX Int’l
Conf. Virtual Execution Environments (VEE ’05), 2005.
[7] L. Cherkasova and R. Gardner, “Measuring CPU Overhead for I/O
Processing in the Xen Virtual Machine Monitor,” Proc. Ann. Conf.
USENIX Ann. Technical Conf. (ATC ’05), 2005.
[8] J. Liu, W. Huang, B. Abali, and D.K. Panda, “High Performance
VMM-Bypass I/O in Virtual Machines,” Proc. Ann. Conf. USENIX
Ann. Technical Conf. (ATC ’06), 2006.
[9] A. Menon, A.L. Cox, and W. Zwaenepoel, “Optimizing Network
Virtualization in Xen,” Proc. Ann. Conf. USENIX Ann. Technical
Conf. (ATC ’06), 2006.
[10] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R.
Neugebauer, I. Pratt, and A. Warfield, “Xen and the Art of
Virtualization,” Proc. 19th ACM Symp. Operating Systems Principles
(SOSP ’03), 2003.
[11] X. Zhang and Y. Dong, “Optimizing Xen VMM Based on Intel
Virtualization Technology,” Proc. IEEE Int’l Conf. Internet Comput-
ing in Science and Eng. (ICICSE ’08), 2008.
[12] P. Willmann, J. Shafer, D. Carr, A. Menon, S. Rixner, A.L. Cox, and
W. Zwaenepoel, “Concurrent Direct Network Access for Virtual
Machine Monitors,” Proc. IEEE 13th Int’l Symp. High Performance
Computer Architecture (HPCA ’07), 2007.
[13] Y. Dong, J. Dai, Z. Huang, H. Guan, K. Tian, and Y. Jiang,
“Towards High-Quality I/O Virtualization,” SYSTOR ’09: Proc.
Israeli Experimental Systems Conf., 2009.
[14] J.R. Santos, Y. Turner, G. Janakiraman, and I. Pratt, “Bridging the
Gap Between Software and Hardware Techniques for I/O
Virtualization,” Proc. Ann. Conf. USENIX Ann. Technical Conf.
(ATC ’08), 2008.
[15] K. Lim, J. Chang, T. Mudge, P. Ranganathan, S.K. Reinhardt, and
T.F. Wenisch, “Disaggregated Memory for Expansion and Sharing
in Blade Servers,” Proc. 36th Ann. Int’l Symp. Computer Architecture
(ISCA ’09), 2009.
[16] M. Marazakis, K. Xinidis, V. Papaefstathiou, and A. Bilas,
“Efficient Remote Block-Level I/O over an RDMA-Capable
NIC,” Proc. 20th Ann. Int’l Conf. Supercomputing (ICS ’06), 2006.
[17] J. Creasey, “Hybrid Hard Drives with Non-Volatile Flash and
Longhorn,” Proc. Windows Hardware Eng. Conf. (WinHEC), 2005.
[18] R. Harris, “Hybrid Drives: Not So Fast,” ZDNet, CBS Interactive,
2007.
[19] E.R. Reid, “Drupal Performance Improvement via SSD Technol-
ogy,” technical report, Sun Microsystems, Inc., 2009.
Fig. 12. Results with varying rates of I/O-intensive jobs.
Fig. 13. Results with varying ratios of HP jobs and BV jobs.

[20] S.-W. Lee and B. Moon, “Design of Flash-Based DBMS: An In-
Page Logging Approach,” Proc. ACM SIGMOD Int’l Conf. Manage-
ment of Data (SIGMOD ’07), 2007.
[21] T. Makatos, Y. Klonatos, M. Marazakis, M.D. Flouris, and A. Bilas,
“Using Transparent Compression to Improve SSD-Based I/O
Caches,” Proc. Fifth European Conf. Computer Systems (EuroSys ’10),
2010.
[22] J.-U. Kang, J.-S. Kim, C. Park, H. Park, and J. Lee, “A Multi-
Channel Architecture for High-Performance NAND Flash-Based
Storage System,” J. Systems Architecture, vol. 53, pp. 644-658, Sept.
2007.
[23] C. Park, P. Talawar, D. Won, M. Jung, J. Im, S. Kim, and Y. Choi,
“A High Performance Controller for NAND Flash-Based Solid
State Disk (NSSD),” Proc. IEEE Non-Volatile Semiconductor Memory
Workshop (NVSMW ’06), 2006.
[24] S. Kang, S. Park, H. Jung, H. Shim, and J. Cha, “Performance
Trade-Offs in Using NVRAM Write Buffer for Flash Memory-
Based Storage Devices,” IEEE Trans. Computers, vol. 58, no. 6,
pp. 744-758, June 2009.
[25] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich,
D. Mazières, S. Mitra, A. Narayanan, G. Parulkar, M. Rosenblum,
S.M. Rumble, E. Stratmann, and R. Stutsman, “The Case for
RAMClouds: Scalable High-Performance Storage Entirely in
DRAM,” ACM SIGOPS Operating Systems Rev., vol. 43, pp. 92-
105, Jan. 2010.
[26] R.P. Goldberg and R. Hassinger, “The Double Paging Anomaly,”
Proc. Int’l Computer Conf. and Exposition (AFIPS ’74), 1974.
[27] C.A. Waldspurger, “Memory Resource Management in VMware
ESX Server,” Proc. Fifth USENIX Conf. Operating Systems Design
and Implementation (OSDI ’02), 2002.
[28] B. Urgaonkar, P.J. Shenoy, and T. Roscoe, “Resource Overbooking
and Application Profiling in Shared Hosting Platforms,” Proc.
Fifth USENIX Conf. Operating Systems Design and Implementation
(OSDI ’02), 2002.
[29] A.V. Do, J. Chen, C. Wang, Y.C. Lee, A.Y. Zomaya, and B.B. Zhou,
“Profiling Applications for Virtual Machine Placement in
Clouds,” Proc. IEEE Int’l Conf. Cloud Computing, 2011.
[30] S. Chen, A. Ailamaki, M. Athanassoulis, P.B. Gibbons, R. Johnson,
I. Pandis, and R. Stoica, “TPC-E vs. TPC-C: Characterizing the
New TPC-E Benchmark via an I/O Comparison Study,” ACM
SIGMOD Record, vol. 39, pp. 5-10, Feb. 2011.
[31] H. Liu, H. Jin, X. Liao, C. Yu, and C.-Z. Xu, “Live Virtual Machine
Migration via Asynchronous Replication and State Synchroniza-
tion,” IEEE Trans. Parallel and Distributed Systems, vol. 22, no. 12,
pp. 1986-1999, Dec. 2011.
[32] G. Jung, M. Hiltunen, K. Joshi, R. Schlichting, and C. Pu, “Mistral:
Dynamically Managing Power, Performance, and Adaptation
Cost in Cloud Infrastructures,” Proc. IEEE 30th Int’l Conf.
Distributed Computing Systems (ICDCS ’10), pp. 62-73, 2010.
Hyuck Han received the BS, MS, and PhD
degrees in computer science and engineering
from Seoul National University, Korea, in 2003,
2006, and 2011, respectively. Currently, he is a
postdoctoral researcher at Seoul National Uni-
versity. His research interests are distributed
computing systems and algorithms.
Young Choon Lee received the BSc (hons)
degree in 2003 and the PhD degree from the
School of Information Technologies at the
University of Sydney in 2008. He is currently a
postdoctoral research fellow in the Centre for
Distributed and High Performance Computing,
School of Information Technologies. His current
research interests include scheduling and re-
source allocation for distributed computing sys-
tems, nature-inspired techniques, and parallel
and distributed algorithms. He is a member of the IEEE and the IEEE
Computer Society.
Woong Shin received the BS degree in
computer science from Korea University, Seoul,
in 2003. He is currently working toward the MS
degree from Seoul National University. He
worked for Samsung Networks from 2003 to
2006 and TmaxSoft from 2006 to 2009 as a
software engineer. His research interests are in
system performance study, virtualization, sto-
rage systems, and cloud computing.
Hyungsoo Jung received the BS degree in
mechanical engineering from Korea University,
Seoul, in 2002, and the MS and PhD degrees in
computer science from Seoul National Univer-
sity, Korea in 2004 and 2009, respectively. He is
currently a postdoctoral research associate at
the University of Sydney, Sydney, Australia. His
research interests are in the areas of distributed
systems, database systems, and transaction
processing.
Heon Y. Yeom received the BS degree in
computer science from Seoul National Univer-
sity in 1984 and the MS and PhD degrees in
computer science from Texas AM University in
1986 and 1992, respectively. He is a professor
with the School of Computer Science and
Engineering, Seoul National University. From
1986 to 1990, he worked with Texas Transpor-
tation Institute as a Systems Analyst, and from
1992 to 1993, he was with Samsung Data
Systems as a research scientist. He joined the Department of Computer
Science, Seoul National University in 1993, where he currently teaches
and researches on distributed systems, multimedia systems and
transaction processing. He is a member of the IEEE.
Albert Y. Zomaya is currently the chair professor
of High Performance Computing Networking
and Australian Research Council Professorial
fellow in the School of Information Technologies,
The University of Sydney. He is also the director
of the Centre for Distributed and High Perfor-
mance Computing which was established in late
2009. He is the author/co-author of seven books,
more than 400 papers, and the editor of nine
books and 11 conference proceedings. He is the
editor-in-chief of the IEEE Transactions on Computers and serves as an
associate editor for 19 leading journals, such as, the IEEE Transactions
on Parallel and Distributed Systems and Journal of Parallel and
Distributed Computing. He is the recipient of the Meritorious Service
Award (in 2000) and the Golden Core Recognition (in 2006), both from
the IEEE Computer Society. Also, he received the IEEE Technical
Committee on Parallel Processing Outstanding Service Award and the
IEEE Technical Committee on Scalable Computing Medal for Excellence
in Scalable Computing, both in 2011. He is a chartered engineer, a fellow
of the AAAS, the IEEE, the IET (United Kingdom), and a distinguished
engineer of the ACM.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.

caching2012.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to caching2012.pdf

Similar to caching2012.pdf (20)

Recently uploaded

Recently uploaded (20)

caching2012.pdf