SlideShare a Scribd company logo
1 of 9
Download to read offline
PowerAlluxio: Fast Storage System with Balanced Memory Utilization
Chi-Fan Chu
cfchu@umich.edu
University of Michigan, Ann Arbor
Chao-Han Tsai
chatsai@umich.edu
University of Michigan, Ann Arbor
Abstract
With high speed network at reasonable price, sharing
memory across machines became a feasible architec-
ture for cluster computing. In addition to disk, under
utilized cluster memory provides stressful machines an-
other choice for data placement. With high network-to-
disk-bandwidth ratio (NDR), fetching files from memory
on remote machine is even faster than local disk read.
We developed PowerAlluxio, an in-memory file sys-
tem based on Alluxio, that provide shared memory ab-
straction. PowerAlluxio improves cluster average task
completion time by 14.11× when data can be fully
cached in all machines. To further improve efficiency, we
proposed a new data eviction policy Smart LRU (SLRU)
which helps PowerAlluxio reduce elapse time by 24.76%
when dealing with large data set. Another observation
was made that even though the traditional disk local-
ity problem became less important, the memory locality
turns out to be a critical factor that dominates application
performance.
1 Introduction
The need for distributed file system arises as industry
needs fault-tolerant and scalable distributed storage so-
lutions. Large-scale computing applications often in-
volve reading and writing high volumes of data, and will
produce huge amount of intermediate files. Therefore,
an efficient and scalable storage solution is required to
store these files. Google file system [1] and Hadoop
distributed file system are proposed for large distributed
data-intensive applications.
The bottleneck of these distributed file system is that
they are I/O-bounded. Although read performance can
be improved by caching data in memory, write perfor-
mance cannot be improved as data are required to be
replicated across nodes for fault tolerance.
Current approach to resolve the disk I/O bottleneck is
either through in-memory file system or network RAIDs.
Alluxio [2, 3], previously Tachyon, is an in-memory file
system leveraging the concept of lineage to improve the
performance of writes without compromising fault toler-
ance. Flat datacenter storage [4] is proposed to divide
files into multiple partitions and store each partitions on
different machines. With full bisection network band-
width, clients are able to exploit the full disk bandwidth
within the cluster to reduce single disk I/O bottleneck.
High-speed network becomes affordable and avail-
able. High-performance Remote Direct Memory Access
(RDMA) capable networks such as InfiniBand FDR can
provide 5 GB/s bandwidth with the cost to of NIC con-
tinuing to drop. These becomes new options for server
inter-connection inside the datacenter.
With InfiniBand network connection, read from the
memory on remote machine is faster than read from lo-
cal disk. The disk bandwidth on a SATA 7200 rpm hard
drive is 130 MB/s. The read bandwidth via InfiniBand
FDR is 2.9 GB/s. Therefore, fully utilize the memory on
all machines to reduce disk I/Os can greatly improve the
throughput. Also, memory locality becomes a significant
factor that should be considered in the file system design.
We implemented our ideas based on Alluxio, an ex-
isting in-memory file system project. We call it Power-
Alluxio, which uses shared-memory scheme to exploit
the memory on all machines and increase the memory
utilization in the cluster without sacrificing memory lo-
cality. Clients can efficiently use the memory on remote
machines to avoid potential disk I/Os. A set of experi-
ments are carried out to evaluate the performance boost.
According to experiments, running PowerAlluxio in a
simulated high-speed network environment with 40.93
network-to-disk-bandwidth ratio (NDR), we can achieve
14.11× speed up as opposed to original Alluxio.
This paper is organized as follows: §2 provides
the background on RDMA and Alluxio. §3 gives an
overview on how PowerAlluxio works. §4 highlights
some implementation details about PowerAlluxio. Ex-
periment results and analysis of PowerAlluxio and Al-
luxio appear in §5. §6 discusses the challenges we en-
countered and the future direction of this project. Finally,
we summarize our conclusions in §7.
2 Background
2.1 RDMA
Remote Direct Memory Access (RDMA) enables direct
access of memory on remote machines. Messages are
sent to the NIC and handled by remote NICs without in-
volving CPUs. It allows zero-copy transfers to reduce
kernel crossing overhead. RDMA NICs provide reliable
and efficient transmission by applying hardware-level re-
transmission for lost packets and kernel by-pass for every
communication. However, RDMA was not widely used
in data centers in the past due to expensive cost.
InfiniBand is a communication standard with very
high throughput and very low latency. It supports RDMA
operations and it is widely used for server interconnect in
the high performance computing (HPC) community.
2.2 Alluxio
Alluxio [2] is an in-memory file system that leverage the
lineage to avoid disk write without compromising fault-
tolerance. Alluxio supports major distributed file sys-
tem such as Hadoop file system (HDFS) or Amazon S3
as under persistent layer. On write operations, Alluxio
first writes files to memory storage, and writes the cor-
responding lineage to persistent layer. The files are then
asynchronously checkpointed to the persistent layer.
To illustrate how lineage works, consider the follow-
ing example. Assume that we have a program P that
reads input files A1,A2,...,An, and writes output files
B1,B2,...,Bm. Before Alluxio writes its output, it will
first record the corresponding lineage information L to
persistent layer. Lineage will contains all the informa-
tion required to run the program, such as the location of
input files, the configuration, and the command. Assume
that Bi is lost, we can re-compute the missing file Bi with
the lineage information L.
Alluxio consists of a centralized master and workers
on each machine, which is similar to the architecture of
Google file system and HDFS. Alluxio master is respon-
sible for managing the metadata of files, periodically
check-pointing the files to persistent layers and allocate
resource for re-computations. Worker is responsible for
data storage. Files are split to multiple blocks and placed
on workers. Alluxio master keeps tracking location of
each block. Every worker has a daemon that manages
local resource and reports the memory usage to master
by periodic heartbeat.
3 Approach Overview
3.1 Problems of Alluxio
The in-memory file systems dramatically shorten appli-
cation run time by caching part of the file system in-
side the memory, such that the number of low-bandwidth
disk read/write operations can be reduced. Alluxio, as
the leading project of distributed in-memory file system,
have been deployed by many companies and demon-
strates its strength for cluster computing.
However, we found that the current design of Alluxio
fails to use cluster memory efficiently. Space intensive
applications manipulate very large files which cannot fit
into the memory on a single machine. Once a client runs
out of space on an Alluxio worker, with the default con-
figuration (local first policy), the worker starts evicting
blocks from its memory storage even though other work-
ers have free memory space. The client accessing the
block that has been evicted triggers cache miss in Al-
luxio, which then either read the file from disk (if the
file has been checkpointed), or recompute the file with
lineage information (if the file is not persisted). Either
way slows down the client. With default storage policy,
a client can only utilize the memory space of a single
worker. Since memory usage of machines is typically
unbalanced within cluster, other workers often have free
memory space for extra data storage. With the help of
high-speed network, it is preferred caching data on the
memory of any remote machine than storing the data on
local disk. Accessing data through network is faster than
reading data from disk.
Although Alluxio can be configured with advanced
settings such as selecting a worker with round robin fash-
ion or selecting a worker with the most available space
among all workers so that the client is not restricted to a
single worker capacity. However, these approaches lose
memory locality. Even though the cluster-wide memory
utilization increases, the latest data, which tends to be
the ”hottest data”, is possible to be placed on remote ma-
chines rather than the local machine. Recent research
[5] shows disk-locality is irrelevant, but memory locality
remains a critical issue. After all, local memory still out-
performs existing network in terms of bandwidth. The
system should cache the latest data as close to the client
as possible.
3.2 PowerAlluxio
Based on Alluxio codebase, we implemented our ideas
as new features. To differentiate the projects, we call
our system PowerAlluxio. The primary goal of PowerAl-
luxio is to increase cluster-wide memory utilization with-
out sacrificing memory locality. To further boost the per-
2
Figure 1: Flow of remote worker data block transfer
formance, a new data eviction mechanism, Smart LRU
(SLRU), is proposed. Figure 1 shows the complete work
flow.
3.2.1 Share Memory Storage Space
With PowerAlluxio, the client is not restricted to a single
worker storage space. From the client perspective, Pow-
erAlluxio provides a shared memory storage abstraction.
Light-loaded workers can assist heavy-loaded workers
by caching extra blocks. This helps to increase cluster-
wide memory utilization. Furthermore, to exploit mem-
ory locality, the client always writes files to its local
worker. If the local worker runs out of space, instead of
directly evicting the blocks, in the following two cases:
1. The evicted data is persisted in disks: The lo-
cal worker immediately removes the blocks from
its memory storage to create space, which avoids
blocking the client’s process. The local worker then
uses a background process to fetch the evicted data
from persistent storage layer, transfer and cache the
data on another worker. Once finished, the data
blocks can be accessed from the memory on remote
machines through network.
2. The evicted date in NOT persisted in disks: The
local worker blocks the client and transfers the
blocks to another worker before eviction. The re-
computation cost is expected to be higher than sim-
ply blocking the client and transferring the data.
Data transfer may cause the receiving worker out of
space, and that worker continues to transfer data to other
machines and causes chain effect. In order to prevent
chain effect, we add an argument into RPC interface tp
help workers differentiate incoming write streams from
clients and workers. If the stream is from a worker, no
more data will be forwarded to other machines and the
chain stops. As the total amount of data being cached
into cluster memory storage increases as well as the
memory locality is preserved, we can expect that Power-
Alluxio improves cluster average task completion time.
3.2.2 Data Eviction Policy
When out of space, worker needs to decide which data
block to send, send to whom, or it should simply evict the
block if this particular block will not be accessed again
in the future. According to the analysis [3, 6, 7], the data
intensive applications have the following characteristics:
• Access Frequency: File access often follows a Zipf-
like distribution.
• Access Temporal Locality: 75% of the re-accesses
take place within 6 hours.
Based on the characteristics, Alluxio uses LRU as default
eviction policy. Every worker maintains a local LRU
queue for all its contained blocks. The queue is updated
whenever an operation is conducted at this worker.
However, in PowerAlluxio, as all the memory are
shared, the local LRU cannot provide critical informa-
tion for block transfer. For example, the worker con-
taining the ”oldest” block is the ideal destination. How-
ever, such destination cannot be identified unless there
is a cluster-wide LRU. An easy solution is to maintain a
global LRU at master. We did not choose this approach
3
because the master could potentially become the scaling
bottleneck as every operation needs to go through master.
Instead, PowerAlluxio uses an approximate global LRU
which we call Smart LRU (SLRU). Here are the details:
1. Block carries last access time information: Local
LRU does not need time information. On every op-
eration it simply place the accessed block to the end
of queue so that the front of queue is always the
oldest block. In order to compare ”age” of blocks
from different workers, we add an extra attribute to
block metadata to record the last access time. At the
block transfer destination, the recorded time is used
to determine the position of transferred block in the
local LRU queue. In this way, PowerAlluxio does
not mess up the client’s data access history.
2. Master keeps tracking age of each worker: The age
of worker is defined as the average age of a num-
ber (which can be tuned according to cluster work-
load) of the oldest blocks at this worker. The worker
reports its age to master by periodic heartbeat and
asks master which worker is the oldest when setting
the transfer destination.
3. Do not move stale blocks: If the block evicted has
not been accessed for more than a properly set time
period, called active time threshold, it is unlikely
that this block will not be accessed again in the fu-
ture. Transferring stale blocks only consumes net-
work bandwidth with no benefit.
Although the age information of workers kept at mas-
ter might not be fresh due to relatively low heartbeat fre-
quency (default once a second), we think this is accept-
able based on the observation that the blocks at the oldest
worker from last heartbeat are comparatively old blocks
in the cluster. We found SLRU works surprisingly well
in practice. SLRU further reduces PowerAlluxio average
task finish time by up to 24.76%.
4 Implementation Highlights
This section describes some key parts of PowerAlluxio.
First, new RPC interfaces to facilitate inter-worker com-
munication are introduced. Second, we explain several
design challenges behind the advanced features of Pow-
erAlluxio. Last, in order to conduct experiments, how
we simulate a high-speed network is explained.
4.1 New PRC Interfaces
In the original Alluxio, all commands and message
passing between clients, master and workers are issued
through remote procedure calls (RPC). There are three
major types of RPC interfaces: 1. Client-side library to
master for file metadata related operations such as regis-
tering new files at master and querying workers informa-
tion associated with a specific file. 2. Client-side library
to worker for direct data read/write stream from clients
to workers. 3. Worker to master for worker registration
and data block commit.
The original Alluxio architecture is centralized with
no direct communication between workers. Workers re-
port run time statistics to master by periodic heartbeat,
and master passes the instruction to workers in the form
of heartbeat response. In order to realize PowerAlluxio,
new RPC interfaces were added:
1. Worker to master: getFileInfoWithBlock()
is for worker to look up file metadata given a block
ID. When data is transferred in the background
(since the target data block is immediately removed
from worker memory storage in order to not block
client process, see §3.2.1), the worker reads the
evicted data from the disk. Worker needs to know
which file and the offset (the file could be divided
into several blocks) associated with the evicted data
block. Therefore, a reversed mapping from block
ID to file ID is maintained at PowerAlluxio master.
getWorkerInfoList() is for workers to query
availability and age of all workers to set data trans-
fer destination. Master returns address, capacity,
used bytes, and age of all workers in the response.
2. Worker to worker: The Netty transport layer module
from client-side library is reused and adapted to fa-
cilitate data transfer between workers. Because the
transferred data size is known, the buffer size can be
set accordingly to improve efficiency. An identifier
is added to the interface for stream receivers to dis-
tinguish client senders from worker senders. Differ-
ent operations are performed when worker sender
commits the data block, such as restoring last ac-
cessed time and inserting block to correct position
in the LRU queue.
4.2 BlockMover Module
Every worker has a BlockMover module. Once out of
space, the eviction planner triggers BlockMover by pass-
ing a list of block ID. To save thread creation/termination
overhead, a thread pool is maintained in BlockMover to
serve background block transfer.
Once a worker evicted a block from its storage, it uses
heartbeat to inform master to remove the worker from
block holder list. Each block is associated with a block
holder list that records the workers that store such block.
Due to the fixed frequency of heartbeat, the state between
worker and master may be inconsistent. In the original
4
Alluxio, this is acceptable. A client guided to this worker
for a certain file by master would trigger a cache miss. In
PowerAlluxio, however, this cache miss may be unneces-
sary, since the block might have already been transferred
and cached at another worker. We want master to update
state faster to avoid such inconsistency. Increase heart-
beat frequency solves the problem but introduce much
overhead at master. Instead, we make workers do instant
heartbeat after transferring blocks.
During implementation, a concurrency bug of the orig-
inal Alluxio was found when two clients simultaneously
accessed a file that was previously evicted. They both
trigger cache miss at Alluxio and both start to cache
data with the same block ID (since its from the same
file, same offset). The one commits later will be thrown
BlockAlreadyExistsException causing client program to
crash but Alluxio remains running. This bug, however,
crashes PowerAlluxio in a similar scenario when both a
client and a worker are caching same file at same place,
and the client commits first. The frequency of such fail-
ure in PowerAlluxio is not negligible. This problem was
finally solved by changing exception handling logic of
Netty transport layer to perform appropriate garbage col-
lection at remote worker.
4.3 Simulate High-Speed Network
Integrating RDMA into PowerAlluxio is extremely dif-
ficult due to limited hardware resources and lack of re-
liable third party software drivers. Since releasing a
deploy-able software is not our primary goal, we decided
to run PowerAlluxio in a simulated high-speed network
with Ethernet setup. Relative network speed can be eval-
uated with network-to-disk speed ratio (NDR), which is
defined as follows:
NDR =
networkbandwidth
diskbandwidth
By fixing the network bandwidth and reducing disk
bandwidth, we can equivalently increase NDR as we are
using high-speed network.
Limit disk bandwidth at kernel is no easy task, either.
To simulate a reduced bandwidth disk, our first attempt
was to read same file for extra times. However, since
the file may be cached in kernel, the read operations may
not actually go into disk. Therefore, as an alternative,
we make every read operation followed by multiple disk
write and flush operations of dummy file with the same
size. This is based on the assumption that HDD sequen-
tially read speed and sequentially write speed is in the
same order of magnitude. By changing the number of
disk write operations for a read operation, we create var-
ious disk bandwidth for experiment.
Figure 2: Bandwidth grows linearly with file size and
saturates when file size reach GB.
Figure 3: Transfer a 512 MB file with different message
sizes.
5 Evaluation
In this section, we first measured the performance of In-
finiBand FDR. Based on the measurement results, we set
up a simulated high-speed network with Ethernet envi-
ronment. and conducted three experiments to demon-
strate the performance of PowerAlluxio and Alluxio.
5.1 InfiniBand
We measured the transmission bandwidth over Infini-
Band FDR with a server and a client. Each machine
had two Xeon E5-2650v2 processors (8 cores each,
2.6 GHz), a 64 GB memory, two 1 TB 7200 rpm disks,
and a Mellanox FDR CX3 NICs that support 40 Gbps
InfiniBand FDR.
5
Figure 4: Runtime of Alluxio and PowerAlluxio with
different network-to-disk-speed ratio (NDR) when all
files can be cached.
Figure 5: Speed-up of PowerAlluxio over Alluxio with
different NDR when all files can be cached.
To find how file size influences the transmission band-
width, we transferred a file in a single session and re-
peated the process for 100 times to obtain the average
bandwidth. For file size from 4 KB to 1 MB, we trans-
ferred the file in a single message. For file size from
4 MB to 2 GB, we divided the file into multiple 1 MB
partitions and transferred each partition in a single mes-
sage. Figure 2 shows that transmission bandwidth con-
tinues to grow with file size and saturates at GB level.
To discover how message size affects bandwidth, we
transferred a 512 MB (default block size of Alluxio) file
with messages of different size. The peak bandwidth oc-
curred when the message size was 64 KB, as shown in
figure 3.
Figure 6: Runtime of Alluxio and PowerAlluxio with
different network-to-disk-speed ratio (NDR) when all
files can be cached. The time for the first read from disk
is omitted.
Figure 7: Speed-up of PowerAlluxio over Alluxio with
different NDR when all files can be cached. The time for
the first read from disk is omitted.
5.2 Scenario 1: All Files Can Be Cached
A client read multiple files on PowerAlluxio. All files
could be cached to the memory on either local machine
or remote machines. In this setting, we had a master and
three workers. Each worker was configured with 2 GB
memory for data storage. All machines were config-
ured with two Xeon E5-2650v2 processors (8 cores each,
2.6 GHz), a 64 GB memory, two 1 TB 7200 rpm disks,
and an Intel X520 PCIe Dual port 10Gbps Ethernet NIC.
The client first read 30 files from persistent storage.
The file size ranges uniformly from 150 MB to 220 MB.
The client randomly performed 1000 read operations to
6
these files. The total file size was 5633 MB, which could
be fully cached to the memory on three machines (6 GB
in total).
Figure 4 shows the runtime of Alluxio and PowerAl-
luxio with different network-to-disk-speed and figure 5
demonstrates the speed-up of PowerAlluxio over Alluxio
with different NDR. PowerAlluxio achieves 14.11×
speed-up when NDR is 40.93.
The major factor limiting even more speed-up came
from the first disk reads of persistent files, which, un-
fortunately, is inevitable in practice. To show that, we
repeated the experiment but omitting the time for the
first disk read. Figure 6 shows the runtime of Alluxio
and PowerAlluxio with different NDR. Figure 7 demon-
strates the corresponding speed-up of PowerAlluxio over
Alluxio with different NDR. The speed-up is 34.57 when
NDR is 40.93.
5.3 Scenario 2: Multiple Clients
Three clients on three different machines are reading and
writing files on PowerAlluxio. Client1 performs 1000
read operations on 30 files and each file is 100 MB to
160 MB. Client2 performs 3000 read operations on 20
files and each file is 40 MB to 55 MB. Client3 performs
1200 operations on 10 files and each file is 80 MB to
120 MB. It was our intention to limit size of Client2 and
Client3 less than a worker space. The hardware specifi-
cation was the same as the previous experiment.
Figure 8 shows the completion time with different con-
figurations. The default storage policy of original Al-
luxio makes clients use the only memory on the local
machines. Client1 performed operations on files with to-
tal size of 3968 MB, which could not be fully cached to
the 2 GB memory on its local machine. Therefore the
completion time for client1 was long due to a lot of disk
I/Os caused by cache misses. Client2 and Client3 per-
formed operations on files with total size of 1007 MB
and 1008 MB respectively, which could be fully cached
in memory on their local machine. Completion time for
client2 and client3 is much shorter.
Another storage policy, the most available policy, of
original Alluxio assigns the machine with the most free
memory space to the client. Since client1 could exploit
the memory on remote machines with most available pol-
icy, the completion time of client1 was shorter than the
completion time with default storage policy. However,
for client2 and client3, since their files were forced to
scatter across workers because of the most available pol-
icy, they lost memory locality. Their completion time
was much longer as compared with the default policy.
Although most available policy allows client to use the
memory on remote machine, the low memory locality is
a critical issue for small-sized client.
Figure 8: Runtime of multiple clients reading files with
Alluxio, Alluxio with most available policy, and Power-
Alluxio.
PowerAlluxio always stores clients’ latest file on the
local machine. When out of space, PowerAlluxio worker
will transfer older data according to SLRU. In this way,
PowerAlluxio guarantees memory locality and allows
higher memory utilization. Therefore PowerAlluxio dra-
matically shortened the completion time of client1 with-
out affecting client2 and client3 much. PowerAlluxio re-
duced average task finish time by 59.87%, while most
available policy can only reduced average task finish
time by 24.11%.
5.4 Scenario 3: Files Exceeding Space
A client read large files from persistent storage. All files
together exceeded total worker space. In this setting,
we had a master and three workers. All machines were
configured with two Intel E5-2660 v2 10-core CPUs at
2.20 GHz (Ivy Bridge), a 256 GB memory, two 1 TB
7200 RPM disks, and a dual-port Intel 10 Gbps NIC.
Each Alluxio worker was configured with 2 GB mem-
ory space.
A client performed 1000 read operations on 40 files.
The file size ranged from 150 MB to 250 MB. There
are 20 very frequently accessed files, 10 frequently ac-
cessed files, and 10 rarely accessed files. The total file
size is 8087 MB, which could not be fully cached at
three machines. The same workload also ran on Power-
Alluxio with SLRU disabled. We expected to see SLRU
improves clients performance.
Figure 9 shows the runtime to read large files with dif-
ferent NDR in Alluxio, PowerAlluxio (SLRU disabled),
and PowerAlluxio. Figure 10 demonstrates the speed-
up of PowerAlluxio and PowerAlluxio (SLRU disabled)
over Alluxio with different NDR. Enabling SLRU boosts
performance of PowerAlluxio by 24.76%. However,
7
Figure 9: Runtime to read very large files with different
NDR.
compared to scenario 1, the achieved speed-up is much
lower and saturated at NDR equals to 16. This is caused
by extra cache misses due to total file size exceeding
workers space. PowerAlluxio without SLRU reduced the
number of cache misses from 670s to 180s, and SLRU
helped PowerAlluxio reduce cache misses by another 30s
times. As shown by §5.2, any disk I/O would severely
damage speed-up. The upper bound of speed-up in this
scenario is as low as 670s/150s equals to 4.46.
Figure 11 shows that the optimum active time thresh-
old (see §3.2.2) scales linearly with NDR, which means
that active time threshold is highly predictable. This ob-
servation is useful when cluster administrator is upgrad-
ing hardware. The optimum active time threshold can be
easily predicted and set according to new hardware spec-
ifications.
6 Discussion
6.1 Challenges Encountered
RDMA integration: Since there is no unified software so-
lution for all hardware, integrating RDMA into Power-
Alluxio is not scalable. We tried to use an open source
library JXIO which supports the Mellanox FDR CX3
NICs. However, JXIO is very unstable, it often crashes
unexpectedly on our hardware. Therefore, we decided
not to integrate JXIO into PowerAlluxio.
Experiment environment: Our experiments were con-
ducted on Cloudlab cluster. However, Cloudlab provides
limited disk space for each user (12 GB disk space per
machine), which means that we cannot run workloads
operating on files with total size more than 12 GB. Fur-
thermore, since Cloudlab provides only virtual machines
Figure 10: Speed-up of PowerAlluxio and PowerAlluxio
(SLRU disabled) over Alluxio with different NDR.
Figure 11: Optimum active time threshold with different
NDR.
instead of physical machines, sharing disk with other
users may result in unstable disk bandwidth.
Time synchronization across workers: the SLRU in-
volves time comparison between machines. The more
precise time synchronized, the more reliable result SLRU
can provide. Currently no extra synchronization is im-
plemented beside using local time of each machine. The
time difference between machines is within several sec-
onds which is acceptable in our experiments.
6.2 Future Work
Benchmark coverage: From experiments, we learned that
speed-up of PowerAlluxio over Alluxio degrades when
all files cannot be cached. How to efficiently and accu-
rately predict performance degradation based on a given
8
workload remains an interesting question. Also, appli-
cations trace data of well-known companies are great
benchmark for future evaluation.
Better way to limit disk bandwidth: In our experiment,
disks bandwidth were reduced at application level. We
believe that the measurement results obtained in this set-
ting can provide us some sense on how PowerAlluxio
performs with high-speed network. The next step is to
reduce disk bandwidth at kernel level for more accurate
variable control.
User-level framework overhead vs. kernel modification:
We noticed that user-level framework overhead became
the bottleneck of PowerAlluxio even with 10Gbps Eth-
ernet connection. For example, it took PowerAlluxio as
much as 1200 ms to read a 250 MB file from remote
worker while the file should be transmitted within 200 ms
in full bandwidth. We cannot fully exploit the high-
speed network because PowerAlluxio is an user-level
framework that suffers from unnecessary kernel cross-
ing. However, even if implementing PowerAlluxio at
kernel level can greatly improve the speed, the system
would be too complicated and the development cost be-
comes unbearable.
7 Conclusion
Machine memory utilization is often unbalanced in cur-
rent distributed in-memory file system. We proposed
PowerAlluxio, an in-memory file system, that allows
client to exploit the memory on all machines within
the cluster. In addition, as the price of high-speed net-
work such as InfiniBand FDR continues to drop, high-
speed network in datacenter becomes an affordable op-
tion. Combining high-speed network and PowerAlluxio,
reading files from memory on remote machine is much
faster than local disk read.
From our experiment results, when all files can be
cached in memory, PowerAlluxio improves cluster av-
erage task completion time by 14.11× over original Al-
luxio. When dealing with large datasets which cannot
be fully cached in cluster memory, our proposed Smart
LRU (SLRU) eviction policy helps PowerAlluxio to re-
duce completion time by 24.76%.
The takeaway from the project is that as high-speed
network becomes highly available, utilizing memory on
remote machine becomes feasible. However, memory lo-
cality arises to be an significant factor that can greatly
influence system performance. Also, application-level
framework overhead will become an non-negligible fac-
tor that limit the speed up from high-speed network.
8 Acknowledgement
We thank Mosharaf Chowdhury for inspiring us to start
this project and providing many great suggestions. Spe-
cial thanks to Yupeng, Cheng, and Gene Pang from Al-
luxio Inc. for answering questions regarding Alluxio
code base. We also thank our friends En-Shuo Hsu,
Shang-En Huang, and Yayun Lo for sharing industrial
experiences as well as excellent feedback. Last but not
least, we thank CloudLab for providing hardware for us
to conduct experiments.
References
[1] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The
google file system. In Proceedings of the Nineteenth ACM Sym-
posium on Operating Systems Principles, SOSP ’03, pages 29–43,
New York, NY, USA, 2003. ACM.
[2] Alluxio inc., http://alluxio.org/.
[3] Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion
Stoica. Tachyon: Reliable, memory speed storage for cluster com-
puting frameworks. In Proceedings of the ACM Symposium on
Cloud Computing, SOCC ’14, pages 6:1–6:15, New York, NY,
USA, 2014. ACM.
[4] Edmund B. Nightingale, Jeremy Elson, Jinliang Fan, Owen Hof-
mann, Jon Howell, and Yutaka Suzue. Flat datacenter storage. In
Presented as part of the 10th USENIX Symposium on Operating
Systems Design and Implementation (OSDI 12), pages 1–15, Hol-
lywood, CA, 2012. USENIX.
[5] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion
Stoica. Disk-locality in datacenter computing considered irrele-
vant.
[6] Yanpei Chen, Sara Alspaugh, and Randy Katz. Interactive ana-
lytical processing in big data systems: A cross-industry study of
mapreduce workloads. Proc. VLDB Endow., 5(12):1802–1813,
August 2012.
[7] Charles Reiss, Alexey Tumanov, Gregory R. Ganger, Randy H.
Katz, and Michael A. Kozuch. Heterogeneity and dynamicity of
clouds at scale: Google trace analysis. In Proceedings of the Third
ACM Symposium on Cloud Computing, SoCC ’12, pages 7:1–7:13,
New York, NY, USA, 2012. ACM.
9

More Related Content

What's hot

Cache performance considerations
Cache performance considerationsCache performance considerations
Cache performance considerationsSlideshare
 
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...IJSRD
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma
 
Exploiting Multi Core Architectures for Process Speed Up
Exploiting Multi Core Architectures for Process Speed UpExploiting Multi Core Architectures for Process Speed Up
Exploiting Multi Core Architectures for Process Speed UpIJERD Editor
 
Erasure codes fast 2012
Erasure codes fast 2012Erasure codes fast 2012
Erasure codes fast 2012Accenture
 
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORYMAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORYcaijjournal
 
Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...
Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...
Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...Ceph Community
 
Survey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSurvey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSahil Kaw
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0Sahil Kaw
 
Hiding data in hard drive’s service areas
Hiding data in hard drive’s service areasHiding data in hard drive’s service areas
Hiding data in hard drive’s service areasYury Chemerkin
 
Compositional Analysis for the Multi-Resource Server
Compositional Analysis for the Multi-Resource ServerCompositional Analysis for the Multi-Resource Server
Compositional Analysis for the Multi-Resource ServerEricsson
 
Cluster Tutorial
Cluster TutorialCluster Tutorial
Cluster Tutorialcybercbm
 

What's hot (20)

Modern processors
Modern processorsModern processors
Modern processors
 
Cache performance considerations
Cache performance considerationsCache performance considerations
Cache performance considerations
 
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
 
cache
cachecache
cache
 
Hadoop
HadoopHadoop
Hadoop
 
Exploiting Multi Core Architectures for Process Speed Up
Exploiting Multi Core Architectures for Process Speed UpExploiting Multi Core Architectures for Process Speed Up
Exploiting Multi Core Architectures for Process Speed Up
 
Erasure codes fast 2012
Erasure codes fast 2012Erasure codes fast 2012
Erasure codes fast 2012
 
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORYMAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
 
Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...
Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...
Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...
 
Survey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSurvey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning Algorithm
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
 
Hiding data in hard drive’s service areas
Hiding data in hard drive’s service areasHiding data in hard drive’s service areas
Hiding data in hard drive’s service areas
 
Compositional Analysis for the Multi-Resource Server
Compositional Analysis for the Multi-Resource ServerCompositional Analysis for the Multi-Resource Server
Compositional Analysis for the Multi-Resource Server
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
Cluster Tutorial
Cluster TutorialCluster Tutorial
Cluster Tutorial
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
Cache design
Cache design Cache design
Cache design
 

Viewers also liked

Party.pptx (1)
Party.pptx (1)Party.pptx (1)
Party.pptx (1)14judit
 
رحلتي إلى أمريكا
رحلتي إلى أمريكارحلتي إلى أمريكا
رحلتي إلى أمريكاAmr Ahmed
 
Media Studies A2 - Evaluation Question 4
Media Studies A2 - Evaluation Question 4Media Studies A2 - Evaluation Question 4
Media Studies A2 - Evaluation Question 4LeahSimmons97
 
The Signature Designer-Makeups & Candid Photography
The Signature Designer-Makeups & Candid PhotographyThe Signature Designer-Makeups & Candid Photography
The Signature Designer-Makeups & Candid PhotographyThe Signature Designer
 
Las plantas son seres vivos
Las plantas son seres vivosLas plantas son seres vivos
Las plantas son seres vivosjhennyna
 
Serving the Wounded Warrior - Veterans with Disabilities and College Disabili...
Serving the Wounded Warrior - Veterans with Disabilities and College Disabili...Serving the Wounded Warrior - Veterans with Disabilities and College Disabili...
Serving the Wounded Warrior - Veterans with Disabilities and College Disabili...CEKinney
 

Viewers also liked (7)

Party.pptx (1)
Party.pptx (1)Party.pptx (1)
Party.pptx (1)
 
رحلتي إلى أمريكا
رحلتي إلى أمريكارحلتي إلى أمريكا
رحلتي إلى أمريكا
 
Media Studies A2 - Evaluation Question 4
Media Studies A2 - Evaluation Question 4Media Studies A2 - Evaluation Question 4
Media Studies A2 - Evaluation Question 4
 
The Signature Designer-Makeups & Candid Photography
The Signature Designer-Makeups & Candid PhotographyThe Signature Designer-Makeups & Candid Photography
The Signature Designer-Makeups & Candid Photography
 
Las plantas son seres vivos
Las plantas son seres vivosLas plantas son seres vivos
Las plantas son seres vivos
 
Serving the Wounded Warrior - Veterans with Disabilities and College Disabili...
Serving the Wounded Warrior - Veterans with Disabilities and College Disabili...Serving the Wounded Warrior - Veterans with Disabilities and College Disabili...
Serving the Wounded Warrior - Veterans with Disabilities and College Disabili...
 
Presentation of Douglas Avelar
Presentation of Douglas AvelarPresentation of Douglas Avelar
Presentation of Douglas Avelar
 

Similar to PowerAlluxio

Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)
Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)
Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)DataCore APAC
 
Virtual SAN- Deep Dive Into Converged Storage
Virtual SAN- Deep Dive Into Converged StorageVirtual SAN- Deep Dive Into Converged Storage
Virtual SAN- Deep Dive Into Converged StorageDataCore Software
 
Acunu Whitepaper v1
Acunu Whitepaper v1Acunu Whitepaper v1
Acunu Whitepaper v1Acunu
 
Cluster Computers
Cluster ComputersCluster Computers
Cluster Computersshopnil786
 
EOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperEOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperDavid Walker
 
TechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebulaTechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebulaOpenNebula Project
 
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storageI-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storageredpel dot com
 
[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage Reduction[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage ReductionPerforce
 
Difference Between San And Nas
Difference Between San And NasDifference Between San And Nas
Difference Between San And NasJill Lyons
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfanil0878
 
Operating Systems Part III-Memory Management
Operating Systems Part III-Memory ManagementOperating Systems Part III-Memory Management
Operating Systems Part III-Memory ManagementAjit Nayak
 
Big Data Glossary of terms
Big Data Glossary of termsBig Data Glossary of terms
Big Data Glossary of termsKognitio
 
Chapter 9 OS
Chapter 9 OSChapter 9 OS
Chapter 9 OSC.U
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersRyousei Takano
 

Similar to PowerAlluxio (20)

Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)
Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)
Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)
 
Virtual SAN- Deep Dive Into Converged Storage
Virtual SAN- Deep Dive Into Converged StorageVirtual SAN- Deep Dive Into Converged Storage
Virtual SAN- Deep Dive Into Converged Storage
 
Acunu Whitepaper v1
Acunu Whitepaper v1Acunu Whitepaper v1
Acunu Whitepaper v1
 
Opetating System Memory management
Opetating System Memory managementOpetating System Memory management
Opetating System Memory management
 
Cluster Computers
Cluster ComputersCluster Computers
Cluster Computers
 
EOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperEOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - Paper
 
TechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebulaTechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebula
 
os
osos
os
 
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storageI-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
 
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
 
[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage Reduction[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage Reduction
 
Challenges in Managing IT Infrastructure
Challenges in Managing IT InfrastructureChallenges in Managing IT Infrastructure
Challenges in Managing IT Infrastructure
 
Difference Between San And Nas
Difference Between San And NasDifference Between San And Nas
Difference Between San And Nas
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster Computing
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdf
 
Operating system
Operating systemOperating system
Operating system
 
Operating Systems Part III-Memory Management
Operating Systems Part III-Memory ManagementOperating Systems Part III-Memory Management
Operating Systems Part III-Memory Management
 
Big Data Glossary of terms
Big Data Glossary of termsBig Data Glossary of terms
Big Data Glossary of terms
 
Chapter 9 OS
Chapter 9 OSChapter 9 OS
Chapter 9 OS
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 

PowerAlluxio

  • 1. PowerAlluxio: Fast Storage System with Balanced Memory Utilization Chi-Fan Chu cfchu@umich.edu University of Michigan, Ann Arbor Chao-Han Tsai chatsai@umich.edu University of Michigan, Ann Arbor Abstract With high speed network at reasonable price, sharing memory across machines became a feasible architec- ture for cluster computing. In addition to disk, under utilized cluster memory provides stressful machines an- other choice for data placement. With high network-to- disk-bandwidth ratio (NDR), fetching files from memory on remote machine is even faster than local disk read. We developed PowerAlluxio, an in-memory file sys- tem based on Alluxio, that provide shared memory ab- straction. PowerAlluxio improves cluster average task completion time by 14.11× when data can be fully cached in all machines. To further improve efficiency, we proposed a new data eviction policy Smart LRU (SLRU) which helps PowerAlluxio reduce elapse time by 24.76% when dealing with large data set. Another observation was made that even though the traditional disk local- ity problem became less important, the memory locality turns out to be a critical factor that dominates application performance. 1 Introduction The need for distributed file system arises as industry needs fault-tolerant and scalable distributed storage so- lutions. Large-scale computing applications often in- volve reading and writing high volumes of data, and will produce huge amount of intermediate files. Therefore, an efficient and scalable storage solution is required to store these files. Google file system [1] and Hadoop distributed file system are proposed for large distributed data-intensive applications. The bottleneck of these distributed file system is that they are I/O-bounded. Although read performance can be improved by caching data in memory, write perfor- mance cannot be improved as data are required to be replicated across nodes for fault tolerance. Current approach to resolve the disk I/O bottleneck is either through in-memory file system or network RAIDs. Alluxio [2, 3], previously Tachyon, is an in-memory file system leveraging the concept of lineage to improve the performance of writes without compromising fault toler- ance. Flat datacenter storage [4] is proposed to divide files into multiple partitions and store each partitions on different machines. With full bisection network band- width, clients are able to exploit the full disk bandwidth within the cluster to reduce single disk I/O bottleneck. High-speed network becomes affordable and avail- able. High-performance Remote Direct Memory Access (RDMA) capable networks such as InfiniBand FDR can provide 5 GB/s bandwidth with the cost to of NIC con- tinuing to drop. These becomes new options for server inter-connection inside the datacenter. With InfiniBand network connection, read from the memory on remote machine is faster than read from lo- cal disk. The disk bandwidth on a SATA 7200 rpm hard drive is 130 MB/s. The read bandwidth via InfiniBand FDR is 2.9 GB/s. Therefore, fully utilize the memory on all machines to reduce disk I/Os can greatly improve the throughput. Also, memory locality becomes a significant factor that should be considered in the file system design. We implemented our ideas based on Alluxio, an ex- isting in-memory file system project. We call it Power- Alluxio, which uses shared-memory scheme to exploit the memory on all machines and increase the memory utilization in the cluster without sacrificing memory lo- cality. Clients can efficiently use the memory on remote machines to avoid potential disk I/Os. A set of experi- ments are carried out to evaluate the performance boost. According to experiments, running PowerAlluxio in a simulated high-speed network environment with 40.93 network-to-disk-bandwidth ratio (NDR), we can achieve 14.11× speed up as opposed to original Alluxio. This paper is organized as follows: §2 provides the background on RDMA and Alluxio. §3 gives an overview on how PowerAlluxio works. §4 highlights some implementation details about PowerAlluxio. Ex-
  • 2. periment results and analysis of PowerAlluxio and Al- luxio appear in §5. §6 discusses the challenges we en- countered and the future direction of this project. Finally, we summarize our conclusions in §7. 2 Background 2.1 RDMA Remote Direct Memory Access (RDMA) enables direct access of memory on remote machines. Messages are sent to the NIC and handled by remote NICs without in- volving CPUs. It allows zero-copy transfers to reduce kernel crossing overhead. RDMA NICs provide reliable and efficient transmission by applying hardware-level re- transmission for lost packets and kernel by-pass for every communication. However, RDMA was not widely used in data centers in the past due to expensive cost. InfiniBand is a communication standard with very high throughput and very low latency. It supports RDMA operations and it is widely used for server interconnect in the high performance computing (HPC) community. 2.2 Alluxio Alluxio [2] is an in-memory file system that leverage the lineage to avoid disk write without compromising fault- tolerance. Alluxio supports major distributed file sys- tem such as Hadoop file system (HDFS) or Amazon S3 as under persistent layer. On write operations, Alluxio first writes files to memory storage, and writes the cor- responding lineage to persistent layer. The files are then asynchronously checkpointed to the persistent layer. To illustrate how lineage works, consider the follow- ing example. Assume that we have a program P that reads input files A1,A2,...,An, and writes output files B1,B2,...,Bm. Before Alluxio writes its output, it will first record the corresponding lineage information L to persistent layer. Lineage will contains all the informa- tion required to run the program, such as the location of input files, the configuration, and the command. Assume that Bi is lost, we can re-compute the missing file Bi with the lineage information L. Alluxio consists of a centralized master and workers on each machine, which is similar to the architecture of Google file system and HDFS. Alluxio master is respon- sible for managing the metadata of files, periodically check-pointing the files to persistent layers and allocate resource for re-computations. Worker is responsible for data storage. Files are split to multiple blocks and placed on workers. Alluxio master keeps tracking location of each block. Every worker has a daemon that manages local resource and reports the memory usage to master by periodic heartbeat. 3 Approach Overview 3.1 Problems of Alluxio The in-memory file systems dramatically shorten appli- cation run time by caching part of the file system in- side the memory, such that the number of low-bandwidth disk read/write operations can be reduced. Alluxio, as the leading project of distributed in-memory file system, have been deployed by many companies and demon- strates its strength for cluster computing. However, we found that the current design of Alluxio fails to use cluster memory efficiently. Space intensive applications manipulate very large files which cannot fit into the memory on a single machine. Once a client runs out of space on an Alluxio worker, with the default con- figuration (local first policy), the worker starts evicting blocks from its memory storage even though other work- ers have free memory space. The client accessing the block that has been evicted triggers cache miss in Al- luxio, which then either read the file from disk (if the file has been checkpointed), or recompute the file with lineage information (if the file is not persisted). Either way slows down the client. With default storage policy, a client can only utilize the memory space of a single worker. Since memory usage of machines is typically unbalanced within cluster, other workers often have free memory space for extra data storage. With the help of high-speed network, it is preferred caching data on the memory of any remote machine than storing the data on local disk. Accessing data through network is faster than reading data from disk. Although Alluxio can be configured with advanced settings such as selecting a worker with round robin fash- ion or selecting a worker with the most available space among all workers so that the client is not restricted to a single worker capacity. However, these approaches lose memory locality. Even though the cluster-wide memory utilization increases, the latest data, which tends to be the ”hottest data”, is possible to be placed on remote ma- chines rather than the local machine. Recent research [5] shows disk-locality is irrelevant, but memory locality remains a critical issue. After all, local memory still out- performs existing network in terms of bandwidth. The system should cache the latest data as close to the client as possible. 3.2 PowerAlluxio Based on Alluxio codebase, we implemented our ideas as new features. To differentiate the projects, we call our system PowerAlluxio. The primary goal of PowerAl- luxio is to increase cluster-wide memory utilization with- out sacrificing memory locality. To further boost the per- 2
  • 3. Figure 1: Flow of remote worker data block transfer formance, a new data eviction mechanism, Smart LRU (SLRU), is proposed. Figure 1 shows the complete work flow. 3.2.1 Share Memory Storage Space With PowerAlluxio, the client is not restricted to a single worker storage space. From the client perspective, Pow- erAlluxio provides a shared memory storage abstraction. Light-loaded workers can assist heavy-loaded workers by caching extra blocks. This helps to increase cluster- wide memory utilization. Furthermore, to exploit mem- ory locality, the client always writes files to its local worker. If the local worker runs out of space, instead of directly evicting the blocks, in the following two cases: 1. The evicted data is persisted in disks: The lo- cal worker immediately removes the blocks from its memory storage to create space, which avoids blocking the client’s process. The local worker then uses a background process to fetch the evicted data from persistent storage layer, transfer and cache the data on another worker. Once finished, the data blocks can be accessed from the memory on remote machines through network. 2. The evicted date in NOT persisted in disks: The local worker blocks the client and transfers the blocks to another worker before eviction. The re- computation cost is expected to be higher than sim- ply blocking the client and transferring the data. Data transfer may cause the receiving worker out of space, and that worker continues to transfer data to other machines and causes chain effect. In order to prevent chain effect, we add an argument into RPC interface tp help workers differentiate incoming write streams from clients and workers. If the stream is from a worker, no more data will be forwarded to other machines and the chain stops. As the total amount of data being cached into cluster memory storage increases as well as the memory locality is preserved, we can expect that Power- Alluxio improves cluster average task completion time. 3.2.2 Data Eviction Policy When out of space, worker needs to decide which data block to send, send to whom, or it should simply evict the block if this particular block will not be accessed again in the future. According to the analysis [3, 6, 7], the data intensive applications have the following characteristics: • Access Frequency: File access often follows a Zipf- like distribution. • Access Temporal Locality: 75% of the re-accesses take place within 6 hours. Based on the characteristics, Alluxio uses LRU as default eviction policy. Every worker maintains a local LRU queue for all its contained blocks. The queue is updated whenever an operation is conducted at this worker. However, in PowerAlluxio, as all the memory are shared, the local LRU cannot provide critical informa- tion for block transfer. For example, the worker con- taining the ”oldest” block is the ideal destination. How- ever, such destination cannot be identified unless there is a cluster-wide LRU. An easy solution is to maintain a global LRU at master. We did not choose this approach 3
  • 4. because the master could potentially become the scaling bottleneck as every operation needs to go through master. Instead, PowerAlluxio uses an approximate global LRU which we call Smart LRU (SLRU). Here are the details: 1. Block carries last access time information: Local LRU does not need time information. On every op- eration it simply place the accessed block to the end of queue so that the front of queue is always the oldest block. In order to compare ”age” of blocks from different workers, we add an extra attribute to block metadata to record the last access time. At the block transfer destination, the recorded time is used to determine the position of transferred block in the local LRU queue. In this way, PowerAlluxio does not mess up the client’s data access history. 2. Master keeps tracking age of each worker: The age of worker is defined as the average age of a num- ber (which can be tuned according to cluster work- load) of the oldest blocks at this worker. The worker reports its age to master by periodic heartbeat and asks master which worker is the oldest when setting the transfer destination. 3. Do not move stale blocks: If the block evicted has not been accessed for more than a properly set time period, called active time threshold, it is unlikely that this block will not be accessed again in the fu- ture. Transferring stale blocks only consumes net- work bandwidth with no benefit. Although the age information of workers kept at mas- ter might not be fresh due to relatively low heartbeat fre- quency (default once a second), we think this is accept- able based on the observation that the blocks at the oldest worker from last heartbeat are comparatively old blocks in the cluster. We found SLRU works surprisingly well in practice. SLRU further reduces PowerAlluxio average task finish time by up to 24.76%. 4 Implementation Highlights This section describes some key parts of PowerAlluxio. First, new RPC interfaces to facilitate inter-worker com- munication are introduced. Second, we explain several design challenges behind the advanced features of Pow- erAlluxio. Last, in order to conduct experiments, how we simulate a high-speed network is explained. 4.1 New PRC Interfaces In the original Alluxio, all commands and message passing between clients, master and workers are issued through remote procedure calls (RPC). There are three major types of RPC interfaces: 1. Client-side library to master for file metadata related operations such as regis- tering new files at master and querying workers informa- tion associated with a specific file. 2. Client-side library to worker for direct data read/write stream from clients to workers. 3. Worker to master for worker registration and data block commit. The original Alluxio architecture is centralized with no direct communication between workers. Workers re- port run time statistics to master by periodic heartbeat, and master passes the instruction to workers in the form of heartbeat response. In order to realize PowerAlluxio, new RPC interfaces were added: 1. Worker to master: getFileInfoWithBlock() is for worker to look up file metadata given a block ID. When data is transferred in the background (since the target data block is immediately removed from worker memory storage in order to not block client process, see §3.2.1), the worker reads the evicted data from the disk. Worker needs to know which file and the offset (the file could be divided into several blocks) associated with the evicted data block. Therefore, a reversed mapping from block ID to file ID is maintained at PowerAlluxio master. getWorkerInfoList() is for workers to query availability and age of all workers to set data trans- fer destination. Master returns address, capacity, used bytes, and age of all workers in the response. 2. Worker to worker: The Netty transport layer module from client-side library is reused and adapted to fa- cilitate data transfer between workers. Because the transferred data size is known, the buffer size can be set accordingly to improve efficiency. An identifier is added to the interface for stream receivers to dis- tinguish client senders from worker senders. Differ- ent operations are performed when worker sender commits the data block, such as restoring last ac- cessed time and inserting block to correct position in the LRU queue. 4.2 BlockMover Module Every worker has a BlockMover module. Once out of space, the eviction planner triggers BlockMover by pass- ing a list of block ID. To save thread creation/termination overhead, a thread pool is maintained in BlockMover to serve background block transfer. Once a worker evicted a block from its storage, it uses heartbeat to inform master to remove the worker from block holder list. Each block is associated with a block holder list that records the workers that store such block. Due to the fixed frequency of heartbeat, the state between worker and master may be inconsistent. In the original 4
  • 5. Alluxio, this is acceptable. A client guided to this worker for a certain file by master would trigger a cache miss. In PowerAlluxio, however, this cache miss may be unneces- sary, since the block might have already been transferred and cached at another worker. We want master to update state faster to avoid such inconsistency. Increase heart- beat frequency solves the problem but introduce much overhead at master. Instead, we make workers do instant heartbeat after transferring blocks. During implementation, a concurrency bug of the orig- inal Alluxio was found when two clients simultaneously accessed a file that was previously evicted. They both trigger cache miss at Alluxio and both start to cache data with the same block ID (since its from the same file, same offset). The one commits later will be thrown BlockAlreadyExistsException causing client program to crash but Alluxio remains running. This bug, however, crashes PowerAlluxio in a similar scenario when both a client and a worker are caching same file at same place, and the client commits first. The frequency of such fail- ure in PowerAlluxio is not negligible. This problem was finally solved by changing exception handling logic of Netty transport layer to perform appropriate garbage col- lection at remote worker. 4.3 Simulate High-Speed Network Integrating RDMA into PowerAlluxio is extremely dif- ficult due to limited hardware resources and lack of re- liable third party software drivers. Since releasing a deploy-able software is not our primary goal, we decided to run PowerAlluxio in a simulated high-speed network with Ethernet setup. Relative network speed can be eval- uated with network-to-disk speed ratio (NDR), which is defined as follows: NDR = networkbandwidth diskbandwidth By fixing the network bandwidth and reducing disk bandwidth, we can equivalently increase NDR as we are using high-speed network. Limit disk bandwidth at kernel is no easy task, either. To simulate a reduced bandwidth disk, our first attempt was to read same file for extra times. However, since the file may be cached in kernel, the read operations may not actually go into disk. Therefore, as an alternative, we make every read operation followed by multiple disk write and flush operations of dummy file with the same size. This is based on the assumption that HDD sequen- tially read speed and sequentially write speed is in the same order of magnitude. By changing the number of disk write operations for a read operation, we create var- ious disk bandwidth for experiment. Figure 2: Bandwidth grows linearly with file size and saturates when file size reach GB. Figure 3: Transfer a 512 MB file with different message sizes. 5 Evaluation In this section, we first measured the performance of In- finiBand FDR. Based on the measurement results, we set up a simulated high-speed network with Ethernet envi- ronment. and conducted three experiments to demon- strate the performance of PowerAlluxio and Alluxio. 5.1 InfiniBand We measured the transmission bandwidth over Infini- Band FDR with a server and a client. Each machine had two Xeon E5-2650v2 processors (8 cores each, 2.6 GHz), a 64 GB memory, two 1 TB 7200 rpm disks, and a Mellanox FDR CX3 NICs that support 40 Gbps InfiniBand FDR. 5
  • 6. Figure 4: Runtime of Alluxio and PowerAlluxio with different network-to-disk-speed ratio (NDR) when all files can be cached. Figure 5: Speed-up of PowerAlluxio over Alluxio with different NDR when all files can be cached. To find how file size influences the transmission band- width, we transferred a file in a single session and re- peated the process for 100 times to obtain the average bandwidth. For file size from 4 KB to 1 MB, we trans- ferred the file in a single message. For file size from 4 MB to 2 GB, we divided the file into multiple 1 MB partitions and transferred each partition in a single mes- sage. Figure 2 shows that transmission bandwidth con- tinues to grow with file size and saturates at GB level. To discover how message size affects bandwidth, we transferred a 512 MB (default block size of Alluxio) file with messages of different size. The peak bandwidth oc- curred when the message size was 64 KB, as shown in figure 3. Figure 6: Runtime of Alluxio and PowerAlluxio with different network-to-disk-speed ratio (NDR) when all files can be cached. The time for the first read from disk is omitted. Figure 7: Speed-up of PowerAlluxio over Alluxio with different NDR when all files can be cached. The time for the first read from disk is omitted. 5.2 Scenario 1: All Files Can Be Cached A client read multiple files on PowerAlluxio. All files could be cached to the memory on either local machine or remote machines. In this setting, we had a master and three workers. Each worker was configured with 2 GB memory for data storage. All machines were config- ured with two Xeon E5-2650v2 processors (8 cores each, 2.6 GHz), a 64 GB memory, two 1 TB 7200 rpm disks, and an Intel X520 PCIe Dual port 10Gbps Ethernet NIC. The client first read 30 files from persistent storage. The file size ranges uniformly from 150 MB to 220 MB. The client randomly performed 1000 read operations to 6
  • 7. these files. The total file size was 5633 MB, which could be fully cached to the memory on three machines (6 GB in total). Figure 4 shows the runtime of Alluxio and PowerAl- luxio with different network-to-disk-speed and figure 5 demonstrates the speed-up of PowerAlluxio over Alluxio with different NDR. PowerAlluxio achieves 14.11× speed-up when NDR is 40.93. The major factor limiting even more speed-up came from the first disk reads of persistent files, which, un- fortunately, is inevitable in practice. To show that, we repeated the experiment but omitting the time for the first disk read. Figure 6 shows the runtime of Alluxio and PowerAlluxio with different NDR. Figure 7 demon- strates the corresponding speed-up of PowerAlluxio over Alluxio with different NDR. The speed-up is 34.57 when NDR is 40.93. 5.3 Scenario 2: Multiple Clients Three clients on three different machines are reading and writing files on PowerAlluxio. Client1 performs 1000 read operations on 30 files and each file is 100 MB to 160 MB. Client2 performs 3000 read operations on 20 files and each file is 40 MB to 55 MB. Client3 performs 1200 operations on 10 files and each file is 80 MB to 120 MB. It was our intention to limit size of Client2 and Client3 less than a worker space. The hardware specifi- cation was the same as the previous experiment. Figure 8 shows the completion time with different con- figurations. The default storage policy of original Al- luxio makes clients use the only memory on the local machines. Client1 performed operations on files with to- tal size of 3968 MB, which could not be fully cached to the 2 GB memory on its local machine. Therefore the completion time for client1 was long due to a lot of disk I/Os caused by cache misses. Client2 and Client3 per- formed operations on files with total size of 1007 MB and 1008 MB respectively, which could be fully cached in memory on their local machine. Completion time for client2 and client3 is much shorter. Another storage policy, the most available policy, of original Alluxio assigns the machine with the most free memory space to the client. Since client1 could exploit the memory on remote machines with most available pol- icy, the completion time of client1 was shorter than the completion time with default storage policy. However, for client2 and client3, since their files were forced to scatter across workers because of the most available pol- icy, they lost memory locality. Their completion time was much longer as compared with the default policy. Although most available policy allows client to use the memory on remote machine, the low memory locality is a critical issue for small-sized client. Figure 8: Runtime of multiple clients reading files with Alluxio, Alluxio with most available policy, and Power- Alluxio. PowerAlluxio always stores clients’ latest file on the local machine. When out of space, PowerAlluxio worker will transfer older data according to SLRU. In this way, PowerAlluxio guarantees memory locality and allows higher memory utilization. Therefore PowerAlluxio dra- matically shortened the completion time of client1 with- out affecting client2 and client3 much. PowerAlluxio re- duced average task finish time by 59.87%, while most available policy can only reduced average task finish time by 24.11%. 5.4 Scenario 3: Files Exceeding Space A client read large files from persistent storage. All files together exceeded total worker space. In this setting, we had a master and three workers. All machines were configured with two Intel E5-2660 v2 10-core CPUs at 2.20 GHz (Ivy Bridge), a 256 GB memory, two 1 TB 7200 RPM disks, and a dual-port Intel 10 Gbps NIC. Each Alluxio worker was configured with 2 GB mem- ory space. A client performed 1000 read operations on 40 files. The file size ranged from 150 MB to 250 MB. There are 20 very frequently accessed files, 10 frequently ac- cessed files, and 10 rarely accessed files. The total file size is 8087 MB, which could not be fully cached at three machines. The same workload also ran on Power- Alluxio with SLRU disabled. We expected to see SLRU improves clients performance. Figure 9 shows the runtime to read large files with dif- ferent NDR in Alluxio, PowerAlluxio (SLRU disabled), and PowerAlluxio. Figure 10 demonstrates the speed- up of PowerAlluxio and PowerAlluxio (SLRU disabled) over Alluxio with different NDR. Enabling SLRU boosts performance of PowerAlluxio by 24.76%. However, 7
  • 8. Figure 9: Runtime to read very large files with different NDR. compared to scenario 1, the achieved speed-up is much lower and saturated at NDR equals to 16. This is caused by extra cache misses due to total file size exceeding workers space. PowerAlluxio without SLRU reduced the number of cache misses from 670s to 180s, and SLRU helped PowerAlluxio reduce cache misses by another 30s times. As shown by §5.2, any disk I/O would severely damage speed-up. The upper bound of speed-up in this scenario is as low as 670s/150s equals to 4.46. Figure 11 shows that the optimum active time thresh- old (see §3.2.2) scales linearly with NDR, which means that active time threshold is highly predictable. This ob- servation is useful when cluster administrator is upgrad- ing hardware. The optimum active time threshold can be easily predicted and set according to new hardware spec- ifications. 6 Discussion 6.1 Challenges Encountered RDMA integration: Since there is no unified software so- lution for all hardware, integrating RDMA into Power- Alluxio is not scalable. We tried to use an open source library JXIO which supports the Mellanox FDR CX3 NICs. However, JXIO is very unstable, it often crashes unexpectedly on our hardware. Therefore, we decided not to integrate JXIO into PowerAlluxio. Experiment environment: Our experiments were con- ducted on Cloudlab cluster. However, Cloudlab provides limited disk space for each user (12 GB disk space per machine), which means that we cannot run workloads operating on files with total size more than 12 GB. Fur- thermore, since Cloudlab provides only virtual machines Figure 10: Speed-up of PowerAlluxio and PowerAlluxio (SLRU disabled) over Alluxio with different NDR. Figure 11: Optimum active time threshold with different NDR. instead of physical machines, sharing disk with other users may result in unstable disk bandwidth. Time synchronization across workers: the SLRU in- volves time comparison between machines. The more precise time synchronized, the more reliable result SLRU can provide. Currently no extra synchronization is im- plemented beside using local time of each machine. The time difference between machines is within several sec- onds which is acceptable in our experiments. 6.2 Future Work Benchmark coverage: From experiments, we learned that speed-up of PowerAlluxio over Alluxio degrades when all files cannot be cached. How to efficiently and accu- rately predict performance degradation based on a given 8
  • 9. workload remains an interesting question. Also, appli- cations trace data of well-known companies are great benchmark for future evaluation. Better way to limit disk bandwidth: In our experiment, disks bandwidth were reduced at application level. We believe that the measurement results obtained in this set- ting can provide us some sense on how PowerAlluxio performs with high-speed network. The next step is to reduce disk bandwidth at kernel level for more accurate variable control. User-level framework overhead vs. kernel modification: We noticed that user-level framework overhead became the bottleneck of PowerAlluxio even with 10Gbps Eth- ernet connection. For example, it took PowerAlluxio as much as 1200 ms to read a 250 MB file from remote worker while the file should be transmitted within 200 ms in full bandwidth. We cannot fully exploit the high- speed network because PowerAlluxio is an user-level framework that suffers from unnecessary kernel cross- ing. However, even if implementing PowerAlluxio at kernel level can greatly improve the speed, the system would be too complicated and the development cost be- comes unbearable. 7 Conclusion Machine memory utilization is often unbalanced in cur- rent distributed in-memory file system. We proposed PowerAlluxio, an in-memory file system, that allows client to exploit the memory on all machines within the cluster. In addition, as the price of high-speed net- work such as InfiniBand FDR continues to drop, high- speed network in datacenter becomes an affordable op- tion. Combining high-speed network and PowerAlluxio, reading files from memory on remote machine is much faster than local disk read. From our experiment results, when all files can be cached in memory, PowerAlluxio improves cluster av- erage task completion time by 14.11× over original Al- luxio. When dealing with large datasets which cannot be fully cached in cluster memory, our proposed Smart LRU (SLRU) eviction policy helps PowerAlluxio to re- duce completion time by 24.76%. The takeaway from the project is that as high-speed network becomes highly available, utilizing memory on remote machine becomes feasible. However, memory lo- cality arises to be an significant factor that can greatly influence system performance. Also, application-level framework overhead will become an non-negligible fac- tor that limit the speed up from high-speed network. 8 Acknowledgement We thank Mosharaf Chowdhury for inspiring us to start this project and providing many great suggestions. Spe- cial thanks to Yupeng, Cheng, and Gene Pang from Al- luxio Inc. for answering questions regarding Alluxio code base. We also thank our friends En-Shuo Hsu, Shang-En Huang, and Yayun Lo for sharing industrial experiences as well as excellent feedback. Last but not least, we thank CloudLab for providing hardware for us to conduct experiments. References [1] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In Proceedings of the Nineteenth ACM Sym- posium on Operating Systems Principles, SOSP ’03, pages 29–43, New York, NY, USA, 2003. ACM. [2] Alluxio inc., http://alluxio.org/. [3] Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. Tachyon: Reliable, memory speed storage for cluster com- puting frameworks. In Proceedings of the ACM Symposium on Cloud Computing, SOCC ’14, pages 6:1–6:15, New York, NY, USA, 2014. ACM. [4] Edmund B. Nightingale, Jeremy Elson, Jinliang Fan, Owen Hof- mann, Jon Howell, and Yutaka Suzue. Flat datacenter storage. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), pages 1–15, Hol- lywood, CA, 2012. USENIX. [5] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. Disk-locality in datacenter computing considered irrele- vant. [6] Yanpei Chen, Sara Alspaugh, and Randy Katz. Interactive ana- lytical processing in big data systems: A cross-industry study of mapreduce workloads. Proc. VLDB Endow., 5(12):1802–1813, August 2012. [7] Charles Reiss, Alexey Tumanov, Gregory R. Ganger, Randy H. Katz, and Michael A. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the Third ACM Symposium on Cloud Computing, SoCC ’12, pages 7:1–7:13, New York, NY, USA, 2012. ACM. 9