+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
cluster08
1. Gather-Arrange-Scatter: Node-Level Request
Reordering for Parallel File Systems on Multi-Core
Clusters
Kazuki Ohta1 , Hiroya Matsuba2 , and Yutaka Ishikawa1,2
1 Graduate 2 Information
School of Information Science and Technology, Technology Center,
The University of Tokyo The University of Tokyo
{kzk@il.is.s, matsuba@cc, ishikawa@is.s}.u-tokyo.ac.jp
issue contiguous requests, however, they can be interrupted
Abstract—Multiple processors or multi-core CPUs are now in
common, and the number of processes running concurrently is in- by the requests of other nodes. Then, the number of disk
creasing in a cluster. Each process issues contiguous I/O requests seeks increases, and the I/O bandwidth falls.
individually, but they can be interrupted by the requests of other
This performance degradation is more critical in parallel
processes if all the processes enter the I/O phase together. Then,
file systems. Some recent super-computers have more than
I/O nodes handle these requests as non-contiguous. This increases
100,000 CPU cores. Consider that most scientific applications
the disk seek time, and causes performance degradation.
To overcome this problem, a node-level request reordering run long and do disk I/O alternately [6]. If more than 100,000
architecture, called Gather-Arrange-Scatter (GAS) architecture, compute processes enter the I/O phase all together and issue
is proposed. In GAS, the I/O requests in the same node are
I/O requests simultaneously, one I/O node receives the I/O
gathered and buffered locally. Then, those are arranged and
requests from all processes.
combined to reduce the I/O cost at I/O nodes, and finally they
To solve this problem, we propose the Gather-Arrange-
are scattered to the remote I/O nodes in parallel.
A prototype is implemented and evaluated using the BTIO Scatter (GAS) node-level I/O request reordering architecture
benchmark. This system reduces up to 84.3% of the lseek() for parallel file systems. The main idea is that I/O requests
calls and reduces up to 93.6% of the number of requests at I/O
issued from the same node are gathered and buffered locally,
nodes. This results in up to a 12.7% performance improvement
then buffered requests are arranged in a better order which
compared to the non-arranged case.
reduces the I/O cost at I/O nodes, and finally scattered to I/O
I. INTRODUCTION nodes in parallel.
To gather the requests, the requests are handled asyn-
A growing number of scientific applications, such as ex-
chronously, that is, write() returns immediately after the
perimental physics, computational biology, astrophysics, and
request is gathered. The file system approach works efficiently
genome analysis need to handle terabytes of data. But the
for all the existing applications that use the POSIX I/O
bandwidth of a single disk is too low to handle data of such a
interface. We have designed and implemented the Parallel
size, so much prior research has addressed how to effectively
Gather-Arrange-Scatter (PGAS) file system to confirm that
utilize multiple disks as a single file system [1]–[5].
the GAS architecture improves the parallel write performance
Such scientific applications usually run on high-
for some I/O intensive benchmarks.
performance commodity clusters with fast processors
The rest of this paper is organized as follows. Chapter 2 de-
and high-bandwidth, low-latency interconnects. Such a
scribes the design of the Gather-Arrange-Scatter architecture in
cluster often consists of compute nodes and I/O nodes.
detail, and Chapter 3 describe the implementation of the PGAS
Usually, compute nodes have low-bandwidth, low-capacity
file system. Chapter 4 compares and analyzes the performance
commodity disks, and I/O nodes have high-bandwidth,
of the PGAS file system with other file systems. Chapter
high-capacity, high-end disks. In addition to that, recently, it
5 reviews previous efforts at developing distributed/parallel
has been common to have multiple processors or processor
file-systems, and other techniques for improving parallel I/O
cores in one node. Node-level parallelism is performed by
performance. Chapter 6 concludes this paper.
multi-thread or multi-process, in practice. Therefore, running
multiple processes of the same application or even different
II. DESIGN AND IMPLEMENTATION
applications on the same node is usual. Each application
A. Issues
tends to issue contiguous I/O requests individually, but if
multiple processes issue many I/O requests simultaneously, Recent cluster nodes commonly have multiple processors
the requests are handled by the local file system in a or CPU-cores. Usually, multi-thread or multi-process methods
non-contiguous way. This is because each process tends to are used to exploit node level parallelism. So it’s common
2. Fig. 1. The difference between a) Existing Parallel File Systems and b) the
Gather-Arrange-Scatter I/O Architecture
Fig. 2. Detailed View of the Gather-Arrange-Scatter (GAS)
Architecture
that processes of the same application, or even different
applications, are running on the same node. Now consider that
the multiple processes issue I/O requests at the same time. B. Gather-Arrange-Scatter
To overcome the problem described in section II-A, we
For example, consider the case that process A re-
propose an I/O architecture to achieve node-level request re-
quests write(filename=”file1”, offset=100, count=40) and
ordering, called Gather-Arrange-Scatter (GAS) architecture.
write(”file1”, 140, 40), then process B simultaneously requests
Figure 1 shows the difference between the existing parallel file
write(”file1”, 180, 40). The file system handles these requests
system approach and the Gather-Arrange-Scatter architecture.
in the order of write(”file1”, 100, 40), write(”file1”, 180, 40),
The idea is that I/O requests issued from the same node are
and write(”file1”, 140, 40). In this order, twice the number of
gathered and buffered temporarily. Then, buffered requests are
seeks are required, compared to the ideal order.
arranged in the ideal order, which reduces the disk seek times.
Usually each application process tends to issue contiguous Finally, they are scattered to the remote disks in parallel.
requests. But in a multi-core environment, contiguous requests Figure 2 illustrates a more detailed view of GAS architec-
may conflict with each other, and can be handled in a non- ture. In the GAS architecture, there are two important servers
contiguous way. Therefore, the file system needs to move called dispatcher and I/O server. The dispatcher should be
the disk head for roughly each request. Because a disk seek launched on every compute node to take charge of the Gather-
is slow compared to memory accesses or CPU cycles, this Arrange-Scatter phase. The I/O server should be launched
greatly decreases the I/O bandwidth, that is, this degrades the on every I/O node to handle I/O requests scattered from the
application performance. Some recent CPUs have more than dispatchers.
four cores, and the number of cores are increasing. So the
The specific process of Gather-Arrange-Scatter is described
number of applications run on the same node is increasing,
in the following subsections.
and this type of collision is going to happen more often in the
1) Gathering: To gather the I/O requests, the system call
near future.
behavior is changed to achieve asynchronous I/O. In asyn-
chronous I/O, system calls return immediately after the request
This performance degradation due to the disk seek time
is issued, then the application can return immediately to the
is more crucial in parallel file systems. Usually, parallel or
computation. The queued requests are sent to the dispatcher
distributed file systems are deployed on the I/O nodes. The
in the background.
I/O requests of each application are sent directly to I/O nodes
through a high-speed interconnect network. This is acceptable 2) Arranging: At first, I/O requests are accumulated into
if the total number of application processes is few. But in the in-memory local buffer that are separated into sub-buffers.
the recent multi-core environment, I/O nodes may receive The number of sub-buffers is same as the number of I/O nodes,
many I/O requests at the same time from a large number and each corresponds to a destination remote disk. The target
of processes. Now the same disk seek problem occurs more sub-buffer is decided by using information from the request,
frequently. like file name or offset, which depends on the implementation.
3. Normally, a file is striped into equally sized blocks in parallel
file systems. And the target sub-buffer may be decided by the
index of the block which contains the requested region. Once
the sub-buffer is full or a certain period passes, buffered I/O
requests are arranged in a better order, which minimizes the
disk seek times. Additionally, multiple contiguous requests are
merged into one request in this phase, because it is also critical
to make as few requests as possible to the file system for low
I/O latency [7]. Fig. 3. Cluster Network Configuration
Figure 2 illustrates the buffering operation. In this figure,
there are four compute nodes and four I/O nodes. The dis- TABLE I
THE SPECIFICATIONS OF THE CLUSTER
patchers have four sub-buffers each, and each corresponds to
a remote I/O server. In this figure the sub-buffer for node1 is
full on node2, so buffered I/O requests are arranged and sent
8 Compute Nodes
to node1. The I/O server on node1 receives the requests, and
CPU Opteron 2212HE(dual-core, 2.0GHz)*2
performs the actual I/O operation for the local disk.
Memory 6 GB
3) Scattering: Arranged I/O requests are scattered to the
HDD Serial ATA Disk (50.28 to 53.21 MB/sec)
corresponding I/O server in parallel. These requests are han-
OS Linux 2.6.18-8.1.8.el5 SMP
dled synchronously by the I/O server, one after another.
I/O Scheduler CFQ I/O Scheduler
C. Implementation 1 Control Node
We have implemented the Parallel Gather-Arrange- CPU Opteron 2218HE(dual-core, 2.6GHz)*2
Scatter (PGAS) file system to confirm that the GAS archi- Memory 16 GB
tecture improves the I/O performance in multi-core clusters. HDD Serial ATA Disk (50.61 MB/sec)
The PGAS file system is implemented on the Linux operating OS Linux 2.6.18-8.1.8.el5 SMP
system with C/C++ program of about 6000 lines. I/O Scheduler CFQ I/O Scheduler
Currently read requests are done synchronously in the
PGAS file system prototype, and only write requests are gath-
ered by the dispatchers. And also the PGAS file system does
A. Cluster Environment
not have a meta-data handling server at present, it requires
a backend NFS server for this purpose. These aspects will
For the experiment, we have eight compute nodes and one
be enhanced in the future. The system call hooking library is
control node. The network configuration of the experiment
used to hook the system call related to file system operations,
cluster are shown in Figure 3, and the specifications of each
which enable the implementation of a file system completely
node is shown in Table I. All computing nodes are connected
in user-space.
with a 10Gbps Myrinet [8] network, and also connected with
1) Consistency Model: Next we discuss the consistency
a 1Gbps Ethernet network. The communication between the
model of the PGAS file system. Most scientific applications
dispatchers and I/O servers are done through the Ethernet
call an open() system call with write-only mode. Therefore
network on the PGAS file system. The disk bandwidth is
the PGAS file system relaxes the POSIX I/O atomicity se-
measured by hdparm [9].
mantics and concentrates on achieving high performance in
1) Software Environment: We use MPICH2 [10] version
write-only mode. So, the PGAS file system implementation
1.0.5 as a default MPI library. All software, including MPICH2
gathers the I/O requests opened in write-only mode, and
and the benchmarks, are compiled with Intel C Compiler,
ensures that the data must be written to the disk immediately
version 10.0. The other file systems or dependent libraries are
after applications close() the file descriptor. To achieve
compiled with GCC, version 4.1.1. MPICH2 is configured to
this constraint, the application is blocked when it calls the
use the Myrinet-10G network.
close() system call until all of its I/O requests are handled
by the I/O server. Because the I/O requests from the dispatcher
B. Other File Systems for Comparison
to the I/O server are done synchronously, the application only
asks the local dispatcher whether all of its requests are sent. To
We run benchmarks on three other file systems: Local Disks,
check this condition, the dispatcher has an unhandled request
Network File System (NFS) with asynchronous mode [11],
count for each process.
[12], and Parallel Virtual File System 2 (PVFS2) [13], [14].
III. EXPERIMENTAL RESULTS Next, we describe the configurations for each file system.
This section describes our experiments and the results we 1) Local Disks: We use local disks to measure the I/O
obtained to evaluate the performance of the PGAS file system. throughput when each process arbitrarily issues requests to
We run an I/O intensive benchmark on the PGAS file system, the operating system. There’s no metadata operation in this
and on other existing file systems. case.
4. TABLE II TABLE III
BTIO BENCHMARK RESULT (8 NODES / 16 BTIO (8N/16P) THE IMPACT OF REQUEST ARRANGEMENT
PROCESSES)
Total Execution Time (sec) Number of lseek() at I/O servers (times)
Class Local NFS PVFS2 PGAS FS Class NotArranged Arranged Reduced
A (0.4GB) 57.82 686.41 385.20 65.22 A (0.4GB) 2287540 666259 70.9%
B (1.6GB) 114.48 1858.21 1404.58 241.74 B (1.6GB) 8512074 1713626 79.9%
C (6.8GB) 422.16 N/A 5358.19 1123.54 C (6.8GB) 27801033 4374196 84.3%
Write Bandwidth (MB/sec) Number of I/O Requests (times)
Class Local NFS PVFS2 PGAS FS Class NotArranged Arranged Reduced
A (0.4GB) 25.46 0.65 1.18 341.31 A (0.4GB) 10490880 666575 93.6%
B (1.6GB) 38.57 1.22 1.30 321.24 B (1.6GB) 42468984 3057766 92.8%
C (6.8GB) 39.39 N/A 1.36 338.19 C (6.8GB) 170142519 11229406 93.4%
Total Execution Time (sec)
Class NotArranged Arranged Reduced
A (0.4GB) 74.71 65.22 12.7%
2) NFS: NFS cannot export multiple disks as a single file
B (1.6GB) 260.47 241.74 7.2%
system. Therefore, in the case of NFS, we use only one disk
C (6.8GB) 1281.92 1123.54 12.4%
of the control node. That is, the total disk bandwidth is
lower than that of the other file systems. We configure NFS
for an asynchronous mode to achieve high performance, but
we found that sometimes the result is inconsistent. of the BTIO benchmark on eight compute nodes with 16
3) PVFS2: We use PVFS2 (version 2.7.0) with the default processes. This table shows the total execution time and write
configuration, for which pvfs2-genconfig is generated. The bandwidth of A to C BTIO classes on various file systems.
benchmarks use PVFS2 with the PVFS2 VFS module. The The block size of the PGAS file system is 64KB, and the
I/O servers run on compute nodes which are in use for the size of sub-buffers is also 64KB. Local Disks, PVFS2, and
computation. The metadata server always runs on the control the PGAS file system use the disks of compute nodes.
node not on compute nodes to avoid a CPU conflict between The PGAS file system shows a performance improvement
the application and the metadata server. over the other file systems. NFS and PVFS2 are relatively slow
for this benchmark, because the cost of calling write() on
C. BTIO Benchmark
these file systems seems very high. With class C, NFS server
A benchmark for evaluating the writing speed of parallel
was crashed for some reason. In the PVFS2 case, the client
I/O is provided by the Block-Tridiagonal (BT) Nas Parallel
servers and I/O servers consume a lot of CPU power, all the
Benchmark (NPB), version 3.3, known as BTIO [15]. Each
way along.
processor is responsible for multiple Cartesian subsets of the
The write throughput of the PGAS file system does not
entire data set, whose number increases as the square root of
correspond to actual I/O speed, because it is actually a memory
the number of processors participating in the computation.
copy. The PGAS files system is not faster than the local
BTIO provides an option for using MPI I/O with collective
disks, although the computation and I/O can be overlapped.
buffering, MPI I/O without collective buffering, or Fortran
This shows that there’s some bottlenecks in the GAS phase,
I/O file operation. We choose Fortran I/O to show that the
specifically in the dispatcher. We need to do a more detailed
file system layer can solve all of the demands of existing
investigation of this.
applications not using MPI-IO. In this case, the applications
call a tremendous amount of small (40bytes) write requests. E. The Effect of the Arrange Phase
So the overhead of the system call can be also important.
Next, we measured the effectiveness of the arrange phase by
In addition to that, BTIO is provided with different input
investigating the following three points in the case of arranged
problem sizes (A-D). We used classes A to C, an aggregate
or not arranged: the number of lseek() calls of all I/O
write amount for a complete run of 0.4 GB, 1.6 GB, and
servers, the number of I/O requests, and the total execution
6.8 GB, respectively. Because the aggregate write amount is
time. We ran class A-C benchmarks on eight compute nodes
fixed, the amount written by each process decreases as the
with sixteen processes. The block size is 64KB, and the size of
number of processes increases. For getting precise results, we
sub-buffers is also 64KB. As mentioned before in section II-B,
cleared the buffer cache of all nodes before each benchmark
the PGAS file system sorts the requests by path and offset in
execution. The workload of BTIO is a typical I/O pattern in
ascending order, then merges the contiguous requests in the
scientific applications, that is, the compute phase and the I/O
arrange phase.
phase alternates.
Table III shows the actual results. About 80% of the
D. Comparison with Other File Systems lseek() times on eight I/O servers have been successfully
We did a performance comparison with other file systems reduced. Additionally, about 93% of the requests are reduced,
with respect to the PGAS file system. Table II shows the results that is, 20 requests are merged into 1 request, on average.
5. As a consequence, about 7.20% of the total execution time is I/O calls, it is possible to obtain the high performance in this
reduced. approach.
2) Data Sieving: To achieve low I/O latency, it is crucial
IV. RELATED WORK to issue as few requests as possible to the file system. Data
Sieving [7] is a technique to make a few large contiguous
A. Distributed/Parallel File System
requests to the file system, even if the I/O requests from the
There are several studies of network-shared file systems
application consist of some small and non-contiguous requests.
which are used for clusters or grid environments.
The disadvantage of Data Sieving is that it must read the
1) PVFS: PVFS [13], [14] (Parallel Virtual File System) data more than needed, and requires extra memory for this.
is designed to provide a high performance file system for The application can control the behavior by setting a runtime
parallel applications which handle large I/O and many file parameter.
accesses. PVFS consists of metadata servers and I/O servers.
3) Two-Phase Write Behind Buffering: Two-Phase Write
I/O servers store the data, striped across multiple servers
Behind Buffering [17] is a technique to improve parallel write
in round-robin fashion. Metadata servers handle metadata
performance on scientific applications. This technique is used
information for the files, like permissions, timestamps, size,
for the efficient distribution of requests to remote processes.
or so on. PVFS provides a kernel VFS module, which allows
The I/O requests are issued from the application in an asyn-
existing UNIX applications (e.g. cp, cat) to run on the PVFS
chronous fashion. Then, they are collected to sub-buffers, each
file system without any code modification. PVFS has the
corresponding to a remote process. Once a sub-buffer is full,
capability to support some different network types through
buffered requests are flushed to the remote process. By this
an abstraction known as the Buffered Messaging Interface
two-stage buffering scheme, an MPI program can achieve high
(BMI), and implementations exist for TCP/IP, InfiniBand, and
write bandwidth on some scientific applications.
Myrinet. Its storage relies on UNIX files to store the file
4) MTIO: More, et al. implemented a multi-threaded MPI-
data, and a Berkeley DB database to hold metadata. PVFS
based I/O library called MTIO [18]. MTIO provides a thread-
relaxes the POSIX atomicity semantics to improve stability
based asynchronous I/O capability to improve the collective
and performance.
I/O performance of scientific applications. The paper reported
In PVFS, each application may issue multiple requests
that MTIO can successfully overlap upto 80% ovarlap between
to the file system simultaneously, so the problem described
the I/O phase and computing phase on an IBM Scalable
in section II-A may happen. Because PVFS also assumes
POWER parallel System (SP).
compute/storage cluster, the GAS architecture can be easily
incorporated into PVFS, to improve its I/O performance.
V. CONCLUSION
B. MPI-IO To achieve high I/O performance in parallel file systems,
it is critical for I/O nodes to handle incoming requests with
The Message Passing Interface (MPI), version 2, includes
low disk seek times and to reduce the number of requests.
an interface for file I/O operations called MPI-IO, which
In order to meet those requirements, a node-level request
focuses on accessing the shared files concurrently. MPI-IO
reordering architecture, called Gather-Arrange-Scatter (GAS)
has a collective I/O feature, where all processes are explicitly
architecture, has been proposed in this paper. In GAS, I/O
synchronized for exchanging access information to generate
requests from applications are gathered once at each compute
a better I/O strategy. And also, there’are some approaches
node, then arranged and combined to minimize the I/O cost,
to incorporate asynchronous I/O into MPI-IO. The following
and finally scattered to the remote I/O nodes. By this scheme,
subsection describes such techniques in MPI-IO. Because
the disk seek times at I/O nodes are reduced, and the number
MPI-IO has an explicit synchronization point, it can have more
of requests are also reduced.
optimal I/O ordering than a file system, which doesn’t have a
A prototype implementation of the GAS architecture, the
synchronization point. But many existing applications still use
Parallel Gather-Arrange-Scatter (PGAS) file system, has been
the POSIX I/O interface. So our file system approach has an
implemented on the Linux operating system. The BTIO Nas
advantage over MPI-IO, in that all existing applications can
Parallel Benchmark is used to evaluate the performance of
gain the performance benefit.
the PGAS file system. It is confirmed that GAS architecture
1) Two-Phase I/O: Two-Phase I/O [16] is one of the
reduces up to 84.3% of the lseek() calls and reduces up
optimization technique with an explicit synchronization point.
to 93.6% of the number of requests at I/O nodes. This results
Consider the case of reading shared files from different pro-
in up to 12.7% of performance improvement compared to the
cesses. At first, designated I/O aggregators collect over an
non-arranged case.
aggregate access region which covers the I/O requests from all
the MPI processes. Nextm, in the I/O phase, I/O aggregators Gather-Arrange-Scatter achieved a compute node side re-
issue large I/O requests which are calculated based on the quest reordering, but I/O server side reordering is also needed
collected request information. Finally in the communication for avoiding more disk seeks. In the future, we will develop
phase, the read data is distributed through the network to the a global I/O scheduler scheme for parallel file systems by
proper processes. Since the network is very much faster than incorporating client-side and server-side request scheduling.
6. And also, the experiments and evaluation should be done on
a wider range of I/O benchmarks.
REFERENCES
[1] D. Teaff, D. Watson, and B. Coyne, “The architecture of the
high performance storage system (hpss),” in Proceedings of the
1995 Goddard Conference on Mass Storage and Technologies, 1995.
[Online]. Available: citeseer.ist.psu.edu/teaff95architecture.html
[2] F. Schmuck and R. Haskin, “GPFS: A shared-disk file system for
large computing clusters,” in Proc. of the First Conference on File
and Storage Technologies (FAST), Jan. 2002, pp. 231–244. [Online].
Available: citeseer.ist.psu.edu/schmuck02gpfs.html
[3] O. Tatebe, Y. Morita, S. Matsuoka, N. Soda, and S. Sekiguchi, “Grid
datafarm architecture for petascale data intensive computing,” in Pro-
ceedings of the 2nd IEEE/ACM International Symposium on Cluster
Computing and the Grid (CCGrid 2002), 2002.
[4] O. Tatebe, N. Soda, Y. Morita, S. Matsuoka, and S. Sekiguchi, “Gfarm
v2: A grid file system that supports high-performance distributed and
parallel data computing,” in Proceedings of the 2004 Computing in High
Energy and Nuclear Physics (CHEP04), 2004.
[5] “Lustre file system,” http://lustre.org/.
[6] E. Smirni and D. A. Reed, “Lessons from characterizing the input/output
behavior of parallel scientific applications,” Perform. Eval., vol. 33,
no. 1, pp. 27–44, 1998.
[7] R. Thakur, W. Gropp, and E. Lusk, “Optimizing noncontiguous
accesses in MPI-IO,” Parallel Computing, vol. 28, no. 1, pp. 83–105,
2002. [Online]. Available: citeseer.ist.psu.edu/thakur02optimizing.html
[8] “Myrinet,” http://www.myri.com/.
[9] “hdparm,” http://sourceforge.net/projects/hdparm/.
[10] A. N. Laboratory, “Mpich2 : High-performance and widely portable
mpi,” http://www.mcs.anl.gov/research/projects/mpich2/.
[11] R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon,
“Design and implementation of the Sun Network Filesystem,” in Proc.
Summer 1985 USENIX Conf., Portland OR (USA), 1985, pp. 119–130.
[Online]. Available: citeseer.ist.psu.edu/sandberg85design.html
[12] B. Pawlowski, S. Shepler, C. Beame, B. Callaghan, M. Eisler,
D. Noveck, D. Robinson, and R. Thurlow, “The nfs version 4 protocol,”
2000. [Online]. Available: citeseer.ist.psu.edu/shepler00nfs.html
[13] P. H. Carns, W. B. L. III, R. B. Ross, and R. Thakur, “Pvfs: a parallel file
system for linux clusters,” in ALS’00: Proceedings of the 4th conference
on 4th Annual Linux Showcase & Conference, Atlanta. Berkeley, CA,
USA: USENIX Association, 2000, pp. 28–28.
[14] T. P. Community, “Parallel virtual file system, version 2,”
http://www.pvfs.org/.
[15] P. Wong and R. F. V. der Wijngaart, “Nas parallel benchmark i/o version
2.4, nas technical report nas-03-002, nasa ames research center, moffett
field, ca 94035-1000.”
[16] J. M. del Rosario, R. Bordawekar, and A. Choudhary, “Improved parallel
i/o via a two-phase run-time access strategy,” SIGARCH Comput. Archit.
News, vol. 21, no. 5, pp. 31–38, 1993.
[17] W. keng Liao, A. Ching, K. Coloma, A. Nisar, A. Choudhary, J. Chen,
R. Sankaran, and S. Klasky, “Using mpi file caching to improve parallel
write performance for large-scale scientific applications,” SC, 2007.
[18] S. More, A. N. Choudhary, I. T. Foster, and M. Q. Xu, “Mtio - a multi-
threaded parallel i/o system,” in IPPS ’97: Proceedings of the 11th
International Symposium on Parallel Processing. Washington, DC,
USA: IEEE Computer Society, 1997, pp. 368–373.