SlideShare a Scribd company logo
1 of 6
Download to read offline
Gather-Arrange-Scatter: Node-Level Request
Reordering for Parallel File Systems on Multi-Core
                      Clusters
                                Kazuki Ohta1 , Hiroya Matsuba2 , and Yutaka Ishikawa1,2

            1 Graduate                                                             2 Information
                         School of Information Science and Technology,                          Technology Center,
                              The University of Tokyo                                   The University of Tokyo

                                  {kzk@il.is.s, matsuba@cc, ishikawa@is.s}.u-tokyo.ac.jp


                                                                     issue contiguous requests, however, they can be interrupted
   Abstract—Multiple processors or multi-core CPUs are now in
common, and the number of processes running concurrently is in-      by the requests of other nodes. Then, the number of disk
creasing in a cluster. Each process issues contiguous I/O requests   seeks increases, and the I/O bandwidth falls.
individually, but they can be interrupted by the requests of other
                                                                        This performance degradation is more critical in parallel
processes if all the processes enter the I/O phase together. Then,
                                                                     file systems. Some recent super-computers have more than
I/O nodes handle these requests as non-contiguous. This increases
                                                                     100,000 CPU cores. Consider that most scientific applications
the disk seek time, and causes performance degradation.
   To overcome this problem, a node-level request reordering         run long and do disk I/O alternately [6]. If more than 100,000
architecture, called Gather-Arrange-Scatter (GAS) architecture,      compute processes enter the I/O phase all together and issue
is proposed. In GAS, the I/O requests in the same node are
                                                                     I/O requests simultaneously, one I/O node receives the I/O
gathered and buffered locally. Then, those are arranged and
                                                                     requests from all processes.
combined to reduce the I/O cost at I/O nodes, and finally they
                                                                        To solve this problem, we propose the Gather-Arrange-
are scattered to the remote I/O nodes in parallel.
   A prototype is implemented and evaluated using the BTIO           Scatter (GAS) node-level I/O request reordering architecture
benchmark. This system reduces up to 84.3% of the lseek()            for parallel file systems. The main idea is that I/O requests
calls and reduces up to 93.6% of the number of requests at I/O
                                                                     issued from the same node are gathered and buffered locally,
nodes. This results in up to a 12.7% performance improvement
                                                                     then buffered requests are arranged in a better order which
compared to the non-arranged case.
                                                                     reduces the I/O cost at I/O nodes, and finally scattered to I/O
                      I. INTRODUCTION                                nodes in parallel.
                                                                        To gather the requests, the requests are handled asyn-
   A growing number of scientific applications, such as ex-
                                                                     chronously, that is, write() returns immediately after the
perimental physics, computational biology, astrophysics, and
                                                                     request is gathered. The file system approach works efficiently
genome analysis need to handle terabytes of data. But the
                                                                     for all the existing applications that use the POSIX I/O
bandwidth of a single disk is too low to handle data of such a
                                                                     interface. We have designed and implemented the Parallel
size, so much prior research has addressed how to effectively
                                                                     Gather-Arrange-Scatter (PGAS) file system to confirm that
utilize multiple disks as a single file system [1]–[5].
                                                                     the GAS architecture improves the parallel write performance
   Such scientific applications usually run on high-
                                                                     for some I/O intensive benchmarks.
performance commodity clusters with fast processors
                                                                        The rest of this paper is organized as follows. Chapter 2 de-
and high-bandwidth, low-latency interconnects. Such a
                                                                     scribes the design of the Gather-Arrange-Scatter architecture in
cluster often consists of compute nodes and I/O nodes.
                                                                     detail, and Chapter 3 describe the implementation of the PGAS
Usually, compute nodes have low-bandwidth, low-capacity
                                                                     file system. Chapter 4 compares and analyzes the performance
commodity disks, and I/O nodes have high-bandwidth,
                                                                     of the PGAS file system with other file systems. Chapter
high-capacity, high-end disks. In addition to that, recently, it
                                                                     5 reviews previous efforts at developing distributed/parallel
has been common to have multiple processors or processor
                                                                     file-systems, and other techniques for improving parallel I/O
cores in one node. Node-level parallelism is performed by
                                                                     performance. Chapter 6 concludes this paper.
multi-thread or multi-process, in practice. Therefore, running
multiple processes of the same application or even different
                                                                                 II. DESIGN AND IMPLEMENTATION
applications on the same node is usual. Each application
                                                                     A. Issues
tends to issue contiguous I/O requests individually, but if
multiple processes issue many I/O requests simultaneously,              Recent cluster nodes commonly have multiple processors
the requests are handled by the local file system in a                or CPU-cores. Usually, multi-thread or multi-process methods
non-contiguous way. This is because each process tends to            are used to exploit node level parallelism. So it’s common
Fig. 1. The difference between a) Existing Parallel File Systems and b) the
Gather-Arrange-Scatter I/O Architecture



                                                                              Fig. 2.    Detailed View of the Gather-Arrange-Scatter (GAS)
                                                                              Architecture
that processes of the same application, or even different
applications, are running on the same node. Now consider that
the multiple processes issue I/O requests at the same time.                   B. Gather-Arrange-Scatter
                                                                                 To overcome the problem described in section II-A, we
   For example, consider the case that process A re-
                                                                              propose an I/O architecture to achieve node-level request re-
quests write(filename=”file1”, offset=100, count=40) and
                                                                              ordering, called Gather-Arrange-Scatter (GAS) architecture.
write(”file1”, 140, 40), then process B simultaneously requests
                                                                              Figure 1 shows the difference between the existing parallel file
write(”file1”, 180, 40). The file system handles these requests
                                                                              system approach and the Gather-Arrange-Scatter architecture.
in the order of write(”file1”, 100, 40), write(”file1”, 180, 40),
                                                                              The idea is that I/O requests issued from the same node are
and write(”file1”, 140, 40). In this order, twice the number of
                                                                              gathered and buffered temporarily. Then, buffered requests are
seeks are required, compared to the ideal order.
                                                                              arranged in the ideal order, which reduces the disk seek times.
   Usually each application process tends to issue contiguous                 Finally, they are scattered to the remote disks in parallel.
requests. But in a multi-core environment, contiguous requests                   Figure 2 illustrates a more detailed view of GAS architec-
may conflict with each other, and can be handled in a non-                     ture. In the GAS architecture, there are two important servers
contiguous way. Therefore, the file system needs to move                       called dispatcher and I/O server. The dispatcher should be
the disk head for roughly each request. Because a disk seek                   launched on every compute node to take charge of the Gather-
is slow compared to memory accesses or CPU cycles, this                       Arrange-Scatter phase. The I/O server should be launched
greatly decreases the I/O bandwidth, that is, this degrades the               on every I/O node to handle I/O requests scattered from the
application performance. Some recent CPUs have more than                      dispatchers.
four cores, and the number of cores are increasing. So the
                                                                                 The specific process of Gather-Arrange-Scatter is described
number of applications run on the same node is increasing,
                                                                              in the following subsections.
and this type of collision is going to happen more often in the
                                                                                 1) Gathering: To gather the I/O requests, the system call
near future.
                                                                              behavior is changed to achieve asynchronous I/O. In asyn-
                                                                              chronous I/O, system calls return immediately after the request
   This performance degradation due to the disk seek time
                                                                              is issued, then the application can return immediately to the
is more crucial in parallel file systems. Usually, parallel or
                                                                              computation. The queued requests are sent to the dispatcher
distributed file systems are deployed on the I/O nodes. The
                                                                              in the background.
I/O requests of each application are sent directly to I/O nodes
through a high-speed interconnect network. This is acceptable                    2) Arranging: At first, I/O requests are accumulated into
if the total number of application processes is few. But in                   the in-memory local buffer that are separated into sub-buffers.
the recent multi-core environment, I/O nodes may receive                      The number of sub-buffers is same as the number of I/O nodes,
many I/O requests at the same time from a large number                        and each corresponds to a destination remote disk. The target
of processes. Now the same disk seek problem occurs more                      sub-buffer is decided by using information from the request,
frequently.                                                                   like file name or offset, which depends on the implementation.
Normally, a file is striped into equally sized blocks in parallel
file systems. And the target sub-buffer may be decided by the
index of the block which contains the requested region. Once
the sub-buffer is full or a certain period passes, buffered I/O
requests are arranged in a better order, which minimizes the
disk seek times. Additionally, multiple contiguous requests are
merged into one request in this phase, because it is also critical
to make as few requests as possible to the file system for low
I/O latency [7].                                                                   Fig. 3.   Cluster Network Configuration
   Figure 2 illustrates the buffering operation. In this figure,
there are four compute nodes and four I/O nodes. The dis-                                       TABLE I
                                                                                   THE SPECIFICATIONS OF THE CLUSTER
patchers have four sub-buffers each, and each corresponds to
a remote I/O server. In this figure the sub-buffer for node1 is
full on node2, so buffered I/O requests are arranged and sent
                                                                                          8 Compute Nodes
to node1. The I/O server on node1 receives the requests, and
                                                                      CPU              Opteron 2212HE(dual-core, 2.0GHz)*2
performs the actual I/O operation for the local disk.
                                                                      Memory           6 GB
   3) Scattering: Arranged I/O requests are scattered to the
                                                                      HDD              Serial ATA Disk (50.28 to 53.21 MB/sec)
corresponding I/O server in parallel. These requests are han-
                                                                      OS               Linux 2.6.18-8.1.8.el5 SMP
dled synchronously by the I/O server, one after another.
                                                                      I/O Scheduler    CFQ I/O Scheduler
C. Implementation                                                                           1 Control Node
   We have implemented the Parallel Gather-Arrange-                   CPU              Opteron 2218HE(dual-core, 2.6GHz)*2
Scatter (PGAS) file system to confirm that the GAS archi-               Memory           16 GB
tecture improves the I/O performance in multi-core clusters.          HDD              Serial ATA Disk (50.61 MB/sec)
The PGAS file system is implemented on the Linux operating             OS               Linux 2.6.18-8.1.8.el5 SMP
system with C/C++ program of about 6000 lines.                        I/O Scheduler    CFQ I/O Scheduler
   Currently read requests are done synchronously in the
PGAS file system prototype, and only write requests are gath-
ered by the dispatchers. And also the PGAS file system does
                                                                     A. Cluster Environment
not have a meta-data handling server at present, it requires
a backend NFS server for this purpose. These aspects will
                                                                        For the experiment, we have eight compute nodes and one
be enhanced in the future. The system call hooking library is
                                                                     control node. The network configuration of the experiment
used to hook the system call related to file system operations,
                                                                     cluster are shown in Figure 3, and the specifications of each
which enable the implementation of a file system completely
                                                                     node is shown in Table I. All computing nodes are connected
in user-space.
                                                                     with a 10Gbps Myrinet [8] network, and also connected with
   1) Consistency Model: Next we discuss the consistency
                                                                     a 1Gbps Ethernet network. The communication between the
model of the PGAS file system. Most scientific applications
                                                                     dispatchers and I/O servers are done through the Ethernet
call an open() system call with write-only mode. Therefore
                                                                     network on the PGAS file system. The disk bandwidth is
the PGAS file system relaxes the POSIX I/O atomicity se-
                                                                     measured by hdparm [9].
mantics and concentrates on achieving high performance in
                                                                        1) Software Environment: We use MPICH2 [10] version
write-only mode. So, the PGAS file system implementation
                                                                     1.0.5 as a default MPI library. All software, including MPICH2
gathers the I/O requests opened in write-only mode, and
                                                                     and the benchmarks, are compiled with Intel C Compiler,
ensures that the data must be written to the disk immediately
                                                                     version 10.0. The other file systems or dependent libraries are
after applications close() the file descriptor. To achieve
                                                                     compiled with GCC, version 4.1.1. MPICH2 is configured to
this constraint, the application is blocked when it calls the
                                                                     use the Myrinet-10G network.
close() system call until all of its I/O requests are handled
by the I/O server. Because the I/O requests from the dispatcher
                                                                     B. Other File Systems for Comparison
to the I/O server are done synchronously, the application only
asks the local dispatcher whether all of its requests are sent. To
                                                                        We run benchmarks on three other file systems: Local Disks,
check this condition, the dispatcher has an unhandled request
                                                                     Network File System (NFS) with asynchronous mode [11],
count for each process.
                                                                     [12], and Parallel Virtual File System 2 (PVFS2) [13], [14].
               III. EXPERIMENTAL RESULTS                             Next, we describe the configurations for each file system.
  This section describes our experiments and the results we             1) Local Disks: We use local disks to measure the I/O
obtained to evaluate the performance of the PGAS file system.         throughput when each process arbitrarily issues requests to
We run an I/O intensive benchmark on the PGAS file system,            the operating system. There’s no metadata operation in this
and on other existing file systems.                                   case.
TABLE II                                                            TABLE III
       BTIO BENCHMARK RESULT (8 NODES / 16                              BTIO (8N/16P) THE IMPACT OF REQUEST ARRANGEMENT
                                             PROCESSES)


                 Total Execution Time (sec)                                Number of lseek() at I/O servers (times)
    Class        Local      NFS     PVFS2        PGAS FS                 Class  NotArranged        Arranged     Reduced
  A (0.4GB)      57.82    686.41     385.20          65.22         A   (0.4GB)        2287540         666259        70.9%
  B (1.6GB)     114.48 1858.21      1404.58         241.74         B   (1.6GB)        8512074        1713626        79.9%
  C (6.8GB)     422.16      N/A     5358.19        1123.54         C   (6.8GB)       27801033        4374196        84.3%
                 Write Bandwidth (MB/sec)                                      Number of I/O Requests (times)
    Class        Local      NFS     PVFS2        PGAS FS                 Class  NotArranged        Arranged     Reduced
  A (0.4GB)      25.46      0.65       1.18        341.31          A   (0.4GB)       10490880         666575        93.6%
  B (1.6GB)      38.57      1.22       1.30        321.24          B   (1.6GB)       42468984        3057766        92.8%
  C (6.8GB)      39.39      N/A        1.36        338.19          C   (6.8GB)     170142519        11229406        93.4%
                                                                                  Total Execution Time (sec)
                                                                         Class  NotArranged        Arranged     Reduced
                                                                   A   (0.4GB)           74.71          65.22       12.7%
   2) NFS: NFS cannot export multiple disks as a single file
                                                                   B   (1.6GB)          260.47         241.74        7.2%
system. Therefore, in the case of NFS, we use only one disk
                                                                   C   (6.8GB)         1281.92       1123.54        12.4%
of the control node. That is, the total disk bandwidth is
lower than that of the other file systems. We configure NFS
for an asynchronous mode to achieve high performance, but
we found that sometimes the result is inconsistent.               of the BTIO benchmark on eight compute nodes with 16
   3) PVFS2: We use PVFS2 (version 2.7.0) with the default        processes. This table shows the total execution time and write
configuration, for which pvfs2-genconfig is generated. The          bandwidth of A to C BTIO classes on various file systems.
benchmarks use PVFS2 with the PVFS2 VFS module. The               The block size of the PGAS file system is 64KB, and the
I/O servers run on compute nodes which are in use for the         size of sub-buffers is also 64KB. Local Disks, PVFS2, and
computation. The metadata server always runs on the control       the PGAS file system use the disks of compute nodes.
node not on compute nodes to avoid a CPU conflict between             The PGAS file system shows a performance improvement
the application and the metadata server.                          over the other file systems. NFS and PVFS2 are relatively slow
                                                                  for this benchmark, because the cost of calling write() on
C. BTIO Benchmark
                                                                  these file systems seems very high. With class C, NFS server
   A benchmark for evaluating the writing speed of parallel
                                                                  was crashed for some reason. In the PVFS2 case, the client
I/O is provided by the Block-Tridiagonal (BT) Nas Parallel
                                                                  servers and I/O servers consume a lot of CPU power, all the
Benchmark (NPB), version 3.3, known as BTIO [15]. Each
                                                                  way along.
processor is responsible for multiple Cartesian subsets of the
                                                                     The write throughput of the PGAS file system does not
entire data set, whose number increases as the square root of
                                                                  correspond to actual I/O speed, because it is actually a memory
the number of processors participating in the computation.
                                                                  copy. The PGAS files system is not faster than the local
   BTIO provides an option for using MPI I/O with collective
                                                                  disks, although the computation and I/O can be overlapped.
buffering, MPI I/O without collective buffering, or Fortran
                                                                  This shows that there’s some bottlenecks in the GAS phase,
I/O file operation. We choose Fortran I/O to show that the
                                                                  specifically in the dispatcher. We need to do a more detailed
file system layer can solve all of the demands of existing
                                                                  investigation of this.
applications not using MPI-IO. In this case, the applications
call a tremendous amount of small (40bytes) write requests.       E. The Effect of the Arrange Phase
So the overhead of the system call can be also important.
                                                                     Next, we measured the effectiveness of the arrange phase by
   In addition to that, BTIO is provided with different input
                                                                  investigating the following three points in the case of arranged
problem sizes (A-D). We used classes A to C, an aggregate
                                                                  or not arranged: the number of lseek() calls of all I/O
write amount for a complete run of 0.4 GB, 1.6 GB, and
                                                                  servers, the number of I/O requests, and the total execution
6.8 GB, respectively. Because the aggregate write amount is
                                                                  time. We ran class A-C benchmarks on eight compute nodes
fixed, the amount written by each process decreases as the
                                                                  with sixteen processes. The block size is 64KB, and the size of
number of processes increases. For getting precise results, we
                                                                  sub-buffers is also 64KB. As mentioned before in section II-B,
cleared the buffer cache of all nodes before each benchmark
                                                                  the PGAS file system sorts the requests by path and offset in
execution. The workload of BTIO is a typical I/O pattern in
                                                                  ascending order, then merges the contiguous requests in the
scientific applications, that is, the compute phase and the I/O
                                                                  arrange phase.
phase alternates.
                                                                     Table III shows the actual results. About 80% of the
D. Comparison with Other File Systems                             lseek() times on eight I/O servers have been successfully
  We did a performance comparison with other file systems          reduced. Additionally, about 93% of the requests are reduced,
with respect to the PGAS file system. Table II shows the results   that is, 20 requests are merged into 1 request, on average.
As a consequence, about 7.20% of the total execution time is       I/O calls, it is possible to obtain the high performance in this
reduced.                                                           approach.
                                                                      2) Data Sieving: To achieve low I/O latency, it is crucial
                    IV. RELATED WORK                               to issue as few requests as possible to the file system. Data
                                                                   Sieving [7] is a technique to make a few large contiguous
A. Distributed/Parallel File System
                                                                   requests to the file system, even if the I/O requests from the
   There are several studies of network-shared file systems
                                                                   application consist of some small and non-contiguous requests.
which are used for clusters or grid environments.
                                                                   The disadvantage of Data Sieving is that it must read the
   1) PVFS: PVFS [13], [14] (Parallel Virtual File System)         data more than needed, and requires extra memory for this.
is designed to provide a high performance file system for           The application can control the behavior by setting a runtime
parallel applications which handle large I/O and many file          parameter.
accesses. PVFS consists of metadata servers and I/O servers.
                                                                      3) Two-Phase Write Behind Buffering: Two-Phase Write
I/O servers store the data, striped across multiple servers
                                                                   Behind Buffering [17] is a technique to improve parallel write
in round-robin fashion. Metadata servers handle metadata
                                                                   performance on scientific applications. This technique is used
information for the files, like permissions, timestamps, size,
                                                                   for the efficient distribution of requests to remote processes.
or so on. PVFS provides a kernel VFS module, which allows
                                                                   The I/O requests are issued from the application in an asyn-
existing UNIX applications (e.g. cp, cat) to run on the PVFS
                                                                   chronous fashion. Then, they are collected to sub-buffers, each
file system without any code modification. PVFS has the
                                                                   corresponding to a remote process. Once a sub-buffer is full,
capability to support some different network types through
                                                                   buffered requests are flushed to the remote process. By this
an abstraction known as the Buffered Messaging Interface
                                                                   two-stage buffering scheme, an MPI program can achieve high
(BMI), and implementations exist for TCP/IP, InfiniBand, and
                                                                   write bandwidth on some scientific applications.
Myrinet. Its storage relies on UNIX files to store the file
                                                                      4) MTIO: More, et al. implemented a multi-threaded MPI-
data, and a Berkeley DB database to hold metadata. PVFS
                                                                   based I/O library called MTIO [18]. MTIO provides a thread-
relaxes the POSIX atomicity semantics to improve stability
                                                                   based asynchronous I/O capability to improve the collective
and performance.
                                                                   I/O performance of scientific applications. The paper reported
   In PVFS, each application may issue multiple requests
                                                                   that MTIO can successfully overlap upto 80% ovarlap between
to the file system simultaneously, so the problem described
                                                                   the I/O phase and computing phase on an IBM Scalable
in section II-A may happen. Because PVFS also assumes
                                                                   POWER parallel System (SP).
compute/storage cluster, the GAS architecture can be easily
incorporated into PVFS, to improve its I/O performance.
                                                                                         V. CONCLUSION
B. MPI-IO                                                             To achieve high I/O performance in parallel file systems,
                                                                   it is critical for I/O nodes to handle incoming requests with
   The Message Passing Interface (MPI), version 2, includes
                                                                   low disk seek times and to reduce the number of requests.
an interface for file I/O operations called MPI-IO, which
                                                                   In order to meet those requirements, a node-level request
focuses on accessing the shared files concurrently. MPI-IO
                                                                   reordering architecture, called Gather-Arrange-Scatter (GAS)
has a collective I/O feature, where all processes are explicitly
                                                                   architecture, has been proposed in this paper. In GAS, I/O
synchronized for exchanging access information to generate
                                                                   requests from applications are gathered once at each compute
a better I/O strategy. And also, there’are some approaches
                                                                   node, then arranged and combined to minimize the I/O cost,
to incorporate asynchronous I/O into MPI-IO. The following
                                                                   and finally scattered to the remote I/O nodes. By this scheme,
subsection describes such techniques in MPI-IO. Because
                                                                   the disk seek times at I/O nodes are reduced, and the number
MPI-IO has an explicit synchronization point, it can have more
                                                                   of requests are also reduced.
optimal I/O ordering than a file system, which doesn’t have a
                                                                      A prototype implementation of the GAS architecture, the
synchronization point. But many existing applications still use
                                                                   Parallel Gather-Arrange-Scatter (PGAS) file system, has been
the POSIX I/O interface. So our file system approach has an
                                                                   implemented on the Linux operating system. The BTIO Nas
advantage over MPI-IO, in that all existing applications can
                                                                   Parallel Benchmark is used to evaluate the performance of
gain the performance benefit.
                                                                   the PGAS file system. It is confirmed that GAS architecture
   1) Two-Phase I/O: Two-Phase I/O [16] is one of the
                                                                   reduces up to 84.3% of the lseek() calls and reduces up
optimization technique with an explicit synchronization point.
                                                                   to 93.6% of the number of requests at I/O nodes. This results
Consider the case of reading shared files from different pro-
                                                                   in up to 12.7% of performance improvement compared to the
cesses. At first, designated I/O aggregators collect over an
                                                                   non-arranged case.
aggregate access region which covers the I/O requests from all
the MPI processes. Nextm, in the I/O phase, I/O aggregators           Gather-Arrange-Scatter achieved a compute node side re-
issue large I/O requests which are calculated based on the         quest reordering, but I/O server side reordering is also needed
collected request information. Finally in the communication        for avoiding more disk seeks. In the future, we will develop
phase, the read data is distributed through the network to the     a global I/O scheduler scheme for parallel file systems by
proper processes. Since the network is very much faster than       incorporating client-side and server-side request scheduling.
And also, the experiments and evaluation should be done on
a wider range of I/O benchmarks.
                               REFERENCES
 [1] D. Teaff, D. Watson, and B. Coyne, “The architecture of the
     high performance storage system (hpss),” in Proceedings of the
     1995 Goddard Conference on Mass Storage and Technologies, 1995.
     [Online]. Available: citeseer.ist.psu.edu/teaff95architecture.html
 [2] F. Schmuck and R. Haskin, “GPFS: A shared-disk file system for
     large computing clusters,” in Proc. of the First Conference on File
     and Storage Technologies (FAST), Jan. 2002, pp. 231–244. [Online].
     Available: citeseer.ist.psu.edu/schmuck02gpfs.html
 [3] O. Tatebe, Y. Morita, S. Matsuoka, N. Soda, and S. Sekiguchi, “Grid
     datafarm architecture for petascale data intensive computing,” in Pro-
     ceedings of the 2nd IEEE/ACM International Symposium on Cluster
     Computing and the Grid (CCGrid 2002), 2002.
 [4] O. Tatebe, N. Soda, Y. Morita, S. Matsuoka, and S. Sekiguchi, “Gfarm
     v2: A grid file system that supports high-performance distributed and
     parallel data computing,” in Proceedings of the 2004 Computing in High
     Energy and Nuclear Physics (CHEP04), 2004.
 [5] “Lustre file system,” http://lustre.org/.
 [6] E. Smirni and D. A. Reed, “Lessons from characterizing the input/output
     behavior of parallel scientific applications,” Perform. Eval., vol. 33,
     no. 1, pp. 27–44, 1998.
 [7] R. Thakur, W. Gropp, and E. Lusk, “Optimizing noncontiguous
     accesses in MPI-IO,” Parallel Computing, vol. 28, no. 1, pp. 83–105,
     2002. [Online]. Available: citeseer.ist.psu.edu/thakur02optimizing.html
 [8] “Myrinet,” http://www.myri.com/.
 [9] “hdparm,” http://sourceforge.net/projects/hdparm/.
[10] A. N. Laboratory, “Mpich2 : High-performance and widely portable
     mpi,” http://www.mcs.anl.gov/research/projects/mpich2/.
[11] R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon,
     “Design and implementation of the Sun Network Filesystem,” in Proc.
     Summer 1985 USENIX Conf., Portland OR (USA), 1985, pp. 119–130.
     [Online]. Available: citeseer.ist.psu.edu/sandberg85design.html
[12] B. Pawlowski, S. Shepler, C. Beame, B. Callaghan, M. Eisler,
     D. Noveck, D. Robinson, and R. Thurlow, “The nfs version 4 protocol,”
     2000. [Online]. Available: citeseer.ist.psu.edu/shepler00nfs.html
[13] P. H. Carns, W. B. L. III, R. B. Ross, and R. Thakur, “Pvfs: a parallel file
     system for linux clusters,” in ALS’00: Proceedings of the 4th conference
     on 4th Annual Linux Showcase & Conference, Atlanta. Berkeley, CA,
     USA: USENIX Association, 2000, pp. 28–28.
[14] T. P. Community, “Parallel virtual file system, version 2,”
     http://www.pvfs.org/.
[15] P. Wong and R. F. V. der Wijngaart, “Nas parallel benchmark i/o version
     2.4, nas technical report nas-03-002, nasa ames research center, moffett
     field, ca 94035-1000.”
[16] J. M. del Rosario, R. Bordawekar, and A. Choudhary, “Improved parallel
     i/o via a two-phase run-time access strategy,” SIGARCH Comput. Archit.
     News, vol. 21, no. 5, pp. 31–38, 1993.
[17] W. keng Liao, A. Ching, K. Coloma, A. Nisar, A. Choudhary, J. Chen,
     R. Sankaran, and S. Klasky, “Using mpi file caching to improve parallel
     write performance for large-scale scientific applications,” SC, 2007.
[18] S. More, A. N. Choudhary, I. T. Foster, and M. Q. Xu, “Mtio - a multi-
     threaded parallel i/o system,” in IPPS ’97: Proceedings of the 11th
     International Symposium on Parallel Processing. Washington, DC,
     USA: IEEE Computer Society, 1997, pp. 368–373.

More Related Content

What's hot

Exploiting Multi Core Architectures for Process Speed Up
Exploiting Multi Core Architectures for Process Speed UpExploiting Multi Core Architectures for Process Speed Up
Exploiting Multi Core Architectures for Process Speed UpIJERD Editor
 
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific WorkflowsOn the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific WorkflowsRafael Ferreira da Silva
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Intel® Software
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureMichael Gschwind
 
Embedded Logic Flip-Flops: A Conceptual Review
Embedded Logic Flip-Flops: A Conceptual ReviewEmbedded Logic Flip-Flops: A Conceptual Review
Embedded Logic Flip-Flops: A Conceptual ReviewSudhanshu Janwadkar
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMSlide_N
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
 
Stratified B-trees - HotStorage11
Stratified B-trees - HotStorage11Stratified B-trees - HotStorage11
Stratified B-trees - HotStorage11Acunu
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingeSAT Journals
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
 
Efficient Parallel Set-Similarity Joins Using MapReduce - Poster
Efficient Parallel Set-Similarity Joins Using MapReduce - PosterEfficient Parallel Set-Similarity Joins Using MapReduce - Poster
Efficient Parallel Set-Similarity Joins Using MapReduce - Posterrvernica
 
TotalView Debugger On Blue Gene
TotalView Debugger On Blue GeneTotalView Debugger On Blue Gene
TotalView Debugger On Blue GeneTotalviewtech
 
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ijdpsjournal
 

What's hot (17)

Exploiting Multi Core Architectures for Process Speed Up
Exploiting Multi Core Architectures for Process Speed UpExploiting Multi Core Architectures for Process Speed Up
Exploiting Multi Core Architectures for Process Speed Up
 
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific WorkflowsOn the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
 
Blue Gene Active Storage
Blue Gene Active StorageBlue Gene Active Storage
Blue Gene Active Storage
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architecture
 
Embedded Logic Flip-Flops: A Conceptual Review
Embedded Logic Flip-Flops: A Conceptual ReviewEmbedded Logic Flip-Flops: A Conceptual Review
Embedded Logic Flip-Flops: A Conceptual Review
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTM
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
Stratified B-trees - HotStorage11
Stratified B-trees - HotStorage11Stratified B-trees - HotStorage11
Stratified B-trees - HotStorage11
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passing
 
4 026
4 0264 026
4 026
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
 
Efficient Parallel Set-Similarity Joins Using MapReduce - Poster
Efficient Parallel Set-Similarity Joins Using MapReduce - PosterEfficient Parallel Set-Similarity Joins Using MapReduce - Poster
Efficient Parallel Set-Similarity Joins Using MapReduce - Poster
 
TotalView Debugger On Blue Gene
TotalView Debugger On Blue GeneTotalView Debugger On Blue Gene
TotalView Debugger On Blue Gene
 
IEEExeonmem
IEEExeonmemIEEExeonmem
IEEExeonmem
 
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
 

Viewers also liked

Analisis de tendencias pedagogica
Analisis de tendencias pedagogica Analisis de tendencias pedagogica
Analisis de tendencias pedagogica innovatic23
 
We Will Read Certificate
We Will Read CertificateWe Will Read Certificate
We Will Read CertificateLaura Cucchiara
 
Met.desarrollar aplic.
Met.desarrollar aplic.Met.desarrollar aplic.
Met.desarrollar aplic.EIYSC
 
OnceWas Autumn 16 to email
OnceWas Autumn 16 to emailOnceWas Autumn 16 to email
OnceWas Autumn 16 to emailBelinda Abbott
 
September 10 campus notes 09102013
September 10 campus notes 09102013September 10 campus notes 09102013
September 10 campus notes 09102013Abigail Bacon
 
Case Alert - Sveda - CJEU Judgment
Case Alert - Sveda - CJEU JudgmentCase Alert - Sveda - CJEU Judgment
Case Alert - Sveda - CJEU JudgmentGraham Brearley
 
Natural Human Hair Tresses Soft Luster Hair from Russia
Natural Human Hair Tresses Soft Luster Hair from RussiaNatural Human Hair Tresses Soft Luster Hair from Russia
Natural Human Hair Tresses Soft Luster Hair from RussiaEastern Hair
 
Карпова Ольга ДОМЭКСПО 24.10.2015
Карпова Ольга ДОМЭКСПО 24.10.2015Карпова Ольга ДОМЭКСПО 24.10.2015
Карпова Ольга ДОМЭКСПО 24.10.2015Olga Karpova
 
Mercantilismo Na Europa
Mercantilismo Na EuropaMercantilismo Na Europa
Mercantilismo Na Europacrie_historia8
 
Flammable Aerosol Testing for HCS 2012 (GHS)
Flammable Aerosol Testing for HCS 2012 (GHS)Flammable Aerosol Testing for HCS 2012 (GHS)
Flammable Aerosol Testing for HCS 2012 (GHS)Jennifer Grant
 

Viewers also liked (13)

Analisis de tendencias pedagogica
Analisis de tendencias pedagogica Analisis de tendencias pedagogica
Analisis de tendencias pedagogica
 
We Will Read Certificate
We Will Read CertificateWe Will Read Certificate
We Will Read Certificate
 
Met.desarrollar aplic.
Met.desarrollar aplic.Met.desarrollar aplic.
Met.desarrollar aplic.
 
geo thomas
geo thomasgeo thomas
geo thomas
 
OnceWas Autumn 16 to email
OnceWas Autumn 16 to emailOnceWas Autumn 16 to email
OnceWas Autumn 16 to email
 
September 10 campus notes 09102013
September 10 campus notes 09102013September 10 campus notes 09102013
September 10 campus notes 09102013
 
Case Alert - Sveda - CJEU Judgment
Case Alert - Sveda - CJEU JudgmentCase Alert - Sveda - CJEU Judgment
Case Alert - Sveda - CJEU Judgment
 
Natural Human Hair Tresses Soft Luster Hair from Russia
Natural Human Hair Tresses Soft Luster Hair from RussiaNatural Human Hair Tresses Soft Luster Hair from Russia
Natural Human Hair Tresses Soft Luster Hair from Russia
 
Lec 9
Lec 9Lec 9
Lec 9
 
Карпова Ольга ДОМЭКСПО 24.10.2015
Карпова Ольга ДОМЭКСПО 24.10.2015Карпова Ольга ДОМЭКСПО 24.10.2015
Карпова Ольга ДОМЭКСПО 24.10.2015
 
Mercantilismo Na Europa
Mercantilismo Na EuropaMercantilismo Na Europa
Mercantilismo Na Europa
 
Propagation Models
Propagation ModelsPropagation Models
Propagation Models
 
Flammable Aerosol Testing for HCS 2012 (GHS)
Flammable Aerosol Testing for HCS 2012 (GHS)Flammable Aerosol Testing for HCS 2012 (GHS)
Flammable Aerosol Testing for HCS 2012 (GHS)
 

Similar to cluster08

Presentation on Bigdata (Energy Efficient Failure Recovery in Hadoop)
Presentation on Bigdata (Energy Efficient Failure Recovery in Hadoop)Presentation on Bigdata (Energy Efficient Failure Recovery in Hadoop)
Presentation on Bigdata (Energy Efficient Failure Recovery in Hadoop)saumo
 
Functional?
 Reactive? 
Why?
Functional?
 Reactive? 
Why?Functional?
 Reactive? 
Why?
Functional?
 Reactive? 
Why?Timetrix
 
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...Vijay Prime
 
Multiprocessor Systems
Multiprocessor SystemsMultiprocessor Systems
Multiprocessor Systemsvampugani
 
Operating Systems Part III-Memory Management
Operating Systems Part III-Memory ManagementOperating Systems Part III-Memory Management
Operating Systems Part III-Memory ManagementAjit Nayak
 
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...IJIR JOURNALS IJIRUSA
 
Acunu Whitepaper v1
Acunu Whitepaper v1Acunu Whitepaper v1
Acunu Whitepaper v1Acunu
 
INTRODUCTION TO PARALLEL PROCESSING
INTRODUCTION TO PARALLEL PROCESSINGINTRODUCTION TO PARALLEL PROCESSING
INTRODUCTION TO PARALLEL PROCESSINGGS Kosta
 
Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip ParallelismSummary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip ParallelismFarwa Ansari
 
UNIT II DIS.pptx
UNIT II DIS.pptxUNIT II DIS.pptx
UNIT II DIS.pptxSamPrem3
 
Sybase IQ ile Muhteşem Performans
Sybase IQ ile Muhteşem PerformansSybase IQ ile Muhteşem Performans
Sybase IQ ile Muhteşem PerformansSybase Türkiye
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...dbpublications
 
Functional? Reactive? Why?
Functional? Reactive? Why?Functional? Reactive? Why?
Functional? Reactive? Why?Aleksandr Tavgen
 

Similar to cluster08 (20)

Presentation on Bigdata (Energy Efficient Failure Recovery in Hadoop)
Presentation on Bigdata (Energy Efficient Failure Recovery in Hadoop)Presentation on Bigdata (Energy Efficient Failure Recovery in Hadoop)
Presentation on Bigdata (Energy Efficient Failure Recovery in Hadoop)
 
Functional?
 Reactive? 
Why?
Functional?
 Reactive? 
Why?Functional?
 Reactive? 
Why?
Functional?
 Reactive? 
Why?
 
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
 
Df35592595
Df35592595Df35592595
Df35592595
 
Multiprocessor Systems
Multiprocessor SystemsMultiprocessor Systems
Multiprocessor Systems
 
Operating Systems Part III-Memory Management
Operating Systems Part III-Memory ManagementOperating Systems Part III-Memory Management
Operating Systems Part III-Memory Management
 
Wiki 2
Wiki 2Wiki 2
Wiki 2
 
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...
 
Modern processors
Modern processorsModern processors
Modern processors
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
Acunu Whitepaper v1
Acunu Whitepaper v1Acunu Whitepaper v1
Acunu Whitepaper v1
 
INTRODUCTION TO PARALLEL PROCESSING
INTRODUCTION TO PARALLEL PROCESSINGINTRODUCTION TO PARALLEL PROCESSING
INTRODUCTION TO PARALLEL PROCESSING
 
Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip ParallelismSummary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
 
93 99
93 9993 99
93 99
 
UNIT II DIS.pptx
UNIT II DIS.pptxUNIT II DIS.pptx
UNIT II DIS.pptx
 
Sybase IQ ile Muhteşem Performans
Sybase IQ ile Muhteşem PerformansSybase IQ ile Muhteşem Performans
Sybase IQ ile Muhteşem Performans
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
 
20120140506012
2012014050601220120140506012
20120140506012
 
20120140506012
2012014050601220120140506012
20120140506012
 
Functional? Reactive? Why?
Functional? Reactive? Why?Functional? Reactive? Why?
Functional? Reactive? Why?
 

More from Hiroshi Ono

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipediaHiroshi Ono
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説Hiroshi Ono
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitectureHiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdfHiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdfHiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfHiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfHiroshi Ono
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfHiroshi Ono
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdfHiroshi Ono
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdfHiroshi Ono
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdfHiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfHiroshi Ono
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfHiroshi Ono
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdfHiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfHiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdfHiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdfHiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfHiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfHiroshi Ono
 

More from Hiroshi Ono (20)

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipedia
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitecture
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdf
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdf
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdf
 
camel-scala.pdf
camel-scala.pdfcamel-scala.pdf
camel-scala.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdf
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 

Recently uploaded

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 

Recently uploaded (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 

cluster08

  • 1. Gather-Arrange-Scatter: Node-Level Request Reordering for Parallel File Systems on Multi-Core Clusters Kazuki Ohta1 , Hiroya Matsuba2 , and Yutaka Ishikawa1,2 1 Graduate 2 Information School of Information Science and Technology, Technology Center, The University of Tokyo The University of Tokyo {kzk@il.is.s, matsuba@cc, ishikawa@is.s}.u-tokyo.ac.jp issue contiguous requests, however, they can be interrupted Abstract—Multiple processors or multi-core CPUs are now in common, and the number of processes running concurrently is in- by the requests of other nodes. Then, the number of disk creasing in a cluster. Each process issues contiguous I/O requests seeks increases, and the I/O bandwidth falls. individually, but they can be interrupted by the requests of other This performance degradation is more critical in parallel processes if all the processes enter the I/O phase together. Then, file systems. Some recent super-computers have more than I/O nodes handle these requests as non-contiguous. This increases 100,000 CPU cores. Consider that most scientific applications the disk seek time, and causes performance degradation. To overcome this problem, a node-level request reordering run long and do disk I/O alternately [6]. If more than 100,000 architecture, called Gather-Arrange-Scatter (GAS) architecture, compute processes enter the I/O phase all together and issue is proposed. In GAS, the I/O requests in the same node are I/O requests simultaneously, one I/O node receives the I/O gathered and buffered locally. Then, those are arranged and requests from all processes. combined to reduce the I/O cost at I/O nodes, and finally they To solve this problem, we propose the Gather-Arrange- are scattered to the remote I/O nodes in parallel. A prototype is implemented and evaluated using the BTIO Scatter (GAS) node-level I/O request reordering architecture benchmark. This system reduces up to 84.3% of the lseek() for parallel file systems. The main idea is that I/O requests calls and reduces up to 93.6% of the number of requests at I/O issued from the same node are gathered and buffered locally, nodes. This results in up to a 12.7% performance improvement then buffered requests are arranged in a better order which compared to the non-arranged case. reduces the I/O cost at I/O nodes, and finally scattered to I/O I. INTRODUCTION nodes in parallel. To gather the requests, the requests are handled asyn- A growing number of scientific applications, such as ex- chronously, that is, write() returns immediately after the perimental physics, computational biology, astrophysics, and request is gathered. The file system approach works efficiently genome analysis need to handle terabytes of data. But the for all the existing applications that use the POSIX I/O bandwidth of a single disk is too low to handle data of such a interface. We have designed and implemented the Parallel size, so much prior research has addressed how to effectively Gather-Arrange-Scatter (PGAS) file system to confirm that utilize multiple disks as a single file system [1]–[5]. the GAS architecture improves the parallel write performance Such scientific applications usually run on high- for some I/O intensive benchmarks. performance commodity clusters with fast processors The rest of this paper is organized as follows. Chapter 2 de- and high-bandwidth, low-latency interconnects. Such a scribes the design of the Gather-Arrange-Scatter architecture in cluster often consists of compute nodes and I/O nodes. detail, and Chapter 3 describe the implementation of the PGAS Usually, compute nodes have low-bandwidth, low-capacity file system. Chapter 4 compares and analyzes the performance commodity disks, and I/O nodes have high-bandwidth, of the PGAS file system with other file systems. Chapter high-capacity, high-end disks. In addition to that, recently, it 5 reviews previous efforts at developing distributed/parallel has been common to have multiple processors or processor file-systems, and other techniques for improving parallel I/O cores in one node. Node-level parallelism is performed by performance. Chapter 6 concludes this paper. multi-thread or multi-process, in practice. Therefore, running multiple processes of the same application or even different II. DESIGN AND IMPLEMENTATION applications on the same node is usual. Each application A. Issues tends to issue contiguous I/O requests individually, but if multiple processes issue many I/O requests simultaneously, Recent cluster nodes commonly have multiple processors the requests are handled by the local file system in a or CPU-cores. Usually, multi-thread or multi-process methods non-contiguous way. This is because each process tends to are used to exploit node level parallelism. So it’s common
  • 2. Fig. 1. The difference between a) Existing Parallel File Systems and b) the Gather-Arrange-Scatter I/O Architecture Fig. 2. Detailed View of the Gather-Arrange-Scatter (GAS) Architecture that processes of the same application, or even different applications, are running on the same node. Now consider that the multiple processes issue I/O requests at the same time. B. Gather-Arrange-Scatter To overcome the problem described in section II-A, we For example, consider the case that process A re- propose an I/O architecture to achieve node-level request re- quests write(filename=”file1”, offset=100, count=40) and ordering, called Gather-Arrange-Scatter (GAS) architecture. write(”file1”, 140, 40), then process B simultaneously requests Figure 1 shows the difference between the existing parallel file write(”file1”, 180, 40). The file system handles these requests system approach and the Gather-Arrange-Scatter architecture. in the order of write(”file1”, 100, 40), write(”file1”, 180, 40), The idea is that I/O requests issued from the same node are and write(”file1”, 140, 40). In this order, twice the number of gathered and buffered temporarily. Then, buffered requests are seeks are required, compared to the ideal order. arranged in the ideal order, which reduces the disk seek times. Usually each application process tends to issue contiguous Finally, they are scattered to the remote disks in parallel. requests. But in a multi-core environment, contiguous requests Figure 2 illustrates a more detailed view of GAS architec- may conflict with each other, and can be handled in a non- ture. In the GAS architecture, there are two important servers contiguous way. Therefore, the file system needs to move called dispatcher and I/O server. The dispatcher should be the disk head for roughly each request. Because a disk seek launched on every compute node to take charge of the Gather- is slow compared to memory accesses or CPU cycles, this Arrange-Scatter phase. The I/O server should be launched greatly decreases the I/O bandwidth, that is, this degrades the on every I/O node to handle I/O requests scattered from the application performance. Some recent CPUs have more than dispatchers. four cores, and the number of cores are increasing. So the The specific process of Gather-Arrange-Scatter is described number of applications run on the same node is increasing, in the following subsections. and this type of collision is going to happen more often in the 1) Gathering: To gather the I/O requests, the system call near future. behavior is changed to achieve asynchronous I/O. In asyn- chronous I/O, system calls return immediately after the request This performance degradation due to the disk seek time is issued, then the application can return immediately to the is more crucial in parallel file systems. Usually, parallel or computation. The queued requests are sent to the dispatcher distributed file systems are deployed on the I/O nodes. The in the background. I/O requests of each application are sent directly to I/O nodes through a high-speed interconnect network. This is acceptable 2) Arranging: At first, I/O requests are accumulated into if the total number of application processes is few. But in the in-memory local buffer that are separated into sub-buffers. the recent multi-core environment, I/O nodes may receive The number of sub-buffers is same as the number of I/O nodes, many I/O requests at the same time from a large number and each corresponds to a destination remote disk. The target of processes. Now the same disk seek problem occurs more sub-buffer is decided by using information from the request, frequently. like file name or offset, which depends on the implementation.
  • 3. Normally, a file is striped into equally sized blocks in parallel file systems. And the target sub-buffer may be decided by the index of the block which contains the requested region. Once the sub-buffer is full or a certain period passes, buffered I/O requests are arranged in a better order, which minimizes the disk seek times. Additionally, multiple contiguous requests are merged into one request in this phase, because it is also critical to make as few requests as possible to the file system for low I/O latency [7]. Fig. 3. Cluster Network Configuration Figure 2 illustrates the buffering operation. In this figure, there are four compute nodes and four I/O nodes. The dis- TABLE I THE SPECIFICATIONS OF THE CLUSTER patchers have four sub-buffers each, and each corresponds to a remote I/O server. In this figure the sub-buffer for node1 is full on node2, so buffered I/O requests are arranged and sent 8 Compute Nodes to node1. The I/O server on node1 receives the requests, and CPU Opteron 2212HE(dual-core, 2.0GHz)*2 performs the actual I/O operation for the local disk. Memory 6 GB 3) Scattering: Arranged I/O requests are scattered to the HDD Serial ATA Disk (50.28 to 53.21 MB/sec) corresponding I/O server in parallel. These requests are han- OS Linux 2.6.18-8.1.8.el5 SMP dled synchronously by the I/O server, one after another. I/O Scheduler CFQ I/O Scheduler C. Implementation 1 Control Node We have implemented the Parallel Gather-Arrange- CPU Opteron 2218HE(dual-core, 2.6GHz)*2 Scatter (PGAS) file system to confirm that the GAS archi- Memory 16 GB tecture improves the I/O performance in multi-core clusters. HDD Serial ATA Disk (50.61 MB/sec) The PGAS file system is implemented on the Linux operating OS Linux 2.6.18-8.1.8.el5 SMP system with C/C++ program of about 6000 lines. I/O Scheduler CFQ I/O Scheduler Currently read requests are done synchronously in the PGAS file system prototype, and only write requests are gath- ered by the dispatchers. And also the PGAS file system does A. Cluster Environment not have a meta-data handling server at present, it requires a backend NFS server for this purpose. These aspects will For the experiment, we have eight compute nodes and one be enhanced in the future. The system call hooking library is control node. The network configuration of the experiment used to hook the system call related to file system operations, cluster are shown in Figure 3, and the specifications of each which enable the implementation of a file system completely node is shown in Table I. All computing nodes are connected in user-space. with a 10Gbps Myrinet [8] network, and also connected with 1) Consistency Model: Next we discuss the consistency a 1Gbps Ethernet network. The communication between the model of the PGAS file system. Most scientific applications dispatchers and I/O servers are done through the Ethernet call an open() system call with write-only mode. Therefore network on the PGAS file system. The disk bandwidth is the PGAS file system relaxes the POSIX I/O atomicity se- measured by hdparm [9]. mantics and concentrates on achieving high performance in 1) Software Environment: We use MPICH2 [10] version write-only mode. So, the PGAS file system implementation 1.0.5 as a default MPI library. All software, including MPICH2 gathers the I/O requests opened in write-only mode, and and the benchmarks, are compiled with Intel C Compiler, ensures that the data must be written to the disk immediately version 10.0. The other file systems or dependent libraries are after applications close() the file descriptor. To achieve compiled with GCC, version 4.1.1. MPICH2 is configured to this constraint, the application is blocked when it calls the use the Myrinet-10G network. close() system call until all of its I/O requests are handled by the I/O server. Because the I/O requests from the dispatcher B. Other File Systems for Comparison to the I/O server are done synchronously, the application only asks the local dispatcher whether all of its requests are sent. To We run benchmarks on three other file systems: Local Disks, check this condition, the dispatcher has an unhandled request Network File System (NFS) with asynchronous mode [11], count for each process. [12], and Parallel Virtual File System 2 (PVFS2) [13], [14]. III. EXPERIMENTAL RESULTS Next, we describe the configurations for each file system. This section describes our experiments and the results we 1) Local Disks: We use local disks to measure the I/O obtained to evaluate the performance of the PGAS file system. throughput when each process arbitrarily issues requests to We run an I/O intensive benchmark on the PGAS file system, the operating system. There’s no metadata operation in this and on other existing file systems. case.
  • 4. TABLE II TABLE III BTIO BENCHMARK RESULT (8 NODES / 16 BTIO (8N/16P) THE IMPACT OF REQUEST ARRANGEMENT PROCESSES) Total Execution Time (sec) Number of lseek() at I/O servers (times) Class Local NFS PVFS2 PGAS FS Class NotArranged Arranged Reduced A (0.4GB) 57.82 686.41 385.20 65.22 A (0.4GB) 2287540 666259 70.9% B (1.6GB) 114.48 1858.21 1404.58 241.74 B (1.6GB) 8512074 1713626 79.9% C (6.8GB) 422.16 N/A 5358.19 1123.54 C (6.8GB) 27801033 4374196 84.3% Write Bandwidth (MB/sec) Number of I/O Requests (times) Class Local NFS PVFS2 PGAS FS Class NotArranged Arranged Reduced A (0.4GB) 25.46 0.65 1.18 341.31 A (0.4GB) 10490880 666575 93.6% B (1.6GB) 38.57 1.22 1.30 321.24 B (1.6GB) 42468984 3057766 92.8% C (6.8GB) 39.39 N/A 1.36 338.19 C (6.8GB) 170142519 11229406 93.4% Total Execution Time (sec) Class NotArranged Arranged Reduced A (0.4GB) 74.71 65.22 12.7% 2) NFS: NFS cannot export multiple disks as a single file B (1.6GB) 260.47 241.74 7.2% system. Therefore, in the case of NFS, we use only one disk C (6.8GB) 1281.92 1123.54 12.4% of the control node. That is, the total disk bandwidth is lower than that of the other file systems. We configure NFS for an asynchronous mode to achieve high performance, but we found that sometimes the result is inconsistent. of the BTIO benchmark on eight compute nodes with 16 3) PVFS2: We use PVFS2 (version 2.7.0) with the default processes. This table shows the total execution time and write configuration, for which pvfs2-genconfig is generated. The bandwidth of A to C BTIO classes on various file systems. benchmarks use PVFS2 with the PVFS2 VFS module. The The block size of the PGAS file system is 64KB, and the I/O servers run on compute nodes which are in use for the size of sub-buffers is also 64KB. Local Disks, PVFS2, and computation. The metadata server always runs on the control the PGAS file system use the disks of compute nodes. node not on compute nodes to avoid a CPU conflict between The PGAS file system shows a performance improvement the application and the metadata server. over the other file systems. NFS and PVFS2 are relatively slow for this benchmark, because the cost of calling write() on C. BTIO Benchmark these file systems seems very high. With class C, NFS server A benchmark for evaluating the writing speed of parallel was crashed for some reason. In the PVFS2 case, the client I/O is provided by the Block-Tridiagonal (BT) Nas Parallel servers and I/O servers consume a lot of CPU power, all the Benchmark (NPB), version 3.3, known as BTIO [15]. Each way along. processor is responsible for multiple Cartesian subsets of the The write throughput of the PGAS file system does not entire data set, whose number increases as the square root of correspond to actual I/O speed, because it is actually a memory the number of processors participating in the computation. copy. The PGAS files system is not faster than the local BTIO provides an option for using MPI I/O with collective disks, although the computation and I/O can be overlapped. buffering, MPI I/O without collective buffering, or Fortran This shows that there’s some bottlenecks in the GAS phase, I/O file operation. We choose Fortran I/O to show that the specifically in the dispatcher. We need to do a more detailed file system layer can solve all of the demands of existing investigation of this. applications not using MPI-IO. In this case, the applications call a tremendous amount of small (40bytes) write requests. E. The Effect of the Arrange Phase So the overhead of the system call can be also important. Next, we measured the effectiveness of the arrange phase by In addition to that, BTIO is provided with different input investigating the following three points in the case of arranged problem sizes (A-D). We used classes A to C, an aggregate or not arranged: the number of lseek() calls of all I/O write amount for a complete run of 0.4 GB, 1.6 GB, and servers, the number of I/O requests, and the total execution 6.8 GB, respectively. Because the aggregate write amount is time. We ran class A-C benchmarks on eight compute nodes fixed, the amount written by each process decreases as the with sixteen processes. The block size is 64KB, and the size of number of processes increases. For getting precise results, we sub-buffers is also 64KB. As mentioned before in section II-B, cleared the buffer cache of all nodes before each benchmark the PGAS file system sorts the requests by path and offset in execution. The workload of BTIO is a typical I/O pattern in ascending order, then merges the contiguous requests in the scientific applications, that is, the compute phase and the I/O arrange phase. phase alternates. Table III shows the actual results. About 80% of the D. Comparison with Other File Systems lseek() times on eight I/O servers have been successfully We did a performance comparison with other file systems reduced. Additionally, about 93% of the requests are reduced, with respect to the PGAS file system. Table II shows the results that is, 20 requests are merged into 1 request, on average.
  • 5. As a consequence, about 7.20% of the total execution time is I/O calls, it is possible to obtain the high performance in this reduced. approach. 2) Data Sieving: To achieve low I/O latency, it is crucial IV. RELATED WORK to issue as few requests as possible to the file system. Data Sieving [7] is a technique to make a few large contiguous A. Distributed/Parallel File System requests to the file system, even if the I/O requests from the There are several studies of network-shared file systems application consist of some small and non-contiguous requests. which are used for clusters or grid environments. The disadvantage of Data Sieving is that it must read the 1) PVFS: PVFS [13], [14] (Parallel Virtual File System) data more than needed, and requires extra memory for this. is designed to provide a high performance file system for The application can control the behavior by setting a runtime parallel applications which handle large I/O and many file parameter. accesses. PVFS consists of metadata servers and I/O servers. 3) Two-Phase Write Behind Buffering: Two-Phase Write I/O servers store the data, striped across multiple servers Behind Buffering [17] is a technique to improve parallel write in round-robin fashion. Metadata servers handle metadata performance on scientific applications. This technique is used information for the files, like permissions, timestamps, size, for the efficient distribution of requests to remote processes. or so on. PVFS provides a kernel VFS module, which allows The I/O requests are issued from the application in an asyn- existing UNIX applications (e.g. cp, cat) to run on the PVFS chronous fashion. Then, they are collected to sub-buffers, each file system without any code modification. PVFS has the corresponding to a remote process. Once a sub-buffer is full, capability to support some different network types through buffered requests are flushed to the remote process. By this an abstraction known as the Buffered Messaging Interface two-stage buffering scheme, an MPI program can achieve high (BMI), and implementations exist for TCP/IP, InfiniBand, and write bandwidth on some scientific applications. Myrinet. Its storage relies on UNIX files to store the file 4) MTIO: More, et al. implemented a multi-threaded MPI- data, and a Berkeley DB database to hold metadata. PVFS based I/O library called MTIO [18]. MTIO provides a thread- relaxes the POSIX atomicity semantics to improve stability based asynchronous I/O capability to improve the collective and performance. I/O performance of scientific applications. The paper reported In PVFS, each application may issue multiple requests that MTIO can successfully overlap upto 80% ovarlap between to the file system simultaneously, so the problem described the I/O phase and computing phase on an IBM Scalable in section II-A may happen. Because PVFS also assumes POWER parallel System (SP). compute/storage cluster, the GAS architecture can be easily incorporated into PVFS, to improve its I/O performance. V. CONCLUSION B. MPI-IO To achieve high I/O performance in parallel file systems, it is critical for I/O nodes to handle incoming requests with The Message Passing Interface (MPI), version 2, includes low disk seek times and to reduce the number of requests. an interface for file I/O operations called MPI-IO, which In order to meet those requirements, a node-level request focuses on accessing the shared files concurrently. MPI-IO reordering architecture, called Gather-Arrange-Scatter (GAS) has a collective I/O feature, where all processes are explicitly architecture, has been proposed in this paper. In GAS, I/O synchronized for exchanging access information to generate requests from applications are gathered once at each compute a better I/O strategy. And also, there’are some approaches node, then arranged and combined to minimize the I/O cost, to incorporate asynchronous I/O into MPI-IO. The following and finally scattered to the remote I/O nodes. By this scheme, subsection describes such techniques in MPI-IO. Because the disk seek times at I/O nodes are reduced, and the number MPI-IO has an explicit synchronization point, it can have more of requests are also reduced. optimal I/O ordering than a file system, which doesn’t have a A prototype implementation of the GAS architecture, the synchronization point. But many existing applications still use Parallel Gather-Arrange-Scatter (PGAS) file system, has been the POSIX I/O interface. So our file system approach has an implemented on the Linux operating system. The BTIO Nas advantage over MPI-IO, in that all existing applications can Parallel Benchmark is used to evaluate the performance of gain the performance benefit. the PGAS file system. It is confirmed that GAS architecture 1) Two-Phase I/O: Two-Phase I/O [16] is one of the reduces up to 84.3% of the lseek() calls and reduces up optimization technique with an explicit synchronization point. to 93.6% of the number of requests at I/O nodes. This results Consider the case of reading shared files from different pro- in up to 12.7% of performance improvement compared to the cesses. At first, designated I/O aggregators collect over an non-arranged case. aggregate access region which covers the I/O requests from all the MPI processes. Nextm, in the I/O phase, I/O aggregators Gather-Arrange-Scatter achieved a compute node side re- issue large I/O requests which are calculated based on the quest reordering, but I/O server side reordering is also needed collected request information. Finally in the communication for avoiding more disk seeks. In the future, we will develop phase, the read data is distributed through the network to the a global I/O scheduler scheme for parallel file systems by proper processes. Since the network is very much faster than incorporating client-side and server-side request scheduling.
  • 6. And also, the experiments and evaluation should be done on a wider range of I/O benchmarks. REFERENCES [1] D. Teaff, D. Watson, and B. Coyne, “The architecture of the high performance storage system (hpss),” in Proceedings of the 1995 Goddard Conference on Mass Storage and Technologies, 1995. [Online]. Available: citeseer.ist.psu.edu/teaff95architecture.html [2] F. Schmuck and R. Haskin, “GPFS: A shared-disk file system for large computing clusters,” in Proc. of the First Conference on File and Storage Technologies (FAST), Jan. 2002, pp. 231–244. [Online]. Available: citeseer.ist.psu.edu/schmuck02gpfs.html [3] O. Tatebe, Y. Morita, S. Matsuoka, N. Soda, and S. Sekiguchi, “Grid datafarm architecture for petascale data intensive computing,” in Pro- ceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2002), 2002. [4] O. Tatebe, N. Soda, Y. Morita, S. Matsuoka, and S. Sekiguchi, “Gfarm v2: A grid file system that supports high-performance distributed and parallel data computing,” in Proceedings of the 2004 Computing in High Energy and Nuclear Physics (CHEP04), 2004. [5] “Lustre file system,” http://lustre.org/. [6] E. Smirni and D. A. Reed, “Lessons from characterizing the input/output behavior of parallel scientific applications,” Perform. Eval., vol. 33, no. 1, pp. 27–44, 1998. [7] R. Thakur, W. Gropp, and E. Lusk, “Optimizing noncontiguous accesses in MPI-IO,” Parallel Computing, vol. 28, no. 1, pp. 83–105, 2002. [Online]. Available: citeseer.ist.psu.edu/thakur02optimizing.html [8] “Myrinet,” http://www.myri.com/. [9] “hdparm,” http://sourceforge.net/projects/hdparm/. [10] A. N. Laboratory, “Mpich2 : High-performance and widely portable mpi,” http://www.mcs.anl.gov/research/projects/mpich2/. [11] R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon, “Design and implementation of the Sun Network Filesystem,” in Proc. Summer 1985 USENIX Conf., Portland OR (USA), 1985, pp. 119–130. [Online]. Available: citeseer.ist.psu.edu/sandberg85design.html [12] B. Pawlowski, S. Shepler, C. Beame, B. Callaghan, M. Eisler, D. Noveck, D. Robinson, and R. Thurlow, “The nfs version 4 protocol,” 2000. [Online]. Available: citeseer.ist.psu.edu/shepler00nfs.html [13] P. H. Carns, W. B. L. III, R. B. Ross, and R. Thakur, “Pvfs: a parallel file system for linux clusters,” in ALS’00: Proceedings of the 4th conference on 4th Annual Linux Showcase & Conference, Atlanta. Berkeley, CA, USA: USENIX Association, 2000, pp. 28–28. [14] T. P. Community, “Parallel virtual file system, version 2,” http://www.pvfs.org/. [15] P. Wong and R. F. V. der Wijngaart, “Nas parallel benchmark i/o version 2.4, nas technical report nas-03-002, nasa ames research center, moffett field, ca 94035-1000.” [16] J. M. del Rosario, R. Bordawekar, and A. Choudhary, “Improved parallel i/o via a two-phase run-time access strategy,” SIGARCH Comput. Archit. News, vol. 21, no. 5, pp. 31–38, 1993. [17] W. keng Liao, A. Ching, K. Coloma, A. Nisar, A. Choudhary, J. Chen, R. Sankaran, and S. Klasky, “Using mpi file caching to improve parallel write performance for large-scale scientific applications,” SC, 2007. [18] S. More, A. N. Choudhary, I. T. Foster, and M. Q. Xu, “Mtio - a multi- threaded parallel i/o system,” in IPPS ’97: Proceedings of the 11th International Symposium on Parallel Processing. Washington, DC, USA: IEEE Computer Society, 1997, pp. 368–373.