Cluster 2010 Presentation

Optimization Techniques at the I/O Forwarding Layer

Kazuki Ohta (presenter):
Preferred Infrast...
Background: Compute and Storage Imbalance

• Leadership-class computational scale:
  • 100,000+ processes
  • Advanced Mul...
Previous Studies: Current I/O Software Stack


   Storage Abstraction,
     Data Portability
     (HDF5, NetCDF,          ...
Challenge: Millions of Concurrent Clients

• 1,000,000+ concurrent clients present a challenge to current I/O stack
   • e...
I/O Software Stack with I/O Forwarding


                                       Parallel/Serial Applications


           ...
Example I/O System: Blue Gene/P Architecture




                                               6
I/O Forwarding Challenges

• Large Requests
  • Latency of the forwarding
  • Memory limit of the I/O
  • Variety of backe...
Out-Of-Order I/O Pipelining
• Split large I/O requests into
  small fixed-size chunks
• These chunks are forwarded in      ...
I/O Request Scheduler

• Scheduling and Merging the small requests at the forwarder
  • Reduce number of seeks
  • Reduce ...
I/O Forwarding and Scalability Layer (IOFSL)

• IOFSL Project [Nawab 2009]
  • Open-Source I/O Forwarding Implementation
 ...
IOFSL Software Stack

               Application

    POSIXInterface     MPI-IO Interface

        FUSE           ROMIO-ZO...
Evaluation on T2K: Spec

• T2K Open Super Computer, Tokyo Sites
  • http://www.open-supercomputer.org/
  • 32 node Researc...
Evaluation on T2K: IOR Benchmark

• Each process issues the same amount of I/O
• Gradually increasing the message size, an...
Evaluation on T2K: IOR Benchmark, 128procs
                     120
                               ROMIO PVFS2
           ...
Evaluation on T2K: IOR Benchmark, 128procs
                     120
                               ROMIO PVFS2
           ...
Evaluation on T2K: IOR Benchmark, 128procs
                     120
                               ROMIO PVFS2
           ...
Evaluation on Blue Gene/P: Spec

• Argonne National Laboratory BG/P “Surveyor”
  • Blue Gene/P platform for research and d...
Evaluation on BG/P: BMI PingPong
                    900
                                        BMI TCP/IP
              ...
Evaluation on BG/P: IOR Benchmark, 256nodes
                      800
                                               CIOD
...
Evaluation on BG/P: IOR Benchmark, 256nodes
                      800
                                                CIOD...
Evaluation on BG/P: IOR Benchmark, 256nodes
                      800
                                                CIOD...
Evaluation on BG/P: Thread Count Effect
                      800
                                IOFSL FIFO (16threads)
 ...
Evaluation on BG/P: Thread Count Effect
                      800
                                IOFSL FIFO (16threads)
 ...
Related Work

• Computational Plant Project @ Sandia National Laboratory
  • First introduced I/O Forwarding Layer
• IBM B...
Future Work

• Event-driven server architecture
  • reduced thread contension
• Collaborative Caching at the I/O forwardin...
Conclusions
• Implementation and evaluation of two optimization techniques at the I/O
  Forwarding Layer
  • I/O pipelinin...
Thanks!

Kazuki Ohta (presenter):
Preferred Infrastructure, Inc., University of Tokyo

Dries Kimpe, Jason Cope, Kamil Iskr...
Upcoming SlideShare
Loading in …5
×

Optimization Techniques at the I/O Forwarding Layer

2,108 views

Published on

Presentation at Cluster 2010

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,108
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
40
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide





























  • Optimization Techniques at the I/O Forwarding Layer

    1. 1. Cluster 2010 Presentation Optimization Techniques at the I/O Forwarding Layer Kazuki Ohta (presenter): Preferred Infrastructure, Inc., University of Tokyo Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross: Argonne National Laboratory Yutaka Ishikawa: University of Tokyo Contact: kazuki.ohta@gmail.com 1
    2. 2. Background: Compute and Storage Imbalance • Leadership-class computational scale: • 100,000+ processes • Advanced Multi-core architectures, Compute node OSs • Leadership-class storage scale: • 100+ servers • Commercial storage hardware, Cluster file system • Current leadership-class machines supply only 1GB/s of storage throughput for every 10TF of compute performance. This gap grew factor of 10 in recent years. • Bridging this imbalance between compute and storage is a critical problem for the large-scale computation. 2
    3. 3. Previous Studies: Current I/O Software Stack Storage Abstraction, Data Portability (HDF5, NetCDF, Parallel/Serial Applications ADIOS) High-Level I/O Libraries Organizing Accesses from Many Clients (ROMIO) POSIX I/O MPI-IO VFS, FUSE Parallel File Systems Logical File System over Many Storage Devices (PVFS2, Lustre, GPFS, PanFS, Storage Devices Ceph, etc) 3
    4. 4. Challenge: Millions of Concurrent Clients • 1,000,000+ concurrent clients present a challenge to current I/O stack • e,g. metadata performance, locking, network incast problem, etc. • I/O Forwarding Layer is introduced. • All I/O requests are delegated to dedicated I/O forwarder process. • I/O forwarder reduces the number of clients seen by the file system for all applications, without collective I/O. I/O Path Compute Compute Compute Compute Compute Compute .... Compute Processes I/O Forwarder I/O Forwarder Parallel File System PVFS2 Parallel File System PVFS2 PVFS2 PVFS2 4
    5. 5. I/O Software Stack with I/O Forwarding Parallel/Serial Applications High-Level I/O Libraries POSIX I/O MPI-IO VFS, FUSE Bridge Between Compute Process and Storage System I/O Forwarding (IBM ciod, Cray DVS, IOFSL) Parallel File Systems Storage Devices 5
    6. 6. Example I/O System: Blue Gene/P Architecture 6
    7. 7. I/O Forwarding Challenges • Large Requests • Latency of the forwarding • Memory limit of the I/O • Variety of backend file system node performance • Small Requests • Current I/O forwarding mechanism reduces the number of clients, but does not reduces the number of requests. • Request processing overheads at the file systems • We proposed two optimization techniques for the I/O forwarding layer. • Out-Of-Order I/O Pipelining, for large requests. • I/O Request Scheduler, for small requests. 7
    8. 8. Out-Of-Order I/O Pipelining • Split large I/O requests into small fixed-size chunks • These chunks are forwarded in File FileSystem an out-of-order way. Client IOFSL System Client IOFSL Threads • Good points • Reduce forwarding latency, by overlapping the I/O requests and the network transfer. • I/O sizes are not limited by the memory size at the forwarding node. No-Pipelining Out-Of-Order Pipelining • Little effect by the slowest file system node. 8
    9. 9. I/O Request Scheduler • Scheduling and Merging the small requests at the forwarder • Reduce number of seeks • Reduce number of requests, the file systems actually sees • Scheduling overhead must be minimum • Handle-Based Round-Robin algorithm for the fairness between files • Ranges are managed by Interval Tree • The contiguous requests are merged Pick N Requests Q and Issue I/O H H H H H Read Read Write Read Read 9
    10. 10. I/O Forwarding and Scalability Layer (IOFSL) • IOFSL Project [Nawab 2009] • Open-Source I/O Forwarding Implementation • http://www.iofsl.org/ • Portable on most HPC environment • Network Independent • All network communication is done by BMI [Carns 2005] • TCP/IP, Infiniband, Myrinet, Blue Gene/P Tree, Portals, etc. • File System Independent • MPI-IO (ROMIO) / FUSE Client 10
    11. 11. IOFSL Software Stack Application POSIXInterface MPI-IO Interface FUSE ROMIO-ZOIDFS IOFSL Server ZOIDFS Client API FileSystem Dispatcher BMI BMI PVFS2 libsysio ZOIDFS Protocol I/O Request (TCP, Infiniband, Myrinet, etc. ) • Out-Of-Order I/O Pipelining and the I/O request scheduler have been implemented in the IOFSL, and evaluated on two environments. • T2K Tokyo (Linux Cluster), and ANL Surveyor (Blue Gene/P) 11
    12. 12. Evaluation on T2K: Spec • T2K Open Super Computer, Tokyo Sites • http://www.open-supercomputer.org/ • 32 node Research Cluster • 16 cores: 2.3 GHz Quad-Core Opteron*4 • 32GB Memory • 10Gbps Myrinet Network • SATA Disk (Read: 49.52 MB/sec, Write 39.76 MB/sec) • One IOFSL, Four PVFS2, 128 MPI Processes • Software • MPICH2 1.1.1p1 • PVFS2 CVS (almost 2.8.2) • both are configured to use Myrinet 12
    13. 13. Evaluation on T2K: IOR Benchmark • Each process issues the same amount of I/O • Gradually increasing the message size, and see the bandwidth change • Note: modified to do fsync() for MPI-IO Message Size 13
    14. 14. Evaluation on T2K: IOR Benchmark, 128procs 120 ROMIO PVFS2 IOFSL none IOFSL hbrr 100 80 Bandwidth (MB/s) 60 40 20 0 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 14
    15. 15. Evaluation on T2K: IOR Benchmark, 128procs 120 ROMIO PVFS2 IOFSL none IOFSL hbrr 100 80 Bandwidth (MB/s) 60 Out-Of-Order Pipelining Improvements ( 29.5%) 40 20 0 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 14
    16. 16. Evaluation on T2K: IOR Benchmark, 128procs 120 ROMIO PVFS2 IOFSL none IOFSL hbrr 100 80 Bandwidth (MB/s) 60 Out-Of-Order Pipelining Improvements ( 29.5%) 40 I/O Scheduler Improvement ( 40.0%) 20 0 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 14
    17. 17. Evaluation on Blue Gene/P: Spec • Argonne National Laboratory BG/P “Surveyor” • Blue Gene/P platform for research and development • 1024 nodes, 4096-core • Four PVFS2 servers • DataDirect Networks S2A9550 SAN • 256 compute nodes, with 4 I/O nodes were used. Node Card: 4 core Node Board: 128 core Rack: 4096 core 15
    18. 18. Evaluation on BG/P: BMI PingPong 900 BMI TCP/IP 800 BMI ZOID CNK BG/P Tree Network 700 600 Bandwidth (MB/s) 500 400 300 200 100 0 1 4 16 64 256 1024 4096 Buffer Size (KB) 16
    19. 19. Evaluation on BG/P: IOR Benchmark, 256nodes 800 CIOD IOFSL FIFO(32threads) 700 600 Bandwidth (MiB/s) 500 400 300 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 17
    20. 20. Evaluation on BG/P: IOR Benchmark, 256nodes 800 CIOD IOFSL FIFO(32threads) 700 600 Bandwidth (MiB/s) Performance Improvements 500 ( 42.0%) 400 300 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 17
    21. 21. Evaluation on BG/P: IOR Benchmark, 256nodes 800 CIOD IOFSL FIFO(32threads) 700 600 Bandwidth (MiB/s) Performance Improvements 500 ( 42.0%) 400 300 Performance Drop ( -38.5%) 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 17
    22. 22. Evaluation on BG/P: Thread Count Effect 800 IOFSL FIFO (16threads) IOFSL FIFO (32threads) 700 600 Bandwidth (MiB/s) 500 400 300 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 18
    23. 23. Evaluation on BG/P: Thread Count Effect 800 IOFSL FIFO (16threads) IOFSL FIFO (32threads) 700 600 Bandwidth (MiB/s) 500 400 300 16 threads > 32threads 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 18
    24. 24. Related Work • Computational Plant Project @ Sandia National Laboratory • First introduced I/O Forwarding Layer • IBM Blue Gene/L, Blue Gene/P • All I/O requests are forwarded to I/O nodes • Compute OS can be stripped down to minimum functionality, and reduces the OS noise • ZOID: I/O Forwarding Project [Kamil 2008] • Only on Blue Gene • Lustre Network Request Scheduler (NRS) [Qian 2009] • Request scheduler at the parallel file system nodes • Only simulation results 19
    25. 25. Future Work • Event-driven server architecture • reduced thread contension • Collaborative Caching at the I/O forwarding layer • multiple I/O forwarder works collaboratively for caching data and also metadata • Hints from MPI-IO • Better cooperation with collective I/O • Evaluation on other leadership scale machines • ORNL Jaguar, Cray XT4, XT5 systems 20
    26. 26. Conclusions • Implementation and evaluation of two optimization techniques at the I/O Forwarding Layer • I/O pipelining that overlaps the file system requests and the network communication. • I/O scheduler that reduces the number of independent, non-contiguous file systems accesses. • Demonstrating portable I/O forwarding layer, and performance comparison with existing HPC I/O software stack. • Two Environments • T2K Tokyo Linux cluster • ANL Blue Gene/P Surveyor • First I/O forwarding evaluations on linux cluster and Blue Gene/P • First comparison between Blue Gene/P IBM I/O stack with OSS I/O stack 21
    27. 27. Thanks! Kazuki Ohta (presenter): Preferred Infrastructure, Inc., University of Tokyo Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross: Argonne National Laboratory Yutaka Ishikawa: University of Tokyo Contact: kazuki.ohta@gmail.com 22

    ×