Efficient Support for MPI-I/O Atomicity
              Based on Versioning

Viet-Trung Tran1, Bogdan Nicolae2, Gabriel Antoniu2, Luc Bougé1
                   KerData Research Team

                 1
                   ENS Cachan, IRISA, France
                2
                  INRIA, IRISA, Rennes, France




                                                                  1
Context: Data Intensive Large-scale HPC
Simulations
 Large-scale simulations of natural phenomena
 Highly parallel platform
 I/O challenges
    High I/O performance
    Huge data sizes (~PB)
    Highly concurrency




                                                 2
Data Access Pattern

 Spatial splitting in parallelization
      Ghost cells
 Application data model vs storage model



                     
                         •Sequence of bytes

 Concurrent overlapping non-contiguous I/O
      Require atomicity guarantees


                                              3
Goal:

High throughput non-contiguous I/O
    under atomicity guarantees




                                     4
State of The Art

 Locking-based approaches to ensure atomicity
 3 level of implementations
    Application
    MPI-I/O               Application (Visit, Tornado simulation)
    Storage
                                Data model (HDF5, NetCDF)

                                     MPI-IO middleware


                          Parallel file systems (PVFS, GPFS, Lustre)



                                                                       5
Our Approach

 Dedicated interface for atomic non-contiguous I/O
    Provide atomicity guarantees at storage level
    No need to translate MPI consistency to storage consistency model

 Shadowing as a key to enhance data access under concurrency
    No locking
    Concurrent overlapped writes are allowed
    Atomicity guarantees

 Data striping




                                                                         6
Building Block: BlobSeer

 A KerData project (blobseer.gforge.inria.fr)
    Data striping
    Versioning-based concurrency control
    Distributed metadata management




                                                 7
Building Block: BlobSeer (continued)

 Distributed metadata management
    Organized as a segment tree                                                            [0, 8]

    Distributed over a DHT
                                                           [0, 4]      [0, 4]                        [4, 4]
 Two phases I/O              Metadata trees
    Data access
                                            [0, 2]      [0, 2]         [2, 2]            [2, 2]      [4, 2]
    Metadata access


                                   [0, 1]      [1, 1]   [1, 1]      [2, 1]      [2, 1]      [3, 1]   [4, 1]



                           Blob



                                                                                                              8
Proposal for A Non-contiguous,
Versioning Oriented Access Interface

 Non-contiguous Write
    vw = NONCONT_WRITE(id, buffers[], offsets[], sizes[])

 Non-contiguous Read
    NONCONT_READ(id, v, buffers[], offsets[], sizes[])

 Challenges
    Noncontiguous I/O must be atomic
    Efficient under concurrency




                                                             9
1st challenge: Non-contiguous I/O Must Be Atomic

 Shadowing techniques
 Isolate non-contiguous update into one single consistent snapshot
    Done at metadata level




                                                                      10
2nd challenge: Efficiency Under Concurrent Accesses


    Advantages of Shadowing
                                                 Our        Locking-
       Parallel data I/O phases                 approach   based
                                                            approach
    Parallel Metadata I/O
                                   Overlapping   Parallel   No
     phases ?                      Data I/O




                                                                       11
Minimize Ordering Overhead

 Ordering is done at metadata level
 Arbitrary order




                                       12
Avoid Synchronization for Concurrent Segment Tree
Generation
 Delegate the generation of shadowing tree to client side
 Shadowing tree are generated in parallel thank to predictable
  metadata node ID




                                                                  13
Lazy Evaluation During Border Node Calculation

 Building metadata tree in bottom-up fashion
 Optimized for non-contiguous pattern




                                                 14
Sumary: Overlapping Non-contiguous I/O

                      Our approach                           Locking-based
                                                             approaches
Data I/O phases       Parallel                               Serialization
Metadata I/O phases   Close to parallel thanks to            Serialization
                      1- Arbitrary ordering
                      2- Metadata level’s ordering
                      3- Client side’s shadowing in parallel
                      4- Lazy evaluation




                                                                             15
Leveraging Our Versioning-Oriented Interface in
Parallel I/O Stack


              Application (Visit, Tornado simulation)


                   Data model (HDF5, NetCDF)


                        MPI-IO middleware


               Storage optimized for atomic MPI-I/O


    Integrating BlobSeer to MPI-I/O middleware is straightforward




                                                                    16
Experimental Evaluation

• Our machines: Reservation on Grid'5000 platform
   – 80 nodes
   – Pentium-4 CPU@2.6Ghz, 4GB RAM, Gigabit Ethernet
   – Measured bandwidth: 117.5 MB/s for MTU=1500B
• 3 sets of experiments:
   – Scalability of non-contiguous I/O
   – Scalability under concurrency
   – MPI-tile-I/O




                                                       17
Scalability of Non-contiguous I/O




                                    18
Scalability Under Concurrency




                                19
MPI-tile-I/O: 128 KB Chunk Size




                                  20
MPI-tile-IO: 1MB Chunk Size




                              21
Conclusion

• Experiments show promising results
   • We outperform locking-based approaches
   • Key features: shadowing, dedicated API for atomic non-contiguous I/O
   • Comparison to Lustre file system

• High throughput non-contiguous I/O under atomicity guarantees
• Future work
   • Exposing versioning-interface to MPI-I/O applications
   • Potential improvement for producer-consumer workflow
   • Pyramid: A large-scale array-oriented active storage system




                                                                            22
Context




                    Application (Visit, Tornado simulation)

                         Data model (HDF5, NetCDF)

                              MPI-IO middleware

                             Parallel file systems



•Parallel file systems do not provide atomic non-contiguous I/O interface




                                                                        23
2nd challenge: Efficiency under concurrent
accesses
 Minimize ordering overhead
    Ordering is done at metadata level
    Arbitrary order

 Avoid synchronization for concurrent segment tree generation
    Delegate the generation of shadowing tree to client side
    Shadowing tree are generated in parallel

 Lazy evaluation during border node calculation




                                                                 24

Efficient Support for MPI-I/O Atomicity

  • 1.
    Efficient Support forMPI-I/O Atomicity Based on Versioning Viet-Trung Tran1, Bogdan Nicolae2, Gabriel Antoniu2, Luc Bougé1 KerData Research Team 1 ENS Cachan, IRISA, France 2 INRIA, IRISA, Rennes, France 1
  • 2.
    Context: Data IntensiveLarge-scale HPC Simulations  Large-scale simulations of natural phenomena  Highly parallel platform  I/O challenges  High I/O performance  Huge data sizes (~PB)  Highly concurrency 2
  • 3.
    Data Access Pattern Spatial splitting in parallelization  Ghost cells  Application data model vs storage model  •Sequence of bytes  Concurrent overlapping non-contiguous I/O  Require atomicity guarantees 3
  • 4.
    Goal: High throughput non-contiguousI/O under atomicity guarantees 4
  • 5.
    State of TheArt  Locking-based approaches to ensure atomicity  3 level of implementations  Application  MPI-I/O Application (Visit, Tornado simulation)  Storage Data model (HDF5, NetCDF) MPI-IO middleware Parallel file systems (PVFS, GPFS, Lustre) 5
  • 6.
    Our Approach  Dedicatedinterface for atomic non-contiguous I/O  Provide atomicity guarantees at storage level  No need to translate MPI consistency to storage consistency model  Shadowing as a key to enhance data access under concurrency  No locking  Concurrent overlapped writes are allowed  Atomicity guarantees  Data striping 6
  • 7.
    Building Block: BlobSeer A KerData project (blobseer.gforge.inria.fr)  Data striping  Versioning-based concurrency control  Distributed metadata management 7
  • 8.
    Building Block: BlobSeer(continued)  Distributed metadata management  Organized as a segment tree [0, 8]  Distributed over a DHT [0, 4] [0, 4] [4, 4]  Two phases I/O Metadata trees  Data access [0, 2] [0, 2] [2, 2] [2, 2] [4, 2]  Metadata access [0, 1] [1, 1] [1, 1] [2, 1] [2, 1] [3, 1] [4, 1] Blob 8
  • 9.
    Proposal for ANon-contiguous, Versioning Oriented Access Interface  Non-contiguous Write  vw = NONCONT_WRITE(id, buffers[], offsets[], sizes[])  Non-contiguous Read  NONCONT_READ(id, v, buffers[], offsets[], sizes[])  Challenges  Noncontiguous I/O must be atomic  Efficient under concurrency 9
  • 10.
    1st challenge: Non-contiguousI/O Must Be Atomic  Shadowing techniques  Isolate non-contiguous update into one single consistent snapshot  Done at metadata level 10
  • 11.
    2nd challenge: EfficiencyUnder Concurrent Accesses  Advantages of Shadowing Our Locking-  Parallel data I/O phases approach based approach  Parallel Metadata I/O Overlapping Parallel No phases ? Data I/O 11
  • 12.
    Minimize Ordering Overhead Ordering is done at metadata level  Arbitrary order 12
  • 13.
    Avoid Synchronization forConcurrent Segment Tree Generation  Delegate the generation of shadowing tree to client side  Shadowing tree are generated in parallel thank to predictable metadata node ID 13
  • 14.
    Lazy Evaluation DuringBorder Node Calculation  Building metadata tree in bottom-up fashion  Optimized for non-contiguous pattern 14
  • 15.
    Sumary: Overlapping Non-contiguousI/O Our approach Locking-based approaches Data I/O phases Parallel Serialization Metadata I/O phases Close to parallel thanks to Serialization 1- Arbitrary ordering 2- Metadata level’s ordering 3- Client side’s shadowing in parallel 4- Lazy evaluation 15
  • 16.
    Leveraging Our Versioning-OrientedInterface in Parallel I/O Stack Application (Visit, Tornado simulation) Data model (HDF5, NetCDF) MPI-IO middleware Storage optimized for atomic MPI-I/O Integrating BlobSeer to MPI-I/O middleware is straightforward 16
  • 17.
    Experimental Evaluation • Ourmachines: Reservation on Grid'5000 platform – 80 nodes – Pentium-4 CPU@2.6Ghz, 4GB RAM, Gigabit Ethernet – Measured bandwidth: 117.5 MB/s for MTU=1500B • 3 sets of experiments: – Scalability of non-contiguous I/O – Scalability under concurrency – MPI-tile-I/O 17
  • 18.
  • 19.
  • 20.
    MPI-tile-I/O: 128 KBChunk Size 20
  • 21.
  • 22.
    Conclusion • Experiments showpromising results • We outperform locking-based approaches • Key features: shadowing, dedicated API for atomic non-contiguous I/O • Comparison to Lustre file system • High throughput non-contiguous I/O under atomicity guarantees • Future work • Exposing versioning-interface to MPI-I/O applications • Potential improvement for producer-consumer workflow • Pyramid: A large-scale array-oriented active storage system 22
  • 23.
    Context Application (Visit, Tornado simulation) Data model (HDF5, NetCDF) MPI-IO middleware Parallel file systems •Parallel file systems do not provide atomic non-contiguous I/O interface 23
  • 24.
    2nd challenge: Efficiencyunder concurrent accesses  Minimize ordering overhead  Ordering is done at metadata level  Arbitrary order  Avoid synchronization for concurrent segment tree generation  Delegate the generation of shadowing tree to client side  Shadowing tree are generated in parallel  Lazy evaluation during border node calculation 24