SlideShare a Scribd company logo
1 of 82
Context
  Contribution 1: Efficient Support for MPI-I/O Atomicity
Contribution 2: A large-scale array-oriented storage system
                Contribution 3: A document-oriented store
                                                Conclusions




        Scalable Data Management Systems For Big Data

                                                Viet-Trung Tran

                                              KerData team
                               PhD Advisors: Gabriel Antoniu and Luc Boug´
                                                                         e




                                              January 21st, 2013

                                                                             1/54
Context
  Contribution 1: Efficient Support for MPI-I/O Atomicity       Big Data explosion
Contribution 2: A large-scale array-oriented storage system   Building scalable data management systems
                Contribution 3: A document-oriented store     Contributions of the thesis
                                                Conclusions


Big Data Explosion




                                                                                                          2/54
Context
  Contribution 1: Efficient Support for MPI-I/O Atomicity       Big Data explosion
Contribution 2: A large-scale array-oriented storage system   Building scalable data management systems
                Contribution 3: A document-oriented store     Contributions of the thesis
                                                Conclusions


Big Data in Data-intensive HPC



      Data-intensive HPC relies on supercomputers to process, analyze,
      and/or visualize massive amounts of data
          Some numbers
                      Large Hadron Collider Grid
                              25 P B per year
                              I/O rates of 300 GB/s
                      Blue Waters peak I/O rates measured at 1 T B/s
                      Data come from a variety of sources: observations,
                      simulations, experimental systems, etc.




                                                                                                          3/54
Context
  Contribution 1: Efficient Support for MPI-I/O Atomicity       Big Data explosion
Contribution 2: A large-scale array-oriented storage system   Building scalable data management systems
                Contribution 3: A document-oriented store     Contributions of the thesis
                                                Conclusions


Big Data in Data-intensive HPC



      Data-intensive HPC relies on supercomputers to process, analyze,
      and/or visualize massive amounts of data
          Some numbers
                      Large Hadron Collider Grid
                              25 P B per year
                              I/O rates of 300 GB/s
                      Blue Waters peak I/O rates measured at 1 T B/s
                      Data come from a variety of sources: observations,
                      simulations, experimental systems, etc.




                                                                                                          3/54
Context
  Contribution 1: Efficient Support for MPI-I/O Atomicity       Big Data explosion
Contribution 2: A large-scale array-oriented storage system   Building scalable data management systems
                Contribution 3: A document-oriented store     Contributions of the thesis
                                                Conclusions


Big Data in Data-intensive HPC



      Data-intensive HPC relies on supercomputers to process, analyze,
      and/or visualize massive amounts of data
          Some numbers
                      Large Hadron Collider Grid
                              25 P B per year
                              I/O rates of 300 GB/s
                      Blue Waters peak I/O rates measured at 1 T B/s
                      Data come from a variety of sources: observations,
                      simulations, experimental systems, etc.




                                                                                                          3/54
Context
  Contribution 1: Efficient Support for MPI-I/O Atomicity       Big Data explosion
Contribution 2: A large-scale array-oriented storage system   Building scalable data management systems
                Contribution 3: A document-oriented store     Contributions of the thesis
                                                Conclusions


Big Data in Data-intensive HPC



      Data-intensive HPC relies on supercomputers to process, analyze,
      and/or visualize massive amounts of data
          Some numbers
                      Large Hadron Collider Grid
                              25 P B per year
                              I/O rates of 300 GB/s
                      Blue Waters peak I/O rates measured at 1 T B/s
                      Data come from a variety of sources: observations,
                      simulations, experimental systems, etc.




                                                                                                          3/54
Context
  Contribution 1: Efficient Support for MPI-I/O Atomicity       Big Data explosion
Contribution 2: A large-scale array-oriented storage system   Building scalable data management systems
                Contribution 3: A document-oriented store     Contributions of the thesis
                                                Conclusions


Definition of Big Data


      According to M. Stonebraker, Big Data has at least one of the
      following characteristics
          Big Volume
                  Large datasets (TB and more)
          Big Velocity
                  Data is moving very fast
          Big Variety
                  Data exist in a large number of
                  formats.




                                                                                                          4/54
Context
  Contribution 1: Efficient Support for MPI-I/O Atomicity       Big Data explosion
Contribution 2: A large-scale array-oriented storage system   Building scalable data management systems
                Contribution 3: A document-oriented store     Contributions of the thesis
                                                Conclusions


Definition of Big Data


      According to M. Stonebraker, Big Data has at least one of the
      following characteristics
          Big Volume
                  Large datasets (TB and more)
          Big Velocity
                  Data is moving very fast
          Big Variety
                  Data exist in a large number of
                  formats.




                                                                                                          4/54
Context
  Contribution 1: Efficient Support for MPI-I/O Atomicity       Big Data explosion
Contribution 2: A large-scale array-oriented storage system   Building scalable data management systems
                Contribution 3: A document-oriented store     Contributions of the thesis
                                                Conclusions


Definition of Big Data


      According to M. Stonebraker, Big Data has at least one of the
      following characteristics
          Big Volume
                  Large datasets (TB and more)
          Big Velocity
                  Data is moving very fast
          Big Variety
                  Data exist in a large number of
                  formats.




                                                                                                          4/54
Context
  Contribution 1: Efficient Support for MPI-I/O Atomicity       Big Data explosion
Contribution 2: A large-scale array-oriented storage system   Building scalable data management systems
                Contribution 3: A document-oriented store     Contributions of the thesis
                                                Conclusions


Big Data Challenges




      Objective of this thesis
           Building scalable data management systems for Big Data
                                                                                                          5/54
Context
  Contribution 1: Efficient Support for MPI-I/O Atomicity       Big Data explosion
Contribution 2: A large-scale array-oriented storage system   Building scalable data management systems
                Contribution 3: A document-oriented store     Contributions of the thesis
                                                Conclusions


Big Data Challenges




      Objective of this thesis
           Building scalable data management systems for Big Data
                                                                                                          5/54
Context
  Contribution 1: Efficient Support for MPI-I/O Atomicity       Big Data explosion
Contribution 2: A large-scale array-oriented storage system   Building scalable data management systems
                Contribution 3: A document-oriented store     Contributions of the thesis
                                                Conclusions


Dealing With Scalability


        Scalability is defined as the ability of a system, network, or
          process, to handle growing amount of work in a capable
         manner, or its ability to be enlarged to accommodate that
                                    growth.



         Two methods for scaling:
                 Scale horizontally (scale out)
                 Scale vertically (scale up)




                                                                                                          6/54
Context
  Contribution 1: Efficient Support for MPI-I/O Atomicity       Big Data explosion
Contribution 2: A large-scale array-oriented storage system   Building scalable data management systems
                Contribution 3: A document-oriented store     Contributions of the thesis
                                                Conclusions


Trend 1: From Centralized to Distributed Approaches



              Centralized storage servers to distributed parallel file systems
                      Centralized file servers ⇒ Cluster ⇒ Grid, Cloud
              Centralized to distributed metadata management
              Example: PVFSv1 [Blumer 1994] ⇒ PVFSv2 [Ross 2003]




                                                                                                          7/54
Context
  Contribution 1: Efficient Support for MPI-I/O Atomicity       Big Data explosion
Contribution 2: A large-scale array-oriented storage system   Building scalable data management systems
                Contribution 3: A document-oriented store     Contributions of the thesis
                                                Conclusions


Trend 2: From One-size-fits-all-needs Storage to
Specialized Storage

              NoSQL movement: Key-value stores, Document stores, etc.
                      Remove unneeded complexity: ACID
                      High scalability
              Array-oriented storage for array data model
              Example: Dynamo, Membase, CouchDB, etc.




                                                                                                          8/54
Context
  Contribution 1: Efficient Support for MPI-I/O Atomicity       Big Data explosion
Contribution 2: A large-scale array-oriented storage system   Building scalable data management systems
                Contribution 3: A document-oriented store     Contributions of the thesis
                                                Conclusions


Trend 3: From Disks to Main Memory Storage

              Memory is the new disk
                      Median analytic job sizes are less than 14 GB [Microsoft]
                      1 TB RAM is feasible
                      DRAM is at least 100 times faster than disks
              Excellent for Big Velocity
              Example: Hyper [Kemper 2011], HANA [SAP], H-Store
              [Kallman 2008]




                                                                                                          9/54
Context
  Contribution 1: Efficient Support for MPI-I/O Atomicity       Big Data explosion
Contribution 2: A large-scale array-oriented storage system   Building scalable data management systems
                Contribution 3: A document-oriented store     Contributions of the thesis
                                                Conclusions


Targeted Environments



              Data-intensive High Performance Computing (HPC)
                      Big Volume, Big Variety
              Geographically distributed environments
                      Big Volume
              Big data analytics in a multicore, big memory server
                      Big Velocity




                                                                                                          10/54
Context
  Contribution 1: Efficient Support for MPI-I/O Atomicity       Big Data explosion
Contribution 2: A large-scale array-oriented storage system   Building scalable data management systems
                Contribution 3: A document-oriented store     Contributions of the thesis
                                                Conclusions


Contributions of This Thesis: Building Scalable Data
Management Systems for Big Data

        Contributions                                         Big Volume         Big Velocity        Big Variety
                                                                  √
        Building a scalable storage system to                                        —                   —
        provide efficient support for MPI-I/O
        atomicity
                                                                   √                                      √
        Pyramid: a large-scale array-oriented                                          —
        storage system
                                                                   √
        Towards a globally distributed file                                             —                  —
        system: adapting BlobSeer to WAN
        scale
                                                                                       √                  √
        DStore: an in-memory document-                             —
        oriented store
                                    √
                                (       = Addressed, — = not addressed).


                                                                                                               11/54
Context
  Contribution 1: Efficient Support for MPI-I/O Atomicity       Big Data explosion
Contribution 2: A large-scale array-oriented storage system   Building scalable data management systems
                Contribution 3: A document-oriented store     Contributions of the thesis
                                                Conclusions


Contributions of This Thesis: Building Scalable Data
Management Systems for Big Data

        Contributions                                         Big Volume         Big Velocity        Big Variety
                                                                  √
        Building a scalable storage system to                                        —                   —
        provide efficient support for MPI-I/O
        atomicity
                                                                   √                                      √
        Pyramid: a large-scale array-oriented                                          —
        storage system
                                                                   √
        Towards a globally distributed file                                             —                  —
        system: adapting BlobSeer to WAN
        scale
                                                                                       √                  √
        DStore: an in-memory document-                             —
        oriented store
                                    √
                                (       = Addressed, — = not addressed).


                                                                                                               11/54
Context
                                                              The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Context: Big Data in Data-intensive HPC




      Contribution 1
       Building a scalable storage system to provide efficient support
                           for MPI-I/O atomicity




                                                                                                      12/54
Context
                                                              The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Problem Description
              Spatial splitting in parallelization: Ghost cells
                                                                                                              .



                                                                                                             "#        !      "$

                                                                                               /




                                                                                               %&%'()&*+,------------()&*+,---------------()&*+,




              Application data model vs. storage data model

                                           P1                        P1 non-access pattern



                                                                File data: contiguous sequence of bytes




          Concurrent overlapping non-contiguous I/O requires atomicity
                                  guarantees
                                                                                                                                                   13/54
Context
                                                              The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Problem Description
              Spatial splitting in parallelization: Ghost cells
                                                                                                              .



                                                                                                             "#        !      "$

                                                                                               /




                                                                                               %&%'()&*+,------------()&*+,---------------()&*+,




              Application data model vs. storage data model

                                           P1                        P1 non-access pattern



                                                                File data: contiguous sequence of bytes




          Concurrent overlapping non-contiguous I/O requires atomicity
                                  guarantees
                                                                                                                                                   13/54
Context
                                                              The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


State of The Art

          Locking-based approaches to
          ensure atomicity
          Done at 3 levels
                  Applications
                          Each process dumps output
                          to a single file                                         Parallel I/O stack
                          Too many files
                  MPI-I/O
                          The whole file is locked
                  Storage
                          Byte range locking based on
                          POSIX lock


                      Poor scalability
                                                                                                       14/54
Context
                                                              The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Goal




              High throughput non-contiguous I/O under atomicity
                                 guarantees




                                                                                                      15/54
Context
                                                              The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Our Approach



              Dedicated interface for atomic non-contiguos I/O
                      Provide atomicity guarantees at storage level
                      No need to map MPI consistency to storage consistency model
              Shadowing rather than locking
                      Concurrent overlapped writes are allowed
                      Atomicity guarantees
              Data striping




                                                                                                      16/54
Context
                                                              The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Building Block: BlobSeer Data Management Service



          A KerData project
          (started with the thesis of
          Bogdan Nicolae)
          Design
                  Data striping
                  Distributed Metadata
                  management
                  Versioning                                              BlobSeer architecture




                                                                                                      17/54
Context
                                                                      The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                      Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                                      Evaluation
                Contribution 3: A document-oriented store
                                                                      Summary
                                                Conclusions


Building Block: BlobSeer (con’t)
              Two phases I/O
                      Data access
                      Metadata access
              Access interface only for contiguous I/O
                      Create, Read, Write, Clone.
              Distributed metadata management
                      Organized as a segment tree
                      Distributed over a DHT
                                                                0,8



                                              0,4                                        4,4



                               0,2                  2,2                      4,2                6,2



                    0,1       1,1       2,1               3,1         4,1          5,1         6,1     7,1




                                                                                                              18/54
Context
                                                                       The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                       Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                                       Evaluation
                Contribution 3: A document-oriented store
                                                                       Summary
                                                Conclusions


Building Block: BlobSeer (con’t)
              Two phases I/O
                      Data access
                      Metadata access
              Access interface only for contiguous I/O
                      Create, Read, Write, Clone.
              Distributed metadata management
                      Organized as a segment tree
                      Distributed over a DHT
                                                                 0,8



                                               0,4                                        4,4



                                   0,2               2,2                      4,2                6,2



                    0,1           1,1    2,1               3,1         4,1          5,1         6,1     7,1




                     1st Writer




                                                                                                               18/54
Context
                                                                              The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                              Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                                              Evaluation
                Contribution 3: A document-oriented store
                                                                              Summary
                                                Conclusions


Building Block: BlobSeer (con’t)
              Two phases I/O
                      Data access
                      Metadata access
              Access interface only for contiguous I/O
                      Create, Read, Write, Clone.
              Distributed metadata management
                      Organized as a segment tree
                      Distributed over a DHT
                                                                        0,8



                                                     0,4                                         4,4



                                   0,2                      2,2                      4,2                6,2



                    0,1           1,1    1,1   2,1    2,1         3,1         4,1          5,1         6,1     7,1




                     1st Writer




                                                                                                                      18/54
Context
                                                                                  The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                                  Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                                                  Evaluation
                Contribution 3: A document-oriented store
                                                                                  Summary
                                                Conclusions


Building Block: BlobSeer (con’t)
              Two phases I/O
                      Data access
                      Metadata access
              Access interface only for contiguous I/O
                      Create, Read, Write, Clone.
              Distributed metadata management
                      Organized as a segment tree
                      Distributed over a DHT
                                                                            0,8



                                                       0,4                                           4,4



                                   0,2     0,2                2,2   2,2                  4,2                6,2



                    0,1           1,1    1,1     2,1    2,1           3,1         4,1          5,1         6,1     7,1




                     1st Writer




                                                                                                                          18/54
Context
                                                                                   The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                                   Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                                                   Evaluation
                Contribution 3: A document-oriented store
                                                                                   Summary
                                                Conclusions


Building Block: BlobSeer (con’t)
              Two phases I/O
                      Data access
                      Metadata access
              Access interface only for contiguous I/O
                      Create, Read, Write, Clone.
              Distributed metadata management
                      Organized as a segment tree
                      Distributed over a DHT
                                                                             0,8



                                                       0,4    0,4                                     4,4



                                   0,2     0,2                 2,2   2,2                  4,2                6,2



                    0,1           1,1    1,1     2,1    2,1            3,1         4,1          5,1         6,1     7,1




                     1st Writer




                                                                                                                           18/54
Context
                                                                                   The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                                   Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                                                   Evaluation
                Contribution 3: A document-oriented store
                                                                                   Summary
                                                Conclusions


Building Block: BlobSeer (con’t)
              Two phases I/O
                      Data access
                      Metadata access
              Access interface only for contiguous I/O
                      Create, Read, Write, Clone.
              Distributed metadata management
                      Organized as a segment tree
                      Distributed over a DHT
                                                                             0,8    0,8



                                                       0,4    0,4                                     4,4



                                   0,2     0,2                 2,2   2,2                  4,2                6,2



                    0,1           1,1    1,1     2,1    2,1            3,1         4,1          5,1         6,1     7,1




                     1st Writer




                                                                                                                           18/54
Context
                                                                         The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                         Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                                         Evaluation
                Contribution 3: A document-oriented store
                                                                         Summary
                                                Conclusions


Zoom on BlobSeer Metadata Generation
                Return from the version manager for creating a new version
                        A version number
                        List of border nodes

                                                                   0,8     0,8         0,8
                                                                                                                          Border nodes


                                             0,4    0,4                                                4,4   4,4



                         0,2     0,2                 2,2   2,2                           4,2     4,2                6,2



          0,1           1,1    1,1     2,1    2,1            3,1          4,1    4,1           5,1     5,1         6,1             7,1




           1st Writer
           2nd Writer




                Border nodes calculation is on the version manager side
                                                                                                                                         19/54
Context
                                                              The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Proposal for a Non-contiguous, Versioning-oriented Access
Interface


              Non-contiguous write
                      vw = NONCONT WRITE(id, buffers[], offsets[], sizes[])
              Non-contiguous read
                      NONCONT READ(id, v, buffers[], offsets[], sizes[])
              Requirements
                      Non-contiguous I/O must be atomic
                      Efficient under concurrency




                                                                                                      20/54
Context
                                                                      The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                      Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                                      Evaluation
                Contribution 3: A document-oriented store
                                                                      Summary
                                                Conclusions


Non-contiguous I/O Must Be Atomic

              Leveraging a shadowing mechanism
                      Isolating non-contiguous update into one single consistent
                      snapshot
                      Done at metadata level

                                                                0,8



                                              0,4                                        4,4



                               0,2                  2,2                      4,2                6,2



                    0,1       1,1       2,1               3,1         4,1          5,1         6,1     7,1




                                                                                                              21/54
Context
                                                                       The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                       Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                                       Evaluation
                Contribution 3: A document-oriented store
                                                                       Summary
                                                Conclusions


Non-contiguous I/O Must Be Atomic

              Leveraging a shadowing mechanism
                      Isolating non-contiguous update into one single consistent
                      snapshot
                      Done at metadata level

                                                                 0,8



                                               0,4                                        4,4



                                   0,2               2,2                      4,2                6,2



                    0,1           1,1    2,1               3,1         4,1          5,1         6,1     7,1




                     1st Writer




                                                                                                               21/54
Context
                                                                              The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                              Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                                              Evaluation
                Contribution 3: A document-oriented store
                                                                              Summary
                                                Conclusions


Non-contiguous I/O Must Be Atomic

              Leveraging a shadowing mechanism
                      Isolating non-contiguous update into one single consistent
                      snapshot
                      Done at metadata level

                                                                        0,8



                                                     0,4                                              4,4



                                   0,2                      2,2                           4,2                6,2



                    0,1           1,1    1,1   2,1    2,1         3,1         4,1   4,1         5,1   5,1   6,1    7,1




                     1st Writer




                                                                                                                         21/54
Context
                                                                                  The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                                  Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                                                  Evaluation
                Contribution 3: A document-oriented store
                                                                                  Summary
                                                Conclusions


Non-contiguous I/O Must Be Atomic

              Leveraging a shadowing mechanism
                      Isolating non-contiguous update into one single consistent
                      snapshot
                      Done at metadata level

                                                                            0,8



                                                       0,4                                                  4,4



                                   0,2     0,2                2,2   2,2                       4,2     4,2          6,2



                    0,1           1,1    1,1     2,1    2,1           3,1         4,1   4,1         5,1     5,1   6,1    7,1




                     1st Writer




                                                                                                                               21/54
Context
                                                                                   The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                                   Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                                                   Evaluation
                Contribution 3: A document-oriented store
                                                                                   Summary
                                                Conclusions


Non-contiguous I/O Must Be Atomic

              Leveraging a shadowing mechanism
                      Isolating non-contiguous update into one single consistent
                      snapshot
                      Done at metadata level

                                                                             0,8



                                                       0,4    0,4                                            4,4   4,4



                                   0,2     0,2                 2,2   2,2                       4,2     4,2                6,2



                    0,1           1,1    1,1     2,1    2,1            3,1         4,1   4,1         5,1     5,1         6,1    7,1




                     1st Writer




                                                                                                                                      21/54
Context
                                                                                   The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                                   Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                                                   Evaluation
                Contribution 3: A document-oriented store
                                                                                   Summary
                                                Conclusions


Non-contiguous I/O Must Be Atomic

              Leveraging a shadowing mechanism
                      Isolating non-contiguous update into one single consistent
                      snapshot
                      Done at metadata level

                                                                             0,8    0,8



                                                       0,4    0,4                                             4,4   4,4



                                   0,2     0,2                 2,2   2,2                        4,2     4,2                6,2



                    0,1           1,1    1,1     2,1    2,1            3,1         4,1    4,1         5,1     5,1         6,1    7,1




                     1st Writer




                                                                                                                                       21/54
Context
                                                                                              The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                                              Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                                                              Evaluation
                Contribution 3: A document-oriented store
                                                                                              Summary
                                                Conclusions


Efficient under Concurrency

              Proposed 3 important optimizations
                      Minimizing ordering overhead
                      Moving border node computation from version manager to
                      clients
                      Lazy evaluation during border node calculation


                                                                                        0,8    0,8         0,8



                                                       0,4    0,4     0,4                                                  4,4           4,4    4,4



                                   0,2     0,2                  2,2   2,2         2,2                        4,2     4,2         4,2            6,2     6,2



                    0,1           1,1    1,1     2,1    2,1   2,1           3,1    3,1        4,1    4,1           5,1     5,1         5,1     6,1    6,1     7,1




                     1st Writer
                     2nd Writer




                                                                                                                                                                    22/54
Context
                                                                                          The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                                          Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                                                          Evaluation
                Contribution 3: A document-oriented store
                                                                                          Summary
                                                Conclusions


Efficient under Concurrency

              Proposed 3 important optimizations
                      Minimizing ordering overhead
                      Moving border node computation from version manager to
                      clients
                      Lazy evaluation during border node calculation


                                                                                    0,8
                                         Border node of the right?                                                   Border node of the left?



                                                            0,4      0,4                                            4,4      4,4



                                   0,2       0,2                      2,2   2,2                       4,2     4,2                      6,2



                    0,1           1,1      1,1        2,1     2,1             3,1         4,1   4,1         5,1     5,1               6,1       7,1




                     1st Writer




                                                                                                                                                      22/54
Context
                                                                                   The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                                   Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                                                   Evaluation
                Contribution 3: A document-oriented store
                                                                                   Summary
                                                Conclusions


Efficient under Concurrency

              Proposed 3 important optimizations
                      Minimizing ordering overhead
                      Moving border node computation from version manager to
                      clients
                      Lazy evaluation during border node calculation


                                                                             0,8    0,8



                                                       0,4    0,4                                             4,4   4,4



                                   0,2     0,2                 2,2   2,2                        4,2     4,2                6,2



                    0,1           1,1    1,1     2,1    2,1            3,1         4,1    4,1         5,1     5,1         6,1    7,1




                     1st Writer




                                                                                                                                       22/54
Context
                                                              The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


 Leveraging Our Versioning-oriented Interface in The
Parallel I/O Stack




          Integration of BlobSeer to MPI-I/O middleware requires a new
                                   ADIO driver

                                                                                                      23/54
Context
                                                              The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Experimental Evaluation



              Our machines: Grid’5000 platform
                      Up to 80 nodes
                      Pentium-4 CPU@2.26Ghz, 4GB RAM, Gigabit Ethernet
                      Measured bandwidth: 117.5 MB/s for MTU = 1500B
              3 sets of experiments
                      Scalability of non-contiguous I/O
                      Scalability under concurrency
                      MPI-tile-I/O




                                                                                                      24/54
Context
                                                                                          The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                                          Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                                                          Evaluation
                Contribution 3: A document-oriented store
                                                                                          Summary
                                                Conclusions


Results of The Experiments: Our Approach vs.
Locking-based


                                                                                                                             3000
                                                                                                                                                                       Lustre
                                  2000                                                                                                                              BlobSeer
                                                                            Lustre




                                                                                              Aggregated Throughput (MB/s)
                                                                         BlobSeer                                            2500
   Aggregated throughput (MB/s)




                                  1500                                                                                       2000


                                                                                                                             1500
                                  1000
                                                                                                                             1000

                                  500                                                                                        500


                                                                                                                               0
                                    0                                                                                               4   9          16          25               36
                                         4   9          16          25               36                                                     Number of concurrent clients
                                                 Number of concurrent clients

                                                                                            MPI-tile-I/O: 1024 ∗ 1024 ∗ 1024
   Subdomains are arranged in a row
                                                                                            tile size

                                                                                                                                                                                     25/54
Context
                                                              The need of atomic non-contiguous I/O
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Contribution 1 - Summary


              A versioning-based mechanism to support atomic MPI-I/O
              efficiently
              The optimization of moving border node computation to
              clients is integrated back to BlobSeer
              Our approach outperforms locking-based approaches
              (aggregated throughput is 3.5 to 10 times better)
      Publication:
      Efficient support for MPI-IO atomicity based on versioning. Tran V.-T., Nicolae B.,
      Antoniu G., Boug´ L. In Proceedings of the 11th IEEE/ACM International Symposium
                       e
      on Cluster, Cloud, and Grid Computing (CCGrid 2011), 514 - 523, Newport Beach,
      USA, May 2011.



                                                                                                      26/54
Context
                                                              The need of specialized storage for array data model
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Context: Big Data in Data-intensive HPC




      Contribution 2
         Pyramid: A scalable storage system for array-oriented data
                                  model




                                                                                                                     27/54
Context
                                                              The need of specialized storage for array data model
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Reconsidering The Mismatch Between Storage Model and
Application Data Model


              Application data model
                      Multidimensional typed arrays, images, etc.
              Storage data model
                      Parallel file systems: Simple and flat I/O data model
                      Mostly contiguous I/O interface: READ,WRITE(offset, size)

      Need additional layers to translate application data model to
      storage data model



                                                                                                                     28/54
Context
                                                              The need of specialized storage for array data model
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Reconsidering The Mismatch Between Storage Model and
Application Data Model


              Application data model
                      Multidimensional typed arrays, images, etc.
              Storage data model
                      Parallel file systems: Simple and flat I/O data model
                      Mostly contiguous I/O interface: READ,WRITE(offset, size)

      Need additional layers to translate application data model to
      storage data model



                                                                                                                     28/54
Context
                                                              The need of specialized storage for array data model
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


M. Stonebraker: The One-storage-fits-all-needs
Has Reached Its Limits




              Performance of non-contiguous I/O vs. I/O atomicity
              Loosing data locality

      Need to specialize the I/O stack to match the requirements of
      applications: Array-oriented storage for array data model
                                                                                                                     29/54
Context
                                                              The need of specialized storage for array data model
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


M. Stonebraker: The One-storage-fits-all-needs
Has Reached Its Limits




              Performance of non-contiguous I/O vs. I/O atomicity
              Loosing data locality

      Need to specialize the I/O stack to match the requirements of
      applications: Array-oriented storage for array data model
                                                                                                                     29/54
Context
                                                              The need of specialized storage for array data model
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Our Approach: Array-oriented Data Model Needs
Array-oriented Storage




              Multi-dimension aware chunking
              Lock-free, distributed chunk indexing
              Array versioning




                                                                                                                     30/54
Context
                                                              The need of specialized storage for array data model
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Multi-dimensional Aware Chunking



                       P1                                          P1 non-access pattern



                                                              File data: contiguous sequence of bytes




              Split array into equal multidimensional chunks and distributed
              over storage elements
                      Simplify load balancing among storage elements
                      Keep the neighbors of cells in the same chunk
                      Eliminate mostly non-contiguous I/O accesses

                                                                                                                     31/54
Context
                                                              The need of specialized storage for array data model
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Multi-dimensional Aware Chunking



                       P1                                          P1 non-access pattern



                                                              File data: contiguous sequence of bytes




              Split array into equal multidimensional chunks and distributed
              over storage elements
                      Simplify load balancing among storage elements
                      Keep the neighbors of cells in the same chunk
                      Eliminate mostly non-contiguous I/O accesses

                                                                                                                     31/54
Context
                                                              The need of specialized storage for array data model
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Multi-dimensional Aware Chunking



                       P1                                         P1 non-access pattern




              Split array into equal multidimensional chunks and distributed
              over storage elements
                      Simplify load balancing among storage elements
                      Keep the neighbors of cells in the same chunk
                      Eliminate mostly non-contiguous I/O accesses

                                                                                                                     31/54
Context
                                                              The need of specialized storage for array data model
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Distributed Quadtree-like Structures

              Common index structures for multidimensional data
                      R-tree, XD-tree, etc.
                      All are designed and optimized for centralized management
                      Poor scalability in high concurrency
              Our approach
                      Porting quadtree-like structures to distributed environments




                                                                                                                     32/54
Context
                                                              The need of specialized storage for array data model
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Array Versioning

              Scientific applications need array versioning [VLDB 2009]
                      Checkpointing
                      Cloning
                      Provenance
              Our approach
                      Keep data and metadata immutable
                      Updates are handled at the metadata level using a shadowing
                      mechanism
              A versioning array-oriented interface
                      id = CREATE(n, sizes[], defval)
                      READ(id, v, offsets[], sizes[], buffer)
                      w = WRITE(id, offsets[], sizes[], buffer)


                                                                                                                     33/54
Context
                                                              The need of specialized storage for array data model
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Pyramid Architecture


      Pyramid is based on BlobSeer [Nicolae - JPDC 2011]



          Version managers
          Metadata managers
          Storage manager
          Storage servers
          Clients
                                                                           Pyramid architecture




                                                                                                                     34/54
Context
                                                                        The need of specialized storage for array data model
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                        Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
                                                                        Evaluation
                Contribution 3: A document-oriented store
                                                                        Summary
                                                Conclusions


Lock-free, Distributed Chunk Indexing
    BlobSeer                                            Pyramid
            Distributed segment                                     Generalize BlobSeer metadata
            tree                                                    organization
                                                                    Quadtree-like structures
                                                  Quadtree for 2D arrays
                                                  Octree for 3D arrays
              Tree nodes are immutable, uniquely identified by the version number
              and the subdomain they cover
              Using DHT to distribute tree nodes over metadata managers
              Shadowing to reflect updates

                   1   2   5   6                        version 1                                           version 2


                   3   4   7   8
                   9 10 13 14
                                    1   2   3   4   5   6     7     8     9   10   11   12   13   14   15     15        16   16
                   11 12 15 16


                                                Distributed quadtree                                                              35/54
Context
                                                              The need of specialized storage for array data model
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Efficient Parallel Updating




                              Total ordering of two concurrent updates
                                                                                                                     36/54
Context
                                                              The need of specialized storage for array data model
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Experimental Evaluation


              Use at most 140 nodes of the Graphene cluster in the
              Grid’5000 testbed
                      1 Gbps ethernet interconnected network
                      Pyramid and the competitor system PVFS are deployed on
                      76 nodes
                      64 nodes are reserved for clients
              Simulate common access pattern exhibited by scientific
              applications: Array dicing
                      Each client accesses a dedicated sub-array
                      Concurrent Read/Write
                      Measure the aggregated throughput



                                                                                                                     37/54
Context
                                                                                           The need of specialized storage for array data model
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                                           Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
                                                                                           Evaluation
                Contribution 3: A document-oriented store
                                                                                           Summary
                                                Conclusions


Aggregated Throughput Achieved under Concurrency


                                  2500
                                                Pyramid writing                                                                             Pyramid writing
                                               Pyramid reading                                                                2500         Pyramid reading
   Aggregated Throughput (MB/s)




                                                                                               Aggregated Throughput (MB/s)
                                                 PVFS2 writing                                                                               PVFS2 writing
                                  2000          PVFS2 reading                                                                               PVFS2 reading
                                                                                                                              2000

                                  1500
                                                                                                                              1500

                                  1000
                                                                                                                              1000


                                  500                                                                                         500


                                    0                                                                                           0
                                         1 4       9      16       25          36     49                                             1 4    8     16           32                 64
                                                       Number of concurrent clients                                                                Number of concurrent clients


  Weak Scalability: Fixed subdomain                                                           Strong scalability: Fixed total
  size, increasing number of client                                                           domain size, increasing number of
  processes                                                                                   client processes


                                                                                                                                                                                       38/54
Context
                                                              The need of specialized storage for array data model
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Contribution 2 - Summary


              Pyramid is an array-oriented storage system
                      Offers parallel array processing for both read and write
                      workloads
                      Built with a distributed metadata management system
                      Relies on shadowing to reflect updates
              Preliminary evaluation shows promising scalability
      Publication:
      Towards scalable array-oriented active storage: the Pyramid approach. Tran V.-T.,
      Nicolae B., Antoniu G. In the ACM SIGOPS Operating Systems Review 46(1):19-25.
      2012.
      Pyramid: A large-scale array-oriented active storage system. Tran V.-T., Nicolae B.,
      Antoniu G., Boug´ L. In The 5th Workshop on Large Scale Distributed Systems and
                       e
      Middleware (LADIS 2011), Seattle, USA, September 2011.


                                                                                                                     39/54
Context
                                                              NoSQL movement & in-memory design
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design of DStore
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Context: Big Data in a Multi-core, Big Memory Server




      Contribution 3
                DStore: a document-oriented store in main memory




                                                                                                  40/54
Context
                                                              NoSQL movement & in-memory design
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design of DStore
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Recall The Context: NoSQL Movement & In-memory
Design

              NoSQL movement
                      Simplified data model: Key-Value, Documents, Graphs, etc.
                      Document-oriented stores offer a rich functionality
              Trending towards in-memory design
                      90% of Facebook jobs < 100GB [Facebook]
                      1 TB DRAM is feasible
                      Memory accesses are at least 100 times faster than disks

        Efficient support for fast, atomic, complex transactions &
                     high throughput read queries


                                                                                                  41/54
Context
                                                              NoSQL movement & in-memory design
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design of DStore
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Recall The Context: NoSQL Movement & In-memory
Design

              NoSQL movement
                      Simplified data model: Key-Value, Documents, Graphs, etc.
                      Document-oriented stores offer a rich functionality
              Trending towards in-memory design
                      90% of Facebook jobs < 100GB [Facebook]
                      1 TB DRAM is feasible
                      Memory accesses are at least 100 times faster than disks

        Efficient support for fast, atomic, complex transactions &
                     high throughput read queries


                                                                                                  41/54
Context
                                                              NoSQL movement & in-memory design
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                              Design of DStore
Contribution 2: A large-scale array-oriented storage system
                                                              Evaluation
                Contribution 3: A document-oriented store
                                                              Summary
                                                Conclusions


Observation


              Example
                      T1 updates {A, B, C}
                      T2 updates {C, D, E}
                      More complex transactions, higher possibility that transactions
                      are dependent
              Concurrent transaction processing
                      Required concurrent data structures
                      Locking & latching account for 30 % overhead [VLDB 2007]
                      Serialization is unavoidable for dependent transactions
              Synchronous index generations
                      More indexes, slower transaction processing



                                                                                                  42/54
Context
                                                                  NoSQL movement & in-memory design
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                  Design of DStore
Contribution 2: A large-scale array-oriented storage system
                                                                  Evaluation
                Contribution 3: A document-oriented store
                                                                  Summary
                                                Conclusions


#1: Target Fast, Atomic Complex Transactions

                                                                  Bulk updating
                                                                  Slave thread
                              Individual updates
                                                   Delta buffer
                                                                             Index data
                                Master thread
                                                                              structure




                                                                  Background process




              Single threaded execution model
              Delta indexing & background index generation to deliver fast processing
              rate
              Bulk updating to ensure atomicity
                                                                                                      43/54
Context
                                                                       NoSQL movement & in-memory design
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                       Design of DStore
Contribution 2: A large-scale array-oriented storage system
                                                                       Evaluation
                Contribution 3: A document-oriented store
                                                                       Summary
                                                Conclusions


#2: Target High-throughput Read Queries



                                                                 Bulk updating


                                     Updates
                                                  Delta buffer
                                  Master thread                            Index data
                                                                            structure



                                                                 Slave thread

                                                                 Background process




              Multiple Reader threads
              Stale READ for performance
              Versioning concurrency control / One new snapshot per an entire delta
              buffer
                                                                                                           44/54
Context
                                                                         NoSQL movement & in-memory design
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                         Design of DStore
Contribution 2: A large-scale array-oriented storage system
                                                                         Evaluation
                Contribution 3: A document-oriented store
                                                                         Summary
                                                Conclusions


#2: Target High-throughput Read Queries

                                                                 Fresh
                                                                 READ




                                                                 Bulk updating


                                     Updates
                                                  Delta buffer
                                  Master thread                            Index data
                                                                            structure



                                                                 Slave thread

                                                                 Background process




              Multiple Reader threads
              Stale READ for performance
              Versioning concurrency control / One new snapshot per an entire delta
              buffer
                                                                                                             44/54
Context
                                                                         NoSQL movement & in-memory design
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                         Design of DStore
Contribution 2: A large-scale array-oriented storage system
                                                                         Evaluation
                Contribution 3: A document-oriented store
                                                                         Summary
                                                Conclusions


#2: Target High-throughput Read Queries

                                                                 Fresh                  Stale
                                                                 READ                   READ




                                                                 Bulk updating


                                     Updates
                                                  Delta buffer
                                  Master thread                            Index data
                                                                            structure



                                                                 Slave thread

                                                                 Background process




              Multiple Reader threads
              Stale READ for performance
              Versioning concurrency control / One new snapshot per an entire delta
              buffer
                                                                                                             44/54
Context
                                                                         NoSQL movement & in-memory design
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                         Design of DStore
Contribution 2: A large-scale array-oriented storage system
                                                                         Evaluation
                Contribution 3: A document-oriented store
                                                                         Summary
                                                Conclusions


#2: Target High-throughput Read Queries

                                                                 Fresh                      Stale
                                                                 READ                       READ




                                                                 Bulk updating

                                                                                   Different snapshots
                                     Updates
                                                  Delta buffer
                                  Master thread                            Index data
                                                                               Index data
                                                                            structure
                                                                                structure


                                                                 Slave thread

                                                                 Background process




              Multiple Reader threads
              Stale READ for performance
              Versioning concurrency control / One new snapshot per an entire delta
              buffer
                                                                                                             44/54
Context
                                                                       NoSQL movement & in-memory design
  Contribution 1: Efficient Support for MPI-I/O Atomicity
                                                                       Design of DStore
Contribution 2: A large-scale array-oriented storage system
                                                                       Evaluation
                Contribution 3: A document-oriented store
                                                                       Summary
                                                Conclusions


Service Model

                                                              Fresh
                                                              read
                                                                                B-tree

                                       Delta buffer                           B-tree index
                                                                               B-tree index
                                                                                new snapshot
                                                               Slave
                                                              thread
          Update queries                                                        B-tree                 Stale
              Master                                                                                   read
                                       Delta buffer                           B-tree index
              thread                                                           B-tree index
                                                                                new snapshot
                                                               Slave
                                                              thread
                                                                                B-tree

                                       Delta buffer                           B-tree index
                                                                               B-tree index
                                                                                new snapshot
                                                               Slave
                                                              thread




                                              DStore service model

                                                                                                               45/54
Scalable Data Management Systems for Big Data
Scalable Data Management Systems for Big Data
Scalable Data Management Systems for Big Data
Scalable Data Management Systems for Big Data
Scalable Data Management Systems for Big Data
Scalable Data Management Systems for Big Data
Scalable Data Management Systems for Big Data
Scalable Data Management Systems for Big Data
Scalable Data Management Systems for Big Data

More Related Content

Similar to Scalable Data Management Systems for Big Data

Paper id 252014139
Paper id 252014139Paper id 252014139
Paper id 252014139IJRAT
 
Life science requirements from e-infrastructure: initial results from a joint...
Life science requirements from e-infrastructure:initial results from a joint...Life science requirements from e-infrastructure:initial results from a joint...
Life science requirements from e-infrastructure: initial results from a joint...Rafael C. Jimenez
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Robert Grossman
 
Protection of big data privacy
Protection of big data privacyProtection of big data privacy
Protection of big data privacyieeepondy
 
Cross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryCross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryChris Freeland
 
JPJ1417 Data Mining With Big Data
JPJ1417   Data Mining With Big DataJPJ1417   Data Mining With Big Data
JPJ1417 Data Mining With Big Datachennaijp
 
wolstencroft-ogf20-astro
wolstencroft-ogf20-astrowolstencroft-ogf20-astro
wolstencroft-ogf20-astrowebuploader
 
Big data and open access: a collision course for science
Big data and open access: a collision course for scienceBig data and open access: a collision course for science
Big data and open access: a collision course for scienceBeth Plale
 
SECURED FREQUENT ITEMSET DISCOVERY IN MULTI PARTY DATA ENVIRONMENT FREQUENT I...
SECURED FREQUENT ITEMSET DISCOVERY IN MULTI PARTY DATA ENVIRONMENT FREQUENT I...SECURED FREQUENT ITEMSET DISCOVERY IN MULTI PARTY DATA ENVIRONMENT FREQUENT I...
SECURED FREQUENT ITEMSET DISCOVERY IN MULTI PARTY DATA ENVIRONMENT FREQUENT I...Editor IJMTER
 
A Distributed Architecture for Sharing Ecological Data Sets with Access and U...
A Distributed Architecture for Sharing Ecological Data Sets with Access and U...A Distributed Architecture for Sharing Ecological Data Sets with Access and U...
A Distributed Architecture for Sharing Ecological Data Sets with Access and U...Javier González
 
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingScott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingGigaScience, BGI Hong Kong
 
IT architectures for data sharing in agri food
IT architectures for data sharing in agri foodIT architectures for data sharing in agri food
IT architectures for data sharing in agri foodChristopher Brewster
 
RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWieijjournal
 

Similar to Scalable Data Management Systems for Big Data (20)

SyMBA: Overview
SyMBA: OverviewSyMBA: Overview
SyMBA: Overview
 
Paper id 252014139
Paper id 252014139Paper id 252014139
Paper id 252014139
 
Life science requirements from e-infrastructure: initial results from a joint...
Life science requirements from e-infrastructure:initial results from a joint...Life science requirements from e-infrastructure:initial results from a joint...
Life science requirements from e-infrastructure: initial results from a joint...
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
LITERATURE SURVEY ON BIG DATA AND PRESERVING PRIVACY FOR THE BIG DATA IN CLOUD
LITERATURE SURVEY ON BIG DATA AND PRESERVING PRIVACY FOR THE BIG DATA IN CLOUDLITERATURE SURVEY ON BIG DATA AND PRESERVING PRIVACY FOR THE BIG DATA IN CLOUD
LITERATURE SURVEY ON BIG DATA AND PRESERVING PRIVACY FOR THE BIG DATA IN CLOUD
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Protection of big data privacy
Protection of big data privacyProtection of big data privacy
Protection of big data privacy
 
Cross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryCross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage Library
 
JPJ1417 Data Mining With Big Data
JPJ1417   Data Mining With Big DataJPJ1417   Data Mining With Big Data
JPJ1417 Data Mining With Big Data
 
wolstencroft-ogf20-astro
wolstencroft-ogf20-astrowolstencroft-ogf20-astro
wolstencroft-ogf20-astro
 
Big data and open access: a collision course for science
Big data and open access: a collision course for scienceBig data and open access: a collision course for science
Big data and open access: a collision course for science
 
SECURED FREQUENT ITEMSET DISCOVERY IN MULTI PARTY DATA ENVIRONMENT FREQUENT I...
SECURED FREQUENT ITEMSET DISCOVERY IN MULTI PARTY DATA ENVIRONMENT FREQUENT I...SECURED FREQUENT ITEMSET DISCOVERY IN MULTI PARTY DATA ENVIRONMENT FREQUENT I...
SECURED FREQUENT ITEMSET DISCOVERY IN MULTI PARTY DATA ENVIRONMENT FREQUENT I...
 
A Distributed Architecture for Sharing Ecological Data Sets with Access and U...
A Distributed Architecture for Sharing Ecological Data Sets with Access and U...A Distributed Architecture for Sharing Ecological Data Sets with Access and U...
A Distributed Architecture for Sharing Ecological Data Sets with Access and U...
 
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingScott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
 
Big Data
Big Data Big Data
Big Data
 
IT architectures for data sharing in agri food
IT architectures for data sharing in agri foodIT architectures for data sharing in agri food
IT architectures for data sharing in agri food
 
130509
130509130509
130509
 
130509
130509130509
130509
 
RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEW
 

More from Viet-Trung TRAN

Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Viet-Trung TRAN
 
Dynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value StoreDynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value StoreViet-Trung TRAN
 
Pregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnPregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnViet-Trung TRAN
 
Mapreduce simplified-data-processing
Mapreduce simplified-data-processingMapreduce simplified-data-processing
Mapreduce simplified-data-processingViet-Trung TRAN
 
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của FacebookTìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của FacebookViet-Trung TRAN
 
giasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studygiasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studyViet-Trung TRAN
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN
 
Large-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkLarge-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkViet-Trung TRAN
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learningViet-Trung TRAN
 
success factors for project proposals
success factors for project proposalssuccess factors for project proposals
success factors for project proposalsViet-Trung TRAN
 
OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents Viet-Trung TRAN
 
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Viet-Trung TRAN
 
Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015Viet-Trung TRAN
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learningViet-Trung TRAN
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forestsViet-Trung TRAN
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringViet-Trung TRAN
 

More from Viet-Trung TRAN (20)

Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
 
Dynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value StoreDynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value Store
 
Pregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnPregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớn
 
Mapreduce simplified-data-processing
Mapreduce simplified-data-processingMapreduce simplified-data-processing
Mapreduce simplified-data-processing
 
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của FacebookTìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
 
giasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studygiasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case study
 
Giasan.vn @rstars
Giasan.vn @rstarsGiasan.vn @rstars
Giasan.vn @rstars
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
Large-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkLarge-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on Spark
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
 
success factors for project proposals
success factors for project proposalssuccess factors for project proposals
success factors for project proposals
 
GPSinsights poster
GPSinsights posterGPSinsights poster
GPSinsights poster
 
OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents
 
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learning
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filtering
 

Recently uploaded

Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 

Recently uploaded (20)

Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 

Scalable Data Management Systems for Big Data

  • 1. Context Contribution 1: Efficient Support for MPI-I/O Atomicity Contribution 2: A large-scale array-oriented storage system Contribution 3: A document-oriented store Conclusions Scalable Data Management Systems For Big Data Viet-Trung Tran KerData team PhD Advisors: Gabriel Antoniu and Luc Boug´ e January 21st, 2013 1/54
  • 2. Context Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion Contribution 2: A large-scale array-oriented storage system Building scalable data management systems Contribution 3: A document-oriented store Contributions of the thesis Conclusions Big Data Explosion 2/54
  • 3. Context Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion Contribution 2: A large-scale array-oriented storage system Building scalable data management systems Contribution 3: A document-oriented store Contributions of the thesis Conclusions Big Data in Data-intensive HPC Data-intensive HPC relies on supercomputers to process, analyze, and/or visualize massive amounts of data Some numbers Large Hadron Collider Grid 25 P B per year I/O rates of 300 GB/s Blue Waters peak I/O rates measured at 1 T B/s Data come from a variety of sources: observations, simulations, experimental systems, etc. 3/54
  • 4. Context Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion Contribution 2: A large-scale array-oriented storage system Building scalable data management systems Contribution 3: A document-oriented store Contributions of the thesis Conclusions Big Data in Data-intensive HPC Data-intensive HPC relies on supercomputers to process, analyze, and/or visualize massive amounts of data Some numbers Large Hadron Collider Grid 25 P B per year I/O rates of 300 GB/s Blue Waters peak I/O rates measured at 1 T B/s Data come from a variety of sources: observations, simulations, experimental systems, etc. 3/54
  • 5. Context Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion Contribution 2: A large-scale array-oriented storage system Building scalable data management systems Contribution 3: A document-oriented store Contributions of the thesis Conclusions Big Data in Data-intensive HPC Data-intensive HPC relies on supercomputers to process, analyze, and/or visualize massive amounts of data Some numbers Large Hadron Collider Grid 25 P B per year I/O rates of 300 GB/s Blue Waters peak I/O rates measured at 1 T B/s Data come from a variety of sources: observations, simulations, experimental systems, etc. 3/54
  • 6. Context Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion Contribution 2: A large-scale array-oriented storage system Building scalable data management systems Contribution 3: A document-oriented store Contributions of the thesis Conclusions Big Data in Data-intensive HPC Data-intensive HPC relies on supercomputers to process, analyze, and/or visualize massive amounts of data Some numbers Large Hadron Collider Grid 25 P B per year I/O rates of 300 GB/s Blue Waters peak I/O rates measured at 1 T B/s Data come from a variety of sources: observations, simulations, experimental systems, etc. 3/54
  • 7. Context Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion Contribution 2: A large-scale array-oriented storage system Building scalable data management systems Contribution 3: A document-oriented store Contributions of the thesis Conclusions Definition of Big Data According to M. Stonebraker, Big Data has at least one of the following characteristics Big Volume Large datasets (TB and more) Big Velocity Data is moving very fast Big Variety Data exist in a large number of formats. 4/54
  • 8. Context Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion Contribution 2: A large-scale array-oriented storage system Building scalable data management systems Contribution 3: A document-oriented store Contributions of the thesis Conclusions Definition of Big Data According to M. Stonebraker, Big Data has at least one of the following characteristics Big Volume Large datasets (TB and more) Big Velocity Data is moving very fast Big Variety Data exist in a large number of formats. 4/54
  • 9. Context Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion Contribution 2: A large-scale array-oriented storage system Building scalable data management systems Contribution 3: A document-oriented store Contributions of the thesis Conclusions Definition of Big Data According to M. Stonebraker, Big Data has at least one of the following characteristics Big Volume Large datasets (TB and more) Big Velocity Data is moving very fast Big Variety Data exist in a large number of formats. 4/54
  • 10. Context Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion Contribution 2: A large-scale array-oriented storage system Building scalable data management systems Contribution 3: A document-oriented store Contributions of the thesis Conclusions Big Data Challenges Objective of this thesis Building scalable data management systems for Big Data 5/54
  • 11. Context Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion Contribution 2: A large-scale array-oriented storage system Building scalable data management systems Contribution 3: A document-oriented store Contributions of the thesis Conclusions Big Data Challenges Objective of this thesis Building scalable data management systems for Big Data 5/54
  • 12. Context Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion Contribution 2: A large-scale array-oriented storage system Building scalable data management systems Contribution 3: A document-oriented store Contributions of the thesis Conclusions Dealing With Scalability Scalability is defined as the ability of a system, network, or process, to handle growing amount of work in a capable manner, or its ability to be enlarged to accommodate that growth. Two methods for scaling: Scale horizontally (scale out) Scale vertically (scale up) 6/54
  • 13. Context Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion Contribution 2: A large-scale array-oriented storage system Building scalable data management systems Contribution 3: A document-oriented store Contributions of the thesis Conclusions Trend 1: From Centralized to Distributed Approaches Centralized storage servers to distributed parallel file systems Centralized file servers ⇒ Cluster ⇒ Grid, Cloud Centralized to distributed metadata management Example: PVFSv1 [Blumer 1994] ⇒ PVFSv2 [Ross 2003] 7/54
  • 14. Context Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion Contribution 2: A large-scale array-oriented storage system Building scalable data management systems Contribution 3: A document-oriented store Contributions of the thesis Conclusions Trend 2: From One-size-fits-all-needs Storage to Specialized Storage NoSQL movement: Key-value stores, Document stores, etc. Remove unneeded complexity: ACID High scalability Array-oriented storage for array data model Example: Dynamo, Membase, CouchDB, etc. 8/54
  • 15. Context Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion Contribution 2: A large-scale array-oriented storage system Building scalable data management systems Contribution 3: A document-oriented store Contributions of the thesis Conclusions Trend 3: From Disks to Main Memory Storage Memory is the new disk Median analytic job sizes are less than 14 GB [Microsoft] 1 TB RAM is feasible DRAM is at least 100 times faster than disks Excellent for Big Velocity Example: Hyper [Kemper 2011], HANA [SAP], H-Store [Kallman 2008] 9/54
  • 16. Context Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion Contribution 2: A large-scale array-oriented storage system Building scalable data management systems Contribution 3: A document-oriented store Contributions of the thesis Conclusions Targeted Environments Data-intensive High Performance Computing (HPC) Big Volume, Big Variety Geographically distributed environments Big Volume Big data analytics in a multicore, big memory server Big Velocity 10/54
  • 17. Context Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion Contribution 2: A large-scale array-oriented storage system Building scalable data management systems Contribution 3: A document-oriented store Contributions of the thesis Conclusions Contributions of This Thesis: Building Scalable Data Management Systems for Big Data Contributions Big Volume Big Velocity Big Variety √ Building a scalable storage system to — — provide efficient support for MPI-I/O atomicity √ √ Pyramid: a large-scale array-oriented — storage system √ Towards a globally distributed file — — system: adapting BlobSeer to WAN scale √ √ DStore: an in-memory document- — oriented store √ ( = Addressed, — = not addressed). 11/54
  • 18. Context Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion Contribution 2: A large-scale array-oriented storage system Building scalable data management systems Contribution 3: A document-oriented store Contributions of the thesis Conclusions Contributions of This Thesis: Building Scalable Data Management Systems for Big Data Contributions Big Volume Big Velocity Big Variety √ Building a scalable storage system to — — provide efficient support for MPI-I/O atomicity √ √ Pyramid: a large-scale array-oriented — storage system √ Towards a globally distributed file — — system: adapting BlobSeer to WAN scale √ √ DStore: an in-memory document- — oriented store √ ( = Addressed, — = not addressed). 11/54
  • 19. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Context: Big Data in Data-intensive HPC Contribution 1 Building a scalable storage system to provide efficient support for MPI-I/O atomicity 12/54
  • 20. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Problem Description Spatial splitting in parallelization: Ghost cells . "# ! "$ / %&%'()&*+,------------()&*+,---------------()&*+, Application data model vs. storage data model P1 P1 non-access pattern File data: contiguous sequence of bytes Concurrent overlapping non-contiguous I/O requires atomicity guarantees 13/54
  • 21. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Problem Description Spatial splitting in parallelization: Ghost cells . "# ! "$ / %&%'()&*+,------------()&*+,---------------()&*+, Application data model vs. storage data model P1 P1 non-access pattern File data: contiguous sequence of bytes Concurrent overlapping non-contiguous I/O requires atomicity guarantees 13/54
  • 22. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions State of The Art Locking-based approaches to ensure atomicity Done at 3 levels Applications Each process dumps output to a single file Parallel I/O stack Too many files MPI-I/O The whole file is locked Storage Byte range locking based on POSIX lock Poor scalability 14/54
  • 23. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Goal High throughput non-contiguous I/O under atomicity guarantees 15/54
  • 24. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Our Approach Dedicated interface for atomic non-contiguos I/O Provide atomicity guarantees at storage level No need to map MPI consistency to storage consistency model Shadowing rather than locking Concurrent overlapped writes are allowed Atomicity guarantees Data striping 16/54
  • 25. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Building Block: BlobSeer Data Management Service A KerData project (started with the thesis of Bogdan Nicolae) Design Data striping Distributed Metadata management Versioning BlobSeer architecture 17/54
  • 26. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Building Block: BlobSeer (con’t) Two phases I/O Data access Metadata access Access interface only for contiguous I/O Create, Read, Write, Clone. Distributed metadata management Organized as a segment tree Distributed over a DHT 0,8 0,4 4,4 0,2 2,2 4,2 6,2 0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1 18/54
  • 27. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Building Block: BlobSeer (con’t) Two phases I/O Data access Metadata access Access interface only for contiguous I/O Create, Read, Write, Clone. Distributed metadata management Organized as a segment tree Distributed over a DHT 0,8 0,4 4,4 0,2 2,2 4,2 6,2 0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1 1st Writer 18/54
  • 28. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Building Block: BlobSeer (con’t) Two phases I/O Data access Metadata access Access interface only for contiguous I/O Create, Read, Write, Clone. Distributed metadata management Organized as a segment tree Distributed over a DHT 0,8 0,4 4,4 0,2 2,2 4,2 6,2 0,1 1,1 1,1 2,1 2,1 3,1 4,1 5,1 6,1 7,1 1st Writer 18/54
  • 29. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Building Block: BlobSeer (con’t) Two phases I/O Data access Metadata access Access interface only for contiguous I/O Create, Read, Write, Clone. Distributed metadata management Organized as a segment tree Distributed over a DHT 0,8 0,4 4,4 0,2 0,2 2,2 2,2 4,2 6,2 0,1 1,1 1,1 2,1 2,1 3,1 4,1 5,1 6,1 7,1 1st Writer 18/54
  • 30. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Building Block: BlobSeer (con’t) Two phases I/O Data access Metadata access Access interface only for contiguous I/O Create, Read, Write, Clone. Distributed metadata management Organized as a segment tree Distributed over a DHT 0,8 0,4 0,4 4,4 0,2 0,2 2,2 2,2 4,2 6,2 0,1 1,1 1,1 2,1 2,1 3,1 4,1 5,1 6,1 7,1 1st Writer 18/54
  • 31. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Building Block: BlobSeer (con’t) Two phases I/O Data access Metadata access Access interface only for contiguous I/O Create, Read, Write, Clone. Distributed metadata management Organized as a segment tree Distributed over a DHT 0,8 0,8 0,4 0,4 4,4 0,2 0,2 2,2 2,2 4,2 6,2 0,1 1,1 1,1 2,1 2,1 3,1 4,1 5,1 6,1 7,1 1st Writer 18/54
  • 32. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Zoom on BlobSeer Metadata Generation Return from the version manager for creating a new version A version number List of border nodes 0,8 0,8 0,8 Border nodes 0,4 0,4 4,4 4,4 0,2 0,2 2,2 2,2 4,2 4,2 6,2 0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1 1st Writer 2nd Writer Border nodes calculation is on the version manager side 19/54
  • 33. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Proposal for a Non-contiguous, Versioning-oriented Access Interface Non-contiguous write vw = NONCONT WRITE(id, buffers[], offsets[], sizes[]) Non-contiguous read NONCONT READ(id, v, buffers[], offsets[], sizes[]) Requirements Non-contiguous I/O must be atomic Efficient under concurrency 20/54
  • 34. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Non-contiguous I/O Must Be Atomic Leveraging a shadowing mechanism Isolating non-contiguous update into one single consistent snapshot Done at metadata level 0,8 0,4 4,4 0,2 2,2 4,2 6,2 0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1 21/54
  • 35. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Non-contiguous I/O Must Be Atomic Leveraging a shadowing mechanism Isolating non-contiguous update into one single consistent snapshot Done at metadata level 0,8 0,4 4,4 0,2 2,2 4,2 6,2 0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1 1st Writer 21/54
  • 36. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Non-contiguous I/O Must Be Atomic Leveraging a shadowing mechanism Isolating non-contiguous update into one single consistent snapshot Done at metadata level 0,8 0,4 4,4 0,2 2,2 4,2 6,2 0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1 1st Writer 21/54
  • 37. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Non-contiguous I/O Must Be Atomic Leveraging a shadowing mechanism Isolating non-contiguous update into one single consistent snapshot Done at metadata level 0,8 0,4 4,4 0,2 0,2 2,2 2,2 4,2 4,2 6,2 0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1 1st Writer 21/54
  • 38. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Non-contiguous I/O Must Be Atomic Leveraging a shadowing mechanism Isolating non-contiguous update into one single consistent snapshot Done at metadata level 0,8 0,4 0,4 4,4 4,4 0,2 0,2 2,2 2,2 4,2 4,2 6,2 0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1 1st Writer 21/54
  • 39. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Non-contiguous I/O Must Be Atomic Leveraging a shadowing mechanism Isolating non-contiguous update into one single consistent snapshot Done at metadata level 0,8 0,8 0,4 0,4 4,4 4,4 0,2 0,2 2,2 2,2 4,2 4,2 6,2 0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1 1st Writer 21/54
  • 40. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Efficient under Concurrency Proposed 3 important optimizations Minimizing ordering overhead Moving border node computation from version manager to clients Lazy evaluation during border node calculation 0,8 0,8 0,8 0,4 0,4 0,4 4,4 4,4 4,4 0,2 0,2 2,2 2,2 2,2 4,2 4,2 4,2 6,2 6,2 0,1 1,1 1,1 2,1 2,1 2,1 3,1 3,1 4,1 4,1 5,1 5,1 5,1 6,1 6,1 7,1 1st Writer 2nd Writer 22/54
  • 41. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Efficient under Concurrency Proposed 3 important optimizations Minimizing ordering overhead Moving border node computation from version manager to clients Lazy evaluation during border node calculation 0,8 Border node of the right? Border node of the left? 0,4 0,4 4,4 4,4 0,2 0,2 2,2 2,2 4,2 4,2 6,2 0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1 1st Writer 22/54
  • 42. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Efficient under Concurrency Proposed 3 important optimizations Minimizing ordering overhead Moving border node computation from version manager to clients Lazy evaluation during border node calculation 0,8 0,8 0,4 0,4 4,4 4,4 0,2 0,2 2,2 2,2 4,2 4,2 6,2 0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1 1st Writer 22/54
  • 43. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Leveraging Our Versioning-oriented Interface in The Parallel I/O Stack Integration of BlobSeer to MPI-I/O middleware requires a new ADIO driver 23/54
  • 44. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Experimental Evaluation Our machines: Grid’5000 platform Up to 80 nodes Pentium-4 CPU@2.26Ghz, 4GB RAM, Gigabit Ethernet Measured bandwidth: 117.5 MB/s for MTU = 1500B 3 sets of experiments Scalability of non-contiguous I/O Scalability under concurrency MPI-tile-I/O 24/54
  • 45. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Results of The Experiments: Our Approach vs. Locking-based 3000 Lustre 2000 BlobSeer Lustre Aggregated Throughput (MB/s) BlobSeer 2500 Aggregated throughput (MB/s) 1500 2000 1500 1000 1000 500 500 0 0 4 9 16 25 36 4 9 16 25 36 Number of concurrent clients Number of concurrent clients MPI-tile-I/O: 1024 ∗ 1024 ∗ 1024 Subdomains are arranged in a row tile size 25/54
  • 46. Context The need of atomic non-contiguous I/O Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Contribution 1 - Summary A versioning-based mechanism to support atomic MPI-I/O efficiently The optimization of moving border node computation to clients is integrated back to BlobSeer Our approach outperforms locking-based approaches (aggregated throughput is 3.5 to 10 times better) Publication: Efficient support for MPI-IO atomicity based on versioning. Tran V.-T., Nicolae B., Antoniu G., Boug´ L. In Proceedings of the 11th IEEE/ACM International Symposium e on Cluster, Cloud, and Grid Computing (CCGrid 2011), 514 - 523, Newport Beach, USA, May 2011. 26/54
  • 47. Context The need of specialized storage for array data model Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation of Pyramid Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Context: Big Data in Data-intensive HPC Contribution 2 Pyramid: A scalable storage system for array-oriented data model 27/54
  • 48. Context The need of specialized storage for array data model Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation of Pyramid Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Reconsidering The Mismatch Between Storage Model and Application Data Model Application data model Multidimensional typed arrays, images, etc. Storage data model Parallel file systems: Simple and flat I/O data model Mostly contiguous I/O interface: READ,WRITE(offset, size) Need additional layers to translate application data model to storage data model 28/54
  • 49. Context The need of specialized storage for array data model Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation of Pyramid Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Reconsidering The Mismatch Between Storage Model and Application Data Model Application data model Multidimensional typed arrays, images, etc. Storage data model Parallel file systems: Simple and flat I/O data model Mostly contiguous I/O interface: READ,WRITE(offset, size) Need additional layers to translate application data model to storage data model 28/54
  • 50. Context The need of specialized storage for array data model Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation of Pyramid Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions M. Stonebraker: The One-storage-fits-all-needs Has Reached Its Limits Performance of non-contiguous I/O vs. I/O atomicity Loosing data locality Need to specialize the I/O stack to match the requirements of applications: Array-oriented storage for array data model 29/54
  • 51. Context The need of specialized storage for array data model Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation of Pyramid Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions M. Stonebraker: The One-storage-fits-all-needs Has Reached Its Limits Performance of non-contiguous I/O vs. I/O atomicity Loosing data locality Need to specialize the I/O stack to match the requirements of applications: Array-oriented storage for array data model 29/54
  • 52. Context The need of specialized storage for array data model Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation of Pyramid Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Our Approach: Array-oriented Data Model Needs Array-oriented Storage Multi-dimension aware chunking Lock-free, distributed chunk indexing Array versioning 30/54
  • 53. Context The need of specialized storage for array data model Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation of Pyramid Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Multi-dimensional Aware Chunking P1 P1 non-access pattern File data: contiguous sequence of bytes Split array into equal multidimensional chunks and distributed over storage elements Simplify load balancing among storage elements Keep the neighbors of cells in the same chunk Eliminate mostly non-contiguous I/O accesses 31/54
  • 54. Context The need of specialized storage for array data model Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation of Pyramid Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Multi-dimensional Aware Chunking P1 P1 non-access pattern File data: contiguous sequence of bytes Split array into equal multidimensional chunks and distributed over storage elements Simplify load balancing among storage elements Keep the neighbors of cells in the same chunk Eliminate mostly non-contiguous I/O accesses 31/54
  • 55. Context The need of specialized storage for array data model Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation of Pyramid Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Multi-dimensional Aware Chunking P1 P1 non-access pattern Split array into equal multidimensional chunks and distributed over storage elements Simplify load balancing among storage elements Keep the neighbors of cells in the same chunk Eliminate mostly non-contiguous I/O accesses 31/54
  • 56. Context The need of specialized storage for array data model Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation of Pyramid Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Distributed Quadtree-like Structures Common index structures for multidimensional data R-tree, XD-tree, etc. All are designed and optimized for centralized management Poor scalability in high concurrency Our approach Porting quadtree-like structures to distributed environments 32/54
  • 57. Context The need of specialized storage for array data model Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation of Pyramid Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Array Versioning Scientific applications need array versioning [VLDB 2009] Checkpointing Cloning Provenance Our approach Keep data and metadata immutable Updates are handled at the metadata level using a shadowing mechanism A versioning array-oriented interface id = CREATE(n, sizes[], defval) READ(id, v, offsets[], sizes[], buffer) w = WRITE(id, offsets[], sizes[], buffer) 33/54
  • 58. Context The need of specialized storage for array data model Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation of Pyramid Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Pyramid Architecture Pyramid is based on BlobSeer [Nicolae - JPDC 2011] Version managers Metadata managers Storage manager Storage servers Clients Pyramid architecture 34/54
  • 59. Context The need of specialized storage for array data model Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation of Pyramid Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Lock-free, Distributed Chunk Indexing BlobSeer Pyramid Distributed segment Generalize BlobSeer metadata tree organization Quadtree-like structures Quadtree for 2D arrays Octree for 3D arrays Tree nodes are immutable, uniquely identified by the version number and the subdomain they cover Using DHT to distribute tree nodes over metadata managers Shadowing to reflect updates 1 2 5 6 version 1 version 2 3 4 7 8 9 10 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 16 16 11 12 15 16 Distributed quadtree 35/54
  • 60. Context The need of specialized storage for array data model Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation of Pyramid Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Efficient Parallel Updating Total ordering of two concurrent updates 36/54
  • 61. Context The need of specialized storage for array data model Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation of Pyramid Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Experimental Evaluation Use at most 140 nodes of the Graphene cluster in the Grid’5000 testbed 1 Gbps ethernet interconnected network Pyramid and the competitor system PVFS are deployed on 76 nodes 64 nodes are reserved for clients Simulate common access pattern exhibited by scientific applications: Array dicing Each client accesses a dedicated sub-array Concurrent Read/Write Measure the aggregated throughput 37/54
  • 62. Context The need of specialized storage for array data model Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation of Pyramid Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Aggregated Throughput Achieved under Concurrency 2500 Pyramid writing Pyramid writing Pyramid reading 2500 Pyramid reading Aggregated Throughput (MB/s) Aggregated Throughput (MB/s) PVFS2 writing PVFS2 writing 2000 PVFS2 reading PVFS2 reading 2000 1500 1500 1000 1000 500 500 0 0 1 4 9 16 25 36 49 1 4 8 16 32 64 Number of concurrent clients Number of concurrent clients Weak Scalability: Fixed subdomain Strong scalability: Fixed total size, increasing number of client domain size, increasing number of processes client processes 38/54
  • 63. Context The need of specialized storage for array data model Contribution 1: Efficient Support for MPI-I/O Atomicity Design & implementation of Pyramid Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Contribution 2 - Summary Pyramid is an array-oriented storage system Offers parallel array processing for both read and write workloads Built with a distributed metadata management system Relies on shadowing to reflect updates Preliminary evaluation shows promising scalability Publication: Towards scalable array-oriented active storage: the Pyramid approach. Tran V.-T., Nicolae B., Antoniu G. In the ACM SIGOPS Operating Systems Review 46(1):19-25. 2012. Pyramid: A large-scale array-oriented active storage system. Tran V.-T., Nicolae B., Antoniu G., Boug´ L. In The 5th Workshop on Large Scale Distributed Systems and e Middleware (LADIS 2011), Seattle, USA, September 2011. 39/54
  • 64. Context NoSQL movement & in-memory design Contribution 1: Efficient Support for MPI-I/O Atomicity Design of DStore Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Context: Big Data in a Multi-core, Big Memory Server Contribution 3 DStore: a document-oriented store in main memory 40/54
  • 65. Context NoSQL movement & in-memory design Contribution 1: Efficient Support for MPI-I/O Atomicity Design of DStore Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Recall The Context: NoSQL Movement & In-memory Design NoSQL movement Simplified data model: Key-Value, Documents, Graphs, etc. Document-oriented stores offer a rich functionality Trending towards in-memory design 90% of Facebook jobs < 100GB [Facebook] 1 TB DRAM is feasible Memory accesses are at least 100 times faster than disks Efficient support for fast, atomic, complex transactions & high throughput read queries 41/54
  • 66. Context NoSQL movement & in-memory design Contribution 1: Efficient Support for MPI-I/O Atomicity Design of DStore Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Recall The Context: NoSQL Movement & In-memory Design NoSQL movement Simplified data model: Key-Value, Documents, Graphs, etc. Document-oriented stores offer a rich functionality Trending towards in-memory design 90% of Facebook jobs < 100GB [Facebook] 1 TB DRAM is feasible Memory accesses are at least 100 times faster than disks Efficient support for fast, atomic, complex transactions & high throughput read queries 41/54
  • 67. Context NoSQL movement & in-memory design Contribution 1: Efficient Support for MPI-I/O Atomicity Design of DStore Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Observation Example T1 updates {A, B, C} T2 updates {C, D, E} More complex transactions, higher possibility that transactions are dependent Concurrent transaction processing Required concurrent data structures Locking & latching account for 30 % overhead [VLDB 2007] Serialization is unavoidable for dependent transactions Synchronous index generations More indexes, slower transaction processing 42/54
  • 68. Context NoSQL movement & in-memory design Contribution 1: Efficient Support for MPI-I/O Atomicity Design of DStore Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions #1: Target Fast, Atomic Complex Transactions Bulk updating Slave thread Individual updates Delta buffer Index data Master thread structure Background process Single threaded execution model Delta indexing & background index generation to deliver fast processing rate Bulk updating to ensure atomicity 43/54
  • 69. Context NoSQL movement & in-memory design Contribution 1: Efficient Support for MPI-I/O Atomicity Design of DStore Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions #2: Target High-throughput Read Queries Bulk updating Updates Delta buffer Master thread Index data structure Slave thread Background process Multiple Reader threads Stale READ for performance Versioning concurrency control / One new snapshot per an entire delta buffer 44/54
  • 70. Context NoSQL movement & in-memory design Contribution 1: Efficient Support for MPI-I/O Atomicity Design of DStore Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions #2: Target High-throughput Read Queries Fresh READ Bulk updating Updates Delta buffer Master thread Index data structure Slave thread Background process Multiple Reader threads Stale READ for performance Versioning concurrency control / One new snapshot per an entire delta buffer 44/54
  • 71. Context NoSQL movement & in-memory design Contribution 1: Efficient Support for MPI-I/O Atomicity Design of DStore Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions #2: Target High-throughput Read Queries Fresh Stale READ READ Bulk updating Updates Delta buffer Master thread Index data structure Slave thread Background process Multiple Reader threads Stale READ for performance Versioning concurrency control / One new snapshot per an entire delta buffer 44/54
  • 72. Context NoSQL movement & in-memory design Contribution 1: Efficient Support for MPI-I/O Atomicity Design of DStore Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions #2: Target High-throughput Read Queries Fresh Stale READ READ Bulk updating Different snapshots Updates Delta buffer Master thread Index data Index data structure structure Slave thread Background process Multiple Reader threads Stale READ for performance Versioning concurrency control / One new snapshot per an entire delta buffer 44/54
  • 73. Context NoSQL movement & in-memory design Contribution 1: Efficient Support for MPI-I/O Atomicity Design of DStore Contribution 2: A large-scale array-oriented storage system Evaluation Contribution 3: A document-oriented store Summary Conclusions Service Model Fresh read B-tree Delta buffer B-tree index B-tree index new snapshot Slave thread Update queries B-tree Stale Master read Delta buffer B-tree index thread B-tree index new snapshot Slave thread B-tree Delta buffer B-tree index B-tree index new snapshot Slave thread DStore service model 45/54