Abstract:
Big Data can be characterized by 3 V’s.
Big Volume refers to the unprecedented growth in the amount of data.
Big Velocity refers to the growth in the speed of moving data in and out management systems.
Big Variety refers to the growth in the number of different data formats.
Managing Big Data requires fundamental changes in the architecture of data management systems. Data storage should continue innovating in order to adapt to the growth of data. They need to be scalable while maintaining high performance regarding data accesses.
This thesis focuses on building scalable data management systems for Big Data.
Our first and second contributions address the challenge of providing efficient support for Big Volume of data in data-intensive high performance computing (HPC) environments. Particularly, we address the shortcoming of existing approaches to handle atomic, non-contiguous I/O operations in a scalable fashion. We propose and implement a versioning-based mechanism that can be leveraged to offer isolation for non-contiguous I/O without the need to perform expensive synchronizations. In the context of parallel array processing in HPC, we introduce Pyramid, a large-scale, array-oriented storage system. It revisits the physical organization of data in distributed storage systems for scalable performance. Pyramid favors multidimensional-aware data chunking, that closely matches the access patterns generated by applications. Pyramid also favors a distributed metadata management and a versioning concurrency control to eliminate synchronizations in concurrency.
Our third contribution addresses Big Volume at the scale of the geographically distributed environments. We consider BlobSeer, a distributed versioning-oriented data management service, and we propose BlobSeer-WAN, an extension of BlobSeer optimized for such geographically distributed environments. BlobSeer-WAN takes into account the latency hierarchy by favoring locally metadata accesses. BlobSeer-WAN features asynchronous metadata replication and a vector-clock implementation for collision resolution.
To cope with the Big Velocity characteristic of Big Data, our last contribution feautures DStore, an in-memory document-oriented store that scale vertically by leveraging large memory capability in multicore machines. DStore demonstrates fast and atomic complex transaction processing in data writing, while maintaining high throughput read access. DStore follows a single-threaded execution model to execute update transactions sequentially, while relying on a versioning concurrency control to enable a large number of simultaneous readers.
1. Context
Contribution 1: Efficient Support for MPI-I/O Atomicity
Contribution 2: A large-scale array-oriented storage system
Contribution 3: A document-oriented store
Conclusions
Scalable Data Management Systems For Big Data
Viet-Trung Tran
KerData team
PhD Advisors: Gabriel Antoniu and Luc Boug´
e
January 21st, 2013
1/54
2. Context
Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion
Contribution 2: A large-scale array-oriented storage system Building scalable data management systems
Contribution 3: A document-oriented store Contributions of the thesis
Conclusions
Big Data Explosion
2/54
3. Context
Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion
Contribution 2: A large-scale array-oriented storage system Building scalable data management systems
Contribution 3: A document-oriented store Contributions of the thesis
Conclusions
Big Data in Data-intensive HPC
Data-intensive HPC relies on supercomputers to process, analyze,
and/or visualize massive amounts of data
Some numbers
Large Hadron Collider Grid
25 P B per year
I/O rates of 300 GB/s
Blue Waters peak I/O rates measured at 1 T B/s
Data come from a variety of sources: observations,
simulations, experimental systems, etc.
3/54
4. Context
Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion
Contribution 2: A large-scale array-oriented storage system Building scalable data management systems
Contribution 3: A document-oriented store Contributions of the thesis
Conclusions
Big Data in Data-intensive HPC
Data-intensive HPC relies on supercomputers to process, analyze,
and/or visualize massive amounts of data
Some numbers
Large Hadron Collider Grid
25 P B per year
I/O rates of 300 GB/s
Blue Waters peak I/O rates measured at 1 T B/s
Data come from a variety of sources: observations,
simulations, experimental systems, etc.
3/54
5. Context
Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion
Contribution 2: A large-scale array-oriented storage system Building scalable data management systems
Contribution 3: A document-oriented store Contributions of the thesis
Conclusions
Big Data in Data-intensive HPC
Data-intensive HPC relies on supercomputers to process, analyze,
and/or visualize massive amounts of data
Some numbers
Large Hadron Collider Grid
25 P B per year
I/O rates of 300 GB/s
Blue Waters peak I/O rates measured at 1 T B/s
Data come from a variety of sources: observations,
simulations, experimental systems, etc.
3/54
6. Context
Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion
Contribution 2: A large-scale array-oriented storage system Building scalable data management systems
Contribution 3: A document-oriented store Contributions of the thesis
Conclusions
Big Data in Data-intensive HPC
Data-intensive HPC relies on supercomputers to process, analyze,
and/or visualize massive amounts of data
Some numbers
Large Hadron Collider Grid
25 P B per year
I/O rates of 300 GB/s
Blue Waters peak I/O rates measured at 1 T B/s
Data come from a variety of sources: observations,
simulations, experimental systems, etc.
3/54
7. Context
Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion
Contribution 2: A large-scale array-oriented storage system Building scalable data management systems
Contribution 3: A document-oriented store Contributions of the thesis
Conclusions
Definition of Big Data
According to M. Stonebraker, Big Data has at least one of the
following characteristics
Big Volume
Large datasets (TB and more)
Big Velocity
Data is moving very fast
Big Variety
Data exist in a large number of
formats.
4/54
8. Context
Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion
Contribution 2: A large-scale array-oriented storage system Building scalable data management systems
Contribution 3: A document-oriented store Contributions of the thesis
Conclusions
Definition of Big Data
According to M. Stonebraker, Big Data has at least one of the
following characteristics
Big Volume
Large datasets (TB and more)
Big Velocity
Data is moving very fast
Big Variety
Data exist in a large number of
formats.
4/54
9. Context
Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion
Contribution 2: A large-scale array-oriented storage system Building scalable data management systems
Contribution 3: A document-oriented store Contributions of the thesis
Conclusions
Definition of Big Data
According to M. Stonebraker, Big Data has at least one of the
following characteristics
Big Volume
Large datasets (TB and more)
Big Velocity
Data is moving very fast
Big Variety
Data exist in a large number of
formats.
4/54
10. Context
Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion
Contribution 2: A large-scale array-oriented storage system Building scalable data management systems
Contribution 3: A document-oriented store Contributions of the thesis
Conclusions
Big Data Challenges
Objective of this thesis
Building scalable data management systems for Big Data
5/54
11. Context
Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion
Contribution 2: A large-scale array-oriented storage system Building scalable data management systems
Contribution 3: A document-oriented store Contributions of the thesis
Conclusions
Big Data Challenges
Objective of this thesis
Building scalable data management systems for Big Data
5/54
12. Context
Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion
Contribution 2: A large-scale array-oriented storage system Building scalable data management systems
Contribution 3: A document-oriented store Contributions of the thesis
Conclusions
Dealing With Scalability
Scalability is defined as the ability of a system, network, or
process, to handle growing amount of work in a capable
manner, or its ability to be enlarged to accommodate that
growth.
Two methods for scaling:
Scale horizontally (scale out)
Scale vertically (scale up)
6/54
13. Context
Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion
Contribution 2: A large-scale array-oriented storage system Building scalable data management systems
Contribution 3: A document-oriented store Contributions of the thesis
Conclusions
Trend 1: From Centralized to Distributed Approaches
Centralized storage servers to distributed parallel file systems
Centralized file servers ⇒ Cluster ⇒ Grid, Cloud
Centralized to distributed metadata management
Example: PVFSv1 [Blumer 1994] ⇒ PVFSv2 [Ross 2003]
7/54
14. Context
Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion
Contribution 2: A large-scale array-oriented storage system Building scalable data management systems
Contribution 3: A document-oriented store Contributions of the thesis
Conclusions
Trend 2: From One-size-fits-all-needs Storage to
Specialized Storage
NoSQL movement: Key-value stores, Document stores, etc.
Remove unneeded complexity: ACID
High scalability
Array-oriented storage for array data model
Example: Dynamo, Membase, CouchDB, etc.
8/54
15. Context
Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion
Contribution 2: A large-scale array-oriented storage system Building scalable data management systems
Contribution 3: A document-oriented store Contributions of the thesis
Conclusions
Trend 3: From Disks to Main Memory Storage
Memory is the new disk
Median analytic job sizes are less than 14 GB [Microsoft]
1 TB RAM is feasible
DRAM is at least 100 times faster than disks
Excellent for Big Velocity
Example: Hyper [Kemper 2011], HANA [SAP], H-Store
[Kallman 2008]
9/54
16. Context
Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion
Contribution 2: A large-scale array-oriented storage system Building scalable data management systems
Contribution 3: A document-oriented store Contributions of the thesis
Conclusions
Targeted Environments
Data-intensive High Performance Computing (HPC)
Big Volume, Big Variety
Geographically distributed environments
Big Volume
Big data analytics in a multicore, big memory server
Big Velocity
10/54
17. Context
Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion
Contribution 2: A large-scale array-oriented storage system Building scalable data management systems
Contribution 3: A document-oriented store Contributions of the thesis
Conclusions
Contributions of This Thesis: Building Scalable Data
Management Systems for Big Data
Contributions Big Volume Big Velocity Big Variety
√
Building a scalable storage system to — —
provide efficient support for MPI-I/O
atomicity
√ √
Pyramid: a large-scale array-oriented —
storage system
√
Towards a globally distributed file — —
system: adapting BlobSeer to WAN
scale
√ √
DStore: an in-memory document- —
oriented store
√
( = Addressed, — = not addressed).
11/54
18. Context
Contribution 1: Efficient Support for MPI-I/O Atomicity Big Data explosion
Contribution 2: A large-scale array-oriented storage system Building scalable data management systems
Contribution 3: A document-oriented store Contributions of the thesis
Conclusions
Contributions of This Thesis: Building Scalable Data
Management Systems for Big Data
Contributions Big Volume Big Velocity Big Variety
√
Building a scalable storage system to — —
provide efficient support for MPI-I/O
atomicity
√ √
Pyramid: a large-scale array-oriented —
storage system
√
Towards a globally distributed file — —
system: adapting BlobSeer to WAN
scale
√ √
DStore: an in-memory document- —
oriented store
√
( = Addressed, — = not addressed).
11/54
19. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Context: Big Data in Data-intensive HPC
Contribution 1
Building a scalable storage system to provide efficient support
for MPI-I/O atomicity
12/54
20. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Problem Description
Spatial splitting in parallelization: Ghost cells
.
"# ! "$
/
%&%'()&*+,------------()&*+,---------------()&*+,
Application data model vs. storage data model
P1 P1 non-access pattern
File data: contiguous sequence of bytes
Concurrent overlapping non-contiguous I/O requires atomicity
guarantees
13/54
21. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Problem Description
Spatial splitting in parallelization: Ghost cells
.
"# ! "$
/
%&%'()&*+,------------()&*+,---------------()&*+,
Application data model vs. storage data model
P1 P1 non-access pattern
File data: contiguous sequence of bytes
Concurrent overlapping non-contiguous I/O requires atomicity
guarantees
13/54
22. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
State of The Art
Locking-based approaches to
ensure atomicity
Done at 3 levels
Applications
Each process dumps output
to a single file Parallel I/O stack
Too many files
MPI-I/O
The whole file is locked
Storage
Byte range locking based on
POSIX lock
Poor scalability
14/54
23. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Goal
High throughput non-contiguous I/O under atomicity
guarantees
15/54
24. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Our Approach
Dedicated interface for atomic non-contiguos I/O
Provide atomicity guarantees at storage level
No need to map MPI consistency to storage consistency model
Shadowing rather than locking
Concurrent overlapped writes are allowed
Atomicity guarantees
Data striping
16/54
25. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Building Block: BlobSeer Data Management Service
A KerData project
(started with the thesis of
Bogdan Nicolae)
Design
Data striping
Distributed Metadata
management
Versioning BlobSeer architecture
17/54
26. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Building Block: BlobSeer (con’t)
Two phases I/O
Data access
Metadata access
Access interface only for contiguous I/O
Create, Read, Write, Clone.
Distributed metadata management
Organized as a segment tree
Distributed over a DHT
0,8
0,4 4,4
0,2 2,2 4,2 6,2
0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1
18/54
27. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Building Block: BlobSeer (con’t)
Two phases I/O
Data access
Metadata access
Access interface only for contiguous I/O
Create, Read, Write, Clone.
Distributed metadata management
Organized as a segment tree
Distributed over a DHT
0,8
0,4 4,4
0,2 2,2 4,2 6,2
0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1
1st Writer
18/54
28. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Building Block: BlobSeer (con’t)
Two phases I/O
Data access
Metadata access
Access interface only for contiguous I/O
Create, Read, Write, Clone.
Distributed metadata management
Organized as a segment tree
Distributed over a DHT
0,8
0,4 4,4
0,2 2,2 4,2 6,2
0,1 1,1 1,1 2,1 2,1 3,1 4,1 5,1 6,1 7,1
1st Writer
18/54
29. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Building Block: BlobSeer (con’t)
Two phases I/O
Data access
Metadata access
Access interface only for contiguous I/O
Create, Read, Write, Clone.
Distributed metadata management
Organized as a segment tree
Distributed over a DHT
0,8
0,4 4,4
0,2 0,2 2,2 2,2 4,2 6,2
0,1 1,1 1,1 2,1 2,1 3,1 4,1 5,1 6,1 7,1
1st Writer
18/54
30. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Building Block: BlobSeer (con’t)
Two phases I/O
Data access
Metadata access
Access interface only for contiguous I/O
Create, Read, Write, Clone.
Distributed metadata management
Organized as a segment tree
Distributed over a DHT
0,8
0,4 0,4 4,4
0,2 0,2 2,2 2,2 4,2 6,2
0,1 1,1 1,1 2,1 2,1 3,1 4,1 5,1 6,1 7,1
1st Writer
18/54
31. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Building Block: BlobSeer (con’t)
Two phases I/O
Data access
Metadata access
Access interface only for contiguous I/O
Create, Read, Write, Clone.
Distributed metadata management
Organized as a segment tree
Distributed over a DHT
0,8 0,8
0,4 0,4 4,4
0,2 0,2 2,2 2,2 4,2 6,2
0,1 1,1 1,1 2,1 2,1 3,1 4,1 5,1 6,1 7,1
1st Writer
18/54
32. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Zoom on BlobSeer Metadata Generation
Return from the version manager for creating a new version
A version number
List of border nodes
0,8 0,8 0,8
Border nodes
0,4 0,4 4,4 4,4
0,2 0,2 2,2 2,2 4,2 4,2 6,2
0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1
1st Writer
2nd Writer
Border nodes calculation is on the version manager side
19/54
33. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Proposal for a Non-contiguous, Versioning-oriented Access
Interface
Non-contiguous write
vw = NONCONT WRITE(id, buffers[], offsets[], sizes[])
Non-contiguous read
NONCONT READ(id, v, buffers[], offsets[], sizes[])
Requirements
Non-contiguous I/O must be atomic
Efficient under concurrency
20/54
34. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Non-contiguous I/O Must Be Atomic
Leveraging a shadowing mechanism
Isolating non-contiguous update into one single consistent
snapshot
Done at metadata level
0,8
0,4 4,4
0,2 2,2 4,2 6,2
0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1
21/54
35. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Non-contiguous I/O Must Be Atomic
Leveraging a shadowing mechanism
Isolating non-contiguous update into one single consistent
snapshot
Done at metadata level
0,8
0,4 4,4
0,2 2,2 4,2 6,2
0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1
1st Writer
21/54
36. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Non-contiguous I/O Must Be Atomic
Leveraging a shadowing mechanism
Isolating non-contiguous update into one single consistent
snapshot
Done at metadata level
0,8
0,4 4,4
0,2 2,2 4,2 6,2
0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1
1st Writer
21/54
37. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Non-contiguous I/O Must Be Atomic
Leveraging a shadowing mechanism
Isolating non-contiguous update into one single consistent
snapshot
Done at metadata level
0,8
0,4 4,4
0,2 0,2 2,2 2,2 4,2 4,2 6,2
0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1
1st Writer
21/54
38. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Non-contiguous I/O Must Be Atomic
Leveraging a shadowing mechanism
Isolating non-contiguous update into one single consistent
snapshot
Done at metadata level
0,8
0,4 0,4 4,4 4,4
0,2 0,2 2,2 2,2 4,2 4,2 6,2
0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1
1st Writer
21/54
39. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Non-contiguous I/O Must Be Atomic
Leveraging a shadowing mechanism
Isolating non-contiguous update into one single consistent
snapshot
Done at metadata level
0,8 0,8
0,4 0,4 4,4 4,4
0,2 0,2 2,2 2,2 4,2 4,2 6,2
0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1
1st Writer
21/54
40. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Efficient under Concurrency
Proposed 3 important optimizations
Minimizing ordering overhead
Moving border node computation from version manager to
clients
Lazy evaluation during border node calculation
0,8 0,8 0,8
0,4 0,4 0,4 4,4 4,4 4,4
0,2 0,2 2,2 2,2 2,2 4,2 4,2 4,2 6,2 6,2
0,1 1,1 1,1 2,1 2,1 2,1 3,1 3,1 4,1 4,1 5,1 5,1 5,1 6,1 6,1 7,1
1st Writer
2nd Writer
22/54
41. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Efficient under Concurrency
Proposed 3 important optimizations
Minimizing ordering overhead
Moving border node computation from version manager to
clients
Lazy evaluation during border node calculation
0,8
Border node of the right? Border node of the left?
0,4 0,4 4,4 4,4
0,2 0,2 2,2 2,2 4,2 4,2 6,2
0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1
1st Writer
22/54
42. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Efficient under Concurrency
Proposed 3 important optimizations
Minimizing ordering overhead
Moving border node computation from version manager to
clients
Lazy evaluation during border node calculation
0,8 0,8
0,4 0,4 4,4 4,4
0,2 0,2 2,2 2,2 4,2 4,2 6,2
0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1
1st Writer
22/54
43. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Leveraging Our Versioning-oriented Interface in The
Parallel I/O Stack
Integration of BlobSeer to MPI-I/O middleware requires a new
ADIO driver
23/54
44. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Experimental Evaluation
Our machines: Grid’5000 platform
Up to 80 nodes
Pentium-4 CPU@2.26Ghz, 4GB RAM, Gigabit Ethernet
Measured bandwidth: 117.5 MB/s for MTU = 1500B
3 sets of experiments
Scalability of non-contiguous I/O
Scalability under concurrency
MPI-tile-I/O
24/54
45. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Results of The Experiments: Our Approach vs.
Locking-based
3000
Lustre
2000 BlobSeer
Lustre
Aggregated Throughput (MB/s)
BlobSeer 2500
Aggregated throughput (MB/s)
1500 2000
1500
1000
1000
500 500
0
0 4 9 16 25 36
4 9 16 25 36 Number of concurrent clients
Number of concurrent clients
MPI-tile-I/O: 1024 ∗ 1024 ∗ 1024
Subdomains are arranged in a row
tile size
25/54
46. Context
The need of atomic non-contiguous I/O
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Contribution 1 - Summary
A versioning-based mechanism to support atomic MPI-I/O
efficiently
The optimization of moving border node computation to
clients is integrated back to BlobSeer
Our approach outperforms locking-based approaches
(aggregated throughput is 3.5 to 10 times better)
Publication:
Efficient support for MPI-IO atomicity based on versioning. Tran V.-T., Nicolae B.,
Antoniu G., Boug´ L. In Proceedings of the 11th IEEE/ACM International Symposium
e
on Cluster, Cloud, and Grid Computing (CCGrid 2011), 514 - 523, Newport Beach,
USA, May 2011.
26/54
47. Context
The need of specialized storage for array data model
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Context: Big Data in Data-intensive HPC
Contribution 2
Pyramid: A scalable storage system for array-oriented data
model
27/54
48. Context
The need of specialized storage for array data model
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Reconsidering The Mismatch Between Storage Model and
Application Data Model
Application data model
Multidimensional typed arrays, images, etc.
Storage data model
Parallel file systems: Simple and flat I/O data model
Mostly contiguous I/O interface: READ,WRITE(offset, size)
Need additional layers to translate application data model to
storage data model
28/54
49. Context
The need of specialized storage for array data model
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Reconsidering The Mismatch Between Storage Model and
Application Data Model
Application data model
Multidimensional typed arrays, images, etc.
Storage data model
Parallel file systems: Simple and flat I/O data model
Mostly contiguous I/O interface: READ,WRITE(offset, size)
Need additional layers to translate application data model to
storage data model
28/54
50. Context
The need of specialized storage for array data model
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
M. Stonebraker: The One-storage-fits-all-needs
Has Reached Its Limits
Performance of non-contiguous I/O vs. I/O atomicity
Loosing data locality
Need to specialize the I/O stack to match the requirements of
applications: Array-oriented storage for array data model
29/54
51. Context
The need of specialized storage for array data model
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
M. Stonebraker: The One-storage-fits-all-needs
Has Reached Its Limits
Performance of non-contiguous I/O vs. I/O atomicity
Loosing data locality
Need to specialize the I/O stack to match the requirements of
applications: Array-oriented storage for array data model
29/54
52. Context
The need of specialized storage for array data model
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Our Approach: Array-oriented Data Model Needs
Array-oriented Storage
Multi-dimension aware chunking
Lock-free, distributed chunk indexing
Array versioning
30/54
53. Context
The need of specialized storage for array data model
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Multi-dimensional Aware Chunking
P1 P1 non-access pattern
File data: contiguous sequence of bytes
Split array into equal multidimensional chunks and distributed
over storage elements
Simplify load balancing among storage elements
Keep the neighbors of cells in the same chunk
Eliminate mostly non-contiguous I/O accesses
31/54
54. Context
The need of specialized storage for array data model
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Multi-dimensional Aware Chunking
P1 P1 non-access pattern
File data: contiguous sequence of bytes
Split array into equal multidimensional chunks and distributed
over storage elements
Simplify load balancing among storage elements
Keep the neighbors of cells in the same chunk
Eliminate mostly non-contiguous I/O accesses
31/54
55. Context
The need of specialized storage for array data model
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Multi-dimensional Aware Chunking
P1 P1 non-access pattern
Split array into equal multidimensional chunks and distributed
over storage elements
Simplify load balancing among storage elements
Keep the neighbors of cells in the same chunk
Eliminate mostly non-contiguous I/O accesses
31/54
56. Context
The need of specialized storage for array data model
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Distributed Quadtree-like Structures
Common index structures for multidimensional data
R-tree, XD-tree, etc.
All are designed and optimized for centralized management
Poor scalability in high concurrency
Our approach
Porting quadtree-like structures to distributed environments
32/54
57. Context
The need of specialized storage for array data model
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Array Versioning
Scientific applications need array versioning [VLDB 2009]
Checkpointing
Cloning
Provenance
Our approach
Keep data and metadata immutable
Updates are handled at the metadata level using a shadowing
mechanism
A versioning array-oriented interface
id = CREATE(n, sizes[], defval)
READ(id, v, offsets[], sizes[], buffer)
w = WRITE(id, offsets[], sizes[], buffer)
33/54
58. Context
The need of specialized storage for array data model
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Pyramid Architecture
Pyramid is based on BlobSeer [Nicolae - JPDC 2011]
Version managers
Metadata managers
Storage manager
Storage servers
Clients
Pyramid architecture
34/54
59. Context
The need of specialized storage for array data model
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Lock-free, Distributed Chunk Indexing
BlobSeer Pyramid
Distributed segment Generalize BlobSeer metadata
tree organization
Quadtree-like structures
Quadtree for 2D arrays
Octree for 3D arrays
Tree nodes are immutable, uniquely identified by the version number
and the subdomain they cover
Using DHT to distribute tree nodes over metadata managers
Shadowing to reflect updates
1 2 5 6 version 1 version 2
3 4 7 8
9 10 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 16 16
11 12 15 16
Distributed quadtree 35/54
60. Context
The need of specialized storage for array data model
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Efficient Parallel Updating
Total ordering of two concurrent updates
36/54
61. Context
The need of specialized storage for array data model
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Experimental Evaluation
Use at most 140 nodes of the Graphene cluster in the
Grid’5000 testbed
1 Gbps ethernet interconnected network
Pyramid and the competitor system PVFS are deployed on
76 nodes
64 nodes are reserved for clients
Simulate common access pattern exhibited by scientific
applications: Array dicing
Each client accesses a dedicated sub-array
Concurrent Read/Write
Measure the aggregated throughput
37/54
62. Context
The need of specialized storage for array data model
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Aggregated Throughput Achieved under Concurrency
2500
Pyramid writing Pyramid writing
Pyramid reading 2500 Pyramid reading
Aggregated Throughput (MB/s)
Aggregated Throughput (MB/s)
PVFS2 writing PVFS2 writing
2000 PVFS2 reading PVFS2 reading
2000
1500
1500
1000
1000
500 500
0 0
1 4 9 16 25 36 49 1 4 8 16 32 64
Number of concurrent clients Number of concurrent clients
Weak Scalability: Fixed subdomain Strong scalability: Fixed total
size, increasing number of client domain size, increasing number of
processes client processes
38/54
63. Context
The need of specialized storage for array data model
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design & implementation of Pyramid
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Contribution 2 - Summary
Pyramid is an array-oriented storage system
Offers parallel array processing for both read and write
workloads
Built with a distributed metadata management system
Relies on shadowing to reflect updates
Preliminary evaluation shows promising scalability
Publication:
Towards scalable array-oriented active storage: the Pyramid approach. Tran V.-T.,
Nicolae B., Antoniu G. In the ACM SIGOPS Operating Systems Review 46(1):19-25.
2012.
Pyramid: A large-scale array-oriented active storage system. Tran V.-T., Nicolae B.,
Antoniu G., Boug´ L. In The 5th Workshop on Large Scale Distributed Systems and
e
Middleware (LADIS 2011), Seattle, USA, September 2011.
39/54
64. Context
NoSQL movement & in-memory design
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design of DStore
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Context: Big Data in a Multi-core, Big Memory Server
Contribution 3
DStore: a document-oriented store in main memory
40/54
65. Context
NoSQL movement & in-memory design
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design of DStore
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Recall The Context: NoSQL Movement & In-memory
Design
NoSQL movement
Simplified data model: Key-Value, Documents, Graphs, etc.
Document-oriented stores offer a rich functionality
Trending towards in-memory design
90% of Facebook jobs < 100GB [Facebook]
1 TB DRAM is feasible
Memory accesses are at least 100 times faster than disks
Efficient support for fast, atomic, complex transactions &
high throughput read queries
41/54
66. Context
NoSQL movement & in-memory design
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design of DStore
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Recall The Context: NoSQL Movement & In-memory
Design
NoSQL movement
Simplified data model: Key-Value, Documents, Graphs, etc.
Document-oriented stores offer a rich functionality
Trending towards in-memory design
90% of Facebook jobs < 100GB [Facebook]
1 TB DRAM is feasible
Memory accesses are at least 100 times faster than disks
Efficient support for fast, atomic, complex transactions &
high throughput read queries
41/54
67. Context
NoSQL movement & in-memory design
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design of DStore
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Observation
Example
T1 updates {A, B, C}
T2 updates {C, D, E}
More complex transactions, higher possibility that transactions
are dependent
Concurrent transaction processing
Required concurrent data structures
Locking & latching account for 30 % overhead [VLDB 2007]
Serialization is unavoidable for dependent transactions
Synchronous index generations
More indexes, slower transaction processing
42/54
68. Context
NoSQL movement & in-memory design
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design of DStore
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
#1: Target Fast, Atomic Complex Transactions
Bulk updating
Slave thread
Individual updates
Delta buffer
Index data
Master thread
structure
Background process
Single threaded execution model
Delta indexing & background index generation to deliver fast processing
rate
Bulk updating to ensure atomicity
43/54
69. Context
NoSQL movement & in-memory design
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design of DStore
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
#2: Target High-throughput Read Queries
Bulk updating
Updates
Delta buffer
Master thread Index data
structure
Slave thread
Background process
Multiple Reader threads
Stale READ for performance
Versioning concurrency control / One new snapshot per an entire delta
buffer
44/54
70. Context
NoSQL movement & in-memory design
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design of DStore
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
#2: Target High-throughput Read Queries
Fresh
READ
Bulk updating
Updates
Delta buffer
Master thread Index data
structure
Slave thread
Background process
Multiple Reader threads
Stale READ for performance
Versioning concurrency control / One new snapshot per an entire delta
buffer
44/54
71. Context
NoSQL movement & in-memory design
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design of DStore
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
#2: Target High-throughput Read Queries
Fresh Stale
READ READ
Bulk updating
Updates
Delta buffer
Master thread Index data
structure
Slave thread
Background process
Multiple Reader threads
Stale READ for performance
Versioning concurrency control / One new snapshot per an entire delta
buffer
44/54
72. Context
NoSQL movement & in-memory design
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design of DStore
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
#2: Target High-throughput Read Queries
Fresh Stale
READ READ
Bulk updating
Different snapshots
Updates
Delta buffer
Master thread Index data
Index data
structure
structure
Slave thread
Background process
Multiple Reader threads
Stale READ for performance
Versioning concurrency control / One new snapshot per an entire delta
buffer
44/54
73. Context
NoSQL movement & in-memory design
Contribution 1: Efficient Support for MPI-I/O Atomicity
Design of DStore
Contribution 2: A large-scale array-oriented storage system
Evaluation
Contribution 3: A document-oriented store
Summary
Conclusions
Service Model
Fresh
read
B-tree
Delta buffer B-tree index
B-tree index
new snapshot
Slave
thread
Update queries B-tree Stale
Master read
Delta buffer B-tree index
thread B-tree index
new snapshot
Slave
thread
B-tree
Delta buffer B-tree index
B-tree index
new snapshot
Slave
thread
DStore service model
45/54