Scalable Data Management Systems for Big Data

Context
Contribution 1: Eﬃcient Support for MPI-I/O Atomicity
Contribution 2: A large-scale array-oriented storage system
Contribution 3: A document-oriented store
Conclusions

Scalable Data Management Systems For Big Data

Viet-Trung Tran

KerData team
PhD Advisors: Gabriel Antoniu and Luc Boug´
e

January 21st, 2013

1/54

Context
Contribution 1: Eﬃcient Support for MPI-I/O Atomicity Big Data explosion
Contribution 2: A large-scale array-oriented storage system Building scalable data management systems
Contribution 3: A document-oriented store Contributions of the thesis
Conclusions

Big Data Explosion

2/54

Context
Conclusions

Big Data in Data-intensive HPC

Data-intensive HPC relies on supercomputers to process, analyze,
and/or visualize massive amounts of data
Some numbers
Large Hadron Collider Grid
25 P B per year
I/O rates of 300 GB/s
Blue Waters peak I/O rates measured at 1 T B/s
Data come from a variety of sources: observations,
simulations, experimental systems, etc.

3/54

Context
Conclusions

Deﬁnition of Big Data

According to M. Stonebraker, Big Data has at least one of the
following characteristics
Big Volume
Large datasets (TB and more)
Big Velocity
Data is moving very fast
Big Variety
Data exist in a large number of
formats.

4/54

Context
Conclusions

Big Data Challenges

Objective of this thesis
Building scalable data management systems for Big Data
5/54

Context
Conclusions

Dealing With Scalability

Scalability is deﬁned as the ability of a system, network, or
process, to handle growing amount of work in a capable
manner, or its ability to be enlarged to accommodate that
growth.

Two methods for scaling:
Scale horizontally (scale out)
Scale vertically (scale up)

6/54

Context
Conclusions

Trend 1: From Centralized to Distributed Approaches

Centralized storage servers to distributed parallel ﬁle systems
Centralized ﬁle servers ⇒ Cluster ⇒ Grid, Cloud
Centralized to distributed metadata management
Example: PVFSv1 [Blumer 1994] ⇒ PVFSv2 [Ross 2003]

7/54

Context
Conclusions

Trend 2: From One-size-ﬁts-all-needs Storage to
Specialized Storage

NoSQL movement: Key-value stores, Document stores, etc.
Remove unneeded complexity: ACID
High scalability
Array-oriented storage for array data model
Example: Dynamo, Membase, CouchDB, etc.

8/54

Context
Conclusions

Trend 3: From Disks to Main Memory Storage

Memory is the new disk
Median analytic job sizes are less than 14 GB [Microsoft]
1 TB RAM is feasible
DRAM is at least 100 times faster than disks
Excellent for Big Velocity
Example: Hyper [Kemper 2011], HANA [SAP], H-Store
[Kallman 2008]

9/54

Context
Conclusions

Targeted Environments

Data-intensive High Performance Computing (HPC)
Big Volume, Big Variety
Geographically distributed environments
Big Volume
Big data analytics in a multicore, big memory server
Big Velocity

10/54

Context
Conclusions

Contributions of This Thesis: Building Scalable Data
Management Systems for Big Data

Contributions Big Volume Big Velocity Big Variety
√
Building a scalable storage system to — —
provide eﬃcient support for MPI-I/O
atomicity
√ √
Pyramid: a large-scale array-oriented —
storage system
√
Towards a globally distributed ﬁle — —
system: adapting BlobSeer to WAN
scale
√ √
DStore: an in-memory document- —
oriented store
√
( = Addressed, — = not addressed).

11/54

Context
The need of atomic non-contiguous I/O
Design & implementation
Evaluation
Summary
Conclusions

Context: Big Data in Data-intensive HPC

Contribution 1
Building a scalable storage system to provide eﬃcient support
for MPI-I/O atomicity

12/54

Context
Evaluation
Summary
Conclusions

Problem Description
Spatial splitting in parallelization: Ghost cells
.

"# ! "$

/

%&%'()&*+,------------()&*+,---------------()&*+,

Application data model vs. storage data model

P1 P1 non-access pattern

File data: contiguous sequence of bytes

Concurrent overlapping non-contiguous I/O requires atomicity
guarantees
13/54

Context
Evaluation
Summary
Conclusions

State of The Art

Locking-based approaches to
ensure atomicity
Done at 3 levels
Applications
Each process dumps output
to a single file Parallel I/O stack
Too many files
MPI-I/O
The whole file is locked
Storage
Byte range locking based on
POSIX lock

Poor scalability
14/54

Context
Evaluation
Summary
Conclusions

Goal

High throughput non-contiguous I/O under atomicity
guarantees

15/54

Context
Evaluation
Summary
Conclusions

Our Approach

Dedicated interface for atomic non-contiguos I/O
Provide atomicity guarantees at storage level
No need to map MPI consistency to storage consistency model
Shadowing rather than locking
Concurrent overlapped writes are allowed
Atomicity guarantees
Data striping

16/54

Context
Evaluation
Summary
Conclusions

Building Block: BlobSeer Data Management Service

A KerData project
(started with the thesis of
Bogdan Nicolae)
Design
Data striping
Distributed Metadata
management
Versioning BlobSeer architecture

17/54

Context
Evaluation
Summary
Conclusions

Building Block: BlobSeer (con’t)
Two phases I/O
Data access
Metadata access
Access interface only for contiguous I/O
Create, Read, Write, Clone.
Distributed metadata management
Organized as a segment tree
Distributed over a DHT
0,8

0,4 4,4

0,2 2,2 4,2 6,2

0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1

18/54

Context
Evaluation
Summary
Conclusions

Two phases I/O
Data access
Metadata access
0,8

0,4 4,4

0,2 2,2 4,2 6,2

0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1

1st Writer

18/54

Context
Evaluation
Summary
Conclusions

Two phases I/O
Data access
Metadata access
0,8

0,4 4,4

0,2 2,2 4,2 6,2

0,1 1,1 1,1 2,1 2,1 3,1 4,1 5,1 6,1 7,1

1st Writer

18/54

Context
Evaluation
Summary
Conclusions

Two phases I/O
Data access
Metadata access
0,8

0,4 4,4

0,2 0,2 2,2 2,2 4,2 6,2

0,1 1,1 1,1 2,1 2,1 3,1 4,1 5,1 6,1 7,1

1st Writer

18/54

Context
Evaluation
Summary
Conclusions

Two phases I/O
Data access
Metadata access
0,8

0,4 0,4 4,4

0,2 0,2 2,2 2,2 4,2 6,2

0,1 1,1 1,1 2,1 2,1 3,1 4,1 5,1 6,1 7,1

1st Writer

18/54

Context
Evaluation
Summary
Conclusions

Two phases I/O
Data access
Metadata access
0,8 0,8

0,4 0,4 4,4

0,2 0,2 2,2 2,2 4,2 6,2

0,1 1,1 1,1 2,1 2,1 3,1 4,1 5,1 6,1 7,1

1st Writer

18/54

Context
Evaluation
Summary
Conclusions

Zoom on BlobSeer Metadata Generation
Return from the version manager for creating a new version
A version number
List of border nodes

0,8 0,8 0,8
Border nodes

0,4 0,4 4,4 4,4

0,2 0,2 2,2 2,2 4,2 4,2 6,2

0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1

1st Writer
2nd Writer

Border nodes calculation is on the version manager side
19/54

Context
Evaluation
Summary
Conclusions

Proposal for a Non-contiguous, Versioning-oriented Access
Interface

Non-contiguous write
vw = NONCONT WRITE(id, buffers[], offsets[], sizes[])
Non-contiguous read
NONCONT READ(id, v, buffers[], offsets[], sizes[])
Requirements
Non-contiguous I/O must be atomic
Efficient under concurrency

20/54

Context
Evaluation
Summary
Conclusions

Non-contiguous I/O Must Be Atomic

Leveraging a shadowing mechanism
Isolating non-contiguous update into one single consistent
snapshot
Done at metadata level

0,8

0,4 4,4

0,2 2,2 4,2 6,2

0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1

21/54

Context
Evaluation
Summary
Conclusions


snapshot

0,8

0,4 4,4

0,2 2,2 4,2 6,2

0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1

1st Writer

21/54

Context
Evaluation
Summary
Conclusions


snapshot

0,8

0,4 4,4

0,2 2,2 4,2 6,2

0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1

1st Writer

21/54

Context
Evaluation
Summary
Conclusions


snapshot

0,8

0,4 4,4

0,2 0,2 2,2 2,2 4,2 4,2 6,2

0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1

1st Writer

21/54

Context
Evaluation
Summary
Conclusions


snapshot

0,8

0,4 0,4 4,4 4,4

0,2 0,2 2,2 2,2 4,2 4,2 6,2

0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1

1st Writer

21/54

Context
Evaluation
Summary
Conclusions


snapshot

0,8 0,8

0,4 0,4 4,4 4,4

0,2 0,2 2,2 2,2 4,2 4,2 6,2

0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1

1st Writer

21/54

Context
Evaluation
Summary
Conclusions

Eﬃcient under Concurrency

Proposed 3 important optimizations
Minimizing ordering overhead
Moving border node computation from version manager to
clients
Lazy evaluation during border node calculation

0,8 0,8 0,8

0,4 0,4 0,4 4,4 4,4 4,4

0,2 0,2 2,2 2,2 2,2 4,2 4,2 4,2 6,2 6,2

0,1 1,1 1,1 2,1 2,1 2,1 3,1 3,1 4,1 4,1 5,1 5,1 5,1 6,1 6,1 7,1

1st Writer
2nd Writer

22/54

Context
Evaluation
Summary
Conclusions


clients

0,8
Border node of the right? Border node of the left?

0,4 0,4 4,4 4,4

0,2 0,2 2,2 2,2 4,2 4,2 6,2

0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1

1st Writer

22/54

Context
Evaluation
Summary
Conclusions


clients

0,8 0,8

0,4 0,4 4,4 4,4

0,2 0,2 2,2 2,2 4,2 4,2 6,2

0,1 1,1 1,1 2,1 2,1 3,1 4,1 4,1 5,1 5,1 6,1 7,1

1st Writer

22/54

Context
Evaluation
Summary
Conclusions

Leveraging Our Versioning-oriented Interface in The
Parallel I/O Stack

Integration of BlobSeer to MPI-I/O middleware requires a new
ADIO driver

23/54

Context
Evaluation
Summary
Conclusions

Experimental Evaluation

Our machines: Grid’5000 platform
Up to 80 nodes
Pentium-4 CPU@2.26Ghz, 4GB RAM, Gigabit Ethernet
Measured bandwidth: 117.5 MB/s for MTU = 1500B
3 sets of experiments
Scalability of non-contiguous I/O
Scalability under concurrency
MPI-tile-I/O

24/54

Context
Evaluation
Summary
Conclusions

Results of The Experiments: Our Approach vs.
Locking-based

3000
Lustre
2000 BlobSeer
Lustre

Aggregated Throughput (MB/s)
BlobSeer 2500
Aggregated throughput (MB/s)

1500 2000

1500
1000
1000

500 500

0
0 4 9 16 25 36
4 9 16 25 36 Number of concurrent clients
Number of concurrent clients

MPI-tile-I/O: 1024 ∗ 1024 ∗ 1024
Subdomains are arranged in a row
tile size

25/54

Context
Evaluation
Summary
Conclusions

Contribution 1 - Summary

A versioning-based mechanism to support atomic MPI-I/O
eﬃciently
The optimization of moving border node computation to
clients is integrated back to BlobSeer
Our approach outperforms locking-based approaches
(aggregated throughput is 3.5 to 10 times better)
Publication:
Eﬃcient support for MPI-IO atomicity based on versioning. Tran V.-T., Nicolae B.,
Antoniu G., Boug´ L. In Proceedings of the 11th IEEE/ACM International Symposium
e
on Cluster, Cloud, and Grid Computing (CCGrid 2011), 514 - 523, Newport Beach,
USA, May 2011.

26/54

Context
The need of specialized storage for array data model
Design & implementation of Pyramid
Evaluation
Summary
Conclusions

Context: Big Data in Data-intensive HPC

Contribution 2
Pyramid: A scalable storage system for array-oriented data
model

27/54

Context
Evaluation
Summary
Conclusions

Reconsidering The Mismatch Between Storage Model and
Application Data Model

Application data model
Multidimensional typed arrays, images, etc.
Storage data model
Parallel file systems: Simple and flat I/O data model
Mostly contiguous I/O interface: READ,WRITE(offset, size)

Need additional layers to translate application data model to
storage data model

28/54

Context
Evaluation
Summary
Conclusions

M. Stonebraker: The One-storage-ﬁts-all-needs
Has Reached Its Limits

Performance of non-contiguous I/O vs. I/O atomicity
Loosing data locality

Need to specialize the I/O stack to match the requirements of
applications: Array-oriented storage for array data model
29/54

Context
Evaluation
Summary
Conclusions

Our Approach: Array-oriented Data Model Needs
Array-oriented Storage

Multi-dimension aware chunking
Lock-free, distributed chunk indexing
Array versioning

30/54

Context
Evaluation
Summary
Conclusions

Multi-dimensional Aware Chunking


File data: contiguous sequence of bytes

Split array into equal multidimensional chunks and distributed
over storage elements
Simplify load balancing among storage elements
Keep the neighbors of cells in the same chunk
Eliminate mostly non-contiguous I/O accesses

31/54

Context
Evaluation
Summary
Conclusions

Multi-dimensional Aware Chunking


Split array into equal multidimensional chunks and distributed
over storage elements
Simplify load balancing among storage elements
Keep the neighbors of cells in the same chunk
Eliminate mostly non-contiguous I/O accesses

31/54

Context
Evaluation
Summary
Conclusions

Distributed Quadtree-like Structures

Common index structures for multidimensional data
R-tree, XD-tree, etc.
All are designed and optimized for centralized management
Poor scalability in high concurrency
Our approach
Porting quadtree-like structures to distributed environments

32/54

Context
Evaluation
Summary
Conclusions

Array Versioning

Scientific applications need array versioning [VLDB 2009]
Checkpointing
Cloning
Provenance
Our approach
Keep data and metadata immutable
Updates are handled at the metadata level using a shadowing
mechanism
A versioning array-oriented interface
id = CREATE(n, sizes[], defval)
READ(id, v, offsets[], sizes[], buffer)
w = WRITE(id, offsets[], sizes[], buffer)

33/54

Context
Evaluation
Summary
Conclusions

Pyramid Architecture

Pyramid is based on BlobSeer [Nicolae - JPDC 2011]

Version managers
Metadata managers
Storage manager
Storage servers
Clients
Pyramid architecture

34/54

Context
Evaluation
Summary
Conclusions

Lock-free, Distributed Chunk Indexing
BlobSeer Pyramid
Distributed segment Generalize BlobSeer metadata
tree organization
Quadtree-like structures
Quadtree for 2D arrays
Octree for 3D arrays
Tree nodes are immutable, uniquely identiﬁed by the version number
and the subdomain they cover
Using DHT to distribute tree nodes over metadata managers
Shadowing to reﬂect updates

1 2 5 6 version 1 version 2

3 4 7 8
9 10 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 16 16
11 12 15 16

Distributed quadtree 35/54

Context
Evaluation
Summary
Conclusions

Eﬃcient Parallel Updating

Total ordering of two concurrent updates
36/54

Context
Evaluation
Summary
Conclusions

Experimental Evaluation

Use at most 140 nodes of the Graphene cluster in the
Grid’5000 testbed
1 Gbps ethernet interconnected network
Pyramid and the competitor system PVFS are deployed on
76 nodes
64 nodes are reserved for clients
Simulate common access pattern exhibited by scientiﬁc
applications: Array dicing
Each client accesses a dedicated sub-array
Concurrent Read/Write
Measure the aggregated throughput

37/54

Context
Evaluation
Summary
Conclusions

Aggregated Throughput Achieved under Concurrency

2500
Pyramid writing Pyramid writing
Pyramid reading 2500 Pyramid reading

PVFS2 writing PVFS2 writing
2000 PVFS2 reading PVFS2 reading
2000

1500
1500

1000
1000

500 500

0 0
1 4 9 16 25 36 49 1 4 8 16 32 64
Number of concurrent clients Number of concurrent clients

Weak Scalability: Fixed subdomain Strong scalability: Fixed total
size, increasing number of client domain size, increasing number of
processes client processes

38/54

Context
Evaluation
Summary
Conclusions

Contribution 2 - Summary

Pyramid is an array-oriented storage system
Oﬀers parallel array processing for both read and write
workloads
Built with a distributed metadata management system
Relies on shadowing to reﬂect updates
Preliminary evaluation shows promising scalability
Publication:
Towards scalable array-oriented active storage: the Pyramid approach. Tran V.-T.,
Nicolae B., Antoniu G. In the ACM SIGOPS Operating Systems Review 46(1):19-25.
2012.
Pyramid: A large-scale array-oriented active storage system. Tran V.-T., Nicolae B.,
Antoniu G., Boug´ L. In The 5th Workshop on Large Scale Distributed Systems and
e
Middleware (LADIS 2011), Seattle, USA, September 2011.

39/54

Context
NoSQL movement & in-memory design
Design of DStore
Evaluation
Summary
Conclusions

Context: Big Data in a Multi-core, Big Memory Server

Contribution 3
DStore: a document-oriented store in main memory

40/54

Context
Design of DStore
Evaluation
Summary
Conclusions

Recall The Context: NoSQL Movement & In-memory
Design

NoSQL movement
Simplified data model: Key-Value, Documents, Graphs, etc.
Document-oriented stores offer a rich functionality
Trending towards in-memory design
90% of Facebook jobs < 100GB [Facebook]
1 TB DRAM is feasible
Memory accesses are at least 100 times faster than disks

Efficient support for fast, atomic, complex transactions &
high throughput read queries

41/54

Context
Design of DStore
Evaluation
Summary
Conclusions

Observation

Example
T1 updates {A, B, C}
T2 updates {C, D, E}
More complex transactions, higher possibility that transactions
are dependent
Concurrent transaction processing
Required concurrent data structures
Locking & latching account for 30 % overhead [VLDB 2007]
Serialization is unavoidable for dependent transactions
Synchronous index generations
More indexes, slower transaction processing

42/54

Context
Design of DStore
Evaluation
Summary
Conclusions

#1: Target Fast, Atomic Complex Transactions

Bulk updating
Slave thread
Individual updates
Delta buffer
Index data
Master thread
structure

Background process

Single threaded execution model
Delta indexing & background index generation to deliver fast processing
rate
Bulk updating to ensure atomicity
43/54

Context
Design of DStore
Evaluation
Summary
Conclusions

#2: Target High-throughput Read Queries

Bulk updating

Updates
Delta buffer
Master thread Index data
structure

Slave thread

Background process

Multiple Reader threads
Stale READ for performance
Versioning concurrency control / One new snapshot per an entire delta
buﬀer
44/54

Context
Design of DStore
Evaluation
Summary
Conclusions


Fresh
READ

Bulk updating

Updates
Delta buffer
structure

Slave thread

Background process

buﬀer
44/54

Context
Design of DStore
Evaluation
Summary
Conclusions


Fresh Stale
READ READ

Bulk updating

Updates
Delta buffer
structure

Slave thread

Background process

buﬀer
44/54

Context
Design of DStore
Evaluation
Summary
Conclusions


Fresh Stale
READ READ

Bulk updating

Different snapshots
Updates
Delta buffer
Index data
structure
structure

Slave thread

Background process

buﬀer
44/54

Context
Design of DStore
Evaluation
Summary
Conclusions

Service Model

Fresh
read
B-tree

Delta buffer B-tree index
B-tree index
new snapshot
Slave
thread
Update queries B-tree Stale
Master read
thread B-tree index
new snapshot
Slave
thread
B-tree

B-tree index
new snapshot
Slave
thread

DStore service model

45/54

Scalable Data Management Systems for Big Data

Scalable Data Management Systems for Big Data

Recommended

Recommended

More Related Content

Similar to Scalable Data Management Systems for Big Data

Similar to Scalable Data Management Systems for Big Data (20)

More from Viet-Trung TRAN

More from Viet-Trung TRAN (20)

Recently uploaded

Recently uploaded (20)

Scalable Data Management Systems for Big Data