Pyramid: A large-scalearray-oriented active storagesystemViet-Trung TRAN, Nicolae Bogdan,Gabriel Antoniu, Luc Bougé KerDat...
Outline1. Motivation2. Architecture3. Preliminary evaluation4. ConclusionViet-TrungTran              02 09 2011 - 2
1MotivationWhyarray-orientedstorage?Viet-TrungTran              00 MOIS 2011 - 3
Context: Data-intensive large-scale HPCsimulations• The scalability of data management is becoming   a critical issue• Mis...
[M. Stonebraker] The one-storage-fits-all-needs has reached its limits• Parallel I/O stack:     - Performance of non-conti...
Our approach• Multi-dimensional aware chunking• Lock-free, distributed chunk indexing• Array versioning• Active storage su...
Multi-dimensional aware chunking• Split array into equal chunks and distributed over storage elements     - Simplify load ...
Lock-free, distributed chunk indexing• Indexing multi-dimensional information     - R-tree, XD-tree, Quad-tree, etc     - ...
Array versioning• Scientific applications need array versioning (VLDB 2009)     - Check pointing     - Cloning     - Prove...
Active storage support• Move data computation to storage elements     - Conserve bandwidth     - Better workload paralleli...
Versioning array-oriented access interface• Basic primitives     - id = CREATE(n, sizes[], defval)     - READ(id, v, offse...
2Pyramid: ArchitectureViet-TrungTran          02 09 2011 - 12
Architecture• Pyramid is inspired by our previous work: BlobSeer [JPDC 2011]• Version managers     - Ensure concurrency co...
Read                                                                  Storage Metadata Version                            ...
Write                                                              Storage Metadata Version Storage                       ...
Lock-free, distributed chunk indexing• Organized as a Quad-tree to index 2D arrays• Each tree node has at most 4 children,...
Tree shadowing to update• Write newly created chunks to storage servers• Build the quad-tree associated to the new snapsho...
Efficient parallel updating                                               Client   Client Storage Metadata Version        ...
Some more I/O primitives• Easily implemented thanks to immutable data and metadata blocks• Cheap I/O operators• Clone a su...
3Preliminary evaluationExperimented on G5K (www.grid5000.fr)Viet-TrungTran                          02 09 2011 - 20
Experimental setupSimulate common access pattern exhibited by scientific applications: Array Dicing• Using at most 130 nod...
Aggregated throughput achieved underconcurrency• PVFS suffers from non-   contiguous access pattern due   to serialization...
4ConclusionViet-TrungTran   02 09 2011 - 23
Conclusion• Pyramid is an array-oriented active storage system• Proposed a system offering support for     - Parallel arra...
Thankyou   INRIA – KerDataResearch Team   www.irisa.fr/kerdata
Upcoming SlideShare
Loading in …5
×

Pyramid: A large-scale array-oriented active storage system

1,774 views
1,620 views

Published on

The recent explosion in data sizes manipulated by distributed scientific applications has prompted the need to develop specialized storage systems capable to deal with specific access patterns in a scalable fashion. In this context, a large class of applications focuses on parallel array processing: small parts of huge multi-dimensional arrays are concurrently accessed by a large number of clients, both for reading and writing. A specialized storage system that deals with such an access pattern faces several challenges at the level of data/metadata management. We introduce Pyramid, an active array-oriented storage system that addresses these challenges and shows promising results in our initial evaluation.

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,774
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
22
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Pyramid: A large-scale array-oriented active storage system

  1. 1. Pyramid: A large-scalearray-oriented active storagesystemViet-Trung TRAN, Nicolae Bogdan,Gabriel Antoniu, Luc Bougé KerData Team Inria, Rennes, France 02 09 2011
  2. 2. Outline1. Motivation2. Architecture3. Preliminary evaluation4. ConclusionViet-TrungTran 02 09 2011 - 2
  3. 3. 1MotivationWhyarray-orientedstorage?Viet-TrungTran 00 MOIS 2011 - 3
  4. 4. Context: Data-intensive large-scale HPCsimulations• The scalability of data management is becoming a critical issue• Mismatch between storage model and application data model• Application data model - Multidimensional typed arrays, images, etc.• Storage model - Parallel file systems: Simple and flat I/O model - Relational model: ill-suited for Scientifics• Need additional layers to map the application model to the storage model  •Sequence of bytesViet-TrungTran 02 09 2011 - 4
  5. 5. [M. Stonebraker] The one-storage-fits-all-needs has reached its limits• Parallel I/O stack: - Performance of non-contiguous I/O vs data atomicity• Relational data model: Application (Visit, Tornado - Simulating arrays on top of table is poor in simulation) performance Data model (HDF5, NetCDF) - Scalability for join queries• Need to specialize the I/O stack to match the MPI-IO middleware applications requirements Parallel file systems - Array-oriented storage for array data model• Example: SciDB with ArrayStore.Viet-TrungTran 02 09 2011 - 5
  6. 6. Our approach• Multi-dimensional aware chunking• Lock-free, distributed chunk indexing• Array versioning• Active storage support• Versioning array-oriented access interfaceViet-TrungTran 02 09 2011 - 6
  7. 7. Multi-dimensional aware chunking• Split array into equal chunks and distributed over storage elements - Simplify load balancing among storage elements - Keep the neighbors of cells in the same chunk• Shared nothing architecture - Easier to handle data consistencyViet-TrungTran 02 09 2011 - 7
  8. 8. Lock-free, distributed chunk indexing• Indexing multi-dimensional information - R-tree, XD-tree, Quad-tree, etc - Designed and optimized centralized management• Centralized metadata management scheme may not scale - Bottleneck under highly concurrency• Our approach: - Porting quad-tree like structures to distributed environment - Using shadowing technique on quad-tree to enable lock-free concurrent updateViet-TrungTran 02 09 2011 - 8
  9. 9. Array versioning• Scientific applications need array versioning (VLDB 2009) - Check pointing - Cloning - Provenance• Keep data and metadata immutable - Updating a chunk is handled at metadata level using shadowing techniqueViet-TrungTran 02 09 2011 - 9
  10. 10. Active storage support• Move data computation to storage elements - Conserve bandwidth - Better workload parallelization• Allow user sending User defined handlers to storage serversViet-TrungTran 02 09 2011 - 10
  11. 11. Versioning array-oriented access interface• Basic primitives - id = CREATE(n, sizes[], defval) - READ(id, v, offsets[], sizes[], buffer) - w = WRITE(id, offsets[], sizes[], buffer) - w = SEND_COMPUTATION(id, v, offsets[], sizes[], f)• Other primitives like cloning, filtering mostly can be implemented based on these above primitivesViet-TrungTran 02 09 2011 - 11
  12. 12. 2Pyramid: ArchitectureViet-TrungTran 02 09 2011 - 12
  13. 13. Architecture• Pyramid is inspired by our previous work: BlobSeer [JPDC 2011]• Version managers - Ensure concurrency control• Metadata managers - Store index tree nodes• Storage manager - Monitor the storage servers - Ensures a load balancing strategy of chunks among storage servers• Active storage servers - Store chunks and perform handlers on chunks• Clients - Perform I/O accessesViet-TrungTran 02 09 2011 - 13
  14. 14. Read Storage Metadata Version Client servers managers managers• I: optionally ask the version manager for the latest published version I II• II: fetch the corresponding metadata from the metadata managers III• III: contact storage servers in parallel and fetch the chunks in the local bufferViet-TrungTran 02 09 2011 - 14
  15. 15. Write Storage Metadata Version Storage Client servers managers manager manager• I: get a list of storage servers that are able to store the chunks, one for each I chunk• II: contact storage servers in parallel and II write the chunks to the corresponding providers III IV• III: get a version number for the update V• IV: add new metadata to consolidate the new version• V: report the new version is ready for publication.Viet-TrungTran 02 09 2011 - 15
  16. 16. Lock-free, distributed chunk indexing• Organized as a Quad-tree to index 2D arrays• Each tree node has at most 4 children, each covers one of the four quadrants• Root tree covers the whole array• Each leaf corresponds to a chunk and holds information about its location• Tree nodes are immutable, uniquely identified by the version number and the sub-domain they cover• Using DHT to distribute tree nodes over metadata managersViet-TrungTran 02 09 2011 - 16
  17. 17. Tree shadowing to update• Write newly created chunks to storage servers• Build the quad-tree associated to the new snapshot in bottom-up fashion - Writing the leaves to DHT - Inner nodes may point to nodes of previous snapshots (imply a synchronization of the quad-tree generation) - Avoid synchronization by feeding additional information about the other concurrent updaters (thank to computational ID of tree nodes)Viet-TrungTran 02 09 2011 - 17
  18. 18. Efficient parallel updating Client Client Storage Metadata Version #1 #2 servers managers manager• Chunks are written concurrently• Versions are assigned in the order the clients finish writing• Clients get additional information about the other concurrent writers• Tree nodes are written in lock-free manner• Versions are published in the order they were assigned Publish PublishViet-TrungTran 02 09 2011 - 18
  19. 19. Some more I/O primitives• Easily implemented thanks to immutable data and metadata blocks• Cheap I/O operators• Clone a sub-domain - Following the metadata tree of a specific snapshot - Creating new metadata tree and publish as a newly created array• Filtering, compression ca be done locally in parallel at active storage servers by introducing user defined handlersViet-TrungTran 02 09 2011 - 19
  20. 20. 3Preliminary evaluationExperimented on G5K (www.grid5000.fr)Viet-TrungTran 02 09 2011 - 20
  21. 21. Experimental setupSimulate common access pattern exhibited by scientific applications: Array Dicing• Using at most 130 nodes of Graphene cluster on G5K - 1 Gbps Ethernet interconnected network - 49 nodes deployed our Pyramid and the competitor system PVFS• Array dicing - Each client accesses a dedicated sub-array - 1 GB per clients consisting 32x32 chunks (1024x1024 bytes chunk size) - Concurrent Reading/Writing• Measure the performance and scalabilityViet-TrungTran 02 09 2011 - 21
  22. 22. Aggregated throughput achieved underconcurrency• PVFS suffers from non- contiguous access pattern due to serialization to flat file• Pyramid - Throughputincreased steady - Promising good scalability on both data and metadata organizationViet-TrungTran 02 09 2011 - 22
  23. 23. 4ConclusionViet-TrungTran 02 09 2011 - 23
  24. 24. Conclusion• Pyramid is an array-oriented active storage system• Proposed a system offering support for - Parallel array processing for both read and write workloads - Versioning data - Distributed metadata management, shadowing to reflect updates• Preliminary evaluation shows promising scalability• Future work - Planed to integrate to HDF5 - Pyramid as a storage engine for SciDB? - Investigate on keeping data at quad-tree nodes Could be used for store array at different resolutions (map application)Viet-TrungTran 02 09 2011 - 24
  25. 25. Thankyou INRIA – KerDataResearch Team www.irisa.fr/kerdata

×