Successfully reported this slideshow.
Your SlideShare is downloading. ×

SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 18 Ad

More Related Content

Slideshows for you (20)

Similar to SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats (20)

Advertisement

More from Qian Lin (12)

Recently uploaded (20)

Advertisement

SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

  1. 1. SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats Speaker: LIN Qian http://www.comp.nus.edu.sg/~linqian
  2. 2. Scientific data analysis today • Increasingly data-intensive – Volume approximately doubles each year • Stored in certain specialized formats – NetCDF, HDF5, ADIOS ... • Popularity of MapReduce and its variants – Free accessibility – Easy programmability – Good scalability – Built-in fault tolerance 1
  3. 3. NetCDF • Network Common Data Form 2
  4. 4. HDF5 • Hierarchical Data Format 3
  5. 5. Scientific data analysis today (cont.) • “Store-first-analyze-after” – Reload data in another file system E.g. load data from PVFS to HDFS – Reload data in another data format E.g. load NetCDF/HDF5 data to a specific format • Problems – Long data migration/transformation time – Stressing network and disks 4
  6. 6. SciMATE • In-situ scientific data analysis – MapReduce with AlternaTE API – Supporting NetCDF, HDF5 and flat-files oNo data reloading! – Transparent to app developers • Optimized for – Access strategies – Access patterns 5
  7. 7. System overview 6
  8. 8. Scientific Data Processing Module Runtime System
  9. 9. Integrating a new data format • Data adaption layer is customizable – Third-party adapter – Open for extension but closed for modification • Have to implement the generic block loader interface – Partitioning function and auxiliary functions – Data access functions 8
  10. 10. Data access strategies and patterns • full_read() – too expensive for reading small data subsets • partial_read() – Strided pattern o partial_read_by_block() – Column pattern o partial_read_by_column() – Discrete point pattern o partial_read_by_list() 9
  11. 11. Access Pattern Optimization • Strided pattern – directly supported by API • Discrete point pattern – no optimization • Column pattern – fixed-size column read 1 2 3 4 5 – contiguous column read 1 2 10
  12. 12. Evaluation • System functionality and scalability – 16 GB datasets – Data processing times ok-means, PCA, kNN othread scalability, node scalability – Data loading times ok-means, PCA onode scalability • Partial read vs. Full read • Fixed-size column read vs. Contiguous column read 11
  13. 13. Thread scalability
  14. 14. Node scalability (data processing)
  15. 15. Node scalability (data loading)
  16. 16. Fixed-size column read vs. Contiguous column read NetCDF HDF5
  17. 17. Contiguous column read NetCDF shows better column non-contiguity tolerance than HDF5. 16
  18. 18. Conclusion and Future Work • Conclusion – Avoid bulk data transfers and vast data transformation – Provide a customizable data format adaption API – Support optimized read via access strategies & patterns • Future Work – Compare with SciHadoop 17

×