1. <…>
Cloud-Optimized HDF5 Files
2023 ESIP Summer Meeting
This work was supported by NASA/GSFC under Raytheon Technologies contract number 80GSFC21CA001.
This document does not contain technology or Technical Data controlled under either the U.S. International Traffic
in Arms Regulations or the U.S. Export Administration Regulations.
Aleksandar Jelenak
NASA EED-3/HDF Group
ajelenak@hdfgroup.org
2. <…>
2
An HDF5 file where internal structures are
arranged for more efficient data access
when in cloud object stores.
Cloud Optimized HDF5* File
*Hierarchical Data Format Version 5
3. <…>
3
• One file = one cloud store object.
• Read-only data access.
• Data reading based on HTTP* range GET
requests with specific file offsets and size
of bytes.
Cloud Optimized Means…
*Hypertext Transfer Protocol
4. <…>
4
• Least amount of reformatting from
archive tapes to cloud object stores.
• HDF5 library instead of custom HDF5 file
format readers with limited capabilities.
• Fast content discovery when files are in
object store.
• Efficient data access for both cloud-
native and conventional applications.
Why Cloud-Optimized HDF5 Files?
5. <…>
5
• Large dataset chunk size (1-4 MiB*).
• Minimal use of variable-length datatypes.
• Consolidated internal file metadata.
What Makes a Cloud-Optimized
HDF5 File?
*mebibyte (1,048,576 bytes)
6. <…>
6
• Chunk size is a product of the dataset’s
datatype size in bytes and the chunk’s
shape (number of elements for each
dimension).
• AWS* best practice document claims one
HTTP range GET request should be 8–16
MiB.
Large Dataset Chunk Sizes
*Amazon Web Service
7. <…>
7
• Chunk shape still needs to account for
likely data access patterns.
• Default HDF5 library dataset chunk cache
only 1 MiB but is configurable per
dataset.
• Chunk cache size can have significant
impact on I/O performance.
• Trade-off between chunk size and
compression/decompression speed.
Large Dataset Chunk Sizes (cont.)
8. <…>
8
• HDF5 library
– H5Pset_cache() for all datasets in a file
– H5Pset_chunk_cache() for individual datasets
• h5py
– h5py.File class for all datasets in a file
– h5py.Group.create_dataset() or
.require_dataset() method for individual datasets
• netCDF* library
– nc_set_chunk_cache() for all variables in a file
– nc_get_var_chunk_cache() for individual variables
How to Set Chunk Cache Size?
*Network Common Data Form
9. <…>
9
• Current implementation of variable-length
(vlen) data in HDF5 files prevents easy
retrieval using HTTP range GET requests.
• Alternative access methods require
duplicating vlen data outside its file or
custom HDF5 file format reader.
• Minimize use of these datatypes in HDF5
datasets if not using the HDF5 library for
cloud data access.
Variable-Length Datatypes
10. <…>
10
• Only one data read needed to learn about
file’s content (what’s in it, and where it is).
• Important for all use cases that require
knowledge of the file’s content prior to
reading any of the file’s data.
• The HDF5 library by default spreads
internal file metadata in small blocks
throughout the file.
Consolidated Internal File Metadata
11. <…>
11
1. Create files with Paged Aggregation file
space management strategy.
2. Create files with increased file metadata
block.
3. Store file’s content information in it’s
User Block.
How to Consolidate Internal File
Metadata?
12. <…>
12
• One of available file space management
strategies. Not the default.
• Can only be set at file creation.
• Best suited when file content is added once
and never modified.
• HDF5 library will read and write data in file
pages.
HDF5 Paged Aggregation
13. <…>
13
• Internal file metadata and raw data are
organized in separate pages of specified
size.
• Setting an appropriate page size can have
all internal file metadata in just one page.
• Only page aggregated files can use library’s
page buffer cache that can significantly
reduce subsequent data access.
HDF5 Paged Aggregation (cont’d)
14. <…>
14
• HDF5 library:
H5Pset_file_space_strategy(fcpl, H5F_FSPACE_STRATEGY_PAGE,…)
H5Pset_file_space_page_size(fcpl, page_size)
fcpl: file creation property list
• h5py
– h5py.File class
• netCDF library
– Not supported yet.
How to Create Paged Aggregated
Files
15. <…>
15
$ h5repack –S PAGE –G PAGE_SIZE_BYTES in.h5 out.h5
How to Apply Paged Aggregation to
Existing Files
16. <…>
16
• Applies to non-page aggregated files.
• Internal metadata stored in metadata
blocks in a file. Default size is 2048 bytes.
• Setting a bigger block at file creation can
combine internal metadata into a single
contiguous block.
• Recommend to make the block size big
enough for entire internal file metadata.
File Metadata Block Size
17. <…>
17
• HDF5 library:
H5Pset_meta_block_size(fapl, block_size,…)
fapl: file access property list
• h5py
– h5py.File class
• netCDF library
– Not supported yet.
How to Create Files with Larger
Metadata Block
19. <…>
19
How to Find Internal File Metadata
Size?
$ h5stat -S ATL03_20190928165055_00270510_004_01.h5
Filename: ATL03_20190928165055_00270510_004_01.h5
File space management strategy: H5F_FSPACE_STRATEGY_FSM_AGGR
File space page size: 4096 bytes
Summary of file space information:
File metadata: 7713028 bytes
Raw data: 2458294886 bytes
Amount/Percent of tracked free space: 0 bytes/0.0%
Unaccounted space: 91670 bytes
Total space: 2466099584 bytes
20. <…>
20
• User block is a block of bytes at the
beginning of an HDF5 file that the library
skips so any user application content can
be stored.
• Extracted file content information can be
stored in the user block to be readily
available later with a single data read
request.
• This new info stays with the file – still one
cloud store object.
File Content Info in User Block
21. <…>
21
• HDF5 library
– H5Pset_userblock()
• h5py
– h5py.File class
• netCDF library
– Not supported yet.
How to Create File with User Block
22. <…>
22
$ h5repack --block=SIZE_BYTES --ublock=user_block.file in.h5 out.h5
or
$ h5jam -i in.h5 -u user_block.file -o out.h5
IMPORTANT: The user block content must be
generated after the user block has been added
if interested in dataset chunk file locations.
Current HDF5 functions for chunk file locations
have a bug so user block size must be added.
How to Add User Block to
Existing Files
23. <…>
23
• HDF5 library can create cloud-optimized
HDF5 files if instructed.
• Larger dataset chunk sizes are the most
important data access optimization.
• Combining internal metadata is required
if discovering file content when files are
already in object store.
• The two optimizations are not related.
Wrap-Up
24. <…>
24
• Page aggregated files are recommended if
HDF5 library will also be used for cloud data
access.
• DMR++ is recommended for file content
description stored in file user block.
• Cloud-optimize HDF5 files prior to transfer to
object store. Best is to create such files and
avoid any post-optimization.
Wrap-Up (cont’d)
25. <…>
25
This work was supported by NASA/GSFC under
Raytheon Technologies contract number
80GSFC21CA001.
Thank you!