SlideShare a Scribd company logo
1 of 25
Download to read offline
<…>
Cloud-Optimized HDF5 Files
2023 ESIP Summer Meeting
This work was supported by NASA/GSFC under Raytheon Technologies contract number 80GSFC21CA001.
This document does not contain technology or Technical Data controlled under either the U.S. International Traffic
in Arms Regulations or the U.S. Export Administration Regulations.
Aleksandar Jelenak
NASA EED-3/HDF Group
ajelenak@hdfgroup.org
<…>
2
An HDF5 file where internal structures are
arranged for more efficient data access
when in cloud object stores.
Cloud Optimized HDF5* File
*Hierarchical Data Format Version 5
<…>
3
• One file = one cloud store object.
• Read-only data access.
• Data reading based on HTTP* range GET
requests with specific file offsets and size
of bytes.
Cloud Optimized Means…
*Hypertext Transfer Protocol
<…>
4
• Least amount of reformatting from
archive tapes to cloud object stores.
• HDF5 library instead of custom HDF5 file
format readers with limited capabilities.
• Fast content discovery when files are in
object store.
• Efficient data access for both cloud-
native and conventional applications.
Why Cloud-Optimized HDF5 Files?
<…>
5
• Large dataset chunk size (1-4 MiB*).
• Minimal use of variable-length datatypes.
• Consolidated internal file metadata.
What Makes a Cloud-Optimized
HDF5 File?
*mebibyte (1,048,576 bytes)
<…>
6
• Chunk size is a product of the dataset’s
datatype size in bytes and the chunk’s
shape (number of elements for each
dimension).
• AWS* best practice document claims one
HTTP range GET request should be 8–16
MiB.
Large Dataset Chunk Sizes
*Amazon Web Service
<…>
7
• Chunk shape still needs to account for
likely data access patterns.
• Default HDF5 library dataset chunk cache
only 1 MiB but is configurable per
dataset.
• Chunk cache size can have significant
impact on I/O performance.
• Trade-off between chunk size and
compression/decompression speed.
Large Dataset Chunk Sizes (cont.)
<…>
8
• HDF5 library
– H5Pset_cache() for all datasets in a file
– H5Pset_chunk_cache() for individual datasets
• h5py
– h5py.File class for all datasets in a file
– h5py.Group.create_dataset() or
.require_dataset() method for individual datasets
• netCDF* library
– nc_set_chunk_cache() for all variables in a file
– nc_get_var_chunk_cache() for individual variables
How to Set Chunk Cache Size?
*Network Common Data Form
<…>
9
• Current implementation of variable-length
(vlen) data in HDF5 files prevents easy
retrieval using HTTP range GET requests.
• Alternative access methods require
duplicating vlen data outside its file or
custom HDF5 file format reader.
• Minimize use of these datatypes in HDF5
datasets if not using the HDF5 library for
cloud data access.
Variable-Length Datatypes
<…>
10
• Only one data read needed to learn about
file’s content (what’s in it, and where it is).
• Important for all use cases that require
knowledge of the file’s content prior to
reading any of the file’s data.
• The HDF5 library by default spreads
internal file metadata in small blocks
throughout the file.
Consolidated Internal File Metadata
<…>
11
1. Create files with Paged Aggregation file
space management strategy.
2. Create files with increased file metadata
block.
3. Store file’s content information in it’s
User Block.
How to Consolidate Internal File
Metadata?
<…>
12
• One of available file space management
strategies. Not the default.
• Can only be set at file creation.
• Best suited when file content is added once
and never modified.
• HDF5 library will read and write data in file
pages.
HDF5 Paged Aggregation
<…>
13
• Internal file metadata and raw data are
organized in separate pages of specified
size.
• Setting an appropriate page size can have
all internal file metadata in just one page.
• Only page aggregated files can use library’s
page buffer cache that can significantly
reduce subsequent data access.
HDF5 Paged Aggregation (cont’d)
<…>
14
• HDF5 library:
H5Pset_file_space_strategy(fcpl, H5F_FSPACE_STRATEGY_PAGE,…)
H5Pset_file_space_page_size(fcpl, page_size)
fcpl: file creation property list
• h5py
– h5py.File class
• netCDF library
– Not supported yet.
How to Create Paged Aggregated
Files
<…>
15
$ h5repack –S PAGE –G PAGE_SIZE_BYTES in.h5 out.h5
How to Apply Paged Aggregation to
Existing Files
<…>
16
• Applies to non-page aggregated files.
• Internal metadata stored in metadata
blocks in a file. Default size is 2048 bytes.
• Setting a bigger block at file creation can
combine internal metadata into a single
contiguous block.
• Recommend to make the block size big
enough for entire internal file metadata.
File Metadata Block Size
<…>
17
• HDF5 library:
H5Pset_meta_block_size(fapl, block_size,…)
fapl: file access property list
• h5py
– h5py.File class
• netCDF library
– Not supported yet.
How to Create Files with Larger
Metadata Block
<…>
18
$ h5repack –-metadata_block_size SIZE_BYTES in.h5 out.h5
How to Increase Metadata Block for
Existing Files
<…>
19
How to Find Internal File Metadata
Size?
$ h5stat -S ATL03_20190928165055_00270510_004_01.h5
Filename: ATL03_20190928165055_00270510_004_01.h5
File space management strategy: H5F_FSPACE_STRATEGY_FSM_AGGR
File space page size: 4096 bytes
Summary of file space information:
File metadata: 7713028 bytes
Raw data: 2458294886 bytes
Amount/Percent of tracked free space: 0 bytes/0.0%
Unaccounted space: 91670 bytes
Total space: 2466099584 bytes
<…>
20
• User block is a block of bytes at the
beginning of an HDF5 file that the library
skips so any user application content can
be stored.
• Extracted file content information can be
stored in the user block to be readily
available later with a single data read
request.
• This new info stays with the file – still one
cloud store object.
File Content Info in User Block
<…>
21
• HDF5 library
– H5Pset_userblock()
• h5py
– h5py.File class
• netCDF library
– Not supported yet.
How to Create File with User Block
<…>
22
$ h5repack --block=SIZE_BYTES --ublock=user_block.file in.h5 out.h5
or
$ h5jam -i in.h5 -u user_block.file -o out.h5
IMPORTANT: The user block content must be
generated after the user block has been added
if interested in dataset chunk file locations.
Current HDF5 functions for chunk file locations
have a bug so user block size must be added.
How to Add User Block to
Existing Files
<…>
23
• HDF5 library can create cloud-optimized
HDF5 files if instructed.
• Larger dataset chunk sizes are the most
important data access optimization.
• Combining internal metadata is required
if discovering file content when files are
already in object store.
• The two optimizations are not related.
Wrap-Up
<…>
24
• Page aggregated files are recommended if
HDF5 library will also be used for cloud data
access.
• DMR++ is recommended for file content
description stored in file user block.
• Cloud-optimize HDF5 files prior to transfer to
object store. Best is to create such files and
avoid any post-optimization.
Wrap-Up (cont’d)
<…>
25
This work was supported by NASA/GSFC under
Raytheon Technologies contract number
80GSFC21CA001.
Thank you!

More Related Content

Similar to Cloud-Optimized HDF5 Files

Hdf5 parallel
Hdf5 parallelHdf5 parallel
Hdf5 parallelmfolk
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsDataWorks Summit
 

Similar to Cloud-Optimized HDF5 Files (20)

Caching and Buffering in HDF5
Caching and Buffering in HDF5Caching and Buffering in HDF5
Caching and Buffering in HDF5
 
Hdf5 parallel
Hdf5 parallelHdf5 parallel
Hdf5 parallel
 
HDF Updae
HDF UpdaeHDF Updae
HDF Updae
 
HDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server FeaturesHDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server Features
 
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIsIntroduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
HDF5 I/O Performance
HDF5 I/O PerformanceHDF5 I/O Performance
HDF5 I/O Performance
 
HDF Cloud Services
HDF Cloud ServicesHDF Cloud Services
HDF Cloud Services
 
HDF Update
HDF UpdateHDF Update
HDF Update
 
Update on HDF5 1.8
Update on HDF5 1.8Update on HDF5 1.8
Update on HDF5 1.8
 
HDF Update
HDF UpdateHDF Update
HDF Update
 
Creating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 FilesCreating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 Files
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the experts
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Parallel HDF5 Developments
Parallel HDF5 DevelopmentsParallel HDF5 Developments
Parallel HDF5 Developments
 
HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?
 
H5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only LibraryH5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only Library
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
Migrating from HDF5 1.6 to 1.8
Migrating from HDF5 1.6 to 1.8Migrating from HDF5 1.6 to 1.8
Migrating from HDF5 1.6 to 1.8
 

More from The HDF-EOS Tools and Information Center

STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...The HDF-EOS Tools and Information Center
 

More from The HDF-EOS Tools and Information Center (20)

The State of HDF
The State of HDFThe State of HDF
The State of HDF
 
Highly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance FeaturesHighly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance Features
 
HDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance DiscussionHDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance Discussion
 
Hyrax: Serving Data from S3
Hyrax: Serving Data from S3Hyrax: Serving Data from S3
Hyrax: Serving Data from S3
 
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLABAccessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
 
HDF - Current status and Future Directions
HDF - Current status and Future DirectionsHDF - Current status and Future Directions
HDF - Current status and Future Directions
 
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
 
HDF - Current status and Future Directions
HDF - Current status and Future Directions HDF - Current status and Future Directions
HDF - Current status and Future Directions
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
 
HDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDFHDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDF
 
HDF5 <-> Zarr
HDF5 <-> ZarrHDF5 <-> Zarr
HDF5 <-> Zarr
 
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
 
HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020
 
Leveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software TestingLeveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software Testing
 
Google Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOSGoogle Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOS
 
HDF-EOS Data Product Developer's Guide
HDF-EOS Data Product Developer's GuideHDF-EOS Data Product Developer's Guide
HDF-EOS Data Product Developer's Guide
 
HDF Status Update
HDF Status UpdateHDF Status Update
HDF Status Update
 
NASA Terra Data Fusion
NASA Terra Data FusionNASA Terra Data Fusion
NASA Terra Data Fusion
 
S3 VFD
S3 VFDS3 VFD
S3 VFD
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Cloud-Optimized HDF5 Files

  • 1. <…> Cloud-Optimized HDF5 Files 2023 ESIP Summer Meeting This work was supported by NASA/GSFC under Raytheon Technologies contract number 80GSFC21CA001. This document does not contain technology or Technical Data controlled under either the U.S. International Traffic in Arms Regulations or the U.S. Export Administration Regulations. Aleksandar Jelenak NASA EED-3/HDF Group ajelenak@hdfgroup.org
  • 2. <…> 2 An HDF5 file where internal structures are arranged for more efficient data access when in cloud object stores. Cloud Optimized HDF5* File *Hierarchical Data Format Version 5
  • 3. <…> 3 • One file = one cloud store object. • Read-only data access. • Data reading based on HTTP* range GET requests with specific file offsets and size of bytes. Cloud Optimized Means… *Hypertext Transfer Protocol
  • 4. <…> 4 • Least amount of reformatting from archive tapes to cloud object stores. • HDF5 library instead of custom HDF5 file format readers with limited capabilities. • Fast content discovery when files are in object store. • Efficient data access for both cloud- native and conventional applications. Why Cloud-Optimized HDF5 Files?
  • 5. <…> 5 • Large dataset chunk size (1-4 MiB*). • Minimal use of variable-length datatypes. • Consolidated internal file metadata. What Makes a Cloud-Optimized HDF5 File? *mebibyte (1,048,576 bytes)
  • 6. <…> 6 • Chunk size is a product of the dataset’s datatype size in bytes and the chunk’s shape (number of elements for each dimension). • AWS* best practice document claims one HTTP range GET request should be 8–16 MiB. Large Dataset Chunk Sizes *Amazon Web Service
  • 7. <…> 7 • Chunk shape still needs to account for likely data access patterns. • Default HDF5 library dataset chunk cache only 1 MiB but is configurable per dataset. • Chunk cache size can have significant impact on I/O performance. • Trade-off between chunk size and compression/decompression speed. Large Dataset Chunk Sizes (cont.)
  • 8. <…> 8 • HDF5 library – H5Pset_cache() for all datasets in a file – H5Pset_chunk_cache() for individual datasets • h5py – h5py.File class for all datasets in a file – h5py.Group.create_dataset() or .require_dataset() method for individual datasets • netCDF* library – nc_set_chunk_cache() for all variables in a file – nc_get_var_chunk_cache() for individual variables How to Set Chunk Cache Size? *Network Common Data Form
  • 9. <…> 9 • Current implementation of variable-length (vlen) data in HDF5 files prevents easy retrieval using HTTP range GET requests. • Alternative access methods require duplicating vlen data outside its file or custom HDF5 file format reader. • Minimize use of these datatypes in HDF5 datasets if not using the HDF5 library for cloud data access. Variable-Length Datatypes
  • 10. <…> 10 • Only one data read needed to learn about file’s content (what’s in it, and where it is). • Important for all use cases that require knowledge of the file’s content prior to reading any of the file’s data. • The HDF5 library by default spreads internal file metadata in small blocks throughout the file. Consolidated Internal File Metadata
  • 11. <…> 11 1. Create files with Paged Aggregation file space management strategy. 2. Create files with increased file metadata block. 3. Store file’s content information in it’s User Block. How to Consolidate Internal File Metadata?
  • 12. <…> 12 • One of available file space management strategies. Not the default. • Can only be set at file creation. • Best suited when file content is added once and never modified. • HDF5 library will read and write data in file pages. HDF5 Paged Aggregation
  • 13. <…> 13 • Internal file metadata and raw data are organized in separate pages of specified size. • Setting an appropriate page size can have all internal file metadata in just one page. • Only page aggregated files can use library’s page buffer cache that can significantly reduce subsequent data access. HDF5 Paged Aggregation (cont’d)
  • 14. <…> 14 • HDF5 library: H5Pset_file_space_strategy(fcpl, H5F_FSPACE_STRATEGY_PAGE,…) H5Pset_file_space_page_size(fcpl, page_size) fcpl: file creation property list • h5py – h5py.File class • netCDF library – Not supported yet. How to Create Paged Aggregated Files
  • 15. <…> 15 $ h5repack –S PAGE –G PAGE_SIZE_BYTES in.h5 out.h5 How to Apply Paged Aggregation to Existing Files
  • 16. <…> 16 • Applies to non-page aggregated files. • Internal metadata stored in metadata blocks in a file. Default size is 2048 bytes. • Setting a bigger block at file creation can combine internal metadata into a single contiguous block. • Recommend to make the block size big enough for entire internal file metadata. File Metadata Block Size
  • 17. <…> 17 • HDF5 library: H5Pset_meta_block_size(fapl, block_size,…) fapl: file access property list • h5py – h5py.File class • netCDF library – Not supported yet. How to Create Files with Larger Metadata Block
  • 18. <…> 18 $ h5repack –-metadata_block_size SIZE_BYTES in.h5 out.h5 How to Increase Metadata Block for Existing Files
  • 19. <…> 19 How to Find Internal File Metadata Size? $ h5stat -S ATL03_20190928165055_00270510_004_01.h5 Filename: ATL03_20190928165055_00270510_004_01.h5 File space management strategy: H5F_FSPACE_STRATEGY_FSM_AGGR File space page size: 4096 bytes Summary of file space information: File metadata: 7713028 bytes Raw data: 2458294886 bytes Amount/Percent of tracked free space: 0 bytes/0.0% Unaccounted space: 91670 bytes Total space: 2466099584 bytes
  • 20. <…> 20 • User block is a block of bytes at the beginning of an HDF5 file that the library skips so any user application content can be stored. • Extracted file content information can be stored in the user block to be readily available later with a single data read request. • This new info stays with the file – still one cloud store object. File Content Info in User Block
  • 21. <…> 21 • HDF5 library – H5Pset_userblock() • h5py – h5py.File class • netCDF library – Not supported yet. How to Create File with User Block
  • 22. <…> 22 $ h5repack --block=SIZE_BYTES --ublock=user_block.file in.h5 out.h5 or $ h5jam -i in.h5 -u user_block.file -o out.h5 IMPORTANT: The user block content must be generated after the user block has been added if interested in dataset chunk file locations. Current HDF5 functions for chunk file locations have a bug so user block size must be added. How to Add User Block to Existing Files
  • 23. <…> 23 • HDF5 library can create cloud-optimized HDF5 files if instructed. • Larger dataset chunk sizes are the most important data access optimization. • Combining internal metadata is required if discovering file content when files are already in object store. • The two optimizations are not related. Wrap-Up
  • 24. <…> 24 • Page aggregated files are recommended if HDF5 library will also be used for cloud data access. • DMR++ is recommended for file content description stored in file user block. • Cloud-optimize HDF5 files prior to transfer to object store. Best is to create such files and avoid any post-optimization. Wrap-Up (cont’d)
  • 25. <…> 25 This work was supported by NASA/GSFC under Raytheon Technologies contract number 80GSFC21CA001. Thank you!