In this Tutorial we will discuss different storage methods for the HDF5 files (split files, family of files, multi-files), and datasets (compressed, external, compact), and related filters and properties. This tutorial will introduce advanced features of HDF5, including:
o Property lists
o Compound datatypes
o hyperslab selections
o point selection
o references to objects and regions
o extendable datasets
o mounting files
group iterations
1. HDF5 Advanced Topics
Object’s Properties
Storage Methods and Filters
Datatypes
HDF and HDF-EOS Workshop VIII
October 26, 2004
1
HDF
2. Topics
General Introduction to HDF5 properties
HDF5 Dataset properties
I/O and Storage Properties (filters)
HDF5 File properties
I/O and Storage Properties (drivers)
Datatypes
Compound
Variable Length
Reference to object and dataset region
2
HDF
4. Properties
Definition
• Mechanism to control different features of the
HDF5 objects
– Implemented via H5P Interface (‘Property lists’)
– HDF5 Library sets objects’ default features
– HDF5 ‘Property lists’ modify default features
• At object creation time (creation properties)
• At object access time (access or transfer properties)
4
HDF
5. Properties
Definitions
• A property list is a list of name-value pairs
– Values may be of any datatype
• A property list is passed as an optional parameters
to the HDF5 APIs
• Property lists are used/ignored by all the layers of
the library, as needed
5
HDF
6. Type of Properties
• Predefined and User defined property lists
• Predefined:
–
–
–
–
File creation
File access
Dataset creation
Dataset access
• Will cover each of these
6
HDF
7. Properties (Example)
HDF5 File
• H5Fcreate(…,creation_prop_id,…)
• Creation properties (how file is created?)
– Library’s defaults
• no user’s block
• predefined sizes of offsets and addresses of the objects in the
file (64-bit for DEC Alpha, 32-bit on Windows)
– User’s settings
• User’s block
• 32-bit sizes on 64-bit platform
• Control over B-trees for chunking storage (split factor)
7
HDF
8. Properties (Example)
HDF5 File
• H5Fcreate(…,access_prop_id)
• Access properties or drivers (How is file
accessed? What is the physical layout on the
disk?)
– Library defaults
• STDIO Library (UNIX fwrite, fread)
– User’s defined
• MPI I/O for parallel access
• Family of files (100 Gb HDF5 represented by 50 2Gb UNIX
files)
• Size of the chunk cache
8
HDF
9. Properties (Example)
HDF5 Dataset
• H5Dcreate(…,creation_prop_id)
• Creation properties (how dataset is created)
– Library’s defaults
•
•
•
•
Storage: Contiguous
Compression: None
Space is allocated when data is first written
No fill value is written
– User’s settings
•
•
•
•
9
Storage: Compact, or chunked, or external
Compression
Fill value
Control over space allocation in the file for raw data
– at creation time
– at write time
HDF
10. Properties (Example)
HDF5 Dataset
• H5Dwrite<read>(…,access_prop_id)
• Access (transfer) properties
– Library defaults
• 1MB conversion buffer
• Error detection on read (if was set during write)
• MPI independent I/O for parallel access
– User defined
• MPI collective I/O for parallel access
• Size of the datatype conversion buffer
• Control over partial I/O to improve performance
10
HDF
11. Properties
Programming model
• Use predefined property type
–
–
–
–
H5P_FILE_CREATE
H5P_FILE_ACCESS
H5P_DATASET_CREATE
H5P_DATASET_ACCESS
• Create new property instance
– H5Pcreate
– H5Pcopy
– H5*get_access_plist; H5*get_create_plist
• Modify property (see H5P APIs)
• Use property to modify object feature
• Close property when done
– H5Pclose
11
HDF
12. Properties
Programming model
• General model of usage: get plist, set values, pass
to library
hid_t plist = H5Pcreate(copy)(predefined_plist);
OR
hid_t plist = H5Xget_create(access)_plist(…);
H5Pset_foo( plist, vals);
H5Xdo_something( Xid, …, plist);
H5Pclose(plist);
12
HDF
14. Dataset Creation Properties
• Storage
–
–
–
–
Contiguous (default)
Compact
Chunked
External
• Filters applied to raw data
– Compression
– Checksum
• Fill value
• Space allocation for raw data in the file
14
HDF
15. Dataset Creation Properties
Storage Layouts
•
•
Storage layout is important for I/O performance
and size of the HDF5 files
Contiguous (default)
•
•
•
Compact
•
•
•
•
15
Used when data will be written/read at once
H5Dcreate(…,H5P_DEFAULT)
Used for small datasets (order of O(bytes)) for better I/O
Raw data is written/read at the time when dataset is open
File is less fragmented
To create a compact dataset follow the ‘Properties
programming model’
HDF
16. Creating Compact Dataset
•
•
•
Create a dataset creation property list
Set property list to use compact storage layout
Create dataset with the above property list
plist
= H5Pcreate(H5P_DATASET_CREATE);
H5Pset_layout(plist, H5D_COMPACT);
dset_id = H5Dcreate (…, “Compact”,…, plist);
H5Pclose(plist);
16
HDF
17. Creating chunked Dataset
• Chunked layout is needed for
– Extendible datasets
– Compression and other filters
– To improve partial I/O for big datasets
chunked
Better subsetting
access time;
extendible
Only two chunks will be
written/read
17
HDF
18. Creating Chunked Dataset
•
•
•
Create a dataset creation property list
Set property list to use chunked storage layout
Create dataset with the above property list
plist
= H5Pcreate(H5P_DATASET_CREATE);
H5Pset_chunk(plist, rank, ch_dims);
dset_id = H5Dcreate (…, “Chunked”,…, plist);
H5Pclose(plist);
18
HDF
19. Dataset Creation Properties
Compression and other I/O Pipeline Filters
•
HDF5 provides a mechanism (“I/O filters”) to
manipulate data while transferring it between
memory and disk
H5Z and H5P interfaces
HDF5 predefined filters (H5P interface)
•
•
–
–
•
Compression (gzip, szip)
Shuffling and checksum filters
User defined filters (H5Z and H5P interfaces)
–
19
Example: Bzip2 compression
http://hdf.ncsa.uiuc.edu/HDF5/papers/bzip2
HDF
20. Compression and other I/O Pipeline Filters
(continued)
•
•
Currently used only with chunked datasets
Filters can be combined together
–
–
•
GZIP + shuffle+checksum filters
Checksum filter + user define encryption filter
Filters are called in the order they are defined on
writing and in the reverse order on reading
User is responsible for “filters pipeline sanity”
•
–
–
20
GZIP +SZIP + shuffle doesn’t make sense
Shuffle + SZIP does
HDF
21. Creating compressed Dataset
• Compression
–
–
–
–
Improves transmission speed
Improves storage efficiency
Requires chunking
May increase CPU time needed for compression
Memory
File
Compressed
21
HDF
22. Creating compressed datasets
•
•
•
•
Create a dataset creation property list
Set chunking (and specify chunk dimensions)
Set compression method
Create dataset with the above property list
plist = H5Pcreate(H5P_DATASET_CREATE);
H5Pset_chunk (plist, ndims, chkdims);
H5Pset_deflate (plist, level);
/*GZIP */
OR
H5Pset_szip (plist, options-mask, numpixels);/*SZIP*/
dset_id = H5Dcreate (file_id, “comp-data”,
“H5T_NATIVE_FLOAT,space_id, plist);
22
HDF
23. Creating external Dataset
•
•
•
•
Dataset’s raw data is stored in an external file
Easy to include existing data into HDF5 file
Easy to export raw data if application needs it
Disadvantage: user has to keep track of additional files
to preserve integrity of the HDF5 file
Dataset “A”
HDF5 file
External file
Raw data for “A”
Raw data can be
stored in external file
Metadata for “A”
23
HDF
24. Creating External Dataset
•
•
•
Create a dataset creation property list
Set property list to use external storage layout
Create dataset with the above property list
plist
= H5Pcreate(H5P_DATASET_CREATE);
H5Pset_external(plist,
“raw_data.ext”, offset, size);
dset_id = H5Dcreate (…, “Chunked”,…, plist);
H5Pclose(plist);
24
HDF
25. Example of External Files
This example shows how a contiguous, one-dimensional
dataset is partitioned into three parts and each of those
parts is stored in a segment of an external file.
plist = H5Pcreate (H5P_DATASET_CREATE);
HPset_external (plist, “raw.data”, 3000, 1000);
H5Pset_external (plist, “raw.data”, 0, 2500);
H5Pset_external (plist, “raw.data”, 4500, 1500);
25
HDF
26. Checksum Filter
• HDF5 includes the Fletcher32 checksum algorithm
for error detection.
• It is automatically included in HDF5
• To use this filter you must add it to the filter pipeline
with H5Pset_filter.
Memory
26
Checksum value
HDF
27. Enabling Checksum Filter
•
•
•
•
•
Create a dataset creation property list
Set chunking (and specify chunk dimensions)
Add the filter to the pipeline
Create your dataset specifying this property list
Close property list
plist = H5Pcreate(H5P_DATASET_CREATE);
H5Pset_chunk (plist, ndims, chkdims);
H5Pset_filter (plist, H5Z_FILTER_FLETCHER32, 0, 0, NULL);
H5Dcreate (…,”Checksum”,…,plist)
H5Pclose(plist);
27
HDF
28. Shuffling filter
• Predefined HDF5 filter
• Not a compression; change of byte order in a
stream of data
• Example
– 1 23 43
• Hexadecimal form
– 0x01 0x17 0x2B
• Big-endian machine
– 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0x17 0x00 0x00
0x00 0x2B
• Shuffling
28
– 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01
0x17 0x2B
HDF
30. Enabling Shuffling Filter
•
•
•
•
•
•
Create a dataset creation property list
Set chunking (and specify chunk dimensions)
Add the filter to the pipeline
Define compression filter
Create your dataset specifying this property list
Close property list
plist = H5Pcreate(H5P_DATASET_CREATE);
H5Pset_chunk (plist, ndims, chkdims);
H5Pset_shuffle(plist);
H5Pset_deflate(plist,level);
H5Dcreate (…,”BetterComp”,…,plist)
H5Pclose(plist);
30
HDF
31. Effect of data shuffling (H5Pset_shuffle
+ H5Pset_deflate)
• Write 4-byte integer dataset 256x256x1024 (256MB)
• Using chunks of 256x16x1024 (16MB)
• Values: random integers between 0 and 255
File size
Write Time
No Shuffle
102.9MB
671.049
629.45
Shuffle
31
Total time
67.34MB
83.353
78.268
Compression combined with shuffling provides
•Better compression ratio
•Better I/O performance
HDF
33. Dataset Access/Transfer Properties
• Improve performance
• H5Pset_buffer
– Sets the size of the datatype conversion buffer during
I/O
– Size should be large enough to hold the slice along the
slowest changing dimension
– Example: Hyperslab 100x200x300, buffer 200x300
• H5Pset_hyper_vector_size
– Sets the number of hyperslab offset and length pairs
– Improves performance for partial I/O
33
HDF
34. Dataset Access/Transfer Properties
• H5Pset_edc_check
–
–
–
–
For datasets created with error detection filter enabled
Enables error checking during read operation
H5Z_ENABLE_EDC (default)
N5Z_DISABLE_EDC
• H5Pset_dxpl_mpio
– Sets data transfer mode for parallel I/O
– H5FD_MPIO_INDEPENDENT (default)
– H5FD_MPIO_COLLECTIVE
34
HDF
36. Standard Interface for User-defined Filters
• H5Zregister : Register filter so that HDF5
knows about it
• H5Zunregister: Unregister a filter
• H5Pset_filter: Adds a filter to the filter pipeline
• H5Pget_filter: Returns information about a filter
in the pipeline
• H5Zfilter_avail: Check if filter is available
36
HDF
38. File Creation Properties
• H5Pset_userblock
– User block stores user-defined information (e.g ASCII text to
describe a file) at the beginning of the file
– Cat my.txt hdf5.h5 > myhdf5.h5
– Sets the size of the user block
– 512 bytes, 1024 bytes, 2^N
• H5Pset_sizes
– Sets the byte size of the offsets and lengths used to address objects
in the file
• H5Pset_sym_k
– Controls the rank of groups B-trees for groups
– Default is 16
• H5Pset_istore_k
– Controls the rank of groups B-trees for chunked datasets
– Default is 32
38
HDF
40. File Access Properties (Performance)
• H5Pset_cache
– Sets metadata cache and raw data chunk parameters
– Improper size will degrade performance
• H5Pset_meta_block_size
– Reduces the number of small objects in the file
– Block of metadata is written in a single I/O operation (default 2K)
– VFL driver has to set H5FD_AGGREGATE_METADATA
• H5Pset_sieve_buffer
– Improves partial I/O
– Need a picture
• VFL layer: file drivers
40
HDF
41. File Access Properties (Physical storage
and Usage of Low-level I/O Libraries)
• VFL layer: file drivers
• Define physical storage of the HDF5 file
–
–
–
–
Memory driver (HDF5 file in the application’s memory)
Stream driver (HDF5 file written to a socket)
Split(multi) files driver
Family driver
• Define low level I/O library
– MPI I/O driver for parallel access
– STDIO vs. SEC2
41
HDF
42. Files needn’t be files - Virtual File Layer
VFL: A public API for writing I/O drivers
Hid_t
“File” Handle
VFL: Virtual File I/O Layer
stdio
mpio
split
family
I/O drivers
SRB
memory
network
“Storage”
Files
42
SRB
Memory
Repository
Network
HDF
43. Split Files
• Allows you to split metadata and data into separate files
• May reside on different file systems for better I/O
• Disadvantage: User has to keep track of the files
HDF5 file
Metadata file
Raw data file
Dataset “A”
Dataset “B”
Data A
Data B
43
HDF
44. Creating Split Files
•
•
•
•
Create a file access property list
Set up file access property list to use split files
Create the file with this property list
Close the property
plist = H5Pcreate (H5P_FILE_ACCESS);
H5Pset_fapl_family(plist, “.met”, H5P_DEFAULT,”.dat”,
H5P_DEFAULT);
file = H5Fcreate
plist);
H5Pclose(plist);
44
(H5FILE_NAME, H5F_ACC_TRUNC,
H5P_DEFAULT,
HDF
45. File Families
• Allows you to access files larger than 2GB on
file systems that don't support large files
• Any HDF5 file can be split into a family of files
and vice versa
• A family member size must be a power of two
45
HDF
46. Creating a File Family
• Create a file access property list
• Set up file access property list to use file
family
• Create the file with this property list
plist = H5Pcreate (H5P_FILE_ACCESS);
H5Pset_fapl_family (plist, family_size, H5P_DEFAULT);
file = H5Fcreate (H5FILE_NAME, H5F_ACC_TRUNC,
H5P_DEFAULT,
plist);
H5Pclose(plist);
46
HDF
48. Datatypes
• A datatype is
– A classification specifying the interpretation of
a data element
– Specifies for a given data element
• the set of possible values it can have
• the operations that can be performed
• how the values of that type are stored
– May be shared between different datasets in
one file
48
HDF
50. General Operations on HDF5 Datatypes
• Create
– H5Tcreate creates a datatype of the HT_COMPOUND, H5T_OPAQUE,
and H5T_ENUM classes
• Copy
– H5Tcopy creates another instance of the datatype; can be applied to any
datatypes
• Commit
– H5Tcommit creates an Datatype Object in the HDF5 file; comitted
datatype can be shared between different datatsets
• Open
– H5Topen opens the datatypes stored in the file
• Close
– H5Tclose closes datatype object
50
HDF
51. Programming model for HDF5 Datatypes
• Use predefined HDF5 types
– No need to close
• OR
– Create
• Create a datatype (by copying existing one or by creating from the one of
H5T_COMPOUND(ENAUM,OPAQUE) classes)
• Create a datatype by queering datatype of a dataset
– Open committed datatype from the file
• (Optional) Discover datatype properties (size, precision,
members, etc.)
• Use datatype to create a dataset/attribute, to write/read
dataset/attribute, to set fill value
• (Optional) Save datatype in the file
• Close
51
HDF
52. HDF5 Compound Datatypes
• Compound types
–
–
–
–
–
–
Comparable to C structs
Members can be atomic or compound types
Members can be multidimensional
Can be written/read by a field or set of fields
Non all data filters can be applied (shuffling, SZIP)
H5Tcreate(H5T_COMPOUND), H5Tinsert calls to
create a compound datatype
– See H5Tget_member* functions for discovering
properties of the HDF5 compound datatype
52
HDF
53. HDF5 Fixed and Variable length array
storage
•Data
•Data
Time
•Data
•Data
•Data
Time
•Data
•Data
•Data
•Data
53
HDF
54. HDF5 Variable Length Datatypes
Programming issues
• Each element is represented by C struct
typedef struct {
size_t length;
void *p;
} hvl_t;
• Base type can be any HDF5 type
54
HDF
55. HDF5 Variable Length Datatypes
Raw data
Global heap
Dataset with variable length datatype
55
HDF
56. HDF Information
• HDF Information Center
– http://hdf.ncsa.uiuc.edu/
• HDF Help email address
– hdfhelp@ncsa.uiuc.edu
• HDF users mailing list
– hdfnews@ncsa.uiuc.edu
56
HDF
Editor's Notes
Format and software for scientific data. HDF5 is a different format from earlier versions of HDF, as is the library.
Stores images, multidimensional arrays, tables, etc. That is, you can construct all of these different kinds structures and store them in HDF5. You can also mix and match them in HDF5 files according to your needs.
Emphasis on storage and I/O efficiency Both the library and the format are designed to address this.
Free and commercial software support As far as HDF5 goes, this is just a goal now. There is commercial support for HDF4, but little if any for HDF5 at this time. We are working with vendors to change this.
Emphasis on standards You can store data in HDF5 in a variety of ways, so we try to work with users to encourage them to organize HDF5 files in standard ways.
Users from many engineering and scientific fields
Format and software for scientific data. HDF5 is a different format from earlier versions of HDF, as is the library.
Stores images, multidimensional arrays, tables, etc. That is, you can construct all of these different kinds structures and store them in HDF5. You can also mix and match them in HDF5 files according to your needs.
Emphasis on storage and I/O efficiency Both the library and the format are designed to address this.
Free and commercial software support As far as HDF5 goes, this is just a goal now. There is commercial support for HDF4, but little if any for HDF5 at this time. We are working with vendors to change this.
Emphasis on standards You can store data in HDF5 in a variety of ways, so we try to work with users to encourage them to organize HDF5 files in standard ways.
Users from many engineering and scientific fields
Format and software for scientific data. HDF5 is a different format from earlier versions of HDF, as is the library.
Stores images, multidimensional arrays, tables, etc. That is, you can construct all of these different kinds structures and store them in HDF5. You can also mix and match them in HDF5 files according to your needs.
Emphasis on storage and I/O efficiency Both the library and the format are designed to address this.
Free and commercial software support As far as HDF5 goes, this is just a goal now. There is commercial support for HDF4, but little if any for HDF5 at this time. We are working with vendors to change this.
Emphasis on standards You can store data in HDF5 in a variety of ways, so we try to work with users to encourage them to organize HDF5 files in standard ways.
Users from many engineering and scientific fields
Format and software for scientific data. HDF5 is a different format from earlier versions of HDF, as is the library.
Stores images, multidimensional arrays, tables, etc. That is, you can construct all of these different kinds structures and store them in HDF5. You can also mix and match them in HDF5 files according to your needs.
Emphasis on storage and I/O efficiency Both the library and the format are designed to address this.
Free and commercial software support As far as HDF5 goes, this is just a goal now. There is commercial support for HDF4, but little if any for HDF5 at this time. We are working with vendors to change this.
Emphasis on standards You can store data in HDF5 in a variety of ways, so we try to work with users to encourage them to organize HDF5 files in standard ways.
Users from many engineering and scientific fields
Format and software for scientific data. HDF5 is a different format from earlier versions of HDF, as is the library.
Stores images, multidimensional arrays, tables, etc. That is, you can construct all of these different kinds structures and store them in HDF5. You can also mix and match them in HDF5 files according to your needs.
Emphasis on storage and I/O efficiency Both the library and the format are designed to address this.
Free and commercial software support As far as HDF5 goes, this is just a goal now. There is commercial support for HDF4, but little if any for HDF5 at this time. We are working with vendors to change this.
Emphasis on standards You can store data in HDF5 in a variety of ways, so we try to work with users to encourage them to organize HDF5 files in standard ways.
Users from many engineering and scientific fields
Format and software for scientific data. HDF5 is a different format from earlier versions of HDF, as is the library.
Stores images, multidimensional arrays, tables, etc. That is, you can construct all of these different kinds structures and store them in HDF5. You can also mix and match them in HDF5 files according to your needs.
Emphasis on storage and I/O efficiency Both the library and the format are designed to address this.
Free and commercial software support As far as HDF5 goes, this is just a goal now. There is commercial support for HDF4, but little if any for HDF5 at this time. We are working with vendors to change this.
Emphasis on standards You can store data in HDF5 in a variety of ways, so we try to work with users to encourage them to organize HDF5 files in standard ways.
Users from many engineering and scientific fields