This Tutorial is designed for the HDF5 users with some HDF5 experience. It will cover properties of the HDF5 objects that affect I/O performance and file sizes. The following HDF5 features will be discussed: partial I/O, chunking and compression, and complex HDF5 datatypes such as strings, variable-length arrays and compound datatypes.
We will also discuss references to objects and datasets regions and how they can be used for indexing. Participants will work with the Tutorial examples and exercises during the hands-on sessions.
4. What is a Selection?
• Selection describes elements of a dataset that
participate in partial I/O
• Hyperslab selection
• Point selection
• Results of Set Operations on hyperslab selections or
point selections (union, difference, …)
• Used by sequential and parallel HDF5
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
4
5. Example of single hyperslab selection
16
11
10
7
Single
Hyperslab
Selection
7 x 11
Dataspace 10 x 16
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
5
7. Example of irregular hyperslab selection
16
10
Dataspace 10 x 16
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
7
8. Example of hyperslab selection
16
10
Dataspace 10 x 16
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
8
9. Example of point selection
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
9
10. Example of irregular selection
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
10
11. Hyperslab Description
•
•
•
•
•
Offset - starting location of a hyperslab (1,1)
Stride - number of elements that separate each block (3,2)
Count - number of blocks (2,6)
Block - block size (2,1)
Everything is “measured” in number of elements
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
11
12. H5Sselect_hyperslab
space_id Identifier of dataspace
op
Selection operator to use
H5S_SELECT_SET: replace existing selection w/parameters from
this call
H5S_SELECT_OR (creates a union with a previous selection)
offset
Array with starting coordinates of hyperslab
stride
Array specifying which positions along a dimension to select
count
Array specifying how many blocks to select from the dataspace, in each
dimension
block
Array specifying size of element block (NULL indicates a block size of
a single element in a dimension)
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
12
13. Reading/Writing Selections
•
•
•
•
•
•
•
Open the file
Open the dataset
Get file dataspace
Create a memory dataspace (data buffer)
Make the selection(s)
Read from or write to the dataset
Close the dataset, file dataspace, memory dataspace, and file
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
13
18. HDF5 Chunking
• Chunked layout is needed for
• Extendible datasets
• Compression and other filters
• To improve partial I/O for big datasets
chunked
Better subsetting
access time;
extendible
Only two chunks will be
written/read
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
18
19. Creating Chunked Dataset
•
•
•
•
Create a dataset creation property list
Set property list to use chunked storage layout
Create dataset with the above property list
Select part of or all data for writing or reading
plist
= H5Pcreate(H5P_DATASET_CREATE);
H5Pset_chunk(plist, rank, ch_dims);
dset_id = H5Dcreate (…, “Chunked”,…, plist);
H5Pclose(plist);
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
19
20. Writing or reading to/from chunked dataset
•
•
•
•
Use the same set of operation as for contiguous dataset
Selections do not need to coincide precisely with the chunks
Chunking mechanism is transparent to application (not the
same as in HDF4 library)
Chunking and compression parameters can affect
performance!!! (Will talk about it the next presentation)
H5Dopen(…);
…………
H5Sselect_hyperslab (…);
…………
H5Dread(…);
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
20
21. H5zlib.c example
Creates a compressed integer dataset 1000x20 in the zip.h5 file
h5dump –p –H zip.h5
HDF5 "zip.h5" {
GROUP "/" {
GROUP "Data" {
DATASET "Compressed_Data" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 1000, 20 )………
STORAGE_LAYOUT {
CHUNKED ( 20, 20 )
SIZE 5316
}
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
21
23. Chunking basics to remember
•
•
Chunking creates storage overhead in the file
Performance is affected by
•
•
•
Chunking and compression parameters
Chunking cache size (H5Pset_cache call)
Some hints for getting better performance
•
•
•
Use chunk size no smaller than block size (4k) on your system
Use compression method appropriate for your data
Avoid using selections that do not coincide with the chunking
boundaries
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
23
24. Chunking and selections
Great performance
Selection coincides with a chunk
02/18/14
Poor performance
Selection spans over all chunks
HDF and HDF-EOS Workshop X, Landover, MD
24
26. Datatypes
• A datatype is
• A classification specifying the interpretation of a data
element
• Specifies for a given data element
• the set of possible values it can have
• the operations that can be performed
• how the values of that type are stored
• May be shared between different datasets in one file
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
26
27. Hierarchy of the HDF5 datatypes classes
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
27
28. General Operations on HDF5 Datatypes
• Create
• Derived and compound datatypes only
• Copy
• All datatypes
• Commit (save in a file to share between different datatsets)
• All datatypes
• Open
• Committed datatypes only
• Discover properties (size, number of members, base type)
• Close
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
28
30. Basic Atomic Datatypes
• Atomic types classes
•
•
•
•
•
integers & floats
strings (fixed and variable size)
pointers - references to objects/dataset regions
opaque
bitfield
• Element of an atomic datatype is a smallest possible unit for
HDF5 I/O operation
• Cannot write or read just mantissa or exponent fields for floats or
sign filed for integers
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
30
31. HDF5 Predefined Datatypes
• HDF5 Library provides predefined datatypes (symbols) for all
basic atomic classes except opaque
• H5T_<arch>_<base>
• Examples:
•
•
•
•
•
•
H5T_IEEE_F64LE
H5T_STD_I32BE
H5T_C_S1
H5T_STD_B32LE
H5T_STD_REF_OBJ, H5T_STD_REF_DSETREG
H5T_NATIVE_INT
• Predefined datatypes do not have constant values; initialized when
library is initialized
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
31
32. When to use HDF5 Predefined Datatypes?
• In datasets and attributes creation operations
• Argument to H5Dcreate or to H5Acreate
• c-crtdat.c example:
H5Dcreate(file_id, "/dset", H5T_STD_I32BE, dataspace_id,
H5P_DEFAULT);
• In datasets and attributes read/write operations
• Argument to H5Dwrite/read, H5Awrite/read
• Always use H5T_NATIVE_* types to describe data in memory
• To create user-defined types
• Fixed and variable-length strings
• User-defined integers and floats (13-bit integer or non-standard floatingpoint)
• In composite types definitions
• Do not use for declaring variables
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
32
33. Reference Datatype
• Reference to an HDF5 object
• Pointers to Groups, datasets, and named datatypes in a file
• Predefined datatype H5T_STD_REG_OBJ
• H5Rcreate
• H5Rdereference
• Reference to a dataset region (selection)
• Pointer to the dataspace selection
• Predefined datatype H5T_STD_REF_DSETREG
• H5Rcreate
• H5Rdereference
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
33
36. Reference to Object
• Create a reference to group object
H5Rcreate(&ref[1], fileid, "/GROUP1/GROUP2",
H5R_OBJECT, -1);
• Write references to a dataset
H5Dwrite(dsetr_id, H5T_STD_REF_OBJ, H5S_ALL,
H5S_ALL, H5P_DEFAULT, ref);
• Read reference back with H5Dread and find an object it points to
type_id = H5Rdereference(dsetr_id, H5R_OBJECT, &ref[3]);
name_size = H5Rget_name(dsetr_id, H5R_OBJECT, &ref_out[3],
(char*)buf, 10);
• buf will contain /MYTYPE, name_size will be 8 (accommodating
“0”)
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
36
39. Reference to Dataset Region
• Create a reference to a dataset region
H5Sselect_hyperslab(space_id,H5S_SELECT_SET,start,NU
LL,count,NULL);
H5Rcreate(&ref[0], file_id, “MATRIX”,
H5R_DATASET_REGION, space_id);
• Write references to a dataset
H5Dwrite(dsetr_id, H5T_STD_REF_DSETREG, H5S_ALL,
H5S_ALL, H5P_DEFAULT, ref);
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
39
40. Reference to Dataset Region
• Read reference back with H5Dread and find a region it points to
dsetv_id = H5Rdereference(dsetr_id,
H5R_DATASET_REGION, &ref_out[0]);
space_id = H5Rget_region(dsetr_id,
H5R_DATASET_REGION,&ref_out[0]);
• Read selection
H5Dread(dsetv_id, H5T_NATIVE_INT, H5S_ALL, space_id,
H5P_DEFAULT, data_out);
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
40
41. Storing strings in HDF5
• Array of characters
• Access to each character
• Extra work to access and interpret each string
• Fixed length
string_id = H5Tcopy(H5T_C_S1);
H5Tset_size(string_id, size);
• Overhead for short strings
• Can be compressed
• Variable length
string_id = H5Tcopy(H5T_C_S1);
H5Tset_size(string_id, H5T_VARIABLE);
• Overhead as for all VL datatypes (later)
• Compression will not be applied to actual data
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
41
42. Bitfield Datatype
• C bitfield
• Bitfield – sequence of bytes packed in some integer type
• Examples of Predefined Datatypes
• H5T_NATIVE_B64 – native 8 byte bitfield
• H5T_STD_B32LE – standard 4 bytes bitfield
• Created by copying predefined bitfield type and setting
precision, offset and padding
• Use n-bit filter to store significant bits only
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
42
46. HDF5 Compound Datatypes
• Compound types
•
•
•
•
•
02/18/14
Comparable to C structs
Members can be atomic or compound types
Members can be multidimensional
Can be written/read by a field or set of fields
Non all data filters can be applied (shuffling, SZIP)
HDF and HDF-EOS Workshop X, Landover, MD
46
47. HDF5 Compound Datatypes
• Which APIs to use?
• H5TB APIs
•
•
•
•
Create, read, get info and merge tables
Add, delete, and append records
Insert and delete fields
Limited control over table’s properties (i.e. only GZIP compression, level
6, default allocation time for table, extendible, etc.)
• PyTables http://www.pytables.org
• Based on H5TB
• Python interface
• Indexing capabilities
• HDF5 APIs
• H5Tcreate(H5T_COMPOUND), H5Tinsert calls to create a compound
datatype
• H5Dcreate, etc.
• See H5Tget_member* functions for discovering properties of the HDF5
compound datatype
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
47
48. Creating and writing compound dataset
h5_compound.c example
typedef struct s1_t {
int a;
float b;
double c;
} s1_t;
s1_t
02/18/14
s1[LENGTH];
HDF and HDF-EOS Workshop X, Landover, MD
48
49. Creating and writing compound dataset
/* Create datatype in memory. */
s1_tid = H5Tcreate (H5T_COMPOUND, sizeof(s1_t));
H5Tinsert(s1_tid, "a_name", HOFFSET(s1_t, a),
H5T_NATIVE_INT);
H5Tinsert(s1_tid, "c_name", HOFFSET(s1_t, c),
H5T_NATIVE_DOUBLE);
H5Tinsert(s1_tid, "b_name", HOFFSET(s1_t, b),
H5T_NATIVE_FLOAT);
Note:
• Use HOFFSET macro instead of calculating offset by hand
• Order of H5Tinsert calls is not important if HOFFSET is used
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
49
50. Creating and writing compound dataset
/* Create dataset and write data */
dataset = H5Dcreate(file, DATASETNAME, s1_tid, space,
H5P_DEFAULT);
status = H5Dwrite(dataset, s1_tid, H5S_ALL, H5S_ALL,
H5P_DEFAULT, s1);
Note:
• In this example memory and file datatypes are the same
•Type is not packed
• Use H5Tpack to save space in the file
s2_tid = H5Tpack(s1_tid);
status = H5Dcreate(file, DATASETNAME, s2_tid, space,
H5P_DEFAULT);
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
50
52. Reading compound dataset
/* Create datatype in memory and read data. */
dataset
s2_tid
mem_tid
status
=
=
=
=
H5Dopen(file, DATSETNAME);
H5Dget_type(dataset);
H5Tget_native_type (s2_tid);
H5Dread(dataset, mem_tid, H5S_ALL,
H5S_ALL, H5P_DEFAULT, s1);
Note:
We could construct memory type as we did in writing example
For general applications we need discover the type in the file to
guess the structure to read to
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
52
55. Acknowledgement
This report is based upon work supported in part by a Cooperative Agreement with NASA under
NASA NNG05GC60A. Any opinions, findings, and conclusions or recommendations expressed in
this material are those of the author(s) and do not necessarily reflect the views of the National Aeronautics
and Space Administration.
02/18/14
HDF and HDF-EOS Workshop X, Landover, MD
55
Editor's Notes
H5Sselect_hyperslab:
1: dataspace identifier
2: selection operator to determine how the new selection is to be combined with the already existing selection for the dataspace. CUrrenlty on H5S_SELECT_SET operator is
supported, which replaces the existing selection with the parameters from this call. Overlapping blocks are not supported with H5S_SELECT_SET.
3:start - starting coordinates. 0-beginning
4:stride: how many elements to move in each dimension.
Stride of zero not supported: NULL==1
5:count: determines how many spaces to select in each dimension
6: block: determines size of the element block: if NULL then
defaults to single emenet in each dimension.
H5Sselect_elements:
1: dataspace identifier
2: H5S_SELECT_SET: (see above)
3:NUMP: number of elements selected
4:coord: 2-D array of size (dataspace rank0 by
num_elements in szie. The order of the element coordinates in the coord array also specifies the order that the array elements are iterated when I/O is performed.
Format and software for scientific data. HDF5 is a different format from earlier versions of HDF, as is the library.
Stores images, multidimensional arrays, tables, etc. That is, you can construct all of these different kinds structures and store them in HDF5. You can also mix and match them in HDF5 files according to your needs.
Emphasis on storage and I/O efficiency Both the library and the format are designed to address this.
Free and commercial software support As far as HDF5 goes, this is just a goal now. There is commercial support for HDF4, but little if any for HDF5 at this time. We are working with vendors to change this.
Emphasis on standards You can store data in HDF5 in a variety of ways, so we try to work with users to encourage them to organize HDF5 files in standard ways.
Users from many engineering and scientific fields