Update on HDF5 1.8

Update on HDF5 1.8
The HDF Group
HDF and HDF-EOS Workshop X
November 28, 2006

HDF

… as we know, there are known knowns;
there are things we know we know.
We also know there are known unknowns;
that is to say we know there are some
things we do not know.
But there are also unknown unknowns -the ones we don't know we don't know.
Donald Rumsfeld

Some things we knew we knew
•
•
•
•
•
•

Need high level APIs – image, etc.
Need more datatypes - packed n-bit, etc.
Need external and other links
Tools needed – h5pack, etc.
Caching embellishments
Eventually, multithreading

Things we knew we did not know
•
•
•
•
•

New requirements from EOS and ASCI
New applications that would use HDF5
How HDF5 would really perform in parallel
What new tools, features and options needed
New APIs, API features

Things we didn’t know we didn’t know
• Completely unanticipated applications
• New data types and structures
• E.g. DNA sequences

• New operations
• E.g. write many real-time streams simultaneously

HDF5 1.8 topics
•
•
•
•
•
•
•
•
•
•

Dataset and datatype improvements
Group improvements
Link Revisions
Shared object header nessages
Metadata cache improvements
Other improvements
Platform-specific changes
High level APIs
Parallel HDF5
Tool improvements

Dataset and Datatype
Improvements

Text-based data type descriptions
• Why:
• Simplify datatype creation
• Make datatype creation code more readable
• Facilitate debugging by printing the text
description of a data type

• What:
• New routine to create a data type through the text
description of the data type:
H5LTdtype_to_text

Text data type description – Example
• Create a datatype of compound type.
/* Create the data type with text
description */
dtype = H5Ttext_to_type(
“typedef struct foo {int a; float b;}
foo_t;”)
/* Convert the data type back to text */
H5Ttype_to_text(dtype, NULL, H5T_C,
&tsize)

Serialized datatypes and dataspaces
• Why:
• Allow datatype and dataspace info to be
transmitted between processes
• Allow datatype/dataspace to be stored in nonHDF5 files

• What:
• A new set of routines to serialize/deserialize
HDF5 datatypes and dataspaces.

Int to float convert during I/O
• Why:
Convert ints to floats during I/O
• What:
Int to float conversion supported during I/O

Revised conversion exception handling
• Why:
Give apps greater control over exceptions
(range errors, etc.) during datatype conversion.
• What:

• To handle exceptions during conversions, register handling
function through H5Pset_type_conv_cb().
• Cases of exception:
•
•
•
•
•
•
•

H5T_CONV_EXCEPT_RANGE_HI
H5T_CONV_EXCEPT_RANGE_LOW
H5T_CONV_EXCEPT_TRUNCATE
H5T_CONV_EXCEPT_PRECISION
H5T_CONV_EXCEPT_PINF
H5T_CONV_EXCEPT_NINF
H5T_CONV_EXCEPT_NAN

• Return values: H5T_CONV_ABORT,
H5T_CONV_UNHANDLED,
H5T_CONV_HANDLED

Compression filter for n-bit data
• Why:
Compact storage for user-defined datatypes

• What:
• When data stored on disk, padding bits chopped
off and only significant bits stored
• Supports most datatypes
• Works with compound datatypes

Offset+size storage filter
• Why:
Use less storage when less precision needed
• What:
• Performs scale/offset operation on each value
• Truncates result to fewer bits before storing
• Currently supports integers and floats

• Example
H5Pset_scaleoffset
(dcr,H5Z_SO_INT,H5Z_SO_INT_MINBITS_DEFAULT);
H5Dcreate(……, dcr)
H5Dwrite (…);

Example with floating-point type
• Data: {104.561, 99.459, 100.545, 105.644}
• Choose scaling factor: decimal precision to keep
E.g. scale factor D = 2
1. Find minimum value (offset): 99.459
2. Subtract minimum value from each element
Result: {5.102, 0, 1.086, 6.185}
3. Scale data by multiplying 10D = 100
Result: {510.2, 0, 108.6, 618.5}
4. Round the data to integer
Result: {510 , 0, 109, 619}
5. Pack and store using min number of bits

“NULL” Dataspace
• Why:
• Allow datasets with no elements to be described
• NetCDF 4 needed a “place holder” for attributes

• What:
• A dataset with no dimensions, no data

Access links by creation-time order
• Why:
• Allow iteration & lookup of group’s links
(children) by creation order as well as by name order
• Support netCDF access model for netCDF 4

• What:
Option to access objects in group according to
relative creation time

“Compact groups”
• Why:
• Save space and access time for small groups
• If groups small, don’t need B-tree overhead

• What:
• Alternate storage for groups with few links

• Example
•
•
•
•
•

File with 11,600 groups
With original group structure, file size ~ 20 MB
With compact groups, file size ~ 12 MB
Total savings: 8 MB (40%)
Average savings/group: ~700 bytes

Better large group storage
• Why:
Faster, more scalable storage and access for
large groups
• What:
New format and method for storing groups
with many links

Intermediate group creation
• Why:
• Simplify creation of a series of connected groups
• Avoid having to create each intermediate group
separately, one by one

• What:
• Intermediate groups can be created when creating
an object in a file, with one function call

Example: add intermediate groups
• Want to create “/A/B/C/dset1”
• “A” exists, but “B/C/dset1” do not
/

/

A

A
B
H5Dcreate(file_id, “/A/B/C/dset1”,..)

C

One call creates groups “B” & “C”, then creates “dset1”

dset1

What are links?
Links connect groups to their members
“Hard” links point to a target by address
“Soft” links store the path to a target
root group
Hard link
<address>

dataset

Soft link
“/target dataset”

New: external Links
• Why: Access objects by file & path within file
• What:
• Store location of file and path within that file
• Can link across files
file2.h5
file1.h5
root group

root group
“target dataset”
<address>

“dataset EL”
“file2.h5”
“target dataset”

dataset

New: User-defined Links
• Why:
• Allow applications to create their own kinds of links and
link operations, such as
• Create “hard” external link that finds an object by address
• Create link that accesses a URL
• Keep track of how often a link accessed, or other behavior

• What:
• App can create new kinds of links by supplying custom
callback functions
• Can do anything HDF5 hard, soft, or external links do

Shared object header messages
• Why: metadata duplicated many times, wasting space
• Example:
• You create a file with 10,000 datasets
• All use the same datatype and dataspace
• HDF5 needs to write this information 10,000 times!
Dataset 1

Dataset 2

Dataset 3

datatype

datatype

datatype

dataspace

dataspace

dataspace

data 1

data 2

data 3

Shared object header messages
What:
• Enable messages to be shared automatically
• HDF5 shares duplicated messages on its own!
Dataset 1

Dataset 2
datatype
dataspace

data 1

data 2

Shared Messages
• Happens automatically
• Works with datatypes, dataspaces, attributes, fill
values, and filter pipelines
• Saves space if these objects are relatively large
• May be faster if HDF5 can cache shared messages
• Drawbacks
• Usually slower than non-shared messages
• Adds overhead to the file
• Index for storing shared datatypes
• 25 bytes per instance

• Older library versions can’t read files with shared messages

Two informal tests
• File with 24 datasets, all with same big datatype
• 26,000 bytes normally
• 17,000 bytes with shared messages enabled
• Saves 375 bytes per dataset

• But, make a bad decision: invoke shared messages
but only create one dataset…
• 9,000 bytes normally
• 12,000 bytes with shared messages enabled
• Probably slower when reading and writing, too.

• Moral: shared messages can be a big help, but only in
the right situation!

Metadata Cache improvements
• Why:
• Improve I/O performance and memory usage
when accessing many objects

• What:
• New metadata cache APIs
• control cache size
• monitor actual cache size and current hit rate

• Under the hood: adaptive cache resizing
• Automatically detects the current working size
• Sets max cache size to the working set size

Metadata cache improvements
• Note: most applications do not need to worry
about the cache
• See “Advanced topics” for details
• And if you do see unusual memory growth or
poor performance, please contact us. We
want to help you.

New extendible error-handling API
• Why: Enable app to integrate error reporting with
HDF5 library error stack
• What: New error handling API
• H5Epush
•
•
•
•
•

- push major and minor error ID on specified
error stack
H5Eprint – print specified stack
H5Ewalk – walk through specified stack
H5Eclear – clear specified stack
H5Eset_auto – turn error printing on/off for specified
stack
H5Eget_auto – return settings for specified stack traversal

Attribute improvements
• Why:
• Use less storage when large numbers of attributes
attached to a single object
• Iterate over or look up attributes by creation order

• What:
• Property to create index on the order in which the
attributes are created
• Improved attribute storage

Support for Unicode Character Set
• Why:
• So apps can create names using Unicode
• netCDF 4 needed this

• What
• UTF-8 Unicode encoding now supported
• For string datatypes, names of links and attributes

• Example:
H5Pset_char_encoding(lcpl_id, H5T_CSET_UTF8)
H5Llink(file_id, "UTF-8 name", …, lcpl_id, …);

Efficient copying of HDF5 objects
• Why:
• Enable apps to copy objects efficiently

• What
• New routines to copy an object in an HDF5 file
within the current file or to another file
• Done at a low-level in the HDF5 file, allowing
• Entire group hierarchies to be copied quickly
• Compressed datasets to be copied without going
through a decompression/compression cycle

Performance of object copy routines
relative time for new h5repack using
object copy routines vs. old h5repack
90%

88.1%

80%
70%
58.7%

60%
50%
40%

35.8%

30%

20.0%

20%
10%

0.3%

0.1%

10
,0
00
ar
ra
at
y,
tri
ch
bu
un
te
ke
s
d,
co
m
pr
es
se
d

ch
un
ke
d

gr
ou
ps

flo
at
16
Kx
16
K

x

16
K

flo
at
ar
ra
y,

10
,0
00

in
ta
r ra
y
16
K

x
16
K

16
K

80
M

ar
ra
y,

co
m
po
un
d

da
ta
ty
pe

0%

Data transformation filter
• Why:
• Apply arithmetic operations to data during I/O

• What:
• Data transformation filter
• Transform expressed by algebraic formula
• Only +, -, *, and /supported

• Example:
•
•
•
•

Expression parameter set, such as x*(x-5)
When dataset read/written, x*(x-5) applied per element
When reading, values in file are unchanged
When writing, transformed data written to file

Stackable Virtual File Drivers
• What is Virtual File Driver (VFD)?

Structure of HDF5 Library
Object API (C, Fortran 90, Java, C++)
Object API (C, Fortran 90, Java, C++)

•• Specify objects and transformation properties
Specify objects and transformation properties
•• Invoke data movement operations and data transformations
Invoke data movement operations and data transformations

Library internals
Library internals

•• Performs data transformations and other prep for I/O
Performs data transformations and other prep for I/O
•• Configurable transformations (compression, etc.)
Configurable transformations (compression, etc.)

Virtual file I/O (C only)
Virtual file I/O (C only)

•• Perform byte-stream I/O operations (open/close, read/write, seek)
Perform byte-stream I/O operations (open/close, read/write, seek)
•• User-implementable I/O (stdio, network, memory, etc.)
User-implementable I/O (stdio, network, memory, etc.)

Stackable VFD
• HDF5 VFD allows
• Storing data using different physical file layout.
E.g., Family VFD (writes file as “family of files”)
• Doing different types of I/O.
E.g., stdio (standard I/O); MPI-I/O (for parallel
I/O)

Stackable VFD
• Why “stackable:”
• Before now, only one VFD could be used at a
time
• VFDs could not inter-operative

• What is “stackable:”
• A Non-terminal VFD may stack on top of
compatible non-terminal and eventually Terminal
VFD’s

• Two kinds of VFD
• Non-terminal (e.g. Family)
• Terminal (e.g. stdio; MPI-I/O)

Stackable VFD
Application
HDF5 API

Default I/O path

metadata
Sec2

Family File

split

Non-terminal
VFD

rawdata

stdio

HDF5 Files

mpiio

Terminal
VFD

Platform-specific changes
• Why: Better UNIX/Linux Portability

• What:
• 1.8 uses latest GNU “auto” tools (autoconf,
automake, libtool)
• improves portability between many machine and OS
configurations

• Build can now be done in parallel
• with gmake “–j” flag
• speeds up build, test and install processes

• Build infrastructure includes many other
improvements as well

Platforms to be dropped
• Operating systems
•
•
•
•
•
•

HPUX 11.00
MAC OS 10.3
AIX 5.1 and 5.2
SGI IRIX64-6.5
Linux 2.4
Solaris 2.8 and 2.9

• Compilers
• GNU C compilers older
than 3.4 (Linux)
• Intel 8.*
• PGI V. 5.*, 6.0
• MPICH 1.2.5

http://www.hdfgroup.org/HDF5/release/alpha/obtain518.html

Platforms to be added
• Systems
•
•
•
•
•

Alpha Open VMS
MAC OSX 10.4 (Intel)
Solaris 2.* on Intel (?)
Cray XT3
Windows 64-bit (32-bit
binaries)
• Linux 2.6
• BG/L

• Compilers
•
•
•
•
•

g95
PGI V. 6.1
Intel 9.*
MPICH 1.2.7
MPICH2

High-Level Fortran APIs
• Fortran APIs have been added for H5Lite,
H5Image and H5Table.

Dimension scales
• Similar to
• Dimension scales in HDF4
• Coordinate variables in netCDF

• What is a dimension scale ?
• An HDF5 dataset with additional metadata that
identifies the dataset as a “Dimension Scale”
• Associated with dimensions of HDF5 datasets
• Meaning of the association is left to applications

• A Dimension scale can be shared by two or
more dataset dimensions

Dimension scales example

HDF Explorer image

Sample dimension scale functions
• H5DSset_scale: convert dataset to a

dimension scale

• H5DSattach_scale: attach scale to a

dimension

• H5DSdetach_scale: detach scale from a

dimension

• H5DSis_attached: verify if scale attached

to dataset

• H5DSget_scale_name: read name of scale

HDF5Packet
• Why:
• High performance table writing
• For data acquisition, when there are many sources
of data
• E.g. flight test

• What:
• Each row is a “packet”: a collection of fields, fixed
or variable length
• Append only
• Indexed retrieval

Packets in HDF5
Variable-length records

Fixed-length data records

Data

Data
Data
.
.
.

Time

Time

Data

Data
Data
.
.
.

Collective I/O improvements
• Why
• Collective I/O not available for chunked data
• Collective I/O not available for complex
selections
• Collective I/O is key to improving performance
for parallel HDF5

• What
• Collective I/O works for chunked storage
• Works for irregular selections for both chunked
and contiguous storage

Parallel h5diff (ph5diff)
• Compares two files in an MPI parallel
environment.
• Compares multiple datasets simultaneously

Windows MPICH support
• Windows MPICH support: prototype

New features for old tools
• h5dump
• Dump data in binary format
• Faster for files with large numbers of objects

• h5diff
• Can now compare dataset regions
• Parallel ph5diff now available

• h5repack
• Efficient data copy using H5Gcopy()
• Able to handle big datasets

New HDF5 Tools
• h5copy
• Copies a group, dataset or named datatype from one location to
another
• Copies within a file or across files

• h5repart
• Partition file into a family of files

• h5import
• Import binary/ascii data into an HDF5 file

• h5check
• Verifies an HDF5 file against the defined HDF5 File Format
Specification

• h5stat
•

Reports statistics about a file and objects in a file

For more information
• Go to http://www.hdfgroup.org/HDF5/
• Click on “Obtain HDF5 1.8.0 Alpha”
• Look at table “Information”

Acknowledgement
This report is based upon work supported in part by a
Cooperative Agreement with NASA under NASA
NNG05GC60A. Any opinions, findings, and
conclusions or recommendations expressed in this
material are those of the author(s) and do not
necessarily reflect the views of the National
Aeronautics and Space Administration.

Update on HDF5 1.8

More Related Content

What's hot

Similar to Update on HDF5 1.8

More from The HDF-EOS Tools and Information Center

Recently uploaded

Update on HDF5 1.8

Editor's Notes