Update on HDF5 1.8
The HDF Group
HDF and HDF-EOS Workshop X
November 28, 2006

HDF
Why HDF5 1.8?
… as we know, there are known knowns;
there are things we know we know.
We also know there are known unknowns;
that is to say we know there are some
things we do not know.
But there are also unknown unknowns -the ones we don't know we don't know.
Donald Rumsfeld
Some things we knew we knew
•
•
•
•
•
•

Need high level APIs – image, etc.
Need more datatypes - packed n-bit, etc.
Need external and other links
Tools needed – h5pack, etc.
Caching embellishments
Eventually, multithreading
Things we knew we did not know
•
•
•
•
•

New requirements from EOS and ASCI
New applications that would use HDF5
How HDF5 would really perform in parallel
What new tools, features and options needed
New APIs, API features
Things we didn’t know we didn’t know
• Completely unanticipated applications
• New data types and structures
• E.g. DNA sequences

• New operations
• E.g. write many real-time streams simultaneously
HDF5 1.8 topics
•
•
•
•
•
•
•
•
•
•

Dataset and datatype improvements
Group improvements
Link Revisions
Shared object header nessages
Metadata cache improvements
Other improvements
Platform-specific changes
High level APIs
Parallel HDF5
Tool improvements
Dataset and Datatype
Improvements
Text-based data type descriptions
• Why:
• Simplify datatype creation
• Make datatype creation code more readable
• Facilitate debugging by printing the text
description of a data type

• What:
• New routine to create a data type through the text
description of the data type:
H5LTdtype_to_text
Text data type description – Example
• Create a datatype of compound type.
/* Create the data type with text
description */
dtype = H5Ttext_to_type(
“typedef struct foo {int a; float b;}
foo_t;”)
/* Convert the data type back to text */
H5Ttype_to_text(dtype, NULL, H5T_C,
&tsize)
Serialized datatypes and dataspaces
• Why:
• Allow datatype and dataspace info to be
transmitted between processes
• Allow datatype/dataspace to be stored in nonHDF5 files

• What:
• A new set of routines to serialize/deserialize
HDF5 datatypes and dataspaces.
Int to float convert during I/O
• Why:
Convert ints to floats during I/O
• What:
Int to float conversion supported during I/O
Revised conversion exception handling
• Why:
Give apps greater control over exceptions
(range errors, etc.) during datatype conversion.
• What:
Revised conversion exception handling
Revised conversion exception handling
• To handle exceptions during conversions, register handling
function through H5Pset_type_conv_cb().
• Cases of exception:
•
•
•
•
•
•
•

H5T_CONV_EXCEPT_RANGE_HI
H5T_CONV_EXCEPT_RANGE_LOW
H5T_CONV_EXCEPT_TRUNCATE
H5T_CONV_EXCEPT_PRECISION
H5T_CONV_EXCEPT_PINF
H5T_CONV_EXCEPT_NINF
H5T_CONV_EXCEPT_NAN

• Return values: H5T_CONV_ABORT,
H5T_CONV_UNHANDLED,
H5T_CONV_HANDLED
Compression filter for n-bit data
• Why:
Compact storage for user-defined datatypes

• What:
• When data stored on disk, padding bits chopped
off and only significant bits stored
• Supports most datatypes
• Works with compound datatypes
N-bit compression example
• In memory, one value of N-Bit datatype is stored like this:
| byte 3 | byte 2 | byte 1 | byte 0 |
|????????|????SPPP|PPPPPPPP|PPPP????|
S-sign bit

P-significant bit

?-padding bit

• After passing through the N-Bit filter, all padding bits are
chopped off, and the bits are stored on disk like this:
|
1st value
|
2nd value
|
|SPPPPPPP PPPPPPPP|SPPPPPPP PPPPPPPP|...

• Opposite (decompress) when going from disk to memory
Offset+size storage filter
• Why:
Use less storage when less precision needed
• What:
• Performs scale/offset operation on each value
• Truncates result to fewer bits before storing
• Currently supports integers and floats

• Example
H5Pset_scaleoffset
(dcr,H5Z_SO_INT,H5Z_SO_INT_MINBITS_DEFAULT);
H5Dcreate(……, dcr)
H5Dwrite (…);
Example with floating-point type
• Data: {104.561, 99.459, 100.545, 105.644}
• Choose scaling factor: decimal precision to keep
E.g. scale factor D = 2
1. Find minimum value (offset): 99.459
2. Subtract minimum value from each element
Result: {5.102, 0, 1.086, 6.185}
3. Scale data by multiplying 10D = 100
Result: {510.2, 0, 108.6, 618.5}
4. Round the data to integer
Result: {510 , 0, 109, 619}
5. Pack and store using min number of bits
“NULL” Dataspace
• Why:
• Allow datasets with no elements to be described
• NetCDF 4 needed a “place holder” for attributes

• What:
• A dataset with no dimensions, no data
Group improvements
Access links by creation-time order
• Why:
• Allow iteration & lookup of group’s links
(children) by creation order as well as by name order
• Support netCDF access model for netCDF 4

• What:
Option to access objects in group according to
relative creation time
“Compact groups”
• Why:
• Save space and access time for small groups
• If groups small, don’t need B-tree overhead

• What:
• Alternate storage for groups with few links

• Example
•
•
•
•
•

File with 11,600 groups
With original group structure, file size ~ 20 MB
With compact groups, file size ~ 12 MB
Total savings: 8 MB (40%)
Average savings/group: ~700 bytes
Better large group storage
• Why:
Faster, more scalable storage and access for
large groups
• What:
New format and method for storing groups
with many links
Intermediate group creation
• Why:
• Simplify creation of a series of connected groups
• Avoid having to create each intermediate group
separately, one by one

• What:
• Intermediate groups can be created when creating
an object in a file, with one function call
Example: add intermediate groups
• Want to create “/A/B/C/dset1”
• “A” exists, but “B/C/dset1” do not
/

/

A

A
B
H5Dcreate(file_id, “/A/B/C/dset1”,..)

C

One call creates groups “B” & “C”, then creates “dset1”

dset1
Link Revisions
What are links?
Links connect groups to their members
“Hard” links point to a target by address
“Soft” links store the path to a target
root group
Hard link
<address>

dataset

Soft link
“/target dataset”
New: external Links
• Why: Access objects by file & path within file
• What:
• Store location of file and path within that file
• Can link across files
file2.h5
file1.h5
root group

root group
“target dataset”
<address>

“dataset EL”
“file2.h5”
“target dataset”

dataset
New: User-defined Links
• Why:
• Allow applications to create their own kinds of links and
link operations, such as
• Create “hard” external link that finds an object by address
• Create link that accesses a URL
• Keep track of how often a link accessed, or other behavior

• What:
• App can create new kinds of links by supplying custom
callback functions
• Can do anything HDF5 hard, soft, or external links do
Shared Object Header
Messages
Shared object header messages
• Why: metadata duplicated many times, wasting space
• Example:
• You create a file with 10,000 datasets
• All use the same datatype and dataspace
• HDF5 needs to write this information 10,000 times!
Dataset 1

Dataset 2

Dataset 3

datatype

datatype

datatype

dataspace

dataspace

dataspace

data 1

data 2

data 3
Shared object header messages
What:
• Enable messages to be shared automatically
• HDF5 shares duplicated messages on its own!
Dataset 1

Dataset 2
datatype
dataspace

data 1

data 2
Shared Messages
• Happens automatically
• Works with datatypes, dataspaces, attributes, fill
values, and filter pipelines
• Saves space if these objects are relatively large
• May be faster if HDF5 can cache shared messages
• Drawbacks
• Usually slower than non-shared messages
• Adds overhead to the file
• Index for storing shared datatypes
• 25 bytes per instance

• Older library versions can’t read files with shared messages
Two informal tests
• File with 24 datasets, all with same big datatype
• 26,000 bytes normally
• 17,000 bytes with shared messages enabled
• Saves 375 bytes per dataset

• But, make a bad decision: invoke shared messages
but only create one dataset…
• 9,000 bytes normally
• 12,000 bytes with shared messages enabled
• Probably slower when reading and writing, too.

• Moral: shared messages can be a big help, but only in
the right situation!
Metadata cache
improvements
Metadata Cache improvements
• Why:
• Improve I/O performance and memory usage
when accessing many objects

• What:
• New metadata cache APIs
• control cache size
• monitor actual cache size and current hit rate

• Under the hood: adaptive cache resizing
• Automatically detects the current working size
• Sets max cache size to the working set size
Metadata cache improvements
• Note: most applications do not need to worry
about the cache
• See “Advanced topics” for details
• And if you do see unusual memory growth or
poor performance, please contact us. We
want to help you.
Other improvements
New extendible error-handling API
• Why: Enable app to integrate error reporting with
HDF5 library error stack
• What: New error handling API
• H5Epush
•
•
•
•
•

- push major and minor error ID on specified
error stack
H5Eprint – print specified stack
H5Ewalk – walk through specified stack
H5Eclear – clear specified stack
H5Eset_auto – turn error printing on/off for specified
stack
H5Eget_auto – return settings for specified stack traversal
Attribute improvements
• Why:
• Use less storage when large numbers of attributes
attached to a single object
• Iterate over or look up attributes by creation order

• What:
• Property to create index on the order in which the
attributes are created
• Improved attribute storage
Support for Unicode Character Set
• Why:
• So apps can create names using Unicode
• netCDF 4 needed this

• What
• UTF-8 Unicode encoding now supported
• For string datatypes, names of links and attributes

• Example:
H5Pset_char_encoding(lcpl_id, H5T_CSET_UTF8)
H5Llink(file_id, "UTF-8 name", …, lcpl_id, …);
Efficient copying of HDF5 objects
• Why:
• Enable apps to copy objects efficiently

• What
• New routines to copy an object in an HDF5 file
within the current file or to another file
• Done at a low-level in the HDF5 file, allowing
• Entire group hierarchies to be copied quickly
• Compressed datasets to be copied without going
through a decompression/compression cycle
Performance of object copy routines
relative time for new h5repack using
object copy routines vs. old h5repack
90%

88.1%

80%
70%
58.7%

60%
50%
40%

35.8%

30%

20.0%

20%
10%

0.3%

0.1%

10
,0
00
ar
ra
at
y,
tri
ch
bu
un
te
ke
s
d,
co
m
pr
es
se
d

ch
un
ke
d

gr
ou
ps

flo
at
16
Kx
16
K

x

16
K

flo
at
ar
ra
y,

10
,0
00

in
ta
r ra
y
16
K

x
16
K

16
K

80
M

ar
ra
y,

co
m
po
un
d

da
ta
ty
pe

0%
Data transformation filter
• Why:
• Apply arithmetic operations to data during I/O

• What:
• Data transformation filter
• Transform expressed by algebraic formula
• Only +, -, *, and /supported

• Example:
•
•
•
•

Expression parameter set, such as x*(x-5)
When dataset read/written, x*(x-5) applied per element
When reading, values in file are unchanged
When writing, transformed data written to file
Stackable Virtual File Drivers
• What is Virtual File Driver (VFD)?
Structure of HDF5 Library
Object API (C, Fortran 90, Java, C++)
Object API (C, Fortran 90, Java, C++)

•• Specify objects and transformation properties
Specify objects and transformation properties
•• Invoke data movement operations and data transformations
Invoke data movement operations and data transformations

Library internals
Library internals

•• Performs data transformations and other prep for I/O
Performs data transformations and other prep for I/O
•• Configurable transformations (compression, etc.)
Configurable transformations (compression, etc.)

Virtual file I/O (C only)
Virtual file I/O (C only)

•• Perform byte-stream I/O operations (open/close, read/write, seek)
Perform byte-stream I/O operations (open/close, read/write, seek)
•• User-implementable I/O (stdio, network, memory, etc.)
User-implementable I/O (stdio, network, memory, etc.)
Stackable VFD
• HDF5 VFD allows
• Storing data using different physical file layout.
E.g., Family VFD (writes file as “family of files”)
• Doing different types of I/O.
E.g., stdio (standard I/O); MPI-I/O (for parallel
I/O)
Stackable VFD
• Why “stackable:”
• Before now, only one VFD could be used at a
time
• VFDs could not inter-operative

• What is “stackable:”
• A Non-terminal VFD may stack on top of
compatible non-terminal and eventually Terminal
VFD’s

• Two kinds of VFD
• Non-terminal (e.g. Family)
• Terminal (e.g. stdio; MPI-I/O)
Stackable VFD
Application
HDF5 API

Default I/O path

metadata
Sec2

Family File

split

Non-terminal
VFD

rawdata

stdio

HDF5 Files

mpiio

Terminal
VFD
Platform-specific
changes
Platform-specific changes
• Why: Better UNIX/Linux Portability

• What:
• 1.8 uses latest GNU “auto” tools (autoconf,
automake, libtool)
• improves portability between many machine and OS
configurations

• Build can now be done in parallel
• with gmake “–j” flag
• speeds up build, test and install processes

• Build infrastructure includes many other
improvements as well
Platforms to be dropped
• Operating systems
•
•
•
•
•
•

HPUX 11.00
MAC OS 10.3
AIX 5.1 and 5.2
SGI IRIX64-6.5
Linux 2.4
Solaris 2.8 and 2.9

• Compilers
• GNU C compilers older
than 3.4 (Linux)
• Intel 8.*
• PGI V. 5.*, 6.0
• MPICH 1.2.5

http://www.hdfgroup.org/HDF5/release/alpha/obtain518.html
Platforms to be added
• Systems
•
•
•
•
•

Alpha Open VMS
MAC OSX 10.4 (Intel)
Solaris 2.* on Intel (?)
Cray XT3
Windows 64-bit (32-bit
binaries)
• Linux 2.6
• BG/L

• Compilers
•
•
•
•
•

g95
PGI V. 6.1
Intel 9.*
MPICH 1.2.7
MPICH2
High level APIs
High-Level Fortran APIs
• Fortran APIs have been added for H5Lite,
H5Image and H5Table.
Dimension scales
• Similar to
• Dimension scales in HDF4
• Coordinate variables in netCDF

• What is a dimension scale ?
• An HDF5 dataset with additional metadata that
identifies the dataset as a “Dimension Scale”
• Associated with dimensions of HDF5 datasets
• Meaning of the association is left to applications

• A Dimension scale can be shared by two or
more dataset dimensions
Dimension scales example

HDF Explorer image
Dimension scales example

HDF Explorer image
Sample dimension scale functions
• H5DSset_scale: convert dataset to a

dimension scale

• H5DSattach_scale: attach scale to a

dimension

• H5DSdetach_scale: detach scale from a

dimension

• H5DSis_attached: verify if scale attached

to dataset

• H5DSget_scale_name: read name of scale
HDF5Packet
• Why:
• High performance table writing
• For data acquisition, when there are many sources
of data
• E.g. flight test

• What:
• Each row is a “packet”: a collection of fields, fixed
or variable length
• Append only
• Indexed retrieval
Packets in HDF5
Variable-length records

Fixed-length data records

Data

Data
Data
.
.
.

Time

Time

Data

Data
Data
.
.
.
Parallel HDF5
Collective I/O improvements
• Why
• Collective I/O not available for chunked data
• Collective I/O not available for complex
selections
• Collective I/O is key to improving performance
for parallel HDF5

• What
• Collective I/O works for chunked storage
• Works for irregular selections for both chunked
and contiguous storage
Parallel h5diff (ph5diff)
• Compares two files in an MPI parallel
environment.
• Compares multiple datasets simultaneously
Windows MPICH support
• Windows MPICH support: prototype
Tool improvements
New features for old tools
• h5dump
• Dump data in binary format
• Faster for files with large numbers of objects

• h5diff
• Can now compare dataset regions
• Parallel ph5diff now available

• h5repack
• Efficient data copy using H5Gcopy()
• Able to handle big datasets
New HDF5 Tools
• h5copy
• Copies a group, dataset or named datatype from one location to
another
• Copies within a file or across files

• h5repart
• Partition file into a family of files

• h5import
• Import binary/ascii data into an HDF5 file

• h5check
• Verifies an HDF5 file against the defined HDF5 File Format
Specification

• h5stat
•

Reports statistics about a file and objects in a file
Thank You
Questions/comments?
For more information
• Go to http://www.hdfgroup.org/HDF5/
• Click on “Obtain HDF5 1.8.0 Alpha”
• Look at table “Information”
Acknowledgement
This report is based upon work supported in part by a
Cooperative Agreement with NASA under NASA
NNG05GC60A. Any opinions, findings, and
conclusions or recommendations expressed in this
material are those of the author(s) and do not
necessarily reflect the views of the National
Aeronautics and Space Administration.

Update on HDF5 1.8

  • 1.
    Update on HDF51.8 The HDF Group HDF and HDF-EOS Workshop X November 28, 2006 HDF
  • 2.
  • 3.
    … as weknow, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -the ones we don't know we don't know. Donald Rumsfeld
  • 4.
    Some things weknew we knew • • • • • • Need high level APIs – image, etc. Need more datatypes - packed n-bit, etc. Need external and other links Tools needed – h5pack, etc. Caching embellishments Eventually, multithreading
  • 5.
    Things we knewwe did not know • • • • • New requirements from EOS and ASCI New applications that would use HDF5 How HDF5 would really perform in parallel What new tools, features and options needed New APIs, API features
  • 6.
    Things we didn’tknow we didn’t know • Completely unanticipated applications • New data types and structures • E.g. DNA sequences • New operations • E.g. write many real-time streams simultaneously
  • 7.
    HDF5 1.8 topics • • • • • • • • • • Datasetand datatype improvements Group improvements Link Revisions Shared object header nessages Metadata cache improvements Other improvements Platform-specific changes High level APIs Parallel HDF5 Tool improvements
  • 8.
  • 9.
    Text-based data typedescriptions • Why: • Simplify datatype creation • Make datatype creation code more readable • Facilitate debugging by printing the text description of a data type • What: • New routine to create a data type through the text description of the data type: H5LTdtype_to_text
  • 10.
    Text data typedescription – Example • Create a datatype of compound type. /* Create the data type with text description */ dtype = H5Ttext_to_type( “typedef struct foo {int a; float b;} foo_t;”) /* Convert the data type back to text */ H5Ttype_to_text(dtype, NULL, H5T_C, &tsize)
  • 11.
    Serialized datatypes anddataspaces • Why: • Allow datatype and dataspace info to be transmitted between processes • Allow datatype/dataspace to be stored in nonHDF5 files • What: • A new set of routines to serialize/deserialize HDF5 datatypes and dataspaces.
  • 12.
    Int to floatconvert during I/O • Why: Convert ints to floats during I/O • What: Int to float conversion supported during I/O
  • 13.
    Revised conversion exceptionhandling • Why: Give apps greater control over exceptions (range errors, etc.) during datatype conversion. • What: Revised conversion exception handling
  • 14.
    Revised conversion exceptionhandling • To handle exceptions during conversions, register handling function through H5Pset_type_conv_cb(). • Cases of exception: • • • • • • • H5T_CONV_EXCEPT_RANGE_HI H5T_CONV_EXCEPT_RANGE_LOW H5T_CONV_EXCEPT_TRUNCATE H5T_CONV_EXCEPT_PRECISION H5T_CONV_EXCEPT_PINF H5T_CONV_EXCEPT_NINF H5T_CONV_EXCEPT_NAN • Return values: H5T_CONV_ABORT, H5T_CONV_UNHANDLED, H5T_CONV_HANDLED
  • 15.
    Compression filter forn-bit data • Why: Compact storage for user-defined datatypes • What: • When data stored on disk, padding bits chopped off and only significant bits stored • Supports most datatypes • Works with compound datatypes
  • 16.
    N-bit compression example •In memory, one value of N-Bit datatype is stored like this: | byte 3 | byte 2 | byte 1 | byte 0 | |????????|????SPPP|PPPPPPPP|PPPP????| S-sign bit P-significant bit ?-padding bit • After passing through the N-Bit filter, all padding bits are chopped off, and the bits are stored on disk like this: | 1st value | 2nd value | |SPPPPPPP PPPPPPPP|SPPPPPPP PPPPPPPP|... • Opposite (decompress) when going from disk to memory
  • 17.
    Offset+size storage filter • Why: Useless storage when less precision needed • What: • Performs scale/offset operation on each value • Truncates result to fewer bits before storing • Currently supports integers and floats • Example H5Pset_scaleoffset (dcr,H5Z_SO_INT,H5Z_SO_INT_MINBITS_DEFAULT); H5Dcreate(……, dcr) H5Dwrite (…);
  • 18.
    Example with floating-pointtype • Data: {104.561, 99.459, 100.545, 105.644} • Choose scaling factor: decimal precision to keep E.g. scale factor D = 2 1. Find minimum value (offset): 99.459 2. Subtract minimum value from each element Result: {5.102, 0, 1.086, 6.185} 3. Scale data by multiplying 10D = 100 Result: {510.2, 0, 108.6, 618.5} 4. Round the data to integer Result: {510 , 0, 109, 619} 5. Pack and store using min number of bits
  • 19.
    “NULL” Dataspace • Why: •Allow datasets with no elements to be described • NetCDF 4 needed a “place holder” for attributes • What: • A dataset with no dimensions, no data
  • 20.
  • 21.
    Access links bycreation-time order • Why: • Allow iteration & lookup of group’s links (children) by creation order as well as by name order • Support netCDF access model for netCDF 4 • What: Option to access objects in group according to relative creation time
  • 22.
    “Compact groups” • Why: •Save space and access time for small groups • If groups small, don’t need B-tree overhead • What: • Alternate storage for groups with few links • Example • • • • • File with 11,600 groups With original group structure, file size ~ 20 MB With compact groups, file size ~ 12 MB Total savings: 8 MB (40%) Average savings/group: ~700 bytes
  • 23.
    Better large groupstorage • Why: Faster, more scalable storage and access for large groups • What: New format and method for storing groups with many links
  • 24.
    Intermediate group creation •Why: • Simplify creation of a series of connected groups • Avoid having to create each intermediate group separately, one by one • What: • Intermediate groups can be created when creating an object in a file, with one function call
  • 25.
    Example: add intermediategroups • Want to create “/A/B/C/dset1” • “A” exists, but “B/C/dset1” do not / / A A B H5Dcreate(file_id, “/A/B/C/dset1”,..) C One call creates groups “B” & “C”, then creates “dset1” dset1
  • 26.
  • 27.
    What are links? Linksconnect groups to their members “Hard” links point to a target by address “Soft” links store the path to a target root group Hard link <address> dataset Soft link “/target dataset”
  • 28.
    New: external Links •Why: Access objects by file & path within file • What: • Store location of file and path within that file • Can link across files file2.h5 file1.h5 root group root group “target dataset” <address> “dataset EL” “file2.h5” “target dataset” dataset
  • 29.
    New: User-defined Links •Why: • Allow applications to create their own kinds of links and link operations, such as • Create “hard” external link that finds an object by address • Create link that accesses a URL • Keep track of how often a link accessed, or other behavior • What: • App can create new kinds of links by supplying custom callback functions • Can do anything HDF5 hard, soft, or external links do
  • 30.
  • 31.
    Shared object headermessages • Why: metadata duplicated many times, wasting space • Example: • You create a file with 10,000 datasets • All use the same datatype and dataspace • HDF5 needs to write this information 10,000 times! Dataset 1 Dataset 2 Dataset 3 datatype datatype datatype dataspace dataspace dataspace data 1 data 2 data 3
  • 32.
    Shared object headermessages What: • Enable messages to be shared automatically • HDF5 shares duplicated messages on its own! Dataset 1 Dataset 2 datatype dataspace data 1 data 2
  • 33.
    Shared Messages • Happensautomatically • Works with datatypes, dataspaces, attributes, fill values, and filter pipelines • Saves space if these objects are relatively large • May be faster if HDF5 can cache shared messages • Drawbacks • Usually slower than non-shared messages • Adds overhead to the file • Index for storing shared datatypes • 25 bytes per instance • Older library versions can’t read files with shared messages
  • 34.
    Two informal tests •File with 24 datasets, all with same big datatype • 26,000 bytes normally • 17,000 bytes with shared messages enabled • Saves 375 bytes per dataset • But, make a bad decision: invoke shared messages but only create one dataset… • 9,000 bytes normally • 12,000 bytes with shared messages enabled • Probably slower when reading and writing, too. • Moral: shared messages can be a big help, but only in the right situation!
  • 35.
  • 36.
    Metadata Cache improvements •Why: • Improve I/O performance and memory usage when accessing many objects • What: • New metadata cache APIs • control cache size • monitor actual cache size and current hit rate • Under the hood: adaptive cache resizing • Automatically detects the current working size • Sets max cache size to the working set size
  • 37.
    Metadata cache improvements •Note: most applications do not need to worry about the cache • See “Advanced topics” for details • And if you do see unusual memory growth or poor performance, please contact us. We want to help you.
  • 38.
  • 39.
    New extendible error-handlingAPI • Why: Enable app to integrate error reporting with HDF5 library error stack • What: New error handling API • H5Epush • • • • • - push major and minor error ID on specified error stack H5Eprint – print specified stack H5Ewalk – walk through specified stack H5Eclear – clear specified stack H5Eset_auto – turn error printing on/off for specified stack H5Eget_auto – return settings for specified stack traversal
  • 40.
    Attribute improvements • Why: •Use less storage when large numbers of attributes attached to a single object • Iterate over or look up attributes by creation order • What: • Property to create index on the order in which the attributes are created • Improved attribute storage
  • 41.
    Support for UnicodeCharacter Set • Why: • So apps can create names using Unicode • netCDF 4 needed this • What • UTF-8 Unicode encoding now supported • For string datatypes, names of links and attributes • Example: H5Pset_char_encoding(lcpl_id, H5T_CSET_UTF8) H5Llink(file_id, "UTF-8 name", …, lcpl_id, …);
  • 42.
    Efficient copying ofHDF5 objects • Why: • Enable apps to copy objects efficiently • What • New routines to copy an object in an HDF5 file within the current file or to another file • Done at a low-level in the HDF5 file, allowing • Entire group hierarchies to be copied quickly • Compressed datasets to be copied without going through a decompression/compression cycle
  • 43.
    Performance of objectcopy routines relative time for new h5repack using object copy routines vs. old h5repack 90% 88.1% 80% 70% 58.7% 60% 50% 40% 35.8% 30% 20.0% 20% 10% 0.3% 0.1% 10 ,0 00 ar ra at y, tri ch bu un te ke s d, co m pr es se d ch un ke d gr ou ps flo at 16 Kx 16 K x 16 K flo at ar ra y, 10 ,0 00 in ta r ra y 16 K x 16 K 16 K 80 M ar ra y, co m po un d da ta ty pe 0%
  • 44.
    Data transformation filter •Why: • Apply arithmetic operations to data during I/O • What: • Data transformation filter • Transform expressed by algebraic formula • Only +, -, *, and /supported • Example: • • • • Expression parameter set, such as x*(x-5) When dataset read/written, x*(x-5) applied per element When reading, values in file are unchanged When writing, transformed data written to file
  • 45.
    Stackable Virtual FileDrivers • What is Virtual File Driver (VFD)?
  • 46.
    Structure of HDF5Library Object API (C, Fortran 90, Java, C++) Object API (C, Fortran 90, Java, C++) •• Specify objects and transformation properties Specify objects and transformation properties •• Invoke data movement operations and data transformations Invoke data movement operations and data transformations Library internals Library internals •• Performs data transformations and other prep for I/O Performs data transformations and other prep for I/O •• Configurable transformations (compression, etc.) Configurable transformations (compression, etc.) Virtual file I/O (C only) Virtual file I/O (C only) •• Perform byte-stream I/O operations (open/close, read/write, seek) Perform byte-stream I/O operations (open/close, read/write, seek) •• User-implementable I/O (stdio, network, memory, etc.) User-implementable I/O (stdio, network, memory, etc.)
  • 47.
    Stackable VFD • HDF5VFD allows • Storing data using different physical file layout. E.g., Family VFD (writes file as “family of files”) • Doing different types of I/O. E.g., stdio (standard I/O); MPI-I/O (for parallel I/O)
  • 48.
    Stackable VFD • Why“stackable:” • Before now, only one VFD could be used at a time • VFDs could not inter-operative • What is “stackable:” • A Non-terminal VFD may stack on top of compatible non-terminal and eventually Terminal VFD’s • Two kinds of VFD • Non-terminal (e.g. Family) • Terminal (e.g. stdio; MPI-I/O)
  • 49.
    Stackable VFD Application HDF5 API DefaultI/O path metadata Sec2 Family File split Non-terminal VFD rawdata stdio HDF5 Files mpiio Terminal VFD
  • 50.
  • 51.
    Platform-specific changes • Why:Better UNIX/Linux Portability • What: • 1.8 uses latest GNU “auto” tools (autoconf, automake, libtool) • improves portability between many machine and OS configurations • Build can now be done in parallel • with gmake “–j” flag • speeds up build, test and install processes • Build infrastructure includes many other improvements as well
  • 52.
    Platforms to bedropped • Operating systems • • • • • • HPUX 11.00 MAC OS 10.3 AIX 5.1 and 5.2 SGI IRIX64-6.5 Linux 2.4 Solaris 2.8 and 2.9 • Compilers • GNU C compilers older than 3.4 (Linux) • Intel 8.* • PGI V. 5.*, 6.0 • MPICH 1.2.5 http://www.hdfgroup.org/HDF5/release/alpha/obtain518.html
  • 53.
    Platforms to beadded • Systems • • • • • Alpha Open VMS MAC OSX 10.4 (Intel) Solaris 2.* on Intel (?) Cray XT3 Windows 64-bit (32-bit binaries) • Linux 2.6 • BG/L • Compilers • • • • • g95 PGI V. 6.1 Intel 9.* MPICH 1.2.7 MPICH2
  • 54.
  • 55.
    High-Level Fortran APIs •Fortran APIs have been added for H5Lite, H5Image and H5Table.
  • 56.
    Dimension scales • Similarto • Dimension scales in HDF4 • Coordinate variables in netCDF • What is a dimension scale ? • An HDF5 dataset with additional metadata that identifies the dataset as a “Dimension Scale” • Associated with dimensions of HDF5 datasets • Meaning of the association is left to applications • A Dimension scale can be shared by two or more dataset dimensions
  • 57.
  • 58.
  • 59.
    Sample dimension scalefunctions • H5DSset_scale: convert dataset to a dimension scale • H5DSattach_scale: attach scale to a dimension • H5DSdetach_scale: detach scale from a dimension • H5DSis_attached: verify if scale attached to dataset • H5DSget_scale_name: read name of scale
  • 60.
    HDF5Packet • Why: • Highperformance table writing • For data acquisition, when there are many sources of data • E.g. flight test • What: • Each row is a “packet”: a collection of fields, fixed or variable length • Append only • Indexed retrieval
  • 61.
    Packets in HDF5 Variable-lengthrecords Fixed-length data records Data Data Data . . . Time Time Data Data Data . . .
  • 62.
  • 63.
    Collective I/O improvements •Why • Collective I/O not available for chunked data • Collective I/O not available for complex selections • Collective I/O is key to improving performance for parallel HDF5 • What • Collective I/O works for chunked storage • Works for irregular selections for both chunked and contiguous storage
  • 64.
    Parallel h5diff (ph5diff) •Compares two files in an MPI parallel environment. • Compares multiple datasets simultaneously
  • 65.
    Windows MPICH support •Windows MPICH support: prototype
  • 66.
  • 67.
    New features forold tools • h5dump • Dump data in binary format • Faster for files with large numbers of objects • h5diff • Can now compare dataset regions • Parallel ph5diff now available • h5repack • Efficient data copy using H5Gcopy() • Able to handle big datasets
  • 68.
    New HDF5 Tools •h5copy • Copies a group, dataset or named datatype from one location to another • Copies within a file or across files • h5repart • Partition file into a family of files • h5import • Import binary/ascii data into an HDF5 file • h5check • Verifies an HDF5 file against the defined HDF5 File Format Specification • h5stat • Reports statistics about a file and objects in a file
  • 69.
  • 70.
  • 71.
    For more information •Go to http://www.hdfgroup.org/HDF5/ • Click on “Obtain HDF5 1.8.0 Alpha” • Look at table “Information”
  • 72.
    Acknowledgement This report isbased upon work supported in part by a Cooperative Agreement with NASA under NASA NNG05GC60A. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Aeronautics and Space Administration.

Editor's Notes