The second statement is what we mean by "rich"
High-level view, point out that the file format is NOT "HDF5" (mention VOL).Gerd is a little unhappy with "structured", but it should be ok for this audience.
HDF5 has the characteristics of other formats that are outthere.It’s hard to store metadata in a binary flat file and it is not scalable
Gerd points out that a library is properly a part of the self-describing representation
High performance can have many meanings
Again, note that links are named, not objects
Much more low-level than, say, an RDBMS, though the ease of use of a database can come at a performance cost"easy" access via Python, Gerd'sPowershell snap-in, etc.Can write your own data access API to create queries, etc.
Need to reword this! "These are calleddataspaces" = bad.
Add resource links to this slide
Why should you listen to my talk?
Note that links are named, not objects!Gerd thinks of names as NAVIGATORS
Wide variety of integer and floating point types, enum types, etc.Need to point out that variable-length strings have compression issues (fixable, with $$$)
Might mention sparsity for chunks here.Mike suggests not mentioning chunks, so perhaps that could be replaced with a note about sparse data.
D Robinson - Using HDF5 to work with large quantities of rich biological data
Using HDF5 To Work With LargeQuantities of Rich Biological Data Dana Robinson (derobins @hdfgroup.org) The HDF Group July 13, 2012 BOSC 2012 1
Todays GoalIs that you walk away from this talk with a basicunderstanding of the HDF5 technology stack.July 13, 2012 BOSC 2012 2
What is HDF5?HDF5 is a highly scalable way to organize andstore heterogeneous, multidimensional dataof user-defined types.HDF5 also allows data relationships andcontext to be stored using annotation andlinking.July 13, 2012 BOSC 2012 4
HDF5The HDF5 technology suite includes:• A structured binary file format• An abstract data model for describing your data• A data access library, written in C (w/ bindings for C++, Fortran 95/2003, and Java) July 13, 2012 BOSC 2012 5
HDF5 has characteristics of … Directories and Files PDF • standard • hierarchical exchange format • collections of • heterogeneous related information HDF5 information Databases XML • subsetting • self-describing • random access Binary Flat File • extensible • high- types performance • rich metadata July 13, 2012April 17-19, 2012 BOSC 2012 6
Advantages of HDF5• Platform and architecture-independent• Scalable in space and time • File size only limited by OS and filesystem • Data access time (esp. parallel) scales well• Flexible (user-defined types and organization)• Files are self-describing July 13, 2012 BOSC 2012 7
Advantages of HDF5 (2)• High-performance• Parallel I/O via MPI-IO• Supports compression and other filters• Open source (BSD license)• THG committed to provide long-term support July 13, 2012 BOSC 2012 8
HDF5 Data Objects• Groups • Datatypes• Datasets • Metadata (Attributes) July 13, 2012 BOSC 2012 9
Example: LCMS Data sample namechromatography parameters ms parameters ms/ms parameters July 13, 2012 BOSC 2012 10
HDF5 Data AccessUnlike many data storage systems, HDF5 has nobuilt-in query engine or indexes.You will have to write your own data access code,usually using the HDF5 API. July 13, 2012 BOSC 2012 11
DataspacesHDF5 has a rich set of data subsetting functionality.Example: displaying a thumbnail of a high-resolution image. July 13, 2012 BOSC 2012 12
Filters and Compression HDF5 supports data filters, including compression, which transform data as it enters or leaves the file. compression filter compressed data uncompressed data in the file in users bufferNote that HDF5 data objects are filtered individually,not the entire file! July 13, 2012 BOSC 2012 13
Higher Language Bindings C++ Fortran (95 & 2003) Java .NET Python• C++ & Fortran distributed with library• Java distributed separately• .NET distributed separately, not supported by THG (as-is)• Python (PyTables, h5py) not distributed by THGNOTE:HDF5 bindings are thin wrappers over the C API. • There is no object-oriented interface to HDF5 • Not pure Java, .NET, etc. July 13, 2012 BOSC 2012 14