Indexing HDF5: A Survey

DM_PPT_NP_v01SESIP_0715_JP
Indexing HDF5: A Survey
Joel Plutchak
The HDF Group
Champaign Illinois USA
This work was supported by NASA/GSFC under
Raytheon Co. contract number NNG10HP02C

The Technology
The HDF5 hierarchical data file format and
API is flexible—it supports self-describing,
portable, and compact storage, as well as
efficient I/O.
2
July 14, 2015
It is a well-described
and well-supported
format that is used in a
wide variety of
disciplines.

The Problem
The HDF5 API does not include mechanisms to
efficiently find and access data based on data values,
like one would perform a query on a relational database.
3
Members of the HDF
Community have developed
this capability so that their
applications can quickly
access targeted pieces of
data— rapidly search and
select interesting portions of
data based on ad hoc search
criteria.

A Solution
Solutions to this problem are called indexing.
This is done by adding a layer between the
HDF5 API and an application that builds a index
on one or more parameters, saving enough
information in the index to more efficiently find
and retrieve specific parts of one or more
datasets in an HDF5 file.
4
July 14, 2015
HDF5 FileApplication
HDF
5
API
Index
Quer
y

Implementations
Implementations exist for adding indexed
access to HDF5 files. A few of them are:
5
July 14, 2015
• PyTables
• FastQuery / FastBit
• Alacrity
• HDF5 (prototype)
• Other experimental work in progress

PyTables
• Uses the Python programming language
• Built on top of the HDF5 library and the
NumPy package
• Uses Optimized Partially Sorted Index
(OPSI) technology designed for fast access
to very large (>100M rows) tables
•
6
July 14, 2015

PyTables
• Example
– create a table:
table = h5file.create_table(group, 'readout', Particle,
"Readout example”)
– Query a table:
condition = '(name == "Particle: 5") | (name ==
"Particle: 7")’
for record in table.where(condition):
# do something with "record”
7
July 14, 2015

PyTables
Limitations
• No support for relationships between datasets
Future work:
• No specifics; a continuing effort that welcomes
additional developers, testers, and users
• Future maintenance and extended
development proposals underway
• The HDF Group is very interested in taking a
significant role in this work as it moves
forward.
8
July 14, 2015

Alacrity
• Analytics-Driven Lossless Data
Compression for Rapid In-Situ Indexing,
Storing, and Querying
• Exploits the representation of floating-point
values by binning on significant bits, using
an inverted index to map each bin
• The software is a research vessel for a
group at University of North Carolina
9
July 14, 2015

FastQuery / FastBit
• FastQuery is an extension to HDF5 from the
visualization Group at Lawrence Berkley National
Laboratory (LBNL)
• Based on LBNL’s FastBit, an efficient searching
technology that uses bitmap indexing for processing
complex, multi-dimensional ad hoc queries on read-only
numeric data
• Extends HDF5’s hyperslab selection mechanism to
allow arbitrary range conditions on the data values
contained in the datasets
• Compound queries can span multiple datasets
10
July 14, 2015

FastQuery / FastBit
Assumptions
• Data is:
– 0-3 dimensional block-structured
– Limited datatypes: float, double, int32, int64, byte
• Two-level hierarchical organization: TimeStep,
VariableName
Future work:
• Arbitrary nesting
• More data schemas (unstructured, AMR, etc.)
11July 14, 2015

HDF5 Data Analysis Extensions
The HDF Group is developing support for indexing and
querying to enable application developers to create
complex and high-performance queries on both
metadata and data elements within an HDF5 container.
These are in the form of objects and associated APIs:
– Query Objects: The H5Q API is used to define a
query and apply it to an HDF5 container
– View Objects: The H5V API is used to generate a
selection from a query
– Index Objects: The H5X API is used to attach /
build an index to data; it is plug-in based to
leverage multiple technologies
12
July 14, 2015
Note: These extensions were developed under Intel’s subcontract with Lawrence Livermore
National Security, LLC under U.S. Department of Energy contract DE-AC52-07NA27344.

HDF5 Data Analysis Extensions Example
July 14, 2015
Add index to existing dataset
dataset = H5Dopen(file, dataset_name, H5P_DEFAULT);
/* Add indexing information */
H5Xcreate(dataset, H5X_PLUGIN_FASTBIT, H5P_DEFAULT);
H5Dclose(dataset);
Create and apply query
float query_lb = 39.1f, query_ub = 42.6f;
hid_t query, query1, query2;
/* Create a simple query:39.1 < x */
query1 = H5Qcreate(H5Q_TYPE_DATA_ELEM, H5Q_MATCH_GREATER_THAN, H5T_NATIVE_FLOAT, &query_lb);
/* Create a second simple query: x < 42.1 */
query2 = H5Qcreate(H5Q_TYPE_DATA_ELEM, H5Q_MATCH_LESS_THAN, H5T_NATIVE_FLOAT, &query_ub);
/* Combine query: 39.1 < x < 42.1 */
query = H5Qcombine(query1, H5Q_COMBINE_AND, query2);
/* Use query to get selection */
dataset = H5Dopen(file, dataset_name, H5P_DEFAULT);
H5Dquery(dataset, query, &dataspace);
/* Read data here using dataspace */
H5Dclose(dataset); 13

HDF5 Data Analysis Extensions Status
Phase I status (2014):
• Prototype implementations for H5Q, H5V, H5X APIs
• H5X API plugins for Alacrity and FastBit technologies
• Incremental update of data is not supported by indexing
packages
Current work (started July 1):
• Views generated from queries to abstract selection results on
multiple objects
• Support for indexing on chunked datasets
• Support for compound types
• Support for parallel indexing
• Query optimization
• Additional indexing plugins
14
July 14, 2015

Summary
• A variety of index methods exist that can be
used to speed targeted access to data in
HDF5 files.
• Capabilities and underlying technologies
differ so use the best fit for your application.
• Work is ongoing… let developers know of
your needs and experiences!
15
July 14, 2015

16
References & Sources
16
PyTables
• http://www.pytables.org/index.html
Alacrity
• J. Jenkins, I. Arkatkar, S. Lakshminarasimhan, I. Boyuka, DavidA., E. Schendel, N.
Shah, S. Ethier, C.-S.Chang, J. Chen, H. Kolla, R. Ross, S. Klasky, N. Samatova,
“ALACRITY: Analytics-Driven Lossless Data Compression for Rapid In-Situ
Indexing, Storing, and Querying,” Transactions on Large-Scale Data- and
Knowledge-Centered Systems, Vol 10 (2013).
FastQuery / FastBit
• http://www-vis.lbl.gov/Events/SC05/HDF5FastQuery/
• K. Wu, “FastBit: an efficient indexing technology for accelerating data-intensive
science,” Journal of Physics: Conference Series, vol. 16, no. 1 (2005)
• HDF5-FastQuery: An API for Simplifying Access to Data Storage, Retrieval,
Indexing and Querying. - Report Number: LBNL/PUB-958 (2006)
HDF Data Analysis Extensions
• J. Soumagne, Q. Koziol, RFC: Data Analysis Extensions, RFC THG 2014-07-17.v4;
The HDF Group (2014)

18
This work was supported by
NASA/GSFC under Raytheon Co.
contract number NNG10HP02C

Indexing HDF5: A Survey

More Related Content

What's hot

Similar to Indexing HDF5: A Survey

More from The HDF-EOS Tools and Information Center

Recently uploaded

Indexing HDF5: A Survey

Editor's Notes