DM_PPT_NP_v01SESIP_0715_JP
Indexing HDF5: A Survey
Joel Plutchak
The HDF Group
Champaign Illinois USA
This work was supported by NASA/GSFC under
Raytheon Co. contract number NNG10HP02C
DM_PPT_NP_v01SESIP_0715_JP
The Technology
The HDF5 hierarchical data file format and
API is flexible—it supports self-describing,
portable, and compact storage, as well as
efficient I/O.
2
July 14, 2015
It is a well-described
and well-supported
format that is used in a
wide variety of
disciplines.
DM_PPT_NP_v01SESIP_0715_JP
The Problem
The HDF5 API does not include mechanisms to
efficiently find and access data based on data values,
like one would perform a query on a relational database.
3
Members of the HDF
Community have developed
this capability so that their
applications can quickly
access targeted pieces of
data— rapidly search and
select interesting portions of
data based on ad hoc search
criteria.
DM_PPT_NP_v01SESIP_0715_JP
A Solution
Solutions to this problem are called indexing.
This is done by adding a layer between the
HDF5 API and an application that builds a index
on one or more parameters, saving enough
information in the index to more efficiently find
and retrieve specific parts of one or more
datasets in an HDF5 file.
4
July 14, 2015
HDF5 FileApplication
HDF
5
API
Index
Quer
y
DM_PPT_NP_v01SESIP_0715_JP
Implementations
Implementations exist for adding indexed
access to HDF5 files. A few of them are:
5
July 14, 2015
• PyTables
• FastQuery / FastBit
• Alacrity
• HDF5 (prototype)
• Other experimental work in progress
DM_PPT_NP_v01SESIP_0715_JP
PyTables
• Uses the Python programming language
• Built on top of the HDF5 library and the
NumPy package
• Uses Optimized Partially Sorted Index
(OPSI) technology designed for fast access
to very large (>100M rows) tables
•
6
July 14, 2015
DM_PPT_NP_v01SESIP_0715_JP
PyTables
• Example
– create a table:
table = h5file.create_table(group, 'readout', Particle,
"Readout example”)
– Query a table:
condition = '(name == "Particle: 5") | (name ==
"Particle: 7")’
for record in table.where(condition):
# do something with "record”
7
July 14, 2015
DM_PPT_NP_v01SESIP_0715_JP
PyTables
Limitations
• No support for relationships between datasets
Future work:
• No specifics; a continuing effort that welcomes
additional developers, testers, and users
• Future maintenance and extended
development proposals underway
• The HDF Group is very interested in taking a
significant role in this work as it moves
forward.
8
July 14, 2015
DM_PPT_NP_v01SESIP_0715_JP
Alacrity
• Analytics-Driven Lossless Data
Compression for Rapid In-Situ Indexing,
Storing, and Querying
• Exploits the representation of floating-point
values by binning on significant bits, using
an inverted index to map each bin
• The software is a research vessel for a
group at University of North Carolina
9
July 14, 2015
DM_PPT_NP_v01SESIP_0715_JP
FastQuery / FastBit
• FastQuery is an extension to HDF5 from the
visualization Group at Lawrence Berkley National
Laboratory (LBNL)
• Based on LBNL’s FastBit, an efficient searching
technology that uses bitmap indexing for processing
complex, multi-dimensional ad hoc queries on read-only
numeric data
• Extends HDF5’s hyperslab selection mechanism to
allow arbitrary range conditions on the data values
contained in the datasets
• Compound queries can span multiple datasets
10
July 14, 2015
DM_PPT_NP_v01SESIP_0715_JP
FastQuery / FastBit
Assumptions
• Data is:
– 0-3 dimensional block-structured
– Limited datatypes: float, double, int32, int64, byte
• Two-level hierarchical organization: TimeStep,
VariableName
Future work:
• Arbitrary nesting
• More data schemas (unstructured, AMR, etc.)
11July 14, 2015
DM_PPT_NP_v01SESIP_0715_JP
HDF5 Data Analysis Extensions
The HDF Group is developing support for indexing and
querying to enable application developers to create
complex and high-performance queries on both
metadata and data elements within an HDF5 container.
These are in the form of objects and associated APIs:
– Query Objects: The H5Q API is used to define a
query and apply it to an HDF5 container
– View Objects: The H5V API is used to generate a
selection from a query
– Index Objects: The H5X API is used to attach /
build an index to data; it is plug-in based to
leverage multiple technologies
12
July 14, 2015
Note: These extensions were developed under Intel’s subcontract with Lawrence Livermore
National Security, LLC under U.S. Department of Energy contract DE-AC52-07NA27344.
DM_PPT_NP_v01SESIP_0715_JP
HDF5 Data Analysis Extensions Example
July 14, 2015
Add index to existing dataset
dataset = H5Dopen(file, dataset_name, H5P_DEFAULT);
/* Add indexing information */
H5Xcreate(dataset, H5X_PLUGIN_FASTBIT, H5P_DEFAULT);
H5Dclose(dataset);
Create and apply query
float query_lb = 39.1f, query_ub = 42.6f;
hid_t query, query1, query2;
/* Create a simple query:39.1 < x */
query1 = H5Qcreate(H5Q_TYPE_DATA_ELEM, H5Q_MATCH_GREATER_THAN, H5T_NATIVE_FLOAT, &query_lb);
/* Create a second simple query: x < 42.1 */
query2 = H5Qcreate(H5Q_TYPE_DATA_ELEM, H5Q_MATCH_LESS_THAN, H5T_NATIVE_FLOAT, &query_ub);
/* Combine query: 39.1 < x < 42.1 */
query = H5Qcombine(query1, H5Q_COMBINE_AND, query2);
/* Use query to get selection */
dataset = H5Dopen(file, dataset_name, H5P_DEFAULT);
H5Dquery(dataset, query, &dataspace);
/* Read data here using dataspace */
H5Dclose(dataset); 13
DM_PPT_NP_v01SESIP_0715_JP
HDF5 Data Analysis Extensions Status
Phase I status (2014):
• Prototype implementations for H5Q, H5V, H5X APIs
• H5X API plugins for Alacrity and FastBit technologies
• Incremental update of data is not supported by indexing
packages
Current work (started July 1):
• Views generated from queries to abstract selection results on
multiple objects
• Support for indexing on chunked datasets
• Support for compound types
• Support for parallel indexing
• Query optimization
• Additional indexing plugins
14
July 14, 2015
DM_PPT_NP_v01SESIP_0715_JP
Summary
• A variety of index methods exist that can be
used to speed targeted access to data in
HDF5 files.
• Capabilities and underlying technologies
differ so use the best fit for your application.
• Work is ongoing… let developers know of
your needs and experiences!
15
July 14, 2015
DM_PPT_NP_v01SESIP_0715_JP
16
References & Sources
16
PyTables
• http://www.pytables.org/index.html
Alacrity
• J. Jenkins, I. Arkatkar, S. Lakshminarasimhan, I. Boyuka, DavidA., E. Schendel, N.
Shah, S. Ethier, C.-S.Chang, J. Chen, H. Kolla, R. Ross, S. Klasky, N. Samatova,
“ALACRITY: Analytics-Driven Lossless Data Compression for Rapid In-Situ
Indexing, Storing, and Querying,” Transactions on Large-Scale Data- and
Knowledge-Centered Systems, Vol 10 (2013).
FastQuery / FastBit
• http://www-vis.lbl.gov/Events/SC05/HDF5FastQuery/
• K. Wu, “FastBit: an efficient indexing technology for accelerating data-intensive
science,” Journal of Physics: Conference Series, vol. 16, no. 1 (2005)
• HDF5-FastQuery: An API for Simplifying Access to Data Storage, Retrieval,
Indexing and Querying. - Report Number: LBNL/PUB-958 (2006)
HDF Data Analysis Extensions
• J. Soumagne, Q. Koziol, RFC: Data Analysis Extensions, RFC THG 2014-07-17.v4;
The HDF Group (2014)
DM_PPT_NP_v01SESIP_0715_JP
17
DM_PPT_NP_v01SESIP_0715_JP
18
This work was supported by
NASA/GSFC under Raytheon Co.
contract number NNG10HP02C

Indexing HDF5: A Survey

  • 1.
    DM_PPT_NP_v01SESIP_0715_JP Indexing HDF5: ASurvey Joel Plutchak The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under Raytheon Co. contract number NNG10HP02C
  • 2.
    DM_PPT_NP_v01SESIP_0715_JP The Technology The HDF5hierarchical data file format and API is flexible—it supports self-describing, portable, and compact storage, as well as efficient I/O. 2 July 14, 2015 It is a well-described and well-supported format that is used in a wide variety of disciplines.
  • 3.
    DM_PPT_NP_v01SESIP_0715_JP The Problem The HDF5API does not include mechanisms to efficiently find and access data based on data values, like one would perform a query on a relational database. 3 Members of the HDF Community have developed this capability so that their applications can quickly access targeted pieces of data— rapidly search and select interesting portions of data based on ad hoc search criteria.
  • 4.
    DM_PPT_NP_v01SESIP_0715_JP A Solution Solutions tothis problem are called indexing. This is done by adding a layer between the HDF5 API and an application that builds a index on one or more parameters, saving enough information in the index to more efficiently find and retrieve specific parts of one or more datasets in an HDF5 file. 4 July 14, 2015 HDF5 FileApplication HDF 5 API Index Quer y
  • 5.
    DM_PPT_NP_v01SESIP_0715_JP Implementations Implementations exist foradding indexed access to HDF5 files. A few of them are: 5 July 14, 2015 • PyTables • FastQuery / FastBit • Alacrity • HDF5 (prototype) • Other experimental work in progress
  • 6.
    DM_PPT_NP_v01SESIP_0715_JP PyTables • Uses thePython programming language • Built on top of the HDF5 library and the NumPy package • Uses Optimized Partially Sorted Index (OPSI) technology designed for fast access to very large (>100M rows) tables • 6 July 14, 2015
  • 7.
    DM_PPT_NP_v01SESIP_0715_JP PyTables • Example – createa table: table = h5file.create_table(group, 'readout', Particle, "Readout example”) – Query a table: condition = '(name == "Particle: 5") | (name == "Particle: 7")’ for record in table.where(condition): # do something with "record” 7 July 14, 2015
  • 8.
    DM_PPT_NP_v01SESIP_0715_JP PyTables Limitations • No supportfor relationships between datasets Future work: • No specifics; a continuing effort that welcomes additional developers, testers, and users • Future maintenance and extended development proposals underway • The HDF Group is very interested in taking a significant role in this work as it moves forward. 8 July 14, 2015
  • 9.
    DM_PPT_NP_v01SESIP_0715_JP Alacrity • Analytics-Driven LosslessData Compression for Rapid In-Situ Indexing, Storing, and Querying • Exploits the representation of floating-point values by binning on significant bits, using an inverted index to map each bin • The software is a research vessel for a group at University of North Carolina 9 July 14, 2015
  • 10.
    DM_PPT_NP_v01SESIP_0715_JP FastQuery / FastBit •FastQuery is an extension to HDF5 from the visualization Group at Lawrence Berkley National Laboratory (LBNL) • Based on LBNL’s FastBit, an efficient searching technology that uses bitmap indexing for processing complex, multi-dimensional ad hoc queries on read-only numeric data • Extends HDF5’s hyperslab selection mechanism to allow arbitrary range conditions on the data values contained in the datasets • Compound queries can span multiple datasets 10 July 14, 2015
  • 11.
    DM_PPT_NP_v01SESIP_0715_JP FastQuery / FastBit Assumptions •Data is: – 0-3 dimensional block-structured – Limited datatypes: float, double, int32, int64, byte • Two-level hierarchical organization: TimeStep, VariableName Future work: • Arbitrary nesting • More data schemas (unstructured, AMR, etc.) 11July 14, 2015
  • 12.
    DM_PPT_NP_v01SESIP_0715_JP HDF5 Data AnalysisExtensions The HDF Group is developing support for indexing and querying to enable application developers to create complex and high-performance queries on both metadata and data elements within an HDF5 container. These are in the form of objects and associated APIs: – Query Objects: The H5Q API is used to define a query and apply it to an HDF5 container – View Objects: The H5V API is used to generate a selection from a query – Index Objects: The H5X API is used to attach / build an index to data; it is plug-in based to leverage multiple technologies 12 July 14, 2015 Note: These extensions were developed under Intel’s subcontract with Lawrence Livermore National Security, LLC under U.S. Department of Energy contract DE-AC52-07NA27344.
  • 13.
    DM_PPT_NP_v01SESIP_0715_JP HDF5 Data AnalysisExtensions Example July 14, 2015 Add index to existing dataset dataset = H5Dopen(file, dataset_name, H5P_DEFAULT); /* Add indexing information */ H5Xcreate(dataset, H5X_PLUGIN_FASTBIT, H5P_DEFAULT); H5Dclose(dataset); Create and apply query float query_lb = 39.1f, query_ub = 42.6f; hid_t query, query1, query2; /* Create a simple query:39.1 < x */ query1 = H5Qcreate(H5Q_TYPE_DATA_ELEM, H5Q_MATCH_GREATER_THAN, H5T_NATIVE_FLOAT, &query_lb); /* Create a second simple query: x < 42.1 */ query2 = H5Qcreate(H5Q_TYPE_DATA_ELEM, H5Q_MATCH_LESS_THAN, H5T_NATIVE_FLOAT, &query_ub); /* Combine query: 39.1 < x < 42.1 */ query = H5Qcombine(query1, H5Q_COMBINE_AND, query2); /* Use query to get selection */ dataset = H5Dopen(file, dataset_name, H5P_DEFAULT); H5Dquery(dataset, query, &dataspace); /* Read data here using dataspace */ H5Dclose(dataset); 13
  • 14.
    DM_PPT_NP_v01SESIP_0715_JP HDF5 Data AnalysisExtensions Status Phase I status (2014): • Prototype implementations for H5Q, H5V, H5X APIs • H5X API plugins for Alacrity and FastBit technologies • Incremental update of data is not supported by indexing packages Current work (started July 1): • Views generated from queries to abstract selection results on multiple objects • Support for indexing on chunked datasets • Support for compound types • Support for parallel indexing • Query optimization • Additional indexing plugins 14 July 14, 2015
  • 15.
    DM_PPT_NP_v01SESIP_0715_JP Summary • A varietyof index methods exist that can be used to speed targeted access to data in HDF5 files. • Capabilities and underlying technologies differ so use the best fit for your application. • Work is ongoing… let developers know of your needs and experiences! 15 July 14, 2015
  • 16.
    DM_PPT_NP_v01SESIP_0715_JP 16 References & Sources 16 PyTables •http://www.pytables.org/index.html Alacrity • J. Jenkins, I. Arkatkar, S. Lakshminarasimhan, I. Boyuka, DavidA., E. Schendel, N. Shah, S. Ethier, C.-S.Chang, J. Chen, H. Kolla, R. Ross, S. Klasky, N. Samatova, “ALACRITY: Analytics-Driven Lossless Data Compression for Rapid In-Situ Indexing, Storing, and Querying,” Transactions on Large-Scale Data- and Knowledge-Centered Systems, Vol 10 (2013). FastQuery / FastBit • http://www-vis.lbl.gov/Events/SC05/HDF5FastQuery/ • K. Wu, “FastBit: an efficient indexing technology for accelerating data-intensive science,” Journal of Physics: Conference Series, vol. 16, no. 1 (2005) • HDF5-FastQuery: An API for Simplifying Access to Data Storage, Retrieval, Indexing and Querying. - Report Number: LBNL/PUB-958 (2006) HDF Data Analysis Extensions • J. Soumagne, Q. Koziol, RFC: Data Analysis Extensions, RFC THG 2014-07-17.v4; The HDF Group (2014)
  • 17.
  • 18.
    DM_PPT_NP_v01SESIP_0715_JP 18 This work wassupported by NASA/GSFC under Raytheon Co. contract number NNG10HP02C

Editor's Notes

  • #5 This talk will provide an overview of a few existing methods for indexing data in HDF5 format.
  • #6 {For each, give overview/description, strengths/weaknesses, example(s)?} {Probably won’t mention more experimental work, e.g., HDF5_SQL from Ohio State: http://web.cse.ohio-state.edu/~wayi/papers/HDF5_SQL.pdf …
  • #7 PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. It is built on top of the HDF5 library, using the Python language and the NumPy package. It is designed to be a fast yet extremely easy to use tool to interactively browse, process and search very large amounts of table and array data organized hierarchically. PyTables takes advantage of the object orientation and introspection capabilities offered by Python, the powerful data management features of HDF5, and NumPy’s flexibility and Numexpr’s high-performance manipulation of large sets of objects organized in a grid-like fashion.
  • #8 At the PyTables BOF at SciPy 2015, we discussed integrating the two main Python packages for HDF: PyTables and H5Py. The first step will be to layer PyTables on top of h5py to create a more maintainable set of packages.
  • #9 PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. It is built on top of the HDF5 library, using the Python language and the NumPy package. It is designed to be a fast yet extremely easy to use tool to interactively browse, process and search very large amounts of table and array data organized hierarchically. PyTables takes advantage of the object orientation and introspection capabilities offered by Python, the powerful data management features of HDF5, and NumPy’s flexibility and Numexpr’s high-performance manipulation of large sets of objects organized in a grid-like fashion.
  • #10 ALACRITY: Analytics-Driven Lossless Data Compression for Rapid In-Situ Indexing, Storing, and Querying For scientific data analysis in particular, methods based on generating heavyweight access acceleration structures, e.g. indexes, are be- coming less feasible for ever-increasing dataset sizes. We present ALACRITY, demonstrating the effectiveness of a fused data and index encoding of scientific, floating-point data in generating lightweight data structures amenable to common types of queries used in scientific data analysis. We exploit the representation of floating-point values by extracting significant bytes, using the resulting unique values to bin the remaining data along fixed-precision boundaries. To optimize query processing, we use an inverted index, mapping each generated bin to a list of records contained within, allowing us to optimize query processing with at- tribute range constraints.
  • #11 The visualization Group at Lawrence Berkley National Laboratory (LBNL) is working on an extension the HDF5 called FastQuery. It is based on LBNL’s FastBit, an efficient searching technology that uses bitmap indexing for processing complex, multi-dimensional ad-hoc queries on read-only numeric data. HDF5 supports complex selections based on multidimensional data coordinates (eg. hyperslab selection). HDF5-FastQuery extends this mechanism to allow arbitrary range conditions on the data values contained in the datasets using the bitmap indices to accelerate the query. The FastQuery technology can efficiently support compound queries that span multiple datasets. The initial implementation uses a wrapper API that is designed to facilitate storage of time-series of multi-variable block- structured datasets which are common in the sciences. In the future, the storage organization can be expanded to accommodate more complex data schemas such as unstructured meshes, chemistry, and particle datasets. Status: FastBit 2.0.2 and HDF5-FastQuery 0.8.4 released March 2015
  • #12 Indices are stored in the HDF5 file. AMR- Adaptive Mesh Refinements The visualization Group at Lawrence Berkley National Laboratory (LBNL) is working on an extension the HDF5 called FastQuery. It is based on LBNL’s FastBit, an efficient searching technology that uses bitmap indexing for processing complex, multi-dimensional ad-hoc queries on read-only numeric data. HDF5 supports complex selections based on multidimensional data coordinates (eg. hyperslab selection). HDF5-FastQuery extends this mechanism to allow arbitrary range conditions on the data values contained in the datasets using the bitmap indices to accelerate the query. The FastQuery technology can efficiently support compound queries that span multiple datasets. The initial implementation uses a wrapper API that is designed to facilitate storage of time-series of multi-variable block- structured datasets which are common in the sciences. In the future, the storage organization can be expanded to accommodate more complex data schemas such as unstructured meshes, chemistry, and particle datasets. Status: FastBit 2.0.2 and HDF5-FastQuery 0.8.4 released March 2015
  • #13 The HDF Group is developing support for indexing and querying by defining components that can enable application developers to create complex and high-performance queries on both metadata and data elements within an HDF5 container and retrieve the results of applying those query operations. These components are in the form of several objects and accompanying APIs: Query Objects: The H5Q API is used to define a query and apply it to an HDF5 container View Objects: The H5V API is used to generate a selection from a query Index Objects: The H5X API is used to attach/ build an index to data; plug-in based; plugins for Alacrity and FastBit technologies currently exist.
  • #15 No incremental update support: Entire data has to be read to rebuild index
  • #17 http://www.pytables.org/index.html