2. 2
• HDF storage schema for the cloud
• HDF Server features
• What’s new
• What’s next
• Demo
Overview
3. 3What is HDF5?
Depends on your point of view:
• a C-API
• a data model
• a file format
Let’s imagine keeping the API and
Data model, but with a different (cloud friendly)
Storage format
4. 4HDF Sharded Schema
Big Idea: Map individual
HDF5 objects (datasets,
groups, chunks) as Object
Storage Objects
• Limit maximum size of any object
• Support parallelism for read/write
• Only data that is modified needs to be
updated
• Multiple clients can be reading/updating
the same “file”
• Don’t need to manage free space
Legend:
• Dataset is partitioned into chunks
• Each chunk stored as an object (file)
• Dataset meta data (type, shape,
attributes, etc.) stored in a separate
object (as JSON text)
Why a sharded data format?
Each chunk (heavy outlines) get
persisted as a separate object
6. 6Implementations of the sharded schema
A storage format specification is nice, but it would
be useful to have some software that can actually
write and read to the format…
As it happens, we’ve created a software service that uses the
schema: HSDS (Highly Scalable Data Service)
Software is available at:
https://github.com/HDFGroup/hsds
Note: HSDS was originally developed as a NASA ACCESS 2015 project:
https://earthdata.nasa.gov/esds/competitive-programs/access/hsds
7. 7Server Features
• Simple + familiar API
• Clients can interact with service using REST API
• SDKs provide language specific interface (e.g. h5pyd for Python)
• Can read/write just the data they need (as opposed to transferring entire files)
• Support for compression
• Container based
• Run in Docker or Kubernetes or DC/OS
• Scalable performance:
• Can cache recently accessed data in RAM
• Can parallelize requests across multiple nodes
• More nodes better performance
• Cluster based – any number of machines can be used to constitute the server
• Multiple clients can read/write to same data source
• No limit to the amount of data that can be stored by the service
8. 8Architecture
Legend:
• Client: Any user of the service
• Load balancer – distributes requests to Service nodes
• Service Nodes – processes requests from clients (with help from Data Nodes)
• Data Nodes – responsible for partition of Object Store
• Object Store: Base storage service (e.g. AWS S3)
9. 9HDF API Compatibility
The sharded storage schema captures the
HDF data model, and REST service interface
is nice, but it would be great if the existing
HDF based applications and libraries could
use the new storage format without requiring a
bunch of code changes…
Two related projects provide a solution:
• H5pyd – h5py compatible package for Python
• REST VOL – HDF5 library plugin for C/C++
10. 10H5pyd – Python client
• H5py is a popular Python package that provide a Pythonic interface to the HDF5 library
• H5pyd (for h5py distributed) provides a h5py compatible h5py for accessing the server
• Pure Python – uses requests package to make http calls to server
• Include several extensions to h5py:
• List content in folders
• Get/Set ACLs (access control list)
• Pytables-like query interface
• H5netcdf and xarray packages will use h5pyd when http:// is prepended to the file path
• Installable from PyPI: $ pip install h5pyd
• Source code: https://github.com/HDFGroup/h5pyd
11. 11Supporting the Python Analytics Stack
Many Python users
don’t use h5py, but
tools higher up the
stack: h5netcdf,
xarray, pandas, etc.
HDF5Lib
H5PY
H5NETCDF
Xarray
Since h5pyd is
compatible with h5py,
we should be able to
support the same stack
for HDF Cloud
HDF5Lib
H5PY
H5NETCDF
Xarray
H5PYD
HDFServer
Disk
Applications can
switch between
local and cloud
access just by
changing file path.
12. 12REST VOL Plugin
• The HDF5 VOL architecture is a plugin layer for HDF5
• Public API stays the same, but different back ends can be implemented
• REST VOL substitutes REST API requests for file i/o actions
• C/Fortran applications should be able to run as is
• Some features not implemented yet:
• VLEN support
• Large read/write support (selections >100mb)
• Downloadable from: https://github.com/HDFGroup/vol-rest
13. 13Command Line Interface (CLI)
• Accessing HDF via a service means one can’t utilize usual shell commands: ls, rm, chmod, etc.
• Command line tools are a set of simple apps to use instead:
• hsinfo: display server version, connect info
• hsls: list content of folder or file
• hstouch: create folder or file
• hsdel: delete a file
• hsload: upload an HDF5 file
• hsget: download content from server to an HDF5 file
• hsacl: create/list/update ACLs (Access Control Lists)
• Hsdiff: compare HDF5 file with sharded representation
• Implemented in Python & uses h5pyd
• Note: data is round-tripable:
• HDF5 File hsload HSDS store hsget HDF5 file
14. 14Supporting traditional HDF5 files
• If you have HDF5 files already stored in the cloud, they can be
accessed by HDF Server
• Rather than converting the entire file to the HDF Schema, just the
metadata needs to be imported (typically <1% of the file)
• Dataset reads are converted to S3 Range Gets on the stored file
• The hsload CLI tool has an option (--link ) for loading file metadata
• It is also possible to construct a server file that aggregates multiple
stored files (similar to how the HDF5 library VDS feature works)
We’ve discussed three aspects of HDF: the data model, API, and file
format. With HSDS we’ve kept the data model and API, but the file
format is radically different. But maybe you have a PB or two of HDF5
files you’d like to use…
15. 15New HSDS features
HSDS version 0.6 is coming soon…
What’s new:
• POSIX Support – Store content on regular disk drives
• Azure
• Azure Blob support – Support for Azure’s object storage format
• AKS (Azure Kubernetes) – Run in Azure’s managed Kubernetes
• Active Directory authentication – Authenticate via AD
• AWS
• Added support for AWS Lambda
• DC/OS – support for DC/OS (Apache Mesos) distributed system
• Domain checksums – verify when any content changes
• Role Based Access Control (RBAC) – manage ACLs for user groups
Complete list is here: https://github.com/HDFGroup/hsds/issues/47
17. 17AWS Lambda Functions
• HSDS can parallelize requests across all the
available backend (“DN”) nodes on the server
• AWS Lambda is a new service that enables you to
run requests ”serverless”
• Pay for just cpu-seconds the function runs
• By incorporating Lambda, some HDF Server
requests can parallelize across a 1000 Lambda
functions (equivalent to a 1000 container server)
• Will dramatically speed up time-series selections
18. 18Kita Lab
• Kita Lab is a JupyterLab and HDF server environment hosted by the HDF Group on AWS
• Kita Lab users can create Python notebooks that use h5pyd to connect to HDF Server
• Each user gets equivalent to 2-core Xeon Server and 10GB local storage
• Users can use up to 100GB of data on HDF Server
• Sign up here: https://www.hdfgroup.org/hdfkitalab/
User’s container
and EBS volume
User
User logs into
Jupyter Hub
JupyterHub spawns
new container at
login
HSDS on Kubernetes
S3 Bucket
19. 19Futures
• Sometimes you’d rather do without a server and talk to the storage system directly:
• Don’t want to deal with setting up service
• Don’t want to worry about scaling service up and down with client load
• You don’t need the synchronization (e.g. managing multiple clients writing to the
same dataset) that a service provides
• HS Direct Access will be a new VOL connector that enables this for the HDF5 library
• Will take advantage of multiple cores
• Uses same schema as HSDS (and can be used in conjunction with HSDS)
Design doc is here:
https://github.com/HDFGroup/hsds/blob/master/docs/design/direct_access/direct_access.md