Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HDF Cloud: HDF5 at Scale


Published on

HDF and HDF-EOS Workshop XXI (2018)

Published in: Software
  • Be the first to comment

  • Be the first to like this

HDF Cloud: HDF5 at Scale

  1. 1. Proprietary and Confidential. Copyright 2018, The HDF Group. HDF Cloud: HDF5 at scale
  2. 2. 2What is HDF5? Depends on your point of view: • a C-API • a File Format • a data model The File format is just a container for The data. Dropping this view of HDF allows us to more flexibly create a cloud version of HDF.
  3. 3. 3Why HDF in the Cloud • It can provide a cost-effective infrastructure • Pay for what you use vs pay for what you may need • Lower overhead: no hardware setup/network configuration, etc. • Benefit from cloud-based technologies: • Elastic compute – scale compute resources dynamically • Object based storage – low cost/built in redundancy • Community platform • Enables interested users to bring their applications to the data • Share data among many users
  4. 4. 4Cost Factors Most public clouds bill per usage For HDF in the cloud, there are three big cost drivers: • Storage: what storage system will be used? (S3 vs. EBS vs. EFS) • Compute: elastic compute on demand better than fixed cost (scale compute to usage not size of data) • Data Egress: • Ingress is free but getting data (egress) out will cost you ($0.09/GB) • Enabling users to get only the data they need will lower egress charges
  5. 5. 5HDF Cloud Overview • RESTful interface to HDF5 using object storage • Storage using AWS S3 (portable to most other object storage systems) • Built in redundancy • Cost effective • Scalable throughput • Runs as a cluster of Docker containers • Elastically scale compute with usage • Feature compatible with HDF5 library • Implemented in Python using asyncio • Task oriented parallelism
  6. 6. 6Object Storage Challenges for HDF • Not POSIX! • High latency (>0.1s) per request • Not write/read consistent • High throughput needs some tricks (use many async requests) • Request charges can add up (public cloud) For HDF5, using the HDF5 library directly on an object storage system is a non-starter. Will need an alternative solution…
  7. 7. 7HDF Cloud Schema Big Idea: Map individual HDF5 objects (datasets, groups, chunks) as Object Storage Objects • Limit maximum storage object size • Support parallelism for read/write • Only data that is modified needs to be updated • Multiple clients can be reading/updating the same “file” Legend: • Dataset is partitioned into chunks • Each chunk stored as an S3 object • Dataset meta data (type, shape, attributes, etc.) stored in a separate object (as JSON text) How to store HDF5 content in S3? Each chunk (heavy outlines) get persisted as a separate object
  8. 8. 8Architecture Legend: • Client: Any user of the service • Load balancer – distributes requests to Service nodes • Service Nodes – processes requests from clients (with help from Data Nodes) • Data Nodes – responsible for partition of Object Store • Object Store: Base storage service (e.g. AWS S3)
  9. 9. 9 h5py Client SDKs for Python and C are drop-in replacements for libraries used with local files. No significant code change to access local and cloud based data. C/Fortran Applications Community Conventions REST Virtual Object Layer Web Applications Browser HDF5 Lib Python Applications Command Line Tools REST API h5pyd S3 Virtual File Driver HDF Services Clients do not know the details of the data or the storage system Data Access Options Architecture
  10. 10. 10Supporting the Python Analytics Stack Many Python users don’t use h5py, but tools higher up the stack: h5netcdf, xarray, pandas, etc. HDF5Lib H5PY H5NETCDF Xarray Since h5pyd is compatible with h5py, we should be able to support the same stack for HDF Cloud HDF5Lib H5PY H5NETCDF Xarray H5PYD HDFServer Disk Applications can switch between local and cloud access just by changing file path.
  11. 11. 11HDF Cloud Features • Simple + familiar API • Clients can interact with service using REST API • SDKs provide language specific interface (e.g. h5pyd for Python) • Can read/write just the data they need (as opposed to transferring entire files) • Support for compression • Scalable performance: • Can cache recently accessed data in RAM • Can parallelize requests across multiple nodes • More nodes  better performance • Multiple clients can read/write to same data source • No limit to the amount of data that can be stored by the service
  12. 12. 12H5pyd – Python client • H5py is a popular Python package that provide a Pythonic interface to the HDF5 library • H5pyd (for h5py distributed) provides a h5py compatible h5py for accessing the server • Pure Python – uses requests package to make http calls to server • Include several extensions to h5py: • List content in folders • Get/Set ACLs (access control list) • Pytables-like query interface
  13. 13. 13REST VOL • The HDF5 VOL architecture is a plugin layer for HDF5 • Public API stays the same, but different back ends can be implemented • REST VOL substitutes REST API requests for file i/o actions • C/Fortran applications should be able to run as is
  14. 14. 14Command Line Interface (CLI) • Accessing HDF via a service means one can’t utilize usual shell commands: ls, rm, chmod, etc. • Command line tools are a set of simple apps to use instead: • hsinfo: display server version, connect info • hsls: list content of folder or file • hstouch: create folder or file • hsdel: delete a file • hsload: upload an HDF5 file • hsget: download content from server to an HDF5 file • hsacl: create/list/update ACLs (Access Control Lists) • Implemented in Python & uses h5pyd
  15. 15. 15Getting Access to HDF Server • Option 1: HDF Kita Lab (JupyterLab environment) • Easy to use • Low cost • Shared HDF Server/S3 Bucket • Option 2: HDF Kita Server • Run your own instance • Launch from AWS Marketplace (coming soon) • Pay your own AWS costs • Option 3: HDF Kita Server On Premise • Roll your own: public or private cloud • Supported with OpenStack & Ceph • Talk to us for other technologies
  16. 16. 16Futures: Supporting traditional HDF5 files 1 6 • Downside of the HDF S3 Schema is that data needs be transmogrified • Since the bulk of the data is usually the chunk data it makes sense to combine the ideas of the S3 Schema and S3VFD: • Convert just the metadata of the source HDF5 file to the S3 Schema • Store the source file as a S3 object • For data reads, metadata provides offset and length into the HDF5 file • S3 Range GET returns needed data • This approach can be used either directly or with HDF Server • Compared with the pure S3VFD approach, you reduce the number of S3 requests needed • Work on supporting this is planned for later this year • BONUS Round – Access to GeoTiff files
  17. 17. 17Futures: Lambda Functions • HDF Server can parallize requests across all the available backend (“DN”) nodes on the server • AWS Lambda is a new service that enables you to run requests ”serverless” • Pay for just cpu-seconds the function runs • By incorporating Lambda, some HDF Server requests can parallelize across a 1000 Lambda functions (equivalent to a 1000 container server) • Will dramatically speed up time-series selections
  18. 18. 18Use Case NREL (National Renewable Energy Laboratory) uses HDF Cloud to make 50TB of wind simulation data accessible to the public. Datasets are three-dimensional covering the continental US: • Time (one slice/hour) • Lon (~2k resolution) • Lat (~2k resolution) Data covers seven year (61318 slices). Data was delivered as 84 ~500GB files, but was aggregated on load to one 50TB “file”. Result is that rather than downloading TB’s of files, interested users can now use the HDF Cloud client libraries to explore this valuable data source.
  19. 19. 19References 1 9 • HDF Schema: • SciPy2017 talk: • AWS Big Data Blog article: from-wind-open-data-on-aws/ • AWS S3 Performance guidelines: considerations.html