SlideShare a Scribd company logo
Proprietary and Confidential. Copyright 2018, The HDF Group.
HDF Cloud:
HDF5 at scale
2What is HDF5?
Depends on your point of view:
• a C-API
• a File Format
• a data model
The File format is just a container for
The data. Dropping this view of HDF
allows us to more flexibly create a cloud
version of HDF.
3Why HDF in the Cloud
• It can provide a cost-effective infrastructure
• Pay for what you use vs pay for what you may need
• Lower overhead: no hardware setup/network configuration, etc.
• Benefit from cloud-based technologies:
• Elastic compute – scale compute resources dynamically
• Object based storage – low cost/built in redundancy
• Community platform
• Enables interested users to bring their applications to the data
• Share data among many users
4Cost Factors
Most public clouds bill per usage
For HDF in the cloud, there are three big cost drivers:
• Storage: what storage system will be used? (S3 vs. EBS vs. EFS)
• Compute: elastic compute on demand better than fixed cost (scale compute to usage
not size of data)
• Data Egress:
• Ingress is free but getting data (egress) out will cost you ($0.09/GB)
• Enabling users to get only the data they need will lower egress charges
5HDF Cloud Overview
• RESTful interface to HDF5 using object storage
• Storage using AWS S3 (portable to most other object storage systems)
• Built in redundancy
• Cost effective
• Scalable throughput
• Runs as a cluster of Docker containers
• Elastically scale compute with usage
• Feature compatible with HDF5 library
• Implemented in Python using asyncio
• Task oriented parallelism
6Object Storage Challenges for HDF
• Not POSIX!
• High latency (>0.1s) per request
• Not write/read consistent
• High throughput needs some tricks (use many async requests)
• Request charges can add up (public cloud)
For HDF5, using the HDF5 library
directly on an object storage
system is a non-starter. Will need
an alternative solution…
7HDF Cloud Schema
Big Idea: Map individual
HDF5 objects (datasets,
groups, chunks) as Object
Storage Objects
• Limit maximum storage object size
• Support parallelism for read/write
• Only data that is modified needs to be
updated
• Multiple clients can be reading/updating
the same “file”
Legend:
• Dataset is partitioned into
chunks
• Each chunk stored as an S3
object
• Dataset meta data (type, shape,
attributes, etc.) stored in a
separate object (as JSON text)
How to store HDF5 content in S3?
Each chunk (heavy outlines) get
persisted as a separate object
8Architecture
Legend:
• Client: Any user of the service
• Load balancer – distributes requests to Service nodes
• Service Nodes – processes requests from clients (with help from Data Nodes)
• Data Nodes – responsible for partition of Object Store
• Object Store: Base storage service (e.g. AWS S3)
9
h5py
Client SDKs for Python
and C are drop-in
replacements for libraries
used with local files.
No significant code change to
access local and cloud based
data.
C/Fortran
Applications
Community
Conventions
REST Virtual
Object Layer
Web
Applications
Browser
HDF5 Lib
Python
Applications
Command
Line Tools
REST
API
h5pyd
S3 Virtual
File Driver
HDF Services
Clients do not
know the details
of the data or the
storage system
Data Access Options
Architecture
10Supporting the Python Analytics Stack
Many Python users
don’t use h5py, but
tools higher up the
stack: h5netcdf,
xarray, pandas, etc.
HDF5Lib
H5PY
H5NETCDF
Xarray
Since h5pyd is
compatible with h5py,
we should be able to
support the same stack
for HDF Cloud
HDF5Lib
H5PY
H5NETCDF
Xarray
H5PYD
HDFServer
Disk
Applications can
switch between
local and cloud
access just by
changing file path.
11HDF Cloud Features
• Simple + familiar API
• Clients can interact with service using REST API
• SDKs provide language specific interface (e.g. h5pyd for Python)
• Can read/write just the data they need (as opposed to transferring entire files)
• Support for compression
• Scalable performance:
• Can cache recently accessed data in RAM
• Can parallelize requests across multiple nodes
• More nodes  better performance
• Multiple clients can read/write to same data source
• No limit to the amount of data that can be stored by the service
12H5pyd – Python client
• H5py is a popular Python package that provide a Pythonic interface to the HDF5 library
• H5pyd (for h5py distributed) provides a h5py compatible h5py for accessing the server
• Pure Python – uses requests package to make http calls to server
• Include several extensions to h5py:
• List content in folders
• Get/Set ACLs (access control list)
• Pytables-like query interface
13REST VOL
• The HDF5 VOL architecture is a plugin layer for HDF5
• Public API stays the same, but different back ends can be implemented
• REST VOL substitutes REST API requests for file i/o actions
• C/Fortran applications should be able to run as is
14Command Line Interface (CLI)
• Accessing HDF via a service means one can’t utilize usual shell commands: ls, rm, chmod, etc.
• Command line tools are a set of simple apps to use instead:
• hsinfo: display server version, connect info
• hsls: list content of folder or file
• hstouch: create folder or file
• hsdel: delete a file
• hsload: upload an HDF5 file
• hsget: download content from server to an HDF5 file
• hsacl: create/list/update ACLs (Access Control Lists)
• Implemented in Python & uses h5pyd
15Getting Access to HDF Server
• Option 1: HDF Kita Lab (JupyterLab environment)
• Easy to use
• Low cost
• Shared HDF Server/S3 Bucket
• Option 2: HDF Kita Server
• Run your own instance
• Launch from AWS Marketplace (coming soon)
• Pay your own AWS costs
• Option 3: HDF Kita Server On Premise
• Roll your own: public or private cloud
• Supported with OpenStack & Ceph
• Talk to us for other technologies
16Futures: Supporting traditional HDF5 files 1
6
• Downside of the HDF S3 Schema is that data needs be transmogrified
• Since the bulk of the data is usually the chunk data it makes sense to combine
the ideas of the S3 Schema and S3VFD:
• Convert just the metadata of the source HDF5 file to the S3 Schema
• Store the source file as a S3 object
• For data reads, metadata provides offset and length into the HDF5 file
• S3 Range GET returns needed data
• This approach can be used either directly or with HDF Server
• Compared with the pure S3VFD approach, you reduce the number of S3
requests needed
• Work on supporting this is planned for later this year
• BONUS Round – Access to GeoTiff files
17Futures: Lambda Functions
• HDF Server can parallize requests across all the
available backend (“DN”) nodes on the server
• AWS Lambda is a new service that enables you to
run requests ”serverless”
• Pay for just cpu-seconds the function runs
• By incorporating Lambda, some HDF Server
requests can parallelize across a 1000 Lambda
functions (equivalent to a 1000 container server)
• Will dramatically speed up time-series selections
18Use Case
NREL (National Renewable Energy Laboratory) uses HDF
Cloud to make 50TB of wind simulation data accessible to the
public.
Datasets are three-dimensional covering the continental US:
• Time (one slice/hour)
• Lon (~2k resolution)
• Lat (~2k resolution)
Data covers seven year (61318 slices). Data was delivered as
84 ~500GB files, but was aggregated on load to one 50TB “file”.
Result is that rather than downloading TB’s of files, interested
users can now use the HDF Cloud client libraries to explore this
valuable data source.
19References 1
9
• HDF Schema:
https://s3.amazonaws.com/hdfgroup/docs/obj_store_schema.pdf
• SciPy2017 talk:
https://s3.amazonaws.com/hdfgroup/docs/hdf_data_services_scipy2017.pdf
• AWS Big Data Blog article: https://aws.amazon.com/blogs/big-data/power-
from-wind-open-data-on-aws/
• AWS S3 Performance guidelines:
https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-
considerations.html

More Related Content

What's hot

H5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only LibraryH5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only Library
The HDF-EOS Tools and Information Center
 
MATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and CapabilitiesMATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and Capabilities
The HDF-EOS Tools and Information Center
 
Product Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the WebProduct Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the Web
The HDF-EOS Tools and Information Center
 
HDF5 <-> Zarr
HDF5 <-> ZarrHDF5 <-> Zarr
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
The HDF-EOS Tools and Information Center
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
The HDF-EOS Tools and Information Center
 
Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5
The HDF-EOS Tools and Information Center
 
HDF Product Designer: Using Templates to Achieve Interoperability
HDF Product Designer: Using Templates to Achieve InteroperabilityHDF Product Designer: Using Templates to Achieve Interoperability
HDF Product Designer: Using Templates to Achieve Interoperability
The HDF-EOS Tools and Information Center
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
Open-source Scientific Computing and Data Analytics using HDF
Open-source Scientific Computing and Data Analytics using HDFOpen-source Scientific Computing and Data Analytics using HDF
Open-source Scientific Computing and Data Analytics using HDF
The HDF-EOS Tools and Information Center
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
Hadoop User Group
 
Securing Data in Hadoop at Uber
Securing Data in Hadoop at UberSecuring Data in Hadoop at Uber
Securing Data in Hadoop at Uber
DataWorks Summit
 
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Alluxio, Inc.
 
Moving form HDF4 to HDF5/netCDF-4
Moving form HDF4 to HDF5/netCDF-4Moving form HDF4 to HDF5/netCDF-4
Moving form HDF4 to HDF5/netCDF-4
The HDF-EOS Tools and Information Center
 
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
The HDF-EOS Tools and Information Center
 
Matlab, Big Data, and HDF Server
Matlab, Big Data, and HDF ServerMatlab, Big Data, and HDF Server
Matlab, Big Data, and HDF Server
The HDF-EOS Tools and Information Center
 
Ozone: Evolution of HDFS scalability & built-in GDPR compliance
Ozone: Evolution of HDFS scalability & built-in GDPR complianceOzone: Evolution of HDFS scalability & built-in GDPR compliance
Ozone: Evolution of HDFS scalability & built-in GDPR compliance
Dinesh Chitlangia
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
Avinash Ramineni
 
Hadoop 2 cluster architecture
Hadoop 2 cluster architectureHadoop 2 cluster architecture
Hadoop 2 cluster architecture
Sandeep Patil
 

What's hot (20)

H5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only LibraryH5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only Library
 
MATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and CapabilitiesMATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and Capabilities
 
Product Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the WebProduct Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the Web
 
HDF5 <-> Zarr
HDF5 <-> ZarrHDF5 <-> Zarr
HDF5 <-> Zarr
 
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
 
Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5
 
HDF Product Designer: Using Templates to Achieve Interoperability
HDF Product Designer: Using Templates to Achieve InteroperabilityHDF Product Designer: Using Templates to Achieve Interoperability
HDF Product Designer: Using Templates to Achieve Interoperability
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Open-source Scientific Computing and Data Analytics using HDF
Open-source Scientific Computing and Data Analytics using HDFOpen-source Scientific Computing and Data Analytics using HDF
Open-source Scientific Computing and Data Analytics using HDF
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Securing Data in Hadoop at Uber
Securing Data in Hadoop at UberSecuring Data in Hadoop at Uber
Securing Data in Hadoop at Uber
 
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
 
Moving form HDF4 to HDF5/netCDF-4
Moving form HDF4 to HDF5/netCDF-4Moving form HDF4 to HDF5/netCDF-4
Moving form HDF4 to HDF5/netCDF-4
 
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
 
Matlab, Big Data, and HDF Server
Matlab, Big Data, and HDF ServerMatlab, Big Data, and HDF Server
Matlab, Big Data, and HDF Server
 
Ozone: Evolution of HDFS scalability & built-in GDPR compliance
Ozone: Evolution of HDFS scalability & built-in GDPR complianceOzone: Evolution of HDFS scalability & built-in GDPR compliance
Ozone: Evolution of HDFS scalability & built-in GDPR compliance
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
 
Hadoop 2 cluster architecture
Hadoop 2 cluster architectureHadoop 2 cluster architecture
Hadoop 2 cluster architecture
 

Similar to HDF Cloud: HDF5 at Scale

Accessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDSAccessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDS
The HDF-EOS Tools and Information Center
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
The HDF-EOS Tools and Information Center
 
HDF Cloud Services
HDF Cloud ServicesHDF Cloud Services
Highly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance FeaturesHighly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance Features
The HDF-EOS Tools and Information Center
 
Cloud-Optimized HDF5 Files
Cloud-Optimized HDF5 FilesCloud-Optimized HDF5 Files
Cloud-Optimized HDF5 Files
The HDF-EOS Tools and Information Center
 
Performance Tuning in HDF5
Performance Tuning in HDF5 Performance Tuning in HDF5
Performance Tuning in HDF5
The HDF-EOS Tools and Information Center
 
Integrating HDF5 with SRB
Integrating HDF5 with SRBIntegrating HDF5 with SRB
Integrating HDF5 with SRB
The HDF-EOS Tools and Information Center
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
Data Con LA
 
HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
Hdf5 parallel
Hdf5 parallelHdf5 parallel
Hdf5 parallel
mfolk
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
bhargavi804095
 
HDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF ServiceHDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF Service
The HDF-EOS Tools and Information Center
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Cloudian
 
Indexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkIndexing with solr search server and hadoop framework
Indexing with solr search server and hadoop framework
keval dalasaniya
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the experts
DataWorks Summit
 
HDF Updae
HDF UpdaeHDF Updae
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
MLconf
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
SUSE Italy
 

Similar to HDF Cloud: HDF5 at Scale (20)

Accessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDSAccessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDS
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
HDF Cloud Services
HDF Cloud ServicesHDF Cloud Services
HDF Cloud Services
 
Highly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance FeaturesHighly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance Features
 
Cloud-Optimized HDF5 Files
Cloud-Optimized HDF5 FilesCloud-Optimized HDF5 Files
Cloud-Optimized HDF5 Files
 
Performance Tuning in HDF5
Performance Tuning in HDF5 Performance Tuning in HDF5
Performance Tuning in HDF5
 
Integrating HDF5 with SRB
Integrating HDF5 with SRBIntegrating HDF5 with SRB
Integrating HDF5 with SRB
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
HDF Update
HDF UpdateHDF Update
HDF Update
 
HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
HDF Data in the Cloud
 
Hdf5 parallel
Hdf5 parallelHdf5 parallel
Hdf5 parallel
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
HDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF ServiceHDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF Service
 
Parallel HDF5 Developments
Parallel HDF5 DevelopmentsParallel HDF5 Developments
Parallel HDF5 Developments
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
 
Indexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkIndexing with solr search server and hadoop framework
Indexing with solr search server and hadoop framework
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the experts
 
HDF Updae
HDF UpdaeHDF Updae
HDF Updae
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
 

More from The HDF-EOS Tools and Information Center

The State of HDF
The State of HDFThe State of HDF
Creating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 FilesCreating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 Files
The HDF-EOS Tools and Information Center
 
HDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance DiscussionHDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance Discussion
The HDF-EOS Tools and Information Center
 
Hyrax: Serving Data from S3
Hyrax: Serving Data from S3Hyrax: Serving Data from S3
Hyrax: Serving Data from S3
The HDF-EOS Tools and Information Center
 
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLABAccessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
The HDF-EOS Tools and Information Center
 
HDF - Current status and Future Directions
HDF - Current status and Future DirectionsHDF - Current status and Future Directions
HDF - Current status and Future Directions
The HDF-EOS Tools and Information Center
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
The HDF-EOS Tools and Information Center
 
HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020HDF5 Roadmap 2019-2020
Leveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software TestingLeveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software Testing
The HDF-EOS Tools and Information Center
 
Google Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOSGoogle Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOS
The HDF-EOS Tools and Information Center
 
HDF-EOS Data Product Developer's Guide
HDF-EOS Data Product Developer's GuideHDF-EOS Data Product Developer's Guide
HDF-EOS Data Product Developer's Guide
The HDF-EOS Tools and Information Center
 
HDF Status Update
HDF Status UpdateHDF Status Update
NASA Terra Data Fusion
NASA Terra Data FusionNASA Terra Data Fusion
S3 VFD
S3 VFDS3 VFD

More from The HDF-EOS Tools and Information Center (14)

The State of HDF
The State of HDFThe State of HDF
The State of HDF
 
Creating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 FilesCreating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 Files
 
HDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance DiscussionHDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance Discussion
 
Hyrax: Serving Data from S3
Hyrax: Serving Data from S3Hyrax: Serving Data from S3
Hyrax: Serving Data from S3
 
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLABAccessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
 
HDF - Current status and Future Directions
HDF - Current status and Future DirectionsHDF - Current status and Future Directions
HDF - Current status and Future Directions
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
 
HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020
 
Leveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software TestingLeveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software Testing
 
Google Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOSGoogle Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOS
 
HDF-EOS Data Product Developer's Guide
HDF-EOS Data Product Developer's GuideHDF-EOS Data Product Developer's Guide
HDF-EOS Data Product Developer's Guide
 
HDF Status Update
HDF Status UpdateHDF Status Update
HDF Status Update
 
NASA Terra Data Fusion
NASA Terra Data FusionNASA Terra Data Fusion
NASA Terra Data Fusion
 
S3 VFD
S3 VFDS3 VFD
S3 VFD
 

Recently uploaded

Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Hivelance Technology
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
ayushiqss
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
varshanayak241
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
NaapbooksPrivateLimi
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
MayankTawar1
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 

Recently uploaded (20)

Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 

HDF Cloud: HDF5 at Scale

  • 1. Proprietary and Confidential. Copyright 2018, The HDF Group. HDF Cloud: HDF5 at scale
  • 2. 2What is HDF5? Depends on your point of view: • a C-API • a File Format • a data model The File format is just a container for The data. Dropping this view of HDF allows us to more flexibly create a cloud version of HDF.
  • 3. 3Why HDF in the Cloud • It can provide a cost-effective infrastructure • Pay for what you use vs pay for what you may need • Lower overhead: no hardware setup/network configuration, etc. • Benefit from cloud-based technologies: • Elastic compute – scale compute resources dynamically • Object based storage – low cost/built in redundancy • Community platform • Enables interested users to bring their applications to the data • Share data among many users
  • 4. 4Cost Factors Most public clouds bill per usage For HDF in the cloud, there are three big cost drivers: • Storage: what storage system will be used? (S3 vs. EBS vs. EFS) • Compute: elastic compute on demand better than fixed cost (scale compute to usage not size of data) • Data Egress: • Ingress is free but getting data (egress) out will cost you ($0.09/GB) • Enabling users to get only the data they need will lower egress charges
  • 5. 5HDF Cloud Overview • RESTful interface to HDF5 using object storage • Storage using AWS S3 (portable to most other object storage systems) • Built in redundancy • Cost effective • Scalable throughput • Runs as a cluster of Docker containers • Elastically scale compute with usage • Feature compatible with HDF5 library • Implemented in Python using asyncio • Task oriented parallelism
  • 6. 6Object Storage Challenges for HDF • Not POSIX! • High latency (>0.1s) per request • Not write/read consistent • High throughput needs some tricks (use many async requests) • Request charges can add up (public cloud) For HDF5, using the HDF5 library directly on an object storage system is a non-starter. Will need an alternative solution…
  • 7. 7HDF Cloud Schema Big Idea: Map individual HDF5 objects (datasets, groups, chunks) as Object Storage Objects • Limit maximum storage object size • Support parallelism for read/write • Only data that is modified needs to be updated • Multiple clients can be reading/updating the same “file” Legend: • Dataset is partitioned into chunks • Each chunk stored as an S3 object • Dataset meta data (type, shape, attributes, etc.) stored in a separate object (as JSON text) How to store HDF5 content in S3? Each chunk (heavy outlines) get persisted as a separate object
  • 8. 8Architecture Legend: • Client: Any user of the service • Load balancer – distributes requests to Service nodes • Service Nodes – processes requests from clients (with help from Data Nodes) • Data Nodes – responsible for partition of Object Store • Object Store: Base storage service (e.g. AWS S3)
  • 9. 9 h5py Client SDKs for Python and C are drop-in replacements for libraries used with local files. No significant code change to access local and cloud based data. C/Fortran Applications Community Conventions REST Virtual Object Layer Web Applications Browser HDF5 Lib Python Applications Command Line Tools REST API h5pyd S3 Virtual File Driver HDF Services Clients do not know the details of the data or the storage system Data Access Options Architecture
  • 10. 10Supporting the Python Analytics Stack Many Python users don’t use h5py, but tools higher up the stack: h5netcdf, xarray, pandas, etc. HDF5Lib H5PY H5NETCDF Xarray Since h5pyd is compatible with h5py, we should be able to support the same stack for HDF Cloud HDF5Lib H5PY H5NETCDF Xarray H5PYD HDFServer Disk Applications can switch between local and cloud access just by changing file path.
  • 11. 11HDF Cloud Features • Simple + familiar API • Clients can interact with service using REST API • SDKs provide language specific interface (e.g. h5pyd for Python) • Can read/write just the data they need (as opposed to transferring entire files) • Support for compression • Scalable performance: • Can cache recently accessed data in RAM • Can parallelize requests across multiple nodes • More nodes  better performance • Multiple clients can read/write to same data source • No limit to the amount of data that can be stored by the service
  • 12. 12H5pyd – Python client • H5py is a popular Python package that provide a Pythonic interface to the HDF5 library • H5pyd (for h5py distributed) provides a h5py compatible h5py for accessing the server • Pure Python – uses requests package to make http calls to server • Include several extensions to h5py: • List content in folders • Get/Set ACLs (access control list) • Pytables-like query interface
  • 13. 13REST VOL • The HDF5 VOL architecture is a plugin layer for HDF5 • Public API stays the same, but different back ends can be implemented • REST VOL substitutes REST API requests for file i/o actions • C/Fortran applications should be able to run as is
  • 14. 14Command Line Interface (CLI) • Accessing HDF via a service means one can’t utilize usual shell commands: ls, rm, chmod, etc. • Command line tools are a set of simple apps to use instead: • hsinfo: display server version, connect info • hsls: list content of folder or file • hstouch: create folder or file • hsdel: delete a file • hsload: upload an HDF5 file • hsget: download content from server to an HDF5 file • hsacl: create/list/update ACLs (Access Control Lists) • Implemented in Python & uses h5pyd
  • 15. 15Getting Access to HDF Server • Option 1: HDF Kita Lab (JupyterLab environment) • Easy to use • Low cost • Shared HDF Server/S3 Bucket • Option 2: HDF Kita Server • Run your own instance • Launch from AWS Marketplace (coming soon) • Pay your own AWS costs • Option 3: HDF Kita Server On Premise • Roll your own: public or private cloud • Supported with OpenStack & Ceph • Talk to us for other technologies
  • 16. 16Futures: Supporting traditional HDF5 files 1 6 • Downside of the HDF S3 Schema is that data needs be transmogrified • Since the bulk of the data is usually the chunk data it makes sense to combine the ideas of the S3 Schema and S3VFD: • Convert just the metadata of the source HDF5 file to the S3 Schema • Store the source file as a S3 object • For data reads, metadata provides offset and length into the HDF5 file • S3 Range GET returns needed data • This approach can be used either directly or with HDF Server • Compared with the pure S3VFD approach, you reduce the number of S3 requests needed • Work on supporting this is planned for later this year • BONUS Round – Access to GeoTiff files
  • 17. 17Futures: Lambda Functions • HDF Server can parallize requests across all the available backend (“DN”) nodes on the server • AWS Lambda is a new service that enables you to run requests ”serverless” • Pay for just cpu-seconds the function runs • By incorporating Lambda, some HDF Server requests can parallelize across a 1000 Lambda functions (equivalent to a 1000 container server) • Will dramatically speed up time-series selections
  • 18. 18Use Case NREL (National Renewable Energy Laboratory) uses HDF Cloud to make 50TB of wind simulation data accessible to the public. Datasets are three-dimensional covering the continental US: • Time (one slice/hour) • Lon (~2k resolution) • Lat (~2k resolution) Data covers seven year (61318 slices). Data was delivered as 84 ~500GB files, but was aggregated on load to one 50TB “file”. Result is that rather than downloading TB’s of files, interested users can now use the HDF Cloud client libraries to explore this valuable data source.
  • 19. 19References 1 9 • HDF Schema: https://s3.amazonaws.com/hdfgroup/docs/obj_store_schema.pdf • SciPy2017 talk: https://s3.amazonaws.com/hdfgroup/docs/hdf_data_services_scipy2017.pdf • AWS Big Data Blog article: https://aws.amazon.com/blogs/big-data/power- from-wind-open-data-on-aws/ • AWS S3 Performance guidelines: https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf- considerations.html

Editor's Notes

  1. Mention Andrew’s talk
  2. Each node is implemented as a docker container