SlideShare a Scribd company logo
1 of 16
HDF Data in the Cloud 1
The HDF Team
Enabling collaboration while
Protecting data producers and
users from disruption as data
move to the cloud
ProcessingTime(Seconds)
2014 2015 2016
The U.S. Geological Survey migrated their archive of Landsat data to
Amazon Web Services. This plot shows the processing time / image
before and after the migration. The average time to process an image
decreased from 375 seconds to 75 seconds because only 3 bands were
being downloaded instead of 11+.
This saved 21,600,000 seconds or 250 days.
Landsat moved to Amazon
Web Services.
The Landsat Experience
Graph by Drew Bollinger (@drewbo19) at Development Seed
Maps Chunks / Rods Cloud
New Cloud Native
Applications
Data Migration / Evolution
lat
lon
time
metadata
metadata
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
Flexible Data Structures / Stable Access
Existing Analysis, Visualization Applications
HDF5 Library (C, Fortran, Java, Python)
HDF5 Virtual
File Driver
Highly Scalable
Data Service
S
T
A
B
I
L
I
T
Y
Local
Files
Private
Cloud
Public
Cloud
New Cloud Native
Applications
Data Migration / Evolution
metadata
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
Flexible Data Location and Storage
Existing Analysis, Visualization Applications
HDF5 Library (C, Fortran, Java, Python)
S
T
A
B
I
L
I
T
Y
HDF5 Virtual
File Driver
Highly Scalable
Data Service
Python alternatives for netCDF API
xarray
netcdf4-python
netcdf-C
HDF5 C
HDF5
Data
h5pyd
optimized - API
h5netcdf - python
netcdf-API
h5py
HDF REST
A
C
B
Highly
Scalable Data
Server
h5py
Client/Server Architecture 6
Client SDKs for Python
and C are drop-in
replacements for libraries
used with local files.
No significant code change
to access local and cloud
based data.
C/Fortran
Applications
Community
Conventions
REST Virtual
Object Layer
Web
Applications
Browser
HDF5 Lib
Python
Applications
Command
Line Tools
REST
API
h5pyd
S3 Virtual
File Driver
HDF Services
Clients do not
know the details
of the data
structures or the
storage system
Data Access Options
Protecting data producers and
users from disruption as data
move to the cloud
Collaboration
Programs
Projects
Teams
Individuals
A
BC
D
Cloud Optimized HDF
A Cloud Optimized HDF is a regular HDF file, aimed at being hosted on a HTTP file
server, with an internal organization that enables efficient access patterns for
expected use cases on the cloud.
Cloud Optimized HDF leverages the ability of clients to access just the data in a file
they need and localizes metadata in order to decrease the time it takes to understand
the file structure.
HDF Cloud enables range gets for files or data collections with hundreds of parameters
including geolocation information.
Metadata and Data Options 9
A
metadata
B
C
D
metadata
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
-- -- --- - ---
metadata
Sustainable Open Source Projects 1
0
We should hold ourselves accountable to the goal of building sustainable
open projects, and lay out a realistic and hard-nosed framework by which
the investors with money (the tech companies and academic
communities that depend on our toolkits) can help foster that
sustainability. To be clear, in my view it should not be the job of (e.g.)
Google to figure out how to contribute to our sustainability; it should be
our job to tell them how they can help us, and then follow through
when they work with us.
Titus Brown, A framework for thinking about Open Source Sustainability?
http://ivory.idyll.org/blog/2018-oss-framework-cpr.html
developers effort
1
1
National Renewable Energy Lab Wind Data
Amazon Web Services
Blog
More HDF Cloud Information
Interactive Wind Data From HDF Cloud
Architecture for Highly Scalable
Data Service
12
Legend:
• Client: Any user of the service
• Load balancer – distributes requests to Service nodes
• Service Nodes – processes requests from clients (with help from Data Nodes)
• Data Nodes – responsible for partition of Object Store
• Object Store: Base storage service (e.g. AWS S3)
Cloud Optimized HDF
• HDF5 (require v1.10?)
• Use chunking for datasets larger than 1MB
• Use “brick style” chunk layouts (enable slicing via any dimension)
• Use readily available compression filters
• Pack metadata in front of file (optimal for S3 VFD)
• Provide sizes and locations of chunks in file
• Compressed variable length data is supported
1
3
Why HDF in the Cloud
• Cost-effective infrastructure
• Pay for what you use vs pay for what you may need
• Lower overhead: no hardware setup/network configuration, etc.
• Benefit from cloud-based technologies:
• Elastic compute – scale compute resources dynamically
• Object based storage – low cost/built in redundancy
• Community platform
• Enables interested users to bring their applications to the data
• Share data among many users
More Information: 15
• H5serv: https://github.com/HDFGroup/h5serv
• Documentation: http://h5serv.readthedocs.io/
• H5pyd: https://github.com/HDFGroup/h5pyd
• RESTful HDF5 White Paper:
https://www.hdfgroup.org/pubs/papers/RESTful_HDF5.pdf
• Blogs:
• https://hdfgroup.org/wp/2015/04/hdf5-for-the-web-hdf-server/
• https://hdfgroup.org/wp/2015/12/serve-protect-web-security-hdf5/
• https://www.hdfgroup.org/2017/04/the-gfed-analysis-tool-an-hdf-
server-implementation/
16
HDF5 Community Support
• Documentation, Tutorials, FAQs, examples
• https://portal.hdfgroup.org/display/support
• HDF-Forum – mailing list and archive
• Great for specific questions
• Helpdesk Email – help@hdfgroup.org
• Issues with software and documentation
• https://portal.hdfgroup.org/display/support/Community

More Related Content

What's hot

Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
The Pandemic Changes Everything, the Need for Speed and Resiliency
The Pandemic Changes Everything, the Need for Speed and ResiliencyThe Pandemic Changes Everything, the Need for Speed and Resiliency
The Pandemic Changes Everything, the Need for Speed and ResiliencyAlluxio, Inc.
 
Docker data science pipeline
Docker data science pipelineDocker data science pipeline
Docker data science pipelineDataWorks Summit
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyAlluxio, Inc.
 
Using GDAL In Your GIS Workflow
Using GDAL In Your GIS WorkflowUsing GDAL In Your GIS Workflow
Using GDAL In Your GIS WorkflowGerry James
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityDataWorks Summit
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...Aditya Yadav
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleVasu S
 
Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0Aditya Yadav
 
Migrating legacy ERP data into Hadoop
Migrating legacy ERP data into HadoopMigrating legacy ERP data into Hadoop
Migrating legacy ERP data into HadoopDataWorks Summit
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceDataWorks Summit
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Present & Future of Greenplum Database A massively parallel Postgres Database...
Present & Future of Greenplum Database A massively parallel Postgres Database...Present & Future of Greenplum Database A massively parallel Postgres Database...
Present & Future of Greenplum Database A massively parallel Postgres Database...VMware Tanzu
 
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data TechBig Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data TechHostedbyConfluent
 
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019VMware Tanzu
 
A Gentle Introduction to GPU Computing by Armen Donigian
A Gentle Introduction to GPU Computing by Armen DonigianA Gentle Introduction to GPU Computing by Armen Donigian
A Gentle Introduction to GPU Computing by Armen DonigianData Con LA
 
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...VMware Tanzu
 

What's hot (20)

Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
The Pandemic Changes Everything, the Need for Speed and Resiliency
The Pandemic Changes Everything, the Need for Speed and ResiliencyThe Pandemic Changes Everything, the Need for Speed and Resiliency
The Pandemic Changes Everything, the Need for Speed and Resiliency
 
Docker data science pipeline
Docker data science pipelineDocker data science pipeline
Docker data science pipeline
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
 
Using GDAL In Your GIS Workflow
Using GDAL In Your GIS WorkflowUsing GDAL In Your GIS Workflow
Using GDAL In Your GIS Workflow
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data Security
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
 
Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0
 
Migrating legacy ERP data into Hadoop
Migrating legacy ERP data into HadoopMigrating legacy ERP data into Hadoop
Migrating legacy ERP data into Hadoop
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Present & Future of Greenplum Database A massively parallel Postgres Database...
Present & Future of Greenplum Database A massively parallel Postgres Database...Present & Future of Greenplum Database A massively parallel Postgres Database...
Present & Future of Greenplum Database A massively parallel Postgres Database...
 
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data TechBig Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
 
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
 
A Gentle Introduction to GPU Computing by Armen Donigian
A Gentle Introduction to GPU Computing by Armen DonigianA Gentle Introduction to GPU Computing by Armen Donigian
A Gentle Introduction to GPU Computing by Armen Donigian
 
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
 

Similar to HDF Data in the Cloud

Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Rajit Saha
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudShubham Tagra
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyAlluxio, Inc.
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudAlluxio, Inc.
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Alluxio, Inc.
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Strategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageStrategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageAmazon Web Services
 
Supporting Digital Media Workflows in the Cloud with Perforce Helix
Supporting Digital Media Workflows in the Cloud with Perforce HelixSupporting Digital Media Workflows in the Cloud with Perforce Helix
Supporting Digital Media Workflows in the Cloud with Perforce HelixPerforce
 
Manage democratization of the data - Data Replication in Hadoop
Manage democratization of the data - Data Replication in HadoopManage democratization of the data - Data Replication in Hadoop
Manage democratization of the data - Data Replication in HadoopDataWorks Summit
 
TechEd NZ 2014: Azure and Sharepoint
TechEd NZ 2014: Azure and SharepointTechEd NZ 2014: Azure and Sharepoint
TechEd NZ 2014: Azure and SharepointIntergen
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopDatabricks
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldRob Gillen
 

Similar to HDF Data in the Cloud (20)

Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the Cloud
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS
 
Amazon CloudFront Complete with Blazeclan's Media Solution Stack
Amazon CloudFront Complete with Blazeclan's Media Solution StackAmazon CloudFront Complete with Blazeclan's Media Solution Stack
Amazon CloudFront Complete with Blazeclan's Media Solution Stack
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Strategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageStrategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud Storage
 
Supporting Digital Media Workflows in the Cloud with Perforce Helix
Supporting Digital Media Workflows in the Cloud with Perforce HelixSupporting Digital Media Workflows in the Cloud with Perforce Helix
Supporting Digital Media Workflows in the Cloud with Perforce Helix
 
Enterprise & Media Storage in the Cloud
Enterprise & Media Storage in the CloudEnterprise & Media Storage in the Cloud
Enterprise & Media Storage in the Cloud
 
Manage democratization of the data - Data Replication in Hadoop
Manage democratization of the data - Data Replication in HadoopManage democratization of the data - Data Replication in Hadoop
Manage democratization of the data - Data Replication in Hadoop
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
TechEd NZ 2014: Azure and Sharepoint
TechEd NZ 2014: Azure and SharepointTechEd NZ 2014: Azure and Sharepoint
TechEd NZ 2014: Azure and Sharepoint
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 
HDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at ScaleHDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at Scale
 
Matlab, Big Data, and HDF Server
Matlab, Big Data, and HDF ServerMatlab, Big Data, and HDF Server
Matlab, Big Data, and HDF Server
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 

More from The HDF-EOS Tools and Information Center

STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...The HDF-EOS Tools and Information Center
 

More from The HDF-EOS Tools and Information Center (20)

Cloud-Optimized HDF5 Files
Cloud-Optimized HDF5 FilesCloud-Optimized HDF5 Files
Cloud-Optimized HDF5 Files
 
Accessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDSAccessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDS
 
The State of HDF
The State of HDFThe State of HDF
The State of HDF
 
Highly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance FeaturesHighly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance Features
 
Creating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 FilesCreating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 Files
 
HDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance DiscussionHDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance Discussion
 
Hyrax: Serving Data from S3
Hyrax: Serving Data from S3Hyrax: Serving Data from S3
Hyrax: Serving Data from S3
 
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLABAccessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
 
HDF - Current status and Future Directions
HDF - Current status and Future DirectionsHDF - Current status and Future Directions
HDF - Current status and Future Directions
 
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
 
HDF - Current status and Future Directions
HDF - Current status and Future Directions HDF - Current status and Future Directions
HDF - Current status and Future Directions
 
H5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only LibraryH5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only Library
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
 
HDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDFHDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDF
 
HDF5 <-> Zarr
HDF5 <-> ZarrHDF5 <-> Zarr
HDF5 <-> Zarr
 
HDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server FeaturesHDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server Features
 
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
 
HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?
 
HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020
 

Recently uploaded

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 

Recently uploaded (20)

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 

HDF Data in the Cloud

  • 1. HDF Data in the Cloud 1 The HDF Team Enabling collaboration while Protecting data producers and users from disruption as data move to the cloud
  • 2. ProcessingTime(Seconds) 2014 2015 2016 The U.S. Geological Survey migrated their archive of Landsat data to Amazon Web Services. This plot shows the processing time / image before and after the migration. The average time to process an image decreased from 375 seconds to 75 seconds because only 3 bands were being downloaded instead of 11+. This saved 21,600,000 seconds or 250 days. Landsat moved to Amazon Web Services. The Landsat Experience Graph by Drew Bollinger (@drewbo19) at Development Seed
  • 3. Maps Chunks / Rods Cloud New Cloud Native Applications Data Migration / Evolution lat lon time metadata metadata -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- Flexible Data Structures / Stable Access Existing Analysis, Visualization Applications HDF5 Library (C, Fortran, Java, Python) HDF5 Virtual File Driver Highly Scalable Data Service S T A B I L I T Y
  • 4. Local Files Private Cloud Public Cloud New Cloud Native Applications Data Migration / Evolution metadata -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- Flexible Data Location and Storage Existing Analysis, Visualization Applications HDF5 Library (C, Fortran, Java, Python) S T A B I L I T Y HDF5 Virtual File Driver Highly Scalable Data Service
  • 5. Python alternatives for netCDF API xarray netcdf4-python netcdf-C HDF5 C HDF5 Data h5pyd optimized - API h5netcdf - python netcdf-API h5py HDF REST A C B Highly Scalable Data Server
  • 6. h5py Client/Server Architecture 6 Client SDKs for Python and C are drop-in replacements for libraries used with local files. No significant code change to access local and cloud based data. C/Fortran Applications Community Conventions REST Virtual Object Layer Web Applications Browser HDF5 Lib Python Applications Command Line Tools REST API h5pyd S3 Virtual File Driver HDF Services Clients do not know the details of the data structures or the storage system Data Access Options Protecting data producers and users from disruption as data move to the cloud
  • 8. Cloud Optimized HDF A Cloud Optimized HDF is a regular HDF file, aimed at being hosted on a HTTP file server, with an internal organization that enables efficient access patterns for expected use cases on the cloud. Cloud Optimized HDF leverages the ability of clients to access just the data in a file they need and localizes metadata in order to decrease the time it takes to understand the file structure. HDF Cloud enables range gets for files or data collections with hundreds of parameters including geolocation information.
  • 9. Metadata and Data Options 9 A metadata B C D metadata -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- -- -- --- - --- metadata
  • 10. Sustainable Open Source Projects 1 0 We should hold ourselves accountable to the goal of building sustainable open projects, and lay out a realistic and hard-nosed framework by which the investors with money (the tech companies and academic communities that depend on our toolkits) can help foster that sustainability. To be clear, in my view it should not be the job of (e.g.) Google to figure out how to contribute to our sustainability; it should be our job to tell them how they can help us, and then follow through when they work with us. Titus Brown, A framework for thinking about Open Source Sustainability? http://ivory.idyll.org/blog/2018-oss-framework-cpr.html developers effort
  • 11. 1 1 National Renewable Energy Lab Wind Data Amazon Web Services Blog More HDF Cloud Information Interactive Wind Data From HDF Cloud
  • 12. Architecture for Highly Scalable Data Service 12 Legend: • Client: Any user of the service • Load balancer – distributes requests to Service nodes • Service Nodes – processes requests from clients (with help from Data Nodes) • Data Nodes – responsible for partition of Object Store • Object Store: Base storage service (e.g. AWS S3)
  • 13. Cloud Optimized HDF • HDF5 (require v1.10?) • Use chunking for datasets larger than 1MB • Use “brick style” chunk layouts (enable slicing via any dimension) • Use readily available compression filters • Pack metadata in front of file (optimal for S3 VFD) • Provide sizes and locations of chunks in file • Compressed variable length data is supported 1 3
  • 14. Why HDF in the Cloud • Cost-effective infrastructure • Pay for what you use vs pay for what you may need • Lower overhead: no hardware setup/network configuration, etc. • Benefit from cloud-based technologies: • Elastic compute – scale compute resources dynamically • Object based storage – low cost/built in redundancy • Community platform • Enables interested users to bring their applications to the data • Share data among many users
  • 15. More Information: 15 • H5serv: https://github.com/HDFGroup/h5serv • Documentation: http://h5serv.readthedocs.io/ • H5pyd: https://github.com/HDFGroup/h5pyd • RESTful HDF5 White Paper: https://www.hdfgroup.org/pubs/papers/RESTful_HDF5.pdf • Blogs: • https://hdfgroup.org/wp/2015/04/hdf5-for-the-web-hdf-server/ • https://hdfgroup.org/wp/2015/12/serve-protect-web-security-hdf5/ • https://www.hdfgroup.org/2017/04/the-gfed-analysis-tool-an-hdf- server-implementation/
  • 16. 16 HDF5 Community Support • Documentation, Tutorials, FAQs, examples • https://portal.hdfgroup.org/display/support • HDF-Forum – mailing list and archive • Great for specific questions • Helpdesk Email – help@hdfgroup.org • Issues with software and documentation • https://portal.hdfgroup.org/display/support/Community

Editor's Notes

  1. Many users of HDF5 are now migrating data archives to public or private cloud systems. The access approaches and performance characteristics of cloud storage are fundamentally different than traditional data storage systems because 1) the data are accessed over http and 2) the data are stored in an object store and identified using unique keys. There are many different ways to organize and access data in the cloud. The HDF Group is currently exploring and developing approaches that will facilitate migration to the cloud and support many existing HDF5 data access use cases. Our goal is to protect data providers and users from disruption as their data and applications are migrated to the cloud.
  2. The most significant satellite data in the cloud experience in the United States comes from the U.S. Geological Survey. They migrated their archive of Landsat data to Amazon Web Services during 2014. The average time to process an image decreased from 375 seconds to 75 seconds because only 3 bands were being downloaded instead of 11+. This saved 21,600,000 seconds or 250 days in the total time required to process 72,000 images.   High-performance subsetting has been a cornerstone of the HDF5 experience for many decades. HDF5 supports extraction of only the metadata and data users need whether it be selected bands or subsets along up to 32 dimensions (space, time, band, …). Our goal is to continue this tradition with high-performance large-scale analysis from the desktop, the organizational Data Center, or the cloud. Old queries 18,000 New queries 72,000 Old time (seconds) 375 New time (seconds) 75 Difference (seconds) 300 Time saved (seconds) 21,600,000 Time saved (days) 250
  3. The HDF5 library, shown as the box in the upper left of this slide, supports existing commercial and open source analysis and visualization applications written in many languages. The HDF Group directly supports C, C++, Fortran, and Java while other communities support Python, Julia, R, and many other languages.   The data in HDF5 files can be organized in many ways to improve performance for expected use cases. This slide shows two end-member organizations (maps: single lat/lon slices for each time and rods: single pixels for all times) for supporting mapping and timeseries studies, and compromise 3D chunks that work well for supporting ad-hoc subsets.   Current HDF5 users do not need to know the specifics of the data organization to access data. The library allows users to access data organized in any way with the same application code, although performance will vary.   Our goal is to keep the analysis and visualization applications the same as the data in any organization move to the cloud. We will accomplish this goal using virtual file drivers (VFD) that plug in to the library to support different storage architectures. This approach has been used in HDF5 to support specialized file systems in high performance computing for many years. We are now applying that experience to support access to data in object stores. We are also developing new tools, like the Highly Scalable Data Service, and new interfaces, like the RESTful API (not shown here), to support access to data that are distributed across object stores using on-demand processing.   Our approach protects existing investments in code and tools, the expensive parts of user systems, while allowing data to migrate and evolve. We are also working to support new cloud native applications and tools.
  4. The HDF Group is developing library plug-ins and tools for accessing cloud data organized to support any analysis need or use case. Some data providers prefer to store entire files in native organizations as single objects in the cloud and to access the data from those files. Other data providers prefer to split the file into smaller pieces, typically datasets or chunks, and to access the data from those smaller chunks. We expect that, in the end, most data providers will use a mix of these two strategies to support diverse users and use cases.   Current HDF5 users do not need to know the specifics of the data organization to access data. Our cloud strategy will allow users to access data organized in any way and stored in any storage system with the same application code, although performance will vary.   Our approach protects existing investments in code and tools, the expensive parts of user systems, while allowing data to migrate and evolve. We are also working to support new cloud native applications and tools.
  5. HDF5 data providers and users write and access data in HDF5 using many different programming languages and in many different architectures. The same diversity will continue as data is moved to the cloud. We are currently supporting a number of access options.   C and Fortran applications will continue to use community conventions (e.g. HDF-EOS, netCDF, NEXUS, BAG, ASDF, …) and the HDF5 library to access data. Two library plug-ins: the REST Virtual Object Layer (VOL) and the S3 Virtual File Driver (VFD) are available to support these users in different ways depending on details of their needs. Our growing community of Python users access HDF5 data using the open source h5py package. They can replace that package with h5pyd, which has identical function calls, and access data in the cloud using the new REST API.   The REST API can also support users that prefer accessing data through a web browser.   These different access options all hide the details of the data storage from the users, supporting our goal of data access that is independent of data organization and storage architecture.
  6. HDF Cloud will enable users to access and analyze data they need to answer new questions that require distributed datasets from many sources.   The research group on the right is accessing many different chunks from the same original file in one case (A), and combining data from one file with chunks from another in case B.   The group on the left is accessing a single chunk from an original file (C) to answer a local question or develop a model, and then applying that model to multiple chunks from separate datasets (D).
  7. Kita enables many options for organizing data and metadata. shows a single user accessing an existing HDF5 file on their desktop. In this case, the metadata (grey) are distributed through the file (not necessarily as organized as they look here). shows access to the same file (unchanged) in the cloud. The change in location is handled in the HDF5 library using the S3 Virtual File Driver (VFD). shows the same data with metadata separated and/or centralized in the file. In either case the goal is to enable the metadata to be read in a single access. In some cases the metadata may be stored or cached on the processing machine. This option typically requires an optimization step during the migration of the file to the cloud. Note that the data in the cloud can be accessed by the individual or by others in the team (or users external to the team if appropriate). shows the file shared into metadata (grey) and data (white). In this case the original file no longer exists. Access in this case is done using the Highly Scalable Data Server, h5pyd, or the restful HDF5 API.
  8. The HDF Group, the U.S. National Renewable Energy Lab (NREL), and the Amazon Web Services open data team have worked together to test HDF Cloud with a large collection of wind data from a mesoscale weather forecast model (WRF). These data were restructured to improve access and migrated to the cloud and an interactive web visualization tool was built by an intern at NREL).   Click the National Renewable Energy Lab Wind Data link to see the web application and other links to find out more about HDF Cloud.
  9. Distributing computing over a collection of processors that can grow and shrink as needed is one of the principle benefits of moving data access systems to the cloud. The Highly Scalable Data Service was developed by The HDF Group to help users take advantage of this critical benefit. Data files are split into datasets and chunks and distributed throughout the data store in a number of “buckets”, each of which is managed by a specific data node. When requests arrive, they are balanced across a number of service nodes, each of which access part of the original datasets. This approach can take advantage of large numbers of nodes when necessary to do large-scale analytics.   As shown in the previous slide, the HSDS can be accessed many ways. The most well developed and tested is the Python package h5pyd which is an extension of h5py that is optimized to use the new RESTful API for HDF that was implemented specifically for data in the cloud. As data moves to the cloud, users replace the h5py package with h5pyd and the data are accessed without any changes to the application.   Users can also create web applications built directly on top of the REST API.
  10. Many communities optimize HDF5 by creating specialized data models specific to their needs and conventions for writing data using those models. As cloud usage increases and The HDF Group continues to explore cloud access options, we are identifying approaches to writing HDF5 files that improve performance in the cloud. If data providers are writing files that they plan to access from the cloud, they can take advantage of what we have learned to optimize data access for their users.
  11. This slide summarizes some of the important reasons for migrating HDF5 data archives to the cloud.
  12. Please click these links for more details.
  13. Please click these links for more details.