SlideShare a Scribd company logo
1 of 20
Proprietary and Confidential. Copyright 2018, The HDF Group.
HDF for the Cloud:
New HDF Server Features
John Readey
2
• HDF storage schema for the cloud
• HDF Server features
• What’s new
• What’s next
• Demo
Overview
3What is HDF5?
Depends on your point of view:
• a C-API
• a data model
• a file format
Let’s imagine keeping the API and
Data model, but with a different (cloud friendly)
Storage format
4HDF Sharded Schema
Big Idea: Map individual
HDF5 objects (datasets,
groups, chunks) as Object
Storage Objects
• Limit maximum size of any object
• Support parallelism for read/write
• Only data that is modified needs to be
updated
• Multiple clients can be reading/updating
the same “file”
• Don’t need to manage free space
Legend:
• Dataset is partitioned into chunks
• Each chunk stored as an object (file)
• Dataset meta data (type, shape,
attributes, etc.) stored in a separate
object (as JSON text)
Why a sharded data format?
Each chunk (heavy outlines) get
persisted as a separate object
5Sharded format example
root_obj_id/
group.json
obj1_id/
group.json
obj2_id/
dataset.json
0_0
0_1
obj3_id/
dataset.json
0_0_2
0_0_3
Observations:
• Metadata is stored as JSON
• Chunk data stored as binary blobs
• Self-explanatory
• One HDF5 file can translate to lots of
objects
• Flat hierarchy – supports HDF5
multilinking
• Can limit maximum size of an object
• Can be used with Posix or object
storage
Schema is documented here:
https://github.com/HDFGroup/hsds/blob/master/docs/design/obj_store_schema/obj_store_schema_v2.md
6Implementations of the sharded schema
A storage format specification is nice, but it would
be useful to have some software that can actually
write and read to the format…
As it happens, we’ve created a software service that uses the
schema: HSDS (Highly Scalable Data Service)
Software is available at:
https://github.com/HDFGroup/hsds
Note: HSDS was originally developed as a NASA ACCESS 2015 project:
https://earthdata.nasa.gov/esds/competitive-programs/access/hsds
7Server Features
• Simple + familiar API
• Clients can interact with service using REST API
• SDKs provide language specific interface (e.g. h5pyd for Python)
• Can read/write just the data they need (as opposed to transferring entire files)
• Support for compression
• Container based
• Run in Docker or Kubernetes or DC/OS
• Scalable performance:
• Can cache recently accessed data in RAM
• Can parallelize requests across multiple nodes
• More nodes  better performance
• Cluster based – any number of machines can be used to constitute the server
• Multiple clients can read/write to same data source
• No limit to the amount of data that can be stored by the service
8Architecture
Legend:
• Client: Any user of the service
• Load balancer – distributes requests to Service nodes
• Service Nodes – processes requests from clients (with help from Data Nodes)
• Data Nodes – responsible for partition of Object Store
• Object Store: Base storage service (e.g. AWS S3)
9HDF API Compatibility
The sharded storage schema captures the
HDF data model, and REST service interface
is nice, but it would be great if the existing
HDF based applications and libraries could
use the new storage format without requiring a
bunch of code changes…
Two related projects provide a solution:
• H5pyd – h5py compatible package for Python
• REST VOL – HDF5 library plugin for C/C++
10H5pyd – Python client
• H5py is a popular Python package that provide a Pythonic interface to the HDF5 library
• H5pyd (for h5py distributed) provides a h5py compatible h5py for accessing the server
• Pure Python – uses requests package to make http calls to server
• Include several extensions to h5py:
• List content in folders
• Get/Set ACLs (access control list)
• Pytables-like query interface
• H5netcdf and xarray packages will use h5pyd when http:// is prepended to the file path
• Installable from PyPI: $ pip install h5pyd
• Source code: https://github.com/HDFGroup/h5pyd
11Supporting the Python Analytics Stack
Many Python users
don’t use h5py, but
tools higher up the
stack: h5netcdf,
xarray, pandas, etc.
HDF5Lib
H5PY
H5NETCDF
Xarray
Since h5pyd is
compatible with h5py,
we should be able to
support the same stack
for HDF Cloud
HDF5Lib
H5PY
H5NETCDF
Xarray
H5PYD
HDFServer
Disk
Applications can
switch between
local and cloud
access just by
changing file path.
12REST VOL Plugin
• The HDF5 VOL architecture is a plugin layer for HDF5
• Public API stays the same, but different back ends can be implemented
• REST VOL substitutes REST API requests for file i/o actions
• C/Fortran applications should be able to run as is
• Some features not implemented yet:
• VLEN support
• Large read/write support (selections >100mb)
• Downloadable from: https://github.com/HDFGroup/vol-rest
13Command Line Interface (CLI)
• Accessing HDF via a service means one can’t utilize usual shell commands: ls, rm, chmod, etc.
• Command line tools are a set of simple apps to use instead:
• hsinfo: display server version, connect info
• hsls: list content of folder or file
• hstouch: create folder or file
• hsdel: delete a file
• hsload: upload an HDF5 file
• hsget: download content from server to an HDF5 file
• hsacl: create/list/update ACLs (Access Control Lists)
• Hsdiff: compare HDF5 file with sharded representation
• Implemented in Python & uses h5pyd
• Note: data is round-tripable:
• HDF5 File hsload  HSDS store  hsget  HDF5 file
14Supporting traditional HDF5 files
• If you have HDF5 files already stored in the cloud, they can be
accessed by HDF Server
• Rather than converting the entire file to the HDF Schema, just the
metadata needs to be imported (typically <1% of the file)
• Dataset reads are converted to S3 Range Gets on the stored file
• The hsload CLI tool has an option (--link ) for loading file metadata
• It is also possible to construct a server file that aggregates multiple
stored files (similar to how the HDF5 library VDS feature works)
We’ve discussed three aspects of HDF: the data model, API, and file
format. With HSDS we’ve kept the data model and API, but the file
format is radically different. But maybe you have a PB or two of HDF5
files you’d like to use…
15New HSDS features
HSDS version 0.6 is coming soon…
What’s new:
• POSIX Support – Store content on regular disk drives
• Azure
• Azure Blob support – Support for Azure’s object storage format
• AKS (Azure Kubernetes) – Run in Azure’s managed Kubernetes
• Active Directory authentication – Authenticate via AD
• AWS
• Added support for AWS Lambda
• DC/OS – support for DC/OS (Apache Mesos) distributed system
• Domain checksums – verify when any content changes
• Role Based Access Control (RBAC) – manage ACLs for user groups
Complete list is here: https://github.com/HDFGroup/hsds/issues/47
16HSDS Platforms
POSIX
Filesystem
HSDS can be run on most container management systems:
Using different supported storage systems:
17AWS Lambda Functions
• HSDS can parallelize requests across all the
available backend (“DN”) nodes on the server
• AWS Lambda is a new service that enables you to
run requests ”serverless”
• Pay for just cpu-seconds the function runs
• By incorporating Lambda, some HDF Server
requests can parallelize across a 1000 Lambda
functions (equivalent to a 1000 container server)
• Will dramatically speed up time-series selections
18Kita Lab
• Kita Lab is a JupyterLab and HDF server environment hosted by the HDF Group on AWS
• Kita Lab users can create Python notebooks that use h5pyd to connect to HDF Server
• Each user gets equivalent to 2-core Xeon Server and 10GB local storage
• Users can use up to 100GB of data on HDF Server
• Sign up here: https://www.hdfgroup.org/hdfkitalab/
User’s container
and EBS volume
User
User logs into
Jupyter Hub
JupyterHub spawns
new container at
login
HSDS on Kubernetes
S3 Bucket
19Futures
• Sometimes you’d rather do without a server and talk to the storage system directly:
• Don’t want to deal with setting up service
• Don’t want to worry about scaling service up and down with client load
• You don’t need the synchronization (e.g. managing multiple clients writing to the
same dataset) that a service provides
• HS Direct Access will be a new VOL connector that enables this for the HDF5 library
• Will take advantage of multiple cores
• Uses same schema as HSDS (and can be used in conjunction with HSDS)
Design doc is here:
https://github.com/HDFGroup/hsds/blob/master/docs/design/direct_access/direct_access.md
20Questions?

More Related Content

What's hot

STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...The HDF-EOS Tools and Information Center
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolutionDataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 

What's hot (20)

HDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at ScaleHDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at Scale
 
HDF for the Cloud
HDF for the CloudHDF for the Cloud
HDF for the Cloud
 
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
 
HDF-EOS 2/5 to netCDF Converter
HDF-EOS 2/5 to netCDF ConverterHDF-EOS 2/5 to netCDF Converter
HDF-EOS 2/5 to netCDF Converter
 
HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020
 
HDF Product Designer
HDF Product DesignerHDF Product Designer
HDF Product Designer
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
 
HDFS Analysis for Small Files
HDFS Analysis for Small FilesHDFS Analysis for Small Files
HDFS Analysis for Small Files
 
Parallel HDF5 Developments
Parallel HDF5 DevelopmentsParallel HDF5 Developments
Parallel HDF5 Developments
 
NetCDF and HDF5
NetCDF and HDF5NetCDF and HDF5
NetCDF and HDF5
 
MATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and CapabilitiesMATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and Capabilities
 
HDF Project Update
HDF Project UpdateHDF Project Update
HDF Project Update
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Introduction to NetCDF-4
Introduction to NetCDF-4Introduction to NetCDF-4
Introduction to NetCDF-4
 
Easy Access of NASA HDF data via OPeNDAP
Easy Access of NASA HDF data via OPeNDAPEasy Access of NASA HDF data via OPeNDAP
Easy Access of NASA HDF data via OPeNDAP
 
Status of HDF-EOS, Related Software and Tools
Status of HDF-EOS, Related Software and ToolsStatus of HDF-EOS, Related Software and Tools
Status of HDF-EOS, Related Software and Tools
 
Performance Tuning in HDF5
Performance Tuning in HDF5 Performance Tuning in HDF5
Performance Tuning in HDF5
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Putting some Spark into HDF5
Putting some Spark into HDF5Putting some Spark into HDF5
Putting some Spark into HDF5
 

Similar to HDF for the Cloud - New HDF Server Features

hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsDataWorks Summit
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online TrainingLearntek1
 
9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdfManoel Ribeiro
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaData Con LA
 
Hdf5 parallel
Hdf5 parallelHdf5 parallel
Hdf5 parallelmfolk
 

Similar to HDF for the Cloud - New HDF Server Features (20)

HDF Cloud Services
HDF Cloud ServicesHDF Cloud Services
HDF Cloud Services
 
Highly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance FeaturesHighly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance Features
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
Accessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDSAccessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDS
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
HDF Update
HDF UpdateHDF Update
HDF Update
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the experts
 
Update on HDF5 1.8
Update on HDF5 1.8Update on HDF5 1.8
Update on HDF5 1.8
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
HDF Update for DAAC Managers (2017-02-27)
HDF Update for DAAC Managers (2017-02-27)HDF Update for DAAC Managers (2017-02-27)
HDF Update for DAAC Managers (2017-02-27)
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hdf5 parallel
Hdf5 parallelHdf5 parallel
Hdf5 parallel
 
Cloud-Optimized HDF5 Files
Cloud-Optimized HDF5 FilesCloud-Optimized HDF5 Files
Cloud-Optimized HDF5 Files
 

More from The HDF-EOS Tools and Information Center

More from The HDF-EOS Tools and Information Center (14)

The State of HDF
The State of HDFThe State of HDF
The State of HDF
 
Creating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 FilesCreating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 Files
 
HDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance DiscussionHDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance Discussion
 
Hyrax: Serving Data from S3
Hyrax: Serving Data from S3Hyrax: Serving Data from S3
Hyrax: Serving Data from S3
 
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLABAccessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
 
HDF - Current status and Future Directions
HDF - Current status and Future DirectionsHDF - Current status and Future Directions
HDF - Current status and Future Directions
 
Leveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software TestingLeveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software Testing
 
Google Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOSGoogle Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOS
 
HDF-EOS Data Product Developer's Guide
HDF-EOS Data Product Developer's GuideHDF-EOS Data Product Developer's Guide
HDF-EOS Data Product Developer's Guide
 
HDF Status Update
HDF Status UpdateHDF Status Update
HDF Status Update
 
NASA Terra Data Fusion
NASA Terra Data FusionNASA Terra Data Fusion
NASA Terra Data Fusion
 
S3 VFD
S3 VFDS3 VFD
S3 VFD
 
HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
HDF Data in the Cloud
 
HDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF ServiceHDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF Service
 

Recently uploaded

Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 

Recently uploaded (20)

2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 

HDF for the Cloud - New HDF Server Features

  • 1. Proprietary and Confidential. Copyright 2018, The HDF Group. HDF for the Cloud: New HDF Server Features John Readey
  • 2. 2 • HDF storage schema for the cloud • HDF Server features • What’s new • What’s next • Demo Overview
  • 3. 3What is HDF5? Depends on your point of view: • a C-API • a data model • a file format Let’s imagine keeping the API and Data model, but with a different (cloud friendly) Storage format
  • 4. 4HDF Sharded Schema Big Idea: Map individual HDF5 objects (datasets, groups, chunks) as Object Storage Objects • Limit maximum size of any object • Support parallelism for read/write • Only data that is modified needs to be updated • Multiple clients can be reading/updating the same “file” • Don’t need to manage free space Legend: • Dataset is partitioned into chunks • Each chunk stored as an object (file) • Dataset meta data (type, shape, attributes, etc.) stored in a separate object (as JSON text) Why a sharded data format? Each chunk (heavy outlines) get persisted as a separate object
  • 5. 5Sharded format example root_obj_id/ group.json obj1_id/ group.json obj2_id/ dataset.json 0_0 0_1 obj3_id/ dataset.json 0_0_2 0_0_3 Observations: • Metadata is stored as JSON • Chunk data stored as binary blobs • Self-explanatory • One HDF5 file can translate to lots of objects • Flat hierarchy – supports HDF5 multilinking • Can limit maximum size of an object • Can be used with Posix or object storage Schema is documented here: https://github.com/HDFGroup/hsds/blob/master/docs/design/obj_store_schema/obj_store_schema_v2.md
  • 6. 6Implementations of the sharded schema A storage format specification is nice, but it would be useful to have some software that can actually write and read to the format… As it happens, we’ve created a software service that uses the schema: HSDS (Highly Scalable Data Service) Software is available at: https://github.com/HDFGroup/hsds Note: HSDS was originally developed as a NASA ACCESS 2015 project: https://earthdata.nasa.gov/esds/competitive-programs/access/hsds
  • 7. 7Server Features • Simple + familiar API • Clients can interact with service using REST API • SDKs provide language specific interface (e.g. h5pyd for Python) • Can read/write just the data they need (as opposed to transferring entire files) • Support for compression • Container based • Run in Docker or Kubernetes or DC/OS • Scalable performance: • Can cache recently accessed data in RAM • Can parallelize requests across multiple nodes • More nodes  better performance • Cluster based – any number of machines can be used to constitute the server • Multiple clients can read/write to same data source • No limit to the amount of data that can be stored by the service
  • 8. 8Architecture Legend: • Client: Any user of the service • Load balancer – distributes requests to Service nodes • Service Nodes – processes requests from clients (with help from Data Nodes) • Data Nodes – responsible for partition of Object Store • Object Store: Base storage service (e.g. AWS S3)
  • 9. 9HDF API Compatibility The sharded storage schema captures the HDF data model, and REST service interface is nice, but it would be great if the existing HDF based applications and libraries could use the new storage format without requiring a bunch of code changes… Two related projects provide a solution: • H5pyd – h5py compatible package for Python • REST VOL – HDF5 library plugin for C/C++
  • 10. 10H5pyd – Python client • H5py is a popular Python package that provide a Pythonic interface to the HDF5 library • H5pyd (for h5py distributed) provides a h5py compatible h5py for accessing the server • Pure Python – uses requests package to make http calls to server • Include several extensions to h5py: • List content in folders • Get/Set ACLs (access control list) • Pytables-like query interface • H5netcdf and xarray packages will use h5pyd when http:// is prepended to the file path • Installable from PyPI: $ pip install h5pyd • Source code: https://github.com/HDFGroup/h5pyd
  • 11. 11Supporting the Python Analytics Stack Many Python users don’t use h5py, but tools higher up the stack: h5netcdf, xarray, pandas, etc. HDF5Lib H5PY H5NETCDF Xarray Since h5pyd is compatible with h5py, we should be able to support the same stack for HDF Cloud HDF5Lib H5PY H5NETCDF Xarray H5PYD HDFServer Disk Applications can switch between local and cloud access just by changing file path.
  • 12. 12REST VOL Plugin • The HDF5 VOL architecture is a plugin layer for HDF5 • Public API stays the same, but different back ends can be implemented • REST VOL substitutes REST API requests for file i/o actions • C/Fortran applications should be able to run as is • Some features not implemented yet: • VLEN support • Large read/write support (selections >100mb) • Downloadable from: https://github.com/HDFGroup/vol-rest
  • 13. 13Command Line Interface (CLI) • Accessing HDF via a service means one can’t utilize usual shell commands: ls, rm, chmod, etc. • Command line tools are a set of simple apps to use instead: • hsinfo: display server version, connect info • hsls: list content of folder or file • hstouch: create folder or file • hsdel: delete a file • hsload: upload an HDF5 file • hsget: download content from server to an HDF5 file • hsacl: create/list/update ACLs (Access Control Lists) • Hsdiff: compare HDF5 file with sharded representation • Implemented in Python & uses h5pyd • Note: data is round-tripable: • HDF5 File hsload  HSDS store  hsget  HDF5 file
  • 14. 14Supporting traditional HDF5 files • If you have HDF5 files already stored in the cloud, they can be accessed by HDF Server • Rather than converting the entire file to the HDF Schema, just the metadata needs to be imported (typically <1% of the file) • Dataset reads are converted to S3 Range Gets on the stored file • The hsload CLI tool has an option (--link ) for loading file metadata • It is also possible to construct a server file that aggregates multiple stored files (similar to how the HDF5 library VDS feature works) We’ve discussed three aspects of HDF: the data model, API, and file format. With HSDS we’ve kept the data model and API, but the file format is radically different. But maybe you have a PB or two of HDF5 files you’d like to use…
  • 15. 15New HSDS features HSDS version 0.6 is coming soon… What’s new: • POSIX Support – Store content on regular disk drives • Azure • Azure Blob support – Support for Azure’s object storage format • AKS (Azure Kubernetes) – Run in Azure’s managed Kubernetes • Active Directory authentication – Authenticate via AD • AWS • Added support for AWS Lambda • DC/OS – support for DC/OS (Apache Mesos) distributed system • Domain checksums – verify when any content changes • Role Based Access Control (RBAC) – manage ACLs for user groups Complete list is here: https://github.com/HDFGroup/hsds/issues/47
  • 16. 16HSDS Platforms POSIX Filesystem HSDS can be run on most container management systems: Using different supported storage systems:
  • 17. 17AWS Lambda Functions • HSDS can parallelize requests across all the available backend (“DN”) nodes on the server • AWS Lambda is a new service that enables you to run requests ”serverless” • Pay for just cpu-seconds the function runs • By incorporating Lambda, some HDF Server requests can parallelize across a 1000 Lambda functions (equivalent to a 1000 container server) • Will dramatically speed up time-series selections
  • 18. 18Kita Lab • Kita Lab is a JupyterLab and HDF server environment hosted by the HDF Group on AWS • Kita Lab users can create Python notebooks that use h5pyd to connect to HDF Server • Each user gets equivalent to 2-core Xeon Server and 10GB local storage • Users can use up to 100GB of data on HDF Server • Sign up here: https://www.hdfgroup.org/hdfkitalab/ User’s container and EBS volume User User logs into Jupyter Hub JupyterHub spawns new container at login HSDS on Kubernetes S3 Bucket
  • 19. 19Futures • Sometimes you’d rather do without a server and talk to the storage system directly: • Don’t want to deal with setting up service • Don’t want to worry about scaling service up and down with client load • You don’t need the synchronization (e.g. managing multiple clients writing to the same dataset) that a service provides • HS Direct Access will be a new VOL connector that enables this for the HDF5 library • Will take advantage of multiple cores • Uses same schema as HSDS (and can be used in conjunction with HSDS) Design doc is here: https://github.com/HDFGroup/hsds/blob/master/docs/design/direct_access/direct_access.md

Editor's Notes

  1. Each node is implemented as a docker container