SlideShare a Scribd company logo
DM_PPT_NP_v02
Utilizing HDF4 File Content Maps
for the Cloud Computing
Hyokyung Joe Lee
The HDF Group
This work was supported by NASA/GSFC under
Raytheon Co. contract number NNG15HZ39C
DM_PPT_NP_v02
2
HDF File Format is for Data.
• PDF for Document, HDF for Data
• Why PDF over MS Word DOC?
– Free, Portable, Sharing & Archiving
• Why HDF over MS Excel XLS(X)?
– Free, Portable, Sharing & Archiving
• HDF: HDF4 & HDF5
DM_PPT_NP_v02
3
HDF4 is “old” format.
• Old = Large volume over long time
• Old = Limitation (32-bit)
• Old = More difficult to sustain
DM_PPT_NP_v02
4
HDF4 is old. So What?
• Convert it to HDF5.
DM_PPT_NP_v02
5
Any alternative?
Cloudification!
DM_PPT_NP_v02
6
Cloudificaiton - Wiktionary
The conversion and/or
migration of data and
application programs in order
to make use of
cloud computing
DM_PPT_NP_v02
7
Why Cloud? AI+Bigdata+Cloud =
DM_PPT_NP_v02
8
ABC Example: El Nino Detection
DM_PPT_NP_v02
9
Cloudificaiton is cool but how?
Use
HDF4 File Content Map.
Group Array
Table
Attribute
Palette
DM_PPT_NP_v02
10
What is HDF4 Map?
XML (ASCII) file that maps the content of binary file.
<h4:Array name="c" path="/a/b/" nDimensions="4">
<h4:dataDimensionSizes>180 8 32 4</h4:dataDimensionSizes>
<h4:chunks>
<h4:chunkDimensionSizes>1 8 32 4</h4:chunkDimensionSizes>
<h4:fillValues value="-9999.000000" chunkPositionInArray="[0,0,0,0]"/>
<h4:byteStream offset="70798703" nBytes="2468" chunkPositionInArray="[114,0,0,0]"/>
<h4:byteStream offset="89101024" nBytes="32" chunkPositionInArray="[172,0,0,0]"/>
<h4:byteStream offset="89127527" nBytes="32" chunkPositionInArray="[173,0,0,0]"/>
</h4:chunks>
</h4:Array>
DM_PPT_NP_v02
11
It is a map with addresses.
<h4:Array name="c" path="/a/b/" nDimensions="4">
<h4:dataDimensionSizes>180 8 32 4</h4:dataDimensionSizes>
<h4:chunks>
<h4:chunkDimensionSizes>1 8 32 4</h4:chunkDimensionSizes>
<h4:fillValues value="-9999.000000" chunkPositionInArray="[0,0,0,0]"/>
…
<h4:byteStream offset="70798703" nBytes="2468" chunkPositionInArray="[114,0,0,0]"/>
<h4:byteStream offset="89101024" nBytes="32" chunkPositionInArray="[172,0,0,0]"/>
<h4:byteStream offset="89127527" nBytes="32" chunkPositionInArray="[173,0,0,0]"/>
</h4:chunks>
</h4:Array> Addresses in the dataAddresses in the file
DM_PPT_NP_v02
12
<h4:Array name="c" path="/a/b/" nDimensions="4">
<h4:dataDimensionSizes>180 8 32 4</h4:dataDimensionSizes>
<h4:chunks>
<h4:chunkDimensionSizes>1 8 32 4</h4:chunkDimensionSizes>
<h4:fillValues value="-9999.000000" chunkPositionInArray="[0,0,0,0]"/>
…
<h4:byteStream offset="70798703" nBytes="2468" chunkPositionInArray="[114,0,0,0]"/>
<h4:byteStream offset="89101024" nBytes="32" chunkPositionInArray="[172,0,0,0]"/>
<h4:byteStream offset="89127527" nBytes="32" chunkPositionInArray="[173,0,0,0]"/>
</h4:chunks>
</h4:Array>
Byte size in map is quite useful.
Bigger chunks may have more information.
Nothing
interesting
This chunk may
have useful
information.
These chunks
may have same
information.
DM_PPT_NP_v02
13
Run data analytics on maps.
Compute checksum and use Elastic Search & Kibana.
Frequency distribution
of checksums
DM_PPT_NP_v02
14
Some chunks are repeated.
A single HDF4 file has 160+ chunks of same data.
Chunks with the same
checksum have the
same data
DM_PPT_NP_v02
15
At collection level, it scales up.
Hundreds of HDF4 files have the 16K chunks of same data.
DM_PPT_NP_v02
16
Elastic search with maps
.. can help users
locate the HDF4 file of
interest.
Nothing
interesting
Most
interesting
DM_PPT_NP_v02
17
Reduce storage cost (e.g., S3) by avoiding redundancy.
Make each chunk searchable through search engine.
Run cloud computing on chunks of interest.
Store chunks as cloud objects
DM_PPT_NP_v02
18
NASA Earthdata search is too shallow.
Index HDF4 data using maps and make deep web.
Provide search interface for the deep web.
Frequently searched data can be cached as cloud objects.
Users can run cloud computing on cached objects in RT.
Verify results with HDF4 archives from NASA data centers.
Shallow Web is not Enough
DM_PPT_NP_v02
19
(BACC= Bigdata Analytics in Cloud Computing)
1. Use HDF archive as is. Create maps for HDF.
2. Maps can be indexed and searched.
3. ELT (Extract Load Transform) only relevant data into
cloud from HDF.
4. Offset/length based file IO is universal - all existing
BACC solutions will work. No dependency on HDF APIs.
HDF: Antifragile Solution for BACC
DM_PPT_NP_v02
20
Future Work
1. HDF5 Mapping Project?
2. Use HDF Product Designer for archiving cloud objects
and analytics results in HDF5.
3. Re-map: To metadata is human, to data is divine: For
the same binary object, user can easily re-define
meaning of data, re-index it, search, and analyze it.
(e.g., serve the same binary data in Chinese, Spanish,
Russian, etc.)
DM_PPT_NP_v02
21
This work was supported by
NASA/GSFC under Raytheon Co.
contract number NNG15HZ39C

More Related Content

What's hot

HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
The HDF-EOS Tools and Information Center
 
Moving form HDF4 to HDF5/netCDF-4
Moving form HDF4 to HDF5/netCDF-4Moving form HDF4 to HDF5/netCDF-4
Moving form HDF4 to HDF5/netCDF-4
The HDF-EOS Tools and Information Center
 
HDF5 Performance Enhancements with the Elimination of Unlimited Dimension
HDF5 Performance Enhancements with the Elimination of Unlimited DimensionHDF5 Performance Enhancements with the Elimination of Unlimited Dimension
HDF5 Performance Enhancements with the Elimination of Unlimited Dimension
The HDF-EOS Tools and Information Center
 
Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
The HDF-EOS Tools and Information Center
 
Product Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the WebProduct Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the Web
The HDF-EOS Tools and Information Center
 
ICESat-2 Metadata and Status
ICESat-2 Metadata and StatusICESat-2 Metadata and Status
ICESat-2 Metadata and Status
The HDF-EOS Tools and Information Center
 
Open-source Scientific Computing and Data Analytics using HDF
Open-source Scientific Computing and Data Analytics using HDFOpen-source Scientific Computing and Data Analytics using HDF
Open-source Scientific Computing and Data Analytics using HDF
The HDF-EOS Tools and Information Center
 
Hierarchical Data Formats (HDF) Update
Hierarchical Data Formats (HDF) UpdateHierarchical Data Formats (HDF) Update
Hierarchical Data Formats (HDF) Update
The HDF-EOS Tools and Information Center
 
SPD and KEA: HDF5 based file formats for Earth Observation
SPD and KEA: HDF5 based file formats for Earth ObservationSPD and KEA: HDF5 based file formats for Earth Observation
SPD and KEA: HDF5 based file formats for Earth Observation
The HDF-EOS Tools and Information Center
 
Scientific Computing and Visualization using HDF
Scientific Computing and Visualization using HDFScientific Computing and Visualization using HDF
Scientific Computing and Visualization using HDF
The HDF-EOS Tools and Information Center
 
MODIS Land and HDF-EOS
MODIS Land and HDF-EOSMODIS Land and HDF-EOS
Improved Methods for Accessing Scientific Data for the Masses
Improved Methods for Accessing Scientific Data for the MassesImproved Methods for Accessing Scientific Data for the Masses
Improved Methods for Accessing Scientific Data for the Masses
The HDF-EOS Tools and Information Center
 
HDF Product Designer: Using Templates to Achieve Interoperability
HDF Product Designer: Using Templates to Achieve InteroperabilityHDF Product Designer: Using Templates to Achieve Interoperability
HDF Product Designer: Using Templates to Achieve Interoperability
The HDF-EOS Tools and Information Center
 
HDF Product Designer
HDF Product DesignerHDF Product Designer
HDF Project Update
HDF Project UpdateHDF Project Update
GDAL Enhancement for ESDIS Project
GDAL Enhancement for ESDIS ProjectGDAL Enhancement for ESDIS Project
GDAL Enhancement for ESDIS Project
The HDF-EOS Tools and Information Center
 
ArcGIS and Multi-D: Tools & Roadmap
ArcGIS and Multi-D: Tools & RoadmapArcGIS and Multi-D: Tools & Roadmap
ArcGIS and Multi-D: Tools & Roadmap
The HDF-EOS Tools and Information Center
 
Multidimensional Scientific Data in ArcGIS
Multidimensional Scientific Data in ArcGISMultidimensional Scientific Data in ArcGIS
Multidimensional Scientific Data in ArcGIS
The HDF-EOS Tools and Information Center
 
Working with Scientific Data in MATLAB
Working with Scientific Data in MATLABWorking with Scientific Data in MATLAB
Working with Scientific Data in MATLAB
The HDF-EOS Tools and Information Center
 

What's hot (20)

HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
Moving form HDF4 to HDF5/netCDF-4
Moving form HDF4 to HDF5/netCDF-4Moving form HDF4 to HDF5/netCDF-4
Moving form HDF4 to HDF5/netCDF-4
 
HDF5 Performance Enhancements with the Elimination of Unlimited Dimension
HDF5 Performance Enhancements with the Elimination of Unlimited DimensionHDF5 Performance Enhancements with the Elimination of Unlimited Dimension
HDF5 Performance Enhancements with the Elimination of Unlimited Dimension
 
Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
 
Product Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the WebProduct Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the Web
 
ICESat-2 Metadata and Status
ICESat-2 Metadata and StatusICESat-2 Metadata and Status
ICESat-2 Metadata and Status
 
Open-source Scientific Computing and Data Analytics using HDF
Open-source Scientific Computing and Data Analytics using HDFOpen-source Scientific Computing and Data Analytics using HDF
Open-source Scientific Computing and Data Analytics using HDF
 
Hierarchical Data Formats (HDF) Update
Hierarchical Data Formats (HDF) UpdateHierarchical Data Formats (HDF) Update
Hierarchical Data Formats (HDF) Update
 
SPD and KEA: HDF5 based file formats for Earth Observation
SPD and KEA: HDF5 based file formats for Earth ObservationSPD and KEA: HDF5 based file formats for Earth Observation
SPD and KEA: HDF5 based file formats for Earth Observation
 
Scientific Computing and Visualization using HDF
Scientific Computing and Visualization using HDFScientific Computing and Visualization using HDF
Scientific Computing and Visualization using HDF
 
MODIS Land and HDF-EOS
MODIS Land and HDF-EOSMODIS Land and HDF-EOS
MODIS Land and HDF-EOS
 
Improved Methods for Accessing Scientific Data for the Masses
Improved Methods for Accessing Scientific Data for the MassesImproved Methods for Accessing Scientific Data for the Masses
Improved Methods for Accessing Scientific Data for the Masses
 
HDF Product Designer: Using Templates to Achieve Interoperability
HDF Product Designer: Using Templates to Achieve InteroperabilityHDF Product Designer: Using Templates to Achieve Interoperability
HDF Product Designer: Using Templates to Achieve Interoperability
 
HDF Product Designer
HDF Product DesignerHDF Product Designer
HDF Product Designer
 
HDF Project Update
HDF Project UpdateHDF Project Update
HDF Project Update
 
GDAL Enhancement for ESDIS Project
GDAL Enhancement for ESDIS ProjectGDAL Enhancement for ESDIS Project
GDAL Enhancement for ESDIS Project
 
ArcGIS and Multi-D: Tools & Roadmap
ArcGIS and Multi-D: Tools & RoadmapArcGIS and Multi-D: Tools & Roadmap
ArcGIS and Multi-D: Tools & Roadmap
 
Bridging ICESat and ICESat-2 Standard Data Products
Bridging ICESat and ICESat-2 Standard Data ProductsBridging ICESat and ICESat-2 Standard Data Products
Bridging ICESat and ICESat-2 Standard Data Products
 
Multidimensional Scientific Data in ArcGIS
Multidimensional Scientific Data in ArcGISMultidimensional Scientific Data in ArcGIS
Multidimensional Scientific Data in ArcGIS
 
Working with Scientific Data in MATLAB
Working with Scientific Data in MATLABWorking with Scientific Data in MATLAB
Working with Scientific Data in MATLAB
 

Viewers also liked

Pilot Project for HDF5 Metadata Structures for SWOT
Pilot Project for HDF5 Metadata Structures for SWOTPilot Project for HDF5 Metadata Structures for SWOT
Pilot Project for HDF5 Metadata Structures for SWOT
The HDF-EOS Tools and Information Center
 
HDF Cloud Services
HDF Cloud ServicesHDF Cloud Services
Breakthrough Listen
Breakthrough ListenBreakthrough Listen
Using visualization tools to access HDF data via OPeNDAP
Using visualization tools to access HDF data via OPeNDAP Using visualization tools to access HDF data via OPeNDAP
Using visualization tools to access HDF data via OPeNDAP
The HDF-EOS Tools and Information Center
 
Advanced HDF5 Features
Advanced HDF5 FeaturesAdvanced HDF5 Features
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Hdf5 current future
Hdf5 current futureHdf5 current future
Hdf5 current future
mfolk
 
Unidata's Approach to Community Broadening through Data and Technology Sharing
Unidata's Approach to Community Broadening through Data and Technology SharingUnidata's Approach to Community Broadening through Data and Technology Sharing
Unidata's Approach to Community Broadening through Data and Technology SharingThe HDF-EOS Tools and Information Center
 
HDF5 Tools
HDF5 ToolsHDF5 Tools

Viewers also liked (9)

Pilot Project for HDF5 Metadata Structures for SWOT
Pilot Project for HDF5 Metadata Structures for SWOTPilot Project for HDF5 Metadata Structures for SWOT
Pilot Project for HDF5 Metadata Structures for SWOT
 
HDF Cloud Services
HDF Cloud ServicesHDF Cloud Services
HDF Cloud Services
 
Breakthrough Listen
Breakthrough ListenBreakthrough Listen
Breakthrough Listen
 
Using visualization tools to access HDF data via OPeNDAP
Using visualization tools to access HDF data via OPeNDAP Using visualization tools to access HDF data via OPeNDAP
Using visualization tools to access HDF data via OPeNDAP
 
Advanced HDF5 Features
Advanced HDF5 FeaturesAdvanced HDF5 Features
Advanced HDF5 Features
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
Hdf5 current future
Hdf5 current futureHdf5 current future
Hdf5 current future
 
Unidata's Approach to Community Broadening through Data and Technology Sharing
Unidata's Approach to Community Broadening through Data and Technology SharingUnidata's Approach to Community Broadening through Data and Technology Sharing
Unidata's Approach to Community Broadening through Data and Technology Sharing
 
HDF5 Tools
HDF5 ToolsHDF5 Tools
HDF5 Tools
 

Similar to Utilizing HDF4 File Content Maps for the Cloud Computing

HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
HDF5 Life cycle of data
HDF5 Life cycle of dataHDF5 Life cycle of data
HDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at ScaleHDF Cloud: HDF5 at Scale
Integrating HDF5 with SRB
Integrating HDF5 with SRBIntegrating HDF5 with SRB
Integrating HDF5 with SRB
The HDF-EOS Tools and Information Center
 
Cloud-Optimized HDF5 Files
Cloud-Optimized HDF5 FilesCloud-Optimized HDF5 Files
Cloud-Optimized HDF5 Files
The HDF-EOS Tools and Information Center
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
JasmineMichael1
 
Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps
Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout MapsEnsuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps
Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps
The HDF-EOS Tools and Information Center
 
Accessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDSAccessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDS
The HDF-EOS Tools and Information Center
 
Putting some Spark into HDF5
Putting some Spark into HDF5Putting some Spark into HDF5
Putting some Spark into HDF5
The HDF-EOS Tools and Information Center
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
Databricks
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
Jazan University
 
HDF for the Cloud
HDF for the CloudHDF for the Cloud
Hadoop and CLOUDIAN HyperStore
Hadoop and CLOUDIAN HyperStoreHadoop and CLOUDIAN HyperStore
Hadoop and CLOUDIAN HyperStore
CLOUDIAN KK
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Safe Software
 
Performance Tuning in HDF5
Performance Tuning in HDF5 Performance Tuning in HDF5
Performance Tuning in HDF5
The HDF-EOS Tools and Information Center
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
Alluxio, Inc.
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
Siddharth Mathur
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Safe Software
 
Parallel Computing with HDF Server
Parallel Computing with HDF ServerParallel Computing with HDF Server
Parallel Computing with HDF Server
The HDF-EOS Tools and Information Center
 
HDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFSHDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFS
DataWorks Summit
 

Similar to Utilizing HDF4 File Content Maps for the Cloud Computing (20)

HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
HDF Data in the Cloud
 
HDF5 Life cycle of data
HDF5 Life cycle of dataHDF5 Life cycle of data
HDF5 Life cycle of data
 
HDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at ScaleHDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at Scale
 
Integrating HDF5 with SRB
Integrating HDF5 with SRBIntegrating HDF5 with SRB
Integrating HDF5 with SRB
 
Cloud-Optimized HDF5 Files
Cloud-Optimized HDF5 FilesCloud-Optimized HDF5 Files
Cloud-Optimized HDF5 Files
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps
Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout MapsEnsuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps
Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps
 
Accessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDSAccessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDS
 
Putting some Spark into HDF5
Putting some Spark into HDF5Putting some Spark into HDF5
Putting some Spark into HDF5
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
HDF for the Cloud
HDF for the CloudHDF for the Cloud
HDF for the Cloud
 
Hadoop and CLOUDIAN HyperStore
Hadoop and CLOUDIAN HyperStoreHadoop and CLOUDIAN HyperStore
Hadoop and CLOUDIAN HyperStore
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
 
Performance Tuning in HDF5
Performance Tuning in HDF5 Performance Tuning in HDF5
Performance Tuning in HDF5
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
 
Parallel Computing with HDF Server
Parallel Computing with HDF ServerParallel Computing with HDF Server
Parallel Computing with HDF Server
 
HDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFSHDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFS
 

More from The HDF-EOS Tools and Information Center

The State of HDF
The State of HDFThe State of HDF
Highly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance FeaturesHighly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance Features
The HDF-EOS Tools and Information Center
 
Creating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 FilesCreating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 Files
The HDF-EOS Tools and Information Center
 
HDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance DiscussionHDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance Discussion
The HDF-EOS Tools and Information Center
 
Hyrax: Serving Data from S3
Hyrax: Serving Data from S3Hyrax: Serving Data from S3
Hyrax: Serving Data from S3
The HDF-EOS Tools and Information Center
 
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLABAccessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
The HDF-EOS Tools and Information Center
 
HDF - Current status and Future Directions
HDF - Current status and Future DirectionsHDF - Current status and Future Directions
HDF - Current status and Future Directions
The HDF-EOS Tools and Information Center
 
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
The HDF-EOS Tools and Information Center
 
HDF - Current status and Future Directions
HDF - Current status and Future Directions HDF - Current status and Future Directions
HDF - Current status and Future Directions
The HDF-EOS Tools and Information Center
 
H5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only LibraryH5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only Library
The HDF-EOS Tools and Information Center
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
The HDF-EOS Tools and Information Center
 
HDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDFHDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDF
The HDF-EOS Tools and Information Center
 
HDF5 <-> Zarr
HDF5 <-> ZarrHDF5 <-> Zarr
HDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server FeaturesHDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server Features
The HDF-EOS Tools and Information Center
 
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
The HDF-EOS Tools and Information Center
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
The HDF-EOS Tools and Information Center
 
HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?
The HDF-EOS Tools and Information Center
 
HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020HDF5 Roadmap 2019-2020
Leveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software TestingLeveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software Testing
The HDF-EOS Tools and Information Center
 
Google Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOSGoogle Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOS
The HDF-EOS Tools and Information Center
 

More from The HDF-EOS Tools and Information Center (20)

The State of HDF
The State of HDFThe State of HDF
The State of HDF
 
Highly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance FeaturesHighly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance Features
 
Creating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 FilesCreating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 Files
 
HDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance DiscussionHDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance Discussion
 
Hyrax: Serving Data from S3
Hyrax: Serving Data from S3Hyrax: Serving Data from S3
Hyrax: Serving Data from S3
 
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLABAccessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
 
HDF - Current status and Future Directions
HDF - Current status and Future DirectionsHDF - Current status and Future Directions
HDF - Current status and Future Directions
 
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
 
HDF - Current status and Future Directions
HDF - Current status and Future Directions HDF - Current status and Future Directions
HDF - Current status and Future Directions
 
H5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only LibraryH5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only Library
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
 
HDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDFHDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDF
 
HDF5 <-> Zarr
HDF5 <-> ZarrHDF5 <-> Zarr
HDF5 <-> Zarr
 
HDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server FeaturesHDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server Features
 
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
 
HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?
 
HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020
 
Leveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software TestingLeveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software Testing
 
Google Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOSGoogle Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOS
 

Recently uploaded

GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 

Recently uploaded (20)

GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 

Utilizing HDF4 File Content Maps for the Cloud Computing

  • 1. DM_PPT_NP_v02 Utilizing HDF4 File Content Maps for the Cloud Computing Hyokyung Joe Lee The HDF Group This work was supported by NASA/GSFC under Raytheon Co. contract number NNG15HZ39C
  • 2. DM_PPT_NP_v02 2 HDF File Format is for Data. • PDF for Document, HDF for Data • Why PDF over MS Word DOC? – Free, Portable, Sharing & Archiving • Why HDF over MS Excel XLS(X)? – Free, Portable, Sharing & Archiving • HDF: HDF4 & HDF5
  • 3. DM_PPT_NP_v02 3 HDF4 is “old” format. • Old = Large volume over long time • Old = Limitation (32-bit) • Old = More difficult to sustain
  • 4. DM_PPT_NP_v02 4 HDF4 is old. So What? • Convert it to HDF5.
  • 6. DM_PPT_NP_v02 6 Cloudificaiton - Wiktionary The conversion and/or migration of data and application programs in order to make use of cloud computing
  • 9. DM_PPT_NP_v02 9 Cloudificaiton is cool but how? Use HDF4 File Content Map. Group Array Table Attribute Palette
  • 10. DM_PPT_NP_v02 10 What is HDF4 Map? XML (ASCII) file that maps the content of binary file. <h4:Array name="c" path="/a/b/" nDimensions="4"> <h4:dataDimensionSizes>180 8 32 4</h4:dataDimensionSizes> <h4:chunks> <h4:chunkDimensionSizes>1 8 32 4</h4:chunkDimensionSizes> <h4:fillValues value="-9999.000000" chunkPositionInArray="[0,0,0,0]"/> <h4:byteStream offset="70798703" nBytes="2468" chunkPositionInArray="[114,0,0,0]"/> <h4:byteStream offset="89101024" nBytes="32" chunkPositionInArray="[172,0,0,0]"/> <h4:byteStream offset="89127527" nBytes="32" chunkPositionInArray="[173,0,0,0]"/> </h4:chunks> </h4:Array>
  • 11. DM_PPT_NP_v02 11 It is a map with addresses. <h4:Array name="c" path="/a/b/" nDimensions="4"> <h4:dataDimensionSizes>180 8 32 4</h4:dataDimensionSizes> <h4:chunks> <h4:chunkDimensionSizes>1 8 32 4</h4:chunkDimensionSizes> <h4:fillValues value="-9999.000000" chunkPositionInArray="[0,0,0,0]"/> … <h4:byteStream offset="70798703" nBytes="2468" chunkPositionInArray="[114,0,0,0]"/> <h4:byteStream offset="89101024" nBytes="32" chunkPositionInArray="[172,0,0,0]"/> <h4:byteStream offset="89127527" nBytes="32" chunkPositionInArray="[173,0,0,0]"/> </h4:chunks> </h4:Array> Addresses in the dataAddresses in the file
  • 12. DM_PPT_NP_v02 12 <h4:Array name="c" path="/a/b/" nDimensions="4"> <h4:dataDimensionSizes>180 8 32 4</h4:dataDimensionSizes> <h4:chunks> <h4:chunkDimensionSizes>1 8 32 4</h4:chunkDimensionSizes> <h4:fillValues value="-9999.000000" chunkPositionInArray="[0,0,0,0]"/> … <h4:byteStream offset="70798703" nBytes="2468" chunkPositionInArray="[114,0,0,0]"/> <h4:byteStream offset="89101024" nBytes="32" chunkPositionInArray="[172,0,0,0]"/> <h4:byteStream offset="89127527" nBytes="32" chunkPositionInArray="[173,0,0,0]"/> </h4:chunks> </h4:Array> Byte size in map is quite useful. Bigger chunks may have more information. Nothing interesting This chunk may have useful information. These chunks may have same information.
  • 13. DM_PPT_NP_v02 13 Run data analytics on maps. Compute checksum and use Elastic Search & Kibana. Frequency distribution of checksums
  • 14. DM_PPT_NP_v02 14 Some chunks are repeated. A single HDF4 file has 160+ chunks of same data. Chunks with the same checksum have the same data
  • 15. DM_PPT_NP_v02 15 At collection level, it scales up. Hundreds of HDF4 files have the 16K chunks of same data.
  • 16. DM_PPT_NP_v02 16 Elastic search with maps .. can help users locate the HDF4 file of interest. Nothing interesting Most interesting
  • 17. DM_PPT_NP_v02 17 Reduce storage cost (e.g., S3) by avoiding redundancy. Make each chunk searchable through search engine. Run cloud computing on chunks of interest. Store chunks as cloud objects
  • 18. DM_PPT_NP_v02 18 NASA Earthdata search is too shallow. Index HDF4 data using maps and make deep web. Provide search interface for the deep web. Frequently searched data can be cached as cloud objects. Users can run cloud computing on cached objects in RT. Verify results with HDF4 archives from NASA data centers. Shallow Web is not Enough
  • 19. DM_PPT_NP_v02 19 (BACC= Bigdata Analytics in Cloud Computing) 1. Use HDF archive as is. Create maps for HDF. 2. Maps can be indexed and searched. 3. ELT (Extract Load Transform) only relevant data into cloud from HDF. 4. Offset/length based file IO is universal - all existing BACC solutions will work. No dependency on HDF APIs. HDF: Antifragile Solution for BACC
  • 20. DM_PPT_NP_v02 20 Future Work 1. HDF5 Mapping Project? 2. Use HDF Product Designer for archiving cloud objects and analytics results in HDF5. 3. Re-map: To metadata is human, to data is divine: For the same binary object, user can easily re-define meaning of data, re-index it, search, and analyze it. (e.g., serve the same binary data in Chinese, Spanish, Russian, etc.)
  • 21. DM_PPT_NP_v02 21 This work was supported by NASA/GSFC under Raytheon Co. contract number NNG15HZ39C

Editor's Notes

  1. Good morning, everyone! My name is Joe Lee and I’m a software engineer at The HDF Group. Although I attended the past ESIP meetings regularly, I could not travel this summer. ESIP meeting is a great place to learn and share new ideas and technologies through face-to-face conversation so I apologize for presenting my new idea over telecon.
  2. Although you may have heard about Hierarchical Data Format, let me start my presentation by giving a very short introduction to HDF. HDF is similar to PDF in many ways as a free & portable binary format although the brand power of HDF is much weaker than the brand power of PDF. Everybody knows that PDF is for publishing document. HDF is for publishing data of any size – big or small. For example, NASA has used HDF for several decades to archive big data as large as Earth observation because it is good for sharing and archiving. HDF has two incompatible formats called HDF4 and HDF5. As the number indicates, HDF4 is old format and HDF5 is relatively new format.
  3. The idea that I am going to present today is mainly about HDF4 because HDF4 is old. I cannot tell exactly how old HDF4 is because I don’t want to discriminate any file format based on its age.  Old can mean many different things – both good and bad. For example, old means a large volume of earth data has been archived in HDF4. Old also means that HDF4 has some limits that are already overcome by today’s technology. As the technology advances very fast, you’ll see fewer tools that support HDF4. I put an image of CD-player because HDF4 reminds me the CD player in my 20 years old car. In 1995, I had to pay extra money for it as a premium car audio option.
  4. Last November, my 20 year old car was finally broken after racking up 250 thousand miles so I went shopping for a new car. I was surprised to know that new cars do not have CD players any more. Instead, they have USB or SD memory card slots and they accept MP3 formats. I’m telling this story because the modernization of HDF4 data is necessary before it gets too old to sustain. Since HDF4 is not backward-compatible with HDF5, HDF5 users need to convert HDF4 to HDF5 if their tools do not support HDF4. The HDF Group already provides h4toh5 conversion tool. This is a good solution as long as you are willing to convert millions of HDF4 files into HDF5 files.
  5. Thinking about the future alternative, like Tesla car that can stream music from cloud, I think Earth data streaming from cloud is the way to go. So, converting HDF4 to HDF5 is an OK solution but I think there should be an alternative if we’d like to modernize old HDF4 data in cloud age.
  6. I found the word “Cloudification” and I like it a lot. Wiktionary defines it as “The conversion ….”
  7. Why does cloud computing matter? I think I don’t have to explain it any more thanks to IBM Watson and Google AlphaGo. When combined with AI and big data, cloud computing can do amazing things like beating human experts.
  8. For another example, last winter, I was involved in a project called data container study. I ran a machine learning experiment with 20 years of NASA sea surface temperature data near Peru from 1987 and 2008 using Open Science Data Cloud and I could detect an anomaly quickly in a few seconds. The result matched nicely with El Nino in 1998. Open Science Data Cloud was very convenient and fast.
  9. What I also learned from data container study is that efficient I/O is the key. OSDC provides 200 terra bytes of public data in HDF4 format. However, they are not directly usable for me because OSDC does not provide any search interface to the collection that is similar to NASA Earthdata search. OSDC only provides a list of HDF4 file names available and all I can do is to transfer a collection of HDF4 files “as is” from cloud storage to the computing nodes. This is horribly inefficient because I need a way to search & filter the only relevant data to speed up my data analytics at collection level. Thus, I came up with an idea to use HDF4 file content map to maximize the utilization of cloud computing. A single binary HDF4 file can have multiple data objects represented as array, group, table, attributes and so on. Each object can be precisely located using the offset from the beginning of file and the number of bytes to read using HDF4 file content map. The rationale is that if the only relevant object can be searched and loaded into data analytics engine, you can reduce the amount of I/O and thus get the result much faster. Without shredding thousands of HDF4 files into objects with HDF4 maps, you must load 200 TB of data into computing nodes, process them, and throw away. You must repeat this for different analytics jobs. You need to wait days for I/O operation while the actual data analytics takes only a few seconds.
  10. So what is HDF4 file content map that I’m talking about? It is an XML file that maps the content of the HDF4 binary file. Unless you’re a hacker working for NSA, it’s hard to know what’s inside the HDF4 binary file as shown in the slide. HDF4 binary file is a long stream of bytes and HDF4 map file can tell you how to decrypt the stream correctly.
  11. Interpreting binary data is possible because the file content map has full of addresses. In HDF, a dataset can be organized into chunks for efficient I/O and the HDF4 map can tell where you can find a chunk of data. The chunk position in array is a good indication of where data is located on on Earth if dataset is grid. By fully disclosing offset and number of bytes to read from the binary file, you don’t need HDF4 libraries to access a chunk of data.
  12. If you read the file content map carefully, you can find some interesting patterns from byte size of each chunk. The fillValues XML tag indicates that there’s nothing to be analyzed in the chunk. Small size of chunk indicates that the chunk contains a lot of repeated information so it can be compressed. Big size of chunk indicates that the chunk has more information than other compressed chunks.
  13. To find useful data from a huge collection of HDF4 in OSDC, I ran an elastic search on chunks and visually inspected the frequency distribution of checksums with Kibana after computing MD5 checksum of each chunk. MD5 checksum on individual chunk is not provided by h4mapwriter yet, so I created a separate script in Python.
  14. Running some analytics on HDF4 file using the HDF4 map was very fun. It revealed that the same chunk of data is repeated within a HDF4 file.
  15. At collection level, it scales up nicely. Hundreds of HDF4 files have the 16K chunk of same data. This makes sense because some observations from the Earth will be same for a long period of time.
  16. Once index is built with Elastic Search, I can easily run query to find a dataset that I’m interested in using the byte size information. For example, I could sort the dataset size from the smallest to the largest over hundreds of HDF4 files. As expected, the smallest byte size dataset showed almost nothing when it is visualized with HDFView. The largest byte size dataset returned a colorful image.
  17. Based on the HDF4 map information, I learned that it is possible to re-organize the entire collection of HDF4 data to optimize the use of cloud storage. If you optimize data organization, ETL time for cloud computing will be shortened and the cost of storage will be also reduced. If you can build search engine on top of those objects, advanced HDF4 users can run cloud computing directly on HDF chunks of their interest after filtering out irrelevant data based on the search result. Users can always transform HDF chunks into other format such as Apache Parquet or JSON to meet their cloud computing needs.
  18. From the Elastic Search experiment with HDF4 map, now I have a new wish list for NASA Earthdata search. Although I like the new and improved NASA Earthdata search, I still think it’s too shallow because it does not index what’s inside from granules. If Earthdata search can index HDF4 maps and provide search interface, collection at the chunk level can be returned to a user’s query. I’d like to call such search service as deep web search. For a chunk collection that deep web search returns, user can stream chunks to user’s cloud storage. Here, the key is to deliver chunk collection to user’s cloud service provider. Downloading the entire HDF4 data does not make sense in this workflow. Then, user can run his analytics job using cloud computing on the streamed chunks. If necessary, users can go back to the original HDF4 archives and run the same analytics if necessary using the traditional off-cloud method.
  19. In summary, data archived in HDF is ready for big data analytics for certain access patterns that data producers prescribed. The prescribed pattern may not match exactly what users need. For such use case scenario, HDF maps can be indexed and searched to identify relevant pieces from HDF. I call it anti-fragile solution because any big data analytics solution in any computer language in any cloud computing environment will work. For example, I could read data over network in PHP language using Apache web server that supports byte-range and it worked pretty nicely. I picked PHP because PHP binding doesn’t exist for HDF. Relying on a single monolithic library to access data is too fragile.
  20. You may wonder if the same solution can be applied to HDF5 or netCDF4. Unfortunately, there is no HDF5 mapper tool yet. How can a user save a collection of chunks in HDF5 easily for future use? I think HDF Product Designer is a good candidate for creating a new HDF5 from chunk objects in cloud. It can play a role of h4toh5 conversion tool with on-demand collection level subset/aggregation capability. Finally, the HDF4 map idea has a great potential as flexible metadata solution. While binary is forever, metadata doesn’t have to be. If you re-map the same binary data with a different dialect, you can serve wider community that understands the dialect. One example is rewriting HDF4 file content map in different languages. Then, international users can discover and access Earthdata more easily. Thank you for listening and I hope that you can use HDF4 map wisely in your next cloud computing project. Do you have any question?