Utilizing HDF4 File Content Maps for the Cloud Computing

•Download as PPTX, PDF•

0 likes•597 views

The HDF-EOS Tools and Information Center

This document discusses using HDF4 file content maps to enable cloud computing capabilities for HDF4 files. HDF4 files contain scientific data but their large size and legacy format pose challenges. The document proposes creating XML maps that describe HDF4 file structure and contents, including chunk locations and sizes. These maps could then be indexed and searched to locate relevant data chunks. Only those chunks would need to be extracted to the cloud, avoiding unnecessary data transfers. This would allow HDF4 files to be queried and analyzed using cloud-based tools while reducing storage costs.

DM_PPT_NP_v02
Utilizing HDF4 File Content Maps
for the Cloud Computing
Hyokyung Joe Lee
The HDF Group
This work was supported by NASA/GSFC under
Raytheon Co. contract number NNG15HZ39C

DM_PPT_NP_v02
2
HDF File Format is for Data.
• PDF for Document, HDF for Data
• Why PDF over MS Word DOC?
– Free, Portable, Sharing & Archiving
• Why HDF over MS Excel XLS(X)?
– Free, Portable, Sharing & Archiving
• HDF: HDF4 & HDF5

DM_PPT_NP_v02
3
HDF4 is “old” format.
• Old = Large volume over long time
• Old = Limitation (32-bit)
• Old = More difficult to sustain

DM_PPT_NP_v02
4
HDF4 is old. So What?
• Convert it to HDF5.

DM_PPT_NP_v02
5
Any alternative?
Cloudification!

DM_PPT_NP_v02
6
Cloudificaiton - Wiktionary
The conversion and/or
migration of data and
application programs in order
to make use of
cloud computing

DM_PPT_NP_v02
7
Why Cloud? AI+Bigdata+Cloud =

DM_PPT_NP_v02
8
ABC Example: El Nino Detection

DM_PPT_NP_v02
9
Cloudificaiton is cool but how?
Use
HDF4 File Content Map.
Group Array
Table
Attribute
Palette

DM_PPT_NP_v02
10
What is HDF4 Map?
XML (ASCII) file that maps the content of binary file.
<h4:Array name="c" path="/a/b/" nDimensions="4">
<h4:dataDimensionSizes>180 8 32 4</h4:dataDimensionSizes>
<h4:chunks>
<h4:chunkDimensionSizes>1 8 32 4</h4:chunkDimensionSizes>
<h4:fillValues value="-9999.000000" chunkPositionInArray="[0,0,0,0]"/>
<h4:byteStream offset="70798703" nBytes="2468" chunkPositionInArray="[114,0,0,0]"/>
<h4:byteStream offset="89101024" nBytes="32" chunkPositionInArray="[172,0,0,0]"/>
<h4:byteStream offset="89127527" nBytes="32" chunkPositionInArray="[173,0,0,0]"/>
</h4:chunks>
</h4:Array>

DM_PPT_NP_v02
11
It is a map with addresses.
<h4:Array name="c" path="/a/b/" nDimensions="4">
<h4:dataDimensionSizes>180 8 32 4</h4:dataDimensionSizes>
<h4:chunks>
<h4:chunkDimensionSizes>1 8 32 4</h4:chunkDimensionSizes>
<h4:fillValues value="-9999.000000" chunkPositionInArray="[0,0,0,0]"/>
…
<h4:byteStream offset="70798703" nBytes="2468" chunkPositionInArray="[114,0,0,0]"/>
<h4:byteStream offset="89101024" nBytes="32" chunkPositionInArray="[172,0,0,0]"/>
<h4:byteStream offset="89127527" nBytes="32" chunkPositionInArray="[173,0,0,0]"/>
</h4:chunks>
</h4:Array> Addresses in the dataAddresses in the file

DM_PPT_NP_v02
12
<h4:Array name="c" path="/a/b/" nDimensions="4">
<h4:dataDimensionSizes>180 8 32 4</h4:dataDimensionSizes>
<h4:chunks>
<h4:chunkDimensionSizes>1 8 32 4</h4:chunkDimensionSizes>
<h4:fillValues value="-9999.000000" chunkPositionInArray="[0,0,0,0]"/>
…
<h4:byteStream offset="70798703" nBytes="2468" chunkPositionInArray="[114,0,0,0]"/>
<h4:byteStream offset="89101024" nBytes="32" chunkPositionInArray="[172,0,0,0]"/>
<h4:byteStream offset="89127527" nBytes="32" chunkPositionInArray="[173,0,0,0]"/>
</h4:chunks>
</h4:Array>
Byte size in map is quite useful.
Bigger chunks may have more information.
Nothing
interesting
This chunk may
have useful
information.
These chunks
may have same
information.

DM_PPT_NP_v02
13
Run data analytics on maps.
Compute checksum and use Elastic Search & Kibana.
Frequency distribution
of checksums

DM_PPT_NP_v02
14
Some chunks are repeated.
A single HDF4 file has 160+ chunks of same data.
Chunks with the same
checksum have the
same data

DM_PPT_NP_v02
15
At collection level, it scales up.
Hundreds of HDF4 files have the 16K chunks of same data.

DM_PPT_NP_v02
16
Elastic search with maps
.. can help users
locate the HDF4 file of
interest.
Nothing
interesting
Most
interesting

DM_PPT_NP_v02
17
Reduce storage cost (e.g., S3) by avoiding redundancy.
Make each chunk searchable through search engine.
Run cloud computing on chunks of interest.
Store chunks as cloud objects

DM_PPT_NP_v02
18
NASA Earthdata search is too shallow.
Index HDF4 data using maps and make deep web.
Provide search interface for the deep web.
Frequently searched data can be cached as cloud objects.
Users can run cloud computing on cached objects in RT.
Verify results with HDF4 archives from NASA data centers.
Shallow Web is not Enough

DM_PPT_NP_v02
19
(BACC= Bigdata Analytics in Cloud Computing)
1. Use HDF archive as is. Create maps for HDF.
2. Maps can be indexed and searched.
3. ELT (Extract Load Transform) only relevant data into
cloud from HDF.
4. Offset/length based file IO is universal - all existing
BACC solutions will work. No dependency on HDF APIs.
HDF: Antifragile Solution for BACC

DM_PPT_NP_v02
20
Future Work
1. HDF5 Mapping Project?
2. Use HDF Product Designer for archiving cloud objects
and analytics results in HDF5.
3. Re-map: To metadata is human, to data is divine: For
the same binary object, user can easily re-define
meaning of data, re-index it, search, and analyze it.
(e.g., serve the same binary data in Chinese, Spanish,
Russian, etc.)

DM_PPT_NP_v02
21
This work was supported by
NASA/GSFC under Raytheon Co.
contract number NNG15HZ39C

What's hot

HDFCloud Workshop: HDF5 in the Cloud

The HDF-EOS Tools and Information Center

Moving form HDF4 to HDF5/netCDF-4

The HDF-EOS Tools and Information Center

HDF5 Performance Enhancements with the Elimination of Unlimited Dimension

The HDF-EOS Tools and Information Center

Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)

The HDF-EOS Tools and Information Center

Product Designer Hub - Taking HPD to the Web

The HDF-EOS Tools and Information Center

ICESat-2 Metadata and Status

The HDF-EOS Tools and Information Center

Open-source Scientific Computing and Data Analytics using HDF

The HDF-EOS Tools and Information Center

Hierarchical Data Formats (HDF) Update

The HDF-EOS Tools and Information Center

SPD and KEA: HDF5 based file formats for Earth Observation

The HDF-EOS Tools and Information Center

Scientific Computing and Visualization using HDF

The HDF-EOS Tools and Information Center

MODIS Land and HDF-EOS

The HDF-EOS Tools and Information Center

Improved Methods for Accessing Scientific Data for the Masses

The HDF-EOS Tools and Information Center

HDF Product Designer: Using Templates to Achieve Interoperability

The HDF-EOS Tools and Information Center

HDF Product Designer

The HDF-EOS Tools and Information Center

HDF Project Update

The HDF-EOS Tools and Information Center

GDAL Enhancement for ESDIS Project

The HDF-EOS Tools and Information Center

ArcGIS and Multi-D: Tools & Roadmap

The HDF-EOS Tools and Information Center

Bridging ICESat and ICESat-2 Standard Data ProductsThe HDF-EOS Tools and Information Center

Multidimensional Scientific Data in ArcGIS

The HDF-EOS Tools and Information Center

Working with Scientific Data in MATLAB

The HDF-EOS Tools and Information Center

What's hot (20)

HDFCloud Workshop: HDF5 in the Cloud

Moving form HDF4 to HDF5/netCDF-4

HDF5 Performance Enhancements with the Elimination of Unlimited Dimension

Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)

Product Designer Hub - Taking HPD to the Web

ICESat-2 Metadata and Status

Open-source Scientific Computing and Data Analytics using HDF

Hierarchical Data Formats (HDF) Update

SPD and KEA: HDF5 based file formats for Earth Observation

Scientific Computing and Visualization using HDF

MODIS Land and HDF-EOS

Improved Methods for Accessing Scientific Data for the Masses

HDF Product Designer: Using Templates to Achieve Interoperability

HDF Product Designer

HDF Project Update

GDAL Enhancement for ESDIS Project

ArcGIS and Multi-D: Tools & Roadmap

Bridging ICESat and ICESat-2 Standard Data Products

Multidimensional Scientific Data in ArcGIS

Working with Scientific Data in MATLAB

Viewers also liked

Pilot Project for HDF5 Metadata Structures for SWOT

The HDF-EOS Tools and Information Center

HDF Cloud Services

The HDF-EOS Tools and Information Center

Breakthrough Listen

The HDF-EOS Tools and Information Center

Using visualization tools to access HDF data via OPeNDAP

The HDF-EOS Tools and Information Center

Advanced HDF5 Features

The HDF-EOS Tools and Information Center

This tutorial is designed for the HDF5 users with some HDF5 experience. It will cover advanced features of the HDF5 library for achieving better I/O performance and efficient storage. The following HDF5 features will be discussed: partial I/O, chunked storage layout, compression and other filters including new n-bit and scale+offset filters. Significant time will be devoted to the discussion of complex HDF5 datatypes such as strings, variable-length datatypes, array and compound datatypes.

Introduction to HDF5

The HDF-EOS Tools and Information Center

Hdf5 current future

mfolk

Unidata's Approach to Community Broadening through Data and Technology SharingThe HDF-EOS Tools and Information Center

HDF5 Tools

The HDF-EOS Tools and Information Center

This tutorial is designed for anyone who needs to work with data stored in HDF5 files. The tutorial will cover functionality and useful features of the HDF5 utilities h5dump, h5diff, h5repack, h5stat, h5copy, h5check and h5repart. We will also introduce a prototype of the new h52jpeg conversion tool and recently released h5perf_serial tool used for performance studies. We will briefly introduce HDFView. Details of the HDFView and HDF-Java will be discussed in a separate tutorial.

Viewers also liked (9)

Pilot Project for HDF5 Metadata Structures for SWOT

HDF Cloud Services

Breakthrough Listen

Using visualization tools to access HDF data via OPeNDAP

Advanced HDF5 Features

Introduction to HDF5

Hdf5 current future

Unidata's Approach to Community Broadening through Data and Technology Sharing

HDF5 Tools

Similar to Utilizing HDF4 File Content Maps for the Cloud Computing

HDF Data in the Cloud

The HDF-EOS Tools and Information Center

HDF5 Life cycle of data

The HDF-EOS Tools and Information Center

HDF Cloud: HDF5 at Scale

The HDF-EOS Tools and Information Center

Integrating HDF5 with SRB

The HDF-EOS Tools and Information Center

Fast partial access to objects from very large files in the SDSC Storage Resource Broker (SRB5) can be extremely challenging, even when those objects are small. The HDF-SRB project integrates the SRB and NCSA Hierarchical Data Format (HDF5), to create an access mechanism within the SRB that is can be orders of magnitude more efficient than current methods for accessing object-based file formats. The project provides interactive and efficient access to datasets or subsets of datasets in large files without bringing entire files into local machines. A new set of data structures and APIs have been implemented to the SRB support such object-level data access. A working prototype of the HDF5-SRB data system has been developed and tested. The SRB support is implemented in HDFView as a client application.

Cloud-Optimized HDF5 Files

The HDF-EOS Tools and Information Center

Unit-3.pptx

JasmineMichael1

Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps

The HDF-EOS Tools and Information Center

A preponderance of data from NASA's Earth Observing System (EOS) is archived in the HDF Version 4 (HDF4) format. The long-term preservation of these data is critical for climate and other scientific studies going many decades into the future. HDF4 is very effective for working with the large and complex collection of EOS data products. Unfortunately, because of the complex internal byte layout of HDF4 files, future readability of HDF4 data depends on preserving a complex software library that can interpret that layout. Having a way to access HDF4 data independent of a library could improve its viability as an archive format, and consequently give confidence that HDF4 data will be readily accessible forever, even if the HDF4 library is gone. To address the need to simplify long-term access to EOS data stored in HDF4, a collaborative project between The HDF Group and NASA Earth Science Data Centers is implementing an approach to accessing data in HDF4 files based on the use of independent maps that describe the data in HDF4 files and tools that can use these maps to recover data from those files. With this approach, relatively simple programs will be able to extract the data from an HDF4 file, bypassing the need for the HDF4 library. A demonstration project has shown that this approach is feasible. This involved an assessment of NASA�s HDF4 data holdings, and development of a prototype XML-based layout mapping language and tools to read layout maps and read HDF4 files using layout maps. Future plans call for a second phase of the project, in which the mapping tools and XML schema are made production quality, the mapping schema are integrated with existing XML metadata files in several data centers, and outreach activities are carried out to encourage and facilitate acceptance of the technology.

Accessing HDF5 data in the cloud with HSDS

The HDF-EOS Tools and Information Center

Putting some Spark into HDF5

The HDF-EOS Tools and Information Center

Spark Meetup at Uber

Databricks

Lecture 2 part 1

Jazan University

HDF for the Cloud

The HDF-EOS Tools and Information Center

Hadoop and CLOUDIAN HyperStore

CLOUDIAN KK

クラウディアン、IOT/M2M本格普及にむけ、ビッグデータを「スマートデータ」として活用できるCLOUDIAN HyperStore 5.1をリリース～　CLOUDIAN HyperStore 5.1ソフトウェアとアプライアンスをHadoopとHortonworks Data Platformが公式認定、ペタバイト規模の分析を可能に　～ http://cloudian.jp/news/pressrelease_detail/press-release-34.html Cloudian HyperStore Ushers in Era of Smart Data With Efficient, Scalable Storage for Internet of Things ～　With Hadoop and Hortonworks Data Platform Qualified on HyperStore 5.1 Software and Appliances, Customers Can Perform In-Place Data Analysis at Petabyte-Scale; Cloudian Becomes Hortonworks Certified Technology Partner　～ http://www.cloudian.com/news/press-releases/cloudian-hyperstore-5.1-ushers-in-era-of-smart-data.php http://hortonworks.com/partner/cloudian/ http://hortonworks.com/wp-content/uploads/2014/08/Cloudian-Hortonworks-Solutions-Brief.pdf

Cloud Revolution: Exploring the New Wave of Serverless Spatial Data

Safe Software

Once in a while, there really is something new under the sun. The rise of cloud-hosted data has fueled innovation in spatial data storage, enabling a brand new serverless architectural approach to spatial data sharing. Join us in our upcoming webinar to learn all about these new ways to organize your data, and leverage data shared by others. Explore the potential of Cloud Native Geospatial Formats in your workflows with FME, as we introduce five new formats: COGs, COPC, FlatGeoBuf, GeoParquet, STAC and ZARR. Learn from industry experts Michelle Roby from Radiant Earth and Chris Holmes from Planet about these cloud-native geospatial data formats and how they can make data easier to manage, share, and analyze. To get us started, they’ll explain the goals of the Cloud-Native Geospatial Foundation and provide overviews of cloud-native technologies including the Cloud-Optimized GeoTIFF (COG), SpatioTemporal Asset Catalogs (STAC), and GeoParquet. Following this, our seasoned FME team will guide you through practical demonstrations, showcasing how to leverage each format to its fullest potential. Learn strategic approaches for seamless integration and transition, along with valuable tips to enhance performance using these formats in FME. Discover how these formats are reshaping geospatial data handling and how you can seamlessly integrate them into your FME workflows and harness the explosion of cloud-hosted data.

Performance Tuning in HDF5

The HDF-EOS Tools and Information Center

Accelerate Spark Workloads on S3

Alluxio, Inc.

Introduction to HDFS

Siddharth Mathur

Cloud Revolution: Exploring the New Wave of Serverless Spatial Data

Safe Software

Parallel Computing with HDF Server

The HDF-EOS Tools and Information Center

HDFS tiered storage: mounting object stores in HDFS

DataWorks Summit

Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Microsoft Azure, and on-premises object stores, such as Western Digital’s ActiveScale. In these settings, applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems for business continuity planning (BCP) and/or supporting hybrid cloud architectures to achieve the required business goals for durability, performance, and coordination. To resolve this complexity, HDFS-9806 has added a PROVIDED storage tier to HDFS allowing mounting external namespaces, both object stores and other HDFS clusters. Building on this functionality, we can now allow remote namespaces to be synchronized with HDFS, enabling asynchronous writes to the remote storage and the possibility to synchronously and transparently read data back to a local application wanting to access file data which is stored remotely. This talk, which corresponds to the work in progress under HDFS-12090, will present how the Hadoop admin can manage storage tiering between clusters and how that is then handled inside HDFS through the snapshotting mechanism and asynchronously satisfying the storage policy. Speaker Thomas Demoor, Object Storage Architect, Western Digital Ewan Higgs, Software Engineer, Western Digital

Recently uploaded

GraphRAG is All You need? LLM & Knowledge Graph

Guy Korland

Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs. 1. Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://arxiv.org/abs/2306.08302 2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs: https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

PHP Frameworks: I want to break free (IPC Berlin 2024)

Ralf Eggert

In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development. This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Aggregage

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Product School

Leading Change strategies and insights for effective change management pdf 1.pdf

OnBoard

Elevating Tactical DDD Patterns Through Object Calisthenics

Dorra BARTAGUIZ

After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!

PCI PIN Basics Webinar from the Controlcase Team

ControlCase

Key Trends Shaping the Future of Infrastructure.pdf

Cheryl Hung

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

Elizabeth Buie - Older adults: Are we really designing for our future selves?

Nexer Digital

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

FIDO Alliance

A tale of scale & speed: How the US Navy is enabling software delivery from l...

sonjaschweigert1

Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved: - Reduction in onboarding time from 5 weeks to 1 day - Improved developer experience and productivity through actionable findings and reduction of false positives - Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO) Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production. We will cover: - How to remove silos in DevSecOps - How to build efficient development pipeline roles and component templates - How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence) - How to streamline operations with automated policy checks on container images

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

James Anderson

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

FIDO Alliance

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

By Design, not by Accident - Agile Venture Bolzano 2024

Pierluigi Pugliese

UiPath Test Automation using UiPath Test Suite series, part 3

DianaGray10

Securing your Kubernetes cluster_ a step-by-step guide to success !

KatiaHIMEUR1

Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster. However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks. In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.

Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx

nkrafacyberclub

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

DanBrown980551

Do you want to learn how to model and simulate an electrical network from scratch in under an hour? Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)! During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook. PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides: - A fully editable and extendable library for grid component modelling; - Visualization tools to display your network; - Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses; The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well. What you will learn during the webinar: - For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills; - For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.

Recently uploaded (20)

GraphRAG is All You need? LLM & Knowledge Graph

PHP Frameworks: I want to break free (IPC Berlin 2024)

Generative AI Deep Dive: Advancing from Proof of Concept to Production

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Leading Change strategies and insights for effective change management pdf 1.pdf

Elevating Tactical DDD Patterns Through Object Calisthenics

PCI PIN Basics Webinar from the Controlcase Team

Key Trends Shaping the Future of Infrastructure.pdf

FIDO Alliance Osaka Seminar: Overview.pdf

Elizabeth Buie - Older adults: Are we really designing for our future selves?

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

A tale of scale & speed: How the US Navy is enabling software delivery from l...

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

By Design, not by Accident - Agile Venture Bolzano 2024

UiPath Test Automation using UiPath Test Suite series, part 3

Securing your Kubernetes cluster_ a step-by-step guide to success !

Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

Utilizing HDF4 File Content Maps for the Cloud Computing

1. DM_PPT_NP_v02 Utilizing HDF4 File Content Maps for the Cloud Computing Hyokyung Joe Lee The HDF Group This work was supported by NASA/GSFC under Raytheon Co. contract number NNG15HZ39C

2. DM_PPT_NP_v02 2 HDF File Format is for Data. • PDF for Document, HDF for Data • Why PDF over MS Word DOC? – Free, Portable, Sharing & Archiving • Why HDF over MS Excel XLS(X)? – Free, Portable, Sharing & Archiving • HDF: HDF4 & HDF5

3. DM_PPT_NP_v02 3 HDF4 is “old” format. • Old = Large volume over long time • Old = Limitation (32-bit) • Old = More difficult to sustain

4. DM_PPT_NP_v02 4 HDF4 is old. So What? • Convert it to HDF5.

5. DM_PPT_NP_v02 5 Any alternative? Cloudification!

6. DM_PPT_NP_v02 6 Cloudificaiton - Wiktionary The conversion and/or migration of data and application programs in order to make use of cloud computing

7. DM_PPT_NP_v02 7 Why Cloud? AI+Bigdata+Cloud =

8. DM_PPT_NP_v02 8 ABC Example: El Nino Detection

9. DM_PPT_NP_v02 9 Cloudificaiton is cool but how? Use HDF4 File Content Map. Group Array Table Attribute Palette

10. DM_PPT_NP_v02 10 What is HDF4 Map? XML (ASCII) file that maps the content of binary file. <h4:Array name="c" path="/a/b/" nDimensions="4"> <h4:dataDimensionSizes>180 8 32 4</h4:dataDimensionSizes> <h4:chunks> <h4:chunkDimensionSizes>1 8 32 4</h4:chunkDimensionSizes> <h4:fillValues value="-9999.000000" chunkPositionInArray="[0,0,0,0]"/> <h4:byteStream offset="70798703" nBytes="2468" chunkPositionInArray="[114,0,0,0]"/> <h4:byteStream offset="89101024" nBytes="32" chunkPositionInArray="[172,0,0,0]"/> <h4:byteStream offset="89127527" nBytes="32" chunkPositionInArray="[173,0,0,0]"/> </h4:chunks> </h4:Array>

11. DM_PPT_NP_v02 11 It is a map with addresses. <h4:Array name="c" path="/a/b/" nDimensions="4"> <h4:dataDimensionSizes>180 8 32 4</h4:dataDimensionSizes> <h4:chunks> <h4:chunkDimensionSizes>1 8 32 4</h4:chunkDimensionSizes> <h4:fillValues value="-9999.000000" chunkPositionInArray="[0,0,0,0]"/> … <h4:byteStream offset="70798703" nBytes="2468" chunkPositionInArray="[114,0,0,0]"/> <h4:byteStream offset="89101024" nBytes="32" chunkPositionInArray="[172,0,0,0]"/> <h4:byteStream offset="89127527" nBytes="32" chunkPositionInArray="[173,0,0,0]"/> </h4:chunks> </h4:Array> Addresses in the dataAddresses in the file

12. DM_PPT_NP_v02 12 <h4:Array name="c" path="/a/b/" nDimensions="4"> <h4:dataDimensionSizes>180 8 32 4</h4:dataDimensionSizes> <h4:chunks> <h4:chunkDimensionSizes>1 8 32 4</h4:chunkDimensionSizes> <h4:fillValues value="-9999.000000" chunkPositionInArray="[0,0,0,0]"/> … <h4:byteStream offset="70798703" nBytes="2468" chunkPositionInArray="[114,0,0,0]"/> <h4:byteStream offset="89101024" nBytes="32" chunkPositionInArray="[172,0,0,0]"/> <h4:byteStream offset="89127527" nBytes="32" chunkPositionInArray="[173,0,0,0]"/> </h4:chunks> </h4:Array> Byte size in map is quite useful. Bigger chunks may have more information. Nothing interesting This chunk may have useful information. These chunks may have same information.

13. DM_PPT_NP_v02 13 Run data analytics on maps. Compute checksum and use Elastic Search & Kibana. Frequency distribution of checksums

14. DM_PPT_NP_v02 14 Some chunks are repeated. A single HDF4 file has 160+ chunks of same data. Chunks with the same checksum have the same data

15. DM_PPT_NP_v02 15 At collection level, it scales up. Hundreds of HDF4 files have the 16K chunks of same data.

16. DM_PPT_NP_v02 16 Elastic search with maps .. can help users locate the HDF4 file of interest. Nothing interesting Most interesting

17. DM_PPT_NP_v02 17 Reduce storage cost (e.g., S3) by avoiding redundancy. Make each chunk searchable through search engine. Run cloud computing on chunks of interest. Store chunks as cloud objects

18. DM_PPT_NP_v02 18 NASA Earthdata search is too shallow. Index HDF4 data using maps and make deep web. Provide search interface for the deep web. Frequently searched data can be cached as cloud objects. Users can run cloud computing on cached objects in RT. Verify results with HDF4 archives from NASA data centers. Shallow Web is not Enough

19. DM_PPT_NP_v02 19 (BACC= Bigdata Analytics in Cloud Computing) 1. Use HDF archive as is. Create maps for HDF. 2. Maps can be indexed and searched. 3. ELT (Extract Load Transform) only relevant data into cloud from HDF. 4. Offset/length based file IO is universal - all existing BACC solutions will work. No dependency on HDF APIs. HDF: Antifragile Solution for BACC

20. DM_PPT_NP_v02 20 Future Work 1. HDF5 Mapping Project? 2. Use HDF Product Designer for archiving cloud objects and analytics results in HDF5. 3. Re-map: To metadata is human, to data is divine: For the same binary object, user can easily re-define meaning of data, re-index it, search, and analyze it. (e.g., serve the same binary data in Chinese, Spanish, Russian, etc.)

21. DM_PPT_NP_v02 21 This work was supported by NASA/GSFC under Raytheon Co. contract number NNG15HZ39C

Editor's Notes

Good morning, everyone! My name is Joe Lee and I’m a software engineer at The HDF Group. Although I attended the past ESIP meetings regularly, I could not travel this summer. ESIP meeting is a great place to learn and share new ideas and technologies through face-to-face conversation so I apologize for presenting my new idea over telecon.
Although you may have heard about Hierarchical Data Format, let me start my presentation by giving a very short introduction to HDF. HDF is similar to PDF in many ways as a free & portable binary format although the brand power of HDF is much weaker than the brand power of PDF. Everybody knows that PDF is for publishing document. HDF is for publishing data of any size – big or small. For example, NASA has used HDF for several decades to archive big data as large as Earth observation because it is good for sharing and archiving. HDF has two incompatible formats called HDF4 and HDF5. As the number indicates, HDF4 is old format and HDF5 is relatively new format.
The idea that I am going to present today is mainly about HDF4 because HDF4 is old. I cannot tell exactly how old HDF4 is because I don’t want to discriminate any file format based on its age.  Old can mean many different things – both good and bad. For example, old means a large volume of earth data has been archived in HDF4. Old also means that HDF4 has some limits that are already overcome by today’s technology. As the technology advances very fast, you’ll see fewer tools that support HDF4. I put an image of CD-player because HDF4 reminds me the CD player in my 20 years old car. In 1995, I had to pay extra money for it as a premium car audio option.
Last November, my 20 year old car was finally broken after racking up 250 thousand miles so I went shopping for a new car. I was surprised to know that new cars do not have CD players any more. Instead, they have USB or SD memory card slots and they accept MP3 formats. I’m telling this story because the modernization of HDF4 data is necessary before it gets too old to sustain. Since HDF4 is not backward-compatible with HDF5, HDF5 users need to convert HDF4 to HDF5 if their tools do not support HDF4. The HDF Group already provides h4toh5 conversion tool. This is a good solution as long as you are willing to convert millions of HDF4 files into HDF5 files.
Thinking about the future alternative, like Tesla car that can stream music from cloud, I think Earth data streaming from cloud is the way to go. So, converting HDF4 to HDF5 is an OK solution but I think there should be an alternative if we’d like to modernize old HDF4 data in cloud age.
I found the word “Cloudification” and I like it a lot. Wiktionary defines it as “The conversion ….”
Why does cloud computing matter? I think I don’t have to explain it any more thanks to IBM Watson and Google AlphaGo. When combined with AI and big data, cloud computing can do amazing things like beating human experts.
For another example, last winter, I was involved in a project called data container study. I ran a machine learning experiment with 20 years of NASA sea surface temperature data near Peru from 1987 and 2008 using Open Science Data Cloud and I could detect an anomaly quickly in a few seconds. The result matched nicely with El Nino in 1998. Open Science Data Cloud was very convenient and fast.
What I also learned from data container study is that efficient I/O is the key. OSDC provides 200 terra bytes of public data in HDF4 format. However, they are not directly usable for me because OSDC does not provide any search interface to the collection that is similar to NASA Earthdata search. OSDC only provides a list of HDF4 file names available and all I can do is to transfer a collection of HDF4 files “as is” from cloud storage to the computing nodes. This is horribly inefficient because I need a way to search & filter the only relevant data to speed up my data analytics at collection level. Thus, I came up with an idea to use HDF4 file content map to maximize the utilization of cloud computing. A single binary HDF4 file can have multiple data objects represented as array, group, table, attributes and so on. Each object can be precisely located using the offset from the beginning of file and the number of bytes to read using HDF4 file content map. The rationale is that if the only relevant object can be searched and loaded into data analytics engine, you can reduce the amount of I/O and thus get the result much faster. Without shredding thousands of HDF4 files into objects with HDF4 maps, you must load 200 TB of data into computing nodes, process them, and throw away. You must repeat this for different analytics jobs. You need to wait days for I/O operation while the actual data analytics takes only a few seconds.
So what is HDF4 file content map that I’m talking about? It is an XML file that maps the content of the HDF4 binary file. Unless you’re a hacker working for NSA, it’s hard to know what’s inside the HDF4 binary file as shown in the slide. HDF4 binary file is a long stream of bytes and HDF4 map file can tell you how to decrypt the stream correctly.
Interpreting binary data is possible because the file content map has full of addresses. In HDF, a dataset can be organized into chunks for efficient I/O and the HDF4 map can tell where you can find a chunk of data. The chunk position in array is a good indication of where data is located on on Earth if dataset is grid. By fully disclosing offset and number of bytes to read from the binary file, you don’t need HDF4 libraries to access a chunk of data.
If you read the file content map carefully, you can find some interesting patterns from byte size of each chunk. The fillValues XML tag indicates that there’s nothing to be analyzed in the chunk. Small size of chunk indicates that the chunk contains a lot of repeated information so it can be compressed. Big size of chunk indicates that the chunk has more information than other compressed chunks.
To find useful data from a huge collection of HDF4 in OSDC, I ran an elastic search on chunks and visually inspected the frequency distribution of checksums with Kibana after computing MD5 checksum of each chunk. MD5 checksum on individual chunk is not provided by h4mapwriter yet, so I created a separate script in Python.
Running some analytics on HDF4 file using the HDF4 map was very fun. It revealed that the same chunk of data is repeated within a HDF4 file.
At collection level, it scales up nicely. Hundreds of HDF4 files have the 16K chunk of same data. This makes sense because some observations from the Earth will be same for a long period of time.
Once index is built with Elastic Search, I can easily run query to find a dataset that I’m interested in using the byte size information. For example, I could sort the dataset size from the smallest to the largest over hundreds of HDF4 files. As expected, the smallest byte size dataset showed almost nothing when it is visualized with HDFView. The largest byte size dataset returned a colorful image.
Based on the HDF4 map information, I learned that it is possible to re-organize the entire collection of HDF4 data to optimize the use of cloud storage. If you optimize data organization, ETL time for cloud computing will be shortened and the cost of storage will be also reduced. If you can build search engine on top of those objects, advanced HDF4 users can run cloud computing directly on HDF chunks of their interest after filtering out irrelevant data based on the search result. Users can always transform HDF chunks into other format such as Apache Parquet or JSON to meet their cloud computing needs.
From the Elastic Search experiment with HDF4 map, now I have a new wish list for NASA Earthdata search. Although I like the new and improved NASA Earthdata search, I still think it’s too shallow because it does not index what’s inside from granules. If Earthdata search can index HDF4 maps and provide search interface, collection at the chunk level can be returned to a user’s query. I’d like to call such search service as deep web search. For a chunk collection that deep web search returns, user can stream chunks to user’s cloud storage. Here, the key is to deliver chunk collection to user’s cloud service provider. Downloading the entire HDF4 data does not make sense in this workflow. Then, user can run his analytics job using cloud computing on the streamed chunks. If necessary, users can go back to the original HDF4 archives and run the same analytics if necessary using the traditional off-cloud method.
In summary, data archived in HDF is ready for big data analytics for certain access patterns that data producers prescribed. The prescribed pattern may not match exactly what users need. For such use case scenario, HDF maps can be indexed and searched to identify relevant pieces from HDF. I call it anti-fragile solution because any big data analytics solution in any computer language in any cloud computing environment will work. For example, I could read data over network in PHP language using Apache web server that supports byte-range and it worked pretty nicely. I picked PHP because PHP binding doesn’t exist for HDF. Relying on a single monolithic library to access data is too fragile.
You may wonder if the same solution can be applied to HDF5 or netCDF4. Unfortunately, there is no HDF5 mapper tool yet. How can a user save a collection of chunks in HDF5 easily for future use? I think HDF Product Designer is a good candidate for creating a new HDF5 from chunk objects in cloud. It can play a role of h4toh5 conversion tool with on-demand collection level subset/aggregation capability. Finally, the HDF4 map idea has a great potential as flexible metadata solution. While binary is forever, metadata doesn’t have to be. If you re-map the same binary data with a different dialect, you can serve wider community that understands the dialect. One example is rewriting HDF4 file content map in different languages. Then, international users can discover and access Earthdata more easily. Thank you for listening and I hope that you can use HDF4 map wisely in your next cloud computing project. Do you have any question?

Utilizing HDF4 File Content Maps for the Cloud Computing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Utilizing HDF4 File Content Maps for the Cloud Computing

Similar to Utilizing HDF4 File Content Maps for the Cloud Computing (20)

More from The HDF-EOS Tools and Information Center

More from The HDF-EOS Tools and Information Center (20)

Recently uploaded

Recently uploaded (20)

Utilizing HDF4 File Content Maps for the Cloud Computing

Editor's Notes