This document discusses using HDF4 file content maps to enable cloud computing capabilities for HDF4 files. HDF4 files contain scientific data but their large size and legacy format pose challenges. The document proposes creating XML maps that describe HDF4 file structure and contents, including chunk locations and sizes. These maps could then be indexed and searched to locate relevant data chunks. Only those chunks would need to be extracted to the cloud, avoiding unnecessary data transfers. This would allow HDF4 files to be queried and analyzed using cloud-based tools while reducing storage costs.
This slide will demonstrate how to use visualization and analysis tools such as IDV and GrADS to access HDF data via OPeNDAP.
To see animation in some slides, please visit:
http://hdfeos.org/workshops/ws13/presentations/day1/jxl_opendap_tutorial.ppt
This tutorial is designed for the HDF5 users with some HDF5 experience.
It will cover advanced features of the HDF5 library for achieving better I/O performance and efficient storage. The following HDF5 features will be discussed: partial I/O, chunked storage layout, compression and other filters including new n-bit and scale+offset filters. Significant time will be devoted to the discussion of complex HDF5 datatypes such as strings, variable-length datatypes, array and compound datatypes.
This tutorial is designed for new HDF5 users. We will go over a brief history of HDF and HDF5 software, and will cover basic HDF5 Data Model objects and their properties; we will give an overview of the HDF5 Libraries and APIs, and discuss the HDF5 programming model. Simple C and Fortran examples, and Java tool HDFView will be used to illustrate HDF5 concepts.
This tutorial is designed for anyone who needs to work with data stored in HDF5 files. The tutorial will cover functionality and useful features of the HDF5 utilities h5dump, h5diff, h5repack, h5stat, h5copy, h5check and h5repart. We will also introduce a prototype of the new h52jpeg conversion tool and recently released h5perf_serial tool used for performance studies. We will briefly introduce HDFView. Details of the HDFView and HDF-Java will be discussed in a separate tutorial.
In this talk we will discuss what happens to data when it is written from the HDF5 application to an HDF5 file. This knowledge will help developers to write more efficient applications and to avoid performance bottlenecks.
Fast partial access to objects from very large files in the SDSC Storage Resource Broker (SRB5) can be extremely challenging, even when those objects are small. The HDF-SRB project integrates the SRB and NCSA Hierarchical Data Format (HDF5), to create an access mechanism within the SRB that is can be orders of magnitude more efficient than current methods for accessing object-based file formats.
The project provides interactive and efficient access to datasets or subsets of datasets in large files without bringing entire files into local machines. A new set of data structures and APIs have been implemented to the SRB support such object-level data access. A working prototype of the HDF5-SRB data system has been developed and tested. The SRB support is implemented in HDFView as a client application.
A preponderance of data from NASA's Earth Observing System (EOS) is archived in the HDF Version 4 (HDF4) format. The long-term preservation of these data is critical for climate and other scientific studies going many decades into the future. HDF4 is very effective for working with the large and complex collection of EOS data products. Unfortunately, because of the complex internal byte layout of HDF4 files, future readability of HDF4 data depends on preserving a complex software library that can interpret that layout. Having a way to access HDF4 data independent of a library could improve its viability as an archive format, and consequently give confidence that HDF4 data will be readily accessible forever, even if the HDF4 library is gone.
To address the need to simplify long-term access to EOS data stored in HDF4, a collaborative project between The HDF Group and NASA Earth Science Data Centers is implementing an approach to accessing data in HDF4 files based on the use of independent maps that describe the data in HDF4 files and tools that can use these maps to recover data from those files. With this approach, relatively simple programs will be able to extract the data from an HDF4 file, bypassing the need for the HDF4 library.
A demonstration project has shown that this approach is feasible. This involved an assessment of NASA�s HDF4 data holdings, and development of a prototype XML-based layout mapping language and tools to read layout maps and read HDF4 files using layout maps. Future plans call for a second phase of the project, in which the mapping tools and XML schema are made production quality, the mapping schema are integrated with existing XML metadata files in several data centers, and outreach activities are carried out to encourage and facilitate acceptance of the technology.
クラウディアン、IOT/M2M本格普及にむけ、ビッグデータを「スマートデータ」として活用できるCLOUDIAN HyperStore 5.1をリリース ~ CLOUDIAN HyperStore 5.1ソフトウェアとアプライアンスをHadoopとHortonworks Data Platformが公式認定、ペタバイト規模の分析を可能に ~
http://cloudian.jp/news/pressrelease_detail/press-release-34.html
Cloudian HyperStore Ushers in Era of Smart Data With Efficient, Scalable Storage for Internet of Things ~ With Hadoop and Hortonworks Data Platform Qualified on HyperStore 5.1 Software and Appliances, Customers Can Perform In-Place Data Analysis at Petabyte-Scale; Cloudian Becomes Hortonworks Certified Technology Partner ~
http://www.cloudian.com/news/press-releases/cloudian-hyperstore-5.1-ushers-in-era-of-smart-data.php
http://hortonworks.com/partner/cloudian/
http://hortonworks.com/wp-content/uploads/2014/08/Cloudian-Hortonworks-Solutions-Brief.pdf
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
Once in a while, there really is something new under the sun. The rise of cloud-hosted data has fueled innovation in spatial data storage, enabling a brand new serverless architectural approach to spatial data sharing. Join us in our upcoming webinar to learn all about these new ways to organize your data, and leverage data shared by others. Explore the potential of Cloud Native Geospatial Formats in your workflows with FME, as we introduce five new formats: COGs, COPC, FlatGeoBuf, GeoParquet, STAC and ZARR.
Learn from industry experts Michelle Roby from Radiant Earth and Chris Holmes from Planet about these cloud-native geospatial data formats and how they can make data easier to manage, share, and analyze. To get us started, they’ll explain the goals of the Cloud-Native Geospatial Foundation and provide overviews of cloud-native technologies including the Cloud-Optimized GeoTIFF (COG), SpatioTemporal Asset Catalogs (STAC), and GeoParquet.
Following this, our seasoned FME team will guide you through practical demonstrations, showcasing how to leverage each format to its fullest potential. Learn strategic approaches for seamless integration and transition, along with valuable tips to enhance performance using these formats in FME.
Discover how these formats are reshaping geospatial data handling and how you can seamlessly integrate them into your FME workflows and harness the explosion of cloud-hosted data.
In this talk we will examine how to tune HDF5 performance to improve I/O speed. The talk will focus on chunk and metadata caches, how they affect performance, and which HDF5 APIs that can be used for performance tuning.
Examples of different chunking strategies will be given. We will also discuss how to reduce file overhead by using special properties of the HDF5 groups, datasets and datatypes.
A brief introduction to Hadoop distributed file system. How a file is broken into blocks, written and replicated on HDFS. How missing replicas are taken care of. How a job is launched and its status is checked. Some advantages and disadvantages of HDFS-1.x
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
Once in a while, there really is something new under the sun. The rise of cloud-hosted data has fueled innovation in spatial data storage, enabling a brand new serverless architectural approach to spatial data sharing. Join us in our upcoming webinar to learn all about these new ways to organize your data, and leverage data shared by others. Explore the potential of Cloud Native Geospatial Formats in your workflows with FME, as we introduce five new formats: COGs, COPC, FlatGeoBuf, GeoParquet, STAC and ZARR.
Learn from industry experts Michelle Roby from Radiant Earth and Chris Holmes from Planet about these cloud-native geospatial data formats and how they can make data easier to manage, share, and analyze. To get us started, they’ll explain the goals of the Cloud-Native Geospatial Foundation and provide overviews of cloud-native technologies including the Cloud-Optimized GeoTIFF (COG), SpatioTemporal Asset Catalogs (STAC), and GeoParquet.
Following this, our seasoned FME team will guide you through practical demonstrations, showcasing how to leverage each format to its fullest potential. Learn strategic approaches for seamless integration and transition, along with valuable tips to enhance performance using these formats in FME.
Discover how these formats are reshaping geospatial data handling and how you can seamlessly integrate them into your FME workflows and harness the explosion of cloud-hosted data.
HDFS tiered storage: mounting object stores in HDFSDataWorks Summit
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Microsoft Azure, and on-premises object stores, such as Western Digital’s ActiveScale. In these settings, applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems for business continuity planning (BCP) and/or supporting hybrid cloud architectures to achieve the required business goals for durability, performance, and coordination.
To resolve this complexity, HDFS-9806 has added a PROVIDED storage tier to HDFS allowing mounting external namespaces, both object stores and other HDFS clusters. Building on this functionality, we can now allow remote namespaces to be synchronized with HDFS, enabling asynchronous writes to the remote storage and the possibility to synchronously and transparently read data back to a local application wanting to access file data which is stored remotely. This talk, which corresponds to the work in progress under HDFS-12090, will present how the Hadoop admin can manage storage tiering between clusters and how that is then handled inside HDFS through the snapshotting mechanism and asynchronously satisfying the storage policy.
Speaker
Thomas Demoor, Object Storage Architect, Western Digital
Ewan Higgs, Software Engineer, Western Digital
Similar to Utilizing HDF4 File Content Maps for the Cloud Computing (20)
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
Utilizing HDF4 File Content Maps for the Cloud Computing
1. DM_PPT_NP_v02
Utilizing HDF4 File Content Maps
for the Cloud Computing
Hyokyung Joe Lee
The HDF Group
This work was supported by NASA/GSFC under
Raytheon Co. contract number NNG15HZ39C
2. DM_PPT_NP_v02
2
HDF File Format is for Data.
• PDF for Document, HDF for Data
• Why PDF over MS Word DOC?
– Free, Portable, Sharing & Archiving
• Why HDF over MS Excel XLS(X)?
– Free, Portable, Sharing & Archiving
• HDF: HDF4 & HDF5
3. DM_PPT_NP_v02
3
HDF4 is “old” format.
• Old = Large volume over long time
• Old = Limitation (32-bit)
• Old = More difficult to sustain
17. DM_PPT_NP_v02
17
Reduce storage cost (e.g., S3) by avoiding redundancy.
Make each chunk searchable through search engine.
Run cloud computing on chunks of interest.
Store chunks as cloud objects
18. DM_PPT_NP_v02
18
NASA Earthdata search is too shallow.
Index HDF4 data using maps and make deep web.
Provide search interface for the deep web.
Frequently searched data can be cached as cloud objects.
Users can run cloud computing on cached objects in RT.
Verify results with HDF4 archives from NASA data centers.
Shallow Web is not Enough
19. DM_PPT_NP_v02
19
(BACC= Bigdata Analytics in Cloud Computing)
1. Use HDF archive as is. Create maps for HDF.
2. Maps can be indexed and searched.
3. ELT (Extract Load Transform) only relevant data into
cloud from HDF.
4. Offset/length based file IO is universal - all existing
BACC solutions will work. No dependency on HDF APIs.
HDF: Antifragile Solution for BACC
20. DM_PPT_NP_v02
20
Future Work
1. HDF5 Mapping Project?
2. Use HDF Product Designer for archiving cloud objects
and analytics results in HDF5.
3. Re-map: To metadata is human, to data is divine: For
the same binary object, user can easily re-define
meaning of data, re-index it, search, and analyze it.
(e.g., serve the same binary data in Chinese, Spanish,
Russian, etc.)
Good morning, everyone! My name is Joe Lee and I’m a software engineer at The HDF Group. Although I attended the past ESIP meetings regularly, I could not travel this summer. ESIP meeting is a great place to learn and share new ideas and technologies through face-to-face conversation so I apologize for presenting my new idea over telecon.
Although you may have heard about Hierarchical Data Format, let me start my presentation by giving a very short introduction to HDF.
HDF is similar to PDF in many ways as a free & portable binary format although the brand power of HDF is much weaker than the brand power of PDF.
Everybody knows that PDF is for publishing document. HDF is for publishing data of any size – big or small.
For example, NASA has used HDF for several decades to archive big data as large as Earth observation because it is good for sharing and archiving.
HDF has two incompatible formats called HDF4 and HDF5.
As the number indicates, HDF4 is old format and HDF5 is relatively new format.
The idea that I am going to present today is mainly about HDF4 because HDF4 is old.
I cannot tell exactly how old HDF4 is because I don’t want to discriminate any file format based on its age.
Old can mean many different things – both good and bad.
For example, old means a large volume of earth data has been archived in HDF4.
Old also means that HDF4 has some limits that are already overcome by today’s technology.
As the technology advances very fast, you’ll see fewer tools that support HDF4.
I put an image of CD-player because HDF4 reminds me the CD player in my 20 years old car.
In 1995, I had to pay extra money for it as a premium car audio option.
Last November, my 20 year old car was finally broken after racking up 250 thousand miles so I went shopping for a new car.
I was surprised to know that new cars do not have CD players any more.
Instead, they have USB or SD memory card slots and they accept MP3 formats.
I’m telling this story because the modernization of HDF4 data is necessary before it gets too old to sustain.
Since HDF4 is not backward-compatible with HDF5, HDF5 users need to convert HDF4 to HDF5 if their tools do not support HDF4.
The HDF Group already provides h4toh5 conversion tool.
This is a good solution as long as you are willing to convert millions of HDF4 files into HDF5 files.
Thinking about the future alternative, like Tesla car that can stream music from cloud, I think Earth data streaming from cloud is the way to go.
So, converting HDF4 to HDF5 is an OK solution but I think there should be an alternative if we’d like to modernize old HDF4 data in cloud age.
I found the word “Cloudification” and I like it a lot. Wiktionary defines it as “The conversion ….”
Why does cloud computing matter? I think I don’t have to explain it any more thanks to IBM Watson and Google AlphaGo. When combined with AI and big data, cloud computing can do amazing things like beating human experts.
For another example, last winter, I was involved in a project called data container study. I ran a machine learning experiment with 20 years of NASA sea surface temperature data near Peru from 1987 and 2008 using Open Science Data Cloud and I could detect an anomaly quickly in a few seconds. The result matched nicely with El Nino in 1998. Open Science Data Cloud was very convenient and fast.
What I also learned from data container study is that efficient I/O is the key.
OSDC provides 200 terra bytes of public data in HDF4 format.
However, they are not directly usable for me because OSDC does not provide any search interface to the collection that is similar to NASA Earthdata search.
OSDC only provides a list of HDF4 file names available and all I can do is to transfer a collection of HDF4 files “as is” from cloud storage to the computing nodes.
This is horribly inefficient because I need a way to search & filter the only relevant data to speed up my data analytics at collection level.
Thus, I came up with an idea to use HDF4 file content map to maximize the utilization of cloud computing. A single binary HDF4 file can have multiple data objects represented as array, group, table, attributes and so on. Each object can be precisely located using the offset from the beginning of file and the number of bytes to read using HDF4 file content map. The rationale is that if the only relevant object can be searched and loaded into data analytics engine, you can reduce the amount of I/O and thus get the result much faster. Without shredding thousands of HDF4 files into objects with HDF4 maps, you must load 200 TB of data into computing nodes, process them, and throw away. You must repeat this for different analytics jobs. You need to wait days for I/O operation while the actual data analytics takes only a few seconds.
So what is HDF4 file content map that I’m talking about? It is an XML file that maps the content of the HDF4 binary file.
Unless you’re a hacker working for NSA, it’s hard to know what’s inside the HDF4 binary file as shown in the slide.
HDF4 binary file is a long stream of bytes and HDF4 map file can tell you how to decrypt the stream correctly.
Interpreting binary data is possible because the file content map has full of addresses.
In HDF, a dataset can be organized into chunks for efficient I/O and the HDF4 map can tell where you can find a chunk of data.
The chunk position in array is a good indication of where data is located on on Earth if dataset is grid.
By fully disclosing offset and number of bytes to read from the binary file, you don’t need HDF4 libraries to access a chunk of data.
If you read the file content map carefully, you can find some interesting patterns from byte size of each chunk.
The fillValues XML tag indicates that there’s nothing to be analyzed in the chunk.
Small size of chunk indicates that the chunk contains a lot of repeated information so it can be compressed.
Big size of chunk indicates that the chunk has more information than other compressed chunks.
To find useful data from a huge collection of HDF4 in OSDC, I ran an elastic search on chunks and visually inspected the frequency distribution of checksums with Kibana after computing MD5 checksum of each chunk.
MD5 checksum on individual chunk is not provided by h4mapwriter yet, so I created a separate script in Python.
Running some analytics on HDF4 file using the HDF4 map was very fun.
It revealed that the same chunk of data is repeated within a HDF4 file.
At collection level, it scales up nicely.
Hundreds of HDF4 files have the 16K chunk of same data.
This makes sense because some observations from the Earth will be same for a long period of time.
Once index is built with Elastic Search, I can easily run query to find a dataset that I’m interested in using the byte size information.
For example, I could sort the dataset size from the smallest to the largest over hundreds of HDF4 files.
As expected, the smallest byte size dataset showed almost nothing when it is visualized with HDFView.
The largest byte size dataset returned a colorful image.
Based on the HDF4 map information, I learned that it is possible to re-organize the entire collection of HDF4 data to optimize the use of cloud storage.
If you optimize data organization, ETL time for cloud computing will be shortened and the cost of storage will be also reduced.
If you can build search engine on top of those objects, advanced HDF4 users can run cloud computing directly on HDF chunks of their interest after filtering out irrelevant data based on the search result.
Users can always transform HDF chunks into other format such as Apache Parquet or JSON to meet their cloud computing needs.
From the Elastic Search experiment with HDF4 map, now I have a new wish list for NASA Earthdata search.
Although I like the new and improved NASA Earthdata search, I still think it’s too shallow because it does not index what’s inside from granules.
If Earthdata search can index HDF4 maps and provide search interface, collection at the chunk level can be returned to a user’s query.
I’d like to call such search service as deep web search.
For a chunk collection that deep web search returns, user can stream chunks to user’s cloud storage.
Here, the key is to deliver chunk collection to user’s cloud service provider.
Downloading the entire HDF4 data does not make sense in this workflow.
Then, user can run his analytics job using cloud computing on the streamed chunks.
If necessary, users can go back to the original HDF4 archives and run the same analytics if necessary using the traditional off-cloud method.
In summary, data archived in HDF is ready for big data analytics for certain access patterns that data producers prescribed.
The prescribed pattern may not match exactly what users need. For such use case scenario, HDF maps can be indexed and searched to identify relevant pieces from HDF.
I call it anti-fragile solution because any big data analytics solution in any computer language in any cloud computing environment will work. For example, I could read data over network in PHP language using Apache web server that supports byte-range and it worked pretty nicely. I picked PHP because PHP binding doesn’t exist for HDF. Relying on a single monolithic library to access data is too fragile.
You may wonder if the same solution can be applied to HDF5 or netCDF4.
Unfortunately, there is no HDF5 mapper tool yet.
How can a user save a collection of chunks in HDF5 easily for future use? I think HDF Product Designer is a good candidate for creating a new HDF5 from chunk objects in cloud. It can play a role of h4toh5 conversion tool with on-demand collection level subset/aggregation capability.
Finally, the HDF4 map idea has a great potential as flexible metadata solution. While binary is forever, metadata doesn’t have to be. If you re-map the same binary data with a different dialect, you can serve wider community that understands the dialect. One example is rewriting HDF4 file content map in different languages. Then, international users can discover and access Earthdata more easily.
Thank you for listening and I hope that you can use HDF4 map wisely in your next cloud computing project.
Do you have any question?