The Hierarchical Data Format (HDF) has been a data format standard in NASA's Earth Observing System Data and Information System (EOSDIS) since the 1990s. Its rich structure, platform independence, full-featured Application Programming Interface (API), and internal compression make it very useful for archiving science data and utilizing them with a rich set of software tools. However, a key drawback for long-term archiving is the complex internal byte layout of HDF files, requiring one to use the API to access HDF data. This makes the long-term readability of HDF data for a given version dependent on long-term allocation of resources to support that version.
The majority of the data from NASA's Earth Observing System (EOS) have been archived in HDF Version 4 (HDF4) format. To address the long-term archival issues for these data a collaborative study between The HDF Group and NASAs EOSDIS data centers is underway. One of the first activities undertaken has been an assessment of the range of HDF4 formatted data held by NASA to determine the capabilities inherent in the HDF format that have been used in practice. Based on the results of this assessment, methods for producing a map of the layout of the HDF Version 4 files held by NASA will be prototyped using a markup-language-based HDF tool to map the layout of the HDF Version 4 files. The resulting maps should allow a separate program to read the file without recourse to the HDF API. To verify this, two independent tools based solely on the map files will be developed and tested with a variety of data products archived by NASA.
Boost Fertility New Invention Ups Success Rates.pdf
Towards Long-Term Archiving of NASA HDF-EOS and HDF Data - Data Maps and the Use of Mark-Up Language
1. QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Towards Long-Term Archiving of
NASA HDF-EOS and HDF Data
Data Maps and the Use of Mark-Up Language
Ruth Duerr, Mike Folk, Muqun Yang, Chris Lynnes, Peter Cao
2. Outline
• Background
• Data Mapping Project Description
• Plans and Early Results
Presented at the HDF and HDF-EOS
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
3. Outline
• Background
• Data Mapping Project Description
• Plans and Early Results
Presented at the HDF and HDF-EOS
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
4. A Concern
• The majority of the data from NASA’s
Earth Observing System (EOS) have
been archived in HDF Version 4 (HDF4)
or HDF-EOS 2 format.
• HDF files have a complex internal byte
layout, requiring one to use the API to
access HDF data
• Long-term readability of HDF data
depends on long-term allocation of
resources to support the API
Presented at the HDF and HDF-EOS
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
5. A Proposal from the Workshop Last Year
• Chris Lynnes noted that
What was needed was a map to the
contents of an HDF file
The output of the HDF4 tools (e.g., hdfls,
hdp, etc.) already provide much of the
information needed
Extending these tools to create a map to
the contents of the file might be feasible
Presented at the HDF and HDF-EOS
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
6. Outline
• Background
• Data Mapping Project Description
• Plans and Early Results
Presented at the HDF and HDF-EOS
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
7. Data Mapping Project Description
• Assess and categorize NASA holdings of
HDF4 data
• Investigate methods of mapping HDF4 files
• Develop requirements for tools to create
maps of HDF4 files
• Create a prototype tool to create maps
• Test the utility of these maps by developing 2
independent tools that use the maps to read
real data
Presented at the HDF and HDF-EOS
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
8. Data Mapping Project Description (continued)
• Assess the utility of this approach
• Document our findings
• Present results and options for
proceeding to the user community
• Evaluate the effort required for a full
solution that meets community needs
• Submit a proposal for that effort
Presented at the HDF and HDF-EOS
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
9. Outline
• Background
• Data Mapping Project Description
• Plans and Early Results
Presented at the HDF and HDF-EOS
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
10. Assess and Categorize NASA Holdings
•
•
•
NASA provided a starter
list of data sets held
NASA data centers were
requested to provide a list
at a project briefing
Results from each DAAC
being compared to ECHO
assessment of data sets
using a .hdf extension
While the volume of NASA data stored in HDF4/HDF-EOS2
format is measured in PB; the fraction of the total number of
NASA data sets archived in HDF4/ HDF-EOS2 is “small”
Presented at the HDF and HDF-EOS
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
11. Assess and Categorize NASA Holdings (continued)
• Examples of each of the hdf4 data sets have
been obtained and examined*
• Information kept summarized below:
•
•
•
•
•
Product id/name
Data Center
Product Version
Multi-file product?
HDF/EOS info (if any)
HDF/EOS version
Point info
Swath info
Grid info
• HDF info
Version
Raster image info
Palette
SDS info
V data info
Annotation
* For the most part
Presented at the HDF and HDF-EOS
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
12. Assess and Categorize NASA Holdings (continued)
• Very preliminary findings
Roughly 50/50 split between HDF-EOS
and plain HDF
Point data is relatively rare and when found
is not accompanied by swath or grid data
No indexes yet
While a few products use the image types,
there are no palettes yet
Presented at the HDF and HDF-EOS
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
13. Investigate Methods of Mapping HDF4 Files
• NSIDC and GES-DISC have provided THG sample data files
• Preliminary priorities for capabilities to tackle:
Contiguous SDS
Contiguous SDS with unlimited dimension
Chunked SDS
Compressed SDS
Chunked and compressed SDS
SDS and attributes
Vdata and attributes
Annotation
Vgroup
Raster image and attributes
Presented at the HDF and HDF-EOS
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
14. Investigate Methods of Mapping HDF4 Files
• NSIDC and GES-DISC have provided THG sample data files
• Preliminary priorities for capabilities to tackle:
Contiguous SDS
Contiguous SDS with unlimited dimension
Chunked SDS
Compressed SDS
Chunked and compressed SDS
SDS and attributes
Vdata and attributes
Annotation
Vgroup
Raster image and attributes
Presented at the HDF and HDF-EOS
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
15. Develop Requirements for Tools to Create Maps
• Maps will be XML-based
• A draft of a map format specification
has been started
Presented at the HDF and HDF-EOS
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
16. Create a Prototype Tool to Create Maps
• An iterative process is being used to
create the prototype
• Each iteration adds the next capability
from the prioritized list shown earlier
• At this point, the tool just creates a text
description
Presented at the HDF and HDF-EOS
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
17. Communications Plan
• Bi-weekly telecons with our sponsors (may move to
monthly)
• Briefing to NASA Data Center managers held, expect
to provide periodic updates
• Brief community at the HDF-Workshop and other
relevant meetings (e.g., AGU)
• Submit a paper to the special issue of IEEE
Transactions of Geoscience and Remote Sensing
devoted to Data Archiving and Distribution
• Public wiki established but not yet populated
Presented at the HDF and HDF-EOS
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
18. Summary
• We’ve started a project to assess and
prototype the ability to create maps to
the contents of HDF4 files that allow
programmers to develop code to read
data without using the HDF APIs
• We welcome community involvement
Presented at the HDF and HDF-EOS
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Editor's Notes
Collaborative project participants
Ruth
Data manager at NSIDC and Data Stewardship program lead
Folks at THG
Chris Lynnes at the GES-DISC
This project was born out of concerns for the long-term accessibility of HDF4 and HDF-EOS2 data that folks from the HDF Group and I have had for many years.
One of the options for dealing with this is to do what Rich Ullman suggested yesterday, that is to retire the format and migrate all of the data to HDF5/HDF-EOS5. This is likely to be rather expensive and worse yet some day when the next new format comes along you will have to do the whole migration over again, potentially every few years for the rest of time.
Mike Folk and I thought that was a great idea, discussed the concept with Chris at the meeting, and decided to see if NASA would be interested in supporting a pilot study. They were willing, so thus was born the HDF mapping project which started up just this last August.
The HDF4 APIs, like the HDF5 APIs we’ve been hearing about all week, are quite complex and it is not at all clear that producing a map to the data will work in all cases. The purpose of assessing and categorizing NASA’s holdings of HDF4 products, is to determine which sets of capabilities have actually been used and the frequency of use of each. The idea is to guide work on this project towards implementing capabilities with high impact. I’m a fan of the Pareto principle. If 20% of the effort will solve 80% of the problem, then perhaps that’s the effort that should be recommended. As a byproduct of this step, NASA will gain a catalog of their HDF4 data holdings and information about the implementation of each.
Investigating methods of mapping HDF4 files is primarily an THG task. Peter Cao is taking the lead on this. We’ll talk about his status later in this presentation.
Developing requirements for tools to create these maps is a joint responsibility. I also think that this is an area where the user community, people like you all out there in the audience, could have input.
THG will also be responsible for creating a prototype tool to create maps from real data. Towards that end both NSIDC and GES-DISC have provided them with sample data.
NSIDC and GES-DISC will undertake separate implementations of read software that will take a map and use it to read real data files. Towards that end, I’ve hired a student, a naïve user if you will, to both work on the assessment as well as to implement the read software. GES-DISC will assign one of their employees to do likewise. These will be independent implementations - very likely in different languages and using different data.
The previous steps collectively should provide us a very good idea of how feasible and useful this idea is.
We intend to document the results of each step in this process. For example, we will document the results of the assessment and categorization of the data and NASA will be provided the catalog of NASA HDF4 data holdings that is developed. We will also document the requirements developed and the results of the independent implementation tests.
Our intention is that this would be an open process, that communication with our stakeholder community is important, that community input on things like requirements and options for proceeding is important. My presentation here is a part of our plans in this area.
As one of the last steps in this project, we will attempt to provide an evaluation of the effort needed to do a full-up solution. We haven’t really started talking about exactly what that will consist of yet; but, my own preference would be to include information about what it would require for each of the NASA data centers to actually implement a full-up solution.
And then, assuming that the results of the project warrant it, we will submit a proposal to do that full-up effort.
This step turns out to be a bit more difficult than you might expect, simply because NASA does not have an up-to-date, definitive list of all of the data sets that are archived by its data centers. NASA did provide a starter list of data sets from the EDGRS metrics gathering system. This list indicates what data sets are (or were in some cases) held by which data centers. At the DAAC management briefing each participant was given a list of data that theoretically were held by them and asked to indicate which, if any, were in HDF4 format. All of the attendees provided their list within a few days. Email was sent to the other centers - so far unsuccessfully. Will shortly be in the process of following up with those folks.
As a crude sanity check of the results, we obtained a list from NASA’s ECHO system of all of the data sets that use the .hdf extension.
One thing I should note is that where ever you see the word “info” that indicates that there are several items being kept under that general category. For example, under swath info we are keeping track of how many swaths there are in the file, how many dimensions the swaths have, how they are organized (for example by time, space, both), and whether dimension maps are used. For SDS, we keep track of how many SDS’s there are, what the maximum dimensionality of an SDS is, whether there are attributes or annotation, whether dimension scales are used, whether chunking is used, and what kind (if any) of compression is used. In other words, we keep track of which portions of the APIs are being used.
I seriously debated whether to show any results of our findings at this point, since we haven’t gathered all of the data and the sample that we have examined is in no way statistically representative; but, decided that I should at least give some idea of what kind of information will come out of this study. The point is that we should have enough information to determine what constructs are used most frequently and in what combinations.
These priorities were generated before we really had any data. They were based on our combined gut feel for what was out there. We now have some data and probably should revisit the list.
The green indicates roughly how far down the list Peter has had a chance to test his ability to develop a map
I expect we will use some combination of the THG and NASA EOS email lists to let folks know when there are new materials on the wiki that they might be interested in.