QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.

Towards Long-Term Archiving of
NASA HDF...
Outline
• Background
• Data Mapping Project Description
• Plans and Early Results

Presented at the HDF and HDF-EOS

Quick...
Outline
• Background
• Data Mapping Project Description
• Plans and Early Results

Presented at the HDF and HDF-EOS

Quick...
A Concern
• The majority of the data from NASA’s
Earth Observing System (EOS) have
been archived in HDF Version 4 (HDF4)
o...
A Proposal from the Workshop Last Year
• Chris Lynnes noted that
 What was needed was a map to the
contents of an HDF fil...
Outline
• Background
• Data Mapping Project Description
• Plans and Early Results

Presented at the HDF and HDF-EOS

Quick...
Data Mapping Project Description
• Assess and categorize NASA holdings of
HDF4 data
• Investigate methods of mapping HDF4 ...
Data Mapping Project Description (continued)
• Assess the utility of this approach
• Document our findings
• Present resul...
Outline
• Background
• Data Mapping Project Description
• Plans and Early Results

Presented at the HDF and HDF-EOS

Quick...
Assess and Categorize NASA Holdings
•
•

•

NASA provided a starter
list of data sets held
NASA data centers were
requeste...
Assess and Categorize NASA Holdings (continued)
• Examples of each of the hdf4 data sets have
been obtained and examined*
...
Assess and Categorize NASA Holdings (continued)
• Very preliminary findings
 Roughly 50/50 split between HDF-EOS
and plai...
Investigate Methods of Mapping HDF4 Files
• NSIDC and GES-DISC have provided THG sample data files
• Preliminary prioritie...
Investigate Methods of Mapping HDF4 Files
• NSIDC and GES-DISC have provided THG sample data files
• Preliminary prioritie...
Develop Requirements for Tools to Create Maps
• Maps will be XML-based
• A draft of a map format specification
has been st...
Create a Prototype Tool to Create Maps
• An iterative process is being used to
create the prototype
• Each iteration adds ...
Communications Plan
• Bi-weekly telecons with our sponsors (may move to
monthly)
• Briefing to NASA Data Center managers h...
Summary
• We’ve started a project to assess and
prototype the ability to create maps to
the contents of HDF4 files that al...
Upcoming SlideShare
Loading in …5
×

Towards Long-Term Archiving of NASA HDF-EOS and HDF Data - Data Maps and the Use of Mark-Up Language

306 views

Published on

The Hierarchical Data Format (HDF) has been a data format standard in NASA's Earth Observing System Data and Information System (EOSDIS) since the 1990s. Its rich structure, platform independence, full-featured Application Programming Interface (API), and internal compression make it very useful for archiving science data and utilizing them with a rich set of software tools. However, a key drawback for long-term archiving is the complex internal byte layout of HDF files, requiring one to use the API to access HDF data. This makes the long-term readability of HDF data for a given version dependent on long-term allocation of resources to support that version.

The majority of the data from NASA's Earth Observing System (EOS) have been archived in HDF Version 4 (HDF4) format. To address the long-term archival issues for these data a collaborative study between The HDF Group and NASAs EOSDIS data centers is underway. One of the first activities undertaken has been an assessment of the range of HDF4 formatted data held by NASA to determine the capabilities inherent in the HDF format that have been used in practice. Based on the results of this assessment, methods for producing a map of the layout of the HDF Version 4 files held by NASA will be prototyped using a markup-language-based HDF tool to map the layout of the HDF Version 4 files. The resulting maps should allow a separate program to read the file without recourse to the HDF API. To verify this, two independent tools based solely on the map files will be developed and tested with a variety of data products archived by NASA.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
306
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Collaborative project participants
    Ruth
    Data manager at NSIDC and Data Stewardship program lead
    Folks at THG
    Chris Lynnes at the GES-DISC
  • This project was born out of concerns for the long-term accessibility of HDF4 and HDF-EOS2 data that folks from the HDF Group and I have had for many years.
    One of the options for dealing with this is to do what Rich Ullman suggested yesterday, that is to retire the format and migrate all of the data to HDF5/HDF-EOS5. This is likely to be rather expensive and worse yet some day when the next new format comes along you will have to do the whole migration over again, potentially every few years for the rest of time.
  • Mike Folk and I thought that was a great idea, discussed the concept with Chris at the meeting, and decided to see if NASA would be interested in supporting a pilot study. They were willing, so thus was born the HDF mapping project which started up just this last August.
  • The HDF4 APIs, like the HDF5 APIs we’ve been hearing about all week, are quite complex and it is not at all clear that producing a map to the data will work in all cases. The purpose of assessing and categorizing NASA’s holdings of HDF4 products, is to determine which sets of capabilities have actually been used and the frequency of use of each. The idea is to guide work on this project towards implementing capabilities with high impact. I’m a fan of the Pareto principle. If 20% of the effort will solve 80% of the problem, then perhaps that’s the effort that should be recommended. As a byproduct of this step, NASA will gain a catalog of their HDF4 data holdings and information about the implementation of each.
    Investigating methods of mapping HDF4 files is primarily an THG task. Peter Cao is taking the lead on this. We’ll talk about his status later in this presentation.
    Developing requirements for tools to create these maps is a joint responsibility. I also think that this is an area where the user community, people like you all out there in the audience, could have input.
    THG will also be responsible for creating a prototype tool to create maps from real data. Towards that end both NSIDC and GES-DISC have provided them with sample data.
    NSIDC and GES-DISC will undertake separate implementations of read software that will take a map and use it to read real data files. Towards that end, I’ve hired a student, a naïve user if you will, to both work on the assessment as well as to implement the read software. GES-DISC will assign one of their employees to do likewise. These will be independent implementations - very likely in different languages and using different data.
  • The previous steps collectively should provide us a very good idea of how feasible and useful this idea is.
    We intend to document the results of each step in this process. For example, we will document the results of the assessment and categorization of the data and NASA will be provided the catalog of NASA HDF4 data holdings that is developed. We will also document the requirements developed and the results of the independent implementation tests.
    Our intention is that this would be an open process, that communication with our stakeholder community is important, that community input on things like requirements and options for proceeding is important. My presentation here is a part of our plans in this area.
    As one of the last steps in this project, we will attempt to provide an evaluation of the effort needed to do a full-up solution. We haven’t really started talking about exactly what that will consist of yet; but, my own preference would be to include information about what it would require for each of the NASA data centers to actually implement a full-up solution.
    And then, assuming that the results of the project warrant it, we will submit a proposal to do that full-up effort.
  • This step turns out to be a bit more difficult than you might expect, simply because NASA does not have an up-to-date, definitive list of all of the data sets that are archived by its data centers. NASA did provide a starter list of data sets from the EDGRS metrics gathering system. This list indicates what data sets are (or were in some cases) held by which data centers. At the DAAC management briefing each participant was given a list of data that theoretically were held by them and asked to indicate which, if any, were in HDF4 format. All of the attendees provided their list within a few days. Email was sent to the other centers - so far unsuccessfully. Will shortly be in the process of following up with those folks.
    As a crude sanity check of the results, we obtained a list from NASA’s ECHO system of all of the data sets that use the .hdf extension.
  • One thing I should note is that where ever you see the word “info” that indicates that there are several items being kept under that general category. For example, under swath info we are keeping track of how many swaths there are in the file, how many dimensions the swaths have, how they are organized (for example by time, space, both), and whether dimension maps are used. For SDS, we keep track of how many SDS’s there are, what the maximum dimensionality of an SDS is, whether there are attributes or annotation, whether dimension scales are used, whether chunking is used, and what kind (if any) of compression is used. In other words, we keep track of which portions of the APIs are being used.
  • I seriously debated whether to show any results of our findings at this point, since we haven’t gathered all of the data and the sample that we have examined is in no way statistically representative; but, decided that I should at least give some idea of what kind of information will come out of this study. The point is that we should have enough information to determine what constructs are used most frequently and in what combinations.
  • These priorities were generated before we really had any data. They were based on our combined gut feel for what was out there. We now have some data and probably should revisit the list.
  • The green indicates roughly how far down the list Peter has had a chance to test his ability to develop a map
  • I expect we will use some combination of the THG and NASA EOS email lists to let folks know when there are new materials on the wiki that they might be interested in.
  • Towards Long-Term Archiving of NASA HDF-EOS and HDF Data - Data Maps and the Use of Mark-Up Language

    1. 1. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Towards Long-Term Archiving of NASA HDF-EOS and HDF Data Data Maps and the Use of Mark-Up Language Ruth Duerr, Mike Folk, Muqun Yang, Chris Lynnes, Peter Cao
    2. 2. Outline • Background • Data Mapping Project Description • Plans and Early Results Presented at the HDF and HDF-EOS QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
    3. 3. Outline • Background • Data Mapping Project Description • Plans and Early Results Presented at the HDF and HDF-EOS QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
    4. 4. A Concern • The majority of the data from NASA’s Earth Observing System (EOS) have been archived in HDF Version 4 (HDF4) or HDF-EOS 2 format. • HDF files have a complex internal byte layout, requiring one to use the API to access HDF data • Long-term readability of HDF data depends on long-term allocation of resources to support the API Presented at the HDF and HDF-EOS QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
    5. 5. A Proposal from the Workshop Last Year • Chris Lynnes noted that  What was needed was a map to the contents of an HDF file  The output of the HDF4 tools (e.g., hdfls, hdp, etc.) already provide much of the information needed  Extending these tools to create a map to the contents of the file might be feasible Presented at the HDF and HDF-EOS QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
    6. 6. Outline • Background • Data Mapping Project Description • Plans and Early Results Presented at the HDF and HDF-EOS QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
    7. 7. Data Mapping Project Description • Assess and categorize NASA holdings of HDF4 data • Investigate methods of mapping HDF4 files • Develop requirements for tools to create maps of HDF4 files • Create a prototype tool to create maps • Test the utility of these maps by developing 2 independent tools that use the maps to read real data Presented at the HDF and HDF-EOS QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
    8. 8. Data Mapping Project Description (continued) • Assess the utility of this approach • Document our findings • Present results and options for proceeding to the user community • Evaluate the effort required for a full solution that meets community needs • Submit a proposal for that effort Presented at the HDF and HDF-EOS QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
    9. 9. Outline • Background • Data Mapping Project Description • Plans and Early Results Presented at the HDF and HDF-EOS QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
    10. 10. Assess and Categorize NASA Holdings • • • NASA provided a starter list of data sets held NASA data centers were requested to provide a list at a project briefing Results from each DAAC being compared to ECHO assessment of data sets using a .hdf extension While the volume of NASA data stored in HDF4/HDF-EOS2 format is measured in PB; the fraction of the total number of NASA data sets archived in HDF4/ HDF-EOS2 is “small” Presented at the HDF and HDF-EOS QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
    11. 11. Assess and Categorize NASA Holdings (continued) • Examples of each of the hdf4 data sets have been obtained and examined* • Information kept summarized below: • • • • • Product id/name Data Center Product Version Multi-file product? HDF/EOS info (if any)     HDF/EOS version Point info Swath info Grid info • HDF info       Version Raster image info Palette SDS info V data info Annotation * For the most part Presented at the HDF and HDF-EOS QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
    12. 12. Assess and Categorize NASA Holdings (continued) • Very preliminary findings  Roughly 50/50 split between HDF-EOS and plain HDF  Point data is relatively rare and when found is not accompanied by swath or grid data  No indexes yet  While a few products use the image types, there are no palettes yet Presented at the HDF and HDF-EOS QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
    13. 13. Investigate Methods of Mapping HDF4 Files • NSIDC and GES-DISC have provided THG sample data files • Preliminary priorities for capabilities to tackle:           Contiguous SDS Contiguous SDS with unlimited dimension Chunked SDS Compressed SDS Chunked and compressed SDS SDS and attributes Vdata and attributes Annotation Vgroup Raster image and attributes Presented at the HDF and HDF-EOS QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
    14. 14. Investigate Methods of Mapping HDF4 Files • NSIDC and GES-DISC have provided THG sample data files • Preliminary priorities for capabilities to tackle:           Contiguous SDS Contiguous SDS with unlimited dimension Chunked SDS Compressed SDS Chunked and compressed SDS SDS and attributes Vdata and attributes Annotation Vgroup Raster image and attributes Presented at the HDF and HDF-EOS QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
    15. 15. Develop Requirements for Tools to Create Maps • Maps will be XML-based • A draft of a map format specification has been started Presented at the HDF and HDF-EOS QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
    16. 16. Create a Prototype Tool to Create Maps • An iterative process is being used to create the prototype • Each iteration adds the next capability from the prioritized list shown earlier • At this point, the tool just creates a text description Presented at the HDF and HDF-EOS QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
    17. 17. Communications Plan • Bi-weekly telecons with our sponsors (may move to monthly) • Briefing to NASA Data Center managers held, expect to provide periodic updates • Brief community at the HDF-Workshop and other relevant meetings (e.g., AGU) • Submit a paper to the special issue of IEEE Transactions of Geoscience and Remote Sensing devoted to Data Archiving and Distribution • Public wiki established but not yet populated Presented at the HDF and HDF-EOS QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
    18. 18. Summary • We’ve started a project to assess and prototype the ability to create maps to the contents of HDF4 files that allow programmers to develop code to read data without using the HDF APIs • We welcome community involvement Presented at the HDF and HDF-EOS QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

    ×