Communities use many different dialects to document their data. We need to be able to translate between these dialects and to understand how much is lost in translation
1. Translation Proofing – Quantitative Tools for
Connecting Metadata Dialects
Ted Habermann
Director of Earth Science
The HDF Group
thabermann@hdfgroup.org
1
2. Metadata in Multiple Dialects
Documentation
Repository
ISO 19115,
19115-2, 19119
and extensions
THREDDS
HDF, netCDF
(NcML)
FGDC,
Data.Gov
SensorML
WCS, WMS,
WFS, SOS
Open
Provenance
Model, PROV
DIF, ECS,
ECHO
KML
3. Translation Lossiness
Documentation dialects generally have significant overlap because the
concepts that are being documented (who, where, what, when, and why?)
are shared cross many communities and dialects.
At the same time, there are differences…
A B AB
More Lossy Less Lossy
We are familiar with the idea of lossiness with data compression. How can we
quantify the lossiness of a translation?
4. Characterizing the Source
The distribution of elements in any metadata collection reflects the requirements
of the data providers and users. Some elements are more common (important?)
than others.
This heterogeneity needs to be considered when evaluating the translation.
448 CSDGM Records
161,151 Elements and Attributes
10,713 Place Keywords
1 /metadata/USGSErp/MetadataNotes
264 elements occur < 100 times
5. Lossiness = Distribution + Crosswalk
+
Actual Distribution (collection & community) Reference Crosswalk
In order to calculate the lossiness of a translation we need the actual distribution
of elements in the source and a reference crosswalk that gives the destinations
that the source elements are mapped to.
Source Destination
6. Three Examples
January 8-10, 2014 ESIP Winter 2014 6
Element # % Translated? % Translated
A 134 66% 1 66%
B 50 25% 1 25%
C 20 10% 1 10%
204 1 100%
Element A occurs 134 times and makes up 66% of the source
Element B occurs 50 times and makes up 25% of the source
Element C occurs 20 times and makes up 10% of the source
Element # % Translated? % Translated
A 134 66% 1 66%
B 50 25% 0 0%
C 20 10% 1 10%
204 1 75%
Element # % Translated? % Translated
A 134 66% 1 66%
B 50 25% 1 25%
C 20 10% 0 0%
204 1 91%
100% elements translated: lossiness = 0%
75% elements translated: lossiness = 25%
91% elements translated: lossiness = 9%
7. Calculating Lossiness
+
Number of Occurrences
Total Number of Elements
*
1 if in crosswalk
0 if not
n = 1
number of
elements
=Lossiness
Actual Distribution (collection & community) Reference Crosswalk
1-
Source Destination
9. Acknowledgements
This work was partially supported by contract number NNG10HP02C from NASA.
Any opinions, findings, conclusions, or recommendations expressed in this material are
those of the author and do not necessarily reflect the views of NASA or The HDF Group.