NISO Forum, Denver, Sept. 24, 2012: Data Equivalence

Data Equivalence
Mark A. Parsons
and the ESIP Preservation and Stewardship Committee

NISO Forum:
Tracking it Back to the Source: Managing and Citing Research Data
Denver, Colorado, USA
24 September 2012

The National Snow and Ice Data
Center…

Manages and
distributes Performs scientiﬁc and
scientiﬁc data informatics research

Supports data
users

Creates tools for Educates the public
data access about the cryosphere
http://nsidc.org

Minimum Arctic Sea Ice Extent.
Image courtesy NASA/Goddard Scientiﬁc Visualization Studio

21 Sept. 2012

Minimum, 16 Sept. 3,412,196 km2*

Derived from the NSIDC Sea Ice Index (Fetterer et al., 2009)
*5-day running mean of daily values


From the Arctic Sea Ice Monitor at the IARC-JAXA Information System (IJIS)
http://www.ijis.iarc.uaf.edu/en/home/seaice_extent.htm *5-day running mean of daily values

3.8% greater than NSIDC’s value

From the Arctic Sea Ice Monitor at the IARC-JAXA Information System (IJIS)
http://www.ijis.iarc.uaf.edu/en/home/seaice_extent.htm *5-day running mean of daily values

Example of differences in SSM/I derived ice concentration
values calculated with three different passive microwave
algorithms (Meier et al. 2001)
Designating User Communities by Parsons and Duerr; CODATA, Berlin, 8 Nov. 2004

Metaphor is for most people a device
of the poetic imagination and the
rhetorical ﬂourish— a matter of
extraordinary rather than ordinary
language. Moreover, metaphor is
typically viewed as characteristic of
language alone, a matter of words
rather than thought or action. For this
reason, most people think they can get
along perfectly well without metaphor.

We have found, on the contrary, that
metaphor is pervasive in everyday life,
not just in language but in thought and
action. Our ordinary conceptual
system, in terms of which we both
think and act, is fundamentally
metaphorical in nature.

“Is Data Publication the Right Metaphor?”
Parsons and Fox. (in press). Data Science Journal

Preprint at http://mp-datamatters.blogspot.com

Purpose of Data Citation

• Aid scientiﬁc reproducibility through direct, unambiguous
connection to the precise data used
• Credit for data authors and stewards
• Accountability for creators and stewards
• Track impact of data set
• Help identify data use (e.g., trackbacks)
• Data authors can verify how their data are being used.
• Users can better understand the application of the data.

• A locator/reference mechanism not a discovery mechanism per se

9

“Bridging Data Lifecycles: Tracking Data Use
via Data Citations” UCAR Workshop Report

• Recommendation 1:
Identify what you want to achieve via data citations
• Recommendation 2: Understand the options for
actionable identiﬁer schemes
• Recommendation 3: Engage stakeholders
• Recommendation 4: Start with well-bounded cases
• Recommendation 5: Plan for long-term implications

• http://library.ucar.edu/data_workshop/
10

How data citation is currently done

• Citation of traditional publication that actually contains the data,
e.g. a parameterization value.
• Not mentioned, just used, e.g., in tables or figures
• Reference to name or source of data in text
• URL in text (with variable degrees of specificity)
• Citation of related paper (e.g. CRU Temp. records recommend
citing two old journal articles which do not contain the actual
data or full description of methods)
• Citation of actual data set typically using recommended citation
given by data center
• Citation of data set including a persistent identifier/locator,
typically a DOI
11

2009
1.7%

2008
1.3%

2007
0.9%

2006
1.3%

2005
0.7%

2004
0.7%

2003
1.0%
Formal Citation
Total Entries
2002
1.3%

0
100
200
300
400
500
600

“MODIS Snow Cover Data” in Google Scholar

Data Citation Guidelines

• Federation of Earth Science Information Partners. 2012. http://bit.ly/data_citation
and related guidelines for the Group on Earth Observations (GEO)
• Best available for Earth system science. Not yet widely adopted but growing.
• Digital Curation Center. 2011. http://www.dcc.ac.uk/resources/how-guides/cite-
datasets
• Best overall guide. Not yet widely adopted but growing.
• DataCite—a well-recognized consortium of libraries and related organizations
working to define a citation approach around DOIs and working to get data
citations included in citation indices.
• DataVerse Network Project—a standard from the social science community using
a Handle locator and “Universal Numerical Fingerprint” as a unique identifier.
• New CODATA Task Group in collaboration with ICSTI. Report due soon.
• NASA DAACs, NCAR, some NOAA centers adopting ESIP-based approaches.
• More consistency is emerging but there is still great variation in recommended
approach. They range from specific data citation, to general acknowledgement,
to recommending citing a journal article, or even a presentation.
13

Basic data citation form and content

Per DataCite:
Creator. PublicationYear. Title. [Version]. Publisher.
[ResourceType]. Identiﬁer.

Per ESIP:
Author(s). ReleaseDate. Title, [version]. [editor(s)]. Archive and/or
Distributor. Locator. [date/time accessed]. [subset used].

14

An Example Citation

Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston.
2002, Updated 2003. CLPX-Ground: ISA snow depth
transects and related measurements ver. 2.0. Edited by M.
Parsons and M. J. Brodzik. Boulder, CO: National Snow
and Ice Data Center. Data set accessed 2008-05-14 at
http://nsidc.org/data/nsidc-0175.html.

Version

transects and related measurements, ver. 2.0. Edited by M.

Locator


Locator

http://dx.doi.org/10.5060/D4H41PBP.

Identifier vs. Locator

• Human ID: Mark Alan Parsons (son of Robert A. and Ann M., etc.)
• every term defined independently and only unique in context/
provenance (remember that).
• Alternative like a social security number requires a very well
controlled central authority.
• Human Locator: 1540 30th St., Room 201, Boulder CO 80303.
• every term has a naming authority

• Data Set IDs: data set title, filename, database key, object id code (e.g.
UUID), etc.
• Data set Locators: URL, directory structure, catalog number, registered
locator (e.g. DOI), etc.
19

An assessment of identification schemes for
digital Earth science data
Unique Unique Citable Scientifically
Identifier Locator Locator Unique ID
ID Scheme Data Set Item Data Set Item Data Set Item Data Set Item

URL/N/I
PURL
XRI
Handle
DOI
ARK
LSID Good
OID Fair
Poor
UUID
Adapted from Duerr, R. E., et al.. 2011. On the utility of identification schemes for digital Earth science data: An
assessment and recommendations. Earth Science Informatics. 4:139-160. 20
http://dx.doi.org/10.1007/s12145-011-0083-6

An assessment of identification schemes for
digital Earth science data
Unique Unique Citable Scientifically
Identifier Locator Locator Unique ID
ID Scheme Data Set Item Data Set Item Data Set Item Data Set Item
s

URL/N/I
or
at
c

PURL
Lo

XRI
Handle
DOI
ARK
ers

LSID Good
ntifi

OID Fair
Ide

Poor
UUID
Adapted from Duerr, R. E., et al.. 2011. On the utility of identification schemes for digital Earth science data: An
assessment and recommendations. Earth Science Informatics. 4:139-160. 20
http://dx.doi.org/10.1007/s12145-011-0083-6

Why the DOI?

• Not perfect but well understood by publishers
• DataCite working with Thomson Reuters to get data citations in their
index.

But...
• What is the citable unit?
• How do we handle different versions?
• What about “retired” data?
• When is a DOI assigned?

21

Issues largely resolved by...

• A deﬁned versioning scheme
• Good tracking and documentation of the versions
• Due diligence in archive and release practices

22

When to assign a DOI?

• First principle: Data should be reference-able as soon as they are
available for use by anyone other than the original authors.
• But...
• Most people (falsely) believe that a DOI assures permanence so
how do we cite transient data?
• Some believe that a DOI should not be assigned until the data has
undergone some level of review (e.g. Lawrence et al. 2011). So
how do we cite data used before the review?
• Data are often used by friends and collaborators in a raw,
“unpublished” state. Should this use be cited with a DOI?
• Near real time or preliminary data may only be available for a short
uncurated, period, and there may not be a good match between
the submission package and the distribution package. What gets
the DOI? When? 23

Versioning approach recommended by DCC

• “As DOIs are used to cite data as evidence, the dataset to which a
DOI points should also remain unchanged, with any new version
receiving a new DOI.”
• “There are two possible approaches the data repository can take: time
slices and snapshots.”

24

Versioning and locators:
some suggestions from NSIDC

• major version.minor version.[archive version]
• Individual stewards need to determine which are major vs. minor versions and describe
the nature and file/record range of every version.
• Assign DOIs to major versions.
• Old DOIs should be maintained and point to some appropriate page that explains what
happened to the old data if they were not archived.
• A new major version leads to the creation of a new collection-level metadata record that
is distributed to appropriate registries. The older metadata record should remain with a
pointer to the new version and with explanation of the status of the older version data.
• Major and minor version (after the first version) should be exposed with the data set title
and recommended citation.
• Minor versions should be explained in documentation, ideally in file-level metadata.
• Applying UUIDs to individual files upon ingest aids in tracking minor versions and
historical citations.

25


Author(s). ReleaseDate. Title, Version. [editor(s)]. Archive and/or
Distributor. Locator. [date/time accessed]. [subset used].

The best solution may be to have unique identiﬁers or query IDs
for subsets, but that won’t be available for most data sets for a
long time, so we need alternative solutions...

26

February 8, 2011, 4:45 PM

Page Numbers for Kindle Books an Imperfect Solution
Neither solution is perfect—‘locations’ or
page numbers—because the problem is
unsolvable. The best we can hope for is a
choice...

Amazon’s Kindle will have
page numbers that
correspond to real books
and locations by passage.

http://pogue.blogs.nytimes.com/2011/02/08/page-numbers-for-kindle-books-an-imperfect-solution/

Chapter and Verse

• Bible

• Koran

• Bhagavad-Gita and
Ramayana

• other sacred texts

• A “structural index”

The “Archive Information Unit”

“An Archival Information Package whose Content Information is not
further broken down into other Content Information components, each
of which has its own complete Preservation Description Information. It
can be viewed as an ‘atomic’ AIP”
“From an Access viewpoint, new subsetting and manipulation
capabilities are beginning to blur the distinction between AICs and AIUs.
Content objects which used to be viewed as atomic can now be viewed
as containing a large variation of contents based on the subsetting
parameters chosen. In a more extreme example, the Content
Information of an AIU may not exist as a physical entity. The Content
Information could consist of several input files (or pointers to the AIPs
containing these data files) and an algorithm which uses these files to
create the data object of interest.”
• CCSDS. 2002. Reference Model for An Open Archival Information System (OAIS)
CCSDS 650.0-B-1 Issue 1. Washington, DC: CCSDS Secretariat. p. 1-8, 4-38.
29

Citation scenarios and production patterns

• What kind of “atomic” item is being cited—the “Archive Information
Unit (AIU)” (e.g., a data ﬁle, a data element within a ﬁle, a relational (or
other) database, a job “residue”)?
• How many AIUs items are in a typical citation for the scenario being
considered?
• What other digital or physical objects need to be available to make the
unit usable—the “Preservation Description Information (PDI)”?

Key Question:
• What structure or structures can we use to organize data collections
that might be common across Earth sciences?

30

An example production pattern for Cline et al. (2003).

A production pattern for Cline et al., 2003

field notebook Excel v1 printout Excel v2

TRANSECT,IOP ,DATE ,TIME,UTME ,UTMN ,DEPTH ,SWET,SRUF,CNPY,
TEMP,SURVEYOR ,QC
,COMMENTS
, , , , , ,cm , , , , deg-
F, , ,
FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, 104, d, y, n,
-999,"Fitzgerald, Matous, Dundas ","QC(000) ","
"
FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n,
"
...
FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n,
-999,"Fitzgerald, Matous, Dundas ","QC(000)
","Couldn't find post, used GPS 5940 1197; FAA4.4 and FAA4.5 unsafe, avalanche
area!

analog to ascii files
digital w/ QC
shapefiles

born
digital

camera interim jpgs collated and named jpgs


distributed data set

TEMP,SURVEYOR ,QC
,COMMENTS
, , , , , ,cm , , , , deg-
F, , ,
FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, 104, d, y, n,
"
FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n,
"
...
FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n,
area!

digital w/ QC
shapeﬁles

born
digital




TEMP,SURVEYOR ,QC

+
,COMMENTS

HTML F,
, , , , , ,cm

FAA01.1 ,iop4,2003-03-25,1017,425941,4410860,
,
,

104,
,

d,
,

y,
,

n,
,
deg-

Doc.
"
FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n,
"
...
FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n,
area!

digital w/ QC
shapeﬁles

born
digital




TEMP,SURVEYOR ,QC

+
,COMMENTS

HTML , , , , , ,cm , , , , deg-

100s 100s
F, , ,
FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, 104, d, y, n,

Doc.
"
FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n,
"
...
FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n,
area!

digital w/ QC
shapeﬁles

born
digital
1000s



TEMP,SURVEYOR ,QC

+
,COMMENTS

HTML , , , , , ,cm , , , , deg-

100s 100s
F, , ,
FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, 104, d, y, n,

Doc.
"
FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n,
"
...
FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n,
area!

digital w/ QC tarball shapeﬁles

born
digital
1000s


ﬁeld notebook Excel v1 Couldn't find post, v2
printout Excel
distributed data set used GPS 5940
1197; FAA4.4 and
TEMP,SURVEYOR
FAA4.5 unsafe,
,QC

+
,COMMENTS

HTML , , , , , ,cm , , , , deg-

100s
avalanche area!
100s
F, , ,
FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, 104, d, y, n,

Doc.
"
FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n,
"
...
FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n,
area!

digital w/ QC tarball shapeﬁles

born
digital
1000s

Crude, inaccurate production pattern for MODIS/Aqua Snow Cover
Daily L3 Global 500m Grid V005 (Hall et al., 2007)


Archives
GSFC EDC NSIDC

MODAPs Processing


Archives
GSFC EDC NSIDC

1 ﬁle/day/tile (grid cell)
Each ﬁle contains
metadata describing
previous inputs and
detailed versioning
MODAPs Processing


Archives
GSFC EDC NSIDC

1,000,000s

1 ﬁle/day/tile (grid cell)
Each ﬁle contains
metadata describing
previous inputs and
detailed versioning
MODAPs Processing

Doing it as best we can...?

• Hall, Dorothy K., George A. Riggs, and Vincent V. Salomonson. 2007,
updated daily. MODIS/Aqua Snow Cover Daily L3 Global 500m Grid
V005.3, Oct. 2007- Sep. 2008, 84°N, 75°W; 44°N, 10°W. Boulder,
Colorado USA: National Snow and Ice Data Center. Data set accessed
2008-11-01 at http://dx.doi.org/10.1234/xxx.
• Hall, Dorothy K., George A. Riggs, and Vincent V. Salomonson. 2007,
updated daily. MODIS/Aqua Snow Cover Daily L3 Global 500m Grid
V005.3, Oct. 2007- Sep. 2008, Tiles (15,2;16,0;16,1;16,2;17,0;17,1).
Boulder, Colorado USA: National Snow and Ice Data Center. Data set
accessed 2008-11-01 at http://dx.doi.org/10.1234/xxx.
• Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston. 2002,
Updated 2003. CLPX-Ground: ISA snow depth transects and related
measurements, Version 2.0, shapeﬁles. Edited by M. Parsons and M.
J. Brodzik. Boulder, CO: National Snow and Ice Data Center. Data set
accessed 2008-05-14 at http://dx.doi.org/10.5060/D4H41PBP. 45

Sea Ice Index
Fetterer et al. 2009

“Near Real Time”
“Preliminary”
Maslanik and Stroeve
Meir et al. 2006
1999

“Final”
Cavalieri et al 1996

Remote Sensing Systems
Sea Ice Production – Data Workflow TBs (Wentz)
Color Key
Source (data)
Value added product
Near-Real-Time product
NSIDC-0002 Preliminary Product
SSM/I Daily and NSIDC-0001
SSM/I Polar Stereo Tbs Final Product
Monthly Polar
Gridded Bootstrap Not part of discussion

Deaccession to begin 1/2012

NSIDC-0051 Goddard Space NSIDC-0079
Fowler Anderson Flight Center Preliminary Bootstrap Sea
Preliminary Sea Ice
University of CO University of NE (GSFC) Ice Concentrations from
Concentrations
From SMMR and SSM/I Nimbus-7 SMMR
Passive Microwave Data and DMSP SSM/I

NISDC-0116 NSIDC-0105
Snow Melt Onset Data not yet
Polar Pathfinder
distributed
Daily 25 km Over Arctic Sea Ice
EASE-Grid Sea Ice from SMMR and NSIDC-0051 NSIDC-0079
Motion Vectors SSM/I Tbs Sea Ice Concentrations Bootstrap Sea Ice
from SMMR and SSM/I Concentrations from
Passive Microwave Data SMMR and SSM/I

NASA Team
G00791 production line
IABP Drifting NSIDC-0066
AVHRR Polar
Buoy
Pressure, Pathfinder
Twice-Daily G02202
Temperature, NOAA/NSIDC CDR
5 km EASE-
Position, and
Interpolated Grid
Composites
Ice Velocity

NSIDC-0081 NSIDC-0192
NSIDC-0046 Near-Real-Time DMSP Sea Ice Trends and
Robinson NH EASE-Grid SSM/I Daily Polar Gridded Climatologies from SMMR
Psuedo-weekly Weekly Sea Ice Concentrations and SSM/I
Snow Cover Snow Cover and Sea
Rutgers Ice Extent V3

NSIDC-0080
Near-Real-Time DMSP NISE on NEO
World Wide Web G02135 SSM/I Daily Polar Gridded NISE
Arctic Sea Ice Sea Ice Index Tbs
News and Analysis

CLASS
F17 TBs
Last updated: D Scott 12/2011

MODIS chlorophyll
level 0
SeaWiFS GSM phytoplankton …
level 1a algorithms
... particulates …
software
level 1b calibration
...
level 2
reprocess,
… revalidate

48
slide courtesy Greg Janee, UCSB

Hypothesis: ~80% of citation scenarios for 80% of
Earth system science data can be addressed with
basic citations (Author(s). ReleaseYear. Title, Version.
[editor(s)]. Archive. Locator. [date/time accessed]. [subset used].),
a solid methods section, and reasonable due
diligence by archives.

Future directions on scientiﬁc equivalence

Content equivalence and provenance equivalence
serve as a proxy for scientific equivalence.

Content Equivalence: Is there an algorithm that can consider the
content of a file and come up with a unique identifier that will be the
same for objects in the same “content equivalence class”?
• Exact content equivalence can use digital signature or
cryptographic techniques MD5, SHA-1, etc.
• Universal Numeric Footprint (UNF) takes a digital signature of a
“canonical” representation of the information, but canonical is a
problem.
• How can we define loose or scientific content equivalence? Are
the shapefiles and text files in Cline et al. (2003) equivalent?

51

Content equivalence and provenance equivalence
serve as a proxy for scientific equivalence.

Provenance Equivalence:  A process is reproducible when there is
sufficient creation provenance details for someone else to make an
equivalent file. If someone follows those provenance details to re-create
the object, the resulting object will be equivalent to the original.
• If we can enumerate/list sufficient or “essential” creation
provenance details to  make an equivalent file, then we can
describe an algorithm to  produce an identifier that will be the
same for files that match those provenance details.
• Algorithm can be similar to the content equivalence algorithm
before: take a digital signature of a canonical representation of the
information in this case a canonical serialization of processes.

52

Thank You
parsonsm@nsidc.org

photo courtesy NOAA

More examples
• Glacier Photos: chemical creation of image, digitization, multi-source, no
versioning
• Hurricane Ike: direct digital creation, embedded software, single-source, no
versioning
• Rock or Ice Cores: physical specimens with later experimental operations to
create data collections – digital results may be stored in databases with complex
provenance
• SeaDataNet: Many individual Conductivity, Temperature, Depth measurements
from many individual cruises compiled into one database.
• Global Historical Climate Network: In situ measurements of Temp and Humidity,
recorded as time series and appended to ﬁles (with some quality control) – multi-
source, some versioning
• Radiosonde Network: Balloon-borne Temp and Humidity from about 20,000 sites
every 12 or 6 hours assimilated into weather forecasts – archives maintained at
NCAR, GSFC, and elsewhere – reanalysis can create new versions
• Large-scale Satellite Data Production: complex, large-scale data ﬂows from
multiple sources, systematic approaches to versioning where the structure of one
version of a data product resembles the next 55


Mandatory: Optional:
Author(s) Editor(s)
Release Date Archive or Distributor Place
Title Other Institutional Role
Version Indication of subset used
Archive and/or Distributor
Locator or Persistent
Location service
Time, date accessed

56

Serving, Citing and Publishing Data
Citation forms an important part of the
scientific record.
Doi:10232/123ro
This involves the peer-review of data
We draw a clear distinction between: 2. sets, and gives “stamp of approval”
Publication of data sets associated with traditional journal
publishing = making available for publications. Can’t be done without
consumption (e.g. on the web), effective linking/citing of the data
sets.
and

Doi:10232/123
Publishing = publishing after some 1. This is our first step for this project –
formal process which adds value Data set Citation formulate and formalise a way of
for the consumer: citing data sets. Will provide benefits
• e.g. PloS ONE type review, or to our users – and a carrot to get
them to provide data to us!
• EGU journal type public review, or
• More traditional peer review.
0.
AND Serving of data sets This is what data centres do as our
• provides commitment to day job – take in data supplied by
scientists and make it available to
persistence other interested parties.

slide courtesy S. Callaghan, BADC

VO Sandpit, November 2009

NISO Forum, Denver, Sept. 24, 2012: Data Equivalence

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to NISO Forum, Denver, Sept. 24, 2012: Data Equivalence

Similar to NISO Forum, Denver, Sept. 24, 2012: Data Equivalence (20)

More from National Information Standards Organization (NISO)

More from National Information Standards Organization (NISO) (20)

Recently uploaded

Recently uploaded (20)

NISO Forum, Denver, Sept. 24, 2012: Data Equivalence

Editor's Notes