NISO Forum, Denver, Sept. 24, 2012: Data Equivalence

3,241 views

Published on

Data Equivalence
Mark Parsons, Lead Project Manager, Senior Associate Scientist, National Snow and Ice Data Center

Data citation, especially using persistent identifiers like Digital Object Identifiers (DOIs), is an increasingly accepted scientific practice. Recently, several, respected organizations have developed guidelines for data citation. The different guidelines are largely congruent in that they agree on the basic practice and elements of data citation, especially for relatively static, whole data collections. There is less agreement on the more subtle nuances of data citation that are sometimes necessary to ensure precise reference and scientific reproducibility--the core purpose of data citation. We need to be sure that if you follow a data reference you get to the precise data that were used or at least their scientific equivalent. Identifiers such as DOIs are necessary but not sufficient for the precise, detailed, references necessary. This talk discusses issues around data set versioning, micro-citation, and scientific equivalence. I propose some interim solutions and suggest research strategies for the future.

Published in: Education
  • Be the first to comment

  • Be the first to like this

NISO Forum, Denver, Sept. 24, 2012: Data Equivalence

  1. 1. Data EquivalenceMark A. Parsonsand the ESIP Preservation and Stewardship CommitteeNISO Forum: Tracking it Back to the Source: Managing and Citing Research DataDenver, Colorado, USA24 September 2012
  2. 2. The National Snow and Ice Data Center…Manages and distributes Performs scientific andscientific data informatics research Supports data usersCreates tools for Educates the public data access about the cryosphere http://nsidc.org
  3. 3. Minimum Arctic Sea Ice Extent.Image courtesy NASA/Goddard Scientific Visualization Studio
  4. 4. 21 Sept. 2012 Minimum, 16 Sept. 3,412,196 km2*Derived from the NSIDC Sea Ice Index (Fetterer et al., 2009) *5-day running mean of daily values
  5. 5. Minimum, 16 Sept. 3,541,406 km2*From the Arctic Sea Ice Monitor at the IARC-JAXA Information System (IJIS)http://www.ijis.iarc.uaf.edu/en/home/seaice_extent.htm *5-day running mean of daily values
  6. 6. Minimum, 16 Sept. 3,541,406 km2* 3.8% greater than NSIDC’s valueFrom the Arctic Sea Ice Monitor at the IARC-JAXA Information System (IJIS)http://www.ijis.iarc.uaf.edu/en/home/seaice_extent.htm *5-day running mean of daily values
  7. 7. Example of differences in SSM/I derived ice concentration values calculated with three different passive microwave algorithms (Meier et al. 2001)Designating User Communities by Parsons and Duerr; CODATA, Berlin, 8 Nov. 2004
  8. 8. Metaphor is for most people a deviceof the poetic imagination and therhetorical flourish— a matter ofextraordinary rather than ordinarylanguage. Moreover, metaphor istypically viewed as characteristic oflanguage alone, a matter of wordsrather than thought or action. For thisreason, most people think they can getalong perfectly well without metaphor.We have found, on the contrary, thatmetaphor is pervasive in everyday life,not just in language but in thought andaction. Our ordinary conceptualsystem, in terms of which we boththink and act, is fundamentallymetaphorical in nature.
  9. 9. “Is Data Publication the Right Metaphor?” Parsons and Fox. (in press). Data Science Journal Preprint at http://mp-datamatters.blogspot.com
  10. 10. Purpose of Data Citation• Aid scientific reproducibility through direct, unambiguous connection to the precise data used• Credit for data authors and stewards• Accountability for creators and stewards• Track impact of data set• Help identify data use (e.g., trackbacks) • Data authors can verify how their data are being used. • Users can better understand the application of the data.• A locator/reference mechanism not a discovery mechanism per se 9
  11. 11. “Bridging Data Lifecycles: Tracking Data Usevia Data Citations” UCAR Workshop Report• Recommendation 1: Identify what you want to achieve via data citations• Recommendation 2: Understand the options for actionable identifier schemes• Recommendation 3: Engage stakeholders• Recommendation 4: Start with well-bounded cases• Recommendation 5: Plan for long-term implications• http://library.ucar.edu/data_workshop/ 10
  12. 12. How data citation is currently done• Citation of traditional publication that actually contains the data, e.g. a parameterization value.• Not mentioned, just used, e.g., in tables or figures• Reference to name or source of data in text• URL in text (with variable degrees of specificity)• Citation of related paper (e.g. CRU Temp. records recommend citing two old journal articles which do not contain the actual data or full description of methods)• Citation of actual data set typically using recommended citation given by data center• Citation of data set including a persistent identifier/locator, typically a DOI 11
  13. 13. 2009 1.7% 2008 1.3% 2007 0.9% 2006 1.3% 2005 0.7% 2004 0.7% 2003 1.0% Formal Citation Total Entries 2002 1.3% 0 100 200 300 400 500 600“MODIS Snow Cover Data” in Google Scholar
  14. 14. Data Citation Guidelines• Federation of Earth Science Information Partners. 2012. http://bit.ly/data_citation and related guidelines for the Group on Earth Observations (GEO) • Best available for Earth system science. Not yet widely adopted but growing.• Digital Curation Center. 2011. http://www.dcc.ac.uk/resources/how-guides/cite- datasets • Best overall guide. Not yet widely adopted but growing.• DataCite—a well-recognized consortium of libraries and related organizations working to define a citation approach around DOIs and working to get data citations included in citation indices.• DataVerse Network Project—a standard from the social science community using a Handle locator and “Universal Numerical Fingerprint” as a unique identifier.• New CODATA Task Group in collaboration with ICSTI. Report due soon.• NASA DAACs, NCAR, some NOAA centers adopting ESIP-based approaches.• More consistency is emerging but there is still great variation in recommended approach. They range from specific data citation, to general acknowledgement, to recommending citing a journal article, or even a presentation. 13
  15. 15. Basic data citation form and contentPer DataCite:Creator. PublicationYear. Title. [Version]. Publisher.[ResourceType]. Identifier.Per ESIP:Author(s). ReleaseDate. Title, [version]. [editor(s)]. Archive and/orDistributor. Locator. [date/time accessed]. [subset used]. 14
  16. 16. An Example CitationCline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston. 2002, Updated 2003. CLPX-Ground: ISA snow depth transects and related measurements ver. 2.0. Edited by M. Parsons and M. J. Brodzik. Boulder, CO: National Snow and Ice Data Center. Data set accessed 2008-05-14 at http://nsidc.org/data/nsidc-0175.html.
  17. 17. VersionCline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston. 2002, Updated 2003. CLPX-Ground: ISA snow depth transects and related measurements, ver. 2.0. Edited by M. Parsons and M. J. Brodzik. Boulder, CO: National Snow and Ice Data Center. Data set accessed 2008-05-14 at http://nsidc.org/data/nsidc-0175.html.
  18. 18. LocatorCline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston. 2002, Updated 2003. CLPX-Ground: ISA snow depth transects and related measurements, ver. 2.0. Edited by M. Parsons and M. J. Brodzik. Boulder, CO: National Snow and Ice Data Center. Data set accessed 2008-05-14 at http://nsidc.org/data/nsidc-0175.html.
  19. 19. LocatorCline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston. 2002, Updated 2003. CLPX-Ground: ISA snow depth transects and related measurements, ver. 2.0. Edited by M. Parsons and M. J. Brodzik. Boulder, CO: National Snow and Ice Data Center. Data set accessed 2012-09-22 at http://dx.doi.org/10.5060/D4H41PBP.
  20. 20. Identifier vs. Locator• Human ID: Mark Alan Parsons (son of Robert A. and Ann M., etc.) • every term defined independently and only unique in context/ provenance (remember that). • Alternative like a social security number requires a very well controlled central authority.• Human Locator: 1540 30th St., Room 201, Boulder CO 80303. • every term has a naming authority• Data Set IDs: data set title, filename, database key, object id code (e.g. UUID), etc.• Data set Locators: URL, directory structure, catalog number, registered locator (e.g. DOI), etc. 19
  21. 21. An assessment of identification schemes for digital Earth science data Unique Unique Citable Scientifically Identifier Locator Locator Unique ID ID Scheme Data Set Item Data Set Item Data Set Item Data Set Item URL/N/I PURL XRI Handle DOI ARK LSID Good OID Fair Poor UUIDAdapted from Duerr, R. E., et al.. 2011. On the utility of identification schemes for digital Earth science data: Anassessment and recommendations. Earth Science Informatics. 4:139-160. 20http://dx.doi.org/10.1007/s12145-011-0083-6
  22. 22. An assessment of identification schemes for digital Earth science data Unique Unique Citable Scientifically Identifier Locator Locator Unique ID ID Scheme Data Set Item Data Set Item Data Set Item Data Set Item s URL/N/I or at c PURLLo XRI Handle DOI ARK ers LSID Good ntifi OID FairIde Poor UUID Adapted from Duerr, R. E., et al.. 2011. On the utility of identification schemes for digital Earth science data: An assessment and recommendations. Earth Science Informatics. 4:139-160. 20 http://dx.doi.org/10.1007/s12145-011-0083-6
  23. 23. Why the DOI?• Not perfect but well understood by publishers• DataCite working with Thomson Reuters to get data citations in their index. But... • What is the citable unit? • How do we handle different versions? • What about “retired” data? • When is a DOI assigned? 21
  24. 24. Issues largely resolved by...• A defined versioning scheme• Good tracking and documentation of the versions• Due diligence in archive and release practices 22
  25. 25. When to assign a DOI?• First principle: Data should be reference-able as soon as they are available for use by anyone other than the original authors.• But... • Most people (falsely) believe that a DOI assures permanence so how do we cite transient data? • Some believe that a DOI should not be assigned until the data has undergone some level of review (e.g. Lawrence et al. 2011). So how do we cite data used before the review? • Data are often used by friends and collaborators in a raw, “unpublished” state. Should this use be cited with a DOI? • Near real time or preliminary data may only be available for a short uncurated, period, and there may not be a good match between the submission package and the distribution package. What gets the DOI? When? 23
  26. 26. Versioning approach recommended by DCC• “As DOIs are used to cite data as evidence, the dataset to which a DOI points should also remain unchanged, with any new version receiving a new DOI.”• “There are two possible approaches the data repository can take: time slices and snapshots.” 24
  27. 27. Versioning and locators:some suggestions from NSIDC• major version.minor version.[archive version]• Individual stewards need to determine which are major vs. minor versions and describe the nature and file/record range of every version.• Assign DOIs to major versions.• Old DOIs should be maintained and point to some appropriate page that explains what happened to the old data if they were not archived.• A new major version leads to the creation of a new collection-level metadata record that is distributed to appropriate registries. The older metadata record should remain with a pointer to the new version and with explanation of the status of the older version data.• Major and minor version (after the first version) should be exposed with the data set title and recommended citation.• Minor versions should be explained in documentation, ideally in file-level metadata.• Applying UUIDs to individual files upon ingest aids in tracking minor versions and historical citations. 25
  28. 28. Basic data citation form and contentAuthor(s). ReleaseDate. Title, Version. [editor(s)]. Archive and/orDistributor. Locator. [date/time accessed]. [subset used].The best solution may be to have unique identifiers or query IDsfor subsets, but that won’t be available for most data sets for along time, so we need alternative solutions... 26
  29. 29. February 8, 2011, 4:45 PM Page Numbers for Kindle Books an Imperfect Solution Neither solution is perfect—‘locations’ or page numbers—because the problem is unsolvable. The best we can hope for is a choice... Amazon’s Kindle will have page numbers that correspond to real books and locations by passage.http://pogue.blogs.nytimes.com/2011/02/08/page-numbers-for-kindle-books-an-imperfect-solution/
  30. 30. Chapter and Verse• Bible• Koran• Bhagavad-Gita and Ramayana• other sacred texts• A “structural index”
  31. 31. The “Archive Information Unit”“An Archival Information Package whose Content Information is notfurther broken down into other Content Information components, eachof which has its own complete Preservation Description Information. Itcan be viewed as an ‘atomic’ AIP”“From an Access viewpoint, new subsetting and manipulationcapabilities are beginning to blur the distinction between AICs and AIUs.Content objects which used to be viewed as atomic can now be viewedas containing a large variation of contents based on the subsettingparameters chosen. In a more extreme example, the ContentInformation of an AIU may not exist as a physical entity. The ContentInformation could consist of several input files (or pointers to the AIPscontaining these data files) and an algorithm which uses these files tocreate the data object of interest.”• CCSDS. 2002. Reference Model for An Open Archival Information System (OAIS) CCSDS 650.0-B-1 Issue 1. Washington, DC: CCSDS Secretariat. p. 1-8, 4-38. 29
  32. 32. Citation scenarios and production patterns• What kind of “atomic” item is being cited—the “Archive Information Unit (AIU)” (e.g., a data file, a data element within a file, a relational (or other) database, a job “residue”)?• How many AIUs items are in a typical citation for the scenario being considered?• What other digital or physical objects need to be available to make the unit usable—the “Preservation Description Information (PDI)”? Key Question:• What structure or structures can we use to organize data collections that might be common across Earth sciences? 30
  33. 33. An example production pattern for Cline et al. (2003).
  34. 34. 32
  35. 35. 33
  36. 36. 34
  37. 37. 35
  38. 38. 36
  39. 39. 37
  40. 40. 38
  41. 41. 39
  42. 42. 40
  43. 43. 41
  44. 44. 42
  45. 45. A production pattern for Cline et al., 2003 field notebook Excel v1 printout Excel v2 TRANSECT,IOP ,DATE ,TIME,UTME ,UTMN ,DEPTH ,SWET,SRUF,CNPY, TEMP,SURVEYOR ,QC ,COMMENTS , , , , , ,cm , , , , deg- F, , , FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, 104, d, y, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " ... FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) ","Couldnt find post, used GPS 5940 1197; FAA4.4 and FAA4.5 unsafe, avalanche area!analog to ascii filesdigital w/ QC shapefilesborndigital camera interim jpgs collated and named jpgs
  46. 46. A production pattern for Cline et al., 2003 field notebook Excel v1 printout Excel v2 distributed data set TRANSECT,IOP ,DATE ,TIME,UTME ,UTMN ,DEPTH ,SWET,SRUF,CNPY, TEMP,SURVEYOR ,QC ,COMMENTS , , , , , ,cm , , , , deg- F, , , FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, 104, d, y, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " ... FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) ","Couldnt find post, used GPS 5940 1197; FAA4.4 and FAA4.5 unsafe, avalanche area!analog to ascii filesdigital w/ QC shapefilesborndigital camera interim jpgs collated and named jpgs
  47. 47. A production pattern for Cline et al., 2003 field notebook Excel v1 printout Excel v2 distributed data set TRANSECT,IOP ,DATE ,TIME,UTME ,UTMN ,DEPTH ,SWET,SRUF,CNPY, TEMP,SURVEYOR ,QC + ,COMMENTS HTML F, , , , , , ,cm FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, , , 104, , d, , y, , n, , deg- Doc. -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " ... FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) ","Couldnt find post, used GPS 5940 1197; FAA4.4 and FAA4.5 unsafe, avalanche area!analog to ascii filesdigital w/ QC shapefilesborndigital camera interim jpgs collated and named jpgs
  48. 48. A production pattern for Cline et al., 2003 field notebook Excel v1 printout Excel v2 distributed data set TRANSECT,IOP ,DATE ,TIME,UTME ,UTMN ,DEPTH ,SWET,SRUF,CNPY, TEMP,SURVEYOR ,QC + ,COMMENTS HTML , , , , , ,cm , , , , deg- 100s 100s F, , , FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, 104, d, y, n, Doc. -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " ... FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) ","Couldnt find post, used GPS 5940 1197; FAA4.4 and FAA4.5 unsafe, avalanche area!analog to ascii filesdigital w/ QC shapefilesborndigital 1000s camera interim jpgs collated and named jpgs
  49. 49. A production pattern for Cline et al., 2003 field notebook Excel v1 printout Excel v2 distributed data set TRANSECT,IOP ,DATE ,TIME,UTME ,UTMN ,DEPTH ,SWET,SRUF,CNPY, TEMP,SURVEYOR ,QC + ,COMMENTS HTML , , , , , ,cm , , , , deg- 100s 100s F, , , FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, 104, d, y, n, Doc. -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " ... FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) ","Couldnt find post, used GPS 5940 1197; FAA4.4 and FAA4.5 unsafe, avalanche area!analog to ascii filesdigital w/ QC tarball shapefilesborndigital 1000s camera interim jpgs collated and named jpgs
  50. 50. A production pattern for Cline et al., 2003 field notebook Excel v1 Couldnt find post, v2 printout Excel distributed data set used GPS 5940 1197; FAA4.4 and TRANSECT,IOP ,DATE ,TIME,UTME ,UTMN ,DEPTH ,SWET,SRUF,CNPY, TEMP,SURVEYOR FAA4.5 unsafe, ,QC + ,COMMENTS HTML , , , , , ,cm , , , , deg- 100s avalanche area! 100s F, , , FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, 104, d, y, n, Doc. -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " ... FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) ","Couldnt find post, used GPS 5940 1197; FAA4.4 and FAA4.5 unsafe, avalanche area!analog to ascii filesdigital w/ QC tarball shapefilesborndigital 1000s camera interim jpgs collated and named jpgs
  51. 51. Crude, inaccurate production pattern for MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005 (Hall et al., 2007)
  52. 52. Crude, inaccurate production pattern for MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005 (Hall et al., 2007) Archives GSFC EDC NSIDC MODAPs Processing
  53. 53. Crude, inaccurate production pattern for MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005 (Hall et al., 2007) Archives GSFC EDC NSIDC 1 file/day/tile (grid cell) Each file contains metadata describing previous inputs and detailed versioning MODAPs Processing
  54. 54. Crude, inaccurate production pattern for MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005 (Hall et al., 2007) Archives GSFC EDC NSIDC 1,000,000s 1 file/day/tile (grid cell) Each file contains metadata describing previous inputs and detailed versioning MODAPs Processing
  55. 55. Doing it as best we can...?• Hall, Dorothy K., George A. Riggs, and Vincent V. Salomonson. 2007, updated daily. MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005.3, Oct. 2007- Sep. 2008, 84°N, 75°W; 44°N, 10°W. Boulder, Colorado USA: National Snow and Ice Data Center. Data set accessed 2008-11-01 at http://dx.doi.org/10.1234/xxx.• Hall, Dorothy K., George A. Riggs, and Vincent V. Salomonson. 2007, updated daily. MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005.3, Oct. 2007- Sep. 2008, Tiles (15,2;16,0;16,1;16,2;17,0;17,1). Boulder, Colorado USA: National Snow and Ice Data Center. Data set accessed 2008-11-01 at http://dx.doi.org/10.1234/xxx.• Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston. 2002, Updated 2003. CLPX-Ground: ISA snow depth transects and related measurements, Version 2.0, shapefiles. Edited by M. Parsons and M. J. Brodzik. Boulder, CO: National Snow and Ice Data Center. Data set accessed 2008-05-14 at http://dx.doi.org/10.5060/D4H41PBP. 45
  56. 56. Sea Ice Index Fetterer et al. 2009 “Near Real Time” “Preliminary”Maslanik and Stroeve Meir et al. 2006 1999 “Final” Cavalieri et al 1996
  57. 57. Remote Sensing SystemsSea Ice Production – Data Workflow TBs (Wentz) Color Key Source (data) Value added product Near-Real-Time product NSIDC-0002 Preliminary Product SSM/I Daily and NSIDC-0001 SSM/I Polar Stereo Tbs Final Product Monthly Polar Gridded Bootstrap Not part of discussionDeaccession to begin 1/2012 NSIDC-0051 Goddard Space NSIDC-0079 Fowler Anderson Flight Center Preliminary Bootstrap Sea Preliminary Sea Ice University of CO University of NE (GSFC) Ice Concentrations from Concentrations From SMMR and SSM/I Nimbus-7 SMMR Passive Microwave Data and DMSP SSM/I NISDC-0116 NSIDC-0105 Snow Melt Onset Data not yet Polar Pathfinder distributed Daily 25 km Over Arctic Sea Ice EASE-Grid Sea Ice from SMMR and NSIDC-0051 NSIDC-0079 Motion Vectors SSM/I Tbs Sea Ice Concentrations Bootstrap Sea Ice from SMMR and SSM/I Concentrations from Passive Microwave Data SMMR and SSM/I NASA Team G00791 production line IABP Drifting NSIDC-0066 AVHRR Polar Buoy Pressure, Pathfinder Twice-Daily G02202 Temperature, NOAA/NSIDC CDR 5 km EASE- Position, and Interpolated Grid Composites Ice Velocity NSIDC-0081 NSIDC-0192 NSIDC-0046 Near-Real-Time DMSP Sea Ice Trends and Robinson NH EASE-Grid SSM/I Daily Polar Gridded Climatologies from SMMR Psuedo-weekly Weekly Sea Ice Concentrations and SSM/I Snow Cover Snow Cover and Sea Rutgers Ice Extent V3 NSIDC-0080 Near-Real-Time DMSP NISE on NEO World Wide Web G02135 SSM/I Daily Polar Gridded NISE Arctic Sea Ice Sea Ice Index Tbs News and Analysis CLASS F17 TBs Last updated: D Scott 12/2011
  58. 58. MODIS chlorophylllevel 0 SeaWiFS GSM phytoplankton …level 1a algorithms ... particulates … softwarelevel 1b calibration ...level 2 reprocess, … revalidate 48 slide courtesy Greg Janee, UCSB
  59. 59. Hypothesis: ~80% of citation scenarios for 80% ofEarth system science data can be addressed withbasic citations (Author(s). ReleaseYear. Title, Version.[editor(s)]. Archive. Locator. [date/time accessed]. [subset used].),a solid methods section, and reasonable duediligence by archives.
  60. 60. Future directions on scientific equivalence
  61. 61. Content equivalence and provenance equivalenceserve as a proxy for scientific equivalence.Content Equivalence: Is there an algorithm that can consider thecontent of a file and come up with a unique identifier that will be thesame for objects in the same “content equivalence class”? • Exact content equivalence can use digital signature or cryptographic techniques MD5, SHA-1, etc.   • Universal Numeric Footprint (UNF) takes a digital signature of a “canonical” representation of the information, but canonical is a problem. • How can we define loose or scientific content equivalence? Are the shapefiles and text files in Cline et al. (2003) equivalent? 51
  62. 62. Content equivalence and provenance equivalenceserve as a proxy for scientific equivalence.Provenance Equivalence:  A process is reproducible when there issufficient creation provenance details for someone else to make anequivalent file. If someone follows those provenance details to re-createthe object, the resulting object will be equivalent to the original.    • If we can enumerate/list sufficient or “essential” creation provenance details to  make an equivalent file, then we can describe an algorithm to  produce an identifier that will be the same for files that match those provenance details. • Algorithm can be similar to the content equivalence algorithm before: take a digital signature of a canonical representation of the information in this case a canonical serialization of processes. 52
  63. 63. Thank Youparsonsm@nsidc.org photo courtesy NOAA
  64. 64. Backup slides
  65. 65. More examples• Glacier Photos: chemical creation of image, digitization, multi-source, no versioning• Hurricane Ike: direct digital creation, embedded software, single-source, no versioning• Rock or Ice Cores: physical specimens with later experimental operations to create data collections – digital results may be stored in databases with complex provenance• SeaDataNet: Many individual Conductivity, Temperature, Depth measurements from many individual cruises compiled into one database.• Global Historical Climate Network: In situ measurements of Temp and Humidity, recorded as time series and appended to files (with some quality control) – multi- source, some versioning• Radiosonde Network: Balloon-borne Temp and Humidity from about 20,000 sites every 12 or 6 hours assimilated into weather forecasts – archives maintained at NCAR, GSFC, and elsewhere – reanalysis can create new versions• Large-scale Satellite Data Production: complex, large-scale data flows from multiple sources, systematic approaches to versioning where the structure of one version of a data product resembles the next 55
  66. 66. Basic data citation form and contentMandatory: Optional: Author(s) Editor(s) Release Date Archive or Distributor Place Title Other Institutional Role Version Indication of subset used Archive and/or Distributor Locator or Persistent Location service Time, date accessed 56
  67. 67. Serving, Citing and Publishing DataCitation forms an important part of the scientific record. Doi:10232/123ro This involves the peer-review of dataWe draw a clear distinction between: 2. sets, and gives “stamp of approval” Publication of data sets associated with traditional journalpublishing = making available for publications. Can’t be done without consumption (e.g. on the web), effective linking/citing of the data sets. and Doi:10232/123Publishing = publishing after some 1. This is our first step for this project – formal process which adds value Data set Citation formulate and formalise a way of for the consumer: citing data sets. Will provide benefits• e.g. PloS ONE type review, or to our users – and a carrot to get them to provide data to us!• EGU journal type public review, or• More traditional peer review. 0.AND Serving of data sets This is what data centres do as our• provides commitment to day job – take in data supplied by scientists and make it available to persistence other interested parties. slide courtesy S. Callaghan, BADC VO Sandpit, November 2009

×