NISO Forum, Denver, Sept. 24, 2012: Data Equivalence

3,213 views

Published on

Data Equivalence
Mark Parsons, Lead Project Manager, Senior Associate Scientist, National Snow and Ice Data Center

Data citation, especially using persistent identifiers like Digital Object Identifiers (DOIs), is an increasingly accepted scientific practice. Recently, several, respected organizations have developed guidelines for data citation. The different guidelines are largely congruent in that they agree on the basic practice and elements of data citation, especially for relatively static, whole data collections. There is less agreement on the more subtle nuances of data citation that are sometimes necessary to ensure precise reference and scientific reproducibility--the core purpose of data citation. We need to be sure that if you follow a data reference you get to the precise data that were used or at least their scientific equivalent. Identifiers such as DOIs are necessary but not sufficient for the precise, detailed, references necessary. This talk discusses issues around data set versioning, micro-citation, and scientific equivalence. I propose some interim solutions and suggest research strategies for the future.

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,213
On SlideShare
0
From Embeds
0
Number of Embeds
237
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Props to ESIP \n - Consortium of ~120 Earth Science-Related Partners\n- Started in 1998\n- Support from NASA, NOAA, and EPA With growing engagement from: USGS, NSF, DOE and USDA\n - Neutral forum for community networking, collaboration & problem solving\n - Limited international participation (strength and weakness)\n- P&S committee a few years old and very active. Please join. Google ESIP preservation\n\nA grand title for what might be better called “How do I specifically reference the precise data I used? And not the data that are very similar but different” \n\nData Equivalence by Mark A. Parsons and the ESIP Preservation and Stewardship Cluster is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. See http://creativecommons.org/licenses/by-sa/3.0/\n<a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/"><img alt="Creative Commons License" style="border-width:0" src="http://i.creativecommons.org/l/by-sa/3.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">How to Cite an Earth Science Data Set</span> by <a xmlns:cc="http://creativecommons.org/ns#" href="http://nsidc.org/cgi-bin/publications/pub_list.pl" property="cc:attributionName" rel="cc:attributionURL">Mark A. Parsons and the ESIP Preservation and Stewardship Cluster</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/">Creative Commons Attribution-ShareAlike 3.0 Unported License</a>.\n\nAbstract:\nData citation, especially using persistent identifiers like Digital Object Identifiers (DOIs), is an increasingly accepted scientific practice. Recently, several, respected organizations have developed guidelines for data citation. The different guidelines are largely congruent in that they agree on the basic practice and elements of data citation, especially for relatively static, whole data collections. There is less agreement on the more subtle nuances of data citation that are sometimes necessary to ensure precise reference and scientific reproducibility--the core purpose of data citation. We need to be sure that if you follow a data reference you get to the precise data that were used or at least their scientific equivalent. Identifiers such as DOIs are necessary but not sufficient for the precise, detailed, references necessary. This talk discusses issues around data set versioning, micro-citation, and scientific equivalence. I proposes some interim solutions and suggest research strategies for the future.\n
  • We manage data, educate the public, and do research on the cryosphere--the frozen parts of the word from kryos the greek for butt cold. We are getting more involved with other data including life and social science data and even Indigenous knowledge. I’d argue we are becoming an important center of interdisciplinary informatics, but we are best known for keeping an eye on that stuff in the background image-- sea ice \n\nImages:\nTop left: Near Real-Time SSM/I EASE-Grid Daily Global Ice Concentration and Snow Extent, 3 February 2008\nTop right: Pancake ice seen off the bow of the research vessel Aurora Australis, in the Southern Ocean near Antarctica. Photo courtesy Ted Scambos.\nBottom right: Data from the Sea Ice Index, shown on Google Earth and distributed from our Web site in the form of a Quick Time movie. The left image is an animated time series of sea ice extent from 1979 to 2006; the static image on the right compares the extent in 2007.\nBottom left: Global Monthly SWE Climatology Browse, from http://nsidc.org/data/nsidc-0271.html\nCenter: A section of the online data set documentation from Sea Ice Charts of the Russian Arctic in Gridded Format, 1933-2006 (http://nsidc.org/data/g02176.html).\n
  • In case you missed it, Arctic sea ice is rapidly declining. This years summer minimum sea ice extent about a week ago shattered previous records in an ongoing collapse.\nThis image shows this years minimum extent relative to to the 30-yr average minimum extent (yellow line). Note this average is in a period of rapid decline.\n\n“Satellite data reveal how the new record low Arctic sea ice extent, from Sept. 16, 2012, compares to the average minimum extent over the past 30 years (in yellow). Sea ice extent maps are derived from data captured by the Scanning Multichannel Microwave Radiometer aboard NASA's Nimbus-7 satellite and the Special Sensor Microwave Imager on multiple satellites from the Defense Meteorological Satellite Program. Credit: NASA/Goddard Scientific Visualization Studio “ http://www.nasa.gov/topics/earth/features/2012-seaicemin.html\n
  • So let’s look at this a little more closely.\n\nFetterer, F., K. Knowles, W. Meier, and M. Savoie. 2002, updated 2009. Sea Ice Index. Boulder, Colorado USA: National Snow and Ice Data Center. http://nsidc.org/data/g02135.html\n
  • \n
  • This highlights the issue. In science you have to be very exact in what you are pointing to.\n\nMeier W.N., VanWoert M.L., & Bertoia C. (2001) Evaluation of operational SSM/I algorithms. Annals of Glaciology. 33:102-08. \n\n
  • \n“researchers from Emory University reported in Brain & Language that when subjects in their laboratory read a metaphor involving texture, the sensory cortex, responsible for perceiving texture through touch, became active. Metaphors like “The singer had a velvet voice” and “He had leathery hands” roused the sensory cortex, while phrases matched for meaning, like “The singer had a pleasing voice” and “He had strong hands,” did not.” ANNIE MURPHY PAUL, NY Times 17 Mar 2012\n\nmetaphors help us create the complex narratives we use to understand our physical and conceptual experience. These complex narratives are made up of smaller, very simple narratives called “frames” or “scripts”. Framing and frame analysis are often used in knowledge representation, social theory, media studies, and psychology with much of the work stemming from Erving Goffman (1974). \n\nThese frames present a set of roles and relationships between them like characters in a play. They also help us define our terms and make sense of language, because words are defined relative to a conceptual frame. The word “sell” does not make sense without some understanding of a commercial transaction and some of the other roles and terms involved like “buyer,” “money,” and “cost”. Furthermore, by mentioning only one of these concepts like “buy” or “sell”, the whole commercial transaction scenario is evoked or “activated” in the mind (Fillmore, 1976). \n\nLakoff (2008) further argues that framing is critical to human cognition. The neural circuitry to create a frame is relatively simple and our brain essentially uses framing as a sort of cognitive processing shortcut. If things are understood in the context of a frame, much is already unconsciously understood and need not be consciously processed. We know what to expect. \n\nHe shows how language, metaphor, and framing play critical roles in any social enterprise. \n\n“Language is at once a surface phenomenon and a source of power. It is a means of expressing, communicating, accessing, and even shaping thought. ... Language gets its power because it is defined relative to frames, prototypes, metaphor, narratives, images and emotions. Part of its power comes from unconscious aspects: we are not consciously aware of all that it evokes in us, but it is there, hidden, always at work. If we hear the same language over and over, we will think more and more in terms of the frames and metaphors activated by that language.” (Lakoff 2008, p14)\nSo with that in mind Peter Fox and I wrote a critical essay, asking\n\nLakoff, G. (2008) The Political Mind: Why You Can't Understand 21st-Century Politics With An 18th-Century Brain. New York: Penguin Group. \n\n
  • \n
  • \n
  • \n
  • \n
  • The National Snow and Ice Data Center distributes a variety of different snow cover products derived from the Moderate Resolution Imaging Spectroradiometer (MODIS). The results of a quick analysis of how many scientific papers mention use of "MODIS Snow Cover Data" (according to Google Scholar) and how often the data sets themselves are formally cited show a huge disparity, illustrating  the infrequency of proper data citation in practice. Moreover, the lack of data citation standards introduces the possibility that informal references to data do not point to the exact data set actually used.\n\nThis was a crude estimate to make a point. Gene Major, Heather Piwowar, others have done manual studies showing it’s not being done and it hasn’t improved in a decade.\n\nSo the approach to date is to try and develop guidelines and promote their use and acceptance of the practice\n\n
  • So it’s pretty clear from all this that the author is not being fairly credited.\n\n
  • Digital Mapping Techniques '00 -- Workshop ProceedingsU.S. Geological Survey Open-File Report 00-325\nProposal for Authorship and Citation Guidelines for Geologic Data Sets and Map Images in the Era of Digital Publication\nBy Stephen M. Richard\n\n\nWhen I presented this only a year ago, it was much more diverse. We are converging on the approach, now we need to implment it!\n
  • regarding, tracking the impact...\n\nChen, R. S. and Downs, R. R. (2010). Evaluating the Use and Impacts of Scientific Data. National Federation of Advanced Information Services (NFAIS) Workshop, Assessing the Usage and Value of Scholarly and Scientific Output:  An Overview of Traditional and Emerging Approaches. Philadelphia, PA, November 10, 2010. http://info.nfais.org/info/ChenDownsNov10.pdf\n
  • \n
  • \n
  • These databases (a) do not currently support tracking citations to data sets, (b) have expensive subscription fees, and (c) only reveal impact demonstrated through traditional citations. \n
  • \n
  • \n
  • resource type is a short controlled list of “The general type of a resource.”\nmany other optional fields possible\n\n
  • \n
  • Authors are those who put the intellectual effort into creating the data set. Study designers, algorithm developers, instrument designers, field team leaders, etc.\n
  • \n
  • \n
  • Editors and other roles help with credit and accountability\n\n
  • Editors and other roles help with credit and accountability\n\n
  • Publidhe. Could be multiple, but usually there’s an authority.\n
  • \n
  • \n
  • They’re different, but sometimes locator can be used as an ID (The person working in this position at this address). Hence the general use of the term “identifier “such as in DOI, which is better described as a locator\n\nChun Li and Choi introduce name and address as complements to Id and locator. I’m not sure I entirely sure I understand the distinction, but a key poin they make is that while the concepts of identifier and locator are distinct tey can and do intersect in different context. Sometimes , you can id with a locator and locate with an identifier.\n\nWhat we really need is what Jon Kunze calls an “actionable locator”.\n\nKunze J. (2003) Towards electronic persistence using ARK identifier. Retrieved September 22, 2012 from the World Wide Web: http://pid.ndk.cz/dokumenty/zakladni-literatura/arkcdl.pdf\n\nChun, W, TH Lee, and T Choi. 2011. YANAIL: yet another definition on names, addresses, identifiers, and locators. Proceedings of the 6th International Conference on Future Internet Technologies. pp. 8-12\n
  • \n
  • Debate in LSID community weakens it. - - an LSID is a locator; but also the ObjectID part of it is an Identifier and most people use a UUID for the ObjectID part of it\nOID problem\nARK is a bit better than the rest of the locators because it has additional trust value ... maybe the color should have been more orange than yellow but I didn’t want to add more colors.  \n\nFrom ARK defn: transparency – no identifier can guarantee stability, and ARK inflections help users make informed judgements\n\n
  • What is the citable unit with a DOI? A file? A collection of files? How many? Further, it is important to note that data products can be purged from an archive; such deleted information still needs to be able to be referenced. Even if the products themselves are not preserved, the raw data must be preserved along with detailed documentation describing how the product was created, and that documentation must be citable.\n\nSome have expressed concern that DOIs are too closely associated with publishers and may restrict open access.\n
  • \n
  • This i where the publication metaphor gets us in trouble.\nCan we track data packages flowing through an ecosystem instead and recreate\n
  • \n
  • still need date accessed with DOI\nCan we auto track uuids?\n\n
  • “Micro-citation” The hardest piece of all and the one most critical to sci. repeatability.\nOne might think of this as citing a passage in a publication-- a page #\n
  • page number requires a canonical format.\n
  • a “structural index (chapter-verse)” vs. “internal location id (page number)” \nstructural index useful for when many different versions but with fairly consistent structure.\n\nThe difficulty with the Unique Numerical Fingerprint approach is that it assumes that there is an immutable order to the data elements. This is not always the case therefore c-v needs to refer to “equivalence classes” not canonical versions. But you can’t deny the human readability of the c-v approach\n\nWe probably need both approaches. We need the chapter and verse that makes sense to people and is easily conceived and communicated between people, but then we still need the precise location and identity of that rather mutable verse represented in a way that computers can readily understand and be precise about, i.e.  the identifier.\nAnd then we can't forget the fact that we have billions if not trillions of "verses" or "granules" that we're dealing with. Our human approach needs to make sense at a high level of aggregation, while the computer approach needs to handle the volumes and precision.\n\nanother option is to capture query and date--this relies on really good provenance tracking.\n
  • note the outdated reference and page numbers-- a problem with costly ISO standards.\n
  • It would probably be wise to distinguish between cases where the file\ncollections are open, but adding relatively stable items, from ones where\nthe files themselves are mutable\n\n\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • the shape file does not hold all the comments that are in the text file.\n\nGo take a photo of the actual fieldbook with a pencil and use that.\n
  • the shape file does not hold all the comments that are in the text file.\n\nGo take a photo of the actual fieldbook with a pencil and use that.\n
  • the shape file does not hold all the comments that are in the text file.\n\nGo take a photo of the actual fieldbook with a pencil and use that.\n
  • the shape file does not hold all the comments that are in the text file.\n\nGo take a photo of the actual fieldbook with a pencil and use that.\n
  • the shape file does not hold all the comments that are in the text file.\n\nGo take a photo of the actual fieldbook with a pencil and use that.\n
  • L3 files records interim files not L2.\nHall, Dorothy K., George A. Riggs, and Vincent V. Salomonson. 2007, updated daily. MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005. Boulder, Colorado USA: National Snow and Ice Data Center. Data set accessed 21 July 2011 at http://nsidc.org/data/myd10a1.html.\n
  • L3 files records interim files not L2.\nHall, Dorothy K., George A. Riggs, and Vincent V. Salomonson. 2007, updated daily. MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005. Boulder, Colorado USA: National Snow and Ice Data Center. Data set accessed 21 July 2011 at http://nsidc.org/data/myd10a1.html.\n
  • L3 files records interim files not L2.\nHall, Dorothy K., George A. Riggs, and Vincent V. Salomonson. 2007, updated daily. MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005. Boulder, Colorado USA: National Snow and Ice Data Center. Data set accessed 21 July 2011 at http://nsidc.org/data/myd10a1.html.\n
  • \n
  • \n
  • \n
  • “To preserve data is to preserve ability to process data!”\nTo reference data is to track data\n\n
  • Include engineers?\nData managers?\nSponsors?\nCorrect URL? This is where a DOI helps.\n
  • Include engineers?\nData managers?\nCorrect URL? This is where a DOI helps.\n
  • I’m have modified this hypothesis and I’m having second thoughts as to whether it stands. From a limited scholarly perspective of citing, well-controlled data sets, it’s probably true. It is “citable” in some of the credit and location senses, but that is insufficient for full scientific tracking and referencing of data streams.\n
  • \n
  • \n
  • W3C and ProvO are doing some good work on this first point.\n\nTake away. Figure out what problem you are trying to solve with referencing and then examine the problem from many perspectives not just the classic scholarly publication model.\n
  • \n
  • \n
  • \n
  • \nmany other optional fields possible\n\n
  • \n
  • NISO Forum, Denver, Sept. 24, 2012: Data Equivalence

    1. 1. Data EquivalenceMark A. Parsonsand the ESIP Preservation and Stewardship CommitteeNISO Forum: Tracking it Back to the Source: Managing and Citing Research DataDenver, Colorado, USA24 September 2012
    2. 2. The National Snow and Ice Data Center…Manages and distributes Performs scientific andscientific data informatics research Supports data usersCreates tools for Educates the public data access about the cryosphere http://nsidc.org
    3. 3. Minimum Arctic Sea Ice Extent.Image courtesy NASA/Goddard Scientific Visualization Studio
    4. 4. 21 Sept. 2012 Minimum, 16 Sept. 3,412,196 km2*Derived from the NSIDC Sea Ice Index (Fetterer et al., 2009) *5-day running mean of daily values
    5. 5. Minimum, 16 Sept. 3,541,406 km2*From the Arctic Sea Ice Monitor at the IARC-JAXA Information System (IJIS)http://www.ijis.iarc.uaf.edu/en/home/seaice_extent.htm *5-day running mean of daily values
    6. 6. Minimum, 16 Sept. 3,541,406 km2* 3.8% greater than NSIDC’s valueFrom the Arctic Sea Ice Monitor at the IARC-JAXA Information System (IJIS)http://www.ijis.iarc.uaf.edu/en/home/seaice_extent.htm *5-day running mean of daily values
    7. 7. Example of differences in SSM/I derived ice concentration values calculated with three different passive microwave algorithms (Meier et al. 2001)Designating User Communities by Parsons and Duerr; CODATA, Berlin, 8 Nov. 2004
    8. 8. Metaphor is for most people a deviceof the poetic imagination and therhetorical flourish— a matter ofextraordinary rather than ordinarylanguage. Moreover, metaphor istypically viewed as characteristic oflanguage alone, a matter of wordsrather than thought or action. For thisreason, most people think they can getalong perfectly well without metaphor.We have found, on the contrary, thatmetaphor is pervasive in everyday life,not just in language but in thought andaction. Our ordinary conceptualsystem, in terms of which we boththink and act, is fundamentallymetaphorical in nature.
    9. 9. “Is Data Publication the Right Metaphor?” Parsons and Fox. (in press). Data Science Journal Preprint at http://mp-datamatters.blogspot.com
    10. 10. Purpose of Data Citation• Aid scientific reproducibility through direct, unambiguous connection to the precise data used• Credit for data authors and stewards• Accountability for creators and stewards• Track impact of data set• Help identify data use (e.g., trackbacks) • Data authors can verify how their data are being used. • Users can better understand the application of the data.• A locator/reference mechanism not a discovery mechanism per se 9
    11. 11. “Bridging Data Lifecycles: Tracking Data Usevia Data Citations” UCAR Workshop Report• Recommendation 1: Identify what you want to achieve via data citations• Recommendation 2: Understand the options for actionable identifier schemes• Recommendation 3: Engage stakeholders• Recommendation 4: Start with well-bounded cases• Recommendation 5: Plan for long-term implications• http://library.ucar.edu/data_workshop/ 10
    12. 12. How data citation is currently done• Citation of traditional publication that actually contains the data, e.g. a parameterization value.• Not mentioned, just used, e.g., in tables or figures• Reference to name or source of data in text• URL in text (with variable degrees of specificity)• Citation of related paper (e.g. CRU Temp. records recommend citing two old journal articles which do not contain the actual data or full description of methods)• Citation of actual data set typically using recommended citation given by data center• Citation of data set including a persistent identifier/locator, typically a DOI 11
    13. 13. 2009 1.7% 2008 1.3% 2007 0.9% 2006 1.3% 2005 0.7% 2004 0.7% 2003 1.0% Formal Citation Total Entries 2002 1.3% 0 100 200 300 400 500 600“MODIS Snow Cover Data” in Google Scholar
    14. 14. Data Citation Guidelines• Federation of Earth Science Information Partners. 2012. http://bit.ly/data_citation and related guidelines for the Group on Earth Observations (GEO) • Best available for Earth system science. Not yet widely adopted but growing.• Digital Curation Center. 2011. http://www.dcc.ac.uk/resources/how-guides/cite- datasets • Best overall guide. Not yet widely adopted but growing.• DataCite—a well-recognized consortium of libraries and related organizations working to define a citation approach around DOIs and working to get data citations included in citation indices.• DataVerse Network Project—a standard from the social science community using a Handle locator and “Universal Numerical Fingerprint” as a unique identifier.• New CODATA Task Group in collaboration with ICSTI. Report due soon.• NASA DAACs, NCAR, some NOAA centers adopting ESIP-based approaches.• More consistency is emerging but there is still great variation in recommended approach. They range from specific data citation, to general acknowledgement, to recommending citing a journal article, or even a presentation. 13
    15. 15. Basic data citation form and contentPer DataCite:Creator. PublicationYear. Title. [Version]. Publisher.[ResourceType]. Identifier.Per ESIP:Author(s). ReleaseDate. Title, [version]. [editor(s)]. Archive and/orDistributor. Locator. [date/time accessed]. [subset used]. 14
    16. 16. An Example CitationCline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston. 2002, Updated 2003. CLPX-Ground: ISA snow depth transects and related measurements ver. 2.0. Edited by M. Parsons and M. J. Brodzik. Boulder, CO: National Snow and Ice Data Center. Data set accessed 2008-05-14 at http://nsidc.org/data/nsidc-0175.html.
    17. 17. VersionCline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston. 2002, Updated 2003. CLPX-Ground: ISA snow depth transects and related measurements, ver. 2.0. Edited by M. Parsons and M. J. Brodzik. Boulder, CO: National Snow and Ice Data Center. Data set accessed 2008-05-14 at http://nsidc.org/data/nsidc-0175.html.
    18. 18. LocatorCline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston. 2002, Updated 2003. CLPX-Ground: ISA snow depth transects and related measurements, ver. 2.0. Edited by M. Parsons and M. J. Brodzik. Boulder, CO: National Snow and Ice Data Center. Data set accessed 2008-05-14 at http://nsidc.org/data/nsidc-0175.html.
    19. 19. LocatorCline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston. 2002, Updated 2003. CLPX-Ground: ISA snow depth transects and related measurements, ver. 2.0. Edited by M. Parsons and M. J. Brodzik. Boulder, CO: National Snow and Ice Data Center. Data set accessed 2012-09-22 at http://dx.doi.org/10.5060/D4H41PBP.
    20. 20. Identifier vs. Locator• Human ID: Mark Alan Parsons (son of Robert A. and Ann M., etc.) • every term defined independently and only unique in context/ provenance (remember that). • Alternative like a social security number requires a very well controlled central authority.• Human Locator: 1540 30th St., Room 201, Boulder CO 80303. • every term has a naming authority• Data Set IDs: data set title, filename, database key, object id code (e.g. UUID), etc.• Data set Locators: URL, directory structure, catalog number, registered locator (e.g. DOI), etc. 19
    21. 21. An assessment of identification schemes for digital Earth science data Unique Unique Citable Scientifically Identifier Locator Locator Unique ID ID Scheme Data Set Item Data Set Item Data Set Item Data Set Item URL/N/I PURL XRI Handle DOI ARK LSID Good OID Fair Poor UUIDAdapted from Duerr, R. E., et al.. 2011. On the utility of identification schemes for digital Earth science data: Anassessment and recommendations. Earth Science Informatics. 4:139-160. 20http://dx.doi.org/10.1007/s12145-011-0083-6
    22. 22. An assessment of identification schemes for digital Earth science data Unique Unique Citable Scientifically Identifier Locator Locator Unique ID ID Scheme Data Set Item Data Set Item Data Set Item Data Set Item s URL/N/I or at c PURLLo XRI Handle DOI ARK ers LSID Good ntifi OID FairIde Poor UUID Adapted from Duerr, R. E., et al.. 2011. On the utility of identification schemes for digital Earth science data: An assessment and recommendations. Earth Science Informatics. 4:139-160. 20 http://dx.doi.org/10.1007/s12145-011-0083-6
    23. 23. Why the DOI?• Not perfect but well understood by publishers• DataCite working with Thomson Reuters to get data citations in their index. But... • What is the citable unit? • How do we handle different versions? • What about “retired” data? • When is a DOI assigned? 21
    24. 24. Issues largely resolved by...• A defined versioning scheme• Good tracking and documentation of the versions• Due diligence in archive and release practices 22
    25. 25. When to assign a DOI?• First principle: Data should be reference-able as soon as they are available for use by anyone other than the original authors.• But... • Most people (falsely) believe that a DOI assures permanence so how do we cite transient data? • Some believe that a DOI should not be assigned until the data has undergone some level of review (e.g. Lawrence et al. 2011). So how do we cite data used before the review? • Data are often used by friends and collaborators in a raw, “unpublished” state. Should this use be cited with a DOI? • Near real time or preliminary data may only be available for a short uncurated, period, and there may not be a good match between the submission package and the distribution package. What gets the DOI? When? 23
    26. 26. Versioning approach recommended by DCC• “As DOIs are used to cite data as evidence, the dataset to which a DOI points should also remain unchanged, with any new version receiving a new DOI.”• “There are two possible approaches the data repository can take: time slices and snapshots.” 24
    27. 27. Versioning and locators:some suggestions from NSIDC• major version.minor version.[archive version]• Individual stewards need to determine which are major vs. minor versions and describe the nature and file/record range of every version.• Assign DOIs to major versions.• Old DOIs should be maintained and point to some appropriate page that explains what happened to the old data if they were not archived.• A new major version leads to the creation of a new collection-level metadata record that is distributed to appropriate registries. The older metadata record should remain with a pointer to the new version and with explanation of the status of the older version data.• Major and minor version (after the first version) should be exposed with the data set title and recommended citation.• Minor versions should be explained in documentation, ideally in file-level metadata.• Applying UUIDs to individual files upon ingest aids in tracking minor versions and historical citations. 25
    28. 28. Basic data citation form and contentAuthor(s). ReleaseDate. Title, Version. [editor(s)]. Archive and/orDistributor. Locator. [date/time accessed]. [subset used].The best solution may be to have unique identifiers or query IDsfor subsets, but that won’t be available for most data sets for along time, so we need alternative solutions... 26
    29. 29. February 8, 2011, 4:45 PM Page Numbers for Kindle Books an Imperfect Solution Neither solution is perfect—‘locations’ or page numbers—because the problem is unsolvable. The best we can hope for is a choice... Amazon’s Kindle will have page numbers that correspond to real books and locations by passage.http://pogue.blogs.nytimes.com/2011/02/08/page-numbers-for-kindle-books-an-imperfect-solution/
    30. 30. Chapter and Verse• Bible• Koran• Bhagavad-Gita and Ramayana• other sacred texts• A “structural index”
    31. 31. The “Archive Information Unit”“An Archival Information Package whose Content Information is notfurther broken down into other Content Information components, eachof which has its own complete Preservation Description Information. Itcan be viewed as an ‘atomic’ AIP”“From an Access viewpoint, new subsetting and manipulationcapabilities are beginning to blur the distinction between AICs and AIUs.Content objects which used to be viewed as atomic can now be viewedas containing a large variation of contents based on the subsettingparameters chosen. In a more extreme example, the ContentInformation of an AIU may not exist as a physical entity. The ContentInformation could consist of several input files (or pointers to the AIPscontaining these data files) and an algorithm which uses these files tocreate the data object of interest.”• CCSDS. 2002. Reference Model for An Open Archival Information System (OAIS) CCSDS 650.0-B-1 Issue 1. Washington, DC: CCSDS Secretariat. p. 1-8, 4-38. 29
    32. 32. Citation scenarios and production patterns• What kind of “atomic” item is being cited—the “Archive Information Unit (AIU)” (e.g., a data file, a data element within a file, a relational (or other) database, a job “residue”)?• How many AIUs items are in a typical citation for the scenario being considered?• What other digital or physical objects need to be available to make the unit usable—the “Preservation Description Information (PDI)”? Key Question:• What structure or structures can we use to organize data collections that might be common across Earth sciences? 30
    33. 33. An example production pattern for Cline et al. (2003).
    34. 34. 32
    35. 35. 33
    36. 36. 34
    37. 37. 35
    38. 38. 36
    39. 39. 37
    40. 40. 38
    41. 41. 39
    42. 42. 40
    43. 43. 41
    44. 44. 42
    45. 45. A production pattern for Cline et al., 2003 field notebook Excel v1 printout Excel v2 TRANSECT,IOP ,DATE ,TIME,UTME ,UTMN ,DEPTH ,SWET,SRUF,CNPY, TEMP,SURVEYOR ,QC ,COMMENTS , , , , , ,cm , , , , deg- F, , , FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, 104, d, y, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " ... FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) ","Couldnt find post, used GPS 5940 1197; FAA4.4 and FAA4.5 unsafe, avalanche area!analog to ascii filesdigital w/ QC shapefilesborndigital camera interim jpgs collated and named jpgs
    46. 46. A production pattern for Cline et al., 2003 field notebook Excel v1 printout Excel v2 distributed data set TRANSECT,IOP ,DATE ,TIME,UTME ,UTMN ,DEPTH ,SWET,SRUF,CNPY, TEMP,SURVEYOR ,QC ,COMMENTS , , , , , ,cm , , , , deg- F, , , FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, 104, d, y, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " ... FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) ","Couldnt find post, used GPS 5940 1197; FAA4.4 and FAA4.5 unsafe, avalanche area!analog to ascii filesdigital w/ QC shapefilesborndigital camera interim jpgs collated and named jpgs
    47. 47. A production pattern for Cline et al., 2003 field notebook Excel v1 printout Excel v2 distributed data set TRANSECT,IOP ,DATE ,TIME,UTME ,UTMN ,DEPTH ,SWET,SRUF,CNPY, TEMP,SURVEYOR ,QC + ,COMMENTS HTML F, , , , , , ,cm FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, , , 104, , d, , y, , n, , deg- Doc. -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " ... FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) ","Couldnt find post, used GPS 5940 1197; FAA4.4 and FAA4.5 unsafe, avalanche area!analog to ascii filesdigital w/ QC shapefilesborndigital camera interim jpgs collated and named jpgs
    48. 48. A production pattern for Cline et al., 2003 field notebook Excel v1 printout Excel v2 distributed data set TRANSECT,IOP ,DATE ,TIME,UTME ,UTMN ,DEPTH ,SWET,SRUF,CNPY, TEMP,SURVEYOR ,QC + ,COMMENTS HTML , , , , , ,cm , , , , deg- 100s 100s F, , , FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, 104, d, y, n, Doc. -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " ... FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) ","Couldnt find post, used GPS 5940 1197; FAA4.4 and FAA4.5 unsafe, avalanche area!analog to ascii filesdigital w/ QC shapefilesborndigital 1000s camera interim jpgs collated and named jpgs
    49. 49. A production pattern for Cline et al., 2003 field notebook Excel v1 printout Excel v2 distributed data set TRANSECT,IOP ,DATE ,TIME,UTME ,UTMN ,DEPTH ,SWET,SRUF,CNPY, TEMP,SURVEYOR ,QC + ,COMMENTS HTML , , , , , ,cm , , , , deg- 100s 100s F, , , FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, 104, d, y, n, Doc. -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " ... FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) ","Couldnt find post, used GPS 5940 1197; FAA4.4 and FAA4.5 unsafe, avalanche area!analog to ascii filesdigital w/ QC tarball shapefilesborndigital 1000s camera interim jpgs collated and named jpgs
    50. 50. A production pattern for Cline et al., 2003 field notebook Excel v1 Couldnt find post, v2 printout Excel distributed data set used GPS 5940 1197; FAA4.4 and TRANSECT,IOP ,DATE ,TIME,UTME ,UTMN ,DEPTH ,SWET,SRUF,CNPY, TEMP,SURVEYOR FAA4.5 unsafe, ,QC + ,COMMENTS HTML , , , , , ,cm , , , , deg- 100s avalanche area! 100s F, , , FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, 104, d, y, n, Doc. -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) "," " ... FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n, -999,"Fitzgerald, Matous, Dundas ","QC(000) ","Couldnt find post, used GPS 5940 1197; FAA4.4 and FAA4.5 unsafe, avalanche area!analog to ascii filesdigital w/ QC tarball shapefilesborndigital 1000s camera interim jpgs collated and named jpgs
    51. 51. Crude, inaccurate production pattern for MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005 (Hall et al., 2007)
    52. 52. Crude, inaccurate production pattern for MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005 (Hall et al., 2007) Archives GSFC EDC NSIDC MODAPs Processing
    53. 53. Crude, inaccurate production pattern for MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005 (Hall et al., 2007) Archives GSFC EDC NSIDC 1 file/day/tile (grid cell) Each file contains metadata describing previous inputs and detailed versioning MODAPs Processing
    54. 54. Crude, inaccurate production pattern for MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005 (Hall et al., 2007) Archives GSFC EDC NSIDC 1,000,000s 1 file/day/tile (grid cell) Each file contains metadata describing previous inputs and detailed versioning MODAPs Processing
    55. 55. Doing it as best we can...?• Hall, Dorothy K., George A. Riggs, and Vincent V. Salomonson. 2007, updated daily. MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005.3, Oct. 2007- Sep. 2008, 84°N, 75°W; 44°N, 10°W. Boulder, Colorado USA: National Snow and Ice Data Center. Data set accessed 2008-11-01 at http://dx.doi.org/10.1234/xxx.• Hall, Dorothy K., George A. Riggs, and Vincent V. Salomonson. 2007, updated daily. MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005.3, Oct. 2007- Sep. 2008, Tiles (15,2;16,0;16,1;16,2;17,0;17,1). Boulder, Colorado USA: National Snow and Ice Data Center. Data set accessed 2008-11-01 at http://dx.doi.org/10.1234/xxx.• Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston. 2002, Updated 2003. CLPX-Ground: ISA snow depth transects and related measurements, Version 2.0, shapefiles. Edited by M. Parsons and M. J. Brodzik. Boulder, CO: National Snow and Ice Data Center. Data set accessed 2008-05-14 at http://dx.doi.org/10.5060/D4H41PBP. 45
    56. 56. Sea Ice Index Fetterer et al. 2009 “Near Real Time” “Preliminary”Maslanik and Stroeve Meir et al. 2006 1999 “Final” Cavalieri et al 1996
    57. 57. Remote Sensing SystemsSea Ice Production – Data Workflow TBs (Wentz) Color Key Source (data) Value added product Near-Real-Time product NSIDC-0002 Preliminary Product SSM/I Daily and NSIDC-0001 SSM/I Polar Stereo Tbs Final Product Monthly Polar Gridded Bootstrap Not part of discussionDeaccession to begin 1/2012 NSIDC-0051 Goddard Space NSIDC-0079 Fowler Anderson Flight Center Preliminary Bootstrap Sea Preliminary Sea Ice University of CO University of NE (GSFC) Ice Concentrations from Concentrations From SMMR and SSM/I Nimbus-7 SMMR Passive Microwave Data and DMSP SSM/I NISDC-0116 NSIDC-0105 Snow Melt Onset Data not yet Polar Pathfinder distributed Daily 25 km Over Arctic Sea Ice EASE-Grid Sea Ice from SMMR and NSIDC-0051 NSIDC-0079 Motion Vectors SSM/I Tbs Sea Ice Concentrations Bootstrap Sea Ice from SMMR and SSM/I Concentrations from Passive Microwave Data SMMR and SSM/I NASA Team G00791 production line IABP Drifting NSIDC-0066 AVHRR Polar Buoy Pressure, Pathfinder Twice-Daily G02202 Temperature, NOAA/NSIDC CDR 5 km EASE- Position, and Interpolated Grid Composites Ice Velocity NSIDC-0081 NSIDC-0192 NSIDC-0046 Near-Real-Time DMSP Sea Ice Trends and Robinson NH EASE-Grid SSM/I Daily Polar Gridded Climatologies from SMMR Psuedo-weekly Weekly Sea Ice Concentrations and SSM/I Snow Cover Snow Cover and Sea Rutgers Ice Extent V3 NSIDC-0080 Near-Real-Time DMSP NISE on NEO World Wide Web G02135 SSM/I Daily Polar Gridded NISE Arctic Sea Ice Sea Ice Index Tbs News and Analysis CLASS F17 TBs Last updated: D Scott 12/2011
    58. 58. MODIS chlorophylllevel 0 SeaWiFS GSM phytoplankton …level 1a algorithms ... particulates … softwarelevel 1b calibration ...level 2 reprocess, … revalidate 48 slide courtesy Greg Janee, UCSB
    59. 59. Hypothesis: ~80% of citation scenarios for 80% ofEarth system science data can be addressed withbasic citations (Author(s). ReleaseYear. Title, Version.[editor(s)]. Archive. Locator. [date/time accessed]. [subset used].),a solid methods section, and reasonable duediligence by archives.
    60. 60. Future directions on scientific equivalence
    61. 61. Content equivalence and provenance equivalenceserve as a proxy for scientific equivalence.Content Equivalence: Is there an algorithm that can consider thecontent of a file and come up with a unique identifier that will be thesame for objects in the same “content equivalence class”? • Exact content equivalence can use digital signature or cryptographic techniques MD5, SHA-1, etc.   • Universal Numeric Footprint (UNF) takes a digital signature of a “canonical” representation of the information, but canonical is a problem. • How can we define loose or scientific content equivalence? Are the shapefiles and text files in Cline et al. (2003) equivalent? 51
    62. 62. Content equivalence and provenance equivalenceserve as a proxy for scientific equivalence.Provenance Equivalence:  A process is reproducible when there issufficient creation provenance details for someone else to make anequivalent file. If someone follows those provenance details to re-createthe object, the resulting object will be equivalent to the original.    • If we can enumerate/list sufficient or “essential” creation provenance details to  make an equivalent file, then we can describe an algorithm to  produce an identifier that will be the same for files that match those provenance details. • Algorithm can be similar to the content equivalence algorithm before: take a digital signature of a canonical representation of the information in this case a canonical serialization of processes. 52
    63. 63. Thank Youparsonsm@nsidc.org photo courtesy NOAA
    64. 64. Backup slides
    65. 65. More examples• Glacier Photos: chemical creation of image, digitization, multi-source, no versioning• Hurricane Ike: direct digital creation, embedded software, single-source, no versioning• Rock or Ice Cores: physical specimens with later experimental operations to create data collections – digital results may be stored in databases with complex provenance• SeaDataNet: Many individual Conductivity, Temperature, Depth measurements from many individual cruises compiled into one database.• Global Historical Climate Network: In situ measurements of Temp and Humidity, recorded as time series and appended to files (with some quality control) – multi- source, some versioning• Radiosonde Network: Balloon-borne Temp and Humidity from about 20,000 sites every 12 or 6 hours assimilated into weather forecasts – archives maintained at NCAR, GSFC, and elsewhere – reanalysis can create new versions• Large-scale Satellite Data Production: complex, large-scale data flows from multiple sources, systematic approaches to versioning where the structure of one version of a data product resembles the next 55
    66. 66. Basic data citation form and contentMandatory: Optional: Author(s) Editor(s) Release Date Archive or Distributor Place Title Other Institutional Role Version Indication of subset used Archive and/or Distributor Locator or Persistent Location service Time, date accessed 56
    67. 67. Serving, Citing and Publishing DataCitation forms an important part of the scientific record. Doi:10232/123ro This involves the peer-review of dataWe draw a clear distinction between: 2. sets, and gives “stamp of approval” Publication of data sets associated with traditional journalpublishing = making available for publications. Can’t be done without consumption (e.g. on the web), effective linking/citing of the data sets. and Doi:10232/123Publishing = publishing after some 1. This is our first step for this project – formal process which adds value Data set Citation formulate and formalise a way of for the consumer: citing data sets. Will provide benefits• e.g. PloS ONE type review, or to our users – and a carrot to get them to provide data to us!• EGU journal type public review, or• More traditional peer review. 0.AND Serving of data sets This is what data centres do as our• provides commitment to day job – take in data supplied by scientists and make it available to persistence other interested parties. slide courtesy S. Callaghan, BADC VO Sandpit, November 2009

    ×