Data Equivalence
Mark Parsons, Lead Project Manager, Senior Associate Scientist, National Snow and Ice Data Center
Data citation, especially using persistent identifiers like Digital Object Identifiers (DOIs), is an increasingly accepted scientific practice. Recently, several, respected organizations have developed guidelines for data citation. The different guidelines are largely congruent in that they agree on the basic practice and elements of data citation, especially for relatively static, whole data collections. There is less agreement on the more subtle nuances of data citation that are sometimes necessary to ensure precise reference and scientific reproducibility--the core purpose of data citation. We need to be sure that if you follow a data reference you get to the precise data that were used or at least their scientific equivalent. Identifiers such as DOIs are necessary but not sufficient for the precise, detailed, references necessary. This talk discusses issues around data set versioning, micro-citation, and scientific equivalence. I propose some interim solutions and suggest research strategies for the future.
Scientific discovery and innovation in an era of data-intensive science
William (Bill) Michener, Professor and Director of e-Science Initiatives for University Libraries, University of New Mexico; DataONE Principal Investigator
The scope and nature of biological, environmental and earth sciences research are evolving rapidly in response to environmental challenges such as global climate change, invasive species and emergent diseases. Scientific studies are increasingly focusing on long-term, broad-scale, and complex questions that require massive amounts of diverse data collected by remote sensing platforms and embedded environmental sensor networks; collaborative, interdisciplinary science teams; and new tools that promote scientific data preservation, discovery, and innovation. This talk describes the challenges facing scientists as they transition into this new era of data intensive science, presents current solutions, and lays out a roadmap to the future where new information technologies significantly increase the pace of scientific discovery and innovation.
Opening Keynote: The Many and the One: BCE themes in 21st century data curation
Allen Renear, Professor and Interim Dean, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Two scientists can be using "the same data" even though the computer files involved appear to be quite different. This is familiar enough, and for the most part, in small communities with shared practices and familiar datasets, raises few problems. But these informal understandings do not scale to 21st century data curation. To get full value from cyberinfrastructure we must support huge quantities of heterogeneous data developed by diverse communities and used by diverse communities -- often with widely varying methods, tools, and purposes. To accomplish this our informal practices and understandings much be replaced, or at least supplemented, by a shared framework of standard terminology for describing complex cascades of representational levels and relationships. Fundamental problems in data curation -- and in particular problems involving provenance, identifiers, and data citation — cannot be fully resolved without such a framework. Although the deepest problems here have ancient origins, useful practical measures are now within reach. Some recent work toward this end that is being carried out at the Center for Informatics Research in Science and Scholarship (CIRSS) at the Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign will be described.
EZID: Easy dataset identification & management
Joan Starr, Manager, Strategic and Project Planning and EZID Service Manager, California Digital Library
Data and data curation are assuming a growing role today’s research library. New approaches are needed both to address the resulting challenges and take advantage of the emerging opportunities. Long-term identifiers represent one such tool. In this presentation, Joan Starr will introduce identifiers and an application designed to make them easy to create and manage: EZID. She will provide a closer look at two identifier types: DOIs and ARKs, and discuss what bringing an identifier service to your institution might mean.
DataCite and Campus Data Services
Paul Bracke, Associate Dean for Digital Programs and Information Services, Purdue University
Research libraries are increasingly interested in developing data services for their campuses. There are many perspectives, however, on how to develop services that are responsive to the many needs of scientists; sensitive to the concerns of scientists who are not always accustomed to sharing their data; and that are attractive to campus administrators. This presentation will discuss the development of campus-based data services programs, the centrality of data citation to these efforts, and the ways in which engagement with DataCite can enhance local programs.
These are the slides for Robert H. McDonald for the Future Trends Panel Presentation at the the Inter-institutional Approaches to Supporting Scholarly Communication Symposium held on August 16, 2012 at the Georgia Institute of Technology.
Scientific discovery and innovation in an era of data-intensive science
William (Bill) Michener, Professor and Director of e-Science Initiatives for University Libraries, University of New Mexico; DataONE Principal Investigator
The scope and nature of biological, environmental and earth sciences research are evolving rapidly in response to environmental challenges such as global climate change, invasive species and emergent diseases. Scientific studies are increasingly focusing on long-term, broad-scale, and complex questions that require massive amounts of diverse data collected by remote sensing platforms and embedded environmental sensor networks; collaborative, interdisciplinary science teams; and new tools that promote scientific data preservation, discovery, and innovation. This talk describes the challenges facing scientists as they transition into this new era of data intensive science, presents current solutions, and lays out a roadmap to the future where new information technologies significantly increase the pace of scientific discovery and innovation.
Opening Keynote: The Many and the One: BCE themes in 21st century data curation
Allen Renear, Professor and Interim Dean, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Two scientists can be using "the same data" even though the computer files involved appear to be quite different. This is familiar enough, and for the most part, in small communities with shared practices and familiar datasets, raises few problems. But these informal understandings do not scale to 21st century data curation. To get full value from cyberinfrastructure we must support huge quantities of heterogeneous data developed by diverse communities and used by diverse communities -- often with widely varying methods, tools, and purposes. To accomplish this our informal practices and understandings much be replaced, or at least supplemented, by a shared framework of standard terminology for describing complex cascades of representational levels and relationships. Fundamental problems in data curation -- and in particular problems involving provenance, identifiers, and data citation — cannot be fully resolved without such a framework. Although the deepest problems here have ancient origins, useful practical measures are now within reach. Some recent work toward this end that is being carried out at the Center for Informatics Research in Science and Scholarship (CIRSS) at the Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign will be described.
EZID: Easy dataset identification & management
Joan Starr, Manager, Strategic and Project Planning and EZID Service Manager, California Digital Library
Data and data curation are assuming a growing role today’s research library. New approaches are needed both to address the resulting challenges and take advantage of the emerging opportunities. Long-term identifiers represent one such tool. In this presentation, Joan Starr will introduce identifiers and an application designed to make them easy to create and manage: EZID. She will provide a closer look at two identifier types: DOIs and ARKs, and discuss what bringing an identifier service to your institution might mean.
DataCite and Campus Data Services
Paul Bracke, Associate Dean for Digital Programs and Information Services, Purdue University
Research libraries are increasingly interested in developing data services for their campuses. There are many perspectives, however, on how to develop services that are responsive to the many needs of scientists; sensitive to the concerns of scientists who are not always accustomed to sharing their data; and that are attractive to campus administrators. This presentation will discuss the development of campus-based data services programs, the centrality of data citation to these efforts, and the ways in which engagement with DataCite can enhance local programs.
These are the slides for Robert H. McDonald for the Future Trends Panel Presentation at the the Inter-institutional Approaches to Supporting Scholarly Communication Symposium held on August 16, 2012 at the Georgia Institute of Technology.
The Brain Imaging Data Structure and its use for fNIRSRobert Oostenveld
These slides were prepared for the NIRS toolkit course at the Donders, which due to the Corona crisis has been postponed. The slides present BIDS, explain how fNIRS often involves multiple signals, and relates the two to synchronization and data management
Slides from my Metadata Workshop at Content Strategy Applied 2012. The session included several hands on exercises, which is where a lot of the interesting conversation took place.
Creating a sustainable business model for a digital repository: the Dryad exp...ASIS&T
Creating a sustainable business model for a digital repository: the Dryad experience
Peggy Schaeffer
Datadryad.org
Presentation at Research Data Access & Preservation Summit
22 March 2012
About the Webinar
In May 2012, the Library of Congress announced a new modeling initiative focused on reflecting the MARC 21 library standard as a Linked Data model for the Web, with an initial model to be proposed by the consulting company Zepheira. The goal of the initiative is to translate the MARC 21 format to a Linked Data model while retaining the richness and benefits of existing data in the historical format.
In this webinar, Eric Miller of Zepheira will report on progress towards this important goal, starting with an analysis of the translation problem and concluding with potential migration scenarios for a broad-based transition from MARC to a new bibliographic framework.
A basic course on Research data management: part 1 - part 4Leon Osinski
Slides belonging to a basic course on research data management. The course consists of 4 parts:
Part 1: what and why
1.1 data management plans
Part 2: protecting and organizing your data
2.1 data safety and data security
2.2 file naming, organizing data (TIER documentation protocol)
Part 3: sharing your data
3.1 via collaboration platforms (during research)
3.2 via data archives (after your research)
Part 4: caring for your data, or making data usable
4.1 tidy data
4.2 documentation/metadata
4.3 licenses
4.4 open data formats
This is a presentation for the Erwin Hahn Instiutute in Essen, explaining the background, functional design and technical architecture of the Donders Repository. Furthermore, it explains how it aligns with the DCCN project management and with the researchers workflow
Presentation Title: Grand Challenges and Big Data: Implications for Public Participation in Scientific Research
Presenter: William Michener, Professor and PI/Director of DataONE, University Libraries, University of New Mexico
Identity, Location, and Citation at NEONMark Parsons
Latest version of an oft-given talk/discussion about data citation and related issues, presented to data scientists at HQ of the National Ecological Observatory Network
The Brain Imaging Data Structure and its use for fNIRSRobert Oostenveld
These slides were prepared for the NIRS toolkit course at the Donders, which due to the Corona crisis has been postponed. The slides present BIDS, explain how fNIRS often involves multiple signals, and relates the two to synchronization and data management
Slides from my Metadata Workshop at Content Strategy Applied 2012. The session included several hands on exercises, which is where a lot of the interesting conversation took place.
Creating a sustainable business model for a digital repository: the Dryad exp...ASIS&T
Creating a sustainable business model for a digital repository: the Dryad experience
Peggy Schaeffer
Datadryad.org
Presentation at Research Data Access & Preservation Summit
22 March 2012
About the Webinar
In May 2012, the Library of Congress announced a new modeling initiative focused on reflecting the MARC 21 library standard as a Linked Data model for the Web, with an initial model to be proposed by the consulting company Zepheira. The goal of the initiative is to translate the MARC 21 format to a Linked Data model while retaining the richness and benefits of existing data in the historical format.
In this webinar, Eric Miller of Zepheira will report on progress towards this important goal, starting with an analysis of the translation problem and concluding with potential migration scenarios for a broad-based transition from MARC to a new bibliographic framework.
A basic course on Research data management: part 1 - part 4Leon Osinski
Slides belonging to a basic course on research data management. The course consists of 4 parts:
Part 1: what and why
1.1 data management plans
Part 2: protecting and organizing your data
2.1 data safety and data security
2.2 file naming, organizing data (TIER documentation protocol)
Part 3: sharing your data
3.1 via collaboration platforms (during research)
3.2 via data archives (after your research)
Part 4: caring for your data, or making data usable
4.1 tidy data
4.2 documentation/metadata
4.3 licenses
4.4 open data formats
This is a presentation for the Erwin Hahn Instiutute in Essen, explaining the background, functional design and technical architecture of the Donders Repository. Furthermore, it explains how it aligns with the DCCN project management and with the researchers workflow
Presentation Title: Grand Challenges and Big Data: Implications for Public Participation in Scientific Research
Presenter: William Michener, Professor and PI/Director of DataONE, University Libraries, University of New Mexico
Identity, Location, and Citation at NEONMark Parsons
Latest version of an oft-given talk/discussion about data citation and related issues, presented to data scientists at HQ of the National Ecological Observatory Network
RDAP13 Jian Qin: Functional and Architectural Requirements for MetadataASIS&T
Jian Qin, Syracuse University
Jian Qin, Syracuse University; Alex Ball, UKLON; Jane Greenberg, University of North Carolina at Chapel Hill: “Functional and Architectural Requirements for Metadata: Supporting Discovery and Management of Scientific Data”
Panel: Linked data and metadata (co-sponsored by the ASIS&T Digital Libraries SIG)
Research Data Access & Preservation Summit 2013
Baltimore, MD April 4, 2013 #rdap13
Functional and Architectural Requirements for Metadata: Supporting Discovery...Jian Qin
The tremendous growth in digital data has led to an increase in metadata initiatives for different types of scientific data, as evident in Ball’s survey (2009). Although individual communities have specific needs, there are shared goals that need to be recognized if systems are to effectively support data sharing within and across all domains. This paper considers this need, and explores systems requirements that are essential for metadata supporting the discovery and management of scientific data. The paper begins with an introduction and a review of selected research specific to metadata modeling in the sciences. Next, the paper’s goals are stated, followed by the presentation of valuable systems requirements. The results include a base-model with three chief principles: principle of least effort, infrastructure service, and portability. The principles are intended to support “data user” tasks. Results also include a set of defined user tasks and functions, and applications scenarios.
Lesson 8 in a set of 10 created by DataONE on Best Practices for Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
GBIF and reuse of research data, Bergen (2016-12-14)Dag Endresen
Biodiversity informatics seminar at the Department of Biology, University of Bergen on data publication and reuse of GBIF-mediated biodiversity data on 14th December 2016. Organized by the Norwegian GBIF Node and the Norwegian Biodiversity Information Center (NBIC, Artsdatabanken).
See also: http://www.gbif.no/events/2016/data-publishing-seminar-in-bergen.html
See also: http://doi.org/10.13140/RG.2.2.24290.32969
Publishing of Scientific Data - Science Foundation Ireland Summit 2010jodischneider
Slides prepared for the Publishing of Scientific Data workshop at the Science Foundation Ireland Summit 2010. I was one of three panelists. We had a lively discussion!
Doing for Data what Pubmed did for literature: DATS a model for dataset description datasets indexing and data discovery.
Googleslides [https://goo.gl/cd5KKa] or Slideshare [https://goo.gl/c8DH5N]
NSF Workshop Data and Software Citation, 6-7 June 2016, Boston USA, Software Panel
FIndable, Accessible, Interoperable, Reusable Software and Data Citation: Europe, Research Objects, and BioSchemas.org
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
This presentation was provided by William Mattingly of the Smithsonian Institution, during the closing segment of the NISO training series "AI & Prompt Design." Session Eight: Limitations and Potential Solutions, was held on May 23, 2024.
This presentation was provided by William Mattingly of the Smithsonian Institution, during the seventh segment of the NISO training series "AI & Prompt Design." Session 7: Open Source Language Models, was held on May 16, 2024.
This presentation was provided by William Mattingly of the Smithsonian Institution, during the sixth segment of the NISO training series "AI & Prompt Design." Session Six: Text Classification with LLMs, was held on May 9, 2024.
This presentation was provided by William Mattingly of the Smithsonian Institution, during the fifth segment of the NISO training series "AI & Prompt Design." Session Five: Named Entity Recognition with LLMs, was held on May 2, 2024.
This presentation was provided by William Mattingly of the Smithsonian Institution, during the fourth segment of the NISO training series "AI & Prompt Design." Session Four: Structured Data and Assistants, was held on April 25, 2024.
This presentation was provided by William Mattingly of the Smithsonian Institution, during the third segment of the NISO training series "AI & Prompt Design." Session Three: Beginning Conversations, was held on April 18, 2024.
This presentation was provided by Kaveh Bazargan of River Valley Technologies, during the NISO webinar "Sustainability in Publishing." The event was held April 17, 2024.
This presentation was provided by Dana Compton of the American Society of Civil Engineers (ASCE), during the NISO webinar "Sustainability in Publishing." The event was held April 17, 2024.
This presentation was provided by William Mattingly of the Smithsonian Institution, during the second segment of the NISO training series "AI & Prompt Design." Session Two: Large Language Models, was held on April 11, 2024.
This presentation was provided by Teresa Hazen of the University of Arizona, Geoff Morse of Northwestern University. and Ken Varnum of the University of Michigan, during the Spring ODI Conformance Statement Workshop for Libraries. This event was held on April 9, 2024
This presentation was provided by William Mattingly of the Smithsonian Institution, during the opening segment of the NISO training series "AI & Prompt Design." Session One: Introduction to Machine Learning, was held on April 4, 2024.
This presentation was provided by William Mattingly of the Smithsonian Institution, for the eight and final session of NISO's 2023 Training Series on Text and Data Mining. Session eight, "Building Data Driven Applications" was held on Thursday, December 7, 2023.
This presentation was provided by William Mattingly of the Smithsonian Institution, for the seventh session of NISO's 2023 Training Series on Text and Data Mining. Session seven, "Vector Databases and Semantic Searching" was held on Thursday, November 30, 2023.
This presentation was provided by William Mattingly of the Smithsonian Institution, for the sixth session of NISO's 2023 Training Series on Text and Data Mining. Session six, "Text Mining Techniques" was held on Thursday, November 16, 2023.
This presentation was provided by William Mattingly of the Smithsonian Institution, for the fifth session of NISO's 2023 Training Series on Text and Data Mining. Session five, "Text Processing for Library Data" was held on Thursday, November 9, 2023.
This presentation was provided by Todd Carpenter, Executive Director, during the NISO webinar on "Strategic Planning." The event was held virtually on November 8, 2023.
This presentation was provided by Rhonda Ross of CAS, a division of the American Chemical Society, and Jonathan Clark of the International DOI Foundation, during the NISO webinar on "Strategic Planning." The event was held virtually on November 8, 2023.
This presentation was provided by William Mattingly of the Smithsonian Institution, for the fourth session of NISO's 2023 Training Series on Text and Data Mining. Session four, "Data Mining Techniques" was held on Thursday, November 2, 2023.
More from National Information Standards Organization (NISO) (20)
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Safalta Digital marketing institute in Noida, provide complete applications that encompass a huge range of virtual advertising and marketing additives, which includes search engine optimization, virtual communication advertising, pay-per-click on marketing, content material advertising, internet analytics, and greater. These university courses are designed for students who possess a comprehensive understanding of virtual marketing strategies and attributes.Safalta Digital Marketing Institute in Noida is a first choice for young individuals or students who are looking to start their careers in the field of digital advertising. The institute gives specialized courses designed and certification.
for beginners, providing thorough training in areas such as SEO, digital communication marketing, and PPC training in Noida. After finishing the program, students receive the certifications recognised by top different universitie, setting a strong foundation for a successful career in digital marketing.
NISO Forum, Denver, Sept. 24, 2012: Data Equivalence
1. Data Equivalence
Mark A. Parsons
and the ESIP Preservation and Stewardship Committee
NISO Forum:
Tracking it Back to the Source: Managing and Citing Research Data
Denver, Colorado, USA
24 September 2012
2. The National Snow and Ice Data
Center…
Manages and
distributes Performs scientific and
scientific data informatics research
Supports data
users
Creates tools for Educates the public
data access about the cryosphere
http://nsidc.org
4. 21 Sept. 2012
Minimum, 16 Sept. 3,412,196 km2*
Derived from the NSIDC Sea Ice Index (Fetterer et al., 2009)
*5-day running mean of daily values
5. Minimum, 16 Sept. 3,541,406 km2*
From the Arctic Sea Ice Monitor at the IARC-JAXA Information System (IJIS)
http://www.ijis.iarc.uaf.edu/en/home/seaice_extent.htm *5-day running mean of daily values
6. Minimum, 16 Sept. 3,541,406 km2*
3.8% greater than NSIDC’s value
From the Arctic Sea Ice Monitor at the IARC-JAXA Information System (IJIS)
http://www.ijis.iarc.uaf.edu/en/home/seaice_extent.htm *5-day running mean of daily values
7. Example of differences in SSM/I derived ice concentration
values calculated with three different passive microwave
algorithms (Meier et al. 2001)
Designating User Communities by Parsons and Duerr; CODATA, Berlin, 8 Nov. 2004
8. Metaphor is for most people a device
of the poetic imagination and the
rhetorical flourish— a matter of
extraordinary rather than ordinary
language. Moreover, metaphor is
typically viewed as characteristic of
language alone, a matter of words
rather than thought or action. For this
reason, most people think they can get
along perfectly well without metaphor.
We have found, on the contrary, that
metaphor is pervasive in everyday life,
not just in language but in thought and
action. Our ordinary conceptual
system, in terms of which we both
think and act, is fundamentally
metaphorical in nature.
9. “Is Data Publication the Right Metaphor?”
Parsons and Fox. (in press). Data Science Journal
Preprint at http://mp-datamatters.blogspot.com
10. Purpose of Data Citation
• Aid scientific reproducibility through direct, unambiguous
connection to the precise data used
• Credit for data authors and stewards
• Accountability for creators and stewards
• Track impact of data set
• Help identify data use (e.g., trackbacks)
• Data authors can verify how their data are being used.
• Users can better understand the application of the data.
• A locator/reference mechanism not a discovery mechanism per se
9
11. “Bridging Data Lifecycles: Tracking Data Use
via Data Citations” UCAR Workshop Report
• Recommendation 1:
Identify what you want to achieve via data citations
• Recommendation 2: Understand the options for
actionable identifier schemes
• Recommendation 3: Engage stakeholders
• Recommendation 4: Start with well-bounded cases
• Recommendation 5: Plan for long-term implications
• http://library.ucar.edu/data_workshop/
10
12. How data citation is currently done
• Citation of traditional publication that actually contains the data,
e.g. a parameterization value.
• Not mentioned, just used, e.g., in tables or figures
• Reference to name or source of data in text
• URL in text (with variable degrees of specificity)
• Citation of related paper (e.g. CRU Temp. records recommend
citing two old journal articles which do not contain the actual
data or full description of methods)
• Citation of actual data set typically using recommended citation
given by data center
• Citation of data set including a persistent identifier/locator,
typically a DOI
11
14. Data Citation Guidelines
• Federation of Earth Science Information Partners. 2012. http://bit.ly/data_citation
and related guidelines for the Group on Earth Observations (GEO)
• Best available for Earth system science. Not yet widely adopted but growing.
• Digital Curation Center. 2011. http://www.dcc.ac.uk/resources/how-guides/cite-
datasets
• Best overall guide. Not yet widely adopted but growing.
• DataCite—a well-recognized consortium of libraries and related organizations
working to define a citation approach around DOIs and working to get data
citations included in citation indices.
• DataVerse Network Project—a standard from the social science community using
a Handle locator and “Universal Numerical Fingerprint” as a unique identifier.
• New CODATA Task Group in collaboration with ICSTI. Report due soon.
• NASA DAACs, NCAR, some NOAA centers adopting ESIP-based approaches.
• More consistency is emerging but there is still great variation in recommended
approach. They range from specific data citation, to general acknowledgement,
to recommending citing a journal article, or even a presentation.
13
15. Basic data citation form and content
Per DataCite:
Creator. PublicationYear. Title. [Version]. Publisher.
[ResourceType]. Identifier.
Per ESIP:
Author(s). ReleaseDate. Title, [version]. [editor(s)]. Archive and/or
Distributor. Locator. [date/time accessed]. [subset used].
14
16. An Example Citation
Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston.
2002, Updated 2003. CLPX-Ground: ISA snow depth
transects and related measurements ver. 2.0. Edited by M.
Parsons and M. J. Brodzik. Boulder, CO: National Snow
and Ice Data Center. Data set accessed 2008-05-14 at
http://nsidc.org/data/nsidc-0175.html.
17. Version
Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston.
2002, Updated 2003. CLPX-Ground: ISA snow depth
transects and related measurements, ver. 2.0. Edited by M.
Parsons and M. J. Brodzik. Boulder, CO: National Snow
and Ice Data Center. Data set accessed 2008-05-14 at
http://nsidc.org/data/nsidc-0175.html.
18. Locator
Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston.
2002, Updated 2003. CLPX-Ground: ISA snow depth
transects and related measurements, ver. 2.0. Edited by M.
Parsons and M. J. Brodzik. Boulder, CO: National Snow
and Ice Data Center. Data set accessed 2008-05-14 at
http://nsidc.org/data/nsidc-0175.html.
19. Locator
Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston.
2002, Updated 2003. CLPX-Ground: ISA snow depth
transects and related measurements, ver. 2.0. Edited by M.
Parsons and M. J. Brodzik. Boulder, CO: National Snow
and Ice Data Center. Data set accessed 2012-09-22 at
http://dx.doi.org/10.5060/D4H41PBP.
20. Identifier vs. Locator
• Human ID: Mark Alan Parsons (son of Robert A. and Ann M., etc.)
• every term defined independently and only unique in context/
provenance (remember that).
• Alternative like a social security number requires a very well
controlled central authority.
• Human Locator: 1540 30th St., Room 201, Boulder CO 80303.
• every term has a naming authority
• Data Set IDs: data set title, filename, database key, object id code (e.g.
UUID), etc.
• Data set Locators: URL, directory structure, catalog number, registered
locator (e.g. DOI), etc.
19
21. An assessment of identification schemes for
digital Earth science data
Unique Unique Citable Scientifically
Identifier Locator Locator Unique ID
ID Scheme Data Set Item Data Set Item Data Set Item Data Set Item
URL/N/I
PURL
XRI
Handle
DOI
ARK
LSID Good
OID Fair
Poor
UUID
Adapted from Duerr, R. E., et al.. 2011. On the utility of identification schemes for digital Earth science data: An
assessment and recommendations. Earth Science Informatics. 4:139-160. 20
http://dx.doi.org/10.1007/s12145-011-0083-6
22. An assessment of identification schemes for
digital Earth science data
Unique Unique Citable Scientifically
Identifier Locator Locator Unique ID
ID Scheme Data Set Item Data Set Item Data Set Item Data Set Item
s
URL/N/I
or
at
c
PURL
Lo
XRI
Handle
DOI
ARK
ers
LSID Good
ntifi
OID Fair
Ide
Poor
UUID
Adapted from Duerr, R. E., et al.. 2011. On the utility of identification schemes for digital Earth science data: An
assessment and recommendations. Earth Science Informatics. 4:139-160. 20
http://dx.doi.org/10.1007/s12145-011-0083-6
23. Why the DOI?
• Not perfect but well understood by publishers
• DataCite working with Thomson Reuters to get data citations in their
index.
But...
• What is the citable unit?
• How do we handle different versions?
• What about “retired” data?
• When is a DOI assigned?
21
24. Issues largely resolved by...
• A defined versioning scheme
• Good tracking and documentation of the versions
• Due diligence in archive and release practices
22
25. When to assign a DOI?
• First principle: Data should be reference-able as soon as they are
available for use by anyone other than the original authors.
• But...
• Most people (falsely) believe that a DOI assures permanence so
how do we cite transient data?
• Some believe that a DOI should not be assigned until the data has
undergone some level of review (e.g. Lawrence et al. 2011). So
how do we cite data used before the review?
• Data are often used by friends and collaborators in a raw,
“unpublished” state. Should this use be cited with a DOI?
• Near real time or preliminary data may only be available for a short
uncurated, period, and there may not be a good match between
the submission package and the distribution package. What gets
the DOI? When? 23
26. Versioning approach recommended by DCC
• “As DOIs are used to cite data as evidence, the dataset to which a
DOI points should also remain unchanged, with any new version
receiving a new DOI.”
• “There are two possible approaches the data repository can take: time
slices and snapshots.”
24
27. Versioning and locators:
some suggestions from NSIDC
• major version.minor version.[archive version]
• Individual stewards need to determine which are major vs. minor versions and describe
the nature and file/record range of every version.
• Assign DOIs to major versions.
• Old DOIs should be maintained and point to some appropriate page that explains what
happened to the old data if they were not archived.
• A new major version leads to the creation of a new collection-level metadata record that
is distributed to appropriate registries. The older metadata record should remain with a
pointer to the new version and with explanation of the status of the older version data.
• Major and minor version (after the first version) should be exposed with the data set title
and recommended citation.
• Minor versions should be explained in documentation, ideally in file-level metadata.
• Applying UUIDs to individual files upon ingest aids in tracking minor versions and
historical citations.
25
28. Basic data citation form and content
Author(s). ReleaseDate. Title, Version. [editor(s)]. Archive and/or
Distributor. Locator. [date/time accessed]. [subset used].
The best solution may be to have unique identifiers or query IDs
for subsets, but that won’t be available for most data sets for a
long time, so we need alternative solutions...
26
29. February 8, 2011, 4:45 PM
Page Numbers for Kindle Books an Imperfect Solution
Neither solution is perfect—‘locations’ or
page numbers—because the problem is
unsolvable. The best we can hope for is a
choice...
Amazon’s Kindle will have
page numbers that
correspond to real books
and locations by passage.
http://pogue.blogs.nytimes.com/2011/02/08/page-numbers-for-kindle-books-an-imperfect-solution/
30. Chapter and Verse
• Bible
• Koran
• Bhagavad-Gita and
Ramayana
• other sacred texts
• A “structural index”
31. The “Archive Information Unit”
“An Archival Information Package whose Content Information is not
further broken down into other Content Information components, each
of which has its own complete Preservation Description Information. It
can be viewed as an ‘atomic’ AIP”
“From an Access viewpoint, new subsetting and manipulation
capabilities are beginning to blur the distinction between AICs and AIUs.
Content objects which used to be viewed as atomic can now be viewed
as containing a large variation of contents based on the subsetting
parameters chosen. In a more extreme example, the Content
Information of an AIU may not exist as a physical entity. The Content
Information could consist of several input files (or pointers to the AIPs
containing these data files) and an algorithm which uses these files to
create the data object of interest.”
• CCSDS. 2002. Reference Model for An Open Archival Information System (OAIS)
CCSDS 650.0-B-1 Issue 1. Washington, DC: CCSDS Secretariat. p. 1-8, 4-38.
29
32. Citation scenarios and production patterns
• What kind of “atomic” item is being cited—the “Archive Information
Unit (AIU)” (e.g., a data file, a data element within a file, a relational (or
other) database, a job “residue”)?
• How many AIUs items are in a typical citation for the scenario being
considered?
• What other digital or physical objects need to be available to make the
unit usable—the “Preservation Description Information (PDI)”?
Key Question:
• What structure or structures can we use to organize data collections
that might be common across Earth sciences?
30
45. A production pattern for Cline et al., 2003
field notebook Excel v1 printout Excel v2
TRANSECT,IOP ,DATE ,TIME,UTME ,UTMN ,DEPTH ,SWET,SRUF,CNPY,
TEMP,SURVEYOR ,QC
,COMMENTS
, , , , , ,cm , , , , deg-
F, , ,
FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, 104, d, y, n,
-999,"Fitzgerald, Matous, Dundas ","QC(000) ","
"
FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n,
-999,"Fitzgerald, Matous, Dundas ","QC(000) ","
"
...
FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n,
-999,"Fitzgerald, Matous, Dundas ","QC(000)
","Couldn't find post, used GPS 5940 1197; FAA4.4 and FAA4.5 unsafe, avalanche
area!
analog to ascii files
digital w/ QC
shapefiles
born
digital
camera interim jpgs collated and named jpgs
46. A production pattern for Cline et al., 2003
field notebook Excel v1 printout Excel v2
distributed data set
TRANSECT,IOP ,DATE ,TIME,UTME ,UTMN ,DEPTH ,SWET,SRUF,CNPY,
TEMP,SURVEYOR ,QC
,COMMENTS
, , , , , ,cm , , , , deg-
F, , ,
FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, 104, d, y, n,
-999,"Fitzgerald, Matous, Dundas ","QC(000) ","
"
FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n,
-999,"Fitzgerald, Matous, Dundas ","QC(000) ","
"
...
FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n,
-999,"Fitzgerald, Matous, Dundas ","QC(000)
","Couldn't find post, used GPS 5940 1197; FAA4.4 and FAA4.5 unsafe, avalanche
area!
analog to ascii files
digital w/ QC
shapefiles
born
digital
camera interim jpgs collated and named jpgs
47. A production pattern for Cline et al., 2003
field notebook Excel v1 printout Excel v2
distributed data set
TRANSECT,IOP ,DATE ,TIME,UTME ,UTMN ,DEPTH ,SWET,SRUF,CNPY,
TEMP,SURVEYOR ,QC
+
,COMMENTS
HTML F,
, , , , , ,cm
FAA01.1 ,iop4,2003-03-25,1017,425941,4410860,
,
,
104,
,
d,
,
y,
,
n,
,
deg-
Doc.
-999,"Fitzgerald, Matous, Dundas ","QC(000) ","
"
FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n,
-999,"Fitzgerald, Matous, Dundas ","QC(000) ","
"
...
FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n,
-999,"Fitzgerald, Matous, Dundas ","QC(000)
","Couldn't find post, used GPS 5940 1197; FAA4.4 and FAA4.5 unsafe, avalanche
area!
analog to ascii files
digital w/ QC
shapefiles
born
digital
camera interim jpgs collated and named jpgs
48. A production pattern for Cline et al., 2003
field notebook Excel v1 printout Excel v2
distributed data set
TRANSECT,IOP ,DATE ,TIME,UTME ,UTMN ,DEPTH ,SWET,SRUF,CNPY,
TEMP,SURVEYOR ,QC
+
,COMMENTS
HTML , , , , , ,cm , , , , deg-
100s 100s
F, , ,
FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, 104, d, y, n,
Doc.
-999,"Fitzgerald, Matous, Dundas ","QC(000) ","
"
FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n,
-999,"Fitzgerald, Matous, Dundas ","QC(000) ","
"
...
FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n,
-999,"Fitzgerald, Matous, Dundas ","QC(000)
","Couldn't find post, used GPS 5940 1197; FAA4.4 and FAA4.5 unsafe, avalanche
area!
analog to ascii files
digital w/ QC
shapefiles
born
digital
1000s
camera interim jpgs collated and named jpgs
49. A production pattern for Cline et al., 2003
field notebook Excel v1 printout Excel v2
distributed data set
TRANSECT,IOP ,DATE ,TIME,UTME ,UTMN ,DEPTH ,SWET,SRUF,CNPY,
TEMP,SURVEYOR ,QC
+
,COMMENTS
HTML , , , , , ,cm , , , , deg-
100s 100s
F, , ,
FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, 104, d, y, n,
Doc.
-999,"Fitzgerald, Matous, Dundas ","QC(000) ","
"
FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n,
-999,"Fitzgerald, Matous, Dundas ","QC(000) ","
"
...
FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n,
-999,"Fitzgerald, Matous, Dundas ","QC(000)
","Couldn't find post, used GPS 5940 1197; FAA4.4 and FAA4.5 unsafe, avalanche
area!
analog to ascii files
digital w/ QC tarball shapefiles
born
digital
1000s
camera interim jpgs collated and named jpgs
50. A production pattern for Cline et al., 2003
field notebook Excel v1 Couldn't find post, v2
printout Excel
distributed data set used GPS 5940
1197; FAA4.4 and
TRANSECT,IOP ,DATE ,TIME,UTME ,UTMN ,DEPTH ,SWET,SRUF,CNPY,
TEMP,SURVEYOR
FAA4.5 unsafe,
,QC
+
,COMMENTS
HTML , , , , , ,cm , , , , deg-
100s
avalanche area!
100s
F, , ,
FAA01.1 ,iop4,2003-03-25,1017,425941,4410860, 104, d, y, n,
Doc.
-999,"Fitzgerald, Matous, Dundas ","QC(000) ","
"
FAA01.2 ,iop4,2003-03-25,1017,425956,4410860, 13, d, n, n,
-999,"Fitzgerald, Matous, Dundas ","QC(000) ","
"
...
FAA04.1 ,iop4,2003-03-25,1221,425938,4411193, 325, d, y, n,
-999,"Fitzgerald, Matous, Dundas ","QC(000)
","Couldn't find post, used GPS 5940 1197; FAA4.4 and FAA4.5 unsafe, avalanche
area!
analog to ascii files
digital w/ QC tarball shapefiles
born
digital
1000s
camera interim jpgs collated and named jpgs
51. Crude, inaccurate production pattern for MODIS/Aqua Snow Cover
Daily L3 Global 500m Grid V005 (Hall et al., 2007)
52. Crude, inaccurate production pattern for MODIS/Aqua Snow Cover
Daily L3 Global 500m Grid V005 (Hall et al., 2007)
Archives
GSFC EDC NSIDC
MODAPs Processing
53. Crude, inaccurate production pattern for MODIS/Aqua Snow Cover
Daily L3 Global 500m Grid V005 (Hall et al., 2007)
Archives
GSFC EDC NSIDC
1 file/day/tile (grid cell)
Each file contains
metadata describing
previous inputs and
detailed versioning
MODAPs Processing
54. Crude, inaccurate production pattern for MODIS/Aqua Snow Cover
Daily L3 Global 500m Grid V005 (Hall et al., 2007)
Archives
GSFC EDC NSIDC
1,000,000s
1 file/day/tile (grid cell)
Each file contains
metadata describing
previous inputs and
detailed versioning
MODAPs Processing
55. Doing it as best we can...?
• Hall, Dorothy K., George A. Riggs, and Vincent V. Salomonson. 2007,
updated daily. MODIS/Aqua Snow Cover Daily L3 Global 500m Grid
V005.3, Oct. 2007- Sep. 2008, 84°N, 75°W; 44°N, 10°W. Boulder,
Colorado USA: National Snow and Ice Data Center. Data set accessed
2008-11-01 at http://dx.doi.org/10.1234/xxx.
• Hall, Dorothy K., George A. Riggs, and Vincent V. Salomonson. 2007,
updated daily. MODIS/Aqua Snow Cover Daily L3 Global 500m Grid
V005.3, Oct. 2007- Sep. 2008, Tiles (15,2;16,0;16,1;16,2;17,0;17,1).
Boulder, Colorado USA: National Snow and Ice Data Center. Data set
accessed 2008-11-01 at http://dx.doi.org/10.1234/xxx.
• Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston. 2002,
Updated 2003. CLPX-Ground: ISA snow depth transects and related
measurements, Version 2.0, shapefiles. Edited by M. Parsons and M.
J. Brodzik. Boulder, CO: National Snow and Ice Data Center. Data set
accessed 2008-05-14 at http://dx.doi.org/10.5060/D4H41PBP. 45
56. Sea Ice Index
Fetterer et al. 2009
“Near Real Time”
“Preliminary”
Maslanik and Stroeve
Meir et al. 2006
1999
“Final”
Cavalieri et al 1996
57. Remote Sensing Systems
Sea Ice Production – Data Workflow TBs (Wentz)
Color Key
Source (data)
Value added product
Near-Real-Time product
NSIDC-0002 Preliminary Product
SSM/I Daily and NSIDC-0001
SSM/I Polar Stereo Tbs Final Product
Monthly Polar
Gridded Bootstrap Not part of discussion
Deaccession to begin 1/2012
NSIDC-0051 Goddard Space NSIDC-0079
Fowler Anderson Flight Center Preliminary Bootstrap Sea
Preliminary Sea Ice
University of CO University of NE (GSFC) Ice Concentrations from
Concentrations
From SMMR and SSM/I Nimbus-7 SMMR
Passive Microwave Data and DMSP SSM/I
NISDC-0116 NSIDC-0105
Snow Melt Onset Data not yet
Polar Pathfinder
distributed
Daily 25 km Over Arctic Sea Ice
EASE-Grid Sea Ice from SMMR and NSIDC-0051 NSIDC-0079
Motion Vectors SSM/I Tbs Sea Ice Concentrations Bootstrap Sea Ice
from SMMR and SSM/I Concentrations from
Passive Microwave Data SMMR and SSM/I
NASA Team
G00791 production line
IABP Drifting NSIDC-0066
AVHRR Polar
Buoy
Pressure, Pathfinder
Twice-Daily G02202
Temperature, NOAA/NSIDC CDR
5 km EASE-
Position, and
Interpolated Grid
Composites
Ice Velocity
NSIDC-0081 NSIDC-0192
NSIDC-0046 Near-Real-Time DMSP Sea Ice Trends and
Robinson NH EASE-Grid SSM/I Daily Polar Gridded Climatologies from SMMR
Psuedo-weekly Weekly Sea Ice Concentrations and SSM/I
Snow Cover Snow Cover and Sea
Rutgers Ice Extent V3
NSIDC-0080
Near-Real-Time DMSP NISE on NEO
World Wide Web G02135 SSM/I Daily Polar Gridded NISE
Arctic Sea Ice Sea Ice Index Tbs
News and Analysis
CLASS
F17 TBs
Last updated: D Scott 12/2011
59. Hypothesis: ~80% of citation scenarios for 80% of
Earth system science data can be addressed with
basic citations (Author(s). ReleaseYear. Title, Version.
[editor(s)]. Archive. Locator. [date/time accessed]. [subset used].),
a solid methods section, and reasonable due
diligence by archives.
61. Content equivalence and provenance equivalence
serve as a proxy for scientific equivalence.
Content Equivalence: Is there an algorithm that can consider the
content of a file and come up with a unique identifier that will be the
same for objects in the same “content equivalence class”?
• Exact content equivalence can use digital signature or
cryptographic techniques MD5, SHA-1, etc.
• Universal Numeric Footprint (UNF) takes a digital signature of a
“canonical” representation of the information, but canonical is a
problem.
• How can we define loose or scientific content equivalence? Are
the shapefiles and text files in Cline et al. (2003) equivalent?
51
62. Content equivalence and provenance equivalence
serve as a proxy for scientific equivalence.
Provenance Equivalence: A process is reproducible when there is
sufficient creation provenance details for someone else to make an
equivalent file. If someone follows those provenance details to re-create
the object, the resulting object will be equivalent to the original.
• If we can enumerate/list sufficient or “essential” creation
provenance details to make an equivalent file, then we can
describe an algorithm to produce an identifier that will be the
same for files that match those provenance details.
• Algorithm can be similar to the content equivalence algorithm
before: take a digital signature of a canonical representation of the
information in this case a canonical serialization of processes.
52
65. More examples
• Glacier Photos: chemical creation of image, digitization, multi-source, no
versioning
• Hurricane Ike: direct digital creation, embedded software, single-source, no
versioning
• Rock or Ice Cores: physical specimens with later experimental operations to
create data collections – digital results may be stored in databases with complex
provenance
• SeaDataNet: Many individual Conductivity, Temperature, Depth measurements
from many individual cruises compiled into one database.
• Global Historical Climate Network: In situ measurements of Temp and Humidity,
recorded as time series and appended to files (with some quality control) – multi-
source, some versioning
• Radiosonde Network: Balloon-borne Temp and Humidity from about 20,000 sites
every 12 or 6 hours assimilated into weather forecasts – archives maintained at
NCAR, GSFC, and elsewhere – reanalysis can create new versions
• Large-scale Satellite Data Production: complex, large-scale data flows from
multiple sources, systematic approaches to versioning where the structure of one
version of a data product resembles the next 55
66. Basic data citation form and content
Mandatory: Optional:
Author(s) Editor(s)
Release Date Archive or Distributor Place
Title Other Institutional Role
Version Indication of subset used
Archive and/or Distributor
Locator or Persistent
Location service
Time, date accessed
56
67. Serving, Citing and Publishing Data
Citation forms an important part of the
scientific record.
Doi:10232/123ro
This involves the peer-review of data
We draw a clear distinction between: 2. sets, and gives “stamp of approval”
Publication of data sets associated with traditional journal
publishing = making available for publications. Can’t be done without
consumption (e.g. on the web), effective linking/citing of the data
sets.
and
Doi:10232/123
Publishing = publishing after some 1. This is our first step for this project –
formal process which adds value Data set Citation formulate and formalise a way of
for the consumer: citing data sets. Will provide benefits
• e.g. PloS ONE type review, or to our users – and a carrot to get
them to provide data to us!
• EGU journal type public review, or
• More traditional peer review.
0.
AND Serving of data sets This is what data centres do as our
• provides commitment to day job – take in data supplied by
scientists and make it available to
persistence other interested parties.
slide courtesy S. Callaghan, BADC
VO Sandpit, November 2009
Editor's Notes
Props to ESIP \n - Consortium of ~120 Earth Science-Related Partners\n- Started in 1998\n- Support from NASA, NOAA, and EPA With growing engagement from: USGS, NSF, DOE and USDA\n - Neutral forum for community networking, collaboration & problem solving\n - Limited international participation (strength and weakness)\n- P&S committee a few years old and very active. Please join. Google ESIP preservation\n\nA grand title for what might be better called &#x201C;How do I specifically reference the precise data I used? And not the data that are very similar but different&#x201D; \n\nData Equivalence by Mark A. Parsons and the ESIP Preservation and Stewardship Cluster is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. See http://creativecommons.org/licenses/by-sa/3.0/\n<a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/"><img alt="Creative Commons License" style="border-width:0" src="http://i.creativecommons.org/l/by-sa/3.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">How to Cite an Earth Science Data Set</span> by <a xmlns:cc="http://creativecommons.org/ns#" href="http://nsidc.org/cgi-bin/publications/pub_list.pl" property="cc:attributionName" rel="cc:attributionURL">Mark A. Parsons and the ESIP Preservation and Stewardship Cluster</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/">Creative Commons Attribution-ShareAlike 3.0 Unported License</a>.\n\nAbstract:\nData citation, especially using persistent identifiers like Digital Object Identifiers (DOIs), is an increasingly accepted scientific practice. Recently, several, respected organizations have developed guidelines for data citation. The different guidelines are largely congruent in that they agree on the basic practice and elements of data citation, especially for relatively static, whole data collections. There is less agreement on the more subtle nuances of data citation that are sometimes necessary to ensure precise reference and scientific reproducibility--the core purpose of data citation. We need to be sure that if you follow a data reference you get to the precise data that were used or at least their scientific equivalent. Identifiers such as DOIs are necessary but not sufficient for the precise, detailed, references necessary. This talk discusses issues around data set versioning, micro-citation, and scientific equivalence. I proposes some interim solutions and suggest research strategies for the future.\n
We manage data, educate the public, and do research on the cryosphere--the frozen parts of the word from kryos the greek for butt cold. We are getting more involved with other data including life and social science data and even Indigenous knowledge. I&#x2019;d argue we are becoming an important center of interdisciplinary informatics, but we are best known for keeping an eye on that stuff in the background image-- sea ice \n\nImages:\nTop left: Near Real-Time SSM/I EASE-Grid Daily Global Ice Concentration and Snow Extent, 3 February 2008\nTop right: Pancake ice seen off the bow of the research vessel Aurora Australis, in the Southern Ocean near Antarctica. Photo courtesy Ted Scambos.\nBottom right: Data from the Sea Ice Index, shown on Google Earth and distributed from our Web site in the form of a Quick Time movie. The left image is an animated time series of sea ice extent from 1979 to 2006; the static image on the right compares the extent in 2007.\nBottom left: Global Monthly SWE Climatology Browse, from http://nsidc.org/data/nsidc-0271.html\nCenter: A section of the online data set documentation from Sea Ice Charts of the Russian Arctic in Gridded Format, 1933-2006 (http://nsidc.org/data/g02176.html).\n
In case you missed it, Arctic sea ice is rapidly declining. This years summer minimum sea ice extent about a week ago shattered previous records in an ongoing collapse.\nThis image shows this years minimum extent relative to to the 30-yr average minimum extent (yellow line). Note this average is in a period of rapid decline.\n\n&#x201C;Satellite data reveal how the new record low Arctic sea ice extent, from Sept. 16, 2012, compares to the average minimum extent over the past 30 years (in yellow). Sea ice extent maps are derived from data captured by the Scanning Multichannel Microwave Radiometer aboard NASA's Nimbus-7 satellite and the Special Sensor Microwave Imager on multiple satellites from the Defense Meteorological Satellite Program. Credit: NASA/Goddard Scientific Visualization Studio &#x201C; http://www.nasa.gov/topics/earth/features/2012-seaicemin.html\n
So let&#x2019;s look at this a little more closely.\n\nFetterer, F., K. Knowles, W. Meier, and M. Savoie. 2002, updated 2009. Sea Ice Index. Boulder, Colorado USA: National Snow and Ice Data Center. http://nsidc.org/data/g02135.html\n
\n
This highlights the issue. In science you have to be very exact in what you are pointing to.\n\nMeier W.N., VanWoert M.L., & Bertoia C. (2001) Evaluation of operational SSM/I algorithms. Annals of Glaciology. 33:102-08. \n\n
\n&#x201C;researchers from Emory University reported in Brain & Language that when subjects in their laboratory read a metaphor involving texture, the sensory cortex, responsible for perceiving texture through touch, became active. Metaphors like &#x201C;The singer had a velvet voice&#x201D; and &#x201C;He had leathery hands&#x201D; roused the sensory cortex, while phrases matched for meaning, like &#x201C;The singer had a pleasing voice&#x201D; and &#x201C;He had strong hands,&#x201D; did not.&#x201D; ANNIE MURPHY PAUL, NY Times 17 Mar 2012\n\nmetaphors help us create the complex narratives we use to understand our physical and conceptual experience. These complex narratives are made up of smaller, very simple narratives called &#x201C;frames&#x201D; or &#x201C;scripts&#x201D;. Framing and frame analysis are often used in knowledge representation, social theory, media studies, and psychology with much of the work stemming from Erving Goffman (1974). \n\nThese frames present a set of roles and relationships between them like characters in a play. They also help us define our terms and make sense of language, because words are defined relative to a conceptual frame. The word &#x201C;sell&#x201D; does not make sense without some understanding of a commercial transaction and some of the other roles and terms involved like &#x201C;buyer,&#x201D; &#x201C;money,&#x201D; and &#x201C;cost&#x201D;. Furthermore, by mentioning only one of these concepts like &#x201C;buy&#x201D; or &#x201C;sell&#x201D;, the whole commercial transaction scenario is evoked or &#x201C;activated&#x201D; in the mind (Fillmore, 1976). \n\nLakoff (2008) further argues that framing is critical to human cognition. The neural circuitry to create a frame is relatively simple and our brain essentially uses framing as a sort of cognitive processing shortcut. If things are understood in the context of a frame, much is already unconsciously understood and need not be consciously processed. We know what to expect. \n\nHe shows how language, metaphor, and framing play critical roles in any social enterprise. \n\n&#x201C;Language is at once a surface phenomenon and a source of power. It is a means of expressing, communicating, accessing, and even shaping thought. ... Language gets its power because it is defined relative to frames, prototypes, metaphor, narratives, images and emotions. Part of its power comes from unconscious aspects: we are not consciously aware of all that it evokes in us, but it is there, hidden, always at work. If we hear the same language over and over, we will think more and more in terms of the frames and metaphors activated by that language.&#x201D; (Lakoff 2008, p14)\nSo with that in mind Peter Fox and I wrote a critical essay, asking\n\nLakoff, G. (2008) The Political Mind: Why You Can't Understand 21st-Century Politics With An 18th-Century Brain. New York: Penguin Group. \n\n
\n
\n
\n
\n
The National Snow and Ice Data Center distributes a variety of different snow cover products derived from the Moderate Resolution Imaging Spectroradiometer (MODIS). The results of a quick analysis of how many scientific papers mention use of "MODIS Snow Cover Data" (according to Google Scholar) and how often the data sets themselves are formally cited show a huge disparity, illustrating&#xA0; the infrequency of proper data citation in practice. Moreover, the lack of data citation standards introduces the possibility that informal references to data do not point to the exact data set actually used.\n\nThis was a crude estimate to make a point. Gene Major, Heather Piwowar, others have done manual studies showing it&#x2019;s not being done and it hasn&#x2019;t improved in a decade.\n\nSo the approach to date is to try and develop guidelines and promote their use and acceptance of the practice\n\n
So it&#x2019;s pretty clear from all this that the author is not being fairly credited.\n\n
Digital Mapping Techniques '00 -- Workshop ProceedingsU.S. Geological Survey Open-File Report 00-325\nProposal for Authorship and Citation Guidelines for Geologic Data Sets and Map Images in the Era of Digital Publication\nBy Stephen M. Richard\n\n\nWhen I presented this only a year ago, it was much more diverse. We are converging on the approach, now we need to implment it!\n
regarding, tracking the impact...\n\nChen, R. S. and Downs, R. R. (2010). Evaluating the Use and Impacts of Scientific Data. National Federation of Advanced Information Services (NFAIS) Workshop, Assessing the Usage and Value of Scholarly and Scientific Output: &#xA0;An Overview of Traditional and Emerging Approaches. Philadelphia, PA, November 10, 2010. http://info.nfais.org/info/ChenDownsNov10.pdf\n
\n
\n
These databases (a) do not currently support tracking citations to data sets, (b) have expensive subscription fees, and (c) only reveal impact demonstrated through traditional citations. \n
\n
\n
resource type is a short controlled list of &#x201C;The general type of a resource.&#x201D;\nmany other optional fields possible\n\n
\n
Authors are those who put the intellectual effort into creating the data set. Study designers, algorithm developers, instrument designers, field team leaders, etc.\n
\n
\n
Editors and other roles help with credit and accountability\n\n
Editors and other roles help with credit and accountability\n\n
Publidhe. Could be multiple, but usually there&#x2019;s an authority.\n
\n
\n
They&#x2019;re different, but sometimes locator can be used as an ID (The person working in this position at this address). Hence the general use of the term &#x201C;identifier &#x201C;such as in DOI, which is better described as a locator\n\nChun Li and Choi introduce name and address as complements to Id and locator. I&#x2019;m not sure I entirely sure I understand the distinction, but a key poin they make is that while the concepts of identifier and locator are distinct tey can and do intersect in different context. Sometimes , you can id with a locator and locate with an identifier.\n\nWhat we really need is what Jon Kunze calls an &#x201C;actionable locator&#x201D;.\n\nKunze J. (2003) Towards electronic persistence using ARK identifier. Retrieved September 22, 2012 from the World Wide Web: http://pid.ndk.cz/dokumenty/zakladni-literatura/arkcdl.pdf\n\nChun, W, TH Lee, and T Choi. 2011. YANAIL: yet another definition on names, addresses, identifiers, and locators. Proceedings of the 6th International Conference on Future Internet Technologies. pp. 8-12\n
\n
Debate in LSID community weakens it. - - an LSID is a locator; but also the ObjectID part of it is an Identifier&#xA0;and most people use a UUID for the ObjectID part of it\nOID problem\nARK is a bit better than the rest of the locators because it has additional trust value ... maybe the color should have been more orange than yellow but I didn&#x2019;t want to add more colors. &#xA0;\n\nFrom ARK defn: transparency &#x2013; no identifier can guarantee stability, and ARK inflections help users make informed judgements\n\n
What is the citable unit with a DOI? A file? A collection of files? How many? Further, it is important to note that data products can be purged from an archive; such deleted information still needs to be able to be referenced. Even if the products themselves are not preserved, the raw data must be preserved along with detailed documentation describing how the product was created, and that documentation must be citable.\n\nSome have expressed concern that DOIs are too closely associated with publishers and may restrict open access.\n
\n
This i where the publication metaphor gets us in trouble.\nCan we track data packages flowing through an ecosystem instead and recreate\n
\n
still need date accessed with DOI\nCan we auto track uuids?\n\n
&#x201C;Micro-citation&#x201D; The hardest piece of all and the one most critical to sci. repeatability.\nOne might think of this as citing a passage in a publication-- a page #\n
page number requires a canonical format.\n
a &#x201C;structural index (chapter-verse)&#x201D; vs. &#x201C;internal location id (page number)&#x201D; \nstructural index useful for when many different versions but with fairly consistent structure.\n\nThe difficulty with the Unique Numerical Fingerprint approach is that it assumes that there is an immutable order to the data elements.&#xA0;This is not always the case therefore c-v needs to refer to &#x201C;equivalence classes&#x201D; not canonical versions. But you can&#x2019;t deny the human readability of the c-v approach\n\nWe probably need both approaches. We need the chapter and verse that makes sense to people and is easily conceived and communicated between people, but then we still need the precise location and identity of that rather mutable verse represented in a way that computers can readily understand and be precise about, i.e. &#xA0;the identifier.\nAnd then we can't forget the fact that we have billions if not trillions of "verses" or "granules" that we're dealing with. Our human approach needs to make sense at a high level of aggregation, while the computer approach needs to handle the volumes and precision.\n\nanother option is to capture query and date--this relies on really good provenance tracking.\n
note the outdated reference and page numbers-- a problem with costly ISO standards.\n
It would probably be wise to distinguish between cases where the file\ncollections are open, but adding relatively stable items, from ones where\nthe files themselves are mutable\n\n\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
the shape file does not hold all the comments that are in the text file.\n\nGo take a photo of the actual fieldbook with a pencil and use that.\n
the shape file does not hold all the comments that are in the text file.\n\nGo take a photo of the actual fieldbook with a pencil and use that.\n
the shape file does not hold all the comments that are in the text file.\n\nGo take a photo of the actual fieldbook with a pencil and use that.\n
the shape file does not hold all the comments that are in the text file.\n\nGo take a photo of the actual fieldbook with a pencil and use that.\n
the shape file does not hold all the comments that are in the text file.\n\nGo take a photo of the actual fieldbook with a pencil and use that.\n
L3 files records interim files not L2.\nHall, Dorothy K., George A. Riggs, and Vincent V. Salomonson. 2007, updated daily. MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005. Boulder, Colorado USA: National Snow and Ice Data Center. Data set accessed 21 July 2011 at http://nsidc.org/data/myd10a1.html.\n
L3 files records interim files not L2.\nHall, Dorothy K., George A. Riggs, and Vincent V. Salomonson. 2007, updated daily. MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005. Boulder, Colorado USA: National Snow and Ice Data Center. Data set accessed 21 July 2011 at http://nsidc.org/data/myd10a1.html.\n
L3 files records interim files not L2.\nHall, Dorothy K., George A. Riggs, and Vincent V. Salomonson. 2007, updated daily. MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005. Boulder, Colorado USA: National Snow and Ice Data Center. Data set accessed 21 July 2011 at http://nsidc.org/data/myd10a1.html.\n
\n
\n
\n
&#x201C;To preserve data is to preserve ability to process data!&#x201D;\nTo reference data is to track data\n\n
Include engineers?\nData managers?\nSponsors?\nCorrect URL? This is where a DOI helps.\n
Include engineers?\nData managers?\nCorrect URL? This is where a DOI helps.\n
I&#x2019;m have modified this hypothesis and I&#x2019;m having second thoughts as to whether it stands. From a limited scholarly perspective of citing, well-controlled data sets, it&#x2019;s probably true. It is &#x201C;citable&#x201D; in some of the credit and location senses, but that is insufficient for full scientific tracking and referencing of data streams.\n
\n
\n
W3C and ProvO are doing some good work on this first point.\n\nTake away. Figure out what problem you are trying to solve with referencing and then examine the problem from many perspectives not just the classic scholarly publication model.\n