Reuse for Research
Curating Astrophysical Datasets
for Future Researchers
Practice Paper, IDCC17
Anders Conrad, Royal Danish Library
Michael Svendsen, Royal Danish Library
Rasmus Handberg, Aarhus University
The NASA Kepler/K2 Mission
Read about the mission at https://kepler.nasa.gov/Mission/QuickGuide/
The Kepler Photometer
From Space to Aarhus…
Spacecraft
Deep Space Network
NASA MAST archive
KASOC archive, Aarhus
KASC scientists/ working
groups KASOC website
(kasoc.phys.au.dk)
The challenge - Where Next?
• Data will remain valuable for active research
for at least 50 years!
• Who will take care when the current research
organisation (Kepler Asteroseismic Science
Consortium, KASC) does no longer exist?
• How can data be kept accessible for continued
active research?
KASC requirements for a Living Archive
• Available for 50 years
• Always freely available on-line
• Continue to be used for active research
• Extendable: New information can be added
• Formats must be readable by both humans and
computers
• Understandable and useful for future
researchers – no matter the science case
Future workshops - Reuse for Research
• For which research questions might future
researchers find this data useful?
• How would they most likely want to see data
packaged?
• What documentation is needed to understand
data outside the current context?
• What search criteria would most likely be used
to discover data?
The 50 Years Issue
• Institutionally:
Who can offer more than 5-10 years of storage
and preservation?
• Financially:
Who will pay?
• Technically:
How will data remain readable and
understandable?
• Scientifically:
How will data remain useful and trustworthy?
From ”Who” and ”How” to…
• How to best
• Structure datasets in a way that is most useful for
research
• Use formats that are suitable for long-term
preservation
• Secure sufficient contextual and specific
documentation for scientific reuse
• Facilitate cross-institutional collaboration, to
provide a sustainable service
• Secure access and discoverability according to
scientific needs
• Secure possibility for continued deposit
Dataset Structure
• One self-containing dataset for each star
• 5 different types of data products
• Dataset-specific documentation
• TOC file (machine and human readable)
• References to publications (bibcodes)
• One generic documentation package
• E.g. NASA and KASC release notes
One BagIt Archive for Each Star
Kepler_10.zip
│ bag-info.txt
│ bagit.txt
│ fetch.txt
│ manifest-sha1.txt
└───data
│ bundle.xml
│ readme.txt
├───datafiles
│ └───...
├───additional_files
│ └───...
├───documentation
│ └───...
└───stellar_models
└───...
Documentation for Each Dataset
<star kic="12345678">
<numax value="3100" error="20" unit="uHz" />
<mass value="1.0" error="0.01" unit="solar" />
<radius value="1.0" error="0.01" unit="solar" />
<datafiles>
<datafile uid=”1” path=”datafiles/original/kplr12345678_llc.fits” />
<datafile uid=”2” path=”datafiles/kasoc.ts/kplr12345678_kasoc.ts.fits”>
<dependency datafile=”1” />
</datafile>
…
</datafiles>
<model path=”stellar_models/kic12345678/” />
</star>
● The bundle.xml file
Proof-of-concept - Repository Setup
• Using Dataverse repository software
• Support for astrophysics metadata
• Discoverability and citability (Datacite DOI’s)
• API’s for automatic ingest workflow
• Versioning – allowing redeposit of extended
versions of datasets
• Issues:
• Missing numeric fields for celestial coordinates (for
discovery)
• Limited options for mapping to external storage (we
use erda.dk)
Institutional Collaboration
Conclusions – as of February 2017
• Data packages designed in a way that can
outlive repository software
• Caveat: may imply limitations in the use of
repository features
• Preservation actions will potentially be
possible, even if we don’t plan them
• We still work on establishing funding and a
sustainable business model
• We need to establish a production
environment for repository
Reuse for Research
Contact: Michael Svendsen, @tullemich, Royal Danish Library

Reuse for research, presentation, idcc17

  • 1.
    Reuse for Research CuratingAstrophysical Datasets for Future Researchers Practice Paper, IDCC17 Anders Conrad, Royal Danish Library Michael Svendsen, Royal Danish Library Rasmus Handberg, Aarhus University
  • 2.
    The NASA Kepler/K2Mission Read about the mission at https://kepler.nasa.gov/Mission/QuickGuide/
  • 3.
  • 4.
    From Space toAarhus… Spacecraft Deep Space Network NASA MAST archive KASOC archive, Aarhus KASC scientists/ working groups KASOC website (kasoc.phys.au.dk)
  • 5.
    The challenge -Where Next? • Data will remain valuable for active research for at least 50 years! • Who will take care when the current research organisation (Kepler Asteroseismic Science Consortium, KASC) does no longer exist? • How can data be kept accessible for continued active research?
  • 6.
    KASC requirements fora Living Archive • Available for 50 years • Always freely available on-line • Continue to be used for active research • Extendable: New information can be added • Formats must be readable by both humans and computers • Understandable and useful for future researchers – no matter the science case
  • 7.
    Future workshops -Reuse for Research • For which research questions might future researchers find this data useful? • How would they most likely want to see data packaged? • What documentation is needed to understand data outside the current context? • What search criteria would most likely be used to discover data?
  • 8.
    The 50 YearsIssue • Institutionally: Who can offer more than 5-10 years of storage and preservation? • Financially: Who will pay? • Technically: How will data remain readable and understandable? • Scientifically: How will data remain useful and trustworthy?
  • 9.
    From ”Who” and”How” to… • How to best • Structure datasets in a way that is most useful for research • Use formats that are suitable for long-term preservation • Secure sufficient contextual and specific documentation for scientific reuse • Facilitate cross-institutional collaboration, to provide a sustainable service • Secure access and discoverability according to scientific needs • Secure possibility for continued deposit
  • 10.
    Dataset Structure • Oneself-containing dataset for each star • 5 different types of data products • Dataset-specific documentation • TOC file (machine and human readable) • References to publications (bibcodes) • One generic documentation package • E.g. NASA and KASC release notes
  • 11.
    One BagIt Archivefor Each Star Kepler_10.zip │ bag-info.txt │ bagit.txt │ fetch.txt │ manifest-sha1.txt └───data │ bundle.xml │ readme.txt ├───datafiles │ └───... ├───additional_files │ └───... ├───documentation │ └───... └───stellar_models └───...
  • 12.
    Documentation for EachDataset <star kic="12345678"> <numax value="3100" error="20" unit="uHz" /> <mass value="1.0" error="0.01" unit="solar" /> <radius value="1.0" error="0.01" unit="solar" /> <datafiles> <datafile uid=”1” path=”datafiles/original/kplr12345678_llc.fits” /> <datafile uid=”2” path=”datafiles/kasoc.ts/kplr12345678_kasoc.ts.fits”> <dependency datafile=”1” /> </datafile> … </datafiles> <model path=”stellar_models/kic12345678/” /> </star> ● The bundle.xml file
  • 13.
    Proof-of-concept - RepositorySetup • Using Dataverse repository software • Support for astrophysics metadata • Discoverability and citability (Datacite DOI’s) • API’s for automatic ingest workflow • Versioning – allowing redeposit of extended versions of datasets • Issues: • Missing numeric fields for celestial coordinates (for discovery) • Limited options for mapping to external storage (we use erda.dk)
  • 14.
  • 15.
    Conclusions – asof February 2017 • Data packages designed in a way that can outlive repository software • Caveat: may imply limitations in the use of repository features • Preservation actions will potentially be possible, even if we don’t plan them • We still work on establishing funding and a sustainable business model • We need to establish a production environment for repository
  • 16.
    Reuse for Research Contact:Michael Svendsen, @tullemich, Royal Danish Library