Preserving Social Science Research Data Using Fedora Bryan Beecher Inter-university Consortium for Political and Social Research (ICPSR) CNI Fall 2010 Membership Meeting
ICPSR World’s largest social science research data archive Lots of files (millions) Small files (6TB total) Long track record of success – 50 yrs Trust us Enormous legacy burden
ICPSR Survey data are our core Low volume of new content compared to natural sciences We curate each item extensively (disclosure, quality, format, usability) Strong access orientation Talk like an archive Walk like an archive?
Walking the walk Good storage container for content and its metadata OAIS-compliant Generate SIPs and AIPs (and DIPs) But…
What should we do?
Where to begin? Focus areas Preservation Going forward Reusable Do not try to include Access Everything we have
A Solution Fedora objects Container for stuff we ingest and preserve Fedora services To generate AIPs and SIPs Tool to generate FOs from existing content and metadata
Ingest The Motivated Depositor Eager to describe the research data in great detail Uploads complete, machine-readable metadata
Ingest (continued) The Unmotivated Depositor Upload a variety of proprietary file formats for documentation and data Leaves the baby on the doorstep
Ingest – Nov 2010 deposits
Ingest (continued) Typical deposit Research data in one of the common stat packages (SAS, SPSS, etc) Technical documentation in a proprietary format (Word, PDF) A  proto-SIP  in quasi-OAIS terms Minimal level of metadata regarding how the survey was conducted
Ingest container – file level Vanilla Fedora Object Will never know what sort of content format to expect Use the RELS-EXT to connect related files
Ingest container – deposit Another plain Fedora Object Points to all of the files stored in the file-level objects Relatively little metadata stored for this level of object
Ingest container – example
Ingest container – example
Ingest and the OAIS PDI Reference – unique Fedora PID Fixity – Fedora-generated checksum Provenance – identity of depositor recorded in the DC Datastream Context – original file name captured in the content Datastream Access Rights – terms of deposit
Generating OAIS SIPs Original content Normalized version too, if applicable What’s normalization in this context? Preservation Description Information (PDI) As described previously Delivered via SDef/SDep combo
Ingest – continued Data Disclosure analysis Recoding Documentation Corrections Clarifications Normalized formats
Ingest – finale Packaged into a “study” Data, doc questionnaire, user guide, etc Normalized formats for preservation Convenient formats for access
Ingest – finale PID REPORT (test/plain) objectProperties DC RELS-EXT AUDIT icpsr:release-28748-file-3 QUESTIONNAIRE (application/pdf) objectProperties DC RELS-EXT isPartOf: release-15868 AUDIT icpsr:release-28748-file-1 STATA-DICT (text/plain) objectProperties DC RELS-EXT isPartOf: release-15868 AUDIT DATA (text/plain) DDI (text/xml) SAS-SETUPS (text/plain) SPSS-SETUPS (text/plain) STATA-SETUPS (text/plain) icpsr:release-28748-file-2 CODEBOOK (application/pdf) objectProperties DC RELS-EXT isPartOf: release-15868 AUDIT
Generating OAIS AIPs For each object (file) Everything from the SIP plus Preservation events Description of the transformation used Preservation commitment Its post-processed version Delivered via SDef/SDep combo
Example AIP PID REPORT (test/plain) objectProperties DC RELS-EXT AUDIT icpsr:release-28748-file-3 QUESTIONNAIRE (application/pdf) objectProperties DC RELS-EXT isPartOf: release-15868 AUDIT icpsr:release-28748-file-1 STATA-DICT (text/plain) objectProperties DC RELS-EXT isPartOf: release-15868 AUDIT DATA (text/plain) DDI (text/xml) SAS-SETUPS (text/plain) SPSS-SETUPS (text/plain) STATA-SETUPS (text/plain) icpsr:release-28748-file-2 CODEBOOK (application/pdf) objectProperties DC RELS-EXT isPartOf: release-15868 AUDIT PID objectProperties DC RELS-EXT AUDIT
Questions we faced Datastreams or relationships? What about our XML? AIPs or DIPs? How to build FOXML?
Datastreams /relationships? PID CONTENT X objectProperties DC RELS-EXT AUDIT PID CONTENT Y objectProperties DC RELS-EXT AUDIT PID CONTENT Y objectProperties DC RELS-EXT AUDIT CONTENT X
Our XML DDI v2 Contains lots of the information one might expect to find in the DC Strategy Duplicate it
AIPs or DIPs Lots of copies Destination Archival Storage remote location Repository for ingest
Building FOXML Source Database DDI XML Re-usable tool
Special Thanks The Team Peggy Overcashier Nathan Adams Nancy McGovern Mary Vardigan The Funder National Science Foundation Award 0958382 INTEROP EAGER program

Beecher cni fall 2010 v4

  • 1.
    Preserving Social ScienceResearch Data Using Fedora Bryan Beecher Inter-university Consortium for Political and Social Research (ICPSR) CNI Fall 2010 Membership Meeting
  • 2.
    ICPSR World’s largestsocial science research data archive Lots of files (millions) Small files (6TB total) Long track record of success – 50 yrs Trust us Enormous legacy burden
  • 3.
    ICPSR Survey dataare our core Low volume of new content compared to natural sciences We curate each item extensively (disclosure, quality, format, usability) Strong access orientation Talk like an archive Walk like an archive?
  • 4.
    Walking the walkGood storage container for content and its metadata OAIS-compliant Generate SIPs and AIPs (and DIPs) But…
  • 5.
  • 6.
    Where to begin?Focus areas Preservation Going forward Reusable Do not try to include Access Everything we have
  • 7.
    A Solution Fedoraobjects Container for stuff we ingest and preserve Fedora services To generate AIPs and SIPs Tool to generate FOs from existing content and metadata
  • 8.
    Ingest The MotivatedDepositor Eager to describe the research data in great detail Uploads complete, machine-readable metadata
  • 9.
    Ingest (continued) TheUnmotivated Depositor Upload a variety of proprietary file formats for documentation and data Leaves the baby on the doorstep
  • 10.
    Ingest – Nov2010 deposits
  • 11.
    Ingest (continued) Typicaldeposit Research data in one of the common stat packages (SAS, SPSS, etc) Technical documentation in a proprietary format (Word, PDF) A proto-SIP in quasi-OAIS terms Minimal level of metadata regarding how the survey was conducted
  • 12.
    Ingest container –file level Vanilla Fedora Object Will never know what sort of content format to expect Use the RELS-EXT to connect related files
  • 13.
    Ingest container –deposit Another plain Fedora Object Points to all of the files stored in the file-level objects Relatively little metadata stored for this level of object
  • 14.
  • 15.
  • 16.
    Ingest and theOAIS PDI Reference – unique Fedora PID Fixity – Fedora-generated checksum Provenance – identity of depositor recorded in the DC Datastream Context – original file name captured in the content Datastream Access Rights – terms of deposit
  • 17.
    Generating OAIS SIPsOriginal content Normalized version too, if applicable What’s normalization in this context? Preservation Description Information (PDI) As described previously Delivered via SDef/SDep combo
  • 18.
    Ingest – continuedData Disclosure analysis Recoding Documentation Corrections Clarifications Normalized formats
  • 19.
    Ingest – finalePackaged into a “study” Data, doc questionnaire, user guide, etc Normalized formats for preservation Convenient formats for access
  • 20.
    Ingest – finalePID REPORT (test/plain) objectProperties DC RELS-EXT AUDIT icpsr:release-28748-file-3 QUESTIONNAIRE (application/pdf) objectProperties DC RELS-EXT isPartOf: release-15868 AUDIT icpsr:release-28748-file-1 STATA-DICT (text/plain) objectProperties DC RELS-EXT isPartOf: release-15868 AUDIT DATA (text/plain) DDI (text/xml) SAS-SETUPS (text/plain) SPSS-SETUPS (text/plain) STATA-SETUPS (text/plain) icpsr:release-28748-file-2 CODEBOOK (application/pdf) objectProperties DC RELS-EXT isPartOf: release-15868 AUDIT
  • 21.
    Generating OAIS AIPsFor each object (file) Everything from the SIP plus Preservation events Description of the transformation used Preservation commitment Its post-processed version Delivered via SDef/SDep combo
  • 22.
    Example AIP PIDREPORT (test/plain) objectProperties DC RELS-EXT AUDIT icpsr:release-28748-file-3 QUESTIONNAIRE (application/pdf) objectProperties DC RELS-EXT isPartOf: release-15868 AUDIT icpsr:release-28748-file-1 STATA-DICT (text/plain) objectProperties DC RELS-EXT isPartOf: release-15868 AUDIT DATA (text/plain) DDI (text/xml) SAS-SETUPS (text/plain) SPSS-SETUPS (text/plain) STATA-SETUPS (text/plain) icpsr:release-28748-file-2 CODEBOOK (application/pdf) objectProperties DC RELS-EXT isPartOf: release-15868 AUDIT PID objectProperties DC RELS-EXT AUDIT
  • 23.
    Questions we facedDatastreams or relationships? What about our XML? AIPs or DIPs? How to build FOXML?
  • 24.
    Datastreams /relationships? PIDCONTENT X objectProperties DC RELS-EXT AUDIT PID CONTENT Y objectProperties DC RELS-EXT AUDIT PID CONTENT Y objectProperties DC RELS-EXT AUDIT CONTENT X
  • 25.
    Our XML DDIv2 Contains lots of the information one might expect to find in the DC Strategy Duplicate it
  • 26.
    AIPs or DIPsLots of copies Destination Archival Storage remote location Repository for ingest
  • 27.
    Building FOXML SourceDatabase DDI XML Re-usable tool
  • 28.
    Special Thanks TheTeam Peggy Overcashier Nathan Adams Nancy McGovern Mary Vardigan The Funder National Science Foundation Award 0958382 INTEROP EAGER program