Beecher cni fall 2010 v4

Preserving Social Science Research Data Using Fedora Bryan Beecher Inter-university Consortium for Political and Social Research (ICPSR) CNI Fall 2010 Membership Meeting

ICPSR World’s largest social science research data archive Lots of files (millions) Small files (6TB total) Long track record of success – 50 yrs Trust us Enormous legacy burden

ICPSR Survey data are our core Low volume of new content compared to natural sciences We curate each item extensively (disclosure, quality, format, usability) Strong access orientation Talk like an archive Walk like an archive?

Walking the walk Good storage container for content and its metadata OAIS-compliant Generate SIPs and AIPs (and DIPs) But…

Where to begin? Focus areas Preservation Going forward Reusable Do not try to include Access Everything we have

A Solution Fedora objects Container for stuff we ingest and preserve Fedora services To generate AIPs and SIPs Tool to generate FOs from existing content and metadata

Ingest The Motivated Depositor Eager to describe the research data in great detail Uploads complete, machine-readable metadata

Ingest (continued) The Unmotivated Depositor Upload a variety of proprietary file formats for documentation and data Leaves the baby on the doorstep

Ingest (continued) Typical deposit Research data in one of the common stat packages (SAS, SPSS, etc) Technical documentation in a proprietary format (Word, PDF) A proto-SIP in quasi-OAIS terms Minimal level of metadata regarding how the survey was conducted

Ingest container – file level Vanilla Fedora Object Will never know what sort of content format to expect Use the RELS-EXT to connect related files

Ingest container – deposit Another plain Fedora Object Points to all of the files stored in the file-level objects Relatively little metadata stored for this level of object

Ingest and the OAIS PDI Reference – unique Fedora PID Fixity – Fedora-generated checksum Provenance – identity of depositor recorded in the DC Datastream Context – original file name captured in the content Datastream Access Rights – terms of deposit

Generating OAIS SIPs Original content Normalized version too, if applicable What’s normalization in this context? Preservation Description Information (PDI) As described previously Delivered via SDef/SDep combo

Ingest – continued Data Disclosure analysis Recoding Documentation Corrections Clarifications Normalized formats

Ingest – finale Packaged into a “study” Data, doc questionnaire, user guide, etc Normalized formats for preservation Convenient formats for access

Ingest – finale PID REPORT (test/plain) objectProperties DC RELS-EXT AUDIT icpsr:release-28748-file-3 QUESTIONNAIRE (application/pdf) objectProperties DC RELS-EXT isPartOf: release-15868 AUDIT icpsr:release-28748-file-1 STATA-DICT (text/plain) objectProperties DC RELS-EXT isPartOf: release-15868 AUDIT DATA (text/plain) DDI (text/xml) SAS-SETUPS (text/plain) SPSS-SETUPS (text/plain) STATA-SETUPS (text/plain) icpsr:release-28748-file-2 CODEBOOK (application/pdf) objectProperties DC RELS-EXT isPartOf: release-15868 AUDIT

Generating OAIS AIPs For each object (file) Everything from the SIP plus Preservation events Description of the transformation used Preservation commitment Its post-processed version Delivered via SDef/SDep combo

Example AIP PID REPORT (test/plain) objectProperties DC RELS-EXT AUDIT icpsr:release-28748-file-3 QUESTIONNAIRE (application/pdf) objectProperties DC RELS-EXT isPartOf: release-15868 AUDIT icpsr:release-28748-file-1 STATA-DICT (text/plain) objectProperties DC RELS-EXT isPartOf: release-15868 AUDIT DATA (text/plain) DDI (text/xml) SAS-SETUPS (text/plain) SPSS-SETUPS (text/plain) STATA-SETUPS (text/plain) icpsr:release-28748-file-2 CODEBOOK (application/pdf) objectProperties DC RELS-EXT isPartOf: release-15868 AUDIT PID objectProperties DC RELS-EXT AUDIT

Questions we faced Datastreams or relationships? What about our XML? AIPs or DIPs? How to build FOXML?

Datastreams /relationships? PID CONTENT X objectProperties DC RELS-EXT AUDIT PID CONTENT Y objectProperties DC RELS-EXT AUDIT PID CONTENT Y objectProperties DC RELS-EXT AUDIT CONTENT X

Our XML DDI v2 Contains lots of the information one might expect to find in the DC Strategy Duplicate it

AIPs or DIPs Lots of copies Destination Archival Storage remote location Repository for ingest

Building FOXML Source Database DDI XML Re-usable tool

Special Thanks The Team Peggy Overcashier Nathan Adams Nancy McGovern Mary Vardigan The Funder National Science Foundation Award 0958382 INTEROP EAGER program

Beecher cni fall 2010 v4

More Related Content

What's hot

Viewers also liked

Similar to Beecher cni fall 2010 v4

Recently uploaded

Beecher cni fall 2010 v4