At Yale University, we have worked on a reaccessioning project that has allowed us to develop our thinking of how this accessioning of electronic records could best be realized for us going forward. Two repositories, Manuscripts and Archives and the Beinecke Rare Book and Manuscript Library, have worked in collaboration to implement software, hardware, and procedures that can be shared to support accessioning. In our reaccessioning project, we are working to establish better control over previously transferred accessions that contain electronic records on media such as floppy disks and CD-ROMs. These pieces of media were often received as part of a hybrid accession that also contained paper records, but in some cases we have received accessions of boxes containing only media.
The goals of our reaccessioning project are fairly straightforward and relate to the three types of control discussed previously. First, we seek to establish administrative control of the media by identifying what it is and documenting its physical and logical characteristics and by assigning a unique identifier to each piece. Secondly, we are working towards gaining physical control of the media, which will allow us to mitigate the risks of media deterioration and obsolescence. Finally, we are trying to establish a basic level of intellectual control by extracting metadata about the filesystems and files contained on the media, such as file names, directory structures, and creation, access, and modification dates.
Our reaccessioning workflow roughly looks like the following. We begin by retrieving the media and bringing it to the electronic records workstation, documenting its change in location within the Archivists’ Toolkit. We then assign unique identifiers to each of the media. We establish the best means by which to write-protect the media for imaging and record its identifying characteristics in a media log. We then put the media in the appropriate drive and create a forensic bit-level disk image, which includes all the files, the filesystem metadata, unused space – in other words, the entirety of the data on the media. We verify the image against the raw contents of the media and extract metadata from the disk image. Finally, we package the images and metadata and transfer the package into storage and complete the rest of the documentation.
To acquire the data off media, we are using a forensic imaging process that extracts the entirety of the data off the media at the lowest level possible. To ensure that we do not intentionally or accidentally manipulate any of the data on the original media, we write-protect the media or reader. For floppy disks, we can use physical write protect tabs. For USB flash media, hard drives, and the like, we connect the drive or reader to a write-blocker, which is a piece of hardware connected to the computer that blocks low-level write signals from a computer. We use a variety of software to acquire the images, such as FTK Imager. The imaging software extracts the data from the media and calculates a cryptographic hash of the data on the media and the data within the image file. If the checksums match, the imaging is viewed as successful. [ADD FTK Imager SCREENSHOT? WRITEBLOCKER PHOTO?]
This is a screenshot of FTK Imager, which we use to image media and to inspect disk images. You can see that the file listing includes regular files, slack or unused space on the disk, and deleted files, as denoted by the red X on the file icons.
Our media log is a SharePoint list that contains identifying characteristics and physical and logical information about the media, such as the type of media, when it was imaged, the text of a label or writing on the media, and the type of filesystem or filesystems it contains. We assign each piece of media a unique identifier, which is a combination of theaccession number and incremental number. The media log also contains the workflow status of the accessioning process for each piece of media and whether processes succeeded or failed.
The first screenshot is an overview for several pieces of media. You can see the unique media identifiers, the media format, and the workflow status.
This expanded view shows all the fields, including further documentation about the disk image, the filesystem contained, and additional notes.
If imaging is successful, we then extract metadata from the filesystem and files within the image. This is a software-based process that provides metadata such as file names, directory structures, creation and modification times, and approximate categorization of the types of files. This metadata can be repurposed in a variety of ways and provides a basic level of intellectual control that is comparable to a box list or other type of inventory for paper records. We are using open source software such as Sleuthkit and fiwalk to perform this extraction, but occasionally we need to rely on other tools for older or less common types of file systems.
Finally, we create a transfer package using the BagIt specification as developed by the Library of Congress and the California Digital Library. To create the packages, we are using the Library of Congress-developed Bagger application. These packages contain the disk images, extracted metadata, and logs generated by the disk imaging software during the acquisition process. The BagIt packages also contain high-level information about the accession. For the time being, we are making a rough connection of one bag per accession, but we realize we may need to modify depending on the size of the accessions.
This an overview of a sample bag, showing the structure and high-level metadata. Once packaged, we transfer the package to storage and verify the success of the transfer using procedures for the BagIt specification which compare the contents of the package against its manifest. If successful, we complete the rest of the documentation and record the success in the media log. We also record the storage location of the transferred package within the Archivists’ Toolkit and add the date of completion.
AIMS Workshop Case Study 2: Re-accessioning at Yale
Case Study:Re-Accessioning at Yale<br />Mark A. Matienzo<br />Yale University<br />
Overview<br />Collaborative capacity building across two repositories<br />Manuscripts and Archives<br />Beinecke Rare Book and Manuscript Library<br />Addressing previously received accessions of containing electronic records on media<br />Still in testing phase, but working towards implementing in production<br />
Types of Records and Media<br />Wide variety of records creators<br />Literary authors<br />University faculty<br />University offices<br />Architectural firms<br />Common types of media<br />Floppy disks: 5.25” and 3.5”<br />Optical media: CDROM, CD-R, DVD-R, etc.<br />Zip disks<br />USB flash drives<br />
Goals of Re-Accessioning<br />Identify, document, and register media<br />Mitigate risk of media deterioration and obsolescence<br />Extract basic metadata from filesystems on media and files contained on filesystems<br />
Disk Imaging<br />Using “forensic” (bit-level) imaging process<br />Ensure data on media is not manipulated using write-protection<br />Uses software to acquire images<br />Includes hash-based verification process<br />
Media Log<br />Using SharePoint list<br />Contains unique identifier of media<br />Records physical/logical characteristics of media<br />Documents success, failure, or status of various processes and additional notes<br />
Metadata Extraction<br />Can be repurposed for descriptive, administrative, and technical metadata<br />Uses command-line tools (Sleuthkit, fiwalk)<br />Outputs XML document<br />
Packaging and Transfer<br />Using BagIt packages/Bagger application<br />Packages contain disk images, extracted metadata, imaging logs, and high-level accession information<br />Transfer to storage is verified by comparison against manifest<br />