Accessioning-Based Metadata Extraction and Iterative Processing: Notes From the Field
Accessioning-Based Metadata
Extraction and Iterative Processing:
Notes From the Field
Mark A. Matienzo, Yale University Library
CurateGear: Enabling the Curation of Digital Collections
January 6, 2012
mark@matienzo.org http://matienzo.org/ @anarchivist
Accessioning Workflow
Start
accessioning Write-protect media Verify image
process Media
Record identifying Extract filesystem- Disk Meta-
Retrieve media characteristics of and file-level images data
media as metadata metadata Transfer package
Package images
Assign identifiers to Ingest transfer
Create image and metadata for
media package
ingest
Media FS/File Document
Disk
MD MD accessioning
image
process
End
accessioning
process
Metadata Extraction
•Desire to repurpose existing information as
archival description and reports to other staff
•Ideal output is XML; can be packaged with disk images
going into medium- or long-term storage
•Tools: Fiwalk/Sleuthkit; FTK Imager; testing others
Sample DFXML Output
<?xml version='1.0' encoding='UTF-8'?>
<dfxml version='1.0'>
<metadata
xmlns='http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML'
xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
xmlns:dc='http://purl.org/dc/elements/1.1/'>
<dc:type>Disk Image</dc:type>
</metadata>
<creator version='1.0'>
<!-- provenance information re: extraction - software used; operating system -->
</creator>
<source>
<image_filename>2004-M-088.0018.dd</image_filename>
</source>
<volume offset='0'><!-- partitions within each disk image -->
<fileobject><!-- files within each partition --></fileobject>
</volume>
<runstats><!-- performance and other statistics --></runstats>
</dfxml>
Current Advantages
•Faster (and more forensically sound) to extract metadata
once rather than having to keep processing an image
•Develop better assessments during accessioning process
(directory structure significant? timestamps accurate?)
•Building supplemental tools takes less time
Gumshoe
•Prototype based on Blacklight (Ruby on Rails + Solr)
•Indexing code works with fiwalk output or directly from a
disk image
•Populates Solr index with all file-level metadata from
fiwalk and, optionally, text strings extracted from files
•Provides searching, sorting and faceting based on
metadata extracted from filesystems and files
•Code at http://github.com/anarchivist/gumshoe
Current Limitations
•Use of fiwalk limited to specific types of filesystems
•Additional software requires additional integration and
data normalization
•DFXML is not (currently) a metadata format common
within domains of archives/libraries
•Extracted metadata maybe harder to repurpose for
descriptive purposes based on level of granularity
Let me provide some background on the state of affairs at Yale University. There are two main special collections units within the Yale University Library with the majority of born-digital records: Manuscripts and Archives, where I work, and the Beinecke Rare Book and Manuscript Library. In addition to these units, other special collections units, such as those within the Haas Arts Library and the Library at the School of Divinity, have some digital records within their holdings. Manuscripts and Archives and the Beinecke both participated in the Andrew W. Mellon Foundation-funded &#x201C;AIMS&#x201D; project, Born-Digital Collections: An Inter-Institutional Model for Stewardship, along with University of Virginia, Stanford University, and the University of Hull. Much of the work that the Yale team contributed to on the AIMS project dealt specifically with developing workflows and procedures for accessioning digital records. In their own distinct ways, Manuscripts and Archives and Beinecke have both implemented iterative processing models based on Dennis Meissner and Mark Greene&#x2019;s &#x201C;More Product Less Process&#x201D; model. In particular, Manuscripts and Archives has focused heavily on an &#x201C;accessioning as processing&#x201D; model. As we found, however, implementing this for digital records was significantly more complicated. \n
Because many of the digital records in our custody were underdescribed even within the accession records and since the records in these accessions have not been transferred off their original media, we have been focusing our efforts on &#x201C;reaccessioning&#x201D; our previous accessions. Our accessions contain many relatively common forms of digital media: 3.5&#x201D; and 5.25&#x201D; floppy disks from a variety of systems, optical media such as CD-ROMs and DVDs, flash media, hard drives, and occasionally, full computer systems. The primary goals of reaccessioning have been to identify, document, and register media within previous accessions, to mitigate the risk of media deterioration and obsolescence, and to extract basic metadata from filesystems on the media and the files contained on those media.\n\nThis is a diagram of the prototype workflow that we developed at Yale based on our work on the AIMS project. Our workflow contains a set of several steps that allow us to achieve the goals I have just mentioned, and includes the process of acquiring a forensic, low-level disk image of the media. The forensic imaging process uses a combination of hardware and software to prevent accidental modification of the &#x201C;original&#x201D; data on the media we received. While I am happy to talk with you during the breakout session regarding the rest of our process, the rest of my presentation will focus on our metadata extraction process, wherein we gather information about the filesystems and files contained within those disk images.\n
At Yale, we hoped that the metadata extraction process would yield descriptive, structural, and technical metadata that would be easily repurposeable for minimal archival description, to support intellectual and perhaps contextual control, for basic preservation purposes, and for reports to other staff. For our purposes, we have been aiming to get XML-based metadata out of the extraction process because it seemed easiest to repurpose. Many archivists at Yale already have some experience working with XML because of our use of Encoded Archival Description.\n\nTo extract the metadata from disk images, we have used primarily Simson Garfinkel&#x2019;s Fiwalk which generates Digital Forensics XML, or DFXML, metadata. DFXML is an open standard currently under development within the open source digital forensics community. Fiwalk depends on the open source Sleuthkit library. We have decided to include DFXML for the time being as part of disk image packages that we are placing in storage. A major motivation for including the extracted metadata with the stored disk image is that is significantly easier - and arguably much safer - to parse and process the metadata than it is to parse the disk image. \n\nIn addition to fiwalk, we have also used AccessData&#x2019;s FTK Imager, a commercial application that can be downloaded at no charge, which optionally extracts metadata in CSV form. We have also been testing a few other applications, some of which I will mention later. \n
This is the overview of the structure of a DFXML file as created by fiwalk. Within this output, there are four main sections - metadata about the DFXML instance itself; provenance information about how the DFXML instance was created, information about the source of the metadata, metadata extracted from the contents of the disk image (that is, the partitions and files), and runtime statistics.\n
This is a sample portion of a output from fiwalk for a given disk image from our collections. Specifically, this section relates to a specific file within a disk image of a floppy disk. You can see information such as the file name, its size, that the file was deleted, modification, access, and creation times, and hash values for the file. There is also a basic assessment of the filetype - in this case, the file is a WordPerfect file.\n
There are a number of tangible advantages that we&#x2019;ve seen in using Fiwalk and generating DFXML at Yale. First, it is considerably faster to extract the metadata once to allow further reuse rather than having to continually extract the metadata every time its used. Once extracted, processing the metadata from an extremely large disk image takes considerably fewer computational resources than working from the disk image over and over.\n\nIt also allows us to make better assessments, such as if the files or their metadata within the image have been modified, or getting a better sense of the entirety of the accession&#x2019;s structure in terms of directories.\n\nFinally, it is significantly easier to create additional tools to work with the extracted metadata than using the underlying stack provided by Sleuthkit since it just requires one to parse XML data.\n
One such tool is a prototype application called Gumshoe, which is a Ruby on Rails-based application using Blacklight. The application indexes metadata extracted from a disk image and stores it in a Solr index, and provides searching, sorting, and faceting of the extracted data. For example, you can sort based on modification times of files or by file name. You can also optionally index any text found within each file. The code for this prototype has been released as open source under the Apache 2.0 license and is available from the URL shown above.\n
This is a sample screenshot of Gumshoe browsing through the metadata of all the files from two distinct disk images - one with over 1,200 files, and the other from 39 files. The files within the results are sorted by size. On the left hand side, you can see the various facets available for filtering the search results. \n
Beyond the benefits that I previously mentioned, there are a number of limitations that we have experienced. Arguably, the one with which we struggle the most with at Yale are the limitations of fiwalk, and Sleuthkit, the underlying framework it uses to process and extract the disk images. Specifically, Sleuthkit provides fiwalk with the ability to interpret the various types of filesystems. However, even with only a moderate amount of accessions of electronic records, we have run into serious barriers with certain filesystems being unsupported. In particular, we have had issues with interpreting original HFS filesystems - the filesystems used on much earlier versions of MacOS - and with much older media, such as Kaypro CP/M disks. It is worth noting first that many of these are not uncommon formats.\n\nWhile we have tested other applications to extract metadata, such as using FTK Imager for HFS volumes and CPMTools for CP/M disks, the data output by these programs is not in DFXML form. It therefore requires additional processing to get it into a common format that is parseable by the tools that we hope to use to process, repurpose, and interpret the extracted metadata.\n\nDFXML also shows quite a lot of promise in terms of extensibility and its continued development. However, it is not currently used widely within the library and archives world. Any repurposing of this metadata will require considerable development and planning time for the creation of appropriate crosswalks.\n\nFinally, we have found that the metadata extracted in this process is not always appropriate for archivists wanting to repurpose it as descriptive information. In some cases, the extracted metadata does not always reflect an appropriate level of granularity. While files titled with clear filenames may be useful especially when directory structures may indicate a pre-existing arrangement, these concepts do not always map nicely to archival concepts such as levels of aggregation. In many cases, metadata about individual files may be too granular for the level of description created within preliminary processing during or soon following accessioning.\n\n
Thank you for your time. Please feel free to speak with me during the session later in the afternoon if you have further questions.\n