(NEHA) Call Girls Nagpur Call Now 8250077686 Nagpur Escorts 24x7
Drake Mendez curatecamp 2015
1. Maximizing Description to
Enhance Access to Born-
Digital Archival Collections
Seeley G. Mudd Manuscript Library
Princeton University Library
Rossy Mendez, Public Services Project Archivist
Jarrett M. Drake, Digital Archivist
CURATEcamp, Brooklyn Historical Society
April 23, 2015
2.
3. “How we describe
the collections in our
care influences the
ability of people to
discover, access, use
and interpret them”
Trends in Practice:
Archival Arrangement and
Description, pg 17.
<extent>
<scopecontent>
<unittitle>
11. Multi-level Description of Digital Records
Reality
For born-digital records, the Archive’s existing descriptive
workflows failed to provide sufficient context and precision for
<did> elements, including <unitdate>, <unittitle>, & <extent>.
Challenge
For multi-level records, how does one create these
elements programmatically?
14. Revised Workflow
Question
What are the key metadata points we should extract from
born-digital records and later represent in EAD?
Answer
1. Names of each folder <unittitle>
2. Modified dates of the oldest and newest files <unitdate>
3. Numbers of folders and numbers of files <extent>
15. Current Description Workflow
Complete digital records processing for Mudd Library can be found at:
http://rbsc.princeton.edu/policies/guidance-recommended-file-formats
.txt
.csv
.xls
.xml
16. Current Description Workflow: Extract
Shell script to extract <unittitle>, <unitdate>, and <extent> values
-maxdepth 1
17. Current Description Workflow: Transform
Output of shell script as .txt file
Output of shell script transformed into EAD
18. Description Workflow Enhancements
Eliminate string values for <extent> elements and
minimize post-processing of data
Use topic modeling for textual data (fondz or another
program) and write scripts for basic textual analysis
(e.g., automated page count for PDF’s)
Index all names of directories and files and represent
their structure through a file browser embedded in the
finding aid and/or the repository (Hydra)
Editor's Notes
Hello, I am Rossy Mendez and I am the Public Services Project Archivist at the Seeley G. Mudd Manuscript Library and my colleague is Jarrett Drake who is the Digital Archivist at Mudd and we are here to talk to you about our process in maximizing description to enhance access to born-digital archival collections.
First I wanted to talk a little bit about where we work. The Mudd Manuscript Library is part of the Rare Books and Special Collections Division at Princeton University. Our library houses and provide access to the university archives and public policy collections. We have over 30,000 linear feet of records of diverse media as well as several collections that contain born-digital material.
One rather unique thing about Mudd is that there is not a hard-lined division between technical services and public services. With the exception of the records manager everyone on the team participates in reference duties by taking on a number of reference shifts that entail assisting on-site and remote patrons and paging as needed. The benefits to this approach is that the work of technical services is informed by how patrons use the collections and the resources used to find them.
Why be concerned with description? [quote]
At Mudd the description in our finding aids is driven by three principles:
A user should be able to quickly gather what and how much born digital content exists.
A user should be able to know where the digital content lives within the finding aid and have easy access to that content.
And last but not least a user should be able to understand the context in which these records were created.
Because users approach records with different research questions and arrive from different information points, we strive to provide description at different levels.
One of the things that we instituted early was providing access to the born digital content through the finding aid. By clicking the view content button users where able to go to the file in our Webspace file system. The problem with this was that there was no distinction between digitized content and that which was born digital which was particularly problematic in collections that contained both types of content. Early attempts at description such as the <scopeandcontent> note in this finding aid provided minimal description but no specific information about the type of content or where it could be found. Other than this, neither the extent field or the series header indicated to the user that born digital content was included.
Over the last year and a half at Mudd we made some significant changes to the description of born digital materials.
Perhaps the most significant change we have made to the description of born-digital records is including the amount of born digital materials. At first we thought of the <extent> field as the amount of physical space that the material occupied. But this information excluded a sense of depth and arrangement and therefore was not a good reflection of reality. Therefore, we decided to focus instead on using <extent> field to echo the quantity of materials and provide additional arrangement information.
Another issue what the use of the word electronic vs the word digital. The change is a more accurate portrayal of the nature of the records since today we use mostly computers and not other electronic devices.
The <unittitle> field plays an important role in differentiating between digitized and born digital content. Without this designation in the user end there is no quick way to tell what material is digitized versus born digital because the access path is the same. Again we this we ultimately made the transition to use the word digital in the <unittitle>field which is used to describe series/subseries.
The <scopecontent> EAD tag which maps out to “Description” is perhaps the most beneficial to the patron because it lets the end user know right away that there is digital material included and secondly it allows for a listing of record formats contained within a digital series.
9
Recently the unitdate element has also undergone some revision so that it reflects more accurately the creation of these records. In his presentation Jarrett will address some of the work that is being done in this area. I will now turn it over to him so that he can explain some of our workflows and the practical components of these applications.
And so as Rossy showed, our description for born-digital records has been up and down, lots of downs
And the problem, as you see stated here, is that our description lacked critical context and critical precision…that was just the reality
The challenge [click] posed by this reality: how does one generate that context and precision programmatically?
And by multi-level, I am drawing a distinction between flat digital records with no hierarchy, which you typically find in oral history collections or other types of communication or publication record types
And meeting that challenge is something that our previous workflow wasn’t able to handle
Pictured here is our digital accessioning overview from 2012…this was a huge step forward from previous practice, and I’m thankful to my predecessors for their work
In the fall of 2013 when I started, our digital archives workstation ran Windows, which I didn’t know I hated then but know now, and we used FTK Imager for disk imaging, Karen’s Directory Printer for directory printing, and Bagger for creating fixity information and AIP’s.
My first multi-level, complex digital collection was a set of records from the University’s first woman president, an accession that contained more than 20,000 digital files and roughly 75 top-level folders.
To create a <unittitle>, I opened the FTK Imager csv output, sorted it alphabetically by full path, and cut/paste top-level folder paths into an AT resource record
To create a <unitdate>, I eyeballed the Modified date’s earliest and latest 4 digit year information and manually typed it into an AT resource record
To create an <extent>, I opened Windows Explorer, right-clicked on the Properties, and manually entered the file count and size directly into the exported EAD.
I hopefully don’t have to explain to everyone here how problematic this was; it’s not that it took a terrible amount of time; given that I only did this for 75 folders, I probably had all of this information into AT after a couple of days
BUT. Those things that we can do quickly in a manual fashion will not suffice when the orders of magnitude increase. More importantly, this way of generating descriptive elements said nothing of what materials lived below this level, and actually didn’t indicate that things lived below at all. So in many ways this description I did 18 months ago failed in both context and precision.
And so archivists at Mudd stepped back and said: we know that the relationships we wanted to represent already existed in the filesystem, so our next question became: how do we extract it directly, reliably, and without human intervention?
In summer of 2014, we started using BitCurator on our FRED and ended our complicated relationship with Windows and Windows-related products.
Between our digital initiative analyst, Rossy, and myself, we listed in plain English the types of questions we wanted to ask of our multi-level digital records: We said for each directory we wanted [click]:
The name of the directory (not files!)
The modified dates of the oldest and newest files
The number of folders and files
With a clear idea of the metadata we needed to extract from born-digital records, I broke down the creation of the component-level <did> elements into four small steps: extract (bash), prepare (LibreOffice Calc), import (oXygen), and transform (oXygen).
Outside the focus of this talk: you can see that I’ve written a similar step for creating <scopecontent> notes. You can find that complete workflow along with the rest of our digital records procedures linked at the bottom of this webpage, but for now I’ll explain and show images of our data extraction and transformation for the component-level <did> elements.
And so because we transitioned our workstation to BitCurator, we were now working in an Ubuntu OS environment, so we turned to the default shell in Linux, which is bash, to extract these data points that were already embedded in the filesystem and could be easily extracted without too much effort.
We wrote a simple for loop in bash that stitched together different iterations of the find command , and it took many drafts to get this script to function the way we needed it to…Rossy can recall our frequent Thursday enough setbacks and near misses.
Initially this script populated all folder titles…so, if the accession had 800 folders, you would feasibly have 800 multilevel components…but, again, given the depth of some accessions in University Archives and the fact that simply revealing the metadata of some files—such as a <unittitle> that read Discipline/Humanities/John Doe—would be an unlawful disclosure of sensitive and legally-protected information, we added the –maxdepth option on the loop to only grab <did> info for top-level folders. We can, and likely will, simply amend this part of the script depending on a collection’s need and access restriction.
After the script finishes running, we take this original text file and concatenate a few fields in OpenOffice, before we import the .xls into oXygen and transform that .xml into EAD with an XSLT stylesheet, after which we normalize the EAD in the same way that we normalize all of our finding aids.
Even though we still currently have to put the raw text file through a series of transformations, we’ve been able to eliminate all rekeying and copying/pasting and produce a computer-generated description in a matter of seconds with very little manual intervention. Archivists do any folder name cleanup (i.e., expanding abbreviations) directly in the EAD.
This computer-generated description is much richer in terms of its context and much more precise in its metadata [highlight the transition from simple 4 digit <unitdates> to ISO-formatted <unitdates>], allowing our archivists to assume intellectual control of born-digital records much more programmatically, reliably, and efficiently.
In ascending order of difficulty, I think these are the next steps for improving our descriptive practice and serving born-digital records to researchers more contextually and precisely.