SlideShare a Scribd company logo
1 of 18
Maximizing Description to
Enhance Access to Born-
Digital Archival Collections
Seeley G. Mudd Manuscript Library
Princeton University Library
Rossy Mendez, Public Services Project Archivist
Jarrett M. Drake, Digital Archivist
CURATEcamp, Brooklyn Historical Society
April 23, 2015
“How we describe
the collections in our
care influences the
ability of people to
discover, access, use
and interpret them”
Trends in Practice:
Archival Arrangement and
Description, pg 17.
<extent>
<scopecontent>
<unittitle>
The beginnings…
<extent>
1. Physical Space Quantity/Arrangement
2. Electronic Digital
<unittitle>
Office of President
Records, Shirley Tilghman
Subgroup (AC379)
<scopecontent>
Series Level
“One third of the digital files are a
mixture of PDF’s and Excel
Spreadsheets”
<phystech>
<unitdate>
Multi-level Description of Digital Records
Reality
For born-digital records, the Archive’s existing descriptive
workflows failed to provide sufficient context and precision for
<did> elements, including <unitdate>, <unittitle>, & <extent>.
Challenge
For multi-level records, how does one create these
elements programmatically?
Previous Workflow
Create disk image
Previous Workflow
CSV output from FTK Imager
AT Resource RecordWindows Explorer
EAD <did> element
Revised Workflow
Question
What are the key metadata points we should extract from
born-digital records and later represent in EAD?
Answer
1. Names of each folder  <unittitle>
2. Modified dates of the oldest and newest files  <unitdate>
3. Numbers of folders and numbers of files  <extent>
Current Description Workflow
Complete digital records processing for Mudd Library can be found at:
http://rbsc.princeton.edu/policies/guidance-recommended-file-formats
.txt
.csv
.xls
.xml
Current Description Workflow: Extract
Shell script to extract <unittitle>, <unitdate>, and <extent> values
-maxdepth 1
Current Description Workflow: Transform
Output of shell script as .txt file
Output of shell script transformed into EAD
Description Workflow Enhancements
 Eliminate string values for <extent> elements and
minimize post-processing of data
 Use topic modeling for textual data (fondz or another
program) and write scripts for basic textual analysis
(e.g., automated page count for PDF’s)
 Index all names of directories and files and represent
their structure through a file browser embedded in the
finding aid and/or the repository (Hydra)

More Related Content

Similar to Drake Mendez curatecamp 2015

accelerating-data-driven
accelerating-data-drivenaccelerating-data-driven
accelerating-data-driven
Joshua Chudy
 
Discovery event peter burnhill (aggregation as tactic)
Discovery event peter burnhill (aggregation as tactic)Discovery event peter burnhill (aggregation as tactic)
Discovery event peter burnhill (aggregation as tactic)
RDTF-Discovery
 
Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2
Smita Chandra
 
Developing tools in humanities computing
Developing tools in humanities computing Developing tools in humanities computing
Developing tools in humanities computing
Dave Marcial
 
Towards the digital_archiving_sysytem_for_field_ar (1)
Towards the digital_archiving_sysytem_for_field_ar (1)Towards the digital_archiving_sysytem_for_field_ar (1)
Towards the digital_archiving_sysytem_for_field_ar (1)
Nadeeka Rathnabahu
 

Similar to Drake Mendez curatecamp 2015 (20)

Week14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptxWeek14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptx
 
accelerating-data-driven
accelerating-data-drivenaccelerating-data-driven
accelerating-data-driven
 
The Quest for Digital Preservation: Will Part of Math History Be Gone Forever?
The Quest for Digital Preservation: Will Part of Math History Be Gone Forever?The Quest for Digital Preservation: Will Part of Math History Be Gone Forever?
The Quest for Digital Preservation: Will Part of Math History Be Gone Forever?
 
Sailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0sSailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0s
 
Discovery event peter burnhill (aggregation as tactic)
Discovery event peter burnhill (aggregation as tactic)Discovery event peter burnhill (aggregation as tactic)
Discovery event peter burnhill (aggregation as tactic)
 
IApart1
IApart1IApart1
IApart1
 
Dh presentation 2019
Dh presentation 2019Dh presentation 2019
Dh presentation 2019
 
Aggregation as tactic sm new
Aggregation as tactic sm newAggregation as tactic sm new
Aggregation as tactic sm new
 
Aggregation as Tactic
Aggregation as TacticAggregation as Tactic
Aggregation as Tactic
 
Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2
 
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
 
UW Libraries Data Services Forum
UW Libraries Data Services ForumUW Libraries Data Services Forum
UW Libraries Data Services Forum
 
Agile Curation Poster
Agile Curation PosterAgile Curation Poster
Agile Curation Poster
 
Establishing the significant properties of digital research
Establishing the significant properties of digital researchEstablishing the significant properties of digital research
Establishing the significant properties of digital research
 
Developing tools in humanities computing
Developing tools in humanities computing Developing tools in humanities computing
Developing tools in humanities computing
 
Towards the digital_archiving_sysytem_for_field_ar (1)
Towards the digital_archiving_sysytem_for_field_ar (1)Towards the digital_archiving_sysytem_for_field_ar (1)
Towards the digital_archiving_sysytem_for_field_ar (1)
 
The Delicate Tension of Digital Technology
The Delicate Tension of Digital TechnologyThe Delicate Tension of Digital Technology
The Delicate Tension of Digital Technology
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday Learning
 
ICAME 2010
ICAME 2010ICAME 2010
ICAME 2010
 
Issues problems
Issues problemsIssues problems
Issues problems
 

Recently uploaded

Russian🍌Dazzling Hottie Get☎️ 9053900678 ☎️call girl In Chandigarh By Chandig...
Russian🍌Dazzling Hottie Get☎️ 9053900678 ☎️call girl In Chandigarh By Chandig...Russian🍌Dazzling Hottie Get☎️ 9053900678 ☎️call girl In Chandigarh By Chandig...
Russian🍌Dazzling Hottie Get☎️ 9053900678 ☎️call girl In Chandigarh By Chandig...
Chandigarh Call girls 9053900678 Call girls in Chandigarh
 

Recently uploaded (20)

2024 Zoom Reinstein Legacy Asbestos Webinar
2024 Zoom Reinstein Legacy Asbestos Webinar2024 Zoom Reinstein Legacy Asbestos Webinar
2024 Zoom Reinstein Legacy Asbestos Webinar
 
Financing strategies for adaptation. Presentation for CANCC
Financing strategies for adaptation. Presentation for CANCCFinancing strategies for adaptation. Presentation for CANCC
Financing strategies for adaptation. Presentation for CANCC
 
Russian🍌Dazzling Hottie Get☎️ 9053900678 ☎️call girl In Chandigarh By Chandig...
Russian🍌Dazzling Hottie Get☎️ 9053900678 ☎️call girl In Chandigarh By Chandig...Russian🍌Dazzling Hottie Get☎️ 9053900678 ☎️call girl In Chandigarh By Chandig...
Russian🍌Dazzling Hottie Get☎️ 9053900678 ☎️call girl In Chandigarh By Chandig...
 
VIP Model Call Girls Shikrapur ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Shikrapur ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Shikrapur ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Shikrapur ( Pune ) Call ON 8005736733 Starting From 5K t...
 
TEST BANK For Essentials of Negotiation, 7th Edition by Roy Lewicki, Bruce Ba...
TEST BANK For Essentials of Negotiation, 7th Edition by Roy Lewicki, Bruce Ba...TEST BANK For Essentials of Negotiation, 7th Edition by Roy Lewicki, Bruce Ba...
TEST BANK For Essentials of Negotiation, 7th Edition by Roy Lewicki, Bruce Ba...
 
Junnar ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
Junnar ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...Junnar ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...
Junnar ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
 
2024: The FAR, Federal Acquisition Regulations - Part 29
2024: The FAR, Federal Acquisition Regulations - Part 292024: The FAR, Federal Acquisition Regulations - Part 29
2024: The FAR, Federal Acquisition Regulations - Part 29
 
Pimpri Chinchwad ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi R...
Pimpri Chinchwad ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi R...Pimpri Chinchwad ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi R...
Pimpri Chinchwad ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi R...
 
Government e Marketplace GeM Presentation
Government e Marketplace GeM PresentationGovernment e Marketplace GeM Presentation
Government e Marketplace GeM Presentation
 
celebrity 💋 Agra Escorts Just Dail 8250092165 service available anytime 24 hour
celebrity 💋 Agra Escorts Just Dail 8250092165 service available anytime 24 hourcelebrity 💋 Agra Escorts Just Dail 8250092165 service available anytime 24 hour
celebrity 💋 Agra Escorts Just Dail 8250092165 service available anytime 24 hour
 
The NAP process & South-South peer learning
The NAP process & South-South peer learningThe NAP process & South-South peer learning
The NAP process & South-South peer learning
 
VIP Model Call Girls Narhe ( Pune ) Call ON 8005736733 Starting From 5K to 25...
VIP Model Call Girls Narhe ( Pune ) Call ON 8005736733 Starting From 5K to 25...VIP Model Call Girls Narhe ( Pune ) Call ON 8005736733 Starting From 5K to 25...
VIP Model Call Girls Narhe ( Pune ) Call ON 8005736733 Starting From 5K to 25...
 
Top Rated Pune Call Girls Hadapsar ⟟ 6297143586 ⟟ Call Me For Genuine Sex Se...
Top Rated  Pune Call Girls Hadapsar ⟟ 6297143586 ⟟ Call Me For Genuine Sex Se...Top Rated  Pune Call Girls Hadapsar ⟟ 6297143586 ⟟ Call Me For Genuine Sex Se...
Top Rated Pune Call Girls Hadapsar ⟟ 6297143586 ⟟ Call Me For Genuine Sex Se...
 
Item # 4 - 231 Encino Ave (Significance Only).pdf
Item # 4 - 231 Encino Ave (Significance Only).pdfItem # 4 - 231 Encino Ave (Significance Only).pdf
Item # 4 - 231 Encino Ave (Significance Only).pdf
 
Just Call Vip call girls Wardha Escorts ☎️8617370543 Starting From 5K to 25K ...
Just Call Vip call girls Wardha Escorts ☎️8617370543 Starting From 5K to 25K ...Just Call Vip call girls Wardha Escorts ☎️8617370543 Starting From 5K to 25K ...
Just Call Vip call girls Wardha Escorts ☎️8617370543 Starting From 5K to 25K ...
 
World Press Freedom Day 2024; May 3rd - Poster
World Press Freedom Day 2024; May 3rd - PosterWorld Press Freedom Day 2024; May 3rd - Poster
World Press Freedom Day 2024; May 3rd - Poster
 
Call Girls Chakan Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chakan Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Chakan Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chakan Call Me 7737669865 Budget Friendly No Advance Booking
 
PPT BIJNOR COUNTING Counting of Votes on ETPBs (FOR SERVICE ELECTORS
PPT BIJNOR COUNTING Counting of Votes on ETPBs (FOR SERVICE ELECTORSPPT BIJNOR COUNTING Counting of Votes on ETPBs (FOR SERVICE ELECTORS
PPT BIJNOR COUNTING Counting of Votes on ETPBs (FOR SERVICE ELECTORS
 
Zechariah Boodey Farmstead Collaborative presentation - Humble Beginnings
Zechariah Boodey Farmstead Collaborative presentation -  Humble BeginningsZechariah Boodey Farmstead Collaborative presentation -  Humble Beginnings
Zechariah Boodey Farmstead Collaborative presentation - Humble Beginnings
 
(NEHA) Call Girls Nagpur Call Now 8250077686 Nagpur Escorts 24x7
(NEHA) Call Girls Nagpur Call Now 8250077686 Nagpur Escorts 24x7(NEHA) Call Girls Nagpur Call Now 8250077686 Nagpur Escorts 24x7
(NEHA) Call Girls Nagpur Call Now 8250077686 Nagpur Escorts 24x7
 

Drake Mendez curatecamp 2015

  • 1. Maximizing Description to Enhance Access to Born- Digital Archival Collections Seeley G. Mudd Manuscript Library Princeton University Library Rossy Mendez, Public Services Project Archivist Jarrett M. Drake, Digital Archivist CURATEcamp, Brooklyn Historical Society April 23, 2015
  • 2.
  • 3. “How we describe the collections in our care influences the ability of people to discover, access, use and interpret them” Trends in Practice: Archival Arrangement and Description, pg 17. <extent> <scopecontent> <unittitle>
  • 5. <extent> 1. Physical Space Quantity/Arrangement 2. Electronic Digital
  • 6.
  • 7. <unittitle> Office of President Records, Shirley Tilghman Subgroup (AC379)
  • 8. <scopecontent> Series Level “One third of the digital files are a mixture of PDF’s and Excel Spreadsheets”
  • 11. Multi-level Description of Digital Records Reality For born-digital records, the Archive’s existing descriptive workflows failed to provide sufficient context and precision for <did> elements, including <unitdate>, <unittitle>, & <extent>. Challenge For multi-level records, how does one create these elements programmatically?
  • 13. Previous Workflow CSV output from FTK Imager AT Resource RecordWindows Explorer EAD <did> element
  • 14. Revised Workflow Question What are the key metadata points we should extract from born-digital records and later represent in EAD? Answer 1. Names of each folder  <unittitle> 2. Modified dates of the oldest and newest files  <unitdate> 3. Numbers of folders and numbers of files  <extent>
  • 15. Current Description Workflow Complete digital records processing for Mudd Library can be found at: http://rbsc.princeton.edu/policies/guidance-recommended-file-formats .txt .csv .xls .xml
  • 16. Current Description Workflow: Extract Shell script to extract <unittitle>, <unitdate>, and <extent> values -maxdepth 1
  • 17. Current Description Workflow: Transform Output of shell script as .txt file Output of shell script transformed into EAD
  • 18. Description Workflow Enhancements  Eliminate string values for <extent> elements and minimize post-processing of data  Use topic modeling for textual data (fondz or another program) and write scripts for basic textual analysis (e.g., automated page count for PDF’s)  Index all names of directories and files and represent their structure through a file browser embedded in the finding aid and/or the repository (Hydra)

Editor's Notes

  1. Hello, I am Rossy Mendez and I am the Public Services Project Archivist at the Seeley G. Mudd Manuscript Library and my colleague is Jarrett Drake who is the Digital Archivist at Mudd and we are here to talk to you about our process in maximizing description to enhance access to born-digital archival collections.
  2. First I wanted to talk a little bit about where we work. The Mudd Manuscript Library is part of the Rare Books and Special Collections Division at Princeton University. Our library houses and provide access to the university archives and public policy collections. We have over 30,000 linear feet of records of diverse media as well as several collections that contain born-digital material. One rather unique thing about Mudd is that there is not a hard-lined division between technical services and public services. With the exception of the records manager everyone on the team participates in reference duties by taking on a number of reference shifts that entail assisting on-site and remote patrons and paging as needed. The benefits to this approach is that the work of technical services is informed by how patrons use the collections and the resources used to find them.
  3. Why be concerned with description? [quote] At Mudd the description in our finding aids is driven by three principles: A user should be able to quickly gather what and how much born digital content exists. A user should be able to know where the digital content lives within the finding aid and have easy access to that content. And last but not least a user should be able to understand the context in which these records were created. Because users approach records with different research questions and arrive from different information points, we strive to provide description at different levels.
  4. One of the things that we instituted early was providing access to the born digital content through the finding aid. By clicking the view content button users where able to go to the file in our Webspace file system. The problem with this was that there was no distinction between digitized content and that which was born digital which was particularly problematic in collections that contained both types of content. Early attempts at description such as the <scopeandcontent> note in this finding aid provided minimal description but no specific information about the type of content or where it could be found. Other than this, neither the extent field or the series header indicated to the user that born digital content was included. Over the last year and a half at Mudd we made some significant changes to the description of born digital materials.
  5. Perhaps the most significant change we have made to the description of born-digital records is including the amount of born digital materials. At first we thought of the <extent> field as the amount of physical space that the material occupied. But this information excluded a sense of depth and arrangement and therefore was not a good reflection of reality. Therefore, we decided to focus instead on using <extent> field to echo the quantity of materials and provide additional arrangement information. Another issue what the use of the word electronic vs the word digital. The change is a more accurate portrayal of the nature of the records since today we use mostly computers and not other electronic devices.
  6. The <unittitle> field plays an important role in differentiating between digitized and born digital content. Without this designation in the user end there is no quick way to tell what material is digitized versus born digital because the access path is the same. Again we this we ultimately made the transition to use the word digital in the <unittitle>field which is used to describe series/subseries.
  7. The <scopecontent> EAD tag which maps out to “Description” is perhaps the most beneficial to the patron because it lets the end user know right away that there is digital material included and secondly it allows for a listing of record formats contained within a digital series.
  8. 9
  9. Recently the unitdate element has also undergone some revision so that it reflects more accurately the creation of these records. In his presentation Jarrett will address some of the work that is being done in this area. I will now turn it over to him so that he can explain some of our workflows and the practical components of these applications.
  10. And so as Rossy showed, our description for born-digital records has been up and down, lots of downs   And the problem, as you see stated here, is that our description lacked critical context and critical precision…that was just the reality   The challenge [click] posed by this reality: how does one generate that context and precision programmatically?   And by multi-level, I am drawing a distinction between flat digital records with no hierarchy, which you typically find in oral history collections or other types of communication or publication record types   And meeting that challenge is something that our previous workflow wasn’t able to handle
  11. Pictured here is our digital accessioning overview from 2012…this was a huge step forward from previous practice, and I’m thankful to my predecessors for their work   In the fall of 2013 when I started, our digital archives workstation ran Windows, which I didn’t know I hated then but know now, and we used FTK Imager for disk imaging, Karen’s Directory Printer for directory printing, and Bagger for creating fixity information and AIP’s.
  12. My first multi-level, complex digital collection was a set of records from the University’s first woman president, an accession that contained more than 20,000 digital files and roughly 75 top-level folders.   To create a <unittitle>, I opened the FTK Imager csv output, sorted it alphabetically by full path, and cut/paste top-level folder paths into an AT resource record To create a <unitdate>, I eyeballed the Modified date’s earliest and latest 4 digit year information and manually typed it into an AT resource record To create an <extent>, I opened Windows Explorer, right-clicked on the Properties, and manually entered the file count and size directly into the exported EAD. I hopefully don’t have to explain to everyone here how problematic this was; it’s not that it took a terrible amount of time; given that I only did this for 75 folders, I probably had all of this information into AT after a couple of days   BUT. Those things that we can do quickly in a manual fashion will not suffice when the orders of magnitude increase. More importantly, this way of generating descriptive elements said nothing of what materials lived below this level, and actually didn’t indicate that things lived below at all. So in many ways this description I did 18 months ago failed in both context and precision.
  13. And so archivists at Mudd stepped back and said: we know that the relationships we wanted to represent already existed in the filesystem, so our next question became: how do we extract it directly, reliably, and without human intervention?   In summer of 2014, we started using BitCurator on our FRED and ended our complicated relationship with Windows and Windows-related products.   Between our digital initiative analyst, Rossy, and myself, we listed in plain English the types of questions we wanted to ask of our multi-level digital records: We said for each directory we wanted [click]: The name of the directory (not files!) The modified dates of the oldest and newest files The number of folders and files
  14. With a clear idea of the metadata we needed to extract from born-digital records, I broke down the creation of the component-level <did> elements into four small steps: extract (bash), prepare (LibreOffice Calc), import (oXygen), and transform (oXygen).   Outside the focus of this talk: you can see that I’ve written a similar step for creating <scopecontent> notes. You can find that complete workflow along with the rest of our digital records procedures linked at the bottom of this webpage, but for now I’ll explain and show images of our data extraction and transformation for the component-level <did> elements.
  15. And so because we transitioned our workstation to BitCurator, we were now working in an Ubuntu OS environment, so we turned to the default shell in Linux, which is bash, to extract these data points that were already embedded in the filesystem and could be easily extracted without too much effort.   We wrote a simple for loop in bash that stitched together different iterations of the find command , and it took many drafts to get this script to function the way we needed it to…Rossy can recall our frequent Thursday enough setbacks and near misses.   Initially this script populated all folder titles…so, if the accession had 800 folders, you would feasibly have 800 multilevel components…but, again, given the depth of some accessions in University Archives and the fact that simply revealing the metadata of some files—such as a <unittitle> that read Discipline/Humanities/John Doe—would be an unlawful disclosure of sensitive and legally-protected information, we added the –maxdepth option on the loop to only grab <did> info for top-level folders. We can, and likely will, simply amend this part of the script depending on a collection’s need and access restriction.
  16. After the script finishes running, we take this original text file and concatenate a few fields in OpenOffice, before we import the .xls into oXygen and transform that .xml into EAD with an XSLT stylesheet, after which we normalize the EAD in the same way that we normalize all of our finding aids.   Even though we still currently have to put the raw text file through a series of transformations, we’ve been able to eliminate all rekeying and copying/pasting and produce a computer-generated description in a matter of seconds with very little manual intervention. Archivists do any folder name cleanup (i.e., expanding abbreviations) directly in the EAD.   This computer-generated description is much richer in terms of its context and much more precise in its metadata [highlight the transition from simple 4 digit <unitdates> to ISO-formatted <unitdates>], allowing our archivists to assume intellectual control of born-digital records much more programmatically, reliably, and efficiently.
  17. In ascending order of difficulty, I think these are the next steps for improving our descriptive practice and serving born-digital records to researchers more contextually and precisely.