SlideShare a Scribd company logo
1 of 11
Download to read offline
Accessioning-Based Metadata
Extraction and Iterative Processing:
       Notes From the Field
       Mark A. Matienzo, Yale University Library
 CurateGear: Enabling the Curation of Digital Collections
                   January 6, 2012

 mark@matienzo.org http://matienzo.org/ @anarchivist
Digital Archives at Yale
Accessioning Workflow
   Start
accessioning                              Write-protect media               Verify image
  process            Media




                                          Record identifying              Extract filesystem-     Disk     Meta-
                 Retrieve media            characteristics of               and file-level      images     data

                                          media as metadata                   metadata         Transfer package




                                                                           Package images
               Assign identifiers to                                                             Ingest transfer
                                            Create image                  and metadata for
                     media                                                                         package
                                                                               ingest




                                  Media                         FS/File                           Document
                                                Disk
                                   MD                            MD                              accessioning
                                               image
                                                                                                   process




                                                                                                     End
                                                                                                 accessioning
                                                                                                   process
Metadata Extraction
•Desire to repurpose existing information as
  archival description and reports to other staff

•Ideal output is XML; can be packaged with disk images
  going into medium- or long-term storage

•Tools: Fiwalk/Sleuthkit; FTK Imager; testing others
Sample DFXML Output
<?xml version='1.0' encoding='UTF-8'?>
<dfxml version='1.0'>
  <metadata
  xmlns='http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML'
  xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
  xmlns:dc='http://purl.org/dc/elements/1.1/'>
    <dc:type>Disk Image</dc:type>
  </metadata>
  <creator version='1.0'>
    <!-- provenance information re: extraction - software used; operating system -->
  </creator>
  <source>
    <image_filename>2004-M-088.0018.dd</image_filename>
  </source>
  <volume offset='0'><!-- partitions within each disk image -->
      <fileobject><!-- files within each partition --></fileobject>
  </volume>
  <runstats><!-- performance and other statistics --></runstats>
</dfxml>
Sample DFXML Output
<fileobject>
  <filename>_ublist1.wpd</filename>
  <partition>1</partition>
  <id>1</id>
  <name_type>r</name_type>
  <filesize>202152</filesize>
  <unalloc>1</unalloc>
  <used>1</used>
  <inode>3</inode>
  <meta_type>1</meta_type>
  <mode>511</mode>
  <nlink>0</nlink>
  <uid>0</uid>
  <gid>0</gid>
  <mtime>2001-02-22T22:30:52Z</mtime>
  <atime>2001-02-22T05:00:00Z</atime>
  <crtime>2001-02-22T22:31:54Z</crtime>
  <libmagic>(Corel/WP)</libmagic>
  <byte_runs>
   <byte_run file_offset='0' fs_offset='16896' img_offset='16896' len='512'/>
  </byte_runs>
  <hashdigest type='md5'>d7bc22242c0a88fd8b68712980d5ab28</hashdigest>
  <hashdigest type='sha1'>64bf2bdf82e33fcda50158804483ac611e753db5</hashdigest>
</fileobject>
Current Advantages
•Faster (and more forensically sound) to extract metadata
  once rather than having to keep processing an image

•Develop better assessments during accessioning process
  (directory structure significant? timestamps accurate?)

•Building supplemental tools takes less time
Gumshoe
•Prototype based on Blacklight (Ruby on Rails + Solr)
•Indexing code works with fiwalk output or directly from a
  disk image

•Populates Solr index with all file-level metadata from
  fiwalk and, optionally, text strings extracted from files

•Provides searching, sorting and faceting based on
  metadata extracted from filesystems and files

•Code at http://github.com/anarchivist/gumshoe
Current Limitations
•Use of fiwalk limited to specific types of filesystems
•Additional software requires additional integration and
  data normalization

•DFXML is not (currently) a metadata format common
  within domains of archives/libraries

•Extracted metadata maybe harder to repurpose for
  descriptive purposes based on level of granularity
Thank You
mark@matienzo.org
http://matienzo.org/
twitter: @anarchivist

More Related Content

Viewers also liked

Open Source Forensics
Open Source ForensicsOpen Source Forensics
Open Source Forensics
CTIN
 
Windows 7 forensics -overview-r3
Windows 7 forensics -overview-r3Windows 7 forensics -overview-r3
Windows 7 forensics -overview-r3
CTIN
 
Raidprep
RaidprepRaidprep
Raidprep
CTIN
 
Translating Geek To Attorneys It Security
Translating Geek To Attorneys It SecurityTranslating Geek To Attorneys It Security
Translating Geek To Attorneys It Security
CTIN
 
Vista Forensics
Vista ForensicsVista Forensics
Vista Forensics
CTIN
 
July132000
July132000July132000
July132000
CTIN
 
Windows 7 forensics event logs-dtl-r3
Windows 7 forensics event logs-dtl-r3Windows 7 forensics event logs-dtl-r3
Windows 7 forensics event logs-dtl-r3
CTIN
 
Edrm
EdrmEdrm
Edrm
CTIN
 

Viewers also liked (20)

Open Source Forensics
Open Source ForensicsOpen Source Forensics
Open Source Forensics
 
Windows 7 forensics -overview-r3
Windows 7 forensics -overview-r3Windows 7 forensics -overview-r3
Windows 7 forensics -overview-r3
 
NTFS file system
NTFS file systemNTFS file system
NTFS file system
 
Raidprep
RaidprepRaidprep
Raidprep
 
Windows 8 Forensics & Anti Forensics
Windows 8 Forensics & Anti ForensicsWindows 8 Forensics & Anti Forensics
Windows 8 Forensics & Anti Forensics
 
Translating Geek To Attorneys It Security
Translating Geek To Attorneys It SecurityTranslating Geek To Attorneys It Security
Translating Geek To Attorneys It Security
 
Social Media Forensics for Investigators
Social Media Forensics for InvestigatorsSocial Media Forensics for Investigators
Social Media Forensics for Investigators
 
Cheatsheet of msdos
Cheatsheet of msdosCheatsheet of msdos
Cheatsheet of msdos
 
Vista Forensics
Vista ForensicsVista Forensics
Vista Forensics
 
Netcat cheat sheet
Netcat cheat sheetNetcat cheat sheet
Netcat cheat sheet
 
Web and Social Media Image Forensics for News Professionals
Web and Social Media Image Forensics for News ProfessionalsWeb and Social Media Image Forensics for News Professionals
Web and Social Media Image Forensics for News Professionals
 
Windows 10 Forensics: OS Evidentiary Artefacts
Windows 10 Forensics: OS Evidentiary ArtefactsWindows 10 Forensics: OS Evidentiary Artefacts
Windows 10 Forensics: OS Evidentiary Artefacts
 
2010 2013 sandro suffert memory forensics introdutory work shop - public
2010 2013 sandro suffert memory forensics introdutory work shop - public2010 2013 sandro suffert memory forensics introdutory work shop - public
2010 2013 sandro suffert memory forensics introdutory work shop - public
 
Windows 8.x Forensics 1.0
Windows 8.x Forensics 1.0Windows 8.x Forensics 1.0
Windows 8.x Forensics 1.0
 
Windows Forensics
Windows ForensicsWindows Forensics
Windows Forensics
 
July132000
July132000July132000
July132000
 
Windows 7 forensics event logs-dtl-r3
Windows 7 forensics event logs-dtl-r3Windows 7 forensics event logs-dtl-r3
Windows 7 forensics event logs-dtl-r3
 
Windows logging cheat sheet
Windows logging cheat sheetWindows logging cheat sheet
Windows logging cheat sheet
 
File Management Presentation
File Management PresentationFile Management Presentation
File Management Presentation
 
Edrm
EdrmEdrm
Edrm
 

Similar to Accessioning-Based Metadata Extraction and Iterative Processing: Notes From the Field

Workshop 1 revised
Workshop 1 revisedWorkshop 1 revised
Workshop 1 revised
peterchanws
 
Lect 21 components_of_database_management_system
Lect 21 components_of_database_management_systemLect 21 components_of_database_management_system
Lect 21 components_of_database_management_system
nadine016
 
Avid Smpte Nyc June 6 Dh
Avid Smpte Nyc June 6 DhAvid Smpte Nyc June 6 Dh
Avid Smpte Nyc June 6 Dh
ShaunPatrick
 

Similar to Accessioning-Based Metadata Extraction and Iterative Processing: Notes From the Field (20)

Profile based Video segmentation system to support E-learning
Profile based Video segmentation system to support E-learningProfile based Video segmentation system to support E-learning
Profile based Video segmentation system to support E-learning
 
Systems, processes & how we stop the wheels falling off
Systems, processes & how we stop the wheels falling offSystems, processes & how we stop the wheels falling off
Systems, processes & how we stop the wheels falling off
 
CIP Genebank Images Management 2022
CIP Genebank Images Management 2022CIP Genebank Images Management 2022
CIP Genebank Images Management 2022
 
Modular applications with montage components
Modular applications with montage componentsModular applications with montage components
Modular applications with montage components
 
Introduction
IntroductionIntroduction
Introduction
 
Version Control & Git
Version Control & GitVersion Control & Git
Version Control & Git
 
Imagically Image Forensic Tool
Imagically Image Forensic ToolImagically Image Forensic Tool
Imagically Image Forensic Tool
 
Workshop 1 revised
Workshop 1 revisedWorkshop 1 revised
Workshop 1 revised
 
Lect 21 components_of_database_management_system
Lect 21 components_of_database_management_systemLect 21 components_of_database_management_system
Lect 21 components_of_database_management_system
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
 
Metadata For Preservation Delos
Metadata For Preservation DelosMetadata For Preservation Delos
Metadata For Preservation Delos
 
Module 02 ftk imager
Module 02 ftk imagerModule 02 ftk imager
Module 02 ftk imager
 
Avid Smpte Nyc June 6 Dh
Avid Smpte Nyc June 6 DhAvid Smpte Nyc June 6 Dh
Avid Smpte Nyc June 6 Dh
 
Unity scene file collaboration
Unity scene file collaborationUnity scene file collaboration
Unity scene file collaboration
 
Embedded Metadata: Share, Deliver, Preserve
Embedded Metadata: Share, Deliver, PreserveEmbedded Metadata: Share, Deliver, Preserve
Embedded Metadata: Share, Deliver, Preserve
 
E-Fraud Prevention based on self-authentication of e-documents
E-Fraud Prevention based on self-authentication of e-documentsE-Fraud Prevention based on self-authentication of e-documents
E-Fraud Prevention based on self-authentication of e-documents
 
A little simple explanation abut Digital imaging
A little simple explanation abut Digital imagingA little simple explanation abut Digital imaging
A little simple explanation abut Digital imaging
 
Designing Hybrid Cryptosystem for Secure Transmission of Image Data using Bio...
Designing Hybrid Cryptosystem for Secure Transmission of Image Data using Bio...Designing Hybrid Cryptosystem for Secure Transmission of Image Data using Bio...
Designing Hybrid Cryptosystem for Secure Transmission of Image Data using Bio...
 
What is Multimedia?.pptx
What is Multimedia?.pptxWhat is Multimedia?.pptx
What is Multimedia?.pptx
 
PLM Data Migration
PLM Data MigrationPLM Data Migration
PLM Data Migration
 

More from Mark Matienzo

Findability in the Flow: Discovery through Linking
Findability in the Flow: Discovery through LinkingFindability in the Flow: Discovery through Linking
Findability in the Flow: Discovery through Linking
Mark Matienzo
 
EAD and MARC sitting in a tree: D-R-U-P-A-L
EAD and MARC sitting in a tree: D-R-U-P-A-LEAD and MARC sitting in a tree: D-R-U-P-A-L
EAD and MARC sitting in a tree: D-R-U-P-A-L
Mark Matienzo
 
How I failed to present on using DVCS to control archival metadata
How I failed to present on using DVCS to control archival metadataHow I failed to present on using DVCS to control archival metadata
How I failed to present on using DVCS to control archival metadata
Mark Matienzo
 

More from Mark Matienzo (11)

To Hell With Good Intentions: Linked Data and the Power to Name
To Hell With Good Intentions: Linked Data and the Power to NameTo Hell With Good Intentions: Linked Data and the Power to Name
To Hell With Good Intentions: Linked Data and the Power to Name
 
Linked Data and the Semantic Web in the Archival Context
Linked Data and the Semantic Web in the Archival ContextLinked Data and the Semantic Web in the Archival Context
Linked Data and the Semantic Web in the Archival Context
 
Archival Sensemaking: Personal Digital Archiving as an Iteration
Archival Sensemaking: Personal Digital Archiving as an IterationArchival Sensemaking: Personal Digital Archiving as an Iteration
Archival Sensemaking: Personal Digital Archiving as an Iteration
 
Findability in the Flow: Discovery through Linking
Findability in the Flow: Discovery through LinkingFindability in the Flow: Discovery through Linking
Findability in the Flow: Discovery through Linking
 
Learning to Take, Learning to Give: Linking as Repurposing Metadata
Learning to Take, Learning to Give: Linking as Repurposing MetadataLearning to Take, Learning to Give: Linking as Repurposing Metadata
Learning to Take, Learning to Give: Linking as Repurposing Metadata
 
EAD and MARC sitting in a tree: D-R-U-P-A-L
EAD and MARC sitting in a tree: D-R-U-P-A-LEAD and MARC sitting in a tree: D-R-U-P-A-L
EAD and MARC sitting in a tree: D-R-U-P-A-L
 
Online Presence and Participation
Online Presence and ParticipationOnline Presence and Participation
Online Presence and Participation
 
Linked Data and Archival Description: Confluences, Contingencies, and Conflicts
Linked Data and Archival Description: Confluences, Contingencies, and ConflictsLinked Data and Archival Description: Confluences, Contingencies, and Conflicts
Linked Data and Archival Description: Confluences, Contingencies, and Conflicts
 
Cheeseburgers With Everything: Context, Content, and Connections in Archival ...
Cheeseburgers With Everything: Context, Content, and Connections in Archival ...Cheeseburgers With Everything: Context, Content, and Connections in Archival ...
Cheeseburgers With Everything: Context, Content, and Connections in Archival ...
 
Archives & the Semantic Web
Archives & the Semantic WebArchives & the Semantic Web
Archives & the Semantic Web
 
How I failed to present on using DVCS to control archival metadata
How I failed to present on using DVCS to control archival metadataHow I failed to present on using DVCS to control archival metadata
How I failed to present on using DVCS to control archival metadata
 

Recently uploaded

MuleSoft Integration with AWS Textract | Calling AWS Textract API |AWS - Clou...
MuleSoft Integration with AWS Textract | Calling AWS Textract API |AWS - Clou...MuleSoft Integration with AWS Textract | Calling AWS Textract API |AWS - Clou...
MuleSoft Integration with AWS Textract | Calling AWS Textract API |AWS - Clou...
MysoreMuleSoftMeetup
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
CaitlinCummins3
 
SPLICE Working Group: Reusable Code Examples
SPLICE Working Group:Reusable Code ExamplesSPLICE Working Group:Reusable Code Examples
SPLICE Working Group: Reusable Code Examples
Peter Brusilovsky
 

Recently uploaded (20)

How to Manage Website in Odoo 17 Studio App.pptx
How to Manage Website in Odoo 17 Studio App.pptxHow to Manage Website in Odoo 17 Studio App.pptx
How to Manage Website in Odoo 17 Studio App.pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Play hard learn harder: The Serious Business of Play
Play hard learn harder:  The Serious Business of PlayPlay hard learn harder:  The Serious Business of Play
Play hard learn harder: The Serious Business of Play
 
When Quality Assurance Meets Innovation in Higher Education - Report launch w...
When Quality Assurance Meets Innovation in Higher Education - Report launch w...When Quality Assurance Meets Innovation in Higher Education - Report launch w...
When Quality Assurance Meets Innovation in Higher Education - Report launch w...
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
 
MuleSoft Integration with AWS Textract | Calling AWS Textract API |AWS - Clou...
MuleSoft Integration with AWS Textract | Calling AWS Textract API |AWS - Clou...MuleSoft Integration with AWS Textract | Calling AWS Textract API |AWS - Clou...
MuleSoft Integration with AWS Textract | Calling AWS Textract API |AWS - Clou...
 
What is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptxWhat is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptx
 
Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 
dusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learningdusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learning
 
Andreas Schleicher presents at the launch of What does child empowerment mean...
Andreas Schleicher presents at the launch of What does child empowerment mean...Andreas Schleicher presents at the launch of What does child empowerment mean...
Andreas Schleicher presents at the launch of What does child empowerment mean...
 
Observing-Correct-Grammar-in-Making-Definitions.pptx
Observing-Correct-Grammar-in-Making-Definitions.pptxObserving-Correct-Grammar-in-Making-Definitions.pptx
Observing-Correct-Grammar-in-Making-Definitions.pptx
 
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUMDEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
 
PUBLIC FINANCE AND TAXATION COURSE-1-4.pdf
PUBLIC FINANCE AND TAXATION COURSE-1-4.pdfPUBLIC FINANCE AND TAXATION COURSE-1-4.pdf
PUBLIC FINANCE AND TAXATION COURSE-1-4.pdf
 
How to Send Pro Forma Invoice to Your Customers in Odoo 17
How to Send Pro Forma Invoice to Your Customers in Odoo 17How to Send Pro Forma Invoice to Your Customers in Odoo 17
How to Send Pro Forma Invoice to Your Customers in Odoo 17
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
 
Rich Dad Poor Dad ( PDFDrive.com )--.pdf
Rich Dad Poor Dad ( PDFDrive.com )--.pdfRich Dad Poor Dad ( PDFDrive.com )--.pdf
Rich Dad Poor Dad ( PDFDrive.com )--.pdf
 
SPLICE Working Group: Reusable Code Examples
SPLICE Working Group:Reusable Code ExamplesSPLICE Working Group:Reusable Code Examples
SPLICE Working Group: Reusable Code Examples
 
21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx
 

Accessioning-Based Metadata Extraction and Iterative Processing: Notes From the Field

  • 1. Accessioning-Based Metadata Extraction and Iterative Processing: Notes From the Field Mark A. Matienzo, Yale University Library CurateGear: Enabling the Curation of Digital Collections January 6, 2012 mark@matienzo.org http://matienzo.org/ @anarchivist
  • 3. Accessioning Workflow Start accessioning Write-protect media Verify image process Media Record identifying Extract filesystem- Disk Meta- Retrieve media characteristics of and file-level images data media as metadata metadata Transfer package Package images Assign identifiers to Ingest transfer Create image and metadata for media package ingest Media FS/File Document Disk MD MD accessioning image process End accessioning process
  • 4. Metadata Extraction •Desire to repurpose existing information as archival description and reports to other staff •Ideal output is XML; can be packaged with disk images going into medium- or long-term storage •Tools: Fiwalk/Sleuthkit; FTK Imager; testing others
  • 5. Sample DFXML Output <?xml version='1.0' encoding='UTF-8'?> <dfxml version='1.0'> <metadata xmlns='http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML' xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xmlns:dc='http://purl.org/dc/elements/1.1/'> <dc:type>Disk Image</dc:type> </metadata> <creator version='1.0'> <!-- provenance information re: extraction - software used; operating system --> </creator> <source> <image_filename>2004-M-088.0018.dd</image_filename> </source> <volume offset='0'><!-- partitions within each disk image --> <fileobject><!-- files within each partition --></fileobject> </volume> <runstats><!-- performance and other statistics --></runstats> </dfxml>
  • 6. Sample DFXML Output <fileobject> <filename>_ublist1.wpd</filename> <partition>1</partition> <id>1</id> <name_type>r</name_type> <filesize>202152</filesize> <unalloc>1</unalloc> <used>1</used> <inode>3</inode> <meta_type>1</meta_type> <mode>511</mode> <nlink>0</nlink> <uid>0</uid> <gid>0</gid> <mtime>2001-02-22T22:30:52Z</mtime> <atime>2001-02-22T05:00:00Z</atime> <crtime>2001-02-22T22:31:54Z</crtime> <libmagic>(Corel/WP)</libmagic> <byte_runs> <byte_run file_offset='0' fs_offset='16896' img_offset='16896' len='512'/> </byte_runs> <hashdigest type='md5'>d7bc22242c0a88fd8b68712980d5ab28</hashdigest> <hashdigest type='sha1'>64bf2bdf82e33fcda50158804483ac611e753db5</hashdigest> </fileobject>
  • 7. Current Advantages •Faster (and more forensically sound) to extract metadata once rather than having to keep processing an image •Develop better assessments during accessioning process (directory structure significant? timestamps accurate?) •Building supplemental tools takes less time
  • 8. Gumshoe •Prototype based on Blacklight (Ruby on Rails + Solr) •Indexing code works with fiwalk output or directly from a disk image •Populates Solr index with all file-level metadata from fiwalk and, optionally, text strings extracted from files •Provides searching, sorting and faceting based on metadata extracted from filesystems and files •Code at http://github.com/anarchivist/gumshoe
  • 9.
  • 10. Current Limitations •Use of fiwalk limited to specific types of filesystems •Additional software requires additional integration and data normalization •DFXML is not (currently) a metadata format common within domains of archives/libraries •Extracted metadata maybe harder to repurpose for descriptive purposes based on level of granularity

Editor's Notes

  1. \n
  2. Let me provide some background on the state of affairs at Yale University. There are two main special collections units within the Yale University Library with the majority of born-digital records: Manuscripts and Archives, where I work, and the Beinecke Rare Book and Manuscript Library. In addition to these units, other special collections units, such as those within the Haas Arts Library and the Library at the School of Divinity, have some digital records within their holdings. Manuscripts and Archives and the Beinecke both participated in the Andrew W. Mellon Foundation-funded &amp;#x201C;AIMS&amp;#x201D; project, Born-Digital Collections: An Inter-Institutional Model for Stewardship, along with University of Virginia, Stanford University, and the University of Hull. Much of the work that the Yale team contributed to on the AIMS project dealt specifically with developing workflows and procedures for accessioning digital records. In their own distinct ways, Manuscripts and Archives and Beinecke have both implemented iterative processing models based on Dennis Meissner and Mark Greene&amp;#x2019;s &amp;#x201C;More Product Less Process&amp;#x201D; model. In particular, Manuscripts and Archives has focused heavily on an &amp;#x201C;accessioning as processing&amp;#x201D; model. As we found, however, implementing this for digital records was significantly more complicated. \n
  3. Because many of the digital records in our custody were underdescribed even within the accession records and since the records in these accessions have not been transferred off their original media, we have been focusing our efforts on &amp;#x201C;reaccessioning&amp;#x201D; our previous accessions. Our accessions contain many relatively common forms of digital media: 3.5&amp;#x201D; and 5.25&amp;#x201D; floppy disks from a variety of systems, optical media such as CD-ROMs and DVDs, flash media, hard drives, and occasionally, full computer systems. The primary goals of reaccessioning have been to identify, document, and register media within previous accessions, to mitigate the risk of media deterioration and obsolescence, and to extract basic metadata from filesystems on the media and the files contained on those media.\n\nThis is a diagram of the prototype workflow that we developed at Yale based on our work on the AIMS project. Our workflow contains a set of several steps that allow us to achieve the goals I have just mentioned, and includes the process of acquiring a forensic, low-level disk image of the media. The forensic imaging process uses a combination of hardware and software to prevent accidental modification of the &amp;#x201C;original&amp;#x201D; data on the media we received. While I am happy to talk with you during the breakout session regarding the rest of our process, the rest of my presentation will focus on our metadata extraction process, wherein we gather information about the filesystems and files contained within those disk images.\n
  4. At Yale, we hoped that the metadata extraction process would yield descriptive, structural, and technical metadata that would be easily repurposeable for minimal archival description, to support intellectual and perhaps contextual control, for basic preservation purposes, and for reports to other staff. For our purposes, we have been aiming to get XML-based metadata out of the extraction process because it seemed easiest to repurpose. Many archivists at Yale already have some experience working with XML because of our use of Encoded Archival Description.\n\nTo extract the metadata from disk images, we have used primarily Simson Garfinkel&amp;#x2019;s Fiwalk which generates Digital Forensics XML, or DFXML, metadata. DFXML is an open standard currently under development within the open source digital forensics community. Fiwalk depends on the open source Sleuthkit library. We have decided to include DFXML for the time being as part of disk image packages that we are placing in storage. A major motivation for including the extracted metadata with the stored disk image is that is significantly easier - and arguably much safer - to parse and process the metadata than it is to parse the disk image. \n\nIn addition to fiwalk, we have also used AccessData&amp;#x2019;s FTK Imager, a commercial application that can be downloaded at no charge, which optionally extracts metadata in CSV form. We have also been testing a few other applications, some of which I will mention later. \n
  5. This is the overview of the structure of a DFXML file as created by fiwalk. Within this output, there are four main sections - metadata about the DFXML instance itself; provenance information about how the DFXML instance was created, information about the source of the metadata, metadata extracted from the contents of the disk image (that is, the partitions and files), and runtime statistics.\n
  6. This is a sample portion of a output from fiwalk for a given disk image from our collections. Specifically, this section relates to a specific file within a disk image of a floppy disk. You can see information such as the file name, its size, that the file was deleted, modification, access, and creation times, and hash values for the file. There is also a basic assessment of the filetype - in this case, the file is a WordPerfect file.\n
  7. There are a number of tangible advantages that we&amp;#x2019;ve seen in using Fiwalk and generating DFXML at Yale. First, it is considerably faster to extract the metadata once to allow further reuse rather than having to continually extract the metadata every time its used. Once extracted, processing the metadata from an extremely large disk image takes considerably fewer computational resources than working from the disk image over and over.\n\nIt also allows us to make better assessments, such as if the files or their metadata within the image have been modified, or getting a better sense of the entirety of the accession&amp;#x2019;s structure in terms of directories.\n\nFinally, it is significantly easier to create additional tools to work with the extracted metadata than using the underlying stack provided by Sleuthkit since it just requires one to parse XML data.\n
  8. One such tool is a prototype application called Gumshoe, which is a Ruby on Rails-based application using Blacklight. The application indexes metadata extracted from a disk image and stores it in a Solr index, and provides searching, sorting, and faceting of the extracted data. For example, you can sort based on modification times of files or by file name. You can also optionally index any text found within each file. The code for this prototype has been released as open source under the Apache 2.0 license and is available from the URL shown above.\n
  9. This is a sample screenshot of Gumshoe browsing through the metadata of all the files from two distinct disk images - one with over 1,200 files, and the other from 39 files. The files within the results are sorted by size. On the left hand side, you can see the various facets available for filtering the search results. \n
  10. Beyond the benefits that I previously mentioned, there are a number of limitations that we have experienced. Arguably, the one with which we struggle the most with at Yale are the limitations of fiwalk, and Sleuthkit, the underlying framework it uses to process and extract the disk images. Specifically, Sleuthkit provides fiwalk with the ability to interpret the various types of filesystems. However, even with only a moderate amount of accessions of electronic records, we have run into serious barriers with certain filesystems being unsupported. In particular, we have had issues with interpreting original HFS filesystems - the filesystems used on much earlier versions of MacOS - and with much older media, such as Kaypro CP/M disks. It is worth noting first that many of these are not uncommon formats.\n\nWhile we have tested other applications to extract metadata, such as using FTK Imager for HFS volumes and CPMTools for CP/M disks, the data output by these programs is not in DFXML form. It therefore requires additional processing to get it into a common format that is parseable by the tools that we hope to use to process, repurpose, and interpret the extracted metadata.\n\nDFXML also shows quite a lot of promise in terms of extensibility and its continued development. However, it is not currently used widely within the library and archives world. Any repurposing of this metadata will require considerable development and planning time for the creation of appropriate crosswalks.\n\nFinally, we have found that the metadata extracted in this process is not always appropriate for archivists wanting to repurpose it as descriptive information. In some cases, the extracted metadata does not always reflect an appropriate level of granularity. While files titled with clear filenames may be useful especially when directory structures may indicate a pre-existing arrangement, these concepts do not always map nicely to archival concepts such as levels of aggregation. In many cases, metadata about individual files may be too granular for the level of description created within preliminary processing during or soon following accessioning.\n\n
  11. Thank you for your time. Please feel free to speak with me during the session later in the afternoon if you have further questions.\n