RBMS 2011_Edwards


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • [INTRO]A little over two years ago, a few elements converged and a core group of us at Stanford began to get more serious about developing a viable method for processing the born-digital “papers” in our collectionsMost of my talk is centered around our first trialsbut I’d first like to describe the context and pressures at SUL that put us on this path … 
  • The major pressure was the growing quantity of legacy media in our “backlog” …With Stanford situated in Silicon Valley, it’s no big surprise that we have a lot of computer collections that contain old legacy media.  Hence our acquisition in the late 1990s of the records of Apple Computer Inc., the papers of Douglas Engelbart and a really large collection of computer games and software. [images: mouse/engelbart, box of Atari games/Cabrinety]
  • Because of those very acquisitions, in 1997, the Manuscripts Division began tracking the incoming quantities – just an overall count – of legacy computer media contained in new accessions. By the end of the decade we had recorded over 7,000 “items”Increasingly our b-d material comes from faculty, artists, writers, organizations Today, we have over 26,000 items of legacy media recorded in our backlog. [Univ. Archives has ~700 listed]
  • The other element was an event in February 2009. A staff member on our digital team (Michael Olson), who had previously worked in Manuscripts, attended the Digital Lives Project’s first conference at the British Library. Two things occurred : He heard about a study* done at the B.L. on data loss in legacy computer media (3% per year) and … He saw that the B.L. was exploring the use of forensic tools for capturing data from media. Based on this and coupled with the weight of our growing backlog of media – we decided on two courses of action: *McLeod, Rory paper “Risk Assessment; using a risk based approach to prioritise handheld digital information” 2008
  • First we purchased forensic hardware and software to enable us to capture and view legacy media and files. Hardware from Digital Intelligence (FRED) Software – we purchased and tested both FTK and En-Case forensic software. This framed the nucleus of our digital lab … And yet, most forensic equipment is geared toward current/modern media. So, we searched Ebay for old floppy disk drives to use with FRED
  • Next, we partnered with 3 other institutions (U. Va., Yale and Hull) - as part of the AIMS Project - funded by the Andrew W. Mellon Foundation. (AIMS Born-Digital Collections: An Inter-Institutional Model for Stewardship) The goals of the project were to:to process b-d material from 13 (mostly legacy) collections and to deliver the b-d material in some fashion by the end of the grant
  • Each repository hired a Digital Archivist. Peter Chan was hired at SUL in January 2010 and began actual work on imaging the disks for our 4 collections and trying out various methods for “processing” the data.  We choose collections that contained different types of media and content. They are: Robert Creeley papers (poet – mostly email, some writing)Stephen Jay Gould papers (paleontologist, author – writing and some data sets)Peter Routledge Koch papers (fine press printer – a mix of files: email, text, image, and design files like Adobe’s InDesign. This was the only collection with files transferred directly from a donor’s current computer)Xanadu Project records (early hypertext project – software program on 6 hard drives)
  • It is now 1.5 years later and we have created a viable (although under constant development) workflow for accessioning and processing b-d materials using forensic tools. This is the more detailed workflow for collections that would be “fully” processed… We are also working on a minimal processing workflow and one that would fully “accession” the data – i.e. remove from physical storage media - and store for later processing. 
  • One of the collections that has informed our development of b-d practice is that of Stephen Jay Gould, which contains both paper (analog) – over 500 linear feet – and 3 cartons of digital material. 98 3.5-inch floppy disks61 5.25-inch floppy disks4 sets of punch cards3 computer tapes In total, over 550 linear feet have been rec’d in 8 accessions.His papers and the audio/video are being processed concurrently – by archivist, Jenny Johnson - and will be done this August This month, the processing team discovered another 5 cartons of punch cards in the 2008 accn (21 sets). [This recent find won’t be resolved by end of grant]
  • Using the two different capture stations – FRED & the floppy/zip station – we created disk images of all the disks : 8 sets of punch cards were successfully read by our neighbors at the Computer History Museum. 1 set was unreadable – as it had no sorting key.We also began tracking loss our own loss statistics - “success or failure” of captures - in a spreadsheet; which we link to our accession records in Archivists’ Toolkit.Loss rate for floppies in Gould are 5% - loss in other collections was higher.Creeley: 6% loss: 1 out of 12 CDs unreadable; 3 out of 53 floppies unreadable. [1987-2004?]Xanadu: 4 of 6 hard drives inoperable – or 67% damaged: [PC’s report: There were mechanical or electrical problems with other drives (one didn't spin after it was powered up and one gave a "dong" sound after it was powered up). We are not sure what the problems with the remaining two drives are – they do spin after power up but we cannot access the data.] Cost to recover ~ $10,000 (2.5K / drive)
  • To process the materials during our initial trial, we used Windows Explorer.Folders were created that mirrored “series” and “titles” in EAD and files were moved from original media folder into appropriate place. This however changed data associated with the files – such as original path, etc.At this point, Peter Chan attended a week long session on the use of forensic software at Digital Intelligence – focusing on FTKWhile much more robust than we needed for archival work, he decided that many of the tools in FTK could easily be adapted for archival processing. We discovered that this practice mirrored work beginning at both BL and Oxford.
  • Technical metadata for the disk images are displayed here. The are arranged by floppy disk and display file format (where identifiable), file size, checksum, creation dates, etc. One can change the view to add additional columns, such as duplicate or primary file, etc.
  • The embedded viewer in FTK – from the same company that does Quick View Plus – allows you to quickly see the contents of many of the files
  • Here are two quick screen shots showing archival HIERARCHY using FTK’s “bookmark” feature.Series or Subseries can be added as metadata to individual or groups of files by highlighting or checking the boxes of the files in the lower panel.
  • Description for the three different formats in Gould will be merged at the end of the summer or early fall – paper, audio/video and born-digital files – but the level of description will be different.Gould’s papers are processed to the folder level for most of the collectionThe audio and video are listed at the item level to facilitate any future digitizationThe born-digital material will have Series level description with notes about original mediacapture and processing methods loss/damaged media and delivery methods
  • Here is a partial view of our working draft for processing notes for Gould b-d “series”
  • We encountered different issues in our other “AIMS” collections - the main one I will mention is the Robert Creeley collection… His papers originally contained : 53 floppies, 5 zip disks, and 3 CDsInitially the computer media was segregated into a separate collection – but will need to be merged into the main collection record and finding aid in the fall.After processing with FTK, the disk images garnered: Identified 50K emails Identified 8 files related to health records Identified 69 files with SS#A recent addenda complicates the processing of Creeley’s born-digital material : rec’d in May 2011 containing b-d media will need to be processed – and may allow us to have more complete set of emails, drafts, etc.7 computers3 zip drives121 optical discs422 3.5-inch floppy diskettes1 Zip 250 USB Drive1 Olympus C-4000 Camedia Digital Camera & flash cards1 20-gigabyte iPodWe have yet to analyze the data in the new accession and compare to original data but two issues cropped upHow to process and deliver multiple computers over creators life cycleData was captured from various CDs and computers to create an overview of the b-d material before transfer to SUL – what got changed in the process?![image from wikipedia taken by Elsa Dorman]
  • In processing initial computer media, PC used folder titles on the disks as keywords for files
  • Using Creeley’s initial text data, we have worked with two individuals – one working in the Digital Humanities – who took the header info from the 50K emails and created a network graph (Elijah Meeks) : Header information from Robert Creeley’s 50,000+ emails emphasizing the connection between the poet and Gerard Malanga. 
  • To wrap up:Donors & users expect us to acquire, organize, preserve and provide access to b-d collectionsSpecial Collections staff capture, appraise, arrange and describe b-d materials AND contribute to requirements for both access and delivery as well as arrangement and description toolsOur digital group will preserve in our preservation repository (SDR) and provide public access and invite participation – Hypatia (under development)
  • RBMS 2011_Edwards

    1. 1. Processing Born-Digital “Papers” @ Stanford<br />Glynn Edwards, RBMS, Baton Rouge, LA - 2011<br />
    2. 2. Collections in the late 1990s<br />Apple Computer Inc. records<br />Douglas Engelbart papers<br />Stephen Cabrinety collection<br />By 2000, over 7,000 items of legacy computer media received as part of hybrid collections<br />Now over 26,000 items recorded during accessioning process<br />
    3. 3. Tracking Computer Media (then)<br />
    4. 4. First Digital Lives Research Conference: Personal Digital Archives for the 21st Century<br />
    5. 5. FRED (Forensic Recovery Evidence Device: Digital Intelligence) Software: FTK suite (AccessData) - EnCase<br />
    6. 6. AIMS Born-Digital Collections: An Inter-Institutional Model for Stewardship<br />University of Virginia<br />Yale University<br />Hull University<br />Stanford University<br />Funded by the Andrew W. Mellon Foundation<br />
    7. 7. Robert Creeley papers<br />Stephen Jay Gould papers<br />Keith Henson papers re: to Project Xanadu<br />Peter Rutledge Koch papers<br />
    8. 8.
    9. 9. Stephen Jay Gould<br /> Influential American paleontologist, evolutionary biologist and historian of science, Gould began his career at Harvard University in 1967 and worked until his death in 2002. <br />98 3 ½” floppy diskettes<br />61 5 ½” floppy diskettes<br />4 sets of punch cards<br />3 computer tapes <br />
    10. 10. Dear Peter,Unfortunately we do not manufacture any motherboards now a days which can support the 5.25 floppy. The interface are different than 3.5 and they are becoming obsolete and are no longer available on the newer motherboards.<br />
    11. 11.
    12. 12. Capture: success & failure<br />
    13. 13. Trial One – processing using Explorer<br />
    14. 14. Trial Two – processing using FTK<br /><ul><li>Embedded viewer that reads over 200 file formats
    15. 15. http://accessdata.com/downloads/media/Recognized%20File%20Types%20FTK1%207-24-08.pdf
    16. 16. Viewing the files will NOT change last accessed dates
    17. 17. Easy to use user interface for creating “bookmarks” for hierarchical information (series, subseries)
    18. 18. Using “labels” on groups of files for descriptive metadata
    19. 19. Pattern & full-text searches – e.g. looking for restricted content such as credit cards, ss#, student grades, etc.
    20. 20. Output to xml </li></li></ul><li>FTK: Technical Metadata<br />Technical metadata associated with files<br />
    21. 21. View Files in “obsolete” file format<br />
    22. 22. Create Hierarchy using “bookmarks”<br />
    23. 23. Administrative & descriptive metadata<br />
    24. 24. Series 6: Gould’s Born-Digital Material<br />
    25. 25. Series 6: Processing Note<br />
    26. 26. Other collections, other issues<br />Robert Creeley’s original media, processed via FTK:<br /><ul><li>50,000+ emails.
    27. 27. Identified 8 files related to health records and
    28. 28. 69 files with SS#</li></ul>More born-digital material received - May 2011 addenda :<br /><ul><li>7 computers
    29. 29. 3 zip drives
    30. 30. 121 optical discs
    31. 31. 422 3.5-inch floppy diskettes
    32. 32. 1 Zip 250 USB Drive
    33. 33. 1 Olympus Camedia CF/SmartMedia Reader
    34. 34. 1 Olympus C-4000 Camedia Digital Camera & flash cards
    35. 35. 1 20-gigabyte iPod</li></li></ul><li>FTK – Creeley’s Emailand Keywords<br />
    36. 36. Creeley’s Email Network<br />
    37. 37. Email Mining on Peter Koch’s Emails<br />
    38. 38. Email Mining on Peter Koch’s Emails<br />http://suif.stanford.edu/~hangal/muse/<br />
    39. 39. What are our Roles?<br />Donors & users expect us to acquire, organize, preserve and provide access to b-d collections<br />Special Collections staff capture, appraise, arrange and describe b-d materials AND contribute to requirements for both access and delivery as well as arrangement and description tools<br />Our digital group will preserve in our preservation repository (SDR) and provide public access and invite participation – Hypatia (under development)<br />
    40. 40. Challenges<br />Read contents from storage media (punch cards, tapes, 8/5.25/3.5 inch. floppy diskettes, Zip disks, etc.)<br />View contents with different formats (WordPerfect, Lotus 1-2-3, Quark files, etc.)<br />Organize “large” collections (420,000 files or multiple computers in 1 collection)<br />Long term preservation (hardware failure, obsolete file formats, unknown future, etc.)<br />Wide scope of knowledge needed (computer hardware, operating systems, application, repository and virtualization software, archival processing, security (authentication , encryption), Web 3.0, digital preservation, natural language processing, etc.)<br />Descriptive standards are in flux<br />Accessioning procedures under development<br />Delivery options for different formats<br />
    41. 41. Stanford Approach<br /><ul><li>End-to-End consideration
    42. 42. Not limited to open source software (total costs)
    43. 43. Aim for production, not proof of concept, tools
    44. 44. Learn from other projects (Paradigm, SALT, etc.)
    45. 45. Integrate with existing infrastructure (AT, Fedora, Solr, Blacklight, SDR, DOR, etc.)
    46. 46. Continue testing and build out of forensic lab hardware/software
    47. 47. Develop training on capture and processing for staff and interns
    48. 48. Work with developers and digital group on functional requirements for delivery platform, etc.
    49. 49. Explore alternative processing (PhotoMechanic for IPTC data in photography collections)
    50. 50. Explore enhanced curation (high res photos, creator interviews, etc.)</li></li></ul><li>Participants<br /> Inter-departmental and -institutional team for born-digital materials:<br />Stanford’s Special Collections: Peter Chan, Glynn Edwards & University Archives<br />Stanford’s Digital Libraries Systems & Services (DLSS): Michael Olson, Tom Cramer, Lynn McRae, Bess Sadler, Naomi Duschay, etc.<br />AIMS partners: Univ. of Virginia, Hull University (U.K.), Yale University<br />