More Related Content


Saa Session 502 Born Digital Archives in Collecting Repositories

  1. one-off collection (eg project) or likely to be subsequent accruals?
  2. collection type; differs for personal papers & organisational records 
  3. same personnel work on paper and born-digital components?  
  4. drag'n'drop to create the intellectual arrangement
  5. ability to return to original order of the material
  6. view some file types, add descriptive metadata etc
  7. find out about material
  8. understand whether it is available for consultation and if so, how
  9. access material.
  10. To apply appropriate access restrictions in order to protect private and sensitive information as well as intellectual property.

Editor's Notes

  1. Hello and welcome to session 502: Born-Digital Archives in Collecting Repository: Turning Challenges into Byte-Size OpportunitiesMy name is Gretchen Gueguen and I’m Digital Archivist at the University of Virginia. This morning, along with my colleagues Mark Matienzo from Yale, Simon Wilson from the University of Hull, and Peter Chan from Stanford, I’m going to talk with you about the AIMS project.
  2. AIMS is the short title for a Mellon-funded grant project entitled Born-Digital Collections: An Inter-Institutional Model for Stewardship. This two-year project set out to create a framework for stewardship of born-digital archival records in the collecting repositories.
  3. As I’ve mentioned, the grant partners include UVA, Stanford, Hull and Yale and Virginia serves as the PI
  4. The grant set out to achieve it’s goal through 4 different areas of activity. The first was the processing of several hybrid collections which you are going to hear about later this morning. The Digital archivists at each institution, the four of us here this morning, were funded by the grant to carry out this processing.To facilitate this stewardship, the partners also sought to develop some software solutions. You won’t hear as much about these this morning, but they include Rubymatica, a ruby-based reworking of Archivematica for the creation of Submission Information Packages, and Functional Requirement for a software tool to facilitate arrangement, description and access to born-digital archival materials. These requirements led to work on developing Hypatia, which is what is known as a “Hydra Head” or a module for the Fedora/Solr/Blacklight Hydra stack, for access to born-digital materials.The partners also hosted several events to garner feedback and to encourage communication among the archival community, including a workshop that took place here in Chicago earlier this week.The final project deliverables will include a White Paper synthesizes the research done during the project and a project report to the Mellon Foundation.
  5. A large part of the White Paper focuses on what we are currently referring to as the AIMS framwork: “A framework for collecting and delivering the born-digital materials that are quickly beginning to constitute the collections of contemporary scholarly, literary, and political figures and organizations.”This is really a high-level look at the tools, strategies, methodologies, and practices needed to effectively manage b-d content
  6. The framework is characterized by four main functions of stewardship:Collection DevelopmentAccessioningArrangement and DescriptionDiscovery and AccessYou’ll notice that we do not include “preservation” as an explicit function here. That is an intentional omission because we believe that preservation is implicit in all of these functions. In addition aspects such developing a preservation repository or undertaking preservation activities are outside of this scope because they are larger institutional initiatives. They are mentioned as prerequisites to being able to do work in many steps, but since there are many guidelines out there we didn’t feel the need to reiterate them here.We are going to focus the rest of our presentation this morning on these four areas and share with you some of the work we have done.If you are interested in more on the background of the project, I will encourage you to check out our project blog, called born Digital Archives and I’ll put a URL up for the blog at the end of the presentation
  7. We are starting our model with activities related to Collection Development. These are the activities undertaken in order to bring material in to the institution.  These include activities we may be very familiar with like prioritizing, developing relationships with creators, doing assessments and negotiating agreements.Within the concept of the AIMS model, which is primarily a hybrid collection environment, this work will be necessary to develop sound capturing and processing activities later.
  8. We’ve defined collection development as having five distinct stages which I’m going to go over with you this morning:PrerequisitesEstablish relationship with donorAnalyze FeasibilityNegotiate AgreementsPrepare for Accessioning
  9. The first step is going through some prerequisites like having an appraisal processes: how will you assess or evaluate materials? How will you be able to determine value? Also you need to evaluate your storage capacity: Do you have enough space to keep this material in both the short- and long-term? What about future transfers? Do you have a sound data preservation strategy or methodology?One of the most important prerequisites is establishing Collection policies.Defining what it is that we want to collect takes on a couple of different questions.The first might be what types of material are we interested in, in the traditional collecting sense: prominent people, organizational records, etc.Next, we need to consider what part of those figures lives we are collecting. We use our digital devices for private activities, as well as more public ones…which are we interested in collecting?The next logical step then is to think about where this information might be on digital devices: stored files probably yes, but do we also need software, operating systems, hardware, internet activity or cloud material?All of these factors, and more come together in a collection development policy, and it can be very difficult to write, especially when you are just starting and don’t know
  10. Assuming that you have the needed prerequisites in place or have the capacity to work on them, you can move on to the actual work of collection development:The first step is establishing a relationship with the donor. In many ways this is parallel to existing analog work, but when dealing with born-digital materials you should start thinking early about how digital archive staff need to be involved? This is potentially going to be very different from access to physical materials and now is the time to discuss options. Now is also the time to discuss the creation of the data with the donor and capture any documentation that will help with later processing and access. But, how comfortable is your donor with digital concepts and access to digital materials? As an example of the difficulty that this can cause, I’d like to show the example of some work that the AIMS project did in this regard. This is a digital donor survey that the AIMS project created based on one created for the PARADIGM workbook. The original intention was that a donor could fill this out before accessioning.This is the first page…and this is the second…and the third…and the fourth….and this is part two!We quickly realized that this would be overwhelming to potential donors, especially ones who hadn’t really thought much about things like their online persona or email preservation. We changed tactics and now recommend that this survey be used as prompt sheet for the archivist in an interview.
  11. Such an interview may be part of a program of enhanced curation, something Jeremy Leighton John at the British Library describes as not only collect[ing] the original archive but add[ing] value to it.“enhanced curation” techniques include things like documenting the creator’s workspace with high-resolution digital photography, creating a digital film of an oral history interviewwith donors about their computers and their computing habits, perhaps capturing video of screencasts of the donor describing the organization on their computer. This type of information can be invaluable as materials are accessioned and processed as the level of abstraction or unfamiliarity with a new system can make it difficult to gain intellectual control.
  12. Okay, so you are ready to move on to considering whether or not you even *can* acquire this material or more likely whether it is worth the costs. What is the cost analysis and risk analysis? Try a test capture…how does it work? Do you have the needed infrastructure and policies or can you create them? Can you even view files in order to appraise them? Do you need these guys to accomplish this? Or maybe these guys?It’s very easy to say “analyze costs” or “evaluate your home institution infrastructure” but if you’ve never encountered a particular software or hardware it’s difficult to be prepared for them. This is where having technologists or digital archivists involved early in the process can help. If possible during a test capture they can do a triage to determine if there are serious preservation concerns, if any forensic processing might be needed to recover damaged or deleted files. Etc.
  13. Moving on then, the next step is negotiating agreements. One of the big problems here is that there is a lack of models for agreements and appraisals. Many elements of standard agreements remain applicable in the hybrid or born-digital archive, but have different implications. It’s not the same to provide unrestricted access to paper documents in a reading room and unrestricted access to digital materials online. Furthermore, you have a much larger potential for capturing and inadvertently exposing sensitive electronic information like financial and health information, passwords and other personal data.The legal agreement with the donor needs to specify:An Agreement about copyright – either transferred to repository/institution or remain with creator/heirsUnderstanding that collecting repository will be “sole” repository of b-d material Understanding of capabilities/limits for capturing b-d material (currently)Understanding of preservation strategies and capabilitiesUnderstanding of delivery capabilities and limits (current)Understanding of what/how files will be restricted or deleted & how this will be confirmed Understanding of capabilities/limits of appraisal, viewing, description/processing of b-d materialUnderstanding of the creative process and relationship with b-d materials, computers, hand-held devices, cloud computing, etc.
  14. The final step in collection development is to prepare for processing. This may seem a little odd in a traditional sense, but what we are alluding to here is making sure that all of your technical steps for transfer, which may not be in the agreement, are planned ahead of time. Specifically, Scope and extent determinedMethod and time determinedPre-acquisition appraisal performedTest capture if neededDevelopment of new methodologies undertaken as needed/possibleEnhanced curation carried outCoordination with acquisition of analog materialThis is really the “action” step where many of the activities you have been planning prior are carried out. Overall, the steps in Collection Development help to set up later activities. By the end of the collection development step, the institution should be ready to take legal and physical custody of material. Doing this in a forward-thinking, planfull manner will help later processes go much smoother. You’ve made it to the finish line of collection development, but now we need to move on to Accessioning.
  15. Accessioning is generally understood as the set of processes wherein a repository takes physical and legal custody of records from a donor and formally documents, or "registers." the transfer. The processes have clear links to both collection development and arrangement and description, and in some cases, institutions may view them as part of those processes. However, we have situated accessioning as a primary function within the AIMS framework.Within our framework, accessioning serves a vital role to allow a collecting repository to establish physical, administrative, and intellectual control over records that have been transferred. The accessioning processes allow archivists to gather a wide variety of information that will inform and prioritize other processes, such as arrangement and description, further appraisal, and requirements for access. Accessioning also provides an environment in which archivists can document their actions and ultimately transfer the accessioned records into an environment for their storage and maintenance.The goals of accessioning therefore reflect the need to establish control over and ensure the authenticity and reliability of transferred records. Archivists must therefore be diligent during accessioning and understand that they understand the potential impact of the actions they take during these processes. If a collecting repository is unable to establish an adequate level of control over transferred electronic records, then it is likely that it has not successfully accessioned them. Accordingly, archivists with "legacy" accessions of electronic records, such as those containing computer media, may want to consider "reaccessioning" those transfers to establish a suitable level of control.
  16. The prerequisites, like the other areas of the AIMS model, broadly fall into several categories; in this case, they are policies, procedures, and infrastructure. There are many policies required to support accessioning properly. These may range from departmental preferences to requirements set at the institutional level. Procedures may account for a number of different options, such as minimal processing, accessioning of born-digital materials with paper records, deferment of digital accessioning, accessioning as resources allow, and retrospective accessioning of previously received electronic records. Infrastructure to support accessioning includes a wide variety of software and hardware, and expertise. This infrastructure will take resources to build, and archivists are urged to consider collaborative partnerships to allow for the better sharing of knowledge.  The transfer and administrative control processes in the AIMS framework are very similar to those for other formats of records. Archivists working with electronic records should be familiar with the various types for transfers and their implications. Types of transfers can include receipt of retired media formerly in use by a creator, records copied to media only used for transfer (such as external hard drives, CDs or DVDs), or a direct transfer using disk imaging software or by copying files across a network.Once the under administrative control, archivists should focus their efforts to gain physical control over records and media. Much of this work concerns identifying and potentially addressing threats preservation issues in the records, such as viruses, unknown file formats, and the physical condition of media if appropriate.Archivists next need to establish intellectual control and gather documentation that will enable further work necessary to process, maintain, or use the records. For some transfers, a listing of directories or files may be repurposed for archival description if the existing arrangement appears to be of value.Finally, the archivist should prepare the records to be maintained over time. This may include actions such as normalizing to preservation formats. Ultimately, the records should also be transferred to a secure storage location that can be monitored by the collecting repository.
  17. At Yale University, we have worked on a reaccessioning project that has allowed us to develop our thinking of how this accessioning of electronic records could best be realized for us going forward. Two repositories, Manuscripts and Archives and the Beinecke Rare Book and Manuscript Library, have worked in collaboration to implement software, hardware, and procedures that can be shared to support accessioning. In our reaccessioning project, we are working to establish better control over previously transferred accessions that contain electronic records on media such as floppy disks and CD-ROMs. These pieces of media were often received as part of a hybrid accession that also contained paper records, but in some cases we have received accessions of boxes containing only media.
  18. The goals of our reaccessioning project are fairly straightforward and relate to the three types of control discussed previously. First, we seek to establish administrative control of the media by identifying what it is and documenting its physical and logical characteristics and by assigning a unique identifier to each piece. Secondly, we are working towards gaining physical control of the media, which will allow us to mitigate the risks of media deterioration and obsolescence. Finally, we are trying to establish a basic level of intellectual control by extracting metadata about the filesystems and files contained on the media, such as file names, directory structures, and creation, access, and modification dates.
  19. Our reaccessioning workflow roughly looks like the following. We begin by retrieving the media and bringing it to the electronic records workstation, documenting its change in location within the Archivists’ Toolkit. We then assign unique identifiers to each of the media. We establish the best means by which to write-protect the media for imaging and record its identifying characteristics in a media log. We then put the media in the appropriate drive and create a forensic bit-level disk image, which includes all the files, the filesystem metadata, unused space – in other words, the entirety of the data on the media. We verify the image against the raw contents of the media and extract metadata from the disk image. Finally, we package the images and metadata and transfer the package into storage and complete the rest of the documentation.
  20. To acquire the data off media, we are using a forensic imaging process that extracts the entirety of the data off the media at the lowest level possible. To ensure that we do not intentionally or accidentally manipulate any of the data on the original media, we write-protect the media or reader. For floppy disks, we can use physical write protect tabs. For USB flash media, hard drives, and the like, we connect the drive or reader to a write-blocker, which is a piece of hardware connected to the computer that blocks low-level write signals from a computer. We use a variety of software to acquire the images, such as FTK Imager. The imaging software extracts the data from the media and calculates a cryptographic hash of the data on the media and the data within the image file. If the checksums match, the imaging is viewed as successful. [ADD FTK Imager SCREENSHOT? WRITEBLOCKER PHOTO?]
  21. This is a screenshot of FTK Imager, which we use to image media and to inspect disk images. You can see that the file listing includes regular files, slack or unused space on the disk, and deleted files, as denoted by the red X on the file icons.
  22. Our media log is a SharePoint list that contains identifying characteristics and physical and logical information about the media, such as the type of media, when it was imaged, the text of a label or writing on the media, and the type of filesystem or filesystems it contains. We assign each piece of media a unique identifier, which is a combination of theaccession number and incremental number. The media log also contains the workflow status of the accessioning process for each piece of media and whether processes succeeded or failed.
  23. The first screenshot is an overview for several pieces of media. You can see the unique media identifiers, the media format, and the workflow status.
  24. This expanded view shows all the fields, including further documentation about the disk image, the filesystem contained, and additional notes.
  25. If imaging is successful, we then extract metadata from the filesystem and files within the image. This is a software-based process that provides metadata such as file names, directory structures, creation and modification times, and approximate categorization of the types of files. This metadata can be repurposed in a variety of ways and provides a basic level of intellectual control that is comparable to a box list or other type of inventory for paper records. We are using open source software such as Sleuthkit and fiwalk to perform this extraction, but occasionally we need to rely on other tools for older or less common types of file systems.
  26. Finally, we create a transfer package using the BagIt specification as developed by the Library of Congress and the California Digital Library. To create the packages, we are using the Library of Congress-developed Bagger application. These packages contain the disk images, extracted metadata, and logs generated by the disk imaging software during the acquisition process. The BagIt packages also contain high-level information about the accession. For the time being, we are making a rough connection of one bag per accession, but we realize we may need to modify depending on the size of the accessions.
  27. This an overview of a sample bag, showing the structure and high-level metadata. Once packaged, we transfer the package to storage and verify the success of the transfer using procedures for the BagIt specification which compare the contents of the package against its manifest. If successful, we complete the rest of the documentation and record the success in the media log. We also record the storage location of the transferred package within the Archivists’ Toolkit and add the date of completion.
  28. SAA definition for description puts emphasis on minimizing the amount of handling needs to be updated to consider preservation actions due to file format obsolescence etc
  29. - reasonable return for our effort for us to describe the ‘project’ and indicative content that we held
  30. What is sensitive will vary from collection to collection information (social security; personal e-mail address/mobile no etc) - Could also be discussion behind a decision (Larkin 25 funding)
  31. As a result of experiences to tackle arrangement and description, the AIMS digital archivists' defined the requirements for a new tool - designed to work with technical and professional standards- use drag'n'drop to create intellectual arrangement, changes a relationship between digital assets (asset doesn’t move) using Fedora "sets“  - rights & permissions to single file, a discrete series or the entire collection
  32. “the systems and workflows that make processed or unprocessedmaterial and the metadata that support it available to users.”Discovery and access is also not possible without completion of many of the prior steps described in this model. The outcomes of those steps have a significant impact on what is either appropriate or achievable in terms of discovery and access. Given the impact of these prior steps on discovery and access it is crucial to consider the desired outcomes for discovery and access as early as possible — ideally during the Collection Development phase — and to continue to update and revise these plans are work on the collection progresses.
  33. Overall though, we have three major goals in discovery and access.The first is to make material available to user communities. This includes ensuring that the users can find the material, understand if it’s available, and get access to it if possibleHowever, that access must follow guidelines for access restrictions related to privacy, and intellectual property.An overarching goal of all three is to ensure that the significant properties of the material are inherent in whichever form the delivery takes.
  34. We plan to delivery Stephen Jay Gould papers in the Hypatia platform. Hypatia is a fedora platform In Hypatia, we have one EAD for the hybrid collectionSeries 6 – for born digital material. We provide a link for people to go to an interface where they can browse and perform full text search on the born digital material of the papers.
  35. Facets:SubjectsTeaching materialsBooksArticlesNSF reports
  36. Convert the files in obsolete file format such as WordPerfect to html. If not, people have to download the files and find a viewer to view the file or create an emulated environment to view the file.
  37. Discovery and access is also not possible without completion of many of the prior steps described in this model. Some institution accept certain file formats only.
  38. Researchers may also need to bookmark or label the files they found.
  39. In additional to Hypatia mentioned above. Stanford also try to use FTK the software we use for processing) to delivery born digital materials.One of the features of the FTK, which I believe will be interested by researchers, is the ability to generate Fuzzy hash.Files with the same hash are the same in contents. What about similar files?Fuzzy hash provide you the information how close files are Full text searchHow many characters mis-speltFuzzy hashing is a tool which provides the ability to compare two different files and determine a fundamental level of similarity. This similarity is expressed as score from 0-100. The higher the score reported the more similar the two pieces of data. A score of 100 would indicate that the files are close to identical. Alternatively a score of 0 would indicate no meaningful common sequence of data between the two files.
  40. I mentioned before that the goal of D&A is to ensure that the significant properties of the material are inherent in whichever form the delivery takes.For design file I believe VM is the appropriate platform.I have built a virtual machine containing some design files with the associated fonts.People want to know the exact fonts, font spacing, etc. used. They don’t have the fonts – so even they download the file, they cannot recreate the appearance of the file,Virtual machine created using Parallels Desktop.
  41. How to delivery 50,000 emails? I worked with colleague at Stanford to produce network graph of 50,000 emails. Name of the network software: Gephi is an open-source software for visualizing and analyzing large networks graphs.
  42. I am very lucky to meet Computer Science candidate at Stanford. SudheendraHangalEmail visualization tool for sentiment analysis.Psychology literature to define what words constitute happiness, love, etc. Topic analysis using software
  43. Annotation, see individual email