Preserving email: The PeDALS approach


Published on

NDIIPP state projects breakout session presentation about tool designed by Persistent Digital Archives and Library System project to preserve email records from Microsoft Outlook PST files.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Preserving email: The PeDALS approach

  1. 1. The PeDALS approach
  2. 2. <ul><li>Pete Watters </li></ul><ul><ul><li>Arizona State Library, project coordinator </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><li>Richard Pearce-Moses </li></ul><ul><ul><li>Clayton State University, Georgia, </li></ul></ul><ul><ul><li>principal investigator </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><li>Brian Schnackel </li></ul><ul><ul><li>Arizona State Library, lead developer </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul>
  3. 3. <ul><li>PeDALS strives for OAIS compliance </li></ul><ul><li>Archivists focus on process, not individual records </li></ul><ul><li>Business rules… </li></ul><ul><ul><li>generate normalized metadata </li></ul></ul><ul><ul><li>transform SIPs into standardized AIPs </li></ul></ul><ul><ul><li>create DIPs for each record </li></ul></ul>
  4. 4. <ul><li>Suited to the PeDALS methodology </li></ul><ul><ul><li>Born digital </li></ul></ul><ul><ul><li>Potential for historical value </li></ul></ul><ul><ul><li>Message transmission information provides a rich source of metadata </li></ul></ul><ul><li>All partners had Outlook PST files </li></ul>
  5. 5. <ul><li>Atomize individual messages </li></ul><ul><ul><li>To store as individual AIPs </li></ul></ul><ul><ul><li>To disseminate as browser-friendly DIPs </li></ul></ul><ul><li>Create a database of rich metadata </li></ul><ul><ul><li>From the process: to support administration </li></ul></ul><ul><ul><li>From the email headers: to support discovery </li></ul></ul><ul><ul><li>From BagIt, New Zealand Metadata Extractor, other sources: to support preservation </li></ul></ul>
  6. 6. <ul><li>PeDALS is intended for permanent records </li></ul><ul><li>PeDALS is not a records management system </li></ul><ul><li>Deleting files is difficult at best </li></ul>
  7. 7. <ul><li>When negotiating with the originating office, archivists encourage weeding PSTs of non-permanent records </li></ul><ul><li>Archivists work with rules rather than records – they don’t have time to weed the collections </li></ul><ul><li>If you give us junk, we’ll archive junk. </li></ul><ul><li>PSTs plucked from hard drives can work, but more likely to generate errors during processing. </li></ul>
  8. 8. <ul><li>Metadata taken from headers was surprisingly messy </li></ul><ul><li>One response is to learn to cope with a complete lack of authority control </li></ul><ul><li>Or possibly correct by “data wrangling” from within the database </li></ul>
  9. 9. <ul><li>Senders and recipients can be an email address or display name from one or more contact lists </li></ul><ul><ul><li>“ Janet Napolitano” or “” or “” or “Napolitano, Janet “ or “Janet” or “J Napolitano”? </li></ul></ul><ul><li>Subject line not reliable source for titles or abstracts – often blank, repetitive, or a remnant from an unrelated message </li></ul>
  10. 10. <ul><li>Email (and other records) may be open to the public by statute, but some content may be sensitive </li></ul><ul><ul><li>Personally identifying information </li></ul></ul><ul><ul><li>Private information (intimate, of no public interest) </li></ul></ul><ul><li>Repositories must develop procedures and policies for aggregates that may have some records with sensitive information </li></ul>
  11. 11. <ul><li>Boucher/Stearns draft legislation for online privacy would require “notice to and consent of an individual prior to the collection and disclosure of certain personal information” such as street and email addresses, phone numbers, aliases, and other common information. </li></ul><ul><li>Excludes government agencies, but may include academic libraries. </li></ul><ul><li>Possible chilling effect on archives: Keeping such information confidential would effectively block access to email and many other records </li></ul>
  12. 12. <ul><li>PST file structure was proprietary </li></ul><ul><li>Considered third-party Outlook plug-ins </li></ul><ul><ul><li>Smithsonian Institution had done research </li></ul></ul><ul><ul><li> </li></ul></ul><ul><li>Adopted open-source PST export utility </li></ul><ul><ul><li>No longer supported </li></ul></ul><ul><ul><li>Written in Visual Basic </li></ul></ul><ul><ul><li> </li></ul></ul>
  13. 13. <ul><li>Could generate human-readable XML of email messages </li></ul><ul><li>Was based on code open to public </li></ul><ul><li>Did not require understanding of PST structure </li></ul>
  14. 14. <ul><li>It’s more than just email </li></ul><ul><li>What to do with tasks, calendar items, contacts? </li></ul><ul><ul><li>Need to give the archivist the ability to decide what to keep </li></ul></ul><ul><li>What about viruses, corrupt attachments? </li></ul>
  15. 15. <ul><li>What is the record? </li></ul><ul><li>What are we authenticating? </li></ul><ul><li>PST as database; messages are constructs of fields in tables tied together by keys and other tables </li></ul><ul><li>XML is best way to preserve these relations and dependencies </li></ul>
  16. 16. <ul><li>Did not use the full record </li></ul><ul><li>Had almost no way to handle errors </li></ul><ul><li>Tended to break when dealing with large PST files that had not been curated </li></ul><ul><li>Required a copy of Outlook </li></ul><ul><li>Ran very slowly </li></ul>
  17. 17. <ul><li>In late February, Microsoft released the PST specification </li></ul><ul><ul><ul><li> </li></ul></ul></ul><ul><li>203 pages of techspeak with some errors and inaccuracies </li></ul><ul><li>Based on the spec, we’ve been developing a file-based tool that doesn’t require Outlook . </li></ul>
  18. 18. <ul><li>Generates XML from the entire PST file </li></ul><ul><li>Much improved exception handling </li></ul><ul><li>Does not require Outlook </li></ul><ul><li>Runs much more quickly </li></ul>
  19. 19. <ul><li>File-based processor was slow to develop because of some errors in Microsoft’s documentation. </li></ul><ul><li>Test on as many PST samples as possible. Don’t rely on small curated samples. </li></ul><ul><li>Discovered differences between Unicode PST files and earlier ANSI-encoded files. </li></ul>
  20. 20. <ul><li>PSTs are not an automatic occurrence in Outlook 2010 </li></ul><ul><li>But they can be generated manually and can remain part of a scheduled retention routine </li></ul>