SlideShare a Scribd company logo
1 of 25
Glynn Edwards
SAA – August 22, 2015
Director, ePADD Project
Archival Stewardship of Email using ePADD Software
Developed and funded by:
ePADD program
Collection
Development
Pre-Acquisition
Appraisal
Capture Normalization
Item-level
processing Bulk processing
Intellectual
Arrangement
Search
Capability
Personal/Sensitive
Information
Processing
Packaging Repository
Online
Discovery
Access
CERP Parser Email message Email message
DArcMail Email message Email message Fielded
EMCAP
Server
Version
Email message Email message
Server
version only
Archivematica
Message +
attachments
Message +
attachments
PeDALS Email message Email message
Other: not
declared
ePADD
Message +
attachments
Message +
attachments
NLP; fielded;
full-text;
lexicon
Identification
(Reg. Ex.)
EAS
Message +
attachments
Message +
attachments
fielded; full-text
Identification (Reg.
Ex.)
eMailchemy
MailStore
Server
Message +
attachments
Message +
attachments
Full-text
AccessData
FTK
Message +
attachments
Message +
attachments
Full-text
Identification (Reg.
Ex.)
ZL Unified
Archive
Message +
attachments
Message +
attachments
Full-text
Preservica
Standard
Message +
attachments
Message +
attachments
Other: not
declared
Paraben
Email Examiner
Message +
attachments
Message +
attachments
Other: not
declared
Aid4Mail
Professional
Other: not
declared
Full support Not Supported Unknown
Lifecycle Tools for Archival Email Stewardship
Preservation AccessAccessioning Archival Processing
Appraisal Module
ePADD Technical Information
ePADD is written in Java and Javascript and powered by Apache Tomcat (v7.0) using Java EE Servlet API
(v3.x) and Java Mail (v1.4.2). Text and metadata extraction, indexing and retrieval is performed by Apache
Lucene (v4.7) and Apache Tika (v1.8). Charting and visualization is supported using the D3-based
reusable chart library (v0.4.10). Oracle's Java Application Bundler and Launch4J are used for packaging
on Mac and Windows platforms respectively. Other Java libraries from Apache (Lang, commons, CLI, IO,
logging, etc.) are also used. JSON formatting is performed with the libraries org.json and Gson.
ePADD has implemented its own natural language processing (NLP) toolkit which is used for
named entity extraction, disambiguation and other tasks. This toolkit supplants the Apache
OpenNLP used in earlier beta versions of the ePADD software. We continue to use Muse as
an internal library within ePADD. However, the Apache OpenNLP proved insufficient for our
needs (at least for name recognition), and after various rounds of customization, we built our
own named entity recognizer. This toolkit uses external datasets such as
Wikipedia/DBpedia, Freebase, Geonames, OCLC FAST and LC Subject Headings/LC
Name Authority File.
The project is developed with IDEs like IntelliJ Idea and Eclipse, built with Apache Maven, Ant, and custom
shell scripts, and tracked using Git for source control and issue tracking. The ePADD software client is
browser-based and compatible with Chrome and Firefox. It is optimized for Windows 7 and OSX
10.9/10.10 machines, using Java 7 or 8.
Correspondents:
Resolving
multiple
accounts into
single entry
Actions: do not transfer – restrict - reviewed
Processing Module
Disambiguation
of names
Discovery & Delivery (Access)
Query generator
Upload of CSV files of email addresses for matching with
existing archive
Search by Date and Date Range
1.1 release - August 2015
New features
Future Roadmap
• Enhance Natural Language Processing Capability
• Enhance the Processing Module Features
• Enhance the Discovery/ Delivery Module Features
• Recommend and Test Preservation Strategy
• Collaboration with other Platforms & Services
• Explore Sustainability Model
• Add Restriction Management/ Annotation Functions
• Enhance the Error Handling Capability
https:/library.stanford.edu/projec
ts/epadd
https://epadd.nimeyo.com/
@e_padd
epadd_project@stanford.edu
Glynn Edwards
gedwards@Stanford.edu
Peter Chan
pchan3@Stanford.edu
Josh Schneider
josh.Schneider@Stanford.edu
http://epadd.stanford.edu/epad
d/collections

More Related Content

What's hot

Project Credit: Melissa Haendel - On the Nature of Credit
Project Credit: Melissa Haendel - On the Nature of CreditProject Credit: Melissa Haendel - On the Nature of Credit
Project Credit: Melissa Haendel - On the Nature of Credit
CASRAI
 

What's hot (7)

SOMEF: a metadata extraction framework from software documentation
SOMEF: a metadata extraction framework from software documentationSOMEF: a metadata extraction framework from software documentation
SOMEF: a metadata extraction framework from software documentation
 
Using RSS to Promote Scholarly Publications
Using RSS to Promote Scholarly PublicationsUsing RSS to Promote Scholarly Publications
Using RSS to Promote Scholarly Publications
 
CEDAR Technologies for AIRR Submissions
CEDAR Technologies for AIRR SubmissionsCEDAR Technologies for AIRR Submissions
CEDAR Technologies for AIRR Submissions
 
Project Credit: Melissa Haendel - On the Nature of Credit
Project Credit: Melissa Haendel - On the Nature of CreditProject Credit: Melissa Haendel - On the Nature of Credit
Project Credit: Melissa Haendel - On the Nature of Credit
 
On the nature of Credit
On the nature of CreditOn the nature of Credit
On the nature of Credit
 
Code4Lib 2008 Metadata Registry
Code4Lib 2008   Metadata RegistryCode4Lib 2008   Metadata Registry
Code4Lib 2008 Metadata Registry
 
Linked data platform adapter for bugzilla poster
Linked data platform adapter for bugzilla posterLinked data platform adapter for bugzilla poster
Linked data platform adapter for bugzilla poster
 

Similar to Archival Stewardship of Email using ePADD Software

Nagarjuna Reddy_Java (1+ Experience)
Nagarjuna Reddy_Java (1+ Experience)Nagarjuna Reddy_Java (1+ Experience)
Nagarjuna Reddy_Java (1+ Experience)
Nagarjun Reddy
 
Selenium_Automation
Selenium_AutomationSelenium_Automation
Selenium_Automation
madhu g
 
aleph - Malware analysis pipelining for the masses
aleph - Malware analysis pipelining for the massesaleph - Malware analysis pipelining for the masses
aleph - Malware analysis pipelining for the masses
Jan Seidl
 
shamResume (1)
shamResume (1)shamResume (1)
shamResume (1)
sham b
 

Similar to Archival Stewardship of Email using ePADD Software (20)

Ldap2010
Ldap2010Ldap2010
Ldap2010
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
Our Hybrid Future: WordPress As Part of the Stack #WCNYC
Our Hybrid Future: WordPress As Part of the Stack #WCNYCOur Hybrid Future: WordPress As Part of the Stack #WCNYC
Our Hybrid Future: WordPress As Part of the Stack #WCNYC
 
HDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows AzureHDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows Azure
 
Nagarjuna Reddy_Java (1+ Experience)
Nagarjuna Reddy_Java (1+ Experience)Nagarjuna Reddy_Java (1+ Experience)
Nagarjuna Reddy_Java (1+ Experience)
 
Our Hybrid Future: WordPress As Part of the Stack
Our Hybrid Future: WordPress As Part of the StackOur Hybrid Future: WordPress As Part of the Stack
Our Hybrid Future: WordPress As Part of the Stack
 
Case Study: University of California, Berkeley and San Francisco
Case Study: University of California, Berkeley and San FranciscoCase Study: University of California, Berkeley and San Francisco
Case Study: University of California, Berkeley and San Francisco
 
N_BHANU_PRAKASH
N_BHANU_PRAKASHN_BHANU_PRAKASH
N_BHANU_PRAKASH
 
Selenium_Automation
Selenium_AutomationSelenium_Automation
Selenium_Automation
 
Introduction to ASP.NET
Introduction to ASP.NETIntroduction to ASP.NET
Introduction to ASP.NET
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
 
Jeevananthan_Informatica
Jeevananthan_InformaticaJeevananthan_Informatica
Jeevananthan_Informatica
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
aleph - Malware analysis pipelining for the masses
aleph - Malware analysis pipelining for the massesaleph - Malware analysis pipelining for the masses
aleph - Malware analysis pipelining for the masses
 
FreEed - Open Source eDiscovery
FreEed - Open Source eDiscoveryFreEed - Open Source eDiscovery
FreEed - Open Source eDiscovery
 
shamResume (1)
shamResume (1)shamResume (1)
shamResume (1)
 
Sparkflows.io
Sparkflows.ioSparkflows.io
Sparkflows.io
 
Under the Hood 11g Identity Management
Under the Hood  11g Identity ManagementUnder the Hood  11g Identity Management
Under the Hood 11g Identity Management
 

Recently uploaded

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Archival Stewardship of Email using ePADD Software

  • 1. Glynn Edwards SAA – August 22, 2015 Director, ePADD Project Archival Stewardship of Email using ePADD Software
  • 2.
  • 4. ePADD program Collection Development Pre-Acquisition Appraisal Capture Normalization Item-level processing Bulk processing Intellectual Arrangement Search Capability Personal/Sensitive Information Processing Packaging Repository Online Discovery Access CERP Parser Email message Email message DArcMail Email message Email message Fielded EMCAP Server Version Email message Email message Server version only Archivematica Message + attachments Message + attachments PeDALS Email message Email message Other: not declared ePADD Message + attachments Message + attachments NLP; fielded; full-text; lexicon Identification (Reg. Ex.) EAS Message + attachments Message + attachments fielded; full-text Identification (Reg. Ex.) eMailchemy MailStore Server Message + attachments Message + attachments Full-text AccessData FTK Message + attachments Message + attachments Full-text Identification (Reg. Ex.) ZL Unified Archive Message + attachments Message + attachments Full-text Preservica Standard Message + attachments Message + attachments Other: not declared Paraben Email Examiner Message + attachments Message + attachments Other: not declared Aid4Mail Professional Other: not declared Full support Not Supported Unknown Lifecycle Tools for Archival Email Stewardship Preservation AccessAccessioning Archival Processing
  • 6.
  • 7. ePADD Technical Information ePADD is written in Java and Javascript and powered by Apache Tomcat (v7.0) using Java EE Servlet API (v3.x) and Java Mail (v1.4.2). Text and metadata extraction, indexing and retrieval is performed by Apache Lucene (v4.7) and Apache Tika (v1.8). Charting and visualization is supported using the D3-based reusable chart library (v0.4.10). Oracle's Java Application Bundler and Launch4J are used for packaging on Mac and Windows platforms respectively. Other Java libraries from Apache (Lang, commons, CLI, IO, logging, etc.) are also used. JSON formatting is performed with the libraries org.json and Gson. ePADD has implemented its own natural language processing (NLP) toolkit which is used for named entity extraction, disambiguation and other tasks. This toolkit supplants the Apache OpenNLP used in earlier beta versions of the ePADD software. We continue to use Muse as an internal library within ePADD. However, the Apache OpenNLP proved insufficient for our needs (at least for name recognition), and after various rounds of customization, we built our own named entity recognizer. This toolkit uses external datasets such as Wikipedia/DBpedia, Freebase, Geonames, OCLC FAST and LC Subject Headings/LC Name Authority File. The project is developed with IDEs like IntelliJ Idea and Eclipse, built with Apache Maven, Ant, and custom shell scripts, and tracked using Git for source control and issue tracking. The ePADD software client is browser-based and compatible with Chrome and Firefox. It is optimized for Windows 7 and OSX 10.9/10.10 machines, using Java 7 or 8.
  • 9. Actions: do not transfer – restrict - reviewed
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 18.
  • 20.
  • 21. Upload of CSV files of email addresses for matching with existing archive Search by Date and Date Range 1.1 release - August 2015 New features
  • 22. Future Roadmap • Enhance Natural Language Processing Capability • Enhance the Processing Module Features • Enhance the Discovery/ Delivery Module Features • Recommend and Test Preservation Strategy • Collaboration with other Platforms & Services • Explore Sustainability Model • Add Restriction Management/ Annotation Functions • Enhance the Error Handling Capability
  • 23.
  • 24.

Editor's Notes

  1. Since we held a demo two days ago, I won’t go into a lot of detail today. Instead I’ll talk a little about our motivation and commitment to the project as well as describe a little bit about our process.
  2. The ePADD project sprung out of real problem we were facing 5 years ago during our tenure on the AIMS Project. The one email collection we tackled during that project was quickly followed by several others. As archivists – we once again focused on the part of the lifecycle that we knew best – pre-accessioning processing access and discovery. We have actually been involved with ePADD design for the past 3 years – the initial year was spent on creation of functional specifications, interviews with different stakeholders, and building the pilot site for Discovery. One caveat: Planning and management of the ePADD project is done by three staff in our Department – myself, Peter Chan and Josh Schneider. This is in addition to our other work and is not covered by any grant funding. It often requires a significant amount of our time – it helps a great deal to be invested and have strong departmental support.
  3. Input from colleagues from different disciplines within our institutions as well as external colleagues – particularly our collaborators - sparked in depth discussions that impacted our initial development & planning. We applied for and rec’d a two-year NHPRC-funded grant project – which we supplemented with other internal funds SUL (40K). The latter were used for specific needs, such as: developing a pilot online discovery site and designing a new UI towards the end of the development cycle
  4. The overall design of the program grew out of use cases at Stanford and SUL policies. Our overriding goals from the very beginning were to: make email archives discoverable and accessible to keep the software open source Separate modules where created for specific functional activities, such as: Pre-Accessioning (or collection development), Accessioning, Processing, Discovery and Delivery (Access). Preservation was outside the scope of the project.
  5. I’d like to describe one early “use case”: In the early design stages, I met with a donor, who in the course of their work, corresponded with about 10 or more whistle blowers at various companies and government agencies. All but one of those correspondents demanded that all of their messages be deleted before the archive was transferred to SUL; while one wanted a 20 year restriction on access. This is one of the reasons why a separate Accessioning Module was designed with much of the same functionality as the Processing Module. It’s not that we expect creators will use it a great deal – but it gives an institution the capability of working with a creator on the initial review if they are willing or it is necessary.
  6. One thing to note during initial acquisition is that all or specific folders can be selected during the appraisal phase. So, if a group or individual only wanted to send you specific emails, they might create an “archive” folder – and send it to you periodically for their archives. The user also has the possibility of adding other email accounts. ePADD performs many automated processes during ingestion: de-dupes messages, extracts entities, perform regular expression searches, and resolves names of correspondents – merging multiple email addresses into one
  7. I’d like to point out that the ePADD programming team – Vihari and Sudheendra - developed a custom NLP toolkit used for entity extraction and disambiguation in the archives as the Apache Open NLP proved insufficient for our work.
  8. Name Resolution: ePADD automatically merges identities for a single correspondent by intelligently analyzing headers. In order to improve the functioning of the actions that depend upon this behavior, ePADD allows the user to confirm or correct the identities of correspondents that ePADD has resolved through its analysis.
  9. Actions can be performed in the Accessioning or Processing modules against individual or sets of messages based on search results or facet – like a correspondent.
  10. In the Processing Module (as in other modules): Messages can be reviewed individually or in bulk by any browse or search terms. Searching is done by full-text of archive and attachments or by lexicons.
  11. In the Processing Module (from Accessioning Module) the results/analysis of the archive are displayed again in an overview page.
  12. Regular expressions are automatically searched against the archive… the file is also editable if you have other ID numbers that need to be searched.
  13. The user can select the Sensitive messages under the Browse Menu screen to view the results of this search.
  14. ePADD allows the user to choose from default or user-generated lexicons, which can be used for a variety of purposes, including searching for personal or confidential information, or formulating complex searches by cateegories.
  15. In this way, the archivist can add categories to create thematic access to the corpus – similar to creating “series” in a finding aid.
  16. ePADD uses algorithms to help the archivist or researcher understand context while reading a message. In this example, the first name Ellie is underlined in red (this is taken from the discovery module – note that the full-text is not avail!) ePADD analyzes the occurrences of Ellie throughout the archive with respect to accompanying text and headers of this message. The colored bar underneath each name indicates the likelihood of that association based on this analysis – a relevance ranking. Here Ellie Dorfman is the top choice. The envelope signifies that there is correspondence from those individuals in the archive. This feature can be used by an archivist during processing or by researchers in the delivery module to understand the archive's contents better. If you think about it, we humans do this kind of context-based disambiguation all the time; ePADD is helping us along by trying to automate some of it. 
  17. It is SUL’s policy not to deliver full-text of email archives online at this time – primarily for 3rd party privacy issues And, we also do not have an online registration form or “virtual” reading room at this time. I’m sure this will change in the future.
  18. So our Discovery Environment needed to allow access only to extracted entities – people, organizations, places – and partial headers. For access to full-text of unrestricted emails, patrons would need to visit our physical reading room. There they have access to full text and all attachments.
  19. One last bit of functionality for now: The query generator is a powerful tool for anyone – creators, archivists or researchers. It allow you to input a large set of text that will be searched against the archive. ePADD performs a bulk search of entities in your text and compares it to those in the email corpus. Results are highlighted in yellow. And when you hover over one hit, the pop up window displays a short list of results which you can click on to go to the original messages.
  20. UPDATE/CURRENT STATUS The NHPRC grant has been completed and Version 1.0 was released (beginning of July) when our project site went live.
  21. In late August or Sept., Release 1.1 will come out with a few add-ons in addition to fixes. It is currently being tested by the ePADD team. I have one last use case for you today: One of our donors decided just before we went live with the software and discovery platform - he decided to send us a list of 300+ names that he wanted to flag for either restriction or removal. This led our team to create an add-on (part of release 1.1) – which is the ability to upload a CSV file of correspondents to take bulk actions (restrict/do not transfer) – speeding up this activity tremendously.
  22. Special Collections @ SUL has applied for another grant to continue the development of ePADD – and as part of that application have drafted a roadmap of future enhancements. If the grant is awarded, we would welcome input and suggestions on this documentation. Some specific examples here – would be the ability: to redact data that is highlight – such as SS#, CC# etc. To allow cross-collection searching and browsing Or full-text discovery when policy allows Allow export of header information for social network analysis
  23. As you will have noticed - one area ePADD does not address currently is preservation. This is part of the future roadmap.
  24. We would like to work with other open-source projects on the preservation aspect.
  25. If you have any questions, please visit the project website or contact us.