Archival Stewardship of Email using ePADD Software

•Download as PPTX, PDF•

1 like•1,002 views

Overview and update on ePADD software development from Stanford University Library's Special Collections Dept. (released July 2015) with some notes about future development.

Technology

Glynn Edwards
SAA – August 22, 2015
Director, ePADD Project
Archival Stewardship of Email using ePADD Software

ePADD program
Collection
Development
Pre-Acquisition
Appraisal
Capture Normalization
Item-level
processing Bulk processing
Intellectual
Arrangement
Search
Capability
Personal/Sensitive
Information
Processing
Packaging Repository
Online
Discovery
Access
CERP Parser Email message Email message
DArcMail Email message Email message Fielded
EMCAP
Server
Version
Email message Email message
Server
version only
Archivematica
Message +
attachments
Message +
attachments
PeDALS Email message Email message
Other: not
declared
ePADD
Message +
attachments
Message +
attachments
NLP; fielded;
full-text;
lexicon
Identification
(Reg. Ex.)
EAS
Message +
attachments
Message +
attachments
fielded; full-text
Identification (Reg.
Ex.)
eMailchemy
MailStore
Server
Message +
attachments
Message +
attachments
Full-text
AccessData
FTK
Message +
attachments
Message +
attachments
Full-text
Identification (Reg.
Ex.)
ZL Unified
Archive
Message +
attachments
Message +
attachments
Full-text
Preservica
Standard
Message +
attachments
Message +
attachments
Other: not
declared
Paraben
Email Examiner
Message +
attachments
Message +
attachments
Other: not
declared
Aid4Mail
Professional
Other: not
declared
Full support Not Supported Unknown
Lifecycle Tools for Archival Email Stewardship
Preservation AccessAccessioning Archival Processing

ePADD Technical Information
ePADD is written in Java and Javascript and powered by Apache Tomcat (v7.0) using Java EE Servlet API
(v3.x) and Java Mail (v1.4.2). Text and metadata extraction, indexing and retrieval is performed by Apache
Lucene (v4.7) and Apache Tika (v1.8). Charting and visualization is supported using the D3-based
reusable chart library (v0.4.10). Oracle's Java Application Bundler and Launch4J are used for packaging
on Mac and Windows platforms respectively. Other Java libraries from Apache (Lang, commons, CLI, IO,
logging, etc.) are also used. JSON formatting is performed with the libraries org.json and Gson.
ePADD has implemented its own natural language processing (NLP) toolkit which is used for
named entity extraction, disambiguation and other tasks. This toolkit supplants the Apache
OpenNLP used in earlier beta versions of the ePADD software. We continue to use Muse as
an internal library within ePADD. However, the Apache OpenNLP proved insufficient for our
needs (at least for name recognition), and after various rounds of customization, we built our
own named entity recognizer. This toolkit uses external datasets such as
Wikipedia/DBpedia, Freebase, Geonames, OCLC FAST and LC Subject Headings/LC
Name Authority File.
The project is developed with IDEs like IntelliJ Idea and Eclipse, built with Apache Maven, Ant, and custom
shell scripts, and tracked using Git for source control and issue tracking. The ePADD software client is
browser-based and compatible with Chrome and Firefox. It is optimized for Windows 7 and OSX
10.9/10.10 machines, using Java 7 or 8.

Correspondents:
Resolving
multiple
accounts into
single entry

Actions: do not transfer – restrict - reviewed

Upload of CSV files of email addresses for matching with
existing archive
Search by Date and Date Range
1.1 release - August 2015
New features

Future Roadmap
• Enhance Natural Language Processing Capability
• Enhance the Processing Module Features
• Enhance the Discovery/ Delivery Module Features
• Recommend and Test Preservation Strategy
• Collaboration with other Platforms & Services
• Explore Sustainability Model
• Add Restriction Management/ Annotation Functions
• Enhance the Error Handling Capability

https:/library.stanford.edu/projec
ts/epadd
https://epadd.nimeyo.com/
@e_padd
epadd_project@stanford.edu
Glynn Edwards
gedwards@Stanford.edu
Peter Chan
pchan3@Stanford.edu
Josh Schneider
josh.Schneider@Stanford.edu
http://epadd.stanford.edu/epad
d/collections

What's hot

SOMEF: a metadata extraction framework from software documentation

dgarijo

The Fletcher School's Edwin Ginn Library has created several databases to track and promote scholarly research by our faculty and students. A faculty publications database provides RSS feeds by author- and user-supplied keywords along with a current awareness feed that includes everything published. A second database highlights student master's theses. Feeds are used internally to populate web pages and externally to promote the school. Learn how we created these two databases.

Using RSS to Promote Scholarly Publications

Ken Varnum

CEDAR Technologies for AIRR Submissions

Syed Ahmad Chan Bukhari, PhD

Project Credit: Melissa Haendel - On the Nature of Credit

CASRAI

On the nature of Credit

mhaendel

Code4Lib 2008 Metadata Registry

jonphipps

The W3C Linked Data Platform (LDP) specification defines a standard HTTP-based protocol for read/write Linked Data and provides the basis for application integration using Linked Data. This poster presents an LDP adapter for the Bugzilla issue tracker and demonstrates how to use the LDP protocol to expose a traditional application as a read/write Linked Data application. This approach provides a flexible LDP adoption strategy with minimal changes to existing applications.

Linked data platform adapter for bugzilla poster

Nandana Mihindukulasooriya

What's hot (7)

SOMEF: a metadata extraction framework from software documentation

Using RSS to Promote Scholarly Publications

CEDAR Technologies for AIRR Submissions

Project Credit: Melissa Haendel - On the Nature of Credit

On the nature of Credit

Code4Lib 2008 Metadata Registry

Linked data platform adapter for bugzilla poster

Similar to Archival Stewardship of Email using ePADD Software

Ldap2010

CYJ

Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010

Bhupesh Bansal

Hadoop and Voldemort @ LinkedIn

Hadoop User Group

Serverless Data Platform

Shu-Jeng Hsieh

WordPress has grown from blogging tool, to flexible CMS to an application platform. As the web development world embraces micro-services, how does WordPress, which is normally implemented as a monolithic solution fit in and evolve? In this talk, I will look at what makes WordPress a good choice for application development, as well as where it is lacking. To put these questions in context, this talk will be framed around a case-study of a hybrid web app, built using WordPress and other tools including VueJS, Laravel and Amazon Web Services.

Our Hybrid Future: WordPress As Part of the Stack #WCNYC

Caldera Labs

HDInsight Hadoop on Windows Azure

Lynn Langit

Nagarjuna Reddy_Java (1+ Experience)

Nagarjun Reddy

More information: http://bit.ly/2gtFMNW WordPress has grown from blogging tool, to flexible CMS to an application platform. As the web development world embraces micro-services, how does WordPress, which is normally implemented as a monolithic solution fit in and evolve? In this talk, I will look at what makes WordPress a good choice for application development, as well as where it is lacking. To put these questions in context, this talk will be framed around a case-study of a hybrid web app, built using WordPress and other tools including VueJS, Laravel and Amazon Web Services.

Our Hybrid Future: WordPress As Part of the Stack

Caldera Labs

Presented by Dedra Chamberlin Deputy Director, Identity and Access Management University of California, Berkeley and San Francisco, Francesco Meschia IAM Engineer, UC Berkeley and Mukesh Yadav, IAM Engineer, UC San Francisco at ForgeRock Open Stack Identity Summit, June 2013 Learn more about ForgeRock Access Management: https://www.forgerock.com/platform/access-management/ Learn more about ForgeRock Identity Management: https://www.forgerock.com/platform/identity-management/

Case Study: University of California, Berkeley and San Francisco

ForgeRock

N_BHANU_PRAKASH

Bhanu Prakash

Selenium_Automation

madhu g

Introduction to ASP.NET

Joni

Building Machine Learning Applications with Sparkling Water

Sri Ambati

Jeevananthan_Informatica

Jeevananthan Rakkiannan

Apache Spark is a next-generation processing engine optimized for speed, ease of use, and advanced analytics well beyond batch. The Spark framework supports streaming data and complex, iterative algorithms, enabling applications to run 100x faster than traditional MapReduce programs. With Spark, developers can write sophisticated parallel applications for faster business decisions and better user outcomes, applied to a wide variety of architectures and industries. Learn What Apache Spark is and how it compares to Hadoop MapReduce, How to filter, map, reduce, and save Resilient Distributed Datasets (RDDs), Who is best suited to attend the course and what prior knowledge you should have, and the benefits of building Spark applications as part of an enterprise data hub.

Introduction to Apache Spark Developer Training

Cloudera, Inc.

Analyzing malware and correlating huge databases of samples is a job for few. Big AV companies have their own systems for cataloging and analyzing malware and our goal is to bring that power to the masses through our OpenSource malware analysis pipeline system called Aleph <https: />. Aleph is not restricted to malware since it is artifact-oriented. It was built with no specific file-type in mind but with the possibility to work with any filetype and have plugins to extract information and correlate with other artifacts for further analysis. This makes aleph also very useful in forensics and other types of work. Aleph is a multi-compartmentalized framework. There are sample collectors that will fetch samples from local folders, RSS feeds and IMAP folders (for now). These samples are queued where the sample workers will grab them and apply specific filters depending on it's file type. Those plugins might enrich sample metadata, extract other artifacts and retrofeed into Aleph for further analysis making all the cross-reference chain in place. The plugins may also add some warning flags based on their findings to give the researcher a more digested info than interpreting all the data. All sample data is stored into a ElasticSearch database which makes easy to query and manage it's metadata fields without rebuilding tables and such. All time and date data is UTC and converted on the fly to user's Timezone. We have internationalization and localization fully implemented and Aleph is available currently in English, Brazilian Portuguese and Spanish

aleph - Malware analysis pipelining for the masses

Jan Seidl

FreEed - Open Source eDiscovery

Mark Kerzner

shamResume (1)

sham b

Sparkflows.io

sparkflows

Under the Hood 11g Identity Management

InSync Conference

Similar to Archival Stewardship of Email using ePADD Software (20)

Ldap2010

Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010

Hadoop and Voldemort @ LinkedIn

Serverless Data Platform

Our Hybrid Future: WordPress As Part of the Stack #WCNYC

HDInsight Hadoop on Windows Azure

Nagarjuna Reddy_Java (1+ Experience)

Our Hybrid Future: WordPress As Part of the Stack

Case Study: University of California, Berkeley and San Francisco

N_BHANU_PRAKASH

Selenium_Automation

Introduction to ASP.NET

Building Machine Learning Applications with Sparkling Water

Jeevananthan_Informatica

Introduction to Apache Spark Developer Training

aleph - Malware analysis pipelining for the masses

FreEed - Open Source eDiscovery

shamResume (1)

Sparkflows.io

Under the Hood 11g Identity Management

Recently uploaded

Finology Group – Insurtech Innovation Award 2024

The Digital Insurer

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

With more memory available, system performance of three Dell devices increased, which can translate to a better user experience Conclusion When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.

Boost PC performance: How more available memory can improve productivity

Principled Technologies

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

apidays

Developing An App To Navigate The Roads of Brazil

V3cube

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

wesley chun

Handwritten Text Recognition for manuscripts and early printed texts

Maria Levchenko

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

What is a good lead in your organisation? Which leads are priority? What happens to leads? When sales and marketing give different answers to these questions, or perhaps aren't sure of the answers at all, frustrations build and opportunities are left on the table. Join us for an illuminating session with Cian McLoughlin, HubSpot Principal Customer Success Manager, as we look at that crucial piece of the customer journey in which leads are transferred from marketing to sales.

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

HampshireHUG

Automating Google Workspace (GWS) & more with Apps Script

wesley chun

Discord is a free app offering voice, video, and text chat functionalities, primarily catering to the gaming community. It serves as a hub for users to create and join servers tailored to their interests. Discord’s ecosystem comprises servers, each functioning as a distinct online community with its own channels dedicated to specific topics or activities. Users can engage in text-based discussions, voice calls, or video chats within these channels. Understanding Discord Servers Discord servers are virtual spaces where users congregate to interact, share content, and build communities. Servers may revolve around gaming, hobbies, interests, or fandoms, providing a platform for like-minded individuals to connect. Communication Features Discord offers a range of communication tools, including text channels for messaging, voice channels for real-time audio conversations, and video channels for face-to-face interactions. These features facilitate seamless communication and collaboration. What Does NSFW Mean? The acronym NSFW stands for “Not Safe For Work,” indicating content that may be inappropriate for professional or public settings. NSFW Content NSFW content encompasses material that is sexually explicit, violent, or otherwise graphic in nature. It often includes nudity, profanity, or depictions of sensitive topics.

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

UK Journal

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

Advantages of Hiring UIUX Design Service Providers for Your Business

Pixlogix Infotech

Enterprise Knowledge’s Urmi Majumder, Principal Data Architecture Consultant, and Fernando Aguilar Islas, Senior Data Science Consultant, presented "Driving Behavioral Change for Information Management through Data-Driven Green Strategy" on March 27, 2024 at Enterprise Data World (EDW) in Orlando, Florida. In this presentation, Urmi and Fernando discussed a case study describing how the information management division in a large supply chain organization drove user behavior change through awareness of the carbon footprint of their duplicated and near-duplicated content, identified via advanced data analytics. Check out their presentation to gain valuable perspectives on utilizing data-driven strategies to influence positive behavioral shifts and support sustainability initiatives within your organization. In this session, participants gained answers to the following questions: - What is a Green Information Management (IM) Strategy, and why should you have one? - How can Artificial Intelligence (AI) and Machine Learning (ML) support your Green IM Strategy through content deduplication? - How can an organization use insights into their data to influence employee behavior for IM? - How can you reap additional benefits from content reduction that go beyond Green IM?

Driving Behavioral Change for Information Management through Data-Driven Gree...

Enterprise Knowledge

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Exploring the Future Potential of AI-Enabled Smartphone Processors

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

A Domino Admins Adventures (Engage 2024)

Boost PC performance: How more available memory can improve productivity

Apidays New York 2024 - The value of a flexible API Management solution for O...

Developing An App To Navigate The Roads of Brazil

Powerful Google developer tools for immediate impact! (2023-24 C)

Handwritten Text Recognition for manuscripts and early printed texts

How to Troubleshoot Apps for the Modern Connected Worker

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Automating Google Workspace (GWS) & more with Apps Script

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Axa Assurance Maroc - Insurer Innovation Award 2024

AWS Community Day CPH - Three problems of Terraform

Strategies for Landing an Oracle DBA Job as a Fresher

GenAI Risks & Security Meetup 01052024.pdf

Advantages of Hiring UIUX Design Service Providers for Your Business

Driving Behavioral Change for Information Management through Data-Driven Gree...

Archival Stewardship of Email using ePADD Software

1. Glynn Edwards SAA – August 22, 2015 Director, ePADD Project Archival Stewardship of Email using ePADD Software

3. Developed and funded by:

4. ePADD program Collection Development Pre-Acquisition Appraisal Capture Normalization Item-level processing Bulk processing Intellectual Arrangement Search Capability Personal/Sensitive Information Processing Packaging Repository Online Discovery Access CERP Parser Email message Email message DArcMail Email message Email message Fielded EMCAP Server Version Email message Email message Server version only Archivematica Message + attachments Message + attachments PeDALS Email message Email message Other: not declared ePADD Message + attachments Message + attachments NLP; fielded; full-text; lexicon Identification (Reg. Ex.) EAS Message + attachments Message + attachments fielded; full-text Identification (Reg. Ex.) eMailchemy MailStore Server Message + attachments Message + attachments Full-text AccessData FTK Message + attachments Message + attachments Full-text Identification (Reg. Ex.) ZL Unified Archive Message + attachments Message + attachments Full-text Preservica Standard Message + attachments Message + attachments Other: not declared Paraben Email Examiner Message + attachments Message + attachments Other: not declared Aid4Mail Professional Other: not declared Full support Not Supported Unknown Lifecycle Tools for Archival Email Stewardship Preservation AccessAccessioning Archival Processing

5. Appraisal Module

7. ePADD Technical Information ePADD is written in Java and Javascript and powered by Apache Tomcat (v7.0) using Java EE Servlet API (v3.x) and Java Mail (v1.4.2). Text and metadata extraction, indexing and retrieval is performed by Apache Lucene (v4.7) and Apache Tika (v1.8). Charting and visualization is supported using the D3-based reusable chart library (v0.4.10). Oracle's Java Application Bundler and Launch4J are used for packaging on Mac and Windows platforms respectively. Other Java libraries from Apache (Lang, commons, CLI, IO, logging, etc.) are also used. JSON formatting is performed with the libraries org.json and Gson. ePADD has implemented its own natural language processing (NLP) toolkit which is used for named entity extraction, disambiguation and other tasks. This toolkit supplants the Apache OpenNLP used in earlier beta versions of the ePADD software. We continue to use Muse as an internal library within ePADD. However, the Apache OpenNLP proved insufficient for our needs (at least for name recognition), and after various rounds of customization, we built our own named entity recognizer. This toolkit uses external datasets such as Wikipedia/DBpedia, Freebase, Geonames, OCLC FAST and LC Subject Headings/LC Name Authority File. The project is developed with IDEs like IntelliJ Idea and Eclipse, built with Apache Maven, Ant, and custom shell scripts, and tracked using Git for source control and issue tracking. The ePADD software client is browser-based and compatible with Chrome and Firefox. It is optimized for Windows 7 and OSX 10.9/10.10 machines, using Java 7 or 8.

8. Correspondents: Resolving multiple accounts into single entry

9. Actions: do not transfer – restrict - reviewed

10. Processing Module

11.

12.

13.

14.

15.

16. Disambiguation of names

17. Discovery & Delivery (Access)

18.

19. Query generator

20.

21. Upload of CSV files of email addresses for matching with existing archive Search by Date and Date Range 1.1 release - August 2015 New features

22. Future Roadmap • Enhance Natural Language Processing Capability • Enhance the Processing Module Features • Enhance the Discovery/ Delivery Module Features • Recommend and Test Preservation Strategy • Collaboration with other Platforms & Services • Explore Sustainability Model • Add Restriction Management/ Annotation Functions • Enhance the Error Handling Capability

23.

24.

25. https:/library.stanford.edu/projec ts/epadd https://epadd.nimeyo.com/ @e_padd epadd_project@stanford.edu Glynn Edwards gedwards@Stanford.edu Peter Chan pchan3@Stanford.edu Josh Schneider josh.Schneider@Stanford.edu http://epadd.stanford.edu/epad d/collections

Editor's Notes

Since we held a demo two days ago, I won’t go into a lot of detail today. Instead I’ll talk a little about our motivation and commitment to the project as well as describe a little bit about our process.
The ePADD project sprung out of real problem we were facing 5 years ago during our tenure on the AIMS Project. The one email collection we tackled during that project was quickly followed by several others. As archivists – we once again focused on the part of the lifecycle that we knew best – pre-accessioning processing access and discovery. We have actually been involved with ePADD design for the past 3 years – the initial year was spent on creation of functional specifications, interviews with different stakeholders, and building the pilot site for Discovery. One caveat: Planning and management of the ePADD project is done by three staff in our Department – myself, Peter Chan and Josh Schneider. This is in addition to our other work and is not covered by any grant funding. It often requires a significant amount of our time – it helps a great deal to be invested and have strong departmental support.
Input from colleagues from different disciplines within our institutions as well as external colleagues – particularly our collaborators - sparked in depth discussions that impacted our initial development & planning. We applied for and rec’d a two-year NHPRC-funded grant project – which we supplemented with other internal funds SUL (40K). The latter were used for specific needs, such as: developing a pilot online discovery site and designing a new UI towards the end of the development cycle
The overall design of the program grew out of use cases at Stanford and SUL policies. Our overriding goals from the very beginning were to: make email archives discoverable and accessible to keep the software open source Separate modules where created for specific functional activities, such as: Pre-Accessioning (or collection development), Accessioning, Processing, Discovery and Delivery (Access). Preservation was outside the scope of the project.
I’d like to describe one early “use case”: In the early design stages, I met with a donor, who in the course of their work, corresponded with about 10 or more whistle blowers at various companies and government agencies. All but one of those correspondents demanded that all of their messages be deleted before the archive was transferred to SUL; while one wanted a 20 year restriction on access. This is one of the reasons why a separate Accessioning Module was designed with much of the same functionality as the Processing Module. It’s not that we expect creators will use it a great deal – but it gives an institution the capability of working with a creator on the initial review if they are willing or it is necessary.
One thing to note during initial acquisition is that all or specific folders can be selected during the appraisal phase. So, if a group or individual only wanted to send you specific emails, they might create an “archive” folder – and send it to you periodically for their archives. The user also has the possibility of adding other email accounts. ePADD performs many automated processes during ingestion: de-dupes messages, extracts entities, perform regular expression searches, and resolves names of correspondents – merging multiple email addresses into one
I’d like to point out that the ePADD programming team – Vihari and Sudheendra - developed a custom NLP toolkit used for entity extraction and disambiguation in the archives as the Apache Open NLP proved insufficient for our work.
Name Resolution: ePADD automatically merges identities for a single correspondent by intelligently analyzing headers. In order to improve the functioning of the actions that depend upon this behavior, ePADD allows the user to confirm or correct the identities of correspondents that ePADD has resolved through its analysis.
Actions can be performed in the Accessioning or Processing modules against individual or sets of messages based on search results or facet – like a correspondent.
In the Processing Module (as in other modules): Messages can be reviewed individually or in bulk by any browse or search terms. Searching is done by full-text of archive and attachments or by lexicons.
In the Processing Module (from Accessioning Module) the results/analysis of the archive are displayed again in an overview page.
Regular expressions are automatically searched against the archive… the file is also editable if you have other ID numbers that need to be searched.
The user can select the Sensitive messages under the Browse Menu screen to view the results of this search.
ePADD allows the user to choose from default or user-generated lexicons, which can be used for a variety of purposes, including searching for personal or confidential information, or formulating complex searches by cateegories.
In this way, the archivist can add categories to create thematic access to the corpus – similar to creating “series” in a finding aid.
ePADD uses algorithms to help the archivist or researcher understand context while reading a message. In this example, the first name Ellie is underlined in red (this is taken from the discovery module – note that the full-text is not avail!) ePADD analyzes the occurrences of Ellie throughout the archive with respect to accompanying text and headers of this message. The colored bar underneath each name indicates the likelihood of that association based on this analysis – a relevance ranking. Here Ellie Dorfman is the top choice. The envelope signifies that there is correspondence from those individuals in the archive. This feature can be used by an archivist during processing or by researchers in the delivery module to understand the archive's contents better. If you think about it, we humans do this kind of context-based disambiguation all the time; ePADD is helping us along by trying to automate some of it.
It is SUL’s policy not to deliver full-text of email archives online at this time – primarily for 3rd party privacy issues And, we also do not have an online registration form or “virtual” reading room at this time. I’m sure this will change in the future.
So our Discovery Environment needed to allow access only to extracted entities – people, organizations, places – and partial headers. For access to full-text of unrestricted emails, patrons would need to visit our physical reading room. There they have access to full text and all attachments.
One last bit of functionality for now: The query generator is a powerful tool for anyone – creators, archivists or researchers. It allow you to input a large set of text that will be searched against the archive. ePADD performs a bulk search of entities in your text and compares it to those in the email corpus. Results are highlighted in yellow. And when you hover over one hit, the pop up window displays a short list of results which you can click on to go to the original messages.
UPDATE/CURRENT STATUS The NHPRC grant has been completed and Version 1.0 was released (beginning of July) when our project site went live.
In late August or Sept., Release 1.1 will come out with a few add-ons in addition to fixes. It is currently being tested by the ePADD team. I have one last use case for you today: One of our donors decided just before we went live with the software and discovery platform - he decided to send us a list of 300+ names that he wanted to flag for either restriction or removal. This led our team to create an add-on (part of release 1.1) – which is the ability to upload a CSV file of correspondents to take bulk actions (restrict/do not transfer) – speeding up this activity tremendously.
Special Collections @ SUL has applied for another grant to continue the development of ePADD – and as part of that application have drafted a roadmap of future enhancements. If the grant is awarded, we would welcome input and suggestions on this documentation. Some specific examples here – would be the ability: to redact data that is highlight – such as SS#, CC# etc. To allow cross-collection searching and browsing Or full-text discovery when policy allows Allow export of header information for social network analysis
As you will have noticed - one area ePADD does not address currently is preservation. This is part of the future roadmap.
We would like to work with other open-source projects on the preservation aspect.
If you have any questions, please visit the project website or contact us.

Archival Stewardship of Email using ePADD Software

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to Archival Stewardship of Email using ePADD Software

Similar to Archival Stewardship of Email using ePADD Software (20)

Recently uploaded

Recently uploaded (20)

Archival Stewardship of Email using ePADD Software

Editor's Notes