Overview and update on ePADD software development from Stanford University Library's Special Collections Dept. (released July 2015) with some notes about future development.
7. ePADD Technical Information
ePADD is written in Java and Javascript and powered by Apache Tomcat (v7.0) using Java EE Servlet API
(v3.x) and Java Mail (v1.4.2). Text and metadata extraction, indexing and retrieval is performed by Apache
Lucene (v4.7) and Apache Tika (v1.8). Charting and visualization is supported using the D3-based
reusable chart library (v0.4.10). Oracle's Java Application Bundler and Launch4J are used for packaging
on Mac and Windows platforms respectively. Other Java libraries from Apache (Lang, commons, CLI, IO,
logging, etc.) are also used. JSON formatting is performed with the libraries org.json and Gson.
ePADD has implemented its own natural language processing (NLP) toolkit which is used for
named entity extraction, disambiguation and other tasks. This toolkit supplants the Apache
OpenNLP used in earlier beta versions of the ePADD software. We continue to use Muse as
an internal library within ePADD. However, the Apache OpenNLP proved insufficient for our
needs (at least for name recognition), and after various rounds of customization, we built our
own named entity recognizer. This toolkit uses external datasets such as
Wikipedia/DBpedia, Freebase, Geonames, OCLC FAST and LC Subject Headings/LC
Name Authority File.
The project is developed with IDEs like IntelliJ Idea and Eclipse, built with Apache Maven, Ant, and custom
shell scripts, and tracked using Git for source control and issue tracking. The ePADD software client is
browser-based and compatible with Chrome and Firefox. It is optimized for Windows 7 and OSX
10.9/10.10 machines, using Java 7 or 8.
21. Upload of CSV files of email addresses for matching with
existing archive
Search by Date and Date Range
1.1 release - August 2015
New features
22. Future Roadmap
• Enhance Natural Language Processing Capability
• Enhance the Processing Module Features
• Enhance the Discovery/ Delivery Module Features
• Recommend and Test Preservation Strategy
• Collaboration with other Platforms & Services
• Explore Sustainability Model
• Add Restriction Management/ Annotation Functions
• Enhance the Error Handling Capability
Since we held a demo two days ago, I won’t go into a lot of detail today. Instead I’ll talk a little about our motivation and commitment to the project as well as describe a little bit about our process.
The ePADD project sprung out of real problem we were facing 5 years ago during our tenure on the AIMS Project. The one email collection we tackled during that project was quickly followed by several others.
As archivists – we once again focused on the part of the lifecycle that we knew best – pre-accessioning processing access and discovery.
We have actually been involved with ePADD design for the past 3 years – the initial year was spent on creation of functional specifications, interviews with different stakeholders, and building the pilot site for Discovery.
One caveat: Planning and management of the ePADD project is done by three staff in our Department – myself, Peter Chan and Josh Schneider. This is in addition to our other work and is not covered by any grant funding. It often requires a significant amount of our time – it helps a great deal to be invested and have strong departmental support.
Input from colleagues from different disciplines within our institutions as well as external colleagues – particularly our collaborators - sparked in depth discussions that impacted our initial development & planning.
We applied for and rec’d a two-year NHPRC-funded grant project – which we supplemented with other internal funds SUL (40K).
The latter were used for specific needs, such as:
developing a pilot online discovery site
and designing a new UI towards the end of the development cycle
The overall design of the program grew out of use cases at Stanford and SUL policies.
Our overriding goals from the very beginning were to:
make email archives discoverable and accessible
to keep the software open source
Separate modules where created for specific functional activities, such as: Pre-Accessioning (or collection development), Accessioning, Processing, Discovery and Delivery (Access).
Preservation was outside the scope of the project.
I’d like to describe one early “use case”:
In the early design stages, I met with a donor, who in the course of their work, corresponded with about 10 or more whistle blowers at various companies and government agencies. All but one of those correspondents demanded that all of their messages be deleted before the archive was transferred to SUL; while one wanted a 20 year restriction on access.
This is one of the reasons why a separate Accessioning Module was designed with much of the same functionality as the Processing Module. It’s not that we expect creators will use it a great deal – but it gives an institution the capability of working with a creator on the initial review if they are willing or it is necessary.
One thing to note during initial acquisition is that all or specific folders can be selected during the appraisal phase. So, if a group or individual only wanted to send you specific emails, they might create an “archive” folder – and send it to you periodically for their archives.
The user also has the possibility of adding other email accounts.
ePADD performs many automated processes during ingestion: de-dupes messages, extracts entities, perform regular expression searches, and resolves names of correspondents – merging multiple email addresses into one
I’d like to point out that the ePADD programming team – Vihari and Sudheendra - developed a custom NLP toolkit used for entity extraction and disambiguation in the archives as the Apache Open NLP proved insufficient for our work.
Name Resolution:
ePADD automatically merges identities for a single correspondent by intelligently analyzing headers.
In order to improve the functioning of the actions that depend upon this behavior, ePADD allows the user to confirm or correct the identities of correspondents that ePADD has resolved through its analysis.
Actions can be performed in the Accessioning or Processing modules against individual or sets of messages based on search results or facet – like a correspondent.
In the Processing Module (as in other modules): Messages can be reviewed individually or in bulk by any browse or search terms.
Searching is done by full-text of archive and attachments or by lexicons.
In the Processing Module (from Accessioning Module) the results/analysis of the archive are displayed again in an overview page.
Regular expressions are automatically searched against the archive… the file is also editable if you have other ID numbers that need to be searched.
The user can select the Sensitive messages under the Browse Menu screen to view the results of this search.
ePADD allows the user to choose from default or user-generated lexicons, which can be used for a variety of purposes, including searching for personal or confidential information, or formulating complex searches by cateegories.
In this way, the archivist can add categories to create thematic access to the corpus – similar to creating “series” in a finding aid.
ePADD uses algorithms to help the archivist or researcher understand context while reading a message.
In this example, the first name Ellie is underlined in red (this is taken from the discovery module – note that the full-text is not avail!)
ePADD analyzes the occurrences of Ellie throughout the archive with respect to accompanying text and headers of this message.
The colored bar underneath each name indicates the likelihood of that association based on this analysis – a relevance ranking. Here Ellie Dorfman is the top choice.
The envelope signifies that there is correspondence from those individuals in the archive.
This feature can be used by an archivist during processing or by researchers in the delivery module to understand the archive's contents better.
If you think about it, we humans do this kind of context-based disambiguation all the time; ePADD is helping us along by trying to automate some of it.
It is SUL’s policy not to deliver full-text of email archives online at this time – primarily for 3rd party privacy issues
And, we also do not have an online registration form or “virtual” reading room at this time. I’m sure this will change in the future.
So our Discovery Environment needed to allow access only to extracted entities – people, organizations, places – and partial headers.
For access to full-text of unrestricted emails, patrons would need to visit our physical reading room. There they have access to full text and all attachments.
One last bit of functionality for now:
The query generator is a powerful tool for anyone – creators, archivists or researchers. It allow you to input a large set of text that will be searched against the archive. ePADD performs a bulk search of entities in your text and compares it to those in the email corpus.
Results are highlighted in yellow. And when you hover over one hit, the pop up window displays a short list of results which you can click on to go to the original messages.
UPDATE/CURRENT STATUS
The NHPRC grant has been completed and Version 1.0 was released (beginning of July) when our project site went live.
In late August or Sept., Release 1.1 will come out with a few add-ons in addition to fixes. It is currently being tested by the ePADD team.
I have one last use case for you today:
One of our donors decided just before we went live with the software and discovery platform - he decided to send us a list of 300+ names that he wanted to flag for either restriction or removal.
This led our team to create an add-on (part of release 1.1) – which is the ability to upload a CSV file of correspondents to take bulk actions (restrict/do not transfer) – speeding up this activity tremendously.
Special Collections @ SUL has applied for another grant to continue the development of ePADD – and as part of that application have drafted a roadmap of future enhancements. If the grant is awarded, we would welcome input and suggestions on this documentation.
Some specific examples here – would be the ability:
to redact data that is highlight – such as SS#, CC# etc.
To allow cross-collection searching and browsing
Or full-text discovery when policy allows
Allow export of header information for social network analysis
As you will have noticed - one area ePADD does not address currently is preservation.
This is part of the future roadmap.
We would like to work with other open-source projects on the preservation aspect.
If you have any questions, please visit the project website or contact us.