Archival Stewardship of Email using ePADD Software

GlynnEdwards 1,004 views 25 slides Aug 24, 2015
Slide 1
Slide 1 of 25
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25

About This Presentation

Overview and update on ePADD software development from Stanford University Library's Special Collections Dept. (released July 2015) with some notes about future development.


Slide Content

Glynn Edwards SAA – August 22, 2015 Director, ePADD Project Archival Stewardship of Email using ePADD Software

Developed and funded by:

ePADD program

Appraisal Module

ePADD Technical Information ePADD is written in Java and Javascript and powered by Apache Tomcat (v7.0) using Java EE Servlet API (v3.x) and Java Mail (v1.4.2). Text and metadata extraction, indexing and retrieval is performed by Apache Lucene (v4.7) and Apache Tika (v1.8). Charting and visualization is supported using the D3‑based reusable chart library (v0.4.10). Oracle's Java Application Bundler and Launch4J are used for packaging on Mac and Windows platforms respectively. Other Java libraries from Apache (Lang, commons, CLI, IO, logging, etc.) are also used. JSON formatting is performed with the libraries org.json and Gson .   ePADD has implemented its own natural language processing (NLP) toolkit which is used for named entity extraction, disambiguation and other tasks. This toolkit supplants the Apache OpenNLP used in earlier beta versions of the ePADD software. We continue to use Muse as an internal library within ePADD . However, the Apache OpenNLP proved insufficient for our needs (at least for name recognition), and after various rounds of customization, we built our own named entity recognizer. This toolkit uses external datasets such as Wikipedia/ DBpedia , Freebase, Geonames , OCLC FAST and LC Subject Headings/LC Name Authority File.   The project is developed with IDEs like IntelliJ Idea and Eclipse, built with Apache Maven, Ant, and custom shell scripts, and tracked using Git for source control and issue tracking. The ePADD software client is browser‑based and compatible with Chrome and Firefox. It is optimized for Windows 7 and OSX 10.9/10.10 machines, using Java 7 or 8.

Correspondents: Resolving multiple accounts into single entry

Actions: do not transfer – restrict - reviewed

Processing Module

Disambiguation of names

Discovery & Delivery (Access)

Query generator

Upload of CSV files of email addresses for matching with existing archive Search by Date and Date Range 1.1 release - August 2015 New features

Future Roadmap Enhance Natural Language Processing Capability Enhance the Processing Module Features Enhance the Discovery/ Delivery Module Features Recommend and Test Preservation Strategy Collaboration with other Platforms & Services Explore Sustainability Model Add Restriction Management/ Annotation Functions Enhance the Error Handling Capability

https:/ library.stanford.edu/projects/ epadd https://epadd.nimeyo.com/ @ e_padd [email protected] Glynn Edwards [email protected] Peter Chan [email protected] Josh Schneider [email protected] http:// epadd.stanford.edu/epadd/collections