Tthornton code4lib

EAD without XSLT
a practical approach to archival finding aids

Trevor Thornton
Senior Applications Developer, NYPL Labs
The New York Public Library

Project goals
• Enable multiple presentations of
the same data

• Support dynamic web applications

• Cross-collection search with
component-level specificity in
results, and faceting on common
access points

System overview
Ruby on Rails
+ MySQL
+ SOLR

Key functionality:
Data Import
Search index
API

Collection model

Each collection:
•must have one
description
•may have one or more
components
•may be associated with
one or more access terms

Component model
Each component:
•must belong to one
collection
•must have one description
•may have one parent
component
•may have one or more
child components
•may be associated with
one or more access terms

Component hierarchy attributes
• collection_id (id of root collection)
• parent_id (id of parent component)
• sib_seq (sibling sequence)
• level_num (numeric level within hierarchy)
• level_text (series, sub-series, file, etc.)

• has_children
Computed after initial data import; provided
• max_levels as a convenience for finding aid UIs and to
streamline formulation of API responses
• top_component_id

Description model
Elements of description organized
(roughly) based on ISAD(G):
•Descriptive identity
ISAD(G) 3.1

•Context
ISAD(G) 3.2.1 - 3.2.3

•Acquisition & processing
ISAD(G) 3.2.4, 3.3.2-3.3.3

•Content and structure
ISAD(G) 3.3.1, 3.3.4

•Access and use
ISAD(G) 3.4

•Related material
ISAD(G) 3.5

•Notes
ISAG(G) 3.6

Description model: basic EAD mapping

Description model: JSON format
{
"unitid": [
{ "value": "3283", "type": "local_mss" }
],
"unittitle": [
{ "value": "David Ames Wells papers" }
],
"unitdate": [
{ "type": "inclusive", "normal": "1847/1895", "value": "1847-1895" }
],
"physdesc_extent":[
{ "value": ".5 linear feet", "unit":"linear feet" },
{ "value": "2 boxes", "unit":"containers" }
],
"abstract": [
{ "value": "David Ames Wells was an engineer, economist, textbook author, and
advocate for lower tariff rates. This collection contains correspondence with
Gordon L. Ford, Worthington C. Ford, and others; clippings; a manuscript
draft of Protection: The Poor Man's Friend; and a lecture Wells delivered on
free trade in 1882"}
],
"prefercite": [
{ "value": "<p>David Ames Wells papers, Manuscripts and Archives Division,
The New York Public Library</p>" }
]
}

EAD as a guide for data storage
• EAD elements that allow only CDATA are stored as
plain strings
• EAD elements that require content to be structured in
<p> or other block elements stored as HTML
• Rules established for converting EAD to HTML
when necessary
• HTML conversion designed to support re-conversion
back to EAD

Special handling for dates
• Dates are hard
o Inclusive dates and bulk dates
o Multiple date formats
o Ranges, lists and both

• Special data structure for dates:
o date_statement (original text)
o inclusive_start / inclusive_end
o bulk_start / bulk_end
o keydate (for ordering query response – earliest inclusive date
or earliest bulk date when present)
o index_dates (for search faceting – every year included in range/list)

Refinement of Access Term/
Access Term Association models

Data import
• It’s messy business
• Bulk of work has focused on EAD;
Nokogiri used extensively for parsing XML
• Basic process for EAD import:
1. Create collection record
2. Extract collection-level data,
create/save description
3. Extract access terms, and for each
a. Save if it doesn’t already exist
b. Save collection/term association
4. Extract top-level components, and for each:
a. Create component record
b. Extract component-level data,
create/save description
c. Extract/save access terms & associations
d. Extract child components and repeat for each

Integration with NYPL digital repository
• Fedora repository
+ custom metadata creation/digitization workflow system
+ API to query repository data
• All records in repository identified with UUID

• UUID of digital object associated with a given component
is stored locally in archives data system
• Best case scenario: common identifiers appear in
archival description and in Fedora

Apache Solr
• Inter- and intra-collection search

• Collocation via faceting and filter queries

• Using RSolr to facilitate interaction with Solr
(for both search and index)

API
• API development is proceeding in step with finding aid
development – available requests added as needed
• Basic requests:
o Collection-level data
o Components of a collection,
or sub-components of a component
o Includes all component-level descriptive data
o Max. depth can be specified
o Digital assets associated with
a component

Considerations for future development
• Separate API from data management?
o Data management app to handle all create/update/destroy
operations, while API (Sinatra?) is read-only
o Open API to public? Security/load considerations…

• ArchivesSpace
o NYPL is considering it as a possible replacement for
our existing ‘home-grown’ system
o How would this system integrate with ArchivesSpace API?

• Upcoming EAD revision

some code to look at and/or borrow from:
github.com/nypl/archives_data_public

finding aid prototype:
archives.nypl.org

me:
trevorthornton@nypl.org

NYPL Labs:
nypl.org/labs

Tthornton code4lib

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Tthornton code4lib

Similar to Tthornton code4lib (20)

Tthornton code4lib