Your SlideShare is downloading. ×
Tthornton code4lib
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Tthornton code4lib


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. EAD without XSLTa practical approach to archival finding aids Trevor Thornton Senior Applications Developer, NYPL Labs The New York Public Library
  • 2. Project goals• Enable multiple presentations of the same data• Support dynamic web applications• Cross-collection search with component-level specificity in results, and faceting on common access points
  • 3. System overviewRuby on Rails+ MySQL+ SOLRKey functionality:Data ImportSearch indexAPI
  • 4. Core models
  • 5. Collection modelEach collection:•must have onedescription•may have one or morecomponents•may be associated withone or more access terms
  • 6. Component modelEach component:•must belong to onecollection•must have one description•may have one parentcomponent•may have one or morechild components•may be associated withone or more access terms
  • 7. Component hierarchy attributes• collection_id (id of root collection)• parent_id (id of parent component)• sib_seq (sibling sequence)• level_num (numeric level within hierarchy)• level_text (series, sub-series, file, etc.)• has_children Computed after initial data import; provided• max_levels as a convenience for finding aid UIs and to streamline formulation of API responses• top_component_id
  • 8. Description modelElements of description organized(roughly) based on ISAD(G):•Descriptive identityISAD(G) 3.1•ContextISAD(G) 3.2.1 - 3.2.3•Acquisition & processingISAD(G) 3.2.4, 3.3.2-3.3.3•Content and structureISAD(G) 3.3.1, 3.3.4•Access and useISAD(G) 3.4•Related materialISAD(G) 3.5•NotesISAG(G) 3.6
  • 9. Description model: basic EAD mapping
  • 10. Description model: JSON format{ "unitid": [ { "value": "3283", "type": "local_mss" } ], "unittitle": [ { "value": "David Ames Wells papers" } ], "unitdate": [ { "type": "inclusive", "normal": "1847/1895", "value": "1847-1895" } ], "physdesc_extent":[ { "value": ".5 linear feet", "unit":"linear feet" }, { "value": "2 boxes", "unit":"containers" } ], "abstract": [ { "value": "David Ames Wells was an engineer, economist, textbook author, and advocate for lower tariff rates. This collection contains correspondence with Gordon L. Ford, Worthington C. Ford, and others; clippings; a manuscript draft of Protection: The Poor Mans Friend; and a lecture Wells delivered on free trade in 1882"} ], "prefercite": [ { "value": "<p>David Ames Wells papers, Manuscripts and Archives Division, The New York Public Library</p>" } ]}
  • 11. EAD as a guide for data storage• EAD elements that allow only CDATA are stored as plain strings• EAD elements that require content to be structured in <p> or other block elements stored as HTML• Rules established for converting EAD to HTML when necessary• HTML conversion designed to support re-conversion back to EAD
  • 12. Special handling for dates• Dates are hard o Inclusive dates and bulk dates o Multiple date formats o Ranges, lists and both• Special data structure for dates: o date_statement (original text) o inclusive_start / inclusive_end o bulk_start / bulk_end o keydate (for ordering query response – earliest inclusive date or earliest bulk date when present) o index_dates (for search faceting – every year included in range/list)
  • 13. Access Term model
  • 14. Refinement of Access Term/Access Term Association models
  • 15. Data import• It’s messy business• Bulk of work has focused on EAD; Nokogiri used extensively for parsing XML• Basic process for EAD import: 1. Create collection record 2. Extract collection-level data, create/save description 3. Extract access terms, and for each a. Save if it doesn’t already exist b. Save collection/term association 4. Extract top-level components, and for each: a. Create component record b. Extract component-level data, create/save description c. Extract/save access terms & associations d. Extract child components and repeat for each
  • 16. Integration with NYPL digital repository• Fedora repository + custom metadata creation/digitization workflow system + API to query repository data• All records in repository identified with UUID• UUID of digital object associated with a given component is stored locally in archives data system• Best case scenario: common identifiers appear in archival description and in Fedora
  • 17. Apache Solr• Inter- and intra-collection search• Collocation via faceting and filter queries• Using RSolr to facilitate interaction with Solr (for both search and index)
  • 18. API• API development is proceeding in step with finding aid development – available requests added as needed• Basic requests: o Collection-level data o Components of a collection, or sub-components of a component o Includes all component-level descriptive data o Max. depth can be specified o Digital assets associated with a component
  • 19. Finding aid prototype
  • 20. Finding aid prototype
  • 21. Front-end system overview
  • 22. Considerations for future development• Separate API from data management? o Data management app to handle all create/update/destroy operations, while API (Sinatra?) is read-only o Open API to public? Security/load considerations…• ArchivesSpace o NYPL is considering it as a possible replacement for our existing ‘home-grown’ system o How would this system integrate with ArchivesSpace API?• Upcoming EAD revision
  • 23. some code to look at and/or borrow aid prototype:archives.nypl.orgme:trevorthornton@nypl.orgNYPL