Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Linking historical ship records to a newspaper archive

7,213 views

Published on

Talk at Histoinformatics, 10 November 2014, Barcelona

Published in: Technology
  • Be the first to comment

Linking historical ship records to a newspaper archive

  1. 1. Linking historical ship records to a newspaper archive Andrea Bravo Balado Victor de Boer, Guus Schreiber VU University Amsterdam
  2. 2. Context: dutchshipsandsailors.nl/ 2
  3. 3. Dutch Ships and Sailors (DSS) datasets 3
  4. 4. Results published as Linked Data 4
  5. 5. Data visualizations 5
  6. 6. This study • Increasing number of historical databases are being digitized • Finding matching occurrences of the same object in different datasets is both relevant (for historical research) and non-trivial – “Instance mapping” • This paper: case study of linking ship instances in two maritime datasets 6
  7. 7. Focus on methodology • This study is not about developing new techniques • This study is about methodology: – What combination of existing techniques gets the “best” result? – What the “best” result is depends on context (i.e., goal of the historical research) • This is a case study, so be wary of generalization 7
  8. 8. Data • Muster rolls (Northern Dutch Maritime Museum) – Period: 1803-1937 – 77,043 records of 34,552 sea men – 17,098 mentions of 4,935 ships • Newspaper archive (Dutch National Library) – Period: 1618-1995 – 7K newspapers, 9M pages (coverage: 10%) – Text generated via OCR 8
  9. 9. Timeline newspapers in the archive 9
  10. 10. Example muster roll record (in Dutch) 10
  11. 11. Example newspaper article (in Dutch) 11
  12. 12. Approach • Generate candidate set of links • Apply two types of filters to the candidate set – Domain-specific filtering • Using domain heuristics about ship identification – Text classification of newspaper articles • Determine whether the article is about a ship • Combine filters 12
  13. 13. Baseline generation • Find all ship instances in the muster rolls • Query newspaper archive for first 100 hits with this name – API: http://www.delpher.nl/ • Result set is expected to have high recall but low precision 13
  14. 14. Evaluation • No gold standard • Manual assessment of all links is infeasible • Sampling method for evaluating candidates – 50 candidates per technique – 3 assessors (domain expert plus two authors) – Inter-observer agreement: Cohen’s kappa = 0.65 • Recall: approximation, based on the estimated number of correct links (using the baseline) 14
  15. 15. Domain-specific filtering • Heuristic 1: co-occurrence of name of ship captain – Common practice in historical maritime documentation • Heuristic 2: date of newspaper article is within ship lifetime (as indicated by muster roll) – Average life span of ship is 30 years 15
  16. 16. Text classification • Task: decide whether a newspaper article is about a ship • Two techniques used – Naive Bayes and Support Vector Machine (SVM) with Sequential Minimal Optimisation (SMO) – WEKA implementation – Training set: 200 samples (121 positive, 79 negative) 16
  17. 17. Configuration • Filter 1a: captain name • Filter 1b: time restriction • Filter 2: combine filters 1a + 1b • Filter 2 + text classification 17
  18. 18. Results 18
  19. 19. Analysis • Captain’s name turns out to be a strong heuristic • Time restriction much less useful • When combined, precision becomes very high, at the cost of (approximate) recall • Text classification has high precision (no false positives) • Text classification combined with heuristic filtering has negative effect 19
  20. 20. Discussion • Interestingly, the historian preferred very high precision at the cost of recall • Consequently, 16K links published as Linked Data (precision 0.96; approximate recall 0.13) • Links are to departure/arrival listing, but also to shipwrecks and sales • In case of good heuristics the contribution of generic techniques is at best minimal • Absence of gold standard is realistic 20
  21. 21. Limitations • Evaluation – 50 samples – Choice of assessors – Approximation of recall • Data – OCR quality of newspaper articles – Digitized newspaper archive covers only 10% 21
  22. 22. Acknowledgements • Jurjen Leinenga, domain expert • CLARIN-NL http://www.clarin.nl • BiographyNet, Netherlands eScience Center http://esciencecenter.nl • Online appendix with details of results at http://dx.doi.org/10.6084/m9.figshare.1189228 22
  23. 23. QUESTION TIME 23

×