Linking historical ship records to 
a newspaper archive 
Andrea Bravo Balado 
Victor de Boer, Guus Schreiber 
VU University Amsterdam
Context: dutchshipsandsailors.nl/ 
2
Dutch Ships and Sailors (DSS) datasets 
3
Results published as Linked Data 
4
Data visualizations 
5
This study 
• Increasing number of historical databases are 
being digitized 
• Finding matching occurrences of the same 
object in different datasets is both relevant 
(for historical research) and non-trivial 
– “Instance mapping” 
• This paper: case study of linking ship instances 
in two maritime datasets 
6
Focus on methodology 
• This study is not about developing new 
techniques 
• This study is about methodology: 
– What combination of existing techniques gets the 
“best” result? 
– What the “best” result is depends on context (i.e., 
goal of the historical research) 
• This is a case study, so be wary of 
generalization 
7
Data 
• Muster rolls (Northern Dutch Maritime 
Museum) 
– Period: 1803-1937 
– 77,043 records of 34,552 sea men 
– 17,098 mentions of 4,935 ships 
• Newspaper archive (Dutch National Library) 
– Period: 1618-1995 
– 7K newspapers, 9M pages (coverage: 10%) 
– Text generated via OCR 
8
Timeline newspapers in the archive 
9
Example muster roll record (in Dutch) 
10
Example newspaper article (in Dutch) 
11
Approach 
• Generate candidate set of links 
• Apply two types of filters to the candidate set 
– Domain-specific filtering 
• Using domain heuristics about ship identification 
– Text classification of newspaper articles 
• Determine whether the article is about a ship 
• Combine filters 
12
Baseline generation 
• Find all ship instances in the muster rolls 
• Query newspaper archive for first 100 hits 
with this name 
– API: http://www.delpher.nl/ 
• Result set is expected to have high recall but 
low precision 
13
Evaluation 
• No gold standard 
• Manual assessment of all links is infeasible 
• Sampling method for evaluating candidates 
– 50 candidates per technique 
– 3 assessors (domain expert plus two authors) 
– Inter-observer agreement: Cohen’s kappa = 0.65 
• Recall: approximation, based on the estimated 
number of correct links (using the baseline) 
14
Domain-specific filtering 
• Heuristic 1: co-occurrence of name of ship 
captain 
– Common practice in historical maritime 
documentation 
• Heuristic 2: date of newspaper article is within 
ship lifetime (as indicated by muster roll) 
– Average life span of ship is 30 years 
15
Text classification 
• Task: decide whether a newspaper article is 
about a ship 
• Two techniques used 
– Naive Bayes and Support Vector Machine (SVM) 
with Sequential Minimal Optimisation (SMO) 
– WEKA implementation 
– Training set: 200 samples (121 positive, 79 
negative) 
16
Configuration 
• Filter 1a: captain name 
• Filter 1b: time restriction 
• Filter 2: combine filters 1a + 1b 
• Filter 2 + text classification 
17
Results 
18
Analysis 
• Captain’s name turns out to be a strong 
heuristic 
• Time restriction much less useful 
• When combined, precision becomes very 
high, at the cost of (approximate) recall 
• Text classification has high precision (no false 
positives) 
• Text classification combined with heuristic 
filtering has negative effect 
19
Discussion 
• Interestingly, the historian preferred very high 
precision at the cost of recall 
• Consequently, 16K links published as Linked 
Data (precision 0.96; approximate recall 0.13) 
• Links are to departure/arrival listing, but also 
to shipwrecks and sales 
• In case of good heuristics the contribution of 
generic techniques is at best minimal 
• Absence of gold standard is realistic 
20
Limitations 
• Evaluation 
– 50 samples 
– Choice of assessors 
– Approximation of recall 
• Data 
– OCR quality of newspaper articles 
– Digitized newspaper archive covers only 10% 
21
Acknowledgements 
• Jurjen Leinenga, domain expert 
• CLARIN-NL 
http://www.clarin.nl 
• BiographyNet, Netherlands eScience Center 
http://esciencecenter.nl 
• Online appendix with details of results at 
http://dx.doi.org/10.6084/m9.figshare.1189228 
22
QUESTION TIME 
23

Linking historical ship records to a newspaper archive

  • 1.
    Linking historical shiprecords to a newspaper archive Andrea Bravo Balado Victor de Boer, Guus Schreiber VU University Amsterdam
  • 2.
  • 3.
    Dutch Ships andSailors (DSS) datasets 3
  • 4.
    Results published asLinked Data 4
  • 5.
  • 6.
    This study •Increasing number of historical databases are being digitized • Finding matching occurrences of the same object in different datasets is both relevant (for historical research) and non-trivial – “Instance mapping” • This paper: case study of linking ship instances in two maritime datasets 6
  • 7.
    Focus on methodology • This study is not about developing new techniques • This study is about methodology: – What combination of existing techniques gets the “best” result? – What the “best” result is depends on context (i.e., goal of the historical research) • This is a case study, so be wary of generalization 7
  • 8.
    Data • Musterrolls (Northern Dutch Maritime Museum) – Period: 1803-1937 – 77,043 records of 34,552 sea men – 17,098 mentions of 4,935 ships • Newspaper archive (Dutch National Library) – Period: 1618-1995 – 7K newspapers, 9M pages (coverage: 10%) – Text generated via OCR 8
  • 9.
  • 10.
    Example muster rollrecord (in Dutch) 10
  • 11.
  • 12.
    Approach • Generatecandidate set of links • Apply two types of filters to the candidate set – Domain-specific filtering • Using domain heuristics about ship identification – Text classification of newspaper articles • Determine whether the article is about a ship • Combine filters 12
  • 13.
    Baseline generation •Find all ship instances in the muster rolls • Query newspaper archive for first 100 hits with this name – API: http://www.delpher.nl/ • Result set is expected to have high recall but low precision 13
  • 14.
    Evaluation • Nogold standard • Manual assessment of all links is infeasible • Sampling method for evaluating candidates – 50 candidates per technique – 3 assessors (domain expert plus two authors) – Inter-observer agreement: Cohen’s kappa = 0.65 • Recall: approximation, based on the estimated number of correct links (using the baseline) 14
  • 15.
    Domain-specific filtering •Heuristic 1: co-occurrence of name of ship captain – Common practice in historical maritime documentation • Heuristic 2: date of newspaper article is within ship lifetime (as indicated by muster roll) – Average life span of ship is 30 years 15
  • 16.
    Text classification •Task: decide whether a newspaper article is about a ship • Two techniques used – Naive Bayes and Support Vector Machine (SVM) with Sequential Minimal Optimisation (SMO) – WEKA implementation – Training set: 200 samples (121 positive, 79 negative) 16
  • 17.
    Configuration • Filter1a: captain name • Filter 1b: time restriction • Filter 2: combine filters 1a + 1b • Filter 2 + text classification 17
  • 18.
  • 19.
    Analysis • Captain’sname turns out to be a strong heuristic • Time restriction much less useful • When combined, precision becomes very high, at the cost of (approximate) recall • Text classification has high precision (no false positives) • Text classification combined with heuristic filtering has negative effect 19
  • 20.
    Discussion • Interestingly,the historian preferred very high precision at the cost of recall • Consequently, 16K links published as Linked Data (precision 0.96; approximate recall 0.13) • Links are to departure/arrival listing, but also to shipwrecks and sales • In case of good heuristics the contribution of generic techniques is at best minimal • Absence of gold standard is realistic 20
  • 21.
    Limitations • Evaluation – 50 samples – Choice of assessors – Approximation of recall • Data – OCR quality of newspaper articles – Digitized newspaper archive covers only 10% 21
  • 22.
    Acknowledgements • JurjenLeinenga, domain expert • CLARIN-NL http://www.clarin.nl • BiographyNet, Netherlands eScience Center http://esciencecenter.nl • Online appendix with details of results at http://dx.doi.org/10.6084/m9.figshare.1189228 22
  • 23.