ESDG seminar 2019: reconstructing a country

Reconstructing a country: Linking
over 12 million lives in the Dutch
civil registry, 1812-1967
Ruben Schalk & Rick Mourits, UU
Auke Rijpma + Albert Merono + Richard Zijdeman + Joe Raad + Kees Mandemakers
4-12-2019

ClariahPlus WP4: Background
• New digital techniques allow for new research questions and new
answers to old questions
• To make use of these possibilities interlinkage and curation of ESH
data is needed
• As well as tools to query over these datasets (Clariah)

ClariahPlus WP4: Aims
• Building on Clariah, we want to develop a 3-part system around the Dutch civil
registry:
1. Civil registry dataset with basic information for all individuals in the
Netherlands c. 1815 – 1940/70 (births, marriages, deaths), with links between a
large number of individuals (child-parent relations + all derivatives).
2. Linking service for external datasets to ease difficult process (nb. name of only
person will usually not suffice: multiple persons required, time window,
locations).
3. Ability to automatically add these and other datasets on hub as Linked Data,
creating ever-growing web of historical data on 19th- and early-20th-century
individuals.

Importance: Backbone for data
integration
• Central database for 19th and early 20th century, so that
new data can be linked to a central hub.
• Optimal data archiving through Linked Data: standardized
sets of variable names, new ways to estimate quality of
matches, and intuitive storage of linking quality.
• Framework to organize and store information on inequality
(and hopefully more in the future)

Importance: New research
• Multigenerational studies: social mobility, heritability.
• Add deep family relations to topics such as asset ownership,
strikes, business, mortality, fertility, anthropometrics etc.
• Conventional research might require larger N than current
micro datasets can provide, for example longevity, birth
spacing, or sex-specific effects.
• Large geographic scope: migration and environmental effects.

Linked Data…?
‘method of publishing structured data so that it can be interlinked and become
more useful through semantic queries’
• Direct, browseable, queryable online access and tooling for visualisation and analysis
• Interlinkage between datasets
• Expand your research (add variables/observations/encoding)
• Easy replication of results by sharing queries/results
• Keeps datasets separate yet linked; you remain responsible for your own data (and results)
• Explicitly defined relations between variables

Why should I use Linked Data?
• Connect datasets while keeping original data as is
• Enrich your own dataset, eg. find info on specific persons (LINKS)
• Automatically recode variables (HISCO, HISCLASS, georeferencing)
• Contextualize your data: connect to micro/macro data like Clio-Infra, MicroHeights, HDNG,
Gemeentegeschiedenis
• Reusable data and research activities:
• Replication of results by using queries of other researchers on your data
• Easy collaboration across datasets
• Meet guidelines by ERC/NWO about data publication and archiving.
• Graph data model suited to heterogeneous or sparse data

Example: what if we combined datasets on historical
stature as Linked Data?
• Initiated by Joerg Baten (University of Tuebingen)
• Shows added value of combining various small to large N datasets
centering around the same topic
• Possibilities:
• Link to Clio-Infra to get correlation between avg. height and GDP querying all
32 datasets at once (380k observations)
• Average stature around the world visualized
• Available at: https://druid.datalegend.net/dataLegend/microHeights

How to use the datahub?
• Use premade queries available at dataset pages on Druid and project
pages on Github
• Adapt queries to liking and save output as csv
• Join our workshops to get acquainted with SPARQL and RDF (TBA).
• Or just ask us

Key dataset: Civil Register/LINKS
• Reconstruct life courses and family relations form the Dutch civil
registry
• Fragmented observations: birth, marriage, and death certificates
• Scanned by regional archives, entered by volunteers.
• Aggregated by CBG/wiewaswie.nl and Coret
Genealogie/openarch.nl
• Cleaned and processed at IISH

Data: progress
• Comparing to known
birth/death totals.
• Noord-Holland
(Amsterdam!) and Zuid-
Holland are the biggest gaps
in the data, but they are
under way.
• Amsterdam archives
interested in completing
their civil registries.
Birth Death
Drenthe 100.0% 114.5%
Friesland 101.9% 114.5%
Gelderland 103.9% 120.0%
Groningen 100.5% 115.3%
Limburg 105.1% 116.3%
Noord-Brabant 114.3% 149.4%
Noord-Holland 82.2% 61.8%
Overijssel 61.2% 113.5%
Utrecht 111.9% 126.5%
Zeeland 113.9% 121.9%
Zuid-Holland 74.2% 80.0%

Approach: record linkage I
• Rule-based approach:
- Levenshtein distances
- Time frames
• Leverage multiple individuals on a certificate: name freq. 1/1,000 ->
^2 -> 1/1,000,000.
• Seems to work well on birth and marriage certificates because the Civ
Reg is a very accurate source.
• Time frames (date minus age) provides further information to make
the links.

Approach: record linkage II
• Scalability
• Important, because naively:
• Zeeland: 700k x 200k comparisons for births -> marriages = 1.4e11
comparisons, 230 G matrix of integers for one string feature.
• Netherlands: 10m x 5m for births -> marriages = 5e13 comparison, 100 TB
matrix.

Approach: record linkage III
• Current scalability solutions:
• Concatenate all names to cut comparisons in 3/6.
• Use directed acyclic word graphs for string comparisons
• Store names in dictionaries to avoid effort duplication

Conclusions
• Exciting project that will provide backbone for individual-level research in
coming decades.
• Substantial challenges remain due to scale of data.
• Optimistically: small-scale private releases in 2020/2021, public releases in
2022.
• Early stages, so input very welcome.

Useful links
• Team page: http://www.datalegend.net/
• Datasets: https://druid.datalegend.net/
• CSV to LOD conversion: http://cattle.datalegend.net/
• Online SPARQL course for historians:
https://programminghistorian.org/en/lessons/intro-to-linked-data

ESDG seminar 2019: reconstructing a country

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ESDG seminar 2019: reconstructing a country

Similar to ESDG seminar 2019: reconstructing a country (20)

Recently uploaded

Recently uploaded (20)

ESDG seminar 2019: reconstructing a country