1) The document describes a project to reconstruct over 12 million lives in the Dutch civil registry from 1812-1967 by linking records and making the data available as linked open data.
2) It aims to develop a three-part system: 1) a civil registry dataset with basic information and links between individuals, 2) a linking service to connect external datasets, and 3) the ability to publish datasets as linked data to create an interconnected web of historical information.
3) Publishing the data as linked open data will allow researchers to easily integrate different datasets, expand their analyses, and replicate and build upon each other's work.
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
ESDG seminar 2019: reconstructing a country
1. Reconstructing a country: Linking
over 12 million lives in the Dutch
civil registry, 1812-1967
Ruben Schalk & Rick Mourits, UU
Auke Rijpma + Albert Merono + Richard Zijdeman + Joe Raad + Kees Mandemakers
4-12-2019
2. ClariahPlus WP4: Background
• New digital techniques allow for new research questions and new
answers to old questions
• To make use of these possibilities interlinkage and curation of ESH
data is needed
• As well as tools to query over these datasets (Clariah)
3. ClariahPlus WP4: Aims
• Building on Clariah, we want to develop a 3-part system around the Dutch civil
registry:
1. Civil registry dataset with basic information for all individuals in the
Netherlands c. 1815 – 1940/70 (births, marriages, deaths), with links between a
large number of individuals (child-parent relations + all derivatives).
2. Linking service for external datasets to ease difficult process (nb. name of only
person will usually not suffice: multiple persons required, time window,
locations).
3. Ability to automatically add these and other datasets on hub as Linked Data,
creating ever-growing web of historical data on 19th- and early-20th-century
individuals.
4. Importance: Backbone for data
integration
• Central database for 19th and early 20th century, so that
new data can be linked to a central hub.
• Optimal data archiving through Linked Data: standardized
sets of variable names, new ways to estimate quality of
matches, and intuitive storage of linking quality.
• Framework to organize and store information on inequality
(and hopefully more in the future)
5. Importance: New research
• Multigenerational studies: social mobility, heritability.
• Add deep family relations to topics such as asset ownership,
strikes, business, mortality, fertility, anthropometrics etc.
• Conventional research might require larger N than current
micro datasets can provide, for example longevity, birth
spacing, or sex-specific effects.
• Large geographic scope: migration and environmental effects.
6. Linked Data…?
‘method of publishing structured data so that it can be interlinked and become
more useful through semantic queries’
• Direct, browseable, queryable online access and tooling for visualisation and analysis
• Interlinkage between datasets
• Expand your research (add variables/observations/encoding)
• Easy replication of results by sharing queries/results
• Keeps datasets separate yet linked; you remain responsible for your own data (and results)
• Explicitly defined relations between variables
7.
8. Why should I use Linked Data?
• Connect datasets while keeping original data as is
• Enrich your own dataset, eg. find info on specific persons (LINKS)
• Automatically recode variables (HISCO, HISCLASS, georeferencing)
• Contextualize your data: connect to micro/macro data like Clio-Infra, MicroHeights, HDNG,
Gemeentegeschiedenis
• Reusable data and research activities:
• Replication of results by using queries of other researchers on your data
• Easy collaboration across datasets
• Meet guidelines by ERC/NWO about data publication and archiving.
• Graph data model suited to heterogeneous or sparse data
9. Example: what if we combined datasets on historical
stature as Linked Data?
• Initiated by Joerg Baten (University of Tuebingen)
• Shows added value of combining various small to large N datasets
centering around the same topic
• Possibilities:
• Link to Clio-Infra to get correlation between avg. height and GDP querying all
32 datasets at once (380k observations)
• Average stature around the world visualized
• Available at: https://druid.datalegend.net/dataLegend/microHeights
10.
11.
12. How to use the datahub?
• Use premade queries available at dataset pages on Druid and project
pages on Github
• Adapt queries to liking and save output as csv
• Join our workshops to get acquainted with SPARQL and RDF (TBA).
• Or just ask us
13. Key dataset: Civil Register/LINKS
• Reconstruct life courses and family relations form the Dutch civil
registry
• Fragmented observations: birth, marriage, and death certificates
• Scanned by regional archives, entered by volunteers.
• Aggregated by CBG/wiewaswie.nl and Coret
Genealogie/openarch.nl
• Cleaned and processed at IISH
14. Data: progress
• Comparing to known
birth/death totals.
• Noord-Holland
(Amsterdam!) and Zuid-
Holland are the biggest gaps
in the data, but they are
under way.
• Amsterdam archives
interested in completing
their civil registries.
Birth Death
Drenthe 100.0% 114.5%
Friesland 101.9% 114.5%
Gelderland 103.9% 120.0%
Groningen 100.5% 115.3%
Limburg 105.1% 116.3%
Noord-Brabant 114.3% 149.4%
Noord-Holland 82.2% 61.8%
Overijssel 61.2% 113.5%
Utrecht 111.9% 126.5%
Zeeland 113.9% 121.9%
Zuid-Holland 74.2% 80.0%
15. Approach: record linkage I
• Rule-based approach:
- Levenshtein distances
- Time frames
• Leverage multiple individuals on a certificate: name freq. 1/1,000 ->
^2 -> 1/1,000,000.
• Seems to work well on birth and marriage certificates because the Civ
Reg is a very accurate source.
• Time frames (date minus age) provides further information to make
the links.
16. Approach: record linkage II
• Scalability
• Important, because naively:
• Zeeland: 700k x 200k comparisons for births -> marriages = 1.4e11
comparisons, 230 G matrix of integers for one string feature.
• Netherlands: 10m x 5m for births -> marriages = 5e13 comparison, 100 TB
matrix.
17. Approach: record linkage III
• Current scalability solutions:
• Concatenate all names to cut comparisons in 3/6.
• Use directed acyclic word graphs for string comparisons
• Store names in dictionaries to avoid effort duplication
18.
19. Conclusions
• Exciting project that will provide backbone for individual-level research in
coming decades.
• Substantial challenges remain due to scale of data.
• Optimistically: small-scale private releases in 2020/2021, public releases in
2022.
• Early stages, so input very welcome.
20. Useful links
• Team page: http://www.datalegend.net/
• Datasets: https://druid.datalegend.net/
• CSV to LOD conversion: http://cattle.datalegend.net/
• Online SPARQL course for historians:
https://programminghistorian.org/en/lessons/intro-to-linked-data