SlideShare a Scribd company logo
1 of 20
Download to read offline
Reconstructing a country: Linking
over 12 million lives in the Dutch
civil registry, 1812-1967
Ruben Schalk & Rick Mourits, UU
Auke Rijpma + Albert Merono + Richard Zijdeman + Joe Raad + Kees Mandemakers
4-12-2019
ClariahPlus WP4: Background
• New digital techniques allow for new research questions and new
answers to old questions
• To make use of these possibilities interlinkage and curation of ESH
data is needed
• As well as tools to query over these datasets (Clariah)
ClariahPlus WP4: Aims
• Building on Clariah, we want to develop a 3-part system around the Dutch civil
registry:
1. Civil registry dataset with basic information for all individuals in the
Netherlands c. 1815 – 1940/70 (births, marriages, deaths), with links between a
large number of individuals (child-parent relations + all derivatives).
2. Linking service for external datasets to ease difficult process (nb. name of only
person will usually not suffice: multiple persons required, time window,
locations).
3. Ability to automatically add these and other datasets on hub as Linked Data,
creating ever-growing web of historical data on 19th- and early-20th-century
individuals.
Importance: Backbone for data
integration
• Central database for 19th and early 20th century, so that
new data can be linked to a central hub.
• Optimal data archiving through Linked Data: standardized
sets of variable names, new ways to estimate quality of
matches, and intuitive storage of linking quality.
• Framework to organize and store information on inequality
(and hopefully more in the future)
Importance: New research
• Multigenerational studies: social mobility, heritability.
• Add deep family relations to topics such as asset ownership,
strikes, business, mortality, fertility, anthropometrics etc.
• Conventional research might require larger N than current
micro datasets can provide, for example longevity, birth
spacing, or sex-specific effects.
• Large geographic scope: migration and environmental effects.
Linked Data…?
‘method of publishing structured data so that it can be interlinked and become
more useful through semantic queries’
• Direct, browseable, queryable online access and tooling for visualisation and analysis
• Interlinkage between datasets
• Expand your research (add variables/observations/encoding)
• Easy replication of results by sharing queries/results
• Keeps datasets separate yet linked; you remain responsible for your own data (and results)
• Explicitly defined relations between variables
Why should I use Linked Data?
• Connect datasets while keeping original data as is
• Enrich your own dataset, eg. find info on specific persons (LINKS)
• Automatically recode variables (HISCO, HISCLASS, georeferencing)
• Contextualize your data: connect to micro/macro data like Clio-Infra, MicroHeights, HDNG,
Gemeentegeschiedenis
• Reusable data and research activities:
• Replication of results by using queries of other researchers on your data
• Easy collaboration across datasets
• Meet guidelines by ERC/NWO about data publication and archiving.
• Graph data model suited to heterogeneous or sparse data
Example: what if we combined datasets on historical
stature as Linked Data?
• Initiated by Joerg Baten (University of Tuebingen)
• Shows added value of combining various small to large N datasets
centering around the same topic
• Possibilities:
• Link to Clio-Infra to get correlation between avg. height and GDP querying all
32 datasets at once (380k observations)
• Average stature around the world visualized
• Available at: https://druid.datalegend.net/dataLegend/microHeights
How to use the datahub?
• Use premade queries available at dataset pages on Druid and project
pages on Github
• Adapt queries to liking and save output as csv
• Join our workshops to get acquainted with SPARQL and RDF (TBA).
• Or just ask us
Key dataset: Civil Register/LINKS
• Reconstruct life courses and family relations form the Dutch civil
registry
• Fragmented observations: birth, marriage, and death certificates
• Scanned by regional archives, entered by volunteers.
• Aggregated by CBG/wiewaswie.nl and Coret
Genealogie/openarch.nl
• Cleaned and processed at IISH
Data: progress
• Comparing to known
birth/death totals.
• Noord-Holland
(Amsterdam!) and Zuid-
Holland are the biggest gaps
in the data, but they are
under way.
• Amsterdam archives
interested in completing
their civil registries.
Birth Death
Drenthe 100.0% 114.5%
Friesland 101.9% 114.5%
Gelderland 103.9% 120.0%
Groningen 100.5% 115.3%
Limburg 105.1% 116.3%
Noord-Brabant 114.3% 149.4%
Noord-Holland 82.2% 61.8%
Overijssel 61.2% 113.5%
Utrecht 111.9% 126.5%
Zeeland 113.9% 121.9%
Zuid-Holland 74.2% 80.0%
Approach: record linkage I
• Rule-based approach:
- Levenshtein distances
- Time frames
• Leverage multiple individuals on a certificate: name freq. 1/1,000 ->
^2 -> 1/1,000,000.
• Seems to work well on birth and marriage certificates because the Civ
Reg is a very accurate source.
• Time frames (date minus age) provides further information to make
the links.
Approach: record linkage II
• Scalability
• Important, because naively:
• Zeeland: 700k x 200k comparisons for births -> marriages = 1.4e11
comparisons, 230 G matrix of integers for one string feature.
• Netherlands: 10m x 5m for births -> marriages = 5e13 comparison, 100 TB
matrix.
Approach: record linkage III
• Current scalability solutions:
• Concatenate all names to cut comparisons in 3/6.
• Use directed acyclic word graphs for string comparisons
• Store names in dictionaries to avoid effort duplication
Conclusions
• Exciting project that will provide backbone for individual-level research in
coming decades.
• Substantial challenges remain due to scale of data.
• Optimistically: small-scale private releases in 2020/2021, public releases in
2022.
• Early stages, so input very welcome.
Useful links
• Team page: http://www.datalegend.net/
• Datasets: https://druid.datalegend.net/
• CSV to LOD conversion: http://cattle.datalegend.net/
• Online SPARQL course for historians:
https://programminghistorian.org/en/lessons/intro-to-linked-data

More Related Content

What's hot

ODaF 2010 Linked Data in the Netherlands
ODaF 2010 Linked Data in the NetherlandsODaF 2010 Linked Data in the Netherlands
ODaF 2010 Linked Data in the NetherlandsRinke Hoekstra
 
Tools for Data Manipulation - UKAD Open Refine Workshop
Tools for Data Manipulation - UKAD Open Refine WorkshopTools for Data Manipulation - UKAD Open Refine Workshop
Tools for Data Manipulation - UKAD Open Refine WorkshopAdrian Stevenson
 
Linking Data with sameAs: Challenges and Solutions - Workshop
Linking Data with sameAs: Challenges and Solutions - WorkshopLinking Data with sameAs: Challenges and Solutions - Workshop
Linking Data with sameAs: Challenges and Solutions - WorkshopAdrian Stevenson
 
Charper.lawdi.20130531
Charper.lawdi.20130531Charper.lawdi.20130531
Charper.lawdi.20130531charper
 
Biodiversity—A Healthy Ecosystem Thrives on Fresh Ideas (Part 1 of 3), Phil J...
Biodiversity—A Healthy Ecosystem Thrives on Fresh Ideas (Part 1 of 3), Phil J...Biodiversity—A Healthy Ecosystem Thrives on Fresh Ideas (Part 1 of 3), Phil J...
Biodiversity—A Healthy Ecosystem Thrives on Fresh Ideas (Part 1 of 3), Phil J...Allen Press
 
Instutional repositories and data
Instutional repositories and dataInstutional repositories and data
Instutional repositories and dataAndrew Treloar
 
Educon: History, History
Educon: History, HistoryEducon: History, History
Educon: History, Historyvisiblehistory
 
2011 11 grdi-presentation
2011 11 grdi-presentation2011 11 grdi-presentation
2011 11 grdi-presentationJohannes Keizer
 
Linked Data for Law Libraries: An Introduction
Linked Data for Law Libraries: An IntroductionLinked Data for Law Libraries: An Introduction
Linked Data for Law Libraries: An IntroductionEmily Nimsakont
 
VRA_2015_CatalogingRoundup_Seneff
VRA_2015_CatalogingRoundup_SeneffVRA_2015_CatalogingRoundup_Seneff
VRA_2015_CatalogingRoundup_SeneffHeather Seneff
 
INVENiT II project presentation
INVENiT II project presentationINVENiT II project presentation
INVENiT II project presentationbucurcristina
 
Digital Humanities and Linked Data
Digital Humanities and Linked DataDigital Humanities and Linked Data
Digital Humanities and Linked DataLeon Wessels
 
A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of DataA Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of DataTobias Kuhn
 
Advanced web searching
Advanced web searchingAdvanced web searching
Advanced web searchingelisacho
 
Linked Data: thinking big, starting small
Linked Data: thinking big, starting smallLinked Data: thinking big, starting small
Linked Data: thinking big, starting smallPeter Neish
 
Zeng marcia ifla-subjectaccesssmartdatadh
Zeng marcia ifla-subjectaccesssmartdatadhZeng marcia ifla-subjectaccesssmartdatadh
Zeng marcia ifla-subjectaccesssmartdatadhMarcia Zeng
 

What's hot (20)

Csdh sbg clariah_intr01
Csdh sbg clariah_intr01Csdh sbg clariah_intr01
Csdh sbg clariah_intr01
 
ODaF 2010 Linked Data in the Netherlands
ODaF 2010 Linked Data in the NetherlandsODaF 2010 Linked Data in the Netherlands
ODaF 2010 Linked Data in the Netherlands
 
Tools for Data Manipulation - UKAD Open Refine Workshop
Tools for Data Manipulation - UKAD Open Refine WorkshopTools for Data Manipulation - UKAD Open Refine Workshop
Tools for Data Manipulation - UKAD Open Refine Workshop
 
Linking Data with sameAs: Challenges and Solutions - Workshop
Linking Data with sameAs: Challenges and Solutions - WorkshopLinking Data with sameAs: Challenges and Solutions - Workshop
Linking Data with sameAs: Challenges and Solutions - Workshop
 
Data on the web
Data on the webData on the web
Data on the web
 
Charper.lawdi.20130531
Charper.lawdi.20130531Charper.lawdi.20130531
Charper.lawdi.20130531
 
Biodiversity—A Healthy Ecosystem Thrives on Fresh Ideas (Part 1 of 3), Phil J...
Biodiversity—A Healthy Ecosystem Thrives on Fresh Ideas (Part 1 of 3), Phil J...Biodiversity—A Healthy Ecosystem Thrives on Fresh Ideas (Part 1 of 3), Phil J...
Biodiversity—A Healthy Ecosystem Thrives on Fresh Ideas (Part 1 of 3), Phil J...
 
Instutional repositories and data
Instutional repositories and dataInstutional repositories and data
Instutional repositories and data
 
Educon: History, History
Educon: History, HistoryEducon: History, History
Educon: History, History
 
2011 11 grdi-presentation
2011 11 grdi-presentation2011 11 grdi-presentation
2011 11 grdi-presentation
 
Linked Data for Law Libraries: An Introduction
Linked Data for Law Libraries: An IntroductionLinked Data for Law Libraries: An Introduction
Linked Data for Law Libraries: An Introduction
 
VRA_2015_CatalogingRoundup_Seneff
VRA_2015_CatalogingRoundup_SeneffVRA_2015_CatalogingRoundup_Seneff
VRA_2015_CatalogingRoundup_Seneff
 
INVENiT II project presentation
INVENiT II project presentationINVENiT II project presentation
INVENiT II project presentation
 
OKFN_OpenDataMx
OKFN_OpenDataMxOKFN_OpenDataMx
OKFN_OpenDataMx
 
GLAMorous LOD
GLAMorous LODGLAMorous LOD
GLAMorous LOD
 
Digital Humanities and Linked Data
Digital Humanities and Linked DataDigital Humanities and Linked Data
Digital Humanities and Linked Data
 
A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of DataA Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
 
Advanced web searching
Advanced web searchingAdvanced web searching
Advanced web searching
 
Linked Data: thinking big, starting small
Linked Data: thinking big, starting smallLinked Data: thinking big, starting small
Linked Data: thinking big, starting small
 
Zeng marcia ifla-subjectaccesssmartdatadh
Zeng marcia ifla-subjectaccesssmartdatadhZeng marcia ifla-subjectaccesssmartdatadh
Zeng marcia ifla-subjectaccesssmartdatadh
 

Similar to ESDG seminar 2019: reconstructing a country

Gaenovium - Open data in the Netherlands
Gaenovium - Open data in the NetherlandsGaenovium - Open data in the Netherlands
Gaenovium - Open data in the NetherlandsBob Coret
 
Martin Heaven Taking HEED presentation at the launch of the Administrative Da...
Martin Heaven Taking HEED presentation at the launch of the Administrative Da...Martin Heaven Taking HEED presentation at the launch of the Administrative Da...
Martin Heaven Taking HEED presentation at the launch of the Administrative Da...Administrative Data Research Centre Wales
 
Introduction to data support services and resources for public policy
Introduction to data support services and resources for public policyIntroduction to data support services and resources for public policy
Introduction to data support services and resources for public policyHistoric Environment Scotland
 
Research into Practice case study 2: Library linked data implementations an...
	Research into Practice case study 2:  Library linked data implementations an...	Research into Practice case study 2:  Library linked data implementations an...
Research into Practice case study 2: Library linked data implementations an...Hazel Hall
 
Introduction to the University Data Library and national data services
Introduction to the University Data Library and national data servicesIntroduction to the University Data Library and national data services
Introduction to the University Data Library and national data servicesEDINA, University of Edinburgh
 
01 caa2019 ariadn_eplus_snd_uj_krakow 20190425
01 caa2019 ariadn_eplus_snd_uj_krakow 2019042501 caa2019 ariadn_eplus_snd_uj_krakow 20190425
01 caa2019 ariadn_eplus_snd_uj_krakow 20190425ariadnenetwork
 
Relationship status: Libraries and linked data in Europe
Relationship status: Libraries and linked data in EuropeRelationship status: Libraries and linked data in Europe
Relationship status: Libraries and linked data in EuropeDiane Rasmussen Pennington
 
Standards in health informatics - Problem, clinical models and terminologies
Standards in health informatics - Problem, clinical models and terminologiesStandards in health informatics - Problem, clinical models and terminologies
Standards in health informatics - Problem, clinical models and terminologiesSilje Ljosland Bakke
 
Ada slide presentation rsc day_feb2017_v2
Ada slide presentation rsc day_feb2017_v2Ada slide presentation rsc day_feb2017_v2
Ada slide presentation rsc day_feb2017_v2SusanMRob
 
Clariah meeting 2013 09-11 odijk 2013-09-11
Clariah meeting 2013 09-11 odijk 2013-09-11Clariah meeting 2013 09-11 odijk 2013-09-11
Clariah meeting 2013 09-11 odijk 2013-09-11CLARIAH
 
Introduction to Edinburgh University Data Library and national data services
Introduction to Edinburgh University Data Library and national data servicesIntroduction to Edinburgh University Data Library and national data services
Introduction to Edinburgh University Data Library and national data servicesEDINA, University of Edinburgh
 
Datasets slidesrachel kotarski
Datasets slidesrachel kotarskiDatasets slidesrachel kotarski
Datasets slidesrachel kotarskiRobin Saklatvala
 
Meeting Federal Research Requirements
Meeting Federal Research RequirementsMeeting Federal Research Requirements
Meeting Federal Research RequirementsICPSR
 
Accessing data for research: data publishing pathways and the Five Safes
Accessing data for research: data publishing pathways and the Five SafesAccessing data for research: data publishing pathways and the Five Safes
Accessing data for research: data publishing pathways and the Five SafesLouise Corti
 

Similar to ESDG seminar 2019: reconstructing a country (20)

Gaenovium - Open data in the Netherlands
Gaenovium - Open data in the NetherlandsGaenovium - Open data in the Netherlands
Gaenovium - Open data in the Netherlands
 
Martin Heaven Taking HEED presentation at the launch of the Administrative Da...
Martin Heaven Taking HEED presentation at the launch of the Administrative Da...Martin Heaven Taking HEED presentation at the launch of the Administrative Da...
Martin Heaven Taking HEED presentation at the launch of the Administrative Da...
 
Introduction to data support services and resources for public policy
Introduction to data support services and resources for public policyIntroduction to data support services and resources for public policy
Introduction to data support services and resources for public policy
 
Research into Practice case study 2: Library linked data implementations an...
	Research into Practice case study 2:  Library linked data implementations an...	Research into Practice case study 2:  Library linked data implementations an...
Research into Practice case study 2: Library linked data implementations an...
 
Sw4 sh slides
Sw4 sh slidesSw4 sh slides
Sw4 sh slides
 
Introduction to the University Data Library and national data services
Introduction to the University Data Library and national data servicesIntroduction to the University Data Library and national data services
Introduction to the University Data Library and national data services
 
01 caa2019 ariadn_eplus_snd_uj_krakow 20190425
01 caa2019 ariadn_eplus_snd_uj_krakow 2019042501 caa2019 ariadn_eplus_snd_uj_krakow 20190425
01 caa2019 ariadn_eplus_snd_uj_krakow 20190425
 
Qs4 group c corti
Qs4 group c cortiQs4 group c corti
Qs4 group c corti
 
Relationship status: Libraries and linked data in Europe
Relationship status: Libraries and linked data in EuropeRelationship status: Libraries and linked data in Europe
Relationship status: Libraries and linked data in Europe
 
Standards in health informatics - Problem, clinical models and terminologies
Standards in health informatics - Problem, clinical models and terminologiesStandards in health informatics - Problem, clinical models and terminologies
Standards in health informatics - Problem, clinical models and terminologies
 
Ada slide presentation rsc day_feb2017_v2
Ada slide presentation rsc day_feb2017_v2Ada slide presentation rsc day_feb2017_v2
Ada slide presentation rsc day_feb2017_v2
 
Clariah meeting 2013 09-11 odijk 2013-09-11
Clariah meeting 2013 09-11 odijk 2013-09-11Clariah meeting 2013 09-11 odijk 2013-09-11
Clariah meeting 2013 09-11 odijk 2013-09-11
 
Introduction to Edinburgh University Data Library and national data services
Introduction to Edinburgh University Data Library and national data servicesIntroduction to Edinburgh University Data Library and national data services
Introduction to Edinburgh University Data Library and national data services
 
Opendatasessions
OpendatasessionsOpendatasessions
Opendatasessions
 
Datasets slidesrachel kotarski
Datasets slidesrachel kotarskiDatasets slidesrachel kotarski
Datasets slidesrachel kotarski
 
Virtual Research Environments at Leiden University
Virtual Research Environments at Leiden UniversityVirtual Research Environments at Leiden University
Virtual Research Environments at Leiden University
 
CISER & the Data Reference Interview
CISER & the Data Reference InterviewCISER & the Data Reference Interview
CISER & the Data Reference Interview
 
Meeting Federal Research Requirements
Meeting Federal Research RequirementsMeeting Federal Research Requirements
Meeting Federal Research Requirements
 
Accessing data for research: data publishing pathways and the Five Safes
Accessing data for research: data publishing pathways and the Five SafesAccessing data for research: data publishing pathways and the Five Safes
Accessing data for research: data publishing pathways and the Five Safes
 
Data 101: A Gentle Introduction
Data 101: A Gentle IntroductionData 101: A Gentle Introduction
Data 101: A Gentle Introduction
 

Recently uploaded

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxMarkSteadman7
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingWSO2
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....rightmanforbloodline
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 

Recently uploaded (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

ESDG seminar 2019: reconstructing a country

  • 1. Reconstructing a country: Linking over 12 million lives in the Dutch civil registry, 1812-1967 Ruben Schalk & Rick Mourits, UU Auke Rijpma + Albert Merono + Richard Zijdeman + Joe Raad + Kees Mandemakers 4-12-2019
  • 2. ClariahPlus WP4: Background • New digital techniques allow for new research questions and new answers to old questions • To make use of these possibilities interlinkage and curation of ESH data is needed • As well as tools to query over these datasets (Clariah)
  • 3. ClariahPlus WP4: Aims • Building on Clariah, we want to develop a 3-part system around the Dutch civil registry: 1. Civil registry dataset with basic information for all individuals in the Netherlands c. 1815 – 1940/70 (births, marriages, deaths), with links between a large number of individuals (child-parent relations + all derivatives). 2. Linking service for external datasets to ease difficult process (nb. name of only person will usually not suffice: multiple persons required, time window, locations). 3. Ability to automatically add these and other datasets on hub as Linked Data, creating ever-growing web of historical data on 19th- and early-20th-century individuals.
  • 4. Importance: Backbone for data integration • Central database for 19th and early 20th century, so that new data can be linked to a central hub. • Optimal data archiving through Linked Data: standardized sets of variable names, new ways to estimate quality of matches, and intuitive storage of linking quality. • Framework to organize and store information on inequality (and hopefully more in the future)
  • 5. Importance: New research • Multigenerational studies: social mobility, heritability. • Add deep family relations to topics such as asset ownership, strikes, business, mortality, fertility, anthropometrics etc. • Conventional research might require larger N than current micro datasets can provide, for example longevity, birth spacing, or sex-specific effects. • Large geographic scope: migration and environmental effects.
  • 6. Linked Data…? ‘method of publishing structured data so that it can be interlinked and become more useful through semantic queries’ • Direct, browseable, queryable online access and tooling for visualisation and analysis • Interlinkage between datasets • Expand your research (add variables/observations/encoding) • Easy replication of results by sharing queries/results • Keeps datasets separate yet linked; you remain responsible for your own data (and results) • Explicitly defined relations between variables
  • 7.
  • 8. Why should I use Linked Data? • Connect datasets while keeping original data as is • Enrich your own dataset, eg. find info on specific persons (LINKS) • Automatically recode variables (HISCO, HISCLASS, georeferencing) • Contextualize your data: connect to micro/macro data like Clio-Infra, MicroHeights, HDNG, Gemeentegeschiedenis • Reusable data and research activities: • Replication of results by using queries of other researchers on your data • Easy collaboration across datasets • Meet guidelines by ERC/NWO about data publication and archiving. • Graph data model suited to heterogeneous or sparse data
  • 9. Example: what if we combined datasets on historical stature as Linked Data? • Initiated by Joerg Baten (University of Tuebingen) • Shows added value of combining various small to large N datasets centering around the same topic • Possibilities: • Link to Clio-Infra to get correlation between avg. height and GDP querying all 32 datasets at once (380k observations) • Average stature around the world visualized • Available at: https://druid.datalegend.net/dataLegend/microHeights
  • 10.
  • 11.
  • 12. How to use the datahub? • Use premade queries available at dataset pages on Druid and project pages on Github • Adapt queries to liking and save output as csv • Join our workshops to get acquainted with SPARQL and RDF (TBA). • Or just ask us
  • 13. Key dataset: Civil Register/LINKS • Reconstruct life courses and family relations form the Dutch civil registry • Fragmented observations: birth, marriage, and death certificates • Scanned by regional archives, entered by volunteers. • Aggregated by CBG/wiewaswie.nl and Coret Genealogie/openarch.nl • Cleaned and processed at IISH
  • 14. Data: progress • Comparing to known birth/death totals. • Noord-Holland (Amsterdam!) and Zuid- Holland are the biggest gaps in the data, but they are under way. • Amsterdam archives interested in completing their civil registries. Birth Death Drenthe 100.0% 114.5% Friesland 101.9% 114.5% Gelderland 103.9% 120.0% Groningen 100.5% 115.3% Limburg 105.1% 116.3% Noord-Brabant 114.3% 149.4% Noord-Holland 82.2% 61.8% Overijssel 61.2% 113.5% Utrecht 111.9% 126.5% Zeeland 113.9% 121.9% Zuid-Holland 74.2% 80.0%
  • 15. Approach: record linkage I • Rule-based approach: - Levenshtein distances - Time frames • Leverage multiple individuals on a certificate: name freq. 1/1,000 -> ^2 -> 1/1,000,000. • Seems to work well on birth and marriage certificates because the Civ Reg is a very accurate source. • Time frames (date minus age) provides further information to make the links.
  • 16. Approach: record linkage II • Scalability • Important, because naively: • Zeeland: 700k x 200k comparisons for births -> marriages = 1.4e11 comparisons, 230 G matrix of integers for one string feature. • Netherlands: 10m x 5m for births -> marriages = 5e13 comparison, 100 TB matrix.
  • 17. Approach: record linkage III • Current scalability solutions: • Concatenate all names to cut comparisons in 3/6. • Use directed acyclic word graphs for string comparisons • Store names in dictionaries to avoid effort duplication
  • 18.
  • 19. Conclusions • Exciting project that will provide backbone for individual-level research in coming decades. • Substantial challenges remain due to scale of data. • Optimistically: small-scale private releases in 2020/2021, public releases in 2022. • Early stages, so input very welcome.
  • 20. Useful links • Team page: http://www.datalegend.net/ • Datasets: https://druid.datalegend.net/ • CSV to LOD conversion: http://cattle.datalegend.net/ • Online SPARQL course for historians: https://programminghistorian.org/en/lessons/intro-to-linked-data