SlideShare a Scribd company logo
Reconstructing a country: Linking
over 12 million lives in the Dutch
civil registry, 1812-1967
Ruben Schalk & Rick Mourits, UU
Auke Rijpma + Albert Merono + Richard Zijdeman + Joe Raad + Kees Mandemakers
4-12-2019
ClariahPlus WP4: Background
• New digital techniques allow for new research questions and new
answers to old questions
• To make use of these possibilities interlinkage and curation of ESH
data is needed
• As well as tools to query over these datasets (Clariah)
ClariahPlus WP4: Aims
• Building on Clariah, we want to develop a 3-part system around the Dutch civil
registry:
1. Civil registry dataset with basic information for all individuals in the
Netherlands c. 1815 – 1940/70 (births, marriages, deaths), with links between a
large number of individuals (child-parent relations + all derivatives).
2. Linking service for external datasets to ease difficult process (nb. name of only
person will usually not suffice: multiple persons required, time window,
locations).
3. Ability to automatically add these and other datasets on hub as Linked Data,
creating ever-growing web of historical data on 19th- and early-20th-century
individuals.
Importance: Backbone for data
integration
• Central database for 19th and early 20th century, so that
new data can be linked to a central hub.
• Optimal data archiving through Linked Data: standardized
sets of variable names, new ways to estimate quality of
matches, and intuitive storage of linking quality.
• Framework to organize and store information on inequality
(and hopefully more in the future)
Importance: New research
• Multigenerational studies: social mobility, heritability.
• Add deep family relations to topics such as asset ownership,
strikes, business, mortality, fertility, anthropometrics etc.
• Conventional research might require larger N than current
micro datasets can provide, for example longevity, birth
spacing, or sex-specific effects.
• Large geographic scope: migration and environmental effects.
Linked Data…?
‘method of publishing structured data so that it can be interlinked and become
more useful through semantic queries’
• Direct, browseable, queryable online access and tooling for visualisation and analysis
• Interlinkage between datasets
• Expand your research (add variables/observations/encoding)
• Easy replication of results by sharing queries/results
• Keeps datasets separate yet linked; you remain responsible for your own data (and results)
• Explicitly defined relations between variables
Why should I use Linked Data?
• Connect datasets while keeping original data as is
• Enrich your own dataset, eg. find info on specific persons (LINKS)
• Automatically recode variables (HISCO, HISCLASS, georeferencing)
• Contextualize your data: connect to micro/macro data like Clio-Infra, MicroHeights, HDNG,
Gemeentegeschiedenis
• Reusable data and research activities:
• Replication of results by using queries of other researchers on your data
• Easy collaboration across datasets
• Meet guidelines by ERC/NWO about data publication and archiving.
• Graph data model suited to heterogeneous or sparse data
Example: what if we combined datasets on historical
stature as Linked Data?
• Initiated by Joerg Baten (University of Tuebingen)
• Shows added value of combining various small to large N datasets
centering around the same topic
• Possibilities:
• Link to Clio-Infra to get correlation between avg. height and GDP querying all
32 datasets at once (380k observations)
• Average stature around the world visualized
• Available at: https://druid.datalegend.net/dataLegend/microHeights
How to use the datahub?
• Use premade queries available at dataset pages on Druid and project
pages on Github
• Adapt queries to liking and save output as csv
• Join our workshops to get acquainted with SPARQL and RDF (TBA).
• Or just ask us
Key dataset: Civil Register/LINKS
• Reconstruct life courses and family relations form the Dutch civil
registry
• Fragmented observations: birth, marriage, and death certificates
• Scanned by regional archives, entered by volunteers.
• Aggregated by CBG/wiewaswie.nl and Coret
Genealogie/openarch.nl
• Cleaned and processed at IISH
Data: progress
• Comparing to known
birth/death totals.
• Noord-Holland
(Amsterdam!) and Zuid-
Holland are the biggest gaps
in the data, but they are
under way.
• Amsterdam archives
interested in completing
their civil registries.
Birth Death
Drenthe 100.0% 114.5%
Friesland 101.9% 114.5%
Gelderland 103.9% 120.0%
Groningen 100.5% 115.3%
Limburg 105.1% 116.3%
Noord-Brabant 114.3% 149.4%
Noord-Holland 82.2% 61.8%
Overijssel 61.2% 113.5%
Utrecht 111.9% 126.5%
Zeeland 113.9% 121.9%
Zuid-Holland 74.2% 80.0%
Approach: record linkage I
• Rule-based approach:
- Levenshtein distances
- Time frames
• Leverage multiple individuals on a certificate: name freq. 1/1,000 ->
^2 -> 1/1,000,000.
• Seems to work well on birth and marriage certificates because the Civ
Reg is a very accurate source.
• Time frames (date minus age) provides further information to make
the links.
Approach: record linkage II
• Scalability
• Important, because naively:
• Zeeland: 700k x 200k comparisons for births -> marriages = 1.4e11
comparisons, 230 G matrix of integers for one string feature.
• Netherlands: 10m x 5m for births -> marriages = 5e13 comparison, 100 TB
matrix.
Approach: record linkage III
• Current scalability solutions:
• Concatenate all names to cut comparisons in 3/6.
• Use directed acyclic word graphs for string comparisons
• Store names in dictionaries to avoid effort duplication
Conclusions
• Exciting project that will provide backbone for individual-level research in
coming decades.
• Substantial challenges remain due to scale of data.
• Optimistically: small-scale private releases in 2020/2021, public releases in
2022.
• Early stages, so input very welcome.
Useful links
• Team page: http://www.datalegend.net/
• Datasets: https://druid.datalegend.net/
• CSV to LOD conversion: http://cattle.datalegend.net/
• Online SPARQL course for historians:
https://programminghistorian.org/en/lessons/intro-to-linked-data

More Related Content

What's hot

Csdh sbg clariah_intr01
Csdh sbg clariah_intr01Csdh sbg clariah_intr01
Csdh sbg clariah_intr01
Richard Zijdeman
 
ODaF 2010 Linked Data in the Netherlands
ODaF 2010 Linked Data in the NetherlandsODaF 2010 Linked Data in the Netherlands
ODaF 2010 Linked Data in the NetherlandsRinke Hoekstra
 
Tools for Data Manipulation - UKAD Open Refine Workshop
Tools for Data Manipulation - UKAD Open Refine WorkshopTools for Data Manipulation - UKAD Open Refine Workshop
Tools for Data Manipulation - UKAD Open Refine Workshop
Adrian Stevenson
 
Linking Data with sameAs: Challenges and Solutions - Workshop
Linking Data with sameAs: Challenges and Solutions - WorkshopLinking Data with sameAs: Challenges and Solutions - Workshop
Linking Data with sameAs: Challenges and Solutions - Workshop
Adrian Stevenson
 
Data on the web
Data on the webData on the web
Data on the web
Alejandra Garcia Rojas
 
Charper.lawdi.20130531
Charper.lawdi.20130531Charper.lawdi.20130531
Charper.lawdi.20130531charper
 
Biodiversity—A Healthy Ecosystem Thrives on Fresh Ideas (Part 1 of 3), Phil J...
Biodiversity—A Healthy Ecosystem Thrives on Fresh Ideas (Part 1 of 3), Phil J...Biodiversity—A Healthy Ecosystem Thrives on Fresh Ideas (Part 1 of 3), Phil J...
Biodiversity—A Healthy Ecosystem Thrives on Fresh Ideas (Part 1 of 3), Phil J...
Allen Press
 
Instutional repositories and data
Instutional repositories and dataInstutional repositories and data
Instutional repositories and data
Andrew Treloar
 
Educon: History, History
Educon: History, HistoryEducon: History, History
Educon: History, Historyvisiblehistory
 
2011 11 grdi-presentation
2011 11 grdi-presentation2011 11 grdi-presentation
2011 11 grdi-presentation
Johannes Keizer
 
Linked Data for Law Libraries: An Introduction
Linked Data for Law Libraries: An IntroductionLinked Data for Law Libraries: An Introduction
Linked Data for Law Libraries: An Introduction
Emily Nimsakont
 
VRA_2015_CatalogingRoundup_Seneff
VRA_2015_CatalogingRoundup_SeneffVRA_2015_CatalogingRoundup_Seneff
VRA_2015_CatalogingRoundup_SeneffHeather Seneff
 
INVENiT II project presentation
INVENiT II project presentationINVENiT II project presentation
INVENiT II project presentation
bucurcristina
 
GLAMorous LOD
GLAMorous LODGLAMorous LOD
GLAMorous LOD
Barry Norton
 
Digital Humanities and Linked Data
Digital Humanities and Linked DataDigital Humanities and Linked Data
Digital Humanities and Linked Data
Leon Wessels
 
A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of DataA Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
Tobias Kuhn
 
Advanced web searching
Advanced web searchingAdvanced web searching
Advanced web searching
elisacho
 
Linked Data: thinking big, starting small
Linked Data: thinking big, starting smallLinked Data: thinking big, starting small
Linked Data: thinking big, starting small
Peter Neish
 
Zeng marcia ifla-subjectaccesssmartdatadh
Zeng marcia ifla-subjectaccesssmartdatadhZeng marcia ifla-subjectaccesssmartdatadh
Zeng marcia ifla-subjectaccesssmartdatadh
Marcia Zeng
 

What's hot (20)

Csdh sbg clariah_intr01
Csdh sbg clariah_intr01Csdh sbg clariah_intr01
Csdh sbg clariah_intr01
 
ODaF 2010 Linked Data in the Netherlands
ODaF 2010 Linked Data in the NetherlandsODaF 2010 Linked Data in the Netherlands
ODaF 2010 Linked Data in the Netherlands
 
Tools for Data Manipulation - UKAD Open Refine Workshop
Tools for Data Manipulation - UKAD Open Refine WorkshopTools for Data Manipulation - UKAD Open Refine Workshop
Tools for Data Manipulation - UKAD Open Refine Workshop
 
Linking Data with sameAs: Challenges and Solutions - Workshop
Linking Data with sameAs: Challenges and Solutions - WorkshopLinking Data with sameAs: Challenges and Solutions - Workshop
Linking Data with sameAs: Challenges and Solutions - Workshop
 
Data on the web
Data on the webData on the web
Data on the web
 
Charper.lawdi.20130531
Charper.lawdi.20130531Charper.lawdi.20130531
Charper.lawdi.20130531
 
Biodiversity—A Healthy Ecosystem Thrives on Fresh Ideas (Part 1 of 3), Phil J...
Biodiversity—A Healthy Ecosystem Thrives on Fresh Ideas (Part 1 of 3), Phil J...Biodiversity—A Healthy Ecosystem Thrives on Fresh Ideas (Part 1 of 3), Phil J...
Biodiversity—A Healthy Ecosystem Thrives on Fresh Ideas (Part 1 of 3), Phil J...
 
Instutional repositories and data
Instutional repositories and dataInstutional repositories and data
Instutional repositories and data
 
Educon: History, History
Educon: History, HistoryEducon: History, History
Educon: History, History
 
2011 11 grdi-presentation
2011 11 grdi-presentation2011 11 grdi-presentation
2011 11 grdi-presentation
 
Linked Data for Law Libraries: An Introduction
Linked Data for Law Libraries: An IntroductionLinked Data for Law Libraries: An Introduction
Linked Data for Law Libraries: An Introduction
 
VRA_2015_CatalogingRoundup_Seneff
VRA_2015_CatalogingRoundup_SeneffVRA_2015_CatalogingRoundup_Seneff
VRA_2015_CatalogingRoundup_Seneff
 
INVENiT II project presentation
INVENiT II project presentationINVENiT II project presentation
INVENiT II project presentation
 
OKFN_OpenDataMx
OKFN_OpenDataMxOKFN_OpenDataMx
OKFN_OpenDataMx
 
GLAMorous LOD
GLAMorous LODGLAMorous LOD
GLAMorous LOD
 
Digital Humanities and Linked Data
Digital Humanities and Linked DataDigital Humanities and Linked Data
Digital Humanities and Linked Data
 
A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of DataA Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
 
Advanced web searching
Advanced web searchingAdvanced web searching
Advanced web searching
 
Linked Data: thinking big, starting small
Linked Data: thinking big, starting smallLinked Data: thinking big, starting small
Linked Data: thinking big, starting small
 
Zeng marcia ifla-subjectaccesssmartdatadh
Zeng marcia ifla-subjectaccesssmartdatadhZeng marcia ifla-subjectaccesssmartdatadh
Zeng marcia ifla-subjectaccesssmartdatadh
 

Similar to ESDG seminar 2019: reconstructing a country

Gaenovium - Open data in the Netherlands
Gaenovium - Open data in the NetherlandsGaenovium - Open data in the Netherlands
Gaenovium - Open data in the Netherlands
Bob Coret
 
Martin Heaven Taking HEED presentation at the launch of the Administrative Da...
Martin Heaven Taking HEED presentation at the launch of the Administrative Da...Martin Heaven Taking HEED presentation at the launch of the Administrative Da...
Martin Heaven Taking HEED presentation at the launch of the Administrative Da...
Administrative Data Research Centre Wales
 
Introduction to data support services and resources for public policy
Introduction to data support services and resources for public policyIntroduction to data support services and resources for public policy
Introduction to data support services and resources for public policy
Historic Environment Scotland
 
Research into Practice case study 2: Library linked data implementations an...
	Research into Practice case study 2:  Library linked data implementations an...	Research into Practice case study 2:  Library linked data implementations an...
Research into Practice case study 2: Library linked data implementations an...
Hazel Hall
 
Sw4 sh slides
Sw4 sh slidesSw4 sh slides
Sw4 sh slides
Victor de Boer
 
Introduction to the University Data Library and national data services
Introduction to the University Data Library and national data servicesIntroduction to the University Data Library and national data services
Introduction to the University Data Library and national data services
EDINA, University of Edinburgh
 
01 caa2019 ariadn_eplus_snd_uj_krakow 20190425
01 caa2019 ariadn_eplus_snd_uj_krakow 2019042501 caa2019 ariadn_eplus_snd_uj_krakow 20190425
01 caa2019 ariadn_eplus_snd_uj_krakow 20190425
ariadnenetwork
 
Relationship status: Libraries and linked data in Europe
Relationship status: Libraries and linked data in EuropeRelationship status: Libraries and linked data in Europe
Relationship status: Libraries and linked data in Europe
Diane Rasmussen Pennington
 
Standards in health informatics - Problem, clinical models and terminologies
Standards in health informatics - Problem, clinical models and terminologiesStandards in health informatics - Problem, clinical models and terminologies
Standards in health informatics - Problem, clinical models and terminologies
Silje Ljosland Bakke
 
Ada slide presentation rsc day_feb2017_v2
Ada slide presentation rsc day_feb2017_v2Ada slide presentation rsc day_feb2017_v2
Ada slide presentation rsc day_feb2017_v2
SusanMRob
 
Clariah meeting 2013 09-11 odijk 2013-09-11
Clariah meeting 2013 09-11 odijk 2013-09-11Clariah meeting 2013 09-11 odijk 2013-09-11
Clariah meeting 2013 09-11 odijk 2013-09-11
CLARIAH
 
Introduction to Edinburgh University Data Library and national data services
Introduction to Edinburgh University Data Library and national data servicesIntroduction to Edinburgh University Data Library and national data services
Introduction to Edinburgh University Data Library and national data services
EDINA, University of Edinburgh
 
Opendatasessions
OpendatasessionsOpendatasessions
Opendatasessions
OpenAccessBelgium
 
Datasets slidesrachel kotarski
Datasets slidesrachel kotarskiDatasets slidesrachel kotarski
Datasets slidesrachel kotarski
Robin Saklatvala
 
Virtual Research Environments at Leiden University
Virtual Research Environments at Leiden UniversityVirtual Research Environments at Leiden University
Virtual Research Environments at Leiden University
Centre for Digital Scholarship, Leiden University Libraries
 
Meeting Federal Research Requirements
Meeting Federal Research RequirementsMeeting Federal Research Requirements
Meeting Federal Research Requirements
ICPSR
 
Accessing data for research: data publishing pathways and the Five Safes
Accessing data for research: data publishing pathways and the Five SafesAccessing data for research: data publishing pathways and the Five Safes
Accessing data for research: data publishing pathways and the Five Safes
Louise Corti
 
Data 101: A Gentle Introduction
Data 101: A Gentle IntroductionData 101: A Gentle Introduction
Data 101: A Gentle Introduction
Hamilton Public Library
 

Similar to ESDG seminar 2019: reconstructing a country (20)

Gaenovium - Open data in the Netherlands
Gaenovium - Open data in the NetherlandsGaenovium - Open data in the Netherlands
Gaenovium - Open data in the Netherlands
 
Martin Heaven Taking HEED presentation at the launch of the Administrative Da...
Martin Heaven Taking HEED presentation at the launch of the Administrative Da...Martin Heaven Taking HEED presentation at the launch of the Administrative Da...
Martin Heaven Taking HEED presentation at the launch of the Administrative Da...
 
Introduction to data support services and resources for public policy
Introduction to data support services and resources for public policyIntroduction to data support services and resources for public policy
Introduction to data support services and resources for public policy
 
Research into Practice case study 2: Library linked data implementations an...
	Research into Practice case study 2:  Library linked data implementations an...	Research into Practice case study 2:  Library linked data implementations an...
Research into Practice case study 2: Library linked data implementations an...
 
Sw4 sh slides
Sw4 sh slidesSw4 sh slides
Sw4 sh slides
 
Introduction to the University Data Library and national data services
Introduction to the University Data Library and national data servicesIntroduction to the University Data Library and national data services
Introduction to the University Data Library and national data services
 
01 caa2019 ariadn_eplus_snd_uj_krakow 20190425
01 caa2019 ariadn_eplus_snd_uj_krakow 2019042501 caa2019 ariadn_eplus_snd_uj_krakow 20190425
01 caa2019 ariadn_eplus_snd_uj_krakow 20190425
 
Qs4 group c corti
Qs4 group c cortiQs4 group c corti
Qs4 group c corti
 
Relationship status: Libraries and linked data in Europe
Relationship status: Libraries and linked data in EuropeRelationship status: Libraries and linked data in Europe
Relationship status: Libraries and linked data in Europe
 
Standards in health informatics - Problem, clinical models and terminologies
Standards in health informatics - Problem, clinical models and terminologiesStandards in health informatics - Problem, clinical models and terminologies
Standards in health informatics - Problem, clinical models and terminologies
 
Ada slide presentation rsc day_feb2017_v2
Ada slide presentation rsc day_feb2017_v2Ada slide presentation rsc day_feb2017_v2
Ada slide presentation rsc day_feb2017_v2
 
Clariah meeting 2013 09-11 odijk 2013-09-11
Clariah meeting 2013 09-11 odijk 2013-09-11Clariah meeting 2013 09-11 odijk 2013-09-11
Clariah meeting 2013 09-11 odijk 2013-09-11
 
Introduction to Edinburgh University Data Library and national data services
Introduction to Edinburgh University Data Library and national data servicesIntroduction to Edinburgh University Data Library and national data services
Introduction to Edinburgh University Data Library and national data services
 
Opendatasessions
OpendatasessionsOpendatasessions
Opendatasessions
 
Datasets slidesrachel kotarski
Datasets slidesrachel kotarskiDatasets slidesrachel kotarski
Datasets slidesrachel kotarski
 
Virtual Research Environments at Leiden University
Virtual Research Environments at Leiden UniversityVirtual Research Environments at Leiden University
Virtual Research Environments at Leiden University
 
CISER & the Data Reference Interview
CISER & the Data Reference InterviewCISER & the Data Reference Interview
CISER & the Data Reference Interview
 
Meeting Federal Research Requirements
Meeting Federal Research RequirementsMeeting Federal Research Requirements
Meeting Federal Research Requirements
 
Accessing data for research: data publishing pathways and the Five Safes
Accessing data for research: data publishing pathways and the Five SafesAccessing data for research: data publishing pathways and the Five Safes
Accessing data for research: data publishing pathways and the Five Safes
 
Data 101: A Gentle Introduction
Data 101: A Gentle IntroductionData 101: A Gentle Introduction
Data 101: A Gentle Introduction
 

Recently uploaded

みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 

Recently uploaded (20)

みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 

ESDG seminar 2019: reconstructing a country

  • 1. Reconstructing a country: Linking over 12 million lives in the Dutch civil registry, 1812-1967 Ruben Schalk & Rick Mourits, UU Auke Rijpma + Albert Merono + Richard Zijdeman + Joe Raad + Kees Mandemakers 4-12-2019
  • 2. ClariahPlus WP4: Background • New digital techniques allow for new research questions and new answers to old questions • To make use of these possibilities interlinkage and curation of ESH data is needed • As well as tools to query over these datasets (Clariah)
  • 3. ClariahPlus WP4: Aims • Building on Clariah, we want to develop a 3-part system around the Dutch civil registry: 1. Civil registry dataset with basic information for all individuals in the Netherlands c. 1815 – 1940/70 (births, marriages, deaths), with links between a large number of individuals (child-parent relations + all derivatives). 2. Linking service for external datasets to ease difficult process (nb. name of only person will usually not suffice: multiple persons required, time window, locations). 3. Ability to automatically add these and other datasets on hub as Linked Data, creating ever-growing web of historical data on 19th- and early-20th-century individuals.
  • 4. Importance: Backbone for data integration • Central database for 19th and early 20th century, so that new data can be linked to a central hub. • Optimal data archiving through Linked Data: standardized sets of variable names, new ways to estimate quality of matches, and intuitive storage of linking quality. • Framework to organize and store information on inequality (and hopefully more in the future)
  • 5. Importance: New research • Multigenerational studies: social mobility, heritability. • Add deep family relations to topics such as asset ownership, strikes, business, mortality, fertility, anthropometrics etc. • Conventional research might require larger N than current micro datasets can provide, for example longevity, birth spacing, or sex-specific effects. • Large geographic scope: migration and environmental effects.
  • 6. Linked Data…? ‘method of publishing structured data so that it can be interlinked and become more useful through semantic queries’ • Direct, browseable, queryable online access and tooling for visualisation and analysis • Interlinkage between datasets • Expand your research (add variables/observations/encoding) • Easy replication of results by sharing queries/results • Keeps datasets separate yet linked; you remain responsible for your own data (and results) • Explicitly defined relations between variables
  • 7.
  • 8. Why should I use Linked Data? • Connect datasets while keeping original data as is • Enrich your own dataset, eg. find info on specific persons (LINKS) • Automatically recode variables (HISCO, HISCLASS, georeferencing) • Contextualize your data: connect to micro/macro data like Clio-Infra, MicroHeights, HDNG, Gemeentegeschiedenis • Reusable data and research activities: • Replication of results by using queries of other researchers on your data • Easy collaboration across datasets • Meet guidelines by ERC/NWO about data publication and archiving. • Graph data model suited to heterogeneous or sparse data
  • 9. Example: what if we combined datasets on historical stature as Linked Data? • Initiated by Joerg Baten (University of Tuebingen) • Shows added value of combining various small to large N datasets centering around the same topic • Possibilities: • Link to Clio-Infra to get correlation between avg. height and GDP querying all 32 datasets at once (380k observations) • Average stature around the world visualized • Available at: https://druid.datalegend.net/dataLegend/microHeights
  • 10.
  • 11.
  • 12. How to use the datahub? • Use premade queries available at dataset pages on Druid and project pages on Github • Adapt queries to liking and save output as csv • Join our workshops to get acquainted with SPARQL and RDF (TBA). • Or just ask us
  • 13. Key dataset: Civil Register/LINKS • Reconstruct life courses and family relations form the Dutch civil registry • Fragmented observations: birth, marriage, and death certificates • Scanned by regional archives, entered by volunteers. • Aggregated by CBG/wiewaswie.nl and Coret Genealogie/openarch.nl • Cleaned and processed at IISH
  • 14. Data: progress • Comparing to known birth/death totals. • Noord-Holland (Amsterdam!) and Zuid- Holland are the biggest gaps in the data, but they are under way. • Amsterdam archives interested in completing their civil registries. Birth Death Drenthe 100.0% 114.5% Friesland 101.9% 114.5% Gelderland 103.9% 120.0% Groningen 100.5% 115.3% Limburg 105.1% 116.3% Noord-Brabant 114.3% 149.4% Noord-Holland 82.2% 61.8% Overijssel 61.2% 113.5% Utrecht 111.9% 126.5% Zeeland 113.9% 121.9% Zuid-Holland 74.2% 80.0%
  • 15. Approach: record linkage I • Rule-based approach: - Levenshtein distances - Time frames • Leverage multiple individuals on a certificate: name freq. 1/1,000 -> ^2 -> 1/1,000,000. • Seems to work well on birth and marriage certificates because the Civ Reg is a very accurate source. • Time frames (date minus age) provides further information to make the links.
  • 16. Approach: record linkage II • Scalability • Important, because naively: • Zeeland: 700k x 200k comparisons for births -> marriages = 1.4e11 comparisons, 230 G matrix of integers for one string feature. • Netherlands: 10m x 5m for births -> marriages = 5e13 comparison, 100 TB matrix.
  • 17. Approach: record linkage III • Current scalability solutions: • Concatenate all names to cut comparisons in 3/6. • Use directed acyclic word graphs for string comparisons • Store names in dictionaries to avoid effort duplication
  • 18.
  • 19. Conclusions • Exciting project that will provide backbone for individual-level research in coming decades. • Substantial challenges remain due to scale of data. • Optimistically: small-scale private releases in 2020/2021, public releases in 2022. • Early stages, so input very welcome.
  • 20. Useful links • Team page: http://www.datalegend.net/ • Datasets: https://druid.datalegend.net/ • CSV to LOD conversion: http://cattle.datalegend.net/ • Online SPARQL course for historians: https://programminghistorian.org/en/lessons/intro-to-linked-data