SlideShare a Scribd company logo
1 of 39
Data Archiving and Networked Services



Linked Census Data
Semantics for Knowledge Discovery of the
Past


Albert Meroño-Peñuela

01/03/2013




DANS is een instituut van KNAW en NWO
Main goal: cross queries

               ?
Main goal: requirements
• Schema flexibility: do not commit to a specific
  schema
• Linkage
  – Internally (e.g between tables), to make relations explicit
  – Externally
      • Harmonization datasets (e.g. HISCO, AC)
      • Enriching datasets (e.g. labour strikes, book publications)
• Inference: of new knowledge (e.g.
  ink_manufacturer(X) & ink_manufacturer          chemical |=
  chemical(X))
• Publication: as open data for researchers on the
  Web (through Service Architectures)
Main goal: RDF datamodel
CEDAR development cycle,
iteration 1



• Gathering: only one file
• Conversion: TabLinker, small table size
• Querying: simple, ad-hoc SPARQL + trivial visualization
Iteration 1: conversion



 •   Supervised Excel to RDF conversion
 •   Python feat. xlutils, xlrd, rdflib libs
 •   Intended for complex layouts that cannot be handled with
     automatic csv2rdf scripts
 •   Maps workbooks to the RDF Data Cube vocabulary
 •   Layout needs to be manually annotated

 https://github.com/Data2Semantics/TabLinker
Iteration 1: conversion
Iteration 1: conversion
Iteration 1: querying
PREFIX   d2s: <http://www.data2semantics.org/core/>
PREFIX   d2sdata: <http://www.data2semantics.org/data/VT_1889_12_H1_marked/Eerste_gedeelte/>
PREFIX   ns2: <http://www.data2semantics.org/core/Eerste_gedeelte/Kom/>
PREFIX   skos: <http://www.w3.org/2004/02/skos/core#>


SELECT ?place ?size
WHERE {
 ?cell d2s:isObservation [ d2s:dimension
d2sdata:Totaal;
    d2s:dimension d2sdata:M_;
    ns2:Buiten_de_kom ?place;
    d2s:populationSize ?size ] .
?place skos:prefLabel "TOT"@nl .
}
ORDER BY DESC(?size)
Iteration 1: querying




http://cedar-project.nl/visualizing-sparql-query-results-on-the-census/
Iteration 1: outcome
CEDAR development cycle,
iteration 2



• Gathering: arbitrary number of files
   • But, what do we have?
• Conversion: arbitrary table size, annotations
• Querying: SPARQL with mappings, top level ontologies
Iteration 2: gathering
                               Hey, what’s there?

Inventory of the dataset
•How many files do we have?
•How many tables/sheets?
•How many variables?
•How many annotations?


   TabExtractor (Python feat. xlrd, Levenshtein libs)

https://github.com/CEDAR-project/TabExtractor
Iteration 2: gathering




https://github.com/CEDAR-project/TabExtractor
https://www.dropbox.com/s/ah7lgmji2ofat3w/Census%20summary.xls
Iteration 2: gathering




https://github.com/CEDAR-project/TabExtractor
https://www.dropbox.com/s/vw1rf4pp8g8sxn3/annotations-dump-translation.csv
Iteration 2: gathering
      Year          File            Table    Row     Col        Author
       1899 VT_1899_06_H5.xls     Utrecht      155      3 Vreugdenhil
       1899 VT_1899_06_H5.xls     Utrecht      805      3 Vreugdenhil
       1930 WT_1930_04_A-T2.xls   Tabel 2a       0      0 Helpdesk
       1930 WT_1930_04_A-T2.xls   Tabel 2b       0      0 Th. Vreugdenhil
       1909 VT_1909_01_T.xls      Tabel 1    10058     13 DFS 7
       1909 VT_1909_01_T.xls      Tabel 1     3321     15 ServiceProfs 001
       1909 VT_1909_01_T.xls      Tabel 1    11909     13 DFS 7
       1909 VT_1909_01_T.xls      Tabel 1    12596     11 DFS 8
Iteration 2: gathering
• 507 Excel files
• 2,288 tables
• 33,283 annotated cells
  – 10.95% numerical corrections
  – 89.05% textual descriptions / anomalies


But TabExtractor ain’t a sexy thing…
• Bring metadata together
• Publish on the Web? Archive?
Iteration 2: gathering
Subset of the dataset
•Miniproject 1
   –   1889
   –   Occupational census
   –   Province Noord-Brabant
   –   1 table
•Miniproject 2
   –   1859, 1869, 1879, 1889
   –   Population census
   –   Province Noord-Brabant
   –   4 tables
Iteration 2: conversion
• Iteration 1 converted to RDF only Excel cells
• Some cells have annotations attached
   – Value corrections: 5  8
   – Explanations, descriptions: Number includes 2 people of
     unkown age
   – Inconsistencies: Sum does not add up
• Iteration 2 produces proper named graphs for
  annotations
Iteration 2: conversion
         Annotations data model
Iteration 2: conversion
         Annotations data model
Iteration 2: conversion
Iteration 2: data quality
• Annotations can improve data quality
• Model has to be extended with actions
  – If sum doesn’t add up  Retrieve numbers from other
    tables/sources
  – Appropriate vocabularies
Iteration 2: data quality
• Measure of data quality? Benford’s Law




   – Data distributions in censuses meet Benford’s Law
   – Demo available!
Iteration 2: querying
PREFIX d2s: <http://www.data2semantics.org/core/>
PREFIX d2sdata:
<http://www.data2semantics.org/data/VT_1889_12_H1_mar
ked/Eerste_gedeelte/>
PREFIX ns2:
<http://www.data2semantics.org/core/Eerste_gedeelte/Kom/
>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>


SELECT ?place ?size
WHERE {
 ?cell d2s:isObservation
[ d2s:dimension
d2sdata:Totaal;
     d2s:dimension d2sdata:M_;
     ns2:Buiten_de_kom ?place;
     d2s:populationSize ?size ] .
?place skos:prefLabel "TOT"@nl .
}
ORDER BY DESC(?size)
Iteration 2: querying
PREFIX d2s: <http://www.data2semantics.org/core/>          PREFIX d2s: <http://www.data2semantics.org/core/>
PREFIX d2sdata:                                            PREFIX d2sdata:
<http://www.data2semantics.org/data/VT_1889_12_H1_mar      <http://www.data2semantics.org/data/VT_1879_10_H1_m
ked/Eerste_gedeelte/>
                                                           arked/NOORD-BRABANT/>
PREFIX ns2:
                                                           PREFIX ns2: <http://www.data2semantics.org/core/Kom-
<http://www.data2semantics.org/core/Eerste_gedeelte/Kom/
>                                                          buiten-de-kom/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>        PREFIX skos: <http://www.w3.org/2004/02/skos/core#>


SELECT ?place ?size                                        SELECT ?place ?size
WHERE {                                                    WHERE {
 ?cell d2s:isObservation                                   ?cell d2s:isObservation
[ d2s:dimension                                            [ d2s:dimension d2sdata:Totaal;
d2sdata:Totaal;                                                 d2s:dimension d2sdata:M;
     d2s:dimension d2sdata:M_;                                  ns2:Kom_Buiten_de_kom ?place;
     ns2:Buiten_de_kom ?place;                                  d2s:populationSize ?size ] .
     d2s:populationSize ?size ] .                               ?place skos:prefLabel "Totaal in
?place skos:prefLabel "TOT"@nl .                           de gemeente"@nl .
}                                                          }
ORDER BY DESC(?size)                                       ORDER BY DESC(?size)
Iteration 2: querying
PREFIX d2s: <http://www.data2semantics.org/core/>          PREFIX d2s: <http://www.data2semantics.org/core/>
PREFIX d2sdata:                                            PREFIX d2sdata:
<http://www.data2semantics.org/data/VT_1889_12_H1_mar      <http://www.data2semantics.org/data/VT_1879_10_H1_m
ked/Eerste_gedeelte/>
                                                           arked/NOORD-BRABANT/>
PREFIX ns2:
                                                           PREFIX ns2: <http://www.data2semantics.org/core/Kom-
<http://www.data2semantics.org/core/Eerste_gedeelte/Kom/
>                                                          buiten-de-kom/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>        PREFIX skos: <http://www.w3.org/2004/02/skos/core#>


SELECT ?place ?size                                        SELECT ?place ?size
WHERE {                                                    WHERE {
 ?cell d2s:isObservation                                   ?cell d2s:isObservation
[ d2s:dimension                                            [ d2s:dimension d2sdata:Totaal;
d2sdata:Totaal;                                                 d2s:dimension d2sdata:M;
     d2s:dimension d2sdata:M_;                                  ns2:Kom_Buiten_de_kom ?place;
     ns2:Buiten_de_kom ?place;                                  d2s:populationSize ?size ] .
     d2s:populationSize ?size ] .                               ?place skos:prefLabel "Totaal in
?place skos:prefLabel "TOT"@nl .                           de gemeente"@nl .
}                                                          }
ORDER BY DESC(?size)                                       ORDER BY DESC(?size)
Iteration 2: querying
• Things to be mapped
  –   Occupations (HISCO)
  –   Municipalities (Amsterdamse Code)
  –   Housing types
  –   Religions
  –   Etc.
• Converted the HISCO and AC mappings to RDF
  (https://github.com/CEDAR-project/Harmonize)
  – Linked to the tables RDF
Iteration 2: linking HISCO
Iteration 2: linking AC
Iteration 2: linking
Iteration 2: linking
•   Issue: HISCO is too generic (top-down approach)
     –   Class 21110 too abstract: General Manager
     –   Visualization of SPARQL HISCO mappings
•   Issue: AC works at the municipality level
     –   Other geographical harmonizations?

•   Need for year-level ontologies
     –   Classification systems are different
•   R script to do bottom-up approach  Classification
    extractor (https://github.com/albertmeronyo/OccupationOntology)
     –   Automated removal of non-related cols and rows
     –   Introduction of redundancy (‘Id.’ values)
     –   Removal of totals
     –   Work in progress: ontology merging
Iteration 2: linking
Upper ontologies
(HISCO, AC)




Year-
dependent
ontologies
Iteration 2: linking
Upper ontologies
(HISCO, AC)




Year-
dependent
ontologies
Iteration 2: linking
Upper ontologies
(HISCO, AC)




Year-
dependent          ?   ?
ontologies
Concept drift
                ?                  ?
       t1                t2                   tn

• Models drift over time
• Classes merge, split, change their properties
  (beroepenklassen)
• Although, some core meaning remains
  (shoemakers)
• Can we automatically identify and align drifted
  models?
Conclusion: milestones
•   Complete inventory of the dataset (w/ metadata
    generation)
•   Translation to RDF
    – Raw data
    – Annotations
    – Harmonization/linking
•   Successful data quality experiments (Benford’s Law)
•   Useful software
    –   TabLinker (Excel/CSV to RDF)
    –   TabExtractor (Excel/CSV metadata collector)
    –   Harmonize (HISCO/AC to Census linker)
    –   OccupationOntology (bottom-up occupation ontology extractor)
Conclusion: future work
•   Better software
    –   TabLinker: automate mark-up process
    –   TabExtractor: improve and publish inventory output
    –   Harmonize: improve HISCO/AC datamodels
    –   OccupationOntology: extend to housing types, religions, etc.
•   Concept drift literature on drifting models (Kuukkanen
    2008, Gonçalves et al. 2009, Shenghui et al. 2010)
•   Semantic Web literature on modeling geographical
    change (Kauppinen 2010)
    – Integrate with AC dataset?
•   Link meaningful datasets with the census
    –   Labour strikes
    –   Book publications
    –   More?
Thank you

                                http://www.cedar-project.nl

                               albert.merono@dans.knaw.nl


Data Archiving and Networked Services (DANS)
Anna van Saksenlaan 10 | 2593 HT Den Haag
Postbus 93067 | 2509 AB Den Haag
070 3446 484 | info@dans.knaw.nl | www.dans.knaw.nl
KVK 54667089 | DANS is een instituut van KNAW en NWO

More Related Content

Similar to Linked Census Data

Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterDatabricks
 
Validating statistical Index Data represented in RDF using SPARQL Queries: Co...
Validating statistical Index Data represented in RDF using SPARQL Queries: Co...Validating statistical Index Data represented in RDF using SPARQL Queries: Co...
Validating statistical Index Data represented in RDF using SPARQL Queries: Co...Jose Emilio Labra Gayo
 
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-..."Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...hamidsamadi
 
What's new in Redis v3.2
What's new in Redis v3.2What's new in Redis v3.2
What's new in Redis v3.2Itamar Haber
 
Challenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkChallenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkDatabricks
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSAPRBETTER
 
Generating Executable Mappings from RDF Data Cube Data Structure Definitions
Generating Executable Mappings from RDF Data Cube Data Structure DefinitionsGenerating Executable Mappings from RDF Data Cube Data Structure Definitions
Generating Executable Mappings from RDF Data Cube Data Structure DefinitionsChristophe Debruyne
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command lineSharat Chikkerur
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And VisualizationIvan Ermilov
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
DRUG - RDSTK Talk
DRUG - RDSTK TalkDRUG - RDSTK Talk
DRUG - RDSTK Talkrtelmore
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
RedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory OptimizationRedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory OptimizationRedis Labs
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 

Similar to Linked Census Data (20)

Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
 
Validating statistical Index Data represented in RDF using SPARQL Queries: Co...
Validating statistical Index Data represented in RDF using SPARQL Queries: Co...Validating statistical Index Data represented in RDF using SPARQL Queries: Co...
Validating statistical Index Data represented in RDF using SPARQL Queries: Co...
 
Metadata crosswalks
Metadata crosswalksMetadata crosswalks
Metadata crosswalks
 
2014.12 - Let's Disco - 2 (EDDI 2014)
2014.12 - Let's Disco - 2 (EDDI 2014)2014.12 - Let's Disco - 2 (EDDI 2014)
2014.12 - Let's Disco - 2 (EDDI 2014)
 
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-..."Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
 
What's new in Redis v3.2
What's new in Redis v3.2What's new in Redis v3.2
What's new in Redis v3.2
 
Challenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkChallenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache Spark
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
 
Generating Executable Mappings from RDF Data Cube Data Structure Definitions
Generating Executable Mappings from RDF Data Cube Data Structure DefinitionsGenerating Executable Mappings from RDF Data Cube Data Structure Definitions
Generating Executable Mappings from RDF Data Cube Data Structure Definitions
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
DRUG - RDSTK Talk
DRUG - RDSTK TalkDRUG - RDSTK Talk
DRUG - RDSTK Talk
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
RedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory OptimizationRedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory Optimization
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 

More from Albert Meroño-Peñuela

List.MID: A MIDI-Based Benchmark for RDF Lists
List.MID: A MIDI-Based Benchmark for RDF ListsList.MID: A MIDI-Based Benchmark for RDF Lists
List.MID: A MIDI-Based Benchmark for RDF ListsAlbert Meroño-Peñuela
 
Modelling and Querying Lists in RDF. A Pragmatic Study
Modelling and Querying Lists in RDF. A Pragmatic StudyModelling and Querying Lists in RDF. A Pragmatic Study
Modelling and Querying Lists in RDF. A Pragmatic StudyAlbert Meroño-Peñuela
 
Making social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked dataMaking social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked dataAlbert Meroño-Peñuela
 
What can I expect from an academic career? Valuable skills
What can I expect from an academic career? Valuable skillsWhat can I expect from an academic career? Valuable skills
What can I expect from an academic career? Valuable skillsAlbert Meroño-Peñuela
 
Automatic Query-Centric API for Routine Access to Linked Data
Automatic Query-Centric API for Routine Access to Linked DataAutomatic Query-Centric API for Routine Access to Linked Data
Automatic Query-Centric API for Routine Access to Linked DataAlbert Meroño-Peñuela
 
One Score To Rule Them All: Semantics in Music Notation
One Score To Rule Them All: Semantics in Music NotationOne Score To Rule Them All: Semantics in Music Notation
One Score To Rule Them All: Semantics in Music NotationAlbert Meroño-Peñuela
 
Repeatable Semantic Queries for the Linked Data Agnostic
Repeatable Semantic Queries for the Linked Data AgnosticRepeatable Semantic Queries for the Linked Data Agnostic
Repeatable Semantic Queries for the Linked Data AgnosticAlbert Meroño-Peñuela
 
The Statistics of Stairway to Heaven: A Semantic Story About Digital Humanities
The Statistics of Stairway to Heaven: A Semantic Story About Digital HumanitiesThe Statistics of Stairway to Heaven: A Semantic Story About Digital Humanities
The Statistics of Stairway to Heaven: A Semantic Story About Digital HumanitiesAlbert Meroño-Peñuela
 
grlc Makes GitHub Taste Like Linked Data APIs
grlc Makes GitHub Taste Like Linked Data APIsgrlc Makes GitHub Taste Like Linked Data APIs
grlc Makes GitHub Taste Like Linked Data APIsAlbert Meroño-Peñuela
 
How does a knowledge graph sound like? (or: music is a graph)
How does a knowledge graph sound like? (or: music is a graph)How does a knowledge graph sound like? (or: music is a graph)
How does a knowledge graph sound like? (or: music is a graph)Albert Meroño-Peñuela
 
LSD Dimensions: Use and Reuse of Linked Statistical Data as RDF Data Cube
LSD Dimensions: Use and Reuse of Linked Statistical Data as RDF Data CubeLSD Dimensions: Use and Reuse of Linked Statistical Data as RDF Data Cube
LSD Dimensions: Use and Reuse of Linked Statistical Data as RDF Data CubeAlbert Meroño-Peñuela
 
Non-Temporal Orderings for Extensional Concept Drift
Non-Temporal Orderings for Extensional Concept DriftNon-Temporal Orderings for Extensional Concept Drift
Non-Temporal Orderings for Extensional Concept DriftAlbert Meroño-Peñuela
 
Detecting and Reporting Extensional Concept Drift in Statistical Linked Data
Detecting and Reporting Extensional Concept Drift in Statistical Linked DataDetecting and Reporting Extensional Concept Drift in Statistical Linked Data
Detecting and Reporting Extensional Concept Drift in Statistical Linked DataAlbert Meroño-Peñuela
 

More from Albert Meroño-Peñuela (19)

List.MID: A MIDI-Based Benchmark for RDF Lists
List.MID: A MIDI-Based Benchmark for RDF ListsList.MID: A MIDI-Based Benchmark for RDF Lists
List.MID: A MIDI-Based Benchmark for RDF Lists
 
Modelling and Querying Lists in RDF. A Pragmatic Study
Modelling and Querying Lists in RDF. A Pragmatic StudyModelling and Querying Lists in RDF. A Pragmatic Study
Modelling and Querying Lists in RDF. A Pragmatic Study
 
Making social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked dataMaking social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked data
 
What can I expect from an academic career? Valuable skills
What can I expect from an academic career? Valuable skillsWhat can I expect from an academic career? Valuable skills
What can I expect from an academic career? Valuable skills
 
The MIDI Linked Data Cloud
The MIDI Linked Data CloudThe MIDI Linked Data Cloud
The MIDI Linked Data Cloud
 
Automatic Query-Centric API for Routine Access to Linked Data
Automatic Query-Centric API for Routine Access to Linked DataAutomatic Query-Centric API for Routine Access to Linked Data
Automatic Query-Centric API for Routine Access to Linked Data
 
One Score To Rule Them All: Semantics in Music Notation
One Score To Rule Them All: Semantics in Music NotationOne Score To Rule Them All: Semantics in Music Notation
One Score To Rule Them All: Semantics in Music Notation
 
Repeatable Semantic Queries for the Linked Data Agnostic
Repeatable Semantic Queries for the Linked Data AgnosticRepeatable Semantic Queries for the Linked Data Agnostic
Repeatable Semantic Queries for the Linked Data Agnostic
 
The Statistics of Stairway to Heaven: A Semantic Story About Digital Humanities
The Statistics of Stairway to Heaven: A Semantic Story About Digital HumanitiesThe Statistics of Stairway to Heaven: A Semantic Story About Digital Humanities
The Statistics of Stairway to Heaven: A Semantic Story About Digital Humanities
 
grlc Makes GitHub Taste Like Linked Data APIs
grlc Makes GitHub Taste Like Linked Data APIsgrlc Makes GitHub Taste Like Linked Data APIs
grlc Makes GitHub Taste Like Linked Data APIs
 
Historical Reasoning on the Web
Historical Reasoning on the WebHistorical Reasoning on the Web
Historical Reasoning on the Web
 
How does a knowledge graph sound like? (or: music is a graph)
How does a knowledge graph sound like? (or: music is a graph)How does a knowledge graph sound like? (or: music is a graph)
How does a knowledge graph sound like? (or: music is a graph)
 
What Is Linked Historical Data?
What Is Linked Historical Data?What Is Linked Historical Data?
What Is Linked Historical Data?
 
CBS CEDAR Presentation
CBS CEDAR PresentationCBS CEDAR Presentation
CBS CEDAR Presentation
 
LSD Dimensions: Use and Reuse of Linked Statistical Data as RDF Data Cube
LSD Dimensions: Use and Reuse of Linked Statistical Data as RDF Data CubeLSD Dimensions: Use and Reuse of Linked Statistical Data as RDF Data Cube
LSD Dimensions: Use and Reuse of Linked Statistical Data as RDF Data Cube
 
Non-Temporal Orderings for Extensional Concept Drift
Non-Temporal Orderings for Extensional Concept DriftNon-Temporal Orderings for Extensional Concept Drift
Non-Temporal Orderings for Extensional Concept Drift
 
Detecting and Reporting Extensional Concept Drift in Statistical Linked Data
Detecting and Reporting Extensional Concept Drift in Statistical Linked DataDetecting and Reporting Extensional Concept Drift in Statistical Linked Data
Detecting and Reporting Extensional Concept Drift in Statistical Linked Data
 
Semantic Web for the Humanities
Semantic Web for the HumanitiesSemantic Web for the Humanities
Semantic Web for the Humanities
 
Linked Humanities data
Linked Humanities dataLinked Humanities data
Linked Humanities data
 

Recently uploaded

KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptxKAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptxjohnandrewcarlos
 
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopkoEmbed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopkobhavenpr
 
Kishan Reddy Report To People (2019-24).pdf
Kishan Reddy Report To People (2019-24).pdfKishan Reddy Report To People (2019-24).pdf
Kishan Reddy Report To People (2019-24).pdfKISHAN REDDY OFFICE
 
Verified Love Spells in Little Rock, AR (310) 882-6330 Get My Ex-Lover Back
Verified Love Spells in Little Rock, AR (310) 882-6330 Get My Ex-Lover BackVerified Love Spells in Little Rock, AR (310) 882-6330 Get My Ex-Lover Back
Verified Love Spells in Little Rock, AR (310) 882-6330 Get My Ex-Lover BackPsychicRuben LoveSpells
 
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...Ismail Fahmi
 
2024 04 03 AZ GOP LD4 Gen Meeting Minutes FINAL.docx
2024 04 03 AZ GOP LD4 Gen Meeting Minutes FINAL.docx2024 04 03 AZ GOP LD4 Gen Meeting Minutes FINAL.docx
2024 04 03 AZ GOP LD4 Gen Meeting Minutes FINAL.docxkfjstone13
 
Lorenzo D'Emidio_Lavoro sullaNorth Korea .pptx
Lorenzo D'Emidio_Lavoro sullaNorth Korea .pptxLorenzo D'Emidio_Lavoro sullaNorth Korea .pptx
Lorenzo D'Emidio_Lavoro sullaNorth Korea .pptxlorenzodemidio01
 
29042024_First India Newspaper Jaipur.pdf
29042024_First India Newspaper Jaipur.pdf29042024_First India Newspaper Jaipur.pdf
29042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
₹5.5k {Cash Payment} Independent Greater Noida Call Girls In [Delhi INAYA] 🔝|...
₹5.5k {Cash Payment} Independent Greater Noida Call Girls In [Delhi INAYA] 🔝|...₹5.5k {Cash Payment} Independent Greater Noida Call Girls In [Delhi INAYA] 🔝|...
₹5.5k {Cash Payment} Independent Greater Noida Call Girls In [Delhi INAYA] 🔝|...Diya Sharma
 
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...narsireddynannuri1
 
TDP As the Party of Hope For AP Youth Under N Chandrababu Naidu’s Leadership
TDP As the Party of Hope For AP Youth Under N Chandrababu Naidu’s LeadershipTDP As the Party of Hope For AP Youth Under N Chandrababu Naidu’s Leadership
TDP As the Party of Hope For AP Youth Under N Chandrababu Naidu’s Leadershipanjanibaddipudi1
 
26042024_First India Newspaper Jaipur.pdf
26042024_First India Newspaper Jaipur.pdf26042024_First India Newspaper Jaipur.pdf
26042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...AlexisTorres963861
 
AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...
AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...
AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...Axel Bruns
 
2024 03 13 AZ GOP LD4 Gen Meeting Minutes_FINAL.docx
2024 03 13 AZ GOP LD4 Gen Meeting Minutes_FINAL.docx2024 03 13 AZ GOP LD4 Gen Meeting Minutes_FINAL.docx
2024 03 13 AZ GOP LD4 Gen Meeting Minutes_FINAL.docxkfjstone13
 
2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx
2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx
2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docxkfjstone13
 
Powerful Love Spells in Phoenix, AZ (310) 882-6330 Bring Back Lost Lover
Powerful Love Spells in Phoenix, AZ (310) 882-6330 Bring Back Lost LoverPowerful Love Spells in Phoenix, AZ (310) 882-6330 Bring Back Lost Lover
Powerful Love Spells in Phoenix, AZ (310) 882-6330 Bring Back Lost LoverPsychicRuben LoveSpells
 
BDSM⚡Call Girls in Indirapuram Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Indirapuram Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Indirapuram Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Indirapuram Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
Enjoy Night⚡Call Girls Rajokri Delhi >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Rajokri Delhi >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Rajokri Delhi >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Rajokri Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's DevelopmentNara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Developmentnarsireddynannuri1
 

Recently uploaded (20)

KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptxKAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
 
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopkoEmbed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
 
Kishan Reddy Report To People (2019-24).pdf
Kishan Reddy Report To People (2019-24).pdfKishan Reddy Report To People (2019-24).pdf
Kishan Reddy Report To People (2019-24).pdf
 
Verified Love Spells in Little Rock, AR (310) 882-6330 Get My Ex-Lover Back
Verified Love Spells in Little Rock, AR (310) 882-6330 Get My Ex-Lover BackVerified Love Spells in Little Rock, AR (310) 882-6330 Get My Ex-Lover Back
Verified Love Spells in Little Rock, AR (310) 882-6330 Get My Ex-Lover Back
 
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...
 
2024 04 03 AZ GOP LD4 Gen Meeting Minutes FINAL.docx
2024 04 03 AZ GOP LD4 Gen Meeting Minutes FINAL.docx2024 04 03 AZ GOP LD4 Gen Meeting Minutes FINAL.docx
2024 04 03 AZ GOP LD4 Gen Meeting Minutes FINAL.docx
 
Lorenzo D'Emidio_Lavoro sullaNorth Korea .pptx
Lorenzo D'Emidio_Lavoro sullaNorth Korea .pptxLorenzo D'Emidio_Lavoro sullaNorth Korea .pptx
Lorenzo D'Emidio_Lavoro sullaNorth Korea .pptx
 
29042024_First India Newspaper Jaipur.pdf
29042024_First India Newspaper Jaipur.pdf29042024_First India Newspaper Jaipur.pdf
29042024_First India Newspaper Jaipur.pdf
 
₹5.5k {Cash Payment} Independent Greater Noida Call Girls In [Delhi INAYA] 🔝|...
₹5.5k {Cash Payment} Independent Greater Noida Call Girls In [Delhi INAYA] 🔝|...₹5.5k {Cash Payment} Independent Greater Noida Call Girls In [Delhi INAYA] 🔝|...
₹5.5k {Cash Payment} Independent Greater Noida Call Girls In [Delhi INAYA] 🔝|...
 
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...
 
TDP As the Party of Hope For AP Youth Under N Chandrababu Naidu’s Leadership
TDP As the Party of Hope For AP Youth Under N Chandrababu Naidu’s LeadershipTDP As the Party of Hope For AP Youth Under N Chandrababu Naidu’s Leadership
TDP As the Party of Hope For AP Youth Under N Chandrababu Naidu’s Leadership
 
26042024_First India Newspaper Jaipur.pdf
26042024_First India Newspaper Jaipur.pdf26042024_First India Newspaper Jaipur.pdf
26042024_First India Newspaper Jaipur.pdf
 
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
 
AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...
AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...
AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...
 
2024 03 13 AZ GOP LD4 Gen Meeting Minutes_FINAL.docx
2024 03 13 AZ GOP LD4 Gen Meeting Minutes_FINAL.docx2024 03 13 AZ GOP LD4 Gen Meeting Minutes_FINAL.docx
2024 03 13 AZ GOP LD4 Gen Meeting Minutes_FINAL.docx
 
2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx
2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx
2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx
 
Powerful Love Spells in Phoenix, AZ (310) 882-6330 Bring Back Lost Lover
Powerful Love Spells in Phoenix, AZ (310) 882-6330 Bring Back Lost LoverPowerful Love Spells in Phoenix, AZ (310) 882-6330 Bring Back Lost Lover
Powerful Love Spells in Phoenix, AZ (310) 882-6330 Bring Back Lost Lover
 
BDSM⚡Call Girls in Indirapuram Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Indirapuram Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Indirapuram Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Indirapuram Escorts >༒8448380779 Escort Service
 
Enjoy Night⚡Call Girls Rajokri Delhi >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Rajokri Delhi >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Rajokri Delhi >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Rajokri Delhi >༒8448380779 Escort Service
 
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's DevelopmentNara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
 

Linked Census Data

  • 1. Data Archiving and Networked Services Linked Census Data Semantics for Knowledge Discovery of the Past Albert Meroño-Peñuela 01/03/2013 DANS is een instituut van KNAW en NWO
  • 2. Main goal: cross queries ?
  • 3. Main goal: requirements • Schema flexibility: do not commit to a specific schema • Linkage – Internally (e.g between tables), to make relations explicit – Externally • Harmonization datasets (e.g. HISCO, AC) • Enriching datasets (e.g. labour strikes, book publications) • Inference: of new knowledge (e.g. ink_manufacturer(X) & ink_manufacturer chemical |= chemical(X)) • Publication: as open data for researchers on the Web (through Service Architectures)
  • 4. Main goal: RDF datamodel
  • 5. CEDAR development cycle, iteration 1 • Gathering: only one file • Conversion: TabLinker, small table size • Querying: simple, ad-hoc SPARQL + trivial visualization
  • 6. Iteration 1: conversion • Supervised Excel to RDF conversion • Python feat. xlutils, xlrd, rdflib libs • Intended for complex layouts that cannot be handled with automatic csv2rdf scripts • Maps workbooks to the RDF Data Cube vocabulary • Layout needs to be manually annotated https://github.com/Data2Semantics/TabLinker
  • 9. Iteration 1: querying PREFIX d2s: <http://www.data2semantics.org/core/> PREFIX d2sdata: <http://www.data2semantics.org/data/VT_1889_12_H1_marked/Eerste_gedeelte/> PREFIX ns2: <http://www.data2semantics.org/core/Eerste_gedeelte/Kom/> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> SELECT ?place ?size WHERE { ?cell d2s:isObservation [ d2s:dimension d2sdata:Totaal; d2s:dimension d2sdata:M_; ns2:Buiten_de_kom ?place; d2s:populationSize ?size ] . ?place skos:prefLabel "TOT"@nl . } ORDER BY DESC(?size)
  • 12. CEDAR development cycle, iteration 2 • Gathering: arbitrary number of files • But, what do we have? • Conversion: arbitrary table size, annotations • Querying: SPARQL with mappings, top level ontologies
  • 13. Iteration 2: gathering Hey, what’s there? Inventory of the dataset •How many files do we have? •How many tables/sheets? •How many variables? •How many annotations? TabExtractor (Python feat. xlrd, Levenshtein libs) https://github.com/CEDAR-project/TabExtractor
  • 16. Iteration 2: gathering Year File Table Row Col Author 1899 VT_1899_06_H5.xls Utrecht 155 3 Vreugdenhil 1899 VT_1899_06_H5.xls Utrecht 805 3 Vreugdenhil 1930 WT_1930_04_A-T2.xls Tabel 2a 0 0 Helpdesk 1930 WT_1930_04_A-T2.xls Tabel 2b 0 0 Th. Vreugdenhil 1909 VT_1909_01_T.xls Tabel 1 10058 13 DFS 7 1909 VT_1909_01_T.xls Tabel 1 3321 15 ServiceProfs 001 1909 VT_1909_01_T.xls Tabel 1 11909 13 DFS 7 1909 VT_1909_01_T.xls Tabel 1 12596 11 DFS 8
  • 17. Iteration 2: gathering • 507 Excel files • 2,288 tables • 33,283 annotated cells – 10.95% numerical corrections – 89.05% textual descriptions / anomalies But TabExtractor ain’t a sexy thing… • Bring metadata together • Publish on the Web? Archive?
  • 18. Iteration 2: gathering Subset of the dataset •Miniproject 1 – 1889 – Occupational census – Province Noord-Brabant – 1 table •Miniproject 2 – 1859, 1869, 1879, 1889 – Population census – Province Noord-Brabant – 4 tables
  • 19. Iteration 2: conversion • Iteration 1 converted to RDF only Excel cells • Some cells have annotations attached – Value corrections: 5  8 – Explanations, descriptions: Number includes 2 people of unkown age – Inconsistencies: Sum does not add up • Iteration 2 produces proper named graphs for annotations
  • 20. Iteration 2: conversion Annotations data model
  • 21. Iteration 2: conversion Annotations data model
  • 23. Iteration 2: data quality • Annotations can improve data quality • Model has to be extended with actions – If sum doesn’t add up  Retrieve numbers from other tables/sources – Appropriate vocabularies
  • 24. Iteration 2: data quality • Measure of data quality? Benford’s Law – Data distributions in censuses meet Benford’s Law – Demo available!
  • 25. Iteration 2: querying PREFIX d2s: <http://www.data2semantics.org/core/> PREFIX d2sdata: <http://www.data2semantics.org/data/VT_1889_12_H1_mar ked/Eerste_gedeelte/> PREFIX ns2: <http://www.data2semantics.org/core/Eerste_gedeelte/Kom/ > PREFIX skos: <http://www.w3.org/2004/02/skos/core#> SELECT ?place ?size WHERE { ?cell d2s:isObservation [ d2s:dimension d2sdata:Totaal; d2s:dimension d2sdata:M_; ns2:Buiten_de_kom ?place; d2s:populationSize ?size ] . ?place skos:prefLabel "TOT"@nl . } ORDER BY DESC(?size)
  • 26. Iteration 2: querying PREFIX d2s: <http://www.data2semantics.org/core/> PREFIX d2s: <http://www.data2semantics.org/core/> PREFIX d2sdata: PREFIX d2sdata: <http://www.data2semantics.org/data/VT_1889_12_H1_mar <http://www.data2semantics.org/data/VT_1879_10_H1_m ked/Eerste_gedeelte/> arked/NOORD-BRABANT/> PREFIX ns2: PREFIX ns2: <http://www.data2semantics.org/core/Kom- <http://www.data2semantics.org/core/Eerste_gedeelte/Kom/ > buiten-de-kom/> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> SELECT ?place ?size SELECT ?place ?size WHERE { WHERE { ?cell d2s:isObservation ?cell d2s:isObservation [ d2s:dimension [ d2s:dimension d2sdata:Totaal; d2sdata:Totaal; d2s:dimension d2sdata:M; d2s:dimension d2sdata:M_; ns2:Kom_Buiten_de_kom ?place; ns2:Buiten_de_kom ?place; d2s:populationSize ?size ] . d2s:populationSize ?size ] . ?place skos:prefLabel "Totaal in ?place skos:prefLabel "TOT"@nl . de gemeente"@nl . } } ORDER BY DESC(?size) ORDER BY DESC(?size)
  • 27. Iteration 2: querying PREFIX d2s: <http://www.data2semantics.org/core/> PREFIX d2s: <http://www.data2semantics.org/core/> PREFIX d2sdata: PREFIX d2sdata: <http://www.data2semantics.org/data/VT_1889_12_H1_mar <http://www.data2semantics.org/data/VT_1879_10_H1_m ked/Eerste_gedeelte/> arked/NOORD-BRABANT/> PREFIX ns2: PREFIX ns2: <http://www.data2semantics.org/core/Kom- <http://www.data2semantics.org/core/Eerste_gedeelte/Kom/ > buiten-de-kom/> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> SELECT ?place ?size SELECT ?place ?size WHERE { WHERE { ?cell d2s:isObservation ?cell d2s:isObservation [ d2s:dimension [ d2s:dimension d2sdata:Totaal; d2sdata:Totaal; d2s:dimension d2sdata:M; d2s:dimension d2sdata:M_; ns2:Kom_Buiten_de_kom ?place; ns2:Buiten_de_kom ?place; d2s:populationSize ?size ] . d2s:populationSize ?size ] . ?place skos:prefLabel "Totaal in ?place skos:prefLabel "TOT"@nl . de gemeente"@nl . } } ORDER BY DESC(?size) ORDER BY DESC(?size)
  • 28. Iteration 2: querying • Things to be mapped – Occupations (HISCO) – Municipalities (Amsterdamse Code) – Housing types – Religions – Etc. • Converted the HISCO and AC mappings to RDF (https://github.com/CEDAR-project/Harmonize) – Linked to the tables RDF
  • 32. Iteration 2: linking • Issue: HISCO is too generic (top-down approach) – Class 21110 too abstract: General Manager – Visualization of SPARQL HISCO mappings • Issue: AC works at the municipality level – Other geographical harmonizations? • Need for year-level ontologies – Classification systems are different • R script to do bottom-up approach  Classification extractor (https://github.com/albertmeronyo/OccupationOntology) – Automated removal of non-related cols and rows – Introduction of redundancy (‘Id.’ values) – Removal of totals – Work in progress: ontology merging
  • 33. Iteration 2: linking Upper ontologies (HISCO, AC) Year- dependent ontologies
  • 34. Iteration 2: linking Upper ontologies (HISCO, AC) Year- dependent ontologies
  • 35. Iteration 2: linking Upper ontologies (HISCO, AC) Year- dependent ? ? ontologies
  • 36. Concept drift ? ? t1 t2 tn • Models drift over time • Classes merge, split, change their properties (beroepenklassen) • Although, some core meaning remains (shoemakers) • Can we automatically identify and align drifted models?
  • 37. Conclusion: milestones • Complete inventory of the dataset (w/ metadata generation) • Translation to RDF – Raw data – Annotations – Harmonization/linking • Successful data quality experiments (Benford’s Law) • Useful software – TabLinker (Excel/CSV to RDF) – TabExtractor (Excel/CSV metadata collector) – Harmonize (HISCO/AC to Census linker) – OccupationOntology (bottom-up occupation ontology extractor)
  • 38. Conclusion: future work • Better software – TabLinker: automate mark-up process – TabExtractor: improve and publish inventory output – Harmonize: improve HISCO/AC datamodels – OccupationOntology: extend to housing types, religions, etc. • Concept drift literature on drifting models (Kuukkanen 2008, Gonçalves et al. 2009, Shenghui et al. 2010) • Semantic Web literature on modeling geographical change (Kauppinen 2010) – Integrate with AC dataset? • Link meaningful datasets with the census – Labour strikes – Book publications – More?
  • 39. Thank you http://www.cedar-project.nl albert.merono@dans.knaw.nl Data Archiving and Networked Services (DANS) Anna van Saksenlaan 10 | 2593 HT Den Haag Postbus 93067 | 2509 AB Den Haag 070 3446 484 | info@dans.knaw.nl | www.dans.knaw.nl KVK 54667089 | DANS is een instituut van KNAW en NWO