0
Data Archiving and Networked ServicesLinked Census DataSemantics for Knowledge Discovery of thePastAlbert Meroño-Peñuela01...
Main goal: cross queries               ?
Main goal: requirements• Schema flexibility: do not commit to a specific  schema• Linkage  – Internally (e.g between table...
Main goal: RDF datamodel
CEDAR development cycle,iteration 1• Gathering: only one file• Conversion: TabLinker, small table size• Querying: simple, ...
Iteration 1: conversion •   Supervised Excel to RDF conversion •   Python feat. xlutils, xlrd, rdflib libs •   Intended fo...
Iteration 1: conversion
Iteration 1: conversion
Iteration 1: queryingPREFIX   d2s: <http://www.data2semantics.org/core/>PREFIX   d2sdata: <http://www.data2semantics.org/d...
Iteration 1: queryinghttp://cedar-project.nl/visualizing-sparql-query-results-on-the-census/
Iteration 1: outcome
CEDAR development cycle,iteration 2• Gathering: arbitrary number of files   • But, what do we have?• Conversion: arbitrary...
Iteration 2: gathering                               Hey, what’s there?Inventory of the dataset•How many files do we have?...
Iteration 2: gatheringhttps://github.com/CEDAR-project/TabExtractorhttps://www.dropbox.com/s/ah7lgmji2ofat3w/Census%20summ...
Iteration 2: gatheringhttps://github.com/CEDAR-project/TabExtractorhttps://www.dropbox.com/s/vw1rf4pp8g8sxn3/annotations-d...
Iteration 2: gathering      Year          File            Table    Row     Col        Author       1899 VT_1899_06_H5.xls ...
Iteration 2: gathering• 507 Excel files• 2,288 tables• 33,283 annotated cells  – 10.95% numerical corrections  – 89.05% te...
Iteration 2: gatheringSubset of the dataset•Miniproject 1   –   1889   –   Occupational census   –   Province Noord-Braban...
Iteration 2: conversion• Iteration 1 converted to RDF only Excel cells• Some cells have annotations attached   – Value cor...
Iteration 2: conversion         Annotations data model
Iteration 2: conversion         Annotations data model
Iteration 2: conversion
Iteration 2: data quality• Annotations can improve data quality• Model has to be extended with actions  – If sum doesn’t a...
Iteration 2: data quality• Measure of data quality? Benford’s Law   – Data distributions in censuses meet Benford’s Law   ...
Iteration 2: queryingPREFIX d2s: <http://www.data2semantics.org/core/>PREFIX d2sdata:<http://www.data2semantics.org/data/V...
Iteration 2: queryingPREFIX d2s: <http://www.data2semantics.org/core/>          PREFIX d2s: <http://www.data2semantics.org...
Iteration 2: queryingPREFIX d2s: <http://www.data2semantics.org/core/>          PREFIX d2s: <http://www.data2semantics.org...
Iteration 2: querying• Things to be mapped  –   Occupations (HISCO)  –   Municipalities (Amsterdamse Code)  –   Housing ty...
Iteration 2: linking HISCO
Iteration 2: linking AC
Iteration 2: linking
Iteration 2: linking•   Issue: HISCO is too generic (top-down approach)     –   Class 21110 too abstract: General Manager ...
Iteration 2: linkingUpper ontologies(HISCO, AC)Year-dependentontologies
Iteration 2: linkingUpper ontologies(HISCO, AC)Year-dependentontologies
Iteration 2: linkingUpper ontologies(HISCO, AC)Year-dependent          ?   ?ontologies
Concept drift                ?                  ?       t1                t2                   tn• Models drift over time•...
Conclusion: milestones•   Complete inventory of the dataset (w/ metadata    generation)•   Translation to RDF    – Raw dat...
Conclusion: future work•   Better software    –   TabLinker: automate mark-up process    –   TabExtractor: improve and pub...
Thank you                                http://www.cedar-project.nl                               albert.merono@dans.knaw...
Upcoming SlideShare
Loading in...5
×

Linked Census Data

366

Published on

Published in: News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
366
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Linked Census Data"

  1. 1. Data Archiving and Networked ServicesLinked Census DataSemantics for Knowledge Discovery of thePastAlbert Meroño-Peñuela01/03/2013DANS is een instituut van KNAW en NWO
  2. 2. Main goal: cross queries ?
  3. 3. Main goal: requirements• Schema flexibility: do not commit to a specific schema• Linkage – Internally (e.g between tables), to make relations explicit – Externally • Harmonization datasets (e.g. HISCO, AC) • Enriching datasets (e.g. labour strikes, book publications)• Inference: of new knowledge (e.g. ink_manufacturer(X) & ink_manufacturer chemical |= chemical(X))• Publication: as open data for researchers on the Web (through Service Architectures)
  4. 4. Main goal: RDF datamodel
  5. 5. CEDAR development cycle,iteration 1• Gathering: only one file• Conversion: TabLinker, small table size• Querying: simple, ad-hoc SPARQL + trivial visualization
  6. 6. Iteration 1: conversion • Supervised Excel to RDF conversion • Python feat. xlutils, xlrd, rdflib libs • Intended for complex layouts that cannot be handled with automatic csv2rdf scripts • Maps workbooks to the RDF Data Cube vocabulary • Layout needs to be manually annotated https://github.com/Data2Semantics/TabLinker
  7. 7. Iteration 1: conversion
  8. 8. Iteration 1: conversion
  9. 9. Iteration 1: queryingPREFIX d2s: <http://www.data2semantics.org/core/>PREFIX d2sdata: <http://www.data2semantics.org/data/VT_1889_12_H1_marked/Eerste_gedeelte/>PREFIX ns2: <http://www.data2semantics.org/core/Eerste_gedeelte/Kom/>PREFIX skos: <http://www.w3.org/2004/02/skos/core#>SELECT ?place ?sizeWHERE { ?cell d2s:isObservation [ d2s:dimensiond2sdata:Totaal; d2s:dimension d2sdata:M_; ns2:Buiten_de_kom ?place; d2s:populationSize ?size ] .?place skos:prefLabel "TOT"@nl .}ORDER BY DESC(?size)
  10. 10. Iteration 1: queryinghttp://cedar-project.nl/visualizing-sparql-query-results-on-the-census/
  11. 11. Iteration 1: outcome
  12. 12. CEDAR development cycle,iteration 2• Gathering: arbitrary number of files • But, what do we have?• Conversion: arbitrary table size, annotations• Querying: SPARQL with mappings, top level ontologies
  13. 13. Iteration 2: gathering Hey, what’s there?Inventory of the dataset•How many files do we have?•How many tables/sheets?•How many variables?•How many annotations? TabExtractor (Python feat. xlrd, Levenshtein libs)https://github.com/CEDAR-project/TabExtractor
  14. 14. Iteration 2: gatheringhttps://github.com/CEDAR-project/TabExtractorhttps://www.dropbox.com/s/ah7lgmji2ofat3w/Census%20summary.xls
  15. 15. Iteration 2: gatheringhttps://github.com/CEDAR-project/TabExtractorhttps://www.dropbox.com/s/vw1rf4pp8g8sxn3/annotations-dump-translation.csv
  16. 16. Iteration 2: gathering Year File Table Row Col Author 1899 VT_1899_06_H5.xls Utrecht 155 3 Vreugdenhil 1899 VT_1899_06_H5.xls Utrecht 805 3 Vreugdenhil 1930 WT_1930_04_A-T2.xls Tabel 2a 0 0 Helpdesk 1930 WT_1930_04_A-T2.xls Tabel 2b 0 0 Th. Vreugdenhil 1909 VT_1909_01_T.xls Tabel 1 10058 13 DFS 7 1909 VT_1909_01_T.xls Tabel 1 3321 15 ServiceProfs 001 1909 VT_1909_01_T.xls Tabel 1 11909 13 DFS 7 1909 VT_1909_01_T.xls Tabel 1 12596 11 DFS 8
  17. 17. Iteration 2: gathering• 507 Excel files• 2,288 tables• 33,283 annotated cells – 10.95% numerical corrections – 89.05% textual descriptions / anomaliesBut TabExtractor ain’t a sexy thing…• Bring metadata together• Publish on the Web? Archive?
  18. 18. Iteration 2: gatheringSubset of the dataset•Miniproject 1 – 1889 – Occupational census – Province Noord-Brabant – 1 table•Miniproject 2 – 1859, 1869, 1879, 1889 – Population census – Province Noord-Brabant – 4 tables
  19. 19. Iteration 2: conversion• Iteration 1 converted to RDF only Excel cells• Some cells have annotations attached – Value corrections: 5  8 – Explanations, descriptions: Number includes 2 people of unkown age – Inconsistencies: Sum does not add up• Iteration 2 produces proper named graphs for annotations
  20. 20. Iteration 2: conversion Annotations data model
  21. 21. Iteration 2: conversion Annotations data model
  22. 22. Iteration 2: conversion
  23. 23. Iteration 2: data quality• Annotations can improve data quality• Model has to be extended with actions – If sum doesn’t add up  Retrieve numbers from other tables/sources – Appropriate vocabularies
  24. 24. Iteration 2: data quality• Measure of data quality? Benford’s Law – Data distributions in censuses meet Benford’s Law – Demo available!
  25. 25. Iteration 2: queryingPREFIX d2s: <http://www.data2semantics.org/core/>PREFIX d2sdata:<http://www.data2semantics.org/data/VT_1889_12_H1_marked/Eerste_gedeelte/>PREFIX ns2:<http://www.data2semantics.org/core/Eerste_gedeelte/Kom/>PREFIX skos: <http://www.w3.org/2004/02/skos/core#>SELECT ?place ?sizeWHERE { ?cell d2s:isObservation[ d2s:dimensiond2sdata:Totaal; d2s:dimension d2sdata:M_; ns2:Buiten_de_kom ?place; d2s:populationSize ?size ] .?place skos:prefLabel "TOT"@nl .}ORDER BY DESC(?size)
  26. 26. Iteration 2: queryingPREFIX d2s: <http://www.data2semantics.org/core/> PREFIX d2s: <http://www.data2semantics.org/core/>PREFIX d2sdata: PREFIX d2sdata:<http://www.data2semantics.org/data/VT_1889_12_H1_mar <http://www.data2semantics.org/data/VT_1879_10_H1_mked/Eerste_gedeelte/> arked/NOORD-BRABANT/>PREFIX ns2: PREFIX ns2: <http://www.data2semantics.org/core/Kom-<http://www.data2semantics.org/core/Eerste_gedeelte/Kom/> buiten-de-kom/>PREFIX skos: <http://www.w3.org/2004/02/skos/core#> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>SELECT ?place ?size SELECT ?place ?sizeWHERE { WHERE { ?cell d2s:isObservation ?cell d2s:isObservation[ d2s:dimension [ d2s:dimension d2sdata:Totaal;d2sdata:Totaal; d2s:dimension d2sdata:M; d2s:dimension d2sdata:M_; ns2:Kom_Buiten_de_kom ?place; ns2:Buiten_de_kom ?place; d2s:populationSize ?size ] . d2s:populationSize ?size ] . ?place skos:prefLabel "Totaal in?place skos:prefLabel "TOT"@nl . de gemeente"@nl .} }ORDER BY DESC(?size) ORDER BY DESC(?size)
  27. 27. Iteration 2: queryingPREFIX d2s: <http://www.data2semantics.org/core/> PREFIX d2s: <http://www.data2semantics.org/core/>PREFIX d2sdata: PREFIX d2sdata:<http://www.data2semantics.org/data/VT_1889_12_H1_mar <http://www.data2semantics.org/data/VT_1879_10_H1_mked/Eerste_gedeelte/> arked/NOORD-BRABANT/>PREFIX ns2: PREFIX ns2: <http://www.data2semantics.org/core/Kom-<http://www.data2semantics.org/core/Eerste_gedeelte/Kom/> buiten-de-kom/>PREFIX skos: <http://www.w3.org/2004/02/skos/core#> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>SELECT ?place ?size SELECT ?place ?sizeWHERE { WHERE { ?cell d2s:isObservation ?cell d2s:isObservation[ d2s:dimension [ d2s:dimension d2sdata:Totaal;d2sdata:Totaal; d2s:dimension d2sdata:M; d2s:dimension d2sdata:M_; ns2:Kom_Buiten_de_kom ?place; ns2:Buiten_de_kom ?place; d2s:populationSize ?size ] . d2s:populationSize ?size ] . ?place skos:prefLabel "Totaal in?place skos:prefLabel "TOT"@nl . de gemeente"@nl .} }ORDER BY DESC(?size) ORDER BY DESC(?size)
  28. 28. Iteration 2: querying• Things to be mapped – Occupations (HISCO) – Municipalities (Amsterdamse Code) – Housing types – Religions – Etc.• Converted the HISCO and AC mappings to RDF (https://github.com/CEDAR-project/Harmonize) – Linked to the tables RDF
  29. 29. Iteration 2: linking HISCO
  30. 30. Iteration 2: linking AC
  31. 31. Iteration 2: linking
  32. 32. Iteration 2: linking• Issue: HISCO is too generic (top-down approach) – Class 21110 too abstract: General Manager – Visualization of SPARQL HISCO mappings• Issue: AC works at the municipality level – Other geographical harmonizations?• Need for year-level ontologies – Classification systems are different• R script to do bottom-up approach  Classification extractor (https://github.com/albertmeronyo/OccupationOntology) – Automated removal of non-related cols and rows – Introduction of redundancy (‘Id.’ values) – Removal of totals – Work in progress: ontology merging
  33. 33. Iteration 2: linkingUpper ontologies(HISCO, AC)Year-dependentontologies
  34. 34. Iteration 2: linkingUpper ontologies(HISCO, AC)Year-dependentontologies
  35. 35. Iteration 2: linkingUpper ontologies(HISCO, AC)Year-dependent ? ?ontologies
  36. 36. Concept drift ? ? t1 t2 tn• Models drift over time• Classes merge, split, change their properties (beroepenklassen)• Although, some core meaning remains (shoemakers)• Can we automatically identify and align drifted models?
  37. 37. Conclusion: milestones• Complete inventory of the dataset (w/ metadata generation)• Translation to RDF – Raw data – Annotations – Harmonization/linking• Successful data quality experiments (Benford’s Law)• Useful software – TabLinker (Excel/CSV to RDF) – TabExtractor (Excel/CSV metadata collector) – Harmonize (HISCO/AC to Census linker) – OccupationOntology (bottom-up occupation ontology extractor)
  38. 38. Conclusion: future work• Better software – TabLinker: automate mark-up process – TabExtractor: improve and publish inventory output – Harmonize: improve HISCO/AC datamodels – OccupationOntology: extend to housing types, religions, etc.• Concept drift literature on drifting models (Kuukkanen 2008, Gonçalves et al. 2009, Shenghui et al. 2010)• Semantic Web literature on modeling geographical change (Kauppinen 2010) – Integrate with AC dataset?• Link meaningful datasets with the census – Labour strikes – Book publications – More?
  39. 39. Thank you http://www.cedar-project.nl albert.merono@dans.knaw.nlData Archiving and Networked Services (DANS)Anna van Saksenlaan 10 | 2593 HT Den HaagPostbus 93067 | 2509 AB Den Haag070 3446 484 | info@dans.knaw.nl | www.dans.knaw.nlKVK 54667089 | DANS is een instituut van KNAW en NWO
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×