Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
Linked Census Data
1. Data Archiving and Networked Services
Linked Census Data
Semantics for Knowledge Discovery of the
Past
Albert Meroño-Peñuela
01/03/2013
DANS is een instituut van KNAW en NWO
3. Main goal: requirements
• Schema flexibility: do not commit to a specific
schema
• Linkage
– Internally (e.g between tables), to make relations explicit
– Externally
• Harmonization datasets (e.g. HISCO, AC)
• Enriching datasets (e.g. labour strikes, book publications)
• Inference: of new knowledge (e.g.
ink_manufacturer(X) & ink_manufacturer chemical |=
chemical(X))
• Publication: as open data for researchers on the
Web (through Service Architectures)
5. CEDAR development cycle,
iteration 1
• Gathering: only one file
• Conversion: TabLinker, small table size
• Querying: simple, ad-hoc SPARQL + trivial visualization
6. Iteration 1: conversion
• Supervised Excel to RDF conversion
• Python feat. xlutils, xlrd, rdflib libs
• Intended for complex layouts that cannot be handled with
automatic csv2rdf scripts
• Maps workbooks to the RDF Data Cube vocabulary
• Layout needs to be manually annotated
https://github.com/Data2Semantics/TabLinker
12. CEDAR development cycle,
iteration 2
• Gathering: arbitrary number of files
• But, what do we have?
• Conversion: arbitrary table size, annotations
• Querying: SPARQL with mappings, top level ontologies
13. Iteration 2: gathering
Hey, what’s there?
Inventory of the dataset
•How many files do we have?
•How many tables/sheets?
•How many variables?
•How many annotations?
TabExtractor (Python feat. xlrd, Levenshtein libs)
https://github.com/CEDAR-project/TabExtractor
17. Iteration 2: gathering
• 507 Excel files
• 2,288 tables
• 33,283 annotated cells
– 10.95% numerical corrections
– 89.05% textual descriptions / anomalies
But TabExtractor ain’t a sexy thing…
• Bring metadata together
• Publish on the Web? Archive?
18. Iteration 2: gathering
Subset of the dataset
•Miniproject 1
– 1889
– Occupational census
– Province Noord-Brabant
– 1 table
•Miniproject 2
– 1859, 1869, 1879, 1889
– Population census
– Province Noord-Brabant
– 4 tables
19. Iteration 2: conversion
• Iteration 1 converted to RDF only Excel cells
• Some cells have annotations attached
– Value corrections: 5 8
– Explanations, descriptions: Number includes 2 people of
unkown age
– Inconsistencies: Sum does not add up
• Iteration 2 produces proper named graphs for
annotations
23. Iteration 2: data quality
• Annotations can improve data quality
• Model has to be extended with actions
– If sum doesn’t add up Retrieve numbers from other
tables/sources
– Appropriate vocabularies
24. Iteration 2: data quality
• Measure of data quality? Benford’s Law
– Data distributions in censuses meet Benford’s Law
– Demo available!
28. Iteration 2: querying
• Things to be mapped
– Occupations (HISCO)
– Municipalities (Amsterdamse Code)
– Housing types
– Religions
– Etc.
• Converted the HISCO and AC mappings to RDF
(https://github.com/CEDAR-project/Harmonize)
– Linked to the tables RDF
32. Iteration 2: linking
• Issue: HISCO is too generic (top-down approach)
– Class 21110 too abstract: General Manager
– Visualization of SPARQL HISCO mappings
• Issue: AC works at the municipality level
– Other geographical harmonizations?
• Need for year-level ontologies
– Classification systems are different
• R script to do bottom-up approach Classification
extractor (https://github.com/albertmeronyo/OccupationOntology)
– Automated removal of non-related cols and rows
– Introduction of redundancy (‘Id.’ values)
– Removal of totals
– Work in progress: ontology merging
36. Concept drift
? ?
t1 t2 tn
• Models drift over time
• Classes merge, split, change their properties
(beroepenklassen)
• Although, some core meaning remains
(shoemakers)
• Can we automatically identify and align drifted
models?
37. Conclusion: milestones
• Complete inventory of the dataset (w/ metadata
generation)
• Translation to RDF
– Raw data
– Annotations
– Harmonization/linking
• Successful data quality experiments (Benford’s Law)
• Useful software
– TabLinker (Excel/CSV to RDF)
– TabExtractor (Excel/CSV metadata collector)
– Harmonize (HISCO/AC to Census linker)
– OccupationOntology (bottom-up occupation ontology extractor)
38. Conclusion: future work
• Better software
– TabLinker: automate mark-up process
– TabExtractor: improve and publish inventory output
– Harmonize: improve HISCO/AC datamodels
– OccupationOntology: extend to housing types, religions, etc.
• Concept drift literature on drifting models (Kuukkanen
2008, Gonçalves et al. 2009, Shenghui et al. 2010)
• Semantic Web literature on modeling geographical
change (Kauppinen 2010)
– Integrate with AC dataset?
• Link meaningful datasets with the census
– Labour strikes
– Book publications
– More?
39. Thank you
http://www.cedar-project.nl
albert.merono@dans.knaw.nl
Data Archiving and Networked Services (DANS)
Anna van Saksenlaan 10 | 2593 HT Den Haag
Postbus 93067 | 2509 AB Den Haag
070 3446 484 | info@dans.knaw.nl | www.dans.knaw.nl
KVK 54667089 | DANS is een instituut van KNAW en NWO