CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

CEDAR & PRELIDA
Preservation of Linked Socio-
Historical Data
Albert Meroño-Peñuela
@albertmeronyo
PRELIDA consolidation workshop @ ISWC, 17-10-2014

CEDAR: Harmonizing Historical Census
Data in the Semantic Web

CEDAR: Source Historical Data
Dutch Historical Censuses (1795-1971)
[Public Historical Statistical Data]

CEDAR goal: cross queries
?
1795 1830 1889 1930 1971
(through ~3K tables)

Towards 5-star Census Data
>1 year ago
1 year ago

• Web publishable
• Machine processable
• Dynamic schema
• Easily link with other
datasets

Why with semantic technology?
• Web publishable, human & machine readable
• Finer granularity level (cell level)
• Statistical comparability by leveraging
semantic descriptions
• Provenance
• Harmonization through linkage to other
datasets (the 5th star)

RDF Data Cube
“There are many situations where it would be useful to
be able to publish multi-dimensional data, such as
statistics, on the web in such a way that they can be
linked to related data sets and concepts.”

RDF Data Cube vocabulary (QB)
• SDMX compatible
• Defines cubes as a set of observations that consist of
dimensions, measures and attributes
• Dimensions: time period, region, sex (qb:DimensionProperty)
• Measure: population life expectancy (qb:MeasureProperty)
• Attribute: unit of measure = years, metadata status =
measured (qb:AttributeProperty)
Observation: “the measured life expectancy of males in
Newport in the period 2004-2006 is 76.7 years”

CEDAR Integrator
https://github.com/CEDAR-project/Integrator

Raw data
cedar:BRT_1889_08_T1-S0-K17 a tablink:DataCell ;
rdfs:label "K17";
tablink:value "12.0" ;
tablink:dimension cedar:BRT_1889_08_T1-S0-A8 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-K6 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-J3 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-B8 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-C12 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-E17 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-F17 ;
tablink:sheet cedar:BRT_1889_08_T1-S0 .

Harmonization Rules as Open
Annotations
cedar:BRT_1889_08_T1-S0-K4-mapping a oa:Annotation ;
oa:hasBody cedar:BRT_1889_08_T1-S0-K4-mapping-body ;
oa:hasTarget cedar:BRT_1889_08_T1-S0-K4 ;
oa:serializedAt "2014-09-24"^^xsd:date ;
oa:serializedBy
<https://github.com/CEDAR-project/Integrator> ;
prov:wasGeneratedBy
cedar:BRT_1889_08_T1-S0-mapping-activity .
cedar:BRT_1889_08_T1-S0-K4-mapping-body a rdfs:Resource ;
sdmx-dimension:sex sdmx-code:sex-F .

Harmonized RDF Data Cube
cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ;
cedar:population "12"^^xml:decimal ;
maritalstatus:maritalStatus
maritalstatus:single ;
cedarterms:occupationPosition cedarterms:job-D ;
sdmx-dimension:sex sdmx-code:sex-F ;
cedarterms:occupation hisco:88030 ;
sdmx-dimension:refArea gg:11150 ;
prov:wasDerivedFrom
cedar:BRT_1889_08_T1-S0-K17 ;
prov:wasGeneratedBy
cedar:BRT_1889_08_T1-S0-K17-activity .

Classification Systems and
Concept Schemes
• Some missing harmonized dimensions!
• Encode all variables and their values using concept
schemes
• Some already exist
– Which ones? How many of them?
– Where?
– By whom?
– Are they used at all? Can I reuse them?
• Some need to be created
– Manual and expert knowledge based
– Can we do it automatically? Or assist the process?

Dutch Historical
Censuses
(CEDAR)
Dutch Ships
and Sailors
Gemeente
geschiede
nis.nl
HISCO
ICONCLASS
Dutch
Historical
Religions
Dutch
Historical
House Types

Existing dimensions
• HISCO
http://historyofwork.iisg.nl/

Existing dimensions
• Gemeentegeschiedenis.nl

Existing LSD dimensions
• P1: Discoverability? How to discover
dimensions created by others?
• P2: Reusability? How often are dimensions
reused? Can we reuse dimensions created by
others?
• P3: Relevance? What’s the size of LSD?

LSD Dimensions
http://lsd-dimensions.org/
https://github.com/albertmeronyo/LSD-Dimensions
Hourly JSON-LD dumps

Existing LSD dimensions
• P1: Discoverability? How to discover
dimensions created by others? LSD
Dimensions
• P2: Reusability? How often are dimensions
reused? Can we reuse dimensions created by
others? Logarithmic law / probably yes
• P3: Relevance? What’s the size of LSD? ~7.9%
of the LOD cloud

Creating new LSD Dimensions
• CEDAR needs concept schemes for
– Historical religious denominations (i.e. religions in
the NL in 18th-20th c.)
– Historical occupations (id.)
– Historical building types (id.)

https://github.com/CEDAR-project/TabCluster

TabCluster
Leverages
● Lexical properties
○ Hierarchical clustering in Python scipy
○ String distances
● Semantic properties (LOD tagging)
○ skos:Concept of most frequent cluster-term
○ Closest common skos:broader skos:Concept of all
cluster-terms

Compatibility? Remixability? Reusability?
Sarven Capadisli, Albert Meroño-Peñuela, Sören Auer, Reinhard Riedl. “Semantic Similarity
and Correlation of Linked Statistical Data Analysis”. 2nd Int. Workshop on Semantic Statistics
(SemStats) ISWC 2014.

Concept Drift
Census classification of
occupations as for
1859
• Root node is void
• Depth 1: occupation groups
• Leaves: actual occupations

Concept Drift
occupations as for
1889

Concept Drift
occupations as for
1899

Concept Drift
Upper ontologies
(HISCO, AC)
Year-
dependent
ontologies
1859 1869 1879

Concept Drift
Upper ontologies
(HISCO, AC)
Year-
dependent
ontologies

Concept Drift
Upper ontologies
(HISCO, AC)
Year-
dependent
ontologies
? ?

Preserving CEDAR
• DANS-EASY as backend (http://easy.dans.knaw.nl/)
• Archived objects: Turtle snapshots
– 20Go uncompressed, 200Mo compressed (per
snapshot)
– Versioning (stats on current release)
• Users still need to
– SPARQL the data => bring up the endpoint on demand
– Run analytics on the data => outsource statistical
analysis

Thank you
Questions, suggestions, comments most
welcome
@albertmeronyo
http://www.cedar-project.nl
http://krr.cs.vu.nl/
http://easy.dans.knaw.nl/
http://lsd-dimensions.org/

Me in 6 tweets
http://www.albertmeronyo.org
• Background: Computer Science, Web hacker, AI & Law
• PhD candidate at the VU University Amsterdam, DANS,
and eHumanities group (KNAW)
• Topic: Semantic Web for the Humanities
• CEDAR project (2012-2015): harmonized historical
Dutch censuses in the Semantic Web
• Problem: statistical data publishing, concept drift and
dynamics of meaning
• Last paper: What is Linked Historical Data? (EKAW
2014)

CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Similar to CEDAR & PRELIDA Preservation of Linked Socio-Historical Data (20)

More from PRELIDA Project

More from PRELIDA Project (17)

Recently uploaded

Recently uploaded (20)

CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Editor's Notes