SlideShare a Scribd company logo
CEDAR & PRELIDA
Preservation of Linked Socio-
Historical Data
Albert Meroño-Peñuela
@albertmeronyo
PRELIDA consolidation workshop @ ISWC, 17-10-2014
CEDAR: Harmonizing Historical Census
Data in the Semantic Web
CEDAR: Source Historical Data
Dutch Historical Censuses (1795-1971)
[Public Historical Statistical Data]
4
From scans to spreadsheets
CEDAR goal: cross queries
?
1795 1830 1889 1930 1971
(through ~3K tables)
Towards 5-star Census Data
Towards 5-star Census Data
>1 year ago
1 year ago
• Web publishable
• Machine processable
• Dynamic schema
• Easily link with other
datasets
Why with semantic technology?
• Web publishable, human & machine readable
• Finer granularity level (cell level)
• Statistical comparability by leveraging
semantic descriptions
• Provenance
• Harmonization through linkage to other
datasets (the 5th star)
RDF Data Cube
“There are many situations where it would be useful to
be able to publish multi-dimensional data, such as
statistics, on the web in such a way that they can be
linked to related data sets and concepts.”
RDF Data Cube vocabulary (QB)
• SDMX compatible
• Defines cubes as a set of observations that consist of
dimensions, measures and attributes
• Dimensions: time period, region, sex (qb:DimensionProperty)
• Measure: population life expectancy (qb:MeasureProperty)
• Attribute: unit of measure = years, metadata status =
measured (qb:AttributeProperty)
Observation: “the measured life expectancy of males in
Newport in the period 2004-2006 is 76.7 years”
CEDAR Integrator
https://github.com/CEDAR-project/Integrator
Raw data
cedar:BRT_1889_08_T1-S0-K17 a tablink:DataCell ;
rdfs:label "K17";
tablink:value "12.0" ;
tablink:dimension cedar:BRT_1889_08_T1-S0-A8 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-K6 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-J3 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-K4 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-K5 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-B8 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-C12 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-E17 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-F17 ;
tablink:sheet cedar:BRT_1889_08_T1-S0 .
Harmonization Rules as Open
Annotations
cedar:BRT_1889_08_T1-S0-K4-mapping a oa:Annotation ;
oa:hasBody cedar:BRT_1889_08_T1-S0-K4-mapping-body ;
oa:hasTarget cedar:BRT_1889_08_T1-S0-K4 ;
oa:serializedAt "2014-09-24"^^xsd:date ;
oa:serializedBy
<https://github.com/CEDAR-project/Integrator> ;
prov:wasGeneratedBy
cedar:BRT_1889_08_T1-S0-mapping-activity .
cedar:BRT_1889_08_T1-S0-K4-mapping-body a rdfs:Resource ;
sdmx-dimension:sex sdmx-code:sex-F .
Harmonized RDF Data Cube
cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ;
cedar:population "12"^^xml:decimal ;
maritalstatus:maritalStatus
maritalstatus:single ;
cedarterms:occupationPosition cedarterms:job-D ;
sdmx-dimension:sex sdmx-code:sex-F ;
cedarterms:occupation hisco:88030 ;
sdmx-dimension:refArea gg:11150 ;
prov:wasDerivedFrom
cedar:BRT_1889_08_T1-S0-K17 ;
prov:wasGeneratedBy
cedar:BRT_1889_08_T1-S0-K17-activity .
Classification Systems and
Concept Schemes
• Some missing harmonized dimensions!
• Encode all variables and their values using concept
schemes
• Some already exist
– Which ones? How many of them?
– Where?
– By whom?
– Are they used at all? Can I reuse them?
• Some need to be created
– Manual and expert knowledge based
– Can we do it automatically? Or assist the process?
Dutch Historical
Censuses
(CEDAR)
Dutch Ships
and Sailors
Gemeente
geschiede
nis.nl
HISCO
ICONCLASS
Dutch
Historical
Religions
Dutch
Historical
House Types
Existing dimensions
• HISCO
http://historyofwork.iisg.nl/
Existing dimensions
• Gemeentegeschiedenis.nl
Existing LSD dimensions
• P1: Discoverability? How to discover
dimensions created by others?
• P2: Reusability? How often are dimensions
reused? Can we reuse dimensions created by
others?
• P3: Relevance? What’s the size of LSD?
LSD Dimensions
http://lsd-dimensions.org/
https://github.com/albertmeronyo/LSD-Dimensions
Hourly JSON-LD dumps
http://lsd-dimensions.org/
Existing LSD dimensions
• P1: Discoverability? How to discover
dimensions created by others? LSD
Dimensions
• P2: Reusability? How often are dimensions
reused? Can we reuse dimensions created by
others? Logarithmic law / probably yes
• P3: Relevance? What’s the size of LSD? ~7.9%
of the LOD cloud
Creating new LSD Dimensions
• CEDAR needs concept schemes for
– Historical religious denominations (i.e. religions in
the NL in 18th-20th c.)
– Historical occupations (id.)
– Historical building types (id.)
https://github.com/CEDAR-project/TabCluster
TabCluster
Leverages
● Lexical properties
○ Hierarchical clustering in Python scipy
○ String distances
● Semantic properties (LOD tagging)
○ skos:Concept of most frequent cluster-term
○ Closest common skos:broader skos:Concept of all
cluster-terms
Compatibility? Remixability? Reusability?
Sarven Capadisli, Albert Meroño-Peñuela, Sören Auer, Reinhard Riedl. “Semantic Similarity
and Correlation of Linked Statistical Data Analysis”. 2nd Int. Workshop on Semantic Statistics
(SemStats) ISWC 2014.
Concept Drift
Census classification of
occupations as for
1859
• Root node is void
• Depth 1: occupation groups
• Leaves: actual occupations
Concept Drift
Census classification of
occupations as for
1889
• Root node is void
• Depth 1: occupation groups
• Leaves: actual occupations
Concept Drift
Census classification of
occupations as for
1899
• Root node is void
• Depth 1: occupation groups
• Leaves: actual occupations
Concept Drift
Upper ontologies
(HISCO, AC)
Year-
dependent
ontologies
1859 1869 1879
Concept Drift
Upper ontologies
(HISCO, AC)
Year-
dependent
ontologies
Concept Drift
Upper ontologies
(HISCO, AC)
Year-
dependent
ontologies
? ?
Preserving CEDAR
Preserving CEDAR
• DANS-EASY as backend (http://easy.dans.knaw.nl/)
• Archived objects: Turtle snapshots
– 20Go uncompressed, 200Mo compressed (per
snapshot)
– Versioning (stats on current release)
• Users still need to
– SPARQL the data => bring up the endpoint on demand
– Run analytics on the data => outsource statistical
analysis
Thank you
Questions, suggestions, comments most
welcome
@albertmeronyo
http://www.cedar-project.nl
http://krr.cs.vu.nl/
http://easy.dans.knaw.nl/
http://lsd-dimensions.org/
Me in 6 tweets
http://www.albertmeronyo.org
• Background: Computer Science, Web hacker, AI & Law
• PhD candidate at the VU University Amsterdam, DANS,
and eHumanities group (KNAW)
• Topic: Semantic Web for the Humanities
• CEDAR project (2012-2015): harmonized historical
Dutch censuses in the Semantic Web
• Problem: statistical data publishing, concept drift and
dynamics of meaning
• Last paper: What is Linked Historical Data? (EKAW
2014)

More Related Content

What's hot

ESTA-LD exploring spatio-temporal linked statistical data
ESTA-LD exploring spatio-temporal linked statistical dataESTA-LD exploring spatio-temporal linked statistical data
ESTA-LD exploring spatio-temporal linked statistical data
geoknow
 
What happened?
What happened?What happened?
What happened?
Martin Majlis
 
Sdwwg experiences and outlook
Sdwwg experiences and outlookSdwwg experiences and outlook
Sdwwg experiences and outlook
geoknow
 
Approaches to representing and delivering geospatial data in the semantic Web...
Approaches to representing and delivering geospatial data in the semantic Web...Approaches to representing and delivering geospatial data in the semantic Web...
Approaches to representing and delivering geospatial data in the semantic Web...
Paul Box
 
In The Land Of Graphs...
In The Land Of Graphs...In The Land Of Graphs...
In The Land Of Graphs...
Fernand Galiana
 
SFScon 21 - Marco Montanari - Open history Map
SFScon 21 - Marco Montanari - Open history MapSFScon 21 - Marco Montanari - Open history Map
SFScon 21 - Marco Montanari - Open history Map
South Tyrol Free Software Conference
 
Eighth openCypher Implementers Group Meeting: Status Update
Eighth openCypher Implementers Group Meeting: Status UpdateEighth openCypher Implementers Group Meeting: Status Update
Eighth openCypher Implementers Group Meeting: Status Update
openCypher
 
The 2nd graph database in sv meetup
The 2nd graph database in sv meetupThe 2nd graph database in sv meetup
The 2nd graph database in sv meetup
Joshua Bae
 
Avito Demand Prediction Challenge - Kaggle
Avito Demand Prediction Challenge - KaggleAvito Demand Prediction Challenge - Kaggle
Avito Demand Prediction Challenge - Kaggle
codrega
 

What's hot (11)

ESTA-LD exploring spatio-temporal linked statistical data
ESTA-LD exploring spatio-temporal linked statistical dataESTA-LD exploring spatio-temporal linked statistical data
ESTA-LD exploring spatio-temporal linked statistical data
 
What happened?
What happened?What happened?
What happened?
 
Sdwwg experiences and outlook
Sdwwg experiences and outlookSdwwg experiences and outlook
Sdwwg experiences and outlook
 
Os Percy
Os PercyOs Percy
Os Percy
 
Approaches to representing and delivering geospatial data in the semantic Web...
Approaches to representing and delivering geospatial data in the semantic Web...Approaches to representing and delivering geospatial data in the semantic Web...
Approaches to representing and delivering geospatial data in the semantic Web...
 
Os Racicot
Os RacicotOs Racicot
Os Racicot
 
In The Land Of Graphs...
In The Land Of Graphs...In The Land Of Graphs...
In The Land Of Graphs...
 
SFScon 21 - Marco Montanari - Open history Map
SFScon 21 - Marco Montanari - Open history MapSFScon 21 - Marco Montanari - Open history Map
SFScon 21 - Marco Montanari - Open history Map
 
Eighth openCypher Implementers Group Meeting: Status Update
Eighth openCypher Implementers Group Meeting: Status UpdateEighth openCypher Implementers Group Meeting: Status Update
Eighth openCypher Implementers Group Meeting: Status Update
 
The 2nd graph database in sv meetup
The 2nd graph database in sv meetupThe 2nd graph database in sv meetup
The 2nd graph database in sv meetup
 
Avito Demand Prediction Challenge - Kaggle
Avito Demand Prediction Challenge - KaggleAvito Demand Prediction Challenge - Kaggle
Avito Demand Prediction Challenge - Kaggle
 

Similar to CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

CBS CEDAR Presentation
CBS CEDAR PresentationCBS CEDAR Presentation
CBS CEDAR Presentation
Albert Meroño-Peñuela
 
DH101 2013/2014 course 6 - Semantic coding, RDF, CIDOC-CRM
DH101 2013/2014 course 6 - Semantic coding, RDF, CIDOC-CRMDH101 2013/2014 course 6 - Semantic coding, RDF, CIDOC-CRM
DH101 2013/2014 course 6 - Semantic coding, RDF, CIDOC-CRMFrederic Kaplan
 
LSD Dimensions: Use and Reuse of Linked Statistical Data as RDF Data Cube
LSD Dimensions: Use and Reuse of Linked Statistical Data as RDF Data CubeLSD Dimensions: Use and Reuse of Linked Statistical Data as RDF Data Cube
LSD Dimensions: Use and Reuse of Linked Statistical Data as RDF Data Cube
Albert Meroño-Peñuela
 
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
Keith.May
 
Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organi...
Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organi...Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organi...
Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organi...
COST Action TD1210
 
ePOM - Intro to Ocean Data Science - Raster and Vector Data Formats
ePOM - Intro to Ocean Data Science - Raster and Vector Data FormatsePOM - Intro to Ocean Data Science - Raster and Vector Data Formats
ePOM - Intro to Ocean Data Science - Raster and Vector Data Formats
Giuseppe Masetti
 
CAA2014 L'ontologie du CIDOC CRM pour interroger une base de données d'Archéo...
CAA2014 L'ontologie du CIDOC CRM pour interroger une base de données d'Archéo...CAA2014 L'ontologie du CIDOC CRM pour interroger une base de données d'Archéo...
CAA2014 L'ontologie du CIDOC CRM pour interroger une base de données d'Archéo...
UMR 7324 CITERES - Laboratoire Archéologie et Territoires, Tours
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)rchbeir
 
Esta ld -exploring-spatio-temporal-linked-statistical-data
Esta ld -exploring-spatio-temporal-linked-statistical-dataEsta ld -exploring-spatio-temporal-linked-statistical-data
Esta ld -exploring-spatio-temporal-linked-statistical-data
geoknow
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
Sören Auer
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
Marin Dimitrov
 
Linked Data: principles and examples
Linked Data: principles and examples Linked Data: principles and examples
Linked Data: principles and examples
Victor de Boer
 
Statistical data in RDF
Statistical data in RDFStatistical data in RDF
Statistical data in RDF
Jindřich Mynarz
 
The Statistics of Stairway to Heaven: A Semantic Story About Digital Humanities
The Statistics of Stairway to Heaven: A Semantic Story About Digital HumanitiesThe Statistics of Stairway to Heaven: A Semantic Story About Digital Humanities
The Statistics of Stairway to Heaven: A Semantic Story About Digital Humanities
Albert Meroño-Peñuela
 
Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...
Humphrey Southall
 
Citing and understanding spatial references for eResearch: Spatial Identifie...
Citing and understanding spatial references for eResearch: Spatial Identifie...Citing and understanding spatial references for eResearch: Spatial Identifie...
Citing and understanding spatial references for eResearch: Spatial Identifie...
Paul Box
 
RDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of SemanticsRDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of Semantics
Jean-Paul Calbimonte
 
Grails goes Graph
Grails goes GraphGrails goes Graph
Grails goes Graph
darthvader42
 
Data Mining introduction and basic concepts
Data Mining introduction and basic conceptsData Mining introduction and basic concepts
Data Mining introduction and basic concepts
PritiRishi
 
Spatial Data Science with R
Spatial Data Science with RSpatial Data Science with R
Spatial Data Science with R
amsantac
 

Similar to CEDAR & PRELIDA Preservation of Linked Socio-Historical Data (20)

CBS CEDAR Presentation
CBS CEDAR PresentationCBS CEDAR Presentation
CBS CEDAR Presentation
 
DH101 2013/2014 course 6 - Semantic coding, RDF, CIDOC-CRM
DH101 2013/2014 course 6 - Semantic coding, RDF, CIDOC-CRMDH101 2013/2014 course 6 - Semantic coding, RDF, CIDOC-CRM
DH101 2013/2014 course 6 - Semantic coding, RDF, CIDOC-CRM
 
LSD Dimensions: Use and Reuse of Linked Statistical Data as RDF Data Cube
LSD Dimensions: Use and Reuse of Linked Statistical Data as RDF Data CubeLSD Dimensions: Use and Reuse of Linked Statistical Data as RDF Data Cube
LSD Dimensions: Use and Reuse of Linked Statistical Data as RDF Data Cube
 
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
 
Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organi...
Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organi...Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organi...
Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organi...
 
ePOM - Intro to Ocean Data Science - Raster and Vector Data Formats
ePOM - Intro to Ocean Data Science - Raster and Vector Data FormatsePOM - Intro to Ocean Data Science - Raster and Vector Data Formats
ePOM - Intro to Ocean Data Science - Raster and Vector Data Formats
 
CAA2014 L'ontologie du CIDOC CRM pour interroger une base de données d'Archéo...
CAA2014 L'ontologie du CIDOC CRM pour interroger une base de données d'Archéo...CAA2014 L'ontologie du CIDOC CRM pour interroger une base de données d'Archéo...
CAA2014 L'ontologie du CIDOC CRM pour interroger une base de données d'Archéo...
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
 
Esta ld -exploring-spatio-temporal-linked-statistical-data
Esta ld -exploring-spatio-temporal-linked-statistical-dataEsta ld -exploring-spatio-temporal-linked-statistical-data
Esta ld -exploring-spatio-temporal-linked-statistical-data
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Linked Data: principles and examples
Linked Data: principles and examples Linked Data: principles and examples
Linked Data: principles and examples
 
Statistical data in RDF
Statistical data in RDFStatistical data in RDF
Statistical data in RDF
 
The Statistics of Stairway to Heaven: A Semantic Story About Digital Humanities
The Statistics of Stairway to Heaven: A Semantic Story About Digital HumanitiesThe Statistics of Stairway to Heaven: A Semantic Story About Digital Humanities
The Statistics of Stairway to Heaven: A Semantic Story About Digital Humanities
 
Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...
 
Citing and understanding spatial references for eResearch: Spatial Identifie...
Citing and understanding spatial references for eResearch: Spatial Identifie...Citing and understanding spatial references for eResearch: Spatial Identifie...
Citing and understanding spatial references for eResearch: Spatial Identifie...
 
RDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of SemanticsRDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of Semantics
 
Grails goes Graph
Grails goes GraphGrails goes Graph
Grails goes Graph
 
Data Mining introduction and basic concepts
Data Mining introduction and basic conceptsData Mining introduction and basic concepts
Data Mining introduction and basic concepts
 
Spatial Data Science with R
Spatial Data Science with RSpatial Data Science with R
Spatial Data Science with R
 

More from PRELIDA Project

Steps towards a Data Value Chain
Steps towards a Data Value ChainSteps towards a Data Value Chain
Steps towards a Data Value Chain
PRELIDA Project
 
Preserving linked data: sustainability and organizational infrastructure
Preserving linked data: sustainability and organizational infrastructurePreserving linked data: sustainability and organizational infrastructure
Preserving linked data: sustainability and organizational infrastructure
PRELIDA Project
 
Organizational and Economic Issues in Linked Data Preservation
Organizational and Economic Issues in Linked Data PreservationOrganizational and Economic Issues in Linked Data Preservation
Organizational and Economic Issues in Linked Data Preservation
PRELIDA Project
 
CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultura...
CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultura...CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultura...
CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultura...
PRELIDA Project
 
Experiments with evolving RDF
Experiments with evolving RDFExperiments with evolving RDF
Experiments with evolving RDF
PRELIDA Project
 
Privacy‐Aware Preservation: Challenges from the Perspective of a Linked Data ...
Privacy‐Aware Preservation: Challenges from the Perspective of a Linked Data ...Privacy‐Aware Preservation: Challenges from the Perspective of a Linked Data ...
Privacy‐Aware Preservation: Challenges from the Perspective of a Linked Data ...
PRELIDA Project
 
Media Ecology Project
Media Ecology ProjectMedia Ecology Project
Media Ecology Project
PRELIDA Project
 
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and RemedyHIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
PRELIDA Project
 
DIACHRON Preservation: Evolution Management for Preservation
DIACHRON Preservation: Evolution Management for PreservationDIACHRON Preservation: Evolution Management for Preservation
DIACHRON Preservation: Evolution Management for Preservation
PRELIDA Project
 
DIACHRON Project Overview
DIACHRON Project OverviewDIACHRON Project Overview
DIACHRON Project Overview
PRELIDA Project
 
PRELIDA Project Draft Roadmap
PRELIDA Project Draft RoadmapPRELIDA Project Draft Roadmap
PRELIDA Project Draft Roadmap
PRELIDA Project
 
D.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital PreservationD.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital Preservation
PRELIDA Project
 
Introduction to PRELIDA Consolidation and Dissemination Workshop
Introduction to PRELIDA Consolidation and Dissemination WorkshopIntroduction to PRELIDA Consolidation and Dissemination Workshop
Introduction to PRELIDA Consolidation and Dissemination Workshop
PRELIDA Project
 
D3.1 State of the art assessment on Linked Data and Digital Preservation
D3.1 State of the art assessment on Linked Data and Digital PreservationD3.1 State of the art assessment on Linked Data and Digital Preservation
D3.1 State of the art assessment on Linked Data and Digital Preservation
PRELIDA Project
 
Gap Analysis
Gap AnalysisGap Analysis
Gap Analysis
PRELIDA Project
 
Towards long-term preservation of linked data - the PRELIDA project
Towards long-term preservation of linked data - the PRELIDA projectTowards long-term preservation of linked data - the PRELIDA project
Towards long-term preservation of linked data - the PRELIDA project
PRELIDA Project
 
Introduction to Prelida
Introduction to PrelidaIntroduction to Prelida
Introduction to Prelida
PRELIDA Project
 

More from PRELIDA Project (17)

Steps towards a Data Value Chain
Steps towards a Data Value ChainSteps towards a Data Value Chain
Steps towards a Data Value Chain
 
Preserving linked data: sustainability and organizational infrastructure
Preserving linked data: sustainability and organizational infrastructurePreserving linked data: sustainability and organizational infrastructure
Preserving linked data: sustainability and organizational infrastructure
 
Organizational and Economic Issues in Linked Data Preservation
Organizational and Economic Issues in Linked Data PreservationOrganizational and Economic Issues in Linked Data Preservation
Organizational and Economic Issues in Linked Data Preservation
 
CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultura...
CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultura...CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultura...
CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultura...
 
Experiments with evolving RDF
Experiments with evolving RDFExperiments with evolving RDF
Experiments with evolving RDF
 
Privacy‐Aware Preservation: Challenges from the Perspective of a Linked Data ...
Privacy‐Aware Preservation: Challenges from the Perspective of a Linked Data ...Privacy‐Aware Preservation: Challenges from the Perspective of a Linked Data ...
Privacy‐Aware Preservation: Challenges from the Perspective of a Linked Data ...
 
Media Ecology Project
Media Ecology ProjectMedia Ecology Project
Media Ecology Project
 
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and RemedyHIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
 
DIACHRON Preservation: Evolution Management for Preservation
DIACHRON Preservation: Evolution Management for PreservationDIACHRON Preservation: Evolution Management for Preservation
DIACHRON Preservation: Evolution Management for Preservation
 
DIACHRON Project Overview
DIACHRON Project OverviewDIACHRON Project Overview
DIACHRON Project Overview
 
PRELIDA Project Draft Roadmap
PRELIDA Project Draft RoadmapPRELIDA Project Draft Roadmap
PRELIDA Project Draft Roadmap
 
D.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital PreservationD.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital Preservation
 
Introduction to PRELIDA Consolidation and Dissemination Workshop
Introduction to PRELIDA Consolidation and Dissemination WorkshopIntroduction to PRELIDA Consolidation and Dissemination Workshop
Introduction to PRELIDA Consolidation and Dissemination Workshop
 
D3.1 State of the art assessment on Linked Data and Digital Preservation
D3.1 State of the art assessment on Linked Data and Digital PreservationD3.1 State of the art assessment on Linked Data and Digital Preservation
D3.1 State of the art assessment on Linked Data and Digital Preservation
 
Gap Analysis
Gap AnalysisGap Analysis
Gap Analysis
 
Towards long-term preservation of linked data - the PRELIDA project
Towards long-term preservation of linked data - the PRELIDA projectTowards long-term preservation of linked data - the PRELIDA project
Towards long-term preservation of linked data - the PRELIDA project
 
Introduction to Prelida
Introduction to PrelidaIntroduction to Prelida
Introduction to Prelida
 

Recently uploaded

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 

Recently uploaded (20)

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 

CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

  • 1. CEDAR & PRELIDA Preservation of Linked Socio- Historical Data Albert Meroño-Peñuela @albertmeronyo PRELIDA consolidation workshop @ ISWC, 17-10-2014
  • 2. CEDAR: Harmonizing Historical Census Data in the Semantic Web
  • 3. CEDAR: Source Historical Data Dutch Historical Censuses (1795-1971) [Public Historical Statistical Data]
  • 4. 4 From scans to spreadsheets
  • 5. CEDAR goal: cross queries ? 1795 1830 1889 1930 1971 (through ~3K tables)
  • 7. Towards 5-star Census Data >1 year ago 1 year ago
  • 8.
  • 9. • Web publishable • Machine processable • Dynamic schema • Easily link with other datasets
  • 10. Why with semantic technology? • Web publishable, human & machine readable • Finer granularity level (cell level) • Statistical comparability by leveraging semantic descriptions • Provenance • Harmonization through linkage to other datasets (the 5th star)
  • 11. RDF Data Cube “There are many situations where it would be useful to be able to publish multi-dimensional data, such as statistics, on the web in such a way that they can be linked to related data sets and concepts.”
  • 12.
  • 13.
  • 14. RDF Data Cube vocabulary (QB) • SDMX compatible • Defines cubes as a set of observations that consist of dimensions, measures and attributes • Dimensions: time period, region, sex (qb:DimensionProperty) • Measure: population life expectancy (qb:MeasureProperty) • Attribute: unit of measure = years, metadata status = measured (qb:AttributeProperty) Observation: “the measured life expectancy of males in Newport in the period 2004-2006 is 76.7 years”
  • 16. Raw data cedar:BRT_1889_08_T1-S0-K17 a tablink:DataCell ; rdfs:label "K17"; tablink:value "12.0" ; tablink:dimension cedar:BRT_1889_08_T1-S0-A8 ; tablink:dimension cedar:BRT_1889_08_T1-S0-K6 ; tablink:dimension cedar:BRT_1889_08_T1-S0-J3 ; tablink:dimension cedar:BRT_1889_08_T1-S0-K4 ; tablink:dimension cedar:BRT_1889_08_T1-S0-K5 ; tablink:dimension cedar:BRT_1889_08_T1-S0-B8 ; tablink:dimension cedar:BRT_1889_08_T1-S0-C12 ; tablink:dimension cedar:BRT_1889_08_T1-S0-E17 ; tablink:dimension cedar:BRT_1889_08_T1-S0-F17 ; tablink:sheet cedar:BRT_1889_08_T1-S0 .
  • 17. Harmonization Rules as Open Annotations cedar:BRT_1889_08_T1-S0-K4-mapping a oa:Annotation ; oa:hasBody cedar:BRT_1889_08_T1-S0-K4-mapping-body ; oa:hasTarget cedar:BRT_1889_08_T1-S0-K4 ; oa:serializedAt "2014-09-24"^^xsd:date ; oa:serializedBy <https://github.com/CEDAR-project/Integrator> ; prov:wasGeneratedBy cedar:BRT_1889_08_T1-S0-mapping-activity . cedar:BRT_1889_08_T1-S0-K4-mapping-body a rdfs:Resource ; sdmx-dimension:sex sdmx-code:sex-F .
  • 18. Harmonized RDF Data Cube cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ; cedar:population "12"^^xml:decimal ; maritalstatus:maritalStatus maritalstatus:single ; cedarterms:occupationPosition cedarterms:job-D ; sdmx-dimension:sex sdmx-code:sex-F ; cedarterms:occupation hisco:88030 ; sdmx-dimension:refArea gg:11150 ; prov:wasDerivedFrom cedar:BRT_1889_08_T1-S0-K17 ; prov:wasGeneratedBy cedar:BRT_1889_08_T1-S0-K17-activity .
  • 19. Classification Systems and Concept Schemes • Some missing harmonized dimensions! • Encode all variables and their values using concept schemes • Some already exist – Which ones? How many of them? – Where? – By whom? – Are they used at all? Can I reuse them? • Some need to be created – Manual and expert knowledge based – Can we do it automatically? Or assist the process?
  • 20. Dutch Historical Censuses (CEDAR) Dutch Ships and Sailors Gemeente geschiede nis.nl HISCO ICONCLASS Dutch Historical Religions Dutch Historical House Types
  • 23. Existing LSD dimensions • P1: Discoverability? How to discover dimensions created by others? • P2: Reusability? How often are dimensions reused? Can we reuse dimensions created by others? • P3: Relevance? What’s the size of LSD?
  • 26.
  • 27.
  • 28.
  • 29. Existing LSD dimensions • P1: Discoverability? How to discover dimensions created by others? LSD Dimensions • P2: Reusability? How often are dimensions reused? Can we reuse dimensions created by others? Logarithmic law / probably yes • P3: Relevance? What’s the size of LSD? ~7.9% of the LOD cloud
  • 30. Creating new LSD Dimensions • CEDAR needs concept schemes for – Historical religious denominations (i.e. religions in the NL in 18th-20th c.) – Historical occupations (id.) – Historical building types (id.)
  • 32. TabCluster Leverages ● Lexical properties ○ Hierarchical clustering in Python scipy ○ String distances ● Semantic properties (LOD tagging) ○ skos:Concept of most frequent cluster-term ○ Closest common skos:broader skos:Concept of all cluster-terms
  • 33. Compatibility? Remixability? Reusability? Sarven Capadisli, Albert Meroño-Peñuela, Sören Auer, Reinhard Riedl. “Semantic Similarity and Correlation of Linked Statistical Data Analysis”. 2nd Int. Workshop on Semantic Statistics (SemStats) ISWC 2014.
  • 34. Concept Drift Census classification of occupations as for 1859 • Root node is void • Depth 1: occupation groups • Leaves: actual occupations
  • 35. Concept Drift Census classification of occupations as for 1889 • Root node is void • Depth 1: occupation groups • Leaves: actual occupations
  • 36. Concept Drift Census classification of occupations as for 1899 • Root node is void • Depth 1: occupation groups • Leaves: actual occupations
  • 37. Concept Drift Upper ontologies (HISCO, AC) Year- dependent ontologies 1859 1869 1879
  • 38. Concept Drift Upper ontologies (HISCO, AC) Year- dependent ontologies
  • 39. Concept Drift Upper ontologies (HISCO, AC) Year- dependent ontologies ? ?
  • 41. Preserving CEDAR • DANS-EASY as backend (http://easy.dans.knaw.nl/) • Archived objects: Turtle snapshots – 20Go uncompressed, 200Mo compressed (per snapshot) – Versioning (stats on current release) • Users still need to – SPARQL the data => bring up the endpoint on demand – Run analytics on the data => outsource statistical analysis
  • 42. Thank you Questions, suggestions, comments most welcome @albertmeronyo http://www.cedar-project.nl http://krr.cs.vu.nl/ http://easy.dans.knaw.nl/ http://lsd-dimensions.org/
  • 43. Me in 6 tweets http://www.albertmeronyo.org • Background: Computer Science, Web hacker, AI & Law • PhD candidate at the VU University Amsterdam, DANS, and eHumanities group (KNAW) • Topic: Semantic Web for the Humanities • CEDAR project (2012-2015): harmonized historical Dutch censuses in the Semantic Web • Problem: statistical data publishing, concept drift and dynamics of meaning • Last paper: What is Linked Historical Data? (EKAW 2014)

Editor's Notes

  1. Good afternoon everybody. I’m Albert Meroño. It’s a great pleasure to be here today, thanks to the organisers for the invitation… Today I’m gonna talk a bit about preservation of linked socio-historical data. And the work that we’ve been doing at the CEDAR project to publish socio-historical data on the SW. And we study the pros and cons of using semantic technologies to enhance the research methodologies of historians and social scientists. The interesting thing about preservation and CEDAR is a double angle: What we do is to re-publish PRESERVED data (from the 18th c.) At the same time we think on how to PRESERVE that re-publication (preserve the Linked Data)
  2. These things are in the archive
  3. The things in the archive change. Availability of new technology forces us opening the archive, taking the data out of it, doing something to it, store the new version.
  4. 2 problems: layout interpretation, and semantic alignment
  5. We like 5 star datasets. Historians also like 5 star datasets. HOWEVER, they still want their non-standard formats for data diving. Data diving guides their research and suggests new research questions.
  6. This is super cool. NOW, how do we connect with the archive to produce it?....
  7. From the ARCHIVE to RDF Data Cube TURTLE
  8. Work on progress on
  9. Interesting – they explain change explicitly, linking together metadata from different periods of time and map shapes.
  10. To what extent can we build these classifications automatically?
  11. ………………… BUT ALL DONE?
  12. Archiving the serialization of such semantic-statistic relationships?
  13. CHANGE OVER TIME