Linked Data has enabled an integrated and uniform access to various disparate socio-historical data sources in the Netherlands. However, preparing data for analysis is still a cumbersome task, taking up to 60% of the time. In this presentation I describe some novel tools based on Semantic Web technology to help automate the still closed, unshared, and non-repeatable process of data preparation.
4. Data Preparation
• Many interesting datasets are messy, incomplete
and incorrect
• Data analysis requires clean data
• Cleaning data involves careful interpretation and
study
• Values and variables in the data are replaced with
(more) standard terms (coding)
• Cross-dataset analyses requires a further data
harmonization step
7. Linking Social History Data
• Linked Open Data – machine-readable Web
graph with 100 billion statements [1]
• Sharing (socio-historical) knowledge for
reusability
• Solves integration
[1] http://lodlaundromat.org/
8. • Tablinker: Conversion of Excel spreadsheets to RDF
• Integrator: Attach harmonization rules to the raw RDF
• Qber: crowd based, interactive coding and harmonization
• LSD Dimensions: index of statistical variables on the Web
13. SCRY
Web standards compatible statistical functions in SPARQL
PREFIX : <http://scry.rocks/example/>
PREFIX scry: <http://scry.rocks/>
PREFIX impute: <http://scry.rocks/math/impute?>
PREFIX mean: <http://scry.rocks/math/mean?>
PREFIX sd: <http://scry.rocks/math/stdev?>
SELECT ?obs ?dim ?imputed_val WHERE {
?obs a qb:Observation .
?dim a qb:DimensionProperty|qb:MeasureProperty .
FILTER NOT EXISTS { ?obs ?dim ?val .}
?other_obs ?dim ?other_val .
SERVICE <http://sparql.scry.rocks/> {
SELECT ?imputed_val {
GRAPH ?g1 {impute:v scry:input ?other_val ;
scry:output ?imputed_val .}
}
}
}
Delegation of non-
standard function to
remote SCRY orb
14. Don’t like SPARQL? Neither do we!
https://github.com/CEDAR-project/Queries http://grlc.clariah-
sdh.eculture.labs.vu.nl/CEDAR-
project/Queries/api-docs
15. Conclusion
• Data preparation: an expensive task (60%)
• Linked Data is good for (socio-historical) data
integration on the Web
• But data quality issues remain
– Linked Edit Rules: rule-hub and data quality
assessment
– SCRY: Linked Data compatible statistical functionality
– grlc: you don’t need to know Linked Data to use
Linked Data
Tools to facilitate data integration in social history – make the life of the social historian working with semi-structured data easier
Tools TO MAKE THEIR LIFES EASIER
STRUCTURED DATA
1 . EXPLAIN PROCESS STAGES
2 . PREPARATION = ARTISAN
3 . A critical step in knowledge discovery is DATA INTEGRATION: INTERROGATING VARIOUS DATASETS IN A COMBINED WAY
CURRENTLY: BROUGHT TOGETHER BY HAND
CS KEEN ON RESEARCHING LATER STAGES
The PROBLEM IS SOMEWHERE ELSE!
SO OUR GOAL IS TO PROVIDE AUTOMATIC METHODS TO
AVOID THIS
MAKE PREPARATION *********REUSABLE************
(AVOID JARGON, DESCRIBE THE TOOLS)
See https://github.com/CEDAR-project
See https://github.com/CLARIAH
The good things of data integration come next…
BRINGING DATABASES TOGETHER
******HOWEVER********
BUT: I actually don’t want to talk about how good Linked Data is for integrating historical data…. But the problems that such integrated datasets might still have
Aligned with general problem of data on the Web. Web data is of varied quality (meaning: you’ll stumble upon very crappy data)
KEEP EXAMPLE ON MISSING DATA
Hub of interconnected constraints on statistical datasets
(Screenshot)
QBsistent: use LER to validate your data TOOL TO VALIDATE YOUR DATA USING THIS HUB
(Link to repo)
(Problem: need to implement statistical functions in SPARQL…)
QUERY EXAMPLES
AS A NOVICE EDIT A QUERY AND JUST GET A SPREADSHEET BY PUSHING A BUTTON
YOU DON’T NEED THE LINKED DATA KNOWLEDGE
These and many more fully described in this recently published book… which actually is my thesis dissertation