Detecting and Reporting Extensional Concept
Drift in Statistical Linked Data
Albert Mero˜o-Pe˜uela1,2 Christophe Gu´ret2
n...
Motivation
Stability of Meaning of Concepts

As the world changes continuously, concepts also change their
meaning over ti...
Motivation
Stability of Meaning of Concepts in SLD

Concepts are also present in SLD
Variable meaning/semantics (what the ...
Motivation
Stability of Meaning of Concepts in SLD

To what extent stability of meaning of these concepts is
guaranteed in...
Concept Drift
Types of change of meaning

The meaning of a concept C can change in several ways1 .
Intension drift occurs ...
Key ideas
If you remember nothing else, remember this

In this talk we present two key ideas
A method to detect extensiona...
Extensional drift detection
Using statistics

We define the extension function ext(C ) of a concept C as
the number of indi...
Extensional drift reporting
Using SPARQL UPDATEs

Linked Data Cloud

SPARQL endpoint

Typical analysis workflow. Users SELE...
Extensional drift reporting
Using SPARQL UPDATEs

Linked Data Cloud

SPARQL endpoint

Proposal. After running analyses, UP...
Case-Study
Publishing Linked Census Data

The Dutch historical censuses (1795 - 1971)

Lots of statistical data (2,288 tab...
Case-Study
Publishing Linked Census Data

Towards 5-star historical census data
Mero˜o-Pe˜uela, A. et al.
n
n

Extensional...
Case-Study
Publishing Linked Census Data

Currently

1 year ago

> 1 year ago

TabLinker3 : Supervised XLS2RDF converter
3...
Case-Study
Querying

SPARQL template for unfolding RDF Data Cubes
Mero˜o-Pe˜uela, A. et al.
n
n

Extensional Concept Drift
Case-Study
Querying

Census contains very heterogeneous data
Subset selection: occupation census of 1889 and 1899
Age rang...
Case-Study
Data preprocessing

1889 HISCO mappings

1899 HISCO mappings

HISCO normalization
Mero˜o-Pe˜uela, A. et al.
n
n...
Case-Study
Extensional drift detection in R

Is there extensional concept drift between two variants of the same
occupatio...
Case-Study
Extensional drift detection in R

Are the occupations in the two censuses comparable?
217 common HISCO codes
72...
Case-Study
Extensional drift detection in R

For all h common HISCO codes
Query for population distributions of h in 1889
...
Case-Study
Results

Late industralization of the Netherlands (late 19th century)
19.35% of occupations show radical extens...
Case-Study
Extensional drift reporting

Closing the pull-push cycle4
4

https://github.com/albertmeronyo/ConceptDrift/
Mer...
Conclusions & Future work
Concept drift: concepts change over time
It affects model reusability and backwards querying
Exte...
Thank you
Questions, suggestions?

@albertmeronyo
http://www.cedar-project.nl
http://www.data2semantics.org

Mero˜o-Pe˜uel...
Upcoming SlideShare
Loading in …5
×

Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

578 views

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
578
On SlideShare
0
From Embeds
0
Number of Embeds
26
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

  1. 1. Detecting and Reporting Extensional Concept Drift in Statistical Linked Data Albert Mero˜o-Pe˜uela1,2 Christophe Gu´ret2 n n e Rinke Hoekstra1,3 Stefan Schlobach1 1 Department 2 Data 3 Leibniz of Computer Science, VU University Amsterdam, NL Archiving and Networked Services, KNAW, NL Center for Law, Faculty of Law, University of Amsterdam, NL October 22th, 2013 First International Workshop on Semantic Statistics ISWC 2013 Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  2. 2. Motivation Stability of Meaning of Concepts As the world changes continuously, concepts also change their meaning over time We call this concept drift Smooth transitions Radical transitions (concept shift) Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  3. 3. Motivation Stability of Meaning of Concepts in SLD Concepts are also present in SLD Variable meaning/semantics (what the variable is supposed to represent?) Variable values/factors (RomschKatholik, RomsKatholic, KatholicChristelijk) Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  4. 4. Motivation Stability of Meaning of Concepts in SLD To what extent stability of meaning of these concepts is guaranteed in data collected on very long time ranges? If meaning is stable Old models are reusable Backwards comparisons always make sense Else New models may be necessary Backwards comparisons may be incorrect Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  5. 5. Concept Drift Types of change of meaning The meaning of a concept C can change in several ways1 . Intension drift occurs when there is a difference in the properties or attributes of two variants of the same concept (simint (C , C ) = 1). Extension drift occurs when there is a difference in the individuals that belong to two variants of the same concept (simext (C , C ) = 1). Label drift occurs when there is a difference in the labels of two variants of the same concept (simlabel (C , C ) = 1). 1 S. Wang, S. Schlobach, M. Klein, What Is Concept Drift and How to Measure It?, 2010 Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  6. 6. Key ideas If you remember nothing else, remember this In this talk we present two key ideas A method to detect extensionally drifted concepts A recommendation to annotate such drifts in SLD Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  7. 7. Extensional drift detection Using statistics We define the extension function ext(C ) of a concept C as the number of individuals that belong to C ext(C ) = |{a : C }| We define the extension similarity function simext (C , C ) between two variants C , C of a concept C as the function that returns the probability that C and C have identical populations2 simext (C , C ) = wilcox.test(ext(C ), ext(C )) Intuitive idea: concept variants with significantly different populations (p < 0.05) suffer radical transitions 2 F. Wilcoxon, Individual comparisons by ranking methods, 1945 Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  8. 8. Extensional drift reporting Using SPARQL UPDATEs Linked Data Cloud SPARQL endpoint Typical analysis workflow. Users SELECT data from SLD, but analyses run offline and results are not pushed back Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  9. 9. Extensional drift reporting Using SPARQL UPDATEs Linked Data Cloud SPARQL endpoint Proposal. After running analyses, UPDATE results (e.g. detected extensional drifts) back to the endpoint Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  10. 10. Case-Study Publishing Linked Census Data The Dutch historical censuses (1795 - 1971) Lots of statistical data (2,288 tables) Lots of concepts Big time span (176 years) May contain lots of drifted concepts! Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  11. 11. Case-Study Publishing Linked Census Data Towards 5-star historical census data Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  12. 12. Case-Study Publishing Linked Census Data Currently 1 year ago > 1 year ago TabLinker3 : Supervised XLS2RDF converter 3 https://github.com/Data2Semantics/TabLinker Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  13. 13. Case-Study Querying SPARQL template for unfolding RDF Data Cubes Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  14. 14. Case-Study Querying Census contains very heterogeneous data Subset selection: occupation census of 1889 and 1899 Age range Gender Marital status Municipality 36-50 M G Velsen 51-60 V O Zaandam 23-35 VROUWEN O. Haarlem 36-50 MANNEN G. Weesp Class Subclass Occupation Position Population II b Aanemers A 3 IV a Agenten B 1 XXI d Ambtenaren C 2 en beambten I h Afwerken A 5 van huizen Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  15. 15. Case-Study Data preprocessing 1889 HISCO mappings 1899 HISCO mappings HISCO normalization Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  16. 16. Case-Study Extensional drift detection in R Is there extensional concept drift between two variants of the same occupation (i.e. same HISCO code)? E.g. is there concept drift between ship loaders of 1889 and 1899? Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  17. 17. Case-Study Extensional drift detection in R Are the occupations in the two censuses comparable? 217 common HISCO codes 72.2% of all 1889 HISCO codes 71.1% of all 1899 HISCO codes Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  18. 18. Case-Study Extensional drift detection in R For all h common HISCO codes Query for population distributions of h in 1889 Query for population distributions of h in 1899 Do wilcox.test between the two and get p-values Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  19. 19. Case-Study Results Late industralization of the Netherlands (late 19th century) 19.35% of occupations show radical extensional drifts Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  20. 20. Case-Study Extensional drift reporting Closing the pull-push cycle4 4 https://github.com/albertmeronyo/ConceptDrift/ Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  21. 21. Conclusions & Future work Concept drift: concepts change over time It affects model reusability and backwards querying Extensional drifts can be detected with wilcox.test We SPARQL UPDATE drifts to let others know Scale up variables, tables Parametrization of the wilcox.test depending on the time gap Integration of intension and label drift measures RDF HISCO will be released soon HISCO mappings will be published as separate named graphs (so that you can link yours) Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift
  22. 22. Thank you Questions, suggestions? @albertmeronyo http://www.cedar-project.nl http://www.data2semantics.org Mero˜o-Pe˜uela, A. et al. n n Extensional Concept Drift

×