The Linked Open Data (LOD) cloud is expanding continuously. Entities appear, change, and disappear over time. However, relatively little is known about the dynamics of the entities, i. e., the characteristics of their temporal evolution. In this paper, we employ clustering techniques over the dynamics of entities to determine common temporal patterns. We define an entity as RDF resource together with its attached RDF types and properties. The quality of the clusterings is evaluated using entity features such as the entities' properties,RDF types, and pay-level domain. In addition, we investigate to what extend entities that share a feature value change together over time. As dataset, we use weekly LOD snapshots over a period of more than three years provided by the Dynamic Linked Data Observatory. Insights into the dynamics of entities on the LOD cloud has strong practical implications to any application requiring fresh caches of LOD. The range of applications is from determining crawling strategies for LOD, caching SPARQL queries, to programming against LOD, and recommending vocabularies for reusing LOD vocabularies.
Information theoritic analysis of entity dynamics on the linked open data cloud
1. www.moving-project.eu
TraininG towards a society of data-saVvy inforMation prOfessionals
to enable open leadership INnovation
Chifumi Nishioka and Ansgar Scherp
ZBW -- Leibniz Information Centre for Economics and Kiel University, Germany
Information-theoretic Analysis of Entity
Dynamics on the Linked Open Data cloud
2. www.moving-project.eu
2 of 19
Motivation
• Understanding the dynamics of the LOD cloud is
important for many applications
• e.g., SPARQL query caching, crawling strategies, term
recommendations
• Related work
• Evolution of LOD documents [Käfer et al. 13]
• Dynamics of LOD sources [Dividino et al. 14]
• Entities on the LOD cloud
• Used by a lot of applications
• Knowledge graph in search engines
• Document modeling [Schuhmacher and Ponzetto 14]
Chifumi Nishioka (chni@informatik.uni-kiel.de)
Come to the presentation of
“TermPicker” by Johann Schaible at
14:30 on 1st June (Wednesday)
We conduct an analysis focusing on entities
3. www.moving-project.eu
3 of 19
Research Goals
• Measure the changes in entities between two points in time
• Represent the temporal dynamics of entities as time-series
• Time-series clustering
• Periodicity detection
• Evaluate four different features of entities
Chifumi Nishioka (chni@informatik.uni-kiel.de)
Goal 1: Represent the temporal dynamics of entities
Goal 2: Find out the representative temporal patterns of
entity dynamics
Goal 3: Find out which features of entity more likely
define temporal dynamics of entities
4. www.moving-project.eu
4 of 19
Formalization
• 𝑋𝑡: snapshot of LOD documents at point in time 𝑡
• Snapshot is a collection of triples 𝑥
• 𝑥: triple
• 𝑥 = 𝑠, 𝑝, 𝑜 : subject, predicate, and object
Chifumi Nishioka (chni@informatik.uni-kiel.de)
5. www.moving-project.eu
5 of 19
Entity and Entity Representations
• Entities are represented by a set of triples
• Entity Representation: Out
• Set of triples with common subject URI
• e.g., db:John_Brown is defined by two triples
• Entity Representation: InOut
• Set of triples with common subject URI or object URI
• e.g., db:John_Brown is defined by three triples
Chifumi Nishioka (chni@informatik.uni-kiel.de)
db:Anne_Smith db:spouseOf db:John_Brown
db:John_Brown db:birthplace db:Los_Angels
db:John_Brown db:works db:Green_University
6. www.moving-project.eu
6 of 19
Triple Weighting
• example: Barack Obama
• <Barack_Obama, dbp:vicePresident , Joe_Biden> is more
important than <Barack_Obama, rdf:type , foaf:Person>
• Baseline
• All triples have a same weight
• Combined Information Content (combIC)
[Schuhmacher and Ponzetto 14]
• 𝐼𝐶 𝑣 = −log( 𝑃(𝑣))
• 𝑝𝑟𝑒𝑑 𝑥 , 𝑜𝑏𝑗 𝑥 returns predicate and object of a
triple 𝑥, respectively
Chifumi Nishioka (chni@informatik.uni-kiel.de)
𝑤 𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒(𝑥) = 1
Each triple in entities has different importance for entities
𝑤𝑐𝑜𝑚𝑏𝐼𝐶 𝑥 = 𝐼𝐶 𝑝𝑟𝑒𝑑 𝑥 + 𝐼𝐶(𝑜𝑏𝑗(𝑥))
7. www.moving-project.eu
7 of 19
Measuring Entity Dynamics
• Cosine distance
• Euclidean distance
Chifumi Nishioka (chni@informatik.uni-kiel.de)
Goal 1: Represent the temporal dynamics of entities
𝛿 𝑐𝑜𝑠𝑑 𝐸𝑡1
, 𝐸𝑡2
= 1 −
𝐸𝑡1
∙ 𝐸𝑡2
||𝐸𝑡1
|| ∙ | 𝐸𝑡2
|
1. Measure the amount of changes in entities between two
successive snapshots by one of two distance measures
𝛿 𝑒𝑢𝑐 𝐸𝑡1
, 𝐸𝑡2
= (𝐸𝑡1,𝑖 − 𝐸𝑡2,𝑖)2
𝑖=1
8. www.moving-project.eu
8 of 19
Vector Representation of Entities
• Represent an entity 𝐸 by one-hot encoding
• Extract all unique triples from different snapshots
• Fix order of triples
• e.g., db:Anne_Smith at 𝑡1 is (1,1,1,0,0) and at 𝑡2 is
(1,0,1,1,1)
• Cosine distance: 𝛿 𝑐𝑜𝑠𝑑 𝐸𝑡1
, 𝐸𝑡2
= 1 −
2
3∙ 4
= 0.42
• Euclidean distance: 𝛿 𝑒𝑢𝑐 𝐸𝑡1
, 𝐸𝑡2
= 3
Chifumi Nishioka (chni@informatik.uni-kiel.de)
𝑡1
1 db:Anne_Smith db:birthplace db:New_York
2 db:Anne_Smith db:works db:Green_University
3 db:Anne_Smith db:spouseOf db:John_Brown
𝑡2
1 db:Anne_Smith db:birthplace db:New_York
4 db:Anne_Smith db:works db:Royal_University
3 db:Anne_Smith db:spouseOf db:John_Brown
5 db:Anne_Smith db:degree db:Master_of_Science
9. www.moving-project.eu
9 of 19
Temporal Dynamics of Entities
• Temporal Dynamics of an entity 𝐸
• 𝑛: the number of snapshots
Chifumi Nishioka (chni@informatik.uni-kiel.de)
Δ(𝐸) = (𝛿 𝐸𝑡1
, 𝐸𝑡2
, 𝛿 𝐸𝑡2
, 𝐸𝑡3
, ⋯ 𝛿(𝐸𝑡 𝑛−1
, 𝐸𝑡 𝑛
))
2. Represent temporal dynamics of entities by a time-series of the
amount of changes in an entity between two successive
snapshots
Subsequently, we mine the resulted time-series to find out
patterns of temporal dynamics of entities
10. www.moving-project.eu
10 of 19
Time-series Clustering
Chifumi Nishioka (chni@informatik.uni-kiel.de)
• Clustering algorithm: k-means++ [Arthur and
Vassilvitskii 07]
• Introduce an improved initial seeding into k-means
• Distance measure: Euclidean distance
• The most efficient measure for distance between
time-series with a reasonably high accuracy [Wang et
al. 13]
• Optimization of the number of clusters :
Average Silhouette
Goal 2: Find out the representative temporal patterns of
entity dynamics
11. www.moving-project.eu
11 of 19
Periodicity Detection
• Periodicity Detection
• A task of detecting periodicity from time-series
• Example 1: (1, 3, 2, 1, 3, 2) -> periodicity of three
• Example 2: (1, 2, 1, 2, 1, 2) -> periodicity of two
• Employ a convolution-based algorithm [Elfeky et al.
05]
Chifumi Nishioka (chni@informatik.uni-kiel.de)
We assume that the amount of changes of entities have
some periodicity
We see the centroids of the resulted clusters as patterns of
entity dynamics
12. www.moving-project.eu
12 of 19
Dataset
• Dynamic Linked Data Observatory (DyLDO)
dataset [Käfer et al. 12]
• Weekly snapshots of the fixed set of LOD documents
• 165 snapshots over three years (05/2012 to 07/2015)
• Entities in the DyLDO dataset
• Almost 75% of entities appear only at one snapshot
• Focus on entities that appear at >70% of snapshots
Chifumi Nishioka (chni@informatik.uni-kiel.de)
Entity
representation
# of unique
entities in 165
snapshots
# of entities that
appear at >70% of
snapshots
Out 27,788,902 2,909,700
InOut 29,097,929 2,950,533
13. www.moving-project.eu
13 of 19
Patterns of Entity Dynamics (1/3)
• Analysis with respect to eight conditions
• Conditions are made by two entity representations,
two distance measures, two triple weighting methods
• Result of clustering
• # of clusters are smaller when using combIC
Chifumi Nishioka (chni@informatik.uni-kiel.de)
14. www.moving-project.eu
14 of 19
Patterns of Entity Dynamics (2/3)
Chifumi Nishioka (chni@informatik.uni-kiel.de)
Out Cosine Baseline Out Cosine CombIC
Out Euclidean CombIC
Out Euclidean Baseline
16. www.moving-project.eu
16 of 19
Periodicity of Entity Dynamics
Chifumi Nishioka (chni@informatik.uni-kiel.de)
We observe periodicities in temporal dynamics of entities
• e.g., “Periodicity of 56” indicates that the amount of
entity changes vary along with one-year cycle
• Different patterns have different periodicities
17. www.moving-project.eu
17 of 19
Features for Entity Dynamics (1/2)
• Four features of entities
• RDF Type (𝑓1)
• Property (𝑓2)
• Union of RDF types and properties (𝑓3)
• Pay level domain (PLD) of entity URI (𝑓4)
• e.g., http://dbpedia.org/resource/The_Beatles -> dbpedia.org
• Evaluate four features by RandIndex
• RandIndex: a metric of clustering
• Measure the difference of clustering by a feature and
by entity dynamics (i.e., time-series vectors)
Chifumi Nishioka (chni@informatik.uni-kiel.de)
Goal 3: Find out which features of entity more likely
define temporal dynamics of entities
18. www.moving-project.eu
18 of 19
Features for Entity Dynamics (2/2)
• Entities that share a common PLD are more likely
to have similar temporal dynamics of entities
when employing baseline for triple weighting
• When using combIC, entities that have a common
RDF type or ECS more likely to belong a same
cluster
Chifumi Nishioka (chni@informatik.uni-kiel.de)
19. www.moving-project.eu
19 of 19
Thank you for your
attention!
Project consortium and funding agency
Chifumi Nishioka (chni@informatik.uni-kiel.de)
MOVING is funded by the EU Horizon 2020 Programme under the project number INSO-4-2015: 693092
20. www.moving-project.eu
20 of 19
Conclusion
• Temporal dynamics of entities on the LOD cloud
• Represent the temporal dynamics of entities as time-
series
• Find out the representative temporal patterns of
entity dynamics
• Find out which features of entity
• Future work
• e.g., SPARQL query caching
Chifumi Nishioka (chni@informatik.uni-kiel.de)
Goal 3: Find out which features of entity more likely
define temporal dynamics of entities
22. www.moving-project.eu
22 of 19
Reference
• [Arthur and Vassilvitskii 07] D. Arthur and S. Vassilvitskii. k-means++: The advantages of
careful seeding. SODA, 2007.
• [Elfeky et al. 05] M.G. Elfeky, W.G. Aref, and A.K. Elmagarmid. Periodicity detection in
time series databases. IEEE TKDE, 2005.
• [Käfer et al. 12] T. Käfer, J. Umbrich, A. Hogan, and A. Polleres. Towards a dynamic
linked data observatory. LDOW, 2012.
• [Käfer et al. 13] T. Käfer, A. Abdelrahman, J. Umbrich, P. O’Byrne, and A. Hogan.
Observing linked data dynamics. ESWC, 2013.
• [Neumann and Moerkotte 11] T. Neumann and G. Moerkotte. Characteristic sets:
Accurate cardinality estimation for RDF queries with multiple joins. ICDE, 2011.
• [Schuhmacher and Ponzetto 14] M. Schuhmacher and S.P. Ponzetto. Knowledge-based
graph document modeling. WSDM, 2014.
• [Wang et al. 13] X. Wang, A. Mueen, H. Ding, G. Trajcevski, P. Scheuermann, and E.
Keogh. Experimental comparison of representation methods and distance measures for
time series data. Data Mining and Knowledge Discovery, 2013.
• [Yang and Leskovec 11] J. Yang and J. Leskovec. Patterns of temporal variation in online
media. WSDM, 2011.
Chifumi Nishioka (chni@informatik.uni-kiel.de)
23. www.moving-project.eu
23 of 19
Entities in the DyLDO dataset
• Distribution of # of times of appearances of
entities in 165 snapshots
Chifumi Nishioka (chni@informatik.uni-kiel.de)
Entity representation Out Entity representation InOut