Dedalo: looking for Clusters' Explanations in a 
Labyrinth of Linked Data 
Ilaria Tiddi, Mathieu d'Aquin, Enrico Motta 
Knowledge Media Institute, The Open University 
May 28, 2014
The Knowledge Discovery process 
 Explaining patterns requires background knowledge. 
 Background knowledge is attributed to the experts. 
 Background knowledge comes from dierent domains. 
 Experts might not be aware of some background knowledge.
Explaining clusters: an example 
Authors clustered according to the papers they wrote together. 
How to explain those clusters?
Explaining clusters { the easy solution 
Use an expert
Explaining clusters { the easy solution 
Use an expert 
each cluster represents a research group in KMi
Explaining clusters { the easy solution 
Use an expert 
each cluster represents a research group in KMi  
Can one trust those experts?
Explaining clusters { the nice solution 
Use Inductive Logic Programming (ILP) 
E+ (positive examples) E (negative examples) 
attendsESWC(M.dAquin). 
attendsESWC(E.Motta). 
attendsESWC(V.Lopez). 
B: knowledge about E = E+ [ E 
submitted(M.dAquin). submitted(V.Lopez). 
submitted(E.Motta). 
accepted(V.Lopez). accepted(M.dAquin). 
Learn a complete (B [ H  E+) and consistent (B [ H 2 E) 
explanation for the relation attendsESWC(X). 
attendsESWC(X) - submitted(X)^accepted(X)
Explaining clusters { still the nice solution 
E+ (positive examples) E (negative examples) 
inMyCluster(M.dAquin). 
inMyCluster(M.Fernandez). 
inMyCluster(V.Lopez). 
inMyCluster(H.Saif). 
inMyCluster(M.Sabou). 
inMyCluster(C.Pedrinaci). 
inMyCluster(J.Domingue). 
B: knowledge about E = E+ [ E 
B? 
inMyCluster(X) { ?
Explaining clusters { the cool solution 
Integrate ILP with Linked Data
Explaining clusters { the cool solution 
E+ (positive examples) E (negative examples) 
inMyCluster(M.dAquin). 
inMyCluster(M.Fernandez). 
inMyCluster(V.Lopez). 
inMyCluster(H.Saif). 
inMyCluster(M.Sabou). 
inMyCluster(C.Pedrinaci). 
inMyCluster(J.Domingue). 
B: knowledge about E = E+ [ E 
topic(M.dAquin, SemanticWeb). topic(M.Sabou, SemanticWeb). 
topic(V.Lopez, SemanticWeb). topic(H.Saif, SocialWeb). 
topic(C.Pedrinaci, SemanticWebServices). 
topic(J.Domingue, SemanticWebServices). 
topic(M.Fernandez, SocialWeb). 
inMyCluster(X) - topic(X,SemanticWeb) 
Is this enough?
Producing Linked Data Explanations 
on similar topics 
People working in the same place are likely to write papers together. 
on the same project
Producing Linked Data Explanations 
People working 
under the same person 
are likely to write papers together. 
with the same partner
Producing Linked Data Explanations 
People working under people interested in the same thing write papers together.
Integrating ILP and Linked Data 
Add to B each Linked Data explanation hi = hpk i:hvk i*, 
where: 
 pk (path): a chain of RDF properties 
pk = fprop0 ! prop1 ! : : : ! propng 
 vk (value): a
nal instance 
 roots(hi ): elements 2 Ci having hi in common 
roots(hi)=fou:M.dAquin, ou:V.Lopez, ou:M.Saboug 
*spread across dierent datasets 
hi = hou:project!ou:ledBy!foaf:topicipk :hedu:SemanticWebivk 
Building each hi : 
{ how? 
{ which chains of properties? 
{ where to
nd the good ones?
Dedalo { An iterative Linked Data traversal 
Scoring hypotheses 
WRacc 
1(hi ) = jroots(hi )j 
jRj 
 
jroots(hi )Ci j 
jroots(hi )j  jCi j 
jRj 
 
1 Geng et al. (2006). Interestingness measures for data mining: A survey.
Dedalo { An iterative Linked Data traversal 
How to de
ne the interestingness of a path pk? 
How to reach the best hi in the shortest time?
Dedalo { Comparing Heuristics 
 We chose to compare dierent strategies. 
 We want to
nd the path pk leading to the best hi in the shortest time. 
 We want to save time and computational complexity 
Path Length length of pk in number of properties composing it 
Path Frequency frequency of the paths in the graph 
Adapted PMI joint and individual distribution of pk and Ci 
Adapted TF{IDF how important is pk (term) in Ci (doc) 
Delta jvals(pk )j  jCj 
Entropy2 distribution of jvals(pk )j 
Conditional Entropy distribution of jvals(pk )j w.r.t. Ci 
2Shannon, C. (1948). A Mathematical Theory of Communication.
Dedalo's Heuristics 
Ci=fou:M.dAquin, ou:V.Lopez, ou:M.Saboug 
Path Frequency top(pk)=hfoaf:topici
Dedalo's Heuristics 
Ci=fou:M.dAquin, ou:V.Lopez, ou:M.Saboug 
Adapted TF{IDF top(pk)=hou:exMember i
Dedalo's Heuristics 
Ci=fou:M.dAquin, ou:V.Lopez, ou:M.Saboug 
Entropy top(pk)=hou:project!ou:ledBy!foaf:topici
Experiments { KMi co-authorship 
 Authors clustered according to their co-authorships. 
 Network Partitioning clustering, jRj=92, jCj= 6 
Cycles 
Wracc 
0 5 10 15 
0.00 0.04 0.08 0.12 
Semantic Web authors 
Len 
Fq 
D 
Ent 
C.Ent 
TFIDF 
PMI 
Cycles 
Wracc 
0 5 10 15 
0.00 0.04 0.08 0.12 
Learning Analytics authors 
Len 
Fq 
D 
Ent 
C.Ent 
TFIDF 
PMI 
jCi j hi WRacc 
22 horg:hasMembership!ox:hasPrincipal- 
0.128 
Investigator!org:hasMembershipip.hou:SmartProductsiv1 
23 horg:hasMembership!ox:hasPrincipalInvestigator 
0.127 
!org:hasMembershipip.hou:SocialLearniv2
Experiments { KMi Publications 
 Papers clustered according to their keywords. 
 XK-Means clustering, jRj=865, jCj= 6 
Cycles 
Wracc 
0 2 4 6 8 10 
0.00 0.01 0.02 0.03 0.04 0.05 
Learning Analytics papers 
Len 
Fq 
D 
Ent 
C.Ent 
TFIDF 
PMI 
Cycles 
Wracc 
0 2 4 6 8 10 
0.00 0.02 0.04 0.06 0.08 0.10 
Semantic Web papers 
Len 
Fq 
D 
Ent 
C.Ent 
TFIDF 
PMI 
jCi j hi WRacc 
601 hdc:creator!ntag:isRelatedToip.hou:LearningAnalyticsiv1 0.042 
220 hdc:creator!ntag:isRelatedToip.hou:SemanticWebiv2 0.073
Experiments {Hudders

Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data

  • 1.
    Dedalo: looking forClusters' Explanations in a Labyrinth of Linked Data Ilaria Tiddi, Mathieu d'Aquin, Enrico Motta Knowledge Media Institute, The Open University May 28, 2014
  • 2.
    The Knowledge Discoveryprocess Explaining patterns requires background knowledge. Background knowledge is attributed to the experts. Background knowledge comes from dierent domains. Experts might not be aware of some background knowledge.
  • 3.
    Explaining clusters: anexample Authors clustered according to the papers they wrote together. How to explain those clusters?
  • 4.
    Explaining clusters {the easy solution Use an expert
  • 5.
    Explaining clusters {the easy solution Use an expert each cluster represents a research group in KMi
  • 6.
    Explaining clusters {the easy solution Use an expert each cluster represents a research group in KMi Can one trust those experts?
  • 7.
    Explaining clusters {the nice solution Use Inductive Logic Programming (ILP) E+ (positive examples) E (negative examples) attendsESWC(M.dAquin). attendsESWC(E.Motta). attendsESWC(V.Lopez). B: knowledge about E = E+ [ E submitted(M.dAquin). submitted(V.Lopez). submitted(E.Motta). accepted(V.Lopez). accepted(M.dAquin). Learn a complete (B [ H E+) and consistent (B [ H 2 E) explanation for the relation attendsESWC(X). attendsESWC(X) - submitted(X)^accepted(X)
  • 8.
    Explaining clusters {still the nice solution E+ (positive examples) E (negative examples) inMyCluster(M.dAquin). inMyCluster(M.Fernandez). inMyCluster(V.Lopez). inMyCluster(H.Saif). inMyCluster(M.Sabou). inMyCluster(C.Pedrinaci). inMyCluster(J.Domingue). B: knowledge about E = E+ [ E B? inMyCluster(X) { ?
  • 9.
    Explaining clusters {the cool solution Integrate ILP with Linked Data
  • 10.
    Explaining clusters {the cool solution E+ (positive examples) E (negative examples) inMyCluster(M.dAquin). inMyCluster(M.Fernandez). inMyCluster(V.Lopez). inMyCluster(H.Saif). inMyCluster(M.Sabou). inMyCluster(C.Pedrinaci). inMyCluster(J.Domingue). B: knowledge about E = E+ [ E topic(M.dAquin, SemanticWeb). topic(M.Sabou, SemanticWeb). topic(V.Lopez, SemanticWeb). topic(H.Saif, SocialWeb). topic(C.Pedrinaci, SemanticWebServices). topic(J.Domingue, SemanticWebServices). topic(M.Fernandez, SocialWeb). inMyCluster(X) - topic(X,SemanticWeb) Is this enough?
  • 11.
    Producing Linked DataExplanations on similar topics People working in the same place are likely to write papers together. on the same project
  • 12.
    Producing Linked DataExplanations People working under the same person are likely to write papers together. with the same partner
  • 13.
    Producing Linked DataExplanations People working under people interested in the same thing write papers together.
  • 14.
    Integrating ILP andLinked Data Add to B each Linked Data explanation hi = hpk i:hvk i*, where: pk (path): a chain of RDF properties pk = fprop0 ! prop1 ! : : : ! propng vk (value): a
  • 15.
    nal instance roots(hi ): elements 2 Ci having hi in common roots(hi)=fou:M.dAquin, ou:V.Lopez, ou:M.Saboug *spread across dierent datasets hi = hou:project!ou:ledBy!foaf:topicipk :hedu:SemanticWebivk Building each hi : { how? { which chains of properties? { where to
  • 16.
  • 17.
    Dedalo { Aniterative Linked Data traversal Scoring hypotheses WRacc 1(hi ) = jroots(hi )j jRj jroots(hi )Ci j jroots(hi )j jCi j jRj 1 Geng et al. (2006). Interestingness measures for data mining: A survey.
  • 18.
    Dedalo { Aniterative Linked Data traversal How to de
  • 19.
    ne the interestingnessof a path pk? How to reach the best hi in the shortest time?
  • 20.
    Dedalo { ComparingHeuristics We chose to compare dierent strategies. We want to
  • 21.
    nd the pathpk leading to the best hi in the shortest time. We want to save time and computational complexity Path Length length of pk in number of properties composing it Path Frequency frequency of the paths in the graph Adapted PMI joint and individual distribution of pk and Ci Adapted TF{IDF how important is pk (term) in Ci (doc) Delta jvals(pk )j jCj Entropy2 distribution of jvals(pk )j Conditional Entropy distribution of jvals(pk )j w.r.t. Ci 2Shannon, C. (1948). A Mathematical Theory of Communication.
  • 22.
    Dedalo's Heuristics Ci=fou:M.dAquin,ou:V.Lopez, ou:M.Saboug Path Frequency top(pk)=hfoaf:topici
  • 23.
    Dedalo's Heuristics Ci=fou:M.dAquin,ou:V.Lopez, ou:M.Saboug Adapted TF{IDF top(pk)=hou:exMember i
  • 24.
    Dedalo's Heuristics Ci=fou:M.dAquin,ou:V.Lopez, ou:M.Saboug Entropy top(pk)=hou:project!ou:ledBy!foaf:topici
  • 25.
    Experiments { KMico-authorship Authors clustered according to their co-authorships. Network Partitioning clustering, jRj=92, jCj= 6 Cycles Wracc 0 5 10 15 0.00 0.04 0.08 0.12 Semantic Web authors Len Fq D Ent C.Ent TFIDF PMI Cycles Wracc 0 5 10 15 0.00 0.04 0.08 0.12 Learning Analytics authors Len Fq D Ent C.Ent TFIDF PMI jCi j hi WRacc 22 horg:hasMembership!ox:hasPrincipal- 0.128 Investigator!org:hasMembershipip.hou:SmartProductsiv1 23 horg:hasMembership!ox:hasPrincipalInvestigator 0.127 !org:hasMembershipip.hou:SocialLearniv2
  • 26.
    Experiments { KMiPublications Papers clustered according to their keywords. XK-Means clustering, jRj=865, jCj= 6 Cycles Wracc 0 2 4 6 8 10 0.00 0.01 0.02 0.03 0.04 0.05 Learning Analytics papers Len Fq D Ent C.Ent TFIDF PMI Cycles Wracc 0 2 4 6 8 10 0.00 0.02 0.04 0.06 0.08 0.10 Semantic Web papers Len Fq D Ent C.Ent TFIDF PMI jCi j hi WRacc 601 hdc:creator!ntag:isRelatedToip.hou:LearningAnalyticsiv1 0.042 220 hdc:creator!ntag:isRelatedToip.hou:SemanticWebiv2 0.073
  • 27.
  • 28.
    eld's dataset Books clustered according to the students' Faculties. K-Means clustering, jRj=6969, jCj= 14 Cycles Wracc 0 5 10 15 0.000 0.001 0.002 0.003 0.004 0.005 Music students' borrowings Len Fq D Ent C.Ent TFIDF PMI Cycles Wracc 0 5 10 15 0.000 0.005 0.010 0.015 Theatre students' borrowings Len Fq D Ent C.Ent TFIDF PMI jCi j hi WRacc 335 hdc:subject!skos:broaderip1 .hlcsh:PhysicalScienceiv 0.005 919 hdc:creator!bl:hasCreated!dc:subjectip2 .hbl:EnglishDramaiv 0.013
  • 29.
    Experiments { Comparingheuristics Heuristics speed comparison in seconds. KMiA1 KMiA2 KMiP1 KMiP2 Hud1 Hud2 Len 1.64 4.15 8.95 9.01 69.13 135.5 Freq 2.57 4.35 7.5 9.29 180 180 PMI 2.05 3.88 11.28 18.42 180 180 TF{IDF 1.69 3.18 10.61 17.19 180 180 Delta 2.02 3.92 180 180 180 180 Entropy 4.19 3.27 7.1 7.3 41.15 105.09 Conditional Entropy 2.64 3.89 7.48 7.55 70.91 40.89 / { Len, Freq : fast but inaccurate baselines , { Entropy/Conditional Entropy: outperforming measures, reducing redundancy (following wrong paths) and time eorts / { PMI , TFIDF, Delta : they might work on less homogeneous clusters
  • 30.
    Conclusions LinkedData { automatically explaining clusters Dedalo { traversing Linked Data to reveal explanations Entropy { driving the search in the Linked Data cloud Beyond Dedalo. Dedalo works as far as there is a limited domain. New use-cases require its extension.
  • 31.
    Future work: theOU students enrolment dataset Add sameAs linking Use of literals Aggregation of atomic rules Explore new hypotheses evaluation measures
  • 32.
    Thanks for yourattention!3 ilaria.tiddi@open.ac.uk mathieu.daquin@open.ac.uk enrico.motta@open.ac.uk Questions? Better asking the robot than the experts 3Special thanks to the KMi (happy) faces.