Your SlideShare is downloading. ×
0
×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Finding the Achilles Heel of the Web of Data

2,960

Published on

Presentation given at the 9th International Semantic Web Conference in Shanghai, China (2010).

Presentation given at the 9th International Semantic Web Conference in Shanghai, China (2010).

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,960
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
16
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Finding the Achilles Heel of the Web of Data using network analysis for link-recommendation Christophe Gu´eret, Paul Groth, Frank van Harmelen, Stefan Schlobach {cgueret,pgroth,Frank.van.Harmelen,schlobac}@few.vu.nl VU University Amsterdam ISWC - November 11, 2010 http://latc-project.eu/ Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 1 / 23
  • 2. The next 25+5 minutes The Web Of Data, Complex Systems, Robustness and road network Contributions from the paper Two Complex System views of the WoD Application of network metrics for robustness Increasing robustness as an optimisation problem Questions to be answered What are these Achilles Heel and where are they? What can we do about it? Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 2 / 23
  • 3. Walking on the WoD roads ; Credit http://www.flickr.com/photos/neuwieser/4828178404/ Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 3 / 23
  • 4. Resource chains and information harvesting The Web of Data is a network of labelled ”roads” It is possible to walk on the WoD from resource to resource Example: find a location by de-referencing chains Freebase DBPedia Geonames 50% of the LOD cloud data sets provide at most 2 connections to other data sets1 1 http://lod-cloud.net/state/ Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 4 / 23
  • 5. What can go wrong If a path is broken... some data sets become isolated information is lost This can happen when... namespaces or concepts are changed sioc:User → sioc:UserAccount servers are offline for some reason data-center flooded, server overloaded, etc Two different types of failure (semantic / structural) Use network analysis tools to identify the nodes at risk and monitor the impact of changes in topology Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 5 / 23
  • 6. Robustness of the network Robustness ∝ level of damage when a node is removed Different measures: Diameter of a graph (low⇒highly connected) Degree distribution (scale-free⇒robust again random failure) Centrality (central nodes are weak spots) . . . Centrality enables per node analysis Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 6 / 23
  • 7. Centrality of nodes in a gprah 1 2 3 4 5 6 7 8 9 10 Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 7 / 23
  • 8. Centrality of nodes in a gprah 1 2 3 4 5 6 7 8 9 10 Different notions of centrality: high degree, close to other nodes, on the way between other nodes. degree centrality → 4 closeness centrality → 3 and 7 betweenness centrality → 8 Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 7 / 23
  • 9. So, where are the Achilles Heel? Credit http://www.flickr.com/photos/robbie1/1725308/ Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 8 / 23
  • 10. The WoD as a Complex system The Web of Data is a multi-dimensional network with labelled edges Need to abstract the WoD into simple networks to study it2 Networks are created using a representative subset of the WoD triples Two networks to analyse the two types of risk 1 A structural network (nodes=hostnames) 2 A semantic network (nodes=namespaces) 2 C. Gu´eret, S. Wang, S. Schlobach The Web of Data is a Complex System - first insight into its multi-scale network properties (ECCS2010) Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 9 / 23
  • 11. Data sets Take all the resource-resource triples from the BTC2010 Group them by hostnames and namespaces BTC 2010 hostnamesnamespaces semantic network structural network Network name Number of nodes Number of edges Hostnames 558k 656k Namespaces 198 936 Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 10 / 23
  • 12. Top 10 visited nodes - structural network Hostname B (n) xmlns.com 5 693 379 049 dbpedia.org 5 432 125 038 purl.org 2 163 504 423 www.kanzaki.com 532 149 372 www.w3.org 470 113 796 dbtune.org 323 796 691 identi.ca 318 896 524 www.twine.com 299 237 555 semanticweb.org 277 374 029 dblp.l3s.de 225 602 575 If you see your machine(s) here, invest in big servers asap! Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 11 / 23
  • 13. Top 10 visited nodes - semantic network Namespace B (n) www.w3.org/1999/02/22-rdf-syntax-ns# 8783 example.org/ 7191 dbpedia.org/resource/ 5428 xmlns.com/foaf/0.1/ 5030 www.w3.org/2002/07/owl# 3926 sw.opencyc.org/concept/ 1764 www.w3.org/2007/uwa/context/deliverycontext.owl# 1737 www.w3.org/2003/01/geo/wgs84_pos# 1609 www.semanticdesktop.org/ontologies/2007/11/01/pimo# 1300 ontologies.ezweb.morfeo-project.org/eztag/ns# 1225 If you see your namespace(s) here, don’t change them - ever ! Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 12 / 23
  • 14. Top 10 visited nodes - semantic network Namespace B (n) www.w3.org/1999/02/22-rdf-syntax-ns# 8783 example.org/ 7191 dbpedia.org/resource/ 5428 xmlns.com/foaf/0.1/ 5030 www.w3.org/2002/07/owl# 3926 sw.opencyc.org/concept/ 1764 www.w3.org/2007/uwa/context/deliverycontext.owl# 1737 www.w3.org/2003/01/geo/wgs84_pos# 1609 www.semanticdesktop.org/ontologies/2007/11/01/pimo# 1300 ontologies.ezweb.morfeo-project.org/eztag/ns# 1225 If you see your namespace(s) here, don’t change them - ever ! Yes, even if there is a version number in it! (sorry Dan...) Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 12 / 23
  • 15. Improving the robustness Credit http://www.flickr.com/photos/thundershead/3713965526/ Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 13 / 23
  • 16. Prevent node failure First, basic, answer: it’s easy! Infrastructure (hostname) network Web of Data is based on standard Web technologies (HTTP, etc) It is known how to scale it: mirrors, round-robin, . . . Semantic (namespaces) network Just use cool URIs, they don’t change (thus, no more problem) Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 14 / 23
  • 17. Prevent node failure First, basic, answer: it’s easy! Infrastructure (hostname) network Web of Data is based on standard Web technologies (HTTP, etc) It is known how to scale it: mirrors, round-robin, . . . Semantic (namespaces) network Just use cool URIs, they don’t change (thus, no more problem) Second answer: find a way decrease the importance of the nodes in the top 10 Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 14 / 23
  • 18. How to decrease the betweenness centrality of the nodes? Add alternate paths to deviate the traffic when needed Freebase DBPedia Geonames Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 15 / 23
  • 19. How to decrease the betweenness centrality of the nodes? Add alternate paths to deviate the traffic when needed Freebase DBPedia Geonames Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 15 / 23
  • 20. But adding new links... may not be possible e.g. map Bio2RDF data to Geonames data has a creation cost + a maintenance cost estimated as inverse of similarity between the vocabulary used by the nodes Optimisation problem decrease the variance of the betweenness centrality minimize the total cost minimize the number of new links Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 16 / 23
  • 21. Optimisation algorithms for adding links Different strategies geared towards particular goals Greedy strategies (exhaustive) 1 Add all the possible edges, starting with the cheapest Increase connectivity among topic-oriented clusters 2 Add all the possible edges, starting with the most expensive Bridge topic-oriented clusters Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 17 / 23
  • 22. Optimisation algorithms for adding links Different strategies geared towards particular goals Greedy strategies (exhaustive) 1 Add all the possible edges, starting with the cheapest Increase connectivity among topic-oriented clusters 2 Add all the possible edges, starting with the most expensive Bridge topic-oriented clusters Selective strategies (set based) 1 Add a random set of edges Rapid & and hopefully efficient way to create a set 2 Use a genetic algorithm to construct an optimal a set of edges Insert the best combination of edges Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 17 / 23
  • 23. Greedy strategies - namespaces network 0 0.5 1 1.5 2 2.5 1 2 5 10 25 50 100 250 500 1000 2500 10000 25000 Centralityratio Number of edges added to the graph target Increasing cost Decreasing cost (the actual centrality value is not meaningful, we report it relative to the initial one) Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 18 / 23
  • 24. Optimal set construction with the genetic algorithm Iterative trial and error Several sets evaluated at the same time Improvement of candidate solutions Create several random sets Evaluate and rank them all Alter the bests to get new sets Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 19 / 23
  • 25. Selective strategies - namespaces network 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1 2 5 10 25 50 100 250 500 1000 2500 10000 25000 Centralityratio Size of the set of edges added target Random choice Evolutionary algorithm If you want to add only few edges, select them carefully Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 20 / 23
  • 26. One possible solution From namespace To namespace Cost http://purl.org/vocab/ lifecycle/schema# http://rdf.freebase.com/ ns/ 0.99 http://annotation. semanticweb.org/2004/ iswc# http://www.w3.org/2007/ uwa/context/location. owl# 0.89 http://openean.kaufkauf. net/id/ http://www.w3.org/2008/ 05/skos-xl# 1.00 http://purl.org/dc/ dcmitype/ http://sw.opencyc.org/ concept/ 1.00 This set of 4 new edges brings the centrality down to 70% of its original value Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 21 / 23
  • 27. Conclusion What’s next? 1 Extend and generalise this work Analyse a different, and bigger, set of crawled data propositions are welcome! Investigate other network measures 2 Increase the application range of our analysis Turn our batch processes into a stream-oriented analysis Make a service for personalised linking recommendations Data and software available on http://linkeddata.few.vu.nl/wod_analysis Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 22 / 23
  • 28. Take home message Network analysis provides meaningful insights By telling which nodes are central and thus weak The Web of Data contains weak points Which can be identified and ranked The Web of Data can be optimized By choosing carefully the new connections to create Slides available on SlideShare http://www.slideshare.net/cgueret/cgueret-iswc2010 Christophe Gu´eret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 23 / 23

×