Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ECCS 2010


Published on

Published in: Career
  • Be the first to comment

ECCS 2010

  1. 1. The Web of Data as a Complex System - First insight into its multi-scale network properties Christophe Guéret, Shenghui Wang , and Stefan Schlobach   Department of Computer Science, Network Institute Vrije Universiteit Amsterdam
  2. 2. Outline <ul><ul><li>What is the Web of Data? </li></ul></ul><ul><li>  </li></ul><ul><ul><li>How complex is the Web of Data? </li></ul></ul><ul><li>  </li></ul><ul><ul><li>A new way of seeing the Web of Data </li></ul></ul><ul><li>  </li></ul><ul><ul><li>What have we found? </li></ul></ul><ul><li>  </li></ul><ul><ul><li>What are the challenges? </li></ul></ul>
  3. 3. What is the Web of Data? <ul><li>The Semantic Web is a web of data </li></ul><ul><li>                                -- </li></ul><ul><li>  </li></ul><ul><li>Linked Data is a sub-topic of the Semantic Web . The term Linked Data is used to describe a method of exposing, sharing, and connecting data via dereferenceable URIs on the Web . </li></ul><ul><li>--   </li></ul><ul><li>  </li></ul><ul><li>Linked Data is about using the Web to connect related data that wasn't previously linked, or using the Web to lower the barriers to linking data currently linked using other methods. </li></ul><ul><li>--   </li></ul><ul><li>  </li></ul>
  4. 4. Four principles of Linked Data <ul><ul><ul><li>Use URIs to identify things. </li></ul></ul></ul><ul><ul><ul><li>Use HTTP URIs so that these things can be referred to and looked up (&quot; dereferenced &quot;) by people and user agents . </li></ul></ul></ul><ul><ul><ul><li>Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF/XML . </li></ul></ul></ul><ul><ul><ul><li>Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web. </li></ul></ul></ul><ul><ul><li>-- Tim Berners-Lee </li></ul></ul>
  5. 5. type An example of linked data <ul><li>Nodes are shared across statements </li></ul><ul><li>The links have some meaning </li></ul>
  6. 6. Since 2006, people are creating linked data
  7. 7. October 2007
  8. 8. July 2009
  9. 9. Evolution of the Web of Data
  10. 10. The WoD is a complex system! <ul><ul><li>More than 260 extremely heterogeneous datasets </li></ul></ul><ul><ul><ul><li>general-purposed datasets, such as DBpedia </li></ul></ul></ul><ul><ul><ul><li>domain-oriented datasets, such as Bio2RDF </li></ul></ul></ul><ul><ul><ul><li>government data, music data, geological data, social network data, etc. </li></ul></ul></ul><ul><li>  </li></ul><ul><ul><li>Nearly 50 billion RDF triples </li></ul></ul><ul><ul><ul><li>Nearly 50 billion links within the datasets </li></ul></ul></ul><ul><ul><ul><li>More than 800 million links between the datasets </li></ul></ul></ul><ul><li>  </li></ul><ul><ul><li>Embedded rich semantics in the data </li></ul></ul><ul><ul><ul><li>data points are typed </li></ul></ul></ul><ul><ul><ul><li>links are typed </li></ul></ul></ul><ul><ul><ul><li>links is what makes the statements useful </li></ul></ul></ul>
  11. 11. Amsterdam The Netherlands isLocatedIn Christophe VU Amsterdam workIn isLocatedIn workIn workIn The links have explicit semantics, which brings implicit links deduced after the reasoning process
  12. 12. People are trying to use the WoD <ul><li>Billion triple challenges since 2008 </li></ul><ul><li>  </li></ul><ul><li>    &quot;The specific goal of the Billion Triples Track is to demonstrate the scalability of applications as well as to encourage the development of applications that can deal with Web data. We stress that the goal of this is not to be a benchmarking effort between triple stores, but rather to demonstrate applications that can scale to a Web scale using realistic Web-quality data . &quot; </li></ul>
  13. 13. The WoD itself should be robust <ul><ul><li>Is there central hubs whose failure would lead to lack of connectivity? </li></ul></ul><ul><li>  </li></ul><ul><ul><li>The WoD is designed for automated agents that have less capability to recover from the failure of the connectivity. </li></ul></ul><ul><li>  </li></ul><ul><ul><li>The robustness of the WoD should be ensured </li></ul></ul><ul><li>  </li></ul><ul><ul><li>Up till now, the WoD could be studied, searched and maintained like a classical database </li></ul></ul>
  14. 14. Network analysis A new way of seeing the WoD   What network analysis tells us
  15. 15. A new way of seeing the WoD Consider the WoD as network
  16. 16. Applying network analysis over the WoD <ul><ul><li>Average path length </li></ul></ul><ul><ul><li>Degree distribution </li></ul></ul><ul><ul><li>Strongly connected components </li></ul></ul><ul><ul><li>Degree centrality </li></ul></ul><ul><ul><li>Between centrality </li></ul></ul><ul><ul><li>Closeness centrality </li></ul></ul>
  17. 17. Scales of observation of the WoD <ul><li>  1. Graphs scale </li></ul>
  18. 18. Graph-scale WoD network <ul><ul><li>Each dataset is a node </li></ul></ul><ul><li>  </li></ul><ul><ul><li>Edges are weighted, directed connections between the datasets </li></ul></ul><ul><ul><ul><li>if there is at least one triple having a subject within dataset 1 and an object within dataset 2, then there is an edge between these two datasets.  </li></ul></ul></ul><ul><ul><ul><li>the number of such triples is the weight of the edge. </li></ul></ul></ul><ul><li>  </li></ul><ul><li>     </li></ul>
  19. 19. <ul><ul><li>110 nodes with 350 edges </li></ul></ul><ul><ul><li>Average path length is 2.16 </li></ul></ul><ul><ul><li>50 components </li></ul></ul>
  20. 20. The degree of 7 is critical point after which the network is not scale-free any more.
  21. 21. Top central nodes Betweenness centrality Closeness centrality Degree centrality Every centrality has a specific meaning... Node Value DBpedia 0.332 DBLP Berlin 0.108 DBLP (RKB) 0.100 DBLP Hannover 0.097 FOAF profiles 0.075 Node Value DBpedia 0.762 Geonames 0.614 Drug Bank 0.576 Linked MDB 0.544 Flickr wrappr 0.526 Node Value DBpedia 0.505 UniProt 0.266 DBLP (RKB) 0.266 ACM (RKB) 0.229 GeneID 0.211
  22. 22. Scales of observation of the WoD <ul><li>2. Triple scale </li></ul>
  23. 23. Triple-scale WoD network <ul><ul><li>We took the 10 million triples from the dataset crawled from the WoD, provided by the billion triple challenge 2009  </li></ul></ul><ul><li>  </li></ul><ul><ul><li>This &quot;BTC&quot; network is defined as G=(V, (E, L)), where </li></ul></ul><ul><ul><ul><li>V is a set of nodes, and each node is a URI or a literal </li></ul></ul></ul><ul><ul><ul><li>E is a set of edges </li></ul></ul></ul><ul><ul><ul><li>L is a set of labels, each label characterising a relation between nodes </li></ul></ul></ul><ul><li>  </li></ul><ul><ul><li>We applied a few strategies to aggregate data for comparison.  </li></ul></ul>
  24. 24. <ul><li>Triple-scale network and its aggregations </li></ul><ul><ul><li>BTC aggregated: triples are aggregated by the domain names </li></ul></ul><ul><ul><li>BTC aggregated + filter: only domain names shared with the graph-scale network </li></ul></ul>Network Nodes Eges Average path length Components BTC 605K 860K 2.15 602K BTC aggregated 14K 31K 2.80 7K BTC aggregated + filter 37 91 1.88 17
  25. 25. Degree distribution BTC BTC aggregated Power-law distribution
  26. 26. Top central nodes:
  27. 27. The next steps   Open challenges Ongoing research activities at VUA
  28. 28. Challenges: <ul><ul><li>Existence of implicit links </li></ul></ul><ul><ul><li>“ Semantic virus” </li></ul></ul>Amsterdam The Netherlands isLocatedIn Christophe VU Amsterdam workIn isLocatedIn workIn workIn Asia isLocatedIn
  29. 29. Challenges: <ul><ul><li>Multi-relations links </li></ul></ul><ul><ul><ul><li>FOAF (social networks + personal information) </li></ul></ul></ul><ul><ul><ul><li>SIOC (relations characterising blogs) </li></ul></ul></ul><ul><ul><ul><li>SWRC (describing research work) </li></ul></ul></ul><ul><ul><ul><li>… </li></ul></ul></ul><ul><ul><ul><li>Different filtering produce different networks </li></ul></ul></ul><ul><ul><ul><li>Centrality status of nodes changes w.r.t the networks </li></ul></ul></ul><ul><ul><li>Dynamics </li></ul></ul><ul><ul><ul><li>Data will be continuously added and linked. </li></ul></ul></ul>
  30. 30. “ sameAs” networks
  31. 31. Monitoring and Improving the WoD <ul><ul><li>Linked data is meant to be browsed, jumping from one ressource to another </li></ul></ul><ul><ul><li>The presence of Hubs is critical for the paths </li></ul></ul><ul><ul><li>Create alternate paths to be used in case of failure </li></ul></ul><ul><li>  </li></ul>Guéret, Groth, van Harmelen, Schlobach, &quot; Finding the Achilles Heel of the Web of Data: using network analysis for link-recommendation &quot;, ISWC2010 - To appear
  32. 32. {cgueret, swang, schlobac} We need to study more!