ECCS 2010


Published on

Published in: Career
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • All the previous version of the picture are on
  • ECCS 2010

    1. 1. The Web of Data as a Complex System - First insight into its multi-scale network properties Christophe Guéret, Shenghui Wang , and Stefan Schlobach   Department of Computer Science, Network Institute Vrije Universiteit Amsterdam
    2. 2. Outline <ul><ul><li>What is the Web of Data? </li></ul></ul><ul><li>  </li></ul><ul><ul><li>How complex is the Web of Data? </li></ul></ul><ul><li>  </li></ul><ul><ul><li>A new way of seeing the Web of Data </li></ul></ul><ul><li>  </li></ul><ul><ul><li>What have we found? </li></ul></ul><ul><li>  </li></ul><ul><ul><li>What are the challenges? </li></ul></ul>
    3. 3. What is the Web of Data? <ul><li>The Semantic Web is a web of data </li></ul><ul><li>                                -- </li></ul><ul><li>  </li></ul><ul><li>Linked Data is a sub-topic of the Semantic Web . The term Linked Data is used to describe a method of exposing, sharing, and connecting data via dereferenceable URIs on the Web . </li></ul><ul><li>--   </li></ul><ul><li>  </li></ul><ul><li>Linked Data is about using the Web to connect related data that wasn't previously linked, or using the Web to lower the barriers to linking data currently linked using other methods. </li></ul><ul><li>--   </li></ul><ul><li>  </li></ul>
    4. 4. Four principles of Linked Data <ul><ul><ul><li>Use URIs to identify things. </li></ul></ul></ul><ul><ul><ul><li>Use HTTP URIs so that these things can be referred to and looked up (&quot; dereferenced &quot;) by people and user agents . </li></ul></ul></ul><ul><ul><ul><li>Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF/XML . </li></ul></ul></ul><ul><ul><ul><li>Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web. </li></ul></ul></ul><ul><ul><li>-- Tim Berners-Lee </li></ul></ul>
    5. 5. type An example of linked data <ul><li>Nodes are shared across statements </li></ul><ul><li>The links have some meaning </li></ul>
    6. 6. Since 2006, people are creating linked data
    7. 7. October 2007
    8. 8. July 2009
    9. 9. Evolution of the Web of Data
    10. 10. The WoD is a complex system! <ul><ul><li>More than 260 extremely heterogeneous datasets </li></ul></ul><ul><ul><ul><li>general-purposed datasets, such as DBpedia </li></ul></ul></ul><ul><ul><ul><li>domain-oriented datasets, such as Bio2RDF </li></ul></ul></ul><ul><ul><ul><li>government data, music data, geological data, social network data, etc. </li></ul></ul></ul><ul><li>  </li></ul><ul><ul><li>Nearly 50 billion RDF triples </li></ul></ul><ul><ul><ul><li>Nearly 50 billion links within the datasets </li></ul></ul></ul><ul><ul><ul><li>More than 800 million links between the datasets </li></ul></ul></ul><ul><li>  </li></ul><ul><ul><li>Embedded rich semantics in the data </li></ul></ul><ul><ul><ul><li>data points are typed </li></ul></ul></ul><ul><ul><ul><li>links are typed </li></ul></ul></ul><ul><ul><ul><li>links is what makes the statements useful </li></ul></ul></ul>
    11. 11. Amsterdam The Netherlands isLocatedIn Christophe VU Amsterdam workIn isLocatedIn workIn workIn The links have explicit semantics, which brings implicit links deduced after the reasoning process
    12. 12. People are trying to use the WoD <ul><li>Billion triple challenges since 2008 </li></ul><ul><li>  </li></ul><ul><li>    &quot;The specific goal of the Billion Triples Track is to demonstrate the scalability of applications as well as to encourage the development of applications that can deal with Web data. We stress that the goal of this is not to be a benchmarking effort between triple stores, but rather to demonstrate applications that can scale to a Web scale using realistic Web-quality data . &quot; </li></ul>
    13. 13. The WoD itself should be robust <ul><ul><li>Is there central hubs whose failure would lead to lack of connectivity? </li></ul></ul><ul><li>  </li></ul><ul><ul><li>The WoD is designed for automated agents that have less capability to recover from the failure of the connectivity. </li></ul></ul><ul><li>  </li></ul><ul><ul><li>The robustness of the WoD should be ensured </li></ul></ul><ul><li>  </li></ul><ul><ul><li>Up till now, the WoD could be studied, searched and maintained like a classical database </li></ul></ul>
    14. 14. Network analysis A new way of seeing the WoD   What network analysis tells us
    15. 15. A new way of seeing the WoD Consider the WoD as network
    16. 16. Applying network analysis over the WoD <ul><ul><li>Average path length </li></ul></ul><ul><ul><li>Degree distribution </li></ul></ul><ul><ul><li>Strongly connected components </li></ul></ul><ul><ul><li>Degree centrality </li></ul></ul><ul><ul><li>Between centrality </li></ul></ul><ul><ul><li>Closeness centrality </li></ul></ul>
    17. 17. Scales of observation of the WoD <ul><li>  1. Graphs scale </li></ul>
    18. 18. Graph-scale WoD network <ul><ul><li>Each dataset is a node </li></ul></ul><ul><li>  </li></ul><ul><ul><li>Edges are weighted, directed connections between the datasets </li></ul></ul><ul><ul><ul><li>if there is at least one triple having a subject within dataset 1 and an object within dataset 2, then there is an edge between these two datasets.  </li></ul></ul></ul><ul><ul><ul><li>the number of such triples is the weight of the edge. </li></ul></ul></ul><ul><li>  </li></ul><ul><li>     </li></ul>
    19. 19. <ul><ul><li>110 nodes with 350 edges </li></ul></ul><ul><ul><li>Average path length is 2.16 </li></ul></ul><ul><ul><li>50 components </li></ul></ul>
    20. 20. The degree of 7 is critical point after which the network is not scale-free any more.
    21. 21. Top central nodes Betweenness centrality Closeness centrality Degree centrality Every centrality has a specific meaning... Node Value DBpedia 0.332 DBLP Berlin 0.108 DBLP (RKB) 0.100 DBLP Hannover 0.097 FOAF profiles 0.075 Node Value DBpedia 0.762 Geonames 0.614 Drug Bank 0.576 Linked MDB 0.544 Flickr wrappr 0.526 Node Value DBpedia 0.505 UniProt 0.266 DBLP (RKB) 0.266 ACM (RKB) 0.229 GeneID 0.211
    22. 22. Scales of observation of the WoD <ul><li>2. Triple scale </li></ul>
    23. 23. Triple-scale WoD network <ul><ul><li>We took the 10 million triples from the dataset crawled from the WoD, provided by the billion triple challenge 2009  </li></ul></ul><ul><li>  </li></ul><ul><ul><li>This &quot;BTC&quot; network is defined as G=(V, (E, L)), where </li></ul></ul><ul><ul><ul><li>V is a set of nodes, and each node is a URI or a literal </li></ul></ul></ul><ul><ul><ul><li>E is a set of edges </li></ul></ul></ul><ul><ul><ul><li>L is a set of labels, each label characterising a relation between nodes </li></ul></ul></ul><ul><li>  </li></ul><ul><ul><li>We applied a few strategies to aggregate data for comparison.  </li></ul></ul>
    24. 24. <ul><li>Triple-scale network and its aggregations </li></ul><ul><ul><li>BTC aggregated: triples are aggregated by the domain names </li></ul></ul><ul><ul><li>BTC aggregated + filter: only domain names shared with the graph-scale network </li></ul></ul>Network Nodes Eges Average path length Components BTC 605K 860K 2.15 602K BTC aggregated 14K 31K 2.80 7K BTC aggregated + filter 37 91 1.88 17
    25. 25. Degree distribution BTC BTC aggregated Power-law distribution
    26. 26. Top central nodes:
    27. 27. The next steps   Open challenges Ongoing research activities at VUA
    28. 28. Challenges: <ul><ul><li>Existence of implicit links </li></ul></ul><ul><ul><li>“ Semantic virus” </li></ul></ul>Amsterdam The Netherlands isLocatedIn Christophe VU Amsterdam workIn isLocatedIn workIn workIn Asia isLocatedIn
    29. 29. Challenges: <ul><ul><li>Multi-relations links </li></ul></ul><ul><ul><ul><li>FOAF (social networks + personal information) </li></ul></ul></ul><ul><ul><ul><li>SIOC (relations characterising blogs) </li></ul></ul></ul><ul><ul><ul><li>SWRC (describing research work) </li></ul></ul></ul><ul><ul><ul><li>… </li></ul></ul></ul><ul><ul><ul><li>Different filtering produce different networks </li></ul></ul></ul><ul><ul><ul><li>Centrality status of nodes changes w.r.t the networks </li></ul></ul></ul><ul><ul><li>Dynamics </li></ul></ul><ul><ul><ul><li>Data will be continuously added and linked. </li></ul></ul></ul>
    30. 30. “ sameAs” networks
    31. 31. Monitoring and Improving the WoD <ul><ul><li>Linked data is meant to be browsed, jumping from one ressource to another </li></ul></ul><ul><ul><li>The presence of Hubs is critical for the paths </li></ul></ul><ul><ul><li>Create alternate paths to be used in case of failure </li></ul></ul><ul><li>  </li></ul>Guéret, Groth, van Harmelen, Schlobach, &quot; Finding the Achilles Heel of the Web of Data: using network analysis for link-recommendation &quot;, ISWC2010 - To appear
    32. 32. {cgueret, swang, schlobac} We need to study more!
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.