Exploring Linked Data content through network analysis

3,029 views

Published on

Presentation given at a seminar in Yahoo.

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,029
On SlideShare
0
From Embeds
0
Number of Embeds
540
Actions
Shares
0
Downloads
40
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Exploring Linked Data content through network analysis

  1. 1. Exploring Linked Data content through network analysis Christophe Guéret (@cgueret) Free University Amsterdam Co-explorers: Stefan Schlobach, Shenghui Wang, Paul Groth, Frank van Harmelenhttp://latc-project.eu http://www.vu.nl
  2. 2. Outline of the talk What is Linked Data? What is there is to be analysed? Do we miss something? New research directions and first resultsNovember 23, 2011 Analysis of Linked Data 2/35
  3. 3. Linked Data (aka Semantic Web) Linked DataNovember 23, 2011 Analysis of Linked Data 3/35 http://www.flickr.com/photos/erikcharlton/3337465138
  4. 4. What is the problem? Frank and Christophe publish some open data Roi wants to combine and enrich it Kennissen Staad Christophe Amsterdam Peter Barcelona WWW Frank David Parijs Ville Pays Roi Barcelone Espagne Paris France WWW Christophe Amsterdam Pays-Bas Marvel icons: mermer, DeviantArtNovember 23, 2011 Analysis of Linked Data 4/35
  5. 5. What is the problem? Kennissen Staad Ville Pays Christophe Peter David Amsterdam Barcelona Parijs + Barcelone Paris Amsterdam Espagne France Pays-Bas = ? Data integration issue “Kennissen”, “Staad”, “Ville”, “Pays” ? “Paris” = “Parijs” ? “Amsterdam” = “Amsterdam” ? Lot of work, must be done again on updatesNovember 23, 2011 Analysis of Linked Data 5/35
  6. 6. A solution Do data integration at the data level Use, and re-use, unambiguous identifiers Use meta-level descriptions of the identifiers Proposal: use the Web as a platform Identifiers = URIs Descriptions = de-referenced documentsNovember 23, 2011 Analysis of Linked Data 6/35
  7. 7. Frank publishes his data Kennissen Staad Christophe Amsterdam Peter Barcelona This is a “triple” David Parijs ex:Acquaintance rdf:type rdf:type rdf:type ex:Christophe ex:Peter ex:David ex:worksIn ex:worksIn ex:worksIn dbpedia:Amsterdam dbpedia:Barcelona dbpedia:Paris Use of compact URIs dbpedia = http://dbpedia.org/resource/ ex = http://example.org/ rdf = http://www.w3.org/1999/02/22-rdf-syntax-ns#November 23, 2011 Analysis of Linked Data 7/35
  8. 8. Christophe re-use part of Franks data Ville Paysto publish his data Barcelone Espagne Paris France Amsterdam Pays-Bas ex:Acquaintance rdf:type rdf:type rdf:type ex:Christophe ex:Peter ex:David ex:worksIn ex:worksIn ex:worksIn dbpedia:Amsterdam dbpedia:Barcelona dbpedia:Paris ex:isIn ex:isIn ex:isIn dbpedia:Netherlands dbpedia:Spain dbpedia:FranceNovember 23, 2011 Analysis of Linked Data 8/35
  9. 9. Roi add some “Conocido”@esmore information rdf:label ex:Acquaintance rdf:type rdf:type rdf:type ex:Christophe ex:Peter ex:David ex:worksIn ex:worksIn ex:worksIn dbpedia:Amsterdam dbpedia:Barcelona dbpedia:Paris ex:isIn ex:isIn ex:isIn dbpedia:Netherlands dbpedia:Spain dbpedia:France ex:isIn ex:isIn ex:isIn dbpedia:EuropeNovember 23, 2011 Analysis of Linked Data 9/35
  10. 10. dbpedia:AmsterdamNovember 23, 2011 Analysis of Linked Data 10/35
  11. 11. Reasoning with Semantics Bonus! dbpedia:Amsterdam ex:isIn dbpedia:Amsterdam ex:isIn rdf:type dbpedia:Netherlands + owl:TransitiveProperty = ex:isIn ex:isIn dbpedia:Europe dbpedia:Europe Example usage Materialize implicit information Check for consistencyNovember 23, 2011 Analysis of Linked Data 11/35
  12. 12. Rough estimate of size 295 data sets, 31B facts in LOD Cloud Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/November 23, 2011 Analysis of Linked Data 12/35
  13. 13. Lots of Data to analyze! :-)November 23, 2011 Analysis of Linked Data 13/35 http://www.flickr.com/photos/argonne/3323018571
  14. 14. But analyzing what exactly? Table of facts published at different locations A distributed Knowledge Base Subject Predicate Object ex:Christophe rdf:type ex:Acquaintance ex:Christophe ex:worksIn dbpedia:Amsterdam ex:Peter rdf:type ex:Acquaintance ... ... ... Subject Predicate Object dbpedia:Amsterdam ex:isIn dbpedia:Netherlands dbpedia:Netherlands ex:isIn dbpedia:Europe ... ... ... Subject Predicate Object ex:Acquaintance rdf:label “Conocido”@es ... ... ...November 23, 2011 Analysis of Linked Data 14/35
  15. 15. Analysis workflow 1.Gather a snapshot of triples 2.Compute descriptive statistics Top resources (subject, predicate, object) Frequency cross-links types (SP,SO,PO,...) Connected components Paths frequency …=> Tricky enough, the data is really big!=> We should be able to get more out of the dataNovember 23, 2011 Analysis of Linked Data 15/35
  16. 16. Can we explain that? Suggestions Started the graph General knowledge Very well knownNovember 23, 2011 Analysis of Linked Data 16/35
  17. 17. or that? Suggestions All published by Bio2RDF Well aware of each other Overlapping domainNovember 23, 2011 Analysis of Linked Data 17/35
  18. 18. Could we predict the impact of ... Dbpedia being down for a while ? SIOC renaming “User” into “UserAccount” ? creating a dataset that turns out to be popular ? Analysing a set of triples is not enoughNovember 23, 2011 Analysis of Linked Data 18/35
  19. 19. Are we overlooking something?November 23, 2011 Analysis of Linked Data 19/35
  20. 20. Its not only about the resources Several entities related to the data ex:something WWW Data publishers/consumers Resources Web servers Interactions between all of them WWWNovember 23, 2011 Analysis of Linked Data 20/35
  21. 21. There are different scales Triples level versus Resource groups level Different data complexity at each scale “Conocido”@es rdf:label ex:Acquaintance rdf:type rdf:type rdf:type ex:Christophe ex:Peter ex:David ex:worksIn ex:worksIn ex:worksIn dbpedia:Amsterdam dbpedia:Barcelona dbpedia:Paris ex:isIn ex:isIn ex:isIn dbpedia:Netherlands dbpedia:Spain dbpedia:France ex:isIn ex:isIn ex:isIn dbpedia:EuropeNovember 23, 2011 Analysis of Linked Data 21/35
  22. 22. It is not a static network Size and topology evolve over time 2007 2008 2010November 23, 2011 Analysis of Linked Data 22/35
  23. 23. Linked Data is a Complex System Multiple scale of observation Emergence of properties The whole is more than the sum of the parts=> Interactions/relations are important tounderstand the system behavior=> We can benefit from a large body ofresearch results in Complex Systems studyNovember 23, 2011 Analysis of Linked Data 23/35
  24. 24. Initial findings and future workNovember 23, 2011 Analysis of Linked Data 24/35 Ya3hs3/2531493704 on Flickr
  25. 25. New analysis workflow 1.Gather a snapshot of triples 2.Gather information about other type of interactions 3.Create specific networks related to the research questions at hand 4.Run metrics, interpret resultsNovember 23, 2011 Analysis of Linked Data 25/35
  26. 26. The LOD is not what we think it is LOD Cloud 2009/2010 vs BTC 2009 crawl Crawled sample differs from the community based view LOD Cloud has lumpy structure Evolution of LOD Cloud centrality changes Increased density and connectivity Christophe Guéret, Shenghui Wang, Paul Groth et al. (2011) Multi-scale Analysis of the Web Of Data: A Challenge to the Complex Systems Community Advances in Complex Systems 14 (04)November 23, 2011 Analysis of Linked Data 26/35
  27. 27. November 23, 2011 Analysis of Linked Data 27/35
  28. 28. The tools we need dont exist We need to flatten the networks to study them Some specific aspects of the system Existence of implicit links Multi-relational and dynamic Distributed Hypergraph of relations Christophe Guéret, Shenghui Wang, Paul Groth et al. (2011) Multi-scale Analysis of the Web Of Data: A Challenge to the Complex Systems Community Advances in Complex Systems 14 (04)November 23, 2011 Analysis of Linked Data 28/35
  29. 29. Influence content<->social networks Generate and bind two networks ex:a ex:b ex:c Measure evolution of degree, betweenness, clustering over time Predict evolution Shenghui Wang, Paul Groth (2010) Measuring the dynamic bi-directional influence between content and social networks Proceedings of the 9th International Semantic Web Conference (ISWC2010)November 23, 2011 Analysis of Linked Data 29/35
  30. 30. Result for conferences Shenghui Wang, Paul Groth (2010) Measuring the dynamic bi-directional influence between content and social networks Proceedings of the 9th International Semantic Web Conference (ISWC2010)November 23, 2011 Analysis of Linked Data 30/35
  31. 31. Centrality to measure robustness Map the BTC2010 to two networks Semantic network based on namespaces Host networks based on hostnames Measure robustness as the variance in betweenness centrality Find weak spots Optimize networks to increase robustness Christophe Guéret, Paul Groth, Frank Van Harmelen et al. (2010) Finding the Achilles Heel of the Web of Data : using network analysis for link-recommendation Proceedings of the 9th International Semantic Web Conference (ISWC2010)November 23, 2011 Analysis of Linked Data 31/35
  32. 32. Results on hostnames Christophe Guéret, Paul Groth, Frank Van Harmelen et al. (2010) Finding the Achilles Heel of the Web of Data : using network analysis for link-recommendation Proceedings of the 9th International Semantic Web Conference (ISWC2010)November 23, 2011 Analysis of Linked Data 32/35
  33. 33. Results on namespaces Christophe Guéret, Paul Groth, Frank Van Harmelen et al. (2010) Finding the Achilles Heel of the Web of Data : using network analysis for link-recommendation Proceedings of the 9th International Semantic Web Conference (ISWC2010)November 23, 2011 Analysis of Linked Data 33/35
  34. 34. Improving the network Christophe Guéret, Paul Groth, Frank Van Harmelen et al. (2010) Finding the Achilles Heel of the Web of Data : using network analysis for link-recommendation Proceedings of the 9th International Semantic Web Conference (ISWC2010)November 23, 2011 Analysis of Linked Data 34/35
  35. 35. Conclusion Take home message Linked Data is not a simple knowledge base Network analysis tools give new insights on the data Results can be used to improve the network Future work Make resource-centric analysis rather than graph- centric analysis (big bottleneck now) Tackle the time aspect of the data Find more analysis to perform and what they tell usNovember 23, 2011 Analysis of Linked Data 35/35

×