Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Web of Data: do we actually understand what we built?


Published on

Despite its obvious success (largest knowledge base ever built, used in practice by companies and governments alike), we actually understand very little of the structure of the Web of Data. Its formal meaning is specified in logic, but with its scale, context dependency and dynamics, the Web of Data has outgrown its traditional model-theoretic semantics.

Is the meaning of a logical statement (an edge in the graph) dependent on the cluster ("context") in which it appears? Does a more densely connected concept (node) contain more information?  Is the path length between two nodes related to their semantic distance?
Properties such as clustering, connectivity and path length are not described, much less explained by model-theoretic semantics. Do such properties contribute to the meaning of a knowledge graph?

To properly understand the structure and meaning of knowledge graphs, we should no longer treat knowledge graphs as (only) a set of logical statements, but treat them properly as a graph. But how to do this is far from clear.

In this talk, I report on some of our early results on some of these questions, but I ask many more questions for which we don't have answers yet.

Published in: Science

The Web of Data: do we actually understand what we built?

  1. 1. Don’t ask “how”, Ask “why”! (with illustrations from the Web of Data) Frank van Harmelen Dept. of “Computer Science” Creative Commons License: allowed to share & remix, but must attribute & non-commercial The Web of Data: do we actually understand what we built? (pssst: our theory has fallen way behind our technology, we know a lot of “how” but we don’t know much “why”)
  2. 2. Some expectation management • Speculation • Questions • Hypotheses If we knew what we were talking about, it wouldn’t be called research
  3. 3. Health Warning: pretentious philosophical introduction coming up
  4. 4. Computer Science should be like a natural science: studying objects in the information universe, and the laws that govern them. And yes, I believe that the information universe exists and can be studied
  5. 5. Fortunately, I’m in good company "Computer science is no more about computers than astronomy is about telescopes” -- Edsger W. Dijkstra "we have to think of computation as a principle and computers (only) as the tool” -- Peter Denning "Professor Shih-Fu Chang will receive a doctorate for his many groundbreaking contributions to our understanding of the digital universe“ -- Arnold Smeulders
  6. 6. Methodological Manifesto Computer Science often: given desired properties design an object which those properties In this talk: given a (very large & complex) object, explain what are its observed properties? Not: “solving a problem” But: “answering a question”
  7. 7. “The computer is not our object of study, It’s our observational instrument”
  8. 8. Our object of study & What to measure
  9. 9. Semantic Web in 4 principles 1. Give all things a name 2. Make a graph of relations between the things at this point we have (only) a Giant Graph 3. Make sure all names are URIs at this point we have (only) a Giant Global Graph 4. Add semantics (= predictable inference) This gives us a Giant Global Knowledge Graph
  10. 10. P3. Make sure all names are URIs x T [<x> IsOfType <T>] different owners & locations < analgesic >
  11. 11. P4: Add semantics Frank Lynda married-to • Frank is male • married-to relates males to females • married-to relates 1 male to 1 female • Lynda = Hazel lowerbound upperbound Hazel
  12. 12. Did we get anywhere? • Google = meaningful search • NXP = data integration • BBC = content re-use • BestBuy = SEO (RDF-a) • = data-publishing Oracle DB, IBM DB2 Reuters, New York Times, Guardian Sears, Kmart, OverStock, Volkswagen, Renault GoodRelations ontology, Yahoo, Bing
  13. 13. 1 triple How big is the Semantic Web?
  14. 14. 107 TriplesSuez Canal Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 17
  15. 15. subsecond querying 108 TriplesMoon Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 18
  16. 16. ~109 TriplesEarth Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 19
  17. 17. Size of the current Semantic Web ~1010 TriplesJupiter Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 20 ≈ 1 triple per web-page
  18. 18. Observing at different scales
  19. 19. Observing at different scales
  20. 20. Distances weighted by number of links
  21. 21. What is this picture telling us? • single connected component • Dense clusters with sparse interconnections • connectivity depends on a few nodes • the degree distribution is highly skewed, • its structure varies between aggregation levels.
  22. 22. What is this picture telling us? • Does the meaning of a node depend on the cluster it appears in? • Does path-length correlate with semantic distance? • Are highly connected nodes more certain? • Mutual influence of low-level and high-level structure? Logic?
  23. 23. Measuring what? • degree distribution, P(d(v)=n) or P(d(v)>n) • degree centrality: relative size of neighbourhood, intuitive notion of local connecivity • betweenness centrality: fraction of all shortest paths that pass through a node, how essential is the node for global connectivity, likelihood of being visited on a graphwalk • closeness centrality 1/average distance to all other nodes where to start for a graphwalk • average shortest path length helps to tune upperbound on graphwalks • number of (strongly) connected components measure of coherence
  24. 24. Measuring When? 20092014 Real phenomenon or measurement artefact?
  25. 25. Some first measurements & their difficulties Christophe Gueret (European Conference on Complex Systems 2011,
  26. 26. OK, let’s measure • Billion Triple Challenge 2009 • WoD 2009 • WoD 2010 • BTC aggregated • SameAs aggregated Non trivial decisions
  27. 27. OK, let’s measure Degree distributionBTC BTC aggregated This suggest power law distribution at different scales
  28. 28. OK, let’s measure • Comparing WoD 2009 & 2010: increasing powerlaw behaviour. • top 5 by degree centrality in sameAs-aggregated Preferential attachment? Dataset SameAs Degree centrality 0.039 0.037 0.027 0.019 0.017 This guy owns 4 out of these 5! Interesting socio-technical questions
  29. 29. But what should we measure? • Treat sameAs nodes as single node? (semantically yes, pragmatically no?) • Is (undirected) connectedness meaningfull, instead of (directed) strongly connected? (semantically no, pragmatically yes?) ???????
  30. 30. And what are “good” values? • Degree distribution should be powerlaw? (robust against random decay) • Local clustering coefficient should be high? (strongly connected “topics”) • Betweenness impact of a sameAs-link should be high? (adds much extra information) ???????
  31. 31. And here’s another one: usage of DBPedia types (Gangemi et al, ISWC2011)
  32. 32. impact on mapping? impact on reasoning? impact on storage? So what? These observations have impact on design!
  33. 33. LODLaundromat: a new observatory for the Web of Data Wouter Beek Laurens Rietveld (ISWC 2014)
  34. 34. LOD Laundromat: clean your dirty triples • crawl – from registries (CKAN), – by chasing URL's, – user can submit URLs – Users can submit files (DropBox plugin) • read multiple formats • clean syntax errors, remove duplicates • compute meta-data information • publish triples as JSON API & (meta-data) as SPARQL • harvest 1B triples/day
  35. 35. LOD Laundromat: • 600.000 RDF files • 3,345,904,218 unique URLs • 5,319,790,836 literals (not counting 6,699,148,542 integers, dates, etc) • 328Gb of zip’ed RDF
  36. 36.
  37. 37. LOTUS: Text search on LODLaundromat • Filip Llievski (ISWC 2016) • Search 5 billion(!) text strings in Linked Open Data (0.5Tb) • From words to linked data • Fuzzy matching (or precise, or substring, or …) •
  38. 38. Graph structure as a proxy for semantics Laurens Rietveld (ISWC 2014)
  39. 39. Hotspots in Knowledge Graps • Observation: realistic queries only hit a small part of the data (< 2%) (DBPedia would need 500k queries to hit < 1%) • Non-trival to obtain these numbers (YASGUI dataset, SWJ2015) Dataset Size #queries Coverage DBPedia 3.9 459M 1640 0.003% Linked Geo Data 289M 81 1.917% MetaLex 204M 4933 0.016% Open-BioMed 79M 931 3.100% Bio2RDF/KEGG 50M 1297 2.013% SW Dog Food 240K 193 39.438%
  40. 40. Experiment • Can we predict the popular part of the graph without knowing the queries? • Use graph-measures as selection thresholds – indegree (easy) – outdegree (easy) – pagerank (doable, iterative) – betweenness centrality (hard) Evaluate Queries
  41. 41. Structural sampling: results Why does this work so unreasonably well? Which methods work on which types of graphs? Logic?
  42. 42. It’s not only about the graph structure: Exploiting the choice of URLs to deal with inconsistency Zhisheng Huang (ISWC 2008)
  43. 43. 48 General Idea s(T,,0)s(T,,1)s(T,,2) =def  is soft-implied by T if it is implied by a consistent subset of T T  
  44. 44. Which selection function s(T,,n)? Google distance where f(x) is the number of Google hits for x f(x,y) is the number of Google hits for the tuple of search items x and y M is the number of web pages indexed by Google )}(log),(min{loglog ),(log)}(log),(max{log ),( yfxfM yxfyfxf yxNGD   
  45. 45. Compute Google distance between URI’s for numbers and colors (note: we’re abusing URI’s as words!)
  46. 46. 51 Evaluation: ask queries over inconsisentent datasets Conclusion: “Graph-growing” using Google Distance gives a high quality sound approximation Ontology #queries Unexpected Intended MadCow+ 2594 0 93% Communication 6576 0 96% Transportation 6258 0 99% Why does this work so unreasonably well?
  47. 47. Google distance This isn’t supposed to work!
  48. 48. URIs are supposed to be meaningless..
  49. 49. Information content of URI’s? Steven de Rooij (ISWC 2016) Unexplained performance prompts more experiments ISWC 2016
  50. 50. Do URL’s encode meaning? Fraction of datasets with redundancy for types/predicates at significance level > 0.99 BTW, this is 600.000 datapoints (RDF docs) Properties Types We need a semantics that accounts for this!
  51. 51. Inference as a measure for information content Nobody can predict these numbers
  52. 52. Exploiting the graph structure for inference Kathrin Dentler (SSWS2009)
  53. 53. 59/18 Inference by walking the graph • Swarm of micro-reasoners • One rule per micro-reasoner • Walk the graph, applying rules when possible • Deduced facts disappear after some time Every author of a paper is a person Every person is also an agent
  54. 54. 60/18 Some early results • most of the derivations are produced • Lost: determinism, completenes • Gained: anytime, coherent, prioritised For which graphs does this work well or not?
  55. 55. Closing: A call to all Semantic Web researchers
  56. 56. A gazillion new open questions don’t just try to build things, also try to understand things don’t just ask how, also ask why