Don’t ask “how”,
Ask “why”!
(with illustrations from the Web of Data)
Frank van Harmelen
Dept. of “Computer Science”
Creative Commons License:
allowed to share & remix,
but must attribute & non-commercial
The Web of Data:
do we actually understand
what we built?
(pssst: our theory has fallen way behind our technology,
we know a lot of “how”
but we don’t know much “why”)
Some expectation management
• Speculation
• Questions
• Hypotheses
If we knew what we
were talking about, it
wouldn’t be called
research
Health Warning:
pretentious
philosophical
introduction
coming up
Computer Science should be like a natural science:
studying objects in the information universe,
and the laws that govern them.
And yes, I believe that the information universe exists and can be studied
Fortunately, I’m in good company
"Computer science is no more about computers
than astronomy is about telescopes”
-- Edsger W. Dijkstra
"we have to think of computation as a principle
and computers (only) as the tool”
-- Peter Denning
"Professor Shih-Fu Chang will receive a doctorate
for his many groundbreaking contributions to our
understanding of the digital universe“
-- Arnold Smeulders
Methodological Manifesto
Computer Science often:
given desired properties
design an object which those properties
In this talk:
given a (very large & complex) object,
explain what are its observed properties?
Not: “solving a problem”
But: “answering a question”
“The computer is not our object of study,
It’s our observational instrument”
Our object
of study
&
What to
measure
Semantic Web in 4 principles
1. Give all things a name
2. Make a graph of relations between the things
at this point we have (only) a Giant Graph
3. Make sure all names are URIs
at this point we have (only) a Giant Global Graph
4. Add semantics (= predictable inference)
This gives us a Giant Global Knowledge Graph
http://www.youtube.com/watch?v=tBSdYi4EY3s
P3. Make sure all names are URIs
x T
[<x> IsOfType <T>]
different
owners & locations
< analgesic >
P4: Add semantics
Frank Lynda
married-to
• Frank is male
• married-to relates
males to females
• married-to relates
1 male to 1 female
• Lynda = Hazel
lowerbound upperbound
Hazel
Did we get anywhere?
• Google = meaningful search
• NXP = data integration
• BBC = content re-use
• BestBuy = SEO (RDF-a)
• data.gov = data-publishing
Oracle DB, IBM DB2
Reuters,
New York Times, Guardian
Sears, Kmart, OverStock,
Volkswagen, Renault
GoodRelations ontology,
schema.org
Yahoo, Bing
1 triple
How big is the Semantic Web?
107 TriplesSuez Canal
Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 17 http://www.aifb.uni-karlsruhe.de/WBS
subsecond querying
108 TriplesMoon
Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 18 http://www.aifb.uni-karlsruhe.de/WBS
~109 TriplesEarth
Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 19 http://www.aifb.uni-karlsruhe.de/WBS
Size of the current Semantic Web
~1010 TriplesJupiter
Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 20 http://www.aifb.uni-karlsruhe.de/WBS
≈ 1 triple per web-page
Observing at different scales
Observing at different scales
Distances weighted by
number of links
What is this picture telling us?
• single connected component
• Dense clusters with sparse interconnections
• connectivity depends on a few nodes
• the degree distribution
is highly skewed,
• its structure varies
between aggregation levels.
What is this picture telling us?
• Does the meaning of a node
depend on the cluster it appears in?
• Does path-length correlate with semantic distance?
• Are highly connected nodes more certain?
• Mutual influence of
low-level and high-level
structure?
Logic?
Measuring what?
• degree distribution, P(d(v)=n) or P(d(v)>n)
• degree centrality: relative size of neighbourhood,
intuitive notion of local connecivity
• betweenness centrality:
fraction of all shortest paths that pass through a node,
how essential is the node for global connectivity,
likelihood of being visited on a graphwalk
• closeness centrality
1/average distance to all other nodes
where to start for a graphwalk
• average shortest path length
helps to tune upperbound on graphwalks
• number of (strongly) connected components
measure of coherence
Measuring When?
20092014
Real phenomenon or
measurement artefact?
Some first
measurements
&
their difficulties
Christophe Gueret
(European Conference on Complex Systems 2011,
OK, let’s measure
• Billion Triple
Challenge 2009 • WoD 2009
• WoD 2010
• BTC aggregated
• SameAs
aggregated
Non trivial
decisions
OK, let’s measure
Degree distributionBTC BTC aggregated
This suggest power law distribution
at different scales
OK, let’s measure
• Comparing WoD 2009 & 2010:
increasing powerlaw behaviour.
• top 5 by degree centrality in sameAs-aggregated
Preferential attachment?
Dataset SameAs Degree centrality
Revyu.com 0.039
Semanticweb.org 0.037
Dbpedia.org 0.027
Data.semanticweb.org 0.019
www.deri.ie 0.017
This guy owns 4 out of these 5!
Interesting socio-technical questions
But what should we measure?
• Treat sameAs nodes as single node?
(semantically yes, pragmatically no?)
• Is (undirected) connectedness meaningfull,
instead of (directed) strongly connected?
(semantically no, pragmatically yes?)
???????
And what are “good” values?
• Degree distribution should be powerlaw?
(robust against random decay)
• Local clustering coefficient should be high?
(strongly connected “topics”)
• Betweenness impact of a sameAs-link
should be high?
(adds much extra information)
???????
And here’s another one:
usage of DBPedia types
(Gangemi et al, ISWC2011)
impact on
mapping?
impact on
reasoning?
impact on
storage?
So what?
These observations have impact on design!
LODLaundromat:
a new observatory
for the Web of Data
Wouter Beek Laurens Rietveld
(ISWC 2014)
LOD Laundromat:
clean your dirty triples
• crawl
– from registries (CKAN),
– by chasing URL's,
– user can submit URLs
– Users can submit files (DropBox plugin)
• read multiple formats
• clean syntax errors, remove duplicates
• compute meta-data information
• publish triples as JSON API & (meta-data) as SPARQL
• harvest 1B triples/day
LOD Laundromat:
• 600.000 RDF files
• 3,345,904,218 unique URLs
• 5,319,790,836 literals
(not counting 6,699,148,542 integers, dates, etc)
• 328Gb of zip’ed RDF
http://lodlaundromat.org
https://www.youtube.com/watch?v=nU2Yh8RXeow
LOTUS:
Text search on LODLaundromat
• Filip Llievski (ISWC 2016)
• Search 5 billion(!) text strings in
Linked Open Data (0.5Tb)
• From words to linked data
• Fuzzy matching (or precise, or substring, or …)
• http://lotus.lodlaundromat.org
Graph structure
as a proxy
for semantics
Laurens Rietveld
(ISWC 2014)
Hotspots in Knowledge Graps
• Observation:
realistic queries only hit a small part of the data (< 2%)
(DBPedia would need 500k queries to hit < 1%)
• Non-trival to obtain these numbers
(YASGUI dataset, SWJ2015)
Dataset Size #queries Coverage
DBPedia 3.9 459M 1640 0.003%
Linked Geo Data 289M 81 1.917%
MetaLex 204M 4933 0.016%
Open-BioMed 79M 931 3.100%
Bio2RDF/KEGG 50M 1297 2.013%
SW Dog Food 240K 193 39.438%
Experiment
• Can we predict the popular part of the graph
without knowing the queries?
• Use graph-measures as selection thresholds
– indegree (easy)
– outdegree (easy)
– pagerank (doable, iterative)
– betweenness centrality (hard)
Evaluate
Queries
Structural sampling: results
Why does this
work so
unreasonably
well?
Which
methods
work on
which types
of graphs?
Logic?
It’s not only about the
graph structure:
Exploiting
the choice of URLs
to deal with inconsistency
Zhisheng Huang
(ISWC 2008)
48
General Idea
s(T,,0)s(T,,1)s(T,,2)
=def
 is soft-implied by T if it is implied by a consistent subset of T
T
 
Which selection function s(T,,n)?
Google distance
where
f(x) is the number of Google hits for x
f(x,y) is the number of Google hits for
the tuple of search items x and y
M is the number of web pages indexed by Google
)}(log),(min{loglog
),(log)}(log),(max{log
),(
yfxfM
yxfyfxf
yxNGD



Compute Google distance between URI’s for numbers and colors
(note: we’re abusing URI’s as words!)
51
Evaluation:
ask queries over inconsisentent datasets
Conclusion:
“Graph-growing” using Google Distance
gives a high quality sound approximation
Ontology #queries Unexpected Intended
MadCow+ 2594 0 93%
Communication 6576 0 96%
Transportation 6258 0 99%
Why does this
work so
unreasonably
well?
Google distance
This isn’t
supposed to
work!
URIs are supposed to be meaningless..
Information content
of URI’s?
Steven de Rooij
(ISWC 2016)
Unexplained performance
prompts more experiments
ISWC 2016
Do URL’s encode meaning?
Fraction of datasets with redundancy for types/predicates
at significance level > 0.99
BTW, this is 600.000 datapoints (RDF docs)
Properties
Types
We need a
semantics
that accounts
for this!
Inference as a measure
for information content
Nobody can
predict these
numbers
Exploiting
the
graph structure
for inference
Kathrin Dentler
(SSWS2009)
59/18
Inference by walking the graph
• Swarm of micro-reasoners
• One rule per micro-reasoner
• Walk the graph, applying rules when possible
• Deduced facts disappear after some time
Every author of a
paper is a person
Every person is
also an agent
60/18
Some early results
• most of the
derivations are
produced
• Lost:
determinism,
completenes
• Gained:
anytime,
coherent,
prioritised
For which
graphs does
this work well
or not?
Closing:
A call to all
Semantic Web
researchers
A gazillion new open questions
don’t just try to build things,
also try to understand things
don’t just ask how,
also ask why

The Web of Data: do we actually understand what we built?

  • 1.
    Don’t ask “how”, Ask“why”! (with illustrations from the Web of Data) Frank van Harmelen Dept. of “Computer Science” Creative Commons License: allowed to share & remix, but must attribute & non-commercial The Web of Data: do we actually understand what we built? (pssst: our theory has fallen way behind our technology, we know a lot of “how” but we don’t know much “why”)
  • 2.
    Some expectation management •Speculation • Questions • Hypotheses If we knew what we were talking about, it wouldn’t be called research
  • 3.
  • 4.
    Computer Science shouldbe like a natural science: studying objects in the information universe, and the laws that govern them. And yes, I believe that the information universe exists and can be studied
  • 5.
    Fortunately, I’m ingood company "Computer science is no more about computers than astronomy is about telescopes” -- Edsger W. Dijkstra "we have to think of computation as a principle and computers (only) as the tool” -- Peter Denning "Professor Shih-Fu Chang will receive a doctorate for his many groundbreaking contributions to our understanding of the digital universe“ -- Arnold Smeulders
  • 6.
    Methodological Manifesto Computer Scienceoften: given desired properties design an object which those properties In this talk: given a (very large & complex) object, explain what are its observed properties? Not: “solving a problem” But: “answering a question”
  • 7.
    “The computer isnot our object of study, It’s our observational instrument”
  • 8.
  • 9.
    Semantic Web in4 principles 1. Give all things a name 2. Make a graph of relations between the things at this point we have (only) a Giant Graph 3. Make sure all names are URIs at this point we have (only) a Giant Global Graph 4. Add semantics (= predictable inference) This gives us a Giant Global Knowledge Graph http://www.youtube.com/watch?v=tBSdYi4EY3s
  • 10.
    P3. Make sureall names are URIs x T [<x> IsOfType <T>] different owners & locations < analgesic >
  • 11.
    P4: Add semantics FrankLynda married-to • Frank is male • married-to relates males to females • married-to relates 1 male to 1 female • Lynda = Hazel lowerbound upperbound Hazel
  • 12.
    Did we getanywhere? • Google = meaningful search • NXP = data integration • BBC = content re-use • BestBuy = SEO (RDF-a) • data.gov = data-publishing Oracle DB, IBM DB2 Reuters, New York Times, Guardian Sears, Kmart, OverStock, Volkswagen, Renault GoodRelations ontology, schema.org Yahoo, Bing
  • 13.
    1 triple How bigis the Semantic Web?
  • 17.
    107 TriplesSuez Canal DennyVrandečić – AIFB, Universität Karlsruhe (TH) 17 http://www.aifb.uni-karlsruhe.de/WBS
  • 18.
    subsecond querying 108 TriplesMoon DennyVrandečić – AIFB, Universität Karlsruhe (TH) 18 http://www.aifb.uni-karlsruhe.de/WBS
  • 19.
    ~109 TriplesEarth Denny Vrandečić– AIFB, Universität Karlsruhe (TH) 19 http://www.aifb.uni-karlsruhe.de/WBS
  • 20.
    Size of thecurrent Semantic Web ~1010 TriplesJupiter Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 20 http://www.aifb.uni-karlsruhe.de/WBS ≈ 1 triple per web-page
  • 21.
  • 22.
  • 25.
  • 26.
    What is thispicture telling us? • single connected component • Dense clusters with sparse interconnections • connectivity depends on a few nodes • the degree distribution is highly skewed, • its structure varies between aggregation levels.
  • 27.
    What is thispicture telling us? • Does the meaning of a node depend on the cluster it appears in? • Does path-length correlate with semantic distance? • Are highly connected nodes more certain? • Mutual influence of low-level and high-level structure? Logic?
  • 28.
    Measuring what? • degreedistribution, P(d(v)=n) or P(d(v)>n) • degree centrality: relative size of neighbourhood, intuitive notion of local connecivity • betweenness centrality: fraction of all shortest paths that pass through a node, how essential is the node for global connectivity, likelihood of being visited on a graphwalk • closeness centrality 1/average distance to all other nodes where to start for a graphwalk • average shortest path length helps to tune upperbound on graphwalks • number of (strongly) connected components measure of coherence
  • 29.
  • 30.
    Some first measurements & their difficulties ChristopheGueret (European Conference on Complex Systems 2011,
  • 31.
    OK, let’s measure •Billion Triple Challenge 2009 • WoD 2009 • WoD 2010 • BTC aggregated • SameAs aggregated Non trivial decisions
  • 32.
    OK, let’s measure DegreedistributionBTC BTC aggregated This suggest power law distribution at different scales
  • 33.
    OK, let’s measure •Comparing WoD 2009 & 2010: increasing powerlaw behaviour. • top 5 by degree centrality in sameAs-aggregated Preferential attachment? Dataset SameAs Degree centrality Revyu.com 0.039 Semanticweb.org 0.037 Dbpedia.org 0.027 Data.semanticweb.org 0.019 www.deri.ie 0.017 This guy owns 4 out of these 5! Interesting socio-technical questions
  • 34.
    But what shouldwe measure? • Treat sameAs nodes as single node? (semantically yes, pragmatically no?) • Is (undirected) connectedness meaningfull, instead of (directed) strongly connected? (semantically no, pragmatically yes?) ???????
  • 35.
    And what are“good” values? • Degree distribution should be powerlaw? (robust against random decay) • Local clustering coefficient should be high? (strongly connected “topics”) • Betweenness impact of a sameAs-link should be high? (adds much extra information) ???????
  • 36.
    And here’s anotherone: usage of DBPedia types (Gangemi et al, ISWC2011)
  • 37.
    impact on mapping? impact on reasoning? impacton storage? So what? These observations have impact on design!
  • 38.
    LODLaundromat: a new observatory forthe Web of Data Wouter Beek Laurens Rietveld (ISWC 2014)
  • 39.
    LOD Laundromat: clean yourdirty triples • crawl – from registries (CKAN), – by chasing URL's, – user can submit URLs – Users can submit files (DropBox plugin) • read multiple formats • clean syntax errors, remove duplicates • compute meta-data information • publish triples as JSON API & (meta-data) as SPARQL • harvest 1B triples/day
  • 40.
    LOD Laundromat: • 600.000RDF files • 3,345,904,218 unique URLs • 5,319,790,836 literals (not counting 6,699,148,542 integers, dates, etc) • 328Gb of zip’ed RDF http://lodlaundromat.org
  • 41.
  • 42.
    LOTUS: Text search onLODLaundromat • Filip Llievski (ISWC 2016) • Search 5 billion(!) text strings in Linked Open Data (0.5Tb) • From words to linked data • Fuzzy matching (or precise, or substring, or …) • http://lotus.lodlaundromat.org
  • 43.
    Graph structure as aproxy for semantics Laurens Rietveld (ISWC 2014)
  • 44.
    Hotspots in KnowledgeGraps • Observation: realistic queries only hit a small part of the data (< 2%) (DBPedia would need 500k queries to hit < 1%) • Non-trival to obtain these numbers (YASGUI dataset, SWJ2015) Dataset Size #queries Coverage DBPedia 3.9 459M 1640 0.003% Linked Geo Data 289M 81 1.917% MetaLex 204M 4933 0.016% Open-BioMed 79M 931 3.100% Bio2RDF/KEGG 50M 1297 2.013% SW Dog Food 240K 193 39.438%
  • 45.
    Experiment • Can wepredict the popular part of the graph without knowing the queries? • Use graph-measures as selection thresholds – indegree (easy) – outdegree (easy) – pagerank (doable, iterative) – betweenness centrality (hard) Evaluate Queries
  • 46.
    Structural sampling: results Whydoes this work so unreasonably well? Which methods work on which types of graphs? Logic?
  • 47.
    It’s not onlyabout the graph structure: Exploiting the choice of URLs to deal with inconsistency Zhisheng Huang (ISWC 2008)
  • 48.
    48 General Idea s(T,,0)s(T,,1)s(T,,2) =def  issoft-implied by T if it is implied by a consistent subset of T T  
  • 49.
    Which selection functions(T,,n)? Google distance where f(x) is the number of Google hits for x f(x,y) is the number of Google hits for the tuple of search items x and y M is the number of web pages indexed by Google )}(log),(min{loglog ),(log)}(log),(max{log ),( yfxfM yxfyfxf yxNGD   
  • 50.
    Compute Google distancebetween URI’s for numbers and colors (note: we’re abusing URI’s as words!)
  • 51.
    51 Evaluation: ask queries overinconsisentent datasets Conclusion: “Graph-growing” using Google Distance gives a high quality sound approximation Ontology #queries Unexpected Intended MadCow+ 2594 0 93% Communication 6576 0 96% Transportation 6258 0 99% Why does this work so unreasonably well?
  • 52.
  • 53.
    URIs are supposedto be meaningless..
  • 54.
    Information content of URI’s? Stevende Rooij (ISWC 2016) Unexplained performance prompts more experiments ISWC 2016
  • 55.
    Do URL’s encodemeaning? Fraction of datasets with redundancy for types/predicates at significance level > 0.99 BTW, this is 600.000 datapoints (RDF docs) Properties Types We need a semantics that accounts for this!
  • 56.
    Inference as ameasure for information content Nobody can predict these numbers
  • 57.
  • 58.
    59/18 Inference by walkingthe graph • Swarm of micro-reasoners • One rule per micro-reasoner • Walk the graph, applying rules when possible • Deduced facts disappear after some time Every author of a paper is a person Every person is also an agent
  • 59.
    60/18 Some early results •most of the derivations are produced • Lost: determinism, completenes • Gained: anytime, coherent, prioritised For which graphs does this work well or not?
  • 60.
    Closing: A call toall Semantic Web researchers
  • 62.
    A gazillion newopen questions don’t just try to build things, also try to understand things don’t just ask how, also ask why