The Web of Data: do we actually understand what we built?

Don’t ask “how”,
Ask “why”!
(with illustrations from the Web of Data)
Frank van Harmelen
Dept. of “Computer Science”
Creative Commons License:
allowed to share & remix,
but must attribute & non-commercial
The Web of Data:
do we actually understand
what we built?
(pssst: our theory has fallen way behind our technology,
we know a lot of “how”
but we don’t know much “why”)

Some expectation management
• Speculation
• Questions
• Hypotheses
If we knew what we
were talking about, it
wouldn’t be called
research

Health Warning:
pretentious
philosophical
introduction
coming up

Computer Science should be like a natural science:
studying objects in the information universe,
and the laws that govern them.
And yes, I believe that the information universe exists and can be studied

Fortunately, I’m in good company
"Computer science is no more about computers
than astronomy is about telescopes”
-- Edsger W. Dijkstra
"we have to think of computation as a principle
and computers (only) as the tool”
-- Peter Denning
"Professor Shih-Fu Chang will receive a doctorate
for his many groundbreaking contributions to our
understanding of the digital universe“
-- Arnold Smeulders

Methodological Manifesto
Computer Science often:
given desired properties
design an object which those properties
In this talk:
given a (very large & complex) object,
explain what are its observed properties?
Not: “solving a problem”
But: “answering a question”

“The computer is not our object of study,
It’s our observational instrument”

Our object
of study
&
What to
measure

Semantic Web in 4 principles
1. Give all things a name
2. Make a graph of relations between the things
at this point we have (only) a Giant Graph
3. Make sure all names are URIs
at this point we have (only) a Giant Global Graph
4. Add semantics (= predictable inference)
This gives us a Giant Global Knowledge Graph
http://www.youtube.com/watch?v=tBSdYi4EY3s

P3. Make sure all names are URIs
x T
[<x> IsOfType <T>]
different
owners & locations
< analgesic >

P4: Add semantics
Frank Lynda
married-to
• Frank is male
• married-to relates
males to females
• married-to relates
1 male to 1 female
• Lynda = Hazel
lowerbound upperbound
Hazel

Did we get anywhere?
• Google = meaningful search
• NXP = data integration
• BBC = content re-use
• BestBuy = SEO (RDF-a)
• data.gov = data-publishing
Oracle DB, IBM DB2
Reuters,
New York Times, Guardian
Sears, Kmart, OverStock,
Volkswagen, Renault
GoodRelations ontology,
schema.org
Yahoo, Bing

1 triple
How big is the Semantic Web?

107 TriplesSuez Canal
Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 17 http://www.aifb.uni-karlsruhe.de/WBS

subsecond querying
108 TriplesMoon

~109 TriplesEarth

Size of the current Semantic Web
~1010 TriplesJupiter
≈ 1 triple per web-page

Distances weighted by
number of links

What is this picture telling us?
• single connected component
• Dense clusters with sparse interconnections
• connectivity depends on a few nodes
• the degree distribution
is highly skewed,
• its structure varies
between aggregation levels.

What is this picture telling us?
• Does the meaning of a node
depend on the cluster it appears in?
• Does path-length correlate with semantic distance?
• Are highly connected nodes more certain?
• Mutual influence of
low-level and high-level
structure?
Logic?

Measuring what?
• degree distribution, P(d(v)=n) or P(d(v)>n)
• degree centrality: relative size of neighbourhood,
intuitive notion of local connecivity
• betweenness centrality:
fraction of all shortest paths that pass through a node,
how essential is the node for global connectivity,
likelihood of being visited on a graphwalk
• closeness centrality
1/average distance to all other nodes
where to start for a graphwalk
• average shortest path length
helps to tune upperbound on graphwalks
• number of (strongly) connected components
measure of coherence

Measuring When?
20092014
Real phenomenon or
measurement artefact?

Some first
measurements
&
their difficulties
Christophe Gueret
(European Conference on Complex Systems 2011,

OK, let’s measure
• Billion Triple
Challenge 2009 • WoD 2009
• WoD 2010
• BTC aggregated
• SameAs
aggregated
Non trivial
decisions

OK, let’s measure
Degree distributionBTC BTC aggregated
This suggest power law distribution
at different scales

OK, let’s measure
• Comparing WoD 2009 & 2010:
increasing powerlaw behaviour.
• top 5 by degree centrality in sameAs-aggregated
Preferential attachment?
Dataset SameAs Degree centrality
Revyu.com 0.039
Semanticweb.org 0.037
Dbpedia.org 0.027
Data.semanticweb.org 0.019
www.deri.ie 0.017
This guy owns 4 out of these 5!
Interesting socio-technical questions

But what should we measure?
• Treat sameAs nodes as single node?
(semantically yes, pragmatically no?)
• Is (undirected) connectedness meaningfull,
instead of (directed) strongly connected?
(semantically no, pragmatically yes?)
???????

And what are “good” values?
• Degree distribution should be powerlaw?
(robust against random decay)
• Local clustering coefficient should be high?
(strongly connected “topics”)
• Betweenness impact of a sameAs-link
should be high?
(adds much extra information)
???????

And here’s another one:
usage of DBPedia types
(Gangemi et al, ISWC2011)

impact on
mapping?
impact on
reasoning?
impact on
storage?
So what?
These observations have impact on design!

LODLaundromat:
a new observatory
for the Web of Data
Wouter Beek Laurens Rietveld
(ISWC 2014)

LOD Laundromat:
clean your dirty triples
• crawl
– from registries (CKAN),
– by chasing URL's,
– user can submit URLs
– Users can submit files (DropBox plugin)
• read multiple formats
• clean syntax errors, remove duplicates
• compute meta-data information
• publish triples as JSON API & (meta-data) as SPARQL
• harvest 1B triples/day

LOD Laundromat:
• 600.000 RDF files
• 3,345,904,218 unique URLs
• 5,319,790,836 literals
(not counting 6,699,148,542 integers, dates, etc)
• 328Gb of zip’ed RDF
http://lodlaundromat.org

https://www.youtube.com/watch?v=nU2Yh8RXeow

LOTUS:
Text search on LODLaundromat
• Filip Llievski (ISWC 2016)
• Search 5 billion(!) text strings in
Linked Open Data (0.5Tb)
• From words to linked data
• Fuzzy matching (or precise, or substring, or …)
• http://lotus.lodlaundromat.org

Graph structure
as a proxy
for semantics
Laurens Rietveld
(ISWC 2014)

Hotspots in Knowledge Graps
• Observation:
realistic queries only hit a small part of the data (< 2%)
(DBPedia would need 500k queries to hit < 1%)
• Non-trival to obtain these numbers
(YASGUI dataset, SWJ2015)
Dataset Size #queries Coverage
DBPedia 3.9 459M 1640 0.003%
Linked Geo Data 289M 81 1.917%
MetaLex 204M 4933 0.016%
Open-BioMed 79M 931 3.100%
Bio2RDF/KEGG 50M 1297 2.013%
SW Dog Food 240K 193 39.438%

Experiment
• Can we predict the popular part of the graph
without knowing the queries?
• Use graph-measures as selection thresholds
– indegree (easy)
– outdegree (easy)
– pagerank (doable, iterative)
– betweenness centrality (hard)
Evaluate
Queries

Structural sampling: results
Why does this
work so
unreasonably
well?
Which
methods
work on
which types
of graphs?
Logic?

It’s not only about the
graph structure:
Exploiting
the choice of URLs
to deal with inconsistency
Zhisheng Huang
(ISWC 2008)

48
General Idea
s(T,,0)s(T,,1)s(T,,2)
=def
 is soft-implied by T if it is implied by a consistent subset of T
T
 

Which selection function s(T,,n)?
Google distance
where
f(x) is the number of Google hits for x
f(x,y) is the number of Google hits for
the tuple of search items x and y
M is the number of web pages indexed by Google
)}(log),(min{loglog
),(log)}(log),(max{log
),(
yfxfM
yxfyfxf
yxNGD




Compute Google distance between URI’s for numbers and colors
(note: we’re abusing URI’s as words!)

51
Evaluation:
ask queries over inconsisentent datasets
Conclusion:
“Graph-growing” using Google Distance
gives a high quality sound approximation
Ontology #queries Unexpected Intended
MadCow+ 2594 0 93%
Communication 6576 0 96%
Transportation 6258 0 99%
Why does this
work so
unreasonably
well?

Google distance
This isn’t
supposed to
work!

URIs are supposed to be meaningless..

Information content
of URI’s?
Steven de Rooij
(ISWC 2016)
Unexplained performance
prompts more experiments
ISWC 2016

Do URL’s encode meaning?
Fraction of datasets with redundancy for types/predicates
at significance level > 0.99
BTW, this is 600.000 datapoints (RDF docs)
Properties
Types
We need a
semantics
that accounts
for this!

Inference as a measure
for information content
Nobody can
predict these
numbers

Exploiting
the
graph structure
for inference
Kathrin Dentler
(SSWS2009)

59/18
Inference by walking the graph
• Swarm of micro-reasoners
• One rule per micro-reasoner
• Walk the graph, applying rules when possible
• Deduced facts disappear after some time
Every author of a
paper is a person
Every person is
also an agent

60/18
Some early results
• most of the
derivations are
produced
• Lost:
determinism,
completenes
• Gained:
anytime,
coherent,
prioritised
For which
graphs does
this work well
or not?

Closing:
A call to all
Semantic Web
researchers

A gazillion new open questions
don’t just try to build things,
also try to understand things
don’t just ask how,
also ask why

The Web of Data: do we actually understand what we built?

More Related Content

What's hot

Similar to The Web of Data: do we actually understand what we built?

More from Frank van Harmelen

Recently uploaded

The Web of Data: do we actually understand what we built?