dsnotify presentation at www2010

DSNo%fy:
Handling
Broken
Links

in
the
Web
of
Data

Niko
Popitsch,
University
of
Vienna
/
Austria

niko.popitsch@univie.ac.at

Joint
work
with
Bernhard
Haslhofer

bernhard.haslhofer@univie.ac.at

April
30,
2010

WWW
2010
Conference

Raleigh, North Carolina, USA

Outline

  IntroducIon
and
problem
deﬁniIon

  Related
work
and
soluIon
strategies

  DSNoIfy

  Usage
scenarios
and
design

  Core
algorithm

  EvaluaIon

  Summary
&
Discussion

  References
image: www.freeimages.co.uk

2

image by TBL / Hans Rosling

Linked
Data
Principles
(short
version):

(1) 
use
HTTP
URIs
to
idenIfy
resources,

(2) 
deliver
meaningful
representa%ons
(e.g.,
RDF,
XHTML)
when
these
are

dereferenced

(3) 
link
to
other
resources

3

A
Linked
Data
Example

4

$ curl -H "Accept: application/rdf+xml" http://www.bbc.co.uk/music/artists/
084308bd-1654-436f-ba03-df6697104e19
Links within the data source
[...]
<mo:member rdf:resource="/music/artists/5d06fe54-485a-4a07-b506-5f6f719448cb#artist" />
<mo:member rdf:resource="/music/artists/f332a312-e95b-4413-b6cc-1762a5a6a083#artist" />
<mo:member rdf:resource="/music/artists/0dcee02c-5d2c-4f5c-9d60-d58a4df32d9e#artist" />
[...]

RDF links between data sources
[...]
<owl:sameAs rdf:resource="http://dbpedia.org/resource/Green_Day" />
<mo:musicbrainz rdf:resource="http://musicbrainz.org/artist/084308bd-1654-436f-ba03-
df6697104e19.html" />
[...]

[...]
<mo:MusicArtist rdf:about="/music/artists/084308bd-1654-436f-ba03-df6697104e19#artist">
<rdf:type rdf:resource="http://purl.org/ontology/mo/MusicGroup" />
<foaf:name>Green Day</foaf:name>
[...]

5

Problem:
links
can
break

6

Ignore
broken
links
?
Not
a
good
idea
!

Broken
links
on
the
Web
are
annoying
for
humans

  but
alternaIve
paths
may
be
used:

  search
engines,
URL
manipulaIon,
alternaIve

informaIon
providers,
etc.

Much
harder
for
machines
in
a
Web
of
Data
!

  reduced
data
accessibility

  data
inconsistencies

7

Avoid
broken
links
?
Great!

But
hard
to
achieve
in
the
Web
environment…

  SoluIon
strategies
that
solve
problem
only
parIally:

  RelaIve
references

  embedded
links

  redundancy

  SoluIon
strategies
that
are
not
commonly
applicable:

  Versioned/staIc
collecIons

  regular
(predictable)
updates

  dynamic
links

  indirecIon
services
(PURLs,
DOIs)

8

Solve
the
problem
(1/2)
:
No%fica%on

No%fica%on
strategy:

  Data
source
“knows“
about
the
events
that
are
taking
place

  NoIfies
clients

  Client
may
then
check
their
links
and
fix
the
broken
ones

Current
AcIviIes:

  WOD-‐LMP
[Volz
et
al.
2009]

  Triplify
Linked
Data
Update
Log
[Auer
et
al.
2009]

  PubSubHubbub
/
sparqlPuSH
1
  h^p://groups.google.com/group/dataset-‐dynamics

  …

9

Solve
the
problem
(2/2):
Detect
and
correct

Detect
and
correct

  If
noIficaIon
is
not
applicable

  Clients
detect
broken
links
and
try
to
fix
them

2

Current
acIviIes:

  Robust
hyperlinks
[Phelps
&
Wilensky
2000]
–
Web
documents

  PageChaser
[Morishima
et
al.
2009]
–
Web
documents

  DSNo%fy
–
aims
at
becoming
a
general
framework
for
fixing
broken
links

  …

10

What
events
cause
the
problem
?

?

11

Events
that
poten%ally
lead
to
broken
links

Broken
links
due
to
dele%on
events

  A
dele%on
event
takes
place
at
Ime
t
when
a
resource
had
(dereferencable)

representaIons
at
t-‐Δ
but
has
none
at
Ime
t

  Vice
versa:
create
event

  Easy
to
detect

12

Events
that
poten%ally
lead
to
broken
links

Broken
links
due
to
update
events

  An
update
events
takes
place
at
Ime
t
when
a
resource
had
diﬀerent

representaIons
at
t-‐Δ
compared
to
the
ones
at
Ime
t

  Resource
updates
resulIng
in
representaIons
with
diﬀerent
meaning

(seman%c
dri_)
may
lead
to
seman%cally
broken
links

  Hard
to
detect,
open
problem

13

Events
that
poten%ally
lead
to
broken
links

What
about
move
events
?

14

Events
that
poten%ally
lead
to
broken
links

What
about
move
events
?

a b

  A
move
event
from
a
to
b
takes
place
at
Ime
t
when

  There
were
no
representaIons
of
b
at
Ime
t-‐Δ

  There
are
no
representaIons
of
a
at
Ime
t

  The
representaIons
of
at-‐Δ
are
more
similar
to
the
ones
of
bt
than
to
the

ones
of
any
other
considered
resource
at
Ime
t

  The
calculated
similarity
between
them
is
>
than
a
threshold

  Instance
matching
problem!

15

Events
that
poten%ally
lead
to
broken
links

The
core
algorithm
of
DSNoIfy
detects
move
events
based
on

resource
similari%es

16

Changes
in
DBpedia

Class Snapshot 3.2 Snapshot 3.3 Moved Removed Created

Person
213,016 244,621 2,841 20,561 49,325

Place 247,508 318,017 2,209 2,430 70,730
Organisation 76,343 105,827 2,020 1,242 28,706

Work 189,725 213,231 4,097 6,558 25,967

Resources that were moved/removed/created between the DBpedia
snapshots 3.2 (October 2008) and 3.3 (May 2009)

PART
3
:
DSNo%fy

18

Usage
Scenario

  ApplicaIon
that
consumes
various
LD
sources
and
may

update
a
“source
dataset”

20

Usage
Scenario

  DSnoIfy
is
an
add-‐on
for
applicaIons
that
want
to
preserve
high
link

integrity
in
their
data

21

Usage
Scenario

  Other
actors
(applicaIons)
might
also
be
interested
in
these
events

22

General
Approach

  Periodically
access
linked
data
sources

  Extract
features
from
resource
representa%ons

  Combine
them
to
comparable
feature
vectors
(FV)

  Store
them
in
3
indices

  1st
index
represents
the
current
state
of
the
monitored
data

  2nd
index
stores
items
that
became
recently
unavailable

  3rd
index
stores
archived
feature
vectors

  Periodically
access
index
1+2
and
log
detect
events

  Periodically
update
indices
1-‐3

23

From
Resource
to
Feature
Vector

  Both,
data
type
and
object

proper%es
supported

  Feature
inﬂuence
is

weighted

  Some
are
used
in

plausibility
checks

  RDFHash
over
all

features

24

Move
Event
Detec%on

  Pair
wise
comparison
using
a
vector
space
model

  Feature
comparison
e.g.,
using
Levenshtein
similarity.

  It
is
suﬃcient
to
compare
recently
added
and
recently
removed
feature

vectors
!

  Two
thresholds
for
comparing
the
similarity
between
FVs
represenIng

created
and
removed
items:

  lower
threshold:
select
predecessor
candidates

  consider
URI
of
added
FV
as
possible
new
URI
of
resource
represented
by

removed
FV

  upper
threshold:
decidable
by
DSNoIfy?

  decide
whether
such
a
candidate
can
be
automaIcally
selected
or
whether

human
user
has
to
be
asked
for
assistance.

25

Core
Housekeeping
Algorithm

Ci,Ri and Mi,j denote create, remove and move events of items i and j. mx 26

and hx denote monitoring and housekeeping operations respectively.

Resul%ng
Data
Structures

  DSNoIfy
constructs
three
data
structures:

  An
event
log
containing
all
events
detected
by
the
system

  A
log
containing
all
“event
choices”
DSNoIfy
cannot
decide
on
and

  a
linked
structure
of
feature
vectors
consItuIng
a
history
of
the

respecIve
items.

  Accessible
via

image: www.freeimages.co.uk
  Linked
data
interface

  Java
interface

  XML-‐RPC

27

Evalua%on

  Core
quesIons:

  Does
DSNoIfy
work
with
real
data
?

  How
does
housekeeping
frequency
affect
its
effec%veness
?

  Used
data:

  Data
from
DBpedia
(8380
events)
and
IIMB
(10
x
222
events)
were
used

  Hand-‐picked
features
based
on
coverage
and
entropy
in
the
data
sets

  Results:

  Housekeeping
frequency
and
data
source
dynamics
determine
the

number
of
FV-‐pairs
that
have
to
be
compared
(scalability)

  Number
of
FV
comparisons
as
well
as
coverage
and
entropy
of
indexed

features
influence
accuracy
of
method

28

Evalua%on
-‐
Results

  Inﬂuence
of
data
source
agility
and
housekeeping
frequency
on
the
accuracy

of
the
DSNoIfy
algorithm
29

Discussion

  Broken
links
are
a
considerable
problem
in
a
Web
of
Data

  The
broken
link
problem
is
partly
a
special
case
of
the
instance
matching

problem

  DSNo%fy
is
an
event-‐based
approach
to
this
problem:

  DSNoIfy
can
be
used
as
an
add-‐on
for
data
sources
that
want
to
preserve

link
integrity
in
their
data

  We
cannot
“cure”
the
Web
of
Data
from
broken
links
(but
at
least
alleviate

the
pain
a
bit
:)

30

Current
and
Future
Work

  Scalability
issues,
evaluaIon

  AutomaIc
feature
selecIon
(parameter
esImaIon)

  Event
algebra:
high-‐level
composite
events

  EvaluaIon
with
other
data
sources
(e.g.,
ﬁle
system)

  Dataset
dynamics:

vocabularies,
protocols,
formats

  …

Thank
You
!

niko.popitsch@univie.ac.at

images: NASA / NSSDC
h^p://www.dsnoIfy.org

31

References
and
Related
Work

  H.
Ashman.
Electronic
document
addressing:
dealing
with
change.
ACM
Comput.
Surv.,
32(3),
2000.

  F.
Kappe.
A
scalable
architecture
for
maintaining
referenIal
integrity
in
distributed
informaIon

systems.
Journal
of
Universal
Computer
Science,
1(2):84–104,
1995.

  A.
Morishima,
A.
Nakamizo,
T.
Iida,
S.
Sugimoto,
and
H.
Kitagawa.
Bringing
your
dead
links
back
to
life:
a

comprehensive
approach
and
lessons
learned.
In
HT
’09:
Proceedings
of
the
20th
ACM
conference
on

Hypertext
and
hypermedia,
pages
15–24,
2009.

  T.
A.
Phelps
and
R.
Wilensky.
Robust
hyperlinks
cost
just
five
words
each.
Technical
Report
UCB/
CSD-‐00-‐1091,
EECS
Department,
University
of
California,
Berkeley,
2000

  J.
Volz,
C.
Bizer,
M.
Gaedke,
and
G.
Kobilarov.
Discovering
and
maintaining
links
on
the
web
of
data.
In

8th
InternaGonal
SemanGc
Web
Conference,
2009.

  A.
Hogan,
A.
Harth,
and
S.
Decker.
Performing
object
consolidaIon
on
the
semanIc
web
data
graph.
In

Proceedings
of
the
1st
I3:
IdenGty,
IdenGfiers,
IdenGficaGon
Workshop,
2007

  A.
Ferrara,
D.
Lorusso,
S.
Montanelli,
and
G.
Varese.
Towards
a
benchmark
for
instance
matching.
In

Ontology
Matching
(OM
2008),
volume
431
of
CEUR
Workshop
Proceedings.
CEUR-‐WS.org,
2008

  C.
Bizer,
T.
Heath,
and
T.
Berners-‐Lee.
Linked
data
-‐
the
story
so
far.
InternaGonal
Journal
on
SemanGc

Web
and
InformaGon
Systems
(IJSWIS),
5(3),
2009

  S.
Auer,
S.
Dietzold,
J.
Lehmann,
S.
Hellmann,
and
D.
Aumüller.
Triplify:
light-‐weight
linked
data

publicaIon
from
relaIonal
databases.
In
WWW
’09,
New
York,
NY,
USA,
2009.
ACM

  W.
Y.
Arms.
Uniform
resource
names:
handles,
purls,
and
digital
object
idenIfiers.
Commun.
ACM,
44
(5):68,
2001.

32

dsnotify presentation at www2010

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Viewers also liked

Viewers also liked (20)

Similar to dsnotify presentation at www2010

Similar to dsnotify presentation at www2010 (20)

dsnotify presentation at www2010