SlideShare a Scribd company logo
1 of 71
Download to read offline
Preserving	
  Linked	
  Data:	
  
Challenges	
  and	
  Opportuni8es	
  
Vassilis	
  Christophides	
  
University	
  of	
  Crete	
  &	
  FORTH-­‐ICS	
  	
  
Heraklion,	
  Crete	
  
Many Web sites containing
unstructured, textual content
Few large Web sites are
specialized on specific
content types
•  Semi-structured/xml
content floating
around e-services
A	
  bit	
  of	
  History:	
  from	
  Web	
  1.0	
  &	
  2.0	
  …	
  
A	
  bit	
  of	
  History:	
  …to	
  Web	
  3.0	
  
Data as Service (DaaS)
Syndicating arbitrarily semi-
structured content across
Web sites using higher-lever
data abstractions (entities)
Data themselves become
“infrastructure”
•  a valuable asset, on which
science, technology, the
economy and society can
advance
The	
  Emerging	
  Web	
  of	
  Data	
  	
  
}  Global	
  data	
  space	
  connec8ng	
  data	
  from	
  diverse	
  domains	
  
and	
  sources	
  	
  
}  Primary	
  objects:	
  “things”	
  (or	
  descrip8on	
  of	
  things)	
  
}  Links	
  between	
  “things”	
  
}  Granularity	
  of	
  informa8on:	
  from	
  en8re	
  datasets	
  to	
  atomic	
  
Adapted	
  from	
  Chris	
  Bizer,	
  Richard	
  Cyganiak,	
  Tom	
  Heath,	
  available	
  at	
  h[p://linkeddata.org/guides-­‐and-­‐tutorials	
  
Web	
  of	
  Data
Conceptual	
  Representation
entity entity entity
entity
Typed	
  Links Typed	
  Links Typed	
  Links
Spreadsheets
HTMLXMLRDFa
URIs
URIs
URIs
URIs
URIs
URIs
represent
SemiStructuredTriplesStatistical
represent represent represent
A Web of
things in
the world,
described
by data on
the Web
En88es:	
  an	
  Invaluable	
  Asset	
  
•  “En88es”	
  is	
  what	
  a	
  large	
  part	
  of	
  our	
  knowledge	
  is	
  
Web	
  Data	
  of	
  Increasing	
  
Standardiza8on	
  
Not	
  all	
  linked	
  data	
  is	
  open	
  and	
  not	
  all	
  open	
  
data	
  is	
  linked!	
  
★	
  Available	
  on	
  the	
  web	
  (whatever	
  format)	
  
but	
  with	
  an	
  open	
  license,	
  to	
  be	
  Open	
  Data	
  
★★Available	
  as	
  machine-­‐readable	
  structured	
  
data	
  (e.g.	
  excel	
  vs.	
  image	
  scan	
  of	
  a	
  table)	
  
★★★	
  as	
  (2)	
  plus	
  non-­‐proprietary	
  format	
  
(e.g.	
  CSV	
  instead	
  of	
  excel)	
  
★★★★	
  as	
  (3),	
  plus	
  using	
  open	
  standards	
  
from	
  W3C	
  (RDF	
  	
  and	
  SPARQL	
  )	
  to	
  iden8fy	
  
things	
  through	
  dereferenceable	
  HTTP	
  URIs,	
  to	
  
ensure	
  effec8ve	
  access	
  	
  
★★★★★	
  as	
  all	
  the	
  above	
  plus	
  establishing	
  
links	
  between	
  data	
  of	
  different	
  sources	
  	
  
File	
  
format	
  
Recommenda0ons	
  
(on	
  a	
  scale	
  of	
  0-­‐5)	
  
csv	
   ★★★	
  
xls	
   ★	
  
pdf	
   ★	
  
doc	
   ★	
  
xml	
   ★★★★	
  
rdf	
   ★★★★★	
  
shp	
   ★★★	
  
ods	
   ★★	
  
8ff	
   ★	
  
jpeg	
   ★	
  
json	
   ★★★	
  
txt	
   ★	
  
html	
   ★★	
  
The	
  LOD	
  Cloud	
  
Media
Government
Geo
Publications
User-generated
Life sciences
Cross-domain
Basic	
  Terminology	
  
•  A	
  dataset	
  is	
  a	
  set	
  of	
  RDF	
  
triples	
  that	
  are	
  published,	
  
maintained	
  or	
  aggregated	
  by	
  
a	
  single	
  provider	
  
•  A	
  linkset	
  is	
  a	
  collec8on	
  of	
  RDF	
  
links	
  between	
  two	
  datasets	
  
i.e.	
  triples	
  whose	
  subject	
  &	
  
object	
  are	
  described	
  in	
  
different	
  datasets	
  
•  But	
  what	
  do	
  we	
  really	
  know	
  
about	
  the	
  produc8on	
  and	
  
cura8on	
  processes	
  of	
  the	
  
sources	
  publishing	
  in	
  RDF?	
  
What	
  is	
  Digital	
  Preserva8on	
  (DP)?	
  
•  Ensure	
  accessibility	
  and	
  usability	
  of	
  digital	
  objects	
  over	
  
&me	
  and	
  across	
  domains,	
  and	
  protect	
  them	
  from	
  media	
  
failure,	
  physical	
  loss,	
  &	
  hardware/so6ware	
  obsolescence
•  Tradi8onally,	
  the	
  objec8ve	
  is	
  to	
  preserve	
  on	
  the	
  long	
  
run	
  the	
  authen&city	
  of	
  a	
  digital	
  object	
  as	
  originally	
  
recorded	
  against	
  any	
  technological	
  change	
  
– extensive	
  annota8on	
  of	
  digital	
  objects	
  with	
  informa8on	
  
related	
  to	
  their	
  significant	
  proper8es	
  	
  
•  content	
  format	
  
•  context	
  of	
  produc&on	
  
•  structural	
  meta-­‐data	
  
•  current	
  behavior,	
  …	
  
Linked	
  Data	
  vs	
  Digital	
  Objects	
  
•  DP	
  techniques	
  proposed	
  for	
  memory	
  ins8tu8ons	
  and	
  	
  	
  	
  	
  	
  
data	
  centers	
  concern	
  fixed	
  digital	
  objects	
  featuring	
  	
  	
  	
  	
  	
  
mostly	
  unstructured	
  data	
  
– raw	
  data	
  sets	
  held	
  in	
  files,	
  scholarly	
  data	
  held	
  in	
  papers…	
  
•  Linked	
  Data	
  are	
  digitally-­‐born	
  objects	
  which	
  
– are	
  graph-­‐structured	
  op8onally	
  sa8sfying	
  integrity	
  
constraints	
  (expressed	
  in	
  higher	
  logic	
  formalisms)	
  
– exhibit	
  complex	
  interdependencies	
  across	
  sources	
  as	
  
well	
  as	
  varying	
  data	
  quality	
  (curated	
  knowledge	
  bases	
  vs	
  
extracted	
  from	
  text	
  or	
  Web	
  2.0	
  sources)	
  	
  
– change	
  without	
  no8fica8on	
  at	
  different	
  granularity	
  	
  	
  	
  
levels	
  to	
  keep	
  them	
  fit	
  for	
  contemporary	
  purposes,	
  and	
  
be	
  available	
  for	
  discovery	
  and	
  re-­‐use	
  in	
  the	
  future	
  
Linked	
  Data:	
  Behind	
  the	
  scenes!	
  
Property	
  names	
   Property	
  values	
  
Different	
  Descrip8ons	
  of	
  the	
  same	
  En8ty
URI	
   dbpedia:Statue_of_Lib
erty	
  
rdfs:label
Statue of Liberty,
Freiheitsstatue, …
dbpprop:location
New York City, New
York, U.S.,
dbpedia:Liberty_Isl
and
dbpprop:sculptor
dbpedia:Frédéric_Au
guste_Bartholdi
dcterms:subject
dbpedia_category:
1886_sculptures, …
foaf:isPrimaryTopicOf
http://
en.wikipedia.org/wiki/
Statue_of_Liberty
dbpprop:beginningDate
1886-10-28
(xsd:date)
dbpprop:restored
19381984
(xsd:integer)
dbpprop:visitationNum
3200000
(xsd:integer)
dbpprop:visitationYear 2009 (xsd:integer)
h[p://www.w3.org/ns/
prov#wasDerivedFrom
h[p://en.wikipedia.org/wiki/
Statue_of_Liberty?
oldid=494328330
URI	
   C:m.072p8	
  
fb:art_form fb:m.06msq (Sculpture)
fb:media fb:m.025rsfk (Copper)
fb:architect fb:m.0jph6 (F. Bartholdi),
fb:m.036qb (G. Eiffel),
fb:m.02wj4z (R. Hunt)
fb:height_meters 93
fb:opened 1886-10-28
12
URI	
   yago:Statue_of_Liberty	
  
skos:prefLabel Statue of Liberty
rdf:type yago:History_museums_i
n_NY, yago:GeoEntity
yago:hasHeight 46.0248
yago:wasCreatedOnDate 1886-##-##
yago:isLocatedIn yago:Manhattan,
yago:Liberty_Island,
yago:hasWikipediaUrl h[p://en.wikipedia.org/wiki/
Statue_of_Liberty
Linked	
  Datasets	
  Depend	
  on	
  Vocabularies
URI	
   dbpedia:Statue_of_Lib
erty	
  
rdfs:label
Statue of Liberty,
Freiheitsstatue, …
dbpprop:location
New York City, New
York, U.S.,
dbpedia:Liberty_Isl
and
dbpprop:sculptor
dbpedia:Frédéric_Au
guste_Bartholdi
dcterms:subject
dbpedia_category:
1886_sculptures, …
foaf:isPrimaryTopicOf
http://
en.wikipedia.org/wiki/
Statue_of_Liberty
dbpprop:beginningDate
1886-10-28
(xsd:date)
dbpprop:restored
19381984
(xsd:integer)
dbpprop:visitationNum
3200000
(xsd:integer)
dbpprop:visitationYear 2009 (xsd:integer)
h[p://www.w3.org/ns/
prov#wasDerivedFrom
h[p://en.wikipedia.org/wiki/
Statue_of_Liberty?
oldid=494328330
URI	
   C:m.072p8	
  
fb:art_form fb:m.06msq (Sculpture)
fb:media fb:m.025rsfk (Copper)
fb:architect fb:m.0jph6 (F. Bartholdi),
fb:m.036qb (G. Eiffel),
fb:m.02wj4z (R. Hunt)
fb:height_meters 93
fb:opened 1886-10-28
13
URI	
   yago:Statue_of_Liberty	
  
skos:prefLabel Statue of Liberty
rdf:type yago:History_museums_i
n_NY, yago:GeoEntity
yago:hasHeight 46.0248
yago:wasCreatedOnDate 1886-##-##
yago:isLocatedIn yago:Manhattan,
yago:Liberty_Island,
yago:hasWikipediaUrl h[p://en.wikipedia.org/wiki/
Statue_of_Liberty
Linked	
  Datasets	
  Have	
  Varying	
  Quality	
  
URI	
   dbpedia:Statue_of_Lib
erty	
  
rdfs:label
Statue of Liberty,
Freiheitsstatue, …
dbpprop:location
New York City, New
York, U.S.,
dbpedia:Liberty_Isl
and
dbpprop:sculptor
dbpedia:Frédéric_Au
guste_Bartholdi
dcterms:subject
dbpedia_category:
1886_sculptures, …
foaf:isPrimaryTopicOf
http://
en.wikipedia.org/wiki/
Statue_of_Liberty
dbpprop:beginningDate
1886-10-28
(xsd:date)
dbpprop:restored
19381984
(xsd:integer)
dbpprop:visitationNum
3200000
(xsd:integer)
dbpprop:visitationYear 2009 (xsd:integer)
h[p://www.w3.org/ns/
prov#wasDerivedFrom
h[p://en.wikipedia.org/wiki/
Statue_of_Liberty?
oldid=494328330
URI	
   C:m.072p8	
  
fb:art_form fb:m.06msq (Sculpture)
fb:media fb:m.025rsfk (Copper)
fb:architect fb:m.0jph6 (F. Bartholdi),
fb:m.036qb (G. Eiffel),
fb:m.02wj4z (R. Hunt)
fb:height_meters 93
fb:opened 1886-10-28
14
URI	
   yago:Statue_of_Liberty	
  
skos:prefLabel Statue of Liberty
rdf:type yago:History_museums_i
n_NY, yago:GeoEntity
yago:hasHeight 46.0248
yago:wasCreatedOnDate 1886-##-##
yago:isLocatedIn yago:Manhattan,
yago:Liberty_Island,
yago:hasWikipediaUrl h[p://en.wikipedia.org/wiki/
Statue_of_Liberty
Linked	
  Datasets	
  Evolve	
  Over	
  Time	
  
15	
  
Current	
  version	
  of	
  DBpedia	
   Previous	
  version	
  of	
  DBpedia	
  
URI	
   dbpedia:Statue_of_Lib
erty	
  
rdfs:label
Statue of Liberty,
Freiheitsstatue, …
dbpprop:location
New York City, New
York, U.S.,
dbpedia:Liberty_Isl
and
dbpprop:sculptor
dbpedia:Frédéric_Au
guste_Bartholdi
dcterms:subject
dbpedia_category:
1886_sculptures, …
foaf:isPrimaryTopicOf
http://
en.wikipedia.org/wiki/
Statue_of_Liberty
dbpprop:beginningDate
1886-10-28
(xsd:date)
dbpprop:restored
19381984
(xsd:integer)
dbpprop:visitationNum
3200000
(xsd:integer)
dbpprop:visitationYear 2009 (xsd:integer)
h[p://www.w3.org/ns/ h[p://en.wikipedia.org/wiki/
Statue_of_Liberty?
URI	
   dbpedia:Statue_of_Lib
erty	
  
rdfs:label
Statue of Liberty,
Freiheitsstatue, …
dbpprop:location
New York City, New
York, U.S.,
dbpedia:Liberty_Isl
and
dbpprop:sculptor
dbpedia:Frédéric_Au
guste_Bartholdi
dcterms:subject
dbpedia_category:
1886_sculptures, …
foaf:isPrimaryTopicOf
http://
en.wikipedia.org/wiki/
Statue_of_Liberty
dbpprop:built
1886-10-28
(xsd:date)
dbpprop:restored
19381984
(xsd:integer)
dbpprop:hasHeight 151 (xsd:integer)
h[p://www.w3.org/ns/
prov#wasDerivedFrom
h[p://en.wikipedia.org/wiki/
Statue_of_Liberty?
oldid=494328330
DP	
  Challenges	
  for	
  Linked	
  Data	
  	
  
–  Incompleteness:	
  real	
  world	
  en8tles	
  are	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
usually	
  par8ally	
  described	
  in	
  data	
  sources	
  
–  Redundancy:	
  the	
  same	
  real	
  world	
  en88es	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
are	
  represented	
  in	
  mul8ple	
  data	
  sources	
  
–  Inconsistency:	
  various	
  forms	
  of	
  inter	
  and	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
intra	
  source	
  data	
  conflicts	
  
–  Incorrectness:	
  errors	
  can	
  be	
  propagated	
  
from	
  one	
  source	
  to	
  the	
  other	
  due	
  to	
  
copying	
  
•  Mastering	
  the	
  varying	
  data	
  quality	
  is	
  a	
  
prerequisite	
  for	
  trus8ng	
  preserved	
  data	
  
origina8ng	
  from	
  various	
  sources	
  	
  
•  The	
  ‘publish-­‐first-­‐refine-­‐later’	
  philosophy	
  of	
  the	
  Linked	
  Data	
  
movement,	
  complemented	
  by	
  the	
  open,	
  decentralized	
  nature	
  
of	
  the	
  Web	
  results	
  in	
  Data	
  
DP	
  Challenges	
  for	
  Linked	
  Data	
  	
  
•  S8ll	
  data	
  publishing	
  and	
  preserva8on	
  are	
  two	
  largely	
  separated	
  
processes	
  which	
  are	
  addressed	
  by	
  SW	
  and	
  DP	
  communi8es.	
  	
  
–  Shouldn’t	
  the	
  “publishers”	
  worry	
  about	
  preserving	
  all	
  their	
  
hard	
  work?	
  
–  Shouldn’t	
  the	
  “preservers”	
  be	
  concerned	
  about	
  the	
  way	
  they	
  
organize,	
  link,	
  and	
  annotate	
  their	
  linked	
  data?	
  
•  Need	
  to	
  break	
  down	
  the	
  tradi8onal	
  boundaries	
  between	
  the	
  
data	
  creators	
  &	
  publishers	
  and	
  the	
  data	
  archivists	
  &	
  brokers	
  
–  Integrate	
  cura8on	
  [when	
  high	
  current/ongoing	
  interest]	
  and	
  
preserva8on	
  [when	
  fall	
  off	
  in	
  interest]	
  ac8vi8es	
  
–  Distribute	
  preserva8on	
  costs	
  over	
  the	
  life-­‐cycle	
  of	
  linked	
  
datasets	
  among	
  data	
  stewards	
  (pay-­‐as-­‐you-­‐go	
  data	
  
preserva8on)	
  
Vision	
  2030:	
  High-­‐Level	
  Group	
  on	
  
Scien8fic	
  Data	
  
“Researchers	
  and	
  prac88oners	
  from	
  any	
  discipline	
  are	
  able	
  	
  to	
  
find,	
  access	
  and	
  process	
  the	
  data	
  they	
  need.	
  They	
  can	
  be	
  
confident	
  in	
  their	
  ability	
  to	
  use	
  and	
  understand	
  data	
  and	
  they	
  
can	
  evaluate	
  the	
  degree	
  to	
  which	
  the	
  data	
  can	
  be	
  trusted”	
  
•  How	
  we	
  support	
  future	
  users	
  in	
  Trus8ng	
  Data?	
  
– By	
  assessing	
  their	
  quality	
  wr.t	
  to	
  the	
  en88es	
  they	
  describe
– By	
  recording	
  from	
  which	
  sources	
  did	
  they	
  originate	
  from	
  	
  
– By	
  understanding	
  how	
  their	
  iden8ty	
  and	
  integrity	
  has	
  
evolved	
  over	
  8me	
  
– By	
  ensuring	
  that	
  they	
  has	
  been	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
preserved	
  properly	
  	
  
Frame	
  Linked	
  Data	
  Preserva8on	
  as	
  
a	
  Research	
  Problem	
  
•  How	
  can	
  we	
  iden8fy	
  that	
  different	
  
resource	
  descrip8ons	
  within	
  or	
  across	
  
datasets	
  refer	
  to	
  the	
  same	
  real-­‐world	
  
en8ty	
  (the	
  en8ty	
  resolu8on	
  problem)	
  to	
  
convey	
  various	
  aspects	
  of	
  the	
  quality	
  of	
  
the	
  harvested	
  datasets	
  (e.g.,	
  redundancy,	
  
completeness,	
  freshness)?	
  
•  How	
  can	
  we	
  record	
  dependencies	
  of	
  
datasets	
  (the	
  provenance	
  problem)	
  and	
  
how	
  they	
  can	
  smoothly	
  represented	
  along	
  
with	
  other	
  (temporal,	
  spa8al,	
  thema8c)	
  
metadata	
  (the	
  annota8on	
  problem)?	
  
Frame	
  Linked	
  Data	
  Preserva8on	
  as	
  
a	
  Research	
  Problem	
  
•  How	
  can	
  we	
  monitor	
  changes	
  of	
  third-­‐party	
  
datasets	
  (the	
  evolu8on	
  tracking	
  problem)	
  or	
  
how	
  can	
  local/remote	
  data	
  imperfec8ons	
  
(e.g.,	
  due	
  to	
  change	
  propaga8on)	
  can	
  be	
  
repaired	
  (the	
  cura8on	
  problem)?	
  
•  How	
  can	
  we	
  appease	
  what	
  versions	
  of	
  
linked	
  datasets	
  should	
  to	
  be	
  preserved	
  for	
  
future	
  use	
  (the	
  mul8-­‐version	
  archive	
  
consistency	
  problem)	
  and	
  how	
  we	
  will	
  be	
  
able	
  to	
  ask	
  a	
  query	
  not	
  only	
  about	
  any	
  past	
  
state	
  of	
  the	
  dataset	
  but	
  also	
  about	
  the	
  
evolu8on	
  of	
  some	
  part	
  of	
  it	
  (the	
  
longitudinal	
  querying	
  problem)?	
  	
  
21	
  
dbpedia:Statue
_of_Liberty	
  
dbpedia:Liberty
_Island	
  
C:m.072p8	
  
yago:Statue
_of_Liberty	
  
yago:History_mu
seums_in_NY	
  
yago:	
  
GeoEn8ty	
  
yago:	
  
Liberty_Island	
  
1886-­‐##-­‐##	
  
yago:wasCreated	
  
onDate	
  
yago:isLocatedIn	
  
rdf:type	
  
rdf:type	
  
dbpedia_category:
1886_sculptures	
  
dbpedia:	
  
Frédéric_Auguste
_Bartholdi	
  
3200000	
  
u:m.06msq	
  
u:m.0jph6	
  
u:architect	
  
u:art_form	
  
dcterms:
subject	
  
dbprop:visita8onNum	
  
dbprop:	
  
loca8on	
  
dbprop:	
  
sculptor	
  
Statue	
  of	
  Liberty	
  
skos:	
  
prefLabel	
  
ER	
  Example	
  
En8ty	
  resolu8on:	
  The	
  problem	
  of	
  iden8fying	
  descrip8ons	
  of	
  the	
  same	
  
en8ty	
  within	
  one	
  or	
  across	
  mul8ple	
  data	
  sources	
  wrt.	
  a	
  match	
  func8on	
  
•  No	
  longer	
  just	
  matching	
  of	
  en8ty	
  names	
  	
  
dbpedia:Statue
_of_Liberty	
  
yago:Statue
_of_Liberty	
  
owl:sameAs	
  
An	
  en8ty	
  resolu8on	
  is	
  a	
  par88on	
  of	
  a	
  set	
  of	
  en8ty	
  descrip8ons,	
  such	
  that:	
  
1. 	
  Matching	
  en8ty	
  descrip8ons	
  are	
  placed	
  in	
  the	
  same	
  subset	
  
2. 	
  All	
  the	
  descrip8ons	
  of	
  the	
  same	
  subset	
  match	
  
C:m.072p8	
  
yago:	
  
Liberty_Island	
  
u:m.0jph6	
  
owl:sameAs	
  
owl:sameAs	
  
dbpedia:Liberty
_Island	
  
dbpedia:	
  
Frédéric_Augus
te_Bartholdi	
  
yago:	
  
Liberty_Island	
  
dbpedia:	
  
Frédéric_Auguste_
Bartholdi	
  
u:m.0jph6	
  
ER	
  Example	
  
u:architect	
  
yago:isLocatedIn	
  
Persons
Places
Artifacts
=>	
  Need	
  to	
  infer	
  also	
  other	
  kind	
  of	
  	
  
rela8onships	
  than	
  “equivalence”	
  
What	
  Makes	
  ER	
  Difficult	
  for	
  Linked	
  
Data	
  	
  
•  Linked	
  Data	
  are	
  inherently	
  semi-­‐structured	
  
–  several	
  seman8c	
  types	
  (see	
  rdf:type	
  proper8es	
  in	
  Yago)	
  could	
  be	
  
simultaneously	
  employed	
  resul8ng	
  to	
  en8ty	
  descrip8ons	
  even	
  of	
  
the	
  same	
  type	
  (persons,	
  places,	
  …)	
  with	
  quite	
  different	
  structures	
  
	
  
•  Linked	
  Data	
  heavily	
  rely	
  on	
  heterogeneous	
  vocabularies	
  
–  DBPedia	
  3.4:	
  50,000	
  proper8es	
  
–  Google	
  Base:100,000	
  schemata	
  and	
  10,000	
  en8ty	
  types	
  
•  Linked	
  Data	
  are	
  Big	
  Data	
  
–  The	
  LOD	
  cloud	
  consists	
  of	
  32	
  billion	
  RDF	
  triples	
  (last	
  update:	
  2011)	
  
–  DBPedia	
  3.4:	
  36.5	
  million	
  triples,	
  2.1	
  million	
  en8ty	
  descrip8ons	
  	
  	
  
–  BTC09:	
  1.15	
  billion	
  triples,	
  182	
  million	
  en8ty	
  descrip8ons	
  
=>	
  Deal	
  with	
  loosely	
  structured	
  en00es	
  
=>	
  Calls	
  for	
  efficient	
  parallel	
  techniques	
  
=>	
  Need	
  for	
  cross-­‐domain	
  techniques	
  
The	
  Role	
  of	
  Similarity	
  Func8ons	
  
Set	
  of	
  all	
  pairs	
  
of	
  en8ty	
  
descrip8ons	
  
Matching	
  pairs	
  
of	
  en8ty	
  
descrip8ons	
  
Pairs	
  of	
  en8ty	
  
descrip8ons	
  sa8sfying	
  
a	
  strict	
  similarity	
  
func8on	
  	
  	
  
The	
  Role	
  of	
  Similarity	
  Func8ons	
  
Set	
  of	
  all	
  pairs	
  
of	
  en8ty	
  
descrip8ons	
  
Matching	
  pairs	
  
of	
  en8ty	
  
descrip8ons	
  
Pairs	
  of	
  en8ty	
  
descrip8ons	
  sa8sfying	
  
a	
  loose	
  similarity	
  
func8on	
  	
  	
  
En8ty	
  Collec8ons	
  and	
  ER	
  Types	
  
•  2	
  kinds	
  of	
  en8ty	
  collec8ons	
  given	
  as	
  input	
  to	
  an	
  ER	
  
task:	
  
–  Clean,	
  which	
  are	
  duplicate-­‐free	
  
(e.g.,DBPedia,Freebase,DBLP)	
  
–  Dirty,	
  which	
  contain	
  duplicate	
  en8ty	
  descrip8ons	
  in	
  
themselves	
  (e.g.,	
  Google	
  Scholar,	
  Citeseer)	
  
•  An	
  ER	
  task	
  that	
  receives	
  as	
  input	
  two	
  en8ty	
  collec8ons	
  
can	
  be	
  of	
  the	
  following	
  types:	
  
–  Clean-­‐Clean	
  ER:	
  Given	
  two	
  clean,	
  but	
  overlapping	
  en8ty	
  
collec8ons,	
  iden8fy	
  the	
  common	
  en8ty	
  descrip8ons	
  (a.k.a.	
  
the	
  Record	
  Linkage	
  in	
  databases)	
  
–  Dirty-­‐Clean	
  ER:	
  
–  Dirty-­‐Dirty	
  ER:	
  Iden8fy	
  unique	
  en8ty	
  descrip8ons	
  contained	
  
in	
  union	
  of	
  the	
  input	
  en8ty	
  collec8ons	
  (a.k.a.	
  the	
  
Deduplica8on	
  problem	
  in	
  databases)	
  
Scaling	
  ER	
  to	
  the	
  Web	
  of	
  Data	
  
•  Blocking	
  to	
  reduce	
  the	
  number	
  of	
  comparisons:	
  
–  Split	
  en8ty	
  descrip8ons	
  into	
  blocks	
  
–  Compare	
  each	
  descrip8on	
  to	
  the	
  descrip8ons	
  
within	
  the	
  same	
  block	
  
•  Desiderata	
  
–  Similar	
  en8ty	
  descrip8ons	
  in	
  the	
  same	
  block	
  
–  Dissimilar	
  en8ty	
  descrip8ons	
  in	
  different	
  blocks	
  
•  Blocking	
  approaches	
  are	
  dis8nguished	
  between:	
  
–  Par88oning,	
  where	
  each	
  descrip8on	
  is	
  placed	
  
in	
  exactly	
  one	
  block	
  :	
  Fewer	
  comparisons	
  
–  Overlapping,	
  where	
  each	
  descrip8on	
  can	
  be	
  
placed	
  in	
  more	
  than	
  one	
  block	
  :	
  More	
  
iden8fied	
  matches	
  
Blocking	
  techniques	
  for	
  Linked	
  Data	
  
•  Mul8-­‐rela8onal	
  and	
  cross-­‐domain	
  en8ty	
  resolu8on	
  
– Token	
  blocking	
  
– Property	
  clustering	
  
– Prefix-­‐Infix(-­‐Suffix)	
  
•  Large-­‐scale	
  en8ty	
  resolu8on	
  
– Choose	
  a	
  computa8onally	
  not	
  expensive	
  
similarity	
  func8on	
  
– Process	
  in	
  parallel	
  par88ons	
  of	
  the	
  en8ty	
  graph	
  
in	
  Map/Reduce	
  nodes	
  	
  
Token	
  Blocking	
  [Papadakis	
  et	
  al.	
  2011]	
  
•  Ignores	
  (seman8c	
  or	
  structured)	
  types	
  of	
  en88es	
  
– String	
  similarity	
  of	
  tokens	
  of	
  property	
  literal	
  values	
  
•  Dis8nct	
  tokens	
  of	
  each	
  property	
  value	
  of	
  each	
  en8ty	
  
descrip8on	
  corresponds	
  to	
  a	
  block	
  
– Each	
  block	
  contains	
  all	
  en88es	
  with	
  the	
  corresponding	
  
token	
  
•  High	
  recall	
  at	
  the	
  cost	
  of	
  low	
  precision	
  and	
  low	
  efficiency:	
  
– Most	
  true	
  matches	
  are	
  placed	
  in	
  the	
  same	
  block	
  
– Many	
  non-­‐matches	
  are	
  also	
  placed	
  in	
  the	
  same	
  block	
  
– The	
  same	
  pair	
  of	
  descrip8ons	
  is	
  contained	
  in	
  many	
  blocks	
  
String	
  Similarity	
  Measures	
  
Token	
  Blocking	
  -­‐	
  Example	
  
Entity descriptions:
e1= {(name, Eiffel Tower), (architect, Sauvestre), (year, 1889),
(location, Paris)}
e2= {(name, Statue of Liberty), (architect, Bartholdi Eiffel),
(year, 1886), (located, NY)}
e3= {(about, Lady Liberty), (architect, Eiffel), (location, NY)}
e4= {(about, Eiffel Tower), (architect, Sauvestre), (year, 1889),
(located, Paris)}
e5= {(name, White Tower), (year-constructed, 1450), (location,
Thessaloniki)}
Generated	
  blocks:	
  
The	
  pair	
  (e1,	
  e4)	
  is	
  contained	
  in	
  5	
  different	
  blocks!	
  
Eiffel	
  
e1,	
  e2,	
  
e3,	
  e4	
  
Tower	
  
e1,	
  e4,	
  
e5	
  
Statue	
  
e2	
  
Liberty	
  
e2,	
  e3	
  
White	
  
e5	
  
1889	
  
e1,	
  e4	
  
Bartholdi	
  
e2	
  
NY	
  
e2,	
  e3	
  
Sauvestre	
  
e1,	
  e4	
  
Paris	
  
e1,	
  e4	
  
1886	
  
e2	
  
1450	
  
e5	
  
Lady	
  
e3	
  
Thessaloniki	
  
e5	
  
Property	
  clustering	
  [Papadakis	
  et	
  al.	
  2013]
•  Assuming	
  two	
  duplicate-­‐free	
  datasets	
  
– Recognize	
  similarity	
  of	
  proper8es	
  based	
  on	
  the	
  
string	
  similarity	
  of	
  their	
  literal	
  values	
  occurring	
  
in	
  en8ty	
  descrip8ons	
  	
  
•  Two	
  main	
  blocking	
  steps:	
  
1. Similar	
  proper8es	
  are	
  placed	
  together	
  in	
  non-­‐
overlapping	
  clusters	
  
2. Token	
  blocking	
  is	
  performed	
  on	
  the	
  descrip8ons	
  
of	
  each	
  cluster	
  
	
  
Clustering	
  En8ty	
  Proper8es	
  
1. For	
  each	
  property	
  of	
  dataset	
  D1:	
  
– Find	
  the	
  most	
  similar	
  property	
  of	
  dataset	
  D2	
  
2. For	
  each	
  property	
  of	
  dataset	
  D2:	
  	
  
– Find	
  the	
  most	
  similar	
  property	
  of	
  dataset	
  D1	
  
3. Compute	
  the	
  transi8ve	
  closure	
  of	
  the	
  generated	
  
pairs	
  of	
  similar	
  proper8es	
  
4. Similar	
  proper8es	
  form	
  clusters	
  
5. All	
  singleton	
  clusters	
  are	
  merged	
  into	
  a	
  common	
  one	
  
D1	
   D2	
  
e1= {(about, Eiffel Tower), (architect, Sauvestre), (year, 1889),
(located, Paris)}
e2= {(about, Statue of Liberty), (architect, Bartholdi Eiffel),
(year, 1886), (located, NY)}
e3= {(about, Auguste Bartholdi), (born, 1834)}
e4= {(about, Joan Tower), (born, 1938)}
e5= {(work, Lady Liberty), (artist, Bartholdi), (location, NY)}
e6= {(work, Eiffel Tower), (year-constructed, 1889), (location,
Paris)}
e7= {(work, Bartholdi Fountain), (year-constructed,1876),
(location, Washington D.C.)}
e1	
  
e2	
  e3	
  
e4	
  
e5	
  
e6	
  e7	
  
Clustering	
  En8ty	
  Proper8es:	
  Example	
  
e1= {(about, Eiffel Tower), (architect, Sauvestre), (year, 1889),
(located, Paris)}
e2= {(about, Statue of Liberty), (architect, Bartholdi Eiffel),
(year, 1886), (located, NY)}
e3= {(about, Auguste Bartholdi), (born, 1834)}
e4= {(about, Joan Tower), (born, 1938)}
e5= {(work, Lady Liberty), (artist, Bartholdi), (location, NY)}
e6= {(work, Eiffel Tower), (year-constructed, 1889), (location,
Paris)}
e7= {(work, Bartholdi Fountain), (year-constructed,1876),
(location, Washington D.C.)}
Clustering	
  En8ty	
  Proper8es:	
  Example	
  
Finding	
  the	
  property	
  of	
  D2	
  that	
  is	
  the	
  most	
  similar	
  to	
  the	
  property	
  “about”	
  of	
  D1:	
  
values	
  of	
  about:	
  {Eiffel,	
  Tower,	
  Statue,	
  Liberty,	
  Auguste,	
  Bartholdi,	
  Juan,	
  Tower}	
  
	
  
compared	
  to	
  (with	
  Jaccard	
  similarity)	
  :	
  
values	
  of	
  work:	
  {Lady,	
  Liberty,	
  Eiffel,	
  Tower,	
  Bartholdi,	
  Fountain}	
  	
  à	
  Jaccard	
  =	
  4/10	
  
values	
  of	
  ar8st:	
  {Bartholdi}	
  à	
  Jaccard	
  =	
  1/8	
  
values	
  of	
  loca8on:	
  {NY,	
  Paris,	
  Washington,	
  D.C.}	
  à	
  Jaccard	
  =	
  0	
  
values	
  of	
  year-­‐constructed:	
  {1889,	
  1876}	
  à	
  Jaccard	
  =	
  0	
  
e1= {(about, Eiffel Tower), (architect, Sauvestre), (year, 1889),
(located, Paris)}
e2= {(about, Statue of Liberty), (architect, Bartholdi Eiffel),
(year, 1886), (located, NY)}
e3= {(about, Auguste Bartholdi), (born, 1834)}
e4= {(about, Joan Tower), (born, 1938)}
e5= {(work, Lady Liberty), (artist, Bartholdi), (location, NY)}
e6= {(work, Eiffel Tower), (year-constructed, 1889), (location,
Paris)}
e7= {(work, Bartholdi Fountain), (year-constructed,1876),
(location, Washington D.C.)}
Clustering	
  En8ty	
  Proper8es:	
  Example	
  
D1	
   D2	
  
about	
  
architect	
  
year	
  
born	
  
located	
  
work	
  
ar8st	
  
year-­‐constructed	
  
loca8on	
  
Clustering	
  En8ty	
  Proper8es:	
  Example	
  
•  Similarly	
  for	
  the	
  rest	
  of	
  the	
  proper8es…	
  
D1	
   D2	
  
about	
  
architect	
  
year	
  
born	
  
located	
  
work	
  
ar8st	
  
year-­‐constructed	
  
loca8on	
  
Clustering	
  En8ty	
  Proper8es:	
  Example	
  
•  Similarly	
  for	
  the	
  rest	
  of	
  the	
  proper8es…	
  
D1	
   D2	
  
about	
  
architect	
  
year	
  
born	
  
located	
  
work	
  
ar8st	
  
year-­‐constructed	
  
loca8on	
  
Clustering	
  En8ty	
  Proper8es:	
  Example	
  
•  Similarly	
  for	
  the	
  rest	
  of	
  the	
  proper8es…	
  
D1	
   D2	
  
about	
  
architect	
  
year	
  
born	
  
located	
  
work	
  
ar8st	
  
year-­‐constructed	
  
loca8on	
  
Clustering	
  En8ty	
  Proper8es:	
  Example	
  
•  Similarly	
  for	
  the	
  rest	
  of	
  the	
  proper8es…	
  
D1	
   D2	
  
about	
  
architect	
  
year	
  
born	
  
located	
  
work	
  
ar8st	
  
year-­‐constructed	
  
loca8on	
  
Clustering	
  En8ty	
  Proper8es:	
  Example	
  
•  Similarly	
  for	
  the	
  rest	
  of	
  the	
  proper8es…	
  
D1	
   D2	
  
about	
  
architect	
  
year	
  
born	
  
located	
  
work	
  
ar8st	
  
year-­‐constructed	
  
loca8on	
  
Clustering	
  En8ty	
  Proper8es:	
  Example	
  
•  Similarly	
  for	
  the	
  rest	
  of	
  the	
  proper8es…	
  
D1	
   D2	
  
about	
  
architect	
  
year	
  
born	
  
located	
  
work	
  
ar8st	
  
year-­‐constructed	
  
loca8on	
  
Clustering	
  En8ty	
  Proper8es:	
  Example	
  
•  Similarly	
  for	
  the	
  rest	
  of	
  the	
  proper8es…	
  
D1	
   D2	
  
about	
  
architect	
  
year	
  
born	
  
located	
  
work	
  
ar8st	
  
year-­‐constructed	
  
loca8on	
  
Clustering	
  En8ty	
  Proper8es:	
  Example	
  
•  Similarly	
  for	
  the	
  rest	
  of	
  the	
  proper8es…	
  
D1	
   D2	
  
about	
  
architect	
  
year	
  
born	
  
located	
  
work	
  
ar8st	
  
year-­‐constructed	
  
loca8on	
  
Clustering	
  En8ty	
  Proper8es:	
  Example	
  
D1	
   D2	
  
e1= {(about, Eiffel Tower), (architect, Sauvestre), (year, 1889),
(located, Paris)}
e2= {(about, Statue of Liberty), (architect, Bartholdi Eiffel),
(year, 1886), (located, NY)}
e3= {(about, Auguste Bartholdi), (born, 1834)}
e4= {(about, Joan Tower), (born, 1938)}
e5= {(work, Lady Liberty), (artist, Bartholdi), (location, NY)}
e6= {(work, Eiffel Tower), (year-constructed, 1889), (location,
Paris)}
e7= {(work, Bartholdi Fountain), (year-constructed,1876),
(location, Washington D.C.)}
D1	
   D2	
  
about	
  
architect	
  
year	
  
born	
  
located	
  
work	
  
ar8st	
  
year-­‐constructed	
  
loca8on	
  
Clustering	
  En8ty	
  Proper8es:	
  Example	
  
•  Compute	
  the	
  transi8ve	
  closure	
  of	
  the	
  generated	
  
property	
  name	
  pairs	
  
– Connected	
  proper8es	
  form	
  clusters	
  
	
  
	
  
	
  
Pairs:	
  (about,	
  work),	
  (work,	
  about),	
  (ar8st,	
  architect),	
  (architect,	
  work)	
  
Transi8ve	
  closure:	
  
D1	
   D2	
  
about	
  
architect	
  
year	
  
born	
  
located	
  
work	
  
ar8st	
  
year-­‐constructed	
  
loca8on	
  
about	
  
work	
  
architect	
  
ar8st	
  C1	
  
Clustering	
  En8ty	
  Proper8es:	
  Example	
  
•  Compute	
  the	
  transi8ve	
  closure	
  of	
  the	
  generated	
  
property	
  name	
  pairs	
  
– Connected	
  proper8es	
  form	
  clusters	
  
	
  
	
  
	
  
Pairs:	
  (year,	
  year-­‐constructed),	
  (year-­‐constructed,	
  year),	
  (year-­‐constructed,	
  born)	
  
Transi8ve	
  closure:	
  
about	
  
work	
  
architect	
  
ar8st	
  C1	
  
D1	
   D2	
  
about	
  
architect	
  
year	
  
born	
  
located	
  
work	
  
ar8st	
  
year-­‐constructed	
  
loca8on	
  
year	
  
year-­‐constructed	
  
born	
  
C2	
  
Clustering	
  En8ty	
  Proper8es:	
  Example	
  
•  Compute	
  the	
  transi8ve	
  closure	
  of	
  the	
  generated	
  
property	
  name	
  pairs	
  
– Connected	
  proper8es	
  form	
  clusters	
  
	
  
	
  
	
  
Pairs:	
  (located,	
  loca8on),	
  (loca8on,	
  located)	
  
Transi8ve	
  closure:	
  
about	
  
work	
  
architect	
  
ar8st	
  C1	
  
year	
  
year-­‐constructed	
  
born	
  
C2	
  
D1	
   D2	
  
about	
  
architect	
  
year	
  
born	
  
located	
  
work	
  
ar8st	
  
year-­‐constructed	
  
loca8on	
  
loca8on	
  
located	
  	
  
C3	
  
Clustering	
  En8ty	
  Proper8es:	
  Example	
  
•  Compute	
  the	
  transi8ve	
  closure	
  of	
  the	
  generated	
  
property	
  name	
  pairs	
  
– Connected	
  proper8es	
  form	
  clusters	
  
	
  
	
  
	
  
•  Generated	
  property	
  clusters:	
  
D1	
   D2	
  
about	
  
architect	
  
year	
  
born	
  
located	
  
work	
  
ar8st	
  
year-­‐constructed	
  
loca8on	
  
about	
  
work	
  
architect	
  
ar8st	
  C1	
  
year	
  
year-­‐constructed	
  
born	
  
C2	
  
loca8on	
  
located	
  	
  
C3	
  
Token	
  Blocking	
  for	
  Each	
  Cluster	
  
Some	
  of	
  	
  the	
  generated	
  blocks:	
  
	
   	
   	
   	
   	
   	
   	
   	
   	
  	
  
	
   	
   	
   	
  	
  compare	
  Lady	
  Liberty	
  to	
  Auguste	
  Bartholdi	
  
C3.NY	
  
e2,	
  e5	
  
C1.Tower	
  
e1,	
  e4,	
  e6	
  
C1.Bartholdi	
  
e2,	
  e3,	
  e5,	
  e7	
  
e1= {(about, Eiffel Tower), (architect, Sauvestre), (year, 1889),
(located, Paris)}
e2= {(about, Statue of Liberty), (architect, Bartholdi Eiffel),
(year, 1886), (located, NY)}
e3= {(about, Auguste Bartholdi), (born, 1834)}
e4= {(about, Joan Tower), (born, 1938)}
e5= {(work, Lady Liberty), (artist, Bartholdi), (location, NY)}
e6= {(work, Eiffel Tower), (year-constructed, 1889), (location,
Paris)}
e7= {(work, Bartholdi Fountain), (year-constructed,1876),
(location, Washington D.C.)}
about	
  
work	
  
architect	
  
ar8st	
  C1	
  
year	
  
year-­‐constructed	
  
born	
  
C2	
  
loca8on	
  
located	
  	
  
C3	
  
Prefix-­‐Infix(-­‐Suffix)	
  [Papadakis	
  et	
  al.	
  2012]
•  How	
  we	
  can	
  explore	
  the	
  seman8cs	
  of	
  URIs	
  to	
  be[er	
  
match	
  en8ty	
  descrip8ons?	
  
– E.g.	
  66%	
  of	
  the	
  182	
  million	
  URIs	
  of	
  BTC09	
  
(km.aiu.kit.	
  edu/projects/btc-­‐2009)	
  follow	
  the	
  
scheme:	
  Prefix-­‐Infix(-­‐Suffix)	
  	
  
•  Prefix	
  describes	
  the	
  source,	
  i.e.	
  domain,	
  of	
  the	
  URI	
  	
  
•  Infix	
  is	
  a	
  local	
  iden8fier	
  	
  
•  The	
  op8onal	
  Suffix	
  contains	
  details	
  about	
  the	
  format,	
  
e.g.	
  .rdf	
  and	
  .nt,	
  or	
  a	
  named	
  anchor	
  
•  Token	
  blocking	
  on	
  the	
  Infixes	
  appearing	
  in	
  the	
  
resource	
  values	
  of	
  proper8es	
  (as	
  subject	
  or	
  object)	
  
Prefix-­‐Infix(-­‐Suffix)	
  [Papadakis	
  et	
  al.	
  2012]	
  
E.g. (Infix-profile):
e1= {(skos:prefLabel, Statue of Liberty),
(yago:isLocatedIn, yago:Liberty_Island)}
e2= {(rdfs:label, Statue of Liberty), (dbpprop:location,
dbpedia:Liberty_Island)}
e3= {(freeb:official_name, Statue of Liberty),
(freeb:containedby, freeb:m.026kp2)}
e4= {(geonames:name, Statue of Liberty), (geonames:nearby,
geonames:5124330)}
Generated	
  blocks:	
  
	
  
	
  
	
  
Note:	
  The	
  effec8veness	
  of	
  the	
  approach	
  relies	
  on	
  the	
  
good	
  naming	
  prac8ces	
  of	
  the	
  data	
  publishers	
  
Liberty
_Island	
  
e1,	
  e2	
  
m.026kp2	
  
e3	
  
5124330	
  
e4	
  
GeoNames	
  
LINDA	
  [Böhm	
  et	
  al.	
  2012]	
  
•  Works	
  on	
  an	
  en8ty	
  graph	
  constructed	
  from	
  RDF	
  
triples	
  by	
  considering	
  the	
  URIs	
  appearing	
  in	
  their	
  
subject,	
  predicate	
  and	
  object	
  posi8ons	
  
•  Matches	
  are	
  iden8fied	
  using	
  two	
  kinds	
  of	
  similari8es:	
  
– Descrip8ons	
  are	
  similar	
  wrt.	
  a	
  string	
  similarity	
  of	
  
their	
  literal	
  values:	
  Checked	
  once	
  
– Descrip8ons	
  have	
  similar	
  neighbours	
  in	
  the	
  en8ty	
  
graph:	
  Checked	
  itera&vely	
  
LINDA	
  [Böhm	
  et	
  al.	
  2012]	
  
•  Scalability:	
  En8ty	
  graph	
  par88ons	
  are	
  processed	
  in	
  
parallel	
  
•  Each	
  Map/Reduce	
  node	
  holds:	
  	
  
– A	
  par88on	
  of	
  the	
  graph	
  along	
  with	
  the	
  similari8es	
  of	
  
the	
  en8ty	
  descrip8on	
  pairs	
  in	
  this	
  par88on	
  
• descrip8on	
  pairs	
  are	
  stored	
  in	
  a	
  priority	
  queue	
  in	
  
descending	
  order	
  wrt.	
  their	
  similarity	
  
– Fast	
  merge-­‐join-­‐like	
  access	
  	
  
•  Effec8veness:	
  Messages	
  from	
  mappers	
  to	
  reducers,	
  
only	
  for	
  the	
  pairs	
  of	
  descrip8ons	
  	
  that	
  need	
  similarity	
  
re-­‐computa8on	
  
LINDA	
  ER	
  Algorithm	
  
•  Two	
  matrices	
  are	
  used:	
  
– X	
  captures	
  the	
  iden8fied	
  matches	
  (binary	
  matrix)	
  
– Y	
  captures	
  the	
  pair-­‐wise	
  similari8es	
  (real	
  values)	
  
•  Ini8aliza8on:	
  common	
  neighbors	
  and	
  string	
  similarity	
  
of	
  literals	
  
•  Updates:	
  Use	
  the	
  iden8fied	
  matches	
  of	
  X	
  
•  Un8l	
  the	
  priority	
  queue	
  (extracted	
  from	
  Y)	
  becomes	
  empty:	
  
– Get	
  the	
  pair	
  (ei,	
  ej)	
  with	
  the	
  highest	
  similarity	
  
•  (ei,	
  ej)	
  match	
  by	
  default	
  
–  Update	
  X:	
  matches	
  of	
  ei	
  are	
  also	
  matches	
  of	
  ej	
  
– Update	
  the	
  queue	
  wrt.	
  the	
  new	
  matches	
  	
  
dbpedia:Statue
_of_Liberty	
  
dbpedia:Liberty
_Island	
  
C:m.072p8	
  
yago:Statue
_of_Liberty	
  
yago:	
  
Liberty_Island	
  
yago:isLocatedIn	
  
dbpedia:	
  
Frédéric_Augus
te_Bartholdi	
  
u:m.0jph6	
  
u:architect	
  
dbprop:	
  
loca8on	
  
dbprop:	
  
sculptor	
  
Priority Queue:
(dbpedia:Liberty_Island,	
  yago:Liberty_Island)	
  
(dbpedia:Statue_of_Liberty,	
  yago:Liberty_Island)	
  
(u:m.072p8,	
  dbpedia:Liberty_Island)	
  
dbpedia:Statue_
of_Liberty	
  
dbpedia:Liberty
_Island	
  
C:m.072p8	
  
yago:Statue
_of_Liberty	
  
yago:	
  
Liberty_Island	
  
yago:isLocatedIn	
  
dbpedia:	
  
Frédéric_Augus
te_Bartholdi	
  
u:m.0jph6	
  
u:architect	
  
dbprop:	
  
loca8on	
  
dbprop:	
  
sculptor	
  
Priority Queue:
(dbpedia:Liberty_Island,	
  yago:Liberty_Island)	
  
(dbpedia:Statue_of_Liberty,	
  yago:Liberty_Island)	
  
(u:m.072p8,	
  dbpedia:Liberty_Island)	
  
match	
  
dbpedia:Statu
e_of_Liberty	
  
dbpedia:Liberty
_Island	
  
C:m.072p8	
  
yago:Statue
_of_Liberty	
  
yago:	
  
Liberty_Island	
  
yago:isLocatedIn	
  
dbpedia:	
  
Frédéric_Augus
te_Bartholdi	
  
u:m.0jph6	
  
u:architect	
  
dbprop:	
  
loca8on	
  
dbprop:	
  
sculptor	
  
Priority Queue:
(dbpedia:Liberty_Island,	
  yago:Liberty_Island)	
  
(dbpedia:Statue_of_Liberty,	
  yago:Liberty_Island)	
  
(u:m.072p8,	
  dbpedia:Liberty_Island)	
  
dequeue	
  these	
  pairs,	
  	
  
as	
  each	
  en8ty	
  can	
  be	
  
mapped	
  at	
  most	
  to	
  one	
  
en8ty	
  per	
  data	
  source	
  
match	
  
dbpedia:Statue_
of_Liberty	
  
dbpedia:Liberty
_Island	
  
C:m.072p8	
  
yago:Statue
_of_Liberty	
  
yago:	
  
Liberty_Island	
  
yago:isLocatedIn	
  
dbpedia:	
  
Frédéric_Augus
te_Bartholdi	
  
u:m.0jph6	
  
u:architect	
  
dbprop:	
  
loca8on	
  
dbprop:	
  
sculptor	
  
Priority Queue:
(dbpedia:Liberty_Island,	
  yago:Liberty_Island)	
  
(dbpedia:Statue_of_Liberty,	
  yago:Statue_of_Liberty)	
  
	
  
enqueue	
  this	
  pair,	
  as	
  	
  
it	
  is	
  in	
  the	
  context	
  of	
  the	
  
newly-­‐found	
  match	
  
match	
  
dbpedia:Statue
_of_Liberty	
  
dbpedia:Liberty
_Island	
  
C:m.072p8	
  
yago:Statue
_of_Liberty	
  
yago:	
  
Liberty_Island	
  
yago:isLocatedIn	
  
dbpedia:	
  
Frédéric_Augus
te_Bartholdi	
  
u:m.0jph6	
  
u:architect	
  
dbprop:	
  
loca8on	
  
dbprop:	
  
sculptor	
  
Priority Queue:
(dbpedia:Statue_of_Liberty,	
  yago:Statue_of_Liberty)	
  
match	
  
match	
  
LINDA	
  
Distribute	
  across	
  a	
  cluster	
  the	
  input	
  en8ty	
  graph	
  
•  A	
  node	
  i	
  holds	
  a	
  por8on	
  Qi	
  of	
  the	
  priority	
  queue	
  and	
  the	
  
respec8ve	
  part	
  Gi	
  of	
  the	
  graph	
  	
  
	
  
Map	
  phase	
  
•  Mapper	
  i	
  reads	
  Qi	
  and	
  forwards	
  messages	
  to	
  reducers	
  for	
  
similari8es	
  re-­‐computa8ons	
  
– Matrix	
  X	
  of	
  iden8fied	
  matches	
  is	
  updated	
  	
  
	
  
Reduce	
  phase	
  
•  Similari8es	
  re-­‐computa8ons	
  (Matrix	
  Y)	
  
•  Updates	
  on	
  priority	
  queues	
  	
  
Frame	
  Linked	
  Data	
  Preserva8on	
  as	
  a	
  
Sustainable	
  Economic	
  Ac8vity	
  
•  Economic	
  ac8vity:	
  deliberate	
  
alloca8on	
  of	
  resources	
  
–  Cost	
  of	
  losing	
  datasets	
  
•  Sustainable:	
  ongoing	
  resource	
  
alloca8on	
  over	
  long	
  periods	
  of	
  8me	
  
–  Involved	
  data	
  subjects	
  
•  Ar8culate	
  the	
  problem/provide	
  
recommenda8ons	
  &	
  guidelines	
  
–  Economic	
  and	
  societal	
  benefits	
  
Technical
Social Economic
Sustainability	
  Condi8ons	
  
•  Who	
  benefits	
  from	
  use	
  of	
  
the	
  preserved	
  data?	
  
•  Who	
  selects	
  what	
  data	
  to	
  
preserve?	
  
•  Who	
  owns	
  the	
  data?	
  
•  Who	
  preserves	
  the	
  data?	
  
•  Who	
  pays	
  both	
  for	
  data	
  
and	
  preserva8on	
  services?	
  
•  recogni8on	
  of	
  the	
  benefits	
  of	
  
preserva8on	
  by	
  decision	
  makers	
  
•  selec8on	
  of	
  datasets	
  with	
  long-­‐
term	
  value	
  
•  incen8ves	
  for	
  decision	
  makers	
  
to	
  act	
  in	
  the	
  public	
  interest	
  or	
  to
elaborate	
  new	
  business	
  models	
  	
  
•  appropriate	
  governance	
  of	
  
preserva8on	
  ac8vi8es	
  
•  ongoing	
  and	
  efficient	
  alloca8on	
  
of	
  resources	
  to	
  preserva8on	
  
•  8mely	
  ac8ons	
  to	
  ensure	
  long-­‐
term	
  data	
  access	
  and	
  usability	
  
Conclusions	
  
•  We	
  need	
  new	
  abstrac8ons	
  bridging	
  closer	
  data	
  crea8on,	
  
processing,	
  publica8on	
  and	
  processing	
  
– Diachronic	
  Data:	
  Data	
  annotated	
  with	
  temporal	
  and	
  
provenance	
  informa8on	
  self-­‐describing	
  their	
  evolu8on	
  
history	
  	
  
– Preserve	
  (semi-­‐)structured,	
  interrelated,	
  evolving	
  data	
  by	
  
keeping	
  them	
  constantly	
  accessible	
  &	
  reusable	
  from	
  an	
  
open	
  framework	
  such	
  as	
  the	
  Data	
  Web	
  
•  We	
  need	
  new	
  business	
  models	
  for	
  spreading	
  data	
  
publica8on	
  and	
  archiving	
  costs	
  among	
  data	
  stakeholders	
  
– 	
  Pay-­‐as-­‐you-­‐go	
  data	
  preserva8on	
  as	
  data	
  products	
  are	
  re-­‐
used	
  through	
  complex	
  value	
  making	
  chains	
  (both	
  memory	
  
ins8tu8ons	
  and	
  data	
  market	
  places)	
  
Acknowledgements	
  
•  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  EU	
  IP	
  Project	
  No	
  601043	
  
– h[p://www.diachron-­‐fp7.eu	
  
	
  	
  	
  	
  
•  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  EU	
  CSA	
  Project	
  No	
  600663	
  
– h[p://www.prelida.eu	
  
Collaborators	
  
•  Vassilis	
  E•himiou	
  (University	
  of	
  Crete)	
  
•  Kostas	
  Stefanidis	
  (FORTH/ICS)	
  
•  Grigoris	
  karvounarakis	
  (LogicBox)	
  
•  Giorgos	
  Flouris	
  (FORTH/ICS)	
  
Ques8ons	
  
References	
  
•  Bohm,	
  C.,	
  de	
  Melo,	
  G.,	
  Naumann,	
  F.,	
  Weikum,	
  G.:	
  Linda:	
  distributed	
  web-­‐of-­‐data-­‐
scale	
  en8ty	
  matching.	
  In	
  CIKM	
  2012	
  
•  Geerts,	
  F.,	
  Karvounarakis,	
  G.,	
  Christophides,	
  V.	
  and	
  Fundulaki	
  I.:	
  Algebraic	
  
structures	
  for	
  capturing	
  the	
  provenance	
  of	
  SPARQL	
  queries.	
  In	
  ICDT	
  2013.	
  
•  Papadakis,	
  G.,	
  Ioannou,	
  E.,	
  Niederee,	
  C.,	
  Palpanas,	
  T.,	
  Nejdl,	
  W.:	
  Beyond	
  100	
  million	
  
en88es:	
  large-­‐scale	
  blocking-­‐based	
  resolu8on	
  for	
  heterogeneous	
  data.	
  In	
  WSDM	
  
2012.	
  
•  Papadakis,	
  G.,	
  Ioannou,	
  E.,	
  Palpanas,	
  T.,	
  Niederee,	
  C.,	
  Nejdl,	
  W.:	
  A	
  Blocking	
  
Framework	
  for	
  En8ty	
  Resolu8on	
  in	
  Highly	
  Heterogeneous	
  Informa8on	
  Spaces.	
  IEEE	
  
Trans.	
  Knowl.	
  Data	
  Eng.	
  (2013)	
  To	
  appear	
  
•  Papavasileiou,	
  V.,	
  Flouris,	
  G.,	
  Fundulaki,	
  I,.	
  Kotzinos,	
  D.,	
  Christophides,	
  V.	
  :High-­‐
level	
  change	
  detec8on	
  in	
  RDF(S)	
  KBs.	
  ACM	
  Trans.	
  Database	
  Syst.	
  38,	
  1,	
  April	
  2013	
  
•  High-­‐Level	
  Group	
  on	
  Scien8fic	
  Data.	
  “Riding	
  the	
  Wave:	
  how	
  Europe	
  can	
  gain	
  from	
  
the	
  raising	
  8de	
  of	
  scien8fic	
  data”	
  ©	
  European	
  Union,	
  2010	
  cordis.europa.eu/fp7/
ict/e-­‐infrastructure/docs/hlg-­‐sdi-­‐report.pdf	
  
•  Blue	
  Ribbon	
  Task	
  Force	
  on	
  Sustainable	
  Digital	
  Preserva8on	
  and	
  Access,	
  Final	
  report	
  
2010	
  br‚.sdsc.edu/biblio/BRTF_Final_Report.pdf‎	
  
Data	
  Science:	
  A	
  Mul8disciplinary	
  Challenge	
  
Source	
  www.oraly8cs.com/2012/06/data-­‐science-­‐is-­‐mul8disciplinary.html	
  
Data	
  Science	
  Research	
  Agenda	
  
Acquisi0on,	
  Storage,	
  and	
  
Management	
  of	
  “Big	
  Data”	
  
Data	
  representa8on,	
  
storage,	
  and	
  retrieval	
  	
  
New	
  parallel	
  data	
  
architectures,	
  including	
  
clouds	
  	
  
Data	
  management	
  policies,	
  
including	
  privacy	
  and	
  secure	
  
access	
  
Communica8on	
  and	
  storage	
  
devices	
  with	
  extreme	
  
capaci8es	
  
Sustainable	
  economic	
  
models	
  for	
  access	
  and	
  
preserva0on	
  	
  
Data	
  Analy0cs	
  
Computa8onal,	
  mathema8cal,	
  
sta8s8cal,	
  and	
  algorithmic	
  
techniques	
  for	
  modeling	
  high	
  
dimensional	
  data	
  
Learning,	
  inference,	
  
predic8on,	
  and	
  knowledge	
  
discovery	
  for	
  large	
  volumes	
  of	
  
dynamic	
  data	
  sets	
  
Data	
  mining	
  to	
  enable	
  
automated	
  hypothesis	
  
genera8on,	
  event	
  correla8on,	
  
and	
  anomaly	
  detec8on	
  	
  
Informa8on	
  infusion	
  of	
  
mul8ple	
  data	
  sources	
  
Data	
  Sharing	
  and	
  
Collabora0on	
  
Tools	
  for	
  distant	
  data	
  
sharing,	
  real	
  8me	
  
visualiza8on,	
  and	
  
so•ware	
  reuse	
  of	
  
complex	
  data	
  sets	
  
Cross	
  disciplinary	
  model,	
  
informa8on	
  and	
  
knowledge	
  sharing	
  
Remote	
  opera8on	
  and	
  
real	
  8me	
  access	
  to	
  
distant	
  data	
  sources	
  and	
  
instruments	
  	
  
Source	
  Big	
  Data	
  R&D	
  Ini8a8ve	
  Howard	
  Wactlar	
  	
  
NIST	
  Big	
  Data	
  Mee8ng	
  June,	
  2012	
  
Towards	
  Data	
  Accountability	
  

More Related Content

What's hot

Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedKeystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedJoel Azzopardi
 
Linked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationLinked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationSören Auer
 
Session 1.2 improving access to digital content by semantic enrichment
Session 1.2   improving access to digital content by semantic enrichmentSession 1.2   improving access to digital content by semantic enrichment
Session 1.2 improving access to digital content by semantic enrichmentsemanticsconference
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesAlessandro Adamou
 
Linked Open Data and DANS
Linked Open Data and DANSLinked Open Data and DANS
Linked Open Data and DANSvty
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked dataLaura Po
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW
 
Memory Connected
Memory ConnectedMemory Connected
Memory ConnectedLi Ding
 
Data Cleaning @ DELix winter school
Data Cleaning @ DELix winter schoolData Cleaning @ DELix winter school
Data Cleaning @ DELix winter schoolLuis Cruz
 
20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogsandrea huang
 
DBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of DataDBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of DataSebastian Hellmann
 
Linked Open Government Data: What’s Next?
Linked Open Government Data:  What’s Next?Linked Open Government Data:  What’s Next?
Linked Open Government Data: What’s Next?Li Ding
 
Semantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisSemantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisCraig Knoblock
 
20130711 records2 graphs_madrid
20130711 records2 graphs_madrid20130711 records2 graphs_madrid
20130711 records2 graphs_madridStefan Gradmann
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph IntroductionSören Auer
 

What's hot (20)

Linked Data to Improve the OER Experience
Linked Data to Improve the OER ExperienceLinked Data to Improve the OER Experience
Linked Data to Improve the OER Experience
 
DBPedia-past-present-future
DBPedia-past-present-futureDBPedia-past-present-future
DBPedia-past-present-future
 
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedKeystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
 
Linked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationLinked data for Enterprise Data Integration
Linked data for Enterprise Data Integration
 
Session 1.2 improving access to digital content by semantic enrichment
Session 1.2   improving access to digital content by semantic enrichmentSession 1.2   improving access to digital content by semantic enrichment
Session 1.2 improving access to digital content by semantic enrichment
 
Ziegler Open Data in Special Collections Libraries
Ziegler Open Data in Special Collections LibrariesZiegler Open Data in Special Collections Libraries
Ziegler Open Data in Special Collections Libraries
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiences
 
Linked Open Data and DANS
Linked Open Data and DANSLinked Open Data and DANS
Linked Open Data and DANS
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow Tutorial
 
Memory Connected
Memory ConnectedMemory Connected
Memory Connected
 
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at ScaleFull Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
 
Data Cleaning @ DELix winter school
Data Cleaning @ DELix winter schoolData Cleaning @ DELix winter school
Data Cleaning @ DELix winter school
 
20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs
 
Linked Data
Linked DataLinked Data
Linked Data
 
DBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of DataDBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of Data
 
Linked Open Government Data: What’s Next?
Linked Open Government Data:  What’s Next?Linked Open Government Data:  What’s Next?
Linked Open Government Data: What’s Next?
 
Semantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisSemantics for Big Data Integration and Analysis
Semantics for Big Data Integration and Analysis
 
20130711 records2 graphs_madrid
20130711 records2 graphs_madrid20130711 records2 graphs_madrid
20130711 records2 graphs_madrid
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 

Viewers also liked

ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorialeswcsummerschool
 
Semantics, Sensors, and the Social Web
Semantics, Sensors, and the Social WebSemantics, Sensors, and the Social Web
Semantics, Sensors, and the Social WebThe Open University
 
Programming with LOD
Programming with LODProgramming with LOD
Programming with LODFumihiro Kato
 
DAA Keynote- 99 Amazing Big Data Experiences
DAA Keynote- 99 Amazing Big Data ExperiencesDAA Keynote- 99 Amazing Big Data Experiences
DAA Keynote- 99 Amazing Big Data ExperiencesOxford Tech + UX
 
Mapping Lo Dto Proton Revised [Compatibility Mode]
Mapping Lo Dto Proton Revised [Compatibility Mode]Mapping Lo Dto Proton Revised [Compatibility Mode]
Mapping Lo Dto Proton Revised [Compatibility Mode]Mariana Damova, Ph.D
 
ESWC SS 2012 - Monday Tutorial 2 Barry Norton: Introduction to SPARQL
ESWC SS 2012 - Monday Tutorial 2 Barry Norton: Introduction to SPARQLESWC SS 2012 - Monday Tutorial 2 Barry Norton: Introduction to SPARQL
ESWC SS 2012 - Monday Tutorial 2 Barry Norton: Introduction to SPARQLeswcsummerschool
 

Viewers also liked (7)

ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
 
Semantics, Sensors, and the Social Web
Semantics, Sensors, and the Social WebSemantics, Sensors, and the Social Web
Semantics, Sensors, and the Social Web
 
Programming with LOD
Programming with LODProgramming with LOD
Programming with LOD
 
IndustryInform Service of Mozaika
IndustryInform Service of MozaikaIndustryInform Service of Mozaika
IndustryInform Service of Mozaika
 
DAA Keynote- 99 Amazing Big Data Experiences
DAA Keynote- 99 Amazing Big Data ExperiencesDAA Keynote- 99 Amazing Big Data Experiences
DAA Keynote- 99 Amazing Big Data Experiences
 
Mapping Lo Dto Proton Revised [Compatibility Mode]
Mapping Lo Dto Proton Revised [Compatibility Mode]Mapping Lo Dto Proton Revised [Compatibility Mode]
Mapping Lo Dto Proton Revised [Compatibility Mode]
 
ESWC SS 2012 - Monday Tutorial 2 Barry Norton: Introduction to SPARQL
ESWC SS 2012 - Monday Tutorial 2 Barry Norton: Introduction to SPARQLESWC SS 2012 - Monday Tutorial 2 Barry Norton: Introduction to SPARQL
ESWC SS 2012 - Monday Tutorial 2 Barry Norton: Introduction to SPARQL
 

Similar to ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Linked Open Data in the World of Patents
Linked Open Data in the World of Patents Linked Open Data in the World of Patents
Linked Open Data in the World of Patents Dr. Haxel Consult
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Enrico Daga
 
Mapping cross-­domain metadata to the Europeana Data Model (EDM) - EDM introd...
Mapping cross-­domain metadata to the Europeana Data Model (EDM) - EDM introd...Mapping cross-­domain metadata to the Europeana Data Model (EDM) - EDM introd...
Mapping cross-­domain metadata to the Europeana Data Model (EDM) - EDM introd...Valentine Charles
 
One day workshop Linked Data and Semantic Web
One day workshop Linked Data and Semantic WebOne day workshop Linked Data and Semantic Web
One day workshop Linked Data and Semantic WebVictor de Boer
 
Linked data for librarians
Linked data for librariansLinked data for librarians
Linked data for librarianstrevorthornton
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataEUCLID project
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsJon Voss
 
New Directions in Information Organization: A Linked Data Model with BIBFRAME
New Directions in Information Organization: A Linked Data Model with BIBFRAMENew Directions in Information Organization: A Linked Data Model with BIBFRAME
New Directions in Information Organization: A Linked Data Model with BIBFRAMESharonYang
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic WebJohn Breslin
 
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata MattersAlphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata MattersNew York University
 
Semantic Web - Lecture 09 - Web Information Systems (4011474FNR)
Semantic Web - Lecture 09 - Web Information Systems (4011474FNR)Semantic Web - Lecture 09 - Web Information Systems (4011474FNR)
Semantic Web - Lecture 09 - Web Information Systems (4011474FNR)Beat Signer
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataMarin Dimitrov
 
RO-Crate: packaging metadata love notes into FAIR Digital Objects
RO-Crate: packaging metadata love notes into FAIR Digital ObjectsRO-Crate: packaging metadata love notes into FAIR Digital Objects
RO-Crate: packaging metadata love notes into FAIR Digital ObjectsCarole Goble
 
Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things PayamBarnaghi
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Cory Lampert
 
RO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research ObjectsRO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research ObjectsCarole Goble
 

Similar to ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data (20)

Linked Open Data in the World of Patents
Linked Open Data in the World of Patents Linked Open Data in the World of Patents
Linked Open Data in the World of Patents
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.
 
Mapping cross-­domain metadata to the Europeana Data Model (EDM) - EDM introd...
Mapping cross-­domain metadata to the Europeana Data Model (EDM) - EDM introd...Mapping cross-­domain metadata to the Europeana Data Model (EDM) - EDM introd...
Mapping cross-­domain metadata to the Europeana Data Model (EDM) - EDM introd...
 
Semantic web
Semantic webSemantic web
Semantic web
 
One day workshop Linked Data and Semantic Web
One day workshop Linked Data and Semantic WebOne day workshop Linked Data and Semantic Web
One day workshop Linked Data and Semantic Web
 
Engineering a Semantic Web (Spring 2018)
Engineering a Semantic Web (Spring 2018)Engineering a Semantic Web (Spring 2018)
Engineering a Semantic Web (Spring 2018)
 
Linked data for librarians
Linked data for librariansLinked data for librarians
Linked data for librarians
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
New Directions in Information Organization: A Linked Data Model with BIBFRAME
New Directions in Information Organization: A Linked Data Model with BIBFRAMENew Directions in Information Organization: A Linked Data Model with BIBFRAME
New Directions in Information Organization: A Linked Data Model with BIBFRAME
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic Web
 
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata MattersAlphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
 
Semantic Web - Lecture 09 - Web Information Systems (4011474FNR)
Semantic Web - Lecture 09 - Web Information Systems (4011474FNR)Semantic Web - Lecture 09 - Web Information Systems (4011474FNR)
Semantic Web - Lecture 09 - Web Information Systems (4011474FNR)
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
RO-Crate: packaging metadata love notes into FAIR Digital Objects
RO-Crate: packaging metadata love notes into FAIR Digital ObjectsRO-Crate: packaging metadata love notes into FAIR Digital Objects
RO-Crate: packaging metadata love notes into FAIR Digital Objects
 
Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things
 
Digital Libraries of the Future
Digital Libraries of the Future
Digital Libraries of the Future
Digital Libraries of the Future
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
 
Linking Open Data
Linking Open DataLinking Open Data
Linking Open Data
 
RO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research ObjectsRO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research Objects
 

More from eswcsummerschool

Semantic Aquarium - ESWC SSchool 14 - Student project
Semantic Aquarium - ESWC SSchool 14 - Student projectSemantic Aquarium - ESWC SSchool 14 - Student project
Semantic Aquarium - ESWC SSchool 14 - Student projecteswcsummerschool
 
Syrtaki - ESWC SSchool 14 - Student project
Syrtaki  - ESWC SSchool 14 - Student projectSyrtaki  - ESWC SSchool 14 - Student project
Syrtaki - ESWC SSchool 14 - Student projecteswcsummerschool
 
Keep fit (a bit) - ESWC SSchool 14 - Student project
Keep fit (a bit)  - ESWC SSchool 14 - Student projectKeep fit (a bit)  - ESWC SSchool 14 - Student project
Keep fit (a bit) - ESWC SSchool 14 - Student projecteswcsummerschool
 
Arabic Sentiment Lexicon - ESWC SSchool 14 - Student project
Arabic Sentiment Lexicon - ESWC SSchool 14 - Student projectArabic Sentiment Lexicon - ESWC SSchool 14 - Student project
Arabic Sentiment Lexicon - ESWC SSchool 14 - Student projecteswcsummerschool
 
FIT-8BIT An activity music assistant - ESWC SSchool 14 - Student project
FIT-8BIT An activity music assistant - ESWC SSchool 14 - Student projectFIT-8BIT An activity music assistant - ESWC SSchool 14 - Student project
FIT-8BIT An activity music assistant - ESWC SSchool 14 - Student projecteswcsummerschool
 
Personal Tours at the British Museum - ESWC SSchool 14 - Student project
Personal Tours at the British Museum  - ESWC SSchool 14 - Student projectPersonal Tours at the British Museum  - ESWC SSchool 14 - Student project
Personal Tours at the British Museum - ESWC SSchool 14 - Student projecteswcsummerschool
 
Exhibition recommendation using British Museum data and Event Registry - ESWC...
Exhibition recommendation using British Museum data and Event Registry - ESWC...Exhibition recommendation using British Museum data and Event Registry - ESWC...
Exhibition recommendation using British Museum data and Event Registry - ESWC...eswcsummerschool
 
Empowering fishing business using Linked Data - ESWC SSchool 14 - Student pro...
Empowering fishing business using Linked Data - ESWC SSchool 14 - Student pro...Empowering fishing business using Linked Data - ESWC SSchool 14 - Student pro...
Empowering fishing business using Linked Data - ESWC SSchool 14 - Student pro...eswcsummerschool
 
Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014
Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014 Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014
Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014 eswcsummerschool
 
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014eswcsummerschool
 
Hands On: Amazon Mechanical Turk - M. Acosta - ESWC SS 2014
Hands On: Amazon Mechanical Turk - M. Acosta - ESWC SS 2014 Hands On: Amazon Mechanical Turk - M. Acosta - ESWC SS 2014
Hands On: Amazon Mechanical Turk - M. Acosta - ESWC SS 2014 eswcsummerschool
 
Tutorial: Querying a Marine Data Warehouse Using SPARQL - I. Fundulaki - ESWC...
Tutorial: Querying a Marine Data Warehouse Using SPARQL - I. Fundulaki - ESWC...Tutorial: Querying a Marine Data Warehouse Using SPARQL - I. Fundulaki - ESWC...
Tutorial: Querying a Marine Data Warehouse Using SPARQL - I. Fundulaki - ESWC...eswcsummerschool
 
Mon norton tut_publishing01
Mon norton tut_publishing01Mon norton tut_publishing01
Mon norton tut_publishing01eswcsummerschool
 
Mon domingue introduction to the school
Mon domingue introduction to the schoolMon domingue introduction to the school
Mon domingue introduction to the schooleswcsummerschool
 
Mon norton tut_querying cultural heritage data
Mon norton tut_querying cultural heritage dataMon norton tut_querying cultural heritage data
Mon norton tut_querying cultural heritage dataeswcsummerschool
 
Tue acosta hands_on_providinglinkeddata
Tue acosta hands_on_providinglinkeddataTue acosta hands_on_providinglinkeddata
Tue acosta hands_on_providinglinkeddataeswcsummerschool
 
Thu bernstein key_warp_speed
Thu bernstein key_warp_speedThu bernstein key_warp_speed
Thu bernstein key_warp_speedeswcsummerschool
 
Fri schreiber key_knowledge engineering
Fri schreiber key_knowledge engineeringFri schreiber key_knowledge engineering
Fri schreiber key_knowledge engineeringeswcsummerschool
 
Mon norton tut_queryinglinkeddata02
Mon norton tut_queryinglinkeddata02Mon norton tut_queryinglinkeddata02
Mon norton tut_queryinglinkeddata02eswcsummerschool
 
Mon fundulaki tut_querying linked data
Mon fundulaki tut_querying linked dataMon fundulaki tut_querying linked data
Mon fundulaki tut_querying linked dataeswcsummerschool
 

More from eswcsummerschool (20)

Semantic Aquarium - ESWC SSchool 14 - Student project
Semantic Aquarium - ESWC SSchool 14 - Student projectSemantic Aquarium - ESWC SSchool 14 - Student project
Semantic Aquarium - ESWC SSchool 14 - Student project
 
Syrtaki - ESWC SSchool 14 - Student project
Syrtaki  - ESWC SSchool 14 - Student projectSyrtaki  - ESWC SSchool 14 - Student project
Syrtaki - ESWC SSchool 14 - Student project
 
Keep fit (a bit) - ESWC SSchool 14 - Student project
Keep fit (a bit)  - ESWC SSchool 14 - Student projectKeep fit (a bit)  - ESWC SSchool 14 - Student project
Keep fit (a bit) - ESWC SSchool 14 - Student project
 
Arabic Sentiment Lexicon - ESWC SSchool 14 - Student project
Arabic Sentiment Lexicon - ESWC SSchool 14 - Student projectArabic Sentiment Lexicon - ESWC SSchool 14 - Student project
Arabic Sentiment Lexicon - ESWC SSchool 14 - Student project
 
FIT-8BIT An activity music assistant - ESWC SSchool 14 - Student project
FIT-8BIT An activity music assistant - ESWC SSchool 14 - Student projectFIT-8BIT An activity music assistant - ESWC SSchool 14 - Student project
FIT-8BIT An activity music assistant - ESWC SSchool 14 - Student project
 
Personal Tours at the British Museum - ESWC SSchool 14 - Student project
Personal Tours at the British Museum  - ESWC SSchool 14 - Student projectPersonal Tours at the British Museum  - ESWC SSchool 14 - Student project
Personal Tours at the British Museum - ESWC SSchool 14 - Student project
 
Exhibition recommendation using British Museum data and Event Registry - ESWC...
Exhibition recommendation using British Museum data and Event Registry - ESWC...Exhibition recommendation using British Museum data and Event Registry - ESWC...
Exhibition recommendation using British Museum data and Event Registry - ESWC...
 
Empowering fishing business using Linked Data - ESWC SSchool 14 - Student pro...
Empowering fishing business using Linked Data - ESWC SSchool 14 - Student pro...Empowering fishing business using Linked Data - ESWC SSchool 14 - Student pro...
Empowering fishing business using Linked Data - ESWC SSchool 14 - Student pro...
 
Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014
Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014 Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014
Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014
 
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
 
Hands On: Amazon Mechanical Turk - M. Acosta - ESWC SS 2014
Hands On: Amazon Mechanical Turk - M. Acosta - ESWC SS 2014 Hands On: Amazon Mechanical Turk - M. Acosta - ESWC SS 2014
Hands On: Amazon Mechanical Turk - M. Acosta - ESWC SS 2014
 
Tutorial: Querying a Marine Data Warehouse Using SPARQL - I. Fundulaki - ESWC...
Tutorial: Querying a Marine Data Warehouse Using SPARQL - I. Fundulaki - ESWC...Tutorial: Querying a Marine Data Warehouse Using SPARQL - I. Fundulaki - ESWC...
Tutorial: Querying a Marine Data Warehouse Using SPARQL - I. Fundulaki - ESWC...
 
Mon norton tut_publishing01
Mon norton tut_publishing01Mon norton tut_publishing01
Mon norton tut_publishing01
 
Mon domingue introduction to the school
Mon domingue introduction to the schoolMon domingue introduction to the school
Mon domingue introduction to the school
 
Mon norton tut_querying cultural heritage data
Mon norton tut_querying cultural heritage dataMon norton tut_querying cultural heritage data
Mon norton tut_querying cultural heritage data
 
Tue acosta hands_on_providinglinkeddata
Tue acosta hands_on_providinglinkeddataTue acosta hands_on_providinglinkeddata
Tue acosta hands_on_providinglinkeddata
 
Thu bernstein key_warp_speed
Thu bernstein key_warp_speedThu bernstein key_warp_speed
Thu bernstein key_warp_speed
 
Fri schreiber key_knowledge engineering
Fri schreiber key_knowledge engineeringFri schreiber key_knowledge engineering
Fri schreiber key_knowledge engineering
 
Mon norton tut_queryinglinkeddata02
Mon norton tut_queryinglinkeddata02Mon norton tut_queryinglinkeddata02
Mon norton tut_queryinglinkeddata02
 
Mon fundulaki tut_querying linked data
Mon fundulaki tut_querying linked dataMon fundulaki tut_querying linked data
Mon fundulaki tut_querying linked data
 

Recently uploaded

AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Romantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxRomantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxsqpmdrvczh
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxLigayaBacuel1
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 

Recently uploaded (20)

AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Romantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxRomantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptx
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 

ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

  • 1. Preserving  Linked  Data:   Challenges  and  Opportuni8es   Vassilis  Christophides   University  of  Crete  &  FORTH-­‐ICS     Heraklion,  Crete  
  • 2. Many Web sites containing unstructured, textual content Few large Web sites are specialized on specific content types •  Semi-structured/xml content floating around e-services A  bit  of  History:  from  Web  1.0  &  2.0  …  
  • 3. A  bit  of  History:  …to  Web  3.0   Data as Service (DaaS) Syndicating arbitrarily semi- structured content across Web sites using higher-lever data abstractions (entities) Data themselves become “infrastructure” •  a valuable asset, on which science, technology, the economy and society can advance
  • 4. The  Emerging  Web  of  Data     }  Global  data  space  connec8ng  data  from  diverse  domains   and  sources     }  Primary  objects:  “things”  (or  descrip8on  of  things)   }  Links  between  “things”   }  Granularity  of  informa8on:  from  en8re  datasets  to  atomic   Adapted  from  Chris  Bizer,  Richard  Cyganiak,  Tom  Heath,  available  at  h[p://linkeddata.org/guides-­‐and-­‐tutorials   Web  of  Data Conceptual  Representation entity entity entity entity Typed  Links Typed  Links Typed  Links Spreadsheets HTMLXMLRDFa URIs URIs URIs URIs URIs URIs represent SemiStructuredTriplesStatistical represent represent represent A Web of things in the world, described by data on the Web
  • 5. En88es:  an  Invaluable  Asset   •  “En88es”  is  what  a  large  part  of  our  knowledge  is  
  • 6. Web  Data  of  Increasing   Standardiza8on   Not  all  linked  data  is  open  and  not  all  open   data  is  linked!   ★  Available  on  the  web  (whatever  format)   but  with  an  open  license,  to  be  Open  Data   ★★Available  as  machine-­‐readable  structured   data  (e.g.  excel  vs.  image  scan  of  a  table)   ★★★  as  (2)  plus  non-­‐proprietary  format   (e.g.  CSV  instead  of  excel)   ★★★★  as  (3),  plus  using  open  standards   from  W3C  (RDF    and  SPARQL  )  to  iden8fy   things  through  dereferenceable  HTTP  URIs,  to   ensure  effec8ve  access     ★★★★★  as  all  the  above  plus  establishing   links  between  data  of  different  sources     File   format   Recommenda0ons   (on  a  scale  of  0-­‐5)   csv   ★★★   xls   ★   pdf   ★   doc   ★   xml   ★★★★   rdf   ★★★★★   shp   ★★★   ods   ★★   8ff   ★   jpeg   ★   json   ★★★   txt   ★   html   ★★  
  • 7. The  LOD  Cloud   Media Government Geo Publications User-generated Life sciences Cross-domain
  • 8. Basic  Terminology   •  A  dataset  is  a  set  of  RDF   triples  that  are  published,   maintained  or  aggregated  by   a  single  provider   •  A  linkset  is  a  collec8on  of  RDF   links  between  two  datasets   i.e.  triples  whose  subject  &   object  are  described  in   different  datasets   •  But  what  do  we  really  know   about  the  produc8on  and   cura8on  processes  of  the   sources  publishing  in  RDF?  
  • 9. What  is  Digital  Preserva8on  (DP)?   •  Ensure  accessibility  and  usability  of  digital  objects  over   &me  and  across  domains,  and  protect  them  from  media   failure,  physical  loss,  &  hardware/so6ware  obsolescence •  Tradi8onally,  the  objec8ve  is  to  preserve  on  the  long   run  the  authen&city  of  a  digital  object  as  originally   recorded  against  any  technological  change   – extensive  annota8on  of  digital  objects  with  informa8on   related  to  their  significant  proper8es     •  content  format   •  context  of  produc&on   •  structural  meta-­‐data   •  current  behavior,  …  
  • 10. Linked  Data  vs  Digital  Objects   •  DP  techniques  proposed  for  memory  ins8tu8ons  and             data  centers  concern  fixed  digital  objects  featuring             mostly  unstructured  data   – raw  data  sets  held  in  files,  scholarly  data  held  in  papers…   •  Linked  Data  are  digitally-­‐born  objects  which   – are  graph-­‐structured  op8onally  sa8sfying  integrity   constraints  (expressed  in  higher  logic  formalisms)   – exhibit  complex  interdependencies  across  sources  as   well  as  varying  data  quality  (curated  knowledge  bases  vs   extracted  from  text  or  Web  2.0  sources)     – change  without  no8fica8on  at  different  granularity         levels  to  keep  them  fit  for  contemporary  purposes,  and   be  available  for  discovery  and  re-­‐use  in  the  future  
  • 11. Linked  Data:  Behind  the  scenes!   Property  names   Property  values  
  • 12. Different  Descrip8ons  of  the  same  En8ty URI   dbpedia:Statue_of_Lib erty   rdfs:label Statue of Liberty, Freiheitsstatue, … dbpprop:location New York City, New York, U.S., dbpedia:Liberty_Isl and dbpprop:sculptor dbpedia:Frédéric_Au guste_Bartholdi dcterms:subject dbpedia_category: 1886_sculptures, … foaf:isPrimaryTopicOf http:// en.wikipedia.org/wiki/ Statue_of_Liberty dbpprop:beginningDate 1886-10-28 (xsd:date) dbpprop:restored 19381984 (xsd:integer) dbpprop:visitationNum 3200000 (xsd:integer) dbpprop:visitationYear 2009 (xsd:integer) h[p://www.w3.org/ns/ prov#wasDerivedFrom h[p://en.wikipedia.org/wiki/ Statue_of_Liberty? oldid=494328330 URI   C:m.072p8   fb:art_form fb:m.06msq (Sculpture) fb:media fb:m.025rsfk (Copper) fb:architect fb:m.0jph6 (F. Bartholdi), fb:m.036qb (G. Eiffel), fb:m.02wj4z (R. Hunt) fb:height_meters 93 fb:opened 1886-10-28 12 URI   yago:Statue_of_Liberty   skos:prefLabel Statue of Liberty rdf:type yago:History_museums_i n_NY, yago:GeoEntity yago:hasHeight 46.0248 yago:wasCreatedOnDate 1886-##-## yago:isLocatedIn yago:Manhattan, yago:Liberty_Island, yago:hasWikipediaUrl h[p://en.wikipedia.org/wiki/ Statue_of_Liberty
  • 13. Linked  Datasets  Depend  on  Vocabularies URI   dbpedia:Statue_of_Lib erty   rdfs:label Statue of Liberty, Freiheitsstatue, … dbpprop:location New York City, New York, U.S., dbpedia:Liberty_Isl and dbpprop:sculptor dbpedia:Frédéric_Au guste_Bartholdi dcterms:subject dbpedia_category: 1886_sculptures, … foaf:isPrimaryTopicOf http:// en.wikipedia.org/wiki/ Statue_of_Liberty dbpprop:beginningDate 1886-10-28 (xsd:date) dbpprop:restored 19381984 (xsd:integer) dbpprop:visitationNum 3200000 (xsd:integer) dbpprop:visitationYear 2009 (xsd:integer) h[p://www.w3.org/ns/ prov#wasDerivedFrom h[p://en.wikipedia.org/wiki/ Statue_of_Liberty? oldid=494328330 URI   C:m.072p8   fb:art_form fb:m.06msq (Sculpture) fb:media fb:m.025rsfk (Copper) fb:architect fb:m.0jph6 (F. Bartholdi), fb:m.036qb (G. Eiffel), fb:m.02wj4z (R. Hunt) fb:height_meters 93 fb:opened 1886-10-28 13 URI   yago:Statue_of_Liberty   skos:prefLabel Statue of Liberty rdf:type yago:History_museums_i n_NY, yago:GeoEntity yago:hasHeight 46.0248 yago:wasCreatedOnDate 1886-##-## yago:isLocatedIn yago:Manhattan, yago:Liberty_Island, yago:hasWikipediaUrl h[p://en.wikipedia.org/wiki/ Statue_of_Liberty
  • 14. Linked  Datasets  Have  Varying  Quality   URI   dbpedia:Statue_of_Lib erty   rdfs:label Statue of Liberty, Freiheitsstatue, … dbpprop:location New York City, New York, U.S., dbpedia:Liberty_Isl and dbpprop:sculptor dbpedia:Frédéric_Au guste_Bartholdi dcterms:subject dbpedia_category: 1886_sculptures, … foaf:isPrimaryTopicOf http:// en.wikipedia.org/wiki/ Statue_of_Liberty dbpprop:beginningDate 1886-10-28 (xsd:date) dbpprop:restored 19381984 (xsd:integer) dbpprop:visitationNum 3200000 (xsd:integer) dbpprop:visitationYear 2009 (xsd:integer) h[p://www.w3.org/ns/ prov#wasDerivedFrom h[p://en.wikipedia.org/wiki/ Statue_of_Liberty? oldid=494328330 URI   C:m.072p8   fb:art_form fb:m.06msq (Sculpture) fb:media fb:m.025rsfk (Copper) fb:architect fb:m.0jph6 (F. Bartholdi), fb:m.036qb (G. Eiffel), fb:m.02wj4z (R. Hunt) fb:height_meters 93 fb:opened 1886-10-28 14 URI   yago:Statue_of_Liberty   skos:prefLabel Statue of Liberty rdf:type yago:History_museums_i n_NY, yago:GeoEntity yago:hasHeight 46.0248 yago:wasCreatedOnDate 1886-##-## yago:isLocatedIn yago:Manhattan, yago:Liberty_Island, yago:hasWikipediaUrl h[p://en.wikipedia.org/wiki/ Statue_of_Liberty
  • 15. Linked  Datasets  Evolve  Over  Time   15   Current  version  of  DBpedia   Previous  version  of  DBpedia   URI   dbpedia:Statue_of_Lib erty   rdfs:label Statue of Liberty, Freiheitsstatue, … dbpprop:location New York City, New York, U.S., dbpedia:Liberty_Isl and dbpprop:sculptor dbpedia:Frédéric_Au guste_Bartholdi dcterms:subject dbpedia_category: 1886_sculptures, … foaf:isPrimaryTopicOf http:// en.wikipedia.org/wiki/ Statue_of_Liberty dbpprop:beginningDate 1886-10-28 (xsd:date) dbpprop:restored 19381984 (xsd:integer) dbpprop:visitationNum 3200000 (xsd:integer) dbpprop:visitationYear 2009 (xsd:integer) h[p://www.w3.org/ns/ h[p://en.wikipedia.org/wiki/ Statue_of_Liberty? URI   dbpedia:Statue_of_Lib erty   rdfs:label Statue of Liberty, Freiheitsstatue, … dbpprop:location New York City, New York, U.S., dbpedia:Liberty_Isl and dbpprop:sculptor dbpedia:Frédéric_Au guste_Bartholdi dcterms:subject dbpedia_category: 1886_sculptures, … foaf:isPrimaryTopicOf http:// en.wikipedia.org/wiki/ Statue_of_Liberty dbpprop:built 1886-10-28 (xsd:date) dbpprop:restored 19381984 (xsd:integer) dbpprop:hasHeight 151 (xsd:integer) h[p://www.w3.org/ns/ prov#wasDerivedFrom h[p://en.wikipedia.org/wiki/ Statue_of_Liberty? oldid=494328330
  • 16. DP  Challenges  for  Linked  Data     –  Incompleteness:  real  world  en8tles  are                           usually  par8ally  described  in  data  sources   –  Redundancy:  the  same  real  world  en88es                           are  represented  in  mul8ple  data  sources   –  Inconsistency:  various  forms  of  inter  and                                 intra  source  data  conflicts   –  Incorrectness:  errors  can  be  propagated   from  one  source  to  the  other  due  to   copying   •  Mastering  the  varying  data  quality  is  a   prerequisite  for  trus8ng  preserved  data   origina8ng  from  various  sources     •  The  ‘publish-­‐first-­‐refine-­‐later’  philosophy  of  the  Linked  Data   movement,  complemented  by  the  open,  decentralized  nature   of  the  Web  results  in  Data  
  • 17. DP  Challenges  for  Linked  Data     •  S8ll  data  publishing  and  preserva8on  are  two  largely  separated   processes  which  are  addressed  by  SW  and  DP  communi8es.     –  Shouldn’t  the  “publishers”  worry  about  preserving  all  their   hard  work?   –  Shouldn’t  the  “preservers”  be  concerned  about  the  way  they   organize,  link,  and  annotate  their  linked  data?   •  Need  to  break  down  the  tradi8onal  boundaries  between  the   data  creators  &  publishers  and  the  data  archivists  &  brokers   –  Integrate  cura8on  [when  high  current/ongoing  interest]  and   preserva8on  [when  fall  off  in  interest]  ac8vi8es   –  Distribute  preserva8on  costs  over  the  life-­‐cycle  of  linked   datasets  among  data  stewards  (pay-­‐as-­‐you-­‐go  data   preserva8on)  
  • 18. Vision  2030:  High-­‐Level  Group  on   Scien8fic  Data   “Researchers  and  prac88oners  from  any  discipline  are  able    to   find,  access  and  process  the  data  they  need.  They  can  be   confident  in  their  ability  to  use  and  understand  data  and  they   can  evaluate  the  degree  to  which  the  data  can  be  trusted”   •  How  we  support  future  users  in  Trus8ng  Data?   – By  assessing  their  quality  wr.t  to  the  en88es  they  describe – By  recording  from  which  sources  did  they  originate  from     – By  understanding  how  their  iden8ty  and  integrity  has   evolved  over  8me   – By  ensuring  that  they  has  been                                                                                       preserved  properly    
  • 19. Frame  Linked  Data  Preserva8on  as   a  Research  Problem   •  How  can  we  iden8fy  that  different   resource  descrip8ons  within  or  across   datasets  refer  to  the  same  real-­‐world   en8ty  (the  en8ty  resolu8on  problem)  to   convey  various  aspects  of  the  quality  of   the  harvested  datasets  (e.g.,  redundancy,   completeness,  freshness)?   •  How  can  we  record  dependencies  of   datasets  (the  provenance  problem)  and   how  they  can  smoothly  represented  along   with  other  (temporal,  spa8al,  thema8c)   metadata  (the  annota8on  problem)?  
  • 20. Frame  Linked  Data  Preserva8on  as   a  Research  Problem   •  How  can  we  monitor  changes  of  third-­‐party   datasets  (the  evolu8on  tracking  problem)  or   how  can  local/remote  data  imperfec8ons   (e.g.,  due  to  change  propaga8on)  can  be   repaired  (the  cura8on  problem)?   •  How  can  we  appease  what  versions  of   linked  datasets  should  to  be  preserved  for   future  use  (the  mul8-­‐version  archive   consistency  problem)  and  how  we  will  be   able  to  ask  a  query  not  only  about  any  past   state  of  the  dataset  but  also  about  the   evolu8on  of  some  part  of  it  (the   longitudinal  querying  problem)?    
  • 21. 21   dbpedia:Statue _of_Liberty   dbpedia:Liberty _Island   C:m.072p8   yago:Statue _of_Liberty   yago:History_mu seums_in_NY   yago:   GeoEn8ty   yago:   Liberty_Island   1886-­‐##-­‐##   yago:wasCreated   onDate   yago:isLocatedIn   rdf:type   rdf:type   dbpedia_category: 1886_sculptures   dbpedia:   Frédéric_Auguste _Bartholdi   3200000   u:m.06msq   u:m.0jph6   u:architect   u:art_form   dcterms: subject   dbprop:visita8onNum   dbprop:   loca8on   dbprop:   sculptor   Statue  of  Liberty   skos:   prefLabel   ER  Example   En8ty  resolu8on:  The  problem  of  iden8fying  descrip8ons  of  the  same   en8ty  within  one  or  across  mul8ple  data  sources  wrt.  a  match  func8on   •  No  longer  just  matching  of  en8ty  names    
  • 22. dbpedia:Statue _of_Liberty   yago:Statue _of_Liberty   owl:sameAs   An  en8ty  resolu8on  is  a  par88on  of  a  set  of  en8ty  descrip8ons,  such  that:   1.   Matching  en8ty  descrip8ons  are  placed  in  the  same  subset   2.   All  the  descrip8ons  of  the  same  subset  match   C:m.072p8   yago:   Liberty_Island   u:m.0jph6   owl:sameAs   owl:sameAs   dbpedia:Liberty _Island   dbpedia:   Frédéric_Augus te_Bartholdi   yago:   Liberty_Island   dbpedia:   Frédéric_Auguste_ Bartholdi   u:m.0jph6   ER  Example   u:architect   yago:isLocatedIn   Persons Places Artifacts =>  Need  to  infer  also  other  kind  of     rela8onships  than  “equivalence”  
  • 23. What  Makes  ER  Difficult  for  Linked   Data     •  Linked  Data  are  inherently  semi-­‐structured   –  several  seman8c  types  (see  rdf:type  proper8es  in  Yago)  could  be   simultaneously  employed  resul8ng  to  en8ty  descrip8ons  even  of   the  same  type  (persons,  places,  …)  with  quite  different  structures     •  Linked  Data  heavily  rely  on  heterogeneous  vocabularies   –  DBPedia  3.4:  50,000  proper8es   –  Google  Base:100,000  schemata  and  10,000  en8ty  types   •  Linked  Data  are  Big  Data   –  The  LOD  cloud  consists  of  32  billion  RDF  triples  (last  update:  2011)   –  DBPedia  3.4:  36.5  million  triples,  2.1  million  en8ty  descrip8ons       –  BTC09:  1.15  billion  triples,  182  million  en8ty  descrip8ons   =>  Deal  with  loosely  structured  en00es   =>  Calls  for  efficient  parallel  techniques   =>  Need  for  cross-­‐domain  techniques  
  • 24. The  Role  of  Similarity  Func8ons   Set  of  all  pairs   of  en8ty   descrip8ons   Matching  pairs   of  en8ty   descrip8ons   Pairs  of  en8ty   descrip8ons  sa8sfying   a  strict  similarity   func8on      
  • 25. The  Role  of  Similarity  Func8ons   Set  of  all  pairs   of  en8ty   descrip8ons   Matching  pairs   of  en8ty   descrip8ons   Pairs  of  en8ty   descrip8ons  sa8sfying   a  loose  similarity   func8on      
  • 26. En8ty  Collec8ons  and  ER  Types   •  2  kinds  of  en8ty  collec8ons  given  as  input  to  an  ER   task:   –  Clean,  which  are  duplicate-­‐free   (e.g.,DBPedia,Freebase,DBLP)   –  Dirty,  which  contain  duplicate  en8ty  descrip8ons  in   themselves  (e.g.,  Google  Scholar,  Citeseer)   •  An  ER  task  that  receives  as  input  two  en8ty  collec8ons   can  be  of  the  following  types:   –  Clean-­‐Clean  ER:  Given  two  clean,  but  overlapping  en8ty   collec8ons,  iden8fy  the  common  en8ty  descrip8ons  (a.k.a.   the  Record  Linkage  in  databases)   –  Dirty-­‐Clean  ER:   –  Dirty-­‐Dirty  ER:  Iden8fy  unique  en8ty  descrip8ons  contained   in  union  of  the  input  en8ty  collec8ons  (a.k.a.  the   Deduplica8on  problem  in  databases)  
  • 27. Scaling  ER  to  the  Web  of  Data   •  Blocking  to  reduce  the  number  of  comparisons:   –  Split  en8ty  descrip8ons  into  blocks   –  Compare  each  descrip8on  to  the  descrip8ons   within  the  same  block   •  Desiderata   –  Similar  en8ty  descrip8ons  in  the  same  block   –  Dissimilar  en8ty  descrip8ons  in  different  blocks   •  Blocking  approaches  are  dis8nguished  between:   –  Par88oning,  where  each  descrip8on  is  placed   in  exactly  one  block  :  Fewer  comparisons   –  Overlapping,  where  each  descrip8on  can  be   placed  in  more  than  one  block  :  More   iden8fied  matches  
  • 28. Blocking  techniques  for  Linked  Data   •  Mul8-­‐rela8onal  and  cross-­‐domain  en8ty  resolu8on   – Token  blocking   – Property  clustering   – Prefix-­‐Infix(-­‐Suffix)   •  Large-­‐scale  en8ty  resolu8on   – Choose  a  computa8onally  not  expensive   similarity  func8on   – Process  in  parallel  par88ons  of  the  en8ty  graph   in  Map/Reduce  nodes    
  • 29. Token  Blocking  [Papadakis  et  al.  2011]   •  Ignores  (seman8c  or  structured)  types  of  en88es   – String  similarity  of  tokens  of  property  literal  values   •  Dis8nct  tokens  of  each  property  value  of  each  en8ty   descrip8on  corresponds  to  a  block   – Each  block  contains  all  en88es  with  the  corresponding   token   •  High  recall  at  the  cost  of  low  precision  and  low  efficiency:   – Most  true  matches  are  placed  in  the  same  block   – Many  non-­‐matches  are  also  placed  in  the  same  block   – The  same  pair  of  descrip8ons  is  contained  in  many  blocks  
  • 31. Token  Blocking  -­‐  Example   Entity descriptions: e1= {(name, Eiffel Tower), (architect, Sauvestre), (year, 1889), (location, Paris)} e2= {(name, Statue of Liberty), (architect, Bartholdi Eiffel), (year, 1886), (located, NY)} e3= {(about, Lady Liberty), (architect, Eiffel), (location, NY)} e4= {(about, Eiffel Tower), (architect, Sauvestre), (year, 1889), (located, Paris)} e5= {(name, White Tower), (year-constructed, 1450), (location, Thessaloniki)} Generated  blocks:   The  pair  (e1,  e4)  is  contained  in  5  different  blocks!   Eiffel   e1,  e2,   e3,  e4   Tower   e1,  e4,   e5   Statue   e2   Liberty   e2,  e3   White   e5   1889   e1,  e4   Bartholdi   e2   NY   e2,  e3   Sauvestre   e1,  e4   Paris   e1,  e4   1886   e2   1450   e5   Lady   e3   Thessaloniki   e5  
  • 32. Property  clustering  [Papadakis  et  al.  2013] •  Assuming  two  duplicate-­‐free  datasets   – Recognize  similarity  of  proper8es  based  on  the   string  similarity  of  their  literal  values  occurring   in  en8ty  descrip8ons     •  Two  main  blocking  steps:   1. Similar  proper8es  are  placed  together  in  non-­‐ overlapping  clusters   2. Token  blocking  is  performed  on  the  descrip8ons   of  each  cluster    
  • 33. Clustering  En8ty  Proper8es   1. For  each  property  of  dataset  D1:   – Find  the  most  similar  property  of  dataset  D2   2. For  each  property  of  dataset  D2:     – Find  the  most  similar  property  of  dataset  D1   3. Compute  the  transi8ve  closure  of  the  generated   pairs  of  similar  proper8es   4. Similar  proper8es  form  clusters   5. All  singleton  clusters  are  merged  into  a  common  one  
  • 34. D1   D2   e1= {(about, Eiffel Tower), (architect, Sauvestre), (year, 1889), (located, Paris)} e2= {(about, Statue of Liberty), (architect, Bartholdi Eiffel), (year, 1886), (located, NY)} e3= {(about, Auguste Bartholdi), (born, 1834)} e4= {(about, Joan Tower), (born, 1938)} e5= {(work, Lady Liberty), (artist, Bartholdi), (location, NY)} e6= {(work, Eiffel Tower), (year-constructed, 1889), (location, Paris)} e7= {(work, Bartholdi Fountain), (year-constructed,1876), (location, Washington D.C.)} e1   e2  e3   e4   e5   e6  e7   Clustering  En8ty  Proper8es:  Example  
  • 35. e1= {(about, Eiffel Tower), (architect, Sauvestre), (year, 1889), (located, Paris)} e2= {(about, Statue of Liberty), (architect, Bartholdi Eiffel), (year, 1886), (located, NY)} e3= {(about, Auguste Bartholdi), (born, 1834)} e4= {(about, Joan Tower), (born, 1938)} e5= {(work, Lady Liberty), (artist, Bartholdi), (location, NY)} e6= {(work, Eiffel Tower), (year-constructed, 1889), (location, Paris)} e7= {(work, Bartholdi Fountain), (year-constructed,1876), (location, Washington D.C.)} Clustering  En8ty  Proper8es:  Example   Finding  the  property  of  D2  that  is  the  most  similar  to  the  property  “about”  of  D1:   values  of  about:  {Eiffel,  Tower,  Statue,  Liberty,  Auguste,  Bartholdi,  Juan,  Tower}     compared  to  (with  Jaccard  similarity)  :   values  of  work:  {Lady,  Liberty,  Eiffel,  Tower,  Bartholdi,  Fountain}    à  Jaccard  =  4/10   values  of  ar8st:  {Bartholdi}  à  Jaccard  =  1/8   values  of  loca8on:  {NY,  Paris,  Washington,  D.C.}  à  Jaccard  =  0   values  of  year-­‐constructed:  {1889,  1876}  à  Jaccard  =  0  
  • 36. e1= {(about, Eiffel Tower), (architect, Sauvestre), (year, 1889), (located, Paris)} e2= {(about, Statue of Liberty), (architect, Bartholdi Eiffel), (year, 1886), (located, NY)} e3= {(about, Auguste Bartholdi), (born, 1834)} e4= {(about, Joan Tower), (born, 1938)} e5= {(work, Lady Liberty), (artist, Bartholdi), (location, NY)} e6= {(work, Eiffel Tower), (year-constructed, 1889), (location, Paris)} e7= {(work, Bartholdi Fountain), (year-constructed,1876), (location, Washington D.C.)} Clustering  En8ty  Proper8es:  Example   D1   D2   about   architect   year   born   located   work   ar8st   year-­‐constructed   loca8on  
  • 37. Clustering  En8ty  Proper8es:  Example   •  Similarly  for  the  rest  of  the  proper8es…   D1   D2   about   architect   year   born   located   work   ar8st   year-­‐constructed   loca8on  
  • 38. Clustering  En8ty  Proper8es:  Example   •  Similarly  for  the  rest  of  the  proper8es…   D1   D2   about   architect   year   born   located   work   ar8st   year-­‐constructed   loca8on  
  • 39. Clustering  En8ty  Proper8es:  Example   •  Similarly  for  the  rest  of  the  proper8es…   D1   D2   about   architect   year   born   located   work   ar8st   year-­‐constructed   loca8on  
  • 40. Clustering  En8ty  Proper8es:  Example   •  Similarly  for  the  rest  of  the  proper8es…   D1   D2   about   architect   year   born   located   work   ar8st   year-­‐constructed   loca8on  
  • 41. Clustering  En8ty  Proper8es:  Example   •  Similarly  for  the  rest  of  the  proper8es…   D1   D2   about   architect   year   born   located   work   ar8st   year-­‐constructed   loca8on  
  • 42. Clustering  En8ty  Proper8es:  Example   •  Similarly  for  the  rest  of  the  proper8es…   D1   D2   about   architect   year   born   located   work   ar8st   year-­‐constructed   loca8on  
  • 43. Clustering  En8ty  Proper8es:  Example   •  Similarly  for  the  rest  of  the  proper8es…   D1   D2   about   architect   year   born   located   work   ar8st   year-­‐constructed   loca8on  
  • 44. Clustering  En8ty  Proper8es:  Example   •  Similarly  for  the  rest  of  the  proper8es…   D1   D2   about   architect   year   born   located   work   ar8st   year-­‐constructed   loca8on  
  • 45. Clustering  En8ty  Proper8es:  Example   D1   D2   e1= {(about, Eiffel Tower), (architect, Sauvestre), (year, 1889), (located, Paris)} e2= {(about, Statue of Liberty), (architect, Bartholdi Eiffel), (year, 1886), (located, NY)} e3= {(about, Auguste Bartholdi), (born, 1834)} e4= {(about, Joan Tower), (born, 1938)} e5= {(work, Lady Liberty), (artist, Bartholdi), (location, NY)} e6= {(work, Eiffel Tower), (year-constructed, 1889), (location, Paris)} e7= {(work, Bartholdi Fountain), (year-constructed,1876), (location, Washington D.C.)} D1   D2   about   architect   year   born   located   work   ar8st   year-­‐constructed   loca8on  
  • 46. Clustering  En8ty  Proper8es:  Example   •  Compute  the  transi8ve  closure  of  the  generated   property  name  pairs   – Connected  proper8es  form  clusters         Pairs:  (about,  work),  (work,  about),  (ar8st,  architect),  (architect,  work)   Transi8ve  closure:   D1   D2   about   architect   year   born   located   work   ar8st   year-­‐constructed   loca8on   about   work   architect   ar8st  C1  
  • 47. Clustering  En8ty  Proper8es:  Example   •  Compute  the  transi8ve  closure  of  the  generated   property  name  pairs   – Connected  proper8es  form  clusters         Pairs:  (year,  year-­‐constructed),  (year-­‐constructed,  year),  (year-­‐constructed,  born)   Transi8ve  closure:   about   work   architect   ar8st  C1   D1   D2   about   architect   year   born   located   work   ar8st   year-­‐constructed   loca8on   year   year-­‐constructed   born   C2  
  • 48. Clustering  En8ty  Proper8es:  Example   •  Compute  the  transi8ve  closure  of  the  generated   property  name  pairs   – Connected  proper8es  form  clusters         Pairs:  (located,  loca8on),  (loca8on,  located)   Transi8ve  closure:   about   work   architect   ar8st  C1   year   year-­‐constructed   born   C2   D1   D2   about   architect   year   born   located   work   ar8st   year-­‐constructed   loca8on   loca8on   located     C3  
  • 49. Clustering  En8ty  Proper8es:  Example   •  Compute  the  transi8ve  closure  of  the  generated   property  name  pairs   – Connected  proper8es  form  clusters         •  Generated  property  clusters:   D1   D2   about   architect   year   born   located   work   ar8st   year-­‐constructed   loca8on   about   work   architect   ar8st  C1   year   year-­‐constructed   born   C2   loca8on   located     C3  
  • 50. Token  Blocking  for  Each  Cluster   Some  of    the  generated  blocks:                                compare  Lady  Liberty  to  Auguste  Bartholdi   C3.NY   e2,  e5   C1.Tower   e1,  e4,  e6   C1.Bartholdi   e2,  e3,  e5,  e7   e1= {(about, Eiffel Tower), (architect, Sauvestre), (year, 1889), (located, Paris)} e2= {(about, Statue of Liberty), (architect, Bartholdi Eiffel), (year, 1886), (located, NY)} e3= {(about, Auguste Bartholdi), (born, 1834)} e4= {(about, Joan Tower), (born, 1938)} e5= {(work, Lady Liberty), (artist, Bartholdi), (location, NY)} e6= {(work, Eiffel Tower), (year-constructed, 1889), (location, Paris)} e7= {(work, Bartholdi Fountain), (year-constructed,1876), (location, Washington D.C.)} about   work   architect   ar8st  C1   year   year-­‐constructed   born   C2   loca8on   located     C3  
  • 51. Prefix-­‐Infix(-­‐Suffix)  [Papadakis  et  al.  2012] •  How  we  can  explore  the  seman8cs  of  URIs  to  be[er   match  en8ty  descrip8ons?   – E.g.  66%  of  the  182  million  URIs  of  BTC09   (km.aiu.kit.  edu/projects/btc-­‐2009)  follow  the   scheme:  Prefix-­‐Infix(-­‐Suffix)     •  Prefix  describes  the  source,  i.e.  domain,  of  the  URI     •  Infix  is  a  local  iden8fier     •  The  op8onal  Suffix  contains  details  about  the  format,   e.g.  .rdf  and  .nt,  or  a  named  anchor   •  Token  blocking  on  the  Infixes  appearing  in  the   resource  values  of  proper8es  (as  subject  or  object)  
  • 52. Prefix-­‐Infix(-­‐Suffix)  [Papadakis  et  al.  2012]   E.g. (Infix-profile): e1= {(skos:prefLabel, Statue of Liberty), (yago:isLocatedIn, yago:Liberty_Island)} e2= {(rdfs:label, Statue of Liberty), (dbpprop:location, dbpedia:Liberty_Island)} e3= {(freeb:official_name, Statue of Liberty), (freeb:containedby, freeb:m.026kp2)} e4= {(geonames:name, Statue of Liberty), (geonames:nearby, geonames:5124330)} Generated  blocks:         Note:  The  effec8veness  of  the  approach  relies  on  the   good  naming  prac8ces  of  the  data  publishers   Liberty _Island   e1,  e2   m.026kp2   e3   5124330   e4   GeoNames  
  • 53. LINDA  [Böhm  et  al.  2012]   •  Works  on  an  en8ty  graph  constructed  from  RDF   triples  by  considering  the  URIs  appearing  in  their   subject,  predicate  and  object  posi8ons   •  Matches  are  iden8fied  using  two  kinds  of  similari8es:   – Descrip8ons  are  similar  wrt.  a  string  similarity  of   their  literal  values:  Checked  once   – Descrip8ons  have  similar  neighbours  in  the  en8ty   graph:  Checked  itera&vely  
  • 54. LINDA  [Böhm  et  al.  2012]   •  Scalability:  En8ty  graph  par88ons  are  processed  in   parallel   •  Each  Map/Reduce  node  holds:     – A  par88on  of  the  graph  along  with  the  similari8es  of   the  en8ty  descrip8on  pairs  in  this  par88on   • descrip8on  pairs  are  stored  in  a  priority  queue  in   descending  order  wrt.  their  similarity   – Fast  merge-­‐join-­‐like  access     •  Effec8veness:  Messages  from  mappers  to  reducers,   only  for  the  pairs  of  descrip8ons    that  need  similarity   re-­‐computa8on  
  • 55. LINDA  ER  Algorithm   •  Two  matrices  are  used:   – X  captures  the  iden8fied  matches  (binary  matrix)   – Y  captures  the  pair-­‐wise  similari8es  (real  values)   •  Ini8aliza8on:  common  neighbors  and  string  similarity   of  literals   •  Updates:  Use  the  iden8fied  matches  of  X   •  Un8l  the  priority  queue  (extracted  from  Y)  becomes  empty:   – Get  the  pair  (ei,  ej)  with  the  highest  similarity   •  (ei,  ej)  match  by  default   –  Update  X:  matches  of  ei  are  also  matches  of  ej   – Update  the  queue  wrt.  the  new  matches    
  • 56. dbpedia:Statue _of_Liberty   dbpedia:Liberty _Island   C:m.072p8   yago:Statue _of_Liberty   yago:   Liberty_Island   yago:isLocatedIn   dbpedia:   Frédéric_Augus te_Bartholdi   u:m.0jph6   u:architect   dbprop:   loca8on   dbprop:   sculptor   Priority Queue: (dbpedia:Liberty_Island,  yago:Liberty_Island)   (dbpedia:Statue_of_Liberty,  yago:Liberty_Island)   (u:m.072p8,  dbpedia:Liberty_Island)  
  • 57. dbpedia:Statue_ of_Liberty   dbpedia:Liberty _Island   C:m.072p8   yago:Statue _of_Liberty   yago:   Liberty_Island   yago:isLocatedIn   dbpedia:   Frédéric_Augus te_Bartholdi   u:m.0jph6   u:architect   dbprop:   loca8on   dbprop:   sculptor   Priority Queue: (dbpedia:Liberty_Island,  yago:Liberty_Island)   (dbpedia:Statue_of_Liberty,  yago:Liberty_Island)   (u:m.072p8,  dbpedia:Liberty_Island)   match  
  • 58. dbpedia:Statu e_of_Liberty   dbpedia:Liberty _Island   C:m.072p8   yago:Statue _of_Liberty   yago:   Liberty_Island   yago:isLocatedIn   dbpedia:   Frédéric_Augus te_Bartholdi   u:m.0jph6   u:architect   dbprop:   loca8on   dbprop:   sculptor   Priority Queue: (dbpedia:Liberty_Island,  yago:Liberty_Island)   (dbpedia:Statue_of_Liberty,  yago:Liberty_Island)   (u:m.072p8,  dbpedia:Liberty_Island)   dequeue  these  pairs,     as  each  en8ty  can  be   mapped  at  most  to  one   en8ty  per  data  source   match  
  • 59. dbpedia:Statue_ of_Liberty   dbpedia:Liberty _Island   C:m.072p8   yago:Statue _of_Liberty   yago:   Liberty_Island   yago:isLocatedIn   dbpedia:   Frédéric_Augus te_Bartholdi   u:m.0jph6   u:architect   dbprop:   loca8on   dbprop:   sculptor   Priority Queue: (dbpedia:Liberty_Island,  yago:Liberty_Island)   (dbpedia:Statue_of_Liberty,  yago:Statue_of_Liberty)     enqueue  this  pair,  as     it  is  in  the  context  of  the   newly-­‐found  match   match  
  • 60. dbpedia:Statue _of_Liberty   dbpedia:Liberty _Island   C:m.072p8   yago:Statue _of_Liberty   yago:   Liberty_Island   yago:isLocatedIn   dbpedia:   Frédéric_Augus te_Bartholdi   u:m.0jph6   u:architect   dbprop:   loca8on   dbprop:   sculptor   Priority Queue: (dbpedia:Statue_of_Liberty,  yago:Statue_of_Liberty)   match   match  
  • 61. LINDA   Distribute  across  a  cluster  the  input  en8ty  graph   •  A  node  i  holds  a  por8on  Qi  of  the  priority  queue  and  the   respec8ve  part  Gi  of  the  graph       Map  phase   •  Mapper  i  reads  Qi  and  forwards  messages  to  reducers  for   similari8es  re-­‐computa8ons   – Matrix  X  of  iden8fied  matches  is  updated       Reduce  phase   •  Similari8es  re-­‐computa8ons  (Matrix  Y)   •  Updates  on  priority  queues    
  • 62. Frame  Linked  Data  Preserva8on  as  a   Sustainable  Economic  Ac8vity   •  Economic  ac8vity:  deliberate   alloca8on  of  resources   –  Cost  of  losing  datasets   •  Sustainable:  ongoing  resource   alloca8on  over  long  periods  of  8me   –  Involved  data  subjects   •  Ar8culate  the  problem/provide   recommenda8ons  &  guidelines   –  Economic  and  societal  benefits   Technical Social Economic
  • 63. Sustainability  Condi8ons   •  Who  benefits  from  use  of   the  preserved  data?   •  Who  selects  what  data  to   preserve?   •  Who  owns  the  data?   •  Who  preserves  the  data?   •  Who  pays  both  for  data   and  preserva8on  services?   •  recogni8on  of  the  benefits  of   preserva8on  by  decision  makers   •  selec8on  of  datasets  with  long-­‐ term  value   •  incen8ves  for  decision  makers   to  act  in  the  public  interest  or  to elaborate  new  business  models     •  appropriate  governance  of   preserva8on  ac8vi8es   •  ongoing  and  efficient  alloca8on   of  resources  to  preserva8on   •  8mely  ac8ons  to  ensure  long-­‐ term  data  access  and  usability  
  • 64. Conclusions   •  We  need  new  abstrac8ons  bridging  closer  data  crea8on,   processing,  publica8on  and  processing   – Diachronic  Data:  Data  annotated  with  temporal  and   provenance  informa8on  self-­‐describing  their  evolu8on   history     – Preserve  (semi-­‐)structured,  interrelated,  evolving  data  by   keeping  them  constantly  accessible  &  reusable  from  an   open  framework  such  as  the  Data  Web   •  We  need  new  business  models  for  spreading  data   publica8on  and  archiving  costs  among  data  stakeholders   –   Pay-­‐as-­‐you-­‐go  data  preserva8on  as  data  products  are  re-­‐ used  through  complex  value  making  chains  (both  memory   ins8tu8ons  and  data  market  places)  
  • 65. Acknowledgements   •                                                                     EU  IP  Project  No  601043   – h[p://www.diachron-­‐fp7.eu           •                                                                     EU  CSA  Project  No  600663   – h[p://www.prelida.eu  
  • 66. Collaborators   •  Vassilis  E•himiou  (University  of  Crete)   •  Kostas  Stefanidis  (FORTH/ICS)   •  Grigoris  karvounarakis  (LogicBox)   •  Giorgos  Flouris  (FORTH/ICS)  
  • 68. References   •  Bohm,  C.,  de  Melo,  G.,  Naumann,  F.,  Weikum,  G.:  Linda:  distributed  web-­‐of-­‐data-­‐ scale  en8ty  matching.  In  CIKM  2012   •  Geerts,  F.,  Karvounarakis,  G.,  Christophides,  V.  and  Fundulaki  I.:  Algebraic   structures  for  capturing  the  provenance  of  SPARQL  queries.  In  ICDT  2013.   •  Papadakis,  G.,  Ioannou,  E.,  Niederee,  C.,  Palpanas,  T.,  Nejdl,  W.:  Beyond  100  million   en88es:  large-­‐scale  blocking-­‐based  resolu8on  for  heterogeneous  data.  In  WSDM   2012.   •  Papadakis,  G.,  Ioannou,  E.,  Palpanas,  T.,  Niederee,  C.,  Nejdl,  W.:  A  Blocking   Framework  for  En8ty  Resolu8on  in  Highly  Heterogeneous  Informa8on  Spaces.  IEEE   Trans.  Knowl.  Data  Eng.  (2013)  To  appear   •  Papavasileiou,  V.,  Flouris,  G.,  Fundulaki,  I,.  Kotzinos,  D.,  Christophides,  V.  :High-­‐ level  change  detec8on  in  RDF(S)  KBs.  ACM  Trans.  Database  Syst.  38,  1,  April  2013   •  High-­‐Level  Group  on  Scien8fic  Data.  “Riding  the  Wave:  how  Europe  can  gain  from   the  raising  8de  of  scien8fic  data”  ©  European  Union,  2010  cordis.europa.eu/fp7/ ict/e-­‐infrastructure/docs/hlg-­‐sdi-­‐report.pdf   •  Blue  Ribbon  Task  Force  on  Sustainable  Digital  Preserva8on  and  Access,  Final  report   2010  br‚.sdsc.edu/biblio/BRTF_Final_Report.pdf‎  
  • 69. Data  Science:  A  Mul8disciplinary  Challenge   Source  www.oraly8cs.com/2012/06/data-­‐science-­‐is-­‐mul8disciplinary.html  
  • 70. Data  Science  Research  Agenda   Acquisi0on,  Storage,  and   Management  of  “Big  Data”   Data  representa8on,   storage,  and  retrieval     New  parallel  data   architectures,  including   clouds     Data  management  policies,   including  privacy  and  secure   access   Communica8on  and  storage   devices  with  extreme   capaci8es   Sustainable  economic   models  for  access  and   preserva0on     Data  Analy0cs   Computa8onal,  mathema8cal,   sta8s8cal,  and  algorithmic   techniques  for  modeling  high   dimensional  data   Learning,  inference,   predic8on,  and  knowledge   discovery  for  large  volumes  of   dynamic  data  sets   Data  mining  to  enable   automated  hypothesis   genera8on,  event  correla8on,   and  anomaly  detec8on     Informa8on  infusion  of   mul8ple  data  sources   Data  Sharing  and   Collabora0on   Tools  for  distant  data   sharing,  real  8me   visualiza8on,  and   so•ware  reuse  of   complex  data  sets   Cross  disciplinary  model,   informa8on  and   knowledge  sharing   Remote  opera8on  and   real  8me  access  to   distant  data  sources  and   instruments     Source  Big  Data  R&D  Ini8a8ve  Howard  Wactlar     NIST  Big  Data  Mee8ng  June,  2012