dsnotify presentation at www2010


Published on

Presentation of DSNotify (http://dsnotify.org) at the WWW 2010 conference in Raleigh/NC/USA

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

dsnotify presentation at www2010

  1. 1. DSNo%fy:  Handling  Broken  Links     in  the  Web  of  Data   Niko  Popitsch,  University  of  Vienna  /  Austria   niko.popitsch@univie.ac.at   Joint  work  with  Bernhard  Haslhofer   bernhard.haslhofer@univie.ac.at   April  30,  2010     WWW  2010  Conference     Raleigh, North Carolina, USA
  2. 2. Outline     IntroducIon  and  problem  definiIon     Related  work  and  soluIon  strategies     DSNoIfy     Usage  scenarios  and  design     Core  algorithm     EvaluaIon     Summary  &  Discussion     References   image: www.freeimages.co.uk 2
  3. 3. image by TBL / Hans Rosling Linked  Data  Principles  (short  version):     (1)   use  HTTP  URIs  to  idenIfy  resources,     (2)   deliver  meaningful  representa%ons  (e.g.,  RDF,  XHTML)  when  these  are   dereferenced   (3)   link  to  other  resources   3
  4. 4. A  Linked  Data  Example   4
  5. 5. $ curl -H "Accept: application/rdf+xml" http://www.bbc.co.uk/music/artists/ 084308bd-1654-436f-ba03-df6697104e19 Links within the data source [...] <mo:member rdf:resource="/music/artists/5d06fe54-485a-4a07-b506-5f6f719448cb#artist" /> <mo:member rdf:resource="/music/artists/f332a312-e95b-4413-b6cc-1762a5a6a083#artist" /> <mo:member rdf:resource="/music/artists/0dcee02c-5d2c-4f5c-9d60-d58a4df32d9e#artist" /> [...] RDF links between data sources [...] <owl:sameAs rdf:resource="http://dbpedia.org/resource/Green_Day" /> <mo:musicbrainz rdf:resource="http://musicbrainz.org/artist/084308bd-1654-436f-ba03- df6697104e19.html" /> [...] [...] <mo:MusicArtist rdf:about="/music/artists/084308bd-1654-436f-ba03-df6697104e19#artist"> <rdf:type rdf:resource="http://purl.org/ontology/mo/MusicGroup" /> <foaf:name>Green Day</foaf:name> [...] 5
  6. 6. Problem:  links  can  break   6
  7. 7. Ignore  broken  links  ?  Not  a  good  idea  !   Broken  links  on  the  Web  are  annoying  for  humans     but  alternaIve  paths  may  be  used:       search  engines,  URL  manipulaIon,  alternaIve     informaIon  providers,  etc.   Much  harder  for  machines  in  a  Web  of  Data  !     reduced  data  accessibility     data  inconsistencies   7
  8. 8. Avoid  broken  links  ?  Great!       But  hard  to  achieve  in  the  Web  environment…     SoluIon  strategies  that  solve  problem  only  parIally:     RelaIve  references     embedded  links     redundancy     SoluIon  strategies  that  are  not  commonly  applicable:     Versioned/staIc  collecIons     regular  (predictable)  updates     dynamic  links     indirecIon  services  (PURLs,  DOIs)   8
  9. 9. Solve  the  problem  (1/2)  :  No%fica%on   No%fica%on  strategy:     Data  source  “knows“  about  the  events  that  are  taking  place     NoIfies  clients     Client  may  then  check  their  links  and  fix  the  broken  ones   Current  AcIviIes:     WOD-­‐LMP  [Volz  et  al.  2009]     Triplify  Linked  Data  Update  Log  [Auer  et  al.  2009]     PubSubHubbub  /  sparqlPuSH   1   h^p://groups.google.com/group/dataset-­‐dynamics     …   9
  10. 10. Solve  the  problem  (2/2):  Detect  and  correct   Detect  and  correct     If  noIficaIon  is  not  applicable     Clients  detect  broken  links  and  try  to  fix  them   2 Current  acIviIes:     Robust  hyperlinks  [Phelps  &  Wilensky  2000]  –  Web  documents     PageChaser  [Morishima  et  al.  2009]  –  Web  documents     DSNo%fy  –  aims  at  becoming  a  general  framework  for  fixing  broken  links     …   10
  11. 11. What  events  cause  the  problem  ?   ? 11
  12. 12. Events  that  poten%ally  lead  to  broken  links   Broken  links  due  to  dele%on  events     A  dele%on  event  takes  place  at  Ime  t  when  a  resource  had  (dereferencable)   representaIons  at  t-­‐Δ  but  has  none  at  Ime  t     Vice  versa:  create  event     Easy  to  detect   12
  13. 13. Events  that  poten%ally  lead  to  broken  links   Broken  links  due  to  update  events     An  update  events  takes  place  at  Ime  t  when  a  resource  had  different   representaIons  at  t-­‐Δ  compared  to  the  ones  at  Ime  t     Resource  updates  resulIng  in  representaIons  with  different  meaning   (seman%c  dri_)  may  lead  to  seman%cally  broken  links     Hard  to  detect,  open  problem     13
  14. 14. Events  that  poten%ally  lead  to  broken  links   What  about  move  events  ?   14
  15. 15. Events  that  poten%ally  lead  to  broken  links   What  about  move  events  ?   a b   A  move  event  from  a  to  b  takes  place  at  Ime  t  when     There  were  no  representaIons  of  b  at  Ime  t-­‐Δ       There  are  no  representaIons  of  a  at  Ime  t     The  representaIons  of  at-­‐Δ  are  more  similar  to  the  ones  of  bt  than  to  the   ones  of  any  other  considered  resource  at  Ime  t     The  calculated  similarity  between  them  is  >  than  a  threshold     Instance  matching  problem!   15
  16. 16. Events  that  poten%ally  lead  to  broken  links   The  core  algorithm  of  DSNoIfy  detects  move  events  based  on   resource  similari%es   16
  17. 17. Changes  in  DBpedia   Class Snapshot 3.2 Snapshot 3.3 Moved Removed Created Person 213,016 244,621 2,841 20,561 49,325 Place 247,508 318,017 2,209 2,430 70,730 Organisation 76,343 105,827 2,020 1,242 28,706 Work 189,725 213,231 4,097 6,558 25,967 Resources that were moved/removed/created between the DBpedia snapshots 3.2 (October 2008) and 3.3 (May 2009)
  18. 18. PART  3  :  DSNo%fy   18
  19. 19. Usage  Scenario   19
  20. 20. Usage  Scenario     ApplicaIon  that  consumes  various  LD  sources  and  may     update  a  “source  dataset”   20
  21. 21. Usage  Scenario     DSnoIfy  is  an  add-­‐on  for  applicaIons  that  want  to  preserve  high  link   integrity  in  their  data   21
  22. 22. Usage  Scenario     Other  actors  (applicaIons)  might  also  be  interested  in  these  events     22
  23. 23. General  Approach     Periodically  access  linked  data  sources     Extract  features  from  resource  representa%ons     Combine  them  to  comparable  feature  vectors  (FV)     Store  them  in  3  indices     1st  index  represents  the  current  state  of  the  monitored  data     2nd  index  stores  items  that  became  recently  unavailable     3rd  index  stores  archived  feature  vectors     Periodically  access  index  1+2  and  log  detect  events     Periodically  update  indices  1-­‐3   23
  24. 24. From  Resource  to  Feature  Vector     Both,  data  type  and  object     proper%es  supported     Feature  influence  is     weighted     Some  are  used  in   plausibility  checks     RDFHash  over  all   features   24
  25. 25. Move  Event  Detec%on     Pair  wise  comparison  using  a  vector  space  model       Feature  comparison  e.g.,  using  Levenshtein  similarity.     It  is  sufficient  to  compare  recently  added  and  recently  removed  feature   vectors  !     Two  thresholds  for  comparing  the  similarity  between  FVs  represenIng   created  and  removed  items:     lower  threshold:  select  predecessor  candidates     consider  URI  of  added  FV  as  possible  new  URI  of  resource  represented  by   removed  FV     upper  threshold:  decidable  by  DSNoIfy?     decide  whether  such  a  candidate  can  be  automaIcally  selected  or  whether   human  user  has  to  be  asked  for  assistance.   25
  26. 26. Core  Housekeeping  Algorithm   Ci,Ri and Mi,j denote create, remove and move events of items i and j. mx 26 and hx denote monitoring and housekeeping operations respectively.
  27. 27. Resul%ng  Data  Structures     DSNoIfy  constructs  three  data  structures:     An  event  log  containing  all  events  detected  by  the  system     A  log  containing  all  “event  choices”  DSNoIfy  cannot  decide  on  and       a  linked  structure  of  feature  vectors  consItuIng  a  history  of  the   respecIve  items.     Accessible  via   image: www.freeimages.co.uk   Linked  data  interface     Java  interface     XML-­‐RPC   27
  28. 28. Evalua%on     Core  quesIons:     Does  DSNoIfy  work  with  real  data  ?     How  does  housekeeping  frequency  affect  its  effec%veness  ?     Used  data:     Data  from  DBpedia  (8380  events)  and  IIMB  (10  x  222  events)  were  used     Hand-­‐picked  features  based  on  coverage  and  entropy  in  the  data  sets       Results:     Housekeeping  frequency  and  data  source  dynamics  determine  the   number  of  FV-­‐pairs  that  have  to  be  compared  (scalability)     Number  of  FV  comparisons  as  well  as  coverage  and  entropy  of  indexed   features  influence  accuracy  of  method   28
  29. 29. Evalua%on  -­‐  Results     Influence  of  data  source  agility  and  housekeeping  frequency  on  the  accuracy   of  the  DSNoIfy  algorithm   29
  30. 30. Discussion     Broken  links  are  a  considerable  problem  in  a  Web  of  Data     The  broken  link  problem  is  partly  a  special  case  of  the  instance  matching   problem     DSNo%fy  is  an  event-­‐based  approach  to  this  problem:     DSNoIfy  can  be  used  as  an  add-­‐on  for  data  sources  that  want  to  preserve   link  integrity  in  their  data     We  cannot  “cure”  the  Web  of  Data  from  broken  links  (but  at  least  alleviate   the  pain  a  bit  :)   30
  31. 31. Current  and  Future  Work     Scalability  issues,  evaluaIon     AutomaIc  feature  selecIon  (parameter  esImaIon)     Event  algebra:  high-­‐level  composite  events     EvaluaIon  with  other  data  sources  (e.g.,  file  system)     Dataset  dynamics:     vocabularies,  protocols,  formats     …   Thank  You  !   niko.popitsch@univie.ac.at   images: NASA / NSSDC h^p://www.dsnoIfy.org   31
  32. 32. References  and  Related  Work     H.  Ashman.  Electronic  document  addressing:  dealing  with  change.  ACM  Comput.  Surv.,  32(3),  2000.     F.  Kappe.  A  scalable  architecture  for  maintaining  referenIal  integrity  in  distributed  informaIon   systems.  Journal  of  Universal  Computer  Science,  1(2):84–104,  1995.     A.  Morishima,  A.  Nakamizo,  T.  Iida,  S.  Sugimoto,  and  H.  Kitagawa.  Bringing  your  dead  links  back  to  life:  a   comprehensive  approach  and  lessons  learned.  In  HT  ’09:  Proceedings  of  the  20th  ACM  conference  on   Hypertext  and  hypermedia,  pages  15–24,  2009.       T.  A.  Phelps  and  R.  Wilensky.  Robust  hyperlinks  cost  just  five  words  each.  Technical  Report  UCB/ CSD-­‐00-­‐1091,  EECS  Department,  University  of  California,  Berkeley,  2000     J.  Volz,  C.  Bizer,  M.  Gaedke,  and  G.  Kobilarov.  Discovering  and  maintaining  links  on  the  web  of  data.  In   8th  InternaGonal  SemanGc  Web  Conference,  2009.     A.  Hogan,  A.  Harth,  and  S.  Decker.  Performing  object  consolidaIon  on  the  semanIc  web  data  graph.  In   Proceedings  of  the  1st  I3:  IdenGty,  IdenGfiers,  IdenGficaGon  Workshop,  2007     A.  Ferrara,  D.  Lorusso,  S.  Montanelli,  and  G.  Varese.  Towards  a  benchmark  for  instance  matching.  In   Ontology  Matching  (OM  2008),  volume  431  of  CEUR  Workshop  Proceedings.  CEUR-­‐WS.org,  2008     C.  Bizer,  T.  Heath,  and  T.  Berners-­‐Lee.  Linked  data  -­‐  the  story  so  far.  InternaGonal  Journal  on  SemanGc   Web  and  InformaGon  Systems  (IJSWIS),  5(3),  2009     S.  Auer,  S.  Dietzold,  J.  Lehmann,  S.  Hellmann,  and  D.  Aumüller.  Triplify:  light-­‐weight  linked  data   publicaIon  from  relaIonal  databases.  In  WWW  ’09,  New  York,  NY,  USA,  2009.  ACM     W.  Y.  Arms.  Uniform  resource  names:  handles,  purls,  and  digital  object  idenIfiers.  Commun.  ACM,  44 (5):68,  2001.   32