Standardizing for Open Data

531 views
447 views

Published on

Plans of W3C in the area of standard activities on Data on the Web

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
531
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Standardizing for Open Data

  1. 1. (1)Standardizing for Open DataIvan  Herman,  W3C  Open  Data  Week  Marseille,  France,  June  26  2013  Slides at: http://www.w3.org/2013/Talks/0626-Marseille-IH/
  2. 2. (2)Data  is  everywhere  on  the  Web!  l  Public,  private,  behind  enterprise  firewalls  l  Ranges  from  informal  to  highly  curated  l  Ranges  from  machine  readable  to  human  readable  l  HTML  tables,  twitter  feeds,  local  vocabularies,  spreadsheets,  …  l  Expressed  in  diverse  models    l  tree,  graph,  table,  …  l  Serialized  in  many  ways    l  XML,  CSV,  RDF,  PDF,  HTML  Tables,  microdata,…  
  3. 3. (3)
  4. 4. (4)
  5. 5. (5)
  6. 6. (6)
  7. 7. (7)
  8. 8. (8)W3C’s  standardization  focus  was,  traditionally,  on  Web  scale  integration  of  data  l Some  basic  principles:  l  use  of  URIs  everywhere  (to  uniquely  identify  things)  l  relate  resources  among  one  another  (to  connect  things  on  the  Web)  l  discover  new  relationships  through  inferences  l This  is  what  the  Semantic  Web  technologies  are  all  about    
  9. 9. (9)We  have  a  number  of  standards  RDF  1.1  SPARQL  1.1  URI  JSON-­‐LD   Turtle   RDFa   RDF/XML  RDF:  data  model,  links,  basic  assertions;  different  serializations    SPARQL:  querying  data  A  fairly  stable  set  of  technologies  by  now!  
  10. 10. (10)We  have  a  number  of  standards  RDB2RDF   RDF  1.1  RDFS  1.1  SPARQL  1.1  OWL  2  URI  JSON-­‐LD   Turtle   RDFa   RDF/XML  RDF:  data  model,  links,  basic  assertions;  different  serializations    SPARQL:  querying  data  RDFS:    simple  vocabularies  OWL:  complex  vocabularies,  ontologies  RDB2RDF:  databases  to  RDF  A  fairly  stable  set  of  technologies  by  now!  
  11. 11. (11)We  have  Linked  Data  principles  
  12. 12. (12)Integration  is  done  in  different  ways  l Very  roughly:  l  data  is  accessed  directly  as  RDF  and  turned  into  something  useful  l  relies  on  data  being  “preprocessed”  and  published  as  RDF  l  data  is  collected  from  different  sources,  integrated  internally  l  using,  say,  a  triple  store  
  13. 13. (13)
  14. 14. (15)However…  l There  is  a  price  to  pay:  a  relatively  heavy  ecosystem  l  many  developers  shy  away  from  using  RDF  and  related  tools  l Not  all  applications  need  this!  l  data  may  be  used  directly,  no  need  for  integration  concerns  l  the  emphasis  may  be  on  easy  production  and  manipulation  of  data  with  simple  tools  
  15. 15. (16)Typical  situation  on  the  Web  l Data  published  in  CSV,  JSON,  XML  l An  application  uses  only  1-­‐2  datasets,  integration  done  by  direct  programming  is  straightforward  l  e.g.,  in  a  Web  Application  l Data  is  often  very  large,  direct  manipulation  is  more  efficient  
  16. 16. (17)Non-­‐RDF  Data  l In  some  setting  that  data  can  be  converted  into  RDF  l But,  in  many  cases,  it  is  not  done  l  e.g.,  CSV  data  is  way  too  big  l  RDF  tooling  may  not  be  adequate  for  the  task  at  hand  l  integration  is  not  a  major  issue  
  17. 17. (18)
  18. 18. (19)What  that  application  does…    l Gets  the  data  published  by  NHS  l Processes  the  data  (e.g.,  through  Hadoop)  l Integrates  the  result  of  the  analysis  with  geographical  data  Ie:  the  raw  data  is  used  without  integration  
  19. 19. (20)The  reality  of  data  on  the  Web…  l It  is  still  a  fairly  messy  space  out  there  L  l  many  different  formats  are  used  l  data  is  difficult  to  find  l  published  data  are  messy,  erroneous,    l  tools  are  complex,  unfinished…    
  20. 20. (21)How  do  developers  perceive  this?  ‘When  transportation  agencies  consider  data  integration,  one  pervasive  notion  is  that  the  analysis  of  existing  information  needs  and  infrastructure,  much  less  the  organization  of  data  into  viable  channels  for  integration,  requires  a  monumental  initial  commitment  of  resources  and  staff.  Resource-­‐scarce  agencies  identify  this  perceived  major  upfront  overhaul  as  "unachievable"  and  "disruptive.”’      -­‐-­‐  Data  Integration  Primer:  Challenges  to  Data  Integration,  US  Dept.  of  Transportation    
  21. 21. (22)One  may  look  at  the  problem  through  different  goggles  l Two  alternatives  come  to  the  fore:  1.  provide  tools,  environments,  etc.,  to  help  outsiders  to  publish  Linked  Data  (in  RDF)  easily  l  a  typical  example  is  the  Datalift  project  2.  forget  about  RDF,  Linked  Data,  etc,  and  concentrate  on  the  raw  data  instead  
  22. 22. (24)But  religions  and  cultures  can  coexist…  J  
  23. 23. (25)Open  Data  on  the  Web  Workshop  l Had  a  successful  workshop  in  London,  in  April:  l  around  100  participants  l  coming  from  different  horizons:  publishers  and  users  of    Linked  Data,  CSV,  PDF,  …    
  24. 24. (26)We  also  talked  to  our  “stakeholders”  l Member  organizations  and  companies  l Open  Data  Institute,  Open  Knowledge  Foundation,  Schema.org  l …  
  25. 25. (27)Some  takeaway  l The  Semantic  Web  community  needs  stability  of  the  technology  l  do  not  add  yet  another  technology  block  J  l  existing  technologies  should  be  maintained  
  26. 26. (28)Some  takeaway  l Look  at  the  more  general  space,  too  l  importance  of  metadata  l  deal  with  non-­‐RDF  data  formats  l  best  practices  are  necessary  to  raise  the  quality  of  published  data  
  27. 27. (29)We  need  to  meet  app  developers  where  they  are!  
  28. 28. (30)Metadata  is  of  a  major  importance  l Metadata  describes  the  characteristics  of  the  dataset  l  structure,  datatypes  used  l  access  rights,  licenses  l  provenance,  authorship  l  etc.  l Vocabularies  are  also  key  for  Linked  Data  
  29. 29. (31)Vocabulary  Management  Action  l Standard  vocabularies  are  necessary  to  describe  data  l  there  are  already  some  initiatives:  W3C’s  data  cube,  data  catalog,  PROV,  schema.org,  DCMI,  …    l At  the  moment,  it  is  a  fairly  chaotic  world…  l  many,  possibly  overlapping  vocabularies  l  difficult  to  locate  the  one  that  is  needed  l  vocabularies  may  not  be  properly  managed,  maintained,  versioned,  provided  persistence…  
  30. 30. (32)W3C’s  plan:    l Provide  a  space  whereby  l  communities  can  develop  l  host  vocabularies  at  W3C  if  requested  l  annotate  vocabularies  with  a  proper  set  of  metadata  terms  l  establish  a  vocabulary  directory  l The  exact  structure  is  still  being  discussed:  http://www.w3.org/2013/04/vocabs/  
  31. 31. (34)CSV  on  the  Web  l Planned  work  areas:  l  metadata  vocabulary  to  describe  CSV  data  l  structure,  reference  to  access  rights,  annotations,  etc.  l  methods  to  find  the  metadata  l  part  of  an  HTTP  header,  special  rows  and  columns,  packaging  formats…  l  mapping  content  to  RDF,  JSON,  XML  l Possibly  at  a  later  phase:    l  API  standards  to  access  CSV  data  
  32. 32. (36)Open  Data  Best  Practices  l Document  best  practices  for  data  publishers  l  management  of  persistence,  versioning,  URI  design  l  use  of  core  vocabularies  (provenance,  access  control,  ownership,  annotations,…)  l  business  models  l Specialized  Metadata  vocabularies  l  quality  description  (quality  of  the  data,  update  frequencies,  correction  policies,  etc.)  l  description  of  data  access  API-­‐s  l  …  
  33. 33. (37)Summary  l Data  on  the  Web  has  many  different  facets  l We  have  concentrated  on  the  integration  aspects  in  the  past  years  l We  have  to  take  a  more  general  view,  look  at  other  types  of  data  published  on  the  Web      
  34. 34. (38)In  future…  l We  should  look  at  other  formats,  not  only  CSV  l  MARC,  GIS,  ABIF,…  l Better  outreach  to  data  publishing  communities  and  organizations  l  WF,  RDA,  ODI,  OKFN,  …  
  35. 35. Enjoy  the  event!  

×