Discovering Related Data Sources in Data Portals

1,233 views
1,094 views

Published on

Slides from my presentation at the 1st International Workshop on Semantic Statistics Sydney, Oct 22, 2013

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,233
On SlideShare
0
From Embeds
0
Number of Embeds
137
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Discovering Related Data Sources in Data Portals

  1. 1. Discovering  Related  Data  Sources     in  Data  Portals     Andreas  Wagner,  Peter  Haase,     Achim  Re4nger,  Holger  Lamm   1st  Interna:onal  Workshop  on  Seman:c  Sta:s:cs   Sydney,  Oct  22,  2013    
  2. 2. Poten&al  of  Open  (Sta&s&cs)  Data   WORLD BANK
  3. 3. fluidOps  Open  Data  Portal   •  Data  collec&on   •  Integra&on  of  major  open  data  catalogs   •  Automated  provisioning  of  10.000s  data  sets   •  Portal  for  search  and  explora&on  of  data  sets   •  Rich  metadata  based  on  open  standards   •  Both  descrip&ve  and  structural  metadata   •  Integrated  querying  across  interlinked  data  sets   •  Easy  to  use  queries  against  mul&ple  data  sets   •  Using  federa&on  technologies   •  Self-­‐service  UI   •  Custom  queries  and  visualiza&ons   •  Widgets,  dashboarding,  etc.   WORLD BANK
  4. 4. Finding  Related  Data  Sets   •  Many  informa&on  needs  require  analysis  of  mul&ple  data  sets   •  Example:  Compare  and  correlate  GDP,  popula&on  and  public  debt   of  countries  over  &me   •  Task  of  finding  related  data  sets   •  Iden&fy  data  sets  that  are  similar,  but  complementary   •  To  support  queries  across  mul&ple  data  sets,  e.g.  in  the  form  of  joins   and  unions   •  Inspira&on:  Finding  related  tables   •  En&ty  complement:  same  aVributes,  complemen&ng  en&&es   •  Schema  complement:  same  en&&es,  complemen&ng  aVributes  
  5. 5. Finding  Related  Data  Sources   via  Related  En&&es   •  Data  Model:  Data  source  is  a  set  of  mul&ple   RDF  graphs   •  Intui&on:  if  data  sources  contain  similar   en&&es,  they  are  somehow  related   Cluster  2   Cluster  1   •  Approach:   En&&es   1.  En&ty  Extrac&on   2.  En&ty  Similarity   3.  En&ty  Clustering   Related?!   Source  1   Source  3   Source  2  
  6. 6. Related  En&&es  (2)   1.  En&ty  Extrac&on   –  Sample  over  en&&es  in  data  graphs  in  D   –  For  each  en&ty  crawl  its  surrounding  sub-­‐graph  [1]   2.  En&ty  Similarity   –  Define  dissimilarity  measure  between  two  en&&es   based  on  kernel  func&ons   –  Compare  en&ty  structure  and  literals  via  different   kernels  [2,3]   3.  En&ty  Clustering   –  Apply  k-­‐means  clustering  to  discover  similar      en&&es  [4]  
  7. 7. Contextualisa&on  Score   •  Contextualiza&on  score  for  data  source  D’’   given  D’:  ec(D’’|D’)  and  sc(D’’|D’)   •  En*ty  complement  score   •  Schema  complement  score  
  8. 8. Search  for  Gross  Domes&c  Product  
  9. 9. Querying  the  Data  Set  
  10. 10. Visualizing  the  Results  
  11. 11. Queries  Across  Related  Data  Sets   •  Query  for  GDP  of  Germany   •  Union  of  results  from     •  Worldbank:  GDP  (current  US$  )  (up  to  2010)   •  Eurostat:  GDP  at  Market  Prices  (including  projected  values  un&l  2014)  
  12. 12. Queries  Across  Related  Data  Sets   Data  from  Worldbank   Data  from  Eurostat  
  13. 13. Summary  and  Outlook   •  Techniques  for  finding  related  data  sets   –  Based  on  finding  related  en&&es   •  Implementa&on  available  in  open  data  portal   •  Outlook   –  Finding  relevant  related  data  sources  for  a  given   informa&on  need   –  End  user  interfaces  for  formula&ng  queries     across  data  sets  (see  Op&que  project)   –  Operators  for  combining  data  cubes   –  Interac&ve  visualiza&on  and  explora&on  of     combined  data  cubes  (see  OpenCube  project)  
  14. 14. References   [1]    G.  A.  Grimnes,  P.  Edwards,  and  A.  Preece.    Instance  based  clustering  of  seman:c  web    resources.  In  ESWC,  2008.   [2]  U.  Lösch,  S.  Bloehdorn,  and  A.  Reenger.    Graph  kernels  for  RDF  data.  In  ESWC,  2012.   [3]  J.  Shawe-­‐Taylor  and  N.  Cris&anini.  Kernel    Methods  for  PaPern  Analysis.  2004.   [4]    R.  Zhang  and  A.  Rudnicky.  A  large  scale    clustering  scheme  for  kernel  k-­‐means.  In    PaVern  Recogni&on,  2002.      

×