Discovering	
  Related	
  Data	
  Sources	
  	
  
in	
  Data	
  Portals	
  
	
  
Andreas	
  Wagner,	
  Peter	
  Haase,	
  	
  
Achim	
  Re4nger,	
  Holger	
  Lamm	
  

1st	
  Interna:onal	
  Workshop	
  on	
  Seman:c	
  Sta:s:cs	
  
Sydney,	
  Oct	
  22,	
  2013	
  

	
  
Poten&al	
  of	
  Open	
  (Sta&s&cs)	
  Data	
  

WORLD BANK
fluidOps	
  Open	
  Data	
  Portal	
  
•  Data	
  collec&on	
  

•  Integra&on	
  of	
  major	
  open	
  data	
  catalogs	
  
•  Automated	
  provisioning	
  of	
  10.000s	
  data	
  sets	
  

•  Portal	
  for	
  search	
  and	
  explora&on	
  of	
  data	
  sets	
  
•  Rich	
  metadata	
  based	
  on	
  open	
  standards	
  
•  Both	
  descrip&ve	
  and	
  structural	
  metadata	
  

•  Integrated	
  querying	
  across	
  interlinked	
  data	
  sets	
  
•  Easy	
  to	
  use	
  queries	
  against	
  mul&ple	
  data	
  sets	
  
•  Using	
  federa&on	
  technologies	
  

•  Self-­‐service	
  UI	
  

•  Custom	
  queries	
  and	
  visualiza&ons	
  
•  Widgets,	
  dashboarding,	
  etc.	
  

WORLD BANK
Finding	
  Related	
  Data	
  Sets	
  
•  Many	
  informa&on	
  needs	
  require	
  analysis	
  of	
  mul&ple	
  data	
  sets	
  
•  Example:	
  Compare	
  and	
  correlate	
  GDP,	
  popula&on	
  and	
  public	
  debt	
  
of	
  countries	
  over	
  &me	
  
•  Task	
  of	
  finding	
  related	
  data	
  sets	
  

•  Iden&fy	
  data	
  sets	
  that	
  are	
  similar,	
  but	
  complementary	
  
•  To	
  support	
  queries	
  across	
  mul&ple	
  data	
  sets,	
  e.g.	
  in	
  the	
  form	
  of	
  joins	
  
and	
  unions	
  

•  Inspira&on:	
  Finding	
  related	
  tables	
  

•  En&ty	
  complement:	
  same	
  aVributes,	
  complemen&ng	
  en&&es	
  
•  Schema	
  complement:	
  same	
  en&&es,	
  complemen&ng	
  aVributes	
  
Finding	
  Related	
  Data	
  Sources	
  
via	
  Related	
  En&&es	
  
•  Data	
  Model:	
  Data	
  source	
  is	
  a	
  set	
  of	
  mul&ple	
  
RDF	
  graphs	
  
•  Intui&on:	
  if	
  data	
  sources	
  contain	
  similar	
  
en&&es,	
  they	
  are	
  somehow	
  related	
  
Cluster	
  2	
  
Cluster	
  1	
  
•  Approach:	
  
En&&es	
  

1.  En&ty	
  Extrac&on	
  
2.  En&ty	
  Similarity	
  
3.  En&ty	
  Clustering	
  

Related?!	
  

Source	
  1	
  

Source	
  3	
  
Source	
  2	
  
Related	
  En&&es	
  (2)	
  
1.  En&ty	
  Extrac&on	
  

–  Sample	
  over	
  en&&es	
  in	
  data	
  graphs	
  in	
  D	
  
–  For	
  each	
  en&ty	
  crawl	
  its	
  surrounding	
  sub-­‐graph	
  [1]	
  

2.  En&ty	
  Similarity	
  

–  Define	
  dissimilarity	
  measure	
  between	
  two	
  en&&es	
  
based	
  on	
  kernel	
  func&ons	
  
–  Compare	
  en&ty	
  structure	
  and	
  literals	
  via	
  different	
  
kernels	
  [2,3]	
  

3.  En&ty	
  Clustering	
  

–  Apply	
  k-­‐means	
  clustering	
  to	
  discover	
  similar	
  	
  
	
  en&&es	
  [4]	
  
Contextualisa&on	
  Score	
  
•  Contextualiza&on	
  score	
  for	
  data	
  source	
  D’’	
  
given	
  D’:	
  ec(D’’|D’)	
  and	
  sc(D’’|D’)	
  
•  En*ty	
  complement	
  score	
  

•  Schema	
  complement	
  score	
  
Search	
  for	
  Gross	
  Domes&c	
  Product	
  
Querying	
  the	
  Data	
  Set	
  
Visualizing	
  the	
  Results	
  
Queries	
  Across	
  Related	
  Data	
  Sets	
  
•  Query	
  for	
  GDP	
  of	
  Germany	
  
•  Union	
  of	
  results	
  from	
  	
  
•  Worldbank:	
  GDP	
  (current	
  US$	
  )	
  (up	
  to	
  2010)	
  
•  Eurostat:	
  GDP	
  at	
  Market	
  Prices	
  (including	
  projected	
  values	
  un&l	
  2014)	
  
Queries	
  Across	
  Related	
  Data	
  Sets	
  

Data	
  from	
  Worldbank	
  

Data	
  from	
  Eurostat	
  
Summary	
  and	
  Outlook	
  
•  Techniques	
  for	
  finding	
  related	
  data	
  sets	
  
–  Based	
  on	
  finding	
  related	
  en&&es	
  

•  Implementa&on	
  available	
  in	
  open	
  data	
  portal	
  
•  Outlook	
  

–  Finding	
  relevant	
  related	
  data	
  sources	
  for	
  a	
  given	
  
informa&on	
  need	
  
–  End	
  user	
  interfaces	
  for	
  formula&ng	
  queries	
  	
  
across	
  data	
  sets	
  (see	
  Op&que	
  project)	
  
–  Operators	
  for	
  combining	
  data	
  cubes	
  
–  Interac&ve	
  visualiza&on	
  and	
  explora&on	
  of	
  	
  
combined	
  data	
  cubes	
  (see	
  OpenCube	
  project)	
  
References	
  
[1]	
   	
  G.	
  A.	
  Grimnes,	
  P.	
  Edwards,	
  and	
  A.	
  Preece.	
  
	
  Instance	
  based	
  clustering	
  of	
  seman:c	
  web	
  
	
  resources.	
  In	
  ESWC,	
  2008.	
  
[2] 	
  U.	
  Lösch,	
  S.	
  Bloehdorn,	
  and	
  A.	
  Reenger.	
  
	
  Graph	
  kernels	
  for	
  RDF	
  data.	
  In	
  ESWC,	
  2012.	
  
[3] 	
  J.	
  Shawe-­‐Taylor	
  and	
  N.	
  Cris&anini.	
  Kernel	
  
	
  Methods	
  for	
  PaPern	
  Analysis.	
  2004.	
  
[4]	
   	
  R.	
  Zhang	
  and	
  A.	
  Rudnicky.	
  A	
  large	
  scale	
  
	
  clustering	
  scheme	
  for	
  kernel	
  k-­‐means.	
  In	
  
	
  PaVern	
  Recogni&on,	
  2002.	
  
	
  
	
  

Discovering Related Data Sources in Data Portals

  • 1.
    Discovering  Related  Data  Sources     in  Data  Portals     Andreas  Wagner,  Peter  Haase,     Achim  Re4nger,  Holger  Lamm   1st  Interna:onal  Workshop  on  Seman:c  Sta:s:cs   Sydney,  Oct  22,  2013    
  • 2.
    Poten&al  of  Open  (Sta&s&cs)  Data   WORLD BANK
  • 3.
    fluidOps  Open  Data  Portal   •  Data  collec&on   •  Integra&on  of  major  open  data  catalogs   •  Automated  provisioning  of  10.000s  data  sets   •  Portal  for  search  and  explora&on  of  data  sets   •  Rich  metadata  based  on  open  standards   •  Both  descrip&ve  and  structural  metadata   •  Integrated  querying  across  interlinked  data  sets   •  Easy  to  use  queries  against  mul&ple  data  sets   •  Using  federa&on  technologies   •  Self-­‐service  UI   •  Custom  queries  and  visualiza&ons   •  Widgets,  dashboarding,  etc.   WORLD BANK
  • 5.
    Finding  Related  Data  Sets   •  Many  informa&on  needs  require  analysis  of  mul&ple  data  sets   •  Example:  Compare  and  correlate  GDP,  popula&on  and  public  debt   of  countries  over  &me   •  Task  of  finding  related  data  sets   •  Iden&fy  data  sets  that  are  similar,  but  complementary   •  To  support  queries  across  mul&ple  data  sets,  e.g.  in  the  form  of  joins   and  unions   •  Inspira&on:  Finding  related  tables   •  En&ty  complement:  same  aVributes,  complemen&ng  en&&es   •  Schema  complement:  same  en&&es,  complemen&ng  aVributes  
  • 6.
    Finding  Related  Data  Sources   via  Related  En&&es   •  Data  Model:  Data  source  is  a  set  of  mul&ple   RDF  graphs   •  Intui&on:  if  data  sources  contain  similar   en&&es,  they  are  somehow  related   Cluster  2   Cluster  1   •  Approach:   En&&es   1.  En&ty  Extrac&on   2.  En&ty  Similarity   3.  En&ty  Clustering   Related?!   Source  1   Source  3   Source  2  
  • 7.
    Related  En&&es  (2)   1.  En&ty  Extrac&on   –  Sample  over  en&&es  in  data  graphs  in  D   –  For  each  en&ty  crawl  its  surrounding  sub-­‐graph  [1]   2.  En&ty  Similarity   –  Define  dissimilarity  measure  between  two  en&&es   based  on  kernel  func&ons   –  Compare  en&ty  structure  and  literals  via  different   kernels  [2,3]   3.  En&ty  Clustering   –  Apply  k-­‐means  clustering  to  discover  similar      en&&es  [4]  
  • 8.
    Contextualisa&on  Score   • Contextualiza&on  score  for  data  source  D’’   given  D’:  ec(D’’|D’)  and  sc(D’’|D’)   •  En*ty  complement  score   •  Schema  complement  score  
  • 10.
    Search  for  Gross  Domes&c  Product  
  • 12.
  • 13.
  • 14.
    Queries  Across  Related  Data  Sets   •  Query  for  GDP  of  Germany   •  Union  of  results  from     •  Worldbank:  GDP  (current  US$  )  (up  to  2010)   •  Eurostat:  GDP  at  Market  Prices  (including  projected  values  un&l  2014)  
  • 15.
    Queries  Across  Related  Data  Sets   Data  from  Worldbank   Data  from  Eurostat  
  • 16.
    Summary  and  Outlook   •  Techniques  for  finding  related  data  sets   –  Based  on  finding  related  en&&es   •  Implementa&on  available  in  open  data  portal   •  Outlook   –  Finding  relevant  related  data  sources  for  a  given   informa&on  need   –  End  user  interfaces  for  formula&ng  queries     across  data  sets  (see  Op&que  project)   –  Operators  for  combining  data  cubes   –  Interac&ve  visualiza&on  and  explora&on  of     combined  data  cubes  (see  OpenCube  project)  
  • 17.
    References   [1]    G.  A.  Grimnes,  P.  Edwards,  and  A.  Preece.    Instance  based  clustering  of  seman:c  web    resources.  In  ESWC,  2008.   [2]  U.  Lösch,  S.  Bloehdorn,  and  A.  Reenger.    Graph  kernels  for  RDF  data.  In  ESWC,  2012.   [3]  J.  Shawe-­‐Taylor  and  N.  Cris&anini.  Kernel    Methods  for  PaPern  Analysis.  2004.   [4]    R.  Zhang  and  A.  Rudnicky.  A  large  scale    clustering  scheme  for  kernel  k-­‐means.  In    PaVern  Recogni&on,  2002.