Web smatch wod2012

1,010 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,010
On SlideShare
0
From Embeds
0
Number of Embeds
142
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Web smatch wod2012

  1. 1. 1WebSmatch : a platform for data and metadata integration Remi Coletta, Emmanuel Castanier, Patrick Valduriez,Christian Frisch, DuyHoa Ngo, Zohra Bellahsene
  2. 2. 2 MotivationsContext: open data in FranceProblems • High number of data sources • Heterogeneous formats • Poorly structuredExample (DataPublica): the web crawl for french open datasources found 148509 Excel files and only 369 RDF filesNeeds: integrate and visualize data sources to yield high-value information 2
  3. 3. 3 www.data-publica.comBusiness: market place for open dataFunctions: crawl, classify, document and reference datasources in a search engineThe data is extracted and structured in a database in order tobe visualized and accessible through APIsProblem: scale to high numbers of heterogeneous, poorlystructured sources 3
  4. 4. 4 DataPublica WorkflowDataPublica provides more than 10 000 XLS files (from severalsources such as INSEE, various public organizations...)WebSmatch is integrated in their workflow 4
  5. 5. 5 Example of inputURL : http://www.data-publica.com/publication/4736 Problem : where are data and metadata? incomplete lines, unnamed attributes Existing tools such as OpenII or Google Refine work only on clean files 5
  6. 6. 6 Example of inputURL : http://www.data-publica.com/publication/4736 Find data table Remove blank lines or columns 6
  7. 7. 7 Example of inputURL : http://www.data-publica.com/publication/4736 Find metadata such as titles Identify collections for bidimensionnal tables 7
  8. 8. 8 WebSmatch workflowFocus on metadata extraction serviceThis service is not used if the input is in a structured format(such as RDF, RDFS, OWL...) 8
  9. 9. 9MetaData Extraction: XLS example First step : Table detection using vision algorithms (dilate/erode) 9
  10. 10. 10MetaData Extraction: XLS example Second step : Attribute detection using machine learning on cell content and neigboorhood 10
  11. 11. 11 MetaData Extraction: XLS exampleThird step : automatic detection of concepts using YAM++(14 matching techniques such as string matching, instancebased, wordnet...)YAM++ came 1st and 2nd at OAEI 2011 : http://oaei.ontologymatching.org/2011/results/ 11
  12. 12. 12 WebSmatch WorkflowFocus on matching serviceRelies on YAM++, combining different metrics (String, Wordnet,Instance based) 12
  13. 13. 13 Data VisualizationStructured export formats easy to use for third parties : DSPLDSPL : DataSet Publishing Language from Google Inc. seehttps://developers.google.com/public-data/For bidimensionnal tables, we need to denormalize as DSPLuses flat CSV files for data => 13
  14. 14. Exporting the Results : integrated 14 metadataHow to make richer datasets : aggregation or intersection – using generic concepts such as time or location – find a specific concept using the matching 14
  15. 15. 15Visualizing the Results 15
  16. 16. 16 Visualizing the Resultshttp://api.data-publica.com/…/content.json?limit=10&filter={revenue_fiscal_par_foyer:{$gt:25000}} • Multi format (json, xml, spreadsheet,csv) • Geolocalized queries • Mashups 16
  17. 17. 17 Perspectives1. Automating large volume extraction: confidence / machine learning2. Clustering documents (on specific concepts & concept instances)• Integration with other tools • Google Refine • RDF export 17
  18. 18. 18 ConclusionWebSmatch is a flexible environment for Open DataintegrationEnd-to-end process: importing, data cleansing andintegrating data sourcesDSPL export format for visualizationReal validation with DataPublica data sources 18

×