Web smatch wod2012

  • 663 views
Uploaded on

 

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
663
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
4
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 1WebSmatch : a platform for data and metadata integration Remi Coletta, Emmanuel Castanier, Patrick Valduriez,Christian Frisch, DuyHoa Ngo, Zohra Bellahsene
  • 2. 2 MotivationsContext: open data in FranceProblems • High number of data sources • Heterogeneous formats • Poorly structuredExample (DataPublica): the web crawl for french open datasources found 148509 Excel files and only 369 RDF filesNeeds: integrate and visualize data sources to yield high-value information 2
  • 3. 3 www.data-publica.comBusiness: market place for open dataFunctions: crawl, classify, document and reference datasources in a search engineThe data is extracted and structured in a database in order tobe visualized and accessible through APIsProblem: scale to high numbers of heterogeneous, poorlystructured sources 3
  • 4. 4 DataPublica WorkflowDataPublica provides more than 10 000 XLS files (from severalsources such as INSEE, various public organizations...)WebSmatch is integrated in their workflow 4
  • 5. 5 Example of inputURL : http://www.data-publica.com/publication/4736 Problem : where are data and metadata? incomplete lines, unnamed attributes Existing tools such as OpenII or Google Refine work only on clean files 5
  • 6. 6 Example of inputURL : http://www.data-publica.com/publication/4736 Find data table Remove blank lines or columns 6
  • 7. 7 Example of inputURL : http://www.data-publica.com/publication/4736 Find metadata such as titles Identify collections for bidimensionnal tables 7
  • 8. 8 WebSmatch workflowFocus on metadata extraction serviceThis service is not used if the input is in a structured format(such as RDF, RDFS, OWL...) 8
  • 9. 9MetaData Extraction: XLS example First step : Table detection using vision algorithms (dilate/erode) 9
  • 10. 10MetaData Extraction: XLS example Second step : Attribute detection using machine learning on cell content and neigboorhood 10
  • 11. 11 MetaData Extraction: XLS exampleThird step : automatic detection of concepts using YAM++(14 matching techniques such as string matching, instancebased, wordnet...)YAM++ came 1st and 2nd at OAEI 2011 : http://oaei.ontologymatching.org/2011/results/ 11
  • 12. 12 WebSmatch WorkflowFocus on matching serviceRelies on YAM++, combining different metrics (String, Wordnet,Instance based) 12
  • 13. 13 Data VisualizationStructured export formats easy to use for third parties : DSPLDSPL : DataSet Publishing Language from Google Inc. seehttps://developers.google.com/public-data/For bidimensionnal tables, we need to denormalize as DSPLuses flat CSV files for data => 13
  • 14. Exporting the Results : integrated 14 metadataHow to make richer datasets : aggregation or intersection – using generic concepts such as time or location – find a specific concept using the matching 14
  • 15. 15Visualizing the Results 15
  • 16. 16 Visualizing the Resultshttp://api.data-publica.com/…/content.json?limit=10&filter={revenue_fiscal_par_foyer:{$gt:25000}} • Multi format (json, xml, spreadsheet,csv) • Geolocalized queries • Mashups 16
  • 17. 17 Perspectives1. Automating large volume extraction: confidence / machine learning2. Clustering documents (on specific concepts & concept instances)• Integration with other tools • Google Refine • RDF export 17
  • 18. 18 ConclusionWebSmatch is a flexible environment for Open DataintegrationEnd-to-end process: importing, data cleansing andintegrating data sourcesDSPL export format for visualizationReal validation with DataPublica data sources 18