Open Government Data Tutorial at CLEI 2013. Part 5 Semantic Issues

475 views
360 views

Published on

This tutorial about Open Government Data was a 4 hours tutorial at the Conferencia Latinoameticana en Informatica (CLEI 2013) http://clei2013.org.ve/ divided into 5 parts:

1 - Introduction
http://www.slideshare.net/jpane/open-government-data-tutorial-at-clei-2013-part-1-introduction

2 - Issues
https://www.slideshare.net/jpane/02-issues-v4slideshare

3 - Real Experience
http://www.slideshare.net/jpane/open-government-data-tutorial-03-real-experience

4 - Applications
http://www.slideshare.net/jpane/open-government-data-tutorial-at-clei-2013-part-4-applications

5 - Semantic Issues
http://www.slideshare.net/jpane/open-government-data-tutorial-at-clei-2013-part-5-semantic-issues

This is part 5 - Semantic Issues

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
475
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Open Government Data Tutorial at CLEI 2013. Part 5 Semantic Issues

  1. 1. OGD: Part 5 – Semantic Issues Juan Pane: jpane@pol.una.py Lorenzino Vaccari: lorenzino.vaccari@gmail.com 1 Juan Pane, Lorenzino Vaccari http://dati.trentino.it/ 08/10/2013
  2. 2. Outline • Overview • Issues of opening data • Entity centric Semantic layer • Importing pipeline • Importing tool 2 Juan Pane, Lorenzino Vaccari 08/10/2013
  3. 3. Available Structured Linked Open Data Open formats Redefenceable Linked The best data is an open data Vs. All data must be perfect 3 Juan Pane, Lorenzino Vaccari 08/10/2013
  4. 4. Lack of explicit semantics The real meaning of the data was kept in the developers mind when creating the data http://goo.gl/npEHKr (Thanks to Moaz Reyad) 4 Juan Pane, Lorenzino Vaccari 08/10/2013
  5. 5. Lack of explicit semantics Can lead to things like: http://goo.gl/npEHKr (Thanks to Moaz Reyad) 5 Juan Pane, Lorenzino Vaccari 08/10/2013
  6. 6. Semantic heterogeneity Difference in the meaning of local data 6 Juan Pane, Lorenzino Vaccari 08/10/2013
  7. 7. Issues when Opening Trentino Data  Each department has authority on only some part of the data.  Dataset originally created for internal use only.  Dataset created for a specific need.  Dataset created with custom format:  For structure (some exceptions)  For data  Lack of reuse -> duplication.  Lack of programmers.  We cannot TELL them what/how to do (always).  Data changes 7 Juan Pane, Lorenzino Vaccari 08/10/2013
  8. 8. Available Data Catalog Structured Open formats Redefenceable Linked 8 Entity Centric Semantic Layer Juan Pane, Lorenzino Vaccari 08/10/2013
  9. 9. Entity centric: Added value  Aggregated data  Accurate data, manually curated  Unique identifiers, distributed perspectives  Re-think identifiers  Semantified values E1 E2 name name Ignacio P. F. nationality italian born in Paraguay lives in Trento date of birth 1980 affiliation 9 Juan Pane Univ. Trento affiliation PF-UNA Juan Pane, Lorenzino Vaccari 08/10/2013
  10. 10. Entities  Real world: is something that has a distinct, separate existence, although it need not be a material (physical) existence. Has a set of properties, which evolve over time. Example:  Mental: personal (local) model created and maintained by a person that references and describes a real world entity.  Digital: capture the semantics of real world entities, provided by people. 10 Juan Pane, Lorenzino Vaccari 08/10/2013
  11. 11. Entity Centric Semantic Layer: • Address the integration problems due to semantic heterogeneity: • Different formats • Different identifiers • Implicit semantics • Homonyms, synonyms, aliases • Partial knowledge • Knowledge evolution http://www.webfoundation.org/2011/11/5-staropen-data-initiatives/ 11 Juan Pane, Lorenzino Vaccari 08/10/2013
  12. 12. Entity-based Integration • Focus on entities as first class citizens • Entities are objects which are so important in our everyday life to be referred with a name • Each entity has its own metadata (e.g. name, latitude, longitude, …) • Each entity is in relation with many other entities (e.g. Einstein was born in Ulm, his affiliation was Charles University, Ulm is a city in Germany) • There are relatively “few” commonsense entity types (person, …, event) • There are many domain specific entities (bus stops, cycling paths, ..) • All components have explicit semantics: schema, entities, attributes, values 12 Juan Pane, Lorenzino Vaccari 08/10/2013
  13. 13. Importing pipeline, Macro Steps Domain analysis 1. Study the needed entity types, adapt the knowledge base accordingly. First time bootstrapping  Import entities 2. Semi-automatic tool.     13 Domain experts are expensive. Human attention is a scarce resource. Incremental enrichment and aggregation of entities. Juan Pane, Lorenzino Vaccari 08/10/2013
  14. 14. Open Data Peculiarities  All data comes from a CKAN repository (DCAT).  Process one data file at a time.  Each data file can be represented as a table.  Each row in the table represents a (partial) entity.  The format of the values might not be enforced in the data files.  Not all data is relevant. 14 Juan Pane, Lorenzino Vaccari 08/10/2013
  15. 15. Available Data Catalog Structured Open formats Redefenceable Linked 15 Juan Pane, Lorenzino Vaccari Entity centric Importing tool 08/10/2013
  16. 16. Importing tool process 16 Juan Pane, Lorenzino Vaccari 08/10/2013
  17. 17. 1. Source Selection Import one data file at a time 17 Juan Pane, Lorenzino Vaccari 08/10/2013
  18. 18. 2. Schema Matching Select a target type of entity -> correspondences between the input columns and the output attributes LocalitaTuristica nome provincia descrizione Andalo (1047) Provincia di Trento Canazei (1450) Trento Prov. 18 lat long Sorge su un'ampia sella prativa 3 al centro... 654463 712857 Situato all'estremità settentrionale della... 511504 147444 Juan Pane, Lorenzino Vaccari funivie 2 • Nome • Provincia • Quota • Coordinate • Descrizione • popolazione 08/10/2013
  19. 19. 3. Data Validation Applies format and structure validation and possible automatic transformations needed to have the input data in the expected format. 19 Juan Pane, Lorenzino Vaccari 08/10/2013
  20. 20. 4. Semantic Enrichment (1/2) Entity disambiguation: Transform text references into links to existing entities. 20 Juan Pane, Lorenzino Vaccari 08/10/2013
  21. 21. 4. Semantic Enrichment (2/2) Natural Language Processing: Extract concepts and entity references from free-text. 21 Juan Pane, Lorenzino Vaccari 08/10/2013
  22. 22. 5. Reconciliation Run Identity Management Algorithms to identify each row as a new or existing entity. Result • No Match • Match • Multiple Matches Action: • Use ID • New ID • Ignore Row 22 Juan Pane, Lorenzino Vaccari 08/10/2013
  23. 23. 6. Exporting At this point:  We know what to export.  All values for target attributes conform to the expected format.  All text has been semantified (NLP).  All textual references to entities are converted to links  Each row has an identifier v0 23 Juan Pane, Lorenzino Vaccari i i+1 08/10/2013
  24. 24. 7. Publishing Put back the semantified entities into CKAN so that the entities can be Open Data and can be found in the same catalog as the original data.  Developers and find the data files of the cleaned, aggregated entities  But can also interact with the entities via the Entitypedia APIs 8. Visualization Search and Navigation 24 Juan Pane, Lorenzino Vaccari 08/10/2013
  25. 25. Semantic Layer: Services Tool for aiding the “semantification” of the datasets in the catalog based on: • Schema matching services • Identity Management services • Entity Matching services • Global Unique Identifier services • Semantic search and indexing services • Natural Language Processing • Entity store 25 Juan Pane, Lorenzino Vaccari 08/10/2013
  26. 26. Our Goal TN UK ES BE 26 Juan Pane, Lorenzino Vaccari 08/10/2013
  27. 27. 27 Juan Pane, Lorenzino Vaccari 08/10/2013 http://www.shabra.com/wp-content/uploads/2011/03/lets-work-together.jpg
  28. 28. Gracias! Grazie! Mercy! Thanks! Kiitos! Dank u! Gràcies! Gratias! Danke! ευχαριστώ We thank in particular CLEI 2013, Autonomous Province of Trento, TrentoRise association, Universidad Nacional de Asuncion, and University of Trento 28 Juan Pane, Lorenzino Vaccari 08/10/2013

×