OGD: Part 5 – Semantic Issues
Juan Pane: jpane@pol.una.py
Lorenzino Vaccari: lorenzino.vaccari@gmail.com

1

Juan Pane, Lorenzino Vaccari

http://dati.trentino.it/

08/10/2013
Outline
• Overview
• Issues of opening data
• Entity centric Semantic layer
• Importing pipeline

• Importing tool

2

Juan Pane, Lorenzino Vaccari

08/10/2013
Available

Structured

Linked Open
Data

Open formats
Redefenceable

Linked

The best data is
an open data
Vs.

All data must be
perfect

3

Juan Pane, Lorenzino Vaccari

08/10/2013
Lack of explicit semantics
The real meaning of the data was kept in the developers mind
when creating the data

http://goo.gl/npEHKr (Thanks to Moaz Reyad)

4

Juan Pane, Lorenzino Vaccari

08/10/2013
Lack of explicit semantics
Can lead to things like:

http://goo.gl/npEHKr (Thanks to Moaz Reyad)

5

Juan Pane, Lorenzino Vaccari

08/10/2013
Semantic heterogeneity
Difference in the meaning of local data

6

Juan Pane, Lorenzino Vaccari

08/10/2013
Issues when Opening Trentino Data
 Each department has authority on only some part of the data.

 Dataset originally created for internal use only.
 Dataset created for a specific need.
 Dataset created with custom format:
 For structure (some exceptions)
 For data
 Lack of reuse -> duplication.
 Lack of programmers.
 We cannot TELL them what/how to do (always).
 Data changes

7

Juan Pane, Lorenzino Vaccari

08/10/2013
Available

Data Catalog

Structured

Open formats
Redefenceable

Linked

8

Entity
Centric
Semantic
Layer
Juan Pane, Lorenzino Vaccari

08/10/2013
Entity centric: Added value
 Aggregated data

 Accurate data, manually curated
 Unique identifiers, distributed perspectives
 Re-think identifiers

 Semantified values
E1

E2

name

name

Ignacio P. F.

nationality

italian

born in

Paraguay

lives in

Trento

date of birth

1980

affiliation

9

Juan Pane

Univ. Trento

affiliation

PF-UNA

Juan Pane, Lorenzino Vaccari

08/10/2013
Entities
 Real world: is something that has a distinct, separate

existence, although it need not be a material (physical)
existence. Has a set of properties, which evolve over time.
Example:
 Mental: personal (local) model created and maintained by a

person that references and describes a real world entity.
 Digital: capture the semantics of real world entities,

provided by people.
10

Juan Pane, Lorenzino Vaccari

08/10/2013
Entity Centric Semantic Layer:
• Address the integration problems due to semantic

heterogeneity:
• Different formats
• Different identifiers
• Implicit semantics
• Homonyms, synonyms, aliases
• Partial knowledge
• Knowledge evolution
http://www.webfoundation.org/2011/11/5-staropen-data-initiatives/

11

Juan Pane, Lorenzino Vaccari

08/10/2013
Entity-based Integration
• Focus on entities as first class citizens
• Entities are objects which are so important in our everyday life to be referred with a name
• Each entity has its own metadata (e.g. name, latitude, longitude, …)
• Each entity is in relation with many other entities (e.g. Einstein was born in Ulm, his affiliation

was Charles University, Ulm is a city in Germany)
• There are relatively “few” commonsense entity types (person, …, event)
• There are many domain specific entities (bus stops, cycling paths, ..)
• All components have explicit semantics: schema, entities, attributes, values

12

Juan Pane, Lorenzino Vaccari

08/10/2013
Importing pipeline, Macro Steps
Domain analysis

1.

Study the needed entity types, adapt the knowledge base
accordingly. First time bootstrapping



Import entities

2.

Semi-automatic tool.






13

Domain experts are expensive.
Human attention is a scarce resource.
Incremental enrichment and aggregation of entities.

Juan Pane, Lorenzino Vaccari

08/10/2013
Open Data Peculiarities
 All data comes from a CKAN repository (DCAT).

 Process one data file at a time.
 Each data file can be represented as a table.
 Each row in the table represents a (partial) entity.

 The format of the values might not be enforced in the data

files.
 Not all data is relevant.

14

Juan Pane, Lorenzino Vaccari

08/10/2013
Available

Data Catalog

Structured

Open formats
Redefenceable

Linked

15

Juan Pane, Lorenzino Vaccari

Entity centric
Importing tool

08/10/2013
Importing tool process

16

Juan Pane, Lorenzino Vaccari

08/10/2013
1. Source Selection
Import one data file at a time

17

Juan Pane, Lorenzino Vaccari

08/10/2013
2. Schema Matching
Select a target type of entity -> correspondences between the input columns and
the output attributes
LocalitaTuristica
nome

provincia

descrizione

Andalo (1047)

Provincia di
Trento

Canazei (1450)

Trento Prov.

18

lat

long

Sorge su un'ampia sella prativa 3
al centro...

654463

712857

Situato all'estremità
settentrionale della...

511504

147444

Juan Pane, Lorenzino Vaccari

funivie

2

• Nome
• Provincia
• Quota
• Coordinate
• Descrizione
• popolazione

08/10/2013
3. Data Validation
Applies format and structure validation and possible automatic transformations
needed to have the input data in the expected format.

19

Juan Pane, Lorenzino Vaccari

08/10/2013
4. Semantic Enrichment (1/2)
Entity disambiguation: Transform text references into links to existing entities.

20

Juan Pane, Lorenzino Vaccari

08/10/2013
4. Semantic Enrichment (2/2)
Natural Language Processing: Extract concepts and entity references from
free-text.

21

Juan Pane, Lorenzino Vaccari

08/10/2013
5. Reconciliation
Run Identity Management Algorithms to identify each row as a new or existing
entity.
Result
• No Match
• Match
• Multiple
Matches

Action:
• Use ID
• New ID
• Ignore
Row

22

Juan Pane, Lorenzino Vaccari

08/10/2013
6. Exporting
At this point:
 We know what to export.
 All values for target attributes conform to the expected format.
 All text has been semantified (NLP).
 All textual references to entities are converted to links
 Each row has an identifier

v0
23

Juan Pane, Lorenzino Vaccari

i

i+1
08/10/2013
7. Publishing
Put back the semantified entities into CKAN so that the entities
can be Open Data and can be found in the same catalog as the
original data.
 Developers and find the data files of the cleaned, aggregated
entities
 But can also interact with the entities via the Entitypedia APIs

8. Visualization
Search and Navigation
24

Juan Pane, Lorenzino Vaccari

08/10/2013
Semantic Layer: Services
Tool for aiding the “semantification” of the datasets in the catalog
based on:
• Schema matching services
• Identity Management services
• Entity Matching services

• Global Unique Identifier services

• Semantic search and indexing services
• Natural Language Processing
• Entity store

25

Juan Pane, Lorenzino Vaccari

08/10/2013
Our Goal
TN

UK

ES

BE

26

Juan Pane, Lorenzino Vaccari

08/10/2013
27

Juan Pane, Lorenzino Vaccari

08/10/2013
http://www.shabra.com/wp-content/uploads/2011/03/lets-work-together.jpg
Gracias!

Grazie!
Mercy!

Thanks!
Kiitos!

Dank u!
Gràcies!

Gratias!
Danke!

ευχαριστώ

We thank in particular CLEI 2013, Autonomous Province of Trento, TrentoRise association,
Universidad Nacional de Asuncion, and University of Trento

28

Juan Pane, Lorenzino Vaccari

08/10/2013

Open Government Data Tutorial at CLEI 2013. Part 5 Semantic Issues

  • 1.
    OGD: Part 5– Semantic Issues Juan Pane: jpane@pol.una.py Lorenzino Vaccari: lorenzino.vaccari@gmail.com 1 Juan Pane, Lorenzino Vaccari http://dati.trentino.it/ 08/10/2013
  • 2.
    Outline • Overview • Issuesof opening data • Entity centric Semantic layer • Importing pipeline • Importing tool 2 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 3.
    Available Structured Linked Open Data Open formats Redefenceable Linked Thebest data is an open data Vs. All data must be perfect 3 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 4.
    Lack of explicitsemantics The real meaning of the data was kept in the developers mind when creating the data http://goo.gl/npEHKr (Thanks to Moaz Reyad) 4 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 5.
    Lack of explicitsemantics Can lead to things like: http://goo.gl/npEHKr (Thanks to Moaz Reyad) 5 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 6.
    Semantic heterogeneity Difference inthe meaning of local data 6 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 7.
    Issues when OpeningTrentino Data  Each department has authority on only some part of the data.  Dataset originally created for internal use only.  Dataset created for a specific need.  Dataset created with custom format:  For structure (some exceptions)  For data  Lack of reuse -> duplication.  Lack of programmers.  We cannot TELL them what/how to do (always).  Data changes 7 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 8.
  • 9.
    Entity centric: Addedvalue  Aggregated data  Accurate data, manually curated  Unique identifiers, distributed perspectives  Re-think identifiers  Semantified values E1 E2 name name Ignacio P. F. nationality italian born in Paraguay lives in Trento date of birth 1980 affiliation 9 Juan Pane Univ. Trento affiliation PF-UNA Juan Pane, Lorenzino Vaccari 08/10/2013
  • 10.
    Entities  Real world:is something that has a distinct, separate existence, although it need not be a material (physical) existence. Has a set of properties, which evolve over time. Example:  Mental: personal (local) model created and maintained by a person that references and describes a real world entity.  Digital: capture the semantics of real world entities, provided by people. 10 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 11.
    Entity Centric SemanticLayer: • Address the integration problems due to semantic heterogeneity: • Different formats • Different identifiers • Implicit semantics • Homonyms, synonyms, aliases • Partial knowledge • Knowledge evolution http://www.webfoundation.org/2011/11/5-staropen-data-initiatives/ 11 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 12.
    Entity-based Integration • Focuson entities as first class citizens • Entities are objects which are so important in our everyday life to be referred with a name • Each entity has its own metadata (e.g. name, latitude, longitude, …) • Each entity is in relation with many other entities (e.g. Einstein was born in Ulm, his affiliation was Charles University, Ulm is a city in Germany) • There are relatively “few” commonsense entity types (person, …, event) • There are many domain specific entities (bus stops, cycling paths, ..) • All components have explicit semantics: schema, entities, attributes, values 12 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 13.
    Importing pipeline, MacroSteps Domain analysis 1. Study the needed entity types, adapt the knowledge base accordingly. First time bootstrapping  Import entities 2. Semi-automatic tool.     13 Domain experts are expensive. Human attention is a scarce resource. Incremental enrichment and aggregation of entities. Juan Pane, Lorenzino Vaccari 08/10/2013
  • 14.
    Open Data Peculiarities All data comes from a CKAN repository (DCAT).  Process one data file at a time.  Each data file can be represented as a table.  Each row in the table represents a (partial) entity.  The format of the values might not be enforced in the data files.  Not all data is relevant. 14 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 15.
    Available Data Catalog Structured Open formats Redefenceable Linked 15 JuanPane, Lorenzino Vaccari Entity centric Importing tool 08/10/2013
  • 16.
    Importing tool process 16 JuanPane, Lorenzino Vaccari 08/10/2013
  • 17.
    1. Source Selection Importone data file at a time 17 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 18.
    2. Schema Matching Selecta target type of entity -> correspondences between the input columns and the output attributes LocalitaTuristica nome provincia descrizione Andalo (1047) Provincia di Trento Canazei (1450) Trento Prov. 18 lat long Sorge su un'ampia sella prativa 3 al centro... 654463 712857 Situato all'estremità settentrionale della... 511504 147444 Juan Pane, Lorenzino Vaccari funivie 2 • Nome • Provincia • Quota • Coordinate • Descrizione • popolazione 08/10/2013
  • 19.
    3. Data Validation Appliesformat and structure validation and possible automatic transformations needed to have the input data in the expected format. 19 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 20.
    4. Semantic Enrichment(1/2) Entity disambiguation: Transform text references into links to existing entities. 20 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 21.
    4. Semantic Enrichment(2/2) Natural Language Processing: Extract concepts and entity references from free-text. 21 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 22.
    5. Reconciliation Run IdentityManagement Algorithms to identify each row as a new or existing entity. Result • No Match • Match • Multiple Matches Action: • Use ID • New ID • Ignore Row 22 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 23.
    6. Exporting At thispoint:  We know what to export.  All values for target attributes conform to the expected format.  All text has been semantified (NLP).  All textual references to entities are converted to links  Each row has an identifier v0 23 Juan Pane, Lorenzino Vaccari i i+1 08/10/2013
  • 24.
    7. Publishing Put backthe semantified entities into CKAN so that the entities can be Open Data and can be found in the same catalog as the original data.  Developers and find the data files of the cleaned, aggregated entities  But can also interact with the entities via the Entitypedia APIs 8. Visualization Search and Navigation 24 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 25.
    Semantic Layer: Services Toolfor aiding the “semantification” of the datasets in the catalog based on: • Schema matching services • Identity Management services • Entity Matching services • Global Unique Identifier services • Semantic search and indexing services • Natural Language Processing • Entity store 25 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 26.
    Our Goal TN UK ES BE 26 Juan Pane,Lorenzino Vaccari 08/10/2013
  • 27.
    27 Juan Pane, LorenzinoVaccari 08/10/2013 http://www.shabra.com/wp-content/uploads/2011/03/lets-work-together.jpg
  • 28.
    Gracias! Grazie! Mercy! Thanks! Kiitos! Dank u! Gràcies! Gratias! Danke! ευχαριστώ We thankin particular CLEI 2013, Autonomous Province of Trento, TrentoRise association, Universidad Nacional de Asuncion, and University of Trento 28 Juan Pane, Lorenzino Vaccari 08/10/2013