Methodological Guidelines for Publishing Linked Data

Methodological Guidelines for
Publishing Linked Data
g
Boris Villazón-Terrazas, Oscar Corcho
Facultad de Informática, Universidad Politécnica de Madrid
,
Campus de Montegancedo sn, 28660 Boadilla del Monte, Madrid
http://www.oeg-upm.net
{bvillazon,ocorcho}@fi.upm.es
Phone: 34 91 3366605 Fax: 34 91 3524819
34.91.3366605, 34.91.3524819
Slides available at: http://www.slideshare.net/boricles/

Acknowledgements: Asunción Gómez-Pérez, Luis M. Vilches,
Victor Saquicela, Al
Vi t S i l Alexander d L ó and many others th t we
d de León, d th that
may have omitted.
WorkdistributedunderthelicenseCreativeCommonsAttribution-
Noncommercial-Share Alike 3.0

Main References

Wood, David (Ed) Linking Government Data - 2011

Methodological Guidelines for Publishing Government Linked Data

Boris Villazón-Terrazas, Luis M. Vilches, Oscar Corcho, Asunción Gómez-Pérez

Best Practices for Publishing Linked Data

W3C Editor’s Draft – Government Linked Data Working Group

Michael Hausenblas, Bernadette Hyland, Boris Villazón-Terrazas

https://dvcs.w3.org/hg/gld/raw-file/bcb72f87b5cc/bp/index.html

Cookbook for Open Government Linked Data

W3C Editor’s Draft – Government Linked Data Working Group

Bernadette Hyland, Boris Villazón-Terrazas, Sarven Capadisli

http://www.w3.org/2011/gld/wiki/Linked_Data_Cookbook
http://www w3 org/2011/gld/wiki/Linked Data Cookbook

Guidelines for Publishing Linked Data

• The process of publishing Linked Data has an
iterative incremental life cycle model.

• Based on our experience in the production of Linked
Data in several Governmental Contexts, have been
applied in real case scenarios.

3

Specification
• Identification and analysis of the data
sources

• URI design

• Definition of the license

6

Specification
Identification and analysis of the data sources

We have to distinguish

• O
Open and publish d t th t government agencies h
d bli h data that t i have
not yet opened up and published
• Task that may require contacting to specific government data
owners to get access to their legacy data

• Reuse and leverage on data already opened up and
p
published by g
y government agencies
g
• Task to look for these data in public government catalogs
• Open Government Data
• datacatalogs org
datacatalogs.org
• Open Government Catalog
7

Specification
Identification and analysis of the data sources

After we have identified and selected the government data
sources

• Search and compile all the available data and
documentation about those resources

• Identify the schema of those resources including
conceptual components and th i relationships
t l t d their l ti hi

• Identify the items in the domain i e things whose
domain, i.e.,
properties and relations are described in the data
sources

8

Specification
GeoLinkedData – Identification of the data sources

Agreement with the IGN
IGN
National Geographic Institute of Spain

Oracle & MySQL

Data
D t sources available
il bl
in a public data catalog
INE
National Statistic Institute of Spain

9

Specification
GeoLinkedData – Analysis of the data sources

Year

Province Industry Production Index

10

Specification
URI Design

• Use meaningful URIs, instead of opaque URIs, when
possible

• Separate TBox (ontology model) from ABox
(instances) URIs
URIs.
• Base URI
http://data.gov.bo/
http://health.data.gov.bo/
• TBox URIs
http://data.gov.bo/ontology/{class|property}
p g gy { |p p y}
• ABox URIs
http://data.gov.bo/resource/
http://data.gov.bo/resource/province/Tiraque
http://data gov bo/resource/province/Tiraque

11

Specification
GeoLinkedData - URI design

• Base URI
http://linkeddata.es/
http://geo.linkeddata.es/

• TBox URIs
http://geo.linkeddata.es/ontology/{concept|property}
http://geo.linkeddata.es/ontology/Provincia
http://geo linkeddata es/ontology/Provincia

• ABox URIs
http://geo.linkeddata.es/resource/{r. type}/{r. name}
http://geo.linkeddata.es/resource/Provincia/Madrid

12

Specification
Definition of the license

• Several possibilities

• The UK Open Government License

• Open Database License

• Public Domain Dedication and License

• Open Data Commons Attribution License

• The C
Creative C
Commons Licenses

It is also possible to reuse and apply an existing license
p pp y g
of the government data sources.
13

Specification
GeoLinkedData - Definition of the license

• Reusing the original license of the government data
sources. IGN and INE data sources have their own
license, similar t Att ib ti Sh
li i il to Attribution-Share Alik 2 5 G
Alike 2.5 Generic
i
License

http://creativecommons.org/licenses/by-sa/2.5/

14

Modelling
Ontology

• An ontology is an engineering artifact, which provides:
• A set of terms
• A set of explicit assumptions regarding the intended meaning of the terms.
• Almost always including concepts and their classification
• Almost always including properties between concepts

• Shared understanding of a domain of interest

• Ontologies expressed in OWL or RDF(S), both based on RDF

16

Modelling
Reuse available vocabularies

Search f suitable
S h for it bl
vocabularies

Linked Open Vocabularies

are there Yes Build the vocabulary by
suitable reusing available
g
vocabularies? vocabularies

No

…

17

Modelling
Reuse available non-ontological resources

Highly reliable Web Sites

Search f suitable
S h for it bl Domain related
Domain-related sites
non-ontological resources

Government Catalogs

are there Yes Build the vocabulary by
suitable transforming available
t f i il bl
resources? resources

No

Build the vocabulary from
scratch

18

Modelling
GeoLinkedData
WGS84 Geo
Positioning: an RDF
vocabulary scv:Dimension
scv:Item
scv:Dataset

hydrographical
phenomena (rivers,
lakes, etc.)

Vocabulary for
instants, intervals,
durations, etc.

Names and
international code
Ontology for OGC systems for
Geography Markup territories and
Language groups

Classes 33 33
Object Properties 44 44
Data Properties 318 318
http://neon-toolkit.org/

19

Modelling
GeoLinkedData

20

Generation
• Transformation

• Data cleansing

• Linking

22

Generation
Transformation

• Take the data sources selected in the specification
activity and transform them to RDF according to the
vocabulary created i th modelling activity
b l t d in the d lli ti it

• Some tools
• CSV and spreadsheets
• RDF extension of Google Refine, XLWrap, RDF123, NOR2O
• RDB
• D2R Server, ODEMapster, W3C RDB2RDF WG – R2RML
• XML
• GRDDL, ReDeFer

23

Generation
GeoLinkedData - Transformation

NOR2O

INE

ODEMapster

IGN

Geospatial Geometry2RDF
column

IGN

24

Generation
Industry Production Index Year

Province

NOR2O

25

Generation
• R2O is an e te s b e, fully dec a at e language to desc be
s a extensible, u y declarative a guage describe
mappings between relational database schemas and ontologies.
• The ODEMapster processor generates RDF instances from
relational instances based on the mapping description
pp g p
expressed in the R2O document

www.oeg-upm.net/index.php/en/downloads/9-r2o-odempaster
26

Generation
• Creation of the R2O Mappings

27

Generation

Excerpt of the R2O document

28

Generation

• Tool for generating RDF from geometrical information

• The geometry could be available in GML or WKT

• The RDF generated follows our Geometry Model

http://www.oeg-upm.net/index.php/en/downloads/151-geometry2rdf

29

Generation

Oracle STO UTIL package

SELECT TO_CHAR(SDO_UTIL.TO_GML311GEOMETRY(geometry))
AS Gml311Geometry
FROM "BCN200"."BCN200_0301L_RIO" c
WHERE c.Etiqueta='Arroyo'

30

Generation

Generation
Data Cleansing

• To find possible errors, identified by Hogan et al.
• http-level issues, such as accessibility and derefencability,
e.g.,
e g HTTP URIs ret rn 40 /50 errors
return 40x/50x
• reasoning issues such as namespace without vocabulary,
e.g., rss:item term invented
• malformed/incompatible datatypes, e.g., “true” as xsd:int

• To fix the identified errors

32

Generation
GeoLinkedData – Data Cleansing

• Errors
• Some resources, with the same name, were mixed. For
example,
e ample Granada municipality belongs to Granada
m nicipalit
province, and La Granada municipality belongs to Barcelona
Province.

• Autonomous communities that only have one province, e.g.,
Murcia Region, missed some municipalities, but their
corresponding provinces, e g Murcia Province have the
provinces e.g., Province,
correct number of municipalities.

• S
Some hydrographical resources missed some parts of their
f
geometrical information.

33

Generation
Linking

Identify suitable data sets http://ckan.net
as linking targets

Discover relationships
between data items
LIMES Silk Framework
http://aksw.org/Projects/limes http://www4.wiwiss.fu-berlin.de/bizer/silk/

Validate the relationships
discovered sameAs Validator
http://oegdev.dia.fi.upm.es:8080/sameAs/

34

Generation
GeoLinkedData - Linking

GeoLinked
Data

DBPedia GeoNames

…. …. ….

http://dbpedia.org/re http://geo.linkeddata http://sws.geoname
source/Madrid .es/.../Madrid s.org/6355233/

…. …. ….

35

Generation
GeoLinkedData - Linking

http://oegdev.dia.fi.upm.es:8080/sameAs/
http://oegdev dia fi upm es:8080/sameAs/

36

Publication
• Dataset publication

• Metadata publication

• Dataset discovery

38

Publication
Dataset Publication

• Tools for storing RDF
• Virtuoso Universal Server, Jena, Sesame, 4Store, YARS,
OWLIM

• SPARQL endpoint and Linked Data frontend
• Pubby, Talis Platform, Fuseki

39

Publication
Metadata Publication

• VoID allows to express metadata about RDF
datasets

• Open Provenance Model

40

Publication
Dataset discovery

• Register the dataset into CKAN Registry

• Generate sitemap files for your dataset, by using
sitemap4rdf

• Submit the sitemap location to Google and Sindice

http://www.w3.org/wiki/TaskForces/CommunityProjects/LinkingOpenData/DataSets/CKANmetainformation

41

Publication
GeoLinkedData – Dataset publication

HTML Linked Data SPARQL

Including Provenance Pubby
Support

http://www4.wiwiss.fu-berlin.de/pubby/ Pubby 0.3

Virtuoso 6.1.0
610

42

Publication
GeoLinkedData – Dataset discovery

43

Exploitation

Streaming resources
45

Exploitation
GeoLinkedData

http://oegdev.dia.fi.upm.es/projects/map4rdf/

map4rdf:
• Google maps viewer of RDF resources
• Resources with spatial information
• Extensible with google plugins
• Used in other applications like Aemet Goodrelations
Aemet,

map4rdf SPARQL

Triplestore
46

DEMO
http://geo.linkeddata.es/browser

47

Provinces – Industry Production Index

50

Methodological Guidelines for Publishing Linked Data

More Related Content

What's hot

Viewers also liked

Similar to Methodological Guidelines for Publishing Linked Data

More from Boris Villazón-Terrazas

Recently uploaded

Methodological Guidelines for Publishing Linked Data