20110728 datalift-rpi-troy

The Datalift Project
Ontologies, Datasets, Tools and Methodologies
to Publish and Interlink ★★★★★ Datasets

François Scharffe
University of Montpellier,
LIRMM, INRIA
francois.scharffe@lirmm.fr
@lechatpito

With the help of the Datalift team
And the support of the French National Research Agency

RPI 28/07/2011 1

State of government open data

(September 2010…)

You’re here

State of government open data

(June 2011)

April 2008 September 2008

May 2007

Linking Open
Data

March 2009

September 2010
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

principles
§
Use the RDF format
§
Use URI to name things
§
Use HTTP URI HTTP (URL) so that one can look up those
names
§
Give information (HTML, RDF) when dereference those links
§
Include in this information other URIs pointing to other data to
enable discovery Tim Berners Lee,
http://www.w3.org/DesignIssues/LinkedData.html

goal of datalift
from raw published data
to interconnected semantic data

phase 1: opening
the data
develop a plateform
easing the publication

Welcome aboard the data lift
Published and interlinked data on the Web
Applications

Interconnexion

Publication infrastructure

Data convertion

Vocabulary selection

Raw data

Example publication process
Environmental, weather, geological datasets

SPARQL

Content Negociation

URI de-referencing

Oil industry
Geography
equipment

st
1 floor - Selection
SemWebPro 18/01/2011 13

Vocabularies of my friends...

Ø What is a (good) vocabulary for linked data ?
§ Usability criterias
Simplicity, visibility, sustainability, integration, coherence …

Ø Differents types of vocabularies
§ metadata, reference, domain, generalist …
§ The pillars of Linked Data : Dublin Core, FOAF, SKOS
Ø Good and less good practices
§ Ex : Programmes BBC vs legislation.gov.uk
§ Vocabulary of a Friend : networked vocabularies
Ø Linguistic problems
§ Existing vocabularies are in English at 99%
§ Terminological approach :which vocabularies for « Event » « Organization »

Did you say « vocabulary »

… And why not « ontology »?
§ « schema » or « metadata schema »?
§ Or « model » (data ? World ?)
Ø All these terms are used and justifiable
They are all « vocabularies »
§ They define types of objects (or classes)
and the properties (or attributes) atttached to these objects.
§ Types and attributes are logically defined
and named using natural language
§ A (semantic) vocabulary
is an explicit formalization
of concepts existing in natural language

15

Vocabularies for linked data

Ø Are meant to describe resources in RDF
Ø Are based on one of the standard W3C language
§ RDF Schema (RDFS)
• For vocabulaires without too much logical complexity
§ OWL
• For more complex ontological constructs
§ These two languages are compatible (almost)
Ø The can be composed « ad libitum »
§ One can reuse a few elements of a vocabulary
§ The original semantics have to be followed

What makes a good vocabulary ?

Ø A good vocabulary is a used vocabulary
§ Data published on CKAN give an idea of vocabulary usage
§ Exemple :
list of datasets using FOAF http://xmlns.com/foaf/0.1/
Ø Other usability criterias
§ Simplicity and readability in natural language
§ Elements documentation (definition in natural language)
§ Visibility and sustainability of the publication
§ Flexibility and extensibility
§ Sémantic integration (with other vocabularies)
§ Social integration (with the user community)

A vocabulary is also a community

Ø Bad (but common) practice
●
Build a lonely vocabulary
– For example as a research project
– Without basing it on any existing vocabulary
§ To publish it (or not) and then to forget about it
§ Not to care about its users
Ø A good vocabulary has an organic life
§ Users and use cases
§ Revisions and extensions
§ Like a « natural » vocabulary

Types of vocabularies

Ø Metadata vocabularies
§ Allowing to annotate other vocabularies
• Dublin Core, Vann, cc REL, Status, Void
Ø Reference vocabularies
§ Provide « common » classes and properties
• FOAF, Event, Time, Org Ontology
Ø Domain vocabularies
§ Specific to a domain of knowledge
• Geonames, Music Ontology, WildLife Ontology
Ø « general » vocabularies
§ Describe « everything » at an arbitrary detail level
• DBpedia Ontology, Cyc Ontology, SUMO

Vocabulary of a Friend

Ø http://www.mondeca.com/foaf/voaf
Ø A simple vocabulary...
Ø To represent interconnexions between vocabularies
Ø A unique entry point to vocabularies and Datasets of
the linked-data cloud Linked Data Cloud
Ø Ongoing work in Datalift

nd
2 floor - Conversion
SemWebPro 18/01/2011 21

Reference datasets, URI design

● Providing reference datasets for the French
ecosystem: geographical, topological, statistical,
political
● Providing URI design guidelines
● Opaque or transparent URIs ?
● Usage of accents in URIs
● Distinction between
Resources: http://dbpedia.org/resource/Paris
Documents: http://dbpedia.org/page/Paris
Data: http://dbpedia.org/data/Paris
… All served with content negociation

Many tools exist !

csv2rdf4lod

Direct Mapping from relational database to RDF

Define a standard transformation from a relational
database to RDF
The relational schema is used :
• Cells of a tuple produce triples with a common subject
• Each cell produces an object
• Different tables of a same database are thus linked together

Standard automatic translation of any relational schema to RDF,
based on the database Dump

Then we can SPARQL CONSTRUCT to adapt vocabularies and
URIs.

Exemple

Credits Ivan Herman: http://ivan-
herman.name/2010/11/19/my-first-mapping-from-
direct-mapping/ 25

Exemple

@base <http://book.example/> .
<Book/ID=0006511409X#_> a <Book> ;
<Book#ISBN> "0006511409X" ;
<Book#Title> "The Glass Palace" ;
<Book#Year> "2000" ;
<Book#Author> <Author/ID=id_xyz#_> .

<Author/ID=id_xyz#_> a <Author> ;
<Author#ID> "id_xyz" ;
<Author#Name> "Ghosh, Amitav" ;
<Author#Homepage> "http://www.amitavghosh.com" .

Simple result but not satisfaying:
● we want to use different vocabulary terms (like a:name)

● the direct mapping produces literal objects most of the time, except when there is

a “jump” from one table to another
● the resulting graph should use a blank node for the author, which is not the case

in the generated graph
Credits Ivan Herman: http://ivan-
herman.name/2010/11/19/my-first-mapping-from-
direct-mapping/ 26

Exemple
Solution : use SPARQL 1.1 Construct queries
CONSTRUCT {
?id a:title ?title ;
a:year ?year ;
a:author _:x .
_:x a:name ?name ;
a:homepage ?hp .
}
WHERE {
SELECT (IRI(fn:concat("http://...",?isbn)) AS ?id)
?title ?year ?name
(IRI(?homepage) AS ?hp)
{
?book a <Book> ;
<Book#ISBN> ?isbn ;
<Book#Title> ?title ;
<Book#Year> ?year ;
<Book#Author> ?author .
?author a <Author> ;
<Author#Name> ?name ;
<Author#Homepage ?homepage .
} 27

rd
3 floor - Publication
SemWebPro 18/01/2011 28

Datalift Platform

V1 to be released in September with expected features :
- Modular architecture
- Raw convertion module: Relational DB (DirectMapping approach, CSV,
XML (based on a user specified XSLT transformation)
- Selection module : LOV repository, automatic candidate vocabulary
proposal using ontology matching from the raw data schema, vocabulary
navigation tool, vocabulary usage metrics, sample data for each vocab
- Convertion (according to the schema) : RDF2RDF Convertion module
based on SPARQL construct (manual editing), Vocabulary mapping
facility (textual)
- Interlinking and Alignment : A Silk interface -- Integration of the
alignment API
- Publication Sesame API, informational vs non-informational resource 29
management.

th
4 floor - Interconnexion
SemWebPro 18/01/2011 31

Web of data and links
- Without links no web but data silos
- Many types of links : the edges of the Web of
data graph are labeled
- Some links are built during the selection phase :
reference datasets
- We study here a particular type of links :
equivalence links.

32

owl:sameAs
- points to a logical identity between two resource
- The quality of the available links is not always
optimal
Other types of links : owl:differentFrom,
rdfs:seeAlso

33

How to link data ?

34

How to link data ?

35

How to link data ?

36

How to link data ?

37

How to link data ?

38

Example Silk link specification
<Silk> <Interlink id="cities">
<Prefix id="rdfs" namespace= <LinkType>owl:sameAs</LinkType>
"http://www.w3.org/2000/01/rdf-schema#" /> <SourceDataset dataSource="dbpedia" var="a">
<Prefix id="dbpedia" namespace= <RestrictTo>
"http://dbpedia.org/ontology/" /> ?a rdf:type dbpedia:City
<Prefix id="gn" namespace= </RestrictTo>
"http://www.geonames.org/ontology#" /> </SourceDataset>
<TargetDataset dataSource="geonames" var="b">
<DataSource id="dbpedia"> <RestrictTo>
<EndpointURI>http://demo_sparql_server1/sparql ?b rdf:type gn:P
</EndpointURI> </RestrictTo>
<Graph>http://dbpedia.org</Graph> </TargetDataset>
</DataSource> <LinkCondition>
<AVG>
<DataSource id="geonames"> <Compare metric="jaroSimilarity">
<EndpointURI>http://demo_sparql_server2/sparql <Param name="str1" path="?a/rdfs:label" />
</EndpointURI> <Param name="str2" path="?b/gn:name" />
<Graph>http://sws.geonames.org/</Graph> </Compare>
</DataSource> <Compare metric="numSimilarity">
<Param name="num1"
<Thresholds accept="0.9" verify="0.7" /> path="?a/dbpedia:populationTotal" />
<Output acceptedLinks="accepted_links.n3" <Param name="num2" path="?b/gn:population" />
verifyLinks="verify_links.n3" </Compare>
mode="truncate" /> </AVG> 39
</LinkCondition>
</Interlink>

Where to find links ?

40

Towards automatic interlinking
We have seen some of the Silk spec fields could be
avoided
- Using alignments between ontologies
- Detecting discriminating properties
- Indicating comparison methods by attaching metadata
to ontologies
-> … ongoing work in Datalift

41

5th floor - Applications
SemWebPro 18/01/2011 42

phase 2: publishing datasets
validate the plateform with real data

Research objectives
§
Methods and metrics for selecting schemas
§
Tradeoff between specific and generic vocabularies
§
Data conversion and URI design patterns
§
Automatic data interlinking
§
Provenance and rights management
§
Integration, architecture and scalability

http://labs.mondeca.com/dataset/lov/index.html

http://labs.mondeca.com/vocab/voaf/

The french wider landscape

●
Regards Citoyens
●
Direction de l’information légale et administrative
●
Fédération des parcs naturels régionaux de France
●
Eurostat
●
Cities of Montpellier, Bordeaux, Rennes, …
●
Data Publica
●
EtatLab

LIRMM D2R Server
http://data.lirmm.fr/nosdeputes/

DATALIFT

next floor: « the web of data »

Credits

This presentation was realized thanks to the work of the Datalift team.
It can be freely distributed under Creative Commons licence BY-NC-SA 3.0

55

20110728 datalift-rpi-troy

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to 20110728 datalift-rpi-troy

Similar to 20110728 datalift-rpi-troy (20)

More from François Scharffe

More from François Scharffe (9)

Recently uploaded

Recently uploaded (20)

20110728 datalift-rpi-troy