The document describes a reference architecture for a linguistic linked data ecosystem. It proposes standards and best practices for publishing, linking, and accessing multilingual data as linked open data. The key components of the architecture include publishing and hosting linguistic linked data, metadata standards, vocabularies for describing different resource types, linking of open and closed data, discovery layers, and semantic web service composition. The architecture supports decentralization, interoperability, and the development of language technologies and analytics services over linked data.
Lider Reference Model ld4lt session March, 3rd, 2015
1. 20/11/2014
‹Nº›
Presenter
name
The
LIDER
Reference
Architecture
Philipp
Cimiano
(represen:ng
the
LIDER
Project)
LD4LT
Teleconference
March
5th,
2015
2. 16/01/2015
Philipp
Cimiano
Goal
• Goal: Develop a Reference model that supports an ecosystem of linguistic
linked data and the development of content analytics services on top of
this ecosystem.
• Key features:
– Linked Data: connected ecosystem of data and services,
interoperability, supporting access by both humans and machines
– Semantic Technologies: open web standards (OWL, RDF) for data
description, SPARQL and HTTP as Web APIs
– De-centralization: Web architecture, no central point of failure, no
vendor lock-in, open standards
5. 16/01/2015
Philipp
Cimiano
Reference
Architecture
Metadata" Licensing" Provenance"
Multilingual Data"
• Metadata:
providing
basic
informa:on
about
the
dataset
(author,
language,
structure),
etc.
• Licensing:
specifying
the
terms
and
condi:ons
of
use
• Provenance:
describing
the
origin
and
processing
history
of
data
6. 16/01/2015
Philipp
Cimiano
Reference
Architecture
LLD Publishing"
"
"
Metadata" Licensing" Provenance"
Vocabularies" Hosting"
Multilingual Data"
best
prac:ces,
standards
and
tools
for
publica>on
and
hos>ng
of
LDL,
and
vocabularies
for
descrip:on
and
transforma:on
of
different
types
of
resources
(lexica,
corpora,
terminologies,
lexico-‐seman:c
resources)
into
RDF/LDL
Linguis:c
Linked
Data
(LDL)
7. 16/01/2015
Philipp
Cimiano
Reference
Architecture
LLD Publishing"
"
"
Metadata"
LLD-aware Services"
"
"
Licensing" Provenance"
Vocabularies" Hosting" Scalability" Streaming" Interoperability"
Multilingual Data"
• Scalability:
caching
and
non-‐centralized
processing
• Streaming:
process
data
in
a
stream
fashion,
thus
reducing
overhead
of
crea:ng
and
closing
connec:ons
• Interoperability:
common
vocabulary
to
describe
inputs
and
output
of
services
8. 16/01/2015
Philipp
Cimiano
Reference
Architecture
LLD Linking"
LLD Publishing"
"
"
Metadata"
Service Composition"
LLD-aware Services"
"
"
Licensing" Provenance"
Vocabularies" Hosting" Scalability" Streaming" Interoperability"
Multilingual Data"
• best
prac:ces
to
suppor:ng
linking
of
resources,
combina:on
of
data
with
different
terms
and
condi:ons
of
use,
in
par:cular
open
and
closed
data
• support
composi>on
of
services
into
complex
workflows
9. 16/01/2015
Philipp
Cimiano
Reference
Architecture
Discovery"
LLD Linking"
LLD Publishing"
"
"
Metadata"
Service Composition"
LLD-aware Services"
"
"
Licensing" Provenance"
Vocabularies" Hosting" Scalability" Streaming" Interoperability"
Multilingual Data"
Discovery
layer
implemented
by
a
number
of
independent
indexing
and
aggrega:on
services
that
support
querying
(SPARQL)
and
browsing
data
(Linked
Data)
10. 16/01/2015
Philipp
Cimiano
Reference
Architecture
Benchmarking & Validation"
Discovery"
LLD Linking"
LLD Publishing"
"
"
Metadata"
Service Composition"
LLD-aware Services"
"
"
Licensing" Provenance"
Vocabularies" Hosting" Scalability" Streaming" Interoperability"
Multilingual Data"
tools
suppor:ng
comparison
of
datasets
and
services
13. 16/01/2015
Philipp
Cimiano
Metadata
• Metadata:
DataID
for
the
descrip:on
of
datasets
(see
Reference
Card
for
DataID),
as
well
as
Dublin
Core,
DCAT
and
a
METASHARE
ontology
currently
in
development
(see
other
threads)
• Licensing:
The
recommenda:on
of
the
LIDER
project
is
to
use
ODRL
for
the
descrip:on
of
terms
and
condi:ons
• Provenance:
The
recommenda:on
of
the
LIDER
project
is
to
use
the
PROV-‐O
vocabulary
to
describe
provenance
of
linguis:c
data
resources
Data
Publishing:
The
LIDER
project
recommends
to
use
DataHub
for
publishing
metadata
• Data
Linking:
The
LIDER
project
has
implemented
services
that
link
data
across
sources
as
proof-‐of-‐concept
implementa:on.
14. 16/01/2015
Philipp
Cimiano
Discovery
Layer
• Reference
implementa:on
is
LingHub:
hcp://linghub.lider-‐
project.eu/
• Indexes
metadata
from
METASHARE,
CLARIN,
LRE
Map,
DataHub
• Integra:on
and
harmoniza:on
of
data
by
mapping
to
DCAT,
Dublin
Core
• Exposes
DataID
metadata
descrip:ons
• Provides
SPARQL
endpoint
• Browsable
by
humans
and
machines
(Linked
Data)
15. 16/01/2015
Philipp
Cimiano
Services
Reference
implementa:on
of
NLP
services
that:
• Use
web
sockets
to
process
data
in
a
streaming
fashion
• Use
NIF-‐grounded
RDF/JSON-‐LD
as
input
and
output
• Can
be
composed
together
by
merging
output
(RDF
merge)
16. 16/01/2015
Philipp
Cimiano
Standardiza:on
Involvement
in
Community
Groups:
• Ontolex
(Ontology-‐Lexicon
Models,
CG)
• BPMLOD
(Best
Prac:ces
for
Mul:lingual
Linked
Open
Data,
CG)
• LD4LT
(Linked
Data
and
Language
Technologies,
CG)
17. 16/01/2015
Philipp
Cimiano
Use
Cases
• An
IT
company
is
ac:ve
in
the
brand
reputa>on
market
and
offers
a
product
that
is
based
on
sen:ment
analysis
for
three
languages
(English,
Spanish;
Portuguese),
and
needs
to
find
sen:ment
annotated
data
for
German
• A
terminology
management
company
wants
to
exploit
LLD
to
support
the
process
of
crea:ng
a
corporate
terminology.
They
want
to
provide
seed
terms
and
exploit
LLOD
to
get
further
candidates
for
terms.
• A
machine
transla>on
company
wants
to
exploit
LLOD
for
training
machine
transla:on
system
and
ease
the
adapta:on
to
a
new
domain,
searches
for
parallel
data
on
a
certain
language
pair.
• An
IT
company
develops
informa:on
extrac:on
techniques
for
compe>tor
analysis.
It
needs
to
develop
an
applica:on
that
works
on
Twicer
data.
The
company
needs
to
find
POS-‐annotated
Twicer
data
to
adapt
their
POS
tagger
to
the
Twicer
domain.
• A
researcher
wants
to
publish
a
dataset
on
the
Web
as
Linguis:c
Linked
Data
and
needs
support
in
this.
A
part
of
the
dataset
will
be
offered
for
free
and
part
will
be
offered
in
exchange
of
money.
18. 16/01/2015
Philipp
Cimiano
Discussion
Thanks
for
your
acen:on!
Any
comments,
ques:ons,
…?