M. Durco, M. Windhouwer. Semantic Mapping in CLARIN Component Metadata. In E. Garoufallou and J. Greenberg (eds.), Metadata and Semantics Research (MTSR 2013; mtsr2013.teithe.gr), CCIS Vol. 390, Springer, Thessaloniki, Greece, November 20-22, 2013.
Semantic Mapping in
CLARIN Component Metadata
Matej Durco
Institute for Corpus Linguistics and Text Technology
matej.durco@assoc.oeaw.ac.at
Menzo Windhouwer
The Language Archive - DANS
menzo.windhouwer@dans.knaw.nl
MTSR 2013
Thessaloniki, Greece
Outline
CLARIN an european infrastructure for language resources
Component Metadata Infrastructure (CMDI)
Semantic Mapping in CMDI
Semantic mapping in the CLARIN joint metadata domain
Conclusions and future work
CLARIN
CLARIN = Common Language Resources and Technology
Infrastructure = an european ESFRI infrastructure project
Aims at providing easy and sustainable access for scholars
in the humanities and social sciences to digital language
data (in written, spoken, video or multimodal form) and
advanced tools to discover, explore, exploit, annotate,
analyze or combine them, independent of where they are
located.
Building a networked federation of European data
repositories, service centers and centers of expertise.
One pillar of this infrastructure is a joint metadata domain
http://www.clarin.eu/
Component Metadata Infrastructure
Rationale for CMDI
Limitations of existing metadata schemas (OLAC/DCMI, IMDI,
TEI header)
Inflexible: too many (IMDI) or too few (OLAC) metadata elements
Limited interoperability (both semantic and syntactic)
Problematic (unfamiliar) terminology for some sub-communities.
Limited support for LT tool & services descriptions
CMDI addresses this by:
Explicit defined schema & semantics
User/project/community defined components
http://www.clarin.eu/cmdi/
CMDI - example
Lets describe a
speech recording
Sample frequency
Format
Size
Technical
Metadata
…
CMDI - example
Lets describe a
speech recording
Name
Language
Id
…
Technical
Metadata
CMDI - example
Lets describe a
speech recording
Name
Actor
Age
Sex
Language
Language
…
Technical
Metadata
CMDI - example
Project
Lets describe a
speech recording
Location
Actor
Metadata schema
(W3C XML Schema)
Language
Technical
Metadata
Metadata Profile
Metadata description
(XML document)
CMDI - workflow
metadata
catalogue
component
registry &
editor
ISOcat
metadata
modeler
metadata
user
search &
semantic
mapping
metadata
curator
Relation
Registry
metadata
editor
Joint
metadata
repository
Local
metadata
repository
OAI-PMH
Service provider
OAI-PMH
Data provider
DATA
metadata
creator
metadata
curator
Semantic Mapping in CMDI
A CMD component, element or value should be linked to a ‘concept’,
i.e., an URI that points to a semantic description
‘concepts’ can be shared indicating shared semantics
Current components use mainly:
Dublin Core elements or terms
ISOcat Data Categories
ISOcat (www.isocat.org) is an ISO 12620:2009 compliant Data
Category Registry
allows ellaborate specifications, e.g., a definition, (alternative)
names, examples, explanations, value domains (all in various
languages)
can be freely used by anyone, including the creation of new data
categories
the Athens Core group has created many metadata data categories
inspired by OLAC, TEI Header and IMDI
Semantic Mapping in CMDI
Name
Language
Id
…
Semantic Registry
Language Name : A human understandable name of the language that ...
Language ID : Identifier of the language as defined by ISO 639 that …
Language
Dictionary
Author
…
Semantic Mapping in CMDI
Due to the use of multiple ‘concept’ registries and the open
nature of some of them (almost) same-as relationships
have to be specified
RELcat (under development) is a Relation Registry which
allows to store these in, possibly user or community specific,
sets
language ID
isocat:DC-2482
dc:language
language name
isocat:DC-2484
time coverage
isocat:DC-1502
relcat:subClassOf
dc:coverage
CMDI in CLARIN
2011-01
Profiles
2012-06
2013-01
2013-06
40
53
87
124
Components
164
298
542
828
Elements
511
893
1505
2399
Distinct Data
Categories (DCs)
203
266
436
499
Metadata DCs
277
712
774
791
24.7%
17.6%
21.5%
26.5%
% Elements w/o DCs
CMD profiles for existing metadata schemas like OLAC/DCMI, TEI Header and
META-SHARE have been created
Profiles differ a lot in structure:
Small and flat profiles with 5 – 10 elements
Large and complex profiles of up to 10 component levels with hundreds of elements
Around half a million CMD records are harvested from around 70 providers
http://catalog.clarin.eu/vlo/
CMD Semantic Mapping in CLARIN
791 metadata Data Categories
222 from Athens Core (recommended)
2 showcases (of very common concepts):
Language
Name
SMC (Semantic Mapping Component) Browser
http://clarin.aac.ac.at/smc-browser
Allows the metadata modeller to explore the semantic overlap
between profiles, components and elements in an interactive
graph
CMD Semantic Mapping in CLARIN
Language
LanguageID (http://www.isocat.org/datcat/DC-2482)
languageName (http://www.isocat.org/datcat/DC-2484)
Linked in the RelationRegistry with the Dublin Core term
language
http://lux13.mpi.nl/relcat/set/cmdi (graph)
Together these ‘concepts’ are linked with 80 profiles
Other related language Data Categories could be
considered
sourceLanguage, languageMother
The Relation Registry allows to include them to maximize
the recall for a specific language
CMD Semantic Mapping in CLARIN
Name
Is a more ambiguous term used by 72 CMD elements
12 different Data Categories are used by these elements
resourceName (http://www.isocat.org/datcat/DC-2544)
resourceTitle (http://www.isocat.org/datcat/DC-2545)
author (http://www.isocat.org/datcat/DC-4115)
contact full name (http://www.isocat.org/datcat/DC-2454)
dcterms:Contributor
...
A naive search on ‘name’ would yield semantically very
heterogenous results, instead use
The ‘concept’ links
Context, i.e., the enclosing components of an element
Conclusion & future work
The CMD Infrastructure is very flexible with regard to metadata
structures, but also provides an integrated semantic layer to achieve
semantic interoperability
All the proper registries are in place and prove to be useful, e.g., by the
central CLARIN catalogue
Users can search and navigate the metadata based on semantics
and are not directly confronted with the structural diversity
Furture work: sometimes more context is needed for disambiguation
However, for metadata modellers the percieved proliferation of reusable
profiles and component can be a burden
The SMC browser gives already insight in (semantic) overlap and
differences
Future work: statistics based on the instance data will also help to
select among profiles and components