Semantic Mapping in
CLARIN Component Metadata
Matej Durco
Institute for Corpus Linguistics and Text Technology
matej.durco@assoc.oeaw.ac.at

Menzo Windhouwer
The Language Archive - DANS
menzo.windhouwer@dans.knaw.nl
MTSR 2013
Thessaloniki, Greece
Outline






CLARIN an european infrastructure for language resources
Component Metadata Infrastructure (CMDI)
Semantic Mapping in CMDI
Semantic mapping in the CLARIN joint metadata domain
Conclusions and future work
CLARIN
 CLARIN = Common Language Resources and Technology
Infrastructure = an european ESFRI infrastructure project
 Aims at providing easy and sustainable access for scholars
in the humanities and social sciences to digital language
data (in written, spoken, video or multimodal form) and
advanced tools to discover, explore, exploit, annotate,
analyze or combine them, independent of where they are
located.
 Building a networked federation of European data
repositories, service centers and centers of expertise.

 One pillar of this infrastructure is a joint metadata domain
http://www.clarin.eu/
Component Metadata Infrastructure
Rationale for CMDI
 Limitations of existing metadata schemas (OLAC/DCMI, IMDI,
TEI header)





Inflexible: too many (IMDI) or too few (OLAC) metadata elements
Limited interoperability (both semantic and syntactic)
Problematic (unfamiliar) terminology for some sub-communities.
Limited support for LT tool & services descriptions

 CMDI addresses this by:
 Explicit defined schema & semantics
 User/project/community defined components
http://www.clarin.eu/cmdi/
CMDI - example

Lets describe a
speech recording

Sample frequency

Format
Size

Technical
Metadata

…
CMDI - example

Lets describe a
speech recording

Name

Language

Id
…

Technical
Metadata
CMDI - example

Lets describe a
speech recording

Name

Actor

Age

Sex

Language

Language
…

Technical
Metadata
CMDI - example

Continent

Location

Country

Address
…

Actor
Language

Technical
Metadata

Lets describe a
speech recording
CMDI - example
Name

Project

Contact

…

Location

Actor
Language

Technical
Metadata

Lets describe a
speech recording
CMDI - example

Project

Lets describe a
speech recording

Location

Actor

Metadata schema
(W3C XML Schema)

Language

Technical
Metadata
Metadata Profile

Metadata description
(XML document)
CMDI - workflow

metadata
catalogue

component
registry &
editor

ISOcat

metadata
modeler

metadata
user
search &
semantic
mapping

metadata
curator

Relation
Registry

metadata
editor

Joint
metadata
repository

Local
metadata
repository

OAI-PMH
Service provider

OAI-PMH
Data provider

DATA

metadata
creator

metadata
curator
Semantic Mapping in CMDI
 A CMD component, element or value should be linked to a ‘concept’,
i.e., an URI that points to a semantic description
 ‘concepts’ can be shared indicating shared semantics

 Current components use mainly:
 Dublin Core elements or terms
 ISOcat Data Categories
 ISOcat (www.isocat.org) is an ISO 12620:2009 compliant Data
Category Registry
 allows ellaborate specifications, e.g., a definition, (alternative)
names, examples, explanations, value domains (all in various
languages)
 can be freely used by anyone, including the creation of new data
categories
 the Athens Core group has created many metadata data categories
inspired by OLAC, TEI Header and IMDI
Semantic Mapping in CMDI

Name

Language

Id
…

Semantic Registry
Language Name : A human understandable name of the language that ...
Language ID : Identifier of the language as defined by ISO 639 that …

Language

Dictionary

Author
…
Semantic Mapping in CMDI
 Due to the use of multiple ‘concept’ registries and the open
nature of some of them (almost) same-as relationships
have to be specified
 RELcat (under development) is a Relation Registry which
allows to store these in, possibly user or community specific,
sets
language ID
isocat:DC-2482
dc:language
language name
isocat:DC-2484

time coverage
isocat:DC-1502

relcat:subClassOf

dc:coverage
CMDI in CLARIN
2011-01
Profiles

2012-06

2013-01

2013-06

40

53

87

124

Components

164

298

542

828

Elements

511

893

1505

2399

Distinct Data
Categories (DCs)

203

266

436

499

Metadata DCs

277

712

774

791

24.7%

17.6%

21.5%

26.5%

% Elements w/o DCs




CMD profiles for existing metadata schemas like OLAC/DCMI, TEI Header and
META-SHARE have been created
Profiles differ a lot in structure:
 Small and flat profiles with 5 – 10 elements
 Large and complex profiles of up to 10 component levels with hundreds of elements



Around half a million CMD records are harvested from around 70 providers
http://catalog.clarin.eu/vlo/
CMD Semantic Mapping in CLARIN
 791 metadata Data Categories
 222 from Athens Core (recommended)
 2 showcases (of very common concepts):
 Language
 Name

 SMC (Semantic Mapping Component) Browser
 http://clarin.aac.ac.at/smc-browser
 Allows the metadata modeller to explore the semantic overlap
between profiles, components and elements in an interactive
graph
CMD Semantic Mapping in CLARIN
 Language
 LanguageID (http://www.isocat.org/datcat/DC-2482)
 languageName (http://www.isocat.org/datcat/DC-2484)
 Linked in the RelationRegistry with the Dublin Core term
language
 http://lux13.mpi.nl/relcat/set/cmdi (graph)

 Together these ‘concepts’ are linked with 80 profiles

 Other related language Data Categories could be
considered
 sourceLanguage, languageMother

 The Relation Registry allows to include them to maximize
the recall for a specific language
CMD Semantic Mapping in CLARIN
CMD Semantic Mapping in CLARIN
 Name
 Is a more ambiguous term used by 72 CMD elements
 12 different Data Categories are used by these elements







resourceName (http://www.isocat.org/datcat/DC-2544)
resourceTitle (http://www.isocat.org/datcat/DC-2545)
author (http://www.isocat.org/datcat/DC-4115)
contact full name (http://www.isocat.org/datcat/DC-2454)
dcterms:Contributor
...

 A naive search on ‘name’ would yield semantically very
heterogenous results, instead use
 The ‘concept’ links
 Context, i.e., the enclosing components of an element
Conclusion & future work
 The CMD Infrastructure is very flexible with regard to metadata
structures, but also provides an integrated semantic layer to achieve
semantic interoperability
 All the proper registries are in place and prove to be useful, e.g., by the
central CLARIN catalogue
 Users can search and navigate the metadata based on semantics
and are not directly confronted with the structural diversity
 Furture work: sometimes more context is needed for disambiguation

 However, for metadata modellers the percieved proliferation of reusable
profiles and component can be a burden
 The SMC browser gives already insight in (semantic) overlap and
differences
 Future work: statistics based on the instance data will also help to
select among profiles and components

Semantic Mapping in CLARIN Component Metadata.

  • 1.
    Semantic Mapping in CLARINComponent Metadata Matej Durco Institute for Corpus Linguistics and Text Technology matej.durco@assoc.oeaw.ac.at Menzo Windhouwer The Language Archive - DANS menzo.windhouwer@dans.knaw.nl MTSR 2013 Thessaloniki, Greece
  • 2.
    Outline      CLARIN an europeaninfrastructure for language resources Component Metadata Infrastructure (CMDI) Semantic Mapping in CMDI Semantic mapping in the CLARIN joint metadata domain Conclusions and future work
  • 3.
    CLARIN  CLARIN =Common Language Resources and Technology Infrastructure = an european ESFRI infrastructure project  Aims at providing easy and sustainable access for scholars in the humanities and social sciences to digital language data (in written, spoken, video or multimodal form) and advanced tools to discover, explore, exploit, annotate, analyze or combine them, independent of where they are located.  Building a networked federation of European data repositories, service centers and centers of expertise.  One pillar of this infrastructure is a joint metadata domain http://www.clarin.eu/
  • 4.
    Component Metadata Infrastructure Rationalefor CMDI  Limitations of existing metadata schemas (OLAC/DCMI, IMDI, TEI header)     Inflexible: too many (IMDI) or too few (OLAC) metadata elements Limited interoperability (both semantic and syntactic) Problematic (unfamiliar) terminology for some sub-communities. Limited support for LT tool & services descriptions  CMDI addresses this by:  Explicit defined schema & semantics  User/project/community defined components http://www.clarin.eu/cmdi/
  • 5.
    CMDI - example Letsdescribe a speech recording Sample frequency Format Size Technical Metadata …
  • 6.
    CMDI - example Letsdescribe a speech recording Name Language Id … Technical Metadata
  • 7.
    CMDI - example Letsdescribe a speech recording Name Actor Age Sex Language Language … Technical Metadata
  • 8.
  • 9.
  • 10.
    CMDI - example Project Letsdescribe a speech recording Location Actor Metadata schema (W3C XML Schema) Language Technical Metadata Metadata Profile Metadata description (XML document)
  • 11.
    CMDI - workflow metadata catalogue component registry& editor ISOcat metadata modeler metadata user search & semantic mapping metadata curator Relation Registry metadata editor Joint metadata repository Local metadata repository OAI-PMH Service provider OAI-PMH Data provider DATA metadata creator metadata curator
  • 12.
    Semantic Mapping inCMDI  A CMD component, element or value should be linked to a ‘concept’, i.e., an URI that points to a semantic description  ‘concepts’ can be shared indicating shared semantics  Current components use mainly:  Dublin Core elements or terms  ISOcat Data Categories  ISOcat (www.isocat.org) is an ISO 12620:2009 compliant Data Category Registry  allows ellaborate specifications, e.g., a definition, (alternative) names, examples, explanations, value domains (all in various languages)  can be freely used by anyone, including the creation of new data categories  the Athens Core group has created many metadata data categories inspired by OLAC, TEI Header and IMDI
  • 13.
    Semantic Mapping inCMDI Name Language Id … Semantic Registry Language Name : A human understandable name of the language that ... Language ID : Identifier of the language as defined by ISO 639 that … Language Dictionary Author …
  • 14.
    Semantic Mapping inCMDI  Due to the use of multiple ‘concept’ registries and the open nature of some of them (almost) same-as relationships have to be specified  RELcat (under development) is a Relation Registry which allows to store these in, possibly user or community specific, sets language ID isocat:DC-2482 dc:language language name isocat:DC-2484 time coverage isocat:DC-1502 relcat:subClassOf dc:coverage
  • 15.
    CMDI in CLARIN 2011-01 Profiles 2012-06 2013-01 2013-06 40 53 87 124 Components 164 298 542 828 Elements 511 893 1505 2399 DistinctData Categories (DCs) 203 266 436 499 Metadata DCs 277 712 774 791 24.7% 17.6% 21.5% 26.5% % Elements w/o DCs   CMD profiles for existing metadata schemas like OLAC/DCMI, TEI Header and META-SHARE have been created Profiles differ a lot in structure:  Small and flat profiles with 5 – 10 elements  Large and complex profiles of up to 10 component levels with hundreds of elements  Around half a million CMD records are harvested from around 70 providers http://catalog.clarin.eu/vlo/
  • 16.
    CMD Semantic Mappingin CLARIN  791 metadata Data Categories  222 from Athens Core (recommended)  2 showcases (of very common concepts):  Language  Name  SMC (Semantic Mapping Component) Browser  http://clarin.aac.ac.at/smc-browser  Allows the metadata modeller to explore the semantic overlap between profiles, components and elements in an interactive graph
  • 17.
    CMD Semantic Mappingin CLARIN  Language  LanguageID (http://www.isocat.org/datcat/DC-2482)  languageName (http://www.isocat.org/datcat/DC-2484)  Linked in the RelationRegistry with the Dublin Core term language  http://lux13.mpi.nl/relcat/set/cmdi (graph)  Together these ‘concepts’ are linked with 80 profiles  Other related language Data Categories could be considered  sourceLanguage, languageMother  The Relation Registry allows to include them to maximize the recall for a specific language
  • 18.
  • 19.
    CMD Semantic Mappingin CLARIN  Name  Is a more ambiguous term used by 72 CMD elements  12 different Data Categories are used by these elements       resourceName (http://www.isocat.org/datcat/DC-2544) resourceTitle (http://www.isocat.org/datcat/DC-2545) author (http://www.isocat.org/datcat/DC-4115) contact full name (http://www.isocat.org/datcat/DC-2454) dcterms:Contributor ...  A naive search on ‘name’ would yield semantically very heterogenous results, instead use  The ‘concept’ links  Context, i.e., the enclosing components of an element
  • 20.
    Conclusion & futurework  The CMD Infrastructure is very flexible with regard to metadata structures, but also provides an integrated semantic layer to achieve semantic interoperability  All the proper registries are in place and prove to be useful, e.g., by the central CLARIN catalogue  Users can search and navigate the metadata based on semantics and are not directly confronted with the structural diversity  Furture work: sometimes more context is needed for disambiguation  However, for metadata modellers the percieved proliferation of reusable profiles and component can be a burden  The SMC browser gives already insight in (semantic) overlap and differences  Future work: statistics based on the instance data will also help to select among profiles and components