2. 2
• Create an interoperable domain of Language
Resources (LR)
– Interoperable formats for LR content
– Persistent identification (and citation) of LRs
– Use of SAML based AAI for access to LRs
– Use of the Component Metadata Infrastructure (CMDI) for describing
LRs
3. 3
• Created as a response to a fragmented situation of LR metadata
• Flexible
– Not a single schema, but supports different metadata schema
– Different schema for different situations
– Semantic Interoperability via linking to semantic registries
• Community driven
– communities can model their own metadata schema
– know their data and can create the right schema
– know the right terminology
• Sharing
– Concepts, Terminology, Vocabularies
• CLARIN Concept Registry for linguistic concepts,
• ISO 368 and other relevant vocabularies
• CLAVAS for organisation names
– Components & profiles via the CLARIN metadata component registry
4. 4
• A Component groups together metadata
Elements, which naturally belong together
to describe a property of the resources
– The Location where a SpeechRecording took place
– The Location of an Actor
– A Location is described by an address a/o region a/o
country a/o continent
• Components can be nested
– The Language a specific Actor speaks
– An Actor who takes part in a SpeechRecording for a
specific Project
• A Profile is a specific collection of
Components for a specific type of
resources, e.g., speech recordings
SpeechRecordingP
ActorC
LocationC
- addressE
- regionE
- countryE
- continentE
LocationC
ProjectC
LanguageC
LanguageC
Technical
MetadataC
6. 6
• Started in 2010, version 1.2 released in 2016 supporting
remote vocabularies
• Actively supported by CLARIN ERIC and several national CLARIN
consortia
• Many supporting tools:
– VLO, COMEDI, ARBIL, CMDI maker, Virtual Collection Registry …
• Link to the Linked (open) Data world: CMDI2RDF
CMDI LODCMDI2RDF
7. 7
• Started as a 2014 CLARIN NL project by TLA/MPI and DANS
• Now a service supported by CLARIAH WP2 (X11.400)
• Linking also to other ‘linguistic’ LoD information sources:
– WALS for linguistic typology information
– CLAVAS organization names
– DBpedia (currently only used as glue)
• Automatic synchronization CMDI metadata
• Simplification of the RDFs CMDI model
8. 8
• CMD is classic W3C schema constrained XML
• To map a CMD record to RDF we need
– A mapping for the basic component model to RDFS
• Basic classes and properties to represent profiles, components,
elements, attributes and their relationships and values
– A mapping for a specific profile or component to RDFS
• A specific subclass or subproperty of the basic component model
– A mapping for specific metadata records to RDF instances of RDFS
• Instances of profile or component
– Additionaly there is a generic CMD envelop that is mapped using
common LOD vocabularies
9. 9
Basic CMD model is described by ISO/DIS 24622-1
1st part of ISO TC 37 SC 4 3 CMD standards family
Natural mapping to RDF would be:
Profiles/components to RDF Classes
Elements to RDF Properties
Complication
CLARIN’s CMDI allows attributes on both Components and Elements
So elements have to be RDF Classes as well
10. 10
• Nevertheless introduces extra hierarchy
• CMDI is already a hierarchical metadata schema
• Human readability decreases
• Other solutions welcome!
R 14
Age
<Description URI= …. >
<Age>14</Age>
…
</Person
<Description…. >
<Age status=‘U’>14</Age>
…
</Description> R
Age
14
U
Simplified example
status
13. 13
• Offers LoD for different LR
metadata infrastructures
– LRE Map (LREC)
– META-SHARE
– CLARIN
– DataHub (linguistic part)
• However
– Wrt. CLARIN only data with DC
profiles
• Just a small part of CLARIN
– Seems partly based on static old
data dumps
14. 14
• Goals:
– Find metadata type of information about LRs in LD format
– Translate that into a ‘suitable’ CMDI profile based metadata record
• Is there such LD that is not already available direct in another
format: OLAC, CLARIN, DC, META-SHARE
– If so, useful to have this metadata in the CLARIN VLO metadata catalogue
– Humanities data archives will have mostly DC, (inventory available from
different projects: e.g. DASISH) and frequently offer LD
– Easier ways exist to translate DC into CMDI (e.g. the CMDI DC profile)
– But LD can be a pivot set for many such translations
• Still in exploratory phase
– Would like to use a general strategy,
– Its very labor intensive to craft specific transformations for every LD set.
15. 15
• Useful for CLARIN?
– Enriching existing CMDI metadata and
recycling them
– Relations to sources already known as:
• WALS, DBpedia, CLAVAS, GlotoLog, …
• Relations to CLARIAH LD sources ?
– Enable the VLO (or an alternative browser)
for visualizing this information
– Increasing metadata quality:
• Use CLAVAS to repair errors
• Include preferred labels
– Some CMDI adaptations required
• Foreign namespace support in CMDI
payload
A
VLO
B
C
RDF2CMD
CLARIN CENTRES
CLARIAH?
Enriched
CMDI
CMDI
DPpedia Glotolog
RDFstore
Virtuoso as a tripelstore
Tomcat as application server
Elda as browser
Conversion pipeline in Java core transforms in XSLT
all in a Docker package
Code all on GitHub: