Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry
www.isocat.org
Collaboratively Defining
Widely Accepted Linguistic Data Categories
in the ISOcat Data Category Registry
Menzo Windhouwer
The Language Archive – DANS
tla.mpi.nl
menzo.windhouwer@dans.knaw.nl
28 March 2013 eHg - New Trends in e-Humanities 1
www.isocat.org
The Language Archive
• Founded in September 2011
• Supported by MPG, BBAW and KNAW (DANS)
• Grown out of the Technical Group at the MPI for
Psycholinguistics
• Since 1990ies: challenge of archiving digital data
• 2000 – 2016 VolkswagenFoundation DOBES
project on Endangered Languages
• Active in many European infrastructure projects:
CLARIN, EUDAT, DASISH, …
28 March 2013 eHg - New Trends in e-Humanities 2
www.isocat.org
Language Archiving Technology
• Full lifecycle support
– Core: resources
– Key: metadata
– ‘New’: CMDI, ISOcat, AV recognition,
…
• Archive size:
– 70 Tb of resources
– 22.000 hours AV recordings
– 75.000 sessions (metadata)
– 5 million annotated segments
– 50 lexica
• My focus: Knowledge Systems
– LEXUS, an online lexicon tool
– ISOcat and companions
28 March 2013 eHg - New Trends in e-Humanities 3
www.isocat.org
Typological Database Nijmegen
TOP NOTION tds:Noun GROUPS{
NOTION tdn:GrammaticalDistinctions
LABEL "Grammatical distinctions for nouns."
GROUPS {
NOTION tdn:AgentNouns
LABEL "Agent nouns."
DESCRIPTION "Nouns can function as the agent of a clause."
LINK TO CONCEPT agentRole
GROUPS {
NOTION tdn:v098_plusAffix
LABEL "Agent nouns formed by verb stem plus affix."
LINK TO CONCEPTS (agentRole, verbalMorphology, boundAffix)
DESCRIPTION
<p>Agent nouns are formed by a verb stem plus an affix, e.g. English <qv>walk-er</qv>.</p>
NOTE AUTHOR IS "TDS" TYPE IS "original TDN label" "AGENT NOUNS ARE VERB STEM PLUS AFFIX"
IS FIELD v098;
...
Notes: TDN is not in archived in TLA, but curated in TDS, a previous project I worked on, and now archived at DANS;
28 March 2013 eHg - New Trends in e-Humanities 4
also this not a TDN punchcard
www.isocat.org
DOBES corpora
28 March 2013 eHg - New Trends in e-Humanities 5
www.isocat.org
Oxford English Dictionary
Source: http://www.oxford-royale.co.uk/news/2010/12/04/new-online-edition-of-oxford-english-dictionary.html
28 March 2013 eHg - New Trends in e-Humanities 6
www.isocat.org
Terminology Community of Practice
• Community started out on paper (A5 fiches),
just like OED
• 80’s - 90’s projects to standardize data
category, the ‘fields’ on the fiches/in the
files/database records, names
• ISO 12620:1999 Data Categories a companion
standard to ISO 12200 Machine-readable
terminology interchange format (MARTIF)
28 March 2013 eHg - New Trends in e-Humanities 7
www.isocat.org
ISO 12620:1999
28 March 2013 eHg - New Trends in e-Humanities 8
www.isocat.org
Towards a Data Category Registry
• Problems with ISO 12620:1999 a hardcoded list of data categories
– Not easily extensible
– Ordering heavily debated
– Outdated and limited in range at the moment of release
• Developments
– In the SALT project an interchange model (TBX) based on MARTIF/data
categories was created, which was widely adopted
– ISO 11179 Metadata Registries was released, which describes the
standardization of data element concepts for metadata
– ISO released Annex ST Standards as databases, which describes an ISO
procedure to standardize registry entries
– In the LIRICS project a pilot Data Category Registry, SYNTAX, was
created
28 March 2013 eHg - New Trends in e-Humanities 9
www.isocat.org
ISO 12620:2009
• Terminology and other content and language resources — Specification of
data categories and management of a Data Category Registry for language
resources
– A data model for data category specifications inspired by ISO 11179
– A procedure to standardize data category specification compliant with
Annex ST
– Each data category gets a unique Persistent Identifier (PID)
– The Max Planck Institute for Psycholinguistics is appointed as the
Registration Authority of the ISO/TC 37 DCR
• In use by a growing number of ISO TC 37 standards
– Lexical Markup Framework (LMF)
– Linguistic Annotation Framework (LAF)
– Morph-syntactic Annotation Framework (MAF)
– …
– could be more, e.g., Feature System Declarations (FSD)
28 March 2013 eHg - New Trends in e-Humanities 10
www.isocat.org
Example Data Category specification
• Data category: /Grammatical gender/
– Administrative part:
• Identifier: grammaticalGender
• PID: http://www.isocat.org/datcat/DC-1297
– Descriptive part:
• English definition: Category based on (depending on languages)
the natural distinction between sex and formal criteria.
• French definition: Catégorie fondée (selon la langue) sur la
distinction naturelle entre les sexes ou d'autres critères formels.
– Linguistic part:
• Morposyntax conceptual domain: /masculine/, /feminine/,
/neuter/
• French conceptual domain: /masculine/, /feminine/
28 March 2013 eHg - New Trends in e-Humanities 11
www.isocat.org
Standardization procedure
Decision Group
Submission Thematic Domain Data Category Registry Stewardship
group Group Board group
Evaluation Validation
rejected rejected
Publication
28 March 2013 eHg - New Trends in e-Humanities 12
www.isocat.org
Thematic Domain Groups
TDG 1: Metadata • TDGs are the owner and guardians
TDG 2: Morphosyntax of a coherent subset of the DCR
TDG 3: Semantic Content Representation • TDGs own one or more profiles
TDG 4: Syntax
TDG 6: Language Resource Ontology • Each TDG has a chair
TDG 7: Lexicography • A number of members assigned by
TDG 8: Language Codes SC P members
TDG 9: Terminology • A number of expert members
invited by the chair (up to 50%)
TDG 11: Multilingual Information Management
TDG 12: Lexical Resources
• TDGs are constituted at the
TDG 13: Lexical Semantics TC37/SC plenary
• New TDGs need to be proposed by
a SC
1. Translation
2. (Sign language)
28 March 2013 eHg - New Trends in e-Humanities 13
www.isocat.org
ISOcat - the ISO TC 37/DCR
• A (coherent) set of Data Categories, in our case for
linguistic resources
• A system to manage this set:
– Create and edit Data Categories
– Share Data Categories, e.g., resolve PID references
– Standardize Data Categories
• An API for tools to access the DCR
• Grass roots approach
– Anyone can access the DCR and use or
create the data categories (s)he needs
28 March 2013 eHg - New Trends in e-Humanities 14
www.isocat.org
Refering to ISOcat data categories
• PIDs of data categories can easily embedded in XML documents
<lmf:LexicalEntry>
<tei:f
name="partOfSpeech"
dcr:datcat="http://www.isocat.org/datcat/DC-1345"
fVal="commonNoun”
dcr:valueDatcat="http://www.isocat.org/datcat/DC-1256"/>
<lmf:Lemma type="Form">
<tei:f
name="writtenForm”
dcr:datcat="http://www.isocat.org/datcat/DC-1836"
fVal="clergyman"/>
</lmf:Lemma>
</lmf:LexicalEntry>
• Also embedding in other formats is possible, e.g., via comments
• Preferably annotate schemas, so a whole range of resources is annotated
in one go
28 March 2013 eHg - New Trends in e-Humanities 15
www.isocat.org
A glimpse of ISOcat
28 March 2013 eHg - New Trends in e-Humanities 16
www.isocat.org
Collaboration in ISOcat
• Registered user can contact eachother via
mediated email
– Ask the owner if a data category can be adapted a
little to your needs
• Registered users can start up a group and invite
other users to join
– Work together on a set of data categories
– Interact via a public and/or private forum
• A group can submit data categories for ISO
standardization
28 March 2013 eHg - New Trends in e-Humanities 17
www.isocat.org
Component MetaData Infrastructure
• CMDI is developed by CLARIN and on its way to
standardization by ISO TC 37
– Limitations existing metadata schemas: DC/OLAC,
IMDI, TEI header
• Inflexible: too many (IMDI) or too few (OLAC) metadata
elements
• Limited interoperability (both semantic and syntactic)
• Problematic (unfamiliar) terminology for some sub-
communities.
• Limited support for LT tool & services descriptions
– The idea is to address this by:
• Explicit defined schema & semantics
• User/project/community defined components
28 March 2013 eHg - New Trends in e-Humanities 18
www.isocat.org
CMDI architecture
ISOcat component metadata
metadata
registry & modeler
catalogue
editor
metadata
user
search & Relation
metadata metadata
semantic Registry
editor creator
mapping
Joint Local
metadata metadata
metadata repository repository
curator metadata
curator
OAI-PMH OAI-PMH
Service provider Data provider
28 March 2013 DATA
eHg - New Trends in e-Humanities 19
www.isocat.org
Athens Core
• Bootstrapped the Metadata data categories
selection in ISOcat
– Based on existing metadata standards, e.g., DC,
OLAC, IMDI, TEI
– Many translations in european languages
• Users add the data categories they need to
the Metadata profile and use them in CMDI
28 March 2013 eHg - New Trends in e-Humanities 20
www.isocat.org
CMDI architecture
ISOcat component metadata
metadata
registry & modeler
catalogue
editor
metadata
user
search & Relation
metadata metadata
semantic Registry
editor creator
mapping
Joint Local
metadata metadata
metadata repository repository
curator metadata
curator
OAI-PMH OAI-PMH
Service provider Data provider
28 March 2013 DATA
eHg - New Trends in e-Humanities 21
www.isocat.org
CMDI architecture
metadata ISOcat component metadata
catalogues registry & modeler
(VLO, MI) editor
metadata
user
search & Relation
metadata metadata
semantic Registry
editor creator
mapping
Joint Local
metadata metadata
metadata repository repository
curator metadata
curator
OAI-PMH OAI-PMH
Service provider Data provider
28 March 2013 DATA
eHg - New Trends in e-Humanities 22
www.isocat.org
CMDI (intermediate) results
• Diverse metadata profiles
– Center or projects create specific ones, but reuses components where
possible
• Shared and explicit semantics help to overcome
– Terminological differences
– Differences in structure
• Future
– Get more context sensitive
• e.g. documentation language vs. speaker language
– Crosswalks
• equivalent metadata data categories are easily introduced due to the open nature
of ISOcat
– User specific relationships
• e.g. theory specific differences can be more important to one user then another
28 March 2013 eHg - New Trends in e-Humanities 23
www.isocat.org
Metadata TDG
• Standardization efforts of the Metadata TDG stalled
– Large overlap with the work/people at the Athens-Core meetings
• Community level agreement is maybe enough
– Activity motivation should not depend on one person, the TDG chair, only
• The need for explicit and shared semantics is not clear enough yet … more evangelization
needed
– Unfamiliarity with the work
• Terminologists are more used to this kind of review work
• Online review vs. old ISO ‘paper’ process
– Members have little time, it is difficult to sync schedules
• TDG experts tend to be senior scientist
• Continuous process vs. sporadic bursts of activity
– Unpaid work
• Project funding vs. wide acceptance in the community
• However, a project might bootstrap a thematic domain
• The same problems hold for other TDGs
– Current tendency to tie data category (selection) standardization to a
new/revised standard, e.g., MAF and TBX
– Redesign of the standardization process is coming up
• ISO is not actively supporting Annex ST Standards as Databases anymore
28 March 2013 eHg - New Trends in e-Humanities 24
www.isocat.org
Community efforts
• LMF-related: UBY, RELISH/GOLD
• Sign Language
• CLARIN
– CMDI, Athens Core
– CLARIN-NL/VL
• Call 1 – 4 projects created CMDI and annotated
resources/schemas
• ISOcat content coordinator: Ineke Schuurman
– Tutorials, guidelines (do’s and don’ts) and feedback
• Better community support in ISOcat
– Views, e.g., CLARIN-NL/VL
– Recommended by, e.g., DC-4949
–…
28 March 2013 eHg - New Trends in e-Humanities 25
www.isocat.org
Conclusions and future work
• Communties can already create a coherent view on ISOcat
– the CMDI use case shows potential
– maybe funder support needed to bootstrap specific domains
• The standardized core will take (a long) time
– like all standardization work
• Next to metadata also content
– explicit semantics would be profitable even when not shared and/or used for
resource discovery
– resources created with tools that support ISOcat will create such resources
more easy
• Companion registries:
– relations between data categories (RELcat)
– annotated schemas for language resources (SCHEMAcat)
– interaction with the CLARIN vocabulary service (CLAVAS)
• Data categories vs. concepts
28 March 2013 eHg - New Trends in e-Humanities 26
www.isocat.org
Detour: ISOcat and LOD/Semantic Web
• Archives and infrastructures look at the resources as
they are, i.e., in general no conversions to triples
• However, ISOcat data categories can easily be used in
RDF resources
:partOfSpeech dcr:datcat <http://www.isocat.org/datcat/DC-396> ;
rdfs:label "part of speech"@en ;
rdfs:comment "A category assigned to a word based on its grammatical
and semantic properties."@en .
• The Relation Registry, which is a tripple store, will in
general support lightweight, semi-formal ontologies
M. Windhouwer, S.E. Wright. Linking to linguistic data categories in ISOcat. LDL 2012.
28 March 2013 eHg - New Trends in e-Humanities 27
www.isocat.org
Thank you for your attention!
Visit
www.isocat.org
Questions?
www.isocat.org/forum/
isocat@mpi.nl
Acknowledgements
Thanks to anyone at TLA, Sue Ellen Wright, Ineke Schuurman, Marc Kemps-Snijders, CLARIN-NL, CLARIN, ISO TC 37
28 March 2013 eHg - New Trends in e-Humanities 28
www.isocat.org
A whole litter of cats!
Linguistic resource (schema) Linguistic knowledge base
Data categories
Containers
Concepts
Relation
Schema Registry - SCHEMAcat
Data Category Registry - ISOcat Concept Registry Relation Registry - RELcat
28 March 2013 eHg - New Trends in e-Humanities 29
www.isocat.org
ISO 11179: concepts vs. data elements/categories
ISO 12620 Data Categories
28 March 2013 eHg - New Trends in e-Humanities 30
Editor's Notes
PWMNLP chapter: “The most well-known early example of a structured community-based effort to create a major language resource was perhaps the Oxford English Dictionary (OED): begun in 1857, the ‘community’ in question grew from a relatively small group of dictionary-aficionados to include hundreds of men and women scholars scattered across the English-speaking world documenting words and word forms using quasi-uniform ‘slips’ designed primarily to document usage and provenance as identified in significant works of English literature. Here the designation of types of information (main forms, part of speech, etymology, etc.) in word-oriented lexical entries is achieved by the now-famous Oxford entry layout, which uses font variation to represent the different kinds of information contained in a lexicographical entry.”