Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
www.isocat.org                          Collaboratively Defining                 Widely Accepted Linguistic Data Categorie...
www.isocat.org                     The Language Archive     • Founded in September 2011     • Supported by MPG, BBAW and K...
www.isocat.org           Language Archiving Technology     • Full lifecycle support           – Core: resources           ...
www.isocat.org           Typological Database Nijmegen            TOP NOTION tds:Noun GROUPS{              NOTION tdn:Gram...
www.isocat.org                     DOBES corpora     28 March 2013     eHg - New Trends in e-Humanities   5
www.isocat.org                    Oxford English Dictionary      Source: http://www.oxford-royale.co.uk/news/2010/12/04/ne...
www.isocat.org          Terminology Community of Practice     • Community started out on paper (A5 fiches),       just lik...
www.isocat.org                     ISO 12620:1999     28 March 2013     eHg - New Trends in e-Humanities   8
www.isocat.org         Towards a Data Category Registry     • Problems with ISO 12620:1999 a hardcoded list of data catego...
www.isocat.org                                  ISO 12620:2009     • Terminology and other content and language resources ...
www.isocat.org         Example Data Category specification     • Data category: /Grammatical gender/           – Administr...
www.isocat.org                     Standardization procedure                                           Decision Group     ...
www.isocat.org                     Thematic Domain Groups     TDG 1: Metadata                             •           TDGs...
www.isocat.org                     ISOcat - the ISO TC 37/DCR     • A (coherent) set of Data Categories, in our case for  ...
www.isocat.org        Refering to ISOcat data categories     • PIDs of data categories can easily embedded in XML document...
www.isocat.org                     A glimpse of ISOcat     28 March 2013       eHg - New Trends in e-Humanities   16
www.isocat.org                     Collaboration in ISOcat     • Registered user can contact eachother via       mediated ...
www.isocat.org         Component MetaData Infrastructure     • CMDI is developed by CLARIN and on its way to       standar...
www.isocat.org                           CMDI architecture                                                  ISOcat        ...
www.isocat.org                          Athens Core     • Bootstrapped the Metadata data categories       selection in ISO...
www.isocat.org                           CMDI architecture                                                  ISOcat        ...
www.isocat.org                           CMDI architecture                        metadata                  ISOcat        ...
www.isocat.org                 CMDI (intermediate) results     • Diverse metadata profiles           – Center or projects ...
www.isocat.org                                   Metadata TDG     • Standardization efforts of the Metadata TDG stalled   ...
www.isocat.org                          Community efforts     • LMF-related: UBY, RELISH/GOLD     • Sign Language     • CL...
www.isocat.org                 Conclusions and future work     • Communties can already create a coherent view on ISOcat  ...
www.isocat.org      Detour: ISOcat and LOD/Semantic Web     • Archives and infrastructures look at the resources as       ...
www.isocat.org                       Thank you for your attention!                                                    Visi...
www.isocat.org                     A whole litter of cats!   Linguistic resource (schema)          Linguistic knowledge ba...
www.isocat.org           ISO 11179: concepts vs. data elements/categories                                        ISO 12620...
Upcoming SlideShare
Loading in …5
×

Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

661 views

Published on

New Trends in e-Humanities, KNAW e-Humanities Group, Amsterdam, March 28, 2013

  • Be the first to comment

  • Be the first to like this

Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

  1. 1. www.isocat.org Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry Menzo Windhouwer The Language Archive – DANS tla.mpi.nl menzo.windhouwer@dans.knaw.nl 28 March 2013 eHg - New Trends in e-Humanities 1
  2. 2. www.isocat.org The Language Archive • Founded in September 2011 • Supported by MPG, BBAW and KNAW (DANS) • Grown out of the Technical Group at the MPI for Psycholinguistics • Since 1990ies: challenge of archiving digital data • 2000 – 2016 VolkswagenFoundation DOBES project on Endangered Languages • Active in many European infrastructure projects: CLARIN, EUDAT, DASISH, … 28 March 2013 eHg - New Trends in e-Humanities 2
  3. 3. www.isocat.org Language Archiving Technology • Full lifecycle support – Core: resources – Key: metadata – ‘New’: CMDI, ISOcat, AV recognition, … • Archive size: – 70 Tb of resources – 22.000 hours AV recordings – 75.000 sessions (metadata) – 5 million annotated segments – 50 lexica • My focus: Knowledge Systems – LEXUS, an online lexicon tool – ISOcat and companions 28 March 2013 eHg - New Trends in e-Humanities 3
  4. 4. www.isocat.org Typological Database Nijmegen TOP NOTION tds:Noun GROUPS{ NOTION tdn:GrammaticalDistinctions LABEL "Grammatical distinctions for nouns." GROUPS { NOTION tdn:AgentNouns LABEL "Agent nouns." DESCRIPTION "Nouns can function as the agent of a clause." LINK TO CONCEPT agentRole GROUPS { NOTION tdn:v098_plusAffix LABEL "Agent nouns formed by verb stem plus affix." LINK TO CONCEPTS (agentRole, verbalMorphology, boundAffix) DESCRIPTION <p>Agent nouns are formed by a verb stem plus an affix, e.g. English <qv>walk-er</qv>.</p> NOTE AUTHOR IS "TDS" TYPE IS "original TDN label" "AGENT NOUNS ARE VERB STEM PLUS AFFIX" IS FIELD v098; ... Notes: TDN is not in archived in TLA, but curated in TDS, a previous project I worked on, and now archived at DANS; 28 March 2013 eHg - New Trends in e-Humanities 4 also this not a TDN punchcard
  5. 5. www.isocat.org DOBES corpora 28 March 2013 eHg - New Trends in e-Humanities 5
  6. 6. www.isocat.org Oxford English Dictionary Source: http://www.oxford-royale.co.uk/news/2010/12/04/new-online-edition-of-oxford-english-dictionary.html 28 March 2013 eHg - New Trends in e-Humanities 6
  7. 7. www.isocat.org Terminology Community of Practice • Community started out on paper (A5 fiches), just like OED • 80’s - 90’s projects to standardize data category, the ‘fields’ on the fiches/in the files/database records, names • ISO 12620:1999 Data Categories a companion standard to ISO 12200 Machine-readable terminology interchange format (MARTIF) 28 March 2013 eHg - New Trends in e-Humanities 7
  8. 8. www.isocat.org ISO 12620:1999 28 March 2013 eHg - New Trends in e-Humanities 8
  9. 9. www.isocat.org Towards a Data Category Registry • Problems with ISO 12620:1999 a hardcoded list of data categories – Not easily extensible – Ordering heavily debated – Outdated and limited in range at the moment of release • Developments – In the SALT project an interchange model (TBX) based on MARTIF/data categories was created, which was widely adopted – ISO 11179 Metadata Registries was released, which describes the standardization of data element concepts for metadata – ISO released Annex ST Standards as databases, which describes an ISO procedure to standardize registry entries – In the LIRICS project a pilot Data Category Registry, SYNTAX, was created 28 March 2013 eHg - New Trends in e-Humanities 9
  10. 10. www.isocat.org ISO 12620:2009 • Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources – A data model for data category specifications inspired by ISO 11179 – A procedure to standardize data category specification compliant with Annex ST – Each data category gets a unique Persistent Identifier (PID) – The Max Planck Institute for Psycholinguistics is appointed as the Registration Authority of the ISO/TC 37 DCR • In use by a growing number of ISO TC 37 standards – Lexical Markup Framework (LMF) – Linguistic Annotation Framework (LAF) – Morph-syntactic Annotation Framework (MAF) – … – could be more, e.g., Feature System Declarations (FSD) 28 March 2013 eHg - New Trends in e-Humanities 10
  11. 11. www.isocat.org Example Data Category specification • Data category: /Grammatical gender/ – Administrative part: • Identifier: grammaticalGender • PID: http://www.isocat.org/datcat/DC-1297 – Descriptive part: • English definition: Category based on (depending on languages) the natural distinction between sex and formal criteria. • French definition: Catégorie fondée (selon la langue) sur la distinction naturelle entre les sexes ou dautres critères formels. – Linguistic part: • Morposyntax conceptual domain: /masculine/, /feminine/, /neuter/ • French conceptual domain: /masculine/, /feminine/ 28 March 2013 eHg - New Trends in e-Humanities 11
  12. 12. www.isocat.org Standardization procedure Decision Group Submission Thematic Domain Data Category Registry Stewardship group Group Board group Evaluation Validation rejected rejected Publication 28 March 2013 eHg - New Trends in e-Humanities 12
  13. 13. www.isocat.org Thematic Domain Groups TDG 1: Metadata • TDGs are the owner and guardians TDG 2: Morphosyntax of a coherent subset of the DCR TDG 3: Semantic Content Representation • TDGs own one or more profiles TDG 4: Syntax TDG 6: Language Resource Ontology • Each TDG has a chair TDG 7: Lexicography • A number of members assigned by TDG 8: Language Codes SC P members TDG 9: Terminology • A number of expert members invited by the chair (up to 50%) TDG 11: Multilingual Information Management TDG 12: Lexical Resources • TDGs are constituted at the TDG 13: Lexical Semantics TC37/SC plenary • New TDGs need to be proposed by a SC 1. Translation 2. (Sign language) 28 March 2013 eHg - New Trends in e-Humanities 13
  14. 14. www.isocat.org ISOcat - the ISO TC 37/DCR • A (coherent) set of Data Categories, in our case for linguistic resources • A system to manage this set: – Create and edit Data Categories – Share Data Categories, e.g., resolve PID references – Standardize Data Categories • An API for tools to access the DCR • Grass roots approach – Anyone can access the DCR and use or create the data categories (s)he needs 28 March 2013 eHg - New Trends in e-Humanities 14
  15. 15. www.isocat.org Refering to ISOcat data categories • PIDs of data categories can easily embedded in XML documents <lmf:LexicalEntry> <tei:f name="partOfSpeech" dcr:datcat="http://www.isocat.org/datcat/DC-1345" fVal="commonNoun” dcr:valueDatcat="http://www.isocat.org/datcat/DC-1256"/> <lmf:Lemma type="Form"> <tei:f name="writtenForm” dcr:datcat="http://www.isocat.org/datcat/DC-1836" fVal="clergyman"/> </lmf:Lemma> </lmf:LexicalEntry> • Also embedding in other formats is possible, e.g., via comments • Preferably annotate schemas, so a whole range of resources is annotated in one go 28 March 2013 eHg - New Trends in e-Humanities 15
  16. 16. www.isocat.org A glimpse of ISOcat 28 March 2013 eHg - New Trends in e-Humanities 16
  17. 17. www.isocat.org Collaboration in ISOcat • Registered user can contact eachother via mediated email – Ask the owner if a data category can be adapted a little to your needs • Registered users can start up a group and invite other users to join – Work together on a set of data categories – Interact via a public and/or private forum • A group can submit data categories for ISO standardization 28 March 2013 eHg - New Trends in e-Humanities 17
  18. 18. www.isocat.org Component MetaData Infrastructure • CMDI is developed by CLARIN and on its way to standardization by ISO TC 37 – Limitations existing metadata schemas: DC/OLAC, IMDI, TEI header • Inflexible: too many (IMDI) or too few (OLAC) metadata elements • Limited interoperability (both semantic and syntactic) • Problematic (unfamiliar) terminology for some sub- communities. • Limited support for LT tool & services descriptions – The idea is to address this by: • Explicit defined schema & semantics • User/project/community defined components 28 March 2013 eHg - New Trends in e-Humanities 18
  19. 19. www.isocat.org CMDI architecture ISOcat component metadata metadata registry & modeler catalogue editor metadata user search & Relation metadata metadata semantic Registry editor creator mapping Joint Local metadata metadata metadata repository repository curator metadata curator OAI-PMH OAI-PMH Service provider Data provider 28 March 2013 DATA eHg - New Trends in e-Humanities 19
  20. 20. www.isocat.org Athens Core • Bootstrapped the Metadata data categories selection in ISOcat – Based on existing metadata standards, e.g., DC, OLAC, IMDI, TEI – Many translations in european languages • Users add the data categories they need to the Metadata profile and use them in CMDI 28 March 2013 eHg - New Trends in e-Humanities 20
  21. 21. www.isocat.org CMDI architecture ISOcat component metadata metadata registry & modeler catalogue editor metadata user search & Relation metadata metadata semantic Registry editor creator mapping Joint Local metadata metadata metadata repository repository curator metadata curator OAI-PMH OAI-PMH Service provider Data provider 28 March 2013 DATA eHg - New Trends in e-Humanities 21
  22. 22. www.isocat.org CMDI architecture metadata ISOcat component metadata catalogues registry & modeler (VLO, MI) editor metadata user search & Relation metadata metadata semantic Registry editor creator mapping Joint Local metadata metadata metadata repository repository curator metadata curator OAI-PMH OAI-PMH Service provider Data provider 28 March 2013 DATA eHg - New Trends in e-Humanities 22
  23. 23. www.isocat.org CMDI (intermediate) results • Diverse metadata profiles – Center or projects create specific ones, but reuses components where possible • Shared and explicit semantics help to overcome – Terminological differences – Differences in structure • Future – Get more context sensitive • e.g. documentation language vs. speaker language – Crosswalks • equivalent metadata data categories are easily introduced due to the open nature of ISOcat – User specific relationships • e.g. theory specific differences can be more important to one user then another 28 March 2013 eHg - New Trends in e-Humanities 23
  24. 24. www.isocat.org Metadata TDG • Standardization efforts of the Metadata TDG stalled – Large overlap with the work/people at the Athens-Core meetings • Community level agreement is maybe enough – Activity motivation should not depend on one person, the TDG chair, only • The need for explicit and shared semantics is not clear enough yet … more evangelization needed – Unfamiliarity with the work • Terminologists are more used to this kind of review work • Online review vs. old ISO ‘paper’ process – Members have little time, it is difficult to sync schedules • TDG experts tend to be senior scientist • Continuous process vs. sporadic bursts of activity – Unpaid work • Project funding vs. wide acceptance in the community • However, a project might bootstrap a thematic domain • The same problems hold for other TDGs – Current tendency to tie data category (selection) standardization to a new/revised standard, e.g., MAF and TBX – Redesign of the standardization process is coming up • ISO is not actively supporting Annex ST Standards as Databases anymore 28 March 2013 eHg - New Trends in e-Humanities 24
  25. 25. www.isocat.org Community efforts • LMF-related: UBY, RELISH/GOLD • Sign Language • CLARIN – CMDI, Athens Core – CLARIN-NL/VL • Call 1 – 4 projects created CMDI and annotated resources/schemas • ISOcat content coordinator: Ineke Schuurman – Tutorials, guidelines (do’s and don’ts) and feedback • Better community support in ISOcat – Views, e.g., CLARIN-NL/VL – Recommended by, e.g., DC-4949 –… 28 March 2013 eHg - New Trends in e-Humanities 25
  26. 26. www.isocat.org Conclusions and future work • Communties can already create a coherent view on ISOcat – the CMDI use case shows potential – maybe funder support needed to bootstrap specific domains • The standardized core will take (a long) time – like all standardization work • Next to metadata also content – explicit semantics would be profitable even when not shared and/or used for resource discovery – resources created with tools that support ISOcat will create such resources more easy • Companion registries: – relations between data categories (RELcat) – annotated schemas for language resources (SCHEMAcat) – interaction with the CLARIN vocabulary service (CLAVAS) • Data categories vs. concepts 28 March 2013 eHg - New Trends in e-Humanities 26
  27. 27. www.isocat.org Detour: ISOcat and LOD/Semantic Web • Archives and infrastructures look at the resources as they are, i.e., in general no conversions to triples • However, ISOcat data categories can easily be used in RDF resources :partOfSpeech dcr:datcat <http://www.isocat.org/datcat/DC-396> ; rdfs:label "part of speech"@en ; rdfs:comment "A category assigned to a word based on its grammatical and semantic properties."@en . • The Relation Registry, which is a tripple store, will in general support lightweight, semi-formal ontologies M. Windhouwer, S.E. Wright. Linking to linguistic data categories in ISOcat. LDL 2012. 28 March 2013 eHg - New Trends in e-Humanities 27
  28. 28. www.isocat.org Thank you for your attention! Visit www.isocat.org Questions? www.isocat.org/forum/ isocat@mpi.nl Acknowledgements Thanks to anyone at TLA, Sue Ellen Wright, Ineke Schuurman, Marc Kemps-Snijders, CLARIN-NL, CLARIN, ISO TC 37 28 March 2013 eHg - New Trends in e-Humanities 28
  29. 29. www.isocat.org A whole litter of cats! Linguistic resource (schema) Linguistic knowledge base Data categories Containers Concepts Relation Schema Registry - SCHEMAcat Data Category Registry - ISOcat Concept Registry Relation Registry - RELcat 28 March 2013 eHg - New Trends in e-Humanities 29
  30. 30. www.isocat.org ISO 11179: concepts vs. data elements/categories ISO 12620 Data Categories 28 March 2013 eHg - New Trends in e-Humanities 30

×