Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
ISOcatAn ISO 12620:2009 Data Category Registry<br />Marc Kemps-Snijdersa, MenzoWindhouwera, Sue Ellen Wrightb<br />aMax Pl...
Outline<br />ISO 12620:2009<br />What are Data Categories?<br />How can you use Data Categories?<br />What is a Data Categ...
ISO 12620:2009<br />Terminology and other content and language resources — Specification of data categories and management...
What is a Data Category?<br />The result of the specification of a given data field<br />A data category is an elementary ...
Data category example<br />Data category: /Grammatical gender/<br />Administrative part:<br />Identifier: grammaticalGende...
Data Category specification – Administrative part<br />13/7/2010<br />CLARA 2010 Summer School<br />6<br />
Data Category specification – Descriptive part<br />13/7/2010<br />CLARA 2010 Summer School<br />7<br />
Data Category specification – Linguistic part<br />13/7/2010<br />CLARA 2010 Summer School<br />8<br />
Mandatory parts of the specification<br />For each data category:<br />a mnemonic identifier<br />an English definition<br...
Guidelines for the specification<br />13/7/2010<br />CLARA 2010 Summer School<br />10<br />(see [2])<br />Identifier:<br /...
More guidelines<br />13/7/2010<br />CLARA 2010 Summer School<br />11<br />Name Section in a Language Section<br />legible ...
More guidelines<br />13/7/2010<br />CLARA 2010 Summer School<br />12<br />Justification:<br />a simple statement justifyin...
Data Category types<br />13/7/2010<br />CLARA 2010 Summer School<br />13<br />open<br />constrained<br />complex:<br />clo...
Data Category relationships<br />13/7/2010<br />CLARA 2010 Summer School<br />14<br />Value domain membership<br />Subsump...
How can you use Data Categories?<br />13/7/2010<br />CLARA 2010 Summer School<br />15<br />partOfSpeech<br />Lemma<br />wr...
How?<br />13/7/2010<br />CLARA 2010 Summer School<br />16<br /><lmf:lexiconxml:lang=“jp” alphabet=“ipa”><br />	<lmf:entry>...
Referencing Data Categories<br />Each Data Category should be uniquely identifiable<br />Ambiguity: different domains use ...
Data Categories Persistent IDentifiers<br />persistent identifier (PID)<br />“unique Uniform Resource Identifier (URI) tha...
Where do you put these references?<br />Preferably in a schema:<br /><rng:attributename=“alphabet” 	dcr:datcat=“http://www...
ISO TC 37 standards using Data Categories<br />Terminological Markup Framework (TMF; ISO 16642)<br />Lexical Markup Framew...
Other uses of Data Categories<br />CLARIN Component Metadata Infrastructure (CMDI)<br />ISO 12620:2009 provides a small XM...
What is a Data Category Registry?<br />A (coherent) set of Data Categories, in our case for linguistic resources<br />A sy...
Standardize Data Categories<br />13/7/2010<br />CLARA 2010 Summer School<br />23<br />Decision Group<br />Submission<br />...
Thematic Domain Groups<br />13/7/2010<br />CLARA 2010 Summer School<br />24<br />TDG 1: Metadata<br />TDG 2: Morphosyntax<...
TDGs own one or more profiles
Each TDG has a chair
A number of judges (assigned by SC P members)
A number of expert members (up to 50%)
TDGs are constituted at the TC37/SC plenary
NewTDGs need to be proposed by a SC</li></ul>Translation<br />Sign language<br />Audio<br />
How can you use a Data Category Registry?<br />You can:<br />Find Data Categories relevant for your resources and embed re...
ISOcat<br />Reference implementation of ISO 12620:2009<br />The TC 37 Data Category Registry<br />13/7/2010<br />CLARA 201...
13/7/2010<br />CLARA 2010 Summer School<br />27<br />A glimpse of ISOcat<br />
Data Category Interchange Format (DCIF)<br />Simplified XML serialization of the data model (see [4])<br />13/7/2010<br />...
RESTful Web Services<br />read-only programming interface to the DCR (see [5])<br />allows tools to interact with ISOcat t...
Persistent IDentifiers<br />ISOcat uses ‘cool URIs’ as PIDs (see [6])<br />these URIs will never change, but resolve to th...
Upcoming SlideShare
Loading in …5
×

ISOcat: an ISO 12620:2009 Data Category Registry

1,427 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

ISOcat: an ISO 12620:2009 Data Category Registry

  1. 1. ISOcatAn ISO 12620:2009 Data Category Registry<br />Marc Kemps-Snijdersa, MenzoWindhouwera, Sue Ellen Wrightb<br />aMax Planck Institute for Psycholinguistics, bKent State University<br />marc.kemps-snijders@mpi.nl , menzo.windhouwer@mpi.nl, sellenwright@gmail.com<br />13/7/2010<br />1<br />CLARA 2010 Summer School<br />
  2. 2. Outline<br />ISO 12620:2009<br />What are Data Categories?<br />How can you use Data Categories?<br />What is a Data Category Registry?<br />How can you use a Data Category Registry?<br />ISOcat<br />Demonstration/Tutorial<br />Future work<br />Handson session<br />13/7/2010<br />CLARA 2010 Summer School<br />2<br />
  3. 3. ISO 12620:2009<br />Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources<br />An ISO TC 37/SC 3 standard (see [1])<br />Successor to ISO 12620:1999 which contained a hardcoded list of Data Categories<br />13/7/2010<br />CLARA 2010 Summer School<br />3<br />
  4. 4. What is a Data Category?<br />The result of the specification of a given data field<br />A data category is an elementary descriptor in a linguistic structure or an annotation scheme.<br />Specification consists of 3 main parts:<br />Administrative part<br />Administration and identification<br />Descriptive part<br />Documentation in various working languages<br />Linguistic part<br />Conceptual domain(s for various object languages)<br />13/7/2010<br />CLARA 2010 Summer School<br />4<br />
  5. 5. Data category example<br />Data category: /Grammatical gender/<br />Administrative part:<br />Identifier: grammaticalGender<br />PID: http://www.isocat.org/datcat/DC-1297<br />Descriptive part:<br />English definition: Category based on (depending on languages) the natural distinction between sex and formal criteria.<br />French definition: Catégorie fondée (selon la langue) sur la distinction naturelle entre les sexes ou d'autres critères formels.<br />Linguistic part:<br />Morposyntax conceptual domain: /male/, /feminine/, /neuter/<br />French conceptual domain: /male/, /feminine/<br />13/7/2010<br />CLARA 2010 Summer School<br />5<br />
  6. 6. Data Category specification – Administrative part<br />13/7/2010<br />CLARA 2010 Summer School<br />6<br />
  7. 7. Data Category specification – Descriptive part<br />13/7/2010<br />CLARA 2010 Summer School<br />7<br />
  8. 8. Data Category specification – Linguistic part<br />13/7/2010<br />CLARA 2010 Summer School<br />8<br />
  9. 9. Mandatory parts of the specification<br />For each data category:<br />a mnemonic identifier<br />an English definition<br />an English name<br />For complex data categories:<br />a conceptual domain<br />For standardization candidates:<br />a profile<br />a justification<br />13/7/2010<br />CLARA 2010 Summer School<br />9<br />
  10. 10. Guidelines for the specification<br />13/7/2010<br />CLARA 2010 Summer School<br />10<br />(see [2])<br />Identifier:<br />camel case and XML-valid element name (without a namespace)<br />partOfSpeech<br />my:POS, 123POS<br />Data Element Name:<br />language independent name for the data category used in a specific application domain (specified in the source)<br />PoS in TBX<br />NN in myTagset or N in yourTagset (if widely used)<br />
  11. 11. More guidelines<br />13/7/2010<br />CLARA 2010 Summer School<br />11<br />Name Section in a Language Section<br />legible name<br />‘part of speech’ in the English language section<br />‘partie du discours’ in the French language section<br />Definition:<br />intentional definitions (ISO 704)<br />should consist of a single sentence fragment<br />Source:<br />add a source for any quoted material<br />
  12. 12. More guidelines<br />13/7/2010<br />CLARA 2010 Summer School<br />12<br />Justification:<br />a simple statement justifying the relevance of the data category to the field of language resources<br />especially needed for standardization<br />
  13. 13. Data Category types<br />13/7/2010<br />CLARA 2010 Summer School<br />13<br />open<br />constrained<br />complex:<br />closed<br />writtenForm<br />grammaticalGender<br />email<br />string<br />string<br />string<br />Constraint: .+@.+<br />neuter<br />feminine<br />masculine<br />simple:<br />
  14. 14. Data Category relationships<br />13/7/2010<br />CLARA 2010 Summer School<br />14<br />Value domain membership<br />Subsumption relationships between simple data categories<br />Relationships between complex data categories are not stored in the DCR<br />partOfSpeech<br />string<br />pronoun<br />personal<br />pronoun<br />
  15. 15. How can you use Data Categories?<br />13/7/2010<br />CLARA 2010 Summer School<br />15<br />partOfSpeech<br />Lemma<br />writtenForm<br />writtenForm<br />Word Form<br />grammaticalGender<br />lexicalType<br />grammaticalGender<br />wordOrder<br />Lexicon<br />1..*<br />A (schema for a) typological database<br />Lexical Entry<br />Shared semantics!<br />0..*<br />1..*<br />Form<br />Sense<br />0..*<br />A LMF (ISO 24613:2008) complaint<br />(schema for a) lexicon<br />
  16. 16. How?<br />13/7/2010<br />CLARA 2010 Summer School<br />16<br /><lmf:lexiconxml:lang=“jp” alphabet=“ipa”><br /> <lmf:entry><br /> <lmf:lemma><br /> <lmf:writtenForm>nihongo</…><br /> …<br /></…><br /> …<br /></…><br /> …<br /></…><br />
  17. 17. Referencing Data Categories<br />Each Data Category should be uniquely identifiable<br />Ambiguity: different domains use the same term but mean different ‘things’<br />Semantic rot: even in the same domain the meaning of a term changes over time<br />Persistence: for archived resources Data Category references should still be resolvable and point to the specification as it was at/close to time of creation<br />ISO/DIS 24619 Language resource management -- Persistent identification and access in language technology applications<br />13/7/2010<br />CLARA 2010 Summer School<br />17<br />
  18. 18. Data Categories Persistent IDentifiers<br />persistent identifier (PID)<br />“unique Uniform Resource Identifier (URI) that ensures permanent access for a digital object by providing access to it independently of its physical location or current ownership” (see [1])<br />For Data Categories this digital object is a specific version of a Data Category specification, i.e., each version of a Data Category has its own PID<br />13/7/2010<br />CLARA 2010 Summer School<br />18<br />
  19. 19. Where do you put these references?<br />Preferably in a schema:<br /><rng:attributename=“alphabet” dcr:datcat=“http://www.isocat.org/datcat/…”><br /> <rng:valuedcr:datcat=“http://www.isocat.org/datcat/…”><br />ipa<br /></…><br /> …<br /></…><br />13/7/2010<br />CLARA 2010 Summer School<br />19<br />
  20. 20. ISO TC 37 standards using Data Categories<br />Terminological Markup Framework (TMF; ISO 16642)<br />Lexical Markup Framework (LMF; ISO 24613)<br />TermBaseeXchange (TBX; ISO 30042)<br />Morpho-syntactic Annotation Framework (MAF; ISO 24611)<br />Linguistic Annotation Framework (LAF; ISO 24612)<br />Meta models which can be instantiated into a specific model with data categories<br />However, some still refer to ISO 12620:1999 Data Categories and some don’t support all types (see [3])<br />13/7/2010<br />CLARA 2010 Summer School<br />20<br />
  21. 21. Other uses of Data Categories<br />CLARIN Component Metadata Infrastructure (CMDI)<br />ISO 12620:2009 provides a small XML vocabulary, DC Reference (see [4]), which provides elements and attributes to embed Data Category references in arbitrary XML documents<br />Including: XML Schema, Relax NG, TEI/ISO feature structures, …<br />The references can be used in URI based ‘mappings’:<br />Including: ODD, RDF-based vocabularies (OWL, SKOS), …<br />13/7/2010<br />CLARA 2010 Summer School<br />21<br />
  22. 22. What is a Data Category Registry?<br />A (coherent) set of Data Categories, in our case for linguistic resources<br />A system to manage this set:<br />Create and edit Data Categories<br />Share Data Categories, e.g., resolve PID references<br />Standardize Data Categories<br />13/7/2010<br />CLARA 2010 Summer School<br />22<br />
  23. 23. Standardize Data Categories<br />13/7/2010<br />CLARA 2010 Summer School<br />23<br />Decision Group<br />Submission<br />group<br />Data Category Registry<br />Board<br />Thematic Domain<br />Group<br />Stewardship<br />group<br />Validation<br />Evaluation<br />rejected<br />rejected<br />Publication<br />
  24. 24. Thematic Domain Groups<br />13/7/2010<br />CLARA 2010 Summer School<br />24<br />TDG 1: Metadata<br />TDG 2: Morphosyntax<br />TDG 3: Semantic Content Representation <br />TDG 4: Syntax <br />TDG 5: Machine Readable Dictionary<br />TDG 6: Language Resource Ontology<br />TDG 7: Lexicography<br />TDG 8: Language Codes<br />TDG 9: Terminology<br />TDG 11: Multilingual Information Management<br />TDG 12: Lexical Resources<br />TDG 13: Lexical Semantics<br />TDG 14: Source Identification<br /><ul><li>TDGs are the owner and guardians of a coherent subset of the DCR
  25. 25. TDGs own one or more profiles
  26. 26. Each TDG has a chair
  27. 27. A number of judges (assigned by SC P members)
  28. 28. A number of expert members (up to 50%)
  29. 29. TDGs are constituted at the TC37/SC plenary
  30. 30. NewTDGs need to be proposed by a SC</li></ul>Translation<br />Sign language<br />Audio<br />
  31. 31. How can you use a Data Category Registry?<br />You can:<br />Find Data Categories relevant for your resources and embed references to them so the semantics of (parts of) your resources are made explicit<br />This can be supported by tools you use, e.g., ELAN, LEXUS and the CMDI Component Editor directly interact with ISOcat<br />Interact with Data Category owners to improve (the coverage of) their Data Categories<br />Create (together with others) new Data Categories needed for your resources and share those<br />Submit (your) Data Categories for standardization<br />Free of charge<br />13/7/2010<br />CLARA 2010 Summer School<br />25<br />
  32. 32. ISOcat<br />Reference implementation of ISO 12620:2009<br />The TC 37 Data Category Registry<br />13/7/2010<br />CLARA 2010 Summer School<br />26<br />
  33. 33. 13/7/2010<br />CLARA 2010 Summer School<br />27<br />A glimpse of ISOcat<br />
  34. 34. Data Category Interchange Format (DCIF)<br />Simplified XML serialization of the data model (see [4])<br />13/7/2010<br />CLARA 2010 Summer School<br />28<br />
  35. 35. RESTful Web Services<br />read-only programming interface to the DCR (see [5])<br />allows tools to interact with ISOcat to help an user to embed PIDs in their resources<br />mainly based on DCIF<br />uses authentication to access private/shared Data Categories<br />currently used by:<br />LEXUS: populate an LMF model<br />ELAN: create controlled vocabularies<br />CMDI Component Editor: create concept links for component elements<br />13/7/2010<br />CLARA 2010 Summer School<br />29<br />
  36. 36. Persistent IDentifiers<br />ISOcat uses ‘cool URIs’ as PIDs (see [6])<br />these URIs will never change, but resolve to the current location in the current implementation, e.g., in ISOcat they resolve to a RESTful Web Service call<br />the isocat.org domain is bound to ISO 12620:2009 and the Registration Authority, currently the MPI, is obliged to keep the PIDs associated with this domain resolvable<br />13/7/2010<br />CLARA 2010 Summer School<br />30<br />
  37. 37. Future work<br />Finish first complete version of ISOcat:<br />Standardization process<br />Cleanup of the current set of Data Categories<br />TDGs cleanup their profiles<br />Standardize first sets of Data Categories<br />Interaction with other TC 37 standards:<br />Migration from ISO 12620:1999<br />Full support for all types of Data Categories<br />13/7/2010<br />CLARA 2010 Summer School<br />31<br />
  38. 38. More future work<br />Additional Data Categories types<br />Container Data Categories<br />Complex and Simple only cover ‘leafs’ and their values<br />Data Category Concepts<br />Basic building blocks for knowledge bases<br />Relation Registries<br />Stores (your) (semantic) relationships between Data Categories<br />13/7/2010<br />CLARA 2010 Summer School<br />32<br />
  39. 39. Registry network<br />13/7/2010<br />CLARA 2010 Summer School<br />33<br />Typological Database System<br />RR<br />MPI RR<br />Relation registries<br />MPI<br />DCR<br />ISO<br />DCR<br />Data category registries<br />TDS<br />database<br />resource<br />MPI<br />archive<br />Linguistic resources<br />
  40. 40. Handson session<br />Register with ISOcat<br />http://www.isocat.org/<br />Can you find Data Categories relevant to your type of resources?<br />Use the search (options), the explorer, …<br />Create and save a Data Category Selection<br />Do you miss Data Categories?<br />Create a new Data Category<br />Do you want to share Data Categories/selections?<br />Create together with some students a group<br />Share your selection with this group<br />Do you miss functionality?<br />Let us know <br />13/7/2010<br />CLARA 2010 Summer School<br />34<br />
  41. 41. 13/7/2010<br />CLARA 2010 Summer School<br />35<br />Thank you for your attention!<br />Visit<br />www.isocat.org<br />Questions?<br />www.isocat.org/forum/<br />isocat@mpi.nl<br />
  42. 42. References<br />[1] ISO 12620, Terminology and other language and content resources -- Specification of data categories and management of a Data Category Registry for language resources. <br />[2] http://www.isocat.org/manual/DCRGuidelines.pdf<br />[3] M.A. Windhouwer, S.E. Wright, M. Kemps-Snijders. Referencing ISOcat data categories. In proceedings of the LREC 2010 LRT standards workshop. Malta, May 18, 2010.<br />[4] http://www.isocat.org/12620/<br />[5] http://www.isocat.org/rest/help.html<br />[6] Tim Berners-Lee, Cool URIs don't change, 1998.<br />13/7/2010<br />CLARA 2010 Summer School<br />36<br />

×