ISOcatAn ISO 12620:2009 Data Category Registry<br />Marc Kemps-Snijdersa, MenzoWindhouwera, Sue Ellen Wrightb<br />aMax Pl...
Outline<br />ISO 12620:2009<br />What are Data Categories?<br />How can you use Data Categories?<br />What is a Data Categ...
ISO 12620:2009<br />Terminology and other content and language resources — Specification of data categories and management...
What is a Data Category?<br />The result of the specification of a given data field<br />A data category is an elementary ...
Data category example<br />Data category: /Grammatical gender/<br />Administrative part:<br />Identifier: grammaticalGende...
Data Category specification – Administrative part<br />13/7/2010<br />CLARA 2010 Summer School<br />6<br />
Data Category specification – Descriptive part<br />13/7/2010<br />CLARA 2010 Summer School<br />7<br />
Data Category specification – Linguistic part<br />13/7/2010<br />CLARA 2010 Summer School<br />8<br />
Mandatory parts of the specification<br />For each data category:<br />a mnemonic identifier<br />an English definition<br...
Guidelines for the specification<br />13/7/2010<br />CLARA 2010 Summer School<br />10<br />(see [2])<br />Identifier:<br /...
More guidelines<br />13/7/2010<br />CLARA 2010 Summer School<br />11<br />Name Section in a Language Section<br />legible ...
More guidelines<br />13/7/2010<br />CLARA 2010 Summer School<br />12<br />Justification:<br />a simple statement justifyin...
Data Category types<br />13/7/2010<br />CLARA 2010 Summer School<br />13<br />open<br />constrained<br />complex:<br />clo...
Data Category relationships<br />13/7/2010<br />CLARA 2010 Summer School<br />14<br />Value domain membership<br />Subsump...
How can you use Data Categories?<br />13/7/2010<br />CLARA 2010 Summer School<br />15<br />partOfSpeech<br />Lemma<br />wr...
How?<br />13/7/2010<br />CLARA 2010 Summer School<br />16<br /><lmf:lexiconxml:lang=“jp” alphabet=“ipa”><br />	<lmf:entry>...
Referencing Data Categories<br />Each Data Category should be uniquely identifiable<br />Ambiguity: different domains use ...
Data Categories Persistent IDentifiers<br />persistent identifier (PID)<br />“unique Uniform Resource Identifier (URI) tha...
Where do you put these references?<br />Preferably in a schema:<br /><rng:attributename=“alphabet” 	dcr:datcat=“http://www...
ISO TC 37 standards using Data Categories<br />Terminological Markup Framework (TMF; ISO 16642)<br />Lexical Markup Framew...
Other uses of Data Categories<br />CLARIN Component Metadata Infrastructure (CMDI)<br />ISO 12620:2009 provides a small XM...
What is a Data Category Registry?<br />A (coherent) set of Data Categories, in our case for linguistic resources<br />A sy...
Standardize Data Categories<br />13/7/2010<br />CLARA 2010 Summer School<br />23<br />Decision Group<br />Submission<br />...
Thematic Domain Groups<br />13/7/2010<br />CLARA 2010 Summer School<br />24<br />TDG 1: Metadata<br />TDG 2: Morphosyntax<...
TDGs own one or more profiles
Each TDG has a chair
A number of judges (assigned by SC P members)
A number of expert members (up to 50%)
TDGs are constituted at the TC37/SC plenary
NewTDGs need to be proposed by a SC</li></ul>Translation<br />Sign language<br />Audio<br />
How can you use a Data Category Registry?<br />You can:<br />Find Data Categories relevant for your resources and embed re...
ISOcat<br />Reference implementation of ISO 12620:2009<br />The TC 37 Data Category Registry<br />13/7/2010<br />CLARA 201...
13/7/2010<br />CLARA 2010 Summer School<br />27<br />A glimpse of ISOcat<br />
Data Category Interchange Format (DCIF)<br />Simplified XML serialization of the data model (see [4])<br />13/7/2010<br />...
RESTful Web Services<br />read-only programming interface to the DCR (see [5])<br />allows tools to interact with ISOcat t...
Persistent IDentifiers<br />ISOcat uses ‘cool URIs’ as PIDs (see [6])<br />these URIs will never change, but resolve to th...
Upcoming SlideShare
Loading in …5
×

ISOcat: an ISO 12620:2009 Data Category Registry

1,365 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,365
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

ISOcat: an ISO 12620:2009 Data Category Registry

  1. 1. ISOcatAn ISO 12620:2009 Data Category Registry<br />Marc Kemps-Snijdersa, MenzoWindhouwera, Sue Ellen Wrightb<br />aMax Planck Institute for Psycholinguistics, bKent State University<br />marc.kemps-snijders@mpi.nl , menzo.windhouwer@mpi.nl, sellenwright@gmail.com<br />13/7/2010<br />1<br />CLARA 2010 Summer School<br />
  2. 2. Outline<br />ISO 12620:2009<br />What are Data Categories?<br />How can you use Data Categories?<br />What is a Data Category Registry?<br />How can you use a Data Category Registry?<br />ISOcat<br />Demonstration/Tutorial<br />Future work<br />Handson session<br />13/7/2010<br />CLARA 2010 Summer School<br />2<br />
  3. 3. ISO 12620:2009<br />Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources<br />An ISO TC 37/SC 3 standard (see [1])<br />Successor to ISO 12620:1999 which contained a hardcoded list of Data Categories<br />13/7/2010<br />CLARA 2010 Summer School<br />3<br />
  4. 4. What is a Data Category?<br />The result of the specification of a given data field<br />A data category is an elementary descriptor in a linguistic structure or an annotation scheme.<br />Specification consists of 3 main parts:<br />Administrative part<br />Administration and identification<br />Descriptive part<br />Documentation in various working languages<br />Linguistic part<br />Conceptual domain(s for various object languages)<br />13/7/2010<br />CLARA 2010 Summer School<br />4<br />
  5. 5. Data category example<br />Data category: /Grammatical gender/<br />Administrative part:<br />Identifier: grammaticalGender<br />PID: http://www.isocat.org/datcat/DC-1297<br />Descriptive part:<br />English definition: Category based on (depending on languages) the natural distinction between sex and formal criteria.<br />French definition: Catégorie fondée (selon la langue) sur la distinction naturelle entre les sexes ou d'autres critères formels.<br />Linguistic part:<br />Morposyntax conceptual domain: /male/, /feminine/, /neuter/<br />French conceptual domain: /male/, /feminine/<br />13/7/2010<br />CLARA 2010 Summer School<br />5<br />
  6. 6. Data Category specification – Administrative part<br />13/7/2010<br />CLARA 2010 Summer School<br />6<br />
  7. 7. Data Category specification – Descriptive part<br />13/7/2010<br />CLARA 2010 Summer School<br />7<br />
  8. 8. Data Category specification – Linguistic part<br />13/7/2010<br />CLARA 2010 Summer School<br />8<br />
  9. 9. Mandatory parts of the specification<br />For each data category:<br />a mnemonic identifier<br />an English definition<br />an English name<br />For complex data categories:<br />a conceptual domain<br />For standardization candidates:<br />a profile<br />a justification<br />13/7/2010<br />CLARA 2010 Summer School<br />9<br />
  10. 10. Guidelines for the specification<br />13/7/2010<br />CLARA 2010 Summer School<br />10<br />(see [2])<br />Identifier:<br />camel case and XML-valid element name (without a namespace)<br />partOfSpeech<br />my:POS, 123POS<br />Data Element Name:<br />language independent name for the data category used in a specific application domain (specified in the source)<br />PoS in TBX<br />NN in myTagset or N in yourTagset (if widely used)<br />
  11. 11. More guidelines<br />13/7/2010<br />CLARA 2010 Summer School<br />11<br />Name Section in a Language Section<br />legible name<br />‘part of speech’ in the English language section<br />‘partie du discours’ in the French language section<br />Definition:<br />intentional definitions (ISO 704)<br />should consist of a single sentence fragment<br />Source:<br />add a source for any quoted material<br />
  12. 12. More guidelines<br />13/7/2010<br />CLARA 2010 Summer School<br />12<br />Justification:<br />a simple statement justifying the relevance of the data category to the field of language resources<br />especially needed for standardization<br />
  13. 13. Data Category types<br />13/7/2010<br />CLARA 2010 Summer School<br />13<br />open<br />constrained<br />complex:<br />closed<br />writtenForm<br />grammaticalGender<br />email<br />string<br />string<br />string<br />Constraint: .+@.+<br />neuter<br />feminine<br />masculine<br />simple:<br />
  14. 14. Data Category relationships<br />13/7/2010<br />CLARA 2010 Summer School<br />14<br />Value domain membership<br />Subsumption relationships between simple data categories<br />Relationships between complex data categories are not stored in the DCR<br />partOfSpeech<br />string<br />pronoun<br />personal<br />pronoun<br />
  15. 15. How can you use Data Categories?<br />13/7/2010<br />CLARA 2010 Summer School<br />15<br />partOfSpeech<br />Lemma<br />writtenForm<br />writtenForm<br />Word Form<br />grammaticalGender<br />lexicalType<br />grammaticalGender<br />wordOrder<br />Lexicon<br />1..*<br />A (schema for a) typological database<br />Lexical Entry<br />Shared semantics!<br />0..*<br />1..*<br />Form<br />Sense<br />0..*<br />A LMF (ISO 24613:2008) complaint<br />(schema for a) lexicon<br />
  16. 16. How?<br />13/7/2010<br />CLARA 2010 Summer School<br />16<br /><lmf:lexiconxml:lang=“jp” alphabet=“ipa”><br /> <lmf:entry><br /> <lmf:lemma><br /> <lmf:writtenForm>nihongo</…><br /> …<br /></…><br /> …<br /></…><br /> …<br /></…><br />
  17. 17. Referencing Data Categories<br />Each Data Category should be uniquely identifiable<br />Ambiguity: different domains use the same term but mean different ‘things’<br />Semantic rot: even in the same domain the meaning of a term changes over time<br />Persistence: for archived resources Data Category references should still be resolvable and point to the specification as it was at/close to time of creation<br />ISO/DIS 24619 Language resource management -- Persistent identification and access in language technology applications<br />13/7/2010<br />CLARA 2010 Summer School<br />17<br />
  18. 18. Data Categories Persistent IDentifiers<br />persistent identifier (PID)<br />“unique Uniform Resource Identifier (URI) that ensures permanent access for a digital object by providing access to it independently of its physical location or current ownership” (see [1])<br />For Data Categories this digital object is a specific version of a Data Category specification, i.e., each version of a Data Category has its own PID<br />13/7/2010<br />CLARA 2010 Summer School<br />18<br />
  19. 19. Where do you put these references?<br />Preferably in a schema:<br /><rng:attributename=“alphabet” dcr:datcat=“http://www.isocat.org/datcat/…”><br /> <rng:valuedcr:datcat=“http://www.isocat.org/datcat/…”><br />ipa<br /></…><br /> …<br /></…><br />13/7/2010<br />CLARA 2010 Summer School<br />19<br />
  20. 20. ISO TC 37 standards using Data Categories<br />Terminological Markup Framework (TMF; ISO 16642)<br />Lexical Markup Framework (LMF; ISO 24613)<br />TermBaseeXchange (TBX; ISO 30042)<br />Morpho-syntactic Annotation Framework (MAF; ISO 24611)<br />Linguistic Annotation Framework (LAF; ISO 24612)<br />Meta models which can be instantiated into a specific model with data categories<br />However, some still refer to ISO 12620:1999 Data Categories and some don’t support all types (see [3])<br />13/7/2010<br />CLARA 2010 Summer School<br />20<br />
  21. 21. Other uses of Data Categories<br />CLARIN Component Metadata Infrastructure (CMDI)<br />ISO 12620:2009 provides a small XML vocabulary, DC Reference (see [4]), which provides elements and attributes to embed Data Category references in arbitrary XML documents<br />Including: XML Schema, Relax NG, TEI/ISO feature structures, …<br />The references can be used in URI based ‘mappings’:<br />Including: ODD, RDF-based vocabularies (OWL, SKOS), …<br />13/7/2010<br />CLARA 2010 Summer School<br />21<br />
  22. 22. What is a Data Category Registry?<br />A (coherent) set of Data Categories, in our case for linguistic resources<br />A system to manage this set:<br />Create and edit Data Categories<br />Share Data Categories, e.g., resolve PID references<br />Standardize Data Categories<br />13/7/2010<br />CLARA 2010 Summer School<br />22<br />
  23. 23. Standardize Data Categories<br />13/7/2010<br />CLARA 2010 Summer School<br />23<br />Decision Group<br />Submission<br />group<br />Data Category Registry<br />Board<br />Thematic Domain<br />Group<br />Stewardship<br />group<br />Validation<br />Evaluation<br />rejected<br />rejected<br />Publication<br />
  24. 24. Thematic Domain Groups<br />13/7/2010<br />CLARA 2010 Summer School<br />24<br />TDG 1: Metadata<br />TDG 2: Morphosyntax<br />TDG 3: Semantic Content Representation <br />TDG 4: Syntax <br />TDG 5: Machine Readable Dictionary<br />TDG 6: Language Resource Ontology<br />TDG 7: Lexicography<br />TDG 8: Language Codes<br />TDG 9: Terminology<br />TDG 11: Multilingual Information Management<br />TDG 12: Lexical Resources<br />TDG 13: Lexical Semantics<br />TDG 14: Source Identification<br /><ul><li>TDGs are the owner and guardians of a coherent subset of the DCR
  25. 25. TDGs own one or more profiles
  26. 26. Each TDG has a chair
  27. 27. A number of judges (assigned by SC P members)
  28. 28. A number of expert members (up to 50%)
  29. 29. TDGs are constituted at the TC37/SC plenary
  30. 30. NewTDGs need to be proposed by a SC</li></ul>Translation<br />Sign language<br />Audio<br />
  31. 31. How can you use a Data Category Registry?<br />You can:<br />Find Data Categories relevant for your resources and embed references to them so the semantics of (parts of) your resources are made explicit<br />This can be supported by tools you use, e.g., ELAN, LEXUS and the CMDI Component Editor directly interact with ISOcat<br />Interact with Data Category owners to improve (the coverage of) their Data Categories<br />Create (together with others) new Data Categories needed for your resources and share those<br />Submit (your) Data Categories for standardization<br />Free of charge<br />13/7/2010<br />CLARA 2010 Summer School<br />25<br />
  32. 32. ISOcat<br />Reference implementation of ISO 12620:2009<br />The TC 37 Data Category Registry<br />13/7/2010<br />CLARA 2010 Summer School<br />26<br />
  33. 33. 13/7/2010<br />CLARA 2010 Summer School<br />27<br />A glimpse of ISOcat<br />
  34. 34. Data Category Interchange Format (DCIF)<br />Simplified XML serialization of the data model (see [4])<br />13/7/2010<br />CLARA 2010 Summer School<br />28<br />
  35. 35. RESTful Web Services<br />read-only programming interface to the DCR (see [5])<br />allows tools to interact with ISOcat to help an user to embed PIDs in their resources<br />mainly based on DCIF<br />uses authentication to access private/shared Data Categories<br />currently used by:<br />LEXUS: populate an LMF model<br />ELAN: create controlled vocabularies<br />CMDI Component Editor: create concept links for component elements<br />13/7/2010<br />CLARA 2010 Summer School<br />29<br />
  36. 36. Persistent IDentifiers<br />ISOcat uses ‘cool URIs’ as PIDs (see [6])<br />these URIs will never change, but resolve to the current location in the current implementation, e.g., in ISOcat they resolve to a RESTful Web Service call<br />the isocat.org domain is bound to ISO 12620:2009 and the Registration Authority, currently the MPI, is obliged to keep the PIDs associated with this domain resolvable<br />13/7/2010<br />CLARA 2010 Summer School<br />30<br />
  37. 37. Future work<br />Finish first complete version of ISOcat:<br />Standardization process<br />Cleanup of the current set of Data Categories<br />TDGs cleanup their profiles<br />Standardize first sets of Data Categories<br />Interaction with other TC 37 standards:<br />Migration from ISO 12620:1999<br />Full support for all types of Data Categories<br />13/7/2010<br />CLARA 2010 Summer School<br />31<br />
  38. 38. More future work<br />Additional Data Categories types<br />Container Data Categories<br />Complex and Simple only cover ‘leafs’ and their values<br />Data Category Concepts<br />Basic building blocks for knowledge bases<br />Relation Registries<br />Stores (your) (semantic) relationships between Data Categories<br />13/7/2010<br />CLARA 2010 Summer School<br />32<br />
  39. 39. Registry network<br />13/7/2010<br />CLARA 2010 Summer School<br />33<br />Typological Database System<br />RR<br />MPI RR<br />Relation registries<br />MPI<br />DCR<br />ISO<br />DCR<br />Data category registries<br />TDS<br />database<br />resource<br />MPI<br />archive<br />Linguistic resources<br />
  40. 40. Handson session<br />Register with ISOcat<br />http://www.isocat.org/<br />Can you find Data Categories relevant to your type of resources?<br />Use the search (options), the explorer, …<br />Create and save a Data Category Selection<br />Do you miss Data Categories?<br />Create a new Data Category<br />Do you want to share Data Categories/selections?<br />Create together with some students a group<br />Share your selection with this group<br />Do you miss functionality?<br />Let us know <br />13/7/2010<br />CLARA 2010 Summer School<br />34<br />
  41. 41. 13/7/2010<br />CLARA 2010 Summer School<br />35<br />Thank you for your attention!<br />Visit<br />www.isocat.org<br />Questions?<br />www.isocat.org/forum/<br />isocat@mpi.nl<br />
  42. 42. References<br />[1] ISO 12620, Terminology and other language and content resources -- Specification of data categories and management of a Data Category Registry for language resources. <br />[2] http://www.isocat.org/manual/DCRGuidelines.pdf<br />[3] M.A. Windhouwer, S.E. Wright, M. Kemps-Snijders. Referencing ISOcat data categories. In proceedings of the LREC 2010 LRT standards workshop. Malta, May 18, 2010.<br />[4] http://www.isocat.org/12620/<br />[5] http://www.isocat.org/rest/help.html<br />[6] Tim Berners-Lee, Cool URIs don't change, 1998.<br />13/7/2010<br />CLARA 2010 Summer School<br />36<br />

×