ISOcat -> LMF-> TEI (Dictionaries)<br />Menzo Windhouwer<br />The Language Archive – MPI-PL<br />Menzo.Windhouwer@mpi.nl<b...
Outline<br />Introduction to ISOcat a ISO 12620:2009 compliant Data Category Registry (DCR)<br />ISOcat and the Lexical Ma...
12 October 2011<br />3<br />ISO 12620:2009<br /><ul><li>Terminology and other content and language resources — Specificati...
An ISO TC 37/SC 3 standard
Replaces ISO 12620:1999, a hardcoded list of Data Categories, with a registry for (standardized) Data Categories</li></ul>...
12 October 2011<br />4<br />What is a Data Category?<br /><ul><li>The result of the specification of a given data field
A data category is an elementary descriptor in a linguistic structure or an annotation scheme.
Specification consists of 3 main parts:
Administrative part
Administration and identification
Descriptive part
Documentation in various working languages
Linguistic part
Conceptual domain(s for various object languages)</li></ul>TEI Lexical workshop - Würzburg, Germany<br />
Upcoming SlideShare
Loading in …5
×

ISOcat to LMF to TEI

1,181 views

Published on

Tightening the representation of lexical data, a TEI perspective (TEI 2011 workshop), 12 October 2011, Wurzburg, Germany

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,181
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

ISOcat to LMF to TEI

  1. 1. ISOcat -> LMF-> TEI (Dictionaries)<br />Menzo Windhouwer<br />The Language Archive – MPI-PL<br />Menzo.Windhouwer@mpi.nl<br />12 October 2011<br />1<br />TEI Lexical workshop - Würzburg, Germany<br />
  2. 2. Outline<br />Introduction to ISOcat a ISO 12620:2009 compliant Data Category Registry (DCR)<br />ISOcat and the Lexical Markup Framework (LMF; ISO 24613:2008)<br />ISOcat and TEI (Dictionaries)<br />12 October 2011<br />2<br />TEI Lexical workshop - Würzburg, Germany<br />
  3. 3. 12 October 2011<br />3<br />ISO 12620:2009<br /><ul><li>Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources
  4. 4. An ISO TC 37/SC 3 standard
  5. 5. Replaces ISO 12620:1999, a hardcoded list of Data Categories, with a registry for (standardized) Data Categories</li></ul>TEI Lexical workshop - Würzburg, Germany<br />
  6. 6. 12 October 2011<br />4<br />What is a Data Category?<br /><ul><li>The result of the specification of a given data field
  7. 7. A data category is an elementary descriptor in a linguistic structure or an annotation scheme.
  8. 8. Specification consists of 3 main parts:
  9. 9. Administrative part
  10. 10. Administration and identification
  11. 11. Descriptive part
  12. 12. Documentation in various working languages
  13. 13. Linguistic part
  14. 14. Conceptual domain(s for various object languages)</li></ul>TEI Lexical workshop - Würzburg, Germany<br />
  15. 15. 12 October 2011<br />5<br />Data category example<br /><ul><li>Data category: /grammatical gender/
  16. 16. Administrative part:
  17. 17. Identifier: grammaticalGender
  18. 18. PID: http://www.isocat.org/datcat/DC-1297
  19. 19. Descriptive part:
  20. 20. English definition: Category based on (depending on languages) the natural distinction between sex and formal criteria.
  21. 21. French definition: Catégorie fondée (selon la langue) sur la distinction naturelle entre les sexes ou d'autres critères formels.
  22. 22. Linguistic part:
  23. 23. Morposyntax conceptual domain: /male/, /feminine/, /neuter/
  24. 24. French conceptual domain: /male/, /feminine/</li></ul>TEI Lexical workshop - Würzburg, Germany<br />
  25. 25. 12 October 2011<br />6<br />What is a Data Category Registry?<br />www.isocat.org<br /><ul><li>A (coherent) set of Data Categories, in our case for linguistic resources
  26. 26. A system to manage this set:
  27. 27. Create and edit Data Categories
  28. 28. Share Data Categories, e.g., resolve PID references
  29. 29. Standardize Data Categories
  30. 30. Grass roots approach</li></ul>TEI Lexical workshop - Würzburg, Germany<br />
  31. 31. ISOcat and LMF<br />§4.4 ISO 12620 Data Category Registry (DCR)<br />“The designers of an LMF conformant lexicon shall use data categories from the ISO 12620 Data Category Registry (DCR) located at www.isocat.org.”<br />§ 5.4 LMF data category selection procedures<br />Create a Data Category Selection<br />Add Data Categories to ISOcat if needed<br /><ul><li>Missing: how to refer to ISOcat Data Categories?</li></ul>12 October 2011<br />7<br />TEI Lexical workshop - Würzburg, Germany<br />
  32. 32. Data Category identifiers are ambiguous<br />…<br /><LexicalEntry><br /> <feat att=“partOfSpeech” val=“commonNoun”/><br /> …<br />ISOcat contains two exact matches for “commonNoun” and one close match:<br />12 October 2011<br />8<br />TEI Lexical workshop - Würzburg, Germany<br />
  33. 33. Why are identifiers ambiguous?<br />Several thematic domains can use the same name for a (slightly) different Data Category<br />This was already true in the predecessor of ISOcat SYNTAX (legacy)<br />There maybe multiple versions of the same Data Category<br />Due to semantic drift or rot the name can not just point to the latest version<br />Users can also create Data Categories with the same name<br />In the future even copy a Data Category to extends its conceptual domain<br /><ul><li>Identifier should have been renamed, e.g., to mnemonic</li></ul>12 October 2011<br />9<br />TEI Lexical workshop - Würzburg, Germany<br />
  34. 34. ISOcat Data Category PIDs are unique<br />Each ISOcat Data Category (version) has an unique PID<br />http://www.isocat.org/datcat/DC-1256<br /><ul><li>/common noun/ by Gil Francopoulo</li></ul>ISO 12620:2009 Annex A provides a small vocabulary to annotate an XML document with Data Category PID references:<br /><feat<br />att=“partOfSpeech”<br />dcr:datcat=“http://www.isocat.org/datcat/DC-1345”<br />val=“commonNoun”<br />dcr:valueDatcat=“http://www.isocat.org/datcat/DC-1256”<br />/><br /><ul><li>Preferably annotate the schema of the resource</li></ul>12 October 2011<br />10<br />TEI Lexical workshop - Würzburg, Germany<br />
  35. 35. TEI feature structures<br /><tei:f<br /> name=“partOfSpeech”<br />dcr:datcat=“http://www.isocat.org/datcat/DC-1345”><br />fVal=“commonNoun”<br />dcr:valueDatcat=“http://www.isocat.org/datcat/DC-1256”<br />/><br />12 October 2011<br />11<br />TEI Lexical workshop - Würzburg, Germany<br />
  36. 36. TEI feature structure declarations<br /><tei:fDecl<br /> name=“partOfSpeech”<br />dcr:datcat=“http://www.isocat.org/datcat/DC-1345”><br /> <tei:vRange><br /> <tei:vAlt><br /> <tei:symbol<br /> value=“commonNoun”<br />dcr:datcat=http://www.isocat.org/datcat/DC-1256/><br /> …<br />12 October 2011<br />12<br />TEI Lexical workshop - Würzburg, Germany<br />
  37. 37. TEI and ISOcat Data Category PIDs<br />Is TEI open to attributes from foreign namespaces?<br /><ul><li>dcr:* attributes can already be used</li></ul>Or can the dcr:* attributes be part of the global attribute list?<br /><ul><li>It would enable to annotate any TEI element, incl. Dictionary elements, with a Data Category reference
  38. 38. The DCR data model now also includes container Data Categories and can thus also cover inner nodes
  39. 39. Could also (partially?) be done by <equiv/> statements in the ODD files
  40. 40. Scripts to do this (semi-)automatically have already been created</li></ul>Or can at least the TEI/ISO feature structure part accept dcr:* attributes?<br /><ul><li>Add a DCR specific attribute list?
  41. 41. Would make the ISO TC 37 standards consistent ISO 24610-1, ISO 24613:2008 and ISO 12620:2009</li></ul> Could also be another TEI attribute that expresses equivalence with an external (URI) specification (like <equiv/> in ODD) and which isn’t as much bound to ISOcatas the dcr:* attributes imply<br />12 October 2011<br />13<br />TEI Lexical workshop - Würzburg, Germany<br />
  42. 42. 12 October 2011<br />14<br />Thank you for your attention!<br />Visit<br />www.isocat.org<br />Questions?<br />Menzo.Windhouwer@mpi.nl<br />TEI Lexical workshop - Würzburg, Germany<br />

×