SlideShare a Scribd company logo
ISOcatAn ISO 12620:2009 Data Category Registry Marc Kemps-Snijdersa, MenzoWindhouwera, Sue Ellen Wrightb aMax Planck Institute for Psycholinguistics, bKent State University marc.kemps-snijders@mpi.nl , menzo.windhouwer@mpi.nl, sellenwright@gmail.com 13/7/2010 1 CLARA 2010 Summer School
Outline ISO 12620:2009 What are Data Categories? How can you use Data Categories? What is a Data Category Registry? How can you use a Data Category Registry? ISOcat Demonstration/Tutorial Future work Handson session 13/7/2010 CLARA 2010 Summer School 2
ISO 12620:2009 Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources An ISO TC 37/SC 3 standard (see [1]) Successor to ISO 12620:1999 which contained a hardcoded list of Data Categories 13/7/2010 CLARA 2010 Summer School 3
What is a Data Category? The result of the specification of a given data field A data category is an elementary descriptor in a linguistic structure or an annotation scheme. Specification consists of 3 main parts: Administrative part Administration and identification Descriptive part Documentation in various working languages Linguistic part Conceptual domain(s for various object languages) 13/7/2010 CLARA 2010 Summer School 4
Data category example Data category: /Grammatical gender/ Administrative part: Identifier: grammaticalGender PID: http://www.isocat.org/datcat/DC-1297 Descriptive part: English definition: Category based on (depending on languages) the natural distinction between sex and formal criteria. French definition: Catégorie fondée (selon la langue) sur la distinction naturelle entre les sexes ou d'autres critères formels. Linguistic part: Morposyntax conceptual domain: /male/, /feminine/, /neuter/ French conceptual domain: /male/, /feminine/ 13/7/2010 CLARA 2010 Summer School 5
Data Category specification – Administrative part 13/7/2010 CLARA 2010 Summer School 6
Data Category specification – Descriptive part 13/7/2010 CLARA 2010 Summer School 7
Data Category specification – Linguistic part 13/7/2010 CLARA 2010 Summer School 8
Mandatory parts of the specification For each data category: a mnemonic identifier an English definition an English name For complex data categories: a conceptual domain For standardization candidates: a profile a justification 13/7/2010 CLARA 2010 Summer School 9
Guidelines for the specification 13/7/2010 CLARA 2010 Summer School 10 (see [2]) Identifier: camel case and XML-valid element name (without a namespace) partOfSpeech my:POS, 123POS Data Element Name: language independent name for the data category used in a specific application domain (specified in the source) PoS in TBX NN in myTagset or N in yourTagset (if widely used)
More guidelines 13/7/2010 CLARA 2010 Summer School 11 Name Section in a Language Section legible name ‘part of speech’ in the English language section ‘partie du discours’ in the French language section Definition: intentional definitions (ISO 704) should consist of a single sentence fragment Source: add a source for any quoted material
More guidelines 13/7/2010 CLARA 2010 Summer School 12 Justification: a simple statement justifying the relevance of the data category to the field of language resources especially needed for standardization
Data Category types 13/7/2010 CLARA 2010 Summer School 13 open constrained complex: closed writtenForm grammaticalGender email string string string Constraint: .+@.+ neuter feminine masculine simple:
Data Category relationships 13/7/2010 CLARA 2010 Summer School 14 Value domain membership Subsumption relationships between simple data categories Relationships between complex data categories are not stored in the DCR partOfSpeech string pronoun personal pronoun
How can you use Data Categories? 13/7/2010 CLARA 2010 Summer School 15 partOfSpeech Lemma writtenForm writtenForm Word Form grammaticalGender lexicalType grammaticalGender wordOrder Lexicon 1..* A (schema for a) typological database Lexical Entry Shared semantics! 0..* 1..* Form Sense 0..* A LMF (ISO 24613:2008) complaint (schema for a) lexicon
How? 13/7/2010 CLARA 2010 Summer School 16 <lmf:lexiconxml:lang=“jp” alphabet=“ipa”> 	<lmf:entry> 		<lmf:lemma> 			<lmf:writtenForm>nihongo</…> 			… </…> 		… </…> 	… </…>
Referencing Data Categories Each Data Category should be uniquely identifiable Ambiguity: different domains use the same term but mean different ‘things’ Semantic rot: even in the same domain the meaning of a term changes over time Persistence: for archived resources Data Category references should still be resolvable and point to the specification as it was at/close to time of creation ISO/DIS 24619 Language resource management -- Persistent identification and access in language technology applications 13/7/2010 CLARA 2010 Summer School 17
Data Categories Persistent IDentifiers persistent identifier (PID) “unique Uniform Resource Identifier (URI) that ensures permanent access for a digital object by providing access to it independently of its physical location or current ownership” (see [1]) For Data Categories this digital object is a specific version of a Data Category specification, i.e., each version of a Data Category has its own PID 13/7/2010 CLARA 2010 Summer School 18
Where do you put these references? Preferably in a schema: <rng:attributename=“alphabet” 	dcr:datcat=“http://www.isocat.org/datcat/…”> 	<rng:valuedcr:datcat=“http://www.isocat.org/datcat/…”> ipa </…> 	… </…> 13/7/2010 CLARA 2010 Summer School 19
ISO TC 37 standards using Data Categories Terminological Markup Framework (TMF; ISO 16642) Lexical Markup Framework (LMF; ISO 24613) TermBaseeXchange (TBX; ISO 30042) Morpho-syntactic Annotation Framework (MAF; ISO 24611) Linguistic Annotation Framework (LAF; ISO 24612) Meta models which can be instantiated into a specific model with data categories However, some still refer to ISO 12620:1999 Data Categories and some don’t support all types (see [3]) 13/7/2010 CLARA 2010 Summer School 20
Other uses of Data Categories CLARIN Component Metadata Infrastructure (CMDI) ISO 12620:2009 provides a small XML vocabulary, DC Reference (see [4]), which provides elements and attributes to embed Data Category references in arbitrary XML documents Including: XML Schema, Relax NG, TEI/ISO feature structures, … The references can be used in URI based ‘mappings’: Including: ODD, RDF-based vocabularies (OWL, SKOS), … 13/7/2010 CLARA 2010 Summer School 21
What is a Data Category Registry? A (coherent) set of Data Categories, in our case for linguistic resources A system to manage this set: Create and edit Data Categories Share Data Categories, e.g., resolve PID references Standardize Data Categories 13/7/2010 CLARA 2010 Summer School 22
Standardize Data Categories 13/7/2010 CLARA 2010 Summer School 23 Decision Group Submission group Data Category Registry Board Thematic Domain Group Stewardship group Validation Evaluation rejected rejected Publication
Thematic Domain Groups 13/7/2010 CLARA 2010 Summer School 24 TDG 1: Metadata TDG 2: Morphosyntax TDG 3: Semantic Content Representation  TDG 4: Syntax  TDG 5: Machine Readable Dictionary TDG 6: Language Resource Ontology TDG 7: Lexicography TDG 8: Language Codes TDG 9: Terminology TDG 11: Multilingual Information Management TDG 12: Lexical Resources TDG 13: Lexical Semantics TDG 14: Source Identification ,[object Object]
TDGs own one or more profiles
Each TDG has a chair
A number of judges (assigned by SC P members)
A number of expert members (up to 50%)
TDGs are constituted at the TC37/SC plenary
NewTDGs need to be proposed by a SCTranslation Sign language Audio
How can you use a Data Category Registry? You can: Find Data Categories relevant for your resources and embed references to them so the semantics of (parts of) your resources are made explicit This can be supported by tools you use, e.g., ELAN, LEXUS and the CMDI Component Editor directly interact  with ISOcat Interact with Data Category owners to improve (the coverage of) their Data Categories Create (together with others) new Data Categories needed for your resources and share those Submit (your) Data Categories for standardization Free of charge 13/7/2010 CLARA 2010 Summer School 25
ISOcat Reference implementation of ISO 12620:2009 The TC 37 Data Category Registry 13/7/2010 CLARA 2010 Summer School 26
13/7/2010 CLARA 2010 Summer School 27 A glimpse of ISOcat
Data Category Interchange Format (DCIF) Simplified XML serialization of the data model (see [4]) 13/7/2010 CLARA 2010 Summer School 28
RESTful Web Services read-only programming interface to the DCR (see [5]) allows tools to interact with ISOcat to help an user to embed PIDs in their resources mainly based on DCIF uses authentication to access private/shared Data Categories currently used by: LEXUS: populate an LMF model ELAN: create controlled vocabularies CMDI Component Editor: create concept links for component elements 13/7/2010 CLARA 2010 Summer School 29
Persistent IDentifiers ISOcat uses ‘cool URIs’ as PIDs (see [6]) these URIs will never change, but resolve to the current location in the current implementation, e.g., in ISOcat they resolve to a RESTful Web Service call the isocat.org domain is bound to ISO 12620:2009 and the Registration Authority, currently the MPI, is obliged to keep the PIDs associated with this domain resolvable 13/7/2010 CLARA 2010 Summer School 30

More Related Content

More from Menzo Windhouwer

ISOcat to LMF to TEI
ISOcat to LMF to TEIISOcat to LMF to TEI
ISOcat to LMF to TEI
Menzo Windhouwer
 
On the way to a Relation Registry for ISOcat data categories
On the way to a Relation Registry for ISOcat data categoriesOn the way to a Relation Registry for ISOcat data categories
On the way to a Relation Registry for ISOcat data categories
Menzo Windhouwer
 
The ISO-DCR
The ISO-DCRThe ISO-DCR
The ISO-DCR
Menzo Windhouwer
 
Use of ISOcat within CMDI
Use of ISOcat within CMDIUse of ISOcat within CMDI
Use of ISOcat within CMDI
Menzo Windhouwer
 
ISOcat: a short introduction
ISOcat: a short introductionISOcat: a short introduction
ISOcat: a short introduction
Menzo Windhouwer
 
Sustainable operability: Keeping complex linguistic resources alive.
Sustainable operability: Keeping complex linguistic resources alive.Sustainable operability: Keeping complex linguistic resources alive.
Sustainable operability: Keeping complex linguistic resources alive.
Menzo Windhouwer
 

More from Menzo Windhouwer (6)

ISOcat to LMF to TEI
ISOcat to LMF to TEIISOcat to LMF to TEI
ISOcat to LMF to TEI
 
On the way to a Relation Registry for ISOcat data categories
On the way to a Relation Registry for ISOcat data categoriesOn the way to a Relation Registry for ISOcat data categories
On the way to a Relation Registry for ISOcat data categories
 
The ISO-DCR
The ISO-DCRThe ISO-DCR
The ISO-DCR
 
Use of ISOcat within CMDI
Use of ISOcat within CMDIUse of ISOcat within CMDI
Use of ISOcat within CMDI
 
ISOcat: a short introduction
ISOcat: a short introductionISOcat: a short introduction
ISOcat: a short introduction
 
Sustainable operability: Keeping complex linguistic resources alive.
Sustainable operability: Keeping complex linguistic resources alive.Sustainable operability: Keeping complex linguistic resources alive.
Sustainable operability: Keeping complex linguistic resources alive.
 

Recently uploaded

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
Federico Razzoli
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 

Recently uploaded (20)

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 

ISOcat: an ISO 12620:2009 Data Category Registry

  • 1. ISOcatAn ISO 12620:2009 Data Category Registry Marc Kemps-Snijdersa, MenzoWindhouwera, Sue Ellen Wrightb aMax Planck Institute for Psycholinguistics, bKent State University marc.kemps-snijders@mpi.nl , menzo.windhouwer@mpi.nl, sellenwright@gmail.com 13/7/2010 1 CLARA 2010 Summer School
  • 2. Outline ISO 12620:2009 What are Data Categories? How can you use Data Categories? What is a Data Category Registry? How can you use a Data Category Registry? ISOcat Demonstration/Tutorial Future work Handson session 13/7/2010 CLARA 2010 Summer School 2
  • 3. ISO 12620:2009 Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources An ISO TC 37/SC 3 standard (see [1]) Successor to ISO 12620:1999 which contained a hardcoded list of Data Categories 13/7/2010 CLARA 2010 Summer School 3
  • 4. What is a Data Category? The result of the specification of a given data field A data category is an elementary descriptor in a linguistic structure or an annotation scheme. Specification consists of 3 main parts: Administrative part Administration and identification Descriptive part Documentation in various working languages Linguistic part Conceptual domain(s for various object languages) 13/7/2010 CLARA 2010 Summer School 4
  • 5. Data category example Data category: /Grammatical gender/ Administrative part: Identifier: grammaticalGender PID: http://www.isocat.org/datcat/DC-1297 Descriptive part: English definition: Category based on (depending on languages) the natural distinction between sex and formal criteria. French definition: Catégorie fondée (selon la langue) sur la distinction naturelle entre les sexes ou d'autres critères formels. Linguistic part: Morposyntax conceptual domain: /male/, /feminine/, /neuter/ French conceptual domain: /male/, /feminine/ 13/7/2010 CLARA 2010 Summer School 5
  • 6. Data Category specification – Administrative part 13/7/2010 CLARA 2010 Summer School 6
  • 7. Data Category specification – Descriptive part 13/7/2010 CLARA 2010 Summer School 7
  • 8. Data Category specification – Linguistic part 13/7/2010 CLARA 2010 Summer School 8
  • 9. Mandatory parts of the specification For each data category: a mnemonic identifier an English definition an English name For complex data categories: a conceptual domain For standardization candidates: a profile a justification 13/7/2010 CLARA 2010 Summer School 9
  • 10. Guidelines for the specification 13/7/2010 CLARA 2010 Summer School 10 (see [2]) Identifier: camel case and XML-valid element name (without a namespace) partOfSpeech my:POS, 123POS Data Element Name: language independent name for the data category used in a specific application domain (specified in the source) PoS in TBX NN in myTagset or N in yourTagset (if widely used)
  • 11. More guidelines 13/7/2010 CLARA 2010 Summer School 11 Name Section in a Language Section legible name ‘part of speech’ in the English language section ‘partie du discours’ in the French language section Definition: intentional definitions (ISO 704) should consist of a single sentence fragment Source: add a source for any quoted material
  • 12. More guidelines 13/7/2010 CLARA 2010 Summer School 12 Justification: a simple statement justifying the relevance of the data category to the field of language resources especially needed for standardization
  • 13. Data Category types 13/7/2010 CLARA 2010 Summer School 13 open constrained complex: closed writtenForm grammaticalGender email string string string Constraint: .+@.+ neuter feminine masculine simple:
  • 14. Data Category relationships 13/7/2010 CLARA 2010 Summer School 14 Value domain membership Subsumption relationships between simple data categories Relationships between complex data categories are not stored in the DCR partOfSpeech string pronoun personal pronoun
  • 15. How can you use Data Categories? 13/7/2010 CLARA 2010 Summer School 15 partOfSpeech Lemma writtenForm writtenForm Word Form grammaticalGender lexicalType grammaticalGender wordOrder Lexicon 1..* A (schema for a) typological database Lexical Entry Shared semantics! 0..* 1..* Form Sense 0..* A LMF (ISO 24613:2008) complaint (schema for a) lexicon
  • 16. How? 13/7/2010 CLARA 2010 Summer School 16 <lmf:lexiconxml:lang=“jp” alphabet=“ipa”> <lmf:entry> <lmf:lemma> <lmf:writtenForm>nihongo</…> … </…> … </…> … </…>
  • 17. Referencing Data Categories Each Data Category should be uniquely identifiable Ambiguity: different domains use the same term but mean different ‘things’ Semantic rot: even in the same domain the meaning of a term changes over time Persistence: for archived resources Data Category references should still be resolvable and point to the specification as it was at/close to time of creation ISO/DIS 24619 Language resource management -- Persistent identification and access in language technology applications 13/7/2010 CLARA 2010 Summer School 17
  • 18. Data Categories Persistent IDentifiers persistent identifier (PID) “unique Uniform Resource Identifier (URI) that ensures permanent access for a digital object by providing access to it independently of its physical location or current ownership” (see [1]) For Data Categories this digital object is a specific version of a Data Category specification, i.e., each version of a Data Category has its own PID 13/7/2010 CLARA 2010 Summer School 18
  • 19. Where do you put these references? Preferably in a schema: <rng:attributename=“alphabet” dcr:datcat=“http://www.isocat.org/datcat/…”> <rng:valuedcr:datcat=“http://www.isocat.org/datcat/…”> ipa </…> … </…> 13/7/2010 CLARA 2010 Summer School 19
  • 20. ISO TC 37 standards using Data Categories Terminological Markup Framework (TMF; ISO 16642) Lexical Markup Framework (LMF; ISO 24613) TermBaseeXchange (TBX; ISO 30042) Morpho-syntactic Annotation Framework (MAF; ISO 24611) Linguistic Annotation Framework (LAF; ISO 24612) Meta models which can be instantiated into a specific model with data categories However, some still refer to ISO 12620:1999 Data Categories and some don’t support all types (see [3]) 13/7/2010 CLARA 2010 Summer School 20
  • 21. Other uses of Data Categories CLARIN Component Metadata Infrastructure (CMDI) ISO 12620:2009 provides a small XML vocabulary, DC Reference (see [4]), which provides elements and attributes to embed Data Category references in arbitrary XML documents Including: XML Schema, Relax NG, TEI/ISO feature structures, … The references can be used in URI based ‘mappings’: Including: ODD, RDF-based vocabularies (OWL, SKOS), … 13/7/2010 CLARA 2010 Summer School 21
  • 22. What is a Data Category Registry? A (coherent) set of Data Categories, in our case for linguistic resources A system to manage this set: Create and edit Data Categories Share Data Categories, e.g., resolve PID references Standardize Data Categories 13/7/2010 CLARA 2010 Summer School 22
  • 23. Standardize Data Categories 13/7/2010 CLARA 2010 Summer School 23 Decision Group Submission group Data Category Registry Board Thematic Domain Group Stewardship group Validation Evaluation rejected rejected Publication
  • 24.
  • 25. TDGs own one or more profiles
  • 26. Each TDG has a chair
  • 27. A number of judges (assigned by SC P members)
  • 28. A number of expert members (up to 50%)
  • 29. TDGs are constituted at the TC37/SC plenary
  • 30. NewTDGs need to be proposed by a SCTranslation Sign language Audio
  • 31. How can you use a Data Category Registry? You can: Find Data Categories relevant for your resources and embed references to them so the semantics of (parts of) your resources are made explicit This can be supported by tools you use, e.g., ELAN, LEXUS and the CMDI Component Editor directly interact with ISOcat Interact with Data Category owners to improve (the coverage of) their Data Categories Create (together with others) new Data Categories needed for your resources and share those Submit (your) Data Categories for standardization Free of charge 13/7/2010 CLARA 2010 Summer School 25
  • 32. ISOcat Reference implementation of ISO 12620:2009 The TC 37 Data Category Registry 13/7/2010 CLARA 2010 Summer School 26
  • 33. 13/7/2010 CLARA 2010 Summer School 27 A glimpse of ISOcat
  • 34. Data Category Interchange Format (DCIF) Simplified XML serialization of the data model (see [4]) 13/7/2010 CLARA 2010 Summer School 28
  • 35. RESTful Web Services read-only programming interface to the DCR (see [5]) allows tools to interact with ISOcat to help an user to embed PIDs in their resources mainly based on DCIF uses authentication to access private/shared Data Categories currently used by: LEXUS: populate an LMF model ELAN: create controlled vocabularies CMDI Component Editor: create concept links for component elements 13/7/2010 CLARA 2010 Summer School 29
  • 36. Persistent IDentifiers ISOcat uses ‘cool URIs’ as PIDs (see [6]) these URIs will never change, but resolve to the current location in the current implementation, e.g., in ISOcat they resolve to a RESTful Web Service call the isocat.org domain is bound to ISO 12620:2009 and the Registration Authority, currently the MPI, is obliged to keep the PIDs associated with this domain resolvable 13/7/2010 CLARA 2010 Summer School 30
  • 37. Future work Finish first complete version of ISOcat: Standardization process Cleanup of the current set of Data Categories TDGs cleanup their profiles Standardize first sets of Data Categories Interaction with other TC 37 standards: Migration from ISO 12620:1999 Full support for all types of Data Categories 13/7/2010 CLARA 2010 Summer School 31
  • 38. More future work Additional Data Categories types Container Data Categories Complex and Simple only cover ‘leafs’ and their values Data Category Concepts Basic building blocks for knowledge bases Relation Registries Stores (your) (semantic) relationships between Data Categories 13/7/2010 CLARA 2010 Summer School 32
  • 39. Registry network 13/7/2010 CLARA 2010 Summer School 33 Typological Database System RR MPI RR Relation registries MPI DCR ISO DCR Data category registries TDS database resource MPI archive Linguistic resources
  • 40. Handson session Register with ISOcat http://www.isocat.org/ Can you find Data Categories relevant to your type of resources? Use the search (options), the explorer, … Create and save a Data Category Selection Do you miss Data Categories? Create a new Data Category Do you want to share Data Categories/selections? Create together with some students a group Share your selection with this group Do you miss functionality? Let us know  13/7/2010 CLARA 2010 Summer School 34
  • 41. 13/7/2010 CLARA 2010 Summer School 35 Thank you for your attention! Visit www.isocat.org Questions? www.isocat.org/forum/ isocat@mpi.nl
  • 42. References [1] ISO 12620, Terminology and other language and content resources -- Specification of data categories and management of a Data Category Registry for language resources. [2] http://www.isocat.org/manual/DCRGuidelines.pdf [3] M.A. Windhouwer, S.E. Wright, M. Kemps-Snijders. Referencing ISOcat data categories. In proceedings of the LREC 2010 LRT standards workshop. Malta, May 18, 2010. [4] http://www.isocat.org/12620/ [5] http://www.isocat.org/rest/help.html [6] Tim Berners-Lee, Cool URIs don't change, 1998. 13/7/2010 CLARA 2010 Summer School 36