AIMSIs ISO 639 enough for a multilingual thesaurus? The AGROVOC case Caterina Caracciolo, Gudrun Johannsen, Lavanya Kiran, Johannes Keizer Food and Agriculture Organization of the UN AOS 2012 Sept 4. 2012 - Kuching (MY)
Background• AGROVOC is published in 21 languages + other under development• Multilinguality has always been an issue• Since the beginning, multilinguality was interpreted as “translation”: – One hierarchy of terms (one structure), translations in various languages• This organization remained with the move from a term-centered to a concept-centered resource9/5/2012 2
AGROVOC as object-centered resource…• Being mainly a resource for document indexing in the area of agriculture, it contains large amount of words referring to plants, animals, food in general9/5/2012 3
# of concepts below top concepts organism substances entitiesphenomena activities products methods properties features objects resources subjects systems locations Series1 groups measures state stages technology processes factors time events site strategies9/5/2012 4 0 5000 10000 15000 20000 25000
Requirements for rendering multilinguality in AGROVOC1. Unambiguously express the geographic area where a given word is used – specification of the area of use of a given word should be optional.2. No limitations on the type of area allowed – Countries, groups of countries, geographical or administrative regions should be equally available for specification.9/5/2012 KISAF, Rome 10
AGROVOC as a SKOS resource• skos:Concept is to indicate a group of words in various languages, to be considered translations of one another• URI are kept “abstract” to emphasize independence of the concept from language – E.g. http://aims.fao.org/aos/agrovoc/c_12332• The words grouped are then labels of the given concept9/5/2012 11
SKOS properties to express terms• skos:prefLabel, skos:altLabel – take plain literals as values – and an optional language tag expressed by XML attribute xml:lang• skosxl:prefLabel, skosxl:altLabel – Take entities with URIs, so extra infomation be attached to labels9/5/2012 12
AGROVOC uses ISO 639 2 digits to tag languages in xml:lang• ISO 639 provides codes for languages independently of – the country where they are spoken: • Spanish, Basque (same country, both official languages) • Dutch, Flamish (different country, similar enough languages…) – And their status: French and Breton (same country, Breton has no status)• Only one code for English, Spanish…• Limitations shown from previous examples9/5/2012 KISAF, Rome 13
Is ISO 639 3 digits an option?• More languages are included – More contemporary languages • Bemba language – “Old” languages (no longer spoken) • Old French (842ca-1400) – Groups of languages • Cuacasian languages – Artificial languages• Same approach as the 2 digit version9/5/2012 KISAF, Rome 15
Is IETF an option?• Internet Engineering Task Force (IETF)• IETF 5646 Tags for identifying languages – Basis is ISO for languages (639) – Subtags from ISO for countries (3166), ISO for scripts (15924)• Examples: – tr-CY = Turkish from Cyprus – zh-Hant-HK = Chinese in traditional Chinese script9/5/2012 KISAF, Rome 16
Is a relational approach an option?• Keep tagging approach to mark the language – Use ISO 639 or IETF• And introduce a relational notion of “where a given word is used”• Link together a concept representing a geographic area, and the object to name – E.g., Kiwicha isNameUsedInRegion Cusco• Aim at “standard” relations…9/5/2012 KISAF, Rome 17
Conclusions?• This is work in progress• We continue working out use cases, especially from Spanish and Portuguese• Assess alternatives9/5/2012 KISAF, Rome 18
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.