AIMSIs ISO 639 enough for a multilingual            thesaurus?            The AGROVOC case   Caterina Caracciolo, Gudrun J...
Background• AGROVOC is published in 21 languages + other  under development• Multilinguality has always been an issue• Sin...
AGROVOC as object-centered                 resource…• Being mainly a resource for document  indexing in the area of agricu...
# of concepts below top concepts  organism substances    entitiesphenomena   activities  products   methods properties    ...
Differentiating languages• Salmon (en)• Salmón (es)• лососи (ru)9/5/2012                               5
But distribution of languages may              be wide…9/5/2012                              6
… and names of food tend to vary…Aguacate            Palta 9/5/2012                        7
… and names of food tend to vary…                     Ataco morado,                     sangorache,                     se...
Not only food names vary9/5/2012                              9
Requirements for rendering           multilinguality in AGROVOC1. Unambiguously express the geographic area   where a give...
AGROVOC as a SKOS resource• skos:Concept is to indicate a group of words in  various languages, to be considered translati...
SKOS properties to express terms• skos:prefLabel, skos:altLabel      – take plain literals as values      – and an optiona...
AGROVOC uses ISO 639 2 digits       to tag languages in xml:lang• ISO 639 provides codes for languages  independently of  ...
MultilingualityISO 639Languagecodes 9/5/2012                     14
Is ISO 639 3 digits an option?• More languages are included      – More contemporary languages           • Bemba language ...
Is IETF an option?• Internet Engineering Task Force (IETF)• IETF 5646 Tags for identifying languages      – Basis is ISO f...
Is a relational approach an option?• Keep tagging approach to mark the language      – Use ISO 639 or IETF• And introduce ...
Conclusions?• This is work in progress• We continue working out use cases, especially  from Spanish and Portuguese• Assess...
Upcoming SlideShare
Loading in …5
×

Caracciolo et al_2012_aos_agrovoc_multilinguality

169
-1

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
169
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • DCMI initiative!
  • Caracciolo et al_2012_aos_agrovoc_multilinguality

    1. 1. AIMSIs ISO 639 enough for a multilingual thesaurus? The AGROVOC case Caterina Caracciolo, Gudrun Johannsen, Lavanya Kiran, Johannes Keizer Food and Agriculture Organization of the UN AOS 2012 Sept 4. 2012 - Kuching (MY)
    2. 2. Background• AGROVOC is published in 21 languages + other under development• Multilinguality has always been an issue• Since the beginning, multilinguality was interpreted as “translation”: – One hierarchy of terms (one structure), translations in various languages• This organization remained with the move from a term-centered to a concept-centered resource9/5/2012 2
    3. 3. AGROVOC as object-centered resource…• Being mainly a resource for document indexing in the area of agriculture, it contains large amount of words referring to plants, animals, food in general9/5/2012 3
    4. 4. # of concepts below top concepts organism substances entitiesphenomena activities products methods properties features objects resources subjects systems locations Series1 groups measures state stages technology processes factors time events site strategies9/5/2012 4 0 5000 10000 15000 20000 25000
    5. 5. Differentiating languages• Salmon (en)• Salmón (es)• лососи (ru)9/5/2012 5
    6. 6. But distribution of languages may be wide…9/5/2012 6
    7. 7. … and names of food tend to vary…Aguacate Palta 9/5/2012 7
    8. 8. … and names of food tend to vary… Ataco morado, sangorache, sergorache, hawarchaAchis,Coyos (Cajamarca),Achita (Ayacucho), Coime, coimi,Kiwicha (Cusco) cuimi, millmi 9/5/2012 8
    9. 9. Not only food names vary9/5/2012 9
    10. 10. Requirements for rendering multilinguality in AGROVOC1. Unambiguously express the geographic area where a given word is used – specification of the area of use of a given word should be optional.2. No limitations on the type of area allowed – Countries, groups of countries, geographical or administrative regions should be equally available for specification.9/5/2012 KISAF, Rome 10
    11. 11. AGROVOC as a SKOS resource• skos:Concept is to indicate a group of words in various languages, to be considered translations of one another• URI are kept “abstract” to emphasize independence of the concept from language – E.g. http://aims.fao.org/aos/agrovoc/c_12332• The words grouped are then labels of the given concept9/5/2012 11
    12. 12. SKOS properties to express terms• skos:prefLabel, skos:altLabel – take plain literals as values – and an optional language tag expressed by XML attribute xml:lang• skosxl:prefLabel, skosxl:altLabel – Take entities with URIs, so extra infomation be attached to labels9/5/2012 12
    13. 13. AGROVOC uses ISO 639 2 digits to tag languages in xml:lang• ISO 639 provides codes for languages independently of – the country where they are spoken: • Spanish, Basque (same country, both official languages) • Dutch, Flamish (different country, similar enough languages…) – And their status: French and Breton (same country, Breton has no status)• Only one code for English, Spanish…• Limitations shown from previous examples9/5/2012 KISAF, Rome 13
    14. 14. MultilingualityISO 639Languagecodes 9/5/2012 14
    15. 15. Is ISO 639 3 digits an option?• More languages are included – More contemporary languages • Bemba language – “Old” languages (no longer spoken) • Old French (842ca-1400) – Groups of languages • Cuacasian languages – Artificial languages• Same approach as the 2 digit version9/5/2012 KISAF, Rome 15
    16. 16. Is IETF an option?• Internet Engineering Task Force (IETF)• IETF 5646 Tags for identifying languages – Basis is ISO for languages (639) – Subtags from ISO for countries (3166), ISO for scripts (15924)• Examples: – tr-CY = Turkish from Cyprus – zh-Hant-HK = Chinese in traditional Chinese script9/5/2012 KISAF, Rome 16
    17. 17. Is a relational approach an option?• Keep tagging approach to mark the language – Use ISO 639 or IETF• And introduce a relational notion of “where a given word is used”• Link together a concept representing a geographic area, and the object to name – E.g., Kiwicha isNameUsedInRegion Cusco• Aim at “standard” relations…9/5/2012 KISAF, Rome 17
    18. 18. Conclusions?• This is work in progress• We continue working out use cases, especially from Spanish and Portuguese• Assess alternatives9/5/2012 KISAF, Rome 18
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×