AIMSIs ISO 639 enough for a multilingual            thesaurus?            The AGROVOC case   Caterina Caracciolo, Gudrun J...
Background• AGROVOC is published in 21 languages + other  under development• Multilinguality has always been an issue• Sin...
AGROVOC as object-centered                 resource…• Being mainly a resource for document  indexing in the area of agricu...
# of concepts below top concepts  organism substances    entitiesphenomena   activities  products   methods properties    ...
Differentiating languages• Salmon (en)• Salmón (es)• лососи (ru)9/5/2012                               5
But distribution of languages may              be wide…9/5/2012                              6
… and names of food tend to vary…Aguacate            Palta 9/5/2012                        7
… and names of food tend to vary…                     Ataco morado,                     sangorache,                     se...
Not only food names vary9/5/2012                              9
Requirements for rendering           multilinguality in AGROVOC1. Unambiguously express the geographic area   where a give...
AGROVOC as a SKOS resource• skos:Concept is to indicate a group of words in  various languages, to be considered translati...
SKOS properties to express terms• skos:prefLabel, skos:altLabel      – take plain literals as values      – and an optiona...
AGROVOC uses ISO 639 2 digits       to tag languages in xml:lang• ISO 639 provides codes for languages  independently of  ...
MultilingualityISO 639Languagecodes 9/5/2012                     14
Is ISO 639 3 digits an option?• More languages are included      – More contemporary languages           • Bemba language ...
Is IETF an option?• Internet Engineering Task Force (IETF)• IETF 5646 Tags for identifying languages      – Basis is ISO f...
Is a relational approach an option?• Keep tagging approach to mark the language      – Use ISO 639 or IETF• And introduce ...
Conclusions?• This is work in progress• We continue working out use cases, especially  from Spanish and Portuguese• Assess...
Upcoming SlideShare
Loading in...5
×

Caracciolo et al_2012_aos_agrovoc_multilinguality

145

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
145
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • DCMI initiative!
  • Caracciolo et al_2012_aos_agrovoc_multilinguality

    1. 1. AIMSIs ISO 639 enough for a multilingual thesaurus? The AGROVOC case Caterina Caracciolo, Gudrun Johannsen, Lavanya Kiran, Johannes Keizer Food and Agriculture Organization of the UN AOS 2012 Sept 4. 2012 - Kuching (MY)
    2. 2. Background• AGROVOC is published in 21 languages + other under development• Multilinguality has always been an issue• Since the beginning, multilinguality was interpreted as “translation”: – One hierarchy of terms (one structure), translations in various languages• This organization remained with the move from a term-centered to a concept-centered resource9/5/2012 2
    3. 3. AGROVOC as object-centered resource…• Being mainly a resource for document indexing in the area of agriculture, it contains large amount of words referring to plants, animals, food in general9/5/2012 3
    4. 4. # of concepts below top concepts organism substances entitiesphenomena activities products methods properties features objects resources subjects systems locations Series1 groups measures state stages technology processes factors time events site strategies9/5/2012 4 0 5000 10000 15000 20000 25000
    5. 5. Differentiating languages• Salmon (en)• Salmón (es)• лососи (ru)9/5/2012 5
    6. 6. But distribution of languages may be wide…9/5/2012 6
    7. 7. … and names of food tend to vary…Aguacate Palta 9/5/2012 7
    8. 8. … and names of food tend to vary… Ataco morado, sangorache, sergorache, hawarchaAchis,Coyos (Cajamarca),Achita (Ayacucho), Coime, coimi,Kiwicha (Cusco) cuimi, millmi 9/5/2012 8
    9. 9. Not only food names vary9/5/2012 9
    10. 10. Requirements for rendering multilinguality in AGROVOC1. Unambiguously express the geographic area where a given word is used – specification of the area of use of a given word should be optional.2. No limitations on the type of area allowed – Countries, groups of countries, geographical or administrative regions should be equally available for specification.9/5/2012 KISAF, Rome 10
    11. 11. AGROVOC as a SKOS resource• skos:Concept is to indicate a group of words in various languages, to be considered translations of one another• URI are kept “abstract” to emphasize independence of the concept from language – E.g. http://aims.fao.org/aos/agrovoc/c_12332• The words grouped are then labels of the given concept9/5/2012 11
    12. 12. SKOS properties to express terms• skos:prefLabel, skos:altLabel – take plain literals as values – and an optional language tag expressed by XML attribute xml:lang• skosxl:prefLabel, skosxl:altLabel – Take entities with URIs, so extra infomation be attached to labels9/5/2012 12
    13. 13. AGROVOC uses ISO 639 2 digits to tag languages in xml:lang• ISO 639 provides codes for languages independently of – the country where they are spoken: • Spanish, Basque (same country, both official languages) • Dutch, Flamish (different country, similar enough languages…) – And their status: French and Breton (same country, Breton has no status)• Only one code for English, Spanish…• Limitations shown from previous examples9/5/2012 KISAF, Rome 13
    14. 14. MultilingualityISO 639Languagecodes 9/5/2012 14
    15. 15. Is ISO 639 3 digits an option?• More languages are included – More contemporary languages • Bemba language – “Old” languages (no longer spoken) • Old French (842ca-1400) – Groups of languages • Cuacasian languages – Artificial languages• Same approach as the 2 digit version9/5/2012 KISAF, Rome 15
    16. 16. Is IETF an option?• Internet Engineering Task Force (IETF)• IETF 5646 Tags for identifying languages – Basis is ISO for languages (639) – Subtags from ISO for countries (3166), ISO for scripts (15924)• Examples: – tr-CY = Turkish from Cyprus – zh-Hant-HK = Chinese in traditional Chinese script9/5/2012 KISAF, Rome 16
    17. 17. Is a relational approach an option?• Keep tagging approach to mark the language – Use ISO 639 or IETF• And introduce a relational notion of “where a given word is used”• Link together a concept representing a geographic area, and the object to name – E.g., Kiwicha isNameUsedInRegion Cusco• Aim at “standard” relations…9/5/2012 KISAF, Rome 17
    18. 18. Conclusions?• This is work in progress• We continue working out use cases, especially from Spanish and Portuguese• Assess alternatives9/5/2012 KISAF, Rome 18
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×