SlideShare a Scribd company logo
1 of 18
Download to read offline
AIMS
Is ISO 639 enough for a multilingual
            thesaurus?
              The AGROVOC case

 Caterina Caracciolo, Gudrun Johannsen, Lavanya Kiran,
                   Johannes Keizer
    Food and Agriculture Organization of the UN
                       AOS 2012
              Sept 4. 2012 - Kuching (MY)
Background
• AGROVOC is published in 21 languages + other
  under development
• Multilinguality has always been an issue
• Since the beginning, multilinguality was
  interpreted as “translation”:
      – One hierarchy of terms (one structure),
        translations in various languages
• This organization remained with the move
  from a term-centered to a concept-centered
  resource
9/19/2012                                         2
AGROVOC as object-centered
                  resource…
• Being mainly a resource for document
  indexing in the area of agriculture, it contains
  large amount of words referring to plants,
  animals, food in general




9/19/2012                                            3
# of concepts below top concepts
  organism
 substances
    entities
phenomena
   activities
  products
   methods
 properties
    features
     objects
  resources
    subjects
    systems
   locations                                               Series1
     groups
  measures
       state
      stages
 technology
  processes
     factors
        time
     events
         site
  strategies
9/19/2012                                                      4
                0   5000   10000   15000   20000   25000
Differentiating languages
• Salmon (en)
• Salmón (es)
• лососи (ru)




9/19/2012                               5
But distribution of languages may
              be wide…




9/19/2012                             6
… and names of food tend to vary…


Aguacate




             Palta




 9/19/2012                       7
… and names of food tend to vary…
                     Ataco morado,
                     sangorache,
                     sergorache,
                     hawarcha




Achis,
Coyos (Cajamarca),
Achita (Ayacucho),
                          Coime, coimi,
Kiwicha (Cusco)
                          cuimi, millmi
   9/19/2012                              8
Not only food names vary




9/19/2012                              9
Requirements for rendering
            multilinguality in AGROVOC
1. Unambiguously express the geographic area
   where a given word is used
      – specification of the area of use of a given word
        should be optional.
2. No limitations on the type of area allowed
      – Countries, groups of countries, geographical or
        administrative regions should be equally available
        for specification.


9/19/2012                   KISAF, Rome                    10
AGROVOC as a SKOS resource
• skos:Concept is to indicate a group of words in
  various languages, to be considered translations of
  one another
• URI are kept “abstract” to emphasize independence
  of the concept from language
      – E.g. http://aims.fao.org/aos/agrovoc/c_12332
• The words grouped are then labels of the given
  concept




9/19/2012                                               11
SKOS properties to express terms
• skos:prefLabel, skos:altLabel
      – take plain literals as values
      – and an optional language tag expressed by XML
        attribute xml:lang
• skosxl:prefLabel, skosxl:altLabel
      – Take entities with URIs, so extra infomation be
        attached to labels




9/19/2012                                                 12
AGROVOC uses ISO 639 2 digits
       to tag languages in xml:lang
• ISO 639 provides codes for languages
  independently of
      – the country where they are spoken:
            • Spanish, Basque (same country, both official languages)
            • Dutch, Flamish (different country, similar enough
              languages…)
      – And their status: French and Breton (same
        country, Breton has no status)
• Only one code for English, Spanish…
• Limitations shown from previous examples
9/19/2012                        KISAF, Rome                        13
Multilinguality
ISO 639
Language
codes




 9/19/2012                     14
Is ISO 639 3 digits an option?
• More languages are included
      – More contemporary languages
            • Bemba language
      – “Old” languages (no longer spoken)
            • Old French (842ca-1400)
      – Groups of languages
            • Cuacasian languages
      – Artificial languages
• Same approach as the 2 digit version
9/19/2012                       KISAF, Rome   15
Is IETF an option?
• Internet Engineering Task Force (IETF)
• IETF 5646 Tags for identifying languages
      – Basis is ISO for languages (639)
      – Subtags from ISO for countries (3166), ISO for
        scripts (15924)
• Examples:
      – tr-CY = Turkish from Cyprus
      – zh-Hant-HK = Chinese in traditional Chinese script

9/19/2012                  KISAF, Rome                   16
Is a relational approach an option?
• Keep tagging approach to mark the language
      – Use ISO 639 or IETF
• And introduce a relational notion of “where a
  given word is used”
• Link together a concept representing a
  geographic area, and the object to name
      – E.g., Kiwicha isNameUsedInRegion Cusco
• Aim at “standard” relations…
9/19/2012                     KISAF, Rome         17
Conclusions?
• This is work in progress
• We continue working out use cases, especially
  from Spanish and Portuguese
• Assess alternatives




9/19/2012            KISAF, Rome              18

More Related Content

Similar to Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case

Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
 
Learning Object Annotation in Agricultural Learning Repositories
Learning Object Annotation in Agricultural Learning RepositoriesLearning Object Annotation in Agricultural Learning Repositories
Learning Object Annotation in Agricultural Learning RepositoriesHannes Ebner
 
eROSA Stakeholder WS1: AgroPortal: a vocabulary and ontology repository for a...
eROSA Stakeholder WS1: AgroPortal: a vocabulary and ontology repository for a...eROSA Stakeholder WS1: AgroPortal: a vocabulary and ontology repository for a...
eROSA Stakeholder WS1: AgroPortal: a vocabulary and ontology repository for a...e-ROSA
 
ELKL 4, Language Technology: learning from endangered languages
ELKL 4, Language Technology: learning from endangered languagesELKL 4, Language Technology: learning from endangered languages
ELKL 4, Language Technology: learning from endangered languagesDafydd Gibbon
 
Linguistic (in)justice and communication models: A pledge for a balanced mult...
Linguistic (in)justice and communication models: A pledge for a balanced mult...Linguistic (in)justice and communication models: A pledge for a balanced mult...
Linguistic (in)justice and communication models: A pledge for a balanced mult...Federico Gobbo
 
2005 09 Dc Keynote
2005 09 Dc Keynote2005 09 Dc Keynote
2005 09 Dc KeynoteJohannes Keizer
 
Darwin Core extension for genebanks (germplasm), at Kansas University (May 2012)
Darwin Core extension for genebanks (germplasm), at Kansas University (May 2012)Darwin Core extension for genebanks (germplasm), at Kansas University (May 2012)
Darwin Core extension for genebanks (germplasm), at Kansas University (May 2012)Dag Endresen
 
ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN
 
Developing corpus-based resources for language learning: looking back in "hope"
Developing corpus-based resources for language learning: looking back in "hope"Developing corpus-based resources for language learning: looking back in "hope"
Developing corpus-based resources for language learning: looking back in "hope"Pascual PĂŠrez-Paredes
 
Corpora in language teaching
Corpora in language teachingCorpora in language teaching
Corpora in language teachingJonathan Smart
 
Linked data and language technologies
Linked data and language technologies Linked data and language technologies
Linked data and language technologies Asuncion Gomez-Perez
 
Using corpora in instruction
Using corpora in instructionUsing corpora in instruction
Using corpora in instructionJonathan Smart
 
Why Languages Matter 20090123
Why Languages Matter 20090123Why Languages Matter 20090123
Why Languages Matter 20090123David Wood
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguisticsIrum Malik
 
Naming and labeling in the Multilingual Web of Data
Naming and labeling in the Multilingual Web of DataNaming and labeling in the Multilingual Web of Data
Naming and labeling in the Multilingual Web of DataDaniel Vila Suero
 
Presentation of the EDULLL repository by EKT at Open Repositories 2012 DSpace...
Presentation of the EDULLL repository by EKT at Open Repositories 2012 DSpace...Presentation of the EDULLL repository by EKT at Open Repositories 2012 DSpace...
Presentation of the EDULLL repository by EKT at Open Repositories 2012 DSpace...Nikos Houssos
 

Similar to Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case (20)

Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
Learning Object Annotation in Agricultural Learning Repositories
Learning Object Annotation in Agricultural Learning RepositoriesLearning Object Annotation in Agricultural Learning Repositories
Learning Object Annotation in Agricultural Learning Repositories
 
eROSA Stakeholder WS1: AgroPortal: a vocabulary and ontology repository for a...
eROSA Stakeholder WS1: AgroPortal: a vocabulary and ontology repository for a...eROSA Stakeholder WS1: AgroPortal: a vocabulary and ontology repository for a...
eROSA Stakeholder WS1: AgroPortal: a vocabulary and ontology repository for a...
 
ELKL 4, Language Technology: learning from endangered languages
ELKL 4, Language Technology: learning from endangered languagesELKL 4, Language Technology: learning from endangered languages
ELKL 4, Language Technology: learning from endangered languages
 
AgriOcean DSpace: an introduction
AgriOcean DSpace: an introductionAgriOcean DSpace: an introduction
AgriOcean DSpace: an introduction
 
Linguistic (in)justice and communication models: A pledge for a balanced mult...
Linguistic (in)justice and communication models: A pledge for a balanced mult...Linguistic (in)justice and communication models: A pledge for a balanced mult...
Linguistic (in)justice and communication models: A pledge for a balanced mult...
 
2005 09 Dc Keynote
2005 09 Dc Keynote2005 09 Dc Keynote
2005 09 Dc Keynote
 
Darwin Core extension for genebanks (germplasm), at Kansas University (May 2012)
Darwin Core extension for genebanks (germplasm), at Kansas University (May 2012)Darwin Core extension for genebanks (germplasm), at Kansas University (May 2012)
Darwin Core extension for genebanks (germplasm), at Kansas University (May 2012)
 
ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)
 
Developing corpus-based resources for language learning: looking back in "hope"
Developing corpus-based resources for language learning: looking back in "hope"Developing corpus-based resources for language learning: looking back in "hope"
Developing corpus-based resources for language learning: looking back in "hope"
 
Corpora in language teaching
Corpora in language teachingCorpora in language teaching
Corpora in language teaching
 
Linked data and language technologies
Linked data and language technologies Linked data and language technologies
Linked data and language technologies
 
Using corpora in instruction
Using corpora in instructionUsing corpora in instruction
Using corpora in instruction
 
Why Languages Matter 20090123
Why Languages Matter 20090123Why Languages Matter 20090123
Why Languages Matter 20090123
 
AgroPortal : a vocabulary and ontology repository for agronomy, plant science...
AgroPortal : a vocabulary and ontology repository for agronomy, plant science...AgroPortal : a vocabulary and ontology repository for agronomy, plant science...
AgroPortal : a vocabulary and ontology repository for agronomy, plant science...
 
AgriOcean DSpace
AgriOcean DSpace AgriOcean DSpace
AgriOcean DSpace
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Naming and labeling in the Multilingual Web of Data
Naming and labeling in the Multilingual Web of DataNaming and labeling in the Multilingual Web of Data
Naming and labeling in the Multilingual Web of Data
 
Presentation of the EDULLL repository by EKT at Open Repositories 2012 DSpace...
Presentation of the EDULLL repository by EKT at Open Repositories 2012 DSpace...Presentation of the EDULLL repository by EKT at Open Repositories 2012 DSpace...
Presentation of the EDULLL repository by EKT at Open Repositories 2012 DSpace...
 
Lemon at-mlw3
Lemon at-mlw3Lemon at-mlw3
Lemon at-mlw3
 

More from AIMS (Agricultural Information Management Standards)

More from AIMS (Agricultural Information Management Standards) (20)

Linked Data Competency Index : Mapping the field for teachers and learners
 Linked Data Competency Index : Mapping the field for teachers and learners Linked Data Competency Index : Mapping the field for teachers and learners
Linked Data Competency Index : Mapping the field for teachers and learners
 
Metadata as Standard: improving Interoperability through the Research Data Al...
Metadata as Standard: improving Interoperability through the Research Data Al...Metadata as Standard: improving Interoperability through the Research Data Al...
Metadata as Standard: improving Interoperability through the Research Data Al...
 
Assigning Digital Object Identifiers (DOIs) to Plant Genetic Resources
Assigning Digital Object Identifiers (DOIs) to Plant Genetic ResourcesAssigning Digital Object Identifiers (DOIs) to Plant Genetic Resources
Assigning Digital Object Identifiers (DOIs) to Plant Genetic Resources
 
VocBench 3: some insights on the forthcoming release
VocBench 3: some insights on the forthcoming release VocBench 3: some insights on the forthcoming release
VocBench 3: some insights on the forthcoming release
 
The case for Digital Objects Identifiers (DOIs) in support of research activi...
The case for Digital Objects Identifiers (DOIs) in support of research activi...The case for Digital Objects Identifiers (DOIs) in support of research activi...
The case for Digital Objects Identifiers (DOIs) in support of research activi...
 
Webinar@AIMS_FAIR Principles and Data Management Planning
Webinar@AIMS_FAIR Principles and Data Management PlanningWebinar@AIMS_FAIR Principles and Data Management Planning
Webinar@AIMS_FAIR Principles and Data Management Planning
 
Webinar@ASIRA: How to foster openness from an academic library
Webinar@ASIRA: How to foster openness from an academic library Webinar@ASIRA: How to foster openness from an academic library
Webinar@ASIRA: How to foster openness from an academic library
 
Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research
Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research
Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research
 
Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...
Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...
Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...
 
Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals
Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals
Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals
 
Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA)
Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA) Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA)
Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA)
 
Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...
Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...
Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...
 
Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context
Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context
Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context
 
Webinar@ASIRA: Emerging Themes in Agricultural Research Publishing
Webinar@ASIRA: Emerging Themes in Agricultural Research PublishingWebinar@ASIRA: Emerging Themes in Agricultural Research Publishing
Webinar@ASIRA: Emerging Themes in Agricultural Research Publishing
 
Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...
Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...
Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...
 
Using AGRIS as a portal of choice to access agricultural research and technol...
Using AGRIS as a portal of choice to access agricultural research and technol...Using AGRIS as a portal of choice to access agricultural research and technol...
Using AGRIS as a portal of choice to access agricultural research and technol...
 
Research4Life: La bibliothèque qui ouvre ses portes
Research4Life: La bibliothèque qui ouvre ses portesResearch4Life: La bibliothèque qui ouvre ses portes
Research4Life: La bibliothèque qui ouvre ses portes
 
Publishing skos concept schemes with skosmos
Publishing skos concept schemes with skosmosPublishing skos concept schemes with skosmos
Publishing skos concept schemes with skosmos
 
Research4Life: La biblioteca que abre puertas
Research4Life: La biblioteca que abre puertasResearch4Life: La biblioteca que abre puertas
Research4Life: La biblioteca que abre puertas
 
Research4Life: The library that opens doors
Research4Life: The library that opens doorsResearch4Life: The library that opens doors
Research4Life: The library that opens doors
 

Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case

  • 1. AIMS Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case Caterina Caracciolo, Gudrun Johannsen, Lavanya Kiran, Johannes Keizer Food and Agriculture Organization of the UN AOS 2012 Sept 4. 2012 - Kuching (MY)
  • 2. Background • AGROVOC is published in 21 languages + other under development • Multilinguality has always been an issue • Since the beginning, multilinguality was interpreted as “translation”: – One hierarchy of terms (one structure), translations in various languages • This organization remained with the move from a term-centered to a concept-centered resource 9/19/2012 2
  • 3. AGROVOC as object-centered resource… • Being mainly a resource for document indexing in the area of agriculture, it contains large amount of words referring to plants, animals, food in general 9/19/2012 3
  • 4. # of concepts below top concepts organism substances entities phenomena activities products methods properties features objects resources subjects systems locations Series1 groups measures state stages technology processes factors time events site strategies 9/19/2012 4 0 5000 10000 15000 20000 25000
  • 5. Differentiating languages • Salmon (en) • SalmĂłn (es) • НОсОси (ru) 9/19/2012 5
  • 6. But distribution of languages may be wide… 9/19/2012 6
  • 7. … and names of food tend to vary… Aguacate Palta 9/19/2012 7
  • 8. … and names of food tend to vary… Ataco morado, sangorache, sergorache, hawarcha Achis, Coyos (Cajamarca), Achita (Ayacucho), Coime, coimi, Kiwicha (Cusco) cuimi, millmi 9/19/2012 8
  • 9. Not only food names vary 9/19/2012 9
  • 10. Requirements for rendering multilinguality in AGROVOC 1. Unambiguously express the geographic area where a given word is used – specification of the area of use of a given word should be optional. 2. No limitations on the type of area allowed – Countries, groups of countries, geographical or administrative regions should be equally available for specification. 9/19/2012 KISAF, Rome 10
  • 11. AGROVOC as a SKOS resource • skos:Concept is to indicate a group of words in various languages, to be considered translations of one another • URI are kept “abstract” to emphasize independence of the concept from language – E.g. http://aims.fao.org/aos/agrovoc/c_12332 • The words grouped are then labels of the given concept 9/19/2012 11
  • 12. SKOS properties to express terms • skos:prefLabel, skos:altLabel – take plain literals as values – and an optional language tag expressed by XML attribute xml:lang • skosxl:prefLabel, skosxl:altLabel – Take entities with URIs, so extra infomation be attached to labels 9/19/2012 12
  • 13. AGROVOC uses ISO 639 2 digits to tag languages in xml:lang • ISO 639 provides codes for languages independently of – the country where they are spoken: • Spanish, Basque (same country, both official languages) • Dutch, Flamish (different country, similar enough languages…) – And their status: French and Breton (same country, Breton has no status) • Only one code for English, Spanish… • Limitations shown from previous examples 9/19/2012 KISAF, Rome 13
  • 15. Is ISO 639 3 digits an option? • More languages are included – More contemporary languages • Bemba language – “Old” languages (no longer spoken) • Old French (842ca-1400) – Groups of languages • Cuacasian languages – Artificial languages • Same approach as the 2 digit version 9/19/2012 KISAF, Rome 15
  • 16. Is IETF an option? • Internet Engineering Task Force (IETF) • IETF 5646 Tags for identifying languages – Basis is ISO for languages (639) – Subtags from ISO for countries (3166), ISO for scripts (15924) • Examples: – tr-CY = Turkish from Cyprus – zh-Hant-HK = Chinese in traditional Chinese script 9/19/2012 KISAF, Rome 16
  • 17. Is a relational approach an option? • Keep tagging approach to mark the language – Use ISO 639 or IETF • And introduce a relational notion of “where a given word is used” • Link together a concept representing a geographic area, and the object to name – E.g., Kiwicha isNameUsedInRegion Cusco • Aim at “standard” relations… 9/19/2012 KISAF, Rome 17
  • 18. Conclusions? • This is work in progress • We continue working out use cases, especially from Spanish and Portuguese • Assess alternatives 9/19/2012 KISAF, Rome 18