Human Language Technologies for Ethiopian
Languages: Challenges and Future Directions


         Solomon Teferra Abate, Binyam Ephrem,
 Enchalew Yifru, Kassa Tilahun, Lemlem Hagos, Mohammed-
              hussen Abubeker and Taye Girma


           LIG, Université Joseph Fourier (UJF)
         ITPhD Program, Addis Ababa University
              solomon_teferra_7@yahoo.com


                  AGIS'11 Conference, Addis Ababa
Outline


●   Ethiopian Languages
●   Human Language Technology (HLT)
      –   Role in Development
      –   HLT in the World
●   HLT for Ethiopian Languages
      –   Language and Technology Coverage
      –   Challenges and limitations
      –   Future Directions and Strategies

                        AGIS'11 Conference, Addis Ababa
Ethiopian Languages


●   There are about 90 languages
●   Most belong to the Afro-Asiatic language family
●   Amharic, Afan-Oromo and Tigringa are the 3 most spoken
●   Amharic is federal working language
      –   Regions have their own working language
      –   The language policy states that everyone has the right to in
           his/her mother tongue
      –   More than 20 languages are MOI in primary (I&II) school
                        AGIS'11 Conference, Addis Ababa
Human Language Technology

●   Is an interdisciplinary field that encompasses most sub-
    disciplines of linguistics, Computational Linguistics, Natural
    Language Processing, computer science, Artificial Intelligence,
    psychology, philosophy, mathematics and statistics
                                  ✔   Morphological analysis/synthesis,
                   ✔   Stemming
Covers ASR,✔
                                  ✔   Information Extraction,
areas              ✔   MT,
       TTS,✔
                                  ✔   Text/document categorization
like:  OCR,
                   ✔   POS tagging,
                                      Spelling and Grammar checking,
           ✔
                                  ✔
                   ✔   Parsing,
                                  ✔   etc.
                        AGIS'11 Conference, Addis Ababa
Human Language Technology - Role

●   Enables ICT products to have knowledge of human language
      ●   Increases the acceptance of the technology and the
            productivity of its users in the information age
●   Helps people collaborate, conduct business, share knowledge
    and participate in social and political debates regardless of
    language barriers or computer skills
●   Relevant for the disadvantaged to have access to information:
      ✔ the illiterate,    ✔ the physically impaired population


      ✔   the rural poor,

                        AGIS'11 Conference, Addis Ababa
HLT in the World

●   Well developed for a few languages of the world like English
●   IBM Watson Computer
    ●       Passed its first test winning a QA competition with $1 M value
    ●       The goal of its design is to have intelligent computer that can
            interact in a natural language
               ✔   Understanding any question asked in a natural speech
               ✔   Answer questions as humans do
        ●    Uses a number of HLT modules such as: ASR, QA, TTS
        ✗    Requires a lot of expensive servers (about a total of $1 billion)
                                AGIS'11 Conference, Addis Ababa
HLT in the World

●   Siri is a simple iphone based system that:
      ●   Receives commands in a natural speech
             ●   Send message
             ●   Schedule meetings
             ●   Place phone calls
●   Siri has been claimed to:
      ●   understand what you say
      ●   know what you mean
      ●   speak back in a natural speech
                           AGIS'11 Conference, Addis Ababa
HLT in the World: Europe

●   Europe is a continent that is united to one multilingual
    economic country with 23 official languages
●   To enable the European languages, the European Union:
      ✔   Invested over €130 M to promote language technologies
            and language resource infrastructures in 2009-2011
      ✔   Allocated €35 M for SME action on Digital Content and
           Languages and €50 M for Language Technologies in its
           Work Program 2011-2012
      ✔   Proposed a simple platform that enables availability of any
            online content and services in all European languages
                        AGIS'11 Conference, Addis Ababa
HLT in the World: South Africa

●   South African government has identified HLT as a priority area
    to enable (technologically) its 11 official languages
➢   Various R&D projects and initiatives have been funded by
    government through:
      ●   Department of Arts and Culture (DAC),
      ●   Department of Science and Technology (DST), and
      ●   National Research Foundation (NRF)
●   The key challenge is fragmentation of R&D activities in HLT
      ●   Addressed by the South African HLT Audit (SAHLTA)
                         AGIS'11 Conference, Addis Ababa
HLT for Ethiopian Languages


●   Research on HLT for Ethiopian languages started in the 1990s
✔   There are now a lot of (>200) encouraging and valuable works
    on:                                ➢ Thesaurus contraction,
    ➢   ASR,              ➢   Stemming,
                                                ➢   Text classification
    ➢   MT                ➢   Parsing,
                                                ➢   Text categorization,
    ➢   Text-to-speech,   ➢   POS tagging,
                                                ➢   Morphological analysis,
    ➢   OCR,              ➢   Spell checking,
                                                ➢   Information Extraction
✗   Most of them are based on LRs developed for the experiment
                          AGIS'11 Conference, Addis Ababa
HLT for Ethiopian Languages

✗   HLT research covers a limited number of Ethiopian languages
                                            HLT for Ethiopian Languages (Masters theses)
                             25
                                                                                               NLP
                                                                                               Speech Processing
                                                                                               OCR
                             20                                                                CSE
            Research Areas




                             15




                             10




                              5




                              0
                                  Amharic      Afan Oromo    Tigringa        Welayta   Ge'ez            Sidama

                                                                 Languages




                                              AGIS'11 Conference, Addis Ababa
Challenges and Limitations

●   Challenges that hinder Ethiopian HLT include:
      –   lack of language resources: speech and text corpora
      –   Lack of standardized evaluation corpora and platform
      –   lack of expertise on both language and technology
      –   time shortage
           ●   done only for academic achievement in the given time
      –   absence of national HLT research plan - HLT road-map
           ● based only on individuals' interest
      –   lack of sustainable and coordinated research fund
                          AGIS'11 Conference, Addis Ababa
Challenges and Limitations

➔   They have limitations:
     –   use of insufficient and low quality language resource
          ➢   research results are not conclusive
     –   research results are not well evaluated, analyzed and
           documented
          ➢  Their achievements and gaps are vague
     –   research attempts in HLT are fragmented
          ➢   lack of integration, consolidation and continuity
               ●   Tokenizer    POS     Parser      LA       ASR/MT
                           AGIS'11 Conference, Addis Ababa
Future Directions and Strategies


●   Is there any other way to escape the cost of the language barrier
    or to cover it with out HLT in the information age? NO!!!
●   Are we rich enough to continue spending for only academic
    exercises? NO!!!
      –   6 months of at least 10 research students doing their thesis on
            any one of HLT areas every year and their supervisors
      –   3 years of at least 6 PhD research students (admitted every year)
            and their research supervisors
      –   The time of academic researchers doing research for publication
           purpose (for academic promotion)
                           AGIS'11 Conference, Addis Ababa
Future Directions and Strategies

●   Give emphasis and recognition to R&D activities in HLT
●   Develop national HLT road-map (HLT Audit)
      –   Shows research priorities
      –   Avoids duplication (even across languages)
      –   Reduces R&D cost
      –   Provides a means of evaluation/assessment
      –   Enforces consolidation, integration and continuity
      –   Inspires researchers and developers
      –   Shows the benefit areas for the HLT industry
                        AGIS'11 Conference, Addis Ababa
Future Directions and Strategies


●   Establish Institutional/National R&D units
      –   Fund, coordinate and evaluate R&D projects
      –   Store, maintain, distribute language resources and R&D
            outputs
      –   Promote the utility of R&D outputs
      –   Coordinate and support private industries
      –   Coordinate the cooperation of the academia and the industry
      –   Promote/attract international investments on HLT industries


                        AGIS'11 Conference, Addis Ababa
Conclusion


●   We have 85 living languages
●   All have speakers who need information and the right
    to get it in a language and the way they understand
              –   HLT is the way to realize it
●   We need to have a strategy to put it in place
      –       Cooperation across:
          ●    Time: past->present->future   ●   Language,
          ●    Research area,                ●   Sector: academic<->industry

                            AGIS'11 Conference, Addis Ababa
We can
           make it
             BY




AGIS'11 Conference, Addis Ababa

Human Language Technologies for Ethiopian Languages: Challenges and Future Directions

  • 1.
    Human Language Technologiesfor Ethiopian Languages: Challenges and Future Directions Solomon Teferra Abate, Binyam Ephrem, Enchalew Yifru, Kassa Tilahun, Lemlem Hagos, Mohammed- hussen Abubeker and Taye Girma LIG, Université Joseph Fourier (UJF) ITPhD Program, Addis Ababa University solomon_teferra_7@yahoo.com AGIS'11 Conference, Addis Ababa
  • 2.
    Outline ● Ethiopian Languages ● Human Language Technology (HLT) – Role in Development – HLT in the World ● HLT for Ethiopian Languages – Language and Technology Coverage – Challenges and limitations – Future Directions and Strategies AGIS'11 Conference, Addis Ababa
  • 3.
    Ethiopian Languages ● There are about 90 languages ● Most belong to the Afro-Asiatic language family ● Amharic, Afan-Oromo and Tigringa are the 3 most spoken ● Amharic is federal working language – Regions have their own working language – The language policy states that everyone has the right to in his/her mother tongue – More than 20 languages are MOI in primary (I&II) school AGIS'11 Conference, Addis Ababa
  • 4.
    Human Language Technology ● Is an interdisciplinary field that encompasses most sub- disciplines of linguistics, Computational Linguistics, Natural Language Processing, computer science, Artificial Intelligence, psychology, philosophy, mathematics and statistics ✔ Morphological analysis/synthesis, ✔ Stemming Covers ASR,✔ ✔ Information Extraction, areas ✔ MT, TTS,✔ ✔ Text/document categorization like: OCR, ✔ POS tagging, Spelling and Grammar checking, ✔ ✔ ✔ Parsing, ✔ etc. AGIS'11 Conference, Addis Ababa
  • 5.
    Human Language Technology- Role ● Enables ICT products to have knowledge of human language ● Increases the acceptance of the technology and the productivity of its users in the information age ● Helps people collaborate, conduct business, share knowledge and participate in social and political debates regardless of language barriers or computer skills ● Relevant for the disadvantaged to have access to information: ✔ the illiterate, ✔ the physically impaired population ✔ the rural poor, AGIS'11 Conference, Addis Ababa
  • 6.
    HLT in theWorld ● Well developed for a few languages of the world like English ● IBM Watson Computer ● Passed its first test winning a QA competition with $1 M value ● The goal of its design is to have intelligent computer that can interact in a natural language ✔ Understanding any question asked in a natural speech ✔ Answer questions as humans do ● Uses a number of HLT modules such as: ASR, QA, TTS ✗ Requires a lot of expensive servers (about a total of $1 billion) AGIS'11 Conference, Addis Ababa
  • 7.
    HLT in theWorld ● Siri is a simple iphone based system that: ● Receives commands in a natural speech ● Send message ● Schedule meetings ● Place phone calls ● Siri has been claimed to: ● understand what you say ● know what you mean ● speak back in a natural speech AGIS'11 Conference, Addis Ababa
  • 8.
    HLT in theWorld: Europe ● Europe is a continent that is united to one multilingual economic country with 23 official languages ● To enable the European languages, the European Union: ✔ Invested over €130 M to promote language technologies and language resource infrastructures in 2009-2011 ✔ Allocated €35 M for SME action on Digital Content and Languages and €50 M for Language Technologies in its Work Program 2011-2012 ✔ Proposed a simple platform that enables availability of any online content and services in all European languages AGIS'11 Conference, Addis Ababa
  • 9.
    HLT in theWorld: South Africa ● South African government has identified HLT as a priority area to enable (technologically) its 11 official languages ➢ Various R&D projects and initiatives have been funded by government through: ● Department of Arts and Culture (DAC), ● Department of Science and Technology (DST), and ● National Research Foundation (NRF) ● The key challenge is fragmentation of R&D activities in HLT ● Addressed by the South African HLT Audit (SAHLTA) AGIS'11 Conference, Addis Ababa
  • 10.
    HLT for EthiopianLanguages ● Research on HLT for Ethiopian languages started in the 1990s ✔ There are now a lot of (>200) encouraging and valuable works on: ➢ Thesaurus contraction, ➢ ASR, ➢ Stemming, ➢ Text classification ➢ MT ➢ Parsing, ➢ Text categorization, ➢ Text-to-speech, ➢ POS tagging, ➢ Morphological analysis, ➢ OCR, ➢ Spell checking, ➢ Information Extraction ✗ Most of them are based on LRs developed for the experiment AGIS'11 Conference, Addis Ababa
  • 11.
    HLT for EthiopianLanguages ✗ HLT research covers a limited number of Ethiopian languages HLT for Ethiopian Languages (Masters theses) 25 NLP Speech Processing OCR 20 CSE Research Areas 15 10 5 0 Amharic Afan Oromo Tigringa Welayta Ge'ez Sidama Languages AGIS'11 Conference, Addis Ababa
  • 12.
    Challenges and Limitations ● Challenges that hinder Ethiopian HLT include: – lack of language resources: speech and text corpora – Lack of standardized evaluation corpora and platform – lack of expertise on both language and technology – time shortage ● done only for academic achievement in the given time – absence of national HLT research plan - HLT road-map ● based only on individuals' interest – lack of sustainable and coordinated research fund AGIS'11 Conference, Addis Ababa
  • 13.
    Challenges and Limitations ➔ They have limitations: – use of insufficient and low quality language resource ➢ research results are not conclusive – research results are not well evaluated, analyzed and documented ➢ Their achievements and gaps are vague – research attempts in HLT are fragmented ➢ lack of integration, consolidation and continuity ● Tokenizer POS Parser LA ASR/MT AGIS'11 Conference, Addis Ababa
  • 14.
    Future Directions andStrategies ● Is there any other way to escape the cost of the language barrier or to cover it with out HLT in the information age? NO!!! ● Are we rich enough to continue spending for only academic exercises? NO!!! – 6 months of at least 10 research students doing their thesis on any one of HLT areas every year and their supervisors – 3 years of at least 6 PhD research students (admitted every year) and their research supervisors – The time of academic researchers doing research for publication purpose (for academic promotion) AGIS'11 Conference, Addis Ababa
  • 15.
    Future Directions andStrategies ● Give emphasis and recognition to R&D activities in HLT ● Develop national HLT road-map (HLT Audit) – Shows research priorities – Avoids duplication (even across languages) – Reduces R&D cost – Provides a means of evaluation/assessment – Enforces consolidation, integration and continuity – Inspires researchers and developers – Shows the benefit areas for the HLT industry AGIS'11 Conference, Addis Ababa
  • 16.
    Future Directions andStrategies ● Establish Institutional/National R&D units – Fund, coordinate and evaluate R&D projects – Store, maintain, distribute language resources and R&D outputs – Promote the utility of R&D outputs – Coordinate and support private industries – Coordinate the cooperation of the academia and the industry – Promote/attract international investments on HLT industries AGIS'11 Conference, Addis Ababa
  • 17.
    Conclusion ● We have 85 living languages ● All have speakers who need information and the right to get it in a language and the way they understand – HLT is the way to realize it ● We need to have a strategy to put it in place – Cooperation across: ● Time: past->present->future ● Language, ● Research area, ● Sector: academic<->industry AGIS'11 Conference, Addis Ababa
  • 18.
    We can make it BY AGIS'11 Conference, Addis Ababa