Human Language Technologies for Ethiopian Languages: Challenges and Future Directions


Solomon Teferra Abate, Binyam Ephrem, Enchalew Yifru, Kassa Tilahun, Lemlem Hagos, Mohammed-hussen Abubeker and Taye Girma

  1. 1. Human Language Technologies for EthiopianLanguages: Challenges and Future Directions Solomon Teferra Abate, Binyam Ephrem, Enchalew Yifru, Kassa Tilahun, Lemlem Hagos, Mohammed- hussen Abubeker and Taye Girma LIG, Université Joseph Fourier (UJF) ITPhD Program, Addis Ababa University AGIS11 Conference, Addis Ababa
  2. 2. Outline● Ethiopian Languages● Human Language Technology (HLT) – Role in Development – HLT in the World● HLT for Ethiopian Languages – Language and Technology Coverage – Challenges and limitations – Future Directions and Strategies AGIS11 Conference, Addis Ababa
  3. 3. Ethiopian Languages● There are about 90 languages● Most belong to the Afro-Asiatic language family● Amharic, Afan-Oromo and Tigringa are the 3 most spoken● Amharic is federal working language – Regions have their own working language – The language policy states that everyone has the right to in his/her mother tongue – More than 20 languages are MOI in primary (I&II) school AGIS11 Conference, Addis Ababa
  4. 4. Human Language Technology● Is an interdisciplinary field that encompasses most sub- disciplines of linguistics, Computational Linguistics, Natural Language Processing, computer science, Artificial Intelligence, psychology, philosophy, mathematics and statistics ✔ Morphological analysis/synthesis, ✔ StemmingCovers ASR,✔ ✔ Information Extraction,areas ✔ MT, TTS,✔ ✔ Text/document categorizationlike: OCR, ✔ POS tagging, Spelling and Grammar checking, ✔ ✔ ✔ Parsing, ✔ etc. AGIS11 Conference, Addis Ababa
  5. 5. Human Language Technology - Role● Enables ICT products to have knowledge of human language ● Increases the acceptance of the technology and the productivity of its users in the information age● Helps people collaborate, conduct business, share knowledge and participate in social and political debates regardless of language barriers or computer skills● Relevant for the disadvantaged to have access to information: ✔ the illiterate, ✔ the physically impaired population ✔ the rural poor, AGIS11 Conference, Addis Ababa
  6. 6. HLT in the World● Well developed for a few languages of the world like English● IBM Watson Computer ● Passed its first test winning a QA competition with $1 M value ● The goal of its design is to have intelligent computer that can interact in a natural language ✔ Understanding any question asked in a natural speech ✔ Answer questions as humans do ● Uses a number of HLT modules such as: ASR, QA, TTS ✗ Requires a lot of expensive servers (about a total of $1 billion) AGIS11 Conference, Addis Ababa
  7. 7. HLT in the World● Siri is a simple iphone based system that: ● Receives commands in a natural speech ● Send message ● Schedule meetings ● Place phone calls● Siri has been claimed to: ● understand what you say ● know what you mean ● speak back in a natural speech AGIS11 Conference, Addis Ababa
  8. 8. HLT in the World: Europe● Europe is a continent that is united to one multilingual economic country with 23 official languages● To enable the European languages, the European Union: ✔ Invested over €130 M to promote language technologies and language resource infrastructures in 2009-2011 ✔ Allocated €35 M for SME action on Digital Content and Languages and €50 M for Language Technologies in its Work Program 2011-2012 ✔ Proposed a simple platform that enables availability of any online content and services in all European languages AGIS11 Conference, Addis Ababa
  9. 9. HLT in the World: South Africa● South African government has identified HLT as a priority area to enable (technologically) its 11 official languages➢ Various R&D projects and initiatives have been funded by government through: ● Department of Arts and Culture (DAC), ● Department of Science and Technology (DST), and ● National Research Foundation (NRF)● The key challenge is fragmentation of R&D activities in HLT ● Addressed by the South African HLT Audit (SAHLTA) AGIS11 Conference, Addis Ababa
  10. 10. HLT for Ethiopian Languages● Research on HLT for Ethiopian languages started in the 1990s✔ There are now a lot of (>200) encouraging and valuable works on: ➢ Thesaurus contraction, ➢ ASR, ➢ Stemming, ➢ Text classification ➢ MT ➢ Parsing, ➢ Text categorization, ➢ Text-to-speech, ➢ POS tagging, ➢ Morphological analysis, ➢ OCR, ➢ Spell checking, ➢ Information Extraction✗ Most of them are based on LRs developed for the experiment AGIS11 Conference, Addis Ababa
  11. 11. HLT for Ethiopian Languages✗ HLT research covers a limited number of Ethiopian languages HLT for Ethiopian Languages (Masters theses) 25 NLP Speech Processing OCR 20 CSE Research Areas 15 10 5 0 Amharic Afan Oromo Tigringa Welayta Geez Sidama Languages AGIS11 Conference, Addis Ababa
  12. 12. Challenges and Limitations● Challenges that hinder Ethiopian HLT include: – lack of language resources: speech and text corpora – Lack of standardized evaluation corpora and platform – lack of expertise on both language and technology – time shortage ● done only for academic achievement in the given time – absence of national HLT research plan - HLT road-map ● based only on individuals interest – lack of sustainable and coordinated research fund AGIS11 Conference, Addis Ababa
  13. 13. Challenges and Limitations➔ They have limitations: – use of insufficient and low quality language resource ➢ research results are not conclusive – research results are not well evaluated, analyzed and documented ➢ Their achievements and gaps are vague – research attempts in HLT are fragmented ➢ lack of integration, consolidation and continuity ● Tokenizer POS Parser LA ASR/MT AGIS11 Conference, Addis Ababa
  14. 14. Future Directions and Strategies● Is there any other way to escape the cost of the language barrier or to cover it with out HLT in the information age? NO!!!● Are we rich enough to continue spending for only academic exercises? NO!!! – 6 months of at least 10 research students doing their thesis on any one of HLT areas every year and their supervisors – 3 years of at least 6 PhD research students (admitted every year) and their research supervisors – The time of academic researchers doing research for publication purpose (for academic promotion) AGIS11 Conference, Addis Ababa
  15. 15. Future Directions and Strategies● Give emphasis and recognition to R&D activities in HLT● Develop national HLT road-map (HLT Audit) – Shows research priorities – Avoids duplication (even across languages) – Reduces R&D cost – Provides a means of evaluation/assessment – Enforces consolidation, integration and continuity – Inspires researchers and developers – Shows the benefit areas for the HLT industry AGIS11 Conference, Addis Ababa
  16. 16. Future Directions and Strategies● Establish Institutional/National R&D units – Fund, coordinate and evaluate R&D projects – Store, maintain, distribute language resources and R&D outputs – Promote the utility of R&D outputs – Coordinate and support private industries – Coordinate the cooperation of the academia and the industry – Promote/attract international investments on HLT industries AGIS11 Conference, Addis Ababa
  17. 17. Conclusion● We have 85 living languages● All have speakers who need information and the right to get it in a language and the way they understand – HLT is the way to realize it● We need to have a strategy to put it in place – Cooperation across: ● Time: past->present->future ● Language, ● Research area, ● Sector: academic<->industry AGIS11 Conference, Addis Ababa
