Natural Language Processing for Amazigh Language

1,459 views

Published on

© Fadoua Ataa Allah and Siham Boulaknadel

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,459
On SlideShare
0
From Embeds
0
Number of Embeds
363
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Natural Language Processing for Amazigh Language

  1. 1. Natural Language Processing for Amazigh Language:Challenges and Future Directions Fadoua Ataa Allah Siham Boulaknadel CEISIC, IRCAM {ataaallah, boulaknadel}@ircam.ma
  2. 2. Outline  Amazigh Language  Amazigh Complexity in NLP  State of the Technology on Amazigh  Future Directions LREC-2012: SALTMIL-AfLaT Workshop 2
  3. 3. Amazigh language Sociolinguistic Context North African autochthonous language  Spoken by millions of people as dialects LREC-2012: SALTMIL-AfLaT Workshop 3
  4. 4. Amazigh language Sociolinguistic Context  Languages of Morocco  Classical Arabic as an official language.  Amazigh, since 2011 it becomes an official language.  Moroccan Arabic or Darija is the diglossia of Classical Arabic.  French as the first foreign language.  Spanish is used in the north of Morocco.  English is becoming the second foreign language. 10/07/2012 LREC-2012: SALTMIL-AfLaT Workshop 4
  5. 5. Amazigh language History  Amazigh abjed  Tifinagh is attested from 25 centuries.  Its writing form has continued to change from the traditional Tuareg writing to the Tifinaghe-IRCAM . Tinzouline Inscriptions (Zagora, Morocco) 10/07/2012 LREC-2012: SALTMIL-AfLaT Workshop 5
  6. 6. Amazigh language History Direction Plate 9 Anou Elias, Mammanet Valley (Niger). Henri Lhote, Oued Mammanet gravures. Les Nouvelles Editions Africaines. 1979 10/07/2012 LREC-2012: SALTMIL-AfLaT Workshop 6
  7. 7. Moroccan Amazigh characteristics Amazigh writing system  Direction: horizontal from left to right.  Alphabet:  27 consonants: ⴱ, ⴳ, ⴳⵯ, ⴷ, ⴹ, ⴼ, ⴽ, ⴽⵯ, ⵀ, ⵃ, ⵄ, ⵅ, ⵇ, ⵊ, ⵍ, ⵎ, ⵏ, ⵔ, ⵕ , ⵖ, ⵙ, ⵚ, ⵛ, ⵜ, ⵟ, ⵣ, ⵥ;  2 semi-consonants: ⵢ and ⵡ;  4 vowels: ⴰ, ⵉ, ⵓ, ⴻ.  Punctuation marks: conventional signs including: “ ” (space), “.”, “,”, “;”, “:”, “?”, “!”, “…” , etc.  Numerals: Hindu-Arabic numerals [0-9]. 10/07/2012 LREC-2012: SALTMIL-AfLaT Workshop 7
  8. 8. Amazigh Complexity in NLP  Different writing forms  Complex phonology and phonetic systems  Rich morphology LREC-2012: SALTMIL-AfLaT Workshop 8
  9. 9. Amazigh Complexity in NLP Amazigh script  Writing prescriptions’ conversion into ‘Tifinaghe – Unicode’ is confronted with:  Spelling variation related to regional varieties ([tfucht] [tafukt] (sun)),  Spelling variation based on the use or the elimination of spaces within or between words ([tadartino] [tadart ino] (my house)).  Arabic or Latin transcription systems. LREC-2012: SALTMIL-AfLaT Workshop 9
  10. 10. Amazigh Complexity in NLP Phonology & phonetic  The main problem of Amazigh phonology and phonetic consists on allophones: /ll/ that is realized as [dj] in the North. LREC-2012: SALTMIL-AfLaT Workshop 10
  11. 11. Amazigh Complexity in NLP Morphology  High inflected language.  Word structure: Prefix Stem Suffix  Affixes set: Prefixes, Infixes, and Suffixes.  Base form varies with paradigms: (qqim  svim (make sit)). LREC-2012: SALTMIL-AfLaT Workshop 11
  12. 12. State of the Amazigh technology  Tifinaghe Encoding  Optical character recognition  Fundamental processing tools  Language resources LREC-2012: SALTMIL-AfLaT Workshop 12
  13. 13. State of the Amazigh technology Tifinaghe Encoding  ANSI  Unicode 13
  14. 14. State of the Amazigh technology OCR  Amazigh OCR systems:  System focused on isolated printed characters based on a syntactic approach using finite automata.  Global approach based on Hidden Markov Models for recognizing handwritten characters.  Method using invariant moments for recognizing printed script.  System based on artificial neural network to recognize printed characters. LREC-2012: SALTMIL-AfLaT Workshop 14
  15. 15. State of the Amazigh technology Fundamental processing  Transliterator  Tagging assistance tool  Light stemmer  Search engine  Concordancer LREC-2012: SALTMIL-AfLaT Workshop 15
  16. 16. State of the Amazigh technology Fundamental processing  Transliterator Arabic script Tifinaghe Latin script Convertisor Unicode Tifinaghe Latin Transliterator LREC-2012: SALTMIL-AfLaT Workshop 16
  17. 17. State of the Amazigh technology Fundamental processing  Tagging assistance tool Amazigh raw corpora Tokenization Manual POS Tag Manual Stemming set Stem Tagged list corpus Validation Standard output LREC-2012: SALTMIL-AfLaT Workshop 17
  18. 18. State of the Amazigh technology Fundamental processing  Light stemmer Begin Prefix + Stem + Suffix Find the largest prefix Stem + Suffix Find the largest suffix Stem End LREC-2012: SALTMIL-AfLaT Workshop 18
  19. 19. State of the Amazigh technology Fundamental processing  Search engine Query Engine Natural Language Index Processing Tools Data Data Indexing Searching Indexer User Interface Natural Language Processing Tools Data Crawling Repository Web Crawler LREC-2012: SALTMIL-AfLaT Workshop 19
  20. 20. State of the Amazigh technology Fundamental processing  Concordancer input field .txt,.doc .pdf, .zip Tokenization List of the text words Word / expression and their frequency Context display LREC-2012: SALTMIL-AfLaT Workshop 20
  21. 21. State of the Amazigh technology Language resources  Corpora  Dictionary  Terminology database LREC-2012: SALTMIL-AfLaT Workshop 21
  22. 22. State of the Amazigh technology Language resources  Corpora:  General corpus,  POS corpus. LREC-2012: SALTMIL-AfLaT Workshop 22
  23. 23. State of the Amazigh technology Language resources  Dictionary  Definition,  Arabic equivalent words,  French equivalent words,  English equivalent words,  Synonyms,  Classification by domains,  Derivational families. LREC-2012: SALTMIL-AfLaT Workshop 23
  24. 24. State of the Amazigh technology Language resources  Terminology database  Media vocabulary  Grammatical vocabulary LREC-2012: SALTMIL-AfLaT Workshop 24
  25. 25. Future Directions  Building a large and representative Amazigh corpora.  Developing a machine translation system.  Creating a pool of competent human resources. LREC-2012: SALTMIL-AfLaT Workshop 25
  26. 26. Thank you foryour attention ⵜⴰⵏⵎⵉⵔⵜLREC-2012: SALTMIL-AfLaT Workshop 26

×