Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and Solutions - Nizar Habash

Arabic Natural Language Processing:
Challenges and Solutions
‫العربية‬ ‫للغة‬ ‫اآللي‬ ‫التحليل‬:‫وحلول‬ ‫تحديات‬
Grammarly Invited Talk
March 26, 2019
Prof. Nizar Habash
New York University Abu Dhabi
nizar.habash@nyu.edu
NYUAD
CAMeLLab

New York University
The Global Network University
2

4

• http://nyuad.nyu.edu/en/
5

• Students from all over the world
– 1300 students, 120 nationalities
– 15% UAE, 15% American, 70% everywhere else
6

• Liberal Arts University
– Four Divisions: Science, Engineering, Social
Science, Arts and Humanities
– 20 majors and many minors
– Interdisciplinarity strongly encouraged
• Computer Science
– Undergraduate and PhD programs
– PhD through NYU New York
7

CAMeL Lab
8
• Computational Approaches to Modeling Language
• http://camel-lab.com
• Research Areas
– Arabic Artificial Intelligence
– Core Natural Language Processing
• Orthography, morphology, syntax, and semantics
– Dialectal modeling
– Machine translation
– Pedagogical applications
– Dialogue systems
NYUAD
CAMeLLab

The CAMeLeers
9
Nasser Zalmout
PhD Student, NYU
Dima Taji
PhD Student, NYU
Alberto Chiercchi
PhD Student, NYU
Alex Erdmann
PhD Student,
Ohio State
Salam Khalifa
Research Assistant
Fadhl Eryani
Research Assistant
Ossama Obeid
Research Assistant
Mai Oudah
Postdoc

Natural Language Processing
• Also known as
– Computational Linguistics
– Language Technologies
– (Language) Artificial Intelligence
• Language Technology is an interdisciplinary field
– Computer science, Linguistics, Cognitive science,
psychology, pedagogy, mathematics, etc.
• Language technologies were some of the earliest
applications of computer science
– Cryptography
– Machine Translation

• Applications
– Information retrieval
– Machine translation
– Automatic speech recognition & speech synthesis
– Sentiment and emotion analysis
– Dialogue systems & chatting agents
– Optical character recognition
– Automatic Summarization, etc.
• Enabling technologies
– Tokenization
– Part-of-speech tagging
– Syntactic parsing
– Lemmatization
– Word sense disambiguation, etc.

Paradigms for
• Rule-based (Intuition-based) Approaches
– Linguists write rules that are applied by the
machines
• Machine Learning Approaches
– Corpus-based, Statistical Approaches
– Machines learn the “rules” from training data
• Machine learning approaches are dominant in
the field

What do we need
to help machines learn?
• Data, data and more data!
• Specifically annotated data
Application Annotated Data Example
Machine Translation Parallel corpus in two languages: UN corpus with
English, Arabic, Chinese, Spanish, Russian, French
Sentiment Analysis A corpus of tweets with tags indicating: positive,
negative, neutral.
Speech Recognition A corpus of audio files with their corresponding
transcripts
Optical Character
Recognition
A corpus of scanned book page images and their
corresponding transcripts.
Part-of-Speech An English corpus with Part-of-Speech indicated for
each word

• d
17
Machine Learning
vs. Human Learning
Predisposed for
acquiring language
not so!
• Developing robust algorithms with appropriate learning
bias for computational linguistics tasks is essential!

Challenges for
Machine Learning Language Technologies
• Size of training data
– More is better!
• Domain and genre sensitivity
– Systems trained on news do not do well on novels
• Quality of annotations
– Why expect good performance if humans do not
agree with each other on the task
• Developing robust algorithms for machine
learning is essential

19
Roadmap
• Natural Language Processing
Applications & Paradigms
• (Why) is Arabic hard for NLP?
• Some Arabic NLP solutions
–NYUAD CAMeL Lab

20
Arabic Script
• A consonantal alphabet
• Written right-to-left
• Letters have contextual variants
• Used to write many languages
besides Arabic: Persian, Kurdish, Urdu,
Pashto, etc.
َ‫ر‬َ‫ع‬‫ال‬ ُّ‫َط‬‫خ‬‫ال‬ُّ‫ي‬‫ي‬‫ب‬

Arabic Script
• Arabic script uses a set of optional diacritics
– Only 1.5% of written words have at least one diacritic
• Undiacritized Standard Arabic words are
ambiguous out of context
Vowel Nunation Gemination
َ‫ب‬
/ba/
‫ب‬
/bu/
‫ي‬‫ب‬
/bi/
‫ب‬
/b/
‫ب‬
/ban/
‫ب‬
/bun/
‫ب‬
/bin/
‫ب‬
/bb/

‫للمغرب‬ ‫الممنوحة‬ ‫المساعدة‬ ‫تجميد‬ ‫تنفي‬ ‫اسبانيا‬
‫مدريد‬1-11(‫ب‬ ‫اف‬)-‫ماريا‬ ‫خوسيه‬ ‫االسبانية‬ ‫الحكومة‬ ‫رئيس‬ ‫اكد‬
‫لل‬ ‫تقدمها‬ ‫التي‬ ‫المساعدة‬ ‫توقف‬ ‫لم‬ ‫اسبانيا‬ ‫ان‬ ‫الخميس‬ ‫اليوم‬ ‫اثنار‬‫خالفا‬ ‫مغرب‬
‫محم‬ ‫المغربي‬ ‫والتعاون‬ ‫الخارجية‬ ‫الشؤون‬ ‫وزير‬ ‫االربعاء‬ ‫امس‬ ‫اكده‬ ‫لما‬‫بن‬ ‫د‬
‫المغربي‬ ‫النواب‬ ‫مجلس‬ ‫امام‬ ‫عيسى‬.‫ف‬ ‫االسبانية‬ ‫الحكومة‬ ‫رئيس‬ ‫وقال‬‫ي‬
‫و‬ ‫ابدا‬ ‫يتوقف‬ ‫لم‬ ‫والمغرب‬ ‫اسبانيا‬ ‫بين‬ ‫التعاون‬ ‫ان‬ ‫صحافي‬ ‫مؤتمر‬‫يجمد‬ ‫لم‬.
‫يا‬‫ي‬‫ن‬‫با‬‫س‬‫ي‬‫ا‬‫ي‬‫ي‬‫ف‬‫ن‬َ‫ت‬َ‫د‬‫ي‬‫ي‬‫م‬‫ج‬َ‫ت‬َ‫ة‬َ‫د‬َ‫ع‬‫سا‬‫الم‬َ‫ح‬‫و‬‫ن‬‫م‬َ‫م‬‫ال‬َ‫ة‬‫ي‬‫ب‬ ‫ي‬‫ر‬‫غ‬َ‫م‬‫ل‬‫ي‬‫ل‬
‫يد‬ ‫ي‬‫ر‬‫د‬َ‫م‬1 - 11 (‫ف‬‫ي‬‫ا‬‫ب‬)-َ‫د‬َّ‫ك‬َ‫ا‬‫يس‬‫ي‬‫ئ‬ َ‫ر‬‫ي‬‫ة‬َ‫م‬‫و‬‫ك‬‫الح‬‫ي‬‫ن‬‫با‬‫س‬‫ي‬‫اال‬‫َّة‬‫ي‬‫يه‬‫ي‬‫س‬‫و‬‫خ‬‫يا‬ ‫ي‬‫مار‬‫اثنار‬
َ‫م‬‫و‬َ‫ي‬‫ال‬َ‫يس‬‫ي‬‫َم‬‫خ‬‫ال‬َّ‫ن‬َ‫ا‬‫يا‬‫ي‬‫ن‬‫با‬‫س‬‫ي‬‫ا‬َ‫م‬‫ي‬‫ل‬َ‫ف‬َّ‫ق‬ َ‫و‬َ‫ت‬‫الم‬‫ة‬َ‫د‬َ‫ع‬‫سا‬‫ي‬‫ي‬‫ت‬َّ‫ال‬‫ها‬‫م‬‫ي‬‫د‬َ‫ق‬‫ت‬‫ي‬‫ر‬‫غ‬َ‫م‬‫ل‬‫ي‬‫ل‬‫ي‬‫ب‬‫الفا‬ ‫ي‬‫خ‬‫ما‬‫ي‬‫ل‬
‫ه‬َ‫د‬َّ‫ك‬َ‫ا‬‫ي‬‫س‬‫م‬َ‫ا‬َ‫ء‬‫عا‬‫ي‬‫ب‬‫ر‬َ‫ال‬‫ا‬َ‫ير‬ ‫ي‬‫ز‬ َ‫و‬‫ي‬‫ون‬‫ؤ‬‫الش‬‫ي‬‫ج‬ ‫ي‬‫الخار‬‫ي‬‫ة‬َّ‫ي‬‫ي‬‫ن‬‫عاو‬َ‫ت‬‫ال‬ َ‫و‬‫ي‬‫ي‬‫ي‬‫ب‬ ‫ي‬‫ر‬‫غ‬َ‫م‬‫ال‬‫م‬‫د‬َّ‫م‬َ‫ح‬‫ن‬‫ي‬‫ب‬
‫ى‬َ‫س‬‫ي‬‫ي‬‫ع‬َ‫مام‬َ‫ا‬‫ي‬‫س‬‫ي‬‫ل‬‫ج‬َ‫م‬‫ي‬‫ب‬‫ا‬‫و‬‫الن‬‫ي‬‫ي‬‫ي‬‫ب‬ ‫ي‬‫ر‬‫غ‬َ‫م‬‫ال‬.َ‫ل‬‫قا‬ َ‫و‬‫يس‬‫ي‬‫ئ‬ َ‫ر‬‫ي‬‫ة‬َ‫م‬‫و‬‫ك‬‫الح‬‫س‬‫ي‬‫اال‬‫ي‬‫ة‬َّ‫ي‬‫ي‬‫ن‬‫با‬‫ي‬‫ي‬‫ف‬
‫ر‬َ‫م‬َ‫ت‬‫ؤ‬‫م‬‫ي‬‫ي‬‫ف‬‫حا‬َ‫ص‬َّ‫ن‬َ‫ا‬َ‫ن‬‫عاو‬َ‫ت‬‫ال‬َ‫ن‬‫ي‬َ‫ب‬‫با‬‫س‬‫ي‬‫ا‬‫يا‬‫ي‬‫ن‬‫ي‬‫ب‬ ‫ي‬‫ر‬‫غ‬َ‫م‬‫ال‬ َ‫و‬َ‫م‬‫ي‬‫ل‬‫ف‬َّ‫ق‬ َ‫و‬َ‫ت‬َ‫ي‬َ‫ا‬‫دا‬َ‫ب‬َ‫م‬‫ي‬‫ل‬ َ‫و‬‫د‬‫ي‬‫م‬َ‫ج‬‫ي‬.

23
Orthographic Ambiguity
• Arabic words can be very ambiguous due to optional
diacritics
• But how ambiguous?
• Classic example
ths s wht n rbc txt lks lk wth n vwls
this is what an Arabic text looks like with no vowels
– Not exactly true
• Long vowels are always written
• Initial vowels are represented by an ‫ا‬ ‘Alif’
• Some final short vowels are deterministically inferable
ths is wht an Arbc txt lks lik wth no vwls
• For a computer …
– A word on average has 12.3 analyses, 6.8 diacritizations,
and 2.7 lemmas (core meanings)
• Not all of this ambiguity is due to orthography! More on this later.

• The Qatar Arabic Language Bank (QALB, PI Habash) project found a very
high (30%) of words have errors in unedited Standard Arabic comments on
Aljazeera.
– 2 Million words were manually corrected to create training data.
• Arabic spelling errors are a big challenge to language technologies
– GIGO: Garbage In Garbage Out
– Errors in Standard Arabic
– Inconsistencies in Dialectal Arabic (no official standard)
• Robust systems need additional functionality to allow for correcting errors
or functioning well despite them.
Spelling Errors

Morphological Complexity
• Arabic is morphologically rich
– A core word has many inflected forms
– Example: Arabic Verbs have 5,400 forms
Gender(2), Number(3), Person(3), Aspect(3), Tense particle (2),
Mood(3), Voice(2), Pronominal clitic(12), Conjunction clitic(3)
25
‫وسنقولها‬
/wasanaqūluhā/
‫و‬+‫س‬+‫ن‬+‫قول‬+‫ها‬
wa+sa+na+qūl+u+hā
and+will+we+say+it
And we will say it
،َ‫قالوا،قلت‬ ،‫قاال‬ ،‫قالت‬ ،‫قال‬
‫قلتن‬ ،‫قلتم‬ ،‫قلتما‬ ،‫ي‬‫ت‬‫قل‬،
‫تقول‬ ،‫يقل‬ ،َ‫ل‬‫يقو‬ ،‫يقول‬،َ‫ل‬‫تقو‬ ،
‫تقولي‬ ،‫تقولين‬ ،‫تقل‬،
...‫فقاال‬ ،‫فقالت‬ ،‫فقال‬...
...،‫وسأقولها‬‫وسنقولها‬،...

Morphological Complexity
• English is not morphologically rich.
– The number of inflected forms is small
– The verb paradigm is limited to 6
– The complete English part-of-speech tag set
has 48 tags
– The complete Arabic part-of-speech tag set
has 22,400 tags
26
VB VBD VBG VBN VBP VBZ
go went going gone go goes

Morphological Ambiguity
• 12.3 analyses and 2.7 lemmas per word
• Spelling ambiguity
– Optional diacritics
– Suboptimal spelling, e.g., (‫,أ‬ ‫إ‬  ‫)ا‬ or (‫ة‬ ‫ه‬ )
– Example: ‫وبادلتها‬
• Derivational ambiguity and homonymy
َ‫و‬+‫ي‬‫ب‬+‫ي‬‫ة‬َّ‫ل‬‫ي‬‫د‬َ‫أ‬+‫ها‬
and with her pieces of evidence
َ‫و‬+‫ت‬‫ل‬َ‫د‬‫ا‬َ‫ب‬+‫ها‬
and I exchanged with her
‫ـن‬‫ي‬َ‫ع‬‫ال‬ the eye, the water spring, Al-Ain city, the notable
‫ل‬َ‫ت‬‫ح‬‫الم‬
occupier, occupied
)‫المحتل‬ ‫العدو‬/‫المحتل‬ ‫الوطن‬/‫المحتلة‬ ‫الدول‬(

Morphological Annotation
‫سنتين‬ ‫لمدة‬ ‫فتحية‬ ‫كاتبته‬ ‫وقد‬.
Fathia corresponded with her for two years.
Word Lemma POS Features Gloss
‫وقد‬ ‫د‬َ‫ق‬ ‫و‬+‫قد‬ Noun Masc Sg size/physique
‫د‬َ‫ق‬ ‫و‬+‫قد‬ Particle may/might/has/have
‫د‬‫ق‬ َ‫و‬ ‫وقد‬ Noun Masc Sg fuel/burning
‫د‬َّ‫ق‬ َ‫و‬ ‫وقد‬ Verb Perf 3rd Masc Sg kindle/ignite
‫د‬َ‫ق‬ َ‫و‬ ‫وقد‬ Verb Perf 3rd Masc Sg ignite/burn
‫كاتبته‬ ‫ب‬‫ي‬‫ت‬‫كا‬ ‫كاتبة‬+‫ه‬ Noun Fem Sg author/writer/clerk
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 3rd Fem Sg correspond_with
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 1st Sg correspond_with
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 2nd Fem Sg correspond_with
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 2nd Masc Sg correspond_with
‫فتحية‬ ‫َّة‬‫ي‬ ‫ي‬‫ح‬َ‫ت‬ ‫ف‬+‫تحية‬ Noun Fem Sg greeting/salute
َّ‫ي‬ ‫ي‬‫تح‬َ‫ف‬‫ة‬ ‫فتحية‬ Proper Fem Sg Fathia
‫لمدة‬ ‫َّة‬‫د‬‫م‬ ‫ل‬+‫مدة‬ Noun Fem Sg interval/period
‫سنتين‬ ‫ت‬‫ن‬‫ي‬‫س‬ ‫سنتين‬ Noun Masc Du cent
‫َة‬‫ن‬َ‫س‬ ‫سنتين‬ Noun Fem Du year
. . . Punc .

Word Lemma POS Features Gloss
‫وقد‬ ‫د‬َ‫ق‬ ‫و‬+‫قد‬ Noun Masc Sg size/physique
‫د‬َ‫ق‬ ‫و‬+‫قد‬ Particle may/might/has/have
‫د‬‫ق‬ َ‫و‬ ‫وقد‬ Noun Masc Sg fuel/burning
‫د‬َّ‫ق‬ َ‫و‬ ‫وقد‬ Verb Perf 3rd Masc Sg kindle/ignite
‫د‬َ‫ق‬ َ‫و‬ ‫وقد‬ Verb Perf 3rd Masc Sg ignite/burn
‫كاتبته‬ ‫ب‬‫ي‬‫ت‬‫كا‬ ‫كاتبة‬+‫ه‬ Noun Fem Sg author/writer/clerk
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 3rd Fem Sg correspond_with
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 1st Sg correspond_with
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 2nd Fem Sg correspond_with
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 2nd Masc Sg correspond_with
‫فتحية‬ ‫َّة‬‫ي‬ ‫ي‬‫ح‬َ‫ت‬ ‫ف‬+‫تحية‬ Noun Fem Sg greeting/salute
َّ‫ي‬ ِ‫تح‬َ‫ف‬‫ة‬ ‫فتحية‬ Proper Fem Sg Fathia
‫لمدة‬ ‫ة‬‫د‬ُ‫م‬ ‫ل‬+‫مدة‬ Noun Fem Sg interval/period
‫سنتين‬ ‫ت‬‫ن‬‫ي‬‫س‬ ‫سنتين‬ Noun Masc Du cent
‫ة‬َ‫ن‬َ‫س‬ ‫سنتين‬ Noun Fem Du year
. . . Punc .
Morphological Annotation
‫سنتين‬ ‫لمدة‬ ‫فتحية‬ ‫كاتبته‬ ‫وقد‬.
Fathia corresponded with her for two years.

30
Arabic and its Dialects
• Arabic has ~360M speakers
• Forms of Arabic
– Classical Arabic (CA)
• Classic historical and liturgical texts
– Modern Standard Arabic (MSA)
• News media & formal speeches and settings
• Only written standard
– Dialectal Arabic (DA)
• Predominantly spoken vernaculars
• No written standards
• Very common on social media
• Diglossia
– Two forms of the language (MSA & DA) exist side by side

Arabic and its Dialects
• Official language: Modern Standard Arabic (MSA)
No one’s native language
• Regional Dialects
– Egyptian Arabic (EGY)
– Levantine Arabic (LEV)
– Gulf Arabic (GLF)
– North African Arabic (NOR): Moroccan, Algerian, Tunisian
– Iraqi, Yemenite, Sudanese
• Dialects and sub-dialects…
– City, Rural, Bedouin

32
Phonological Variations
• Major variants
MSA Dialects
‫ق‬ /q/ /q/, /k/, /ʔ/, /g/, /ʤ/, /ɢ/
‫ث‬ /θ/ /θ/, /t/, /s/
‫ذ‬ /δ/ /δ/, /d/, /z/
‫ج‬ /ʤ/ /ʤ/, /g/, /ʒ/

Spelling Inconsistency
33
Egyptian Arabic word
‫ماَّبيقولهاش‬
/mabiʔulhāʃ/
“he does not say it
If there is no
standard,
can a word be
misspelled?

Lexical and Phonological Variation
You say to-MAY-to, I say to-MAH-to!

Lexical and Phonological Variation
‫بندورة‬
‫توماطيش‬
‫طماط‬
‫طماطة‬
‫طماطم‬
‫طماطمة‬
‫طماطيس‬
‫قوطة‬
‫مطيشة‬
b a n a d oo r a
ALE, DAM
b a n a d uu r a
BEI, AMM, JER
b a n d oo r a AMM, JER, SAL
t uu m aa t. ii sh FES
t. a m aa t. SAN, MUS
t. e m aa t. DOH
t a m aa t a
BAG, BAS, MOS
t. a m aa t. i m JED, RIY, SAN, KHA,
ALX, ASW
t. m aa t. i m
SFA, TUN, BEN, TRI
g uu t. a
JED
2 uu t. a
CAI
t. o m a t. ii sh
ALG
t. a m aa t. ii s
SAN m a t. ii sh a
FES, RAB
t. a m aa t. m a
MUS
‫طوماطيش‬

38
Morphological Variation
• Some aspects of words are simplified in the dialects
– Loss of case marking
kitaabu, kitaaba, kitaaabi, kitaabun, kitaaban, kitaabin  kitaab
– Consolidation of masculine and feminine plurals
yaktubuun, yaktubuu, yaktubna  yiktibu || yikitbuun
• Other aspects increase in complexity!
‫كتاب‬‫كتاب‬ ،‫ي‬‫ب‬‫كتا‬ ،‫كتابا‬ ،َ‫كتاب‬ ،‫كتاب‬ ،‫كتاب‬
‫يكتبون‬ ،‫يكتبوا‬‫يكتبن‬ ،‫يكتبون‬ ،‫يكتبوا‬

39
Morphological Variation
Verb Morphology
conjverbobject subj tense
IOBJ negneg
MSA
‫له‬ ‫تكتبوها‬ ‫ولم‬
/walam taktubūhā lahu/
/wa+lam taktubū+hā la+hu/
and+not_past write_you+it for+him
EGY
‫و‬‫ما‬‫كتبتوهالو‬‫ش‬
/wimakatabtuhalūʃ/
/wi+ma+katab+tu+ha+lū+ʃ/
and+not+wrote+you+it+for_him+not
And you didn’t write it for him

Challenges to Arabic NLP
Arabic English
Orthographic ambiguity More Less
Orthographic inconsistency More Less
Morphological complexity More Less
Dialectal variation More Less
‫وبعقدنا‬
‫َا‬‫ن‬‫ي‬‫د‬َ‫ق‬‫ع‬‫ي‬‫ب‬ َ‫و‬ ‫َا‬‫ن‬‫ي‬‫د‬‫ق‬‫ي‬‫ع‬‫ي‬‫ب‬ َ‫و‬ ‫َا‬‫ن‬‫ي‬‫د‬‫ق‬َ‫ع‬‫ي‬‫ب‬ َ‫و‬ َ‫ن‬‫د‬‫ي‬‫ق‬َ‫ع‬‫ي‬‫ب‬‫و‬‫ا‬
and he stresses us out | and with our (contract | necklace | psychoses)

Comparing Performance
• SOTA Part-of-Speech Tagging and Syntax Parsing
Results from (Björkelund et al. 2013, Pasha et al., 2014, Weiss et al, 2015, Kumar et al., 2016)
– Large gap between English and Arabic; and between
Standard Arabic and Arabic dialects
– More resources and more research efforts for English
compared to Arabic.
41
English Standard Arabic Egyptian Arabic
Full Part-of-Speech 97.6% 85.4% 75.5%
Core POS Part-of-Speech 96.1% 91.1%
Dependency Syntax 92.2% 86.2%

Comparing Performance
• Machine Translation
– Quality of machine translation from MSA is much better than
in the dialects
– The main reason is availability of parallel corpora
• 150 million words of parallel Standard Arabic-English text compared
to 1.5 million words of Dialect-English text (Zbib et al., 2012)
42
Arabic Source Text Google Translate (Oct 17, 2018)
MSA ‫من‬‫فضلك‬‫ال‬‫تكلمني‬ Please do not talk to me
EGY ‫انت‬‫متكلمنيش‬‫خالص‬ You are pure Mtkmlnish
MSA ‫ال‬‫يوجد‬،‫كهرباء‬‫ماذا‬‫حدث؟‬ No electricity, what happened?
LEV ‫شكلو‬‫مفيش‬،‫كهربا‬‫ليش‬‫هيك؟‬ Shaku Mfish electrified, why not heck?
IRQ ‫شو‬‫ماكو‬،‫كهرباء‬‫خير؟‬ Xu Mako electricity, okay?

44
Roadmap
• Natural Language Processing
Applications & Paradigms
• (Why) is Arabic hard for NLP?
• Some Arabic NLP solutions
–NYUAD CAMeL Lab

MADAMIRA
http://camel.abudhabi.nyu.edu/madamira/
• State-of-the-art Arabic and Arabic Dialect Processing
tool (Pasha et al., 2014)
– Full Morphological disambiguation
– Hybrid
• Rule-based analyzer dictionaries
• Machine learning disambiguation
• Current release: Standard Arabic and Egyptian Arabic
• Under construction: Palestinian, Syrian, Moroccan,
Yemeni, Gulf
• Neural Extensions (Zalmout et al. 2017; 2018)

W-3 W-2 W-1 W0 W1 W2 W3 W4W-4
MORPHOLOGICAL
ANALYZER
MORPHOLOGICAL
CLASSIFIERS
• Rule-based
• Human-created
• Multiple independent
classifiers
• Corpus-trained
2nd
3rd
5th
4th
1st
RANKER
• Heuristic or
corpus-trained
(Habash&Rambow 2005; Roth et al. 2008; Pasha et al., 2014; Zalmout&Habash 2017, 2018)

MADAMIRA
Demo: http://camel.abudhabi.nyu.edu/madamira/
• ‫ي‬

MADAMIRA
Morphological Disambiguation
System: MSA MSA EGY
Test: MSA EGY EGY
Full Analysis 84.3% 27.0% 75.4%
Diacriticization 86.4% 32.2% 83.2%
Lemmatization 96.1% 67.1% 86.3%
Base POS-tagging 96.1% 82.1% 91.1%
Segmentation 99.1% 90.5% 97.4%
wakAtibuhu
kAtib_1
pos:noun
prc3:0 prc2:wa_conj
prc1:0 prc0:0 per:3 asp:na
vox:na mod:na gen:m
num:s stt:c cas:n
enc0:pron3ms
w+ kAtb +h
‫وكاتبه‬wkAtbh
and his writer

• Zalmout et al (EMNLP 2017, NAACL 2018)
– Neural implementation for MADAMIRA
• 4.4% absolute increase over the state-of-the-art in full
morphological analysis accuracy on all words
• absolute 10.6% increase for out-of-vocabulary words
Neural MADAMIRA

Automatic
Arabic Spelling Correction
• Neural models for Arabic
spelling correction gave state-
of-the-art results
– QALB shared task data 2014,
2015
– 1 million word training data
– Using word and character
narrow embeddings (+/-2) in
seq-to-seq model did best.
50
(Watson, Zalmout and Habash, 2018)

CODA
A Conventional Orthography
for Dialectal Arabic
• Developed for computational processing purposes
(Habash et al, 2012)
• Objectives
– CODA covers all Arabic dialects in principle
– CODA minimizes differences in choices
– CODA is easy to learn and produce consistently
– CODA is intuitive to readers unfamiliar with it
– CODA uses Arabic script
• Started with manuals for Egyptian, Tunisian, Levantine,
Algerian, and Gulf
• CODA* : CODA for 28 different city dialects (LREC 2018)
• http://coda.camel-lab.com/ 51

CODA Examples
CODA
‫االمتحانات‬ ‫قبل‬ ‫اللي‬ ‫الفترة‬ ‫صحابي‬ ‫ماَّشفتش‬
gloss the exams before which the period my friends I did not see
Spelling
variants
‫ا‬‫إل‬‫متحانات‬ ‫أ‬‫بل‬ ‫اللـ‬‫ـى‬ ‫الفتر‬‫ه‬ ‫صحابـ‬‫ـى‬ ‫ما‬‫شفتش‬
‫ا‬‫لـ‬‫ـمتحانات‬ ‫ا‬‫بل‬ ‫إ‬‫للي‬ ‫الفـ‬‫طـ‬‫ر‬‫ة‬ ‫صـ‬‫و‬‫حابي‬ ‫مـ‬‫شفتش‬
‫االمتـ‬‫ـحـ‬‫نات‬ abl ‫إ‬‫للـ‬‫ـى‬ ‫الفـ‬‫طـ‬‫ر‬‫ه‬ ‫صـ‬‫و‬‫حابـ‬‫ـى‬ ‫شـ‬ ‫ما‬‫و‬‫فتش‬
‫ا‬‫إل‬‫متـ‬‫ـحـ‬‫نات‬ qbl ‫ا‬‫لـ‬‫ـي‬ ilftra Su7abi ‫ما‬‫شـ‬‫و‬‫فتش‬
‫ا‬‫لـ‬‫ـمتـ‬‫ـحـ‬‫نات‬ qabl ‫ا‬‫لى‬ sohaby ‫مـ‬‫شـ‬‫و‬‫فتش‬
ilimti7anat ‫إلـ‬‫ـي‬ mashoftish
limtihanaat ‫إلى‬
illi

SAMER Project
• Simplification of Arabic Masterpieces for Extensive
Reading
– Muhamed Al Khalil, Nizar Habash and Dris Sulaimani
– NYUAD Research Enhancement Fund
– Collaboration with the UAE Ministry of Education
• Objectives
– Create a standard for the simplification of modern fiction in
Arabic to school-age learners.
– Develop a tool for automating readability scale grading for
Arabic
– Simplify a number of Arabic fiction masterpieces

SAMER Readability Prediction
• Large L1 corpus AND L2 corpus compared to previous Arabic Readability studies
• Sweeping and systematic feature engineering and comparison
• State-of-the-art tools tailored for Modern Standard Arabic
• Exploring L1 and L2 performance within the same consistent feature framework
• Leveraging L1 resources for L2 performance improvement

Full Feature Breakdown (146 feats)

+
more
detailed
features
obtained:
clitics,
person,
gender,
number,
aspect
(V), case
(N)
Dependency
parse
Tree
depth

SAMER Simplification Interface

MADAR Project
• Multi-Arabic Dialect Applications and
Resources
• Collaboration among CMUQ, NYUAD and
Columbia
– Nizar Habash, Houda Bouamor, Kemal Oflazer and
Owen Rambow
• Modeling 25 Arabic city dialects
– Lexical resources, parallel data, dialect
identification, and dialect machine translation
• http://madar.camel-lab.com
• http://adida.abudhabi.nyu.edu

Fine Grained Dialect Identification
• Salameh, Bouamor and Habash, 2018 (COLING)
• Best results (Accuracy)
• Demo: http://adida.abudhabi.nyu.edu
System 6-Label
Test
26-Label
Test
Baseline: Character 5-gram language model 92.7% 64.7%
Multinomial Naïve Bayes
Character/Word 5-gram language model
93.6% 67.5%
Multinomial Naïve Bayes
Character/Word 5-gram language model
+ Corpus-6 Classifier Probability
67.9%

• How many words are needed to guarantee an optimal
classification into a certain dialect?
• ~90% with 16 words!
• Almost 2 sentences
in Corpus-26
• ~100% with 51 words!
• Almost 7 sentences
• We are currently preparing a competition on dialect ID of
Twitter users.
Can we do better? Yes, with more input!

Summary
• Arabic poses many challenges to AI/NLP
– Orthographic ambiguity
– Morphological complexity
– Enormous variety
– Annotated resource poverty
• There has been a lot of work on Arabic and
Arabic dialect technologies.
– But more is needed still…

Future Directions
• More Arabic varieties
– We plan to continue to working on new dialects and new
Arabic domains
– New data sets
– New algorithms for supporting low resource languages
• More tools
– We are developing an open source suite called CamelTools
to support Arabic processing
• More interdisciplinary collaborations
– We are proposing a center on Human-centered AI at
NYUAD that brings together researchers from computer
science, digital humanities, language pedagogy, history,
and sociology, as well as industrial partners.

• http://nyuad.nyu.edu/en/
64
Thank You!
Questions?

Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and Solutions - Nizar Habash

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and Solutions - Nizar Habash

Similar to Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and Solutions - Nizar Habash (20)

More from Grammarly

More from Grammarly (14)

Recently uploaded

Recently uploaded (20)

Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and Solutions - Nizar Habash