SlideShare a Scribd company logo
1 of 64
Arabic Natural Language Processing:
Challenges and Solutions
‫العربية‬ ‫للغة‬ ‫اآللي‬ ‫التحليل‬:‫وحلول‬ ‫تحديات‬
Grammarly Invited Talk
March 26, 2019
Prof. Nizar Habash
New York University Abu Dhabi
nizar.habash@nyu.edu
NYUAD
CAMeLLab
New York University
The Global Network University
2
New York University Abu Dhabi
4
• http://nyuad.nyu.edu/en/
5
New York University Abu Dhabi
• Students from all over the world
– 1300 students, 120 nationalities
– 15% UAE, 15% American, 70% everywhere else
6
New York University Abu Dhabi
• Liberal Arts University
– Four Divisions: Science, Engineering, Social
Science, Arts and Humanities
– 20 majors and many minors
– Interdisciplinarity strongly encouraged
• Computer Science
– Undergraduate and PhD programs
– PhD through NYU New York
7
CAMeL Lab
8
• Computational Approaches to Modeling Language
• http://camel-lab.com
• Research Areas
– Arabic Artificial Intelligence
– Core Natural Language Processing
• Orthography, morphology, syntax, and semantics
– Dialectal modeling
– Machine translation
– Pedagogical applications
– Dialogue systems
NYUAD
CAMeLLab
The CAMeLeers
9
Nasser Zalmout
PhD Student, NYU
Dima Taji
PhD Student, NYU
Alberto Chiercchi
PhD Student, NYU
Alex Erdmann
PhD Student,
Ohio State
Salam Khalifa
Research Assistant
Fadhl Eryani
Research Assistant
Ossama Obeid
Research Assistant
Mai Oudah
Postdoc
Ok…. Back to the talk!
Arabic Natural Language Processing:
Challenges and Solutions
‫العربية‬ ‫للغة‬ ‫اآللي‬ ‫التحليل‬:‫وحلول‬ ‫تحديات‬
Grammarly Invited Talk
March 26, 2019
Prof. Nizar Habash
New York University Abu Dhabi
nizar.habash@nyu.edu
NYUAD
CAMeLLab
Natural Language Processing
Natural Language Processing
• Also known as
– Computational Linguistics
– Language Technologies
– (Language) Artificial Intelligence
• Language Technology is an interdisciplinary field
– Computer science, Linguistics, Cognitive science,
psychology, pedagogy, mathematics, etc.
• Language technologies were some of the earliest
applications of computer science
– Cryptography
– Machine Translation
Natural Language Processing
• Applications
– Information retrieval
– Machine translation
– Automatic speech recognition & speech synthesis
– Sentiment and emotion analysis
– Dialogue systems & chatting agents
– Optical character recognition
– Automatic Summarization, etc.
• Enabling technologies
– Tokenization
– Part-of-speech tagging
– Syntactic parsing
– Lemmatization
– Word sense disambiguation, etc.
Paradigms for
Natural Language Processing
• Rule-based (Intuition-based) Approaches
– Linguists write rules that are applied by the
machines
• Machine Learning Approaches
– Corpus-based, Statistical Approaches
– Machines learn the “rules” from training data
• Machine learning approaches are dominant in
the field
What do we need
to help machines learn?
• Data, data and more data!
• Specifically annotated data
Application Annotated Data Example
Machine Translation Parallel corpus in two languages: UN corpus with
English, Arabic, Chinese, Spanish, Russian, French
Sentiment Analysis A corpus of tweets with tags indicating: positive,
negative, neutral.
Speech Recognition A corpus of audio files with their corresponding
transcripts
Optical Character
Recognition
A corpus of scanned book page images and their
corresponding transcripts.
Part-of-Speech An English corpus with Part-of-Speech indicated for
each word
• d
17
Machine Learning
vs. Human Learning
Predisposed for
acquiring language
not so!
• Developing robust algorithms with appropriate learning
bias for computational linguistics tasks is essential!
Challenges for
Machine Learning Language Technologies
• Size of training data
– More is better!
• Domain and genre sensitivity
– Systems trained on news do not do well on novels
• Quality of annotations
– Why expect good performance if humans do not
agree with each other on the task
• Developing robust algorithms for machine
learning is essential
19
Roadmap
• Natural Language Processing
Applications & Paradigms
• (Why) is Arabic hard for NLP?
• Some Arabic NLP solutions
–NYUAD CAMeL Lab
20
Arabic Script
• A consonantal alphabet
• Written right-to-left
• Letters have contextual variants
• Used to write many languages
besides Arabic: Persian, Kurdish, Urdu,
Pashto, etc.
َ‫ر‬َ‫ع‬‫ال‬ ُّ‫َط‬‫خ‬‫ال‬ُّ‫ي‬‫ي‬‫ب‬
Arabic Script
• Arabic script uses a set of optional diacritics
– Only 1.5% of written words have at least one diacritic
• Undiacritized Standard Arabic words are
ambiguous out of context
Vowel Nunation Gemination
َ‫ب‬
/ba/
‫ب‬
/bu/
‫ي‬‫ب‬
/bi/
‫ب‬
/b/
‫ب‬
/ban/
‫ب‬
/bun/
‫ب‬
/bin/
‫ب‬
/bb/
‫للمغرب‬ ‫الممنوحة‬ ‫المساعدة‬ ‫تجميد‬ ‫تنفي‬ ‫اسبانيا‬
‫مدريد‬1-11(‫ب‬ ‫اف‬)-‫ماريا‬ ‫خوسيه‬ ‫االسبانية‬ ‫الحكومة‬ ‫رئيس‬ ‫اكد‬
‫لل‬ ‫تقدمها‬ ‫التي‬ ‫المساعدة‬ ‫توقف‬ ‫لم‬ ‫اسبانيا‬ ‫ان‬ ‫الخميس‬ ‫اليوم‬ ‫اثنار‬‫خالفا‬ ‫مغرب‬
‫محم‬ ‫المغربي‬ ‫والتعاون‬ ‫الخارجية‬ ‫الشؤون‬ ‫وزير‬ ‫االربعاء‬ ‫امس‬ ‫اكده‬ ‫لما‬‫بن‬ ‫د‬
‫المغربي‬ ‫النواب‬ ‫مجلس‬ ‫امام‬ ‫عيسى‬.‫ف‬ ‫االسبانية‬ ‫الحكومة‬ ‫رئيس‬ ‫وقال‬‫ي‬
‫و‬ ‫ابدا‬ ‫يتوقف‬ ‫لم‬ ‫والمغرب‬ ‫اسبانيا‬ ‫بين‬ ‫التعاون‬ ‫ان‬ ‫صحافي‬ ‫مؤتمر‬‫يجمد‬ ‫لم‬.
‫يا‬‫ي‬‫ن‬‫با‬‫س‬‫ي‬‫ا‬‫ي‬‫ي‬‫ف‬‫ن‬َ‫ت‬َ‫د‬‫ي‬‫ي‬‫م‬‫ج‬َ‫ت‬َ‫ة‬َ‫د‬َ‫ع‬‫سا‬‫الم‬َ‫ح‬‫و‬‫ن‬‫م‬َ‫م‬‫ال‬َ‫ة‬‫ي‬‫ب‬ ‫ي‬‫ر‬‫غ‬َ‫م‬‫ل‬‫ي‬‫ل‬
‫يد‬ ‫ي‬‫ر‬‫د‬َ‫م‬1 - 11 (‫ف‬‫ي‬‫ا‬‫ب‬)-َ‫د‬َّ‫ك‬َ‫ا‬‫يس‬‫ي‬‫ئ‬ َ‫ر‬‫ي‬‫ة‬َ‫م‬‫و‬‫ك‬‫الح‬‫ي‬‫ن‬‫با‬‫س‬‫ي‬‫اال‬‫َّة‬‫ي‬‫يه‬‫ي‬‫س‬‫و‬‫خ‬‫يا‬ ‫ي‬‫مار‬‫اثنار‬
َ‫م‬‫و‬َ‫ي‬‫ال‬َ‫يس‬‫ي‬‫َم‬‫خ‬‫ال‬َّ‫ن‬َ‫ا‬‫يا‬‫ي‬‫ن‬‫با‬‫س‬‫ي‬‫ا‬َ‫م‬‫ي‬‫ل‬َ‫ف‬َّ‫ق‬ َ‫و‬َ‫ت‬‫الم‬‫ة‬َ‫د‬َ‫ع‬‫سا‬‫ي‬‫ي‬‫ت‬َّ‫ال‬‫ها‬‫م‬‫ي‬‫د‬َ‫ق‬‫ت‬‫ي‬‫ر‬‫غ‬َ‫م‬‫ل‬‫ي‬‫ل‬‫ي‬‫ب‬‫الفا‬ ‫ي‬‫خ‬‫ما‬‫ي‬‫ل‬
‫ه‬َ‫د‬َّ‫ك‬َ‫ا‬‫ي‬‫س‬‫م‬َ‫ا‬َ‫ء‬‫عا‬‫ي‬‫ب‬‫ر‬َ‫ال‬‫ا‬َ‫ير‬ ‫ي‬‫ز‬ َ‫و‬‫ي‬‫ون‬‫ؤ‬‫الش‬‫ي‬‫ج‬ ‫ي‬‫الخار‬‫ي‬‫ة‬َّ‫ي‬‫ي‬‫ن‬‫عاو‬َ‫ت‬‫ال‬ َ‫و‬‫ي‬‫ي‬‫ي‬‫ب‬ ‫ي‬‫ر‬‫غ‬َ‫م‬‫ال‬‫م‬‫د‬َّ‫م‬َ‫ح‬‫ن‬‫ي‬‫ب‬
‫ى‬َ‫س‬‫ي‬‫ي‬‫ع‬َ‫مام‬َ‫ا‬‫ي‬‫س‬‫ي‬‫ل‬‫ج‬َ‫م‬‫ي‬‫ب‬‫ا‬‫و‬‫الن‬‫ي‬‫ي‬‫ي‬‫ب‬ ‫ي‬‫ر‬‫غ‬َ‫م‬‫ال‬.َ‫ل‬‫قا‬ َ‫و‬‫يس‬‫ي‬‫ئ‬ َ‫ر‬‫ي‬‫ة‬َ‫م‬‫و‬‫ك‬‫الح‬‫س‬‫ي‬‫اال‬‫ي‬‫ة‬َّ‫ي‬‫ي‬‫ن‬‫با‬‫ي‬‫ي‬‫ف‬
‫ر‬َ‫م‬َ‫ت‬‫ؤ‬‫م‬‫ي‬‫ي‬‫ف‬‫حا‬َ‫ص‬َّ‫ن‬َ‫ا‬َ‫ن‬‫عاو‬َ‫ت‬‫ال‬َ‫ن‬‫ي‬َ‫ب‬‫با‬‫س‬‫ي‬‫ا‬‫يا‬‫ي‬‫ن‬‫ي‬‫ب‬ ‫ي‬‫ر‬‫غ‬َ‫م‬‫ال‬ َ‫و‬َ‫م‬‫ي‬‫ل‬‫ف‬َّ‫ق‬ َ‫و‬َ‫ت‬َ‫ي‬َ‫ا‬‫دا‬َ‫ب‬َ‫م‬‫ي‬‫ل‬ َ‫و‬‫د‬‫ي‬‫م‬َ‫ج‬‫ي‬.
23
Orthographic Ambiguity
• Arabic words can be very ambiguous due to optional
diacritics
• But how ambiguous?
• Classic example
ths s wht n rbc txt lks lk wth n vwls
this is what an Arabic text looks like with no vowels
– Not exactly true
• Long vowels are always written
• Initial vowels are represented by an ‫ا‬ ‘Alif’
• Some final short vowels are deterministically inferable
ths is wht an Arbc txt lks lik wth no vwls
• For a computer …
– A word on average has 12.3 analyses, 6.8 diacritizations,
and 2.7 lemmas (core meanings)
• Not all of this ambiguity is due to orthography! More on this later.
• The Qatar Arabic Language Bank (QALB, PI Habash) project found a very
high (30%) of words have errors in unedited Standard Arabic comments on
Aljazeera.
– 2 Million words were manually corrected to create training data.
• Arabic spelling errors are a big challenge to language technologies
– GIGO: Garbage In Garbage Out
– Errors in Standard Arabic
– Inconsistencies in Dialectal Arabic (no official standard)
• Robust systems need additional functionality to allow for correcting errors
or functioning well despite them.
Spelling Errors
Morphological Complexity
• Arabic is morphologically rich
– A core word has many inflected forms
– Example: Arabic Verbs have 5,400 forms
Gender(2), Number(3), Person(3), Aspect(3), Tense particle (2),
Mood(3), Voice(2), Pronominal clitic(12), Conjunction clitic(3)
25
‫وسنقولها‬
/wasanaqūluhā/
‫و‬+‫س‬+‫ن‬+‫قول‬+‫ها‬
wa+sa+na+qūl+u+hā
and+will+we+say+it
And we will say it
،َ‫قالوا،قلت‬ ،‫قاال‬ ،‫قالت‬ ،‫قال‬
‫قلتن‬ ،‫قلتم‬ ،‫قلتما‬ ،‫ي‬‫ت‬‫قل‬،
‫تقول‬ ،‫يقل‬ ،َ‫ل‬‫يقو‬ ،‫يقول‬،َ‫ل‬‫تقو‬ ،
‫تقولي‬ ،‫تقولين‬ ،‫تقل‬،
...‫فقاال‬ ،‫فقالت‬ ،‫فقال‬...
...،‫وسأقولها‬‫وسنقولها‬،...
Morphological Complexity
• English is not morphologically rich.
– The number of inflected forms is small
– The verb paradigm is limited to 6
– The complete English part-of-speech tag set
has 48 tags
– The complete Arabic part-of-speech tag set
has 22,400 tags
26
VB VBD VBG VBN VBP VBZ
go went going gone go goes
Morphological Ambiguity
• 12.3 analyses and 2.7 lemmas per word
• Spelling ambiguity
– Optional diacritics
– Suboptimal spelling, e.g., (‫,أ‬ ‫إ‬  ‫)ا‬ or (‫ة‬ ‫ه‬ )
– Example: ‫وبادلتها‬
• Derivational ambiguity and homonymy
َ‫و‬+‫ي‬‫ب‬+‫ي‬‫ة‬َّ‫ل‬‫ي‬‫د‬َ‫أ‬+‫ها‬
and with her pieces of evidence
َ‫و‬+‫ت‬‫ل‬َ‫د‬‫ا‬َ‫ب‬+‫ها‬
and I exchanged with her
‫ـن‬‫ي‬َ‫ع‬‫ال‬ the eye, the water spring, Al-Ain city, the notable
‫ل‬َ‫ت‬‫ح‬‫الم‬
occupier, occupied
)‫المحتل‬ ‫العدو‬/‫المحتل‬ ‫الوطن‬/‫المحتلة‬ ‫الدول‬(
Morphological Annotation
‫سنتين‬ ‫لمدة‬ ‫فتحية‬ ‫كاتبته‬ ‫وقد‬.
Fathia corresponded with her for two years.
Word Lemma POS Features Gloss
‫وقد‬ ‫د‬َ‫ق‬ ‫و‬+‫قد‬ Noun Masc Sg size/physique
‫د‬َ‫ق‬ ‫و‬+‫قد‬ Particle may/might/has/have
‫د‬‫ق‬ َ‫و‬ ‫وقد‬ Noun Masc Sg fuel/burning
‫د‬َّ‫ق‬ َ‫و‬ ‫وقد‬ Verb Perf 3rd Masc Sg kindle/ignite
‫د‬َ‫ق‬ َ‫و‬ ‫وقد‬ Verb Perf 3rd Masc Sg ignite/burn
‫كاتبته‬ ‫ب‬‫ي‬‫ت‬‫كا‬ ‫كاتبة‬+‫ه‬ Noun Fem Sg author/writer/clerk
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 3rd Fem Sg correspond_with
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 1st Sg correspond_with
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 2nd Fem Sg correspond_with
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 2nd Masc Sg correspond_with
‫فتحية‬ ‫َّة‬‫ي‬ ‫ي‬‫ح‬َ‫ت‬ ‫ف‬+‫تحية‬ Noun Fem Sg greeting/salute
َّ‫ي‬ ‫ي‬‫تح‬َ‫ف‬‫ة‬ ‫فتحية‬ Proper Fem Sg Fathia
‫لمدة‬ ‫َّة‬‫د‬‫م‬ ‫ل‬+‫مدة‬ Noun Fem Sg interval/period
‫سنتين‬ ‫ت‬‫ن‬‫ي‬‫س‬ ‫سنتين‬ Noun Masc Du cent
‫َة‬‫ن‬َ‫س‬ ‫سنتين‬ Noun Fem Du year
. . . Punc .
Word Lemma POS Features Gloss
‫وقد‬ ‫د‬َ‫ق‬ ‫و‬+‫قد‬ Noun Masc Sg size/physique
‫د‬َ‫ق‬ ‫و‬+‫قد‬ Particle may/might/has/have
‫د‬‫ق‬ َ‫و‬ ‫وقد‬ Noun Masc Sg fuel/burning
‫د‬َّ‫ق‬ َ‫و‬ ‫وقد‬ Verb Perf 3rd Masc Sg kindle/ignite
‫د‬َ‫ق‬ َ‫و‬ ‫وقد‬ Verb Perf 3rd Masc Sg ignite/burn
‫كاتبته‬ ‫ب‬‫ي‬‫ت‬‫كا‬ ‫كاتبة‬+‫ه‬ Noun Fem Sg author/writer/clerk
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 3rd Fem Sg correspond_with
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 1st Sg correspond_with
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 2nd Fem Sg correspond_with
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 2nd Masc Sg correspond_with
‫فتحية‬ ‫َّة‬‫ي‬ ‫ي‬‫ح‬َ‫ت‬ ‫ف‬+‫تحية‬ Noun Fem Sg greeting/salute
َّ‫ي‬ ِ‫تح‬َ‫ف‬‫ة‬ ‫فتحية‬ Proper Fem Sg Fathia
‫لمدة‬ ‫ة‬‫د‬ُ‫م‬ ‫ل‬+‫مدة‬ Noun Fem Sg interval/period
‫سنتين‬ ‫ت‬‫ن‬‫ي‬‫س‬ ‫سنتين‬ Noun Masc Du cent
‫ة‬َ‫ن‬َ‫س‬ ‫سنتين‬ Noun Fem Du year
. . . Punc .
Morphological Annotation
‫سنتين‬ ‫لمدة‬ ‫فتحية‬ ‫كاتبته‬ ‫وقد‬.
Fathia corresponded with her for two years.
30
Arabic and its Dialects
• Arabic has ~360M speakers
• Forms of Arabic
– Classical Arabic (CA)
• Classic historical and liturgical texts
– Modern Standard Arabic (MSA)
• News media & formal speeches and settings
• Only written standard
– Dialectal Arabic (DA)
• Predominantly spoken vernaculars
• No written standards
• Very common on social media
• Diglossia
– Two forms of the language (MSA & DA) exist side by side
Arabic and its Dialects
• Official language: Modern Standard Arabic (MSA)
No one’s native language
• Regional Dialects
– Egyptian Arabic (EGY)
– Levantine Arabic (LEV)
– Gulf Arabic (GLF)
– North African Arabic (NOR): Moroccan, Algerian, Tunisian
– Iraqi, Yemenite, Sudanese
• Dialects and sub-dialects…
– City, Rural, Bedouin
32
Phonological Variations
• Major variants
MSA Dialects
‫ق‬ /q/ /q/, /k/, /ʔ/, /g/, /ʤ/, /ɢ/
‫ث‬ /θ/ /θ/, /t/, /s/
‫ذ‬ /δ/ /δ/, /d/, /z/
‫ج‬ /ʤ/ /ʤ/, /g/, /ʒ/
Spelling Inconsistency
33
Egyptian Arabic word
‫ماَّبيقولهاش‬
/mabiʔulhāʃ/
“he does not say it
If there is no
standard,
can a word be
misspelled?
Lexical and Phonological Variation
You say to-MAY-to, I say to-MAH-to!
Lexical and Phonological Variation
‫بندورة‬
‫توماطيش‬
‫طماط‬
‫طماطة‬
‫طماطم‬
‫طماطمة‬
‫طماطيس‬
‫قوطة‬
‫مطيشة‬
b a n a d oo r a
ALE, DAM
b a n a d uu r a
BEI, AMM, JER
b a n d oo r a AMM, JER, SAL
t uu m aa t. ii sh FES
t. a m aa t. SAN, MUS
t. e m aa t. DOH
t a m aa t a
BAG, BAS, MOS
t. a m aa t. i m JED, RIY, SAN, KHA,
ALX, ASW
t. m aa t. i m
SFA, TUN, BEN, TRI
g uu t. a
JED
2 uu t. a
CAI
t. o m a t. ii sh
ALG
t. a m aa t. ii s
SAN m a t. ii sh a
FES, RAB
t. a m aa t. m a
MUS
‫طوماطيش‬
Lexical and Phonological Variation
‫بندورة‬
‫توماطيش‬
‫طماط‬
‫طماطة‬
‫طماطم‬
‫طماطمة‬
‫طماطيس‬
‫قوطة‬
‫مطيشة‬
b a n a d oo r a
ALE, DAM
b a n a d uu r a
BEI, AMM, JER
b a n d oo r a AMM, JER, SAL
t uu m aa t. ii sh FES
t. a m aa t. SAN, MUS
t. e m aa t. DOH
t a m aa t a
BAG, BAS, MOS
t. a m aa t. i m JED, RIY, SAN, KHA,
ALX, ASW
t. m aa t. i m
SFA, TUN, BEN, TRI
g uu t. a
JED
2 uu t. a
CAI
t. o m a t. ii sh
ALG
t. a m aa t. ii s
SAN m a t. ii sh a
FES, RAB
t. a m aa t. m a
MUS
‫طوماطيش‬
Lexical and Phonological Variation
‫بندورة‬
‫توماطيش‬
‫طماط‬
‫طماطة‬
‫طماطم‬
‫طماطمة‬
‫طماطيس‬
‫قوطة‬
‫مطيشة‬
b a n a d oo r a
ALE, DAM
b a n a d uu r a
BEI, AMM, JER
b a n d oo r a AMM, JER, SAL
t uu m aa t. ii sh FES
t. a m aa t. SAN, MUS
t. e m aa t. DOH
t a m aa t a
BAG, BAS, MOS
t. a m aa t. i m JED, RIY, SAN, KHA,
ALX, ASW
t. m aa t. i m
SFA, TUN, BEN, TRI
g uu t. a
JED
2 uu t. a
CAI
t. o m a t. ii sh
ALG
t. a m aa t. ii s
SAN m a t. ii sh a
FES, RAB
t. a m aa t. m a
MUS
‫طوماطيش‬
38
Morphological Variation
• Some aspects of words are simplified in the dialects
– Loss of case marking
kitaabu, kitaaba, kitaaabi, kitaabun, kitaaban, kitaabin  kitaab
– Consolidation of masculine and feminine plurals
yaktubuun, yaktubuu, yaktubna  yiktibu || yikitbuun
• Other aspects increase in complexity!
‫كتاب‬‫كتاب‬ ،‫ي‬‫ب‬‫كتا‬ ،‫كتابا‬ ،َ‫كتاب‬ ،‫كتاب‬ ،‫كتاب‬
‫يكتبون‬ ،‫يكتبوا‬‫يكتبن‬ ،‫يكتبون‬ ،‫يكتبوا‬
39
Morphological Variation
Verb Morphology
conjverbobject subj tense
IOBJ negneg
MSA
‫له‬ ‫تكتبوها‬ ‫ولم‬
/walam taktubūhā lahu/
/wa+lam taktubū+hā la+hu/
and+not_past write_you+it for+him
EGY
‫و‬‫ما‬‫كتبتوهالو‬‫ش‬
/wimakatabtuhalūʃ/
/wi+ma+katab+tu+ha+lū+ʃ/
and+not+wrote+you+it+for_him+not
And you didn’t write it for him
Challenges to Arabic NLP
Arabic English
Orthographic ambiguity More Less
Orthographic inconsistency More Less
Morphological complexity More Less
Dialectal variation More Less
‫وبعقدنا‬
‫َا‬‫ن‬‫ي‬‫د‬َ‫ق‬‫ع‬‫ي‬‫ب‬ َ‫و‬ ‫َا‬‫ن‬‫ي‬‫د‬‫ق‬‫ي‬‫ع‬‫ي‬‫ب‬ َ‫و‬ ‫َا‬‫ن‬‫ي‬‫د‬‫ق‬َ‫ع‬‫ي‬‫ب‬ َ‫و‬ َ‫ن‬‫د‬‫ي‬‫ق‬َ‫ع‬‫ي‬‫ب‬‫و‬‫ا‬
and he stresses us out | and with our (contract | necklace | psychoses)
Comparing Performance
• SOTA Part-of-Speech Tagging and Syntax Parsing
Results from (Björkelund et al. 2013, Pasha et al., 2014, Weiss et al, 2015, Kumar et al., 2016)
– Large gap between English and Arabic; and between
Standard Arabic and Arabic dialects
– More resources and more research efforts for English
compared to Arabic.
41
English Standard Arabic Egyptian Arabic
Full Part-of-Speech 97.6% 85.4% 75.5%
Core POS Part-of-Speech 96.1% 91.1%
Dependency Syntax 92.2% 86.2%
Comparing Performance
• Machine Translation
– Quality of machine translation from MSA is much better than
in the dialects
– The main reason is availability of parallel corpora
• 150 million words of parallel Standard Arabic-English text compared
to 1.5 million words of Dialect-English text (Zbib et al., 2012)
42
Arabic Source Text Google Translate (Oct 17, 2018)
MSA ‫من‬‫فضلك‬‫ال‬‫تكلمني‬ Please do not talk to me
EGY ‫انت‬‫متكلمنيش‬‫خالص‬ You are pure Mtkmlnish
MSA ‫ال‬‫يوجد‬،‫كهرباء‬‫ماذا‬‫حدث؟‬ No electricity, what happened?
LEV ‫شكلو‬‫مفيش‬،‫كهربا‬‫ليش‬‫هيك؟‬ Shaku Mfish electrified, why not heck?
IRQ ‫شو‬‫ماكو‬،‫كهرباء‬‫خير؟‬ Xu Mako electricity, okay?
44
Roadmap
• Natural Language Processing
Applications & Paradigms
• (Why) is Arabic hard for NLP?
• Some Arabic NLP solutions
–NYUAD CAMeL Lab
MADAMIRA
http://camel.abudhabi.nyu.edu/madamira/
• State-of-the-art Arabic and Arabic Dialect Processing
tool (Pasha et al., 2014)
– Full Morphological disambiguation
– Hybrid
• Rule-based analyzer dictionaries
• Machine learning disambiguation
• Current release: Standard Arabic and Egyptian Arabic
• Under construction: Palestinian, Syrian, Moroccan,
Yemeni, Gulf
• Neural Extensions (Zalmout et al. 2017; 2018)
W-3 W-2 W-1 W0 W1 W2 W3 W4W-4
MORPHOLOGICAL
ANALYZER
MORPHOLOGICAL
CLASSIFIERS
• Rule-based
• Human-created
• Multiple independent
classifiers
• Corpus-trained
2nd
3rd
5th
4th
1st
RANKER
• Heuristic or
corpus-trained
(Habash&Rambow 2005; Roth et al. 2008; Pasha et al., 2014; Zalmout&Habash 2017, 2018)
MADAMIRA
Demo: http://camel.abudhabi.nyu.edu/madamira/
• ‫ي‬
MADAMIRA
Morphological Disambiguation
System: MSA MSA EGY
Test: MSA EGY EGY
Full Analysis 84.3% 27.0% 75.4%
Diacriticization 86.4% 32.2% 83.2%
Lemmatization 96.1% 67.1% 86.3%
Base POS-tagging 96.1% 82.1% 91.1%
Segmentation 99.1% 90.5% 97.4%
wakAtibuhu
kAtib_1
pos:noun
prc3:0 prc2:wa_conj
prc1:0 prc0:0 per:3 asp:na
vox:na mod:na gen:m
num:s stt:c cas:n
enc0:pron3ms
w+ kAtb +h
‫وكاتبه‬wkAtbh
and his writer
• Zalmout et al (EMNLP 2017, NAACL 2018)
– Neural implementation for MADAMIRA
• 4.4% absolute increase over the state-of-the-art in full
morphological analysis accuracy on all words
• absolute 10.6% increase for out-of-vocabulary words
Neural MADAMIRA
Automatic
Arabic Spelling Correction
• Neural models for Arabic
spelling correction gave state-
of-the-art results
– QALB shared task data 2014,
2015
– 1 million word training data
– Using word and character
narrow embeddings (+/-2) in
seq-to-seq model did best.
50
(Watson, Zalmout and Habash, 2018)
CODA
A Conventional Orthography
for Dialectal Arabic
• Developed for computational processing purposes
(Habash et al, 2012)
• Objectives
– CODA covers all Arabic dialects in principle
– CODA minimizes differences in choices
– CODA is easy to learn and produce consistently
– CODA is intuitive to readers unfamiliar with it
– CODA uses Arabic script
• Started with manuals for Egyptian, Tunisian, Levantine,
Algerian, and Gulf
• CODA* : CODA for 28 different city dialects (LREC 2018)
• http://coda.camel-lab.com/ 51
CODA Examples
CODA
‫االمتحانات‬ ‫قبل‬ ‫اللي‬ ‫الفترة‬ ‫صحابي‬ ‫ماَّشفتش‬
gloss the exams before which the period my friends I did not see
Spelling
variants
‫ا‬‫إل‬‫متحانات‬ ‫أ‬‫بل‬ ‫اللـ‬‫ـى‬ ‫الفتر‬‫ه‬ ‫صحابـ‬‫ـى‬ ‫ما‬‫شفتش‬
‫ا‬‫لـ‬‫ـمتحانات‬ ‫ا‬‫بل‬ ‫إ‬‫للي‬ ‫الفـ‬‫طـ‬‫ر‬‫ة‬ ‫صـ‬‫و‬‫حابي‬ ‫مـ‬‫شفتش‬
‫االمتـ‬‫ـحـ‬‫نات‬ abl ‫إ‬‫للـ‬‫ـى‬ ‫الفـ‬‫طـ‬‫ر‬‫ه‬ ‫صـ‬‫و‬‫حابـ‬‫ـى‬ ‫شـ‬ ‫ما‬‫و‬‫فتش‬
‫ا‬‫إل‬‫متـ‬‫ـحـ‬‫نات‬ qbl ‫ا‬‫لـ‬‫ـي‬ ilftra Su7abi ‫ما‬‫شـ‬‫و‬‫فتش‬
‫ا‬‫لـ‬‫ـمتـ‬‫ـحـ‬‫نات‬ qabl ‫ا‬‫لى‬ sohaby ‫مـ‬‫شـ‬‫و‬‫فتش‬
ilimti7anat ‫إلـ‬‫ـي‬ mashoftish
limtihanaat ‫إلى‬
illi
SAMER Project
• Simplification of Arabic Masterpieces for Extensive
Reading
– Muhamed Al Khalil, Nizar Habash and Dris Sulaimani
– NYUAD Research Enhancement Fund
– Collaboration with the UAE Ministry of Education
• Objectives
– Create a standard for the simplification of modern fiction in
Arabic to school-age learners.
– Develop a tool for automating readability scale grading for
Arabic
– Simplify a number of Arabic fiction masterpieces
SAMER Readability Prediction
• Large L1 corpus AND L2 corpus compared to previous Arabic Readability studies
• Sweeping and systematic feature engineering and comparison
• State-of-the-art tools tailored for Modern Standard Arabic
• Exploring L1 and L2 performance within the same consistent feature framework
• Leveraging L1 resources for L2 performance improvement
Full Feature Breakdown (146 feats)
+
more
detailed
features
obtained:
clitics,
person,
gender,
number,
aspect
(V), case
(N)
Dependency
parse
Tree
depth
SAMER Simplification Interface
MADAR Project
• Multi-Arabic Dialect Applications and
Resources
• Collaboration among CMUQ, NYUAD and
Columbia
– Nizar Habash, Houda Bouamor, Kemal Oflazer and
Owen Rambow
• Modeling 25 Arabic city dialects
– Lexical resources, parallel data, dialect
identification, and dialect machine translation
• http://madar.camel-lab.com
• http://adida.abudhabi.nyu.edu
The MADAR Corpus: example
Fine Grained Dialect Identification
• Salameh, Bouamor and Habash, 2018 (COLING)
• Best results (Accuracy)
• Demo: http://adida.abudhabi.nyu.edu
System 6-Label
Test
26-Label
Test
Baseline: Character 5-gram language model 92.7% 64.7%
Multinomial Naïve Bayes
Character/Word 5-gram language model
93.6% 67.5%
Multinomial Naïve Bayes
Character/Word 5-gram language model
+ Corpus-6 Classifier Probability
67.9%
• How many words are needed to guarantee an optimal
classification into a certain dialect?
• ~90% with 16 words!
• Almost 2 sentences
in Corpus-26
• ~100% with 51 words!
• Almost 7 sentences
• We are currently preparing a competition on dialect ID of
Twitter users.
Can we do better? Yes, with more input!
Summary
• Arabic poses many challenges to AI/NLP
– Orthographic ambiguity
– Morphological complexity
– Enormous variety
– Annotated resource poverty
• There has been a lot of work on Arabic and
Arabic dialect technologies.
– But more is needed still…
Future Directions
• More Arabic varieties
– We plan to continue to working on new dialects and new
Arabic domains
– New data sets
– New algorithms for supporting low resource languages
• More tools
– We are developing an open source suite called CamelTools
to support Arabic processing
• More interdisciplinary collaborations
– We are proposing a center on Human-centered AI at
NYUAD that brings together researchers from computer
science, digital humanities, language pedagogy, history,
and sociology, as well as industrial partners.
• http://nyuad.nyu.edu/en/
64
Thank You!
Questions?

More Related Content

What's hot

Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Databricks
 
Introduction to natural language processing, history and origin
Introduction to natural language processing, history and originIntroduction to natural language processing, history and origin
Introduction to natural language processing, history and originShubhankar Mohan
 
Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Hady Elsahar
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
natural language processing help at myassignmenthelp.net
natural language processing  help at myassignmenthelp.netnatural language processing  help at myassignmenthelp.net
natural language processing help at myassignmenthelp.netwww.myassignmenthelp.net
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processingrohitnayak
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingVeenaSKumar2
 
Introduction to the theory of computation
Introduction to the theory of computationIntroduction to the theory of computation
Introduction to the theory of computationprasadmvreddy
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model佳蓉 倪
 
Introduction to text to speech
Introduction to text to speechIntroduction to text to speech
Introduction to text to speechBilgin Aksoy
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Challenges in nlp
Challenges in nlpChallenges in nlp
Challenges in nlpZareen Syed
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Gpt1 and 2 model review
Gpt1 and 2 model reviewGpt1 and 2 model review
Gpt1 and 2 model reviewSeoung-Ho Choi
 
Introduction to common sense reasoning
Introduction to common sense reasoningIntroduction to common sense reasoning
Introduction to common sense reasoningMartin Molina
 

What's hot (20)

Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Introduction to natural language processing, history and origin
Introduction to natural language processing, history and originIntroduction to natural language processing, history and origin
Introduction to natural language processing, history and origin
 
Chatbot
ChatbotChatbot
Chatbot
 
Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Neural Language Generation Head to Toe
Neural Language Generation Head to Toe
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
natural language processing help at myassignmenthelp.net
natural language processing  help at myassignmenthelp.netnatural language processing  help at myassignmenthelp.net
natural language processing help at myassignmenthelp.net
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
NLP
NLPNLP
NLP
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Introduction to the theory of computation
Introduction to the theory of computationIntroduction to the theory of computation
Introduction to the theory of computation
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
 
Introduction to text to speech
Introduction to text to speechIntroduction to text to speech
Introduction to text to speech
 
ChatGPT.pptx
ChatGPT.pptxChatGPT.pptx
ChatGPT.pptx
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
NLTK
NLTKNLTK
NLTK
 
Challenges in nlp
Challenges in nlpChallenges in nlp
Challenges in nlp
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Gpt1 and 2 model review
Gpt1 and 2 model reviewGpt1 and 2 model review
Gpt1 and 2 model review
 
Introduction to common sense reasoning
Introduction to common sense reasoningIntroduction to common sense reasoning
Introduction to common sense reasoning
 

Similar to Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and Solutions - Nizar Habash

Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsshrey bhate
 
Earth ontology
Earth ontologyEarth ontology
Earth ontologydr-nawal
 
Closing the language gap: developing machine learning tools to detect the lan...
Closing the language gap: developing machine learning tools to detect the lan...Closing the language gap: developing machine learning tools to detect the lan...
Closing the language gap: developing machine learning tools to detect the lan...CILIP MDG
 
Applied linguistics presentation
Applied linguistics  presentationApplied linguistics  presentation
Applied linguistics presentationMuhammad Furqan
 
Building Reading Fluency Through Blended & Flipped Learning
Building Reading Fluency Through Blended & Flipped LearningBuilding Reading Fluency Through Blended & Flipped Learning
Building Reading Fluency Through Blended & Flipped LearningSaint Michael's College
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...ijnlc
 
Sentiment Analysis for Arabizi: A Multilingual Jargon on Social Media
Sentiment Analysis for Arabizi: A Multilingual Jargon on Social MediaSentiment Analysis for Arabizi: A Multilingual Jargon on Social Media
Sentiment Analysis for Arabizi: A Multilingual Jargon on Social MediaKnowledge Media Institute
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...kevig
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Processing short-message communications in low-resource languages
Processing short-message communications in low-resource languages�Processing short-message communications in low-resource languages�
Processing short-message communications in low-resource languages Robert Munro
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
Design and Development of an Educational Arabic Sign Language Mobile App
Design and Development of an Educational Arabic Sign Language Mobile AppDesign and Development of an Educational Arabic Sign Language Mobile App
Design and Development of an Educational Arabic Sign Language Mobile AppHCI Lab
 
Applied linguistics
Applied linguisticsApplied linguistics
Applied linguisticsRaul Vargas
 
Didactique de l'Anglais de Spécialité (GT GERAS)
Didactique de l'Anglais de Spécialité (GT GERAS)Didactique de l'Anglais de Spécialité (GT GERAS)
Didactique de l'Anglais de Spécialité (GT GERAS)Shona Whyte
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...ijaia
 
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...gerogepatton
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...gerogepatton
 
SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)
SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)
SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)Knowledge Media Institute
 
MoM2010: Arabic natural language processing
MoM2010: Arabic natural language processingMoM2010: Arabic natural language processing
MoM2010: Arabic natural language processingHend Al-Khalifa
 

Similar to Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and Solutions - Nizar Habash (20)

Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Earth ontology
Earth ontologyEarth ontology
Earth ontology
 
Closing the language gap: developing machine learning tools to detect the lan...
Closing the language gap: developing machine learning tools to detect the lan...Closing the language gap: developing machine learning tools to detect the lan...
Closing the language gap: developing machine learning tools to detect the lan...
 
Linguascope2018
Linguascope2018Linguascope2018
Linguascope2018
 
Applied linguistics presentation
Applied linguistics  presentationApplied linguistics  presentation
Applied linguistics presentation
 
Building Reading Fluency Through Blended & Flipped Learning
Building Reading Fluency Through Blended & Flipped LearningBuilding Reading Fluency Through Blended & Flipped Learning
Building Reading Fluency Through Blended & Flipped Learning
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
 
Sentiment Analysis for Arabizi: A Multilingual Jargon on Social Media
Sentiment Analysis for Arabizi: A Multilingual Jargon on Social MediaSentiment Analysis for Arabizi: A Multilingual Jargon on Social Media
Sentiment Analysis for Arabizi: A Multilingual Jargon on Social Media
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Processing short-message communications in low-resource languages
Processing short-message communications in low-resource languages�Processing short-message communications in low-resource languages�
Processing short-message communications in low-resource languages
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Design and Development of an Educational Arabic Sign Language Mobile App
Design and Development of an Educational Arabic Sign Language Mobile AppDesign and Development of an Educational Arabic Sign Language Mobile App
Design and Development of an Educational Arabic Sign Language Mobile App
 
Applied linguistics
Applied linguisticsApplied linguistics
Applied linguistics
 
Didactique de l'Anglais de Spécialité (GT GERAS)
Didactique de l'Anglais de Spécialité (GT GERAS)Didactique de l'Anglais de Spécialité (GT GERAS)
Didactique de l'Anglais de Spécialité (GT GERAS)
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
 
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
 
SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)
SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)
SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)
 
MoM2010: Arabic natural language processing
MoM2010: Arabic natural language processingMoM2010: Arabic natural language processing
MoM2010: Arabic natural language processing
 

More from Grammarly

Vitalii Braslavskyi - Declarative engineering
Vitalii Braslavskyi - Declarative engineering Vitalii Braslavskyi - Declarative engineering
Vitalii Braslavskyi - Declarative engineering Grammarly
 
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...Grammarly
 
Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...
Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...
Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...Grammarly
 
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly
 
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...Grammarly
 
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...Grammarly
 
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...Grammarly
 
Grammarly Meetup: DevOps at Grammarly: Scaling 100x
Grammarly Meetup: DevOps at Grammarly: Scaling 100xGrammarly Meetup: DevOps at Grammarly: Scaling 100x
Grammarly Meetup: DevOps at Grammarly: Scaling 100xGrammarly
 
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...Grammarly
 
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...Grammarly
 
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...Grammarly
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy GutsGrammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy GutsGrammarly
 
Natural Language Processing for biomedical text mining - Thierry Hamon
Natural Language Processing for biomedical text mining - Thierry HamonNatural Language Processing for biomedical text mining - Thierry Hamon
Natural Language Processing for biomedical text mining - Thierry HamonGrammarly
 

More from Grammarly (14)

Vitalii Braslavskyi - Declarative engineering
Vitalii Braslavskyi - Declarative engineering Vitalii Braslavskyi - Declarative engineering
Vitalii Braslavskyi - Declarative engineering
 
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...
 
Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...
Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...
Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...
 
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
 
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...
 
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
 
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
 
Grammarly Meetup: DevOps at Grammarly: Scaling 100x
Grammarly Meetup: DevOps at Grammarly: Scaling 100xGrammarly Meetup: DevOps at Grammarly: Scaling 100x
Grammarly Meetup: DevOps at Grammarly: Scaling 100x
 
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...
 
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
 
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy GutsGrammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
 
Natural Language Processing for biomedical text mining - Thierry Hamon
Natural Language Processing for biomedical text mining - Thierry HamonNatural Language Processing for biomedical text mining - Thierry Hamon
Natural Language Processing for biomedical text mining - Thierry Hamon
 

Recently uploaded

SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comsaastr
 
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power
 
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...marjmae69
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringSebastiano Panichella
 
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Krijn Poppe
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationNathan Young
 
call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@vikas rana
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...漢銘 謝
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.KathleenAnnCordero2
 
Genshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxGenshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxJohnree4
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSebastiano Panichella
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸mathanramanathan2005
 
Anne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptxAnne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptxnoorehahmad
 
James Joyce, Dubliners and Ulysses.ppt !
James Joyce, Dubliners and Ulysses.ppt !James Joyce, Dubliners and Ulysses.ppt !
James Joyce, Dubliners and Ulysses.ppt !risocarla2016
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxmavinoikein
 
miladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxmiladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxCarrieButtitta
 
Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Escort Service
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxaryanv1753
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxFamilyWorshipCenterD
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝soniya singh
 

Recently uploaded (20)

SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
 
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
 
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software Engineering
 
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism Presentation
 
call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
 
Genshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxGenshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptx
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation Track
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸
 
Anne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptxAnne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptx
 
James Joyce, Dubliners and Ulysses.ppt !
James Joyce, Dubliners and Ulysses.ppt !James Joyce, Dubliners and Ulysses.ppt !
James Joyce, Dubliners and Ulysses.ppt !
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptx
 
miladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxmiladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptx
 
Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptx
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
 

Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and Solutions - Nizar Habash

  • 1. Arabic Natural Language Processing: Challenges and Solutions ‫العربية‬ ‫للغة‬ ‫اآللي‬ ‫التحليل‬:‫وحلول‬ ‫تحديات‬ Grammarly Invited Talk March 26, 2019 Prof. Nizar Habash New York University Abu Dhabi nizar.habash@nyu.edu NYUAD CAMeLLab
  • 2. New York University The Global Network University 2
  • 3.
  • 4. New York University Abu Dhabi 4
  • 6. New York University Abu Dhabi • Students from all over the world – 1300 students, 120 nationalities – 15% UAE, 15% American, 70% everywhere else 6
  • 7. New York University Abu Dhabi • Liberal Arts University – Four Divisions: Science, Engineering, Social Science, Arts and Humanities – 20 majors and many minors – Interdisciplinarity strongly encouraged • Computer Science – Undergraduate and PhD programs – PhD through NYU New York 7
  • 8. CAMeL Lab 8 • Computational Approaches to Modeling Language • http://camel-lab.com • Research Areas – Arabic Artificial Intelligence – Core Natural Language Processing • Orthography, morphology, syntax, and semantics – Dialectal modeling – Machine translation – Pedagogical applications – Dialogue systems NYUAD CAMeLLab
  • 9. The CAMeLeers 9 Nasser Zalmout PhD Student, NYU Dima Taji PhD Student, NYU Alberto Chiercchi PhD Student, NYU Alex Erdmann PhD Student, Ohio State Salam Khalifa Research Assistant Fadhl Eryani Research Assistant Ossama Obeid Research Assistant Mai Oudah Postdoc
  • 10. Ok…. Back to the talk!
  • 11. Arabic Natural Language Processing: Challenges and Solutions ‫العربية‬ ‫للغة‬ ‫اآللي‬ ‫التحليل‬:‫وحلول‬ ‫تحديات‬ Grammarly Invited Talk March 26, 2019 Prof. Nizar Habash New York University Abu Dhabi nizar.habash@nyu.edu NYUAD CAMeLLab
  • 13. Natural Language Processing • Also known as – Computational Linguistics – Language Technologies – (Language) Artificial Intelligence • Language Technology is an interdisciplinary field – Computer science, Linguistics, Cognitive science, psychology, pedagogy, mathematics, etc. • Language technologies were some of the earliest applications of computer science – Cryptography – Machine Translation
  • 14. Natural Language Processing • Applications – Information retrieval – Machine translation – Automatic speech recognition & speech synthesis – Sentiment and emotion analysis – Dialogue systems & chatting agents – Optical character recognition – Automatic Summarization, etc. • Enabling technologies – Tokenization – Part-of-speech tagging – Syntactic parsing – Lemmatization – Word sense disambiguation, etc.
  • 15. Paradigms for Natural Language Processing • Rule-based (Intuition-based) Approaches – Linguists write rules that are applied by the machines • Machine Learning Approaches – Corpus-based, Statistical Approaches – Machines learn the “rules” from training data • Machine learning approaches are dominant in the field
  • 16. What do we need to help machines learn? • Data, data and more data! • Specifically annotated data Application Annotated Data Example Machine Translation Parallel corpus in two languages: UN corpus with English, Arabic, Chinese, Spanish, Russian, French Sentiment Analysis A corpus of tweets with tags indicating: positive, negative, neutral. Speech Recognition A corpus of audio files with their corresponding transcripts Optical Character Recognition A corpus of scanned book page images and their corresponding transcripts. Part-of-Speech An English corpus with Part-of-Speech indicated for each word
  • 17. • d 17 Machine Learning vs. Human Learning Predisposed for acquiring language not so! • Developing robust algorithms with appropriate learning bias for computational linguistics tasks is essential!
  • 18. Challenges for Machine Learning Language Technologies • Size of training data – More is better! • Domain and genre sensitivity – Systems trained on news do not do well on novels • Quality of annotations – Why expect good performance if humans do not agree with each other on the task • Developing robust algorithms for machine learning is essential
  • 19. 19 Roadmap • Natural Language Processing Applications & Paradigms • (Why) is Arabic hard for NLP? • Some Arabic NLP solutions –NYUAD CAMeL Lab
  • 20. 20 Arabic Script • A consonantal alphabet • Written right-to-left • Letters have contextual variants • Used to write many languages besides Arabic: Persian, Kurdish, Urdu, Pashto, etc. َ‫ر‬َ‫ع‬‫ال‬ ُّ‫َط‬‫خ‬‫ال‬ُّ‫ي‬‫ي‬‫ب‬
  • 21. Arabic Script • Arabic script uses a set of optional diacritics – Only 1.5% of written words have at least one diacritic • Undiacritized Standard Arabic words are ambiguous out of context Vowel Nunation Gemination َ‫ب‬ /ba/ ‫ب‬ /bu/ ‫ي‬‫ب‬ /bi/ ‫ب‬ /b/ ‫ب‬ /ban/ ‫ب‬ /bun/ ‫ب‬ /bin/ ‫ب‬ /bb/
  • 22. ‫للمغرب‬ ‫الممنوحة‬ ‫المساعدة‬ ‫تجميد‬ ‫تنفي‬ ‫اسبانيا‬ ‫مدريد‬1-11(‫ب‬ ‫اف‬)-‫ماريا‬ ‫خوسيه‬ ‫االسبانية‬ ‫الحكومة‬ ‫رئيس‬ ‫اكد‬ ‫لل‬ ‫تقدمها‬ ‫التي‬ ‫المساعدة‬ ‫توقف‬ ‫لم‬ ‫اسبانيا‬ ‫ان‬ ‫الخميس‬ ‫اليوم‬ ‫اثنار‬‫خالفا‬ ‫مغرب‬ ‫محم‬ ‫المغربي‬ ‫والتعاون‬ ‫الخارجية‬ ‫الشؤون‬ ‫وزير‬ ‫االربعاء‬ ‫امس‬ ‫اكده‬ ‫لما‬‫بن‬ ‫د‬ ‫المغربي‬ ‫النواب‬ ‫مجلس‬ ‫امام‬ ‫عيسى‬.‫ف‬ ‫االسبانية‬ ‫الحكومة‬ ‫رئيس‬ ‫وقال‬‫ي‬ ‫و‬ ‫ابدا‬ ‫يتوقف‬ ‫لم‬ ‫والمغرب‬ ‫اسبانيا‬ ‫بين‬ ‫التعاون‬ ‫ان‬ ‫صحافي‬ ‫مؤتمر‬‫يجمد‬ ‫لم‬. ‫يا‬‫ي‬‫ن‬‫با‬‫س‬‫ي‬‫ا‬‫ي‬‫ي‬‫ف‬‫ن‬َ‫ت‬َ‫د‬‫ي‬‫ي‬‫م‬‫ج‬َ‫ت‬َ‫ة‬َ‫د‬َ‫ع‬‫سا‬‫الم‬َ‫ح‬‫و‬‫ن‬‫م‬َ‫م‬‫ال‬َ‫ة‬‫ي‬‫ب‬ ‫ي‬‫ر‬‫غ‬َ‫م‬‫ل‬‫ي‬‫ل‬ ‫يد‬ ‫ي‬‫ر‬‫د‬َ‫م‬1 - 11 (‫ف‬‫ي‬‫ا‬‫ب‬)-َ‫د‬َّ‫ك‬َ‫ا‬‫يس‬‫ي‬‫ئ‬ َ‫ر‬‫ي‬‫ة‬َ‫م‬‫و‬‫ك‬‫الح‬‫ي‬‫ن‬‫با‬‫س‬‫ي‬‫اال‬‫َّة‬‫ي‬‫يه‬‫ي‬‫س‬‫و‬‫خ‬‫يا‬ ‫ي‬‫مار‬‫اثنار‬ َ‫م‬‫و‬َ‫ي‬‫ال‬َ‫يس‬‫ي‬‫َم‬‫خ‬‫ال‬َّ‫ن‬َ‫ا‬‫يا‬‫ي‬‫ن‬‫با‬‫س‬‫ي‬‫ا‬َ‫م‬‫ي‬‫ل‬َ‫ف‬َّ‫ق‬ َ‫و‬َ‫ت‬‫الم‬‫ة‬َ‫د‬َ‫ع‬‫سا‬‫ي‬‫ي‬‫ت‬َّ‫ال‬‫ها‬‫م‬‫ي‬‫د‬َ‫ق‬‫ت‬‫ي‬‫ر‬‫غ‬َ‫م‬‫ل‬‫ي‬‫ل‬‫ي‬‫ب‬‫الفا‬ ‫ي‬‫خ‬‫ما‬‫ي‬‫ل‬ ‫ه‬َ‫د‬َّ‫ك‬َ‫ا‬‫ي‬‫س‬‫م‬َ‫ا‬َ‫ء‬‫عا‬‫ي‬‫ب‬‫ر‬َ‫ال‬‫ا‬َ‫ير‬ ‫ي‬‫ز‬ َ‫و‬‫ي‬‫ون‬‫ؤ‬‫الش‬‫ي‬‫ج‬ ‫ي‬‫الخار‬‫ي‬‫ة‬َّ‫ي‬‫ي‬‫ن‬‫عاو‬َ‫ت‬‫ال‬ َ‫و‬‫ي‬‫ي‬‫ي‬‫ب‬ ‫ي‬‫ر‬‫غ‬َ‫م‬‫ال‬‫م‬‫د‬َّ‫م‬َ‫ح‬‫ن‬‫ي‬‫ب‬ ‫ى‬َ‫س‬‫ي‬‫ي‬‫ع‬َ‫مام‬َ‫ا‬‫ي‬‫س‬‫ي‬‫ل‬‫ج‬َ‫م‬‫ي‬‫ب‬‫ا‬‫و‬‫الن‬‫ي‬‫ي‬‫ي‬‫ب‬ ‫ي‬‫ر‬‫غ‬َ‫م‬‫ال‬.َ‫ل‬‫قا‬ َ‫و‬‫يس‬‫ي‬‫ئ‬ َ‫ر‬‫ي‬‫ة‬َ‫م‬‫و‬‫ك‬‫الح‬‫س‬‫ي‬‫اال‬‫ي‬‫ة‬َّ‫ي‬‫ي‬‫ن‬‫با‬‫ي‬‫ي‬‫ف‬ ‫ر‬َ‫م‬َ‫ت‬‫ؤ‬‫م‬‫ي‬‫ي‬‫ف‬‫حا‬َ‫ص‬َّ‫ن‬َ‫ا‬َ‫ن‬‫عاو‬َ‫ت‬‫ال‬َ‫ن‬‫ي‬َ‫ب‬‫با‬‫س‬‫ي‬‫ا‬‫يا‬‫ي‬‫ن‬‫ي‬‫ب‬ ‫ي‬‫ر‬‫غ‬َ‫م‬‫ال‬ َ‫و‬َ‫م‬‫ي‬‫ل‬‫ف‬َّ‫ق‬ َ‫و‬َ‫ت‬َ‫ي‬َ‫ا‬‫دا‬َ‫ب‬َ‫م‬‫ي‬‫ل‬ َ‫و‬‫د‬‫ي‬‫م‬َ‫ج‬‫ي‬.
  • 23. 23 Orthographic Ambiguity • Arabic words can be very ambiguous due to optional diacritics • But how ambiguous? • Classic example ths s wht n rbc txt lks lk wth n vwls this is what an Arabic text looks like with no vowels – Not exactly true • Long vowels are always written • Initial vowels are represented by an ‫ا‬ ‘Alif’ • Some final short vowels are deterministically inferable ths is wht an Arbc txt lks lik wth no vwls • For a computer … – A word on average has 12.3 analyses, 6.8 diacritizations, and 2.7 lemmas (core meanings) • Not all of this ambiguity is due to orthography! More on this later.
  • 24. • The Qatar Arabic Language Bank (QALB, PI Habash) project found a very high (30%) of words have errors in unedited Standard Arabic comments on Aljazeera. – 2 Million words were manually corrected to create training data. • Arabic spelling errors are a big challenge to language technologies – GIGO: Garbage In Garbage Out – Errors in Standard Arabic – Inconsistencies in Dialectal Arabic (no official standard) • Robust systems need additional functionality to allow for correcting errors or functioning well despite them. Spelling Errors
  • 25. Morphological Complexity • Arabic is morphologically rich – A core word has many inflected forms – Example: Arabic Verbs have 5,400 forms Gender(2), Number(3), Person(3), Aspect(3), Tense particle (2), Mood(3), Voice(2), Pronominal clitic(12), Conjunction clitic(3) 25 ‫وسنقولها‬ /wasanaqūluhā/ ‫و‬+‫س‬+‫ن‬+‫قول‬+‫ها‬ wa+sa+na+qūl+u+hā and+will+we+say+it And we will say it ،َ‫قالوا،قلت‬ ،‫قاال‬ ،‫قالت‬ ،‫قال‬ ‫قلتن‬ ،‫قلتم‬ ،‫قلتما‬ ،‫ي‬‫ت‬‫قل‬، ‫تقول‬ ،‫يقل‬ ،َ‫ل‬‫يقو‬ ،‫يقول‬،َ‫ل‬‫تقو‬ ، ‫تقولي‬ ،‫تقولين‬ ،‫تقل‬، ...‫فقاال‬ ،‫فقالت‬ ،‫فقال‬... ...،‫وسأقولها‬‫وسنقولها‬،...
  • 26. Morphological Complexity • English is not morphologically rich. – The number of inflected forms is small – The verb paradigm is limited to 6 – The complete English part-of-speech tag set has 48 tags – The complete Arabic part-of-speech tag set has 22,400 tags 26 VB VBD VBG VBN VBP VBZ go went going gone go goes
  • 27. Morphological Ambiguity • 12.3 analyses and 2.7 lemmas per word • Spelling ambiguity – Optional diacritics – Suboptimal spelling, e.g., (‫,أ‬ ‫إ‬  ‫)ا‬ or (‫ة‬ ‫ه‬ ) – Example: ‫وبادلتها‬ • Derivational ambiguity and homonymy َ‫و‬+‫ي‬‫ب‬+‫ي‬‫ة‬َّ‫ل‬‫ي‬‫د‬َ‫أ‬+‫ها‬ and with her pieces of evidence َ‫و‬+‫ت‬‫ل‬َ‫د‬‫ا‬َ‫ب‬+‫ها‬ and I exchanged with her ‫ـن‬‫ي‬َ‫ع‬‫ال‬ the eye, the water spring, Al-Ain city, the notable ‫ل‬َ‫ت‬‫ح‬‫الم‬ occupier, occupied )‫المحتل‬ ‫العدو‬/‫المحتل‬ ‫الوطن‬/‫المحتلة‬ ‫الدول‬(
  • 28. Morphological Annotation ‫سنتين‬ ‫لمدة‬ ‫فتحية‬ ‫كاتبته‬ ‫وقد‬. Fathia corresponded with her for two years. Word Lemma POS Features Gloss ‫وقد‬ ‫د‬َ‫ق‬ ‫و‬+‫قد‬ Noun Masc Sg size/physique ‫د‬َ‫ق‬ ‫و‬+‫قد‬ Particle may/might/has/have ‫د‬‫ق‬ َ‫و‬ ‫وقد‬ Noun Masc Sg fuel/burning ‫د‬َّ‫ق‬ َ‫و‬ ‫وقد‬ Verb Perf 3rd Masc Sg kindle/ignite ‫د‬َ‫ق‬ َ‫و‬ ‫وقد‬ Verb Perf 3rd Masc Sg ignite/burn ‫كاتبته‬ ‫ب‬‫ي‬‫ت‬‫كا‬ ‫كاتبة‬+‫ه‬ Noun Fem Sg author/writer/clerk ‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 3rd Fem Sg correspond_with ‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 1st Sg correspond_with ‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 2nd Fem Sg correspond_with ‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 2nd Masc Sg correspond_with ‫فتحية‬ ‫َّة‬‫ي‬ ‫ي‬‫ح‬َ‫ت‬ ‫ف‬+‫تحية‬ Noun Fem Sg greeting/salute َّ‫ي‬ ‫ي‬‫تح‬َ‫ف‬‫ة‬ ‫فتحية‬ Proper Fem Sg Fathia ‫لمدة‬ ‫َّة‬‫د‬‫م‬ ‫ل‬+‫مدة‬ Noun Fem Sg interval/period ‫سنتين‬ ‫ت‬‫ن‬‫ي‬‫س‬ ‫سنتين‬ Noun Masc Du cent ‫َة‬‫ن‬َ‫س‬ ‫سنتين‬ Noun Fem Du year . . . Punc .
  • 29. Word Lemma POS Features Gloss ‫وقد‬ ‫د‬َ‫ق‬ ‫و‬+‫قد‬ Noun Masc Sg size/physique ‫د‬َ‫ق‬ ‫و‬+‫قد‬ Particle may/might/has/have ‫د‬‫ق‬ َ‫و‬ ‫وقد‬ Noun Masc Sg fuel/burning ‫د‬َّ‫ق‬ َ‫و‬ ‫وقد‬ Verb Perf 3rd Masc Sg kindle/ignite ‫د‬َ‫ق‬ َ‫و‬ ‫وقد‬ Verb Perf 3rd Masc Sg ignite/burn ‫كاتبته‬ ‫ب‬‫ي‬‫ت‬‫كا‬ ‫كاتبة‬+‫ه‬ Noun Fem Sg author/writer/clerk ‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 3rd Fem Sg correspond_with ‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 1st Sg correspond_with ‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 2nd Fem Sg correspond_with ‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 2nd Masc Sg correspond_with ‫فتحية‬ ‫َّة‬‫ي‬ ‫ي‬‫ح‬َ‫ت‬ ‫ف‬+‫تحية‬ Noun Fem Sg greeting/salute َّ‫ي‬ ِ‫تح‬َ‫ف‬‫ة‬ ‫فتحية‬ Proper Fem Sg Fathia ‫لمدة‬ ‫ة‬‫د‬ُ‫م‬ ‫ل‬+‫مدة‬ Noun Fem Sg interval/period ‫سنتين‬ ‫ت‬‫ن‬‫ي‬‫س‬ ‫سنتين‬ Noun Masc Du cent ‫ة‬َ‫ن‬َ‫س‬ ‫سنتين‬ Noun Fem Du year . . . Punc . Morphological Annotation ‫سنتين‬ ‫لمدة‬ ‫فتحية‬ ‫كاتبته‬ ‫وقد‬. Fathia corresponded with her for two years.
  • 30. 30 Arabic and its Dialects • Arabic has ~360M speakers • Forms of Arabic – Classical Arabic (CA) • Classic historical and liturgical texts – Modern Standard Arabic (MSA) • News media & formal speeches and settings • Only written standard – Dialectal Arabic (DA) • Predominantly spoken vernaculars • No written standards • Very common on social media • Diglossia – Two forms of the language (MSA & DA) exist side by side
  • 31. Arabic and its Dialects • Official language: Modern Standard Arabic (MSA) No one’s native language • Regional Dialects – Egyptian Arabic (EGY) – Levantine Arabic (LEV) – Gulf Arabic (GLF) – North African Arabic (NOR): Moroccan, Algerian, Tunisian – Iraqi, Yemenite, Sudanese • Dialects and sub-dialects… – City, Rural, Bedouin
  • 32. 32 Phonological Variations • Major variants MSA Dialects ‫ق‬ /q/ /q/, /k/, /ʔ/, /g/, /ʤ/, /ɢ/ ‫ث‬ /θ/ /θ/, /t/, /s/ ‫ذ‬ /δ/ /δ/, /d/, /z/ ‫ج‬ /ʤ/ /ʤ/, /g/, /ʒ/
  • 33. Spelling Inconsistency 33 Egyptian Arabic word ‫ماَّبيقولهاش‬ /mabiʔulhāʃ/ “he does not say it If there is no standard, can a word be misspelled?
  • 34. Lexical and Phonological Variation You say to-MAY-to, I say to-MAH-to!
  • 35. Lexical and Phonological Variation ‫بندورة‬ ‫توماطيش‬ ‫طماط‬ ‫طماطة‬ ‫طماطم‬ ‫طماطمة‬ ‫طماطيس‬ ‫قوطة‬ ‫مطيشة‬ b a n a d oo r a ALE, DAM b a n a d uu r a BEI, AMM, JER b a n d oo r a AMM, JER, SAL t uu m aa t. ii sh FES t. a m aa t. SAN, MUS t. e m aa t. DOH t a m aa t a BAG, BAS, MOS t. a m aa t. i m JED, RIY, SAN, KHA, ALX, ASW t. m aa t. i m SFA, TUN, BEN, TRI g uu t. a JED 2 uu t. a CAI t. o m a t. ii sh ALG t. a m aa t. ii s SAN m a t. ii sh a FES, RAB t. a m aa t. m a MUS ‫طوماطيش‬
  • 36. Lexical and Phonological Variation ‫بندورة‬ ‫توماطيش‬ ‫طماط‬ ‫طماطة‬ ‫طماطم‬ ‫طماطمة‬ ‫طماطيس‬ ‫قوطة‬ ‫مطيشة‬ b a n a d oo r a ALE, DAM b a n a d uu r a BEI, AMM, JER b a n d oo r a AMM, JER, SAL t uu m aa t. ii sh FES t. a m aa t. SAN, MUS t. e m aa t. DOH t a m aa t a BAG, BAS, MOS t. a m aa t. i m JED, RIY, SAN, KHA, ALX, ASW t. m aa t. i m SFA, TUN, BEN, TRI g uu t. a JED 2 uu t. a CAI t. o m a t. ii sh ALG t. a m aa t. ii s SAN m a t. ii sh a FES, RAB t. a m aa t. m a MUS ‫طوماطيش‬
  • 37. Lexical and Phonological Variation ‫بندورة‬ ‫توماطيش‬ ‫طماط‬ ‫طماطة‬ ‫طماطم‬ ‫طماطمة‬ ‫طماطيس‬ ‫قوطة‬ ‫مطيشة‬ b a n a d oo r a ALE, DAM b a n a d uu r a BEI, AMM, JER b a n d oo r a AMM, JER, SAL t uu m aa t. ii sh FES t. a m aa t. SAN, MUS t. e m aa t. DOH t a m aa t a BAG, BAS, MOS t. a m aa t. i m JED, RIY, SAN, KHA, ALX, ASW t. m aa t. i m SFA, TUN, BEN, TRI g uu t. a JED 2 uu t. a CAI t. o m a t. ii sh ALG t. a m aa t. ii s SAN m a t. ii sh a FES, RAB t. a m aa t. m a MUS ‫طوماطيش‬
  • 38. 38 Morphological Variation • Some aspects of words are simplified in the dialects – Loss of case marking kitaabu, kitaaba, kitaaabi, kitaabun, kitaaban, kitaabin  kitaab – Consolidation of masculine and feminine plurals yaktubuun, yaktubuu, yaktubna  yiktibu || yikitbuun • Other aspects increase in complexity! ‫كتاب‬‫كتاب‬ ،‫ي‬‫ب‬‫كتا‬ ،‫كتابا‬ ،َ‫كتاب‬ ،‫كتاب‬ ،‫كتاب‬ ‫يكتبون‬ ،‫يكتبوا‬‫يكتبن‬ ،‫يكتبون‬ ،‫يكتبوا‬
  • 39. 39 Morphological Variation Verb Morphology conjverbobject subj tense IOBJ negneg MSA ‫له‬ ‫تكتبوها‬ ‫ولم‬ /walam taktubūhā lahu/ /wa+lam taktubū+hā la+hu/ and+not_past write_you+it for+him EGY ‫و‬‫ما‬‫كتبتوهالو‬‫ش‬ /wimakatabtuhalūʃ/ /wi+ma+katab+tu+ha+lū+ʃ/ and+not+wrote+you+it+for_him+not And you didn’t write it for him
  • 40. Challenges to Arabic NLP Arabic English Orthographic ambiguity More Less Orthographic inconsistency More Less Morphological complexity More Less Dialectal variation More Less ‫وبعقدنا‬ ‫َا‬‫ن‬‫ي‬‫د‬َ‫ق‬‫ع‬‫ي‬‫ب‬ َ‫و‬ ‫َا‬‫ن‬‫ي‬‫د‬‫ق‬‫ي‬‫ع‬‫ي‬‫ب‬ َ‫و‬ ‫َا‬‫ن‬‫ي‬‫د‬‫ق‬َ‫ع‬‫ي‬‫ب‬ َ‫و‬ َ‫ن‬‫د‬‫ي‬‫ق‬َ‫ع‬‫ي‬‫ب‬‫و‬‫ا‬ and he stresses us out | and with our (contract | necklace | psychoses)
  • 41. Comparing Performance • SOTA Part-of-Speech Tagging and Syntax Parsing Results from (Björkelund et al. 2013, Pasha et al., 2014, Weiss et al, 2015, Kumar et al., 2016) – Large gap between English and Arabic; and between Standard Arabic and Arabic dialects – More resources and more research efforts for English compared to Arabic. 41 English Standard Arabic Egyptian Arabic Full Part-of-Speech 97.6% 85.4% 75.5% Core POS Part-of-Speech 96.1% 91.1% Dependency Syntax 92.2% 86.2%
  • 42. Comparing Performance • Machine Translation – Quality of machine translation from MSA is much better than in the dialects – The main reason is availability of parallel corpora • 150 million words of parallel Standard Arabic-English text compared to 1.5 million words of Dialect-English text (Zbib et al., 2012) 42 Arabic Source Text Google Translate (Oct 17, 2018) MSA ‫من‬‫فضلك‬‫ال‬‫تكلمني‬ Please do not talk to me EGY ‫انت‬‫متكلمنيش‬‫خالص‬ You are pure Mtkmlnish MSA ‫ال‬‫يوجد‬،‫كهرباء‬‫ماذا‬‫حدث؟‬ No electricity, what happened? LEV ‫شكلو‬‫مفيش‬،‫كهربا‬‫ليش‬‫هيك؟‬ Shaku Mfish electrified, why not heck? IRQ ‫شو‬‫ماكو‬،‫كهرباء‬‫خير؟‬ Xu Mako electricity, okay?
  • 43.
  • 44. 44 Roadmap • Natural Language Processing Applications & Paradigms • (Why) is Arabic hard for NLP? • Some Arabic NLP solutions –NYUAD CAMeL Lab
  • 45. MADAMIRA http://camel.abudhabi.nyu.edu/madamira/ • State-of-the-art Arabic and Arabic Dialect Processing tool (Pasha et al., 2014) – Full Morphological disambiguation – Hybrid • Rule-based analyzer dictionaries • Machine learning disambiguation • Current release: Standard Arabic and Egyptian Arabic • Under construction: Palestinian, Syrian, Moroccan, Yemeni, Gulf • Neural Extensions (Zalmout et al. 2017; 2018)
  • 46. W-3 W-2 W-1 W0 W1 W2 W3 W4W-4 MORPHOLOGICAL ANALYZER MORPHOLOGICAL CLASSIFIERS • Rule-based • Human-created • Multiple independent classifiers • Corpus-trained 2nd 3rd 5th 4th 1st RANKER • Heuristic or corpus-trained (Habash&Rambow 2005; Roth et al. 2008; Pasha et al., 2014; Zalmout&Habash 2017, 2018)
  • 48. MADAMIRA Morphological Disambiguation System: MSA MSA EGY Test: MSA EGY EGY Full Analysis 84.3% 27.0% 75.4% Diacriticization 86.4% 32.2% 83.2% Lemmatization 96.1% 67.1% 86.3% Base POS-tagging 96.1% 82.1% 91.1% Segmentation 99.1% 90.5% 97.4% wakAtibuhu kAtib_1 pos:noun prc3:0 prc2:wa_conj prc1:0 prc0:0 per:3 asp:na vox:na mod:na gen:m num:s stt:c cas:n enc0:pron3ms w+ kAtb +h ‫وكاتبه‬wkAtbh and his writer
  • 49. • Zalmout et al (EMNLP 2017, NAACL 2018) – Neural implementation for MADAMIRA • 4.4% absolute increase over the state-of-the-art in full morphological analysis accuracy on all words • absolute 10.6% increase for out-of-vocabulary words Neural MADAMIRA
  • 50. Automatic Arabic Spelling Correction • Neural models for Arabic spelling correction gave state- of-the-art results – QALB shared task data 2014, 2015 – 1 million word training data – Using word and character narrow embeddings (+/-2) in seq-to-seq model did best. 50 (Watson, Zalmout and Habash, 2018)
  • 51. CODA A Conventional Orthography for Dialectal Arabic • Developed for computational processing purposes (Habash et al, 2012) • Objectives – CODA covers all Arabic dialects in principle – CODA minimizes differences in choices – CODA is easy to learn and produce consistently – CODA is intuitive to readers unfamiliar with it – CODA uses Arabic script • Started with manuals for Egyptian, Tunisian, Levantine, Algerian, and Gulf • CODA* : CODA for 28 different city dialects (LREC 2018) • http://coda.camel-lab.com/ 51
  • 52. CODA Examples CODA ‫االمتحانات‬ ‫قبل‬ ‫اللي‬ ‫الفترة‬ ‫صحابي‬ ‫ماَّشفتش‬ gloss the exams before which the period my friends I did not see Spelling variants ‫ا‬‫إل‬‫متحانات‬ ‫أ‬‫بل‬ ‫اللـ‬‫ـى‬ ‫الفتر‬‫ه‬ ‫صحابـ‬‫ـى‬ ‫ما‬‫شفتش‬ ‫ا‬‫لـ‬‫ـمتحانات‬ ‫ا‬‫بل‬ ‫إ‬‫للي‬ ‫الفـ‬‫طـ‬‫ر‬‫ة‬ ‫صـ‬‫و‬‫حابي‬ ‫مـ‬‫شفتش‬ ‫االمتـ‬‫ـحـ‬‫نات‬ abl ‫إ‬‫للـ‬‫ـى‬ ‫الفـ‬‫طـ‬‫ر‬‫ه‬ ‫صـ‬‫و‬‫حابـ‬‫ـى‬ ‫شـ‬ ‫ما‬‫و‬‫فتش‬ ‫ا‬‫إل‬‫متـ‬‫ـحـ‬‫نات‬ qbl ‫ا‬‫لـ‬‫ـي‬ ilftra Su7abi ‫ما‬‫شـ‬‫و‬‫فتش‬ ‫ا‬‫لـ‬‫ـمتـ‬‫ـحـ‬‫نات‬ qabl ‫ا‬‫لى‬ sohaby ‫مـ‬‫شـ‬‫و‬‫فتش‬ ilimti7anat ‫إلـ‬‫ـي‬ mashoftish limtihanaat ‫إلى‬ illi
  • 53. SAMER Project • Simplification of Arabic Masterpieces for Extensive Reading – Muhamed Al Khalil, Nizar Habash and Dris Sulaimani – NYUAD Research Enhancement Fund – Collaboration with the UAE Ministry of Education • Objectives – Create a standard for the simplification of modern fiction in Arabic to school-age learners. – Develop a tool for automating readability scale grading for Arabic – Simplify a number of Arabic fiction masterpieces
  • 54. SAMER Readability Prediction • Large L1 corpus AND L2 corpus compared to previous Arabic Readability studies • Sweeping and systematic feature engineering and comparison • State-of-the-art tools tailored for Modern Standard Arabic • Exploring L1 and L2 performance within the same consistent feature framework • Leveraging L1 resources for L2 performance improvement
  • 55. Full Feature Breakdown (146 feats)
  • 58. MADAR Project • Multi-Arabic Dialect Applications and Resources • Collaboration among CMUQ, NYUAD and Columbia – Nizar Habash, Houda Bouamor, Kemal Oflazer and Owen Rambow • Modeling 25 Arabic city dialects – Lexical resources, parallel data, dialect identification, and dialect machine translation • http://madar.camel-lab.com • http://adida.abudhabi.nyu.edu
  • 59. The MADAR Corpus: example
  • 60. Fine Grained Dialect Identification • Salameh, Bouamor and Habash, 2018 (COLING) • Best results (Accuracy) • Demo: http://adida.abudhabi.nyu.edu System 6-Label Test 26-Label Test Baseline: Character 5-gram language model 92.7% 64.7% Multinomial Naïve Bayes Character/Word 5-gram language model 93.6% 67.5% Multinomial Naïve Bayes Character/Word 5-gram language model + Corpus-6 Classifier Probability 67.9%
  • 61. • How many words are needed to guarantee an optimal classification into a certain dialect? • ~90% with 16 words! • Almost 2 sentences in Corpus-26 • ~100% with 51 words! • Almost 7 sentences • We are currently preparing a competition on dialect ID of Twitter users. Can we do better? Yes, with more input!
  • 62. Summary • Arabic poses many challenges to AI/NLP – Orthographic ambiguity – Morphological complexity – Enormous variety – Annotated resource poverty • There has been a lot of work on Arabic and Arabic dialect technologies. – But more is needed still…
  • 63. Future Directions • More Arabic varieties – We plan to continue to working on new dialects and new Arabic domains – New data sets – New algorithms for supporting low resource languages • More tools – We are developing an open source suite called CamelTools to support Arabic processing • More interdisciplinary collaborations – We are proposing a center on Human-centered AI at NYUAD that brings together researchers from computer science, digital humanities, language pedagogy, history, and sociology, as well as industrial partners.