SlideShare a Scribd company logo
Arabic Natural Language Processing:
Challenges and Solutions
‫العربية‬ ‫للغة‬ ‫اآللي‬ ‫التحليل‬:‫وحلول‬ ‫تحديات‬
Grammarly Invited Talk
March 26, 2019
Prof. Nizar Habash
New York University Abu Dhabi
nizar.habash@nyu.edu
NYUAD
CAMeLLab
New York University
The Global Network University
2
New York University Abu Dhabi
4
• http://nyuad.nyu.edu/en/
5
New York University Abu Dhabi
• Students from all over the world
– 1300 students, 120 nationalities
– 15% UAE, 15% American, 70% everywhere else
6
New York University Abu Dhabi
• Liberal Arts University
– Four Divisions: Science, Engineering, Social
Science, Arts and Humanities
– 20 majors and many minors
– Interdisciplinarity strongly encouraged
• Computer Science
– Undergraduate and PhD programs
– PhD through NYU New York
7
CAMeL Lab
8
• Computational Approaches to Modeling Language
• http://camel-lab.com
• Research Areas
– Arabic Artificial Intelligence
– Core Natural Language Processing
• Orthography, morphology, syntax, and semantics
– Dialectal modeling
– Machine translation
– Pedagogical applications
– Dialogue systems
NYUAD
CAMeLLab
The CAMeLeers
9
Nasser Zalmout
PhD Student, NYU
Dima Taji
PhD Student, NYU
Alberto Chiercchi
PhD Student, NYU
Alex Erdmann
PhD Student,
Ohio State
Salam Khalifa
Research Assistant
Fadhl Eryani
Research Assistant
Ossama Obeid
Research Assistant
Mai Oudah
Postdoc
Ok…. Back to the talk!
Arabic Natural Language Processing:
Challenges and Solutions
‫العربية‬ ‫للغة‬ ‫اآللي‬ ‫التحليل‬:‫وحلول‬ ‫تحديات‬
Grammarly Invited Talk
March 26, 2019
Prof. Nizar Habash
New York University Abu Dhabi
nizar.habash@nyu.edu
NYUAD
CAMeLLab
Natural Language Processing
Natural Language Processing
• Also known as
– Computational Linguistics
– Language Technologies
– (Language) Artificial Intelligence
• Language Technology is an interdisciplinary field
– Computer science, Linguistics, Cognitive science,
psychology, pedagogy, mathematics, etc.
• Language technologies were some of the earliest
applications of computer science
– Cryptography
– Machine Translation
Natural Language Processing
• Applications
– Information retrieval
– Machine translation
– Automatic speech recognition & speech synthesis
– Sentiment and emotion analysis
– Dialogue systems & chatting agents
– Optical character recognition
– Automatic Summarization, etc.
• Enabling technologies
– Tokenization
– Part-of-speech tagging
– Syntactic parsing
– Lemmatization
– Word sense disambiguation, etc.
Paradigms for
Natural Language Processing
• Rule-based (Intuition-based) Approaches
– Linguists write rules that are applied by the
machines
• Machine Learning Approaches
– Corpus-based, Statistical Approaches
– Machines learn the “rules” from training data
• Machine learning approaches are dominant in
the field
What do we need
to help machines learn?
• Data, data and more data!
• Specifically annotated data
Application Annotated Data Example
Machine Translation Parallel corpus in two languages: UN corpus with
English, Arabic, Chinese, Spanish, Russian, French
Sentiment Analysis A corpus of tweets with tags indicating: positive,
negative, neutral.
Speech Recognition A corpus of audio files with their corresponding
transcripts
Optical Character
Recognition
A corpus of scanned book page images and their
corresponding transcripts.
Part-of-Speech An English corpus with Part-of-Speech indicated for
each word
• d
17
Machine Learning
vs. Human Learning
Predisposed for
acquiring language
not so!
• Developing robust algorithms with appropriate learning
bias for computational linguistics tasks is essential!
Challenges for
Machine Learning Language Technologies
• Size of training data
– More is better!
• Domain and genre sensitivity
– Systems trained on news do not do well on novels
• Quality of annotations
– Why expect good performance if humans do not
agree with each other on the task
• Developing robust algorithms for machine
learning is essential
19
Roadmap
• Natural Language Processing
Applications & Paradigms
• (Why) is Arabic hard for NLP?
• Some Arabic NLP solutions
–NYUAD CAMeL Lab
20
Arabic Script
• A consonantal alphabet
• Written right-to-left
• Letters have contextual variants
• Used to write many languages
besides Arabic: Persian, Kurdish, Urdu,
Pashto, etc.
َ‫ر‬َ‫ع‬‫ال‬ ُّ‫َط‬‫خ‬‫ال‬ُّ‫ي‬‫ي‬‫ب‬
Arabic Script
• Arabic script uses a set of optional diacritics
– Only 1.5% of written words have at least one diacritic
• Undiacritized Standard Arabic words are
ambiguous out of context
Vowel Nunation Gemination
َ‫ب‬
/ba/
‫ب‬
/bu/
‫ي‬‫ب‬
/bi/
‫ب‬
/b/
‫ب‬
/ban/
‫ب‬
/bun/
‫ب‬
/bin/
‫ب‬
/bb/
‫للمغرب‬ ‫الممنوحة‬ ‫المساعدة‬ ‫تجميد‬ ‫تنفي‬ ‫اسبانيا‬
‫مدريد‬1-11(‫ب‬ ‫اف‬)-‫ماريا‬ ‫خوسيه‬ ‫االسبانية‬ ‫الحكومة‬ ‫رئيس‬ ‫اكد‬
‫لل‬ ‫تقدمها‬ ‫التي‬ ‫المساعدة‬ ‫توقف‬ ‫لم‬ ‫اسبانيا‬ ‫ان‬ ‫الخميس‬ ‫اليوم‬ ‫اثنار‬‫خالفا‬ ‫مغرب‬
‫محم‬ ‫المغربي‬ ‫والتعاون‬ ‫الخارجية‬ ‫الشؤون‬ ‫وزير‬ ‫االربعاء‬ ‫امس‬ ‫اكده‬ ‫لما‬‫بن‬ ‫د‬
‫المغربي‬ ‫النواب‬ ‫مجلس‬ ‫امام‬ ‫عيسى‬.‫ف‬ ‫االسبانية‬ ‫الحكومة‬ ‫رئيس‬ ‫وقال‬‫ي‬
‫و‬ ‫ابدا‬ ‫يتوقف‬ ‫لم‬ ‫والمغرب‬ ‫اسبانيا‬ ‫بين‬ ‫التعاون‬ ‫ان‬ ‫صحافي‬ ‫مؤتمر‬‫يجمد‬ ‫لم‬.
‫يا‬‫ي‬‫ن‬‫با‬‫س‬‫ي‬‫ا‬‫ي‬‫ي‬‫ف‬‫ن‬َ‫ت‬َ‫د‬‫ي‬‫ي‬‫م‬‫ج‬َ‫ت‬َ‫ة‬َ‫د‬َ‫ع‬‫سا‬‫الم‬َ‫ح‬‫و‬‫ن‬‫م‬َ‫م‬‫ال‬َ‫ة‬‫ي‬‫ب‬ ‫ي‬‫ر‬‫غ‬َ‫م‬‫ل‬‫ي‬‫ل‬
‫يد‬ ‫ي‬‫ر‬‫د‬َ‫م‬1 - 11 (‫ف‬‫ي‬‫ا‬‫ب‬)-َ‫د‬َّ‫ك‬َ‫ا‬‫يس‬‫ي‬‫ئ‬ َ‫ر‬‫ي‬‫ة‬َ‫م‬‫و‬‫ك‬‫الح‬‫ي‬‫ن‬‫با‬‫س‬‫ي‬‫اال‬‫َّة‬‫ي‬‫يه‬‫ي‬‫س‬‫و‬‫خ‬‫يا‬ ‫ي‬‫مار‬‫اثنار‬
َ‫م‬‫و‬َ‫ي‬‫ال‬َ‫يس‬‫ي‬‫َم‬‫خ‬‫ال‬َّ‫ن‬َ‫ا‬‫يا‬‫ي‬‫ن‬‫با‬‫س‬‫ي‬‫ا‬َ‫م‬‫ي‬‫ل‬َ‫ف‬َّ‫ق‬ َ‫و‬َ‫ت‬‫الم‬‫ة‬َ‫د‬َ‫ع‬‫سا‬‫ي‬‫ي‬‫ت‬َّ‫ال‬‫ها‬‫م‬‫ي‬‫د‬َ‫ق‬‫ت‬‫ي‬‫ر‬‫غ‬َ‫م‬‫ل‬‫ي‬‫ل‬‫ي‬‫ب‬‫الفا‬ ‫ي‬‫خ‬‫ما‬‫ي‬‫ل‬
‫ه‬َ‫د‬َّ‫ك‬َ‫ا‬‫ي‬‫س‬‫م‬َ‫ا‬َ‫ء‬‫عا‬‫ي‬‫ب‬‫ر‬َ‫ال‬‫ا‬َ‫ير‬ ‫ي‬‫ز‬ َ‫و‬‫ي‬‫ون‬‫ؤ‬‫الش‬‫ي‬‫ج‬ ‫ي‬‫الخار‬‫ي‬‫ة‬َّ‫ي‬‫ي‬‫ن‬‫عاو‬َ‫ت‬‫ال‬ َ‫و‬‫ي‬‫ي‬‫ي‬‫ب‬ ‫ي‬‫ر‬‫غ‬َ‫م‬‫ال‬‫م‬‫د‬َّ‫م‬َ‫ح‬‫ن‬‫ي‬‫ب‬
‫ى‬َ‫س‬‫ي‬‫ي‬‫ع‬َ‫مام‬َ‫ا‬‫ي‬‫س‬‫ي‬‫ل‬‫ج‬َ‫م‬‫ي‬‫ب‬‫ا‬‫و‬‫الن‬‫ي‬‫ي‬‫ي‬‫ب‬ ‫ي‬‫ر‬‫غ‬َ‫م‬‫ال‬.َ‫ل‬‫قا‬ َ‫و‬‫يس‬‫ي‬‫ئ‬ َ‫ر‬‫ي‬‫ة‬َ‫م‬‫و‬‫ك‬‫الح‬‫س‬‫ي‬‫اال‬‫ي‬‫ة‬َّ‫ي‬‫ي‬‫ن‬‫با‬‫ي‬‫ي‬‫ف‬
‫ر‬َ‫م‬َ‫ت‬‫ؤ‬‫م‬‫ي‬‫ي‬‫ف‬‫حا‬َ‫ص‬َّ‫ن‬َ‫ا‬َ‫ن‬‫عاو‬َ‫ت‬‫ال‬َ‫ن‬‫ي‬َ‫ب‬‫با‬‫س‬‫ي‬‫ا‬‫يا‬‫ي‬‫ن‬‫ي‬‫ب‬ ‫ي‬‫ر‬‫غ‬َ‫م‬‫ال‬ َ‫و‬َ‫م‬‫ي‬‫ل‬‫ف‬َّ‫ق‬ َ‫و‬َ‫ت‬َ‫ي‬َ‫ا‬‫دا‬َ‫ب‬َ‫م‬‫ي‬‫ل‬ َ‫و‬‫د‬‫ي‬‫م‬َ‫ج‬‫ي‬.
23
Orthographic Ambiguity
• Arabic words can be very ambiguous due to optional
diacritics
• But how ambiguous?
• Classic example
ths s wht n rbc txt lks lk wth n vwls
this is what an Arabic text looks like with no vowels
– Not exactly true
• Long vowels are always written
• Initial vowels are represented by an ‫ا‬ ‘Alif’
• Some final short vowels are deterministically inferable
ths is wht an Arbc txt lks lik wth no vwls
• For a computer …
– A word on average has 12.3 analyses, 6.8 diacritizations,
and 2.7 lemmas (core meanings)
• Not all of this ambiguity is due to orthography! More on this later.
• The Qatar Arabic Language Bank (QALB, PI Habash) project found a very
high (30%) of words have errors in unedited Standard Arabic comments on
Aljazeera.
– 2 Million words were manually corrected to create training data.
• Arabic spelling errors are a big challenge to language technologies
– GIGO: Garbage In Garbage Out
– Errors in Standard Arabic
– Inconsistencies in Dialectal Arabic (no official standard)
• Robust systems need additional functionality to allow for correcting errors
or functioning well despite them.
Spelling Errors
Morphological Complexity
• Arabic is morphologically rich
– A core word has many inflected forms
– Example: Arabic Verbs have 5,400 forms
Gender(2), Number(3), Person(3), Aspect(3), Tense particle (2),
Mood(3), Voice(2), Pronominal clitic(12), Conjunction clitic(3)
25
‫وسنقولها‬
/wasanaqūluhā/
‫و‬+‫س‬+‫ن‬+‫قول‬+‫ها‬
wa+sa+na+qūl+u+hā
and+will+we+say+it
And we will say it
،َ‫قالوا،قلت‬ ،‫قاال‬ ،‫قالت‬ ،‫قال‬
‫قلتن‬ ،‫قلتم‬ ،‫قلتما‬ ،‫ي‬‫ت‬‫قل‬،
‫تقول‬ ،‫يقل‬ ،َ‫ل‬‫يقو‬ ،‫يقول‬،َ‫ل‬‫تقو‬ ،
‫تقولي‬ ،‫تقولين‬ ،‫تقل‬،
...‫فقاال‬ ،‫فقالت‬ ،‫فقال‬...
...،‫وسأقولها‬‫وسنقولها‬،...
Morphological Complexity
• English is not morphologically rich.
– The number of inflected forms is small
– The verb paradigm is limited to 6
– The complete English part-of-speech tag set
has 48 tags
– The complete Arabic part-of-speech tag set
has 22,400 tags
26
VB VBD VBG VBN VBP VBZ
go went going gone go goes
Morphological Ambiguity
• 12.3 analyses and 2.7 lemmas per word
• Spelling ambiguity
– Optional diacritics
– Suboptimal spelling, e.g., (‫,أ‬ ‫إ‬  ‫)ا‬ or (‫ة‬ ‫ه‬ )
– Example: ‫وبادلتها‬
• Derivational ambiguity and homonymy
َ‫و‬+‫ي‬‫ب‬+‫ي‬‫ة‬َّ‫ل‬‫ي‬‫د‬َ‫أ‬+‫ها‬
and with her pieces of evidence
َ‫و‬+‫ت‬‫ل‬َ‫د‬‫ا‬َ‫ب‬+‫ها‬
and I exchanged with her
‫ـن‬‫ي‬َ‫ع‬‫ال‬ the eye, the water spring, Al-Ain city, the notable
‫ل‬َ‫ت‬‫ح‬‫الم‬
occupier, occupied
)‫المحتل‬ ‫العدو‬/‫المحتل‬ ‫الوطن‬/‫المحتلة‬ ‫الدول‬(
Morphological Annotation
‫سنتين‬ ‫لمدة‬ ‫فتحية‬ ‫كاتبته‬ ‫وقد‬.
Fathia corresponded with her for two years.
Word Lemma POS Features Gloss
‫وقد‬ ‫د‬َ‫ق‬ ‫و‬+‫قد‬ Noun Masc Sg size/physique
‫د‬َ‫ق‬ ‫و‬+‫قد‬ Particle may/might/has/have
‫د‬‫ق‬ َ‫و‬ ‫وقد‬ Noun Masc Sg fuel/burning
‫د‬َّ‫ق‬ َ‫و‬ ‫وقد‬ Verb Perf 3rd Masc Sg kindle/ignite
‫د‬َ‫ق‬ َ‫و‬ ‫وقد‬ Verb Perf 3rd Masc Sg ignite/burn
‫كاتبته‬ ‫ب‬‫ي‬‫ت‬‫كا‬ ‫كاتبة‬+‫ه‬ Noun Fem Sg author/writer/clerk
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 3rd Fem Sg correspond_with
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 1st Sg correspond_with
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 2nd Fem Sg correspond_with
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 2nd Masc Sg correspond_with
‫فتحية‬ ‫َّة‬‫ي‬ ‫ي‬‫ح‬َ‫ت‬ ‫ف‬+‫تحية‬ Noun Fem Sg greeting/salute
َّ‫ي‬ ‫ي‬‫تح‬َ‫ف‬‫ة‬ ‫فتحية‬ Proper Fem Sg Fathia
‫لمدة‬ ‫َّة‬‫د‬‫م‬ ‫ل‬+‫مدة‬ Noun Fem Sg interval/period
‫سنتين‬ ‫ت‬‫ن‬‫ي‬‫س‬ ‫سنتين‬ Noun Masc Du cent
‫َة‬‫ن‬َ‫س‬ ‫سنتين‬ Noun Fem Du year
. . . Punc .
Word Lemma POS Features Gloss
‫وقد‬ ‫د‬َ‫ق‬ ‫و‬+‫قد‬ Noun Masc Sg size/physique
‫د‬َ‫ق‬ ‫و‬+‫قد‬ Particle may/might/has/have
‫د‬‫ق‬ َ‫و‬ ‫وقد‬ Noun Masc Sg fuel/burning
‫د‬َّ‫ق‬ َ‫و‬ ‫وقد‬ Verb Perf 3rd Masc Sg kindle/ignite
‫د‬َ‫ق‬ َ‫و‬ ‫وقد‬ Verb Perf 3rd Masc Sg ignite/burn
‫كاتبته‬ ‫ب‬‫ي‬‫ت‬‫كا‬ ‫كاتبة‬+‫ه‬ Noun Fem Sg author/writer/clerk
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 3rd Fem Sg correspond_with
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 1st Sg correspond_with
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 2nd Fem Sg correspond_with
‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 2nd Masc Sg correspond_with
‫فتحية‬ ‫َّة‬‫ي‬ ‫ي‬‫ح‬َ‫ت‬ ‫ف‬+‫تحية‬ Noun Fem Sg greeting/salute
َّ‫ي‬ ِ‫تح‬َ‫ف‬‫ة‬ ‫فتحية‬ Proper Fem Sg Fathia
‫لمدة‬ ‫ة‬‫د‬ُ‫م‬ ‫ل‬+‫مدة‬ Noun Fem Sg interval/period
‫سنتين‬ ‫ت‬‫ن‬‫ي‬‫س‬ ‫سنتين‬ Noun Masc Du cent
‫ة‬َ‫ن‬َ‫س‬ ‫سنتين‬ Noun Fem Du year
. . . Punc .
Morphological Annotation
‫سنتين‬ ‫لمدة‬ ‫فتحية‬ ‫كاتبته‬ ‫وقد‬.
Fathia corresponded with her for two years.
30
Arabic and its Dialects
• Arabic has ~360M speakers
• Forms of Arabic
– Classical Arabic (CA)
• Classic historical and liturgical texts
– Modern Standard Arabic (MSA)
• News media & formal speeches and settings
• Only written standard
– Dialectal Arabic (DA)
• Predominantly spoken vernaculars
• No written standards
• Very common on social media
• Diglossia
– Two forms of the language (MSA & DA) exist side by side
Arabic and its Dialects
• Official language: Modern Standard Arabic (MSA)
No one’s native language
• Regional Dialects
– Egyptian Arabic (EGY)
– Levantine Arabic (LEV)
– Gulf Arabic (GLF)
– North African Arabic (NOR): Moroccan, Algerian, Tunisian
– Iraqi, Yemenite, Sudanese
• Dialects and sub-dialects…
– City, Rural, Bedouin
32
Phonological Variations
• Major variants
MSA Dialects
‫ق‬ /q/ /q/, /k/, /ʔ/, /g/, /ʤ/, /ɢ/
‫ث‬ /θ/ /θ/, /t/, /s/
‫ذ‬ /δ/ /δ/, /d/, /z/
‫ج‬ /ʤ/ /ʤ/, /g/, /ʒ/
Spelling Inconsistency
33
Egyptian Arabic word
‫ماَّبيقولهاش‬
/mabiʔulhāʃ/
“he does not say it
If there is no
standard,
can a word be
misspelled?
Lexical and Phonological Variation
You say to-MAY-to, I say to-MAH-to!
Lexical and Phonological Variation
‫بندورة‬
‫توماطيش‬
‫طماط‬
‫طماطة‬
‫طماطم‬
‫طماطمة‬
‫طماطيس‬
‫قوطة‬
‫مطيشة‬
b a n a d oo r a
ALE, DAM
b a n a d uu r a
BEI, AMM, JER
b a n d oo r a AMM, JER, SAL
t uu m aa t. ii sh FES
t. a m aa t. SAN, MUS
t. e m aa t. DOH
t a m aa t a
BAG, BAS, MOS
t. a m aa t. i m JED, RIY, SAN, KHA,
ALX, ASW
t. m aa t. i m
SFA, TUN, BEN, TRI
g uu t. a
JED
2 uu t. a
CAI
t. o m a t. ii sh
ALG
t. a m aa t. ii s
SAN m a t. ii sh a
FES, RAB
t. a m aa t. m a
MUS
‫طوماطيش‬
Lexical and Phonological Variation
‫بندورة‬
‫توماطيش‬
‫طماط‬
‫طماطة‬
‫طماطم‬
‫طماطمة‬
‫طماطيس‬
‫قوطة‬
‫مطيشة‬
b a n a d oo r a
ALE, DAM
b a n a d uu r a
BEI, AMM, JER
b a n d oo r a AMM, JER, SAL
t uu m aa t. ii sh FES
t. a m aa t. SAN, MUS
t. e m aa t. DOH
t a m aa t a
BAG, BAS, MOS
t. a m aa t. i m JED, RIY, SAN, KHA,
ALX, ASW
t. m aa t. i m
SFA, TUN, BEN, TRI
g uu t. a
JED
2 uu t. a
CAI
t. o m a t. ii sh
ALG
t. a m aa t. ii s
SAN m a t. ii sh a
FES, RAB
t. a m aa t. m a
MUS
‫طوماطيش‬
Lexical and Phonological Variation
‫بندورة‬
‫توماطيش‬
‫طماط‬
‫طماطة‬
‫طماطم‬
‫طماطمة‬
‫طماطيس‬
‫قوطة‬
‫مطيشة‬
b a n a d oo r a
ALE, DAM
b a n a d uu r a
BEI, AMM, JER
b a n d oo r a AMM, JER, SAL
t uu m aa t. ii sh FES
t. a m aa t. SAN, MUS
t. e m aa t. DOH
t a m aa t a
BAG, BAS, MOS
t. a m aa t. i m JED, RIY, SAN, KHA,
ALX, ASW
t. m aa t. i m
SFA, TUN, BEN, TRI
g uu t. a
JED
2 uu t. a
CAI
t. o m a t. ii sh
ALG
t. a m aa t. ii s
SAN m a t. ii sh a
FES, RAB
t. a m aa t. m a
MUS
‫طوماطيش‬
38
Morphological Variation
• Some aspects of words are simplified in the dialects
– Loss of case marking
kitaabu, kitaaba, kitaaabi, kitaabun, kitaaban, kitaabin  kitaab
– Consolidation of masculine and feminine plurals
yaktubuun, yaktubuu, yaktubna  yiktibu || yikitbuun
• Other aspects increase in complexity!
‫كتاب‬‫كتاب‬ ،‫ي‬‫ب‬‫كتا‬ ،‫كتابا‬ ،َ‫كتاب‬ ،‫كتاب‬ ،‫كتاب‬
‫يكتبون‬ ،‫يكتبوا‬‫يكتبن‬ ،‫يكتبون‬ ،‫يكتبوا‬
39
Morphological Variation
Verb Morphology
conjverbobject subj tense
IOBJ negneg
MSA
‫له‬ ‫تكتبوها‬ ‫ولم‬
/walam taktubūhā lahu/
/wa+lam taktubū+hā la+hu/
and+not_past write_you+it for+him
EGY
‫و‬‫ما‬‫كتبتوهالو‬‫ش‬
/wimakatabtuhalūʃ/
/wi+ma+katab+tu+ha+lū+ʃ/
and+not+wrote+you+it+for_him+not
And you didn’t write it for him
Challenges to Arabic NLP
Arabic English
Orthographic ambiguity More Less
Orthographic inconsistency More Less
Morphological complexity More Less
Dialectal variation More Less
‫وبعقدنا‬
‫َا‬‫ن‬‫ي‬‫د‬َ‫ق‬‫ع‬‫ي‬‫ب‬ َ‫و‬ ‫َا‬‫ن‬‫ي‬‫د‬‫ق‬‫ي‬‫ع‬‫ي‬‫ب‬ َ‫و‬ ‫َا‬‫ن‬‫ي‬‫د‬‫ق‬َ‫ع‬‫ي‬‫ب‬ َ‫و‬ َ‫ن‬‫د‬‫ي‬‫ق‬َ‫ع‬‫ي‬‫ب‬‫و‬‫ا‬
and he stresses us out | and with our (contract | necklace | psychoses)
Comparing Performance
• SOTA Part-of-Speech Tagging and Syntax Parsing
Results from (Björkelund et al. 2013, Pasha et al., 2014, Weiss et al, 2015, Kumar et al., 2016)
– Large gap between English and Arabic; and between
Standard Arabic and Arabic dialects
– More resources and more research efforts for English
compared to Arabic.
41
English Standard Arabic Egyptian Arabic
Full Part-of-Speech 97.6% 85.4% 75.5%
Core POS Part-of-Speech 96.1% 91.1%
Dependency Syntax 92.2% 86.2%
Comparing Performance
• Machine Translation
– Quality of machine translation from MSA is much better than
in the dialects
– The main reason is availability of parallel corpora
• 150 million words of parallel Standard Arabic-English text compared
to 1.5 million words of Dialect-English text (Zbib et al., 2012)
42
Arabic Source Text Google Translate (Oct 17, 2018)
MSA ‫من‬‫فضلك‬‫ال‬‫تكلمني‬ Please do not talk to me
EGY ‫انت‬‫متكلمنيش‬‫خالص‬ You are pure Mtkmlnish
MSA ‫ال‬‫يوجد‬،‫كهرباء‬‫ماذا‬‫حدث؟‬ No electricity, what happened?
LEV ‫شكلو‬‫مفيش‬،‫كهربا‬‫ليش‬‫هيك؟‬ Shaku Mfish electrified, why not heck?
IRQ ‫شو‬‫ماكو‬،‫كهرباء‬‫خير؟‬ Xu Mako electricity, okay?
44
Roadmap
• Natural Language Processing
Applications & Paradigms
• (Why) is Arabic hard for NLP?
• Some Arabic NLP solutions
–NYUAD CAMeL Lab
MADAMIRA
http://camel.abudhabi.nyu.edu/madamira/
• State-of-the-art Arabic and Arabic Dialect Processing
tool (Pasha et al., 2014)
– Full Morphological disambiguation
– Hybrid
• Rule-based analyzer dictionaries
• Machine learning disambiguation
• Current release: Standard Arabic and Egyptian Arabic
• Under construction: Palestinian, Syrian, Moroccan,
Yemeni, Gulf
• Neural Extensions (Zalmout et al. 2017; 2018)
W-3 W-2 W-1 W0 W1 W2 W3 W4W-4
MORPHOLOGICAL
ANALYZER
MORPHOLOGICAL
CLASSIFIERS
• Rule-based
• Human-created
• Multiple independent
classifiers
• Corpus-trained
2nd
3rd
5th
4th
1st
RANKER
• Heuristic or
corpus-trained
(Habash&Rambow 2005; Roth et al. 2008; Pasha et al., 2014; Zalmout&Habash 2017, 2018)
MADAMIRA
Demo: http://camel.abudhabi.nyu.edu/madamira/
• ‫ي‬
MADAMIRA
Morphological Disambiguation
System: MSA MSA EGY
Test: MSA EGY EGY
Full Analysis 84.3% 27.0% 75.4%
Diacriticization 86.4% 32.2% 83.2%
Lemmatization 96.1% 67.1% 86.3%
Base POS-tagging 96.1% 82.1% 91.1%
Segmentation 99.1% 90.5% 97.4%
wakAtibuhu
kAtib_1
pos:noun
prc3:0 prc2:wa_conj
prc1:0 prc0:0 per:3 asp:na
vox:na mod:na gen:m
num:s stt:c cas:n
enc0:pron3ms
w+ kAtb +h
‫وكاتبه‬wkAtbh
and his writer
• Zalmout et al (EMNLP 2017, NAACL 2018)
– Neural implementation for MADAMIRA
• 4.4% absolute increase over the state-of-the-art in full
morphological analysis accuracy on all words
• absolute 10.6% increase for out-of-vocabulary words
Neural MADAMIRA
Automatic
Arabic Spelling Correction
• Neural models for Arabic
spelling correction gave state-
of-the-art results
– QALB shared task data 2014,
2015
– 1 million word training data
– Using word and character
narrow embeddings (+/-2) in
seq-to-seq model did best.
50
(Watson, Zalmout and Habash, 2018)
CODA
A Conventional Orthography
for Dialectal Arabic
• Developed for computational processing purposes
(Habash et al, 2012)
• Objectives
– CODA covers all Arabic dialects in principle
– CODA minimizes differences in choices
– CODA is easy to learn and produce consistently
– CODA is intuitive to readers unfamiliar with it
– CODA uses Arabic script
• Started with manuals for Egyptian, Tunisian, Levantine,
Algerian, and Gulf
• CODA* : CODA for 28 different city dialects (LREC 2018)
• http://coda.camel-lab.com/ 51
CODA Examples
CODA
‫االمتحانات‬ ‫قبل‬ ‫اللي‬ ‫الفترة‬ ‫صحابي‬ ‫ماَّشفتش‬
gloss the exams before which the period my friends I did not see
Spelling
variants
‫ا‬‫إل‬‫متحانات‬ ‫أ‬‫بل‬ ‫اللـ‬‫ـى‬ ‫الفتر‬‫ه‬ ‫صحابـ‬‫ـى‬ ‫ما‬‫شفتش‬
‫ا‬‫لـ‬‫ـمتحانات‬ ‫ا‬‫بل‬ ‫إ‬‫للي‬ ‫الفـ‬‫طـ‬‫ر‬‫ة‬ ‫صـ‬‫و‬‫حابي‬ ‫مـ‬‫شفتش‬
‫االمتـ‬‫ـحـ‬‫نات‬ abl ‫إ‬‫للـ‬‫ـى‬ ‫الفـ‬‫طـ‬‫ر‬‫ه‬ ‫صـ‬‫و‬‫حابـ‬‫ـى‬ ‫شـ‬ ‫ما‬‫و‬‫فتش‬
‫ا‬‫إل‬‫متـ‬‫ـحـ‬‫نات‬ qbl ‫ا‬‫لـ‬‫ـي‬ ilftra Su7abi ‫ما‬‫شـ‬‫و‬‫فتش‬
‫ا‬‫لـ‬‫ـمتـ‬‫ـحـ‬‫نات‬ qabl ‫ا‬‫لى‬ sohaby ‫مـ‬‫شـ‬‫و‬‫فتش‬
ilimti7anat ‫إلـ‬‫ـي‬ mashoftish
limtihanaat ‫إلى‬
illi
SAMER Project
• Simplification of Arabic Masterpieces for Extensive
Reading
– Muhamed Al Khalil, Nizar Habash and Dris Sulaimani
– NYUAD Research Enhancement Fund
– Collaboration with the UAE Ministry of Education
• Objectives
– Create a standard for the simplification of modern fiction in
Arabic to school-age learners.
– Develop a tool for automating readability scale grading for
Arabic
– Simplify a number of Arabic fiction masterpieces
SAMER Readability Prediction
• Large L1 corpus AND L2 corpus compared to previous Arabic Readability studies
• Sweeping and systematic feature engineering and comparison
• State-of-the-art tools tailored for Modern Standard Arabic
• Exploring L1 and L2 performance within the same consistent feature framework
• Leveraging L1 resources for L2 performance improvement
Full Feature Breakdown (146 feats)
+
more
detailed
features
obtained:
clitics,
person,
gender,
number,
aspect
(V), case
(N)
Dependency
parse
Tree
depth
SAMER Simplification Interface
MADAR Project
• Multi-Arabic Dialect Applications and
Resources
• Collaboration among CMUQ, NYUAD and
Columbia
– Nizar Habash, Houda Bouamor, Kemal Oflazer and
Owen Rambow
• Modeling 25 Arabic city dialects
– Lexical resources, parallel data, dialect
identification, and dialect machine translation
• http://madar.camel-lab.com
• http://adida.abudhabi.nyu.edu
The MADAR Corpus: example
Fine Grained Dialect Identification
• Salameh, Bouamor and Habash, 2018 (COLING)
• Best results (Accuracy)
• Demo: http://adida.abudhabi.nyu.edu
System 6-Label
Test
26-Label
Test
Baseline: Character 5-gram language model 92.7% 64.7%
Multinomial Naïve Bayes
Character/Word 5-gram language model
93.6% 67.5%
Multinomial Naïve Bayes
Character/Word 5-gram language model
+ Corpus-6 Classifier Probability
67.9%
• How many words are needed to guarantee an optimal
classification into a certain dialect?
• ~90% with 16 words!
• Almost 2 sentences
in Corpus-26
• ~100% with 51 words!
• Almost 7 sentences
• We are currently preparing a competition on dialect ID of
Twitter users.
Can we do better? Yes, with more input!
Summary
• Arabic poses many challenges to AI/NLP
– Orthographic ambiguity
– Morphological complexity
– Enormous variety
– Annotated resource poverty
• There has been a lot of work on Arabic and
Arabic dialect technologies.
– But more is needed still…
Future Directions
• More Arabic varieties
– We plan to continue to working on new dialects and new
Arabic domains
– New data sets
– New algorithms for supporting low resource languages
• More tools
– We are developing an open source suite called CamelTools
to support Arabic processing
• More interdisciplinary collaborations
– We are proposing a center on Human-centered AI at
NYUAD that brings together researchers from computer
science, digital humanities, language pedagogy, history,
and sociology, as well as industrial partners.
• http://nyuad.nyu.edu/en/
64
Thank You!
Questions?

More Related Content

What's hot

ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
iwan_rg
 
Examples of Ontology Applications
Examples of Ontology ApplicationsExamples of Ontology Applications
Examples of Ontology Applications
AIMS (Agricultural Information Management Standards)
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
Bhaskar Mitra
 
Word embedding
Word embedding Word embedding
Word embedding
ShivaniChoudhary74
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
Traian Rebedea
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
Rupak Roy
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
live_and_let_live
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
Hady Elsahar
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
Christian Perone
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Yuriy Guts
 
Glove global vectors for word representation
Glove global vectors for word representationGlove global vectors for word representation
Glove global vectors for word representation
hyunyoung Lee
 
Natural Language Processing seminar review
Natural Language Processing seminar review Natural Language Processing seminar review
Natural Language Processing seminar review
Jayneel Vora
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Bhaskar Mitra
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embedding
Khang Pham
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Toine Bogers
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
ankit_ppt
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
Yogendra Tamang
 

What's hot (20)

Natural language processing
Natural language processingNatural language processing
Natural language processing
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
Wordnet Introduction
Wordnet IntroductionWordnet Introduction
Wordnet Introduction
 
Examples of Ontology Applications
Examples of Ontology ApplicationsExamples of Ontology Applications
Examples of Ontology Applications
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
Word embedding
Word embedding Word embedding
Word embedding
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
NLP
NLPNLP
NLP
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Glove global vectors for word representation
Glove global vectors for word representationGlove global vectors for word representation
Glove global vectors for word representation
 
Natural Language Processing seminar review
Natural Language Processing seminar review Natural Language Processing seminar review
Natural Language Processing seminar review
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embedding
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 

Similar to Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and Solutions - Nizar Habash

Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
shrey bhate
 
Earth ontology
Earth ontologyEarth ontology
Earth ontologydr-nawal
 
Closing the language gap: developing machine learning tools to detect the lan...
Closing the language gap: developing machine learning tools to detect the lan...Closing the language gap: developing machine learning tools to detect the lan...
Closing the language gap: developing machine learning tools to detect the lan...
CILIP MDG
 
Linguascope2018
Linguascope2018Linguascope2018
Linguascope2018
Isabelle Jones
 
Applied linguistics presentation
Applied linguistics  presentationApplied linguistics  presentation
Applied linguistics presentation
Muhammad Furqan
 
Building Reading Fluency Through Blended & Flipped Learning
Building Reading Fluency Through Blended & Flipped LearningBuilding Reading Fluency Through Blended & Flipped Learning
Building Reading Fluency Through Blended & Flipped Learning
Saint Michael's College
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
ijnlc
 
Sentiment Analysis for Arabizi: A Multilingual Jargon on Social Media
Sentiment Analysis for Arabizi: A Multilingual Jargon on Social MediaSentiment Analysis for Arabizi: A Multilingual Jargon on Social Media
Sentiment Analysis for Arabizi: A Multilingual Jargon on Social Media
Knowledge Media Institute
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
kevig
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Processing short-message communications in low-resource languages
Processing short-message communications in low-resource languages�Processing short-message communications in low-resource languages�
Processing short-message communications in low-resource languages
Robert Munro
 
Design and Development of an Educational Arabic Sign Language Mobile App
Design and Development of an Educational Arabic Sign Language Mobile AppDesign and Development of an Educational Arabic Sign Language Mobile App
Design and Development of an Educational Arabic Sign Language Mobile App
HCI Lab
 
Applied linguistics
Applied linguisticsApplied linguistics
Applied linguisticsRaul Vargas
 
Didactique de l'Anglais de Spécialité (GT GERAS)
Didactique de l'Anglais de Spécialité (GT GERAS)Didactique de l'Anglais de Spécialité (GT GERAS)
Didactique de l'Anglais de Spécialité (GT GERAS)
Shona Whyte
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
gerogepatton
 
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
gerogepatton
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
ijaia
 
SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)
SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)
SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)
Knowledge Media Institute
 
An exploratory corpus study of the AP Spanish
An exploratory corpus study of the AP SpanishAn exploratory corpus study of the AP Spanish
An exploratory corpus study of the AP SpanishSteven Saffels
 

Similar to Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and Solutions - Nizar Habash (20)

Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Earth ontology
Earth ontologyEarth ontology
Earth ontology
 
Closing the language gap: developing machine learning tools to detect the lan...
Closing the language gap: developing machine learning tools to detect the lan...Closing the language gap: developing machine learning tools to detect the lan...
Closing the language gap: developing machine learning tools to detect the lan...
 
Linguascope2018
Linguascope2018Linguascope2018
Linguascope2018
 
Applied linguistics presentation
Applied linguistics  presentationApplied linguistics  presentation
Applied linguistics presentation
 
Building Reading Fluency Through Blended & Flipped Learning
Building Reading Fluency Through Blended & Flipped LearningBuilding Reading Fluency Through Blended & Flipped Learning
Building Reading Fluency Through Blended & Flipped Learning
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
 
Sentiment Analysis for Arabizi: A Multilingual Jargon on Social Media
Sentiment Analysis for Arabizi: A Multilingual Jargon on Social MediaSentiment Analysis for Arabizi: A Multilingual Jargon on Social Media
Sentiment Analysis for Arabizi: A Multilingual Jargon on Social Media
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Processing short-message communications in low-resource languages
Processing short-message communications in low-resource languages�Processing short-message communications in low-resource languages�
Processing short-message communications in low-resource languages
 
Design and Development of an Educational Arabic Sign Language Mobile App
Design and Development of an Educational Arabic Sign Language Mobile AppDesign and Development of an Educational Arabic Sign Language Mobile App
Design and Development of an Educational Arabic Sign Language Mobile App
 
Applied linguistics
Applied linguisticsApplied linguistics
Applied linguistics
 
Didactique de l'Anglais de Spécialité (GT GERAS)
Didactique de l'Anglais de Spécialité (GT GERAS)Didactique de l'Anglais de Spécialité (GT GERAS)
Didactique de l'Anglais de Spécialité (GT GERAS)
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
 
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
 
SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)
SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)
SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)
 
An exploratory corpus study of the AP Spanish
An exploratory corpus study of the AP SpanishAn exploratory corpus study of the AP Spanish
An exploratory corpus study of the AP Spanish
 
Intro
IntroIntro
Intro
 

More from Grammarly

Vitalii Braslavskyi - Declarative engineering
Vitalii Braslavskyi - Declarative engineering Vitalii Braslavskyi - Declarative engineering
Vitalii Braslavskyi - Declarative engineering
Grammarly
 
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...
Grammarly
 
Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...
Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...
Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...
Grammarly
 
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly
 
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...
Grammarly
 
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
Grammarly
 
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly
 
Grammarly Meetup: DevOps at Grammarly: Scaling 100x
Grammarly Meetup: DevOps at Grammarly: Scaling 100xGrammarly Meetup: DevOps at Grammarly: Scaling 100x
Grammarly Meetup: DevOps at Grammarly: Scaling 100x
Grammarly
 
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...
Grammarly
 
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly
 
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
Grammarly
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy GutsGrammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly
 
Natural Language Processing for biomedical text mining - Thierry Hamon
Natural Language Processing for biomedical text mining - Thierry HamonNatural Language Processing for biomedical text mining - Thierry Hamon
Natural Language Processing for biomedical text mining - Thierry Hamon
Grammarly
 

More from Grammarly (14)

Vitalii Braslavskyi - Declarative engineering
Vitalii Braslavskyi - Declarative engineering Vitalii Braslavskyi - Declarative engineering
Vitalii Braslavskyi - Declarative engineering
 
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...
 
Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...
Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...
Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...
 
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
 
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...
 
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
 
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
 
Grammarly Meetup: DevOps at Grammarly: Scaling 100x
Grammarly Meetup: DevOps at Grammarly: Scaling 100xGrammarly Meetup: DevOps at Grammarly: Scaling 100x
Grammarly Meetup: DevOps at Grammarly: Scaling 100x
 
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...
 
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
 
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy GutsGrammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
 
Natural Language Processing for biomedical text mining - Thierry Hamon
Natural Language Processing for biomedical text mining - Thierry HamonNatural Language Processing for biomedical text mining - Thierry Hamon
Natural Language Processing for biomedical text mining - Thierry Hamon
 

Recently uploaded

somanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptxsomanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
Howard Spence
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 Presentation
Access Innovations, Inc.
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Sebastiano Panichella
 
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Orkestra
 
Obesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditionsObesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditions
Faculty of Medicine And Health Sciences
 
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXOBitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Matjaž Lipuš
 
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Sebastiano Panichella
 
Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control Tower
Vladimir Samoylov
 
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
0x01 - Newton's Third Law:  Static vs. Dynamic Abusers0x01 - Newton's Third Law:  Static vs. Dynamic Abusers
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
OWASP Beja
 
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
khadija278284
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
Sebastiano Panichella
 
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
OECD Directorate for Financial and Enterprise Affairs
 
Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutes
IP ServerOne
 

Recently uploaded (13)

somanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptxsomanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 Presentation
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
 
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
 
Obesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditionsObesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditions
 
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXOBitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXO
 
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
 
Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control Tower
 
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
0x01 - Newton's Third Law:  Static vs. Dynamic Abusers0x01 - Newton's Third Law:  Static vs. Dynamic Abusers
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
 
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
 
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
 
Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutes
 

Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and Solutions - Nizar Habash

  • 1. Arabic Natural Language Processing: Challenges and Solutions ‫العربية‬ ‫للغة‬ ‫اآللي‬ ‫التحليل‬:‫وحلول‬ ‫تحديات‬ Grammarly Invited Talk March 26, 2019 Prof. Nizar Habash New York University Abu Dhabi nizar.habash@nyu.edu NYUAD CAMeLLab
  • 2. New York University The Global Network University 2
  • 3.
  • 4. New York University Abu Dhabi 4
  • 6. New York University Abu Dhabi • Students from all over the world – 1300 students, 120 nationalities – 15% UAE, 15% American, 70% everywhere else 6
  • 7. New York University Abu Dhabi • Liberal Arts University – Four Divisions: Science, Engineering, Social Science, Arts and Humanities – 20 majors and many minors – Interdisciplinarity strongly encouraged • Computer Science – Undergraduate and PhD programs – PhD through NYU New York 7
  • 8. CAMeL Lab 8 • Computational Approaches to Modeling Language • http://camel-lab.com • Research Areas – Arabic Artificial Intelligence – Core Natural Language Processing • Orthography, morphology, syntax, and semantics – Dialectal modeling – Machine translation – Pedagogical applications – Dialogue systems NYUAD CAMeLLab
  • 9. The CAMeLeers 9 Nasser Zalmout PhD Student, NYU Dima Taji PhD Student, NYU Alberto Chiercchi PhD Student, NYU Alex Erdmann PhD Student, Ohio State Salam Khalifa Research Assistant Fadhl Eryani Research Assistant Ossama Obeid Research Assistant Mai Oudah Postdoc
  • 10. Ok…. Back to the talk!
  • 11. Arabic Natural Language Processing: Challenges and Solutions ‫العربية‬ ‫للغة‬ ‫اآللي‬ ‫التحليل‬:‫وحلول‬ ‫تحديات‬ Grammarly Invited Talk March 26, 2019 Prof. Nizar Habash New York University Abu Dhabi nizar.habash@nyu.edu NYUAD CAMeLLab
  • 13. Natural Language Processing • Also known as – Computational Linguistics – Language Technologies – (Language) Artificial Intelligence • Language Technology is an interdisciplinary field – Computer science, Linguistics, Cognitive science, psychology, pedagogy, mathematics, etc. • Language technologies were some of the earliest applications of computer science – Cryptography – Machine Translation
  • 14. Natural Language Processing • Applications – Information retrieval – Machine translation – Automatic speech recognition & speech synthesis – Sentiment and emotion analysis – Dialogue systems & chatting agents – Optical character recognition – Automatic Summarization, etc. • Enabling technologies – Tokenization – Part-of-speech tagging – Syntactic parsing – Lemmatization – Word sense disambiguation, etc.
  • 15. Paradigms for Natural Language Processing • Rule-based (Intuition-based) Approaches – Linguists write rules that are applied by the machines • Machine Learning Approaches – Corpus-based, Statistical Approaches – Machines learn the “rules” from training data • Machine learning approaches are dominant in the field
  • 16. What do we need to help machines learn? • Data, data and more data! • Specifically annotated data Application Annotated Data Example Machine Translation Parallel corpus in two languages: UN corpus with English, Arabic, Chinese, Spanish, Russian, French Sentiment Analysis A corpus of tweets with tags indicating: positive, negative, neutral. Speech Recognition A corpus of audio files with their corresponding transcripts Optical Character Recognition A corpus of scanned book page images and their corresponding transcripts. Part-of-Speech An English corpus with Part-of-Speech indicated for each word
  • 17. • d 17 Machine Learning vs. Human Learning Predisposed for acquiring language not so! • Developing robust algorithms with appropriate learning bias for computational linguistics tasks is essential!
  • 18. Challenges for Machine Learning Language Technologies • Size of training data – More is better! • Domain and genre sensitivity – Systems trained on news do not do well on novels • Quality of annotations – Why expect good performance if humans do not agree with each other on the task • Developing robust algorithms for machine learning is essential
  • 19. 19 Roadmap • Natural Language Processing Applications & Paradigms • (Why) is Arabic hard for NLP? • Some Arabic NLP solutions –NYUAD CAMeL Lab
  • 20. 20 Arabic Script • A consonantal alphabet • Written right-to-left • Letters have contextual variants • Used to write many languages besides Arabic: Persian, Kurdish, Urdu, Pashto, etc. َ‫ر‬َ‫ع‬‫ال‬ ُّ‫َط‬‫خ‬‫ال‬ُّ‫ي‬‫ي‬‫ب‬
  • 21. Arabic Script • Arabic script uses a set of optional diacritics – Only 1.5% of written words have at least one diacritic • Undiacritized Standard Arabic words are ambiguous out of context Vowel Nunation Gemination َ‫ب‬ /ba/ ‫ب‬ /bu/ ‫ي‬‫ب‬ /bi/ ‫ب‬ /b/ ‫ب‬ /ban/ ‫ب‬ /bun/ ‫ب‬ /bin/ ‫ب‬ /bb/
  • 22. ‫للمغرب‬ ‫الممنوحة‬ ‫المساعدة‬ ‫تجميد‬ ‫تنفي‬ ‫اسبانيا‬ ‫مدريد‬1-11(‫ب‬ ‫اف‬)-‫ماريا‬ ‫خوسيه‬ ‫االسبانية‬ ‫الحكومة‬ ‫رئيس‬ ‫اكد‬ ‫لل‬ ‫تقدمها‬ ‫التي‬ ‫المساعدة‬ ‫توقف‬ ‫لم‬ ‫اسبانيا‬ ‫ان‬ ‫الخميس‬ ‫اليوم‬ ‫اثنار‬‫خالفا‬ ‫مغرب‬ ‫محم‬ ‫المغربي‬ ‫والتعاون‬ ‫الخارجية‬ ‫الشؤون‬ ‫وزير‬ ‫االربعاء‬ ‫امس‬ ‫اكده‬ ‫لما‬‫بن‬ ‫د‬ ‫المغربي‬ ‫النواب‬ ‫مجلس‬ ‫امام‬ ‫عيسى‬.‫ف‬ ‫االسبانية‬ ‫الحكومة‬ ‫رئيس‬ ‫وقال‬‫ي‬ ‫و‬ ‫ابدا‬ ‫يتوقف‬ ‫لم‬ ‫والمغرب‬ ‫اسبانيا‬ ‫بين‬ ‫التعاون‬ ‫ان‬ ‫صحافي‬ ‫مؤتمر‬‫يجمد‬ ‫لم‬. ‫يا‬‫ي‬‫ن‬‫با‬‫س‬‫ي‬‫ا‬‫ي‬‫ي‬‫ف‬‫ن‬َ‫ت‬َ‫د‬‫ي‬‫ي‬‫م‬‫ج‬َ‫ت‬َ‫ة‬َ‫د‬َ‫ع‬‫سا‬‫الم‬َ‫ح‬‫و‬‫ن‬‫م‬َ‫م‬‫ال‬َ‫ة‬‫ي‬‫ب‬ ‫ي‬‫ر‬‫غ‬َ‫م‬‫ل‬‫ي‬‫ل‬ ‫يد‬ ‫ي‬‫ر‬‫د‬َ‫م‬1 - 11 (‫ف‬‫ي‬‫ا‬‫ب‬)-َ‫د‬َّ‫ك‬َ‫ا‬‫يس‬‫ي‬‫ئ‬ َ‫ر‬‫ي‬‫ة‬َ‫م‬‫و‬‫ك‬‫الح‬‫ي‬‫ن‬‫با‬‫س‬‫ي‬‫اال‬‫َّة‬‫ي‬‫يه‬‫ي‬‫س‬‫و‬‫خ‬‫يا‬ ‫ي‬‫مار‬‫اثنار‬ َ‫م‬‫و‬َ‫ي‬‫ال‬َ‫يس‬‫ي‬‫َم‬‫خ‬‫ال‬َّ‫ن‬َ‫ا‬‫يا‬‫ي‬‫ن‬‫با‬‫س‬‫ي‬‫ا‬َ‫م‬‫ي‬‫ل‬َ‫ف‬َّ‫ق‬ َ‫و‬َ‫ت‬‫الم‬‫ة‬َ‫د‬َ‫ع‬‫سا‬‫ي‬‫ي‬‫ت‬َّ‫ال‬‫ها‬‫م‬‫ي‬‫د‬َ‫ق‬‫ت‬‫ي‬‫ر‬‫غ‬َ‫م‬‫ل‬‫ي‬‫ل‬‫ي‬‫ب‬‫الفا‬ ‫ي‬‫خ‬‫ما‬‫ي‬‫ل‬ ‫ه‬َ‫د‬َّ‫ك‬َ‫ا‬‫ي‬‫س‬‫م‬َ‫ا‬َ‫ء‬‫عا‬‫ي‬‫ب‬‫ر‬َ‫ال‬‫ا‬َ‫ير‬ ‫ي‬‫ز‬ َ‫و‬‫ي‬‫ون‬‫ؤ‬‫الش‬‫ي‬‫ج‬ ‫ي‬‫الخار‬‫ي‬‫ة‬َّ‫ي‬‫ي‬‫ن‬‫عاو‬َ‫ت‬‫ال‬ َ‫و‬‫ي‬‫ي‬‫ي‬‫ب‬ ‫ي‬‫ر‬‫غ‬َ‫م‬‫ال‬‫م‬‫د‬َّ‫م‬َ‫ح‬‫ن‬‫ي‬‫ب‬ ‫ى‬َ‫س‬‫ي‬‫ي‬‫ع‬َ‫مام‬َ‫ا‬‫ي‬‫س‬‫ي‬‫ل‬‫ج‬َ‫م‬‫ي‬‫ب‬‫ا‬‫و‬‫الن‬‫ي‬‫ي‬‫ي‬‫ب‬ ‫ي‬‫ر‬‫غ‬َ‫م‬‫ال‬.َ‫ل‬‫قا‬ َ‫و‬‫يس‬‫ي‬‫ئ‬ َ‫ر‬‫ي‬‫ة‬َ‫م‬‫و‬‫ك‬‫الح‬‫س‬‫ي‬‫اال‬‫ي‬‫ة‬َّ‫ي‬‫ي‬‫ن‬‫با‬‫ي‬‫ي‬‫ف‬ ‫ر‬َ‫م‬َ‫ت‬‫ؤ‬‫م‬‫ي‬‫ي‬‫ف‬‫حا‬َ‫ص‬َّ‫ن‬َ‫ا‬َ‫ن‬‫عاو‬َ‫ت‬‫ال‬َ‫ن‬‫ي‬َ‫ب‬‫با‬‫س‬‫ي‬‫ا‬‫يا‬‫ي‬‫ن‬‫ي‬‫ب‬ ‫ي‬‫ر‬‫غ‬َ‫م‬‫ال‬ َ‫و‬َ‫م‬‫ي‬‫ل‬‫ف‬َّ‫ق‬ َ‫و‬َ‫ت‬َ‫ي‬َ‫ا‬‫دا‬َ‫ب‬َ‫م‬‫ي‬‫ل‬ َ‫و‬‫د‬‫ي‬‫م‬َ‫ج‬‫ي‬.
  • 23. 23 Orthographic Ambiguity • Arabic words can be very ambiguous due to optional diacritics • But how ambiguous? • Classic example ths s wht n rbc txt lks lk wth n vwls this is what an Arabic text looks like with no vowels – Not exactly true • Long vowels are always written • Initial vowels are represented by an ‫ا‬ ‘Alif’ • Some final short vowels are deterministically inferable ths is wht an Arbc txt lks lik wth no vwls • For a computer … – A word on average has 12.3 analyses, 6.8 diacritizations, and 2.7 lemmas (core meanings) • Not all of this ambiguity is due to orthography! More on this later.
  • 24. • The Qatar Arabic Language Bank (QALB, PI Habash) project found a very high (30%) of words have errors in unedited Standard Arabic comments on Aljazeera. – 2 Million words were manually corrected to create training data. • Arabic spelling errors are a big challenge to language technologies – GIGO: Garbage In Garbage Out – Errors in Standard Arabic – Inconsistencies in Dialectal Arabic (no official standard) • Robust systems need additional functionality to allow for correcting errors or functioning well despite them. Spelling Errors
  • 25. Morphological Complexity • Arabic is morphologically rich – A core word has many inflected forms – Example: Arabic Verbs have 5,400 forms Gender(2), Number(3), Person(3), Aspect(3), Tense particle (2), Mood(3), Voice(2), Pronominal clitic(12), Conjunction clitic(3) 25 ‫وسنقولها‬ /wasanaqūluhā/ ‫و‬+‫س‬+‫ن‬+‫قول‬+‫ها‬ wa+sa+na+qūl+u+hā and+will+we+say+it And we will say it ،َ‫قالوا،قلت‬ ،‫قاال‬ ،‫قالت‬ ،‫قال‬ ‫قلتن‬ ،‫قلتم‬ ،‫قلتما‬ ،‫ي‬‫ت‬‫قل‬، ‫تقول‬ ،‫يقل‬ ،َ‫ل‬‫يقو‬ ،‫يقول‬،َ‫ل‬‫تقو‬ ، ‫تقولي‬ ،‫تقولين‬ ،‫تقل‬، ...‫فقاال‬ ،‫فقالت‬ ،‫فقال‬... ...،‫وسأقولها‬‫وسنقولها‬،...
  • 26. Morphological Complexity • English is not morphologically rich. – The number of inflected forms is small – The verb paradigm is limited to 6 – The complete English part-of-speech tag set has 48 tags – The complete Arabic part-of-speech tag set has 22,400 tags 26 VB VBD VBG VBN VBP VBZ go went going gone go goes
  • 27. Morphological Ambiguity • 12.3 analyses and 2.7 lemmas per word • Spelling ambiguity – Optional diacritics – Suboptimal spelling, e.g., (‫,أ‬ ‫إ‬  ‫)ا‬ or (‫ة‬ ‫ه‬ ) – Example: ‫وبادلتها‬ • Derivational ambiguity and homonymy َ‫و‬+‫ي‬‫ب‬+‫ي‬‫ة‬َّ‫ل‬‫ي‬‫د‬َ‫أ‬+‫ها‬ and with her pieces of evidence َ‫و‬+‫ت‬‫ل‬َ‫د‬‫ا‬َ‫ب‬+‫ها‬ and I exchanged with her ‫ـن‬‫ي‬َ‫ع‬‫ال‬ the eye, the water spring, Al-Ain city, the notable ‫ل‬َ‫ت‬‫ح‬‫الم‬ occupier, occupied )‫المحتل‬ ‫العدو‬/‫المحتل‬ ‫الوطن‬/‫المحتلة‬ ‫الدول‬(
  • 28. Morphological Annotation ‫سنتين‬ ‫لمدة‬ ‫فتحية‬ ‫كاتبته‬ ‫وقد‬. Fathia corresponded with her for two years. Word Lemma POS Features Gloss ‫وقد‬ ‫د‬َ‫ق‬ ‫و‬+‫قد‬ Noun Masc Sg size/physique ‫د‬َ‫ق‬ ‫و‬+‫قد‬ Particle may/might/has/have ‫د‬‫ق‬ َ‫و‬ ‫وقد‬ Noun Masc Sg fuel/burning ‫د‬َّ‫ق‬ َ‫و‬ ‫وقد‬ Verb Perf 3rd Masc Sg kindle/ignite ‫د‬َ‫ق‬ َ‫و‬ ‫وقد‬ Verb Perf 3rd Masc Sg ignite/burn ‫كاتبته‬ ‫ب‬‫ي‬‫ت‬‫كا‬ ‫كاتبة‬+‫ه‬ Noun Fem Sg author/writer/clerk ‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 3rd Fem Sg correspond_with ‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 1st Sg correspond_with ‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 2nd Fem Sg correspond_with ‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 2nd Masc Sg correspond_with ‫فتحية‬ ‫َّة‬‫ي‬ ‫ي‬‫ح‬َ‫ت‬ ‫ف‬+‫تحية‬ Noun Fem Sg greeting/salute َّ‫ي‬ ‫ي‬‫تح‬َ‫ف‬‫ة‬ ‫فتحية‬ Proper Fem Sg Fathia ‫لمدة‬ ‫َّة‬‫د‬‫م‬ ‫ل‬+‫مدة‬ Noun Fem Sg interval/period ‫سنتين‬ ‫ت‬‫ن‬‫ي‬‫س‬ ‫سنتين‬ Noun Masc Du cent ‫َة‬‫ن‬َ‫س‬ ‫سنتين‬ Noun Fem Du year . . . Punc .
  • 29. Word Lemma POS Features Gloss ‫وقد‬ ‫د‬َ‫ق‬ ‫و‬+‫قد‬ Noun Masc Sg size/physique ‫د‬َ‫ق‬ ‫و‬+‫قد‬ Particle may/might/has/have ‫د‬‫ق‬ َ‫و‬ ‫وقد‬ Noun Masc Sg fuel/burning ‫د‬َّ‫ق‬ َ‫و‬ ‫وقد‬ Verb Perf 3rd Masc Sg kindle/ignite ‫د‬َ‫ق‬ َ‫و‬ ‫وقد‬ Verb Perf 3rd Masc Sg ignite/burn ‫كاتبته‬ ‫ب‬‫ي‬‫ت‬‫كا‬ ‫كاتبة‬+‫ه‬ Noun Fem Sg author/writer/clerk ‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 3rd Fem Sg correspond_with ‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 1st Sg correspond_with ‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 2nd Fem Sg correspond_with ‫ب‬َ‫ت‬‫كا‬ ‫كاتبت‬+‫ه‬ Verb Perf 2nd Masc Sg correspond_with ‫فتحية‬ ‫َّة‬‫ي‬ ‫ي‬‫ح‬َ‫ت‬ ‫ف‬+‫تحية‬ Noun Fem Sg greeting/salute َّ‫ي‬ ِ‫تح‬َ‫ف‬‫ة‬ ‫فتحية‬ Proper Fem Sg Fathia ‫لمدة‬ ‫ة‬‫د‬ُ‫م‬ ‫ل‬+‫مدة‬ Noun Fem Sg interval/period ‫سنتين‬ ‫ت‬‫ن‬‫ي‬‫س‬ ‫سنتين‬ Noun Masc Du cent ‫ة‬َ‫ن‬َ‫س‬ ‫سنتين‬ Noun Fem Du year . . . Punc . Morphological Annotation ‫سنتين‬ ‫لمدة‬ ‫فتحية‬ ‫كاتبته‬ ‫وقد‬. Fathia corresponded with her for two years.
  • 30. 30 Arabic and its Dialects • Arabic has ~360M speakers • Forms of Arabic – Classical Arabic (CA) • Classic historical and liturgical texts – Modern Standard Arabic (MSA) • News media & formal speeches and settings • Only written standard – Dialectal Arabic (DA) • Predominantly spoken vernaculars • No written standards • Very common on social media • Diglossia – Two forms of the language (MSA & DA) exist side by side
  • 31. Arabic and its Dialects • Official language: Modern Standard Arabic (MSA) No one’s native language • Regional Dialects – Egyptian Arabic (EGY) – Levantine Arabic (LEV) – Gulf Arabic (GLF) – North African Arabic (NOR): Moroccan, Algerian, Tunisian – Iraqi, Yemenite, Sudanese • Dialects and sub-dialects… – City, Rural, Bedouin
  • 32. 32 Phonological Variations • Major variants MSA Dialects ‫ق‬ /q/ /q/, /k/, /ʔ/, /g/, /ʤ/, /ɢ/ ‫ث‬ /θ/ /θ/, /t/, /s/ ‫ذ‬ /δ/ /δ/, /d/, /z/ ‫ج‬ /ʤ/ /ʤ/, /g/, /ʒ/
  • 33. Spelling Inconsistency 33 Egyptian Arabic word ‫ماَّبيقولهاش‬ /mabiʔulhāʃ/ “he does not say it If there is no standard, can a word be misspelled?
  • 34. Lexical and Phonological Variation You say to-MAY-to, I say to-MAH-to!
  • 35. Lexical and Phonological Variation ‫بندورة‬ ‫توماطيش‬ ‫طماط‬ ‫طماطة‬ ‫طماطم‬ ‫طماطمة‬ ‫طماطيس‬ ‫قوطة‬ ‫مطيشة‬ b a n a d oo r a ALE, DAM b a n a d uu r a BEI, AMM, JER b a n d oo r a AMM, JER, SAL t uu m aa t. ii sh FES t. a m aa t. SAN, MUS t. e m aa t. DOH t a m aa t a BAG, BAS, MOS t. a m aa t. i m JED, RIY, SAN, KHA, ALX, ASW t. m aa t. i m SFA, TUN, BEN, TRI g uu t. a JED 2 uu t. a CAI t. o m a t. ii sh ALG t. a m aa t. ii s SAN m a t. ii sh a FES, RAB t. a m aa t. m a MUS ‫طوماطيش‬
  • 36. Lexical and Phonological Variation ‫بندورة‬ ‫توماطيش‬ ‫طماط‬ ‫طماطة‬ ‫طماطم‬ ‫طماطمة‬ ‫طماطيس‬ ‫قوطة‬ ‫مطيشة‬ b a n a d oo r a ALE, DAM b a n a d uu r a BEI, AMM, JER b a n d oo r a AMM, JER, SAL t uu m aa t. ii sh FES t. a m aa t. SAN, MUS t. e m aa t. DOH t a m aa t a BAG, BAS, MOS t. a m aa t. i m JED, RIY, SAN, KHA, ALX, ASW t. m aa t. i m SFA, TUN, BEN, TRI g uu t. a JED 2 uu t. a CAI t. o m a t. ii sh ALG t. a m aa t. ii s SAN m a t. ii sh a FES, RAB t. a m aa t. m a MUS ‫طوماطيش‬
  • 37. Lexical and Phonological Variation ‫بندورة‬ ‫توماطيش‬ ‫طماط‬ ‫طماطة‬ ‫طماطم‬ ‫طماطمة‬ ‫طماطيس‬ ‫قوطة‬ ‫مطيشة‬ b a n a d oo r a ALE, DAM b a n a d uu r a BEI, AMM, JER b a n d oo r a AMM, JER, SAL t uu m aa t. ii sh FES t. a m aa t. SAN, MUS t. e m aa t. DOH t a m aa t a BAG, BAS, MOS t. a m aa t. i m JED, RIY, SAN, KHA, ALX, ASW t. m aa t. i m SFA, TUN, BEN, TRI g uu t. a JED 2 uu t. a CAI t. o m a t. ii sh ALG t. a m aa t. ii s SAN m a t. ii sh a FES, RAB t. a m aa t. m a MUS ‫طوماطيش‬
  • 38. 38 Morphological Variation • Some aspects of words are simplified in the dialects – Loss of case marking kitaabu, kitaaba, kitaaabi, kitaabun, kitaaban, kitaabin  kitaab – Consolidation of masculine and feminine plurals yaktubuun, yaktubuu, yaktubna  yiktibu || yikitbuun • Other aspects increase in complexity! ‫كتاب‬‫كتاب‬ ،‫ي‬‫ب‬‫كتا‬ ،‫كتابا‬ ،َ‫كتاب‬ ،‫كتاب‬ ،‫كتاب‬ ‫يكتبون‬ ،‫يكتبوا‬‫يكتبن‬ ،‫يكتبون‬ ،‫يكتبوا‬
  • 39. 39 Morphological Variation Verb Morphology conjverbobject subj tense IOBJ negneg MSA ‫له‬ ‫تكتبوها‬ ‫ولم‬ /walam taktubūhā lahu/ /wa+lam taktubū+hā la+hu/ and+not_past write_you+it for+him EGY ‫و‬‫ما‬‫كتبتوهالو‬‫ش‬ /wimakatabtuhalūʃ/ /wi+ma+katab+tu+ha+lū+ʃ/ and+not+wrote+you+it+for_him+not And you didn’t write it for him
  • 40. Challenges to Arabic NLP Arabic English Orthographic ambiguity More Less Orthographic inconsistency More Less Morphological complexity More Less Dialectal variation More Less ‫وبعقدنا‬ ‫َا‬‫ن‬‫ي‬‫د‬َ‫ق‬‫ع‬‫ي‬‫ب‬ َ‫و‬ ‫َا‬‫ن‬‫ي‬‫د‬‫ق‬‫ي‬‫ع‬‫ي‬‫ب‬ َ‫و‬ ‫َا‬‫ن‬‫ي‬‫د‬‫ق‬َ‫ع‬‫ي‬‫ب‬ َ‫و‬ َ‫ن‬‫د‬‫ي‬‫ق‬َ‫ع‬‫ي‬‫ب‬‫و‬‫ا‬ and he stresses us out | and with our (contract | necklace | psychoses)
  • 41. Comparing Performance • SOTA Part-of-Speech Tagging and Syntax Parsing Results from (Björkelund et al. 2013, Pasha et al., 2014, Weiss et al, 2015, Kumar et al., 2016) – Large gap between English and Arabic; and between Standard Arabic and Arabic dialects – More resources and more research efforts for English compared to Arabic. 41 English Standard Arabic Egyptian Arabic Full Part-of-Speech 97.6% 85.4% 75.5% Core POS Part-of-Speech 96.1% 91.1% Dependency Syntax 92.2% 86.2%
  • 42. Comparing Performance • Machine Translation – Quality of machine translation from MSA is much better than in the dialects – The main reason is availability of parallel corpora • 150 million words of parallel Standard Arabic-English text compared to 1.5 million words of Dialect-English text (Zbib et al., 2012) 42 Arabic Source Text Google Translate (Oct 17, 2018) MSA ‫من‬‫فضلك‬‫ال‬‫تكلمني‬ Please do not talk to me EGY ‫انت‬‫متكلمنيش‬‫خالص‬ You are pure Mtkmlnish MSA ‫ال‬‫يوجد‬،‫كهرباء‬‫ماذا‬‫حدث؟‬ No electricity, what happened? LEV ‫شكلو‬‫مفيش‬،‫كهربا‬‫ليش‬‫هيك؟‬ Shaku Mfish electrified, why not heck? IRQ ‫شو‬‫ماكو‬،‫كهرباء‬‫خير؟‬ Xu Mako electricity, okay?
  • 43.
  • 44. 44 Roadmap • Natural Language Processing Applications & Paradigms • (Why) is Arabic hard for NLP? • Some Arabic NLP solutions –NYUAD CAMeL Lab
  • 45. MADAMIRA http://camel.abudhabi.nyu.edu/madamira/ • State-of-the-art Arabic and Arabic Dialect Processing tool (Pasha et al., 2014) – Full Morphological disambiguation – Hybrid • Rule-based analyzer dictionaries • Machine learning disambiguation • Current release: Standard Arabic and Egyptian Arabic • Under construction: Palestinian, Syrian, Moroccan, Yemeni, Gulf • Neural Extensions (Zalmout et al. 2017; 2018)
  • 46. W-3 W-2 W-1 W0 W1 W2 W3 W4W-4 MORPHOLOGICAL ANALYZER MORPHOLOGICAL CLASSIFIERS • Rule-based • Human-created • Multiple independent classifiers • Corpus-trained 2nd 3rd 5th 4th 1st RANKER • Heuristic or corpus-trained (Habash&Rambow 2005; Roth et al. 2008; Pasha et al., 2014; Zalmout&Habash 2017, 2018)
  • 48. MADAMIRA Morphological Disambiguation System: MSA MSA EGY Test: MSA EGY EGY Full Analysis 84.3% 27.0% 75.4% Diacriticization 86.4% 32.2% 83.2% Lemmatization 96.1% 67.1% 86.3% Base POS-tagging 96.1% 82.1% 91.1% Segmentation 99.1% 90.5% 97.4% wakAtibuhu kAtib_1 pos:noun prc3:0 prc2:wa_conj prc1:0 prc0:0 per:3 asp:na vox:na mod:na gen:m num:s stt:c cas:n enc0:pron3ms w+ kAtb +h ‫وكاتبه‬wkAtbh and his writer
  • 49. • Zalmout et al (EMNLP 2017, NAACL 2018) – Neural implementation for MADAMIRA • 4.4% absolute increase over the state-of-the-art in full morphological analysis accuracy on all words • absolute 10.6% increase for out-of-vocabulary words Neural MADAMIRA
  • 50. Automatic Arabic Spelling Correction • Neural models for Arabic spelling correction gave state- of-the-art results – QALB shared task data 2014, 2015 – 1 million word training data – Using word and character narrow embeddings (+/-2) in seq-to-seq model did best. 50 (Watson, Zalmout and Habash, 2018)
  • 51. CODA A Conventional Orthography for Dialectal Arabic • Developed for computational processing purposes (Habash et al, 2012) • Objectives – CODA covers all Arabic dialects in principle – CODA minimizes differences in choices – CODA is easy to learn and produce consistently – CODA is intuitive to readers unfamiliar with it – CODA uses Arabic script • Started with manuals for Egyptian, Tunisian, Levantine, Algerian, and Gulf • CODA* : CODA for 28 different city dialects (LREC 2018) • http://coda.camel-lab.com/ 51
  • 52. CODA Examples CODA ‫االمتحانات‬ ‫قبل‬ ‫اللي‬ ‫الفترة‬ ‫صحابي‬ ‫ماَّشفتش‬ gloss the exams before which the period my friends I did not see Spelling variants ‫ا‬‫إل‬‫متحانات‬ ‫أ‬‫بل‬ ‫اللـ‬‫ـى‬ ‫الفتر‬‫ه‬ ‫صحابـ‬‫ـى‬ ‫ما‬‫شفتش‬ ‫ا‬‫لـ‬‫ـمتحانات‬ ‫ا‬‫بل‬ ‫إ‬‫للي‬ ‫الفـ‬‫طـ‬‫ر‬‫ة‬ ‫صـ‬‫و‬‫حابي‬ ‫مـ‬‫شفتش‬ ‫االمتـ‬‫ـحـ‬‫نات‬ abl ‫إ‬‫للـ‬‫ـى‬ ‫الفـ‬‫طـ‬‫ر‬‫ه‬ ‫صـ‬‫و‬‫حابـ‬‫ـى‬ ‫شـ‬ ‫ما‬‫و‬‫فتش‬ ‫ا‬‫إل‬‫متـ‬‫ـحـ‬‫نات‬ qbl ‫ا‬‫لـ‬‫ـي‬ ilftra Su7abi ‫ما‬‫شـ‬‫و‬‫فتش‬ ‫ا‬‫لـ‬‫ـمتـ‬‫ـحـ‬‫نات‬ qabl ‫ا‬‫لى‬ sohaby ‫مـ‬‫شـ‬‫و‬‫فتش‬ ilimti7anat ‫إلـ‬‫ـي‬ mashoftish limtihanaat ‫إلى‬ illi
  • 53. SAMER Project • Simplification of Arabic Masterpieces for Extensive Reading – Muhamed Al Khalil, Nizar Habash and Dris Sulaimani – NYUAD Research Enhancement Fund – Collaboration with the UAE Ministry of Education • Objectives – Create a standard for the simplification of modern fiction in Arabic to school-age learners. – Develop a tool for automating readability scale grading for Arabic – Simplify a number of Arabic fiction masterpieces
  • 54. SAMER Readability Prediction • Large L1 corpus AND L2 corpus compared to previous Arabic Readability studies • Sweeping and systematic feature engineering and comparison • State-of-the-art tools tailored for Modern Standard Arabic • Exploring L1 and L2 performance within the same consistent feature framework • Leveraging L1 resources for L2 performance improvement
  • 55. Full Feature Breakdown (146 feats)
  • 58. MADAR Project • Multi-Arabic Dialect Applications and Resources • Collaboration among CMUQ, NYUAD and Columbia – Nizar Habash, Houda Bouamor, Kemal Oflazer and Owen Rambow • Modeling 25 Arabic city dialects – Lexical resources, parallel data, dialect identification, and dialect machine translation • http://madar.camel-lab.com • http://adida.abudhabi.nyu.edu
  • 59. The MADAR Corpus: example
  • 60. Fine Grained Dialect Identification • Salameh, Bouamor and Habash, 2018 (COLING) • Best results (Accuracy) • Demo: http://adida.abudhabi.nyu.edu System 6-Label Test 26-Label Test Baseline: Character 5-gram language model 92.7% 64.7% Multinomial Naïve Bayes Character/Word 5-gram language model 93.6% 67.5% Multinomial Naïve Bayes Character/Word 5-gram language model + Corpus-6 Classifier Probability 67.9%
  • 61. • How many words are needed to guarantee an optimal classification into a certain dialect? • ~90% with 16 words! • Almost 2 sentences in Corpus-26 • ~100% with 51 words! • Almost 7 sentences • We are currently preparing a competition on dialect ID of Twitter users. Can we do better? Yes, with more input!
  • 62. Summary • Arabic poses many challenges to AI/NLP – Orthographic ambiguity – Morphological complexity – Enormous variety – Annotated resource poverty • There has been a lot of work on Arabic and Arabic dialect technologies. – But more is needed still…
  • 63. Future Directions • More Arabic varieties – We plan to continue to working on new dialects and new Arabic domains – New data sets – New algorithms for supporting low resource languages • More tools – We are developing an open source suite called CamelTools to support Arabic processing • More interdisciplinary collaborations – We are proposing a center on Human-centered AI at NYUAD that brings together researchers from computer science, digital humanities, language pedagogy, history, and sociology, as well as industrial partners.