SlideShare a Scribd company logo
Workshop on Arabic Natural Language Processing 
EMNLP 2014, Doha 
Building a Corpus for Palestinian Arabic 
- a Preliminary Study 
Mustafa Jarrar, *Nizar Habash, Diyam Akra, Nasser Zalmout 
Birzeit University, Palestine *New York University Abu Dhabi
Reference: 
Mustafa Jarrar, Nizar Habash, Diyam Akra, Nasser Zalmout: Building A Corpus For Palestinian Arabic: A 
Preliminary Study. In proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing. 
Association for Computational Linguistics (ACL), Pages (18-27). October 25, 2014, Doha, Qatar. ISBN: 978-1- 
937284-96-1 
Download Article 
http://www.jarrar.info/publications/JHAZ14.pdf 
Watch the Presentation 
https://www.youtube.com/watch?v=Kw3R3DQVc8E
Acknowledgment 
• This work is part of the ongoing project called Curras, funded 
by the Research Council - Palestinian Ministry of Higher 
Education. 
• Team: 
– Mustafa Jarrar (project investigator) 
– Nizar Habash 
– Mahdi Arar 
– Diyam Fuad 
– Faeq Alrimawi 
• Collaborators: Owen Rambow, Faisal Al-Shargi, and Ramy 
Eskander 
http://sina.birzeit.edu/projects/
Agenda 
• Introduction 
• Distinguishing features of Palestinian Arabic 
• Related work 
• Building the Curras Corpus 
– Corpus Collection 
– Annotation process 
• Challenges 
– A Conventional Orthography for Palestinian Arabic 
– Morphological Annotation Process challenges 
• Conclusion and Future Work
Introduction 
• Most Arabic NLP tools and resources were developed to serve 
Modern Standard Arabic (MSA) 
– Dialectal phonological and morphological variations from MSA 
– No standard orthography for Arabic Dialects 
• Exponential growth of socially generated dialectal content 
• Our goal is to build a high-coverage and well-annotated corpus 
of the Palestinian Arabic dialect (PAL) 
– Curras Corpus of Palestinian Arabic 
– Collect words from different resources (43K words) 
– Annotate using a morphological analyzer tool (MADAMIRA) 
– Manually correct annotations 
• First step toward developing NLP applications for PAL 
• Insights and extensions to other Arabic dialects
Distinguishing Features of PAL 
• Palestinian Arabic is a Southern Levantine dialect of 
Arabic 
– The points presented here focus on difference from MSA 
– Some are shared with other dialects 
• Phonology 
– Pronunciation of the /q/ phoneme (MSA (ق 
• Urban /’/, rural /k/, Bedouin /g/, and Druze /q/ 
– MSA diphthongs /ay/ and /aw/ generally become /e:/ and 
/o:/ 
– PAL elides many short vowels that appear in the MSA 
cognates 
• e.g. MSA جبال /jiba:l/ ‘mountains’ becomes /jba:l/ 
– Insertion of epenthetic vowels (Herzallah, 1990) 
• e.g. /kalb/ and /kalib/ ( كلب ‘dog’)
Distinguishing Features of PAL 
• Morphology 
– PAL in most of its sub-dialects collapses the feminine 
and masculine plurals and duals in verbs and most 
nouns, as well as other forms: 
• e.g. حبيت /Habbe:t/ ‘I (or you m.s.) loved’ 
– PAL has many clitics that do not exist in MSA 
• The progressive particle /b+/ (as in /b+tuktub/ ‘you write’) 
• The demonstrative particle /ha+/ (as in /ha+l+be:t/ ‘this 
house’) 
• The negation cirmcumclitic /ma+ +š/ (as in /ma+katab+$/ 
‘he did not write’) 
• Indirect object clitic (as in /ma+katab+l+o:+š/ ‘he did not 
write to him’)
Distinguishing Features of PAL 
• Lexicon 
– Some common PAL words are portmanteaus of 
MSA words 
• /biddi/ ‘I want’ is from MSA /bi+widd+i/ ‘in my desire’ 
• /le:š/ ‘why’ corresponds to MSA /li+’ayyi šay’in/ ‘for 
what thing’ 
– Borrowed words from other languages: 
• E كندرة /kundara/ ‘shoe’ (Turkish) 
• E محسوم /maHsu:m/ ‘checkpoint’ (Hebrew)
Related Work 
• Dialectal Corpus Collection 
– Egyptian Colloquial Arabic Lexicon (ECAL) (Kilany, et al., 2002) 
– Maamouri et al. (2006) developed a pilot Levantine Arabic Treebank 
(LATB) 
– LDC BOLT resources for Egyptian Arabic (ATB-ARZ) 
– Others: YADAC (Al-Sabbagh and Girju, 2012), COLABA project (Diab 
et al., 2010) 
• Dialectal Orthography 
– Habash et al (2012a) proposed the so-called CODA (conventional 
orthography of dialectal Arabic) for the purpose of developing 
computational models of Arabic dialects in general 
• Dialectal Morphological Annotation 
– CALIMA-Egyptian (Habash et al., 2012b) 
– MADAMIRA (MSA/Egyptian) (Pasha et al., 2014)
Agenda 
• Introduction 
• Distinguishing features of Palestinian Arabic 
• Related work 
• Building the Curras Corpus 
– Corpus Collection 
– Annotation process 
• Challenges 
– A Conventional Orthography for Palestinian Arabic 
– Morphological Annotation Process challenges 
• Conclusion and Future Work
Building the Curras Corpus: 
Corpus Collection 
• Challenges 
– Scarce resources in terms of written literature 
• Traditional and informal contexts, such as conversations in 
TV series, movies, or on social media platforms 
– Orthographic inconsistency 
• Mixing of phonetic spelling and MSA-cognate-based spelling 
• Arabizi spelling 
• In this stage we focus on precision and variety 
rather than mere size 
– We tried not only to manually select and review the 
content of the corpus, but also to assure a variety of 
topics and contexts, localities and sub-dialects.
Manually Collected Resources 
The Curras Corpus Statistics 
Document Type 
Word 
Tokens 
Word 
Types 
Documents 
Facebook 3,120 1,985 35 threads 
Twitter 3,541 2,133 38 threads 
Blogs 8,748 4,454 37 threads 
Forums 1,092 798 33 threads 
Palestinian Stories 2,407 1,422 6 stories 
Palestinian Terms 759 556 1 doc 
TV Show: وطن ع وتر 
Watan Aa Watar 
23,423 8,459 41 episodes 
Curras Total 43,090 19,807 191
Building the Curras Corpus: 
Corpus Annotation 
• We define the annotation of a word as a tuple 
<w, wB c, cB, l, pB, g, i> 
W Raw (Unicode) The raw input word 
wB 
Raw (Buckwalter) The same raw input word in Buckwalter 
transliteration 
C 
CODA (Unicode) The Conventional Orthography (Habash et al., 2012) 
version of the input word 
cB 
CODA (Buckwalter) The Buckwalter transliteration of the CODA form 
l Lemma The lemma of the word in Buckwalter transliteration 
pB 
The Full Buckwalter POS Tag 
g Gloss (English) 
i 
Analysis Source of annotation, e.g., ANNO is a human annotator, and 
MADA is the MADAMIRA system with some minor or no automatic 
post-processing
Building the Curras Corpus: 
Corpus Annotation 
• We exploit existing tools to speed up the annotation process. 
We specifically use the MADAMIRA tool (Pasha et al., 2014) 
for morphological analysis and disambiguation of MSA and 
EGY. 
• A manual step is then needed to verify every annotation, to 
correct errors and fill in gaps. 
• We made one major simplification to the annotations to 
minimize the load on the human annotator. 
– We do not produce diacritized morphological analyses in the 
Buckwalter POS tag.
Agenda 
• Introduction 
• Distinguishing features of Palestinian Arabic 
• Related work 
• Building the Curras Corpus 
– Corpus Collection 
– Annotation process 
• Challenges 
– A Conventional Orthography for Palestinian Arabic 
– Morphological Annotation Process challenges 
• Two pilot studies 
• Conclusion and Future Work
Challenges: 
Conventional Orthography for PAL 
• Inconsistent spellings 
– e.g. ‘heart’ (MSA قلب qalb) four spellings: قلب qalb /qalb/, ألب >alb /’alb/, كلب kalb 
/kalb/, and جلب galb /galb/ 
• Shortening of long vowels 
– E قانون qAnuwn ‘law’ (MSA /qa:nu:n/), shortened first vowel قنون qanuwn 
/qanu:n/ 
• PAL has some clitics that do not exist in MSA 
– e.g. the PAL future particle ح/Ha/ 
• Morphemes with additional forms or pronunciations 
– e.g. ال Al /il/ which has a non-MSA/non-EGY allomorph /li/ 
– E لبلاد /libla:d/ ‘the homeland/countries’ can be spelled to reflect the morphology 
as البلاد AlblAd or the phonology لبلاد lblAd, with the latter being ambiguous with 
‘for countries’ 
• Words in PAL that have no cognate in MSA 
– e.g. the word /barDo/ ‘additionally’ is spontaneously written as برضو brDw, برضه 
brDh and برضة brDp.
CODA: A Conventional Orthography for 
Dialectal Arabic 
• Habash et al., (2012) developed CODA for 
computational processing purposes primarily 
• Objectives 
– CODA covers all DAs, minimizing differences in 
choices 
– CODA is easy to learn and produce consistently 
– CODA is intuitive to readers unfamiliar with it 
– CODA uses Arabic script 
• Inspired by previous efforts from the LDC and 
linguistic studies 
• Guidelines created for Egyptian and Tunisian so far 
17
CODA Examples 
18 
Phenomenon Original CODA 
Spelling Errors 
Typos 
Speech effects 
Merges 
Splits 
الاجابه 
شبب 
كبيييييييير 
اليومبريستيج 
المع روف 
الإجابة 
سبب 
كبير 
اليوم بريستيج 
المعروف 
MSA Root Cognate قلب آلب، كلب 
Dialectal Clitic 
Guidelines 
علبيت 
مشفناش 
عالبيت 
ما شافناش 
Unique Dialect Words برضه بردو، برضو
Challenges: 
Conventional Orthography for PAL 
• Our challenge was to take CODA guidelines (specifically the EGY 
version) (Habash et al., 2012) and extend them. 
– In terms of phonology-orthography 
• We added the letter كk to the list of root letters to be spelled in the MSA 
cognate to cover the PAL rural sub-dialects that pronounces it as /t$/. 
– In terms of morphology 
• We added the non-EGY demonstrative proclitic هha and the conjunction 
proclitic تta ‘so as to’ to the list of clitics, e.g., بهالبيت bhAlbyt ‘in this 
house’ and تيشوف ty$wf ‘so that he can see’. 
– We extended the list of exceptional words to cover problematic 
PAL words.
Pilot study (I) 
• We considered 1,000 words from 77 tweets in Curras. 
The CODA version of each word was created in context. 
Results: 
– 15.9% of all words had a different CODA form from the 
input raw word form. 
• 42% of these changes involve consonants. 
• 20% vowels and the hamzated/bare forms of the letter Alif اA. 
• 29% word changes involve the spelling of specific morpheme. 
• 18% of the changed words experience a split or a merge (with 
splits happening five time more than merges). 
• Only about 8% of the changed words were PAL specific terms. 
• less than 7% involved a typo or speech effect elongation.
Challenges: 
Morphological Annotation Process 
• Manually annotating words can take a lot of time. 
• To make the process faster and easier we decided 
to use an existing morphological analyzer for 
MSA or EGY to create PAL annotations. 
• To validate the use of the morphological analyzer 
we conducted a pilot study
Pilot study (II) 
• We ran the words from a randomly selected episode of 
the PAL TV show “Watan Aa Watar” (460 words) 
through both MADAMIRA-MSA and MADAMIRA-EGY. 
• We analyzed the output from both systems to 
determine its usability for PAL annotations. 
• The results of this experiment are summarized below: 
Accuracy of automatic annotation of PAL text 
Statistics MADAMIRA MSA MADAMIRA EGY 
No Analysis 17.78% 7.24% 
Wrongly Analyzed 18.43% 14.75% 
Correctly Analyzed 63.79% 78.01%
RAW CODA Lemma BuckWalter POS (Diacritized) 
Glos 
s 
Status 
MADAMIRA 
MSA EGY 
ابوكوا 1 
Abwkw 
A 
ابوكو Abwkw Abuw 
Abw/NOUN+ 
kw/POSS_PRON_3MS 
father MADA NA 
Correc 
t 
الاكل 2 AlAkl الأكل Al>kl >akl Al/DET+>kl/NOUN eating MADA Usable 
Correc 
t 
لبنوك 3 lbnwk البنوك Albnwk bank Al/DET+bnwk/NOUN bank ANNO Wrong Wrong 
التاني 4 AltAny الثاني AlvAny vAniy Al/DET+vAny/ADJ_NUM second ANNO Wrong Usable 
5 
الحما 
ر 
AlHmAr الحمار AlHmAr HmAr Al/DET+HmAr/NOUN donkey MADA UsableUsable 
6 
للرات 
ب 
llrAtb للراتب llrAtb rAtib l/PREP+Al/DET+rAtb/NOUN salary MADA Usable 
Correc 
t 
ايوة 7 Aywp أيوه >ywh >ayowah >ywh/INTERJ yes MADA NA 
Correc 
t 
بدها 8 bdhA بدها bdhA bid~ 
bd/NOUN+ 
hA/POSS_PRON_3FS 
want ANNO Wrong Wrong 
9 
بنردل 
ك 
bnrdlk 
بنرد_ل 
ك 
bnrd_lk rad~ 
b/PROG_PART+n/IV1P 
+rd/IV+l/PREP+k/PRON_2MS 
answer MADA NA Usable 
1 
0 
هدول hdwl هذول h*wl ha*A h*wl/DEM_PRON these ANNO NA NA
Conclusion and Future Work 
• We presented preliminary results on the potential, and 
limitations, of using existing resources, specially MADAMIRA 
EGY, to semi-automate and speed up the annotation process 
of a Palestinian Arabic corpus. 
• Several issues need to be considered and researched further: 
– The development of Palestinian-specific morphological annotation and 
CODA guidelines. 
– The development of a Palestinian lexicon. 
– The extension of MADAMIRA to analyze Palestinian text. 
• Corpus will be further extended to include more text, and all 
lexical annotations will be linked with an Arabic ontology 
• The annotated corpus will be used as part of developing a 
Palestinian Arabic morphological analyzer 
• We plan to extend the work to other dialects
Questions?
References 
• H. Abo Bakr, K. Shaalan, and I. Ziedan. A Hybrid Approach for Converting Written 
Egyptian Colloquial Dialect into Diacritized Arabic. In The 6th International 
Conference on Informatics and Systems, INFOS2008. Cairo University. 
• R. Al-Sabbagh and R. Girju. YADAC: Yet another dialectal Arabic corpus. In Proc. of 
the Language Resources and Evaluation Conference (LREC), pages 2882–2889, 
Istanbul, 2012. 
• T. Buckwalter. 2004. Buckwalter Arabic morphological analyzer version 2.0. LDC 
catalog number LDC2004L02, ISBN 1-58563-324-0 
• M. Diab, N. Habash, O. Rambow, M. Altantawy, and Y. Benajiba. COLABA: Arabic 
Dialect Annotation and Processing. LREC Workshop on Semitic Language 
Processing, Malta, 2010 
• H. Gadalla, H. Kilany, H. Arram, A. Yacoub, A. El-Habashi, A. Shalaby, K. Karins, E. 
Rowson, R. MacIntyre, P. Kingsbury, D. Graff, and C. McLemore. 1997. CALLHOME 
Egyptian Arabic Transcripts. Linguistic Data Consortium, Catalog No.: LDC97T19. 
• D. Graff, M. Maamouri, B. Bouziri, S. Krouna, S. Kulick, and T. Buckwalter. 2009. 
Standard Arabic Morphological Analyzer (SAMA) Version 3.1. Linguistic Data 
Consortium LDC2009E73. 
• N. Habash, M. Diab, and O. Rabmow. (2012a) Conventional Orthography for 
Dialectal Arabic. In Proc. of LREC, Istanbul, Turkey, 2012.
• N. Habash, R. Eskander, and A. Hawwari. (2012b) A Morphological Analyzer for 
Egyptian Arabic. In Proc. of the Special Interest Group on Computational 
Morphology and Phonology, Montréal, Canada, 2012. 
• N. Habash, R. Roth, O. Rambow, R. Eskander, and N. Tomeh. Morphological 
Analysis and Disambiguation for Dialectal Arabic. In Proc. of NAACL, Atlanta, 
Georgia, 2013. 
• R. Herzallah. (1990). Aspects of Palestinian Arabic Phonology: A Nonlinear 
Approach. Ph.D. thesis, Cornell University. Distributed as Working Papers of the 
Cornell Phonetics Laboratory No. 4. 
• H. Kilany, H. Gadalla, H. Arram, A. Yacoub, A. El-Habashi, and C. McLemore. 
Egyptian Colloquial Arabic Lexicon. Linguistic Data Consortium, Catalog No.: 
LDC99L22. 
• M. Maamouri, A. Bies, T. Buckwalter, M. Diab, N. Habash, O. Rambow, and 
D. Tabessi. Developing and using a pilot dialectal Arabic treebank. In Proc. of LREC, 
Genoa, Italy, 2006. 
• A. Pasha, M. Al-Badrashiny, M. Diab, A. El Kholy, R. Eskander, N. Habash, M. 
Pooleery, O. Rambow and R. M. Roth. MADAMIRA: A Fast, Comprehensive Tool for 
Morphological Analysis and Disambiguation of Arabic. In Proc. of LREC, Reykjavik, 
Iceland, 2014. 
• W. Salloum and N. Habash. Dialectal to Standard Arabic Paraphrasing to Improve 
Arabic-English Statistical Machine Translation. In Proc. of the First Workshop on 
Algorithms and Resources for Modeling of Dialects and Language Varieties, 
Edinburgh, Scotland, 2011. 
• I. Zribi, R. Boujelbane, A. Masmoudi, M. Ellouze Khmekhem, L. Hadrich Belguith, 
and N. Habash. A Conventional Orthography for Tunisian Arabic. In Proc. of LREC, 
Reykjavik, Iceland, 2014.

More Related Content

Similar to Presentation curras paper-emnlp2014-final

Teaching Arabic Speakers: Linguistic and Cultural Considerations, Shira Packer
Teaching Arabic Speakers: Linguistic and Cultural Considerations, Shira PackerTeaching Arabic Speakers: Linguistic and Cultural Considerations, Shira Packer
Teaching Arabic Speakers: Linguistic and Cultural Considerations, Shira Packer
spacke
 
Developing corpus-based resources for language learning: looking back in "hope"
Developing corpus-based resources for language learning: looking back in "hope"Developing corpus-based resources for language learning: looking back in "hope"
Developing corpus-based resources for language learning: looking back in "hope"
Pascual Pérez-Paredes
 
2021-0509_JAECS2021_Spring
2021-0509_JAECS2021_Spring2021-0509_JAECS2021_Spring
2021-0509_JAECS2021_Spring
Mizumoto Atsushi
 
Tawasol symbols project overview
Tawasol symbols project overviewTawasol symbols project overview
Tawasol symbols project overview
E.A. Draffan
 
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
iwan_rg
 
Corpus based translation Studies
Corpus based translation StudiesCorpus based translation Studies
Corpus based translation Studies
Habib Ali
 
The Common European Framework of Reference for Arabic Language Teaching & Lea...
The Common European Framework of Reference for Arabic Language Teaching & Lea...The Common European Framework of Reference for Arabic Language Teaching & Lea...
The Common European Framework of Reference for Arabic Language Teaching & Lea...
mlang-events
 
Arabic words stemming approach using arabic wordnet
Arabic words stemming approach using arabic wordnetArabic words stemming approach using arabic wordnet
Arabic words stemming approach using arabic wordnet
IJDKP
 
Corpora analysis bruno natalia sarah
Corpora analysis   bruno natalia sarahCorpora analysis   bruno natalia sarah
Corpora analysis bruno natalia sarah
Bruno Bruno Bruno Bruno
 
SIGNWRITING SYMPOSIUM 2016 PRESENTATION 63 "Using SignWriting for the Peruvia...
SIGNWRITING SYMPOSIUM 2016 PRESENTATION 63 "Using SignWriting for the Peruvia...SIGNWRITING SYMPOSIUM 2016 PRESENTATION 63 "Using SignWriting for the Peruvia...
SIGNWRITING SYMPOSIUM 2016 PRESENTATION 63 "Using SignWriting for the Peruvia...
SignWriting For Sign Languages
 
Not just for reference: Dictionaries and corpora as language acquisition tools
Not just for reference: Dictionaries and corpora as language acquisition toolsNot just for reference: Dictionaries and corpora as language acquisition tools
Not just for reference: Dictionaries and corpora as language acquisition tools
Michael Brown
 
NLP_KASHK:POS Tagging
NLP_KASHK:POS TaggingNLP_KASHK:POS Tagging
NLP_KASHK:POS Tagging
Hemantha Kulathilake
 
Exploring the effects of stemming on
Exploring the effects of stemming onExploring the effects of stemming on
Exploring the effects of stemming on
ijaia
 
telling tails for developing our materials
telling tails for developing our materialstelling tails for developing our materials
telling tails for developing our materials
NezrinMemmedzade1
 
Chapter one for presentation( Course curriculum development in Language Teach...
Chapter one for presentation( Course curriculum development in Language Teach...Chapter one for presentation( Course curriculum development in Language Teach...
Chapter one for presentation( Course curriculum development in Language Teach...
louth sran
 
169242324 hausa-language-study-nysc
169242324 hausa-language-study-nysc169242324 hausa-language-study-nysc
169242324 hausa-language-study-nysc
homeworkping8
 
MoM2010: Arabic natural language processing
MoM2010: Arabic natural language processingMoM2010: Arabic natural language processing
MoM2010: Arabic natural language processingHend Al-Khalifa
 
A New Approach to Romanize Arabic Words
A New Approach to Romanize Arabic WordsA New Approach to Romanize Arabic Words
A New Approach to Romanize Arabic Words
IJERA Editor
 
Arabic language presentation 01
Arabic language presentation 01Arabic language presentation 01
Arabic language presentation 01Mohammed Attia
 
Corpus linguistics intro
Corpus linguistics introCorpus linguistics intro
Corpus linguistics intro
Alex Curtis
 

Similar to Presentation curras paper-emnlp2014-final (20)

Teaching Arabic Speakers: Linguistic and Cultural Considerations, Shira Packer
Teaching Arabic Speakers: Linguistic and Cultural Considerations, Shira PackerTeaching Arabic Speakers: Linguistic and Cultural Considerations, Shira Packer
Teaching Arabic Speakers: Linguistic and Cultural Considerations, Shira Packer
 
Developing corpus-based resources for language learning: looking back in "hope"
Developing corpus-based resources for language learning: looking back in "hope"Developing corpus-based resources for language learning: looking back in "hope"
Developing corpus-based resources for language learning: looking back in "hope"
 
2021-0509_JAECS2021_Spring
2021-0509_JAECS2021_Spring2021-0509_JAECS2021_Spring
2021-0509_JAECS2021_Spring
 
Tawasol symbols project overview
Tawasol symbols project overviewTawasol symbols project overview
Tawasol symbols project overview
 
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
 
Corpus based translation Studies
Corpus based translation StudiesCorpus based translation Studies
Corpus based translation Studies
 
The Common European Framework of Reference for Arabic Language Teaching & Lea...
The Common European Framework of Reference for Arabic Language Teaching & Lea...The Common European Framework of Reference for Arabic Language Teaching & Lea...
The Common European Framework of Reference for Arabic Language Teaching & Lea...
 
Arabic words stemming approach using arabic wordnet
Arabic words stemming approach using arabic wordnetArabic words stemming approach using arabic wordnet
Arabic words stemming approach using arabic wordnet
 
Corpora analysis bruno natalia sarah
Corpora analysis   bruno natalia sarahCorpora analysis   bruno natalia sarah
Corpora analysis bruno natalia sarah
 
SIGNWRITING SYMPOSIUM 2016 PRESENTATION 63 "Using SignWriting for the Peruvia...
SIGNWRITING SYMPOSIUM 2016 PRESENTATION 63 "Using SignWriting for the Peruvia...SIGNWRITING SYMPOSIUM 2016 PRESENTATION 63 "Using SignWriting for the Peruvia...
SIGNWRITING SYMPOSIUM 2016 PRESENTATION 63 "Using SignWriting for the Peruvia...
 
Not just for reference: Dictionaries and corpora as language acquisition tools
Not just for reference: Dictionaries and corpora as language acquisition toolsNot just for reference: Dictionaries and corpora as language acquisition tools
Not just for reference: Dictionaries and corpora as language acquisition tools
 
NLP_KASHK:POS Tagging
NLP_KASHK:POS TaggingNLP_KASHK:POS Tagging
NLP_KASHK:POS Tagging
 
Exploring the effects of stemming on
Exploring the effects of stemming onExploring the effects of stemming on
Exploring the effects of stemming on
 
telling tails for developing our materials
telling tails for developing our materialstelling tails for developing our materials
telling tails for developing our materials
 
Chapter one for presentation( Course curriculum development in Language Teach...
Chapter one for presentation( Course curriculum development in Language Teach...Chapter one for presentation( Course curriculum development in Language Teach...
Chapter one for presentation( Course curriculum development in Language Teach...
 
169242324 hausa-language-study-nysc
169242324 hausa-language-study-nysc169242324 hausa-language-study-nysc
169242324 hausa-language-study-nysc
 
MoM2010: Arabic natural language processing
MoM2010: Arabic natural language processingMoM2010: Arabic natural language processing
MoM2010: Arabic natural language processing
 
A New Approach to Romanize Arabic Words
A New Approach to Romanize Arabic WordsA New Approach to Romanize Arabic Words
A New Approach to Romanize Arabic Words
 
Arabic language presentation 01
Arabic language presentation 01Arabic language presentation 01
Arabic language presentation 01
 
Corpus linguistics intro
Corpus linguistics introCorpus linguistics intro
Corpus linguistics intro
 

More from Mustafa Jarrar

Clustering Arabic Tweets for Sentiment Analysis
Clustering Arabic Tweets for Sentiment AnalysisClustering Arabic Tweets for Sentiment Analysis
Clustering Arabic Tweets for Sentiment Analysis
Mustafa Jarrar
 
Classifying Processes and Basic Formal Ontology
Classifying Processes  and Basic Formal OntologyClassifying Processes  and Basic Formal Ontology
Classifying Processes and Basic Formal Ontology
Mustafa Jarrar
 
Discrete Mathematics Course Outline
Discrete Mathematics Course OutlineDiscrete Mathematics Course Outline
Discrete Mathematics Course Outline
Mustafa Jarrar
 
Business Process Implementation
Business Process ImplementationBusiness Process Implementation
Business Process Implementation
Mustafa Jarrar
 
Business Process Design and Re-engineering
Business Process Design and Re-engineeringBusiness Process Design and Re-engineering
Business Process Design and Re-engineering
Mustafa Jarrar
 
BPMN 2.0 Analytical Constructs
BPMN 2.0 Analytical ConstructsBPMN 2.0 Analytical Constructs
BPMN 2.0 Analytical Constructs
Mustafa Jarrar
 
BPMN 2.0 Descriptive Constructs
BPMN 2.0 Descriptive Constructs  BPMN 2.0 Descriptive Constructs
BPMN 2.0 Descriptive Constructs
Mustafa Jarrar
 
Introduction to Business Process Management
Introduction to Business Process ManagementIntroduction to Business Process Management
Introduction to Business Process Management
Mustafa Jarrar
 
Customer Complaint Ontology
Customer Complaint Ontology Customer Complaint Ontology
Customer Complaint Ontology
Mustafa Jarrar
 
Subset, Equality, and Exclusion Rules
Subset, Equality, and Exclusion RulesSubset, Equality, and Exclusion Rules
Subset, Equality, and Exclusion Rules
Mustafa Jarrar
 
Schema Modularization in ORM
Schema Modularization in ORMSchema Modularization in ORM
Schema Modularization in ORM
Mustafa Jarrar
 
On Computer Science Trends and Priorities in Palestine
On Computer Science Trends and Priorities in PalestineOn Computer Science Trends and Priorities in Palestine
On Computer Science Trends and Priorities in Palestine
Mustafa Jarrar
 
Lessons from Class Recording & Publishing of Eight Online Courses
Lessons from Class Recording & Publishing of Eight Online CoursesLessons from Class Recording & Publishing of Eight Online Courses
Lessons from Class Recording & Publishing of Eight Online Courses
Mustafa Jarrar
 
Jarrar: Future Internet in Horizon 2020 Calls
Jarrar: Future Internet in Horizon 2020 CallsJarrar: Future Internet in Horizon 2020 Calls
Jarrar: Future Internet in Horizon 2020 Calls
Mustafa Jarrar
 
Habash: Arabic Natural Language Processing
Habash: Arabic Natural Language ProcessingHabash: Arabic Natural Language Processing
Habash: Arabic Natural Language ProcessingMustafa Jarrar
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Riestra: How to Design and engineer Competitive Horizon 2020 Proposals
Riestra: How to Design and engineer Competitive Horizon 2020 ProposalsRiestra: How to Design and engineer Competitive Horizon 2020 Proposals
Riestra: How to Design and engineer Competitive Horizon 2020 Proposals
Mustafa Jarrar
 
Bouquet: SIERA Workshop on The Pillars of Horizon2020
Bouquet: SIERA Workshop on The Pillars of Horizon2020Bouquet: SIERA Workshop on The Pillars of Horizon2020
Bouquet: SIERA Workshop on The Pillars of Horizon2020
Mustafa Jarrar
 
Jarrar: Sparql Project
Jarrar: Sparql ProjectJarrar: Sparql Project
Jarrar: Sparql ProjectMustafa Jarrar
 
Jarrar: Logical Foundation of Ontology Engineering
Jarrar: Logical Foundation of Ontology EngineeringJarrar: Logical Foundation of Ontology Engineering
Jarrar: Logical Foundation of Ontology Engineering
Mustafa Jarrar
 

More from Mustafa Jarrar (20)

Clustering Arabic Tweets for Sentiment Analysis
Clustering Arabic Tweets for Sentiment AnalysisClustering Arabic Tweets for Sentiment Analysis
Clustering Arabic Tweets for Sentiment Analysis
 
Classifying Processes and Basic Formal Ontology
Classifying Processes  and Basic Formal OntologyClassifying Processes  and Basic Formal Ontology
Classifying Processes and Basic Formal Ontology
 
Discrete Mathematics Course Outline
Discrete Mathematics Course OutlineDiscrete Mathematics Course Outline
Discrete Mathematics Course Outline
 
Business Process Implementation
Business Process ImplementationBusiness Process Implementation
Business Process Implementation
 
Business Process Design and Re-engineering
Business Process Design and Re-engineeringBusiness Process Design and Re-engineering
Business Process Design and Re-engineering
 
BPMN 2.0 Analytical Constructs
BPMN 2.0 Analytical ConstructsBPMN 2.0 Analytical Constructs
BPMN 2.0 Analytical Constructs
 
BPMN 2.0 Descriptive Constructs
BPMN 2.0 Descriptive Constructs  BPMN 2.0 Descriptive Constructs
BPMN 2.0 Descriptive Constructs
 
Introduction to Business Process Management
Introduction to Business Process ManagementIntroduction to Business Process Management
Introduction to Business Process Management
 
Customer Complaint Ontology
Customer Complaint Ontology Customer Complaint Ontology
Customer Complaint Ontology
 
Subset, Equality, and Exclusion Rules
Subset, Equality, and Exclusion RulesSubset, Equality, and Exclusion Rules
Subset, Equality, and Exclusion Rules
 
Schema Modularization in ORM
Schema Modularization in ORMSchema Modularization in ORM
Schema Modularization in ORM
 
On Computer Science Trends and Priorities in Palestine
On Computer Science Trends and Priorities in PalestineOn Computer Science Trends and Priorities in Palestine
On Computer Science Trends and Priorities in Palestine
 
Lessons from Class Recording & Publishing of Eight Online Courses
Lessons from Class Recording & Publishing of Eight Online CoursesLessons from Class Recording & Publishing of Eight Online Courses
Lessons from Class Recording & Publishing of Eight Online Courses
 
Jarrar: Future Internet in Horizon 2020 Calls
Jarrar: Future Internet in Horizon 2020 CallsJarrar: Future Internet in Horizon 2020 Calls
Jarrar: Future Internet in Horizon 2020 Calls
 
Habash: Arabic Natural Language Processing
Habash: Arabic Natural Language ProcessingHabash: Arabic Natural Language Processing
Habash: Arabic Natural Language Processing
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Riestra: How to Design and engineer Competitive Horizon 2020 Proposals
Riestra: How to Design and engineer Competitive Horizon 2020 ProposalsRiestra: How to Design and engineer Competitive Horizon 2020 Proposals
Riestra: How to Design and engineer Competitive Horizon 2020 Proposals
 
Bouquet: SIERA Workshop on The Pillars of Horizon2020
Bouquet: SIERA Workshop on The Pillars of Horizon2020Bouquet: SIERA Workshop on The Pillars of Horizon2020
Bouquet: SIERA Workshop on The Pillars of Horizon2020
 
Jarrar: Sparql Project
Jarrar: Sparql ProjectJarrar: Sparql Project
Jarrar: Sparql Project
 
Jarrar: Logical Foundation of Ontology Engineering
Jarrar: Logical Foundation of Ontology EngineeringJarrar: Logical Foundation of Ontology Engineering
Jarrar: Logical Foundation of Ontology Engineering
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 

Presentation curras paper-emnlp2014-final

  • 1. Workshop on Arabic Natural Language Processing EMNLP 2014, Doha Building a Corpus for Palestinian Arabic - a Preliminary Study Mustafa Jarrar, *Nizar Habash, Diyam Akra, Nasser Zalmout Birzeit University, Palestine *New York University Abu Dhabi
  • 2. Reference: Mustafa Jarrar, Nizar Habash, Diyam Akra, Nasser Zalmout: Building A Corpus For Palestinian Arabic: A Preliminary Study. In proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing. Association for Computational Linguistics (ACL), Pages (18-27). October 25, 2014, Doha, Qatar. ISBN: 978-1- 937284-96-1 Download Article http://www.jarrar.info/publications/JHAZ14.pdf Watch the Presentation https://www.youtube.com/watch?v=Kw3R3DQVc8E
  • 3. Acknowledgment • This work is part of the ongoing project called Curras, funded by the Research Council - Palestinian Ministry of Higher Education. • Team: – Mustafa Jarrar (project investigator) – Nizar Habash – Mahdi Arar – Diyam Fuad – Faeq Alrimawi • Collaborators: Owen Rambow, Faisal Al-Shargi, and Ramy Eskander http://sina.birzeit.edu/projects/
  • 4. Agenda • Introduction • Distinguishing features of Palestinian Arabic • Related work • Building the Curras Corpus – Corpus Collection – Annotation process • Challenges – A Conventional Orthography for Palestinian Arabic – Morphological Annotation Process challenges • Conclusion and Future Work
  • 5. Introduction • Most Arabic NLP tools and resources were developed to serve Modern Standard Arabic (MSA) – Dialectal phonological and morphological variations from MSA – No standard orthography for Arabic Dialects • Exponential growth of socially generated dialectal content • Our goal is to build a high-coverage and well-annotated corpus of the Palestinian Arabic dialect (PAL) – Curras Corpus of Palestinian Arabic – Collect words from different resources (43K words) – Annotate using a morphological analyzer tool (MADAMIRA) – Manually correct annotations • First step toward developing NLP applications for PAL • Insights and extensions to other Arabic dialects
  • 6. Distinguishing Features of PAL • Palestinian Arabic is a Southern Levantine dialect of Arabic – The points presented here focus on difference from MSA – Some are shared with other dialects • Phonology – Pronunciation of the /q/ phoneme (MSA (ق • Urban /’/, rural /k/, Bedouin /g/, and Druze /q/ – MSA diphthongs /ay/ and /aw/ generally become /e:/ and /o:/ – PAL elides many short vowels that appear in the MSA cognates • e.g. MSA جبال /jiba:l/ ‘mountains’ becomes /jba:l/ – Insertion of epenthetic vowels (Herzallah, 1990) • e.g. /kalb/ and /kalib/ ( كلب ‘dog’)
  • 7. Distinguishing Features of PAL • Morphology – PAL in most of its sub-dialects collapses the feminine and masculine plurals and duals in verbs and most nouns, as well as other forms: • e.g. حبيت /Habbe:t/ ‘I (or you m.s.) loved’ – PAL has many clitics that do not exist in MSA • The progressive particle /b+/ (as in /b+tuktub/ ‘you write’) • The demonstrative particle /ha+/ (as in /ha+l+be:t/ ‘this house’) • The negation cirmcumclitic /ma+ +š/ (as in /ma+katab+$/ ‘he did not write’) • Indirect object clitic (as in /ma+katab+l+o:+š/ ‘he did not write to him’)
  • 8. Distinguishing Features of PAL • Lexicon – Some common PAL words are portmanteaus of MSA words • /biddi/ ‘I want’ is from MSA /bi+widd+i/ ‘in my desire’ • /le:š/ ‘why’ corresponds to MSA /li+’ayyi šay’in/ ‘for what thing’ – Borrowed words from other languages: • E كندرة /kundara/ ‘shoe’ (Turkish) • E محسوم /maHsu:m/ ‘checkpoint’ (Hebrew)
  • 9. Related Work • Dialectal Corpus Collection – Egyptian Colloquial Arabic Lexicon (ECAL) (Kilany, et al., 2002) – Maamouri et al. (2006) developed a pilot Levantine Arabic Treebank (LATB) – LDC BOLT resources for Egyptian Arabic (ATB-ARZ) – Others: YADAC (Al-Sabbagh and Girju, 2012), COLABA project (Diab et al., 2010) • Dialectal Orthography – Habash et al (2012a) proposed the so-called CODA (conventional orthography of dialectal Arabic) for the purpose of developing computational models of Arabic dialects in general • Dialectal Morphological Annotation – CALIMA-Egyptian (Habash et al., 2012b) – MADAMIRA (MSA/Egyptian) (Pasha et al., 2014)
  • 10. Agenda • Introduction • Distinguishing features of Palestinian Arabic • Related work • Building the Curras Corpus – Corpus Collection – Annotation process • Challenges – A Conventional Orthography for Palestinian Arabic – Morphological Annotation Process challenges • Conclusion and Future Work
  • 11. Building the Curras Corpus: Corpus Collection • Challenges – Scarce resources in terms of written literature • Traditional and informal contexts, such as conversations in TV series, movies, or on social media platforms – Orthographic inconsistency • Mixing of phonetic spelling and MSA-cognate-based spelling • Arabizi spelling • In this stage we focus on precision and variety rather than mere size – We tried not only to manually select and review the content of the corpus, but also to assure a variety of topics and contexts, localities and sub-dialects.
  • 12. Manually Collected Resources The Curras Corpus Statistics Document Type Word Tokens Word Types Documents Facebook 3,120 1,985 35 threads Twitter 3,541 2,133 38 threads Blogs 8,748 4,454 37 threads Forums 1,092 798 33 threads Palestinian Stories 2,407 1,422 6 stories Palestinian Terms 759 556 1 doc TV Show: وطن ع وتر Watan Aa Watar 23,423 8,459 41 episodes Curras Total 43,090 19,807 191
  • 13. Building the Curras Corpus: Corpus Annotation • We define the annotation of a word as a tuple <w, wB c, cB, l, pB, g, i> W Raw (Unicode) The raw input word wB Raw (Buckwalter) The same raw input word in Buckwalter transliteration C CODA (Unicode) The Conventional Orthography (Habash et al., 2012) version of the input word cB CODA (Buckwalter) The Buckwalter transliteration of the CODA form l Lemma The lemma of the word in Buckwalter transliteration pB The Full Buckwalter POS Tag g Gloss (English) i Analysis Source of annotation, e.g., ANNO is a human annotator, and MADA is the MADAMIRA system with some minor or no automatic post-processing
  • 14. Building the Curras Corpus: Corpus Annotation • We exploit existing tools to speed up the annotation process. We specifically use the MADAMIRA tool (Pasha et al., 2014) for morphological analysis and disambiguation of MSA and EGY. • A manual step is then needed to verify every annotation, to correct errors and fill in gaps. • We made one major simplification to the annotations to minimize the load on the human annotator. – We do not produce diacritized morphological analyses in the Buckwalter POS tag.
  • 15. Agenda • Introduction • Distinguishing features of Palestinian Arabic • Related work • Building the Curras Corpus – Corpus Collection – Annotation process • Challenges – A Conventional Orthography for Palestinian Arabic – Morphological Annotation Process challenges • Two pilot studies • Conclusion and Future Work
  • 16. Challenges: Conventional Orthography for PAL • Inconsistent spellings – e.g. ‘heart’ (MSA قلب qalb) four spellings: قلب qalb /qalb/, ألب >alb /’alb/, كلب kalb /kalb/, and جلب galb /galb/ • Shortening of long vowels – E قانون qAnuwn ‘law’ (MSA /qa:nu:n/), shortened first vowel قنون qanuwn /qanu:n/ • PAL has some clitics that do not exist in MSA – e.g. the PAL future particle ح/Ha/ • Morphemes with additional forms or pronunciations – e.g. ال Al /il/ which has a non-MSA/non-EGY allomorph /li/ – E لبلاد /libla:d/ ‘the homeland/countries’ can be spelled to reflect the morphology as البلاد AlblAd or the phonology لبلاد lblAd, with the latter being ambiguous with ‘for countries’ • Words in PAL that have no cognate in MSA – e.g. the word /barDo/ ‘additionally’ is spontaneously written as برضو brDw, برضه brDh and برضة brDp.
  • 17. CODA: A Conventional Orthography for Dialectal Arabic • Habash et al., (2012) developed CODA for computational processing purposes primarily • Objectives – CODA covers all DAs, minimizing differences in choices – CODA is easy to learn and produce consistently – CODA is intuitive to readers unfamiliar with it – CODA uses Arabic script • Inspired by previous efforts from the LDC and linguistic studies • Guidelines created for Egyptian and Tunisian so far 17
  • 18. CODA Examples 18 Phenomenon Original CODA Spelling Errors Typos Speech effects Merges Splits الاجابه شبب كبيييييييير اليومبريستيج المع روف الإجابة سبب كبير اليوم بريستيج المعروف MSA Root Cognate قلب آلب، كلب Dialectal Clitic Guidelines علبيت مشفناش عالبيت ما شافناش Unique Dialect Words برضه بردو، برضو
  • 19. Challenges: Conventional Orthography for PAL • Our challenge was to take CODA guidelines (specifically the EGY version) (Habash et al., 2012) and extend them. – In terms of phonology-orthography • We added the letter كk to the list of root letters to be spelled in the MSA cognate to cover the PAL rural sub-dialects that pronounces it as /t$/. – In terms of morphology • We added the non-EGY demonstrative proclitic هha and the conjunction proclitic تta ‘so as to’ to the list of clitics, e.g., بهالبيت bhAlbyt ‘in this house’ and تيشوف ty$wf ‘so that he can see’. – We extended the list of exceptional words to cover problematic PAL words.
  • 20. Pilot study (I) • We considered 1,000 words from 77 tweets in Curras. The CODA version of each word was created in context. Results: – 15.9% of all words had a different CODA form from the input raw word form. • 42% of these changes involve consonants. • 20% vowels and the hamzated/bare forms of the letter Alif اA. • 29% word changes involve the spelling of specific morpheme. • 18% of the changed words experience a split or a merge (with splits happening five time more than merges). • Only about 8% of the changed words were PAL specific terms. • less than 7% involved a typo or speech effect elongation.
  • 21. Challenges: Morphological Annotation Process • Manually annotating words can take a lot of time. • To make the process faster and easier we decided to use an existing morphological analyzer for MSA or EGY to create PAL annotations. • To validate the use of the morphological analyzer we conducted a pilot study
  • 22. Pilot study (II) • We ran the words from a randomly selected episode of the PAL TV show “Watan Aa Watar” (460 words) through both MADAMIRA-MSA and MADAMIRA-EGY. • We analyzed the output from both systems to determine its usability for PAL annotations. • The results of this experiment are summarized below: Accuracy of automatic annotation of PAL text Statistics MADAMIRA MSA MADAMIRA EGY No Analysis 17.78% 7.24% Wrongly Analyzed 18.43% 14.75% Correctly Analyzed 63.79% 78.01%
  • 23. RAW CODA Lemma BuckWalter POS (Diacritized) Glos s Status MADAMIRA MSA EGY ابوكوا 1 Abwkw A ابوكو Abwkw Abuw Abw/NOUN+ kw/POSS_PRON_3MS father MADA NA Correc t الاكل 2 AlAkl الأكل Al>kl >akl Al/DET+>kl/NOUN eating MADA Usable Correc t لبنوك 3 lbnwk البنوك Albnwk bank Al/DET+bnwk/NOUN bank ANNO Wrong Wrong التاني 4 AltAny الثاني AlvAny vAniy Al/DET+vAny/ADJ_NUM second ANNO Wrong Usable 5 الحما ر AlHmAr الحمار AlHmAr HmAr Al/DET+HmAr/NOUN donkey MADA UsableUsable 6 للرات ب llrAtb للراتب llrAtb rAtib l/PREP+Al/DET+rAtb/NOUN salary MADA Usable Correc t ايوة 7 Aywp أيوه >ywh >ayowah >ywh/INTERJ yes MADA NA Correc t بدها 8 bdhA بدها bdhA bid~ bd/NOUN+ hA/POSS_PRON_3FS want ANNO Wrong Wrong 9 بنردل ك bnrdlk بنرد_ل ك bnrd_lk rad~ b/PROG_PART+n/IV1P +rd/IV+l/PREP+k/PRON_2MS answer MADA NA Usable 1 0 هدول hdwl هذول h*wl ha*A h*wl/DEM_PRON these ANNO NA NA
  • 24. Conclusion and Future Work • We presented preliminary results on the potential, and limitations, of using existing resources, specially MADAMIRA EGY, to semi-automate and speed up the annotation process of a Palestinian Arabic corpus. • Several issues need to be considered and researched further: – The development of Palestinian-specific morphological annotation and CODA guidelines. – The development of a Palestinian lexicon. – The extension of MADAMIRA to analyze Palestinian text. • Corpus will be further extended to include more text, and all lexical annotations will be linked with an Arabic ontology • The annotated corpus will be used as part of developing a Palestinian Arabic morphological analyzer • We plan to extend the work to other dialects
  • 26. References • H. Abo Bakr, K. Shaalan, and I. Ziedan. A Hybrid Approach for Converting Written Egyptian Colloquial Dialect into Diacritized Arabic. In The 6th International Conference on Informatics and Systems, INFOS2008. Cairo University. • R. Al-Sabbagh and R. Girju. YADAC: Yet another dialectal Arabic corpus. In Proc. of the Language Resources and Evaluation Conference (LREC), pages 2882–2889, Istanbul, 2012. • T. Buckwalter. 2004. Buckwalter Arabic morphological analyzer version 2.0. LDC catalog number LDC2004L02, ISBN 1-58563-324-0 • M. Diab, N. Habash, O. Rambow, M. Altantawy, and Y. Benajiba. COLABA: Arabic Dialect Annotation and Processing. LREC Workshop on Semitic Language Processing, Malta, 2010 • H. Gadalla, H. Kilany, H. Arram, A. Yacoub, A. El-Habashi, A. Shalaby, K. Karins, E. Rowson, R. MacIntyre, P. Kingsbury, D. Graff, and C. McLemore. 1997. CALLHOME Egyptian Arabic Transcripts. Linguistic Data Consortium, Catalog No.: LDC97T19. • D. Graff, M. Maamouri, B. Bouziri, S. Krouna, S. Kulick, and T. Buckwalter. 2009. Standard Arabic Morphological Analyzer (SAMA) Version 3.1. Linguistic Data Consortium LDC2009E73. • N. Habash, M. Diab, and O. Rabmow. (2012a) Conventional Orthography for Dialectal Arabic. In Proc. of LREC, Istanbul, Turkey, 2012.
  • 27. • N. Habash, R. Eskander, and A. Hawwari. (2012b) A Morphological Analyzer for Egyptian Arabic. In Proc. of the Special Interest Group on Computational Morphology and Phonology, Montréal, Canada, 2012. • N. Habash, R. Roth, O. Rambow, R. Eskander, and N. Tomeh. Morphological Analysis and Disambiguation for Dialectal Arabic. In Proc. of NAACL, Atlanta, Georgia, 2013. • R. Herzallah. (1990). Aspects of Palestinian Arabic Phonology: A Nonlinear Approach. Ph.D. thesis, Cornell University. Distributed as Working Papers of the Cornell Phonetics Laboratory No. 4. • H. Kilany, H. Gadalla, H. Arram, A. Yacoub, A. El-Habashi, and C. McLemore. Egyptian Colloquial Arabic Lexicon. Linguistic Data Consortium, Catalog No.: LDC99L22. • M. Maamouri, A. Bies, T. Buckwalter, M. Diab, N. Habash, O. Rambow, and D. Tabessi. Developing and using a pilot dialectal Arabic treebank. In Proc. of LREC, Genoa, Italy, 2006. • A. Pasha, M. Al-Badrashiny, M. Diab, A. El Kholy, R. Eskander, N. Habash, M. Pooleery, O. Rambow and R. M. Roth. MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic. In Proc. of LREC, Reykjavik, Iceland, 2014. • W. Salloum and N. Habash. Dialectal to Standard Arabic Paraphrasing to Improve Arabic-English Statistical Machine Translation. In Proc. of the First Workshop on Algorithms and Resources for Modeling of Dialects and Language Varieties, Edinburgh, Scotland, 2011. • I. Zribi, R. Boujelbane, A. Masmoudi, M. Ellouze Khmekhem, L. Hadrich Belguith, and N. Habash. A Conventional Orthography for Tunisian Arabic. In Proc. of LREC, Reykjavik, Iceland, 2014.

Editor's Notes

  1. What?1 Why?2-3 For what?4 How?5
  2. Example heart: qalb
  3. Example heart: qalb
  4. Green: referenced in slides