Mustafa Jarrar, Nizar Habash, Diyam Akra, Nasser Zalmout: Building A Corpus For Palestinian Arabic: A Preliminary Study. In proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing. Association for Computational Linguistics (ACL), Pages (18-27). October 25, 2014, Doha, Qatar. ISBN: 978-1-937284-96-1
An engaging workshop intended to showcase community efforts to implement LGR Procedure for current and potential Generation Panel members. The workshop will also discuss how Generation Panels of related scripts should coordinate with each other going forward.
Core vocabularies for Bilingual Language Learning and Literacy Skill building...E.A. Draffan
How using symbols in the Arabic and English language not only need to take into account personalisation and localisation issues but also linguistic issues and the core vocabularies needed to encourage literacy skills for those with communication, reading and writing difficulties and language learning.
An engaging workshop intended to showcase community efforts to implement LGR Procedure for current and potential Generation Panel members. The workshop will also discuss how Generation Panels of related scripts should coordinate with each other going forward.
Core vocabularies for Bilingual Language Learning and Literacy Skill building...E.A. Draffan
How using symbols in the Arabic and English language not only need to take into account personalisation and localisation issues but also linguistic issues and the core vocabularies needed to encourage literacy skills for those with communication, reading and writing difficulties and language learning.
Teaching Arabic Speakers: Linguistic and Cultural Considerations, Shira Packerspacke
Teaching English to Arabic Speakers: Cultural and Linguistic Considerations
In the past few years, more Arabic speakers have come to Canada to learn English than ever before. The workshop aims to present cultural and linguistic information that is useful to English teachers of native Arabic-speaking learners. Participants will learn how to anticipate challenges with regards to teaching grammar, pronunciation, literacy, and critical thinking skills to native Arabic speakers.
JAECS 2021 Spring Symposium
Corpus Tools and Statistical Methods (TASM) SIG
Revisiting What Counts as a Word: The development of New Word Level Checker
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...iwan_rg
By:
Nizar Habash
Abstract
The Arabic language consists of a number of variants among which Modern Standard Arabic (MSA) has a special status as the formal, mostly written, standard of the media, culture and education across the Arab World. The other variants are informal, mostly spoken, dialects that are the languages of communication of daily life. Most of the natural language processing resources and research in Arabic have focused on MSA. However, recently, more and more research is targeting Arabic dialects. In this talk, we present the main challenges of processing Arabic dialects, and discuss common solution paradigms, current advances, and future directions.
CBTS is considered a new paradigm in the discipline of Translation Studies. it is also considered a new methodology , which based is on Corpus linguistics and DTS......
Arabic words stemming approach using arabic wordnetIJDKP
The big growth of the Arabic internet content in the last years has raised up the need for an effective
stemming techniques for Arabic language. Arabic stemming algorithms can be ranked, according to three
category, as root-based approach (ex. Khoja); stem-based approach (ex. Larkey); and statistical approach
(ex. N-Garm). However, no stemming of this language is perfect: The existing stemmers have a low
efficiency. In this paper, we introduce a new stemming technique for Arabic words that also solve the
problem of the plural form of irregular nouns in Arabic language, which called broken plural. The
proposed stem extractor provides very accurate results in comparisons with other algorithms.
Consequently the search effectiveness improved.
SIGNWRITING SYMPOSIUM 2016 PRESENTATION 63 "Using SignWriting for the Peruvian Sign Language (LSP) Dictionary" by Miguel Rodríguez Mondoñedo. This talk will discuss the possibilities of using SW as a tool to represent lexical entries in a Peruvian Sign Language Dictionary (Diccionario Anotado de Lengua de Señas Peruana, LSP). Currently, no published LSP dictionary exists; in fact, there is almost no research on LSP, and very few systematically gathered data on the language. We plan to elaborate the first annotated dictionary on LSP (DALSP). The DALSP will have the following characteristics for each lexical entry: (i) A recorded video of the sign in citation form (provided by native speakers), (ii) a gloss in Spanish, (iii) a translation to Spanish, (iv) a morphosyntactic description, (v) a phonological description, (vi) a recorded video with a sample sentence in LSP containing the sign (provided by native speakers), (vii) gloss and translation for the LSP sample sentence, (viii) a SignWriting transcription for the sign and the sample sentence. It will be published online. http://www.signwriting.org/symposium/presentation0063.html
Stemming is the process of reducing words to their stems or roots. Due to the morphological richness and
complexity of the Arabic language, stemming is an essential part of most Natural Language Processing
(NLP) tasks for this language. In this paper, we study the impact of different stemming approaches on the
Named Entity Recognition (NER) task for Arabic and explore the merits, limitations and differences
between light stemming and root-extraction methods. Our experiments are evaluated on the standard
ANERCorp dataset as well as the AQMAR Arabic Wikipedia Named Entity Corpus.
A New Approach to Romanize Arabic WordsIJERA Editor
Romanization of Arabic words has been acquired the interest of the researchers due to its importance in many
fields such as security and terrorism fighting, translation, religious purposes, etc.
In this paper, a proposed method was presented to solve the drawbacks of available methods such as lack of
reverse recognition, using of extra letters and punctuation characters, and neglecting the correlation of the letters
in a word.
This method was implemented and tested using a sample of 100 undergraduate Iraqi students and 150 Arabic
words which romanized using five well-known methods in addition to the proposed one. The test showed that
the proposed method dominants the rest method from the recognition and reverse recognition process in
considerable ratio.
Clustering Arabic Tweets for Sentiment AnalysisMustafa Jarrar
Diab Abuaiadah, Dileep Rajendran, Mustafa Jarrar: Clustering Arabic Tweets for Sentiment Analysis. Proceedings of the 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications. IEEE Computer Society. DOI 10.1109/AICCSA.2017.162
Classifying Processes and Basic Formal OntologyMustafa Jarrar
pdf http://www.jarrar.info/publications/JC17.pdf
Mustafa Jarrar and Werner Ceusters
ABSTRACT
Unlike what is the case for physical entities and other types of continuants, few process ontologies exist. This is not only because processes received less attention in the research community, but also because classifying them is challenging. Moreover, upper level categories or classification criteria to help in modelling and integrating lower level process ontologies have thus far not been developed or widely adopted. This paper proposes a basis for further classifying processes in the Basic Formal Ontology. The work is inspired by the aspectual characteristics of verbs such as homeomericity, cumulativity, telicity, atomicity, instantaneity and durativity. But whereas these characteristics have been proposed by linguists and philosophers of language from a linguistic perspective with a focus on how matters are described, our focus is on what is the case in reality thus providing an ontological perspective. This was achieved by first investigating the applicability of these characteristics to the top-level processes in the Gene Ontology, and then, where possible, deriving from the linguistic perspective relationships that are faithful to the ontological principles adhered to by the Basic Formal Ontology.
More Related Content
Similar to Presentation curras paper-emnlp2014-final
Teaching Arabic Speakers: Linguistic and Cultural Considerations, Shira Packerspacke
Teaching English to Arabic Speakers: Cultural and Linguistic Considerations
In the past few years, more Arabic speakers have come to Canada to learn English than ever before. The workshop aims to present cultural and linguistic information that is useful to English teachers of native Arabic-speaking learners. Participants will learn how to anticipate challenges with regards to teaching grammar, pronunciation, literacy, and critical thinking skills to native Arabic speakers.
JAECS 2021 Spring Symposium
Corpus Tools and Statistical Methods (TASM) SIG
Revisiting What Counts as a Word: The development of New Word Level Checker
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...iwan_rg
By:
Nizar Habash
Abstract
The Arabic language consists of a number of variants among which Modern Standard Arabic (MSA) has a special status as the formal, mostly written, standard of the media, culture and education across the Arab World. The other variants are informal, mostly spoken, dialects that are the languages of communication of daily life. Most of the natural language processing resources and research in Arabic have focused on MSA. However, recently, more and more research is targeting Arabic dialects. In this talk, we present the main challenges of processing Arabic dialects, and discuss common solution paradigms, current advances, and future directions.
CBTS is considered a new paradigm in the discipline of Translation Studies. it is also considered a new methodology , which based is on Corpus linguistics and DTS......
Arabic words stemming approach using arabic wordnetIJDKP
The big growth of the Arabic internet content in the last years has raised up the need for an effective
stemming techniques for Arabic language. Arabic stemming algorithms can be ranked, according to three
category, as root-based approach (ex. Khoja); stem-based approach (ex. Larkey); and statistical approach
(ex. N-Garm). However, no stemming of this language is perfect: The existing stemmers have a low
efficiency. In this paper, we introduce a new stemming technique for Arabic words that also solve the
problem of the plural form of irregular nouns in Arabic language, which called broken plural. The
proposed stem extractor provides very accurate results in comparisons with other algorithms.
Consequently the search effectiveness improved.
SIGNWRITING SYMPOSIUM 2016 PRESENTATION 63 "Using SignWriting for the Peruvian Sign Language (LSP) Dictionary" by Miguel Rodríguez Mondoñedo. This talk will discuss the possibilities of using SW as a tool to represent lexical entries in a Peruvian Sign Language Dictionary (Diccionario Anotado de Lengua de Señas Peruana, LSP). Currently, no published LSP dictionary exists; in fact, there is almost no research on LSP, and very few systematically gathered data on the language. We plan to elaborate the first annotated dictionary on LSP (DALSP). The DALSP will have the following characteristics for each lexical entry: (i) A recorded video of the sign in citation form (provided by native speakers), (ii) a gloss in Spanish, (iii) a translation to Spanish, (iv) a morphosyntactic description, (v) a phonological description, (vi) a recorded video with a sample sentence in LSP containing the sign (provided by native speakers), (vii) gloss and translation for the LSP sample sentence, (viii) a SignWriting transcription for the sign and the sample sentence. It will be published online. http://www.signwriting.org/symposium/presentation0063.html
Stemming is the process of reducing words to their stems or roots. Due to the morphological richness and
complexity of the Arabic language, stemming is an essential part of most Natural Language Processing
(NLP) tasks for this language. In this paper, we study the impact of different stemming approaches on the
Named Entity Recognition (NER) task for Arabic and explore the merits, limitations and differences
between light stemming and root-extraction methods. Our experiments are evaluated on the standard
ANERCorp dataset as well as the AQMAR Arabic Wikipedia Named Entity Corpus.
A New Approach to Romanize Arabic WordsIJERA Editor
Romanization of Arabic words has been acquired the interest of the researchers due to its importance in many
fields such as security and terrorism fighting, translation, religious purposes, etc.
In this paper, a proposed method was presented to solve the drawbacks of available methods such as lack of
reverse recognition, using of extra letters and punctuation characters, and neglecting the correlation of the letters
in a word.
This method was implemented and tested using a sample of 100 undergraduate Iraqi students and 150 Arabic
words which romanized using five well-known methods in addition to the proposed one. The test showed that
the proposed method dominants the rest method from the recognition and reverse recognition process in
considerable ratio.
Clustering Arabic Tweets for Sentiment AnalysisMustafa Jarrar
Diab Abuaiadah, Dileep Rajendran, Mustafa Jarrar: Clustering Arabic Tweets for Sentiment Analysis. Proceedings of the 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications. IEEE Computer Society. DOI 10.1109/AICCSA.2017.162
Classifying Processes and Basic Formal OntologyMustafa Jarrar
pdf http://www.jarrar.info/publications/JC17.pdf
Mustafa Jarrar and Werner Ceusters
ABSTRACT
Unlike what is the case for physical entities and other types of continuants, few process ontologies exist. This is not only because processes received less attention in the research community, but also because classifying them is challenging. Moreover, upper level categories or classification criteria to help in modelling and integrating lower level process ontologies have thus far not been developed or widely adopted. This paper proposes a basis for further classifying processes in the Basic Formal Ontology. The work is inspired by the aspectual characteristics of verbs such as homeomericity, cumulativity, telicity, atomicity, instantaneity and durativity. But whereas these characteristics have been proposed by linguists and philosophers of language from a linguistic perspective with a focus on how matters are described, our focus is on what is the case in reality thus providing an ontological perspective. This was achieved by first investigating the applicability of these characteristics to the top-level processes in the Gene Ontology, and then, where possible, deriving from the linguistic perspective relationships that are faithful to the ontological principles adhered to by the Basic Formal Ontology.
The goal of this course is to introduce students to ideas and techniques from discrete mathematics that are widely used in computer science. Ultimately, students are expected to understand and use (abstract) discrete structures that are the backbones of computer science. In particular, this class is meant to introduce logic, proofs, sets, functions, relations, counting, graphs and trees and with an emphasis on applications in computer science.
Lecture slides by Mustafa Jarrar at Birzeit University, Palestine.
Course Title: Data and Business Process Modeling
See the course webpage and video lectures at: http://jarrar-courses.blogspot.com/2015/01/data-and-business-process-modelling.html
and http://www.jarrar.info
Business Process Design and Re-engineeringMustafa Jarrar
Lecture slides by Mustafa Jarrar at Birzeit University, Palestine.
Course Title: Data and Business Process Modeling
See the course webpage and video lectures at: http://jarrar-courses.blogspot.com/2015/01/data-and-business-process-modelling.html
and http://www.jarrar.info
Lecture slides by Mustafa Jarrar at Birzeit University, Palestine.
Course Title: Data and Business Process Modeling
See the course webpage and video lectures at: http://jarrar-courses.blogspot.com/2015/01/data-and-business-process-modelling.html
and http://www.jarrar.info
See the course webpage and video lectures at: http://jarrar-courses.blogspot.com/2015/01/data-and-business-process-modelling.html
and http://www.jarrar.info
Introduction to Business Process ManagementMustafa Jarrar
Lecture slides by Mustafa Jarrar at Birzeit University, Palestine.
Course Title: Data and Business Process Modeling
See the course webpage and video lectures at: http://jarrar-courses.blogspot.com/2011/12/arabic-ontology.html
and http://www.jarrar.info
Lecture video by Mustafa Jarrar at Birzeit University, Palestine.
See the course webpage at: http://jarrar-courses.blogspot.com/2011/09/knowledgeengineering-fall2011.html
and http://www.jarrar.info
and on Youtube:
https://www.youtube.com/watch?v=GYmI37-0b5k&index=7&list=PLDEA50C29F3D28257
Lecture video by Mustafa Jarrar at Birzeit University, Palestine.
See the course webpage at: http://jarrar-courses.blogspot.com/2011/09/knowledgeengineering-fall2011.html
and http://www.jarrar.info
and on Youtube:
https://www.youtube.com/watch?v=GYmI37-0b5k&index=7&list=PLDEA50C29F3D28257
On Computer Science Trends and Priorities in PalestineMustafa Jarrar
On Computer Science Trends and Priorities in Palestine,
by Mustafa Jarrar
Computer Science
Birzeit University, Palestine
Personal Page: http://www.jarrar.info
At Workshop on ّIT Research Trends and Priorities
Islamic University of Gaza, Palestine
28 March, 2015
Bouquet: SIERA Workshop on The Pillars of Horizon2020Mustafa Jarrar
Prof Paolo Bouquet
University of Trento Italy.
Workshop on Proposal Writing and International Fundraising
Sina Institute at Birzeit University
April 2, 2014.
Jarrar: Logical Foundation of Ontology EngineeringMustafa Jarrar
Lecture slides by Mustafa Jarrar at Birzeit University, Palestine.
See the course webpage at: http://jarrar-courses.blogspot.com/2012/04/aai-spring-jan-may-2012.html
and http://www.jarrar.info
and on Youtube:
http://www.youtube.com/watch?v=aNpLekq6-oA&list=PL44443F36733EF123
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Presentation curras paper-emnlp2014-final
1. Workshop on Arabic Natural Language Processing
EMNLP 2014, Doha
Building a Corpus for Palestinian Arabic
- a Preliminary Study
Mustafa Jarrar, *Nizar Habash, Diyam Akra, Nasser Zalmout
Birzeit University, Palestine *New York University Abu Dhabi
2. Reference:
Mustafa Jarrar, Nizar Habash, Diyam Akra, Nasser Zalmout: Building A Corpus For Palestinian Arabic: A
Preliminary Study. In proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing.
Association for Computational Linguistics (ACL), Pages (18-27). October 25, 2014, Doha, Qatar. ISBN: 978-1-
937284-96-1
Download Article
http://www.jarrar.info/publications/JHAZ14.pdf
Watch the Presentation
https://www.youtube.com/watch?v=Kw3R3DQVc8E
3. Acknowledgment
• This work is part of the ongoing project called Curras, funded
by the Research Council - Palestinian Ministry of Higher
Education.
• Team:
– Mustafa Jarrar (project investigator)
– Nizar Habash
– Mahdi Arar
– Diyam Fuad
– Faeq Alrimawi
• Collaborators: Owen Rambow, Faisal Al-Shargi, and Ramy
Eskander
http://sina.birzeit.edu/projects/
4. Agenda
• Introduction
• Distinguishing features of Palestinian Arabic
• Related work
• Building the Curras Corpus
– Corpus Collection
– Annotation process
• Challenges
– A Conventional Orthography for Palestinian Arabic
– Morphological Annotation Process challenges
• Conclusion and Future Work
5. Introduction
• Most Arabic NLP tools and resources were developed to serve
Modern Standard Arabic (MSA)
– Dialectal phonological and morphological variations from MSA
– No standard orthography for Arabic Dialects
• Exponential growth of socially generated dialectal content
• Our goal is to build a high-coverage and well-annotated corpus
of the Palestinian Arabic dialect (PAL)
– Curras Corpus of Palestinian Arabic
– Collect words from different resources (43K words)
– Annotate using a morphological analyzer tool (MADAMIRA)
– Manually correct annotations
• First step toward developing NLP applications for PAL
• Insights and extensions to other Arabic dialects
6. Distinguishing Features of PAL
• Palestinian Arabic is a Southern Levantine dialect of
Arabic
– The points presented here focus on difference from MSA
– Some are shared with other dialects
• Phonology
– Pronunciation of the /q/ phoneme (MSA (ق
• Urban /’/, rural /k/, Bedouin /g/, and Druze /q/
– MSA diphthongs /ay/ and /aw/ generally become /e:/ and
/o:/
– PAL elides many short vowels that appear in the MSA
cognates
• e.g. MSA جبال /jiba:l/ ‘mountains’ becomes /jba:l/
– Insertion of epenthetic vowels (Herzallah, 1990)
• e.g. /kalb/ and /kalib/ ( كلب ‘dog’)
7. Distinguishing Features of PAL
• Morphology
– PAL in most of its sub-dialects collapses the feminine
and masculine plurals and duals in verbs and most
nouns, as well as other forms:
• e.g. حبيت /Habbe:t/ ‘I (or you m.s.) loved’
– PAL has many clitics that do not exist in MSA
• The progressive particle /b+/ (as in /b+tuktub/ ‘you write’)
• The demonstrative particle /ha+/ (as in /ha+l+be:t/ ‘this
house’)
• The negation cirmcumclitic /ma+ +š/ (as in /ma+katab+$/
‘he did not write’)
• Indirect object clitic (as in /ma+katab+l+o:+š/ ‘he did not
write to him’)
8. Distinguishing Features of PAL
• Lexicon
– Some common PAL words are portmanteaus of
MSA words
• /biddi/ ‘I want’ is from MSA /bi+widd+i/ ‘in my desire’
• /le:š/ ‘why’ corresponds to MSA /li+’ayyi šay’in/ ‘for
what thing’
– Borrowed words from other languages:
• E كندرة /kundara/ ‘shoe’ (Turkish)
• E محسوم /maHsu:m/ ‘checkpoint’ (Hebrew)
9. Related Work
• Dialectal Corpus Collection
– Egyptian Colloquial Arabic Lexicon (ECAL) (Kilany, et al., 2002)
– Maamouri et al. (2006) developed a pilot Levantine Arabic Treebank
(LATB)
– LDC BOLT resources for Egyptian Arabic (ATB-ARZ)
– Others: YADAC (Al-Sabbagh and Girju, 2012), COLABA project (Diab
et al., 2010)
• Dialectal Orthography
– Habash et al (2012a) proposed the so-called CODA (conventional
orthography of dialectal Arabic) for the purpose of developing
computational models of Arabic dialects in general
• Dialectal Morphological Annotation
– CALIMA-Egyptian (Habash et al., 2012b)
– MADAMIRA (MSA/Egyptian) (Pasha et al., 2014)
10. Agenda
• Introduction
• Distinguishing features of Palestinian Arabic
• Related work
• Building the Curras Corpus
– Corpus Collection
– Annotation process
• Challenges
– A Conventional Orthography for Palestinian Arabic
– Morphological Annotation Process challenges
• Conclusion and Future Work
11. Building the Curras Corpus:
Corpus Collection
• Challenges
– Scarce resources in terms of written literature
• Traditional and informal contexts, such as conversations in
TV series, movies, or on social media platforms
– Orthographic inconsistency
• Mixing of phonetic spelling and MSA-cognate-based spelling
• Arabizi spelling
• In this stage we focus on precision and variety
rather than mere size
– We tried not only to manually select and review the
content of the corpus, but also to assure a variety of
topics and contexts, localities and sub-dialects.
12. Manually Collected Resources
The Curras Corpus Statistics
Document Type
Word
Tokens
Word
Types
Documents
Facebook 3,120 1,985 35 threads
Twitter 3,541 2,133 38 threads
Blogs 8,748 4,454 37 threads
Forums 1,092 798 33 threads
Palestinian Stories 2,407 1,422 6 stories
Palestinian Terms 759 556 1 doc
TV Show: وطن ع وتر
Watan Aa Watar
23,423 8,459 41 episodes
Curras Total 43,090 19,807 191
13. Building the Curras Corpus:
Corpus Annotation
• We define the annotation of a word as a tuple
<w, wB c, cB, l, pB, g, i>
W Raw (Unicode) The raw input word
wB
Raw (Buckwalter) The same raw input word in Buckwalter
transliteration
C
CODA (Unicode) The Conventional Orthography (Habash et al., 2012)
version of the input word
cB
CODA (Buckwalter) The Buckwalter transliteration of the CODA form
l Lemma The lemma of the word in Buckwalter transliteration
pB
The Full Buckwalter POS Tag
g Gloss (English)
i
Analysis Source of annotation, e.g., ANNO is a human annotator, and
MADA is the MADAMIRA system with some minor or no automatic
post-processing
14. Building the Curras Corpus:
Corpus Annotation
• We exploit existing tools to speed up the annotation process.
We specifically use the MADAMIRA tool (Pasha et al., 2014)
for morphological analysis and disambiguation of MSA and
EGY.
• A manual step is then needed to verify every annotation, to
correct errors and fill in gaps.
• We made one major simplification to the annotations to
minimize the load on the human annotator.
– We do not produce diacritized morphological analyses in the
Buckwalter POS tag.
15. Agenda
• Introduction
• Distinguishing features of Palestinian Arabic
• Related work
• Building the Curras Corpus
– Corpus Collection
– Annotation process
• Challenges
– A Conventional Orthography for Palestinian Arabic
– Morphological Annotation Process challenges
• Two pilot studies
• Conclusion and Future Work
16. Challenges:
Conventional Orthography for PAL
• Inconsistent spellings
– e.g. ‘heart’ (MSA قلب qalb) four spellings: قلب qalb /qalb/, ألب >alb /’alb/, كلب kalb
/kalb/, and جلب galb /galb/
• Shortening of long vowels
– E قانون qAnuwn ‘law’ (MSA /qa:nu:n/), shortened first vowel قنون qanuwn
/qanu:n/
• PAL has some clitics that do not exist in MSA
– e.g. the PAL future particle ح/Ha/
• Morphemes with additional forms or pronunciations
– e.g. ال Al /il/ which has a non-MSA/non-EGY allomorph /li/
– E لبلاد /libla:d/ ‘the homeland/countries’ can be spelled to reflect the morphology
as البلاد AlblAd or the phonology لبلاد lblAd, with the latter being ambiguous with
‘for countries’
• Words in PAL that have no cognate in MSA
– e.g. the word /barDo/ ‘additionally’ is spontaneously written as برضو brDw, برضه
brDh and برضة brDp.
17. CODA: A Conventional Orthography for
Dialectal Arabic
• Habash et al., (2012) developed CODA for
computational processing purposes primarily
• Objectives
– CODA covers all DAs, minimizing differences in
choices
– CODA is easy to learn and produce consistently
– CODA is intuitive to readers unfamiliar with it
– CODA uses Arabic script
• Inspired by previous efforts from the LDC and
linguistic studies
• Guidelines created for Egyptian and Tunisian so far
17
18. CODA Examples
18
Phenomenon Original CODA
Spelling Errors
Typos
Speech effects
Merges
Splits
الاجابه
شبب
كبيييييييير
اليومبريستيج
المع روف
الإجابة
سبب
كبير
اليوم بريستيج
المعروف
MSA Root Cognate قلب آلب، كلب
Dialectal Clitic
Guidelines
علبيت
مشفناش
عالبيت
ما شافناش
Unique Dialect Words برضه بردو، برضو
19. Challenges:
Conventional Orthography for PAL
• Our challenge was to take CODA guidelines (specifically the EGY
version) (Habash et al., 2012) and extend them.
– In terms of phonology-orthography
• We added the letter كk to the list of root letters to be spelled in the MSA
cognate to cover the PAL rural sub-dialects that pronounces it as /t$/.
– In terms of morphology
• We added the non-EGY demonstrative proclitic هha and the conjunction
proclitic تta ‘so as to’ to the list of clitics, e.g., بهالبيت bhAlbyt ‘in this
house’ and تيشوف ty$wf ‘so that he can see’.
– We extended the list of exceptional words to cover problematic
PAL words.
20. Pilot study (I)
• We considered 1,000 words from 77 tweets in Curras.
The CODA version of each word was created in context.
Results:
– 15.9% of all words had a different CODA form from the
input raw word form.
• 42% of these changes involve consonants.
• 20% vowels and the hamzated/bare forms of the letter Alif اA.
• 29% word changes involve the spelling of specific morpheme.
• 18% of the changed words experience a split or a merge (with
splits happening five time more than merges).
• Only about 8% of the changed words were PAL specific terms.
• less than 7% involved a typo or speech effect elongation.
21. Challenges:
Morphological Annotation Process
• Manually annotating words can take a lot of time.
• To make the process faster and easier we decided
to use an existing morphological analyzer for
MSA or EGY to create PAL annotations.
• To validate the use of the morphological analyzer
we conducted a pilot study
22. Pilot study (II)
• We ran the words from a randomly selected episode of
the PAL TV show “Watan Aa Watar” (460 words)
through both MADAMIRA-MSA and MADAMIRA-EGY.
• We analyzed the output from both systems to
determine its usability for PAL annotations.
• The results of this experiment are summarized below:
Accuracy of automatic annotation of PAL text
Statistics MADAMIRA MSA MADAMIRA EGY
No Analysis 17.78% 7.24%
Wrongly Analyzed 18.43% 14.75%
Correctly Analyzed 63.79% 78.01%
23. RAW CODA Lemma BuckWalter POS (Diacritized)
Glos
s
Status
MADAMIRA
MSA EGY
ابوكوا 1
Abwkw
A
ابوكو Abwkw Abuw
Abw/NOUN+
kw/POSS_PRON_3MS
father MADA NA
Correc
t
الاكل 2 AlAkl الأكل Al>kl >akl Al/DET+>kl/NOUN eating MADA Usable
Correc
t
لبنوك 3 lbnwk البنوك Albnwk bank Al/DET+bnwk/NOUN bank ANNO Wrong Wrong
التاني 4 AltAny الثاني AlvAny vAniy Al/DET+vAny/ADJ_NUM second ANNO Wrong Usable
5
الحما
ر
AlHmAr الحمار AlHmAr HmAr Al/DET+HmAr/NOUN donkey MADA UsableUsable
6
للرات
ب
llrAtb للراتب llrAtb rAtib l/PREP+Al/DET+rAtb/NOUN salary MADA Usable
Correc
t
ايوة 7 Aywp أيوه >ywh >ayowah >ywh/INTERJ yes MADA NA
Correc
t
بدها 8 bdhA بدها bdhA bid~
bd/NOUN+
hA/POSS_PRON_3FS
want ANNO Wrong Wrong
9
بنردل
ك
bnrdlk
بنرد_ل
ك
bnrd_lk rad~
b/PROG_PART+n/IV1P
+rd/IV+l/PREP+k/PRON_2MS
answer MADA NA Usable
1
0
هدول hdwl هذول h*wl ha*A h*wl/DEM_PRON these ANNO NA NA
24. Conclusion and Future Work
• We presented preliminary results on the potential, and
limitations, of using existing resources, specially MADAMIRA
EGY, to semi-automate and speed up the annotation process
of a Palestinian Arabic corpus.
• Several issues need to be considered and researched further:
– The development of Palestinian-specific morphological annotation and
CODA guidelines.
– The development of a Palestinian lexicon.
– The extension of MADAMIRA to analyze Palestinian text.
• Corpus will be further extended to include more text, and all
lexical annotations will be linked with an Arabic ontology
• The annotated corpus will be used as part of developing a
Palestinian Arabic morphological analyzer
• We plan to extend the work to other dialects
26. References
• H. Abo Bakr, K. Shaalan, and I. Ziedan. A Hybrid Approach for Converting Written
Egyptian Colloquial Dialect into Diacritized Arabic. In The 6th International
Conference on Informatics and Systems, INFOS2008. Cairo University.
• R. Al-Sabbagh and R. Girju. YADAC: Yet another dialectal Arabic corpus. In Proc. of
the Language Resources and Evaluation Conference (LREC), pages 2882–2889,
Istanbul, 2012.
• T. Buckwalter. 2004. Buckwalter Arabic morphological analyzer version 2.0. LDC
catalog number LDC2004L02, ISBN 1-58563-324-0
• M. Diab, N. Habash, O. Rambow, M. Altantawy, and Y. Benajiba. COLABA: Arabic
Dialect Annotation and Processing. LREC Workshop on Semitic Language
Processing, Malta, 2010
• H. Gadalla, H. Kilany, H. Arram, A. Yacoub, A. El-Habashi, A. Shalaby, K. Karins, E.
Rowson, R. MacIntyre, P. Kingsbury, D. Graff, and C. McLemore. 1997. CALLHOME
Egyptian Arabic Transcripts. Linguistic Data Consortium, Catalog No.: LDC97T19.
• D. Graff, M. Maamouri, B. Bouziri, S. Krouna, S. Kulick, and T. Buckwalter. 2009.
Standard Arabic Morphological Analyzer (SAMA) Version 3.1. Linguistic Data
Consortium LDC2009E73.
• N. Habash, M. Diab, and O. Rabmow. (2012a) Conventional Orthography for
Dialectal Arabic. In Proc. of LREC, Istanbul, Turkey, 2012.
27. • N. Habash, R. Eskander, and A. Hawwari. (2012b) A Morphological Analyzer for
Egyptian Arabic. In Proc. of the Special Interest Group on Computational
Morphology and Phonology, Montréal, Canada, 2012.
• N. Habash, R. Roth, O. Rambow, R. Eskander, and N. Tomeh. Morphological
Analysis and Disambiguation for Dialectal Arabic. In Proc. of NAACL, Atlanta,
Georgia, 2013.
• R. Herzallah. (1990). Aspects of Palestinian Arabic Phonology: A Nonlinear
Approach. Ph.D. thesis, Cornell University. Distributed as Working Papers of the
Cornell Phonetics Laboratory No. 4.
• H. Kilany, H. Gadalla, H. Arram, A. Yacoub, A. El-Habashi, and C. McLemore.
Egyptian Colloquial Arabic Lexicon. Linguistic Data Consortium, Catalog No.:
LDC99L22.
• M. Maamouri, A. Bies, T. Buckwalter, M. Diab, N. Habash, O. Rambow, and
D. Tabessi. Developing and using a pilot dialectal Arabic treebank. In Proc. of LREC,
Genoa, Italy, 2006.
• A. Pasha, M. Al-Badrashiny, M. Diab, A. El Kholy, R. Eskander, N. Habash, M.
Pooleery, O. Rambow and R. M. Roth. MADAMIRA: A Fast, Comprehensive Tool for
Morphological Analysis and Disambiguation of Arabic. In Proc. of LREC, Reykjavik,
Iceland, 2014.
• W. Salloum and N. Habash. Dialectal to Standard Arabic Paraphrasing to Improve
Arabic-English Statistical Machine Translation. In Proc. of the First Workshop on
Algorithms and Resources for Modeling of Dialects and Language Varieties,
Edinburgh, Scotland, 2011.
• I. Zribi, R. Boujelbane, A. Masmoudi, M. Ellouze Khmekhem, L. Hadrich Belguith,
and N. Habash. A Conventional Orthography for Tunisian Arabic. In Proc. of LREC,
Reykjavik, Iceland, 2014.