SenZi, A new sentiment lexicon for Arabizi presented at RANLP in September 2019.
We present our work in developing a sentiment lexicon for the Lebanese dialect Arabizi under severe circumstance of low resources. We then expand the lexicon to cover the orthographic varieties using word-embeddings since Arabizi lacks a standard orthography. Link to video below:
https://youtu.be/RtoRyqEq0sA
SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)
1. SenZi: A Sentiment
Analysis Lexicon for the
Latinised Arabic, Arabizi
TAHA TOBAILI, MIRIAM FERNANDEZ, HARITH ALANI,
SANAA SHARAFEDDINE, HAZEM HAJJ, GORAN GLAVAS
KNOWLEDGE MEDIA INSTITUTE, THE OPEN UNIVERSITY, UK
2. Arabizi Background
Term: A Mixture of Araby and Englizi
A transcription of the spoken dialectal Arabic (DA) in Latin script,
using numbers and symbols as well e.g. 7abibi
Bilingual Arabs, mainly the youth,
started texting DA using the
Latin keyboard.
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
3. Arabic Background
Ranked 5th in the World: Spoken by 400+ M People of which 300 M Natives
3@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
4. Arabic Background
1. Modern Standard Arabic (MSA): Formal, Structured,
Rich in Grammar, Linguistic Rules, Poetry, Part of Speech,
Thesauri, Corpora (spoken and written)
2. Dialectal Arabic (DA): Spoken Natural Mother Tongue,
Esoteric within each Region: Word Choice,
Coining Terms, Tempo, Pronunciation, Influenced by
other languages – Broken Arabic.
Levantine: Syrian, Lebanese, Palestinian, Jordanian.
(Turkish and French)
3. Arabizi: A reflection of the spoken DA in Latin script.
4@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
5. Arabizi Background
1. Guttural Phonemes: حعخغ and glottal stop ء lou2a, lou2lou2a
Different transcription standards: Egyptian Arabizi خ = 7’, kh
Lebanese Arabizi خ = 5, kh
2. Heavy and Light phonemes: t (ط ت ), s ( ص ,)س d (د ,)ض k (ك ,)ق th ( ذظ )
Dialectal: z ( زذظ ), s (ص س )ث e.g. laziz
3. Short Vowels (diacritics) and long vowels: َكَتَب katab or ktb
Inconsistent Orthography!
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
6. Challenges
1. Inconsistent Orthography:
Word Ambiguity: dareb ض or د route, hit?
Lexical Sparsity: habibi: hbb, 7bb, 7abibi, 7abeebi, 7abb, hbb
2. Rich in Morphology: b7ebak (I-love-you-masculine), bet7ibine (you-love-me-feminine), etc..
100 inflectional forms
100 x orthographic variances
3. Codeswitching: hi, kifak, cava?
keep a 3aj2a kit ma3ak matra7 el first aid kit in case 3le2et bi 3aj2a
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
8. Sentiment Analysis for Arabizi
Data Collection Sentiment Lexicon Evaluation
8
Treat Arabizi as a Low-Resourced Language Independent of Arabic
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
9. Lebanon
9
177K Tweets
Split Arabic / Latinscript
97K Latin script Tweets
Preprocessing: 60K
30K for Annotation
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
11. Annotation Results
Tweets Arabizi Not Arabizi I Don’t Know Kappa
30K 4.3K 27.6K 641 0.74
Dataset:
1. Sentiment Analysis (SA) : 1.6K Tweets (800 positive, 800 negative) two answers match.
2. Arabizi Identification (AI) : 4.4K Tweets (2.2K Arabizi, 2.2K Not Arabizi) three answers match
Tweets Positive Negative Neutral I Don’t Know Kappa
3.4K 1.2K 1.4K 2.1K 172 0.33
3.4K Two Answers Match for Arabizi-yes
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
12. Sentiment Lexicon
Hu and Liu: 6.8K Words 7.8K Words bab.la 1.5K Words 732 Words 2K Words 2K Arabizi
MPQA: 7.6K Words 9.4K MSA Words 600 pos, 1.4K neg
Living Arabic: 7K Words
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
13. Evaluation SenZi
Lexicon Based Approach:Lexicon Based Approach: Match the Positive and Negative Words with the Lexicon
Two-Class classification: Positive or Negative
SA Dataset: 800 positive, 800 negative (2-annotator agreement)
No Sentiment Tweets were given a Sentiment Class Randomly.
Score: -3
Negative
Recall Precision F1-Score Accuracy
0.56 0.59 0.57 0.58
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
14. Evaluation
Error Analysis
btestehali naja7at al deni l2nek btenshri al fara7 wen ma ken mabrok 3layki inti wa m7ebenk kl l
naja7at
you-deserve the success of the world because you spread the happiness everywhere congrats to
you and your-loved-ones for all the successes
SenZi
naja7at successes (inflected)
mabrok congrats (written differently)
naje7 success
mabrouk congrats
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
16. Expanding SenZi
Arabizi Corpus
49 Lebanese Facebook public pages
• Extract all comments and replies to comments
• Skip Arabic, keep the Latinscript
• Preprocessing
2.2 Million
Comments
Latin Script Languages: English, French, Arabizi, Latinised Far Eastern
Languages e.g. Hindi and Filipino
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
22. Evaluation SenZi
Lexicon Based Approach:Lexicon Based Approach: Match the Positive and Negative Words with the Lexicon
Two-Class classification: Positive or Negative
SA Dataset: 800 positive, 800 negative (2-annotator agreement)
No Sentiment Tweets were given a Sentiment Class Randomly.
Lexicon Recall Precision F1-Score Accuracy
SenZi Original 0.56 0.59 0.57 0.58
SenZi 1st Expansion 0.74 0.64 0.69 0.67
SenZi 2nd Expansion 0.79 0.66 0.72 0.69
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
23. Error Analysis
To the best of our knowledge there are no other Arabizi Sentiment Lexicons (Especially
Lebanese).
Analysing Errors for the best version of SenZi
Actual
Classified
Positive Negative Neutral
Positive 55% 3% 42%
Negative 13% 39% 48%
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
24. Error Analysis
Form not in
the Lexicon
English
Sentiment
Words
Neutral
Word
Classified
MWE and
Sarcasm
No Sentiment
Word
Word not in
Lexicon
Word
Ambiguity
Wrong
Negation
20% 12.5% 14% 10% 8% 6% 5% 3%
Wrongly Classified Tweets: 100 positive, 100 negative
Naja7 naja7at successes (inflected)
Mabrouk mabrok congrats (written differently)
happy birthday, miss you, lovely, good luck
el me3ze betsou2 a7san (the goat drives better)
maba2 fi oxygen (there is no more oxygen)
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
25. SenZi: Sentiment Analysis Lexicon for Arabizi
Free Resources
https://project-rbz.kmi.open.ac.uk/
Acknowledgments
Annotators, System Developers, and Anonymous Reviewers
Thank You 7abibi!
@tahatobaili
taha.tobaili@open.ac.uk
KNOWLEDGE MEDIA INSTITUTE