SlideShare a Scribd company logo
SenZi: A Sentiment
Analysis Lexicon for the
Latinised Arabic, Arabizi
TAHA TOBAILI, MIRIAM FERNANDEZ, HARITH ALANI,
SANAA SHARAFEDDINE, HAZEM HAJJ, GORAN GLAVAS
KNOWLEDGE MEDIA INSTITUTE, THE OPEN UNIVERSITY, UK
Arabizi Background
Term: A Mixture of Araby and Englizi
A transcription of the spoken dialectal Arabic (DA) in Latin script,
using numbers and symbols as well e.g. 7abibi
Bilingual Arabs, mainly the youth,
started texting DA using the
Latin keyboard.
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
Arabic Background
Ranked 5th in the World: Spoken by 400+ M People of which 300 M Natives
3@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
Arabic Background
1. Modern Standard Arabic (MSA): Formal, Structured,
Rich in Grammar, Linguistic Rules, Poetry, Part of Speech,
Thesauri, Corpora (spoken and written)
2. Dialectal Arabic (DA): Spoken Natural Mother Tongue,
Esoteric within each Region: Word Choice,
Coining Terms, Tempo, Pronunciation, Influenced by
other languages – Broken Arabic.
Levantine: Syrian, Lebanese, Palestinian, Jordanian.
(Turkish and French)
3. Arabizi: A reflection of the spoken DA in Latin script.
4@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
Arabizi Background
1. Guttural Phonemes: ‫ح‬‫ع‬‫خ‬‫غ‬ and glottal stop ‫ء‬ lou2a, lou2lou2a
Different transcription standards: Egyptian Arabizi ‫خ‬ = 7’, kh
Lebanese Arabizi ‫خ‬ = 5, kh
2. Heavy and Light phonemes: t (‫ط‬ ‫ت‬ ), s ( ‫ص‬ ‫,)س‬ d (‫د‬ ‫,)ض‬ k (‫ك‬ ‫,)ق‬ th ( ‫ذ‬‫ظ‬ )
Dialectal: z ( ‫ز‬‫ذ‬‫ظ‬ ), s (‫ص‬ ‫س‬ ‫)ث‬ e.g. laziz
3. Short Vowels (diacritics) and long vowels: َ‫ك‬َ‫ت‬َ‫ب‬ katab or ktb
Inconsistent Orthography!
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
Challenges
1. Inconsistent Orthography:
Word Ambiguity: dareb ‫ض‬ or ‫د‬ route, hit?
Lexical Sparsity: habibi: hbb, 7bb, 7abibi, 7abeebi, 7abb, hbb
2. Rich in Morphology: b7ebak (I-love-you-masculine), bet7ibine (you-love-me-feminine), etc..
100 inflectional forms
100 x orthographic variances
3. Codeswitching: hi, kifak, cava?
keep a 3aj2a kit ma3ak matra7 el first aid kit in case 3le2et bi 3aj2a
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
Transliteration
Dialect, Inconsistent Orthography, Word Ambiguity
Da5l jamelik w hadamtik, Oh your-beauty and your-sense-of-humour!
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
Sentiment Analysis for Arabizi
Data Collection Sentiment Lexicon Evaluation
8
Treat Arabizi as a Low-Resourced Language Independent of Arabic
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
Lebanon
9
177K Tweets
Split Arabic / Latinscript
97K Latin script Tweets
Preprocessing: 60K
30K for Annotation
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
Annotation Results
Tweets Arabizi Not Arabizi I Don’t Know Kappa
30K 4.3K 27.6K 641 0.74
Dataset:
1. Sentiment Analysis (SA) : 1.6K Tweets (800 positive, 800 negative) two answers match.
2. Arabizi Identification (AI) : 4.4K Tweets (2.2K Arabizi, 2.2K Not Arabizi) three answers match
Tweets Positive Negative Neutral I Don’t Know Kappa
3.4K 1.2K 1.4K 2.1K 172 0.33
3.4K Two Answers Match for Arabizi-yes
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
Sentiment Lexicon
Hu and Liu: 6.8K Words 7.8K Words bab.la 1.5K Words 732 Words 2K Words 2K Arabizi
MPQA: 7.6K Words 9.4K MSA Words 600 pos, 1.4K neg
Living Arabic: 7K Words
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
Evaluation SenZi
Lexicon Based Approach:Lexicon Based Approach: Match the Positive and Negative Words with the Lexicon
Two-Class classification: Positive or Negative
SA Dataset: 800 positive, 800 negative (2-annotator agreement)
No Sentiment Tweets were given a Sentiment Class Randomly.
Score: -3
Negative
Recall Precision F1-Score Accuracy
0.56 0.59 0.57 0.58
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
Evaluation
Error Analysis
btestehali naja7at al deni l2nek btenshri al fara7 wen ma ken mabrok 3layki inti wa m7ebenk kl l
naja7at
you-deserve the success of the world because you spread the happiness everywhere congrats to
you and your-loved-ones for all the successes
SenZi
naja7at successes (inflected)
mabrok congrats (written differently)
naje7 success
mabrouk congrats
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
Expanding SenZi
Word Embeddings
Arabizi Corpus
Nearest Word Neighbors
SenZi Words
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
Expanding SenZi
Arabizi Corpus
49 Lebanese Facebook public pages
• Extract all comments and replies to comments
• Skip Arabic, keep the Latinscript
• Preprocessing
2.2 Million
Comments
Latin Script Languages: English, French, Arabizi, Latinised Far Eastern
Languages e.g. Hindi and Filipino
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
Expanding SenZi
Train an Arabizi Identifier
AI Dataset: 4.4K Tweets (2.2K Arabizi, 2.2K Not Arabizi) three answers match.
SVM Classifier, 10-fold cross validation.
Recall Precision F1-Score Accuracy
0.93 0.95 0.97 0.97
2.1 Million Latin Script
Comments
1 Million Arabizi
Comments
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
Expanding SenZi
Word Embeddings: FastText
Arabizi Corpus:
1M Arabizi
Comments
50 Nearest Word Neighbors
SenZi Word
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
Expanding SenZi
tayeb (tasty/ cute): wtayeb tayebb tayeeb ltayeb taye2 tayebbb taybé tayem tayb tayef taybo tayob
katayeb tayer ayeb tayeh tab tayeje Tayeb tayib taybee tay tayba taybeh tay2o tayab tayben tayyeb
7abayeb taybii tayech tay2a tayibb 8ayeb jayeb tay2ino tayybe habeyeb taybe Tyeb taybine Sahten tayyb
taym taybi Sahtan sayeb taybeee sahten za3len
Inflectional and Orthographic forms
Match consonant-letter-sequence:
tayeb: tyb atyab, atyabak, atyabo, 2tyab, tayoub, taybe, tayoubi, taybeee, taybin, and tayoubin,.
tayeb: wtayeb tayebb tayeeb ltayeb tayebbb tayb taybo tayob katayeb tayeb tayib taybee tayba taybeh
tayab tayben tayyeb taybii tayibb tayybe taybe tyeb taybine tayyb taybi taybeee
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
Expanding SenZi
Expand SenZi Twice (Recursively)
tayeb: wtayeb tayebb tayeeb ltayeb tayebbb tayb taybo tayob katayeb tayeb tayib taybee
tayba taybeh tayab tayben tayyeb taybii tayibb tayybe taybe tyeb taybine tayyb taybi taybeee
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
Evaluation SenZi
Lexicon Based Approach:Lexicon Based Approach: Match the Positive and Negative Words with the Lexicon
Two-Class classification: Positive or Negative
SA Dataset: 800 positive, 800 negative (2-annotator agreement)
No Sentiment Tweets were given a Sentiment Class Randomly.
Lexicon Recall Precision F1-Score Accuracy
SenZi Original 0.56 0.59 0.57 0.58
SenZi 1st Expansion 0.74 0.64 0.69 0.67
SenZi 2nd Expansion 0.79 0.66 0.72 0.69
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
Error Analysis
To the best of our knowledge there are no other Arabizi Sentiment Lexicons (Especially
Lebanese).
Analysing Errors for the best version of SenZi
Actual
Classified
Positive Negative Neutral
Positive 55% 3% 42%
Negative 13% 39% 48%
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
Error Analysis
Form not in
the Lexicon
English
Sentiment
Words
Neutral
Word
Classified
MWE and
Sarcasm
No Sentiment
Word
Word not in
Lexicon
Word
Ambiguity
Wrong
Negation
20% 12.5% 14% 10% 8% 6% 5% 3%
Wrongly Classified Tweets: 100 positive, 100 negative
Naja7 naja7at successes (inflected)
Mabrouk mabrok congrats (written differently)
happy birthday, miss you, lovely, good luck
el me3ze betsou2 a7san (the goat drives better)
maba2 fi oxygen (there is no more oxygen)
@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
SenZi: Sentiment Analysis Lexicon for Arabizi
Free Resources
https://project-rbz.kmi.open.ac.uk/
Acknowledgments
Annotators, System Developers, and Anonymous Reviewers
Thank You 7abibi!
@tahatobaili
taha.tobaili@open.ac.uk
KNOWLEDGE MEDIA INSTITUTE

More Related Content

What's hot

Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Abstractive Text Summarization
Abstractive Text SummarizationAbstractive Text Summarization
Abstractive Text SummarizationTho Phan
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysisgirisv
 
Text classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_juneText classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_juneDeep Learning Italia
 

What's hot (6)

Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Abstractive Text Summarization
Abstractive Text SummarizationAbstractive Text Summarization
Abstractive Text Summarization
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Text classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_juneText classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_june
 
Alqoritm anlayışı
Alqoritm anlayışıAlqoritm anlayışı
Alqoritm anlayışı
 

Similar to SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)

Sentiment Analysis for Arabizi: A Multilingual Jargon on Social Media
Sentiment Analysis for Arabizi: A Multilingual Jargon on Social MediaSentiment Analysis for Arabizi: A Multilingual Jargon on Social Media
Sentiment Analysis for Arabizi: A Multilingual Jargon on Social MediaKnowledge Media Institute
 
Lexical Induction of Morphological and Orthographic Forms for Low Resourced L...
Lexical Induction of Morphological and Orthographic Forms for Low Resourced L...Lexical Induction of Morphological and Orthographic Forms for Low Resourced L...
Lexical Induction of Morphological and Orthographic Forms for Low Resourced L...Knowledge Media Institute
 
Spoken vs Written English
Spoken vs Written EnglishSpoken vs Written English
Spoken vs Written Englishianlatta
 
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...Grammarly
 
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...iwan_rg
 
The Zobin Method
The Zobin MethodThe Zobin Method
The Zobin Methodzvizev
 
The Zobin Method
The Zobin MethodThe Zobin Method
The Zobin Methodzvizev
 
Final quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using rFinal quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using rAlexandria University
 

Similar to SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi) (13)

Sentiment Analysis for Arabizi: A Multilingual Jargon on Social Media
Sentiment Analysis for Arabizi: A Multilingual Jargon on Social MediaSentiment Analysis for Arabizi: A Multilingual Jargon on Social Media
Sentiment Analysis for Arabizi: A Multilingual Jargon on Social Media
 
Lexical Induction of Morphological and Orthographic Forms for Low Resourced L...
Lexical Induction of Morphological and Orthographic Forms for Low Resourced L...Lexical Induction of Morphological and Orthographic Forms for Low Resourced L...
Lexical Induction of Morphological and Orthographic Forms for Low Resourced L...
 
Spoken vs Written English
Spoken vs Written EnglishSpoken vs Written English
Spoken vs Written English
 
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
 
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
 
Pronounce Russian Properly
Pronounce Russian ProperlyPronounce Russian Properly
Pronounce Russian Properly
 
A hundred and one rules !
A hundred and one rules !A hundred and one rules !
A hundred and one rules !
 
Arabic Idioms - LANE 424
Arabic Idioms - LANE 424 Arabic Idioms - LANE 424
Arabic Idioms - LANE 424
 
The Zobin Method
The Zobin MethodThe Zobin Method
The Zobin Method
 
The Zobin Method
The Zobin MethodThe Zobin Method
The Zobin Method
 
Arabic alphabets
Arabic alphabetsArabic alphabets
Arabic alphabets
 
Easy Arabic
Easy ArabicEasy Arabic
Easy Arabic
 
Final quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using rFinal quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using r
 

Recently uploaded

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单enxupq
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesStarCompliance.io
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单enxupq
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单ewymefz
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单ewymefz
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?DOT TECH
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单vcaxypu
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatheahmadsaood
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundOppotus
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单ewymefz
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsCEPTES Software Inc
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...elinavihriala
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单ewymefz
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单ukgaet
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsalex933524
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhArpitMalhotra16
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单yhkoc
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJames Polillo
 

Recently uploaded (20)

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 

SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)

  • 1. SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic, Arabizi TAHA TOBAILI, MIRIAM FERNANDEZ, HARITH ALANI, SANAA SHARAFEDDINE, HAZEM HAJJ, GORAN GLAVAS KNOWLEDGE MEDIA INSTITUTE, THE OPEN UNIVERSITY, UK
  • 2. Arabizi Background Term: A Mixture of Araby and Englizi A transcription of the spoken dialectal Arabic (DA) in Latin script, using numbers and symbols as well e.g. 7abibi Bilingual Arabs, mainly the youth, started texting DA using the Latin keyboard. @TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 3. Arabic Background Ranked 5th in the World: Spoken by 400+ M People of which 300 M Natives 3@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 4. Arabic Background 1. Modern Standard Arabic (MSA): Formal, Structured, Rich in Grammar, Linguistic Rules, Poetry, Part of Speech, Thesauri, Corpora (spoken and written) 2. Dialectal Arabic (DA): Spoken Natural Mother Tongue, Esoteric within each Region: Word Choice, Coining Terms, Tempo, Pronunciation, Influenced by other languages – Broken Arabic. Levantine: Syrian, Lebanese, Palestinian, Jordanian. (Turkish and French) 3. Arabizi: A reflection of the spoken DA in Latin script. 4@TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 5. Arabizi Background 1. Guttural Phonemes: ‫ح‬‫ع‬‫خ‬‫غ‬ and glottal stop ‫ء‬ lou2a, lou2lou2a Different transcription standards: Egyptian Arabizi ‫خ‬ = 7’, kh Lebanese Arabizi ‫خ‬ = 5, kh 2. Heavy and Light phonemes: t (‫ط‬ ‫ت‬ ), s ( ‫ص‬ ‫,)س‬ d (‫د‬ ‫,)ض‬ k (‫ك‬ ‫,)ق‬ th ( ‫ذ‬‫ظ‬ ) Dialectal: z ( ‫ز‬‫ذ‬‫ظ‬ ), s (‫ص‬ ‫س‬ ‫)ث‬ e.g. laziz 3. Short Vowels (diacritics) and long vowels: َ‫ك‬َ‫ت‬َ‫ب‬ katab or ktb Inconsistent Orthography! @TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 6. Challenges 1. Inconsistent Orthography: Word Ambiguity: dareb ‫ض‬ or ‫د‬ route, hit? Lexical Sparsity: habibi: hbb, 7bb, 7abibi, 7abeebi, 7abb, hbb 2. Rich in Morphology: b7ebak (I-love-you-masculine), bet7ibine (you-love-me-feminine), etc.. 100 inflectional forms 100 x orthographic variances 3. Codeswitching: hi, kifak, cava? keep a 3aj2a kit ma3ak matra7 el first aid kit in case 3le2et bi 3aj2a @TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 7. Transliteration Dialect, Inconsistent Orthography, Word Ambiguity Da5l jamelik w hadamtik, Oh your-beauty and your-sense-of-humour! @TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 8. Sentiment Analysis for Arabizi Data Collection Sentiment Lexicon Evaluation 8 Treat Arabizi as a Low-Resourced Language Independent of Arabic @TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 9. Lebanon 9 177K Tweets Split Arabic / Latinscript 97K Latin script Tweets Preprocessing: 60K 30K for Annotation @TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 11. Annotation Results Tweets Arabizi Not Arabizi I Don’t Know Kappa 30K 4.3K 27.6K 641 0.74 Dataset: 1. Sentiment Analysis (SA) : 1.6K Tweets (800 positive, 800 negative) two answers match. 2. Arabizi Identification (AI) : 4.4K Tweets (2.2K Arabizi, 2.2K Not Arabizi) three answers match Tweets Positive Negative Neutral I Don’t Know Kappa 3.4K 1.2K 1.4K 2.1K 172 0.33 3.4K Two Answers Match for Arabizi-yes @TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 12. Sentiment Lexicon Hu and Liu: 6.8K Words 7.8K Words bab.la 1.5K Words 732 Words 2K Words 2K Arabizi MPQA: 7.6K Words 9.4K MSA Words 600 pos, 1.4K neg Living Arabic: 7K Words @TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 13. Evaluation SenZi Lexicon Based Approach:Lexicon Based Approach: Match the Positive and Negative Words with the Lexicon Two-Class classification: Positive or Negative SA Dataset: 800 positive, 800 negative (2-annotator agreement) No Sentiment Tweets were given a Sentiment Class Randomly. Score: -3 Negative Recall Precision F1-Score Accuracy 0.56 0.59 0.57 0.58 @TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 14. Evaluation Error Analysis btestehali naja7at al deni l2nek btenshri al fara7 wen ma ken mabrok 3layki inti wa m7ebenk kl l naja7at you-deserve the success of the world because you spread the happiness everywhere congrats to you and your-loved-ones for all the successes SenZi naja7at successes (inflected) mabrok congrats (written differently) naje7 success mabrouk congrats @TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 15. Expanding SenZi Word Embeddings Arabizi Corpus Nearest Word Neighbors SenZi Words @TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 16. Expanding SenZi Arabizi Corpus 49 Lebanese Facebook public pages • Extract all comments and replies to comments • Skip Arabic, keep the Latinscript • Preprocessing 2.2 Million Comments Latin Script Languages: English, French, Arabizi, Latinised Far Eastern Languages e.g. Hindi and Filipino @TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 18. Expanding SenZi Train an Arabizi Identifier AI Dataset: 4.4K Tweets (2.2K Arabizi, 2.2K Not Arabizi) three answers match. SVM Classifier, 10-fold cross validation. Recall Precision F1-Score Accuracy 0.93 0.95 0.97 0.97 2.1 Million Latin Script Comments 1 Million Arabizi Comments @TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 19. Expanding SenZi Word Embeddings: FastText Arabizi Corpus: 1M Arabizi Comments 50 Nearest Word Neighbors SenZi Word @TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 20. Expanding SenZi tayeb (tasty/ cute): wtayeb tayebb tayeeb ltayeb taye2 tayebbb taybé tayem tayb tayef taybo tayob katayeb tayer ayeb tayeh tab tayeje Tayeb tayib taybee tay tayba taybeh tay2o tayab tayben tayyeb 7abayeb taybii tayech tay2a tayibb 8ayeb jayeb tay2ino tayybe habeyeb taybe Tyeb taybine Sahten tayyb taym taybi Sahtan sayeb taybeee sahten za3len Inflectional and Orthographic forms Match consonant-letter-sequence: tayeb: tyb atyab, atyabak, atyabo, 2tyab, tayoub, taybe, tayoubi, taybeee, taybin, and tayoubin,. tayeb: wtayeb tayebb tayeeb ltayeb tayebbb tayb taybo tayob katayeb tayeb tayib taybee tayba taybeh tayab tayben tayyeb taybii tayibb tayybe taybe tyeb taybine tayyb taybi taybeee @TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 21. Expanding SenZi Expand SenZi Twice (Recursively) tayeb: wtayeb tayebb tayeeb ltayeb tayebbb tayb taybo tayob katayeb tayeb tayib taybee tayba taybeh tayab tayben tayyeb taybii tayibb tayybe taybe tyeb taybine tayyb taybi taybeee @TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 22. Evaluation SenZi Lexicon Based Approach:Lexicon Based Approach: Match the Positive and Negative Words with the Lexicon Two-Class classification: Positive or Negative SA Dataset: 800 positive, 800 negative (2-annotator agreement) No Sentiment Tweets were given a Sentiment Class Randomly. Lexicon Recall Precision F1-Score Accuracy SenZi Original 0.56 0.59 0.57 0.58 SenZi 1st Expansion 0.74 0.64 0.69 0.67 SenZi 2nd Expansion 0.79 0.66 0.72 0.69 @TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 23. Error Analysis To the best of our knowledge there are no other Arabizi Sentiment Lexicons (Especially Lebanese). Analysing Errors for the best version of SenZi Actual Classified Positive Negative Neutral Positive 55% 3% 42% Negative 13% 39% 48% @TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 24. Error Analysis Form not in the Lexicon English Sentiment Words Neutral Word Classified MWE and Sarcasm No Sentiment Word Word not in Lexicon Word Ambiguity Wrong Negation 20% 12.5% 14% 10% 8% 6% 5% 3% Wrongly Classified Tweets: 100 positive, 100 negative Naja7 naja7at successes (inflected) Mabrouk mabrok congrats (written differently) happy birthday, miss you, lovely, good luck el me3ze betsou2 a7san (the goat drives better) maba2 fi oxygen (there is no more oxygen) @TAHATOBAILI: SENTIMENT ANALYSIS FOR ARABIZI
  • 25. SenZi: Sentiment Analysis Lexicon for Arabizi Free Resources https://project-rbz.kmi.open.ac.uk/ Acknowledgments Annotators, System Developers, and Anonymous Reviewers Thank You 7abibi! @tahatobaili taha.tobaili@open.ac.uk KNOWLEDGE MEDIA INSTITUTE