SlideShare a Scribd company logo
1 of 16
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 1
Clustering Arabic Tweets
for Sentiment Analysis
Diab Abuaiadah
Diab.Abuaiadah@Wintec.ac.nz
Waikato Institute of Technology
New Zealand
Dileep Rajendran
Dileep.Rajendran@Wintec.ac.nz
Waikato Institute of Technology
New Zealand
Mustafa Jarrar
mjarrar@birzeit.edu
Birzeit University
Palestine
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 2
This talk is based on:
Diab Abuaiadah, Dileep Rajendran, Mustafa Jarrar:
Clustering Arabic Tweets for Sentiment Analysis.
Proceedings of the 2017 IEEE/ACS 14th International
Conference on Computer Systems and Applications.
IEEE Computer Society. DOI
10.1109/AICCSA.2017.162
http://www.jarrar.info/publications/ARJ17.pdf
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 3
Characteristics and challenges of Arabic Tweets
 Short text: “Less statistical data” implies poor result, thus less
effective information retrieval algorithms.
 Dialectal language: vast geographical area which includes the
Middle east and north Africa.
 Informal language: does not follow language grammar and
rules.
 Different regions have different dialects.
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 4
Importance of extracting knowledge
from Arabic tweets
 Social and political sentiments: measures the popularity of
political and/or social policies.
 Marketing products : extracting positive and negative opinion,
discovering recent trends and detecting product popularity.
 Sarcastic and non-Sarcastic tweets: also important for
detecting sentiments.
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 5
What preprocessing steps are needed?
 Removing stop words!
 In many applications removing stop words may improve the performance
 In sentiment analysis removing stop might result in loosing valuable
information. For example: keeping negative terms such as “no” can be
useful to indicate negative sentiments
 Lemmatization!
 More challenging with informal and short text
 May produce high errors ratio
 Stemming!
 Also has high error ratio for informal short text
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 6
Similarity Functions
 Most information retrieval algorithms employ a similarity (or
distance) function to measure the similarity between :
 two documents (Tweets)
 or a document (Tweet) and a group of documents (Tweets)
 There are five commonly used similarity functions in natural
language processing applications: Euclidian, Cosine, Pearson,
Jaccard and averaged Kullback-Leibler divergence (KLD)
The goal of this paper is to evaluate and compare these
functions in clustering Arabic tweets for sentiment analysis
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 7
The similarity functions
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 8
Our Datasets and Preprocessing
 An open dataset with 2000 tweets (Jordan dialect)
N. Abdulla, N. Mahyoub, M. Shehab and N. Al-Ayyoub, "Arabic sentiment analysis:
Corpus-based and lexicon- based," in In Proceedings of The IEEE conference on Applied
Electrical Engineering and Computing Technologies (AEECT)., 2013.
 Other datasets were available but the labeling was
immature at the time of writing the paper.
 Preprocessing includes removing stop words, Light10
and Khoja stemmers.
 Clustering using the K-Means (and Bisect K-Means)
algorithms.
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 9
Implementation
 Khoja’s stemmer: the Java code was taken from the author
home page. http://zeus.cs.pacificu.edu/shereen/research.htm
 Light10 and removing stop words: in-house Java code. The
implementation is straightforward as it simply strip off a
predefined fixed set of prefixes and suffixes.
 Clustering and Bisect clustering: in-house Java code using
Hashtable to represent TF-IDF of each document.
(TF: Term Frequency, IDF: Inverse Document Frequency)
 Purities and entropies: in-house Java code.
To measure the quality of the algorithm’s results
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 10
Testing the code
 The code for the similarity functions, K-Means clustering,
purity and entropy measures were tested by applying it to a
well-known dataset
20 Newsgroup dataset: http://csmining.org/index.php/id-20-newsgroups.html
 The purities and entropies results (when applying the K-
Means) is comparable to that of Huang (2008):
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.4480&rep=rep1&type=pdf
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 11
Purities: K-means algorithm
Raw: no pre-processing
NoSW: Only removed stop words (the known list)
Light10: Removed stop words + Light10
Root: Removed stop words + Khoja’s
- Each test was repeated 50 times, and the average is used.
- The Euclidian was bad
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 12
Purities for K-Means
 KLD clearly outperformed the Cosine similarity function
 KLD outperformed the Cosine similarity function
 This is of interest as the Cosine similarity function is commonly used in
many products and published papers.
 The Cosine function outperforms KLD for normal-sized texts, but for
tweets, as our experiment is showing.
 Kohja's stemmer (Root) clearly outperforms the Light10
stemmer
 However, Light10 stemmer outperforms Kohja’s stemmer for normal-
sized texts
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 13
Purities: optimized K-Means
 The optimized K-Means (Bisect K-Means) improves the results
compared to that of K-Means (68% to 76%).
 KLD and Kohja’s stemmer retained their superiority compared to
other similarity functions and other processing.
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 14
Future Research
 Lemmatization: could be more effective than stemmers
 Are other clustering algorithms more effective than K-
Means? Hierarchical algorithms produce better clustering
results but have higher running time
 N-gram preprocessing is an interesting approach that could
lead to higher accuracies
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 15
Thank you
Questions?
Questions?
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 16
References
[1] S. Khoja, 2016. [Online]. Available: http://zeus.cs.pacificu.edu/shereen/research.htm. [Accessed 01 August 2016].
[2] R. M. Duwairi, R. Marji, N. Sha'ban and S. Rushaidat, "Sentiment Analysis in Arabic Tweets,”. Proceedings of ICICS 2014.
[3] E. Zangerle, G. Wolfgang and G. Specht, "On the impact of text similarity functions on hashtag recommendations in microblogging environments," Social
Network Analysis and Mining, vol. 3, no. 4, pp. 889-898, 2013.
[4] S. R. El-Beltagy and A. Ali, "Open Issues in the Sentiment Analysis of Arabic," in IIT, 2013.
[5] H. Froud, R. Benslimane, A. Lachkar and S. A. & Ouatik, "Stemming and similarity measures for Arabic Documents Clustering," in I/V IEE ISVC, 2010.
[6] H. Froud, A. Lachkar and S. Ouatik, "Arabic text summarization based on latent semantic analysis to enhance arabic documents clustering," International
Journal of Data Mining & Knowledge Management Process (IJDKP), 2013.
[7] D. Abuaidah, "Using Bisect K-Means Clustering Technique in the Analysis of Arabic Documents," ACM Transactions on Asian and Low-Resource Language
Information Processing, vol. 15, no. 3, p. 17, 2016.
[8] Q. Bsoul and M. Mohd, "Effect of ISRI stemming on similarity measure for Arabic document clustering," in Asia Information Retrieval Symposium, 2011.
[9] O. Ghanem and W. M. Ashour, "Stemming Effectiveness in Clustering of Arabic Documents," Journal of Computer Applications, vol. 5, no. 49, 2012.
[10] L. Larkey, L. Ballesteros and M. Connell, "Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurrence Analysis," ACM, 2002.
[11] N. Sandhya, Y. S. Lalitha, V. Sowmya, A. D. K.. and D. A. Govardhan, "Analysis of Stemming Algorithm for Text Clustering," vol. 8, no. 5, 2011.
[12] A. Shoukry and A. Rafea, "Sentence-Level Arabic Sentiment Analysis," in International Conference on Collaboration Technologies and Systems, 2012.
[13] B. Pang and L. Lillian, "Opinion mining and sentiment analysis," Foundations and trends in information retrieval, vol. 2, no. 1-2, pp. 1-135, 2008.
[14] M. Althobaiti, U. Kruschwitz and M. Poesio, "Combining Minimally-supervised Methods for Arabic Named Entity Recognition," Transactions of the
Association for Computational Linguistics, vol. 3, pp. 243-255, 2015.
[15] M. Efron, "Information search and retrieval in microblogs," Information search and retrieval in microblogs." Journal of the American Society for Information
Science and Technology, vol. 62, no. 6, pp. 996-1008, 2011.
[16] J. Akshay, X. Song, T. Finin and B. Tseng, "Why we twitter: understanding microblogging usage and communities.," in In Proceedings of the 9th WebKDD
and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, 2007.
[17] G. Salton and C. Buckley, "Term weighting approaches in automatic text retrieval," Information Processing and Management, vol. 24, no. 5. 1988.
[18] G. Salton, Automatic text processing: the transformation, analysis, and retrieval of information by computer, Boston: Addison-Wesley Longman Co., 1989.
[19] A. Go, R. Bhayani and L. Huang, "Twitter Sentiment Classification using Distant Supervision," CS224N Project Report, Stanford, 2009.
[20] A. K. Jose, N. Bhatia and S. Krishna, "TWITTER SENTIMENT ANALYSIS," National Institute of Technology, Calicut, Monsoon, 2010.
[21] A. Huang, "Similarity measures for text document clustering," in Proceedings of the sixth new zealand computer science research student conference
(NZCSRSC2008), Christchurch, New Zealand, 2008.
[22] M. Steinbach, G. Karypis and V. Kumar, "A Comparison of Document Clustering Techniques," in KDD Workshop on Text Mining, 2000.
[23] S. Khoja and R. Garside, "Stemming Arabic text," Lancaster University, Lancaster, 1999.
[24] N. Abdulla, N. Mahyoub, M. Shehab and N. Al-Ayyoub, "Arabic sentiment analysis: Corpus-based and lexicon-based," in In Proceedings of The IEEE
conference on Applied Electrical Engineering and Computing Technologies (AEECT)., 2013.
[25] K. Darwish, W. Magdy and A. Mourad, "Language processing for arabic microblog retrieval," in Proceedings of the 21st ACM international conference on
Information and knowledge management, 2012.
[26] A. Newsri, "Effective Retrieval Techniques for Arabic Text," Melbourne, Australia, 2008.
[27] E. Al-Shammari and J. Lin, "Towards an Error-Free Arabic Stemming," in Proceedings of the 2nd ACM workshop on Improving non english web searching
(iNEWS '08), New York, NY, USA, 2008A.
[28] E. Al-Shammari and J. Lin, "A novel Arabic lemmatization algorithm," in Proceedings of the second workshop on Analytics for noisy unstructured text data.
[29] A. Jain, "Data clustering: 50 years beyond K-means," Pattern Recognition Letters, vol. 31, p. 651–666, 2010.
[30] R. Kashef and M. S. Kamel, "Enhanced bisecting k-means clustering using intermediate cooperation," Pattern Recognition, vol. 42, no. 11, 2009.
[31] C. Manning, P. Raghavan and H. Schütze, Introduction to information retrieval, vol. 1, Cambridge: Cambridge university press, 2008.

More Related Content

Similar to Clustering Arabic Tweets for Sentiment Analysis

Top cited articles 2020 - Advanced Computational Intelligence: An Internation...
Top cited articles 2020 - Advanced Computational Intelligence: An Internation...Top cited articles 2020 - Advanced Computational Intelligence: An Internation...
Top cited articles 2020 - Advanced Computational Intelligence: An Internation...aciijournal
 
MOST CITED NATURAL LANGUAGECOMPUTING ARTICLESIN 2017
MOST CITED NATURAL LANGUAGECOMPUTING ARTICLESIN 2017MOST CITED NATURAL LANGUAGECOMPUTING ARTICLESIN 2017
MOST CITED NATURAL LANGUAGECOMPUTING ARTICLESIN 2017kevig
 
TOP 10 Cited Computer Science & Information Technology Research Articles From...
TOP 10 Cited Computer Science & Information Technology Research Articles From...TOP 10 Cited Computer Science & Information Technology Research Articles From...
TOP 10 Cited Computer Science & Information Technology Research Articles From...AIRCC Publishing Corporation
 
New Research Articles 2020 September Issue International Journal of Software ...
New Research Articles 2020 September Issue International Journal of Software ...New Research Articles 2020 September Issue International Journal of Software ...
New Research Articles 2020 September Issue International Journal of Software ...ijseajournal
 
May 2022: Top 10 Read Articles in Data Mining & Knowledge Management Process
May 2022: Top 10 Read Articles in Data Mining & Knowledge Management ProcessMay 2022: Top 10 Read Articles in Data Mining & Knowledge Management Process
May 2022: Top 10 Read Articles in Data Mining & Knowledge Management ProcessIJDKP
 
Information Technology in Industry(ITII) - November Issue 2018
Information Technology in Industry(ITII) - November Issue 2018Information Technology in Industry(ITII) - November Issue 2018
Information Technology in Industry(ITII) - November Issue 2018ITIIIndustries
 
Top 10 Natural Language Processing Trends in 2020 - International Journal on ...
Top 10 Natural Language Processing Trends in 2020 - International Journal on ...Top 10 Natural Language Processing Trends in 2020 - International Journal on ...
Top 10 Natural Language Processing Trends in 2020 - International Journal on ...kevig
 
New Research Articles 2022 January Issue International Journal of Software En...
New Research Articles 2022 January Issue International Journal of Software En...New Research Articles 2022 January Issue International Journal of Software En...
New Research Articles 2022 January Issue International Journal of Software En...ijseajournal
 
The International Journal of Network Security & Its Applications (IJNSA) -- ...
The International Journal of Network Security & Its Applications (IJNSA) --  ...The International Journal of Network Security & Its Applications (IJNSA) --  ...
The International Journal of Network Security & Its Applications (IJNSA) -- ...IJNSA Journal
 
Automatic Code Completion Exploting Semantic Similarity
Automatic Code Completion Exploting Semantic SimilarityAutomatic Code Completion Exploting Semantic Similarity
Automatic Code Completion Exploting Semantic SimilarityMasud Rahman
 
TOP READ NATURAL LANGUAGE COMPUTING ARTICLE 2020
TOP READ NATURAL LANGUAGE  COMPUTING ARTICLE 2020TOP READ NATURAL LANGUAGE  COMPUTING ARTICLE 2020
TOP READ NATURAL LANGUAGE COMPUTING ARTICLE 2020kevig
 
Trends of machine learning in 2020 - International Journal of Artificial Inte...
Trends of machine learning in 2020 - International Journal of Artificial Inte...Trends of machine learning in 2020 - International Journal of Artificial Inte...
Trends of machine learning in 2020 - International Journal of Artificial Inte...gerogepatton
 
Thirteen Years of SysML: A Systematic Mapping Study
Thirteen Years of SysML: A Systematic Mapping StudyThirteen Years of SysML: A Systematic Mapping Study
Thirteen Years of SysML: A Systematic Mapping Studyswolny
 
Federated Learning of Neural Network Models with Heterogeneous Structures.pdf
Federated Learning of Neural Network Models with Heterogeneous Structures.pdfFederated Learning of Neural Network Models with Heterogeneous Structures.pdf
Federated Learning of Neural Network Models with Heterogeneous Structures.pdfKundjanasith Thonglek
 
Advances of neural networks in 2020
Advances of neural networks in 2020Advances of neural networks in 2020
Advances of neural networks in 2020kevig
 
Top 5 MOST VIEWED LANGUAGE COMPUTING ARTICLE - International Journal on Natur...
Top 5 MOST VIEWED LANGUAGE COMPUTING ARTICLE - International Journal on Natur...Top 5 MOST VIEWED LANGUAGE COMPUTING ARTICLE - International Journal on Natur...
Top 5 MOST VIEWED LANGUAGE COMPUTING ARTICLE - International Journal on Natur...kevig
 
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...Tao Xie
 

Similar to Clustering Arabic Tweets for Sentiment Analysis (20)

Top cited articles 2020 - Advanced Computational Intelligence: An Internation...
Top cited articles 2020 - Advanced Computational Intelligence: An Internation...Top cited articles 2020 - Advanced Computational Intelligence: An Internation...
Top cited articles 2020 - Advanced Computational Intelligence: An Internation...
 
MOST CITED NATURAL LANGUAGECOMPUTING ARTICLESIN 2017
MOST CITED NATURAL LANGUAGECOMPUTING ARTICLESIN 2017MOST CITED NATURAL LANGUAGECOMPUTING ARTICLESIN 2017
MOST CITED NATURAL LANGUAGECOMPUTING ARTICLESIN 2017
 
TOP 10 Cited Computer Science & Information Technology Research Articles From...
TOP 10 Cited Computer Science & Information Technology Research Articles From...TOP 10 Cited Computer Science & Information Technology Research Articles From...
TOP 10 Cited Computer Science & Information Technology Research Articles From...
 
New Research Articles 2020 September Issue International Journal of Software ...
New Research Articles 2020 September Issue International Journal of Software ...New Research Articles 2020 September Issue International Journal of Software ...
New Research Articles 2020 September Issue International Journal of Software ...
 
May 2022: Top 10 Read Articles in Data Mining & Knowledge Management Process
May 2022: Top 10 Read Articles in Data Mining & Knowledge Management ProcessMay 2022: Top 10 Read Articles in Data Mining & Knowledge Management Process
May 2022: Top 10 Read Articles in Data Mining & Knowledge Management Process
 
Information Technology in Industry(ITII) - November Issue 2018
Information Technology in Industry(ITII) - November Issue 2018Information Technology in Industry(ITII) - November Issue 2018
Information Technology in Industry(ITII) - November Issue 2018
 
Top 10 Natural Language Processing Trends in 2020 - International Journal on ...
Top 10 Natural Language Processing Trends in 2020 - International Journal on ...Top 10 Natural Language Processing Trends in 2020 - International Journal on ...
Top 10 Natural Language Processing Trends in 2020 - International Journal on ...
 
New Research Articles 2022 January Issue International Journal of Software En...
New Research Articles 2022 January Issue International Journal of Software En...New Research Articles 2022 January Issue International Journal of Software En...
New Research Articles 2022 January Issue International Journal of Software En...
 
The International Journal of Network Security & Its Applications (IJNSA) -- ...
The International Journal of Network Security & Its Applications (IJNSA) --  ...The International Journal of Network Security & Its Applications (IJNSA) --  ...
The International Journal of Network Security & Its Applications (IJNSA) -- ...
 
Automatic Code Completion Exploting Semantic Similarity
Automatic Code Completion Exploting Semantic SimilarityAutomatic Code Completion Exploting Semantic Similarity
Automatic Code Completion Exploting Semantic Similarity
 
TOP READ NATURAL LANGUAGE COMPUTING ARTICLE 2020
TOP READ NATURAL LANGUAGE  COMPUTING ARTICLE 2020TOP READ NATURAL LANGUAGE  COMPUTING ARTICLE 2020
TOP READ NATURAL LANGUAGE COMPUTING ARTICLE 2020
 
Trends of machine learning in 2020 - International Journal of Artificial Inte...
Trends of machine learning in 2020 - International Journal of Artificial Inte...Trends of machine learning in 2020 - International Journal of Artificial Inte...
Trends of machine learning in 2020 - International Journal of Artificial Inte...
 
CV of Md. Nasfikur R. Khan
CV of Md. Nasfikur R. KhanCV of Md. Nasfikur R. Khan
CV of Md. Nasfikur R. Khan
 
2015_FIT_Talk.pptx
2015_FIT_Talk.pptx2015_FIT_Talk.pptx
2015_FIT_Talk.pptx
 
Thirteen Years of SysML: A Systematic Mapping Study
Thirteen Years of SysML: A Systematic Mapping StudyThirteen Years of SysML: A Systematic Mapping Study
Thirteen Years of SysML: A Systematic Mapping Study
 
Federated Learning of Neural Network Models with Heterogeneous Structures.pdf
Federated Learning of Neural Network Models with Heterogeneous Structures.pdfFederated Learning of Neural Network Models with Heterogeneous Structures.pdf
Federated Learning of Neural Network Models with Heterogeneous Structures.pdf
 
Observlets
Observlets Observlets
Observlets
 
Advances of neural networks in 2020
Advances of neural networks in 2020Advances of neural networks in 2020
Advances of neural networks in 2020
 
Top 5 MOST VIEWED LANGUAGE COMPUTING ARTICLE - International Journal on Natur...
Top 5 MOST VIEWED LANGUAGE COMPUTING ARTICLE - International Journal on Natur...Top 5 MOST VIEWED LANGUAGE COMPUTING ARTICLE - International Journal on Natur...
Top 5 MOST VIEWED LANGUAGE COMPUTING ARTICLE - International Journal on Natur...
 
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
 

More from Mustafa Jarrar

Classifying Processes and Basic Formal Ontology
Classifying Processes  and Basic Formal OntologyClassifying Processes  and Basic Formal Ontology
Classifying Processes and Basic Formal OntologyMustafa Jarrar
 
Business Process Implementation
Business Process ImplementationBusiness Process Implementation
Business Process ImplementationMustafa Jarrar
 
Business Process Design and Re-engineering
Business Process Design and Re-engineeringBusiness Process Design and Re-engineering
Business Process Design and Re-engineeringMustafa Jarrar
 
BPMN 2.0 Analytical Constructs
BPMN 2.0 Analytical ConstructsBPMN 2.0 Analytical Constructs
BPMN 2.0 Analytical ConstructsMustafa Jarrar
 
BPMN 2.0 Descriptive Constructs
BPMN 2.0 Descriptive Constructs  BPMN 2.0 Descriptive Constructs
BPMN 2.0 Descriptive Constructs Mustafa Jarrar
 
Introduction to Business Process Management
Introduction to Business Process ManagementIntroduction to Business Process Management
Introduction to Business Process ManagementMustafa Jarrar
 
Customer Complaint Ontology
Customer Complaint Ontology Customer Complaint Ontology
Customer Complaint Ontology Mustafa Jarrar
 
Subset, Equality, and Exclusion Rules
Subset, Equality, and Exclusion RulesSubset, Equality, and Exclusion Rules
Subset, Equality, and Exclusion RulesMustafa Jarrar
 
Schema Modularization in ORM
Schema Modularization in ORMSchema Modularization in ORM
Schema Modularization in ORMMustafa Jarrar
 
On Computer Science Trends and Priorities in Palestine
On Computer Science Trends and Priorities in PalestineOn Computer Science Trends and Priorities in Palestine
On Computer Science Trends and Priorities in PalestineMustafa Jarrar
 
Lessons from Class Recording & Publishing of Eight Online Courses
Lessons from Class Recording & Publishing of Eight Online CoursesLessons from Class Recording & Publishing of Eight Online Courses
Lessons from Class Recording & Publishing of Eight Online CoursesMustafa Jarrar
 
Presentation curras paper-emnlp2014-final
Presentation curras paper-emnlp2014-finalPresentation curras paper-emnlp2014-final
Presentation curras paper-emnlp2014-finalMustafa Jarrar
 
Jarrar: Future Internet in Horizon 2020 Calls
Jarrar: Future Internet in Horizon 2020 CallsJarrar: Future Internet in Horizon 2020 Calls
Jarrar: Future Internet in Horizon 2020 CallsMustafa Jarrar
 
Habash: Arabic Natural Language Processing
Habash: Arabic Natural Language ProcessingHabash: Arabic Natural Language Processing
Habash: Arabic Natural Language ProcessingMustafa Jarrar
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Riestra: How to Design and engineer Competitive Horizon 2020 Proposals
Riestra: How to Design and engineer Competitive Horizon 2020 ProposalsRiestra: How to Design and engineer Competitive Horizon 2020 Proposals
Riestra: How to Design and engineer Competitive Horizon 2020 ProposalsMustafa Jarrar
 
Bouquet: SIERA Workshop on The Pillars of Horizon2020
Bouquet: SIERA Workshop on The Pillars of Horizon2020Bouquet: SIERA Workshop on The Pillars of Horizon2020
Bouquet: SIERA Workshop on The Pillars of Horizon2020Mustafa Jarrar
 
Jarrar: Sparql Project
Jarrar: Sparql ProjectJarrar: Sparql Project
Jarrar: Sparql ProjectMustafa Jarrar
 
Jarrar: Logical Foundation of Ontology Engineering
Jarrar: Logical Foundation of Ontology EngineeringJarrar: Logical Foundation of Ontology Engineering
Jarrar: Logical Foundation of Ontology EngineeringMustafa Jarrar
 
Jarrar: Stepwise Methodologies for Developing Ontologies
Jarrar: Stepwise Methodologies for Developing OntologiesJarrar: Stepwise Methodologies for Developing Ontologies
Jarrar: Stepwise Methodologies for Developing OntologiesMustafa Jarrar
 

More from Mustafa Jarrar (20)

Classifying Processes and Basic Formal Ontology
Classifying Processes  and Basic Formal OntologyClassifying Processes  and Basic Formal Ontology
Classifying Processes and Basic Formal Ontology
 
Business Process Implementation
Business Process ImplementationBusiness Process Implementation
Business Process Implementation
 
Business Process Design and Re-engineering
Business Process Design and Re-engineeringBusiness Process Design and Re-engineering
Business Process Design and Re-engineering
 
BPMN 2.0 Analytical Constructs
BPMN 2.0 Analytical ConstructsBPMN 2.0 Analytical Constructs
BPMN 2.0 Analytical Constructs
 
BPMN 2.0 Descriptive Constructs
BPMN 2.0 Descriptive Constructs  BPMN 2.0 Descriptive Constructs
BPMN 2.0 Descriptive Constructs
 
Introduction to Business Process Management
Introduction to Business Process ManagementIntroduction to Business Process Management
Introduction to Business Process Management
 
Customer Complaint Ontology
Customer Complaint Ontology Customer Complaint Ontology
Customer Complaint Ontology
 
Subset, Equality, and Exclusion Rules
Subset, Equality, and Exclusion RulesSubset, Equality, and Exclusion Rules
Subset, Equality, and Exclusion Rules
 
Schema Modularization in ORM
Schema Modularization in ORMSchema Modularization in ORM
Schema Modularization in ORM
 
On Computer Science Trends and Priorities in Palestine
On Computer Science Trends and Priorities in PalestineOn Computer Science Trends and Priorities in Palestine
On Computer Science Trends and Priorities in Palestine
 
Lessons from Class Recording & Publishing of Eight Online Courses
Lessons from Class Recording & Publishing of Eight Online CoursesLessons from Class Recording & Publishing of Eight Online Courses
Lessons from Class Recording & Publishing of Eight Online Courses
 
Presentation curras paper-emnlp2014-final
Presentation curras paper-emnlp2014-finalPresentation curras paper-emnlp2014-final
Presentation curras paper-emnlp2014-final
 
Jarrar: Future Internet in Horizon 2020 Calls
Jarrar: Future Internet in Horizon 2020 CallsJarrar: Future Internet in Horizon 2020 Calls
Jarrar: Future Internet in Horizon 2020 Calls
 
Habash: Arabic Natural Language Processing
Habash: Arabic Natural Language ProcessingHabash: Arabic Natural Language Processing
Habash: Arabic Natural Language Processing
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Riestra: How to Design and engineer Competitive Horizon 2020 Proposals
Riestra: How to Design and engineer Competitive Horizon 2020 ProposalsRiestra: How to Design and engineer Competitive Horizon 2020 Proposals
Riestra: How to Design and engineer Competitive Horizon 2020 Proposals
 
Bouquet: SIERA Workshop on The Pillars of Horizon2020
Bouquet: SIERA Workshop on The Pillars of Horizon2020Bouquet: SIERA Workshop on The Pillars of Horizon2020
Bouquet: SIERA Workshop on The Pillars of Horizon2020
 
Jarrar: Sparql Project
Jarrar: Sparql ProjectJarrar: Sparql Project
Jarrar: Sparql Project
 
Jarrar: Logical Foundation of Ontology Engineering
Jarrar: Logical Foundation of Ontology EngineeringJarrar: Logical Foundation of Ontology Engineering
Jarrar: Logical Foundation of Ontology Engineering
 
Jarrar: Stepwise Methodologies for Developing Ontologies
Jarrar: Stepwise Methodologies for Developing OntologiesJarrar: Stepwise Methodologies for Developing Ontologies
Jarrar: Stepwise Methodologies for Developing Ontologies
 

Recently uploaded

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 

Recently uploaded (20)

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 

Clustering Arabic Tweets for Sentiment Analysis

  • 1. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 1 Clustering Arabic Tweets for Sentiment Analysis Diab Abuaiadah Diab.Abuaiadah@Wintec.ac.nz Waikato Institute of Technology New Zealand Dileep Rajendran Dileep.Rajendran@Wintec.ac.nz Waikato Institute of Technology New Zealand Mustafa Jarrar mjarrar@birzeit.edu Birzeit University Palestine
  • 2. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 2 This talk is based on: Diab Abuaiadah, Dileep Rajendran, Mustafa Jarrar: Clustering Arabic Tweets for Sentiment Analysis. Proceedings of the 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications. IEEE Computer Society. DOI 10.1109/AICCSA.2017.162 http://www.jarrar.info/publications/ARJ17.pdf
  • 3. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 3 Characteristics and challenges of Arabic Tweets  Short text: “Less statistical data” implies poor result, thus less effective information retrieval algorithms.  Dialectal language: vast geographical area which includes the Middle east and north Africa.  Informal language: does not follow language grammar and rules.  Different regions have different dialects.
  • 4. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 4 Importance of extracting knowledge from Arabic tweets  Social and political sentiments: measures the popularity of political and/or social policies.  Marketing products : extracting positive and negative opinion, discovering recent trends and detecting product popularity.  Sarcastic and non-Sarcastic tweets: also important for detecting sentiments.
  • 5. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 5 What preprocessing steps are needed?  Removing stop words!  In many applications removing stop words may improve the performance  In sentiment analysis removing stop might result in loosing valuable information. For example: keeping negative terms such as “no” can be useful to indicate negative sentiments  Lemmatization!  More challenging with informal and short text  May produce high errors ratio  Stemming!  Also has high error ratio for informal short text
  • 6. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 6 Similarity Functions  Most information retrieval algorithms employ a similarity (or distance) function to measure the similarity between :  two documents (Tweets)  or a document (Tweet) and a group of documents (Tweets)  There are five commonly used similarity functions in natural language processing applications: Euclidian, Cosine, Pearson, Jaccard and averaged Kullback-Leibler divergence (KLD) The goal of this paper is to evaluate and compare these functions in clustering Arabic tweets for sentiment analysis
  • 7. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 7 The similarity functions
  • 8. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 8 Our Datasets and Preprocessing  An open dataset with 2000 tweets (Jordan dialect) N. Abdulla, N. Mahyoub, M. Shehab and N. Al-Ayyoub, "Arabic sentiment analysis: Corpus-based and lexicon- based," in In Proceedings of The IEEE conference on Applied Electrical Engineering and Computing Technologies (AEECT)., 2013.  Other datasets were available but the labeling was immature at the time of writing the paper.  Preprocessing includes removing stop words, Light10 and Khoja stemmers.  Clustering using the K-Means (and Bisect K-Means) algorithms.
  • 9. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 9 Implementation  Khoja’s stemmer: the Java code was taken from the author home page. http://zeus.cs.pacificu.edu/shereen/research.htm  Light10 and removing stop words: in-house Java code. The implementation is straightforward as it simply strip off a predefined fixed set of prefixes and suffixes.  Clustering and Bisect clustering: in-house Java code using Hashtable to represent TF-IDF of each document. (TF: Term Frequency, IDF: Inverse Document Frequency)  Purities and entropies: in-house Java code. To measure the quality of the algorithm’s results
  • 10. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 10 Testing the code  The code for the similarity functions, K-Means clustering, purity and entropy measures were tested by applying it to a well-known dataset 20 Newsgroup dataset: http://csmining.org/index.php/id-20-newsgroups.html  The purities and entropies results (when applying the K- Means) is comparable to that of Huang (2008): http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.4480&rep=rep1&type=pdf
  • 11. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 11 Purities: K-means algorithm Raw: no pre-processing NoSW: Only removed stop words (the known list) Light10: Removed stop words + Light10 Root: Removed stop words + Khoja’s - Each test was repeated 50 times, and the average is used. - The Euclidian was bad
  • 12. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 12 Purities for K-Means  KLD clearly outperformed the Cosine similarity function  KLD outperformed the Cosine similarity function  This is of interest as the Cosine similarity function is commonly used in many products and published papers.  The Cosine function outperforms KLD for normal-sized texts, but for tweets, as our experiment is showing.  Kohja's stemmer (Root) clearly outperforms the Light10 stemmer  However, Light10 stemmer outperforms Kohja’s stemmer for normal- sized texts
  • 13. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 13 Purities: optimized K-Means  The optimized K-Means (Bisect K-Means) improves the results compared to that of K-Means (68% to 76%).  KLD and Kohja’s stemmer retained their superiority compared to other similarity functions and other processing.
  • 14. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 14 Future Research  Lemmatization: could be more effective than stemmers  Are other clustering algorithms more effective than K- Means? Hierarchical algorithms produce better clustering results but have higher running time  N-gram preprocessing is an interesting approach that could lead to higher accuracies
  • 15. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 15 Thank you Questions? Questions?
  • 16. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 16 References [1] S. Khoja, 2016. [Online]. Available: http://zeus.cs.pacificu.edu/shereen/research.htm. [Accessed 01 August 2016]. [2] R. M. Duwairi, R. Marji, N. Sha'ban and S. Rushaidat, "Sentiment Analysis in Arabic Tweets,”. Proceedings of ICICS 2014. [3] E. Zangerle, G. Wolfgang and G. Specht, "On the impact of text similarity functions on hashtag recommendations in microblogging environments," Social Network Analysis and Mining, vol. 3, no. 4, pp. 889-898, 2013. [4] S. R. El-Beltagy and A. Ali, "Open Issues in the Sentiment Analysis of Arabic," in IIT, 2013. [5] H. Froud, R. Benslimane, A. Lachkar and S. A. & Ouatik, "Stemming and similarity measures for Arabic Documents Clustering," in I/V IEE ISVC, 2010. [6] H. Froud, A. Lachkar and S. Ouatik, "Arabic text summarization based on latent semantic analysis to enhance arabic documents clustering," International Journal of Data Mining & Knowledge Management Process (IJDKP), 2013. [7] D. Abuaidah, "Using Bisect K-Means Clustering Technique in the Analysis of Arabic Documents," ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 15, no. 3, p. 17, 2016. [8] Q. Bsoul and M. Mohd, "Effect of ISRI stemming on similarity measure for Arabic document clustering," in Asia Information Retrieval Symposium, 2011. [9] O. Ghanem and W. M. Ashour, "Stemming Effectiveness in Clustering of Arabic Documents," Journal of Computer Applications, vol. 5, no. 49, 2012. [10] L. Larkey, L. Ballesteros and M. Connell, "Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurrence Analysis," ACM, 2002. [11] N. Sandhya, Y. S. Lalitha, V. Sowmya, A. D. K.. and D. A. Govardhan, "Analysis of Stemming Algorithm for Text Clustering," vol. 8, no. 5, 2011. [12] A. Shoukry and A. Rafea, "Sentence-Level Arabic Sentiment Analysis," in International Conference on Collaboration Technologies and Systems, 2012. [13] B. Pang and L. Lillian, "Opinion mining and sentiment analysis," Foundations and trends in information retrieval, vol. 2, no. 1-2, pp. 1-135, 2008. [14] M. Althobaiti, U. Kruschwitz and M. Poesio, "Combining Minimally-supervised Methods for Arabic Named Entity Recognition," Transactions of the Association for Computational Linguistics, vol. 3, pp. 243-255, 2015. [15] M. Efron, "Information search and retrieval in microblogs," Information search and retrieval in microblogs." Journal of the American Society for Information Science and Technology, vol. 62, no. 6, pp. 996-1008, 2011. [16] J. Akshay, X. Song, T. Finin and B. Tseng, "Why we twitter: understanding microblogging usage and communities.," in In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, 2007. [17] G. Salton and C. Buckley, "Term weighting approaches in automatic text retrieval," Information Processing and Management, vol. 24, no. 5. 1988. [18] G. Salton, Automatic text processing: the transformation, analysis, and retrieval of information by computer, Boston: Addison-Wesley Longman Co., 1989. [19] A. Go, R. Bhayani and L. Huang, "Twitter Sentiment Classification using Distant Supervision," CS224N Project Report, Stanford, 2009. [20] A. K. Jose, N. Bhatia and S. Krishna, "TWITTER SENTIMENT ANALYSIS," National Institute of Technology, Calicut, Monsoon, 2010. [21] A. Huang, "Similarity measures for text document clustering," in Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, 2008. [22] M. Steinbach, G. Karypis and V. Kumar, "A Comparison of Document Clustering Techniques," in KDD Workshop on Text Mining, 2000. [23] S. Khoja and R. Garside, "Stemming Arabic text," Lancaster University, Lancaster, 1999. [24] N. Abdulla, N. Mahyoub, M. Shehab and N. Al-Ayyoub, "Arabic sentiment analysis: Corpus-based and lexicon-based," in In Proceedings of The IEEE conference on Applied Electrical Engineering and Computing Technologies (AEECT)., 2013. [25] K. Darwish, W. Magdy and A. Mourad, "Language processing for arabic microblog retrieval," in Proceedings of the 21st ACM international conference on Information and knowledge management, 2012. [26] A. Newsri, "Effective Retrieval Techniques for Arabic Text," Melbourne, Australia, 2008. [27] E. Al-Shammari and J. Lin, "Towards an Error-Free Arabic Stemming," in Proceedings of the 2nd ACM workshop on Improving non english web searching (iNEWS '08), New York, NY, USA, 2008A. [28] E. Al-Shammari and J. Lin, "A novel Arabic lemmatization algorithm," in Proceedings of the second workshop on Analytics for noisy unstructured text data. [29] A. Jain, "Data clustering: 50 years beyond K-means," Pattern Recognition Letters, vol. 31, p. 651–666, 2010. [30] R. Kashef and M. S. Kamel, "Enhanced bisecting k-means clustering using intermediate cooperation," Pattern Recognition, vol. 42, no. 11, 2009. [31] C. Manning, P. Raghavan and H. Schütze, Introduction to information retrieval, vol. 1, Cambridge: Cambridge university press, 2008.

Editor's Notes

  1. Sarcastic = ساخر
  2. Clustering into two classes (positive , Negative) .
  3. Purities and entropies How much
  4. Each test
  5. Optimized K-Means = Bisect K-Means