Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 1
Clustering Arabic Tweets
...
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 2
This talk is based on:
Di...
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 3
Characteristics and chall...
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 4
Importance of extracting ...
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 5
What preprocessing steps ...
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 6
Similarity Functions
 Mo...
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 7
The similarity functions
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 8
Our Datasets and Preproce...
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 9
Implementation
 Khoja’s ...
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 10
Testing the code
 The c...
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 11
Purities: K-means algori...
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 12
Purities for K-Means
 K...
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 13
Purities: optimized K-Me...
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 14
Future Research
 Lemmat...
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 15
Thank you
Questions?
Que...
14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 16
References
[1] S. Khoja,...
Upcoming SlideShare
Loading in …5
×

Clustering Arabic Tweets for Sentiment Analysis

343 views

Published on

Diab Abuaiadah, Dileep Rajendran, Mustafa Jarrar: Clustering Arabic Tweets for Sentiment Analysis. Proceedings of the 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications. IEEE Computer Society. DOI 10.1109/AICCSA.2017.162

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Clustering Arabic Tweets for Sentiment Analysis

  1. 1. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 1 Clustering Arabic Tweets for Sentiment Analysis Diab Abuaiadah Diab.Abuaiadah@Wintec.ac.nz Waikato Institute of Technology New Zealand Dileep Rajendran Dileep.Rajendran@Wintec.ac.nz Waikato Institute of Technology New Zealand Mustafa Jarrar mjarrar@birzeit.edu Birzeit University Palestine
  2. 2. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 2 This talk is based on: Diab Abuaiadah, Dileep Rajendran, Mustafa Jarrar: Clustering Arabic Tweets for Sentiment Analysis. Proceedings of the 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications. IEEE Computer Society. DOI 10.1109/AICCSA.2017.162 http://www.jarrar.info/publications/ARJ17.pdf
  3. 3. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 3 Characteristics and challenges of Arabic Tweets  Short text: “Less statistical data” implies poor result, thus less effective information retrieval algorithms.  Dialectal language: vast geographical area which includes the Middle east and north Africa.  Informal language: does not follow language grammar and rules.  Different regions have different dialects.
  4. 4. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 4 Importance of extracting knowledge from Arabic tweets  Social and political sentiments: measures the popularity of political and/or social policies.  Marketing products : extracting positive and negative opinion, discovering recent trends and detecting product popularity.  Sarcastic and non-Sarcastic tweets: also important for detecting sentiments.
  5. 5. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 5 What preprocessing steps are needed?  Removing stop words!  In many applications removing stop words may improve the performance  In sentiment analysis removing stop might result in loosing valuable information. For example: keeping negative terms such as “no” can be useful to indicate negative sentiments  Lemmatization!  More challenging with informal and short text  May produce high errors ratio  Stemming!  Also has high error ratio for informal short text
  6. 6. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 6 Similarity Functions  Most information retrieval algorithms employ a similarity (or distance) function to measure the similarity between :  two documents (Tweets)  or a document (Tweet) and a group of documents (Tweets)  There are five commonly used similarity functions in natural language processing applications: Euclidian, Cosine, Pearson, Jaccard and averaged Kullback-Leibler divergence (KLD) The goal of this paper is to evaluate and compare these functions in clustering Arabic tweets for sentiment analysis
  7. 7. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 7 The similarity functions
  8. 8. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 8 Our Datasets and Preprocessing  An open dataset with 2000 tweets (Jordan dialect) N. Abdulla, N. Mahyoub, M. Shehab and N. Al-Ayyoub, "Arabic sentiment analysis: Corpus-based and lexicon- based," in In Proceedings of The IEEE conference on Applied Electrical Engineering and Computing Technologies (AEECT)., 2013.  Other datasets were available but the labeling was immature at the time of writing the paper.  Preprocessing includes removing stop words, Light10 and Khoja stemmers.  Clustering using the K-Means (and Bisect K-Means) algorithms.
  9. 9. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 9 Implementation  Khoja’s stemmer: the Java code was taken from the author home page. http://zeus.cs.pacificu.edu/shereen/research.htm  Light10 and removing stop words: in-house Java code. The implementation is straightforward as it simply strip off a predefined fixed set of prefixes and suffixes.  Clustering and Bisect clustering: in-house Java code using Hashtable to represent TF-IDF of each document. (TF: Term Frequency, IDF: Inverse Document Frequency)  Purities and entropies: in-house Java code. To measure the quality of the algorithm’s results
  10. 10. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 10 Testing the code  The code for the similarity functions, K-Means clustering, purity and entropy measures were tested by applying it to a well-known dataset 20 Newsgroup dataset: http://csmining.org/index.php/id-20-newsgroups.html  The purities and entropies results (when applying the K- Means) is comparable to that of Huang (2008): http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.4480&rep=rep1&type=pdf
  11. 11. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 11 Purities: K-means algorithm Raw: no pre-processing NoSW: Only removed stop words (the known list) Light10: Removed stop words + Light10 Root: Removed stop words + Khoja’s - Each test was repeated 50 times, and the average is used. - The Euclidian was bad
  12. 12. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 12 Purities for K-Means  KLD clearly outperformed the Cosine similarity function  KLD outperformed the Cosine similarity function  This is of interest as the Cosine similarity function is commonly used in many products and published papers.  The Cosine function outperforms KLD for normal-sized texts, but for tweets, as our experiment is showing.  Kohja's stemmer (Root) clearly outperforms the Light10 stemmer  However, Light10 stemmer outperforms Kohja’s stemmer for normal- sized texts
  13. 13. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 13 Purities: optimized K-Means  The optimized K-Means (Bisect K-Means) improves the results compared to that of K-Means (68% to 76%).  KLD and Kohja’s stemmer retained their superiority compared to other similarity functions and other processing.
  14. 14. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 14 Future Research  Lemmatization: could be more effective than stemmers  Are other clustering algorithms more effective than K- Means? Hierarchical algorithms produce better clustering results but have higher running time  N-gram preprocessing is an interesting approach that could lead to higher accuracies
  15. 15. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 15 Thank you Questions? Questions?
  16. 16. 14th International Conference on Computer Systems and Application (AICCSA’17), 30 October 2017 16 References [1] S. Khoja, 2016. [Online]. Available: http://zeus.cs.pacificu.edu/shereen/research.htm. [Accessed 01 August 2016]. [2] R. M. Duwairi, R. Marji, N. Sha'ban and S. Rushaidat, "Sentiment Analysis in Arabic Tweets,”. Proceedings of ICICS 2014. [3] E. Zangerle, G. Wolfgang and G. Specht, "On the impact of text similarity functions on hashtag recommendations in microblogging environments," Social Network Analysis and Mining, vol. 3, no. 4, pp. 889-898, 2013. [4] S. R. El-Beltagy and A. Ali, "Open Issues in the Sentiment Analysis of Arabic," in IIT, 2013. [5] H. Froud, R. Benslimane, A. Lachkar and S. A. & Ouatik, "Stemming and similarity measures for Arabic Documents Clustering," in I/V IEE ISVC, 2010. [6] H. Froud, A. Lachkar and S. Ouatik, "Arabic text summarization based on latent semantic analysis to enhance arabic documents clustering," International Journal of Data Mining & Knowledge Management Process (IJDKP), 2013. [7] D. Abuaidah, "Using Bisect K-Means Clustering Technique in the Analysis of Arabic Documents," ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 15, no. 3, p. 17, 2016. [8] Q. Bsoul and M. Mohd, "Effect of ISRI stemming on similarity measure for Arabic document clustering," in Asia Information Retrieval Symposium, 2011. [9] O. Ghanem and W. M. Ashour, "Stemming Effectiveness in Clustering of Arabic Documents," Journal of Computer Applications, vol. 5, no. 49, 2012. [10] L. Larkey, L. Ballesteros and M. Connell, "Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurrence Analysis," ACM, 2002. [11] N. Sandhya, Y. S. Lalitha, V. Sowmya, A. D. K.. and D. A. Govardhan, "Analysis of Stemming Algorithm for Text Clustering," vol. 8, no. 5, 2011. [12] A. Shoukry and A. Rafea, "Sentence-Level Arabic Sentiment Analysis," in International Conference on Collaboration Technologies and Systems, 2012. [13] B. Pang and L. Lillian, "Opinion mining and sentiment analysis," Foundations and trends in information retrieval, vol. 2, no. 1-2, pp. 1-135, 2008. [14] M. Althobaiti, U. Kruschwitz and M. Poesio, "Combining Minimally-supervised Methods for Arabic Named Entity Recognition," Transactions of the Association for Computational Linguistics, vol. 3, pp. 243-255, 2015. [15] M. Efron, "Information search and retrieval in microblogs," Information search and retrieval in microblogs." Journal of the American Society for Information Science and Technology, vol. 62, no. 6, pp. 996-1008, 2011. [16] J. Akshay, X. Song, T. Finin and B. Tseng, "Why we twitter: understanding microblogging usage and communities.," in In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, 2007. [17] G. Salton and C. Buckley, "Term weighting approaches in automatic text retrieval," Information Processing and Management, vol. 24, no. 5. 1988. [18] G. Salton, Automatic text processing: the transformation, analysis, and retrieval of information by computer, Boston: Addison-Wesley Longman Co., 1989. [19] A. Go, R. Bhayani and L. Huang, "Twitter Sentiment Classification using Distant Supervision," CS224N Project Report, Stanford, 2009. [20] A. K. Jose, N. Bhatia and S. Krishna, "TWITTER SENTIMENT ANALYSIS," National Institute of Technology, Calicut, Monsoon, 2010. [21] A. Huang, "Similarity measures for text document clustering," in Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, 2008. [22] M. Steinbach, G. Karypis and V. Kumar, "A Comparison of Document Clustering Techniques," in KDD Workshop on Text Mining, 2000. [23] S. Khoja and R. Garside, "Stemming Arabic text," Lancaster University, Lancaster, 1999. [24] N. Abdulla, N. Mahyoub, M. Shehab and N. Al-Ayyoub, "Arabic sentiment analysis: Corpus-based and lexicon-based," in In Proceedings of The IEEE conference on Applied Electrical Engineering and Computing Technologies (AEECT)., 2013. [25] K. Darwish, W. Magdy and A. Mourad, "Language processing for arabic microblog retrieval," in Proceedings of the 21st ACM international conference on Information and knowledge management, 2012. [26] A. Newsri, "Effective Retrieval Techniques for Arabic Text," Melbourne, Australia, 2008. [27] E. Al-Shammari and J. Lin, "Towards an Error-Free Arabic Stemming," in Proceedings of the 2nd ACM workshop on Improving non english web searching (iNEWS '08), New York, NY, USA, 2008A. [28] E. Al-Shammari and J. Lin, "A novel Arabic lemmatization algorithm," in Proceedings of the second workshop on Analytics for noisy unstructured text data. [29] A. Jain, "Data clustering: 50 years beyond K-means," Pattern Recognition Letters, vol. 31, p. 651–666, 2010. [30] R. Kashef and M. S. Kamel, "Enhanced bisecting k-means clustering using intermediate cooperation," Pattern Recognition, vol. 42, no. 11, 2009. [31] C. Manning, P. Raghavan and H. Schütze, Introduction to information retrieval, vol. 1, Cambridge: Cambridge university press, 2008.

×