Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Alleviating Data Sparsity for Twitter Sentiment Analysis


Published on

Twitter has brought much attention recently as a hot research topic in the domain of sentiment analysis. Training sentiment classifiers from tweets data often faces the data sparsity problem partly due to the large variety of short and irregular forms introduced to tweets because of the 140-character limit. In this work we propose using two different sets of features to alleviate the data sparseness problem. One is the semantic feature set where we extract semantically hidden concepts from tweets and then incorporate them into classifier training through interpolation. Another is the sentiment-topic feature set where we extract latent topics and the associated topic sentiment from tweets, then augment the original feature space with these sentiment-topics. Experimental results on the Stanford Twitter Sentiment Dataset show that both feature sets outperform the baseline model using unigrams only. Moreover, using semantic features rivals the previously reported best result. Using sentiment-topic features achieves 86.3% sentiment classification accuracy, which outperforms existing approaches.

Published in: Technology, Education
  • Be the first to comment

Alleviating Data Sparsity for Twitter Sentiment Analysis

  1. 1. Alleviating Data Sparsity ForTwitter Sentiment Analysis Hassan Saif, Yulan He & Harith AlaniKnowledge Media Institute, The Open University, Milton Keynes, United KingdomMaking Sense of Microblogs – WWW2012 Conference Lyon - France
  2. 2. Outline• Hello World• Motivation• Related Work• Semantic Features• Topic-Sentiment Features• Evaluation• Demos• The Future
  3. 3. Sentiment Analysis“Sentiment analysis is the task of identifyingpositive and negative opinions, emotions andevaluations in text” The main dish was The main dish was It is a Syrian dish delicious salty and horrible Opinion Fact Opinion 3
  4. 4. Microblogging• Service which allows subscribers to post short updates online and broadcast them• Answers the question: What are you doing now?• Twitter, Plurk, sfeed, Yammer, BlueTwi, etc. 4
  5. 5. Together!
  6. 6. Sense?
  7. 7. Sense? 1400000AGARWAL et al. 1200000 1143562BARBOSA, L., AND FENG, J. 1000000BIFET, A., AND FRANK, E. 800000DIAKOPOULOS, N., AND SHAMMA, D. 713178GO et al. 600000He & Saif 400000PAK & PAROUBEK 200000 And 0 1 Many Objective Tweets Subjective Tweets Others UK General Elections Corpus
  8. 8. Why Private SectorsBecause It is Critical df Keep In Touch Public Sectors
  9. 9. Related Work
  10. 10. Sentiment Analysis Lexical Based ApproachText Classification Problem Right Features Word Polarity Building Better Dictionary Machine Learning Approach
  11. 11. Twitter Sentiment AnalysisChallenges – The short length of status update – Language Variations – Open Social Environment
  12. 12. Twitter Sentiment AnalysisRelated Work • Distant Supervision – Supervised classifiers trained from noisy labels – Tweets messages are labeled using emoticons – Data filtering process Go et al., (2009) - Barbosa and Fengl. (2010) – Pak and Paroubek (2010)
  13. 13. Twitter Sentiment AnalysisRelated Work • Followers Graph & Label Propagation – Twitter follower graph (users, tweets, unigrams and hashtags) – Start with small number of labeled tweets – Applied label propagation method throughout the graph. Speriosu et al., (2009)
  14. 14. Twitter Sentiment AnalysisRelated Work • Feature Engineering – Unigrams, bigrams, POS – Microblogging features • Hashtags • Emoticons • Abbreviations & Intensifiers Agsrwal et al., (2011) – Kouloumpis et al (2011)
  15. 15. So?
  16. 16. What Does sparsity mean? Training data contains many infrequent terms
  17. 17. What Does sparsity mean? Word frequency statistics
  18. 18. How!Sentiment Topic Features Extracts semantically hidden concepts from tweets data and then incorporates them into supervisedExtract latent topics and classifier training by interpolationthe associated topicsentiment from thetweets data which aresubsequently added intothe original feature space Semantic Featuresfor supervised classifiertraining
  19. 19. Semantic FeaturesShallow Semantic Method Sport @Stace_meister Ya, I have Rugby in an hour Sushi time for fabulous Jesses last day on dragons den Person Dear eBay, if I win I owe you a total 580.63 bye paycheck Company
  20. 20. Sematic FeaturesInterpolation Method
  21. 21. Topic-Sentiment FeaturesJoint Sentiment Topic ModelJST1 is a four-layer generative modelwhich allows the detection of bothsentiment and topic simultaneouslyfrom text.The only supervision is word priorpolarity information which can beobtained from MPQA subjectivitylexicon.Lin & He. 2009
  22. 22. Twitter Sentiment Corpus Stanford UniversityCollected: the 6th of April & the 25th of June 2009Training Set: 1.6 million tweets (Balanced)Testing Set: 177 negative tweets & 182 positive tweets
  23. 23. Our Sentiment Corpus• Training Set: 60K tweets• Testing Set: 1000 tweets• Annotated additional 640 tweets using Tweenator
  24. 24. Evaluation Extended Test SetMethod AccuracyUnigrams 80.7%Semantic replacement 76.3%Semantic interpolation 84.0%Sentiment-topic features 82.3% Sentiment classification results on the 1000-tweets test set
  25. 25. Evaluation Stanford Dataset Method Accuracy Unigrams 81.0% Semantic replacement 77.3% Sematic augmentation 80.45% Semantic interpolation 84.1% Sentiment-topic features 86.3% (Go et al., 2009) 83% (Speriosu et al., 2011) 84.7%Sentiment classification results on the original Stanford Twitter Sentiment test set
  26. 26. EvaluationSemantic Features V.s Sentiment-Topic Features
  27. 27. Tweenator
  28. 28. Conclusion• Twitter sentiment analysis faces data sparsity problem due to some special characteristics of Twitter• Semantic & topic-sentiment features reduce the sparsity problem and increase the performance significantly• Sentiment-topic features should be preferred over semantic features for the sentiment classification task since it gives much better results with far less features. 30
  29. 29. The Future
  30. 30. Future Work Sentiment-Topic Model Enhance Entity Extraction Methods Attaching weight to extracted featuresSemantic Smoothing Model Statistical replacement
  31. 31. References[1] AGARWAL, A., XIE, B., VOVSHA, I., RAMBOW, O., AND PASSONNEAU, R. Sentiment analysis of twitter data. In Proceedings ofthe ACL 2011 Workshop on Languages in Social Media (2011), pp. 30–38.[2] BARBOSA, L., AND FENG, J. Robust sentiment detection on twitter from biased and noisy data. In Proceedings of COLING(2010), pp. 36–44.[3] BIFET, A., AND FRANK, E. Sentiment knowledge discovery in twitter streaming data. In Discovery Science (2010), Springer, pp.1–15.[4] GO, A., BHAYANI, R., AND HUANG, L. Twitter sentiment classification using distant supervision. CS224N ProjectReport, Stanford (2009).[5] KOULOUMPIS, E., WILSON, T., AND MOORE, J. Twitter sentiment analysis: The good the bad and the omg! In Proceedings ofthe ICWSM (2011).[5]LIN, C., AND HE, Y. Joint sentiment/topic model for sentiment analysis. In Proceeding of the 18th ACM conference onInformation and knowledge management (2009), ACM, pp. 375–384.[6] PAK, A., AND PAROUBEK, P. Twitter as a corpus for sentiment analysis and opinion mining. Proceedings of LREC 2010 (2010).[7]SAIF, H., HE, Y., AND ALANI, H. Semantic Smoothing for Twitter Sentiment Analysis. In Proceeding of the 10th InternationalSemantic Web Conference (ISWC) (2011).
  32. 32. Thank You