Alleviating Data Sparsity for Twitter Sentiment Analysis


Published on

Twitter has brought much attention recently as a hot research topic in the domain of sentiment analysis. Training sentiment classifiers from tweets data often faces the data sparsity problem partly due to the large variety of short and irregular forms introduced to tweets because of the 140-character limit. In this work we propose using two different sets of features to alleviate the data sparseness problem. One is the semantic feature set where we extract semantically hidden concepts from tweets and then incorporate them into classifier training through interpolation. Another is the sentiment-topic feature set where we extract latent topics and the associated topic sentiment from tweets, then augment the original feature space with these sentiment-topics. Experimental results on the Stanford Twitter Sentiment Dataset show that both feature sets outperform the baseline model using unigrams only. Moreover, using semantic features rivals the previously reported best result. Using sentiment-topic features achieves 86.3% sentiment classification accuracy, which outperforms existing approaches.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Can we extract sentiment from microblogs?Provide list of people who said users express more opinions on microblogscustomers are less controlled of their emotions when interacting in a social environment.So, yes we can do sentiment analysis over microblogs.
  • Now, Why do we need to do it?Millionsof status updates and tweets messages are shared everyday among usersFor public sectors: social data can be used to predict public political opinion which make easier to policy makers to put political strategies based on that.For private sector: Companies can track public opinion about their products and services
  • Two Main ApproachesLexical-based ApproachBuilding a better dictionaryFocuses on words and phrases as bearers of semantic orientationMachine Learning Approach Finding the Right FeaturesYet another form of text classification
  • To overcome the noisy label data
  • To overcome the noisy label data
  • To overcome the noisy label data
  • Previous work tried to overcome the sparsity problem partially and indirectly by either apply some data filtering or depending on different set of microblogs features. However,None of them studied the affect of data sparsity on the classification performance.
  • Why: many words and terms in the testing corpus doesn’t appear in the training corpus.
  • Compares the word frequency statistics of the tweets data we used in our experiments and the movie review dataX-axis shows the word frequency interval, e.g., words occur up to 10 times. (1-10), more than 10 times but up to 20 times (10-20), etc.Y-axis shows the percentage of words falls within certain word frequency interval.It can be observed that the tweets data are sparser than the movie review data since the former contain more infrequent words, with 93% of the words in the tweets data occurring less than 10 times (cf. 78% in the movie review data).
  • §
  • §
  • Alleviating Data Sparsity for Twitter Sentiment Analysis

    1. 1. Alleviating Data Sparsity ForTwitter Sentiment Analysis Hassan Saif, Yulan He & Harith AlaniKnowledge Media Institute, The Open University, Milton Keynes, United KingdomMaking Sense of Microblogs – WWW2012 Conference Lyon - France
    2. 2. Outline• Hello World• Motivation• Related Work• Semantic Features• Topic-Sentiment Features• Evaluation• Demos• The Future
    3. 3. Sentiment Analysis“Sentiment analysis is the task of identifyingpositive and negative opinions, emotions andevaluations in text” The main dish was The main dish was It is a Syrian dish delicious salty and horrible Opinion Fact Opinion 3
    4. 4. Microblogging• Service which allows subscribers to post short updates online and broadcast them• Answers the question: What are you doing now?• Twitter, Plurk, sfeed, Yammer, BlueTwi, etc. 4
    5. 5. Together!
    6. 6. Sense?
    7. 7. Sense? 1400000AGARWAL et al. 1200000 1143562BARBOSA, L., AND FENG, J. 1000000BIFET, A., AND FRANK, E. 800000DIAKOPOULOS, N., AND SHAMMA, D. 713178GO et al. 600000He & Saif 400000PAK & PAROUBEK 200000 And 0 1 Many Objective Tweets Subjective Tweets Others UK General Elections Corpus
    8. 8. Why Private SectorsBecause It is Critical df Keep In Touch Public Sectors
    9. 9. Related Work
    10. 10. Sentiment Analysis Lexical Based ApproachText Classification Problem Right Features Word Polarity Building Better Dictionary Machine Learning Approach
    11. 11. Twitter Sentiment AnalysisChallenges – The short length of status update – Language Variations – Open Social Environment
    12. 12. Twitter Sentiment AnalysisRelated Work • Distant Supervision – Supervised classifiers trained from noisy labels – Tweets messages are labeled using emoticons – Data filtering process Go et al., (2009) - Barbosa and Fengl. (2010) – Pak and Paroubek (2010)
    13. 13. Twitter Sentiment AnalysisRelated Work • Followers Graph & Label Propagation – Twitter follower graph (users, tweets, unigrams and hashtags) – Start with small number of labeled tweets – Applied label propagation method throughout the graph. Speriosu et al., (2009)
    14. 14. Twitter Sentiment AnalysisRelated Work • Feature Engineering – Unigrams, bigrams, POS – Microblogging features • Hashtags • Emoticons • Abbreviations & Intensifiers Agsrwal et al., (2011) – Kouloumpis et al (2011)
    15. 15. So?
    16. 16. What Does sparsity mean? Training data contains many infrequent terms
    17. 17. What Does sparsity mean? Word frequency statistics
    18. 18. How!Sentiment Topic Features Extracts semantically hidden concepts from tweets data and then incorporates them into supervisedExtract latent topics and classifier training by interpolationthe associated topicsentiment from thetweets data which aresubsequently added intothe original feature space Semantic Featuresfor supervised classifiertraining
    19. 19. Semantic FeaturesShallow Semantic Method Sport @Stace_meister Ya, I have Rugby in an hour Sushi time for fabulous Jesses last day on dragons den Person Dear eBay, if I win I owe you a total 580.63 bye paycheck Company
    20. 20. Sematic FeaturesInterpolation Method
    21. 21. Topic-Sentiment FeaturesJoint Sentiment Topic ModelJST1 is a four-layer generative modelwhich allows the detection of bothsentiment and topic simultaneouslyfrom text.The only supervision is word priorpolarity information which can beobtained from MPQA subjectivitylexicon.Lin & He. 2009
    22. 22. Twitter Sentiment Corpus Stanford UniversityCollected: the 6th of April & the 25th of June 2009Training Set: 1.6 million tweets (Balanced)Testing Set: 177 negative tweets & 182 positive tweets
    23. 23. Our Sentiment Corpus• Training Set: 60K tweets• Testing Set: 1000 tweets• Annotated additional 640 tweets using Tweenator
    24. 24. Evaluation Extended Test SetMethod AccuracyUnigrams 80.7%Semantic replacement 76.3%Semantic interpolation 84.0%Sentiment-topic features 82.3% Sentiment classification results on the 1000-tweets test set
    25. 25. Evaluation Stanford Dataset Method Accuracy Unigrams 81.0% Semantic replacement 77.3% Sematic augmentation 80.45% Semantic interpolation 84.1% Sentiment-topic features 86.3% (Go et al., 2009) 83% (Speriosu et al., 2011) 84.7%Sentiment classification results on the original Stanford Twitter Sentiment test set
    26. 26. EvaluationSemantic Features V.s Sentiment-Topic Features
    27. 27. Tweenator
    28. 28. Conclusion• Twitter sentiment analysis faces data sparsity problem due to some special characteristics of Twitter• Semantic & topic-sentiment features reduce the sparsity problem and increase the performance significantly• Sentiment-topic features should be preferred over semantic features for the sentiment classification task since it gives much better results with far less features. 30
    29. 29. The Future
    30. 30. Future Work Sentiment-Topic Model Enhance Entity Extraction Methods Attaching weight to extracted featuresSemantic Smoothing Model Statistical replacement
    31. 31. References[1] AGARWAL, A., XIE, B., VOVSHA, I., RAMBOW, O., AND PASSONNEAU, R. Sentiment analysis of twitter data. In Proceedings ofthe ACL 2011 Workshop on Languages in Social Media (2011), pp. 30–38.[2] BARBOSA, L., AND FENG, J. Robust sentiment detection on twitter from biased and noisy data. In Proceedings of COLING(2010), pp. 36–44.[3] BIFET, A., AND FRANK, E. Sentiment knowledge discovery in twitter streaming data. In Discovery Science (2010), Springer, pp.1–15.[4] GO, A., BHAYANI, R., AND HUANG, L. Twitter sentiment classification using distant supervision. CS224N ProjectReport, Stanford (2009).[5] KOULOUMPIS, E., WILSON, T., AND MOORE, J. Twitter sentiment analysis: The good the bad and the omg! In Proceedings ofthe ICWSM (2011).[5]LIN, C., AND HE, Y. Joint sentiment/topic model for sentiment analysis. In Proceeding of the 18th ACM conference onInformation and knowledge management (2009), ACM, pp. 375–384.[6] PAK, A., AND PAROUBEK, P. Twitter as a corpus for sentiment analysis and opinion mining. Proceedings of LREC 2010 (2010).[7]SAIF, H., HE, Y., AND ALANI, H. Semantic Smoothing for Twitter Sentiment Analysis. In Proceeding of the 10th InternationalSemantic Web Conference (ISWC) (2011).
    32. 32. Thank You