A Survey of Sentiment Analysis


Published on

Sentiment Analysis refers to a set of natural language processing technologies used to extract subjective information from a body of text. While sentiment analysis offers significant insight into the public opinion, implementations still exhibit great potential for development, thus making it a
nascent field of research.

This survey provides a brief overview of the technologies commonly used to approach problems in sentiment analysis, taking particular challenges imposed by user-generated content in “social-media” into account. This survey will seek to demonstrate which technologies are promising in the field in general and in the realm of user generated content in particular.

Published in: Technology, Education

A Survey of Sentiment Analysis

  1. 1. A Survey of Sentiment Analysis Blockseminar “Intelligente Softwaresysteme” 2013/14 7 Feb 2014 TU Berlin Moritz Platt
  2. 2. Agenda Introduction ▼ Algorithms ▼ Benchmarks ▼ Outlook Intelligente Softwaresysteme 2013/14 2
  3. 3. Introduction > Algorithms > Benchmarks > Outlook Sentiment Analysis is an NLP Task • Sentiment Analysis = Opinion Mining = Subjectivity Analysis • Extract opinions on objects from text • Working on natural language corpora • Research problem with a lot of applications • Relatively new research area, rapidly developing field • Related fields: • Natural Language Processing • Social Media Analysis • Text Mining • Data Mining Intelligente Softwaresysteme 2013/14 3
  4. 4. Introduction > Algorithms > Benchmarks > Outlook Accessing Opinions–Now and Then Pre Dot-Com Era • Extensive measures • Surveys • Opinion polls • Focus groups Intelligente Softwaresysteme 2013/14 Dot-Com Era and Beyond • Huge stream of opinionated text • 1.2 million daily blog posts [Zabin2008] • 45 million daily “status updates” on Facebook [Thomas2010] • Often featuring opinions towards products or persons 4
  5. 5. Introduction > Algorithms > Benchmarks > Outlook Where are today’s opionated texts coming from? Social Networks Intelligente Softwaresysteme 2013/14 Reviews Blogs 5
  6. 6. Introduction > Algorithms > Benchmarks > Outlook The Relationship Between Opinion Holders and Objects • Consider a set of product reviews for a particular model of a cellular phone Opinion Holders John Jack Opinionated Text “Voice q uality is wo Sentiment Value f nderful. ” “Voice sounds terrible.” ality is av peech qu “S erage.” James Positive Negative Neutral Objects Features o A particular model of a cellular phone f f f The voice quality of a particular model of a cellular phone f • Edges between opinion holders and features represent opinions • The time aspect is usually ommited Intelligente Softwaresysteme 2013/14 6
  7. 7. Introduction > Algorithms > Benchmarks > Outlook The Aspects of Opinions Structure of an opinion as defined by Liu : [Liu2010] (oj, fjk, soijkl, hi, tl) • Object oj The target of an opinion (e.g. product, person, event, organisation, topic) • Feature fjk Components/Attributes of an object (e.g. battery life, camera resolution) • Sentiment Value soijkl The orientantion of an opinion from a set of possible choices (e.g. positive, negative, neutral) • Opinion Holder hi The person expressing the opinion • Time tl The time at which the opinion is expressed Intelligente Softwaresysteme 2013/14 7
  8. 8. Algorithms Intelligente Softwaresysteme 2013/14 8
  9. 9. Introduction > Algorithms > Benchmarks > Outlook Approaching Sentiments Algorithmically Unsupervised Methods • Point-Wise Mutual Information • No training data • Cross-domain applications Supervised Methods • Manually labelled training data • Usually superior to unsupervised approaches Intelligente Softwaresysteme 2013/14 • Naïve Bayes Classification • Maximum Entropy Classification • Suppor Vector Machines 9
  10. 10. Introduction > Algorithms > Benchmarks > Outlook PMI-IR • PMI: Point-wise mutual information • IR: Information retrieval • Introduced 2002 as an unsupervised learning algorithm for classifying reviews [Turney2002] • Based on the concept of PMI [Church1990] PMI (word 1 ,w ord 2 )= log 2 p(word 1 &word 2 ) p(word 1 )p(word 2 ) • Measures the probability of the co-occurrence of words Intelligente Softwaresysteme 2013/14 10
  11. 11. Introduction > Algorithms > Benchmarks > Outlook PMI-IR • Turney used the words poor and excellent as seeds for the algorithm SO(phrase )= PMI( phrase, “ excellent ”) PMI(phrase, “ poor”) • SO is the sentiment orientation value • Positive SO-value for phrases more associated with excellent • Negative SO-value for phrases more associated with poor SO(phrase )= log 2 h(phrase NEAR“ excellent ”)h(“ poor”) h(phrase NEAR“ poor”)h(“ excellent ”) • Improvement of results through IR component • Turney used AltaVista • uses the NEAR operator • h(query) is the number of hits returned given the query Intelligente Softwaresysteme 2013/14 11
  12. 12. Introduction > Algorithms > Benchmarks > Outlook Naïve Bayes Classification • Based on Bayes rule [Bayes1763] • Simply trained, probalistic, effective • “Bag of words” of an input document d • Fixed set of classes C, e.g. C = {positive, negative} • d can be reduced by omitting irrelevant words All Words [Jurafsky2013] I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.! Intelligente Softwaresysteme 2013/14 Opinionated Words [Jurafsky2013] x love xxxxxxxxxxxxxxxx sweet xxxxxxx satirical xxxxxxxxxx xxxxxxxxxxx great xxxxxxx xxxxxxxxxxxxxxxxxxx fun xxxx xxxxxxxxxxxxx whimsical xxxx romantic xxxx laughing xxxxxxxxx xxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxx recommend xxxxx xxxx xxxxxxxxxxxxxxxxxxxxxxxx several xxxxxxxxxxxxxxxxx xxxxx happy xxxxxxxxx again xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx 12
  13. 13. Introduction > Algorithms > Benchmarks > Outlook Naïve Bayes at work 1. Estimate P(c) of each class c by dividing the number of words in documents in c by the total number of words in the corpus 2. Estimate the P(w|c) for all words w and classes c 3. The score for a document d to be in class c is 4. The most likely class for a document is the one with the highest score [Potts2011] Intelligente Softwaresysteme 2013/14 13
  14. 14. Introduction > Algorithms > Benchmarks > Outlook Maximum Entropy Classification Ignorance is preferable to error, and he is less remote from the truth who believes nothing than he who believes what is wrong. — Thomas Jefferson • Find weights for the features that maximize the likelihood of the training data • Add constraints based on training data • More constraints = less entropy = distribution is closer to data • More difficult to implement than Naïve Bayes [Potts2011] Intelligente Softwaresysteme 2013/14 14
  15. 15. Introduction > Algorithms > Benchmarks > Outlook Support Vector Machines • Most intuitive for two-class, y separable training data sets • Find a vector to seperate data sets maximizing the margin (A vs B) • The margin is limited by support vectors • Applicable to more complicated problems too • n-class space • inseperable training data through transformation in higher dimensions Intelligente Softwaresysteme 2013/14 B A x 15
  16. 16. Benchmarks Intelligente Softwaresysteme 2013/14 16
  17. 17. Introduction > Algorithms > Benchmarks > Outlook Benchmarking Sentiment Analysis • Benchmarking NB and ME with in-domain testing • Binary classification • 6.000 restaurant reviews [Potts2011] Intelligente Softwaresysteme 2013/14 17
  18. 18. Introduction > Algorithms > Benchmarks > Outlook Benchmarking Sentiment Analysis • Benchmarking NB and ME with testing on a different domain • Binary classification • Trained on 6.000 restaurant reviews • Tested on 6.000 product reviews [Potts2011] Intelligente Softwaresysteme 2013/14 18
  19. 19. Outlook Intelligente Softwaresysteme 2013/14 19
  20. 20. Introduction > Algorithms > Benchmarks > Outlook Opinionated Data in the Wild • Works well under laboratory conditions • Proper spelling • Highly opinionated • Pre-defined object • Still common NLP problems remain • Named entity recognition • Context specific meaning • Language Ambiguity • Benchmarking corpora do not reflect real-world data quality Intelligente Softwaresysteme 2013/14 20
  21. 21. Introduction > Algorithms > Benchmarks > Outlook Opinionated Data in the Wild • Social media data • Highly relevant • Huge corpus • Constantly growing • Very noisy • Questionable text quality • Spelling • Grammar • Spam • Unclear context • Figurative speech • Slang • Irony Cole J. Got a canon gl1 I love it, but a little fuzzy Like • Comment 28 January 2010 Warren Scott M. your mxf format is a joke. DO NOT BUY CANON Like • Comment 21 January at 18:31 Leon H. Why battery 6L in my Canon sx280 have pretty low life Like • Comment 11 January at 10:58 Phil D. Youse guys did a solid on my wife's TI3- warranty expired last month, but did the job good! Thanks Canon Like • Comment 11 January at 04:39 Authentic status updates from https://www.facebook.com/pages/Canon-Cameras Intelligente Softwaresysteme 2013/14 21
  22. 22. Introduction > Algorithms > Benchmarks > Outlook Conclusions / Future Work • Development of algorithms is on the right track • Evolvement beyond binary classification • Algorithms will become more robust on less homogenous sources • Industry aims to apply algorithms to noisy data Intelligente Softwaresysteme 2013/14 22
  23. 23. Appendix Intelligente Softwaresysteme 2013/14 23
  24. 24. References article(Bayes1763) Bayes, T. An essay towards solving a problem in the doctrine of chances Phil. Trans. of the Royal Soc. of London, 1763, Vol. 53, pp. 370-418 article(Church1990) Church, K.W. & Hanks, P . Word Association Norms, Mutual Information, and Lexicography Comput. Linguist., MIT Press, 1990, Vol. 16(1), pp. 22-29 misc(Jurafsky2013) Dan Jurafsky, E. Naïve Bayes and Text Classification 2013 inproceedings(Liu2010) Liu, B. Sentiment analysis and subjectivity Handbook of Natural Language Processing, Second Edition. Taylor and Francis Group, Boca 2010 misc(Potts2011) Potts, C. Sentiment Symposium Tutorial: Classifiers http://sentiment.christopherpotts.net/classifiers.html 2011 Intelligente Softwaresysteme 2013/14 24
  25. 25. book(Thomas2010) Thomas, A. & Applegate, J. Pay Attention!: How to Listen, Respond, and Profit from Customer Feedback Wiley, 2010 inproceedings(Turney2002) Turney, P .D. Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews Proceedings 40th Annual Meeting of the ACL (2002) 2002, pp. 417-424 misc(Zabin2008) Zabin, J. & Jefferies, A. Social Media Monitoring and Analysis: Generating Consumer Insights from Online Conversation Aberdeen Group Benchmark Report, 2008 Intelligente Softwaresysteme 2013/14 25
  26. 26. Picture Credit Icons Page 8: Arrow by Jamison Wieser from The Noun Project Photography Page 1: “Thumbs up on diving down” by James Huckaby is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic License. Based on a work at http://www.flickr.com/photos/raveller/1117899371/. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/2.0/legalcode. Page 3: “Coventry Solihull Warwickshire Sub-Regional Planning Study Questionnaire” by The JR James Archive is licensed under a Creative Commons Attribution-NonCommercial 2.0 Generic License. Based on a work at http://www.flickr.com/photos/ jrjamesarchive/9371523446/. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/2.0/legalcode. Page 14: “Svm intro.svg” by Fabian Bürger is licensed under a Creative Commons Attribution 3.0 License. Based on a work at http://commons.wikimedia.org/wiki/File:Svm_intro.svg. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/legalcode. Intelligente Softwaresysteme 2013/14 26