Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sentiment analysis


Published on

Published in: Technology

Sentiment analysis

  1. 1. S.V.Giri - ( -
  2. 2. Generally speaking, sentiment analysis aims to determine the attitudeof a speaker or a writer with respect to some topic or the overallcontextual polarity of a document ~ Wikipedia[1]Levels[2] at which sentiments can be expressed: Phrase Sentence Paragraph Document About a Subject
  3. 3. User’s OpinionsBob: Its a great movie (Positive sentiment)Alice: Nah!! I didnt like it at all (Negative sentiment)Bob: I am not so sure about the movie. You may like it,or may be not ! (Neutral!! Confused!!)
  4. 4. Understanding public opinion on products, movies etc.Ex: There is 67% negative opinion on the color of Amazon’s new version of Kindle. Using this knowledge to Make predictions in market trends, results of election polls etc. Make decisions ! Ex: Changing the color in subsequent versions Personalization! Ex: Recommeding products depending on what yourfriends feel.
  5. 5. Binary Positive NegativeOrdinal valuesEx: rating from 1 to 5Complex polarityDetect the source, target and attitudeEx: Obama offers comfort after colorado shooting.Subject : Obama, Target: People , Attitude: comfort
  6. 6. NLP Use of semantics to understand the language Uses lexicons, dictionaries, ontologies Ex: I feel great today. (Understands that user’s feeling is great)Machine Learning Don’t have to understand the meaning. Uses classifiers such as Naïve Bayes, SVM, Max Ent etc.Ex: I feel great today (Doesn’t have to understand what user is feeling. It’s just that word great appears in positive or negative set, is good enough to classify the sentence as positive or negative)
  7. 7. Apple Ipod ReviewAlice : Apple ipod is a great music player. It’s better than any other product I have boughtGreat – PositiveBetter – PositiveTotal Positives = 2Total Negatives =0Net Score = 2-0=2Hence the review is Positive
  8. 8. Apple Ipod ReviewAlice : Apple ipod is not bad at all. You can buy it.Not – NegativeBad – NegativeTotal Positives = 0Total Negatives =2Net Score = 0-2=-2Hence the review is NegativeNote: This can be solved by a preprocessing stage such as converting “Not bad ” to “Good”. But preprocessing for NLP is complex.
  9. 9. Requires a good classifierRequires a training set for each class.In our case:2 classes, Positive and NegativeRequire pre-classified training set for both these classes.
  10. 10. Training data for Movie DomainPositive class Sleepy Hollow is an awesome movie. Every one should watch it. Christopher Nolan is such a great director that he can convert any script into a block buster. Great actors, great direction and a great movie.Negative class Nothing can make this movie better. It can win the stupidest movie of the year award, if there is such a thing.
  11. 11. Advantages Don’t have to create a sentiment lexicon (great is 80% positive, bad is 75% negative etc…) Categorization of proper nouns as well (Ex: Cameron Diaz) Generic and can be applied for various domains Language independent models (Ex: Jaime le film "Amélie") Disadvantage: Should have large sets of training data
  12. 12. Preparing Train Training Set ClassifierYelp Data Pre-processing Collection Test classifier PreparingCity Test SetGrid
  13. 13. City Grid MediaCityGrid Media is an online media company thatconnects web and mobile publishers with localbusinesses by linking them through CityGrid Provides Restful API Ratings (0-10) Reviews Domain Restaurant
  14. 14.  Tokenization Case Conversion Word conversion to full forms (“Don’t” to “do not”, “I’ll” to “I will”) Removal of punctuations Stop word filter using Lucene Length filter – to remove words with less than 3 characters
  15. 15. Reviews with ratings > 8 - Positive ClassReviews with ratings < 3 - Negative ClassTrainingPositive reviews – 20,000Negative reviews – 20,000Considering the same scale with out biasTest SetPositive reviews – 1,000Negative reviews – 1,000
  16. 16. Tokenization Splitting the sentences into words.Vectorization A vector for each review in the vector space modelTraining and Test Sets Store the files corresponding to Training and Test sets on HDFSTrain the classifier./bin/mahout trainclassifier -i /restaurants/bayes- train-input -o /restaurants/bayes-model -type bayes -ng 1 -source hdfs
  17. 17. Unigram Considers only one token Ex: It is a good movie. {It, is, a, good, movie}BigramConsiders two consecutive tokensEx: It is not bad movie{It is, is not, not bad, bad movie}
  18. 18. Reviews for sea food restaurants This restaurant makes good crab dishes. Crab is a kind of sea food isnt it? The is a good sea food restaurant. Nay!! dont go there if you want sea food. Try going to Marina or some other restaurant.Reviews for breakfast The English breakfast is very good in this restaurant. Crepes are yummy. Eww! I hate sea food. I can survive the entire day on my breakfast
  19. 19. Considering the case of UnigramWord frequency in each class Sea food BreakfastSeafood - 3 1crabs 1 0breakfast 0 1crepes 0 1Compute prior probabilities according to this table
  20. 20. Which place should I go to order crepes? Seafood or breakfast place?Naïve Bayes Formula p(c/w)= [p(w/c)p(c)]/p(w)SolutionCrepes (Important extracted word from query- all other words being unimportant) – classifyProbablityFor sea food = [0* (4/7)/ (1/7)] = 0For BreakFast = (1/3 * (3/7)/(1/7))=1
  21. 21. N-gram 1Confusion Matrix-------------------------------------------------------a b <--Classified as964 36 | 1000 a = (Positive)82 918 | 1000 b = (Negative)================================================
  22. 22. N-gram 2Confusion Matrix-------------------------------------------------------a b c <--Classified as969 31 0 | 1000 a = (Positive)62 938 0 | 1000 b = (Negative)================================================
  23. 23. Precision= True positives / (True Positives + FalsePositives)Recall = True Positives / (True Positives + FalseNegatives)F - score= 2*P*R/(P+R)The results show that Bi-gram model does betterthan unigram model
  24. 24.  Dark Knight rises is a good movie Dark knight rises is an awesome movie Both are positive But, second expresses more positive ness NLP is better than Machine Learning Machine learning cannot understand the semantics Need of a lexicon Also to differentiate between I like the food The food is awesome and it’s worth every penny of your money. The staff is very friendly and we received a very warm welcome. (Twitter is restricted to 150 word tweets while many review sites allow users toenter as many words as possible. This Intensity calculation is useful in such cases)
  25. 25. Intensity Models Review Level Intensity The Intensity calculated according to the number/type of senti-words in the review Corpus Level Intensity for the review. The Intensity of the review with respect to the entire corpus of reviews. This depends on the corpus distribution
  26. 26. Uniform weightage ModelPositive emotion word is given a positive score of 1 andnegative emotion word is given a negative score of 1Net Score = ∑Positive Score – ∑Negative Score.Using LexiconWeighted Net Score =∑ Weighted Positive Score – ∑Weighted Negative Score.The intensity values are obtained from Sentiwordnet [5].
  27. 27. Applying Gaussian Distribution over entire corpusof reviews. Note: It doesn’t fall under Gaussian Distribution, but the logfrequencies does.
  28. 28. Positive Reviews Average Positive Words/Review: 4.1 Average Negative Words/Review: 1.1 Negative Reviews Average Positive Words/Review: 1.7 Average Negative Words/Review: 4.2Note: We use the property of Gaussian Distribution that 1-sigmadeviation from Mean corresponds to 68% of the density, and 2-sigmadeviation corresponds to 95% density.
  29. 29. Corpus Level intensitiesThe more the number of positive senti-words in a review, themore is its positive intensity. Similarly, the more the number ofnegative senti-words in a review, the more is its negativeintensity
  30. 30. Total Intensity = [(Review Level Intensity + Corpus LevelIntensity)]/2I Like the foodSentiments : (food)Score = (100 + 1)/2 = 50.5The food is awesome and it’s worth every penny of yourmoney. The staff is very friendly and we received a verywarm welcome.Sentiments : (Awesome, worth, friendly, warm)Score = (100 + 80)/2 = 90
  31. 31. Aspects [6] are the features which define a product/Item etc.Samsung Galaxy Prevail Android Smartphone (Boost Mobile) --AmazonFeatures of Smart Phone: Design Size Speed Sound Music Player Camera/cam Battery
  32. 32. Aspects can be extracted with the help of a POSTaggerStanford POS Tagger [7] :This restaurant has good ambianceParse Tree(ROOT (S (NP (DT This) (NN restaurant)) (VP (VBZ has) (NP (JJ good) (NN ambiance))))NP- Noun Phrase , JJ- Adjective , NN - Noun
  33. 33. Extracting Adjective-Noun Pair from reviews(for the previousproduct):This would enable us to identify the aspects and theircorresponding sentimentsReviews Attractive design & compact size Good speed, not the slowest nor the fastest Clear sound for phone calls & decent music player Fixed focus low res cam (2MP) no LED Battery, this is an issue with all smart phonesAspects – {Design (attractive), Size(compact), Speed(Good),Sound(clear), Music Player(decent), Cam(low resolution),Battery(negative) }
  34. 34. Used Stanford POS tagger to extract Adjective-Nounpair from the corpus of all the restaurant reviewsRestaurant DomainI – 2548We- 1342They- 955It- 911Food- 347Services- 291Place- 248Foods- 229Service- 210experiences- 131Waitress- 122 … pizza-51Problem : Apart from the aspects/features of restaurants such as Food,Place, service, there is high number of pronouns. These pronouns canrepresent any thing
  35. 35. The high frequency counts of pronouns shows that weneed to de-reference them and extract the correspondingnounsThis restaurant has good ambiance, but it is not as good asdescribed by my friendsReplacing all the “it”s in this sentence with ambiance“This” with restaurant.Note: Stanford NLP tool kit has de-referencing API
  36. 36. Is –A Relation Ship Another problem faced. Sentiments attached to sub-categories than the main categories. Ex: The pizza in this restaurant is good. Good is attached to Pizza Pizza is a type of Food Hence all the sentiments about Pizza should be pointed tofoodThis kind of relationships are given by GraphDatabase(Entity relationships) called freebase
  37. 37. Algorithm Use POS tagger to extract nouns attached to adjectives Dereference the personal pronouns Remove the existing pronouns Use freebase dump to find IS-A relation Merge frequencies of plural and singular words and use singulars Find the adjectives associated with the nouns. This would give an indication of the sentiment
  38. 38. Restaurant- 816Food- 719Service- 613experience- 219Waitress- 122 (Further have to establish a relation ship betweenwaitress and service. Need of an ontology for each domain or can use wordnetto find the distance between waitress and service )Review – 91Drink - 64
  39. 39. [1][2] R. McDonald, K. Hannan, T. Neylon, M. Wells, and J. Reynar, “Structured modelsfor fine-tocoarse sentiment analysis,” Proceedings of the Association forComputational Linguistics (ACL), pp. 432–439, Prague, Czech Republic: June 2007.[3] WILSON,T., J.WIEBE, and P.HOFFMANN. 2005. Recognizing contextual polarity inphrase-level sentiment analysis. In Proceedings of Human Language TechnologiesConference/Conference on Empirical Methods in Natural Language Processing(HLT/EMNLP 2005), pp. 347–354, Vancouver, Canada.[4][5][6][7][8] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? Sentiment classificationusing machine learning techniques,” in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing (EMNLP), pp. 79–86, 2002.
  40. 40. Thank You