• Save
Language-Independent Twitter Sentiment Analysis
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Language-Independent Twitter Sentiment Analysis

on

  • 1,967 views

We describe a language-independent approach to sentiment analysis (positive or negative emotions) in tweets. We also present our evaluation dataset of human-annotated sentiments in tweets, collected ...

We describe a language-independent approach to sentiment analysis (positive or negative emotions) in tweets. We also present our evaluation dataset of human-annotated sentiments in tweets, collected using Amazon Mechanical Turk.

This is the presentation I held at KDML, LWA 2012, Dortmund, Germany.

Visit http://irml.dai-labor.de/ for more information.

Statistics

Views

Total Views
1,967
Views on SlideShare
1,681
Embed Views
286

Actions

Likes
3
Downloads
0
Comments
1

2 Embeds 286

http://irml.dailab.de 285
http://translate.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Remarkable. Didn't quite catch it from the slides, but do you consider emoticons as sentiment mentions? I've found that they are often sentiment-null or even sarcastic (reverse polarity). Thanks!
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Language-Independent Twitter Sentiment Analysis Presentation Transcript

  • 1. Language-Independent Twitter Sentiment AnalysisSascha Narr, Michael Hülfenhaus, Sahin AlbayrakSascha NarrCompetence Center Information Retrieval & Machine LearningKDML 2012, LWA, Dortmund, Germany
  • 2. Overview►1. Sentiment analysis on social media►2. Creation of a multilingual evaluation dataset of tweets►3. A language-independent sentiment labeling heuristic for semi-supervised learning►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 2
  • 3. Overview►1. Sentiment analysis on social media►2. Creation of a multilingual evaluation dataset of tweets►3. A language-independent sentiment labeling heuristic for semi-supervised learning►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 3
  • 4. 1. Sentiment Analysis on Social Media► Why Sentiment Analysis?  People’s opinions and sentiments about products and events in large numbers are invaluable:  Market research, product feedback and more  Sentiment Analysis allows to automatically collect such data► Why Twitter?  400 Million tweets posted each day[1]  Shorter text lengths encourage people to “just write” what they think  Tweets are often informal and contain lots of opinions [1]: http://news.cnet.com/8301-1023 3-57448388-93/twitter-hits-400-million-tweets-per-day-mostly-mobile/ 18. September 2012 Language-Independent Twitter Sentiment Analysis 4
  • 5. 1. Methods for Sentiment Classification► Sentiment classification goals:  Subjectivity: “Does the tweet contain an opinion?”  Polarity: “Is the expressed opinion positive or negative?”► Classifiers used:  Naive Bayes, Maximum Entropy, Support Vector Machines► Features used:  n-grams, WordNet semantics, part-of-speech information► Tweet texts have unique properties:  Informal, contain slang, emoticons, misspellings 18. September 2012 Language-Independent Twitter Sentiment Analysis 5
  • 6. 1. Multilingual Sentiment Analysis►Less than 40% of tweets are English [1]►Natural language processing methods are often designed specifically for one language► Increase coverage of sentiment analysis by using a language-independent approach: No extra effort for additional languages Is the approach really effective for all languages? [1] http://semiocast.com/publications/2011_11_24_Arabic_highest_growth_on_Twitter 18. September 2012 Language-Independent Twitter Sentiment Analysis 6
  • 7. Overview►1. Sentiment analysis on social media►2. Creation of a multilingual evaluation dataset of tweets►3. A language-independent sentiment labeling heuristic for semi-supervised learning►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 7
  • 8. 2. Creation of a Multilingual Evaluation Dataset► We created a hand-annotated sentiment evaluation dataset of over 12000 tweets  4 languages: English, German, French, Portuguese►Used the Amazon Mechanical Turk platform for annotation►Each tweet was annotated by 3 different workers:  Labels: “positive”, “neutral”, “negative”  Added validation tweets to try to ensure the quality of the annotations 18. September 2012 Language-Independent Twitter Sentiment Analysis 8
  • 9. 2. Our Multilingual Evaluation Dataset► Observed a low inter-annotator agreement in our dataset  Sentiment classification is a hard task, even for humans  Tweets that humans disagree on are harder to classify as well► The dataset is publicly available for research purposes Table 1: Tweet counts for the complete annotated dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 9
  • 10. Overview►1. Sentiment analysis on social media►2. Creation of a multilingual evaluation dataset of tweets►3. A language-independent sentiment labeling heuristic for semi-supervised learning►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 10
  • 11. 3. A Language-Independent Heuristic► To train a sentiment classifier, a large amount of labeled training data is needed  Can be obtained without human effort using a previously proposed heuristic► The heuristic uses emoticons in tweets as noisy labels► Heuristic: If a tweet contains only positive emoticons, label its whole text as positive (and vice versa for negative).► Examples of emoticons we used:  Positive: :) :-) =) ;) :] :D ˆ-ˆ ˆ_ˆ  Negative: :( :-( :(( -.- >:-( D: :/ 18. September 2012 Language-Independent Twitter Sentiment Analysis 11
  • 12. 3. Heuristic for Semi-Supervised Learning► Heuristic can be applied to almost any language, since emoticons are used extensively on Twitter► Amount of tweets with emoticons differs among languages  Caused by many factors like language-specific ways to express sentiments or different distributions of “formal” tweets Table 2: Number of tweets containing emoticons for each language 18. September 2012 Language-Independent Twitter Sentiment Analysis 12
  • 13. Overview►1. Sentiment analysis on social media►2. Creation of a multilingual evaluation dataset of tweets►3. A language-independent sentiment labeling heuristic for semi-supervised learning►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 13
  • 14. 4. Experiments – Sentiment Classification► Data:  Training: From ~ 800M random tweets of mixed languages:  Filter for languages: English, German, French, Portuguese  Use emoticon heuristic to select and label training data  Evaluation: 12597 hand-annotated tweets (4 languages)► Setup:  Classification: Sentiment polarity only  Classifier: Naive Bayes  Features: 1-grams and 1, 2-grams  Trained 4 classifiers for en, de, fr, pt 1 classifier for combined en+de+fr+pt 18. September 2012 Language-Independent Twitter Sentiment Analysis 14
  • 15. 4. Experiments: Evaluation Dataset► 2 variations of our evaluation set for the experiments:  agree-3: Tweets all 3 annotators agreed on for a sentiment  agree-2: Tweets at least 2 annotators agreed on► Baseline: always guess “positive” (more pos. tweets than neg.) Table 3: Tweet counts for the evaluation datasets 18. September 2012 Language-Independent Twitter Sentiment Analysis 15
  • 16. 4. Results – English Classifier► Best results: English classifier using 1-grams, on the 3-agree set  81.3% accuracy (500k trained tweets)► Performance on 2-agree set constantly lower than 3-agree en 18. September 2012 Language-Independent Twitter Sentiment Analysis 16
  • 17. 4. Results – All Languages en de fr pt 18. September 2012 Language-Independent Twitter Sentiment Analysis 17
  • 18. 4. Evaluation – All Languages Compared en de► Strong differences between languages► Differences do not correlate with number of emoticons in each fr pt language► Emoticon heuristic better fit for some languages, may depend on the style of expressing sentiment in it► “muito engraçado kkkkkkkk” Table3: Tweet counts containing emoticons for each language 18. September 2012 Language-Independent Twitter Sentiment Analysis 18
  • 19. 4. Evaluation – Multi-language Classifier► Tested on combined 4 language evaluation set► Highest Performance: 71.5% accuracy  Slightly less than using 4 individual classifiers (73.9% accuracy)► Usefulness of combined classifier can outweigh performance degradation en+de+fr+pt 18. September 2012 Language-Independent Twitter Sentiment Analysis 19
  • 20. Conclusions► We presented and evaluated a language-independent sentiment classification approach on 4 languages  A language-independent classifier can be trained given only raw tweets, using a noisy label heuristic  Good performances across languages, varies for each  Classifiers need a very large number of tweets for training  Mixed-language classifiers are viable► Future work:  Currently we only classify sentiment polarity  Classifying subjectivity in tweets is important, but finding a good heuristic to label “neutral” tweets is a challenge 18. September 2012 Language-Independent Twitter Sentiment Analysis 20
  • 21. Language-Independent Twitter Sentiment Analysis Thanks for your attention! Questions? 18. September 2012 Language-Independent Twitter Sentiment Analysis 21
  • 22. ContactSascha Narr DAI-LaborDipl.-Inform. Technische Universität Berlin Fakultät IV –Competence Center Information Retrieval & Elektrontechnik & InformatikMachine Learningsascha.narr@dai-labor.de Sekretariat TEL 14Fon +49 (0) 30 / 314 – 74 138 Ernst Reuter Platz 7Fax +49 (0) 30 / 314 – 74 003 10587 Berlin www.dai-labor.de 18. September 2012 Language-Independent Twitter Sentiment Analysis 22