• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Language-Independent Twitter Sentiment Analysis
 

Language-Independent Twitter Sentiment Analysis

on

  • 454 views

 

Statistics

Views

Total Views
454
Views on SlideShare
454
Embed Views
0

Actions

Likes
0
Downloads
22
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Language-Independent Twitter Sentiment Analysis Language-Independent Twitter Sentiment Analysis Presentation Transcript

    • Language-Independent Twitter Sentiment AnalysisSascha Narr, Michael Hülfenhaus, Sahin AlbayrakSascha NarrCompetence Center Information Retrieval & Machine LearningKDML 2012, LWA, Dortmund, Germany
    • Overview►1. Sentiment analysis on social media►2. Creation of a multilingual evaluation dataset of tweets►3. A language-independent sentiment labeling heuristic for semi-supervised learning►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 2
    • Overview►1. Sentiment analysis on social media►2. Creation of a multilingual evaluation dataset of tweets►3. A language-independent sentiment labeling heuristic for semi-supervised learning►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 3
    • 1. Sentiment Analysis on Social Media► Why Sentiment Analysis?  People’s opinions and sentiments about products and events in large numbers are invaluable:  Market research, product feedback and more  Sentiment Analysis allows to automatically collect such data► Why Twitter?  400 Million tweets posted each day[1]  Shorter text lengths encourage people to “just write” what they think  Tweets are often informal and contain lots of opinions [1]: http://news.cnet.com/8301-1023 3-57448388-93/twitter-hits-400-million-tweets-per-day-mostly-mobile/ 18. September 2012 Language-Independent Twitter Sentiment Analysis 4
    • 1. Methods for Sentiment Classification► Sentiment classification goals:  Subjectivity: “Does the tweet contain an opinion?”  Polarity: “Is the expressed opinion positive or negative?”► Classifiers used:  Naive Bayes, Maximum Entropy, Support Vector Machines► Features used:  n-grams, WordNet semantics, part-of-speech information► Tweet texts have unique properties:  Informal, contain slang, emoticons, misspellings 18. September 2012 Language-Independent Twitter Sentiment Analysis 5
    • 1. Multilingual Sentiment Analysis►Less than 40% of tweets are English [1]►Natural language processing methods are often designed specifically for one language► Increase coverage of sentiment analysis by using a language-independent approach: No extra effort for additional languages Is the approach really effective for all languages? [1] http://semiocast.com/publications/2011_11_24_Arabic_highest_growth_on_Twitter 18. September 2012 Language-Independent Twitter Sentiment Analysis 6
    • Overview►1. Sentiment analysis on social media►2. Creation of a multilingual evaluation dataset of tweets►3. A language-independent sentiment labeling heuristic for semi-supervised learning►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 7
    • 2. Creation of a Multilingual Evaluation Dataset► We created a hand-annotated sentiment evaluation dataset of over 12000 tweets  4 languages: English, German, French, Portuguese►Used the Amazon Mechanical Turk platform for annotation►Each tweet was annotated by 3 different workers:  Labels: “positive”, “neutral”, “negative”  Added validation tweets to try to ensure the quality of the annotations 18. September 2012 Language-Independent Twitter Sentiment Analysis 8
    • 2. Our Multilingual Evaluation Dataset► Observed a low inter-annotator agreement in our dataset  Sentiment classification is a hard task, even for humans  Tweets that humans disagree on are harder to classify as well► The dataset is publicly available for research purposes Table 1: Tweet counts for the complete annotated dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 9
    • Overview►1. Sentiment analysis on social media►2. Creation of a multilingual evaluation dataset of tweets►3. A language-independent sentiment labeling heuristic for semi-supervised learning►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 10
    • 3. A Language-Independent Heuristic► To train a sentiment classifier, a large amount of labeled training data is needed  Can be obtained without human effort using a previously proposed heuristic► The heuristic uses emoticons in tweets as noisy labels► Heuristic: If a tweet contains only positive emoticons, label its whole text as positive (and vice versa for negative).► Examples of emoticons we used:  Positive: :) :-) =) ;) :] :D ˆ-ˆ ˆ_ˆ  Negative: :( :-( :(( -.- >:-( D: :/ 18. September 2012 Language-Independent Twitter Sentiment Analysis 11
    • 3. Heuristic for Semi-Supervised Learning► Heuristic can be applied to almost any language, since emoticons are used extensively on Twitter► Amount of tweets with emoticons differs among languages  Caused by many factors like language-specific ways to express sentiments or different distributions of “formal” tweets Table 2: Number of tweets containing emoticons for each language 18. September 2012 Language-Independent Twitter Sentiment Analysis 12
    • Overview►1. Sentiment analysis on social media►2. Creation of a multilingual evaluation dataset of tweets►3. A language-independent sentiment labeling heuristic for semi-supervised learning►4. Experiments on the multilingual dataset 18. September 2012 Language-Independent Twitter Sentiment Analysis 13
    • 4. Experiments – Sentiment Classification► Data:  Training: From ~ 800M random tweets of mixed languages:  Filter for languages: English, German, French, Portuguese  Use emoticon heuristic to select and label training data  Evaluation: 12597 hand-annotated tweets (4 languages)► Setup:  Classification: Sentiment polarity only  Classifier: Naive Bayes  Features: 1-grams and 1, 2-grams  Trained 4 classifiers for en, de, fr, pt 1 classifier for combined en+de+fr+pt 18. September 2012 Language-Independent Twitter Sentiment Analysis 14
    • 4. Experiments: Evaluation Dataset► 2 variations of our evaluation set for the experiments:  agree-3: Tweets all 3 annotators agreed on for a sentiment  agree-2: Tweets at least 2 annotators agreed on► Baseline: always guess “positive” (more pos. tweets than neg.) Table 3: Tweet counts for the evaluation datasets 18. September 2012 Language-Independent Twitter Sentiment Analysis 15
    • 4. Results – English Classifier► Best results: English classifier using 1-grams, on the 3-agree set  81.3% accuracy (500k trained tweets)► Performance on 2-agree set constantly lower than 3-agree en 18. September 2012 Language-Independent Twitter Sentiment Analysis 16
    • 4. Results – All Languages en de fr pt 18. September 2012 Language-Independent Twitter Sentiment Analysis 17
    • 4. Evaluation – All Languages Compared en de► Strong differences between languages► Differences do not correlate with number of emoticons in each fr pt language► Emoticon heuristic better fit for some languages, may depend on the style of expressing sentiment in it► “muito engraçado kkkkkkkk” Table3: Tweet counts containing emoticons for each language 18. September 2012 Language-Independent Twitter Sentiment Analysis 18
    • 4. Evaluation – Multi-language Classifier► Tested on combined 4 language evaluation set► Highest Performance: 71.5% accuracy  Slightly less than using 4 individual classifiers (73.9% accuracy)► Usefulness of combined classifier can outweigh performance degradation en+de+fr+pt 18. September 2012 Language-Independent Twitter Sentiment Analysis 19
    • Conclusions► We presented and evaluated a language-independent sentiment classification approach on 4 languages  A language-independent classifier can be trained given only raw tweets, using a noisy label heuristic  Good performances across languages, varies for each  Classifiers need a very large number of tweets for training  Mixed-language classifiers are viable► Future work:  Currently we only classify sentiment polarity  Classifying subjectivity in tweets is important, but finding a good heuristic to label “neutral” tweets is a challenge 18. September 2012 Language-Independent Twitter Sentiment Analysis 20
    • Language-Independent Twitter Sentiment Analysis Thanks for your attention! Questions? 18. September 2012 Language-Independent Twitter Sentiment Analysis 21
    • ContactSascha Narr DAI-LaborDipl.-Inform. Technische Universität Berlin Fakultät IV –Competence Center Information Retrieval & Elektrontechnik & InformatikMachine Learningsascha.narr@dai-labor.de Sekretariat TEL 14Fon +49 (0) 30 / 314 – 74 138 Ernst Reuter Platz 7Fax +49 (0) 30 / 314 – 74 003 10587 Berlin www.dai-labor.de 18. September 2012 Language-Independent Twitter Sentiment Analysis 22