Evaluation Datasets for Twitter Sentiment Analysis
A survey and a new dataset, the STS-Gold

Hassan Saif, Miriam Fernandez...
• Definition & Background
• Evaluation Datasets for Twitter Sentiment
Analysis
• STS-Gold

Outline
• Comparative Study
• C...
Sentiment Analysis – Definition
Sentiment Analysis
“Sentiment analysis is the task of identifying
positive and negative op...
Supervised

Sentiment Approaches

Unsupervised
Hybrid

Tweet-level
Sentiment Levels
Phrase-level
Entity-level

Twitter
Sen...
Evaluation Datasets for Twitter Sentiment Analysis
SA Level

SA Task

No. of Tweets

Construction & Annotation

Dataset
Da...
Dataset

SA Level

SA Task

Annotation/Agreement

Tweet

Subjectivity

Manual/UD

Tweet/Target

Subjectivity

Manual/UD

O...
• Details about the annotation
methodology (STS, HCR, Sanders)

What is Missing?

• Entity-level Sentiment Evaluation:
• M...
 Enables the evaluation at both the entity and tweet
levels

 Tweets and entities are annotated independently

 Contain...
Data Collection

STS Corpus
Select

28 Entities
Select

100 Tweet/Entity
180K Tweets

STS-Gold

Alchemy API

2800 Tweets

...
STS-Gold
Obama

Taylor Swift

Vegas

YouTube

Facebook

London
City

Person

Person

Person

Company

LeBron

Oprah

Perso...
3000 Tweets

147 Entities

Data Annotation

Tweenator.com

Sentiment Classes
Positive, Negative, Neutr
al, Mixed, Other

S...
Comparative Study

•
•
•
•

Vocabulary Size
Number of Tweets
Data Sparsity
Classification Performance
– Polarity Classific...
Comparative Study.1
Vocabulary Size vs. No. of Tweets
- There exists a high correction between the vocabulary size and the...
Data Spar sity

Comparativeimportant factor that affectstheov
Da s t s rs isa Study.2
ta e pa ity
n

-

m chinele rning cl...
Comparative Study.3
Classification Performance vs. Dataset Sparsity (1)

0.9

Average Classifier Performance

Average Clas...
Comparative Study.3
Classification Performance vs. Dataset Sparsity (2)
- No correlation between the classification perfor...
• Current datasets to evaluate Twitter
sentiment classifiers:
– Focus on the tweet-level.
– Assign similar sentiment label...
Thank You
Email: hassan.saif@open.ac.uk
Twitter: hrsaif
Website: tweenator.com
Upcoming SlideShare
Loading in …5
×

Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

3,287 views

Published on

Sentiment analysis over Twitter offers organisations and individuals a fast and effective way to monitor the publics' feelings towards them and their competitors. To assess the performance of sentiment analysis methods over Twitter a small set of evaluation datasets have been released in the last few years. In this paper we present an overview of eight publicly available and manually annotated evaluation datasets for Twitter sentiment analysis. Based on this review, we show that a common limitation of most of these datasets, when assessing sentiment analysis at target (entity) level, is the lack of distinctive sentiment annotations among the tweets and the entities contained in them. For example, the tweet ``I love iPhone, but I hate iPad'' can be annotated with a mixed sentiment label, but the entity iPhone within this tweet should be annotated with a positive sentiment label. Aiming to overcome this limitation, and to complement current evaluation datasets, we present STS-Gold, a new evaluation dataset where tweets and targets (entities) are annotated individually and therefore may present different sentiment labels. This paper also provides a comparative study of the various datasets along several dimensions including: total number of tweets, vocabulary size and sparsity. We also investigate the pair-wise correlation among these dimensions as well as their correlations to the sentiment classification performance on different datasets.

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,287
On SlideShare
0
From Embeds
0
Number of Embeds
38
Actions
Shares
0
Downloads
82
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

  1. 1. Evaluation Datasets for Twitter Sentiment Analysis A survey and a new dataset, the STS-Gold Hassan Saif, Miriam Fernandez, Yulan He and Harith Alani Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom 1st Workshop on Emotion and Sentiment in Social and Expressive Media Approaches and perspectives from AI
  2. 2. • Definition & Background • Evaluation Datasets for Twitter Sentiment Analysis • STS-Gold Outline • Comparative Study • Conclusion
  3. 3. Sentiment Analysis – Definition Sentiment Analysis “Sentiment analysis is the task of identifying positive and negative opinions, emotions and evaluations in text” The main dish was delicious It is a Syrian dish Positive Neutral The main dish was salty and horrible Negative 3
  4. 4. Supervised Sentiment Approaches Unsupervised Hybrid Tweet-level Sentiment Levels Phrase-level Entity-level Twitter Sentiment Analysis (Background) Subjectivity Sentiment Tasks Polarity Sentiment Strength Emotion/Mood 4
  5. 5. Evaluation Datasets for Twitter Sentiment Analysis SA Level SA Task No. of Tweets Construction & Annotation Dataset Dataset Vocabulary Size Class Distribution Sparsity
  6. 6. Dataset SA Level SA Task Annotation/Agreement Tweet Subjectivity Manual/UD Tweet/Target Subjectivity Manual/UD Obama-McCain Debate (OMD) Tweet Polarity* Manual/α=0.655 Sentiment Strength Twitter Dataset (SS-Tweet) Tweet Strength/Subj ectivity** Manual α≈0.56 Sanders Twitter Dataset Tweet Subjectivity Manual/UD Dialogue Earth Twitter Corpus (WAB, GASP) Tweet/Target Subjectivity Manual/UD SemEval-2013 Dataset Tweet/Expre ssion Subjectivity Manual/UD Stanford Twitter Corpus (STS) Health Care Reform (HCR) Evaluation Datasets – Overview
  7. 7. • Details about the annotation methodology (STS, HCR, Sanders) What is Missing? • Entity-level Sentiment Evaluation: • Most works are focused on assessing the performance of sentiment classifiers at the tweet level (STS, OMD, SS-Tweet, Sanders) • Datasets, which allow for the sentiment evaluation at the entity level, assign similar sentiment labels to the tweet and the entities within it. (HCR, WAB, GASP)
  8. 8.  Enables the evaluation at both the entity and tweet levels  Tweets and entities are annotated independently  Contains 58 Entities & 3000 Tweets
  9. 9. Data Collection STS Corpus Select 28 Entities Select 100 Tweet/Entity 180K Tweets STS-Gold Alchemy API 2800 Tweets Entity-Extraction +200 tweets Identify Frequent Concepts 3000 Tweets Top & Mid Frequent Entities Entity-Extraction 147 Entities
  10. 10. STS-Gold Obama Taylor Swift Vegas YouTube Facebook London City Person Person Person Company LeBron Oprah Person Seattle McDonalds Starbucks Sydney iPod iPhone Lakers England Cavas US Xbox Technology Person PSP Organization Person Country Headache NASA Person Health Condition UN Brazil LeBron Flu Person Cancer Fever
  11. 11. 3000 Tweets 147 Entities Data Annotation Tweenator.com Sentiment Classes Positive, Negative, Neutr al, Mixed, Other STS-Gold 3000 Tweets 147 Entities Inter-annotation Agreement Tweet α=0.765 Filtering 2205 Tweets 58 Entities Entity α1=0.416 α2=0.964
  12. 12. Comparative Study • • • • Vocabulary Size Number of Tweets Data Sparsity Classification Performance – Polarity Classification – Naïve Bayes & Maximum Entropy
  13. 13. Comparative Study.1 Vocabulary Size vs. No. of Tweets - There exists a high correction between the vocabulary size and the number of tweets (ρ = 0.95) - However, increasing the number of tweets does not always lead to increasing the vocabulary size. (OMD)
  14. 14. Data Spar sity Comparativeimportant factor that affectstheov Da s t s rs isa Study.2 ta e pa ity n - m chinele rning cla s rs[17]. According toS if e a a a s ifie a t l. tha nothe type r sof da ta(e m .g., oviere w da ) duetoa vie ta Data Sparsity in tweets. words Inthiss ction, wea e imtocom rethepre e dda s ts pa s nte ta e Twitter datasets are generally tethes rs de eof agive Toca lculavery sparse ity gre pa nda s t weus ta e e Increasing both the number of tweets or the vocabulary size increases the sparsity [13]: Pn degree of the dataset: - ρno_of_tweets = 0.71 i Ni Sd = 1 − - ρvocabulary_size = 0.77 n ⇥ |V | Whe reN i isthethenum r of dis be tinct wordsintwe t i e the dataset and |V | the vocabulary size. 9 The Twe tNLP toke r ca be downloa d from ht t p: e nize n de Tweet NLP/
  15. 15. Comparative Study.3 Classification Performance vs. Dataset Sparsity (1) 0.9 Average Classifier Performance Average Classifier Performance According to Makrehchi et al (2008) and Saif et al (2012): in a given dataset the classification performance and the sparsity degree are negatively correlated, i.e., increasing the dataset sparsity hinders the classification performance. 228 M . M akrehchi and M .S. K amel 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Industry Sectors 20 newsgroups Reuters 0.991 0.992 0.993 0.994 0.995 0.996 0.997 0.998 0.999 Average Sparsity (a) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.9441 Industry Sectors 20 newsgroups Reuters 0.9550 0.9661 0.9772 0.9886 1.00 0.9441 0.9550 Average Sparsity (b) F i g. 2. Classifier performance as a funct ion of sparsity: (a) Rocchio, and (b) SV M
  16. 16. Comparative Study.3 Classification Performance vs. Dataset Sparsity (2) - No correlation between the classification performance and the sparsity degree across the datasets. (ρacc = −0.06, ρf1 = 0.23) - The sparsity-performance correlation is intrinsic, meaning that it might exists within the dataset itself, but not necessarily across the datasets.
  17. 17. • Current datasets to evaluate Twitter sentiment classifiers: – Focus on the tweet-level. – Assign similar sentiment labels to the tweets and the entities within them. • STS-Gold allows for sentiment evaluation as both the tweet and the entity levels. • A correlation between the vocabulary size and the number of tweets does not always exist. • The sparsity-performance correlation is intrinsic, i.e., it only exists within the dataset itself, but not across the different datasets. Conclusion!
  18. 18. Thank You Email: hassan.saif@open.ac.uk Twitter: hrsaif Website: tweenator.com

×