Evaluation Datasets for Twitter Sentiment Analysis
A survey and a new dataset, the STS-Gold

Hassan Saif, Miriam Fernandez, Yulan He and Harith Alani
Knowledge Media Institute, The Open University,
Milton Keynes, United Kingdom

1st Workshop on Emotion and Sentiment in Social and
Expressive Media Approaches and perspectives from AI
• Definition & Background
• Evaluation Datasets for Twitter Sentiment
Analysis
• STS-Gold

Outline
• Comparative Study
• Conclusion
Sentiment Analysis – Definition
Sentiment Analysis
“Sentiment analysis is the task of identifying
positive and negative opinions, emotions and
evaluations in text”

The main dish was
delicious

It is a Syrian dish

Positive

Neutral

The main dish was
salty and horrible

Negative
3
Supervised

Sentiment Approaches

Unsupervised
Hybrid

Tweet-level
Sentiment Levels
Phrase-level
Entity-level

Twitter
Sentiment
Analysis
(Background)

Subjectivity
Sentiment Tasks

Polarity
Sentiment Strength
Emotion/Mood

4
Evaluation Datasets for Twitter Sentiment Analysis
SA Level

SA Task

No. of Tweets

Construction & Annotation

Dataset
Dataset

Vocabulary Size

Class Distribution
Sparsity
Dataset

SA Level

SA Task

Annotation/Agreement

Tweet

Subjectivity

Manual/UD

Tweet/Target

Subjectivity

Manual/UD

Obama-McCain Debate (OMD)

Tweet

Polarity*

Manual/α=0.655

Sentiment Strength Twitter Dataset (SS-Tweet)

Tweet

Strength/Subj
ectivity**

Manual
α≈0.56

Sanders Twitter Dataset

Tweet

Subjectivity

Manual/UD

Dialogue Earth Twitter Corpus (WAB, GASP)

Tweet/Target

Subjectivity

Manual/UD

SemEval-2013 Dataset

Tweet/Expre
ssion

Subjectivity

Manual/UD

Stanford Twitter Corpus (STS)
Health Care Reform (HCR)

Evaluation Datasets – Overview
• Details about the annotation
methodology (STS, HCR, Sanders)

What is Missing?

• Entity-level Sentiment Evaluation:
• Most works are focused on
assessing the performance of
sentiment classifiers at the tweet
level (STS, OMD, SS-Tweet, Sanders)
• Datasets, which allow for the
sentiment evaluation at the entity
level, assign similar sentiment
labels to the tweet and the entities
within it. (HCR, WAB, GASP)
 Enables the evaluation at both the entity and tweet
levels

 Tweets and entities are annotated independently

 Contains 58 Entities & 3000 Tweets
Data Collection

STS Corpus
Select

28 Entities
Select

100 Tweet/Entity
180K Tweets

STS-Gold

Alchemy API

2800 Tweets

Entity-Extraction
+200 tweets

Identify Frequent
Concepts

3000 Tweets

Top & Mid
Frequent Entities

Entity-Extraction

147 Entities
STS-Gold
Obama

Taylor Swift

Vegas

YouTube

Facebook

London
City

Person

Person

Person

Company

LeBron

Oprah

Person

Seattle

McDonalds

Starbucks

Sydney
iPod

iPhone
Lakers
England

Cavas

US

Xbox

Technology
Person

PSP

Organization

Person

Country

Headache

NASA

Person

Health
Condition

UN

Brazil

LeBron

Flu

Person
Cancer

Fever
3000 Tweets

147 Entities

Data Annotation

Tweenator.com

Sentiment Classes
Positive, Negative, Neutr
al, Mixed, Other

STS-Gold
3000 Tweets

147 Entities

Inter-annotation Agreement
Tweet α=0.765

Filtering

2205 Tweets

58 Entities

Entity α1=0.416
α2=0.964
Comparative Study

•
•
•
•

Vocabulary Size
Number of Tweets
Data Sparsity
Classification Performance
– Polarity Classification
– Naïve Bayes & Maximum Entropy
Comparative Study.1
Vocabulary Size vs. No. of Tweets
- There exists a high correction between the vocabulary size and the number of
tweets (ρ = 0.95)
- However, increasing the number of tweets does not always lead to increasing the
vocabulary size. (OMD)
Data Spar sity

Comparativeimportant factor that affectstheov
Da s t s rs isa Study.2
ta e pa ity
n

-

m chinele rning cla s rs[17]. According toS if e a
a
a
s ifie
a t l.
tha
nothe type
r
sof da
ta(e m
.g., oviere w da ) duetoa
vie
ta
Data Sparsity in tweets.
words
Inthiss ction, wea
e
imtocom rethepre e dda s ts
pa
s nte ta e
Twitter datasets are generally tethes rs de eof agive
Toca
lculavery sparse ity gre
pa
nda s t weus
ta e
e
Increasing both the number of tweets or the vocabulary size increases the sparsity
[13]:
Pn
degree of the dataset:
- ρno_of_tweets = 0.71
i Ni
Sd = 1 −
- ρvocabulary_size = 0.77
n ⇥ |V |
Whe
reN i isthethenum r of dis
be
tinct wordsintwe t i
e
the dataset and |V | the vocabulary size.
9

The Twe tNLP toke r ca be downloa d from ht t p:
e
nize n
de
Tweet NLP/
Comparative Study.3
Classification Performance vs. Dataset Sparsity (1)

0.9

Average Classifier Performance

Average Classifier Performance

According to Makrehchi et al (2008) and Saif et al (2012): in a given dataset the
classification performance and the sparsity degree are negatively correlated, i.e.,
increasing the dataset sparsity hinders the classification performance.
228
M . M akrehchi and M .S. K amel

0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1

Industry Sectors
20 newsgroups
Reuters

0.991 0.992 0.993 0.994 0.995 0.996 0.997 0.998 0.999

Average Sparsity

(a)

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.9441

Industry Sectors
20 newsgroups
Reuters
0.9550

0.9661

0.9772

0.9886

1.00

0.9441

0.9550

Average Sparsity

(b)

F i g. 2. Classifier performance as a funct ion of sparsity: (a) Rocchio, and (b) SV M
Comparative Study.3
Classification Performance vs. Dataset Sparsity (2)
- No correlation between the classification performance and the sparsity degree
across the datasets. (ρacc = −0.06, ρf1 = 0.23)
- The sparsity-performance correlation is intrinsic, meaning that it might exists within
the dataset itself, but not necessarily across the datasets.
• Current datasets to evaluate Twitter
sentiment classifiers:
– Focus on the tweet-level.
– Assign similar sentiment labels to the
tweets and the entities within them.

• STS-Gold allows for sentiment evaluation
as both the tweet and the entity levels.

• A correlation between the vocabulary size
and the number of tweets does not
always exist.
• The sparsity-performance correlation is
intrinsic, i.e., it only exists within the
dataset itself, but not across the different
datasets.

Conclusion!
Thank You
Email: hassan.saif@open.ac.uk
Twitter: hrsaif
Website: tweenator.com

Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold

  • 1.
    Evaluation Datasets forTwitter Sentiment Analysis A survey and a new dataset, the STS-Gold Hassan Saif, Miriam Fernandez, Yulan He and Harith Alani Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom 1st Workshop on Emotion and Sentiment in Social and Expressive Media Approaches and perspectives from AI
  • 2.
    • Definition &Background • Evaluation Datasets for Twitter Sentiment Analysis • STS-Gold Outline • Comparative Study • Conclusion
  • 3.
    Sentiment Analysis –Definition Sentiment Analysis “Sentiment analysis is the task of identifying positive and negative opinions, emotions and evaluations in text” The main dish was delicious It is a Syrian dish Positive Neutral The main dish was salty and horrible Negative 3
  • 4.
  • 5.
    Evaluation Datasets forTwitter Sentiment Analysis SA Level SA Task No. of Tweets Construction & Annotation Dataset Dataset Vocabulary Size Class Distribution Sparsity
  • 6.
    Dataset SA Level SA Task Annotation/Agreement Tweet Subjectivity Manual/UD Tweet/Target Subjectivity Manual/UD Obama-McCainDebate (OMD) Tweet Polarity* Manual/α=0.655 Sentiment Strength Twitter Dataset (SS-Tweet) Tweet Strength/Subj ectivity** Manual α≈0.56 Sanders Twitter Dataset Tweet Subjectivity Manual/UD Dialogue Earth Twitter Corpus (WAB, GASP) Tweet/Target Subjectivity Manual/UD SemEval-2013 Dataset Tweet/Expre ssion Subjectivity Manual/UD Stanford Twitter Corpus (STS) Health Care Reform (HCR) Evaluation Datasets – Overview
  • 7.
    • Details aboutthe annotation methodology (STS, HCR, Sanders) What is Missing? • Entity-level Sentiment Evaluation: • Most works are focused on assessing the performance of sentiment classifiers at the tweet level (STS, OMD, SS-Tweet, Sanders) • Datasets, which allow for the sentiment evaluation at the entity level, assign similar sentiment labels to the tweet and the entities within it. (HCR, WAB, GASP)
  • 8.
     Enables theevaluation at both the entity and tweet levels  Tweets and entities are annotated independently  Contains 58 Entities & 3000 Tweets
  • 9.
    Data Collection STS Corpus Select 28Entities Select 100 Tweet/Entity 180K Tweets STS-Gold Alchemy API 2800 Tweets Entity-Extraction +200 tweets Identify Frequent Concepts 3000 Tweets Top & Mid Frequent Entities Entity-Extraction 147 Entities
  • 10.
  • 11.
    3000 Tweets 147 Entities DataAnnotation Tweenator.com Sentiment Classes Positive, Negative, Neutr al, Mixed, Other STS-Gold 3000 Tweets 147 Entities Inter-annotation Agreement Tweet α=0.765 Filtering 2205 Tweets 58 Entities Entity α1=0.416 α2=0.964
  • 12.
    Comparative Study • • • • Vocabulary Size Numberof Tweets Data Sparsity Classification Performance – Polarity Classification – Naïve Bayes & Maximum Entropy
  • 13.
    Comparative Study.1 Vocabulary Sizevs. No. of Tweets - There exists a high correction between the vocabulary size and the number of tweets (ρ = 0.95) - However, increasing the number of tweets does not always lead to increasing the vocabulary size. (OMD)
  • 14.
    Data Spar sity Comparativeimportantfactor that affectstheov Da s t s rs isa Study.2 ta e pa ity n - m chinele rning cla s rs[17]. According toS if e a a a s ifie a t l. tha nothe type r sof da ta(e m .g., oviere w da ) duetoa vie ta Data Sparsity in tweets. words Inthiss ction, wea e imtocom rethepre e dda s ts pa s nte ta e Twitter datasets are generally tethes rs de eof agive Toca lculavery sparse ity gre pa nda s t weus ta e e Increasing both the number of tweets or the vocabulary size increases the sparsity [13]: Pn degree of the dataset: - ρno_of_tweets = 0.71 i Ni Sd = 1 − - ρvocabulary_size = 0.77 n ⇥ |V | Whe reN i isthethenum r of dis be tinct wordsintwe t i e the dataset and |V | the vocabulary size. 9 The Twe tNLP toke r ca be downloa d from ht t p: e nize n de Tweet NLP/
  • 15.
    Comparative Study.3 Classification Performancevs. Dataset Sparsity (1) 0.9 Average Classifier Performance Average Classifier Performance According to Makrehchi et al (2008) and Saif et al (2012): in a given dataset the classification performance and the sparsity degree are negatively correlated, i.e., increasing the dataset sparsity hinders the classification performance. 228 M . M akrehchi and M .S. K amel 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Industry Sectors 20 newsgroups Reuters 0.991 0.992 0.993 0.994 0.995 0.996 0.997 0.998 0.999 Average Sparsity (a) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.9441 Industry Sectors 20 newsgroups Reuters 0.9550 0.9661 0.9772 0.9886 1.00 0.9441 0.9550 Average Sparsity (b) F i g. 2. Classifier performance as a funct ion of sparsity: (a) Rocchio, and (b) SV M
  • 16.
    Comparative Study.3 Classification Performancevs. Dataset Sparsity (2) - No correlation between the classification performance and the sparsity degree across the datasets. (ρacc = −0.06, ρf1 = 0.23) - The sparsity-performance correlation is intrinsic, meaning that it might exists within the dataset itself, but not necessarily across the datasets.
  • 17.
    • Current datasetsto evaluate Twitter sentiment classifiers: – Focus on the tweet-level. – Assign similar sentiment labels to the tweets and the entities within them. • STS-Gold allows for sentiment evaluation as both the tweet and the entity levels. • A correlation between the vocabulary size and the number of tweets does not always exist. • The sparsity-performance correlation is intrinsic, i.e., it only exists within the dataset itself, but not across the different datasets. Conclusion!
  • 18.