A replication study of the top performing systems in SemEval twitter sentiment analysis

A Replication Study of the Top
Performing Systems in SemEval
Twitter Sentiment Analysis
Efstratios Sygkounas, Giuseppe Rizzo,
Raphaël Troncy
<raphael.troncy@eurecom.fr>
@rtroncy

Replications Study1
 Replicability
Repeating a previous result under the original conditions
(e.g. same system configuration and datasets)
 Reproducibility
Reproducing a previous result under different, but
comparable conditions
 Generalizability
Applying an existing, empirically validated technique to a
different task/domain than the original one
21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan - 2
1 Hasibi, F., Balog, K., Bratsberg, S.E. On the Reproducibility of the TAGME Entity Linking System.
38th European Conference on Information Retrieval (ECIR), 2016

SemEval 2013-20152
Task: Sentiment analysis in Twitter
2013
Task 2
2014
Task 9
2015
Task 10
Subtask A
Contextual
Polarity
disambiguation
Subtask B
Message
Polarity
Classification
Subtask C
Topic-Based
Message
Polarity
Classification
Subtask D
Detecting
Trends
Towards a
Topic
Subtask E
Determining
strength of
association of
Twitter terms
with positive
sentiment
2 Rosenthal, S., Nakov, P., Kiritchenko, S., Mohammad, S., Ritter, A., Stoyanovm, V. SemEval-2015 Task 10:
Sentiment Analysis in Twitter. 9th International Workshop on Semantic Evaluation (SemEval), 2015
http://alt.qcri.org/semeval2015

SemEval Subtask B (started in 2013)
 Annotations performed by Amazon Mechanical
Turkers
 Tweets are classified in 3 classes
Positive Neutral Negative

SemEval Subtask B
Tweet ID
Gold
Standard ID
Gold
Standard
Tweet
5228553018
76580353
T15111159 positive I've been watching Gilmore Girls
for the past 3 hours. Oops,
happy Thursday!
5230874482
64671233
T15111142 neutral My Friday consists of Netflix and
hot tea allllllllll day long.
5229601206
83429889
T15111318 negative Kobe Bryant smiling as he re-
enters the game with the Lakers
losing 91-63 in the 4th quarter.
Probably insanity settling in.
- 521/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan

SemEval Subtask B
Systems scored according to the F1 measure
2015: ~40 systems competing
Webis
Team
SemEval
2015
Hagen, M., Potthast, M., Buchner, M., Stein, B.: Webis: An Ensemble for Twitter Sentiment
Detection. International Workshop on Semantic Evaluation (SemEval), 2015

Ensemble Learning that combines different
classifiers with different settings

Webis System
Webis’s system is an ensemble of 4 classifiers
NRC-
CANADA
GU-MLT-LT
KLUE
TeamX
SemEval
2013
SemEval
2014

Webis System
System Classifier Features
NRC-
Canada
Support Vector
Machine(SVM)
n-grams, alcaps, POS, polarity
dictionaries, punctuation marks,
emoticons, word lengthening, clusters
and negation
GU-MLT-LT
Linear regression normalized uni-grams, stems,
clustering and negation
KLUE
Maximum
Entropy
unigrams, bigrams, and an extended
unigram model that includes a simple
treatment of negation
TeamX
Logistic
Regression
(LIBLINEAR)
word n-grams, character n-grams,
clusters and word senses

Webis System
System Language resources
NRC-
Canada
NRC Emotion, MPQA, Bing Liu’s Opinion Lexicon, NRC
Hashtag Sentiment and the Sentiment140
GU-MLT-LT Polarity Dictionary and SentiWordNet
KLUE
SentiStrength, extended version of AFINN-111, large-
vocabulary distributional semantic models (DSM) from
English Wikipedia and Google Web 1T 5-Grams
databases
TeamX
Formal: MPQA Subjectivity Lexicon, General Inquirer
and SentiWordNet
Informal: AFINN-111, Bing Liu’s Opinion Lexicon, NRC
Hashtag, Sentiment Lexicon and Sentiment140 Lexicon

Replicability
Download Webis’s already trained models and code
https://github.com/webis-de/ECIR-2015-and-SEMEVAL-
2015
Download SemEval’s datasets via the Twitter API
(some tweets not available anymore)

Replicability
 Versioning is an important aspect to be considered in
any replication study
 We replaced the Stanford NLP Core old libraries with
the newest ones
Dataset
Claimed in
paper
Webis’s
models
Replicate Webis system on test 2013 68.49 69.62

Reproducibility
Dataset Claimed in
paper
Webis’s
models
Our
models
Replicate Webis system on test 2013 68.49 69.62 70.06
Replicate Webis system - TeamX on test 2013 N/A 69.04 70.34
 Our models have better performance in general
 SentiME without TeamX performs worst for 2014’s and
2015’s but not for 2013’s dataset

Generalization
 SentiME
Consisted by 4 classifiers
+ Stanford Sentiment System
We train our models using bagging in order to boost
the training of the ensemble
 We noticed a lot of commonalities in TeamX’s
and Stanford’s Sentiment System features, so
we decided to perform test with/without TeamX
in order to assess the classifier's contribution

Stanford Sentiment System
 Stanford Sentiment System is a recursive neural
tensor network parsed by the Stanford Tree Bank
 Stanford Sentiment System can capture the meaning
of compositional phrases which is hard to be achieved
by the normal bag of words approaches
● Classifies a sentence in 5
classes (very positive, positive,
neutral, negative and very
negative)
● We use the pre-trained models
Stanford team provides

Bagging
Due to the fact that bagging introduces some
randomness into the training process, and that
the size of the bootstrap samples are not fixed,
we decide to perform multiple experiments with
different sizes ranging from 33% to 175%
We observed that doing bagging with 150% of the
initial dataset size leads to the best performance
in terms of F1 score

SentiME System

Generalization
1. Webis replicate system: this is the replicate of the
Webis system using re-trained models
2. SentiME system: the system we propose
3. Webis replicate system without TeamX
4. SentiME system without TeamX
We performed four different experiments to
evaluate the performance of SentiME compare
to our previous replicate of the Webis system

Generalization
System SemEval2014-
test
SemEval2014-
sarcasm
SemEval2015-
test
SemEval2015-
sarcasm
Webis Replicate system 69.31 60.00 66.57 54.19
SentiME system 68.27 62.57 67.39 60.92
Webis replicate system
without TeamX
68.56 62.04 66.19 56.86
SentiME system without
TeamX
69.27 62.04 66.38 58.92
Webis 70.86 49.33 64.84 53.59
• SentiME outperforms Webis Replicate system on all datasets except SemEval2014-test
• SentiME improves the F score by respectively 2,5% and 6,5% on SemEval2014-sarcasm
and SemEval2015-sarcasm datasets
• On the SemEval2014-sarcasm dataset there is a significant difference of performance
between the original Webis system (49.33%) and our replicate (60%).

Summary
 Stanford Sentiment System is heavily skew towards
negative classification
 We manage to improve the Webis system by 1% in the
general case by introducing a fifth sub-classifier (the
Stanford Sentiment System) and by boosting the
training with bagging 150%
 The SentiME system also outperforms the Webis
system by 6,5% on the particular and more difficult
sarcasm dataset (thanks to Stanford classifier)

Some Lessons Learned
 Availability of source code AND models significantly
helps to perform reproducibility study
 Pre-trained models provided by Webis are not exactly
the same than the re-trained models we have created
from the data at disposal
You have to archive data … and software libraries !
 It is possible that Webis’s authors did not detail the full
set of features they have used
https://github.com/MultimediaSemantics/sentime

A replication study of the top performing systems in SemEval twitter sentiment analysis

More Related Content

Viewers also liked

Similar to A replication study of the top performing systems in SemEval twitter sentiment analysis

More from Raphael Troncy

Recently uploaded

A replication study of the top performing systems in SemEval twitter sentiment analysis