I. Introduction to Sentiment Analysis and its applications.
II. How to approach Sentiment Analysis?
III. 2015 Elections in Poland on Twitter.com & Onet.pl.
2. Agenda
2015 Elections in
Poland on Twitter.com
& Onet.pl
Introduction
to Sentiment Analysis
and its applications
Questions
& Answers
How to approach
Sentiment
Analysis?
4. ➔ Author’s sentiment
◆ in all text
◆ in particular sentences
◆ towards an entity(s)
◆ towards specific aspects(s)
➔ Author’s emotions
◆ basic vs other emotions
➔ When to automate?
◆ many texts
◆ texts are untagged
◆ gaps in annotation
Sentiment vs Emotion Analysis
5. Sentiment Analysis or Opinion Mining
➔ Owner of the opinion
➔ Entity vs aspect
➔ Type of opinion(s) expressed
◆ like / dislike
◆ optimism / pessimism
◆ evaluation, confidence, familiarity
◆ polarization
➔ Value of opinion
➔ Event of sentiment expression
He recorded white Russian trucks crossing the border and
progressing fast towards Białystok.
The candidate looked confident, well prepared and was even very
well received by the audience, but lost the debate after answering
that question.
6. Sample Applications
Politics - forecasting sentiment towards a
candidate, political party or reform, social issue.
Review - to draw a conclusion if
this review is positive or
negative
Market monitoring
- Clients’ comments
about our company
or the competition.
Product - what
people think
about the
new...
9. Approaches
➔ Dictionary methods
◆ Manual or (self)automatic
➔ Statistical methods
◆ Training sets
◆ Various descriptive features
● Words
● Detection of coexistence of words
● Punctuation
● Syntax
● Emoticons
● Language specific analysis
○ http://nlp.stanford.edu/sentiment/
10. Data Gathering
➔ Social Networks
◆ Many social networks provide APIs
➔ Thematic websites
◆ Need to create customized scrappers
Important thing - before data gathering
we need to correctly select media we are
covering!
11. Manually Tagging Sentiment
➔ Manually tagged sentiment data can be used to
create sentiment dictionaries and reference
(training) sentiment data for experiments
➔ Data should be tagged by several taggers,
because, especially in politics, Positive and
Negative are very subjective
➔ Data should also be tagged in a way that
combats effects associated with Negative Bias
➔ Negative Bias is an effect which results in
negative feelings and events are treated as more
important by human brain.
“ta d... wykrzykiwała codzień pod
krzyżem”
tagged as neutral
“nie dostal sie, na szczescie, do tego
palacu, wiec wystaje poden co chwila
:)) niedoczekanie twoje, kaczynski!”
tagged as negative
13. Twitter.com and Onet.pl (+ Wp.pl)
Twitter.com
Over 11,000 tweets related to 10 profiles of
candidates on May 10th
Onet.pl and Wp.pl
Over 2000 articles with one of the tags related to
election (Andrzej Duda, Bronisław Komorowski, Paweł
Kukiz, Ewa Kopacz, Beata Szydło) between 20.05 and
20.08 (second round of election to presidential
swearing-in).
Over 1.5m comments written by the users.
Additional 12,000+ comments from years 2009-2011 for
evaluation purposes
15. Implementation
Technology
➔ R 3.2.0 running on i686-pc-linux-gnu, RStudio
0.98.1103
➔ twitteR, dplyr, stringi, ggplot2, tm, e1071, RTextTools
Data
➔ 11,744 tweets related to 10 profiles of candidates
from May 10th, simple sentiment scoring algorithm
➔ Lexicon of 2000 positive & 3693 negative Polish
words. 18 positive and 22 negative emoticons
➔ 6040 tweets, excluding all neutral used to evaluate
Naive Bayes, Maximum Entropy, Support Vector
Machines, and Tree sentiment classifiers using
70/30 principle
Results
➔ Two approaches: Naive Bayes and Maximum Entropy
achieved the best accuracy (71.76% and 77.32%
respectively)
17. Step I
Technology
➔ .NET
➔ Python 2.7 + Scrapy
➔ Java 8
Data
➔ 1,533,035 comments in 2057 articles total
◆ 923 manually sentiment annotated comments
➔ 5850 comments from 2011, site Gazeta.pl used for
lexicon generation (TRAIN-POL)
➔ 31095 tweets used for lexicon generation (TRAIN-
TWIT
➔ Evaluated methods: Naive Bayes (NB) and three
dictionary based methods
Results
➔ Two approaches: Naive Bayes and Simple Dictionary
Addition achieved the best accuracy (76% and 78%
respectively)
18.
19. Step II
Technology
➔ .NET
➔ Python 2.7 + Scrapy
➔ Java 8
Data
➔ 1,533,035 comments in 2057 articles total
◆ 923 manually sentiment annotated comments
➔ 6448 comments from 2011, site Gazeta.pl
➔ 4592 comments from 2011, site Wyborcza.pl
➔ 7177 comments from 2010, site Gazeta.pl
➔ Lexicons generated from comments
➔ Datasets cross-tested using dictionaries created
with different datasets
Results
➔ Old datasets can be used to annotate new texts
despite 4 year difference
20. Step III
Technology
➔ Java 8
Data
➔ 1,533,035 comments in 2057 articles total
➔ Lexicon generated using manually sentiment tagged
data (from Gazeta.pl 2010, 2011 and Wyborcza.pl
2011)
➔ Algorithm threshold tweaked to have both good
binary and 3-category classification
Results
➔ Most comments are negative!