Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Automatic term extraction of dynamically
updated text collections for sentiment
classification into three classes
Yuliya Rubtsova
The A.P. Ershov Institute of Informatics Systems
(IIS)

Applied problems which can be solved
with sentiment classification
 consumer reviews study to commercial products for
businesses;

 consumer reviews study to commercial products for
businesses;
 recommender systems;

 consumer reviews study to commercial products
for businesses;
 recommender systems;
 Human Machine Interface of a computer system
which is responsible for adapting the system's
behavior to the current emotional state of the
person

Human Machine Interface of a computer system which
is responsible for adapting the system's behavior to the
current emotional state of the person
 psychological and medical diagnosis;
 safety control by analyzing the behavior of mass
gatherings;
 assistance in carrying out investigative measures.

Most common sentiment
analysis approaches
Supervised
machine
learning
Dictionaries
and rules
Combined
method

Existing corpora
 Corpora of reviews which contain user marks
 Belongs to one subject domain (movies reviews,
books reviews, gadgets reviews)
 Corps of news (a few emotional texts)

Filtration
 Texts containing both positive and negative emotions;
 Not informative tweets (less than 40 characters long);
 Copied texts and retweets.

Corpus of short texts consists of
114 991 – positive texts
111 923 – negative texts
107 990 – neutral texts

Corpus of short texts
Collection type Number of words Number of unique
words
Positive messages 1 559 176 150 720
Negative messages 1 445 517 191 677
Neutral messages 1 852 995 105 239

Unique terms distribution in relation depending on
the number of tweets
0
50000
100000
150000
200000
250000
300000
350000
400000
53
8213
16461
24624
32824
40999
49264
57414
65571
73660
81791
89882
97945
106068
114238
123009
131937
140682
149495
158284
167136
175859
184578
193442
202354
211426
220117
229570
238882
247995
256716
265561
274244
282350
Number of the unuque terms
Number of texts

Uniformity of used collections
Words frequency distribution

Most common approaches for
used for N-grams extracting
 Manually, using a thesaurus.
 Term Extraction, based on significance of this term
for a collection

Data sets characteristics
 The entire data set is known
 The entire data set is avaliable
 The entire data set is static (can’t change during calculation)
When new document is added, it is necessary to the update the
document frequency of many terms and all previously generated
term weights needs recalibration. For N documents in a data
stream, the computational complexity is O(N2).

Human speech is constantly
changing => there is a need to
update emotional dictionaries

Change in vocabulary and
topics discussed
Percentage of references to the Olympic theme on all
12%
0.50%
14%
12%
10%
8%
6%
4%
2%
0%
posts
Febrary August

topics discussed
Percentage of references to the vacation theme on all
0.06%
0.12%
0.14%
0.12%
0.10%
0.08%
0.06%
0.04%
0.02%
0.00%
posts
Febrary August

topics discussed
Percentage of using term “Sebyashka” (selfie – rus) on all
0.00%
0.02%
0.03%
0.02%
0.02%
0.01%
0.01%
0.00%
posts
Febrary August

Filtration
 Punctuation – commas, colons, quotation marks
(exclamation marks, question marks and ellipses were
retained);
 References to significant personalities and events
 Proper names;
 Numerals;
 All links were replaced with the word "Link" and were taken
into consideration as a whole;
 Many dots were replaced with ellipsis.

TF-ICF
C – number of categories,
cf – the number of categories in which weighed term is found

TF-IDF
tf – is the frequency of term occurrence in the collection (positive or
negative tweets) ,
T – total number of messages in the collections,
– the number of messages in the positive and negative
T(ti )
collections contained the term

Corpus of News texts consists of
46 339 – positive news
46 337 – negative news
46 340 – neutral news

ROMIP mixed collection consists of
Reviews on books, movies, or digital camera from
blogs
543– positive blog texts
236– negative blog texts
103– neutral blog texts

Short text collection
TF-IDF TF-ICF
Accuracy 95,5981 95,0664
Precision 0,958092631 0,953112184
Recall 0,955204837 0,94984672
F-Measure 0,956646554 0,95147665
News collection
TF-IDF TF-ICF
Accuracy 69,8619 58,1397
Precision 0,709246342 0,61278022
Recall 0,698624505 0,581402868
F-Measure 0,703895355 0,596679322
ROMIP collection
TF-IDF TF-ICF
Accuracy 53,9773 57,9545
Precision 0,561341047 0,558902611
Recall 0,5311636 0,535790598
F-Measure 0,545835539 0,547102625

Experimental results in terms of F-measure
95.66
70.39
54.58
95.15
59.68
54.71
120
100
80
60
40
20
0
Short texts News Romip
TF-IDF
TF-ICF

The program module allows
 dynamically update the unigram dictionary,
recalculate the weight of terms, depending on the
accessories to the collection;
 take into account the lexical speech changes in time;
 investigate new terms entering into active
vocabulary.

Thank you!
Presentation: http://www.slideshare.net/mokoron
Yuliya Rubtsova
yu.rubtsova@gmail.com
study.mokoron.com

Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Recommended

Recommended

More Related Content

Similar to Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Similar to Automatic term extraction of dynamically updated text collections for sentiment classification into three classes (12)

More from Yuliya Rubtsova

More from Yuliya Rubtsova (17)

Recently uploaded

Recently uploaded (20)

Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Editor's Notes