Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Automatic term extraction of dynamically 
updated text collections for sentiment 
classification into three classes 
Yuliy...
Applied problems which can be solved 
with sentiment classification 
 consumer reviews study to commercial products for 
...
Applied problems which can be solved 
with sentiment classification 
 consumer reviews study to commercial products for 
...
Applied problems which can be solved 
with sentiment classification 
 consumer reviews study to commercial products 
for ...
Human Machine Interface of a computer system which 
is responsible for adapting the system's behavior to the 
current emot...
Most common sentiment 
analysis approaches 
Supervised 
machine 
learning 
Dictionaries 
and rules 
Combined 
method
Existing corpora 
 Corpora of reviews which contain user marks 
 Belongs to one subject domain (movies reviews, 
books r...
Filtration 
 Texts containing both positive and negative emotions; 
 Not informative tweets (less than 40 characters lon...
Corpus of short texts consists of 
114 991 – positive texts 
111 923 – negative texts 
107 990 – neutral texts
Corpus of short texts 
Collection type Number of words Number of unique 
words 
Positive messages 1 559 176 150 720 
Negat...
Unique terms distribution in relation depending on 
the number of tweets 
0	 
50000	 
100000	 
150000	 
200000	 
250000	 
...
Uniformity of used collections 
Words frequency distribution
Most common approaches for 
used for N-grams extracting 
 Manually, using a thesaurus. 
 Term Extraction, based on signi...
Data sets characteristics 
 The entire data set is known 
 The entire data set is avaliable 
 The entire data set is st...
Human speech is constantly 
changing => there is a need to 
update emotional dictionaries
Change in vocabulary and 
topics discussed 
Percentage of references to the Olympic theme on all 
12% 
0.50% 
14% 
12% 
10...
Change in vocabulary and 
topics discussed 
Percentage of references to the vacation theme on all 
0.06% 
0.12% 
0.14% 
0....
Change in vocabulary and 
topics discussed 
Percentage of using term “Sebyashka” (selfie – rus) on all 
0.00% 
0.02% 
0.03...
Filtration 
 Punctuation – commas, colons, quotation marks 
(exclamation marks, question marks and ellipses were 
retaine...
TF-ICF 
C – number of categories, 
cf – the number of categories in which weighed term is found
TF-IDF 
tf – is the frequency of term occurrence in the collection (positive or 
negative tweets) , 
T – total number of m...
Experiments
Corpus of News texts consists of 
46 339 – positive news 
46 337 – negative news 
46 340 – neutral news
ROMIP mixed collection consists of 
Reviews on books, movies, or digital camera from 
blogs 
543– positive blog texts 
236...
Short text collection 
TF-IDF TF-ICF 
Accuracy 95,5981 95,0664 
Precision 0,958092631 0,953112184 
Recall 0,955204837 0,94...
Results
Experimental results in terms of F-measure 
95.66 
70.39 
54.58 
95.15 
59.68 
54.71 
120 
100 
80 
60 
40 
20 
0 
Short t...
The program module allows 
 dynamically update the unigram dictionary, 
recalculate the weight of terms, depending on the...
Thank you! 
Presentation: http://www.slideshare.net/mokoron 
Yuliya Rubtsova 
yu.rubtsova@gmail.com 
study.mokoron.com
Automatic term extraction of dynamically updated text collections for sentiment classification into three classes
Automatic term extraction of dynamically updated text collections for sentiment classification into three classes
Upcoming SlideShare
Loading in …5
×

Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

An automatic term extraction approach for building a vocabulary that is constantly updated. A prepared dictionary is used for sentiment classification into three classes (positive, neutral, negative). In addition, the results of sentiment classification are described and the accuracy of methods based on various weighting schemes is compared. The work also demonstrates the computational complexity of generating representations for N dynamic documents depending on the weighting scheme used.

  • Login to see the comments

  • Be the first to like this

Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

  1. 1. Automatic term extraction of dynamically updated text collections for sentiment classification into three classes Yuliya Rubtsova The A.P. Ershov Institute of Informatics Systems (IIS)
  2. 2. Applied problems which can be solved with sentiment classification  consumer reviews study to commercial products for businesses;
  3. 3. Applied problems which can be solved with sentiment classification  consumer reviews study to commercial products for businesses;  recommender systems;
  4. 4. Applied problems which can be solved with sentiment classification  consumer reviews study to commercial products for businesses;  recommender systems;  Human Machine Interface of a computer system which is responsible for adapting the system's behavior to the current emotional state of the person
  5. 5. Human Machine Interface of a computer system which is responsible for adapting the system's behavior to the current emotional state of the person  psychological and medical diagnosis;  safety control by analyzing the behavior of mass gatherings;  assistance in carrying out investigative measures.
  6. 6. Most common sentiment analysis approaches Supervised machine learning Dictionaries and rules Combined method
  7. 7. Existing corpora  Corpora of reviews which contain user marks  Belongs to one subject domain (movies reviews, books reviews, gadgets reviews)  Corps of news (a few emotional texts)
  8. 8. Filtration  Texts containing both positive and negative emotions;  Not informative tweets (less than 40 characters long);  Copied texts and retweets.
  9. 9. Corpus of short texts consists of 114 991 – positive texts 111 923 – negative texts 107 990 – neutral texts
  10. 10. Corpus of short texts Collection type Number of words Number of unique words Positive messages 1 559 176 150 720 Negative messages 1 445 517 191 677 Neutral messages 1 852 995 105 239
  11. 11. Unique terms distribution in relation depending on the number of tweets 0 50000 100000 150000 200000 250000 300000 350000 400000 53 8213 16461 24624 32824 40999 49264 57414 65571 73660 81791 89882 97945 106068 114238 123009 131937 140682 149495 158284 167136 175859 184578 193442 202354 211426 220117 229570 238882 247995 256716 265561 274244 282350 Number of the unuque terms Number of texts
  12. 12. Uniformity of used collections Words frequency distribution
  13. 13. Most common approaches for used for N-grams extracting  Manually, using a thesaurus.  Term Extraction, based on significance of this term for a collection
  14. 14. Data sets characteristics  The entire data set is known  The entire data set is avaliable  The entire data set is static (can’t change during calculation) When new document is added, it is necessary to the update the document frequency of many terms and all previously generated term weights needs recalibration. For N documents in a data stream, the computational complexity is O(N2).
  15. 15. Human speech is constantly changing => there is a need to update emotional dictionaries
  16. 16. Change in vocabulary and topics discussed Percentage of references to the Olympic theme on all 12% 0.50% 14% 12% 10% 8% 6% 4% 2% 0% posts Febrary August
  17. 17. Change in vocabulary and topics discussed Percentage of references to the vacation theme on all 0.06% 0.12% 0.14% 0.12% 0.10% 0.08% 0.06% 0.04% 0.02% 0.00% posts Febrary August
  18. 18. Change in vocabulary and topics discussed Percentage of using term “Sebyashka” (selfie – rus) on all 0.00% 0.02% 0.03% 0.02% 0.02% 0.01% 0.01% 0.00% posts Febrary August
  19. 19. Filtration  Punctuation – commas, colons, quotation marks (exclamation marks, question marks and ellipses were retained);  References to significant personalities and events  Proper names;  Numerals;  All links were replaced with the word "Link" and were taken into consideration as a whole;  Many dots were replaced with ellipsis.
  20. 20. TF-ICF C – number of categories, cf – the number of categories in which weighed term is found
  21. 21. TF-IDF tf – is the frequency of term occurrence in the collection (positive or negative tweets) , T – total number of messages in the collections, – the number of messages in the positive and negative T(ti ) collections contained the term
  22. 22. Experiments
  23. 23. Corpus of News texts consists of 46 339 – positive news 46 337 – negative news 46 340 – neutral news
  24. 24. ROMIP mixed collection consists of Reviews on books, movies, or digital camera from blogs 543– positive blog texts 236– negative blog texts 103– neutral blog texts
  25. 25. Short text collection TF-IDF TF-ICF Accuracy 95,5981 95,0664 Precision 0,958092631 0,953112184 Recall 0,955204837 0,94984672 F-Measure 0,956646554 0,95147665 News collection TF-IDF TF-ICF Accuracy 69,8619 58,1397 Precision 0,709246342 0,61278022 Recall 0,698624505 0,581402868 F-Measure 0,703895355 0,596679322 ROMIP collection TF-IDF TF-ICF Accuracy 53,9773 57,9545 Precision 0,561341047 0,558902611 Recall 0,5311636 0,535790598 F-Measure 0,545835539 0,547102625
  26. 26. Results
  27. 27. Experimental results in terms of F-measure 95.66 70.39 54.58 95.15 59.68 54.71 120 100 80 60 40 20 0 Short texts News Romip TF-IDF TF-ICF
  28. 28. The program module allows  dynamically update the unigram dictionary, recalculate the weight of terms, depending on the accessories to the collection;  take into account the lexical speech changes in time;  investigate new terms entering into active vocabulary.
  29. 29. Thank you! Presentation: http://www.slideshare.net/mokoron Yuliya Rubtsova yu.rubtsova@gmail.com study.mokoron.com

×