SlideShare a Scribd company logo
1 of 31
Automatic term extraction of dynamically 
updated text collections for sentiment 
classification into three classes 
Yuliya Rubtsova 
The A.P. Ershov Institute of Informatics Systems 
(IIS)
Applied problems which can be solved 
with sentiment classification 
 consumer reviews study to commercial products for 
businesses;
Applied problems which can be solved 
with sentiment classification 
 consumer reviews study to commercial products for 
businesses; 
 recommender systems;
Applied problems which can be solved 
with sentiment classification 
 consumer reviews study to commercial products 
for businesses; 
 recommender systems; 
 Human Machine Interface of a computer system 
which is responsible for adapting the system's 
behavior to the current emotional state of the 
person
Human Machine Interface of a computer system which 
is responsible for adapting the system's behavior to the 
current emotional state of the person 
 psychological and medical diagnosis; 
 safety control by analyzing the behavior of mass 
gatherings; 
 assistance in carrying out investigative measures.
Most common sentiment 
analysis approaches 
Supervised 
machine 
learning 
Dictionaries 
and rules 
Combined 
method
Existing corpora 
 Corpora of reviews which contain user marks 
 Belongs to one subject domain (movies reviews, 
books reviews, gadgets reviews) 
 Corps of news (a few emotional texts)
Filtration 
 Texts containing both positive and negative emotions; 
 Not informative tweets (less than 40 characters long); 
 Copied texts and retweets.
Corpus of short texts consists of 
114 991 – positive texts 
111 923 – negative texts 
107 990 – neutral texts
Corpus of short texts 
Collection type Number of words Number of unique 
words 
Positive messages 1 559 176 150 720 
Negative messages 1 445 517 191 677 
Neutral messages 1 852 995 105 239
Unique terms distribution in relation depending on 
the number of tweets 
0	 
50000	 
100000	 
150000	 
200000	 
250000	 
300000	 
350000	 
400000	 
53	 
8213	 
16461	 
24624	 
32824	 
40999	 
49264	 
57414	 
65571	 
73660	 
81791	 
89882	 
97945	 
106068	 
114238	 
123009	 
131937	 
140682	 
149495	 
158284	 
167136	 
175859	 
184578	 
193442	 
202354	 
211426	 
220117	 
229570	 
238882	 
247995	 
256716	 
265561	 
274244	 
282350	 
Number	of	the	unuque	terms	 
Number	of	texts
Uniformity of used collections 
Words frequency distribution
Most common approaches for 
used for N-grams extracting 
 Manually, using a thesaurus. 
 Term Extraction, based on significance of this term 
for a collection
Data sets characteristics 
 The entire data set is known 
 The entire data set is avaliable 
 The entire data set is static (can’t change during calculation) 
When new document is added, it is necessary to the update the 
document frequency of many terms and all previously generated 
term weights needs recalibration. For N documents in a data 
stream, the computational complexity is O(N2).
Human speech is constantly 
changing => there is a need to 
update emotional dictionaries
Change in vocabulary and 
topics discussed 
Percentage of references to the Olympic theme on all 
12% 
0.50% 
14% 
12% 
10% 
8% 
6% 
4% 
2% 
0% 
posts 
Febrary August
Change in vocabulary and 
topics discussed 
Percentage of references to the vacation theme on all 
0.06% 
0.12% 
0.14% 
0.12% 
0.10% 
0.08% 
0.06% 
0.04% 
0.02% 
0.00% 
posts 
Febrary August
Change in vocabulary and 
topics discussed 
Percentage of using term “Sebyashka” (selfie – rus) on all 
0.00% 
0.02% 
0.03% 
0.02% 
0.02% 
0.01% 
0.01% 
0.00% 
posts 
Febrary August
Filtration 
 Punctuation – commas, colons, quotation marks 
(exclamation marks, question marks and ellipses were 
retained); 
 References to significant personalities and events 
 Proper names; 
 Numerals; 
 All links were replaced with the word "Link" and were taken 
into consideration as a whole; 
 Many dots were replaced with ellipsis.
TF-ICF 
C – number of categories, 
cf – the number of categories in which weighed term is found
TF-IDF 
tf – is the frequency of term occurrence in the collection (positive or 
negative tweets) , 
T – total number of messages in the collections, 
– the number of messages in the positive and negative 
T(ti ) 
collections contained the term
Experiments
Corpus of News texts consists of 
46 339 – positive news 
46 337 – negative news 
46 340 – neutral news
ROMIP mixed collection consists of 
Reviews on books, movies, or digital camera from 
blogs 
543– positive blog texts 
236– negative blog texts 
103– neutral blog texts
Short text collection 
TF-IDF TF-ICF 
Accuracy 95,5981 95,0664 
Precision 0,958092631 0,953112184 
Recall 0,955204837 0,94984672 
F-Measure 0,956646554 0,95147665 
News collection 
TF-IDF TF-ICF 
Accuracy 69,8619 58,1397 
Precision 0,709246342 0,61278022 
Recall 0,698624505 0,581402868 
F-Measure 0,703895355 0,596679322 
ROMIP collection 
TF-IDF TF-ICF 
Accuracy 53,9773 57,9545 
Precision 0,561341047 0,558902611 
Recall 0,5311636 0,535790598 
F-Measure 0,545835539 0,547102625
Results
Experimental results in terms of F-measure 
95.66 
70.39 
54.58 
95.15 
59.68 
54.71 
120 
100 
80 
60 
40 
20 
0 
Short texts News Romip 
TF-IDF 
TF-ICF
The program module allows 
 dynamically update the unigram dictionary, 
recalculate the weight of terms, depending on the 
accessories to the collection; 
 take into account the lexical speech changes in time; 
 investigate new terms entering into active 
vocabulary.
Thank you! 
Presentation: http://www.slideshare.net/mokoron 
Yuliya Rubtsova 
yu.rubtsova@gmail.com 
study.mokoron.com

More Related Content

Similar to Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Experiences with Sentiment Analysis with Peter Zadrozny
Experiences with Sentiment Analysis with Peter ZadroznyExperiences with Sentiment Analysis with Peter Zadrozny
Experiences with Sentiment Analysis with Peter Zadrozny
padatascience
 
Resume_Clasification.pptx
Resume_Clasification.pptxResume_Clasification.pptx
Resume_Clasification.pptx
MOINDALVS
 

Similar to Automatic term extraction of dynamically updated text collections for sentiment classification into three classes (12)

Semantic Patterns for Sentiment Analysis of Twitter
Semantic Patterns for Sentiment Analysis of TwitterSemantic Patterns for Sentiment Analysis of Twitter
Semantic Patterns for Sentiment Analysis of Twitter
 
Reasesrty djhjan S - explanation required.pptx
Reasesrty djhjan S - explanation required.pptxReasesrty djhjan S - explanation required.pptx
Reasesrty djhjan S - explanation required.pptx
 
Omsa
OmsaOmsa
Omsa
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Experiences with Sentiment Analysis with Peter Zadrozny
Experiences with Sentiment Analysis with Peter ZadroznyExperiences with Sentiment Analysis with Peter Zadrozny
Experiences with Sentiment Analysis with Peter Zadrozny
 
Zouaq wole2013
Zouaq wole2013Zouaq wole2013
Zouaq wole2013
 
Explore the Effects of Emoticons on Twitter Sentiment Analysis
Explore the Effects of Emoticons on Twitter Sentiment Analysis Explore the Effects of Emoticons on Twitter Sentiment Analysis
Explore the Effects of Emoticons on Twitter Sentiment Analysis
 
Lexicon Integrated CNN Models with Attention for Sentiment Analysis
Lexicon Integrated CNN Models with Attention for Sentiment AnalysisLexicon Integrated CNN Models with Attention for Sentiment Analysis
Lexicon Integrated CNN Models with Attention for Sentiment Analysis
 
Resume_Clasification.pptx
Resume_Clasification.pptxResume_Clasification.pptx
Resume_Clasification.pptx
 
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
 
Twitter sentiment analysis.pptx
Twitter sentiment analysis.pptxTwitter sentiment analysis.pptx
Twitter sentiment analysis.pptx
 
Xomia_20220602.pptx
Xomia_20220602.pptxXomia_20220602.pptx
Xomia_20220602.pptx
 

More from Yuliya Rubtsova

Web analytics в картинках и денежных знаках
Web analytics в картинках и денежных знакахWeb analytics в картинках и денежных знаках
Web analytics в картинках и денежных знаках
Yuliya Rubtsova
 
Продвижение мобильных приложений в AppStore и Google Play
Продвижение мобильных приложений в AppStore и Google PlayПродвижение мобильных приложений в AppStore и Google Play
Продвижение мобильных приложений в AppStore и Google Play
Yuliya Rubtsova
 

More from Yuliya Rubtsova (17)

Как продать самолет с помощью соц.сетей или социальные сети для бизнеса
Как продать самолет с помощью соц.сетей или социальные сети для бизнесаКак продать самолет с помощью соц.сетей или социальные сети для бизнеса
Как продать самолет с помощью соц.сетей или социальные сети для бизнеса
 
Entity-oriented sentiment analysis of tweets: results and problems
Entity-oriented sentiment analysis of tweets: results and problemsEntity-oriented sentiment analysis of tweets: results and problems
Entity-oriented sentiment analysis of tweets: results and problems
 
Aspect extraction using conditional random fields [SentiRuEval]
Aspect extraction using conditional random fields [SentiRuEval]Aspect extraction using conditional random fields [SentiRuEval]
Aspect extraction using conditional random fields [SentiRuEval]
 
Измеряй и властвуй или практическая web-аналитика
Измеряй и властвуй или практическая web-аналитика Измеряй и властвуй или практическая web-аналитика
Измеряй и властвуй или практическая web-аналитика
 
Метод построения корпуса коротких текстов
Метод построения корпуса коротких текстовМетод построения корпуса коротких текстов
Метод построения корпуса коротких текстов
 
Веб аналитика на практике
Веб аналитика на практикеВеб аналитика на практике
Веб аналитика на практике
 
Mad analyst
Mad analyst   Mad analyst
Mad analyst
 
Курс леций по основам интернет маркетинга и поисковой оптимизации
Курс леций по основам интернет маркетинга и поисковой оптимизацииКурс леций по основам интернет маркетинга и поисковой оптимизации
Курс леций по основам интернет маркетинга и поисковой оптимизации
 
Web analytics в картинках и денежных знаках
Web analytics в картинках и денежных знакахWeb analytics в картинках и денежных знаках
Web analytics в картинках и денежных знаках
 
Продвижение мобильных приложений в AppStore и Google Play
Продвижение мобильных приложений в AppStore и Google PlayПродвижение мобильных приложений в AppStore и Google Play
Продвижение мобильных приложений в AppStore и Google Play
 
Увеличение конверсии сайта
Увеличение конверсии сайтаУвеличение конверсии сайта
Увеличение конверсии сайта
 
Как из посетителя сделать покупателя
Как из посетителя сделать покупателяКак из посетителя сделать покупателя
Как из посетителя сделать покупателя
 
Mobile applications market
Mobile applications marketMobile applications market
Mobile applications market
 
Intranet
IntranetIntranet
Intranet
 
Networking
NetworkingNetworking
Networking
 
Usability testing
Usability testingUsability testing
Usability testing
 
Twitter marketing communications
Twitter marketing communicationsTwitter marketing communications
Twitter marketing communications
 

Recently uploaded

Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 

Recently uploaded (20)

SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 

Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

  • 1. Automatic term extraction of dynamically updated text collections for sentiment classification into three classes Yuliya Rubtsova The A.P. Ershov Institute of Informatics Systems (IIS)
  • 2. Applied problems which can be solved with sentiment classification  consumer reviews study to commercial products for businesses;
  • 3.
  • 4. Applied problems which can be solved with sentiment classification  consumer reviews study to commercial products for businesses;  recommender systems;
  • 5.
  • 6. Applied problems which can be solved with sentiment classification  consumer reviews study to commercial products for businesses;  recommender systems;  Human Machine Interface of a computer system which is responsible for adapting the system's behavior to the current emotional state of the person
  • 7. Human Machine Interface of a computer system which is responsible for adapting the system's behavior to the current emotional state of the person  psychological and medical diagnosis;  safety control by analyzing the behavior of mass gatherings;  assistance in carrying out investigative measures.
  • 8. Most common sentiment analysis approaches Supervised machine learning Dictionaries and rules Combined method
  • 9. Existing corpora  Corpora of reviews which contain user marks  Belongs to one subject domain (movies reviews, books reviews, gadgets reviews)  Corps of news (a few emotional texts)
  • 10. Filtration  Texts containing both positive and negative emotions;  Not informative tweets (less than 40 characters long);  Copied texts and retweets.
  • 11. Corpus of short texts consists of 114 991 – positive texts 111 923 – negative texts 107 990 – neutral texts
  • 12. Corpus of short texts Collection type Number of words Number of unique words Positive messages 1 559 176 150 720 Negative messages 1 445 517 191 677 Neutral messages 1 852 995 105 239
  • 13. Unique terms distribution in relation depending on the number of tweets 0 50000 100000 150000 200000 250000 300000 350000 400000 53 8213 16461 24624 32824 40999 49264 57414 65571 73660 81791 89882 97945 106068 114238 123009 131937 140682 149495 158284 167136 175859 184578 193442 202354 211426 220117 229570 238882 247995 256716 265561 274244 282350 Number of the unuque terms Number of texts
  • 14. Uniformity of used collections Words frequency distribution
  • 15. Most common approaches for used for N-grams extracting  Manually, using a thesaurus.  Term Extraction, based on significance of this term for a collection
  • 16. Data sets characteristics  The entire data set is known  The entire data set is avaliable  The entire data set is static (can’t change during calculation) When new document is added, it is necessary to the update the document frequency of many terms and all previously generated term weights needs recalibration. For N documents in a data stream, the computational complexity is O(N2).
  • 17. Human speech is constantly changing => there is a need to update emotional dictionaries
  • 18. Change in vocabulary and topics discussed Percentage of references to the Olympic theme on all 12% 0.50% 14% 12% 10% 8% 6% 4% 2% 0% posts Febrary August
  • 19. Change in vocabulary and topics discussed Percentage of references to the vacation theme on all 0.06% 0.12% 0.14% 0.12% 0.10% 0.08% 0.06% 0.04% 0.02% 0.00% posts Febrary August
  • 20. Change in vocabulary and topics discussed Percentage of using term “Sebyashka” (selfie – rus) on all 0.00% 0.02% 0.03% 0.02% 0.02% 0.01% 0.01% 0.00% posts Febrary August
  • 21. Filtration  Punctuation – commas, colons, quotation marks (exclamation marks, question marks and ellipses were retained);  References to significant personalities and events  Proper names;  Numerals;  All links were replaced with the word "Link" and were taken into consideration as a whole;  Many dots were replaced with ellipsis.
  • 22. TF-ICF C – number of categories, cf – the number of categories in which weighed term is found
  • 23. TF-IDF tf – is the frequency of term occurrence in the collection (positive or negative tweets) , T – total number of messages in the collections, – the number of messages in the positive and negative T(ti ) collections contained the term
  • 25. Corpus of News texts consists of 46 339 – positive news 46 337 – negative news 46 340 – neutral news
  • 26. ROMIP mixed collection consists of Reviews on books, movies, or digital camera from blogs 543– positive blog texts 236– negative blog texts 103– neutral blog texts
  • 27. Short text collection TF-IDF TF-ICF Accuracy 95,5981 95,0664 Precision 0,958092631 0,953112184 Recall 0,955204837 0,94984672 F-Measure 0,956646554 0,95147665 News collection TF-IDF TF-ICF Accuracy 69,8619 58,1397 Precision 0,709246342 0,61278022 Recall 0,698624505 0,581402868 F-Measure 0,703895355 0,596679322 ROMIP collection TF-IDF TF-ICF Accuracy 53,9773 57,9545 Precision 0,561341047 0,558902611 Recall 0,5311636 0,535790598 F-Measure 0,545835539 0,547102625
  • 29. Experimental results in terms of F-measure 95.66 70.39 54.58 95.15 59.68 54.71 120 100 80 60 40 20 0 Short texts News Romip TF-IDF TF-ICF
  • 30. The program module allows  dynamically update the unigram dictionary, recalculate the weight of terms, depending on the accessories to the collection;  take into account the lexical speech changes in time;  investigate new terms entering into active vocabulary.
  • 31. Thank you! Presentation: http://www.slideshare.net/mokoron Yuliya Rubtsova yu.rubtsova@gmail.com study.mokoron.com

Editor's Notes

  1. show that when the document set size is small, the unique term count continues to climb up as the number of documents increases. However, this growth of the unique term count is reduced sharply as the number of documents becomes very large. This observation indicates that if the document collection is sufficiently large, we can expect to see very few new words by adding more documents.
  2. References to significant personalities and events – the attitude towards them may vary over time, but a classifier trained on "old texts" will not be able to adapt quickly;