Opinion mining for social media and news items in Romanian

939 views
805 views

Published on

Published in: Education, Business, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
939
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
22
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Opinion mining for social media and news items in Romanian

  1. 1. Authors UNIVERSITY POLITEHNICA OF BUCHAREST Opinion Mining for Social Media and News Items in Romanian Claudia Cârdei Filip Manișor Traian Rebedea traian.rebedea@cs.pub.ro
  2. 2. Overview • Introduction • Previous Work – English – Romanian • Proposed Solutions • Opinionated Corpus • Results and Comparisons • Conclusions 22.09.13 Sesiunea de Licenţe - Iulie 2012 2
  3. 3. Introduction • Sentiment analysis and opinion mining research has mainly concentrated on English and other important languages (Spanish, Chinese, etc.) – Various commercial and open-source solutions exist mainly for English – Corpora of opinionated texts and databases of affective words (general or domain specific) also exist for these languages • Objective: develop an opinion mining solution for Romanian texts gathered from a wide range of online sources (mostly social media and news items) 22.09.13 ICSCS 2013 . K-TEAMS 2013 Workshop Opinion Mining for Social Media and News Items in Romanian 3
  4. 4. Introduction • Popular research domain in the last years • Sentiment, subjectivity, opinion, publicity – Related, but somewhat different • Sentiment or subjectivity in a text: – Positive, negative or neutral – Subjective or objective • Opinionated text – Opinion author – Opinion target (subject) – Opinion (affective) words – Opinion polarity E.g. President Obama declared that the US immigration system is broken. 22.09.13 ICSCS 2013 . K-TEAMS 2013 Workshop Opinion Mining for Social Media and News Items in Romanian 4
  5. 5. Previous Work - English 22.09.13 ICSCS 2013 . K-TEAMS 2013 Workshop Opinion Mining for Social Media and News Items in Romanian 5
  6. 6. Previous Work - English • Lots of studies and corpora in different domains • The movie reviews dataset – very popular • Initial results using BoW, punctuation, etc. – Accuracy ≈ 80% • Improvement to find relations/dependencies between opinion targets and affective words – Accuracy ≈ 84% • Mining frequent dependency subtrees for positive and negative reviews and using a SVM with these subtrees as features – Accuracy ≈ 88% 22.09.13 ICSCS 2013 . K-TEAMS 2013 Workshop Opinion Mining for Social Media and News Items in Romanian 6
  7. 7. Previous Work - Romanian • Use machine translation to generate English texts, then apply opinion mining • Translate affective words databases in Romanian (e.g. WordNet Affect) • Developing new affective words lists • Training and evaluation on specific corpora in Romanian • Problems with NER, dependency parsing, affective words scores 22.09.13 ICSCS 2013 . K-TEAMS 2013 Workshop Opinion Mining for Social Media and News Items in Romanian 7
  8. 8. Proposed Solutions • Supervised solution trained for several different opinion subjects (entities) • Three approaches – Bag of words – Affective words and dependency parsing – N-grams probabilities 22.09.13 ICSCS 2013 . K-TEAMS 2013 Workshop Opinion Mining for Social Media and News Items in Romanian 8
  9. 9. Bag of Words • Bag of words model: – Tokenization, diacritics restoration, lemmatization – Distinct lemmas selected as features – Improvements: POS filter, word n-grams filter – Used both binary features and TF-IDF 22.09.13 ICSCS 2013 . K-TEAMS 2013 Workshop Opinion Mining for Social Media and News Items in Romanian 9
  10. 10. Affective Scores & Dependency Parsing • Compute affective word scores in Romanian: – Translate all the adjectives and adverbs from the English WordNet into Romanian using Google Translate – Uses the probability of each translation pair • Several affective score databases have been translated: SentiWordNet, SenticNet 2 and ANEW • Used the UAIC Romanian FDG parser to identify dependencies between the subject entity and adjectives or adverbs 22.09.13 ICSCS 2013 . K-TEAMS 2013 Workshop Opinion Mining for Social Media and News Items in Romanian 10
  11. 11. N-grams Probabilities • Compute the conditional probability for each n-gram in the corpus given that the document is either positive or negative • Then use the following score for each n-gram (feature f): • The score of a new text is computed by summing the scores for each of the n-grams existing in that text 22.09.13 ICSCS 2013 . K-TEAMS 2013 Workshop Opinion Mining for Social Media and News Items in Romanian 11
  12. 12. Opinionated Corpus • Corpus manually annotated by analysts for their customers (created by Treeworks for their product ZeList, www.zelist.ro) • ZeList indexes most of the texts published in Romanian in most popular social networks, blogs, online forums, news websites, etc. • Used data for seven different entities (companies or brands) ranging from banks and beer brands and going to web publishers and media corporations • The name of the entities have been anonymized 22.09.13 ICSCS 2013 . K-TEAMS 2013 Workshop Opinion Mining for Social Media and News Items in Romanian 12
  13. 13. Opinionated Corpus • Problems: – These texts are very noisy, very heterogeneous, from a wide range of sources and with different writing styles (e.g. Twitter vs. news items) – Some of them also might express positive and negative publicity rather than opinions 22.09.13 ICSCS 2013 . K-TEAMS 2013 Workshop Opinion Mining for Social Media and News Items in Romanian 13
  14. 14. Opinionated Corpus • Data about the first version of the corpus • Data collection ranged from a couple of months to a couple of years, depending on the entity • The second version contained a larger export of data for each entity 22.09.13 ICSCS 2013 . K-TEAMS 2013 Workshop Opinion Mining for Social Media and News Items in Romanian 14 Entity Total items Neutral Opinionated Positive Negative Ent1 6055 5853 202 29 173 Ent2 2240 1961 279 222 57 Ent3 343 260 83 64 19 Ent4 1168 876 292 120 172 Ent5 539 520 19 17 2 Ent6 1025 570 455 330 125 Ent7 3787 3016 771 593 178
  15. 15. Results - Outline • Results obtained for the first version of the corpus, for all entities • Accuracy positive-negative should be more relevant • Good results for entities with more data, poor results for the ones with a small number of opinionated texts 22.09.13 ICSCS 2013 . K-TEAMS 2013 Workshop Opinion Mining for Social Media and News Items in Romanian 15 Entity Total items Neutral Opinionated Accuracy opinion-neutral Accuracy positive- negative Ent1 6055 5853 202 97.01% 92.07% Ent2 2240 1961 279 91.79% 87.81% Ent3 343 260 83 84.84% 89.15% Ent4 1168 876 292 86.22% 82.19% Ent5 539 520 19 97.40% 57.89% Ent6 1025 570 455 76.20% 84.17% Ent7 3787 3016 771 81.75% 83.65%
  16. 16. Results - Comparison • Comparison of the above presented solutions using the second (larger) version of the corpus • Only for one entity by extracting a balanced dataset with 700 positive and 700 negative opinionated texts 22.09.13 ICSCS 2013 . K-TEAMS 2013 Workshop Opinion Mining for Social Media and News Items in Romanian 16 Method Accuracy BoW + POS filter 81.31% BoW only adj. 70.89% BoW only adj. & adv. 76.60% Frequent bigrams 80.88% Frequent trigrams 76.60% Affective scores + dependency parsing 52.18% Affective scores (comparison with 0 decision) 55.35% Trigrams probabilities 88.44% Bigrams probabilities 72.54%
  17. 17. Conclusions • Several alternatives for determining the opinion polarity have been evaluated on a corpus manually annotated for different Romanian entities • Best results obtained at this moment: BoW plus a POS filter or a frequent bigrams approach + SVM classifier • Romanian FDG parser does not provide a good accuracy for the dependency parsing task, especially for texts from social media – Texts are somewhat freely written, with little regards to usual form or structure – Improvement of this method & the affective words database are still possible 22.09.13 ICSCS 2013 . K-TEAMS 2013 Workshop Opinion Mining for Social Media and News Items in Romanian 17
  18. 18. Thank you! • Questions? • Discussions 22.09.13 CSCS 2013 – Bucharest, Romania 18

×