Cross-domain sentiment analysis of the natural Romanian language

Ștefana Cioban
Statistics-Forecasts-Mathematics Department, Faculty of Economics and
Business Administration & Interdisciplinary Centre for Data Science,
Babeș-Bolyai University, Cluj-Napoca, Romania
Cross-domain
sentiment
analysis of the
natural
Romanian
language

CONTENT
1. Introduction
2. Related work
3. Methodology
4. Findings
5. Conclusions

Introduction
NLP in general
Applications
State-of-the-art: CNNs & RNNs
The problem
English vs. other languages
AIM
train and evaluate the most frequently used SA on a Romanian
cross-domain corpus

Related work
• scarce literature on Romanian SA
• To benefit from the techniques developed for English, researchers prefer translation [1],
[2]
• SA methodological applications:
• Lexicons [4], [5], [6]
• ML: SVM, NB, DT [7] & [8], RNN [9], Transformers (Google’s BERT) [10]
• Cross-domain SA [3]

Methodology
multi-domain Romanian from 38310 reviews: LaRoSeDa [11] & a compilation of products
and movies reviews [12]
English translation of the document label
this director must have been sick when he
directed this film
0
a piece of junk that doesn't have a proper
wiring diagram
0
it is a quality product, and the delivery of the
order was made in a short time
1
very satisfied with a small and powerful
phone
1
Statistic Label Word count
Count 38310 38310
Mean 0.5 434.52
Std 0.5 335.04
Min 0 2
25% 0 119
50% 0.5 368.5
75% 1 745
Max 1 6158

Methodology
• Text preprocessing
• Training and testing with the most popular: DT, LR, SVM, NBC,
RNN, Transformer (BERT)
• Accuracy evaluation: f1, precision, recall & loss and accuracy stability

Findings
Precision Recall F1-score Support
0 (negative) 0.92 0.94 0.93 3834
1 (positive) 0.94 0.92 0.93 3828
Accuracy 0.93 7662
• BERT – best performance with 0.1 loss after 5 epochs, 98% training acc, 93% validation accuracy
• Competing accuracy of LR and SVC with less resources and faster training
• Most models outperform the ones with translated text [1], [2]
• Direction towards colloquial language in cross-disciplinary domains

Findings
75%
86% 86%
71%
87%
93%
0%
20%
40%
60%
80%
100%
DT LR SVC NBC RNN BERT

Findings
• robustness check:
• training and testing using RoBERT [13]
• Same learning rate and batch size
• Comparable results –93% validation accuracy
• confirm the fitness of using variations of pretrained transformers for cross-domain,
Romanian text document for SA

Conclusions
• compilation of a free-speech dataset to serve for machine learning
applications and validations of models for the task of sentiment
classification for the Romanian language
• Comparison between ML methods: Google’s BERT as best performing
• Further research directions:
• More annotated documents from other domains
• Other vectorization techniques besides BOW
• Comparison with the translations of the text into English

References
[1] Marcu, D., Danubianu, M.: Sentiment Analysis from Students' Feedback A Romanian High School Case Study. In: 15th International
Conference on Development and Application Systems (DAS), pp. 204-209, IEEE, Suceava, Romania (2020)
[2] Russu, R.M., Vlad, O.L., Dinsoreanu, M., Potolea, R.: An Opinion Mining Approach For Romanian Language. In: 2014 IEEE International
Conference on Intelligent Computer Communication and Processing (ICCP), pp.43-46, IEEE, Cluj Napoca, Romania (2014).
[3] Deriu, J.M., Weilenmann, M., von Grunigen, D., Cieliebak, M.: Potential and Limitations of Cross-Domain Sentiment Classification, In:
Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, pp. 14-24, Association for Computational
Linguistics, Valencia, Spain (2017).
[4] Bobicev, V., Maxim, V., Prodan, T., Burciu, N., Anghelus, V.: Emotions in Words: De-veloping a Multilingual WordNet-Affect. In:
Gelbukh, A.(eds.) 11th International Con-ference on Intelligent Text Processing and Computational Linguistics, vol. 6008, pp. 375-384.
Springer, Iasi, Romania (2010).
[5] Lupea, M., Briciu, A.: Studying emotions in Romanian words using Formal Concept Analysis. Computer Speech and Language 57, 128-
145 (2019).
[6] Gifu, D., Cioca, M.: Detecting Emotions in Comments on Forums. International Journal of Computers Communications & Control 9(6),
694-702 (2014).

References
[7] Sun, S.L., Luo, C., Chen, J.Y.: A review of natural language processing techniques for opinion mining systems. Information Fusion 36, 10-
25 (2017).
[8] Nassirtoussi, A.K., Aghabozorgi, S., Teh, YW., Ngo, D.C.L.: Text mining for market prediction: A systematic review. Expert Systems with
Applications 41(16), 7653-7670 (2014).
[9] Schuszter, I.C.: Integrating Deep Learning for NLP in Romanian Psychology. In: 2018 20th International Symposium on Symbolic and
Numeric Algorithms for Scientific Computing (SYNASC 2018), pp. 237-244. IEEE, Timisoara, Romania (2018).
[10] Google AI Blog: Open Sourcing BERT: State-of-the-Art Pre-training for Natural Lan-guage Processing, last accessed 2021/04/14 (2018).
[11] Tache, A.M., Gaman, M., Ionescu, R.T.: Clustering Word Embeddings with Self-Organizing Maps. Application on LaRoSeDa – A Large
Romanian Sentiment Data Set. arXiv preprint arXiv:2101.04197 (2021).
[12] Katakonst: Sentiment Analysis with Tensorflow, https://github.com/katakonst/sentiment-analysis-tensorflow, last accessed 2021/04/11.
[13] Masala, M., Ruseti, S., Dascalu, M.: RoBERT–A Romanian BERT Model. In: Proceed-ings of the 28th International Conference on
Computational Linguistics, pp. 6626-6637, Barcelona, Spain (2020).

Cross-domain sentiment analysis of the natural Romanian language

Recommended

Recommended

More Related Content

Similar to Cross-domain sentiment analysis of the natural Romanian language

Similar to Cross-domain sentiment analysis of the natural Romanian language (20)

More from ICDEcCnferenece

More from ICDEcCnferenece (20)

Recently uploaded

Recently uploaded (20)

Cross-domain sentiment analysis of the natural Romanian language