3. Multi-emotion sentiment classification is a challenging problem in
natural language processing (NLP) with various real-world
applications. In this study, the authors demonstrate the effectiveness
of combining large-scale unsupervised language modeling with fine-
tuning to address this task, even on difficult datasets with label class
imbalance and domain-specific context.
The authors train an attention-based Transformer network on a substantial
amount of text data (specifically, Amazon reviews) fine-tune the model on the
training set. Their approach achieves a competitive F1 score of 0.69 on the.
SemEval Task 1:E-c multidimensional emotion classification problem, which is
based on the Plutchik wheel of emotions. Notably, the model performs well
on challenging emotion categories such as Fear (0.73), Disgust (0.77), Anger
(0.78), as well as on rare categories like Anticipation (0.42) Surprise (0.37).
4. Language models, such as mLSTM and Transformer networks, have shown
impressive performance on academic datasets, surpassing previous state-of-
the-art approaches. However, their performance on practical text classification
tasks using real-world data remains uncertain. In this work, we train mLSTM
and Transformer language models on a large 40GB text dataset and apply
them to binary sentiment analysis and multidimensional emotion classification
tasks. We evaluate our models on both academic datasets and an original
social media dataset, showcasing their performance against state-of-the-art
approaches and commercially available APIs.
Our models achieve state-of-the-art performance on academic datasets
without domain-specific training or excessive hyper-parameter tuning. On the
social media dataset, they outperform commercially available APIs, even
when re-calibrated to the test set. The Transformer model generally
outperforms mLSTM, especially in fine-tuning for multidimensional emotion
classification. Fine-tuning significantly improves performance on emotion
tasks for both models. We demonstrate that unsupervised language modeling
combined with fine-tuning offers practical solutions for text classification
problems, including those with large class imbalance and human label
disagreement.
Text classification across domains is challenging due to unknown words,
specialized context, and colloquial language. By training models on a diverse
text corpus, they learn to adapt to different contexts and select relevant
features for emotion classification. Our work shows the effectiveness of
language models in specialized text classification problems and unlocks
possibilities for real-world applications.
5. PLUTCHIK'S WHEEL OF
EMOTIONS
Our multidimensional emotion classification
focuses on Plutchik's wheel of emotions, which
has been in use since 1979. This taxonomy aims
to classify human emotions based on four
dualities: Joy - Sadness, Anger - Fear, Trust -
Disgust, and Surprise - Anticipation. According
to the basic emotion model proposed by Ekman
in 2013, although humans experience numerous
emotions, some emotions are considered more
fundamental than others.
In our comparison with IBM's Watson, a
commercial general-purpose emotion
classification API, we evaluate classification
scores for the emotions of Joy, Sadness, Fear,
Disgust, and Anger. These emotions align with
the categories present in Plutchik's wheel
(shown in Figure 1).
6. We used a larger batch size and shorter sequence length (global batch of 512, sequence length of 64 tokens) to train the models
efficiently on tweets, which are short text snippets.
The language models were trained on the Amazon Reviews dataset, chosen for its rich emotional context.
We compared two models: mLSTM and Transformer, both known for their state-of-the-art performance in academic NLP
benchmarks.
Unsupervised pretraining was performed using an encoder-decoder framework to maximize the likelihood of token sequences.
The models were fine-tuned for emotion classification tasks.
Example formula: The objective to maximize the likelihood of token sequences can be represented as:
log p(x0, . . . , xn) = - Σ log p(xt|xt−1, . . . , x0)
This formula captures the joint probability distribution of sequences, allowing the model to predict the next token given the
preceding ones accurately.
In summary, our methodology included efficient training, leveraging emotional context, comparing different models, and
utilizing unsupervised pretraining to achieve effective text classification.
7. RESUL
TS
Binary Sentiment Tweets:
On the academic SST dataset, the Transformer model performs close to the state-of-the-art but doesn't
exceed it.
On the company tweets dataset, the Transformer model outperforms the mLSTM and ELMo baselines,
as well as both Watson and Google Sentiment APIs, even after optimal calibration of the API results on
the test set.
Multi-Label Emotion Tweets:
Comparing our models to Watson on the SemEval dataset and company tweets, our models outperform
Watson on every emotion category, including Anger, Disgust, Fear, Joy, and Sadness.
SemEval Tweets:
Our finetuned Transformer model achieved the top macro-averaged F1 score among all submissions in
the SemEval Task1:E-C challenge.
- While our model's micro-average F1 scores and Jaccard Index accuracy are slightly lower, indicating
relatively higher performance on rare and difficult categories, the most common categories of Joy,
Anger, Disgust, and Optimism receive relatively higher F1 scores across all models.
In summary, our models demonstrate strong performance across various sentiment and emotion
classification tasks, outperforming baselines and commercial sentiment analysis APIs in multiple
scenarios.
8. 1. Classification Performance by Dataset Size:
The experiment showed that the macro average F1 score is more sensitive to dataset size and
falls more quickly than the micro average F1 score.
Categories with worse class imbalance benefit more from having a larger training dataset size,
suggesting that more data can substantially improve results for harder categories.
The difference between single and multihead decoders becomes more pronounced for more
difficult categories and smaller dataset sizes.
2. Dataset Quality and Human Rater Agreement:
The SemEval dataset and the company tweets dataset labeled using a similar technique
showed reasonably good results, validating the labeling approach.
Plutchik category labels have large rater disagreement, even among vetted raters, which can
be attributed to the tendency to label "No Emotion" when unsure about a category.
Datasets with more emotions tend to have higher Plutchik disagreement, possibly due to the
uncertainty of raters when assigning emotions.
3. Difficult Tweets and Challenging Contexts:
General-purpose APIs may not work well on the company tweets dataset due to the context-
specific nature of emotion classification.
Examples of disagreements between human raters and the Watson API in the video game
context highlight the challenge of ascribing negative sentiment to terms that are not inherently
negative in that context.
Training a large unsupervised model and finetuning with a small amount of labeled data
specific to the dataset's context may yield better results.
4. Multiple Softmax Outputs and Transformer Features:
Training models with multiple softmax outputs can improve performance on language
modeling by capturing a larger number of distinct contexts in the text.
The Transformer model may also capture features relevant to a wide range of contexts, and
finetuning helps select the most significant features for a specific setting, while disregarding
irrelevant features.
In summary, the analysis highlights the importance of dataset size, rater agreement, context-
specific challenges, and the potential benefits of using unsupervised models with finetuning to
improve classification performance in challenging text classification tasks.
Analysis :
9. CONCLUSION:
This work demonstrates the effectiveness of unsupervised pretraining and finetuning in
tackling difficult text classification tasks. The Transformer network, in particular, showed
remarkable performance when adapting to downstream tasks with noisy labels and specialized
context.
The framework presented here offers flexibility and ease of customization for text classification
on niche tasks. Unsupervised language modeling on general text datasets without labels allows
for pretraining, while finetuning with a small amount of domain-specific labeled data proves to
be effective in transferring to downstream tasks.
This approach holds great potential for various practical text classification problems. Similar to
the successes of language modeling and transfer in academic text understanding tasks on the
GLUE Benchmark, it is anticipated that this framework can be applied to a wide range of real-
world text classification challenges.
By leveraging the power of unsupervised pretraining and fine-tuning, this framework opens
doors for academics and small organizations to tackle text classification problems with limited
labeled data, enabling advancements in diverse domains of natural language processing.
10. Al-Rfou, R.; Choe, D.; Constant, N.; Guo, M.; and Jones, L. 2018.
Character-level language modeling with deeper self-attention. CoRR
abs/1808.044449.
Baziotis, C.; Athanasiou, N.; Chronopoulou, A.; Kolovou, A.;
Paraskevopoulos, G.; Ellinas, N.; Narayanan, S.; and Potamianos,
A. 2018. NTUA-SLP at semeval-2018 task 1: Predicting affective
content in tweets with deep attentive rnns and transfer learning.
CoRR abs/1804.06658.
Dai, A. M., and Le, Q. V. 2015. Semi-supervised sequence learning.
CoRR abs/1511.01432.
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert:
Pre-training of deep bidirectional transformers for language
understanding.
Ekman, P. 2013. An argument for basic emotions.
Gray, S.; Radford, A.; and Kingma, D. P. 2017. Gpu kernels for
block-sparse weights.
Howard, J., and Ruder, S. 2018. Fine-tuned language models for
text classification. CoRR abs/1801.06146.
Khetan, A.; Lipton, Z. C.; and Anandkumar, A. 2017. Learning from
noisy singly-labeled data. CoRR abs/1712.04577.
Krause, B.; Lu, L.; Murray, I.; and Renals, S. 2016. Multiplicative
LSTM for sequence modelling. CoRR abs/1609.07959.