Kaggle Tweet Sentiment Extraction: 1st place solution

Tweet
Sentiment
Extraction
1st place solution
by Artsem Zhyvalkouski
from Dark of the Moon

Agenda
1. Background
2. Competition overview
3. Solution summary
4. The “Magic”
5. Other solutions
6. Conclusion

Self-introduction ● Artsem Zhyvalkouski
● From: Minsk, Belarus 🇧🇾
● CS student @ Tokyo City University 🇯🇵
● Did R&D in CV / NLP at a startup
● Love reading NLP papers
● Enjoy learning languages
○ Fluent in 🇧🇾🇷🇺🇬🇧🇯🇵
○ Interested in 🇫🇷🇰🇷

Motivation Why did I start working on this competition?
● To familiarize myself with current SOTA in NLP:
Transformer-based models
● To learn PyTorch🔥 and HuggingFace🤗 libraries
● The dataset is rather small so no need in powerful
machines: Colab Pro was enough
● The task seemed fun / unusual and I had time for it

Teammates ● Théo
○ Student in applied Mathematics & Machine Learning
○ 10th in Google Quest Q&A
● Anton
○ Works in IT
○ 10th in Google Quest Q&A
● Hikkiiii
○ NLP / QA background
○ 10th in TensorFlow 2.0 Question Answering

Task & data ● Task: for a given tweet, predict what word or phrase best
supports the labeled sentiment
● Application example: some businesses may want to know
exactly why people think something about their product
● Data
○ Train: 27k tweets
○ Public / private test: 4k / 8k tweets
text (given) sentiment (given) selected_text (target)
I really really like the
song Love Story by
Taylor Swift
positive like
i need to get my
computer fixed
neutral i need to get my computer
fixed
Sooo SAD I will miss you
here in San Diego!!!
negative Sooo SAD

Evaluation ● Metric: word-level Jaccard score

Problem with
labels
● Some labels are noisy
text (given) sentiment (given) selected_text (target)
hey mia! totally adore
your music. when will
your cd be out?
positive y adore
I know It was worth a
shot, though!
positive as wort
the exact one i was
thinking of the bestttt.
positive e bestttt

Transformers ● Transformers like BERT, RoBERTa, BART etc. have
become default in SOTA NLP
● Pretrained on a huge amount of texts
● Somewhat heavy and long to train
● Can be used in either NER or QA setup for this task
● QA worked best

Leveraging the
QA setup
● Question: sentiment
● Answer: support phrase

My models:
summary
● RoBERTa-base-squad2, RoBERTa-large-squad2,
DistilRoBERTa-base, XLNet-base-cased
● Pretraining on SQuAD 2.0
○ Task pretraining works
● Avg / Max of layers w/o embedding layer
● Multi Sample Dropout
● AdamW with linear warmup schedule
● Custom loss: Jaccard-based Soft Labels
● Best single model: RoBERTa-base-squad2, 5 fold
stratified CV: 0.715

My models:
architecture
Layer 1
Layer n
...
Embeddings
Transformer
Head
MSD + Dense
Sentiment Sentence Tokens<s> </s> </s> </s>
Probabilities of the token being the start of the selected text
Probabilities of the token being the end of the selected text
MaxPoolAvgPool

Multi Sample
Dropout
● Multi-Sample Dropout for Accelerated Training and
Better Generalization
(https://arxiv.org/pdf/1905.09788.pdf)
*Image from Jigsaw Unintended Bias in Toxicity Classification 8th place solution

Optimizer and
schedule
● AdamW optimizer: Decoupled Weight Decay
Regularization
● Linear warmup schedule

My models:
custom loss
● Jaccard-based Soft Labels
● Modified version

Théo’s models ● Transformers
○ BERT-base-uncased
○ BERT-large-uncased-wwm
○ DistilBERT
○ ALBERT-large-v2
● Architecture
○ MSD on the concatenation of the last 8 hidden states
● Training
○ Smoothed categorical cross-entropy
○ Discriminative learning rate
○ Sequence bucketing to speed up the training

Anton’s models
● Transformers
○ RoBERTa-base
○ BERTweet: A pre-trained language model for English
Tweets (https://arxiv.org/pdf/2005.10200.pdf)
● Architecture
○ Same as Théo’s
● Training
○ Smoothed categorical cross-entropy
○ Discriminative learning rate
○ Custom merges.txt file for RoBERTa

Hikkiiii’s models
● Transformers
○ RoBERTa-base
○ RoBERTa-large
● Architecture
○ Append sentiment token to the end of the text
○ CNN + Linear layer on the concatenation of the last
3 hidden states
● Training
○ Standard cross-entropy loss

● Are we done? 🤗
○ Transformers are token-level, hence we
can’t capture the noisy pattern
○ No obvious way to make transformers
character-level
○ Character-level RNNs are not even
nearly as powerful as transformers
○ We can’t simply blend models with
different tokenizations
Problems

● Solution: stacking to the rescue!
○ Convert token probabilities from
transformers to char-level by assigning
each char the probability of its token
○ Feed OOF char-level probabilities from
several transformers into a char-level
NN using stacking
Stacking

Stacking
token level start & end proba
Transformer
target :
Start / end tokens
<sos> Sentiment <sep> Tokens <sep>
char level start & end proba
(token based)
Char NN
target :
Start / end chars
using offsets
char level start / end proba
Characters
…
n models
Concatenate
token level start & end proba
Transformer
target :
Start / end tokens
<sos> Sentiment <sep> Tokens <sep>
char level start & end proba
(token based)
Start & end featuresSentiment
using offsets

Char-level NN:
RNN
Start & end probas
Bidirectional LSTM Embedding
Characters
Embedding
Sentiment
Bidirectional LSTM x2 with skip connection
MSD + Linear
Softmax
Start & end probas

Char-level NN:
CNN
Start & end probas
Conv1D +
BatchNorm
Embedding
Characters
Embedding
Sentiment
Conv1D +
BatchNorm
x4
MSD + Linear
Softmax
Start & end probas

Char-level NN:
WaveNet
Start & end probas
Conv1D +
BatchNorm
Embedding
Characters
Embedding
Sentiment
WaveBlock +
BatchNorm
x3
MSD + Linear
Softmax
Start & end probas

Char-level NNs:
details ● Adam optimizer
● Linear learning rate decay without warmup
● Smoothed Cross Entropy Loss
● Stochastic Weighted Average: Averaging Weights Leads
to Wider Optima and Better Generalization
● Select the whole text if predicted start_idx > end_idx

An obvious step ● So now we have a lot of different 1st level models
and different 2nd level architectures
● If you participated in a tabular data competition, an
obvious next step is...

1st submission
Public LB : 0.734 (#3)
Private LB : 0.736 (#1)
RoBERTa-large
CV 0.715
XLNet-base-cased
CV 0.707
RoBERTa-base
CV 0.715
DistilRoBERTa-base
CV 0.713
DistilBERT-base-uncased
CV 0.705
BERT-base-uncased
CV 0.710
BERT-large-uncased-wwm
CV 0.710
ALBERT-large-v2
CV 0.711
BERTweet
CV 0.711
RoBERTa-base
CV 0.715
RoBERTa-base
CV 0.712
RoBERTa-large
CV 0.714
CNN
CV 0.7342
CNN
CV 0.7335
WaveNet
CV 0.7347
WaveNet
CV 0.7330
Average
CV
0.7363
1st level
Transformers
2nd level
Char-level NNs

2nd submission
Public LB : 0.734 (#3)
Private LB : 0.735 (#1)
RoBERTa-large
CV 0.715
XLNet-base-cased
CV 0.707
RoBERTa-base
CV 0.715
DistilRoBERTa-base
CV 0.713
DistilBERT-base-uncased
CV 0.705
BERT-base-uncased
CV 0.710
BERT-large-uncased-wwm
CV 0.710
ALBERT-large-v2
CV 0.711
BERTweet
CV 0.711
RoBERTa-base
CV 0.715
RoBERTa-base
CV 0.712
RoBERTa-large
CV 0.714
CNN
CV 0.7342
WaveNet
CV 0.7337
RNN
CV 0.7343
WaveNet
CV 0.7335
Average
CV
0.7365
1st level
Transformers
2nd level
Char-level NNs

Pseudo-labeling ● We used one of our CV 0.7354 blends to pseudo-label the public
test data
● Approach from
the Google Quest Q&A 1st place
solution: “leakless” pseudo-labels
● Confidence score:
(start_probas.max() + end_probas.max()) / 2
● Threshold=0.35 to cut off
low-confidence samples
● This gave a pretty robust boost
of 0.001-0.002 for many models
*Image from https://datawhatnow.com/pseudo-labeling-semi-supervised-
learning/

Things that
didn’t work
● NER approach
● BCE + SoftIOU loss like in image segmentation
● Sample weighting
● T5, BART, GPT-2, ELECTRA, XLM-RoBERTa
● Char-level transformer
● Pre-trained embeddings: FastText, Flair
● Statistical features: num_words, num_spaces etc.
● XGBoost as 2nd level
● 3rd level models
● Pre/post-processing

Finding the
“Magic”
● The noise in the label comes from consecutive spaces
Selected text :
onna
Original text :
is _ back _ home _ now _ _ _ _ _ _ gonna _ miss _ every _ one
● We assumed they were removed during annotation
Text with spaces cleaned :
is _ back _ home _ now _ gonna _ miss _ every _ one
Annotation, on the cleaned text :
➔ Stores the start and end indices (?)
● Which results in problems when retrieving the labels on the original text
Retrieved label, on the original text :
➔ 5 removed spaces offsets the label by 5 characters

Using the
“Magic”
● Then, we can post-process our predictions to retrieve the noise
Selected text:
onna
Original text :
● We use a transformer to get the start/end token
We use it on the cleaned text :
Assuming the model perfectly predicts “miss”, perfect start and end predictions would look this way :
0 … 0 1 1 1 1 0 ... 0
0 … 0 1 1 1 1 0 ... 0
Because transformer work at token level, the whole word is gonna be selected
● Finally, we can align those predictions with the original text
Prediction on the noisy data :
0 … 0 1 1 1 1 0 ... 0
0 … 0 1 1 1 1 0 ... 0
● Which matches the noisy label !

Why we didn’t
use the “Magic” ● We found the pattern pretty late and didn’t have
enough time to leverage it directly
● Eventually, our 2nd level models learnt the pattern
even better then simple pre/post-processing

2nd place
solution
● Pre/post-process following the “Magic”
● Sample tweets equally according to their sentiment to
mitigate the imbalance within batches
● Reranking model
1. Store top n candidates from the base model and
assign a “step_1_score” accordingly
2. Train a RoBERTa to predict Jaccard for the
candidates: “step_2_score”
3. Choose the best one by:
step_2_score + step_1_score * 0.5

3rd place
solution: 1st
model
Training
Inference
● Results in k*k combinations
● Argmax over multiplications

Conclusion
● Transformers are perfect for QA
● PyTorch🔥 & HuggingFace🤗 are awesome
● Transformers can be extended to char-level
● Diversity rules
● Annotation process is crucial
● Teaming up is rewarding
● Kaggle is good for learning but impractical sometimes
● Kaggle community is amazing

My links
● Follow me on
○ LinkedIn: https://www.linkedin.com/in/zhyvalkouski
○ Kaggle: https://www.kaggle.com/aruchomu
○ Twitter: https://twitter.com/artem_aruchomu
○ GitHub: https://github.com/heartkilla
● I’m open to new opportunities!

Thanks for listening, stay safe and happy kaggling!

Kaggle Tweet Sentiment Extraction: 1st place solution

More Related Content

What's hot

Similar to Kaggle Tweet Sentiment Extraction: 1st place solution

Recently uploaded

Kaggle Tweet Sentiment Extraction: 1st place solution