SlideShare a Scribd company logo
1 of 4
Download to read offline
Toxicity Detection for Indic Multilingual Social
Media Content
Manan Jhaveri
NMIMS University
Devanshu Ramaiya
NMIMS University
Harveen Singh Chadha
Thoughtworks
Abstract—Toxic content is one of the most critical issues for
social media platforms today. India alone had 518 million social
media users in 2020. In order to provide a good experience to
content creators and their audience, it is crucial to flag toxic
comments and the users who post that. But the big challenge is
identifying toxicity in low resource Indic languages because of the
presence of multiple representations of the same text. Moreover,
the posts/comments on social media do not adhere to a particular
format, grammar or sentence structure; this makes the task of
abuse detection even more challenging for multilingual social
media platforms. This paper describes the system proposed by
team ’Moj Masti’ using the data provided by ShareChat/Moj
in IIIT-D Multilingual Abusive Comment Identification challenge.
We focus on how we can leverage multilingual transformer based
pre-trained and fine-tuned models to approach code-mixed/code-
switched classification tasks. Our best performing system was an
ensemble of XLM-RoBERTa and MuRIL which achieved a Mean
F-1 score of 0.9 on the test data/leaderboard. We also observed
an increase in the performance by adding transliterated data.
Furthermore, using weak metadata, ensembling and some post-
processing techniques boosted the performance of our system,
thereby placing us 1st on the leaderboard.
I. INTRODUCTION
Toxic comments refer to rude, insulting and offensive com-
ments that can severely affect a person’s mental health and
severe cases can also be classified as cyber-bullying. It often
instills insecurities in young people, leading them to develop
low self-esteem and suicidal thoughts. This abusive environ-
ment also dissuades people from expressing their opinions
in the comment section which is supposed to be a safe and
supportive space for productive discussions. Young children
learn the wrong idea that adapting profane language will
help them in seeking attention and becoming more socially
acceptable. Hence, the task of flagging inappropriate content
on social media is extremely important to create a healthy
social space.
In this paper, we discuss how we leveraged AI to solve this
task for us. This task is unprecedented because of the lack
of useful data to train powerful models. We have used state-
of-the-art transformer models pretrained on MLM task, fine-
tuned with different architectures. Furthermore, we discuss
some creative post-processing techniques that help to enhance
the scores.
II. TASK OVERVIEW
A. Problem Formulation
IIIT-D Multilingual Abusive Comment Identification is an
innovative challenge towards combating abusive comments
on Moj, one of India’s largest social media platform, in
multiple regional Indian languages. This paper is about a novel
research problem focused on improving the social space for
the members of the social media community. The focus area
is improving abusive comment identification on social media
in low-resource Indian languages.
There are multiple challenges that we need to combat
while dealing with multilingual text data. Some of the main
challenges in this task include:
• There is a lack of resources about literature and grammar
despite millions of native speakers using these languages.
Building NLP algorithms with limited basic lexical re-
source is highly challenging.
• Not all Indic languages fall into the same Linguistic
Families. There’s (1) Indo-Aryan which includes Hindi,
Marathi, Gujarati, Bengali, etc (2) The Dravidian family
consists of Tamil, Telugu, Kannada and Malayalam (3)
Tribes of Central India speak Austric languages (4) Sino-
Tibetan languages are spoken by tribes of the North
Eastern India.
• The posts/comments on social media do not adhere to
a particular format, grammar or sentence structure. They
are short, incomplete, filled with slangs, emoticons and
abbreviations.
B. Data Description
In this challenge, the data provided was split into 2 parts:
training data with 665k+ samples and test data with 74.3k+
samples. Key novelties around this dataset include:
• Natural language comment text data in 13 Indic Lan-
guages labeled as Abusive(312k samples) or Not Abu-
sive(352k samples). Fig. 1. tells us that Hindi is the most
common language.
• The data is human annotated.
• Metadata based explicit feedback - post identification
number, report count of the comment, report count of
the post, count of likes on the comment and count of
likes on the post.
III. MODEL BUILDING APPROACH AND EVALUATION
After the research in Attention Mechanism [10] and the
groundbreaking model - BERT [4], NLP space has been
revolutionized and state-of-the-art Transformers have become
the go-to option for almost all NLP tasks. For tackling this
multilingual task, we chose the following models -
arXiv:2201.00598v1
[cs.CL]
3
Jan
2022
Fig. 1. Language distribution in train dataset
• XLM-RoBERTa - It is a transformer-based masked lan-
guage model trained on 100 languages, using more than
two terabytes of filtered CommonCrawl data. [3]
• MuRIL - A multilingual language model specifically
built for Indic languages trained on significantly large
amounts of Indian text corpora with both transliterated
and translated document pairs, that serve as supervised
cross-lingual signals in training. [7]
• mBERT - MultilingualBERT (mBERT) is a transformer
based language model trained on raw Wikipedia text of
104 languages. This model is contextual and its training
requires no supervision - no alignment between the
languages is done. [5]
• IndicBERT - This model is trained on IndicCorp and
evaluated on IndicGLUE. Similar to mBERT, a single
model is trained on all Indian languages with the hope of
utilizing the relatedness amongst Indian languages. [6]
• RemBERT - This model is pretrained on 110 languages
using a Masked Language Modeling (MLM) objective.
Its main difference with mBERT is that the input and
output embeddings are not tied. Instead, RemBERT uses
small input embeddings and large output embeddings. [2]
The Evaluation Metric for this task is Mean F1-score.
A. Pre-training
Pre-training using Masked Language Model(MLM) [9] has
been one of the most popular and successful methods in
downstream tasks mostly due to their ability to model higher-
order word co-occurrence statistics. [8]
Since MuRIL is already a BERT encoder model pre-trained
on Indic languages with MLM objective, we performed pre-
training only on XLM-RoBERTa model (3 epochs for Base
variant and 2 epochs for Large variant). We separated 10% of
the test data for evaluating the pre-train step. Table I shows
the summary of this task. On average, the test F-1 score on
downstream task was approximately 0.87 without performing
MLM and 0.88 with MLM when we use the same settings to
train the finetuned model in both the cases. This helped us to
conclude that MLM on the given data certainly helps in the
downstream tasks.
TABLE I
VALIDATION LOSS FOR MLM PRE-TRAINING
Info XLM-RoBERTa Base XLM-RoBERTa Large
Accelerator Tesla P100 16GB Tesla T4 16GB
Time Taken (hr) 11.26 35.3
Epoch 1 3.3819 3.0549
Epoch 2 3.3222 2.8466
Epoch 3 3.2946 -
B. Data Augmentation
We performed data augmentation by adding transliterated
data. We removed emojis from text and then we used uroman1
to generate additional transliterated data of 219114 samples.
We also transliterated the test dataset and made the original
plus transliterated dataset available on Kaggle 2
.
C. Model training
Mainly, we used 2 different Model Architectures; one was
the original architecture where we took the transformer outputs
and found the probabilities and in the second one, we added
a custom attention head for the transformer output before
calculating the probabilities.
We also experimented with a number of different truncation
sizes of the input text length. We tried different models with
truncation length of 64, 96, 128 and 256 and finally decided
to go ahead with 128.
From Table II, we found out that MLM and Transliterated
data had a positive impact on the performance. Moreover, a
custom attention head boosted the score too. XLM-RoBERTa
was the best performer followed by MuRIL. We also observed
a slight difference between GPU and TPU accelerators -
models trained on GPU tend to perform much better. But since
experimentation time on TPU was lower, we went ahead with
it for most experiments. We used wandb[1] to track most of
our experiments, we are also releasing our experimentation
logs 3
As mentioned above, we also experimented with other
transformer models but didn’t achieve satisfactory results.
RemBERT has outperformed XLM-RoBERTa in various tasks
but underperformed in this case. mBERT also gave meagre
results. Surprisingly, IndicBERT is the worst performer for
this task.
Note - We transliterated the test data as well. So, whenever
we have trained a model with transliterated data, the inference
was done on the original text and transliterated text and the
probabilities were combined with 7:3 ratio to form the final
prediction.
D. Ensembling
We wanted to select diverse models so that each model
makes different mistakes and by combining the learning of
1https://github.com/isi-nlp/uroman
2https://www.kaggle.com/harveenchadha/iitdtransliterated
3https://wandb.ai/harveenchadha/iitd-private
TABLE II
MODEL EXPERIMENTATION SUMMARY
Id Model Data Accelerator Validation Strategy CV Test Score
1 XLM-RoBERTa Base Original TPU v3-8 10% split 0.854 0.87674
2 XLM-RoBERTa Large Original TPU v3-8 10% split 0.8601 0.8819
3 XLM-RoBERTa Base Original + Transliterated TPU v3-8 10% split 0.86 0.87989
4 XLM-RoBERTa Large Original + Transliterated TPU v3-8 10% split 0.8626 0.88238
5 XLM-RoBERTa Large + MLM Original TPU v3-8 10% split 0.8631 0.88291
6 XLM-RoBERTa Large + MLM Original + transliterated TPU v3-8 10% split 0.863 0.88316
7 XLM-RoBERTa Base + MLM + Attention head Original Tesla P100 - 16GB 4-Fold 0.866075 0.8814
8 XLM-RoBERTa Large + MLM + Attention head Original + transliterated RTX5000 24GB 10% split 0.8669 0.88378
9 MuRIL Base Original TPU v3-8 10% split 0.8520 0.87539
10 MuRIL Large Original TPU v3-8 10% split 0.8546 0.87821
11 XLM-R Base + MLM + last 5 hidden states average Original Tesla P100 - 16GB 4-Fold 0.8662 0.88149
12 XLM-RoBERTa + MLM Original + transliterated TPU v3-8 4-Fold 0.86825 0.8872
13 XLM-RoBERTa Large Original TPU v3-8 10-Fold 0.85759 0.8829
14 RemBERT (Truncation Length = 64) Original Tesla P100 10% split 0.8529 0.877
15 RemBERT Original TPU v3-8 10% split 0.8397 0.8678
16 mBERT Original TPU v3-8 10% split 0.8474 0.8724
17 mBERT (Truncation Length = 64) Original Tesla P100 4-Fold 0.8435 0.86327
18 IndicBERT Original TPU v3-8 10% split 0.8306 0.8456
diverse models, we can get a better result. Hence, after a few
experiments, we went forward with the following Ids from
Table II - 2, 6, 8, 9, 12, 13. We gave equal weights to all the
models and our test score was 0.88756. After changing the
probability inference threshold to 0.55, the score increased to
0.88827.
E. Using Metadata
As mentioned previously, we were also provided with weak
metadata which was utilized in the following way -
1) Feature Engineering: Due to the variation in times-
tamps, there were different values of reports and likes for the
same post in different samples.
For example, there is a post with post index = 1, and it has
1 comment. When this comment was captured for the dataset,
the like and report on the post were 5 and 10 respectively.
After a few minutes, when a new comment was uploaded and
captured for the dataset, the number of likes and reports had
been changed to 10 and 20 respectively. Which means, now
there are 2 samples with post index = 1 in the dataset but the
report count and like count recorded for them are different due
to variation in the timestamp.
In order to deal with this, we created mean-value aggregated
features and maximum-value aggregated features for report
count and like count for posts as well as comments.
We also added length of characters in the comment, length
of tokens in the comment and probabilities from the ensemble
as a feature. We used this data to train boosting classifiers.
2) Post-processing: As per our findings, increasing the
threshold gave us a boost and hence we decided to tweak the
thresholds for each language. After experimenting with several
thresholds, we found that the values in Table III gave the best
score.
TABLE III
LANGUAGE-WISE INFERENCE THRESHOLDS
Language Threshold
Marathi 0.6
Malayalam 0.5
Hindi 0.54
Telugu 0.6
Tamil 0.5
Odia 0.5
Gujarati 0.4
Bhojpuri 0.6
Haryanvi 0.5
Assamese 0.45
Kannada 0.6
Rajasthani 0.5
Bengali 0.6
All the threshold values that we tested were multiples of 5.
So, there must have been some edge cases that might be getting
misclassified and in order to account for that, we increased the
predicted probabilities by 0.01.
Table IV shows the performance of different boosting mod-
els. XGBoost and CatBoost were giving the highest score. We
combined the probabilities from these models with XGBoost
to CatBoost ratio of 6:4 and achieved the final best score of
0.90005.
IV. CONCLUSION
In recent times, social media has become a hub for informa-
tion exchange and entertainment. Results from this paper can
be used to create systems that can flag toxicity and provide
TABLE IV
POST-PROCESSING MODELS SUMMARY
Model CV Average Test Score
LightGBM 0.9051 0.89933
XGBoost 0.9051 0.89988
CatBoost 0.9052 0.89980
users with a healthier experience. Surprisingly, MuRIL which
is trained specifically on Indian text did not perform well
individually (both large and base variants) but it gave us an
edge in the ensemble because it added to model diversity.
For the same reason, we added non-pretrained models to the
ensemble. The metadata in itself was very weak but combining
it with the transformer output probabilities improved our
predictions substantially. Even small tweaks like increasing
the probability by 0.01 and language-wise inference thresholds
had an astonishing impact in making a better system.
ACKNOWLEDGMENT
We would like to thank Moj and ShareChat for providing
the data and organizing the Multilingual Abusive Comment
Identification Challenge along with IIIT-Delhi. We would
also like to thank Kaggle for providing access to TPUs and
jarvislabs.ai for providing credits to train the final model.
REFERENCES
[1] Lukas Biewald. Experiment tracking with weights and biases, 2020.
Software available from wandb.com.
[2] Hyung Won Chung, Thibault Févry, Henry Tsai, Melvin Johnson,
and Sebastian Ruder. Rethinking embedding coupling in pre-trained
language models. In 9th International Conference on Learning Repre-
sentations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenRe-
view.net, 2021.
[3] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaud-
hary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott,
Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual
representation learning at scale, 2020.
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
Bert: Pre-training of deep bidirectional transformers for language un-
derstanding, 2019.
[5] Karthikeyan K, Zihan Wang, Stephen Mayhew, and Dan Roth. Cross-
lingual ability of multilingual bert: An empirical study, 2020.
[6] Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, Gokul N.C.,
Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. Indic-
NLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained
multilingual language models for Indian languages. In Findings of
the Association for Computational Linguistics: EMNLP 2020, pages
4948–4961, Online, November 2020. Association for Computational
Linguistics.
[7] Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla,
Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal,
Rajiv Teja Nagipogu, Shachi Dave, Shruti Gupta, Subhash Chandra Bose
Gali, Vish Subramanian, and Partha Talukdar. Muril: Multilingual
representations for indian languages, 2021.
[8] Koustuv Sinha, Robin Jia, Dieuwke Hupkes, Joelle Pineau, Adina
Williams, and Douwe Kiela. Masked language modeling and the
distributional hypothesis: Order word matters pre-training for little,
2021.
[9] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mass:
Masked sequence to sequence pre-training for language generation,
2019.
[10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention
is all you need, 2017.

More Related Content

Similar to 2201.00598.pdf

IRJET - Response Analysis of Educational Videos
IRJET - Response Analysis of Educational VideosIRJET - Response Analysis of Educational Videos
IRJET - Response Analysis of Educational VideosIRJET Journal
 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)kevig
 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)kevig
 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)kevig
 
IRJET- Semantic Question Matching
IRJET- Semantic Question MatchingIRJET- Semantic Question Matching
IRJET- Semantic Question MatchingIRJET Journal
 
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATIONIMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATIONadeij1
 
Analysis of the evolution of advanced transformer-based language models: Expe...
Analysis of the evolution of advanced transformer-based language models: Expe...Analysis of the evolution of advanced transformer-based language models: Expe...
Analysis of the evolution of advanced transformer-based language models: Expe...IAESIJAI
 
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language TextsRBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
 
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language TextsRBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONijaia
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONgerogepatton
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONgerogepatton
 
A Comparative Study of Text Comprehension in IELTS Reading Exam using GPT-3
A Comparative Study of Text Comprehension in IELTS Reading Exam using GPT-3A Comparative Study of Text Comprehension in IELTS Reading Exam using GPT-3
A Comparative Study of Text Comprehension in IELTS Reading Exam using GPT-3AIRCC Publishing Corporation
 
IRJET - Storytelling App for Children with Hearing Impairment using Natur...
IRJET -  	  Storytelling App for Children with Hearing Impairment using Natur...IRJET -  	  Storytelling App for Children with Hearing Impairment using Natur...
IRJET - Storytelling App for Children with Hearing Impairment using Natur...IRJET Journal
 
Benchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLPBenchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLPYury Kashnitsky
 
An Efficient Approach to Produce Source Code by Interpreting Algorithm
An Efficient Approach to Produce Source Code by Interpreting AlgorithmAn Efficient Approach to Produce Source Code by Interpreting Algorithm
An Efficient Approach to Produce Source Code by Interpreting AlgorithmIRJET Journal
 
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP MeetupDealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP MeetupYves Peirsman
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentationSurya Sg
 

Similar to 2201.00598.pdf (20)

IRJET - Response Analysis of Educational Videos
IRJET - Response Analysis of Educational VideosIRJET - Response Analysis of Educational Videos
IRJET - Response Analysis of Educational Videos
 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)
 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)
 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)
 
IRJET- Semantic Question Matching
IRJET- Semantic Question MatchingIRJET- Semantic Question Matching
IRJET- Semantic Question Matching
 
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATIONIMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
 
Analysis of the evolution of advanced transformer-based language models: Expe...
Analysis of the evolution of advanced transformer-based language models: Expe...Analysis of the evolution of advanced transformer-based language models: Expe...
Analysis of the evolution of advanced transformer-based language models: Expe...
 
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language TextsRBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
 
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language TextsRBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
 
A Comparative Study of Text Comprehension in IELTS Reading Exam using GPT-3
A Comparative Study of Text Comprehension in IELTS Reading Exam using GPT-3A Comparative Study of Text Comprehension in IELTS Reading Exam using GPT-3
A Comparative Study of Text Comprehension in IELTS Reading Exam using GPT-3
 
1808.10245v1 (1).pdf
1808.10245v1 (1).pdf1808.10245v1 (1).pdf
1808.10245v1 (1).pdf
 
IRJET - Storytelling App for Children with Hearing Impairment using Natur...
IRJET -  	  Storytelling App for Children with Hearing Impairment using Natur...IRJET -  	  Storytelling App for Children with Hearing Impairment using Natur...
IRJET - Storytelling App for Children with Hearing Impairment using Natur...
 
Benchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLPBenchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLP
 
An Efficient Approach to Produce Source Code by Interpreting Algorithm
An Efficient Approach to Produce Source Code by Interpreting AlgorithmAn Efficient Approach to Produce Source Code by Interpreting Algorithm
An Efficient Approach to Produce Source Code by Interpreting Algorithm
 
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP MeetupDealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 

Recently uploaded

UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 

Recently uploaded (20)

UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 

2201.00598.pdf

  • 1. Toxicity Detection for Indic Multilingual Social Media Content Manan Jhaveri NMIMS University Devanshu Ramaiya NMIMS University Harveen Singh Chadha Thoughtworks Abstract—Toxic content is one of the most critical issues for social media platforms today. India alone had 518 million social media users in 2020. In order to provide a good experience to content creators and their audience, it is crucial to flag toxic comments and the users who post that. But the big challenge is identifying toxicity in low resource Indic languages because of the presence of multiple representations of the same text. Moreover, the posts/comments on social media do not adhere to a particular format, grammar or sentence structure; this makes the task of abuse detection even more challenging for multilingual social media platforms. This paper describes the system proposed by team ’Moj Masti’ using the data provided by ShareChat/Moj in IIIT-D Multilingual Abusive Comment Identification challenge. We focus on how we can leverage multilingual transformer based pre-trained and fine-tuned models to approach code-mixed/code- switched classification tasks. Our best performing system was an ensemble of XLM-RoBERTa and MuRIL which achieved a Mean F-1 score of 0.9 on the test data/leaderboard. We also observed an increase in the performance by adding transliterated data. Furthermore, using weak metadata, ensembling and some post- processing techniques boosted the performance of our system, thereby placing us 1st on the leaderboard. I. INTRODUCTION Toxic comments refer to rude, insulting and offensive com- ments that can severely affect a person’s mental health and severe cases can also be classified as cyber-bullying. It often instills insecurities in young people, leading them to develop low self-esteem and suicidal thoughts. This abusive environ- ment also dissuades people from expressing their opinions in the comment section which is supposed to be a safe and supportive space for productive discussions. Young children learn the wrong idea that adapting profane language will help them in seeking attention and becoming more socially acceptable. Hence, the task of flagging inappropriate content on social media is extremely important to create a healthy social space. In this paper, we discuss how we leveraged AI to solve this task for us. This task is unprecedented because of the lack of useful data to train powerful models. We have used state- of-the-art transformer models pretrained on MLM task, fine- tuned with different architectures. Furthermore, we discuss some creative post-processing techniques that help to enhance the scores. II. TASK OVERVIEW A. Problem Formulation IIIT-D Multilingual Abusive Comment Identification is an innovative challenge towards combating abusive comments on Moj, one of India’s largest social media platform, in multiple regional Indian languages. This paper is about a novel research problem focused on improving the social space for the members of the social media community. The focus area is improving abusive comment identification on social media in low-resource Indian languages. There are multiple challenges that we need to combat while dealing with multilingual text data. Some of the main challenges in this task include: • There is a lack of resources about literature and grammar despite millions of native speakers using these languages. Building NLP algorithms with limited basic lexical re- source is highly challenging. • Not all Indic languages fall into the same Linguistic Families. There’s (1) Indo-Aryan which includes Hindi, Marathi, Gujarati, Bengali, etc (2) The Dravidian family consists of Tamil, Telugu, Kannada and Malayalam (3) Tribes of Central India speak Austric languages (4) Sino- Tibetan languages are spoken by tribes of the North Eastern India. • The posts/comments on social media do not adhere to a particular format, grammar or sentence structure. They are short, incomplete, filled with slangs, emoticons and abbreviations. B. Data Description In this challenge, the data provided was split into 2 parts: training data with 665k+ samples and test data with 74.3k+ samples. Key novelties around this dataset include: • Natural language comment text data in 13 Indic Lan- guages labeled as Abusive(312k samples) or Not Abu- sive(352k samples). Fig. 1. tells us that Hindi is the most common language. • The data is human annotated. • Metadata based explicit feedback - post identification number, report count of the comment, report count of the post, count of likes on the comment and count of likes on the post. III. MODEL BUILDING APPROACH AND EVALUATION After the research in Attention Mechanism [10] and the groundbreaking model - BERT [4], NLP space has been revolutionized and state-of-the-art Transformers have become the go-to option for almost all NLP tasks. For tackling this multilingual task, we chose the following models - arXiv:2201.00598v1 [cs.CL] 3 Jan 2022
  • 2. Fig. 1. Language distribution in train dataset • XLM-RoBERTa - It is a transformer-based masked lan- guage model trained on 100 languages, using more than two terabytes of filtered CommonCrawl data. [3] • MuRIL - A multilingual language model specifically built for Indic languages trained on significantly large amounts of Indian text corpora with both transliterated and translated document pairs, that serve as supervised cross-lingual signals in training. [7] • mBERT - MultilingualBERT (mBERT) is a transformer based language model trained on raw Wikipedia text of 104 languages. This model is contextual and its training requires no supervision - no alignment between the languages is done. [5] • IndicBERT - This model is trained on IndicCorp and evaluated on IndicGLUE. Similar to mBERT, a single model is trained on all Indian languages with the hope of utilizing the relatedness amongst Indian languages. [6] • RemBERT - This model is pretrained on 110 languages using a Masked Language Modeling (MLM) objective. Its main difference with mBERT is that the input and output embeddings are not tied. Instead, RemBERT uses small input embeddings and large output embeddings. [2] The Evaluation Metric for this task is Mean F1-score. A. Pre-training Pre-training using Masked Language Model(MLM) [9] has been one of the most popular and successful methods in downstream tasks mostly due to their ability to model higher- order word co-occurrence statistics. [8] Since MuRIL is already a BERT encoder model pre-trained on Indic languages with MLM objective, we performed pre- training only on XLM-RoBERTa model (3 epochs for Base variant and 2 epochs for Large variant). We separated 10% of the test data for evaluating the pre-train step. Table I shows the summary of this task. On average, the test F-1 score on downstream task was approximately 0.87 without performing MLM and 0.88 with MLM when we use the same settings to train the finetuned model in both the cases. This helped us to conclude that MLM on the given data certainly helps in the downstream tasks. TABLE I VALIDATION LOSS FOR MLM PRE-TRAINING Info XLM-RoBERTa Base XLM-RoBERTa Large Accelerator Tesla P100 16GB Tesla T4 16GB Time Taken (hr) 11.26 35.3 Epoch 1 3.3819 3.0549 Epoch 2 3.3222 2.8466 Epoch 3 3.2946 - B. Data Augmentation We performed data augmentation by adding transliterated data. We removed emojis from text and then we used uroman1 to generate additional transliterated data of 219114 samples. We also transliterated the test dataset and made the original plus transliterated dataset available on Kaggle 2 . C. Model training Mainly, we used 2 different Model Architectures; one was the original architecture where we took the transformer outputs and found the probabilities and in the second one, we added a custom attention head for the transformer output before calculating the probabilities. We also experimented with a number of different truncation sizes of the input text length. We tried different models with truncation length of 64, 96, 128 and 256 and finally decided to go ahead with 128. From Table II, we found out that MLM and Transliterated data had a positive impact on the performance. Moreover, a custom attention head boosted the score too. XLM-RoBERTa was the best performer followed by MuRIL. We also observed a slight difference between GPU and TPU accelerators - models trained on GPU tend to perform much better. But since experimentation time on TPU was lower, we went ahead with it for most experiments. We used wandb[1] to track most of our experiments, we are also releasing our experimentation logs 3 As mentioned above, we also experimented with other transformer models but didn’t achieve satisfactory results. RemBERT has outperformed XLM-RoBERTa in various tasks but underperformed in this case. mBERT also gave meagre results. Surprisingly, IndicBERT is the worst performer for this task. Note - We transliterated the test data as well. So, whenever we have trained a model with transliterated data, the inference was done on the original text and transliterated text and the probabilities were combined with 7:3 ratio to form the final prediction. D. Ensembling We wanted to select diverse models so that each model makes different mistakes and by combining the learning of 1https://github.com/isi-nlp/uroman 2https://www.kaggle.com/harveenchadha/iitdtransliterated 3https://wandb.ai/harveenchadha/iitd-private
  • 3. TABLE II MODEL EXPERIMENTATION SUMMARY Id Model Data Accelerator Validation Strategy CV Test Score 1 XLM-RoBERTa Base Original TPU v3-8 10% split 0.854 0.87674 2 XLM-RoBERTa Large Original TPU v3-8 10% split 0.8601 0.8819 3 XLM-RoBERTa Base Original + Transliterated TPU v3-8 10% split 0.86 0.87989 4 XLM-RoBERTa Large Original + Transliterated TPU v3-8 10% split 0.8626 0.88238 5 XLM-RoBERTa Large + MLM Original TPU v3-8 10% split 0.8631 0.88291 6 XLM-RoBERTa Large + MLM Original + transliterated TPU v3-8 10% split 0.863 0.88316 7 XLM-RoBERTa Base + MLM + Attention head Original Tesla P100 - 16GB 4-Fold 0.866075 0.8814 8 XLM-RoBERTa Large + MLM + Attention head Original + transliterated RTX5000 24GB 10% split 0.8669 0.88378 9 MuRIL Base Original TPU v3-8 10% split 0.8520 0.87539 10 MuRIL Large Original TPU v3-8 10% split 0.8546 0.87821 11 XLM-R Base + MLM + last 5 hidden states average Original Tesla P100 - 16GB 4-Fold 0.8662 0.88149 12 XLM-RoBERTa + MLM Original + transliterated TPU v3-8 4-Fold 0.86825 0.8872 13 XLM-RoBERTa Large Original TPU v3-8 10-Fold 0.85759 0.8829 14 RemBERT (Truncation Length = 64) Original Tesla P100 10% split 0.8529 0.877 15 RemBERT Original TPU v3-8 10% split 0.8397 0.8678 16 mBERT Original TPU v3-8 10% split 0.8474 0.8724 17 mBERT (Truncation Length = 64) Original Tesla P100 4-Fold 0.8435 0.86327 18 IndicBERT Original TPU v3-8 10% split 0.8306 0.8456 diverse models, we can get a better result. Hence, after a few experiments, we went forward with the following Ids from Table II - 2, 6, 8, 9, 12, 13. We gave equal weights to all the models and our test score was 0.88756. After changing the probability inference threshold to 0.55, the score increased to 0.88827. E. Using Metadata As mentioned previously, we were also provided with weak metadata which was utilized in the following way - 1) Feature Engineering: Due to the variation in times- tamps, there were different values of reports and likes for the same post in different samples. For example, there is a post with post index = 1, and it has 1 comment. When this comment was captured for the dataset, the like and report on the post were 5 and 10 respectively. After a few minutes, when a new comment was uploaded and captured for the dataset, the number of likes and reports had been changed to 10 and 20 respectively. Which means, now there are 2 samples with post index = 1 in the dataset but the report count and like count recorded for them are different due to variation in the timestamp. In order to deal with this, we created mean-value aggregated features and maximum-value aggregated features for report count and like count for posts as well as comments. We also added length of characters in the comment, length of tokens in the comment and probabilities from the ensemble as a feature. We used this data to train boosting classifiers. 2) Post-processing: As per our findings, increasing the threshold gave us a boost and hence we decided to tweak the thresholds for each language. After experimenting with several thresholds, we found that the values in Table III gave the best score. TABLE III LANGUAGE-WISE INFERENCE THRESHOLDS Language Threshold Marathi 0.6 Malayalam 0.5 Hindi 0.54 Telugu 0.6 Tamil 0.5 Odia 0.5 Gujarati 0.4 Bhojpuri 0.6 Haryanvi 0.5 Assamese 0.45 Kannada 0.6 Rajasthani 0.5 Bengali 0.6 All the threshold values that we tested were multiples of 5. So, there must have been some edge cases that might be getting misclassified and in order to account for that, we increased the predicted probabilities by 0.01. Table IV shows the performance of different boosting mod- els. XGBoost and CatBoost were giving the highest score. We combined the probabilities from these models with XGBoost to CatBoost ratio of 6:4 and achieved the final best score of 0.90005. IV. CONCLUSION In recent times, social media has become a hub for informa- tion exchange and entertainment. Results from this paper can be used to create systems that can flag toxicity and provide
  • 4. TABLE IV POST-PROCESSING MODELS SUMMARY Model CV Average Test Score LightGBM 0.9051 0.89933 XGBoost 0.9051 0.89988 CatBoost 0.9052 0.89980 users with a healthier experience. Surprisingly, MuRIL which is trained specifically on Indian text did not perform well individually (both large and base variants) but it gave us an edge in the ensemble because it added to model diversity. For the same reason, we added non-pretrained models to the ensemble. The metadata in itself was very weak but combining it with the transformer output probabilities improved our predictions substantially. Even small tweaks like increasing the probability by 0.01 and language-wise inference thresholds had an astonishing impact in making a better system. ACKNOWLEDGMENT We would like to thank Moj and ShareChat for providing the data and organizing the Multilingual Abusive Comment Identification Challenge along with IIIT-Delhi. We would also like to thank Kaggle for providing access to TPUs and jarvislabs.ai for providing credits to train the final model. REFERENCES [1] Lukas Biewald. Experiment tracking with weights and biases, 2020. Software available from wandb.com. [2] Hyung Won Chung, Thibault Févry, Henry Tsai, Melvin Johnson, and Sebastian Ruder. Rethinking embedding coupling in pre-trained language models. In 9th International Conference on Learning Repre- sentations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenRe- view.net, 2021. [3] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaud- hary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale, 2020. [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language un- derstanding, 2019. [5] Karthikeyan K, Zihan Wang, Stephen Mayhew, and Dan Roth. Cross- lingual ability of multilingual bert: An empirical study, 2020. [6] Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. Indic- NLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4948–4961, Online, November 2020. Association for Computational Linguistics. [7] Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, Shruti Gupta, Subhash Chandra Bose Gali, Vish Subramanian, and Partha Talukdar. Muril: Multilingual representations for indian languages, 2021. [8] Koustuv Sinha, Robin Jia, Dieuwke Hupkes, Joelle Pineau, Adina Williams, and Douwe Kiela. Masked language modeling and the distributional hypothesis: Order word matters pre-training for little, 2021. [9] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mass: Masked sequence to sequence pre-training for language generation, 2019. [10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017.