[DSC Croatia 22] Experience in collaboration between academia and industry: NLP solutions for infodemic management - Ana Mestrovic & Mladen Fernezir

Experience in collaboration between
academia and industry:
NLP solutions for infodemic management
Ana Meštrović, Faculty of Informatics and Digital Technologies, University of Rijeka
Mladen Fernežir, Velebit AI
1

Overview
• InfoCoV project
• Project results
• Implementation
2

InfoCoV project
InfoCoV: Multilayer Framework for the Information Spreading
Characterization in Social Media during the COVID-19 Crisis
• Croatian Science Foundation - HRZZ
• 15 June 2020 – 14 January 2022
• Collaboration with Velebit AI
• Information monitoring
• COVID-19 texts in social media
• Research: NLP & SNA
4

Can AI help us in infodemic management?
• AI –> analysis of a large amount of texts
• Machine learning, neural networks, ...
• NLP tasks
• Keyword extraction
• Name entity recognition (NER)
• Topic modelling
• Text classification
• Sentiment analysis
• Fake news detection
• Multilayer framework
• Social network analysis
• Dynamic and spreading
5

Podaci
6
Dataset
Dataset Description Size
Cro-CoV-texts Texts collected from online portals > 186.738 articles
Cro-CoV-comm Users’ comments on COVID-19 articles in
online portals
> 503.325 comments
Cro-CoV-Tweets COVID-19 related tweets posted from users
registered in Croatia
> 1 milion tweets
> 200.000 COVID-19 tweets
Senti-Cro-CoV-Tweets Tweets annotated with the seniment polarity
(positive, negative, neutral)
10.000 annotated tweets
Cro-CoV-netTW Network of Twitter users > 40.000 users
Cro-CoV-multilayerTW Multilayer network of Twitter 6 layers (multilyer network)
Cro-CoV-Reddit Posts and comments from Croatian subreddit
1,654 posts
6,466 comments
Cro-CoV-Forum COVID-19 posts from the Croatian forum 3479 posts (* students)
Cro-CoV-YT COVID-19 posts from YouTube 4530 comments (* students)

Language and classification models
cro-CoV-cseBERT, cro-CoV-BERTić – language models
sent-cro-CoV-cseBERT – sentiment classification
multi-cro-CoV-cseBERT – retweet classification
7

Keyword extraction
9
9
symptoms and
hygiene
medicaments
and drugs
vaccine
general
terms
Online news portals, Cro-CoV-Texts
• 190.000 COVID-19 related articles
• Croatian language
• First 13 months of the pandemic (2 waves)

Topic modelling
Distribution of topics over time Topic spreding via retweeting
11

Sentiment analysis
Twitter, Cro-CoV-Tweets
• 206.196 COVID-19 related tweets
• Croatian language
• 1.1.2020. – 31.5.2021. (3 waves)
12

Clustering of Tweets
# Topic
0 Informative facts about COVID-19
1 Education and implementation of the COVID-19 policies
2 Coping with the pandemic
3 Revolt against the COVID-19 policies and behaviour of
citizens
4 Public discussion regarding anti-pandemic policies and
vaccines
5 Impact of COVID-19 policies on economy and education
6 Public comments on statements of the politicians and
scientists
7 Information about new daily COVID-19 cases
8 Ironic comments of COVID-19
9 Short generic messages related to COVID-19
13

Clustering of Tweets
• Negative attitudes: „Public discussion
regarding anti-pandemic policies and
vaccines”
• Non-negative attitudes: informative
messages and „Coping with the
panedmic”
14

InfoCoV team
• Laboratory for Semantic Technologies
15

NLP classification techniques
● Classifying text is a basic NLP problem, but still often challenging
in practice
● Helpful: large language models pre-trained on large amounts of
data
● Regardless of the exact domain, the typical approach is common:
○ Pick a pre-trained language model close to your specific
problem
○ Optionally, tune the language model with your unlabeled data
○ Fine-tune the language model to your labeled data (your
specific categories to predict)
17

Language model tuning
● Available base language model for Croatian:
○ CroSloEngual BERT,
https://huggingface.co/EMBEDDIA/crosloengual-bert
○ BERTić* [bert-ich] /bɜrtitʃ/ - A transformer language model
for Bosnian, Croatian, Montenegrin and Serbian,
https://huggingface.co/classla/bcms-bertic
● Self-supervised tuning to COVID specific Croatian data
○ Useful to prepare data similar to the final classification task
(e.g. oversampling user comments data)
18

BERTić model self-supervised tuning
19

Sentiment classification
● Croatian Tweets related to COVID
● Classification problem into 3 sentiment classes:
○ Neutral: 4914
○ Negative: 3730
○ Positive: 475
● Difficulties:
○ Low amount of labeled data
○ Class disbalance
20

Typical problem: overfitting
21

Options to prevent overfitting
Loss weights can depend on
specific output combinations:
22
true:
0
true:
1
true:
2
predicted:
0
W00 W01 W02
predicted:
1
W10 W11 W12
predicted:
2
W20 W21 W22
● Minority class
oversampling
● Different class loss
weights
● Dropout
● L2 regularization
● Freezing some model
parameters
● Early stopping
● NLP data augmentation

Retweet category classification
A subset of Croatian tweets labeled into two categories
● 0: retweeted only once
● 1: tweets retweeted more than once
Types of features and variants of training
a. Content features extracted from a transformer language model
b. Tabular features representing Twitter users and their
interactions (categorical and numerical)
c. Joined all features
23

Investigating different algorithms
Classification algorithms:
• MLP
• Random Forest
• LightGBM
• NODE
• TabNet
• Category Embeddings & MLP
Github:
https://github.com/InfoCoV/Multi
-Cro-CoV-cseBERT
24
References:
Neural Oblivious Decision
Ensembles for Deep Learning on
Tabular Data,
https://arxiv.org/abs/1909.06312
TabNet: Attentive Interpretable
Tabular Learning,
https://arxiv.org/abs/1908.07442

Optuna hyper-parameter search
25

Optuna hyper-parameter importance
26

Useful Python libraries
● Hugging Face Transformers,
https://huggingface.co/docs/transformers/index
● Simple Transformers, https://simpletransformers.ai/
● Sentence Transformers, https://www.sbert.net/
● PyTorch, https://pytorch.org/
● PyTorch Ignite, https://pytorch-ignite.ai/
● PyTorch Tabular,
https://github.com/manujosephv/pytorch_tabular/
● LightGBM, https://github.com/microsoft/LightGBM
● CLASSLA, https://github.com/clarinsi/classla
27

[DSC Croatia 22] Experience in collaboration between academia and industry: NLP solutions for infodemic management - Ana Mestrovic & Mladen Fernezir

Recommended

Recommended

More Related Content

Similar to [DSC Croatia 22] Experience in collaboration between academia and industry: NLP solutions for infodemic management - Ana Mestrovic & Mladen Fernezir

Similar to [DSC Croatia 22] Experience in collaboration between academia and industry: NLP solutions for infodemic management - Ana Mestrovic & Mladen Fernezir (20)

More from DataScienceConferenc1

More from DataScienceConferenc1 (20)

Recently uploaded

Recently uploaded (20)

[DSC Croatia 22] Experience in collaboration between academia and industry: NLP solutions for infodemic management - Ana Mestrovic & Mladen Fernezir