With her background in academia, Ana Meštrović will describe the experience of collaboration with Mladen Fernežir from Velebit AI.
In her talk, Ana will present research ideas and results of the project “Multilayer Framework for the Information Spreading Characterization in Social Media during the COVID-19 Crisis - InfoCoV”, funded by the Croatian Science Foundation. One aspect of the InfoCoV project was information monitoring using NLP methods, which enables a better understanding of infodemic. Ana will explain how they apply NLP methods in the task of information monitoring and Mladen will share technical details of implementation.
[DSC MENA 24] Yasser_El_Bendary - How NLP & LLMs model can excel in comprehen...DataScienceConferenc1
More Related Content
Similar to [DSC Croatia 22] Experience in collaboration between academia and industry: NLP solutions for infodemic management - Ana Mestrovic & Mladen Fernezir
EU Data Market study. Presentation at NESSI Summit 2014 IDC & Open EvidenceKasia Szkuta
Similar to [DSC Croatia 22] Experience in collaboration between academia and industry: NLP solutions for infodemic management - Ana Mestrovic & Mladen Fernezir (20)
[DSC Croatia 22] Experience in collaboration between academia and industry: NLP solutions for infodemic management - Ana Mestrovic & Mladen Fernezir
1. Experience in collaboration between
academia and industry:
NLP solutions for infodemic management
Ana Meštrović, Faculty of Informatics and Digital Technologies, University of Rijeka
Mladen Fernežir, Velebit AI
1
4. InfoCoV project
InfoCoV: Multilayer Framework for the Information Spreading
Characterization in Social Media during the COVID-19 Crisis
• Croatian Science Foundation - HRZZ
• 15 June 2020 – 14 January 2022
• Collaboration with Velebit AI
• Information monitoring
• COVID-19 texts in social media
• Research: NLP & SNA
4
5. Can AI help us in infodemic management?
• AI –> analysis of a large amount of texts
• Machine learning, neural networks, ...
• NLP tasks
• Keyword extraction
• Name entity recognition (NER)
• Topic modelling
• Text classification
• Sentiment analysis
• Fake news detection
• Multilayer framework
• Social network analysis
• Dynamic and spreading
5
6. Podaci
6
Dataset
Dataset Description Size
Cro-CoV-texts Texts collected from online portals > 186.738 articles
Cro-CoV-comm Users’ comments on COVID-19 articles in
online portals
> 503.325 comments
Cro-CoV-Tweets COVID-19 related tweets posted from users
registered in Croatia
> 1 milion tweets
> 200.000 COVID-19 tweets
Senti-Cro-CoV-Tweets Tweets annotated with the seniment polarity
(positive, negative, neutral)
10.000 annotated tweets
Cro-CoV-netTW Network of Twitter users > 40.000 users
Cro-CoV-multilayerTW Multilayer network of Twitter 6 layers (multilyer network)
Cro-CoV-Reddit Posts and comments from Croatian subreddit
1,654 posts
6,466 comments
Cro-CoV-Forum COVID-19 posts from the Croatian forum 3479 posts (* students)
Cro-CoV-YT COVID-19 posts from YouTube 4530 comments (* students)
7. Language and classification models
cro-CoV-cseBERT, cro-CoV-BERTić – language models
sent-cro-CoV-cseBERT – sentiment classification
multi-cro-CoV-cseBERT – retweet classification
7
13. Clustering of Tweets
# Topic
0 Informative facts about COVID-19
1 Education and implementation of the COVID-19 policies
2 Coping with the pandemic
3 Revolt against the COVID-19 policies and behaviour of
citizens
4 Public discussion regarding anti-pandemic policies and
vaccines
5 Impact of COVID-19 policies on economy and education
6 Public comments on statements of the politicians and
scientists
7 Information about new daily COVID-19 cases
8 Ironic comments of COVID-19
9 Short generic messages related to COVID-19
13
14. Clustering of Tweets
• Negative attitudes: „Public discussion
regarding anti-pandemic policies and
vaccines”
• Non-negative attitudes: informative
messages and „Coping with the
panedmic”
14
17. NLP classification techniques
● Classifying text is a basic NLP problem, but still often challenging
in practice
● Helpful: large language models pre-trained on large amounts of
data
● Regardless of the exact domain, the typical approach is common:
○ Pick a pre-trained language model close to your specific
problem
○ Optionally, tune the language model with your unlabeled data
○ Fine-tune the language model to your labeled data (your
specific categories to predict)
17
18. Language model tuning
● Available base language model for Croatian:
○ CroSloEngual BERT,
https://huggingface.co/EMBEDDIA/crosloengual-bert
○ BERTić* [bert-ich] /bɜrtitʃ/ - A transformer language model
for Bosnian, Croatian, Montenegrin and Serbian,
https://huggingface.co/classla/bcms-bertic
● Self-supervised tuning to COVID specific Croatian data
○ Useful to prepare data similar to the final classification task
(e.g. oversampling user comments data)
18
22. Options to prevent overfitting
Loss weights can depend on
specific output combinations:
22
true:
0
true:
1
true:
2
predicted:
0
W00 W01 W02
predicted:
1
W10 W11 W12
predicted:
2
W20 W21 W22
● Minority class
oversampling
● Different class loss
weights
● Dropout
● L2 regularization
● Freezing some model
parameters
● Early stopping
● NLP data augmentation
23. Retweet category classification
A subset of Croatian tweets labeled into two categories
● 0: retweeted only once
● 1: tweets retweeted more than once
Types of features and variants of training
a. Content features extracted from a transformer language model
b. Tabular features representing Twitter users and their
interactions (categorical and numerical)
c. Joined all features
23
24. Investigating different algorithms
Classification algorithms:
• MLP
• Random Forest
• LightGBM
• NODE
• TabNet
• Category Embeddings & MLP
Github:
https://github.com/InfoCoV/Multi
-Cro-CoV-cseBERT
24
References:
Neural Oblivious Decision
Ensembles for Deep Learning on
Tabular Data,
https://arxiv.org/abs/1909.06312
TabNet: Attentive Interpretable
Tabular Learning,
https://arxiv.org/abs/1908.07442