Unsupervised Topic Modeling with BERTopic for Coarse and Fine-Grained News Classification

•Download as PPTX, PDF•

0 likes•181 views

Transformer models have achieved state-of-the-art results for news classification tasks, but remain difficult to modify to yield the desired class probabilities in a multi-class setting. Using a neural topic model to create dense topic clusters helps with generating these class probabilities. The presented work uses the BERTopic clustered embeddings model as a preprocessor to eliminate documents that do not belong to any distinct cluster or topic. By combining the resulting embeddings with a Sentence Transformer fine-tuned with SetFit, we obtain a prompt-free framework that demonstrates competitive performance even with few-shot labeled data. Our findings show that incorporating BERTopic in the preprocessing stage leads to a notable improvement in the classification accuracy of news documents. Furthermore, our method outperforms hybrid approaches that combine text and images for news document classification.

Technology

Mohamad Al Sayed
Adrian M.P. Brașoveanu
Lyndon J.B. Nixon
Arno Scharl
Unsupervised Topic Modeling with BERTopic
for Coarse and Fine-Grained News
Classification

Use Case: IPTC
● IPTC classification
● 17 domains with subcategories
● Initial approach: Silver Standard
● Assignments based on URLs
● Multiple categories possible
● Needed for news media projects

Challenges with Current Text Classification Pipelines
● Data Quality
● Scalability
● Interpretability
● Imbalanced Classes
● Computational Costs
● Model Overfitting

Neural Topic Modeling: BERTopic
● Text Preprocessing
● Document Embedding
● Dimensionality Reduction
● Clustering
● Topic Representation

Our Solution - Method
● Extract a coherent topic representation using BERTopic.
● Use the documents to train any classifier (RoBERTa, SetFit, etc.)

Experiment - Models
● RoBERTa (Robustly Optimized BERT Pretraining Approach)
● SetFit (Few-shot)

Experiment - Embeddings Evaluation
● RoBERTa Embeddings win

Experiment - Model Evaluation (with and without BERTopic)

Conclusion
● Improved Content Understanding
● Customization
● Improved Analytics
● Mitigation of Information Overload
● Improvements on Downstream Tasks

Thank you for your attention!
Acknowledgments
GENTIO project funded by the Austrian Federal Ministry for Climate Action,
Environment, Energy, Mobility and Technology (BMK) via the ICT of the Future
Program (GA No. 873992).
DWBI project funded by Vienna Science and Technology Fund (WWTF)
[10.47379/ICT20096].
Images
Copyright © IPTC, HuggingFace, Mohamad Al Sayed

What's hot

Data Visualization in PythonJagriti Goswami

Machine Learning Feature Selection - Random Forest Rupak Roy

Introduction to pandasPiyush rai

Universal turing coastusShiraz316

Kmpakruthi k

Prims and kruskal algorithmsSaga Valsalan

BackTracking Algorithm: Technique and ExamplesFahim Ferdous

XgboostVivian S. Zhang

Latent Dirichlet AllocationSangwoo Mo

NLP - Sentiment AnalysisRupak Roy

Language for specifying lexical AnalyzerArchana Gopinath

Decision Tree LearningMilind Gokhale

Real data models of silicon valleyPatrick McFadin

NoSQL databases and managing big dataSteven Francia

Pushdown automatalavishka_anuj

Finite AutomataMukesh Tekwani

boosting algorithmPrithvi Paneru

Computational Complexity: Complexity ClassesAntonis Antonopoulos

Assosiate rule miningTilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Randomized algorithms ver 1.0Dr. C.V. Suresh Babu

What's hot (20)

Data Visualization in Python

Machine Learning Feature Selection - Random Forest

Introduction to pandas

Universal turing coastus

Kmp

Prims and kruskal algorithms

BackTracking Algorithm: Technique and Examples

Xgboost

Latent Dirichlet Allocation

NLP - Sentiment Analysis

Language for specifying lexical Analyzer

Decision Tree Learning

Real data models of silicon valley

NoSQL databases and managing big data

Pushdown automata

Finite Automata

boosting algorithm

Computational Complexity: Complexity Classes

Assosiate rule mining

Randomized algorithms ver 1.0

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

CloudStudio User manual (basic edition):comworks

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

Build your next Gen AI Breakthrough - April 2024Neo4j

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems

My Hashitalk Indonesia April 2024 Presentation

Dev Dives: Streamline document processing with UiPath Studio Web

Benefits Of Flutter Compared To Other Frameworks

CloudStudio User manual (basic edition):

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Build your next Gen AI Breakthrough - April 2024

Unblocking The Main Thread Solving ANRs and Frozen Frames

Designing IA for AI - Information Architecture Conference 2024

Pigging Solutions in Pet Food Manufacturing

Advanced Test Driven-Development @ php[tek] 2024

My INSURER PTE LTD - Insurtech Innovation Award 2024

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Unraveling Multimodality with Large Language Models.pdf

Science&tech:THE INFORMATION AGE STS.pdf

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Unsupervised Topic Modeling with BERTopic for Coarse and Fine-Grained News Classification

1. Mohamad Al Sayed Adrian M.P. Brașoveanu Lyndon J.B. Nixon Arno Scharl Unsupervised Topic Modeling with BERTopic for Coarse and Fine-Grained News Classification

2. Use Case: IPTC ● IPTC classification ● 17 domains with subcategories ● Initial approach: Silver Standard ● Assignments based on URLs ● Multiple categories possible ● Needed for news media projects

3. Challenges with Current Text Classification Pipelines ● Data Quality ● Scalability ● Interpretability ● Imbalanced Classes ● Computational Costs ● Model Overfitting

4. Neural Topic Modeling: BERTopic ● Text Preprocessing ● Document Embedding ● Dimensionality Reduction ● Clustering ● Topic Representation

5. Our Solution - Method ● Extract a coherent topic representation using BERTopic. ● Use the documents to train any classifier (RoBERTa, SetFit, etc.)

6. Experiment - Datasets

7. Experiment - Models ● RoBERTa (Robustly Optimized BERT Pretraining Approach) ● SetFit (Few-shot)

8. Experiment - Embeddings Evaluation ● RoBERTa Embeddings win

9. Experiment - Model Evaluation (with and without BERTopic)

10. Portal Integration

11. Conclusion ● Improved Content Understanding ● Customization ● Improved Analytics ● Mitigation of Information Overload ● Improvements on Downstream Tasks

12. Thank you for your attention! Acknowledgments GENTIO project funded by the Austrian Federal Ministry for Climate Action, Environment, Energy, Mobility and Technology (BMK) via the ICT of the Future Program (GA No. 873992). DWBI project funded by Vienna Science and Technology Fund (WWTF) [10.47379/ICT20096]. Images Copyright © IPTC, HuggingFace, Mohamad Al Sayed

Unsupervised Topic Modeling with BERTopic for Coarse and Fine-Grained News Classification

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Unsupervised Topic Modeling with BERTopic for Coarse and Fine-Grained News Classification

Similar to Unsupervised Topic Modeling with BERTopic for Coarse and Fine-Grained News Classification (20)

More from MODUL Technology GmbH

More from MODUL Technology GmbH (20)

Recently uploaded

Recently uploaded (20)

Unsupervised Topic Modeling with BERTopic for Coarse and Fine-Grained News Classification