Mohamad Al Sayed
Adrian M.P. Brașoveanu
Lyndon J.B. Nixon
Arno Scharl
Unsupervised Topic Modeling with BERTopic
for Coarse and Fine-Grained News
Classification
Use Case: IPTC
● IPTC classification
● 17 domains with subcategories
● Initial approach: Silver Standard
● Assignments based on URLs
● Multiple categories possible
● Needed for news media projects
Challenges with Current Text Classification Pipelines
● Data Quality
● Scalability
● Interpretability
● Imbalanced Classes
● Computational Costs
● Model Overfitting
Neural Topic Modeling: BERTopic
● Text Preprocessing
● Document Embedding
● Dimensionality Reduction
● Clustering
● Topic Representation
Our Solution - Method
● Extract a coherent topic representation using BERTopic.
● Use the documents to train any classifier (RoBERTa, SetFit, etc.)
Experiment - Datasets
Experiment - Models
● RoBERTa (Robustly Optimized BERT Pretraining Approach)
● SetFit (Few-shot)
Experiment - Embeddings Evaluation
● RoBERTa Embeddings win
Experiment - Model Evaluation (with and without BERTopic)
Portal Integration
Conclusion
● Improved Content Understanding
● Customization
● Improved Analytics
● Mitigation of Information Overload
● Improvements on Downstream Tasks
Thank you for your attention!
Acknowledgments
GENTIO project funded by the Austrian Federal Ministry for Climate Action,
Environment, Energy, Mobility and Technology (BMK) via the ICT of the Future
Program (GA No. 873992).
DWBI project funded by Vienna Science and Technology Fund (WWTF)
[10.47379/ICT20096].
Images
Copyright © IPTC, HuggingFace, Mohamad Al Sayed

Unsupervised Topic Modeling with BERTopic for Coarse and Fine-Grained News Classification

  • 1.
    Mohamad Al Sayed AdrianM.P. Brașoveanu Lyndon J.B. Nixon Arno Scharl Unsupervised Topic Modeling with BERTopic for Coarse and Fine-Grained News Classification
  • 2.
    Use Case: IPTC ●IPTC classification ● 17 domains with subcategories ● Initial approach: Silver Standard ● Assignments based on URLs ● Multiple categories possible ● Needed for news media projects
  • 3.
    Challenges with CurrentText Classification Pipelines ● Data Quality ● Scalability ● Interpretability ● Imbalanced Classes ● Computational Costs ● Model Overfitting
  • 4.
    Neural Topic Modeling:BERTopic ● Text Preprocessing ● Document Embedding ● Dimensionality Reduction ● Clustering ● Topic Representation
  • 5.
    Our Solution -Method ● Extract a coherent topic representation using BERTopic. ● Use the documents to train any classifier (RoBERTa, SetFit, etc.)
  • 6.
  • 7.
    Experiment - Models ●RoBERTa (Robustly Optimized BERT Pretraining Approach) ● SetFit (Few-shot)
  • 8.
    Experiment - EmbeddingsEvaluation ● RoBERTa Embeddings win
  • 9.
    Experiment - ModelEvaluation (with and without BERTopic)
  • 10.
  • 11.
    Conclusion ● Improved ContentUnderstanding ● Customization ● Improved Analytics ● Mitigation of Information Overload ● Improvements on Downstream Tasks
  • 12.
    Thank you foryour attention! Acknowledgments GENTIO project funded by the Austrian Federal Ministry for Climate Action, Environment, Energy, Mobility and Technology (BMK) via the ICT of the Future Program (GA No. 873992). DWBI project funded by Vienna Science and Technology Fund (WWTF) [10.47379/ICT20096]. Images Copyright © IPTC, HuggingFace, Mohamad Al Sayed