SlideShare a Scribd company logo
1 of 12
Mohamad Al Sayed
Adrian M.P. Brașoveanu
Lyndon J.B. Nixon
Arno Scharl
Unsupervised Topic Modeling with BERTopic
for Coarse and Fine-Grained News
Classification
Use Case: IPTC
● IPTC classification
● 17 domains with subcategories
● Initial approach: Silver Standard
● Assignments based on URLs
● Multiple categories possible
● Needed for news media projects
Challenges with Current Text Classification Pipelines
● Data Quality
● Scalability
● Interpretability
● Imbalanced Classes
● Computational Costs
● Model Overfitting
Neural Topic Modeling: BERTopic
● Text Preprocessing
● Document Embedding
● Dimensionality Reduction
● Clustering
● Topic Representation
Our Solution - Method
● Extract a coherent topic representation using BERTopic.
● Use the documents to train any classifier (RoBERTa, SetFit, etc.)
Experiment - Datasets
Experiment - Models
● RoBERTa (Robustly Optimized BERT Pretraining Approach)
● SetFit (Few-shot)
Experiment - Embeddings Evaluation
● RoBERTa Embeddings win
Experiment - Model Evaluation (with and without BERTopic)
Portal Integration
Conclusion
● Improved Content Understanding
● Customization
● Improved Analytics
● Mitigation of Information Overload
● Improvements on Downstream Tasks
Thank you for your attention!
Acknowledgments
GENTIO project funded by the Austrian Federal Ministry for Climate Action,
Environment, Energy, Mobility and Technology (BMK) via the ICT of the Future
Program (GA No. 873992).
DWBI project funded by Vienna Science and Technology Fund (WWTF)
[10.47379/ICT20096].
Images
Copyright © IPTC, HuggingFace, Mohamad Al Sayed

More Related Content

What's hot

Data Visualization in Python
Data Visualization in PythonData Visualization in Python
Data Visualization in PythonJagriti Goswami
 
Machine Learning Feature Selection - Random Forest
Machine Learning Feature Selection - Random Forest Machine Learning Feature Selection - Random Forest
Machine Learning Feature Selection - Random Forest Rupak Roy
 
Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandasPiyush rai
 
Universal turing coastus
Universal turing coastusUniversal turing coastus
Universal turing coastusShiraz316
 
Prims and kruskal algorithms
Prims and kruskal algorithmsPrims and kruskal algorithms
Prims and kruskal algorithmsSaga Valsalan
 
BackTracking Algorithm: Technique and Examples
BackTracking Algorithm: Technique and ExamplesBackTracking Algorithm: Technique and Examples
BackTracking Algorithm: Technique and ExamplesFahim Ferdous
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationSangwoo Mo
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment AnalysisRupak Roy
 
Language for specifying lexical Analyzer
Language for specifying lexical AnalyzerLanguage for specifying lexical Analyzer
Language for specifying lexical AnalyzerArchana Gopinath
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree LearningMilind Gokhale
 
Real data models of silicon valley
Real data models of silicon valleyReal data models of silicon valley
Real data models of silicon valleyPatrick McFadin
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big dataSteven Francia
 
Computational Complexity: Complexity Classes
Computational Complexity: Complexity ClassesComputational Complexity: Complexity Classes
Computational Complexity: Complexity ClassesAntonis Antonopoulos
 

What's hot (20)

Data Visualization in Python
Data Visualization in PythonData Visualization in Python
Data Visualization in Python
 
Machine Learning Feature Selection - Random Forest
Machine Learning Feature Selection - Random Forest Machine Learning Feature Selection - Random Forest
Machine Learning Feature Selection - Random Forest
 
Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandas
 
Universal turing coastus
Universal turing coastusUniversal turing coastus
Universal turing coastus
 
Kmp
KmpKmp
Kmp
 
Prims and kruskal algorithms
Prims and kruskal algorithmsPrims and kruskal algorithms
Prims and kruskal algorithms
 
BackTracking Algorithm: Technique and Examples
BackTracking Algorithm: Technique and ExamplesBackTracking Algorithm: Technique and Examples
BackTracking Algorithm: Technique and Examples
 
Xgboost
XgboostXgboost
Xgboost
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment Analysis
 
Language for specifying lexical Analyzer
Language for specifying lexical AnalyzerLanguage for specifying lexical Analyzer
Language for specifying lexical Analyzer
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Real data models of silicon valley
Real data models of silicon valleyReal data models of silicon valley
Real data models of silicon valley
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big data
 
Pushdown automata
Pushdown automataPushdown automata
Pushdown automata
 
Finite Automata
Finite AutomataFinite Automata
Finite Automata
 
boosting algorithm
boosting algorithmboosting algorithm
boosting algorithm
 
Computational Complexity: Complexity Classes
Computational Complexity: Complexity ClassesComputational Complexity: Complexity Classes
Computational Complexity: Complexity Classes
 
Assosiate rule mining
Assosiate rule miningAssosiate rule mining
Assosiate rule mining
 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0
 

Similar to Unsupervised Topic Modeling with BERTopic for Coarse and Fine-Grained News Classification

Data analysis with Pandas and Spark
Data analysis with Pandas and SparkData analysis with Pandas and Spark
Data analysis with Pandas and SparkFelix Crisan
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLHimadri Mishra
 
RDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation PruningRDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation Pruningwajrcs
 
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryOpen Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryMarcus Hanwell
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Lviv Startup Club
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanycOpen Analytics
 
BIG Data Science: A Path Forward
BIG Data Science:  A Path ForwardBIG Data Science:  A Path Forward
BIG Data Science: A Path ForwardDan Mallinger
 
Compiler assisted code randomization S&P'18
Compiler assisted code randomization S&P'18Compiler assisted code randomization S&P'18
Compiler assisted code randomization S&P'18星曼 陈
 
Combining Textual and Graph-based Features for Entity Disambiguation
Combining Textual and Graph-based Features for Entity DisambiguationCombining Textual and Graph-based Features for Entity Disambiguation
Combining Textual and Graph-based Features for Entity Disambiguationshakimov
 
Detection & Recognition of Text.pdf
Detection & Recognition of Text.pdfDetection & Recognition of Text.pdf
Detection & Recognition of Text.pdfnisarggandhewar1
 
An RDF Dataset Generator for the Social Network Benchmark with Real-World Coh...
An RDF Dataset Generator for the Social Network Benchmark with Real-World Coh...An RDF Dataset Generator for the Social Network Benchmark with Real-World Coh...
An RDF Dataset Generator for the Social Network Benchmark with Real-World Coh...Holistic Benchmarking of Big Linked Data
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016Nikhil Shekhar
 
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Pramati Technologies
 
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022ArangoDB Database
 
Headline sentiment analysis for ipos
Headline sentiment analysis for iposHeadline sentiment analysis for ipos
Headline sentiment analysis for iposmelissaburn
 

Similar to Unsupervised Topic Modeling with BERTopic for Coarse and Fine-Grained News Classification (20)

Data analysis with Pandas and Spark
Data analysis with Pandas and SparkData analysis with Pandas and Spark
Data analysis with Pandas and Spark
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
 
RDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation PruningRDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation Pruning
 
Python Training.pdf
Python Training.pdfPython Training.pdf
Python Training.pdf
 
Python Training.pdf
Python Training.pdfPython Training.pdf
Python Training.pdf
 
Python Training.pdf
Python Training.pdfPython Training.pdf
Python Training.pdf
 
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryOpen Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
 
BIG Data Science: A Path Forward
BIG Data Science:  A Path ForwardBIG Data Science:  A Path Forward
BIG Data Science: A Path Forward
 
Mongodb (1)
Mongodb (1)Mongodb (1)
Mongodb (1)
 
Compiler assisted code randomization S&P'18
Compiler assisted code randomization S&P'18Compiler assisted code randomization S&P'18
Compiler assisted code randomization S&P'18
 
Combining Textual and Graph-based Features for Entity Disambiguation
Combining Textual and Graph-based Features for Entity DisambiguationCombining Textual and Graph-based Features for Entity Disambiguation
Combining Textual and Graph-based Features for Entity Disambiguation
 
Detection & Recognition of Text.pdf
Detection & Recognition of Text.pdfDetection & Recognition of Text.pdf
Detection & Recognition of Text.pdf
 
An RDF Dataset Generator for the Social Network Benchmark with Real-World Coh...
An RDF Dataset Generator for the Social Network Benchmark with Real-World Coh...An RDF Dataset Generator for the Social Network Benchmark with Real-World Coh...
An RDF Dataset Generator for the Social Network Benchmark with Real-World Coh...
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016
 
Centernet
CenternetCenternet
Centernet
 
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
 
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
 
Headline sentiment analysis for ipos
Headline sentiment analysis for iposHeadline sentiment analysis for ipos
Headline sentiment analysis for ipos
 

More from MODUL Technology GmbH

How distinct and aligned with UGC is European capitals’ DMO branding on Insta...
How distinct and aligned with UGC is European capitals’ DMO branding on Insta...How distinct and aligned with UGC is European capitals’ DMO branding on Insta...
How distinct and aligned with UGC is European capitals’ DMO branding on Insta...MODUL Technology GmbH
 
Framing Few Shot Knowledge Graph Completion with Large Language Models
Framing Few Shot Knowledge Graph Completion with Large Language ModelsFraming Few Shot Knowledge Graph Completion with Large Language Models
Framing Few Shot Knowledge Graph Completion with Large Language ModelsMODUL Technology GmbH
 
Breaking New Ground with EPOCH: AI and Web Intelligence Transform Price Forec...
Breaking New Ground with EPOCH: AI and Web Intelligence Transform Price Forec...Breaking New Ground with EPOCH: AI and Web Intelligence Transform Price Forec...
Breaking New Ground with EPOCH: AI and Web Intelligence Transform Price Forec...MODUL Technology GmbH
 
New Opportunities for Understanding Tourist Photography.pptx
New Opportunities for Understanding Tourist Photography.pptxNew Opportunities for Understanding Tourist Photography.pptx
New Opportunities for Understanding Tourist Photography.pptxMODUL Technology GmbH
 
How do destinations relate to one another? A study of visual destination bran...
How do destinations relate to one another? A study of visual destination bran...How do destinations relate to one another? A study of visual destination bran...
How do destinations relate to one another? A study of visual destination bran...MODUL Technology GmbH
 
Do DMOs promote the right aspects of the destination? A study of Instagram ph...
Do DMOs promote the right aspects of the destination? A study of Instagram ph...Do DMOs promote the right aspects of the destination? A study of Instagram ph...
Do DMOs promote the right aspects of the destination? A study of Instagram ph...MODUL Technology GmbH
 
The Impact of Social Media on perceived Destination Image: case of Mexico Ci...
The Impact of Social Media on perceived Destination Image:  case of Mexico Ci...The Impact of Social Media on perceived Destination Image:  case of Mexico Ci...
The Impact of Social Media on perceived Destination Image: case of Mexico Ci...MODUL Technology GmbH
 
The Impact of Social Media on perceived Destination Image: the case of Mexico...
The Impact of Social Media on perceived Destination Image:the case of Mexico...The Impact of Social Media on perceived Destination Image:the case of Mexico...
The Impact of Social Media on perceived Destination Image: the case of Mexico...MODUL Technology GmbH
 
How Instagram influences Visual Destination Image - a case study of Jordan an...
How Instagram influences Visual Destination Image - a case study of Jordan an...How Instagram influences Visual Destination Image - a case study of Jordan an...
How Instagram influences Visual Destination Image - a case study of Jordan an...MODUL Technology GmbH
 
NoTube: Pattern-based Recommendations (part 3)
NoTube: Pattern-based Recommendations (part 3)NoTube: Pattern-based Recommendations (part 3)
NoTube: Pattern-based Recommendations (part 3)MODUL Technology GmbH
 
NoTube: Pattern-based Recommendations (part 1)
NoTube: Pattern-based Recommendations (part 1)NoTube: Pattern-based Recommendations (part 1)
NoTube: Pattern-based Recommendations (part 1)MODUL Technology GmbH
 
NoTube: Pattern-based Recommendations (part 1)
NoTube: Pattern-based Recommendations (part 1)NoTube: Pattern-based Recommendations (part 1)
NoTube: Pattern-based Recommendations (part 1)MODUL Technology GmbH
 
NoTube: Recommendations (Collaborative)
NoTube: Recommendations (Collaborative)NoTube: Recommendations (Collaborative)
NoTube: Recommendations (Collaborative)MODUL Technology GmbH
 
NoTube: User Profiling (Beancounter)
NoTube: User Profiling (Beancounter)NoTube: User Profiling (Beancounter)
NoTube: User Profiling (Beancounter)MODUL Technology GmbH
 
14 no tube dissemination and showcases [compatibility mode]
14 no tube dissemination and showcases [compatibility mode]14 no tube dissemination and showcases [compatibility mode]
14 no tube dissemination and showcases [compatibility mode]MODUL Technology GmbH
 

More from MODUL Technology GmbH (20)

How distinct and aligned with UGC is European capitals’ DMO branding on Insta...
How distinct and aligned with UGC is European capitals’ DMO branding on Insta...How distinct and aligned with UGC is European capitals’ DMO branding on Insta...
How distinct and aligned with UGC is European capitals’ DMO branding on Insta...
 
Framing Few Shot Knowledge Graph Completion with Large Language Models
Framing Few Shot Knowledge Graph Completion with Large Language ModelsFraming Few Shot Knowledge Graph Completion with Large Language Models
Framing Few Shot Knowledge Graph Completion with Large Language Models
 
Breaking New Ground with EPOCH: AI and Web Intelligence Transform Price Forec...
Breaking New Ground with EPOCH: AI and Web Intelligence Transform Price Forec...Breaking New Ground with EPOCH: AI and Web Intelligence Transform Price Forec...
Breaking New Ground with EPOCH: AI and Web Intelligence Transform Price Forec...
 
New Opportunities for Understanding Tourist Photography.pptx
New Opportunities for Understanding Tourist Photography.pptxNew Opportunities for Understanding Tourist Photography.pptx
New Opportunities for Understanding Tourist Photography.pptx
 
How do destinations relate to one another? A study of visual destination bran...
How do destinations relate to one another? A study of visual destination bran...How do destinations relate to one another? A study of visual destination bran...
How do destinations relate to one another? A study of visual destination bran...
 
Do DMOs promote the right aspects of the destination? A study of Instagram ph...
Do DMOs promote the right aspects of the destination? A study of Instagram ph...Do DMOs promote the right aspects of the destination? A study of Instagram ph...
Do DMOs promote the right aspects of the destination? A study of Instagram ph...
 
The Impact of Social Media on perceived Destination Image: case of Mexico Ci...
The Impact of Social Media on perceived Destination Image:  case of Mexico Ci...The Impact of Social Media on perceived Destination Image:  case of Mexico Ci...
The Impact of Social Media on perceived Destination Image: case of Mexico Ci...
 
The Impact of Social Media on perceived Destination Image: the case of Mexico...
The Impact of Social Media on perceived Destination Image:the case of Mexico...The Impact of Social Media on perceived Destination Image:the case of Mexico...
The Impact of Social Media on perceived Destination Image: the case of Mexico...
 
How Instagram influences Visual Destination Image - a case study of Jordan an...
How Instagram influences Visual Destination Image - a case study of Jordan an...How Instagram influences Visual Destination Image - a case study of Jordan an...
How Instagram influences Visual Destination Image - a case study of Jordan an...
 
Media mining for smarter tourism
Media mining for smarter tourismMedia mining for smarter tourism
Media mining for smarter tourism
 
NoTube: Pattern-based Recommendations (part 3)
NoTube: Pattern-based Recommendations (part 3)NoTube: Pattern-based Recommendations (part 3)
NoTube: Pattern-based Recommendations (part 3)
 
NoTube: Pattern-based Recommendations (part 1)
NoTube: Pattern-based Recommendations (part 1)NoTube: Pattern-based Recommendations (part 1)
NoTube: Pattern-based Recommendations (part 1)
 
NoTube: Pattern-based Recommendations (part 1)
NoTube: Pattern-based Recommendations (part 1)NoTube: Pattern-based Recommendations (part 1)
NoTube: Pattern-based Recommendations (part 1)
 
NoTube: Recommendations (Collaborative)
NoTube: Recommendations (Collaborative)NoTube: Recommendations (Collaborative)
NoTube: Recommendations (Collaborative)
 
NoTube: User Profiling (Beancounter)
NoTube: User Profiling (Beancounter)NoTube: User Profiling (Beancounter)
NoTube: User Profiling (Beancounter)
 
14 no tube dissemination and showcases [compatibility mode]
14 no tube dissemination and showcases [compatibility mode]14 no tube dissemination and showcases [compatibility mode]
14 no tube dissemination and showcases [compatibility mode]
 
NoTube: BBC show case
NoTube: BBC show caseNoTube: BBC show case
NoTube: BBC show case
 
NoTube: Stoneroos show case
NoTube: Stoneroos show caseNoTube: Stoneroos show case
NoTube: Stoneroos show case
 
NoTube: RAI Show Case
NoTube: RAI Show CaseNoTube: RAI Show Case
NoTube: RAI Show Case
 
NoTube: Architecture
NoTube: ArchitectureNoTube: Architecture
NoTube: Architecture
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

Unsupervised Topic Modeling with BERTopic for Coarse and Fine-Grained News Classification

  • 1. Mohamad Al Sayed Adrian M.P. Brașoveanu Lyndon J.B. Nixon Arno Scharl Unsupervised Topic Modeling with BERTopic for Coarse and Fine-Grained News Classification
  • 2. Use Case: IPTC ● IPTC classification ● 17 domains with subcategories ● Initial approach: Silver Standard ● Assignments based on URLs ● Multiple categories possible ● Needed for news media projects
  • 3. Challenges with Current Text Classification Pipelines ● Data Quality ● Scalability ● Interpretability ● Imbalanced Classes ● Computational Costs ● Model Overfitting
  • 4. Neural Topic Modeling: BERTopic ● Text Preprocessing ● Document Embedding ● Dimensionality Reduction ● Clustering ● Topic Representation
  • 5. Our Solution - Method ● Extract a coherent topic representation using BERTopic. ● Use the documents to train any classifier (RoBERTa, SetFit, etc.)
  • 7. Experiment - Models ● RoBERTa (Robustly Optimized BERT Pretraining Approach) ● SetFit (Few-shot)
  • 8. Experiment - Embeddings Evaluation ● RoBERTa Embeddings win
  • 9. Experiment - Model Evaluation (with and without BERTopic)
  • 11. Conclusion ● Improved Content Understanding ● Customization ● Improved Analytics ● Mitigation of Information Overload ● Improvements on Downstream Tasks
  • 12. Thank you for your attention! Acknowledgments GENTIO project funded by the Austrian Federal Ministry for Climate Action, Environment, Energy, Mobility and Technology (BMK) via the ICT of the Future Program (GA No. 873992). DWBI project funded by Vienna Science and Technology Fund (WWTF) [10.47379/ICT20096]. Images Copyright © IPTC, HuggingFace, Mohamad Al Sayed