SlideShare a Scribd company logo
Hierarchical Topic Detection and Representation
Yash Vadalia (201001015)
Raj Mehta (201305504)
Lalit M (201101189)
Ashutosh Borkar (201101002)
Introduction
● Huge volume of news/information.
● Automatic processing of information to keep up with latest updates.
● Documents with similar stories are clustered together.
● Topics extracted from these clusters.
● Applications: Searching, topic based document suggestion.
Approach
Parsing
● Corpus: Real news dataset (link).
● Unstructured data makes information extraction difficult.
● Data has huge amount of noise.
○ html tags
○ non-printable characters
...continued
● Process the raw data and remove noise (HTML tags, comments, etc).
● Segment each document into sentences and further into words/tokens.
● Stop words removal and Stemming.
● Tag each token with the right parts of speech (POS).
● Store the tag and frequency of all nouns and verbs (document vector)
Document Similarity
● Document similarity: Cosine similarity of document vectors.
● Higher the similarity, more the probability of having similar topic
● Wt
represents weight of a word and is given by
Cluster Similarity
● Various linkage criteria are available for finding similarity between
clusters:
○ Single Linkage
○ Complete Linkage
○ Mean Linkage
○ Centroid Linkage
○ Minimum Energy etc
● Mean Linkage is prefered over other since it reduces the effect of chaining.
Clustering
● Agglomerative hierarchical clustering
○ Consider each document as single cluster
○ Find most (max) similar pair of clusters to merge
○ Merge into single cluster
○ Repeat
● Each iteration reduces the number of cluster by one.
● Termination
○ Either the maximum similarity goes below a threshold
○ Requisite number of clusters formed.
Topic Extraction
● Used TF-IDF and parsimonious model to weigh terms to get the most
relevant topics.
● Parsimonious Model
...continued
● Words having less weight are ignored and ones with maximum weight are
considered as topic for that cluster.
● Instead of all kinds of word, processing specific parts of speech yeilds more
relevant topics.
● Proper Nouns and Verbs are represent entities and events respectively in a
document.
Results
● Output is a binary tree having various clusters combined at each level.
● Each non-leaf node in tree is a cluster.
● Each leaf-node is document.
● Tree is not well balanced and do suffer little from chaining if almost all
documents are of same topic.
Demo Screenshot
Conclusion
● HTD is a newer variant over Topic detection.
● Provides multiple level of granularity.
● Major issue in the statistical approach we followed is scaling.
○ Cubic complexity of processing (document similarity matrix,
clustering)
● The relevance between the documents can be improved as we go towards
the events from documents.

More Related Content

What's hot

Contextual Definition Generation
Contextual Definition GenerationContextual Definition Generation
Contextual Definition Generation
Sergey Sosnovsky
 
Object-Oriented Writing: augmented writing for creating coherent and argument...
Object-Oriented Writing: augmented writing for creating coherent and argument...Object-Oriented Writing: augmented writing for creating coherent and argument...
Object-Oriented Writing: augmented writing for creating coherent and argument...
Seong-Young Her
 
Analysis 3 1
Analysis 3 1Analysis 3 1
Analysis 3 1
Keith Bryson
 
XML - SAX
XML - SAXXML - SAX
Order out of Chaos: Construction of Knowledge Models from PDF Textbooks
Order out of Chaos: Construction of Knowledge Models from PDF TextbooksOrder out of Chaos: Construction of Knowledge Models from PDF Textbooks
Order out of Chaos: Construction of Knowledge Models from PDF Textbooks
Isaac Alpizar-Chacon
 
Cross-domain Document Retrieval: Matching between Conversational and Formal W...
Cross-domain Document Retrieval: Matching between Conversational and Formal W...Cross-domain Document Retrieval: Matching between Conversational and Formal W...
Cross-domain Document Retrieval: Matching between Conversational and Formal W...
Jinho Choi
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trends
KU Leuven
 
CPP18 - String Parsing
CPP18 - String ParsingCPP18 - String Parsing
CPP18 - String Parsing
Michael Heron
 
Text Mining
Text MiningText Mining
Text Mining
sathish sak
 
StaTIX - Statistical Type Inference on Linked Data
StaTIX - Statistical Type Inference on Linked DataStaTIX - Statistical Type Inference on Linked Data
StaTIX - Statistical Type Inference on Linked Data
Artem Lutov
 
DOM-XML
DOM-XMLDOM-XML
XSL - XML STYLE SHEET
XSL - XML STYLE SHEETXSL - XML STYLE SHEET
XSL - XML STYLE SHEET
SaraswathiRamalingam
 

What's hot (12)

Contextual Definition Generation
Contextual Definition GenerationContextual Definition Generation
Contextual Definition Generation
 
Object-Oriented Writing: augmented writing for creating coherent and argument...
Object-Oriented Writing: augmented writing for creating coherent and argument...Object-Oriented Writing: augmented writing for creating coherent and argument...
Object-Oriented Writing: augmented writing for creating coherent and argument...
 
Analysis 3 1
Analysis 3 1Analysis 3 1
Analysis 3 1
 
XML - SAX
XML - SAXXML - SAX
XML - SAX
 
Order out of Chaos: Construction of Knowledge Models from PDF Textbooks
Order out of Chaos: Construction of Knowledge Models from PDF TextbooksOrder out of Chaos: Construction of Knowledge Models from PDF Textbooks
Order out of Chaos: Construction of Knowledge Models from PDF Textbooks
 
Cross-domain Document Retrieval: Matching between Conversational and Formal W...
Cross-domain Document Retrieval: Matching between Conversational and Formal W...Cross-domain Document Retrieval: Matching between Conversational and Formal W...
Cross-domain Document Retrieval: Matching between Conversational and Formal W...
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trends
 
CPP18 - String Parsing
CPP18 - String ParsingCPP18 - String Parsing
CPP18 - String Parsing
 
Text Mining
Text MiningText Mining
Text Mining
 
StaTIX - Statistical Type Inference on Linked Data
StaTIX - Statistical Type Inference on Linked DataStaTIX - Statistical Type Inference on Linked Data
StaTIX - Statistical Type Inference on Linked Data
 
DOM-XML
DOM-XMLDOM-XML
DOM-XML
 
XSL - XML STYLE SHEET
XSL - XML STYLE SHEETXSL - XML STYLE SHEET
XSL - XML STYLE SHEET
 

Viewers also liked

Localeikki seed pitch deck
Localeikki seed pitch deckLocaleikki seed pitch deck
Localeikki seed pitch deck
Tracy McMillan
 
速習!HTML5 Japan Cup 2014
速習!HTML5 Japan Cup 2014速習!HTML5 Japan Cup 2014
速習!HTML5 Japan Cup 2014
5jcup
 
GRHS.2013.Thematic.Children.and_.Youth_
GRHS.2013.Thematic.Children.and_.Youth_GRHS.2013.Thematic.Children.and_.Youth_
GRHS.2013.Thematic.Children.and_.Youth_
Tracy McMillan
 
Localeikki abbrev pitch deck
Localeikki abbrev pitch deckLocaleikki abbrev pitch deck
Localeikki abbrev pitch deck
Tracy McMillan
 
эр краткая версия для сайта 16 03 16 1
эр краткая версия для сайта  16 03 16 1эр краткая версия для сайта  16 03 16 1
эр краткая версия для сайта 16 03 16 1
Бизнес-омбудсмен Морозов
 
Children’s dental routine
Children’s dental routineChildren’s dental routine
Children’s dental routine
Kolibree
 
Adobe Illustrator Tutorial : Pathfinder Palette
Adobe Illustrator Tutorial : Pathfinder PaletteAdobe Illustrator Tutorial : Pathfinder Palette
Adobe Illustrator Tutorial : Pathfinder Palette
Bapu Graphics India
 
Kolibree 2014 - The World's First Connected Electric Toothbrush
Kolibree 2014 - The World's First Connected Electric ToothbrushKolibree 2014 - The World's First Connected Electric Toothbrush
Kolibree 2014 - The World's First Connected Electric Toothbrush
Kolibree
 
Модели финансирования производственных проектов в сфере среднего предпринимат...
Модели финансирования производственных проектов в сфере среднего предпринимат...Модели финансирования производственных проектов в сфере среднего предпринимат...
Модели финансирования производственных проектов в сфере среднего предпринимат...
Бизнес-омбудсмен Морозов
 
«Защита прав предпринимателей. Диалог бизнеса и Правительства»
«Защита прав предпринимателей. Диалог бизнеса и Правительства»«Защита прав предпринимателей. Диалог бизнеса и Правительства»
«Защита прав предпринимателей. Диалог бизнеса и Правительства»
Бизнес-омбудсмен Морозов
 
こんなレジアプリが欲しい!(ハンズラボ株式会社)
こんなレジアプリが欲しい!(ハンズラボ株式会社)こんなレジアプリが欲しい!(ハンズラボ株式会社)
こんなレジアプリが欲しい!(ハンズラボ株式会社)
5jcup
 
Ajax Course In Delhi
Ajax Course In DelhiAjax Course In Delhi
Ajax Course In Delhi
Bapu Graphics India
 
Presentation on Bootstrap Course
Presentation on Bootstrap CoursePresentation on Bootstrap Course
Presentation on Bootstrap Course
Bapu Graphics India
 
Adobe flash quiz
Adobe flash quizAdobe flash quiz
Adobe flash quiz
Bapu Graphics India
 
1.46 riesgos-de-la-piel.
1.46 riesgos-de-la-piel.1.46 riesgos-de-la-piel.
1.46 riesgos-de-la-piel.
Sonia Altamirano Oncoy
 
экономика роста. презентация
экономика роста. презентацияэкономика роста. презентация
экономика роста. презентация
Бизнес-омбудсмен Морозов
 
Adobe Photoshop Tools
Adobe Photoshop ToolsAdobe Photoshop Tools
Adobe Photoshop Tools
Bapu Graphics India
 

Viewers also liked (17)

Localeikki seed pitch deck
Localeikki seed pitch deckLocaleikki seed pitch deck
Localeikki seed pitch deck
 
速習!HTML5 Japan Cup 2014
速習!HTML5 Japan Cup 2014速習!HTML5 Japan Cup 2014
速習!HTML5 Japan Cup 2014
 
GRHS.2013.Thematic.Children.and_.Youth_
GRHS.2013.Thematic.Children.and_.Youth_GRHS.2013.Thematic.Children.and_.Youth_
GRHS.2013.Thematic.Children.and_.Youth_
 
Localeikki abbrev pitch deck
Localeikki abbrev pitch deckLocaleikki abbrev pitch deck
Localeikki abbrev pitch deck
 
эр краткая версия для сайта 16 03 16 1
эр краткая версия для сайта  16 03 16 1эр краткая версия для сайта  16 03 16 1
эр краткая версия для сайта 16 03 16 1
 
Children’s dental routine
Children’s dental routineChildren’s dental routine
Children’s dental routine
 
Adobe Illustrator Tutorial : Pathfinder Palette
Adobe Illustrator Tutorial : Pathfinder PaletteAdobe Illustrator Tutorial : Pathfinder Palette
Adobe Illustrator Tutorial : Pathfinder Palette
 
Kolibree 2014 - The World's First Connected Electric Toothbrush
Kolibree 2014 - The World's First Connected Electric ToothbrushKolibree 2014 - The World's First Connected Electric Toothbrush
Kolibree 2014 - The World's First Connected Electric Toothbrush
 
Модели финансирования производственных проектов в сфере среднего предпринимат...
Модели финансирования производственных проектов в сфере среднего предпринимат...Модели финансирования производственных проектов в сфере среднего предпринимат...
Модели финансирования производственных проектов в сфере среднего предпринимат...
 
«Защита прав предпринимателей. Диалог бизнеса и Правительства»
«Защита прав предпринимателей. Диалог бизнеса и Правительства»«Защита прав предпринимателей. Диалог бизнеса и Правительства»
«Защита прав предпринимателей. Диалог бизнеса и Правительства»
 
こんなレジアプリが欲しい!(ハンズラボ株式会社)
こんなレジアプリが欲しい!(ハンズラボ株式会社)こんなレジアプリが欲しい!(ハンズラボ株式会社)
こんなレジアプリが欲しい!(ハンズラボ株式会社)
 
Ajax Course In Delhi
Ajax Course In DelhiAjax Course In Delhi
Ajax Course In Delhi
 
Presentation on Bootstrap Course
Presentation on Bootstrap CoursePresentation on Bootstrap Course
Presentation on Bootstrap Course
 
Adobe flash quiz
Adobe flash quizAdobe flash quiz
Adobe flash quiz
 
1.46 riesgos-de-la-piel.
1.46 riesgos-de-la-piel.1.46 riesgos-de-la-piel.
1.46 riesgos-de-la-piel.
 
экономика роста. презентация
экономика роста. презентацияэкономика роста. презентация
экономика роста. презентация
 
Adobe Photoshop Tools
Adobe Photoshop ToolsAdobe Photoshop Tools
Adobe Photoshop Tools
 

Similar to Hierarchical Topic Detection and Representation

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
Kalpit Desai
 
Elasticsearch V/s Relational Database
Elasticsearch V/s Relational DatabaseElasticsearch V/s Relational Database
Elasticsearch V/s Relational Database
Richa Budhraja
 
Topic modelling
Topic modellingTopic modelling
Topic modelling
Shubhmay Potdar
 
Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"
National Information Standards Organization (NISO)
 
Discourse Corpra about the subject of semantics
Discourse Corpra about the subject of semanticsDiscourse Corpra about the subject of semantics
Discourse Corpra about the subject of semantics
ssuseree197e
 
Document Summarization
Document SummarizationDocument Summarization
Document Summarization
Pratik Kumar
 
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGEXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
ijnlc
 
Machine Learning - Intro & Applications .pptx
Machine Learning - Intro & Applications .pptxMachine Learning - Intro & Applications .pptx
Machine Learning - Intro & Applications .pptx
ssuserf3aa89
 
Post conference workshop (xml and structure)
Post conference workshop (xml and structure)Post conference workshop (xml and structure)
Post conference workshop (xml and structure)
Scriptorium Publishing
 
Mdst3703 2013-09-17-text-models
Mdst3703 2013-09-17-text-modelsMdst3703 2013-09-17-text-models
Mdst3703 2013-09-17-text-models
Rafael Alvarado
 
Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  	Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  
sstose
 
A survey of xml tree patterns
A survey of xml tree patternsA survey of xml tree patterns
A survey of xml tree patterns
IEEEFINALYEARPROJECTS
 
Data Structures & Algorithms
Data Structures & AlgorithmsData Structures & Algorithms
Data Structures & Algorithms
Muhammad Jahanzaib
 
Document similarity
Document similarityDocument similarity
Document similarity
Hemant Hatankar
 
Ontology matching
Ontology matchingOntology matching
Ontology matching
Ícaro Medeiros
 
Networks and Natural Language Processing
Networks and Natural Language ProcessingNetworks and Natural Language Processing
Networks and Natural Language Processing
Ahmed Magdy Ezzeldin, MSc.
 
The Duet model
The Duet modelThe Duet model
The Duet model
Bhaskar Mitra
 
Emmanuelle Morlock - Introduction to Digital Epigraphy
Emmanuelle Morlock - Introduction to Digital EpigraphyEmmanuelle Morlock - Introduction to Digital Epigraphy
Emmanuelle Morlock - Introduction to Digital Epigraphy
Project Visible Words/MotsAVoir
 
20150504 introduction2 digitalepigraphy-visible-words_final
20150504 introduction2 digitalepigraphy-visible-words_final20150504 introduction2 digitalepigraphy-visible-words_final
20150504 introduction2 digitalepigraphy-visible-words_final
Emmanuelle Morlock
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
Krish_ver2
 

Similar to Hierarchical Topic Detection and Representation (20)

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Elasticsearch V/s Relational Database
Elasticsearch V/s Relational DatabaseElasticsearch V/s Relational Database
Elasticsearch V/s Relational Database
 
Topic modelling
Topic modellingTopic modelling
Topic modelling
 
Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"
 
Discourse Corpra about the subject of semantics
Discourse Corpra about the subject of semanticsDiscourse Corpra about the subject of semantics
Discourse Corpra about the subject of semantics
 
Document Summarization
Document SummarizationDocument Summarization
Document Summarization
 
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGEXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
 
Machine Learning - Intro & Applications .pptx
Machine Learning - Intro & Applications .pptxMachine Learning - Intro & Applications .pptx
Machine Learning - Intro & Applications .pptx
 
Post conference workshop (xml and structure)
Post conference workshop (xml and structure)Post conference workshop (xml and structure)
Post conference workshop (xml and structure)
 
Mdst3703 2013-09-17-text-models
Mdst3703 2013-09-17-text-modelsMdst3703 2013-09-17-text-models
Mdst3703 2013-09-17-text-models
 
Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  	Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  
 
A survey of xml tree patterns
A survey of xml tree patternsA survey of xml tree patterns
A survey of xml tree patterns
 
Data Structures & Algorithms
Data Structures & AlgorithmsData Structures & Algorithms
Data Structures & Algorithms
 
Document similarity
Document similarityDocument similarity
Document similarity
 
Ontology matching
Ontology matchingOntology matching
Ontology matching
 
Networks and Natural Language Processing
Networks and Natural Language ProcessingNetworks and Natural Language Processing
Networks and Natural Language Processing
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Emmanuelle Morlock - Introduction to Digital Epigraphy
Emmanuelle Morlock - Introduction to Digital EpigraphyEmmanuelle Morlock - Introduction to Digital Epigraphy
Emmanuelle Morlock - Introduction to Digital Epigraphy
 
20150504 introduction2 digitalepigraphy-visible-words_final
20150504 introduction2 digitalepigraphy-visible-words_final20150504 introduction2 digitalepigraphy-visible-words_final
20150504 introduction2 digitalepigraphy-visible-words_final
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
 

Recently uploaded

一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
fkyes25
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 

Recently uploaded (20)

一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 

Hierarchical Topic Detection and Representation

  • 1. Hierarchical Topic Detection and Representation Yash Vadalia (201001015) Raj Mehta (201305504) Lalit M (201101189) Ashutosh Borkar (201101002)
  • 2. Introduction ● Huge volume of news/information. ● Automatic processing of information to keep up with latest updates. ● Documents with similar stories are clustered together. ● Topics extracted from these clusters. ● Applications: Searching, topic based document suggestion.
  • 4. Parsing ● Corpus: Real news dataset (link). ● Unstructured data makes information extraction difficult. ● Data has huge amount of noise. ○ html tags ○ non-printable characters
  • 5. ...continued ● Process the raw data and remove noise (HTML tags, comments, etc). ● Segment each document into sentences and further into words/tokens. ● Stop words removal and Stemming. ● Tag each token with the right parts of speech (POS). ● Store the tag and frequency of all nouns and verbs (document vector)
  • 6. Document Similarity ● Document similarity: Cosine similarity of document vectors. ● Higher the similarity, more the probability of having similar topic ● Wt represents weight of a word and is given by
  • 7. Cluster Similarity ● Various linkage criteria are available for finding similarity between clusters: ○ Single Linkage ○ Complete Linkage ○ Mean Linkage ○ Centroid Linkage ○ Minimum Energy etc ● Mean Linkage is prefered over other since it reduces the effect of chaining.
  • 8. Clustering ● Agglomerative hierarchical clustering ○ Consider each document as single cluster ○ Find most (max) similar pair of clusters to merge ○ Merge into single cluster ○ Repeat ● Each iteration reduces the number of cluster by one. ● Termination ○ Either the maximum similarity goes below a threshold ○ Requisite number of clusters formed.
  • 9. Topic Extraction ● Used TF-IDF and parsimonious model to weigh terms to get the most relevant topics. ● Parsimonious Model
  • 10. ...continued ● Words having less weight are ignored and ones with maximum weight are considered as topic for that cluster. ● Instead of all kinds of word, processing specific parts of speech yeilds more relevant topics. ● Proper Nouns and Verbs are represent entities and events respectively in a document.
  • 11. Results ● Output is a binary tree having various clusters combined at each level. ● Each non-leaf node in tree is a cluster. ● Each leaf-node is document. ● Tree is not well balanced and do suffer little from chaining if almost all documents are of same topic.
  • 13. Conclusion ● HTD is a newer variant over Topic detection. ● Provides multiple level of granularity. ● Major issue in the statistical approach we followed is scaling. ○ Cubic complexity of processing (document similarity matrix, clustering) ● The relevance between the documents can be improved as we go towards the events from documents.