话题模型2

•

0 likes•144 views

Bryan Gummibearehausen

话题模型

Data & Analytics

1
Each word comes from different topics
(bag of words: ignore order)

2
Each word comes from different topics
(bag of words: ignore order)
Mixture Weight for Topic k
Multinomial Distribution
over ALL words based on topic k

It is a mixture model
3
Word
Topic 1 Topic K
datalove
date
life
computer
java

It is a mixture model
4
Word
Topic 1 Topic K
Big Data
Machine Learning
1) Pick a topic
2) Pick a word
data
life
computer
love
date
life
java

It is a mixture model
5
Word
Topic 1 Topic K
The chosen
Topic: Z
datalove
date
life
computer
java

It is a mixture model
6
Word
Topic 1 Topic K
Big Data
Machine Learning
The chosen
Topic: Z
data
love
date
life
computer
java

So we really want to know
1) Z
2) _
3) _

We need to know
1) Z: cluster for the word (topic assignments)
2) _:: document composition (topic proportion)
3) _: : key words (topics)

9
Z W
We need to know
1) Z: cluster for the word (topic assignments)
2) _:: document composition (topic proportion)
3) _: : key words (topics)

10
Z W
The nth word in the document d
Topic assignment for the nth word in the document d
The topic proportions in the document d

11
Zd,n
k=1…K
Wd,n
n=1,…,Nd
d=1,…,D
K: number of topics
Nd: number of words
D: number of documents
Document Collection of Documents

12
Zd,n
k=1…K
Wd,n
n=1,…,Nd
d=1,…,D
K: number of topics
Nd: number of words
D: number of documents
Document Collection of Documents

13
Zd,n
k=1…K
Wd,n
n=1,…,Nd
d=1,…,D
K: number of topics
Nd: number of words
D: number of documents
Document Collection of Documents

14
Zd,n
k=1…K
Wd,n
n=1,…,Nd
d=1,…,D
Document Collection of Documents

15
Zd,n
k=1…K
Wd,n
n=1,…,Nd
d=1,…,D
K: number of topics
Nd: number of words
D: number of documents
Bayesian: But what about the distribution for and ??

16
and control the “sparsity” of the weights for the multinomial.
Implications: a priori we assume
- Topics have few key words
- Documents only have a small subset of topics

Dirichlet Distribution with Different Sparsity Parameters
17

18
Latent Dirichlet Allocation
Zd,n
k=1…K
Wd,n
n=1,…,Nd

19
How do we fit this model?
Want the posterior:

21
The nominator can be computed by summing the joint distribution
over every possible instantiation of hidden topic structures.
However, the number of possible topic structures is exponentially
so it is exponentially large. This method is not feasible.

22
Two main ways to get posterior:
- Sampling methods
- Asymptotically correct
- Time consuming
- Variational methods
- Faster
- Need math skills

23
- An intuitively appealing Bayesian unsupervised learning model
- Training is difficult
- Lots of packages exist, main issue is scalability
- Validation is difficult
- Usually cast into a supervised learning framework
- Presentation is difficult
- Visualization for the Bayesian model is hard.
Summary:

Blei, D., Ng, A., Jordan, M. Latent Dirichlet allocation.
J. Mach. Learn. Res. 3 (January 2003), 993–1022.
Reference :
24

Similar to 话题模型2

Basic Research on Text Mining at UCSD.San Diego Supercomputer Center

Diversified Social Media Retrieval for News StoriesBryan Gummibearehausen

Topic ModelsClaudia Wagner

Tatyana Makhalova - 2015 - News clustering approach based on discourse text s...Association for Computational Linguistics

KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...Aaron Li

Lecture1.pptxjonathanG19

Lec1Prafulla Kiran

A Text Mining Research Based on LDA Topic Modellingcsandit

A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf

Author Topic ModelFReeze FRancis

Frontiers of Computational Journalism week 2 - Text AnalysisJonathan Stray

3 - Finding similar itemsViet-Trung TRAN

Canini09aAjay Ohri

LDA/TagLDA In Slow MotionPradipto Das

EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...Chuancong Gao

Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato

Learning the Structure of Related Tasks butest

Topic Models - LDA and Correlated Topic ModelsClaudia Wagner

A Neural Probabilistic Language Model_v2Jisoo Jang

Dedalo, looking for Cluster Explanations in a labyrinth of Linked DataVrije Universiteit Amsterdam

Similar to 话题模型2 (20)

Basic Research on Text Mining at UCSD.

Diversified Social Media Retrieval for News Stories

Topic Models

Tatyana Makhalova - 2015 - News clustering approach based on discourse text s...

KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...

Lecture1.pptx

Lec1

A Text Mining Research Based on LDA Topic Modelling

A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING

Author Topic Model

Frontiers of Computational Journalism week 2 - Text Analysis

3 - Finding similar items

Canini09a

LDA/TagLDA In Slow Motion

EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...

Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks

Learning the Structure of Related Tasks

Topic Models - LDA and Correlated Topic Models

A Neural Probabilistic Language Model_v2

Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data

Recently uploaded

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

ASML's Taxonomy Adventure by Daniel Cantervoginip

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda

Recently uploaded (20)

Generative AI for Social Good at Open Data Science East 2024

Dubai Call Girls Wifey O52&786472 Call Girls Dubai

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

ASML's Taxonomy Adventure by Daniel Canter

Top 5 Best Data Analytics Courses In Queens

RABBIT: A CLI tool for identifying bots based on their GitHub events.

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

04242024_CCC TUG_Joins and Relationships

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样

DBA Basics: Getting Started with Performance Tuning.pdf

GA4 Without Cookies [Measure Camp AMS]

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx

话题模型2

1. 1 Each word comes from different topics (bag of words: ignore order)

2. 2 Each word comes from different topics (bag of words: ignore order) Mixture Weight for Topic k Multinomial Distribution over ALL words based on topic k

3. It is a mixture model 3 Word Topic 1 Topic K datalove date life computer java

4. It is a mixture model 4 Word Topic 1 Topic K Big Data Machine Learning 1) Pick a topic 2) Pick a word data life computer love date life java

5. It is a mixture model 5 Word Topic 1 Topic K The chosen Topic: Z datalove date life computer java

6. It is a mixture model 6 Word Topic 1 Topic K Big Data Machine Learning The chosen Topic: Z data love date life computer java

7. So we really want to know 1) Z 2) _ 3) _

8. We need to know 1) Z: cluster for the word (topic assignments) 2) _:: document composition (topic proportion) 3) _: : key words (topics)

9. 9 Z W We need to know 1) Z: cluster for the word (topic assignments) 2) _:: document composition (topic proportion) 3) _: : key words (topics)

10. 10 Z W The nth word in the document d Topic assignment for the nth word in the document d The topic proportions in the document d

11. 11 Zd,n k=1…K Wd,n n=1,…,Nd d=1,…,D K: number of topics Nd: number of words D: number of documents Document Collection of Documents

12. 12 Zd,n k=1…K Wd,n n=1,…,Nd d=1,…,D K: number of topics Nd: number of words D: number of documents Document Collection of Documents

13. 13 Zd,n k=1…K Wd,n n=1,…,Nd d=1,…,D K: number of topics Nd: number of words D: number of documents Document Collection of Documents

14. 14 Zd,n k=1…K Wd,n n=1,…,Nd d=1,…,D Document Collection of Documents

15. 15 Zd,n k=1…K Wd,n n=1,…,Nd d=1,…,D K: number of topics Nd: number of words D: number of documents Bayesian: But what about the distribution for and ??

16. 16 and control the “sparsity” of the weights for the multinomial. Implications: a priori we assume - Topics have few key words - Documents only have a small subset of topics

17. Dirichlet Distribution with Different Sparsity Parameters 17

18. 18 Latent Dirichlet Allocation Zd,n k=1…K Wd,n n=1,…,Nd

19. 19 How do we fit this model? Want the posterior:

20. 20 weights

21. 21 The nominator can be computed by summing the joint distribution over every possible instantiation of hidden topic structures. However, the number of possible topic structures is exponentially so it is exponentially large. This method is not feasible.

22. 22 Two main ways to get posterior: - Sampling methods - Asymptotically correct - Time consuming - Variational methods - Faster - Need math skills

23. 23 - An intuitively appealing Bayesian unsupervised learning model - Training is difficult - Lots of packages exist, main issue is scalability - Validation is difficult - Usually cast into a supervised learning framework - Presentation is difficult - Visualization for the Bayesian model is hard. Summary:

24. Blei, D., Ng, A., Jordan, M. Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (January 2003), 993–1022. Reference : 24

话题模型2

Recommended

Recommended

More Related Content

Similar to 话题模型2

Similar to 话题模型2 (20)

Recently uploaded

Recently uploaded (20)

话题模型2