Jeonghun Yoon
Probabilistic Topic Models
Natural Language Processing
Natural language(์ž์—ฐ์–ด) : ์ผ์ƒ ์ƒํ™œ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์–ธ์–ด
Natural language processing(์ž์—ฐ์–ด ์ฒ˜๋ฆฌ) : ์ž์—ฐ์–ด์˜ ์˜๋ฏธ๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์ปดํ“จํ„ฐ๊ฐ€ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์ผ๋“ค์„(tasks) ์ฒ˜
๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ
Easy
๏ฌ Spell checking, Keyword search, Finding synonyms
Medium
๏ฌ Parsing information from websites, documents, etc
Hard
๏ฌ Machine translation
๏ฌ Semantic analysis
๏ฌ Coherence
๏ฌ Question answering
CS 224D : Deep Learning for NLP
Semantic analysis
์–ธ์–ดํ•™์—์„œ์˜ ์˜๋ฏธ ๋ถ„์„
๏ฌ ์ž์—ฐ์–ด๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ, ๋ฌธ์žฅ์˜ ์˜๋ฏธ(meaning, semantic)์— ๊ทผ๊ฑฐํ•˜์—ฌ ๋ฌธ์žฅ์„ ํ•ด์„ํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธ
๏ฌ Syntax analysis์˜ ๋ฐ˜๋Œ€(lexicon, syntax analysis)
๋จธ์‹ ๋Ÿฌ๋‹์—์„œ์˜ ์˜๋ฏธ ๋ถ„์„
๏ฌ Corpus์— ์žˆ๋Š” ๋งŽ์€ documents ์ง‘ํ•ฉ์— ๋‚ด์ œ๋˜์–ด ์žˆ๋Š”(latent) meanings, concepts, subjects, topics ๋“ฑ์„ ์ถ”์ •ํ• 
์ˆ˜ ์žˆ๋Š” ๊ตฌ์กฐ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธ
๏ฌ ๋Œ€ํ‘œ์ ์ธ ์˜๋ฏธ ๋ถ„์„ ๊ธฐ๋ฒ•
๏ฌ Latent Semantic Analysis(LSA or LSI)
๏ฌ PLSI
๏ฌ Latent Dirichlet Allocation(LDA)
๏ฌ Hieararchical Dirichlet Processing(HDP)
Semantic analysis
Representation of documents
Axes of a spatial Probabilistic topics
- Euclidean space์—์„œ ์ •์˜ ๊ฐ€๋Šฅ
- Hard to interprete
- ๋‹จ์–ด์ƒ์— ์ •์˜๋œ probability distribution
- Interpretable
1. Axes of a spatial
- LSA
2. Probabilistic topics
- LDA
3. Bayesian Nonparametric
- HDP
LSA (Latent Semantic Analysis)
๏ฌ LSA(LSI)๋Š” document data์˜ ์ˆจ๊ฒจ์ง„ ์˜๋ฏธ(hidden concept)๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ธฐ๋ฒ•์ด๋‹ค.
๏ฌ LSA๋Š” ๊ฐ๊ฐ์˜ ๋ฌธ์„œ(document)์™€ ๋‹จ์–ด(word)๋ฅผ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•œ๋‹ค. ๋ฒกํ„ฐ๋‚ด์—์„œ ๊ฐ๊ฐ์˜ element๋Š” ์ˆจ๊ฒจ์ง„
์˜๋ฏธ๊ฐ€ ๋  ๊ฒƒ์ด๋‹ค.
LSA (Latent Semantic Analysis)
๏ฌ 3๋ฒˆ ๋ฌธ์„œ๋Š” ์ฟผ๋ฆฌ์— ๋Œ€ํ•ด์„œ 1๋“ฑ์ด ๋  ๊ฒƒ์ด๋‹ค.
๏ฌ 2๋ฒˆ, 4๋ฒˆ ๋ฌธ์„œ๋Š” ๊ทธ ๋‹ค์Œ์ด ๋  ๊ฒƒ์ด๋‹ค.
๏ฌ 1๋ฒˆ, 5๋ฒˆ ๋ฌธ์„œ๋Š”?
๏ƒผ ์‚ฌ๋žŒ๋“ค์ด ์ธ์‹ํ•˜๊ธฐ๋กœ๋Š” ๋ฌธ์„œ 1๋ฒˆ์ด ๋ฌธ์„œ 5๋ฒˆ ๋ณด๋‹ค ์ฃผ์–ด์ง„ ์ฟผ๋ฆฌ์— ๋” ๋งž๋Š” ๋ฌธ์„œ์ด๋‹ค.
์ปดํ“จํ„ฐ๋„ ์ด๋Ÿฌํ•œ ์ถ”๋ก  ๊ณผ์ •์„ ํ•  ์ˆ˜ ์žˆ์„๊นŒ? ์ฆ‰ ์ˆจ๊ฒจ์ง„ ์˜๋ฏธ๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์„๊นŒ?
๐‘‘1 : Romeo and Juliet.
๐‘‘2 : Juliet: O happy dagger!
๐‘‘3 : Romeo died by dagger.
๐‘‘4 : "Live free or die", that's the motto of New-Hampshire
๐‘‘5 : Did you know, New-Hampshire is in New-England
๐‘„๐‘ข๐‘’๐‘Ÿ๐‘ฆ : dies and dagger
LSA (Latent Semantic Analysis)
matrix ๐‘จ :
romeo juliet happy dagger
live die free new-hampshire
๐‘‘1
๐‘‘2
๐‘‘3
๐‘‘4
๐‘‘5
LSA (Latent Semantic Analysis)
matrix ๐‘จ :
romeo juliet happy dagger
live die free new-hampshire
๐‘‘1
๐‘‘2
๐‘‘3
๐‘‘4
๐‘‘5
doc-doc matrix
matrix ๐‘จ : matrix ๐‘จ ๐‘ป: matrix ๐‘จ๐‘จ ๐‘ป(๐‘ฉ):
5 ร— 8 8 ร— 5 5 ร— 5
1๋ฒˆ ๋ฌธ์„œ์—๋Š” romeo, juliet, 2๋ฒˆ ๋ฌธ์„œ์—๋Š” juliet, happy, dagger
์ฆ‰ ๊ฒน์ณ์ง€๋Š” ๊ฒƒ์ด 1๊ฐœ์ด๋ฏ€๋กœ ๐ต 1,2 = ๐ต 2,1 = 1
matrix ๐‘ฉ = ๐ด๐ด ๐‘‡
doc-doc matrix
๋ฌธ์„œ ๐‘–์™€ ๋ฌธ์„œ ๐‘—๊ฐ€ ๐‘๊ฐœ ์˜ ๊ณตํ†ต
๋‹จ์–ด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉด ๐ต ๐‘–, ๐‘— = ๐‘
word-word matrix
8 ร— 5 5 ร— 8 8 ร— 8
matrix ๐‘จ :matrix ๐‘จ ๐‘ป
: matrix ๐‘จ ๐‘ป
๐‘จ(๐‘ช) :
juliet์€ 1๋ฒˆ, 2๋ฒˆ ๋ฌธ์„œ์—์„œ ๋‚˜์˜ค๊ณ , dagger๋Š” 2, 3๋ฒˆ ๋ฌธ์„œ์—์„œ ๋‚˜์˜จ๋‹ค.
์ฆ‰ ๊ฒน์ณ์ง€๋Š” ๊ฒƒ์ด 1๊ฐœ์ด๋ฏ€๋กœ ๐ถ 2,4 = ๐ต 4,2 = 1
matrix ๐ถ = ๐ด ๐‘‡
๐ด word-word matrix
์ฆ‰, ๋‹จ์–ด ๐‘–์™€ ๋‹จ์–ด ๐‘—๊ฐ€ ๐‘ ๊ฐœ์˜ ๋ฌธ์„œ์—์„œ
ํ•จ๊ป˜ ๋ฐœ์ƒํ–ˆ์œผ๋ฉด ๐ถ ๐‘–, ๐‘— = ๐‘
LSA (Latent Semantic Analysis)
SVD ์‚ฌ์šฉ!
๐ด = ๐‘ˆฮฃ๐‘‰ ๐‘‡, ๐‘ˆ๋Š” ๐ต์˜ eigenvectors์ด๊ณ , ๐‘‰๋Š” ๐ถ์˜ eigenvectors์ด๋‹ค.
singular value
LSA (Latent Semantic Analysis)
Reduced SVD ์‚ฌ์šฉ!
๐ด ๐‘˜ = ๐‘† ๐‘˜ฮฃk ๐‘ˆ ๐‘˜
๐‘‡
, ๋ชจ๋“  singular value๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์—†๊ณ , ์ž‘์€ ๊ฒƒ๋“ค์€ ์ œ์™ธํ•œ๋‹ค.
๐‘˜๊ฐœ์˜ ํŠน์ด๊ฐ’๋งŒ ๋‚จ๊ธฐ๋Š” ๊ฒƒ์ด๋‹ค. ์ฆ‰ ๐‘˜๊ฐœ์˜ "hidden concepts"๋งŒ ๋‚จ๊ธด๋‹ค.
LSA (Latent Semantic Analysis)
ฮฃ2 ๐‘‰2
๐‘‡
=
๐‘‰2
๐‘‡
=
Word vector
LSA (Latent Semantic Analysis)
Word vector์˜ scatter
LSA (Latent Semantic Analysis)
๐‘ˆ2ฮฃ2 =
๐‘ˆ2 =
Document vector ๐‘‘1
๐‘‘2
๐‘‘3
๐‘‘4
๐‘‘5
๐‘‘1
๐‘‘2
๐‘‘3
๐‘‘4
๐‘‘5
LSA (Latent Semantic Analysis)
Document vector์˜ scatter
LSA (Latent Semantic Analysis)
Word / Document vector์˜ scatter
LSA (Latent Semantic Analysis)
cosine similarity =
๐‘‘ ๐‘–โˆ™๐‘ž
๐‘‘ ๐‘– ๐‘ž๐‘ž =
๐‘ž1 + ๐‘ž2
2
query : dagger, die
result :
LSA (Latent Semantic Analysis)
Word / Document / Query vector์˜ scatter
1. Axes of a spatial
- LSA
2. Probabilistic topics
- LDA
3. Bayesian Nonparametric
- HDP
Topic models
Topic models์˜ ๊ธฐ๋ณธ ์•„์ด๋””์–ด
๏ฌ ๋ฌธ์„œ๋Š” ํ† ํ”ฝ๋“ค์˜ ํ˜ผํ•ฉ ๋ชจ๋ธ์ด๋ฉฐ ๊ฐ ํ† ํ”ฝ์€ ๋‹จ์–ด์ƒ์— ์ •์˜๋œ ํ™•๋ฅ ๋ถ„ํฌ
Document
Topic i Topic j Topic k
Word Word Word Word Word Word
Probabilistic topic models. Steyvers, M. & Griffiths, T. (2006)
Topic models
- Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, โ€ฆ
- Topic B: 20% cats, 20% cute, 15% dogs, 15% hamster, โ€ฆ
Doc 1 : I like to eat broccoli and bananas.
Doc 2 : I ate a banana and tomato smoothie for breakfast.
Doc 3 : Dogs and cats are cute.
Doc 4 : My sister adopted a cats yesterday.
Doc 5 : Look at this cute hamster munching on a piece of broccoli.
์˜ˆ์ œ)
- Doc 1 and 2 : 100% topic A
- Doc 3 and 4 : 100% topic B
- Doc 5 : 60% topic A, 40% topic B
Topic models
Introduction to Probabilistic Topic Models. David M. Blei (2012)
Topic models
Introduction to Probabilistic Topic Models. David M. Blei (2012)
(Left) ๋ฌธ์„œ์—์„œ์˜ topic proportion
(Right) ๋ฌธ์„œ์—์„œ ๋น„์ค‘์ด ๋†’์•˜๋˜ ํ† ํ”ฝ์— ๋Œ€ํ•˜์—ฌ,
ํ† ํ”ฝ๋ณ„ ๋ฌธ์„œ๋‚ด ๋นˆ๋„์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ๋‹จ์–ด
Probabilistic Topic Models์˜ ๊ตฌ์กฐ
๋ชจ๋ธ์˜ ์ •์˜์— ์•ž์„œ, ํ•„์š”ํ•œ ๋‹จ์–ด๋“ค์˜ ์ˆ˜ํ•™์  ํ‘œ๊ธฐ
๏ฌ Word : 1, โ€ฆ , ๐‘‰ ๋ฅผ ์ธ๋ฑ์Šค๋กœ ๊ฐ€์ง€๋Š” vocaburary ์ƒ์˜ items
๏ฌ Document : ๐‘ word์˜ sequence
๏ฌ ๐•จ = ๐‘ค1, ๐‘ค2, โ€ฆ , ๐‘ค ๐‘ , ๐‘ค ๐‘› : word์˜ sequence๋‚ด์—์„œ ๐‘›๋ฒˆ์งธ์— ์žˆ๋Š” word
๏ฌ Corpus : ๐ท documents์˜ collection
๏ฌ ๐ถ = ๐•จ1, ๐•จ2, โ€ฆ , ๐•จ ๐ท
Probabilistic Topic Models์˜ ๊ตฌ์กฐ
๋ฌธ์„œ ๐‘‘์˜ ๋‹จ์–ด ๐‘ค๐‘– ๋Œ€ํ•œ ๋ถ„ํฌ :
๐‘ƒ ๐‘ค๐‘– =
๐‘˜=1
๐พ
๐‘ƒ ๐‘ค๐‘–|๐‘ง๐‘– = ๐‘˜ ๐‘ƒ ๐‘ง๐‘– = ๐‘˜
๏ฌ ๐‘ƒ ๐‘ค๐‘–|๐‘ง๐‘– = ๐‘˜ : ํ† ํ”ฝ ๐‘˜์—์„œ, ๋‹จ์–ด ๐‘ค๐‘–์˜ probability
๏ฌ ๊ฐ ํ† ํ”ฝ์—์„œ ์–ด๋–ค ๋‹จ์–ด๋“ค์ด ์ค‘์š”ํ• ๊นŒ?
๏ฌ ๐‘ƒ ๐‘ง๐‘– = ๐‘˜ : ๐‘–๋ฒˆ์งธ ๋‹จ์–ด์— ํ† ํ”ฝ ๐‘˜๊ฐ€ ํ• ๋‹น๋˜๋Š” probability (์ฆ‰, ํ† ํ”ฝ ๐‘—๊ฐ€ ๐‘–๋ฒˆ์งธ ๋‹จ์–ด๋ฅผ ์œ„ํ•ด ์ƒ˜ํ”Œ๋ง ๋  ํ™•๋ฅ )
๐›ฝ ๐‘˜ = ๐‘ƒ ๐‘ค|๐‘ง = ๐‘˜ : ํ† ํ”ฝ ๐‘˜์—์„œ, ๋‹จ์–ด๋“ค์˜ multinomial distribution
๐œƒ ๐‘‘ = ๐‘ƒ ๐‘ง : ๋ฌธ์„œ ๐‘‘์—์„œ, ํ† ํ”ฝ๋“ค์˜ multinomial distribution
Latent Dirichlet Allocation์˜ ๋“ฑ์žฅ
๋ฌธ์„œ ๐‘‘์˜ ๋‹จ์–ด ๐‘ค๐‘– ๋Œ€ํ•œ ๋ถ„ํฌ :
๐‘ƒ ๐‘ค๐‘– =
๐‘˜=1
๐พ
๐‘ƒ ๐‘ค๐‘–|๐‘ง๐‘– = ๐‘˜ ๐‘ƒ ๐‘ง๐‘– = ๐‘˜
๋””๋ฆฌํด๋ ˆ ๋ถ„ํฌ(Dirichlet distribution)์€ multinomial distribution์˜ ์ผค๋ ˆ ์‚ฌ์ „ ๋ถ„ํฌ๋กœ(conjugate prior) ์‚ฌ์šฉ
๋‹คํ•ญ ๋ถ„ํฌ(Multinomial distribution) ๐‘ = ๐‘1, โ€ฆ , ๐‘ ๐พ ์— ๋Œ€ํ•œ Dirichlet distribution :
๐ท๐‘–๐‘Ÿ ๐›ผ1, โ€ฆ , ๐›ผ ๐พ =
ฮ“ ๐‘˜ ๐›ผ ๐‘˜
๐‘˜ ฮ“ ๐›ผ ๐‘˜ ๐‘˜=1
๐พ
๐‘ ๐‘˜
๐›ผ ๐‘˜โˆ’1
๏ฌ Hyperparameter ๐›ผ๐‘— : ๋ฌธ์„œ ๐‘‘์—์„œ ํ† ํ”ฝ ๐‘—๊ฐ€ ์ƒ˜ํ”Œ๋ง ๋œ ํšŸ์ˆ˜์— ๋Œ€ํ•œ ์‚ฌ์ „ ๊ด€์ฐฐ count (๋ฌธ์„œ๋กœ๋ถ€ํ„ฐ ๋‹จ์–ด๊ฐ€ ์‹ค์ œ๋กœ ๊ด€
์ฐฐ๋˜๊ธฐ ์ด์ „์˜ ๊ฐ’)
LDA๋Š” Dirichlet distribution์„ ๐›‰์˜ prior๋กœ ์‚ฌ์šฉ
(Blei et. Al, 2003)
Latent Dirichlet Allocation์˜ ๋“ฑ์žฅ
Latent Dirichlet Allocation. Blei et. Al (2003)
LDA :
Dirichlet parameter
Variant LDA์˜ ๋“ฑ์žฅ
๋ฌธ์„œ ๐‘‘์˜ ๋‹จ์–ด ๐‘ค๐‘– ๋Œ€ํ•œ ๋ถ„ํฌ :
๐‘ƒ ๐‘ค๐‘– =
๐‘˜=1
๐พ
๐‘ƒ ๐‘ค๐‘–|๐‘ง๐‘– = ๐‘˜ ๐‘ƒ ๐‘ง๐‘– = ๐‘˜
Hyperparameter ๐œ‚ : Corpus์˜ ๋‹จ์–ด๊ฐ€ ๊ด€์ฐฐ๋˜๊ธฐ ์ด์ „์—, ํ† ํ”ฝ์—์„œ ๋‹จ์–ด๊ฐ€ ์ƒ˜ํ”Œ๋ง ๋œ ํšŸ์ˆ˜์— ๋Œ€ํ•œ ์‚ฌ์ „ ๊ด€์ฐฐ
count
Varian LDA๋Š” symmetric Dirichlet distribution(๐œผ)์„ ๐œท์˜ prior๋กœ ์‚ฌ์šฉ
(Griffiths and Steyvers, 2004)
Variant LDA์˜ ๋“ฑ์žฅ Variant LDA :
Dirichlet parameter
Introduction to Probabilistic Topic Models. David M. Blei (2012)
๐›ผ Dirichlet parameter
๐œƒ ๐‘‘ ๋ฌธ์„œ ๐‘‘์—์„œ ํ† ํ”ฝ ๋น„์œจ(proportion) ๐œƒ ๐‘‘,๐‘˜ ๋ฌธ์„œ ๐‘‘์—์„œ ํŠน์ • ํ† ํ”ฝ ๐‘˜์˜ proportion
๐‘ ๐‘‘ ๋ฌธ์„œ ๐‘‘์—์„œ ํ† ํ”ฝ ํ• ๋‹น(assignment) ๐‘ ๐‘‘,๐‘› ๋ฌธ์„œ ๐‘‘์—์„œ ๐‘›-th ๋‹จ์–ด์— ๋Œ€ํ•œ ํ† ํ”ฝ ํ• ๋‹น
๐‘Š๐‘‘ ๋ฌธ์„œ ๐‘‘์—์„œ ๊ด€์ฐฐ๋œ ๋‹จ์–ด๋“ค ๐‘Š๐‘‘,๐‘› ๋ฌธ์„œ ๐‘‘์—์„œ ๐‘›-th ๋‹จ์–ด
๐›ฝ ๐‘˜
ํ† ํ”ฝ ๐‘˜์˜ vocaburary์—์„œ์˜ ๋ถ„ํฌ
(๋‹จ์–ด ์ „์ฒด ์…‹์—์„œ ์ •์˜๋œ ํ† ํ”ฝ ๐‘˜์˜ ๋ถ„ํฌ)
๐œ‚ Dirichlet parameter
The plate surrounding ๐œƒ ๐‘‘ ๊ฐ ๋ฌธ์„œ ๐‘‘์— ๋Œ€ํ•˜์—ฌ, ํ† ํ”ฝ ๋ถ„ํฌ์˜ sampling (์ด ๐ท๊ฐœ์˜ ๋ฌธ์„œ)
The plate surrounding ๐›ฝ ๐‘˜ ๊ฐ topic ๐‘˜์— ๋Œ€ํ•˜์—ฌ, ๋‹จ์–ด ๋ถ„ํฌ์˜ sampling (์ด ๐พ๊ฐœ์˜ ํ† ํ”ฝ)
LDA ๋ชจ๋ธ์˜ ๋ณ€์ˆ˜
๐œƒ ๐‘‘
โ€ฒ
๐‘  :
๐›ฝ ๐‘˜
โ€ฒ
๐‘  :
Document Topic 1 Topic 2 Topic 3 โ€ฆ Topic ๐พ
Document 1 ๐œƒ1 0.2 0.4 0.0 โ€ฆ 0.1
Document 2 ๐œƒ2 0.8 0.1 0.0 โ€ฆ 0.0
โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ
Document ๐‘€ ๐œƒ ๐‘€ 0.5 0.4 0.1 โ€ฆ 0.0
Terms Topic 1 ๐›ฝ1 Topic 2 ๐›ฝ2 Topic 3 ๐›ฝ3 โ€ฆ Topic ๐พ ๐›ฝ ๐พ
Word 1 0.02 0.09 0.00 โ€ฆ 0.00
Word 2 0.08 0.52 0.37 โ€ฆ 0.03
โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ
Wordt ๐‘‰ 0.05 0.12 0.01 โ€ฆ 0.45
ํ•ฉ : 1
ํ•ฉ : 1
- Variant LDA version์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์œผ๋ฏ€๋กœ, ์ด version์„ LDA ๋ผ๊ณ  ์ง€์นญ ํ•˜๊ฒ ์Œ
LDA์˜ Generative process
LDA๋Š” generative model์ด๋‹ค.
1. ๋ฌธ์„œ์˜ ๋‹จ์–ด์˜ ๊ฐฏ์ˆ˜ ๐‘์„ Poisson ๋ถ„ํฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์„ ํƒํ•œ๋‹ค. ๐‘~๐‘ƒ๐‘œ๐‘–๐‘ ๐‘ ๐‘œ๐‘›(๐œ‰)
2. ๋ฌธ์„œ์˜ ํ† ํ”ฝ ๋ถ„ํฌ(proportion) ๐œƒ ๐‘‘๋ฅผ Dirichlet(๐›ผ) ๋ถ„ํฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์„ ํƒํ•œ๋‹ค. ๐œƒ ๐‘‘~๐ท๐‘–๐‘Ÿ๐‘–๐‘โ„Ž๐‘™๐‘’๐‘ก ๐›ผ
3. ๋ฌธ์„œ์˜ ๋‹จ์–ด ๊ฐ๊ฐ์— ๋Œ€ํ•˜์—ฌ
a. ํ† ํ”ฝ ๋ถ„ํฌ ๐œƒ ๐‘‘๋ฅผ ์ด์šฉํ•˜์—ฌ, ๋‹จ์–ด์— ํ† ํ”ฝ์„ ํ• ๋‹นํ•œ๋‹ค. ๐‘ ๐‘‘,๐‘›~๐‘€๐‘ข๐‘™๐‘ก๐‘–๐‘›๐‘œ๐‘š๐‘–๐‘Ž๐‘™ ๐œƒ
b. ๐‘ ๐‘Š๐‘‘,๐‘›|๐‘ ๐‘‘,๐‘›, ๐›ฝ ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹จ์–ด๋ฅผ ์„ ํƒํ•œ๋‹ค. ์ด ํ™•๋ฅ ๋ถ„ํฌ๋Š” ๋‹คํ•ญ๋ถ„ํฌ์ด๋‹ค.
LDA์˜ Generative process
์˜ˆ์ œ)
1. ์ƒˆ๋กœ์šด ๋ฌธ์„œ ๐ท์˜ ๊ธธ์ด๋ฅผ 5๋กœ ์„ ํƒํ•œ๋‹ค. ์ฆ‰, ๐ท = ๐‘ค1, ๐‘ค2, ๐‘ค3, ๐‘ค4, ๐‘ค5
2. ๋ฌธ์„œ ๐ท์˜ ํ† ํ”ฝ ๋ถ„ํฌ๋ฅผ 50%๋Š” ์Œ์‹(food), 50%๋Š” ๋™๋ฌผ(animal)๋กœ ์„ ํƒํ•œ๋‹ค.
3. ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•˜์—ฌ,
1. ์ฒซ๋ฒˆ์งธ ๋‹จ์–ด ๐‘ค1์— food topic์„ ํ• ๋‹นํ•œ๋‹ค. Food topic์—์„œ broccoli๋ฅผ ๐‘ค1์œผ๋กœ ์„ ํƒํ•œ๋‹ค.
2. ๋‘๋ฒˆ์งธ ๋‹จ์–ด ๐‘ค2์— animal topic์„ ํ• ๋‹นํ•œ๋‹ค. Animal topic์—์„œ panda๋ฅผ ๐‘ค2์œผ๋กœ ์„ ํƒํ•œ๋‹ค.
3. ์„ธ๋ฒˆ์งธ ๋‹จ์–ด ๐‘ค3์— animal topic์„ ํ• ๋‹นํ•œ๋‹ค. Animal topic์—์„œ adorable ๋ฅผ ๐‘ค3์œผ๋กœ ์„ ํƒํ•œ๋‹ค.
4. ๋„ค๋ฒˆ์งธ ๋‹จ์–ด ๐‘ค4์— food topic์„ ํ• ๋‹นํ•œ๋‹ค. Food topic์—์„œ cherries ๋ฅผ ๐‘ค4์œผ๋กœ ์„ ํƒํ•œ๋‹ค.
5. ๋‹ค์„ฏ๋ฒˆ์งธ ๋‹จ์–ด ๐‘ค5์— food topic์„ ํ• ๋‹นํ•œ๋‹ค. Food topic์—์„œ eating ๋ฅผ ๐‘ค5์œผ๋กœ ์„ ํƒํ•œ๋‹ค.
๐ท : broccoli panda adorable cherries eating
LDA ๋ชจ๋ธ์˜ inference
๊ด€์ฐฐ ๊ฐ€๋Šฅํ•œ ๋ฌธ์„œ ๋‚ด ๋‹จ์–ด ๐‘Š๐‘‘,๐‘›๋ฅผ ์ด์šฉํ•˜์—ฌ, LDA ๋ชจ๋ธ์˜ ์ž ์žฌ ๋ณ€์ˆ˜(hidden variable)์ธ ๋ฌธ์„œ์˜ ํ† ํ”ฝ๋ถ„ํฌ ๐œƒ ๐‘‘
์™€ ํ† ํ”ฝ์˜ ๋‹จ์–ด๋ถ„ํฌ ๐›ฝ ๐‘˜๋ฅผ ์ถ”์ •ํ•˜๋Š” ๊ณผ์ •์ด inference์ด๋‹ค.
Generative probabilistic modeling์—์„œ๋Š”, data๋Š” ์ž ์žฌ ๋ณ€์ˆ˜(hidden variable)๋ฅผ ํฌํ•จํ•˜๋Š” generative
process์—์„œ๋ถ€ํ„ฐ ๋ฐœ์ƒํ•˜๋Š”๊ฒƒ์œผ๋กœ ๋‹ค๋ฃฌ๋‹ค. ๋”ฐ๋ผ์„œ, generative process๋Š” observed random variable๊ณผ
hidden random variable์˜ ๊ฒฐํ•ฉ ํ™•๋ฅ ๋ฐ€๋„(joint probability distribution)๋ฅผ ์ •์˜ํ•œ๋‹ค.
๏ฌ Observed variables : ๋ฌธ์„œ๋‚ด์˜ ๋‹จ์–ด๋“ค
๏ฌ Hidden variables : ๋ฌธ์„œ์˜ ํ† ํ”ฝ ๋ถ„ํฌ, ํ† ํ”ฝ์˜ ๋‹จ์–ด ๋ถ„ํฌ (topic structure)
๊ฒฐํ•ฉ ํ™•๋ฅ ๋ฐ€๋„ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ observed variable์ด ์ฃผ์–ด์กŒ์„ ๋•Œ hidden variable์˜ ์กฐ๊ฑด๋ถ€ ๋ถ„ํฌ๋ฅผ ๊ตฌํ•œ๋‹ค.
์ด ๋ถ„ํฌ๋Š” ์‚ฌํ›„ ํ™•๋ฅ ๋ถ„ํฌ(posterior distribution)์ด๋‹ค.
๐‘ ๐›ฝ1:๐พ, ๐œƒ1:๐ท, ๐‘ง1:๐ท , ๐‘ค1:๐ท =
๐‘–=1
๐พ
๐‘ ๐›ฝ๐‘–
๐‘‘=1
๐ท
๐‘ ๐œƒ ๐‘‘
๐‘›=1
๐‘
๐‘ ๐‘ง ๐‘‘,๐‘›|๐œƒ ๐‘‘ ๐‘ ๐‘ค ๐‘‘,๐‘›|๐›ฝ1:๐พ, ๐‘ง ๐‘‘,๐‘›
๊ด€์ฐฐ ๊ฐ€๋Šฅ ๋ฐ์ดํ„ฐ ๐‘ค1:๐ท๋ฅผ ํ†ตํ•ด์„œ inferenceํ•ด์•ผ ํ•  ๋ณ€์ˆ˜ : ๐›ฝ1:๐ท, ๐œƒ1:๐ท, ๐‘ง1:๐ท
๐‘ ๐›ฝ1:๐พ, ๐œƒ1:๐ท, ๐‘ง1:๐ท|๐‘ค1:๐ท =
๐‘ ๐›ฝ1:๐พ, ๐œƒ1:๐ท, ๐‘ง1:๐ท, ๐‘ค1:๐ท
๐‘ ๐‘ค1:๐ท
LDA ๋ชจ๋ธ์˜ inference
๐›ฝ1:๐พ ํ† ํ”ฝ 1~๐พ์˜ vocabulary์—์„œ์˜ ๋ถ„ํฌ
๐œƒ1:๐ท ๋ฌธ์„œ 1~๐ท์—์„œ์˜ ํ† ํ”ฝ ๋น„์œจ
๐‘ง1:๐ท ๋ฌธ์„œ 1~๐ท์—์„œ์˜ ํ† ํ”ฝ ํ• ๋‹น
๐‘ค1:๐ท ๋ฌธ์„œ 1~๐ท์—์„œ ๊ด€์ฐฐ๋œ ๋‹จ์–ด๋“ค
Posterior dist.
LDA ๋ชจ๋ธ์˜ inference
Posterior distribution์„ ๊ตฌํ•˜๋Š” ๊ฒƒ์€ ์‰ฌ์šด๊ฒƒ์ธ๊ฐ€?
๋ถ„์ž์˜ ๊ฒฝ์šฐ๋ฅผ ๋จผ์ € ์‚ดํŽด๋ณด์ž.
๐‘ ๐›ฝ1:๐พ, ๐œƒ1:๐ท, ๐‘ง1:๐ท|๐‘ค1:๐ท =
๐‘ ๐›ฝ1:๐พ, ๐œƒ1:๐ท, ๐‘ง1:๐ท, ๐‘ค1:๐ท
๐‘ ๐‘ค1:๐ท
๋ชจ๋“  random variable์˜ ๊ฒฐํ•ฉ ํ™•๋ฅ ๋ฐ€๋„ ํ•จ์ˆ˜๋Š”,
hidden variable์ด ์ž„์˜๋กœ ์…‹ํŒ…๋œ๋‹ค๋ฉด ์‰ฝ๊ฒŒ ๊ณ„์‚ฐ ๊ฐ€๋Šฅ
LDA ๋ชจ๋ธ์˜ inference
๋ถ„๋ชจ์˜ ๊ฒฝ์šฐ๋ฅผ ์‚ดํŽด๋ณด์ž.
๐‘ ๐›ฝ1:๐พ, ๐œƒ1:๐ท, ๐‘ง1:๐ท|๐‘ค1:๐ท =
๐‘ ๐›ฝ1:๐พ, ๐œƒ1:๐ท, ๐‘ง1:๐ท, ๐‘ค1:๐ท
๐‘ ๐‘ค1:๐ท
Observed variable์˜ ์ฃผ๋ณ€ ๋ฐ€๋„ํ•จ์ˆ˜(marginal probability)
- ์ž„์˜์˜ topic model์—์„œ, observed corpus๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์„ ํ™•๋ฅ ์„ ๊ตฌํ•˜๋Š” ๊ฒƒ
- ๋ชจ๋“  hidden topic structure์˜ ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ(instantiation)๋ฅผ ๊ตฌํ•˜๊ณ , ๊ฒฐํ•ฉ ํ™•๋ฅ ๋ฐ€๋„ ํ•จ์ˆ˜๋ฅผ summation
๊ฐ€๋Šฅํ•œ hidden topic sturcture๋Š” ์ง€์ˆ˜์ ์œผ๋กœ ๋งŽ๋‹ค. ๋”ฐ๋ผ์„œ ํ•ด๋‹น ๋ฐ€๋„ํ•จ์ˆ˜๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ์–ด๋ ต๋‹ค.
Modern probabilistic models, Bayesian statistics์—์„œ๋Š” ๋ถ„๋ชจ์˜ ๋ถ„ํฌ ๋•Œ๋ฌธ์— posterior๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด
์–ด๋ ต๋‹ค. ๋”ฐ๋ผ์„œ posterior๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ถ”์ •ํ•˜๋Š” ๊ธฐ๋ฒ•์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ๋งŽ์ด ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ๋‹ค.
๋”ฐ๋ผ์„œ, topic modeling algorithms์—์„œ๋„ posterior distribution์„ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•œ๋‹ค.
- sampling based method
- variational method
Difficulty of deriving marginal probability ๐‘ ๐‘ค1:๐ท
๏ฌ Topic mixture ๐œƒ์˜ joint distribution (parameter : ๐›ผ, ๐›ฝ)
๐‘ ๐œƒ, ๐•ซ, ๐•จ ๐›ผ, ๐›ฝ = ๐‘ ๐œƒ|๐›ผ
๐‘›=1
๐‘
๐‘ ๐‘ง ๐‘›|๐œƒ ๐‘ ๐‘ค ๐‘›|๐‘ง ๐‘›, ๐›ฝ
๏ฌ Document์˜ marginal distribution
๐‘ ๐•จ|๐›ผ, ๐›ฝ = ๐‘ ๐œƒ|๐›ผ
๐‘›=1
๐‘
๐‘ง ๐‘›
๐‘ ๐‘ง ๐‘›|๐œƒ ๐‘ ๐‘ค ๐‘›|๐‘ง ๐‘›, ๐›ฝ ๐‘‘๐œƒ
๏ฌ Corpus์˜ probability
๐‘ ๐ท|๐›ผ, ๐›ฝ =
๐‘‘=1
๐‘€
๐‘ ๐œƒ ๐‘‘|๐›ผ
๐‘›=1
๐‘ ๐‘‘
๐‘ง ๐‘‘๐‘›
๐‘ ๐‘ง ๐‘‘๐‘›|๐œƒ ๐‘‘ ๐‘ ๐‘ค ๐‘‘๐‘›|๐‘ง ๐‘‘๐‘›, ๐›ฝ ๐‘‘๐œƒ ๐‘‘
์—ฌ๊ธฐ์„œ์˜ notation์€ Latent Dirichlet Allocation. Blei et. Al (2003)์„ ์ฐธ๊ณ 
Algorithm for extracting topics
Gibbs sampling
๐‘ท ๐’›๐’Š = ๐’‹|๐’›โˆ’๐’Š, ๐‘ค๐‘–, ๐‘‘๐‘–,โˆ™ โˆ
๐ถ ๐‘ค ๐‘– ๐‘—
๐‘‰๐พ
+ ๐œ‚
๐‘ค=1
๐‘‰
๐ถ ๐‘ค๐‘—
๐‘‰๐พ
+ ๐‘‰๐œ‚
๐ถ ๐‘‘๐‘– ๐‘—
๐ท๐พ
+ ๐›ผ
๐‘ก=1
๐พ
๐ถ ๐‘‘ ๐‘– ๐‘ก
๐ท๐พ
+ ๐‘‡๐›ผ
๋ฌธ์„œ์—์„œ ๐‘–๋ฒˆ์งธ์— ๋‚˜์˜ค๋Š” ๋‹จ์–ด ๐‘ค์˜ ํ† ํ”ฝ์ด ๐‘—์ผ ํ™•๋ฅ ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์š”์†Œ
- ์š”์†Œ 1 : ํ† ํ”ฝ ๐‘—์— ํ• ๋‹น๋œ ์ „์ฒด ๋‹จ์–ด ์ค‘์—์„œ ํ•ด๋‹น ๋‹จ์–ด์˜ ์ ์œ ์œจ์ด ๋†’์„์ˆ˜๋ก ๐‘—์ผ ํ™•๋ฅ ์ด ํฌ๋‹ค.
- ์š”์†Œ 2 : w๐‘–๊ฐ€ ์†ํ•œ ๋ฌธ์„œ ๋‚ด ๋‹ค๋ฅธ ๋‹จ์–ด๊ฐ€ ํ† ํ”ฝ ๐‘—์— ๋งŽ์ด ํ• ๋‹น๋˜์—ˆ์„์ˆ˜๋ก ๐‘—์ผ ํ™•๋ฅ ์ด ํฌ๋‹ค.
๐‘ง๐‘– = ๐‘— ๋ฌธ์„œ์—์„œ ๐‘–๋ฒˆ์งธ์— ๋‚˜์˜ค๋Š” ๋‹จ์–ด ๐‘ค์— ํ† ํ”ฝ ๐‘—๊ฐ€ ํ• ๋‹น
๐•ซโˆ’๐‘– ๐‘–๋ฒˆ์งธ ๋‹จ์–ด๋ฅผ ์ œ์™ธํ•œ ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค์— ๋Œ€ํ•œ ํ† ํ”ฝ ํ• ๋‹น
๐‘ค๐‘– ๋‹จ์–ด index
๐‘‘๐‘– ๋ฌธ์„œ index
โˆ™ ๋‹ค๋ฅธ ์ •๋ณด ๋ฐ observed information
๐ถ ๐‘ค๐‘—
๐‘Š๐พ
๋‹จ์–ด ๐‘ค๊ฐ€ ํ† ํ”ฝ ๐‘—์— ํ• ๋‹น๋œ ํšŸ์ˆ˜ (ํ˜„์žฌ ๐‘–๋Š” ์ œ์™ธ)
๐ถ ๐‘‘๐‘—
๐ท๐พ
๋ฌธ์„œ ๐‘‘์˜ ๋‹จ์–ด๋“ค ์ค‘์—์„œ ํ† ํ”ฝ ๐‘—์— ํ• ๋‹น๋œ ํšŸ์ˆ˜ (ํ˜„์žฌ ๐‘–๋Š” ์ œ์™ธ)
๐œ‚ ํ† ํ”ฝ์˜ ๋‹จ์–ด ๋ถ„ํฌ ์ƒ์„ฑ์— ์‚ฌ์šฉ๋˜๋Š” Dirichlet parameter
๐›ผ ๋ฌธ์„œ์˜ ํ† ํ”ฝ ๋ถ„ํฌ ์ƒ์„ฑ์— ์‚ฌ์šฉ๋˜๋Š” Dirichlet parameter
Smoothing
Algorithm for extracting topics
Doc 0 : ๐‘ง0,0, ๐‘ง0,1, ๐‘ง0,2, ๐‘ง0,3
Doc 1 : (๐‘ง1,0, ๐‘ง1,1, ๐‘ง1,2)
Doc 2 : (๐‘ง2,0, ๐‘ง2,1, ๐‘ง2,2, ๐‘ง2,3)
Doc 3 : (๐‘ง3,0, ๐‘ง3,1, ๐‘ง3,2, ๐‘ง3,3, ๐‘ง5,4)
Doc 4 : (๐‘ง4,0, ๐‘ง4,1, ๐‘ง4,2, ๐‘ง4,3, ๐‘ง4,4)
Doc 5 : (๐‘ง5,0, ๐‘ง5,1, ๐‘ง5,2, ๐‘ง5,3, ๐‘ง5,4, ๐‘ง5,5)
์˜ˆ์ œ)
๐‘ง๐‘–,๐‘— : ๐‘–๋ฒˆ์งธ ๋ฌธ์„œ์— ๐‘— ํ† ํ”ฝ์ด ํ• ๋‹น๋œ ๊ฒƒ์„ ๋‚˜ํƒ€๋‚ด๋Š” ํ™•๋ฅ ๋ณ€์ˆ˜
1. ํ™•๋ฅ ๋ณ€์ˆ˜์— ๋žœ๋คํ•˜๊ฒŒ ํ† ํ”ฝ์„ ํ• ๋‹น
2. ๐‘ง0,0์„ ์ œ์™ธํ•œ ๊ฐ’๋“ค์„ ํ† ๋Œ€๋กœ ๐‘ง0,0์˜ ๊ฐ’์„ ์—…๋ฐ์ดํŠธ
3. ๐‘ง0,1์„ ์ œ์™ธํ•œ ๊ฐ’๋“ค์„ ํ† ๋Œ€๋กœ ๐‘ง0,1์˜ ๊ฐ’์„ ์—…๋ฐ์ดํŠธ
โ€ฆ.
4. ๐‘ง5,5์„ ์ œ์™ธํ•œ ๊ฐ’๋“ค์„ ํ† ๋Œ€๋กœ ๐‘ง5,5์˜ ๊ฐ’์„ ์—…๋ฐ์ดํŠธ
5. ํ™•๋ฅ ๋ณ€์ˆ˜๊ฐ€ ์ˆ˜๋ ดํ•  ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณต
Algorithm for extracting topics
๐ถ ๐‘‰๐พ
=
๐ถ11 ๐ถ12 โ€ฆ
๐ถ21 ๐ถ22 โ€ฆ
โ€ฆ โ€ฆ โ€ฆ
๐ถ1๐‘˜ โ€ฆ ๐ถ1๐พ
๐ถ2๐‘˜ โ€ฆ ๐ถ2๐พ
โ€ฆ โ€ฆ โ€ฆ
๐ถ ๐‘ฃ3 ๐ถ ๐‘ฃ3 โ€ฆ
โ€ฆ โ€ฆ โ€ฆ
๐ถ ๐‘‰1 ๐ถ ๐‘‰2 โ€ฆ
๐ถ ๐‘ฃ๐‘˜ โ€ฆ ๐ถ ๐‘ฃ๐พ
โ€ฆ โ€ฆ โ€ฆ
๐ถ ๐‘‰๐‘˜ โ€ฆ ๐ถ ๐‘‰๐พ
๐ถ ๐ท๐พ
=
๐ถ11 ๐ถ12 โ€ฆ
๐ถ21 ๐ถ22 โ€ฆ
โ€ฆ โ€ฆ โ€ฆ
๐ถ1๐‘˜ โ€ฆ ๐ถ1๐พ
๐ถ2๐‘˜ โ€ฆ ๐ถ2๐พ
โ€ฆ โ€ฆ โ€ฆ
๐ถ ๐‘‘3 ๐ถ ๐‘‘3 โ€ฆ
โ€ฆ โ€ฆ โ€ฆ
๐ถ ๐ท1 ๐ถ ๐ท2 โ€ฆ
๐ถ ๐‘‘๐‘˜ โ€ฆ ๐ถ ๐‘‘๐พ
โ€ฆ โ€ฆ โ€ฆ
๐ถ ๐ท๐‘˜ โ€ฆ ๐ถ ๐ท๐พ
Generative model vs. Statistical inference
์ตœ์ ์˜ ํ† ํ”ฝ ์ˆ˜
Perplexity
๏ฌ Language modeling์—์„œ ์ฃผ๋กœ ์ปจ๋ฒค์…˜์œผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค.
๏ฌ ์ผ๋ฐ˜์ ์œผ๋กœ perplexity๋Š” exp ๐ป ๐‘ ๋กœ ํ‘œํ˜„๋œ๋‹ค. ๐ป ๐‘ ๋Š” ๐‘์˜ ์—”ํŠธ๋กœํ”ผ๋ฅผ ์˜๋ฏธํ•œ๋‹ค.
๏ฌ LDA์—์„œ ์ถ”์ •๋œ ํ† ํ”ฝ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹จ์–ด์˜ ๋ฐœ์ƒ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•˜์˜€์„ ๋•Œ, ํ™•๋ฅ ๊ฐ’์ด ๋†’์„์ˆ˜๋ก generative
process๋ฅผ ์ œ๋Œ€๋กœ ์„ค๋ช…ํ•œ๋‹ค๊ณ  ๋ณธ๋‹ค.
๐‘ƒ๐‘’๐‘Ÿ๐‘๐‘™๐‘’๐‘ฅ๐‘–๐‘ก๐‘ฆ ๐ถ = exp โˆ’
๐‘‘=1
๐ท
log ๐‘ ๐•จ ๐‘‘
๐‘‘=1
๐ท
๐‘๐‘‘
๏ฌ ๐‘ ๐•จ ๐‘‘ : ํ† ํ”ฝ์˜ ๋‹จ์–ด๋ถ„ํฌ ์ •๋ณด์™€ ๋ฌธ์„œ๋‚ด ํ† ํ”ฝ์˜ ๋น„์ค‘ ์ •๋ณด์˜ ๊ณฑ์„ ์ด์šฉํ•˜์—ฌ ๊ณ„์‚ฐ
๏ฌ ๐‘ ๐•จ ๐‘‘ ๋Š” ํด์ˆ˜๋ก ์ข‹์œผ๋ฏ€๋กœ, perplexity๋Š” ์ž‘์„์ˆ˜๋ก ์ข‹๋‹ค.
์ตœ์ ์˜ ํ† ํ”ฝ ์ˆ˜
Topic coherence
๏ฌ ์‹ค์ œ๋กœ ์‚ฌ๋žŒ์ด ํ•ด์„ํ•˜๊ธฐ์—(interpretability) ์ ํ•ฉํ•œ ํ‰๊ฐ€ ์ฒ™๋„๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์ œ์‹œ๋œ ์—ฌ๋Ÿฌ ์ฒ™๋„๋“ค ์ค‘ ํ•˜๋‚˜
๏ฌ Newman์€ ๋‰ด์Šค์™€ ์ฑ… ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜์—ฌ ํ† ํ”ฝ ๋ชจ๋ธ๋ง์„ ์‹ค์‹œ. ๊ทธ ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜จ ํ† ํ”ฝ๋“ค์ด ์œ ์˜๋ฏธํ•œ์ง€ ์ˆ˜์ž‘์—…์œผ๋กœ
์ ์ˆ˜ํ™”. ๊ทธ๋ฆฌ๊ณ  ์ด๋ ‡๊ฒŒ ๋งค๊ฒจ์ง„ ์ ์ˆ˜์™€ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ๋Š” ์ฒ™๋„๋ฅผ ์ œ์‹œ.
๏ฌ ํ† ํ”ฝ ๋ชจ๋ธ๋ง ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜จ ๊ฐ๊ฐ์˜ ํ† ํ”ฝ์—์„œ ์ƒ์œ„ ๐‘๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ์„ ํƒํ•œ ํ›„, ์ƒ์œ„ ๋‹จ์–ด ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ, ์‹ค
์ œ๋กœ ํ•ด๋‹น ํ† ํ”ฝ์ด ์˜๋ฏธ์ ์œผ๋กœ ์ผ์น˜ํ•˜๋Š” ๋‹จ์–ด๋“ค๋ผ๋ฆฌ ๋ชจ์—ฌ์žˆ๋Š”์ง€ ํŒ๋‹จ ๊ฐ€๋Šฅ
๏ฌ ๋‹ค์–‘ํ•œ ๋ฒ„์ „
๏ฌ NPMI
๏ฌ UMass
๏ฌ UCI
๏ฌ c_v
Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010, June). Automatic evaluation of topic coherence. In Human Language Technologies
์ตœ์ ์˜ ํ† ํ”ฝ ์ˆ˜
Topic coherence : c_v version
M Rรถder (2015) Exploring the Space of Topic Coherence Measures
1. Axes of a spatial
- LSA
2. Probabilistic topics
- LDA
3. Bayesian Nonparametric
- HDP
Dirichlet Process
LDA๋Š” ํ† ํ”ฝ์˜ ์ˆ˜ ๐‘˜๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•˜์—ฌ, ์ด ๋ฐ์ดํ„ฐ์— ๋ช‡ ๊ฐœ์˜ ํ† ํ”ฝ์ด ์กด์žฌํ•˜๋Š”์ง€ ๋ฏธ๋ฆฌ ์•„๋Š” ๊ฒƒ์€
์–ด๋ ต๋‹ค. ์ด ๋ถ€๋ถ„์ด LDA์˜ ์•ฝ์  ์ค‘ ํ•˜๋‚˜์ด๋‹ค.
ํ•˜์ง€๋งŒ ์šฐ๋ฆฌ๋Š” ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ ์ ์ ˆํ•œ ํ† ํ”ฝ ๊ฐœ์ˆ˜๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์œผ๋ฉฐ ์ด๊ฒƒ์€ Dirichlet Process๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ตฌํ• 
์ˆ˜ ์žˆ๋‹ค.
Dirichlet distribution์€ ์ฃผ์–ด์ง„ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์— ๋”ฐ๋ผ ๋‹คํ•ญ๋ถ„ํฌ๋ฅผ ์ƒ์„ฑํ•ด์ฃผ๋Š” ๋ถ„ํฌ๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ
๋””๋ฆฌํด๋ ˆ ๋ถ„ํฌ๋ฅผ ์‚ฌ์ „๋ถ„ํฌ๋กœ ๋‘๋ฉด, ๋‹คํ•ญ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด๋Š” ์‚ฌํ›„ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ์‰ฝ๊ฒŒ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. (๋””๋ฆฌํด๋ ˆ ๋ถ„ํฌ
๋Š” ๋‹คํ•ญ๋ถ„ํฌ์˜ ์ผค๋ ˆ๋ถ„ํฌ์ด๋‹ค.)
Dirichlet Process
๐‘‹๊ฐ€ ๋””๋ฆฌํด๋ ˆ ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์ž. ์ฆ‰ ๐‘‹~๐ท๐‘–๐‘Ÿ(1,2,1)๋ผ๊ณ  ํ•˜์ž. ์—ฌ๊ธฐ์„œ ๐‘˜ = 3์ด๋‹ค. ๋”ฐ๋ผ์„œ ์ด ๋ถ„ํฌ์—์„œ
ํ‘œ๋ณธ์„ ์ถ”์ถœํ•œ๋‹ค๋ฉด 3๊ฐœ์˜ ์„ฑ๋ถ„์œผ๋กœ๋งŒ ์ด๋ฃจ์–ด์ง„, ๊ทธ๋ฆฌ๊ณ  ์›์†Œ์˜ ํ•ฉ์ด 1์ธ ์ƒ˜ํ”Œ๋“ค์ด ์ƒ์„ฑ๋  ๊ฒƒ์ด๋‹ค.
๐•ฉ1 = 0.2,0.5,0.3
๐•ฉ2 = 0.1,0.6, 0.3
โ€ฆ
๋””๋ฆฌํด๋ ˆ ๋ถ„ํฌ๋ฅผ ์„œ๋กœ ๋‹ค๋ฅธ ๐‘˜๊ฐ’์ด ๋‚˜์˜ฌ ์ˆ˜ ์žˆ๋Š” ๋ถ„ํฌ๋กœ ํ™•์žฅ์„ ํ•œ ๊ฒƒ์ด ๋‹ค์Œ์˜ Dirichlet Process์ด๋‹ค.
๐‘‹~๐ท๐‘ƒ ๐›ผ, ๐ป
๐›ผ๋Š” ์ง‘์ค‘ ํŒŒ๋ผ๋ฏธํ„ฐ(concentration parameter)์ด๊ณ  ๐ป๋Š” ๋ชจ๋ถ„ํฌ์ด๋‹ค. ๐›ผ๊ฐ€ 0์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ๐‘‹์˜ ๋ถ„ํฌ๋Š” ๋ชจ๋ถ„ํฌ
์˜ ํ‰๊ท ์„ ์ค‘์‹ฌ์œผ๋กœ ๋ชจ์ด๊ณ , ๐›ผ๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ๋ชจ๋ถ„ํฌ์˜ ํ‰๊ท ์—์„œ ๋ฉ€์–ด์ง€๊ฒŒ ๋œ๋‹ค.
YW The Et al.(2005) Hierarchical Dirichlet Processes
Dirichlet Process
๐›ผ = 1
๐›ผ = 10
๐›ผ = 100
๐›ผ = 1000
https://en.wikipedia.org/wiki/Dirichlet_process
Concentration parameter ๐›ผ ๋ฅผ ๋ณ€ํ™”์‹œ์ผœ ์–ป์€,
๐‘‹์˜ ๋ถ„ํฌ์—์„œ ์ถ”์ถœ๋œ ์ƒ˜ํ”Œ๋“ค์ด๋‹ค.
ํ‘œ๋ณธ์— ํฌํ•จ๋œ ๐‘˜(๋ง‰๋Œ€ ๊ฐœ์ˆ˜)๋Š” ๊ฐ™์€ ๐›ผ๋ผ๋„
๋‹ค๋ฅด๋‹ค.
Chinese Restaurant Process
ํ•œ ์ค‘๊ตญ์ง‘์ด ์žˆ๊ณ , ๊ทธ ์ค‘๊ตญ์ง‘์—๋Š” ๋ฌด์ˆ˜ํžˆ ๋งŽ์€ ์‹ํƒ์ด ์žˆ๋‹ค. ์‹ํƒ์—๋Š” ๋ฌด์ˆ˜ํžˆ ๋งŽ์€ ์ž๋ฆฌ๊ฐ€ ์žˆ์–ด์„œ ์†๋‹˜์ด ์–ผ
๋งˆ๋“ ์ง€ ์•‰์„ ์ˆ˜ ์žˆ๋‹ค.
๋‹จ, ์†๋‹˜์ด ์‹ํƒ์— ์•‰์„ ๋•Œ ์•„๋ž˜์™€ ๊ฐ™์€ ๊ทœ์น™์ด ์žˆ๋‹ค.
๏ฌ ์†๋‹˜์€ ์‹ํƒ์˜ ์ธ๊ธฐ๋„๋ฅผ ๊ณ ๋ คํ•ด์„œ ์–ด๋А ์‹ํƒ์— ์•‰์„์ง€ ์„ ํƒํ•œ๋‹ค.
๏ฌ ์‹ํƒ์— ์–ด๋–ค ์Œ์‹์ด ์˜ฌ๋ ค์งˆ์ง€๋Š” ์‹ํƒ์˜ ์ฒซ ์†๋‹˜์ด ์•‰์„ ๋•Œ ๋ชจ๋ถ„ํฌ ๐ป์—์„œ ์Œ์‹ ํ•˜๋‚˜๋ฅผ ๋ฝ‘์Œ์œผ๋กœ์„œ ๊ฒฐ์ •ํ•œ๋‹ค.
๏ฌ ์‹ํƒ์˜ ์ธ๊ธฐ๋„๋Š” ์•‰์•„ ์žˆ๋Š” ์‚ฌ๋žŒ์˜ ์ˆ˜์— ๋น„๋ก€ํ•˜๊ณ , ์†๋‹˜์€ ๋น„์–ด์žˆ๋Š” ์‹ํƒ์„ ๊ณ ๋ฅผ ์ˆ˜ ์žˆ๋‹ค. ๋นˆ ์‹ํƒ์˜ ์ธ๊ธฐ๋„๋Š” ์ง‘
์ค‘ ํŒŒ๋ผ๋ฏธํ„ฐ ๐›ผ์™€ ๋น„๋ก€ํ•œ๋‹ค.
์ฒซ๋ฒˆ์งธ ์†๋‹˜์€ ๋น„์–ด์žˆ๋Š” ์‹ํƒ์— ์•‰์„ ๊ฒƒ์ด๋‹ค. ๋‘๋ฒˆ์งธ ์†๋‹˜์€ ์ฒซ๋ฒˆ์งธ ์‹ํƒ์— ์•‰๊ฑฐ๋‚˜, ๋น„์–ด์žˆ๋Š” ์ƒˆ๋กœ์šด ์‹ํƒ์—
์•‰๋Š”๋‹ค. ๋งŒ์•ฝ ๐›ผ๊ฐ€ ํฌ๋‹ค๋ฉด ๋น„์–ด์žˆ๋Š” ์‹ํƒ์„ ๊ณ ๋ฅผ ํ™•๋ฅ ์ด ๋†’์•„์ง„๋‹ค. (๋˜๋Š” ์‹ํƒ์˜ ์ธ๊ธฐ๊ฐ€ ๋‚ฎ์•„๋„)
์ด๋ ‡๊ฒŒ ์†๋‹˜์ด ๋ฌดํ•œํžˆ ๊ณ„์† ๋“ค์–ด์˜ค๋‹ค๋ณด๋ฉด ์‹ํƒ์˜ ๊ฐœ์ˆ˜๊ฐ€ ์ •ํ•ด์ง€๊ณ (countably infinite), ์‹ํƒ์˜ ์ธ๊ธฐ๋„ ๋น„์œจ๋„
์ผ์ • ๊ฐ’์— ์ˆ˜๋ ดํ•˜๊ฒŒ ๋œ๋‹ค. ์ด๋ ‡๊ฒŒ ์–ป์€ ์ธ๊ธฐ๋„์˜ ๋น„๋Š” ๋ชจ๋ถ„ํฌ๊ฐ€ ๐ป, ์ง‘์ค‘ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๐›ผ์ธ ๋””๋ฆฌํด๋ ˆ ํ”„๋กœ์„ธ์Šค
์—์„œ ๋ฝ‘์€ ์ƒ˜ํ”Œ์ด ๋œ๋‹ค.
Hierarchical Dirichlet Process
DP์˜ ์• ๋กœ์‚ฌํ•ญ : ๋งŒ์•ฝ ์ค‘๊ตญ์ง‘์ด ํ•œ ๊ณณ์ด ์•„๋‹ˆ๊ณ  ์—ฌ๋Ÿฌ๊ณณ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜์ž. ์–ด๋–ค ์Œ์‹์ด ์˜ฌ๋ ค์ง„ ์‹ํƒ์„ ์ฐพ์•„์„œ
๊ฐ๊ฐ์˜ ์ค‘๊ตญ์ง‘์—์„œ ๊ทธ ์Œ์‹์ด ์–ผ๋งˆ๋‚˜ ์ธ๊ธฐ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๊ณ  ์‹ถ๋‹ค๊ณ  ํ•˜์ž. ๋ฌธ์ œ๋Š” ์—ฌ๊ธฐ์„œ ๋ฐœ์ƒํ•œ๋‹ค. ๊ฐ ์ค‘๊ตญ
์ง‘์— ์‹ํƒ์ด ๋ช‡ ๊ฐœ ์žˆ๋Š”์ง€ ๋ชจ๋ฅด๊ณ , ๊ฐ ์‹ํƒ์— ์–ด๋–ค ์Œ์‹์ด ์žˆ๋Š”์ง€ ๋ชจ๋ฅธ๋‹ค. ์–ด๋–ค ์Œ์‹์ด ์„œ๋กœ ๋‹ค๋ฅธ ์‹๋‹น์—์„œ
๊ฐ™์ด ๋“ฑ์žฅํ•œ๋‹ค๋Š” ๋ณด์žฅ๋„ ์—†๋‹ค. ์‹๋‹น ๊ฐ„ ์Œ์‹ ๋น„๊ต๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.
๊ทธ๋ž˜์„œ ํ•™์ž๋“ค์€ Hierarchical Dirichlet Process๋ฅผ ์ œ์•ˆํ–ˆ๋‹ค.
๐บ0~๐ท๐‘ƒ ๐›พ, ๐ป
๐บ๐‘—~๐ท๐‘ƒ(๐›ผ, ๐บ0)
๋ชจ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด๋Š” ์ƒ์œ„ ๋””๋ฆฌํด๋ ˆ ํ”„๋กœ์„ธ์Šค๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ์ด ๋””๋ฆฌํด๋ ˆ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋ชจ๋ถ„ํฌ๋กœ ํ•˜๋Š” ํ•˜์œ„ ๋””๋ฆฌํด
๋ ˆ ํ”„๋กœ์„ธ์Šค๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ ์ƒ์„ฑํ•œ๋‹ค. ์ด๋Ÿฐ์‹์œผ๋กœ ํ•˜์œ„ ๋””๋ฆฌํด๋ ˆ ํ”„๋กœ์„ธ์Šค๋ฅผ ์„œ๋กœ ์—ฐ๊ฒฐ ์‹œ์ผœ์ค€๋‹ค.
Hierarchical Dirichlet Process
YW The Et al.(2005) Hierarchical Dirichlet Processes
CRP using HDP
์ค‘๊ตญ์ง‘ ์ฒด์ธ์ ์ด ์žˆ๋‹ค. ์ฒด์ธ์ ์—๋Š” ๋ฌด์ˆ˜ํžˆ ๋งŽ์€ ์‹ํƒ์ด ์žˆ๊ณ , ๊ทธ ์‹ํƒ์—๋Š” ๋ฌด์ˆ˜ํžˆ ๋งŽ์€ ์ž๋ฆฌ๊ฐ€ ์žˆ๋‹ค. ๊ฐ ์ฒด์ธ
์ ๋งˆ๋‹ค ์†๋‹˜์ด ํ•œ๋ช…์”ฉ ๊พธ์ค€ํžˆ ๋“ค์–ด์™€์„œ ์‹ํƒ์— ๊ณจ๋ผ ์•‰๋Š”๋‹ค. ๊ฐ๊ฐ์˜ ์ฒด์ธ์ ์— ์Œ์‹์„ ๋ฐฐ๋ถ„ํ•ด์ฃผ๋Š” ์ค‘๊ตญ์ง‘ ๋ณธ
์‚ฌ๊ฐ€ ์žˆ๋‹ค. ๊ฐ ์ฒด์ธ์ ์— ์†๋‹˜์ด ๋นˆ ์‹ํƒ์— ์•‰๊ฒŒ ๋˜๋ฉด, ์ง์›์ด ๋ณธ์‚ฌ๋กœ ๊ฐ€์„œ ์‹ํƒ์— ์˜ฌ๋ฆด ์Œ์‹์„ ๊ณจ๋ผ์˜จ๋‹ค. ๋ณธ
์‚ฌ์—๋Š” ์Œ์‹์ด ๋‹ด๊ฒจ์žˆ๋Š” ์•„์ฃผ ํฐ ์ ‘์‹œ๊ฐ€ ๋ฌดํ•œํžˆ ์žˆ๊ณ , ๊ฐ ์ ‘์‹œ๋Š” ์ถฉ๋ถ„ํžˆ ์ปค์„œ ๋ฌดํ•œํžˆ ์Œ์‹์„ ํ’€ ์ˆ˜ ์žˆ๋‹ค.
CRP using HDP
[๐บ0~๐ท๐‘ƒ ๐›พ, ๐ป ]
๏ฌ ์ง์›์€ ์ ‘์‹œ์˜ ์ธ๊ธฐ๋„๋ฅผ ๊ณ ๋ คํ•ด์„œ ์–ด๋А ์ ‘์‹œ์—์„œ ์Œ์‹์„ ํ’€์ง€ ๊ฒฐ์ •
๏ฌ ์ ‘์‹œ์— ์–ด๋–ค ์Œ์‹์ด ์˜ฌ๋ ค์งˆ์ง€๋Š” ์ ‘์‹œ์—์„œ ์ฒซ ์ง์›์ด ์Œ์‹์„ ํ’€ ๋•Œ, ๋ชจ๋ถ„ํฌ ๐ป์—์„œ ์Œ์‹์„ ํ•˜๋‚˜ ๋ฝ‘์Œ์œผ๋กœ ๊ฒฐ์ •
๏ฌ ์ ‘์‹œ์˜ ์ธ๊ธฐ๋„๋Š” ์Œ์‹์„ ํ‘ธ๋Š” ์ง์›์˜ ์ˆ˜์— ๋น„๋ก€ํ•˜๊ณ , ์ง์›์ด ๋น„์–ด์žˆ๋Š” ์ ‘์‹œ๋ฅผ ๊ณ ๋ฅผ ์ˆ˜๋„ ์žˆ๋‹ค. ์ ‘์‹œ์˜ ์ธ๊ธฐ๋„๋Š”
์ง‘์ค‘ ํŒŒ๋ผ๋ฏธํ„ฐ ๐›พ์™€ ์ƒ๊ด€์žˆ๋‹ค.
[๐บ๐‘—~๐ท๐‘ƒ(๐›ผ, ๐บ0)]
๏ฌ ์†๋‹˜์€ ์‹ํƒ์˜ ์ธ๊ธฐ๋„๋ฅผ ๊ณ ๋ คํ•ด์„œ ์–ด๋А ์‹ํƒ์— ์•‰์„์ง€๋ฅผ ๊ณ ๋ฅธ๋‹ค.
๏ฌ ์‹ํƒ์— ์–ด๋–ค ์Œ์‹์ด ์˜ฌ๋ ค์งˆ์ง€๋Š” ์‹ํƒ์— ์ฒซ ์†๋‹˜์ด ์•‰์œผ๋ฉด ์ง์›์„ ๋ณธ์‚ฌ๋กœ ๋ณด๋‚ด์„œ ๊ฒฐ์ •ํ•˜๊ณ , ๊ทธ ์Œ์‹์˜ ๋ถ„ํฌ ์—ญ์‹œ
์ง์›๋“ค์ด ์ ‘์‹œ์—์„œ ์Œ์‹์„ ํ‘ธ๋Š” ๋ถ„ํฌ ๐บ0๋ฅผ ๋”ฐ๋ฅธ๋‹ค. (์‹ํƒ์—๋Š” 1๊ฐœ์˜ ์Œ์‹๋งŒ ์˜ฌ๋ผ๊ฐ„๋‹ค)
๏ฌ ์‹ํƒ์˜ ์ธ๊ธฐ๋„๋Š” ์•‰์•„ ์žˆ๋Š” ์‚ฌ๋žŒ์˜ ์ˆ˜์— ๋น„๋ก€ํ•˜๊ณ , ์ง‘์ค‘ ํŒŒ๋ผ๋ฏธํ„ฐ ๐›ผ์— ๋”ฐ๋ผ์„œ ๋น„์–ด์žˆ๋Š” ์‹ํƒ์„ ๊ณ ๋ฅผ์ˆ˜๋„ ์žˆ๋‹ค.
CRP using HDP
YW The Et al.(2005) Hierarchical Dirichlet Processes
HDP for infinite topic models
๏ฌ ์ค‘๊ตญ์ง‘ ์ฒด์ธ์  โ€“ Documents
๏ฌ ๋ณธ์‚ฌ์—์„œ ๊ฒฐ์ •ํ•ด์ฃผ๋Š” ์Œ์‹ โ€“ Topic
๏ฌ ๊ฐ๊ฐ์˜ ์†๋‹˜๋“ค โ€“ Words in a document
https://www.cs.cmu.edu/~epxing/Class/10708-14/scribe_notes/scribe_note_lecture20.pdf

Topic models

  • 1.
  • 2.
    Natural Language Processing Naturallanguage(์ž์—ฐ์–ด) : ์ผ์ƒ ์ƒํ™œ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์–ธ์–ด Natural language processing(์ž์—ฐ์–ด ์ฒ˜๋ฆฌ) : ์ž์—ฐ์–ด์˜ ์˜๋ฏธ๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์ปดํ“จํ„ฐ๊ฐ€ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์ผ๋“ค์„(tasks) ์ฒ˜ ๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ Easy ๏ฌ Spell checking, Keyword search, Finding synonyms Medium ๏ฌ Parsing information from websites, documents, etc Hard ๏ฌ Machine translation ๏ฌ Semantic analysis ๏ฌ Coherence ๏ฌ Question answering CS 224D : Deep Learning for NLP
  • 3.
    Semantic analysis ์–ธ์–ดํ•™์—์„œ์˜ ์˜๋ฏธ๋ถ„์„ ๏ฌ ์ž์—ฐ์–ด๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ, ๋ฌธ์žฅ์˜ ์˜๋ฏธ(meaning, semantic)์— ๊ทผ๊ฑฐํ•˜์—ฌ ๋ฌธ์žฅ์„ ํ•ด์„ํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธ ๏ฌ Syntax analysis์˜ ๋ฐ˜๋Œ€(lexicon, syntax analysis) ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ์˜ ์˜๋ฏธ ๋ถ„์„ ๏ฌ Corpus์— ์žˆ๋Š” ๋งŽ์€ documents ์ง‘ํ•ฉ์— ๋‚ด์ œ๋˜์–ด ์žˆ๋Š”(latent) meanings, concepts, subjects, topics ๋“ฑ์„ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋Š” ๊ตฌ์กฐ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธ ๏ฌ ๋Œ€ํ‘œ์ ์ธ ์˜๋ฏธ ๋ถ„์„ ๊ธฐ๋ฒ• ๏ฌ Latent Semantic Analysis(LSA or LSI) ๏ฌ PLSI ๏ฌ Latent Dirichlet Allocation(LDA) ๏ฌ Hieararchical Dirichlet Processing(HDP)
  • 4.
    Semantic analysis Representation ofdocuments Axes of a spatial Probabilistic topics - Euclidean space์—์„œ ์ •์˜ ๊ฐ€๋Šฅ - Hard to interprete - ๋‹จ์–ด์ƒ์— ์ •์˜๋œ probability distribution - Interpretable
  • 5.
    1. Axes ofa spatial - LSA 2. Probabilistic topics - LDA 3. Bayesian Nonparametric - HDP
  • 6.
    LSA (Latent SemanticAnalysis) ๏ฌ LSA(LSI)๋Š” document data์˜ ์ˆจ๊ฒจ์ง„ ์˜๋ฏธ(hidden concept)๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ธฐ๋ฒ•์ด๋‹ค. ๏ฌ LSA๋Š” ๊ฐ๊ฐ์˜ ๋ฌธ์„œ(document)์™€ ๋‹จ์–ด(word)๋ฅผ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•œ๋‹ค. ๋ฒกํ„ฐ๋‚ด์—์„œ ๊ฐ๊ฐ์˜ element๋Š” ์ˆจ๊ฒจ์ง„ ์˜๋ฏธ๊ฐ€ ๋  ๊ฒƒ์ด๋‹ค.
  • 7.
    LSA (Latent SemanticAnalysis) ๏ฌ 3๋ฒˆ ๋ฌธ์„œ๋Š” ์ฟผ๋ฆฌ์— ๋Œ€ํ•ด์„œ 1๋“ฑ์ด ๋  ๊ฒƒ์ด๋‹ค. ๏ฌ 2๋ฒˆ, 4๋ฒˆ ๋ฌธ์„œ๋Š” ๊ทธ ๋‹ค์Œ์ด ๋  ๊ฒƒ์ด๋‹ค. ๏ฌ 1๋ฒˆ, 5๋ฒˆ ๋ฌธ์„œ๋Š”? ๏ƒผ ์‚ฌ๋žŒ๋“ค์ด ์ธ์‹ํ•˜๊ธฐ๋กœ๋Š” ๋ฌธ์„œ 1๋ฒˆ์ด ๋ฌธ์„œ 5๋ฒˆ ๋ณด๋‹ค ์ฃผ์–ด์ง„ ์ฟผ๋ฆฌ์— ๋” ๋งž๋Š” ๋ฌธ์„œ์ด๋‹ค. ์ปดํ“จํ„ฐ๋„ ์ด๋Ÿฌํ•œ ์ถ”๋ก  ๊ณผ์ •์„ ํ•  ์ˆ˜ ์žˆ์„๊นŒ? ์ฆ‰ ์ˆจ๊ฒจ์ง„ ์˜๋ฏธ๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์„๊นŒ? ๐‘‘1 : Romeo and Juliet. ๐‘‘2 : Juliet: O happy dagger! ๐‘‘3 : Romeo died by dagger. ๐‘‘4 : "Live free or die", that's the motto of New-Hampshire ๐‘‘5 : Did you know, New-Hampshire is in New-England ๐‘„๐‘ข๐‘’๐‘Ÿ๐‘ฆ : dies and dagger
  • 8.
    LSA (Latent SemanticAnalysis) matrix ๐‘จ : romeo juliet happy dagger live die free new-hampshire ๐‘‘1 ๐‘‘2 ๐‘‘3 ๐‘‘4 ๐‘‘5
  • 9.
    LSA (Latent SemanticAnalysis) matrix ๐‘จ : romeo juliet happy dagger live die free new-hampshire ๐‘‘1 ๐‘‘2 ๐‘‘3 ๐‘‘4 ๐‘‘5
  • 10.
    doc-doc matrix matrix ๐‘จ: matrix ๐‘จ ๐‘ป: matrix ๐‘จ๐‘จ ๐‘ป(๐‘ฉ): 5 ร— 8 8 ร— 5 5 ร— 5 1๋ฒˆ ๋ฌธ์„œ์—๋Š” romeo, juliet, 2๋ฒˆ ๋ฌธ์„œ์—๋Š” juliet, happy, dagger ์ฆ‰ ๊ฒน์ณ์ง€๋Š” ๊ฒƒ์ด 1๊ฐœ์ด๋ฏ€๋กœ ๐ต 1,2 = ๐ต 2,1 = 1 matrix ๐‘ฉ = ๐ด๐ด ๐‘‡ doc-doc matrix ๋ฌธ์„œ ๐‘–์™€ ๋ฌธ์„œ ๐‘—๊ฐ€ ๐‘๊ฐœ ์˜ ๊ณตํ†ต ๋‹จ์–ด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉด ๐ต ๐‘–, ๐‘— = ๐‘
  • 11.
    word-word matrix 8 ร—5 5 ร— 8 8 ร— 8 matrix ๐‘จ :matrix ๐‘จ ๐‘ป : matrix ๐‘จ ๐‘ป ๐‘จ(๐‘ช) : juliet์€ 1๋ฒˆ, 2๋ฒˆ ๋ฌธ์„œ์—์„œ ๋‚˜์˜ค๊ณ , dagger๋Š” 2, 3๋ฒˆ ๋ฌธ์„œ์—์„œ ๋‚˜์˜จ๋‹ค. ์ฆ‰ ๊ฒน์ณ์ง€๋Š” ๊ฒƒ์ด 1๊ฐœ์ด๋ฏ€๋กœ ๐ถ 2,4 = ๐ต 4,2 = 1 matrix ๐ถ = ๐ด ๐‘‡ ๐ด word-word matrix ์ฆ‰, ๋‹จ์–ด ๐‘–์™€ ๋‹จ์–ด ๐‘—๊ฐ€ ๐‘ ๊ฐœ์˜ ๋ฌธ์„œ์—์„œ ํ•จ๊ป˜ ๋ฐœ์ƒํ–ˆ์œผ๋ฉด ๐ถ ๐‘–, ๐‘— = ๐‘
  • 12.
    LSA (Latent SemanticAnalysis) SVD ์‚ฌ์šฉ! ๐ด = ๐‘ˆฮฃ๐‘‰ ๐‘‡, ๐‘ˆ๋Š” ๐ต์˜ eigenvectors์ด๊ณ , ๐‘‰๋Š” ๐ถ์˜ eigenvectors์ด๋‹ค. singular value
  • 13.
    LSA (Latent SemanticAnalysis) Reduced SVD ์‚ฌ์šฉ! ๐ด ๐‘˜ = ๐‘† ๐‘˜ฮฃk ๐‘ˆ ๐‘˜ ๐‘‡ , ๋ชจ๋“  singular value๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์—†๊ณ , ์ž‘์€ ๊ฒƒ๋“ค์€ ์ œ์™ธํ•œ๋‹ค. ๐‘˜๊ฐœ์˜ ํŠน์ด๊ฐ’๋งŒ ๋‚จ๊ธฐ๋Š” ๊ฒƒ์ด๋‹ค. ์ฆ‰ ๐‘˜๊ฐœ์˜ "hidden concepts"๋งŒ ๋‚จ๊ธด๋‹ค.
  • 14.
    LSA (Latent SemanticAnalysis) ฮฃ2 ๐‘‰2 ๐‘‡ = ๐‘‰2 ๐‘‡ = Word vector
  • 15.
    LSA (Latent SemanticAnalysis) Word vector์˜ scatter
  • 16.
    LSA (Latent SemanticAnalysis) ๐‘ˆ2ฮฃ2 = ๐‘ˆ2 = Document vector ๐‘‘1 ๐‘‘2 ๐‘‘3 ๐‘‘4 ๐‘‘5 ๐‘‘1 ๐‘‘2 ๐‘‘3 ๐‘‘4 ๐‘‘5
  • 17.
    LSA (Latent SemanticAnalysis) Document vector์˜ scatter
  • 18.
    LSA (Latent SemanticAnalysis) Word / Document vector์˜ scatter
  • 19.
    LSA (Latent SemanticAnalysis) cosine similarity = ๐‘‘ ๐‘–โˆ™๐‘ž ๐‘‘ ๐‘– ๐‘ž๐‘ž = ๐‘ž1 + ๐‘ž2 2 query : dagger, die result :
  • 20.
    LSA (Latent SemanticAnalysis) Word / Document / Query vector์˜ scatter
  • 21.
    1. Axes ofa spatial - LSA 2. Probabilistic topics - LDA 3. Bayesian Nonparametric - HDP
  • 22.
    Topic models Topic models์˜๊ธฐ๋ณธ ์•„์ด๋””์–ด ๏ฌ ๋ฌธ์„œ๋Š” ํ† ํ”ฝ๋“ค์˜ ํ˜ผํ•ฉ ๋ชจ๋ธ์ด๋ฉฐ ๊ฐ ํ† ํ”ฝ์€ ๋‹จ์–ด์ƒ์— ์ •์˜๋œ ํ™•๋ฅ ๋ถ„ํฌ Document Topic i Topic j Topic k Word Word Word Word Word Word Probabilistic topic models. Steyvers, M. & Griffiths, T. (2006)
  • 23.
    Topic models - TopicA: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, โ€ฆ - Topic B: 20% cats, 20% cute, 15% dogs, 15% hamster, โ€ฆ Doc 1 : I like to eat broccoli and bananas. Doc 2 : I ate a banana and tomato smoothie for breakfast. Doc 3 : Dogs and cats are cute. Doc 4 : My sister adopted a cats yesterday. Doc 5 : Look at this cute hamster munching on a piece of broccoli. ์˜ˆ์ œ) - Doc 1 and 2 : 100% topic A - Doc 3 and 4 : 100% topic B - Doc 5 : 60% topic A, 40% topic B
  • 24.
    Topic models Introduction toProbabilistic Topic Models. David M. Blei (2012)
  • 25.
    Topic models Introduction toProbabilistic Topic Models. David M. Blei (2012) (Left) ๋ฌธ์„œ์—์„œ์˜ topic proportion (Right) ๋ฌธ์„œ์—์„œ ๋น„์ค‘์ด ๋†’์•˜๋˜ ํ† ํ”ฝ์— ๋Œ€ํ•˜์—ฌ, ํ† ํ”ฝ๋ณ„ ๋ฌธ์„œ๋‚ด ๋นˆ๋„์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ๋‹จ์–ด
  • 26.
    Probabilistic Topic Models์˜๊ตฌ์กฐ ๋ชจ๋ธ์˜ ์ •์˜์— ์•ž์„œ, ํ•„์š”ํ•œ ๋‹จ์–ด๋“ค์˜ ์ˆ˜ํ•™์  ํ‘œ๊ธฐ ๏ฌ Word : 1, โ€ฆ , ๐‘‰ ๋ฅผ ์ธ๋ฑ์Šค๋กœ ๊ฐ€์ง€๋Š” vocaburary ์ƒ์˜ items ๏ฌ Document : ๐‘ word์˜ sequence ๏ฌ ๐•จ = ๐‘ค1, ๐‘ค2, โ€ฆ , ๐‘ค ๐‘ , ๐‘ค ๐‘› : word์˜ sequence๋‚ด์—์„œ ๐‘›๋ฒˆ์งธ์— ์žˆ๋Š” word ๏ฌ Corpus : ๐ท documents์˜ collection ๏ฌ ๐ถ = ๐•จ1, ๐•จ2, โ€ฆ , ๐•จ ๐ท
  • 27.
    Probabilistic Topic Models์˜๊ตฌ์กฐ ๋ฌธ์„œ ๐‘‘์˜ ๋‹จ์–ด ๐‘ค๐‘– ๋Œ€ํ•œ ๋ถ„ํฌ : ๐‘ƒ ๐‘ค๐‘– = ๐‘˜=1 ๐พ ๐‘ƒ ๐‘ค๐‘–|๐‘ง๐‘– = ๐‘˜ ๐‘ƒ ๐‘ง๐‘– = ๐‘˜ ๏ฌ ๐‘ƒ ๐‘ค๐‘–|๐‘ง๐‘– = ๐‘˜ : ํ† ํ”ฝ ๐‘˜์—์„œ, ๋‹จ์–ด ๐‘ค๐‘–์˜ probability ๏ฌ ๊ฐ ํ† ํ”ฝ์—์„œ ์–ด๋–ค ๋‹จ์–ด๋“ค์ด ์ค‘์š”ํ• ๊นŒ? ๏ฌ ๐‘ƒ ๐‘ง๐‘– = ๐‘˜ : ๐‘–๋ฒˆ์งธ ๋‹จ์–ด์— ํ† ํ”ฝ ๐‘˜๊ฐ€ ํ• ๋‹น๋˜๋Š” probability (์ฆ‰, ํ† ํ”ฝ ๐‘—๊ฐ€ ๐‘–๋ฒˆ์งธ ๋‹จ์–ด๋ฅผ ์œ„ํ•ด ์ƒ˜ํ”Œ๋ง ๋  ํ™•๋ฅ ) ๐›ฝ ๐‘˜ = ๐‘ƒ ๐‘ค|๐‘ง = ๐‘˜ : ํ† ํ”ฝ ๐‘˜์—์„œ, ๋‹จ์–ด๋“ค์˜ multinomial distribution ๐œƒ ๐‘‘ = ๐‘ƒ ๐‘ง : ๋ฌธ์„œ ๐‘‘์—์„œ, ํ† ํ”ฝ๋“ค์˜ multinomial distribution
  • 28.
    Latent Dirichlet Allocation์˜๋“ฑ์žฅ ๋ฌธ์„œ ๐‘‘์˜ ๋‹จ์–ด ๐‘ค๐‘– ๋Œ€ํ•œ ๋ถ„ํฌ : ๐‘ƒ ๐‘ค๐‘– = ๐‘˜=1 ๐พ ๐‘ƒ ๐‘ค๐‘–|๐‘ง๐‘– = ๐‘˜ ๐‘ƒ ๐‘ง๐‘– = ๐‘˜ ๋””๋ฆฌํด๋ ˆ ๋ถ„ํฌ(Dirichlet distribution)์€ multinomial distribution์˜ ์ผค๋ ˆ ์‚ฌ์ „ ๋ถ„ํฌ๋กœ(conjugate prior) ์‚ฌ์šฉ ๋‹คํ•ญ ๋ถ„ํฌ(Multinomial distribution) ๐‘ = ๐‘1, โ€ฆ , ๐‘ ๐พ ์— ๋Œ€ํ•œ Dirichlet distribution : ๐ท๐‘–๐‘Ÿ ๐›ผ1, โ€ฆ , ๐›ผ ๐พ = ฮ“ ๐‘˜ ๐›ผ ๐‘˜ ๐‘˜ ฮ“ ๐›ผ ๐‘˜ ๐‘˜=1 ๐พ ๐‘ ๐‘˜ ๐›ผ ๐‘˜โˆ’1 ๏ฌ Hyperparameter ๐›ผ๐‘— : ๋ฌธ์„œ ๐‘‘์—์„œ ํ† ํ”ฝ ๐‘—๊ฐ€ ์ƒ˜ํ”Œ๋ง ๋œ ํšŸ์ˆ˜์— ๋Œ€ํ•œ ์‚ฌ์ „ ๊ด€์ฐฐ count (๋ฌธ์„œ๋กœ๋ถ€ํ„ฐ ๋‹จ์–ด๊ฐ€ ์‹ค์ œ๋กœ ๊ด€ ์ฐฐ๋˜๊ธฐ ์ด์ „์˜ ๊ฐ’) LDA๋Š” Dirichlet distribution์„ ๐›‰์˜ prior๋กœ ์‚ฌ์šฉ (Blei et. Al, 2003)
  • 29.
    Latent Dirichlet Allocation์˜๋“ฑ์žฅ Latent Dirichlet Allocation. Blei et. Al (2003) LDA : Dirichlet parameter
  • 30.
    Variant LDA์˜ ๋“ฑ์žฅ ๋ฌธ์„œ๐‘‘์˜ ๋‹จ์–ด ๐‘ค๐‘– ๋Œ€ํ•œ ๋ถ„ํฌ : ๐‘ƒ ๐‘ค๐‘– = ๐‘˜=1 ๐พ ๐‘ƒ ๐‘ค๐‘–|๐‘ง๐‘– = ๐‘˜ ๐‘ƒ ๐‘ง๐‘– = ๐‘˜ Hyperparameter ๐œ‚ : Corpus์˜ ๋‹จ์–ด๊ฐ€ ๊ด€์ฐฐ๋˜๊ธฐ ์ด์ „์—, ํ† ํ”ฝ์—์„œ ๋‹จ์–ด๊ฐ€ ์ƒ˜ํ”Œ๋ง ๋œ ํšŸ์ˆ˜์— ๋Œ€ํ•œ ์‚ฌ์ „ ๊ด€์ฐฐ count Varian LDA๋Š” symmetric Dirichlet distribution(๐œผ)์„ ๐œท์˜ prior๋กœ ์‚ฌ์šฉ (Griffiths and Steyvers, 2004)
  • 31.
    Variant LDA์˜ ๋“ฑ์žฅVariant LDA : Dirichlet parameter Introduction to Probabilistic Topic Models. David M. Blei (2012) ๐›ผ Dirichlet parameter ๐œƒ ๐‘‘ ๋ฌธ์„œ ๐‘‘์—์„œ ํ† ํ”ฝ ๋น„์œจ(proportion) ๐œƒ ๐‘‘,๐‘˜ ๋ฌธ์„œ ๐‘‘์—์„œ ํŠน์ • ํ† ํ”ฝ ๐‘˜์˜ proportion ๐‘ ๐‘‘ ๋ฌธ์„œ ๐‘‘์—์„œ ํ† ํ”ฝ ํ• ๋‹น(assignment) ๐‘ ๐‘‘,๐‘› ๋ฌธ์„œ ๐‘‘์—์„œ ๐‘›-th ๋‹จ์–ด์— ๋Œ€ํ•œ ํ† ํ”ฝ ํ• ๋‹น ๐‘Š๐‘‘ ๋ฌธ์„œ ๐‘‘์—์„œ ๊ด€์ฐฐ๋œ ๋‹จ์–ด๋“ค ๐‘Š๐‘‘,๐‘› ๋ฌธ์„œ ๐‘‘์—์„œ ๐‘›-th ๋‹จ์–ด ๐›ฝ ๐‘˜ ํ† ํ”ฝ ๐‘˜์˜ vocaburary์—์„œ์˜ ๋ถ„ํฌ (๋‹จ์–ด ์ „์ฒด ์…‹์—์„œ ์ •์˜๋œ ํ† ํ”ฝ ๐‘˜์˜ ๋ถ„ํฌ) ๐œ‚ Dirichlet parameter The plate surrounding ๐œƒ ๐‘‘ ๊ฐ ๋ฌธ์„œ ๐‘‘์— ๋Œ€ํ•˜์—ฌ, ํ† ํ”ฝ ๋ถ„ํฌ์˜ sampling (์ด ๐ท๊ฐœ์˜ ๋ฌธ์„œ) The plate surrounding ๐›ฝ ๐‘˜ ๊ฐ topic ๐‘˜์— ๋Œ€ํ•˜์—ฌ, ๋‹จ์–ด ๋ถ„ํฌ์˜ sampling (์ด ๐พ๊ฐœ์˜ ํ† ํ”ฝ)
  • 32.
    LDA ๋ชจ๋ธ์˜ ๋ณ€์ˆ˜ ๐œƒ๐‘‘ โ€ฒ ๐‘  : ๐›ฝ ๐‘˜ โ€ฒ ๐‘  : Document Topic 1 Topic 2 Topic 3 โ€ฆ Topic ๐พ Document 1 ๐œƒ1 0.2 0.4 0.0 โ€ฆ 0.1 Document 2 ๐œƒ2 0.8 0.1 0.0 โ€ฆ 0.0 โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ Document ๐‘€ ๐œƒ ๐‘€ 0.5 0.4 0.1 โ€ฆ 0.0 Terms Topic 1 ๐›ฝ1 Topic 2 ๐›ฝ2 Topic 3 ๐›ฝ3 โ€ฆ Topic ๐พ ๐›ฝ ๐พ Word 1 0.02 0.09 0.00 โ€ฆ 0.00 Word 2 0.08 0.52 0.37 โ€ฆ 0.03 โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ Wordt ๐‘‰ 0.05 0.12 0.01 โ€ฆ 0.45 ํ•ฉ : 1 ํ•ฉ : 1 - Variant LDA version์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์œผ๋ฏ€๋กœ, ์ด version์„ LDA ๋ผ๊ณ  ์ง€์นญ ํ•˜๊ฒ ์Œ
  • 33.
    LDA์˜ Generative process LDA๋Š”generative model์ด๋‹ค. 1. ๋ฌธ์„œ์˜ ๋‹จ์–ด์˜ ๊ฐฏ์ˆ˜ ๐‘์„ Poisson ๋ถ„ํฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์„ ํƒํ•œ๋‹ค. ๐‘~๐‘ƒ๐‘œ๐‘–๐‘ ๐‘ ๐‘œ๐‘›(๐œ‰) 2. ๋ฌธ์„œ์˜ ํ† ํ”ฝ ๋ถ„ํฌ(proportion) ๐œƒ ๐‘‘๋ฅผ Dirichlet(๐›ผ) ๋ถ„ํฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์„ ํƒํ•œ๋‹ค. ๐œƒ ๐‘‘~๐ท๐‘–๐‘Ÿ๐‘–๐‘โ„Ž๐‘™๐‘’๐‘ก ๐›ผ 3. ๋ฌธ์„œ์˜ ๋‹จ์–ด ๊ฐ๊ฐ์— ๋Œ€ํ•˜์—ฌ a. ํ† ํ”ฝ ๋ถ„ํฌ ๐œƒ ๐‘‘๋ฅผ ์ด์šฉํ•˜์—ฌ, ๋‹จ์–ด์— ํ† ํ”ฝ์„ ํ• ๋‹นํ•œ๋‹ค. ๐‘ ๐‘‘,๐‘›~๐‘€๐‘ข๐‘™๐‘ก๐‘–๐‘›๐‘œ๐‘š๐‘–๐‘Ž๐‘™ ๐œƒ b. ๐‘ ๐‘Š๐‘‘,๐‘›|๐‘ ๐‘‘,๐‘›, ๐›ฝ ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹จ์–ด๋ฅผ ์„ ํƒํ•œ๋‹ค. ์ด ํ™•๋ฅ ๋ถ„ํฌ๋Š” ๋‹คํ•ญ๋ถ„ํฌ์ด๋‹ค.
  • 34.
    LDA์˜ Generative process ์˜ˆ์ œ) 1.์ƒˆ๋กœ์šด ๋ฌธ์„œ ๐ท์˜ ๊ธธ์ด๋ฅผ 5๋กœ ์„ ํƒํ•œ๋‹ค. ์ฆ‰, ๐ท = ๐‘ค1, ๐‘ค2, ๐‘ค3, ๐‘ค4, ๐‘ค5 2. ๋ฌธ์„œ ๐ท์˜ ํ† ํ”ฝ ๋ถ„ํฌ๋ฅผ 50%๋Š” ์Œ์‹(food), 50%๋Š” ๋™๋ฌผ(animal)๋กœ ์„ ํƒํ•œ๋‹ค. 3. ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•˜์—ฌ, 1. ์ฒซ๋ฒˆ์งธ ๋‹จ์–ด ๐‘ค1์— food topic์„ ํ• ๋‹นํ•œ๋‹ค. Food topic์—์„œ broccoli๋ฅผ ๐‘ค1์œผ๋กœ ์„ ํƒํ•œ๋‹ค. 2. ๋‘๋ฒˆ์งธ ๋‹จ์–ด ๐‘ค2์— animal topic์„ ํ• ๋‹นํ•œ๋‹ค. Animal topic์—์„œ panda๋ฅผ ๐‘ค2์œผ๋กœ ์„ ํƒํ•œ๋‹ค. 3. ์„ธ๋ฒˆ์งธ ๋‹จ์–ด ๐‘ค3์— animal topic์„ ํ• ๋‹นํ•œ๋‹ค. Animal topic์—์„œ adorable ๋ฅผ ๐‘ค3์œผ๋กœ ์„ ํƒํ•œ๋‹ค. 4. ๋„ค๋ฒˆ์งธ ๋‹จ์–ด ๐‘ค4์— food topic์„ ํ• ๋‹นํ•œ๋‹ค. Food topic์—์„œ cherries ๋ฅผ ๐‘ค4์œผ๋กœ ์„ ํƒํ•œ๋‹ค. 5. ๋‹ค์„ฏ๋ฒˆ์งธ ๋‹จ์–ด ๐‘ค5์— food topic์„ ํ• ๋‹นํ•œ๋‹ค. Food topic์—์„œ eating ๋ฅผ ๐‘ค5์œผ๋กœ ์„ ํƒํ•œ๋‹ค. ๐ท : broccoli panda adorable cherries eating
  • 35.
    LDA ๋ชจ๋ธ์˜ inference ๊ด€์ฐฐ๊ฐ€๋Šฅํ•œ ๋ฌธ์„œ ๋‚ด ๋‹จ์–ด ๐‘Š๐‘‘,๐‘›๋ฅผ ์ด์šฉํ•˜์—ฌ, LDA ๋ชจ๋ธ์˜ ์ž ์žฌ ๋ณ€์ˆ˜(hidden variable)์ธ ๋ฌธ์„œ์˜ ํ† ํ”ฝ๋ถ„ํฌ ๐œƒ ๐‘‘ ์™€ ํ† ํ”ฝ์˜ ๋‹จ์–ด๋ถ„ํฌ ๐›ฝ ๐‘˜๋ฅผ ์ถ”์ •ํ•˜๋Š” ๊ณผ์ •์ด inference์ด๋‹ค. Generative probabilistic modeling์—์„œ๋Š”, data๋Š” ์ž ์žฌ ๋ณ€์ˆ˜(hidden variable)๋ฅผ ํฌํ•จํ•˜๋Š” generative process์—์„œ๋ถ€ํ„ฐ ๋ฐœ์ƒํ•˜๋Š”๊ฒƒ์œผ๋กœ ๋‹ค๋ฃฌ๋‹ค. ๋”ฐ๋ผ์„œ, generative process๋Š” observed random variable๊ณผ hidden random variable์˜ ๊ฒฐํ•ฉ ํ™•๋ฅ ๋ฐ€๋„(joint probability distribution)๋ฅผ ์ •์˜ํ•œ๋‹ค. ๏ฌ Observed variables : ๋ฌธ์„œ๋‚ด์˜ ๋‹จ์–ด๋“ค ๏ฌ Hidden variables : ๋ฌธ์„œ์˜ ํ† ํ”ฝ ๋ถ„ํฌ, ํ† ํ”ฝ์˜ ๋‹จ์–ด ๋ถ„ํฌ (topic structure) ๊ฒฐํ•ฉ ํ™•๋ฅ ๋ฐ€๋„ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ observed variable์ด ์ฃผ์–ด์กŒ์„ ๋•Œ hidden variable์˜ ์กฐ๊ฑด๋ถ€ ๋ถ„ํฌ๋ฅผ ๊ตฌํ•œ๋‹ค. ์ด ๋ถ„ํฌ๋Š” ์‚ฌํ›„ ํ™•๋ฅ ๋ถ„ํฌ(posterior distribution)์ด๋‹ค.
  • 36.
    ๐‘ ๐›ฝ1:๐พ, ๐œƒ1:๐ท,๐‘ง1:๐ท , ๐‘ค1:๐ท = ๐‘–=1 ๐พ ๐‘ ๐›ฝ๐‘– ๐‘‘=1 ๐ท ๐‘ ๐œƒ ๐‘‘ ๐‘›=1 ๐‘ ๐‘ ๐‘ง ๐‘‘,๐‘›|๐œƒ ๐‘‘ ๐‘ ๐‘ค ๐‘‘,๐‘›|๐›ฝ1:๐พ, ๐‘ง ๐‘‘,๐‘› ๊ด€์ฐฐ ๊ฐ€๋Šฅ ๋ฐ์ดํ„ฐ ๐‘ค1:๐ท๋ฅผ ํ†ตํ•ด์„œ inferenceํ•ด์•ผ ํ•  ๋ณ€์ˆ˜ : ๐›ฝ1:๐ท, ๐œƒ1:๐ท, ๐‘ง1:๐ท ๐‘ ๐›ฝ1:๐พ, ๐œƒ1:๐ท, ๐‘ง1:๐ท|๐‘ค1:๐ท = ๐‘ ๐›ฝ1:๐พ, ๐œƒ1:๐ท, ๐‘ง1:๐ท, ๐‘ค1:๐ท ๐‘ ๐‘ค1:๐ท LDA ๋ชจ๋ธ์˜ inference ๐›ฝ1:๐พ ํ† ํ”ฝ 1~๐พ์˜ vocabulary์—์„œ์˜ ๋ถ„ํฌ ๐œƒ1:๐ท ๋ฌธ์„œ 1~๐ท์—์„œ์˜ ํ† ํ”ฝ ๋น„์œจ ๐‘ง1:๐ท ๋ฌธ์„œ 1~๐ท์—์„œ์˜ ํ† ํ”ฝ ํ• ๋‹น ๐‘ค1:๐ท ๋ฌธ์„œ 1~๐ท์—์„œ ๊ด€์ฐฐ๋œ ๋‹จ์–ด๋“ค Posterior dist.
  • 37.
    LDA ๋ชจ๋ธ์˜ inference Posteriordistribution์„ ๊ตฌํ•˜๋Š” ๊ฒƒ์€ ์‰ฌ์šด๊ฒƒ์ธ๊ฐ€? ๋ถ„์ž์˜ ๊ฒฝ์šฐ๋ฅผ ๋จผ์ € ์‚ดํŽด๋ณด์ž. ๐‘ ๐›ฝ1:๐พ, ๐œƒ1:๐ท, ๐‘ง1:๐ท|๐‘ค1:๐ท = ๐‘ ๐›ฝ1:๐พ, ๐œƒ1:๐ท, ๐‘ง1:๐ท, ๐‘ค1:๐ท ๐‘ ๐‘ค1:๐ท ๋ชจ๋“  random variable์˜ ๊ฒฐํ•ฉ ํ™•๋ฅ ๋ฐ€๋„ ํ•จ์ˆ˜๋Š”, hidden variable์ด ์ž„์˜๋กœ ์…‹ํŒ…๋œ๋‹ค๋ฉด ์‰ฝ๊ฒŒ ๊ณ„์‚ฐ ๊ฐ€๋Šฅ
  • 38.
    LDA ๋ชจ๋ธ์˜ inference ๋ถ„๋ชจ์˜๊ฒฝ์šฐ๋ฅผ ์‚ดํŽด๋ณด์ž. ๐‘ ๐›ฝ1:๐พ, ๐œƒ1:๐ท, ๐‘ง1:๐ท|๐‘ค1:๐ท = ๐‘ ๐›ฝ1:๐พ, ๐œƒ1:๐ท, ๐‘ง1:๐ท, ๐‘ค1:๐ท ๐‘ ๐‘ค1:๐ท Observed variable์˜ ์ฃผ๋ณ€ ๋ฐ€๋„ํ•จ์ˆ˜(marginal probability) - ์ž„์˜์˜ topic model์—์„œ, observed corpus๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์„ ํ™•๋ฅ ์„ ๊ตฌํ•˜๋Š” ๊ฒƒ - ๋ชจ๋“  hidden topic structure์˜ ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ(instantiation)๋ฅผ ๊ตฌํ•˜๊ณ , ๊ฒฐํ•ฉ ํ™•๋ฅ ๋ฐ€๋„ ํ•จ์ˆ˜๋ฅผ summation ๊ฐ€๋Šฅํ•œ hidden topic sturcture๋Š” ์ง€์ˆ˜์ ์œผ๋กœ ๋งŽ๋‹ค. ๋”ฐ๋ผ์„œ ํ•ด๋‹น ๋ฐ€๋„ํ•จ์ˆ˜๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ์–ด๋ ต๋‹ค. Modern probabilistic models, Bayesian statistics์—์„œ๋Š” ๋ถ„๋ชจ์˜ ๋ถ„ํฌ ๋•Œ๋ฌธ์— posterior๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต๋‹ค. ๋”ฐ๋ผ์„œ posterior๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ถ”์ •ํ•˜๋Š” ๊ธฐ๋ฒ•์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ๋งŽ์ด ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ, topic modeling algorithms์—์„œ๋„ posterior distribution์„ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•œ๋‹ค. - sampling based method - variational method
  • 39.
    Difficulty of derivingmarginal probability ๐‘ ๐‘ค1:๐ท ๏ฌ Topic mixture ๐œƒ์˜ joint distribution (parameter : ๐›ผ, ๐›ฝ) ๐‘ ๐œƒ, ๐•ซ, ๐•จ ๐›ผ, ๐›ฝ = ๐‘ ๐œƒ|๐›ผ ๐‘›=1 ๐‘ ๐‘ ๐‘ง ๐‘›|๐œƒ ๐‘ ๐‘ค ๐‘›|๐‘ง ๐‘›, ๐›ฝ ๏ฌ Document์˜ marginal distribution ๐‘ ๐•จ|๐›ผ, ๐›ฝ = ๐‘ ๐œƒ|๐›ผ ๐‘›=1 ๐‘ ๐‘ง ๐‘› ๐‘ ๐‘ง ๐‘›|๐œƒ ๐‘ ๐‘ค ๐‘›|๐‘ง ๐‘›, ๐›ฝ ๐‘‘๐œƒ ๏ฌ Corpus์˜ probability ๐‘ ๐ท|๐›ผ, ๐›ฝ = ๐‘‘=1 ๐‘€ ๐‘ ๐œƒ ๐‘‘|๐›ผ ๐‘›=1 ๐‘ ๐‘‘ ๐‘ง ๐‘‘๐‘› ๐‘ ๐‘ง ๐‘‘๐‘›|๐œƒ ๐‘‘ ๐‘ ๐‘ค ๐‘‘๐‘›|๐‘ง ๐‘‘๐‘›, ๐›ฝ ๐‘‘๐œƒ ๐‘‘ ์—ฌ๊ธฐ์„œ์˜ notation์€ Latent Dirichlet Allocation. Blei et. Al (2003)์„ ์ฐธ๊ณ 
  • 40.
    Algorithm for extractingtopics Gibbs sampling ๐‘ท ๐’›๐’Š = ๐’‹|๐’›โˆ’๐’Š, ๐‘ค๐‘–, ๐‘‘๐‘–,โˆ™ โˆ ๐ถ ๐‘ค ๐‘– ๐‘— ๐‘‰๐พ + ๐œ‚ ๐‘ค=1 ๐‘‰ ๐ถ ๐‘ค๐‘— ๐‘‰๐พ + ๐‘‰๐œ‚ ๐ถ ๐‘‘๐‘– ๐‘— ๐ท๐พ + ๐›ผ ๐‘ก=1 ๐พ ๐ถ ๐‘‘ ๐‘– ๐‘ก ๐ท๐พ + ๐‘‡๐›ผ ๋ฌธ์„œ์—์„œ ๐‘–๋ฒˆ์งธ์— ๋‚˜์˜ค๋Š” ๋‹จ์–ด ๐‘ค์˜ ํ† ํ”ฝ์ด ๐‘—์ผ ํ™•๋ฅ ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์š”์†Œ - ์š”์†Œ 1 : ํ† ํ”ฝ ๐‘—์— ํ• ๋‹น๋œ ์ „์ฒด ๋‹จ์–ด ์ค‘์—์„œ ํ•ด๋‹น ๋‹จ์–ด์˜ ์ ์œ ์œจ์ด ๋†’์„์ˆ˜๋ก ๐‘—์ผ ํ™•๋ฅ ์ด ํฌ๋‹ค. - ์š”์†Œ 2 : w๐‘–๊ฐ€ ์†ํ•œ ๋ฌธ์„œ ๋‚ด ๋‹ค๋ฅธ ๋‹จ์–ด๊ฐ€ ํ† ํ”ฝ ๐‘—์— ๋งŽ์ด ํ• ๋‹น๋˜์—ˆ์„์ˆ˜๋ก ๐‘—์ผ ํ™•๋ฅ ์ด ํฌ๋‹ค. ๐‘ง๐‘– = ๐‘— ๋ฌธ์„œ์—์„œ ๐‘–๋ฒˆ์งธ์— ๋‚˜์˜ค๋Š” ๋‹จ์–ด ๐‘ค์— ํ† ํ”ฝ ๐‘—๊ฐ€ ํ• ๋‹น ๐•ซโˆ’๐‘– ๐‘–๋ฒˆ์งธ ๋‹จ์–ด๋ฅผ ์ œ์™ธํ•œ ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค์— ๋Œ€ํ•œ ํ† ํ”ฝ ํ• ๋‹น ๐‘ค๐‘– ๋‹จ์–ด index ๐‘‘๐‘– ๋ฌธ์„œ index โˆ™ ๋‹ค๋ฅธ ์ •๋ณด ๋ฐ observed information ๐ถ ๐‘ค๐‘— ๐‘Š๐พ ๋‹จ์–ด ๐‘ค๊ฐ€ ํ† ํ”ฝ ๐‘—์— ํ• ๋‹น๋œ ํšŸ์ˆ˜ (ํ˜„์žฌ ๐‘–๋Š” ์ œ์™ธ) ๐ถ ๐‘‘๐‘— ๐ท๐พ ๋ฌธ์„œ ๐‘‘์˜ ๋‹จ์–ด๋“ค ์ค‘์—์„œ ํ† ํ”ฝ ๐‘—์— ํ• ๋‹น๋œ ํšŸ์ˆ˜ (ํ˜„์žฌ ๐‘–๋Š” ์ œ์™ธ) ๐œ‚ ํ† ํ”ฝ์˜ ๋‹จ์–ด ๋ถ„ํฌ ์ƒ์„ฑ์— ์‚ฌ์šฉ๋˜๋Š” Dirichlet parameter ๐›ผ ๋ฌธ์„œ์˜ ํ† ํ”ฝ ๋ถ„ํฌ ์ƒ์„ฑ์— ์‚ฌ์šฉ๋˜๋Š” Dirichlet parameter Smoothing
  • 41.
    Algorithm for extractingtopics Doc 0 : ๐‘ง0,0, ๐‘ง0,1, ๐‘ง0,2, ๐‘ง0,3 Doc 1 : (๐‘ง1,0, ๐‘ง1,1, ๐‘ง1,2) Doc 2 : (๐‘ง2,0, ๐‘ง2,1, ๐‘ง2,2, ๐‘ง2,3) Doc 3 : (๐‘ง3,0, ๐‘ง3,1, ๐‘ง3,2, ๐‘ง3,3, ๐‘ง5,4) Doc 4 : (๐‘ง4,0, ๐‘ง4,1, ๐‘ง4,2, ๐‘ง4,3, ๐‘ง4,4) Doc 5 : (๐‘ง5,0, ๐‘ง5,1, ๐‘ง5,2, ๐‘ง5,3, ๐‘ง5,4, ๐‘ง5,5) ์˜ˆ์ œ) ๐‘ง๐‘–,๐‘— : ๐‘–๋ฒˆ์งธ ๋ฌธ์„œ์— ๐‘— ํ† ํ”ฝ์ด ํ• ๋‹น๋œ ๊ฒƒ์„ ๋‚˜ํƒ€๋‚ด๋Š” ํ™•๋ฅ ๋ณ€์ˆ˜ 1. ํ™•๋ฅ ๋ณ€์ˆ˜์— ๋žœ๋คํ•˜๊ฒŒ ํ† ํ”ฝ์„ ํ• ๋‹น 2. ๐‘ง0,0์„ ์ œ์™ธํ•œ ๊ฐ’๋“ค์„ ํ† ๋Œ€๋กœ ๐‘ง0,0์˜ ๊ฐ’์„ ์—…๋ฐ์ดํŠธ 3. ๐‘ง0,1์„ ์ œ์™ธํ•œ ๊ฐ’๋“ค์„ ํ† ๋Œ€๋กœ ๐‘ง0,1์˜ ๊ฐ’์„ ์—…๋ฐ์ดํŠธ โ€ฆ. 4. ๐‘ง5,5์„ ์ œ์™ธํ•œ ๊ฐ’๋“ค์„ ํ† ๋Œ€๋กœ ๐‘ง5,5์˜ ๊ฐ’์„ ์—…๋ฐ์ดํŠธ 5. ํ™•๋ฅ ๋ณ€์ˆ˜๊ฐ€ ์ˆ˜๋ ดํ•  ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณต
  • 42.
    Algorithm for extractingtopics ๐ถ ๐‘‰๐พ = ๐ถ11 ๐ถ12 โ€ฆ ๐ถ21 ๐ถ22 โ€ฆ โ€ฆ โ€ฆ โ€ฆ ๐ถ1๐‘˜ โ€ฆ ๐ถ1๐พ ๐ถ2๐‘˜ โ€ฆ ๐ถ2๐พ โ€ฆ โ€ฆ โ€ฆ ๐ถ ๐‘ฃ3 ๐ถ ๐‘ฃ3 โ€ฆ โ€ฆ โ€ฆ โ€ฆ ๐ถ ๐‘‰1 ๐ถ ๐‘‰2 โ€ฆ ๐ถ ๐‘ฃ๐‘˜ โ€ฆ ๐ถ ๐‘ฃ๐พ โ€ฆ โ€ฆ โ€ฆ ๐ถ ๐‘‰๐‘˜ โ€ฆ ๐ถ ๐‘‰๐พ ๐ถ ๐ท๐พ = ๐ถ11 ๐ถ12 โ€ฆ ๐ถ21 ๐ถ22 โ€ฆ โ€ฆ โ€ฆ โ€ฆ ๐ถ1๐‘˜ โ€ฆ ๐ถ1๐พ ๐ถ2๐‘˜ โ€ฆ ๐ถ2๐พ โ€ฆ โ€ฆ โ€ฆ ๐ถ ๐‘‘3 ๐ถ ๐‘‘3 โ€ฆ โ€ฆ โ€ฆ โ€ฆ ๐ถ ๐ท1 ๐ถ ๐ท2 โ€ฆ ๐ถ ๐‘‘๐‘˜ โ€ฆ ๐ถ ๐‘‘๐พ โ€ฆ โ€ฆ โ€ฆ ๐ถ ๐ท๐‘˜ โ€ฆ ๐ถ ๐ท๐พ
  • 43.
    Generative model vs.Statistical inference
  • 44.
    ์ตœ์ ์˜ ํ† ํ”ฝ ์ˆ˜ Perplexity ๏ฌLanguage modeling์—์„œ ์ฃผ๋กœ ์ปจ๋ฒค์…˜์œผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค. ๏ฌ ์ผ๋ฐ˜์ ์œผ๋กœ perplexity๋Š” exp ๐ป ๐‘ ๋กœ ํ‘œํ˜„๋œ๋‹ค. ๐ป ๐‘ ๋Š” ๐‘์˜ ์—”ํŠธ๋กœํ”ผ๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ๏ฌ LDA์—์„œ ์ถ”์ •๋œ ํ† ํ”ฝ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹จ์–ด์˜ ๋ฐœ์ƒ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•˜์˜€์„ ๋•Œ, ํ™•๋ฅ ๊ฐ’์ด ๋†’์„์ˆ˜๋ก generative process๋ฅผ ์ œ๋Œ€๋กœ ์„ค๋ช…ํ•œ๋‹ค๊ณ  ๋ณธ๋‹ค. ๐‘ƒ๐‘’๐‘Ÿ๐‘๐‘™๐‘’๐‘ฅ๐‘–๐‘ก๐‘ฆ ๐ถ = exp โˆ’ ๐‘‘=1 ๐ท log ๐‘ ๐•จ ๐‘‘ ๐‘‘=1 ๐ท ๐‘๐‘‘ ๏ฌ ๐‘ ๐•จ ๐‘‘ : ํ† ํ”ฝ์˜ ๋‹จ์–ด๋ถ„ํฌ ์ •๋ณด์™€ ๋ฌธ์„œ๋‚ด ํ† ํ”ฝ์˜ ๋น„์ค‘ ์ •๋ณด์˜ ๊ณฑ์„ ์ด์šฉํ•˜์—ฌ ๊ณ„์‚ฐ ๏ฌ ๐‘ ๐•จ ๐‘‘ ๋Š” ํด์ˆ˜๋ก ์ข‹์œผ๋ฏ€๋กœ, perplexity๋Š” ์ž‘์„์ˆ˜๋ก ์ข‹๋‹ค.
  • 45.
    ์ตœ์ ์˜ ํ† ํ”ฝ ์ˆ˜ Topiccoherence ๏ฌ ์‹ค์ œ๋กœ ์‚ฌ๋žŒ์ด ํ•ด์„ํ•˜๊ธฐ์—(interpretability) ์ ํ•ฉํ•œ ํ‰๊ฐ€ ์ฒ™๋„๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์ œ์‹œ๋œ ์—ฌ๋Ÿฌ ์ฒ™๋„๋“ค ์ค‘ ํ•˜๋‚˜ ๏ฌ Newman์€ ๋‰ด์Šค์™€ ์ฑ… ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜์—ฌ ํ† ํ”ฝ ๋ชจ๋ธ๋ง์„ ์‹ค์‹œ. ๊ทธ ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜จ ํ† ํ”ฝ๋“ค์ด ์œ ์˜๋ฏธํ•œ์ง€ ์ˆ˜์ž‘์—…์œผ๋กœ ์ ์ˆ˜ํ™”. ๊ทธ๋ฆฌ๊ณ  ์ด๋ ‡๊ฒŒ ๋งค๊ฒจ์ง„ ์ ์ˆ˜์™€ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ๋Š” ์ฒ™๋„๋ฅผ ์ œ์‹œ. ๏ฌ ํ† ํ”ฝ ๋ชจ๋ธ๋ง ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜จ ๊ฐ๊ฐ์˜ ํ† ํ”ฝ์—์„œ ์ƒ์œ„ ๐‘๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ์„ ํƒํ•œ ํ›„, ์ƒ์œ„ ๋‹จ์–ด ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ, ์‹ค ์ œ๋กœ ํ•ด๋‹น ํ† ํ”ฝ์ด ์˜๋ฏธ์ ์œผ๋กœ ์ผ์น˜ํ•˜๋Š” ๋‹จ์–ด๋“ค๋ผ๋ฆฌ ๋ชจ์—ฌ์žˆ๋Š”์ง€ ํŒ๋‹จ ๊ฐ€๋Šฅ ๏ฌ ๋‹ค์–‘ํ•œ ๋ฒ„์ „ ๏ฌ NPMI ๏ฌ UMass ๏ฌ UCI ๏ฌ c_v Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010, June). Automatic evaluation of topic coherence. In Human Language Technologies
  • 46.
    ์ตœ์ ์˜ ํ† ํ”ฝ ์ˆ˜ Topiccoherence : c_v version M Rรถder (2015) Exploring the Space of Topic Coherence Measures
  • 47.
    1. Axes ofa spatial - LSA 2. Probabilistic topics - LDA 3. Bayesian Nonparametric - HDP
  • 48.
    Dirichlet Process LDA๋Š” ํ† ํ”ฝ์˜์ˆ˜ ๐‘˜๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•˜์—ฌ, ์ด ๋ฐ์ดํ„ฐ์— ๋ช‡ ๊ฐœ์˜ ํ† ํ”ฝ์ด ์กด์žฌํ•˜๋Š”์ง€ ๋ฏธ๋ฆฌ ์•„๋Š” ๊ฒƒ์€ ์–ด๋ ต๋‹ค. ์ด ๋ถ€๋ถ„์ด LDA์˜ ์•ฝ์  ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ํ•˜์ง€๋งŒ ์šฐ๋ฆฌ๋Š” ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ ์ ์ ˆํ•œ ํ† ํ”ฝ ๊ฐœ์ˆ˜๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์œผ๋ฉฐ ์ด๊ฒƒ์€ Dirichlet Process๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. Dirichlet distribution์€ ์ฃผ์–ด์ง„ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์— ๋”ฐ๋ผ ๋‹คํ•ญ๋ถ„ํฌ๋ฅผ ์ƒ์„ฑํ•ด์ฃผ๋Š” ๋ถ„ํฌ๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋””๋ฆฌํด๋ ˆ ๋ถ„ํฌ๋ฅผ ์‚ฌ์ „๋ถ„ํฌ๋กœ ๋‘๋ฉด, ๋‹คํ•ญ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด๋Š” ์‚ฌํ›„ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ์‰ฝ๊ฒŒ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. (๋””๋ฆฌํด๋ ˆ ๋ถ„ํฌ ๋Š” ๋‹คํ•ญ๋ถ„ํฌ์˜ ์ผค๋ ˆ๋ถ„ํฌ์ด๋‹ค.)
  • 49.
    Dirichlet Process ๐‘‹๊ฐ€ ๋””๋ฆฌํด๋ ˆ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์ž. ์ฆ‰ ๐‘‹~๐ท๐‘–๐‘Ÿ(1,2,1)๋ผ๊ณ  ํ•˜์ž. ์—ฌ๊ธฐ์„œ ๐‘˜ = 3์ด๋‹ค. ๋”ฐ๋ผ์„œ ์ด ๋ถ„ํฌ์—์„œ ํ‘œ๋ณธ์„ ์ถ”์ถœํ•œ๋‹ค๋ฉด 3๊ฐœ์˜ ์„ฑ๋ถ„์œผ๋กœ๋งŒ ์ด๋ฃจ์–ด์ง„, ๊ทธ๋ฆฌ๊ณ  ์›์†Œ์˜ ํ•ฉ์ด 1์ธ ์ƒ˜ํ”Œ๋“ค์ด ์ƒ์„ฑ๋  ๊ฒƒ์ด๋‹ค. ๐•ฉ1 = 0.2,0.5,0.3 ๐•ฉ2 = 0.1,0.6, 0.3 โ€ฆ ๋””๋ฆฌํด๋ ˆ ๋ถ„ํฌ๋ฅผ ์„œ๋กœ ๋‹ค๋ฅธ ๐‘˜๊ฐ’์ด ๋‚˜์˜ฌ ์ˆ˜ ์žˆ๋Š” ๋ถ„ํฌ๋กœ ํ™•์žฅ์„ ํ•œ ๊ฒƒ์ด ๋‹ค์Œ์˜ Dirichlet Process์ด๋‹ค. ๐‘‹~๐ท๐‘ƒ ๐›ผ, ๐ป ๐›ผ๋Š” ์ง‘์ค‘ ํŒŒ๋ผ๋ฏธํ„ฐ(concentration parameter)์ด๊ณ  ๐ป๋Š” ๋ชจ๋ถ„ํฌ์ด๋‹ค. ๐›ผ๊ฐ€ 0์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ๐‘‹์˜ ๋ถ„ํฌ๋Š” ๋ชจ๋ถ„ํฌ ์˜ ํ‰๊ท ์„ ์ค‘์‹ฌ์œผ๋กœ ๋ชจ์ด๊ณ , ๐›ผ๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ๋ชจ๋ถ„ํฌ์˜ ํ‰๊ท ์—์„œ ๋ฉ€์–ด์ง€๊ฒŒ ๋œ๋‹ค. YW The Et al.(2005) Hierarchical Dirichlet Processes
  • 50.
    Dirichlet Process ๐›ผ =1 ๐›ผ = 10 ๐›ผ = 100 ๐›ผ = 1000 https://en.wikipedia.org/wiki/Dirichlet_process Concentration parameter ๐›ผ ๋ฅผ ๋ณ€ํ™”์‹œ์ผœ ์–ป์€, ๐‘‹์˜ ๋ถ„ํฌ์—์„œ ์ถ”์ถœ๋œ ์ƒ˜ํ”Œ๋“ค์ด๋‹ค. ํ‘œ๋ณธ์— ํฌํ•จ๋œ ๐‘˜(๋ง‰๋Œ€ ๊ฐœ์ˆ˜)๋Š” ๊ฐ™์€ ๐›ผ๋ผ๋„ ๋‹ค๋ฅด๋‹ค.
  • 51.
    Chinese Restaurant Process ํ•œ์ค‘๊ตญ์ง‘์ด ์žˆ๊ณ , ๊ทธ ์ค‘๊ตญ์ง‘์—๋Š” ๋ฌด์ˆ˜ํžˆ ๋งŽ์€ ์‹ํƒ์ด ์žˆ๋‹ค. ์‹ํƒ์—๋Š” ๋ฌด์ˆ˜ํžˆ ๋งŽ์€ ์ž๋ฆฌ๊ฐ€ ์žˆ์–ด์„œ ์†๋‹˜์ด ์–ผ ๋งˆ๋“ ์ง€ ์•‰์„ ์ˆ˜ ์žˆ๋‹ค. ๋‹จ, ์†๋‹˜์ด ์‹ํƒ์— ์•‰์„ ๋•Œ ์•„๋ž˜์™€ ๊ฐ™์€ ๊ทœ์น™์ด ์žˆ๋‹ค. ๏ฌ ์†๋‹˜์€ ์‹ํƒ์˜ ์ธ๊ธฐ๋„๋ฅผ ๊ณ ๋ คํ•ด์„œ ์–ด๋А ์‹ํƒ์— ์•‰์„์ง€ ์„ ํƒํ•œ๋‹ค. ๏ฌ ์‹ํƒ์— ์–ด๋–ค ์Œ์‹์ด ์˜ฌ๋ ค์งˆ์ง€๋Š” ์‹ํƒ์˜ ์ฒซ ์†๋‹˜์ด ์•‰์„ ๋•Œ ๋ชจ๋ถ„ํฌ ๐ป์—์„œ ์Œ์‹ ํ•˜๋‚˜๋ฅผ ๋ฝ‘์Œ์œผ๋กœ์„œ ๊ฒฐ์ •ํ•œ๋‹ค. ๏ฌ ์‹ํƒ์˜ ์ธ๊ธฐ๋„๋Š” ์•‰์•„ ์žˆ๋Š” ์‚ฌ๋žŒ์˜ ์ˆ˜์— ๋น„๋ก€ํ•˜๊ณ , ์†๋‹˜์€ ๋น„์–ด์žˆ๋Š” ์‹ํƒ์„ ๊ณ ๋ฅผ ์ˆ˜ ์žˆ๋‹ค. ๋นˆ ์‹ํƒ์˜ ์ธ๊ธฐ๋„๋Š” ์ง‘ ์ค‘ ํŒŒ๋ผ๋ฏธํ„ฐ ๐›ผ์™€ ๋น„๋ก€ํ•œ๋‹ค. ์ฒซ๋ฒˆ์งธ ์†๋‹˜์€ ๋น„์–ด์žˆ๋Š” ์‹ํƒ์— ์•‰์„ ๊ฒƒ์ด๋‹ค. ๋‘๋ฒˆ์งธ ์†๋‹˜์€ ์ฒซ๋ฒˆ์งธ ์‹ํƒ์— ์•‰๊ฑฐ๋‚˜, ๋น„์–ด์žˆ๋Š” ์ƒˆ๋กœ์šด ์‹ํƒ์— ์•‰๋Š”๋‹ค. ๋งŒ์•ฝ ๐›ผ๊ฐ€ ํฌ๋‹ค๋ฉด ๋น„์–ด์žˆ๋Š” ์‹ํƒ์„ ๊ณ ๋ฅผ ํ™•๋ฅ ์ด ๋†’์•„์ง„๋‹ค. (๋˜๋Š” ์‹ํƒ์˜ ์ธ๊ธฐ๊ฐ€ ๋‚ฎ์•„๋„) ์ด๋ ‡๊ฒŒ ์†๋‹˜์ด ๋ฌดํ•œํžˆ ๊ณ„์† ๋“ค์–ด์˜ค๋‹ค๋ณด๋ฉด ์‹ํƒ์˜ ๊ฐœ์ˆ˜๊ฐ€ ์ •ํ•ด์ง€๊ณ (countably infinite), ์‹ํƒ์˜ ์ธ๊ธฐ๋„ ๋น„์œจ๋„ ์ผ์ • ๊ฐ’์— ์ˆ˜๋ ดํ•˜๊ฒŒ ๋œ๋‹ค. ์ด๋ ‡๊ฒŒ ์–ป์€ ์ธ๊ธฐ๋„์˜ ๋น„๋Š” ๋ชจ๋ถ„ํฌ๊ฐ€ ๐ป, ์ง‘์ค‘ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๐›ผ์ธ ๋””๋ฆฌํด๋ ˆ ํ”„๋กœ์„ธ์Šค ์—์„œ ๋ฝ‘์€ ์ƒ˜ํ”Œ์ด ๋œ๋‹ค.
  • 52.
    Hierarchical Dirichlet Process DP์˜์• ๋กœ์‚ฌํ•ญ : ๋งŒ์•ฝ ์ค‘๊ตญ์ง‘์ด ํ•œ ๊ณณ์ด ์•„๋‹ˆ๊ณ  ์—ฌ๋Ÿฌ๊ณณ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜์ž. ์–ด๋–ค ์Œ์‹์ด ์˜ฌ๋ ค์ง„ ์‹ํƒ์„ ์ฐพ์•„์„œ ๊ฐ๊ฐ์˜ ์ค‘๊ตญ์ง‘์—์„œ ๊ทธ ์Œ์‹์ด ์–ผ๋งˆ๋‚˜ ์ธ๊ธฐ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๊ณ  ์‹ถ๋‹ค๊ณ  ํ•˜์ž. ๋ฌธ์ œ๋Š” ์—ฌ๊ธฐ์„œ ๋ฐœ์ƒํ•œ๋‹ค. ๊ฐ ์ค‘๊ตญ ์ง‘์— ์‹ํƒ์ด ๋ช‡ ๊ฐœ ์žˆ๋Š”์ง€ ๋ชจ๋ฅด๊ณ , ๊ฐ ์‹ํƒ์— ์–ด๋–ค ์Œ์‹์ด ์žˆ๋Š”์ง€ ๋ชจ๋ฅธ๋‹ค. ์–ด๋–ค ์Œ์‹์ด ์„œ๋กœ ๋‹ค๋ฅธ ์‹๋‹น์—์„œ ๊ฐ™์ด ๋“ฑ์žฅํ•œ๋‹ค๋Š” ๋ณด์žฅ๋„ ์—†๋‹ค. ์‹๋‹น ๊ฐ„ ์Œ์‹ ๋น„๊ต๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค. ๊ทธ๋ž˜์„œ ํ•™์ž๋“ค์€ Hierarchical Dirichlet Process๋ฅผ ์ œ์•ˆํ–ˆ๋‹ค. ๐บ0~๐ท๐‘ƒ ๐›พ, ๐ป ๐บ๐‘—~๐ท๐‘ƒ(๐›ผ, ๐บ0) ๋ชจ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด๋Š” ์ƒ์œ„ ๋””๋ฆฌํด๋ ˆ ํ”„๋กœ์„ธ์Šค๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ์ด ๋””๋ฆฌํด๋ ˆ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋ชจ๋ถ„ํฌ๋กœ ํ•˜๋Š” ํ•˜์œ„ ๋””๋ฆฌํด ๋ ˆ ํ”„๋กœ์„ธ์Šค๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ ์ƒ์„ฑํ•œ๋‹ค. ์ด๋Ÿฐ์‹์œผ๋กœ ํ•˜์œ„ ๋””๋ฆฌํด๋ ˆ ํ”„๋กœ์„ธ์Šค๋ฅผ ์„œ๋กœ ์—ฐ๊ฒฐ ์‹œ์ผœ์ค€๋‹ค.
  • 53.
    Hierarchical Dirichlet Process YWThe Et al.(2005) Hierarchical Dirichlet Processes
  • 54.
    CRP using HDP ์ค‘๊ตญ์ง‘์ฒด์ธ์ ์ด ์žˆ๋‹ค. ์ฒด์ธ์ ์—๋Š” ๋ฌด์ˆ˜ํžˆ ๋งŽ์€ ์‹ํƒ์ด ์žˆ๊ณ , ๊ทธ ์‹ํƒ์—๋Š” ๋ฌด์ˆ˜ํžˆ ๋งŽ์€ ์ž๋ฆฌ๊ฐ€ ์žˆ๋‹ค. ๊ฐ ์ฒด์ธ ์ ๋งˆ๋‹ค ์†๋‹˜์ด ํ•œ๋ช…์”ฉ ๊พธ์ค€ํžˆ ๋“ค์–ด์™€์„œ ์‹ํƒ์— ๊ณจ๋ผ ์•‰๋Š”๋‹ค. ๊ฐ๊ฐ์˜ ์ฒด์ธ์ ์— ์Œ์‹์„ ๋ฐฐ๋ถ„ํ•ด์ฃผ๋Š” ์ค‘๊ตญ์ง‘ ๋ณธ ์‚ฌ๊ฐ€ ์žˆ๋‹ค. ๊ฐ ์ฒด์ธ์ ์— ์†๋‹˜์ด ๋นˆ ์‹ํƒ์— ์•‰๊ฒŒ ๋˜๋ฉด, ์ง์›์ด ๋ณธ์‚ฌ๋กœ ๊ฐ€์„œ ์‹ํƒ์— ์˜ฌ๋ฆด ์Œ์‹์„ ๊ณจ๋ผ์˜จ๋‹ค. ๋ณธ ์‚ฌ์—๋Š” ์Œ์‹์ด ๋‹ด๊ฒจ์žˆ๋Š” ์•„์ฃผ ํฐ ์ ‘์‹œ๊ฐ€ ๋ฌดํ•œํžˆ ์žˆ๊ณ , ๊ฐ ์ ‘์‹œ๋Š” ์ถฉ๋ถ„ํžˆ ์ปค์„œ ๋ฌดํ•œํžˆ ์Œ์‹์„ ํ’€ ์ˆ˜ ์žˆ๋‹ค.
  • 55.
    CRP using HDP [๐บ0~๐ท๐‘ƒ๐›พ, ๐ป ] ๏ฌ ์ง์›์€ ์ ‘์‹œ์˜ ์ธ๊ธฐ๋„๋ฅผ ๊ณ ๋ คํ•ด์„œ ์–ด๋А ์ ‘์‹œ์—์„œ ์Œ์‹์„ ํ’€์ง€ ๊ฒฐ์ • ๏ฌ ์ ‘์‹œ์— ์–ด๋–ค ์Œ์‹์ด ์˜ฌ๋ ค์งˆ์ง€๋Š” ์ ‘์‹œ์—์„œ ์ฒซ ์ง์›์ด ์Œ์‹์„ ํ’€ ๋•Œ, ๋ชจ๋ถ„ํฌ ๐ป์—์„œ ์Œ์‹์„ ํ•˜๋‚˜ ๋ฝ‘์Œ์œผ๋กœ ๊ฒฐ์ • ๏ฌ ์ ‘์‹œ์˜ ์ธ๊ธฐ๋„๋Š” ์Œ์‹์„ ํ‘ธ๋Š” ์ง์›์˜ ์ˆ˜์— ๋น„๋ก€ํ•˜๊ณ , ์ง์›์ด ๋น„์–ด์žˆ๋Š” ์ ‘์‹œ๋ฅผ ๊ณ ๋ฅผ ์ˆ˜๋„ ์žˆ๋‹ค. ์ ‘์‹œ์˜ ์ธ๊ธฐ๋„๋Š” ์ง‘์ค‘ ํŒŒ๋ผ๋ฏธํ„ฐ ๐›พ์™€ ์ƒ๊ด€์žˆ๋‹ค. [๐บ๐‘—~๐ท๐‘ƒ(๐›ผ, ๐บ0)] ๏ฌ ์†๋‹˜์€ ์‹ํƒ์˜ ์ธ๊ธฐ๋„๋ฅผ ๊ณ ๋ คํ•ด์„œ ์–ด๋А ์‹ํƒ์— ์•‰์„์ง€๋ฅผ ๊ณ ๋ฅธ๋‹ค. ๏ฌ ์‹ํƒ์— ์–ด๋–ค ์Œ์‹์ด ์˜ฌ๋ ค์งˆ์ง€๋Š” ์‹ํƒ์— ์ฒซ ์†๋‹˜์ด ์•‰์œผ๋ฉด ์ง์›์„ ๋ณธ์‚ฌ๋กœ ๋ณด๋‚ด์„œ ๊ฒฐ์ •ํ•˜๊ณ , ๊ทธ ์Œ์‹์˜ ๋ถ„ํฌ ์—ญ์‹œ ์ง์›๋“ค์ด ์ ‘์‹œ์—์„œ ์Œ์‹์„ ํ‘ธ๋Š” ๋ถ„ํฌ ๐บ0๋ฅผ ๋”ฐ๋ฅธ๋‹ค. (์‹ํƒ์—๋Š” 1๊ฐœ์˜ ์Œ์‹๋งŒ ์˜ฌ๋ผ๊ฐ„๋‹ค) ๏ฌ ์‹ํƒ์˜ ์ธ๊ธฐ๋„๋Š” ์•‰์•„ ์žˆ๋Š” ์‚ฌ๋žŒ์˜ ์ˆ˜์— ๋น„๋ก€ํ•˜๊ณ , ์ง‘์ค‘ ํŒŒ๋ผ๋ฏธํ„ฐ ๐›ผ์— ๋”ฐ๋ผ์„œ ๋น„์–ด์žˆ๋Š” ์‹ํƒ์„ ๊ณ ๋ฅผ์ˆ˜๋„ ์žˆ๋‹ค.
  • 56.
    CRP using HDP YWThe Et al.(2005) Hierarchical Dirichlet Processes
  • 57.
    HDP for infinitetopic models ๏ฌ ์ค‘๊ตญ์ง‘ ์ฒด์ธ์  โ€“ Documents ๏ฌ ๋ณธ์‚ฌ์—์„œ ๊ฒฐ์ •ํ•ด์ฃผ๋Š” ์Œ์‹ โ€“ Topic ๏ฌ ๊ฐ๊ฐ์˜ ์†๋‹˜๋“ค โ€“ Words in a document https://www.cs.cmu.edu/~epxing/Class/10708-14/scribe_notes/scribe_note_lecture20.pdf

Editor's Notes

  • #2ย http://www.inf.ed.ac.uk/teaching/courses/tnlp/2016/Lingzhe.pdf
  • #41ย https://4four.us/article/2014/10/lda-parameter-estimation https://www.4four.us/article/2014/11/markov-chain-monte-carlo
  • #47ย https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/