Pystan for nlp

Pystanで自然言語処理
へ向けて

2013/12/22 BUGS,Stan勉強会 #2
@xiangze750

1

Agenda
Pythonの魅力
● Pystanでできること
●
NLP(自然言語処理)
– topic model
– ライブラリ
– 混合モデル
– LDA
– Dirichlet process, Chinese restaurant process
– 階層Dirichlet process
●
生態学における中立理論
●

2

Pythonの魅力

●

●

豊富なライブラリ

●

Computer Vision (PIL,OpenCV)

●

数式処理(sympy)

●

音声処理(wave,Audiolab)音楽解析(music21)

●

可視化(matplotlib,networkx)

●


3

Pystanでできること

●

●

data(変数)はarrayで代入

4

http://pystan.readthedocs.org/en/latest/getting_started.html

NLPライブラリとの連携

●

●

NLTK(Natural language toolkit)
さまざまなCorpus(文書、単語の集合)が使える
–

●

N-gram化、頻度分布など

Gensim
–

Topic modelの実装(後述)

5


●

●

Bag of words
–

単語の位置関係の情報は捨て去る

–

6

http://journalofdigitalhumanities.org/wp-content/uploads/2013/02/blei_lda_illustration.png


●

●

Topic model
– 文書をtopicに分類
–

Topicを確率変数とする

7

http://journalofdigitalhumanities.org/wp-content/uploads/2013/02/blei_lda_illustration.png

NLPライブラリとの連携

●

●

shuyoさんによるトピックモデルのPython実装
–

https://github.com/shuyo/iir/blob/master/lda

–

NLTKのコーパス読み出し

–

documentをbag of wordsの形式にできる

–

階層Dirichlet modelも実装されている(後述)

8

混合モデル

●

●

多項混合モデル
–

多項分布(Categorical分布)でトピックごとの単語選択
のモデル化
多項分布=“歪んだ
サイコロ”

9

トピックモデル概論
http://sugiyama-www.cs.titech.ac.jp/~sugi/2007/Canon-MachineLearning27-jp.pdf

混合モデル

●

●

ポリヤ混合モデル
–

トピックの事前分布としてDirichlet分布を用いる

–

Dirichlet分布はCategorical分布,多項分布の共役事前分
布


10

Dirichlet分布

●

●

Categorical分布,多項分布の共役事前分布

●

simplex上の値を返す

●

Stanでは
vector<lower=0>[V] alpha;
simplex[V] x;
x~dirichlet(alpha);

11

歪んだサイコロを生成する
ガチャガチャ

LDA(latent Dirichlet
allocation)
●

●

Word w_m,nごとにトピックz_m,nがある。

●

トピックz_m,nごとに混合分布がある。
トピックの分布(documentごと)
単語の分布(トピックごと)
トピック
単語

12

LDA(latent dirichlet
allocation)
●

●

Stan code(manual 128 page)
parameters {
simplex[K] theta[M]; // topic dist for
doc m
simplex[V] phi[K]; // word dist for topic
k
}
model {
for (m in 1:M)
theta[m] ~ dirichlet(alpha); // prior
for (k in 1:K)
phi[k] ~ dirichlet(beta); // prior
for (n in 1:N) {
real gamma[K];
for (k in 1:K)
gamma[k]<-log(theta[doc[n],k])
+log(phi[k,w[n]]);
increment_log_prob(log_sum_exp(gamma));
}

潜在変数zのCategorical 分布は直接使えない
(http://xiangze.hatenablog.com/entry/2013/12/19/013557)

13

Dirichlet process

●

●

Topicの数を可変(non-parametric)としたい
–

無限変数のDirichlet分布

–

確率分布(Dirichlet分布)上の確率分布

–

変数を交換しても分布は変わらない(c.f. De Finetti's theorem)

任意の分割Aに対して

となればGはHをbase distributionとし
たDirichlet process

G
Θ
面積:G(A0),G(A1),......G(An)

Dirichlet Processes(Teh 2010)
http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/Teh2010a.pdf

14

Chinese restaurant process

●

●

無限の変数を有限の過程で表現したい
–

観測変数は有限
確率変数を反復的に取り出す(変数の交換に対して不
変)

–

人(word)の多いテーブルに行きやすい

–

客:word
料理:topic
Table:対応関係

Dirichlet Processes(Teh 2010)

15

Chinese restaurant process

●

n+1人目の客

新しいテーブ
ルに着く確率

既存のテーブルに着く確率
着席者が多いテーブルにつきやすい

Hierarchical Dirichlet Processes(Teh 2006)
http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/jasa2006.pdf

16

階層Dirichlet process

●

●

Dirichlet process上のDirichlet process


17

CRPと階層Dirichlet process

●

●

Chinese resutaurant franchise
各店舗　出てくる料理は同じ
客:word
料理:topic
Table:対応関係


18

CRPと階層Dirichlet process

●

●

Chinese resutaurant franchise
–

単語分布の変数

–

トピック分布の変数


19

Chinese restaurant
franchise
●

●

実装の難点(Stan)
–

model内に変数への代入が書けない

20

“Inﬁnite LDA” –Implementing the HDP with minimum code complexity
http://www.arbylon.net/publications/ilda.pdf

Chinese restaurant
franchise
●

●

実装(JAGS)
–

model {
x[1] ~ dnorm(0.0, 1.0E-4)
k <- 1
for (j in 1:M) {
for (i in 2:N) {
q ~ dunif(0,1)
totm <-sum(m)
if(q>gamma/(totm+gamma)){
ind ~ dmulti(m/sum(m),1)
th[i] <- th[ind]
m[ind] <- m[ind]+1
}else{
th[i] ~ dunif(0,1)
k <- k+1
m[k] <- 1
}
}

途中...

}
for (j in 1:M) {
for (i in 2:N) {
q ~ dunif(0,1)
if(q>alpha/(i-1+alpha)){
ind <- dmulti(n/sum(n),1)
phi[i] ~ th[j][ind]
nj[i] <- nj[i]+1
}else{
phi[i] ~ th[docid[i]*N+]
kn <- kn + 1
n[kn] <- 1
}
}
}

21

そもそもの問題意識

●

●

●

“ノンパラベイズに, 汎用の「パッケージ」はな
い”(Nonparametric Bayes for Non-Bayesians)
様々なデータ構造上の確率過程
–

Inﬁnite Stochastic Tree

–

Mondrian Process

Mondrian Process

実装したかったもの

“Nonparametric Bayes for Non-Bayesians”
http://www.ism.ac.jp/~daichi/paper/ibis2008-npbayes-tutorial.pdf

22

Stick breaking process

●

●

(階層)Dirichlet processの別表現
π0

π1

π2
π3

23

Truncated stick breaking process...?

余談: 生態学における中立
理論
●

●

中立性
同一の生態学的ニッチに属する種の個体数分布は一定
の関数に従う
Ewens distribution
–

●

–
–

限られたニッチの中での各種の個体数の分布
Chinese restaurant processの特殊な場合

24

Stephen P. Hubbell先生

理論
●

●

Ewens distribution

A unified theory of biogeography and relative species abundance and
its application to tropical rain forests and coral reefs
http://www3.botany.ubc.ca/vellend/COM_ECOL/Hubbell_CoralReefs97.pdf

25

理論
●

●

Rのuntb package
–

http://cran.r-project.org/web/packages/untb
#example
demo(untb)
#Saunder datasetの個体数-種
の順位分布と推定されたθ

26

まとめ

●

●

●

●

Pystanを用いてば比較的簡単にStanのLDAを
使える。
Stan2.0では制約上ノンパラメトリックLDAの
実装はできない。JAGSでは出来るかもしれな
い。
生態学はすごい

27

Reference
●

shuyoさんによるLDA,HDP-LDAのpython実装(nltkを使用)
–

●

ノンパラベイズの入門の入門
–

●


“Inﬁnite LDA” –Implementing the HDP with minimum code complexity
–

●


Dirichlet Process(Teh 2010)
–

●

http://breakbee.hatenablog.jp/entry/2013/11/30/222553

–

●

http://www.kecl.ntt.co.jp/as/members/yamada/dpm_ueda_yamada2007.pdf

ディリクレ過程混合モデルへの変分推論適用について
–

●


Introduction to Nonparametric Bayesian Models(上田、山田2007)
–

●

http://d.hatena.ne.jp/n_shuyo/20110608/hdplda

–

●

http://www.slideshare.net/shuyo/ss-15098006

Mi manca qualche giovedi`?　　階層ディリクレ過程を実装してみる (1) HDP-LDA と LDA のモデルを比較
–

●

https://github.com/shuyo/iir/blob/master/lda

http://www.arbylon.net/publications/ilda.pdf

A unified theory of biogeography and relative species abundance and its application to tropical rain forests and coral reefs
–

http://www3.botany.ubc.ca/vellend/COM_ECOL/Hubbell_CoralReefs97.pdf

28

Pystan for nlp

Recommended

Recommended

More Related Content

More from Xiangze

More from Xiangze (8)

Pystan for nlp