Gensim

次元削減
文書-単語行列が巨大な疎行列になって手に負えない!
「ねこ」と「にゃんこ」(同義語) を同一視したい!
人名の「田中」と地名の「田中」(多義語) を別物だとみなしたい!
⇨ 次元削減 (dimensionality reduction) を利用 (e.g. クラスタリング、ト
ピックモデル)
2 / 11

Gensim
Gensim
トピックモデル (pLSA, LDA) や deep learning(word2vec) を簡単に使えるラ
イブラリ [2][3]
公式サイトの tutorial は若干分かりにくいです
使い方は [4] や [1] に詳しい
Figure: Mentioned by the author:)
3 / 11

Gensim
Gensim
公式サイトの tutorial は若干分かりにくいです
使い方は [4] や [1] に詳しい
..........
..........
..........
..........
..........
..........
System
and human
system
documents
..........
..........
..........
..........
..........
..........
['system',
'and'
'human']
texts
形態素解析
{'and': 19,
'minors': 37, ...}
dic = corpora.Dictonary()
..........
..........
..........
..........
..........
..........
[(10, 2),
(19, 1),
(3, 1), ...]
corpus
dic.doc2bow()
辞書とtf値を対応付け
dic.save()
dict.dic
MmCorpus
.serialize()
corpus.mm
tf・idf
LSALSA
LDA
HDP
RP
log
entropy
word
2vec
models
model
.save()
lda.model
dic.load() MmCorpus()
model
.load()
similarities
文書の類似性判定
lda.model
topic
extraction
model
.show_topics()
文書のトピック抽出
Figure: Gensim を使った処理の一例
4 / 11

Gensim
Step0. documents
元の文書をリスト型で準備
1 # 元の文書
2 documents = [
3 ”Human machine interface for lab abc computer applications”,
4 ”A survey of user opinion of computer system response time”,
5 ”The EPS user interface management system”,
6 ”System and human system engineering testing of EPS”,
7 ”Relation of user perceived response time to error measurement”,
8 ”The generation of random binary unordered trees”,
9 ”The intersection graph of paths in trees”,
10 ”Graph minors IV Widths of trees and well quasi ordering”,
11 ”Graph minors A survey”]
5 / 11

Gensim
Step1. 形態素解析
1 def parse(doc):
2 # 日本語なら形態素解析
3 # stopwordを除去する
4 stoplist = set(’for a of the and to in’.split())
5 text = [word for word in doc.lower().split() if word not in stoplist]
6 return text
7
8 texts = [[w for w in parse(doc)] for doc in documents]
9 print texts
10 ’’’ [
11 [’human’, ’machine’, ’interface’, ...],
12 [’a’, ’survey’, ’of’, ’user’, ...],
13 ...] ’’’
6 / 11

Gensim
Step2. 辞書を作成
1 dic = corpora.Dictionary(texts)
2 # 巨大なデータに対しては時間がかかるので保存。
3 dic.save(’dict.dic’)
4 # dic.load(’dict.dic’) で読み込み。
5
6 print dic.token2id
7 # {’and’: 19, ’minors’: 37, ’generation’: 28, ...}
8 print dic[19]
9 # ’and’が出力される。
7 / 11

Gensim
Step3. コーパスを作成
1 # 作成した辞書を使って、文書を変換
2 new_doc = ”Human computer interaction”
3 new_vec = dic.doc2bow(parse(new_doc))
4 print new_vec
5 # ”interaction”は辞書にないので無視される
6 # [(2, 1), (4, 1)]
7
8 # 同様にして、最初の文書集合に対してcorpus(文書−単語行列)を作成
9 # ここでは、単純なtf値からなる文書−単語行列を作成
10 corpus = [dic.doc2bow(text) for text in texts]
11 print corpus
13 # Matrix Market形式で corpusを保存。他の形式でも良い。
14 corpora.MmCorpus.serialize(’corpus.mm’, corpus)
15 # 保存した corpusを読み込むとき
16 # corpus = corpora.MmCorpus(’corpus.mm’)
17
18 # 作成したコーパスで類似度を測る
19 index = similarities.docsim.SparseMatrixSimilarity(corpus, num_features=len(dic))
20 # クエリを特徴ベクトルで表現
21 query = [(0,1),(4,1)]
22 # queryと類似するもの上位 10件を出力
23 print sorted(enumerate(index[query]), reverse=True, key=lambda x:x[1])[:10]
8 / 11

Gensim
Step4. モデルを適用 (tf・idf)
1 m = models.TﬁdfModel(corpus)
2 # tf・idf値からなる文書−単語行列を作成
3 # m[corpus[0]] で 0番目の文書の特徴ベクトルになる
4 corpus = m[corpus]
5 # m[corpus]は再びコーパスとして使用可能
Step5. トピックモデルを適用
1 # topic数は 200−500くらいが普通?
2 m = models.LdaModel(corpus, id2word = dic, num_topics = 3)
4 m.save(’lda.model’)
5 # m[corpus[i]] に含まれる tupleは、文書iが topic jに属する確率 P(t_j | d_i) を表す
6
7 # 得られた topicとその成分を表示
8 for n in range(0, m.num_topics):
9 # formatted=Trueとすると、線型モデルで表示
10 print m.show_topics(formatted=False)
9 / 11

Gensim
出力されたトピック
topic1 = 0.097 ∗ system + 0.068 ∗ eps + 0.055 ∗ human + 0.054 ∗ interface
+ 0.040 ∗ trees + 0.040 ∗ user + 0.039 ∗ engineering
+ 0.039 ∗ management + 0.039 ∗ testing + 0.039 ∗ binary
topic2 = 0.077 ∗ graph + 0.074 ∗ trees + 0.046 ∗ minors + 0.043 ∗ response
+ 0.043 ∗ ordering + 0.043 ∗ well + 0.043 ∗ iv + 0.043 ∗ quasi
+ 0.043 ∗ widths + 0.042 ∗ user
topic3 = 0.081 ∗ computer + 0.060 ∗ user + 0.060 ∗ system + 0.060 ∗ survey
+ 0.059 ∗ time + 0.058 ∗ response + 0.058 ∗ opinion + 0.038 ∗ lab
+ 0.037 ∗ abc + 0.037 ∗ machine
10 / 11

Reference I
Python 用のトピックモデルのライブラリ gensim の使い方 (主に日本語のテキストの読み込み)
- 唯物是真 @Scaled_Wurm. url:
http://sucrose.hatenablog.com/entry/2013/10/29/001041.
Radim Řehůřek. gensim: Topic modelling for humans. url:
http://radimrehurek.com/gensim.
Radim Řehůřek. “Software Framework for Topic Modelling with Large Corpora”. In:
Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. 2010,
pp. 45–50. url: http://www.muni.cz/research/publications/884893.
高橋侑久. LSI や LDA を手軽に試せる Gensim を使った自然言語処理入門 - SELECT *
FROM life; url: http://yuku-tech.hatenablog.com/entry/20110623/1308810518.
11 / 11

Gensim

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Gensim

Similar to Gensim (20)

More from saireya _

More from saireya _ (20)

Gensim