Word2vec alpha

word2vec + 𝛼
mt_caret
kml輪講
2018-05-25
mt_caret (kml輪講) word2vec + 𝛼 2018-05-25 1 / 28

全体の流れ
自然言語処理に関して何も知らないところからword2vecの仕組みとその後の発展ま
でを追う。
Linguistic Regularities in Continuous Space Word Representations
Eﬃcient Estimation of Word Representations in Vector Space
Distributed Representations of Words and Phrases and their
Compositionality
(word2vec Parameter Learning Explained)
(word2vec Explained: Deriving Mikolov et al’s Negative Sampling
Word-Embedding Method)

語の表現方法
図 1: One-hotベクトル (https://blog.acolyer.orgより引用)
𝑦𝑦𝑦 = 𝑓(𝑊𝑊𝑊𝑥𝑥𝑥)のようなモデルを考えた時、 𝑥𝑥𝑥が語に対応する One-hotベクトルだと
考えると 𝑊𝑊𝑊𝑥𝑥𝑥は 𝑊𝑊𝑊の一列を取り出していると考えることができる。したがっ
て、 𝑊𝑊𝑊の各列は語に対応していると解釈できる。

語の表現方法
すると、語のOne-hotベクトルを入力とするニューラルネットワークベースのモデ
ルであれば、最初の層の重み 𝑊𝑊𝑊の各列から語を表す連続的なベクトル、つまり分散
表現が得られる。
図 2: 分散表現 (https://blog.acolyer.orgより引用)

Linguistic Regularities in Continuous Space Word
Representations

Representations
これらの分散表現は言語における統語構造・意味構造が上手く反映されている。
統語構造: apple − apples ≃ car − cars
意味構造: woman − man ≃ queen − king
図 3: 分散表現 (https://blog.acolyer.orgより引用)

Representations
統語構造・意味構造の検証のためのテストセットを用意し 𝑎 ∶ 𝑏, 𝑐 ∶ 𝑑という関係性
において 𝑑を求めたい語とした時、 𝑏 − 𝑎 + 𝑐にコサイン距離が最も近い語を答えと
し、正答率を検証。
図 4: 統語構造のテストセット

Representations

Eﬃcient Estimation of Word Representations in Vector
Space

Space
𝑂 = 𝐸 × 𝑇 × 𝑄
O: 学習に掛かる計算量
E: データセットの大きさ(語数)
Q: モデル依存
NNベースのモデル(NNLM)
𝑄 = 𝑁 × 𝐷 + 𝑁 × 𝐷 × 𝐻 + 𝐻 × 𝑉
N: 入力語数
D: 投影先の次元
H: 分散表現の次元
V: Vocabularyの大きさ

Space
𝑂 = 𝐸 × 𝑇 × 𝑄
Q: モデル依存
RNNベースのモデル(RNNLM)
𝑄 = 𝐻 × 𝐻 + 𝐻 × 𝑉
H: 分散表現の次元

Space
図 5: Continuous Bag-of-Words(CBOW)とContinuous Skip-gramモデル

word2vec Parameter Learning Explained
ℎℎℎ = 𝑥𝑥𝑥 𝑇
𝑊𝑊𝑊
図 6: 1語入力のCBOWモデル

ℎℎℎ =
1
𝐶
𝑊𝑊𝑊 (𝑥𝑥𝑥1 + 𝑥𝑥𝑥2 + ⋯ + 𝑥𝑥𝑥 𝐶)
図 7: 多語入力のCBOWモデル

モデルの構造は1語入力のCBOWと同じだが、出力 𝑦をコンテキストの語全てと比較
して交差エントロピーロスを計算する。
図 8: Skip-gramモデル

Hierarchical Softmax
通常のSoftmaxだと分母で出力列ベクトルの全ての行を計算する必要があり、
𝐻 × 𝑉 の計算が必要になっていた。そこで、各語を表す行に行き着く確率を二分木
と各枝での左右への遷移確率をシグモイドでモデル化する。すると、各枝では
𝐻 × 1の計算で済みlog2(𝑉 )回の遷移で語にたどり着くため 𝐻 × 𝑉 が
𝐻 × log2(𝑉 )になる。
𝑃(”time”|𝐶) = 𝑃 𝑛0
(right|𝐶)𝑃 𝑛1
(left|𝐶)𝑃 𝑛2
(right|𝐶)
図 9: Hierarchical Softmaxの図 (http://building-babylon.net/より引用)

Space
𝑂 = 𝐸 × 𝑇 × 𝑄
Q: モデル依存
Continuous Bag-of-Wordsモデル(CBOW)
𝑄 = 𝐶 × 𝐷 + 𝐷 × 𝑙𝑜𝑔2(𝑉 )
C: 入力語数
D: 投影先の次元かつ分散表現の次元(同一)

Space
𝑂 = 𝐸 × 𝑇 × 𝑄
Q: モデル依存
Continuous Skip-gramモデル(CBOW)
𝑄 = 𝐶 × (𝐷 + 𝐷 × 𝑙𝑜𝑔2(𝑉 ))
C: 予測する語数
D: 投影先の次元かつ分散表現の次元(同一)

Space
We observe large improvements in accuracy at much lower computa-
tional cost, i.e. it takes less than a day to learn high quality word
vectors from a 1.6 billion words data set. Furthermore, we show that
these vectors provide state-of-the-art performance on our test set
for measuring syntactic and semantic word similarities.
図 10: CBOWとSkip-gramの結果

Space
We observe large improvements in accuracy at much lower computa-
tional cost, i.e. it takes less than a day to learn high quality word
vectors from a 1.6 billion words data set. Furthermore, we show that
these vectors provide state-of-the-art performance on our test set
for measuring syntactic and semantic word similarities.
図 11: 計算時間

Space
The training speed is signiﬁcantly higher than reported earlier in this
paper, i.e. it is in the order of billions of words per hour for typical
hyperparameter choices. We also published more than 1.4 million vec-
tors that represent named entities, trained on more than 100 billion
words. Some of our follow-up work will be published in an upcoming
NIPS 2013 paper.

Compositionality
Negative Sampling
Subsampling
Learning Phrases

Negative Sampling
そもそもSoftmaxを使わずNoise Contrastive Estimation(NCE)の近似である
Negative Sampling(NEG)を行う。具体的には正解の語を最大化し、データセットか
ら 𝑘個語を引いてそれらを最小化することを目標として学習する。
log 𝜎 (𝑣′
𝑤 𝑂
𝑣 𝑇
𝑤 𝐼
) +
𝑘
∑
=1
𝔼 𝑤 𝑖∼𝑃 𝑛(𝑤)[− log 𝜎 (𝑣′
𝑤 𝑖
𝑣 𝑇
𝑤 𝐼
)]

Compositionality
Subsampling
頻出語(“in”, “the”, “a”, etc.のストップワード等)は情報が少ないため、確率
𝑃(𝑤 𝑖) = 1 − √
𝑡
𝑓(𝑤 𝑖)
の確率で語を捨てる処理をコーパスについて行った後にword2vecの学習を行う。こ
この 𝑡は適当に決める(10−5
前後が典型的)。

Compositionality
Learning for Phrases
単体で出現する確率(unigram)と2語連続して出現する確率(bigram)を用いて以下の
スコアを計算し、閾値を超えたものは新しい語としてVocabularyに追加する。これ
を閾値を下げながら何パスか行う。
score(𝑤 𝑖, 𝑤 𝑗) =
count(𝑤 𝑖, 𝑤 𝑗) − 𝛿
count(𝑤 𝑖) × count(𝑤 𝑗)
図 12: 句を学習した結果

Compositionality
この論文の成果がオープンソースとしてhttps://code.google.com/p/word2vecで
公開されていて、そのプロジェクトの名前がword2vec1
。
1
タイトル回収

+𝛼
Hierarchical Softmaxの木の作り方 (A Scalable Hierarchical Distributed
Language Model)
Poincare Embeddings (Poincaré Embeddings for Learning Hierarchical
Representations)
doc2vec (Distributed Representations of Sentences and Documents)

参考にした資料
The amazing power of word vectors | the morning paper
Hierarchical Softmax – Building Babylon
How does sub-sampling of frequent words work in the context of
Word2Vec? - Quora
Approximating the Softmax for Learning Word Embeddings
A gentle introduction to Doc2Vec – ScaleAbout – Medium
異空間への埋め込み！Poincare Embeddingsが拓く表現学習の新展開 - ABEJA
Arts Blog
Neural Network Methods for Natural Language Processing

Word2vec alpha

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Word2vec alpha

Similar to Word2vec alpha (20)

More from KCS Keio Computer Society

More from KCS Keio Computer Society (20)

Word2vec alpha