BERT を中心に解説した資料です.BERT に比べると,XLNet と RoBERTa の内容は詳細に追ってないです.
あと,自作の図は上から下ですが,引っ張ってきた図は下から上になっているので注意してください.
もし間違い等あったら修正するので,言ってください.
(特に,RoBERTa の英語を読み間違えがちょっと怖いです.言い訳すいません.)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
XLNet: Generalized Autoregressive Pretraining for Language Understanding
RoBERTa: A Robustly Optimized BERT Pretraining Approach
BERT を中心に解説した資料です.BERT に比べると,XLNet と RoBERTa の内容は詳細に追ってないです.
あと,自作の図は上から下ですが,引っ張ってきた図は下から上になっているので注意してください.
もし間違い等あったら修正するので,言ってください.
(特に,RoBERTa の英語を読み間違えがちょっと怖いです.言い訳すいません.)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
XLNet: Generalized Autoregressive Pretraining for Language Understanding
RoBERTa: A Robustly Optimized BERT Pretraining Approach
CluBERT: A Cluster-Based Approach for Learning Sense Distributions in Multipl...禎晃 山崎
CluBERT: A Cluster-Based Approach for Learning Sense Distributions in Multiple Languages
Word Sense Disambiguation, BERT, clustering
ということで読みました.
p. 7 は「solid は glass の上位語,glassware は glass の下位語」でした。。。
CluBERT: A Cluster-Based Approach for Learning Sense Distributions in Multipl...禎晃 山崎
CluBERT: A Cluster-Based Approach for Learning Sense Distributions in Multiple Languages
Word Sense Disambiguation, BERT, clustering
ということで読みました.
p. 7 は「solid は glass の上位語,glassware は glass の下位語」でした。。。
Get To The Point: Summarization with Pointer-Generator Networks_acl17_論文紹介Masayoshi Kondo
Neural Text Summarizationタスクの研究論文.ACL'17- long paper採択.スタンフォード大のD.Manning-labの博士学生とGoogle Brainの共同研究.長文データ(multi-sentences)に対して、生成時のrepetitionを回避するような仕組みをモデルに導入し、長文の要約生成を可能とした.ゼミでの論文紹介資料.論文URL : https://arxiv.org/abs/1704.04368
Soft Actor-Critic is an off-policy maximum entropy deep reinforcement learning algorithm that uses a stochastic actor. It was presented in a 2017 NIPS paper by researchers from OpenAI, UC Berkeley, and DeepMind. Soft Actor-Critic extends the actor-critic framework by incorporating an entropy term into the reward function to encourage exploration. This allows the agent to learn stochastic policies that can operate effectively in environments with complex, sparse rewards. The algorithm was shown to learn robust policies on continuous control tasks using deep neural networks to approximate the policy and action-value functions.
2. 全体の流れ
自然言語処理に関して何も知らないところからword2vecの仕組みとその後の発展ま
でを追う。
Linguistic Regularities in Continuous Space Word Representations
Efficient Estimation of Word Representations in Vector Space
Distributed Representations of Words and Phrases and their
Compositionality
(word2vec Parameter Learning Explained)
(word2vec Explained: Deriving Mikolov et al’s Negative Sampling
Word-Embedding Method)
mt_caret (kml輪講) word2vec + 𝛼 2018-05-25 2 / 28
5. Linguistic Regularities in Continuous Space Word
Representations
mt_caret (kml輪講) word2vec + 𝛼 2018-05-25 5 / 28
6. Linguistic Regularities in Continuous Space Word
Representations
これらの分散表現は言語における統語構造・意味構造が上手く反映されている。
統語構造: apple − apples ≃ car − cars
意味構造: woman − man ≃ queen − king
図 3: 分散表現 (https://blog.acolyer.orgより引用)
mt_caret (kml輪講) word2vec + 𝛼 2018-05-25 6 / 28
7. Linguistic Regularities in Continuous Space Word
Representations
統語構造・意味構造の検証のためのテストセットを用意し 𝑎 ∶ 𝑏, 𝑐 ∶ 𝑑という関係性
において 𝑑を求めたい語とした時、 𝑏 − 𝑎 + 𝑐にコサイン距離が最も近い語を答えと
し、正答率を検証。
図 4: 統語構造のテストセット
mt_caret (kml輪講) word2vec + 𝛼 2018-05-25 7 / 28
8. Linguistic Regularities in Continuous Space Word
Representations
mt_caret (kml輪講) word2vec + 𝛼 2018-05-25 8 / 28
9. Efficient Estimation of Word Representations in Vector
Space
mt_caret (kml輪講) word2vec + 𝛼 2018-05-25 9 / 28
19. Efficient Estimation of Word Representations in Vector
Space
We observe large improvements in accuracy at much lower computa-
tional cost, i.e. it takes less than a day to learn high quality word
vectors from a 1.6 billion words data set. Furthermore, we show that
these vectors provide state-of-the-art performance on our test set
for measuring syntactic and semantic word similarities.
図 10: CBOWとSkip-gramの結果
mt_caret (kml輪講) word2vec + 𝛼 2018-05-25 19 / 28
20. Efficient Estimation of Word Representations in Vector
Space
We observe large improvements in accuracy at much lower computa-
tional cost, i.e. it takes less than a day to learn high quality word
vectors from a 1.6 billion words data set. Furthermore, we show that
these vectors provide state-of-the-art performance on our test set
for measuring syntactic and semantic word similarities.
図 11: 計算時間
mt_caret (kml輪講) word2vec + 𝛼 2018-05-25 20 / 28
21. Efficient Estimation of Word Representations in Vector
Space
The training speed is significantly higher than reported earlier in this
paper, i.e. it is in the order of billions of words per hour for typical
hyperparameter choices. We also published more than 1.4 million vec-
tors that represent named entities, trained on more than 100 billion
words. Some of our follow-up work will be published in an upcoming
NIPS 2013 paper.
mt_caret (kml輪講) word2vec + 𝛼 2018-05-25 21 / 28
22. Distributed Representations of Words and Phrases and their
Compositionality
Negative Sampling
Subsampling
Learning Phrases
mt_caret (kml輪講) word2vec + 𝛼 2018-05-25 22 / 28
24. Distributed Representations of Words and Phrases and their
Compositionality
Subsampling
頻出語(“in”, “the”, “a”, etc.のストップワード等)は情報が少ないため、確率
𝑃(𝑤 𝑖) = 1 − √
𝑡
𝑓(𝑤 𝑖)
の確率で語を捨てる処理をコーパスについて行った後にword2vecの学習を行う。こ
この 𝑡は適当に決める(10−5
前後が典型的)。
mt_caret (kml輪講) word2vec + 𝛼 2018-05-25 24 / 28
25. Distributed Representations of Words and Phrases and their
Compositionality
Learning for Phrases
単体で出現する確率(unigram)と2語連続して出現する確率(bigram)を用いて以下の
スコアを計算し、閾値を超えたものは新しい語としてVocabularyに追加する。これ
を閾値を下げながら何パスか行う。
score(𝑤 𝑖, 𝑤 𝑗) =
count(𝑤 𝑖, 𝑤 𝑗) − 𝛿
count(𝑤 𝑖) × count(𝑤 𝑗)
図 12: 句を学習した結果
mt_caret (kml輪講) word2vec + 𝛼 2018-05-25 25 / 28
26. Distributed Representations of Words and Phrases and their
Compositionality
この論文の成果がオープンソースとしてhttps://code.google.com/p/word2vecで
公開されていて、そのプロジェクトの名前がword2vec1
。
1
タイトル回収
mt_caret (kml輪講) word2vec + 𝛼 2018-05-25 26 / 28
27. +𝛼
Hierarchical Softmaxの木の作り方 (A Scalable Hierarchical Distributed
Language Model)
Poincare Embeddings (Poincaré Embeddings for Learning Hierarchical
Representations)
doc2vec (Distributed Representations of Sentences and Documents)
mt_caret (kml輪講) word2vec + 𝛼 2018-05-25 27 / 28
28. 参考にした資料
The amazing power of word vectors | the morning paper
Hierarchical Softmax – Building Babylon
How does sub-sampling of frequent words work in the context of
Word2Vec? - Quora
Approximating the Softmax for Learning Word Embeddings
A gentle introduction to Doc2Vec – ScaleAbout – Medium
異空間への埋め込み!Poincare Embeddingsが拓く表現学習の新展開 - ABEJA
Arts Blog
Neural Network Methods for Natural Language Processing
mt_caret (kml輪講) word2vec + 𝛼 2018-05-25 28 / 28