[ACL2018読み会資料] Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context

Sharp Nearby, Fuzzy Far Away:
How Neural Language Models Use Context
Urvashi Khandelwal, He He, Peng Qi, Dan Jurafsky
(Stanford University)
M2 山岸駿秀 @ ACL2018読み会

Introduction
● n-gram Language Modelと比較して、Neural Language Model
（NLM）は長距離文脈を使えるようになったとされる
● 実際に長距離文脈を捉えられているのかをAblation Test
● Neural Cache ModelはLMにどう影響するかを調査
読んだ理由
● 文脈の知見が欲しかったから
● “We propose a novel architecture …” に疲れたから
2

言語モデルの復習と今回の入力例
● 以下の確率を計算
● Negative Log Likelihoodを計算
● Perplexityで評価
... the company reported a loss after
taxation and minority interests of NUM
million irish borrowings under the
short-term parts of a credit agreement
</s> berlitz which is based in
princeton n.j. provides language
instruction and translation services
through more than NUM language centers
in NUM countries </s> in the past five
years more sim has set a fresh target
of $ NUM a share by the end of </s>
reaching that goal says robert t. UNK
applied 's chief financial officer than
NUM NUM of its sales have been outside
the u.s. </s> macmillan has owned
berlitz since NUM </s> in the first six
3

実験設定
● Corpus: PennTreeBankとWikitext-2
● モデルは普通のNLM
○ Dropoutを時間方向にも適用
○ Random seedを変えて3つ用意
→ 平均値を報告
● 学習時は対象文の前の文を全て使用
● Devで評価（Testの特徴を調べるのは気が引けたらしい） 4

How much context is used?
● 実験1: LSTMは何単語覚えられるのか？
○ δ-function: Test dataの変更方法を指示
● effective context sizeを調べる
○ Perplexityが収束する長さ（全て使ったときのPerplexity + 1%くらい）
● 評価はLossかPerplexityの変化率（以降全てこれ）
○ n単語消去したら、文長-n単語分のLossを測定
5

結果1: 文長とHyperparameter（右: PTB）
● PTBで150単語、Wikiで250単語あたりが限界
● Hyperparameterは性能に影響するが、記憶力には無関係
6

結果2: 単語のクラスごとのLoss（右: Wiki）
● Infrequent words（Trainで出現数800回以下）は長距離文脈が必要
● Function words（前置詞と冠詞）は周辺単語だけでいい
7

Nearby vs. long-range context
● LSTMはだいたい200単語くらい覚えられる
→ 場所による特徴はあるのか？
● 文脈の途中（長さは span = (s1, s2] で管理）を変化させる
○ ρはshuffleかreverse
● 文長は300単語で固定
8

結果3（右: Wiki）
a. s2 = s1 + 20のとき: 近い文脈は語順が重要
b. s2 = nのとき: 離れた文脈は「出現したこと」が重要、
違う単語列（語順は整っている）で置換すると悪い
9

Types of words and the region of context
● 「単語が出現したこと」が重要なら、function wordsはいらない？
● fPOS
(y, span): span中でPOSがyの単語を除去
● 同数の単語をrandomに削除する実験もした
10

結果4: 機能語/内容語の削除（左:PTB 右:Wiki）
● 近くのContent wordsは絶対に必要
● 20単語くらい離れるとFunction wordsの影響が小さい
● 遠くの単語は意味だけを大まかに覚えているのか？
11

Can LSTMs copy words without caches?
● Neural Language GenerationではCopy Mechanismが使われる
○ AttentionとかCopyNetとかCacheとか
● 「200単語も記憶できるならCopy Mechanismいらないのでは？」
以下の場合分けをして実験
● 文脈の距離:　“nearby” ≦ 50 < “long-range”
● Copyすると解になる単語がどこにあるか → これを消す
○ Cnear: “nearby”にある
○ Cfar: “long-range”にある
○ Cnone: どこにもない
12

結果5: Cを消した（左:PTB　右:Wiki）
● Cfarを消してもそこまで悪くならない → 大まかな意味を学習？
● Cnearを消してはいけない → 近くの単語をCopyする能力がある？
● 長距離の文脈を消すとCnoneの性能に悪影響
13

結果6: 除去の代わりに置換（左:PTB 右:Wiki）
● “Similar”: 同程度のfrequencyかつ同じPOSの単語
● 近いところは表層が同じであることが重要
● 遠いところはCfarを消しても分布仮説的なもので予測可能？
14

How does the cache help?
● Neural Cache Model [Grave+, ICLR2017]
○ hi
はそれまでのhidden states
○ 各単語に対してPcacheを計算し、Plm + Pcacheを生成確率とする
● 300単語以上使う（Document lengthの平均）
○ PTB: 500単語
○ Wiki: 3875単語
● Cache Modelを基準としたNLMのPPLの増加率で評価
15

結果7: Cacheの影響（左: PTB　右: Wiki）
● 文脈に出てきた単語はCopyされていそう
● 文脈に出てきていない単語を出すことには向いていない
● LSTMとCacheでできることが違う → 補完できているのでは？
16

出てきていれば生成できる
17

まとめ
● LSTMを使ったNeural Language Modelの性能を調べた
● 以下のことがわかった
○ LSTMは200単語くらい覚えられる
○ Hyperparameterは性能を変化させるが、記憶力には影響がない
○ 近くの単語は語順が重要、遠くの単語は存在することが重要
○ Cacheを使うと遠くの単語を使えるようになる
● “この結果はdata-drivenかもしれないので要追試”
○ “一応PTBとWikiでデータの多様性を持たせたつもり”
18

感想
● 読みやすい & 謙虚な文体 & マジメな実験で好感が持てる
● 語順が自由な言語では違いがありそう
● 学習の設定がちょっと特殊
○ 普通、無限に文脈を使える設定で学習しないのでは
○ 長距離文脈が現れる設定と1文しかこない設定では結果が変わりそう
● 200単語以上覚えておく必要がないだけなのか覚えられないのか
○ 平均単語文長20単語だから、10文くらい
○ LSTMは原理的には全部覚えられるはず……
● 何単語消したのかとか、そういうデータがほしかった
19

[ACL2018読み会資料] Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to [ACL2018読み会資料] Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context

Similar to [ACL2018読み会資料] Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context (20)

More from Hayahide Yamagishi

More from Hayahide Yamagishi (15)

[ACL2018読み会資料] Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context