Hybrid computing using a neural network with dynamic external memory

Deep neural networks
• Recently, deep neural networks have made a lot of progress in
complex pattern matching of various tasks such as the
computer vision and the natural language processing
• However, they are limited in their ability to represent data
structures such as graphs and trees, to use variables and to
handle representations over long sequences
2

Recurrent neural networks (RNNs)
• RNNs can capture the long dependency of sequential data and are
also known as Turing-Complete (Siegelmann and Sontag, 1995)
and therefore are capable of simulating arbitrary procedures, if
properly wired
• RNNs struggled with the vanishing gradients problem, however
(Bengio et al., 1994)
3

Long short-term memory (LSTM)
• The LSTM architecture solved the vanishing gradients problem by
taking gating mechanisms into RNN architectures and calculating
the gradients by element-wise multiplication with the gate value at
every time-step (Hochreiter and Schmidhuber, 1997)
• LSTM became quite successful and helped to outperform
traditional models because of the sequence to sequence model
(Sutskever et al., 2014) and attention mechanisms (Bahdanau et
al., 2014; Luong et al., 2015)
• Yet LSTM based models have not reached a real solution to the
problems mentioned as the limitations of deep neural networks
4

The von Neumann architecture
• In the von Neumann architecture (Neumann, 1945),
programs can be run by three fundamental mechanisms:
• a processing unit that performs simple arithmetic operations and logic ones
• a control unit that takes control flow instructions such as sequential
execution, conditional branch and loop
• a memory unit that data and instructions are written to and read from during
computation
• This architecture can represent complex data structures and learn
to perform algorithmic tasks
• It separates computation by a processor (i.e. a processing
unit and a control one) and memory, whereas neural networks
mix computation and memory into the network weights 5
Memory
Input Output
Control
Unit
Processing
Unit
Processor

Neural Turing Machine (NTM)
• To improve the performance of standard neural networks,
Graves et al. (2014) proposed a neural network model
with an addressable external memory called
Neural Turing Machine (NTM)
• The whole architecture of NTM is differentiable,
therefore can be trained end-to-end including
how to access to the memory
6
Controller
（RNN）
Input Output
Memory
Read
Head
Write
Head

Differentiable Neural Computer (DNC)
• Further, Graves et al. (2016) improved the memory access
mechanism and proposed DNC
• It solved algorithmic tasks over structured data such as
traversal and shortest-path tasks of a route map and an
inference task of a family tree
• In the experiment on question answering with premises, input
sequences were written to the memory and necessary
information to infer the answer was read from the memory,
hence representing variables
• DNC was also able to learn long sequences by the dynamic
memory access mechanism 7

DNCの全体図
(1) コントローラ(RNN)は入力データ 𝑥𝑡と前ステップ
でのメモリからの出力 𝑟𝑡−1 を合わせた
𝜒𝑡 = [𝑥𝑡, 𝑟𝑡−1 ]を入力として受け取り、隠れ
ベクトルℎ 𝑡 を出力
(2) ℎ 𝑡 を線形変換し、出力 𝜐𝑡 = 𝑊𝑦ℎ 𝑡 とメモリを
制御するためのパラメータを格納した “interface
vector” 𝜉𝑡 = 𝑊𝜉ℎ 𝑡 を得る（𝑊は重み行列）
(3) 𝜉𝑡をもとにメモリへのデータの書き込みが
行われ、メモリの状態を更新
(4) 最後に、メモリからの読み出しが行われ、“read
vector” 𝑟𝑡 により出力 𝑦𝑡 = 𝜐𝑡 + 𝑊𝑟 𝑟𝑡 を計算する
一方、𝑟𝑡を次ステップのRNNへの入力に追加
(1)~(4) が1ステップ分 8

メモリの構造
• メモリはN×Wの行列
• Nはメモリスロットの総数（アドレスの総数）、Wはスロット
の長さ（格納する数値ベクトルの次数）
• N, Wは固定
• メモリ状態は刻々と更新されていき、タイムステップ t での
メモリを表す行列を 𝑀𝑡と表記
9

コントローラ
• 各タイムステップ t につき，コントローラはデータセットから
入力 𝑥 𝑡 ∈ ℝ 𝑋 と前タイムステップのメモリ𝑀𝑡−1 ∈ ℝ 𝑁×𝑊から
読み出された R 個のベクトル𝑟𝑡−1
1
, … , 𝑟𝑡−1
𝑅
を受け取り， 𝑦𝑡 ∈ ℝ 𝑌
を出力
• そして現タイムステップのメモリの制御方法を定義した
interface vector 𝜉𝑡 を出力
• 表記を簡略化するために，入力と読み出しベクトルを結合した
ものを 𝜒𝑡 = [𝑥 𝑡; 𝑟𝑡−1
1
; … ; 𝑟𝑡−1
𝑅
] とする
10

コントローラ
• LSTM の変数は以下
• 𝑙はレイヤのインデックス
• 𝜎 𝑥 =
1
1+exp(−𝑥)
はロジスティックシグモイド関数
• すべての t に対してℎ 𝑡
0
= 𝟎 ，すべての𝑙についてℎ0
𝑙
= 𝑠 𝑜
𝑙 = 𝟎
• Wは重み行列，bはバイアス
11

コントローラ
• 各タイムステップにおいて、コントローラは以下のように定義される output vector 𝜐𝑡
と interface vector 𝜉𝑡 ∈ ℝ 𝑊×𝑅 +3𝑊+5𝑅+3 を出力
• コントローラが再帰的だとすると、その出力は現タイムステップまでの入力の完全な
履歴 (𝜒1, … , 𝜒𝑡)の関数であり、コントローラが行う操作は以下のようにまとめられる
• 𝜃 はネットワークの重み
• 最後に、read vector を結合したものと 𝑅𝑊 × 𝑌の重み行列 𝑊𝑟を掛け合わせたベクト
ルに 𝜐𝑡 を足し，output vector 𝑦𝑡 を得る
• このようにすることで直前に読み出された情報を用いて出力を調整することが可能12

Interface パラメータ
• メモリに対して制御を行う前に、interface vector 𝜉𝑡 は以下のように細かく分割
• 個々の要素は正しい定義域内に含まれるよう様々な関数により処理
• ロジスティックシグモイド関数は [0, 1] ，oneplus 関数は [1, ∞)に制約をかける
• Softmax 関数はベクトルを 𝑆 𝑁 に制限
• ^ 記号はさらにスケール変換をかけることを意味
• Rはメモリからの読み出し回数
• 1 ステップでメモリからの読み出しは複数回、書き込みは1回のみ行う 13

メモリの読み書き
• メモリの読み書きを行うには、読み書き対象となるメモリ
スロットのアドレスを指定する必要がある
• この操作を微分可能な計算で表現するには、どのアドレスを
重点的に読み出す/書き込むかの重み付けを行えば良い
• これらの重み、read/write weighting は要素の総和がほぼ 1 で
ある非負の数のベクトル
• read/write weighting をどのように求めるかが重要
• これらを求めてしまえば読み出し/書き込みの演算自体は単純 15

読み出し操作
• R個の read weighting {𝑤𝑡
𝑟,1
, … , 𝑤𝑡
𝑟,𝑅
∈ Δ 𝑁}によってアドレスの
内容の重み付き和を計算し， "read vector" {𝑟𝑡
1
, … , 𝑟𝑡
𝑅
}を以下の
ように定義
• read vector は次時刻のコントローラへの入力に追加される
16

書き込み操作
• write vector 𝑣 𝑡 ∈ ℝ 𝑊：書き込みたいデータのベクトル
• erase vector 𝑒𝑡 ∈ [0,1] 𝑊：スロット内のデータをどのような
パターンで消去するかを表したベクトル
• write weighting 𝑤𝑡
𝑤
∈ Δ 𝑁により， 𝑣 𝑡と𝑒𝑡を接続し，以下のように
メモリ行列 𝑀𝑡−1 に対して消去・書き込みを行い更新
∘ は要素ごとの積，E はすべて 1 の N×W 行列
17

メモリの番地付け
① write weighting の更新
② read weighting の更新
18

① write weightingの更新
書き込み先のアドレスの選択には2つの方法があり、
write weighting は2つの要素から構成される：
1. Content-based addressing: interface vector を通して入力された
“key” をもとに書き込み先スロットを選択
2. Dynamic memory allocation: 前タイムステップまでの読み出し
状況をもとに使用済みのスロットを選択
19

1. Content-based addressing
• write key 𝑘 𝑡
𝑤
∈ ℝ 𝑊
とメモリM ∈ ℝ 𝑁×𝑊
の各スロットに格納されているデータを
照合し類似度を計算
• “write strength” 𝛽𝑡
𝑤
∈ [1, ∞)はweighting のピークの鋭さ調整パラメータ
• β→∞の極限で𝐶は鋭いピークがひとつだけ立つ
• 𝐷はコサイン類似度
• weighting 𝐶(𝑀, 𝑘, 𝛽) ∈ 𝑆 𝑁はメモリアドレスに対する正規化確率分布を定義
• この方法による write content weighting は
20

2.1. retention vector 𝜓 𝑡の構成
• コントローラが必要に応じてメモリを開放・確保できるように使用可能なメモリ
スロットのリストを構築
• 前タイムステップで読み込みを行ったメモリスロットを開放して良いかどうかのフラグ
（実際には0~1の連続値なので重み）が "free gate" 𝑓𝑡
𝑖
(i=1,...,𝑅)
• 「 𝑓𝑡
𝑖
𝑤𝑡−1
𝑟,𝑖
≃1 ⇔ 𝑓𝑡
𝑖
≃1 かつ 𝑤𝑡−1
𝑟,𝑖
≃1」ならば読み込み済みかつ開放可なので上書き可
• 「 𝑓𝑡
𝑖
𝑤𝑡−1
𝑟,𝑖
≃0 ⇔ 𝑓𝑡
𝑖
≃0 または 𝑤𝑡−1
𝑟,𝑖
≃0」ならば前タイムステップで読み込まれていない、
または、読み込み済みだったとしても開放フラグが立っていないので上書き不可
• R回分の使用中（上書き不可）フラグである “memory retention vector” 𝜓 𝑡 ∈ [0,1] 𝑁 を
以下で定義：
• R回のうち1度でも上書き可になったメモリスロットは上書き可 𝜓 𝑡 ≃0となる
• 一方、使用中（上書き不可） 𝜓 𝑡 ≃1 となるのは、R回全てについて上書き不可だったとき
21

2.2. usage vector 𝑢 𝑡の構成
• 各メモリスロットの使用中度合いを表す重み “memory usage vector”
𝑢 𝑡 ∈ [0,1] 𝑁を以下の更新式で定義
∘は要素ごとの積
(·) 内第 3 項は𝑢 𝑡の各要素が 1 を超えないように調整するための補正項
• 𝑢 𝑡 = 1 に達すると書き込みによる重みの更新は停止し、もし 𝑢 𝑡 > 1 となっても
更新により値を減少させる
• 直感的には、すでに使用中若しくは書き込まれたばかりのどちらかであり、
かつ free gate により保持(𝜓 𝑡 𝑖 ≈ 1)されていればスロットは使用中となる
• スロットへのあらゆる書き込みはその使用度を上げ(最大で 1)、free gate が
用いられたときのみ使用度は下がる(最小で 0 ) 22

2.3. allocation weighting の構成
• 𝑢 𝑡の要素を値が小さい順に並べた時のメモリアドレスのインデックスのベクトル
を 𝜙 𝑡 ∈ ℤ 𝑁
とする（つまり、𝜙 𝑡[1]は最も使用度が低いアドレスのインデックス
となる）
• ここで， Δ 𝑁に含まれる重みベクトルは総和が 1 以下であり，つまり重みがない
ということがどのアドレスにもアクセスしない null 操作を明示的に担う
• 書き込みを行うための新しいアドレスを与える “allocation weighting” 𝑎 𝑡 ∈ Δ 𝑁を
以下で定義
• 1−𝑢 𝑡 なので使用済み（書き込み可）の度合いを表す重み
• すべての使用度が 1 であると， 𝑎 𝑡 = 0 となり，コントローラはスロットの開放が
行われない限りメモリを確保することができない 23

1.+2. Write weighting
• コントローラは新しく確保されたスロット、または内容によって選ばれた
スロットに書き込み、あるいは書き込みはまったく行わないということも可能
• allocation weighting 𝑎 𝑡 と write content weighting 𝑐𝑡
𝑊
∈ 𝑆 𝑁 を組み合わせて write
weighting 𝑤𝑡
𝑊
∈ Δ 𝑁は以下のように定義される
• allocation gate 𝑔𝑡
𝑎
∈ [0,1]: 使用済みフラグあるいはkeyのどちらに基づいて書き込み
を行うかを制御するゲート
• write gate 𝑔𝑡
𝑤
∈ [0,1]: そもそも書き込みを行うかを制御するゲート
• メモリを不必要な上書きから保護するのに用いられる
24

② read weightingの更新
読み出し先のアドレスの選択にも2つの方法がある：
1. Content-based addressing: interface vector を通して入力された
“key” をもとに読み込み先スロットを選択
2. Temporal memory linkage: 書き込み順に従ってスロットを選択
25

1. Content-based addressing
• write weighting のときと同様に、各読み出しヘッド i に対して
“read key” {𝑘 𝑡
𝑟,𝑖
}𝑖=1,2,..,𝑅 と “read strength” {𝛽𝑡
𝑟,𝑖
}𝑖=1,2,..,𝑅を用いて
類似度を計算
• この方法による read content weighting は以下
26

2.1. precedence weighting の構成
• 前述のメモリを確保する方法では、メモリスロットに対して
書き込みが行われた順番に関する情報は一切保存されていない
• しかし、このような情報を保持しておくことが有用な場合も
多くある
• 例えば、命令のシーケンスが順番通りに記録され、引き出されなけれ
ばならないときなど
• したがって、メモリスロットが連続的に上書きされていくのを
追跡するために temporal link matrix 𝐿 𝑡 ∈ [0,1] 𝑁×𝑁
を用いる
27

2.1. precedence weighting の構成
• 𝐿 𝑡 を定義するために、成分 𝑝𝑡[𝑖] がスロット i が最後に書き込まれたという
程度を表す precedence weighting 𝑝𝑡 ∈ Δ 𝑁を以下の更新式で定義：
• 𝑤𝑡
𝑤
は①で更新したwrite weighting
• “強く”書き込みが行われた場合は、Σ𝑤𝑡
𝑤
≃1 となるので前タイムステップの情
報 𝑝𝑡−1 は消去され、現タイムステップでどこに書き込まれたかが保存される
• 書き込みが行われなかった場合は、最後に行われた書き込みの重みが保持
され続ける
28

2.2. temporal link matrix の構成
• メモリスロットへの書き込み順を表す N×N 行列 𝐿 𝑡 を構成
• 𝐿 𝑡[𝑖, 𝑗]はメモリ 𝑀𝑡においてスロット i はスロット j の後に書き込まれたという程度を表す
• スロットが上書きされる度に，リンク行列はそのスロットに出入りする古いリンクを消去
するよう更新され、最後に書き込みが行われたスロットからの新しいリンクが追加される
• 𝐿 𝑡 の更新式を以下で与える
• i は現タイムステップ，j は前タイムステップ
• あるスロットからそれ自身への遷移をどのように表現するかは明確ではないため，自分自身
へのリンクは除外 (行列の対角成分は常に 0 )
• 𝐿 𝑡の行と列はそれぞれ特定のメモリスロットへの，及び特定のメモリスロットからの一時的
なリンクの重みを表現 29

2.3. forward/backward weighting の構成
• メモリへの書き込み順を考慮した forward/backward
weighting {𝑓𝑡
𝑖
∈ Δ 𝑁}𝑖=1,2,..,𝑅 / {𝑏𝑡
𝑖
∈ Δ 𝑁}𝑖=1,2,..,𝑅を以下で構成
• 𝑤𝑡−1
𝑟,𝑖
は前タイムステップからの i 番目の read weighting
30

1.+2. Read weighting
• read mode vector 𝜋 𝑡
𝑖
∈ 𝑆3により、backward weighting 𝑏𝑡
𝑖
と
forward weighting 𝑓𝑡
𝑖
、そして read content weighting 𝑐𝑡
𝑟,𝑖
を合わ
せて、read weighting 𝑤𝑡
𝑟,𝑖
∈ 𝑆3を以下のように定義：
• 読み出しモードにおいて 𝜋 𝑡
𝑖
[2] が強ければ， 𝑘 𝑡
𝑟,𝑖
を用いた内容
による検索に戻る
• 𝜋 𝑡
𝑖
[3]であれば，read key は無視し，読み出しヘッドは書き込ま
れた順にメモリスロットを指していく
• 𝜋 𝑡
𝑖
[1] であれば，読み出しヘッドは逆順で動いていく 31

参考文献
• Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine
translation by jointly learning toalign and translate.
• Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning long-term dependencies with
gradient descent is difficult.Trans. Neur. Netw., 5(2):157–166, March.
• Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural turing machines. cite
arxiv:1410.5401.
• Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka
Grabska-Barwi ́nska, Ser-gio G ́omez Colmenarejo, Edward Grefenstette, Tiago Ramalho,
John Agapiou, Adri`a Puigdom`enech Badia,Karl Moritz Hermann, Yori Zwols, Georg
Ostrovski, Adam Cain, Helen King, Christopher Summerfield, PhilBlunsom, Koray
Kavukcuoglu, and Demis Hassabis. 2016. Hybrid computing using a neural network
withdynamic external memory.Nature, 538(7626):471–476, October.
• Sepp Hochreiter and J ̈urgen Schmidhuber. 1997. Long short-term memory.Neural
computation, 9(8):1735–1780.
32

参考文献
• Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to
attention-based neu-ral machine translation. InProceedings of the 2015 Conference on
Empirical Methods in Natural LanguageProcessing, pages 1412–1421, Lisbon, Portugal,
September. Association for Computational Linguistics.
• John von Neumann. 1945. First draft of a report on the edvac. Technical report.
• H.T. Siegelmann and E.D. Sontag. 1995. On the computational power of neural nets.J.
Comput. Syst. Sci.,50(1):132–150, February.
• Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with
neural networks.CoRR,abs/1409.3215.
• 外部メモリー付きのニューラルネット"Differentiable Neural Computing (DNC)"について
解説するよ
https://qiita.com/deaikei/items/e06a1a122b7e19f6a343#%E3%83%AD%E3%82%B8%E3%
83%83%E3%82%AF%E8%A7%A3%E8%AA%AC
2017/03/28
33

Hybrid computing using a neural network with dynamic external memory

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to Hybrid computing using a neural network with dynamic external memory

Similar to Hybrid computing using a neural network with dynamic external memory (20)

Hybrid computing using a neural network with dynamic external memory

Editor's Notes