1. Tensorizing Neural Network
NIPS 2015
11 citations
Alexander Novikov
Skolkovo Institute Science and
Technology, Moscow, Russia
Dmitry Podoprikhin
INRIA, SIERRA, project-team,
Paris, France
Anton Osokin
National Research University
Higher School of Economics,
Moscow, Russia
Dmitry Vetrov
Institute of Numerical Mathematics
of the Russian Academy of Science,
Moscow, Russia
5. Background: Training Neural Network
● Back-propagation
– an efficient way to compute gradient of objective
function wrt all the parameters
● Gradient descent:
● Objective:
21. Motivation
● What if neural network is too large to fit into
memory?
– distributed neural network
● distribute parameters
● challenge: training
● Approaches
[Elastic Averaging SGD (NIPS'15)]
– Model compression
● reduce required space
23. Problem Formulation
● Given weight matrix of a fully-
connected layer
– compact with back-propagation
● Requirement
● Goal
– reduce space complexity
26. ● Reducing #parameter
W
N
M
U
R
M
N
R VT
=
Σ
R
R
Take R largest eigenvalues and
corresponding eigenvectors
[1:R] [1:R]
[1:R] [1:R]
Naïve Method: Low-Rank SVD
40. Main Idea
recursively applying low-rank SVD
1)
2) if matrix is too thin => reshaper
=
N
r B
N
=
N N
gets thinner and thinner
41. Main Idea
recursively applying low-rank SVD
1)
2) if matrix is too thin => reshape
balanced thin
Give 2 matrix of same #elements,
which can be compressed more?
42. Main Idea
recursively applying low-rank SVD
1)
2) if matrix is too thin => reshape
balanced10
12
thin2
60
Give 2 matrix of same #elements,
which can be compressed more?
43. Main Idea
recursively applying low-rank SVD
1)
2) if matrix is too thin => reshape
4
6
5
thin2
60
How about reshape into higher dimension tensor?
71. Experiment
● Large: ImageNet
ILSVRC-2012
1000 class with 1.2 million(train), 50,000(valid)
compress FC layer of a large CNN
4096 x 1000
4096 x 4096
4096 x 25088
13 conv-layers
VGG-16
75. Conclusion
● An interesting approach to do model compression
and did work very well
● Applying reshape adaptively in recursive low-rank
SVD may work better?
76. Tensorizing Neural Network
NIPS 2015
11 citations
Alexander Novikov
Skolkovo Institute Science and
Technology, Moscow, Russia
Dmitry Podoprikhin
INRIA, SIERRA, project-team,
Paris, France
Anton Osokin
National Research University
Higher School of Economics,
Moscow, Russia
Dmitry Vetrov
Institute of Numerical Mathematics
of the Russian Academy of Science,
Moscow, Russia
今天要報的這篇叫 Tensorizing Neural Network
發在 2015 年 NIPS 上 共有 11 個 citation
95. Motivation
● What if neural network is too large to fit into
memory?
萬一 neural network 參數太多一台機器 memory 放不
下怎麼辦?
96. Motivation
● What if neural network is too large to fit into
memory?
– distributed neural network
● distribute parameters
● challenge: training
● Approaches
[Elastic Averaging SGD (NIPS'15)]
– Model compression
● reduce required space
有兩個方向 其實是互補的可以一起使用
參數太多一台放不下那就把參數 distribute 到不同台機
器上 那這個 approach 的困難在於 training 也必須是
distributed 有不少 work 在研究 distributed
stochastic gradient descent 比如康軍之後會報的
Elastic Averaging SGD
這篇屬於另一個 approach 做 model compression 減
少 neural network 需要的參數空間
100. ● Reducing #parameter
W
N
M
U
M
M Σ
N
M V
N
N T
=
SVD decomposition
Naïve Method: Low-Rank SVD
Low-rank SVD
首先用 SVD 把 MbyN 的 matrix 拆成三項
101. ● Reducing #parameter
W
N
M
U
R
M
N
R VT
=
Σ
R
R
Take R largest eigenvalues and
corresponding eigenvectors
[1:R] [1:R]
[1:R] [1:R]
Naïve Method: Low-Rank SVD
每個 eigenvalue 有對應的 eigenvector 把前 R 個
eigenvalue 大的 component 取出來
103. ● Reducing #parameter
[1:R] [1:R]
[1:R] [1:R]
( )
A
R
M
W
N
M
U
R
M
N
R VT
=
Σ
R
R
N
R B
T
Naïve Method: Low-Rank SVD
變成兩個 matrix 相乘
104. ● Reducing #parameter
space complexity
[1:R] [1:R]
[1:R] [1:R]
( )
A
R
M
W
N
M
U
R
M
N
R VT
=
Σ
R
R
N
R B
Naïve Method: Low-Rank SVD
比起原本 W 需要 M*N 的空間 現在只存 component A
和 B 的話 需要的儲存空間只要 R(M+N) 所以可以減
少需要的空間
105. Instead of updating W,
update components
Naïve Method: Low-Rank SVD
By low-rank SVD
那麼要怎麼跟 training 結合呢?
原本 gradient descent 更新 W 但現在不存 W 了
所以就需要算 objective function 對 component 的
grdient
107. ● To integrated with back-propagation
Naïve Method: Low-Rank SVD
– have calculate 3 gradients:
● objective wrt parameter
● output wrt input
● output wrt parameter
doesn't change
前兩項不變
108. ● To integrated with back-propagation
Naïve Method: Low-Rank SVD
– have calculate 3 gradients:
● objective wrt parameter
● output wrt input
● output wrt parameter
只有第 3 項 output 對 componentA 和 B 的 gradient 要
推 我推出來是像這樣
109. It works,
but weight matrix might be very large
→ want to do more compression
Naïve Method: Low-Rank SVD
所以 low-rank SVD 可以做到 compression 減少參數
又保留 layer 的功能 那這篇想要做更有效率的壓縮
省更多 space
112. Main Idea
recursively applying low-rank SVD
1)
2) if matrix is too thin => reshape
A
r
MW
N
M =
N
r B
原本 W 需要 M*N 來存
decompose 成兩個 matrix 後只要 M+N 乘上 rank 數
如果再進一步把 A 或 B decompose 的話就能省更多空
間
114. Main Idea
recursively applying low-rank SVD
1)
2) if matrix is too thin => reshaper
=
N
r B
N
=
N N
再對分解的第二個 matrix 做 low-rank SVD 一直做
115. Main Idea
recursively applying low-rank SVD
1)
2) if matrix is too thin => reshaper
=
N
r B
N
=
N N
gets thinner and thinner
你會發現 第二個 matrix 變得愈來愈細 那愈來愈細會怎
樣
116. Main Idea
recursively applying low-rank SVD
1)
2) if matrix is too thin => reshape
balanced thin
Give 2 matrix of same #elements,
which can be compressed more?
問大家一個問題 你覺得兩個大小相同 但一個長的像左
邊這樣長寬差不多的 matrix 還是右邊這個很瘦的
matrix 做 low-rank SVD 取相同 rank 會省比較多空
間?
117. Main Idea
recursively applying low-rank SVD
1)
2) if matrix is too thin => reshape
balanced10
12
thin2
60
Give 2 matrix of same #elements,
which can be compressed more?
舉例說 10*12 和 2*60 都取 rank=1 的話 左邊的就少很
多 所以這就 motivate 了做 recursive low-rank SVD
在 matrix 太瘦的時候要 reshape 的 idea
所以大家直覺上同意叫 balanced 的能 compress 更
多?
118. Main Idea
recursively applying low-rank SVD
1)
2) if matrix is too thin => reshape
4
6
5
thin2
60
How about reshape into higher dimension tensor?
那再問一個問題 如果 reshape 不侷限在 2 維平面 允許
reshape 成 3 維以上的 tensor 的話 那是不是更
balanced 可以 compress 省更多空間呢?
119. Main Idea
recursively applying low-rank SVD
1)
2) if matrix is too thin => reshape
=> Tensor-Train Decomposition
所以等會會看到這篇會先把 matrix reshape 成 tensor
然後用一種叫 Tensor-Train decomposition 的方法
Tensor-Train decomposition 基本上就是在做
recursive 的 low-rankSVD
120. Tensor-Train Decomposition
W8
10
If we want to TT-decompose a matrix
First, need to reshape it into a
tensor
首先想 decompose 一個 matrix 先把它 reshape 成高
維 tensor 比如 8*10 的 reshape 成 2*2*2*2*5 的 5 維
的 tensor 那接下來就能使用 Tensor-Train
decomposition
121. 2
2
and apply low-rank SVD
Tensor-Train Decomposition
unfold by 1st
dimension
Tensor-Train decomposition 首先會把 W 按照第 1
mode unfold 展開成 matrix 對這 matrix 做 low-rank
SVD
129. Tensor-Train Decomposition
Tensor-Train format
TT-rank = SVD decomposition rank
這就是 Tensor-Train format 每一個 element 都能拆解
成 d 個 matrix 相乘 要特別講一下 Tensor-Train
format 的 rank 是定義成 low-rank SVD 取的 rank 也
就是 r1,r2,r3,r4,r5
130. Tensor-Train Decomposition
Tensor-Train format
[2010 Oseledets. et al] shows for arbitrary tensor,
its TT-format exits but is not unique
很自然會想知道 如果對一個 tensor 給定一組想要的
rank 能不能找到對應的 Tensor-Train format 呢?
發明 Tensor-Train 的 Oseledets 等人有給證明一定可
以找的到 但不是 unique 的
131. Tensor-Train Decomposition
Space complexity
Tensor-Train format
Tensor-Train formatorigin
那來分析 space complexity
如果有個 tensor 每一 mode 的 dimension 是 nk 的話
原本需要的空間就是 d mode 的 dimension 相乘 n^d
那如果寫成 TT-format 就只存 core tensor space
complexity 是 d 個 core tensor 每個需要 dimension
n 乘上 rank r^2 有 d 個 core tensor 所以需要 ndr^2
那麼就從 n^d 降到 ndr^2
132. Tensor-Train Decomposition
Tensor-Train format
TT-format
Space
canonical Tucker
robust
compared to other tensor decomposition methods
跟其他常見的 tensor decomposition 方法比較
canonical/CP 是拆解成 d 個長度就是 dimension 的
vector 做 outer produce 成 rank1 tensor 然後取 r 個
拼起來 所以只需要 ndr 個 entry 比 TT-format 少但
CP 不是 SVD-based 不 robust 想取某個 rank r 的
low-rank tensor CP 不一定能找到最好的那就不適合
這篇
而 Tucker 雖然說他跟 TT 一樣也是 SVD-based 而且
形式跟 CP 很像但多了一個 d mode 的 core tensor
所以還多加上了 r^d 但開什麼玩笑 在 d 很大的時候
需要的 space 是 exponential 成長 這就是為何
Tucker 是 SVD-based 但這篇不用的原因
133. Tensor-Train Layer
represent as d-dimensional tensors
by simple MATLAB reshape
來講這篇 paper 要做的事情 要壓縮 fully-connected
layer y=Wx+b 來講先把 W,x,b 都 reshape 成 d
mode 的 tensor 用 matlab command
150. Conclusion
● An interesting approach to do model compression
and did work very well
● Applying reshape adaptively in recursive low-rank
SVD may work better?
學會一種有趣的 model compression 方法
說不定在 recursive 做 low-rank SVD 時不要做
reshape 而是 adaptive 的做說不定能壓縮更多