The document discusses using Tensor Train (TT) decomposition to efficiently represent tensors and apply it to machine learning models. Some key points:
- TT decomposition provides a compact representation of tensors that allows efficient linear algebra operations.
- It has been used to compress the weights matrix of neural networks without loss of accuracy.
- Exponential machines model all feature interactions using a TT-formatted weight tensor, controlling complexity with TT-rank. This outperforms other models on classification tasks involving interactions.
3. Tensor Train summary
Tensor Train (TT) decomposition [Oseledets 2011]:
A compact representation for tensors (=multidimensional array);
Allows for efficient application of linear algebra operations.
Alexander Novikov Tensor Train in machine learning October 11, 2016 3 / 26
4. Low-rank decomposition
A23 =
G1 G2
i2 = 3i1 = 2
Ai1i2 = G1[i1]
1×r
G2[i2]
r×1
A = G1G2
G1 – collection of rows, G2 – collection of columns:
Alexander Novikov Tensor Train in machine learning October 11, 2016 4 / 26
5. Tensor Train decomposition
A2423 =
G1 G2 G3 G4
i2 = 4 i3 = 2 i4 = 3
i1 = 2
Ai1...id
= G1[i1]
1×r
G2[i2]
r×r
. . . Gd [id ]
r×1
An example of computing one element of 4-dimensional tensor:
Alexander Novikov Tensor Train in machine learning October 11, 2016 5 / 26
6. Tensor Train decomposition Cont’d
Tensor A is said to be in the TT-format, if
Ai1,...,id
= G1[i1] G2[i2] · · · Gd [id ], ik ∈ {1, . . . , n},
where Gk[ik] — is a matrix of size rk−1 × rk, r0 = rd = 1.
Notation & terminology:
Gk — TT-cores;
rk — TT-ranks;
r = max
k=0,...,d
rk — the maximal TT-rank.
The TT-format uses O ndr2 memory to store nd elements. Efficient only
if the TT-rank is small.
Alexander Novikov Tensor Train in machine learning October 11, 2016 6 / 26
10. Sum of tensors
Tensors A and B are in the TT-format:
Ai1...id
= GA
1 [i1] · · · GA
d [id ], Bi1...id
= GB
1 [i1] · · · GB
d [id ].
Find the TT-format of
C = A + B,
Ci1...id
= Ai1...id
+ Bi1...id
.
Alexander Novikov Tensor Train in machine learning October 11, 2016 9 / 26
11. Sum of tensors
Tensors A and B are in the TT-format:
Ai1...id
= GA
1 [i1] · · · GA
d [id ], Bi1...id
= GB
1 [i1] · · · GB
d [id ].
Find the TT-format of
C = A + B,
Ci1...id
= Ai1...id
+ Bi1...id
.
TT-cores of the result:
GC
k [ik] =
GA
k [ik] 0
0 GB
k [ik]
, k = 2, . . . , d − 1,
GC
1 [i1] = GA
1 [i1] GB
1 [i1] , GC
d [id ] =
GA
d [id ]
GB
d [id ]
.
TT-ranks of the result are sums of the TT-ranks.
Alexander Novikov Tensor Train in machine learning October 11, 2016 9 / 26
12. TT-rounding
Given a tensor A in the TT-format with rank r, the TT-rounding
[Oseledets, 2011]:
A = tt-round(A, ε), ε > 0
finds the tensor A such that
1 A − A F ≤ ε A F ;
2 TT-rank of A is minimal among all B:
A − B F ≤ ε√
d−1
A F .
Where A F = i1,...,id
A2
i1,...,id
.
Alexander Novikov Tensor Train in machine learning October 11, 2016 10 / 26
13. How to find TT-decomposition of a given tensor
Analytical formulas for special cases;
An exact algorithm based on SVD for medium tensor. E.g. for a
58 ≈ 400 000 tensor takes 8 ms on my laptop;
For large tensors (e.g. 250), approximate algorithms that look at a
fraction of the tensor elements: DMRG-cross [Savostyanov and
Oseledets, 2011], AMEn-cross [Dolgov and Savostyanov, 2013].
Alexander Novikov Tensor Train in machine learning October 11, 2016 11 / 26
14. TT-format operations
Operation Rank of the result
C = c · A r(C) = r(A)
C = A + c r(C) = r(A)+1
C = A + B r(C) ≤ r(A)+r(B)
C = A B r(C) ≤ r(A)r(B)
C = round(A, ε) r(C) ≤ r(A)
sum A –
A F –
(Ask me about differential equations)
Alexander Novikov Tensor Train in machine learning October 11, 2016 12 / 26
15. Example application: TensorNet
1 Neural networks use fully-connected layers: y = f (W x + b).
2 The matrix W is of millions parameters.
3 Lets store and train the matrix W in the TT-format.
Can’t work for general matrices, but for VGG-16 net we compressed
4048 × 4048 matrix to 320 params without loss of accuracy.
Alexander Novikov Tensor Train in machine learning October 11, 2016 13 / 26
16. Linear model
Model
y(x) = w x + b,
b ∈ R, w ∈ Rd
Loss function
N
k=1
w x(k)
+ b, y(k)
.
Linear regression
Logistic regression
Linear SVM
...
Alexander Novikov Tensor Train in machine learning October 11, 2016 14 / 26
17. Need for interactions
Linear models give everyone same recommendations
Same story e.g. in bag-of-words text tasks
Use interactions (products of features)!
Alexander Novikov Tensor Train in machine learning October 11, 2016 15 / 26
18. Models with interactions
y(x) = b + w x +
i,j
Pijxi xj,
b ∈ R, w ∈ Rd
, P ∈ Rd×d
For d features d2 parameters: overfitting on sparse data
Complexity is also d2
For recommender systems d is millions
SVM with polynomial kernel has same drawbacks
Alexander Novikov Tensor Train in machine learning October 11, 2016 16 / 26
19. Factorization machines
y(x) = b + w x +
i,j
Pijxi xj
Factorization machines [Rendle 2010] use rank r for P
y(x) =b + w x +
i,j
r
f =1
Vif Vjf xi xj,
b ∈ R, w ∈ Rd
, V ∈ Rd×r
Matrix P = VV is not sparse, but structured (low rank)
Control the number of parameters with r
Can represent almost any matrix with large r
Alexander Novikov Tensor Train in machine learning October 11, 2016 17 / 26
20. High order analysis
Factorization machines model (3rd order)
y(x) =b + w x +
i,j
r
f =1
Vif Vjf xi xj
+
i,j,k
r
f =1
Uif Ujf Ukf xi xjxk.
In fact, Factorization machines just use CP-decomposition for the weight
tensor Pi,j,k:
Pijk =
r
f =1
Uif Ujf Ukf
But
Converge poorly with high order
Complexity of inference and learning
Alexander Novikov Tensor Train in machine learning October 11, 2016 18 / 26
21. Exponential machines
Lets encode interactions by binary code. Every bit indicates if
corresponded feature is included or not in current interaction.
Exponential machines example (d = 3):
y(x) = W000 + W100 x1 + W010 x2 + W001x3
+ W110 x1x2 + W101 x1x3 + W011 x2x3
+ W111 x1x2x3.
Alexander Novikov Tensor Train in machine learning October 11, 2016 19 / 26
22. Exponential machines
Lets encode interactions by binary code. Every bit indicates if
corresponded feature is included or not in current interaction.
Exponential machines example (d = 3):
y(x) = W000 + W100 x1 + W010 x2 + W001x3
+ W110 x1x2 + W101 x1x3 + W011 x2x3
+ W111 x1x2x3.
In general:
y(x) =
1
i1=0
. . .
1
id =0
Wi1,...,id
xi1
1 . . . xid
d ,
W ∈ R2×...×2
with TT-rank r
Captures all 2d interactions
Control the number of parameters with TT-rank r
Can represent any polynomial function with large r
Alexander Novikov Tensor Train in machine learning October 11, 2016 19 / 26
23. Exponential machines inference
Linear O(r2d) inference:
y(x) =
i1,...,id
G1[i1] . . . Gd [id ]
d
k=1
xik
k
=
i1,...,id
xi1
1 G1[i1] . . . xid
d Gd [id ]
=
1
i1=0
xi1
1 G1[i1]
. . .
1
id =0
xid
d Gd [id ]
= A1
1×r
A2
r×r
. . . Ad
r×1
,
Alexander Novikov Tensor Train in machine learning October 11, 2016 20 / 26
24. Exponential machines learning
minimize
W
N
k=1
W, X(k)
, y(k)
,
subject to TT-rank(W) = r0,
1 Autodiff to compute gradients with respect to TT-cores G
2 OR Riemannian optimization
Theorem [Holtz, 2012]
The set of all d-dimensional tensors with fixed TT-rank r
Mr = {W ∈ R2×...×2
: TT-rank(W) = r}
forms a Riemannian manifold.
Alexander Novikov Tensor Train in machine learning October 11, 2016 21 / 26
25. Riemannian optimization
− ∂L
∂Wt
TW Mr
−Gt
TT-roundWt+1
Mr
projection
Wt
Alexander Novikov Tensor Train in machine learning October 11, 2016 22 / 26
26. Riemannian optimization Cont’d
Loss function
L(W) =
N
k=1
W, X(k)
, y(k)
Gradient
∂L
∂W
=
N
k=1
∂
∂y
X(k)
.
Where X is of TT-rank 1!
Xi1...id
=
d
k=1
xik
k .
Alexander Novikov Tensor Train in machine learning October 11, 2016 23 / 26
28. Experiments: classification
1 We generated 105 train and 105 test objects and d = 30 features.
2 Xij ∼ U{−1, +1}.
3 Ground truth for 3 interactions of order 2:
y(x) = ε1x1x5 + ε2x3x8 + ε3x4x5; ε1, ε2, ε3 ∼ U(−1, 1).
4 We used 20 interactions of order 6.
Method Test AUC Training time (s) Inference time (s)
Log. reg. 0.50 ± 0.0 0.4 0.0
RF 0.55 ± 0.0 21.4 1.3
SVM RBF 0.50 ± 0.0 2262.6 1076.1
SVM poly. 2 0.50 ± 0.0 1152.6 852.0
SVM poly. 6 0.56 ± 0.0 4090.9 754.8
2-nd order FM 0.50 ± 0.0 638.2 0.1
6-th order FM 0.57 ± 0.05 1412.0 0.2
ExM rank 2 0.54 ± 0.05 198.4 0.1
ExM rank 4 0.69 ± 0.02 443.0 0.1
ExM rank 8 0.75 ± 0.02 998.3 0.2
Alexander Novikov Tensor Train in machine learning October 11, 2016 25 / 26
29. Conclusion
Tensor Train decomposition compactly represent tensors.
Can parametrize machine learning models with TT-tensors.
E.g. the weights of a neural network.
Or modeling all 2d interactions (products of features).
Control the number of underlying parameters via TT-rank.
Riemannian optimization learning sometimes outperforms SGD.
There is a Python code for everything: TT, TensorNet, and
Exponential Machines.
Alexander Novikov Tensor Train in machine learning October 11, 2016 26 / 26