Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)

Polynomial Tensor Sketch
for Element-wise Matrix Function
Insu Han 1 Haim Avron 2 Jinwoo Shin 1
1 KAIST
2 Tel Aviv University

Problem: Element-wise Matrix Function
Example:
Given and a function , element-wise matrix function
is defined as

Applications in machine learning
• Non-linear activation functions in deep learning:
• Kernel methods (e.g., Gaussian process, Determinantal Point Process):
(MLP)
(Transformer)
is defined as

Element-wise matrix function ≠ algebraic matrix function
• Consider with and then
i.e., eigenvectors are preserved.
• But, element-wise matrix function does not preserve eigenvectors. E.g.,
is defined as

Many applications require multiplication with vectors, i.e.,
We focus on a low-rank matrix for
is defined as

Many applications require multiplication with vectors, i.e.,
We focus on a low-rank matrix for
Issue. It requires complexity to compute, too expensive for large .
Goal. Design a fast algorithm of time-complexity
for low-rank approximation of ,
i.e., without computing all entries of .
is defined as

Main Result: Polynomial Tensor Sketch
Our key ideas:
1. Tensor sketch [PP13] can approximate element-wise matrix monomial
function in time, i.e.,
Goal. Given , we aim to approximate in time

Our key ideas:
function in time, i.e.,
2. Polynomial approximation on

Our key ideas:
function in time.
Applying Tensor sketch with , the low-rank approximation on
can be done in time.
target dimension for Tensor sketch
polynomial degree

Our key ideas:
function in time.
Applying Tensor sketch with , the low-rank approximation on
can be done in time.
Q. How can find good coefficients?
target dimension for Tensor sketch
polynomial degree

Main Result: Coreset Coefficients
Our key ideas:
A natural choice of coefficients is polynomial expansion, e.g., Taylor series.

Our key ideas:
However, this ignores the error of tensor sketch increasing exponentially.
Proposition. Suppose , then the error bound is

Our key ideas:
However, this ignores the error of tensor sketch increasing exponentially.
Q. How to balance errors of two approximations?
Proposition. Suppose , then the error bound is

Observation. For some and diagonal matrix
which depend on the input, if holds that
for any coefficient

• -th column of is the vectorization of for all
• is the vectorization of
• is the diagonal matrix such that
for any coefficient

To minimizes the error bound, we choose
closed-form
solution
for any coefficient

for any coefficient
To minimizes the error bound, we choose
closed-form
solution
Theorem (informal). There exists such that the proposed algorithm
with can hold

Optimal Coefficients. For and diagonal matrix ,
But, requires operations to compute which is even
worse than the exact computation .

To solve this, we use coreset-based regression, i.e., a regression problem
using a subset of rows in .
Optimal Coefficients. For and diagonal matrix ,
But, requires operations to compute which is even
worse than the exact computation .

Approximating Coefficients via Coreset.
(1) Select a subset of rows in (i.e., coreset) with size
(2) Construct and based on selected entries in
where is # of rows in whose the nearest row in is the -th one.

Theorem (informal). If a coreset of satisfies that
, then for some and under mild assumption,

• It can be 2-approximation of set cover problem
• It takes time to run.
Greedy -center algorithm

(1) Select a subset of rows in (i.e., coreset) via the greedy -center
Complexity.
The coefficients via the greedy -center algorithm can be computed in
time.

Main Result: Summary
Algorithm.
(1) Compute the coefficients via the greedy -center
(2) Run Tensor Sketch of and to obtain
(3) Construct the approximation
For , requires operations.
The overall complexity is
Coreset-based coefficient Poly-Tensor sketch
Complexity.

Experiments: Kernel Approximation
RBF Kernel Approximation.
Our method (Coreset-TS) outperforms other competitors; random Fourier
feature (RFF), and algorithm with coefficient from Taylor (Taylor-TS) and
Chebyshev series (Chebyshev-TS).

Experiments: Kernel SVM
Classification with RBF Kernel SVM.
We evaluate the test error of classification task under real-world datasets.
Our method (Coreset-TS) performs similar to the exact RBF kernel SVM, but
our running times are run faster up to 49 times. Compared to a method with
Random Fourier Feature (RFF), ours show better test error.

Experiments: Sinkhorn Algorithm
Sinkhorn Algorithm for Computing Optimal Transport Distance.
Given vectors and a low-rank matrix , Sinkhorn requires to compute
We evaluate speed up and approximation ratio of Random Fourier Feature
(RFF), Nystrom method and our algorithm (Coreset-TS) to image color
transformation. Ours show the smallest variance on approximation ratio.

Experiments: Neural Network Linearization
Linearization of Neural Network.
Given a 2-layer fully-connected network, a low-rank approximation on non-
linear operation can be linearized the network:
• (+) The inference time can be reduced once parameters are pre-computed
• (−) The inputs should be computed with non-linear transform, i.e.,
Results on final 2 FC layers of AlexNet trained by CIFAR10 dataset

Conclusion
• We propose a fast low-rank approximation for the element-wise matrix
functions by combining Tensor sketch and polynomial approximation.
• For a linear-time algorithm, we utilize -center greedy algorithm to obtain
coefficients and provide error analysis.
• Empirically, we apply our algorithm into practical ML applications such as
kernel SVM, optimal transport and neural network linearization.
• We believe that our contribution would be a broader interest of other ML
applications.
Thanks for your attention

Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)

More Related Content

What's hot

Similar to Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)

More from ALINLAB

Recently uploaded

Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)