Relaxation methods for the matrix exponential on large networks

Coordinate descent methods
for the matrix exponential !
on large networks
David F. Gleich!
Purdue University!
Joint work with
Kyle Kloster @ Purdue
supported by "
NSF CAREER
1149756-CCF
Code www.cs.purdue.edu/homes/dgleich/codes/nexpokit!
ICME
David Gleich · Purdue
1

This talk
x = exp(P)ec
x the solution
P the matrix
ec the column
localized
large, sparse, stochastic
ICME
2

Localized solutions
0 2 4 6
x 10
5
0
0.5
1
1.5
plot(x) nnz(x) = 513, 969
10
0
10
2
10
4
10
6
10
−15
10
−10
10
−5
10
0
10
0
10
2
10
4
10
6
10
−15
10
−10
10
−5
10
0
nonzeros
error
x = exp(P)ec
length(x) = 513, 969
ICME
3

Our mission!
Find the solution with work "
roughly proportional to the "
localization, not the matrix.
ICME
4

Our algorithm!
www.cs.purdue.edu/homes/dgleich/codes/nexpokit
10
0
10
2
10
4
10
6
10
−15
10
−10
10
−5
10
0
10
0
10
2
10
4
10
6
10
−15
10
−10
10
−5
10
0
nonzeros
error
ICME
5

Outline
1.  Motivation and setup
2.  Converting x = exp(P) ec into a linear system
3.  Coordinate descent methods for "
linear systems from large networks
4.  Error analysis
5.  Experiments
ICME
6

Models and algorithms for high performance !
matrix and network computations
ICME
7
1
error
1
std
0
2
(b) Std, s = 0.39 cm
10
error
0
0
10
std
0
20
(d) Std, s = 1.95 cm
model compared to the prediction standard de-
bble locations at the ﬁnal time for two values of
= 1.95 cm. (Colors are visible in the electronic
approximately twenty minutes to construct using
s.
ta involved a few pre- and post-processing steps:
m Aria, globally transpose the data, compute the
nd errors. The preprocessing steps took approx-
recise timing information, but we do not report
Tensor eigenvalues"
and a power method

FIGURE 6 – Previous work
from the PI tackled net-
work alignment with ma-
trix methods for edge
overlap:
i
j j0
i0
OverlapOverlap
A L B
This proposal is for match-
ing triangles using tensor
methods:
j
i
k
j0
i0
k0
TriangleTriangle
A L B
t
r
o
s.
g
n.
o
n
s
s-
g
maximize
P
ijk Tijk xi xj xk
subject to kxk2 = 1
where ! ensures the 2-norm
[x(next)
]i = ⇢ · (
X
jk
Tijk xj xk + xi )
SSHOPM method due to "
Kolda and Mayo
Simulation data analysis
SIMAX ‘09, SISC ‘11,MapReduce ‘11, ICASSP ’12
Network alignment
ICDM ‘09, SC ‘11, TKDE ‘13
Fast & Scalable"
Network centrality
SC ‘05, WAW ‘07, SISC ‘10, WWW ’10, …
Data clustering
WSDM ‘12, KDD ‘12, CIKM ’13 …
Ax = b
min kAx bk
Ax = x
Massive matrix "
computations
on multi-threaded
and distributed
architectures

Matrix exponentials
exp(A) is deﬁned as
1X
k=0
1
k!
Ak Always converges
special case of a function of a matrix
dx
dt
= Ax(t) , x(t) = exp(tA)x(0) Evolution operator "
for an ODE
A is n ⇥ n, real
ICME
8

SIAM REVIEW c 2003 Society for Industrial and Applied Mathematics
Vol. 45, No. 1, pp. 3–49
Nineteen Dubious Ways to
Compute the Exponential of a
Matrix, Twenty-Five Years Later∗
Cleve Moler†
Charles Van Loan‡
ICME
9

Matrix exponentials on large networks
exp(A) =
1X
k=0
1
k!
Ak If A is the adjacency matrix, then
Ak counts the number of length k
paths between node pairs.
[Estrada 2000, Farahat et al. 2002, 2006]
Large entries denote important nodes or edges.
Used for link prediction and centrality
If P is a transition matrix, then "
Pk is the probability of a length k
walk between node pairs.
[Kondor & Lafferty 2002, Kunegis & Lommatzsch 2009, Chung 2007]
Used for link prediction, kernels, and
clustering or community detection
exp(P) =
1X
k=0
1
k!
Pk
ICME
10

Another useful matrix exponential
P column stochastic
e.g. P = AT
D 1
A is the adjacency matrix
if A is symmetric
exp(PT
) = exp(D 1
A) = D 1
exp(AD 1
)D = D 1
exp(P)D
ICME
11

Another useful matrix exponential
P column stochastic
e.g. P = AT
D 1
A is the adjacency matrix
if A is symmetric
exp( L) = exp(D 1/2
AD 1/2
I)
=
1
e
exp(D 1/2
AD 1/2
)
=
1
e
D 1/2
exp(AD 1
)D1/2
=
1
e
D 1/2
exp(P)D1/2
Negative Normalized Laplacian
ICME
12

Matrix exponentials on large networks
Is a single column interesting? Yes!
exp(P)ec =
1X
k=0
1
k!
Pk
ec Link prediction scores for node c
A community relative to node c
But …
modern networks are "
large ~ O(109) nodes,
sparse ~ O(1011) edges,
constantly changing …
and so we’d like "
speed over accuracy
ICME
13

The issue with existing methods
We want good results in less than one matvec.
Our graphs have small diameter and fast ﬁll-in.

Krylov methods !
A few matvecs, quick loss of sparsity due to orthogonality
!
Direct expansion!
A few matvecs, quick loss of sparsity due to ﬁll-in
ICME
14
exp(P)ec ⇡ ⇢Vexp(H)e1
[Sidje 1998]"
ExpoKit
exp(P)ec ⇡
PN
k=0
1
k! Pk
ec

Outline
5.  Experiments
✓
ICME
15

Our underlying method
Direct expansion!
A few matvecs, quick loss of sparsity due to ﬁll-in

This method is stable for stochastic P!
"… no cancellation, unbounded norm, etc.
!
!

ICME
16
x = exp(P)ec ⇡
PN
k=0
1
k! Pk
ec = xN
Lemma kx xNk1 
1
N!N

Our underlying method !
as a linear system
Direct expansion!

"
!
!
!

ICME
17
x = exp(P)ec ⇡
PN
k=0
1
k! Pk
ec = xN
2
6
6
6
6
6
6
4
III
P/1 III
P/2
...
... III
P/N III
3
7
7
7
7
7
7
5
2
6
6
6
6
6
6
4
v0
v1
...
...
vN
3
7
7
7
7
7
7
5
=
2
6
6
6
6
6
6
4
ec
0
...
...
0
3
7
7
7
7
7
7
5
xN =
NX
i=0
vi
(III ⌦ IIIN SN ⌦ P)v = e1 ⌦ ec
Lemma we approximate xN well if we approximate v well

Our mission (2)!
Approximately solve "

when A, b are sparse,"
x is localized.
ICME
18
Ax = b

Outline
5.  Experiments
✓
ICME
19
✓

Coordinate descent, Gauss-Southwell,
Gauss-Seidel, relaxation & “push” methods
ICME
20
Algebraically! Procedurally!
Solve(A,b)
x = sparse(size(A,1),1)
r = b
While (1)
Pick j where r(j) != 0
z = r(j)
x(j) = x(j) + r(j)
For i where A(i,j) != 0
r(i) = r(i) – z*A(i,j)
Ax = b
r(k)
= b Ax(k)
x(k+1)
= x(k)
+ ej eT
j r(k)
r(k+1)
= r(k)
r(k)
j Aej

It’s called the “push” method
because of PageRank
ICME
21
(III ↵P)x = v
r(k)
= v (III ↵P)x(k)
x(k+1)
= x(k)
+ ej eT
j r(k)
“r(k+1)
= r(k)
r(k)
j Aej ”
r(k+1)
i =
8
><
>:
0 i = j
r(k)
i + ↵Pi,j r(k)
j Pi,j 6= 0
r(k)
i otherwise
PageRankPush(links,v,alpha)
x = sparse(size(A,1),1)
r = b
While (1)
Pick j where r(j) != 0
z = r(j)
x(j) = x(j) + z
r(j) = 0
z = z / deg(j)
For i where “j links to i”
r(i) = r(i) + z

It’s called the “push” method
because of PageRank
ICME
22
Demo

Justiﬁcation of terminology
This method is frequently “rediscovered” (3 times for PageRank!)

Let Ax = b, diag(A) = I
It’s Gauss-Seidel if j is chosen cyclically
It’s Gauss-Southwell if j is the largest entry in the residual
It’s coordinate descent if A is symmetric, pos. deﬁnite
It’s a relaxation step for any A

Works great for other problems too! "
[Bonchi, Gleich, et al. J. Internet Math. 2012]
ICME
23

Back to the exponential
ICME
24
2
6
6
6
6
6
6
4
III
P/1 III
P/2
...
... III
P/N III
3
7
7
7
7
7
7
5
2
6
6
6
6
6
6
4
v0
v1
...
...
vN
3
7
7
7
7
7
7
5
=
2
6
6
6
6
6
6
4
ec
0
...
...
0
3
7
7
7
7
7
7
5
xN =
NX
i=0
vi
Solve this system via the same method.

Optimization 1 build system implicitly

Optimization 2 don’t store vi, just store sum xN

Code (inefﬁcient, but working) for !
Gauss-Southwell to solve
function x = nexpm(P,c,tol)
n = size(P,1); N = 11; sumr=1;
r = zeros(n,N+1); r(c,1) = 1; x = zeros(n,1); % the residual and solution
while sumr >= tol % use max iteration too
[ml,q]=max(r(:)); i=mod(q-1,n)+1; k=ceil(q/n); % use a heap in practice for max
r(q) = 0; x(i) = x(i)+ml; sumr = sumr-ml;% zero the residual, add to solution
[nset,~,vals] = ﬁnd(P(:,i)); ml=ml/k; % look up the neighbors of node i
for j=1:numel(nset) % for all neighbors
if k==N, x(nset(j)) = x(nset(j)) + vals(j)*ml; % add to solution
else, r(nset(j),k+1) = r(nset(j),k+1) + vals(j)*ml;% or add to next residual
sumr = sumr + vals(j)*ml;
end, end, end % end if, end for, end while

Todo use dictionary for x, r and use heap or queue for residual
ICME
25

Outline
5.  Experiments
✓
ICME
26
✓
✓

Error analysis for Gauss-Southwell
ICME
27
Theorem
Assume P is column-stochastic, v(0)
= 0.
(Nonnegativity)
iterates and residuals are nonnegative
v(l)
0 and r(l)
0
(Convergence)
residual goes to 0:
kr(l)
k1 
Ql
k=1 1 1
2dk  l( 1
2d )
“easy”
“annoying”
d is the
largest degree

Proof sketch
Gauss-Southwell picks largest residual
⇒  Bound the update by avg. nonzeros in residual (sloppy)
⇒  Algebraic convergence with slow rate, but each update is
REALLY fast O(d max log n).
If d is log log n, then our method runs in sub-linear time "
(but so does just about anything)
ICME
28

Overall error analysis
ICME
29

Components!
Truncation to N terms
Residual to error

Approximate solve

Theorem kxN
(`)
xk1 
1
N!N
+
1
e
· `
1
2d
After ℓ steps of Gauss-Southwell

Outline
5.  Experiments
✓
ICME
30
✓
✓
✓

Our implementations
C++ mex implementation with a heap to
implement Gauss-Southwell.
C++ mex implementation with a queue to store
all residual entries ≥ 1/(tol nN).

At completion, the residual norm ≤ tol.
We use the queue except for the runtime
comparison.
ICME
31

Accuracy vs. tolerance
ICME
32
0
0.2
0.4
0.6
0.8
1
−2 −3 −4 −5 −6 −7
log10 of residual tolerance
Precisionat100
pgp−ccpgp social graph, 10k vertices
For the pgp social
graph, we study the
precision in ﬁnding the
100 largest nodes as
we vary the tolerance.
This set of 100 does
not include the nodes
immediate neighbors.
(Boxplot over 50 trials)

Accuracy vs. work
ICME
33
For the dblp collaboration
graph, we study the
precision in ﬁnding the
100 largest nodes as we
vary the work. This set of
100 does not include the
nodes immediate
neighbors. (One column,
but representative)
10
−2
10
−1
10
0
0
0.2
0.4
0.6
0.8
1
dblp−cc
Effective matrix−vector products
Precision
tol=10−4
tol=10
−5
@10
@25
@100
@1000
dblp collaboration graph, 225k vertices

Runtime
ICME
34
10
3
10
4
10
5
10
6
10
−4
10
−2
10
0
|E| + |V|
Runtime(secs).
TSGS
TSGSQ
EXPV
MEXPV
TAYLOR
Flickr social network"
500k nodes, 5M edges

Outline
5.  Experiments
✓
ICME
35
✓
✓
✓
✓

References and ongoing work
Kloster and Gleich, Workshop on Algorithms for the
Web-graph, 2013 (forthcoming).
www.cs.purdue.edu/homes/dgleich/codes/nexpokit
•  Error analysis using the queue
•  Better linear systems for faster convergence
•  Asynchronous coordinate descent methods
•  Scaling up to billion node graphs
•  More explicit localization in algorithms
ICME
36

Relaxation methods for the matrix exponential on large networks

More Related Content

What's hot

Viewers also liked

Similar to Relaxation methods for the matrix exponential on large networks

Recently uploaded

Relaxation methods for the matrix exponential on large networks