Spacey random walks and higher-order data analysis

Spacey random walks for !
higher-order data analysis
David F. Gleich!
Purdue University!
May 20, 2016!
Joint work with Austin Benson, Lek-Heng Lim,
Tao Wu, supported by NSF CAREER
CCF-1149756, IIS-1422918, DARPA SIMPLEX
Papers {1602.02102,1603.00395}
TMA 2016
David Gleich · Purdue
1

Markov chains, matrices, and
eigenvectors have a long relationship.
Kemeny and Snell, 1976. “In the land of Oz they never have two nice
days in a row. If they have a nice day, they are just as likely to have
snow as rain the next day. If they have snow or rain, they have an
even chain of having the same the next day. If there is a change from
snow or rain, only half of the time is this change to a nice day. “
other. The transition matrix is
We next consider a modified version of the random walk. If the
process is in one of the three interior states, it has equal probability
of moving right, moving left, or staying in its present state. If it is
S5 .1 .2 .4 .2 .1
EXAMPLE 8
nite Mathematics (Chapter V, Section 8), in the Land
ave two nice days in a row. If they have a nice day
ikely to have snow as rain the next day. If they
n) they have an even chance of having the same the
re is a change from snow or rain, only half of the
nge to a nice day. We form a three-state Markov
R, N, and S for rain, nice, and snow, respectively.
trix is then
(8)
Column-stochastic in my talk
P =
0
B
@
R N S
R 1/2 1/2 1/4
N 1/4 0 1/4
S 1/4 1/2 1/2
1
C
A
x stationary distribution
xi =
X
j
P(i, j)xj , xi 0,
X
i
xi = 1
x =
 2/5
1/5
2/5
x is an eigenvector
TMA 2016
2

Markov chains, matrices, and
eigenvectors have a long relationship.

1.  Start with a Markov chain
2.  Inquire about the stationary distribution
3.  This question gives rise to an eigenvector
problem on the transition matrix
TMA 2016
3
X1, X2, ... , Xt , Xt+1, ...
xi = lim
N!1
1
N
NX
t=1
Ind[Xt = i] This is the limiting fraction of time
the chain spends in state i
In general, Xt will be a stochastic
process in this talk

Higher-order Markov chains are more
useful for modern data problems.
Higher-order means more history!
Rosvall et al. (Nature Com. 2014) found
•  Higher-order Markov chains were critical to "
finding multidisiplinary journals in citation "
data and patterns in air traffic networks.
Chierichetti et al. (WWW 2012) found
•  Higher-order Markov models capture browsing "
behavior more accurately than first-order models.
(and more!)
somewhat less than second order. Next, we assembled the links into networks.
All links with the same start node in the bigrams represent out-links of the start
node in the standard network (Fig. 6d). A physical node in the memory network,
which corresponds to a regular node in a standard network, has one memory node
for each in-link (Fig. 6e). A memory node represents a pair of nodes in the
trigrams. For example, the blue memory node in Fig. 6e represents passengers who
come to Chicago from New York. All links with the same start memory node in the
Comm
work
used a
to com
dynam
As we
gener
Th
advan
data.
measu
by fol
or cov
pickin
Th
dynam
walke
the st
nodes
move
rando
nodes
length
We ac
of eac
optim
depen
memo
corres
algori
Fig
detect
New Y
order
descri
With
211
24,919
95,977
99,140
72.569
72.467Atlanta
Atlanta
Atlanta Atlanta
New York
New York
Chicago
Chicago
Chicago
Chicago
Chicago
San Francisco San Francisco
San Francisco
First–order Markov Second–order Markov
New York New YorkSeattle Seattle
Chicago
San Francisco San Francisco
Atlanta Atlanta
Chicago
Figure 6 | From pathway data to networks with and without memory.
(a) Itineraries weighted by passenger number. (b) Aggregated bigrams
for links between physical nodes. (c) Aggregated trigrams for links between
memory nodes. (d) Network without memory. (e) Network with
memory. Corresponding dynamics in Fig. 1a,b.
Rosvall et al. 2014
TMA 2016
4

Stationary dist. of higher-order
Markov chains are still matrix eqns.
Convert into a ﬁrst order Markov chain on pairs of states.

Xi,j =
X
j,k
P(i, j, k)Xj,k Xi,j 0,
P
i,j Xi,j = 1
P(i, j, k) = Prob. of state i given hist. j, k
xi =
P
j X(i, j) Marginal for the stat. dist.
P[Xt+1 = i | Xt = j, Xt 1 = k] = P(i, j, k)
TMA 2016
5
Last state 1 2 3
Current state 1 2 3 1 2 3 1 2 3
P[next state = 1] 0 0 0 1/4 0 0 1/4 0 3/4
P[next state = 2] 3/5 2/3 0 1/2 0 1/2 0 1/2 0
P[next state = 3] 2/5 1/3 1 1/4 1 1/2 3/4 1/2 1/4

Stationary dist. of higher-order
Markov chains are still matrix eqns.
(1,1) (2,1)
(3,1)
(1,2)
(2,2)(3,2)
2/5
3/5
1/3 2/3
1/4 1/2
1/4
1
1/4
3/4
1 1 3 3 1 · · ·
2/5 1 3/4
2 3 3 · · ·
1/2
1 2 3 2 · · ·
1/3 1/2
The implicit Markov chain
P(i, j, k) = Prob. of state i given hist. j, k
P[Xt+1 = i | Xt = j, Xt 1 = k] = P(i, j, k)
TMA 2016
6
Last state 1 2 3
Current state 1 2 3 1 2 3 1 2 3
P[next state = 1] 0 0 0 1/4 0 0 1/4 0 3/4
P[next state = 2] 3/5 2/3 0 1/2 0 1/2 0 1/2 0
P[next state = 3] 2/5 1/3 1 1/4 1 1/2 3/4 1/2 1/4

Hypermatrices, tensors, and tensor
eigenvectors have been used too

Z-eigenvectors (above) proposed by Lim (2005), Qi (2005).
Many references to doing tensors for data analysis (1970+)
Anandkumar et al. 2014
•  Tensor eigenvector decomp. are optimal to recover "
latent variable models based on higher-order moments.
1
3
2
A
tensor A : n ⇥ n ⇥ n
tensor eigenvector
X
j,k
A(i, j, k)xj xk = xi ,
Ax2
= x
TMA 2016
7

But there were few results
connecting hypermatrices, tensors,
and higher-order Markov chains.
TMA 2016
8

Li and Ng proposed a link between
tensors and high-order MC
1.  Start with a higher-order Markov chain
2.  Look at the stationary distribution
3.  Assume/approximate as rank 1
4.  … and we have a tensor eigenvector
Li and Ng 2014.
Xi,j = xi xj
xi =
X
j,k
P(i, j, k)xj xk
TMA 2016
9
Xi,j =
X
k
P(i, j, k)Xj,k Xi,j 0,
P
i,j Xi,j = 1

Li and Ng proposed an algebraic link
between tensors and high-order MC
The Li and Ng stationary distribution

Li and Ng 2014.
xi =
X
j,k
P(i, j, k)xj xk
•  Is a tensor z-eigenvector
•  Is non-negative and sums to one
•  Can sometimes be computed "
[Li and Ng, 14; Chu and Wu, 14;
Gleich, Lim, Yu 15]

•  May or may not be unique
•  Almost always exists
Our question!
Is there a stochastic process underlying
this tensor eigenvector?
Px2
= x
TMA 2016
10

Markov chain ! matrix equationIntro
Markov chain ! matrix equation !
approximation
Li & Ng,"
Multilinear "
PageRank
Desired
stochastic process ! approx. equations
Our question!
Is there a stochastic process underlying
this tensor eigenvector?
TMA 2016
11
X1, X2, ... ! Px = x
X1, X2, ... ! Px2
= x
X1, X2, ... ! “PX = X” ! Px2
= x

The spacey random walk
Consider a higher-order Markov chain.

If we were perfect, we’d ﬁgure out the stat-
ionary distribution of that. But we are spacey!
•  On arriving at state j, we promptly "
“space out” and forget we came from k.
•  But we still believe we are “higher-order”
•  So we invent a state k by drawing "
a random state from our history.
TMA 2016
12
Benson, Gleich, Lim arXiv:2016
P[Xt+i | history] = P[Xt+i | Xt = j, Xt 1 = k]
走神
or

According to
my students

10
12
4
9
7
11
4
Xt-1
Xt
Yt
Key insight limiting dist. of this process are tensor eigenvectors
The Spacey Random Walk
P[Xt+1 = i | Xt = j, Xt 1 = k] = P(i, j, k) Higher-order
Markov
TMA 2016
13
P[Xt+1 = i | Xt = j, Yt = g] = P(i, j, g)

P(Xt+1 = i | Ft )
=
X
k
Pi,Xt ,k Ck (t)/(t + n)
The spacey random walk process

This is a reinforced stochastic process or a"
(generalized) vertex-reinforced random walk! "
Diaconis; Pemantle, 1992; Benaïm, 1997; Pemantle 2007
TMA 2016
14
Let Ct (k) = (1 +
Pt
s=1 Ind{Xs = k})
How often we’ve visited
state k in the past
Ft is the -algebra
generated by the history
{Xt : 0  t  n}

Generalized vertex-reinforced!
random walks (VRRW)
A vertex-reinforced random walk at time t transitions
according to a Markov matrix M given the observed
frequencies of visiting each state.

The map from the simplex of prob. distributions to Markov
chains is key to VRRW
TMA 2016
15
M. Benïam 1997
P(Xt+1 = i | Ft )
= [M(t)]i,Xt
= [M(c(t))]i,Xt c(t) = observed frequency
c 7! M(c)
How often we’ve
been where
Where we are
going to next

Stationary distributions of
VRRWs correspond to ODEs
THEOREM [Benaïm, 1997] Paraphrased"
The sequence of empirical observation probabilities ct is
an asymptotic pseudo-trajectory for the dynamical
system

Thus, convergence of the ODE to a ﬁxed point is
equivalent to stationary distributions of the VRRW.
•  M must always have a unique stationary distribution!
•  The map to M must be very continuous
•  Asymptotic pseudo-trajectories satisfy
dx
dt
= ⇡[M(x)] x ⇡(M(x)) is a map to the stat. dist
TMA 2016
16
lim
t!1
kc(t + T) x(t + T)x(t)=c(t)k = 0

The Markov matrix for !
Spacey Random Walks

A necessary condition for a stationary distribution

(otherwise makes no sense)

TMA 2016
17
Property B. Let P be an order-m, n dimensional probability table. Then P has
property B if there is a unique stationary distribution associated with all stochastic
combinations of the last m 2 modes. That is, M =
P
k,`,... P(:, :, k, `, ...) k,`,... defines
a Markov chain with a unique Perron root when all s are positive and sum to one.
dx
dt
= ⇡[M(x)] x
This is the transition probability associated
with guessing the last state based on history!
2
1
M(x)
1
3
2
x
P
M(c) =
X
k
P(:, :, k)ck

Stationary points of the ODE for the
Spacey Random Walk are tensor evecs
M(c) =
X
k
P(:, :, k)ck
dx/dt = 0 , ⇡(M(x)) = x , M(x)x = x ,
X
j,k
P(i, j, k)xj xk = xi
But not all tensor eigenvectors are stationary points!
dx
dt
= ⇡[M(x)] x
TMA 2016
18

Some results on spacey
random walk models
1.  If you give it a Markov chain hidden in a hypermatrix,
then it works like a Markov chain.
2.  All 2 x 2 x 2 x … x 2 problems have a stationary
distribution (with a few corner cases).
3.  This shows that an “exotic” class of Pólya urns always converges
4.  Spacey random surfer models have unique
stationary distributions in some regimes
5.  Spacey random walks model Hardy-Weinberg laws in pop. genetics
6.  Spacey random walks are a plausible model of
taxicab behavior
TMA 2016
19

All 2-state spacey random walk
models have a stationary distribution
If we unfold P(i,j,k) for a 2 x 2 x 2

then

Key idea reduce to 1-dim ODE
R =

a b c d
1 a 1 b 1 c 1 d
M(x) = R(x ⌦ ) =

c x1(c a) d x1(d b)
1 c + x1(c a) 1 d + x1(d b)
⇡(
h
p 1 q
1 p q
i
) =
1 q
2 p q
TMA 2016
20

The one-dimensional ODE has
a really simple structure
stable stable
unstable
x1
dx1/dt
In general, dx1/dt (0) ≥ 0, dx1/dt (1) ≤ 0, so there must be a stable point by cont.
TMA 2016
21

With multiple states, the
situation is more complicated
If P is irreducible, there always exists a ﬁxed point of the
algebraic equation

By Li and Ng 2013 using Brouwer’s theorem.
State of the art computation!
•  Power method [Li and Ng], "
more analysis in [Chu & Wu, Gleich, Lim, Yu] and more today
•  Shifted iteration, Newton iteration [Gleich, Lim, Yu]
New idea!
•  Integrate the ODE
Px2
= x
TMA 2016
22
M(c) =
X
k
P(:, :, k)ck
dx
dt
= ⇡[M(x)] x

Spacey random surfers are a
reﬁned model with some structure
Akin to the PageRank modiﬁcation of a Markov chain
1.  With probability α, follow the spacey random walk
2.  With probability 1-α, teleport based a distribution v!
The solution of is unique if α < 0.5
THEOREM (Benson, Gleich, Lim)"
The spacey random surfer model always has a
stationary dist. if α < 0.5. In other words, the ODE

always converges to a stable point

Gleich, Lim, Yu, SIMAX 2015
Benson, Gleich, Lim, arXiv:2016
x = ↵Px2
+ (1 ↵)v
dx
dt
= (1 ↵)[ ↵R(x ⌦ )] 1
v x
TMA 2016
23
Yongyang Yu
Purdue

Some nice open problems in
this model
•  For all the problems we have, Matlab’s ode45 has
never failed to converge to a eigenvector. (Even when
all other algorithms will not converge.)
•  Can we show that if the power method converges to
a ﬁxed point, then the ODE converges? (The converse
is false.)
•  There is also a family of models (e.g. pick “second”
state based on history instead of the “third”), how can
we use this fact?
TMA 2016
24

Here’s what we are using
spacey random walks to do!

1.  Model the behavior of taxicabs in a large city. "
Involves ﬁtting transition probabilities to data. "
2.  Cluster higher-order data in a type of
“generalized” spectral clustering."
Involves a useful asymptotic property of spacey random walks"
Benson, Gleich, Leskovec SDM2016"
Wu, Benson, Gleich, arXiv:2016
TMA 2016
25

Taxicab’s are a plausible
spacey random walk model
1,2,2,1,5,4,4,…
1,2,3,2,2,5,5,…
2,2,3,3,3,3,2,…
5,4,5,5,3,3,1,…
Model people by locations.
1  A passenger with location k is drawn at random.
2  The taxi picks up the passenger at location j.
3  The taxi drives the passenger to location i with probability P(i,j,k)
Approximate locations by history à spacey random walk.
Beijing Taxi image from Yu Zheng "
(Urban Computing Microsoft Asia)
TMA 2016
26
Image from nyc.gov

NYC Taxi Data support the spacey
random walk hypothesis
One year of 1000 taxi trajectories in NYC.
States are neighborhoods in Manhattan.
P(i,j,k) = probability of taxi going from j to i "
when the passenger is from location k.
Evaluation RMSE

TMA 2016
27
First-order Markov 0.846
Second-order Markov 0.835
Spacey 0.835

A property of spacey random walks
makes the connection to clustering
Spacey random walks (with stat. dists.) are
asymptotically Markov chains
•  once the occupation vector c converges, then future
transitions are according to the Markov chain M(c)

This makes a connection to clustering
•  spectral clustering methods can be derived by looking
for partitions of reversible Markov chains (and
research is on non-reversible ones too..)
We had an initial paper on using this idea for “motif-based clustering” of a
graph, but there is much better technique we have now.
TMA 2016
28
Benson, Leskovec, Gleich. SDM 2015
Wu, Benson, Gleich. arXiv:2016
Jure Leskovec
Stanford

Given data bricks, we can cluster them
using these ideas, with one more
[i1, i2, …, in]3
[i1, i2, …, in1
] x
[j1, j2, …, jn2
] x
[k1, k2, …, kn3
]

If the data is a symmetric
cube, we can normalize it
to get a transition tensor
If the data is a brick, we
symmetrize using
Ragnarsson and van
Loan’s idea

TMA 2016
29
Wu, Benson, Gleich arXiv:2016
A !
h
0 A
AT
0
iGeneralization of

The clustering methodology
1.  Symmetrize the brick (if necessary)
2.  Normalize to be a column stochastic tensor
3.  Estimate the stationary distribution of the
spacey random walk (spacey random surf.)
or a generalization… (super-spacey RW)
4.  Form the asymptotic Markov model
5.  Bisect using eigenvectors or properties of
that asymptotic Markov model; then recurse.
TMA 2016
30

Clustering airport-airport-airline
networks
UNCLUSTERED
(No structure apparent)
Airport-Airport-Airline"
Network
CLUSTERED
Diagonal structure evident
Name Airports Airlines Notes
World Hubs 250 77 Beijing, JFK
Europe 184 32 Europe, Morocco
United States 137 9 U.S. and Canc´un
China/Taiwan 170 33 China, Taiwan, Thailand
Oceania/SE Asia 302 52 Canadian airlines too
Mexico/Americas 399 68
TMA 2016
31

Clusters in symmetrized three-gram
and four-gram data
Data 3, 4-gram data from COCA (ngrams.info)
“best clusters”
pronouns & articles (the, we, he, …)
prepositions & link verbs (in, of, as, to, …)
Fun 3-gram clusters!
{cheese, cream, sour, low-fat, frosting, nonfat, fat-free}
{bag, plastic, garbage, grocery, trash, freezer}
{church, bishop, catholic, priest, greek, orthodox, methodist, roman,
priests, episcopal, churches, bishops}
Fun 4-gram clusters !
{german, chancellor, angela, merkel, gerhard, schroeder, helmut, kohl}

TMA 2016
32

Clusters in 3-gram Chinese text
TMA 2016
33
社会 – society
– economy
– develop
– “ism”
国家 – nation
政府 – government
We also get stop-words in the Chinese text (highly
occuring words.)

But then we also get some strange words.

Reason Google’s Chinese corpus has a bias in its books.

One more problem
FIGURE 6 – Previous work
from the PI tackled net-
work alignment with ma-
trix methods for edge
overlap:
i
j j0
i0
OverlapOverlap
A L B
This proposal is for match-
ing triangles using tensor
methods:
A L B
This proposal is for match-
ing triangles using tensor
methods:
j
i
k
j0
i0
k0
TriangleTriangle
A L B
If xi, xj, and xk are
indicators associated with
the edges (i, i0
), (j, j0
), and
0
X
i2L
X
j2L
X
k2L
xixjxkTi,j,k
| {z }
triangle overlap term
nding to i, j, and k in
ching. Maximizing this
n to investigate a heuris-
he tensor T and using
ding. Similar heuristics
etwork alignment algo-
009). The work involves
Triangular Alignment (TAME): A Tensor-based Approach for
Higher-order Network Alignment"
Joint w. Shahin Mohammadi, Ananth Grama, and Tamara Kolda
http://arxiv.org/abs/1510.06482
max xT
(A ⌦ B)x s.t. kxk = 1 max(A ⌦ B)x3
s.t. kxk = 1
A, B is triangle hypergraph adjacency
A, B is edge adjacency matrix
“Solved” with x of dim. 86 million"
has 5 trillion non-zeros
A ⌦ B

www.cs.purdue.edu/homes/dgleich
Summary!
Spacey random walks are a new type of stochastic process that provides a
direct interpretation of tensor eigenvectors of higher-order Markov chains
probability tables.
!
We are excited!!
•  Many potential new applications of the spacey random walk process
•  Many open theoretical questions for us (and others) to follow up on.!
!
Code!
https://github.com/dgleich/mlpagerank
https://github.com/arbenson/tensor-sc
https://github.com/arbenson/spacey-random-walks
https://github.com/wutao27/GtensorSC
Papers!
Gleich, Lim, Yu. Mulltilinear PageRank, SIMAX 2015
Benson, Gleich, Leskovec, Tensor spectral clustering, SDM 2015
Benson, Gleich, Lim. Spacey random walks. arXiv:1602.02102
Wu, Benson, Gleich. Tensor spectral co-clustering. arXiv:1603.00395
35

Spacey random walks and higher-order data analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Spacey random walks and higher-order data analysis

Similar to Spacey random walks and higher-order data analysis (20)

Recently uploaded

Recently uploaded (20)

Spacey random walks and higher-order data analysis