Successfully reported this slideshow.                                   Upcoming SlideShare
×

Spacey random walks and higher-order data analysis

685 views

Published on

My talk at TMA 2016 (The workshop on Tensors, Matrices, and their Applications) on the relationship between a spacey random walk process and tensor eigenvectors

Published in: Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No • Be the first to comment

• Be the first to like this

Spacey random walks and higher-order data analysis

1. 1. Spacey random walks for ! higher-order data analysis David F. Gleich! Purdue University! May 20, 2016! Joint work with Austin Benson, Lek-Heng Lim, Tao Wu, supported by NSF CAREER CCF-1149756, IIS-1422918, DARPA SIMPLEX Papers {1602.02102,1603.00395} TMA 2016 David Gleich · Purdue 1
2. 2. Markov chains, matrices, and eigenvectors have a long relationship. Kemeny and Snell, 1976. “In the land of Oz they never have two nice days in a row. If they have a nice day, they are just as likely to have snow as rain the next day. If they have snow or rain, they have an even chain of having the same the next day. If there is a change from snow or rain, only half of the time is this change to a nice day. “ other. The transition matrix is We next consider a modified version of the random walk. If the process is in one of the three interior states, it has equal probability of moving right, moving left, or staying in its present state. If it is S5 .1 .2 .4 .2 .1 EXAMPLE 8 nite Mathematics (Chapter V, Section 8), in the Land ave two nice days in a row. If they have a nice day ikely to have snow as rain the next day. If they n) they have an even chance of having the same the re is a change from snow or rain, only half of the nge to a nice day. We form a three-state Markov R, N, and S for rain, nice, and snow, respectively. trix is then (8) Column-stochastic in my talk P = 0 B @ R N S R 1/2 1/2 1/4 N 1/4 0 1/4 S 1/4 1/2 1/2 1 C A x stationary distribution xi = X j P(i, j)xj , xi 0, X i xi = 1 x =  2/5 1/5 2/5 x is an eigenvector TMA 2016 David Gleich · Purdue 2
3. 3. Markov chains, matrices, and eigenvectors have a long relationship. 1.  Start with a Markov chain 2.  Inquire about the stationary distribution 3.  This question gives rise to an eigenvector problem on the transition matrix TMA 2016 David Gleich · Purdue 3 X1, X2, ... , Xt , Xt+1, ... xi = lim N!1 1 N NX t=1 Ind[Xt = i] This is the limiting fraction of time the chain spends in state i In general, Xt will be a stochastic process in this talk
4. 4. Higher-order Markov chains are more useful for modern data problems. Higher-order means more history! Rosvall et al. (Nature Com. 2014) found •  Higher-order Markov chains were critical to " ﬁnding multidisiplinary journals in citation " data and patterns in air trafﬁc networks. Chierichetti et al. (WWW 2012) found •  Higher-order Markov models capture browsing " behavior more accurately than ﬁrst-order models. (and more!) somewhat less than second order. Next, we assembled the links into networks. All links with the same start node in the bigrams represent out-links of the start node in the standard network (Fig. 6d). A physical node in the memory network, which corresponds to a regular node in a standard network, has one memory node for each in-link (Fig. 6e). A memory node represents a pair of nodes in the trigrams. For example, the blue memory node in Fig. 6e represents passengers who come to Chicago from New York. All links with the same start memory node in the Comm work used a to com dynam As we gener Th advan data. measu by fol or cov pickin Th dynam walke the st nodes move rando nodes length We ac of eac optim depen memo corres algori Fig detect New Y order descri With 211 24,919 95,977 99,140 72.569 72.467Atlanta Atlanta Atlanta Atlanta New York New York Chicago Chicago Chicago Chicago Chicago San Francisco San Francisco San Francisco First–order Markov Second–order Markov New York New YorkSeattle Seattle Chicago San Francisco San Francisco Atlanta Atlanta Chicago Figure 6 | From pathway data to networks with and without memory. (a) Itineraries weighted by passenger number. (b) Aggregated bigrams for links between physical nodes. (c) Aggregated trigrams for links between memory nodes. (d) Network without memory. (e) Network with memory. Corresponding dynamics in Fig. 1a,b. Rosvall et al. 2014 TMA 2016 David Gleich · Purdue 4
5. 5. Stationary dist. of higher-order Markov chains are still matrix eqns. Convert into a ﬁrst order Markov chain on pairs of states. Xi,j = X j,k P(i, j, k)Xj,k Xi,j 0, P i,j Xi,j = 1 P(i, j, k) = Prob. of state i given hist. j, k xi = P j X(i, j) Marginal for the stat. dist. P[Xt+1 = i | Xt = j, Xt 1 = k] = P(i, j, k) TMA 2016 David Gleich · Purdue 5 Last state 1 2 3 Current state 1 2 3 1 2 3 1 2 3 P[next state = 1] 0 0 0 1/4 0 0 1/4 0 3/4 P[next state = 2] 3/5 2/3 0 1/2 0 1/2 0 1/2 0 P[next state = 3] 2/5 1/3 1 1/4 1 1/2 3/4 1/2 1/4
6. 6. Stationary dist. of higher-order Markov chains are still matrix eqns. (1,1) (2,1) (3,1) (1,2) (2,2)(3,2) 2/5 3/5 1/3 2/3 1/4 1/2 1/4 1 1/4 3/4 1 1 3 3 1 · · · 2/5 1 3/4 2 3 3 · · · 1/2 1 2 3 2 · · · 1/3 1/2 The implicit Markov chain P(i, j, k) = Prob. of state i given hist. j, k P[Xt+1 = i | Xt = j, Xt 1 = k] = P(i, j, k) TMA 2016 David Gleich · Purdue 6 Last state 1 2 3 Current state 1 2 3 1 2 3 1 2 3 P[next state = 1] 0 0 0 1/4 0 0 1/4 0 3/4 P[next state = 2] 3/5 2/3 0 1/2 0 1/2 0 1/2 0 P[next state = 3] 2/5 1/3 1 1/4 1 1/2 3/4 1/2 1/4
7. 7. Hypermatrices, tensors, and tensor eigenvectors have been used too Z-eigenvectors (above) proposed by Lim (2005), Qi (2005). Many references to doing tensors for data analysis (1970+) Anandkumar et al. 2014 •  Tensor eigenvector decomp. are optimal to recover " latent variable models based on higher-order moments. 1 3 2 A tensor A : n ⇥ n ⇥ n tensor eigenvector X j,k A(i, j, k)xj xk = xi , Ax2 = x TMA 2016 David Gleich · Purdue 7
8. 8. But there were few results connecting hypermatrices, tensors, and higher-order Markov chains. TMA 2016 David Gleich · Purdue 8
9. 9. Li and Ng proposed a link between tensors and high-order MC 1.  Start with a higher-order Markov chain 2.  Look at the stationary distribution 3.  Assume/approximate as rank 1 4.  … and we have a tensor eigenvector Li and Ng 2014. Xi,j = xi xj xi = X j,k P(i, j, k)xj xk TMA 2016 David Gleich · Purdue 9 Xi,j = X k P(i, j, k)Xj,k Xi,j 0, P i,j Xi,j = 1
10. 10. Li and Ng proposed an algebraic link between tensors and high-order MC The Li and Ng stationary distribution Li and Ng 2014. xi = X j,k P(i, j, k)xj xk •  Is a tensor z-eigenvector •  Is non-negative and sums to one •  Can sometimes be computed " [Li and Ng, 14; Chu and Wu, 14; Gleich, Lim, Yu 15] •  May or may not be unique •  Almost always exists Our question! Is there a stochastic process underlying this tensor eigenvector? Px2 = x TMA 2016 David Gleich · Purdue 10
11. 11. Markov chain ! matrix equationIntro Markov chain ! matrix equation ! approximation Li & Ng," Multilinear " PageRank Desired stochastic process ! approx. equations Our question! Is there a stochastic process underlying this tensor eigenvector? TMA 2016 David Gleich · Purdue 11 X1, X2, ... ! Px = x X1, X2, ... ! Px2 = x X1, X2, ... ! “PX = X” ! Px2 = x
12. 12. The spacey random walk Consider a higher-order Markov chain. If we were perfect, we’d ﬁgure out the stat- ionary distribution of that. But we are spacey! •  On arriving at state j, we promptly " “space out” and forget we came from k. •  But we still believe we are “higher-order” •  So we invent a state k by drawing " a random state from our history. TMA 2016 David Gleich · Purdue 12 Benson, Gleich, Lim arXiv:2016 P[Xt+i | history] = P[Xt+i | Xt = j, Xt 1 = k] 走神 or According to my students
13. 13. 10 12 4 9 7 11 4 Xt-1 Xt Yt Key insight limiting dist. of this process are tensor eigenvectors The Spacey Random Walk P[Xt+1 = i | Xt = j, Xt 1 = k] = P(i, j, k) Higher-order Markov Benson, Gleich, Lim arXiv:2016 TMA 2016 David Gleich · Purdue 13 P[Xt+1 = i | Xt = j, Yt = g] = P(i, j, g)
14. 14. P(Xt+1 = i | Ft ) = X k Pi,Xt ,k Ck (t)/(t + n) The spacey random walk process This is a reinforced stochastic process or a" (generalized) vertex-reinforced random walk! " Diaconis; Pemantle, 1992; Benaïm, 1997; Pemantle 2007 TMA 2016 David Gleich · Purdue 14 Let Ct (k) = (1 + Pt s=1 Ind{Xs = k}) How often we’ve visited state k in the past Ft is the -algebra generated by the history {Xt : 0  t  n} Benson, Gleich, Lim arXiv:2016
15. 15. Generalized vertex-reinforced! random walks (VRRW) A vertex-reinforced random walk at time t transitions according to a Markov matrix M given the observed frequencies of visiting each state. The map from the simplex of prob. distributions to Markov chains is key to VRRW TMA 2016 David Gleich · Purdue 15 M. Benïam 1997 P(Xt+1 = i | Ft ) = [M(t)]i,Xt = [M(c(t))]i,Xt c(t) = observed frequency c 7! M(c) How often we’ve been where Where we are going to next
16. 16. Stationary distributions of VRRWs correspond to ODEs THEOREM [Benaïm, 1997] Paraphrased" The sequence of empirical observation probabilities ct is an asymptotic pseudo-trajectory for the dynamical system Thus, convergence of the ODE to a ﬁxed point is equivalent to stationary distributions of the VRRW. •  M must always have a unique stationary distribution! •  The map to M must be very continuous •  Asymptotic pseudo-trajectories satisfy dx dt = ⇡[M(x)] x ⇡(M(x)) is a map to the stat. dist TMA 2016 David Gleich · Purdue 16 lim t!1 kc(t + T) x(t + T)x(t)=c(t)k = 0
17. 17. The Markov matrix for ! Spacey Random Walks A necessary condition for a stationary distribution (otherwise makes no sense) TMA 2016 David Gleich · Purdue 17 Property B. Let P be an order-m, n dimensional probability table. Then P has property B if there is a unique stationary distribution associated with all stochastic combinations of the last m 2 modes. That is, M = P k,`,... P(:, :, k, `, ...) k,`,... defines a Markov chain with a unique Perron root when all s are positive and sum to one. dx dt = ⇡[M(x)] x This is the transition probability associated with guessing the last state based on history! 2 1 M(x) 1 3 2 x P M(c) = X k P(:, :, k)ck Benson, Gleich, Lim arXiv:2016
18. 18. Stationary points of the ODE for the Spacey Random Walk are tensor evecs M(c) = X k P(:, :, k)ck dx/dt = 0 , ⇡(M(x)) = x , M(x)x = x , X j,k P(i, j, k)xj xk = xi But not all tensor eigenvectors are stationary points! dx dt = ⇡[M(x)] x Benson, Gleich, Lim arXiv:2016 TMA 2016 David Gleich · Purdue 18
19. 19. Some results on spacey random walk models 1.  If you give it a Markov chain hidden in a hypermatrix, then it works like a Markov chain. 2.  All 2 x 2 x 2 x … x 2 problems have a stationary distribution (with a few corner cases). 3.  This shows that an “exotic” class of Pólya urns always converges 4.  Spacey random surfer models have unique stationary distributions in some regimes 5.  Spacey random walks model Hardy-Weinberg laws in pop. genetics 6.  Spacey random walks are a plausible model of taxicab behavior Benson, Gleich, Lim arXiv:2016 TMA 2016 David Gleich · Purdue 19
20. 20. All 2-state spacey random walk models have a stationary distribution If we unfold P(i,j,k) for a 2 x 2 x 2 then Key idea reduce to 1-dim ODE R =  a b c d 1 a 1 b 1 c 1 d M(x) = R(x ⌦ ) =  c x1(c a) d x1(d b) 1 c + x1(c a) 1 d + x1(d b) ⇡( h p 1 q 1 p q i ) = 1 q 2 p q Benson, Gleich, Lim arXiv:2016 TMA 2016 David Gleich · Purdue 20
21. 21. The one-dimensional ODE has a really simple structure stable stable unstable x1 dx1/dt In general, dx1/dt (0) ≥ 0, dx1/dt (1) ≤ 0, so there must be a stable point by cont. Benson, Gleich, Lim arXiv:2016 TMA 2016 David Gleich · Purdue 21
22. 22. With multiple states, the situation is more complicated If P is irreducible, there always exists a ﬁxed point of the algebraic equation By Li and Ng 2013 using Brouwer’s theorem. State of the art computation! •  Power method [Li and Ng], " more analysis in [Chu & Wu, Gleich, Lim, Yu] and more today •  Shifted iteration, Newton iteration [Gleich, Lim, Yu] New idea! •  Integrate the ODE Px2 = x Benson, Gleich, Lim arXiv:2016 TMA 2016 David Gleich · Purdue 22 M(c) = X k P(:, :, k)ck dx dt = ⇡[M(x)] x
23. 23. Spacey random surfers are a reﬁned model with some structure Akin to the PageRank modiﬁcation of a Markov chain 1.  With probability α, follow the spacey random walk 2.  With probability 1-α, teleport based a distribution v! The solution of is unique if α < 0.5 THEOREM (Benson, Gleich, Lim)" The spacey random surfer model always has a stationary dist. if α < 0.5. In other words, the ODE always converges to a stable point Gleich, Lim, Yu, SIMAX 2015 Benson, Gleich, Lim, arXiv:2016 x = ↵Px2 + (1 ↵)v dx dt = (1 ↵)[ ↵R(x ⌦ )] 1 v x TMA 2016 David Gleich · Purdue 23 Yongyang Yu Purdue
24. 24. Some nice open problems in this model •  For all the problems we have, Matlab’s ode45 has never failed to converge to a eigenvector. (Even when all other algorithms will not converge.) •  Can we show that if the power method converges to a ﬁxed point, then the ODE converges? (The converse is false.) •  There is also a family of models (e.g. pick “second” state based on history instead of the “third”), how can we use this fact? TMA 2016 David Gleich · Purdue 24
25. 25. Here’s what we are using spacey random walks to do! 1.  Model the behavior of taxicabs in a large city. " Involves ﬁtting transition probabilities to data. " Benson, Gleich, Lim arXiv:2016 2.  Cluster higher-order data in a type of “generalized” spectral clustering." Involves a useful asymptotic property of spacey random walks" Benson, Gleich, Leskovec SDM2016" Wu, Benson, Gleich, arXiv:2016 TMA 2016 David Gleich · Purdue 25
26. 26. Taxicab’s are a plausible spacey random walk model 1,2,2,1,5,4,4,… 1,2,3,2,2,5,5,… 2,2,3,3,3,3,2,… 5,4,5,5,3,3,1,… Model people by locations. 1  A passenger with location k is drawn at random. 2  The taxi picks up the passenger at location j. 3  The taxi drives the passenger to location i with probability P(i,j,k) Approximate locations by history à spacey random walk. Beijing Taxi image from Yu Zheng " (Urban Computing Microsoft Asia) TMA 2016 David Gleich · Purdue 26 Image from nyc.gov Benson, Gleich, Lim arXiv:2016
27. 27. NYC Taxi Data support the spacey random walk hypothesis One year of 1000 taxi trajectories in NYC. States are neighborhoods in Manhattan. P(i,j,k) = probability of taxi going from j to i " when the passenger is from location k. Evaluation RMSE TMA 2016 David Gleich · Purdue 27 First-order Markov 0.846 Second-order Markov 0.835 Spacey 0.835 Benson, Gleich, Lim arXiv:2016
28. 28. A property of spacey random walks makes the connection to clustering Spacey random walks (with stat. dists.) are asymptotically Markov chains •  once the occupation vector c converges, then future transitions are according to the Markov chain M(c) This makes a connection to clustering •  spectral clustering methods can be derived by looking for partitions of reversible Markov chains (and research is on non-reversible ones too..) We had an initial paper on using this idea for “motif-based clustering” of a graph, but there is much better technique we have now. TMA 2016 David Gleich · Purdue 28 Benson, Leskovec, Gleich. SDM 2015 Wu, Benson, Gleich. arXiv:2016 Jure Leskovec Stanford
29. 29. Given data bricks, we can cluster them using these ideas, with one more [i1, i2, …, in]3 [i1, i2, …, in1 ] x [j1, j2, …, jn2 ] x [k1, k2, …, kn3 ] If the data is a symmetric cube, we can normalize it to get a transition tensor If the data is a brick, we symmetrize using Ragnarsson and van Loan’s idea TMA 2016 David Gleich · Purdue 29 Wu, Benson, Gleich arXiv:2016 A ! h 0 A AT 0 iGeneralization of
30. 30. The clustering methodology 1.  Symmetrize the brick (if necessary) 2.  Normalize to be a column stochastic tensor 3.  Estimate the stationary distribution of the spacey random walk (spacey random surf.) or a generalization… (super-spacey RW) 4.  Form the asymptotic Markov model 5.  Bisect using eigenvectors or properties of that asymptotic Markov model; then recurse. TMA 2016 David Gleich · Purdue 30
31. 31. Clustering airport-airport-airline networks UNCLUSTERED (No structure apparent) Airport-Airport-Airline" Network CLUSTERED Diagonal structure evident Name Airports Airlines Notes World Hubs 250 77 Beijing, JFK Europe 184 32 Europe, Morocco United States 137 9 U.S. and Canc´un China/Taiwan 170 33 China, Taiwan, Thailand Oceania/SE Asia 302 52 Canadian airlines too Mexico/Americas 399 68 TMA 2016 David Gleich · Purdue 31
32. 32. Clusters in symmetrized three-gram and four-gram data Data 3, 4-gram data from COCA (ngrams.info) “best clusters” pronouns & articles (the, we, he, …) prepositions & link verbs (in, of, as, to, …) Fun 3-gram clusters! {cheese, cream, sour, low-fat, frosting, nonfat, fat-free} {bag, plastic, garbage, grocery, trash, freezer} {church, bishop, catholic, priest, greek, orthodox, methodist, roman, priests, episcopal, churches, bishops} Fun 4-gram clusters ! {german, chancellor, angela, merkel, gerhard, schroeder, helmut, kohl} TMA 2016 David Gleich · Purdue 32
33. 33. Clusters in 3-gram Chinese text TMA 2016 David Gleich · Purdue 33 社会 – society – economy – develop – “ism” 国家 – nation 政府 – government We also get stop-words in the Chinese text (highly occuring words.) But then we also get some strange words. Reason Google’s Chinese corpus has a bias in its books.
34. 34. One more problem FIGURE 6 – Previous work from the PI tackled net- work alignment with ma- trix methods for edge overlap: i j j0 i0 OverlapOverlap A L B This proposal is for match- ing triangles using tensor methods: A L B This proposal is for match- ing triangles using tensor methods: j i k j0 i0 k0 TriangleTriangle A L B If xi, xj, and xk are indicators associated with the edges (i, i0 ), (j, j0 ), and 0 X i2L X j2L X k2L xixjxkTi,j,k | {z } triangle overlap term nding to i, j, and k in ching. Maximizing this n to investigate a heuris- he tensor T and using ding. Similar heuristics etwork alignment algo- 009). The work involves Triangular Alignment (TAME): A Tensor-based Approach for Higher-order Network Alignment" Joint w. Shahin Mohammadi, Ananth Grama, and Tamara Kolda http://arxiv.org/abs/1510.06482 max xT (A ⌦ B)x s.t. kxk = 1 max(A ⌦ B)x3 s.t. kxk = 1 A, B is triangle hypergraph adjacency A, B is edge adjacency matrix “Solved” with x of dim. 86 million" has 5 trillion non-zeros A ⌦ B
35. 35. www.cs.purdue.edu/homes/dgleich Summary! Spacey random walks are a new type of stochastic process that provides a direct interpretation of tensor eigenvectors of higher-order Markov chains probability tables. ! We are excited!! •  Many potential new applications of the spacey random walk process •  Many open theoretical questions for us (and others) to follow up on.! ! Code! https://github.com/dgleich/mlpagerank https://github.com/arbenson/tensor-sc https://github.com/arbenson/spacey-random-walks https://github.com/wutao27/GtensorSC Papers! Gleich, Lim, Yu. Mulltilinear PageRank, SIMAX 2015 Benson, Gleich, Leskovec, Tensor spectral clustering, SDM 2015 Benson, Gleich, Lim. Spacey random walks. arXiv:1602.02102 Wu, Benson, Gleich. Tensor spectral co-clustering. arXiv:1603.00395 35