Iterative methods with special structures

678 views

Published on

In a talk at the Institute for Physics and Computational Mathematics in Beijing, I discuss a few different types of structure in iterative methods.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
678
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Iterative methods with special structures

  1. 1. Iterative methods with special structures David F. Gleich! Purdue University! David Gleich · Purdue 1
  2. 2. Two projects. 1.  Circulant structure in tensors and a linear system solver. 2.  Localized structure in the matrix exponential and relaxation methods. David Gleich · Purdue 2
  3. 3. Circulant algebra 40 60 80 100 120 40 60 80 mm David F. Gleich (Purdue) LA/Opt Seminar 2 / 29 Introduction Kilmer, Martin, and Perrone (2008) presented a circulant algebra: a set of operations that generalize matrix algebra to three-way data and provided an SVD. The essence of this approach amounts to viewing three-dimensional objects as two-dimensional arrays (i.e., matrices) of one-dimensional arrays (i.e., vectors). Braman (2010) developed spectral and other decompositions. We have extended this algebra with the ingredients required for iterative methods such as the power method and Arnoldi method, and have char- acterized the behavior of these algo- rithms. With Chen Greif and James Varah at UBC. David Gleich · Purdue 3
  4. 4. 40 60 80 100 120 40 60 80 mm David F. Gleich (Purdue) LA/Opt Seminar 4 / 29 Three-way arrays Given an m ⇥ n ⇥ k table of data, we view this data as an m ⇥ n ma- trix where each “scalar” is a vector of length k. A 2 Km⇥n k We denote the space of length-k scalars as Kk. These scalars interact like circulant matrices. David Gleich · Purdue 4
  5. 5. 40 60 80 Circulant matrices are a commutative, closed class under the standard matrix operations. 2 6 6 6 6 6 4 1 k . . . 2 2 1 ... ... ... ... ... k k . . . 2 1 3 7 7 7 7 7 5 We’ll see more of their properties shortly! 40 60 80 100 120 40 60 80 mm David F. Gleich (Purdue) LA/Opt Seminar 6 / 29 The circ operation We denote the space of length-k scalars as Kk. These scalars interact like circulant matrices. = { 1 ... k } 2 Kk. $ circ( ) ⌘ 2 6 6 6 6 6 4 1 k . . . 2 2 1 ... ... ... ... ... k k . . . 2 1 3 7 7 7 7 7 5 . + $ circ( )+circ( ) and $ circ( )circ( ); 0 = {0 0 ... 0} 1 = {1 0 ... 0} David Gleich · Purdue 5
  6. 6. Scalars to matrix-vector products 40 60 80 100 120 40 60 mm David F. Gleich (Purdue) LA/Opt Seminar 13 Operations (cont.) More operations are simplified in the Fourier space too. Let cft( ) = di g [ˆ1, ..., ˆk]. Because the ˆj values are the eigenvalues of circ( ), we have: abs( ) = icft(di g [| ˆ1|, ..., | ˆk|]), = icft(di g [ˆ1, ..., ˆk]) = icft(cft( ) ), and angle( ) = icft(di g [ˆ1/| ˆ1|, ..., ˆk/| ˆk|]). 40 60 80 100 120 40 60 80 mm David F. Gleich (Purdue) LA/Opt Seminar 10 / 29 cft and icft We define the “Circulant Fourier Transform” or cft cft : 2 Kk 7! Ck⇥k and its inverse icft : Ck⇥k 7! Kk as follows: cft( ) ⌘ ñ ˆ1 ... ˆk ô = F circ( )F, icft Çñ ˆ1 ... ˆk ôå ⌘ $ F cft( )F , where ˆj are the eigenvalues of circ( ) as produced in the Fourier transform. These transformations satisfy icft(cft( )) = and provide a convenient way of moving between operations in Kk to the more familiar environment of diagonal matrices in Ck⇥k. 40 60 40 60 mm The circ operation on ma A x = 2 6 4 Pn j=1 A1,j j ...Pn j=1 Am,j j 3 7 5 $ 2 4 circ(A1,1) .. ... .. circ(Am,1) .. Define circ(A) ⌘ 2 4 circ(A1,1) ... circ(A1,n) ... ... ... circ(Am,1) ... circ(Am,n) 3 5 c David Gleich · Purdue 6
  7. 7. The special structure This circulant structure is our special structure for this first problem. We look at two types of iterative methods: 1.  the power method and 2.  the Arnoldi method. David Gleich · Purdue 7
  8. 8. A perplexing result! 40 60 80 100 120 40 60 80 mm Example Run the power method on  {2 3 1} {0 0 0} {0 0 0} {3 1 1} Result = (1/3) {10 4 4} David Gleich · Purdue 8
  9. 9. Some understanding through decoupling 40 60 80 100 120 40 60 80 mm Back to figure David Gleich · Purdue 9
  10. 10. Some understanding through decoupling 40 60 80 100 40 60 80 mm David F. Gleich (Purdue) LA/Opt Sem Example Let A = î {2 3 1} {8 2 0} { 2 0 2} {3 1 1} ó . The result of the circ and cft operations are: circ(A) = 2 6 6 6 6 6 6 4 2 1 3 8 0 2 3 2 1 2 8 0 1 3 2 0 2 8 2 2 0 3 1 1 0 2 2 1 3 1 2 0 2 1 1 3 3 7 7 7 7 7 7 5 , ( ⌦ F )circ(A)( ⌦ F) = 2 6 6 6 6 6 6 6 4 6 6 p 3 9 + p 3 p 3 9 p 3 0 5 3 + p 3 2 3 p 3 2 3 7 7 7 7 7 7 7 5 , cft(A) = 2 6 6 6 6 6 6 6 4 6 6 0 5 p 3 9 + p 3 3 + p 3 2 p 3 9 p 3 3 p 3 2 3 7 7 7 7 7 7 7 5 . David Gleich · Purdue 10
  11. 11. 40 60 80 100 120 40 60 80 mm David F. Gleich (Purdue) LA/Opt Seminar 20 / Example A =  {2 3 1} {0 0 0} {0 0 0} {3 1 1} ˆA1 =  6 0 0 5 , ˆA2 = ñ - p 3 0 0 2 ô , ˆA3 = ñ p 3 0 0 2 ô . 1 = icft(di g [6 2 2]) = (1/3) {10 4 4} 2 = icft(di g [5 - p 3 p 3]) = (1/3) {5 2 2} 3 = icft(di g [6 - p 3 p 3]) = {2 3 1} 4 = icft(di g [5 2 2]) = (1/3) {3 1 1} . The corresponding eigenvectors are x1 =  {1/3 1/3 1/3} {2/3 -1/3 -1/3} ; x2 =  {2/3 -1/3 -1/3} {1/3 1/3 1/3} ; x3 =  {1 0 0} {0 0 0} ; x4 =  {0 0 0} {1 0 0} . Some understanding through decoupling David Gleich · Purdue 11
  12. 12. Convergence of the power method is in terms of the individual blocks 40 60 80 100 120 40 60 mm David F. Gleich (Purdue) LA/Opt Seminar 23 / 29 The power method converges Let A 2 Kn⇥n k have a canonical set of eigenvalues 1, . . . , n where | 1| > | 2|, then the power method in the circulant algebra convergences to an eigenvector x1 with eigenvalue 1. Where we use the ordering ... < $ cft( ) < cft( ) elementwise 40 60 80 5 = icft(di g [6 - p 3 2]) 6 = icft(di g [6 2 p 3]) 7 = icft(di g [5 - p 3 2]) 8 = icft(di g [5 2 p 3]), altogether polynomial number, exceeds dimension of matrix. Definition. A canonical set of eigenvalues and eigenvectors is a set of minimum size, ordered such that abs( 1) abs( 2) . . . abs( k), which contains the information to reproduce any eigenvalue or eigenvector of A In this case, the only canonical set is {( 1, x1), ( 2, x2)}. (Need two, and have abs( 1) abs( 2).) David Gleich · Purdue 12
  13. 13. 40 60 80 100 120 40 60 mm David F. Gleich (Purdue) LA/Opt Seminar 28 / 29 2000 4000 6000 8000 0 −15 0 −10 0 −5 0 0 ✓ 2 +2 cos( 2⇡/n) 2 +2 cos( ⇡/n) ◆2 i ✓ 6 +2 cos( 2⇡/n) 6 +2 cos( ⇡/n) ◆2 i ✓ 6 +2 cos( 2⇡/n) 6 +2 cos( ⇡/n) ◆i iteration Eigenvalue Error Eigenvector Change ure: The convergence behavior of the power od in the circulant algebra. The gray lines show rror in the each eigenvalue in Fourier space. e curves track the predictions made based on igenvalues as discussed in the text. The red 0 10 20 30 40 50 10 −15 10 −10 10 −5 10 0 Arnoldi iteration Magnitude Absolute error Residual magnitude Figure: The convergence behavior of a GMRES procedure using the circulant Arnoldi process. The gray lines show the error in each Fourier component and the red line shows the magnitude of the residual. We observe poor convergence in one The Arnoldi Method Using our repertoire of operations, the Arnoldi method in the circulant algebra is equivalent to individual Arnoldi processes on each matrix. Equivalent to a block Arnoldi process. Using the cft and icft operations, we produce an Arnoldi factorization: 40 60 80 100 120 40 60 80 mm avid F. Gleich (Purdue) LA/Opt Seminar 24 / 29 e Arnoldi process Let A be an n ⇥ n matrix with real valued entries. Then the Arnoldi method is a technique to build an orthogonal basis or the Krylov subspace Kt(A, v) = span{v, Av, . . . , At 1 v}, where v is an initial vector. We have the decomposition AQt = Qt+1Ht+1,t where Qt is an n ⇥ t matrix, and Ht+1,t is a (t + 1) ⇥ t upper Hessenberg matrix. Using our repertoire of operations, the Arnoldi method in he circulant algebra is equivalent to individual Arnoldi processes on each matrix ˆAj. Equivalent to a block Arnoldi process. Using the cft and icft operations, we produce an Arnoldi actorization: A Qt = Qt+1 Ht+1,t. David Gleich · Purdue 13
  14. 14. A number of interesting mathematical results from this algebra 1.  A case study of how “decoupled” block iterations arise and are meaningful for an application. 2.  It’s a beautiful algebra. E.g. 40 60 80 100 40 60 80 mm More operations are simplified in the Fourier space too. Le ft( ) = di g [ˆ1, ..., ˆk]. Because the ˆj values are the igenvalues of circ( ), we have: abs( ) = icft(di g [| ˆ1|, ..., | ˆk|]), = icft(di g [ˆ1, ..., ˆk]) = icft(cft( ) ), and angle( ) = icft(di g [ˆ1/| ˆ1|, ..., ˆk/| ˆk|]). Proofs are simple, e.g. angle( ) angle( ) = 1 Live Matlab demo David Gleich · Purdue 14
  15. 15. Conclusion to the circulant algebra Paper available from " https://www.cs.purdue.edu/homes/dgleich/ The power and Arnoldi method in an algebra of circulants, NLA 2013, Gleich, Greif, Varah. Code available from" https://www.cs.purdue.edu/homes/dgleich/ codes/camat David Gleich · Purdue 15
  16. 16. Project 2 Fast relaxation methods to estimate a column of the martrix exponential. With Kyle Kloster David Gleich · Purdue 16
  17. 17. Matrix exponentials exp(A) is defined as 1X k=0 1 k! Ak Always converges dx dt = Ax(t) , x(t) = exp(tA)x(0) Evolution operator " for an ODE A is n ⇥ n, real David Gleich · Purdue 17 special case of a function of a matrix f(A) others are f(x) = 1/x; f(x) = sinh(x)...
  18. 18. Matrix exponentials on large networks exp(A) = 1X k=0 1 k! Ak If A is the adjacency matrix, then Ak counts the number of length k paths between node pairs. [Estrada 2000, Farahat et al. 2002, 2006] Large entries denote important nodes or edges. Used for link prediction and centrality If P is a transition matrix (column stochastic), then Pk is the probability of a length k walk between node pairs. [Kondor & Lafferty 2002, Kunegis & Lommatzsch 2009, Chung 2007] Used for link prediction, kernels, and clustering or community detection exp(P) = 1X k=0 1 k! Pk David Gleich · Purdue 18
  19. 19. This talk: a column of the matrix exponential x = exp(P)ec x the solution P the matrix ec the column David Gleich · Purdue 19
  20. 20. This talk: a column of the matrix exponential x = exp(P)ec x the solution P the matrix ec the column localized large, sparse, stochastic David Gleich · Purdue 20
  21. 21. Uniformly localized " solutions in livejournal 1 2 3 4 5 x 10 6 0 0.5 1 1.5 nnz = 4815948 magnitude plot(x) 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 −14 10 −12 10 −10 10 −8 10 −6 10 −4 10 −2 10 0 1−normerror largest non−zeros retained 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 −14 10 −12 10 −10 10 −8 10 −6 10 −4 10 −2 10 0 1−normerror largest non−zeros retained x = exp(P)ec David Gleich · Purdue 21 nnz(x) = 4, 815, 948 Gleich & Kloster, arXiv:1310.3423
  22. 22. Our mission! Exploit the structure Find the solution with work " roughly proportional to the " localization, not the matrix. David Gleich · Purdue 22
  23. 23. Our algorithms for uniform localization" www.cs.purdue.edu/homes/dgleich/codes/nexpokit 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 −8 10 −6 10 −4 10 −2 10 0 non−zeros 1−normerror gexpm gexpmq expmimv 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 −8 10 −6 10 −4 10 −2 10 0 non−zeros 1−normerror David Gleich · Purdue 23 work = O ⇣ log(1 " )(1 " )3/2 d2 (log d)2 ⌘ nnz = O ⇣ log(1 " )(1 " )3/2 d(log d) ⌘
  24. 24. Matrix exponentials on large networks Is a single column interesting? Yes! exp(P)ec = 1X k=0 1 k! Pk ec Link prediction scores for node c A community relative to node c But … modern networks are " large ~ O(109) nodes, sparse ~ O(1011) edges, constantly changing … and so we’d like " speed over accuracy David Gleich · Purdue 24
  25. 25. SIAM REVIEW c⃝ 2003 Society for Industrial and Applied Mathematics Vol. 45, No. 1, pp. 3–49 Nineteen Dubious Ways to Compute the Exponential of a Matrix, Twenty-Five Years Later∗ Cleve Moler† Charles Van Loan‡ David Gleich · Purdue 25
  26. 26. Our underlying method Direct expansion! A few matvecs, quick loss of sparsity due to fill-in This method is stable for stochastic P! "… no cancellation, unbounded norm, etc. ! ! David Gleich · Purdue 26 x = exp(P)ec ⇡ PN k=0 1 k! Pk ec = xN The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
  27. 27. Our underlying method " as a linear system Direct expansion! " ! ! ! David Gleich · Purdue 27 x = exp(P)ec ⇡ PN k=0 1 k! Pk ec = xN 2 6 6 6 6 6 6 4 III P/1 III P/2 ... ... III P/N III 3 7 7 7 7 7 7 5 2 6 6 6 6 6 6 4 v0 v1 ... ... vN 3 7 7 7 7 7 7 5 = 2 6 6 6 6 6 6 4 ec 0 ... ... 0 3 7 7 7 7 7 7 5 xN = NX i=0 vi (III ⌦ IIIN SN ⌦ P)v = e1 ⌦ ec Lemma we approximate xN well if we approximate v well
  28. 28. Our mission (2)! Approximately solve " when A, b are sparse," x is localized. David Gleich · Purdue 28 Ax = b
  29. 29. Coordinate descent, Gauss-Southwell, Gauss-Seidel, relaxation & “push” methods David Gleich · Purdue 29 Algebraically! Procedurally! Solve(A,b) x = sparse(size(A,1),1) r = b While (1) Pick j where r(j) != 0 z = r(j) x(j) = x(j) + r(j) For i where A(i,j) != 0 r(i) = r(i) – z*A(i,j) Ax = b r(k) = b Ax(k) x(k+1) = x(k) + ej eT j r(k) r(k+1) = r(k) r(k) j Aej
  30. 30. Back to the exponential David Gleich · Purdue 30 2 6 6 6 6 6 6 4 III P/1 III P/2 ... ... III P/N III 3 7 7 7 7 7 7 5 2 6 6 6 6 6 6 4 v0 v1 ... ... vN 3 7 7 7 7 7 7 5 = 2 6 6 6 6 6 6 4 ec 0 ... ... 0 3 7 7 7 7 7 7 5 xN = NX i=0 vi (III ⌦ IIIN SN ⌦ P)v = e1 ⌦ ec Solve this system via the same method. Optimization 1 build system implicitly Optimization 2 don’t store vi, just store sum xN
  31. 31. Error analysis for Gauss-Southwell David Gleich · Purdue 31 Theorem Assume P is column-stochastic, v(0) = 0. (Nonnegativity) iterates and residuals are nonnegative v(l) 0 and r(l) 0 (Convergence) residual goes to 0: kr(l) k1  Ql k=1 1 1 2dk  l( 1 2d ) (III ⌦ IIIN SN ⌦ P)v = e1 ⌦ ec “easy” “annoying” d is the largest degree
  32. 32. Proof sketch Gauss-Southwell picks largest residual ⇒  Bound the update by avg. nonzeros in residual (sloppy) ⇒  Algebraic convergence with slow rate, but each update is REALLY fast O(d max log n). If d is log log n, then our method runs in sub-linear time " (but so does just about anything) David Gleich · Purdue 32
  33. 33. Overall error analysis David Gleich · Purdue 33 Components! Truncation to N terms Residual to error Approximate solve Theorem kxN (`) xk1  1 N!N + 1 e · ` 1 2d After ℓ steps of Gauss-Southwell
  34. 34. More recent error analysis David Gleich · Purdue 34 Theorem (Gleich and Kloster, 2013 arXiv:1310.3423)" Consider computing the matrix exponential using the Gauss-Southwell relaxation method in a graph with a Zipf-law in the degrees with exponent p=1 and max- degree d, then the work involved in getting a solution with 1-norm error ε is work = O ⇣ log(1 " )(1 " )3/2 d2 (log d)2 ⌘
  35. 35. Problem size &" Runtimes 10 6 10 7 10 8 10 9 10 10 10 −4 10 −2 10 0 10 2 10 4 |V|+ nnz(P) runtime(s) expmv half gexpmq gexpm expmimv – The median runtime of our methods for the seven graphs over 100 trials (only Table 3 – The real-world datasets we use in our experiments span three orders of magnitude in size. Graph |V | nnz(P ) nnz(P )/|V | itdk0304 190,914 1,215,220 6.37 dblp-2010 226,413 1,432,920 6.33 flickr-scc 527,476 9,357,071 17.74 ljournal-2008 5,363,260 77,991,514 14.54 webbase-2001 118,142,155 1,019,903,190 8.63 twitter-2010 33,479,734 1,394,440,635 41.65 friendster 65,608,366 3,612,134,270 55.06 Real-world networks The datasets used are summarized in Table 3. They include a version of the flickr graph from [Bonchi et al., 2012] containing just the largest strongly-connected component of the original graph; dblp-2010 from [Boldi et al., 2011], itdk0304 in [(The Cooperative Association for Internet Data Analyais), 2005], ljournal- 2008 from [Boldi et al., 2011, Chierichetti et al., 2009], twitter-2010 [Kwak et al., 2010] webbase-2001 from [Hirai et al., 2000, Boldi and Vigna, 2005], and the friendster graph in [Yang and Leskovec, 2012]. Implementation details All experiments were performed on either a dual processor Xeon e5-2670 system with 16 cores (total) and 256GB of RAM or a single processor Intel i7-990X, 3.47 GHz CPU and 24 GB of RAM. Our algorithms were implemented in C++ using the Matlab MEX interface. All data structures used are memory-e cient: the solution and residual are stored as hash tables using Google’s sparsehash pack- age. The precise code for the algorithms and the experiments below are available via https://www.cs.purdue.edu/homes/dgleich/codes/nexpokit/. Comparison We compare our implementation with a state-of-the-art Matlab function for computing the exponential of a matrix times a vector, expmv [Al-Mohy and Higham, 2011]. We customized this method with knowledge that kP k1 = 1. This single change results in a great improvement to the runtime of their code. In each experiment, we use as the “true solution” the result of a call to expmv using the ‘single’ option, which guarantees a 1-norm error bounded by 2 24 , or, for smaller problems, we use a Taylor approximation with the number of terms predicted by Lemma 12.David Gleich · Purdue 35
  36. 36. References and ongoing work Kloster and Gleich, Workshop on Algorithms for the Web-graph, 2013. Also see the journal version on arXiv. www.cs.purdue.edu/homes/dgleich/codes/nexpokit •  Error analysis using the queue (almost done …) •  Better linear systems for faster convergence •  Asynchronous coordinate descent methods •  Scaling up to billion node graphs (done …) David Gleich · Purdue 36 Supported by NSF CAREER 1149756-CCF www.cs.purdue.edu/homes/dgleich

×