Numerical Linear Algebra for Data and Link Analysis.
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Numerical Linear Algebra for Data and Link Analysis.

  • 897 views
Uploaded on

Talk at Google about spectral graph partitioning and distributed pager rank computing using linear systems

Talk at Google about spectral graph partitioning and distributed pager rank computing using linear systems

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
897
On Slideshare
890
From Embeds
7
Number of Embeds
2

Actions

Shares
Downloads
8
Comments
0
Likes
0

Embeds 7

http://www.linkedin.com 6
http://www.slashdocs.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Numerical Linear Algebrafor Data and Link Analysis Leonid Zhukov June 9, 2005
  • 2. AbstractNumerical Linear Algebra for Data and Link AnalysisModern information retrieval and data mining systems must operate on extremely large datasets and require efficient, robust andscalable algorithms. Numerical linear algebra provides a solid foundation for the development of such algorithms and analysisof their behavior.In this talk I will discuss several linear algebra based methods and their practical applications:i) Spectral graph partitioning. I will describe a recursive spectral algorithm for bi-partite graph partitioning and its applicationto simultaneous clustering of bidded terms and advertisers in pay-for-performance market data. I will also present a new localrefinement strategy that allows us to improve cluster quality.ii) Web graph link analysis. I will discuss a linear system formulation of the PageRank algorithm and the use of Krylovsubspace methods for an efficient solution. I will also describe our scalable parallel implementation and present results ofnumerical experiments for the convergence of iterative methods on multiple graphs with various parameter settings.In conclusion I will outline some difficulties encountered while developing these applications and address possible solutionsand future research directions.
  • 3. Outline • Introduction – Computational science and information retrieval • Spectral clustering and graph partitioning – Spectral clustering – Flow refinement – Bi-partite spectral and advertiser-term clustering • Web graph link analysis – PageRank as linear system – Krylov subspace methods – Numerical experiments • Parallel implementation – Distributed matrices – MPI, PETSc, etc • Conclusion and future work
  • 4. 1. Introduction1.1. Computational science for information retrieval • Multiple applications of numerical methods, no specilized algorithms • Large scale problems • Practical applications Scientific Computing Information Retrieval Problem in continuum, governed by PDE Discrete data is given discretization for numerical solution no control over problem size control over resolution 2D or 3D geometry High dimensional spaces Uniform distribution of node degrees Power-low degree distribution
  • 5. 1.2. Scientific Computing vs Information Retrieval graphs FEM mesh for CFD simulations Artist-Artist similarity graph
  • 6. 2. Spectral Graph Partitioning
  • 7. 2.1. Graph partitioning • Bisecting the graph, edge separatorGood and balanced cut • Balanced partition • “Natural” boundaries partition = clustering
  • 8. 2.2. Metrics - good cut • Partitioning: cut(V1, V2) = eij ; assoc(V1, V ) = d(vi) i∈V1 ,j∈V2 i∈V1 • Objective functions: – Minimal cut: M Cut(V1, V2) = cut(V1, V2); – Normalized cut: cut(V1, V2) cut(V1, V2) N Cut(V1, V2) = + assoc(V1, V ) assoc(V2, V ) – Quotient Cut: cut(V1, V2) QCut(V1, V2) = min(assoc(V1, V ), assoc(V2, V ))
  • 9. 2.3. Graph cuts • Let G = (V, E) - graph, A(G) - adjacency matrix • Let V = V + ∪ V − be partitioning of the nodes • Let v = {+1, −1, +1, ... − 1, +1}T - indicator vector -1 -1 +1 +1 +1 x x x x x • v(i) = +1, if v(i) ∈ V +; v(i) = −1, if v(i) ∈ V − • Compute the number of edges, connecting V + and V − 1 1 cut(V +, V −) = (v(i) − v(j))2 = vT Lv 4 4 e(i,j) • L=D−A • Minimal cut partitioning - smallest number of edges to remove • Exact solution is NP-hard!
  • 10. 2.4. Spectral method - motivation (from Physics) • Linear graph - 5 nodes: 1 2 3 4 5 x x x x x • Energy of the system: 1 1 E= m x(i)2 + k ˙ (x(i) − x(j))2 2 i 2 i,j • Equations of motion: d2x M 2 = −kLx dt • Laplacian matrix 5x5: 1 −1    −1 2 −1  L= −1 2 −1     −1 2 −1  −1 1
  • 11. 2.5. Spectral method - motivation (from Physics) • Eigenproblem: Lx = λx 2 • Second lowest λ2 = ω2 mode bisecting the string into two equal sized components
  • 12. 2.6. Spectral method - relaxation • Discrete problem → continuous problem • Discrete problem: find 1 min( vT Lv) 4 constraints v(i) = ±1, i v(i) = 0; • Relaxation - continuous problem: find 1 min( xT Lx) 4 constraints: x(i)2 = N , i x(i) = 0 • Exact constraint satisfies relaxed equation, but not other way around! • Given x(i), round them up by v(i) = sign(x(i))
  • 13. 2.7. Spectral method - computations • Constraint optimization problem: 1 Q(x) = xT Lx − λ(xT x − N ) 4 • Additional constraint: x e = {1, 1, 1, .., 1} • Minimization 1 xT Lx min( ) x⊥x1 4 xT x • Courant Fischer Minimax Theorem Lx = λx Looking for λ2 (second smallest) eigenvalue and x2
  • 14. 2.8. Family of spectral methods • Ratio cut: cut(V1, V2) cut(V1, V2) RCut(V1, V2) = + |V1| |V2| (D − A)x = λx • Normalized cut: cut(V1, V2) cut(V1, V2) N Cut(V1, V2) = + assoc(V1, V ) assoc(V2, V ) assoc(V1, V1) assoc(V2, V2) N Cut(V1, V2) = 2 − ( + ) assoc(V1, V ) assoc(V2, V ) (D − A)x = λDx
  • 15. 2.9. Spectral partitioning algorithmAlgorithm 1 Compute the eigenvector v2 corresponding to λ2 of L(G) for all node n in G do if v2 (n) < 0 then put node n in partition V- else put node n in partition V+ end if end for
  • 16. 2.10. Spectral ordering algorithmAlgorithm 2 Compute the eigenvector v2 corresponding to λ2 of L(G) for all node n in G do sort n according to v2 (n) end for • Permute columns and rows of A according to “new” ordering • Since − v(j))2 is minimized ⇒ e(i,j) (v(i) there are few edges connecting distant v(i) and v(j)
  • 17. 2.11. Spectral Example I (good)
  • 18. 2.12. Linear sweep • Linear sweep: N Cut(V1, V2), QCut(V1, V2)
  • 19. 2.13. Spectral Example II (not so good)
  • 20. 2.14. Flow refinmentSet up and solve minimum S-T cut problem • Divide node in 3 sets according to embedding ordering • set up s-t max flow problem with one set of nodes pinned to the source and another to the sink with inf capacity links • solve to obtain S-T min cut ( min-cut max-flow theorem, find saturated fron- tier), • move the partition
  • 21. 2.15. Flow refinment cut(A,B)=171 cut(A,B)=70 QCut=0.0108 QCut=0.0053 NCut=0.0206 NCut=0.0088 part size=1433 part size=1195
  • 22. 2.16. Flow refinment cut(A,B)=11605 cut(A,B)=36688 QCut=0.242 QCut=0.160 NCut=0.267 NCut=0.296 part size=266 part size=1103
  • 23. 2.17. Recursive spectral • tree → flat clusters
  • 24. 2.18. Example: Recursive Spectral
  • 25. 2.19. Data: Advertiser - bidded term data aj Terms aj Terms ti ti A= A= Advertisers Advertisers • Simultaneous clustering of advertisers and bidded terms (co-clustering) • Bi-partite graph partitioning problem
  • 26. 2.20. Bi-partite graph case • Adjacency matrix for the bipartite graph ˆ 0 A A= AT 0 • Eigensystem: D1 −A x D1 0 x =λ −AT D2 y 0 D2 y • Normalization: −1/2 −1/2 An = D1 AD2 Anv = (1 − λ)u AT u = (1 − λ)v n • SVD decomposition: An = uσvT , σ = 1 − λ
  • 27. 2.21. Advertiser - bidded search term matrix
  • 28. 2.22. Advertiser - bidded search term matrix
  • 29. 2.23. Computational consideration • Large and very sparse matrices • Only top few eigenvectors needed • Precision requirements low • Iterative Krylov subspace methods, Lanczos and Arnoldi algorithms • Only matrix-vector multiply
  • 30. 3. Web graph link analysis
  • 31. 3.1. PageRank model • Random walk on the graph • Markov process: memoryless, homogeneous, • Stationary distribution: existence, uniqueness, convergence. • Perron-Frobenius theorem; irreducible, every state is reachable from every other, and aperiodic - no cycles
  • 32. 3.2. PageRank model • Construct probability matrix P = D−1A, D = diag(A) • Construct transition matrix for Markov process (row-stochastic) P = P + (dvT ) • Correct reducibility (irreducible) P = cP + (1 − c)(evT ) • Markov chain stationary distribution exist and unique (Perron-Frobenius) P T p = λp
  • 33. 3.3. Linear system formulation • PageRank equation (cP + c(dvT ) + (1 − c)(evT ))x = λx • Normalization (eT x) = (xT e) = ||x||1, λ1 = 1 • Identity (dT x) = ||x|| − ||PT x||. • Linear system (I − cPT )x = v(||x||1 − c||PT x||1)
  • 34. 3.4. Linear System vs Eigensystem Eigensystem Linear system P T p = λp (I − cPT )x = k(x) v P = cP + c(dvT ) + (1 − c)(evT ) k(x) = ||x||1 − c||PT x||1 x λ=1 p= ||x||1 • Iteration matrices: P , I − cPT - different rate of convergence • Vector v - rhs or in the matrix • More methods available for linear system • Solution is linear with respect to v
  • 35. 3.5. Flowchart of computational methods
  • 36. 3.6. Stationary iterations • Power iterations: T p(k+1) = cP p(k) + (1 − c)v • Jacoby Iterations: p(k+1) = cPT p(k) + kv • Iteration error: e(k) = ||x(k) − x(k−1)||1 r(k) = ||b − Ax(k)||1 • Convergence in k steps: k ∼ log(e)/log(c)
  • 37. 3.7. Stationary methods convergence 0 Error Metrics for Jacobi Convergence 10 ||x(k) − x*|| −1 10 ||A x(k) − b|| ||x(k) − x*||/||x(k)|| −2 10 −3 Metric Value 10 −4 10 −5 10 −6 10 −7 10 −8 10 0 10 20 30 40 50 60 70 80 Iteration
  • 38. 3.8. Krylov subspace methods • Linear system Ax = b, A = I − cPT , b = kv • Residual r = b − Ax • Krylov subspace Km = span{r, Ar, A2r, A3r...Amr} • xm is build from x0 + Km, xm = x0 + qm−1(A)r0 • Only matrix-vector products • Explicit minimization in subspace, extra information for next step
  • 39. 3.9. Krylov subspace methods • Generalize Minimum Residual (GMRES) pick xn ∈ Kn , such that min ||b − Axn ||, rn ⊥ AKn • Biconjugate Gradient (BiCG) n−1 pick xn ∈ Kn , such that rn ⊥ span{w, AT w, ...AT w) • Biconjugate Gradient Stabilized (BiCGSTAB) • Quasi-Minimal Residual (QMR) • Conjugate Gradient Squared (CGS) • Chebyshev Iterations.Preconditioners • Convergence depends on cond(A) = λmax/λmin • Preconditioner M, M−1A x = M−1b • Iterate M−1A - better condition number • Diagonal preconditioner M = D
  • 40. 3.10. Krylov subspace methods: convergence 2 Error Metrics for BiCG Convergence 10 ||x(k) − x*|| ||A x(k) − b|| 0 10 ||x(k) − x*||/||x(k)|| Metric Value −2 10 −4 10 −6 10 −8 10 0 5 10 15 20 25 30 35 40 45 Iteration
  • 41. 3.11. Computational Requirements Method IP SAXPY MV Storage PAGERANK 1 1 M + 3v JACOBI 1 1 M + 3v GMRES i+1 i+1 1 M + (i + 5)v BiCG 2 5 2 M + 10v BiCGSTAB 4 6 2 M + 10v • IP - inner vector products • SAXPY - scalar times vector plus vector • MV - matrix vector products • M - matrix, v - vector
  • 42. 3.12. Graph statistics Name Nodes Links Storage Size bs-cc 20k 130k 1.6 MB edu 2M 14M 176 MB yahoo-r2 14M 266M 3.25 GB uk 18.5M 300M 3.67 GB yahoo-r3 60M 850M 10.4 GB db 70M 1B 12.3 GB av 1.4B 6.6B 80 GB
  • 43. 3.13. Graph statistics bs outdegree, b = 1.526 y2 outdegree, b = 1.454 db outdegree, b = 2.010 0 0 0 10 10 10 −1 −1 −1 10 10 10 −2 −2 10 10 −2 10 −3 10 −3 10 −4 −4 −3 10 10 10 −5 −5 10 10 −4 10 −6 −6 10 10 −5 −7 −7 10 10 10 0 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 bs indegree, b = 1.747 y2 indegree, b = 1.848 db indegree, b = 1.870 0 0 0 10 10 10 −1 −1 −1 10 10 10 −2 −2 10 10 −2 10 −3 10 −3 10 −4 −4 −3 10 10 10 −5 −5 10 10 −4 10 −6 −6 10 10 −5 −7 −7 10 10 10 0 1 2 3 4 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
  • 44. 3.14. Convergence I uk iteration convergence 0 uk time convergence 0 10 10 std std −1 jacobi −1 10 jacobi 10 gmres gmres bicg −2 bicg −2 10 bcgs 10 bcgs −3 −3 10 10 Error Error −4 −4 10 10 −5 −5 10 10 −6 −6 10 10 −7 −7 10 10 −8 −8 10 10 0 10 20 30 40 50 60 70 80 0 5 10 15 20 25 30 35 40 Iteration Time (sec)
  • 45. 3.15. Convergence II 0 db iteration convergence 0 db time convergence 10 10 std std −1 jacobi −1 10 jacobi 10 gmres gmres bicg −2 bicg −2 10 bcgs 10 bcgs −3 −3 10 10 Error Error −4 −4 10 10 −5 −5 10 10 −6 −6 10 10 −7 −7 10 10 −8 −8 10 10 0 10 20 30 40 50 60 70 0 100 200 300 400 500 600 700 Iteration Time (sec)
  • 46. 3.16. Convergence on AV graph 0 av time 10 std −1 10 bcgs −2 10 error/residual −3 10 −4 10 −5 10 −6 10 −7 10 −8 10 0 50 100 150 200 250 300 350 400 450 500 Time (sec)
  • 47. 3.17. PageRank Timing results Graph PR Jacobi GMRES BiCG BCGS edu 84 84 21 † 44 ∗ 21∗ 20 procs 0.09s / 7.56s 0.07s / 5.88 0.6s / 12.6s 0.4s / 17.6s 0.4s / 8.4s yahoo-r2 71 65 12 35 10 20 procs 1.8s / 127s 1.9s / 123s 16s / 192s 8.6s / 301s 9.9s / 99s uk 73 71 22 ∗ 25 ∗ 11∗ 60 procs 0.09s/ 6.57s 0.1s / 7.1s 0.8s / 17.6s 0.80s / 20s 1.0s / 11s yahoo-r3 76 75 60 procs 1.6s / 122s 1.5s / 112s db 62 58 29 45 15∗ 60 procs 9.0s / 558s 8.7s / 505s 15s / 435s 15s / 675s 15s / 225s av 72 26 226 procs 6.5s / 468s 16.5s / 429s av (host order) 72 26 140 procs 4.6s / 331s 15.0 / 390s
  • 48. 3.18. Dependence on teleportation 2 Convergence and Conditioning for db 10 std gmres c = 0.85 0 10 c = 0.90 c = 0.95 c = 0.99 Error/Residual −2 10 −4 10 −6 10 −8 10 0 20 40 60 80 100 120 140 160 180 200 Iteration
  • 49. 4. Parallel system
  • 50. 4.1. Matrix-Vector multiply • Iterative process Ax→x • Every process “owns” several rows of the matrix • Every process “owns” corresponding part of the vector • Communications required for multiplication
  • 51. 4.2. Distributed matrices • Computing: – Load balancing: equal number of non-zeros per processor – Minimize communications: smallest number “of the processor” ele- ments • Storage: – Number of non-zeros per processor – Number of rows per processor
  • 52. 4.3. Practical data distribution • Balanced graph partitioning – Exact - NP hard – Approximate - multi-resolution, spectral, geometric, • Practical solution – Sort graph in lexigraphic order – Fill processors consecutively by row, adding rows until wrowsnp + wnnz nnzp > (wrowsn + wnnz nnz)/p with wrows : wnnz = 1/1, 2/1, 4/1
  • 53. 4.4. Data distribution schemes y2 std parellelization and distribution 400 smart nrows 350 300 250 time, s 200 150 100 50 5 10 15 20 25 30 # of processors
  • 54. 4.5. Implementation details
  • 55. 4.6. Implementation: MPI • Message Passing Interface (MPI) standard • Library specification for message-passing • Message passing = data transfer + synchronization • MPI_SEND, MPI_RECV • MPI_Bcast, MPI_Reduce, MPI_Gather, MPI_Scatter • Implementations: LAM, mpich, Intel, etc.MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);while (!done) { if (myid == 0){ printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break;}
  • 56. 4.7. Implementation: PETSc • Portable Extensible Toolkit for Scientific Computing • Implements basic linear algebra operations on distributed matrices. • Advanced linear and nonlinear solversPetscInitialize(&argc,&args,(char *)0,help);MatCreate(PETSC_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE,N,N,&A);MatSetValues(A,4,idx,4,idx,Ke,ADD_VALUES);MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);VecAssemblyBegin(b);VecAssemblyEnd(b);MatMult(A, b, x);
  • 57. 4.8. Network topology 0 Network Topology Effects 10 std−140−full −1 10 bcgs−140−full std−140−star −2 bcgs−140−star 10 error/residual −3 10 −4 10 −5 10 −6 10 −7 10 −8 10 0 500 1000 1500 2000 2500 Time (sec)
  • 58. 4.9. Host Ordering on AV graph 0 Host Order Improvement 10 std−140 −1 10 bcgs−140 std−140−host −2 bcgs−140−host 10 error/residual −3 10 −4 10 −5 10 −6 10 −7 10 −8 10 0 100 200 300 400 500 600 700 800 900 Time (sec)
  • 59. 4.10. Parallel performance Performance Increase (Percent decrease in time) Scaling for computing with full−web 250% std bcgs 200% 150% 100% 50% 0% 90% 100% 110% 120% 130% 140% 150% 160% 170%
  • 60. 5. Conclusions• Eigenvalues everywhere! Linear algebra methods provide provably good solutions to many problems. Methods are very general.• Power-law graphs with high variance in node degrees present challenges to high performance parallel computing• Skewed distribution, chains, central core, singletons makes clustering of power-law data a difficult problem• Embedding in 1D is probably not sufficient for this type of data, higher dimensions needed
  • 61. 5.1. References • Collaborators: – Kevin Lang, Pavel Berkhin – David Gleich and Matt Rasmussen • Publications: – “Fast Parallel PageRank: A Linear System Approach”, 2004 – “Spectral Clustering of Large Advertiser Datasets”, 2003 – “Clustering of bipartite advertiser-keyword graph”, 2002 • References: – Spectral graph partitioning: M. Fiedler (1973), A. Pothen (1990), H. Simon (1991), B. Mohar (1992), B. Hendrickson (1995), D. Spielman (1996), F. Chang (1996), S. Guattery (1998), R. Kannan (1999), J. Shi (2000), I. Dhillon ( 2001), A. Ng (2001), H. Zha (2001), C. Ding (2001) – PageRank computing: S.Brin (1998), L. Page (1998), J. Kleinberg (1999), A. Arasu (2002), T. Haveliwala (2002-03), A. Langville (2002), G. Jeh (2003), S. Kamvar (2003), A. Broder (2004)