Anti-differentiating Approximation Algorithms: PageRank and MinCut

640 views

Published on

We study how Google's PageRank method relates to mincut and a particular type of electrical flow in a network. We also explain the details of how the "push method" for computing PageRank helps to accelerate it. This has implications for semi-supervised learning and machine learning, as well as social network analysis.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
640
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Anti-differentiating Approximation Algorithms: PageRank and MinCut

  1. 1. Anti-differentiating approximation algorithms ! & new relationships between ! Page Rank, spectral, and localized flow David F. Gleich! Purdue University! Joint work with Michael Mahoney. Supported by " NSF CAREER 1149756-CCF, " Simons Inst. ICERM David Gleich · Purdue 1
  2. 2. Anti-differentiating approximation algorithms ! & new relationships between ! Page Rank, spectral, and localized flow A new derivation of the PageRank vector for an undirected graph based on Laplacians, cuts, or flows. A new understanding of the “push” methods to compute Personalized PageRank An empirical improvement to methods for semi- supervised learning. 1st 2nd ICERM David Gleich · Purdue 2
  3. 3. The PageRank problem ! The PageRank random surfer 1.  With probability beta, follow a random-walk step 2.  With probability (1-beta), jump randomly ~ dist. v. Goal find the stationary dist. x! ! Alg Solve the linear system Symmetric adjacency matrix Diagonal degree matrix Solution Jump-vector (I AD 1 )x = (1 )v x = AD 1 x + (1 )v ICERM David Gleich · Purdue 3
  4. 4. The PageRank problem & ! the Laplacian 1. (I AD 1 )x = (1 )v; 2. (I A)y = (1 )D 1/2 v, where A = D 1/2 AD 1/2 and x = D1/2 y; and 3. [↵D + L]z = ↵v where = 1/(1 + ↵) and x = Dz. Combinatorial Laplacian ICERM David Gleich · Purdue 4
  5. 5. The Push Algorithm for PageRank Proposed (in closest form) in Andersen, Chung, Lang " (also by McSherry, Jeh & Widom) for personalized PageRank Strongly related to Gauss-Seidel (see my talk at Simons for this) Derived to show improved runtime for balanced solvers 1. x(1) = 0, r(1) = (1 )ei , k = 1 2. while any rj > ⌧dj (dj is the degree of node j) 3. x(k+1) = x(k) + (rj ⌧dj ⇢)ej 4. r(k+1) i = 8 >< >: ⌧dj ⇢ i = j r(k) i + (rj ⌧dj ⇢)/dj i ⇠ j r(k) i otherwise 5. k k + 1 The Push Method! ⌧, ⇢ ICERM David Gleich · Purdue 5
  6. 6. … demo of push … ICERM David Gleich · Purdue 6
  7. 7. Why do we care about push? 1.  Used for empirical studies of “communities” 2.  Used for “fast PageRank” approximation It produces sparse approximations to PageRank! Newman’s netscience! 379 vertices, 1828 nnz “zero” on most of the nodes v has a single " one here 7
  8. 8. Our question! Why does the “push method” have such incredible empirical utility? 8
  9. 9. The O(correct) answer 1.  PageRank related to Laplacian 2.  Laplacian related to cuts 3.  Andersen, Chung, Lang provides the " “right” bounds and “localization” This talk the θ(correct) answer?" A deeper insight into the relationship ICERM David Gleich · Purdue 9
  10. 10. Intellectually indebted to … Chin, Mądry, Miller & Peng [2013] Orecchia & Zhu [2014] 10
  11. 11. minimize kBxkC,1 = P ij2E Ci,j |xi xj | subject to xs = 1, xt = 0, x 0. The s-t min-cut problem Unweighted incidence matrix Diagonal capacity matrix 11
  12. 12. The localized cut graph Related to a construction used in “FlowImprove” " Andersen & Lang (2007); and Orecchia & Zhu (2014) AS = 2 4 0 ↵dT S 0 ↵dS A ↵d¯S 0 ↵dT ¯S 0 3 5 Connect s to vertices in S with weight ↵ · degree Connect t to vertices in ¯S with weight ↵ · degree ICERM David Gleich · Purdue 12
  13. 13. The localized cut graph Connect s to vertices in S with weight ↵ · degree Connect t to vertices in ¯S with weight ↵ · degree BS = 2 4 e IS 0 0 B 0 0 I¯S e 3 5 minimize kBSxkC(↵),1 subject to xs = 1, xt = 0 x 0. Solve the s-t min-cut ICERM David Gleich · Purdue 13
  14. 14. The localized cut graph Connect s to vertices in S with weight ↵ · degree Connect t to vertices in ¯S with weight ↵ · degree BS = 2 4 e IS 0 0 B 0 0 I¯S e 3 5 Solve the “electrical flow” 
 s-t min-cut minimize kBSxkC(↵),2 subject to xs = 1, xt = 0 ICERM David Gleich · Purdue 14
  15. 15. s-t min-cut à PageRank The PageRank vector z that solves (↵D + L)z = ↵v with v = dS/vol(S) is a renormalized solution of the electrical cut computation: minimize kBSxkC(↵),2 subject to xs = 1, xt = 0. Specifically, if x is the solution, then x = 2 4 1 vol(S)z 0 3 5 Proof Square and expand the objective into a Laplacian, then apply constraints. ICERM David Gleich · Purdue 15
  16. 16. PageRank à s-t min-cut That equivalence works if v is degree-weighted. What if v is the uniform vector? A(s) = 2 4 0 ↵sT 0 ↵s A ↵(d s) 0 ↵(d s)T 0 3 5 . ICERM David Gleich · Purdue 16
  17. 17. And beyond … Easy to cook up interesting diffusion-like problems and adapt them to this framework. In particular, Zhou et al. (2004) gave a semi- supervised learning diffusion we study soon. 2 4 0 eT S 0 eS ✓A e¯S 0 e¯S 0 3 5 . (I + ✓L)x = eS ICERM David Gleich · Purdue 17
  18. 18. Back to the push method Let x be the output from the push method with 0 < < 1, v = dS/vol(S), ⇢ = 1, and ⌧ > 0. Set ↵ = 1 ,  = ⌧vol(S)/ , and let zG solve: minimize 1 2 kBSzk 2 C(↵),2 + kDzk1 subject to zs = 1, zt = 0, z 0 , where z = h 1 zG 0 i . Then x = DzG/vol(S). Proof Write out KKT conditions Show that the push method solves them. Slackness was “tricky” Regularization for sparsity ICERM David Gleich · Purdue 18 Need for normalization
  19. 19. … demo of equivalence … 19
  20. 20. This is a case of Algorithmic Anti-differentiation! 20
  21. 21. The ideal world Given Problem P Derive solution characterization C Show algorithm A " finds a solution where C holds Profit?! Given “min-cut” Derive “max-flow is equivalent to min-cut” Show push-relabel solves max-flow " Profit!! ICERM David Gleich · Purdue 21
  22. 22. (The ideal world)’ Given Problem P Derive solution approx. characterization C’ Show algorithm A’ quickly finds a solution where C’ holds Profit?! Given “sparest-cut” Derive Rayleigh- quotient approximation Show power-method finds a good Rayleigh- quotient Profit?! ICERM David Gleich · Purdue 22
  23. 23. The real world? Given Task P Hack around until you find something useful Write paper presenting “novel heuristic” H for P and … Profit!! Given “find-communities” Hack around " ??? (hidden) ??? Write paper presenting “three matvecs finds real- world communities” Profit!! ICERM David Gleich · Purdue 23
  24. 24. Understand why H works! Show heuristic H solves P’ Guess and check! until you find something H solves Derive characterization of heuristic H The real world Given “find-communities” Hack around " Write paper presenting “three matvecs finds real- world communities” Profit!! Algorithmic Anti-differentiation! Given heuristic H, is there a problem P’ such that H is an algorithm for P’ ? ICERM David Gleich · Purdue 24 e.g. Mahoney & Orecchia
  25. 25. If your algorithm is related to optimization, this is: Given a procedure X, " what objective does it optimize? The real world Algorithmic Anti-differentiation! Given heuristic H, is there a problem P’ such that H is an algorithm for P’ ? In an unconstrained case, this is just “anti-differentiation!” ICERM David Gleich · Purdue 25
  26. 26. Algorithmic Anti-differentiation in the literature Dhillon et al. (2007) " Spectral clustering, trace minimization & kernel k-means Saunders (1995) LSQR & Craig iterative methods ICERM David Gleich · Purdue 26
  27. 27. Why does it matter?! These details matter in " many empirical studies, and can dramatically impact performance (speed or quality) ICERM David Gleich · Purdue 27
  28. 28. Semi-supervised Learning on Graphs Ai,j = exp ✓ kdi dj k2 2 2 2 ◆ di dj = 2.5 = 1.25 Zhou et al. NIPS (2003) 28
  29. 29. Semi-supervised Learning on Graphs = 2.5 = 1.25 Experiment predict unlabeled images from the labeled ones 29
  30. 30. Semi-supervised Learning on Graphs K2 = (D A) 1 K1 = (I A) 1 K3 = (Diag(Ae) A) 1 Y = Ki L Our new “kernel” Indicators on the revealed labels Predictions Experiment vary number of labeled images and track perf. y = argmaxj Y 30
  31. 31. Semi-supervised Learning on Graphs K2 = (D A) 1 K1 = (I A) 1 K3 = (Diag(Ae) A) 1 Y = Ki L Experiment vary number of labeled images and track perf. y = argmaxj Y 0 20 40 0 0.2 0.4 0.6 0.8 1 Num. labels Errorrate K1 K2 K3 RK3 = 1.25 Regularized K3 Zhou et al. NIPS (2004) 31
  32. 32. Semi-supervised Learning on Graphs K2 = (D A) 1 K1 = (I A) 1 K3 = (Diag(Ae) A) 1 Y = Ki L Experiment vary number of labeled images and track perf. y = argmaxj Y Regularized K3 = 2.5 Our new value Random guessing 32
  33. 33. Semi-supervised Learning on Graphs K2 = (D A) 1 K1 = (I A) 1 K3 = (Diag(Ae) A) 1 Y = Ki L Experiment vary number of labeled images and track perf. y = argmaxj Y Regularized K3 0 20 40 0 0.2 0.4 0.6 0.8 1 Num. labels Errorrate K1 K2 K3 RK3 = 2.5 Our new value Random guessing 33
  34. 34. What’s happening? 0 0.5 1 0 0.2 0.4 0.6 0.8 1 2 vs. 1,2,3,4, σ=2.50 false pos. truepos. K1 K2 K3 RK3 0 0.5 1 0 0.2 0.4 0.6 0.8 1 2 vs. 1,2,3,4, σ=1.25 false pos. truepos. K1 K2 K3 RK3 Much better performance! ICERM David Gleich · Purdue 34
  35. 35. The results of our ! regularized estimate 500 1000 1500 2000 2500 3000 3500 0.05 0.1 0.15 0.2 0.25 0.3 0.35 ICERM David Gleich · Purdue 35
  36. 36. Why does it matter?! Theory has the answer! We “sweep” over cuts from approximate eigenvectors! It’s the order not the values. ICERM David Gleich · Purdue 36
  37. 37. 0 20 40 0 0.1 0.2 0.3 0.4 Num. labels Errorrate K1 K2 K3 RK3 Improved performance Y = Ki L Regularized K3 y = argminj SortedRank(Y) We have spent no time tuning the reg. parameter. ICERM David Gleich · Purdue 37 K2 = (D A) 1 K1 = (I A) 1 K3 = (Diag(Ae) A) 1 = 2.5 Our new value
  38. 38. Anti-di↵erentiating Approximation Algorithms 16 nonzeros 15 nonzeros 284 nonzeros 24 nonzeros Recap & Conclusions ICERM David Gleich · Purdue 38 Open issues! Better treatment of directed graphs? Algorithm for rho < 1?! rho set to ½ in most “uses” Need new analysis New relationships between localized cuts & PageRank New understanding of PPR" push procedure Improvements to semi- supervised learning on graphs!

×