Fast Katz and Commuters
Upcoming SlideShare
Loading in...5
×
 

Fast Katz and Commuters

on

  • 935 views

 

Statistics

Views

Total Views
935
Views on SlideShare
935
Embed Views
0

Actions

Likes
0
Downloads
7
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Fast Katz and Commuters Fast Katz and Commuters Presentation Transcript

    • FAST KATZ AND COMMUTERS Quadrature Rules and Sparse Linear Solvers for Link Prediction Heuristics David F. Gleich Sandia National Labs Berkeley LAPACK Seminar February 3, 2011 With Pooya Esfandiar, Francesco Bonchi, Chen Grief, Laks V. S. Lakshmanan, and Byung-Won OnSandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.David F. Gleich (Sandia) Berkeley SCMC Seminar 1 / 43
    • MAIN RESULTSA – adjacency matrix For KatzL – Laplacian matrix Compute one       fast Compute top       fastKatz score :                         Commute time:                For CommuteCompute one       fastDavid F. Gleich (Sandia) Berkeley SCMC Seminar 2 of 43
    • OUTLINEWhy study these measures?Katz Rank and Commute TimeHow else do people compute them?Quadrature rules for pairwise scoresSparse linear systems solves for top-kAs many results as we have time for…David F. Gleich (Sandia) Berkeley SCMC Seminar 3 of 43
    • WHY? LINK PREDICTION Neighborhood based Path based Liben-Nowell and Kleinberg 2003, 2006 found that path based link prediction was more efficientDavid F. Gleich (Sandia) Berkeley SCMC Seminar 4 of 43
    • NOTEAll graphs are undirectedAll graphs are connectedDavid F. Gleich (Sandia) Berkeley SCMC Seminar 5 of 43
    • LEO KATZDavid F. Gleich (Sandia) Berkeley SCMC Seminar 6 of 43
    • NOT QUITE, WIKIPEDIA    : adjacency,                     : random walk PageRank               Katz              These are equivalent if     has constantdegreeDavid F. Gleich (Sandia) Berkeley SCMC Seminar 7 of 43
    • WHAT KATZ ACTUALLY SAID “we assume that each link independently has the same probability of being effective” … “we conceive a constant     , depending on the group and the context of the particular investigation, which has the force of a probability of effectiveness of a single link. A k-step chain then, has probability       of being effective.” “We wish to find the column sums of the matrix” Leo Katz 1953, A New Status Index Derived from Sociometric Analysis, Psychometria 18(1):39-43David F. Gleich (Sandia) Berkeley SCMC Seminar 8 of 43
    • A MODERN TAKEThe Katz score (node-based) is           The Katz score (edge-based) is           David F. Gleich (Sandia) Berkeley SCMC Seminar 9 of 43
    • RETURNING TO THE MATRIX                                         Carl Neumann                            David F. Gleich (Sandia) Berkeley SCMC Seminar 10 of 43
    • Carl Neumann I’ve heard the Neumann series called the “von Neumann” series more than I’d like! In fact, the von Neumann kernel of a graph should be named the “Neumann” kernel! Wikipedia pageDavid F. Gleich (Sandia) Berkeley SCMC Seminar 11 / 43
    • PROPERTIES OF KATZ’S MATRIX    is symmetric    exists when                                      is spd when                      Note that         1/max-degree sufficesDavid F. Gleich (Sandia) Berkeley SCMC Seminar 12 of 43
    • COMMUTE TIME Picture taken from Google images, seems to be Bay Bridge Traffic by Jim M. Goldstein.David F. Gleich (Sandia) Berkeley SCMC Seminar 13 / 43
    • COMMUTE TIMEConsider a uniform random walk on agraph                     Also called the hitting time from node i to j, or the first transition time                                                     David F. Gleich (Sandia) Berkeley SCMC Seminar 14 of 43
    • SKIPPING DETAILS                    : graph Laplacian                                             is the only null-vectorDavid F. Gleich (Sandia) Berkeley SCMC Seminar 15 of 43
    • WHAT DO OTHER PEOPLE DO?1) Just work with the linear algebra formulations2) For Katz, Truncate the Neumann series as a few (3-5) terms3) Use low-rank approximations from EVD(A) or EVD(L)4) For commute, use Johnson-Lindenstrauss inspired random sampling5) Approximately decompose into smaller problems Liben-Nowell and Kleinberg CIKM2003, Acar et al. ICDM2009, Spielman and Srivastava STOC2008, Sarkar and Moore UAI2007,Wang et al. ICDM2007David F. Gleich (Sandia) Berkeley SCMC Seminar 16 of 43
    • THE PROBLEM All of these techniques are preprocessing based because most people’s goal is to compute all the scores. We want to avoid preprocessing the graph. There are a few caveats here! i.e. one could solve the system instead of looking for the matrix inverseDavid F. Gleich (Sandia) Berkeley SCMC Seminar 17 of 43
    • WHY NO PREPROCESSING? The graph is constantly changing as I rate new movies.David F. Gleich (Sandia) Berkeley SCMC Seminar 18 of 43
    • WHY NO PREPROCESSING? Top-k predicted “links” are movies to watch! Pairwise scores give user similarityDavid F. Gleich (Sandia) Berkeley SCMC Seminar 19 of 43
    • PAIRWISE ALGORITHMS Katz            Commute                                         Golub and Meurant to the rescue!David F. Gleich (Sandia) Berkeley SCMC Seminar 20 of 43
    • MMQ - THE BIG IDEAQuadratic form                 Think                       Weighted sum           A is s.p.d. use EVD   Stieltjes integral           “A tautology”   Quadrature approximation             Matrix equation         LanczosDavid F. Gleich (Sandia) Berkeley SCMC Seminar 21 of 43
    • LANCZOS        , k-steps of the Lanczos methodproduce                                    and                              =              David F. Gleich (Sandia) Berkeley SCMC Seminar 22 of 43
    • PRACTICAL LANCZOSOnly need to store the last 2 vectors in      Updating requires O(matvec) work      is not orthogonalDavid F. Gleich (Sandia) Berkeley SCMC Seminar 23 of 43
    • MMQ PROCEDUREGoal                        Given                        1. Run k-steps of Lanczos on     starting with    2. Compute       ,     with an additional eigenvalue at     , set            Correspond to a Gauss-Radau rule, with u as a prescribed node3. Compute     ,     with an additional eigenvalue at   , set           Correspond to a Gauss-Radau rule, with l as a prescribed node4. Output               as lower and upper bounds on    David F. Gleich (Sandia) Berkeley SCMC Seminar 24 of 43
    • PRACTICAL MMQIncrease k to become more accurateBad eigenvalue bounds yield worse results      and     are easy to compute    not required, we can iterativelyupdate it’s LU factorizationDavid F. Gleich (Sandia) Berkeley SCMC Seminar 25 of 43
    • PRACTICAL MMQDavid F. Gleich (Sandia) Berkeley SCMC Seminar 26 of 43
    • ONE LAST STEP FOR KATZ Katz                                                                            David F. Gleich (Sandia) Berkeley SCMC Seminar 27 of 43
    • KATZ SCORES ARE LOCALIZED Up to 50 neighbors is 99.65% of the total massDavid F. Gleich (Sandia) Berkeley SCMC Seminar 28 of 43
    • TOP-K ALGORITHM FOR KATZApproximate                   where     is sparseKeep     sparse tooIdeally, don’t “touch” all of    David F. Gleich (Sandia) Berkeley SCMC Seminar 29 of 43
    • INSPIRATION - PAGERANK Approximate                    where     is sparse Keep     sparse too? YES! Ideally, don’t “touch” all of     ? YES!McSherry WWW2005, Berkhin 2007, Anderson et al. FOCS2008 – Thanks to Reid Anderson for telling me McSherry did this too. David F. Gleich (Sandia) Berkeley SCMC Seminar 30 of 43
    • THE ALGORITHM - MCSHERRYFor              Start with the Richardson iteration                            Rewrite                    Richardson converges if                        David F. Gleich (Sandia) Berkeley SCMC Seminar 31 of 43
    • THE ALGORITHMNote     is sparse.If                 , then         is sparse.Idea only add one component of         to        David F. Gleich (Sandia) Berkeley SCMC Seminar 32 of 43
    • THE ALGORITHMFor              Init:                                                                How to pick   ?David F. Gleich (Sandia) Berkeley SCMC Seminar 33 of 43
    • THE ALGORITHM FOR KATZFor                            Init:                                                                       Pick   as max     Storing the non-zeros of the residual in a heap makes picking the max log(n) time. See Anderson et al. FOCS2008 for moreDavid F. Gleich (Sandia) Berkeley SCMC Seminar 34 of 43
    • CONVERGENCE?The “standard” proof works for        1/max-degree.For                       , then for symmetric     ,this algorithm is the Gauss-Southwellprocedure and it still converges.David F. Gleich (Sandia) Berkeley SCMC Seminar 35 of 43
    • RESULTS – DATA, PARAMETERS All unweighted, connected graphs Easy                                     Hard                            David F. Gleich (Sandia) Berkeley SCMC Seminar 36 of 43
    • KATZ BOUND CONVERGENCEDavid F. Gleich (Sandia) Berkeley SCMC Seminar 37 of 43
    • COMMUTE BOUND CONVERG.David F. Gleich (Sandia) Berkeley SCMC Seminar 38 of 43
    • KATZ SET CONVERGENCE For arXiv graph.David F. Gleich (Sandia) Berkeley SCMC Seminar 39 of 43
    • TIMINGDavid F. Gleich (Sandia) Berkeley SCMC Seminar 40 of 43
    • CONCLUSIONSThese algorithms are faster than manyalternatives.For pairwise commute, stopping criteriaare simpler with boundsFor top-k problems, we often need lessthan 1 matvec for good enough resultsDavid F. Gleich (Sandia) Berkeley SCMC Seminar 41 of 43
    • WARTS AND TODOSStopping criteria on our top-k algorithmcan be a bit hairy… we should refine it.The top-k approach doesn’t work right forcommute time… we have an alternative.                        requires the entire diagonal of      Take advantage of new research to “seed”commute time better! von Luxburg et al. NIPS2010David F. Gleich (Sandia) Berkeley SCMC Seminar 42 of 43
    • Paper at WAW2010 Slides should be online soon Code is online already stanford.edu/~dgleich/ publications/2010/codes/fast-katz By AngryDogDesign on DeviantArtDavid F. Gleich (Sandia) Berkeley SCMC Seminar 43
    • F-MEASUREDavid F. Gleich (Sandia) Berkeley SCMC Seminar 44 of 43
    • MAIN RESULTS – SLIDE ONEA – adjacency matrixL – Laplacian matrixKatz score :                         Commute time:                 David F. Gleich (Sandia) Berkeley SCMC Seminar 45 of 43
    • MAIN RESULTS – SLIDE TWOFor Katz Compute one       fast Compute top       fastFor Commute Compute one       fastFor almost commute Compute top $F_{i,:}$ fastDavid F. Gleich (Sandia) Berkeley SCMC Seminar 46 of 43
    • MAIN RESULTS – SLIDE THREEDavid F. Gleich (Sandia) Berkeley SCMC Seminar 47 of 43
    • KATZ BOUND CONVERGENCEDavid F. Gleich (Sandia) Berkeley SCMC Seminar 48 of 43
    • KATZ ERROR CONVERGENCEDavid F. Gleich (Sandia) Berkeley SCMC Seminar 49 of 43
    • COMMUTE BOUND CONVERG.David F. Gleich (Sandia) Berkeley SCMC Seminar 50 of 43
    • COMMUTE ERROR CONVERG.David F. Gleich (Sandia) Berkeley SCMC Seminar 51 of 43
    • KATZ SET CONVERGENCEDavid F. Gleich (Sandia) Berkeley SCMC Seminar 52 of 43
    • KATZ ORDER CONVERGENCEDavid F. Gleich (Sandia) Berkeley SCMC Seminar 53 of 43