Fast Katz and Commuters - Quadrature Rules and Sparse Linear Solvers
for Link Prediction Heuristics
Motivated by social network data mining problems such as link
prediction and collaborative filtering, significant research effort
has been devoted to computing topological measures including the Katz
score and the commute time. Existing approaches approximate all
pairwise relationships simultaneously. We are interested in
computing
the score for a single pair of nodes;
the top-k nodes with the best scores from a given source node.
For the pairwise problem, we introduce an iterative algorithm that
computes upper and lower bounds for the measures we seek. This
algorithm exploits a relationship between the Lanczos process and a
quadrature rule.
For the top-k problem, we propose an algorithm that only accesses a
small portion of the graph, similar to algorithms used in personalized
PageRank computing. To test scalability and accuracy, we experiment
with three real-world networks and find that our algorithms run in
milliseconds to seconds without any preprocessing.
1. Tweet along @dgleich
FAST KATZ AND COMMUTERS
Quadrature Rules and Sparse Linear Solvers
for Link Prediction Heuristics
David F. Gleich
Sandia National Labs
la/opt seminar
October 14th 2010
With Pooya Esfandiar, Francesco Bonchi, Chen Grief,
LaksV. S. Lakshmanan, and Byung-Won On
David F. Gleich (Sandia) ICME la/opt seminar 1 / 50
2. Tweet along @dgleich
MAIN RESULTS – SLIDE ONE
A – adjacency matrix
L – Laplacian matrix
Katz score :
Commute time:
David F. Gleich (Sandia) ICME la/opt seminar 2 of 50
3. Tweet along @dgleich
MAIN RESULTS – SLIDE TWO
For Katz Compute one fast
Compute top fast
For Commute
Compute one fast
For almost commute
Compute top fast
David F. Gleich (Sandia) ICME la/opt seminar 3 of 50
5. Tweet along @dgleich
OUTLINE
Why study these measures?
Katz Rank and Commute Time
How else do people compute them?
Quadrature rules for pairwise scores
Sparse linear systems solves for top-k
As many results as we have time for…
David F. Gleich (Sandia) ICME la/opt seminar 5 of 50
6. Tweet along @dgleich
WHY? LINK PREDICTION
David F. Gleich (Sandia) ICME la/opt seminar
Liben-Nowell and Kleinberg 2003, 2006 found that path based link prediction was more efficient
Neighborhood based
Path based
6 of 50
7. Tweet along @dgleich
NOTE
All graphs are undirected
All graphs are connected
David F. Gleich (Sandia) ICME la/opt seminar 7 of 50
9. Tweet along @dgleich
NOT QUITE,WIKIPEDIA
: adjacency, : random walk
PageRank
Katz
These are equivalent if has constant
degree
David F. Gleich (Sandia) ICME la/opt seminar 9 of 50
10. Tweet along @dgleich
WHAT KATZ ACTUALLY SAID
Leo Katz 1953,A New Status Index Derived from Sociometric Analysis, Psychometria 18(1):39-43
“we assume that each link independently has the
same probability of being effective” …
“we conceive a constant , depending
on the group and the context of the particular
investigation, which has the force of a probability
of effectiveness of a single link.A k-step chain
then, has probability of being effective.”
“We wish to find the column sums of the matrix”
David F. Gleich (Sandia) ICME la/opt seminar 10 of 50
11. Tweet along @dgleich
A MODERN TAKE
The Katz score (node-based) is
The Katz score (edge-based) is
David F. Gleich (Sandia) ICME la/opt seminar 11 of 50
13. Tweet along @dgleich
Carl Neumann
I’ve heard the Neumann series called the “von Neumann”
series more than I’d like! In fact, the von Neumann kernel
of a graph should be named the “Neumann” kernel!
David F. Gleich (Sandia) ICME la/opt seminar
Wikipedia page
13 / 50
14. Tweet along @dgleich
PROPERTIES OF KATZ’S MATRIX
is symmetric
exists when
is sym. pos. def. when
Note that 1/max-degree suffices
David F. Gleich (Sandia) ICME la/opt seminar 14 of 50
15. Tweet along @dgleich
COMMUTE TIME
Consider a uniform random walk on a
graph
David F. Gleich (Sandia) ICME la/opt seminar
Also called the hitting
time from node i to j, or
the first transition time
15 of 50
16. Tweet along @dgleich
SKIPPING DETAILS
: graph Laplacian
is the only null-vector
David F. Gleich (Sandia) ICME la/opt seminar 16 of 50
17. Tweet along @dgleich
WHAT DO OTHER PEOPLE DO?
1) Just work with the linear algebra formulations
2) For Katz, Truncate the Neumann series as a few (3-5)
terms (I’m searching for this ref.)
3) Use low-rank approximations from EVD(A) or EVD(L)
4) For commute, use Johnson-Lindenstrauss inspired
random sampling
5) Approximately decompose into smaller problems
David F. Gleich (Sandia) ICME la/opt seminar
Liben-Nowll and Kleinberg CIKM2003,Acar et al. ICDM2009,Spielman and Srivastava STOC2008,Sarkar and Moore UAI2007
17 of 50
18. Tweet along @dgleich
THE PROBLEM
All of these techniques are
preprocessing based because
most people’s goal is to compute
all the scores.
We want to avoid
preprocessing the graph.
David F. Gleich (Sandia) ICME la/opt seminar
There are a few caveats here! i.e. one could solve the system instead of looking for the matrix inverse
18 of 50
19. Tweet along @dgleich
WHY NO PREPROCESSING?
The graph is constantly changing
as I rate new movies.
David F. Gleich (Sandia) ICME la/opt seminar 19 of 50
20. Tweet along @dgleich
WHY NO PREPROCESSING?
David F. Gleich (Sandia) ICME la/opt seminar
Top-k predicted “links”
are movies to watch!
Pairwise scores give
user similarity
20 of 50
21. Tweet along @dgleich
PAIRWISE ALGORITHMS
Katz
Commute
David F. Gleich (Sandia) ICME la/opt seminar
Golub and Meurant
to the rescue!
21 of 50
22. Tweet along @dgleich
MMQ - THE BIG IDEA
Quadratic form
Weighted sum
Stieltjes integral
Quadrature approximation
Matrix equation
David F. Gleich (Sandia) ICME la/opt seminar
Think
A is s.p.d. use EVD
“A tautology”
Lanczos
22 of 50
23. Tweet along @dgleich
LANCZOS
, $k$-steps of the Lanczos method
produce
and
David F. Gleich (Sandia) ICME la/opt seminar
=
23 of 50
24. Tweet along @dgleich
PRACTICAL LANCZOS
Only need to store the last 2 vectors in
Updating requires O(matvec) work
is not orthogonal
David F. Gleich (Sandia) ICME la/opt seminar 24 of 50
25. Tweet along @dgleich
MMQ PROCEDURE
Goal
Given
1. Run k-steps of Lanczos on starting with
2. Compute , with an additional eigenvalue at ,
set
3. Compute , with an additional eigenvalue at , set
4. Output as lower and upper bounds on b
David F. Gleich (Sandia) ICME la/opt seminar
Correspond to a Gauss-Radau rule, with
u as a prescribed node
Correspond to a Gauss-Radau rule, with
l as a prescribed node
25 of 50
26. Tweet along @dgleich
PRACTICAL MMQ
Increase k to become more accurate
Bad eigenvalue bounds yield worse results
and are easy to compute
not required, we can iteratively
update it’s LU factorization
David F. Gleich (Sandia) ICME la/opt seminar 26 of 50
28. Tweet along @dgleich
ONE LAST STEP FOR KATZ
Katz
David F. Gleich (Sandia) ICME la/opt seminar 28 of 50
29. Tweet along @dgleich
TOP-K ALGORITHM FOR KATZ
Approximate
where is sparse
Keep sparse too
Ideally, don’t “touch” all of
David F. Gleich (Sandia) ICME la/opt seminar 29 of 50
30. Tweet along @dgleich
INSPIRATION - PAGERANK
Approximate
where is sparse
Keep sparse too? YES!
Ideally, don’t “touch” all of ? YES!
David F. Gleich (Sandia) ICME la/opt seminar
McSherryWWW2005, Berkhin 2007,Anderson et al. FOCS2008 –Thanks to Reid Anderson for telling me McSherry did this too.
30 of 50
31. Tweet along @dgleich
THE ALGORITHM - MCSHERRY
For
Start with the Richardson iteration
Rewrite
Richardson converges if
David F. Gleich (Sandia) ICME la/opt seminar 31 of 50
32. Tweet along @dgleich
THE ALGORITHM
Note is sparse.
If , then is sparse.
Idea
only add one component of to
David F. Gleich (Sandia) ICME la/opt seminar 32 of 50
33. Tweet along @dgleich
THE ALGORITHM
For
Init:
How to pick ?
David F. Gleich (Sandia) ICME la/opt seminar 33 of 50
34. Tweet along @dgleich
THE ALGORITHM FOR KATZ
For
Init:
Pick as max
David F. Gleich (Sandia) ICME la/opt seminar
Storing the non-zeros of the residual in a heap makes picking the max log(n) time. See Anderson et al. FOCS2008 for more
34 of 50
35. Tweet along @dgleich
CONVERGENCE?
If you pick as the maximum element, we
can show this is convergent if Richardson
converges. This proof requires to be
symmetric positive definite.
David F. Gleich (Sandia) ICME la/opt seminar 35 of 50
36. Tweet along @dgleich
RESULTS - DATA
All unweighted, connected graphs
David F. Gleich (Sandia) ICME la/opt seminar 36 of 50
38. Tweet along @dgleich
PAIRWISE RESULTS
Katz upper and lower bounds
Katz error convergence
Commute-time upper and lower bounds
Commute-time error convergence
For the arXiv graph here
David F. Gleich (Sandia) ICME la/opt seminar 38 of 50
43. Tweet along @dgleich
TOP-K RESULTS
Katz set convergence
Katz order convergence
For arXiv graph
David F. Gleich (Sandia) ICME la/opt seminar 43 of 50
46. Tweet along @dgleich
CONCLUSIONS
These algorithms are faster than many
alternatives.
For pairwise commute, stopping criteria
are simpler
For top-k, we often need less than 1
matvec for good enough results
David F. Gleich (Sandia) ICME la/opt seminar 46 of 50
47. Tweet along @dgleich
WARTS
Stopping criteria on our top-k algorithm
can be a bit hairy
The top-k approach doesn’t work right for
commute time
David F. Gleich (Sandia) ICME la/opt seminar 47 of 50
48. Tweet along @dgleich
TODO
Try on netflix data!
Explore our “almost commute measure
more”
David F. Gleich (Sandia) ICME la/opt seminar 48 of 50
50. Tweet along @dgleich
By AngryDogDesign on DeviantArt
Preprint available by request
Slides should be online soon
Code is online already
stanford.edu/~dgleich/
publications/2010/codes/fast-katz
David F. Gleich (Sandia) ICME la/opt seminar 50