Upcoming SlideShare
Loading in …5
×

# Numerical Linear Algebra for Data and Link Analysis

1,162 views
982 views

Published on

Talk at SIAM PP 2006

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

• Be the first to like this

No Downloads
Views
Total views
1,162
On SlideShare
0
From Embeds
0
Number of Embeds
52
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Numerical Linear Algebra for Data and Link Analysis

1. 1. Numerical Linear Algebra for Data and Link Analysis Leonid Zhukov, Yahoo! Inc David Gleich, Stanford
2. 2. Outline•  Intro•  Scale free graphs – Flickr example•  Page rank computing
3. 3. Scientific computing•  Engineering Computing •  Information Retrieval –  continuum problems, PDE –  Discrete data is given, no governed, control over control over resolution discretization –  No associated physical space –  2D or 3D (coordinates) –  Uniform distribution of node –  Not planar, triangulation degrees –  Power-low degree distribution –  “Small world” effect
4. 4. Large graphs•  Explicit: graphs & networks –  Web graph –  Internet –  Yahoo! Photo Sharing (Flickr) –  Yahoo! 360 (Social network)•  Implicit: transaction history, email, messenger –  Yahoo! Search marketing (Overture) –  Yahoo! Mail –  Yahoo Messenger•  Constructed: affinity between data points –  Yahoo! Music (Launch) –  Yahoo! Movies –  Yahoo! ….
5. 5. Flickr – social network
6. 6. Flickr: adjacency matrix 580,000 users, 3,500,000 links
7. 7. Flickr: reverse Cuthill-McKee 580,000 users, 3,500,000 links
8. 8. Flickr – a graph
9. 9. Flickr: some stats•  # nodes = 584,207•  # edges (nnz) = 3,555,115•  max in-degree = 3531•  max out-degree = 8976•  < in-degree > = < out-degree > = 6•  diameter = 18•  # strongly connected components = 152,324•  largest strongly connected components = 274,649 374 186 155 11•  # connected component = 43,189•  largest connected components = 404,893 378 112 108 103•  highest core number = 249 (size 668)
10. 10. Flickr: scale free graphin-degrees out-degrees
11. 11. Data and link analysis•  Web Search –  ranking of pages/hosts/domains –  graph algorithms –  web spam•  Social networks –  finding communities –  trust propagation•  Data mining –  clustering –  classification
12. 12. Data and links analysis•  Graph partitioning: –  “balanced”, for data distribution –  “natural boundaries”, for clustering / communities•  Numerical computing on graphs: –  Node ranking methods (PageRank) –  Dimensionality reduction (SVD, HITS) –  Low dimensional embeddings –  Graph interpolation / regularization
13. 13. What do we compute?•  Probability matrix:•  Graph Laplacian: •  SVD/LSI: Computing eigenvalues / eigenvectors, linear systems
14. 14. Yahoo! Search Marketing (Overture) View Bids Tool
15. 15. Overture data search \$ bid advertiser accessory desk .83 Office Max alfred hitchcock .01 Videoflicks.com educational software .13 Buy.com educational software .4 eBAY International AG educational software .17 OfficeMax epson .28 Buy.com epson .4 eBAY International AG fax .13 Buy.com fax .4 eBAY International AG fax .38 OfficeMax game .02 Net Business game .25 Buy.com george harrison .15 eBAY International AG george harrison .05 MusicStack george harrison .01 Videoflicks.com jackson michael .05 MusicStack jackson michael .01 Videoflicks.com
16. 16. Overture bid graph Terms accessory deskeducational software Office Max epson Buy.com fax game eBAY International AG george harrison Net Business jackson michael MusicStack alfred hitchcock Videoflicks.com Advertisers
17. 17. Overture bid matrix 1 2 3 4 5 6 accessory desk game 1. Net Business fax 2. Office Maxeducational software 3. Buy.com epson 4. eBAY george harrison 5. MusicStack 6. Videoflicks.com jackson michael alfred hitchcock
18. 18. 170,077 terms, 25,971 advertisers, 1,307,316 bids
19. 19. Overture bid matrix: 18-core2000 advertisers , 3000 search terms, 92,347 bids
20. 20. Spectral graph partitioning •  assign each node indicator pi = §1 1 2 3 4 •  partition p = {-1,-1,-1,-1, +1, +1, +1} 5 6 7 8 +1 -1 Relaxation procedure: =>Quadratic optimization: Min cut solution : Rounding off:
21. 21. Spectral: ordering Eigenvector Eigenvector – sortedAdjacency matrix Adjacency matrix – re-ordered perm = [1 2 6 7 3 8 4 5]
22. 22. Spectral : additions•  Recursively•  Min cut, ratio cut, normalized cut, quotient cut, …•  Sweep for optimal value•  Flow based refinement•  Optimize partition among several eigenvectors•  Bi-partite graphs•  + other bells & whistles
23. 23. Search Market Place
24. 24. Sample clustersA.S.C. Incorporated Atlantic Telecom Berent Associates Center for AudioLink Shyness and Social Therapy Cost Plus Electronics Dr. Puff Headset Express New York Psychotherapy Collective Hello Direct Ruby Shoes PDA Mountain Self-employed psychologist PK TECH INC‘ www.kabbalah.com‘ ………………….. …………… accessory phone health mental cordless headset help selfcordless headset phone improvement self free hands parenting‘ headset headset microphone headset phone headset plantronics headset wireless isdn plantronics‘ …………….
25. 25. Flickr: “10-core” spectral ordering
26. 26. Websearch Engines•  Crawler + indexer •  Front end # Qs PR 1 0.7 0.4 2 0.8 0.1 Link Analysis 3 0.1 0.5 Text Analysis PageRank Inverted Index
27. 27. PageRank model•  Random walk on graph•  Markov process: memory-less, homogeneous•  Stationary distribution: existence, uniqueness, convergencePerron-Frobenius theorem, irreducible, every state is reachable from every other, and aperiodic – no cycles
28. 28. PageRank•  Construct probability matrix:•  Construct transition row-stochastic matrix for Markov process •  Correct reducibility: •  Find stationary distribution ( exist & unique):c – teleportation coefficient, v – personalization vector, d – dangling nodes (sinks) indicator
29. 29. PageRank linear system•  Eigensystem:•  Linear system:
30. 30. Computational diagram PageRank ProblemEigensystem Linear system Power Jacobi GMRES BiCG BiCGSTAB Iterations Iterations Preconditioners Block Additive Jacobi Jacobi Schwarz PageRank Vector
31. 31. Data: WebGraphs Name # Nodes/size # Links / nnz Memory bs 0.3M 1.6M 20.4 MB edu 2M 14M 176MByahoo-r2 14M 266M 3.25GB uk 18.5M 300M 3.67GByahoo-r3 60M 850M 10.4GB db 70M 1B 12.3GB av 1.4B 6.6B 80GB Graph in memory, compressed storage, int, int, double
32. 32. Power law graphs ( y = x - b) ?bs: 0.3M nodes y2: 266 M nodes db: 1B nodes
33. 33. Computational Methods Error Metrics for Jacobi Convergence Error Metrics for BiCG Convergence 0 2 10 10 ||x(k) - x*|| ||x(k) - x*|| -1 (k) 10 ||A x - b|| ||A x(k) - b|| (k) * (k) ||x - x ||/||x || 10 0 ||x(k) - x*||/||x(k)|| -2 10 -3 10 -2Metric Value Metric Value 10 -4 10 -4 -5 10 10 -6 10 -6 10 -7 10 -8 -8 10 10 0 10 20 30 40 50 60 70 80 0 5 10 15 20 25 30 35 40 45 Iteration Iteration
34. 34. Convergence results 0 uk iteration convergence 0 db iteration convergence 10 10 std std 10 -1 jacobi 10 -1 jacobi gmres gmres -2 bicg -2 bicg 10 bcgs 10 bcgs -3 -3 10 10Error Error -4 -4 10 10 -5 -5 10 10 -6 -6 10 10 -7 -7 10 10 -8 10 -8 0 10 20 30 40 50 60 70 80 10 0 10 20 30 40 50 60 70 Iteration Iteration uk: 18.5 M pages, 300 M links db: 70 M pages, 1 B links.
35. 35. Convergence results 0 uk time convergence 0 db time convergence 10 10 std std 10 -1 jacobi -1 jacobi 10 gmres gmres -2 bicg -2 bicg 10 bcgs 10 bcgs -3 -3 10 10Error Error -4 -4 10 10 -5 -5 10 10 -6 10 -6 10 -7 10 10 -7 -8 10 10 -8 0 5 10 15 20 25 30 35 40 0 100 200 300 400 500 600 700 Time (sec) Time (sec) uk: 18.5 M pages, 300 M links db: 70 M pages, 1 B links.
36. 36. Summary Name Size Power Jacobi GMRES BiCG BCGS edu 2M 84 84 21* 44* 21*20 procs 14M 0.09/7.5s 0.07/6.5s 0.6/13.2s 0.4/17.7s 0.4/8.7syahoo-r2 14M 71 65 12* 35* 17*20 procs 266M 1.8/129s 1.9/126s 16/194s 8.6/300s 9.9/168s uk 18.5M 73 71 22* 25* 11*60 procs 300M 0.09/7s 0.1/10s 0.8/17.6s 0.8/19.4s 1.0/10.8syahoo-r3 60M 76 7560 procs 850M 1.6/119s 1.5/112s db 70M 62 58 29 45 15*60 procs 1B 9.0/557s 8.7/506s 15/432s 15/676s 15/220s av 1.4B 72 26140 procs 6.6B 4.6/333s 15/391s
37. 37. Reduced teleportation 2 Convergence and Conditioning for db 10 std gmres 0 c = 0.85 10 c = 0.90 c = 0.95 c = 0.99 Error/Residual -2 10 -4 10 -6 10 -8 10 0 20 40 60 80 100 120 140 160 180 200 Iterationdb: 70 M pages, 1 B links.
38. 38. YRL research cluster Custom Parallel PageRank PETSc Off the shelf MPI Gigabit Switch RLX Blades Dual 2.8 GHz Xeon 4 GB RAM Gigabit Ethernet 120 TotalRLX RLX RLX
39. 39. Typical graphout-degrees.in-degrees. bs: 70 M pages, 1 B links.
40. 40. Parallel Matrix-Vector Multiply =
41. 41. Graph distribution methods•  Balance rows•  Balance nnz•  Random destribution•  Balance rows and nnz (heuristics)•  2D distributions•  Graph partitioning•  Hypergraph partitioning
42. 42. Some parallel experiments
43. 43. Domain lexicographic sortinghttp://host.domain.tld/path ! http://tld.domain.host/path bs-cc: 17 k pages, 133 k links
44. 44. Domain ordering 0 Host Order Improvement 10 std-140 10 -1 bcgs-140 std-140-host -2 bcgs-140-host 10 -3 10 error/residual -4 10 -5 10 -6 10 -7 10 -8 10 0 100 200 300 400 500 600 700 800 900 Time (sec)av: 1.4 B pages, 6.6 B links.
45. 45. Full web graph parallelization Scaling for computing with full-web Performance Increase (Percent decrease in time) 250% std bcgs 200% 150% 100% 50% 0% 90% 100% 110% 120% 130% 140% 150% 160% 170% 140 186 226av: 1.4 B pages, 6.6 B links.
46. 46. FEM mesh for CFD computations
47. 47. Power-law graph embedding in 3D
48. 48. final thoughts•  Power-law graphs are different•  Not all power-law graphs are the same•  But all have central core, chains, singletons.•  Difficult to distribute / parallelize•  Its is experimental data•  Lots of cool applications•  Let’s talk …..
49. 49. References•  Spectral graph partitioning:- M. Fiedler (1973), A. Pothen (1990), H. Simon (1991), B. Mohar (1992), B. Hendrickson (1995), D. Spielman (1996), F. Chang (1996), S. Guattery (1998), R. Kannan (1999), J. Shi (2000), I. Dhillon ( 2001), A. Ng (2001), H. Zha (2001), C. Ding (2001)•  PageRank computing:- S.Brin (1998), L. Page (1998), J. Kleinberg (1999), A. Arasu (2002), T. Haveliwala (2002-03), A. Langville (2002), G. Jeh (2003), S. Kamvar (2003), A. Broder (2004)