Successfully reported this slideshow.
Upcoming SlideShare
×

# Ph.D. Defense: Models and Algorithms for PageRank sensitivity

2,092 views

Published on

Published in: Technology, Education
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### Ph.D. Defense: Models and Algorithms for PageRank sensitivity

1. 1. Models and Algorithms for PageRank Sensitivity David F. Gleich Stanford University Ph.D. Oral Defense Institute for Computational and Mathematical Engineering May 26, 2009Gleich (Stanford) Ph.D. Defense 1 / 41
2. 2. Outline PageRank intro Sensitivity Random sensitivity Inner-Outer SummaryGleich (Stanford) Ph.D. Defense 2 / 41
4. 4. PageRank intro SensitivityPageRank intro Random sensitivitySlide 4 of 41 Inner-Outer Summary
5. 5. A cartoon websearch primer1. Crawl webpages2. Analyze webpage text (information retrieval)3. Analyze webpage links4. Fit measures to human evaluations5. Produce rankings6. Continually update Gleich (Stanford) PageRank intro Ph.D. Defense 5 / 41
6. 6. 1 2 to 3Gleich (Stanford) PageRank intro Ph.D. Defense 6 / 41
7. 7. PageRank by Google The places we ﬁnd the surfer most often are im- portant pages. 3 The Model2 5 1. follow edges uniformly with 4 probability α, and 2. randomly jump with probability1 6 1 − α, we’ll assume everywhere is equally likelyGleich (Stanford) PageRank intro Ph.D. Defense 7 / 41
8. 8. Some PageRank details 3   2 5 1/ 6 1/ 2 0 0 0 0 4  1/ 6 0 0 1/ 3 0 0 P j ≥0 →  1/ 6 1/ 2 0 1/ 3 0 0  1/ 6 0 1/ 2 0 0 0 eT P=eT 1/ 6 0 1/ 2 1/ 3 0 1 1/ 6 0 0 0 1 0 1 6 P T ≥0 “jump” → v=[1 n ... 1 n ] eT v=1Markov chain αP + (1 − α)veT x = x unique x ⇒ j ≥ 0, eT x = 1.Linear system ( − αP)x = (1 − α)vSmall detail dangling nodes patched back to vGleich (Stanford) PageRank intro Ph.D. Defense 8 / 41
9. 9. Other uses for PageRank What else people use PageRank to do GeneRank ProteinRank NM_003748 NM_003862 Contig32125_RC U82987 AB037863 NM_020974 Contig55377_RC NM_003882 NM_000849 Contig48328_RC IsoRank Contig46223_RC NM_006117 NM_003239 NM_018401 AF257175 AF201951 NM_001282 Contig63102_RC NM_000286 Contig34634_RC NM_000320 AB033007 AL355708 NM_000017 NM_006763 AF148505 Contig57595 NM_001280 AJ224741 U45975 Contig49670_RC Contig753_RC Contig25055_RC Contig53646_RC Contig42421_RC Contig51749_RC AL137514 NM_004911 NM_000224 NM_013262 Contig41887_RC NM_004163 AB020689 NM_015416 Contig43747_RC NM_012429 AB033043 AL133619 NM_016569 NM_004480 NM_004798 Contig37063_RC NM_000507 AB037745 Contig50802_RC NM_001007 Contig53742_RC NM_018104 Contig51963 Contig53268_RC NM_012261 NM_020244 Contig55813_RC Contig27312_RC Contig44064_RC NM_002570 NM_002900 AL050090 NM_015417 Contig47405_RC NM_016337 Contig55829_RC Contig37598 Contig45347_RC NM_020675 NM_003234 AL080110 AL137295 Contig17359_RC NM_013296 NM_019013 AF052159 Contig55313_RC NM_002358 NM_004358 Contig50106_RC NM_005342 NM_014754 U58033 Contig64688 NM_001827 Contig3902_RC Contig41413_RC NM_015434 NM_014078 NM_018120 NM_001124 L27560 Contig45816_RC AL050021 NM_006115 NM_001333 NM_005496 Contig51519_RC Contig1778_RC NM_014363 NM_001905 NM_018454 NM_002811 NM_004603 AB032973 NM_006096 D25328 Contig46802_RC X94232 NM_018004 Contig8581_RC Clustering Contig55188_RC Contig50410 Contig53226_RC NM_012214 NM_006201 NM_006372 Contig13480_RC AL137502 Contig40128_RC NM_003676 NM_013437 Contig2504_RC AL133603 NM_012177 R70506_RC NM_003662 NM_018136 NM_000158 NM_018410 Contig21812_RC NM_004052 Contig4595 Contig60864_RC NM_003878 U96131 NM_005563 NM_018455 Contig44799_RC NM_003258 NM_004456 NM_003158 NM_014750 Contig25343_RC NM_005196 Contig57864_RC NM_014109 NM_002808 Contig58368_RC Contig46653_RC NM_004504 M21551 NM_014875 NM_001168 NM_003376 NM_018098 AF161553 NM_020166 NM_017779 NM_018265 AF155117 NM_004701 NM_006281 Contig44289_RC NM_004336 Contig33814_RC (graph partitioning) NM_003600 NM_006265 NM_000291 NM_000096 NM_001673 NM_001216 NM_014968 NM_018354 NM_007036 NM_004702 Contig2399_RC NM_001809 Contig20217_RC NM_003981 NM_007203 NM_006681 AF055033 NM_014889 NM_020386 NM_000599 Contig56457_RC NM_005915 Contig24252_RC Contig55725_RC NM_002916 NM_014321 NM_006931 AL080079 Contig51464_RC NM_000788 NM_016448 X05610 NM_014791 Contig40831_RC AK000745 NM_015984 NM_016577 Contig32185_RC AF052162 AF073519 NM_003607 NM_006101 NM_003875 Contig25991 Contig35251_RC NM_004994 NM_000436 NM_002073 NM_002019 NM_000127 NM_020188 AL137718 Contig28552_RC Contig38288_RC AA555029_RC NM_016359 Contig46218_RC Contig63649_RC AL080059 10 20 30 40 50 60 70 Sports rankingUse ( − αGD−1 )x = w toﬁnd “nearby” importantgenes. Teaching Morrison et al. GeneRank, 2005. Gleich (Stanford) PageRank intro Ph.D. Defense 9 / 41
10. 10. My other projectsPrior PageRank Parallel Krylov Methods Approximate Personal Gleich, Zhukov, and Berkhin , Yahoo! Research Labs PageRank Technical Report, YRL-2004-038; Gleich and Zhukov, Gleich and Polito, Internet Math. 3(3):257 294, SuperComputing poster, 2005. 2007. Does existing software work for computing PageRank Can you build a web search engine on your PC? on a cluster? Parameterized MatrixOngoing Network Alignment Problems Come back here for (with Mohsen Bay- j Square j s r (with Paul Constantine) his defense on Monday, ati, Margot Gerritsen, June 1st at 1:30pm! Amin Saberi, and Ying A(s)x(s) = b(s) Wang) t tMy Software Packages Publications MatlabBGL vismatrix Random α PageRank libbvg parameterized Inner-Outer PageRank matrix package gaimc (with Paul) Gleich (Stanford) PageRank intro Ph.D. Defense 10 / 41
11. 11. PageRank intro SensitivitySensitivity Random sensitivitySlide 11 of 41 Inner-Outer Summary
12. 12. Which sensitivity?Sensitivity to the links : examined and understoodSensitivity to the jump : examined, understood, and usefulSensitivity to α : less well understood Gleich (Stanford) Sensitivity Ph.D. Defense 12 / 41
13. 13. PageRank on Wikipediaα = 0.50 α = 0.85 α = 0.99United States United States C:ContentsC:Living people C:Main topic classif. C:Main topic classif.France C:Contents C:FundamentalGermany C:Living people United StatesEngland C:Ctgs. by country C:Wikipedia admin.United Kingdom United Kingdom P:List of portalsCanada C:Fundamental P:Contents/PortalsJapan C:Ctgs. by topic C:PortalsPoland C:Wikipedia admin. C:SocietyAustralia France C:Ctgs. by topicNote Top 10 articles on Wikipedia with highest PageRank Gleich (Stanford) Sensitivity Ph.D. Defense 13 / 41
14. 14. The PageRank functionLook at the PageRank vector as a function of α ( − αP)x(α) = (1 − α)vand examine its derivative.My ContributionsGleich, Glynn, Golub, Greif, Dagstuhl proceedings, 2007. OthersCompute the derivative with just PageRank becomessimple PageRank solves. more sensitive as α → 1.Empirically evaluated the PageRank vector atderivative as a rank change α = 1 well deﬁned.predictor. α matters! Golub and Greif, 2004; Boldi et al., 2005; Berkhin, 2005; Langville and Meyer, 2006. Gleich (Stanford) Sensitivity Ph.D. Defense 14 / 41
15. 15. PageRank introRandom Sensitivitysensitivity Random sensitivitySlide 15 of 41 Inner-Outer Summary
16. 16. What is alpha? Author α Brin and Page (1998) 0.85 Najork et al. (2007) 0.85 Litvak et al. (2006) 0.5 Experiment (slide 20) 0.375 Algorithms (...) ≥ 0.85For you, α is clearGoogle wants PageRank for everyone Gleich (Stanford) Random sensitivity Ph.D. Defense 16 / 41
17. 17. Multiple surfers Each person picks α from distribution A ... ↓ ↓ x(E [A]) E [x(A)] x(E [A]) = E [x(A)]Gleich (Stanford) Random sensitivity Ph.D. Defense 17 / 41
18. 18. Random alpha PageRank RAPrModel PageRank as the random variables x(A)and look at E [x(A)] and Std [x(A)] . Gleich and Constantine, Workshop on Algorithms on the Web Graph, 2007 Gleich (Stanford) Random sensitivity Ph.D. Defense 18 / 41
19. 19. What is A? Beta(0,0,0.6,0.9) Beta(2,16,0,1) Beta(1,1,0.1,0.9) Beta(−0.5,−0.5,0.2,0.7) 0 1 Bet ( , b, , r)Gleich (Stanford) Random sensitivity Ph.D. Defense 19 / 41
20. 20. Alpha is 2 Histogram 1.8 Density Fit Beta(1.5,0.5) 1.6 mean 0.375 1.4 mode 0.25 1.2 density 1 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α Data provided by Abraham Flaxman and Asela Gunawardana at Microsoft.Gleich (Stanford) Random sensitivity Ph.D. Defense 20 / 41
21. 21. Example x1 3 x 2 2 5 x 3 4 x4 1 6 x 5 x 6 0 0.5Gleich (Stanford) Random sensitivity Ph.D. Defense 21 / 41
22. 22. What changes? x(A) A ∼ Bet ( , b, , r) with 0 ≤ < r ≤ 11. E [ (A)] ≥ 0 and E [x(A)] = 1; thus E [x(A)] is a probability distribution. ∞2. E [x(A)] = ℓ=0 E Aℓ − Aℓ+1 Pℓ v; thus we can interpret E [x(A)] in length-ℓ paths.3. for page with no in-links, (A) = (1 − A) ; thus E [ (A)] = (E [A]) and Std [ (A)] = Std [A] But is this one useful? Gleich (Stanford) Random sensitivity Ph.D. Defense 22 / 41
23. 23. RAPr on WikipediaE [x(A)] Std [x(A)]United States United StatesC:Living people C:Living peopleFrance C:Main topic classif.United Kingdom C:ContentsGermany C:Ctgs. by countryEngland United KingdomCanada FranceJapan C:FundamentalPoland EnglandAustralia C:Ctgs. by topic Gleich (Stanford) Random sensitivity Ph.D. Defense 23 / 41
24. 24. Std vs. PageRank Does it tell us more than just PageRank? uk2006 — 77M nodes and 2B edges 1 k 1isim(k) = k =1 2 |Diff[Y(1: ), Z(1: )]| Disjoint 1 Std[x(A )] vs. x(0.85) 1 Std[x(A2)] vs. x(0.5) Kendall’s τ 0.8 τ(x(E1 ), S1 ) = +0.3 Intersection Similarity (k) Std[x(A )] vs. x(0.85) 3 0.6 τ(x(E2 ), S2 ) = −0.5 0.4 τ(x(0.85), S3 ) = −0.2 0.2Identical 0 0 2 4 6 10 10 10 10 k A1 ∼ Bet (2, 16, [0, 1]) A2 ∼ Bet (1, 1, [0, 1]) A3 ∼ Bet (0.5, 1.5, [0, 1]) Gleich (Stanford) Random sensitivity Ph.D. Defense 24 / 41
25. 25. Computation1. monte carlo 1 N E [x(A)] = N =1 x(α ) α ∼A2. path damping N E [x(A)] ≈ =0 E A − A +1 P v3. quadrature r N E [x(A)] = x(α) dρ(α) ≈ =1 x(ζ )ω Gleich (Stanford) Random sensitivity Ph.D. Defense 25 / 41
26. 26. Time cnr2000 — 325k nodes and 3M edges 010 −510 −1010 Monte Carlo Path Damping Quadrature −1510 −2 −1 0 1 2 3 4 10 10 10 10 10 10 10 Time (sec) Gleich (Stanford) Random sensitivity Ph.D. Defense 26 / 41
27. 27. Convergence theoryMethod Conv. Work Required What is N? 1 number ofMonte Carlo N PageRank systems N samples from APath Damping r N+2 N + 1 matrix vector terms of(without N1+ products Neumann seriesStd [x(A)]) number ofGaussian r 2N N PageRank systems quadratureQuadrature points and r are parameters from Bet ( , b, , r) Gleich (Stanford) Random sensitivity Ph.D. Defense 27 / 41
28. 28. Webspam application Hosts of uk-2006 are labeled as spam, not-spam, other P R f FP FN Baseline 0.694 0.558 0.618 0.034 0.442 Beta(0.5,1.5) 0.695 0.561 0.621 0.034 0.439 Beta(1,1) 0.698 0.562 0.622 0.033 0.438 Beta(2,16) 0.699 0.562 0.623 0.033 0.438Note Bagged (10) J48 decision tree classiﬁer in Weka, mean of 50 repetitions from10-fold cross-validation of 4948 non-spam and 674 spam hosts (5622 total). Becchetti et al. Link analysis for Web spam detection, 2008. Gleich (Stanford) Random sensitivity Ph.D. Defense 28 / 41
29. 29. PageRank intro SensitivityInner-Outer Random sensitivitySlide 29 of 41 Inner-Outer Summary
30. 30. Motivation Why another PageRank algorithm?For the RAPr codes, we need 1. reliable code 2. fast code over a range of α’s fancy → Use Matlab’s “” 3. code for big problems → Use a Gauss-Seidel or custom Richardson method 4. code with only matvec products → Use the inner-outer iteration 5. code with only 2 vectors of memory → Use the power method simple Gleich (Stanford) Inner-Outer Ph.D. Defense 30 / 41
31. 31. Inner-Outer Note PageRank is easier when α is smaller Thus Solve PageRank with itself using β < α!Outer ( − βP)x(k+1) = (α − β)Px(k) + (1 − α)v ≡ f(k)Inner y(j+1) = βPy(j) + (α − β)Px(k) + (1 − α)v A new parameter? What is β? 0.5 How many inner iterations? Until a residual of 10−2 Gray, Greif, Lau, 2007. Gleich (Stanford) Inner-Outer Ph.D. Defense 31 / 41
32. 32. Inner-Outer algorithm Input: P, v, α, τ, (β = 0.5, η = 10−2 ) Output: x if 0 ≤ β ≤ α, 1: x ← v convergence with 2: y ← Px any η 3: while αy + (1 − α)v − x 1 ≥ τ uses only three 4: f ← (α − β)y + (1 − α)v vectors of memory 5: repeat 6: x ← f + βy β = 0.5, η = 10−2 7: y ← Px often faster than the 8: until f + βy − x 1 < η power method 9: end while (or just a titch slower) 10: x ← αy + (1 − α)vNote Note that the inner-loop checks its condition after doing one iteration. Gleich (Stanford) Inner-Outer Ph.D. Defense 32 / 41
33. 33. Performance wb−edu, α = 0.85 wb−edu, α = 0.99 0 10 0 10 −1 0 10 10 −1 10 10 0 −2 10 −2 −2 10 10 10 −2 5 10 15 20 20 40 −3 −3 10 10Residual Residual −4 −4 10 10 −5 −5 10 10 −6 −6 10 10 power power inout inout −7 −7 10 10 10 20 30 40 50 60 70 80 200 400 600 800 1000 1200 Multiplication Multiplication τ = 10−7 , β = 0.5, η = 10−2 ; wb-edu graph (9.8M nodes, 57.M edges) Gleich (Stanford) Inner-Outer Ph.D. Defense 33 / 41
34. 34. Extensions1. A large scale shared-memory parallel version on compressed web graphs2. A Gauss-Seidel variant3. A BiCG-STAB preconditioner4. A conjecture about the performance of the iteration5. Showed the algorithm converges for “any” β, η Gleich, Gray, Greif, Lau, submitted. Gleich (Stanford) Inner-Outer Ph.D. Defense 34 / 41
35. 35. Convergence ResultSketch of convergence result1. error after j steps of the inner iteration j−1 α−β f(j) = αβj−1 Pj + βℓ Pℓ f(0) β ℓ=12. upper bound error by (α − β) + (1 − α)βj f(j) ≤ f(0) . 1−β3. notice f(j) ≤ α f(0) , j ≥ 14. hence, convergence as long as β ≤ α Gleich (Stanford) Inner-Outer Ph.D. Defense 35 / 41
36. 36. PageRank intro SensitivitySummary Random sensitivitySlide 36 of 41 Inner-Outer Summary
37. 37. Conclusionsα matterssensitivity is usefuleverything is just PageRank Gleich (Stanford) Summary Ph.D. Defense 37 / 41
38. 38. Contributions 1. Derivative Gleich, Glynn, Golub, Greif, 2007. New technique to compute the derivative using just PageRank2. RAPr 3. Inner-OuterConstantine and Gleich, 2007; Constantine, Gleich, Gleich, Gray, Greif, Lau, submitted.and Iaccarino, submitted. New PageRank model and Improved convergence sensitivity measure analysis Range of algorithms and Gauss-Seidel and algorithmic analysis preconditioning variants Empirically helpful for Shared-memory parallel spam identiﬁcation implementation Robust software Robust software Gleich (Stanford) Summary Ph.D. Defense 38 / 41
39. 39. Thanks! Michael Saunders (My Advisor) Hector Garcia-Molina Chen Greif Art Owen Amin SaberiGleich (Stanford) Summary Ph.D. Defense 39 / 41
40. 40. Thanks Gene!
41. 41. Margot Gerritsen Debbie HeimowitzPeter Glynn Jason AzicriWalter Murray Steven FanReid Andersen Paul ConstantinePavel Berkhin Michael AtkinsonKevin Lang Jeremy KozdonAmy Langville Esteban ArcauteMatthew RasmussenSebastiano Vigna Adam Guetz Will Fong THANKLeonid Zhukov Andrew BradleyIndira ChoudhurySeth Tornborg Nick Henderson Chris Maes YOUBrian Tempero Nicole TaheriPrisilla Williams Ying WangDeb Michael Nick WestMayita Romero Kaustuvs RumLes Fletcher Saeco Coffee MachineHugh Fletcher Napa ValleyLindsey Fletcher MatlabJane Fletcher superlu