Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Sparse matrix computations in MapRe... by David Gleich 10315 views
- Higher-order organization of comple... by David Gleich 2213 views
- Spacey random walks and higher orde... by David Gleich 1781 views
- Using Local Spectral Methods to Rob... by David Gleich 1017 views
- Matrix methods for Hadoop by David Gleich 14863 views
- Anti-differentiating Approximation ... by David Gleich 828 views

879 views

Published on

Based on a talk by Margot Gerritsen (which used elements from another talk I gave years ago, yay co-author improvements!)

No Downloads

Total views

879

On SlideShare

0

From Embeds

0

Number of Embeds

1

Shares

0

Downloads

26

Comments

0

Likes

3

No embeds

No notes for slide

- 1. How Does Google? !! David F. Gleich!Computer Science!Purdue University!A journey into the wondrous mathematicsbehind your favorite websites1
- 2. Mathematics underlies anenormous number of thewebsites we use everyday!2
- 3. 1. ‘s PageRank2. Multi-armed bandits andinternet experiments3
- 4. 4
- 5. Larry Page !Sergey Brin!• Created a web-search algorithmcalled “backrub”• Spun-off a company “Googol”based on the paper• The importance of a page isdetermined by the importance ofpages that link to it.Lawrence Page, Sergey Brin, Rajeev Motwani,TerryWinograd “The PageRank Citation Ranking: BringingOrder to the Web” TR, Stanford InfoLab, 1999 5
- 6. A websearch primer1. Crawl webpages2. Analyze webpage text (information retrieval)3. Analyze webpage links4. Fit over 200 measures to human evaluations5. Produce rankings6. Continuously update6
- 7. Pages, nodes, incoming links,outgoing links, and “importance”7“Important” pagesthat link to me!cba“Important”pages thatlink toPurdue!
- 8. 8
- 9. Tim Davis andYifan Hu Sparse Matrix Gallery
- 10. http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html1000 vertices on8.5-by-11 paper1,000,000,000,000vertices (one trillion)Paper the size ofManhattan island !(23 sq miles)?The web10
- 11. We need something better!11
- 12. A wee web-graph: linkcounting is too easy to game!1 2 3 4 5 6 1/3 1/3 1/3 1/2 1/2 12
- 13. A wee web-graph: linkcounting is too easy to game!1 2 3 4 5 6 1/3 1/3 1/3 1/2 1/2 The importance of apage is determinedby the importance ofpages that link to it.x1 = 0x2 =13x1x3 =13x1 +12x2x4 =13x1 + x3 + x5x5 = x4x6 =12x213
- 14. The importance of a page is determinedby the importance of pages that link to itxi =Xj2Bi1djxj“Back-links from page i”Why it was called Backrub! “Importance” of page i“Importance” of page jNumber of links page j uses!out-degree in graph theory x3 =13x1 +12x21 2 3 1/3 1/2 14
- 15. We can rewrite this equation in a moremathematically convenient way1 1 2 3 4 5 62 1 2 3 4 5 63 1 2 3 4 5 64 1 2 3 4 5 65 1 2 3 4 5 66 1 2 3 4 5 6x 0 x 0 x 0 x 0 x 0 x 0 x1x x 0 x 0 x 0 x 0 x 0 x31 1x x x 0 x 0 x 0 x 0 x3 21x x 0 x 1x 0 x 1x 0 x3x 0 x 0 x 0 x 1x 0 x 0 x1x 0 x x 0 x 0 x 0 x 0 x2= + + + + += + + + + += + + + + += + + + + += + + + + += + + + + +15
- 16. 1 12 23 34 45 56 6x x0 0 0 0 0 0x x1/ 3 0 0 0 0 0x x1/ 3 1/ 2 0 0 0 0orx x1/ 3 0 1 0 1 0x x0 0 0 1 0 0x x0 1/ 2 0 0 0 0⎡ ⎤ ⎡ ⎤⎡ ⎤⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥=⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎣ ⎦⎣ ⎦ ⎣ ⎦x = PxAnd even more conveniently!Element k in column m = "probability" ofgoing from node m to node k16
- 17. The matrix P for websitesshows a lot of structureEvery dot is a non-zero element indicating a linkMatrices are sparse, and generally with block structureblock structure can be explored to speed up ranking algorithm17
- 18. But this idea doesn’t work forthe wee web-graph1 2 3 4 5 6 1/3 1/3 1/3 1/2 1/2 Nodes 1, 4 and 5determine everything!x1 = 0x2 =13x1x3 =13x1 +12x2x4 =13x1 + x3 + x5x5 = x4x6 =12x2x1 = 0x2 =13x1 = 0x3 =13x1 +12x2 = 0x4 =13x1 + x3 + x5 = x5x5 = x4x6 =12x2 = 018
- 19. But this idea doesn’t work forthe wee web-graph1 2 3 4 5 6 1/3 1/3 1/3 1/2 1/2 Node 1 !“lonely”Nodes 4 and 5 !“mutual admirationsocieties” Node 6 “anti-social”These nodes need to be “ﬁxed” to get areliable and useful ranking!19
- 20. The gang of four to the rescueAndreiMarkovOscarPerronGeorgFrogeniusRichard !von Mises20
- 21. Let’s ﬁx it up and force node 6 tochoose, or link to everyone123456P =266666640 0 0 0 0 01/3 0 0 0 0 01/3 1/2 0 0 0 01/3 0 1 0 1 00 0 0 1 0 00 1/2 0 0 0 037777775P =266666640 0 0 0 0 1/61/3 0 0 0 0 1/61/3 1/2 0 0 0 1/61/3 0 1 0 1 1/60 0 0 1 0 1/60 1/2 0 0 0 1/63777777521
- 22. Taxation is the way torepresentation!cbaIf is a good page, thenit’ll still be a good page ifwe “tax” the importancefrom a, b, and cWe can redistribute thetaxed amounts to allincluding lonely nodes!22
- 23. The importance of a page is determinedby the importance of pages that link to it** After tax and any beneﬁtsThe total importance that page j !contributes to page iBeneﬁts to page iThe taxation rate of allxi =Xj2Bi↵xjdj+ (1 ↵)bi23
- 24. x1x2x3x4x5x6!"#########$%&&&&&&&&&= α0 0 0 0 0 1/ 61/ 3 0 0 0 0 1/ 61/ 3 1/ 2 0 0 0 1/ 61/ 3 0 1 0 1 1/ 60 0 0 1 0 1/ 60 1/ 2 0 0 0 1/ 6!"#######$%&&&&&&&x1x2x3x4x5x6!"#########$%&&&&&&&&&+(1− α)b1b2b3b4b5b6!"#########$%&&&&&&&&&Perron and Frobenius showed the newequation always has a unique solutionx = ↵Px + (1 ↵)b24
- 25. 1 2 3 4 5 6 1/3 1/3 1/3 1/2 1/2 What von Mises and Richardson showedis that guess, check, and correct works!x(new)= ↵Px(old)+ (1 ↵)bx(start)=266666640.170.170.170.170.170.1737777775x(1)=266666640.050.100.170.380.190.1237777775x(2)=266666640.040.060.100.360.360.0837777775x(1)=266666640.030.040.060.430.390.053777777525
- 26. 26
- 27. There’s still a lot of work left todo to make a search engineMake it fast!Watch out for spamWatch out for manipulationPersonalizeExperiment!27
- 28. 1. ‘s PageRank2. Multi-armed bandits andinternet experiments28
- 29. http://adamlofting.com/736/drawn-multi-armed-bandit-experiments/multi-armed-bandit/Not this!29
- 30. http://upload.wikimedia.org/wikipedia/en/8/82/Las_Vegas_slot_machines.jpgThis!Pays out !$0.92/dollarPays out !$0.98/dollarPays out !$0.95/dollarPays out !$0.99/dollar30
- 31. What in the heck does a multi-armedbandit have to do with Google?31
- 32. What in the heck does a multi-armedbandit have to do with Google?Pays out !$0.92/viewPays out !$0.66/viewPays out !$0.91/view toshow adsPays out !-$0.02/viewhide ads32
- 33. How to optimize your websitewithout exploiting the banditsTry condition A 100 times, ﬁnd 45 “wins”Try condition B 100 times, ﬁnd 85 “wins”Try condition C 100 times, ﬁnd 10 “wins”…Choose the best!33
- 34. This ﬁeld has some of thebest terminologyExplore !Exploit !Regret34
- 35. This ﬁeld has some of thebest terminologyExplore – Visiting Las Vegas!Exploit – Your new winning strategy!Regret – That you didn’t quit afterwinning the ﬁrst round35
- 36. This ﬁeld has some of thebest terminologyExplore – Testing slot machines/experiments for their rewardExploit – Playing the best rewardyou’ve found so far Regret – How much you lost due !to exploration36
- 37. How to optimize your websitewithout exploiting the banditsTry condition A 100 times, ﬁnd 45 “wins”Try condition B 100 times, ﬁnd 85 “wins”Try condition C 100 times, ﬁnd 10 “wins”…Choose the best!Pureexploration!We only exploit our findings at the end!37
- 38. How to optimize your websiteexploiting the banditsTry condition A 5 times, ﬁnd 4 wins!Try condition B 5 times, ﬁnd 4 wins!Try condition C 5 times, ﬁnd 2 winsTry condition A 7 times, ﬁnd 3 wins!Try condition B 7 times, ﬁnd 5 wins!Try condition C 1 time, ﬁnd 0 winsPureexploration!Exploit ourknowledgeCondition A B CEst. Return 0.58 0.75 0.3338
- 39. The goal of these problems is to constructoptimal strategies to minimize regretRegret how much you left “on the table” by exploring zero-regret strategy is one where regret(T trials) is sublinear in T!as the number of plays T → ∞ E[play best always plays made based on data]regret 100-each 255/300 140/300 = 0.38regret 30-mixed 25.5/30 0.45 ⇥ 12 + 0.85 ⇥ 12 + 0.1 ⇥ 6 = 0.3139
- 40. [The bandit problem] was formulated during the [secondworld] war, and efforts to solve it so sapped the energiesand minds of Allied analysts that the suggestion wasmade that the problem be dropped over Germany, as theultimate instrument of intellectual sabotage. Peter Whittle (Whittle, 1979)Discussion of “Bandit processes and dynamical allocation indices”Their importance to website optimization,advertising, and recommendation hasrejuvenated research on these problemswith fascinating new questions. 40
- 41. Math is everywhere andespecially your favoritewebsites!Matrices and probability arekey ingredients.41
- 42. PageRank on Wikipedia= 0.50United StatesC:Living peopleFranceGermanyEnglandUnited KingdomCanadaJapanPolandAustralia= 0.85United StatesC:Main topic classif.C:ContentsC:Living peopleC:Ctgs. by countryUnited KingdomC:FundamentalC:Ctgs. by topicC:Wikipedia admin.France= 0.99C:ContentsC:Main topic classif.C:FundamentalUnited StatesC:Wikipedia admin.P:List of portalsP:Contents/PortalsC:PortalsC:SocietyC:Ctgs. by topicNote Top 10 articles on Wikipedia with highest PageRankDavid F. Gleich (Sandia) Sensitivity Purdue 11 / 3642

No public clipboards found for this slide

Be the first to comment