How does Google Google: A journey into the wondrous mathematics behind your favorite websites

774 views
694 views

Published on

A talk I gave at the annual meeting for the MetroNY section of the MAA about how Google works from a link-ranking perspective. (http://sections.maa.org/metrony/)

Based on a talk by Margot Gerritsen (which used elements from another talk I gave years ago, yay co-author improvements!)

Published in: Technology, News & Politics
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
774
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
20
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

How does Google Google: A journey into the wondrous mathematics behind your favorite websites

  1. 1. How Does Google? !! David F. Gleich!Computer Science!Purdue University!A journey into the wondrous mathematicsbehind your favorite websites1
  2. 2. Mathematics underlies anenormous number of thewebsites we use everyday!2
  3. 3. 1.  ‘s PageRank2.  Multi-armed bandits andinternet experiments3
  4. 4. 4
  5. 5. Larry Page !Sergey Brin!•  Created a web-search algorithmcalled “backrub”•  Spun-off a company “Googol”based on the paper•  The importance of a page isdetermined by the importance ofpages that link to it.Lawrence Page, Sergey Brin, Rajeev Motwani,TerryWinograd “The PageRank Citation Ranking: BringingOrder to the Web” TR, Stanford InfoLab, 1999 5
  6. 6. A websearch primer1.  Crawl webpages2.  Analyze webpage text (information retrieval)3.  Analyze webpage links4.  Fit over 200 measures to human evaluations5.  Produce rankings6.  Continuously update6
  7. 7. Pages, nodes, incoming links,outgoing links, and “importance”7“Important” pagesthat link to me!cba“Important”pages thatlink toPurdue!
  8. 8. 8
  9. 9. Tim Davis andYifan Hu Sparse Matrix Gallery
  10. 10. http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html1000 vertices on8.5-by-11 paper1,000,000,000,000vertices (one trillion)Paper the size ofManhattan island !(23 sq miles)?The web10
  11. 11. We need something better!11
  12. 12. A wee web-graph: linkcounting is too easy to game!1 2 3 4 5 6 1/3 1/3 1/3 1/2 1/2 12
  13. 13. A wee web-graph: linkcounting is too easy to game!1 2 3 4 5 6 1/3 1/3 1/3 1/2 1/2 The importance of apage is determinedby the importance ofpages that link to it.x1 = 0x2 =13x1x3 =13x1 +12x2x4 =13x1 + x3 + x5x5 = x4x6 =12x213
  14. 14. The importance of a page is determinedby the importance of pages that link to itxi =Xj2Bi1djxj“Back-links from page i”Why it was called Backrub! “Importance” of page i“Importance” of page jNumber of links page j uses!out-degree in graph theory x3 =13x1 +12x21 2 3 1/3 1/2 14
  15. 15. We can rewrite this equation in a moremathematically convenient way1 1 2 3 4 5 62 1 2 3 4 5 63 1 2 3 4 5 64 1 2 3 4 5 65 1 2 3 4 5 66 1 2 3 4 5 6x 0 x 0 x 0 x 0 x 0 x 0 x1x x 0 x 0 x 0 x 0 x 0 x31 1x x x 0 x 0 x 0 x 0 x3 21x x 0 x 1x 0 x 1x 0 x3x 0 x 0 x 0 x 1x 0 x 0 x1x 0 x x 0 x 0 x 0 x 0 x2= + + + + += + + + + += + + + + += + + + + += + + + + += + + + + +15
  16. 16. 1 12 23 34 45 56 6x x0 0 0 0 0 0x x1/ 3 0 0 0 0 0x x1/ 3 1/ 2 0 0 0 0orx x1/ 3 0 1 0 1 0x x0 0 0 1 0 0x x0 1/ 2 0 0 0 0⎡ ⎤ ⎡ ⎤⎡ ⎤⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥=⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎣ ⎦⎣ ⎦ ⎣ ⎦x = PxAnd even more conveniently!Element k in column m = "probability" ofgoing from node m to node k16
  17. 17. The matrix P for websitesshows a lot of structureEvery dot is a non-zero element indicating a linkMatrices are sparse, and generally with block structureblock structure can be explored to speed up ranking algorithm17
  18. 18. But this idea doesn’t work forthe wee web-graph1 2 3 4 5 6 1/3 1/3 1/3 1/2 1/2 Nodes 1, 4 and 5determine everything!x1 = 0x2 =13x1x3 =13x1 +12x2x4 =13x1 + x3 + x5x5 = x4x6 =12x2x1 = 0x2 =13x1 = 0x3 =13x1 +12x2 = 0x4 =13x1 + x3 + x5 = x5x5 = x4x6 =12x2 = 018
  19. 19. But this idea doesn’t work forthe wee web-graph1 2 3 4 5 6 1/3 1/3 1/3 1/2 1/2 Node 1 !“lonely”Nodes 4 and 5 !“mutual admirationsocieties” Node 6 “anti-social”These nodes need to be “fixed” to get areliable and useful ranking!19
  20. 20. The gang of four to the rescueAndreiMarkovOscarPerronGeorgFrogeniusRichard !von Mises20
  21. 21. Let’s fix it up and force node 6 tochoose, or link to everyone123456P =266666640 0 0 0 0 01/3 0 0 0 0 01/3 1/2 0 0 0 01/3 0 1 0 1 00 0 0 1 0 00 1/2 0 0 0 037777775P =266666640 0 0 0 0 1/61/3 0 0 0 0 1/61/3 1/2 0 0 0 1/61/3 0 1 0 1 1/60 0 0 1 0 1/60 1/2 0 0 0 1/63777777521
  22. 22. Taxation is the way torepresentation!cbaIf is a good page, thenit’ll still be a good page ifwe “tax” the importancefrom a, b, and cWe can redistribute thetaxed amounts to allincluding lonely nodes!22
  23. 23. The importance of a page is determinedby the importance of pages that link to it** After tax and any benefitsThe total importance that page j !contributes to page iBenefits to page iThe taxation rate of allxi =Xj2Bi↵xjdj+ (1 ↵)bi23
  24. 24. x1x2x3x4x5x6!"#########$%&&&&&&&&&= α0 0 0 0 0 1/ 61/ 3 0 0 0 0 1/ 61/ 3 1/ 2 0 0 0 1/ 61/ 3 0 1 0 1 1/ 60 0 0 1 0 1/ 60 1/ 2 0 0 0 1/ 6!"#######$%&&&&&&&x1x2x3x4x5x6!"#########$%&&&&&&&&&+(1− α)b1b2b3b4b5b6!"#########$%&&&&&&&&&Perron and Frobenius showed the newequation always has a unique solutionx = ↵Px + (1 ↵)b24
  25. 25. 1 2 3 4 5 6 1/3 1/3 1/3 1/2 1/2 What von Mises and Richardson showedis that guess, check, and correct works!x(new)= ↵Px(old)+ (1 ↵)bx(start)=266666640.170.170.170.170.170.1737777775x(1)=266666640.050.100.170.380.190.1237777775x(2)=266666640.040.060.100.360.360.0837777775x(1)=266666640.030.040.060.430.390.053777777525
  26. 26. 26
  27. 27. There’s still a lot of work left todo to make a search engineMake it fast!Watch out for spamWatch out for manipulationPersonalizeExperiment!27
  28. 28. 1.  ‘s PageRank2.  Multi-armed bandits andinternet experiments28
  29. 29. http://adamlofting.com/736/drawn-multi-armed-bandit-experiments/multi-armed-bandit/Not this!29
  30. 30. http://upload.wikimedia.org/wikipedia/en/8/82/Las_Vegas_slot_machines.jpgThis!Pays out !$0.92/dollarPays out !$0.98/dollarPays out !$0.95/dollarPays out !$0.99/dollar30
  31. 31. What in the heck does a multi-armedbandit have to do with Google?31
  32. 32. What in the heck does a multi-armedbandit have to do with Google?Pays out !$0.92/viewPays out !$0.66/viewPays out !$0.91/view toshow adsPays out !-$0.02/viewhide ads32
  33. 33. How to optimize your websitewithout exploiting the banditsTry condition A 100 times, find 45 “wins”Try condition B 100 times, find 85 “wins”Try condition C 100 times, find 10 “wins”…Choose the best!33
  34. 34. This field has some of thebest terminologyExplore !Exploit !Regret34
  35. 35. This field has some of thebest terminologyExplore – Visiting Las Vegas!Exploit – Your new winning strategy!Regret – That you didn’t quit afterwinning the first round35
  36. 36. This field has some of thebest terminologyExplore – Testing slot machines/experiments for their rewardExploit – Playing the best rewardyou’ve found so far Regret – How much you lost due !to exploration36
  37. 37. How to optimize your websitewithout exploiting the banditsTry condition A 100 times, find 45 “wins”Try condition B 100 times, find 85 “wins”Try condition C 100 times, find 10 “wins”…Choose the best!Pureexploration!We only exploit our findings at the end!37
  38. 38. How to optimize your websiteexploiting the banditsTry condition A 5 times, find 4 wins!Try condition B 5 times, find 4 wins!Try condition C 5 times, find 2 winsTry condition A 7 times, find 3 wins!Try condition B 7 times, find 5 wins!Try condition C 1 time, find 0 winsPureexploration!Exploit ourknowledgeCondition A B CEst. Return 0.58 0.75 0.3338
  39. 39. The goal of these problems is to constructoptimal strategies to minimize regretRegret how much you left “on the table” by exploring zero-regret strategy is one where regret(T trials) is sublinear in T!as the number of plays T → ∞ E[play best always plays made based on data]regret 100-each 255/300 140/300 = 0.38regret 30-mixed 25.5/30 0.45 ⇥ 12 + 0.85 ⇥ 12 + 0.1 ⇥ 6 = 0.3139
  40. 40. [The bandit problem] was formulated during the [secondworld] war, and efforts to solve it so sapped the energiesand minds of Allied analysts that the suggestion wasmade that the problem be dropped over Germany, as theultimate instrument of intellectual sabotage. Peter Whittle (Whittle, 1979)Discussion of “Bandit processes and dynamical allocation indices”Their importance to website optimization,advertising, and recommendation hasrejuvenated research on these problemswith fascinating new questions. 40
  41. 41. Math is everywhere andespecially your favoritewebsites!Matrices and probability arekey ingredients.41
  42. 42. PageRank on Wikipedia= 0.50United StatesC:Living peopleFranceGermanyEnglandUnited KingdomCanadaJapanPolandAustralia= 0.85United StatesC:Main topic classif.C:ContentsC:Living peopleC:Ctgs. by countryUnited KingdomC:FundamentalC:Ctgs. by topicC:Wikipedia admin.France= 0.99C:ContentsC:Main topic classif.C:FundamentalUnited StatesC:Wikipedia admin.P:List of portalsP:Contents/PortalsC:PortalsC:SocietyC:Ctgs. by topicNote Top 10 articles on Wikipedia with highest PageRankDavid F. Gleich (Sandia) Sensitivity Purdue 11 / 3642

×