Deriving “The Google Matrix”:       G = αS + (1- α)1/neeT             Lecture 4
 B.S Physics 1993, University of Washington M.S EE 1998, Washington State (four patents) 10+ Years in Search Marketing...
   Build a focused crawler in:    Java, Python, PERL Point at MSU home page. Gather all the URLs and  store for later us...
   #! /user/bin/python   ### Basic Web Crawler in Python to Grab a URL from command    line   ## Use the urllib2 librar...
  #store html page in an object called htmlPage htmlPage = httpResponse.read() print htmlPage htmlDom = BeautifulSoup(...
 Open source Java-based crawler https://webarchive.jira.com/wiki/display/H  eritrix/Heritrix;jsessionid=AE9A595F01C  AAB...
1               2        3    6               5            4
r(Pi) = Σr(Pj)/|Pj|                    PjΞBPi• r(Pi) is page rank of Page Pi• Pj is number of outlinks from page Pj• BPi i...
   r(Pj) values of Inlinking page is unknown. Need a starting value.     Could initialize the values to 1/n (number of p...
R k + 1(Pi) = Σrk(Pj)/|Pj|                       PjΞBPi•   R k + 1 PageRank at of Pi at iteration K + 1•   Ro(Pi) = 1/n, w...
Iteration 0    Iteration 1     Iteration2       Rankk= 2r0(P1) = 1/6   r1(P1) = 1/18   r2(P1) = 1/36          5r0(P2) = 1/...
[nxm] * [mxr] = nxr
•   Non-zero row elements i are outlinking    pages of page i•   Non-zero column elements I are inlinking    pages of page i
π (k + 1) T =   π (k)T*HWhere: πT is a 1x n row vector
• Rank sinks & Convergence• Resembles work done on Markov Chains    • H = transitional probability matrix    • Converges t...
Next state depends on current state (no memory)
•   “Random Surfer” Model    • Following hyperlinks    • Time spent on a page is proportional to its      importance.    •...
S = H + a(1/n *eT)Where: ai = 1 if page i is dangling otherwise 0.  eT(1x6) = all 1’s, n = number of nodes
Serendipity?: Page and Brin introduced an “adjustment”. Random Surfer can “teleport” and enter a new destination into a br...
• Teleportation matrix: E = 1/n * eeT• α controls the proportion of time a “rand  surfer” follows hyperlinks as opposed to...
G = αS + (1-   α)1/nee T
π (k + 1) T =         π (k)T*G*2002 World’s largest matrix computation. Order in   2002 ~8.1 x10^9 !
G = αH + (αa + (1-α)e)1/neT
PageRank and The Google Matrix
PageRank and The Google Matrix
PageRank and The Google Matrix
PageRank and The Google Matrix
Upcoming SlideShare
Loading in …5
×

PageRank and The Google Matrix

1,501 views

Published on

Lecture on Google Matrix and Page Rank.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,501
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Never taught this course in MT. Taught for MASCO last Jan.
  • Never taught this course in MT. Taught for MASCO last Jan.
  • Never taught this course in MT. Taught for MASCO last Jan.
  • Never taught this course in MT. Taught for MASCO last Jan.
  • Hyper text transer protocol…
  • Never taught this course in MT. Taught for MASCO last Jan.
  • Rows n and columns m. Inner dimensions must match.
  • In this example initialize pi(o) matrix to [1/6, 1/6, 1/6, … ] multiply out times H and you get Iteration 1 in table 4.1 of book. This gives the same results as the page rank formula.
  • A11 could be a probability that we stay where we are. A12 is probablity that we go to s@.
  • The I refers to rows only. So if there is all zeros in a row then ai = 1. S is the same dimension as H. a is 6 x 1 and eT is 1 x 6 which gives 6 x 6 matrix Plus H. eT is all ones.
  • The I refers to rows only. So if there is all zeros in a row then ai = 1. S is the same dimension as H. a is 6 x 1 and eT is 1 x 6 which gives 6 x 6 matrix Plus H. eT is all ones.
  • Order of a matrix is m times n!
  • Multiply this by pi(0) which is a 1x6 matrix [ 1/6 , 1/1…. End up with page rank vector of 1x6. Interpretation. If one value is 0.37 then 37% of the time is spent on that page.
  • PageRank and The Google Matrix

    1. 1. Deriving “The Google Matrix”: G = αS + (1- α)1/neeT Lecture 4
    2. 2.  B.S Physics 1993, University of Washington M.S EE 1998, Washington State (four patents) 10+ Years in Search Marketing Founder of SEMJ.org (Research Journal) Blogger for SemanticWeb.com President of Future Farm Inc.
    3. 3.  Build a focused crawler in: Java, Python, PERL Point at MSU home page. Gather all the URLs and store for later use. http://www.montana.edu/robots.txt Store all the HTML and label with DocID. Read Google’s Paper. Next time Page Rank & the Google Matrix. Contest: Who can store the most unique URLS? Due Feb 7th (Next week). Send coded and URL list.
    4. 4.  #! /user/bin/python ### Basic Web Crawler in Python to Grab a URL from command line ## Use the urllib2 library for URLs, Use BeautifulSoup # from BeautifulSoup import BeautifulSoup import sys #allow users to input string import urllib2 ####change user-agent name from urllib import FancyURLopener class MyOpener(FancyURLopener): version = BadBot/1.0 print MyOpener.version # print the user agent name httpResponse = urllib2.urlopen(sys.argv[1])
    5. 5.  #store html page in an object called htmlPage htmlPage = httpResponse.read() print htmlPage htmlDom = BeautifulSoup(htmlPage) # dump page title print htmlDom.title.string # dump all links in page allLinks = htmlDom.findAll(a, {href: True}) for link in allLinks: print link[href]#Print name of Bot MyOpener.version
    6. 6.  Open source Java-based crawler https://webarchive.jira.com/wiki/display/H eritrix/Heritrix;jsessionid=AE9A595F01C AAB59BBCDC50C8A3ED2A9 http://www.robotstxt.org/robotstxt.html http://www.commoncrawl.org/
    7. 7. 1 2 3 6 5 4
    8. 8. r(Pi) = Σr(Pj)/|Pj| PjΞBPi• r(Pi) is page rank of Page Pi• Pj is number of outlinks from page Pj• BPi is set of pages pointing into Pi
    9. 9.  r(Pj) values of Inlinking page is unknown. Need a starting value.  Could initialize the values to 1/n (number of pages) R0(Pi) = 1/n for all pages Pi Process is repeated until a stable value is obtained (Will not happen in all cases). Will this converge?
    10. 10. R k + 1(Pi) = Σrk(Pj)/|Pj| PjΞBPi• R k + 1 PageRank at of Pi at iteration K + 1• Ro(Pi) = 1/n, where in is all nodes• r(Pi) is page rank of Page Pi• Pj is number of outlinks from page Pj• BPi is set of pages pointing into Pi
    11. 11. Iteration 0 Iteration 1 Iteration2 Rankk= 2r0(P1) = 1/6 r1(P1) = 1/18 r2(P1) = 1/36 5r0(P2) = 1/6 r1(P2) = 5/36 r2(P2) = 1/18 4r0(P3) = 1/6 r1(P3) = 1/12 r2(P3) = 1/36 5r0(P4) = 1/6 r1(P4) = 1/4 r2(P4) = 17/72 1r0(P5) = 1/6 r1(P5) = 5/36 r2(P5) = 11/72 3r0(P6) = 1/6 r1(P6) = 1/6 R2(P6) = 14/72 2
    12. 12. [nxm] * [mxr] = nxr
    13. 13. • Non-zero row elements i are outlinking pages of page i• Non-zero column elements I are inlinking pages of page i
    14. 14. π (k + 1) T = π (k)T*HWhere: πT is a 1x n row vector
    15. 15. • Rank sinks & Convergence• Resembles work done on Markov Chains • H = transitional probability matrix • Converges to a unique positive vector if • Stochastic: Each row sum = 1 • Irreducible: Non-zero probability of transitioning (even if more than one state) to any other state. • Aperiodic: No requirements on how many steps to get to a state i. Can be irregular. • Primitive: Irreducible and Periodic
    16. 16. Next state depends on current state (no memory)
    17. 17. • “Random Surfer” Model • Following hyperlinks • Time spent on a page is proportional to its importance. • Fixes the “dangling node” problem. Surfer gets stuck on a node. Pdf files, images, etc. • Need to allow surfer to “teleport” or make random jumps.
    18. 18. S = H + a(1/n *eT)Where: ai = 1 if page i is dangling otherwise 0. eT(1x6) = all 1’s, n = number of nodes
    19. 19. Serendipity?: Page and Brin introduced an “adjustment”. Random Surfer can “teleport” and enter a new destination into a browser.
    20. 20. • Teleportation matrix: E = 1/n * eeT• α controls the proportion of time a “rand surfer” follows hyperlinks as opposed to teleporting. If = 0.5 then half the time is spent doing both.• At 0.5 about 34 iterations required to converge to a tolerance of 10^-10.• Originally set at 0.85. As it -> 1 computation time grows. Sensitivity issue.
    21. 21. G = αS + (1- α)1/nee T
    22. 22. π (k + 1) T = π (k)T*G*2002 World’s largest matrix computation. Order in 2002 ~8.1 x10^9 !
    23. 23. G = αH + (αa + (1-α)e)1/neT

    ×