Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

1,501 views

Published on

Lecture on Google Matrix and Page Rank.

Published in:
Technology

No Downloads

Total views

1,501

On SlideShare

0

From Embeds

0

Number of Embeds

2

Shares

0

Downloads

0

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Deriving “The Google Matrix”: G = αS + (1- α)1/neeT Lecture 4
- 2. B.S Physics 1993, University of Washington M.S EE 1998, Washington State (four patents) 10+ Years in Search Marketing Founder of SEMJ.org (Research Journal) Blogger for SemanticWeb.com President of Future Farm Inc.
- 3. Build a focused crawler in: Java, Python, PERL Point at MSU home page. Gather all the URLs and store for later use. http://www.montana.edu/robots.txt Store all the HTML and label with DocID. Read Google’s Paper. Next time Page Rank & the Google Matrix. Contest: Who can store the most unique URLS? Due Feb 7th (Next week). Send coded and URL list.
- 4. #! /user/bin/python ### Basic Web Crawler in Python to Grab a URL from command line ## Use the urllib2 library for URLs, Use BeautifulSoup # from BeautifulSoup import BeautifulSoup import sys #allow users to input string import urllib2 ####change user-agent name from urllib import FancyURLopener class MyOpener(FancyURLopener): version = BadBot/1.0 print MyOpener.version # print the user agent name httpResponse = urllib2.urlopen(sys.argv[1])
- 5. #store html page in an object called htmlPage htmlPage = httpResponse.read() print htmlPage htmlDom = BeautifulSoup(htmlPage) # dump page title print htmlDom.title.string # dump all links in page allLinks = htmlDom.findAll(a, {href: True}) for link in allLinks: print link[href]#Print name of Bot MyOpener.version
- 6. Open source Java-based crawler https://webarchive.jira.com/wiki/display/H eritrix/Heritrix;jsessionid=AE9A595F01C AAB59BBCDC50C8A3ED2A9 http://www.robotstxt.org/robotstxt.html http://www.commoncrawl.org/
- 7. 1 2 3 6 5 4
- 8. r(Pi) = Σr(Pj)/|Pj| PjΞBPi• r(Pi) is page rank of Page Pi• Pj is number of outlinks from page Pj• BPi is set of pages pointing into Pi
- 9. r(Pj) values of Inlinking page is unknown. Need a starting value. Could initialize the values to 1/n (number of pages) R0(Pi) = 1/n for all pages Pi Process is repeated until a stable value is obtained (Will not happen in all cases). Will this converge?
- 10. R k + 1(Pi) = Σrk(Pj)/|Pj| PjΞBPi• R k + 1 PageRank at of Pi at iteration K + 1• Ro(Pi) = 1/n, where in is all nodes• r(Pi) is page rank of Page Pi• Pj is number of outlinks from page Pj• BPi is set of pages pointing into Pi
- 11. Iteration 0 Iteration 1 Iteration2 Rankk= 2r0(P1) = 1/6 r1(P1) = 1/18 r2(P1) = 1/36 5r0(P2) = 1/6 r1(P2) = 5/36 r2(P2) = 1/18 4r0(P3) = 1/6 r1(P3) = 1/12 r2(P3) = 1/36 5r0(P4) = 1/6 r1(P4) = 1/4 r2(P4) = 17/72 1r0(P5) = 1/6 r1(P5) = 5/36 r2(P5) = 11/72 3r0(P6) = 1/6 r1(P6) = 1/6 R2(P6) = 14/72 2
- 12. [nxm] * [mxr] = nxr
- 13. • Non-zero row elements i are outlinking pages of page i• Non-zero column elements I are inlinking pages of page i
- 14. π (k + 1) T = π (k)T*HWhere: πT is a 1x n row vector
- 15. • Rank sinks & Convergence• Resembles work done on Markov Chains • H = transitional probability matrix • Converges to a unique positive vector if • Stochastic: Each row sum = 1 • Irreducible: Non-zero probability of transitioning (even if more than one state) to any other state. • Aperiodic: No requirements on how many steps to get to a state i. Can be irregular. • Primitive: Irreducible and Periodic
- 16. Next state depends on current state (no memory)
- 17. • “Random Surfer” Model • Following hyperlinks • Time spent on a page is proportional to its importance. • Fixes the “dangling node” problem. Surfer gets stuck on a node. Pdf files, images, etc. • Need to allow surfer to “teleport” or make random jumps.
- 18. S = H + a(1/n *eT)Where: ai = 1 if page i is dangling otherwise 0. eT(1x6) = all 1’s, n = number of nodes
- 19. Serendipity?: Page and Brin introduced an “adjustment”. Random Surfer can “teleport” and enter a new destination into a browser.
- 20. • Teleportation matrix: E = 1/n * eeT• α controls the proportion of time a “rand surfer” follows hyperlinks as opposed to teleporting. If = 0.5 then half the time is spent doing both.• At 0.5 about 34 iterations required to converge to a tolerance of 10^-10.• Originally set at 0.85. As it -> 1 computation time grows. Sensitivity issue.
- 21. G = αS + (1- α)1/nee T
- 22. π (k + 1) T = π (k)T*G*2002 World’s largest matrix computation. Order in 2002 ~8.1 x10^9 !
- 23. G = αH + (αa + (1-α)e)1/neT

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment