PageRank & Searching


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

PageRank & Searching

  1. 1. PageRank & Searching Rahul Bindra, Aditya Nigudkar B.E, Computer Science, 2nd Year Madhav Institute of Technology and Science, Gwalior ABSTRACTInformation on the web is increasing exponentially and so is the number of people seeking it. In orderto waste minimum time in searching for queries, we use search engines almost for every bit ofinformation on web. There have been many search engines in the past like AltaVista, Inktomi,Overture etc. but they failed to rank their results. In late 90’s Sergey Brin and Larry Page ledGOOGLE revolutionized the search engine industry when it came up with its PageRank system torank the search results which helped GOOGLE to change from a struggling 2-member searchamateur to a multi-billion dollar search enterprise. The paper aims at giving an insight on the prosand cons of a PageRank system and how it changed the meaning of searching for billions of peopleacross the globe. 1. IntroductionPageRank was developed at Stanford University by Larry Page (hence the name Page-Rank) andSergey Brin as part of a research project about a new kind of search engine. The project started in1995 and led to a functional prototype, named Google, in 1998. Shortly after, Page and Brin foundedGoogle Inc., the company behind the Google search engine. While just one of many factors whichdetermine the ranking of Google search results, PageRank continues to provide the basis for all ofGoogles web search tools.PageRank relies on the uniquely democratic nature of the web by using its vast link structure as anindicator of an individual pages value. PageRank actually works on an intuitive system, which works
  2. 2. as a model of a web surfer’s behaviour. It works on the probability that a web page will be randomlyaccessed by a web surfer. It also takes into consideration the pages that link to the page. Using Yahooas an example, the justification of this is that if a page like Yahoo were to link directly to another page,it is very likely that the page is of high quality (Brin, S., Page, L., 2000). 2. PageRank 2.1 What is PageRank ?PageRank is a link analysis algorithm which assigns a numerical weighting to each element of ahyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" itsrelative importance within the set. The algorithm may be applied to any collection of entities withreciprocal quotations and references. The numerical weight that it assigns to any given element E isalso called the PageRank of E and denoted by PR(E).Google figures that when one page links to another page, it is effectively casting a vote for the otherpage. The more votes that are cast for a page, the more important the page must be. Also, theimportance of the page that is casting the vote determines how important the vote itself is. Googlecalculates a pages importance from the votes cast for it. How important each vote is taken intoaccount when a pages PageRank is calculated. PageRank is Googles method of measuring a pages"importance." When all other factors such as Title tag and keywords are taken into account, Googleuses PageRank to adjust results so that sites that are deemed more "important" will move up in theresults page of a users search accordingly. A basic overview of how Google ranks pages in theirsearch engine results pages (SERPS) follows:1) Find all pages matching the keywords of the search.2) Rank accordingly using "on the page factors" such as keywords.3) Calculate in the inbound anchor text.4) Adjust the results by PageRank scores. 2.2 How is PageRank Calculated ?Quoting from the original Google paper, PageRank is defined like this:We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is adamping factor which can be set between 0 and 1. We usually set d to 0.85. There are more detailsabout d in the next section. Also C(A) is defined as the number of links going out of page A. ThePageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))Note that the PageRanks form a probability distribution over web pages, so the sum of all web pagesPageRanks will be one.PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to theprincipal eigenvector of the normalized link matrix of the web.but that’s not too helpful so let’s break it down into sections.PR(Tn) - Each page has a notion of its own self-importance. That’s “PR(T1)” for the first page in theweb all the way up to “PR(Tn)” for the last page.
  3. 3. C(Tn) - Each page spreads its vote out evenly amongst all of it’s outgoing links. The count, ornumber, of outgoing links for page 1 is “C(T1)”, “C(Tn)” for page n, and so on for all pages.PR(Tn)/C(Tn) - so if our page (page A) has a backlink from page “n” the share of the vote page Awill get is “PR(Tn)/C(Tn)”.d(... - All these fractions of votes are added together but, to stop the other pages having too muchinfluence, this total vote is “damped down” by multiplying it by 0.85 (the factor “d”).(1 - d) - The (1 – d) bit at the beginning is a bit of probability math magic so the “sum of all webpages PageRanks will be one”: it adds in the bit lost by the d(.... It also means that if a page has nolinks to it (no backlinks) even then it will still get a small PR of 0.15 (i.e. 1 – 0.85). (Aside: the Googlepaper says “the sum of all pages” but they mean the “the normalised sum” – otherwise known as “theaverage” to you and me.We can think of it in a simpler way:-a pages PageRank = 0.15 + 0.85 * (a "share" of the PageRank of every page that links to it)"share" = the linking pages PageRank divided by the number of outbound links on the page. 2.3 Need Of IterationsThe equation above clearly shows how a pages PageRank is arrived at. But what isnt immediatelyobvious is that it cant work if the calculation is done just once. Suppose we have 2 pages, A and B,which link to each other, and neither have any other links of any kind. This is what happens:-Step 1: Calculate page As PageRank from the value of its inbound linksPage A now has a new PageRank value. The calculation used the value of the inbound link from pageB. But page B has an inbound link (from page A) and its new PageRank value hasnt been worked outyet, so page As new PageRank value is based on inaccurate data and cant be accurate.Step 2: Calculate page Bs PageRank from the value of its inbound linksPage B now has a new PageRank value, but it cant be accurate because the calculation used the newPageRank value of the inbound link from page A, which is inaccurate.Its a Catch 22 situation. We cant work out As PageRank until we know Bs PageRank, and we cantwork out Bs PageRank until we know As PageRank.The problem is overcome by repeating the calculations many times. Each time produces slightly moreaccurate values. In fact, total accuracy can never be achieved because the calculations are alwaysbased on inaccurate values. 40 to 50 iterations are sufficient to reach a point where any furtheriterations wouldnt produce enough of a change to the values to matter. This is precisely what Googledoes at each update, and its the reason why the updates take so long. 2.4 Examples
  4. 4. 1. Lets consider a 3 page site (pages A, B and C) with no links coming in from the outside. We willallocate each page an initial PageRank of 1, although it makes no difference whether we start eachpage with 1, 0 or 99. Apart from a few millionths of a PageRank point, after many iterations the endresult is always the same. Starting with 1 requires fewer iterations for the PageRanks to converge to asuitable result than when starting with 0 or any other number. A B PR 1 PR 1 C PR 1The sites maximum PageRank is the amount of PageRank in the site. In this case, we have 3 pages sothe sites maximum is 3. At the moment, none of the pages link to any other pages and none link tothem. If you make the calculation once for each page, youll find that each of them ends up with aPageRank of 0.15. No matter how many iterations you run, each pages PageRank remains at 0.15. Thetotal PageRank in the site = 0.45, whereas it could be 3. The site is seriously wasting most of itspotential PageRank.2. Now, adding some links to above example. Start again with PR1 all round. After 1iteration:-Page A = 1.425Page B = 1Page C = 0.575By comparison to the 1 iteration figures in the previous example, page A has lost some PageRank,page B has gained some and page C stayed the same. Page C now shares its "vote" between A and B.Previously A received all of it. Thats why page A has lost out and why page B has gained. and after100 iterations:-Page A = 1.298245Page B = 0.9999999Page C = 0.7017543
  5. 5. A B PR 1 PR 1 C PR 13. Another example involving multiple links is as follows:The calculated PR values are depicted in the following diagram:As you’d expect, the home page has the most PR – after all, it has the most incoming links! But what’shappened to the average? It’s only 0.378. Take a look at the “external site” pages – what’s happening
  6. 6. to their PageRank? They’re not passing it on, they’re not voting for anyone, they’re wasting their PR.A possible remedy to this problem is suggested in the following figure:Thus it can be clearly inferred from the above example that the architectural interlinking of a site withother sites plays a very important role in determining its “Average PR” which helps it to improve itsposition in GOOGLE’s search results. 2.5 PageRank FeedbackAnother important aspect of PageRank is "PageRank Feedback." Pages linking to each other cancreate a feedback effect that can increase the PageRank of those pages.Lets say that page A currently links to nowhere. If we add a link from page A to page B, then page Ais saying that page B is important. This means the measure of page Bs votes is also increased. Page Bis now saying that the pages it links to are more important than they otherwise would be. So themeasure of those pages votes will be increased.... and so on (with the pages they link to through thelink structure). The effect is diluted as it moves down through the links. If we could point our webbrowser at page B and through clicking onthe links, get to page A, then so could the Google algorithm (at least one of the pages linking to pageA has become more important). If a page linking to page A is more important, then so is its vote, andsubsequently page A becomes more important! So by linking to page B, page A has made itself moreimportant, thus creating PageRank Feedback.An interesting thing about PR Feedback is that you can use it to your advantage via the internalnavigational structure of your site. It is very important to keep as much PageRank within your site aspossible. Therefore, only link out to other sites from low PageRank pages of your site. 3. Optimization Of PageRank
  7. 7. There are three fundamental areas to look at when trying to optimize the PageRank for your site:1. The links you choose to have link to you, i.e., which ones you choose, and how much effort you putin to getting them.2. Who you choose to link out to from your site, and from which page of your site you place their link.(Maximizing PageRank Feedback and minimizing PageRank leakage).3. The internal navigational structure and linkage of your pages, in order to best distribute PageRankwithin your site. 3.1 Links To Your SiteWhen looking for links to your site, from a purely PageRank point of view, one might think youshould simply look for pages that have the highest Toolbar PageRank. However, this way of thinkingis incorrect. As more and more people try to get links from only high PageRank sites, it becomes lessand less of a winning proposition.The actual PageRank from an individual page is shared out amongst the links on that page. So, a linkfrom a page with a PageRank of 4 might be better than a link from a page with a PageRank of 6 ifthere are less total links on the PR 4 page. 3.2 Links Out Of Your SiteWhen considering links out of your site there is one golden rule: Generally, you will want to keep PageRank within your own site.This does not mean that you will lose PageRank from your site by linking out, but that the totalPageRank within it may be lower than it could have been had you not linked out.The best outbound link PageRank scenario occurs when the outbound link comes from a page that hasboth:a) A low PageRankb) A lot of links to pages on your site.How can this best be achieved? One way would be by writing reviews of the sites we link out to on aseparate page of our site, and by providing a link to those reviews along with each hyperlink to theexternal site. We will have to make sure that the review page also links back to a page in our own sitethat is high up its structure. (It’s best if this is our home page, but any important page will do.) Bydoing this, we’ve significantly reduced the amount of PageRank we’ve let out of our site. We’vetargeted the distribution of PageRank to the home page to ensure that less is passed back through ourlinks page (which would be a wasted opportunity), and more is put elsewhere in our site. This conceptcan be illustrated diagrammatically as follows:
  8. 8. If we perform the same calculations but include the review pages, here’s what we get:Without the Review Pages:The HomePage PR is: 0.9536152797B,C,D PR is: 0.4201909959Total is: 2.2141882674
  9. 9. With the Review Pages:The HomePage PR is: 2.439718935B,C,D PR is: 0.8412536982Total is: 4.9634800296Thus it could be clearly observed from the above results that adding review pager in an architecturallinking helps to improve its PR by keeping the PR within the architecture. 3.3 Internal Structures And LinkagesTo get a high PageRank, it is not enough to have tens of thousands of pages. Those pages must also bein Google’s index. To achieve this they must contain enough content for Google’s algorithms toconsider them worthy of being added to the index. As you develop content for your site, you are alsocreating more PageRank for your site. Its hard work, and you’re creating it slowly – but if you’re alsocreating pages that people will want to link to, then you’re doing yourself two favors at once:PageRank from both directions. Or, to put it in very basic terms: The best “internal” thing that can bedone to build PageRank is to write lots of good content. Ensure that pages aren’t overly short or overlylong and break the content into several pages where necessary. There are three different ways in whichpages can be interlinked within a site: Hierarchical Total PageRank in this site is 4 Looping
  10. 10. Total PageRank in this site is 4 Extensive Interlinking Total PageRank in this site is 4The maximum benefit from PageRank is derived when this methodology is applied towards pages thatwant to rank high for highly competitive keyword phrases, or towards pages that must compete on alarge number of keyword phrases. The Extensive Interlinking strategy retains the most PageRankwithin the site. Following this is the Hierarchical strategy and lastly the Looping strategy. 4. Significance Of PageRankOf course, important pages mean nothing to you if they dont match your query. So, Google combinesPageRank with sophisticated text-matching techniques to find pages that are both important andrelevant to your search. Google goes far beyond the number of times a term appears on a page andexamines all aspects of the pages content (and the content of the pages linking to it) to determine if itsa good match for your query. Thus, in this way, PageRank helps GOOGLE to bring the best rankedsearch results corresponding to your query.A version of PageRank has recently been proposed as a replacement for the traditional ISI impactfactor. Instead of merely counting citations of a journal, the "quality" of a citation is determined in aPageRank fashion.A Web crawler may use PageRank as one of a number of importance metrics it uses to determinewhich URL to visit next during a crawl of the web. One of the early working papers which was used inthe creation of Google is “Efficient crawling through URL ordering”, which discusses the use of anumber of different importance metrics to determine how deeply, and how much of a site Google will
  11. 11. crawl. PageRank is presented as one of a number of these importance metrics, though there are otherslisted such as the number of inbound and outbound links for a URL, and the distance from the rootdirectory on a site to the URL. 5. ConclusionTo sum up, PageRank is a technique that has revolutionized the concept of searching for every websurfer across the globe putting the whole worldly knowledge just a click away for anyone. It hasassisted in unification of cultures and sharing of ideas by providing an easier, faster and more relevantway to answer to the query of a user. This technique, along with GOOGLE’s huge database ofinformation, has emerged to be a one-stop point for every user around the world with any type ofquery.
  12. 12. 6. References1. David A. Vise and Mark Malseed, “The Google Story”, Delacorte Press.2. Nancy Blachman, “The Google Guide”3. 12-2005-14. Sergey Brin and Lawrence Page, ”The Anatomy of a search engine”.7. /Understanding Googles Page Rank System.htm8. Chris Ridings and Mike Shishigin, “PageRank Uncovered”.9. Pagerank Explained_ Googles PageRank and how to make themost of it.htm10. Google Support Center, “Google Technology”.11. Wikipedia, “How PageRank Works”.12. Ian Rogers, “The Google Pagerank Algorithm and How It Works”.