Cloud Computing Project

4,825 views
4,738 views

Published on

It is about my project report for Pagerank Algorithm Implementation using Map-Reduce on psuedo/distributed mode..

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,825
On SlideShare
0
From Embeds
0
Number of Embeds
28
Actions
Shares
0
Downloads
166
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Cloud Computing Project

  1. 1. Page Rank Implementation<br />CLOUD COMPUTING PROJECT<br />-Team 3<br />By:<br />- Devendra Singh Parmar<br />
  2. 2. Project Abstract<br />Instructor: Prof. Reddy Raja<br />Mentor: Ms M.Padmini<br />To Implement PageRank Algorithm using Map-Reduce for Wikipedia and verify it for smaller data-sets<br />
  3. 3. Agenda<br /><ul><li>Motivation
  4. 4. Introduction to Algorithm
  5. 5. PageRank Equation Analysis
  6. 6. Brief Description of Project
  7. 7. Module1
  8. 8. Module2
  9. 9. Module3
  10. 10. Applications </li></li></ul><li>Motivation<br />-&gt; Need for PageRank:<br /> The Search engines store billions of web pages which overall contain trillions of web url links. So, there is a need for an algorithm that gives the most relevant pages specific to a query.<br />-&gt; Need for Distributed Environment<br />( Map-Reduce and Distributed Storage)<br /><ul><li> Trillions of links implies huge data storage required.</li></ul> (if each url requires 0.5K, then we need over 400TB just to store URLs!) <br /><ul><li> Large data set implies large computations</li></ul>Thus, we handle above issues in our project by using a distributed cluster<br />
  11. 11. Agenda<br /><ul><li>Motivation
  12. 12. Introduction to Algorithm
  13. 13. PageRank Equation Analysis
  14. 14. Brief Description of Project
  15. 15. Module1
  16. 16. Module2
  17. 17. Module3
  18. 18. Applications </li></li></ul><li>Introduction<br />PageRank is a link analysis algorithm, named after Larry Page, used by the Google Internet search engine that assigns a numerical weighting to each element of a hyperlinkedset of documents, such as the Worldwide Web, with the purpose of &quot;measuring&quot; its relative importance within the set<br />The numerical weight that it assigns to any given element E is also called the PageRank of E and denoted by PR(E).<br />
  19. 19. Algorithm<br />Google figures that when one page links to another page, it is effectively casting a vote for the other page. The more votes that are cast for a page, the more important the page must be. Also, the importance of the page that is casting the vote determines how important the vote itself is. Google calculates a page&apos;s importance from the votes cast for it. How important each vote is also taken into account when a page&apos;s PageRank is calculated.<br />
  20. 20. Agenda<br /><ul><li>Motivation
  21. 21. Introduction to Algorithm
  22. 22. PageRank Equation Analysis
  23. 23. Brief Description of Project
  24. 24. Module1
  25. 25. Module2
  26. 26. Module3
  27. 27. Applications </li></li></ul><li>The PageRank Equation<br />Simple Iterative Algorithm<br />For kth iteration PageRank of ith page is given by:<br />Here,<br />
  28. 28. The PageRank Equation(Issues and Enhancement)<br />Problems:<br /><ul><li> Rank Sinks or Dangling Pages
  29. 29. Cycles</li></ul>Solution:<br />
  30. 30. PageRank Equation(Enhancement)<br />Solution for Cycles and If a random surfer gets bored<br />Here ‘d ‘ is known as damping factor . It represents the probability, at any step, that the person will continue surfing . The value of ‘d’ is typically kept 0.85<br />
  31. 31. PageRank Equation (finally)<br />
  32. 32. In other words<br />In a simpler way:- <br />a page&apos;s PageRank = 0.15 /N+ 0.85 * (a &quot;share&quot; of the PageRank of every page that links to it) <br />&quot;share&quot; = the linking page&apos;s PageRank divided by the number of outbound links on the page. <br />And N=the number of documents in collection<br />The equation of PageRank shows clearly how a page&apos;s PageRank is arrived at. But what isn&apos;t immediately obvious is that it can&apos;t work if the calculation is done just once. <br />
  33. 33. PageRank Equation-as per the published paper :“The Anatomy of a Large-Scale Hyper textual Web Search Engine”-Sergey Brin and Lawrence Page <br />We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85.. Also C(A) is defined as the number of links going out of page A. <br />The PageRank of a page A is given as follows: <br />PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) <br />-&gt;Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages’ PageRanks will be one.<br />
  34. 34. IssuesIn the Original Formula<br />Formula given in the in Page and Brin&apos;s paper does not supports the statement that &quot;the sum of all PageRanks is one“<br />Hence to support the statement the formula is modified as:<br /> PR(A) = (1-d)/N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))<br />where N=the number of documents in collection<br />
  35. 35. Agenda<br /><ul><li>Motivation
  36. 36. Introduction to Algorithm
  37. 37. PageRank Equation Analysis
  38. 38. Brief Description of Project
  39. 39. Module1
  40. 40. Module2
  41. 41. Module3
  42. 42. Applications </li></li></ul><li>Brief Description of Project<br />Input: <br />Data Set containing multiple records where each record contains the Url of the Page(from Url) followed by the url of a page to which it is pointing to(ToUrl).<br />Wiki_Votes.txt<br />ToUrl<br />FromUrl<br />
  43. 43. Brief Description of Project(Contd.)<br />Output:<br />The output file consist of records containing the url of the page(from Url), the page rank value of the page(PRValue) and the list of urls to which the page points to(ToUrlList).<br />FinalOutput.txt<br />ToUrlList<br />fromUrl<br />PRValue<br />
  44. 44. Brief Description of ProjectModules<br />Web<br />Graph<br />Module1: Converter<br />Module2: PageRank Calculator<br />Module3: Output Analyzer<br />Converter<br />Iterate<br />until <br />convergence<br />PageRank<br />Calculator<br />...<br />Search Engine<br />Output Analyzer<br />Create<br />Index<br />
  45. 45. Agenda<br /><ul><li>Motivation
  46. 46. Introduction to Algorithm
  47. 47. PageRank Equation Analysis
  48. 48. Brief Description of Project
  49. 49. Module1
  50. 50. Module2
  51. 51. Module3
  52. 52. Applications </li></li></ul><li>Module1: ConverterInput-Output<br />Converter (Initializing with PR= 1/N )<br />FromUrlPRValue List:<br />
  53. 53. Module1: ConverterIssues<br />Self Loops:<br /> -handled by checking the FromUrl with ToUrl before sending it to the reduce function<br /> Dangling Pages:<br /> -handled by initializing their PRValue with 1/N and the List of ToUrls is left blank.<br />
  54. 54. Agenda<br /><ul><li>Motivation
  55. 55. Introduction to Algorithm
  56. 56. PageRank Equation Analysis
  57. 57. Brief Description of Project
  58. 58. Module1
  59. 59. Module2
  60. 60. Module3
  61. 61. Applications </li></li></ul><li>Module2: PageRank CalculatorInput-Output<br />PageRank Calculator (User can give Precision)<br />
  62. 62. Module2: PageRank Calculator<br />Map:<br />Input:<br />index.html PRValueOutList:<br /> &lt; 1.html 2.html... &gt; Output 1. Output for each outlink:<br />key: “1.html”<br />value: PRValue/ ListLength<br /> (Vote Share)<br /> 2. ToUrl itself<br /> key: index.html<br />value: &lt;OutList&gt;<br />Reduce<br />Input: <br />Key: “1.html”<br />Value: 0.5 23Value: 0.24 2…….<br />Value : UrlList &lt;OutLink&gt;<br />Output:<br />Key: “1.html”<br />Value: “&lt;new pagerank&gt; <br /> &lt;OutList&gt; 1.html 2.html...”<br />Start with the initial PageRank and Outlinksof a document.<br />
  63. 63. Module2: PageRank Calculator<br />Map:<br />Input:<br />index.html PRValueOutList:<br /> &lt; 1.html 2.html... &gt; Output 1. Output for each outlink:<br />key: “1.html”<br />value: PRValue/ ListLength<br /> (Vote Share)<br /> 2. ToUrl itself<br /> key: index.html<br />value: &lt;OutList&gt;<br />Reduce<br />Input: <br />Key: “1.html”<br />Value: 0.5 23Value: 0.24 2…….<br />Value : UrlList &lt;OutLink&gt;<br />Output:<br />Key: “1.html”<br />Value: “&lt;new pagerank&gt; <br /> &lt;OutList&gt; 1.html 2.html...”<br />For each Outlink, output the PageRank’s share of the Inlinks, and List of outlinks.<br />
  64. 64. Module2: PageRank Calculator<br />Map:<br />Input:<br />index.html PRValueOutList:<br /> &lt; 1.html 2.html... &gt; Output 1. Output for each outlink:<br />key: “1.html”<br />value: PRValue/ ListLength<br /> (Vote Share)<br /> 2. ToUrl itself<br /> key: index.html<br />value: &lt;OutList&gt;<br />Reduce<br />Input: <br />Key: “1.html”<br />Value: 0.5 23Value: 0.24 2…….<br />Value : UrlList &lt;OutLink&gt;<br />Output:<br />Key: “1.html”<br />Value: “&lt;new pagerank&gt; <br /> &lt;OutList&gt; 1.html 2.html...”<br />Now the reducer has a Url of document, all the inlinks to that document and their corresponding PageRank’s share and List of outlinks.<br />
  65. 65. Module2: PageRank Calculator<br />Map:<br />Input:<br />index.html PRValueOutList:<br /> &lt; 1.html 2.html... &gt; Output 1. Output for each outlink:<br />key: “1.html”<br />value: PRValue/ ListLength<br /> (Vote Share)<br /> 2. ToUrl itself<br /> key: index.html<br />value: &lt;OutList&gt;<br />Reduce<br />Input: <br />Key: “1.html”<br />Value: 0.5 23Value: 0.24 2…….<br />Value : UrlList &lt;OutLink&gt;<br />Output:<br />Key: “1.html”<br />Value: “&lt;new pagerank&gt; <br /> &lt;OutList&gt; 1.html 2.html...”<br />Compute the new PageRank and output in the same format as the input.<br />
  66. 66. Module2: PageRank Calculator<br />Map:<br />Input:<br />index.html PRValueOutList:<br /> &lt; 1.html 2.html... &gt; Output 1. Output for each outlink:<br />key: “1.html”<br />value: PRValue/ ListLength<br /> (Vote Share)<br /> 2. ToUrl itself<br /> key: index.html<br />value: &lt;OutList&gt;<br />Reduce<br />Input: <br />Key: “1.html”<br />Value: 0.5 23Value: 0.24 2…….<br />Value : UrlList &lt;OutLink&gt;<br />Output:<br />Key: “1.html”<br />Value: “&lt;new pagerank&gt; <br /> &lt;OutList&gt; 1.html 2.html...”<br />Now iterate until convergence (determined by the precision value).<br />
  67. 67. Module2: PageRank Calculator IssuesCatch22 Situation<br />Suppose we have 2 pages, A and B, which link to each other, and neither have any other links of any kind. This is what happens:- <br />Step 1: Calculate page A&apos;s PageRank from the value of its inbound links<br />Step 2: Calculate page B&apos;s PageRank from the value of its inbound links<br /> we can&apos;t work out A&apos;s PageRank until we know B&apos;s PageRank, and we can&apos;t work out B&apos;s PageRank until we know A&apos;s PageRank. Thus the PageRank of A and B will be inaccurate.<br />
  68. 68. Module2: PageRank Calculator IssuesCatch22 situation (solution)<br />This problem is overcome by repeating the calculations many times. Each time produces slightly more accurate values. In fact, total accuracy can never be achieved because the calculations are always based on inaccurate values.<br />The number of iterations should be sufficient to reach a point where any further iterations wouldn&apos;t produce enough of a change to the values to matter.<br />=&gt; Use “delta function” which will keep track of changes in the PageRank of all the pages and if the change in PageRank of all the pages is less than the value specified by the user the iterations can be stopped.<br />
  69. 69. Agenda<br /><ul><li>Motivation
  70. 70. Introduction to Algorithm
  71. 71. PageRank Equation Analysis
  72. 72. Brief Description of Project
  73. 73. Module1
  74. 74. Module2
  75. 75. Module3
  76. 76. Applications </li></li></ul><li>Module 3: Output AnalyzerInput-Output<br />Input<br />Analyzer ( If user want Top 3)<br />Output<br />
  77. 77. Agenda<br /><ul><li>Motivation
  78. 78. Introduction to Algorithm
  79. 79. PageRank Equation Analysis
  80. 80. Brief Description of Project
  81. 81. Module1
  82. 82. Module2
  83. 83. Module3
  84. 84. Applications
  85. 85. Questions</li></li></ul><li>Applications and Extensions<br />A simple model of Search Engine. (Implemented)<br /> The application utilizes: <br />The PageRank calculated by the PageRank Calculator<br />The output generated by a map-reduce module that finds out the number of times a pattern (as per the user’s query) matches in each of the files present in data set.<br />And outputs:<br /> The list of pages which are relevant to the query made in the order of their importance.<br />(DEMO)<br />
  86. 86. Applications and Extensions<br />Other Applications:<br /><ul><li>PageRank-based mechanism to rank knowledge items used in E-Learning.
  87. 87. GeneRank (based on PageRank) ranks the genes analyzed in the microarray to see the relationship between the cell’s function and gene expression.
  88. 88. Can be used to sort the items present in the side menu in various blogs and sites depending on their importance.</li></li></ul><li>References<br />http://infolab.stanford.edu/pub/papers/google.pdf<br /> ( research paper by Brin and Page)<br />http://www.ams.org/featurecolumn/archive/pagerank.html<br />http://en.wikipedia.org/wiki/PageRank<br />http://www.webworkshop.net/pagerank.html#how_is_pagerank_calculated<br />
  89. 89. Questions<br />
  90. 90. Thank You<br />

×