1.
PageRank and Hyperlink- InducedTopic Search in Web Structure Mining Presented By Priyabrata Satapathy
2.
Plan of My work(I) Learn Basic Knowledge of Web structure Hub Authority Link analysis PageRank HITS2 Anand Bihari
3.
Plan of My work(II) Literature Survey on PageRank and HITS in Web Structure Mining. Defining Problem (PageRank and HITS). Proposing/ Designing a new Algorithm for Computing a PageRank of web page. Simulation and Performance Analysis of proposed Algorithm.3 Anand Bihari
4.
Outline Introduction Basic Concepts of Web Structure Hub and Authority PageRank HITS Conclusion Future Work References4 Anand Bihari
5.
Introduction World Wide Web is distributed by numerous Web sites around the world, a global information system. Web servers can potentially host millions of pages which make the number of web pages extremely difficult to track. Web networks like the thousands of interconnected, intertwined with the cells organized in a complex structure. Each Web site also contains a number of Web pages. It contains the following three parts; Body of the page, The page contains hypertext markup language and Hyperlinks between Web pages.5 Anand Bihari
6.
Web Mining Web mining can generally be divided into three categories: Web content mining, Web structure mining Web usage mining6 Anand Bihari
7.
Web Structure Mining Web structure mining is the main content of hyperlink analysis, that is, by analyzing the links between pages to study the relationship between the reference pages to find useful patterns, improve search quality. Structure mining is the site with one page to another page from a link diagram.7 Anand Bihari
8.
Simple Web Link Graph Page A Page B A Page C Page D8 Anand Bihari
9.
Hub A hub is a page with many out-links. Authority An authority is a page with many in-links.9 Anand Bihari
10.
Hubs and Authorities on the Internet Hubs Authorities Authorities and Hubs have a mutual reinforcement relationship. A good hub increases the authority weight of the pages it points. A good authority increases the hub weight of the pages that point to it.10 Anand Bihari
11.
Link Analysis There are two famous link analysis methods: 1.PageRank Algorithm 2.HITS Algorithm11 Anand Bihari
12.
PageRank The heart of Google’s searching software is PageRank. A system for ranking web pages developed by Larry Page and Sergey Brin at Stanford University in 1996. Based on the idea of a ’random surfer’ PageRank is a static ranking of Web pages. PageRank is based on the measure of prestige in social networks, The PageRank value of each page can be regarded as its prestige.12 Anand Bihari
13.
PageRank From the perspective of prestige, we use the following to derive the PageRank algorithm. A hyperlink from a page pointing to another page is an implicit conveyance of authority to the target page. Thus, the more in-links that a page “ i “ receives, the more prestige the page “ i “ has. Pages that point to page “ i “also have their own prestige scores. A page with a higher prestige score pointing to “ i “ is more important than a page with a lower prestige score pointing to “ i.” In other words, a page is important if it is pointed to by other important pages.13 Anand Bihari
14.
PageRank In-links of page i: These are the hyperlinks that point to page “ i “ from other pages. Usually, hyperlinks from the same site are not considered. Out-links of page i: These are the hyperlinks that point out to other pages from page “ i “. Usually, links to pages of the same site are not considered. A B Website 1 Website 214 Anand Bihari
15.
PageRank Algorithm The PageRank of a web page is therefore calculated as a sum of the PageRanks of all pages linking to it (its incoming links), divided by the number of out links on each of those pages (its outgoing links). Where: PR(A) is the PageRank of page A, PR(Ti) is the PageRank of pages Ti which link to page A, C(Ti) is the number of outbound links on page Ti d is a damping factor which can be set between 0 and 1. It depends on the number of clicks, usually set to 0.85. n is the number of inlinks of page A. It’s obvious that the PageRank algorithm does not rank the whole website, but it’s determined for each page individually. Furthermore, the PageRank of page A is recursively defined by the PageRank of those pages which link to page A15 Anand Bihari
16.
A B A The Characteristics of PageRank C D We regard a small web consisting of four pages A, B, C and D, whereby page A links to the pages B ,C and D, page B links to page C , page C links to page A and page D links to page C. According to Page and Brin, the damping factor d is usually set to 0.85, but to keep the calculation simple we set it to 0.5. PR(A) = 0.5 + 0.5 ( PR(C)) PR(B) = 0.5 + 0.5 ( PR(A)/3) PR(C) = 0.5 + 0.5 ( PR(A)/3 + PR(B) + PR(D) ) PR(D) = 0.5 + 0.5 ( PR(A)/3 ) We get the following PageRank values for the single pages: PR(A) = 12/10 = 1.2 PR(B) = 7/10 = 0.7 PR(C) = 14/10 = 1.416 PR(D) = 7/10 = 0.7 Anand Bihari
17.
The Iterative Computation of PageRank The Google search engine uses an approximative, iterative computation of PageRank values. This means that each page is assigned an initial starting value. The iteration ends when the PageRank value do not change much or equal.17 Anand Bihari
18.
The Iterative Computation of PageRank Algorithm General PageRank equation is PR(A)=(1-d)+d(PR(T1)/C(T1)+-------------+PR(Tn)/C(Tn)) Iteration Algorithm Set PR [ R1,R2,……………,Rn] where R is some initial rank of page and n is the number of pages in the graph. d 0.5 i1 Do Pri (A) (1-d) + d (Pri-1(T1)/C(T1) +… +Pri-1(Tn)/C(Tn)) k | PRi (A) – Pri-1(A)| i i+1 While k < e , where e is a small number indicating the convergence threshold Return PR18 Anand Bihari
19.
The Iterative Computation of PageRank (example) Let initial PageRank value of each page is 1 Iteration PR(A) PR(B) PR(C) PR(D) 0 1 1 1 1 1 1 0.6667 1.3332 0.6667 2 1.1666 0.6944 1.3888 0.6944 3 1.1944 0.6990 1.3980 0.6990 4 1.1990 0.6998 1.3996 0.6998 5 1.1998 0.6999 1.3998 0.6999 6 1.1999 0.6999 1.3998 0.6999 7 1.1999 0.6999 1.3998 0.6999 The sum of all pages PageRanks still converges to the total number of web pages. So the average PageRank of a web page is 1.19 Anand Bihari
20.
Effects of Inbound Links(I) Each additional inbound link for a web page always increases that pages PageRank. One may assume that an additional inbound link from page X increases the PageRank of page A by d PR(X) / C(X) X PR(A)=0.5+0.5(PR(X)+PR(C)) A B A PR(B) = 0.5 + 0.5 ( PR(A)/3) PR(C) = 0.5 + 0.5 ( PR(A)/3 + PR(B) + PR(D) ) C D PR(D) = 0.5 + 0.5 ( PR(A)/3 )20 Anand Bihari
21.
Effects of Inbound Links(II) Let PR(X) = 10. We get the following PageRank values for the single pages: PR(A) = 31/5 = 6.2 PR(B) = 23/15 = 1.53 PR(C) = 46/15 = 3.067 PR(D) = 23/15 = 1.53 We see that the initial effect of the additional inbound link of page A, which was given by d PR(X) / C(X) = 0.5 10 / 1 = 5 Hence page A will have an even higher PageRank benefit from its additional inbound link.21 Anand Bihari
22.
Effect of outbound Links(I) Since PageRank is based on the linking structure of the whole web. it is inescapable that if the inbound links of a page influence its PageRank, its outbound links do also have some impact. In this graph Page B have an additional outbound links. Then PageRank Value of A B A PR(A)=0.5+0.5(PR(C)) PR(B)=0.5+0.5(PR(A)/3) C D PR(C)= 0.5+0.5(PR(A)/3+PR(B)/2+PR(D)) PR(D)=0.5+0.5(PR(A)/3+PR(B)/2)22 Anand Bihari
23.
Effect of outbound Links(II) We get the following PageRank values for the single pages: PR(A) = 1.14 PR(B) = 0.753 PR(C) = 0.8796 PR(D) = 1.31805 The total PageRank of all pages’ = 4. Hence, adding a link has no effect on the total PageRank of the web. Additionally, the PageRank of page D is increased and the PageRank of Page A and C are decereased.23 Anand Bihari
24.
The Effect of the Number of Pages An additional page increases the PageRank of all pages on the web .24 Anand Bihari
25.
How Increase the PageRank of Websites Add new pages to your website (as many as you can) Swap links with websites which have high PageRank value Raise the number of inbound links (Advertise your website on other sites) etc.25 Anand Bihari
26.
HITS HITS stands for Hyperlink Induced Topic Search. Developed by Jon Kleinberg HITS is search query dependent. When the user issues a search query, HITS first expands the list of relevant pages returned by a search engine and then produce two rankings of the expanded set of pages, authority ranking and Hub ranking. Uses hubs and authorities to define a recursive relationship between web pages.26 Anand Bihari
27.
HITS Algorithms (I) HITS depend on query words. Firstly HITS invokes a traditional search engine to get a set of pages related to the query, and then expands the set by hyperlinks pointing to them or pointed by them. After that, HITS tries to find the top hubs and authorities by iterative calculations. All of the processing are done online. R is a root set that returned by the query and S is base set to cover all linked pages.27 Anand Bihari
28.
HITS Algorithm (II) Let the authority score of the page i be ap(i) and the hub score of page i is hp(i) .The mutual reinforcement relationship of the two scores is represented as follows: ap(i) = hq hp(i) = aq The implication q→ p is that there is a point p from the q hyperlink. After several iterative calculations until the results converge, the final output of HITS algorithm is a set of weights with large Hub p pages and have greater weight Authority page.28 Anand Bihari
29.
HITS Algorithm (III) Let A be the adjacency matrix of the root set R and denote the authority weight vector by “a” and the hub weight vector by “h” , where a = a1 h= h1 a2 h2 . . . . an hn Then a=AT.h and h=A.a The computation of authority scores and hub scores is basically the same as the computation of the PageRank scores using the iteration method. If we use ak and hk to denote authority and hub scores at the kth iteration, the iterative processes for generating the final solutions are ak = ATAak-1 and hk = A AT hk-1 Starting with a0 = h0 = 1 1 .29 . Anand Bihari 1
30.
A B A HITS Example C D The adjacency matrix of the graph is A= 0 1 1 1 with transpose AT = 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 Assume the initial hub and authority weight is: h= 1 and a = 1 1 1 1 1 1 1 We compute the authority weight vector by a = AT.h = 1 h = A.a = 3 1 1 3 1 1 130 Anand Bihari
31.
HITS Example(cont.) Hub weight of Page A = 3, Page B = 1, Page C = 1 and Page D = 1; Authority weight of Page A = 1, Page B = 1, Page C = 3 and Page D = 1; Hence we say that the Hub weight of a page is the total number of its out linked pages and the Authority weight of a page is the total number of in linked pages .31 Anand Bihari
32.
Conclusion Study basic concepts of Hyperlinks Analysis. Study PageRanking Technique. Study HITS Technique.32 Anand Bihari
33.
Future Work Study Hyperlink analysis technique. Literature Survey on Hyperlink analysis and other related topic. Defining problem in PageRank and HITS. Proposing new algorithm or Improve the PageRank and HITS algorithms. Simulation and Performance Analysis of proposed Model.33 Anand Bihari
34.
Future Literature Survey Titles Name of Journal/Conferences Publication Year Mining web informative structures and IEEE Transactions On Knowledge And 2004 Contents based on entropy analysis Data Engineering Wisdom: web intra page informative IEEE Transactions On Knowledge And 2005 structure Mining based on document Data Engineering object model Knowledge Discovery and Retrieval 2010 Fourth Asia International 2010 on World Wide Web Using Web Conference on Mathematical/ Analytical Structure Mining Modelling and Computer Simulation Design and implementation of a web International Conference on internet 2011 structure Mining algorithm using technology and secured transactions breadth first search Strategy for academic search application34 Anand Bihari
35.
References Bing Liu “Web Data Mining ” Springer International Edition. IEEE Conference Paper “Research on PageRank and Hyperlink – Induced Topic Search in Web Structure Mining “ Website : Google, Wikipedia, http://pr.efactory.de/ www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture4/lectur e4.html35 Anand Bihari
Be the first to comment