1. HITS + PageRank Jens Noschinski, Thomas Honné, Kersten Schuster, Andreas Schäfer WS 2010/2011 The slides are licensed under theCreative Commons Attribution-ShareAlike 3.0 License Web Technologies – Prof. Dr. Ulrik Schroeder
3. Problem: searching for information on the web > 1 mio. results, but only the first 10-20 results are relevant How do search engines decide which sites are important? What else needs to be considered? Motivation 3
4. Motivation Fast and efficient many requests at the same time very big set of websites (more than 1.000.000.000.000 in July ‘08) Actuality of results recent changes Availability of the search engine itself of indexed pages that can be searched (cache) Resistance against manipulation search result manipulation spam 4
6. HITS HITS = Hyperlink-Induced Topic Search Introduced in 1997 by Jon Kleinberg For broad-topic information discovery pick out few relevant sources Identify authoritative web pages most central regarding a certain topic Question: When can a page be considered authoritative? 6 6
7. Two distinct types of pages Authorities highly referenced pages considered as authoritative Hubs pages that point to many authorities points from which authority is conferred Mutually Reinforcing Relationship a good hub points to many good authorities a good authority is pointed to by many good hubs Hubs and Authorities Hub Authority 7
8. Root Set and Base Set First step of HITS‘ processing Assemble root set S of pages execute a user-supplied query use a full text search engine Expand to base set T add pages that point to any page in S add pages that are pointed to by any page in S Restrictions set of pages pointing to an authority can be enormous consider fixed-size random subset page links can be internal links for site navigation exclude links between pages on the same host 8 8
10. Hub Weight and Authority Weight Weights associated with each page p hub weight h(p) authority weight a(p) initialized to 1 Calculation a(p) is the sum of hub weights of pages pointing to p h(p) is the sum of authority weights of pages pointed to by p “p -> q“ means that page phas a hyperlink to page q 10
11. Further Processing Repeat whole update operation k times ongoing updates - no exact final result for weights convergence to certain values in time k = 20 has shown to deliver a good convergence Normalize the weights prevent the values from getting too large normalize after each iteration 11
12. Output Only few pages from base set are relevant dump the n pages with the highest authority weights dump the n pages with the highest hub weights n = 10 is reasonable We just got our final search results 12
13. Drawbacks No anti-spam capability link farms can boost hub score Topic drift not all linked pages are thematically related Minor link changes can cause large result changes Query-dependent algorithm is executed for every single search query query is time consuming computation of root and base set calculation of hub and authority weights 13
15. Background on PageRank Published in 1998 developed and patented at Stanford University amongst others by the Google founders Larry Page and Sergei Brin exclusively licensed by Google Differences to other search technologies not only ranked by content new ranking criteria based on the link structure harder to manipulate 15
16. Main idea Each website has a numeric value called PageRank or Prestige PageRank computation is based on in- and outlinks C D B A B C D ABCD A 16
17. PageRank Algorithm Surfer follows an outlink of page x with probability px Therefore the PageRank of a page is Resulting equation system: A B C D ABCD 17 17
18. PageRank Algorithm Other scorescanbereachedbymultiplicationof all valueswiththe same factor C=5 D=8 B=2 A=4 18 18
19. Problems of the algorithm Rank Sink after some iterations A and B will have a PageRank of 0 solution: RandomSurfer 19 19 C D B A
20. RandomSurfer Idea: simulate real surfingbehavior a real surfer may “teleport“ toanotherwebsite (back-button, bookmark, ...) the “damping factor“ distheprobabilitytofollow a regular outlink 20 20
25. Properties Strengths pre-computable fast spam-resistant minorchangeshaveminoreffects Weaknesses pagesonlyauthoritative in generaland not on querytopic link farms Google-bombs 25
26. Summary HITS algorithm is executed after a query is made pages get a hub- and an authority-value calculation of whether a page provides good information and/or whether it links to pages that do so no spam-fighting ability PageRank each page gets one PageRank that declares its value query-independent spam-resistant 26
27. Sources Papers about PageRank Larry Page et al.: The PageRank Citation Ranking: Bringing Order to the Web Ulrik Brandes, Gabi Dorfmüller: 10. Algorithmus der Woche der RWTH Aachen, 09/05/2006 Peter J. Zehetner: „Der PageRank-Algorithmus“, 05/2007 Taher H. Haveliwala: „Efficient Computation of PageRank“, 10/1999 Papers about HITS Jon Kleinberg: Authoritative sources in a hyperlinked environment Jon Kleinberg et al.: Inferring Web communities from link topology Book Bing Liu: “Web Data Mining”, 2008 27