SlideShare a Scribd company logo
1 of 27
HITS + PageRank Jens Noschinski, Thomas Honné, Kersten Schuster, Andreas Schäfer WS 2010/2011 The slides are licensed under theCreative Commons Attribution-ShareAlike 3.0 License Web Technologies – Prof. Dr. Ulrik Schroeder
Overview Motivation HITS background algorithm drawbacks PageRank background algorithm problems Summary HITS, PageRank differences Sources 2
Problem: searching for information on the web  	> 1 mio. results, but only the first 10-20 results are relevant How do search engines decide which sites are important? What else needs to be considered? Motivation 3
Motivation Fast and efficient many requests at the same time very big set of websites (more than 1.000.000.000.000 in July ‘08) Actuality of results recent changes Availability  of the search engine itself of indexed pages that can be searched (cache) Resistance against manipulation search result manipulation spam 4
Hits 5
HITS HITS = Hyperlink-Induced Topic Search Introduced in 1997 by Jon Kleinberg For broad-topic information discovery pick out few relevant sources Identify authoritative web pages most central regarding a certain topic 	Question: When can a page be considered authoritative? 6 6
Two distinct types of pages Authorities highly referenced pages considered as authoritative Hubs pages that point to many authorities points from which authority is conferred Mutually Reinforcing Relationship a good hub points to many good authorities a good authority is pointed to by many good hubs Hubs and Authorities Hub Authority 7
Root Set and Base Set First step of HITS‘ processing Assemble root set S of pages execute a user-supplied query use a full text search engine Expand to base set T add pages that point to any page in S add pages that are pointed to by any page in S Restrictions set of pages pointing to an authority can be enormous consider fixed-size random subset page links can be internal links for site navigation exclude links between pages on the same host 8 8
Root Set (S) and Base Set (T) T S 9
Hub Weight and Authority Weight Weights associated with each page p hub weight h(p) authority weight a(p) initialized to 1 Calculation a(p) is the sum of hub weights of pages pointing to p h(p) is the sum of authority weights of pages pointed to by p “p -> q“ means that page phas a hyperlink to page q 10
Further Processing Repeat whole update operation k times ongoing updates - no exact final result for weights convergence to certain values in time k = 20 has shown to deliver a good convergence Normalize the weights prevent the values from getting too large normalize after each iteration 11
Output Only few pages from base set are relevant dump the n pages with the highest authority weights dump the n pages with the highest hub weights n = 10 is reasonable We just got our final search results  12
Drawbacks No anti-spam capability link farms can boost hub score Topic drift not all linked pages are thematically related Minor link changes can cause large result changes Query-dependent algorithm is executed for every single search query query is time consuming computation of root and base set  calculation of hub and authority weights 13
Pagerank 14 14
Background on PageRank Published in 1998 developed and patented at Stanford University amongst others by the Google founders Larry Page and Sergei Brin exclusively licensed by Google Differences to other search technologies not only ranked by content new ranking criteria based on the link structure harder to manipulate 15
Main idea Each website has a numeric value called PageRank or Prestige PageRank computation is based on in- and outlinks C D B A    B    C   D ABCD   A 16
PageRank Algorithm Surfer follows an outlink of page x with probability px Therefore the PageRank of a page is  Resulting equation system: A    B    C   D ABCD   17 17
PageRank Algorithm Other scorescanbereachedbymultiplicationof all valueswiththe same factor C=5 D=8 B=2 A=4 18 18
Problems of the algorithm Rank Sink after some iterations A and B will have a PageRank of 0 solution: RandomSurfer 19 19 C D B A
RandomSurfer Idea: simulate real surfingbehavior a real surfer may “teleport“ toanotherwebsite (back-button, bookmark, ...) the “damping factor“ distheprobabilitytofollow a regular outlink 20 20
Iterative algorithm PageRank-Iterate(G)RepeatUntil Return  21
Iterative algorithm PageRank-Iterate(G)RepeatUntil Return  22 C B D A Step 0:
Iterative algorithm PageRank-Iterate(G)RepeatUntil Return  23 C B D A Step 1:
Iterative algorithm PageRank-Iterate(G)RepeatUntil Return  24 C B D A Final step :
Properties Strengths pre-computable fast spam-resistant minorchangeshaveminoreffects Weaknesses pagesonlyauthoritative in generaland not on querytopic link farms Google-bombs 25
Summary HITS algorithm is executed after a query is made pages get a hub- and an authority-value calculation of whether a page provides good information and/or whether it links to pages that do so no spam-fighting ability PageRank each page gets one PageRank that declares its value query-independent spam-resistant 26
Sources Papers about PageRank Larry Page et al.: The PageRank Citation Ranking: Bringing Order to the Web Ulrik Brandes, Gabi Dorfmüller: 10. Algorithmus der Woche der RWTH Aachen, 09/05/2006 Peter J. Zehetner: „Der PageRank-Algorithmus“, 05/2007 Taher H. Haveliwala: „Efficient Computation of PageRank“, 10/1999 Papers about HITS Jon Kleinberg: Authoritative sources in a hyperlinked environment  Jon Kleinberg et al.: Inferring Web communities from link topology Book Bing Liu: “Web Data Mining”, 2008 27

More Related Content

What's hot

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach
Neo4j
 
Stuart russell and peter norvig artificial intelligence - a modern approach...
Stuart russell and peter norvig   artificial intelligence - a modern approach...Stuart russell and peter norvig   artificial intelligence - a modern approach...
Stuart russell and peter norvig artificial intelligence - a modern approach...
Lê Anh Đạt
 

What's hot (20)

Unit 1 - SNA QUESTION BANK
Unit 1 - SNA QUESTION BANKUnit 1 - SNA QUESTION BANK
Unit 1 - SNA QUESTION BANK
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach
 
An Introduction to Information Retrieval and Applications
 An Introduction to Information Retrieval and Applications An Introduction to Information Retrieval and Applications
An Introduction to Information Retrieval and Applications
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
 
Web mining
Web miningWeb mining
Web mining
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
 
web mining
web miningweb mining
web mining
 
Web data mining
Web data miningWeb data mining
Web data mining
 
Stuart russell and peter norvig artificial intelligence - a modern approach...
Stuart russell and peter norvig   artificial intelligence - a modern approach...Stuart russell and peter norvig   artificial intelligence - a modern approach...
Stuart russell and peter norvig artificial intelligence - a modern approach...
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Web Mining
Web Mining Web Mining
Web Mining
 
CS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit ICS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit I
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 

Similar to HITS + Pagerank

Topic sensitive page rank(review)
Topic sensitive page rank(review)Topic sensitive page rank(review)
Topic sensitive page rank(review)
hongs
 
page rank explication et exemple formule
page rank explication et exemple  formulepage rank explication et exemple  formule
page rank explication et exemple formule
RamiHarrathi1
 
Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)
James Arnold
 
Done rerea dlink-farm-spam
Done rerea dlink-farm-spamDone rerea dlink-farm-spam
Done rerea dlink-farm-spam
James Arnold
 
Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)
James Arnold
 

Similar to HITS + Pagerank (20)

Link Analysis
Link AnalysisLink Analysis
Link Analysis
 
Link Analysis
Link AnalysisLink Analysis
Link Analysis
 
Ranking Web Pages
Ranking Web PagesRanking Web Pages
Ranking Web Pages
 
Markov chains and page rankGraphs.pdf
Markov chains and page rankGraphs.pdfMarkov chains and page rankGraphs.pdf
Markov chains and page rankGraphs.pdf
 
Science of the Interwebs
Science of the InterwebsScience of the Interwebs
Science of the Interwebs
 
Topic sensitive page rank(review)
Topic sensitive page rank(review)Topic sensitive page rank(review)
Topic sensitive page rank(review)
 
Pagerank
PagerankPagerank
Pagerank
 
page rank explication et exemple formule
page rank explication et exemple  formulepage rank explication et exemple  formule
page rank explication et exemple formule
 
Web mining
Web miningWeb mining
Web mining
 
Link analysis : Comparative study of HITS and Page Rank Algorithm
Link analysis : Comparative study of HITS and Page Rank AlgorithmLink analysis : Comparative study of HITS and Page Rank Algorithm
Link analysis : Comparative study of HITS and Page Rank Algorithm
 
Page rank by university of michagain.ppt
Page rank by university of michagain.pptPage rank by university of michagain.ppt
Page rank by university of michagain.ppt
 
Pagerank
PagerankPagerank
Pagerank
 
Web2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldWeb2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google world
 
Link Analysis
Link AnalysisLink Analysis
Link Analysis
 
Pagerank
PagerankPagerank
Pagerank
 
Search engine page rank demystification
Search engine page rank demystificationSearch engine page rank demystification
Search engine page rank demystification
 
Web Search and SEO - Web Technologies (1019888BNR)
Web Search and SEO - Web Technologies (1019888BNR)Web Search and SEO - Web Technologies (1019888BNR)
Web Search and SEO - Web Technologies (1019888BNR)
 
Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)
 
Done rerea dlink-farm-spam
Done rerea dlink-farm-spamDone rerea dlink-farm-spam
Done rerea dlink-farm-spam
 
Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

HITS + Pagerank

  • 1. HITS + PageRank Jens Noschinski, Thomas Honné, Kersten Schuster, Andreas Schäfer WS 2010/2011 The slides are licensed under theCreative Commons Attribution-ShareAlike 3.0 License Web Technologies – Prof. Dr. Ulrik Schroeder
  • 2. Overview Motivation HITS background algorithm drawbacks PageRank background algorithm problems Summary HITS, PageRank differences Sources 2
  • 3. Problem: searching for information on the web > 1 mio. results, but only the first 10-20 results are relevant How do search engines decide which sites are important? What else needs to be considered? Motivation 3
  • 4. Motivation Fast and efficient many requests at the same time very big set of websites (more than 1.000.000.000.000 in July ‘08) Actuality of results recent changes Availability of the search engine itself of indexed pages that can be searched (cache) Resistance against manipulation search result manipulation spam 4
  • 6. HITS HITS = Hyperlink-Induced Topic Search Introduced in 1997 by Jon Kleinberg For broad-topic information discovery pick out few relevant sources Identify authoritative web pages most central regarding a certain topic Question: When can a page be considered authoritative? 6 6
  • 7. Two distinct types of pages Authorities highly referenced pages considered as authoritative Hubs pages that point to many authorities points from which authority is conferred Mutually Reinforcing Relationship a good hub points to many good authorities a good authority is pointed to by many good hubs Hubs and Authorities Hub Authority 7
  • 8. Root Set and Base Set First step of HITS‘ processing Assemble root set S of pages execute a user-supplied query use a full text search engine Expand to base set T add pages that point to any page in S add pages that are pointed to by any page in S Restrictions set of pages pointing to an authority can be enormous consider fixed-size random subset page links can be internal links for site navigation exclude links between pages on the same host 8 8
  • 9. Root Set (S) and Base Set (T) T S 9
  • 10. Hub Weight and Authority Weight Weights associated with each page p hub weight h(p) authority weight a(p) initialized to 1 Calculation a(p) is the sum of hub weights of pages pointing to p h(p) is the sum of authority weights of pages pointed to by p “p -> q“ means that page phas a hyperlink to page q 10
  • 11. Further Processing Repeat whole update operation k times ongoing updates - no exact final result for weights convergence to certain values in time k = 20 has shown to deliver a good convergence Normalize the weights prevent the values from getting too large normalize after each iteration 11
  • 12. Output Only few pages from base set are relevant dump the n pages with the highest authority weights dump the n pages with the highest hub weights n = 10 is reasonable We just got our final search results  12
  • 13. Drawbacks No anti-spam capability link farms can boost hub score Topic drift not all linked pages are thematically related Minor link changes can cause large result changes Query-dependent algorithm is executed for every single search query query is time consuming computation of root and base set calculation of hub and authority weights 13
  • 15. Background on PageRank Published in 1998 developed and patented at Stanford University amongst others by the Google founders Larry Page and Sergei Brin exclusively licensed by Google Differences to other search technologies not only ranked by content new ranking criteria based on the link structure harder to manipulate 15
  • 16. Main idea Each website has a numeric value called PageRank or Prestige PageRank computation is based on in- and outlinks C D B A B C D ABCD   A 16
  • 17. PageRank Algorithm Surfer follows an outlink of page x with probability px Therefore the PageRank of a page is Resulting equation system: A B C D ABCD   17 17
  • 18. PageRank Algorithm Other scorescanbereachedbymultiplicationof all valueswiththe same factor C=5 D=8 B=2 A=4 18 18
  • 19. Problems of the algorithm Rank Sink after some iterations A and B will have a PageRank of 0 solution: RandomSurfer 19 19 C D B A
  • 20. RandomSurfer Idea: simulate real surfingbehavior a real surfer may “teleport“ toanotherwebsite (back-button, bookmark, ...) the “damping factor“ distheprobabilitytofollow a regular outlink 20 20
  • 24. Iterative algorithm PageRank-Iterate(G)RepeatUntil Return 24 C B D A Final step :
  • 25. Properties Strengths pre-computable fast spam-resistant minorchangeshaveminoreffects Weaknesses pagesonlyauthoritative in generaland not on querytopic link farms Google-bombs 25
  • 26. Summary HITS algorithm is executed after a query is made pages get a hub- and an authority-value calculation of whether a page provides good information and/or whether it links to pages that do so no spam-fighting ability PageRank each page gets one PageRank that declares its value query-independent spam-resistant 26
  • 27. Sources Papers about PageRank Larry Page et al.: The PageRank Citation Ranking: Bringing Order to the Web Ulrik Brandes, Gabi Dorfmüller: 10. Algorithmus der Woche der RWTH Aachen, 09/05/2006 Peter J. Zehetner: „Der PageRank-Algorithmus“, 05/2007 Taher H. Haveliwala: „Efficient Computation of PageRank“, 10/1999 Papers about HITS Jon Kleinberg: Authoritative sources in a hyperlinked environment Jon Kleinberg et al.: Inferring Web communities from link topology Book Bing Liu: “Web Data Mining”, 2008 27