SlideShare a Scribd company logo
The Anatomy of a Large-Scale Hypertextual
Web Search Engine
Introduction
• Designed to crawl and index the Web efficiently
• Produce much more satisfying search results
• Very little academic research
Introduction
• Web is growing rapidly
• Also, new users inexperienced in the art of web research
• Human maintained lists cover popular topics are efficient
but expensive
Introduction
• Automated search engines that rely on keyword matching
• Return too many low quality matches
• Advertisers attempt to mislead automated search engines
Challenges
• Fast crawling technology is needed
• Storage space must be used efficiently
• The indexing system must process hundreds of gigabytes of data
efficiently
• Queries must be handled quickly
Design Goals
• Improved Search Quality
- We need tools that have very high precision
• Academic Search Engine Research
- Build systems that reasonable numbers of people can
actually use
System Features: PageRank
• Use of Maps
• PageRank to prioritize web search result
• The probability that the random surfer visits a page is its
PageRank.
• And, the d damping factor is the probability at each page
the "random surfer" will get bored & request another
random page.
System Features: PageRank
• We assume page A has pages T1...Tn which point to it (i.e., are
citations). The parameter d is a damping factor which can be set
between 0 and 1. We usually set d to 0.85.
• Also C(A) is defined as the number of links going out of page A.
The PageRank of a page A is given as follows:
• PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
• Note that the PageRanks form a probability distribution over web
pages, so the sum of all web pages’ PageRanks will be one.
Google Architecture Overview
Major Data Structures
• BigFiles
- Virtual files spanning multiple file systems
- Addressable by 64 bit integers
- Multiple file systems is handled automatically
- Also support rudimentary compression options
Major Data Structures
• Repository
- Contains the full HTML of every web page.
- Each page is compressed using zlib (zlib vs bzip)
- The documents are stored one after the other
- Prefixed by docID, length, and URL
Major Data Structures
• Repository
Major Data Structures
• Document Index
- Document index: information about each document.
- Fixed width ISAM (Index sequential access mode) index,
ordered by docID.
- Each entry includes the current document status, a pointer
into the repository, a document checksum, and various
statistics.
- If the document has been crawled, it also contains a
pointer into a variable width file called docinfo which
contains its URL and title.
Major Data Structures
• Document Index
- Otherwise the pointer points into the URLlist which
contains just the URL.
- This design decision was driven by the desire to have a
reasonably compact data structure, and the ability to fetch a
record in one disk seek during a search
Major Data Structures
• Lexicon
- It is implemented in two parts -- a list of the words
(concatenated together but separated by nulls) and a hash
table of pointers.
Major Data Structures
• Hit Lists
- A list of occurrences of a particular word in a particular
document including position, font, and capitalization
information.
- Account for most of the space used in both the forward
and the inverted indices.
Major Data Structures
• Hit Lists
- Fancy hits include hits occurring in a URL, title, anchor
text, or meta tag.
- Plain hits include everything else. A plain hit consists of a
capitalization bit, font size, and 12 bits of word position in a
document
Major Data Structures
• Forward Index
- Each barrel holds a range of wordID’s.
- If a document contains words that fall into a particular
barrel, the docID is recorded into the barrel, followed by a
list of wordID’s with hitlists which correspond to those
words.
- Requires slightly more storage because of duplicated
docIDs.
Major Data Structures
• Inverted Index
- Consists of the same barrels as the forward index, except
that they have been processed by the sorter.
- For every valid wordID, the lexicon contains a pointer into
the barrel that wordID falls into.
- It points to a doclist of docID’s together with their
corresponding hit lists.
- This doclist represents all the occurrences of that word in
all documents.
Crawling the Web
• Indexing the Web
• Searching

More Related Content

What's hot

Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finale
Ajit More
 
London HUG
London HUGLondon HUG
London HUG
Boudicca
 

What's hot (20)

Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finale
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
London HUG
London HUGLondon HUG
London HUG
 
Data storage and indexing
Data storage and indexingData storage and indexing
Data storage and indexing
 
Indexing
IndexingIndexing
Indexing
 
Reference linking and Cited-by
Reference linking and Cited-byReference linking and Cited-by
Reference linking and Cited-by
 
Working with Crossref and registering content
Working with Crossref and registering contentWorking with Crossref and registering content
Working with Crossref and registering content
 
Dm1.1
Dm1.1Dm1.1
Dm1.1
 
File organization in database
File organization in databaseFile organization in database
File organization in database
 
Overview of Storage and Indexing ...
Overview of Storage and Indexing                                             ...Overview of Storage and Indexing                                             ...
Overview of Storage and Indexing ...
 
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
 
Introduction to Redis Data Structures: Sorted Sets
Introduction to Redis Data Structures: Sorted SetsIntroduction to Redis Data Structures: Sorted Sets
Introduction to Redis Data Structures: Sorted Sets
 
Data indexing presentation
Data indexing presentationData indexing presentation
Data indexing presentation
 
CEK KEMIRIPAN PADA CROSSREF
CEK KEMIRIPAN PADA CROSSREFCEK KEMIRIPAN PADA CROSSREF
CEK KEMIRIPAN PADA CROSSREF
 
Freire model api
Freire model apiFreire model api
Freire model api
 
Redis - Your Magical superfast database
Redis - Your Magical superfast databaseRedis - Your Magical superfast database
Redis - Your Magical superfast database
 
PoolParty SKOS and Linked Data
PoolParty SKOS and Linked DataPoolParty SKOS and Linked Data
PoolParty SKOS and Linked Data
 
Which no sql database
Which no sql databaseWhich no sql database
Which no sql database
 
Crossref Metadata and Metadata Services
Crossref Metadata and Metadata ServicesCrossref Metadata and Metadata Services
Crossref Metadata and Metadata Services
 
Managing plagiarism: Similarity Check
Managing plagiarism: Similarity CheckManaging plagiarism: Similarity Check
Managing plagiarism: Similarity Check
 

Similar to The Anatomy of a Large-Scale Hypertextual Web Search Engine

Google Paper
Google Paper Google Paper
Google Paper
girish1m
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
NekoGato
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
smile790243
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technology
Stefanos Anastasiadis
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)
Petter Skodvin-Hvammen
 

Similar to The Anatomy of a Large-Scale Hypertextual Web Search Engine (20)

DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 
Google Paper
Google Paper Google Paper
Google Paper
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Web mining
Web miningWeb mining
Web mining
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
Info 2402 irt-chapter_3
Info 2402 irt-chapter_3Info 2402 irt-chapter_3
Info 2402 irt-chapter_3
 
Apache drill
Apache drillApache drill
Apache drill
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technology
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Databases and types of databases
Databases and types of databasesDatabases and types of databases
Databases and types of databases
 
The Enterprise Search Market in a Nutshell
The Enterprise Search Market in a NutshellThe Enterprise Search Market in a Nutshell
The Enterprise Search Market in a Nutshell
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)
 
Indexing and hashing
Indexing and hashingIndexing and hashing
Indexing and hashing
 
Mba admission in india
Mba admission in indiaMba admission in india
Mba admission in india
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 

Recently uploaded

Recently uploaded (20)

Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Motion for AI: Creating Empathy in Technology
Motion for AI: Creating Empathy in TechnologyMotion for AI: Creating Empathy in Technology
Motion for AI: Creating Empathy in Technology
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Server-Driven User Interface (SDUI) at Priceline
Server-Driven User Interface (SDUI) at PricelineServer-Driven User Interface (SDUI) at Priceline
Server-Driven User Interface (SDUI) at Priceline
 
The architecture of Generative AI for enterprises.pdf
The architecture of Generative AI for enterprises.pdfThe architecture of Generative AI for enterprises.pdf
The architecture of Generative AI for enterprises.pdf
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Intelligent Gimbal FINAL PAPER Engineering.pdf
Intelligent Gimbal FINAL PAPER Engineering.pdfIntelligent Gimbal FINAL PAPER Engineering.pdf
Intelligent Gimbal FINAL PAPER Engineering.pdf
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 

The Anatomy of a Large-Scale Hypertextual Web Search Engine

  • 1. The Anatomy of a Large-Scale Hypertextual Web Search Engine
  • 2. Introduction • Designed to crawl and index the Web efficiently • Produce much more satisfying search results • Very little academic research
  • 3. Introduction • Web is growing rapidly • Also, new users inexperienced in the art of web research • Human maintained lists cover popular topics are efficient but expensive
  • 4. Introduction • Automated search engines that rely on keyword matching • Return too many low quality matches • Advertisers attempt to mislead automated search engines
  • 5. Challenges • Fast crawling technology is needed • Storage space must be used efficiently • The indexing system must process hundreds of gigabytes of data efficiently • Queries must be handled quickly
  • 6. Design Goals • Improved Search Quality - We need tools that have very high precision • Academic Search Engine Research - Build systems that reasonable numbers of people can actually use
  • 7. System Features: PageRank • Use of Maps • PageRank to prioritize web search result • The probability that the random surfer visits a page is its PageRank. • And, the d damping factor is the probability at each page the "random surfer" will get bored & request another random page.
  • 8. System Features: PageRank • We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. • Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: • PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) • Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages’ PageRanks will be one.
  • 10. Major Data Structures • BigFiles - Virtual files spanning multiple file systems - Addressable by 64 bit integers - Multiple file systems is handled automatically - Also support rudimentary compression options
  • 11. Major Data Structures • Repository - Contains the full HTML of every web page. - Each page is compressed using zlib (zlib vs bzip) - The documents are stored one after the other - Prefixed by docID, length, and URL
  • 13. Major Data Structures • Document Index - Document index: information about each document. - Fixed width ISAM (Index sequential access mode) index, ordered by docID. - Each entry includes the current document status, a pointer into the repository, a document checksum, and various statistics. - If the document has been crawled, it also contains a pointer into a variable width file called docinfo which contains its URL and title.
  • 14. Major Data Structures • Document Index - Otherwise the pointer points into the URLlist which contains just the URL. - This design decision was driven by the desire to have a reasonably compact data structure, and the ability to fetch a record in one disk seek during a search
  • 15. Major Data Structures • Lexicon - It is implemented in two parts -- a list of the words (concatenated together but separated by nulls) and a hash table of pointers.
  • 16. Major Data Structures • Hit Lists - A list of occurrences of a particular word in a particular document including position, font, and capitalization information. - Account for most of the space used in both the forward and the inverted indices.
  • 17. Major Data Structures • Hit Lists - Fancy hits include hits occurring in a URL, title, anchor text, or meta tag. - Plain hits include everything else. A plain hit consists of a capitalization bit, font size, and 12 bits of word position in a document
  • 18. Major Data Structures • Forward Index - Each barrel holds a range of wordID’s. - If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID’s with hitlists which correspond to those words. - Requires slightly more storage because of duplicated docIDs.
  • 19. Major Data Structures • Inverted Index - Consists of the same barrels as the forward index, except that they have been processed by the sorter. - For every valid wordID, the lexicon contains a pointer into the barrel that wordID falls into. - It points to a doclist of docID’s together with their corresponding hit lists. - This doclist represents all the occurrences of that word in all documents.
  • 20. Crawling the Web • Indexing the Web • Searching