SlideShare a Scribd company logo
1 of 6
The Anatomy of GOOGLE Search Engine
By
Manish Chopra
Corporate Trainer
IIHT Ltd. Bangalore
August, 2008
The Anatomy of GOOGLE Search Engine – IIHT Ltd.
2
Google runs on a distributed network of thousands of low-cost computers and can therefore carry
out fast parallel processing. Parallel processing is a method of computation in which many
calculations can be performed simultaneously, significantly speeding up data processing. Google
has three distinct parts:
1. Googlebot, a web crawler that finds and fetches web pages.
2. The indexer that sorts every word on every page and stores the resulting index of words
in a huge database.
3. The query processor, which compares your search query to the index and recommends
the documents that it considers most relevant.
Let’s take a closer look at each part.
1. Googlebot, Google’s Web Crawler
Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and
hands them off to the Google indexer. It’s easy to imagine Googlebot as a little spider scurrying
across the strands of cyberspace, but in reality Googlebot doesn’t traverse the web at all. It
functions much like your web browser, by sending a request to a web server for a web page,
downloading the entire page, and then handing it off to Google’s indexer.
Googlebot consists of many computers requesting and fetching pages much more quickly than
you can with your web browser. In fact, Googlebot can request thousands of different pages
simultaneously. To avoid overwhelming web servers, or crowding out requests from human
users, Googlebot deliberately makes requests of each individual web server more slowly than it’s
capable of doing.
Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html,
and through finding links by crawling the web.
The Anatomy of GOOGLE Search Engine – IIHT Ltd.
3
Unfortunately, spammers figured out how to create automated bots that bombarded the add URL
form with millions of URLs pointing to commercial propaganda. Google rejects those URLs
submitted through its Add URL form that it suspects are trying to deceive users by employing
tactics such as including hidden text or links on a page, stuffing a page with irrelevant words,
using sneaky redirects, creating doorways, domains, or sub-domains with substantially similar
content, sending automated queries to Google, and linking to bad neighbors. So now the Add
URL form also has a test: it displays some squiggly letters designed to fool automated “letter-
guessers”; it asks you to enter the letters you see, something like an eye-chart test to stop
spambots.
When Googlebot fetches a page, it culls all the links appearing on the page and adds them to a
queue for subsequent crawling. Googlebot tends to encounter little spam because most web
authors link only to what they believe are high-quality pages. By harvesting links from every
page it encounters, Googlebot can quickly build a list of links that can cover broad reaches of the
web. This technique, known as deep crawling, also allows Googlebot to probe deep within
individual sites. Because of their massive scale, deep crawls can reach almost every page in the
web. Because the web is vast, this can take some time, so some pages may be crawled only once
a month.
Although its function is simple, Googlebot must be programmed to handle several challenges.
First, since Googlebot sends out simultaneous requests for thousands of pages, the queue of
“visit soon” URLs must be constantly examined and compared with URLs already in Google’s
index. Duplicates in the queue must be eliminated to prevent Googlebot from fetching the same
page again. Googlebot must determine how often to revisit a page. On the one hand, it’s a waste
of resources to re-index an unchanged page. On the other hand, Google wants to re-index
changed pages to deliver up-to-date results.
To keep the index current, Google continuously recrawls popular frequently changing web pages
at a rate roughly proportional to how often the pages change. Such crawls keep an index current
and are known as fresh crawls. Newspaper pages are downloaded daily, pages with stock quotes
are downloaded much more frequently. Of course, fresh crawls return fewer pages than the deep
crawl. The combination of the two types of crawls allows Google to both make efficient use of
its resources and keep its index reasonably current.
2. Google’s Indexer
Googlebot gives the indexer the full text of the pages it finds. These pages are stored in Google’s
index database. This index is sorted alphabetically by search term, with each index entry storing
a list of documents in which the term appears and the location within the text where it occurs.
This data structure allows rapid access to documents that contain user query terms.
To improve search performance, Google ignores (doesn’t index) common words called stop
words (such as the, is, on, or, of, how, why, as well as certain single digits and single letters).
Stop words are so common that they do little to narrow a search, and therefore they can safely be
discarded. The indexer also ignores some punctuation and multiple spaces, as well as converting
all letters to lowercase, to improve Google’s performance.
The Anatomy of GOOGLE Search Engine – IIHT Ltd.
4
3. Google’s Query Processor
The query processor has several parts, including the user interface (search box), the “engine” that
evaluates queries and matches them to relevant documents, and the results formatter.
PageRank is Google’s system for ranking web pages. A page with a higher PageRank is deemed
more important and is more likely to be listed above a page with a lower PageRank.
Google considers over a hundred factors in computing a PageRank and determining which
documents are most relevant to a query, including the popularity of the page, the position and
size of the search terms within the page, and the proximity of the search terms to one another on
the page. A patent application discusses other factors that Google considers when ranking a page.
Google also applies machine-learning techniques to improve its performance automatically by
learning relationships and associations within the stored data. For example, the spelling-
correcting system uses such techniques to figure out likely alternative spellings. Google closely
guards the formulas it uses to calculate relevance; they’re tweaked to improve quality and
performance, and to outwit the latest devious techniques used by spammers.
Indexing the full text of the web allows Google to go beyond simply matching single search
terms. Google gives more priority to pages that have search terms near each other and in the
same order as the query. Google can also match multi-word phrases and sentences. Since Google
indexes HTML code in addition to the text on the page, users can restrict searches on the basis of
where query words appear, e.g., in the title, in the URL, in the body, and in links to the page.
Let’s see how Google processes a query.
The Anatomy of GOOGLE Search Engine – IIHT Ltd.
5
When Google responds with search results in a web browser, the results page is filled with
information and links, most of which relate to the user’s query. Here is a snapshot of results page
that is displayed from a query on Google search engine:
The page contains the following components:
• Google Logo: Click on the Google logo to go to Google’s home page.
• Statistics Bar: Describes your search, includes the number of results on the current
results page and an estimate of the total number of results, as well as the time your search
took. For the sake of efficiency, Google estimates the number of results.
The Anatomy of GOOGLE Search Engine – IIHT Ltd.
6
Below are descriptions of some search-result components. These components appear in fonts of
different colors on the result page to make it easier to distinguish them from one another.
o Page Title: (blue) The web page’s title, if the page has one, or its URL if the page has no
title or if Google has not indexed all of the page’s content. Click on the page title to
display the corresponding page.
o Snippets: (black) Each search result usually includes one or more short excerpts of the
text that matches your query with your search terms in boldface type. These snippets,
which appear in a black font, may provide you with
• The information you are seeking
• What you might find on the linked page
• Ideas of terms to use in your subsequent searches
• When Google hasn’t crawled a page, it doesn’t include a snippet. A page might
not be crawled because its publisher requested no crawling, or because the page
was written in such a way that it was too difficult to crawl.
o URL of Result: (green) Web address of the search result.
o Size: (green) The size of the text portion of the web page. It is omitted for sites not yet
indexed. In the screen shot, “38k” means that the text portion of the web page is 38
kilobytes. 1k of text is about 170 words. A page containing 38K characters thus is about
6460 words long. For the sake of efficiency, Google searches only the first 101 kilobytes
(approximately 17,000 words) of a web page and the first 120 kilobytes of a pdf file.
Assuming 15 words per line and 50 lines per page, Google searches the first 22 pages of a
web page and the first 26 pages of a pdf file. If a page is larger, Google will list the page
as being 101 kilobytes or 120 kilobytes for a pdf file. This means that Google’s results
won’t reference any part of a web page beyond its first 101 kilobytes or any part of a pdf
file beyond the first 120 kilobytes.
o Date: (green) The date tells you the freshness of Google’s copy of the page. Dates are
included for pages that have recently had a fresh crawl.
o Indented Result: When Google finds multiple results from the same website, it lists the
most relevant result first with the second most relevant page from that same site indented
below it. Limiting the number of results from a given site to two ensures that pages from
one site will not dominate your search results and that Google provides pages from a
variety of sites.
o More Results: When there are more than two results from the same site; access the
remaining results from the “More results from…” link. When Google returns more than
one page of results, you can view subsequent pages by clicking either a page number or
one of the “o”s in the whimsical “Gooooogle” that appears below the last search result on
the page.

More Related Content

What's hot

OSINT - Yandex Search
OSINT - Yandex SearchOSINT - Yandex Search
OSINT - Yandex SearchRaghav Bisht
 
Google ppt by amit
Google ppt by amitGoogle ppt by amit
Google ppt by amitDAVV
 
Internet search techniques by tariq ghayyur1
Internet search techniques by tariq ghayyur1Internet search techniques by tariq ghayyur1
Internet search techniques by tariq ghayyur1Tariq Ghayyur
 
Effective web search techniques
Effective web search techniquesEffective web search techniques
Effective web search techniquesaliciafe0215
 
Enhance Your Google Search
Enhance Your Google SearchEnhance Your Google Search
Enhance Your Google SearchValentini Mellas
 
Understanding Seo At A Glance
Understanding Seo At A GlanceUnderstanding Seo At A Glance
Understanding Seo At A Glancepoojagupta267
 
Sourcing / Recruiting /Searching on Google and Live
Sourcing / Recruiting /Searching on Google and LiveSourcing / Recruiting /Searching on Google and Live
Sourcing / Recruiting /Searching on Google and LiveRithesh Nair
 
Google searching techniques
Google searching techniquesGoogle searching techniques
Google searching techniquesabbas mohd
 
Google search techniques
Google search techniquesGoogle search techniques
Google search techniquesNirav Ranpara
 
Boolean- Search Basics
Boolean- Search BasicsBoolean- Search Basics
Boolean- Search BasicsRithesh Nair
 
Week13 key concepts_googlesearchtechniques
Week13 key concepts_googlesearchtechniquesWeek13 key concepts_googlesearchtechniques
Week13 key concepts_googlesearchtechniquescarolyn oldham
 
Week 9 10 ppt-google_search
Week 9 10 ppt-google_searchWeek 9 10 ppt-google_search
Week 9 10 ppt-google_searchcarolyn oldham
 
Internet Search Presentation
Internet Search PresentationInternet Search Presentation
Internet Search PresentationSteve Guinan
 
Week12keyconceptsgooglesearchtechniques
Week12keyconceptsgooglesearchtechniquesWeek12keyconceptsgooglesearchtechniques
Week12keyconceptsgooglesearchtechniquescarolyn oldham
 
Content Automation at Scale | Michael Costanzo @ #brightonSEO 2021
Content Automation at Scale | Michael Costanzo @ #brightonSEO 2021Content Automation at Scale | Michael Costanzo @ #brightonSEO 2021
Content Automation at Scale | Michael Costanzo @ #brightonSEO 2021Michael Costanzo
 
How google search engine work
How google search engine workHow google search engine work
How google search engine workLạc Lạc
 

What's hot (20)

OSINT - Yandex Search
OSINT - Yandex SearchOSINT - Yandex Search
OSINT - Yandex Search
 
Google ppt by amit
Google ppt by amitGoogle ppt by amit
Google ppt by amit
 
Internet search techniques by tariq ghayyur1
Internet search techniques by tariq ghayyur1Internet search techniques by tariq ghayyur1
Internet search techniques by tariq ghayyur1
 
Internet search techniques for K12
Internet search techniques for K12Internet search techniques for K12
Internet search techniques for K12
 
Google search tips
Google search tipsGoogle search tips
Google search tips
 
Google Searchology
Google SearchologyGoogle Searchology
Google Searchology
 
Effective web search techniques
Effective web search techniquesEffective web search techniques
Effective web search techniques
 
Searching google
Searching googleSearching google
Searching google
 
Enhance Your Google Search
Enhance Your Google SearchEnhance Your Google Search
Enhance Your Google Search
 
Understanding Seo At A Glance
Understanding Seo At A GlanceUnderstanding Seo At A Glance
Understanding Seo At A Glance
 
Sourcing / Recruiting /Searching on Google and Live
Sourcing / Recruiting /Searching on Google and LiveSourcing / Recruiting /Searching on Google and Live
Sourcing / Recruiting /Searching on Google and Live
 
Google searching techniques
Google searching techniquesGoogle searching techniques
Google searching techniques
 
Google search techniques
Google search techniquesGoogle search techniques
Google search techniques
 
Boolean- Search Basics
Boolean- Search BasicsBoolean- Search Basics
Boolean- Search Basics
 
Week13 key concepts_googlesearchtechniques
Week13 key concepts_googlesearchtechniquesWeek13 key concepts_googlesearchtechniques
Week13 key concepts_googlesearchtechniques
 
Week 9 10 ppt-google_search
Week 9 10 ppt-google_searchWeek 9 10 ppt-google_search
Week 9 10 ppt-google_search
 
Internet Search Presentation
Internet Search PresentationInternet Search Presentation
Internet Search Presentation
 
Week12keyconceptsgooglesearchtechniques
Week12keyconceptsgooglesearchtechniquesWeek12keyconceptsgooglesearchtechniques
Week12keyconceptsgooglesearchtechniques
 
Content Automation at Scale | Michael Costanzo @ #brightonSEO 2021
Content Automation at Scale | Michael Costanzo @ #brightonSEO 2021Content Automation at Scale | Michael Costanzo @ #brightonSEO 2021
Content Automation at Scale | Michael Costanzo @ #brightonSEO 2021
 
How google search engine work
How google search engine workHow google search engine work
How google search engine work
 

Viewers also liked

Steps to create an RPM package in Linux
Steps to create an RPM package in LinuxSteps to create an RPM package in Linux
Steps to create an RPM package in LinuxManish Chopra
 
Tadas Juknevičius (MOSTA) Doktorantūros horizontai Lietuvos mokslo politikoje
Tadas Juknevičius (MOSTA) Doktorantūros horizontai Lietuvos mokslo politikojeTadas Juknevičius (MOSTA) Doktorantūros horizontai Lietuvos mokslo politikoje
Tadas Juknevičius (MOSTA) Doktorantūros horizontai Lietuvos mokslo politikojeAidis Stukas
 
dr. Brigita Serafinavičiūtė (VU) Atviras mokslas: nauja mada ar būtinybė?
dr. Brigita Serafinavičiūtė (VU) Atviras mokslas: nauja mada ar būtinybė?dr. Brigita Serafinavičiūtė (VU) Atviras mokslas: nauja mada ar būtinybė?
dr. Brigita Serafinavičiūtė (VU) Atviras mokslas: nauja mada ar būtinybė?Aidis Stukas
 
Lietuvos doktorantūra iššūkiai ir galimybės
Lietuvos doktorantūra iššūkiai ir galimybėsLietuvos doktorantūra iššūkiai ir galimybės
Lietuvos doktorantūra iššūkiai ir galimybėsAidis Stukas
 
Congenital vertical talus BY DR.NAVEEN RATHOR
Congenital vertical talus BY DR.NAVEEN RATHORCongenital vertical talus BY DR.NAVEEN RATHOR
Congenital vertical talus BY DR.NAVEEN RATHORDR.Naveen Rathor
 
GSMA 2017 | 200 Plus Slides
GSMA 2017 |  200 Plus SlidesGSMA 2017 |  200 Plus Slides
GSMA 2017 | 200 Plus SlidesZohar Urian
 
7 Trends & Insights MWC 2017
7 Trends & Insights MWC 20177 Trends & Insights MWC 2017
7 Trends & Insights MWC 2017DMI
 
MAROC EN AFRIQUE ; LES MODES D’ENTRÉE DES ENTREPRISES MAROCAINES SUR LE MARCH...
MAROC EN AFRIQUE ; LES MODES D’ENTRÉE DES ENTREPRISES MAROCAINES SUR LE MARCH...MAROC EN AFRIQUE ; LES MODES D’ENTRÉE DES ENTREPRISES MAROCAINES SUR LE MARCH...
MAROC EN AFRIQUE ; LES MODES D’ENTRÉE DES ENTREPRISES MAROCAINES SUR LE MARCH...Omar El-Azrak
 
Women day splash answers
Women day splash answersWomen day splash answers
Women day splash answersprateek vishnoi
 
Féminisme et égalité des sexes
Féminisme et égalité des sexesFéminisme et égalité des sexes
Féminisme et égalité des sexesIpsos France
 

Viewers also liked (10)

Steps to create an RPM package in Linux
Steps to create an RPM package in LinuxSteps to create an RPM package in Linux
Steps to create an RPM package in Linux
 
Tadas Juknevičius (MOSTA) Doktorantūros horizontai Lietuvos mokslo politikoje
Tadas Juknevičius (MOSTA) Doktorantūros horizontai Lietuvos mokslo politikojeTadas Juknevičius (MOSTA) Doktorantūros horizontai Lietuvos mokslo politikoje
Tadas Juknevičius (MOSTA) Doktorantūros horizontai Lietuvos mokslo politikoje
 
dr. Brigita Serafinavičiūtė (VU) Atviras mokslas: nauja mada ar būtinybė?
dr. Brigita Serafinavičiūtė (VU) Atviras mokslas: nauja mada ar būtinybė?dr. Brigita Serafinavičiūtė (VU) Atviras mokslas: nauja mada ar būtinybė?
dr. Brigita Serafinavičiūtė (VU) Atviras mokslas: nauja mada ar būtinybė?
 
Lietuvos doktorantūra iššūkiai ir galimybės
Lietuvos doktorantūra iššūkiai ir galimybėsLietuvos doktorantūra iššūkiai ir galimybės
Lietuvos doktorantūra iššūkiai ir galimybės
 
Congenital vertical talus BY DR.NAVEEN RATHOR
Congenital vertical talus BY DR.NAVEEN RATHORCongenital vertical talus BY DR.NAVEEN RATHOR
Congenital vertical talus BY DR.NAVEEN RATHOR
 
GSMA 2017 | 200 Plus Slides
GSMA 2017 |  200 Plus SlidesGSMA 2017 |  200 Plus Slides
GSMA 2017 | 200 Plus Slides
 
7 Trends & Insights MWC 2017
7 Trends & Insights MWC 20177 Trends & Insights MWC 2017
7 Trends & Insights MWC 2017
 
MAROC EN AFRIQUE ; LES MODES D’ENTRÉE DES ENTREPRISES MAROCAINES SUR LE MARCH...
MAROC EN AFRIQUE ; LES MODES D’ENTRÉE DES ENTREPRISES MAROCAINES SUR LE MARCH...MAROC EN AFRIQUE ; LES MODES D’ENTRÉE DES ENTREPRISES MAROCAINES SUR LE MARCH...
MAROC EN AFRIQUE ; LES MODES D’ENTRÉE DES ENTREPRISES MAROCAINES SUR LE MARCH...
 
Women day splash answers
Women day splash answersWomen day splash answers
Women day splash answers
 
Féminisme et égalité des sexes
Féminisme et égalité des sexesFéminisme et égalité des sexes
Féminisme et égalité des sexes
 

Similar to The Anatomy of GOOGLE Search Engine

Search Engine Optimization (Seo)
Search Engine Optimization (Seo)Search Engine Optimization (Seo)
Search Engine Optimization (Seo)ssunnysengar
 
Search engine optimization (seo)
Search engine optimization (seo)Search engine optimization (seo)
Search engine optimization (seo)jhon smith
 
Google Search Engine
Google Search Engine Google Search Engine
Google Search Engine Aniket_1415
 
How Google WOrks?
How Google WOrks?How Google WOrks?
How Google WOrks?07Deeps
 
Dipika arora ppts
Dipika arora pptsDipika arora ppts
Dipika arora pptspreetianeja
 
Inside google search - how it works??
Inside google search - how it works??Inside google search - how it works??
Inside google search - how it works??Dhruv Patel
 
The Beginner's Guide to Googlebot Optimization
The Beginner's Guide to Googlebot OptimizationThe Beginner's Guide to Googlebot Optimization
The Beginner's Guide to Googlebot OptimizationCMI_Compas
 
Google Looks Into the Index Now Protocol for Crawling and Indexing
Google Looks Into the Index Now Protocol for Crawling and IndexingGoogle Looks Into the Index Now Protocol for Crawling and Indexing
Google Looks Into the Index Now Protocol for Crawling and IndexingPaulDonahue16
 
Google indexing
Google indexingGoogle indexing
Google indexingtahoor71
 
seo important Terms A-Z (1).pdf safalta.com
seo important Terms A-Z (1).pdf safalta.comseo important Terms A-Z (1).pdf safalta.com
seo important Terms A-Z (1).pdf safalta.comashgamer800
 
Crawl optimization - ( How to optimize to increase crawl budget)
Crawl optimization - ( How to optimize to increase crawl budget)Crawl optimization - ( How to optimize to increase crawl budget)
Crawl optimization - ( How to optimize to increase crawl budget)SyedFaraz41
 
How google works and functions: A complete Approach
How google works and functions: A complete ApproachHow google works and functions: A complete Approach
How google works and functions: A complete ApproachPrakhar Gethe
 
Introduction to SEO Basics
Introduction to SEO BasicsIntroduction to SEO Basics
Introduction to SEO BasicsJenifer Renjini
 
Search Engine Optimization
Search Engine OptimizationSearch Engine Optimization
Search Engine Optimizationshrishail uttagi
 
How to perform a technical SEO audit and ramp up your content strategy in 10 ...
How to perform a technical SEO audit and ramp up your content strategy in 10 ...How to perform a technical SEO audit and ramp up your content strategy in 10 ...
How to perform a technical SEO audit and ramp up your content strategy in 10 ...Waqar Ahmad
 

Similar to The Anatomy of GOOGLE Search Engine (20)

Search Engine Optimization (Seo)
Search Engine Optimization (Seo)Search Engine Optimization (Seo)
Search Engine Optimization (Seo)
 
Search engine optimization (seo)
Search engine optimization (seo)Search engine optimization (seo)
Search engine optimization (seo)
 
Google Search Engine
Google Search Engine Google Search Engine
Google Search Engine
 
How Google WOrks?
How Google WOrks?How Google WOrks?
How Google WOrks?
 
Dipika arora ppts
Dipika arora pptsDipika arora ppts
Dipika arora ppts
 
How Google Search Works
How Google Search WorksHow Google Search Works
How Google Search Works
 
How Google Works
How Google WorksHow Google Works
How Google Works
 
Webmaster tools (ICMK485)
Webmaster tools (ICMK485)Webmaster tools (ICMK485)
Webmaster tools (ICMK485)
 
Inside google search - how it works??
Inside google search - how it works??Inside google search - how it works??
Inside google search - how it works??
 
The Beginner's Guide to Googlebot Optimization
The Beginner's Guide to Googlebot OptimizationThe Beginner's Guide to Googlebot Optimization
The Beginner's Guide to Googlebot Optimization
 
Google Looks Into the Index Now Protocol for Crawling and Indexing
Google Looks Into the Index Now Protocol for Crawling and IndexingGoogle Looks Into the Index Now Protocol for Crawling and Indexing
Google Looks Into the Index Now Protocol for Crawling and Indexing
 
Google indexing
Google indexingGoogle indexing
Google indexing
 
seo important Terms A-Z (1).pdf safalta.com
seo important Terms A-Z (1).pdf safalta.comseo important Terms A-Z (1).pdf safalta.com
seo important Terms A-Z (1).pdf safalta.com
 
Crawl optimization - ( How to optimize to increase crawl budget)
Crawl optimization - ( How to optimize to increase crawl budget)Crawl optimization - ( How to optimize to increase crawl budget)
Crawl optimization - ( How to optimize to increase crawl budget)
 
How google works and functions: A complete Approach
How google works and functions: A complete ApproachHow google works and functions: A complete Approach
How google works and functions: A complete Approach
 
Introduction to SEO Basics
Introduction to SEO BasicsIntroduction to SEO Basics
Introduction to SEO Basics
 
Search Engine Optimization
Search Engine OptimizationSearch Engine Optimization
Search Engine Optimization
 
Basics of SEO
Basics of SEO Basics of SEO
Basics of SEO
 
Seo Manual
Seo ManualSeo Manual
Seo Manual
 
How to perform a technical SEO audit and ramp up your content strategy in 10 ...
How to perform a technical SEO audit and ramp up your content strategy in 10 ...How to perform a technical SEO audit and ramp up your content strategy in 10 ...
How to perform a technical SEO audit and ramp up your content strategy in 10 ...
 

More from Manish Chopra

AWS and Slack Integration - Sending CloudWatch Notifications to Slack.pdf
AWS and Slack Integration - Sending CloudWatch Notifications to Slack.pdfAWS and Slack Integration - Sending CloudWatch Notifications to Slack.pdf
AWS and Slack Integration - Sending CloudWatch Notifications to Slack.pdfManish Chopra
 
Getting Started with ChatGPT.pdf
Getting Started with ChatGPT.pdfGetting Started with ChatGPT.pdf
Getting Started with ChatGPT.pdfManish Chopra
 
Grafana and AWS - Implementation and Usage
Grafana and AWS - Implementation and UsageGrafana and AWS - Implementation and Usage
Grafana and AWS - Implementation and UsageManish Chopra
 
Containers Auto Scaling on AWS.pdf
Containers Auto Scaling on AWS.pdfContainers Auto Scaling on AWS.pdf
Containers Auto Scaling on AWS.pdfManish Chopra
 
OpenKM Solution Document
OpenKM Solution DocumentOpenKM Solution Document
OpenKM Solution DocumentManish Chopra
 
Alfresco Content Services - Solution Document
Alfresco Content Services - Solution DocumentAlfresco Content Services - Solution Document
Alfresco Content Services - Solution DocumentManish Chopra
 
Jenkins Study Guide ToC
Jenkins Study Guide ToCJenkins Study Guide ToC
Jenkins Study Guide ToCManish Chopra
 
Ansible Study Guide ToC
Ansible Study Guide ToCAnsible Study Guide ToC
Ansible Study Guide ToCManish Chopra
 
Microservices with Dockers and Kubernetes
Microservices with Dockers and KubernetesMicroservices with Dockers and Kubernetes
Microservices with Dockers and KubernetesManish Chopra
 
Unix and Linux Operating Systems
Unix and Linux Operating SystemsUnix and Linux Operating Systems
Unix and Linux Operating SystemsManish Chopra
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive AnalyticsManish Chopra
 
Preparing a Dataset for Processing
Preparing a Dataset for ProcessingPreparing a Dataset for Processing
Preparing a Dataset for ProcessingManish Chopra
 
Organizations with largest hadoop clusters
Organizations with largest hadoop clustersOrganizations with largest hadoop clusters
Organizations with largest hadoop clustersManish Chopra
 
Distributed File Systems
Distributed File SystemsDistributed File Systems
Distributed File SystemsManish Chopra
 
Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Manish Chopra
 
Oracle solaris 11 installation
Oracle solaris 11 installationOracle solaris 11 installation
Oracle solaris 11 installationManish Chopra
 
Big Data Analytics Course Guide TOC
Big Data Analytics Course Guide TOCBig Data Analytics Course Guide TOC
Big Data Analytics Course Guide TOCManish Chopra
 
Emergence and Importance of Cloud Computing for the Enterprise
Emergence and Importance of Cloud Computing for the EnterpriseEmergence and Importance of Cloud Computing for the Enterprise
Emergence and Importance of Cloud Computing for the EnterpriseManish Chopra
 
Setting up a HADOOP 2.2 cluster on CentOS 6
Setting up a HADOOP 2.2 cluster on CentOS 6Setting up a HADOOP 2.2 cluster on CentOS 6
Setting up a HADOOP 2.2 cluster on CentOS 6Manish Chopra
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigManish Chopra
 

More from Manish Chopra (20)

AWS and Slack Integration - Sending CloudWatch Notifications to Slack.pdf
AWS and Slack Integration - Sending CloudWatch Notifications to Slack.pdfAWS and Slack Integration - Sending CloudWatch Notifications to Slack.pdf
AWS and Slack Integration - Sending CloudWatch Notifications to Slack.pdf
 
Getting Started with ChatGPT.pdf
Getting Started with ChatGPT.pdfGetting Started with ChatGPT.pdf
Getting Started with ChatGPT.pdf
 
Grafana and AWS - Implementation and Usage
Grafana and AWS - Implementation and UsageGrafana and AWS - Implementation and Usage
Grafana and AWS - Implementation and Usage
 
Containers Auto Scaling on AWS.pdf
Containers Auto Scaling on AWS.pdfContainers Auto Scaling on AWS.pdf
Containers Auto Scaling on AWS.pdf
 
OpenKM Solution Document
OpenKM Solution DocumentOpenKM Solution Document
OpenKM Solution Document
 
Alfresco Content Services - Solution Document
Alfresco Content Services - Solution DocumentAlfresco Content Services - Solution Document
Alfresco Content Services - Solution Document
 
Jenkins Study Guide ToC
Jenkins Study Guide ToCJenkins Study Guide ToC
Jenkins Study Guide ToC
 
Ansible Study Guide ToC
Ansible Study Guide ToCAnsible Study Guide ToC
Ansible Study Guide ToC
 
Microservices with Dockers and Kubernetes
Microservices with Dockers and KubernetesMicroservices with Dockers and Kubernetes
Microservices with Dockers and Kubernetes
 
Unix and Linux Operating Systems
Unix and Linux Operating SystemsUnix and Linux Operating Systems
Unix and Linux Operating Systems
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive Analytics
 
Preparing a Dataset for Processing
Preparing a Dataset for ProcessingPreparing a Dataset for Processing
Preparing a Dataset for Processing
 
Organizations with largest hadoop clusters
Organizations with largest hadoop clustersOrganizations with largest hadoop clusters
Organizations with largest hadoop clusters
 
Distributed File Systems
Distributed File SystemsDistributed File Systems
Distributed File Systems
 
Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3
 
Oracle solaris 11 installation
Oracle solaris 11 installationOracle solaris 11 installation
Oracle solaris 11 installation
 
Big Data Analytics Course Guide TOC
Big Data Analytics Course Guide TOCBig Data Analytics Course Guide TOC
Big Data Analytics Course Guide TOC
 
Emergence and Importance of Cloud Computing for the Enterprise
Emergence and Importance of Cloud Computing for the EnterpriseEmergence and Importance of Cloud Computing for the Enterprise
Emergence and Importance of Cloud Computing for the Enterprise
 
Setting up a HADOOP 2.2 cluster on CentOS 6
Setting up a HADOOP 2.2 cluster on CentOS 6Setting up a HADOOP 2.2 cluster on CentOS 6
Setting up a HADOOP 2.2 cluster on CentOS 6
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-Koenig
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

The Anatomy of GOOGLE Search Engine

  • 1. The Anatomy of GOOGLE Search Engine By Manish Chopra Corporate Trainer IIHT Ltd. Bangalore August, 2008
  • 2. The Anatomy of GOOGLE Search Engine – IIHT Ltd. 2 Google runs on a distributed network of thousands of low-cost computers and can therefore carry out fast parallel processing. Parallel processing is a method of computation in which many calculations can be performed simultaneously, significantly speeding up data processing. Google has three distinct parts: 1. Googlebot, a web crawler that finds and fetches web pages. 2. The indexer that sorts every word on every page and stores the resulting index of words in a huge database. 3. The query processor, which compares your search query to the index and recommends the documents that it considers most relevant. Let’s take a closer look at each part. 1. Googlebot, Google’s Web Crawler Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. It’s easy to imagine Googlebot as a little spider scurrying across the strands of cyberspace, but in reality Googlebot doesn’t traverse the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, and then handing it off to Google’s indexer. Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes requests of each individual web server more slowly than it’s capable of doing. Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html, and through finding links by crawling the web.
  • 3. The Anatomy of GOOGLE Search Engine – IIHT Ltd. 3 Unfortunately, spammers figured out how to create automated bots that bombarded the add URL form with millions of URLs pointing to commercial propaganda. Google rejects those URLs submitted through its Add URL form that it suspects are trying to deceive users by employing tactics such as including hidden text or links on a page, stuffing a page with irrelevant words, using sneaky redirects, creating doorways, domains, or sub-domains with substantially similar content, sending automated queries to Google, and linking to bad neighbors. So now the Add URL form also has a test: it displays some squiggly letters designed to fool automated “letter- guessers”; it asks you to enter the letters you see, something like an eye-chart test to stop spambots. When Googlebot fetches a page, it culls all the links appearing on the page and adds them to a queue for subsequent crawling. Googlebot tends to encounter little spam because most web authors link only to what they believe are high-quality pages. By harvesting links from every page it encounters, Googlebot can quickly build a list of links that can cover broad reaches of the web. This technique, known as deep crawling, also allows Googlebot to probe deep within individual sites. Because of their massive scale, deep crawls can reach almost every page in the web. Because the web is vast, this can take some time, so some pages may be crawled only once a month. Although its function is simple, Googlebot must be programmed to handle several challenges. First, since Googlebot sends out simultaneous requests for thousands of pages, the queue of “visit soon” URLs must be constantly examined and compared with URLs already in Google’s index. Duplicates in the queue must be eliminated to prevent Googlebot from fetching the same page again. Googlebot must determine how often to revisit a page. On the one hand, it’s a waste of resources to re-index an unchanged page. On the other hand, Google wants to re-index changed pages to deliver up-to-date results. To keep the index current, Google continuously recrawls popular frequently changing web pages at a rate roughly proportional to how often the pages change. Such crawls keep an index current and are known as fresh crawls. Newspaper pages are downloaded daily, pages with stock quotes are downloaded much more frequently. Of course, fresh crawls return fewer pages than the deep crawl. The combination of the two types of crawls allows Google to both make efficient use of its resources and keep its index reasonably current. 2. Google’s Indexer Googlebot gives the indexer the full text of the pages it finds. These pages are stored in Google’s index database. This index is sorted alphabetically by search term, with each index entry storing a list of documents in which the term appears and the location within the text where it occurs. This data structure allows rapid access to documents that contain user query terms. To improve search performance, Google ignores (doesn’t index) common words called stop words (such as the, is, on, or, of, how, why, as well as certain single digits and single letters). Stop words are so common that they do little to narrow a search, and therefore they can safely be discarded. The indexer also ignores some punctuation and multiple spaces, as well as converting all letters to lowercase, to improve Google’s performance.
  • 4. The Anatomy of GOOGLE Search Engine – IIHT Ltd. 4 3. Google’s Query Processor The query processor has several parts, including the user interface (search box), the “engine” that evaluates queries and matches them to relevant documents, and the results formatter. PageRank is Google’s system for ranking web pages. A page with a higher PageRank is deemed more important and is more likely to be listed above a page with a lower PageRank. Google considers over a hundred factors in computing a PageRank and determining which documents are most relevant to a query, including the popularity of the page, the position and size of the search terms within the page, and the proximity of the search terms to one another on the page. A patent application discusses other factors that Google considers when ranking a page. Google also applies machine-learning techniques to improve its performance automatically by learning relationships and associations within the stored data. For example, the spelling- correcting system uses such techniques to figure out likely alternative spellings. Google closely guards the formulas it uses to calculate relevance; they’re tweaked to improve quality and performance, and to outwit the latest devious techniques used by spammers. Indexing the full text of the web allows Google to go beyond simply matching single search terms. Google gives more priority to pages that have search terms near each other and in the same order as the query. Google can also match multi-word phrases and sentences. Since Google indexes HTML code in addition to the text on the page, users can restrict searches on the basis of where query words appear, e.g., in the title, in the URL, in the body, and in links to the page. Let’s see how Google processes a query.
  • 5. The Anatomy of GOOGLE Search Engine – IIHT Ltd. 5 When Google responds with search results in a web browser, the results page is filled with information and links, most of which relate to the user’s query. Here is a snapshot of results page that is displayed from a query on Google search engine: The page contains the following components: • Google Logo: Click on the Google logo to go to Google’s home page. • Statistics Bar: Describes your search, includes the number of results on the current results page and an estimate of the total number of results, as well as the time your search took. For the sake of efficiency, Google estimates the number of results.
  • 6. The Anatomy of GOOGLE Search Engine – IIHT Ltd. 6 Below are descriptions of some search-result components. These components appear in fonts of different colors on the result page to make it easier to distinguish them from one another. o Page Title: (blue) The web page’s title, if the page has one, or its URL if the page has no title or if Google has not indexed all of the page’s content. Click on the page title to display the corresponding page. o Snippets: (black) Each search result usually includes one or more short excerpts of the text that matches your query with your search terms in boldface type. These snippets, which appear in a black font, may provide you with • The information you are seeking • What you might find on the linked page • Ideas of terms to use in your subsequent searches • When Google hasn’t crawled a page, it doesn’t include a snippet. A page might not be crawled because its publisher requested no crawling, or because the page was written in such a way that it was too difficult to crawl. o URL of Result: (green) Web address of the search result. o Size: (green) The size of the text portion of the web page. It is omitted for sites not yet indexed. In the screen shot, “38k” means that the text portion of the web page is 38 kilobytes. 1k of text is about 170 words. A page containing 38K characters thus is about 6460 words long. For the sake of efficiency, Google searches only the first 101 kilobytes (approximately 17,000 words) of a web page and the first 120 kilobytes of a pdf file. Assuming 15 words per line and 50 lines per page, Google searches the first 22 pages of a web page and the first 26 pages of a pdf file. If a page is larger, Google will list the page as being 101 kilobytes or 120 kilobytes for a pdf file. This means that Google’s results won’t reference any part of a web page beyond its first 101 kilobytes or any part of a pdf file beyond the first 120 kilobytes. o Date: (green) The date tells you the freshness of Google’s copy of the page. Dates are included for pages that have recently had a fresh crawl. o Indented Result: When Google finds multiple results from the same website, it lists the most relevant result first with the second most relevant page from that same site indented below it. Limiting the number of results from a given site to two ensures that pages from one site will not dominate your search results and that Google provides pages from a variety of sites. o More Results: When there are more than two results from the same site; access the remaining results from the “More results from…” link. When Google returns more than one page of results, you can view subsequent pages by clicking either a page number or one of the “o”s in the whimsical “Gooooogle” that appears below the last search result on the page.