How Google search engine algorithm works
Prepared by:- Viral Shah (120570107014)
Guided by :- Prof. Sahista Machhar, MEFGI
It is a program that
searches for and
identifies items in a
by the user, used
especially for finding
particular sites on the
World Wide Web.
There are 759 Million websites on the Web &
60 Trillion webpages of this websites.
AND IT’S CONSTANTLY GROWING !!!!!
GOOGLE navigates WEB by
To find information on the
hundreds of millions of Web
pages that exist, a search
engine employs special
software robots, called
SPIDERS, to build lists of the
words found on Web sites.
When a spider is building its
lists, the process is called
The usual starting points are lists of heavily
used servers and very popular pages. The
spider will begin with a popular site, indexing
the words on its pages and following every
link found within the site. In this way, the
spidering system quickly begins to travel,
spreading out across the most widely used
portions of the Web.
When the Google spider looked at an HTML page, it took note of
Words occurring in the title, subtitles, meta tags and other
positions of relative importance were noted for special consideration
during a subsequent user search. The Google spider was built to index
every significant word on a page, leaving out the articles “a”, “an” and
"the”. Other spiders take different approaches.
For example, some spiders will keep track of the words in the title,
sub-headings and links, along with the 100 most frequently used
words on the page and each word in the first 20 lines of text. Lycos is
said to use this approach to spidering the Web.
GOOGLE built their initial system to use multiple spiders, usually three
at one time. Each spider could keep about 300 connections to Web
pages open at a time.
Google’s spider name is Googlebot.
Googlebot is the search bot software used
by Google, which collects documents from
the web to build a searchable index for
the Google Search engine.
By following the web-pages, INDEX is
prepared. The index includes text from
millions of books from several libraries and
That means GOOGLE follow links from page
to page. Also they sort pages by their content
and other factors.
These all activities Google carry out is tracked
in the INDEX. Google continuously updates
index and it is stored over large servers.
Currently, Google’s Index size is over 100
Site owners choose whether their sites are
To prevent most search engine web
crawlers from indexing a page on your site, place
the following meta tag into the<head> section of
<meta name="robots" content="noindex">
To prevent only Google web crawlers from
indexing a page:
<meta name="googlebot" content="noindex">
Predicts what you might be searching for.
This includes understanding terms with more
than one meaning.
Recognizes words with similar meanings.
3) QUERY UNDERSTANDING
Gets to the deeper meaning of the words
4) GOOGLE INSTANT
Displays immediate results as you type.
Identifies and corrects possible spelling
errors and provides alternatives.
Based on all the above factors, Google picks
some web-pages from the index.
Then, Google ranks the result on various
1) Site & Page Quality:-
It is checked by how you are writing
How much fresh the content is & at how
much regular interval it is updated !!
Google tries to find out how much it is safe
and doesn’t contains spams.
Along with these, there are 200+ factors used
by Google to rank any particular webs-page.
After all these operations, you will get the
desired result and these all happens in one
Google fights with spam every second to give
true & relevant result.
The majority of spam removal is
automatic. Google examine other
questionable documents by hand. If Google
find spam, they take manual action.
1) PURE SPAM
Site appears to use aggressive spam
techniques such as automatically generated
gibberish, cloaking, scraping content from
other websites, and/or repeated or egregious
violations of Google's Webmaster Guidelines.
2) HIDDEN TEXT AND/OR KEYWORD STUFFING
Some of the pages may contain hidden
text and/or keyword stuffing.
3) USER-GENERATED SPAM
Site appears to contain spammy user-generated
content. The problematic content
may appear on forum pages, guestbook pages,
or user profiles.
4) PARKED DOMAINS
Parked domains are placeholder sites with little
unique content, so Google doesn't typically
include them in search results.
5) THIN CONTENT WITH LITTLE OR
NO ADDED VALUE
Site appears to consist of low-quality or shallow pages
which do not provide users with much added value
(such as thin affiliate pages, doorway pages, cookie-cutter
sites, automatically generated content, or copied
6) UNNATURAL LINKS TO A SITE
Google has detected a pattern of unnatural artificial,
deceptive or manipulative links pointing to the site.
These may be the result of buying links that pass
PageRank or participating in link schemes.
Besides these all there are thousands other
factors Google uses to detect Spam and
decides the page-rank of web-page
accordingly which is constantly updated and
finally Google only keeps trusted documents
And the point of Interest is that to make
presentation on google, I used
Behind your simple page of results is a
complex system, carefully crafted and
tested, to support more than one-hundred
billion searches each month !!!!