CRAWLER,INDEX,RANKING AND ITS WORKING.pptx

A Web crawler, sometimes called a spider or spiderbot and often
shortened to crawler, is an Internet bot that systematically browses
the World Wide Web and that is typically operated by search engines for
the purpose of Web indexing (web spidering).
Web search engines and some other websites use Web crawling or spidering
software to update their web content or indices of other sites' web content.
Web crawlers copy pages for processing by a search engine, which
indexes the downloaded pages so that users can search more efficiently.

Web crawlers access sites via the internet and gather information
about each page, including titles, images, keywords, and links
within the page.This data is used by search engines to build an
index of web pages, allowing the engine to return faster and
more accurate search results for users.Web crawlers may also be
used to scrape or pull content from websites, monitor changes
on web pages, test websites, and mine them for data.

How do web crawlers work?
Web crawlers start by crawling a set of known pages and following hyperlinks to new
pages. Before crawling a site, web crawlers review the site’s robots.txt file, which outlines
the rules the website owner has established for bots about which pages can be crawled
and which links can be followed.Because crawlers can’t index every page on the internet, they
follow certain rules to prioritize some pages over others. Crawlers may be instructed to
give more weight to pages that have more external links to other pages, to sites with a
higher number of page views, and to sites that have greater brand authority. Search engines
assume that pages with lots of visitors and links are more likely to offer authoritative
information and high-quality content that users are looking for. Crawlers also use
algorithms to rate the value of content or the quality of links on the page.

As web crawlers explore websites, they copy each site’s meta tags, which provide metadata
information about the site and the keywords on it. This data helps search engines determine
how a page will show up in search results.

What are types of web crawlers?
There are four basic types of web crawlers.
● Focused web crawlers search, index, and download web content concerning
specific topics. Rather than exploring every hyperlink on a page as a standard web
crawler would, a focused web crawler only follows links perceived to be relevant.
● Incremental crawlers revisit websites to refresh an index and update URLs.
● Parallel crawlers run multiple crawling processes at the same time to maximize
the download rate.
● Distributed crawlers use multiple crawlers to simultaneously index different
sites.

Examples of web crawlers
● Googlebot, the crawler for Google’s search engine
● Bingbot, Microsoft’s search engine crawler
● Amazonbot, the Amazon web crawler
● DuckDuckBot, the crawler for the search engine DuckDuckGo
● YandexBot, the crawler for the Yandex search engine
● Baiduspider, the web crawler for the Chinese search engine Baidu
● Slurp, the web crawler for Yahoo
● Coupon apps, like Honey

Indexing:
The process of organizing or categorizing web pages into a database
or index based on keywords, content relevance, and popularity.
Search engines index various aspects of web pages, including text content,
meta tags, images, links and more. They also consider factors like page
load speed and mobile-friendliness.
The time it takes for a web page to be indexed can vary. High-quality,
frequently updated websites may be indexed quickly, while less popular
or static sites may take longer. It can range from a few hours to several
weeks.

We can use a robots.txt file to instruct web crawlers not to index specific
pages or directories on your website. You can also use meta tags like
"noindex" to prevent individual pages from appearing in search results.
In short, if you want users to find your website on Google or Bing, it needs to be
indexed: information about the page should be added to the search engine
database.
The indexing and crawling are two separate processes. Crawling refers to
discovering content and indexing refers to storing said content. If your
page has been crawled, this doesn’t mean that it is indexed.

Benefits of web indexing:
● Better visibility: Fast indexing and ranking can increase website visibility and generate
more online traffic.
● Improved SEO: Proper indexing can help improve website SEO and search engine
rankings.
● Increased user engagement: Good navigation and clear access to content can increase
user engagement and satisfaction on a site.

HOW INDEXING CAN BE DONE:
During indexing, Google determines if the page showing in search is a copy or
the original (the canonical). It begins this evaluation by organizing similar pages
into groups. It then assigns canonical status to the most representative one.
The rest are considered alternative versions and used in other situations,
including mobile search results or specific queries. Google also notes details about
the canonical page, such as language, location, and user-friendliness. This
information helps Google decide on which pages to show in search
results.Google only adds pages to the index if they contain quality content.
Pages engaging in shady activity like keyword stuffing or link building with low-
quality or spammy domains will be flagged or ignored.

RANKING:
Search engine rank is the position at which a website or web page appears on
the first page of organic search results.
● Search engine ranks are determined mathematically by weighted
algorithms.
● Ranks are dynamic and are likely to change depending upon the specific
search query and the query source.
● It’s important to monitor rankings and adapt SEO strategies accordingly to
maintain visibility in search results.

● Factors like content depth, uniqueness, readability, and backlinks
from authoritative sites can influence search engine ranks for new
content.
● Technical SEO factors such as page load speed, mobile-friendliness,
and navigation can also influence search engine ranks.

Searching
Now we know about the three- step process search machines
use to return applicable results. Crawling, indexing, and ranking
allow search machines to find and organize information. But
how does that help them answer your search query?
Let’s walk through how search machines answer queries step-by-
step, from the moment you class a term in the search bar.
Search Engine

Searching
Step 1 :Search Engines Parse Intent
To return applicable results, search engines have to “ understand” the search
intent behind a term. They use sophisticated language models to do that,
breaking down your query into gobbets (part) of keywords and parsing
meaning.
For illustration, Google’s reverse system allows the search machine to find,
when groups of words mean the same thing. So when you class in “ dark
multicolored dresses,” search engines will return results for black dresses as
well as dark tones. The machine understands that dark is frequently

Searching
Step 1 Search Engines Parse Intent
Search machines also use keywords to understand broad “ orders” of quest
intent. In the “ dark multicolored dress” illustration, the term “ buy”
signals to search machines that it should pull up product pages to match a
shopping searcher’s intent.
Search results also use “ freshness” algorithms to understand quest
intent. These algorithms identify trending keywords and return newer
pages.

Searching
Step 2 Search Engines Match Pages to Query Intent
Once the search machine understands what kind of result you want to see, it needs to find
matching pages. A series of factors help the search machine decide which pages are best,
including
❖ Title/ content applicability Types of content
❖ Content Quality
❖ Site Quality & newness
❖ Page popularity
❖ Language of query

Searching
Step 3 Search Engines Apply Localized’ Factors
A number of individual factors come into play when search machines decide
which results you see.
Location Some quests, like “ cafés near me,” are obviously location-
dependent. But Google will rank results for original factors indeed in non-
location-specific quests.

Searching
Step 3 Search Engines Apply Localized’ Factors
Search settings Search settings are also an important index of which results you ’re
likely to find useful, similar as if you set a favored language or decided into
SafeSearch (a tool that helps filter out unequivocal results).
Quest history A user’s quest history also influences the results they see.

Reverse Queries Processing (RQP)
Reverse Query Processing refers to the process of determining the
original query or set of queries that produced a specific result set in a
database or information retrieval system. Unlike traditional query
processing, which starts with a query to retrieve data, reverse query
processing seeks to identify how a particular output was generated,
often by analyzing stored query logs, result patterns, and underlying
data structures.To trace back from results to queries, providing insight
into how specific data was obtained.

Further formally, given a Query Q and a Table R, the thing is to find a Database
D (a set of tables) similar that Q (D) = R. We call this problem reverse query
processing or RQP, for short.
Reverse query processing has several operations.
First, it can be used in order to test databases.
Second, RQP can be helpful to debug a database operation because it enables
programmers to conclude in which states a program with embedded SQL can
get.

Applications
Search Engines: Understanding user queries based on clicked results
to enhance search algorithms.
Data Warehousing: Retrieving the original ETL processes or queries
that led to specific data views.
Social Media Analytics: Analyzing user engagement data to reverse-
engineer popular content or queries.

Challenges
● Ambiguity: Multiple queries can produce the same result set,
complicating the reverse processing.
● Scalability: Managing large volumes of query-result pairs
efficiently.
● Dynamic Data: Changes in the underlying data can affect the
accuracy of reverse query results.

Types of Search Engine
CRAWLER based search engines
DIRECTORIES based search engines
HYBRID search engines
META search engines
SPECIALIZED search engines

Crawler Based Search Engines
Crawler-based search engines…… What most of us familiar…
Google, Bing etc.
They are called Crawler because the software produced crawls the
web like a spider, automatically updating and adding new pages to
its search index as it goes.
Crawler-based search engines are good when you have a specific
search topic in mind and can be very efficient in finding relevant
information in this situation. However, when the search topic is
general, crawler-base search engines may return hundreds of
thousands of irrelevant responses to simple search requests.

Continue…
Three major components of crawler based search engine are-
1. The Crawler (Spider)
The crawler/spider visits a web page, reads it, and then follows links to other pages within
the site. The spider will return to the site on a regular basis, such as every month or every
fifteen days, to look for changes.
2. The Index
Everything the spider finds goes into the second part of the search engine, the index. The
index will contain a copy of every web page that the spider finds. If a web page changes,
then the index is updated with new information.
3. The Search Engine Software
This is the software program that accepts the user-entered query, interprets it, and shifts
through the millions of pages recorded in the index to find matches and ranks them in
order of what it believes is most relevant and presents them in a customizable manner to
the user.

How Crawler Based Search Engine Works?

Continue…
Crawler-based search engines are constantly searching the Internet
for new web pages and updating their database of information with
these new or altered pages.
Examples of crawler-based search engines are:

Directories based Search Engines/Human- Powered Directory
A ‘directory’ uses human editors who decide what category the site
belongs to; they place websites within specific categories in the
directories database. The human editors comprehensively check the
website and rank it, based on the information they find, using a pre-
defined set of rules.
There are two major directories at the time of writing:
Yahoo Directory (www.yahoo.com)
Open Directory (www.dmoz.org)
Note: Since late 2002 Yahoo has provided search results using crawler-
based technology as well as its own directory.

Hybrid Search Engines
Hybrid search engines use a combination of both crawler-based results
and directory-based results. More and more search engines these days
are moving to a hybrid-based model.
Examples of hybrid search engines are:

Meta Search Engines
Meta search engines take the results from
all the other search engines results, and
combine them into one large listing.
Examples of Meta search engines include:
Dogpile, Metacrawler, Mamma etc.

Meta-Search Engine Architecture

Specialized Search Engines
Specialized search engines have been developed to cater for the demands of niche areas
(Specific areas). There are hundreds of specialized search engines, including:
Images (PicSearch.com)
Shopping (shopping.yahoo.com)
Flights / Travel (SkyScanner.net)
Blogs (BlogPulse.com)
People (Pipl.com)
Forums (BoardReader.com)
Music (SongBoxx.com)
Audio & Video (PodScope.com, Blinkx.com)
Resources (FileDigg.com [.ppt and .pdf])
Private Search (DuckDuckGo.com)

CRAWLER,INDEX,RANKING AND ITS WORKING.pptx

CRAWLER,INDEX,RANKING AND ITS WORKING.pptx

More Related Content

Similar to CRAWLER,INDEX,RANKING AND ITS WORKING.pptx

More from ajajkhan16

Recently uploaded

CRAWLER,INDEX,RANKING AND ITS WORKING.pptx