CRAWLER,INDEX,RANKING
A Web crawler, sometimes called a spider or spiderbot and often
shortened to crawler, is an Internet bot that systematically browses
the World Wide Web and that is typically operated by search engines for
the purpose of Web indexing (web spidering).
Web search engines and some other websites use Web crawling or spidering
software to update their web content or indices of other sites' web content.
Web crawlers copy pages for processing by a search engine, which
indexes the downloaded pages so that users can search more efficiently.
Web crawlers access sites via the internet and gather information
about each page, including titles, images, keywords, and links
within the page.This data is used by search engines to build an
index of web pages, allowing the engine to return faster and
more accurate search results for users.Web crawlers may also be
used to scrape or pull content from websites, monitor changes
on web pages, test websites, and mine them for data.
How do web crawlers work?
Web crawlers start by crawling a set of known pages and following hyperlinks to new
pages. Before crawling a site, web crawlers review the site’s robots.txt file, which outlines
the rules the website owner has established for bots about which pages can be crawled
and which links can be followed.Because crawlers can’t index every page on the internet, they
follow certain rules to prioritize some pages over others. Crawlers may be instructed to
give more weight to pages that have more external links to other pages, to sites with a
higher number of page views, and to sites that have greater brand authority. Search engines
assume that pages with lots of visitors and links are more likely to offer authoritative
information and high-quality content that users are looking for. Crawlers also use
algorithms to rate the value of content or the quality of links on the page.
As web crawlers explore websites, they copy each site’s meta tags, which provide metadata
information about the site and the keywords on it. This data helps search engines determine
how a page will show up in search results.
What are types of web crawlers?
There are four basic types of web crawlers.
● Focused web crawlers search, index, and download web content concerning
specific topics. Rather than exploring every hyperlink on a page as a standard web
crawler would, a focused web crawler only follows links perceived to be relevant.
● Incremental crawlers revisit websites to refresh an index and update URLs.
● Parallel crawlers run multiple crawling processes at the same time to maximize
the download rate.
● Distributed crawlers use multiple crawlers to simultaneously index different
sites.
Examples of web crawlers
● Googlebot, the crawler for Google’s search engine
● Bingbot, Microsoft’s search engine crawler
● Amazonbot, the Amazon web crawler
● DuckDuckBot, the crawler for the search engine DuckDuckGo
● YandexBot, the crawler for the Yandex search engine
● Baiduspider, the web crawler for the Chinese search engine Baidu
● Slurp, the web crawler for Yahoo
● Coupon apps, like Honey
Indexing:
The process of organizing or categorizing web pages into a database
or index based on keywords, content relevance, and popularity.
Search engines index various aspects of web pages, including text content,
meta tags, images, links and more. They also consider factors like page
load speed and mobile-friendliness.
The time it takes for a web page to be indexed can vary. High-quality,
frequently updated websites may be indexed quickly, while less popular
or static sites may take longer. It can range from a few hours to several
weeks.
We can use a robots.txt file to instruct web crawlers not to index specific
pages or directories on your website. You can also use meta tags like
"noindex" to prevent individual pages from appearing in search results.
In short, if you want users to find your website on Google or Bing, it needs to be
indexed: information about the page should be added to the search engine
database.
The indexing and crawling are two separate processes. Crawling refers to
discovering content and indexing refers to storing said content. If your
page has been crawled, this doesn’t mean that it is indexed.
Benefits of web indexing:
● Better visibility: Fast indexing and ranking can increase website visibility and generate
more online traffic.
● Improved SEO: Proper indexing can help improve website SEO and search engine
rankings.
● Increased user engagement: Good navigation and clear access to content can increase
user engagement and satisfaction on a site.
HOW INDEXING CAN BE DONE:
During indexing, Google determines if the page showing in search is a copy or
the original (the canonical). It begins this evaluation by organizing similar pages
into groups. It then assigns canonical status to the most representative one.
The rest are considered alternative versions and used in other situations,
including mobile search results or specific queries. Google also notes details about
the canonical page, such as language, location, and user-friendliness. This
information helps Google decide on which pages to show in search
results.Google only adds pages to the index if they contain quality content.
Pages engaging in shady activity like keyword stuffing or link building with low-
quality or spammy domains will be flagged or ignored.
RANKING:
Search engine rank is the position at which a website or web page appears on
the first page of organic search results.
● Search engine ranks are determined mathematically by weighted
algorithms.
● Ranks are dynamic and are likely to change depending upon the specific
search query and the query source.
● It’s important to monitor rankings and adapt SEO strategies accordingly to
maintain visibility in search results.
● Factors like content depth, uniqueness, readability, and backlinks
from authoritative sites can influence search engine ranks for new
content.
● Technical SEO factors such as page load speed, mobile-friendliness,
and navigation can also influence search engine ranks.
Searching
Now we know about the three- step process search machines
use to return applicable results. Crawling, indexing, and ranking
allow search machines to find and organize information. But
how does that help them answer your search query?
Let’s walk through how search machines answer queries step-by-
step, from the moment you class a term in the search bar.
Search Engine
Searching
Step 1 :Search Engines Parse Intent
To return applicable results, search engines have to “ understand” the search
intent behind a term. They use sophisticated language models to do that,
breaking down your query into gobbets (part) of keywords and parsing
meaning.
For illustration, Google’s reverse system allows the search machine to find,
when groups of words mean the same thing. So when you class in “ dark
multicolored dresses,” search engines will return results for black dresses as
well as dark tones. The machine understands that dark is frequently
Searching
Step 1 Search Engines Parse Intent
Search machines also use keywords to understand broad “ orders” of quest
intent. In the “ dark multicolored dress” illustration, the term “ buy”
signals to search machines that it should pull up product pages to match a
shopping searcher’s intent.
Search results also use “ freshness” algorithms to understand quest
intent. These algorithms identify trending keywords and return newer
pages.
Searching
Step 2 Search Engines Match Pages to Query Intent
Once the search machine understands what kind of result you want to see, it needs to find
matching pages. A series of factors help the search machine decide which pages are best,
including
❖ Title/ content applicability Types of content
❖ Content Quality
❖ Site Quality & newness
❖ Page popularity
❖ Language of query
Searching
Step 3 Search Engines Apply Localized’ Factors
A number of individual factors come into play when search machines decide
which results you see.
Location Some quests, like “ cafés near me,” are obviously location-
dependent. But Google will rank results for original factors indeed in non-
location-specific quests.
Searching
Step 3 Search Engines Apply Localized’ Factors
Search settings Search settings are also an important index of which results you ’re
likely to find useful, similar as if you set a favored language or decided into
SafeSearch (a tool that helps filter out unequivocal results).
Quest history A user’s quest history also influences the results they see.
Reverse Queries Processing (RQP)
Reverse Query Processing refers to the process of determining the
original query or set of queries that produced a specific result set in a
database or information retrieval system. Unlike traditional query
processing, which starts with a query to retrieve data, reverse query
processing seeks to identify how a particular output was generated,
often by analyzing stored query logs, result patterns, and underlying
data structures.To trace back from results to queries, providing insight
into how specific data was obtained.
Further formally, given a Query Q and a Table R, the thing is to find a Database
D (a set of tables) similar that Q (D) = R. We call this problem reverse query
processing or RQP, for short.
Reverse query processing has several operations.
First, it can be used in order to test databases.
Second, RQP can be helpful to debug a database operation because it enables
programmers to conclude in which states a program with embedded SQL can
get.
Applications
Search Engines: Understanding user queries based on clicked results
to enhance search algorithms.
Data Warehousing: Retrieving the original ETL processes or queries
that led to specific data views.
Social Media Analytics: Analyzing user engagement data to reverse-
engineer popular content or queries.
Challenges
● Ambiguity: Multiple queries can produce the same result set,
complicating the reverse processing.
● Scalability: Managing large volumes of query-result pairs
efficiently.
● Dynamic Data: Changes in the underlying data can affect the
accuracy of reverse query results.
Types of Search Engine
CRAWLER based search engines
DIRECTORIES based search engines
HYBRID search engines
META search engines
SPECIALIZED search engines
Crawler Based Search Engines
Crawler-based search engines…… What most of us familiar…
Google, Bing etc.
They are called Crawler because the software produced crawls the
web like a spider, automatically updating and adding new pages to
its search index as it goes.
Crawler-based search engines are good when you have a specific
search topic in mind and can be very efficient in finding relevant
information in this situation. However, when the search topic is
general, crawler-base search engines may return hundreds of
thousands of irrelevant responses to simple search requests.
Continue…
Three major components of crawler based search engine are-
1. The Crawler (Spider)
The crawler/spider visits a web page, reads it, and then follows links to other pages within
the site. The spider will return to the site on a regular basis, such as every month or every
fifteen days, to look for changes.
2. The Index
Everything the spider finds goes into the second part of the search engine, the index. The
index will contain a copy of every web page that the spider finds. If a web page changes,
then the index is updated with new information.
3. The Search Engine Software
This is the software program that accepts the user-entered query, interprets it, and shifts
through the millions of pages recorded in the index to find matches and ranks them in
order of what it believes is most relevant and presents them in a customizable manner to
the user.
How Crawler Based Search Engine Works?
Continue…
Crawler-based search engines are constantly searching the Internet
for new web pages and updating their database of information with
these new or altered pages.
Examples of crawler-based search engines are:
Directories based Search Engines/Human- Powered Directory
A ‘directory’ uses human editors who decide what category the site
belongs to; they place websites within specific categories in the
directories database. The human editors comprehensively check the
website and rank it, based on the information they find, using a pre-
defined set of rules.
There are two major directories at the time of writing:
Yahoo Directory (www.yahoo.com)
Open Directory (www.dmoz.org)
Note: Since late 2002 Yahoo has provided search results using crawler-
based technology as well as its own directory.
DMOZ Directory
Yahoo! Directory
Hybrid Search Engines
Hybrid search engines use a combination of both crawler-based results
and directory-based results. More and more search engines these days
are moving to a hybrid-based model.
Examples of hybrid search engines are:
Meta Search Engines
Meta search engines take the results from
all the other search engines results, and
combine them into one large listing.
Examples of Meta search engines include:
Dogpile, Metacrawler, Mamma etc.
Meta-Search Engine Architecture
Specialized Search Engines
Specialized search engines have been developed to cater for the demands of niche areas
(Specific areas). There are hundreds of specialized search engines, including:
Images (PicSearch.com)
Shopping (shopping.yahoo.com)
Flights / Travel (SkyScanner.net)
Blogs (BlogPulse.com)
People (Pipl.com)
Forums (BoardReader.com)
Music (SongBoxx.com)
Audio & Video (PodScope.com, Blinkx.com)
Resources (FileDigg.com [.ppt and .pdf])
Private Search (DuckDuckGo.com)
CRAWLER,INDEX,RANKING AND ITS WORKING.pptx

CRAWLER,INDEX,RANKING AND ITS WORKING.pptx

  • 1.
  • 2.
    A Web crawler,sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering). Web search engines and some other websites use Web crawling or spidering software to update their web content or indices of other sites' web content. Web crawlers copy pages for processing by a search engine, which indexes the downloaded pages so that users can search more efficiently.
  • 3.
    Web crawlers accesssites via the internet and gather information about each page, including titles, images, keywords, and links within the page.This data is used by search engines to build an index of web pages, allowing the engine to return faster and more accurate search results for users.Web crawlers may also be used to scrape or pull content from websites, monitor changes on web pages, test websites, and mine them for data.
  • 4.
    How do webcrawlers work? Web crawlers start by crawling a set of known pages and following hyperlinks to new pages. Before crawling a site, web crawlers review the site’s robots.txt file, which outlines the rules the website owner has established for bots about which pages can be crawled and which links can be followed.Because crawlers can’t index every page on the internet, they follow certain rules to prioritize some pages over others. Crawlers may be instructed to give more weight to pages that have more external links to other pages, to sites with a higher number of page views, and to sites that have greater brand authority. Search engines assume that pages with lots of visitors and links are more likely to offer authoritative information and high-quality content that users are looking for. Crawlers also use algorithms to rate the value of content or the quality of links on the page.
  • 5.
    As web crawlersexplore websites, they copy each site’s meta tags, which provide metadata information about the site and the keywords on it. This data helps search engines determine how a page will show up in search results.
  • 6.
    What are typesof web crawlers? There are four basic types of web crawlers. ● Focused web crawlers search, index, and download web content concerning specific topics. Rather than exploring every hyperlink on a page as a standard web crawler would, a focused web crawler only follows links perceived to be relevant. ● Incremental crawlers revisit websites to refresh an index and update URLs. ● Parallel crawlers run multiple crawling processes at the same time to maximize the download rate. ● Distributed crawlers use multiple crawlers to simultaneously index different sites.
  • 7.
    Examples of webcrawlers ● Googlebot, the crawler for Google’s search engine ● Bingbot, Microsoft’s search engine crawler ● Amazonbot, the Amazon web crawler ● DuckDuckBot, the crawler for the search engine DuckDuckGo ● YandexBot, the crawler for the Yandex search engine ● Baiduspider, the web crawler for the Chinese search engine Baidu ● Slurp, the web crawler for Yahoo ● Coupon apps, like Honey
  • 8.
    Indexing: The process oforganizing or categorizing web pages into a database or index based on keywords, content relevance, and popularity. Search engines index various aspects of web pages, including text content, meta tags, images, links and more. They also consider factors like page load speed and mobile-friendliness. The time it takes for a web page to be indexed can vary. High-quality, frequently updated websites may be indexed quickly, while less popular or static sites may take longer. It can range from a few hours to several weeks.
  • 9.
    We can usea robots.txt file to instruct web crawlers not to index specific pages or directories on your website. You can also use meta tags like "noindex" to prevent individual pages from appearing in search results. In short, if you want users to find your website on Google or Bing, it needs to be indexed: information about the page should be added to the search engine database. The indexing and crawling are two separate processes. Crawling refers to discovering content and indexing refers to storing said content. If your page has been crawled, this doesn’t mean that it is indexed.
  • 11.
    Benefits of webindexing: ● Better visibility: Fast indexing and ranking can increase website visibility and generate more online traffic. ● Improved SEO: Proper indexing can help improve website SEO and search engine rankings. ● Increased user engagement: Good navigation and clear access to content can increase user engagement and satisfaction on a site.
  • 12.
    HOW INDEXING CANBE DONE: During indexing, Google determines if the page showing in search is a copy or the original (the canonical). It begins this evaluation by organizing similar pages into groups. It then assigns canonical status to the most representative one. The rest are considered alternative versions and used in other situations, including mobile search results or specific queries. Google also notes details about the canonical page, such as language, location, and user-friendliness. This information helps Google decide on which pages to show in search results.Google only adds pages to the index if they contain quality content. Pages engaging in shady activity like keyword stuffing or link building with low- quality or spammy domains will be flagged or ignored.
  • 13.
    RANKING: Search engine rankis the position at which a website or web page appears on the first page of organic search results. ● Search engine ranks are determined mathematically by weighted algorithms. ● Ranks are dynamic and are likely to change depending upon the specific search query and the query source. ● It’s important to monitor rankings and adapt SEO strategies accordingly to maintain visibility in search results.
  • 14.
    ● Factors likecontent depth, uniqueness, readability, and backlinks from authoritative sites can influence search engine ranks for new content. ● Technical SEO factors such as page load speed, mobile-friendliness, and navigation can also influence search engine ranks.
  • 15.
    Searching Now we knowabout the three- step process search machines use to return applicable results. Crawling, indexing, and ranking allow search machines to find and organize information. But how does that help them answer your search query? Let’s walk through how search machines answer queries step-by- step, from the moment you class a term in the search bar. Search Engine
  • 16.
    Searching Step 1 :SearchEngines Parse Intent To return applicable results, search engines have to “ understand” the search intent behind a term. They use sophisticated language models to do that, breaking down your query into gobbets (part) of keywords and parsing meaning. For illustration, Google’s reverse system allows the search machine to find, when groups of words mean the same thing. So when you class in “ dark multicolored dresses,” search engines will return results for black dresses as well as dark tones. The machine understands that dark is frequently
  • 17.
    Searching Step 1 SearchEngines Parse Intent Search machines also use keywords to understand broad “ orders” of quest intent. In the “ dark multicolored dress” illustration, the term “ buy” signals to search machines that it should pull up product pages to match a shopping searcher’s intent. Search results also use “ freshness” algorithms to understand quest intent. These algorithms identify trending keywords and return newer pages.
  • 18.
    Searching Step 2 SearchEngines Match Pages to Query Intent Once the search machine understands what kind of result you want to see, it needs to find matching pages. A series of factors help the search machine decide which pages are best, including ❖ Title/ content applicability Types of content ❖ Content Quality ❖ Site Quality & newness ❖ Page popularity ❖ Language of query
  • 19.
    Searching Step 3 SearchEngines Apply Localized’ Factors A number of individual factors come into play when search machines decide which results you see. Location Some quests, like “ cafés near me,” are obviously location- dependent. But Google will rank results for original factors indeed in non- location-specific quests.
  • 20.
    Searching Step 3 SearchEngines Apply Localized’ Factors Search settings Search settings are also an important index of which results you ’re likely to find useful, similar as if you set a favored language or decided into SafeSearch (a tool that helps filter out unequivocal results). Quest history A user’s quest history also influences the results they see.
  • 21.
    Reverse Queries Processing(RQP) Reverse Query Processing refers to the process of determining the original query or set of queries that produced a specific result set in a database or information retrieval system. Unlike traditional query processing, which starts with a query to retrieve data, reverse query processing seeks to identify how a particular output was generated, often by analyzing stored query logs, result patterns, and underlying data structures.To trace back from results to queries, providing insight into how specific data was obtained.
  • 22.
    Further formally, givena Query Q and a Table R, the thing is to find a Database D (a set of tables) similar that Q (D) = R. We call this problem reverse query processing or RQP, for short. Reverse query processing has several operations. First, it can be used in order to test databases. Second, RQP can be helpful to debug a database operation because it enables programmers to conclude in which states a program with embedded SQL can get.
  • 23.
    Applications Search Engines: Understandinguser queries based on clicked results to enhance search algorithms. Data Warehousing: Retrieving the original ETL processes or queries that led to specific data views. Social Media Analytics: Analyzing user engagement data to reverse- engineer popular content or queries.
  • 24.
    Challenges ● Ambiguity: Multiplequeries can produce the same result set, complicating the reverse processing. ● Scalability: Managing large volumes of query-result pairs efficiently. ● Dynamic Data: Changes in the underlying data can affect the accuracy of reverse query results.
  • 25.
    Types of SearchEngine CRAWLER based search engines DIRECTORIES based search engines HYBRID search engines META search engines SPECIALIZED search engines
  • 26.
    Crawler Based SearchEngines Crawler-based search engines…… What most of us familiar… Google, Bing etc. They are called Crawler because the software produced crawls the web like a spider, automatically updating and adding new pages to its search index as it goes. Crawler-based search engines are good when you have a specific search topic in mind and can be very efficient in finding relevant information in this situation. However, when the search topic is general, crawler-base search engines may return hundreds of thousands of irrelevant responses to simple search requests.
  • 27.
    Continue… Three major componentsof crawler based search engine are- 1. The Crawler (Spider) The crawler/spider visits a web page, reads it, and then follows links to other pages within the site. The spider will return to the site on a regular basis, such as every month or every fifteen days, to look for changes. 2. The Index Everything the spider finds goes into the second part of the search engine, the index. The index will contain a copy of every web page that the spider finds. If a web page changes, then the index is updated with new information. 3. The Search Engine Software This is the software program that accepts the user-entered query, interprets it, and shifts through the millions of pages recorded in the index to find matches and ranks them in order of what it believes is most relevant and presents them in a customizable manner to the user.
  • 28.
    How Crawler BasedSearch Engine Works?
  • 29.
    Continue… Crawler-based search enginesare constantly searching the Internet for new web pages and updating their database of information with these new or altered pages. Examples of crawler-based search engines are:
  • 30.
    Directories based SearchEngines/Human- Powered Directory A ‘directory’ uses human editors who decide what category the site belongs to; they place websites within specific categories in the directories database. The human editors comprehensively check the website and rank it, based on the information they find, using a pre- defined set of rules. There are two major directories at the time of writing: Yahoo Directory (www.yahoo.com) Open Directory (www.dmoz.org) Note: Since late 2002 Yahoo has provided search results using crawler- based technology as well as its own directory.
  • 31.
  • 32.
  • 33.
    Hybrid Search Engines Hybridsearch engines use a combination of both crawler-based results and directory-based results. More and more search engines these days are moving to a hybrid-based model. Examples of hybrid search engines are:
  • 34.
    Meta Search Engines Metasearch engines take the results from all the other search engines results, and combine them into one large listing. Examples of Meta search engines include: Dogpile, Metacrawler, Mamma etc.
  • 35.
  • 36.
    Specialized Search Engines Specializedsearch engines have been developed to cater for the demands of niche areas (Specific areas). There are hundreds of specialized search engines, including: Images (PicSearch.com) Shopping (shopping.yahoo.com) Flights / Travel (SkyScanner.net) Blogs (BlogPulse.com) People (Pipl.com) Forums (BoardReader.com) Music (SongBoxx.com) Audio & Video (PodScope.com, Blinkx.com) Resources (FileDigg.com [.ppt and .pdf]) Private Search (DuckDuckGo.com)