This document provides a review of design issues for search engines and web crawlers. It discusses how search engines use web crawlers to collect web documents for storage and indexing from the growing World Wide Web. The three main parts of a search engine are the crawler, indexer, and query engine. Crawler-based search engines create listings automatically using algorithms while human-powered directories rely on human organization. Designing efficient search engines and crawlers faces challenges from the diversity of web documents and changing user behaviors. Web crawlers prioritize URLs to download pages and extract new links to update search engine databases.
An Intelligent Meta Search Engine for Efficient Web Document Retrievaliosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
International conference On Computer Science And technologyanchalsinghdm
ICGCET 2019 | 5th International Conference on Green Computing and Engineering Technologies. The conference will be held on 7th September - 9th September 2019 in Morocco. International Conference On Engineering Technology
The conference aims to promote the work of researchers, scientists, engineers and students from across the world on advancement in electronic and computer systems.
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...IJSRD
The development of the web in past few years has created a lot of challenge in this field. The new work in this field is the search of the data in a search tree pattern based on tree. Various sequential mining algorithms have been devoloped till date. Web usage mining is used to operate the web server logs, that contains the navigation history of the user. Recommendater system is explained properly with the explanation of whole procedure of the recommendater system. The search results of the data leads to the proper ad efficient search. But the problem was the time utilization and the search results generated from them. So, a new local search algorithm is proposed for country-wise search that makes the searching more efficient on local results basis. This approach has lead to an advancement in the search based methods and the results generated.
An Intelligent Meta Search Engine for Efficient Web Document Retrievaliosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
International conference On Computer Science And technologyanchalsinghdm
ICGCET 2019 | 5th International Conference on Green Computing and Engineering Technologies. The conference will be held on 7th September - 9th September 2019 in Morocco. International Conference On Engineering Technology
The conference aims to promote the work of researchers, scientists, engineers and students from across the world on advancement in electronic and computer systems.
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...IJSRD
The development of the web in past few years has created a lot of challenge in this field. The new work in this field is the search of the data in a search tree pattern based on tree. Various sequential mining algorithms have been devoloped till date. Web usage mining is used to operate the web server logs, that contains the navigation history of the user. Recommendater system is explained properly with the explanation of whole procedure of the recommendater system. The search results of the data leads to the proper ad efficient search. But the problem was the time utilization and the search results generated from them. So, a new local search algorithm is proposed for country-wise search that makes the searching more efficient on local results basis. This approach has lead to an advancement in the search based methods and the results generated.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
The World Wide Web (Web) is a popular and interactive medium to disseminate information today.
The Web is huge, diverse, and dynamic and thus raises the scalability, multi-media data, and temporal issues respectively.
The web has become a resourceful tool for almost all domains today. Search engines prominently use
inverted indexing technique to locate the web pages having the users query. The performance of inverted
index fundamentally depends upon the searching of keyword in the list maintained by search engine. Text
matching is done with the help of string matching algorithm. It is important to any string matching
algorithm to locate quickly the occurrences of the user specified pattern in large text. In this paper a new
string matching algorithm for keyword searching is proposed. The proposed algorithm relies on new
technique based on pattern length and FML (First-Middle-Last) character match. This proposed
algorithm is analysed and implemented. The extensive testing and comparisons are done with BoyerMoore, Naïve, Improved Naïve, Horspool and Zhu Takaoka. The result shows that the proposed
algorithm takes less time than other existing algorithm.
Web Page Recommendation Using Web MiningIJERA Editor
On World Wide Web various kind of content are generated in huge amount, so to give relevant result to user web recommendation become important part of web application. On web different kind of web recommendation are made available to user every day that includes Image, Video, Audio, query suggestion and web page. In this paper we are aiming at providing framework for web page recommendation. 1) First we describe the basics of web mining, types of web mining. 2) Details of each web mining technique.3)We propose the architecture for the personalized web page recommendation.
Intelligent Semantic Web Search Engines: A Brief Survey dannyijwest
The World Wide Web (WWW) allows the people to share the information (data) from the large database repositories globally. The amount of information grows billions of databases. We need to search the information will specialize tools known generically search engine. There are many of search engines available today, retrieving meaningful information is difficult. However to overcome this problem in search engines to retrieve meaningful information intelligently, semantic web technologies are playing a major role. In this paper we present survey on the search engine generations and the role of search engines in intelligent web and semantic search technologies.
Web is a collection of inter-related files on one or more web servers while web mining means extracting valuable information from web databases. Web mining is one of the data mining domains where data mining techniques are used for extracting information from the web servers. The web data includes web
pages, web links, objects on the web and web logs. Web mining is used to understand the customer behaviour, evaluate a particular website based on the information which is stored in web log files. Web mining is evaluated by using data mining techniques, namely classification, clustering, and association
rules. It has some beneficial areas or applications such as Electronic commerce, E-learning, Egovernment, E-policies, E-democracy, Electronic business, security, crime investigation and digital library. Retrieving the required web page from the web efficiently and effectively becomes a challenging task
because web is made up of unstructured data, which delivers the large amount of information and increase the complexity of dealing information from different web service providers. The collection of information becomes very hard to find, extract, filter or evaluate the relevant information for the users. In this paper,
we have studied the basic concepts of web mining, classification, processes and issues. In addition to this,
this paper also analyzed the web mining research challenges.
Interfacing of Java 3D objects for Virtual Physics Lab (VPLab) Setup for enco...IOSR Journals
During this paper we target the combination of web accessible physics experiments (VPLabs) combined with the Sun’s toolkit for making cooperative 3D virtual worlds. Among such a cooperative setting these tools give the chance for academics and students to figure along as avatars as they control actual instrumentation, visualize natural phenomenon generated by the experiment, and discuss the results. Especially we'll define the steps of integration, future goals, yet because the price of a collaboration area in Wonderland's virtual world.
Unification Scheme in Double Radio Sources: Bend Angle versus Arm Length Rati...IOSR Journals
From radio source unification scheme, asymmetries are more pronounced in quasars than in radio galaxies. Using a sample of 625 double radio sources (316 radio galaxies and 309 quasars), we investigated the variations between bend angle, arm-length ratio and redshift. We find no significant correlation (0.045) between bend angle and redshift for the entire sample and at low z for radio galaxies while a weak correlation (0.121) between bend angle and redshift for quasars for the entire sampleis observed. A weak correlation (radio galaxies, r = 0.173 and quasars, r = 0.102) between the bend angle and arm-length ratio for the double radio sources used in this study is observed. Kharbet al. (2008) in their study of powerful classical double radio galaxies noted that this correlation could suggest that the environmental asymmetries that give rise to the arm-length ratio could be contributory to the misalignment angles in these sources. Quasars appear much more bent and misaligned. This is consistent with quasars being radio galaxies viewed at small angles.
Fluorescence quenching of 5-methyl-2-phenylindole (MPI) by carbon tetrachlori...IOSR Journals
The fluorescence quenching of 5-methyl-2-phenylindole (MPI) by carbon tetrachloride by steady state in different solvents, and by transient method in benzene has been carried out at room temperature. The Stern–Volmer (SV) plot has been found to be non-linear with a positive deviation for all the solvents studied. In order to interpret these results we have invoked the ground state complex and sphere of action static quenching models. Using these models various rate parameters have been determined. The magnitudes of these parameters imply that sphere of action static quenching model agrees well with the experimental results. Hence the positive deviation in the SV plots is attributed to the static and dynamic quenching. Further, from the studies of temperature dependence of rate parameters and lifetime measurements, it could be explained that the positive deviation is due to the presence of a small static quenching component in the overall dynamic quenching. With the use of finite sink approximation model, it was possible to check whether these bimolecular reactions as diffusion limited and to estimate independently distance parameter R′ and mutual diffusion coefficient D. Finally an effort has been made to correlate the values of R′ and D with the values of the encounter distance R and the mutual diffusion coefficient D determined using the Edward's empirical relation and Stokes–Einstein relation.
Modeling and qualitative analysis of malaria epidemiologyIOSR Journals
We develop and analyze a mathematical model on malaria dynamics using ordinary differential equations, in order to investigate the effect of certain parameters on the transmission and spread of the disease. We introduce dimensionless variables for time, fraction of infected human and fraction of infected mosquito and solve the resulting system of ordinary differential equations numerically. Equilibrium points are established and stability criteria analyzed. The results reveal that the parameters play a crucial role in the interaction between human and infected mosquito.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
The World Wide Web (Web) is a popular and interactive medium to disseminate information today.
The Web is huge, diverse, and dynamic and thus raises the scalability, multi-media data, and temporal issues respectively.
The web has become a resourceful tool for almost all domains today. Search engines prominently use
inverted indexing technique to locate the web pages having the users query. The performance of inverted
index fundamentally depends upon the searching of keyword in the list maintained by search engine. Text
matching is done with the help of string matching algorithm. It is important to any string matching
algorithm to locate quickly the occurrences of the user specified pattern in large text. In this paper a new
string matching algorithm for keyword searching is proposed. The proposed algorithm relies on new
technique based on pattern length and FML (First-Middle-Last) character match. This proposed
algorithm is analysed and implemented. The extensive testing and comparisons are done with BoyerMoore, Naïve, Improved Naïve, Horspool and Zhu Takaoka. The result shows that the proposed
algorithm takes less time than other existing algorithm.
Web Page Recommendation Using Web MiningIJERA Editor
On World Wide Web various kind of content are generated in huge amount, so to give relevant result to user web recommendation become important part of web application. On web different kind of web recommendation are made available to user every day that includes Image, Video, Audio, query suggestion and web page. In this paper we are aiming at providing framework for web page recommendation. 1) First we describe the basics of web mining, types of web mining. 2) Details of each web mining technique.3)We propose the architecture for the personalized web page recommendation.
Intelligent Semantic Web Search Engines: A Brief Survey dannyijwest
The World Wide Web (WWW) allows the people to share the information (data) from the large database repositories globally. The amount of information grows billions of databases. We need to search the information will specialize tools known generically search engine. There are many of search engines available today, retrieving meaningful information is difficult. However to overcome this problem in search engines to retrieve meaningful information intelligently, semantic web technologies are playing a major role. In this paper we present survey on the search engine generations and the role of search engines in intelligent web and semantic search technologies.
Web is a collection of inter-related files on one or more web servers while web mining means extracting valuable information from web databases. Web mining is one of the data mining domains where data mining techniques are used for extracting information from the web servers. The web data includes web
pages, web links, objects on the web and web logs. Web mining is used to understand the customer behaviour, evaluate a particular website based on the information which is stored in web log files. Web mining is evaluated by using data mining techniques, namely classification, clustering, and association
rules. It has some beneficial areas or applications such as Electronic commerce, E-learning, Egovernment, E-policies, E-democracy, Electronic business, security, crime investigation and digital library. Retrieving the required web page from the web efficiently and effectively becomes a challenging task
because web is made up of unstructured data, which delivers the large amount of information and increase the complexity of dealing information from different web service providers. The collection of information becomes very hard to find, extract, filter or evaluate the relevant information for the users. In this paper,
we have studied the basic concepts of web mining, classification, processes and issues. In addition to this,
this paper also analyzed the web mining research challenges.
Interfacing of Java 3D objects for Virtual Physics Lab (VPLab) Setup for enco...IOSR Journals
During this paper we target the combination of web accessible physics experiments (VPLabs) combined with the Sun’s toolkit for making cooperative 3D virtual worlds. Among such a cooperative setting these tools give the chance for academics and students to figure along as avatars as they control actual instrumentation, visualize natural phenomenon generated by the experiment, and discuss the results. Especially we'll define the steps of integration, future goals, yet because the price of a collaboration area in Wonderland's virtual world.
Unification Scheme in Double Radio Sources: Bend Angle versus Arm Length Rati...IOSR Journals
From radio source unification scheme, asymmetries are more pronounced in quasars than in radio galaxies. Using a sample of 625 double radio sources (316 radio galaxies and 309 quasars), we investigated the variations between bend angle, arm-length ratio and redshift. We find no significant correlation (0.045) between bend angle and redshift for the entire sample and at low z for radio galaxies while a weak correlation (0.121) between bend angle and redshift for quasars for the entire sampleis observed. A weak correlation (radio galaxies, r = 0.173 and quasars, r = 0.102) between the bend angle and arm-length ratio for the double radio sources used in this study is observed. Kharbet al. (2008) in their study of powerful classical double radio galaxies noted that this correlation could suggest that the environmental asymmetries that give rise to the arm-length ratio could be contributory to the misalignment angles in these sources. Quasars appear much more bent and misaligned. This is consistent with quasars being radio galaxies viewed at small angles.
Fluorescence quenching of 5-methyl-2-phenylindole (MPI) by carbon tetrachlori...IOSR Journals
The fluorescence quenching of 5-methyl-2-phenylindole (MPI) by carbon tetrachloride by steady state in different solvents, and by transient method in benzene has been carried out at room temperature. The Stern–Volmer (SV) plot has been found to be non-linear with a positive deviation for all the solvents studied. In order to interpret these results we have invoked the ground state complex and sphere of action static quenching models. Using these models various rate parameters have been determined. The magnitudes of these parameters imply that sphere of action static quenching model agrees well with the experimental results. Hence the positive deviation in the SV plots is attributed to the static and dynamic quenching. Further, from the studies of temperature dependence of rate parameters and lifetime measurements, it could be explained that the positive deviation is due to the presence of a small static quenching component in the overall dynamic quenching. With the use of finite sink approximation model, it was possible to check whether these bimolecular reactions as diffusion limited and to estimate independently distance parameter R′ and mutual diffusion coefficient D. Finally an effort has been made to correlate the values of R′ and D with the values of the encounter distance R and the mutual diffusion coefficient D determined using the Edward's empirical relation and Stokes–Einstein relation.
Modeling and qualitative analysis of malaria epidemiologyIOSR Journals
We develop and analyze a mathematical model on malaria dynamics using ordinary differential equations, in order to investigate the effect of certain parameters on the transmission and spread of the disease. We introduce dimensionless variables for time, fraction of infected human and fraction of infected mosquito and solve the resulting system of ordinary differential equations numerically. Equilibrium points are established and stability criteria analyzed. The results reveal that the parameters play a crucial role in the interaction between human and infected mosquito.
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
A Two Stage Crawler on Web Search using Site Ranker for Adaptive LearningIJMTST Journal
The cyber world is a verity collection of billions of web pages containing terabytes of information arranged in thousands of servers using HTML. The size of this amassment itself is a difficultto retrieving required and relevant information. This made search engines a paramount part of our lives. Search engines strive to retrieve information as useful as possible. One of the building blocks of search engines is the Web Crawler. The main idea is to propose a an efficient harvesting deep-web interfaces using site ranker and adoptive learning methodology framework, concretely two keenly intellective Crawlers, for efficient accumulating deep web interfaces. Within the first stage, A Smart WebCrawler performs site-predicated sorting out centre pages with the support of search engines, evading visiting an oversized variety of pages. To realize supplemental correct results for a targeted crawl, keenly belong to the Crawler, ranks websites to inductively authorize prodigiously relevant ones for a given topic. Within the second stage, smart Crawler, achieves quick in website looking by excavating most useful links with associate degree accommodative link -ranking.
`A Survey on approaches of Web Mining in Varied Areasinventionjournals
There has been lot of research in recent years for efficient web searching. Several papers have proposed algorithm for user feedback sessions, to evaluate the performance of inferring user search goals. When the information is retrieved, user clicks on a particular URL. Based on the click rate, ranking will be done automatically, clustering the feedback sessions. Web search engines have made enormous contributions to the web and society. They make finding information on the web quick and easy. However, they are far from optimal. A major deficiency of generic search engines is that they follow the ‘‘one size fits all’’ model and are not adaptable to individual users.
HIGWGET-A Model for Crawling Secure Hidden WebPagesijdkp
The conventional search engines existing over the internet are active in searching the appropriate
information. The search engine gets few constraints similar to attainment the information seeked from a
different sources. The web crawlers are intended towards a exact lane of the web.Web Crawlers are limited
in moving towards a different path as they are protected or at times limited because of the apprehension of
threats. It is possible to make a web crawler,which will have the ability of penetrating from side to side the
paths of the web, not reachable by the usual web crawlers, so as to get a improved answer in terms of
infoemation, time and relevancy for the given search query. The proposed web crawler is designed to
attend Hyper Text Transfer Protocol Secure (HTTPS) websites including the web pages,which requires
verification to view and index.
Focused web crawling using named entity recognition for narrow domainseSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Focused web crawling using named entity recognition for narrow domainseSAT Journals
Abstract Within recent years the World Wide Web (WWW) has grown enormously to a large extent where generic web crawlers have become unable to keep up with. As a result, focused web crawlers have gained its popularity which is focused only on a particular domain. But these crawlers are based on lexical terms where they ignore the information contained within named entities; named entities can be a very good source of information when crawling on narrow domains. In this paper we discuss a new approach to focus crawling based on named entities for narrow domains. We have conducted experiments in focused web crawling in three narrow domains: baseball, football and American politics. A classifier based on the centroid algorithm is used to guide the crawler which is trained on web pages collected manually from online news articles for each domain. Our results showed that during anytime of the crawl, the collection built with our crawler is better than the traditional focused crawler based on lexical terms, in terms of the harvest ratio. And this was true for all the three domains considered. Index Terms: web mining, focused crawling, named entity, classification
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
Design Issues for Search Engines and Web Crawlers: A Review
1. IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 6 (Nov. - Dec. 2013), PP 34-37
www.iosrjournals.org
www.iosrjournals.org 34 | Page
Design Issues for Search Engines and Web Crawlers: A Review
Deepak Kumar, Aditya Kumar
WCTM, Gurgaon, Haryana,
Abstract: The World Wide Web is a huge source of hyperlinked information contained in hypertext documents.
Search engines use web crawlers to collect these web documents from web for storage and indexing. The prompt
growth of the World Wide Web has posed incomparable challenges for the designers of search engines and web
crawlers; that help users to retrieve web pages in a reasonable amount of time. In this paper, a review on need
and working of a search engine, and role of a web crawler is being presented.
Key words: Internet, www, search engine, types, design issues, web crawlers.
I. Introduction
Today Internet [1,5] has become the universal source of information. Within a span of few years, it has
changed the way we business and communicate. It has given a worldwide platform to us. The World Wide Web
(WWW or web) [1, 2, 4, 12, 13] is a hyperlinked repository of hypertext documents lying in different websites
distributed over far end distant geographical locations.
There is unbelievable growth in the number of web documents and it is estimated that currently there
are millions of web servers around the world hosting trillions of web pages. With such a huge web repository,
finding the right information at the right time is a very challenging task. Hence, there is a need for a more
effective way of retrieving information from the web.
Search engines [3,7,8,9,11,12,13,14] operate as a link between web users and web documents. Without
search engines, this vast source of information in web pages remain veiled for us. A search engine is a
searchable database which collects information from web pages on the Internet, indexes the information and then
stores the result in a huge database where from it can be searched quickly.
II. Search Engines
A general web search engine (Fig. 1) has three parts [3,10,11,12,13,14] i.e. Crawler, Indexer and Query
engine. The web crawler (also called robot, spider, worm, walker or wanderer) is a module that searches the web
pages from the web world. These are small programs that peruse the web on the search engine's behalf, and
follow links to reach different pages. Starting with a set of seed URLs, crawlers extract URLs appearing in the
retrieved pages, and store pages in a repository database.
The indexer extracts all the words from each page and records the URL where each word has occurred.
The result is stored in a large table containing URLs; pointing to pages in the repository where a given word
occurs.
Indexing methods used in web database creation are full text indexing, keyword indexing and human
indexing. Full text indexing is where every word on the page is put into a database for searching. It helps user to
find every example in response to a specific name or terminology. A general topic search will not be very useful
in the database and one has to dig through a lot of false drops.
In keyword indexing only important words or phrases are put into a database. In human indexing a
person examines the page and determines a very few key phrases that describes it. This allows the user to find a
good start of works on a topic, assuming that the topic was picked by the human as something that describes the
page.
The query engine is responsible for receiving and filling search requests from users. It relies on the
indexes and on the repository. Because of the web's size, and the fact that users typically only enter one or two
keywords, result sets are usually very large.
Fig. 1 : Architecture of a typical web search engine
2. Design Issues for Search Engines and Web Crawlers: A Review
www.iosrjournals.org 35 | Page
2.1 Types of Search Engines
Search engines are good at finding unique keywords, phrases, quotes, and information contained in the
full text of web pages. Search engines allow user to enter keywords and then search for them in its table
followed by database. Various types [9] of search engines available are Crawler based search engines, Human
powered directories, Meta search engines and Hybrid search engines.
i. Crawler based search engines :
Crawler based search engines create their listings (that make up the search engine's index or catalog)
automatically with the help of web crawlers. It uses a computer algorithm to rank all pages retrieved. Such
search engines are huge and often retrieve a lot of information. For complex searches, it allows to search within
the results of previous search and enables us to refine search results. Such types of search engines contain full
text of web pages they link to. Here, one can find pages by matching words in the pages user wants.
ii. Human powered directories :
Human powered directories are built by human selection i.e. they depend on humans to create
repository. They are organized into subject categories and classification of pages is done by subjects. Such
directories do not contain full text of the web page they link to, and are smaller than most search engines.
iii. Hybrid search engine :
Hybrid search engines differ from traditional text oriented and directory based search engines. These
search engines typically favor one type of listing over the other. Many search engines today combine a crawler
based search engine with a directory service.
iv. Meta search engines :
Meta search engines accumulate search and screen the results of multiple primary search engines.
Unlike search engines, meta crawlers don't crawl the web themselves to build listings. Instead, they allow
searches to be sent to several search engines all at once. The results are then blended together onto one page.
Based on the application for which search engines are used, they can be categorized as follows:-
i. Primary search engines scan entire sections of the www and produce their results from databases of web
page content, automatically created by computers.
ii. Business and Services search engines essentially National yellow page directories.
iii. Employment and Job search engines either provide potential employers access to resumes of people
interested in working for them or provide prospective employees with information on job availability.
iv. Finance-oriented search engines facilitate searches for specific information about companies (officers,
annual reports etc.).
v. Image search engines help us search the www for images of all kinds.
vi. News search engines search newspaper and news web site archives for the selected information.
vii. People search engines search for names, addresses, telephone numbers and e-mail addresses.
viii. Subject guides are like indexes in the back of a book. They involve human intervention in selecting and
organizing resources, so they cover fewer resources and topics but provide more focus and guidance.
ix. Specialized search engines search specialized databases, allow users to enter search terms in a particularly
easy way, look for low prices on items they are interested in purchasing, and even give users access to real,
live human beings to answer questions.
2.2 Design issues
Designing a search engine [7,8,9,10,11] is a tricky task and there are differences in the ways they work.
In common they all perform three basic tasks; search the Internet based on important words, keep an index of the
words they find and where they find them, and allow users to look for words (or combinations of words) found
in that index.
There are different ways to improve performance of search engines but three main characteristics are
improving algorithms to search the web, using filters towards the user’s results; and improving the user interface
for query input. Factors that determine the quality of a search engine are freshness of contents, index quality,
search features, retrieval system and user behaviour. Few more issues are :-
i. Primary goals:
The primary goal of a search engine is to provide high quality search results over a rapidly growing
world wide web. The most important appraise for a search engine is its search performance, quality of the
results, ability to crawl and indexing the web efficiently.
3. Design Issues for Search Engines and Web Crawlers: A Review
www.iosrjournals.org 36 | Page
ii. Diversity of documents:
On the web, documents are written in several different languages. Documents need to be indexed in a
way which allows it to search for documents written in diverse languages with just one query. Search engines
should be able to index documents written in multiple formats, as each file format provides certain difficulties
for the search engines.
iii. Behaviour of web users:
Web users are very heterogeneous and search engines are used by professionals as well as by laymen.
The search engine needs to be enough smart to cater for both types of users. Moreover, there is a tendency that
users often only look at the results set that can be seen without scrolling, and the results which are not on first
page are nearly invisible for the general user.
iv. Freshness of database :
Search engines find problems in keeping its database up-to-date with the entire web because of its
enormous size and the different update cycles of individual websites.
Other characteristics [10] that a large search engine is expected to have are scalability, high
performance, politeness, continuity, extensibility and portability.
III. Web Crawlers
Web crawler (Fig. 2) is a program that fetches information from www in an automated manner. The
objective of a web crawler is to maintain freshness [6] of pages in its collection as high as possible. To download
a document, a crawler starts by placing an initial set of seed URLs in a queue, where all URLs to be retrieved are
kept and prioritized. From this queue the crawler extracts a URL, downloads the page, extracts URLs from the
downloaded page, and places the new URLs in the queue. This process is repeated and the collected pages are
used by a search engine. The browser parses the document and makes it available to the user.
The general algorithm of a web crawler is given below :
Begin
Read a URL from the set of seed URLs;
Determine the IP address for the host name;
Download the Robot.txt file which carries downloading permissions and also specifies the files to be excluded
by the crawler;
Determine the protocol of underlying host like http, ftp, gopher etc.;
Based on the protocol of the host, download the document;
Identify the document format like doc, html, or pdf etc.;
Check whether the document has already been downloaded or not;
If the document is fresh one then
Read it and extract the links or references to the other cites from that documents;
else
Continue;
Convert the URL links into their absolute URL equivalents;
Add the URLs to set of seed URLs;
End.
A web crawler deals with two main issues, a good crawling strategy for deciding which pages to download next
and, to have a highly optimized system architecture that can download a large number of pages per second. On
other side it has to be robust against crashes, manageable, and considerate of resources and web servers.
Fig. 3.1 : General architecture of a web crawler
4. Design Issues for Search Engines and Web Crawlers: A Review
www.iosrjournals.org 37 | Page
Several available web crawling techniques are Parallel crawler, Distributed crawler, Focussed crawler, Hidden
crawler, Incremental crawler, that differs in their mechanism, implementation and objective.
IV. Conclusion and future research
In this paper we conclude that designing a search engine to crawl the complete web is a nontrivial task.
They have to be enough committed to perform the crawling process efficiently and reliably. Besides some design
issues of a search engine, we also discussed that a web crawler supports the search engine to update its database
to improve quality of contents retrieved by the user. From several available web crawling techniques,
appropriate one may be used to crawl the web while designing a search engine.
References
[1]. A.K. Sharma, J. P. Gupta, D. P. Agarwal, “Augment Hypertext Documents suitable for parallel crawlers” , accepted for presentation
and inclusion in the proceedings of WITSA-2003, a National workshop on Information Technology Services and Applications,
Feb’2003, New Delhi.
[2]. Alexandros Ntoulas, Junghoo Cho, Christopher Olston, "What's New on the Web ? The Evolution of the Web from a Search Engine
Perspective", In Proceedings of the World-Wide Web Conference (WWW), May 2004.
[3]. Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan, "Searching the Web", ACM Transactions
on Internet Technology, 1(1): August 2001.
[4]. Baldi, Pierre, “Modeling the Internet and the Web: Probabilistic Methods and Algorithms”, 2003.
[5]. Barry M. Leiner, Vinton G. Cerf, David D. Clark, Robert E. Kahn, Leonard, Kleinrock, Daniel C. Lynch, Jon
[6]. Postel, Larry G. Roberts, Stephen Wolff, “A Brief History of the Internet”, www.isoc.org/internet/his tory,
[7]. Brian E. Brewington, George Cybenko, “How dynamic i s the web.”, In Proceedings of the Ninth Internationa l World-Wide Web
Conference, Amsterdam, Netherlands, May 2000.
[8]. Dirk Lewandowski, “Web searching, search engines an d Information Retrieval, Information Services & Use”, 25 (2005) 137-147,
IOS Press, 2005.
[9]. Franklin, Curt, “How Internet Search Engines Work”, 2002, www.howstuffworks.com.
[10]. Grossan. B, “Search Engines : What they are, how th ey work, and practical suggestions for getting the most out of them”, Feb97,
webreference.com.
[11]. Heydon A., Najork M., “Mercator: A scalable, extens ible Web crawler.”, World Wide Web, vol. 2, no. 4, pp. 2 19-229, 1999.
[12]. Mark Najork, Allan Heydon, “High- Performance Web Crawling”, September 2001.
[13]. Mike, Burner, “Crawling towards Eternity : Building an archive of the World Wide Web”, Web Techniques Magazine, 2(5), May
1997.
[14]. Niraj Singhal, Ashutosh Dixit, “Retrieving Informat ion from the Web and Search Engine Application”, in proceedings of National
Conference on “Emerging Trends in Software and Network Techniques– ETSNT’09”, Amity University, Noida, India, Apr 2009 .
[15]. Sergey Brin, Lawrence Page, “The anatomy of a large - scale hyper textual Web search engine”, Proceedings of the Seventh
International World Wide Web Conference, pages 107-117, April 1998.