This document proposes a novel architecture for a parallel domain focused crawler that uses mobile crawlers. The architecture aims to reduce bandwidth usage by only downloading pages that have changed since the last crawl. The key components are a frequency change estimator, crawler manager, domain name director, and crawler hands. The frequency change estimator identifies modified pages. Crawler hands fetch pages from remote sites and only download modified pages in compressed form. This is intended to more efficiently update search engine repositories while reducing network resource usage.
10 Insightful Quotes On Designing A Better Customer ExperienceYuan Wang
In an ever-changing landscape of one digital disruption after another, companies and organisations are looking for new ways to understand their target markets and engage them better. Increasingly they invest in user experience (UX) and customer experience design (CX) capabilities by working with a specialist UX agency or developing their own UX lab. Some UX practitioners are touting leaner and faster ways of developing customer-centric products and services, via methodologies such as guerilla research, rapid prototyping and Agile UX. Others seek innovation and fulfilment by spending more time in research, being more inclusive, and designing for social goods.
Experience is more than just an interface. It is a relationship, as well as a series of touch points between your brand and your customer. Here are our top 10 highlights and takeaways from the recent UX Australia conference to help you transform your customer experience design.
For full article, continue reading at https://yump.com.au/10-ways-supercharge-customer-experience-design/
How to Build a Dynamic Social Media PlanPost Planner
Stop guessing and wasting your time on networks and strategies that don’t work!
Join Rebekah Radice and Katie Lance to learn how to optimize your social networks, the best kept secrets for hot content, top time management tools, and much more!
Watch the replay here: bit.ly/socialmedia-plan
10 Insightful Quotes On Designing A Better Customer ExperienceYuan Wang
In an ever-changing landscape of one digital disruption after another, companies and organisations are looking for new ways to understand their target markets and engage them better. Increasingly they invest in user experience (UX) and customer experience design (CX) capabilities by working with a specialist UX agency or developing their own UX lab. Some UX practitioners are touting leaner and faster ways of developing customer-centric products and services, via methodologies such as guerilla research, rapid prototyping and Agile UX. Others seek innovation and fulfilment by spending more time in research, being more inclusive, and designing for social goods.
Experience is more than just an interface. It is a relationship, as well as a series of touch points between your brand and your customer. Here are our top 10 highlights and takeaways from the recent UX Australia conference to help you transform your customer experience design.
For full article, continue reading at https://yump.com.au/10-ways-supercharge-customer-experience-design/
How to Build a Dynamic Social Media PlanPost Planner
Stop guessing and wasting your time on networks and strategies that don’t work!
Join Rebekah Radice and Katie Lance to learn how to optimize your social networks, the best kept secrets for hot content, top time management tools, and much more!
Watch the replay here: bit.ly/socialmedia-plan
Web Crawling Using Location Aware Techniqueijsrd.com
Most of the modern search engines are based in the well-known web crawling model. A centralized crawler or a farm of parallel crawlers is dedicated to the process of downloading the web pages and updating the database. However, the model fails to keep pace with the size of the web, and the frequency of changes in the web documents. For this reason, there have been some recent proposals for distributed web crawling, trying to alleviate the bottlenecks in the search engines and scale with the web . In this work, facilitating the flexibility and extensibility of Mobile Agent , we consider adding location awareness to distributed web crawling, so that each web page is crawled from the web crawler logically most near to it, the crawler that could download it the faster. We evaluate the location aware approach and show that it can reduce the time demanded for downloading to one order of magnitude.
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...ijwscjournal
A Web crawler is an important component of the Web search engine. It demands large amount of hardware
resources (CPU and memory) to crawl data from the rapidly growing and changing Web. So that the
crawling process should be a continuous process performed from time-to-time to maintain up-to-date
crawled data. This paper develops and investigates the performance of a new approach to speed up the
crawling process on a multi-core processor through virtualization. In this approach, the multi-core
processor is divided into a number of virtual-machines (VMs) that can run in parallel (concurrently)
performing different crawling tasks on different data. It presents a description, implementation, and
evaluation of a VM-based distributed Web crawler. In order to estimate the speedup factor achieved by the
VM-based crawler over a non-virtualization crawler, extensive crawling experiments were carried-out to
estimate the crawling times for various numbers of documents. Furthermore, the average crawling rate in
documents per unit time is computed, and the effect of the number of VMs on the speedup factor is
investigated. For example, on an Intel® Core™ i5-2300 CPU @2.80 GHz and 8 GB memory, a speedup
factor of ~1.48 is achieved when crawling 70000 documents on 3 and 4 VMs.
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...ijwscjournal
The size of the internet is large and it had grown enormously search engines are the tools for Web site navigation and search. Search engines maintain indices for web documents and provide search facilities by continuously downloading Web pages for processing. This process of downloading web pages is known as web crawling. In this paper we propose the architecture for Effective Migrating Parallel Web Crawling approach with domain specific and incremental crawling strategy that makes web crawling system more effective and efficient. The major advantages of migrating parallel web crawler are that the analysis portion of the crawling process is done locally at the residence of data rather than inside the Web search engine repository. This significantly reduces network load and traffic which in turn improves the performance, effectiveness and efficiency of the crawling process. The another advantage of migrating parallel crawler is that as the size of the Web grows, it becomes necessary to parallelize a crawling process, in order to finish downloading web pages in a comparatively shorter time. Domain specific crawling will yield high quality pages. The crawling process will migrate to host or server with specific domain and start downloading pages within specific domain. Incremental crawling will keep the pages in local database fresh thus increasing the quality of downloaded pages.
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...ijwscjournal
The size of the internet is large and it had grown enormously search engines are the tools for Web site navigation and search. Search engines maintain indices for web documents and provide search facilities by continuously downloading Web pages for processing. This process of downloading web pages is known as web crawling. In this paper we propose the architecture for Effective Migrating Parallel Web Crawling approach with domain specific and incremental crawling strategy that makes web crawling system more effective and efficient. The major advantages of migrating parallel web crawler are that the analysis portion of the crawling process is done locally at the residence of data rather than inside the Web search engine repository. This significantly reduces network load and traffic which in turn improves the performance, effectiveness and efficiency of the crawling process. The another advantage of migrating parallel crawler is that as the size of the Web grows, it becomes necessary to parallelize a crawling process, in order to finish downloading web pages in a comparatively shorter time. Domain specific crawling will yield high quality pages. The crawling process will migrate to host or server with specific domain and start downloading pages within specific domain. Incremental crawling will keep the pages in local database fresh thus increasing the quality of downloaded pages.
Research on Key Technology of Web ReptileIRJESJOURNAL
Abstract: This paper mainly introduces the web crawler system structure. Through the analysis of the architecture of web crawler we obtainedfive functional components. They respectively are: the URL scheduler, DNS resolver, web crawling modules, web page analyzer, and the URL judgment device. To build an efficient Web crawler the key technology is to design an efficient Web crawler, so as to solve the challenges that the huge scale of Web brings
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...ijmech
Now the public traffics make the life more and more convenient. The amount of vehicles in large or medium
sized cities is also in the rapid growth. In order to take full advantage of social resources and protect the
environment, regional end-to-end public transport services are established by analyzing online travel data.
The usage of computer programs for processing of the web page is necessary for accessing to a large
number of the carpool data. In the paper, web crawlers are designed to capture the travel data from
several large service sites. In order to maximize the access to traffic data, a breadth-first algorithm is used.
The carpool data will be saved in a structured form. Additionally, the paper has provided a convenient
method of data collecting to the program.
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...ijmech
Now the public traffics make the life more and more convenient. The amount of vehicles in large or medium sized cities is also in the rapid growth. In order to take full advantage of social resources and protect the environment, regional end-to-end public transport services are established by analyzing online travel data. The usage of computer programs for processing of the web page is necessary for accessing to a large number of the carpool data. In the paper, web crawlers are designed to capture the travel data from several large service sites. In order to maximize the access to traffic data, a breadth-first algorithm is used. The carpool data will be saved in a structured form. Additionally, the paper has provided a convenient method of data collecting to the program.
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...ijmech
Now the public traffics make the life more and more convenient. The amount of vehicles in large or medium sized cities is also in the rapid growth. In order to take full advantage of social resources and protect the environment, regional end-to-end public transport services are established by analyzing online travel data. The usage of computer programs for processing of the web page is necessary for accessing to a large number of the carpool data. In the paper, web crawlers are designed to capture the travel data from several large service sites. In order to maximize the access to traffic data, a breadth-first algorithm is used. The carpool data will be saved in a structured form. Additionally, the paper has provided a convenient method of data collecting to the program.
A Novel Interface to a Web Crawler using VB.NET TechnologyIOSR Journals
Abstract : The number of web pages is increasing into millions and trillions around the world. To make
searching much easier for users, web search engines came into existence. Web Search engines are used to find
specific information on the World Wide Web. Without search engines, it would be almost impossible to locate
anything on the Web unless or until a specific URL address is known. This information is provided to search by
a web crawler which is a computer program or software. Web crawler is an essential component of search
engines, data mining and other Internet applications. Scheduling Web pages to be downloaded is an important
aspect of crawling. Previous research on Web crawl focused on optimizing either crawl speed or quality of the
Web pages downloaded. While both metrics are important, scheduling using one of them alone is insufficient
and can bias or hurt overall crawl process. This paper is all about design a new Web Crawler using VB.NET
Technology.
Keywords: Web Crawler, Visual Basic Technology, Crawler Interface, Uniform Resource Locator.
Why do we use react JS on the website_.pdfnearlearn
React is a JavaScript library for building user interfaces. It is widely used in web development for creating interactive and reusable UI components. Some of the key benefits of using React are:
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...ijwscjournal
A Web crawler is an important component of the Web search engine. It demands large amount of hardware resources (CPU and memory) to crawl data from the rapidly growing and changing Web. So that the crawling process should be a continuous process performed from time-to-time to maintain up-to-date crawled data. This paper develops and investigates the performance of a new approach to speed up the crawling process on a multi-core processor through virtualization. In this approach, the multi-core processor is divided into a number of virtual-machines (VMs) that can run in parallel (concurrently)
performing different crawling tasks on different data. It presents a description, implementation, and evaluation of a VM-based distributed Web crawler. In order to estimate the speedup factor achieved by the VM-based crawler over a non-virtualization crawler, extensive crawling experiments were carried-out to
estimate the crawling times for various numbers of documents. Furthermore, the average crawling rate in documents per unit time is computed, and the effect of the number of VMs on the speedup factor is investigated. For example, on an Intel® Core™ i5-2300 CPU @2.80 GHz and 8 GB memory, a speedup
factor of ~1.48 is achieved when crawling 70000 documents on 3 and 4 VMs.
http://lab.lvduit.com:7850/~lvduit/crawl-the-web/
Special thanks to Hatforrent, csail.mit.edu, Amitavroy, ...
Contact us at dafculi.xc@gmail.com or duyet2000@gmail.com
Web Crawling Using Location Aware Techniqueijsrd.com
Most of the modern search engines are based in the well-known web crawling model. A centralized crawler or a farm of parallel crawlers is dedicated to the process of downloading the web pages and updating the database. However, the model fails to keep pace with the size of the web, and the frequency of changes in the web documents. For this reason, there have been some recent proposals for distributed web crawling, trying to alleviate the bottlenecks in the search engines and scale with the web . In this work, facilitating the flexibility and extensibility of Mobile Agent , we consider adding location awareness to distributed web crawling, so that each web page is crawled from the web crawler logically most near to it, the crawler that could download it the faster. We evaluate the location aware approach and show that it can reduce the time demanded for downloading to one order of magnitude.
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...ijwscjournal
A Web crawler is an important component of the Web search engine. It demands large amount of hardware
resources (CPU and memory) to crawl data from the rapidly growing and changing Web. So that the
crawling process should be a continuous process performed from time-to-time to maintain up-to-date
crawled data. This paper develops and investigates the performance of a new approach to speed up the
crawling process on a multi-core processor through virtualization. In this approach, the multi-core
processor is divided into a number of virtual-machines (VMs) that can run in parallel (concurrently)
performing different crawling tasks on different data. It presents a description, implementation, and
evaluation of a VM-based distributed Web crawler. In order to estimate the speedup factor achieved by the
VM-based crawler over a non-virtualization crawler, extensive crawling experiments were carried-out to
estimate the crawling times for various numbers of documents. Furthermore, the average crawling rate in
documents per unit time is computed, and the effect of the number of VMs on the speedup factor is
investigated. For example, on an Intel® Core™ i5-2300 CPU @2.80 GHz and 8 GB memory, a speedup
factor of ~1.48 is achieved when crawling 70000 documents on 3 and 4 VMs.
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...ijwscjournal
The size of the internet is large and it had grown enormously search engines are the tools for Web site navigation and search. Search engines maintain indices for web documents and provide search facilities by continuously downloading Web pages for processing. This process of downloading web pages is known as web crawling. In this paper we propose the architecture for Effective Migrating Parallel Web Crawling approach with domain specific and incremental crawling strategy that makes web crawling system more effective and efficient. The major advantages of migrating parallel web crawler are that the analysis portion of the crawling process is done locally at the residence of data rather than inside the Web search engine repository. This significantly reduces network load and traffic which in turn improves the performance, effectiveness and efficiency of the crawling process. The another advantage of migrating parallel crawler is that as the size of the Web grows, it becomes necessary to parallelize a crawling process, in order to finish downloading web pages in a comparatively shorter time. Domain specific crawling will yield high quality pages. The crawling process will migrate to host or server with specific domain and start downloading pages within specific domain. Incremental crawling will keep the pages in local database fresh thus increasing the quality of downloaded pages.
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...ijwscjournal
The size of the internet is large and it had grown enormously search engines are the tools for Web site navigation and search. Search engines maintain indices for web documents and provide search facilities by continuously downloading Web pages for processing. This process of downloading web pages is known as web crawling. In this paper we propose the architecture for Effective Migrating Parallel Web Crawling approach with domain specific and incremental crawling strategy that makes web crawling system more effective and efficient. The major advantages of migrating parallel web crawler are that the analysis portion of the crawling process is done locally at the residence of data rather than inside the Web search engine repository. This significantly reduces network load and traffic which in turn improves the performance, effectiveness and efficiency of the crawling process. The another advantage of migrating parallel crawler is that as the size of the Web grows, it becomes necessary to parallelize a crawling process, in order to finish downloading web pages in a comparatively shorter time. Domain specific crawling will yield high quality pages. The crawling process will migrate to host or server with specific domain and start downloading pages within specific domain. Incremental crawling will keep the pages in local database fresh thus increasing the quality of downloaded pages.
Research on Key Technology of Web ReptileIRJESJOURNAL
Abstract: This paper mainly introduces the web crawler system structure. Through the analysis of the architecture of web crawler we obtainedfive functional components. They respectively are: the URL scheduler, DNS resolver, web crawling modules, web page analyzer, and the URL judgment device. To build an efficient Web crawler the key technology is to design an efficient Web crawler, so as to solve the challenges that the huge scale of Web brings
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...ijmech
Now the public traffics make the life more and more convenient. The amount of vehicles in large or medium
sized cities is also in the rapid growth. In order to take full advantage of social resources and protect the
environment, regional end-to-end public transport services are established by analyzing online travel data.
The usage of computer programs for processing of the web page is necessary for accessing to a large
number of the carpool data. In the paper, web crawlers are designed to capture the travel data from
several large service sites. In order to maximize the access to traffic data, a breadth-first algorithm is used.
The carpool data will be saved in a structured form. Additionally, the paper has provided a convenient
method of data collecting to the program.
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...ijmech
Now the public traffics make the life more and more convenient. The amount of vehicles in large or medium sized cities is also in the rapid growth. In order to take full advantage of social resources and protect the environment, regional end-to-end public transport services are established by analyzing online travel data. The usage of computer programs for processing of the web page is necessary for accessing to a large number of the carpool data. In the paper, web crawlers are designed to capture the travel data from several large service sites. In order to maximize the access to traffic data, a breadth-first algorithm is used. The carpool data will be saved in a structured form. Additionally, the paper has provided a convenient method of data collecting to the program.
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...ijmech
Now the public traffics make the life more and more convenient. The amount of vehicles in large or medium sized cities is also in the rapid growth. In order to take full advantage of social resources and protect the environment, regional end-to-end public transport services are established by analyzing online travel data. The usage of computer programs for processing of the web page is necessary for accessing to a large number of the carpool data. In the paper, web crawlers are designed to capture the travel data from several large service sites. In order to maximize the access to traffic data, a breadth-first algorithm is used. The carpool data will be saved in a structured form. Additionally, the paper has provided a convenient method of data collecting to the program.
A Novel Interface to a Web Crawler using VB.NET TechnologyIOSR Journals
Abstract : The number of web pages is increasing into millions and trillions around the world. To make
searching much easier for users, web search engines came into existence. Web Search engines are used to find
specific information on the World Wide Web. Without search engines, it would be almost impossible to locate
anything on the Web unless or until a specific URL address is known. This information is provided to search by
a web crawler which is a computer program or software. Web crawler is an essential component of search
engines, data mining and other Internet applications. Scheduling Web pages to be downloaded is an important
aspect of crawling. Previous research on Web crawl focused on optimizing either crawl speed or quality of the
Web pages downloaded. While both metrics are important, scheduling using one of them alone is insufficient
and can bias or hurt overall crawl process. This paper is all about design a new Web Crawler using VB.NET
Technology.
Keywords: Web Crawler, Visual Basic Technology, Crawler Interface, Uniform Resource Locator.
Why do we use react JS on the website_.pdfnearlearn
React is a JavaScript library for building user interfaces. It is widely used in web development for creating interactive and reusable UI components. Some of the key benefits of using React are:
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...ijwscjournal
A Web crawler is an important component of the Web search engine. It demands large amount of hardware resources (CPU and memory) to crawl data from the rapidly growing and changing Web. So that the crawling process should be a continuous process performed from time-to-time to maintain up-to-date crawled data. This paper develops and investigates the performance of a new approach to speed up the crawling process on a multi-core processor through virtualization. In this approach, the multi-core processor is divided into a number of virtual-machines (VMs) that can run in parallel (concurrently)
performing different crawling tasks on different data. It presents a description, implementation, and evaluation of a VM-based distributed Web crawler. In order to estimate the speedup factor achieved by the VM-based crawler over a non-virtualization crawler, extensive crawling experiments were carried-out to
estimate the crawling times for various numbers of documents. Furthermore, the average crawling rate in documents per unit time is computed, and the effect of the number of VMs on the speedup factor is investigated. For example, on an Intel® Core™ i5-2300 CPU @2.80 GHz and 8 GB memory, a speedup
factor of ~1.48 is achieved when crawling 70000 documents on 3 and 4 VMs.
http://lab.lvduit.com:7850/~lvduit/crawl-the-web/
Special thanks to Hatforrent, csail.mit.edu, Amitavroy, ...
Contact us at dafculi.xc@gmail.com or duyet2000@gmail.com
1. Rajender Nath, Naresh Kumar / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue 5, September- October 2012, pp.266-269
A Novel Architecture for Parallel Domain Focused Crawler
Rajender Nath*, Naresh Kumar**
* Professor, DCSA, Kurukshetra University, Kurukshetra, Haryana, India
**Assistant Professor, MSIT, Janak Puri, New Delhi, India
Abstract
WWW is a collection of hyperlink These mobile crawlers go to the remote site and takes
document available in HTML format [10]. This only those pages that are found modified since the
collection is very huge and thus difficult to refresh last crawl and send only compressed modified web
quickly because 40% of the web pages changes page to search engine for indexing.
daily. As the web has dynamic nature, so, to cover
more and more pages on the web, the concept of I. RELATED WORK
parallel crawlers are used. These crawlers run in A domain specific parallel crawler is a
parallel and cover the wider area of the web. But program used for searching the information in the
the parallel crawlers consume more resources specific domain (.org, .com, .edu etc.) only [4]. The
especially bandwidth of the network to keep the main aim of domain specific crawler is to collect all
repository up to date. So this paper proposes an web pages, but selects and retrieves pages only from
architecture that uses the concept of mobile the specific domain [1, 2, 5].
crawlers to fetch the web page and download only
those pages that are changed since the last crawl. In [9], last modified date, number of
These crawlers run in the specific domain and keywords and number of URLs were used to check
skip the irrelevant domains. the change in the page. The purpose of this was to
determine page change behavior, load on the network
Index Terms— WWW, Search Engine, URLs, and bandwidth preservation. from the experimental
Crawling process, parallel crawler, Internet work it was found that 63% of the total pages
traffic. changed that need to be downloading and indexing.
Further the experiment also shows the reduction in
Introduction downloading up to 37% . It was also shown that
WWW [7] is a client server architecture that proposed scheme reduce the load on the network to
allows the users to search by using keywords on the one fourth through the use of compression and
search engine interface and in response search engine preserving bandwidth of network.
returns the web pages to the user. Very large number
of page are available on the web, therefore search Focused Crawler introduced by Chakrabarti
engine require a mechanism called crawler to collects et al. [6] describe the focused crawler in which web
the web pages from the web. A web crawler (also pages were maintained on a specific set of topic
called walker, spider, bot) is a software program that hierarchy. The goal of this paper was to selectively
runs repetitively to download the web pages from the look for pages that are relevant to a pre-defined set of
web. This program takes the URL from the seed topics. The proposed system was built from three
queue [1] and fetches the corresponding web page separate mechanisms namely: - Crawler, classifier
from the web. Millions of web pages are downloaded and distiller. Classifier tells the page relevance
per day by a crawler to achieve the target of fulfilling according to the taxonomy. The distiller identifies
the user requirements. But in spite of all this it is not hyper textual pages that are the access points to many
possible to visit the entire web and to download all relevant pages within a few links. The author
the requested web pages because according to [13] designed the Focused Web Crawler that crawl the
any search engine can cover only 16% of the entire web pages and store them locally based on the
web. To traverse the Web quickly and entirely is an relevancy and priority criteria to retrieve the web
expensive, unrealistic goal because of the required page.
hardware and network resources [2, 3]. The purposed distributed crawler and
Currently web crawlers have indexed parallel crawling in [1] increase the coverage and
billions of web pages and 40% of Internet traffic and reduce the bandwidth preservation. But it only
bandwidth utilization is due to web crawlers that distributes and localized the load, not reducing the
download the web pages for different search engines load in actual.
[12]. After studying various papers [9] on the mobile agent
This paper proposes an approach in which these problems are identified by us:
the concepts of mobile agents are used as parallel 1. The relevancy of any web page can be
crawlers based on domain specific web crawlers. determined after downloading it on the search
266 | P a g e
2. Rajender Nath, Naresh Kumar / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue 5, September- October 2012, pp.266-269
engine end only. So once a page is down loaded
then what is the benefit of filtering it because all
the network resources are already used while
downloading it.
2. The load on the network and bandwidth
preservation while downloading is considerably
large.
To overcome the above listed problems, this
paper proposed an architecture for mobile web
crawlers that uses the concept of frequency change
estimator and some new component helps in cleaning
the page that are not changed at the remote site.
URL DND
Client side / Search Engine Queue
Crawler URL
Manager Dispatcher
(CM)
DNS
Return to SE QUEUE
with crawled S
pages
Comparator
Database (CoM)
(DB)
CH CH CH
Frequency Change
Estimator (FCE)
WWW
Figure 1: Proposed Parallel Domain Focused Crawler
Remote site
3. Proposed Work
The proposed Mobile Crawler is shown in 3.1 Crawler Manager
Figure 1, in which the mobile crawler are send to the The main task of CM are mobile crawler
remote site where they check modification in page. If generation and assignment of URL’s to the mobile
the page is modified then download it otherwise crawler for crawling the web pages based on the
reject the URL. The major components for proposed information taken from the FCE. After receiving the
architecture are listed below: downloaded web pages in compressed form the task
1. Analyzer Module (AM). of decompressing the web pages is also performed by
2. Domain Name Director (DND). the CM.
3. Crawler Hand (CH).
4. Crawler Manager (CM). 3.2 URL Dispatcher
5. Data Base (DB). URL Dispatcher [11] will take the URL
6. Frequency change Estimator (FCE). from the Comparator module and send them to URL
7. Comparator (CoM). queue for further processing.
267 | P a g e
3. Rajender Nath, Naresh Kumar / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue 5, September- October 2012, pp.266-269
3.5 Comparator Module
3.3 Frequency Change Estimator The CoM checks to see if the downloading period
This module will identify whether the two of a web page is reached. If yes, then send this URL
version of web pages corresponding to the same URL of Web page to URL dispatcher for further
are same. This module is the part of the client site processing otherwise reject this URL. The last
and remains there. The FCE is used to calculate the modify and frequency of change is taken from the
probability of the page change by visiting the web FCE module that was computed when the URL was
page periodically [8]. This module maintains the previously crawled. It must also be noted that ASCII
page change frequency of every page at the SE and value of the web page is sent with the mobile crawler
also judge that in how many days this page will be so that it will help the mobile crawler to decide that
modified based on periodical visit of this web page the page is actually changed or not.
by the web crawler. Whenever the date and month of
change for any web page occurs it will send the URL
of that web page to URL dispatcher so that this web
page can be downloaded to refresh the repository.
URL & ASCII
Value
Fetch URL
WWW
Crawler Worker Crawling Process
Visited Fetched Web Page
ASCII
value
Manager
ASCII value of new & old Web Page
Y
ASCII Download & send The
Reject the Web page to SE after
Web Page same
Compression
N
Figure 2: Crawler Hand
3.4 Database (DB)
The data base of any system performs the 3.6 Domain Name Director (DND)
one of major role. This is maintain at the search The main job of the DND is to take the URL
engine end and contains the information about all the from the URL queue and assign them to the particular
crawled page crawled by the mobile crawlers. This domain (.org, .com, .edu etc..). Given a URL to the
module has five fields: DNS Queue from where crawler will take the URL
1) Name of URL: Name of those URL’s that and fetch the web page. The number of downloading
have been visited by the crawler. pages or crawling time will depends upon on the
2) Parent URL: Parent web page of the current URL itself. The distribution of the load on the
URL. Crawler Hand is likely to be different depending on
3) ASCII Count: Total sum of ASCII value of the frequency of demand of the domains.
each character of the web page.
4) Last modified date: Date and month on 3.7 Crawler Hand or crawler (CH) - The
which the page was last modified. major module of the entire proposed architecture is
5) Frequency of change: Expected date on crawl hand (as shown in Figure 2). Crawl Hand
which page change will take place. retrieves the URL from the DNS Queue in FCFS
6) File path: Complete path of the web page at fashion and sends request to web server. The web
the search engine end. documents for corresponding URL are fetched and
send to the Manager. Manager will calculate the
268 | P a g e
4. Rajender Nath, Naresh Kumar / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue 5, September- October 2012, pp.266-269
corresponding ASCII value of that web page. URL ordering”, 7th International WWW
Compare the new and old ASCII value of the Conference , April 14-18, Brisbane, 1998.
corresponding page, if they are same then reject the [6] S. Chakrabarti, M. v. d. Berg, and B. Domc,
web page otherwise send this page to SE after “Focused crawling: a new approach to topic-
compression. specific Web resource discovery”, published
in Computer Networks, 31(11–16):1623–
4. Conclusion 1640. 1999.
The size of the web is exponentially [7] Sergey Brin and Lawrence Page,” The
growing and 40% of the web pages are changes daily Anatomy of a Large-Scale Hypertextual
[9] Crawlers are being used to collect Web data for Web Search Engine” pulished in the
search engine. This paper has enumerated the major International conference WWW7 , 1998.
components of the crawler and Parallelized the [8] Junghoo Cho, “Estimating Frequency of
crawling system that is very vital from the point of Change” published in ACM Journal Name,
view of downloading documents in a small amount of Vol. V, No. N, Month 20YY, Pages 1 - 32.
time. [9] Rajender Nath and Satinder Bal, “A Novel
This proposed architecture of domain Mobile Crawler System Based on Filtering
specific parallel crawler fulfills the following off Non-Modified Pages for Reducing Load
characteristics-Full Distribution, Scalability, Load on the Network”, published in The
Balancing and Reliability also because any web page International Arab Journal of Information
is downloading only when it gets any change. Further Technology, Vol. 8, No. 3, July 2011.
more the web page is downloaded only in the [10] Naresh Chauhan, A. K. Sharma, “ Design of
compressed form that will reduce the size of the web an Agent Based Context Driven Focused
page. Crawler” published in BVICAM’S
International Journal of Information
5. Future Work Technology, 2008.
To enhance and prove this architecture [11] Rajender Nath and Naresh Kumar,” A Novel
several interesting issues like page change with Architecture for Parallel Crawler based on
frequency, Link Extraction from the downloaded web Focused Crawling”, published in 6th
page etc. are yet need to be identified. This will be international conference on Intelligent
produced with implementation results in near future. Systems, Sustainable, new and renewable
The results will be compared with the other Energy Technology & Nanotechnology,
techniques used by mobile web crawlers in the form March 16-18, 2012.
of bandwidth preservation, downloading the number [12] Yuan X. and Harms J., “An Efficient
of web page, total downloading time that is saved by Scheme to Remove Crawler Traffic from the
this proposed work. Internet,” in Proceedings of the 11th
International Conferences on Computer
References Communications and Networks, pp. 90-95,
[1] Junghoo Cho, Hector Garcia-Molina, 2002.
“Parallel Crawlers “ published in WWW [13] Lawrence S. and Giles C., “Accessibility of
2002, May 7–11, 2002, Honolulu, Hawaii, Information on the Web,” Nature, vol. 400,
USA. ACM 1-58113-449-5/02/0005. no. 6740, pp. 107-109, 1999.
[2] Bidoki, Yazdani et el, “FICA: A fast
intelligent crawling algorithm”, Web
Intelligence, IEEE/ACM/WIC International
conference on Intelligent agent technology,
Pages 635-641, 2007.
[3] Cui Xiaoqing Yan Chun,” An evolutionary
relevance calculation measure in topic
crawler ” CCCM 2009, ISECS International
Colloquium on Computing, Communication,
Control, and Management, 267 –270, Aug
2009.
[4] Nidhi Tayagi and Deepti Gupta, “A Novel
Architecture for Domain Specific Parallel
Crawler”, published in Indian Journal of
Computer Science and Engineering” Vol 1
No 1 44-53, ISSN: 0976-5166.
[5] Junghoo Cho, Hector Garcia-Molina,
Lawrence Page, Efficient crawling through
269 | P a g e