SlideShare a Scribd company logo
1 of 9
Download to read offline
International Journal on Web Service Computing (IJWSC), Vol.3, No.3, September 2012
DOI : 10.5121/ijwsc.2012.3308 85
AN EXTENDED MODEL FOR EFFECTIVE
MIGRATING PARALLEL WEB CRAWLING WITH
DOMAIN SPECIFIC AND INCREMENTAL CRAWLING
Md. Faizan Farooqui1
, Dr. Md. Rizwan Beg2
and Dr. Md. Qasim Rafiq3
1Research Scholar, Department of Computer Application,
Integral University, Lucknow 226026, India
faizan_farooqui2000@yahoo.com
2Professor & Head, Department of CS&E, Integral University,
Lucknow 226026, India
rizwanbeg@gmail.com
3Chairman, Department of Computer Engineering,
Aligarh Muslim University, Aligarh, India
mqrafiq@hotmail.com
ABSTRACT
The size of the internet is large and it had grown enormously search engines are the tools for Web site
navigation and search. Search engines maintain indices for web documents and provide search facilities by
continuously downloading Web pages for processing. This process of downloading web pages is known as
web crawling. In this paper we propose the architecture for Effective Migrating Parallel Web Crawling
approach with domain specific and incremental crawling strategy that makes web crawling system more ef-
fective and efficient. The major advantages of migrating parallel web crawler are that the analysis portion
of the crawling process is done locally at the residence of data rather than inside the Web search engine
repository. This significantly reduces network load and traffic which in turn improves the performance, ef-
fectiveness and efficiency of the crawling process. The another advantage of migrating parallel crawler is
that as the size of the Web grows, it becomes necessary to parallelize a crawling process, in order to finish
downloading web pages in a comparatively shorter time. Domain specific crawling will yield high quality
pages. The crawling process will migrate to host or server with specific domain and start downloading
pages within specific domain. Incremental crawling will keep the pages in local database fresh thus in-
creasing the quality of downloaded pages.
Keywords
Web crawling, parallel migrating web crawler, search engine,
1. INTRODUCTION
The Internet is a global system of interconnected computer networks. The searching and indexing
tasks for the web are currently handled from specialized web applications called search engines.
The modern search engines can be divided into three parts they are the publically available search
engine, the data- base and the web crawling system. A web crawler is an automated program that
browses the Web in a methodological manner. The process of traversing the web is called Web
crawling. Web crawler starts with a queue of known URLs to visit. As it visits these pages, it
scans them for links to other web pages. The Web Crawler consists of Crawler Manager, Ro-
bot.txt downloader, Spider and Link Extractor. A Crawl Manager takes a set of URLs from Link
International Journal on Web Service Computing (IJWSC), Vol.3, No.2, September 2012
86
extractor and sends the next URLs to the DNS resolver to obtain its IP address. Robot.txt file are
the means by which web authors express their wish as to which pages they want the crawler to
avoid. Link extractor extract URLs from the links in the downloaded pages and sends the URLs
to the crawler manager for downloading afterwards.
2. LITERATURE SURVEY
In [13], the author demonstrated an efficient approach to the “download-first process-later” strat-
egy of existing search engines by using mobile crawlers. In [14] author has implemented
UbiCrawler, a scalable distributed and fault-tolerant web crawler. In [15] author presented the ar-
chitecture of PARCAHYD which is an ongoing project aimed at designing of a Parallel Crawler
based on Augmented Hypertext Documents. In [16] the author studied how an effective parallel
crawler is designed. As the size of the Web grows, it becomes imperative to parallelize a crawling
process. In [17] the author proposes Mercator, which is a scalable, extensible crawler. [18] Pre-
sented Google, a prototype of a large scale search engine which makes heavy use of the structure
present in hypertext. [19] Aims at designing and implementing a parallel migrating crawler in
which the work of a crawler is divided amongst a number of independent.
3. MOTIVATION: PROBLEMS IN GENERIC WEB CRAWLERS
According to [1], Web crawlers of big commercial search engines crawl up to 10 million pages
per day. Assuming an average page size of 6K [2], the crawling activities of a single commercial
search engine adds a daily load of 60GB to the Web.
Scaling Issues: One of the first Web search engines, the World Wide Web Worm [3], was intro-
duced in 1994 and used an index of 110,000 Web pages. Big commercial search engines in 1998
claimed to index up to 110 million pages [1]. The Web is expected to grow further at an exponen-
tial speed, doubling its size (in terms of number of pages) in less than a year [4].
Efficiency Issues: Current Web crawlers download all these irrelevant pages because traditional
crawling techniques cannot analyze the page content prior to page download [13].
Index Quality Issues: Current commercial search engines maintain Web indices of up to several
hundred million pages [1].
4. MIGRATING PARALLEL WEB CRAWLER
The Migrating Parallel Crawler system consists of Central Crawler, Crawl Frontiers, and Local
Database of each Crawl Frontier and Centralized Database. It is responsibility of central crawler
to receiving the URL input from the applications and forwards the URLs to the available migrat-
ing crawling process. Crawling process migrated to different machines to increase the system per-
formance. Local database of each crawl frontier are buffers that locally collect the data. This data
is transferred to the central crawler after compression and filtering which reduces the network
bandwidth overhead. The central crawler has a centralized database which will contain the docu-
ments collected by the crawl frontiers independently. The main advantages of the migrating paral-
lel crawler approach are Localized Data Access, Remote Page Selection, Remote Page Filtering,
Remote Page Compression, Scalability, Network-load dispersion, Network-load reduction.
International Journal on Web Service Computing (IJWSC), Vol.3, No.2, September 2012
87
5. ISSUES IN MIGRATING PARALLEL WEB CRAWLER
The following issues are important in the study of a migrating parallel crawler interesting:
Communication bandwidth: Communication is really important so as to prevent overlap, or to
improve the quality of the downloaded content, crawling processes need to communicate and co-
ordinate with each other to improve quality thus consuming valuable communication bandwidth.
Quality: Prime objective of migrating parallel crawler is that to download the “important” pages
first to improve the “quality” of the downloaded pages.
Overlap: When multiple processes run in parallel to download pages, it is possible that different
processes download the same page multiple times. One process may download the same page that
other process may have already downloaded, one process may not be aware that another process
has downloaded the same page.
6. COORDINATION IN MIGRATING PARALLEL WEB CRAWLER
The coordination is done in order to prevent overlap and also crawling processes need to coordi-
nate with each other so that the correct and quality pages are downloaded. This coordination can
be achieved in following ways:
Independent: In this method Crawling processes may download pages independent of each other
without any coordination. The downloaded pages may overlap in this case, but we assume that
overlap is not significant enough.
Static assignment: In this method the Web is partitioned into sub partitions and each sub partition
is assigned to each Crawling processes before the actual crawling process may begin. This type of
arrangement is called static assignment.
Dynamic assignment: In this coordination scheme there is a central coordinator that divides the
Web into small partitions and each partition is dynamically assigned to a Crawling processes for
further downloading.
7. CRAWLING MODES FOR STATIC ASSIGNMENT IN MIGRATING PARALLEL
WEB CRAWLER
In this scheme of static assignment, each Crawling processes are responsible for a certain parti-
tion of the Web and have to download pages within the partition. There may be possibility that
some pages in the partition may have links to pages in another partition. Such types of links are
referred to as an inter-partition link.
Firewall mode: Each Crawling processes strictly download the pages within its partition and in-
ter-partition links are not followed. In firewall mode all inter-partition links are either thrown
away or ignored.
Cross-over mode: In this mode each Crawling processes download pages within its partition. It
follows inter-partition links when links within its partitions are exhausted.
International Journal on Web Service Computing (IJWSC), Vol.3, No.2, September 2012
88
Exchange mode: In this mode Crawling processes periodically exchange inter-partition links.
Crawling processes are not allowed to follow inter-partition links [16].
Table 1. Comparison of three Crawling Modes[16]
Mode Coverage overlap Quality Communication
Firewall Bad Good Bad Good
Cross-Over Good Bad Bad Good
Exchange Good Good Good Bad
8. EVALUATION MODELS IN MIGRATING PARALLEL WEB CRAWLER
This section deals with metrics that we define to quantify the advantages or disadvantages of mi-
grating parallel crawler.
Coverage: the coverage is defined as I/U, where U is the number of pages downloaded by migrat-
ing parallel crawler and I is number of unique pages downloaded by the migrating parallel
crawler.
Overlap: The overlap is defined as (N−I) /I. Where, N is number of pages downloaded by the
crawler, and I is the number of unique downloaded pages by migrating parallel crawler.
Quality: The quality of downloaded pages is defined as |AN∩PN|/|PN|. The migrating parallel
crawler downloads the important N pages, and PN is that set of N pages. AN is set of N pages
that an actual migrating parallel crawler would download, which is different from PN.
Communication overhead: The communication overhead is defined as E/I. Where I in total pages
downloaded and E represents exchanged inter-partition URLs.
Table 2 summarizes different crawling techniques with their performance attributes. Crawling
process of migrating parallel crawlers move to the resource which needs to be crawled to take the
advantage of localized data access. After accessing a resource, migrating parallel crawler move
on to the next machine or server and send result of crawling to central database in compressed
form. From the table it is evident that Migrating parallel crawlers are more efficient in terms of
network load reduction, I/O performance, reducing communication bandwidth, network load dis-
persion etc.
Table 2. Performance Attributes for Different Web Crawlers
High per-
formance
distributed
web crawler
Incremental
crawler
Parallel
crawler
Agent
based
Migrating
parallel
Crawler
Domain
Specific
Crawler
Mobile
Crawler
Breadth
First
Crawling
Robustness ++ -- -- -- ++ -- ++ --
Flexibility ++ -- -- ++ ++ -- -- --
Manageability ++ -- -- ++ ++ -- -- --
I/O Perform-
ance
++ -- -- ++ ++ -- -- --
Network re-
sources
++ -- -- -- ++ -- ++ --
OS limits ++ -- -- -- ++ -- ++ --
International Journal on Web Service Computing (IJWSC), Vol.3, No.2, September 2012
89
Higher per-
formance
++ -- -- -- ++ -- ++ --
Extensibility -- ++ -- -- -- -- ++ --
Incremental
crawling
-- ++ -- -- -- -- -- --
Reasonable
cost
++ -- -- -- -- -- -- ++
Overlapping -- -- ++ ++ ++ ++ ++ --
Communication
bandwidth
-- -- ++ -- ++ ++ ++ --
Network load
dispersion
-- -- ++ -- ++ ++ ++ --
Network load
reduction
-- -- ++ ++ ++ ++ ++ --
Freshness -- ++ -- -- -- -- -- --
Page rank -- -- -- -- -- -- -- ++
Scalability -- -- -- -- -- ++ -- --
Load sharing -- -- -- -- -- ++ ++ --
High quality -- -- -- -- -- -- -- ++
Agent oriented -- -- -- ++ ++ -- -- --
9. PROPOSED ARCHITECTURE OF MIGRATING PARALLEL WEB CRAWLER
A migrating parallel crawler consists of several multiple crawling processes which migrate to
source of data for example Web server or host before the crawling process is actually started on
that Web server or host. Migrating parallel crawlers move to resource and take the advantages of
localized access of data. Migrating parallel crawler after accessing a resource migrates to the next
host or server or to their home system. Each migrating parallel crawling process performs the
tasks of single crawler that it downloads pages from the Web server or host, stores the pages in
the local database, extracts URLs from the content of downloaded pages and follows the ex-
tracted URLs or links. When Crawling process run on the same local area network and communi-
cate through a high speed interconnection network then it is known as intra-site migrating parallel
web crawler. When Crawling process run at geographically distant locations connected by the
Internet or a wide area network then it is known as distributed migrating parallel web crawler.
The architecture consists of central coordinator system and crawling process.
9.1 Central Coordinator System
Central coordinator system consists of Central Crawl Manager, Crawling Application, Web
Cache, Central Database or Repository, URL Dispatcher, URL Distributor, Domain Selector.
Central crawl manager is the central part of the crawler all other components are started and reg-
ister with the central crawl manager to offer or request services. It acts as the core of the system.
Central Crawl Manager’s objective is to download pages in the order specified by the crawling
application. The central crawler has a list of URLs to crawl. After getting the URLs of files, cen-
tral crawl manager queries the DNS resolvers for the IP addresses of the host or servers. The cen-
tral manager then takes into consideration robots.txt files in the root directory of the web server.
The central crawler a set of available crawling process which have registered themselves with the
central crawl manager which logically migrated to different domain specific locations. The cen-
tral crawl manager assigns different URLs to all the crawl process in domain specific manner and
in turn the crawling processes begin to crawl the received URL in breadth first fashion and pages
are downloaded by multi threaded downloaded. It is the responsibility of the crawling application
International Journal on Web Service Computing (IJWSC), Vol.3, No.2, September 2012
90
to check for duplicate pages. The crawling application considers in this paper is a breadth-first
crawl strategy. Breadth first search gives high quality pages thus improving the quality of
downloaded pages. The crawling process actually crawls in breadth first manner. The DNS re-
solver uses the GNU adns asynchronous DNS client library to access a DNS server usually collo-
cated on the same machine. A web cache stores web resources in anticipation of future requests.
Web cache works on the principle that the most popular resource is likely to be requested in the
future. The advantages of Web Cache are reduced network bandwidth, less user perceived delay,
less load on the origin servers. Web cache provides quick response time. When the information
resides in web cache, the request is satisfied by a cache, the content no longer has to travel across
the Internet. Central database or Repository stores the list of URLs received and downloaded web
pages which are collection of the documents downloaded by crawling process. Web pages can be
stored in compressed form. Each document is provided with a unique number called document
identifier denoted by doc_id. The documents are stored along with doc_id, length of document
and URL. This helps with data consistency and makes data storage much easy and efficient. Data
is stored in the central data base in hierarchical tree manner.
Each node or the parent node of the tree is the domain and the child node of the parent node is the
sub domain. Each node consists of domain name and its corresponding URLs. The database is
continuously updated for future use by adding new URL. Central Database communicates with
the database of the search engine. URL dispatcher reads the URL database and retrieves the URL
from the index table maintained in the central repository and forwards it for crawling. URL dis-
tributor classifies the URLs on the basis of domains. It then distributes the URL to the concerned
domain specific crawling process for crawling. The URL distributor balances the load on the
crawling process. Domain selector identifies the domains of the links which are extracted by the
link extractor. The links belonging to specific domain are stored in the corresponding domain ta-
ble. The domain selector also forwards the domain specific URL to the URL dispatcher for fur-
ther crawling.
International Journal on Web Service Computing (IJWSC), Vol.3, No.2, September 2012
91
Figure: Proposed Architecture of Migrating Parallel Crawler
9.2 Crawling Process
Crawling process consists of Scheduler, New ordered Queues, Site ordering module, URL
Queues/ KnownURLs, Multi threaded Downloader, URL Collector, Link Extractor, Link ana-
lyzer. Scheduler provides the set of URLs to be downloaded based on the modified page rank.
The URLs are stored in new ordered queues. New ordered Queues stores the set of URLs based
on modified page rank. Site ordering module provides the modified page rank of the page.
KnownURLs are the set of already known URLS. They can be treated as seed URLs. Mutli
threaded downloader takes a URL from URL collector and downloads the corresponding web-
page to store it in the local repository or database. The downloader component fetches files from
the web by opening up to connections to different servers. URL collector keeps the URL from the
downloaded pages. Link Extractor extracts the link from the downloaded pages. Link analyzer
checks the extracted URLs by the link extractor. If there is similarity in the URL then such URLs
are discarded and not further forwarded for processing. When the crawling process runs it re-
quires some memory space to save the downloaded web pages. Each crawling process has its own
local storage. The crawling process saves the downloaded web pages in this database. It is the
Crawling Process
Storage
Locally Collected URLSRanking Module
Update Module
Multithreaded Downloader
URL Collector
URL Queues
Site Ordering Module
Scheduler To Domain Selector
Domain Selector
Link Analyzer
Link Extractor
New Ordered Queues
World Wide Web
Central Coordinator System
Crawling Application
Crawl Manager
Web Cache
Central database
DNS Resolver
URL Dispatcher
URL Distributor
Domain
Selector
International Journal on Web Service Computing (IJWSC), Vol.3, No.2, September 2012
92
storage area of the machine on which the crawling process is running. The Ranking Module con-
stantly scans the knownURLs and the local database to make the refinement decision. The Rank-
ing Module refines the local database. The Ranking Module discards the less important page from
the local database to make space for the new page. LocallycollectedURLs are the set of URLs in
the local collection. The Update Module maintains the local database fresh. In order to maintain
the local database fresh, the crawler has to select the page that will increase the freshness most
significantly this decision is known as update decision.
10. EVALUATION METHOD
The performance of the proposed migrating parallel web crawler is measured by obtaining top
1000 URLs returned by modified page update algorithm will be used. Minimum number of over-
lapping pages of specific domain and more relevant pages will be considered. The fresh pages
will be considered by using the incremental module. Less traffic on the network will be ensured.
Less bandwidth will be consumed as less traffic will travel over the network.
11. RESULT
The crawled pages will be domain specific. Fresh pages will be downloaded as change frequency
is taken into consideration. High quality of pages will be downloaded as crawling processes are
performing in breadth first manner. Breadth first crawling improves the quality of downloaded
pages.
12. CONCLUSIONS
In this paper we proposed an Extended Model for Effective Migrating Parallel Web Crawling
with Domain Specific and Incremental Crawling. Domain specific crawling will yield high qual-
ity pages. The crawling process will migrate to host or server with specific domain and start
downloading pages within specific domain. Incremental crawling will keep the pages in local da-
tabase fresh thus increasing the quality of downloaded pages.
The research directions in migrating parallel crawler include:
‱ Security could be introduced in migrating parallel crawlers
‱ Migrating parallel crawler could be made polite
‱ Location awareness could be introduced in migrating parallel crawlers
This future work will deal with the problem of quick searching and downloading the data. The
data will be collected and analyzed with the help of tables and graphs.
ACKNOWLEDGMENTS
First and foremost, our sincere thanks goes to Prof. Syed Wasim Akhtar, Honorable Vice Chan-
cellor, Integral University, Lucknow. Prof Akhtar has given us unconditional support in many as-
pects, which enabled us to work on the exciting and challenging field of Software Engineering.
We would also like to give our special thanks to Prof. T. Usmani, Pro Vice Chancellor, Integral
University. His encouragement, help, and care were remarkable during the past few years. We are
also grateful to Prof S. M. Iqbal Chief Academic Consultant, Integral University. Prof Iqbal pro-
vided us with valuable thoughts for our research work. My gratitude also goes to Dr. Irfan Ali
Khan, Registrar, Integral University for his constructive comments and shared experience.
International Journal on Web Service Computing (IJWSC), Vol.3, No.2, September 2012
93
REFERENCES
[1] D. Sullivan, “Search Engine Watch,” Mecklermedia, 1998.
[2] S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” Stanford
University, Stanford, CA, Technical Report, 1997.
[3] O. A. McBryan, “GENVL and WWW: Tools for Taming the Web,” in Proceedings of the First In-
ternational Conference on the World Wide Web, Geneva, Switzerland, 1994.
[4] B. Kahle, “Archiving the Internet,” Scientific American, 1996.
[5] J. Gosling and H. McGilton, “The Java Language Environment,” Sun Microsystems, Mountain View,
CA, White Paper, April 1996.
[6] J. E. White, Mobile Agents, MIT Press, Cambridge, MA, 1996.
[7] C. G. Harrison, D. M. Chess, and A. Kershenbaum, “Mobile Agents: Are they a good idea?,” IBM
Research Division, T.J. Watson Research Center, White Plains, NY, Research Report, September
1996.
[8] H. S. Nwana, “Software Agents: An Overview,” Knowledge Engineering Review, Cambridge Uni-
versity Press, 11:3, pp. , 1996.
[9] M. Wooldridge, “Intelligent Agents: Theory and Practice,” Knowledge Engineering Review, Cam-
bridge University Press, 10:2, pp. , 1995.
[10] P. Maes, “Modeling Adaptive Autonomous Agents,” MIT Media Laboratory, Cambridge, MA, Re-
search Report, May 1994.
[11] P. Maes, “Intelligent Software,” Scientific American, 273:3, pp. , 1995.
[12] T. Finin, Y. Labrou, and J. Mayfield, “KQML as an agent communication language,” University of
Maryland Baltimore County, Baltimore, MD, September 1994.
[13] Oachim Hammer , Jan Fiedler “Using Mobile Crawlers to Search the Web Efficiently” in 2000
[14] Paolo Boldi, Bruno Codenotti, Massimo Santini, Sebastiano Vigna, “UbiCrawler: A Scalable Fully
Distributed Web Crawler” in 2002
[15] A.K. Sharma, J.P. Gupta, D. P. Aggarwal, “PARCAHYDE: An Architecture of a Parallel Crawler
based on Augmented HypertextDocuments” in 2010
[16] J. Cho and H.Garcia-Molina, “Parallel crawlers”. In Proceedings of the Eleventh International World
Wide Web Conference, 2002, pp. 124 – 135
[17] A. Heydon and M. Najork, “Mercator: A scalable, extensible web crawler”. World Wide Web, vol. 2,
no. 4, pp. 219 -229, 1999.
[18] Akansha Singh , Krishna Kant Singh , “Faster and Efficient Web Crawling with Parallel Migrating
Web Crawler” in 2010
[19] Min Wu, Junliang Lai, “The Research and Implementation of parallel web crawler in cluster” in In-
ternational Conference on Computational and Information Sciences 2010

More Related Content

What's hot

A Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyA Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyIOSR Journals
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...CloudTechnologies
 
“Web crawler”
“Web crawler”“Web crawler”
“Web crawler”ranjit banshpal
 
Colloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerColloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerAkshay Pratap Singh
 
Web crawler synopsis
Web crawler synopsisWeb crawler synopsis
Web crawler synopsisMayur Garg
 
Farthest first clustering in links reorganization
Farthest first clustering in links reorganizationFarthest first clustering in links reorganization
Farthest first clustering in links reorganizationIJwest
 
Web crawler synopsis
Web crawler synopsisWeb crawler synopsis
Web crawler synopsisMayur Garg
 
Working of a Web Crawler
Working of a Web CrawlerWorking of a Web Crawler
Working of a Web CrawlerSanchit Saini
 
Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014Sunny Gupta
 
Using Exclusive Web Crawlers to Store Better Results in Search Engines' Database
Using Exclusive Web Crawlers to Store Better Results in Search Engines' DatabaseUsing Exclusive Web Crawlers to Store Better Results in Search Engines' Database
Using Exclusive Web Crawlers to Store Better Results in Search Engines' DatabaseIJwest
 
Web Crawler For Mining Web Data
Web Crawler For Mining Web DataWeb Crawler For Mining Web Data
Web Crawler For Mining Web DataIRJET Journal
 
RabbitMQ Implementation as Message Broker in Distributed Application with RES...
RabbitMQ Implementation as Message Broker in Distributed Application with RES...RabbitMQ Implementation as Message Broker in Distributed Application with RES...
RabbitMQ Implementation as Message Broker in Distributed Application with RES...IJCSIS Research Publications
 

What's hot (17)

A Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyA Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET Technology
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
 
“Web crawler”
“Web crawler”“Web crawler”
“Web crawler”
 
Colloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerColloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web Crawler
 
Web crawling
Web crawlingWeb crawling
Web crawling
 
Web crawler synopsis
Web crawler synopsisWeb crawler synopsis
Web crawler synopsis
 
Hypermedia APIs
Hypermedia APIsHypermedia APIs
Hypermedia APIs
 
Farthest first clustering in links reorganization
Farthest first clustering in links reorganizationFarthest first clustering in links reorganization
Farthest first clustering in links reorganization
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
Web crawler synopsis
Web crawler synopsisWeb crawler synopsis
Web crawler synopsis
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Working of a Web Crawler
Working of a Web CrawlerWorking of a Web Crawler
Working of a Web Crawler
 
Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014
 
Using Exclusive Web Crawlers to Store Better Results in Search Engines' Database
Using Exclusive Web Crawlers to Store Better Results in Search Engines' DatabaseUsing Exclusive Web Crawlers to Store Better Results in Search Engines' Database
Using Exclusive Web Crawlers to Store Better Results in Search Engines' Database
 
Web Crawler For Mining Web Data
Web Crawler For Mining Web DataWeb Crawler For Mining Web Data
Web Crawler For Mining Web Data
 
RabbitMQ Implementation as Message Broker in Distributed Application with RES...
RabbitMQ Implementation as Message Broker in Distributed Application with RES...RabbitMQ Implementation as Message Broker in Distributed Application with RES...
RabbitMQ Implementation as Message Broker in Distributed Application with RES...
 

Similar to AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN SPECIFIC AND INCREMENTAL CRAWLING

DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...ijmech
 
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...ijmech
 
E3602042044
E3602042044E3602042044
E3602042044ijceronline
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the webVan-Duyet Le
 
Web Crawling Using Location Aware Technique
Web Crawling Using Location Aware TechniqueWeb Crawling Using Location Aware Technique
Web Crawling Using Location Aware Techniqueijsrd.com
 
Smart Crawler Automation with RMI
Smart Crawler Automation with RMISmart Crawler Automation with RMI
Smart Crawler Automation with RMIIRJET Journal
 
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...ijwscjournal
 
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Denis Shestakov
 
Brief Introduction on Working of Web Crawler
Brief Introduction on Working of Web CrawlerBrief Introduction on Working of Web Crawler
Brief Introduction on Working of Web Crawlerrahulmonikasharma
 
Pdd crawler a focused web
Pdd crawler  a focused webPdd crawler  a focused web
Pdd crawler a focused webcsandit
 
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...ijwscjournal
 
Sekhon final 1_ppt
Sekhon final 1_pptSekhon final 1_ppt
Sekhon final 1_pptManant Sweet
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrievaliosrjce
 
HIGWGET-A Model for Crawling Secure Hidden WebPages
HIGWGET-A Model for Crawling Secure Hidden WebPagesHIGWGET-A Model for Crawling Secure Hidden WebPages
HIGWGET-A Model for Crawling Secure Hidden WebPagesijdkp
 
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET Journal
 
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive LearningA Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive LearningIJMTST Journal
 
The Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web CrawlerThe Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web CrawlerIRJESJOURNAL
 
Focused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainsFocused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainseSAT Publishing House
 
Focused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainsFocused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainseSAT Journals
 

Similar to AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN SPECIFIC AND INCREMENTAL CRAWLING (20)

DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
 
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
 
E3602042044
E3602042044E3602042044
E3602042044
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web
 
Web Crawling Using Location Aware Technique
Web Crawling Using Location Aware TechniqueWeb Crawling Using Location Aware Technique
Web Crawling Using Location Aware Technique
 
Smart Crawler Automation with RMI
Smart Crawler Automation with RMISmart Crawler Automation with RMI
Smart Crawler Automation with RMI
 
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
 
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
 
Brief Introduction on Working of Web Crawler
Brief Introduction on Working of Web CrawlerBrief Introduction on Working of Web Crawler
Brief Introduction on Working of Web Crawler
 
Pdd crawler a focused web
Pdd crawler  a focused webPdd crawler  a focused web
Pdd crawler a focused web
 
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
 
Sekhon final 1_ppt
Sekhon final 1_pptSekhon final 1_ppt
Sekhon final 1_ppt
 
G017254554
G017254554G017254554
G017254554
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
 
HIGWGET-A Model for Crawling Secure Hidden WebPages
HIGWGET-A Model for Crawling Secure Hidden WebPagesHIGWGET-A Model for Crawling Secure Hidden WebPages
HIGWGET-A Model for Crawling Secure Hidden WebPages
 
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
 
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive LearningA Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
 
The Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web CrawlerThe Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web Crawler
 
Focused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainsFocused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domains
 
Focused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainsFocused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domains
 

Recently uploaded

Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)Dr. Mazin Mohamed alkathiri
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
à€­à€Ÿà€°à€€-à€°à„‹à€ź à€”à„à€Żà€Ÿà€Șà€Ÿà€°.pptx, Indo-Roman Trade,
à€­à€Ÿà€°à€€-à€°à„‹à€ź à€”à„à€Żà€Ÿà€Șà€Ÿà€°.pptx, Indo-Roman Trade,à€­à€Ÿà€°à€€-à€°à„‹à€ź à€”à„à€Żà€Ÿà€Șà€Ÿà€°.pptx, Indo-Roman Trade,
à€­à€Ÿà€°à€€-à€°à„‹à€ź à€”à„à€Żà€Ÿà€Șà€Ÿà€°.pptx, Indo-Roman Trade,Virag Sontakke
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 

Recently uploaded (20)

Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
à€­à€Ÿà€°à€€-à€°à„‹à€ź à€”à„à€Żà€Ÿà€Șà€Ÿà€°.pptx, Indo-Roman Trade,
à€­à€Ÿà€°à€€-à€°à„‹à€ź à€”à„à€Żà€Ÿà€Șà€Ÿà€°.pptx, Indo-Roman Trade,à€­à€Ÿà€°à€€-à€°à„‹à€ź à€”à„à€Żà€Ÿà€Șà€Ÿà€°.pptx, Indo-Roman Trade,
à€­à€Ÿà€°à€€-à€°à„‹à€ź à€”à„à€Żà€Ÿà€Șà€Ÿà€°.pptx, Indo-Roman Trade,
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 

AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN SPECIFIC AND INCREMENTAL CRAWLING

  • 1. International Journal on Web Service Computing (IJWSC), Vol.3, No.3, September 2012 DOI : 10.5121/ijwsc.2012.3308 85 AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN SPECIFIC AND INCREMENTAL CRAWLING Md. Faizan Farooqui1 , Dr. Md. Rizwan Beg2 and Dr. Md. Qasim Rafiq3 1Research Scholar, Department of Computer Application, Integral University, Lucknow 226026, India faizan_farooqui2000@yahoo.com 2Professor & Head, Department of CS&E, Integral University, Lucknow 226026, India rizwanbeg@gmail.com 3Chairman, Department of Computer Engineering, Aligarh Muslim University, Aligarh, India mqrafiq@hotmail.com ABSTRACT The size of the internet is large and it had grown enormously search engines are the tools for Web site navigation and search. Search engines maintain indices for web documents and provide search facilities by continuously downloading Web pages for processing. This process of downloading web pages is known as web crawling. In this paper we propose the architecture for Effective Migrating Parallel Web Crawling approach with domain specific and incremental crawling strategy that makes web crawling system more ef- fective and efficient. The major advantages of migrating parallel web crawler are that the analysis portion of the crawling process is done locally at the residence of data rather than inside the Web search engine repository. This significantly reduces network load and traffic which in turn improves the performance, ef- fectiveness and efficiency of the crawling process. The another advantage of migrating parallel crawler is that as the size of the Web grows, it becomes necessary to parallelize a crawling process, in order to finish downloading web pages in a comparatively shorter time. Domain specific crawling will yield high quality pages. The crawling process will migrate to host or server with specific domain and start downloading pages within specific domain. Incremental crawling will keep the pages in local database fresh thus in- creasing the quality of downloaded pages. Keywords Web crawling, parallel migrating web crawler, search engine, 1. INTRODUCTION The Internet is a global system of interconnected computer networks. The searching and indexing tasks for the web are currently handled from specialized web applications called search engines. The modern search engines can be divided into three parts they are the publically available search engine, the data- base and the web crawling system. A web crawler is an automated program that browses the Web in a methodological manner. The process of traversing the web is called Web crawling. Web crawler starts with a queue of known URLs to visit. As it visits these pages, it scans them for links to other web pages. The Web Crawler consists of Crawler Manager, Ro- bot.txt downloader, Spider and Link Extractor. A Crawl Manager takes a set of URLs from Link
  • 2. International Journal on Web Service Computing (IJWSC), Vol.3, No.2, September 2012 86 extractor and sends the next URLs to the DNS resolver to obtain its IP address. Robot.txt file are the means by which web authors express their wish as to which pages they want the crawler to avoid. Link extractor extract URLs from the links in the downloaded pages and sends the URLs to the crawler manager for downloading afterwards. 2. LITERATURE SURVEY In [13], the author demonstrated an efficient approach to the “download-first process-later” strat- egy of existing search engines by using mobile crawlers. In [14] author has implemented UbiCrawler, a scalable distributed and fault-tolerant web crawler. In [15] author presented the ar- chitecture of PARCAHYD which is an ongoing project aimed at designing of a Parallel Crawler based on Augmented Hypertext Documents. In [16] the author studied how an effective parallel crawler is designed. As the size of the Web grows, it becomes imperative to parallelize a crawling process. In [17] the author proposes Mercator, which is a scalable, extensible crawler. [18] Pre- sented Google, a prototype of a large scale search engine which makes heavy use of the structure present in hypertext. [19] Aims at designing and implementing a parallel migrating crawler in which the work of a crawler is divided amongst a number of independent. 3. MOTIVATION: PROBLEMS IN GENERIC WEB CRAWLERS According to [1], Web crawlers of big commercial search engines crawl up to 10 million pages per day. Assuming an average page size of 6K [2], the crawling activities of a single commercial search engine adds a daily load of 60GB to the Web. Scaling Issues: One of the first Web search engines, the World Wide Web Worm [3], was intro- duced in 1994 and used an index of 110,000 Web pages. Big commercial search engines in 1998 claimed to index up to 110 million pages [1]. The Web is expected to grow further at an exponen- tial speed, doubling its size (in terms of number of pages) in less than a year [4]. Efficiency Issues: Current Web crawlers download all these irrelevant pages because traditional crawling techniques cannot analyze the page content prior to page download [13]. Index Quality Issues: Current commercial search engines maintain Web indices of up to several hundred million pages [1]. 4. MIGRATING PARALLEL WEB CRAWLER The Migrating Parallel Crawler system consists of Central Crawler, Crawl Frontiers, and Local Database of each Crawl Frontier and Centralized Database. It is responsibility of central crawler to receiving the URL input from the applications and forwards the URLs to the available migrat- ing crawling process. Crawling process migrated to different machines to increase the system per- formance. Local database of each crawl frontier are buffers that locally collect the data. This data is transferred to the central crawler after compression and filtering which reduces the network bandwidth overhead. The central crawler has a centralized database which will contain the docu- ments collected by the crawl frontiers independently. The main advantages of the migrating paral- lel crawler approach are Localized Data Access, Remote Page Selection, Remote Page Filtering, Remote Page Compression, Scalability, Network-load dispersion, Network-load reduction.
  • 3. International Journal on Web Service Computing (IJWSC), Vol.3, No.2, September 2012 87 5. ISSUES IN MIGRATING PARALLEL WEB CRAWLER The following issues are important in the study of a migrating parallel crawler interesting: Communication bandwidth: Communication is really important so as to prevent overlap, or to improve the quality of the downloaded content, crawling processes need to communicate and co- ordinate with each other to improve quality thus consuming valuable communication bandwidth. Quality: Prime objective of migrating parallel crawler is that to download the “important” pages first to improve the “quality” of the downloaded pages. Overlap: When multiple processes run in parallel to download pages, it is possible that different processes download the same page multiple times. One process may download the same page that other process may have already downloaded, one process may not be aware that another process has downloaded the same page. 6. COORDINATION IN MIGRATING PARALLEL WEB CRAWLER The coordination is done in order to prevent overlap and also crawling processes need to coordi- nate with each other so that the correct and quality pages are downloaded. This coordination can be achieved in following ways: Independent: In this method Crawling processes may download pages independent of each other without any coordination. The downloaded pages may overlap in this case, but we assume that overlap is not significant enough. Static assignment: In this method the Web is partitioned into sub partitions and each sub partition is assigned to each Crawling processes before the actual crawling process may begin. This type of arrangement is called static assignment. Dynamic assignment: In this coordination scheme there is a central coordinator that divides the Web into small partitions and each partition is dynamically assigned to a Crawling processes for further downloading. 7. CRAWLING MODES FOR STATIC ASSIGNMENT IN MIGRATING PARALLEL WEB CRAWLER In this scheme of static assignment, each Crawling processes are responsible for a certain parti- tion of the Web and have to download pages within the partition. There may be possibility that some pages in the partition may have links to pages in another partition. Such types of links are referred to as an inter-partition link. Firewall mode: Each Crawling processes strictly download the pages within its partition and in- ter-partition links are not followed. In firewall mode all inter-partition links are either thrown away or ignored. Cross-over mode: In this mode each Crawling processes download pages within its partition. It follows inter-partition links when links within its partitions are exhausted.
  • 4. International Journal on Web Service Computing (IJWSC), Vol.3, No.2, September 2012 88 Exchange mode: In this mode Crawling processes periodically exchange inter-partition links. Crawling processes are not allowed to follow inter-partition links [16]. Table 1. Comparison of three Crawling Modes[16] Mode Coverage overlap Quality Communication Firewall Bad Good Bad Good Cross-Over Good Bad Bad Good Exchange Good Good Good Bad 8. EVALUATION MODELS IN MIGRATING PARALLEL WEB CRAWLER This section deals with metrics that we define to quantify the advantages or disadvantages of mi- grating parallel crawler. Coverage: the coverage is defined as I/U, where U is the number of pages downloaded by migrat- ing parallel crawler and I is number of unique pages downloaded by the migrating parallel crawler. Overlap: The overlap is defined as (N−I) /I. Where, N is number of pages downloaded by the crawler, and I is the number of unique downloaded pages by migrating parallel crawler. Quality: The quality of downloaded pages is defined as |AN∩PN|/|PN|. The migrating parallel crawler downloads the important N pages, and PN is that set of N pages. AN is set of N pages that an actual migrating parallel crawler would download, which is different from PN. Communication overhead: The communication overhead is defined as E/I. Where I in total pages downloaded and E represents exchanged inter-partition URLs. Table 2 summarizes different crawling techniques with their performance attributes. Crawling process of migrating parallel crawlers move to the resource which needs to be crawled to take the advantage of localized data access. After accessing a resource, migrating parallel crawler move on to the next machine or server and send result of crawling to central database in compressed form. From the table it is evident that Migrating parallel crawlers are more efficient in terms of network load reduction, I/O performance, reducing communication bandwidth, network load dis- persion etc. Table 2. Performance Attributes for Different Web Crawlers High per- formance distributed web crawler Incremental crawler Parallel crawler Agent based Migrating parallel Crawler Domain Specific Crawler Mobile Crawler Breadth First Crawling Robustness ++ -- -- -- ++ -- ++ -- Flexibility ++ -- -- ++ ++ -- -- -- Manageability ++ -- -- ++ ++ -- -- -- I/O Perform- ance ++ -- -- ++ ++ -- -- -- Network re- sources ++ -- -- -- ++ -- ++ -- OS limits ++ -- -- -- ++ -- ++ --
  • 5. International Journal on Web Service Computing (IJWSC), Vol.3, No.2, September 2012 89 Higher per- formance ++ -- -- -- ++ -- ++ -- Extensibility -- ++ -- -- -- -- ++ -- Incremental crawling -- ++ -- -- -- -- -- -- Reasonable cost ++ -- -- -- -- -- -- ++ Overlapping -- -- ++ ++ ++ ++ ++ -- Communication bandwidth -- -- ++ -- ++ ++ ++ -- Network load dispersion -- -- ++ -- ++ ++ ++ -- Network load reduction -- -- ++ ++ ++ ++ ++ -- Freshness -- ++ -- -- -- -- -- -- Page rank -- -- -- -- -- -- -- ++ Scalability -- -- -- -- -- ++ -- -- Load sharing -- -- -- -- -- ++ ++ -- High quality -- -- -- -- -- -- -- ++ Agent oriented -- -- -- ++ ++ -- -- -- 9. PROPOSED ARCHITECTURE OF MIGRATING PARALLEL WEB CRAWLER A migrating parallel crawler consists of several multiple crawling processes which migrate to source of data for example Web server or host before the crawling process is actually started on that Web server or host. Migrating parallel crawlers move to resource and take the advantages of localized access of data. Migrating parallel crawler after accessing a resource migrates to the next host or server or to their home system. Each migrating parallel crawling process performs the tasks of single crawler that it downloads pages from the Web server or host, stores the pages in the local database, extracts URLs from the content of downloaded pages and follows the ex- tracted URLs or links. When Crawling process run on the same local area network and communi- cate through a high speed interconnection network then it is known as intra-site migrating parallel web crawler. When Crawling process run at geographically distant locations connected by the Internet or a wide area network then it is known as distributed migrating parallel web crawler. The architecture consists of central coordinator system and crawling process. 9.1 Central Coordinator System Central coordinator system consists of Central Crawl Manager, Crawling Application, Web Cache, Central Database or Repository, URL Dispatcher, URL Distributor, Domain Selector. Central crawl manager is the central part of the crawler all other components are started and reg- ister with the central crawl manager to offer or request services. It acts as the core of the system. Central Crawl Manager’s objective is to download pages in the order specified by the crawling application. The central crawler has a list of URLs to crawl. After getting the URLs of files, cen- tral crawl manager queries the DNS resolvers for the IP addresses of the host or servers. The cen- tral manager then takes into consideration robots.txt files in the root directory of the web server. The central crawler a set of available crawling process which have registered themselves with the central crawl manager which logically migrated to different domain specific locations. The cen- tral crawl manager assigns different URLs to all the crawl process in domain specific manner and in turn the crawling processes begin to crawl the received URL in breadth first fashion and pages are downloaded by multi threaded downloaded. It is the responsibility of the crawling application
  • 6. International Journal on Web Service Computing (IJWSC), Vol.3, No.2, September 2012 90 to check for duplicate pages. The crawling application considers in this paper is a breadth-first crawl strategy. Breadth first search gives high quality pages thus improving the quality of downloaded pages. The crawling process actually crawls in breadth first manner. The DNS re- solver uses the GNU adns asynchronous DNS client library to access a DNS server usually collo- cated on the same machine. A web cache stores web resources in anticipation of future requests. Web cache works on the principle that the most popular resource is likely to be requested in the future. The advantages of Web Cache are reduced network bandwidth, less user perceived delay, less load on the origin servers. Web cache provides quick response time. When the information resides in web cache, the request is satisfied by a cache, the content no longer has to travel across the Internet. Central database or Repository stores the list of URLs received and downloaded web pages which are collection of the documents downloaded by crawling process. Web pages can be stored in compressed form. Each document is provided with a unique number called document identifier denoted by doc_id. The documents are stored along with doc_id, length of document and URL. This helps with data consistency and makes data storage much easy and efficient. Data is stored in the central data base in hierarchical tree manner. Each node or the parent node of the tree is the domain and the child node of the parent node is the sub domain. Each node consists of domain name and its corresponding URLs. The database is continuously updated for future use by adding new URL. Central Database communicates with the database of the search engine. URL dispatcher reads the URL database and retrieves the URL from the index table maintained in the central repository and forwards it for crawling. URL dis- tributor classifies the URLs on the basis of domains. It then distributes the URL to the concerned domain specific crawling process for crawling. The URL distributor balances the load on the crawling process. Domain selector identifies the domains of the links which are extracted by the link extractor. The links belonging to specific domain are stored in the corresponding domain ta- ble. The domain selector also forwards the domain specific URL to the URL dispatcher for fur- ther crawling.
  • 7. International Journal on Web Service Computing (IJWSC), Vol.3, No.2, September 2012 91 Figure: Proposed Architecture of Migrating Parallel Crawler 9.2 Crawling Process Crawling process consists of Scheduler, New ordered Queues, Site ordering module, URL Queues/ KnownURLs, Multi threaded Downloader, URL Collector, Link Extractor, Link ana- lyzer. Scheduler provides the set of URLs to be downloaded based on the modified page rank. The URLs are stored in new ordered queues. New ordered Queues stores the set of URLs based on modified page rank. Site ordering module provides the modified page rank of the page. KnownURLs are the set of already known URLS. They can be treated as seed URLs. Mutli threaded downloader takes a URL from URL collector and downloads the corresponding web- page to store it in the local repository or database. The downloader component fetches files from the web by opening up to connections to different servers. URL collector keeps the URL from the downloaded pages. Link Extractor extracts the link from the downloaded pages. Link analyzer checks the extracted URLs by the link extractor. If there is similarity in the URL then such URLs are discarded and not further forwarded for processing. When the crawling process runs it re- quires some memory space to save the downloaded web pages. Each crawling process has its own local storage. The crawling process saves the downloaded web pages in this database. It is the Crawling Process Storage Locally Collected URLSRanking Module Update Module Multithreaded Downloader URL Collector URL Queues Site Ordering Module Scheduler To Domain Selector Domain Selector Link Analyzer Link Extractor New Ordered Queues World Wide Web Central Coordinator System Crawling Application Crawl Manager Web Cache Central database DNS Resolver URL Dispatcher URL Distributor Domain Selector
  • 8. International Journal on Web Service Computing (IJWSC), Vol.3, No.2, September 2012 92 storage area of the machine on which the crawling process is running. The Ranking Module con- stantly scans the knownURLs and the local database to make the refinement decision. The Rank- ing Module refines the local database. The Ranking Module discards the less important page from the local database to make space for the new page. LocallycollectedURLs are the set of URLs in the local collection. The Update Module maintains the local database fresh. In order to maintain the local database fresh, the crawler has to select the page that will increase the freshness most significantly this decision is known as update decision. 10. EVALUATION METHOD The performance of the proposed migrating parallel web crawler is measured by obtaining top 1000 URLs returned by modified page update algorithm will be used. Minimum number of over- lapping pages of specific domain and more relevant pages will be considered. The fresh pages will be considered by using the incremental module. Less traffic on the network will be ensured. Less bandwidth will be consumed as less traffic will travel over the network. 11. RESULT The crawled pages will be domain specific. Fresh pages will be downloaded as change frequency is taken into consideration. High quality of pages will be downloaded as crawling processes are performing in breadth first manner. Breadth first crawling improves the quality of downloaded pages. 12. CONCLUSIONS In this paper we proposed an Extended Model for Effective Migrating Parallel Web Crawling with Domain Specific and Incremental Crawling. Domain specific crawling will yield high qual- ity pages. The crawling process will migrate to host or server with specific domain and start downloading pages within specific domain. Incremental crawling will keep the pages in local da- tabase fresh thus increasing the quality of downloaded pages. The research directions in migrating parallel crawler include: ‱ Security could be introduced in migrating parallel crawlers ‱ Migrating parallel crawler could be made polite ‱ Location awareness could be introduced in migrating parallel crawlers This future work will deal with the problem of quick searching and downloading the data. The data will be collected and analyzed with the help of tables and graphs. ACKNOWLEDGMENTS First and foremost, our sincere thanks goes to Prof. Syed Wasim Akhtar, Honorable Vice Chan- cellor, Integral University, Lucknow. Prof Akhtar has given us unconditional support in many as- pects, which enabled us to work on the exciting and challenging field of Software Engineering. We would also like to give our special thanks to Prof. T. Usmani, Pro Vice Chancellor, Integral University. His encouragement, help, and care were remarkable during the past few years. We are also grateful to Prof S. M. Iqbal Chief Academic Consultant, Integral University. Prof Iqbal pro- vided us with valuable thoughts for our research work. My gratitude also goes to Dr. Irfan Ali Khan, Registrar, Integral University for his constructive comments and shared experience.
  • 9. International Journal on Web Service Computing (IJWSC), Vol.3, No.2, September 2012 93 REFERENCES [1] D. Sullivan, “Search Engine Watch,” Mecklermedia, 1998. [2] S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” Stanford University, Stanford, CA, Technical Report, 1997. [3] O. A. McBryan, “GENVL and WWW: Tools for Taming the Web,” in Proceedings of the First In- ternational Conference on the World Wide Web, Geneva, Switzerland, 1994. [4] B. Kahle, “Archiving the Internet,” Scientific American, 1996. [5] J. Gosling and H. McGilton, “The Java Language Environment,” Sun Microsystems, Mountain View, CA, White Paper, April 1996. [6] J. E. White, Mobile Agents, MIT Press, Cambridge, MA, 1996. [7] C. G. Harrison, D. M. Chess, and A. Kershenbaum, “Mobile Agents: Are they a good idea?,” IBM Research Division, T.J. Watson Research Center, White Plains, NY, Research Report, September 1996. [8] H. S. Nwana, “Software Agents: An Overview,” Knowledge Engineering Review, Cambridge Uni- versity Press, 11:3, pp. , 1996. [9] M. Wooldridge, “Intelligent Agents: Theory and Practice,” Knowledge Engineering Review, Cam- bridge University Press, 10:2, pp. , 1995. [10] P. Maes, “Modeling Adaptive Autonomous Agents,” MIT Media Laboratory, Cambridge, MA, Re- search Report, May 1994. [11] P. Maes, “Intelligent Software,” Scientific American, 273:3, pp. , 1995. [12] T. Finin, Y. Labrou, and J. Mayfield, “KQML as an agent communication language,” University of Maryland Baltimore County, Baltimore, MD, September 1994. [13] Oachim Hammer , Jan Fiedler “Using Mobile Crawlers to Search the Web Efficiently” in 2000 [14] Paolo Boldi, Bruno Codenotti, Massimo Santini, Sebastiano Vigna, “UbiCrawler: A Scalable Fully Distributed Web Crawler” in 2002 [15] A.K. Sharma, J.P. Gupta, D. P. Aggarwal, “PARCAHYDE: An Architecture of a Parallel Crawler based on Augmented HypertextDocuments” in 2010 [16] J. Cho and H.Garcia-Molina, “Parallel crawlers”. In Proceedings of the Eleventh International World Wide Web Conference, 2002, pp. 124 – 135 [17] A. Heydon and M. Najork, “Mercator: A scalable, extensible web crawler”. World Wide Web, vol. 2, no. 4, pp. 219 -229, 1999. [18] Akansha Singh , Krishna Kant Singh , “Faster and Efficient Web Crawling with Parallel Migrating Web Crawler” in 2010 [19] Min Wu, Junliang Lai, “The Research and Implementation of parallel web crawler in cluster” in In- ternational Conference on Computational and Information Sciences 2010