An Efficient Cloud based Approach for Service Crawling

Short Paper
ACEEE Int. J. on Information Technology, Vol. 3, No. 1, March 2013

An Efficient Cloud based Approach for Service
Crawling
Chandan Banerjee 1, 2, Anirban Kundu 2, 3, Sumon Sadhukhan1, Rana Dattagupta4
1
Netaji Subhash Engineering College, Kolkata 700152, India
{chandanbanerjee1, sumon.sadhukhan8}@gmail.com
2
Innovation Research Lab (IRL), Howrah, West Bengal 711103, India
anik76in@gmail.com
3
Kuang-Chi Institute of Advanced Technology, Shenzhen 518057, P.R.China
anirban.kundu@kuang-chi.org
4
Jadavpur University, Kolkata 700032, India
rdattagupta@cse.jdvu.ac.in

Abstract— In this paper, we have designed a crawler that surfacing. The challenge has been studied by several
searches services provided by different clouds connected in a researches such as [5], [6], [7], [8], [9]. In these methods,
network. Proposed method provides details of freshness and candidate query keywords are generated from the obtained
age of cloud clusters. Crawler checks each router available in records.Section II shows our proposed framework and the
a network providing services. On basis of search criteria, our
corresponding approach. Experimental analyses are
design generates output guiding users for accessing requested
cloud services in efficient manner. We have planned to store presented in Section III. Section IV concludes the paper.
the result in an m-way tree and to use traversal technique for
extraction of specific data from the crawling result. We have II. FRAMEWORK
compared the result with other typical search techniques.
We consider that there are several nodes which are
Index Terms—cloud crawler, service crawling, cloud search, connected to each other in a network fashion. Clusters are
Freshness, Age formed with several nodes providing distinct services. The
head node is also connected with the network. Cluster may
I. INTRODUCTION have private networks recursively. The crawler will reach the
end point and take information from them and send them to
In modern life, the usage of cloud is growing in a rapid the head node. The Node A, stores the whole result. Boxes
way. Cloud user typically relies on specific services. Web are indicating networks. A network may have a sub-network.In
search engines [1] crawl the web and update information the second section, we use M-Way tree traversal technique
world-wide. Now-a-days, Internet users are switching from so that we can reach the destination with minimum path
single service to cloud service requiring more availability of length. In the last section we show how the technique is
cloud service. Web crawlers [2] store data after fetching web efficient in comparison with other searching algorithm. To
pages and cache them into their database. Every crawler realize the efficiency of the algorithm we need to understand
stores the crawled result in its database and result is searched about the Freshness and Age of crawler. Every crawler has to
when it is needed. The search Engines [3] are often compared update fast the database and produce efficient result. The
with other search Engines with time complexity and space terms freshness and age involve the Database.
complexity. Freshness and Age of crawled result are also
considerably important. Cloud crawler [4] works with Internet A. Freshness and Age
Protocol (IP) addresses of a cache stored in a tree structure. A cloud service database is called ‘fresher’ when it has
Hosts are visited using specific threads for specific networks. updated information with other crawlers. For an instance if a
Frequently, one needs to maintain local copies of remote crawler crawls more nodes than other crawlers then it is
data sources for better performance or availability. For fresher. If a crawler shows a result of 5 min ago then it is its
example, Web search engine copies a significant subset of age.
the Web and maintain copies or indexes of the pages to help 1. Freshness
users access relevant information.In this situation, a part of Let S = {n1, n2, n3…nn} is the total amount of node in the
the local copy may get out-of-date because changes at the network; where n1, n2 are nodes and N is the number of
sources are not immediately propagated to the local copy. elements. D1, D2, …, Dn are the service stored on the particular
Therefore, it becomes important to design a good refresh node. Total freshness of the crawler is,
policy that maximizes the “freshness” of the local copy. As Freshness (tn) = 1/N i=1N F(ni,t);
the cloud services grow larger, it becomes more important to Where F(ni,t) = 0 if not updated
refresh the data more effectively.One critical challenge in = 1 if updated at time t
surfacing approach is how a crawler can automatically 2. Age
generate promising queries so that it can carry out efficient Let {T1, T2… Tn} is the time set, when the information about

© 2013 ACEEE 61
DOI: 01.IJIT.3.1. 1114

Short Paper

the specific node is taken into account. The current time is T.
Then, the age of the node is {T-Tn}.
At time t, if the age of an element is Ai, then
Ai = 0 (if it is updated at t)
Ai= Ti – Ti-1 (if it is not updated at t)
Total Time of the A(s,t) = 1/N i=1NAi
A cloud crawler is used to fetch the services for creating
a framework of cloud service crawler engine using proper
indexing methodologies. A crawler for a specific service is a
program for extracting outward Web links (URLs) and further
adding them into a list after processing. Thus, a cloud service
Fig. 2. Arbitrary Cloud Cluster Scenario
crawler is a program which fetches as many relevant services
as possible for the specific users. It uses the Web link In crawling run time a hash table is made mapping with the
structure in which the order of the list is important, because Node and Number (IP-address) of resources in a cloud network
only high quality Web pages are considered as relevant. Fig. which is shown in Table 2. Our proposed search approach
1 shows the proposed service based cloud crawler. Here, an shows in subsection E.Sample network is being crawled using
element insertion means that the element is inserted at the proposed method which is shown in Table I.
pointer location within the m-way tree. A special traversal TABLE I. PROPOSED APPROACH BASED ON FIG. 2
technique is utilized for visiting all the nodes within each
network or sub-network. Each node is selected twice. Second
time it is actually popped from stack. An advantage of our
algorithm is that data need not to be stored in the client node.
The result is directly sent to the crawler server after scanning
a single node.

Fig. 1. Flowchart of Service based Cloud Crawler

B. Sample Procedure of a Sample Network
Fig. 2 shows an arbitrary cloud cluster. There are total
four network clusters within a cloud. Circular boxes indicate
the clusters and rectangular boxes indicate the resources of
each cluster network. Table 1 show the result which is based
on our proposed approach as shown in our previous work [1].
© 2013 ACEEE 62
DOI: 01.IJIT.3.1.1114

Short Paper

C. Hash Table
The hash table is generated based on the mapping
between the Node and Number (IP-address) of resources in a
cloud network. Table II is created using real-time crawling.
TABLE II. H ASH TABLE BASED ON TABLE I

D. Indexing Result
Crawler finishes searching the cloud; and, then stores
the result into an M-Way tree using Table II based on Fig. 3.
E. Search Approch
The algorithm described in Fig. 4 is used to reach any
node using the crawling result. Consider, Node 13 is to be
© 2013 ACEEE 63
DOI: 01.IJIT.3.1.1114

Short Paper

TABLE III. PROPOSED SEARCH APPORACH

Fig. 3. M-Way tree
visited in a particular time instance. Table 3 shows different
steps to search Node 13.

Fig. 4. Flow chart to reach any node using Fig. 2
The shortest path to reach Node 13 is {1 9 11 13}.

III. EXPERIMENTAL ANALYSIS
We know, time complexities [10] [11] of DFS and BFS are
O(|V|+|E|); where V= vertices of the graph and E =Edge of
graph;
A. Best Case Scenario
1) Breath First Search (BFS)
Total Number Nodes visited=MN; where M= Average Number
of machine present in every network. N=Level of Tree.
2) Depth First Search (DFS)
Total Number of Node Visited= N, where N=Level of tree.
3) Based on our Proposed Algorithm
Total Number of Node Visited= N, where N=Level of tree.
The best case analysis has been shown in Fig. 5. Our algorithm
has been compared with typical DFS and BFS methods. With
the help of comparative study we conclude that number of
visited node would be increased with the increment of level Fig. 5. Best Case Complexity Comparison
of m-way Tree. With the help of our proposed searching B. Worst Case Scenario
method, we can find out shortest the path to reach every
1)Breath First Search (BFS)
node.
© 2013 ACEEE 64
DOI: 01.IJIT.3.1.1114

Short Paper

Total Number of Node Visited = M^(N+1) CONCLUSIONS
2) Depth First Search (DFS)
In our methodology, a Hash-table is generated in which
Total Number of Node Visited = M^(N+1)
each resource is assigned with a particular number. The Hash
3) Based on our Proposed Algorithm
table is helpful for identification of each node. It is also useful
Total Number of Node Visited = N
to find out shortest path for reaching any node (resource)
Minimum time complexity has been achieved to reach any
within the table. Freshness and age of a result can be
destination node using our proposed algorithm in worst case
calculated with the help of hash-table comparing the past
analysis. Fig. 6 shows the worst case complexity analysis
and present results of the particular nodes. In different network
comparison.
different machines have same IP address; it can be identified
by hash-table because it allocates unique number to each
machine. Minimal numbers of nodes are being visited in
proposed method compared to DFS or BFS.

REFERENCES
[1] Brin, S., Page, L., “The anatomy of a large-scale hyper textual
Web search engine,” Computer Network ISDN Syst. 30, 1998,
pp. 107-117
[2] Lu, J., Wang, Y., Liang, J., Chen, J., Liu, J., “An Approach to
Deep Web Crawling by Sampling,” Web Intelligence 2008, pp.
718-724
[3] Yang, Kai-Hsiang, Pan, Chi-Chien, Lee, Tzao-Lin,
“Approximate search engine optimization for directory
service,” Parallel and Distributed Processing Symposium,
2003, Dept. of Comput. Sci. & Inf. Eng., Nat. Taiwan Univ.,
Taipei, Taiwan
[4] C.Banerjee, A.Kundu, S.Sadhukhan, S.Bose, R.Dattagupta ;
“Service Crawling in Cloud Computing”; 2nd International
Conference on Advances in Information Technology and Mobile
Communication, CCIS 296, pp. 243~246, Springer-Verlag
Berlin Heidelberg Publication
Fig. 6. Worst Case Complexity Comparison [5] Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A.,
Four clusters have been used for experimental purpose Halevy, A.: Google’s Deep-Web Crawl. In Proceedings of
VLDB2008. Auckland, New Zealand, pp. 1241—1252 (2008)
using tree traversal as shown in Fig.7 using cloud crawler
[6] Ntoulas, A., Zerfos, P., Cho, J.: Downloading Textual Hidden
based on IP addresses available in cache. Threads have been Web Content through Keyword Queries. In Proceedings of
utilized to visit distinct hosts in a concurrent manner. There JCDL2005. Denver, USA. pp. 100—109 (2005)
is no need to store data into client node as result is directly [7] Barbosa, L., Freire, J.: Siphoning Hidden-Web Data through
sent to crawler server scanning each node. Cloud crawler Keyword-Based Interfaces. In Proceedings of SBBD2004,
works with IP addresses of a cache following an m-way tree Brasilia, Brazil, pp. 309—321 (2004)
structure. [8] Liu, J., Wu, ZH., Jiang, L., Zheng, QH., Liu, X.: Crawling
Deep Web Content Through Query Forms. In Proceedings of
WEBIST2009, Lisbon Portugal, pp. 634—642 (2009)
[9] Lu, J., Wang, Y., Liang, J., Chen, J., Liu J.: An Approach to
Deep Web Crawling by Sampling. In Proceedings of IEEE/
WIC/ACM Web Intelligence, Sydney, Australia, pp. 718—
724 (2008)
[10] M. Ajtai, On the complexity of the pigeonhole principle,
Proc. of the 29th FOCS, pp. 346–355, 1988
[11] Thomas H. Cormen, Cli_ord Stein, Ronald L. Rivest, and
Charles E. Leiserson. Introduction to Algorithms. The MIT
Press, 3rd edition, 2009

Fig. 7. Crawling Results

© 2013 ACEEE 65
DOI: 01.IJIT.3.1.1114

An Efficient Cloud based Approach for Service Crawling

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (7)

Similar to An Efficient Cloud based Approach for Service Crawling

Similar to An Efficient Cloud based Approach for Service Crawling (20)

More from IDES Editor

More from IDES Editor (20)

An Efficient Cloud based Approach for Service Crawling