An Efficient Cloud based Approach for Service Crawling

270 views

Published on

In this paper, we have designed a crawler that
searches services provided by different clouds connected in a
network. Proposed method provides details of freshness and
age of cloud clusters. Crawler checks each router available in
a network providing services. On basis of search criteria, our
design generates output guiding users for accessing requested
cloud services in efficient manner. We have planned to store
the result in an m-way tree and to use traversal technique for
extraction of specific data from the crawling result. We have
compared the result with other typical search techniques.

  • Be the first to comment

  • Be the first to like this

An Efficient Cloud based Approach for Service Crawling

  1. 1. Short Paper ACEEE Int. J. on Information Technology, Vol. 3, No. 1, March 2013 An Efficient Cloud based Approach for Service Crawling Chandan Banerjee 1, 2, Anirban Kundu 2, 3, Sumon Sadhukhan1, Rana Dattagupta4 1 Netaji Subhash Engineering College, Kolkata 700152, India {chandanbanerjee1, sumon.sadhukhan8}@gmail.com 2 Innovation Research Lab (IRL), Howrah, West Bengal 711103, India anik76in@gmail.com 3 Kuang-Chi Institute of Advanced Technology, Shenzhen 518057, P.R.China anirban.kundu@kuang-chi.org 4 Jadavpur University, Kolkata 700032, India rdattagupta@cse.jdvu.ac.inAbstract— In this paper, we have designed a crawler that surfacing. The challenge has been studied by severalsearches services provided by different clouds connected in a researches such as [5], [6], [7], [8], [9]. In these methods,network. Proposed method provides details of freshness and candidate query keywords are generated from the obtainedage of cloud clusters. Crawler checks each router available in records.Section II shows our proposed framework and thea network providing services. On basis of search criteria, our corresponding approach. Experimental analyses aredesign generates output guiding users for accessing requestedcloud services in efficient manner. We have planned to store presented in Section III. Section IV concludes the paper.the result in an m-way tree and to use traversal technique forextraction of specific data from the crawling result. We have II. FRAMEWORKcompared the result with other typical search techniques. We consider that there are several nodes which areIndex Terms—cloud crawler, service crawling, cloud search, connected to each other in a network fashion. Clusters areFreshness, Age formed with several nodes providing distinct services. The head node is also connected with the network. Cluster may I. INTRODUCTION have private networks recursively. The crawler will reach the end point and take information from them and send them to In modern life, the usage of cloud is growing in a rapid the head node. The Node A, stores the whole result. Boxesway. Cloud user typically relies on specific services. Web are indicating networks. A network may have a sub-network.Insearch engines [1] crawl the web and update information the second section, we use M-Way tree traversal techniqueworld-wide. Now-a-days, Internet users are switching from so that we can reach the destination with minimum pathsingle service to cloud service requiring more availability of length. In the last section we show how the technique iscloud service. Web crawlers [2] store data after fetching web efficient in comparison with other searching algorithm. Topages and cache them into their database. Every crawler realize the efficiency of the algorithm we need to understandstores the crawled result in its database and result is searched about the Freshness and Age of crawler. Every crawler has towhen it is needed. The search Engines [3] are often compared update fast the database and produce efficient result. Thewith other search Engines with time complexity and space terms freshness and age involve the Database.complexity. Freshness and Age of crawled result are alsoconsiderably important. Cloud crawler [4] works with Internet A. Freshness and AgeProtocol (IP) addresses of a cache stored in a tree structure. A cloud service database is called ‘fresher’ when it hasHosts are visited using specific threads for specific networks. updated information with other crawlers. For an instance if a Frequently, one needs to maintain local copies of remote crawler crawls more nodes than other crawlers then it isdata sources for better performance or availability. For fresher. If a crawler shows a result of 5 min ago then it is itsexample, Web search engine copies a significant subset of age.the Web and maintain copies or indexes of the pages to help 1. Freshnessusers access relevant information.In this situation, a part of Let S = {n1, n2, n3…nn} is the total amount of node in thethe local copy may get out-of-date because changes at the network; where n1, n2 are nodes and N is the number ofsources are not immediately propagated to the local copy. elements. D1, D2, …, Dn are the service stored on the particularTherefore, it becomes important to design a good refresh node. Total freshness of the crawler is,policy that maximizes the “freshness” of the local copy. As Freshness (tn) = 1/N i=1N F(ni,t);the cloud services grow larger, it becomes more important to Where F(ni,t) = 0 if not updatedrefresh the data more effectively.One critical challenge in = 1 if updated at time tsurfacing approach is how a crawler can automatically 2. Agegenerate promising queries so that it can carry out efficient Let {T1, T2… Tn} is the time set, when the information about© 2013 ACEEE 61DOI: 01.IJIT.3.1. 1114
  2. 2. Short Paper ACEEE Int. J. on Information Technology, Vol. 3, No. 1, March 2013 the specific node is taken into account. The current time is T.Then, the age of the node is {T-Tn}.At time t, if the age of an element is Ai, thenAi = 0 (if it is updated at t)Ai= Ti – Ti-1 (if it is not updated at t)Total Time of the A(s,t) = 1/N i=1NAi A cloud crawler is used to fetch the services for creatinga framework of cloud service crawler engine using properindexing methodologies. A crawler for a specific service is aprogram for extracting outward Web links (URLs) and furtheradding them into a list after processing. Thus, a cloud service Fig. 2. Arbitrary Cloud Cluster Scenariocrawler is a program which fetches as many relevant servicesas possible for the specific users. It uses the Web link In crawling run time a hash table is made mapping with thestructure in which the order of the list is important, because Node and Number (IP-address) of resources in a cloud networkonly high quality Web pages are considered as relevant. Fig. which is shown in Table 2. Our proposed search approach1 shows the proposed service based cloud crawler. Here, an shows in subsection E.Sample network is being crawled usingelement insertion means that the element is inserted at the proposed method which is shown in Table I.pointer location within the m-way tree. A special traversal TABLE I. PROPOSED APPROACH BASED ON FIG. 2technique is utilized for visiting all the nodes within eachnetwork or sub-network. Each node is selected twice. Secondtime it is actually popped from stack. An advantage of ouralgorithm is that data need not to be stored in the client node.The result is directly sent to the crawler server after scanninga single node. Fig. 1. Flowchart of Service based Cloud CrawlerB. Sample Procedure of a Sample Network Fig. 2 shows an arbitrary cloud cluster. There are totalfour network clusters within a cloud. Circular boxes indicatethe clusters and rectangular boxes indicate the resources ofeach cluster network. Table 1 show the result which is basedon our proposed approach as shown in our previous work [1].© 2013 ACEEE 62DOI: 01.IJIT.3.1.1114
  3. 3. Short Paper ACEEE Int. J. on Information Technology, Vol. 3, No. 1, March 2013 C. Hash Table The hash table is generated based on the mapping between the Node and Number (IP-address) of resources in a cloud network. Table II is created using real-time crawling. TABLE II. H ASH TABLE BASED ON TABLE I D. Indexing Result Crawler finishes searching the cloud; and, then stores the result into an M-Way tree using Table II based on Fig. 3. E. Search Approch The algorithm described in Fig. 4 is used to reach any node using the crawling result. Consider, Node 13 is to be© 2013 ACEEE 63DOI: 01.IJIT.3.1.1114
  4. 4. Short Paper ACEEE Int. J. on Information Technology, Vol. 3, No. 1, March 2013 TABLE III. PROPOSED SEARCH APPORACH Fig. 3. M-Way treevisited in a particular time instance. Table 3 shows differentsteps to search Node 13. Fig. 4. Flow chart to reach any node using Fig. 2The shortest path to reach Node 13 is {1 9 11 13}. III. EXPERIMENTAL ANALYSIS We know, time complexities [10] [11] of DFS and BFS areO(|V|+|E|); where V= vertices of the graph and E =Edge ofgraph;A. Best Case Scenario1) Breath First Search (BFS)Total Number Nodes visited=MN; where M= Average Numberof machine present in every network. N=Level of Tree.2) Depth First Search (DFS)Total Number of Node Visited= N, where N=Level of tree.3) Based on our Proposed AlgorithmTotal Number of Node Visited= N, where N=Level of tree.The best case analysis has been shown in Fig. 5. Our algorithmhas been compared with typical DFS and BFS methods. Withthe help of comparative study we conclude that number ofvisited node would be increased with the increment of level Fig. 5. Best Case Complexity Comparisonof m-way Tree. With the help of our proposed searching B. Worst Case Scenariomethod, we can find out shortest the path to reach every 1)Breath First Search (BFS)node.© 2013 ACEEE 64DOI: 01.IJIT.3.1.1114
  5. 5. Short Paper ACEEE Int. J. on Information Technology, Vol. 3, No. 1, March 2013Total Number of Node Visited = M^(N+1) CONCLUSIONS2) Depth First Search (DFS) In our methodology, a Hash-table is generated in whichTotal Number of Node Visited = M^(N+1) each resource is assigned with a particular number. The Hash3) Based on our Proposed Algorithm table is helpful for identification of each node. It is also usefulTotal Number of Node Visited = N to find out shortest path for reaching any node (resource) Minimum time complexity has been achieved to reach any within the table. Freshness and age of a result can bedestination node using our proposed algorithm in worst case calculated with the help of hash-table comparing the pastanalysis. Fig. 6 shows the worst case complexity analysis and present results of the particular nodes. In different networkcomparison. different machines have same IP address; it can be identified by hash-table because it allocates unique number to each machine. Minimal numbers of nodes are being visited in proposed method compared to DFS or BFS. REFERENCES [1] Brin, S., Page, L., “The anatomy of a large-scale hyper textual Web search engine,” Computer Network ISDN Syst. 30, 1998, pp. 107-117 [2] Lu, J., Wang, Y., Liang, J., Chen, J., Liu, J., “An Approach to Deep Web Crawling by Sampling,” Web Intelligence 2008, pp. 718-724 [3] Yang, Kai-Hsiang, Pan, Chi-Chien, Lee, Tzao-Lin, “Approximate search engine optimization for directory service,” Parallel and Distributed Processing Symposium, 2003, Dept. of Comput. Sci. & Inf. Eng., Nat. Taiwan Univ., Taipei, Taiwan [4] C.Banerjee, A.Kundu, S.Sadhukhan, S.Bose, R.Dattagupta ; “Service Crawling in Cloud Computing”; 2nd International Conference on Advances in Information Technology and Mobile Communication, CCIS 296, pp. 243~246, Springer-Verlag Berlin Heidelberg Publication Fig. 6. Worst Case Complexity Comparison [5] Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Four clusters have been used for experimental purpose Halevy, A.: Google’s Deep-Web Crawl. In Proceedings of VLDB2008. Auckland, New Zealand, pp. 1241—1252 (2008)using tree traversal as shown in Fig.7 using cloud crawler [6] Ntoulas, A., Zerfos, P., Cho, J.: Downloading Textual Hiddenbased on IP addresses available in cache. Threads have been Web Content through Keyword Queries. In Proceedings ofutilized to visit distinct hosts in a concurrent manner. There JCDL2005. Denver, USA. pp. 100—109 (2005)is no need to store data into client node as result is directly [7] Barbosa, L., Freire, J.: Siphoning Hidden-Web Data throughsent to crawler server scanning each node. Cloud crawler Keyword-Based Interfaces. In Proceedings of SBBD2004,works with IP addresses of a cache following an m-way tree Brasilia, Brazil, pp. 309—321 (2004)structure. [8] Liu, J., Wu, ZH., Jiang, L., Zheng, QH., Liu, X.: Crawling Deep Web Content Through Query Forms. In Proceedings of WEBIST2009, Lisbon Portugal, pp. 634—642 (2009) [9] Lu, J., Wang, Y., Liang, J., Chen, J., Liu J.: An Approach to Deep Web Crawling by Sampling. In Proceedings of IEEE/ WIC/ACM Web Intelligence, Sydney, Australia, pp. 718— 724 (2008) [10] M. Ajtai, On the complexity of the pigeonhole principle, Proc. of the 29th FOCS, pp. 346–355, 1988 [11] Thomas H. Cormen, Cli_ord Stein, Ronald L. Rivest, and Charles E. Leiserson. Introduction to Algorithms. The MIT Press, 3rd edition, 2009 Fig. 7. Crawling Results© 2013 ACEEE 65DOI: 01.IJIT.3.1.1114

×