SlideShare a Scribd company logo
1 of 5
Download to read offline
IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 13, Issue 1 (Jul. - Aug. 2013), PP 102-106
www.iosrjournals.org
www.iosrjournals.org 102 | Page
Topic-specific Web Crawler using Probability Method
S. Subatra Devi1
, Dr. P. Sheik Abdul Khader2
1
(Research scholar, PSVP Engineering College, Chennai, Tamil Nadu, India)
2
(Professor & HOD, BSA Crescent Engineering College, Chennai, Tamil Nadu, India)
Abstract : Web has become an integral part of our lives and search engines play an important role in making
users search the content online using specific topic. The web is a huge and highly dynamic environment which is
growing exponentially in content and developing fast in structure. No search engine can cover the whole web,
but it has to focus on the most valuable pages for crawling. Many methods have been developed based on link
and text content analysis for retrieving the pages. Topic-specific web crawler collects the relevant web pages of
interested topics of the user from the web. In this paper, we present an algorithm that covers the link, text
content using Levenshtein distance and probability method to fetch more number of relevant pages based on the
topic specified by the user. Evaluation illustrates that the proposed web crawler collects the best web pages
under user interests during the earlier period of crawling.
Keywords - Levenshtein Distance, Hyperlink, Probability Method, Search engine, Web Crawler.
I. Introduction
The web is a very large environment, from which users provide the keyword, query or topics to fetch
the required information. Such growth and fluctuation generate essential limits of scale for today's generic
search engines [5]. The number of web sites keeps on increasing and hence the number of links and the
documents keep on increasing day by day [1]. Crawlers are programs that penetrate through the hyperlinks to
retrieve the required page. Based on the hyperlink, the crawler retrieves the document and the required numbers
of web pages are retrieved. This process is repeated until the predefined numbers of web pages are downloaded
or the storage of the local repository is exhausted [3].
In this paper, an efficient web crawling algorithm is presented by combining text content, link analysis
and probabilistic method. Initially, the topic is given as input to various search engines. Here, Google, Yahoo
and MSN are used. For the given topic the common URL‘s in all the three search engines or in any two, are
retrieved and given as input to the crawler. This URL will be considered as the seed URL. The crawler,
considering this as the seed URL will retrieve the web pages based on the hyperlink. The retrieving is done by
considering the keyword on the link, the text content of the document and the probability method. The
probability method is used to find the number of similar as well as dissimilar keywords occurring in a web page.
Determining the probability of the dissimilarity in keywords allows filtering the irrelevant pages in the
beginning of the crawling. This makes to consider the relevant pages more effectively and skews the search. The
link content and the text content are the common methods used by most of the algorithms. Using this as the
base, when it is implemented along with the probability method, this method provides more effective pages.
The rest of this paper is organized as follows. Section 2 specifies the related work. Section 3 proposes
the algorithm for web crawling process. Section 4 shows the experimental results and performance evaluation
of the proposed algorithm. Finally, the conclusion of the result is given in section 5.
II. Related Work
There are several algorithms based on content and link strategy. The algorithm based on hyperlink and
content relevance and on HITS is presented as Heuristic search [14]. Another method specifies the ranking
based on content and link [13]. In [9], the relevancy is based on 0 or 1, which consists of several drawbacks that
are overcome in the Shark- search [8].
The division score and the link score is applied for each and every link in [6]. Some knowledge bases
based on starting URL‘s, topic keywords and URL predictions are being discussed [7]. An importance of a page
is computed based on the composition of PR score in [2]. Focused crawling analyzes its crawl boundary based
on the links that are most likely to be relevant and avoids irrelevant regions [10]. An algorithm [22] on
hyperlinks and content relevance strategy is based on topic-specific crawling.
The topic keyword is used as a base in several algorithms [2], [7], [10] and [12] by which the crawler
crawls through the web to fetch the relevant pages. The crawler using breadth-first search [19] order discovers
high quality pages during the early stages of the crawl. An algorithm in [20], combines text search with
semantic search. To engineer a search engine Google with page rank method is applied in [17]. Different
arbitrary predicates are applied [18] to perform an intelligent crawling. Accurately predict the relevance [15] of
unvisited web pages by known URL‘s. In this method, this algorithm crawls the web by considering the
Topic-specific Web Crawler using Probability Method
www.iosrjournals.org 103 | Page
keyword in the hyperlink, the content of the document and the probability method. A latent semantic indexing
classifier that combines link and text [21] is used to index domain specific web documents.
III. Proposed Method
The seed URL is the base path for retrieving the more relevant pages. This URL is fetched by giving
the topic keyword [2], [7] to the three different search engines. The seed URL is given as root path to the
crawler. Now, the outgoing links of this root path are fetched. For each of this link, three methods are applied as
specified below.
1.1 Calculating the Relevancy Score
From the outgoing link as specified above, the relevancy score is calculated in an iterative method for
each and every link. The calculation consists of the following methods.
1.1.1 Link Weight
The link weight is calculated by checking whether the topic keyword is presented in the outgoing links.
If this anchor text is present the division method is used to determine the link weight. Here, if all the topic
keywords are presented in the outgoing link then its link score is 1. Otherwise, the link score is based on the
percentage of the topic keywords present in the link. Finally, the link weight is determined by the ratio of the
total number of links containing the anchor text to the total number of links of the parent node. This Link
Weight is given by
UL
Wt = -----------
LT
Where, UL represents the total number of links that contain the anchor text and LT represents the total
number of links presented in the parent node.
For each and every parent URL, the link weight is determined and the crawling is performed based on
the ranking of the parent URL‘s.
1.1.2 Text content similarity using Levenshtein Distance
Here, the Levenshtein Distance is used to compute the text content similarity of the two pages, i.e., the
seed URL page and the child page. Before computing the similarity, the tokens are extracted from the pages.
These tokens are nothing but the topic keywords of that page. For filtering these topic keywords, the two
methods
- Stop word removal
- Stemming algorithm
Are used. The stop word removal method is, removing the base words from the document such as, a,
an, the, etc. The stemming algorithm applied here is Porter Stemming algorithm, which skews the word with the
stem of the word. For example, ‗engineering‘, ‗engineered‘, etc. are stemmed to the word ‗engineer‘.
The words extracted by these methods are known as tokens. These tokens are given as input to the
Levenshtein Distance for ‗EditDist‘. Here, the distance is computed by determining the similarity between the
two strings. For which, the following operations are performed to transfer one string s1 to another string s2.
- insertion
- deletion
- substitution
Finally, the text content similarity between the seed page and the child page is calculated based on the
word distance and the length of the word.
∑ EditDist (s1, s2)
D lev (s1,s2)= -----------------------------------
Length (s1) + Length (s2)
Where EditDist is performed to calculate the number of insertion, deletion and substitution operation
which are needed to transform one string s1 into the another string s2.
1.1.3 Probability Method
The probability based distance method is used to calculate the similarity between the two pages, the
seed URL page and the child page. The tokens are extracted from these pages, said to be as keywords and the
frequency of those keywords are found and represented in sorted order. The top n keywords are selected from
the parent and the child pages. The similarity and the dissimilarity are calculated between these two pages.
Topic-specific Web Crawler using Probability Method
www.iosrjournals.org 104 | Page
Pm = [P(wi wj) + 1] – (1 - µ )[1 – P(wi wj)]
where, P(wi wj) = T / N, and P(wi wj) = K / N, T refers to the number of similar keywords of
both the pages, N is the total number of keywords and K refers to the number of dissimilar keywords of both
the pages.
Based on the sorted order of the similar keywords, the documents are given the priorities in decreasing
order. The probability of the similar keywords helps in retrieving the relevant pages effectively. Similarly, the
probability of finding the dissimilar keywords supports in filtering the irrelevant pages effectively in the
beginning stage of crawling. For each and every iteration, the irrelevant pages are filtered which makes the
crawler to retrieve more number of relevant pages.
1.2 Determining the Relevancy Score
Here, the relevancy score is determined by using the above three methods. The computed value of the
above three methods
R s = α * W t + β * D lev + γ * Pm
Where W t represents the link weight, D lev specifies the levenshtein distance and P m specifies the
probability method.
This relevancy score R s is calculated for each and every link and it is compared with the threshold
value. For different threshold value, the relevancy score is determined and the relevant pages are retrieved.
IV. Results And Discussion
In this section, it is proved with the experimental results that the proposed web crawling algorithm
using link, text analysis and probability method is more effective. The proposed algorithm has been
implemented in java (jdk 1.6) and the experimentation was performed for different categories.
The experimentation was performed with different topic keywords which are specified in Fig. 1 given
below. The figure also shows that more number of relevant pages is retrieved by using the above mentioned
method.
Fig. 1. Result for different topics
The analysis was made for different threshold values for different keywords. When the relevancy score
value is less than the threshold value, then those pages are considered as the irrelevant pages and are removed
from the URL queue. Otherwise, the pages are relevant pages and placed on the queue based on the priority. The
more relevant pages that were retrieved for particular threshold values were taken for consideration. From these
set, the average number of relevant pages are considered for each topic.
The Fig. 2 mentioned below specifies the relevant pages retrieved for the keyword ‗Java Beans‘ on
different threshold values. This process is applied for the different topics to determine the efficiency of the
algorithm. The co-efficient factor α, β and γ was also tested with different values ranging between 0 and 1 for
increasing the weight of the individuals.
Similarly, the experimentation was considered for different values of the co-efficient factor and the
relevancy score was determined for each URL. It was proved that more relevant pages were retrieved by using
this method for different topics.
Topic-specific Web Crawler using Probability Method
www.iosrjournals.org 105 | Page
Fig. 2. Result for the topic ‘Java Beans’ on different threshold values
V. Conclusion
In this paper, the crawling method using the link, text content and probability method is more efficient
and produces more relevant pages. The probability of finding the similar and dissimilar keywords makes the
crawler to retrieve the pages effectively. For experimentation, several topic keywords were given and verified
that the more relevant pages were retrieved. The experiment was compared with different threshold values to
test the efficiency. The co-efficient factor of each method in the relevancy score computation was also tested
with different values. Testing based on these different cases, it was proved that the proposed algorithm was
more effective in retrieving the relevant pages efficiently.
References
[1] T.H. Have;owa;a Topic-Sensitive PageRamk, Proceedings of the 11th
World Wide Web conference, pp.517-526.
[2] Blaž Novak, Survey of focused web crawling algorithms, in Proceedings of SIKDD, pp. 55-58, 2004.
[3] Shalin shah, Spe 2006. Implementing an Effective Web Crawler. Pant, G., Srinivasan, P., Menczer, F., Crawling the Web. Web
Dynamics: Adapting to Change in Content, Size, Topology and Use, edited by M. Levene and A. Poulovassilis, Springer- verlog,
pp: 153-178, November 2004.
[4] Debashis Hati and Amritesh Kumar, An Approach for Identifying URLs Based on Division Score and Link Score in Focused
Crawler, International Journal of Computer Applications, Vol. 2, no. 3, May 2010.
[5] A. Rungsawang, N. Angkawattanawit, Learnable topic-specific web crawler. Journal of Network and Computer Applications,
Issue no:28,page no:97-114,2005
[6] Michael Hersovici, Michal Jacovi, Yoelle S. Maarek, Dan Pelleg, Menachem Shtalhaim and Sigalit Ur, The shark-search algorithm.
An application: tailored Web site mapping, in Proceedings of the Seventh International World Wide Web Conference on Computer
Networks and ISDN Systems, Vol. 30, no. 1-7, pp. 317-326, April 1998.
[7] P. De Bra, G-J Houben, Y. Kornatzky, and R. Post, Information Retrieval in Distributed Hypertexts, in the Proceedings of
RIAO'94, Intelligent Multimedia, Information Retrieval Systems and Management, New York, NY, 1994.
[8] S. Chakrabarti, M. van den Berg, and B. Dom, Focused Crawling: A New Approach for Topic-Specific Resource Discovery, In
Proc. 8th WWW, 1999.
[9] A. Rungsawang, N. Angkawattanawit, Learnable Crawling: An Efficient Approach to Topic-specific Web Resource Discovery,
2005.
[10] K.Bharat and M.Henzinger, Improved Algorithms for Topic Distillation in a Hyperlinked Environment, In proc. Of the ACM
SIGIR ‘98 conference on Research and Development in Information Retrieval.
[11] J.Jayanthi, Dr. K.S. Jayakumar, An Integrated Page Ranking Algorithm for Personalized Web Search, International Journal of
Computer Applications, 2011.
[12] Lili Yan, Wencai Du, Yingbin wei and Henian chen, A novel heuristic search algorithm based on hyperlink and relevance strategy
for Web Search, 2012, Advances in Intelligent and Soft Computing.
[13] Zhumin Chen; Jun Ma; Jingsheng Lei; Bo Yuan; Li Lian, Aug 24-27, 2007. An Improved Shark-Search Algorithm Based on
Multi-information. Fourth International Conference on Fuzzy Systems and Knowledge Discovery, pp: 659 – 658.
[14] Sotiris Batsakis, Euripides G.M. Petrakis, Evangelos Milios, Improving the Performance of Focused Web Crawlers, Data &
Knowledge Engineering, Vol: 68, No: 10, pp: 1001-1013, October 2009.
[15] Brin, S., & Page, L. The anatomy of a large-scale hyper textual web search engine. In Proceedings of the seventh international
conference on World Wide Web (WWW), pp: 107–117, 1998.
[16] Aggarwal C. Garawi F.Yu P. Intelligent crawling on the world wide web with arbitrary predicates, In: Proceedings of the 10th
International World Wide Web Conference. Hongkong: 2001.p. 96-105.
[17] M. Najork and J.L. Wiener. ―Breadth-first search crawling yields high-quality pages‖, In Proceedings of the Tenth
Conference on World Wide Web, Hong Kong, Elsevier Science, May 2001, pp. 114–118.
[18] Lixin Han and Guihai Chen, The HWS hybrid web search , Information and Software Technology, Elsevier Science, 2005.
[19] Rodriguez-Mier. P, Mucientes. M, Lama.M. Automatic Web service Composition with a Heuristic-based Search Algorithm, IEEE
International Conference on Web Services, 2011.
Topic-specific Web Crawler using Probability Method
www.iosrjournals.org 106 | Page
[20] G. Almpanidis, C. Kotropoulos, I. Pitas, September 2007. Combining text and link analysis for focused crawling—An application
for vertical search engines. Information Systems, Vol 32, No: 6, pp: 886-908.
[21] Lili Yan, Wencai Du, Yingbin wei and Henian chen, ―A novel heuristic search algorithm based on hyperlink and relevance strategy
for Web Search‖, 2012, Advances in Intelligent and Soft Computing.

More Related Content

What's hot

PageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey reportPageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey reportIOSR Journals
 
WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTE...
WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTE...WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTE...
WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTE...cscpconf
 
Comparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining CategoriesComparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining Categoriestheijes
 
The Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web CrawlerThe Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web CrawlerIRJESJOURNAL
 
Keyword Query Routing
Keyword Query RoutingKeyword Query Routing
Keyword Query RoutingSWAMI06
 
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEEFINALYEARSTUDENTPROJECTS
 
keyword query routing
keyword query routingkeyword query routing
keyword query routingswathi78
 
Focused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainsFocused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainseSAT Publishing House
 
JPJ1423 Keyword Query Routing
JPJ1423   Keyword Query RoutingJPJ1423   Keyword Query Routing
JPJ1423 Keyword Query Routingchennaijp
 
TEXT ANALYZER
TEXT ANALYZER TEXT ANALYZER
TEXT ANALYZER ijcseit
 
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIJwest
 

What's hot (16)

Keyword query routing
Keyword query routingKeyword query routing
Keyword query routing
 
PageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey reportPageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey report
 
WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTE...
WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTE...WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTE...
WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTE...
 
Comparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining CategoriesComparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining Categories
 
The Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web CrawlerThe Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web Crawler
 
Keyword Query Routing
Keyword Query RoutingKeyword Query Routing
Keyword Query Routing
 
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
 
Keyword query routing
Keyword query routingKeyword query routing
Keyword query routing
 
keyword query routing
keyword query routingkeyword query routing
keyword query routing
 
dexa08linli
dexa08linlidexa08linli
dexa08linli
 
Focused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainsFocused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domains
 
At33264269
At33264269At33264269
At33264269
 
Pagerank and hits
Pagerank and hitsPagerank and hits
Pagerank and hits
 
JPJ1423 Keyword Query Routing
JPJ1423   Keyword Query RoutingJPJ1423   Keyword Query Routing
JPJ1423 Keyword Query Routing
 
TEXT ANALYZER
TEXT ANALYZER TEXT ANALYZER
TEXT ANALYZER
 
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
 

Viewers also liked

A metadata focused crawler for Linked Data
A metadata focused crawler for Linked DataA metadata focused crawler for Linked Data
A metadata focused crawler for Linked DataRaphael do Vale
 
A machine learning approach to building domain specific search
A machine learning approach to building domain specific searchA machine learning approach to building domain specific search
A machine learning approach to building domain specific searchNiharjyoti Sarangi
 
5 tips for perfect presentations
5 tips for perfect presentations5 tips for perfect presentations
5 tips for perfect presentationsBernd Allgeier
 
Immigration and Visas for International Medics
Immigration and Visas for International MedicsImmigration and Visas for International Medics
Immigration and Visas for International MedicsOdyssey Recruitment
 
The Golden Formula of Writing Article Easily
The Golden Formula of Writing Article EasilyThe Golden Formula of Writing Article Easily
The Golden Formula of Writing Article Easilybelieve52
 
テスト分析 個人
テスト分析 個人テスト分析 個人
テスト分析 個人Tomoaki Fukura
 
Lc ftu hcmc best finance award
Lc ftu hcmc  best finance awardLc ftu hcmc  best finance award
Lc ftu hcmc best finance awardManh Cuong Chau
 
Презентация Тепликского РРЦГ "Результаты работы ІІ Фазы Проекта МРГ в Тепликс...
Презентация Тепликского РРЦГ "Результаты работы ІІ Фазы Проекта МРГ в Тепликс...Презентация Тепликского РРЦГ "Результаты работы ІІ Фазы Проекта МРГ в Тепликс...
Презентация Тепликского РРЦГ "Результаты работы ІІ Фазы Проекта МРГ в Тепликс...Anastasia Lanina
 
Presentazione dogsinthecity
Presentazione dogsinthecityPresentazione dogsinthecity
Presentazione dogsinthecitylucaminetti
 
R120234【メソ研】003
R120234【メソ研】003R120234【メソ研】003
R120234【メソ研】003Sei Sumi
 

Viewers also liked (20)

A metadata focused crawler for Linked Data
A metadata focused crawler for Linked DataA metadata focused crawler for Linked Data
A metadata focused crawler for Linked Data
 
A machine learning approach to building domain specific search
A machine learning approach to building domain specific searchA machine learning approach to building domain specific search
A machine learning approach to building domain specific search
 
B0330107
B0330107B0330107
B0330107
 
B0941214
B0941214B0941214
B0941214
 
D0311824
D0311824D0311824
D0311824
 
E0953336
E0953336E0953336
E0953336
 
5 tips for perfect presentations
5 tips for perfect presentations5 tips for perfect presentations
5 tips for perfect presentations
 
Immigration and Visas for International Medics
Immigration and Visas for International MedicsImmigration and Visas for International Medics
Immigration and Visas for International Medics
 
The Golden Formula of Writing Article Easily
The Golden Formula of Writing Article EasilyThe Golden Formula of Writing Article Easily
The Golden Formula of Writing Article Easily
 
テスト分析 個人
テスト分析 個人テスト分析 個人
テスト分析 個人
 
C0441923
C0441923C0441923
C0441923
 
Energy
EnergyEnergy
Energy
 
A0520106
A0520106A0520106
A0520106
 
Lc ftu hcmc best finance award
Lc ftu hcmc  best finance awardLc ftu hcmc  best finance award
Lc ftu hcmc best finance award
 
C0330818
C0330818C0330818
C0330818
 
Презентация Тепликского РРЦГ "Результаты работы ІІ Фазы Проекта МРГ в Тепликс...
Презентация Тепликского РРЦГ "Результаты работы ІІ Фазы Проекта МРГ в Тепликс...Презентация Тепликского РРЦГ "Результаты работы ІІ Фазы Проекта МРГ в Тепликс...
Презентация Тепликского РРЦГ "Результаты работы ІІ Фазы Проекта МРГ в Тепликс...
 
D0941824
D0941824D0941824
D0941824
 
Presentazione dogsinthecity
Presentazione dogsinthecityPresentazione dogsinthecity
Presentazione dogsinthecity
 
R120234【メソ研】003
R120234【メソ研】003R120234【メソ研】003
R120234【メソ研】003
 
E0162736
E0162736E0162736
E0162736
 

Similar to Topic-specific Web Crawler using Probability Method

Efficient focused web crawling approach
Efficient focused web crawling approachEfficient focused web crawling approach
Efficient focused web crawling approachSyed Islam
 
Googling of GooGle
Googling of GooGleGoogling of GooGle
Googling of GooGlebinit singh
 
Priority based focused web crawler
Priority based focused web crawlerPriority based focused web crawler
Priority based focused web crawlerIAEME Publication
 
Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020Editor IJARCET
 
Evaluation of Web Search Engines Based on Ranking of Results and Features
Evaluation of Web Search Engines Based on Ranking of Results and FeaturesEvaluation of Web Search Engines Based on Ranking of Results and Features
Evaluation of Web Search Engines Based on Ranking of Results and FeaturesWaqas Tariq
 
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...cscpconf
 
Data mining in web search engine optimization
Data mining in web search engine optimizationData mining in web search engine optimization
Data mining in web search engine optimizationBookStoreLib
 
Comparative analysis of relative and exact search for web information retrieval
Comparative analysis of relative and exact search for web information retrievalComparative analysis of relative and exact search for web information retrieval
Comparative analysis of relative and exact search for web information retrievaleSAT Journals
 
Semantic Search of E-Learning Documents Using Ontology Based System
Semantic Search of E-Learning Documents Using Ontology Based SystemSemantic Search of E-Learning Documents Using Ontology Based System
Semantic Search of E-Learning Documents Using Ontology Based Systemijcnes
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search EngineNIKHIL NAIR
 
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSZac Darcy
 
Identifying Important Features of Users to Improve Page Ranking Algorithms
Identifying Important Features of Users to Improve Page Ranking Algorithms Identifying Important Features of Users to Improve Page Ranking Algorithms
Identifying Important Features of Users to Improve Page Ranking Algorithms dannyijwest
 
Sekhon final 1_ppt
Sekhon final 1_pptSekhon final 1_ppt
Sekhon final 1_pptManant Sweet
 
acm_src_grandfinals_thomas_effland
acm_src_grandfinals_thomas_efflandacm_src_grandfinals_thomas_effland
acm_src_grandfinals_thomas_efflandThomas Effland
 
Recommendation generation by integrating sequential
Recommendation generation by integrating sequentialRecommendation generation by integrating sequential
Recommendation generation by integrating sequentialeSAT Publishing House
 
Recommendation generation by integrating sequential pattern mining and semantics
Recommendation generation by integrating sequential pattern mining and semanticsRecommendation generation by integrating sequential pattern mining and semantics
Recommendation generation by integrating sequential pattern mining and semanticseSAT Journals
 

Similar to Topic-specific Web Crawler using Probability Method (20)

Efficient focused web crawling approach
Efficient focused web crawling approachEfficient focused web crawling approach
Efficient focused web crawling approach
 
Googling of GooGle
Googling of GooGleGoogling of GooGle
Googling of GooGle
 
Priority based focused web crawler
Priority based focused web crawlerPriority based focused web crawler
Priority based focused web crawler
 
Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020
 
Evaluation of Web Search Engines Based on Ranking of Results and Features
Evaluation of Web Search Engines Based on Ranking of Results and FeaturesEvaluation of Web Search Engines Based on Ranking of Results and Features
Evaluation of Web Search Engines Based on Ranking of Results and Features
 
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
 
Data mining in web search engine optimization
Data mining in web search engine optimizationData mining in web search engine optimization
Data mining in web search engine optimization
 
Comparative analysis of relative and exact search for web information retrieval
Comparative analysis of relative and exact search for web information retrievalComparative analysis of relative and exact search for web information retrieval
Comparative analysis of relative and exact search for web information retrieval
 
Semantic Search of E-Learning Documents Using Ontology Based System
Semantic Search of E-Learning Documents Using Ontology Based SystemSemantic Search of E-Learning Documents Using Ontology Based System
Semantic Search of E-Learning Documents Using Ontology Based System
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search Engine
 
407 409
407 409407 409
407 409
 
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
 
Identifying Important Features of Users to Improve Page Ranking Algorithms
Identifying Important Features of Users to Improve Page Ranking Algorithms Identifying Important Features of Users to Improve Page Ranking Algorithms
Identifying Important Features of Users to Improve Page Ranking Algorithms
 
I0331047050
I0331047050I0331047050
I0331047050
 
Sekhon final 1_ppt
Sekhon final 1_pptSekhon final 1_ppt
Sekhon final 1_ppt
 
acm_src_grandfinals_thomas_effland
acm_src_grandfinals_thomas_efflandacm_src_grandfinals_thomas_effland
acm_src_grandfinals_thomas_effland
 
Recommendation generation by integrating sequential
Recommendation generation by integrating sequentialRecommendation generation by integrating sequential
Recommendation generation by integrating sequential
 
Recommendation generation by integrating sequential pattern mining and semantics
Recommendation generation by integrating sequential pattern mining and semanticsRecommendation generation by integrating sequential pattern mining and semantics
Recommendation generation by integrating sequential pattern mining and semantics
 

More from IOSR Journals (20)

A011140104
A011140104A011140104
A011140104
 
M0111397100
M0111397100M0111397100
M0111397100
 
L011138596
L011138596L011138596
L011138596
 
K011138084
K011138084K011138084
K011138084
 
J011137479
J011137479J011137479
J011137479
 
I011136673
I011136673I011136673
I011136673
 
G011134454
G011134454G011134454
G011134454
 
H011135565
H011135565H011135565
H011135565
 
F011134043
F011134043F011134043
F011134043
 
E011133639
E011133639E011133639
E011133639
 
D011132635
D011132635D011132635
D011132635
 
C011131925
C011131925C011131925
C011131925
 
B011130918
B011130918B011130918
B011130918
 
A011130108
A011130108A011130108
A011130108
 
I011125160
I011125160I011125160
I011125160
 
H011124050
H011124050H011124050
H011124050
 
G011123539
G011123539G011123539
G011123539
 
F011123134
F011123134F011123134
F011123134
 
E011122530
E011122530E011122530
E011122530
 
D011121524
D011121524D011121524
D011121524
 

Recently uploaded

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 

Recently uploaded (20)

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 

Topic-specific Web Crawler using Probability Method

  • 1. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 13, Issue 1 (Jul. - Aug. 2013), PP 102-106 www.iosrjournals.org www.iosrjournals.org 102 | Page Topic-specific Web Crawler using Probability Method S. Subatra Devi1 , Dr. P. Sheik Abdul Khader2 1 (Research scholar, PSVP Engineering College, Chennai, Tamil Nadu, India) 2 (Professor & HOD, BSA Crescent Engineering College, Chennai, Tamil Nadu, India) Abstract : Web has become an integral part of our lives and search engines play an important role in making users search the content online using specific topic. The web is a huge and highly dynamic environment which is growing exponentially in content and developing fast in structure. No search engine can cover the whole web, but it has to focus on the most valuable pages for crawling. Many methods have been developed based on link and text content analysis for retrieving the pages. Topic-specific web crawler collects the relevant web pages of interested topics of the user from the web. In this paper, we present an algorithm that covers the link, text content using Levenshtein distance and probability method to fetch more number of relevant pages based on the topic specified by the user. Evaluation illustrates that the proposed web crawler collects the best web pages under user interests during the earlier period of crawling. Keywords - Levenshtein Distance, Hyperlink, Probability Method, Search engine, Web Crawler. I. Introduction The web is a very large environment, from which users provide the keyword, query or topics to fetch the required information. Such growth and fluctuation generate essential limits of scale for today's generic search engines [5]. The number of web sites keeps on increasing and hence the number of links and the documents keep on increasing day by day [1]. Crawlers are programs that penetrate through the hyperlinks to retrieve the required page. Based on the hyperlink, the crawler retrieves the document and the required numbers of web pages are retrieved. This process is repeated until the predefined numbers of web pages are downloaded or the storage of the local repository is exhausted [3]. In this paper, an efficient web crawling algorithm is presented by combining text content, link analysis and probabilistic method. Initially, the topic is given as input to various search engines. Here, Google, Yahoo and MSN are used. For the given topic the common URL‘s in all the three search engines or in any two, are retrieved and given as input to the crawler. This URL will be considered as the seed URL. The crawler, considering this as the seed URL will retrieve the web pages based on the hyperlink. The retrieving is done by considering the keyword on the link, the text content of the document and the probability method. The probability method is used to find the number of similar as well as dissimilar keywords occurring in a web page. Determining the probability of the dissimilarity in keywords allows filtering the irrelevant pages in the beginning of the crawling. This makes to consider the relevant pages more effectively and skews the search. The link content and the text content are the common methods used by most of the algorithms. Using this as the base, when it is implemented along with the probability method, this method provides more effective pages. The rest of this paper is organized as follows. Section 2 specifies the related work. Section 3 proposes the algorithm for web crawling process. Section 4 shows the experimental results and performance evaluation of the proposed algorithm. Finally, the conclusion of the result is given in section 5. II. Related Work There are several algorithms based on content and link strategy. The algorithm based on hyperlink and content relevance and on HITS is presented as Heuristic search [14]. Another method specifies the ranking based on content and link [13]. In [9], the relevancy is based on 0 or 1, which consists of several drawbacks that are overcome in the Shark- search [8]. The division score and the link score is applied for each and every link in [6]. Some knowledge bases based on starting URL‘s, topic keywords and URL predictions are being discussed [7]. An importance of a page is computed based on the composition of PR score in [2]. Focused crawling analyzes its crawl boundary based on the links that are most likely to be relevant and avoids irrelevant regions [10]. An algorithm [22] on hyperlinks and content relevance strategy is based on topic-specific crawling. The topic keyword is used as a base in several algorithms [2], [7], [10] and [12] by which the crawler crawls through the web to fetch the relevant pages. The crawler using breadth-first search [19] order discovers high quality pages during the early stages of the crawl. An algorithm in [20], combines text search with semantic search. To engineer a search engine Google with page rank method is applied in [17]. Different arbitrary predicates are applied [18] to perform an intelligent crawling. Accurately predict the relevance [15] of unvisited web pages by known URL‘s. In this method, this algorithm crawls the web by considering the
  • 2. Topic-specific Web Crawler using Probability Method www.iosrjournals.org 103 | Page keyword in the hyperlink, the content of the document and the probability method. A latent semantic indexing classifier that combines link and text [21] is used to index domain specific web documents. III. Proposed Method The seed URL is the base path for retrieving the more relevant pages. This URL is fetched by giving the topic keyword [2], [7] to the three different search engines. The seed URL is given as root path to the crawler. Now, the outgoing links of this root path are fetched. For each of this link, three methods are applied as specified below. 1.1 Calculating the Relevancy Score From the outgoing link as specified above, the relevancy score is calculated in an iterative method for each and every link. The calculation consists of the following methods. 1.1.1 Link Weight The link weight is calculated by checking whether the topic keyword is presented in the outgoing links. If this anchor text is present the division method is used to determine the link weight. Here, if all the topic keywords are presented in the outgoing link then its link score is 1. Otherwise, the link score is based on the percentage of the topic keywords present in the link. Finally, the link weight is determined by the ratio of the total number of links containing the anchor text to the total number of links of the parent node. This Link Weight is given by UL Wt = ----------- LT Where, UL represents the total number of links that contain the anchor text and LT represents the total number of links presented in the parent node. For each and every parent URL, the link weight is determined and the crawling is performed based on the ranking of the parent URL‘s. 1.1.2 Text content similarity using Levenshtein Distance Here, the Levenshtein Distance is used to compute the text content similarity of the two pages, i.e., the seed URL page and the child page. Before computing the similarity, the tokens are extracted from the pages. These tokens are nothing but the topic keywords of that page. For filtering these topic keywords, the two methods - Stop word removal - Stemming algorithm Are used. The stop word removal method is, removing the base words from the document such as, a, an, the, etc. The stemming algorithm applied here is Porter Stemming algorithm, which skews the word with the stem of the word. For example, ‗engineering‘, ‗engineered‘, etc. are stemmed to the word ‗engineer‘. The words extracted by these methods are known as tokens. These tokens are given as input to the Levenshtein Distance for ‗EditDist‘. Here, the distance is computed by determining the similarity between the two strings. For which, the following operations are performed to transfer one string s1 to another string s2. - insertion - deletion - substitution Finally, the text content similarity between the seed page and the child page is calculated based on the word distance and the length of the word. ∑ EditDist (s1, s2) D lev (s1,s2)= ----------------------------------- Length (s1) + Length (s2) Where EditDist is performed to calculate the number of insertion, deletion and substitution operation which are needed to transform one string s1 into the another string s2. 1.1.3 Probability Method The probability based distance method is used to calculate the similarity between the two pages, the seed URL page and the child page. The tokens are extracted from these pages, said to be as keywords and the frequency of those keywords are found and represented in sorted order. The top n keywords are selected from the parent and the child pages. The similarity and the dissimilarity are calculated between these two pages.
  • 3. Topic-specific Web Crawler using Probability Method www.iosrjournals.org 104 | Page Pm = [P(wi wj) + 1] – (1 - µ )[1 – P(wi wj)] where, P(wi wj) = T / N, and P(wi wj) = K / N, T refers to the number of similar keywords of both the pages, N is the total number of keywords and K refers to the number of dissimilar keywords of both the pages. Based on the sorted order of the similar keywords, the documents are given the priorities in decreasing order. The probability of the similar keywords helps in retrieving the relevant pages effectively. Similarly, the probability of finding the dissimilar keywords supports in filtering the irrelevant pages effectively in the beginning stage of crawling. For each and every iteration, the irrelevant pages are filtered which makes the crawler to retrieve more number of relevant pages. 1.2 Determining the Relevancy Score Here, the relevancy score is determined by using the above three methods. The computed value of the above three methods R s = α * W t + β * D lev + γ * Pm Where W t represents the link weight, D lev specifies the levenshtein distance and P m specifies the probability method. This relevancy score R s is calculated for each and every link and it is compared with the threshold value. For different threshold value, the relevancy score is determined and the relevant pages are retrieved. IV. Results And Discussion In this section, it is proved with the experimental results that the proposed web crawling algorithm using link, text analysis and probability method is more effective. The proposed algorithm has been implemented in java (jdk 1.6) and the experimentation was performed for different categories. The experimentation was performed with different topic keywords which are specified in Fig. 1 given below. The figure also shows that more number of relevant pages is retrieved by using the above mentioned method. Fig. 1. Result for different topics The analysis was made for different threshold values for different keywords. When the relevancy score value is less than the threshold value, then those pages are considered as the irrelevant pages and are removed from the URL queue. Otherwise, the pages are relevant pages and placed on the queue based on the priority. The more relevant pages that were retrieved for particular threshold values were taken for consideration. From these set, the average number of relevant pages are considered for each topic. The Fig. 2 mentioned below specifies the relevant pages retrieved for the keyword ‗Java Beans‘ on different threshold values. This process is applied for the different topics to determine the efficiency of the algorithm. The co-efficient factor α, β and γ was also tested with different values ranging between 0 and 1 for increasing the weight of the individuals. Similarly, the experimentation was considered for different values of the co-efficient factor and the relevancy score was determined for each URL. It was proved that more relevant pages were retrieved by using this method for different topics.
  • 4. Topic-specific Web Crawler using Probability Method www.iosrjournals.org 105 | Page Fig. 2. Result for the topic ‘Java Beans’ on different threshold values V. Conclusion In this paper, the crawling method using the link, text content and probability method is more efficient and produces more relevant pages. The probability of finding the similar and dissimilar keywords makes the crawler to retrieve the pages effectively. For experimentation, several topic keywords were given and verified that the more relevant pages were retrieved. The experiment was compared with different threshold values to test the efficiency. The co-efficient factor of each method in the relevancy score computation was also tested with different values. Testing based on these different cases, it was proved that the proposed algorithm was more effective in retrieving the relevant pages efficiently. References [1] T.H. Have;owa;a Topic-Sensitive PageRamk, Proceedings of the 11th World Wide Web conference, pp.517-526. [2] Blaž Novak, Survey of focused web crawling algorithms, in Proceedings of SIKDD, pp. 55-58, 2004. [3] Shalin shah, Spe 2006. Implementing an Effective Web Crawler. Pant, G., Srinivasan, P., Menczer, F., Crawling the Web. Web Dynamics: Adapting to Change in Content, Size, Topology and Use, edited by M. Levene and A. Poulovassilis, Springer- verlog, pp: 153-178, November 2004. [4] Debashis Hati and Amritesh Kumar, An Approach for Identifying URLs Based on Division Score and Link Score in Focused Crawler, International Journal of Computer Applications, Vol. 2, no. 3, May 2010. [5] A. Rungsawang, N. Angkawattanawit, Learnable topic-specific web crawler. Journal of Network and Computer Applications, Issue no:28,page no:97-114,2005 [6] Michael Hersovici, Michal Jacovi, Yoelle S. Maarek, Dan Pelleg, Menachem Shtalhaim and Sigalit Ur, The shark-search algorithm. An application: tailored Web site mapping, in Proceedings of the Seventh International World Wide Web Conference on Computer Networks and ISDN Systems, Vol. 30, no. 1-7, pp. 317-326, April 1998. [7] P. De Bra, G-J Houben, Y. Kornatzky, and R. Post, Information Retrieval in Distributed Hypertexts, in the Proceedings of RIAO'94, Intelligent Multimedia, Information Retrieval Systems and Management, New York, NY, 1994. [8] S. Chakrabarti, M. van den Berg, and B. Dom, Focused Crawling: A New Approach for Topic-Specific Resource Discovery, In Proc. 8th WWW, 1999. [9] A. Rungsawang, N. Angkawattanawit, Learnable Crawling: An Efficient Approach to Topic-specific Web Resource Discovery, 2005. [10] K.Bharat and M.Henzinger, Improved Algorithms for Topic Distillation in a Hyperlinked Environment, In proc. Of the ACM SIGIR ‘98 conference on Research and Development in Information Retrieval. [11] J.Jayanthi, Dr. K.S. Jayakumar, An Integrated Page Ranking Algorithm for Personalized Web Search, International Journal of Computer Applications, 2011. [12] Lili Yan, Wencai Du, Yingbin wei and Henian chen, A novel heuristic search algorithm based on hyperlink and relevance strategy for Web Search, 2012, Advances in Intelligent and Soft Computing. [13] Zhumin Chen; Jun Ma; Jingsheng Lei; Bo Yuan; Li Lian, Aug 24-27, 2007. An Improved Shark-Search Algorithm Based on Multi-information. Fourth International Conference on Fuzzy Systems and Knowledge Discovery, pp: 659 – 658. [14] Sotiris Batsakis, Euripides G.M. Petrakis, Evangelos Milios, Improving the Performance of Focused Web Crawlers, Data & Knowledge Engineering, Vol: 68, No: 10, pp: 1001-1013, October 2009. [15] Brin, S., & Page, L. The anatomy of a large-scale hyper textual web search engine. In Proceedings of the seventh international conference on World Wide Web (WWW), pp: 107–117, 1998. [16] Aggarwal C. Garawi F.Yu P. Intelligent crawling on the world wide web with arbitrary predicates, In: Proceedings of the 10th International World Wide Web Conference. Hongkong: 2001.p. 96-105. [17] M. Najork and J.L. Wiener. ―Breadth-first search crawling yields high-quality pages‖, In Proceedings of the Tenth Conference on World Wide Web, Hong Kong, Elsevier Science, May 2001, pp. 114–118. [18] Lixin Han and Guihai Chen, The HWS hybrid web search , Information and Software Technology, Elsevier Science, 2005. [19] Rodriguez-Mier. P, Mucientes. M, Lama.M. Automatic Web service Composition with a Heuristic-based Search Algorithm, IEEE International Conference on Web Services, 2011.
  • 5. Topic-specific Web Crawler using Probability Method www.iosrjournals.org 106 | Page [20] G. Almpanidis, C. Kotropoulos, I. Pitas, September 2007. Combining text and link analysis for focused crawling—An application for vertical search engines. Information Systems, Vol 32, No: 6, pp: 886-908. [21] Lili Yan, Wencai Du, Yingbin wei and Henian chen, ―A novel heuristic search algorithm based on hyperlink and relevance strategy for Web Search‖, 2012, Advances in Intelligent and Soft Computing.