Upcoming SlideShare
Loading in...5







Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

50120140504017 50120140504017 Document Transcript

  • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 4, April (2014), pp. 157-164 © IAEME 157 AN ENHANCEMENT OVER MULTI-LEVEL LINK STRUCTURE ANALYSIS TO OVERCOME FALSE POSITIVE Pratikkumar B Chauhan1 , Kamlesh M Patel2 1, 2 (Computer Engineering, R.K. University, Kasturbadham, Near Tramba, Rajkot, Gujarat, India) ABSTRACT Search engine become the primary source to get knowledge and information on Web. When people executed a query a list of rank-wise URLs are extracted based on keywords by Search Engine, few of them URLs are having higher page-rank which might be boost-up due to spam links. So, it becomes necessary to identify this spam links in response with search engine result that prevent the users to misguide by such spam-links, but it is very difficult task to identify those spam links. Spammer creates a spam pages to earn profit or for marketing purpose over the web. Spam pages are also created in order to get higher page ranking score in search engine result. This research paper has covered detail study on MLSA as it is capable to do identify Spam link based on its threshold value, and also it gives result of those link that has more degree of levels as it is able to perform on multi-level makes it differ from other algorithms. This research paper also introduced a new mechanism which overcomes some issues exists with experimental result such as false positive. This research paper will give you result of those link that are falsely detected as spam. Also discussed new approach with an issue that may helpful to research scholar to dig out even more in regards to improve their efficiency in linking based algorithm. Keywords: Web Spam, Link Spam, Link Farm, Spam Rank, Spamming, Web Mining, Web Structure Mining, Link Analysis. 1. INTRODUCTION Today as we know, web users or internet users are increasing rapidly to gain information from the net. Most of the people rely on search engine to get information from the Web. But due to any reason some times this may be happened that we may not get actual information that we are INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 5, Issue 4, April (2014), pp. 157-164 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2014): 8.5328 (Calculated by GISI) www.jifactor.com IJCET © I A E M E
  • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 4, April (2014), pp. 157-164 © IAEME 158 searching for. One of the reasons behind it is Spam Link. Many spam links will result in response with search engine result but is very difficult to identify such spam link. Due to the bulky size of the Web the problem of Web spam is well-known and not easy to solve, mostly that makes many algorithms infeasible in practice [2][3]. Therefore due to web spam, it is an attempt to increase rank of inappropriate web pages. There are many spamming technique to increase the rank of inappropriate page among them Link spam is one of web spam technique which aims for increasing rank by creating artificial popularity of the page by increasing in-links of the page. Some specific Web pages are spam target pages or not that every user need to know. Our method can be applied to such individual pages. Only the page farms of those pages need to be extracted and analyzed. We have search neighbors of those pages, which are highly feasible [4]. Spamming degrades the quality of search and also loses trust of user over search engine result. Spam sites serve different content malware, adult content spreading and phishing attack.[3][4]. Spamming is the technique any action whose purpose is to boost a web page’s position in search engine results, without providing additional value. Owner of a web site and any business man always want to grow their business and for that they often promote their web pages and boosts ranking by attracting links from other web sites. The only difference between normal page and targeted pages is that only the links are justifiable [3][8]. All misleading actions that try to increase the ranking of a page in search engines are generally referred to as Web spam or spamdexing (spamming + indexing) [2]. For example, ranked 100 million webpage’s using page rank algorithm. And found that 11 out of top 20 results were pornographic websites that achieves high ranking by content and web link manipulation [3] [4][5]. Spamming is technique to earn maximum profit to the attacker. A densely related set of pages is known as Link farm, created explicitly with the purpose of misleading a link-based ranking algorithm [2]. Pages in Link farms are called boosting pages, created by a sole purpose of boosting rank score of some pages, called target pages [1][3]. A spam page is a page that is used for spamming or receives a substantial amount of its score from other spam pages. Another definition of spam, given in is “any attempt to mislead a search engine’s relevancy algorithm” [3] [6]. 1.1 WEB SPAM With the advent of search engines web spamming appeared early and not easily solved. The neighborhood of a spam page will look different from an honest one. The neighborhood of a link spam will consist of a large number of artificially generated links. These links likely come from similar objects [9]. Web spam refers to attempts to increase the ranking of a web page by manipulating the content of a page and the link structure around a page. There are numerous approaches for detecting web spam, which may be based on web page content, link structure, or a combination of these and among them I focused on linking structure of a web site [10]. 2. LINKING BASED ALGORITHM This research paper focused on Linking based algorithm namely Multi-level Link Structure Analysis (MLSA) which purely linking based algorithm to detect spam link which is modification of seed and parental penalty algorithm. As we always using internet or web to get information over the net through web sites web spammer who creates spam link takes the advantage of vulnerabilities of linking based algorithms. Search engine also uses linking based technique to rank website in response with search engine result. They create many artificial references or links in order to acquire higher-than-deserved ranking in search engines’ results which will generate higher traffic to their websites [3][7][11]. Chakrabarti introduced a fine-grained approach that integrates document structure using Document Object Model (DOM) into earlier hyperlink-based topic.
  • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 4, April (2014), pp. 157-164 © IAEME 159 2.1 LIMITATION IN SEED AND PARENTAL PENALTY ALGORITHM Seed and parental penalty algorithm works as follow. First of all, all the when the query is executed, set of URLs result from the search engine response. Then using same search engine we have to find another set of incoming link for each of these URLs. After having two sets we need to find out an intersection, means two links, incoming and outgoing links are pointing two each other. And if this value or amount of intersection is greater than some predefined threshold than it is mark as bad else it is good. Seed and parental penalty algorithm are used to detect spam pages that directly swap the link from its neighboring site. Fig 1 illustrates link exchanges in 3domains A, B and C. Domain is referred to as a website with a unique domain name. Three Domain A, B and C have web pages A1, A2, B1, B2, C1, and C2. Domain A links to domain B through pageA2 to page B2, but the reciprocal link points from pageB1 to page A1.Similar method is used for link exchange with domain C to A, where the reciprocal link points from A2to C2 [7]. Figure 1: Link Exchange in seed parental penalty algorithm [7] If there does not exist any link that pointing to each other, means if there is no intersection exists then some spam link may bypass the structure of algorithm to calculate importance of page and may lead to wrong direction. In our example A1 bypasses the structure of algorithm. Thus, there is a possibly false negative exists in the Seed and Parental Penalty algorithms, where the possible spam page might be able to avoid the detection [7]. To overcome this false negative MLSA is introduced. 2.2 MLSA ALGORITHM MLSA algorithm is used to detect spam pages. In most link exchanges and linkfarm spam pages, participant have at least an outgoing link from one of the web pages within the same domain to its neighboring domain.
  • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 4, April (2014), pp. 157-164 © IAEME 160 Figure 2: Link Exchange in MLSA[7] In following Fig 2, illustrates the link parsing sequence of MLSA. A1, A2, A3 and A4 denote 4 web pages within a domain.A1 represent a candidate web page that is being analyzed. MLSA would first parse through the outgoing links in page A1 pointing externally to other domains. Outgoing links from page A2 andA3 would be collected in level 1. Then, the algorithm would continue to the next level by outgoing links from internal pages that page A1links to. This process will repeat to the number of predefined levels. Again from the all the out going link set of incoming link is collected. And using these both set this algorithm will calculate number of intersection between outgoing link and incoming link. If number of intersection is greater than the predetermine threshold than link is treated as spam link as genuine link. Let p to denote the URL of a candidate web page and d[p]the domain name of p. IN(d[p]) denotes the sets of incoming links to the domain or root of p. OUT(ntmp) represents the outgoing links from a temporary node, ntmp. We have to assign some initial value means Parsing depth, Dmlsa that represents the number of levels the algorithm parses and value of Tmlsa is the threshold. MLSA algorithm: (1) For each URL i in IN(d[p]), if d[i] != d[p] and d[i] is not in InDomainList(p), then add d[i] to the set of InDomainList(p). (2) Set p as ntmp and set current level, L to 0. (3) If level, L <= Dmlsa, For each URL k in OUT(ntmp), Execute 3.1 and 3.2. (3.1) If d[k] != d[p] and d[k] is not in OutDomainList(p), then add d[k] to the set of OutDomainList(p). (3.2)Else If d[k] == d[p], then set L ++, set k as ntmp and repeat step 3. (4) Calculate the intersection of InDomainList(p) and OutDomainList(p). If number of intersection is more than the threshold value tmlsa, then page will be marked as bad page. (5) Repeat all above steps for every search result URL, p. Using this algorithm it is being noted that false positive exist in this algorithm. That means sites those are genuine are also detected as a spam so using this algorithm dmoz.org, navjivannaturecure.com which are genuine site that are detected as a spam page. Analysis of some links is shown in following table with level 2 and threshold value 5 which shows that there are some links and false positive exist in system.
  • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 4, April (2014), pp. 157-164 © IAEME 161 Table 1: Experimental Result analysis of MLSA algorithm URL Intersection Spam? http://internethomeloans.com.au/ 14 Yes http://www.quantumfinancesolutions.com.au/ 59 Yes http://www.lifeinsurance.net.au/ 103 Yes http://a1carlease.com.au/ 95 Yes http://www.domz.org/docs/en/about.html 6 Yes http://pratikchauhan.co.in/single.html 12 Yes http://www.navjivannaturecure.com/index.php 9 Yes 3. PROPOSED WORK As we know false positive exist in the existing system. This research paper focuses on this issue of existing system. In this method I use the concept of duplication the link that is detected once will be stored in an array. And if the same link is detected then it will not be scanned more time. This research paper also compare extracted link with master URL and if it is same as the master URL then crosslink counter will be incremented. In this research paper I calculate main domain of master URL and currently extracted URL as follow. Example for the URL http://pratikchauhan.co.in/single.html has following main domain http://www.pratikchauhan.co.in. And if this both main domain are same then crosslink counter will not be incremented and if they are from other domain then counter will be incremented. Let M denote master_URL, S[] is Array of scanned_URL, cross_link, TH is threshold value, Li is limit, L is level which is set to 1 and used in this algorithm. K represents temporary URL. 3.1 IMPROVED MLSA Input: M, TH, Li, T Out put: URL is spam or not. Fetch_recursive(M,Li,L) { if(L>Li) Return; else extract all outlink of M using DOM parser and save it in HT. Find main d1 domain of M. foreach K in HT initialize S[] if(K∈S[])
  • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 4, April (2014), pp. 157-164 © IAEME 162 continue if (K = M) { find main d2 domain of HT if(d1=d2) { Display crosslink occur at same domain cross link not incremented } else { Display crosslink occur from other domain cross link incremented cross_link++ continue } } S[]=K Fetch_recursive(K,Li,L++) } if (cross_link>= TH) M is detected as spam else M is not spam. Using this algorithm this research paper has solved this issue namely false positive which exist in existing algorithm and following table shows result of some link and analysis of links for level 2 and threshold value is 5. Table 2: Experimental result analysis of Improved MLSA algorithm URL Intersection Spam? http://internethomeloans.com.au/ 6 Yes http://www.quantumfinancesolutions.com.au/ 3 No http://www.lifeinsurance.net.au/ 21 yes http://a1carlease.com.au/ 5 Yes http://www.domz.org/docs/en/about.html 0 No http://pratikchauhan.co.in/single.html 0 No http://www.navjivannaturecure.com/index.php 0 No
  • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 4, April (2014), pp. 157-164 © IAEME 163 As we can see from the graph both algorithm computes number of intersection or cross link that exist for any particular URL. The graph is created from the above both table for level 2 and threshold value 5. As we know in MLSA algorithm false positive exists means sites which are genuine they are detected as a spam link, and in this research paper has focused on this issue and using Improved MLSA algorithm this issue is being solved. The number of intersection which is greater than threshold is detected as spam link and others are not spam link. So by reducing number of intersection we can solve false positive by 90-100% using newly proposed work. 4. CONCLUSION At the end, after studying, learning and comparing a lot I conclude that Linking based algorithm is used to detect link spam. This research paper focused on MLSA which stands for Multi- level Link Structure Analysis (MLSA), which is purely relies on the linking structure of the web pages which is the enhancement of seed and parental penalty algorithm. In this algorithm back link from the page to its home node or master URL is calculated, and termed as cross link. Also solve the issue of false positive which exist in MLSA algorithm using same comparison threshold and crosslink. This research paper overcomes the issue of false positive with accuracy of 90-100%. So websites which are genuine but detected as spam using MLSA algorithm they are detected as a genuine site using Improved MLSA algorithm. Each node in a website has at least one out going link to other domain or same domain. However further improvement in this algorithm is necessary because it takes to much time in extracting link for predefined level. So we may improve efficiency of this proposed algorithm by integrating it with using content spam and clocking algorithms.
  • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 4, April (2014), pp. 157-164 © IAEME 164 REFERENCES [1] Athasit Surarerks, Arnon Rungsawang, Chakrit Likitkhajorn, An Approach of Two- Way Spam Detection Based on Boosting Pages Analysis, IEEE, 978-1-4673-2025-2/12, 2012. [2] Carlos Castillo, Debora Donato, Luca Becchetti, Using Rank Propagation and Probabilistic Counting for Link Based Spam Detection, WEBKDD’06, August 20, 2006, Philadelphia, Pennsylvania, USA. [3] Chauhan Pratikkumar Bharatbhai, Kamlesh M Patel, Analysis of Spam Link Detection Algorithm based on Hyperlinks, IFRSA International Journal of Data Warehousing & Mining, Vol 4, issue1, Feb. 2014, 67-72. [4] Jiawei Han, Nikita Spirin, Survey on Web Spam Detection: Principles and Algorithms, SIGKDD Explorations Volume 13, Issue 2, 50-64. [5] K.K. Arthi, Dr. V.Thiagarasu,A Study on Web Spam Classification and Algorithms, International Journal of Computer Trends and Technology, volume 4 Issue 9– Sep 2013, ISSN: 2231-2803, 3126-3131. [6] Mr.R.BalaKumar, Mr.P.Rajendran, Mrs.R.Mynavathi, Survey on Spam Detection Techniques in Data Mining, International Journal of Advanced Research in Data Mining and Cloud ComputingVol.1, Issue 1, July 2013, ISSN 2321-8754, 8-17. [7] Tan Su Tung, Nor Adnan Yahaya, S.M.F.D Syed Mustapha, Multi-level Link Structure Analysis Technique for Detecting Link Farm SpamPages, IEEE/WIC/ACM International Conference onWeb Intelligence and Intelligent Agent Technology, IEEE COMPUTER SOCIETY, 0-7695-2749-3/06, 2006. [8] Zhou. B and Pei. J,Link Spam Target Detection Using Page Farms, ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 3, Article 13, 1556-4681/2009/07, July 2009 , USA. [9] Andras A. Benczur ,Karoly Csalogany,Tamas Sarlos, Mate Uher., SpamRank – Fully Automatic Link Spam Detection, Computer and Automation Research Institute, Hungarian Academy of Sciences (MTA SZTAKI), 11 Lagymanyosi u., H–1111 Budapest, Hungary, 2,Eotvos University, Budapest. [10] Reid Andersen, Christian Borgs, Jennifer Chayes, John Hopcroft, Kamal Jain, Vahab Mirrokni, Shanghua Teng, Robust PageRank and Locally Computable Spam Detection Features, AIRWeb ’08, April 22, 2008 Beijing, China. [11] Shekoofeh Ghiam and Alireza Nemaney Pour, A Survey On Web Spam Detection Methods: Taxonomy, International Journal of Network Security & Its Applications (IJNSA), Vol.4, No.5, September 2012, IRAN. [12] Jyoti Pruthi and Dr. Ela Kumar, “Data Set Selection in Anti-Spamming Algorithm - Large or Small”, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 2, 2012, pp. 206 - 212, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. [13] Goverdhan Reddy Jidiga and Dr. P Sammulal, “Machine Learning Approach to Anomaly Detection in Cyber Security with a Case Study of Spamming Attack”, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 3, 2013, pp. 113 - 122, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. [14] Prajakta Ozarkar and Dr. Manasi Patwardhan, “Efficient Spam Classification by Appropriate Feature Selection”, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 3, 2013, pp. 123 - 139, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.