SlideShare a Scribd company logo
1 of 5
Download to read offline
International Journal of Engineering Inventions
e-ISSN: 2278-7461, p-ISSN: 2319-6491
Volume 4, Issue 1 (July 2014) PP: 55-59
www.ijeijournal.com Page | 55
Page Rank Link Farm Detection
Akshay Saxena1
, Rohit Nigam2
1, 2
Department Of Information and Communication Technology (ICT) Manipal University, Manipal - 576014,
Karnataka, India
Abstract: The PageRank algorithm is an important algorithm which is implemented to determine the quality of
a page on the web. With search engines attaining a high position in guiding the traffic on the internet,
PageRank is an important factor to determine its flow. Since link analysis is used in search engine’s ranking
systems, link based spam structure known as link farms are created by spammers to generate a high PageRank
for their and in turn a target page. In this paper, we suggest a method through which these structures can be
detected and thus the overall ranking results can be improved.
I. Introduction
1.1 Page Rank
We will be working primarily on PageRank[1]
algorithm developed by Larry Page of Google. This
algorithm determines the importance of a website by counting the number and quality of links to a page,
assuming that an important website will have more links pointing to it by other websites.
This technique is used in Google's search process to rank websites of a search result.
It is not the only algorithm used by Google to order search engine results, but it is the first algorithm
that was used by the company, and it is the best-known. The counting and indexing of Web pages is done
through various programs like webcrawlers[1]
and other bot programs. By assigning a numerical weighting to
each link, it repeatedly applies the PageRank algorithm to get a more accurate result and determine the relative
rank of a Web page within a set pages.
The set of Web pages being considered is thought of as a directed graph with each node representing a
page and each edge as a link between pages. We determine the number of links pointing to a page by traversing
the graph and use those values in the PageRank formula.
Pagerank is quite prone to manipulation. Our goal was to find an effective way to ignore links from
pages that are trying to falsify a Pagerank. These spamming pages usually occur as a group called link farms
described below.
1.2 Complete Graph
In the mathematical field of graph theory, a complete graph[2]
is a simple undirected graph in which
every pair of distinct vertices is connected by a unique edge. A complete digraph is a directed graph in which
every pair of distinct vertices is connected by a pair of unique edges (one in each direction).
1.3 Clique
In the mathematical area of graph theory, a clique[2]
an undirected graph is a subset of its vertices such
that every two vertices in the subset are connected by an edge. Cliques are one of the basic concepts of graph
theory and are used in many other mathematical problems and constructions on graphs. Cliques have also been
studied in computer science: the task of finding whether there is a clique of a given size in a graph (the clique
problem) is NP-complete, but despite this hardness result many algorithms for finding cliques have been
studied.
FIG 1.1
Page Rank Link Farm Detection
www.ijeijournal.com Page | 56
1.4 Link Farm
On the World Wide Web, a link farm[3]
is any group of web sites that all hyperlink to every other site in
the group. In graph theoretic terms, a link farm is a clique. Although some link farms can be created by hand,
most are created through automated programs and services. A link farm is a form of spamming the index of a
search engine (sometimes called spamdexing or spamexing). Other link exchange systems are designed to allow
individual websites to selectively exchange links with other relevant websites and are not considered a form of
spamdexing.
FIG 1.2 Normal Graph of Web Pages FIG 1.3 Link Farm of 1, 2, 3 & 4
II. Existing System
Academic citation literature has been applied to the web, largely by counting citations or backlinks to a
given page. This gives some approximation of a page's importance or quality. PageRank extends this idea by not
counting links from all pages equally, and by normalizing by the number of links on a page. PageRank is
defined as follows:
We assume page A has pages T1...Tn which point to it (i.e., are inbound links). The parameter d is a
damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the
next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is
given as follows:
PR(A) = (1-d)/N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Note that the PageRank values form a probability distribution over web pages, so the sum of all web
pages' PageRank values will be one.
PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the
principal eigenvector of the normalized link matrix of the web. Also, a PageRank for 26 million web pages can
be computed in a few hours on a medium size workstation
III. Previous Work
Link spamming is one of the web spamming techniques that try to mislead link-based ranking
algorithms such as PageRank and HITS. Since these algorithms consider a link to pages as an endorsement for
that page, spammers create numerous false links and construct an artificially interlinked link structure, so called
a spam farm, to centralize link-based importance to their own spam pages.
To understand the web spamming, Gyöngyi et al[4]
described various web spamming techniques in.
Optimal link structures to boost PageRank scores are also studied to grasp the behavior of web spammers.
Fetterly[5]
found out that outliers in statistical distributions are very likely to be spam by analyzing statistical
properties of linkage, URL, host resolutions and contents of pages. To demote link spam, Gyöngyi et
alintroduced TrustRank[4]
that is a biased PageRank where rank scores start to be propagated from a seed set of
good pages through outgoing links. By this, we can expect spam pages to get low rank. Optimizing the link
structure is another approach to demote link spam.
Carvalho[6]
proposed the idea of noisy links, a link structure that has a negative impact on link-based
ranking algorithms. Qi et al. also estimated the quality of links by similarity of two pages. To detect link spam,
Bencz´ur introduced SpamRank[7]
. SpamRank checks PageRank score distributions of all in-neighbors of a
target page. If this distribution is abnormal, SpamRank regards a target page as a spam and penalizes it.
Becchetti[8]
employed link-based features for the link spam detection. They built a link spam classifier
with several features of the link structure like degrees, link-based ranking scores, and characteristics of out-
Page Rank Link Farm Detection
www.ijeijournal.com Page | 57
neighbors. Saito[9]
employed a graph algorithm to detect link spam. They decomposed the Web graph into
strongly connected components and discovered that large components are spam with high probability. Link
farms in the core were extracted by maximal clique enumeration.
IV. Proposed System
The PageRank algorithm, as described in the previous section, works upon a given set of pages found
via search using Web spiders or other programs. Now to rank these pages, we consider the links of each page
and apply the PageRank algorithm to it. However, the page rank algorithm can be easily fooled by link farms.
Link Farms work as follows - We know that the PageRank is higher for pages with more links pointing to it i.e.
more the in-line, higher the rank. A link farm forms a complete graph. This means, we have a large set of pages
pointing to each other, thus increasing the number of in-links and out-links even they are not relevant, hence
they falsify the Page Rank values obtained. On applying the algorithm to this graph, an interesting fact came to
notice that the Page Rank values of all the pages in a link farm was the same and remained so even after
multiple iterations of the algorithm. One could easily use this fact to identify link farms and remove such sets of
pages from our whole set. But the problem occurs when a page which is not a part of the link farm and is
actually relevant, it happens to have the same or nearly same PageRank as that of the link farm pages. So if we
consider pages with same page ranks for elimination, some relevant pages might also get discarded
unintentionally.
To solve this problem, we introduce a new factor we call Gap Rank. GapRank is based on PageRank
and in a way inverse of it. While PageRank is based on inbound links of a page, GapRank is calculated using
outbound links of pages.
Gaprank Formula
GR(A) = (1-d)/N + d (GR(T1)/L(T1) + ... + GR(Tn)/L(Tn))
where A is the page under consideration, T1,T2….Tn the set of pages that link from A i.e. outbound
pages,L(Tn) is the number of inbound links on page Tn, and N is the total number of pages.
The purpose of this formula is to rank the set of pages based on the number of outbound links. We then use the
GapRank to identify the Link Farm. Since all the pages in a link farm have same number of inbound links and
outbound links, the GapRank of these pages will be equal, just like their PageRanks. This can now be used to
differentiate a link farm from a page having almost same PageRank as those of link farm pages, but it's
GapRank will be different.
This analysis can further be used to reduce the ranks of these spam pages or eliminate them altogether
so that we get proper ranks for pages that suffered because of the link farm.
V. Methodology
Consider a given dataset consisting of 11 pages P1, P2,..,P11. The links between the pages can be seen in the
FIG 5.1, depicting a graph of nodes for each page and edges for links between them.
[FIG 5.1 Graph of initial dataset]
Page Rank Link Farm Detection
www.ijeijournal.com Page | 58
5.1 Calculating Pagerank
Initial PageRank to be given to all pages will be 1/n, where n is the size of dataset, i.e., 1/11. Set the
initial damping factor. In our example we take damping factor as 0.85.
Calculate the PageRank values of all the pages according to the formula :
PR(A) = (1-d)/N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
where PR represents the PageRank of the represented page, C represents the total number of out bound links of
the page and d represents the damping factor.
For the given dataset, we obtain the following PageRank values after 30 iterations (Table 5.1.1):
Page Name PageRank
P1 0.0557838
P2 0.0426136
P3 0.0426136
P4 0.0426136
P5 0.0426136
P6 0.0426136
P7 0.0194318
P8 0.0359489
P9 0.0136364
P10 0.0136364
P11 0.049858
[Table 5.1.1 PageRank values of all pages in the dataset.]
We consider the PageRank values obtained after 30 iterations of PageRank calculation for all pages as
final because variations in the values after this will be negligible. Negligible changes reflect final values allotted
to each page.
5.2 Searching For Duplicate Pagerank Values
Scan the result for all the duplicate PageRank values using the best scanning technique available. The
pages with duplicate values are shown in Table 5.2.1. These pages may or may not belong to the spam group
(link farm) as some pages may have the same value as a spam page just by mere coincidence. To eliminate these
pages from the suspected pages, we perform the next step.
Page Name PageRank
P2 0.0426136
P3 0.0426136
P4 0.0426136
P5 0.0426136
P6 0.0426136
[Table 5.2.1 Pages having duplicate values]
5.3 Calculating Gaprank Values
Gap Rank values for the selected pages are calculated in the same way as PageRank values, by using
the formula shown in fig 4.1. The Gap Rank values obtained are shown in the Table 5.3.1
Page Name GapRank
P2 0.106365
P3 0.106365
P4 0.106365
P5 0.106365
P6 0.106365
[Table 5.2.1 GapRank values of pages having duplicate PageRank values ]
The pages having the same Gap Rank and PageRank values are identified as spam pages which
constitute the link farm.
Page Rank Link Farm Detection
www.ijeijournal.com Page | 59
5.4 Remove Link Farms From The Dataset
The pages constituting the link farm are removed from the data set and a new dataset is obtained
without the spam pages. The PageRank values of the pages in the new dataset are calculated from the beginning
with new initial PageRank values for all pages determined by 1/n’, where n’ is the new dataset size. The new
PageRank values are shown in the table 5.4.1 given below.
Page Name PageRank
P1 0.122724
P7 0.04275
P8 0.0790875
P9 0.03
P10 0.03
[Table5.4.1 New PageRank Values for non-spam pages]
REFERENCES
[1]. Sergey Brin and Lawrence Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine
[2]. Richard D. Alba: A graph-theoretic definition of a Sociometric Clique
[3]. Baoning Wu and Brian D. Davison:Identifying Link Farm Spam Pages
[4]. Z. Gy¨ongyi and H. Garcia-Molina. Web spamtaxonomy. In Proceedings of the 1st internationalworkshop on Adversarial
information retrieval on theWeb, 2005.
[5]. D. Fetterly, M. Manasse and M. Najork. Spam, damnspam, and statistics: using statistical analysis tolocate spam Web pages. In
Proceedings of the 7th
International Workshop on the Web and Databases,2004.
[6]. A. Carvalho, P. Chirita, E. Moura and P. Calado. Sitelevel noise removal for search engines. In Proceedingsof the 15th international
conference on World WideWeb. 2006
[7]. A. A. Bencz´ur, K Csalog´any, T Sarl´os and M. Uher.SpamRank-fully automatic link spam detection. InProceedings of the 1st
international workshop onAdversarial information retrieval on the Web, 2005
[8]. L. Becchetti, C. Castillo, D. Donato, S. Leonardi andR. Baeza-Yates. Link-based characterization anddetection of Web spam. In
Proceedings of the 2nd
international workshop on Adversarial informationretrieval on the Web, 2006.
[9]. H. Saito, M. Toyoda, M. Kitsuregawa and K. Aihara: A large-scale study of link spam detection by graphalgorithms In Proceedings
of the 3rd internationalworkshop on Adversarial information retrieval on theWeb, 2007.

More Related Content

What's hot (17)

Ranking Web Pages
Ranking Web PagesRanking Web Pages
Ranking Web Pages
 
Done reread deeperinsidepagerank
Done reread deeperinsidepagerankDone reread deeperinsidepagerank
Done reread deeperinsidepagerank
 
Deeper Inside PageRank (NOTES)
Deeper Inside PageRank (NOTES)Deeper Inside PageRank (NOTES)
Deeper Inside PageRank (NOTES)
 
PageRank and Markov Chain
PageRank and Markov ChainPageRank and Markov Chain
PageRank and Markov Chain
 
J046045558
J046045558J046045558
J046045558
 
CSE509 Lecture 3
CSE509 Lecture 3CSE509 Lecture 3
CSE509 Lecture 3
 
Efficient focused web crawling approach
Efficient focused web crawling approachEfficient focused web crawling approach
Efficient focused web crawling approach
 
JPJ1423 Keyword Query Routing
JPJ1423   Keyword Query RoutingJPJ1423   Keyword Query Routing
JPJ1423 Keyword Query Routing
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
 
WEB PAGE RANKING BASED ON TEXT SUBSTANCE OF LINKED PAGES
WEB PAGE RANKING BASED ON TEXT SUBSTANCE OF LINKED PAGESWEB PAGE RANKING BASED ON TEXT SUBSTANCE OF LINKED PAGES
WEB PAGE RANKING BASED ON TEXT SUBSTANCE OF LINKED PAGES
 
keyword query routing
keyword query routingkeyword query routing
keyword query routing
 
Extracting Resources that Help Tell Events' Stories
Extracting Resources that Help Tell Events' StoriesExtracting Resources that Help Tell Events' Stories
Extracting Resources that Help Tell Events' Stories
 
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
 
Keyword query routing
Keyword query routingKeyword query routing
Keyword query routing
 
Topic-specific Web Crawler using Probability Method
Topic-specific Web Crawler using Probability MethodTopic-specific Web Crawler using Probability Method
Topic-specific Web Crawler using Probability Method
 
Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461
 
Transitivity of Trust
Transitivity of TrustTransitivity of Trust
Transitivity of Trust
 

Viewers also liked

Virus de la inmunodeficiencia humana
Virus de la inmunodeficiencia humanaVirus de la inmunodeficiencia humana
Virus de la inmunodeficiencia humanaguest4510b18
 
southern 2008 3rd
southern 2008 3rdsouthern 2008 3rd
southern 2008 3rdfinance17
 
Trabajo practico n5 nogara ferroniiiii (1)
Trabajo practico n5 nogara ferroniiiii (1)Trabajo practico n5 nogara ferroniiiii (1)
Trabajo practico n5 nogara ferroniiiii (1)Leo Ferroni
 
Presentacion mis cosas china
Presentacion mis cosas chinaPresentacion mis cosas china
Presentacion mis cosas chinaandy_saf
 
материнская плата
материнская платаматеринская плата
материнская платаSvetlana Belova
 
otyskanie chasti-ot_celogo_i_celogo_po_ego_chasti
otyskanie chasti-ot_celogo_i_celogo_po_ego_chastiotyskanie chasti-ot_celogo_i_celogo_po_ego_chasti
otyskanie chasti-ot_celogo_i_celogo_po_ego_chastiArtem Puzyrevich
 
Dec 6 2014 Announcements
Dec 6 2014 AnnouncementsDec 6 2014 Announcements
Dec 6 2014 Announcementstriumphantlife
 
Gerin daalgabar 2
Gerin daalgabar 2Gerin daalgabar 2
Gerin daalgabar 2enhmaa95
 
Martin luther and reformation
Martin luther and reformationMartin luther and reformation
Martin luther and reformationmarianasaenz25
 
Rainbow Girls Supply’s and Treasures + Sunday Extravaganza! 2.2
Rainbow Girls Supply’s and Treasures + Sunday Extravaganza! 2.2Rainbow Girls Supply’s and Treasures + Sunday Extravaganza! 2.2
Rainbow Girls Supply’s and Treasures + Sunday Extravaganza! 2.2Cassarah Peony
 
AssembléIas Setoriais De Cultura De Roraima
AssembléIas Setoriais De Cultura De RoraimaAssembléIas Setoriais De Cultura De Roraima
AssembléIas Setoriais De Cultura De RoraimaEdgar Borges
 
Presentación semana 4
Presentación semana 4Presentación semana 4
Presentación semana 4David Uoc
 
Kodak Brownie Cresta 3
Kodak Brownie Cresta 3Kodak Brownie Cresta 3
Kodak Brownie Cresta 3sallysk_84
 
computer sciences an 2000
computer sciences an 2000computer sciences an 2000
computer sciences an 2000finance17
 
Black Tie Inside
Black Tie InsideBlack Tie Inside
Black Tie Insidevarietyiowa
 
Nacen del vientre de su madre
Nacen del vientre de su madreNacen del vientre de su madre
Nacen del vientre de su madrecolenoblejas
 
Flat Stanley in Carlsbad Flower Fields
Flat Stanley in Carlsbad Flower FieldsFlat Stanley in Carlsbad Flower Fields
Flat Stanley in Carlsbad Flower Fieldsatumtu
 

Viewers also liked (20)

Virus de la inmunodeficiencia humana
Virus de la inmunodeficiencia humanaVirus de la inmunodeficiencia humana
Virus de la inmunodeficiencia humana
 
southern 2008 3rd
southern 2008 3rdsouthern 2008 3rd
southern 2008 3rd
 
как малката кухня да изглежда по просторна
как малката кухня да изглежда по просторнакак малката кухня да изглежда по просторна
как малката кухня да изглежда по просторна
 
Trabajo practico n5 nogara ferroniiiii (1)
Trabajo practico n5 nogara ferroniiiii (1)Trabajo practico n5 nogara ferroniiiii (1)
Trabajo practico n5 nogara ferroniiiii (1)
 
Presentacion mis cosas china
Presentacion mis cosas chinaPresentacion mis cosas china
Presentacion mis cosas china
 
материнская плата
материнская платаматеринская плата
материнская плата
 
otyskanie chasti-ot_celogo_i_celogo_po_ego_chasti
otyskanie chasti-ot_celogo_i_celogo_po_ego_chastiotyskanie chasti-ot_celogo_i_celogo_po_ego_chasti
otyskanie chasti-ot_celogo_i_celogo_po_ego_chasti
 
Dec 6 2014 Announcements
Dec 6 2014 AnnouncementsDec 6 2014 Announcements
Dec 6 2014 Announcements
 
Gerin daalgabar 2
Gerin daalgabar 2Gerin daalgabar 2
Gerin daalgabar 2
 
Martin luther and reformation
Martin luther and reformationMartin luther and reformation
Martin luther and reformation
 
Rail Мasters 2014
Rail Мasters 2014Rail Мasters 2014
Rail Мasters 2014
 
Rainbow Girls Supply’s and Treasures + Sunday Extravaganza! 2.2
Rainbow Girls Supply’s and Treasures + Sunday Extravaganza! 2.2Rainbow Girls Supply’s and Treasures + Sunday Extravaganza! 2.2
Rainbow Girls Supply’s and Treasures + Sunday Extravaganza! 2.2
 
AssembléIas Setoriais De Cultura De Roraima
AssembléIas Setoriais De Cultura De RoraimaAssembléIas Setoriais De Cultura De Roraima
AssembléIas Setoriais De Cultura De Roraima
 
Presentación semana 4
Presentación semana 4Presentación semana 4
Presentación semana 4
 
Kodak Brownie Cresta 3
Kodak Brownie Cresta 3Kodak Brownie Cresta 3
Kodak Brownie Cresta 3
 
computer sciences an 2000
computer sciences an 2000computer sciences an 2000
computer sciences an 2000
 
Black Tie Inside
Black Tie InsideBlack Tie Inside
Black Tie Inside
 
DKF Erfa møde
DKF Erfa mødeDKF Erfa møde
DKF Erfa møde
 
Nacen del vientre de su madre
Nacen del vientre de su madreNacen del vientre de su madre
Nacen del vientre de su madre
 
Flat Stanley in Carlsbad Flower Fields
Flat Stanley in Carlsbad Flower FieldsFlat Stanley in Carlsbad Flower Fields
Flat Stanley in Carlsbad Flower Fields
 

Similar to Detecting Page Rank Link Farms

Done rerea dlink-farm-spam
Done rerea dlink-farm-spamDone rerea dlink-farm-spam
Done rerea dlink-farm-spamJames Arnold
 
Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)James Arnold
 
Enhancement in Weighted PageRank Algorithm Using VOL
Enhancement in Weighted PageRank Algorithm Using VOLEnhancement in Weighted PageRank Algorithm Using VOL
Enhancement in Weighted PageRank Algorithm Using VOLIOSR Journals
 
Done rerea dlink spam alliances good
Done rerea dlink spam alliances goodDone rerea dlink spam alliances good
Done rerea dlink spam alliances goodJames Arnold
 
A Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESA Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESSubhajit Sahu
 
IRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A ComparisonIRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A ComparisonIRJET Journal
 
Incremental Page Rank Computation on Evolving Graphs : NOTES
Incremental Page Rank Computation on Evolving Graphs : NOTESIncremental Page Rank Computation on Evolving Graphs : NOTES
Incremental Page Rank Computation on Evolving Graphs : NOTESSubhajit Sahu
 
Search engine page rank demystification
Search engine page rank demystificationSearch engine page rank demystification
Search engine page rank demystificationRaja R
 
PageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibPageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibEl Habib NFAOUI
 
Random web surfer pagerank algorithm
Random web surfer pagerank algorithmRandom web surfer pagerank algorithm
Random web surfer pagerank algorithmalexandrelevada
 
PageRank & Searching
PageRank & SearchingPageRank & Searching
PageRank & Searchingrahulbindra
 
PageRank Algorithm In data mining
PageRank Algorithm In data miningPageRank Algorithm In data mining
PageRank Algorithm In data miningMai Mustafa
 
LINEAR ALGEBRA BEHIND GOOGLE SEARCH
LINEAR ALGEBRA BEHIND GOOGLE SEARCHLINEAR ALGEBRA BEHIND GOOGLE SEARCH
LINEAR ALGEBRA BEHIND GOOGLE SEARCHDivyansh Verma
 

Similar to Detecting Page Rank Link Farms (20)

TrustRank.PDF
TrustRank.PDFTrustRank.PDF
TrustRank.PDF
 
Done rerea dlink-farm-spam
Done rerea dlink-farm-spamDone rerea dlink-farm-spam
Done rerea dlink-farm-spam
 
Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)
 
PageRank Algorithm
PageRank AlgorithmPageRank Algorithm
PageRank Algorithm
 
Enhancement in Weighted PageRank Algorithm Using VOL
Enhancement in Weighted PageRank Algorithm Using VOLEnhancement in Weighted PageRank Algorithm Using VOL
Enhancement in Weighted PageRank Algorithm Using VOL
 
Done rerea dlink spam alliances good
Done rerea dlink spam alliances goodDone rerea dlink spam alliances good
Done rerea dlink spam alliances good
 
A Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESA Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTES
 
IRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A ComparisonIRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A Comparison
 
Incremental Page Rank Computation on Evolving Graphs : NOTES
Incremental Page Rank Computation on Evolving Graphs : NOTESIncremental Page Rank Computation on Evolving Graphs : NOTES
Incremental Page Rank Computation on Evolving Graphs : NOTES
 
Page Rank
Page RankPage Rank
Page Rank
 
Search engine page rank demystification
Search engine page rank demystificationSearch engine page rank demystification
Search engine page rank demystification
 
PageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibPageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_Habib
 
Page Rank
Page RankPage Rank
Page Rank
 
Page Rank
Page RankPage Rank
Page Rank
 
Random web surfer pagerank algorithm
Random web surfer pagerank algorithmRandom web surfer pagerank algorithm
Random web surfer pagerank algorithm
 
PageRank & Searching
PageRank & SearchingPageRank & Searching
PageRank & Searching
 
PageRank Algorithm In data mining
PageRank Algorithm In data miningPageRank Algorithm In data mining
PageRank Algorithm In data mining
 
LINEAR ALGEBRA BEHIND GOOGLE SEARCH
LINEAR ALGEBRA BEHIND GOOGLE SEARCHLINEAR ALGEBRA BEHIND GOOGLE SEARCH
LINEAR ALGEBRA BEHIND GOOGLE SEARCH
 
Page Rank
Page RankPage Rank
Page Rank
 
Page Rank
Page RankPage Rank
Page Rank
 

More from International Journal of Engineering Inventions www.ijeijournal.com

More from International Journal of Engineering Inventions www.ijeijournal.com (20)

H04124548
H04124548H04124548
H04124548
 
G04123844
G04123844G04123844
G04123844
 
F04123137
F04123137F04123137
F04123137
 
E04122330
E04122330E04122330
E04122330
 
C04121115
C04121115C04121115
C04121115
 
B04120610
B04120610B04120610
B04120610
 
A04120105
A04120105A04120105
A04120105
 
F04113640
F04113640F04113640
F04113640
 
E04112135
E04112135E04112135
E04112135
 
D04111520
D04111520D04111520
D04111520
 
C04111114
C04111114C04111114
C04111114
 
B04110710
B04110710B04110710
B04110710
 
A04110106
A04110106A04110106
A04110106
 
I04105358
I04105358I04105358
I04105358
 
H04104952
H04104952H04104952
H04104952
 
G04103948
G04103948G04103948
G04103948
 
F04103138
F04103138F04103138
F04103138
 
E04102330
E04102330E04102330
E04102330
 
D04101822
D04101822D04101822
D04101822
 
C04101217
C04101217C04101217
C04101217
 

Recently uploaded

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Recently uploaded (20)

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Detecting Page Rank Link Farms

  • 1. International Journal of Engineering Inventions e-ISSN: 2278-7461, p-ISSN: 2319-6491 Volume 4, Issue 1 (July 2014) PP: 55-59 www.ijeijournal.com Page | 55 Page Rank Link Farm Detection Akshay Saxena1 , Rohit Nigam2 1, 2 Department Of Information and Communication Technology (ICT) Manipal University, Manipal - 576014, Karnataka, India Abstract: The PageRank algorithm is an important algorithm which is implemented to determine the quality of a page on the web. With search engines attaining a high position in guiding the traffic on the internet, PageRank is an important factor to determine its flow. Since link analysis is used in search engine’s ranking systems, link based spam structure known as link farms are created by spammers to generate a high PageRank for their and in turn a target page. In this paper, we suggest a method through which these structures can be detected and thus the overall ranking results can be improved. I. Introduction 1.1 Page Rank We will be working primarily on PageRank[1] algorithm developed by Larry Page of Google. This algorithm determines the importance of a website by counting the number and quality of links to a page, assuming that an important website will have more links pointing to it by other websites. This technique is used in Google's search process to rank websites of a search result. It is not the only algorithm used by Google to order search engine results, but it is the first algorithm that was used by the company, and it is the best-known. The counting and indexing of Web pages is done through various programs like webcrawlers[1] and other bot programs. By assigning a numerical weighting to each link, it repeatedly applies the PageRank algorithm to get a more accurate result and determine the relative rank of a Web page within a set pages. The set of Web pages being considered is thought of as a directed graph with each node representing a page and each edge as a link between pages. We determine the number of links pointing to a page by traversing the graph and use those values in the PageRank formula. Pagerank is quite prone to manipulation. Our goal was to find an effective way to ignore links from pages that are trying to falsify a Pagerank. These spamming pages usually occur as a group called link farms described below. 1.2 Complete Graph In the mathematical field of graph theory, a complete graph[2] is a simple undirected graph in which every pair of distinct vertices is connected by a unique edge. A complete digraph is a directed graph in which every pair of distinct vertices is connected by a pair of unique edges (one in each direction). 1.3 Clique In the mathematical area of graph theory, a clique[2] an undirected graph is a subset of its vertices such that every two vertices in the subset are connected by an edge. Cliques are one of the basic concepts of graph theory and are used in many other mathematical problems and constructions on graphs. Cliques have also been studied in computer science: the task of finding whether there is a clique of a given size in a graph (the clique problem) is NP-complete, but despite this hardness result many algorithms for finding cliques have been studied. FIG 1.1
  • 2. Page Rank Link Farm Detection www.ijeijournal.com Page | 56 1.4 Link Farm On the World Wide Web, a link farm[3] is any group of web sites that all hyperlink to every other site in the group. In graph theoretic terms, a link farm is a clique. Although some link farms can be created by hand, most are created through automated programs and services. A link farm is a form of spamming the index of a search engine (sometimes called spamdexing or spamexing). Other link exchange systems are designed to allow individual websites to selectively exchange links with other relevant websites and are not considered a form of spamdexing. FIG 1.2 Normal Graph of Web Pages FIG 1.3 Link Farm of 1, 2, 3 & 4 II. Existing System Academic citation literature has been applied to the web, largely by counting citations or backlinks to a given page. This gives some approximation of a page's importance or quality. PageRank extends this idea by not counting links from all pages equally, and by normalizing by the number of links on a page. PageRank is defined as follows: We assume page A has pages T1...Tn which point to it (i.e., are inbound links). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1-d)/N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Note that the PageRank values form a probability distribution over web pages, so the sum of all web pages' PageRank values will be one. PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web. Also, a PageRank for 26 million web pages can be computed in a few hours on a medium size workstation III. Previous Work Link spamming is one of the web spamming techniques that try to mislead link-based ranking algorithms such as PageRank and HITS. Since these algorithms consider a link to pages as an endorsement for that page, spammers create numerous false links and construct an artificially interlinked link structure, so called a spam farm, to centralize link-based importance to their own spam pages. To understand the web spamming, Gyöngyi et al[4] described various web spamming techniques in. Optimal link structures to boost PageRank scores are also studied to grasp the behavior of web spammers. Fetterly[5] found out that outliers in statistical distributions are very likely to be spam by analyzing statistical properties of linkage, URL, host resolutions and contents of pages. To demote link spam, Gyöngyi et alintroduced TrustRank[4] that is a biased PageRank where rank scores start to be propagated from a seed set of good pages through outgoing links. By this, we can expect spam pages to get low rank. Optimizing the link structure is another approach to demote link spam. Carvalho[6] proposed the idea of noisy links, a link structure that has a negative impact on link-based ranking algorithms. Qi et al. also estimated the quality of links by similarity of two pages. To detect link spam, Bencz´ur introduced SpamRank[7] . SpamRank checks PageRank score distributions of all in-neighbors of a target page. If this distribution is abnormal, SpamRank regards a target page as a spam and penalizes it. Becchetti[8] employed link-based features for the link spam detection. They built a link spam classifier with several features of the link structure like degrees, link-based ranking scores, and characteristics of out-
  • 3. Page Rank Link Farm Detection www.ijeijournal.com Page | 57 neighbors. Saito[9] employed a graph algorithm to detect link spam. They decomposed the Web graph into strongly connected components and discovered that large components are spam with high probability. Link farms in the core were extracted by maximal clique enumeration. IV. Proposed System The PageRank algorithm, as described in the previous section, works upon a given set of pages found via search using Web spiders or other programs. Now to rank these pages, we consider the links of each page and apply the PageRank algorithm to it. However, the page rank algorithm can be easily fooled by link farms. Link Farms work as follows - We know that the PageRank is higher for pages with more links pointing to it i.e. more the in-line, higher the rank. A link farm forms a complete graph. This means, we have a large set of pages pointing to each other, thus increasing the number of in-links and out-links even they are not relevant, hence they falsify the Page Rank values obtained. On applying the algorithm to this graph, an interesting fact came to notice that the Page Rank values of all the pages in a link farm was the same and remained so even after multiple iterations of the algorithm. One could easily use this fact to identify link farms and remove such sets of pages from our whole set. But the problem occurs when a page which is not a part of the link farm and is actually relevant, it happens to have the same or nearly same PageRank as that of the link farm pages. So if we consider pages with same page ranks for elimination, some relevant pages might also get discarded unintentionally. To solve this problem, we introduce a new factor we call Gap Rank. GapRank is based on PageRank and in a way inverse of it. While PageRank is based on inbound links of a page, GapRank is calculated using outbound links of pages. Gaprank Formula GR(A) = (1-d)/N + d (GR(T1)/L(T1) + ... + GR(Tn)/L(Tn)) where A is the page under consideration, T1,T2….Tn the set of pages that link from A i.e. outbound pages,L(Tn) is the number of inbound links on page Tn, and N is the total number of pages. The purpose of this formula is to rank the set of pages based on the number of outbound links. We then use the GapRank to identify the Link Farm. Since all the pages in a link farm have same number of inbound links and outbound links, the GapRank of these pages will be equal, just like their PageRanks. This can now be used to differentiate a link farm from a page having almost same PageRank as those of link farm pages, but it's GapRank will be different. This analysis can further be used to reduce the ranks of these spam pages or eliminate them altogether so that we get proper ranks for pages that suffered because of the link farm. V. Methodology Consider a given dataset consisting of 11 pages P1, P2,..,P11. The links between the pages can be seen in the FIG 5.1, depicting a graph of nodes for each page and edges for links between them. [FIG 5.1 Graph of initial dataset]
  • 4. Page Rank Link Farm Detection www.ijeijournal.com Page | 58 5.1 Calculating Pagerank Initial PageRank to be given to all pages will be 1/n, where n is the size of dataset, i.e., 1/11. Set the initial damping factor. In our example we take damping factor as 0.85. Calculate the PageRank values of all the pages according to the formula : PR(A) = (1-d)/N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) where PR represents the PageRank of the represented page, C represents the total number of out bound links of the page and d represents the damping factor. For the given dataset, we obtain the following PageRank values after 30 iterations (Table 5.1.1): Page Name PageRank P1 0.0557838 P2 0.0426136 P3 0.0426136 P4 0.0426136 P5 0.0426136 P6 0.0426136 P7 0.0194318 P8 0.0359489 P9 0.0136364 P10 0.0136364 P11 0.049858 [Table 5.1.1 PageRank values of all pages in the dataset.] We consider the PageRank values obtained after 30 iterations of PageRank calculation for all pages as final because variations in the values after this will be negligible. Negligible changes reflect final values allotted to each page. 5.2 Searching For Duplicate Pagerank Values Scan the result for all the duplicate PageRank values using the best scanning technique available. The pages with duplicate values are shown in Table 5.2.1. These pages may or may not belong to the spam group (link farm) as some pages may have the same value as a spam page just by mere coincidence. To eliminate these pages from the suspected pages, we perform the next step. Page Name PageRank P2 0.0426136 P3 0.0426136 P4 0.0426136 P5 0.0426136 P6 0.0426136 [Table 5.2.1 Pages having duplicate values] 5.3 Calculating Gaprank Values Gap Rank values for the selected pages are calculated in the same way as PageRank values, by using the formula shown in fig 4.1. The Gap Rank values obtained are shown in the Table 5.3.1 Page Name GapRank P2 0.106365 P3 0.106365 P4 0.106365 P5 0.106365 P6 0.106365 [Table 5.2.1 GapRank values of pages having duplicate PageRank values ] The pages having the same Gap Rank and PageRank values are identified as spam pages which constitute the link farm.
  • 5. Page Rank Link Farm Detection www.ijeijournal.com Page | 59 5.4 Remove Link Farms From The Dataset The pages constituting the link farm are removed from the data set and a new dataset is obtained without the spam pages. The PageRank values of the pages in the new dataset are calculated from the beginning with new initial PageRank values for all pages determined by 1/n’, where n’ is the new dataset size. The new PageRank values are shown in the table 5.4.1 given below. Page Name PageRank P1 0.122724 P7 0.04275 P8 0.0790875 P9 0.03 P10 0.03 [Table5.4.1 New PageRank Values for non-spam pages] REFERENCES [1]. Sergey Brin and Lawrence Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine [2]. Richard D. Alba: A graph-theoretic definition of a Sociometric Clique [3]. Baoning Wu and Brian D. Davison:Identifying Link Farm Spam Pages [4]. Z. Gy¨ongyi and H. Garcia-Molina. Web spamtaxonomy. In Proceedings of the 1st internationalworkshop on Adversarial information retrieval on theWeb, 2005. [5]. D. Fetterly, M. Manasse and M. Najork. Spam, damnspam, and statistics: using statistical analysis tolocate spam Web pages. In Proceedings of the 7th International Workshop on the Web and Databases,2004. [6]. A. Carvalho, P. Chirita, E. Moura and P. Calado. Sitelevel noise removal for search engines. In Proceedingsof the 15th international conference on World WideWeb. 2006 [7]. A. A. Bencz´ur, K Csalog´any, T Sarl´os and M. Uher.SpamRank-fully automatic link spam detection. InProceedings of the 1st international workshop onAdversarial information retrieval on the Web, 2005 [8]. L. Becchetti, C. Castillo, D. Donato, S. Leonardi andR. Baeza-Yates. Link-based characterization anddetection of Web spam. In Proceedings of the 2nd international workshop on Adversarial informationretrieval on the Web, 2006. [9]. H. Saito, M. Toyoda, M. Kitsuregawa and K. Aihara: A large-scale study of link spam detection by graphalgorithms In Proceedings of the 3rd internationalworkshop on Adversarial information retrieval on theWeb, 2007.