This document discusses techniques for detecting link farms, which are groups of web pages that link to each other to artificially boost their PageRank scores. It provides background on PageRank and how link farms can manipulate it. The proposed method calculates both PageRank and a new "GapRank" score for pages, and identifies pages as part of a link farm if they have identical PageRank and GapRank values. The method is demonstrated on a sample dataset, where pages with duplicate PageRank scores are found and shown to also have identical GapRank, identifying them as a link farm that is then removed from the dataset. This improves the PageRank algorithm's ability to rank pages accurately.
PageRank is an algorithm used by the Google web search engine to rank websites in the search engine results. PageRank was named after Larry Page, one of the founders of Google. PageRank is a way of measuring the importance of website pages.
Evaluation of Web Search Engines Based on Ranking of Results and FeaturesWaqas Tariq
Search engines help the user to surf the web. Due to the vast number of web pages it is highly impossible for the user to retrieve the appropriate web page he needs. Thus, Web search ranking algorithms play an important role in ranking web pages so that the user could retrieve the page which is most relevant to the user's query. This paper presents a study of the applicability of two user-effort-sensitive evaluation measures on five Web search engines (Google, Ask, Yahoo, AOL and Bing). Twenty queries were collected from the list of most hit queries in the last year from various search engines and based upon that search engines are evaluated.
PageRank is an algorithm used by the Google web search engine to rank websites in the search engine results. PageRank was named after Larry Page, one of the founders of Google. PageRank is a way of measuring the importance of website pages.
Evaluation of Web Search Engines Based on Ranking of Results and FeaturesWaqas Tariq
Search engines help the user to surf the web. Due to the vast number of web pages it is highly impossible for the user to retrieve the appropriate web page he needs. Thus, Web search ranking algorithms play an important role in ranking web pages so that the user could retrieve the page which is most relevant to the user's query. This paper presents a study of the applicability of two user-effort-sensitive evaluation measures on five Web search engines (Google, Ask, Yahoo, AOL and Bing). Twenty queries were collected from the list of most hit queries in the last year from various search engines and based upon that search engines are evaluated.
Highlighted notes on Deeper Inside PageRank.
While doing research work under Prof. Kishore Kothapalli.
This is a really "deep" review of PageRank! Should be a good story for a PhD student going to be working with PageRank optimizations.
We are good IEEE java projects development center in Chennai and Pondicherry. We guided advanced java technologies projects of cloud computing, data mining, Secure Computing, Networking, Parallel & Distributed Systems, Mobile Computing and Service Computing (Web Service).
For More Details:
http://jpinfotech.org/final-year-ieee-projects/2014-ieee-projects/java-projects/
World Wide Web is large sized repository of
interlinked hypertext documents accessed via the Internet. Web
may contain text, images, video, and other multimedia data. The
user navigates through this using hyperlink. Search Engine gives
millions of results and applies Web mining techniques to order the
results. The sorted order of search results is obtained by applying
some special algorithms called—Page ranking algorithms. The
algorithm measures the importance of the pages by analyzing the
number of inlinked and outlinked pages. Our proposed system is
built on an idea that to rank the relevant pages higher in the
retrieved document set, an analysis of both page‘s text substance
and links information is required. The proposed approach is
based on the assumption that the effective weight of a term in a
page is computed by adding the weight of a term in the current
page and additional weight of the term in the linked pages. In
this chapter, we first study the nature of web pages, the various
link analysis ranking algorithms and their limitations and then
show the comparative analysis of the ranking scores obtained
through these approaches with our new suggested ranking
approach.
Development of a system that automatically generates (kind of) storylines out of social media aggregated around hashtags, following links being shared.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Highlighted notes on Deeper Inside PageRank.
While doing research work under Prof. Kishore Kothapalli.
This is a really "deep" review of PageRank! Should be a good story for a PhD student going to be working with PageRank optimizations.
We are good IEEE java projects development center in Chennai and Pondicherry. We guided advanced java technologies projects of cloud computing, data mining, Secure Computing, Networking, Parallel & Distributed Systems, Mobile Computing and Service Computing (Web Service).
For More Details:
http://jpinfotech.org/final-year-ieee-projects/2014-ieee-projects/java-projects/
World Wide Web is large sized repository of
interlinked hypertext documents accessed via the Internet. Web
may contain text, images, video, and other multimedia data. The
user navigates through this using hyperlink. Search Engine gives
millions of results and applies Web mining techniques to order the
results. The sorted order of search results is obtained by applying
some special algorithms called—Page ranking algorithms. The
algorithm measures the importance of the pages by analyzing the
number of inlinked and outlinked pages. Our proposed system is
built on an idea that to rank the relevant pages higher in the
retrieved document set, an analysis of both page‘s text substance
and links information is required. The proposed approach is
based on the assumption that the effective weight of a term in a
page is computed by adding the weight of a term in the current
page and additional weight of the term in the linked pages. In
this chapter, we first study the nature of web pages, the various
link analysis ranking algorithms and their limitations and then
show the comparative analysis of the ranking scores obtained
through these approaches with our new suggested ranking
approach.
Development of a system that automatically generates (kind of) storylines out of social media aggregated around hashtags, following links being shared.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Rainbow Girls Supply’s and Treasures + Sunday Extravaganza! 2.2Cassarah Peony
Come Join The Fun! Double The Goodies In This One.
Auction Starts At 7pm EST but lobby will open at 6:30pm EST
AUCTION LINK: https://www.anymeeting.com/661-788-333
AssembléIas Setoriais De Cultura De RoraimaEdgar Borges
Fotografias da Assembléia Setorial de cultura de Roraima, realizada no dia 23 de janeiro de 2010 no Palácio da Cultura. Organização: Coletivo Arteliteratura Caimbé e outros
Imágenes de la Kodak Brownie Cresta 3. Todas la fotos son de mi autoría. Para conocer las características de esta cámara ingresar a lanarizdeanais.com.ar
The way in which the displaying of the web pages is done within a search is not a mystery. It involves applied math and good computer science knowledge for the right implementation. This relation involves vectors, matrixes and other mathematical notations. The PageRank vector needs to be calculated, that implies calculations for a stationary distribution, stochastic matrix. The matrices hold the link structure and the guidance of the web surfer. As links are added every day, and the number of websites goes beyond billions, the modification of the web link’s structure in the web affects the PageRank. In order to make this work, search algorithms need improvements. Problems and misbehaviors may come into place, but this topic pays attention to many researches which do improvements day by day. Even though it is a simple formula, PageRank runs a successful business. PageRank may be considered as the right example where applied math and computer knowledge can be fitted together.
A Generalization of the PageRank Algorithm : NOTESSubhajit Sahu
This paper discusses a method of Generalizing PageRank algorithm for different types of networks. Rank of each vertex is considered to be dependent upon both the in- and out-edges. Each edge can also have differing importance. This solves the problem of dead ends and spider traps without the need of taxation (?).
---
Abstract— PageRank is a well-known algorithm that has been used to understand the structure of the Web. In its classical formulation the algorithm considers only forward looking paths in its analysis- a typical web scenario. We propose a generalization of the PageRank algorithm based on both out-links and in-links. This generalization enables the elimination network anomalies- and increases the applicability of the algorithm to an array of new applications in networked data. Through experimental results we illustrate that the proposed generalized PageRank minimizes the effect of network anomalies, and results in more realistic representation of the network.
Keywords- Search Engine; PageRank; Web Structure; Web Mining; Spider-Trap; dead-end; Taxation;Web spamming
Incremental Page Rank Computation on Evolving Graphs : NOTESSubhajit Sahu
Highlighted notes while doing research work under Prof. Dip Sankar Banerjee and Prof. Kishore Kothapalli:
Incremental Page Rank Computation on Evolving Graphs.
https://dl.acm.org/doi/10.1145/1062745.1062885
This paper describes a simple method for computing dynamic pagerank, based on the fact that change of out-degree of a node does not affect its pagerank (first order markov property). The part of graph which is updated (edge additions / edge deletions / weight changes) is used to find the affected partition of graph using BFS. The unaffected partition is simply scaled, and pagerank computation is done only for the affected partition.
Hi All,
This Presentation will feature more about the working of search engine how do the inner functionality takes place. In the later half of the Presentation the Page Rank will be explained in depth. how do they calculate it, How it differing from the actual PR, Google PR. How frequently they do update the PR value in the google. and lots more with calculation and few examples.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
I04015559
1. International Journal of Engineering Inventions
e-ISSN: 2278-7461, p-ISSN: 2319-6491
Volume 4, Issue 1 (July 2014) PP: 55-59
www.ijeijournal.com Page | 55
Page Rank Link Farm Detection
Akshay Saxena1
, Rohit Nigam2
1, 2
Department Of Information and Communication Technology (ICT) Manipal University, Manipal - 576014,
Karnataka, India
Abstract: The PageRank algorithm is an important algorithm which is implemented to determine the quality of
a page on the web. With search engines attaining a high position in guiding the traffic on the internet,
PageRank is an important factor to determine its flow. Since link analysis is used in search engine’s ranking
systems, link based spam structure known as link farms are created by spammers to generate a high PageRank
for their and in turn a target page. In this paper, we suggest a method through which these structures can be
detected and thus the overall ranking results can be improved.
I. Introduction
1.1 Page Rank
We will be working primarily on PageRank[1]
algorithm developed by Larry Page of Google. This
algorithm determines the importance of a website by counting the number and quality of links to a page,
assuming that an important website will have more links pointing to it by other websites.
This technique is used in Google's search process to rank websites of a search result.
It is not the only algorithm used by Google to order search engine results, but it is the first algorithm
that was used by the company, and it is the best-known. The counting and indexing of Web pages is done
through various programs like webcrawlers[1]
and other bot programs. By assigning a numerical weighting to
each link, it repeatedly applies the PageRank algorithm to get a more accurate result and determine the relative
rank of a Web page within a set pages.
The set of Web pages being considered is thought of as a directed graph with each node representing a
page and each edge as a link between pages. We determine the number of links pointing to a page by traversing
the graph and use those values in the PageRank formula.
Pagerank is quite prone to manipulation. Our goal was to find an effective way to ignore links from
pages that are trying to falsify a Pagerank. These spamming pages usually occur as a group called link farms
described below.
1.2 Complete Graph
In the mathematical field of graph theory, a complete graph[2]
is a simple undirected graph in which
every pair of distinct vertices is connected by a unique edge. A complete digraph is a directed graph in which
every pair of distinct vertices is connected by a pair of unique edges (one in each direction).
1.3 Clique
In the mathematical area of graph theory, a clique[2]
an undirected graph is a subset of its vertices such
that every two vertices in the subset are connected by an edge. Cliques are one of the basic concepts of graph
theory and are used in many other mathematical problems and constructions on graphs. Cliques have also been
studied in computer science: the task of finding whether there is a clique of a given size in a graph (the clique
problem) is NP-complete, but despite this hardness result many algorithms for finding cliques have been
studied.
FIG 1.1
2. Page Rank Link Farm Detection
www.ijeijournal.com Page | 56
1.4 Link Farm
On the World Wide Web, a link farm[3]
is any group of web sites that all hyperlink to every other site in
the group. In graph theoretic terms, a link farm is a clique. Although some link farms can be created by hand,
most are created through automated programs and services. A link farm is a form of spamming the index of a
search engine (sometimes called spamdexing or spamexing). Other link exchange systems are designed to allow
individual websites to selectively exchange links with other relevant websites and are not considered a form of
spamdexing.
FIG 1.2 Normal Graph of Web Pages FIG 1.3 Link Farm of 1, 2, 3 & 4
II. Existing System
Academic citation literature has been applied to the web, largely by counting citations or backlinks to a
given page. This gives some approximation of a page's importance or quality. PageRank extends this idea by not
counting links from all pages equally, and by normalizing by the number of links on a page. PageRank is
defined as follows:
We assume page A has pages T1...Tn which point to it (i.e., are inbound links). The parameter d is a
damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the
next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is
given as follows:
PR(A) = (1-d)/N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Note that the PageRank values form a probability distribution over web pages, so the sum of all web
pages' PageRank values will be one.
PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the
principal eigenvector of the normalized link matrix of the web. Also, a PageRank for 26 million web pages can
be computed in a few hours on a medium size workstation
III. Previous Work
Link spamming is one of the web spamming techniques that try to mislead link-based ranking
algorithms such as PageRank and HITS. Since these algorithms consider a link to pages as an endorsement for
that page, spammers create numerous false links and construct an artificially interlinked link structure, so called
a spam farm, to centralize link-based importance to their own spam pages.
To understand the web spamming, Gyöngyi et al[4]
described various web spamming techniques in.
Optimal link structures to boost PageRank scores are also studied to grasp the behavior of web spammers.
Fetterly[5]
found out that outliers in statistical distributions are very likely to be spam by analyzing statistical
properties of linkage, URL, host resolutions and contents of pages. To demote link spam, Gyöngyi et
alintroduced TrustRank[4]
that is a biased PageRank where rank scores start to be propagated from a seed set of
good pages through outgoing links. By this, we can expect spam pages to get low rank. Optimizing the link
structure is another approach to demote link spam.
Carvalho[6]
proposed the idea of noisy links, a link structure that has a negative impact on link-based
ranking algorithms. Qi et al. also estimated the quality of links by similarity of two pages. To detect link spam,
Bencz´ur introduced SpamRank[7]
. SpamRank checks PageRank score distributions of all in-neighbors of a
target page. If this distribution is abnormal, SpamRank regards a target page as a spam and penalizes it.
Becchetti[8]
employed link-based features for the link spam detection. They built a link spam classifier
with several features of the link structure like degrees, link-based ranking scores, and characteristics of out-
3. Page Rank Link Farm Detection
www.ijeijournal.com Page | 57
neighbors. Saito[9]
employed a graph algorithm to detect link spam. They decomposed the Web graph into
strongly connected components and discovered that large components are spam with high probability. Link
farms in the core were extracted by maximal clique enumeration.
IV. Proposed System
The PageRank algorithm, as described in the previous section, works upon a given set of pages found
via search using Web spiders or other programs. Now to rank these pages, we consider the links of each page
and apply the PageRank algorithm to it. However, the page rank algorithm can be easily fooled by link farms.
Link Farms work as follows - We know that the PageRank is higher for pages with more links pointing to it i.e.
more the in-line, higher the rank. A link farm forms a complete graph. This means, we have a large set of pages
pointing to each other, thus increasing the number of in-links and out-links even they are not relevant, hence
they falsify the Page Rank values obtained. On applying the algorithm to this graph, an interesting fact came to
notice that the Page Rank values of all the pages in a link farm was the same and remained so even after
multiple iterations of the algorithm. One could easily use this fact to identify link farms and remove such sets of
pages from our whole set. But the problem occurs when a page which is not a part of the link farm and is
actually relevant, it happens to have the same or nearly same PageRank as that of the link farm pages. So if we
consider pages with same page ranks for elimination, some relevant pages might also get discarded
unintentionally.
To solve this problem, we introduce a new factor we call Gap Rank. GapRank is based on PageRank
and in a way inverse of it. While PageRank is based on inbound links of a page, GapRank is calculated using
outbound links of pages.
Gaprank Formula
GR(A) = (1-d)/N + d (GR(T1)/L(T1) + ... + GR(Tn)/L(Tn))
where A is the page under consideration, T1,T2….Tn the set of pages that link from A i.e. outbound
pages,L(Tn) is the number of inbound links on page Tn, and N is the total number of pages.
The purpose of this formula is to rank the set of pages based on the number of outbound links. We then use the
GapRank to identify the Link Farm. Since all the pages in a link farm have same number of inbound links and
outbound links, the GapRank of these pages will be equal, just like their PageRanks. This can now be used to
differentiate a link farm from a page having almost same PageRank as those of link farm pages, but it's
GapRank will be different.
This analysis can further be used to reduce the ranks of these spam pages or eliminate them altogether
so that we get proper ranks for pages that suffered because of the link farm.
V. Methodology
Consider a given dataset consisting of 11 pages P1, P2,..,P11. The links between the pages can be seen in the
FIG 5.1, depicting a graph of nodes for each page and edges for links between them.
[FIG 5.1 Graph of initial dataset]
4. Page Rank Link Farm Detection
www.ijeijournal.com Page | 58
5.1 Calculating Pagerank
Initial PageRank to be given to all pages will be 1/n, where n is the size of dataset, i.e., 1/11. Set the
initial damping factor. In our example we take damping factor as 0.85.
Calculate the PageRank values of all the pages according to the formula :
PR(A) = (1-d)/N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
where PR represents the PageRank of the represented page, C represents the total number of out bound links of
the page and d represents the damping factor.
For the given dataset, we obtain the following PageRank values after 30 iterations (Table 5.1.1):
Page Name PageRank
P1 0.0557838
P2 0.0426136
P3 0.0426136
P4 0.0426136
P5 0.0426136
P6 0.0426136
P7 0.0194318
P8 0.0359489
P9 0.0136364
P10 0.0136364
P11 0.049858
[Table 5.1.1 PageRank values of all pages in the dataset.]
We consider the PageRank values obtained after 30 iterations of PageRank calculation for all pages as
final because variations in the values after this will be negligible. Negligible changes reflect final values allotted
to each page.
5.2 Searching For Duplicate Pagerank Values
Scan the result for all the duplicate PageRank values using the best scanning technique available. The
pages with duplicate values are shown in Table 5.2.1. These pages may or may not belong to the spam group
(link farm) as some pages may have the same value as a spam page just by mere coincidence. To eliminate these
pages from the suspected pages, we perform the next step.
Page Name PageRank
P2 0.0426136
P3 0.0426136
P4 0.0426136
P5 0.0426136
P6 0.0426136
[Table 5.2.1 Pages having duplicate values]
5.3 Calculating Gaprank Values
Gap Rank values for the selected pages are calculated in the same way as PageRank values, by using
the formula shown in fig 4.1. The Gap Rank values obtained are shown in the Table 5.3.1
Page Name GapRank
P2 0.106365
P3 0.106365
P4 0.106365
P5 0.106365
P6 0.106365
[Table 5.2.1 GapRank values of pages having duplicate PageRank values ]
The pages having the same Gap Rank and PageRank values are identified as spam pages which
constitute the link farm.
5. Page Rank Link Farm Detection
www.ijeijournal.com Page | 59
5.4 Remove Link Farms From The Dataset
The pages constituting the link farm are removed from the data set and a new dataset is obtained
without the spam pages. The PageRank values of the pages in the new dataset are calculated from the beginning
with new initial PageRank values for all pages determined by 1/n’, where n’ is the new dataset size. The new
PageRank values are shown in the table 5.4.1 given below.
Page Name PageRank
P1 0.122724
P7 0.04275
P8 0.0790875
P9 0.03
P10 0.03
[Table5.4.1 New PageRank Values for non-spam pages]
REFERENCES
[1]. Sergey Brin and Lawrence Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine
[2]. Richard D. Alba: A graph-theoretic definition of a Sociometric Clique
[3]. Baoning Wu and Brian D. Davison:Identifying Link Farm Spam Pages
[4]. Z. Gy¨ongyi and H. Garcia-Molina. Web spamtaxonomy. In Proceedings of the 1st internationalworkshop on Adversarial
information retrieval on theWeb, 2005.
[5]. D. Fetterly, M. Manasse and M. Najork. Spam, damnspam, and statistics: using statistical analysis tolocate spam Web pages. In
Proceedings of the 7th
International Workshop on the Web and Databases,2004.
[6]. A. Carvalho, P. Chirita, E. Moura and P. Calado. Sitelevel noise removal for search engines. In Proceedingsof the 15th international
conference on World WideWeb. 2006
[7]. A. A. Bencz´ur, K Csalog´any, T Sarl´os and M. Uher.SpamRank-fully automatic link spam detection. InProceedings of the 1st
international workshop onAdversarial information retrieval on the Web, 2005
[8]. L. Becchetti, C. Castillo, D. Donato, S. Leonardi andR. Baeza-Yates. Link-based characterization anddetection of Web spam. In
Proceedings of the 2nd
international workshop on Adversarial informationretrieval on the Web, 2006.
[9]. H. Saito, M. Toyoda, M. Kitsuregawa and K. Aihara: A large-scale study of link spam detection by graphalgorithms In Proceedings
of the 3rd internationalworkshop on Adversarial information retrieval on theWeb, 2007.