1. The document discusses techniques for detecting link-based spam using rank propagation and probabilistic counting.
2. It aims to characterize spam pages and analyze the distribution of statistics about pages to identify outliers associated with spam.
3. The key techniques involve counting the number of "supporters" or pages linking to a target page at different distances from that page.
(July 22, 2010) Presentation from the third day of the Social Media Boot Camp at US Pacific Command, Camp HM Smith, Oahu [cc] http://www.socialmediabootcamp.com
Social Media and New Media Workshop (FSI) PY363 - Day 3Eric Schwartzman
The first day of the Social Media and New Media Workshop, which I developed for the US Dept. of State's Foreign Service Institute, and taught on July 15, 2009.
Seojocktoberfest - link health audits and organic traffic recovery - Scott McLayYard
Scott is one of the UK’s most experienced technical SEOs and adds a strong enterprise level technological focus to Yard’s work. Having worked on technical audits, internationalisation projects and migration support for hundreds of websites, including global household brand names, throughout his 12 year technical SEO career, Scott has industry insights, competitive intelligence and technical solutions readily to hand for Yard clients and he will be sharing some of this knowledge in his presentation on link health audits and organic traffic recovery.
(July 22, 2010) Presentation from the third day of the Social Media Boot Camp at US Pacific Command, Camp HM Smith, Oahu [cc] http://www.socialmediabootcamp.com
Social Media and New Media Workshop (FSI) PY363 - Day 3Eric Schwartzman
The first day of the Social Media and New Media Workshop, which I developed for the US Dept. of State's Foreign Service Institute, and taught on July 15, 2009.
Seojocktoberfest - link health audits and organic traffic recovery - Scott McLayYard
Scott is one of the UK’s most experienced technical SEOs and adds a strong enterprise level technological focus to Yard’s work. Having worked on technical audits, internationalisation projects and migration support for hundreds of websites, including global household brand names, throughout his 12 year technical SEO career, Scott has industry insights, competitive intelligence and technical solutions readily to hand for Yard clients and he will be sharing some of this knowledge in his presentation on link health audits and organic traffic recovery.
Risk Beyond Acquiring: Merchant Risk Across FinTechGeo Coelho
In our Vendor Spotlight at MAC 2016, Jose Caldera, IdentityMind VP of Marketing & Product, presented a collection of use cases and examples of how IdentityMind clients are applying our platform for merchant risk, fraud prevention, anti-money laundering, and terrorist financing prevention.
The session provided a great introduction to the benefits of our platform, and we've provided a synopsis of it here for those who missed the session, or were curious for more information.
Evaluation of Spam Detection and Prevention Frameworks for Email and Image Sp...Pedram Hayati
In recent years, online spam has become a major problem for
the sustainability of the Internet. Excessive amounts of spam
are not only reducing the quality of information available on
the Internet but also creating concern amongst search engines
and web users. This paper aims to analyse existing works in
two different categories of spam domains - email spam and
image spam to gain a deeper understanding of this problem.
Future research directions are also presented in these spam
domains.
More info: http://debii.curtin.edu.au/~pedram/research/publications/76-evaluation-of-spam-detection-and-prevention-frameworks-for-email-and-image-spam-a-state-of-art.html
From the eCommerce Summit in Atlanta June 3-4, 2009 where Mountain Media explains the topic of PC Compliance for online merchants. Visit http://www.ecmta.org to find out more.
Onboarding has never been more important than it is today. Why? It’s an opportunity for meaningful face-to-face engagement in a digital world where customer relationships are more and more remote. 72% of people will use online banking by 2017 and 43% will use mobile banking (Forrester Research, Trends 2014: North America Digital Banking).
Onboarding is the perfect opportunity to establish the emotional connection that is critical for increasing future sales. Emotional engagement represents a 23% premium for share of wallet compared to average (Gallup Consulting).
For onboarding to set up future sales it needs to establish you as the resource that can provide the help customers need as they face the many financial decisions in front of them. It needs to set you up as the one who can give them the confidence to make decisions that could help them save money and avoid mistakes.
Finacle Digital Commerce solution leverages the power of digital money to unlock new revenue streams, extend distribution reach, and foster customer loyalty for financial institutions, telecom service providers, and retailers.
The complexity of the merchant onboarding and KYC compliance processes for card acquirers is quite staggering. Read why Card Acquirers Need Rapid Integration and Operationalized Risk Models to Meet Merchant Demand for Fast Onboarding
Hardware-software complex to manage accounts, e-wallets, payment and loyalty cards in single application.
A mobile platform turning a smartphone into a powerful payment and loyalty tool.
- All-in-one,
- Simple authentication & authorization,
- P2P transfers,
- Pay by QR code,
- Pay by NFC,
- Pay by cards linked to an account,
- Mobile acquiring,
- Invoices,
- Loans,
- E-policies,
- Consolidation of loyalty programs,
- Discounts and promotions,
- Ticketing.
www.mwallet.pro
www.m-processing.com
Risk Beyond Acquiring: Merchant Risk Across FinTechGeo Coelho
In our Vendor Spotlight at MAC 2016, Jose Caldera, IdentityMind VP of Marketing & Product, presented a collection of use cases and examples of how IdentityMind clients are applying our platform for merchant risk, fraud prevention, anti-money laundering, and terrorist financing prevention.
The session provided a great introduction to the benefits of our platform, and we've provided a synopsis of it here for those who missed the session, or were curious for more information.
Evaluation of Spam Detection and Prevention Frameworks for Email and Image Sp...Pedram Hayati
In recent years, online spam has become a major problem for
the sustainability of the Internet. Excessive amounts of spam
are not only reducing the quality of information available on
the Internet but also creating concern amongst search engines
and web users. This paper aims to analyse existing works in
two different categories of spam domains - email spam and
image spam to gain a deeper understanding of this problem.
Future research directions are also presented in these spam
domains.
More info: http://debii.curtin.edu.au/~pedram/research/publications/76-evaluation-of-spam-detection-and-prevention-frameworks-for-email-and-image-spam-a-state-of-art.html
From the eCommerce Summit in Atlanta June 3-4, 2009 where Mountain Media explains the topic of PC Compliance for online merchants. Visit http://www.ecmta.org to find out more.
Onboarding has never been more important than it is today. Why? It’s an opportunity for meaningful face-to-face engagement in a digital world where customer relationships are more and more remote. 72% of people will use online banking by 2017 and 43% will use mobile banking (Forrester Research, Trends 2014: North America Digital Banking).
Onboarding is the perfect opportunity to establish the emotional connection that is critical for increasing future sales. Emotional engagement represents a 23% premium for share of wallet compared to average (Gallup Consulting).
For onboarding to set up future sales it needs to establish you as the resource that can provide the help customers need as they face the many financial decisions in front of them. It needs to set you up as the one who can give them the confidence to make decisions that could help them save money and avoid mistakes.
Finacle Digital Commerce solution leverages the power of digital money to unlock new revenue streams, extend distribution reach, and foster customer loyalty for financial institutions, telecom service providers, and retailers.
The complexity of the merchant onboarding and KYC compliance processes for card acquirers is quite staggering. Read why Card Acquirers Need Rapid Integration and Operationalized Risk Models to Meet Merchant Demand for Fast Onboarding
Hardware-software complex to manage accounts, e-wallets, payment and loyalty cards in single application.
A mobile platform turning a smartphone into a powerful payment and loyalty tool.
- All-in-one,
- Simple authentication & authorization,
- P2P transfers,
- Pay by QR code,
- Pay by NFC,
- Pay by cards linked to an account,
- Mobile acquiring,
- Invoices,
- Loans,
- E-policies,
- Consolidation of loyalty programs,
- Discounts and promotions,
- Ticketing.
www.mwallet.pro
www.m-processing.com
Keynote at the Dutch-Belgian Information Retrieval Workshop, November 2016, Delft, Netherlands.
Based on KDD 2016 tutorial with Sara Hajian and Francesco Bonchi.
KDD 2016 tutorial on Algorithmic Bias, Parts I and II.
Video:
Part I: https://www.youtube.com/watch?v=mJcWrfoGup8
Part II: https://www.youtube.com/watch?v=nKemhMbaYcU
Part III: https://www.youtube.com/watch?v=ErgHjxJsEKA
By Sara Hajian, Francesco Bonchi, and Carlos Castillo.
http://francescobonchi.com/algorithmic_bias_tutorial.html
KDD 2016 tutorial on Algorithmic Bias, Parts III and IV.
Video: https://www.youtube.com/watch?v=ErgHjxJsEKA
By Sara Hajian, Francesco Bonchi, and Carlos Castillo.
http://francescobonchi.com/algorithmic_bias_tutorial.html
Various examples of observational studies, mostly fo the analysis of social media.
Lecture for the M. Sc. Data Science, Sapienza University of Rome, Spring 2016.
Basic concepts about natural experiments, based mostly on Dunning's book.
Lecture for the M. Sc. Data Science, Sapienza University of Rome, Spring 2016.
Predictions of links in graphs based on content and information propagations.
Lecture for the M. Sc. Data Science, Sapienza University of Rome, Spring 2016.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Free Complete Python - A step towards Data Science
Using Rank Propagation for Spam Detection (WebKDD 2006)
1. Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
Using rank propagation and Probabilistic
L. Becchetti,
C. Castillo,
D. Donato,
counting for Link-Based Spam Detection
S. Leonardi and
R. Baeza-Yates
Motivation
Luca Becchetti1 , Carlos Castillo1 ,Debora Donato1 ,
Spam pages
characterization
Stefano Leonardi1 and Ricardo Baeza-Yates2
Truncated
PageRank
1. Universit` di Roma “La Sapienza” – Rome, Italy
a
Counting
2. Yahoo! Research – Barcelona, Spain and Santiago, Chile
supporters
Experiments
August 20th, 2006
Conclusions
2. Using rank
propagation and
Probabilistic
counting for
Motivation
Link-Based 1
Spam Detection
L. Becchetti,
C. Castillo,
Spam pages characterization
2
D. Donato,
S. Leonardi and
R. Baeza-Yates
Truncated PageRank
3
Motivation
Spam pages
characterization
Counting supporters
4
Truncated
PageRank
Counting
supporters
Experiments
5
Experiments
Conclusions
Conclusions
6
3. Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
Motivation
1
S. Leonardi and
Spam pages characterization
2
R. Baeza-Yates
Truncated PageRank
3
Motivation
Counting supporters
4
Spam pages
Experiments
5
characterization
Conclusions
6
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
4. What is on the Web?
Using rank
Information
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
5. What is on the Web?
Using rank
Information + Porn
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
6. What is on the Web?
Using rank
Information + Porn + On-line casinos + Free movies +
propagation and
Probabilistic
Cheap software + Buy a MBA diploma + Prescription -free
counting for
Link-Based
drugs + V!-4-gra + Get rich now now now!!!
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
Graphic: www.milliondollarhomepage.com
7. Web spam (keywords + links)
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
8. Web spam (mostly keywords)
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
9. Search engine?
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
10. Fake search engine
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
11. Problem: “normal” pages that are spam
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
12. Problem: “normal” pages that are spam
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
Some content is introduced
13. Problem: borderline pages
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
14. Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
Motivation
1
S. Leonardi and
Spam pages characterization
2
R. Baeza-Yates
Truncated PageRank
3
Motivation
Counting supporters
4
Spam pages
Experiments
5
characterization
Conclusions
6
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
15. Definitions
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
16. Definitions
Using rank
propagation and
Probabilistic
counting for
Any deliberate action that is meant to trigger
Link-Based
Spam Detection
an unjustifiably favorable relevance or importance
L. Becchetti,
for some Web page, considering the page’s true value
C. Castillo,
D. Donato,
[Gy¨ngyi and Garcia-Molina, 2005]
o
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
17. Definitions
Using rank
propagation and
Probabilistic
counting for
Any deliberate action that is meant to trigger
Link-Based
Spam Detection
an unjustifiably favorable relevance or importance
L. Becchetti,
for some Web page, considering the page’s true value
C. Castillo,
D. Donato,
[Gy¨ngyi and Garcia-Molina, 2005]
o
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
any attempt to deceive a search engine’s relevancy algorithm
Truncated
PageRank
or simply
Counting
supporters
Experiments
anything that would not be done
Conclusions
if search engines did not exist.
[Perkins, 2001]
18. Link farms
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
19. Link farms
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Single-level farms can be detected by searching groups of
Conclusions
nodes sharing their out-links [Gibson et al., 2005]
20. Link−based spam
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
[Fetterly et al., 2004] hypothesized that studying the
L. Becchetti,
C. Castillo,
distribution of statistics about pages could be a good way of
D. Donato,
detecting spam pages:
S. Leonardi and
R. Baeza-Yates
Motivation
“in a number of these distributions, outlier values are
Spam pages
associated with web spam”
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
21. Link−based spam
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
[Fetterly et al., 2004] hypothesized that studying the
L. Becchetti,
C. Castillo,
distribution of statistics about pages could be a good way of
D. Donato,
detecting spam pages:
S. Leonardi and
R. Baeza-Yates
Motivation
“in a number of these distributions, outlier values are
Spam pages
associated with web spam”
characterization
Truncated
PageRank
Research goal
Counting
supporters
Statistical analysis of link-based spam
Experiments
Conclusions
22. Idea: count “supporters” at different distances
Using rank
propagation and
Probabilistic
counting for
Link-Based
Number of different nodes at a given distance:
Spam Detection
L. Becchetti,
C. Castillo,
.UK 18 mill. nodes .EU.INT 860,000 nodes
D. Donato,
S. Leonardi and
0.3 0.3
R. Baeza-Yates
Motivation
0.2 0.2
Frequency
Frequency
Spam pages
characterization
0.1 0.1
Truncated
PageRank
0.0 0.0
5 10 15 20 25 30 5 10 15 20 25 30
Counting
supporters Distance Distance
Experiments
Average distance Average distance
Conclusions
14.9 clicks 10.0 clicks
23. High and low-ranked pages are different
Using rank
4
propagation and
x 10
Probabilistic
Top 0%−10%
counting for
12
Link-Based
Top 40%−50%
Spam Detection
Top 60%−70%
L. Becchetti,
10
Number of Nodes
C. Castillo,
D. Donato,
S. Leonardi and
8
R. Baeza-Yates
Motivation
6
Spam pages
characterization
4
Truncated
PageRank
2
Counting
supporters
Experiments
0
1 5 10 15 20
Conclusions
Distance
24. High and low-ranked pages are different
Using rank
4
propagation and
x 10
Probabilistic
Top 0%−10%
counting for
12
Link-Based
Top 40%−50%
Spam Detection
Top 60%−70%
L. Becchetti,
10
Number of Nodes
C. Castillo,
D. Donato,
S. Leonardi and
8
R. Baeza-Yates
Motivation
6
Spam pages
characterization
4
Truncated
PageRank
2
Counting
supporters
Experiments
0
1 5 10 15 20
Conclusions
Distance
Areas below the curves are equal if we are in the same
strongly-connected component
25. Metrics
Using rank
Graph algorithms
propagation and
Probabilistic
counting for
All shortest paths, centrality, betweenness, clustering coefficient...
Link-Based
Spam Detection
Streamed algorithms
L. Becchetti,
Breadth-first and depth-first search
C. Castillo,
D. Donato,
Count of neighbors
S. Leonardi and
R. Baeza-Yates
Symmetric algorithms
Motivation
(Strongly) connected components
Spam pages
Approximate count of neighbors
characterization
Truncated
PageRank, Truncated PageRank, Linear Rank
PageRank
HITS, Salsa, TrustRank
Counting
supporters
Experiments
Conclusions
26. Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
Motivation
1
S. Leonardi and
Spam pages characterization
2
R. Baeza-Yates
Truncated PageRank
3
Motivation
Counting supporters
4
Spam pages
Experiments
5
characterization
Conclusions
6
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
27. General functional ranking
Using rank
propagation and
Probabilistic
Let P the row-normalized version of the citation matrix of a
counting for
Link-Based
graph G = (V , E )
Spam Detection
A functional ranking [Baeza-Yates et al., 2006] is a
L. Becchetti,
C. Castillo,
link-based ranking algorithm to compute a scoring vector W
D. Donato,
S. Leonardi and
of the form:
R. Baeza-Yates
∞
damping(t) t
W= P.
Motivation
N
t=0
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
28. General functional ranking
Using rank
propagation and
Probabilistic
Let P the row-normalized version of the citation matrix of a
counting for
Link-Based
graph G = (V , E )
Spam Detection
A functional ranking [Baeza-Yates et al., 2006] is a
L. Becchetti,
C. Castillo,
link-based ranking algorithm to compute a scoring vector W
D. Donato,
S. Leonardi and
of the form:
R. Baeza-Yates
∞
damping(t) t
W= P.
Motivation
N
t=0
Spam pages
characterization
Truncated
PageRank
There are many choices for damping(t), including simply a
Counting
linear function that is as good as PageRank in practice
supporters
Experiments
Conclusions
29. General functional ranking
Using rank
propagation and
Probabilistic
Let P the row-normalized version of the citation matrix of a
counting for
Link-Based
graph G = (V , E )
Spam Detection
A functional ranking [Baeza-Yates et al., 2006] is a
L. Becchetti,
C. Castillo,
link-based ranking algorithm to compute a scoring vector W
D. Donato,
S. Leonardi and
of the form:
R. Baeza-Yates
∞
damping(t) t
W= P.
Motivation
N
t=0
Spam pages
characterization
Truncated
PageRank
There are many choices for damping(t), including simply a
Counting
linear function that is as good as PageRank in practice
supporters
Experiments
Conclusions
damping(t) = (1 − α)αt
30. Truncated PageRank
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
Reduce the direct contribution of the first levels of links:
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
t≤T
0
damping(t) =
Counting
C αt
supporters
t>T
Experiments
Conclusions
31. Truncated PageRank
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
Reduce the direct contribution of the first levels of links:
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
t≤T
0
damping(t) =
Counting
C αt
supporters
t>T
Experiments
V No extra reading of the graph after PageRank
Conclusions
32. General algorithm
Require: N: number of nodes, 0 < α < 1: damping factor, T≥ −1: distance for
Using rank
propagation and truncation
Probabilistic
1: for i : 1 . . . N do {Initialization}
counting for
2: R[i] ← (1 − α)/((αT +1 )N)
Link-Based
Spam Detection
3: if T≥ 0 then
4: Score[i] ← 0
L. Becchetti,
5: else {Calculate normal PageRank}
C. Castillo,
D. Donato,
6: Score[i] ← R[i]
S. Leonardi and
7: end if
R. Baeza-Yates
8: end for
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
33. General algorithm
Require: N: number of nodes, 0 < α < 1: damping factor, T≥ −1: distance for
Using rank
propagation and truncation
Probabilistic
1: for i : 1 . . . N do {Initialization}
counting for
2: R[i] ← (1 − α)/((αT +1 )N)
Link-Based
Spam Detection
3: if T≥ 0 then
4: Score[i] ← 0
L. Becchetti,
5: else {Calculate normal PageRank}
C. Castillo,
D. Donato,
6: Score[i] ← R[i]
S. Leonardi and
7: end if
R. Baeza-Yates
8: end for
9: distance = 1
Motivation
10: while not converged do
Spam pages
11: Aux ← 0
characterization
12: for src : 1 . . . N do {Follow links in the graph}
Truncated
13: for all link from src to dest do
PageRank
14: Aux[dest] ← Aux[dest] + R[src]/outdegree(src)
Counting
15: end for
supporters
16: end for
Experiments
17: for i : 1 . . . N do {Apply damping factor α}
Conclusions 18: R[i] ← Aux[i] ×α
19: if distance > T then {Add to ranking value}
20: Score[i] ← Score[i] + R[i]
21: end if
22: end for
23: distance = distance +1
24: end while
34. Truncated PageRank vs PageRank
Using rank
propagation and
Probabilistic
counting for −3 −3
10 10
Link-Based
Spam Detection
Truncated PageRank T=1
Truncated PageRank T=4
−4 −4
10 10
L. Becchetti,
−5 −5
C. Castillo, 10 10
D. Donato,
−6 −6
S. Leonardi and 10 10
R. Baeza-Yates
−7 −7
10 10
Motivation −8 −8
10 10
Spam pages −9 −9
10 −8 10 −8
characterization −6 −4 −6 −4
10 10 10 10 10 10
Normal PageRank Normal PageRank
Truncated
PageRank
Comparing PageRank and Truncated PageRank with T = 1
Counting
supporters
and T = 4.
Experiments
The correlation is high and decreases as more levels are
Conclusions
truncated.
35. Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
Motivation
1
S. Leonardi and
Spam pages characterization
2
R. Baeza-Yates
Truncated PageRank
3
Motivation
Counting supporters
4
Spam pages
Experiments
5
characterization
Conclusions
6
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
36. Probabilistic counting
Using rank
propagation and
Probabilistic
counting for
1
1
Link-Based
0
0
Spam Detection
0
0
L. Becchetti, 0
0
0 1
1 1
C. Castillo, 1
1
0 0
1 1
D. Donato, 0
0
0
0 0 0
S. Leonardi and Propagation of 0
0 1
1
R. Baeza-Yates bits using the 1
0 1
1
“OR” operation 1
0 1
0
Motivation
1
Target
0 Count bits set
Spam pages
0
page
0
characterization to estimate
0
0 supporters
Truncated 0
0
1
1
PageRank 1
1
0
0 1
1
0
0
Counting
0
0
supporters
1
1
Experiments 0
0
Conclusions
37. Probabilistic counting
Using rank
propagation and
Probabilistic
counting for
1
1
Link-Based
0
0
Spam Detection
0
0
L. Becchetti, 0
0
0 1
1 1
C. Castillo, 1
1
0 0
1 1
D. Donato, 0
0
0
0 0 0
S. Leonardi and Propagation of 0
0 1
1
R. Baeza-Yates bits using the 1
0 1
1
“OR” operation 1
0 1
0
Motivation
1
Target
0 Count bits set
Spam pages
0
page
0
characterization to estimate
0
0 supporters
Truncated 0
0
1
1
PageRank 1
1
0
0 1
1
0
0
Counting
0
0
supporters
1
1
Experiments 0
0
Conclusions
Improvement of ANF algorithm [Palmer et al., 2002] based on
probabilistic counting [Flajolet and Martin, 1985]
38. General algorithm
Using rank
Require: N: number of nodes, d: distance, k: bits
propagation and
Probabilistic
1: for node : 1 . . . N, bit: 1 . . . k do
counting for
Link-Based
INIT(node,bit)
2:
Spam Detection
3: end for
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
39. General algorithm
Using rank
Require: N: number of nodes, d: distance, k: bits
propagation and
Probabilistic
1: for node : 1 . . . N, bit: 1 . . . k do
counting for
Link-Based
INIT(node,bit)
2:
Spam Detection
3: end for
L. Becchetti,
C. Castillo,
4: for distance : 1 . . . d do {Iteration step}
D. Donato,
S. Leonardi and
Aux ← 0k
5:
R. Baeza-Yates
for src : 1 . . . N do {Follow links in the graph}
6:
Motivation
for all links from src to dest do
7:
Spam pages
Aux[dest] ← Aux[dest] OR V[src,·]
characterization 8:
Truncated
end for
9:
PageRank
end for
10:
Counting
supporters
V ← Aux
11:
Experiments
12: end for
Conclusions
40. General algorithm
Using rank
Require: N: number of nodes, d: distance, k: bits
propagation and
Probabilistic
1: for node : 1 . . . N, bit: 1 . . . k do
counting for
Link-Based
INIT(node,bit)
2:
Spam Detection
3: end for
L. Becchetti,
C. Castillo,
4: for distance : 1 . . . d do {Iteration step}
D. Donato,
S. Leonardi and
Aux ← 0k
5:
R. Baeza-Yates
for src : 1 . . . N do {Follow links in the graph}
6:
Motivation
for all links from src to dest do
7:
Spam pages
Aux[dest] ← Aux[dest] OR V[src,·]
characterization 8:
Truncated
end for
9:
PageRank
end for
10:
Counting
supporters
V ← Aux
11:
Experiments
12: end for
Conclusions
13: for node: 1 . . . N do {Estimate supporters}
Supporters[node] ← ESTIMATE( V[node,·] )
14:
15: end for
16: return Supporters
41. Our estimator
Using rank
propagation and
Probabilistic
counting for
Link-Based
Initialize all bits to one with probability
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
42. Our estimator
Using rank
propagation and
Probabilistic
counting for
Link-Based
Initialize all bits to one with probability
Spam Detection
by the independence of the i − th component Xi ’s we have,
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
P[Xi = 1] = 1 − (1 − )neighbors(node) ,
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
43. Our estimator
Using rank
propagation and
Probabilistic
counting for
Link-Based
Initialize all bits to one with probability
Spam Detection
by the independence of the i − th component Xi ’s we have,
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
P[Xi = 1] = 1 − (1 − )neighbors(node) ,
Motivation
Spam pages ones(node)
1−
Estimator: neighbors(node) = log(1−
characterization ) k
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
44. Our estimator
Using rank
propagation and
Probabilistic
counting for
Link-Based
Initialize all bits to one with probability
Spam Detection
by the independence of the i − th component Xi ’s we have,
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
P[Xi = 1] = 1 − (1 − )neighbors(node) ,
Motivation
Spam pages
Estimator: neighbors(node) = log(1− ) 1 − ones(node)
characterization
k
Truncated
Problem: neighbors(node) can vary by orders of magnitudes
PageRank
as node varies.
Counting
supporters
Experiments
Conclusions
45. Our estimator
Using rank
propagation and
Probabilistic
counting for
Link-Based
Initialize all bits to one with probability
Spam Detection
by the independence of the i − th component Xi ’s we have,
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
P[Xi = 1] = 1 − (1 − )neighbors(node) ,
Motivation
Spam pages
Estimator: neighbors(node) = log(1− ) 1 − ones(node)
characterization
k
Truncated
Problem: neighbors(node) can vary by orders of magnitudes
PageRank
as node varies.
Counting
supporters
This means that for some values of , the computed value of
Experiments
ones(node) might be k (or 0, depending on neighbors(node))
Conclusions
with relatively high probability.
46. Adaptive estimator
Using rank
propagation and
Probabilistic
counting for
Link-Based
1
Spam Detection
if we knew neighbors(node) and chose = we
neighbors(node)
L. Becchetti,
would get:
C. Castillo,
D. Donato,
S. Leonardi and
1
R. Baeza-Yates
1−
ones(node) k 0.63k,
e
Motivation
Spam pages
characterization
Adaptive estimation
Truncated
PageRank
Repeat the above process for = 1/2, 1/4, 1/8, . . . , and look
Counting
for the transitions from more than (1 − 1/e)k ones to less
supporters
than (1 − 1/e)k ones.
Experiments
Conclusions
47. Convergence
Using rank
100%
propagation and
Probabilistic
90%
counting for
Link-Based
Spam Detection
80%
Fraction of nodes
L. Becchetti,
70%
with estimates
C. Castillo,
D. Donato,
60%
S. Leonardi and
R. Baeza-Yates
50% d=1
d=2
Motivation
40%
d=3
Spam pages
30% d=4
characterization
d=5
Truncated
20%
d=6
PageRank
d=7
10%
Counting
d=8
supporters
0%
5 10 15 20
Experiments
Iteration
Conclusions
48. Convergence
Using rank
100%
propagation and
Probabilistic
90%
counting for
Link-Based
Spam Detection
80%
Fraction of nodes
L. Becchetti,
70%
with estimates
C. Castillo,
D. Donato,
60%
S. Leonardi and
R. Baeza-Yates
50% d=1
d=2
Motivation
40%
d=3
Spam pages
30% d=4
characterization
d=5
Truncated
20%
d=6
PageRank
d=7
10%
Counting
d=8
supporters
0%
5 10 15 20
Experiments
Iteration
Conclusions
15 iterations for estimating the neighbors at distance 4 or less
49. Convergence
Using rank
100%
propagation and
Probabilistic
90%
counting for
Link-Based
Spam Detection
80%
Fraction of nodes
L. Becchetti,
70%
with estimates
C. Castillo,
D. Donato,
60%
S. Leonardi and
R. Baeza-Yates
50% d=1
d=2
Motivation
40%
d=3
Spam pages
30% d=4
characterization
d=5
Truncated
20%
d=6
PageRank
d=7
10%
Counting
d=8
supporters
0%
5 10 15 20
Experiments
Iteration
Conclusions
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8.
51. Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
Motivation
1
S. Leonardi and
Spam pages characterization
2
R. Baeza-Yates
Truncated PageRank
3
Motivation
Counting supporters
4
Spam pages
Experiments
5
characterization
Conclusions
6
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
52. Test collection
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
U.K. collection
L. Becchetti,
C. Castillo,
18.5 million pages downloaded from the .UK domain
D. Donato,
S. Leonardi and
R. Baeza-Yates
5,344 hosts manually classified (6% of the hosts)
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
53. Test collection
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
U.K. collection
L. Becchetti,
C. Castillo,
18.5 million pages downloaded from the .UK domain
D. Donato,
S. Leonardi and
R. Baeza-Yates
5,344 hosts manually classified (6% of the hosts)
Motivation
Spam pages
characterization
Truncated
Classified entire hosts:
PageRank
Counting
V A few hosts are mixed: spam and non-spam pages
supporters
X More coverage: sample covers 32% of the pages
Experiments
Conclusions
54. Automatic classifier
We extracted (for the home page and the page with
Using rank
propagation and
maximum PageRank) PageRank, Truncated PageRank at
Probabilistic
counting for
2 . . . 4, Supporters at 2 . . . 4
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
55. Automatic classifier
We extracted (for the home page and the page with
Using rank
propagation and
maximum PageRank) PageRank, Truncated PageRank at
Probabilistic
counting for
2 . . . 4, Supporters at 2 . . . 4
Link-Based
Spam Detection
We measured:
L. Becchetti,
C. Castillo,
D. Donato,
# of spam hosts classified as spam
S. Leonardi and
Precision =
R. Baeza-Yates
# of hosts classified as spam
Motivation
# of spam hosts classified as spam
Recall = .
Spam pages
# of spam hosts
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
56. Automatic classifier
We extracted (for the home page and the page with
Using rank
propagation and
maximum PageRank) PageRank, Truncated PageRank at
Probabilistic
counting for
2 . . . 4, Supporters at 2 . . . 4
Link-Based
Spam Detection
We measured:
L. Becchetti,
C. Castillo,
D. Donato,
# of spam hosts classified as spam
S. Leonardi and
Precision =
R. Baeza-Yates
# of hosts classified as spam
Motivation
# of spam hosts classified as spam
Recall = .
Spam pages
# of spam hosts
characterization
Truncated
PageRank
and the two types of errors in spam classification
Counting
supporters
Experiments
# of normal hosts classified as spam
Conclusions
False positive rate =
# of normal hosts
# of spam hosts classified as normal
False negative rate = .
# of spam hosts
57. Single-technique classifier
Using rank
propagation and
Probabilistic
counting for
Link-Based
Classifier based on TrustRank: uses as features the PageRank,
Spam Detection
the estimated non-spam mass, and the
L. Becchetti,
C. Castillo,
estimated non-spam mass divided by PageRank.
D. Donato,
S. Leonardi and
Classifier based on Truncated PageRank: uses as features the
R. Baeza-Yates
PageRank, the Truncated PageRank with
Motivation
truncation distance t = 2, 3, 4 (with t = 1 it
Spam pages
characterization
would be just based on in-degree), and the
Truncated
Truncated PageRank divided by PageRank.
PageRank
Classifier based on Estimation of Supporters: uses as features
Counting
supporters
the PageRank, the estimation of supporters at a
Experiments
given distance d = 2, 3, 4, and the estimation of
Conclusions
supporters divided by PageRank.
58. Comparison of single-technique classifier (M=5)
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
Classifiers Spam class False False
L. Becchetti,
C. Castillo,
(pruning with M = 5) Prec. Recall Pos. Neg.
D. Donato,
S. Leonardi and
TrustRank 0.82 0.50 2.1% 50%
R. Baeza-Yates
Trunc. PageRank t = 2 0.85 0.50 1.6% 50%
Motivation
Trunc. PageRank t = 3 0.84 0.47 1.6% 53%
Spam pages
characterization
Trunc. PageRank t = 4 0.79 0.45 2.2% 55%
Truncated
Est. Supporters d = 2 0.78 0.60 3.2% 40%
PageRank
Est. Supporters d = 3 0.83 0.64 2.4% 36%
Counting
supporters
Est. Supporters d = 4 0.86 0.64 2.0% 36%
Experiments
Conclusions
59. Comparison of single-technique classifier (M=30)
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
Classifiers Spam class False False
L. Becchetti,
C. Castillo,
(pruning with M = 30) Prec. Recall Pos. Neg.
D. Donato,
S. Leonardi and
TrustRank 0.80 0.49 2.3% 51%
R. Baeza-Yates
Trunc. PageRank t = 2 0.82 0.43 1.8% 57%
Motivation
Trunc. PageRank t = 3 0.81 0.42 1.8% 58%
Spam pages
characterization
Trunc. PageRank t = 4 0.77 0.43 2.4% 57%
Truncated
Est. Supporters d = 2 0.76 0.52 3.1% 48%
PageRank
Est. Supporters d = 3 0.83 0.57 2.1% 43%
Counting
supporters
Est. Supporters d = 4 0.80 0.57 2.6% 43%
Experiments
Conclusions
60. Combined classifier
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
Spam class False False
S. Leonardi and
R. Baeza-Yates
Pruning Rules Precision Recall Pos. Neg.
Motivation
M=5 49 0.87 0.80 2.0% 20%
Spam pages
M=30 31 0.88 0.76 1.8% 24%
characterization
No pruning 189 0.85 0.79 2.6% 21%
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
61. Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
Motivation
1
S. Leonardi and
Spam pages characterization
2
R. Baeza-Yates
Truncated PageRank
3
Motivation
Counting supporters
4
Spam pages
Experiments
5
characterization
Conclusions
6
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
62. Summary of classifiers
Using rank
(a) Precision and recall of spam detection
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
(b) Error rates of the spam classifiers
Counting
supporters
Experiments
Conclusions
63. Conclusions
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
V Link-based statistics to detect 80% of spam
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
64. Conclusions
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
V Link-based statistics to detect 80% of spam
D. Donato,
S. Leonardi and
X
R. Baeza-Yates
No magic bullet in link analysis
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
65. Conclusions
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
V Link-based statistics to detect 80% of spam
D. Donato,
S. Leonardi and
X
R. Baeza-Yates
No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
66. Conclusions
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
V Link-based statistics to detect 80% of spam
D. Donato,
S. Leonardi and
X
R. Baeza-Yates
No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
Motivation
V
Spam pages
Measure both home page and max. PageRank page
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
67. Conclusions
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
V Link-based statistics to detect 80% of spam
D. Donato,
S. Leonardi and
X
R. Baeza-Yates
No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
Motivation
V
Spam pages
Measure both home page and max. PageRank page
characterization
V
Truncated
Host-based counts are important
PageRank
Counting
supporters
Experiments
Conclusions
68. Conclusions
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
V Link-based statistics to detect 80% of spam
D. Donato,
S. Leonardi and
X
R. Baeza-Yates
No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
Motivation
V
Spam pages
Measure both home page and max. PageRank page
characterization
V
Truncated
Host-based counts are important
PageRank
Counting
supporters
Experiments
Conclusions
69. Conclusions
Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
V Link-based statistics to detect 80% of spam
D. Donato,
S. Leonardi and
X
R. Baeza-Yates
No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
Motivation
V
Spam pages
Measure both home page and max. PageRank page
characterization
V
Truncated
Host-based counts are important
PageRank
Counting
Next step: combine link analysis and content analysis
supporters
Experiments
Conclusions
70. Using rank
propagation and
Probabilistic
counting for
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Thank you!
Motivation
Spam pages
characterization
Truncated
PageRank
Counting
supporters
Experiments
Conclusions
71. Using rank
Baeza-Yates, R., Boldi, P., and Castillo, C. (2006).
propagation and
Probabilistic
Generalizing PageRank: Damping functions for link-based
counting for
Link-Based
ranking algorithms.
Spam Detection
In Proceedings of SIGIR, Seattle, Washington, USA. ACM
L. Becchetti,
C. Castillo,
Press.
D. Donato,
S. Leonardi and
R. Baeza-Yates
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and
Baeza-Yates, R. (2006).
Motivation
Using rank propagation and probabilistic counting for
Spam pages
characterization
link-based spam detection.
Truncated
In Proceedings of the Workshop on Web Mining and Web
PageRank
Usage Analysis (WebKDD), Pennsylvania, USA. ACM Press.
Counting
supporters
Experiments
Bencz´r, A. A., Csalog´ny, K., Sarl´s, T., and Uher, M.
u a o
Conclusions (2005).
Spamrank: fully automatic link spam detection.
In Proceedings of the First International Workshop on
Adversarial Information Retrieval on the Web, Chiba, Japan.
72. Using rank
propagation and
Probabilistic
counting for
Fetterly, D., Manasse, M., and Najork, M. (2004).
Link-Based
Spam Detection
Spam, damn spam, and statistics: Using statistical analysis to
L. Becchetti,
locate spam web pages.
C. Castillo,
D. Donato,
In Proceedings of the seventh workshop on the Web and
S. Leonardi and
R. Baeza-Yates
databases (WebDB), pages 1–6, Paris, France.
Motivation
Flajolet, P. and Martin, N. G. (1985).
Spam pages
characterization
Probabilistic counting algorithms for data base applications.
Truncated
Journal of Computer and System Sciences, 31(2):182–209.
PageRank
Counting
Gibson, D., Kumar, R., and Tomkins, A. (2005).
supporters
Experiments
Discovering large dense subgraphs in massive graphs.
Conclusions
In VLDB ’05: Proceedings of the 31st international conference
on Very large data bases, pages 721–732. VLDB Endowment.
73. Using rank
propagation and
Probabilistic
Gy¨ngyi, Z. and Garcia-Molina, H. (2005).
o
counting for
Link-Based
Web spam taxonomy.
Spam Detection
L. Becchetti,
In First International Workshop on Adversarial Information
C. Castillo,
Retrieval on the Web.
D. Donato,
S. Leonardi and
R. Baeza-Yates
Gy¨ngyi, Z., Molina, H. G., and Pedersen, J. (2004).
o
Motivation
Combating web spam with trustrank.
Spam pages
In Proceedings of the Thirtieth International Conference on
characterization
Very Large Data Bases (VLDB), pages 576–587, Toronto,
Truncated
PageRank
Canada. Morgan Kaufmann.
Counting
supporters
Newman, M. E., Strogatz, S. H., and Watts, D. J. (2001).
Experiments
Random graphs with arbitrary degree distributions and their
Conclusions
applications.
Phys Rev E Stat Nonlin Soft Matter Phys, 64(2 Pt 2).
74. Using rank
propagation and
Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006).
Probabilistic
counting for
Link-Based
Detecting spam web pages through content analysis.
Spam Detection
In Proceedings of the World Wide Web conference, pages
L. Becchetti,
C. Castillo,
83–92, Edinburgh, Scotland.
D. Donato,
S. Leonardi and
R. Baeza-Yates
Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).
ANF: a fast and scalable tool for data mining in massive
Motivation
graphs.
Spam pages
characterization
In Proceedings of the eighth ACM SIGKDD international
Truncated
conference on Knowledge discovery and data mining, pages
PageRank
81–90, New York, NY, USA. ACM Press.
Counting
supporters
Perkins, A. (2001).
Experiments
Conclusions
The classification of search engine spam.
Available online at
http://www.silverdisc.co.uk/articles/spam-classification/.