Winning the Big Data SPAM Challenge__HadoopSummit2010

•

1 like•575 views

This document discusses using Hadoop to process large amounts of spam data. It describes different types of spam, including email spam, social media spam, and web spam. It then explains why Hadoop is well-suited for spam processing due to its ability to parallelize tasks and handle large datasets. Sample system architectures and heuristics for spam detection are presented, such as analyzing IP addresses, link patterns, and content. Metrics like Jaccard similarity and arrival times can also help evaluate spam. Overall, the document advocates using Hadoop to gain insights from massive spam datasets through simple solutions that can effectively capture the majority of spam.

Technology

Winning the Big Data Spam Challenge

Erich Nachbar Stefan Groschupf Florian Leibert

Spam Types - Email Spam

• What do spammers do?
• Many domains
• Cycle through IPs (TOR, bulk blocks)
• Bulk account creation (increase IP reputation)
• Break captchas (Mechanical Turk)
• Common names
(e.g. http://www.census.gov/genealogy/names/dist.male.ﬁrst)

Spam Types - Social Media

• Spam Carriers
• Blog Postings
• Comments
• Friend Requests
• ...
• Spam Generation through
• Actual User Accounts (Hacked / User Virus)
• Bot Accounts

Spam Types - Social Media

• Differences
• Detection is the same
• Account treatment is different (cancel vs. clean)
• 99% of all Spam contains URLs:
• Ignore text-only messages.
• Look at the URL not the text.

Spam Types - Web Spam

• Goal: influence search engine results
• Link farms
• Keyword, Meta tag stuffing
• Hidden or invisible unrelated text
• Scraper sites
• Spam blogs

Why process Spam in Hadoop?

• Easy to parallelize
• Bucketization
• User
• Date
• Source
• etc.
• Count models (probabilities) are very "hadoopy"

Why process Spam in Hadoop?

• Large data sets
• More samples ~ better results
• Algorithms require preprocessing
• Existing code
• e.g. url-parsers, bayes implementations
•

Heuristics for Spam Detection

•Easy to compute, Group-By & Count
•Captcha solving rates
•Source IP/Email
• Historic vs. current volume
• Reputation
• Link
• Rrequency, position, ratio

Heuristics for Spam Detection

• Content
• Self similarity
• Hash of media content
• Keywords

Looking at arrival times

• Inter-arrival times
• Fast Fourier Transform
• Timestamps
-> frequency space

Evaluating content
• Jaccard similarity

• Bucketize by user / source email
• s1 = S(x1,x3,x5), s2 = S(x2,x4,x6)
• Easy with Hadoop
• map: emit user_id (K), text (V)
• reduce:

• Even simpler:
• # links / user
• # complaints, spam tags / user
• etc.

Solutions - Pay Level Domain

• Requires payment at Top Level Domain
• Simple heuristic
• Much simpler than Trust-Rank, Page-Rank, etc.

Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov
IRLbot: Scaling to 6 Billion Pages and Beyond

Take Aways

• Spam Reports are important
• Rolling real-time Ham & Spam Samples
• Have Knobs to turn (e.g. over JMX)
• Simple solutions can get you pretty far, the easy 80%
• Spammers adapt very fast, stay agile
• Try to break your own system

Thank you!

erich@quantiﬁnd.com, @enachb

sg@datameer.com, @datameer

ﬂo@leibert.de, @ﬂoleibert

What's hot

The state of the art in Linked DataJoshua Shinavier

Vít Listík - Email.cz workshopMachine Learning Prague

MongoDB World 2019 - A Complete Methodology to Data Modeling for MongoDBDaniel Coupal

DomainTools Fingerprinting Threat Actors with Web AssetsDomainTools

Web Scraping BasicsKyle Banerjee

A recommendation engine for your php applicationMichele Orselli

Austin Day of Rest - IntroductionHandsOnWP.com

Web Scraping With PythonRobert Dempsey

Getting started with Web Scraping in PythonSatwik Kansal

Linked Open Data for LibrariesLukas Koster

Web Scraping using Python | Web Screen ScrapingCynthiaCruz55

Web scraping in python Viren Rajput

EPUB vs. WEB: A Cautionary Tale - ebookcraft 2016 - Tzviya Siegman & Dave CramerBookNet Canada

TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...Grokking VN

A Framework for Dynamic Data Source Identification and Orchestration on the Webmashups

Intro to web scraping with PythonMaris Lemba

Hacking iOS Applications with ProxiesKarl Fosaaen

What is a Robot txt file?Abhishek Mitra

Web crawlerpoonamkenkre

Linked Data: A short(-ish) introductionPete Johnston

What's hot (20)

The state of the art in Linked Data

Vít Listík - Email.cz workshop

MongoDB World 2019 - A Complete Methodology to Data Modeling for MongoDB

DomainTools Fingerprinting Threat Actors with Web Assets

Web Scraping Basics

A recommendation engine for your php application

Austin Day of Rest - Introduction

Web Scraping With Python

Getting started with Web Scraping in Python

Linked Open Data for Libraries

Web Scraping using Python | Web Screen Scraping

Web scraping in python

EPUB vs. WEB: A Cautionary Tale - ebookcraft 2016 - Tzviya Siegman & Dave Cramer

TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...

A Framework for Dynamic Data Source Identification and Orchestration on the Web

Intro to web scraping with Python

Hacking iOS Applications with Proxies

What is a Robot txt file?

Web crawler

Linked Data: A short(-ish) introduction

Similar to Winning the Big Data SPAM Challenge__HadoopSummit2010

HadoopSummit_2010_big dataspamchallange_hadoopsummit2010Yahoo Developer Network

Geek basicskdmcBerkeley at UC Berkeley

Email Address HarvestingMichael Lamont

Intelligent Stream Filtering Using MongoDBMihnea Giurgea

Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics

Open Data Summit Presentation by Joe OlsenChristopher Whitaker

Internet and Social Media for Beginnersbecarreno

Creating an Open Source Genealogical Search Engine with Apache SolrBrooke Ganz

HaifaRam Dutt Shukla

Halko_santafe_2015Nathan Halko

OSINT for Attack and DefenseAndrew McNicol

Week 1 - Interactive News Editing and Producingkurtgessler

Voight-Kampff for Email Addresses: Quantifying Email Address Reputation to Id...Joshua Kamdjou

Fighting Spam at FlickrMikhail Panchenko

NotaCon 2011 - Networking for PentestersRob Fuller

Ir1Tomas Anikevičius

Semantic Web and Schema.orgrvguha

Chirp 2010: Scaling TwitterJohn Adams

Engineering patterns for implementing data science models on big data platformsHisham Arafat

Gates Toorcon X New School Information GatheringChris Gates

Similar to Winning the Big Data SPAM Challenge__HadoopSummit2010 (20)

HadoopSummit_2010_big dataspamchallange_hadoopsummit2010

Geek basics

Email Address Harvesting

Intelligent Stream Filtering Using MongoDB

Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...

Open Data Summit Presentation by Joe Olsen

Internet and Social Media for Beginners

Creating an Open Source Genealogical Search Engine with Apache Solr

Haifa

Halko_santafe_2015

OSINT for Attack and Defense

Week 1 - Interactive News Editing and Producing

Voight-Kampff for Email Addresses: Quantifying Email Address Reputation to Id...

Fighting Spam at Flickr

NotaCon 2011 - Networking for Pentesters

Ir1

Semantic Web and Schema.org

Chirp 2010: Scaling Twitter

Engineering patterns for implementing data science models on big data platforms

Gates Toorcon X New School Information Gathering

Recently uploaded

Key Features Of Token Development (1).pptxLBM Solutions

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

The transition to renewables in India.pdfCompetition Advisory Services (India) LLP

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

Artificial intelligence in the post-deep learning eraDeakin University

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

Recently uploaded (20)

Key Features Of Token Development (1).pptx

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

Unlocking the Potential of the Cloud for IBM Power Systems

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Designing IA for AI - Information Architecture Conference 2024

The transition to renewables in India.pdf

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

Artificial intelligence in the post-deep learning era

Understanding the Laravel MVC Architecture

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Connect Wave/ connectwave Pitch Deck Presentation

Pigging Solutions Piggable Sweeping Elbows

Unblocking The Main Thread Solving ANRs and Frozen Frames

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service

Pigging Solutions in Pet Food Manufacturing

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Streamlining Python Development: A Guide to a Modern Project Setup

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Winning the Big Data SPAM Challenge__HadoopSummit2010

1. Winning the Big Data Spam Challenge Erich Nachbar Stefan Groschupf Florian Leibert

2. Spam Types - Email Spam • What do spammers do? • Many domains • Cycle through IPs (TOR, bulk blocks) • Bulk account creation (increase IP reputation) • Break captchas (Mechanical Turk) • Common names (e.g. http://www.census.gov/genealogy/names/dist.male.ﬁrst)

3. Spam Types - Social Media • Spam Carriers • Blog Postings • Comments • Friend Requests • ... • Spam Generation through • Actual User Accounts (Hacked / User Virus) • Bot Accounts

4. Spam Types - Social Media • Differences • Detection is the same • Account treatment is different (cancel vs. clean) • 99% of all Spam contains URLs: • Ignore text-only messages. • Look at the URL not the text.

5. Spam Types - Web Spam • Goal: influence search engine results • Link farms • Keyword, Meta tag stuffing • Hidden or invisible unrelated text • Scraper sites • Spam blogs

6. Why process Spam in Hadoop? • Easy to parallelize • Bucketization • User • Date • Source • etc. • Count models (probabilities) are very "hadoopy"

7. Why process Spam in Hadoop? • Large data sets • More samples ~ better results • Algorithms require preprocessing • Existing code • e.g. url-parsers, bayes implementations •

8. Sample System Architecture

9. Heuristics for Spam Detection •Easy to compute, Group-By & Count •Captcha solving rates •Source IP/Email • Historic vs. current volume • Reputation • Link • Rrequency, position, ratio

10. Heuristics for Spam Detection • Content • Self similarity • Hash of media content • Keywords

11. Looking at arrival times • Inter-arrival times • Fast Fourier Transform • Timestamps -> frequency space

12. Evaluating content • Jaccard similarity • Bucketize by user / source email • s1 = S(x1,x3,x5), s2 = S(x2,x4,x6) • Easy with Hadoop • map: emit user_id (K), text (V) • reduce: • Even simpler: • # links / user • # complaints, spam tags / user • etc.

13. Solutions - Pay Level Domain • Requires payment at Top Level Domain • Simple heuristic • Much simpler than Trust-Rank, Page-Rank, etc. Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov IRLbot: Scaling to 6 Billion Pages and Beyond

14. Demo

15. Take Aways • Spam Reports are important • Rolling real-time Ham & Spam Samples • Have Knobs to turn (e.g. over JMX) • Simple solutions can get you pretty far, the easy 80% • Spammers adapt very fast, stay agile • Try to break your own system

16. Thank you! erich@quantifind.com, @enachb sg@datameer.com, @datameer flo@leibert.de, @floleibert

Winning the Big Data SPAM Challenge__HadoopSummit2010

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Winning the Big Data SPAM Challenge__HadoopSummit2010

Similar to Winning the Big Data SPAM Challenge__HadoopSummit2010 (20)

More from Yahoo Developer Network

More from Yahoo Developer Network (20)

Recently uploaded

Recently uploaded (20)

Winning the Big Data SPAM Challenge__HadoopSummit2010