SlideShare a Scribd company logo
1 of 16
Winning the Big Data Spam Challenge


 Erich Nachbar   Stefan Groschupf   Florian Leibert
Spam Types - Email Spam

•   What do spammers do? 
    • Many domains
    • Cycle through IPs (TOR, bulk blocks)
    • Bulk account creation (increase IP reputation)
       • Break captchas (Mechanical Turk)
       • Common names
         (e.g. http://www.census.gov/genealogy/names/dist.male.first) 
Spam Types - Social Media

•   Spam Carriers
    • Blog Postings
    • Comments
    • Friend Requests
    • ...
•   Spam Generation through
    • Actual User Accounts (Hacked / User Virus)
    • Bot Accounts
Spam Types - Social Media

•   Differences
     • Detection is the same
     • Account treatment is different (cancel vs. clean)
•   99% of all Spam contains URLs:
     • Ignore text-only messages.
     • Look at the URL not the text.
Spam Types - Web Spam

•   Goal: influence search engine results
    • Link farms
    • Keyword, Meta tag stuffing
    • Hidden or invisible unrelated text
    • Scraper sites
    • Spam blogs
Why process Spam in Hadoop?

•   Easy to parallelize  
    • Bucketization
       • User
       • Date
       • Source
       • etc.
    • Count models (probabilities) are very "hadoopy"
Why process Spam in Hadoop?

•   Large data sets
     • More samples ~ better results
•   Algorithms require preprocessing
•   Existing code
     • e.g. url-parsers, bayes implementations
    •
Sample System Architecture
Heuristics for Spam Detection

•Easy to compute, Group-By & Count
•Captcha solving rates
•Source IP/Email
  • Historic vs. current volume
  • Reputation
• Link
   • Rrequency, position, ratio 
Heuristics for Spam Detection

•   Content
     • Self similarity
     • Hash of media content
     • Keywords
Looking at arrival times


•   Inter-arrival times
•   Fast Fourier Transform 
     • Timestamps
       -> frequency space
Evaluating content
•   Jaccard similarity

    •   Bucketize by user / source email
         • s1 = S(x1,x3,x5), s2 = S(x2,x4,x6)
    •   Easy with Hadoop
         • map: emit user_id (K), text (V)
         • reduce: 

•   Even simpler: 
    • # links / user
    • # complaints, spam tags / user
    • etc.
Solutions - Pay Level Domain

• Requires payment at Top Level Domain
• Simple heuristic
• Much simpler than Trust-Rank, Page-Rank, etc.




  Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov
  IRLbot: Scaling to 6 Billion Pages and Beyond
Demo
Take Aways

• Spam Reports are important
• Rolling real-time Ham & Spam Samples
• Have Knobs to turn (e.g. over JMX)
• Simple solutions can get you pretty far, the easy 80% 
• Spammers adapt very fast, stay agile
• Try to break your own system
Thank you!

erich@quantifind.com, @enachb

sg@datameer.com, @datameer

  flo@leibert.de, @floleibert

More Related Content

What's hot

The state of the art in Linked Data
The state of the art in Linked DataThe state of the art in Linked Data
The state of the art in Linked DataJoshua Shinavier
 
MongoDB World 2019 - A Complete Methodology to Data Modeling for MongoDB
MongoDB World 2019 - A Complete Methodology to Data Modeling for MongoDBMongoDB World 2019 - A Complete Methodology to Data Modeling for MongoDB
MongoDB World 2019 - A Complete Methodology to Data Modeling for MongoDBDaniel Coupal
 
DomainTools Fingerprinting Threat Actors with Web Assets
DomainTools Fingerprinting Threat Actors with Web AssetsDomainTools Fingerprinting Threat Actors with Web Assets
DomainTools Fingerprinting Threat Actors with Web AssetsDomainTools
 
A recommendation engine for your php application
A recommendation engine for your php applicationA recommendation engine for your php application
A recommendation engine for your php applicationMichele Orselli
 
Austin Day of Rest - Introduction
Austin Day of Rest - IntroductionAustin Day of Rest - Introduction
Austin Day of Rest - IntroductionHandsOnWP.com
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With PythonRobert Dempsey
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in PythonSatwik Kansal
 
Linked Open Data for Libraries
Linked Open Data for LibrariesLinked Open Data for Libraries
Linked Open Data for LibrariesLukas Koster
 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingWeb Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingCynthiaCruz55
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python Viren Rajput
 
EPUB vs. WEB: A Cautionary Tale - ebookcraft 2016 - Tzviya Siegman & Dave Cramer
EPUB vs. WEB: A Cautionary Tale - ebookcraft 2016 - Tzviya Siegman & Dave CramerEPUB vs. WEB: A Cautionary Tale - ebookcraft 2016 - Tzviya Siegman & Dave Cramer
EPUB vs. WEB: A Cautionary Tale - ebookcraft 2016 - Tzviya Siegman & Dave CramerBookNet Canada
 
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...Grokking VN
 
A Framework for Dynamic Data Source Identification and Orchestration on the Web
A Framework for Dynamic Data Source Identification and Orchestration on the WebA Framework for Dynamic Data Source Identification and Orchestration on the Web
A Framework for Dynamic Data Source Identification and Orchestration on the Webmashups
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with PythonMaris Lemba
 
Hacking iOS Applications with Proxies
Hacking iOS Applications with ProxiesHacking iOS Applications with Proxies
Hacking iOS Applications with ProxiesKarl Fosaaen
 
What is a Robot txt file?
What is a Robot txt file?What is a Robot txt file?
What is a Robot txt file?Abhishek Mitra
 
Linked Data: A short(-ish) introduction
Linked Data: A short(-ish) introductionLinked Data: A short(-ish) introduction
Linked Data: A short(-ish) introductionPete Johnston
 

What's hot (20)

The state of the art in Linked Data
The state of the art in Linked DataThe state of the art in Linked Data
The state of the art in Linked Data
 
Vít Listík - Email.cz workshop
Vít Listík - Email.cz workshopVít Listík - Email.cz workshop
Vít Listík - Email.cz workshop
 
MongoDB World 2019 - A Complete Methodology to Data Modeling for MongoDB
MongoDB World 2019 - A Complete Methodology to Data Modeling for MongoDBMongoDB World 2019 - A Complete Methodology to Data Modeling for MongoDB
MongoDB World 2019 - A Complete Methodology to Data Modeling for MongoDB
 
DomainTools Fingerprinting Threat Actors with Web Assets
DomainTools Fingerprinting Threat Actors with Web AssetsDomainTools Fingerprinting Threat Actors with Web Assets
DomainTools Fingerprinting Threat Actors with Web Assets
 
Web Scraping Basics
Web Scraping BasicsWeb Scraping Basics
Web Scraping Basics
 
A recommendation engine for your php application
A recommendation engine for your php applicationA recommendation engine for your php application
A recommendation engine for your php application
 
Austin Day of Rest - Introduction
Austin Day of Rest - IntroductionAustin Day of Rest - Introduction
Austin Day of Rest - Introduction
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in Python
 
Linked Open Data for Libraries
Linked Open Data for LibrariesLinked Open Data for Libraries
Linked Open Data for Libraries
 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingWeb Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen Scraping
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
EPUB vs. WEB: A Cautionary Tale - ebookcraft 2016 - Tzviya Siegman & Dave Cramer
EPUB vs. WEB: A Cautionary Tale - ebookcraft 2016 - Tzviya Siegman & Dave CramerEPUB vs. WEB: A Cautionary Tale - ebookcraft 2016 - Tzviya Siegman & Dave Cramer
EPUB vs. WEB: A Cautionary Tale - ebookcraft 2016 - Tzviya Siegman & Dave Cramer
 
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...
 
A Framework for Dynamic Data Source Identification and Orchestration on the Web
A Framework for Dynamic Data Source Identification and Orchestration on the WebA Framework for Dynamic Data Source Identification and Orchestration on the Web
A Framework for Dynamic Data Source Identification and Orchestration on the Web
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Hacking iOS Applications with Proxies
Hacking iOS Applications with ProxiesHacking iOS Applications with Proxies
Hacking iOS Applications with Proxies
 
What is a Robot txt file?
What is a Robot txt file?What is a Robot txt file?
What is a Robot txt file?
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Linked Data: A short(-ish) introduction
Linked Data: A short(-ish) introductionLinked Data: A short(-ish) introduction
Linked Data: A short(-ish) introduction
 

Similar to Winning the Big Data SPAM Challenge__HadoopSummit2010

HadoopSummit_2010_big dataspamchallange_hadoopsummit2010
HadoopSummit_2010_big dataspamchallange_hadoopsummit2010HadoopSummit_2010_big dataspamchallange_hadoopsummit2010
HadoopSummit_2010_big dataspamchallange_hadoopsummit2010Yahoo Developer Network
 
Email Address Harvesting
Email Address HarvestingEmail Address Harvesting
Email Address HarvestingMichael Lamont
 
Intelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDBIntelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDBMihnea Giurgea
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenChristopher Whitaker
 
Internet and Social Media for Beginners
Internet and Social Media for BeginnersInternet and Social Media for Beginners
Internet and Social Media for Beginnersbecarreno
 
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrCreating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrBrooke Ganz
 
Halko_santafe_2015
Halko_santafe_2015Halko_santafe_2015
Halko_santafe_2015Nathan Halko
 
OSINT for Attack and Defense
OSINT for Attack and DefenseOSINT for Attack and Defense
OSINT for Attack and DefenseAndrew McNicol
 
Week 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and ProducingWeek 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and Producingkurtgessler
 
Voight-Kampff for Email Addresses: Quantifying Email Address Reputation to Id...
Voight-Kampff for Email Addresses: Quantifying Email Address Reputation to Id...Voight-Kampff for Email Addresses: Quantifying Email Address Reputation to Id...
Voight-Kampff for Email Addresses: Quantifying Email Address Reputation to Id...Joshua Kamdjou
 
NotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for PentestersNotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for PentestersRob Fuller
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.orgrvguha
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterJohn Adams
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsHisham Arafat
 
Gates Toorcon X New School Information Gathering
Gates Toorcon X New School Information GatheringGates Toorcon X New School Information Gathering
Gates Toorcon X New School Information GatheringChris Gates
 

Similar to Winning the Big Data SPAM Challenge__HadoopSummit2010 (20)

HadoopSummit_2010_big dataspamchallange_hadoopsummit2010
HadoopSummit_2010_big dataspamchallange_hadoopsummit2010HadoopSummit_2010_big dataspamchallange_hadoopsummit2010
HadoopSummit_2010_big dataspamchallange_hadoopsummit2010
 
Geek basics
Geek basicsGeek basics
Geek basics
 
Email Address Harvesting
Email Address HarvestingEmail Address Harvesting
Email Address Harvesting
 
Intelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDBIntelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDB
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
Internet and Social Media for Beginners
Internet and Social Media for BeginnersInternet and Social Media for Beginners
Internet and Social Media for Beginners
 
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrCreating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
 
Haifa
HaifaHaifa
Haifa
 
Halko_santafe_2015
Halko_santafe_2015Halko_santafe_2015
Halko_santafe_2015
 
OSINT for Attack and Defense
OSINT for Attack and DefenseOSINT for Attack and Defense
OSINT for Attack and Defense
 
Week 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and ProducingWeek 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and Producing
 
Voight-Kampff for Email Addresses: Quantifying Email Address Reputation to Id...
Voight-Kampff for Email Addresses: Quantifying Email Address Reputation to Id...Voight-Kampff for Email Addresses: Quantifying Email Address Reputation to Id...
Voight-Kampff for Email Addresses: Quantifying Email Address Reputation to Id...
 
Fighting Spam at Flickr
Fighting Spam at FlickrFighting Spam at Flickr
Fighting Spam at Flickr
 
NotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for PentestersNotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for Pentesters
 
Ir1
Ir1Ir1
Ir1
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.org
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platforms
 
Gates Toorcon X New School Information Gathering
Gates Toorcon X New School Information GatheringGates Toorcon X New School Information Gathering
Gates Toorcon X New School Information Gathering
 

More from Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

More from Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Recently uploaded

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 

Winning the Big Data SPAM Challenge__HadoopSummit2010

  • 1. Winning the Big Data Spam Challenge Erich Nachbar Stefan Groschupf Florian Leibert
  • 2. Spam Types - Email Spam • What do spammers do?  • Many domains • Cycle through IPs (TOR, bulk blocks) • Bulk account creation (increase IP reputation) • Break captchas (Mechanical Turk) • Common names (e.g. http://www.census.gov/genealogy/names/dist.male.first) 
  • 3. Spam Types - Social Media • Spam Carriers • Blog Postings • Comments • Friend Requests • ... • Spam Generation through • Actual User Accounts (Hacked / User Virus) • Bot Accounts
  • 4. Spam Types - Social Media • Differences • Detection is the same • Account treatment is different (cancel vs. clean) • 99% of all Spam contains URLs: • Ignore text-only messages. • Look at the URL not the text.
  • 5. Spam Types - Web Spam • Goal: influence search engine results • Link farms • Keyword, Meta tag stuffing • Hidden or invisible unrelated text • Scraper sites • Spam blogs
  • 6. Why process Spam in Hadoop? • Easy to parallelize   • Bucketization • User • Date • Source • etc. • Count models (probabilities) are very "hadoopy"
  • 7. Why process Spam in Hadoop? • Large data sets • More samples ~ better results • Algorithms require preprocessing • Existing code • e.g. url-parsers, bayes implementations •
  • 9. Heuristics for Spam Detection •Easy to compute, Group-By & Count •Captcha solving rates •Source IP/Email • Historic vs. current volume • Reputation • Link • Rrequency, position, ratio 
  • 10. Heuristics for Spam Detection • Content • Self similarity • Hash of media content • Keywords
  • 11. Looking at arrival times • Inter-arrival times • Fast Fourier Transform  • Timestamps -> frequency space
  • 12. Evaluating content • Jaccard similarity • Bucketize by user / source email • s1 = S(x1,x3,x5), s2 = S(x2,x4,x6) • Easy with Hadoop • map: emit user_id (K), text (V) • reduce:  • Even simpler:  • # links / user • # complaints, spam tags / user • etc.
  • 13. Solutions - Pay Level Domain • Requires payment at Top Level Domain • Simple heuristic • Much simpler than Trust-Rank, Page-Rank, etc. Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov IRLbot: Scaling to 6 Billion Pages and Beyond
  • 14. Demo
  • 15. Take Aways • Spam Reports are important • Rolling real-time Ham & Spam Samples • Have Knobs to turn (e.g. over JMX) • Simple solutions can get you pretty far, the easy 80%  • Spammers adapt very fast, stay agile • Try to break your own system
  • 16. Thank you! erich@quantifind.com, @enachb sg@datameer.com, @datameer flo@leibert.de, @floleibert