Winning the Big Data Spam Challenge


 Erich Nachbar   Stefan Groschupf   Florian Leibert
Spam Types - Email Spam

•   What do spammers do? 
    • Many domains
    • Cycle through IPs (TOR, bulk blocks)
    • Bul...
Spam Types - Social Media

•   Spam Carriers
    • Blog Postings
    • Comments
    • Friend Requests
    • ...
•   Spam G...
Spam Types - Social Media

•   Differences
     • Detection is the same
     • Account treatment is different (cancel vs. ...
Spam Types - Web Spam

•   Goal: influence search engine results
    • Link farms
    • Keyword, Meta tag stuffing
    • H...
Why process Spam in Hadoop?

•   Easy to parallelize  
    • Bucketization
       • User
       • Date
       • Source
   ...
Why process Spam in Hadoop?

•   Large data sets
     • More samples ~ better results
•   Algorithms require preprocessing...
Sample System Architecture
Heuristics for Spam Detection

•Easy to compute, Group-By & Count
•Captcha solving rates
•Source IP/Email
  • Historic vs....
Heuristics for Spam Detection

•   Content
     • Self similarity
     • Hash of media content
     • Keywords
Looking at arrival times


•   Inter-arrival times
•   Fast Fourier Transform 
     • Timestamps
       -> frequency space
Evaluating content
•   Jaccard similarity

    •   Bucketize by user / source email
         • s1 = S(x1,x3,x5), s2 = S(x2...
Solutions - Pay Level Domain

• Requires payment at Top Level Domain
• Simple heuristic
• Much simpler than Trust-Rank, Pa...
Demo
Take Aways

• Spam Reports are important
• Rolling real-time Ham & Spam Samples
• Have Knobs to turn (e.g. over JMX)
• Sim...
Thank you!

erich@quantifind.com, @enachb

sg@datameer.com, @datameer

  flo@leibert.de, @floleibert
Upcoming SlideShare
Loading in...5
×

Winning the Big Data SPAM Challenge__HadoopSummit2010

1,518

Published on

Hadoop Summit 2010- Application Track
Winning the Big Data SPAM Challenge
Stefan Groschupf, Datameer ; Erich Nachbar, Florian Leibert

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,518
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Winning the Big Data SPAM Challenge__HadoopSummit2010

  1. 1. Winning the Big Data Spam Challenge Erich Nachbar Stefan Groschupf Florian Leibert
  2. 2. Spam Types - Email Spam • What do spammers do?  • Many domains • Cycle through IPs (TOR, bulk blocks) • Bulk account creation (increase IP reputation) • Break captchas (Mechanical Turk) • Common names (e.g. http://www.census.gov/genealogy/names/dist.male.first) 
  3. 3. Spam Types - Social Media • Spam Carriers • Blog Postings • Comments • Friend Requests • ... • Spam Generation through • Actual User Accounts (Hacked / User Virus) • Bot Accounts
  4. 4. Spam Types - Social Media • Differences • Detection is the same • Account treatment is different (cancel vs. clean) • 99% of all Spam contains URLs: • Ignore text-only messages. • Look at the URL not the text.
  5. 5. Spam Types - Web Spam • Goal: influence search engine results • Link farms • Keyword, Meta tag stuffing • Hidden or invisible unrelated text • Scraper sites • Spam blogs
  6. 6. Why process Spam in Hadoop? • Easy to parallelize   • Bucketization • User • Date • Source • etc. • Count models (probabilities) are very "hadoopy"
  7. 7. Why process Spam in Hadoop? • Large data sets • More samples ~ better results • Algorithms require preprocessing • Existing code • e.g. url-parsers, bayes implementations •
  8. 8. Sample System Architecture
  9. 9. Heuristics for Spam Detection •Easy to compute, Group-By & Count •Captcha solving rates •Source IP/Email • Historic vs. current volume • Reputation • Link • Rrequency, position, ratio 
  10. 10. Heuristics for Spam Detection • Content • Self similarity • Hash of media content • Keywords
  11. 11. Looking at arrival times • Inter-arrival times • Fast Fourier Transform  • Timestamps -> frequency space
  12. 12. Evaluating content • Jaccard similarity • Bucketize by user / source email • s1 = S(x1,x3,x5), s2 = S(x2,x4,x6) • Easy with Hadoop • map: emit user_id (K), text (V) • reduce:  • Even simpler:  • # links / user • # complaints, spam tags / user • etc.
  13. 13. Solutions - Pay Level Domain • Requires payment at Top Level Domain • Simple heuristic • Much simpler than Trust-Rank, Page-Rank, etc. Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov IRLbot: Scaling to 6 Billion Pages and Beyond
  14. 14. Demo
  15. 15. Take Aways • Spam Reports are important • Rolling real-time Ham & Spam Samples • Have Knobs to turn (e.g. over JMX) • Simple solutions can get you pretty far, the easy 80%  • Spammers adapt very fast, stay agile • Try to break your own system
  16. 16. Thank you! erich@quantifind.com, @enachb sg@datameer.com, @datameer flo@leibert.de, @floleibert

×