HadoopSummit_2010_big dataspamchallange_hadoopsummit2010

766 views

Published on

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
766
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

HadoopSummit_2010_big dataspamchallange_hadoopsummit2010

  1. 1. Winning the Big Data Spam Challenge Erich Nachbar Stefan Groschupf Florian Leibert
  2. 2. Spam Types - Email Spam • What do spammers do?  • Many domains • Cycle through IPs (TOR, bulk blocks) • Bulk account creation (increase IP reputation) • Break captchas (Mechanical Turk) • Common names (e.g. http://www.census.gov/genealogy/names/dist.male.first) 
  3. 3. Spam Types - Social Media • Spam Carriers • Blog Postings • Comments • Friend Requests • ... • Spam Generation through • Actual User Accounts (Hacked / User Virus) • Bot Accounts
  4. 4. Spam Types - Social Media • Differences • Detection is the same • Account treatment is different (cancel vs. clean) • 99% of all Spam contains URLs: • Ignore text-only messages. • Look at the URL not the text.
  5. 5. Spam Types - Web Spam • Goal: influence search engine results • Link farms • Keyword, Meta tag stuffing • Hidden or invisible unrelated text • Scraper sites • Spam blogs
  6. 6. Why process Spam in Hadoop? • Easy to parallelize   • Bucketization • User • Date • Source • etc. • Count models (probabilities) are very "hadoopy"
  7. 7. Why process Spam in Hadoop? • Large data sets • More samples ~ better results • Algorithms require preprocessing • Existing code • e.g. url-parsers, bayes implementations •
  8. 8. Sample System Architecture
  9. 9. Heuristics for Spam Detection • Easy to compute, Group-By & Count • Captcha solving rates • Source IP/Email • Historic vs. current volume • Reputation • Link • Rrequency, position, ratio 
  10. 10. Heuristics for Spam Detection • Content • Self similarity • Hash of media content • Keywords
  11. 11. Looking at arrival times • Inter-arrival times • Fast Fourier Transform  • Timestamps -> frequency space
  12. 12. Evaluating content • Jaccard similarity • Bucketize by user / source email • s1 = S(x1,x3,x5), s2 = S(x2,x4,x6) • Easy with Hadoop • map: emit user_id (K), text (V) • reduce:  • Even simpler:  • # links / user • # complaints, spam tags / user • etc.
  13. 13. Solutions - Pay Level Domain • Requires payment at Top Level Domain • Simple heuristic • Much simpler than Trust-Rank, Page-Rank, etc. Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov IRLbot: Scaling to 6 Billion Pages and Beyond
  14. 14. Demo
  15. 15. Take Aways • Spam Reports are important • Rolling real-time Ham & Spam Samples • Have Knobs to turn (e.g. over JMX) • Simple solutions can get you pretty far, the easy 80%  • Spammers adapt very fast, stay agile • Try to break your own system
  16. 16. Thank you! erich@quantifind.com, @enachb sg@datameer.com, @datameer flo@leibert.de, @floleibert

×