Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Yahoo! Mail antispam - Bay area Hadoop user group

21,334 views

Published on

Published in: Technology
  • Login to see the comments

Yahoo! Mail antispam - Bay area Hadoop user group

  1. 1. Yokai Versus the ElephantHadoop and the Fight Against Shape-Shifting Spam<br />VishwanathRamarao & Mark Risher<br />Yahoo! Mail<br />
  2. 2. © SHMorgan - www.obakemono.com<br />
  3. 3. AGENDA<br />3<br />Shape-shifting spam<br />Antispam Origins<br />Hadoop Algorithms<br />Applications to Security<br />Resources for Implementers<br />
  4. 4.
  5. 5. 5<br />
  6. 6. 6<br />http:/<!--gmail.com-->/f915fde2cf53df18<!--uc22wddprm-->.li<!--cf997b28e-->gh<!--PdNKLr--><br />tt<!---kxnd2itipuvd.yahoo.com-->o<!--ju1j8V--><br />p<!--vrgxetdcnubslgacvc-->b<!--OsLaWIv-->o<!--_qsgsnnjuf1m@vkvriskrgavzxjovbqg.net-->dy<!--in7oouvxfrg7ax-->.com]*!}v}]along especially consecutive important dmvfu<br /><!--gmail.com--><br />
  7. 7. 7<br />
  8. 8. 8<br />1,300,925,111,156,286,160,896<br />(http://bit.ly/cpOyLi)<br />
  9. 9.
  10. 10. 10<br />
  11. 11. Typical attack/response profile<br />11<br />Rule change<br />(1/23@01:15)<br />
  12. 12. MORE YOKAI - TARGETED ATTACKS<br /><style>mechanic CC0066 getimage 3A00 lectroniques repertoires spiel proscribing ammonoid 10110 radiobuttontelefoons Jermaine iesaporitoroshan 3026 janatatrennungpalillos toughest ncapitolecalzado 20200 Omnimedia collective saudadedizaines 205px hardener elongating InvasionofyourprivacyPersonnalftsbedingungenMontanerprozacSerpellfcardbvh capacitate 12502 courtship kiranjiutroligt transducer tyee Delhaize clueless toffee nnioZoapochino sterns 622 Verordnung carbons waterresistant assessing footerTextperrine url0 potatoes 999933 Rightmove positively thmb closer secures Amarillo suffer 314992 32599 8849 GJ initialling cockleshell JTA Justiaguardo jibes Chubb inflammatory iteration granfaldasseoir considerations 692px treasured Allotransplantationtwoyearsappx Bowers doorgeven 1487 bigpicture repeatedly Popp MPEG4 webbsidaliefdeVoeding Elena Kernighan sternway laggardly Zwischendurch commons equis sewing f17 apadrinasareiniqueslugoquotedblbayr 3500 CI addressee optativelygazzetta 616px mingus 23238 PhotoLink desuetude tofu keychains molding redevelopment stucco deltage astrology2 thumbscrews probablemente 700g rnsfuseactionrepristaires restraint manchettestrendlineseffectuedespatchMinskyestadual doses danbrown Muenster jind7n7 smashes gourmandesashantisentants rows kyk coated Incontournablescoincidenjspa stalker CDS contienen expletives s8 eof replenishing puyalluppratosondravalidarorientale sonnets steamer Niwangoacrocentric dozens elr tempting poing jails ingredi Sep3 misdirection vested tecniciconciertos dear martini 3D35 MBR DNAME 2650 violation Egyptiin NCR sposoriss hl 12450 connectors circumcision transform CFA employeur 153 comunicazioni miner 19905 citronella PlissierHellmich Randall CaradonnaspringaregistradahauptEntran 3060 Rochin capacitor sotol 3413 smirk interditeServicePoint capabilities bouncefeeLinkov 3Dg auntie OSP CaeciliaPlatzierung wrangler pisosbanlieueDaniellaenderleisraelprofessionnellessusto 39800 Espanaplena radian antic!...........................200KB……….<br /> </style><br /><center><a href="http://ivywhere.info/52210088504303.hrmj.1/285/1000/1006/1000/1237976a102c0176c7b3fb3164f83590.html">Please Click Here if You Can't See Images<br><imgsrc="http://ivywhere.info/images/usacpm1.jpg" border="0"></a><br><a href="http://ivywhere.info/52210088504303.hrmj.1/40106/1000/1000/1000/a.html"><imgsrc="http://ivywhere.info/images/usacpm2.jpg" border="0"></a><br><a href="http://ivywhere.info/gp.html"><imgsrc="http://ivywhere.info/images/please2.jpg" border="0"></a><br><br />12<br />[400kb…]<br /><center><a href="http://corfair.info/52210088504303.hrmj.1/129286/1000/1006/1000/d1c7b1fa06980b08bf9b3a9c14844623.html">Please Click Here if You Can't See Images<br><imgsrc="http://corfair.info/images/ivblg1.jpg" border="0"></a><br><a href="http://corfair.info/52210088504303.hrmj.1/40126/1000/1000/1000/a.html"><imgsrc="http://corfair.info/images/ivblg2.jpg" border="0"></a><br><a href="http://corfair.info/gp.html"><imgsrc="http://corfair.info/images/please2.jpg" border="0"></a><br> <br />
  13. 13.
  14. 14. 14<br />
  15. 15. Why is the ANTISPAM PROBLEM hard<br />Scale of the problem; 25B Connections, 5B deliveries, 450M mailboxes<br />User feedback is often late, noisy and not always actionable <br />Large, diverse stream of legitimate traffic that looks like spam<br />Slow adoption of authentication technologies like DKIM and SPF<br />Spammers are clever; target and specialize attacks <br />Rapidly changing spam campaigns with a large bot controlled IP base; large variations even within a single campaign<br />A significant percentage of spam comes from large ESPs like Hotmail, Google and Yahoo<br />15<br />
  16. 16. Generation 1: Manual management layer<br />Heuristics, blocks, blacklists<br />Provide attack mitigation and operational flexibility, highly explainable. <br />Not durable, expensive to keep pace with fast morphing spam<br />Ad hoc queries<br />Proprietary implementations, not very scalable, steep learning curve<br />Reactive and usually late<br />16<br />
  17. 17. Generation 2: Machine Management Layer<br />Online reputation models<br />Simple, mostly scoring/counter/ratio based models<br />Highly scalable due the absence of any state/memory<br />Generalize too broadly, lack expressive power<br />Batch trained reputation models<br />Typically digested memory based hashing or machine learning models<br />Difficult to implement and due to the need for labeled examples scale well only moderately<br />Slow to update and learn, lack explainability, limited operational control<br />17<br />
  18. 18.
  19. 19. distributed computing paradigm<br />19<br />Map:Reduce + distributed storage:<br /><ul><li>Simplicity of online, stateless models
  20. 20. Expressiveness of offline analysis
  21. 21. Ease of management</li></li></ul><li>the map:reduce paradigm <br />Input data format is application-specific, specified by the user <br />Output is a set of <key,value> pairs <br />User expresses algorithm using two functions<br />Map is applied on the input data and produces a list of intermediate <key,value> pairs <br />Reduce is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs <br />Finally, output pairs are sorted by their key value<br />20<br />
  22. 22. the map:reduce paradigm <br />21<br />Mapper<br /><k1,v1><br />Mapper<br /><k1,{v1,v3}><br /><k2,v2><br />Reducer<br /><k2,v2><br /><k1,W1><br />Mapper<br /><k1,v3><br />
  23. 23. A SIMPLE MAP:REDUCE EXAMPLE<br />$ bin/hadoopdfs -cat /usr/joe/wordcount/input/file01 <br />Hello World Bye World <br />$ bin/hadoopdfs -cat /usr/joe/wordcount/input/file02 <br />Hello Hadoop Goodbye Hadoop<br />// Split up input files (MAP), iterate over chunks, reassemble results (REDUCE) <br />$ bin/hadoop jar /usr/joe/wordcount.jarorg.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output<br />$ bin/hadoopdfs -cat /usr/joe/wordcount/output/part-00000 <br />Bye 1 <br />Goodbye 1 <br />Hadoop 2 <br />Hello 2 <br />World 2 <br />22<br />
  24. 24. a simple map:reduce example (bit.ly/bdyi0l)<br />18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {<br />19. String line = value.toString();<br />20. StringTokenizertokenizer = new StringTokenizer(line);<br />21. while (tokenizer.hasMoreTokens()) {<br />22. word.set(tokenizer.nextToken());<br />23. output.collect(word, one);<br />24. }<br />25. }<br />23<br />
  25. 25. a simple map:reduce example (bit.ly/bdyi0l)<br />28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {<br />29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {<br />30. int sum = 0;<br />31. while (values.hasNext()) {<br />32. sum += values.next().get();<br />33. }<br />34. output.collect(key, new IntWritable(sum));<br />24<br />
  26. 26. Applications <br />& <br />Outcomes<br />25<br />
  27. 27. Lets REVIEW OUR DESIGN GOALs AGAIN<br />Classifiers are notorious for lack of explainability<br />Engineers and analysts needs to know what the classifier is missing<br />Engineers and analysts need to know about emerging threats<br />Analysts need “canned” reports along interesting dimensions<br />Machines need smart feature engineering<br />Develop a scalable system to provide deep insight into spammer campaigns<br />Double up as a platform for standard reporting<br />Also double up as a platform for adhoc analysis and data probing<br />Signal amplification and smart feature extraction platform<br />26<br />
  28. 28. Our ANTISPAM ANALYTIC PLATFORM<br />Hadoop: Implements map reduce, written in Java but supports many other languages including Perl and C++ using the streaming interface<br />Feature engineering with small simple Perl programs for data extraction and transformation<br />SQL-like “Pig” programming language for data analysis and management<br />Mahout: data mining libraries that provide shrink- wrapped, scalable, sophisticated algorithms<br />Other proprietary algorithms and frameworks for specialized tasks<br />27<br />
  29. 29. Various ASPECTS of A GRID DRIVEN SOLUTION<br />Standard reporting<br />Ad hoc querying<br />Campaign discovery from spam feedback using frequent item set mining<br />“Gaming” detection in notspam feedback using connected components<br />28<br />
  30. 30. Top SPAMMY DOMAINS REPORT FOR 01/15/2010<br />29<br />key:noreply.amateurmatch.com|value:1164<br />key:goodmere.info|value:896<br />key:marketing.meredith.com|value:1078<br />key:verizon.net|value:822<br />key:reply.mb00.net|value:980<br />key:insideapple.apple.com|value:1094<br />key:facebookappmail.com|value:882<br />key:mydailymoment.com|value:849<br />key:thetwilightsaga.com|value:4671<br />key:adknowledgemailer6.com|value:859<br />key:freedollarspro.info|value:1164<br />key:smartreachmedia.com|value:1074<br />key:yahoo.es|value:877<br />key:ecomasher.com|value:1197<br />key:leasetrade-statusupdates.com|value:951<br />key:noreply.amateurmatch.comvalue:1164<br />
  31. 31. AD HOC queries for ANTISPAM research<br />Identify domains that had few spam votes in the previous time window but have a high number of spam votes today<br />All IPs in the last hour that sent a particular URL pattern…or that sent any unknown URL >500 times<br />Which domains/IPs suddenly increased their sending volume after a positive reputation change<br />Which FROM addresses exhibit low message size entropy<br />All messages that had nothing but a URL and the domain of the URL had low page rank<br />30<br />
  32. 32. AD HOC QUERIES - Anatomy of a PIG QUERY<br />--- This includes some basic string functions, including splitting a string on the '@' character<br />register /homes/jpujara/pig_scripts/string.jar;<br />define splitEmail string.Tokenize('2','@');<br />--- Load up some data - incoming messages at a date and time, and our trusted user database<br />MESSAGES = load '/projects/antispam/mta_feature_logs/$date*/*/*-$time*' using com.yahoo.ymail.pigfunctions.AsStorage('__record_key__,firstrcpt,mailfrom') as (mid:chararray,to:chararray,from:chararray);<br />USERS = load '/projects/antispam/TrustedUser.bz2' using com.yahoo.ymail.pigfunctions.AsStorage('user,t') as (user:chararray,trusted:int);<br />--- Split the e-mail addresses into user+domain and generate the appropriate user-id for yahoo users and partners<br />EXPLODED_MESSAGES = FOREACH MESSAGES GENERATE to,FLATTEN(splitEmail(to)) as (user,udomain),FLATTEN(splitEmail(from)) as (sender,sdomain);<br />YAHOO_MESSAGES = FOREACH EXPLODED_MESSAGES GENERATE (udomain MATCHES '.*yahoo.*' ? user : to ) as yuser,sdomain;<br />31<br />--- Combine the message and sender domains with the trusted user data and select only trusted messages<br />YAHOO_MESSAGES_TRUST = JOIN YAHOO_MESSAGES by yuser, USERS by user;<br />TRUSTED_MESSAGES = FILTER YAHOO_MESSAGES_TRUST by trusted > 0;<br />--- Group by domain, and generate a count, order by descending count<br />DOMAIN_GROUPS = GROUP TRUSTED_MESSAGES by sdomain;<br />DOMAIN_GROUPS_COUNT = FOREACH DOMAIN_GROUPS GENERATE group,COUNT(TRUSTED_MESSAGES) as count;<br />DOMAIN_GROUPS_ORDER = ORDER DOMAIN_GROUPS_COUNT by count DESC;<br />--- Output the results<br />STORE DOMAIN_GROUPS_ORDER into '$targetdir/topDomains';<br />
  33. 33. CAMPAIGN Discovery in SPAM Feedback<br />Frequent Itemset Mining<br />Classical method<br />Research interesting relationships between variables in a large database<br />Primarily applied for market basket analysis<br />Many good implementations<br />APRIORI<br />Easy to implement<br />Parallelizes moderately well but bottlenecks for extremely large data sets<br />Not very efficient with the number scans<br />ECLAT<br />Parallelizes easily <br />Amenable to a good grid implementation<br />Fewer scans of the dataset<br />Parallel FP GROWTH<br />Designed explicitly for systems like hadoop<br />Implemented in Mahout 0.2<br />32<br />
  34. 34. Frequent item set – example dataset<br />33<br />
  35. 35. Frequent ITEMSET MINING<br />34<br />Slide Courtsey: dortmund.de<br />
  36. 36. Frequent itemset MINING on ONE DAY’s SPAM REPORTS<br />9 2595 (IPTYPE:none,FROMUSER:sales,SUBJ:It's Important You Know,FROMDOM:dappercom.info,URL:dappercom.info,ip_D:66.206.14.77,)<br />9 2457 (IPTYPE:none,FROMUSER:sales,SUBJ:Save On Costly Repairs,FROMDOM:aftermoon.info,URL:aftermoon.info,ip_D:66.206.14.78,)<br />9 2447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On-New-Vehicles,FROMDOM:sherge.info,URL:sherge.info,ip_D:66.206.25.227,)<br />9 2432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:zaninte.info,URL:zaninte.info,ip_D:66.206.25.227,)<br />9 2376 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole family,FROMDOM:fiatchimera.com,URL:articulatedispirit.com,ip_D:216.218.201.149,)<br />9 2184 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole family,FROMDOM:fiatchimera.com,URL:stratagemnepheligenous.com,ip_D:216.218.201.149,) <br />9 1990 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New Cars,FROMDOM:sastlg.info,URL:sastlg.info,ip_D:66.206.25.227,)<br />9 1899 (IPTYPE:none,FROMUSER:sales,FROMDOM:brunhil.info,SUBJ:700-CreditScore-What-Is-Yours?,URL:brunhil.info,ip_D:66.206.25.227,)<br />9 1743 (IPTYPE:none,FROMUSER:sales,SUBJ:Now exercise can be fun,FROMDOM:accordpac.info,URL:accordpac.info,ip_D:66.206.14.78,)<br />9 1706 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New Cars,FROMDOM:rionel.info,URL:rionel.info,ip_D:66.206.25.227,)<br />9 1693 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:astroom.info,URL:astroom.info,ip_D:66.206.25.227,)<br />9 1689 (IPTYPE:none,FROMUSER:sales,SUBJ:eBay: Work@Home w/Solid-Income-Strategies,FROMDOM:stamine.info,URL:stamine.info,ip_D:66.165.232.203,)<br />35<br />2432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReportUpdate,FROMDOM:zaninte.info,URL:zaninte.info, ip_D:66.206.25.227,)<br />2447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On-New-Vehicles,FROMDOM:sherge.info,URL:sherge.info,<br />ip_D:66.206.25.227,)<br />
  37. 37. Gaming DETECTION in NOTSPAM FEEDBACK<br /><ul><li>Spammers instrument accounts to vote “not spam” on emails that they send
  38. 38. Delays classification of spamming IP addresses</li></ul>Throws off the classifiers if the feedback is not filtered well<br />Model the problem as a bipartite graph<br />Well known model for matching algorithms<br />Broadly applied in various fields like coding theory<br />A graph whose vertices are disjoint form disjoint sets U,V <br />There is an edge connecting every U to a vertex in V<br />36<br />
  39. 39. Connected COMPONETS - EXPLAINED<br />Y1 = Yahoo user 1, Y2 = Yahoo user 2<br />IP1 = IP address of the host Y1 “voted” notspam from<br />37<br />y1<br />IP1<br />y1<br />SQUARING<br />weight = 2<br />y1<br />IP2<br />y1<br />
  40. 40. Connected COMPONENTS for “GAMING” DETECTION<br />38<br />Set of IPs/YIDs used <br />exclusively for <br />voting notspam<br />Set of (likely new) <br />spamming IPs which <br />are “worth” voting for<br />y1<br />IP3<br />IP1<br />y2<br />IP4<br />IP2<br />y3<br />Set of <br />“voted on” IPs<br />Set of <br />“voted from” IPs<br />Set of Yahoo IDs<br />voting notspam<br />
  41. 41. Connected Components - RESULTS<br />39<br />- Connnected components for IPsnotspam was voted from<br />
  42. 42. Connected components - results<br />40<br />- Connnected components for IPsnotspam was voted on<br />
  43. 43. CONCLUSIONS<br />We have had success leveraging parallel, stateful algorithms on grid systems to keep pace with polymorphic spam that evade traditional analysis and algorithms<br />Frequent Itemset Mining rapidly identifies cohesive campaigns in ISSPAM feedback<br />Connected Components amplifies weak signals in gamed NOTSPAM feedback and helps separate signal from noise in the feedback<br />Grid system based analysis platforms may be broadly applicable across the security domain<br />41<br />
  44. 44. Apply Slide<br />Download Hadoop distribution<br />http://hadoop.apache.org<br />Try out Pig on standalone, single Linux box<br />Identify source data to aggregate<br />Start simple: IP patterns across web access logs<br />Begin with offline aggregation; yesterday’s attacks still interesting<br />Read Connected Components and Frequent Itemset Mining papers<br />Stop looking for a single, invariant “tell” – far too costly<br />Start thinking about co-occurrence of innocuous features <br />42<br />
  45. 45. Resources for implementers<br />Hadoop setup, documentation and resources<br />http://hadoop.apache.org/<br />Pig documentation and resources<br />http://hadoop.apache.org/pig/<br />Mahout documentation and resources<br />http://lucene.apache.org/mahout/<br />Frequent itemset mining implementation repository<br />http://fimi.cs.helsinki.fi/src/<br />Connected components description<br />[link not yet live]<br />Ranger, Raghuraman, Penmetsa, Bradski, and Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In HPCA 2007<br />43<br />
  46. 46.
  47. 47. Connected COMPONENTS<br />45<br /><ul><li>Reg IP
  48. 48. Cookie
  49. 49. Username
  50. 50. Birthday
  51. 51. Reg IP
  52. 52. Cookie
  53. 53. Username
  54. 54. Birthday
  55. 55. Reg IP
  56. 56. Cookie
  57. 57. Username
  58. 58. Birthday
  59. 59. Reg IP
  60. 60. Cookie
  61. 61. Username
  62. 62. Birthday
  63. 63. Reg IP
  64. 64. Cookie
  65. 65. Username
  66. 66. Birthday
  67. 67. Reg IP
  68. 68. Cookie
  69. 69. Username
  70. 70. Birthday
  71. 71. Reg IP
  72. 72. Cookie
  73. 73. Username
  74. 74. Birthday</li>

×