Yokai Versus the ElephantHadoop and the Fight Against Shape-Shifting SpamVishwanathRamarao & Mark RisherYahoo! Mail
© SHMorgan - www.obakemono.com
AGENDA3Shape-shifting spamAntispam OriginsHadoop AlgorithmsApplications to SecurityResources for Implementers
5
6http:/<!--gmail.com-->/f915fde2cf53df18<!--uc22wddprm-->.li<!--cf997b28e-->gh<!--PdNKLr-->tt<!---kxnd2itipuvd.yahoo.com-->o<!--ju1j8V-->p<!--vrgxetdcnubslgacvc-->b<!--OsLaWIv-->o<!--_qsgsnnjuf1m@vkvriskrgavzxjovbqg.net-->dy<!--in7oouvxfrg7ax-->.com]*!}v}]along especially consecutive important dmvfu<!--gmail.com-->
7
81,300,925,111,156,286,160,896(http://bit.ly/cpOyLi)
10
Typical attack/response profile11Rule change(1/23@01:15)
MORE YOKAI - TARGETED ATTACKS<style>mechanic CC0066 getimage 3A00 lectroniques repertoires spiel proscribing ammonoid 10110 radiobuttontelefoons Jermaine iesaporitoroshan 3026 janatatrennungpalillos toughest ncapitolecalzado 20200 Omnimedia collective saudadedizaines 205px hardener elongating InvasionofyourprivacyPersonnalftsbedingungenMontanerprozacSerpellfcardbvh capacitate 12502 courtship kiranjiutroligt transducer tyee Delhaize clueless toffee nnioZoapochino sterns 622 Verordnung carbons waterresistant assessing footerTextperrine url0 potatoes 999933 Rightmove positively thmb closer secures Amarillo suffer 314992 32599 8849 GJ initialling cockleshell JTA Justiaguardo jibes Chubb inflammatory iteration granfaldasseoir considerations 692px treasured Allotransplantationtwoyearsappx Bowers doorgeven 1487 bigpicture repeatedly Popp MPEG4 webbsidaliefdeVoeding Elena Kernighan sternway laggardly Zwischendurch commons equis sewing f17 apadrinasareiniqueslugoquotedblbayr 3500 CI addressee optativelygazzetta 616px mingus 23238 PhotoLink desuetude tofu keychains molding redevelopment stucco deltage astrology2 thumbscrews probablemente 700g rnsfuseactionrepristaires restraint manchettestrendlineseffectuedespatchMinskyestadual doses danbrown Muenster jind7n7 smashes gourmandesashantisentants rows kyk coated Incontournablescoincidenjspa stalker CDS contienen expletives s8 eof replenishing puyalluppratosondravalidarorientale sonnets steamer Niwangoacrocentric dozens elr tempting poing jails ingredi Sep3 misdirection vested tecniciconciertos dear martini 3D35 MBR DNAME 2650 violation Egyptiin NCR sposoriss hl 12450 connectors circumcision transform CFA employeur 153 comunicazioni miner 19905 citronella PlissierHellmich Randall CaradonnaspringaregistradahauptEntran 3060 Rochin capacitor sotol 3413 smirk interditeServicePoint capabilities bouncefeeLinkov 3Dg auntie OSP CaeciliaPlatzierung wrangler pisosbanlieueDaniellaenderleisraelprofessionnellessusto 39800 Espanaplena radian antic!...........................200KB………. </style><center><a href="http://ivywhere.info/52210088504303.hrmj.1/285/1000/1006/1000/1237976a102c0176c7b3fb3164f83590.html">Please Click Here if You Can't See Images<br><imgsrc="http://ivywhere.info/images/usacpm1.jpg" border="0"></a><br><a href="http://ivywhere.info/52210088504303.hrmj.1/40106/1000/1000/1000/a.html"><imgsrc="http://ivywhere.info/images/usacpm2.jpg" border="0"></a><br><a href="http://ivywhere.info/gp.html"><imgsrc="http://ivywhere.info/images/please2.jpg" border="0"></a><br>12[400kb…]<center><a href="http://corfair.info/52210088504303.hrmj.1/129286/1000/1006/1000/d1c7b1fa06980b08bf9b3a9c14844623.html">Please Click Here if You Can't See Images<br><imgsrc="http://corfair.info/images/ivblg1.jpg" border="0"></a><br><a href="http://corfair.info/52210088504303.hrmj.1/40126/1000/1000/1000/a.html"><imgsrc="http://corfair.info/images/ivblg2.jpg" border="0"></a><br><a href="http://corfair.info/gp.html"><imgsrc="http://corfair.info/images/please2.jpg" border="0"></a><br>
14
Why is the ANTISPAM PROBLEM hardScale of the problem; 25B Connections, 5B deliveries, 450M mailboxesUser feedback is often late, noisy and not always actionable Large, diverse stream of legitimate traffic that looks like spamSlow adoption of authentication technologies like DKIM and SPFSpammers are clever; target and specialize attacks Rapidly changing spam campaigns with a large bot controlled IP base; large variations even within a single campaignA significant percentage of spam comes from large ESPs like Hotmail, Google and Yahoo15
Generation 1: Manual management layerHeuristics, blocks, blacklistsProvide attack mitigation and operational flexibility, highly explainable. Not durable, expensive to keep pace with fast morphing spamAd hoc queriesProprietary implementations, not very scalable, steep learning curveReactive and usually late16
Generation 2: Machine Management LayerOnline reputation modelsSimple, mostly scoring/counter/ratio based modelsHighly scalable due the absence of any state/memoryGeneralize too broadly, lack expressive powerBatch trained reputation modelsTypically digested memory based hashing or machine learning modelsDifficult to implement and due to the need for labeled examples scale well only moderatelySlow to update and learn, lack explainability, limited operational control17
distributed computing paradigm19Map:Reduce + distributed storage:Simplicity of online, stateless models
Expressiveness of offline analysis
Ease of managementthe map:reduce paradigm Input data format is application-specific, specified by the user Output is a set of <key,value> pairs User expresses algorithm using two functionsMap is applied on the input data and produces a list of intermediate <key,value> pairs Reduce is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs Finally, output pairs are sorted by their key value20
the map:reduce paradigm 21Mapper<k1,v1>Mapper<k1,{v1,v3}><k2,v2>Reducer<k2,v2><k1,W1>Mapper<k1,v3>
A SIMPLE MAP:REDUCE EXAMPLE$ bin/hadoopdfs -cat /usr/joe/wordcount/input/file01 Hello World Bye World $ bin/hadoopdfs -cat /usr/joe/wordcount/input/file02 Hello Hadoop Goodbye Hadoop// Split up input files (MAP), iterate over chunks, reassemble results (REDUCE) $ bin/hadoop jar /usr/joe/wordcount.jarorg.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output$ bin/hadoopdfs -cat /usr/joe/wordcount/output/part-00000 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2 22
a simple map:reduce example (bit.ly/bdyi0l)18.	public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {19.	String line = value.toString();20.	StringTokenizertokenizer = new StringTokenizer(line);21.	while (tokenizer.hasMoreTokens()) {22.		word.set(tokenizer.nextToken());23.		output.collect(word, one);24.		}25.	}23
a simple map:reduce example (bit.ly/bdyi0l)28.	public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {29.	public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {30.		int sum = 0;31.		while (values.hasNext()) {32.			sum += values.next().get();33.		}34.		output.collect(key, new IntWritable(sum));24
Applications & Outcomes25
Lets REVIEW OUR DESIGN GOALs AGAINClassifiers are notorious for lack of explainabilityEngineers and analysts needs to know what the classifier is missingEngineers and analysts need to know about emerging threatsAnalysts need “canned” reports along interesting dimensionsMachines need smart feature engineeringDevelop a scalable system to provide deep insight into spammer campaignsDouble up as a platform for standard reportingAlso double up as a platform for adhoc analysis and data probingSignal amplification and smart feature extraction platform26
Our ANTISPAM ANALYTIC PLATFORMHadoop: Implements map reduce, written in Java but supports many other languages including Perl and C++ using the streaming interfaceFeature engineering with small simple Perl programs for data extraction and transformationSQL-like “Pig” programming language for data analysis and managementMahout: data mining libraries that provide shrink- wrapped, scalable, sophisticated algorithmsOther proprietary algorithms and frameworks for specialized tasks27
Various ASPECTS of A GRID DRIVEN SOLUTIONStandard reportingAd hoc queryingCampaign discovery from spam feedback using frequent item set mining“Gaming” detection in notspam feedback using connected components28
Top SPAMMY DOMAINS REPORT FOR 01/15/201029key:noreply.amateurmatch.com|value:1164key:goodmere.info|value:896key:marketing.meredith.com|value:1078key:verizon.net|value:822key:reply.mb00.net|value:980key:insideapple.apple.com|value:1094key:facebookappmail.com|value:882key:mydailymoment.com|value:849key:thetwilightsaga.com|value:4671key:adknowledgemailer6.com|value:859key:freedollarspro.info|value:1164key:smartreachmedia.com|value:1074key:yahoo.es|value:877key:ecomasher.com|value:1197key:leasetrade-statusupdates.com|value:951key:noreply.amateurmatch.comvalue:1164
AD HOC queries for ANTISPAM researchIdentify domains that had few spam votes in the previous time window but have a high number of spam votes todayAll IPs in the last hour that sent a particular URL pattern…or that sent any unknown URL >500 timesWhich domains/IPs suddenly increased their sending volume after a positive reputation changeWhich FROM addresses exhibit low message size entropyAll messages that had nothing but a URL and the domain of the URL had low page rank30
AD HOC QUERIES - Anatomy of a PIG QUERY---  This includes some basic string functions, including splitting a string on the '@' characterregister /homes/jpujara/pig_scripts/string.jar;define splitEmail string.Tokenize('2','@');--- Load up some data - incoming messages at a date and time, and our trusted user databaseMESSAGES = load '/projects/antispam/mta_feature_logs/$date*/*/*-$time*' using com.yahoo.ymail.pigfunctions.AsStorage('__record_key__,firstrcpt,mailfrom') as (mid:chararray,to:chararray,from:chararray);USERS = load '/projects/antispam/TrustedUser.bz2' using com.yahoo.ymail.pigfunctions.AsStorage('user,t') as (user:chararray,trusted:int);--- Split the e-mail addresses into user+domain and generate the appropriate user-id for yahoo users and partnersEXPLODED_MESSAGES = FOREACH MESSAGES GENERATE to,FLATTEN(splitEmail(to)) as (user,udomain),FLATTEN(splitEmail(from)) as (sender,sdomain);YAHOO_MESSAGES = FOREACH EXPLODED_MESSAGES GENERATE (udomain MATCHES '.*yahoo.*' ? user : to ) as yuser,sdomain;31--- Combine the message and sender domains with the trusted user data and select only trusted messagesYAHOO_MESSAGES_TRUST = JOIN YAHOO_MESSAGES by yuser, USERS by user;TRUSTED_MESSAGES = FILTER YAHOO_MESSAGES_TRUST by trusted > 0;--- Group by domain, and generate a count, order by descending countDOMAIN_GROUPS = GROUP TRUSTED_MESSAGES by sdomain;DOMAIN_GROUPS_COUNT = FOREACH DOMAIN_GROUPS GENERATE group,COUNT(TRUSTED_MESSAGES) as count;DOMAIN_GROUPS_ORDER = ORDER DOMAIN_GROUPS_COUNT by count DESC;--- Output the resultsSTORE DOMAIN_GROUPS_ORDER into '$targetdir/topDomains';
CAMPAIGN Discovery in SPAM FeedbackFrequent Itemset MiningClassical methodResearch interesting relationships between variables in a large databasePrimarily applied for market basket analysisMany good implementationsAPRIORIEasy to implementParallelizes moderately well but bottlenecks for extremely large data setsNot very efficient with the number scansECLATParallelizes easily Amenable to a good grid implementationFewer scans of the datasetParallel FP GROWTHDesigned explicitly for systems like hadoopImplemented in Mahout 0.232
Frequent item set – example dataset33
Frequent ITEMSET MINING34Slide Courtsey: dortmund.de
Frequent itemset MINING on ONE DAY’s SPAM REPORTS9	2595 (IPTYPE:none,FROMUSER:sales,SUBJ:It's Important You Know,FROMDOM:dappercom.info,URL:dappercom.info,ip_D:66.206.14.77,)9	2457 (IPTYPE:none,FROMUSER:sales,SUBJ:Save On Costly Repairs,FROMDOM:aftermoon.info,URL:aftermoon.info,ip_D:66.206.14.78,)9	2447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On-New-Vehicles,FROMDOM:sherge.info,URL:sherge.info,ip_D:66.206.25.227,)9	2432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:zaninte.info,URL:zaninte.info,ip_D:66.206.25.227,)9	2376 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole family,FROMDOM:fiatchimera.com,URL:articulatedispirit.com,ip_D:216.218.201.149,)9	2184 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole family,FROMDOM:fiatchimera.com,URL:stratagemnepheligenous.com,ip_D:216.218.201.149,) 9	1990 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New Cars,FROMDOM:sastlg.info,URL:sastlg.info,ip_D:66.206.25.227,)9	1899 (IPTYPE:none,FROMUSER:sales,FROMDOM:brunhil.info,SUBJ:700-CreditScore-What-Is-Yours?,URL:brunhil.info,ip_D:66.206.25.227,)9	1743 (IPTYPE:none,FROMUSER:sales,SUBJ:Now exercise can be fun,FROMDOM:accordpac.info,URL:accordpac.info,ip_D:66.206.14.78,)9	1706 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New Cars,FROMDOM:rionel.info,URL:rionel.info,ip_D:66.206.25.227,)9	1693 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:astroom.info,URL:astroom.info,ip_D:66.206.25.227,)9	1689 (IPTYPE:none,FROMUSER:sales,SUBJ:eBay: Work@Home w/Solid-Income-Strategies,FROMDOM:stamine.info,URL:stamine.info,ip_D:66.165.232.203,)352432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReportUpdate,FROMDOM:zaninte.info,URL:zaninte.info, ip_D:66.206.25.227,)2447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On-New-Vehicles,FROMDOM:sherge.info,URL:sherge.info,ip_D:66.206.25.227,)
Gaming DETECTION in NOTSPAM FEEDBACKSpammers instrument accounts to vote “not spam” on emails that they send
Delays classification of spamming IP addressesThrows off the classifiers if the feedback is not filtered wellModel the problem as a bipartite graphWell known model for matching algorithmsBroadly applied in various fields like coding theoryA graph whose vertices are disjoint form disjoint sets U,V There is an edge connecting every U to a vertex in V36
Connected COMPONETS - EXPLAINEDY1 = Yahoo user 1, Y2 = Yahoo user 2IP1 = IP address of the host Y1 “voted” notspam from37y1IP1y1SQUARINGweight = 2y1IP2y1
Connected COMPONENTS for “GAMING” DETECTION38Set of IPs/YIDs used exclusively for voting notspamSet of (likely new) spamming IPs which are “worth”  voting fory1IP3IP1y2IP4IP2y3Set of “voted on” IPsSet of “voted from” IPsSet of Yahoo IDsvoting notspam
Connected Components  - RESULTS39- Connnected components for IPsnotspam was voted from
Connected components - results40- Connnected components for IPsnotspam was voted on
CONCLUSIONSWe have had success leveraging parallel, stateful algorithms on grid systems to keep pace with polymorphic spam that evade traditional analysis and algorithmsFrequent Itemset Mining rapidly identifies cohesive campaigns in ISSPAM feedbackConnected Components amplifies weak signals in gamed NOTSPAM feedback and helps separate signal from noise in the feedbackGrid system based analysis platforms may be broadly applicable across the security domain41
Apply SlideDownload Hadoop distributionhttp://hadoop.apache.orgTry out Pig on standalone, single Linux boxIdentify source data to aggregateStart simple: IP patterns across web access logsBegin with offline aggregation; yesterday’s attacks still interestingRead Connected Components and Frequent Itemset Mining papersStop looking for a single, invariant “tell” – far too costlyStart thinking about co-occurrence of innocuous features 42
Resources for implementersHadoop setup, documentation and resourceshttp://hadoop.apache.org/Pig documentation and resourceshttp://hadoop.apache.org/pig/Mahout documentation and resourceshttp://lucene.apache.org/mahout/Frequent itemset mining implementation repositoryhttp://fimi.cs.helsinki.fi/src/Connected components description[link not yet live]Ranger, Raghuraman, Penmetsa, Bradski, and Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In HPCA 200743

Yahoo! Mail antispam - Bay area Hadoop user group

  • 1.
    Yokai Versus theElephantHadoop and the Fight Against Shape-Shifting SpamVishwanathRamarao & Mark RisherYahoo! Mail
  • 2.
    © SHMorgan -www.obakemono.com
  • 3.
    AGENDA3Shape-shifting spamAntispam OriginsHadoopAlgorithmsApplications to SecurityResources for Implementers
  • 5.
  • 6.
  • 7.
  • 8.
  • 10.
  • 11.
  • 12.
    MORE YOKAI -TARGETED ATTACKS<style>mechanic CC0066 getimage 3A00 lectroniques repertoires spiel proscribing ammonoid 10110 radiobuttontelefoons Jermaine iesaporitoroshan 3026 janatatrennungpalillos toughest ncapitolecalzado 20200 Omnimedia collective saudadedizaines 205px hardener elongating InvasionofyourprivacyPersonnalftsbedingungenMontanerprozacSerpellfcardbvh capacitate 12502 courtship kiranjiutroligt transducer tyee Delhaize clueless toffee nnioZoapochino sterns 622 Verordnung carbons waterresistant assessing footerTextperrine url0 potatoes 999933 Rightmove positively thmb closer secures Amarillo suffer 314992 32599 8849 GJ initialling cockleshell JTA Justiaguardo jibes Chubb inflammatory iteration granfaldasseoir considerations 692px treasured Allotransplantationtwoyearsappx Bowers doorgeven 1487 bigpicture repeatedly Popp MPEG4 webbsidaliefdeVoeding Elena Kernighan sternway laggardly Zwischendurch commons equis sewing f17 apadrinasareiniqueslugoquotedblbayr 3500 CI addressee optativelygazzetta 616px mingus 23238 PhotoLink desuetude tofu keychains molding redevelopment stucco deltage astrology2 thumbscrews probablemente 700g rnsfuseactionrepristaires restraint manchettestrendlineseffectuedespatchMinskyestadual doses danbrown Muenster jind7n7 smashes gourmandesashantisentants rows kyk coated Incontournablescoincidenjspa stalker CDS contienen expletives s8 eof replenishing puyalluppratosondravalidarorientale sonnets steamer Niwangoacrocentric dozens elr tempting poing jails ingredi Sep3 misdirection vested tecniciconciertos dear martini 3D35 MBR DNAME 2650 violation Egyptiin NCR sposoriss hl 12450 connectors circumcision transform CFA employeur 153 comunicazioni miner 19905 citronella PlissierHellmich Randall CaradonnaspringaregistradahauptEntran 3060 Rochin capacitor sotol 3413 smirk interditeServicePoint capabilities bouncefeeLinkov 3Dg auntie OSP CaeciliaPlatzierung wrangler pisosbanlieueDaniellaenderleisraelprofessionnellessusto 39800 Espanaplena radian antic!...........................200KB………. </style><center><a href="http://ivywhere.info/52210088504303.hrmj.1/285/1000/1006/1000/1237976a102c0176c7b3fb3164f83590.html">Please Click Here if You Can't See Images<br><imgsrc="http://ivywhere.info/images/usacpm1.jpg" border="0"></a><br><a href="http://ivywhere.info/52210088504303.hrmj.1/40106/1000/1000/1000/a.html"><imgsrc="http://ivywhere.info/images/usacpm2.jpg" border="0"></a><br><a href="http://ivywhere.info/gp.html"><imgsrc="http://ivywhere.info/images/please2.jpg" border="0"></a><br>12[400kb…]<center><a href="http://corfair.info/52210088504303.hrmj.1/129286/1000/1006/1000/d1c7b1fa06980b08bf9b3a9c14844623.html">Please Click Here if You Can't See Images<br><imgsrc="http://corfair.info/images/ivblg1.jpg" border="0"></a><br><a href="http://corfair.info/52210088504303.hrmj.1/40126/1000/1000/1000/a.html"><imgsrc="http://corfair.info/images/ivblg2.jpg" border="0"></a><br><a href="http://corfair.info/gp.html"><imgsrc="http://corfair.info/images/please2.jpg" border="0"></a><br>
  • 14.
  • 15.
    Why is theANTISPAM PROBLEM hardScale of the problem; 25B Connections, 5B deliveries, 450M mailboxesUser feedback is often late, noisy and not always actionable Large, diverse stream of legitimate traffic that looks like spamSlow adoption of authentication technologies like DKIM and SPFSpammers are clever; target and specialize attacks Rapidly changing spam campaigns with a large bot controlled IP base; large variations even within a single campaignA significant percentage of spam comes from large ESPs like Hotmail, Google and Yahoo15
  • 16.
    Generation 1: Manualmanagement layerHeuristics, blocks, blacklistsProvide attack mitigation and operational flexibility, highly explainable. Not durable, expensive to keep pace with fast morphing spamAd hoc queriesProprietary implementations, not very scalable, steep learning curveReactive and usually late16
  • 17.
    Generation 2: MachineManagement LayerOnline reputation modelsSimple, mostly scoring/counter/ratio based modelsHighly scalable due the absence of any state/memoryGeneralize too broadly, lack expressive powerBatch trained reputation modelsTypically digested memory based hashing or machine learning modelsDifficult to implement and due to the need for labeled examples scale well only moderatelySlow to update and learn, lack explainability, limited operational control17
  • 19.
    distributed computing paradigm19Map:Reduce+ distributed storage:Simplicity of online, stateless models
  • 20.
  • 21.
    Ease of managementthemap:reduce paradigm Input data format is application-specific, specified by the user Output is a set of <key,value> pairs User expresses algorithm using two functionsMap is applied on the input data and produces a list of intermediate <key,value> pairs Reduce is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs Finally, output pairs are sorted by their key value20
  • 22.
    the map:reduce paradigm21Mapper<k1,v1>Mapper<k1,{v1,v3}><k2,v2>Reducer<k2,v2><k1,W1>Mapper<k1,v3>
  • 23.
    A SIMPLE MAP:REDUCEEXAMPLE$ bin/hadoopdfs -cat /usr/joe/wordcount/input/file01 Hello World Bye World $ bin/hadoopdfs -cat /usr/joe/wordcount/input/file02 Hello Hadoop Goodbye Hadoop// Split up input files (MAP), iterate over chunks, reassemble results (REDUCE) $ bin/hadoop jar /usr/joe/wordcount.jarorg.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output$ bin/hadoopdfs -cat /usr/joe/wordcount/output/part-00000 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2 22
  • 24.
    a simple map:reduceexample (bit.ly/bdyi0l)18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {19. String line = value.toString();20. StringTokenizertokenizer = new StringTokenizer(line);21. while (tokenizer.hasMoreTokens()) {22. word.set(tokenizer.nextToken());23. output.collect(word, one);24. }25. }23
  • 25.
    a simple map:reduceexample (bit.ly/bdyi0l)28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {30. int sum = 0;31. while (values.hasNext()) {32. sum += values.next().get();33. }34. output.collect(key, new IntWritable(sum));24
  • 26.
  • 27.
    Lets REVIEW OURDESIGN GOALs AGAINClassifiers are notorious for lack of explainabilityEngineers and analysts needs to know what the classifier is missingEngineers and analysts need to know about emerging threatsAnalysts need “canned” reports along interesting dimensionsMachines need smart feature engineeringDevelop a scalable system to provide deep insight into spammer campaignsDouble up as a platform for standard reportingAlso double up as a platform for adhoc analysis and data probingSignal amplification and smart feature extraction platform26
  • 28.
    Our ANTISPAM ANALYTICPLATFORMHadoop: Implements map reduce, written in Java but supports many other languages including Perl and C++ using the streaming interfaceFeature engineering with small simple Perl programs for data extraction and transformationSQL-like “Pig” programming language for data analysis and managementMahout: data mining libraries that provide shrink- wrapped, scalable, sophisticated algorithmsOther proprietary algorithms and frameworks for specialized tasks27
  • 29.
    Various ASPECTS ofA GRID DRIVEN SOLUTIONStandard reportingAd hoc queryingCampaign discovery from spam feedback using frequent item set mining“Gaming” detection in notspam feedback using connected components28
  • 30.
    Top SPAMMY DOMAINSREPORT FOR 01/15/201029key:noreply.amateurmatch.com|value:1164key:goodmere.info|value:896key:marketing.meredith.com|value:1078key:verizon.net|value:822key:reply.mb00.net|value:980key:insideapple.apple.com|value:1094key:facebookappmail.com|value:882key:mydailymoment.com|value:849key:thetwilightsaga.com|value:4671key:adknowledgemailer6.com|value:859key:freedollarspro.info|value:1164key:smartreachmedia.com|value:1074key:yahoo.es|value:877key:ecomasher.com|value:1197key:leasetrade-statusupdates.com|value:951key:noreply.amateurmatch.comvalue:1164
  • 31.
    AD HOC queriesfor ANTISPAM researchIdentify domains that had few spam votes in the previous time window but have a high number of spam votes todayAll IPs in the last hour that sent a particular URL pattern…or that sent any unknown URL >500 timesWhich domains/IPs suddenly increased their sending volume after a positive reputation changeWhich FROM addresses exhibit low message size entropyAll messages that had nothing but a URL and the domain of the URL had low page rank30
  • 32.
    AD HOC QUERIES- Anatomy of a PIG QUERY--- This includes some basic string functions, including splitting a string on the '@' characterregister /homes/jpujara/pig_scripts/string.jar;define splitEmail string.Tokenize('2','@');--- Load up some data - incoming messages at a date and time, and our trusted user databaseMESSAGES = load '/projects/antispam/mta_feature_logs/$date*/*/*-$time*' using com.yahoo.ymail.pigfunctions.AsStorage('__record_key__,firstrcpt,mailfrom') as (mid:chararray,to:chararray,from:chararray);USERS = load '/projects/antispam/TrustedUser.bz2' using com.yahoo.ymail.pigfunctions.AsStorage('user,t') as (user:chararray,trusted:int);--- Split the e-mail addresses into user+domain and generate the appropriate user-id for yahoo users and partnersEXPLODED_MESSAGES = FOREACH MESSAGES GENERATE to,FLATTEN(splitEmail(to)) as (user,udomain),FLATTEN(splitEmail(from)) as (sender,sdomain);YAHOO_MESSAGES = FOREACH EXPLODED_MESSAGES GENERATE (udomain MATCHES '.*yahoo.*' ? user : to ) as yuser,sdomain;31--- Combine the message and sender domains with the trusted user data and select only trusted messagesYAHOO_MESSAGES_TRUST = JOIN YAHOO_MESSAGES by yuser, USERS by user;TRUSTED_MESSAGES = FILTER YAHOO_MESSAGES_TRUST by trusted > 0;--- Group by domain, and generate a count, order by descending countDOMAIN_GROUPS = GROUP TRUSTED_MESSAGES by sdomain;DOMAIN_GROUPS_COUNT = FOREACH DOMAIN_GROUPS GENERATE group,COUNT(TRUSTED_MESSAGES) as count;DOMAIN_GROUPS_ORDER = ORDER DOMAIN_GROUPS_COUNT by count DESC;--- Output the resultsSTORE DOMAIN_GROUPS_ORDER into '$targetdir/topDomains';
  • 33.
    CAMPAIGN Discovery inSPAM FeedbackFrequent Itemset MiningClassical methodResearch interesting relationships between variables in a large databasePrimarily applied for market basket analysisMany good implementationsAPRIORIEasy to implementParallelizes moderately well but bottlenecks for extremely large data setsNot very efficient with the number scansECLATParallelizes easily Amenable to a good grid implementationFewer scans of the datasetParallel FP GROWTHDesigned explicitly for systems like hadoopImplemented in Mahout 0.232
  • 34.
    Frequent item set– example dataset33
  • 35.
    Frequent ITEMSET MINING34SlideCourtsey: dortmund.de
  • 36.
    Frequent itemset MININGon ONE DAY’s SPAM REPORTS9 2595 (IPTYPE:none,FROMUSER:sales,SUBJ:It's Important You Know,FROMDOM:dappercom.info,URL:dappercom.info,ip_D:66.206.14.77,)9 2457 (IPTYPE:none,FROMUSER:sales,SUBJ:Save On Costly Repairs,FROMDOM:aftermoon.info,URL:aftermoon.info,ip_D:66.206.14.78,)9 2447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On-New-Vehicles,FROMDOM:sherge.info,URL:sherge.info,ip_D:66.206.25.227,)9 2432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:zaninte.info,URL:zaninte.info,ip_D:66.206.25.227,)9 2376 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole family,FROMDOM:fiatchimera.com,URL:articulatedispirit.com,ip_D:216.218.201.149,)9 2184 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole family,FROMDOM:fiatchimera.com,URL:stratagemnepheligenous.com,ip_D:216.218.201.149,) 9 1990 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New Cars,FROMDOM:sastlg.info,URL:sastlg.info,ip_D:66.206.25.227,)9 1899 (IPTYPE:none,FROMUSER:sales,FROMDOM:brunhil.info,SUBJ:700-CreditScore-What-Is-Yours?,URL:brunhil.info,ip_D:66.206.25.227,)9 1743 (IPTYPE:none,FROMUSER:sales,SUBJ:Now exercise can be fun,FROMDOM:accordpac.info,URL:accordpac.info,ip_D:66.206.14.78,)9 1706 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New Cars,FROMDOM:rionel.info,URL:rionel.info,ip_D:66.206.25.227,)9 1693 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:astroom.info,URL:astroom.info,ip_D:66.206.25.227,)9 1689 (IPTYPE:none,FROMUSER:sales,SUBJ:eBay: Work@Home w/Solid-Income-Strategies,FROMDOM:stamine.info,URL:stamine.info,ip_D:66.165.232.203,)352432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReportUpdate,FROMDOM:zaninte.info,URL:zaninte.info, ip_D:66.206.25.227,)2447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On-New-Vehicles,FROMDOM:sherge.info,URL:sherge.info,ip_D:66.206.25.227,)
  • 37.
    Gaming DETECTION inNOTSPAM FEEDBACKSpammers instrument accounts to vote “not spam” on emails that they send
  • 38.
    Delays classification ofspamming IP addressesThrows off the classifiers if the feedback is not filtered wellModel the problem as a bipartite graphWell known model for matching algorithmsBroadly applied in various fields like coding theoryA graph whose vertices are disjoint form disjoint sets U,V There is an edge connecting every U to a vertex in V36
  • 39.
    Connected COMPONETS -EXPLAINEDY1 = Yahoo user 1, Y2 = Yahoo user 2IP1 = IP address of the host Y1 “voted” notspam from37y1IP1y1SQUARINGweight = 2y1IP2y1
  • 40.
    Connected COMPONENTS for“GAMING” DETECTION38Set of IPs/YIDs used exclusively for voting notspamSet of (likely new) spamming IPs which are “worth” voting fory1IP3IP1y2IP4IP2y3Set of “voted on” IPsSet of “voted from” IPsSet of Yahoo IDsvoting notspam
  • 41.
    Connected Components - RESULTS39- Connnected components for IPsnotspam was voted from
  • 42.
    Connected components -results40- Connnected components for IPsnotspam was voted on
  • 43.
    CONCLUSIONSWe have hadsuccess leveraging parallel, stateful algorithms on grid systems to keep pace with polymorphic spam that evade traditional analysis and algorithmsFrequent Itemset Mining rapidly identifies cohesive campaigns in ISSPAM feedbackConnected Components amplifies weak signals in gamed NOTSPAM feedback and helps separate signal from noise in the feedbackGrid system based analysis platforms may be broadly applicable across the security domain41
  • 44.
    Apply SlideDownload Hadoopdistributionhttp://hadoop.apache.orgTry out Pig on standalone, single Linux boxIdentify source data to aggregateStart simple: IP patterns across web access logsBegin with offline aggregation; yesterday’s attacks still interestingRead Connected Components and Frequent Itemset Mining papersStop looking for a single, invariant “tell” – far too costlyStart thinking about co-occurrence of innocuous features 42
  • 45.
    Resources for implementersHadoopsetup, documentation and resourceshttp://hadoop.apache.org/Pig documentation and resourceshttp://hadoop.apache.org/pig/Mahout documentation and resourceshttp://lucene.apache.org/mahout/Frequent itemset mining implementation repositoryhttp://fimi.cs.helsinki.fi/src/Connected components description[link not yet live]Ranger, Raghuraman, Penmetsa, Bradski, and Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In HPCA 200743

Editor's Notes

  • #3 Who knows what Yokai are? &lt;audience poll&gt;Shape-shifters from Japanese mythology. Many other examples, e.g. Proteus, who would tell you the future, but first you had to capture him. Just like the gods, Change shape to avoid capture* vary over IP, vary over content, vary over template features (e.g. document structure, subjects, size entropy)
  • #5 In abuse, these are “shape shifters.”They vary many aspects of the message to avoid detection: IPSubjectContentFor example, these four messages are obviously built from a single template, but changing its shape to avoid capture. How to catch?In the past: + Heuristics &amp; Regex + Dictionary (URLdb) + Invariant metadataChallenges: + slow to write+ difficult to write+ easy to evade
  • #8 Here is a third type of shape-shifting spamFor all of these: attackers have distinct advantage, because they can change most aspects and still get through
  • #9 1.3 sextillion (1.3e21) variations, almost all can be recognized by human being in milisecondsspammers learned they can change any variable to hide from bulk filtershttp://cockeyed.com/lessons/viagra/viagra.html
  • #10 These bastards… the most despised doctors on the InternetAlmost all pages resolve through numerous HTML/Javascript redirectors to this page
  • #11 Daniel Geer said, there are targets of CHANCE and targets of CHOICE. Small businesses are in the former camp, catching the miscellaneous attacks out there.Increasingly, larger companies are TARGETS OF CHOICE, meaning the bad guys a) specifically tailor their attacks based on known vulnerabilities, and b) use feedback loops to improve the effectiveness of them.
  • #12 This is what a targetted attack profile looks like: After you patch, they almost stop trying
  • #13 One example of such a clearly targeted attack: 400KB of style gibberish embedded in a style sheet, completely throws out our parsersMaybe ASCII art spam, or something else that couldn’t be caught by simple pattern matchingThis is what our filters see: a stream of ASCII that is deliberately using multiple layerse.g. here, a TinyURL redirector, further obfuscated with non-printing HTML, spaces, and CSS chaffTo fight in olden days, hand-written regex to identify a patternOR heuristic on some invariant part of the message. But what is invariant? dozens of TinyURL clonesdozens of HTML and CSS tricks2^32 IP addressesinfinite FROM addressesinfinite SUBJECT lines…
  • #14 Sent by botnetsThis is Reactor Mailer; controlled Srizbi from the McColo datacenters until Nov 2008This is the template for Stormbot; notice it has control variables for all the settingsWhile most of these came in through SMTP port 25, now they are increasingly hitting HTTP and port 80
  • #15 Historically, POINT SOLUTIONS address each problem individuallyregexheuristicWouldn’t this be better if this guy could use more than one finger at a time?Something is *almost over the limit* along one dimension and *almost over the limit* along another.Message from IP that sends 80% good mail, with tinyurl that we don’t recognize, that was addressed to 40 people.*PRIOR PROBABILITY**COMPOSITE SCORE*
  • #16 Scale forces simplistic architectures; Feedback based architectures always lag behind the spam campaignFeedback also has many segments;- Personal preference spam: “I didn’t like this week’s Amazon gold box deals but I liked last week’s messages from Amazon”- Annoyance emails from legitimate bulk mailers: “This coupon is coming far too often these days”-Listserver spam: “This finance group - Newsletter messages that are no longer interesting to the user: “Gosh I am so not into that band any more”sometimes sends me stock spam”Traffic to a small enterprise domains can be restricted with firewall rules etc but large free mail provider traffic is full of corner casesCompounding the problem is the fact that adoption of DKIM and SPF has been slow, especially internationally and in emerging economies.But make no mistake, some of these spammers are very cleverIts more fruitful to target yahoo or google than to build a generic spam engine
  • #17 Lets looks at what is in place right now in terms of an architecture; Most large scale systems have some components from gen1 technologiesProvide attack mitigation and operational flexibility, highly explainable. Not durable, expensive to keep pace with fast morphing spamProprietary implementations, not very scalable, steep learning curveReactive and usually late
  • #18  Two ways this has been solved in the past: Machine management…Both systems, because of scale, were limited to looking at small pieces of data – an IP, a URL, etc.
  • #19 In this talk we’ll introduce Hadoop, an open-source grid computing environment with applications to fighting abuse. We’ll talk about how Hadoop can be applied to polymorphic spam and abuseAbout three years ago, Doug Cutting released version 0.15 of Hadoop, an open-source platform inspired by Google’s proprietary Map:Reduce algorithm“Supercomputer” – petabytes of storage, terabytes of RAM allow “needle in the haystack” even at Y!Mail scalehundred of featureshundreds of billions of recordstrends buried in global data
  • #20 Hadoop is the most prevalentAlso “Ngrid” and “Sun’s GridEngine” are other alternatives
  • #22 Input data format is application-specific, specified by the user Output is a set of &lt;key,value&gt; pairs User expresses algorithm using two functionsMap is applied on the input data and produces a list of intermediate &lt;key,value&gt; pairs Reduce is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs Finally, output pairs are sorted by their key value
  • #23 Toy exampleProvide some insight into what a map reduce program looks like, looks very much like unix command line
  • #24 Java code to highlight the mapper, mapper simply adds each word to a set and emits a count of 1 for each time the word is seen
  • #25 The reducer simply sums the values for each word, draw attention to line 32While this is a toy example, it should give a fair idea about how to structure a problem to be solvable by map reduce. The key takeaway is that writing even native map reduce programs can be quite simple and executing it even simpler
  • #26 Take the audience progressively through more and more sophisticated applications, starting from basic reporting and ending in outbound spammer analysis based on SWARM features
  • #27 Knowing the accurancy of your SVM/Bayes classifier puts you in no better situation to ask and answer what type of spam is leaking; and we know spammers are constantly probing80% of the spam/content classification problem is in smart feature engineering
  • #28 Lets looks at how our/Yahoo’s platform looks like Perl programs for feature engineering make it very easy and flexibleHadoop with its pig support is already well suited as a platform for adhoc data analysisFor deep data mining, open source mahout
  • #29 We will look at the hadoop is four different settings;
  • #30 * In antispam, these basic reports combined with human review form a barrier against highly directed attacks that exploit system weaknesses* Note how easy it is to slice and dice your data and write fairly sophisticated reports using pig/streaming. It is critical in antispam systems that the reporting platform be flexible and provide a lot of expressive power, hadoop and pig achieve that.*
  • #31  Previous such queries were against small samples, now we can do it against the full data set and get highly accurate results in a very short amount of timeAlternate architectures such as OLAP are too expensive at this scale
  • #32 * Pig is a data flow specification language. Its like SQL but unlike SQL it is better suited for data flow control. * In antispam, these basic reports combined with human review form a barrier against highly directed attacks that exploit system weaknessesNote how easy it is to slice and dice your data and write fairly sophisticated reports using pig. It is critical in antispam systems that the reporting platform be flexible and provide a lot of expressive power, hadoop and pig achieve that.*
  • #33 -- People who bought eggs also bought bread
  • #36 * We ran frequent itemset on one day’s spam votes, the results are striking.* Notice in the above example how the same campaign [the same FROMUSER] is being managed with different templates for subjects and URLs and is also originating from different IPs* Others records in the background are the result of the freq itemset mining algorithm as well and map very closely with spam campaigns.
  • #38 Develop a bipartite graph of users and the IPs they vote fromSquaring of the graph give rise to connected componentWeight of the connected component is a measured by the number of vertices that share the component.
  • #39 GamingIPs are IPs that the spammers try to whitelist in advanceDetected them by extending the connected component view on Ips the notspam is voted on
  • #40 The results are quite spectacular!! There is a massive amount of “gaming” going on with “notspam feedback” and there are only a handful of Ips that are doing this. There are a large number of smaller components not shown in the results above
  • #41 The results are less stronger – notice the two smaller weaker clusters in row 3 and 4The big takeway is that such unsupervised matching algorithms are going to be extremely power amplifiers of signals and can be used to rapidly separate out noise from signal.Imagine this being applied on traffic with more items such as IPs, message subjects, size of messages, fuzzy signatures etc.
  • #44 We encourage and invite others to try hadoop in anti spam and anti abuse architectures and share their experiences with us.
  • #46 Three users known badsame IP leads to new cookiesame cookie leads to new birthdayetc.*AMPLIFICATION OF SMALL SIGNAL*