Yahoo! Mail antispam - Bay area Hadoop user group

Yokai Versus the ElephantHadoop and the Fight Against Shape-Shifting SpamVishwanathRamarao & Mark RisherYahoo! Mail

AGENDA3Shape-shifting spamAntispam OriginsHadoop AlgorithmsApplications to SecurityResources for Implementers

6http://f915fde2cf53df18.lighttopbody.com]*!}v}]along especially consecutive important dmvfu

81,300,925,111,156,286,160,896(http://bit.ly/cpOyLi)

Typical attack/response profile11Rule change(1/23@01:15)

MORE YOKAI - TARGETED ATTACKS<style>mechanic CC0066 getimage 3A00 lectroniques repertoires spiel proscribing ammonoid 10110 radiobuttontelefoons Jermaine iesaporitoroshan 3026 janatatrennungpalillos toughest ncapitolecalzado 20200 Omnimedia collective saudadedizaines 205px hardener elongating InvasionofyourprivacyPersonnalftsbedingungenMontanerprozacSerpellfcardbvh capacitate 12502 courtship kiranjiutroligt transducer tyee Delhaize clueless toffee nnioZoapochino sterns 622 Verordnung carbons waterresistant assessing footerTextperrine url0 potatoes 999933 Rightmove positively thmb closer secures Amarillo suffer 314992 32599 8849 GJ initialling cockleshell JTA Justiaguardo jibes Chubb inflammatory iteration granfaldasseoir considerations 692px treasured Allotransplantationtwoyearsappx Bowers doorgeven 1487 bigpicture repeatedly Popp MPEG4 webbsidaliefdeVoeding Elena Kernighan sternway laggardly Zwischendurch commons equis sewing f17 apadrinasareiniqueslugoquotedblbayr 3500 CI addressee optativelygazzetta 616px mingus 23238 PhotoLink desuetude tofu keychains molding redevelopment stucco deltage astrology2 thumbscrews probablemente 700g rnsfuseactionrepristaires restraint manchettestrendlineseffectuedespatchMinskyestadual doses danbrown Muenster jind7n7 smashes gourmandesashantisentants rows kyk coated Incontournablescoincidenjspa stalker CDS contienen expletives s8 eof replenishing puyalluppratosondravalidarorientale sonnets steamer Niwangoacrocentric dozens elr tempting poing jails ingredi Sep3 misdirection vested tecniciconciertos dear martini 3D35 MBR DNAME 2650 violation Egyptiin NCR sposoriss hl 12450 connectors circumcision transform CFA employeur 153 comunicazioni miner 19905 citronella PlissierHellmich Randall CaradonnaspringaregistradahauptEntran 3060 Rochin capacitor sotol 3413 smirk interditeServicePoint capabilities bouncefeeLinkov 3Dg auntie OSP CaeciliaPlatzierung wrangler pisosbanlieueDaniellaenderleisraelprofessionnellessusto 39800 Espanaplena radian antic!...........................200KB………. </style><center><a href="http://ivywhere.info/52210088504303.hrmj.1/285/1000/1006/1000/1237976a102c0176c7b3fb3164f83590.html">Please Click Here if You Can't See Images<br><imgsrc="http://ivywhere.info/images/usacpm1.jpg" border="0"></a><br><a href="http://ivywhere.info/52210088504303.hrmj.1/40106/1000/1000/1000/a.html"><imgsrc="http://ivywhere.info/images/usacpm2.jpg" border="0"></a><br><a href="http://ivywhere.info/gp.html"><imgsrc="http://ivywhere.info/images/please2.jpg" border="0"></a><br>12[400kb…]<center><a href="http://corfair.info/52210088504303.hrmj.1/129286/1000/1006/1000/d1c7b1fa06980b08bf9b3a9c14844623.html">Please Click Here if You Can't See Images<br><imgsrc="http://corfair.info/images/ivblg1.jpg" border="0"></a><br><a href="http://corfair.info/52210088504303.hrmj.1/40126/1000/1000/1000/a.html"><imgsrc="http://corfair.info/images/ivblg2.jpg" border="0"></a><br><a href="http://corfair.info/gp.html"><imgsrc="http://corfair.info/images/please2.jpg" border="0"></a><br>

Why is the ANTISPAM PROBLEM hardScale of the problem; 25B Connections, 5B deliveries, 450M mailboxesUser feedback is often late, noisy and not always actionable Large, diverse stream of legitimate traffic that looks like spamSlow adoption of authentication technologies like DKIM and SPFSpammers are clever; target and specialize attacks Rapidly changing spam campaigns with a large bot controlled IP base; large variations even within a single campaignA significant percentage of spam comes from large ESPs like Hotmail, Google and Yahoo15

Generation 1: Manual management layerHeuristics, blocks, blacklistsProvide attack mitigation and operational flexibility, highly explainable. Not durable, expensive to keep pace with fast morphing spamAd hoc queriesProprietary implementations, not very scalable, steep learning curveReactive and usually late16

Generation 2: Machine Management LayerOnline reputation modelsSimple, mostly scoring/counter/ratio based modelsHighly scalable due the absence of any state/memoryGeneralize too broadly, lack expressive powerBatch trained reputation modelsTypically digested memory based hashing or machine learning modelsDifficult to implement and due to the need for labeled examples scale well only moderatelySlow to update and learn, lack explainability, limited operational control17

distributed computing paradigm19Map:Reduce + distributed storage:Simplicity of online, stateless models

Expressiveness of offline analysis

Ease of managementthe map:reduce paradigm Input data format is application-specific, specified by the user Output is a set of <key,value> pairs User expresses algorithm using two functionsMap is applied on the input data and produces a list of intermediate <key,value> pairs Reduce is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs Finally, output pairs are sorted by their key value20

the map:reduce paradigm 21Mapper<k1,v1>Mapper<k1,{v1,v3}><k2,v2>Reducer<k2,v2><k1,W1>Mapper<k1,v3>

A SIMPLE MAP:REDUCE EXAMPLE$ bin/hadoopdfs -cat /usr/joe/wordcount/input/file01 Hello World Bye World $ bin/hadoopdfs -cat /usr/joe/wordcount/input/file02 Hello Hadoop Goodbye Hadoop// Split up input files (MAP), iterate over chunks, reassemble results (REDUCE) $ bin/hadoop jar /usr/joe/wordcount.jarorg.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output$ bin/hadoopdfs -cat /usr/joe/wordcount/output/part-00000 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2 22

a simple map:reduce example (bit.ly/bdyi0l)18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {19. String line = value.toString();20. StringTokenizertokenizer = new StringTokenizer(line);21. while (tokenizer.hasMoreTokens()) {22. word.set(tokenizer.nextToken());23. output.collect(word, one);24. }25. }23

a simple map:reduce example (bit.ly/bdyi0l)28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {30. int sum = 0;31. while (values.hasNext()) {32. sum += values.next().get();33. }34. output.collect(key, new IntWritable(sum));24

Lets REVIEW OUR DESIGN GOALs AGAINClassifiers are notorious for lack of explainabilityEngineers and analysts needs to know what the classifier is missingEngineers and analysts need to know about emerging threatsAnalysts need “canned” reports along interesting dimensionsMachines need smart feature engineeringDevelop a scalable system to provide deep insight into spammer campaignsDouble up as a platform for standard reportingAlso double up as a platform for adhoc analysis and data probingSignal amplification and smart feature extraction platform26

Our ANTISPAM ANALYTIC PLATFORMHadoop: Implements map reduce, written in Java but supports many other languages including Perl and C++ using the streaming interfaceFeature engineering with small simple Perl programs for data extraction and transformationSQL-like “Pig” programming language for data analysis and managementMahout: data mining libraries that provide shrink- wrapped, scalable, sophisticated algorithmsOther proprietary algorithms and frameworks for specialized tasks27

Various ASPECTS of A GRID DRIVEN SOLUTIONStandard reportingAd hoc queryingCampaign discovery from spam feedback using frequent item set mining“Gaming” detection in notspam feedback using connected components28

AD HOC queries for ANTISPAM researchIdentify domains that had few spam votes in the previous time window but have a high number of spam votes todayAll IPs in the last hour that sent a particular URL pattern…or that sent any unknown URL >500 timesWhich domains/IPs suddenly increased their sending volume after a positive reputation changeWhich FROM addresses exhibit low message size entropyAll messages that had nothing but a URL and the domain of the URL had low page rank30

AD HOC QUERIES - Anatomy of a PIG QUERY--- This includes some basic string functions, including splitting a string on the '@' characterregister /homes/jpujara/pig_scripts/string.jar;define splitEmail string.Tokenize('2','@');--- Load up some data - incoming messages at a date and time, and our trusted user databaseMESSAGES = load '/projects/antispam/mta_feature_logs/$date*/*/*-$time*' using com.yahoo.ymail.pigfunctions.AsStorage('__record_key__,firstrcpt,mailfrom') as (mid:chararray,to:chararray,from:chararray);USERS = load '/projects/antispam/TrustedUser.bz2' using com.yahoo.ymail.pigfunctions.AsStorage('user,t') as (user:chararray,trusted:int);--- Split the e-mail addresses into user+domain and generate the appropriate user-id for yahoo users and partnersEXPLODED_MESSAGES = FOREACH MESSAGES GENERATE to,FLATTEN(splitEmail(to)) as (user,udomain),FLATTEN(splitEmail(from)) as (sender,sdomain);YAHOO_MESSAGES = FOREACH EXPLODED_MESSAGES GENERATE (udomain MATCHES '.*yahoo.*' ? user : to ) as yuser,sdomain;31--- Combine the message and sender domains with the trusted user data and select only trusted messagesYAHOO_MESSAGES_TRUST = JOIN YAHOO_MESSAGES by yuser, USERS by user;TRUSTED_MESSAGES = FILTER YAHOO_MESSAGES_TRUST by trusted > 0;--- Group by domain, and generate a count, order by descending countDOMAIN_GROUPS = GROUP TRUSTED_MESSAGES by sdomain;DOMAIN_GROUPS_COUNT = FOREACH DOMAIN_GROUPS GENERATE group,COUNT(TRUSTED_MESSAGES) as count;DOMAIN_GROUPS_ORDER = ORDER DOMAIN_GROUPS_COUNT by count DESC;--- Output the resultsSTORE DOMAIN_GROUPS_ORDER into '$targetdir/topDomains';

CAMPAIGN Discovery in SPAM FeedbackFrequent Itemset MiningClassical methodResearch interesting relationships between variables in a large databasePrimarily applied for market basket analysisMany good implementationsAPRIORIEasy to implementParallelizes moderately well but bottlenecks for extremely large data setsNot very efficient with the number scansECLATParallelizes easily Amenable to a good grid implementationFewer scans of the datasetParallel FP GROWTHDesigned explicitly for systems like hadoopImplemented in Mahout 0.232

Frequent item set – example dataset33

Frequent ITEMSET MINING34Slide Courtsey: dortmund.de

Frequent itemset MINING on ONE DAY’s SPAM REPORTS9 2595 (IPTYPE:none,FROMUSER:sales,SUBJ:It's Important You Know,FROMDOM:dappercom.info,URL:dappercom.info,ip_D:66.206.14.77,)9 2457 (IPTYPE:none,FROMUSER:sales,SUBJ:Save On Costly Repairs,FROMDOM:aftermoon.info,URL:aftermoon.info,ip_D:66.206.14.78,)9 2447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On-New-Vehicles,FROMDOM:sherge.info,URL:sherge.info,ip_D:66.206.25.227,)9 2432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:zaninte.info,URL:zaninte.info,ip_D:66.206.25.227,)9 2376 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole family,FROMDOM:fiatchimera.com,URL:articulatedispirit.com,ip_D:216.218.201.149,)9 2184 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole family,FROMDOM:fiatchimera.com,URL:stratagemnepheligenous.com,ip_D:216.218.201.149,) 9 1990 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New Cars,FROMDOM:sastlg.info,URL:sastlg.info,ip_D:66.206.25.227,)9 1899 (IPTYPE:none,FROMUSER:sales,FROMDOM:brunhil.info,SUBJ:700-CreditScore-What-Is-Yours?,URL:brunhil.info,ip_D:66.206.25.227,)9 1743 (IPTYPE:none,FROMUSER:sales,SUBJ:Now exercise can be fun,FROMDOM:accordpac.info,URL:accordpac.info,ip_D:66.206.14.78,)9 1706 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New Cars,FROMDOM:rionel.info,URL:rionel.info,ip_D:66.206.25.227,)9 1693 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:astroom.info,URL:astroom.info,ip_D:66.206.25.227,)9 1689 (IPTYPE:none,FROMUSER:sales,SUBJ:eBay: Work@Home w/Solid-Income-Strategies,FROMDOM:stamine.info,URL:stamine.info,ip_D:66.165.232.203,)352432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReportUpdate,FROMDOM:zaninte.info,URL:zaninte.info, ip_D:66.206.25.227,)2447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On-New-Vehicles,FROMDOM:sherge.info,URL:sherge.info,ip_D:66.206.25.227,)

Gaming DETECTION in NOTSPAM FEEDBACKSpammers instrument accounts to vote “not spam” on emails that they send

Delays classification of spamming IP addressesThrows off the classifiers if the feedback is not filtered wellModel the problem as a bipartite graphWell known model for matching algorithmsBroadly applied in various fields like coding theoryA graph whose vertices are disjoint form disjoint sets U,V There is an edge connecting every U to a vertex in V36

Connected COMPONETS - EXPLAINEDY1 = Yahoo user 1, Y2 = Yahoo user 2IP1 = IP address of the host Y1 “voted” notspam from37y1IP1y1SQUARINGweight = 2y1IP2y1

Connected COMPONENTS for “GAMING” DETECTION38Set of IPs/YIDs used exclusively for voting notspamSet of (likely new) spamming IPs which are “worth” voting fory1IP3IP1y2IP4IP2y3Set of “voted on” IPsSet of “voted from” IPsSet of Yahoo IDsvoting notspam

Connected Components - RESULTS39- Connnected components for IPsnotspam was voted from

Connected components - results40- Connnected components for IPsnotspam was voted on

CONCLUSIONSWe have had success leveraging parallel, stateful algorithms on grid systems to keep pace with polymorphic spam that evade traditional analysis and algorithmsFrequent Itemset Mining rapidly identifies cohesive campaigns in ISSPAM feedbackConnected Components amplifies weak signals in gamed NOTSPAM feedback and helps separate signal from noise in the feedbackGrid system based analysis platforms may be broadly applicable across the security domain41

Apply SlideDownload Hadoop distributionhttp://hadoop.apache.orgTry out Pig on standalone, single Linux boxIdentify source data to aggregateStart simple: IP patterns across web access logsBegin with offline aggregation; yesterday’s attacks still interestingRead Connected Components and Frequent Itemset Mining papersStop looking for a single, invariant “tell” – far too costlyStart thinking about co-occurrence of innocuous features 42

Resources for implementersHadoop setup, documentation and resourceshttp://hadoop.apache.org/Pig documentation and resourceshttp://hadoop.apache.org/pig/Mahout documentation and resourceshttp://lucene.apache.org/mahout/Frequent itemset mining implementation repositoryhttp://fimi.cs.helsinki.fi/src/Connected components description[link not yet live]Ranger, Raghuraman, Penmetsa, Bradski, and Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In HPCA 200743

Yahoo! Mail antispam - Bay area Hadoop user group

More Related Content

What's hot

Viewers also liked

Similar to Yahoo! Mail antispam - Bay area Hadoop user group

More from Hadoop User Group

Recently uploaded

Yahoo! Mail antispam - Bay area Hadoop user group

Editor's Notes