Group Details:-Dhara Shah z3299353Imad Hashmi z3193866Zuo Cui z3261136Our Paper:- Y. Xie , F. Yu, K. Achan , R. Panigraphy , G. Hulten and I. Osipkov ,Spamming Botnets: Signatures and Characteristics, in Proceedings of ACMSIGCOMM 2008, pp. 171-182, Seattle, USA August 2008. PDFFlow of the Literature Review is as follows:-IntroductionBackground and Previous WorkFocus on Technology used in PaperFuture and Related WorkIntroductionSince email has become a wide spread means of communication around theworld and millions of email messages are transferred every minute, it isunderstandable that illegitimate use of email service is also in practice sincelong. One of the many abuses of this service is spamming which is used byadvertisers around the world to send advertisements of their products tolegitimate email users. Following discussion is on the methods used by anti spamsystem to detect spam emails and botnets.Background and Previous WorkThere were a lot of researches on the identification and filtering of email spam.Based on the part of email used for spam detection, all these work could begenerally classified into two main categories: non-content-based andcontent- based.Non-content-based filters are also known as address-based filters. It uses theinformation such as IP address or email address in the email header toexamine. Blacklist and Whitelist are the common technique in this category.Blacklist records the IP addresses or email addresses which send spam. Andconversely, Whitelist contains all acceptable email addresses. They can bedeployed on the client computers or email servers. Cook et al. (2006)experimented a domain specific blacklist which worked on the mail server toreduce the number of spam entering the network. But blacklist may easily causefalse positive. If one of them sends spam then its IP address or email address isrecorded in the blacklist. Consequently, other legitimate mails from that emailaddress are all marked as spam.
Content-based detection filters spam by analyzing the message content ofreceived email, which overcome the drawbacks of Non-content-based filters.They scan for some sensitive keywords in the content to identify the spam. Thistype of filters includes Heuristic filters and Bayesian filters.Heuristic detection, are also known as rule-based analysis which uses regularexpression rules to detect phrases or characters that are common to spammails. Rules can be set as email header information, keywords or URL in thecontent. William Cohen (1996) used learning rules successfully to classify emailsinto different folders. But there are little related researches on the spamdetection based on rules.However, the spam detection precision relies on the rules which are set by mailsystem managers. So it will take significantly long time to define the rules. Afterthat, the rules should be refined frequently. If these pre-set rules on the mailsystem are not updated, the filters will not work efficiently on the new spam withnew features. Besides, the rules are rigid and easy to cause false positive.In addition, because the content-based detection of spam can be considered asthe problem of text classification, several machine learning approaches havebeen applied to spam detection. Among many others, Bayesian is one of thosebeing proposed. In 1998, Bayesian classification techniques are employed to theissue of spam filtering (Sahami et al, 1998). It is able to classify the occurrenceof certain words or phrases in the message content. Then the filters evaluate theprobability whether spam or not by analyzing the statistics. As a result, theBayesian filters eliminate more than 95% spam in the experiments and identify80% of incoming junk mail in the real scenario. It is obviously that the Bayesianfilters can provide a high correct rate with regard to the detection of plain-textcontent.Now Bayesian is widely used with other methods in many spam detectiontechnologies to improve the accuracy. However there are some issues in theBayesian filters. First, as the same issue as other machine learning approaches,the accuracy of Bayesian filters depend on the quality of training data andtraining process. Second, even Bayesian filters can provide high precision forplain-text content, but it is difficult to detect the booming spam containedimages. Therefore, a further research conducted by Okayama University iscarried out to detect the image spam (Uemura et al. 2008). It designed amethod allows the existing Bayesian filter to learn image information, such asthe file size or name, and then evaluate the probability on the learning results.After some experiments of this method, it can be observed that the falsenegative rate dropped but the false positive rates are almost same. It meansthis method can play only a booster role in the identification of spam usingBayesian because less information is provided by images to distinguish the spamand legitimate mails.
Content Based Detection System has lot of advantages but the time and loads ofprocessing space as it goes through the complete email. There is need of an antispam system which could combine the advantages of content based and noncontent based spam detection system.AutoRE which a software designed by the Microsoft research group and ouranchor paper has tried to combine the both type of detection systems i.e.content based and non content based system. Now we will be discussing in detailhow AutoRE combine the both systems.Focus on Technology used in PaperAutoRE unlike all the previous solutions to detecting botnets (like spamhaus,blacklist) where areas it creates and trains itself dynamically real time. To dothis it has 3 major steps when a set of emails is supplied to it, they are asfollows:- 1. URL Pre-processing 2. Group Selector 3. Regular Expression GenerationIt is important to understand that we are not identifying spam or not spamemails. As by definition any email which is regular and sends in bulk is spam,but spam emails are not malicious as even a normal user might send an emailwhich is sent to his complete contact list but is relevant and not spam. Our focusis on spam emails generated by botnets as they are not relevant emails it don’thave any meaning to it, they are just sent to accomplish some maliciousmission. As botnets are autonomous systems, there is a pattern in their sendingbehaviour as they are programmed. So to catch that pattern above mentionedsteps are followed. While doing URL Pre-processing following parameters areconsidered:- 1. URL String 2. Source server IP address 3. Email sending timeAll forwarded messages are discarded as a legitimate forwarding server can bemistaken for botnet member. URL Strings which are suspiciously random andmultiple domains are extracted out. As URL strings like a.com, b.com areunlikely to be by botnets as they are registered domain names whicheconomically not feasible for spammers. URL strings are then broken down andgrouped into groups as per their domain names. As it is observed that spamemails are advertising for a particular product or particular advertisingcampaign, then domain specific signatures are created. And from this domainspecific signatures domain-agnostic regular expression are created to get betterresults in form of reduced false positive rates and identifying the botnets evenwhen they change their domains. Before creating the generalised regularexpression domain specific signature need to suffice that it’s distributed,bursty and specific only then can be classified as spam signature.
While grouping it’s very important to understand how to group the domains aswith n number of emails there are possibly n domain names. So whileconsidering distributed property temporal correlation is considered and burstyproperty is considered over a span of 5 days as it’s observed most ASes areactive for minimum 5 days.Now once we are done classifying domains into the groups, next step would begenerating regular expression. By generating a Regular expression a not atoken conjunction helps us reduce the false positive rate as keywords used inthe token conjunction are words which may or may not be part of email. Aftercreating domain specific groups we create a signature of the group andclassification is no more based on the group and its domain agnostic. By doingso we assured that in future if the botnets change the domain still they will bedetected as there domain will hold the same regular expression and groupsignature which classifies them as spam this happens because we are notgenerating domain specific signature. This a unique feature of AutoRE whichhelps it finding maximum spam emails with minimum false positive rate. Alsoafter categorizing them and assigning them their respective regular expressionit’s very important that we verify that the emails we have classified as spam areactually spams or not. To do so there are 2 steps we need to do. First of all wequery our suspected IP Addresses to Blacklist which are found in the listare directly classified as spam. The ones which are not we need to run somebehavioural test to understand whether they are spams or not. Thisbehavioural test is done on each campaign the points to taken care of are asfollows:-1). Similarity of Email Properties2). Similarity of Sending Time3). Similarity of Email Sending BehaviourAs the emails we are targeting are being generated and send by automatedsystem above mentioned properties play a big role. As botnets are automatedsystems they are bound to have pattern as however random the sendingalgorithm is designed due to the frequency of occurrence pattern is going to begenerated.It doesn’t end here as by the means of this software we can study thecharacteristic of the botnets and predict the traffic and spam emails which aregoing to be generated. This study on botnets has revealed lot of facts which arepointers for future research in the anti spam system. In the next section we willbe mentioning the results of the study on botnets and its use in technologiesemerged after AutoRE.
Future and Related WorkCharacteristics of Botnets and their use in present anti spam systems:-1). Spam Sending Patterns over the networkThe above characteristic is used in A Dynamic Reputation Service forSpotting Spammers  SpamSpotter is real time (like AutoRE) reputationsoftware for filtering spam messages. The Spamspotter software classifies emailsenders in real time based on their global sending behaviour. This system iscalled behavioural detection. SpamSpotter than applies a third party machinelearning behavioural algorithm on this data to generate reputation of senders. Apreferred algorithm in SpamSpotter is SNARE. It is a network level behaviouralalgorithm which identifies spam senders based on their email sending behaviourinstead of their addresses and the contents that they are sending. In somecases, SNARE mechanism is so efficient that it can identify a spammer before ithas sent a large number of email messages.AutoRE also studied the similar behaviour though SpamSpotter goes next levelby implementing SNARE algorithm to calculate reputation of a sender.2). Distribution of IP AddressOne of the characteristic of Botnets observed while experimentation of AutoREwas studying distribution of IP Address. This is very important characteristic tobe studied as it can help us stop and understand the wide spread of Botnets.This property has been extended by Studying Spamming Botnets UsingBotlab  Botnets are the most used spamming technique used these days. Itis estimated that 85% of billions of spam messages are generated by botnets.This paper presents a botnet monitoring platform called botlab which monitorsall incoming spam traffic at a certain location. It scans the spam messages andobtains bot binaries through spam links. A human operator than runs specifictools on these binaries to obtain information about the bots sending thesespams. It then executes multiple captive, sandboxed nodes from variousbotnets, allowing it to observe the precise outgoing spam feeds from thesenodes. It scours the spam feeds for URLs, gathers information on scams, andidentifies exploit links. Finally, it correlates the incoming and outgoing spamfeeds to identify the most active botnets and the set of compromised hostscomprising each botnet. Also another extension is studying the characteristic ofBotnet and using it to detect them is done in BotGraph: Large ScaleSpamming Botnet Detection  BotGraph detects the abnormal sharing of IPaddresses among accounts holders in an email system. Applying BotGraph totwo months of Hotmail log of total 450GB data, BotGraph successfully identifiedover 26 million bot-accounts with a low false positive rate of 0.44%. BotGraphhas also been implemented using a distributed clustered algorithm with MapReduce technique. BotGraph can detect botnet sign-ups and already createdbotnet email accounts.
Also one more interesting study came up during the research of AutoRE underthe category to scan the network traffic was the increase in use of static IPaddresses from Nov 2006 to July 2007. Due to this study it helped the blacklistto improve by populating it by static IP address. Also research suggested thatBotnets are evolving and creating more sophisticated and polymorphicURL’s to bypass anti spam systems.One major disadvantage of AutoRE is its not practically real timeimplemented. Till now its method are under investigation and itsinception real time is still awaited.References1). A Dynamic Reputation Service for Spotting Spammers AnirudhRamachandran, Shuang Hao, Hitesh Khandelwal, Nick Feamster, SantoshVempala School of Computer Science, Georgia Techhttp://www.cs.purdue.edu/homes/hkhande/projects/spam/spam_nsdi.pdf2).BotGraph: Large Scale Spamming Botnet Detectionhttp://research.microsoft.com/pubs/79413/botgraph.pdfM. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. ‘A Bayesian approach tofiltering junk E-mail’. In Learning for Text Categorization: Papers from the 1998Workshop, Madison, Wisconsin, 1998.3). Cohen, W 1996 ‘Learning Rules that Classify E-Mail’, Advances in InductiveLogic Programming, pp. 124-1434). Cook, D, Hartnett, J, Manderson, K&Scanlan,J 2006, ‘Catching Spam Beforeit Arrives: Domain Specific Dynamic Blacklists’, Proceedings of the 2006Australasian workshops on Grid computing and e-research, Vol. 54, pp.193-202.5). SNARE: Spatio-temporal Network-level Automatic Reputation Enginehttp://hdl.handle.net/1853/251356). Studying Spamming Botnets Using Botlabhttp://www.cs.washington.edu/homes/arvind/papers/botlab.pdf7).Uemura, M& Tabata, T 2008 ‘Design and Evaluation of a Bayesian-filter-basedImage Spam Filtering Method’, 2008 International Conference on InformationSecurity and Assurance, 2008 IEEE