Group Details:-Dhara Shah z3299353Imad Hashmi z3193866Zuo Cui z3261136Our Paper:- Y. Xie , F. Yu, K. Achan , R. Panigraphy , G. Hulten and I.Osipkov , Spamming Botnets: Signatures and Characteristics, inProceedings of ACM SIGCOMM 2008, pp. 171-182, Seattle, USA August2008.Is this paper technically sound?Paper is based on the experiments conducted on 3 months data collectedfrom the Hotmail‟s Server. To simulate similar results we needed thealgorithm or rules used in the AutoRE software to generate regularexpression and data on which experiments could be conducted.To get the details of the software we tried contacting the Authors butunfortunately could not receive any reply from them (proof attached inappendix). We suspect that as it‟s a Microsoft group research andcommercial product details are confidential. Hence we tried looking at theopen source spam detection software to understand working of AutoRE.We could not compare the techniques used by the open source SpamDetection Software and AutoRE as we didn‟t had all details of AutoRE.There are a number of spam detection tools available both commercialand open source but none of them is based on signatures. The idea in thispaper is genuine and novel because other content based filters do notgenerate signatures and rely on a complete scan of the email. Followingare some of the rules used to identify a spam URL. We discuss URLsonly because AutoRE works with URLs only: Uses a numeric IP address in URL Uses %-escapes inside a URLs hostname Completely unnecessary %-escapes inside a URL Dotted-decimal IP address followed by CGI Uses non-standard port number for HTTP Has Yahoo Redirect URI Contains an URL-encoded hostname (HTTP77) URI contains ".com" in middle URI contains ".com" in middle and end URI contains ".net" or ".org", then ".com" URI hostname has long hexadecimal sequence URI hostname has long non-vowel sequence CGI in .info TLD other than third-level "www" CGI in .biz TLD other than third-level "www"
There is a long list of email header criteria which can be applied toidentify spam but that is beyond the scope here.Next was we tried collecting data from the University‟s Mail server toverify the characteristics about the spam emails mentioned in the paper(proof attached in the appendix). But due to security issues concernedwith the university we couldn‟t get the data. Hence we redirected ouryahoo, Gmail and hotmail accounts to Cse account. And then accessingthe Cse account via “pine” utility. Pine is a text based email reader whichenables us to see detailed email headers. We tried distinguishing theemail header of the Spam Email and a legitimate Email. But as Csedoesn‟t have an anti spam technology applied to it, it relies on theUniversity‟s server for this. We verified this by observing that all theemails coming to Cse are being forwarded by the University‟s server. Alsowe understood that even if the user marks a email as spam, the systemdoes not categorize it as spam until it satisfy the basic property ofburstiness. We classified few legitimate email-ids as spam but the emailserver never classified it as spam as they were never sending in bulk.Result from Pine is as follows:-INFPACM003.services.comms.unsw.edu.au ([22.214.171.124]) (IP doesnt matchsender domain) (for <firstname.lastname@example.org>) By note With Smtp ; Fri, 18 Jun 2010 20:23:12 +1000Received: from mta156.mail.in.yahoo.com ([126.96.36.199]) byINFPACM003.services.comms.unsw.edu.au with SMTP; 18 Jun 2010 20:02:46 +1000Received: from 188.8.131.52 (HELO web32405.mail.mud.yahoo.com) (184.108.40.206) by mta156.mail.in.yahoo.com with SMTP; Fri, 18 Jun 201015:53:07 +0530Received: (qmail 20395 invoked by uid 60001); 18 Jun 2010 10:23:04 -0000Received: from [220.127.116.11] by web32405.mail.mud.yahoo.com via HTTP; Fri , 18 Jun 2010 03:23:03 PDTReceived: From INFPACM001.services.comms.unsw.edu.au ([18.104.22.168]) (for <email@example.com>) By note With Smtp ; Fri, 18 Jun 2010 20:04:32 +1000Received: from mta177.mail.in.yahoo.com ([22.214.171.124]) byINFPACM001.services.comms.unsw.edu.au with SMTP; 18 Jun 2010 19:52:33 +1000Received: from 126.96.36.199 (EHLO bay0-omc1-s5.bay0.hotmail.com) (188.8.131.52) by mta177.mail.in.yahoo.com with SMTP; Fri, 18 Jun 201015:34:22 +0530Received: from BL2PRD0102HT003.prod.exchangelabs.com ([184.108.40.206]) bybay0-omc1-s5.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Fri, 18 Jun 2010 03:04:00 -0700Received: from BL2PRD0102MB009.prod.exchangelabs.com ([169.254.34.168]) byBL2PRD0102HT003.prod.exchangelabs.com ([169.254.220.82]) with mapi; Fri, 18 Jun 2010 10:03:59 +0000
Are the ideas and results presented in this paper novel?In our opinion, the idea of framework AutoRE is significantly novel.Although in some previous works, regular expressions were used for spamdetection which is based on URLs in the email content; AutoRE is quitedifferent from them. As can be seen from reasons below:First, AutoRE has ability to automatically generate regular expressionsbased on the discovered URLs. Currently, man-made regular expressionsare required in most detection framework. With the rapid growth of thenumber of spam, it becomes increasing tough even impossible togenerate regular expressions manually. By learning from some methodsof worm detection system (Singhs research ), AutoRE generates spamsignature automatically. Therefore, this technique reduces the workload ofhuman being and improves the veracity of regular expressions.Second, AutoRE has capacity to predict the future domain-agnosticbotnets. Most of previous researches and current detection frameworksare aiming at specific individual botnet. However, for those botnets whichhave similar behaviours, AutoRE cannot detect them automatically andthey can only take action to the domain of those botnets which have beencaptured. For those possible future domain, these previous research ishelpless. However, AutoRE is able to analyse and group the domainswhich have similar behaviour, and then merge domain-specific regularexpressions into domain-agnostic regular expressions, therefore, AutoREobtain the ability of detecting the domain both currently and in the futurewhich possess same behaviour.From these points of view, AutoRE can be considered as an innovativeframework in the field of spam detection.Are there any weaknesses of this paper that you have notmentioned in your answers to the above questions?One of the weaknesses is that AutoRE doesn‟t deal with proxy URL. Theseproxy URLs usually have no relevance to their redirect destination, so it ishard to group them by using AutoRE. Although they can be traced fromredirecting destination and using this destination address to detectwhether it is a spam or not by AutoRE, but the tracing process is exactlyas spammer‟s wishes. Currently, this situation cannot be improved in thispaper. Another weakness is that AutoRE cannot detect the increasingimage spam. So authors could borrow ideas from other image spamdetection framework (like Uemura research ), using image‟sinformation, such as URL, file name or size, to improve this framework.
Do you think the results of this paper are of practical significance?Even though AutoRE was only tested randomly on Hotmail, the result waspretty compelling. As the author mentioned, the regular expressionsignatures can detect 10 times more spam than previous complete URLbased signatures and it can reduce the false positive rate of detection ofbotnet spam and host significantly. AutoRE is able to capture anadditional 16-18% of the spam that bypassed well known spam filters(e.g. spamhaus). Meanwhile, at the present time, both the transientnature of the attack and the fact that only a few spam sent by eachbotnet make it more difficult for previous spam filtering frameworksdetecting and blacklisting the individual bots. Hence, AutoRE becomesmore practical for helping existing spam filtering frameworks to detectspam. And most importantly, AutoRE is also capable of “predicting” futurebotnets regardless of domain name, and besides, it is also quite useful forthe characteristic of current botnets.However, there is no single framework that can be permanent suitable forall kinds of spam. If AutoRE is widely used in real time, spam senders willtry to find weaknesses of this framework, and further, find a way tocounter the weaknesses and hide spam from being detected by AutoRE.Thus, AutoRE needs to update frequently to make it more efficiently.What is your assessment of the readability, organization andoverall presentation of the paper?The idea of the paper has been well described overall. The reader gets afair idea about what the author wants them to understand as theyproceed with the topics. There is however a few improvements deemedimportant. The abstract section of the paper gives an impression as thesoftware AutoRE processes the complete email contents including body forsignature generation which is not the case. As the algorithm works onlyon the URLs inside the email contents it should be mentioned in theabstract section that this is not a content based filtering system. Anotherpoint that we noted is the focus of the paper which seems dividedbetween two different topics; AutoRE and Botnet characteristics. Althoughthe paper addresses both of these topics but they seem unrelatedsometimes as AutoRE generates signatures only on already receivedcollection of emails. The way these spam emails are sent and howdifferent botnet characteristics effect that may be better described in aseparate paper with more details and then can be referred here asrequired by AutoRE. There is a lot of detail associated with topics likedynamic and static IP addresses, email sending behaviours of botnets andtraffic correlations. A lot of data and statistics can be collected on theselines for analysis. The paper itself suggests that this is an interestingfuture direction because due importance cannot be given to all areas in asingle paper.
If you were a reviewer whose recommendation is being sought bythe editor of the journal or the conference proceedings onwhether or not to publish this paper, what would be yourrecommendation?This is a very important topic and a well known subject. The authors doesnot need to explain too much about the importance as there is a lot ofinvestment already being done in the field of spam detection. The authorsalso have a complete working implementation of the algorithm which hasbeen tested on real world data. With the success results claimed by theauthors the idea seems to carry a lot of weight although the software hasnot been in practice for unknown reasons.The paper is definitely worth publishing in a related conference. The lowfalse positive rates of applying AutoRE signatures is significantly less thanthe existing mechanisms although it does not cover the complete emailcontents.How can the work presented in this paper be improved?The paper tries to solve a very important problem of spam emails using amix of content based and non-content based filtering. With significantlylow false positive rate and detection of high number of spam campaigns,the results are quite impressive. However we suggest that the work canbe improved in a number of ways. Improvement of Signature Since AutoRE generates a signature of the spam campaign which it applies to emails arriving later to find out similarities. This signature creation can be improved in a number of ways. Currently it involves only the URLs inside the email message. This signature generation mechanism is incomplete since a lot of spam emails do not contain URLs. Handling of Proxy URLs The system at the moment does not work with proxy URLs. This means that a lot of different URLs redirecting to a single resource will not be picked up by the signature. This can be solved by building a blacklist database of all domains providing redirection services to spammers. A domain found in multiple subsequent emails is a good candidate for the blacklist database. It will not be possible for spammers to quickly register new domains for redirection services.
Keeping signature up-to-date AutoRE works on historic data. Since it generates spam signatures and identified spam emails based on historic data it is a big challenge itself to keep those signatures correct and up-to-date. If the signature expires the low false positive rate may change significantly and the system may lose its strength. The paper does not explain anything about it. Having a mechanism to update the signature will heavily boost the software performance. Detecting Image spam A lot of spam emails today are sent in the form of images. The purpose of using images is to hide email contents from content based filters. This important feature should be dealt with by content based filtering systems like AutoRE. One way of doing this is to generate signature of the image as well. Some basic characteristics like image size, type and dimensions can be recorded inside the email signature to identify similar images in other emails. Advanced image signature algorithms like colour histograms might not be possible to apply at such mass scale but calculating an image hash might turn out to be useful. Dependence on Botnet burstiness AutoRE heavily relies on the burstiness property of spamming botnets with the assumption that the botnets will be rented for a small time only. This can ultimately result in generation of a totally incorrect spam signature if botnet start throttling the sending speed. However this topic remains wide open because waiting for the right spam email to be used as signature data is not the option.Reference:Uemura, M& Tabata, T 2008 „Design and Evaluation of a Bayesian-filter-based Image Spam Filtering Method‟, 2008 International Conferenceon Information Security and Assurance, 2008 IEEES. Singh, C. Estan, G. Varghese, and S. Savage. Automated wormfingerprinting. In OSDI, 2004.Apache SpamAssassin
Appendix:-Following are the proof of our efforts:- 1. Letter from the IT Department of UNSW
Email to Microsoft Team:-Inquiry Regarding your paper on "Spamming Botnets : Signatures and Characteristics"Dharaben ShahYou forwarded this message on 4/13/2010 12:36 AM.Sent: Tuesday, April 13, 2010 12:34 AMTo: firstname.lastname@example.org Respected Sir/Madam, I am a student at The University of New South Wales,Sydney,Australia. Your paper on "Spamming Botnets: Signatures and Characteristics" is my anchor paper for a research study in my course "Advance Computer Networks". First of all, I would like to appreciate the manner in which the paper is written, It was very interesting and inspiring to go through the paper. Secondly I needed a favour from you to help me in research study on your paper. I would be highly obliged if you could help in my research study. I understand your limitations and would highly appreciate any help you could provide me. I am hoping for some kind of pointers to move ahead on my research work all, I am expected is to do is try and conduct some experiments on anchor paper to understand the topic well and if possible come up with any difficulties not mentioned in paper. I would be waiting for your reply eagerly. Thanks and Regards, Dhara Shah Master of Engineering Science specialization Information Technology The University of New South Wales Student.
Our DiaryRelease Date: - 11th March, 2010. Read abstract of 8 topics each and nominated 2 topics per groupmember by 17th March, 2010. Got the final selected topic by 19th March, 2010. Till 28th March went through the anchor paper thoroughly and wroteone page write-up as a summary of the understanding of the paper. On 28th March decided the approach ahead. Our approach was welisted the references mentioned in the anchor paper and each on us wasassigned 8 of them. Our objective was to find where the references wereused in the anchor paper and to write a small summary explaining its usein the anchor paper. The Deadline for this work was 4th April. EveryMonday we discussed our progress as it was our lab time. Next was we mailed to the researchers of our anchor paper and tried toget coding of the software mentioned in our anchor paper. Our effortswere futile as the software was not available commercially and being aMicrosoft research details were not revealed to us. Hence we decided tomove ahead and gather more literature to find a way to experiment theanchor paper. Till 11th April we had been working around anchor paper only as it tookus time understanding and finding a way to experimenting. From 11thApril for 2 weeks (till 25th April) following task was assigned to the groupmembers: - Dhara - working on anchor paper and finding way to conductexperiment on it. Imad – Future work and related work. Zuo – Past andrelated work.Outcome: - Possible area of exploitation are creating Botnet and sendingemails to test various mail service provider and see how they detect spamemail .Proving difference between Regular Expression and TokenConjunction Signature. 2 page write-up on key findings of the paper,future and background work.From 25th April to 12th May we are working on presentation as ourpresentation was on 13th May. After 13th May from 13th May to 20th Maywe tried getting data from University mail server and tried setting up mailserver to get data to testify findings. Due to failure in setting up the mailserver from 21st May to 27th May we tried getting University data andsetting up Botnet. From 27th may to 25th June we tried collecting datathrough pine and applying for University Data.