Group Details:- 
Dhara Shah z3299353 
Imad Hashmi z3193866 
Zuo Cui z3261136 
Our Paper:- Y. Xie , F. Yu, K. Achan , R. Panigraphy , G. Hulten and I. Osipkov , Spamming Botnets: Signatures and Characteristics, in Proceedings of ACM SIGCOMM 2008, pp. 171-182, Seattle, USA August 2008. PDF 
Flow of the Literature Review is as follows:- 
Introduction 
Background and Previous Work 
Focus on Technology used in Paper 
Future and Related Work 
Introduction 
Since email has become a wide spread means of communication around the world and millions of email messages are transferred every minute, it is understandable that illegitimate use of email service is also in practice since long. One of the many abuses of this service is spamming which is used by advertisers around the world to send advertisements of their products to legitimate email users. Following discussion is on the methods used by anti spam system to detect spam emails and botnets. 
Background and Previous Work 
There were a lot of researches on the identification and filtering of email spam. 
Based on the part of email used for spam detection, all these work could be generally classified into two main categories: non-content-based and content- based. 
Non-content-based filters are also known as address-based filters. It uses the information such as IP address or email address in the email header to examine. Blacklist and Whitelist are the common technique in this category. Blacklist records the IP addresses or email addresses which send spam. And conversely, Whitelist contains all acceptable email addresses. They can be deployed on the client computers or email servers. Cook et al. (2006) experimented a domain specific blacklist which worked on the mail server to reduce the number of spam entering the network. But blacklist may easily cause false positive. If one of them sends spam then its IP address or email address is recorded in the blacklist. Consequently, other legitimate mails from that email address are all marked as spam.
Content-based detection filters spam by analyzing the message content of received email, which overcome the drawbacks of Non-content-based filters. They scan for some sensitive keywords in the content to identify the spam. This type of filters includes Heuristic filters and Bayesian filters. 
Heuristic detection, are also known as rule-based analysis which uses regular expression rules to detect phrases or characters that are common to spam mails. Rules can be set as email header information, keywords or URL in the content. William Cohen (1996) used learning rules successfully to classify emails into different folders. But there are little related researches on the spam detection based on rules. 
However, the spam detection precision relies on the rules which are set by mail system managers. So it will take significantly long time to define the rules. After that, the rules should be refined frequently. If these pre-set rules on the mail system are not updated, the filters will not work efficiently on the new spam with new features. Besides, the rules are rigid and easy to cause false positive. 
In addition, because the content-based detection of spam can be considered as the problem of text classification, several machine learning approaches have been applied to spam detection. Among many others, Bayesian is one of those being proposed. In 1998, Bayesian classification techniques are employed to the issue of spam filtering (Sahami et al, 1998). It is able to classify the occurrence of certain words or phrases in the message content. Then the filters evaluate the probability whether spam or not by analyzing the statistics. As a result, the Bayesian filters eliminate more than 95% spam in the experiments and identify 80% of incoming junk mail in the real scenario. It is obviously that the Bayesian filters can provide a high correct rate with regard to the detection of plain-text content. 
Now Bayesian is widely used with other methods in many spam detection technologies to improve the accuracy. However there are some issues in the Bayesian filters. First, as the same issue as other machine learning approaches, the accuracy of Bayesian filters depend on the quality of training data and training process. Second, even Bayesian filters can provide high precision for plain-text content, but it is difficult to detect the booming spam contained images. Therefore, a further research conducted by Okayama University is carried out to detect the image spam (Uemura et al. 2008). It designed a method allows the existing Bayesian filter to learn image information, such as the file size or name, and then evaluate the probability on the learning results. After some experiments of this method, it can be observed that the false negative rate dropped but the false positive rates are almost same. It means this method can play only a booster role in the identification of spam using Bayesian because less information is provided by images to distinguish the spam and legitimate mails.
Content Based Detection System has lot of advantages but the time and loads of processing space as it goes through the complete email. There is need of an anti spam system which could combine the advantages of content based and non content based spam detection system. 
AutoRE which a software designed by the Microsoft research group and our anchor paper has tried to combine the both type of detection systems i.e. content based and non content based system. Now we will be discussing in detail how AutoRE combine the both systems. 
Focus on Technology used in Paper 
AutoRE unlike all the previous solutions to detecting botnets (like spamhaus, blacklist) where areas it creates and trains itself dynamically real time. To do this it has 3 major steps when a set of emails is supplied to it, they are as follows:- 
1. URL Pre-processing 
2. Group Selector 
3. Regular Expression Generation 
It is important to understand that we are not identifying spam or not spam emails. As by definition any email which is regular and sends in bulk is spam, but spam emails are not malicious as even a normal user might send an email which is sent to his complete contact list but is relevant and not spam. Our focus is on spam emails generated by botnets as they are not relevant emails it don’t have any meaning to it, they are just sent to accomplish some malicious mission. As botnets are autonomous systems, there is a pattern in their sending behaviour as they are programmed. So to catch that pattern above mentioned steps are followed. While doing URL Pre-processing following parameters are considered:- 
1. URL String 
2. Source server IP address 
3. Email sending time 
All forwarded messages are discarded as a legitimate forwarding server can be mistaken for botnet member. URL Strings which are suspiciously random and multiple domains are extracted out. As URL strings like a.com, b.com are unlikely to be by botnets as they are registered domain names which economically not feasible for spammers. URL strings are then broken down and grouped into groups as per their domain names. As it is observed that spam emails are advertising for a particular product or particular advertising campaign, then domain specific signatures are created. And from this domain specific signatures domain-agnostic regular expression are created to get better results in form of reduced false positive rates and identifying the botnets even when they change their domains. Before creating the generalised regular expression domain specific signature need to suffice that it’s distributed, bursty and specific only then can be classified as spam signature.
While grouping it’s very important to understand how to group the domains as with n number of emails there are possibly n domain names. So while considering distributed property temporal correlation is considered and bursty property is considered over a span of 5 days as it’s observed most ASes are active for minimum 5 days. 
Now once we are done classifying domains into the groups, next step would be generating regular expression. By generating a Regular expression a not a token conjunction helps us reduce the false positive rate as keywords used in the token conjunction are words which may or may not be part of email. After creating domain specific groups we create a signature of the group and classification is no more based on the group and its domain agnostic. By doing so we assured that in future if the botnets change the domain still they will be detected as there domain will hold the same regular expression and group signature which classifies them as spam this happens because we are not generating domain specific signature. This a unique feature of AutoRE which helps it finding maximum spam emails with minimum false positive rate. Also after categorizing them and assigning them their respective regular expression it’s very important that we verify that the emails we have classified as spam are actually spams or not. To do so there are 2 steps we need to do. First of all we query our suspected IP Addresses to Blacklist which are found in the list are directly classified as spam. The ones which are not we need to run some behavioural test to understand whether they are spams or not. This behavioural test is done on each campaign the points to taken care of are as follows:- 
1). Similarity of Email Properties 
2). Similarity of Sending Time 
3). Similarity of Email Sending Behaviour 
As the emails we are targeting are being generated and send by automated system above mentioned properties play a big role. As botnets are automated systems they are bound to have pattern as however random the sending algorithm is designed due to the frequency of occurrence pattern is going to be generated. 
It doesn’t end here as by the means of this software we can study the characteristic of the botnets and predict the traffic and spam emails which are going to be generated. This study on botnets has revealed lot of facts which are pointers for future research in the anti spam system. In the next section we will be mentioning the results of the study on botnets and its use in technologies emerged after AutoRE.
Future and Related Work 
Characteristics of Botnets and their use in present anti spam systems:- 
1). Spam Sending Patterns over the network 
The above characteristic is used in A Dynamic Reputation Service for Spotting Spammers [1] SpamSpotter is real time (like AutoRE) reputation software for filtering spam messages. The Spamspotter software classifies email senders in real time based on their global sending behaviour. This system is called behavioural detection. SpamSpotter than applies a third party machine learning behavioural algorithm on this data to generate reputation of senders. A preferred algorithm in SpamSpotter is SNARE. It is a network level behavioural algorithm which identifies spam senders based on their email sending behaviour instead of their addresses and the contents that they are sending. In some cases, SNARE mechanism is so efficient that it can identify a spammer before it has sent a large number of email messages. 
AutoRE also studied the similar behaviour though SpamSpotter goes next level by implementing SNARE algorithm to calculate reputation of a sender. 
2). Distribution of IP Address 
One of the characteristic of Botnets observed while experimentation of AutoRE was studying distribution of IP Address. This is very important characteristic to be studied as it can help us stop and understand the wide spread of Botnets. This property has been extended by Studying Spamming Botnets Using Botlab [6] Botnets are the most used spamming technique used these days. It is estimated that 85% of billions of spam messages are generated by botnets. This paper presents a botnet monitoring platform called botlab which monitors all incoming spam traffic at a certain location. It scans the spam messages and obtains bot binaries through spam links. A human operator than runs specific tools on these binaries to obtain information about the bots sending these spams. It then executes multiple captive, sandboxed nodes from various botnets, allowing it to observe the precise outgoing spam feeds from these nodes. It scours the spam feeds for URLs, gathers information on scams, and identifies exploit links. Finally, it correlates the incoming and outgoing spam feeds to identify the most active botnets and the set of compromised hosts comprising each botnet. Also another extension is studying the characteristic of Botnet and using it to detect them is done in BotGraph: Large Scale Spamming Botnet Detection [2] BotGraph detects the abnormal sharing of IP addresses among accounts holders in an email system. Applying BotGraph to two months of Hotmail log of total 450GB data, BotGraph successfully identified over 26 million bot-accounts with a low false positive rate of 0.44%. BotGraph has also been implemented using a distributed clustered algorithm with Map Reduce technique. BotGraph can detect botnet sign-ups and already created botnet email accounts.
Also one more interesting study came up during the research of AutoRE under the category to scan the network traffic was the increase in use of static IP addresses from Nov 2006 to July 2007. Due to this study it helped the blacklist to improve by populating it by static IP address. Also research suggested that Botnets are evolving and creating more sophisticated and polymorphic URL’s to bypass anti spam systems. 
One major disadvantage of AutoRE is its not practically real time implemented. Till now its method are under investigation and its inception real time is still awaited. 
References 
1). A Dynamic Reputation Service for Spotting Spammers Anirudh Ramachandran, Shuang Hao, Hitesh Khandelwal, Nick Feamster, Santosh Vempala School of Computer Science, Georgia Tech 
http://www.cs.purdue.edu/homes/hkhande/projects/spam/spam_nsdi.pdf 
2).BotGraph: Large Scale Spamming Botnet Detection http://research.microsoft.com/pubs/79413/botgraph.pdf 
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. ‘A Bayesian approach to filtering junk E-mail’. In Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin, 1998. 
3). Cohen, W 1996 ‘Learning Rules that Classify E-Mail’, Advances in Inductive Logic Programming, pp. 124-143 
4). Cook, D, Hartnett, J, Manderson, K&Scanlan,J 2006, ‘Catching Spam Before it Arrives: Domain Specific Dynamic Blacklists’, Proceedings of the 2006 Australasian workshops on Grid computing and e-research, Vol. 54, pp.193-202. 
5). SNARE: Spatio-temporal Network-level Automatic Reputation Engine http://hdl.handle.net/1853/25135 
6). Studying Spamming Botnets Using Botlab 
http://www.cs.washington.edu/homes/arvind/papers/botlab.pdf 
7).Uemura, M& Tabata, T 2008 ‘Design and Evaluation of a Bayesian-filter-based Image Spam Filtering Method’, 2008 International Conference on Information Security and Assurance, 2008 IEEE

NetworkPaperthesis1

  • 1.
    Group Details:- DharaShah z3299353 Imad Hashmi z3193866 Zuo Cui z3261136 Our Paper:- Y. Xie , F. Yu, K. Achan , R. Panigraphy , G. Hulten and I. Osipkov , Spamming Botnets: Signatures and Characteristics, in Proceedings of ACM SIGCOMM 2008, pp. 171-182, Seattle, USA August 2008. PDF Flow of the Literature Review is as follows:- Introduction Background and Previous Work Focus on Technology used in Paper Future and Related Work Introduction Since email has become a wide spread means of communication around the world and millions of email messages are transferred every minute, it is understandable that illegitimate use of email service is also in practice since long. One of the many abuses of this service is spamming which is used by advertisers around the world to send advertisements of their products to legitimate email users. Following discussion is on the methods used by anti spam system to detect spam emails and botnets. Background and Previous Work There were a lot of researches on the identification and filtering of email spam. Based on the part of email used for spam detection, all these work could be generally classified into two main categories: non-content-based and content- based. Non-content-based filters are also known as address-based filters. It uses the information such as IP address or email address in the email header to examine. Blacklist and Whitelist are the common technique in this category. Blacklist records the IP addresses or email addresses which send spam. And conversely, Whitelist contains all acceptable email addresses. They can be deployed on the client computers or email servers. Cook et al. (2006) experimented a domain specific blacklist which worked on the mail server to reduce the number of spam entering the network. But blacklist may easily cause false positive. If one of them sends spam then its IP address or email address is recorded in the blacklist. Consequently, other legitimate mails from that email address are all marked as spam.
  • 2.
    Content-based detection filtersspam by analyzing the message content of received email, which overcome the drawbacks of Non-content-based filters. They scan for some sensitive keywords in the content to identify the spam. This type of filters includes Heuristic filters and Bayesian filters. Heuristic detection, are also known as rule-based analysis which uses regular expression rules to detect phrases or characters that are common to spam mails. Rules can be set as email header information, keywords or URL in the content. William Cohen (1996) used learning rules successfully to classify emails into different folders. But there are little related researches on the spam detection based on rules. However, the spam detection precision relies on the rules which are set by mail system managers. So it will take significantly long time to define the rules. After that, the rules should be refined frequently. If these pre-set rules on the mail system are not updated, the filters will not work efficiently on the new spam with new features. Besides, the rules are rigid and easy to cause false positive. In addition, because the content-based detection of spam can be considered as the problem of text classification, several machine learning approaches have been applied to spam detection. Among many others, Bayesian is one of those being proposed. In 1998, Bayesian classification techniques are employed to the issue of spam filtering (Sahami et al, 1998). It is able to classify the occurrence of certain words or phrases in the message content. Then the filters evaluate the probability whether spam or not by analyzing the statistics. As a result, the Bayesian filters eliminate more than 95% spam in the experiments and identify 80% of incoming junk mail in the real scenario. It is obviously that the Bayesian filters can provide a high correct rate with regard to the detection of plain-text content. Now Bayesian is widely used with other methods in many spam detection technologies to improve the accuracy. However there are some issues in the Bayesian filters. First, as the same issue as other machine learning approaches, the accuracy of Bayesian filters depend on the quality of training data and training process. Second, even Bayesian filters can provide high precision for plain-text content, but it is difficult to detect the booming spam contained images. Therefore, a further research conducted by Okayama University is carried out to detect the image spam (Uemura et al. 2008). It designed a method allows the existing Bayesian filter to learn image information, such as the file size or name, and then evaluate the probability on the learning results. After some experiments of this method, it can be observed that the false negative rate dropped but the false positive rates are almost same. It means this method can play only a booster role in the identification of spam using Bayesian because less information is provided by images to distinguish the spam and legitimate mails.
  • 3.
    Content Based DetectionSystem has lot of advantages but the time and loads of processing space as it goes through the complete email. There is need of an anti spam system which could combine the advantages of content based and non content based spam detection system. AutoRE which a software designed by the Microsoft research group and our anchor paper has tried to combine the both type of detection systems i.e. content based and non content based system. Now we will be discussing in detail how AutoRE combine the both systems. Focus on Technology used in Paper AutoRE unlike all the previous solutions to detecting botnets (like spamhaus, blacklist) where areas it creates and trains itself dynamically real time. To do this it has 3 major steps when a set of emails is supplied to it, they are as follows:- 1. URL Pre-processing 2. Group Selector 3. Regular Expression Generation It is important to understand that we are not identifying spam or not spam emails. As by definition any email which is regular and sends in bulk is spam, but spam emails are not malicious as even a normal user might send an email which is sent to his complete contact list but is relevant and not spam. Our focus is on spam emails generated by botnets as they are not relevant emails it don’t have any meaning to it, they are just sent to accomplish some malicious mission. As botnets are autonomous systems, there is a pattern in their sending behaviour as they are programmed. So to catch that pattern above mentioned steps are followed. While doing URL Pre-processing following parameters are considered:- 1. URL String 2. Source server IP address 3. Email sending time All forwarded messages are discarded as a legitimate forwarding server can be mistaken for botnet member. URL Strings which are suspiciously random and multiple domains are extracted out. As URL strings like a.com, b.com are unlikely to be by botnets as they are registered domain names which economically not feasible for spammers. URL strings are then broken down and grouped into groups as per their domain names. As it is observed that spam emails are advertising for a particular product or particular advertising campaign, then domain specific signatures are created. And from this domain specific signatures domain-agnostic regular expression are created to get better results in form of reduced false positive rates and identifying the botnets even when they change their domains. Before creating the generalised regular expression domain specific signature need to suffice that it’s distributed, bursty and specific only then can be classified as spam signature.
  • 4.
    While grouping it’svery important to understand how to group the domains as with n number of emails there are possibly n domain names. So while considering distributed property temporal correlation is considered and bursty property is considered over a span of 5 days as it’s observed most ASes are active for minimum 5 days. Now once we are done classifying domains into the groups, next step would be generating regular expression. By generating a Regular expression a not a token conjunction helps us reduce the false positive rate as keywords used in the token conjunction are words which may or may not be part of email. After creating domain specific groups we create a signature of the group and classification is no more based on the group and its domain agnostic. By doing so we assured that in future if the botnets change the domain still they will be detected as there domain will hold the same regular expression and group signature which classifies them as spam this happens because we are not generating domain specific signature. This a unique feature of AutoRE which helps it finding maximum spam emails with minimum false positive rate. Also after categorizing them and assigning them their respective regular expression it’s very important that we verify that the emails we have classified as spam are actually spams or not. To do so there are 2 steps we need to do. First of all we query our suspected IP Addresses to Blacklist which are found in the list are directly classified as spam. The ones which are not we need to run some behavioural test to understand whether they are spams or not. This behavioural test is done on each campaign the points to taken care of are as follows:- 1). Similarity of Email Properties 2). Similarity of Sending Time 3). Similarity of Email Sending Behaviour As the emails we are targeting are being generated and send by automated system above mentioned properties play a big role. As botnets are automated systems they are bound to have pattern as however random the sending algorithm is designed due to the frequency of occurrence pattern is going to be generated. It doesn’t end here as by the means of this software we can study the characteristic of the botnets and predict the traffic and spam emails which are going to be generated. This study on botnets has revealed lot of facts which are pointers for future research in the anti spam system. In the next section we will be mentioning the results of the study on botnets and its use in technologies emerged after AutoRE.
  • 5.
    Future and RelatedWork Characteristics of Botnets and their use in present anti spam systems:- 1). Spam Sending Patterns over the network The above characteristic is used in A Dynamic Reputation Service for Spotting Spammers [1] SpamSpotter is real time (like AutoRE) reputation software for filtering spam messages. The Spamspotter software classifies email senders in real time based on their global sending behaviour. This system is called behavioural detection. SpamSpotter than applies a third party machine learning behavioural algorithm on this data to generate reputation of senders. A preferred algorithm in SpamSpotter is SNARE. It is a network level behavioural algorithm which identifies spam senders based on their email sending behaviour instead of their addresses and the contents that they are sending. In some cases, SNARE mechanism is so efficient that it can identify a spammer before it has sent a large number of email messages. AutoRE also studied the similar behaviour though SpamSpotter goes next level by implementing SNARE algorithm to calculate reputation of a sender. 2). Distribution of IP Address One of the characteristic of Botnets observed while experimentation of AutoRE was studying distribution of IP Address. This is very important characteristic to be studied as it can help us stop and understand the wide spread of Botnets. This property has been extended by Studying Spamming Botnets Using Botlab [6] Botnets are the most used spamming technique used these days. It is estimated that 85% of billions of spam messages are generated by botnets. This paper presents a botnet monitoring platform called botlab which monitors all incoming spam traffic at a certain location. It scans the spam messages and obtains bot binaries through spam links. A human operator than runs specific tools on these binaries to obtain information about the bots sending these spams. It then executes multiple captive, sandboxed nodes from various botnets, allowing it to observe the precise outgoing spam feeds from these nodes. It scours the spam feeds for URLs, gathers information on scams, and identifies exploit links. Finally, it correlates the incoming and outgoing spam feeds to identify the most active botnets and the set of compromised hosts comprising each botnet. Also another extension is studying the characteristic of Botnet and using it to detect them is done in BotGraph: Large Scale Spamming Botnet Detection [2] BotGraph detects the abnormal sharing of IP addresses among accounts holders in an email system. Applying BotGraph to two months of Hotmail log of total 450GB data, BotGraph successfully identified over 26 million bot-accounts with a low false positive rate of 0.44%. BotGraph has also been implemented using a distributed clustered algorithm with Map Reduce technique. BotGraph can detect botnet sign-ups and already created botnet email accounts.
  • 6.
    Also one moreinteresting study came up during the research of AutoRE under the category to scan the network traffic was the increase in use of static IP addresses from Nov 2006 to July 2007. Due to this study it helped the blacklist to improve by populating it by static IP address. Also research suggested that Botnets are evolving and creating more sophisticated and polymorphic URL’s to bypass anti spam systems. One major disadvantage of AutoRE is its not practically real time implemented. Till now its method are under investigation and its inception real time is still awaited. References 1). A Dynamic Reputation Service for Spotting Spammers Anirudh Ramachandran, Shuang Hao, Hitesh Khandelwal, Nick Feamster, Santosh Vempala School of Computer Science, Georgia Tech http://www.cs.purdue.edu/homes/hkhande/projects/spam/spam_nsdi.pdf 2).BotGraph: Large Scale Spamming Botnet Detection http://research.microsoft.com/pubs/79413/botgraph.pdf M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. ‘A Bayesian approach to filtering junk E-mail’. In Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin, 1998. 3). Cohen, W 1996 ‘Learning Rules that Classify E-Mail’, Advances in Inductive Logic Programming, pp. 124-143 4). Cook, D, Hartnett, J, Manderson, K&Scanlan,J 2006, ‘Catching Spam Before it Arrives: Domain Specific Dynamic Blacklists’, Proceedings of the 2006 Australasian workshops on Grid computing and e-research, Vol. 54, pp.193-202. 5). SNARE: Spatio-temporal Network-level Automatic Reputation Engine http://hdl.handle.net/1853/25135 6). Studying Spamming Botnets Using Botlab http://www.cs.washington.edu/homes/arvind/papers/botlab.pdf 7).Uemura, M& Tabata, T 2008 ‘Design and Evaluation of a Bayesian-filter-based Image Spam Filtering Method’, 2008 International Conference on Information Security and Assurance, 2008 IEEE