SlideShare a Scribd company logo
1 of 6
Download to read offline
International Journal of Computer Engineering
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN
0976Technology (IJCET),2, Number 1, Jan- April (2011), © IAEME
and – 6375(Online) Volume ISSN 0976 – 6367(Print)                             IJCET
ISSN 0976 – 6375(Online) Volume 2
Number 1, Jan - April (2011), pp. 47-52                                    ©IAEME
© IAEME, http://www.iaeme.com/ijcet.html



INTEGRATION OF FEATURE SETS WITH MACHINE LEARNING
          TECHNIQUES FOR SPAM FILTERING
C.R. Cyril Anthoni                                Dr. A. Christy
Lecturer                                          Professor & Principal
Management Development Institute                  St.Mary’s School of Management Studies
Singapore                                         Rajiv Gandhi Road, Chennai – 119



ABSTRACT

         In this era of Information Technology, Internet is emerging at a rapid pace and it
is playing a major role in the field of communication, business, economy, entertainment
and social groups all over the globe. E-mail is the most preferred tool for communication
as it is easy to use, easy to prioritize, easier for reference, secure and reliable. As emails
are so important for all the computer savvy people, email privacy has become an equally
important concern. This paper proposes a novel approach for spam filtering using
selective feature sets combined with machine learning techniques and the results shows
that approach has produced less number of false positives.

Keywords: False positive, False negative, HMM, TF and IDF

INTRODUCTION

        E-mail operates on the Simple Mail Transfer Protocol (SMTP). The protocol
SMTP was written in 1982 without taking the problem of spam into consideration. The
only way the end users used to differentiate the spam is from the sender’s address.
        The task of spam filtering is to rule out unsolicited e-mails automatically from a
user’s mail stream. Mails contain headers, which contain the source of the message and
the examination of the email header can reveal the sender machine’s IP address. Previous
studies have used the spam filtering techniques like Blacklist, Real-Time Blackhole List,
Whitelist, Greylist, Content-Based Filters, Word-Based Filters, Heuristic Filters,
Bayesian Filters, Challenge/Response System, Collaborative Filters and DNS Lookup
Systems and Feature sets in filtering the e-mails containing spam.. Naïve Bayes, a
commonly used classification spam filtering algorithm is found to be sensitive to feature
selection methods on small feature sets (Le Zhang, Jingbo Zhu and Tianshun Yao,2004).




                                                 47
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN
0976 – 6375(Online) Volume 2, Number 1, Jan- April (2011), © IAEME



SECURITY ISSUES

The Unsolicited Commercial E-mail (UCE) messages are often used to force the users to
reveal their personal information. Spam mails are commonly used to ask for information
that can be used by the attackers. The safeguarding of email messages from unauthorized
access and inspection is known as electronic privacy. Much of the Internet is susceptible
to attacks. Emails travel across the unprotected paths. E-mail spams started growing
from 1990s. Spammers collect e-mail addresses from chatrooms, websites, customer lists,
newsgroups, and viruses which harvest user’s address books, and are sold to other
spammers. According to a survey conducted by Message Anti-Abuse Working Group,
the amount of spam email was between 88 to 92%. According to a Commtouch report in
the first quarter of 2010, there are about 183 billion spam messages sent every day. The
most popular spam topic is "pharmacy ads" which make up 81% of e-mail spam
messages, Replica about 5.40%, Enhancers 2.30%, Phishing 2.30%, Degrees 1.30% and
Weight loss 0.40%, etc. When grouped by continents, spam comes mostly from Asia
(37.8%, down from 39.8%) , North America (23.6%, up from 21.8%) , Europe (23.4%,
down from 23.9%) and South America (12.9%, down from 13.2%).

A study of International Data Corporation (IDC) ranked spams in the second position of
the ISPs’ problems. Similarly technology has offered solutions to the email privacy
issues. The security measures include cryptography, digital signatures and the use of
secure protocols, which can ensure email privacy to a certain extent. PGP (Pretty Good
Privacy) is one of the encryption standards that encrypts and decrypts the email message.
This standard of encryption is a part of most of today's operating systems. Email
encryption is another method that achieves email privacy. It is achieved by means of
public key cryptography. The popularly used protocols for email encryption are S/MIME,
TLS and OpenPGP.

Spam filtering can be recast as Text Categorization task where the classes to be predicted
are spam and legitimate. A variety of supervised machine-learning algorithms have been
successfully applied to mail filtering task. Traditionally, filters of this type have so far
been based mostly on manually constructed keyword patterns. An alternative method is
to use a naïve Bayesian classifier trained automatically to detect spam messages. The
studies in Text Categorization have revealed that building classifiers automatically by
applying some machine-learning algorithms (Support vector machine, Ada Boost and
Maximum Entropy Model) to a set of pre-classified documents can produce good
performance across diverse data sets.

SYSTEM ARCHITECTURE

To protect email privacy, it is obvious that the messages have to be encrypted. However,
in order for the filtering process to be effective, the architecture is designed to reduce the
false positives. We have used the standard metrics to measure the spam filtering
accuracy. A ham email that is classified as spam by the filtering scheme is termed as a
false positive. The false positive percentage is defined as the ratio of the number of false


                                                 48
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN
0976 – 6375(Online) Volume 2, Number 1, Jan- April (2011), © IAEME

positive emails to the total number of actual ham emails in the dataset used during the
testing phase. A supervised learning algorithm is used, which retrieves the selected
feature sets and finds is threshold value. The general architecture is depicted in shown in
Figure 1.



       Text
                            Feature set               Data          Machine             Rules
                            extraction                Base          learning

                                     Figure 1 General architecture
The feature preserving technique is based upon the concept of Shingles, which has been
used in a wide variety of web and Internet data management problems, such as
redundancy elimination in web caches and search engines, and template and fragment
detection in web pages. Shingles are essentially a set of numbers that act as a fingerprint
of a document. Shingles have the unique property that if two documents vary by a small
amount their shingle sets also differs by a small amount.
        Machine Learning learns through examples, domain knowledge and feedback. As
Information Extraction systems are domain specific, machine learning plays a vital role
in classification and prediction. The algorithm presented in Table 1 illustrates the
architecture. A set of frequently identified spam words are selected as the training sets
and its accuracy in identifying as spam is evaluated by various metrics. If the values are
very close to 0, the feature sets resemble close to the training data sets.

       Input : D(F0,F1,….Fn-1) // a training data set with N features
       S0      // a subset from which to start the search
       δ       // a stopping criterion
           Output : Sbest     // an optimal subset
           begin
      initialize : Sbest = S0 ;
      γbest = eval (S0, D,A) ;
      // evaluate S0 by a set of meaures
           do begin
      S= generate(D) ;       //generate a subset for evaluation
      γ = eval (S, D,A); // evaluate the current subset S by A
      if (γ is better than γbest)
              γbest = γ;
              Sbest = S;
       end until (γ is reached);
       return Sbest;
       end;
                                              Table 1 Algorithm


                                                        49
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN
0976 – 6375(Online) Volume 2, Number 1, Jan- April (2011), © IAEME

IDENTIFIED METRICS

The efficiency of the feature sets collected are evaluated by various metrics. The Baye’s
rule is adopted for Hidden Markov Model which relates the conditional probability of an
event

A given B to the conditional probability of B given A:
       PROB (A|B) = PROB (B|A)*PROB (A)/PROB (B)                                             (1)

       Each keyword can be represented as a term vector of the form
ā = (a1,a2,….an), each term ai has a weight wi associated with it and wi denotes the
normalized     frequency    of     word     in    the     vector    space,    where




                                                                                            (2)
where tfi is the term frequency of ai

idfi is inverse document frequency denoted as



                                                                                              (3)

where D is the total number of mails and DF is the number of mails in which a term has
appeared in a collection.

EXPERIMENTS

A list of keywords under the category Filter #1, Filter #2, Filter #3 and Filter #4 from the
website ‘http://www.vaughns-1-pagers.com/internet/spam-word-list.htm’ are collected as
feature sets and trained. The proposed method is compared with two popular spam
filtering approaches, namely Bayesian filtering and simple hash-based collaborative
filtering. In Bayesian filtering, the message is classified as a spam if its spamminess is
greater than or equal to a preset threshold (µ), and vice-versa. On the other hand, using
hash based technique, its decision is based on the number of times the email
corresponding to a particular hash value have been reported as spam. If this spam count
of the hash value corresponding to in-coming email exceeds a threshold, the email is
classified as spam, and otherwise it is classified as ham

The datasets used in the experiments are derived from two publicly available email
corpus, namely TREC email corpus and the SpamAssassin email corpus. One third of
email set including ham and spam are used for training, and the remainder is used for
testing.




                                                 50
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN
0976 – 6375(Online) Volume 2, Number 1, Jan- April (2011), © IAEME




                                                                        Figure 2
                                                                               Figure 3
                                                 The good words are selected randomly
from a good word database created from the labeled ham data. We vary the amount of
appended words in the range of 0% to 50% of the original emails’ word count and we call
it degree of attack. The experimental setup consists of 12 agents with each agent
employing 50% of the messages in its email set during the training phase. The initial
experiments showa that the supervised learning algorithm combined with HMM model is
effective in identifying the feature sets, which reduces the amount of false positives.

CONCLUSION

Bayesian schemes are threshold-based approaches, so finding the appropriate thresh-old
to achieve both low false positive and false negative rates is the key to the success of
these approaches. The results from previous experiment are obtained when 50% of the
emails in its email set are used during the training phase. We vary the threshold
parameters of the two schemes and collect the false positive and false negative
percentages. In Figure 2 and Figure 3 we plot the results of the experiment with false
positive percentages on the X-axis and the false negatives on the Y-axis. The results
show that neither of the approaches outperforms the other at all false positive percentage
values. Users have a much lower tolerance of false positives than false negatives, and
anything more than 1% percent false positives is usually considered unacceptable. . Our
initial experiments show that HMM approach is very effective in filtering spam, has high
resilience towards various attacks, and it provides strong privacy protection to the
participating entities.




                                                 51
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN
0976 – 6375(Online) Volume 2, Number 1, Jan- April (2011), © IAEME



REFERENCES

    1.      Ahmed Khorsi (2007), ‘An Overview of Content-Based Spam Filtering
            Techniques’, Informatica 31, pp. 269-277

    2.      Androutsopoulos, J. Koutsias, K.V. Chandrinos, G. Paliouras, and C.D.
            Spyropoulos. An evaluation of naïve bayesian anti-spam _ltering. In G.
            Potamias, V. Moustakis, and M.n van Someren, editors, Proceedings of the
            Workshop on Machine Learning in the New Information Age, 11th European
            Conference on Machine Learning (ECML 2000), pages 9{17, Barcelona,
            Spain, 2000a. URL http://arXiv.org/abs/cs.CL/0006013.

    3.      S. Atskins. Size and cost of the problem. In Proceedings of the Fifty-sixth
            Internet Engineering Task Force(IETF) Meeting, (San Francisco, CA), March
            16-21 2003. SpamCon Foundation.

    4.      Christina V, Karpagavalli S and Suganya G (2010), ‘A Study on Email Spam
            Filtering Techniques’, International Journal of Computer Applications (0975 –
            8887), Vol. 12- No. 1, pp. 7-9

    5.      Le Zhang, Jingbo Zhu, Tianshun Yao (2004), ‘An Evaluation of Statistical
            Spam Filtering Techniques’, Asian transactions on Asian Language
            Information processing, Vol. 3, No. 4, pp. 243-269

    6.      Zhenyu Zhong , Lakshmish Ramaswamy and Kang Li , ‘ALPACAS: A
            Large-scale Privacy-Aware Collaborative Anti-spam System’




                                                 52

More Related Content

What's hot

A FUZZY MATHEMATICAL MODEL FOR PEFORMANCE TESTING IN CLOUD COMPUTING USING US...
A FUZZY MATHEMATICAL MODEL FOR PEFORMANCE TESTING IN CLOUD COMPUTING USING US...A FUZZY MATHEMATICAL MODEL FOR PEFORMANCE TESTING IN CLOUD COMPUTING USING US...
A FUZZY MATHEMATICAL MODEL FOR PEFORMANCE TESTING IN CLOUD COMPUTING USING US...ijseajournal
 
Efficient radio resource allocation scheme for 5G networks with device-to-devi...
Efficient radio resource allocation scheme for 5G networks with device-to-devi...Efficient radio resource allocation scheme for 5G networks with device-to-devi...
Efficient radio resource allocation scheme for 5G networks with device-to-devi...IJECEIAES
 
An Efficient Algorithm to Calculate The Connectivity of Hyper-Rings Distribut...
An Efficient Algorithm to Calculate The Connectivity of Hyper-Rings Distribut...An Efficient Algorithm to Calculate The Connectivity of Hyper-Rings Distribut...
An Efficient Algorithm to Calculate The Connectivity of Hyper-Rings Distribut...ijitcs
 
USING A DEEP UNDERSTANDING OF NETWORK ACTIVITIES FOR SECURITY EVENT MANAGEMENT
USING A DEEP UNDERSTANDING OF NETWORK ACTIVITIES FOR SECURITY EVENT MANAGEMENTUSING A DEEP UNDERSTANDING OF NETWORK ACTIVITIES FOR SECURITY EVENT MANAGEMENT
USING A DEEP UNDERSTANDING OF NETWORK ACTIVITIES FOR SECURITY EVENT MANAGEMENTIJNSA Journal
 
Multidimensional scaling algorithm and its current applications in wireless l...
Multidimensional scaling algorithm and its current applications in wireless l...Multidimensional scaling algorithm and its current applications in wireless l...
Multidimensional scaling algorithm and its current applications in wireless l...IDES Editor
 
Implementation of digital image watermarking techniques using dwt and dwt svd...
Implementation of digital image watermarking techniques using dwt and dwt svd...Implementation of digital image watermarking techniques using dwt and dwt svd...
Implementation of digital image watermarking techniques using dwt and dwt svd...eSAT Journals
 
TRUST MANAGEMENT FOR DELAY TOLERANT NETWORK
TRUST MANAGEMENT FOR DELAY TOLERANT NETWORKTRUST MANAGEMENT FOR DELAY TOLERANT NETWORK
TRUST MANAGEMENT FOR DELAY TOLERANT NETWORKIAEME Publication
 
AUTHENTICATION USING TRUST TO DETECT MISBEHAVING NODES IN MOBILE AD HOC NETWO...
AUTHENTICATION USING TRUST TO DETECT MISBEHAVING NODES IN MOBILE AD HOC NETWO...AUTHENTICATION USING TRUST TO DETECT MISBEHAVING NODES IN MOBILE AD HOC NETWO...
AUTHENTICATION USING TRUST TO DETECT MISBEHAVING NODES IN MOBILE AD HOC NETWO...IJNSA Journal
 
High performance intrusion detection using modified k mean & naïve bayes
High performance intrusion detection using modified k mean & naïve bayesHigh performance intrusion detection using modified k mean & naïve bayes
High performance intrusion detection using modified k mean & naïve bayeseSAT Journals
 
IRJET- Chaos based Secured Communication in Energy Efficient Wireless Sensor...
IRJET-	 Chaos based Secured Communication in Energy Efficient Wireless Sensor...IRJET-	 Chaos based Secured Communication in Energy Efficient Wireless Sensor...
IRJET- Chaos based Secured Communication in Energy Efficient Wireless Sensor...IRJET Journal
 
Hybrid Approach for Improving Data Security and Size Reduction in Image Stega...
Hybrid Approach for Improving Data Security and Size Reduction in Image Stega...Hybrid Approach for Improving Data Security and Size Reduction in Image Stega...
Hybrid Approach for Improving Data Security and Size Reduction in Image Stega...IRJET Journal
 
SVD Based Robust Digital Watermarking For Still Images Using Wavelet Transform
SVD Based Robust Digital Watermarking For Still Images Using Wavelet Transform SVD Based Robust Digital Watermarking For Still Images Using Wavelet Transform
SVD Based Robust Digital Watermarking For Still Images Using Wavelet Transform cscpconf
 
Overlapped clustering approach for maximizing the service reliability of
Overlapped clustering approach for maximizing the service reliability ofOverlapped clustering approach for maximizing the service reliability of
Overlapped clustering approach for maximizing the service reliability ofIAEME Publication
 

What's hot (20)

A FUZZY MATHEMATICAL MODEL FOR PEFORMANCE TESTING IN CLOUD COMPUTING USING US...
A FUZZY MATHEMATICAL MODEL FOR PEFORMANCE TESTING IN CLOUD COMPUTING USING US...A FUZZY MATHEMATICAL MODEL FOR PEFORMANCE TESTING IN CLOUD COMPUTING USING US...
A FUZZY MATHEMATICAL MODEL FOR PEFORMANCE TESTING IN CLOUD COMPUTING USING US...
 
Efficient radio resource allocation scheme for 5G networks with device-to-devi...
Efficient radio resource allocation scheme for 5G networks with device-to-devi...Efficient radio resource allocation scheme for 5G networks with device-to-devi...
Efficient radio resource allocation scheme for 5G networks with device-to-devi...
 
An Efficient Algorithm to Calculate The Connectivity of Hyper-Rings Distribut...
An Efficient Algorithm to Calculate The Connectivity of Hyper-Rings Distribut...An Efficient Algorithm to Calculate The Connectivity of Hyper-Rings Distribut...
An Efficient Algorithm to Calculate The Connectivity of Hyper-Rings Distribut...
 
Ee4301798802
Ee4301798802Ee4301798802
Ee4301798802
 
Ie3514301434
Ie3514301434Ie3514301434
Ie3514301434
 
USING A DEEP UNDERSTANDING OF NETWORK ACTIVITIES FOR SECURITY EVENT MANAGEMENT
USING A DEEP UNDERSTANDING OF NETWORK ACTIVITIES FOR SECURITY EVENT MANAGEMENTUSING A DEEP UNDERSTANDING OF NETWORK ACTIVITIES FOR SECURITY EVENT MANAGEMENT
USING A DEEP UNDERSTANDING OF NETWORK ACTIVITIES FOR SECURITY EVENT MANAGEMENT
 
170 328-1-pb
170 328-1-pb170 328-1-pb
170 328-1-pb
 
135 139
135 139135 139
135 139
 
Multidimensional scaling algorithm and its current applications in wireless l...
Multidimensional scaling algorithm and its current applications in wireless l...Multidimensional scaling algorithm and its current applications in wireless l...
Multidimensional scaling algorithm and its current applications in wireless l...
 
Implementation of digital image watermarking techniques using dwt and dwt svd...
Implementation of digital image watermarking techniques using dwt and dwt svd...Implementation of digital image watermarking techniques using dwt and dwt svd...
Implementation of digital image watermarking techniques using dwt and dwt svd...
 
TRUST MANAGEMENT FOR DELAY TOLERANT NETWORK
TRUST MANAGEMENT FOR DELAY TOLERANT NETWORKTRUST MANAGEMENT FOR DELAY TOLERANT NETWORK
TRUST MANAGEMENT FOR DELAY TOLERANT NETWORK
 
Ax34298305
Ax34298305Ax34298305
Ax34298305
 
AUTHENTICATION USING TRUST TO DETECT MISBEHAVING NODES IN MOBILE AD HOC NETWO...
AUTHENTICATION USING TRUST TO DETECT MISBEHAVING NODES IN MOBILE AD HOC NETWO...AUTHENTICATION USING TRUST TO DETECT MISBEHAVING NODES IN MOBILE AD HOC NETWO...
AUTHENTICATION USING TRUST TO DETECT MISBEHAVING NODES IN MOBILE AD HOC NETWO...
 
High performance intrusion detection using modified k mean & naïve bayes
High performance intrusion detection using modified k mean & naïve bayesHigh performance intrusion detection using modified k mean & naïve bayes
High performance intrusion detection using modified k mean & naïve bayes
 
IRJET- Chaos based Secured Communication in Energy Efficient Wireless Sensor...
IRJET-	 Chaos based Secured Communication in Energy Efficient Wireless Sensor...IRJET-	 Chaos based Secured Communication in Energy Efficient Wireless Sensor...
IRJET- Chaos based Secured Communication in Energy Efficient Wireless Sensor...
 
Hybrid Approach for Improving Data Security and Size Reduction in Image Stega...
Hybrid Approach for Improving Data Security and Size Reduction in Image Stega...Hybrid Approach for Improving Data Security and Size Reduction in Image Stega...
Hybrid Approach for Improving Data Security and Size Reduction in Image Stega...
 
SVD Based Robust Digital Watermarking For Still Images Using Wavelet Transform
SVD Based Robust Digital Watermarking For Still Images Using Wavelet Transform SVD Based Robust Digital Watermarking For Still Images Using Wavelet Transform
SVD Based Robust Digital Watermarking For Still Images Using Wavelet Transform
 
Overlapped clustering approach for maximizing the service reliability of
Overlapped clustering approach for maximizing the service reliability ofOverlapped clustering approach for maximizing the service reliability of
Overlapped clustering approach for maximizing the service reliability of
 
E42022125
E42022125E42022125
E42022125
 
D24019026
D24019026D24019026
D24019026
 

Similar to Integration of feature sets with machine learning techniques

A Heuristic Approach for Network Data Clustering
A Heuristic Approach for Network Data ClusteringA Heuristic Approach for Network Data Clustering
A Heuristic Approach for Network Data Clusteringidescitation
 
Study, analysis and formulation of a new method for integrity protection of d...
Study, analysis and formulation of a new method for integrity protection of d...Study, analysis and formulation of a new method for integrity protection of d...
Study, analysis and formulation of a new method for integrity protection of d...ijsrd.com
 
Text Based Fuzzy Clustering Algorithm to Filter Spam E-mail
Text Based Fuzzy Clustering Algorithm to Filter Spam E-mailText Based Fuzzy Clustering Algorithm to Filter Spam E-mail
Text Based Fuzzy Clustering Algorithm to Filter Spam E-mailijsrd.com
 
Study of Various Techniques to Filter Spam Emails
Study of Various Techniques to Filter Spam EmailsStudy of Various Techniques to Filter Spam Emails
Study of Various Techniques to Filter Spam EmailsIRJET Journal
 
Overview of Anti-spam filtering Techniques
Overview of Anti-spam filtering TechniquesOverview of Anti-spam filtering Techniques
Overview of Anti-spam filtering TechniquesIRJET Journal
 
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...IJNSA Journal
 
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...IJNSA Journal
 
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...IJNSA Journal
 
The Detection of Suspicious Email Based on Decision Tree ...
The Detection of Suspicious Email Based on Decision Tree                     ...The Detection of Suspicious Email Based on Decision Tree                     ...
The Detection of Suspicious Email Based on Decision Tree ...IRJET Journal
 
EMAIL SPAM DETECTION USING HYBRID ALGORITHM
EMAIL SPAM DETECTION USING HYBRID ALGORITHMEMAIL SPAM DETECTION USING HYBRID ALGORITHM
EMAIL SPAM DETECTION USING HYBRID ALGORITHMIRJET Journal
 
A Survey on Spam Filtering Methods and Mapreduce with SVM
A Survey on Spam Filtering Methods and Mapreduce with SVMA Survey on Spam Filtering Methods and Mapreduce with SVM
A Survey on Spam Filtering Methods and Mapreduce with SVMIRJET Journal
 
Obfuscated computer virus detection using machine learning algorithm
Obfuscated computer virus detection using machine learning algorithmObfuscated computer virus detection using machine learning algorithm
Obfuscated computer virus detection using machine learning algorithmjournalBEEI
 
Obfuscated computer virus detection using machine learning algorithm
Obfuscated computer virus detection using machine learning algorithmObfuscated computer virus detection using machine learning algorithm
Obfuscated computer virus detection using machine learning algorithmjournalBEEI
 
OPTIMIZING HYPERPARAMETERS FOR ENHANCED EMAIL CLASSIFICATION AND FORENSIC ANA...
OPTIMIZING HYPERPARAMETERS FOR ENHANCED EMAIL CLASSIFICATION AND FORENSIC ANA...OPTIMIZING HYPERPARAMETERS FOR ENHANCED EMAIL CLASSIFICATION AND FORENSIC ANA...
OPTIMIZING HYPERPARAMETERS FOR ENHANCED EMAIL CLASSIFICATION AND FORENSIC ANA...IJNSA Journal
 
Prepare black list using bayesian approach to improve performance of spam fil...
Prepare black list using bayesian approach to improve performance of spam fil...Prepare black list using bayesian approach to improve performance of spam fil...
Prepare black list using bayesian approach to improve performance of spam fil...IAEME Publication
 
A multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detectionA multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detectioneSAT Publishing House
 
A multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detectionA multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detectioneSAT Journals
 
A multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detectionA multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detectioneSAT Publishing House
 
A Model for Fuzzy Logic Based Machine Learning Approach for Spam Filtering
A Model for Fuzzy Logic Based Machine Learning Approach for  Spam FilteringA Model for Fuzzy Logic Based Machine Learning Approach for  Spam Filtering
A Model for Fuzzy Logic Based Machine Learning Approach for Spam FilteringIOSR Journals
 
An Approach for Malicious Spam Detection in Email with Comparison of Differen...
An Approach for Malicious Spam Detection in Email with Comparison of Differen...An Approach for Malicious Spam Detection in Email with Comparison of Differen...
An Approach for Malicious Spam Detection in Email with Comparison of Differen...IRJET Journal
 

Similar to Integration of feature sets with machine learning techniques (20)

A Heuristic Approach for Network Data Clustering
A Heuristic Approach for Network Data ClusteringA Heuristic Approach for Network Data Clustering
A Heuristic Approach for Network Data Clustering
 
Study, analysis and formulation of a new method for integrity protection of d...
Study, analysis and formulation of a new method for integrity protection of d...Study, analysis and formulation of a new method for integrity protection of d...
Study, analysis and formulation of a new method for integrity protection of d...
 
Text Based Fuzzy Clustering Algorithm to Filter Spam E-mail
Text Based Fuzzy Clustering Algorithm to Filter Spam E-mailText Based Fuzzy Clustering Algorithm to Filter Spam E-mail
Text Based Fuzzy Clustering Algorithm to Filter Spam E-mail
 
Study of Various Techniques to Filter Spam Emails
Study of Various Techniques to Filter Spam EmailsStudy of Various Techniques to Filter Spam Emails
Study of Various Techniques to Filter Spam Emails
 
Overview of Anti-spam filtering Techniques
Overview of Anti-spam filtering TechniquesOverview of Anti-spam filtering Techniques
Overview of Anti-spam filtering Techniques
 
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
 
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
 
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
 
The Detection of Suspicious Email Based on Decision Tree ...
The Detection of Suspicious Email Based on Decision Tree                     ...The Detection of Suspicious Email Based on Decision Tree                     ...
The Detection of Suspicious Email Based on Decision Tree ...
 
EMAIL SPAM DETECTION USING HYBRID ALGORITHM
EMAIL SPAM DETECTION USING HYBRID ALGORITHMEMAIL SPAM DETECTION USING HYBRID ALGORITHM
EMAIL SPAM DETECTION USING HYBRID ALGORITHM
 
A Survey on Spam Filtering Methods and Mapreduce with SVM
A Survey on Spam Filtering Methods and Mapreduce with SVMA Survey on Spam Filtering Methods and Mapreduce with SVM
A Survey on Spam Filtering Methods and Mapreduce with SVM
 
Obfuscated computer virus detection using machine learning algorithm
Obfuscated computer virus detection using machine learning algorithmObfuscated computer virus detection using machine learning algorithm
Obfuscated computer virus detection using machine learning algorithm
 
Obfuscated computer virus detection using machine learning algorithm
Obfuscated computer virus detection using machine learning algorithmObfuscated computer virus detection using machine learning algorithm
Obfuscated computer virus detection using machine learning algorithm
 
OPTIMIZING HYPERPARAMETERS FOR ENHANCED EMAIL CLASSIFICATION AND FORENSIC ANA...
OPTIMIZING HYPERPARAMETERS FOR ENHANCED EMAIL CLASSIFICATION AND FORENSIC ANA...OPTIMIZING HYPERPARAMETERS FOR ENHANCED EMAIL CLASSIFICATION AND FORENSIC ANA...
OPTIMIZING HYPERPARAMETERS FOR ENHANCED EMAIL CLASSIFICATION AND FORENSIC ANA...
 
Prepare black list using bayesian approach to improve performance of spam fil...
Prepare black list using bayesian approach to improve performance of spam fil...Prepare black list using bayesian approach to improve performance of spam fil...
Prepare black list using bayesian approach to improve performance of spam fil...
 
A multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detectionA multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detection
 
A multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detectionA multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detection
 
A multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detectionA multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detection
 
A Model for Fuzzy Logic Based Machine Learning Approach for Spam Filtering
A Model for Fuzzy Logic Based Machine Learning Approach for  Spam FilteringA Model for Fuzzy Logic Based Machine Learning Approach for  Spam Filtering
A Model for Fuzzy Logic Based Machine Learning Approach for Spam Filtering
 
An Approach for Malicious Spam Detection in Email with Comparison of Differen...
An Approach for Malicious Spam Detection in Email with Comparison of Differen...An Approach for Malicious Spam Detection in Email with Comparison of Differen...
An Approach for Malicious Spam Detection in Email with Comparison of Differen...
 

More from iaemedu

Tech transfer making it as a risk free approach in pharmaceutical and biotech in
Tech transfer making it as a risk free approach in pharmaceutical and biotech inTech transfer making it as a risk free approach in pharmaceutical and biotech in
Tech transfer making it as a risk free approach in pharmaceutical and biotech iniaemedu
 
Website based patent information searching mechanism
Website based patent information searching mechanismWebsite based patent information searching mechanism
Website based patent information searching mechanismiaemedu
 
Prediction of customer behavior using cma
Prediction of customer behavior using cmaPrediction of customer behavior using cma
Prediction of customer behavior using cmaiaemedu
 
Performance analysis of manet routing protocol in presence
Performance analysis of manet routing protocol in presencePerformance analysis of manet routing protocol in presence
Performance analysis of manet routing protocol in presenceiaemedu
 
Performance measurement of different requirements engineering
Performance measurement of different requirements engineeringPerformance measurement of different requirements engineering
Performance measurement of different requirements engineeringiaemedu
 
Mobile safety systems for automobiles
Mobile safety systems for automobilesMobile safety systems for automobiles
Mobile safety systems for automobilesiaemedu
 
Efficient text compression using special character replacement
Efficient text compression using special character replacementEfficient text compression using special character replacement
Efficient text compression using special character replacementiaemedu
 
Agile programming a new approach
Agile programming a new approachAgile programming a new approach
Agile programming a new approachiaemedu
 
Adaptive load balancing techniques in global scale grid environment
Adaptive load balancing techniques in global scale grid environmentAdaptive load balancing techniques in global scale grid environment
Adaptive load balancing techniques in global scale grid environmentiaemedu
 
A survey on the performance of job scheduling in workflow application
A survey on the performance of job scheduling in workflow applicationA survey on the performance of job scheduling in workflow application
A survey on the performance of job scheduling in workflow applicationiaemedu
 
A survey of mitigating routing misbehavior in mobile ad hoc networks
A survey of mitigating routing misbehavior in mobile ad hoc networksA survey of mitigating routing misbehavior in mobile ad hoc networks
A survey of mitigating routing misbehavior in mobile ad hoc networksiaemedu
 
A novel approach for satellite imagery storage by classify
A novel approach for satellite imagery storage by classifyA novel approach for satellite imagery storage by classify
A novel approach for satellite imagery storage by classifyiaemedu
 
A self recovery approach using halftone images for medical imagery
A self recovery approach using halftone images for medical imageryA self recovery approach using halftone images for medical imagery
A self recovery approach using halftone images for medical imageryiaemedu
 
A comprehensive study of non blocking joining technique
A comprehensive study of non blocking joining techniqueA comprehensive study of non blocking joining technique
A comprehensive study of non blocking joining techniqueiaemedu
 
A comparative study on multicast routing using dijkstra’s
A comparative study on multicast routing using dijkstra’sA comparative study on multicast routing using dijkstra’s
A comparative study on multicast routing using dijkstra’siaemedu
 
The detection of routing misbehavior in mobile ad hoc networks
The detection of routing misbehavior in mobile ad hoc networksThe detection of routing misbehavior in mobile ad hoc networks
The detection of routing misbehavior in mobile ad hoc networksiaemedu
 
Visual cryptography scheme for color images
Visual cryptography scheme for color imagesVisual cryptography scheme for color images
Visual cryptography scheme for color imagesiaemedu
 
Software process methodologies and a comparative study of various models
Software process methodologies and a comparative study of various modelsSoftware process methodologies and a comparative study of various models
Software process methodologies and a comparative study of various modelsiaemedu
 
Software metric analysis methods for product development
Software metric analysis methods for product developmentSoftware metric analysis methods for product development
Software metric analysis methods for product developmentiaemedu
 
Software process and product quality assurance in it organizations
Software process and product quality assurance in it organizationsSoftware process and product quality assurance in it organizations
Software process and product quality assurance in it organizationsiaemedu
 

More from iaemedu (20)

Tech transfer making it as a risk free approach in pharmaceutical and biotech in
Tech transfer making it as a risk free approach in pharmaceutical and biotech inTech transfer making it as a risk free approach in pharmaceutical and biotech in
Tech transfer making it as a risk free approach in pharmaceutical and biotech in
 
Website based patent information searching mechanism
Website based patent information searching mechanismWebsite based patent information searching mechanism
Website based patent information searching mechanism
 
Prediction of customer behavior using cma
Prediction of customer behavior using cmaPrediction of customer behavior using cma
Prediction of customer behavior using cma
 
Performance analysis of manet routing protocol in presence
Performance analysis of manet routing protocol in presencePerformance analysis of manet routing protocol in presence
Performance analysis of manet routing protocol in presence
 
Performance measurement of different requirements engineering
Performance measurement of different requirements engineeringPerformance measurement of different requirements engineering
Performance measurement of different requirements engineering
 
Mobile safety systems for automobiles
Mobile safety systems for automobilesMobile safety systems for automobiles
Mobile safety systems for automobiles
 
Efficient text compression using special character replacement
Efficient text compression using special character replacementEfficient text compression using special character replacement
Efficient text compression using special character replacement
 
Agile programming a new approach
Agile programming a new approachAgile programming a new approach
Agile programming a new approach
 
Adaptive load balancing techniques in global scale grid environment
Adaptive load balancing techniques in global scale grid environmentAdaptive load balancing techniques in global scale grid environment
Adaptive load balancing techniques in global scale grid environment
 
A survey on the performance of job scheduling in workflow application
A survey on the performance of job scheduling in workflow applicationA survey on the performance of job scheduling in workflow application
A survey on the performance of job scheduling in workflow application
 
A survey of mitigating routing misbehavior in mobile ad hoc networks
A survey of mitigating routing misbehavior in mobile ad hoc networksA survey of mitigating routing misbehavior in mobile ad hoc networks
A survey of mitigating routing misbehavior in mobile ad hoc networks
 
A novel approach for satellite imagery storage by classify
A novel approach for satellite imagery storage by classifyA novel approach for satellite imagery storage by classify
A novel approach for satellite imagery storage by classify
 
A self recovery approach using halftone images for medical imagery
A self recovery approach using halftone images for medical imageryA self recovery approach using halftone images for medical imagery
A self recovery approach using halftone images for medical imagery
 
A comprehensive study of non blocking joining technique
A comprehensive study of non blocking joining techniqueA comprehensive study of non blocking joining technique
A comprehensive study of non blocking joining technique
 
A comparative study on multicast routing using dijkstra’s
A comparative study on multicast routing using dijkstra’sA comparative study on multicast routing using dijkstra’s
A comparative study on multicast routing using dijkstra’s
 
The detection of routing misbehavior in mobile ad hoc networks
The detection of routing misbehavior in mobile ad hoc networksThe detection of routing misbehavior in mobile ad hoc networks
The detection of routing misbehavior in mobile ad hoc networks
 
Visual cryptography scheme for color images
Visual cryptography scheme for color imagesVisual cryptography scheme for color images
Visual cryptography scheme for color images
 
Software process methodologies and a comparative study of various models
Software process methodologies and a comparative study of various modelsSoftware process methodologies and a comparative study of various models
Software process methodologies and a comparative study of various models
 
Software metric analysis methods for product development
Software metric analysis methods for product developmentSoftware metric analysis methods for product development
Software metric analysis methods for product development
 
Software process and product quality assurance in it organizations
Software process and product quality assurance in it organizationsSoftware process and product quality assurance in it organizations
Software process and product quality assurance in it organizations
 

Integration of feature sets with machine learning techniques

  • 1. International Journal of Computer Engineering International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976Technology (IJCET),2, Number 1, Jan- April (2011), © IAEME and – 6375(Online) Volume ISSN 0976 – 6367(Print) IJCET ISSN 0976 – 6375(Online) Volume 2 Number 1, Jan - April (2011), pp. 47-52 ©IAEME © IAEME, http://www.iaeme.com/ijcet.html INTEGRATION OF FEATURE SETS WITH MACHINE LEARNING TECHNIQUES FOR SPAM FILTERING C.R. Cyril Anthoni Dr. A. Christy Lecturer Professor & Principal Management Development Institute St.Mary’s School of Management Studies Singapore Rajiv Gandhi Road, Chennai – 119 ABSTRACT In this era of Information Technology, Internet is emerging at a rapid pace and it is playing a major role in the field of communication, business, economy, entertainment and social groups all over the globe. E-mail is the most preferred tool for communication as it is easy to use, easy to prioritize, easier for reference, secure and reliable. As emails are so important for all the computer savvy people, email privacy has become an equally important concern. This paper proposes a novel approach for spam filtering using selective feature sets combined with machine learning techniques and the results shows that approach has produced less number of false positives. Keywords: False positive, False negative, HMM, TF and IDF INTRODUCTION E-mail operates on the Simple Mail Transfer Protocol (SMTP). The protocol SMTP was written in 1982 without taking the problem of spam into consideration. The only way the end users used to differentiate the spam is from the sender’s address. The task of spam filtering is to rule out unsolicited e-mails automatically from a user’s mail stream. Mails contain headers, which contain the source of the message and the examination of the email header can reveal the sender machine’s IP address. Previous studies have used the spam filtering techniques like Blacklist, Real-Time Blackhole List, Whitelist, Greylist, Content-Based Filters, Word-Based Filters, Heuristic Filters, Bayesian Filters, Challenge/Response System, Collaborative Filters and DNS Lookup Systems and Feature sets in filtering the e-mails containing spam.. Naïve Bayes, a commonly used classification spam filtering algorithm is found to be sensitive to feature selection methods on small feature sets (Le Zhang, Jingbo Zhu and Tianshun Yao,2004). 47
  • 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 2, Number 1, Jan- April (2011), © IAEME SECURITY ISSUES The Unsolicited Commercial E-mail (UCE) messages are often used to force the users to reveal their personal information. Spam mails are commonly used to ask for information that can be used by the attackers. The safeguarding of email messages from unauthorized access and inspection is known as electronic privacy. Much of the Internet is susceptible to attacks. Emails travel across the unprotected paths. E-mail spams started growing from 1990s. Spammers collect e-mail addresses from chatrooms, websites, customer lists, newsgroups, and viruses which harvest user’s address books, and are sold to other spammers. According to a survey conducted by Message Anti-Abuse Working Group, the amount of spam email was between 88 to 92%. According to a Commtouch report in the first quarter of 2010, there are about 183 billion spam messages sent every day. The most popular spam topic is "pharmacy ads" which make up 81% of e-mail spam messages, Replica about 5.40%, Enhancers 2.30%, Phishing 2.30%, Degrees 1.30% and Weight loss 0.40%, etc. When grouped by continents, spam comes mostly from Asia (37.8%, down from 39.8%) , North America (23.6%, up from 21.8%) , Europe (23.4%, down from 23.9%) and South America (12.9%, down from 13.2%). A study of International Data Corporation (IDC) ranked spams in the second position of the ISPs’ problems. Similarly technology has offered solutions to the email privacy issues. The security measures include cryptography, digital signatures and the use of secure protocols, which can ensure email privacy to a certain extent. PGP (Pretty Good Privacy) is one of the encryption standards that encrypts and decrypts the email message. This standard of encryption is a part of most of today's operating systems. Email encryption is another method that achieves email privacy. It is achieved by means of public key cryptography. The popularly used protocols for email encryption are S/MIME, TLS and OpenPGP. Spam filtering can be recast as Text Categorization task where the classes to be predicted are spam and legitimate. A variety of supervised machine-learning algorithms have been successfully applied to mail filtering task. Traditionally, filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative method is to use a naïve Bayesian classifier trained automatically to detect spam messages. The studies in Text Categorization have revealed that building classifiers automatically by applying some machine-learning algorithms (Support vector machine, Ada Boost and Maximum Entropy Model) to a set of pre-classified documents can produce good performance across diverse data sets. SYSTEM ARCHITECTURE To protect email privacy, it is obvious that the messages have to be encrypted. However, in order for the filtering process to be effective, the architecture is designed to reduce the false positives. We have used the standard metrics to measure the spam filtering accuracy. A ham email that is classified as spam by the filtering scheme is termed as a false positive. The false positive percentage is defined as the ratio of the number of false 48
  • 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 2, Number 1, Jan- April (2011), © IAEME positive emails to the total number of actual ham emails in the dataset used during the testing phase. A supervised learning algorithm is used, which retrieves the selected feature sets and finds is threshold value. The general architecture is depicted in shown in Figure 1. Text Feature set Data Machine Rules extraction Base learning Figure 1 General architecture The feature preserving technique is based upon the concept of Shingles, which has been used in a wide variety of web and Internet data management problems, such as redundancy elimination in web caches and search engines, and template and fragment detection in web pages. Shingles are essentially a set of numbers that act as a fingerprint of a document. Shingles have the unique property that if two documents vary by a small amount their shingle sets also differs by a small amount. Machine Learning learns through examples, domain knowledge and feedback. As Information Extraction systems are domain specific, machine learning plays a vital role in classification and prediction. The algorithm presented in Table 1 illustrates the architecture. A set of frequently identified spam words are selected as the training sets and its accuracy in identifying as spam is evaluated by various metrics. If the values are very close to 0, the feature sets resemble close to the training data sets. Input : D(F0,F1,….Fn-1) // a training data set with N features S0 // a subset from which to start the search δ // a stopping criterion Output : Sbest // an optimal subset begin initialize : Sbest = S0 ; γbest = eval (S0, D,A) ; // evaluate S0 by a set of meaures do begin S= generate(D) ; //generate a subset for evaluation γ = eval (S, D,A); // evaluate the current subset S by A if (γ is better than γbest) γbest = γ; Sbest = S; end until (γ is reached); return Sbest; end; Table 1 Algorithm 49
  • 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 2, Number 1, Jan- April (2011), © IAEME IDENTIFIED METRICS The efficiency of the feature sets collected are evaluated by various metrics. The Baye’s rule is adopted for Hidden Markov Model which relates the conditional probability of an event A given B to the conditional probability of B given A: PROB (A|B) = PROB (B|A)*PROB (A)/PROB (B) (1) Each keyword can be represented as a term vector of the form ā = (a1,a2,….an), each term ai has a weight wi associated with it and wi denotes the normalized frequency of word in the vector space, where (2) where tfi is the term frequency of ai idfi is inverse document frequency denoted as (3) where D is the total number of mails and DF is the number of mails in which a term has appeared in a collection. EXPERIMENTS A list of keywords under the category Filter #1, Filter #2, Filter #3 and Filter #4 from the website ‘http://www.vaughns-1-pagers.com/internet/spam-word-list.htm’ are collected as feature sets and trained. The proposed method is compared with two popular spam filtering approaches, namely Bayesian filtering and simple hash-based collaborative filtering. In Bayesian filtering, the message is classified as a spam if its spamminess is greater than or equal to a preset threshold (µ), and vice-versa. On the other hand, using hash based technique, its decision is based on the number of times the email corresponding to a particular hash value have been reported as spam. If this spam count of the hash value corresponding to in-coming email exceeds a threshold, the email is classified as spam, and otherwise it is classified as ham The datasets used in the experiments are derived from two publicly available email corpus, namely TREC email corpus and the SpamAssassin email corpus. One third of email set including ham and spam are used for training, and the remainder is used for testing. 50
  • 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 2, Number 1, Jan- April (2011), © IAEME Figure 2 Figure 3 The good words are selected randomly from a good word database created from the labeled ham data. We vary the amount of appended words in the range of 0% to 50% of the original emails’ word count and we call it degree of attack. The experimental setup consists of 12 agents with each agent employing 50% of the messages in its email set during the training phase. The initial experiments showa that the supervised learning algorithm combined with HMM model is effective in identifying the feature sets, which reduces the amount of false positives. CONCLUSION Bayesian schemes are threshold-based approaches, so finding the appropriate thresh-old to achieve both low false positive and false negative rates is the key to the success of these approaches. The results from previous experiment are obtained when 50% of the emails in its email set are used during the training phase. We vary the threshold parameters of the two schemes and collect the false positive and false negative percentages. In Figure 2 and Figure 3 we plot the results of the experiment with false positive percentages on the X-axis and the false negatives on the Y-axis. The results show that neither of the approaches outperforms the other at all false positive percentage values. Users have a much lower tolerance of false positives than false negatives, and anything more than 1% percent false positives is usually considered unacceptable. . Our initial experiments show that HMM approach is very effective in filtering spam, has high resilience towards various attacks, and it provides strong privacy protection to the participating entities. 51
  • 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 2, Number 1, Jan- April (2011), © IAEME REFERENCES 1. Ahmed Khorsi (2007), ‘An Overview of Content-Based Spam Filtering Techniques’, Informatica 31, pp. 269-277 2. Androutsopoulos, J. Koutsias, K.V. Chandrinos, G. Paliouras, and C.D. Spyropoulos. An evaluation of naïve bayesian anti-spam _ltering. In G. Potamias, V. Moustakis, and M.n van Someren, editors, Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML 2000), pages 9{17, Barcelona, Spain, 2000a. URL http://arXiv.org/abs/cs.CL/0006013. 3. S. Atskins. Size and cost of the problem. In Proceedings of the Fifty-sixth Internet Engineering Task Force(IETF) Meeting, (San Francisco, CA), March 16-21 2003. SpamCon Foundation. 4. Christina V, Karpagavalli S and Suganya G (2010), ‘A Study on Email Spam Filtering Techniques’, International Journal of Computer Applications (0975 – 8887), Vol. 12- No. 1, pp. 7-9 5. Le Zhang, Jingbo Zhu, Tianshun Yao (2004), ‘An Evaluation of Statistical Spam Filtering Techniques’, Asian transactions on Asian Language Information processing, Vol. 3, No. 4, pp. 243-269 6. Zhenyu Zhong , Lakshmish Ramaswamy and Kang Li , ‘ALPACAS: A Large-scale Privacy-Aware Collaborative Anti-spam System’ 52