SlideShare a Scribd company logo
1 of 5
Download to read offline
ISSN: 2278 – 1323
                                     International Journal of Advanced Research in Computer Engineering & Technology
                                                                                         Volume 1, Issue 4, June 2012


                          Accurate Spam Mail Detection
                            using Bayesian Algorithm
                                           V.Kanaka Durga, M.R. Raja Ramesh


    Abstract— With the increasing popularity of a E-mail          your address is posted on a public forum or visible on a
users, E-mail spam problem growing proportionally. Spam           website, spammers will find it and start to bombard you.
filtering with near duplicate matching scheme is widely           Most spam messages on the internet today are
discussed in recent years. It is based on a known spam
database formed by user feedback which cannot fully catch         advertisements.
the evolving nature of spam and also it requires much             B. Spam Problems
storage. In view of above drawbacks, we proposed an
effective spam detection scheme based on Bayesian                 Spammers often says that spam is not a problem.” Just hit
approach. First we use Bayesian filter based on single            delete if you don‟t want to see it”. And many spam
keyword sets and then we improve the scheme by using              messages carry the tagline. ”If you don‟t want to receive
multiple keyword sets and satisfactory results obtained.          further mailing reply, we will remove you.” But spam is
                                                                  huge problem. It consumes network bandwidth, storage
  Index Terms— Spam detection, Near Duplicate Matching,           space and slowdown E-mail servers.
SAG.
                                                                  C. Flavors of spam:
                                                                  Unsolicited Commercial E-mail (UCE)-An E-mail
                     I.   INTRODUCTION                            messages that you receive without asking for it
With the increasing popularity of electronic mail (or e-          advertising a product or service. This is also called junk
mail), several people and companies found it an easy way          E-mail. Unsolicited Bulk E-mail(UBE)-UBE refers to E-
to distribute a massive amount of unsolicited messages to         mail messages that are sent in bulk to thousands of
a tremendous number of users at a very low cost. These            recipients. BE may be commercial in nature, in which
unwanted bulk messages or junk emails are called spam             case it is also UCE. But it may be sent for other purposes
messages. The majority of spam messages that has been             as well, such as political lobbying or harassment. Make
reported recently are unsolicited commercials promoting           Money Fast(MMF)-MMF messages in the form of chain
services and products including sexual enhancers, cheap           letters or multi-level marketing schemes, are messages
drugs and herbal supplements, health insurance, travel            that suggest you can get rich by sending money to the top
tickets, hotel reservations, and software products. They          name on list, removing that name, adding your name to
can also include offensive content such pornographic              the bottom of the list ,and forwarding the messages to the
images and can be used as well for spreading rumors and           other people.
other fraudulent advertisements such as make money fast.             E-mail communication is prevalent and indispensable
E-mail spam has become an epidemic problem that can               nowadays. However, the threat of unsolicited junk e-
negatively affect the usability of electronic mail as a           mails, also known as spams, becomes more and more
communication means. Besides wasting users‟ time and              serious. According to a survey by the website Top Ten
effort to scan and delete the massive amount of junk e-           REVIEWS, 40 percent of e-mails were considered as
mails received; it consumes network bandwidth and                 spams in 2006. The statistics collected by MessageLabs
storage space, slows down e-mail servers, and provides a          show that recently the spam rate is over 70 percent and
medium to distribute harmful and/or offensive content.            persistently remains high. The primary challenge of spam
Spam emails are getting better in its ability to break anti-      detection problem lies in the fact that spammers will
spam filters and it would take a great deal to get it fully       always find new ways to attack spam filters owing to the
eradicated. Spammers are also becoming more                       economic benefits of sending spams. Note that existing
innovative, so that the anti-spam research is having a            filters generally perform well when dealing with clumsy
great relevance these days.                                       spams, which have duplicate content with suspicious
                                                                  keywords or are sent from an identical notorious server.
A. How do spammers get your address?                              Therefore, the next stage of spam detection research
If you or anyone else ever gives your E-mail address on           should focus on coping with cunning spams which evolve
website, then your E-mail address is out there for the            naturally and continuously.
world to use because the companies often sell your                In view of above facts, the notion of collaborative spam
address. For instance If you go to a website and send a           filtering with near-duplicate similarity matching scheme
funny cartoon, or an electronic greeting to a friend you          has recently received much attention. The primary idea of
have given out their E-mail address and the rest is history.      the near-duplicate matching scheme for spam detection is
Nothing is free, and these websites make their money by           to maintain a known spam database, formed by user
selling advertising and selling your E-mail address. If           feedback, to block subsequent spams with similar



                                             All Rights Reserved © 2012 IJARCET                                         402
ISSN: 2278 – 1323
                                      International Journal of Advanced Research in Computer Engineering & Technology
                                                                                                  Volume 1, Issue 4, June 2012
content. Collaborative filtering indicates that user              mechanism is applied, the most critical factor is how to
knowledge of what spam may subsequently appear is                 represent each e-mail for near-duplicate matching. The e-
collected to detect following spams. Overall, there are           mail abstraction not only should capture the near-
three key points of this type of spam detection approach          duplicate phenomenon of spams, but should avoid
we have to be concerned about. First, an effective                accidental deletion of hams.
representation of e-mail (i.e., e-mail abstraction) is
essential. Since a large set of reported spams have to be         Preprocessing Steps:
stored in the known spam database, the storage size of e-
                                                                    A. E-mail Abstraction Scheme
mail abstraction should be small. Moreover, the e-mail
abstraction     should      capture   the    near-duplicate          Here we introduced a novel e-mail abstraction scheme,
phenomenon of spams, and should avoid accidental                  SAG procedure is presented to depict the generation
deletion of non-spam e-mails (also known as hams).                process of an e-mail abstraction in II-A-1, and the
Second, every incoming e-mail has to be matched with              robustness issue is discussed in Section II-A-2.
the large database, meaning that the near-duplicate                  1) Structure Abstraction Generation: We propose the
matching process should be substantially efficient.               SAG procedure to generate the e-mail abstraction using
Finally, the latest spams to be included instantly and            HTML content in e-mail. The algorithmic form of SAG is
successively into the database so as to effectively block         depicted in Fig. 1. Using three major phases Procedure
subsequent near-duplicate spams.                                  SAG is composed, Tag Extraction Phase, Tag Reordering
One subtle difference is that a false positive would be a         Phase, and <anchor> Appending Phase. In Tag Extraction
more serious error than a false negative as a false positive      Phase, the name of each HTML tag is extracted, and tag
would mean that an important e-mail was identified as             attributes and attribute values are eliminated.. One
spam and rejected. According to a leading body in IT,             objective of this preprocessing step is to remove tags that
inaccurate anti-spam solutions may be responsible for             are common but not discriminative between e-mails. The
wasting more than five million working hours a year on            other objective is to prevent malicious tag insertion
checking that legitimate messages were not mistakenly             attack, and thus the robust-ness of the proposed
quarantined. We explore a new approach based on                   abstraction scheme can be further enhanced. The
Bayesian approach that can automatically classify e-mail          following sequence of operations is performed in the
messages as spam or legitimate.            We study its           preprocessing step.
performance for various conjunction and disjunction
                                                                     1. Front and rear tags are excluded.
operators for several datasets. The results are promising
                                                                     2. Nonempty tags 2 that have no corresponding start
as compared with a previous classifiers such as genetic                 tags or end tags are deleted. Besides, mismatched
algorithm and XCS (extended Classifier System)                          nonempty tags are also deleted.
classifier system. Classification accuracy above 97% and             3. All empty tags 2 are regarded as the same and are
low false positive rates are achieved in many test cases.               replaced by the newly created <empty/> tag.
                                                                        Moreover, successive <empty/> tags are pruned and
                      II. RELATED WORKS                                 only one <empty/> tag is retained.
   Nowadays E-mail spam problem is increasing                        4. The pairs of nonempty tags enclosing nothing are
drastically. So to relieve this problem various techniques              removed.
have been explored. Previous works on spam detection                 Procedure SAG
can be classified into three categories: a) content-based            Input: the email with text/html content-type, the tag
methods, b) non content-based methods, and c) others .                       length threshold (Lth_short) of the short email
Initially, researchers analyze e-mail content text and               Output: the email abstraction (EA) of the input email
model this problem as a binary text classification task.             1 // Tag Extraction Phase
Representatives of this category are Naive Bayes [1], [2]            2 Transform each tag to <tag.name>;
                                                                     3 Transform each paragraph of text to <mytext/>;
and Support Vector Machines (SVMs) [5], [10], [9], [14]
                                                                     4 AnchorSet = the union of all <anchor>
methods. In general, Naive Bayes methods train a
                                                                     5 EA = concatenation of <tag.name>;
probability model using classified e-mails, and each word
                                                                     6 Preprocess the tag sequence of EA;
in e-mails will be given a probability of being a                    7 // Tag Reordering Phase:
suspicious spam keyword. The other group attempts to                 8 for(for each tag of EA) // pn: position number
exploit non content information such as e-mail header, e-            9      tag.new_pn = ASSIGN_PN (EA.tag_length,
mail social network ,and e -mail traffic [13],[9] to filter       tag.pn))
spams . Collecting notorious and innocent sender                     10 Put the tag to the position tag.new_pn;
addresses (or IP addresses) from e-mail header to create             11 EA = the concatenation of <tag.name> with
black list and white list is a commonly applied method            new_pn;
initially. On the other hand, collaborative spam filtering           12 //<anchor> Appending Phase
with near-duplicate similarity matching scheme has been              13 if (EA.tag_length<Lth_short)
studied widely in recent years. Regarding collaborative              14 Append AnchorSet in font EA;
mechanism, P2P-based architecture [12], centralized                  15 return EA;
server-based system [3], [16], [15], and others [6], [8] are         End
generally employed. Note that no matter which                                    Fig. 1 Algorithm form of procedure SAG.




                                             All Rights Reserved © 2012 IJARCET                                             403
ISSN: 2278 – 1323
                                     International Journal of Advanced Research in Computer Engineering & Technology
                                                                                         Volume 1, Issue 4, June 2012
   2) Robustness Issue: The near-duplicate spam
detection‟s main difficulty is to withstand malicious
attack by spammers. E-mail abstractions are generated
based mainly on hash-based content text in prior
approaches. For example, the authors in [3], [16], [15]
extract words or terms to generate the e-mail abstraction.
Besides, substrings extracted by various techniques are
widely employed in [7], [12], [11], [6], [8], [4]. This type
of e-mail representation has following disadvantages.
Firstly, the insertion of a randomized and normal
paragraph can easily defeat this type of spam filters.
Moreover, since the structures and features of different
languages are diverse, word and substring extraction may
not be applicable to e-mails in all languages. Concretely
speaking, for instance, trigrams of substrings used in [7],
[8], [6] are not suitable for non alphabetic languages,
such as Chinese. In this paper, we devise a novel e-mail
abstraction scheme that considers e-mail layout structure
to represent e-mails. To assess the robustness of the
proposed scheme, we model possible spammer attacks                            Fig. 2 Examples of possible Spammer Attacks
and organize these attacks as following three categories.
Examples and the outputs of preprocessing of procedure               Note that due to space limitation, we are not able to
SAG are shown in Fig. 1.                                          discuss all possible situations. Nevertheless, representing
                                                                  e-mails with layout structure is more robust to most
  A) Random Paragraph Insertion                                   existing attacks than text-based approaches. Even though
    This type of spammer attack is commonly used now-             new attack has been designed, we can react against it by
a-days. As shown in Fig. 2, normal contents without any           adjusting the preprocessing step of procedure SAG. On
advertisement keywords are inserted to confuse text-              the other hand, our approach extracts only HTML tag
based spam filtering techniques. It is noted that our             sequences and transforms each paragraph with no tag
scheme transforms each paragraph into a newly created             embedded to<mytext/>, meaning that the proposed
tag <mytext/>, and consecutive empty tags will then be            abstraction scheme can be applied to e-mails in all
transformed to <empty/>. As such, the representation of           languages without modify in any components.
each random inserted paragraph is identical, and thus our
scheme is resistant to this type of attack.
  B) Random HTML Tag Insertion                                                      III. BAYESIAN ALGORITHM
     If spammers know that the proposed scheme is based             One of the most effective and intelligent solutions to
on HTML tag sequences, random HTML tags will be                   combat spam email nowadays is Bayesian filtering.
inserted rather than random paragraphs. On the one hand,            How does Bayesian filtering work?
arbitrary tag insertion will cause syntax errors due to tag
                                                                     Naive Bayes classifier is used to identify spam e-mail.
mismatching. This may lead to abnormal display of spam
                                                                  It is based on the principle that most events are dependent
content that spammers do not wish this to happen. On the
                                                                  and that the probability of an event occurring in the future
other hand, procedure SAG also adopts some heuristics
                                                                  can be inferred from the previous occurrences of that
(as depicted in Section III-C to deal with the random
                                                                  event. The same technique can be used to classify spam.
insertion of empty tags and the tag mismatching of
                                                                  If some piece of text occurs often in spam but not in
nonempty tags. Fig. 2 shows two example outputs and the
                                                                  legitimate mail, then it would be reasonable to assume
details of each step can be found in Fig. 2. With the
                                                                  that this email is probably spam.
proposed method, most random inserted tags will be
                                                                     Bayesian filters must be „trained‟ to work effectively.
removed, and thus the effectiveness of the attack of
                                                                  Particular words have certain probabilities (also known as
random tag insertion is limited.
                                                                  likelihood functions) of occurring in spam email but not
                                                                  in legitimate email. For instance, most email users will
                                                                  frequently encounter the word Viagra in spam email, but
                                                                  will seldom see it in other email. Before mail can be
  C) Sophisticated HTML Tag Insertion:                            filtered using this method, the user needs to generate a
   Suppose that spammers are more sophisticated, they             database with words and tokens (such as the $ sign, IP
may insert legal HTML tag patterns. As shown in Fig. 1,           addresses and domains, and so on), collected from a
if tag patterns that do conform to syntax rules are               sample of spam mail and valid mail (referred to as
inserted, they will not be eliminated.                            „ham‟). For all words in each training email, the filter will
                                                                  adjust the probabilities that each word will appear in
                                                                  spam or legitimate email in its database.



                                             All Rights Reserved © 2012 IJARCET                                             404
ISSN: 2278 – 1323
                                      International Journal of Advanced Research in Computer Engineering & Technology
                                                                                            Volume 1, Issue 4, June 2012
  A. Mathematical Foundation                                      conforms to the 50% hypothesis about repartition
   Bayes theorem is used several times in the context of          between spam and ham, i.e. that the datasets of spam and
spam:                                                             ham are of same size.
   1) First time, to compute the probability that the                Of course, determining whether a message is spam or
message is spam, knowing that a given word appears in             ham based only on the presence of the word "replica" is
this message.                                                     error-prone, which is why Bayesian spam software tries
   2) Second time, to compute the probability that the            to consider several words and combine their spamicities
message is spam, taking into consideration all of its             to determine a message's overall probability of being
words (or a relevant subset of them);                             spam.
   3) Sometimes third time, to deal with rare words.
  B. Computing the probability that a message containing                            IV. SPAMMER ATTACK
  a given word is spam                                              To further verify the robustness of Cosdes, in this
   Let's suppose the suspected message contains the word         section, we simulate the spammer attack of random
"replica". Most people who are used to receiving e-mail          HTML tag insertion. We consider the situation that a
know that this message is likely to be spam, more                spammer sends n identical e-mails at a time, where n is
precisely a proposal to sell counterfeit copies of well-         varied from 1,000 to 100,000. It is assumed that a
known brands of watches. The spam detection software,            sequence of random HTML tags is inserted into the
however, does not "know" such facts, all it can do is            beginning of each e-mail. The number of tags in a
compute probabilities.                                           sequence is a random number between 1 and 50.5
   The formula used by the software to determine that is            Regarding the type of tag, we randomly choose them
derived from Bayes theorem                                       from HTML tag list. around 90 percent of e-mails are
   where:                                                        matched with other e-mails, meaning that only 10 percent
    Pr(S|W) is the probability that a message is a              of spams have completely distinct HTML tag sequences
          spam, knowing that the word "replica" is in it;        as random HTML tag insertion is applied. This is because
    Pr(S) is the overall probability that any given             the sequence preprocessing step of procedure SAG will
          message is spam;                                       delete nonempty tags that have no corresponding start
    Pr(S|W) is the probability that the word                    tags or end tags. Random HTML tag insertion cannot
       "replica" appears in spam messages;                       generate legal tag sequences and thus most tags will be
    Pr(H) is the overall probability that any given             eliminated. Concerning the efficiency analysis, it can be
          message is not spam (is "ham");                        observed in that the sequence preprocessing step incurs
    Pr(W|H) is the probability that the word                    very little overhead. This also indicates that if new
       "replica" appears in ham messages.                        spammer attack occurs, Cosdes still has capacity to react
  C. The Spamicity of a word                                     against it by applying a more sophisticated counter
   Recent statistics show that the current probability of        measure.
any message being spam is 80%, at the very least:                  A. Experimental results:
   Pr(S)=0.8;Pr(H)=0.2
   However, most bayesian spam detection software
makes the assumption that there is no a priori reason for
any incoming message to be spam rather than ham, and
considers both cases to have equal probabilities of 50%:
   Pr(S)=0.8;Pr(H)=0.2
   The filters that use this hypothesis are said to be "not
biased", meaning that they have no prejudice regarding
the incoming email. This assumption permits simplifying
the general formula to:

      Pr(S|W)                                                                    Fig. 3 Seeking username and password
   This quantity is called "spamicity" (or "spaminess") of
the word "replica", and can be computed. The
number Pr(W|S) used in this formula is approximated to
the frequency of messages containing "replica" in the
messages identified as spam during the learning phase.
Similarly, Pr(W|H) is approximated to the frequency of
messages containing "replica" in the messages identified
as ham during the learning phase. For these
approximations to make sense, the set of learned
messages needs to be big and representative enough.[9] It
is also advisable that the learned set of messages



                                            All Rights Reserved © 2012 IJARCET                                          405
ISSN: 2278 – 1323
                                           International Journal of Advanced Research in Computer Engineering & Technology
                                                                                               Volume 1, Issue 4, June 2012
                                                                             [9] A. Kolcz and J. Alspector, “SVM-Based Filtering of Email
                                                                                 Spam with Content-Specific Misclassification Costs,” Proc.
                                                                                 ICDM Work-shop Text Mining, 2001.
                                                                             [10] H. Drucker, D. Wu, and V.N. Vapnik, “Support Vector
                                                                                 Machines for Spam Categorization,” Proc. IEEE Trans. Neural
                                                                                 Networks, pp. 1048-1054, 1999.
                                                                             [11] A . Gray and M.Haahr , “ Personalised ,Collaborative Spam
                                                                                 Filtering,” Proc. First Conf. Email and Anti-Spam (CEAS),
                                                                                 2004.
                                                                            [12] S. Sarafijanovic, S. Perez, and J.-Y.L. Boudec, “Improving
                                                                                 Digest-Based Collaborative Spam Detection,” Proc. MIT Spam
                                                                                 Conf., 2008.
                                                                            [13] R. Clayton, “Email Traffic: A Quantitative Snapshot,” Proc. of
                                                                                 the Fourth Conf. Email and Anti-Spam (CEAS), 2007.
                                                                            [14]D. Sculley and G.M. Wachman, “Relaxed Online SVMs for
                                                                                 Spam Filtering,” Proc. 30th Ann. Int‟l ACM SIGIR Conf.
                                                                                 Research and Development in Information Retrieval (SIGIR),
                Fig. 4 Specifying spam detection period
                                                                                 pp. 415-422, 2007.
                                                                            [15] I. Rigoutsos and T. Huynh, “Chung-Kwei: A Pattern-Discovery-
                                                                                 Based System for the Automatic Identification of Unsolicited
                                                                                 E-Mail Messages (SPAM),” Proc. First Conf. Email and Anti-
                                                                                 Spam (CEAS), 2004.
                                                                             [16] M.S. Pera and Y.-K. Ng, “Using Word Similarity to Eradicate
                                                                                 Junk Emails,” Proc. 16th ACM Int‟l Conf. Information and
                                                                                 Knowledge Management (CIKM), pp. 943-946, 2007.




                                                                         ABOUT THE AUTHORS
                        Fig. 5 Showing result.
                                                                            Mrs. V. KANAKA DURGA is pursuing M.Tech. at Sri Vasavi
                                                                         Engineering college, Tadepalligudem. Her area of interest includes
                         V. CONCLUSION                                   Computer Networks and Network Security.
                                                                            Mr. M.R. RAJA RAMESH, M.Tech., is working as an Associate
   In this paper we proposed more sophisticated and                      Professor in the Dept. of Computer Science & Engineering in Sri Vasavi
robust e-mail abstraction scheme based on Bayesian with                  Engineering College, Tadepalligudem. Presently he is pursuing Ph.D. at
                                                                         Andhra University.
new scheme.we can efficiently capture the near duplicate
of the spams.Our scheme achieved efficient similarity
matching and educed data storage.So we can conclude
that our scheme is apt for spam detection.


                           REFERENCES
   [1] J. Hovold, “Naive Bayes Spam Filtering Using Word-Position-
        Based Attributes,” Proc. Second Conf. Email and Anti-Spam
        (CEAS),2005.
  [2] V. Metsis, I. Androutsopoulos, and G. Paliouras, “Spam
        Filtering with Naive Bayes—Which Naive Bayes?” Proc. Third
        Conf. Email and Anti-Spam (CEAS), 2006
  [3]A. Kolcz, A. Chowdhury, and J. Alspector, “The Impact of
        Feature Selection on Signature-Driven Spam Detection,” Proc.
        First Conf. Email and Anti-Spam (CEAS), 2004.
  [4]S. Sarafijanovic, S. Perez, and J.-Y.L. Boudec, “Resolving FP-TP
        Conflict in Digest-Based Selection Algorithm,” Proc. Fifth
        Conf. Email and Anti-Spam (CEAS), 2008.
  [5] E. Blanzieri and A. Bryl, “Evaluation of the Highest
        Probability SVM Nearest Neighbor Classifier with Variable
        Relative Error Cost,” Proc. Fourth Conf. Email and Anti-Spam
        (CEAS), 2007.
   [6] J.S.Kong,P.O.Boykin,B.A.Rezaei,N.Sarshar,and V.P.Roychowd
        hury, “ Scalable and Rel iable Col laborative Spam Filters:
        Harnessing the Global Social Email Networks,” Proc. Second
        Conf. Email and Anti-Spam (CEAS), 2005.
  [7] E. Damiani, S.D.C. di Vimercati, S. Paraboschi, and P.
        Samarati,“An Open Digest-Based Technique for Spam
        Detection,” Proc. Int‟lWorkshop Security in Parallel and
        Distributed Systems, pp. 559-564,2004.
  [8] S.Sarafijanovic and J.-Y.L. Boudec, “Artificial Immune System
        for Collaborative Spam Filtering,” Proc. Second Work shop
        Nature Inspired Cooperative Strategies for Optimization
        (NICSO), 2007.




                                                    All Rights Reserved © 2012 IJARCET                                                    406

More Related Content

What's hot

Spamming Ict
Spamming   IctSpamming   Ict
Spamming Ict
siewying
 

What's hot (15)

Spamming and Spam Filtering
Spamming and Spam FilteringSpamming and Spam Filtering
Spamming and Spam Filtering
 
E mail image spam filtering techniques
E mail image spam filtering techniquesE mail image spam filtering techniques
E mail image spam filtering techniques
 
Spam and Anti Spam Techniques
Spam and Anti Spam TechniquesSpam and Anti Spam Techniques
Spam and Anti Spam Techniques
 
Spam Filtering
Spam FilteringSpam Filtering
Spam Filtering
 
A review of spam filtering and measures of antispam
A review of spam filtering and measures of antispamA review of spam filtering and measures of antispam
A review of spam filtering and measures of antispam
 
Spamming Ict
Spamming   IctSpamming   Ict
Spamming Ict
 
Spam filtering with Naive Bayes Algorithm
Spam filtering with Naive Bayes AlgorithmSpam filtering with Naive Bayes Algorithm
Spam filtering with Naive Bayes Algorithm
 
A Survey: SMS Spam Filtering
A Survey: SMS Spam FilteringA Survey: SMS Spam Filtering
A Survey: SMS Spam Filtering
 
Spam, security
Spam, securitySpam, security
Spam, security
 
Spam Email: 8 Dos and Dont's
Spam Email: 8 Dos and Dont'sSpam Email: 8 Dos and Dont's
Spam Email: 8 Dos and Dont's
 
A multi layer architecture for spam-detection system
A multi layer architecture for spam-detection systemA multi layer architecture for spam-detection system
A multi layer architecture for spam-detection system
 
Collateral Damage: Consequences of Spam and Virus Filtering for the E-Mail S...
Collateral Damage:
Consequences of Spam and Virus Filtering for the E-Mail S...Collateral Damage:
Consequences of Spam and Virus Filtering for the E-Mail S...
Collateral Damage: Consequences of Spam and Virus Filtering for the E-Mail S...
 
Survey on spam filtering
Survey on spam filteringSurvey on spam filtering
Survey on spam filtering
 
Spam Email identification
Spam Email identificationSpam Email identification
Spam Email identification
 
B0940509
B0940509B0940509
B0940509
 

Similar to 402 406

NetworkPaperthesis1
NetworkPaperthesis1NetworkPaperthesis1
NetworkPaperthesis1
Dhara Shah
 
Maximise Email Deliverability
Maximise Email DeliverabilityMaximise Email Deliverability
Maximise Email Deliverability
GetResponse
 

Similar to 402 406 (20)

Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
Email campaigns are the lifeblood of most industries.docx
Email campaigns are the lifeblood of most industries.docxEmail campaigns are the lifeblood of most industries.docx
Email campaigns are the lifeblood of most industries.docx
 
IRJET- Image Spam Detection: Problem and Existing Solution
IRJET-  	  Image Spam Detection: Problem and Existing SolutionIRJET-  	  Image Spam Detection: Problem and Existing Solution
IRJET- Image Spam Detection: Problem and Existing Solution
 
NetworkPaperthesis1
NetworkPaperthesis1NetworkPaperthesis1
NetworkPaperthesis1
 
A multi layer architecture for spam-detection system
A multi layer architecture for spam-detection systemA multi layer architecture for spam-detection system
A multi layer architecture for spam-detection system
 
miniproject.ppt.pptx
miniproject.ppt.pptxminiproject.ppt.pptx
miniproject.ppt.pptx
 
Email marketing the easy way
Email marketing the easy wayEmail marketing the easy way
Email marketing the easy way
 
E spam
E spamE spam
E spam
 
E spam
E spamE spam
E spam
 
E spam
E spamE spam
E spam
 
Email Marketing The Easy Way
Email Marketing The Easy WayEmail Marketing The Easy Way
Email Marketing The Easy Way
 
Email deliverability
Email deliverabilityEmail deliverability
Email deliverability
 
Fighting spam
Fighting spamFighting spam
Fighting spam
 
What is SPAM?
What is SPAM?What is SPAM?
What is SPAM?
 
How to Keep Spam Off Your Network
How to Keep Spam Off Your NetworkHow to Keep Spam Off Your Network
How to Keep Spam Off Your Network
 
Maximise Email Deliverability
Maximise Email DeliverabilityMaximise Email Deliverability
Maximise Email Deliverability
 
Did you get my email?
Did you get my email?Did you get my email?
Did you get my email?
 
Analysis of an image spam in email based on content analysis
Analysis of an image spam in email based on content analysisAnalysis of an image spam in email based on content analysis
Analysis of an image spam in email based on content analysis
 

More from Editor IJARCET

Volume 2-issue-6-2205-2207
Volume 2-issue-6-2205-2207Volume 2-issue-6-2205-2207
Volume 2-issue-6-2205-2207
Editor IJARCET
 
Volume 2-issue-6-2195-2199
Volume 2-issue-6-2195-2199Volume 2-issue-6-2195-2199
Volume 2-issue-6-2195-2199
Editor IJARCET
 
Volume 2-issue-6-2200-2204
Volume 2-issue-6-2200-2204Volume 2-issue-6-2200-2204
Volume 2-issue-6-2200-2204
Editor IJARCET
 
Volume 2-issue-6-2190-2194
Volume 2-issue-6-2190-2194Volume 2-issue-6-2190-2194
Volume 2-issue-6-2190-2194
Editor IJARCET
 
Volume 2-issue-6-2186-2189
Volume 2-issue-6-2186-2189Volume 2-issue-6-2186-2189
Volume 2-issue-6-2186-2189
Editor IJARCET
 
Volume 2-issue-6-2177-2185
Volume 2-issue-6-2177-2185Volume 2-issue-6-2177-2185
Volume 2-issue-6-2177-2185
Editor IJARCET
 
Volume 2-issue-6-2173-2176
Volume 2-issue-6-2173-2176Volume 2-issue-6-2173-2176
Volume 2-issue-6-2173-2176
Editor IJARCET
 
Volume 2-issue-6-2165-2172
Volume 2-issue-6-2165-2172Volume 2-issue-6-2165-2172
Volume 2-issue-6-2165-2172
Editor IJARCET
 
Volume 2-issue-6-2159-2164
Volume 2-issue-6-2159-2164Volume 2-issue-6-2159-2164
Volume 2-issue-6-2159-2164
Editor IJARCET
 
Volume 2-issue-6-2155-2158
Volume 2-issue-6-2155-2158Volume 2-issue-6-2155-2158
Volume 2-issue-6-2155-2158
Editor IJARCET
 
Volume 2-issue-6-2148-2154
Volume 2-issue-6-2148-2154Volume 2-issue-6-2148-2154
Volume 2-issue-6-2148-2154
Editor IJARCET
 
Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147
Editor IJARCET
 
Volume 2-issue-6-2119-2124
Volume 2-issue-6-2119-2124Volume 2-issue-6-2119-2124
Volume 2-issue-6-2119-2124
Editor IJARCET
 
Volume 2-issue-6-2139-2142
Volume 2-issue-6-2139-2142Volume 2-issue-6-2139-2142
Volume 2-issue-6-2139-2142
Editor IJARCET
 
Volume 2-issue-6-2130-2138
Volume 2-issue-6-2130-2138Volume 2-issue-6-2130-2138
Volume 2-issue-6-2130-2138
Editor IJARCET
 
Volume 2-issue-6-2125-2129
Volume 2-issue-6-2125-2129Volume 2-issue-6-2125-2129
Volume 2-issue-6-2125-2129
Editor IJARCET
 
Volume 2-issue-6-2114-2118
Volume 2-issue-6-2114-2118Volume 2-issue-6-2114-2118
Volume 2-issue-6-2114-2118
Editor IJARCET
 
Volume 2-issue-6-2108-2113
Volume 2-issue-6-2108-2113Volume 2-issue-6-2108-2113
Volume 2-issue-6-2108-2113
Editor IJARCET
 
Volume 2-issue-6-2102-2107
Volume 2-issue-6-2102-2107Volume 2-issue-6-2102-2107
Volume 2-issue-6-2102-2107
Editor IJARCET
 

More from Editor IJARCET (20)

Electrically small antennas: The art of miniaturization
Electrically small antennas: The art of miniaturizationElectrically small antennas: The art of miniaturization
Electrically small antennas: The art of miniaturization
 
Volume 2-issue-6-2205-2207
Volume 2-issue-6-2205-2207Volume 2-issue-6-2205-2207
Volume 2-issue-6-2205-2207
 
Volume 2-issue-6-2195-2199
Volume 2-issue-6-2195-2199Volume 2-issue-6-2195-2199
Volume 2-issue-6-2195-2199
 
Volume 2-issue-6-2200-2204
Volume 2-issue-6-2200-2204Volume 2-issue-6-2200-2204
Volume 2-issue-6-2200-2204
 
Volume 2-issue-6-2190-2194
Volume 2-issue-6-2190-2194Volume 2-issue-6-2190-2194
Volume 2-issue-6-2190-2194
 
Volume 2-issue-6-2186-2189
Volume 2-issue-6-2186-2189Volume 2-issue-6-2186-2189
Volume 2-issue-6-2186-2189
 
Volume 2-issue-6-2177-2185
Volume 2-issue-6-2177-2185Volume 2-issue-6-2177-2185
Volume 2-issue-6-2177-2185
 
Volume 2-issue-6-2173-2176
Volume 2-issue-6-2173-2176Volume 2-issue-6-2173-2176
Volume 2-issue-6-2173-2176
 
Volume 2-issue-6-2165-2172
Volume 2-issue-6-2165-2172Volume 2-issue-6-2165-2172
Volume 2-issue-6-2165-2172
 
Volume 2-issue-6-2159-2164
Volume 2-issue-6-2159-2164Volume 2-issue-6-2159-2164
Volume 2-issue-6-2159-2164
 
Volume 2-issue-6-2155-2158
Volume 2-issue-6-2155-2158Volume 2-issue-6-2155-2158
Volume 2-issue-6-2155-2158
 
Volume 2-issue-6-2148-2154
Volume 2-issue-6-2148-2154Volume 2-issue-6-2148-2154
Volume 2-issue-6-2148-2154
 
Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147
 
Volume 2-issue-6-2119-2124
Volume 2-issue-6-2119-2124Volume 2-issue-6-2119-2124
Volume 2-issue-6-2119-2124
 
Volume 2-issue-6-2139-2142
Volume 2-issue-6-2139-2142Volume 2-issue-6-2139-2142
Volume 2-issue-6-2139-2142
 
Volume 2-issue-6-2130-2138
Volume 2-issue-6-2130-2138Volume 2-issue-6-2130-2138
Volume 2-issue-6-2130-2138
 
Volume 2-issue-6-2125-2129
Volume 2-issue-6-2125-2129Volume 2-issue-6-2125-2129
Volume 2-issue-6-2125-2129
 
Volume 2-issue-6-2114-2118
Volume 2-issue-6-2114-2118Volume 2-issue-6-2114-2118
Volume 2-issue-6-2114-2118
 
Volume 2-issue-6-2108-2113
Volume 2-issue-6-2108-2113Volume 2-issue-6-2108-2113
Volume 2-issue-6-2108-2113
 
Volume 2-issue-6-2102-2107
Volume 2-issue-6-2102-2107Volume 2-issue-6-2102-2107
Volume 2-issue-6-2102-2107
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

402 406

  • 1. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 4, June 2012 Accurate Spam Mail Detection using Bayesian Algorithm V.Kanaka Durga, M.R. Raja Ramesh Abstract— With the increasing popularity of a E-mail your address is posted on a public forum or visible on a users, E-mail spam problem growing proportionally. Spam website, spammers will find it and start to bombard you. filtering with near duplicate matching scheme is widely Most spam messages on the internet today are discussed in recent years. It is based on a known spam database formed by user feedback which cannot fully catch advertisements. the evolving nature of spam and also it requires much B. Spam Problems storage. In view of above drawbacks, we proposed an effective spam detection scheme based on Bayesian Spammers often says that spam is not a problem.” Just hit approach. First we use Bayesian filter based on single delete if you don‟t want to see it”. And many spam keyword sets and then we improve the scheme by using messages carry the tagline. ”If you don‟t want to receive multiple keyword sets and satisfactory results obtained. further mailing reply, we will remove you.” But spam is huge problem. It consumes network bandwidth, storage Index Terms— Spam detection, Near Duplicate Matching, space and slowdown E-mail servers. SAG. C. Flavors of spam: Unsolicited Commercial E-mail (UCE)-An E-mail I. INTRODUCTION messages that you receive without asking for it With the increasing popularity of electronic mail (or e- advertising a product or service. This is also called junk mail), several people and companies found it an easy way E-mail. Unsolicited Bulk E-mail(UBE)-UBE refers to E- to distribute a massive amount of unsolicited messages to mail messages that are sent in bulk to thousands of a tremendous number of users at a very low cost. These recipients. BE may be commercial in nature, in which unwanted bulk messages or junk emails are called spam case it is also UCE. But it may be sent for other purposes messages. The majority of spam messages that has been as well, such as political lobbying or harassment. Make reported recently are unsolicited commercials promoting Money Fast(MMF)-MMF messages in the form of chain services and products including sexual enhancers, cheap letters or multi-level marketing schemes, are messages drugs and herbal supplements, health insurance, travel that suggest you can get rich by sending money to the top tickets, hotel reservations, and software products. They name on list, removing that name, adding your name to can also include offensive content such pornographic the bottom of the list ,and forwarding the messages to the images and can be used as well for spreading rumors and other people. other fraudulent advertisements such as make money fast. E-mail communication is prevalent and indispensable E-mail spam has become an epidemic problem that can nowadays. However, the threat of unsolicited junk e- negatively affect the usability of electronic mail as a mails, also known as spams, becomes more and more communication means. Besides wasting users‟ time and serious. According to a survey by the website Top Ten effort to scan and delete the massive amount of junk e- REVIEWS, 40 percent of e-mails were considered as mails received; it consumes network bandwidth and spams in 2006. The statistics collected by MessageLabs storage space, slows down e-mail servers, and provides a show that recently the spam rate is over 70 percent and medium to distribute harmful and/or offensive content. persistently remains high. The primary challenge of spam Spam emails are getting better in its ability to break anti- detection problem lies in the fact that spammers will spam filters and it would take a great deal to get it fully always find new ways to attack spam filters owing to the eradicated. Spammers are also becoming more economic benefits of sending spams. Note that existing innovative, so that the anti-spam research is having a filters generally perform well when dealing with clumsy great relevance these days. spams, which have duplicate content with suspicious keywords or are sent from an identical notorious server. A. How do spammers get your address? Therefore, the next stage of spam detection research If you or anyone else ever gives your E-mail address on should focus on coping with cunning spams which evolve website, then your E-mail address is out there for the naturally and continuously. world to use because the companies often sell your In view of above facts, the notion of collaborative spam address. For instance If you go to a website and send a filtering with near-duplicate similarity matching scheme funny cartoon, or an electronic greeting to a friend you has recently received much attention. The primary idea of have given out their E-mail address and the rest is history. the near-duplicate matching scheme for spam detection is Nothing is free, and these websites make their money by to maintain a known spam database, formed by user selling advertising and selling your E-mail address. If feedback, to block subsequent spams with similar All Rights Reserved © 2012 IJARCET 402
  • 2. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 4, June 2012 content. Collaborative filtering indicates that user mechanism is applied, the most critical factor is how to knowledge of what spam may subsequently appear is represent each e-mail for near-duplicate matching. The e- collected to detect following spams. Overall, there are mail abstraction not only should capture the near- three key points of this type of spam detection approach duplicate phenomenon of spams, but should avoid we have to be concerned about. First, an effective accidental deletion of hams. representation of e-mail (i.e., e-mail abstraction) is essential. Since a large set of reported spams have to be Preprocessing Steps: stored in the known spam database, the storage size of e- A. E-mail Abstraction Scheme mail abstraction should be small. Moreover, the e-mail abstraction should capture the near-duplicate Here we introduced a novel e-mail abstraction scheme, phenomenon of spams, and should avoid accidental SAG procedure is presented to depict the generation deletion of non-spam e-mails (also known as hams). process of an e-mail abstraction in II-A-1, and the Second, every incoming e-mail has to be matched with robustness issue is discussed in Section II-A-2. the large database, meaning that the near-duplicate 1) Structure Abstraction Generation: We propose the matching process should be substantially efficient. SAG procedure to generate the e-mail abstraction using Finally, the latest spams to be included instantly and HTML content in e-mail. The algorithmic form of SAG is successively into the database so as to effectively block depicted in Fig. 1. Using three major phases Procedure subsequent near-duplicate spams. SAG is composed, Tag Extraction Phase, Tag Reordering One subtle difference is that a false positive would be a Phase, and <anchor> Appending Phase. In Tag Extraction more serious error than a false negative as a false positive Phase, the name of each HTML tag is extracted, and tag would mean that an important e-mail was identified as attributes and attribute values are eliminated.. One spam and rejected. According to a leading body in IT, objective of this preprocessing step is to remove tags that inaccurate anti-spam solutions may be responsible for are common but not discriminative between e-mails. The wasting more than five million working hours a year on other objective is to prevent malicious tag insertion checking that legitimate messages were not mistakenly attack, and thus the robust-ness of the proposed quarantined. We explore a new approach based on abstraction scheme can be further enhanced. The Bayesian approach that can automatically classify e-mail following sequence of operations is performed in the messages as spam or legitimate. We study its preprocessing step. performance for various conjunction and disjunction 1. Front and rear tags are excluded. operators for several datasets. The results are promising 2. Nonempty tags 2 that have no corresponding start as compared with a previous classifiers such as genetic tags or end tags are deleted. Besides, mismatched algorithm and XCS (extended Classifier System) nonempty tags are also deleted. classifier system. Classification accuracy above 97% and 3. All empty tags 2 are regarded as the same and are low false positive rates are achieved in many test cases. replaced by the newly created <empty/> tag. Moreover, successive <empty/> tags are pruned and II. RELATED WORKS only one <empty/> tag is retained. Nowadays E-mail spam problem is increasing 4. The pairs of nonempty tags enclosing nothing are drastically. So to relieve this problem various techniques removed. have been explored. Previous works on spam detection Procedure SAG can be classified into three categories: a) content-based Input: the email with text/html content-type, the tag methods, b) non content-based methods, and c) others . length threshold (Lth_short) of the short email Initially, researchers analyze e-mail content text and Output: the email abstraction (EA) of the input email model this problem as a binary text classification task. 1 // Tag Extraction Phase Representatives of this category are Naive Bayes [1], [2] 2 Transform each tag to <tag.name>; 3 Transform each paragraph of text to <mytext/>; and Support Vector Machines (SVMs) [5], [10], [9], [14] 4 AnchorSet = the union of all <anchor> methods. In general, Naive Bayes methods train a 5 EA = concatenation of <tag.name>; probability model using classified e-mails, and each word 6 Preprocess the tag sequence of EA; in e-mails will be given a probability of being a 7 // Tag Reordering Phase: suspicious spam keyword. The other group attempts to 8 for(for each tag of EA) // pn: position number exploit non content information such as e-mail header, e- 9 tag.new_pn = ASSIGN_PN (EA.tag_length, mail social network ,and e -mail traffic [13],[9] to filter tag.pn)) spams . Collecting notorious and innocent sender 10 Put the tag to the position tag.new_pn; addresses (or IP addresses) from e-mail header to create 11 EA = the concatenation of <tag.name> with black list and white list is a commonly applied method new_pn; initially. On the other hand, collaborative spam filtering 12 //<anchor> Appending Phase with near-duplicate similarity matching scheme has been 13 if (EA.tag_length<Lth_short) studied widely in recent years. Regarding collaborative 14 Append AnchorSet in font EA; mechanism, P2P-based architecture [12], centralized 15 return EA; server-based system [3], [16], [15], and others [6], [8] are End generally employed. Note that no matter which Fig. 1 Algorithm form of procedure SAG. All Rights Reserved © 2012 IJARCET 403
  • 3. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 4, June 2012 2) Robustness Issue: The near-duplicate spam detection‟s main difficulty is to withstand malicious attack by spammers. E-mail abstractions are generated based mainly on hash-based content text in prior approaches. For example, the authors in [3], [16], [15] extract words or terms to generate the e-mail abstraction. Besides, substrings extracted by various techniques are widely employed in [7], [12], [11], [6], [8], [4]. This type of e-mail representation has following disadvantages. Firstly, the insertion of a randomized and normal paragraph can easily defeat this type of spam filters. Moreover, since the structures and features of different languages are diverse, word and substring extraction may not be applicable to e-mails in all languages. Concretely speaking, for instance, trigrams of substrings used in [7], [8], [6] are not suitable for non alphabetic languages, such as Chinese. In this paper, we devise a novel e-mail abstraction scheme that considers e-mail layout structure to represent e-mails. To assess the robustness of the proposed scheme, we model possible spammer attacks Fig. 2 Examples of possible Spammer Attacks and organize these attacks as following three categories. Examples and the outputs of preprocessing of procedure Note that due to space limitation, we are not able to SAG are shown in Fig. 1. discuss all possible situations. Nevertheless, representing e-mails with layout structure is more robust to most A) Random Paragraph Insertion existing attacks than text-based approaches. Even though This type of spammer attack is commonly used now- new attack has been designed, we can react against it by a-days. As shown in Fig. 2, normal contents without any adjusting the preprocessing step of procedure SAG. On advertisement keywords are inserted to confuse text- the other hand, our approach extracts only HTML tag based spam filtering techniques. It is noted that our sequences and transforms each paragraph with no tag scheme transforms each paragraph into a newly created embedded to<mytext/>, meaning that the proposed tag <mytext/>, and consecutive empty tags will then be abstraction scheme can be applied to e-mails in all transformed to <empty/>. As such, the representation of languages without modify in any components. each random inserted paragraph is identical, and thus our scheme is resistant to this type of attack. B) Random HTML Tag Insertion III. BAYESIAN ALGORITHM If spammers know that the proposed scheme is based One of the most effective and intelligent solutions to on HTML tag sequences, random HTML tags will be combat spam email nowadays is Bayesian filtering. inserted rather than random paragraphs. On the one hand, How does Bayesian filtering work? arbitrary tag insertion will cause syntax errors due to tag Naive Bayes classifier is used to identify spam e-mail. mismatching. This may lead to abnormal display of spam It is based on the principle that most events are dependent content that spammers do not wish this to happen. On the and that the probability of an event occurring in the future other hand, procedure SAG also adopts some heuristics can be inferred from the previous occurrences of that (as depicted in Section III-C to deal with the random event. The same technique can be used to classify spam. insertion of empty tags and the tag mismatching of If some piece of text occurs often in spam but not in nonempty tags. Fig. 2 shows two example outputs and the legitimate mail, then it would be reasonable to assume details of each step can be found in Fig. 2. With the that this email is probably spam. proposed method, most random inserted tags will be Bayesian filters must be „trained‟ to work effectively. removed, and thus the effectiveness of the attack of Particular words have certain probabilities (also known as random tag insertion is limited. likelihood functions) of occurring in spam email but not in legitimate email. For instance, most email users will frequently encounter the word Viagra in spam email, but will seldom see it in other email. Before mail can be C) Sophisticated HTML Tag Insertion: filtered using this method, the user needs to generate a Suppose that spammers are more sophisticated, they database with words and tokens (such as the $ sign, IP may insert legal HTML tag patterns. As shown in Fig. 1, addresses and domains, and so on), collected from a if tag patterns that do conform to syntax rules are sample of spam mail and valid mail (referred to as inserted, they will not be eliminated. „ham‟). For all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database. All Rights Reserved © 2012 IJARCET 404
  • 4. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 4, June 2012 A. Mathematical Foundation conforms to the 50% hypothesis about repartition Bayes theorem is used several times in the context of between spam and ham, i.e. that the datasets of spam and spam: ham are of same size. 1) First time, to compute the probability that the Of course, determining whether a message is spam or message is spam, knowing that a given word appears in ham based only on the presence of the word "replica" is this message. error-prone, which is why Bayesian spam software tries 2) Second time, to compute the probability that the to consider several words and combine their spamicities message is spam, taking into consideration all of its to determine a message's overall probability of being words (or a relevant subset of them); spam. 3) Sometimes third time, to deal with rare words. B. Computing the probability that a message containing IV. SPAMMER ATTACK a given word is spam To further verify the robustness of Cosdes, in this Let's suppose the suspected message contains the word section, we simulate the spammer attack of random "replica". Most people who are used to receiving e-mail HTML tag insertion. We consider the situation that a know that this message is likely to be spam, more spammer sends n identical e-mails at a time, where n is precisely a proposal to sell counterfeit copies of well- varied from 1,000 to 100,000. It is assumed that a known brands of watches. The spam detection software, sequence of random HTML tags is inserted into the however, does not "know" such facts, all it can do is beginning of each e-mail. The number of tags in a compute probabilities. sequence is a random number between 1 and 50.5 The formula used by the software to determine that is Regarding the type of tag, we randomly choose them derived from Bayes theorem from HTML tag list. around 90 percent of e-mails are where: matched with other e-mails, meaning that only 10 percent  Pr(S|W) is the probability that a message is a of spams have completely distinct HTML tag sequences spam, knowing that the word "replica" is in it; as random HTML tag insertion is applied. This is because  Pr(S) is the overall probability that any given the sequence preprocessing step of procedure SAG will message is spam; delete nonempty tags that have no corresponding start  Pr(S|W) is the probability that the word tags or end tags. Random HTML tag insertion cannot "replica" appears in spam messages; generate legal tag sequences and thus most tags will be  Pr(H) is the overall probability that any given eliminated. Concerning the efficiency analysis, it can be message is not spam (is "ham"); observed in that the sequence preprocessing step incurs  Pr(W|H) is the probability that the word very little overhead. This also indicates that if new "replica" appears in ham messages. spammer attack occurs, Cosdes still has capacity to react C. The Spamicity of a word against it by applying a more sophisticated counter Recent statistics show that the current probability of measure. any message being spam is 80%, at the very least: A. Experimental results: Pr(S)=0.8;Pr(H)=0.2 However, most bayesian spam detection software makes the assumption that there is no a priori reason for any incoming message to be spam rather than ham, and considers both cases to have equal probabilities of 50%: Pr(S)=0.8;Pr(H)=0.2 The filters that use this hypothesis are said to be "not biased", meaning that they have no prejudice regarding the incoming email. This assumption permits simplifying the general formula to: Pr(S|W) Fig. 3 Seeking username and password This quantity is called "spamicity" (or "spaminess") of the word "replica", and can be computed. The number Pr(W|S) used in this formula is approximated to the frequency of messages containing "replica" in the messages identified as spam during the learning phase. Similarly, Pr(W|H) is approximated to the frequency of messages containing "replica" in the messages identified as ham during the learning phase. For these approximations to make sense, the set of learned messages needs to be big and representative enough.[9] It is also advisable that the learned set of messages All Rights Reserved © 2012 IJARCET 405
  • 5. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 4, June 2012 [9] A. Kolcz and J. Alspector, “SVM-Based Filtering of Email Spam with Content-Specific Misclassification Costs,” Proc. ICDM Work-shop Text Mining, 2001. [10] H. Drucker, D. Wu, and V.N. Vapnik, “Support Vector Machines for Spam Categorization,” Proc. IEEE Trans. Neural Networks, pp. 1048-1054, 1999. [11] A . Gray and M.Haahr , “ Personalised ,Collaborative Spam Filtering,” Proc. First Conf. Email and Anti-Spam (CEAS), 2004. [12] S. Sarafijanovic, S. Perez, and J.-Y.L. Boudec, “Improving Digest-Based Collaborative Spam Detection,” Proc. MIT Spam Conf., 2008. [13] R. Clayton, “Email Traffic: A Quantitative Snapshot,” Proc. of the Fourth Conf. Email and Anti-Spam (CEAS), 2007. [14]D. Sculley and G.M. Wachman, “Relaxed Online SVMs for Spam Filtering,” Proc. 30th Ann. Int‟l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR), Fig. 4 Specifying spam detection period pp. 415-422, 2007. [15] I. Rigoutsos and T. Huynh, “Chung-Kwei: A Pattern-Discovery- Based System for the Automatic Identification of Unsolicited E-Mail Messages (SPAM),” Proc. First Conf. Email and Anti- Spam (CEAS), 2004. [16] M.S. Pera and Y.-K. Ng, “Using Word Similarity to Eradicate Junk Emails,” Proc. 16th ACM Int‟l Conf. Information and Knowledge Management (CIKM), pp. 943-946, 2007. ABOUT THE AUTHORS Fig. 5 Showing result. Mrs. V. KANAKA DURGA is pursuing M.Tech. at Sri Vasavi Engineering college, Tadepalligudem. Her area of interest includes V. CONCLUSION Computer Networks and Network Security. Mr. M.R. RAJA RAMESH, M.Tech., is working as an Associate In this paper we proposed more sophisticated and Professor in the Dept. of Computer Science & Engineering in Sri Vasavi robust e-mail abstraction scheme based on Bayesian with Engineering College, Tadepalligudem. Presently he is pursuing Ph.D. at Andhra University. new scheme.we can efficiently capture the near duplicate of the spams.Our scheme achieved efficient similarity matching and educed data storage.So we can conclude that our scheme is apt for spam detection. REFERENCES [1] J. Hovold, “Naive Bayes Spam Filtering Using Word-Position- Based Attributes,” Proc. Second Conf. Email and Anti-Spam (CEAS),2005. [2] V. Metsis, I. Androutsopoulos, and G. Paliouras, “Spam Filtering with Naive Bayes—Which Naive Bayes?” Proc. Third Conf. Email and Anti-Spam (CEAS), 2006 [3]A. Kolcz, A. Chowdhury, and J. Alspector, “The Impact of Feature Selection on Signature-Driven Spam Detection,” Proc. First Conf. Email and Anti-Spam (CEAS), 2004. [4]S. Sarafijanovic, S. Perez, and J.-Y.L. Boudec, “Resolving FP-TP Conflict in Digest-Based Selection Algorithm,” Proc. Fifth Conf. Email and Anti-Spam (CEAS), 2008. [5] E. Blanzieri and A. Bryl, “Evaluation of the Highest Probability SVM Nearest Neighbor Classifier with Variable Relative Error Cost,” Proc. Fourth Conf. Email and Anti-Spam (CEAS), 2007. [6] J.S.Kong,P.O.Boykin,B.A.Rezaei,N.Sarshar,and V.P.Roychowd hury, “ Scalable and Rel iable Col laborative Spam Filters: Harnessing the Global Social Email Networks,” Proc. Second Conf. Email and Anti-Spam (CEAS), 2005. [7] E. Damiani, S.D.C. di Vimercati, S. Paraboschi, and P. Samarati,“An Open Digest-Based Technique for Spam Detection,” Proc. Int‟lWorkshop Security in Parallel and Distributed Systems, pp. 559-564,2004. [8] S.Sarafijanovic and J.-Y.L. Boudec, “Artificial Immune System for Collaborative Spam Filtering,” Proc. Second Work shop Nature Inspired Cooperative Strategies for Optimization (NICSO), 2007. All Rights Reserved © 2012 IJARCET 406