 INTRODUCTION
 NATURAL LANGUAGE PROCESSING
 SPAMMING
 RELATED WORK
 N-GRAM MODELLING
 WORD STEMMING
 BAYESIAN CLASSIFICATION
 PROPOSED MODEL
 COMPONENTS
 BLOCK DIAGRAM
 WORKING
 CONCLUSION
 QUESTION & ANSWER
 THANK YOU
Natural Language Processing (NLP)
is a field of computer science and
linguistics concerned with the
interactions between computers and
human (natural) languages.
Spamming is the use of electronic
messaging systems like e-mails and
other digital delivery systems and
broadcast media to send unwanted
bulk messages indiscriminately.
N-GRAM MODELLING
WORD STEMMING
BAYESIAN CLASSIFICATION
 An N-gram is an N-character slice of a longer string.
 N-grams of several different lengths simultaneously are
used in the processing while detecting spam emails.
 In N-gram blank space is replaces by underscore
character(“_”).
Here the example ,
WORD = STAY
 Bi-grams: _S, ST, TA, SY, Y_
 Tri-grams: _ _S, _ST, STA, TAY, AY_, Y_
 Quad-grams: _ _ _ S, _ _ ST, _STA, STAY,
TAY_, AY_ _, Y_ _ _
The following steps are used in a word stemming algorithm:
 Remove all non-alpha characters. (Allow some characters
like '/' ' ' '|' etc. which can be used together to look like some
characters, such as / for 'V')
 Remove all vowels from the word except the initial one.
 Replace all digits and symbols by similar looking digits or
characters.
 Replace consecutive repeated characters by a
single character and vice versa.
 Use phonetic algorithms like sound ex on the
resultant string.
 Give it a numeric value depending on the
operations performed over it.
 Update the database for that particular keyword.
 Bayesian Spam Filtering is a statistical method of
classifying an email into spam and legitimate classes.
 Naive Bayes classifiers work by correlating the use of
tokens with spam and non-spam emails and then using
Bayesian inference to calculate a probability that an email is
or is not spam.
Here,
A= no. Of Ham meassage
b= no. Of Spam messgae
 Email Input
 URL Source Check
 URL Blacklist Database
 Threshold Counter
 Keyword Extractor
 Classifier
 NLP Engine
WORKING
 An unclassified email is taken as input for processing.
 The URL Source checking block checks the URL of the incoming
email and searches URL Blacklist Database to find a match. If the
search is successful then, it categorizes the email as a spam or
considers it to be ham and processes further.
 At the same time there is a Threshold Counter which keeps track
of the number of emails coming from this source over a period of
time t.
 The Email is then passed to the Keyword Extractor which
will tokenize the message into keywords.
 These keywords are passed to the Classifier which will
classify the mail into a specific category E.
 The email is then passed to the NLP Engine along with its
category.
Spam mails are a serious concern to and a major annoyance
for many Internet users. The mode proposed as a solution in
this paper is highly beneficial because it introduces a
threshold counter which helps overcome congestion on the
web server and also maintain the spam filter efficiency but
at the same time, it also requires overhead storage space for
the databases.
Spam Detection Using Natural Language processing
Spam Detection Using Natural Language processing

Spam Detection Using Natural Language processing

  • 2.
     INTRODUCTION  NATURALLANGUAGE PROCESSING  SPAMMING  RELATED WORK  N-GRAM MODELLING  WORD STEMMING  BAYESIAN CLASSIFICATION
  • 3.
     PROPOSED MODEL COMPONENTS  BLOCK DIAGRAM  WORKING  CONCLUSION  QUESTION & ANSWER  THANK YOU
  • 4.
    Natural Language Processing(NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages.
  • 5.
    Spamming is theuse of electronic messaging systems like e-mails and other digital delivery systems and broadcast media to send unwanted bulk messages indiscriminately.
  • 6.
  • 7.
     An N-gramis an N-character slice of a longer string.  N-grams of several different lengths simultaneously are used in the processing while detecting spam emails.  In N-gram blank space is replaces by underscore character(“_”).
  • 8.
    Here the example, WORD = STAY  Bi-grams: _S, ST, TA, SY, Y_  Tri-grams: _ _S, _ST, STA, TAY, AY_, Y_  Quad-grams: _ _ _ S, _ _ ST, _STA, STAY, TAY_, AY_ _, Y_ _ _
  • 9.
    The following stepsare used in a word stemming algorithm:  Remove all non-alpha characters. (Allow some characters like '/' ' ' '|' etc. which can be used together to look like some characters, such as / for 'V')  Remove all vowels from the word except the initial one.  Replace all digits and symbols by similar looking digits or characters.
  • 10.
     Replace consecutiverepeated characters by a single character and vice versa.  Use phonetic algorithms like sound ex on the resultant string.  Give it a numeric value depending on the operations performed over it.  Update the database for that particular keyword.
  • 11.
     Bayesian SpamFiltering is a statistical method of classifying an email into spam and legitimate classes.  Naive Bayes classifiers work by correlating the use of tokens with spam and non-spam emails and then using Bayesian inference to calculate a probability that an email is or is not spam.
  • 12.
    Here, A= no. OfHam meassage b= no. Of Spam messgae
  • 13.
     Email Input URL Source Check  URL Blacklist Database  Threshold Counter  Keyword Extractor  Classifier  NLP Engine
  • 15.
    WORKING  An unclassifiedemail is taken as input for processing.  The URL Source checking block checks the URL of the incoming email and searches URL Blacklist Database to find a match. If the search is successful then, it categorizes the email as a spam or considers it to be ham and processes further.  At the same time there is a Threshold Counter which keeps track of the number of emails coming from this source over a period of time t.
  • 16.
     The Emailis then passed to the Keyword Extractor which will tokenize the message into keywords.  These keywords are passed to the Classifier which will classify the mail into a specific category E.  The email is then passed to the NLP Engine along with its category.
  • 17.
    Spam mails area serious concern to and a major annoyance for many Internet users. The mode proposed as a solution in this paper is highly beneficial because it introduces a threshold counter which helps overcome congestion on the web server and also maintain the spam filter efficiency but at the same time, it also requires overhead storage space for the databases.