SlideShare a Scribd company logo
1 of 11
ENABLING SPAM FILTERING FOR
MOBILE ORIGINAL EQUIPMENT
MANUFACTURERS
By
Group 2
Avinash Kumar(15BM6JP08)
Ayan Sengupta(15BM6JP09)
Bharathi R(15BM6JP10)
Bodhisattwa Prasad Majumder(15BM6JP11)
Chandra Bhanu Jha(15BM6JP12)
Dattatreya Biswas(15BM6JP13)
Deepu Unnikrishnan(15BM6JP14)
Data Source: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
I. INTRODUCTION
A spam is defined as an irrelevant or unsolicited message sent over communication channels,
typically to a large numbers of users, for the purposes of advertising, phishing, spreading
malware, etc. With the humongous boom in the number of mobile users, SMS has grown into a
multi-billion dollars commercial industry. As per Wikipedia [1] is the most widely used data
application with an estimated 3.5 billion active users, or about 80% of all mobile phone
subscribers at the end of 2010.
A spam filter is a program that is used to prevent spam from getting to a user's inbox. Like other
types of filtering[2] programs, a spam filter looks for certain criteria on which it bases
judgments. For example, the simplest and earliest versions (such as the one available with
Microsoft's Hotmail) can be set to watch for particular words in the subject line of messages and
to exclude these from the user's inbox. This method is not especially effective, too often omitting
perfectly legitimate messages (these are called false positives) and letting actual spam through. In
general, Spam filters are estimated to reduce costs by roughly 30%.
II. BUSINESS SCOPE
According to a study [3], the volume of SMS spam has risen 45% in the US in 2011 to 4.5 billion
messages and, in 2012, more than 69% of the mobile users claimed to have received text spam.
A paper [4] published in the journal of Economic perspectives titled “The Economics of Spam”
estimated that Americans experience costs of almost $20 billion annually due to spam, while
spammers and spam-advertised merchants collect gross worldwide revenues on the order of $200
million per year, and conclude that the 'externality ratio' of external costs to internal benefits for
spam is around 100:1. Spammers are claimed to have been dumping a lot on society and reaping
fairly little in return.
Research [5] by a Stanford University Scholar states that due to increased popularity in young
demographics and the decrease in text messaging charges over the years (in China it now costs
less than $0.001 to send a text message), SMS Spam is showing growth, and in 2012 in parts of
Asia up to 30% of text messages was spam. SMS spams being very personal and more irritating
than email spams contribute to costs for the receiver as well. If SMS span remains unaddressed, a
mobile operator with 10 million subscribers can incur up to $6 Billion in losses per year.
While drawing a boundary for filtering out SMS spams on a user based scale, the business
considerations include the costs of misclassification of legitimate SMS as being fake and the
inconvenience caused by allowing a certain proportion of spams when the genuinity cannot be
ascertained. The attempt here has been to provide a worthwhile solution in light of the concerns.
The dataset has been taken from UCI Machine Learning repository which contains 5574 text
messages [8].
III. DATA PREPROCESSING
The dataset of experiment consists of one large text file in which each line corresponds to a text
message (SMS). Therefore, preprocessing of the data, extraction of features and further
engineering, and tokenization of each message is required.
For the initial analysis of the data, each message in dataset is split into tokens of alphanumeric
characters. Tokenization has been done keeping space as the delimiter. Stop-words [7] were
removed from all the text messages as they appear most frequently along both response class and
don’t have much discriminative power. The effect of abbreviations in the messages is ignored,
and no word stemming algorithm is used. Additionally, more tokens are generated based on the
number of special characters (!,(,),.,:,$,etc), the number of uppercase letters, number of spelling
mistakes and the overall number of characters in the message. The intuition behind calculation of
the number of special characters is to detect the spam which usually tend to have more number
of special characters like $, @, # etc. The number of uppercase letters gives a lead in detecting
spam as usual presence of uppercase letters for emphasis. The intuition behind entering the
length of message as a feature is that the cost of sending a text message is the same as long as it
is contained below 160 characters, so marketers would prefer to use most of the space available
to them as long as it doesn’t exceed the limit. The interesting observation from the data was the
presence of misspelled words which are prevalent in ham and usually not in spam. Unigram
frequency analysis has been carried out to understand the most frequent words used in spam after
removing the stop words. Hence, the most frequently occurring words were identified from the
spam messages using term document matrix. These words certainly will have more
discriminative power in determining spam. However, not all of these words are useful in the
classification. Tokens (words) which fall into the top 5 percentile based on frequency (having
frequency in the list of all words appear in spam, has been considered as separate features). Here
is the list of words used as features: {150p, call, cash, chat, claim, com, contact, customer, free,
get, guaranteed, just, mobile, msg, new, nokia, now, per, phone, please, prize, reply, send,
service, stop, text, tone, txt, urgent, week, will, win, won, www}. Indicator variable were used to
denote presence of each word - ‘1’ for presence of the particular word, 0 otherwise. Considering
all the features, the training data finally contains 40 predictor variables.
IV. METHODOLOGY AND RESULTS
The logical approach to handle problem is to identify the features which are distinct in spam and
we define ham messages as those which are not spam. Thus, the response contains two class
spam (1) and ham (0).
In the first phase of analysis, a multinomial general linear regression with a binomial logit link
was applied on 40 explanatory variables without considering any interaction terms. An accuracy
of 95.55% was achieved, for this model on the test set and is presented as a confusion matrix in
Table 1. For measuring goodness-of-fit, the NagelKerke R-square was calculated and it shows a
value of 0.735. From the model, the significant explanatory variable were identified as
word_count, character_count, special_count from the engineered features and win, won, urgent,
txt, text, tone, mobile, new, contact and call. The subjective inferences from the above are
directly conclusive as the word reflects the specific interest of the marketers who tend to send
spam messages. The words seem instigating to make a forward step with the spam messages as
obvious. Furthermore, it comes costly, when a ham is misclassified as spam and thus, the
objective is to minimize the type-I error. The threshold along which the predicted probabilities
has been clamped to 0 (ham) or 1 (spam) has been obtained running an iterative search where it
achieves minimum type-I error. It deteriorates the model accuracy, which means type-II error
increased. As a suitable trade-off, type-I error of 0.02% is allowed. For the first model, the
threshold chosen is .85. The ROC curve also shows the point where it achieves maximum AUC.
(Figure 2). Still, the threshold obtained by the iterative minimization was considered to keep the
objective as minimization of misclassification of hams.
The flow of investigation naturally asks for the further investigation to incorporate interaction
terms in the model. Interaction terms, namely, word_count * special_count, special_count *
upper_count, upper_count*word_count, were incorporated sequentially and each time, all
previous predictors intact were kept intact. It is observed that the NagelKerke R-square value
continuously increased with the addition of interaction terms. The best NagelKerke R-square was
achieved when all the reported interaction terms were included along with the other 40 variables
(Table 5). Wald’s test (Table 7) was performed for all the predictor variables for the model
which gives the best Nagelkerke R-squared value (Table 6). The accuracy did not improve much
and it stays same even with the best model so far (Table 1, 2, 3, 4).
A boosting method ,which is an process of finding function in each iteration and caters to
different segmentation of the dataset for those all models from previous iteration are not
confident about, was also explored. The Gradient Boosting with general linear regression as the
basic model reaches an accuracy of 98.02% (Table 6) which is significantly higher than a single
logistic model. The ensemble performs better even in the front of minimizing the type-I error and
the threshold chosen was .85. It helps improve the type-II error and in turn improve the accuracy.
The result was compared with the report [7] which deals with same dataset and applies SVM,
Multinonimal Naive Bayes, KNN and AdaBoost with decision trees. It beats of their models in
terms of Type-I error which comes out as .40% in cost of the decrease in accuracy only by .06%.
V. CONCLUSION AND FUTURE SCOPE
The model presents an efficient spam detection algorithm which is at par with the state-of-the art
and has significantly low type-II error. Thus, it is highly efficient in detecting spam as well as
not blocking the hams. The drawback of this analysis is that it does not consider the combined
occurrence of words, which occurs naturally in language. The bi-gram and trigram frequency
analysis can be carried out to further, to improve the accuracy. Similar analysis can be carried in
case of emails, with almost similar features. As Xiaomi extended its capability of detecting spam
and built a recommendation engine which allows user to identify the spam messages, this
algorithm can be used for any in-house IT services in academic institutions which empowers
better spam detection while keeping the type-II error very low.
REFERENCES:
[1] https://en.wikipedia.org/wiki/Short_Message_Service
[2] http://whatis.techtarget.com/definition/filter
[3] http://www-users.cs.umn.edu/~zhzhang/Papers/raid2013_jiang_spam.pdf
[4] https://www.aeaweb.org/articles?id=10.1257/jep.26.3.87
[5] http://cs229.stanford.edu/proj2013/ShiraniMehrSMSSpamDetectionUsingMachineLearningApproach.pdf
[6] https://en.wikipedia.org/wiki/Stop_words
[7] http://cs229.stanford.edu/proj2013/ShiraniMehr-SMSSpamDetectionUsingMachineLearningApproach.pdf
[8] https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
Fig. 1 Nagelkerke R square vale for different models (for value see Table 5)
(a) (b)
(c) (d)
Fig 2. ROC for (a) Model without interactions (b) model + word_count*special_count (c)
previous + special_count*upper_count (c) previous + upper_count*word_count
Fig3. Feature Importance (Gini Index)
for GradientBoostingModel
Table 1 (without interaction terms)
Table 2 (with word_count*special_count interaction terms)
Table 3 (with word_count*special_count + special_count*upper_count)
Table 4 (with word_count*special_count + special_count*upper_count +
spccial_count*word_count)
Model Accuracy Nagelkerke R square
Model with no interaction 95.55 0.735
Model with word_count*special_count
interaction terms
95.83 0.759
Model with word_count*special_count +
special_count*upper_count
95.62 0.765
Model with word_count*special_count +
special_count*upper_count +
spccial_count*word_count
95.48 0.772
Table 5. Accuracy and Nagelkerke R square for all logit models
Table 6 (Summary of Last Logit Model with all interaction terms)
Variablenames chi square P_value
Intercept 410.76 0
special_count 6.350828 0.011733
upper_count 17.89395 2.34E-05
word_count 38.58931 5.23E-10
char_count 1.433727 0.231157
mistake_count 2.743402 0.097657
150p 0.000928 0.975694
call 48.27376 3.71E-12
cash 3.719562 0.053778
chat 14.91105 0.000113
claim 0.000386 0.984329
com 0.728009 0.393529
contact 26.16906 3.13E-07
customer 3.158536 0.075531
free 0.525269 0.468603
get 5.688389 0.017078
guaranteed 9.89E-05 0.992066
just 4.791905 0.028594
mobile 35.2288 2.93E-09
msg 6.476558 0.010931
new 37.70887 8.21E-10
nokia 1.607212 0.204884
now 6.688895 0.009702
per 1.503184 0.220182
phone 3.997506 0.045568
please 1.643732 0.199814
prize 0.000355 0.984958
reply 1.889935 0.169209
send 2.952542 0.085743
service 0.000379 0.984458
stop 0.95865 0.327527
text 57.51872 3.35E-14
tone 40.92106 1.59E-10
txt 29.92575 4.49E-08
urgent 34.70201 3.84E-09
week 2.307764 0.128729
will 6.152124 0.013125
win 6.407034 0.011367
won 0.022776 0.880041
www 22.17337 2.49E-06
word_count:upper_count 44.70183 2.29E-11
special_count:upper_count 33.34398 7.72E-09
special_count:word_count 24.39449 7.85E-07
Table 7. Wald’s Test for all the variables for all interaction model

More Related Content

What's hot

News Reliability Evaluation using Latent Semantic Analysis
News Reliability Evaluation using Latent Semantic AnalysisNews Reliability Evaluation using Latent Semantic Analysis
News Reliability Evaluation using Latent Semantic AnalysisTELKOMNIKA JOURNAL
 
AN ANALYSIS OF EFFECTIVE ANTI SPAM PROTOCOL USING DECISION TREE CLASSIFIERS
AN ANALYSIS OF EFFECTIVE ANTI SPAM PROTOCOL USING DECISION TREE CLASSIFIERSAN ANALYSIS OF EFFECTIVE ANTI SPAM PROTOCOL USING DECISION TREE CLASSIFIERS
AN ANALYSIS OF EFFECTIVE ANTI SPAM PROTOCOL USING DECISION TREE CLASSIFIERSijsrd.com
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachReza Rahimi
 
Spam and Anti-spam - Sudipta Bhattacharya
Spam and Anti-spam - Sudipta BhattacharyaSpam and Anti-spam - Sudipta Bhattacharya
Spam and Anti-spam - Sudipta Bhattacharyasankhadeep
 
E mail image spam filtering techniques
E mail image spam filtering techniquesE mail image spam filtering techniques
E mail image spam filtering techniquesranjit banshpal
 
Survey in Online Social Media Skelton by Network based Spam
Survey in Online Social Media Skelton by Network based SpamSurvey in Online Social Media Skelton by Network based Spam
Survey in Online Social Media Skelton by Network based SpamIRJET Journal
 
Mills_Metafeatures.doc
Mills_Metafeatures.docMills_Metafeatures.doc
Mills_Metafeatures.docbutest
 
Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...khalil IBRAHIM
 
End the Nightmares! 10 Email Deliverability Myths Debunked
End the Nightmares! 10 Email Deliverability Myths DebunkedEnd the Nightmares! 10 Email Deliverability Myths Debunked
End the Nightmares! 10 Email Deliverability Myths DebunkedYes Lifecycle Marketing
 

What's hot (12)

News Reliability Evaluation using Latent Semantic Analysis
News Reliability Evaluation using Latent Semantic AnalysisNews Reliability Evaluation using Latent Semantic Analysis
News Reliability Evaluation using Latent Semantic Analysis
 
AN ANALYSIS OF EFFECTIVE ANTI SPAM PROTOCOL USING DECISION TREE CLASSIFIERS
AN ANALYSIS OF EFFECTIVE ANTI SPAM PROTOCOL USING DECISION TREE CLASSIFIERSAN ANALYSIS OF EFFECTIVE ANTI SPAM PROTOCOL USING DECISION TREE CLASSIFIERS
AN ANALYSIS OF EFFECTIVE ANTI SPAM PROTOCOL USING DECISION TREE CLASSIFIERS
 
Aj35198205
Aj35198205Aj35198205
Aj35198205
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning Approach
 
Email Spam Project
Email Spam ProjectEmail Spam Project
Email Spam Project
 
Spam and Anti-spam - Sudipta Bhattacharya
Spam and Anti-spam - Sudipta BhattacharyaSpam and Anti-spam - Sudipta Bhattacharya
Spam and Anti-spam - Sudipta Bhattacharya
 
E mail image spam filtering techniques
E mail image spam filtering techniquesE mail image spam filtering techniques
E mail image spam filtering techniques
 
Research Report
Research ReportResearch Report
Research Report
 
Survey in Online Social Media Skelton by Network based Spam
Survey in Online Social Media Skelton by Network based SpamSurvey in Online Social Media Skelton by Network based Spam
Survey in Online Social Media Skelton by Network based Spam
 
Mills_Metafeatures.doc
Mills_Metafeatures.docMills_Metafeatures.doc
Mills_Metafeatures.doc
 
Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...
 
End the Nightmares! 10 Email Deliverability Myths Debunked
End the Nightmares! 10 Email Deliverability Myths DebunkedEnd the Nightmares! 10 Email Deliverability Myths Debunked
End the Nightmares! 10 Email Deliverability Myths Debunked
 

Viewers also liked

Visualsoft Affiliate Marketing for SME's Workshop - How Affiliate Marketing W...
Visualsoft Affiliate Marketing for SME's Workshop - How Affiliate Marketing W...Visualsoft Affiliate Marketing for SME's Workshop - How Affiliate Marketing W...
Visualsoft Affiliate Marketing for SME's Workshop - How Affiliate Marketing W...Visualsoft Marketing
 
Disco duro 5
Disco duro 5Disco duro 5
Disco duro 5aimeleon6
 
Фінальна презентація роботи "формальної групи" проектного семінару "Право на ...
Фінальна презентація роботи "формальної групи" проектного семінару "Право на ...Фінальна презентація роботи "формальної групи" проектного семінару "Право на ...
Фінальна презентація роботи "формальної групи" проектного семінару "Право на ...Department_of_urban_planning
 
VilanovaDIBA14 baden powell esther santos garcia
VilanovaDIBA14 baden powell esther santos garciaVilanovaDIBA14 baden powell esther santos garcia
VilanovaDIBA14 baden powell esther santos garciaEsther Santos
 
New product development
New product developmentNew product development
New product developmentANUJ YADAV
 
Patient engagement in clinical trials
Patient engagement in clinical trials Patient engagement in clinical trials
Patient engagement in clinical trials Martin Kelly
 
The Role of Patients & their Challenges in Clinical Trials
The Role of Patients & their Challenges in Clinical TrialsThe Role of Patients & their Challenges in Clinical Trials
The Role of Patients & their Challenges in Clinical TrialsKathi Apostolidis
 
Precision livestock farming. Techs applied in livestock production. Efficienc...
Precision livestock farming. Techs applied in livestock production. Efficienc...Precision livestock farming. Techs applied in livestock production. Efficienc...
Precision livestock farming. Techs applied in livestock production. Efficienc...Alfredo J. Escribano, PhD., MBA
 
Business quiz finals
Business quiz finalsBusiness quiz finals
Business quiz finalsShrey Manish
 
монгол хэлний найруулгазүй 1
монгол хэлний найруулгазүй 1монгол хэлний найруулгазүй 1
монгол хэлний найруулгазүй 1oyunaadorj
 
Creative lesson plan
Creative lesson planCreative lesson plan
Creative lesson planTHANVAS
 

Viewers also liked (18)

Power
PowerPower
Power
 
Visualsoft Affiliate Marketing for SME's Workshop - How Affiliate Marketing W...
Visualsoft Affiliate Marketing for SME's Workshop - How Affiliate Marketing W...Visualsoft Affiliate Marketing for SME's Workshop - How Affiliate Marketing W...
Visualsoft Affiliate Marketing for SME's Workshop - How Affiliate Marketing W...
 
Sense títol 1
Sense títol 1Sense títol 1
Sense títol 1
 
Disco duro 5
Disco duro 5Disco duro 5
Disco duro 5
 
εύα ρένα
εύα ρέναεύα ρένα
εύα ρένα
 
KAREN PFLIEGERS RESUME
KAREN PFLIEGERS RESUMEKAREN PFLIEGERS RESUME
KAREN PFLIEGERS RESUME
 
Фінальна презентація роботи "формальної групи" проектного семінару "Право на ...
Фінальна презентація роботи "формальної групи" проектного семінару "Право на ...Фінальна презентація роботи "формальної групи" проектного семінару "Право на ...
Фінальна презентація роботи "формальної групи" проектного семінару "Право на ...
 
VilanovaDIBA14 baden powell esther santos garcia
VilanovaDIBA14 baden powell esther santos garciaVilanovaDIBA14 baden powell esther santos garcia
VilanovaDIBA14 baden powell esther santos garcia
 
New product development
New product developmentNew product development
New product development
 
Patient engagement in clinical trials
Patient engagement in clinical trials Patient engagement in clinical trials
Patient engagement in clinical trials
 
Ingles 1
Ingles 1Ingles 1
Ingles 1
 
The Role of Patients & their Challenges in Clinical Trials
The Role of Patients & their Challenges in Clinical TrialsThe Role of Patients & their Challenges in Clinical Trials
The Role of Patients & their Challenges in Clinical Trials
 
10.usaha kecil menengah
10.usaha kecil menengah10.usaha kecil menengah
10.usaha kecil menengah
 
Precision livestock farming. Techs applied in livestock production. Efficienc...
Precision livestock farming. Techs applied in livestock production. Efficienc...Precision livestock farming. Techs applied in livestock production. Efficienc...
Precision livestock farming. Techs applied in livestock production. Efficienc...
 
Business quiz finals
Business quiz finalsBusiness quiz finals
Business quiz finals
 
монгол хэлний найруулгазүй 1
монгол хэлний найруулгазүй 1монгол хэлний найруулгазүй 1
монгол хэлний найруулгазүй 1
 
Ppt
PptPpt
Ppt
 
Creative lesson plan
Creative lesson planCreative lesson plan
Creative lesson plan
 

Similar to Enabling Spam filtering

A multi layer architecture for spam-detection system
A multi layer architecture for spam-detection systemA multi layer architecture for spam-detection system
A multi layer architecture for spam-detection systemcsandit
 
A multi layer architecture for spam-detection system
A multi layer architecture for spam-detection systemA multi layer architecture for spam-detection system
A multi layer architecture for spam-detection systemcsandit
 
Overview of Anti-spam filtering Techniques
Overview of Anti-spam filtering TechniquesOverview of Anti-spam filtering Techniques
Overview of Anti-spam filtering TechniquesIRJET Journal
 
miniproject.ppt.pptx
miniproject.ppt.pptxminiproject.ppt.pptx
miniproject.ppt.pptxAnush90
 
A Model for Fuzzy Logic Based Machine Learning Approach for Spam Filtering
A Model for Fuzzy Logic Based Machine Learning Approach for  Spam FilteringA Model for Fuzzy Logic Based Machine Learning Approach for  Spam Filtering
A Model for Fuzzy Logic Based Machine Learning Approach for Spam FilteringIOSR Journals
 
Survey on spam filtering
Survey on spam filteringSurvey on spam filtering
Survey on spam filteringChippy Thomas
 
wp-big-rewards-small-improvements
wp-big-rewards-small-improvementswp-big-rewards-small-improvements
wp-big-rewards-small-improvementsRafael Marrero
 
Identification of Spam Emails from Valid Emails by Using Voting
Identification of Spam Emails from Valid Emails by Using VotingIdentification of Spam Emails from Valid Emails by Using Voting
Identification of Spam Emails from Valid Emails by Using VotingEditor IJCATR
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
 
A Comparative Analysis of Different Feature Set on the Performance of Differe...
A Comparative Analysis of Different Feature Set on the Performance of Differe...A Comparative Analysis of Different Feature Set on the Performance of Differe...
A Comparative Analysis of Different Feature Set on the Performance of Differe...gerogepatton
 
NetworkPaperthesis1
NetworkPaperthesis1NetworkPaperthesis1
NetworkPaperthesis1Dhara Shah
 
Integration of feature sets with machine learning techniques
Integration of feature sets with machine learning techniquesIntegration of feature sets with machine learning techniques
Integration of feature sets with machine learning techniquesiaemedu
 
Network paperthesis1
Network paperthesis1Network paperthesis1
Network paperthesis1Dhara Shah
 

Similar to Enabling Spam filtering (20)

A multi layer architecture for spam-detection system
A multi layer architecture for spam-detection systemA multi layer architecture for spam-detection system
A multi layer architecture for spam-detection system
 
A multi layer architecture for spam-detection system
A multi layer architecture for spam-detection systemA multi layer architecture for spam-detection system
A multi layer architecture for spam-detection system
 
spam_msg_detection.pdf
spam_msg_detection.pdfspam_msg_detection.pdf
spam_msg_detection.pdf
 
Overview of Anti-spam filtering Techniques
Overview of Anti-spam filtering TechniquesOverview of Anti-spam filtering Techniques
Overview of Anti-spam filtering Techniques
 
miniproject.ppt.pptx
miniproject.ppt.pptxminiproject.ppt.pptx
miniproject.ppt.pptx
 
A Model for Fuzzy Logic Based Machine Learning Approach for Spam Filtering
A Model for Fuzzy Logic Based Machine Learning Approach for  Spam FilteringA Model for Fuzzy Logic Based Machine Learning Approach for  Spam Filtering
A Model for Fuzzy Logic Based Machine Learning Approach for Spam Filtering
 
Survey on spam filtering
Survey on spam filteringSurvey on spam filtering
Survey on spam filtering
 
wp-big-rewards-small-improvements
wp-big-rewards-small-improvementswp-big-rewards-small-improvements
wp-big-rewards-small-improvements
 
Identification of Spam Emails from Valid Emails by Using Voting
Identification of Spam Emails from Valid Emails by Using VotingIdentification of Spam Emails from Valid Emails by Using Voting
Identification of Spam Emails from Valid Emails by Using Voting
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
A Comparative Analysis of Different Feature Set on the Performance of Differe...
A Comparative Analysis of Different Feature Set on the Performance of Differe...A Comparative Analysis of Different Feature Set on the Performance of Differe...
A Comparative Analysis of Different Feature Set on the Performance of Differe...
 
NetworkPaperthesis1
NetworkPaperthesis1NetworkPaperthesis1
NetworkPaperthesis1
 
Spam Filtering
Spam FilteringSpam Filtering
Spam Filtering
 
Integration of feature sets with machine learning techniques
Integration of feature sets with machine learning techniquesIntegration of feature sets with machine learning techniques
Integration of feature sets with machine learning techniques
 
Jt3616901697
Jt3616901697Jt3616901697
Jt3616901697
 
DEVELOPMENT OF AN EFFECTIVE BAYESIAN APPROACH FOR SPAM FILTERING
DEVELOPMENT OF AN EFFECTIVE BAYESIAN APPROACH FOR SPAM FILTERINGDEVELOPMENT OF AN EFFECTIVE BAYESIAN APPROACH FOR SPAM FILTERING
DEVELOPMENT OF AN EFFECTIVE BAYESIAN APPROACH FOR SPAM FILTERING
 
Network paperthesis1
Network paperthesis1Network paperthesis1
Network paperthesis1
 

More from Dattatreya Biswas

More from Dattatreya Biswas (6)

Dattatreya biswas
Dattatreya biswasDattatreya biswas
Dattatreya biswas
 
Dattatreya biswas
Dattatreya biswasDattatreya biswas
Dattatreya biswas
 
Stanford SQL certification
Stanford SQL certificationStanford SQL certification
Stanford SQL certification
 
What's cooking
What's cookingWhat's cooking
What's cooking
 
Bank marketing
Bank marketingBank marketing
Bank marketing
 
Apple iPhone SE pricing
Apple iPhone SE pricingApple iPhone SE pricing
Apple iPhone SE pricing
 

Recently uploaded

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 

Recently uploaded (20)

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 

Enabling Spam filtering

  • 1. ENABLING SPAM FILTERING FOR MOBILE ORIGINAL EQUIPMENT MANUFACTURERS By Group 2 Avinash Kumar(15BM6JP08) Ayan Sengupta(15BM6JP09) Bharathi R(15BM6JP10) Bodhisattwa Prasad Majumder(15BM6JP11) Chandra Bhanu Jha(15BM6JP12) Dattatreya Biswas(15BM6JP13) Deepu Unnikrishnan(15BM6JP14) Data Source: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
  • 2. I. INTRODUCTION A spam is defined as an irrelevant or unsolicited message sent over communication channels, typically to a large numbers of users, for the purposes of advertising, phishing, spreading malware, etc. With the humongous boom in the number of mobile users, SMS has grown into a multi-billion dollars commercial industry. As per Wikipedia [1] is the most widely used data application with an estimated 3.5 billion active users, or about 80% of all mobile phone subscribers at the end of 2010. A spam filter is a program that is used to prevent spam from getting to a user's inbox. Like other types of filtering[2] programs, a spam filter looks for certain criteria on which it bases judgments. For example, the simplest and earliest versions (such as the one available with Microsoft's Hotmail) can be set to watch for particular words in the subject line of messages and to exclude these from the user's inbox. This method is not especially effective, too often omitting perfectly legitimate messages (these are called false positives) and letting actual spam through. In general, Spam filters are estimated to reduce costs by roughly 30%. II. BUSINESS SCOPE According to a study [3], the volume of SMS spam has risen 45% in the US in 2011 to 4.5 billion messages and, in 2012, more than 69% of the mobile users claimed to have received text spam. A paper [4] published in the journal of Economic perspectives titled “The Economics of Spam” estimated that Americans experience costs of almost $20 billion annually due to spam, while spammers and spam-advertised merchants collect gross worldwide revenues on the order of $200 million per year, and conclude that the 'externality ratio' of external costs to internal benefits for spam is around 100:1. Spammers are claimed to have been dumping a lot on society and reaping fairly little in return. Research [5] by a Stanford University Scholar states that due to increased popularity in young demographics and the decrease in text messaging charges over the years (in China it now costs less than $0.001 to send a text message), SMS Spam is showing growth, and in 2012 in parts of Asia up to 30% of text messages was spam. SMS spams being very personal and more irritating than email spams contribute to costs for the receiver as well. If SMS span remains unaddressed, a mobile operator with 10 million subscribers can incur up to $6 Billion in losses per year. While drawing a boundary for filtering out SMS spams on a user based scale, the business considerations include the costs of misclassification of legitimate SMS as being fake and the inconvenience caused by allowing a certain proportion of spams when the genuinity cannot be ascertained. The attempt here has been to provide a worthwhile solution in light of the concerns. The dataset has been taken from UCI Machine Learning repository which contains 5574 text messages [8].
  • 3. III. DATA PREPROCESSING The dataset of experiment consists of one large text file in which each line corresponds to a text message (SMS). Therefore, preprocessing of the data, extraction of features and further engineering, and tokenization of each message is required. For the initial analysis of the data, each message in dataset is split into tokens of alphanumeric characters. Tokenization has been done keeping space as the delimiter. Stop-words [7] were removed from all the text messages as they appear most frequently along both response class and don’t have much discriminative power. The effect of abbreviations in the messages is ignored, and no word stemming algorithm is used. Additionally, more tokens are generated based on the number of special characters (!,(,),.,:,$,etc), the number of uppercase letters, number of spelling mistakes and the overall number of characters in the message. The intuition behind calculation of the number of special characters is to detect the spam which usually tend to have more number of special characters like $, @, # etc. The number of uppercase letters gives a lead in detecting spam as usual presence of uppercase letters for emphasis. The intuition behind entering the length of message as a feature is that the cost of sending a text message is the same as long as it is contained below 160 characters, so marketers would prefer to use most of the space available to them as long as it doesn’t exceed the limit. The interesting observation from the data was the presence of misspelled words which are prevalent in ham and usually not in spam. Unigram frequency analysis has been carried out to understand the most frequent words used in spam after removing the stop words. Hence, the most frequently occurring words were identified from the spam messages using term document matrix. These words certainly will have more discriminative power in determining spam. However, not all of these words are useful in the classification. Tokens (words) which fall into the top 5 percentile based on frequency (having frequency in the list of all words appear in spam, has been considered as separate features). Here is the list of words used as features: {150p, call, cash, chat, claim, com, contact, customer, free, get, guaranteed, just, mobile, msg, new, nokia, now, per, phone, please, prize, reply, send, service, stop, text, tone, txt, urgent, week, will, win, won, www}. Indicator variable were used to denote presence of each word - ‘1’ for presence of the particular word, 0 otherwise. Considering all the features, the training data finally contains 40 predictor variables. IV. METHODOLOGY AND RESULTS The logical approach to handle problem is to identify the features which are distinct in spam and we define ham messages as those which are not spam. Thus, the response contains two class spam (1) and ham (0). In the first phase of analysis, a multinomial general linear regression with a binomial logit link was applied on 40 explanatory variables without considering any interaction terms. An accuracy of 95.55% was achieved, for this model on the test set and is presented as a confusion matrix in Table 1. For measuring goodness-of-fit, the NagelKerke R-square was calculated and it shows a value of 0.735. From the model, the significant explanatory variable were identified as word_count, character_count, special_count from the engineered features and win, won, urgent,
  • 4. txt, text, tone, mobile, new, contact and call. The subjective inferences from the above are directly conclusive as the word reflects the specific interest of the marketers who tend to send spam messages. The words seem instigating to make a forward step with the spam messages as obvious. Furthermore, it comes costly, when a ham is misclassified as spam and thus, the objective is to minimize the type-I error. The threshold along which the predicted probabilities has been clamped to 0 (ham) or 1 (spam) has been obtained running an iterative search where it achieves minimum type-I error. It deteriorates the model accuracy, which means type-II error increased. As a suitable trade-off, type-I error of 0.02% is allowed. For the first model, the threshold chosen is .85. The ROC curve also shows the point where it achieves maximum AUC. (Figure 2). Still, the threshold obtained by the iterative minimization was considered to keep the objective as minimization of misclassification of hams. The flow of investigation naturally asks for the further investigation to incorporate interaction terms in the model. Interaction terms, namely, word_count * special_count, special_count * upper_count, upper_count*word_count, were incorporated sequentially and each time, all previous predictors intact were kept intact. It is observed that the NagelKerke R-square value continuously increased with the addition of interaction terms. The best NagelKerke R-square was achieved when all the reported interaction terms were included along with the other 40 variables (Table 5). Wald’s test (Table 7) was performed for all the predictor variables for the model which gives the best Nagelkerke R-squared value (Table 6). The accuracy did not improve much and it stays same even with the best model so far (Table 1, 2, 3, 4). A boosting method ,which is an process of finding function in each iteration and caters to different segmentation of the dataset for those all models from previous iteration are not confident about, was also explored. The Gradient Boosting with general linear regression as the basic model reaches an accuracy of 98.02% (Table 6) which is significantly higher than a single logistic model. The ensemble performs better even in the front of minimizing the type-I error and the threshold chosen was .85. It helps improve the type-II error and in turn improve the accuracy. The result was compared with the report [7] which deals with same dataset and applies SVM, Multinonimal Naive Bayes, KNN and AdaBoost with decision trees. It beats of their models in terms of Type-I error which comes out as .40% in cost of the decrease in accuracy only by .06%. V. CONCLUSION AND FUTURE SCOPE The model presents an efficient spam detection algorithm which is at par with the state-of-the art and has significantly low type-II error. Thus, it is highly efficient in detecting spam as well as not blocking the hams. The drawback of this analysis is that it does not consider the combined occurrence of words, which occurs naturally in language. The bi-gram and trigram frequency analysis can be carried out to further, to improve the accuracy. Similar analysis can be carried in case of emails, with almost similar features. As Xiaomi extended its capability of detecting spam and built a recommendation engine which allows user to identify the spam messages, this algorithm can be used for any in-house IT services in academic institutions which empowers better spam detection while keeping the type-II error very low.
  • 5. REFERENCES: [1] https://en.wikipedia.org/wiki/Short_Message_Service [2] http://whatis.techtarget.com/definition/filter [3] http://www-users.cs.umn.edu/~zhzhang/Papers/raid2013_jiang_spam.pdf [4] https://www.aeaweb.org/articles?id=10.1257/jep.26.3.87 [5] http://cs229.stanford.edu/proj2013/ShiraniMehrSMSSpamDetectionUsingMachineLearningApproach.pdf [6] https://en.wikipedia.org/wiki/Stop_words [7] http://cs229.stanford.edu/proj2013/ShiraniMehr-SMSSpamDetectionUsingMachineLearningApproach.pdf [8] https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
  • 6. Fig. 1 Nagelkerke R square vale for different models (for value see Table 5) (a) (b) (c) (d)
  • 7. Fig 2. ROC for (a) Model without interactions (b) model + word_count*special_count (c) previous + special_count*upper_count (c) previous + upper_count*word_count Fig3. Feature Importance (Gini Index) for GradientBoostingModel
  • 8. Table 1 (without interaction terms) Table 2 (with word_count*special_count interaction terms) Table 3 (with word_count*special_count + special_count*upper_count) Table 4 (with word_count*special_count + special_count*upper_count + spccial_count*word_count) Model Accuracy Nagelkerke R square Model with no interaction 95.55 0.735 Model with word_count*special_count interaction terms 95.83 0.759 Model with word_count*special_count + special_count*upper_count 95.62 0.765 Model with word_count*special_count + special_count*upper_count + spccial_count*word_count 95.48 0.772 Table 5. Accuracy and Nagelkerke R square for all logit models
  • 9. Table 6 (Summary of Last Logit Model with all interaction terms)
  • 10. Variablenames chi square P_value Intercept 410.76 0 special_count 6.350828 0.011733 upper_count 17.89395 2.34E-05 word_count 38.58931 5.23E-10 char_count 1.433727 0.231157 mistake_count 2.743402 0.097657 150p 0.000928 0.975694 call 48.27376 3.71E-12 cash 3.719562 0.053778 chat 14.91105 0.000113 claim 0.000386 0.984329 com 0.728009 0.393529 contact 26.16906 3.13E-07 customer 3.158536 0.075531 free 0.525269 0.468603 get 5.688389 0.017078 guaranteed 9.89E-05 0.992066 just 4.791905 0.028594 mobile 35.2288 2.93E-09 msg 6.476558 0.010931 new 37.70887 8.21E-10 nokia 1.607212 0.204884 now 6.688895 0.009702 per 1.503184 0.220182 phone 3.997506 0.045568 please 1.643732 0.199814 prize 0.000355 0.984958 reply 1.889935 0.169209 send 2.952542 0.085743 service 0.000379 0.984458 stop 0.95865 0.327527 text 57.51872 3.35E-14 tone 40.92106 1.59E-10 txt 29.92575 4.49E-08 urgent 34.70201 3.84E-09 week 2.307764 0.128729 will 6.152124 0.013125 win 6.407034 0.011367 won 0.022776 0.880041 www 22.17337 2.49E-06 word_count:upper_count 44.70183 2.29E-11
  • 11. special_count:upper_count 33.34398 7.72E-09 special_count:word_count 24.39449 7.85E-07 Table 7. Wald’s Test for all the variables for all interaction model