Tips & Tricks
Prepare black list using bayesian approach to improve performance of spam filter 2
Like this document? Why not share!
Paper id 212014100
Paper id 152013128
A system to filter unwanted message...
by Madan Golla
Email sent successfully!
Show related SlideShares at end
Prepare black list using bayesian approach to improve performance of spam filter 2
Mar 02, 2013
Comment goes here.
12 hours ago
Are you sure you want to
Your message goes here
Be the first to comment
Be the first to like this
Number of Embeds
No notes for slide
Transcript of "Prepare black list using bayesian approach to improve performance of spam filter 2 "
1. INTERNATIONALComputer EngineeringCOMPUTER ENGINEERING International Journal of JOURNAL OF and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1,(IJCET) & TECHNOLOGY January- February (2013), © IAEMEISSN 0976 – 6367(Print)ISSN 0976 – 6375(Online)Volume 4, Issue 1, January- February (2013), pp. 318-324 IJCET© IAEME:www.iaeme.com/ijcet.aspJournal Impact Factor (2012): 3.9580 (Calculated by GISI) ©IAEMEwww.jifactor.com PREPARE BLACK LIST USING BAYESIAN APPROACH TO IMPROVE PERFORMANCE OF SPAM FILTER Nitin Rola1, Prof. Rashmi Gupta2 1 Computer Science & Engineering, TIT, Bhopal 2 Computer Science & Engineering, TIT, Bhopal ABSTRACT Email is very secure, cheap, easy and reliable communication medium, but it has one big disadvantage that is of spam (junk) Email. Solution of this spam is automatic filtering system which eliminates (spam) unwanted mails. Bayesian approach is efficient and powerful for doing this task. Bayesian approach seems to be simple text classification technique, but right now many researches are going on the same because cost of misclassification of the legitimate to spam is very high. Here we have considered an origin and a Bayesian approach for filtering spam mail.So, the major issue in Bayesian approach is performance of filter when word library become very large. To improve performance we can first classify on the basis of origin (black list) of e-mail then classify it by Bayesian approach to make it more accurate and faster. Keywords:Automated Accurate and Faster Spam Filter, Train Origin Database by Bayesian Approach, Self Learning.I. INTRODUCTION It is rapid information exchange Era and one of the advances, secure, cheap, reliable and fast technologies for information exchange is Email. Users of Emails are increasing day by day and also increasing the volume of unwanted mails (spam). Also popular medium of communication for E – Commerce is Email which has opened the door for direct marketers to bombard the mails which fills the mail boxes of users with unwanted mails and as same copy of mail is there on many users mailbox on same server it is just wastage of resource and also waste of bandwidth. Spam mail is also called as unsolicited bulk mail or junk, so we say spam Email is unwanted internet Email. Spam is an ever-increasing problem. The number of spam mails is increasing daily – studies show that over 90% of all current email is spam. Added to this, spammers are becoming more sophisticated and are constantly managing to outsmart ‘static’ methods of fighting spam. The techniques currently used by most anti-spam 318
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME January software are static, meaning that it is fairly easy to evade by tweaking the message a little. To do this, spammers simply examine the latest anti spam techniques and find ways how to anti-spam dodge them. To effectively combat spam, an adaptive new technique is needed. This method must be familiar with spammers’ tactics as they change over time. It must also be able to h adapt to the particular organization that it is protecting from spam. The answer lies in Bayesian mathematics. In following figure we can see Max spam mail 34.7 sent per second, total spam sent in last month 12666548 mails. am Fig 1: SpamCop Statistics For filtering here we combine two approach origin and Bayesian for speed and accuracy. Origin technique provides high speed but it has no accuracy and Bayesian provide high accuracy but it has no speed. So here we take advantage of both technique and develop highly accurate and faster spam filter. II. ORIGIN-BASED FILTER Origin based filters are methods which based on using network information in order to detect whether it is spam or not. IP and the email address are the most important pieces of network information used. There are several major types of origin-Based filters such as origin Based Blacklists, White lists, and Challenge/Response systems. Here we will use Blacklists technique and maintain black list by self learning technique. We will train black list database ain from spam mail which classified by Bayesian.III. BAYESIAN APPROACH Naive Bayesian is a fundamental statistical approach based on probability initially proposed by Sahami et al. (1998). The Bayesian algorithm predicts the classification of (1998). new e-mail by identifying an e-mail as spam or legitimate. This is achieved by looking at mail the features using a ‘training set’ which has already been pre-classified correctly and then pre classified checking whether a particular word appears in the e-mail. High probability indicates the new e mail. e-mail as spam e-mail. 319
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME A Bayesian classifier is simply a Bayesian network applied to a classification task. It contains a node C representing a class variable (Junk Or Legitimate) and a node Xi for each of the feature (each of the words). Given a specific instance x(an assignment of values x1,x2,x3,..........,xn to a feature variables), the Bayesian network allows us to compute the probability P(C=ck/X=x) for each possible class ck. this is done via Bayes theorem, giving us Bayes: PሺC ൌ ck | X ൌ xሻ PሺC ൌ ckሻ PሺC ൌ ck | X ൌ xሻ ൌ ܲሺܺ ൌ ݔሻ In the context of the classification, specifically junk Email filtering, it becomes necessary to represent mail message as feature vectors so as to make such Bayesian classification methods directly applicable.IV. ACTUAL IMPLEMENTATION We divided this implementation into following three parts. A. Training B. Classification A. Training In Training part we have to train following three database of Spam Filter. • Origin Email id with counter (Blacklist). • Spam with counter. • Legitimate with counter. For our system we have used some mails from following E-mail ID to train the database. • email@example.com • firstname.lastname@example.org • email@example.com In this algorithm we have neglected some common occurring words, list of these words are as below hi, hello, dear, regards, thank, thanks, of, into, they, she, it, been, he, in, the, how, where, an, out, you, i, am, there, not, can, could, would, will, if, has, have, why, who ,had, with, your, or, any, my, we, so, date, to, from, mon, monday, tue, tuesday, wed, wednesday, thu, thursday, fri, friday, sat, saturday, sun, sunday, jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec, let, make, put, seem, take, about, among, at , between, now, out, still, almost, even, much, quite, very, please. A.1 Training (Algorithm) 1. After classification retrieve sender email id of all spam mail. 2. If sender email id of spam mail is available in origin (blacklist) database then just increase its count, otherwise insert email id in origin (blacklist) database. 3. Retrieve sender email id of all legitimate email. 4. If sender email id of legitimate mail is available in origin (blacklist) database then set value of count is zero. 5. Extract features (word) from all spam mail 6. Update database of spam mail; if word available then increase its count by one otherwise insert it as new word with count one in spam databases. 7. Update database of legitimate mail; if word available then increase its count by one otherwise insert it as new word with count one in legitimate databases. 8. Database improvement is complete. 320
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME A.2 Training (Flow Chart) Retrieve sender email id of all spam If sender email id is available in origin database No Yes Increase counter of this email id in Insert as a new entry in origin origin database database Retrieve sender email id of all Legitimate mail If sender email id of legitimate mail is available in origin database No Yes Set counter value as zero Insert as a new entry in origin Retrieve word of all legitimate mail If word is available in legitimate database Increase counter value by 1 Insert as a new word Retrieve word of all spam mail If word is available in spam database No Increase counter value by 1 Insert as a new word Yes Training Process complete 321
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME A.3 Classification Process (Algorithm) 1. Download new mail. 2. Retrieve Origin or sender email id. 3. If there is no sender id then classify as a spam. 4. If sender email id available in origin database then check its count, if count is greater than 20 then classify this mail is a spam otherwise send this mail in second level (Bayesian) to classify. 5. In second level (Bayesian) Receive mail which is not classified by first level (Origin). 6. Extract features (word) from all mail and store it in temporary database with frequency of occurrence in same mail. 7. If there is no text in mail then classify as a spam. 8. If there is any attachment then give message to check this mail because filter is not able to read attachment. 9. Calculate probability for spam and legitimate by above Bayesian formula for each word. 10. Store probability of each word for spam and legitimate in temporary database. 11. Calculate sum of probability of all word of same file for spam and legitimate. 12. If sum of probability for spam is greater than legitimate then classify as spam otherwise legitimate. 13. If sum of probability for spam and legitimate is same then classify as legitimate. 14. Classification process is complete. A.4 Classification Process (Flow Chart) New Mail Retrieve Sender ID If sender ID is available in Origin Database and count >20 Yes Classify as a Spam No Extract features (word) Calculate probabilities in Spam If Spam_Prob>Leig_Prob Yes No Classify as a Spam Classify as a Legitimate Update Database for Self Learning 322
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEMEV. RESULTS TABLE 1 Total Mail = 28 Spam Legitimate Actual Spam Actual Legitimate Origin 5 23 23 5 Bayesia 17 6 18 5 n TABLE 2 Total Mail = 17 Spam Legitimate Actual Spam Actual Legitimate Origin 6 11 13 4 Bayesia 9 4 9 4 n In table 1 we can see 5 mails are classified at origin level out of 28. So, in second level just check content of 23 mails which not classified as spam in origin level. In table 2 we can see 6 mails are classified at origin level out of 17. So, in second level just check content of 11 mails which not classified as spam in origin level. In origin level it cannot give accuracy if some mail arrive from different email id then it will classify it as a legitimate. So here we use Bayesian approach in second level to improve accuracy, give input all mails which are classified legitimate by Origin in Level 1. If we not use Origin then Bayesian have to check contents of all mails and it will degrade the performance of filter.VI. CONCLUSION In the time of growing problem of Junk Email, we have made a system which classifies junk mail automatically; this system uses the concept of Origin and Bayesian theorem for classification task. The efficiency of this kind of system is enhanced by considering not only words of mail as feature but we can consider other domain specific features which provide strong evidence about Junk. Also we can set some manually made handy rules along with system to improve system performance. Here we have not considered header of the mail so in future work we can use header to improve system accuracy. REFERENCES Journal Papers:  ThamaraiSubramaniam, Hamid A. Jalab and Alaa Y. Taqa, Overview of textual anti-spam filtering techniques, International Journal of the Physical Sciences Vol. 5(12), pp. 1869- 1882, 4 October, 2010  Alia TahaSabri, Adel HamdanMohammads, Bassam Al-Shargabi and Maher Abu Hamdeh, Developing New Continuous Learning Approach for Spam Detection using Artificial Neural Network (CLA_ANN), European Journal of Scientific Research ISSN 1450-216X Vol.42 No.3 (2010), pp.525-535 © EuroJournals Publishing, Inc. 2010 323
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 1, January- February (2013), © IAEME Ahmed Khorsi, An Overview of Content-Based Spam Filtering Techniques, Informatica31 (2007) 269-277 Giorgio Fumera, IgnazioPillai and Fabio Roli, Spam Filtering Based On The Analysis Of Text Information Embedded Into Images, Journal of Machine Learning Research 7 (2006) 2699-2720 Ms. JyotiPruthi and Dr. Ela Kumar, ”Data Set Selection In Anti-Spamming Algorithm - Large Or Small”, International Journal of Computer Engineering and Technology (IJCET), Volume 3, Issue 2, 2012, pp.206-212. Published by IAEME. C.R. Cyril Anthoni and Dr. A. Christy, ”Integration Of Feature Sets With Machine Learning Techniques For Spam Filtering”, International Journal of Computer Engineering and Technology (IJCET), Volume 2, Issue 1, 2011, pp.47-52. Published by IAEME.Theses: Jon Kagstrom, Improving Naive Bayesian Spam Filtering, Mid Sweden University Department for Information Technology and Media Spring 2005 Thomas Richard Lynam, Spam Filter Improvement Through Measurement, Waterloo, Ontario, Canada, 2009 CsabaGulyas, Creation of a Bayesian network-based meta spam filter, using the analysis of different spam filters, Budapest, 16th May 2006Proceedings Papers: Vikas P. Deshpande, Robert F. Erbacher, and Chris Harris, An Evaluation of Naïve Bayesian Anti-Spam Filtering Techniques, Proceedings of the 2007 IEEE Workshop on Information Assurance United States Military Academy, West Point, NY 20-22 June 2007 YanhuiGuo, Yaolong Zhang, Jianyi Liu and Cong Wang, Research on the Comprehensive Anti-Spam Filter, 9701-0/06/$20.00 02006 IEEE. xi-lin zhao1, jian-zhongzhou, bofu and huilui, Research of Probability Petri Nets Model For Fault Diagnosis Based on Bayesian theorem, Proceedings of the 7th World Congress on Intelligent Control and Automation June 25 - 27, 2008, Chongqing, China BijuIssac, Wendy Japutra Jap and JofryHadiSutanto, Improved Bayesian Anti-Spam Filter Implementation and Analysis on Independent Spam Corpuses, 2009 International Conference on Computer Engineering and Technology Chengcheng Li and Jianyi Liu, Combining Behavior And Bayesian Chinese Spam Filter, Proceedings of IC-NIDC2009 Yishan Gong and Qiang Chen, Research of Spam Filtering Based on Bayesian Algorithm, 2010 International Conference on Computer Application and System Modeling (ICCASM 2010) 324