SlideShare a Scribd company logo
1 of 4
International XII. Turkish Symposium on Artificial Intelligence and Neural Networks - TAINN 2003


                             ADAPTIVE TURKISH ANTI-SPAM FILTERING

                               Levent Özgür1,       Tunga Güngör2,              Fikret Gürgen3

              e-mail: ozgurlev@boun.edu.tr e–mail: gungort@boun.edu.tr e-mail: gurgen@boun.edu.tr

                                   Boğaziçi University Computer Engineering Dept.
                                            Bebek 34342 İstanbul/Turkey

        Keywords: Artificial Neural Network, Bayes Filter, morphology, anti-spam filter, mutual information



ABSTRACT                                                       In the program, two main algorithms are implemented
                                                               to specify weights of nodes in the network. In the first
We propose an anti-spam filtering algorithm that is            one, each message is represented by a vector X =( x1 ,
used for Turkish language based on Artificial Neural
Networks (ANN) and Bayes Filter. The algorithm has             x2 ,..., xn ) [3]. We use binary attributes :
two parts: the first part deals with morphology of
Turkish words. The second part classifies the e-mails          X 1 , X 2,..., X n
by using the roots of words extracted by morphology
part [1,2]. A total of 584 mails (337 spam and 247
                                                               In our experiments, words are represented by                                             xi   and
normal) are used in the experiments. A success rate up
to 95% of spam mail detection and 90% of normal                the problem is whether the word occurs in the text or
mail detection is achieved.                                    not.
                                                               The second algorithm takes the length of the text into
                                                               consideration. Number of occurrences of the word is
I. INTRODUCTION                                                also important. The weights of the nodes are, in fact,
                                                               the frequency of that word in the text and calculated
Spam mails, by definition, are the electronic messages         by the probability formula:
posted blindly to thousands of recepients usually for
advertisement. As time passes, much more percentage                                           # of occurrences of that word in the text
of the mails are treated as spam and this increases the                           Xi =
                                                                                                      # of all words in the text
level of seriousness of the problem.
                                                               Given categories (c: spam or normal) and candidates
The aim of this paper is to present an algorithm for           for feature vector, the mutual information (MI) of
anti-spam filtering which is used for Turkish language         them are calculated as [3]:
based on ANN and Bayes Filter.

II. DATA SET AND MORPHOLOGY MODULE
                                                                                                                                         P ( X = x, C = c )
                                                               MI ( X , C ) =              ∑                   P( X = x, C = c) *log
                                                                                                                                       P( X = x) * P (C = c)
A Turkish morphological analysis program is prepared                            x∈{0,1}, c∈{ spam , normal }


based on a research on turkish morphology [5]. We
have observed that this module can parse the roots of          The attributes with the highest MIs are selected to
the words in any text file with a success over 90%.            form the feature vector. These highest valued words
Most of the words that can not be parsed are proper            are most probably the words occurring frequently in
nouns and mispelled words. For time complexity,                spam mails and much less in normal mails. So these
actual CPU timings (in seconds) was found to be                selected words are said to be the best classifiers for
approximately 6 seconds for an average length e-mail.          spam and normal mails.

Spam mails are collected in two different mail
addresses [6]. The morphology module uses these                IV.BAYES CLASSIFIER
mails to create two main files (spam and normal) for
the use of the learning module which will classify the         Bayes Algorithm is one of the machine learning
e-mails                                                        methods which is frequently used in text
                                                               categorization. It uses a discriminant function to
III. ARTIFICIAL NEURAL NETWORK                                 compute the conditional probabilities of P (Ci | X )
                                                               [11]. Here, given the inputs, P (Ci | X ) denotes the
ANN algorithm is applied to the data obtained for the
learning module. In this study, as stated previously we        probability that, example X belongs to class C :
                                                                                                                                                    i
used 337 spam and 247 normal mails. For the ANN
algorithm, we first applied Single-Layer Network.
International XII. Turkish Symposium on Artificial Intelligence and Neural Networks - TAINN 2003


                 P ( Ci )* P ( X |Ci )                       spam mails [10]. With spam mails, this approach
P ( Ci | X ) =                                               showed better results than the previous ones but as a
                       P( X )                                surprise, a decline of success was observed in Figure 2
P ( Ci )   is the probabilty of observing class i.           with normal mails. With some improvements in this
                                                             model (values of the probability of words occurring in
P ( X | C i ) denotes the probability of observing the       the mails were weighted emprically by 3/2), an
                                                             increase in success was observed and this model has
example, given class C . P(X) is the probability of          showed the best performance among all of the
                                i                            models.
the input, which is independent of the classes.
For Bayes Algorithm, three main approaches were              In Experiment 3, Bayes Filter seems more succesful
implemented to score classes on each input, namely           than ANN, especially with binary approach. The other
Binary Model, Probabilistic Model, Advance                   two approaches (probabilistic and advance
Probabilistic Model which are studied in detail in the       probabilistic model) seem to achieve nearly the same
original paper.                                              success rates, which are a bit lower than binary
                                                             approach in normal mails. With spam mails, all these
V. RESULTS                                                   Bayes algorithms seem to achieve same nice results.
                                                                       SLP model without hidden layers and
6 different runs were performed for each algorithm           without all words included is the fastest model with
and for each number of inputs ((10,40,70) for ANN            1.5 and 4 seconds each. However, as described above,
algorithms and (1,11,21,31,41) for Bayes Filter). Each       success rates of SLP are far below with respect to
run has taken a different part of the data set as training   SLP with all words. Note that, these experiments were
and test.                                                    run with over 500 e-mails. For a user-specific
                                                             application, number of mails may decline. Also note
All these algorithms were implemented in Visual C++          that, even with the algorithm of SLP with all words as
6.0 under Windows XP, on an Intel Pentium PC                 candidates, time is below 1 minute, which means it is
running at 1.6 GHz with 256 MB RAM.                          a reasonable time even for a user application.

In Experiment 1, success rates are observed to be over
90% in normal data; while around 75% in spam test            VI. DISCUSSION AND FUTURE WORK
data. Different algorithms are observed to be more
succesful than others with different number of inputs        Discussion and future work considering these results
or different data sets. Generally, we can conclude that,     are the following:
probabilistic model achieves slightly better results
than binary model. Also, MLP algorithms are more             (a) With English e-mails, 10-15 was found to be the
succesful than SLP algorithms but, their runs take a         best feature vector size [3] . With Turkish e-mails,
considerably longer time because of their large              50-60 input number seems to be the optimal feature
number of hidden layers. With less hidden layers             vector size in this study. When this number is
(20,30,50), the algorithms are not more succesful than       decreased or increased, success rates fall.
SLP algorithm.With these results, SLP with                   (b) Bayes Filter seems to be more succesful than
probabilistic model is selected to be the ideal model.       ANN, especially with Binary Approach.
                                                             (c) Time complexities of current experiments are not
In Experiment 2, although SLP with probabilistic             so high. The morphology module takes longer time
model seemed to be the best of the implemented               than learning module.
models, the success rates of the filter were not as high      (d) As seen in the results, words in the normal mails
as expected, especially in spam test runs. To increase       are as important as words in spam mails in classifying
these rates, we have developed some more models to           the e-mails.
work together with SLP. One of these models is to
use also the non Turkish words in the mails for our          As a summary, morphological work and learning
spam filter in addition to Turkish words. We have            methods seem to be successful in classifying e-mails
implemented it by using the word itself when an              for agglutinative languages like Turkish. We continue
accepted Turkish root of the word can not been found         the experiments on different databases and believe that
in the program. With this improvement, the success           our system will give more successful results.
rates of the filter have reached 90% in total, which
was previously about 80%. As seen in Figure 2, we            REFERENCES
also implemented two other models by improving the
selection of the candidates for feature vector.              1. C. Bishop, Neural Networks for Pattern
Previously, the highest valued words were most               Recognition, Oxford Unv. Press, 1995.
probably the words occurring frequently in spam mails        2. T.M. Mitchell, Machine Learning, McGraw-Hill,
and much less in normal mails. Now we implemented            1997.
the model so that, we have also selected words that
occur frequently in normal mails and much less in
International XII. Turkish Symposium on Artificial Intelligence and Neural Networks - TAINN 2003


3. I. Androutsopoulos and J. Koutsias,An Evaluation
of Naive Bayesian Networks. Proceedings of the
Machine Learning in the New Information Age, 2000.
4. C. Apte, F.Damerau and S.M. Weiss, Automated
Learning of Decision Rules for Text Categorization,
ACM Transactions on Information Systems,
12(3):233-251, 1994.
5. T. Güngör, Computer Processing of Turkish:
Morphological and Lexical Investigation, PhD Thesis,
Boğaziçi University, 1995.
6. Mail addresses: turkcespam@yahoo.com and
junk@cmpe.boun.edu.tr.
7. D. Lewis,Feature Selection and Feature Extraction
for Text Categorization, Proceedings of the DARPA
Workshop on Speech and Natural Language, pp.
212-217, Harriman, New York, 1992.
8. I. Dagan, Y. Karov and D. Roth, Mistake-Driven
Learning in Text Categorization, Proceedings of the
Conference on Emprical Methods in Natural Language
Processing,pp. 55-63, Providence, Rhode Island, 1997.
9. W.Cohen, Learning Rules That Classify E-mail.
Proceedings of the AAAI Spring Symposium on
Machine Learning in Information Access, Stanford,
California, 1996.
10. http://www.paulgraham.com/spam.html
11. J. Gama, A Linear-Bayes Classifier, IBERAMIA-
SBIA 2000, LNAI 1952, pp. 269-279, 2000.
Adaptive Turkish Anti-Spam Filter Using ANN and Bayes Classification

More Related Content

What's hot

A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introductionYueshen Xu
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Mini-batch Variational Inference for Time-Aware Topic Modeling
Mini-batch Variational Inference for Time-Aware Topic ModelingMini-batch Variational Inference for Time-Aware Topic Modeling
Mini-batch Variational Inference for Time-Aware Topic ModelingTomonari Masada
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
An Index Based K-Partitions Multiple Pattern Matching Algorithm
An Index Based K-Partitions Multiple Pattern Matching AlgorithmAn Index Based K-Partitions Multiple Pattern Matching Algorithm
An Index Based K-Partitions Multiple Pattern Matching AlgorithmIDES Editor
 
Semantic Parsing in Bayesian Anti Spam
Semantic Parsing in Bayesian Anti SpamSemantic Parsing in Bayesian Anti Spam
Semantic Parsing in Bayesian Anti SpamTao He
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modelingHiroyuki Kuromiya
 
Multilabel Classification by BCH Code and Random Forests
Multilabel Classification by BCH Code and Random ForestsMultilabel Classification by BCH Code and Random Forests
Multilabel Classification by BCH Code and Random ForestsIDES Editor
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationQuentin Pleplé
 
Simulation of Language Acquisition Walter Daelemans
Simulation of Language Acquisition Walter DaelemansSimulation of Language Acquisition Walter Daelemans
Simulation of Language Acquisition Walter Daelemansbutest
 
Modification of some solution techniques of combinatorial
Modification of some solution techniques of combinatorialModification of some solution techniques of combinatorial
Modification of some solution techniques of combinatorialAlexander Decker
 
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...IJCSEA Journal
 
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...tsysglobalsolutions
 

What's hot (19)

A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
 
Topicmodels
TopicmodelsTopicmodels
Topicmodels
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Topic Models
Topic ModelsTopic Models
Topic Models
 
Mini-batch Variational Inference for Time-Aware Topic Modeling
Mini-batch Variational Inference for Time-Aware Topic ModelingMini-batch Variational Inference for Time-Aware Topic Modeling
Mini-batch Variational Inference for Time-Aware Topic Modeling
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
An Index Based K-Partitions Multiple Pattern Matching Algorithm
An Index Based K-Partitions Multiple Pattern Matching AlgorithmAn Index Based K-Partitions Multiple Pattern Matching Algorithm
An Index Based K-Partitions Multiple Pattern Matching Algorithm
 
Semantic Parsing in Bayesian Anti Spam
Semantic Parsing in Bayesian Anti SpamSemantic Parsing in Bayesian Anti Spam
Semantic Parsing in Bayesian Anti Spam
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
Multilabel Classification by BCH Code and Random Forests
Multilabel Classification by BCH Code and Random ForestsMultilabel Classification by BCH Code and Random Forests
Multilabel Classification by BCH Code and Random Forests
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet Allocation
 
Canini09a
Canini09aCanini09a
Canini09a
 
Simulation of Language Acquisition Walter Daelemans
Simulation of Language Acquisition Walter DaelemansSimulation of Language Acquisition Walter Daelemans
Simulation of Language Acquisition Walter Daelemans
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Modification of some solution techniques of combinatorial
Modification of some solution techniques of combinatorialModification of some solution techniques of combinatorial
Modification of some solution techniques of combinatorial
 
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
 
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
 

Similar to Adaptive Turkish Anti-Spam Filter Using ANN and Bayes Classification

EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...IJNSA Journal
 
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...IJNSA Journal
 
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...IJNSA Journal
 
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning ApproachesA Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approachesijtsrd
 
Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...khalil IBRAHIM
 
Convolutional Neural Network for Text Classification
Convolutional Neural Network for Text ClassificationConvolutional Neural Network for Text Classification
Convolutional Neural Network for Text ClassificationAnaïs Addad
 
lecture_mooney.ppt
lecture_mooney.pptlecture_mooney.ppt
lecture_mooney.pptbutest
 
Integration of feature sets with machine learning techniques
Integration of feature sets with machine learning techniquesIntegration of feature sets with machine learning techniques
Integration of feature sets with machine learning techniquesiaemedu
 
A study of detecting computer viruses in real infected files in the n-gram re...
A study of detecting computer viruses in real infected files in the n-gram re...A study of detecting computer viruses in real infected files in the n-gram re...
A study of detecting computer viruses in real infected files in the n-gram re...UltraUploader
 
A Survey on Spam Filtering Methods and Mapreduce with SVM
A Survey on Spam Filtering Methods and Mapreduce with SVMA Survey on Spam Filtering Methods and Mapreduce with SVM
A Survey on Spam Filtering Methods and Mapreduce with SVMIRJET Journal
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer PerceptronsESCOM
 
A new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryA new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryIAEME Publication
 
A Comparison of Serial and Parallel Substring Matching Algorithms
A Comparison of Serial and Parallel Substring Matching AlgorithmsA Comparison of Serial and Parallel Substring Matching Algorithms
A Comparison of Serial and Parallel Substring Matching Algorithmszexin wan
 
Sentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic RelationsSentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic RelationsEditor IJCATR
 

Similar to Adaptive Turkish Anti-Spam Filter Using ANN and Bayes Classification (20)

Equirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
Equirs: Explicitly Query Understanding Information Retrieval System Based on HmmEquirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
Equirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
 
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
 
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
 
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
 
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning ApproachesA Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
 
Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...
 
Convolutional Neural Network for Text Classification
Convolutional Neural Network for Text ClassificationConvolutional Neural Network for Text Classification
Convolutional Neural Network for Text Classification
 
lecture_mooney.ppt
lecture_mooney.pptlecture_mooney.ppt
lecture_mooney.ppt
 
Integration of feature sets with machine learning techniques
Integration of feature sets with machine learning techniquesIntegration of feature sets with machine learning techniques
Integration of feature sets with machine learning techniques
 
Machine learning
Machine learningMachine learning
Machine learning
 
50320130403003 2
50320130403003 250320130403003 2
50320130403003 2
 
A study of detecting computer viruses in real infected files in the n-gram re...
A study of detecting computer viruses in real infected files in the n-gram re...A study of detecting computer viruses in real infected files in the n-gram re...
A study of detecting computer viruses in real infected files in the n-gram re...
 
A Survey on Spam Filtering Methods and Mapreduce with SVM
A Survey on Spam Filtering Methods and Mapreduce with SVMA Survey on Spam Filtering Methods and Mapreduce with SVM
A Survey on Spam Filtering Methods and Mapreduce with SVM
 
spam_msg_detection.pdf
spam_msg_detection.pdfspam_msg_detection.pdf
spam_msg_detection.pdf
 
IEEE
IEEEIEEE
IEEE
 
Et25897899
Et25897899Et25897899
Et25897899
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
 
A new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryA new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binary
 
A Comparison of Serial and Parallel Substring Matching Algorithms
A Comparison of Serial and Parallel Substring Matching AlgorithmsA Comparison of Serial and Parallel Substring Matching Algorithms
A Comparison of Serial and Parallel Substring Matching Algorithms
 
Sentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic RelationsSentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic Relations
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Adaptive Turkish Anti-Spam Filter Using ANN and Bayes Classification

  • 1. International XII. Turkish Symposium on Artificial Intelligence and Neural Networks - TAINN 2003 ADAPTIVE TURKISH ANTI-SPAM FILTERING Levent Özgür1, Tunga Güngör2, Fikret Gürgen3 e-mail: ozgurlev@boun.edu.tr e–mail: gungort@boun.edu.tr e-mail: gurgen@boun.edu.tr Boğaziçi University Computer Engineering Dept. Bebek 34342 İstanbul/Turkey Keywords: Artificial Neural Network, Bayes Filter, morphology, anti-spam filter, mutual information ABSTRACT In the program, two main algorithms are implemented to specify weights of nodes in the network. In the first We propose an anti-spam filtering algorithm that is one, each message is represented by a vector X =( x1 , used for Turkish language based on Artificial Neural Networks (ANN) and Bayes Filter. The algorithm has x2 ,..., xn ) [3]. We use binary attributes : two parts: the first part deals with morphology of Turkish words. The second part classifies the e-mails X 1 , X 2,..., X n by using the roots of words extracted by morphology part [1,2]. A total of 584 mails (337 spam and 247 In our experiments, words are represented by xi and normal) are used in the experiments. A success rate up to 95% of spam mail detection and 90% of normal the problem is whether the word occurs in the text or mail detection is achieved. not. The second algorithm takes the length of the text into consideration. Number of occurrences of the word is I. INTRODUCTION also important. The weights of the nodes are, in fact, the frequency of that word in the text and calculated Spam mails, by definition, are the electronic messages by the probability formula: posted blindly to thousands of recepients usually for advertisement. As time passes, much more percentage # of occurrences of that word in the text of the mails are treated as spam and this increases the Xi = # of all words in the text level of seriousness of the problem. Given categories (c: spam or normal) and candidates The aim of this paper is to present an algorithm for for feature vector, the mutual information (MI) of anti-spam filtering which is used for Turkish language them are calculated as [3]: based on ANN and Bayes Filter. II. DATA SET AND MORPHOLOGY MODULE P ( X = x, C = c ) MI ( X , C ) = ∑ P( X = x, C = c) *log P( X = x) * P (C = c) A Turkish morphological analysis program is prepared x∈{0,1}, c∈{ spam , normal } based on a research on turkish morphology [5]. We have observed that this module can parse the roots of The attributes with the highest MIs are selected to the words in any text file with a success over 90%. form the feature vector. These highest valued words Most of the words that can not be parsed are proper are most probably the words occurring frequently in nouns and mispelled words. For time complexity, spam mails and much less in normal mails. So these actual CPU timings (in seconds) was found to be selected words are said to be the best classifiers for approximately 6 seconds for an average length e-mail. spam and normal mails. Spam mails are collected in two different mail addresses [6]. The morphology module uses these IV.BAYES CLASSIFIER mails to create two main files (spam and normal) for the use of the learning module which will classify the Bayes Algorithm is one of the machine learning e-mails methods which is frequently used in text categorization. It uses a discriminant function to III. ARTIFICIAL NEURAL NETWORK compute the conditional probabilities of P (Ci | X ) [11]. Here, given the inputs, P (Ci | X ) denotes the ANN algorithm is applied to the data obtained for the learning module. In this study, as stated previously we probability that, example X belongs to class C : i used 337 spam and 247 normal mails. For the ANN algorithm, we first applied Single-Layer Network.
  • 2. International XII. Turkish Symposium on Artificial Intelligence and Neural Networks - TAINN 2003 P ( Ci )* P ( X |Ci ) spam mails [10]. With spam mails, this approach P ( Ci | X ) = showed better results than the previous ones but as a P( X ) surprise, a decline of success was observed in Figure 2 P ( Ci ) is the probabilty of observing class i. with normal mails. With some improvements in this model (values of the probability of words occurring in P ( X | C i ) denotes the probability of observing the the mails were weighted emprically by 3/2), an increase in success was observed and this model has example, given class C . P(X) is the probability of showed the best performance among all of the i models. the input, which is independent of the classes. For Bayes Algorithm, three main approaches were In Experiment 3, Bayes Filter seems more succesful implemented to score classes on each input, namely than ANN, especially with binary approach. The other Binary Model, Probabilistic Model, Advance two approaches (probabilistic and advance Probabilistic Model which are studied in detail in the probabilistic model) seem to achieve nearly the same original paper. success rates, which are a bit lower than binary approach in normal mails. With spam mails, all these V. RESULTS Bayes algorithms seem to achieve same nice results. SLP model without hidden layers and 6 different runs were performed for each algorithm without all words included is the fastest model with and for each number of inputs ((10,40,70) for ANN 1.5 and 4 seconds each. However, as described above, algorithms and (1,11,21,31,41) for Bayes Filter). Each success rates of SLP are far below with respect to run has taken a different part of the data set as training SLP with all words. Note that, these experiments were and test. run with over 500 e-mails. For a user-specific application, number of mails may decline. Also note All these algorithms were implemented in Visual C++ that, even with the algorithm of SLP with all words as 6.0 under Windows XP, on an Intel Pentium PC candidates, time is below 1 minute, which means it is running at 1.6 GHz with 256 MB RAM. a reasonable time even for a user application. In Experiment 1, success rates are observed to be over 90% in normal data; while around 75% in spam test VI. DISCUSSION AND FUTURE WORK data. Different algorithms are observed to be more succesful than others with different number of inputs Discussion and future work considering these results or different data sets. Generally, we can conclude that, are the following: probabilistic model achieves slightly better results than binary model. Also, MLP algorithms are more (a) With English e-mails, 10-15 was found to be the succesful than SLP algorithms but, their runs take a best feature vector size [3] . With Turkish e-mails, considerably longer time because of their large 50-60 input number seems to be the optimal feature number of hidden layers. With less hidden layers vector size in this study. When this number is (20,30,50), the algorithms are not more succesful than decreased or increased, success rates fall. SLP algorithm.With these results, SLP with (b) Bayes Filter seems to be more succesful than probabilistic model is selected to be the ideal model. ANN, especially with Binary Approach. (c) Time complexities of current experiments are not In Experiment 2, although SLP with probabilistic so high. The morphology module takes longer time model seemed to be the best of the implemented than learning module. models, the success rates of the filter were not as high (d) As seen in the results, words in the normal mails as expected, especially in spam test runs. To increase are as important as words in spam mails in classifying these rates, we have developed some more models to the e-mails. work together with SLP. One of these models is to use also the non Turkish words in the mails for our As a summary, morphological work and learning spam filter in addition to Turkish words. We have methods seem to be successful in classifying e-mails implemented it by using the word itself when an for agglutinative languages like Turkish. We continue accepted Turkish root of the word can not been found the experiments on different databases and believe that in the program. With this improvement, the success our system will give more successful results. rates of the filter have reached 90% in total, which was previously about 80%. As seen in Figure 2, we REFERENCES also implemented two other models by improving the selection of the candidates for feature vector. 1. C. Bishop, Neural Networks for Pattern Previously, the highest valued words were most Recognition, Oxford Unv. Press, 1995. probably the words occurring frequently in spam mails 2. T.M. Mitchell, Machine Learning, McGraw-Hill, and much less in normal mails. Now we implemented 1997. the model so that, we have also selected words that occur frequently in normal mails and much less in
  • 3. International XII. Turkish Symposium on Artificial Intelligence and Neural Networks - TAINN 2003 3. I. Androutsopoulos and J. Koutsias,An Evaluation of Naive Bayesian Networks. Proceedings of the Machine Learning in the New Information Age, 2000. 4. C. Apte, F.Damerau and S.M. Weiss, Automated Learning of Decision Rules for Text Categorization, ACM Transactions on Information Systems, 12(3):233-251, 1994. 5. T. Güngör, Computer Processing of Turkish: Morphological and Lexical Investigation, PhD Thesis, Boğaziçi University, 1995. 6. Mail addresses: turkcespam@yahoo.com and junk@cmpe.boun.edu.tr. 7. D. Lewis,Feature Selection and Feature Extraction for Text Categorization, Proceedings of the DARPA Workshop on Speech and Natural Language, pp. 212-217, Harriman, New York, 1992. 8. I. Dagan, Y. Karov and D. Roth, Mistake-Driven Learning in Text Categorization, Proceedings of the Conference on Emprical Methods in Natural Language Processing,pp. 55-63, Providence, Rhode Island, 1997. 9. W.Cohen, Learning Rules That Classify E-mail. Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access, Stanford, California, 1996. 10. http://www.paulgraham.com/spam.html 11. J. Gama, A Linear-Bayes Classifier, IBERAMIA- SBIA 2000, LNAI 1952, pp. 269-279, 2000.