2. Abstract
Naïve bayes algorithm is a machine learning algorithm for classification problems.it is
primarily used for text classification which involves high dimensional data sets.a few examples
are spam filteration,sentimental analysis and classifying news articles.
Identifying the ducument into a particular category is still presenting challenge because of
large and vast amount of featuers in the datasets.
Naïve bayes is very popular in commercial and open source anti-spam e-mail filtration.naive
bayes has been studied extensively since the 1950s.
Naves bayes is potentially good at serving as a document classification model.
We have discuss the mathematical implementation of naïve bayes classifier for spam
filteration.
3. Introduction
Although several machine learning algorithms have
been employed in anti-spam e-mail filtering,
including algorithms that are considered top-
performers in text classification, like Boosting and
Support Vector Machines, decision tree,neural
network,logistic regression.
but Naive Bayes (nb) classifiers currently appear to
be particularly popular in commercial and open-
sourcespam filters.This
is probably due to their simplicity, which
makes them easy to implement, their linear
computational complexity, and their accuracy,
which in spam filtering is comparable to that of
more elaborate learning algorithms.
4. Naïve bayes algorithm is called 'naïve ' because it make the assumption that
the occurrence of certain feature is independent of the occurrence of other
features.
Naive Bayes classifiers are a popular statistical technique of e-mail filtering.
They typically use bag of words features to identify spam e-mail, an approach
commonly used in text classification.
Naive Bayes classifiers work by correlating the use of tokens (typically words,
or sometimes other things), with spam and non-spam e-mails and then using
Bayes' theorem to calculate a probability that an email is or is not spam.
Naive Bayes spam filtering is a baseline technique for dealing with spam
can tailor itself to the email needs of individual users and give low false
positive spam detection rates that are generally acceptable to users. It is one
of the oldest ways of doing spam filtering, with roots in the 1990s.
5. Literature survey
Bayesian algorithms were used to sort and filter email by 1996. Although naive
Bayesian filters did not become popular until later, multiple programs were
released in 1998 to address the growing problem of unwanted email. The first
scholarly publication on Bayesian spam filtering was by Sahami et al in 1998. That
work was soon thereafter deployed in commercial spam filters. However, in 2002
Paul Graham greatly decreased the false positive rate, so that it could be used on
its own as a single spam filter.
Variants of the basic technique have been implemented in a number of research
works and commercial software products. Many modern mail clients implement
Bayesian spam filtering. Users can also install separate email filtering programs.
Server-side email filters, such as DSPAM, Spam Assassin, Spam Bayes, Bogofilter
and ASSP, make use of Bayesian spam filtering techniques, and the functionality is
sometimes embedded within mail server software itself. CRM114, oft cited as a
Bayesian filter, is not intended to use a Bayes filter in production, but includes the
″unigram″ feature for reference.
6. Bayes theorem
Bayes -"it refer to the statistician and philospher
thomas bayes and the theorem named after
him,bayes theorem,which is the base for the
naïve bayes algorithm".
Bayes theorem-
Bayes theorem is stated as probability of the
event B given A is equal to the probability of the
event A given B multiplied by the probability of
A upon probability of B.
Let us understand what's the bayes theorem
7. Mathematical foundation
Bayesian email filters utilize Bayes' theorem. Bayes' theorem is used several times in the context of
spam:
a first time, to compute the probability that the message is spam, knowing that a given word
appears in this message;
a second time, to compute the probability that the message is spam, taking into consideration all
of its words (or a relevant subset of them);
sometimes a third time, to deal with rare words.
8. Computing the probability that a message containing a given word is spam
Let's suppose the suspected message contains the word "replica". Most people who are used to receiving e-mail
know that this message is likely to be spam, more precisely a proposal to sell counterfeit copies of well-known
brands of watches. The spam detection software, however, does not "know" such facts; all it can do is compute
probabilities.
The formula used by the software to determine that, is derived from Bayes' theorem.
9. Conclusion
Text classification with naïve bayes algorithm is equally good and comparable with
other method of classification.
One of the best advantages of bayesian spam filtering is that it can be trained on
a per-user basis.
The spam that a user receives is often related to the online user's activities.
The legitimate e-mails a user receives will tend to be different.