SPAM FILTERSANATOMY OF MULTIMEDIA - TEXT CLASSIFICATION SYSTEMSASWIN SIVA N (J09023)
Document ClassificationDocument Classification based on it’s TEXT contents A very practical application of Machine Intell...
Early ClassifiersRule Based systems. Presence of Capital Text. Presence of Pharmaceutical words.Etc..Has disadvantages :...
Features and TrainingFeatures are words. Pairs of words or sentences also can be used as features in morecomplex systems....
Prediction Done through probabilities. Once the training is completed, various probability measures canbe calculated : ...
A Naïve Bayesian Classifier Assumption : The occurrence of words in documents is completelyindependent of each other. No...
A Naïve Bayesian Classifier The category having highest probability for that document isreported.Some Heuristics : Assig...
Training Sample
Prediction
Alternate Methods, Advantagesand Disadvantages Neural Networks and SVMs can also be used for Documentclassification. Adv...
Thank you!
SPAM Filters
SPAM Filters
SPAM Filters
SPAM Filters
Upcoming SlideShare
Loading in...5
×

SPAM Filters

272

Published on

Implementation of a SPAM Filter using Naive Bayesian Filter.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
272
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "SPAM Filters"

  1. 1. SPAM FILTERSANATOMY OF MULTIMEDIA - TEXT CLASSIFICATION SYSTEMSASWIN SIVA N (J09023)
  2. 2. Document ClassificationDocument Classification based on it’s TEXT contents A very practical application of Machine Intelligence. Automatic SPAM detection Email Labelling Search Ranking etc…
  3. 3. Early ClassifiersRule Based systems. Presence of Capital Text. Presence of Pharmaceutical words.Etc..Has disadvantages : SPAMMERS can work around the rules. A Non-SPAM message can end up being classified as SPAM!
  4. 4. Features and TrainingFeatures are words. Pairs of words or sentences also can be used as features in morecomplex systems.For training a pre labelled set of documents and their categories arefed into the trainer. The system stores each word with it’s occurrence count in eachcategory.
  5. 5. Prediction Done through probabilities. Once the training is completed, various probability measures canbe calculated : Ex : P (word | category) , P (category) etc…We will look at a Commonly used prediction method. A Naïve Bayesian Classifier
  6. 6. A Naïve Bayesian Classifier Assumption : The occurrence of words in documents is completelyindependent of each other. Not always valid !! Ex : It is more likely to find {‘online’, ‘money’, ‘casino’} rather than{‘money’, ‘attached’} But accuracy is considerable. P (Document | category) = Π (P (words | category)) P(category | Document) = P (Document | category) * P (category)
  7. 7. A Naïve Bayesian Classifier The category having highest probability for that document isreported.Some Heuristics : Assign initial probabilities to avoid under-fitting due to lack oftraining documents. Set threshold values for reporting and report unknown in case ofsensitive applications.
  8. 8. Training Sample
  9. 9. Prediction
  10. 10. Alternate Methods, Advantagesand Disadvantages Neural Networks and SVMs can also be used for Documentclassification. Advantage : This method is Computationally Less Expensive whencompared to Neural Networks. Disadvantage : Cannot capture the semantics with such a simpleclassifier.ex : {‘online’, ‘casino’, ‘bet’} may be a spam while {‘horse’, ‘bet’}need not be. It is difficult to differentiate the semantics with a naïveBayesian classifier.
  11. 11. Thank you!

×