Spam Filtering


Published on

Spam Filtering Techniques

Published in: Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Spam Filtering

  1. 1. Spam Filtering Techniques UMAR M. A. ALHARAKY, B.SC. RAKAN RAZOUK, PH.D.
  2. 2. Introduction What is spam?  Spam also called as Unsolicited Commercial Email (UCE)  Involves sending messages by email to numerous recipients at the same time (Mass Emailing).  Grew exponentially since 1990 but has leveled off recently and is no longer growing exponentially  80% of all spam is sent by less than 200 spammers
  3. 3. Introduction Purpose of Spam  Advertisements  Pyramid schemes (Multi-Level Marketing)  Giveaways  Chain letters  Political email  Stock market advice
  4. 4. Introduction Spam as a problem  Consumes computing resources and time  Reduces the effectiveness of legitimate advertising  Cost Shifting  Fraud  Identity Theft  Consumer Perception  Global Implications John Borgan [ReplyNet] – “Junk email is not just annoying anymore. It’s eating into productivity. It’s eating into time”
  5. 5. Some Statistics Cost of Spam 2009:  $130 billion worldwide  $42 billion in US alone  30% increase from 2007 estimates  100% increase in 2007 from 2005 Main components of cost:  Productivity loss from inspecting and deleting spam missed by spam control products (False Negatives)  Productivity loss from searching for legitimate mails deleted in error by spam control products (False Positives)  Operations and helpdesk costs (Filters and Firewalls – installment and maintenance)
  6. 6. Some Statistics
  7. 7. Some Statistics Spam Averages about 88% of all emails sent  Barracuda Networks:
  8. 8. Introduction Federal Regulations (CAN-SPAM act of 2003)  Controlling the Assault of Non-Solicited Pornography And Marketing (CAN-SPAM) Act of 2003 – signed on 12/16/2003  Commercial email must adhere to three types of compliance  Unsubscribe Compliance  Content Compliance  Sender Behavior Compliance
  9. 9. Introduction Email Address Harvesting - Process of obtaining email addresses through various methods  Purchasing /Trading lists with other spammers  Bots  Directory harvest attack  Free Product or Services requiring valid email address  News bulletins /Forums
  10. 10. Spam Life Cycle
  11. 11. Types of Spam Filters Header Filters  Look at email headers to judge if forged or not  Contain more information in addition to recipient , sender and subject fields Language Filters  filters based on email body language  Can be used to filter out spam written in foreign languages
  12. 12. Types of Spam Filters Content Filters  Scan the text content of emails  Use fuzzy logic Permission Filters  Based on Challenge /Response system White list/blacklist Filters  Will only accept emails from list of “good email addresses”  Will block emails from “bad email addresses”
  13. 13. Types of Spam Filters Community Filters  Work on the principal of "communal knowledge" of spam  These types of filters communicate with a central server. Bayesian Filters  Statistical email filtering  Uses Naïve Bayes classifier
  14. 14. Spam Filters Properties Filter must prevent spam from entering inboxes Able to detect the spam without blocking the ham  Maximize efficiency of the filter Do not require any modification to existing e-mail protocols Easily incremental  Spam evolve continuously  Need to adapt to each user
  15. 15. Data Mining and Spam FilteringWhat is Data Mining? Predictive analysis of data for the discovery of trends and patterns Reveals critical insights into unstructured data.Why Data Mining? The amount of data is increasing every year. To convert the acquired data into information and knowledge and use it in decision making.How it is done? Utilizes computing power to apply sophisticated algorithms for knowledge discovery of data.The Target… Spam Filtering using Data Mining
  16. 16. Data Mining and Spam Filtering Spam Filtering can be seen as a specific text categorization (Classification) History  Jason Rennie’s iFile (1996); first know program to use Bayesian Classification for spam filtering
  17. 17. Bayesian Classification Particular words have particular probabilities of occurring in spam email and in legitimate email  For instance, most email users will frequently encounter the word "Viagra" in spam email, but will seldom see it in other email The filter doesnt know these probabilities in advance, and must first be trained so it can build them up To train the filter, the user must manually indicate whether a new email is spam or not
  18. 18. Bayesian Classification For all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database  For instance, Bayesian spam filters will typically have learned a very high spam probability for the words "Viagra" and "refinance", but a very low spam probability for words seen only in legitimate email, such as the names of friends and family members After training, the word probabilities are used to compute the probability that an email with a particular set of words in it belongs to either category
  19. 19. Bayesian Classification Each word in the email contributes to the emails spam probability, or only the most interesting words This contribution is computed using Bayes theorem Then, the emails spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam.
  20. 20. Bayesian Classification The initial training can usually be refined when wrong judgments from the software are identified (false positives or false negatives)  That allows the software to dynamically adapt to the ever evolving nature of spam Some spam filters combine the results of both Bayesian spam filtering and other heuristics (pre- defined rules about the contents, looking at the messages envelope, etc.)  Resulting in even higher filtering accuracy
  21. 21. Computing the Probability1. Computing the probability that a message containing a given word is spam: Lets suppose the suspected message contains the word "replica“  Most people who are used to receiving e-mail know that this message is likely to be spam The formula used by the software for computing
  22. 22. Computing the Probabilitywhere: is the probability that a message is a spam, knowing that the word "replica" is in it; is the overall probability that any given message is spam; is the probability that the word "replica" appears in spam messages; is the overall probability that any given message is not spam (is "ham"); is the probability that the word "replica" appears in ham messages.
  23. 23. Computing the Probability2. Spam or ham: Recent statistics show that current probability of any message to be spam is 80%, at the very least: However, most spam detection software are “not biased”, meaning that they have no prejudice regarding the incoming email  It is advisable that the datasets of spam and ham are of same size
  24. 24. Computing the Probability2. Spam or ham: This quantity is called "spamicity" (or "spaminess") of the word "replica“  Pr(W|S) is approximated to the frequency of messages containing "replica" in the messages identified as spam during the learning phase  Pr(W|H) is approximated to the frequency of messages containing "replica" in the messages identified as ham during the learning phase
  25. 25. Computing the Probability3. Combining individual probabilities: The Bayesian spam filtering software makes the "naïve" assumption that the words present in the message are independent events With that assumption,where: p is the probability that the suspect message is spam pi is the probability p(S|Wi)
  26. 26. Pros & ConsAdvantages It can be trained on a per-user basis  The spam that a user receives is often related to the online users activitiesDisadvantages Bayesian spam filtering is susceptible to Bayesian poisoning  Insertion of random innocuous words that are not normally associated with spam  Replace text with pictures  Google, by its Gmail email system, performing an OCR to every mid to large size image, analyzing the text inside
  27. 27. Pros & Cons
  28. 28. Conclusions Spam emails not only consume computing resources, but can also be frustrating Numerous detection techniques exist, but none is a “good for all scenarios” technique Data Mining approaches for content based spam filtering seem promising
  29. 29. Thank You / Questions