Spam Filtering Techniques


   UMAR M. A. ALHARAKY, B.SC.

    RAKAN RAZOUK, PH.D.
Introduction

 What is spam?


    Spam also called as Unsolicited Commercial Email (UCE)

    Involves sending messages by email to numerous recipients at
     the same time (Mass Emailing).

    Grew exponentially since 1990 but has leveled off recently and
     is no longer growing exponentially

    80% of all spam is sent by less than 200 spammers
Introduction

 Purpose of Spam

    Advertisements

    Pyramid schemes (Multi-Level Marketing)

    Giveaways

    Chain letters

    Political email

    Stock market advice
Introduction

 Spam as a problem
      Consumes   computing resources and time
      Reduces the effectiveness of legitimate advertising
      Cost Shifting
      Fraud
      Identity Theft
      Consumer Perception
      Global Implications


 John Borgan [ReplyNet] – “Junk email is not just annoying anymore.
  It’s eating into productivity. It’s eating into time”
Some Statistics

       Cost of Spam 2009:
       $130 billion worldwide
       $42 billion in US alone
       30% increase from 2007 estimates
       100% increase in 2007 from 2005

 Main components of cost:
   Productivity loss from inspecting and deleting spam missed by spam
    control products (False Negatives)
   Productivity loss from searching for legitimate mails deleted in error
    by spam control products (False Positives)
       Operations and helpdesk costs (Filters and Firewalls – installment
        and maintenance)
Some Statistics
Some Statistics

   Spam Averages about 88% of all emails sent
       Barracuda Networks:
Introduction

 Federal Regulations (CAN-SPAM act of 2003)

    Controlling the Assault of Non-Solicited Pornography And
     Marketing (CAN-SPAM) Act of 2003 – signed on 12/16/2003



    Commercial email must adhere to three types of compliance
        Unsubscribe Compliance

      Content Compliance

      Sender   Behavior Compliance
Introduction

 Email Address Harvesting - Process of obtaining
 email addresses through various methods

    Purchasing /Trading lists with other spammers

    Bots

    Directory harvest attack

    Free Product or Services requiring valid email address

    News bulletins /Forums
Spam Life Cycle
Types of Spam Filters


   Header Filters
       Look at email headers to judge if forged or not
       Contain more information in addition to recipient , sender and subject
        fields




   Language Filters
       filters based on email body language
       Can be used to filter out spam written in foreign languages
Types of Spam Filters


   Content Filters
     Scan the text content of emails
     Use fuzzy logic


   Permission Filters
       Based on Challenge /Response system

   White list/blacklist Filters
     Will only accept emails from list of “good email addresses”
     Will block emails from “bad email addresses”
Types of Spam Filters


   Community Filters
     Work on  the principal of "communal knowledge" of spam
     These types of filters communicate with a central server.




   Bayesian Filters
     Statistical email filtering
     Uses Naïve Bayes classifier
Spam Filters Properties

   Filter must prevent spam from entering inboxes

   Able to detect the spam without blocking the ham
     Maximize    efficiency of the filter

   Do not require any modification to existing e-mail protocols

   Easily incremental
       Spam evolve continuously

       Need to adapt to each user
Data Mining and Spam Filtering

What is Data Mining?
 Predictive analysis of data for the discovery of trends and patterns
 Reveals critical insights into unstructured data.


Why Data Mining?
 The amount of data is increasing every year.
 To convert the acquired data into information and knowledge and use it in
  decision making.

How it is done?
 Utilizes computing power to apply sophisticated algorithms for knowledge
  discovery of data.

The Target…
 Spam Filtering using Data Mining
Data Mining and Spam Filtering


 Spam Filtering can be seen as a specific text
 categorization (Classification)



 History
       Jason Rennie’s iFile (1996); first know program to use Bayesian
        Classification for spam filtering
Bayesian Classification

 Particular words have particular probabilities of
 occurring in spam email and in legitimate email
  For instance, most email users will frequently encounter the word
   "Viagra" in spam email, but will seldom see it in other email

 The filter doesn't know these probabilities in
 advance, and must first be trained so it can build
 them up

 To train the filter, the user must manually indicate
 whether a new email is spam or not
Bayesian Classification

 For all words in each training email, the filter will
 adjust the probabilities that each word will appear in
 spam or legitimate email in its database
  For instance, Bayesian spam filters will typically have learned a very
   high spam probability for the words "Viagra" and "refinance", but a
   very low spam probability for words seen only in legitimate email,
   such as the names of friends and family members

 After training, the word probabilities are used to
 compute the probability that an email with a
 particular set of words in it belongs to either
 category
Bayesian Classification

 Each word in the email contributes to the email's
 spam probability, or only the most interesting words

 This contribution is computed using Bayes'
 theorem

 Then, the email's spam probability is computed
 over all words in the email, and if the total exceeds
 a certain threshold (say 95%), the filter will mark
 the email as a spam.
Bayesian Classification

 The initial training can usually be refined when
  wrong judgments from the software are identified
  (false positives or false negatives)
  That allows the software to dynamically adapt to the ever evolving
   nature of spam

 Some spam filters combine the results of both
  Bayesian spam filtering and other heuristics (pre-
  defined rules about the contents, looking at the message's
  envelope, etc.)
  Resulting   in even higher filtering accuracy
Computing the Probability


1. Computing the probability that a message
  containing a given word is spam:

 Let's suppose the suspected message contains the
  word "replica“
     Most people who are used to receiving e-mail know that this
      message is likely to be spam

 The formula used by the software for computing
Computing the Probability




where:
           is the probability that a message is a spam, knowing that the
    word "replica" is in it;
       is the overall probability that any given message is spam;
           is the probability that the word "replica" appears in spam
    messages;
        is the overall probability that any given message is not spam (is
    "ham");
            is the probability that the word "replica" appears in ham
    messages.
Computing the Probability


2. Spam or ham:
 Recent statistics show that current probability of
 any message to be spam is 80%, at the very least:

 However, most spam detection software are “not
 biased”, meaning that they have no prejudice
 regarding the incoming email


     It is advisable that the datasets of spam and ham are of same size
Computing the Probability


2. Spam or ham:



 This quantity is called "spamicity" (or "spaminess")
 of the word "replica“
     Pr(W|S) is approximated to the frequency of messages containing
      "replica" in the messages identified as spam during the learning
      phase
     Pr(W|H) is approximated to the frequency of messages containing
      "replica" in the messages identified as ham during the learning
      phase
Computing the Probability


3. Combining individual probabilities:
 The Bayesian spam filtering software makes the "naïve"
  assumption that the words present in the message
  are independent events

 With that assumption,



where:
 p is the probability that the suspect message is spam
 pi is the probability p(S|Wi)
Pros & Cons

Advantages
 It can be trained on a per-user basis
    The spam that a user receives is often related to the
     online user's activities
Disadvantages
 Bayesian spam filtering is susceptible
  to Bayesian poisoning
    Insertion of random innocuous words that are not
     normally associated with spam
    Replace text with pictures
    Google, by its Gmail email system, performing an
     OCR to every mid to large size image, analyzing the
     text inside
Pros & Cons
Conclusions


 Spam emails not only consume computing resources,
 but can also be frustrating

 Numerous detection techniques exist, but none is a
 “good for all scenarios” technique

 Data Mining approaches for content based spam
 filtering seem promising
Thank You / Questions

Spam Filtering

  • 1.
    Spam Filtering Techniques UMAR M. A. ALHARAKY, B.SC. RAKAN RAZOUK, PH.D.
  • 2.
    Introduction  What isspam?  Spam also called as Unsolicited Commercial Email (UCE)  Involves sending messages by email to numerous recipients at the same time (Mass Emailing).  Grew exponentially since 1990 but has leveled off recently and is no longer growing exponentially  80% of all spam is sent by less than 200 spammers
  • 3.
    Introduction  Purpose ofSpam  Advertisements  Pyramid schemes (Multi-Level Marketing)  Giveaways  Chain letters  Political email  Stock market advice
  • 4.
    Introduction  Spam asa problem  Consumes computing resources and time  Reduces the effectiveness of legitimate advertising  Cost Shifting  Fraud  Identity Theft  Consumer Perception  Global Implications  John Borgan [ReplyNet] – “Junk email is not just annoying anymore. It’s eating into productivity. It’s eating into time”
  • 5.
    Some Statistics  Cost of Spam 2009:  $130 billion worldwide  $42 billion in US alone  30% increase from 2007 estimates  100% increase in 2007 from 2005  Main components of cost:  Productivity loss from inspecting and deleting spam missed by spam control products (False Negatives)  Productivity loss from searching for legitimate mails deleted in error by spam control products (False Positives)  Operations and helpdesk costs (Filters and Firewalls – installment and maintenance)
  • 6.
  • 7.
    Some Statistics  Spam Averages about 88% of all emails sent  Barracuda Networks:
  • 8.
    Introduction  Federal Regulations(CAN-SPAM act of 2003)  Controlling the Assault of Non-Solicited Pornography And Marketing (CAN-SPAM) Act of 2003 – signed on 12/16/2003  Commercial email must adhere to three types of compliance  Unsubscribe Compliance  Content Compliance  Sender Behavior Compliance
  • 9.
    Introduction  Email AddressHarvesting - Process of obtaining email addresses through various methods  Purchasing /Trading lists with other spammers  Bots  Directory harvest attack  Free Product or Services requiring valid email address  News bulletins /Forums
  • 10.
  • 11.
    Types of SpamFilters  Header Filters  Look at email headers to judge if forged or not  Contain more information in addition to recipient , sender and subject fields  Language Filters  filters based on email body language  Can be used to filter out spam written in foreign languages
  • 12.
    Types of SpamFilters  Content Filters  Scan the text content of emails  Use fuzzy logic  Permission Filters  Based on Challenge /Response system  White list/blacklist Filters  Will only accept emails from list of “good email addresses”  Will block emails from “bad email addresses”
  • 13.
    Types of SpamFilters  Community Filters  Work on the principal of "communal knowledge" of spam  These types of filters communicate with a central server.  Bayesian Filters  Statistical email filtering  Uses Naïve Bayes classifier
  • 14.
    Spam Filters Properties  Filter must prevent spam from entering inboxes  Able to detect the spam without blocking the ham  Maximize efficiency of the filter  Do not require any modification to existing e-mail protocols  Easily incremental  Spam evolve continuously  Need to adapt to each user
  • 15.
    Data Mining andSpam Filtering What is Data Mining?  Predictive analysis of data for the discovery of trends and patterns  Reveals critical insights into unstructured data. Why Data Mining?  The amount of data is increasing every year.  To convert the acquired data into information and knowledge and use it in decision making. How it is done?  Utilizes computing power to apply sophisticated algorithms for knowledge discovery of data. The Target…  Spam Filtering using Data Mining
  • 16.
    Data Mining andSpam Filtering  Spam Filtering can be seen as a specific text categorization (Classification)  History  Jason Rennie’s iFile (1996); first know program to use Bayesian Classification for spam filtering
  • 17.
    Bayesian Classification  Particularwords have particular probabilities of occurring in spam email and in legitimate email  For instance, most email users will frequently encounter the word "Viagra" in spam email, but will seldom see it in other email  The filter doesn't know these probabilities in advance, and must first be trained so it can build them up  To train the filter, the user must manually indicate whether a new email is spam or not
  • 18.
    Bayesian Classification  Forall words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database  For instance, Bayesian spam filters will typically have learned a very high spam probability for the words "Viagra" and "refinance", but a very low spam probability for words seen only in legitimate email, such as the names of friends and family members  After training, the word probabilities are used to compute the probability that an email with a particular set of words in it belongs to either category
  • 19.
    Bayesian Classification  Eachword in the email contributes to the email's spam probability, or only the most interesting words  This contribution is computed using Bayes' theorem  Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam.
  • 20.
    Bayesian Classification  Theinitial training can usually be refined when wrong judgments from the software are identified (false positives or false negatives)  That allows the software to dynamically adapt to the ever evolving nature of spam  Some spam filters combine the results of both Bayesian spam filtering and other heuristics (pre- defined rules about the contents, looking at the message's envelope, etc.)  Resulting in even higher filtering accuracy
  • 21.
    Computing the Probability 1.Computing the probability that a message containing a given word is spam:  Let's suppose the suspected message contains the word "replica“  Most people who are used to receiving e-mail know that this message is likely to be spam  The formula used by the software for computing
  • 22.
    Computing the Probability where:  is the probability that a message is a spam, knowing that the word "replica" is in it;  is the overall probability that any given message is spam;  is the probability that the word "replica" appears in spam messages;  is the overall probability that any given message is not spam (is "ham");  is the probability that the word "replica" appears in ham messages.
  • 23.
    Computing the Probability 2.Spam or ham:  Recent statistics show that current probability of any message to be spam is 80%, at the very least:  However, most spam detection software are “not biased”, meaning that they have no prejudice regarding the incoming email  It is advisable that the datasets of spam and ham are of same size
  • 24.
    Computing the Probability 2.Spam or ham:  This quantity is called "spamicity" (or "spaminess") of the word "replica“  Pr(W|S) is approximated to the frequency of messages containing "replica" in the messages identified as spam during the learning phase  Pr(W|H) is approximated to the frequency of messages containing "replica" in the messages identified as ham during the learning phase
  • 25.
    Computing the Probability 3.Combining individual probabilities:  The Bayesian spam filtering software makes the "naïve" assumption that the words present in the message are independent events  With that assumption, where:  p is the probability that the suspect message is spam  pi is the probability p(S|Wi)
  • 26.
    Pros & Cons Advantages It can be trained on a per-user basis  The spam that a user receives is often related to the online user's activities Disadvantages  Bayesian spam filtering is susceptible to Bayesian poisoning  Insertion of random innocuous words that are not normally associated with spam  Replace text with pictures  Google, by its Gmail email system, performing an OCR to every mid to large size image, analyzing the text inside
  • 27.
  • 28.
    Conclusions  Spam emailsnot only consume computing resources, but can also be frustrating  Numerous detection techniques exist, but none is a “good for all scenarios” technique  Data Mining approaches for content based spam filtering seem promising
  • 29.
    Thank You /Questions