Spam Filtering Techniques UMAR M. A. ALHARAKY, B.SC. RAKAN RAZOUK, PH.D.
Introduction What is spam? Spam also called as Unsolicited Commercial Email (UCE) Involves sending messages by email to numerous recipients at the same time (Mass Emailing). Grew exponentially since 1990 but has leveled off recently and is no longer growing exponentially 80% of all spam is sent by less than 200 spammers
Introduction Spam as a problem Consumes computing resources and time Reduces the effectiveness of legitimate advertising Cost Shifting Fraud Identity Theft Consumer Perception Global Implications John Borgan [ReplyNet] – “Junk email is not just annoying anymore. It’s eating into productivity. It’s eating into time”
Some Statistics Cost of Spam 2009: $130 billion worldwide $42 billion in US alone 30% increase from 2007 estimates 100% increase in 2007 from 2005 Main components of cost: Productivity loss from inspecting and deleting spam missed by spam control products (False Negatives) Productivity loss from searching for legitimate mails deleted in error by spam control products (False Positives) Operations and helpdesk costs (Filters and Firewalls – installment and maintenance)
Some Statistics Spam Averages about 88% of all emails sent Barracuda Networks:
Introduction Federal Regulations (CAN-SPAM act of 2003) Controlling the Assault of Non-Solicited Pornography And Marketing (CAN-SPAM) Act of 2003 – signed on 12/16/2003 Commercial email must adhere to three types of compliance Unsubscribe Compliance Content Compliance Sender Behavior Compliance
Introduction Email Address Harvesting - Process of obtaining email addresses through various methods Purchasing /Trading lists with other spammers Bots Directory harvest attack Free Product or Services requiring valid email address News bulletins /Forums
Types of Spam Filters Header Filters Look at email headers to judge if forged or not Contain more information in addition to recipient , sender and subject fields Language Filters filters based on email body language Can be used to filter out spam written in foreign languages
Types of Spam Filters Content Filters Scan the text content of emails Use fuzzy logic Permission Filters Based on Challenge /Response system White list/blacklist Filters Will only accept emails from list of “good email addresses” Will block emails from “bad email addresses”
Types of Spam Filters Community Filters Work on the principal of "communal knowledge" of spam These types of filters communicate with a central server. Bayesian Filters Statistical email filtering Uses Naïve Bayes classifier
Spam Filters Properties Filter must prevent spam from entering inboxes Able to detect the spam without blocking the ham Maximize efficiency of the filter Do not require any modification to existing e-mail protocols Easily incremental Spam evolve continuously Need to adapt to each user
Data Mining and Spam FilteringWhat is Data Mining? Predictive analysis of data for the discovery of trends and patterns Reveals critical insights into unstructured data.Why Data Mining? The amount of data is increasing every year. To convert the acquired data into information and knowledge and use it in decision making.How it is done? Utilizes computing power to apply sophisticated algorithms for knowledge discovery of data.The Target… Spam Filtering using Data Mining
Data Mining and Spam Filtering Spam Filtering can be seen as a specific text categorization (Classification) History Jason Rennie’s iFile (1996); first know program to use Bayesian Classification for spam filtering
Bayesian Classification Particular words have particular probabilities of occurring in spam email and in legitimate email For instance, most email users will frequently encounter the word "Viagra" in spam email, but will seldom see it in other email The filter doesnt know these probabilities in advance, and must first be trained so it can build them up To train the filter, the user must manually indicate whether a new email is spam or not
Bayesian Classification For all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database For instance, Bayesian spam filters will typically have learned a very high spam probability for the words "Viagra" and "refinance", but a very low spam probability for words seen only in legitimate email, such as the names of friends and family members After training, the word probabilities are used to compute the probability that an email with a particular set of words in it belongs to either category
Bayesian Classification Each word in the email contributes to the emails spam probability, or only the most interesting words This contribution is computed using Bayes theorem Then, the emails spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam.
Bayesian Classification The initial training can usually be refined when wrong judgments from the software are identified (false positives or false negatives) That allows the software to dynamically adapt to the ever evolving nature of spam Some spam filters combine the results of both Bayesian spam filtering and other heuristics (pre- defined rules about the contents, looking at the messages envelope, etc.) Resulting in even higher filtering accuracy
Computing the Probability1. Computing the probability that a message containing a given word is spam: Lets suppose the suspected message contains the word "replica“ Most people who are used to receiving e-mail know that this message is likely to be spam The formula used by the software for computing
Computing the Probabilitywhere: is the probability that a message is a spam, knowing that the word "replica" is in it; is the overall probability that any given message is spam; is the probability that the word "replica" appears in spam messages; is the overall probability that any given message is not spam (is "ham"); is the probability that the word "replica" appears in ham messages.
Computing the Probability2. Spam or ham: Recent statistics show that current probability of any message to be spam is 80%, at the very least: However, most spam detection software are “not biased”, meaning that they have no prejudice regarding the incoming email It is advisable that the datasets of spam and ham are of same size
Computing the Probability2. Spam or ham: This quantity is called "spamicity" (or "spaminess") of the word "replica“ Pr(W|S) is approximated to the frequency of messages containing "replica" in the messages identified as spam during the learning phase Pr(W|H) is approximated to the frequency of messages containing "replica" in the messages identified as ham during the learning phase
Computing the Probability3. Combining individual probabilities: The Bayesian spam filtering software makes the "naïve" assumption that the words present in the message are independent events With that assumption,where: p is the probability that the suspect message is spam pi is the probability p(S|Wi)
Pros & ConsAdvantages It can be trained on a per-user basis The spam that a user receives is often related to the online users activitiesDisadvantages Bayesian spam filtering is susceptible to Bayesian poisoning Insertion of random innocuous words that are not normally associated with spam Replace text with pictures Google, by its Gmail email system, performing an OCR to every mid to large size image, analyzing the text inside
Conclusions Spam emails not only consume computing resources, but can also be frustrating Numerous detection techniques exist, but none is a “good for all scenarios” technique Data Mining approaches for content based spam filtering seem promising