2. Problem Definition
The term spam generally refers to unsolicited electronic
communications (typically email) or, in some cases,
unsolicited commercial bulk communications. Some
refer to this kind of email simply as junk email.
Beyond the annoyance and the time wasted sifting
through unwanted messages, spam can cause
significant harm by infecting users’ computers with
malicious software capable of damaging systems and
stealing personal information. It also can consume
network resources.
3. Introduction
A spam message classification is a step towards
building a tool for scam message identification
and early scam detection.
A piece of software that processes incoming emails
so as to prevent spam from reaching a user's inbox.
4. Review of Literature
Images Book Authors Description
Spamming
the
Spammers
Book by
Peter
Dabbene
Dieter P. Bieny resumes his campaign against e-mail
spammers, seeking justice and entertainment value at
every turn. Can he still convince the scammers to invest
their time and effort in an ultimately fruitless endeavor, or
have they caught on to his game?
Spam: A
Shadow
History of the
Internet
Book by
Finn
Brunton
The vast majority of all email sent every day is spam, a
variety of idiosyncratically spelled requests to provide
account information, invitations to spend money on
dubious products, and pleas to send cash overseas.
Spam
Nation: The
Inside Story
of Organized
Cybercrime
Book by
Brian Krebs
There is a Threat Lurking Online with the Power to Destroy
Your Finances, Steal Your Personal Data, and Endanger
Your Life.
5. Proposed Solution
What ShouldYou Expect fromYour Spam Filter?
Threat detection
Modern filters will often have some form of integrated threat
detection solution.
This means that it will use AI and machine learning to analyze
trillions of data points in order to get a better
understanding of how attackers shift their approach and
what should raise a red flag.
This involves the scanning of message content and
attributes, as well as domains and addresses associated
with malicious intent, and other anomalies to know what
to filter and what to allow.
7. Step 1: E-mail Data Collection
The dataset contained in a corpus plays a crucial role in assessing the
performance of any spam filter. Many open-source datasets are freely
available in the public domain. Below mentioned two datasets are
widely popular as they contain a huge amount of emails.
8. Step 2: Pre-processing of E-mail content
At this step, we mainly perform tokenization of mails. Tokenization is
a process where we break the content of an email into words and
transform big messages into a sequence of representative symbols
termed tokens. These tokens are extracted from the email body,
header, subject, and image.
9. Step 3: Feature Extraction and Selection
After pre-processing, we can have a large number of words. Here
we can maintain a database that contains the frequency of the
different words represented in each column. These attributes can
be categorized on a different basis, like:
Important attributes: Frequency of repeated words, Number of
semantic discrepancies, an Adult content bag of words, etc.
Additional Attributes: Sender account features like Sender
country, IP address, email, age of sender, Number of replies,
number of recipients, and website address.
Less important attributes: Geographical distance between
sender and receiver, Sender’s date of birth, Account lifespan, Sex
of sender, and Age of the recipient.
10. Step 4: Implementation
Similar to the Nearest Neighbour algorithm, the K-Nearest
Neighbour algorithm serves the purpose of clustering. Still,
instead of giving just one nearest instance, it looks at the closest
K instances to the new incoming instance. Based on the
frequency of those K instances, K-NN classifies the new
instances. The value of K is considered to be a hyperparameter
that needs tuning. To tune this, one can take one of the famous
Hit and Trial approaches where we try some K's values and then
check the model's performance.
11. Step 5: Performance Analysis
Now our algorithm is ready, so we must check the performance
of the model. Even a single missed important message may
cause a user to reconsider the value of spam filtering. So we
must be sure that our algorithm will be as close to 100%
accurate. But some researchers feel that considering only the
accuracy as the evaluation parameter for spam classification is
not enough.
12. References
David Strom, "'Phishing' IdentityTheft Is Gaining
Popularity," Security Pipeline, November 20, 2003.
Yahoo, Microsoft, AOL Sue Under New Anti-Spam
Law," Bloomberg News, March 10, 2004.
Jonathan Krim, "EarthLink to OfferAnti-Spam E-Mail
System," Washington Post, May 7, 2003.
EarthLinkWins Antispam Injunction," Associated
Press, May 7, 2003.