2. What is spam?
• Spam is the use of electronic messaging systems to send
unsolicited bulk messages, especially
11/6/2012
advertising, indiscriminately.
2
3. Types of Spam
• Email Spam ( Most Well Known, and topic for today )
• Comment Spam ( Probably that’s why we have capcha )
11/6/2012
• Instant Messaging Spam ( E.g. In yahoo messengers, unknown
messengers sending weird urls )
• Junk Fax ( Your machine is printing hundreds of spam
messages and you cant delete them, thankfully now a horror
of past )
• Unsolicited text messages. ( Offers make me think, I am
luckiest girl alive )
• Social Networking Spams ( They are send by your friend who
clicks on similar message send by their friend )
3
4. Geographical Origins of spams
Origin or source of spam
refers to the geographical
location of the computer
11/6/2012
from which the spam is
sent; it is not the country
where the spammer
resides, nor the country that
hosts the spamvertised site.
Interesting Fact:
As much as 80% of spam
received by Internet users in
North America and Europe
can be traced to fewer than
200 spammers
4
6. Other Fast Facts
• Spam accounts for 14.5 billion messages globally per day. In
other words, spam makes up 45% of all emails.
11/6/2012
• A 2004 survey estimated that lost productivity costs Internet
users in the United States $21.58 billion annually.
• People switched to gmail from yahoo because of better spam
filter
• Spam mails fill your email space and cause users to ask for
more free space. Another technique used by gmail to lure
users. 6
7. Current Works :Bayesian Model
• Based on Document Filtering concept
11/6/2012
Pr(S|W) is the probability that a message is a spam, knowing that the word "replica"
is in it;
Pr(S) is the overall probability that any given message is spam;
Pr(W|S) is the probability that the word "replica" appears in spam messages;
Pr(H) is the overall probability that any given message is not spam (is "ham");
Pr(W|H) is the probability that the word "replica" appears in ham messages.
Combining Words:
p :is the probability that the suspect message is spam;
p1: is the probability that it is a spam knowing it contains a first word (for example
"replica");
Problem:
Bayesian Poisioning
7
8. Other Models( machine Learning Based)
• Neural Networks
• Graphical Models
11/6/2012
• Logistic Regression
• Support Vector Machines (SVMs)
• all make fewer assumptions
• These kinds of relationships between words implicitly or
explicitly, at the expense of more complexity
8
9. MSR: Challenge Response system
• Idea of Cynthia Dwork (now at Microsoft Research, Silicon
Valley) and Moni Naor (at the Weizmann Institute of Science
11/6/2012
in Israel.)
• First determine if a message is ham or spam and take action
• Aim try to search even false positive spams.
• Idea increase recall of ham messages
• So you send challenge of
small puzzle to sender,
who will answer it if it is
genuine
• Spammers do not have time 9
10. My idea: Collaborative intelligence
• Distinguish message as spam of ham from previous techniques
• Try to warn user of probable spam from mails classified as
11/6/2012
ham, from response of other readers
• A mail if send to 50 people. If it is classified as ham.
• Check the rate if others recipients try to mark it as spam.
• If a new user opens it, you say it is in inbox, but probably a
spam, with some confidence.
• User is pre warned of possible spam in his inbox.
10
11. References
• Commtouch: Internet Threats Trend Report October 2012
11/6/2012
(http://www.commtouch.com/download/2389)
• Semantic: Internet security report
(http://www.symantec.com/content/en/us/enterprise/other_resources/b-istr_main_report_2011_21239364.en-
us.pdf)
• Cisco: Security Report
(http://www.cisco.com/en/US/prod/collateral/vpndevc/security_annual_report_2011.pdf)
• Wikipedia : http://en.wikipedia.org/wiki/Email_spam
• http://www.destinationcrm.com/Articles/Editorial/Magazine-Features/Avoid-the-Spam-Folder-
84272.aspx
• techsupportalert.com/content/how-why-switch-yahoo-mail-gmail.htm
http://www.
• http://www.spamhaus.org/statistics/countries/
11
• MSR:http://research.microsoft.com/en-us/um/people/joshuago/significance-
spam_edited2-times.pdf