How Many Folders Do You Really Need?Classifying Email into a Handful of Categories
1. How Many Folders Do You Really Need?
Classifying Email into a Handful of
CategoriesDate:2015/07/08
Author:Mihajlo Grbovic, Guy Halawi, Zohar Karnin, Yoelle
Maarek
Source:CIKM '14
Advisor:Jia-ling Koh
Spearker:LIN,CI-JIE
1
5. Introduction
Recently automatic classification offering the same categories
to all users has started to appear in some Web mail clients
5
6. Introduction
Today's commercial Web mail traffic is dominated by
machine-generated messages
social networks,e-commerce sites,etc
6
7. Introduction
Goal
Automatically distinguishing between personal and
machine-generated email
Classifying messages into latent categories,without
requiring users to have defined any folder
7
11. DISCOVERING LATENT CATEGORIES
Retrieving the most “popular” folders created by users
ignored system folders (e.g., “trash”, “spam”)
Applied LDA to these document folders in order to discover a
set of latent topics
latent topics would map into “latent categories”
11
LDA
Latent
categories
12. DISCOVERING LATENT CATEGORIES
The topics obtained for K = 6, as this value exposed a good
balance between total and individual coverage
The email traffic coverage at K = 6 was 70%
12
machine generatedhuman generated
14. Extracting Features
Content features
extract words from the subject line and message body
the subject character length, body character length
the number of urls occurring in the body
Address features
features extracted from the sender email address
the subdomains (e.g. .edu,.gov, etc.) and subnames(e.g. billing, noreply)
14
15. Extracting Features
Behavioral features
weekly and monthly volumes of sent messages
volumes of messages sent as a reply
volumes of messages sent as forward (with FW: in the subject line)
volume of the messages received by the sender
volume of the messages received as a reply
volume of the messages received as a forward
15
16. Extracting Features
Temporal behavior features
Record whether a sender sends more than X messages in an hour
X takes as values: 10, 60, 80, 100, 120
16
22. Heuristic labeling
Used this type of labeling mostly for differentiating between
human and machine senders
Identify corporate machine senders
such as “mailer-daemon” or “no-reply”
repeating occurrences of words such as “unsubscribe” in message
headers
SMTP domain information
Identify human senders
<first name>.<lastname>@
22
27. CLASSIFICATION MECHANISM
Online lightweight classification
consisting of hard-coded rules designed to quickly classify
finding the top 100 senders that cover a significant percentage of the
total traffic and are category consistent
categorizing all reply/forward messages as human
27
28. CLASSIFICATION MECHANISM
Online sender-based classification
looking for the sender in a lookup table containing senders with known
categories
28
boots@
email.boots.com
sender category
1800usbanks
@online.usbank.com
shopping
accorhotels.reservation
@accor.com
travel
boots@email.boots.com shopping
target.payroll@target.com finance
lookup
shopping
29. CLASSIFICATION MECHANISM
Offline creation of classified senders table
use the training set to train a logistic regression model
train a separate model in a one-vs-all manner
the classification process is run performed periodically to account for
new senders
29
new email
human
shopping
finance
travel
social
career
logistic
regression
sender category
new email finance
.
.
.
.
.
.
30. CLASSIFICATION MECHANISM
Online Heavy-weight classification
email messages whose sender did not appear in the classified sender
table are sent to a heavy-weight message based classifier
use all relevant feature, pertaining to the message body, subject line and
sender name
employed a logistic regression classifier
30
31. CLASSIFICATION MECHANISM
Offline training the message-level classifier
a logistic regression model is trained for each category in a one-vs-all
model
the training process is quite similar to the sender classification
which is of course different as it contains messages rather than senders
31
33. Experiment
Experimental evaluation was performed on more than 500
billion messages received during a period of six months by
users of Yahoo mail service
33
38. CONCLUSION
Presented here a Web-scale categorization approach
offline learning
online classification
Discovered latent categories
Categories cover more than 70% of both email traffic and
email search queries
38