How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

How Many Folders Do You Really Need?
Classifying Email into a Handful of
CategoriesDate:2015/07/08
Author:Mihajlo Grbovic, Guy Halawi, Zohar Karnin, Yoelle
Maarek
Source:CIKM '14
Advisor:Jia-ling Koh
Spearker:LIN,CI-JIE
1

Outline
Introduction
Method
Experiment
Conclusion
2

Outline
Introduction
Method
Experiment
Conclusion
3

Introduction
 Email classification is still a mostly manual task
4

Introduction
 Recently automatic classification offering the same categories
to all users has started to appear in some Web mail clients
5

Introduction
 Today's commercial Web mail traffic is dominated by
machine-generated messages
 social networks,e-commerce sites,etc
6

Introduction
 Goal
 Automatically distinguishing between personal and
machine-generated email
 Classifying messages into latent categories,without
requiring users to have defined any folder
7

Outline
Introduction
Method
Experiment
Conclusion
8

Overflow
9
Email
raw data
LDA cluster
Latent
categories
Feature
extraction
Aggregation
Training data
generation
Test data

Overflow
10
Email
raw data
LDA cluster
Latent
categories
Feature
extraction
Aggregation
Training data
generation
Test data

DISCOVERING LATENT CATEGORIES
 Retrieving the most “popular” folders created by users
 ignored system folders (e.g., “trash”, “spam”)
 Applied LDA to these document folders in order to discover a
set of latent topics
 latent topics would map into “latent categories”
11
LDA
Latent
categories

DISCOVERING LATENT CATEGORIES
 The topics obtained for K = 6, as this value exposed a good
balance between total and individual coverage
 The email traffic coverage at K = 6 was 70%
12
machine generatedhuman generated

Overflow
13
Email
raw data
LDA cluster
Latent
categories
Feature
extraction
Aggregation
Training data
generation
Test data

Extracting Features
 Content features
 extract words from the subject line and message body
 the subject character length, body character length
 the number of urls occurring in the body
 Address features
 features extracted from the sender email address
 the subdomains (e.g. .edu,.gov, etc.) and subnames(e.g. billing, noreply)
14

Extracting Features
 Behavioral features
 weekly and monthly volumes of sent messages
 volumes of messages sent as a reply
 volumes of messages sent as forward (with FW: in the subject line)
 volume of the messages received by the sender
 volume of the messages received as a reply
 volume of the messages received as a forward
15

Extracting Features
 Temporal behavior features
 Record whether a sender sends more than X messages in an hour
 X takes as values: 10, 60, 80, 100, 120
16

Overflow
17
Email
raw data
LDA cluster
Latent
categories
Feature
extraction
Aggregation
Training data
generation
Test data

Overflow
19
Email
raw data
LDA cluster
Latent
categories
Feature
extraction
Aggregation
Training data
generation
Test data

TRAINING DATA
 consider 3 types of labeling techniques
 manual
 heuristic-based
 automatic
 6 latent categories
 human
 career
 shopping
 travel
 finance
 social
20

Manual labeling
 Human editors assign labels to specific examples
21

Heuristic labeling
 Used this type of labeling mostly for differentiating between
human and machine senders
 Identify corporate machine senders
 such as “mailer-daemon” or “no-reply”
 repeating occurrences of words such as “unsubscribe” in message
headers
 SMTP domain information
 Identify human senders
 <first name>.<lastname>@
22

Automatic labeling
 Folder-based majority voting
23
boots@
email.boots.com
purchase:55 ebay:4
credit cards:1 hotel:6
Shopping finance travel
purchase Credit
cards
Hotel
ebay
55+4 1 6
Category:Shopping
59>50(threshold),
num of folders >1(threshold)

Automatic labeling
 Folder-based LDA voting
24
boots@
email.boots.com
purchase:
Shopping:70%
Finance:20%
ebay:
Shopping:60%
Social:10%
credit cards
Finance:90%
Shopping:10%
hotel:
Travel:74%
Finance:15%
Category:Shopping
Shopping:0.7+0.6+0.1 Travel:0.74
Finance:0.2+0.9+0.15
Social:0.1

Overflow
25
Email
raw data
LDA cluster
Latent
categories
Feature
extraction
Aggregation
Training data
generation
Test data

CLASSIFICATION MECHANISM
 Online lightweight classification
 consisting of hard-coded rules designed to quickly classify
 finding the top 100 senders that cover a significant percentage of the
total traffic and are category consistent
 categorizing all reply/forward messages as human
27

 Online sender-based classification
 looking for the sender in a lookup table containing senders with known
categories
28
boots@
email.boots.com
sender category
1800usbanks
@online.usbank.com
shopping
accorhotels.reservation
@accor.com
travel
boots@email.boots.com shopping
target.payroll@target.com finance
lookup
shopping

 Offline creation of classified senders table
 use the training set to train a logistic regression model
 train a separate model in a one-vs-all manner
 the classification process is run performed periodically to account for
new senders
29
new email
human
shopping
finance
travel
social
career
logistic
regression
sender category
new email finance
.
.
.
.
.
.

 Online Heavy-weight classification
 email messages whose sender did not appear in the classified sender
table are sent to a heavy-weight message based classifier
 use all relevant feature, pertaining to the message body, subject line and
sender name
 employed a logistic regression classifier
30

 Offline training the message-level classifier
 a logistic regression model is trained for each category in a one-vs-all
model
 the training process is quite similar to the sender classification
 which is of course different as it contains messages rather than senders
31

Outline
Introduction
Method
Experiment
Conclusion
32

Experiment
 Experimental evaluation was performed on more than 500
billion messages received during a period of six months by
users of Yahoo mail service
33

Experiment
35
AUC (one vs rest classification) Performance on different feature subsets
content features (email body, subject, etc.)

Outline
Introduction
Method
Experiment
Conclusion
37

CONCLUSION
 Presented here a Web-scale categorization approach
 offline learning
 online classification
 Discovered latent categories
 Categories cover more than 70% of both email traffic and
email search queries
38

How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

Recommended

Recommended

More Related Content

Similar to How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

Similar to How Many Folders Do You Really Need?Classifying Email into a Handful of Categories (20)