SlideShare a Scribd company logo
1 of 39
How Many Folders Do You Really Need?
Classifying Email into a Handful of
CategoriesDate:2015/07/08
Author:Mihajlo Grbovic, Guy Halawi, Zohar Karnin, Yoelle
Maarek
Source:CIKM '14
Advisor:Jia-ling Koh
Spearker:LIN,CI-JIE
1
Outline
Introduction
Method
Experiment
Conclusion
2
Outline
Introduction
Method
Experiment
Conclusion
3
Introduction
 Email classification is still a mostly manual task
4
Introduction
 Recently automatic classification offering the same categories
to all users has started to appear in some Web mail clients
5
Introduction
 Today's commercial Web mail traffic is dominated by
machine-generated messages
 social networks,e-commerce sites,etc
6
Introduction
 Goal
 Automatically distinguishing between personal and
machine-generated email
 Classifying messages into latent categories,without
requiring users to have defined any folder
7
Outline
Introduction
Method
Experiment
Conclusion
8
Overflow
9
Email
raw data
LDA cluster
Latent
categories
Feature
extraction
Aggregation
Training data
generation
Test data
Overflow
10
Email
raw data
LDA cluster
Latent
categories
Feature
extraction
Aggregation
Training data
generation
Test data
DISCOVERING LATENT CATEGORIES
 Retrieving the most “popular” folders created by users
 ignored system folders (e.g., “trash”, “spam”)
 Applied LDA to these document folders in order to discover a
set of latent topics
 latent topics would map into “latent categories”
11
LDA
Latent
categories
DISCOVERING LATENT CATEGORIES
 The topics obtained for K = 6, as this value exposed a good
balance between total and individual coverage
 The email traffic coverage at K = 6 was 70%
12
machine generatedhuman generated
Overflow
13
Email
raw data
LDA cluster
Latent
categories
Feature
extraction
Aggregation
Training data
generation
Test data
Extracting Features
 Content features
 extract words from the subject line and message body
 the subject character length, body character length
 the number of urls occurring in the body
 Address features
 features extracted from the sender email address
 the subdomains (e.g. .edu,.gov, etc.) and subnames(e.g. billing, noreply)
14
Extracting Features
 Behavioral features
 weekly and monthly volumes of sent messages
 volumes of messages sent as a reply
 volumes of messages sent as forward (with FW: in the subject line)
 volume of the messages received by the sender
 volume of the messages received as a reply
 volume of the messages received as a forward
15
Extracting Features
 Temporal behavior features
 Record whether a sender sends more than X messages in an hour
 X takes as values: 10, 60, 80, 100, 120
16
Overflow
17
Email
raw data
LDA cluster
Latent
categories
Feature
extraction
Aggregation
Training data
generation
Test data
Aggregation
18
Financial
Overflow
19
Email
raw data
LDA cluster
Latent
categories
Feature
extraction
Aggregation
Training data
generation
Test data
TRAINING DATA
 consider 3 types of labeling techniques
 manual
 heuristic-based
 automatic
 6 latent categories
 human
 career
 shopping
 travel
 finance
 social
20
Manual labeling
 Human editors assign labels to specific examples
21
Heuristic labeling
 Used this type of labeling mostly for differentiating between
human and machine senders
 Identify corporate machine senders
 such as “mailer-daemon” or “no-reply”
 repeating occurrences of words such as “unsubscribe” in message
headers
 SMTP domain information
 Identify human senders
 <first name>.<lastname>@
22
Automatic labeling
 Folder-based majority voting
23
boots@
email.boots.com
purchase:55 ebay:4
credit cards:1 hotel:6
Shopping finance travel
purchase Credit
cards
Hotel
ebay
55+4 1 6
Category:Shopping
59>50(threshold),
num of folders >1(threshold)
Automatic labeling
 Folder-based LDA voting
24
boots@
email.boots.com
purchase:
Shopping:70%
Finance:20%
ebay:
Shopping:60%
Social:10%
credit cards
Finance:90%
Shopping:10%
hotel:
Travel:74%
Finance:15%
Category:Shopping
Shopping:0.7+0.6+0.1 Travel:0.74
Finance:0.2+0.9+0.15
Social:0.1
Overflow
25
Email
raw data
LDA cluster
Latent
categories
Feature
extraction
Aggregation
Training data
generation
Test data
CLASSIFICATION MECHANISM
26
CLASSIFICATION MECHANISM
 Online lightweight classification
 consisting of hard-coded rules designed to quickly classify
 finding the top 100 senders that cover a significant percentage of the
total traffic and are category consistent
 categorizing all reply/forward messages as human
27
CLASSIFICATION MECHANISM
 Online sender-based classification
 looking for the sender in a lookup table containing senders with known
categories
28
boots@
email.boots.com
sender category
1800usbanks
@online.usbank.com
shopping
accorhotels.reservation
@accor.com
travel
boots@email.boots.com shopping
target.payroll@target.com finance
lookup
shopping
CLASSIFICATION MECHANISM
 Offline creation of classified senders table
 use the training set to train a logistic regression model
 train a separate model in a one-vs-all manner
 the classification process is run performed periodically to account for
new senders
29
new email
human
shopping
finance
travel
social
career
logistic
regression
sender category
new email finance
.
.
.
.
.
.
CLASSIFICATION MECHANISM
 Online Heavy-weight classification
 email messages whose sender did not appear in the classified sender
table are sent to a heavy-weight message based classifier
 use all relevant feature, pertaining to the message body, subject line and
sender name
 employed a logistic regression classifier
30
CLASSIFICATION MECHANISM
 Offline training the message-level classifier
 a logistic regression model is trained for each category in a one-vs-all
model
 the training process is quite similar to the sender classification
 which is of course different as it contains messages rather than senders
31
Outline
Introduction
Method
Experiment
Conclusion
32
Experiment
 Experimental evaluation was performed on more than 500
billion messages received during a period of six months by
users of Yahoo mail service
33
Experiment
34
Experiment
35
AUC (one vs rest classification) Performance on different feature subsets
content features (email body, subject, etc.)
Experiment
36
Outline
Introduction
Method
Experiment
Conclusion
37
CONCLUSION
 Presented here a Web-scale categorization approach
 offline learning
 online classification
 Discovered latent categories
 Categories cover more than 70% of both email traffic and
email search queries
38
Thanks for listening
39

More Related Content

Similar to How Many Folders Do You Really Need? Classifying Email into a Handful of Categories

_Brian_McCarthy_NET_Portfolio
_Brian_McCarthy_NET_Portfolio_Brian_McCarthy_NET_Portfolio
_Brian_McCarthy_NET_Portfolio
Brian McCarthy
 

Similar to How Many Folders Do You Really Need? Classifying Email into a Handful of Categories (20)

MongoDB World 2019: How Braze uses the MongoDB Aggregation Pipeline for Lean,...
MongoDB World 2019: How Braze uses the MongoDB Aggregation Pipeline for Lean,...MongoDB World 2019: How Braze uses the MongoDB Aggregation Pipeline for Lean,...
MongoDB World 2019: How Braze uses the MongoDB Aggregation Pipeline for Lean,...
 
_Brian_McCarthy_NET_Portfolio
_Brian_McCarthy_NET_Portfolio_Brian_McCarthy_NET_Portfolio
_Brian_McCarthy_NET_Portfolio
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
Machine learning session 7
Machine learning session 7Machine learning session 7
Machine learning session 7
 
WORKLOAD CHARACTERIZATION OF SPAM EMAIL FILTERING SYSTEMS
WORKLOAD CHARACTERIZATION OF SPAM EMAIL FILTERING SYSTEMSWORKLOAD CHARACTERIZATION OF SPAM EMAIL FILTERING SYSTEMS
WORKLOAD CHARACTERIZATION OF SPAM EMAIL FILTERING SYSTEMS
 
Axios Systems assyst RUG2017 - Personalisation of assyst v2.0
Axios Systems assyst RUG2017 - Personalisation of assyst v2.0Axios Systems assyst RUG2017 - Personalisation of assyst v2.0
Axios Systems assyst RUG2017 - Personalisation of assyst v2.0
 
MLBox 0.8.2
MLBox 0.8.2 MLBox 0.8.2
MLBox 0.8.2
 
USING BLACK-LIST AND WHITE-LIST TECHNIQUE TO DETECT MALICIOUS URLS
USING BLACK-LIST AND WHITE-LIST TECHNIQUE TO DETECT MALICIOUS URLSUSING BLACK-LIST AND WHITE-LIST TECHNIQUE TO DETECT MALICIOUS URLS
USING BLACK-LIST AND WHITE-LIST TECHNIQUE TO DETECT MALICIOUS URLS
 
VWO - Mark de Winter - Run more experiments with fewer resources.pdf
VWO - Mark de Winter - Run more experiments with fewer resources.pdfVWO - Mark de Winter - Run more experiments with fewer resources.pdf
VWO - Mark de Winter - Run more experiments with fewer resources.pdf
 
Run more experiments with fewer resources
Run more experiments with fewer resourcesRun more experiments with fewer resources
Run more experiments with fewer resources
 
MSBI Online Training in Hyderabad
MSBI Online Training in HyderabadMSBI Online Training in Hyderabad
MSBI Online Training in Hyderabad
 
MSBI Online Training in India
MSBI Online Training in IndiaMSBI Online Training in India
MSBI Online Training in India
 
MSBI Online Training
MSBI Online Training MSBI Online Training
MSBI Online Training
 
Business Optix - what we do
Business Optix - what we doBusiness Optix - what we do
Business Optix - what we do
 
Insight User Conference Bootcamp - Use the Engagement Tracking and Metrics A...
Insight User Conference Bootcamp - Use the Engagement Tracking  and Metrics A...Insight User Conference Bootcamp - Use the Engagement Tracking  and Metrics A...
Insight User Conference Bootcamp - Use the Engagement Tracking and Metrics A...
 
Ms dynamics crm
Ms dynamics crmMs dynamics crm
Ms dynamics crm
 
MS Dynamics CRM
MS Dynamics CRMMS Dynamics CRM
MS Dynamics CRM
 
Presentazione tutorial
Presentazione tutorialPresentazione tutorial
Presentazione tutorial
 
Aj35198205
Aj35198205Aj35198205
Aj35198205
 
Telecom datascience master_public
Telecom datascience master_publicTelecom datascience master_public
Telecom datascience master_public
 

How Many Folders Do You Really Need? Classifying Email into a Handful of Categories