1. Prof. Neeraj Bhargava
Vishal Dutt
Department of Computer Science, School of
Engineering & System Sciences
MDS University, Ajmer
2.
3. The Data
Goal: build a model to predict whether an incoming
email is spam.
Analogous to insurance fraud detection
About 21,000 data points, each representing an email
message sent to an HP scientist.
Binary target variable
1 = the message was spam: 8%
0 = the message was not spam 92%
Predictive variables created based on frequencies of
various words & characters.
4. The Predictive Variables
57 variables created
Frequency of “George” (the scientist’s first name)
Frequency of “!”, “$”, etc.
Frequency of long strings of capital letters
Frequency of “receive”, “free”, “credit”….
Etc
Variables creation required insight that (as yet) can’t
be automated.
Analogous to the insurance variables an insightful actuary or
underwriter can create.
6. Methodology
Divide data 60%-40% into train-test.
Use multiple techniques to fit models on train data.
Apply the models to the test data.
Compare their power using gains charts.
7. Un-pruned Tree
Just let CART keep splitting
as long as it can.
Too big.
Messy
More importantly: this
tree over-fits the data
Use Cross-Validation (on the
train data) to prune back.
Select the optimal sub-
tree.
|
8. Pruning Back
Plot cross-validated error rate vs. size of tree
Note: error can actually increase if the tree is too big (over-
fit)
Looks like the ≈ optimal tree has 52 nodes
So prune the tree back to 52 nodes
X-valRelativeError
0.20.40.60.81.0
Inf 0.09 0.043 0.018 0.011 0.0096 0.0061 0.0049 0.0033 0.002 0.0011 2.4e-05
1 3 5 7 8 10 11 13 15 17 19 20 22 24 25 30 37 52 56 66 71 83
size of tree
9. Pruned Tree #1
The pruned tree is still pretty
big.
Can we get away with pruning
the tree back even further?
Let’s be radical and prune way
back to a tree we actually
wouldn’t mind looking at.
|
cp
X-valRelativeError
0.20.40.60.81.0
Inf 0.09 0.043 0.018 0.011 0.0096 0.0061 0.0049 0.0033 0.002 0.0011 2.4e-05
1 3 5 7 8 10 11 13 15 17 19 20 22 24 25 30 37 52 56 66 71 83
size of tree
10. Pruned Tree #2
|
freq_DOLLARSIGN< 0.0555
freq_remove< 0.065
freq_EXCL< 0.5235
tot.CAPS< 83.5
freq_free< 0.77
freq_george>=0.14
freq_hp>=0.16
freq_EXCL< 0.3765
avg.CAPS< 2.92
freq_remove< 0.025
0
1.061e+04/170
0
415/29
1
0/13
1
20/75
0
59/0
1
70/178
0
285/12
0
208/54
1
12/51
1
46/193
1
4/290
Suggests rule:
Many “$” signs, caps, and “!” and few instances of
company name (“HP”) spam!