Wang ke classification by cut clearance under threshold
1. Classification by CUT:
Clearance Under Threshold
Ryan McBride (rom2@sfu.ca),
Ke Wang (wangk@cs.sfu.ca),
and Wenyuan Li (wenyuanli630@gail.com)
June 17, 2015
2. Summary
Domain knowledge helps identify “bad”
cases.
Usual Domain Knowledge: Each
outcome’s cost or relative benefit - cost
sensitive classification.
But costs are too hard to specify in
practice.
Our Idea: Model with a regulatory
threshold, a maximum acceptable
frequency in future cases.
3. Problem: Given a collection of
sampled electrical transformers, predict
ones with carcinogenic polychlorinated
biphenyls (PCBs), known to be harmful
to human and environment.
5. Conventional Solution
User sets cost matrix
(note: negative=bad)
Object Class j
Positive Negative
Predicted Positive C1 C2
Class i Negative C3 C4
Issue: What is the cost of not
removing a public health hazard?
6. Our Solution: Thresholds
Insight: Problems without costs focus
on acceptable rates of negatives:
1. Regulations: At most “1 hazard out of
100”.
2. Power Industries: Too frequent outages
in equipment ⇒ Strengthen equipment.
Idea: Model to find “under threshold”
groups.
7. CUT Classification: Given t,
partition attribute space:
-
x
y + +
+
+ +
+
+
- + +
+
+
+
-
---
- -+
+
Gi Over Threshold ⇒ Mitigate Risk.
Gi Under Threshold ⇒ Delay Action.
8. Defining Cleared Groups
When is a group “under
threshold”?
One sample that isn’t contaminated?
One hundred samples with no PCBs?
Million samples with no PCBs?
Only “clear” if enough
observations...
Use statistics to estimate
potential frequencies
9. Statistical Clearance
Use confidence interval with some
confidence (e.g. 99%):
Frequency in future cases is no more
than upper bound: ub(Gi)
Example: There is a 99% chance that
no more than 5% of Dynamo
Incorporated transformers are
contaminated.
Unknown class object o cleared if in Gi
where ub(Gi) ≤ t.
10. Partitioning Objective
Goal: Prove many future cases
are cleared.
CUT+
Algorithm: Repeated
search for large cleared groupings.
Example with t = 5% on next slide.
11. List valid partitions and choose one:
Lowlands:
2 PCB of 300
ub(Lowlands):
1.6%
300 CLEARED
Midlands:
103 PCB of 150
ub(Midlands):
76.3%
NON-CLEARED
Partition A: Region for t=5%
Highlands:
45 PCB of 550
ub(Highlands):
10.3%
NON-CLEARED
Partition B: Manufacturer for t=5%
Made-Up Electric:
130 PCB of 400
ub(Made-Up)=36.4%
NON-CLEARED
Dynamo Inc:
20 PCB of 600
ub(Dynamo)=4.8%
600 CLEARED
Partition A clears 300 samples.
Partition B clears 600 samples.
Partition B preferred because it clears
more objects.
12. Current Tree Partition:
Produced by
Made-Up Electric
20 PCB of 600
ub(Dynamo): 4.8%,
600 CLEARED
Produced by
Dynamo Inc
All Objects
130 PCB of 400
ub(Made-Up): 36.4%
NON-CLEARED
Improvement 1: Repeat partition search in
non-cleared groups.
13. Final Tree
20 PCB of 600
ub(Dynamo): 4.8%,
600 CLEARED
In Surrey
Produced by
Dynamo Inc
All Objects
98 PCB of 100
ub(G): 100%,
NON-CLEARED
In Lowlands
In Midlands
In Highlands
30 PCB of 150
ub(G): 25.8%,
NON-CLEARED
2 PCB of 150
ub(G): 4.2%,
150 CLEARED
Produced by
Made-Up Electric
Improvement 2: Merge all non-cleared
regions then search again.
14. CUT+ Algorithm
Given a set of training objects, G, and a
clearance threshold, t
REPEAT UNTIL no cleared group is
found:
CUT Tree(G, t)
Remove the objects assigned to a cleared
group from G
Three heuristics for building trees:
1. Immediate Clearance
2. Risk Reduction
3. Pure Potential
15. Experiments (1)
Use cross-validation and compare:
3 CUT+
algorithms.
Competitors from other classification
areas.
Problem Set: PCB
identification problems.
17. PCB Experiment (1)
t ranges from
0% to ˆp.
ˆp is the
observed rate
of PCB cases.
0%
1%
2%
3%
4%
5%
0%
0.1p̂
0.2p̂
0.3p̂
0.4p̂
0.5p̂
0.6p̂
0.7p̂
0.8p̂
0.9p̂
1.0p̂
FPR(t)
Clearance Threshold, t
Pure Potential Baseline1: C4.5
Baseline2: SMOTE Baseline3: MetaCost
0%
20%
40%
60%
80%
100%
0%
0.1p̂
0.2p̂
0.3p̂
0.4p̂
0.5p̂
0.6p̂
0.7p̂
0.8p̂
0.9p̂
1.0p̂
TPR
Results for PCB50
CUT+
clears more non-PCB transformers.
Paper results show that there are not too
many “over threshold” errors.
18. PCB Experiment (2)
t ranges from
0% to ˆp.
ˆp is the
observed rate
of PCB cases.
0%
1%
2%
3%
4%
5%
0%
0.1p̂
0.2p̂
0.3p̂
0.4p̂
0.5p̂
0.6p̂
0.7p̂
0.8p̂
0.9p̂
1.0p̂
FPR(t)
Clearance Threshold, t
Pure Potential Baseline1: C4.5
Baseline2: SMOTE Baseline3: MetaCost
0%
20%
40%
60%
80%
100%
0%
0.1p̂
0.2p̂
0.3p̂
0.4p̂
0.5p̂
0.6p̂
0.7p̂
0.8p̂
0.9p̂
1.0p̂
TPR
Results for PCB50
Competitors have few cleared groups since:
Too few observations to clear group.
Or frequency too high to clear group.
19. More Experiments on UCI Sets:
Pure Potential best algorithm in 22 out
of 25 tests.
Code available at
http://www.cs.sfu.ca/~wangk/
software/CUT_classification
20. Acknowledgments
Funding: BC Hydro R&D program and
Canada’s NSERC.
Transformer Image Source:
Wikipedia user Benutzer:Stahlkocher;
License: GFDL.