Suppose that we are interested in classifying n points in a z-dimensional space into two groups having response 1 and response 0 as the target variable. In some real data cases in customer classification, it is difficult to discriminate the favorable customers showing response 1 from others because many re- sponse 1 points and 0 points are closely located. In such a case, to find the denser regions to the favorable customers is considered to be an alter- native. Such regions are called the bumps, and finding them is called the bump hunting. By pre-specifying a pureness rate p in advance a maximum capture rate c could be obtained; the pureness rate is the ratio of the num- ber of response 1 points to the total number of points in the target region; the capture rate is the ratio of the number of response 1 points to the total number of points in the total regions. Then a trade-off curve between p and c can be constructed. Thus, the bump hunting is the same as the trade-off curve constructing. In order to make future actions easier, we adopt simpler boundary shapes for the bumps such as the union of z-dimensional boxes located parallel to some explanation variable axes; this means that we adopt the binary decision tree. Since the conventional binary decision tree will not provide the maximum capture rates because of its local optimizer property, some probabilistic methods would be required. Here, we use the genetic al- gorithm (GA) specified to the tree structure to accomplish this; we call this the tree GA. The tree GA has a tendency to provide many local maxima of the capture rates unlike the ordinary GA. According to this property, we can estimate the upper bound curve for the trade-off curve by using the extreme-value statistics. However, these curves could be optimistic if they are constructed using the training data alone. We should be careful in as- sessing the accuracy of these curves. By applying the test data, the accuracy of the trade-off curve itself can easily be assessed. However, the property of the local maxima would not be preserved. In this paper, we have developed a new tree GA to preserve the property of the local maxima of the capture rates by assessing the test data results in each evolution procedure. Then, the accuracy of the trade-off curve and its upper bound curve are assessed.
Grateful 7 speech thanking everyone that has helped.pdf
Accuracy assessment for the trade off curve and its upper curve in the bump hunting using the new tree genetic algorithm
1. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 0
Accuracy Assessment for the Trade-off Curve
and Its Upper Curve in the Bump Hunting
Using the New Tree Genetic Algorithm
H. Hirose
Department of Systems Design and Informatics
Faculty of Computer Science and Systems Engineering
Kyushu Institute of Technology
Fukuoka, 820-8502 Japan
2. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 1
background and objectives
response 1
response 0
feature variable 1
feature variable 2
feature variable 3
feature variable m-1
feature variable m
feature variable 1
feature variable 2
feature variable 3
feature variable m-1
feature variable m
Let’s consider the two class
classification problem.
Giving 0/1 responses to each
class, we are interested in
finding the response 1 points.
Each point has a large number
of feature variables,
covariates, say 50, 100.
We want to know how to
search for the response 1
points as much as we want.
3. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 2
the information for the
customers preference
is abundant
in the cases
rather easy to classify
the favorable customers
easy to find the boundaries to classify
the feature points clearly
classification
linear discriminat analysis
nearest neighbors
logistic regression
neural networks
SVM
many classification problems
4. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 3
some of
0/1 responses projected onto 2 dimensional feature variable space
・ red : response 1
・ blue : response 0
a real messy customer database
real data
.
explanation variable A
explanationvariableB
response 1
response 0
It seems difficult to
discriminate these
two responses.
5. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 4
the information for the
customers preference
is abundant
in the cases
rather easy to classify
the favorable customers
easy to find the boundaries to classify
the feature points clearly
classification
linear discriminat analysis
nearest neighbors
logistic regression
neural networks
SVM
finding denser regions instead of classification
less chances to collect
the customers
preference; the amount
of information is not so
large
in our case
it seems not so easy
to classify the
favorable customers
difficult to draw the boundaries to discriminate
response 1 from response 0 points
finding denser regions of response 1
bump
hunting
instead
6. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 5
use of the decision tree
in finding denser regions
bump
because
If we think the rectangular box regions parallel to the
axes, it directly corresponds to the if-then-rules
described by a tree and it is easy to apply to the
future action.
use of the decision tree
If-then-rule
we do not consider the data transormation for simple data handling
7. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 6
use of the Gini s index in splitting
i(t) = p( j |t){1− p( j |t)} =
j =1
C
∑ p( j | t){1− p( j | t)}
j =1
C
∑
=1− {p( j | t)}2
j =1
C
∑
i(t) = pLi(tL ) + pRi(tR )
Δi(t) = i(t)− i(t)
Improvement,
Impurity
x1+x2
y1+y2
x1
y1
x2
y2
x
i(t) = pL i(tL ) + pRi(tR)
=
2(x2 y1 − x1y2 )2
(x1 + y1)(x2 + y2 )(x1 + x2 + y1 + y2 )
pL pR
To split the samples into two subsets, the decision trees use some criteria.
For example, the CART adopts the Gini’s index, and the C4.5, 4.6 adopts the entropy.
These two are essentially the same.
Here, we use the Gini’s index.
8. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 7
decision tree finds the boundaries of the bump
bump
x
boundaries of the bump found by Gini’s index
y2
x2
x1
y1
response 0
response 1
This is a one-dimensional example case that the Gini index can find the boundaries for the
bump region successfully. This tendency is preserved also in higher dimensional cases.
The optimal splitting point by using the Gini’s index is not intended to search for the boundary
of the bump region. It is primarily intended to search for purer regions. However, the splitting
point by using the Gini’s index can also find a good point for the boundary of the bump region.
x1+x2
y1+y2
x1
y1
x2
y2
x
Gini’s index
i(t) = pL i(tL ) + pRi(tR)
=
2(x2 y1 − x1y2 )2
(x1 + y1)(x2 + y2 )(x1 + x2 + y1 + y2 )
purer
purer
9. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 8
Cases we cannot find any bumps using the Gini’s improvement
exceptional cases
10. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 9
trade-off between pureness rate & capture rate
pureness - rate =
#(response 1 in target regions)
#(response 1&0 in target regions)
capture- rate =
#(response 1 in target regions)
# (response 1 in toal regions)
pureness - rate =
7
10
= 0.7
mean pureness - rate =
12
12 +15
= 0.44
capture- rate =
7
12
= 0.58
define
1. under the condition that the pureness-rate of response 1 is pre-specified,
find the bump where the capture-rate of response 1 becomes maximum.
2. obtain the trade-off curve between the pureness-rate and the maximum capture-rate
objectives
1
0
pureness rate of response 1
capturerate
1
pureness
capture
a trade-off curve
between the
pureness-rate and
the capture-rate
larger the pureness-rate
smaller the capture-rate
total regions)
11. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 10
bump region
TP
FP
FN TN
P N
P TP FN
N FP TN
actual
predicted
€
0
€
1
€
0
€
1
€
0
€
1
€
0
€
1
€
0
€
1
€
1
€
0€
1
€
0
€
1 €
0
€
0
€
1
€
0
€
1
€
0
€
1
€
0
€
1
€
1 €
0
€
0
€
1
€
0
€
1
€
0
€
1
€
0
€
1
€
1
€
0
pureness-rate, capture-rate and TP, FP, TN, FN
€
pureness rate =
# TP
# TP+# FP bump region
€
capture rate =
# TP
# TP+# FN total
recall
precision
confusion matrix
Recall/Precision curve
Receiver/Operator Characteristics
0%
20%
50%
0 %
capture
rate
40%
60
%
80%
100%
each
classifier
skyline
pureness rate
cm
p0
The recall/precision curve
corresponds to one
classifier of a tree, but the
trade-off curve try to find
the supremum point of all
the classifiers under the
pre-specified pureness-rate.
12. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 11
1t
2t 3t
4t 5t 6t 7t
The conventional decision tree finds
the optimal feature variable and
optimal splitting point from the top
node to downward by using the
Gini’s index, or entropy.
generate the tree by the probabilistic method
But it will not capture the largest
number of response 1 points.
explore the optimal tree by generating the trees by a greedy method
€
ti :explanation variable is selected at random
optimal spilitting point is found by using the Gini's index
=> probabilistic method
=> genetic algorithm
13. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 12
parent A
parent b
cross-over in the tree
The genetic algorithm applied to the tree structure is different from
the conventional one where the structure is one-dimensional line
like genes. The splitting point at the root node in a tree have a
definitely important meaning.
child Ab
crossover is
supposed to
preserve the good
inheritance
A
B
a
b
This is an example of a child Ab, consisting of
left-hand-side branch from parent A and upper
side tree from parent B.
the branches with
the root node are
used as they are
14. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 13
parent A,a
parent B,b
child Ab
cross-over in the tree
child bA
child Ba child aB
According to
this manner, we
can have 4
children by
parent A and B.
15. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 14
1
20
2
30
0
random
30
genetic algorithm for the tree evolution
30 initial trees
sorted from larger
capture-rates
10
evolution algorithm to the tree structure
local maximum case 1
top 1
cap. max
1
20
2
20
0 5 10 15 20 25
80
90
100
110
120
130
140
evolution
capture
rate
the best tree from one
set of initial seeds
cross over
tree A
tree B
next generation #1,2,3,4
branches from top 10
branches from top 1
cross over
tree B
tree A
2
3
5
6
7
8
9
4
1
10
next generation #17,18,19,20
branches from top 6
branches from top 5
…
…
…
next generation #5,6,7,8
branches from top 9
branches from top 2
combine the two of them from different
parents, producing 4 children
evolution
1
20
2
1
top
10
evolution procedure is continued to 20 generations
16. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 15
1
20
2
2
top
20
1
20
2
30
1
random
30
local maximum case 1
top 1
cap. max
1
20
2
20
evolution
genetic algorithm for the tree evolution (2)
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
20
evolution
local maximum case 2
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
20
evolution
local maximum case 3
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
20
evolution
local maximum case 20
Why?
20 cases with different initial seeds
are dealt with similarly.
17. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 16
0
50
100
150
200
250
300
-.2 .2 .6 1
var01
0
50
100
150
200
250
300
-.2 .2 .6 1
var02
0
40
80
120
160
-.2 .2 .6 1
var08
0
100
200
300
400
-.2 .2 .6 1
var03
0
20
40
60
80
100
-.2 .2 .6 1
var11
0
10
20
30
40
50
60
70
-.2 .2 .6 1
var12
0
20
40
60
80
-.2 .2 .6 1
var13
0
100
200
300
400
-.2 .2 .6 1
var04
0
50
100
150
200
250
300
-.2 .2 .6 1
var05
0
50
100
150
200
250
300
-.2 .2 .6 1
var06
0
40
80
120
160
-.2 .2 .6 1
var07
0
10
20
30
40
50
60
70
-.2 .2 .6 1
var14
0
20
40
60
80
-.2 .2 .6 1
var15
0
20
40
60
80
100
120
-.2 .2 .6 1
var16
0
20
40
60
80
100
120
140
-.2 .2 .6 1
var17
0
20
40
60
80
-.2 .2 .6 1
var18
marginal densities in one feature variable
1
2
3
4
5
6
7
8
response
0
1
800
samples
200
samples
simulated densities of the feature variables
simulated data mimicked to a real customer data base for simplicity
1
2
3
4
5
6
7
7
8
marginal densities in two feature variables
bump region
simulated data
18. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 17
0 5 10 15 20 25
80
90
100
110
120
130
140
0 5 10 15 20 25
80
90
100
110
120
130
140
0 5 10 15 20 25
80
90
100
110
120
130
140
0 5 10 15 20 25
80
90
100
110
120
130
140
0 5 10 15 20 25
80
90
100
110
120
130
140
0 5 10 15 20 25
80
90
100
110
120
130
140
numberofcapturedpointsforresponse1
from many initial sets of seeds in the genetic algorithm for the decision tree,
different capture-rates are obtained.
local convergence in the GA and estimated return
iteration number of the evolution procedure
…
p0=0.45
simulated data
each point
is
a local
maxima
fitted
density function
0
1
2
3
4
5
6
7
112.5 117.5 122.5 127.5 132.5 137.5
number of captured points for response 1
frequency
histogram for 20 observed local maxima
return period
return period
0
40
80
120
160
200
125 135 145 155 165
frequency
return period and its CI are obtained
boostrap result
F(x) = exp −exp −
x − γ
η
⎛
⎝
⎜ ⎞
⎠
⎡
⎣⎢
⎤
⎦⎥
0
20
40
60
80
100
120
140
105 110 115 120 125 130 135 140
500 cases
19. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 18
pureness of response 1specify p0
1
0
1
usable rules
upper bound capture-rates
estimated by using the
extreme-value statistics
trade-off curve and its upper bound
many
local
maxima
are
obtained
by GA
return period
and its CI
by extreme-value statistics
capturerate That’s it?
No.
These curves could be
optimistic.
Because we are using
only the training data.
20. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 19
10-fold
CV
10 fold cross-validation
original
data
training
data
induced
rule
1,2,...10
1,2,...9
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
20
evolution
case 1
rules
training
data
test
data
induced
rule
eval.
training
data
test
data
induced
rule
eval.
accur
acy
1,2,...10 9
2,...10
mean
eval.
1
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
20
evolution
case 10
rules data
time
computing !
To assess the accuracy of
the trade-off curve,
90%
10
test
data
eval.
data
10%
21. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 20
bootstrapped hold-out
original
data
training
data
induced
rule-1
11*,21*,...n/2
n
1,2,...n
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
20
evolution
case 1
rules
test
data
11**,21**,...
n/2
data
eval.
1b**,2b**,...
accur
acy
mean
eval.
training
data
test
data
induced
rule-2
eval.
12*,22*,... 12**,22**,...
training
data
test
data
induced
rule-b
eval.
1b*,2b*,...
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
20
evolution
case 10
rules data
computing cost
is reduced
Instead of using the
cross-validation,
the bootstrapped hold-
out is used here.
22. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 21
1
20
2
30
1
random
30
1
20
2
2
select top
10 by
applying the
training data
top 1
cap. max
1
20
2
20
evolution
old
10
training
data
BHO
we have been using
the training data only
in the GA tree
procedure
training
data
evaluation
data
test
data
we divide the data to 3 parts
1
20
2
30
1
random
30
1
20
2
2
top 1
cap. max
1
20
2
20
new
select top
10 by
applying the
evaluation data10
training
data
evaluation
data
evolution
At each evolution generation stage, we
produce the trees using the training data, and
select the best trees using the evaluation data.
Then, we can expect that the final stage results
could be the local maxima for the evaluation
data, and we may apply the extreme-value
statistics to these final results.
Then, we apply the the final rule to the test data.
test
data
accuracy
assess.
tree genetic algorithm
23. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 22
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
20 by eval.
evolution
case 1
0
0.05
0.1
0.15
0.2
0.25
0.0 0.1 0.2 0.3 0.4 0.5 0.6
evolution in the tree GA and the return period
BHO
real data
using the evaluation data
the capture-rate is converging to a final value
within 10 generations, both in training data and
evaluation data.
using the 20 final best capture rates
100 125 150 175 200 225 250
0.0025
0.005
0.0075
0.01
0.0125
0.015
extreme-value density using the estimated
parameters
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
20 by test
evolution
case 1
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
20 by test
evolution
case 1
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
10 by eval. data
evolution
case 20
24. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 23
0.04 0.06 0.08 0.12 0.14 0.16
5
10
15
20
25
30
Gumbel
Distribution
fit
0.04 0.06 0.08 0.12 0.14 0.16
5
10
15
20
25
30
Gumbel
Distribution
fit
€
f (x) =
1
η
⋅exp
γ − x
η
⎛
⎝
⎜
⎞
⎠
⎟⋅exp − exp
γ − x
η
⎛
⎝
⎜
⎞
⎠
⎟
⎛
⎝
⎜
⎞
⎠
⎟
Gumbel distribution for maxima
similarity between the evaluation and the test
BHO
200 initial cases
pre-specified pureness rate
= 45%
real data.04
.06
.08
.1
.12
.14
.16
test
.04 .06 .08 .1 .12 .14 .16
eval.
we may
estimate the
upper bound of
the trade-off
curve by using
the test data
results.
0
10
20
30
40
50
60
70
.04 .06 .08 .1 .12 .14 .16
test
0
10
20
30
40
50
60
70
80
.04 .06 .08 .1 .12 .14 .16
eval.
evaluation
data result
test
data result
observed observed
relation
25. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 24
pureness of response 1
capture
rate
specify p0
1
0
1
rules using only the
training data
maximum capture-rates estimated by using extreme-
value statistics with the training data
accurate trade-off curve using the test data
maximum capture-rates estimated by using
extreme-value statistics with the test data
rules using
the training
data
26. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 25
0
0.1
0.2
0.3
0.4
0.2 0.3 0.4 0.5 0.6 0.7 0.8
10 cases 99.8% return
period
mean
10 cases of
best 1s from
20 local
maxima by the
new tree-GA
with the test
data
mean40%
45%
50%
60% 70%
pureness of response 1
capturerate
The upper bound for the trade-off curve using extreme-value statistics can be
estimated by using the new tree-GA using test data
actual trade-off curve and its upper bound
real data
27. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 26
1. In finding the denser region for response 1 points having a large number of feature
variables, we have proposed to use the bump hunting method.
2. To evaluate the bump hunting method, we have shown that the trade-off curve is
useful.
3. To construct the trade-off curve, we have been used the tree genetic algorithm and
the extreme-value statistics.
4. We have shown that the trade-off curve using the training data could be
optimistic.
5. For the use of the test data with less computing cost, we have proposed the
bootstrapped hold-out method instead of cross-validation.
6. To estimate the accurate upper bound trade-off curve, we have developed the new
tree-GA by using the three sets of sampled data, training, evaluation and test data.
7. The evaluation data results follow the extreme-value statistics, and using the
similarity between the evaluation data results and the test data results, we can
estimate the accurate trade-off curve.
conclusions
28. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 27
end
thank you
Accuracy assessment for the trade-off curve
and its upper bound curve in the bump hunting
using the new tree genetic algorithm
29. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 28
literature related to the bump hunting
T. Yukizane, S. Ohi, E. Miyano, H. Hirose, The bump hunting method using the
genetic algorithm with the extreme-value statistics, IEICE Transactions D: on
Information and Systems, Vol.E89-D, No.8, pp.2332-2339 (2006.8)
H. Hirose : Optimal boundary finding method for the bumpy regions, IFORS2005,
July 11-15, (2005).
H. Hirose, T. Yukizane, E. Miyano: Boundary detection for bumps using the Gini s
index in messy classification problems, CITSA 2006, pp.293-298 (2006)
H. Hirose, T. Yukizane, and T. Deguchi, The bump hunting method and its accuracy using the
genetic algorithm with application to real customer data, CIT2007, pp.128-132, October
16-19, (2007)
H. Hirose, The bump hunting using the decision tree combined with the genetic algorithm: extreme-value
statistics aspect, ICMLDA2007, pp.713-717, October 24-26, (2007)
Hirose, H.: A method to discriminate the minor groups from the major groups. Hawaii Int. Conf.
Statistics, Mathematics, and Related Fields, (2005).
Hirose, H., Ohi, S. and Yukizane, T.: Assessment of the prediction accuracy in the bump hunting
procedure. Hawaii International Conference on Statistics, Mathematics, and Related Fields,
(2007).
Hirose, Yukizane, T.: The accuracy of the trade-off curve in the bump hunting. Hawaii International
Conference on Statistics, Mathematics, and Related Fields, (2008).
30. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 29
Friedman, J. H. and Fisher, N. I., “Bump hunting in high dimensional data”, Statistics and
Computing, 9 123 - 143 (1999)
Gray, J.B. and Fan, G, “Target: Tree analysis with randomly generated and evolved trees”,
Technical report, The University of Alabama, (2003)
literature related to the bump hunting
Kohavi, R. (1995), A Study of Cross-Validation and Bootstrap for Accuracy
Estima- tion and Model Selection, IJCAI (International Joint Conference on
Artificial In- tel ligence).
Hastie, T., Tibshirani, R. and Friedman, J.H.: Elements of Statistical Learning. Springer (2001).