SlideShare a Scribd company logo
1 of 30
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 0
Accuracy Assessment for the Trade-off Curve 	

and Its Upper Curve in the Bump Hunting	

Using the New Tree Genetic Algorithm	

H. Hirose	

Department of Systems Design and Informatics	

Faculty of Computer Science and Systems Engineering	

Kyushu Institute of Technology	

Fukuoka, 820-8502 Japan
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 1
background and objectives	

response 1	

response 0	

feature variable 1	

feature variable 2	

feature variable 3	

feature variable m-1	

feature variable m	

feature variable 1	

feature variable 2	

feature variable 3	

feature variable m-1	

feature variable m	

Let’s consider the two class
classification problem.	

Giving 0/1 responses to each
class, we are interested in
finding the response 1 points. 	

Each point has a large number
of feature variables,
covariates, say 50, 100.	

We want to know how to
search for the response 1
points as much as we want.
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 2
the information for the
customers preference
is abundant
in the cases
rather easy to classify
the favorable customers
easy to find the boundaries to classify
the feature points clearly	

classification
linear discriminat analysis	

nearest neighbors	

logistic regression	

neural networks	

SVM	

many classification problems
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 3
some of
0/1 responses projected onto 2 dimensional feature variable space	

・ red  : response 1	

・ blue : response 0	

a real messy customer database
real data
.	
explanation variable A	

explanationvariableB	

response 1	

response 0	

It seems difficult to
discriminate these
two responses.
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 4
the information for the
customers preference
is abundant
in the cases
rather easy to classify
the favorable customers
easy to find the boundaries to classify
the feature points clearly	

classification
linear discriminat analysis	

nearest neighbors	

logistic regression	

neural networks	

SVM	

finding denser regions instead of classification
less chances to collect
the customers
preference; the amount
of information is not so
large
in our case
it seems not so easy
to classify the
favorable customers
difficult to draw the boundaries to discriminate
response 1 from response 0 points	

finding denser regions of response 1
bump
hunting
instead
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 5
use of the decision tree
in finding denser regions
bump
because
If we think the rectangular box regions parallel to the
axes, it directly corresponds to the if-then-rules
described by a tree and it is easy to apply to the
future action.
use of the decision tree
If-then-rule
we do not consider the data transormation for simple data handling
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 6
use of the Gini s index in splitting
i(t) = p( j |t){1− p( j |t)} =
j =1
C
∑ p( j | t){1− p( j | t)}
j =1
C
∑
=1− {p( j | t)}2
j =1
C
∑
i(t) = pLi(tL ) + pRi(tR )
Δi(t) = i(t)− i(t)
Improvement,	
 Impurity	
  x1+x2	

y1+y2	

x1	

y1	

x2	

y2	

x	
 
i(t) = pL i(tL ) + pRi(tR)
=
2(x2 y1 − x1y2 )2
(x1 + y1)(x2 + y2 )(x1 + x2 + y1 + y2 )
pL pR
To split the samples into two subsets, the decision trees use some criteria. 	

For example, the CART adopts the Gini’s index, and the C4.5, 4.6 adopts the entropy. 	

These two are essentially the same.	

Here, we use the Gini’s index.
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 7
decision tree finds the boundaries of the bump
bump	

x	
 
boundaries of the bump found by Gini’s index	

y2	

x2	

x1	

y1	

response 0	

response 1	

This is a one-dimensional example case that the Gini index can find the boundaries for the
bump region successfully. This tendency is preserved also in higher dimensional cases.	

The optimal splitting point by using the Gini’s index is not intended to search for the boundary
of the bump region. It is primarily intended to search for purer regions. However, the splitting
point by using the Gini’s index can also find a good point for the boundary of the bump region.	

x1+x2	

y1+y2	

x1	

y1	

x2	

y2	

x	
 
Gini’s index
i(t) = pL i(tL ) + pRi(tR)
=
2(x2 y1 − x1y2 )2
(x1 + y1)(x2 + y2 )(x1 + x2 + y1 + y2 )
purer	

 purer
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 8
Cases we cannot find any bumps using the Gini’s improvement
exceptional cases
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 9
trade-off between pureness rate & capture rate
pureness - rate =
#(response 1 in target regions)
#(response 1&0 in target regions)
capture- rate =
#(response 1 in target regions)
# (response 1 in toal regions)
pureness - rate =
7
10
= 0.7
mean pureness - rate =
12
12 +15
= 0.44
capture- rate =
7
12
= 0.58
define	

1. under the condition that the pureness-rate of response 1 is pre-specified,	

find the bump where the capture-rate of response 1 becomes maximum.	

2. obtain the trade-off curve between the pureness-rate and the maximum capture-rate	

objectives	

1
0
pureness rate of response 1
capturerate
1
pureness	

capture
a trade-off curve
between the
pureness-rate and
the capture-rate	

larger the pureness-rate	

smaller the capture-rate	

total regions)
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 10
bump region
TP
FP
FN TN
P N
P TP FN
N FP TN
actual
predicted
€
0
€
1
€
0
€
1
€
0
€
1
€
0
€
1
€
0
€
1
€
1
€
0€
1
€
0
€
1 €
0
€
0
€
1
€
0
€
1
€
0
€
1
€
0
€
1
€
1 €
0
€
0
€
1
€
0
€
1
€
0
€
1
€
0
€
1
€
1
€
0
pureness-rate, capture-rate and TP, FP, TN, FN
€
pureness rate =
# TP
# TP+# FP bump region
€
capture rate =
# TP
# TP+# FN total
recall	

precision	

confusion matrix	

Recall/Precision curve	

Receiver/Operator Characteristics	

 0% 	

 20%	

50%	

0 % 	

capture
rate
40%	

 60
%	

80%	

 100%	

each
classifier
skyline
pureness rate
cm
p0
The recall/precision curve
corresponds to one
classifier of a tree, but the
trade-off curve try to find
the supremum point of all
the classifiers under the
pre-specified pureness-rate.
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 11
1t
2t 3t
4t 5t 6t 7t
The conventional decision tree finds
the optimal feature variable and
optimal splitting point from the top
node to downward by using the
Gini’s index, or entropy.
generate the tree by the probabilistic method
But it will not capture the largest
number of response 1 points.	

explore the optimal tree by generating the trees by a greedy method
€
ti :explanation variable is selected at random
optimal spilitting point is found by using the Gini's index
=> probabilistic method	

=> genetic algorithm
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 12
parent A
parent b
cross-over in the tree
The genetic algorithm applied to the tree structure is different from
the conventional one where the structure is one-dimensional line
like genes. The splitting point at the root node in a tree have a
definitely important meaning. 	

child Ab
crossover is
supposed to
preserve the good
inheritance
A
B
a
b
This is an example of a child Ab, consisting of
left-hand-side branch from parent A and upper
side tree from parent B.	

the branches with
the root node are
used as they are
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 13
parent A,a
parent B,b
child Ab
cross-over in the tree
child bA
child Ba child aB
According to
this manner, we
can have 4
children by
parent A and B.
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 14
1
20
2
30
0
random
30
genetic algorithm for the tree evolution
30 initial trees
sorted from larger
capture-rates
10
evolution algorithm to the tree structure	

local maximum case 1
top 1
cap. max
1
20
2
20
0 5 10 15 20 25
80
90
100
110
120
130
140
evolution
capture
rate
the best tree from one
set of initial seeds	

cross over
tree A	

tree B	

next generation #1,2,3,4	

branches from top 10	

branches from top 1	

cross over	

tree B	

tree A	

2
3
5
6
7
8
9
4
1
10
next generation #17,18,19,20	

branches from top 6	

branches from top 5	

…	

…	

…	

next generation #5,6,7,8	

branches from top 9	

branches from top 2	

combine the two of them from different
parents, producing 4 children	

evolution
1
20
2
1
top
10
evolution procedure is continued to 20 generations
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 15
1
20
2
2
top
20
1
20
2
30
1
random
30
local maximum case 1
top 1
cap. max
1
20
2
20
evolution
genetic algorithm for the tree evolution (2)
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
20
evolution
local maximum case 2
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
20
evolution
local maximum case 3
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
20
evolution
local maximum case 20
Why?	

20 cases with different initial seeds	

are dealt with similarly.
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 16
0
50
100
150
200
250
300
-.2 .2 .6 1
var01
0
50
100
150
200
250
300
-.2 .2 .6 1
var02
0
40
80
120
160
-.2 .2 .6 1
var08
0
100
200
300
400
-.2 .2 .6 1
var03
0
20
40
60
80
100
-.2 .2 .6 1
var11
0
10
20
30
40
50
60
70
-.2 .2 .6 1
var12
0
20
40
60
80
-.2 .2 .6 1
var13
0
100
200
300
400
-.2 .2 .6 1
var04
0
50
100
150
200
250
300
-.2 .2 .6 1
var05
0
50
100
150
200
250
300
-.2 .2 .6 1
var06
0
40
80
120
160
-.2 .2 .6 1
var07
0
10
20
30
40
50
60
70
-.2 .2 .6 1
var14
0
20
40
60
80
-.2 .2 .6 1
var15
0
20
40
60
80
100
120
-.2 .2 .6 1
var16
0
20
40
60
80
100
120
140
-.2 .2 .6 1
var17
0
20
40
60
80
-.2 .2 .6 1
var18
marginal densities in one feature variable 	

1	

 2	

 3	

 4	

 5	

 6	

 7	

 8	

response	

0	

1	

800	

samples	

200	

samples	

simulated densities of the feature variables
simulated data mimicked to a real customer data base for simplicity
1	

 2	

 3	

 4	

 5	

 6	

 7	

7	

8	

marginal densities in two feature variables	

bump region
simulated data
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 17
0 5 10 15 20 25
80
90
100
110
120
130
140
0 5 10 15 20 25
80
90
100
110
120
130
140
0 5 10 15 20 25
80
90
100
110
120
130
140
0 5 10 15 20 25
80
90
100
110
120
130
140
0 5 10 15 20 25
80
90
100
110
120
130
140
0 5 10 15 20 25
80
90
100
110
120
130
140
numberofcapturedpointsforresponse1	

from many initial sets of seeds in the genetic algorithm for the decision tree, 	

different capture-rates are obtained.	

local convergence in the GA and estimated return
iteration number of the evolution procedure	

…
p0=0.45	

simulated data
each point
is	

a local
maxima	

fitted 	

density function	

0
1
2
3
4
5
6
7
112.5 117.5 122.5 127.5 132.5 137.5
number of captured points for response 1	

frequency	

histogram for 20 observed local maxima	

return period	

return period	

0
40
80
120
160
200
125 135 145 155 165
frequency	

return period and its CI are obtained	

boostrap result	

F(x) = exp −exp −
x − γ
η
⎛
⎝
⎜ ⎞
⎠
⎡
⎣⎢
⎤
⎦⎥
0
20
40
60
80
100
120
140
105 110 115 120 125 130 135 140
500 cases
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 18
pureness of response 1specify p0
1
0
1
usable rules
upper bound capture-rates
estimated by using the
extreme-value statistics
trade-off curve and its upper bound
many
local
maxima
are
obtained
by GA
return period
and its CI
by extreme-value statistics
capturerate That’s it?	

No.	

These curves could be
optimistic. 	

Because we are using
only the training data.
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 19
10-fold
CV
10 fold cross-validation
original
data
training
data
induced
rule
1,2,...10
1,2,...9
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
20
evolution
case 1
rules
training
data
test
data
induced
rule
eval.
training
data
test
data
induced
rule
eval.
accur
acy
1,2,...10 9
2,...10
mean
eval.
1
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
20
evolution
case 10
rules data
time
computing !
To assess the accuracy of
the trade-off curve,	

90%	

10
test
data
eval.
data
10%
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 20
bootstrapped hold-out
original
data
training
data
induced
rule-1
11*,21*,...n/2
n
1,2,...n
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
20
evolution
case 1
rules
test
data
11**,21**,...
n/2
data
eval.
1b**,2b**,...
accur
acy
mean
eval.
training
data
test
data
induced
rule-2
eval.
12*,22*,... 12**,22**,...
training
data
test
data
induced
rule-b
eval.
1b*,2b*,...
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
20
evolution
case 10
rules data
computing cost
is reduced
Instead of using the
cross-validation,	

the bootstrapped hold-
out is used here.
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 21
1
20
2
30
1
random
30
1
20
2
2
select top
10 by
applying the
training data
top 1
cap. max
1
20
2
20
evolution
old
10
training
data
BHO
we have been using
the training data only
in the GA tree
procedure	

training
data
evaluation
data
test
data
we divide the data to 3 parts	

1
20
2
30
1
random
30
1
20
2
2
top 1
cap. max
1
20
2
20
new
select top
10 by
applying the
evaluation data10
training
data
evaluation
data
evolution
At each evolution generation stage, we
produce the trees using the training data, and
select the best trees using the evaluation data.
Then, we can expect that the final stage results
could be the local maxima for the evaluation
data, and we may apply the extreme-value
statistics to these final results. 	

Then, we apply the the final rule to the test data.	

test
data
accuracy
assess.
tree genetic algorithm
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 22
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
20 by eval.
evolution
case 1
0
0.05
0.1
0.15
0.2
0.25
0.0 0.1 0.2 0.3 0.4 0.5 0.6
evolution in the tree GA and the return period
BHO
real data
using the evaluation data	

the capture-rate is converging to a final value
within 10 generations, both in training data and
evaluation data.	

using the 20 final best capture rates	

100 125 150 175 200 225 250
0.0025
0.005
0.0075
0.01
0.0125
0.015
extreme-value density using the estimated
parameters	

top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
20 by test
evolution
case 1
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
20 by test
evolution
case 1
top 1
cap. max
1
20
2
1
20
2
30
1
20
2
1 2 20
random
30
top
10 by eval. data
evolution
case 20
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 23
0.04 0.06 0.08 0.12 0.14 0.16
5
10
15
20
25
30
Gumbel
Distribution
fit
0.04 0.06 0.08 0.12 0.14 0.16
5
10
15
20
25
30
Gumbel
Distribution
fit
€
f (x) =
1
η
⋅exp
γ − x
η
⎛
⎝
⎜
⎞
⎠
⎟⋅exp − exp
γ − x
η
⎛
⎝
⎜
⎞
⎠
⎟
⎛
⎝
⎜
⎞
⎠
⎟
Gumbel distribution for maxima
similarity between the evaluation and the test
BHO
200 initial cases
pre-specified pureness rate
= 45%
real data.04
.06
.08
.1
.12
.14
.16
test
.04 .06 .08 .1 .12 .14 .16
eval.
we may
estimate the
upper bound of
the trade-off
curve by using
the test data
results.	

0
10
20
30
40
50
60
70
.04 .06 .08 .1 .12 .14 .16
test
0
10
20
30
40
50
60
70
80
.04 .06 .08 .1 .12 .14 .16
eval.
evaluation
data result
test
data result
observed observed
relation
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 24
pureness of response 1
capture
rate
specify p0
1
0
1
rules using only the
training data
maximum capture-rates estimated by using extreme-
value statistics with the training data
accurate trade-off curve using the test data
maximum capture-rates estimated by using
extreme-value statistics with the test data
rules using
the training
data
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 25
0
0.1
0.2
0.3
0.4
0.2 0.3 0.4 0.5 0.6 0.7 0.8
10 cases 99.8% return
period
mean
10 cases of
best 1s from
20 local
maxima by the
new tree-GA
with the test
data
mean40%
45%
50%
60% 70%
pureness of response 1
capturerate
The upper bound for the trade-off curve using extreme-value statistics can be
estimated by using the new tree-GA using test data
actual trade-off curve and its upper bound
real data
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 26
1.  In finding the denser region for response 1 points having a large number of feature
variables, we have proposed to use the bump hunting method.	

2.  To evaluate the bump hunting method, we have shown that the trade-off curve is
useful.	

3.  To construct the trade-off curve, we have been used the tree genetic algorithm and
the extreme-value statistics.	

4.  We have shown that the trade-off curve using the training data could be
optimistic.	

5.  For the use of the test data with less computing cost, we have proposed the
bootstrapped hold-out method instead of cross-validation.	

6.  To estimate the accurate upper bound trade-off curve, we have developed the new
tree-GA by using the three sets of sampled data, training, evaluation and test data.	

7.  The evaluation data results follow the extreme-value statistics, and using the
similarity between the evaluation data results and the test data results, we can
estimate the accurate trade-off curve.	

conclusions
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 27
end	

thank you	

Accuracy assessment for the trade-off curve 	

and its upper bound curve in the bump hunting	

using the new tree genetic algorithm
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 28
literature related to the bump hunting
T. Yukizane, S. Ohi, E. Miyano, H. Hirose, The bump hunting method using the
genetic algorithm with the extreme-value statistics, IEICE Transactions D: on
Information and Systems, Vol.E89-D, No.8, pp.2332-2339 (2006.8)	

H. Hirose : Optimal boundary finding method for the bumpy regions, IFORS2005,
July 11-15, (2005).	

H. Hirose, T. Yukizane, E. Miyano: Boundary detection for bumps using the Gini s
index in messy classification problems, CITSA 2006, pp.293-298 (2006)
H. Hirose, T. Yukizane, and T. Deguchi, The bump hunting method and its accuracy using the
genetic algorithm with application to real customer data, CIT2007, pp.128-132, October
16-19, (2007)	

H. Hirose, The bump hunting using the decision tree combined with the genetic algorithm: extreme-value
statistics aspect, ICMLDA2007, pp.713-717, October 24-26, (2007)	

Hirose, H.: A method to discriminate the minor groups from the major groups. Hawaii Int. Conf.
Statistics, Mathematics, and Related Fields, (2005).	

Hirose, H., Ohi, S. and Yukizane, T.: Assessment of the prediction accuracy in the bump hunting
procedure. Hawaii International Conference on Statistics, Mathematics, and Related Fields,
(2007).	

Hirose, Yukizane, T.: The accuracy of the trade-off curve in the bump hunting. Hawaii International
Conference on Statistics, Mathematics, and Related Fields, (2008).
Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 29
Friedman, J. H. and Fisher, N. I., “Bump hunting in high dimensional data”, Statistics and
Computing, 9 123 - 143 (1999) 	

Gray, J.B. and Fan, G, “Target: Tree analysis with randomly generated and evolved trees”,
Technical report, The University of Alabama, (2003) 	

literature related to the bump hunting
Kohavi, R. (1995), A Study of Cross-Validation and Bootstrap for Accuracy
Estima- tion and Model Selection, IJCAI (International Joint Conference on
Artificial In- tel ligence).	

Hastie, T., Tibshirani, R. and Friedman, J.H.: Elements of Statistical Learning. Springer (2001).

More Related Content

Viewers also liked

Bump hunting とその応用
Bump hunting とその応用Bump hunting とその応用
Bump hunting とその応用
Hideo Hirose
 

Viewers also liked (19)

Extended cumulative exposure model, ecem
Extended cumulative exposure model, ecemExtended cumulative exposure model, ecem
Extended cumulative exposure model, ecem
 
「科学=予測」Science cafe 2011 june 10
「科学=予測」Science cafe 2011 june 10「科学=予測」Science cafe 2011 june 10
「科学=予測」Science cafe 2011 june 10
 
Monty hall puzzle extension
Monty hall puzzle extensionMonty hall puzzle extension
Monty hall puzzle extension
 
Random number generation for the generalized normal distribution using the re...
Random number generation for the generalized normal distribution using the re...Random number generation for the generalized normal distribution using the re...
Random number generation for the generalized normal distribution using the re...
 
ある集中講義への試み 統計とデータ解析(2015.9)
ある集中講義への試み 統計とデータ解析(2015.9)ある集中講義への試み 統計とデータ解析(2015.9)
ある集中講義への試み 統計とデータ解析(2015.9)
 
A successful maximum likelihood parameter estimation in skewed distributions ...
A successful maximum likelihood parameter estimation in skewed distributions ...A successful maximum likelihood parameter estimation in skewed distributions ...
A successful maximum likelihood parameter estimation in skewed distributions ...
 
順序統計量とToeicスコア
順序統計量とToeicスコア順序統計量とToeicスコア
順序統計量とToeicスコア
 
1/2+1/3=2/5
1/2+1/3=2/51/2+1/3=2/5
1/2+1/3=2/5
 
Central Limit Theorem & Galton Board
Central Limit Theorem & Galton BoardCentral Limit Theorem & Galton Board
Central Limit Theorem & Galton Board
 
Bump hunting とその応用
Bump hunting とその応用Bump hunting とその応用
Bump hunting とその応用
 
Decision trees for machine learning
Decision trees for machine learningDecision trees for machine learning
Decision trees for machine learning
 
מצגת ייצוג נשים
מצגת ייצוג נשים מצגת ייצוג נשים
מצגת ייצוג נשים
 
מצגת דמוקרטיה בעולם הערבי
מצגת דמוקרטיה בעולם הערבימצגת דמוקרטיה בעולם הערבי
מצגת דמוקרטיה בעולם הערבי
 
מצגת מדינה הרווחה
מצגת מדינה הרווחהמצגת מדינה הרווחה
מצגת מדינה הרווחה
 
ある反転授業の試み:正規分布のTaylor展開をとおして
ある反転授業の試み:正規分布のTaylor展開をとおしてある反転授業の試み:正規分布のTaylor展開をとおして
ある反転授業の試み:正規分布のTaylor展開をとおして
 
Les aventures de Todd, Malcolm Joy
Les aventures de Todd, Malcolm JoyLes aventures de Todd, Malcolm Joy
Les aventures de Todd, Malcolm Joy
 
מצגת הפרטה ומינויים פוליטיים
מצגת הפרטה ומינויים פוליטייםמצגת הפרטה ומינויים פוליטיים
מצגת הפרטה ומינויים פוליטיים
 
Le chat brun, Lisa Gelsomini
Le chat brun, Lisa GelsominiLe chat brun, Lisa Gelsomini
Le chat brun, Lisa Gelsomini
 
My love
My loveMy love
My love
 

Similar to Accuracy assessment for the trade off curve and its upper curve in the bump hunting using the new tree genetic algorithm

Pampers CaseIn an increasingly competitive diaper market, P&G’.docx
Pampers CaseIn an increasingly competitive diaper market, P&G’.docxPampers CaseIn an increasingly competitive diaper market, P&G’.docx
Pampers CaseIn an increasingly competitive diaper market, P&G’.docx
bunyansaturnina
 
Stats 3000 Week 1 - Winter 2011
Stats 3000 Week 1 - Winter 2011Stats 3000 Week 1 - Winter 2011
Stats 3000 Week 1 - Winter 2011
Lauren Crosby
 

Similar to Accuracy assessment for the trade off curve and its upper curve in the bump hunting using the new tree genetic algorithm (20)

Machine Learning Decision Tree Algorithms
Machine Learning Decision Tree AlgorithmsMachine Learning Decision Tree Algorithms
Machine Learning Decision Tree Algorithms
 
Probability distribution Function & Decision Trees in machine learning
Probability distribution Function  & Decision Trees in machine learningProbability distribution Function  & Decision Trees in machine learning
Probability distribution Function & Decision Trees in machine learning
 
Decision tree
Decision tree Decision tree
Decision tree
 
Ananlyzing Stabiblity of Model Ecosystems
Ananlyzing Stabiblity of Model EcosystemsAnanlyzing Stabiblity of Model Ecosystems
Ananlyzing Stabiblity of Model Ecosystems
 
Chapter10 Revised
Chapter10 RevisedChapter10 Revised
Chapter10 Revised
 
Chapter10 Revised
Chapter10 RevisedChapter10 Revised
Chapter10 Revised
 
Chapter10 Revised
Chapter10 RevisedChapter10 Revised
Chapter10 Revised
 
Genetic Algorithm 2 -.pptx
Genetic Algorithm 2 -.pptxGenetic Algorithm 2 -.pptx
Genetic Algorithm 2 -.pptx
 
3 es timation-of_parameters[1]
3 es timation-of_parameters[1]3 es timation-of_parameters[1]
3 es timation-of_parameters[1]
 
Pampers CaseIn an increasingly competitive diaper market, P&G’.docx
Pampers CaseIn an increasingly competitive diaper market, P&G’.docxPampers CaseIn an increasingly competitive diaper market, P&G’.docx
Pampers CaseIn an increasingly competitive diaper market, P&G’.docx
 
Reproducibility and differential analysis with selfish
Reproducibility and differential analysis with selfishReproducibility and differential analysis with selfish
Reproducibility and differential analysis with selfish
 
Lect w7 t_test_amp_chi_test
Lect w7 t_test_amp_chi_testLect w7 t_test_amp_chi_test
Lect w7 t_test_amp_chi_test
 
Multinomial Model Simulations
Multinomial Model SimulationsMultinomial Model Simulations
Multinomial Model Simulations
 
Enhanced abc algo for tsp
Enhanced abc algo for tspEnhanced abc algo for tsp
Enhanced abc algo for tsp
 
Genetic Algorithm
Genetic AlgorithmGenetic Algorithm
Genetic Algorithm
 
Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data science
 
Stats 3000 Week 1 - Winter 2011
Stats 3000 Week 1 - Winter 2011Stats 3000 Week 1 - Winter 2011
Stats 3000 Week 1 - Winter 2011
 
Gadoc
GadocGadoc
Gadoc
 
statistics assignment help
statistics assignment helpstatistics assignment help
statistics assignment help
 
Hitch hiking journalclub
Hitch hiking journalclubHitch hiking journalclub
Hitch hiking journalclub
 

More from Hideo Hirose

More from Hideo Hirose (16)

データを読み取る感性
データを読み取る感性データを読み取る感性
データを読み取る感性
 
データを読む感性
データを読む感性データを読む感性
データを読む感性
 
Derivative of sine function: A graphical explanation
Derivative of sine function: A graphical explanationDerivative of sine function: A graphical explanation
Derivative of sine function: A graphical explanation
 
Success/Failure Prediction for Final Examination using the Trend of Weekly On...
Success/Failure Prediction for Final Examination using the Trend of Weekly On...Success/Failure Prediction for Final Examination using the Trend of Weekly On...
Success/Failure Prediction for Final Examination using the Trend of Weekly On...
 
Attendance to Lectures is Crucial in Order Not to Drop Out
Attendance to Lectures is Crucial in Order Not to Drop OutAttendance to Lectures is Crucial in Order Not to Drop Out
Attendance to Lectures is Crucial in Order Not to Drop Out
 
HTT vs. HTH
HTT vs. HTHHTT vs. HTH
HTT vs. HTH
 
統計の世界:予測を扱う科学 Statistics World: A Science of Prediction
統計の世界:予測を扱う科学 Statistics World: A Science of Prediction統計の世界:予測を扱う科学 Statistics World: A Science of Prediction
統計の世界:予測を扱う科学 Statistics World: A Science of Prediction
 
Solve [X^2=A], where A is a matrix
Solve [X^2=A], where A is a matrixSolve [X^2=A], where A is a matrix
Solve [X^2=A], where A is a matrix
 
コーヒーはホットかアイスか 意外なことが分かった
コーヒーはホットかアイスか 意外なことが分かったコーヒーはホットかアイスか 意外なことが分かった
コーヒーはホットかアイスか 意外なことが分かった
 
多変数の極値問題は解析と線形代数の融合だ
多変数の極値問題は解析と線形代数の融合だ多変数の極値問題は解析と線形代数の融合だ
多変数の極値問題は解析と線形代数の融合だ
 
Homotopy法による非線形方程式の解法
Homotopy法による非線形方程式の解法Homotopy法による非線形方程式の解法
Homotopy法による非線形方程式の解法
 
Different classification results under different criteria, distance and proba...
Different classification results under different criteria, distance and proba...Different classification results under different criteria, distance and proba...
Different classification results under different criteria, distance and proba...
 
漸近理論をスライド1枚で(フォローアッププログラムクラス講義07132016)
漸近理論をスライド1枚で(フォローアッププログラムクラス講義07132016)漸近理論をスライド1枚で(フォローアッププログラムクラス講義07132016)
漸近理論をスライド1枚で(フォローアッププログラムクラス講義07132016)
 
雷の波形は指数関数(フォローアッププログラムクラス講義07072016)
雷の波形は指数関数(フォローアッププログラムクラス講義07072016)雷の波形は指数関数(フォローアッププログラムクラス講義07072016)
雷の波形は指数関数(フォローアッププログラムクラス講義07072016)
 
微分は約分ではない(フォローアッププログラムクラス講義06152016)
微分は約分ではない(フォローアッププログラムクラス講義06152016)微分は約分ではない(フォローアッププログラムクラス講義06152016)
微分は約分ではない(フォローアッププログラムクラス講義06152016)
 
Interesting but difficult problem: find the optimum saury layout on a gridiro...
Interesting but difficult problem: find the optimum saury layout on a gridiro...Interesting but difficult problem: find the optimum saury layout on a gridiro...
Interesting but difficult problem: find the optimum saury layout on a gridiro...
 

Recently uploaded

Mifty kit IN Salmiya (+918133066128) Abortion pills IN Salmiyah Cytotec pills
Mifty kit IN Salmiya (+918133066128) Abortion pills IN Salmiyah Cytotec pillsMifty kit IN Salmiya (+918133066128) Abortion pills IN Salmiyah Cytotec pills
Mifty kit IN Salmiya (+918133066128) Abortion pills IN Salmiyah Cytotec pills
Abortion pills in Kuwait Cytotec pills in Kuwait
 
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
Renandantas16
 
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
dollysharma2066
 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Dipal Arora
 
Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...
Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...
Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...
lizamodels9
 
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
amitlee9823
 

Recently uploaded (20)

Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communications
 
Mifty kit IN Salmiya (+918133066128) Abortion pills IN Salmiyah Cytotec pills
Mifty kit IN Salmiya (+918133066128) Abortion pills IN Salmiyah Cytotec pillsMifty kit IN Salmiya (+918133066128) Abortion pills IN Salmiyah Cytotec pills
Mifty kit IN Salmiya (+918133066128) Abortion pills IN Salmiyah Cytotec pills
 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.
 
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
 
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...
 
A DAY IN THE LIFE OF A SALESMAN / WOMAN
A DAY IN THE LIFE OF A  SALESMAN / WOMANA DAY IN THE LIFE OF A  SALESMAN / WOMAN
A DAY IN THE LIFE OF A SALESMAN / WOMAN
 
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
 
7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...
 
Forklift Operations: Safety through Cartoons
Forklift Operations: Safety through CartoonsForklift Operations: Safety through Cartoons
Forklift Operations: Safety through Cartoons
 
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
 
Call Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine ServiceCall Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine Service
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
 
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
 
Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...
Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...
Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...
 
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
 
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
 
RSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors DataRSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors Data
 
Grateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdfGrateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdf
 

Accuracy assessment for the trade off curve and its upper curve in the bump hunting using the new tree genetic algorithm

  • 1. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 0 Accuracy Assessment for the Trade-off Curve and Its Upper Curve in the Bump Hunting Using the New Tree Genetic Algorithm H. Hirose Department of Systems Design and Informatics Faculty of Computer Science and Systems Engineering Kyushu Institute of Technology Fukuoka, 820-8502 Japan
  • 2. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 1 background and objectives response 1 response 0 feature variable 1 feature variable 2 feature variable 3 feature variable m-1 feature variable m feature variable 1 feature variable 2 feature variable 3 feature variable m-1 feature variable m Let’s consider the two class classification problem. Giving 0/1 responses to each class, we are interested in finding the response 1 points. Each point has a large number of feature variables, covariates, say 50, 100. We want to know how to search for the response 1 points as much as we want.
  • 3. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 2 the information for the customers preference is abundant in the cases rather easy to classify the favorable customers easy to find the boundaries to classify the feature points clearly classification linear discriminat analysis nearest neighbors logistic regression neural networks SVM many classification problems
  • 4. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 3 some of 0/1 responses projected onto 2 dimensional feature variable space ・ red  : response 1 ・ blue : response 0 a real messy customer database real data . explanation variable A explanationvariableB response 1 response 0 It seems difficult to discriminate these two responses.
  • 5. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 4 the information for the customers preference is abundant in the cases rather easy to classify the favorable customers easy to find the boundaries to classify the feature points clearly classification linear discriminat analysis nearest neighbors logistic regression neural networks SVM finding denser regions instead of classification less chances to collect the customers preference; the amount of information is not so large in our case it seems not so easy to classify the favorable customers difficult to draw the boundaries to discriminate response 1 from response 0 points finding denser regions of response 1 bump hunting instead
  • 6. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 5 use of the decision tree in finding denser regions bump because If we think the rectangular box regions parallel to the axes, it directly corresponds to the if-then-rules described by a tree and it is easy to apply to the future action. use of the decision tree If-then-rule we do not consider the data transormation for simple data handling
  • 7. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 6 use of the Gini s index in splitting i(t) = p( j |t){1− p( j |t)} = j =1 C ∑ p( j | t){1− p( j | t)} j =1 C ∑ =1− {p( j | t)}2 j =1 C ∑ i(t) = pLi(tL ) + pRi(tR ) Δi(t) = i(t)− i(t) Improvement, Impurity x1+x2 y1+y2 x1 y1 x2 y2 x i(t) = pL i(tL ) + pRi(tR) = 2(x2 y1 − x1y2 )2 (x1 + y1)(x2 + y2 )(x1 + x2 + y1 + y2 ) pL pR To split the samples into two subsets, the decision trees use some criteria. For example, the CART adopts the Gini’s index, and the C4.5, 4.6 adopts the entropy. These two are essentially the same. Here, we use the Gini’s index.
  • 8. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 7 decision tree finds the boundaries of the bump bump x boundaries of the bump found by Gini’s index y2 x2 x1 y1 response 0 response 1 This is a one-dimensional example case that the Gini index can find the boundaries for the bump region successfully. This tendency is preserved also in higher dimensional cases. The optimal splitting point by using the Gini’s index is not intended to search for the boundary of the bump region. It is primarily intended to search for purer regions. However, the splitting point by using the Gini’s index can also find a good point for the boundary of the bump region. x1+x2 y1+y2 x1 y1 x2 y2 x Gini’s index i(t) = pL i(tL ) + pRi(tR) = 2(x2 y1 − x1y2 )2 (x1 + y1)(x2 + y2 )(x1 + x2 + y1 + y2 ) purer purer
  • 9. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 8 Cases we cannot find any bumps using the Gini’s improvement exceptional cases
  • 10. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 9 trade-off between pureness rate & capture rate pureness - rate = #(response 1 in target regions) #(response 1&0 in target regions) capture- rate = #(response 1 in target regions) # (response 1 in toal regions) pureness - rate = 7 10 = 0.7 mean pureness - rate = 12 12 +15 = 0.44 capture- rate = 7 12 = 0.58 define 1. under the condition that the pureness-rate of response 1 is pre-specified, find the bump where the capture-rate of response 1 becomes maximum. 2. obtain the trade-off curve between the pureness-rate and the maximum capture-rate objectives 1 0 pureness rate of response 1 capturerate 1 pureness capture a trade-off curve between the pureness-rate and the capture-rate larger the pureness-rate smaller the capture-rate total regions)
  • 11. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 10 bump region TP FP FN TN P N P TP FN N FP TN actual predicted € 0 € 1 € 0 € 1 € 0 € 1 € 0 € 1 € 0 € 1 € 1 € 0€ 1 € 0 € 1 € 0 € 0 € 1 € 0 € 1 € 0 € 1 € 0 € 1 € 1 € 0 € 0 € 1 € 0 € 1 € 0 € 1 € 0 € 1 € 1 € 0 pureness-rate, capture-rate and TP, FP, TN, FN € pureness rate = # TP # TP+# FP bump region € capture rate = # TP # TP+# FN total recall precision confusion matrix Recall/Precision curve Receiver/Operator Characteristics 0% 20% 50% 0 % capture rate 40% 60 % 80% 100% each classifier skyline pureness rate cm p0 The recall/precision curve corresponds to one classifier of a tree, but the trade-off curve try to find the supremum point of all the classifiers under the pre-specified pureness-rate.
  • 12. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 11 1t 2t 3t 4t 5t 6t 7t The conventional decision tree finds the optimal feature variable and optimal splitting point from the top node to downward by using the Gini’s index, or entropy. generate the tree by the probabilistic method But it will not capture the largest number of response 1 points. explore the optimal tree by generating the trees by a greedy method € ti :explanation variable is selected at random optimal spilitting point is found by using the Gini's index => probabilistic method => genetic algorithm
  • 13. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 12 parent A parent b cross-over in the tree The genetic algorithm applied to the tree structure is different from the conventional one where the structure is one-dimensional line like genes. The splitting point at the root node in a tree have a definitely important meaning. child Ab crossover is supposed to preserve the good inheritance A B a b This is an example of a child Ab, consisting of left-hand-side branch from parent A and upper side tree from parent B. the branches with the root node are used as they are
  • 14. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 13 parent A,a parent B,b child Ab cross-over in the tree child bA child Ba child aB According to this manner, we can have 4 children by parent A and B.
  • 15. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 14 1 20 2 30 0 random 30 genetic algorithm for the tree evolution 30 initial trees sorted from larger capture-rates 10 evolution algorithm to the tree structure local maximum case 1 top 1 cap. max 1 20 2 20 0 5 10 15 20 25 80 90 100 110 120 130 140 evolution capture rate the best tree from one set of initial seeds cross over tree A tree B next generation #1,2,3,4 branches from top 10 branches from top 1 cross over tree B tree A 2 3 5 6 7 8 9 4 1 10 next generation #17,18,19,20 branches from top 6 branches from top 5 … … … next generation #5,6,7,8 branches from top 9 branches from top 2 combine the two of them from different parents, producing 4 children evolution 1 20 2 1 top 10 evolution procedure is continued to 20 generations
  • 16. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 15 1 20 2 2 top 20 1 20 2 30 1 random 30 local maximum case 1 top 1 cap. max 1 20 2 20 evolution genetic algorithm for the tree evolution (2) top 1 cap. max 1 20 2 1 20 2 30 1 20 2 1 2 20 random 30 top 20 evolution local maximum case 2 top 1 cap. max 1 20 2 1 20 2 30 1 20 2 1 2 20 random 30 top 20 evolution local maximum case 3 top 1 cap. max 1 20 2 1 20 2 30 1 20 2 1 2 20 random 30 top 20 evolution local maximum case 20 Why? 20 cases with different initial seeds are dealt with similarly.
  • 17. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 16 0 50 100 150 200 250 300 -.2 .2 .6 1 var01 0 50 100 150 200 250 300 -.2 .2 .6 1 var02 0 40 80 120 160 -.2 .2 .6 1 var08 0 100 200 300 400 -.2 .2 .6 1 var03 0 20 40 60 80 100 -.2 .2 .6 1 var11 0 10 20 30 40 50 60 70 -.2 .2 .6 1 var12 0 20 40 60 80 -.2 .2 .6 1 var13 0 100 200 300 400 -.2 .2 .6 1 var04 0 50 100 150 200 250 300 -.2 .2 .6 1 var05 0 50 100 150 200 250 300 -.2 .2 .6 1 var06 0 40 80 120 160 -.2 .2 .6 1 var07 0 10 20 30 40 50 60 70 -.2 .2 .6 1 var14 0 20 40 60 80 -.2 .2 .6 1 var15 0 20 40 60 80 100 120 -.2 .2 .6 1 var16 0 20 40 60 80 100 120 140 -.2 .2 .6 1 var17 0 20 40 60 80 -.2 .2 .6 1 var18 marginal densities in one feature variable 1 2 3 4 5 6 7 8 response 0 1 800 samples 200 samples simulated densities of the feature variables simulated data mimicked to a real customer data base for simplicity 1 2 3 4 5 6 7 7 8 marginal densities in two feature variables bump region simulated data
  • 18. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 17 0 5 10 15 20 25 80 90 100 110 120 130 140 0 5 10 15 20 25 80 90 100 110 120 130 140 0 5 10 15 20 25 80 90 100 110 120 130 140 0 5 10 15 20 25 80 90 100 110 120 130 140 0 5 10 15 20 25 80 90 100 110 120 130 140 0 5 10 15 20 25 80 90 100 110 120 130 140 numberofcapturedpointsforresponse1 from many initial sets of seeds in the genetic algorithm for the decision tree, different capture-rates are obtained. local convergence in the GA and estimated return iteration number of the evolution procedure … p0=0.45 simulated data each point is a local maxima fitted density function 0 1 2 3 4 5 6 7 112.5 117.5 122.5 127.5 132.5 137.5 number of captured points for response 1 frequency histogram for 20 observed local maxima return period return period 0 40 80 120 160 200 125 135 145 155 165 frequency return period and its CI are obtained boostrap result F(x) = exp −exp − x − γ η ⎛ ⎝ ⎜ ⎞ ⎠ ⎡ ⎣⎢ ⎤ ⎦⎥ 0 20 40 60 80 100 120 140 105 110 115 120 125 130 135 140 500 cases
  • 19. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 18 pureness of response 1specify p0 1 0 1 usable rules upper bound capture-rates estimated by using the extreme-value statistics trade-off curve and its upper bound many local maxima are obtained by GA return period and its CI by extreme-value statistics capturerate That’s it? No. These curves could be optimistic. Because we are using only the training data.
  • 20. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 19 10-fold CV 10 fold cross-validation original data training data induced rule 1,2,...10 1,2,...9 top 1 cap. max 1 20 2 1 20 2 30 1 20 2 1 2 20 random 30 top 20 evolution case 1 rules training data test data induced rule eval. training data test data induced rule eval. accur acy 1,2,...10 9 2,...10 mean eval. 1 top 1 cap. max 1 20 2 1 20 2 30 1 20 2 1 2 20 random 30 top 20 evolution case 10 rules data time computing ! To assess the accuracy of the trade-off curve, 90% 10 test data eval. data 10%
  • 21. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 20 bootstrapped hold-out original data training data induced rule-1 11*,21*,...n/2 n 1,2,...n top 1 cap. max 1 20 2 1 20 2 30 1 20 2 1 2 20 random 30 top 20 evolution case 1 rules test data 11**,21**,... n/2 data eval. 1b**,2b**,... accur acy mean eval. training data test data induced rule-2 eval. 12*,22*,... 12**,22**,... training data test data induced rule-b eval. 1b*,2b*,... top 1 cap. max 1 20 2 1 20 2 30 1 20 2 1 2 20 random 30 top 20 evolution case 10 rules data computing cost is reduced Instead of using the cross-validation, the bootstrapped hold- out is used here.
  • 22. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 21 1 20 2 30 1 random 30 1 20 2 2 select top 10 by applying the training data top 1 cap. max 1 20 2 20 evolution old 10 training data BHO we have been using the training data only in the GA tree procedure training data evaluation data test data we divide the data to 3 parts 1 20 2 30 1 random 30 1 20 2 2 top 1 cap. max 1 20 2 20 new select top 10 by applying the evaluation data10 training data evaluation data evolution At each evolution generation stage, we produce the trees using the training data, and select the best trees using the evaluation data. Then, we can expect that the final stage results could be the local maxima for the evaluation data, and we may apply the extreme-value statistics to these final results. Then, we apply the the final rule to the test data. test data accuracy assess. tree genetic algorithm
  • 23. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 22 top 1 cap. max 1 20 2 1 20 2 30 1 20 2 1 2 20 random 30 top 20 by eval. evolution case 1 0 0.05 0.1 0.15 0.2 0.25 0.0 0.1 0.2 0.3 0.4 0.5 0.6 evolution in the tree GA and the return period BHO real data using the evaluation data the capture-rate is converging to a final value within 10 generations, both in training data and evaluation data. using the 20 final best capture rates 100 125 150 175 200 225 250 0.0025 0.005 0.0075 0.01 0.0125 0.015 extreme-value density using the estimated parameters top 1 cap. max 1 20 2 1 20 2 30 1 20 2 1 2 20 random 30 top 20 by test evolution case 1 top 1 cap. max 1 20 2 1 20 2 30 1 20 2 1 2 20 random 30 top 20 by test evolution case 1 top 1 cap. max 1 20 2 1 20 2 30 1 20 2 1 2 20 random 30 top 10 by eval. data evolution case 20
  • 24. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 23 0.04 0.06 0.08 0.12 0.14 0.16 5 10 15 20 25 30 Gumbel Distribution fit 0.04 0.06 0.08 0.12 0.14 0.16 5 10 15 20 25 30 Gumbel Distribution fit € f (x) = 1 η ⋅exp γ − x η ⎛ ⎝ ⎜ ⎞ ⎠ ⎟⋅exp − exp γ − x η ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ Gumbel distribution for maxima similarity between the evaluation and the test BHO 200 initial cases pre-specified pureness rate = 45% real data.04 .06 .08 .1 .12 .14 .16 test .04 .06 .08 .1 .12 .14 .16 eval. we may estimate the upper bound of the trade-off curve by using the test data results. 0 10 20 30 40 50 60 70 .04 .06 .08 .1 .12 .14 .16 test 0 10 20 30 40 50 60 70 80 .04 .06 .08 .1 .12 .14 .16 eval. evaluation data result test data result observed observed relation
  • 25. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 24 pureness of response 1 capture rate specify p0 1 0 1 rules using only the training data maximum capture-rates estimated by using extreme- value statistics with the training data accurate trade-off curve using the test data maximum capture-rates estimated by using extreme-value statistics with the test data rules using the training data
  • 26. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 25 0 0.1 0.2 0.3 0.4 0.2 0.3 0.4 0.5 0.6 0.7 0.8 10 cases 99.8% return period mean 10 cases of best 1s from 20 local maxima by the new tree-GA with the test data mean40% 45% 50% 60% 70% pureness of response 1 capturerate The upper bound for the trade-off curve using extreme-value statistics can be estimated by using the new tree-GA using test data actual trade-off curve and its upper bound real data
  • 27. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 26 1.  In finding the denser region for response 1 points having a large number of feature variables, we have proposed to use the bump hunting method. 2.  To evaluate the bump hunting method, we have shown that the trade-off curve is useful. 3.  To construct the trade-off curve, we have been used the tree genetic algorithm and the extreme-value statistics. 4.  We have shown that the trade-off curve using the training data could be optimistic. 5.  For the use of the test data with less computing cost, we have proposed the bootstrapped hold-out method instead of cross-validation. 6.  To estimate the accurate upper bound trade-off curve, we have developed the new tree-GA by using the three sets of sampled data, training, evaluation and test data. 7.  The evaluation data results follow the extreme-value statistics, and using the similarity between the evaluation data results and the test data results, we can estimate the accurate trade-off curve. conclusions
  • 28. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 27 end thank you Accuracy assessment for the trade-off curve and its upper bound curve in the bump hunting using the new tree genetic algorithm
  • 29. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 28 literature related to the bump hunting T. Yukizane, S. Ohi, E. Miyano, H. Hirose, The bump hunting method using the genetic algorithm with the extreme-value statistics, IEICE Transactions D: on Information and Systems, Vol.E89-D, No.8, pp.2332-2339 (2006.8) H. Hirose : Optimal boundary finding method for the bumpy regions, IFORS2005, July 11-15, (2005). H. Hirose, T. Yukizane, E. Miyano: Boundary detection for bumps using the Gini s index in messy classification problems, CITSA 2006, pp.293-298 (2006) H. Hirose, T. Yukizane, and T. Deguchi, The bump hunting method and its accuracy using the genetic algorithm with application to real customer data, CIT2007, pp.128-132, October 16-19, (2007) H. Hirose, The bump hunting using the decision tree combined with the genetic algorithm: extreme-value statistics aspect, ICMLDA2007, pp.713-717, October 24-26, (2007) Hirose, H.: A method to discriminate the minor groups from the major groups. Hawaii Int. Conf. Statistics, Mathematics, and Related Fields, (2005). Hirose, H., Ohi, S. and Yukizane, T.: Assessment of the prediction accuracy in the bump hunting procedure. Hawaii International Conference on Statistics, Mathematics, and Related Fields, (2007). Hirose, Yukizane, T.: The accuracy of the trade-off curve in the bump hunting. Hawaii International Conference on Statistics, Mathematics, and Related Fields, (2008).
  • 30. Accuracy assessment for the trade-off curve in the bump hunting by the new tree GA by H. Hirose et. al. 29 Friedman, J. H. and Fisher, N. I., “Bump hunting in high dimensional data”, Statistics and Computing, 9 123 - 143 (1999) Gray, J.B. and Fan, G, “Target: Tree analysis with randomly generated and evolved trees”, Technical report, The University of Alabama, (2003) literature related to the bump hunting Kohavi, R. (1995), A Study of Cross-Validation and Bootstrap for Accuracy Estima- tion and Model Selection, IJCAI (International Joint Conference on Artificial In- tel ligence). Hastie, T., Tibshirani, R. and Friedman, J.H.: Elements of Statistical Learning. Springer (2001).