Heuristic design of experiments w meta gradient search

1
Heuristic Design of Experiments
with Meta-Gradient Search
of Model Training Parameters
SF Bay ACM, Data Mining SIG, Feb 28, 2011
http://www.sfbayacm.org/?p=2464
Greg_Makowski@yahoo.com
www.LinkedIn.com/in/GregMakowski

Choice is good…
But can be
overwhelming
2

Key Questions Discussed
• You (a data miner) have many algorithms or
libraries you can use, with many choices…
– How to stay organized among all the choices?
• Algorithm parameters
• Adjustments in Cost vs. Profit (Type I vs. II error bias)
• Metric selection (Lift if acting on top % vs. RMSE or ROC)
• Ensemble Modeling, boosting, bagging, stacking
• Data versions, preprocessing, trying new fields
– How to plan, and learn as you go?
– How simple should you stay ?
– to keep descriptiveness vs. Occam’s Razor?
3

Outline
Model Training Parameters in SAS Enterprise Miner
Tracking Conservative Results in a “Model Notebook”
How to Measure Progress
Meta-Gradient Search of Model Training Parameters
How to Plan and dynamically adapt
How to Describe Any Complex System – Sensitivity
4

Enterprise Miner
Sample Data Flow for a Project
: 5
(Boxes are expanded in later slides)
Learning
Tuning
Validation
Stratified
Sampling

Type I vs. II Error Weights
Profit-Loss Ratios
6
In the Data Source,
NOT the Model Engines
In other software,
may use a weight field
Need to stay organized
regardless

Regression
• It is always good to find the
best linear solution early on
– Like testing a null hypothesis:
(linear vs. non-linear) problem
• Can feed “score” or “residual
error” as a source field into
non-linear models
7

Neural Net Architecture
and Parameters
8
c
c c
c
c
c
c
c
c
c c
field 1
field 2
$
c
$
$
$
$
c
c
$
c
c
c
RBF
c
c
c
$ $
c
$
c
$
c
c
c
$ $ $$
A Neural Net
Solution
“Non-Linear”
Several
regions
which are
not adjacent
MLP

A Comparison of a Neural Net
and Regression
Direct connect
9
A Logistic regression formula:
Y = f( a0 + a1*X1 + a2*X2 + a3*X3)
a* are coefficients
Backpropagation, cast in a similar form:
H1 = f(w0 + w1*I1 + w2*I2 + w3*I3)
H2 = f(w4 + w5*I1 + w6*I2 + w7*I3)
a0
Y
X1 X2 X3
:
Hn = f(w8 + w9*I1 + w10*I2 + w11*I3)
O1 = f(w12 + w13*H1 + .... + w15*Hn)
On = ....
w* are weights, AKA coefficients
I1..In are input nodes or input variables.
H1..Hn are hidden nodes, which extract features of the data.
O1..On are the outputs, which group disjoint categories.
f() is the SIGMOID function, a non-linear “S” curve
a1 a2 a3
Output
H1 Hidden 2
w1
w2
w3
Input 1 I2 I3
Bias
it is very noisy in the brain – chemical depletion of neurotransmitters

Neural Net
• Network  Architecture can be linear
(MLP) or circular (many RBF)
• Network  Direct Connection allows
inputs to connect to output (to find the
simple, linear solution first)
• Network  Hidden Units can go up to
64 (much better than 8)
• Profit/Loss uses settings in Data Source
10

Tree
Depth = 2
11
What does a DecisionTree Look Like?
Split 3
Age
Income
$
Split 1
Split 2
$
$$
$
Leaf 3
$
$ $
$
$
$ $ $
$ $
$
$
$
$
$
c
c c
c
c
c
c
c
c
c
c
$ c
Leaf 4
Leaf 1
Leaf 2
Split 2 Split 3
Leaf 1
Split 1
Leaf 2 Leaf 3 Leaf 4
If (Age < Split1) then
:…If (Income > Split2) then Leaf1 with dollar_avg1
:…If (Income < Split2) then Leaf2 with dollar_avg2
If (Age > Split1) then
:…If (Income > Split3) then Leaf3 with dollar_avg3
:…If (Income < Split3) then Leaf4 with dollar_avg4

Decision Tree
• Primary Parameters to vary
– Criterion
• Probchisq (Default)
• Entropy
• Gini
– Assessment (Decision vs. Lift)
– Tree size (depth, leaf size, Xvalid)
12

Gradient Boosting
(Tree Based)
Based on “Greedy Function
Approximation: A Gradient
Boosting Machine” by Jerome
Friedman
Each new CART tree:
• is on a 60% random sample
• Is a small, general tree
• Forecasts the error from the forecast
from all previous trees summed
• May have 50 to 2,000 trees in a
sequence
• Evaluate how far “back” in sequence
to prune
13

DM Algorithms Available in Packages
14
# Modules per Forecasting Family in DM Software
Regres -
s ion
Las s o
Reg
Decis ion
Tree
Neural
Net
Support
Vector
Mach
Other TOT
2 1 0 0 0 1 4
0 0 1 0 0 0 1
3 0 3 3 0 3 12
1 0 1 0 1 1 4
0 0 4 0 0 0 4
3 2 5 3 2 3 18
0 0 0 0 0 5 5

Feel Overwhelmed on Lots of Complex
Algorithm Parameters? GOOD!
• A deep understanding of algorithms, math and
assumptions helps significantly  Heuristics
– i.e. typically, regression has a problem with correlating
inputs because the solution calculation uses matrix
inversion (if you are worried about weight sign inversion)
– SVM’s or Bayesian Nets do not have this problem,
because they are solved differently.
• Don’t have a problem with correlating inputs, input selection
becomes more random – but you still get a decent solution
• How can you manage the details?
– I am glad you asked…. Moving on to the next section
15

Outline
16

Model Exploration Process
• Scientific Method of
Hypothesis  Test
– If you change ONE thing, than any change
in the results is because of that one
change
– Design of Experiments (DOE), test plan
– Best to compare model settings on same
data version
• New data versions add new preprocessed fields,
or new months (records)
– Key design objective: all experiments are
reproducible
• SAME Random split between Learning – Test –
Validation, with a consistent random seed
– LTV split before loading data in a tool, so same
partitioning for all tools/libraries/algorithms

Model Notebook
Input Parameters Outcomes
Lift in Top 10%
Train Val
18
Gap =
Abs(
Trn-Val)
Consrv
Result
Param
1
vars
offerd
Param
2
var
selct
Param
3
…
Vars
Seltd
Trn
Time
Data
Ver
Algor
Mod
Num
1 Regrsn 1 27 stepw 9 12 5.77 5.94 0.17 5.60
vars
offerd
Hidn
Nodes
Direct
Conn
Arch
Bad vs. Good
1 Neural 1 27 3 n MLP all 77 6.65 10.89 4.24 2.41
1 Neural 2 27 10 n MLP all 40 6.88 6.73 0.15 6.58
1 Neural 3 27 10 Y MLP all 36 6.40 6.93 0.53 5.87
1 Neural 4 27 10 n RBF all 34 5.67 5.54 0.13 5.41
1 Neural 5 27 10 Y RBF all 35 5.95 7.92 1.97 3.98

Model Notebook
Outcome Details
• My Heuristic Design Objectives: (yours may be different)
– Accuracy in deployment
– Reliability and consistent behavior, a general solution
• Use one or more hold-out data sets to check consistency
• Penalize more, as the forecast becomes less consistent
– No penalty for model complexity (if it validates consistently)
• Let me drive a car to work, instead limiting me to a bike
– Message for check writer
– Don’t consider only Occam’s Razor: value consistent good results
– Develop a “smooth, continuous metric” to sort and find
models that perform “best” in future deployment
19

Model Notebook
Outcome Details
• Training = results on the training set
• Validation = results on the validation hold out
• Gap = abs( Training – Validation )
A bigger gap (volatility) is a bigger concern for deployment, a symptom
Minimize Senior VP Heart attacks! (one penalty for volatility)
Set expectations & meet expectations
Regularization helps significantly
• Conservative Result
= worst( Training, Validation) + Gap_penalty
Corr / Lift / Profit  higher is better: Cons Result = min(Trn, Val) - Gap
MAD / RMSE / Risk  lower is better: Cons Result = max(Trn, Val) + Gap
Business Value or Pain ranking = function of( conservative result2 0)

Model Notebook
Lift in Top 10%
Train Val
21
Gap =
Abs(
Trn-Val)
Consrv
Result
Param
1
vars
offerd
Param
2
var
selct
Param
3
…
Vars
Seltd
Trn
Time
Data
Ver
Algor
Mod
Num
1 Regrsn 1 27 stepw 9 12 5.77 5.94 0.17 5.60
vars
offerd
Hidn
Nodes
Direct
Conn
Arch
Bad vs. Good
1 Neural 1 27 3 n MLP all 77 6.65 10.89 4.24 2.41
1 Neural 2 27 10 n MLP all 40 6.88 6.73 0.15 6.58
1 Neural 3 27 10 Y MLP all 36 6.40 6.93 0.53 5.87
1 Neural 4 27 10 n RBF all 34 5.67 5.54 0.13 5.41
1 Neural 5 27 10 Y RBF all 35 5.95 7.92 1.97 3.98

Model Notebook Process
Tracking Detail  Training the Data Miner
Data
Ver
Aut
hor
Input / Test Outcome
Algor
Mod
Num
chng
from
prior
Model Notebook
Project = Transit, Last Update 5/6/2010
Param 1 Param 2 Param 3 Param 4 Param 5 Param 6 Param 7
Status Lift in Top 10% Over File Avg
Var
Sel
Trn
time
(sec)
Lift in Top 5% Over File Avg
Top
5%
Train Val
Gap =
Abs(
Trn-Val)
Consrv
Result
Outcomes
Top
10%
Train Val
Gap =
Abs(
Trn-Val)
Consrv
Result
Outcomes
Lift in Top 20% Over File Avg
Top
20%
Train Val
Gap =
Abs(
Trn-Val)
Consrv
Result
Data
Ver
Aut
hor
Algor
Mod
Num
chng
from
prior
vars
offered
var
selectn
Var
Sel
Trn
Time
Train Val Gap
Consrv
Result
Train Val Gap
Consrv
Result
Train Val Gap
Consrv
Result
1 GM B logistic 1 0 27 stepws 10 12.04 8.12 3.92 4.20 7.59 4.85 2.74 2.11
1 GM B logistic 2 1 19 stepws 10 12.04 8.12 3.92 4.20 7.59 4.85 2.74 2.11
1 GM B logistic 3 1 6, no dbc stepws 4 7.51 1.98 5.53 -3.55 4.90 3.96 0.94 3.02 investigate inconsistency
1 GM B logistic 4 1
13, only
dbc
stepws 7 9.58 7.33 2.25 5.08 6.59 5.25 1.34 3.91
Data
Ver
Aut
hor
Algor
Mod
Num
chng
from
prior
vars
offered
regr type
var
selectn
2-factor
interact
polynom
Var
Sel
Trn
Time
Train Val Gap
Consrv
Result
Train Val Gap
Consrv
Result
Train Val Gap
Consrv
Result
Regression
1 GM regr 1 0 27 logistic stepws n 9 12 5.77 5.94 0.17 5.60 3.35 4.46 1.11 2.24 2.25 3.02 0.77 1.48
1 GM regr 2 1 27 logistic stepws Yes 9 16 5.76 5.94 0.18 5.58 3.35 4.46 1.11 2.24 2.25 3.02 0.77 1.48
1 GM regr 3 1 27 logistic stepws n 2 10 57 5.86 6.93 1.07 4.79 3.48 5.03 1.55 1.93 2.32 2.61 0.29 2.03
1 GM regr 4 1 27 logistic stepws Yes 2 11 58 5.86 6.93 1.07 4.79 3.48 5.04 1.56 1.92 2.32 2.92 0.60 1.72
4 GM regr 5 4 3 logistic stepwise Yes 2 8 63 12.88 13.40 0.52 12.36 6.65 6.89 0.24 6.41 3.53 3.64 0.11 3.43
4 GM regr 6 5 28 logistic stepwise Yes 2
didn't finish, out of memory
4 GM regr 7 5 3 logistic stepwise n 2 63 12.88 13.40 0.52 12.36 6.65 6.89 0.24 6.41 3.53 3.64 0.11 3.43
4 GM regr 8 5 3 logistic stepwise n 1 12.88 13.40 0.52 12.36 6.65 6.89 0.24 6.41 3.53 3.64 0.11 3.43
4 GM regr 9 5 3 logistic stepwise Yes 1 12.88 13.40 0.52 12.36 6.65 6.89 0.24 6.41 3.53 3.64 0.11 3.43
4 GM regr 10 8 28 logistic stepwise n 1 12.88 13.40 0.52 12.36 6.65 6.89 0.24 6.41 3.53 3.64 0.11 3.43
add Feb & Mar to
4n GM regr 13 11 3 logistic stepwise Yes 3 6 78 18.39 18.79 0.39 18.00 9.58 9.55 0.03 9.52 4.96 4.92 0.03 4.89
recent*
recent_serrtrn_dbc changed to recent_serrtrn_flag
4n GM regr 14 11 3 6 78 12.49 12.12 0.36 11.76 7.63 7.42 0.20 7.22 4.29 4.47 0.18 4.12
(does DBC on ser patt help? YES)
Yippeee!
1 GM DM Regr 1 0 27 logistic stepws 13 15 12.00 3.17 8.83 -5.66 7.21 4.16 3.05 1.11 4.28 3.07 1.21 1.86
4 GM DM Regr 2 0 28
max v
3000
min rsq
0.005
use
aov16 var
YES
6 72 16.27 15.76 0.52 15.24 8.67 8.03 0.64 7.39 4.58 4.24 0.34 3.90
1 GM PLS 1 0
1 GM PLS 2 1 27 default default default default 4 18 11.26 3.08 8.18 -5.10 7.12 4.85 2.27 2.58 4.28 3.12 1.16 1.96
1 GM PLS 3 1 Test Set Cros Val didn't finish, don't use Xvalidation
4 GM PLS 4 0 28 PLS NIPALS 200 28 122 16.63 15.76 0.87 14.89 8.93 8.03 0.90 7.13 4.76 4.32 0.45 3.87
Data
Ver
Aut
hor
Algor
Mod
Num
chng
from
prior
vars
offered
hidden
Direct
Conn ?
arch
Var
Sel
Trn
Time
Train Val Gap
Consrv
Result
Train Val Gap
Consrv
Result
Train Val Gap
Consrv
Result
1 GM AutoNrl 1 0 27 2 n MLP all 35 4.19 3.76 0.43 3.33 2.47 2.57 0.10 2.37 1.77 1.88 0.11 1.66
1 GM AutoNrl 3 1 27 8 n MLP
AutoNeural
trn action
= search
all 532 0.83 0.56 0.27 0.29 0.83 0.56 0.27 0.29 0.83 0.56 0.27 0.29
activ =
logistic
all 356 5.12 2.97 2.15 0.82 3.02 3.37 0.35 2.67 1.90 2.57 0.67 1.23
arch =
block
all 130 0.89 0.97 0.08 0.81
arch =
funnel
all 595 1.36 1.08 0.28 0.80
Data
Ver
Aut
hor
Algor
Mod
Num
chng
from
prior
vars
offered
hidden
Direct
Conn ?
arch Decay
Decision
Weight
Var
Sel
Trn
Time
Train Val Gap
Consrv
Result
Train Val Gap
Consrv
Result
Train Val Gap
Consrv
Result
1 GM Neural 1 0 27 3 n MLP all 77 6.65 10.89 4.24 2.41 3.90 6.53 2.63 1.27 2.52 3.96 1.44 1.08
1 GM Neural 2 1 27 10 n MLP all 40 6.88 6.73 0.15 6.58 3.97 4.55 0.58 3.39 2.56 3.02 0.46 2.10
1 GM Neural 3 1 27 10 Y MLP all 36 6.40 6.93 0.53 5.87 3.49 5.45 1.96 1.53 2.32 3.22 0.90 1.42
1 GM Neural 4 1 27 10 n RBF (orbfeq) all 34 5.67 5.54 0.13 5.41 3.25 4.85 1.60 1.65 2.20 3.22 1.02 1.18
1 GM Neural 5 1 27 10 Y RBF all 35 5.95 7.92 1.97 3.98 3.48 4.85 1.37 2.11 2.31 3.17 0.86 1.45
js1 JS Neural 6 0 17 5 n MLP Softmax 10,-5,-1,0 all 6.03 6.53 0.50 5.53 3.40 4.55 1.15 2.25 2.67 3.36 0.69 1.98
js1 JS Neural 7 6 15 5 Y MLP Softmax 10,-5,-1,0 all 6.14 5.74 0.40 5.34 3.59 2.97 0.62 2.35 2.77 2.37 0.40 1.97
js1 JS Neural 8 6 15 3 Y MLP Softmax 0.5 10,-5,-1,0 all 6.27 7.13 0.86 5.41 3.54 3.56 0.02 3.52 2.74 2.57 0.17 2.40
js1 JS Neural 9 6 15 3 n MLP Softmax 0.5 10,-5,-1,0 all 6.27 6.33 0.06 6.21 3.57 4.65 1.08 2.49 2.76 2.82 0.06 2.70
2 GM Neural 10 2 35 12 Y MLP 20,0,-1,0 all
3 GM Neural 11 2 45 20 n MLP 20,0,-1,0 all 18 6.26 7.76 1.50 4.76 3.54 4.22 0.68 2.86 2.18 2.46 0.28 1.91
3 GM Neural 12 11 45 20 n MLP 0.8 20,0,-1,0 all 16 6.26 7.76 1.50 4.76 3.54 4.22 0.68 2.86 2.18 2.46 0.28 1.91
4 GM Neural 17 15 same, max iter 20 --> 50 all 1754 18.02 18.18 0.16 17.86 9.21 9.55 0.34 8.87 4.66 4.77 0.11 4.55
4 GM Neural 18 16
29 (no
twoYr)
same, max iter 20 --> 50
Neural
40 0 0 all 18.386 18.98 18.18 0.80 17.38 9.25 9.59 0.34 8.90 4.67 4.86 0.20 4.47
4n GM DMNeural 19 0 13 3 n all 19 10.60 2.57 8.03 -5.46 6.93 4.36 2.57 1.79 4.14 2.57 1.57 1.00
More
Heuristic Strategy:
1) Try a few models of many
algorithm types (seed the
search)
2) Opportunistically spend
more effort on what is
working (invest in top stocks)
3) Still try a few trials on
medium success (diversify,
limited by project time-box)
4) Try ensemble methods,
combining model forecasts
& top source vars w/
The Data Mining Battle Field model

M
cnt
Data
Ver
Aut
hor
Algor
Mod
Num
chng
from
prior
vars
offered
criterion
max
depth
leaf size
asses =
5% Lift
Decision
Weight
Var
Sel
Trn
Time
Train Val Gap
Consrv
Result
Train Val Gap
Consrv
Result
Train Val Gap
Consrv
Result
47 1 GM Dec Tree 1 0 27 default 6 5 20,0,-5,0 7 13 13.71 9.59 4.12 5.47 7.67 5.35 2.32 3.03 4.33 3.80 0.53 3.27
48 1 GM Dec Tree 2 1 27 probchisq 6 5 20,0,-5,0 7 16 13.71 9.59 4.12 5.47 7.67 5.35 2.32 3.03 4.33 3.80 0.53 3.27
49 1 GM Dec Tree 3 1 27 entropy 6 5 20,0,-5,0 6 16 13.94 12.62 1.32 11.30 7.49 7.09 0.40 6.69 4.27 4.09 0.18 3.91
50 1 GM Dec Tree 4 1 27 gini 6 5 20,0,-5,0 10 22 13.76 11.28 2.48 8.80 7.70 6.10 1.60 4.50 4.32 3.71 0.61 3.10
54 1 GM Dec Tree 8 3 27 entropy 6 100 xval = Y 20,0,-5,0 8 32 14.51 12.82 1.69 11.13 8.95 7.42 1.53 5.89 4.72 4.13 0.59 3.54
56 1 GM Dec Tree 10 3 27 entropy 6 5
obs
import =
Y
DecisionTree
Data Version 1
20,0,-5,0 6 17 13.94 12.62 1.32 11.30 7.49 7.09 0.40 6.69 4.27 4.09 0.18 3.91
asses =
5% Lift
20,0,-5,0 6 12 13.94 12.62 1.32 11.30 7.49 7.09 0.40 6.69 4.27 4.09 0.18 3.91
46 2 GM Dec Tree 13 3 33 entropy 6 5 a=5% lift 20,0,-5,0 7 16 15.92 14.96 0.96 14.00 8.29 7.84 0.45 7.39 4.40 4.17 0.23 3.94
47 2 GM Dec Tree 14 13 33 entropy 6 5 a=5% lift 10,-2.5,-1,0 13 15 16.32 15.05 1.27 13.78 9.07 8.00 1.07 6.93 4.63 4.08 0.55 3.53
48 2 GM Dec Tree 15 13 33 entropy 6 5 a=5% lift 1,-1,1,-1 8 15 15.30 14.34 0.96 13.38 7.98 7.53 0.45 7.08 4.25 4.05 0.20 3.85
50 2 GM Dec Tree 17 13 33 entropy 6 5 a=5% lift 20,-5,0,0 12 15 16.32 15.60 0.72 14.88 8.79 8.26 0.53 7.73 4.47 4.21 0.26 3.95
52 2 GM Dec Tree 19 13 33 entropy 6 5 a=5% lift xval = no DecisionTree
20,0,-1,0 6 15 15.87 15.52 0.35 15.17 8.26 8.12 0.14 7.98 4.40 4.32 0.08 4.24
53 2 GM Dec Tree 20 13 33 entropy 6 5 a=5% lift 20,-5,-1,1 12 16 16.32 15.05 1.27 13.78 8.96 8.14 0.82 7.32 4.62 4.23 0.39 3.84
54 2 GM Dec Tree 21 13 33 entropy 6 5 a=5% lift xval = no 20,0,0,1 9 16 16.17 15.57 0.60 14.97 8.74 8.25 0.49 7.76 4.44 4.21 0.23 3.98
55 2 GM Dec Tree 22 19 33 gini 6 5 a=5% lift 20,0,-1,0 8 16 15.17 13.17 2.00 56 2 GM Dec Tree 23 19 33 probchisq 6 5 a=5% lift 20,0,-1,0 8 16 15.17 13.17 2.00 57 2 GM Dec Tree 24 19 33 entropy 20 5 a=5% lift Data 20,0,-1,0 19 Version 26 18.94 15.42 3.52 2
11.17 8.02 7.32 0.70 6.62 4.40 4.26 0.14 4.12
11.17 8.02 7.32 0.70 6.62 4.40 4.26 0.14 4.12
11.90 9.67 7.78 1.89 5.89 4.90 4.06 0.84 3.22
64 2 GM Dec Tree 31 19 35 entropy 7 40 a=5% lift itmledratio
itm_to_led 20,0,-1,0 7 36 15.90 15.36 0.54 14.82 8.28 8.03 0.25 7.78 4.40 4.27 0.13 4.14
ex=20k
node s mp
= 30k
65 2 GM Dec Tree 37 19
14, raw
only
entropy 6 5 a=5% lift 0 20,0,-1,0 7 16 13.92 11.81 2.11 9.69 7.46 6.54 0.93 5.61 4.24 3.91 0.33 3.57
5.28 2.15 0.41
improvement gain in Conservative Lift from new variables (vs. DecTree-d2-m19)
66 3 GM Dec Tree 38 19 45 entropy 8 5 a=5% lift xval = no 20,0,-5,1 3 39 13.41 15.52 2.11 11.30 7.50 8.47 0.97 6.54 4.01 4.44 0.43 3.58
67 3 GM Dec Tree 39 38 45 gini 8 5 a=5% lift xval = no 20,0,-5,1 3 71 13.41 15.52 2.11 11.30 7.50 8.47 0.97 6.54 4.01 4.44 0.43 3.58
68 3 GM Dec Tree 40 38 45 propchi 8 5 a=5% lift xval DecisionTree
= no 20,0,-5,1 3 42 13.41 15.52 2.11 11.30 7.50 8.47 0.97 6.54 4.01 4.44 0.43 3.58
69 3 GM Dec Tree 41 38 45 entropy 20 5 a=5% lift subtr= 20,0,-5,1 33 91 20.00 14.81 5.19 9.61 10.00 7.54 2.46 5.08 5.00 3.90 1.10 2.80
70 3 GM Dec Tree 42 38 45 entropy 20 100 a=5% lift sub=lrg 20,0,-5,1 25 70 19.09 16.25 2.84 13.42 10.00 8.17 1.83 6.35 5.00 4.19 0.81 3.38
71 3 GM Dec Tree 43 38 45 entropy 20 200 a=5% lift sub=lrg 20,0,-5,1 23 64 17.67 16.67 1.01 72 3 GM Dec Tree 44 38 45 entropy 20 400 a=5% lift sub=lrg 20,0,-5,1 21 59 15.87 17.08 1.21 73 3 GM Dec Tree 45 38 45 entropy 20 800 a=5% lift Data sub=lrg 20,0,-5,1 Version 16 52 14.35 16.16 1.81 3
15.66 9.81 8.54 1.27 7.28 5.00 4.34 0.66 3.67
14.67 9.02 8.96 0.06 8.89 4.97 4.69 0.28 4.41
12.53 8.46 8.96 0.50 7.96 4.78 4.79 0.01 4.78
82 3 GM Dec Tree 54 43 45 entropy 20 200 a=5% lift sub=lrg 20,0,-5.0 23 65 17.67 16.67 1.01 15.66 9.81 8.54 1.27 7.28 5.00 4.34 0.66 3.67
84 3 GM Dec Tree 56 43 45 entropy 20 200 a=5% lift sub=lrg 20,-5,0,0 23 65 17.67 16.67 1.01 15.66 9.81 8.54 1.27 7.28 5.00 4.34 0.66 3.67
86 4 GM Dec Tree 58 57 107 (tree settings the same, dropped INT* categorical vars, not DBC)
18 115 20.00 16.09 3.91 12.18 10.00 8.15 1.85 6.29 5.00 4.18 0.82 3.35
87 4 GM Dec Tree 59 57 107 entropy 20 500 a=5% lift sub=DecisionTree
lrg 20,0,-5,1 13 110 19.46 14.79 4.68 10.11 10.00 7.64 2.36 5.29 5.00 3.95 1.05 2.91
91 4 GM Dec Tree 63 57 107 entropy 20 1500 a=5% lift Data sub=lrg 20,0,-5,1 Version 9 60 16.17 14.66 1.50 4
13.16 9.89 8.18 1.71 6.47 5.00 3.38 1.62 1.76
interactions are getting selected, improve Trn results but
decrease Val results. Perhaps I should regen the INT*dbc with a
larger number of min records.
More
97 4n GM Dec Tree 69 61 3 entropy 20 2000 a=5% lift sub=lrg 20,0,-5,0 7 14.61 15.54 0.93 13.68 8.83 8.99 0.16 8.67 4.88 4.73 0.15 4.58
use RAW vars ONLY, to test value of my preprocessing
M
cnt
Data
Ver
Aut
hor
Algor
Mod
Num
chng
from
prior
binary
model
cleanup
model
max num
rips
Var
Sel
Trn
Time
Train Val Gap
Consrv
Result
Train Val Gap
Consrv
Result
Train Val Gap
Consrv
Result
94 1 GM Rule Ind 1 0 tree neural 16 32 10.77 9.92 0.85 9.07 6.28 5.60 0.68 4.92 3.35 3.09 0.26 2.83
95 1 GM Rule Ind 2 1 regr neural 16 36 5.95 7.52 1.57 4.38 3.55 4.85 1.30 2.25 2.35 3.17 0.82 1.53
96 1 GM Rule Ind 3 1 neural tree 16 121 5.95 7.92 1.97 3.98 3.52 5.64 2.12 1.40 2.34 3.31 0.97 1.37
“Agile Software Design”
Get something simple,
fully working and tested
early on (Data Version 1)
Data Version 2…4
Working, incremental improvements
Incremental complexity
Different preprocessing
Add more fields, records
Add & test more
complexity

M
cnt
Data
Ver
Aut
hor
Algor
Mod
Num
chng
from
prior
vars
offered
criterion
max
depth
leaf size
asses =
5% Lift
Decision
Weight
Var
Sel
Trn
Time
Train Val Gap
Consrv
Result
Train Val Gap
Consrv
Result
Train Val Gap
Consrv
Result
47 1 GM Dec Tree 1 0 27 default 6 5 20,0,-5,0 7 13 13.71 9.59 4.12 5.47 7.67 5.35 2.32 3.03 4.33 3.80 0.53 3.27
48 1 GM Dec Tree 2 1 27 probchisq 6 5 20,0,-5,0 7 16 13.71 9.59 4.12 5.47 7.67 5.35 2.32 3.03 4.33 3.80 0.53 3.27
50 1 GM Dec Tree 4 1 27 gini 6 5 20,0,-5,0 10 22 13.76 11.28 2.48 8.80 7.70 6.10 1.60 4.50 4.32 3.71 0.61 3.10
obs
import =
Y
20,0,-5,0 6 17 13.94 12.62 1.32 11.30 7.49 7.09 0.40 6.69 4.27 4.09 0.18 3.91
asses =
5% Lift
20,0,-5,0 6 12 13.94 12.62 1.32 11.30 7.49 7.09 0.40 6.69 4.27 4.09 0.18 3.91
47 2 GM Dec Tree 14 13 33 entropy 6 5 a=5% lift 10,-2.5,-1,0 13 15 16.32 15.05 1.27 13.78 9.07 8.00 1.07 6.93 4.63 4.08 0.55 3.53
53 2 GM Dec Tree 20 13 33 entropy 6 5 a=5% lift 20,-5,-1,1 12 16 16.32 15.05 1.27 13.78 8.96 8.14 0.82 7.32 4.62 4.23 0.39 3.84
54 2 GM Dec Tree 21 13 33 entropy 6 5 a=5% lift xval = no 20,0,0,1 9 16 16.17 15.57 0.60 14.97 8.74 8.25 0.49 7.76 4.44 4.21 0.23 3.98
55 2 GM Dec Tree 22 19 33 gini 6 5 a=5% lift 20,0,-1,0 8 16 15.17 13.17 2.00 11.17 8.02 7.32 0.70 6.62 4.40 4.26 0.14 4.12
56 2 GM Dec Tree 23 19 33 probchisq 6 5 a=5% lift 20,0,-1,0 8 16 15.17 13.17 2.00 11.17 8.02 7.32 0.70 6.62 4.40 4.26 0.14 4.12
64 2 GM Dec Tree 31 19 35 entropy 7 40 a=5% lift itmledratio
itm_to_led 20,0,-1,0 7 36 15.90 15.36 0.54 14.82 8.28 8.03 0.25 7.78 4.40 4.27 0.13 4.14
ex=20k
node s mp
= 30k
65 2 GM Dec Tree 37 19
14, raw
only
entropy 6 5 a=5% lift 0 20,0,-1,0 7 16 13.92 11.81 2.11 9.69 7.46 6.54 0.93 5.61 4.24 3.91 0.33 3.57
5.28 2.15 0.41
improvement gain in Conservative Lift from new variables (vs. DecTree-d2-m19)
67 3 GM Dec Tree 39 38 45 gini 8 5 a=5% lift xval = no 20,0,-5,1 3 71 13.41 15.52 2.11 11.30 7.50 8.47 0.97 6.54 4.01 4.44 0.43 3.58
68 3 GM Dec Tree 40 38 45 propchi 8 5 a=5% lift xval = no 20,0,-5,1 3 42 13.41 15.52 2.11 11.30 7.50 8.47 0.97 6.54 4.01 4.44 0.43 3.58
69 3 GM Dec Tree 41 38 45 entropy 20 5 a=5% lift subtr= 20,0,-5,1 33 91 20.00 14.81 5.19 9.61 10.00 7.54 2.46 5.08 5.00 3.90 1.10 2.80
82 3 GM Dec Tree 54 43 45 entropy 20 200 a=5% lift sub=lrg 20,0,-5.0 23 65 17.67 16.67 1.01 15.66 9.81 8.54 1.27 7.28 5.00 4.34 0.66 3.67
84 3 GM Dec Tree 56 43 45 entropy 20 200 a=5% lift sub=lrg 20,-5,0,0 23 65 17.67 16.67 1.01 15.66 9.81 8.54 1.27 7.28 5.00 4.34 0.66 3.67
86 4 GM Dec Tree 58 57 107 (tree settings the same, dropped INT* categorical vars, not DBC)
18 115 20.00 16.09 3.91 12.18 10.00 8.15 1.85 6.29 5.00 4.18 0.82 3.35
interactions are getting selected, improve Trn results but
decrease Val results. Perhaps I should regen the INT*dbc with a
larger number of min records.
More
use RAW vars ONLY, to test value of my preprocessing
M
cnt
Data
Ver
Aut
hor
Algor
Mod
Num
chng
from
prior
binary
model
cleanup
model
max num
rips
Var
Sel
Trn
Time
Train Val Gap
Consrv
Result
Train Val Gap
Consrv
Result
Train Val Gap
Consrv
Result
95 1 GM Rule Ind 2 1 regr neural 16 36 5.95 7.52 1.57 4.38 3.55 4.85 1.30 2.25 2.35 3.17 0.82 1.53
Can treat model notebook table
as meta-data (i.e. 144 records or
models)
Train models on meta-data
Source vars = model parameters
Target 1 = conservative result
or
Target 2 = training time
Perform sensitivity analysis
to answer questions:
Q) Searching which model
training parameters lead to the
best results?
Q) …most training time?

Outline
25

Design Of Experiments (DOE)
Parameter Search
• Ideally, vary one parameter at a time, quantify the results
– Bigger challenge in BIG DATA compute per model
• Exhaustive Grid Search O(3P)
– for Param A = Low, Med, High (test 3 settings)
– for Param B = Low, Med, High
– for Param C = Low, Med, High
– easy to implement, not the most efficient
– Can use Fractional Factorial design (i.e. 10%)
• Scales less effectively for many parameters
• Stochastic Search (Genetic Algorithms) O(1002)
C
– Directed Random Search is more efficient than Grid Search, but…
– Can be overkill in complexity: (100 models / generation) * (100’s gens)
• Taguchi Analysis (works with this DOE approach)
– Efficient multivariate orthogonal search
– test landing pages w/ Offermatica (acquired by Ominture in 2007 for DOE)
– http://en.wikipedia.org/wiki/Taguchi_methods
– Does not use domain knowledge of parameter interactions - OPPORTUNITY
A
B

Taguchi
Design
• Not a full grid
search
• Can we
improve with
experience
and a
heuristic
process?
27
http://www.itl.nist.gov/div898/handbook/pri/section5/pri56.htm
http://www.jmp.com/support/downloads/pdf/jmp_design_of_experiments.pdf

Model Parameters
Algorithm Searches Meta-Search by a Data Miner
Design of Experiments (DOE)
Over Your Choices
Algorithm Model Parameters Model Training Parameters
Regression weights variable selct (forward, step)
Neural net weights step size; learning rate
Decision Tree (spend < $1000) max depth; (Gini, Entropy)
28

Model Parameters vs.
Model Training Parameters
Algorithm Searches Meta-Search by a Data Miner
Over Your Choices
Algorithm Model Parameters Model Training Parameters
Regression weights variable select (forward, step)
Neural net weights step size; learning rate
Decision Tree (spend < $1000) max depth; (Gini, Entropy)
29

Heuristic Planning Your
• Assumptions about Data Mining Project
– May be on BIG DATA, with practical constraints
– May be training 4 to 400 models (not 4000+ like GA)
– Want diversity, to investigate different algorithms
– Want to generalize process to future deployments
• Heuristic Strategies
– Use knowledge of interacting parameters (parallel tests)
• (Cost+profit weights) and (boosting weights) fight each other
– Delay searching compute intensive parameters
• First stabilize most other “computationally reasonable” params
• Large decision tree depth,
• neural nets w/ lots of connections
– Opportunistically spend time by algorithm success 30

Gradient Descent Numerical Methods
Searching to Find Minima
31
High Error
Low Error
Forest
Fields
Beach
Water
Deep Water
Weight Parameter 1
Weight Param 2
Min
Min
hill tops
beach
water
Min

Gradient Descent Numerical Methods
Searching to Find Minima
32
“Ski Down” from
the mountains to
Lake Tahoe
Moving = adjust param
X = starting position
M = a local minimum
High Error
Low Error
Forest
Fields
Beach
Water
Deep Water
Weight Parameter 1
Weight Param 2
X
M
M
hill tops
beach
water

Conservative Result with Respect to
33
“Ski Down” from
the mountains to
Lake Tahoe
Moving = adjust param
X = starting position
M = a local minimum
High Error
Low Error
Forest
Fields
Beach
Water
Deep Water
Model Parameter 1
Model Param 2
X
M
M

Heuristic Planning Your
• Start with a reasonable default setting of
parameters,
– the “center of the daisy”  the gradient check
• Vary one parameter at a time from the center
– “each petal of the daisy”  gradient search trial
• Move to the next “reasonable multivariate start”
– The “stem of the daisy”  steepest descent 34

Heuristic “Meta-Gradient Search” of
35
High Error
Parameter 2
Low Error Parameter 1
M

36
High Error
Parameter 2
Low Error Parameter 1
M

37
Parameter 1
Parameter 2
M
vs.
Taguchi DOE
Art vs. Science?
No, a practical
compliment
using existing
num. methods

38
Mod
Num
chng
from
prior
vars
offered
criterion
max
depth
leaf size
1 0 27 default 6 5
2 1 27 probchisq 6 5
3 1 27 entropy 6 5
4 1 27 gini 6 5
5 3 27 entropy 12 5
6 3 27 entropy 6 10
7 3 27 entropy 6 100
8 3 27 entropy 6 100
9 3 27 entropy 6 5
10 3 27 entropy 6 5
11 3 27 entropy 6 5
12 3 27 entropy 10 2
Can you give a more
tangible example?
This sounds a bit
vague.
Change from Prior Model
– tracks change from the
“center of a daisy”
(Model 1 or 3)

• After stabilizing most of the “fast” and “medium”
compute time parameters, search the “long compute
time settings”
• With the final parameter settings, if 2x or 10x more data
is available, perform a “final bake in,” long training run
• Then try Ensemble Methods
– Stacking, boosting, bagging combining many of the best
models,
– Gradient Boosting over residual error
– Select models who’s residual errors correlate the least
– Use a 2nd stage model to combine 1st stage models and top
preprocessed fields (for context switching)
– Last year’s KDD Cup winners
– Netflix winners used Ensemble methods

Outline
How to Describe Any Complex System
Sensitivity Analysis
40

Needs to Describe Forecast Alg
• Many Data Mining solutions need description
– To check writer (to SVP, owner, business unit, …) business reality
check before deployment
– “What if” analysis, to fine tune larger system
• Feed Operations Research or Revenue Management systems
– Need a modeling “descriptive simulation” (political donations)
– When evaluating credit, by law required to offer 4 “reason
codes” for each person scored – when they are declined
• Should the Data Miner cut algorithm choices?
– NO! “I understand how a bike works, but I drive a car to work”
– how much detailed understanding is needed?
– Provide enough info to “drive the car” vs. “build the car”
• Check writer does not need to understand B-tree to buy SQL 41

Sensitivity Analysis
(OAT) One At a Time*
For source fields with
binned ranges, sensitivity
tells you importance of the
range, i.e. “low”, …. “high”
Can put sensitivity values in
Record Level “Reason
codes” can be extracted
from the most important
bins that apply to the given
42
Target field
Arbitrarily Complex
Data Mining System
(S) Source fields
*Some catch interactions
Pivot Tables
or Cluster
record
Delta in forecast
Present record N, S times, each input 5% bigger (fixed input delta)
Record delta change in output, S times per record
Aggregate: average(abs(delta)), target change per input field delta

43
Descriptions of Predictive Models
Reason Codes – Ranked by Sensitivity Analysis
• Reason codes are specific to the model and record
• Ranked predictive fields Mr. Smith Mr. Jones
max_late_payment_120d 0 1
bankrupt_in_last_5_yrs 1 1
• Mr. Smith’s reason codes include:
max_late_payment_90d 1
bankrupt_in_last_5_yrs 1

Summary
• Conservative Result (How to Measure)
– Continuous metric to select accurate and general models
• Heuristic Meta-Gradient Search (How to Plan)
– An automated or human process to plan a Design of
Experiments (DOE)
– Searches the training parameters that a data miner adjusts
in data mining software (“meta-parameter search”)
– Heuristic DOE improvements
• Most systems can be “reasonably described”
– Focus on repeatable business benefit (accuracy) over
description or blind Occam’s Razor on a tech metric
44
SF Bay ACM, Data Mining SIG, Feb 28, 2011
http://www.sfbayacm.org/?p=2464
Greg_Makowski@yahoo.com
www.LinkedIn.com/in/GregMakowski
Take Away: The process of going
from design objectives to heuristic design

Heuristic design of experiments w meta gradient search

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Heuristic design of experiments w meta gradient search

Similar to Heuristic design of experiments w meta gradient search (20)

More from Greg Makowski

More from Greg Makowski (6)

Recently uploaded

Recently uploaded (20)

Heuristic design of experiments w meta gradient search