Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Heuristic design of experiments w meta gradient search

736 views

Published on

Once you have started learning about predictive algorithms, and the basic knowledge discovery in databases process, what is the next level of detail to learn for a consulting project?

* Give examples of the many model training parameters
* Track results in a "model notebook"
* Use a model metric that combines both accuracy and generalization to rank models
* How to strategically search over the model training parameters - use a gradient descent approach
* One way to describe an arbitrarily complex predictive system is by using sensitivity analysis

Published in: Data & Analytics
  • Be the first to comment

Heuristic design of experiments w meta gradient search

  1. 1. 1 Heuristic Design of Experiments with Meta-Gradient Search of Model Training Parameters SF Bay ACM, Data Mining SIG, Feb 28, 2011 http://www.sfbayacm.org/?p=2464 Greg_Makowski@yahoo.com www.LinkedIn.com/in/GregMakowski
  2. 2. Choice is good… But can be overwhelming 2
  3. 3. Key Questions Discussed • You (a data miner) have many algorithms or libraries you can use, with many choices… – How to stay organized among all the choices? • Algorithm parameters • Adjustments in Cost vs. Profit (Type I vs. II error bias) • Metric selection (Lift if acting on top % vs. RMSE or ROC) • Ensemble Modeling, boosting, bagging, stacking • Data versions, preprocessing, trying new fields – How to plan, and learn as you go? – How simple should you stay ? – to keep descriptiveness vs. Occam’s Razor? 3
  4. 4. Outline Model Training Parameters in SAS Enterprise Miner Tracking Conservative Results in a “Model Notebook” How to Measure Progress Meta-Gradient Search of Model Training Parameters How to Plan and dynamically adapt How to Describe Any Complex System – Sensitivity 4
  5. 5. Enterprise Miner Sample Data Flow for a Project : 5 (Boxes are expanded in later slides) Learning Tuning Validation Stratified Sampling
  6. 6. Type I vs. II Error Weights Profit-Loss Ratios 6 In the Data Source, NOT the Model Engines In other software, may use a weight field Need to stay organized regardless
  7. 7. Regression • It is always good to find the best linear solution early on – Like testing a null hypothesis: (linear vs. non-linear) problem • Can feed “score” or “residual error” as a source field into non-linear models 7
  8. 8. Neural Net Architecture and Parameters 8 c c c c c c c c c c c field 1 field 2 $ c $ $ $ $ c c $ c c c RBF c c c $ $ c $ c $ c c c $ $ $$ A Neural Net Solution “Non-Linear” Several regions which are not adjacent MLP
  9. 9. A Comparison of a Neural Net and Regression Direct connect 9 A Logistic regression formula: Y = f( a0 + a1*X1 + a2*X2 + a3*X3) a* are coefficients Backpropagation, cast in a similar form: H1 = f(w0 + w1*I1 + w2*I2 + w3*I3) H2 = f(w4 + w5*I1 + w6*I2 + w7*I3) a0 Y X1 X2 X3 : Hn = f(w8 + w9*I1 + w10*I2 + w11*I3) O1 = f(w12 + w13*H1 + .... + w15*Hn) On = .... w* are weights, AKA coefficients I1..In are input nodes or input variables. H1..Hn are hidden nodes, which extract features of the data. O1..On are the outputs, which group disjoint categories. f() is the SIGMOID function, a non-linear “S” curve a1 a2 a3 Output H1 Hidden 2 w1 w2 w3 Input 1 I2 I3 Bias it is very noisy in the brain – chemical depletion of neurotransmitters
  10. 10. Neural Net • Network  Architecture can be linear (MLP) or circular (many RBF) • Network  Direct Connection allows inputs to connect to output (to find the simple, linear solution first) • Network  Hidden Units can go up to 64 (much better than 8) • Profit/Loss uses settings in Data Source 10
  11. 11. Tree Depth = 2 11 What does a DecisionTree Look Like? Split 3 Age Income $ Split 1 Split 2 $ $$ $ Leaf 3 $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ c c c c c c c c c c c $ c Leaf 4 Leaf 1 Leaf 2 Split 2 Split 3 Leaf 1 Split 1 Leaf 2 Leaf 3 Leaf 4 If (Age < Split1) then :…If (Income > Split2) then Leaf1 with dollar_avg1 :…If (Income < Split2) then Leaf2 with dollar_avg2 If (Age > Split1) then :…If (Income > Split3) then Leaf3 with dollar_avg3 :…If (Income < Split3) then Leaf4 with dollar_avg4
  12. 12. Decision Tree • Primary Parameters to vary – Criterion • Probchisq (Default) • Entropy • Gini – Assessment (Decision vs. Lift) – Tree size (depth, leaf size, Xvalid) 12
  13. 13. Gradient Boosting (Tree Based) Based on “Greedy Function Approximation: A Gradient Boosting Machine” by Jerome Friedman Each new CART tree: • is on a 60% random sample • Is a small, general tree • Forecasts the error from the forecast from all previous trees summed • May have 50 to 2,000 trees in a sequence • Evaluate how far “back” in sequence to prune 13
  14. 14. DM Algorithms Available in Packages 14 # Modules per Forecasting Family in DM Software Regres - s ion Las s o Reg Decis ion Tree Neural Net Support Vector Mach Other TOT 2 1 0 0 0 1 4 0 0 1 0 0 0 1 3 0 3 3 0 3 12 1 0 1 0 1 1 4 0 0 4 0 0 0 4 3 2 5 3 2 3 18 0 0 0 0 0 5 5
  15. 15. Feel Overwhelmed on Lots of Complex Algorithm Parameters? GOOD! • A deep understanding of algorithms, math and assumptions helps significantly  Heuristics – i.e. typically, regression has a problem with correlating inputs because the solution calculation uses matrix inversion (if you are worried about weight sign inversion) – SVM’s or Bayesian Nets do not have this problem, because they are solved differently. • Don’t have a problem with correlating inputs, input selection becomes more random – but you still get a decent solution • How can you manage the details? – I am glad you asked…. Moving on to the next section 15
  16. 16. Outline Model Training Parameters in SAS Enterprise Miner Tracking Conservative Results in a “Model Notebook” How to Measure Progress Meta-Gradient Search of Model Training Parameters How to Plan and dynamically adapt How to Describe Any Complex System – Sensitivity 16
  17. 17. Model Exploration Process • Scientific Method of Hypothesis  Test – If you change ONE thing, than any change in the results is because of that one change – Design of Experiments (DOE), test plan – Best to compare model settings on same data version • New data versions add new preprocessed fields, or new months (records) – Key design objective: all experiments are reproducible • SAME Random split between Learning – Test – Validation, with a consistent random seed – LTV split before loading data in a tool, so same partitioning for all tools/libraries/algorithms
  18. 18. Model Notebook Input Parameters Outcomes Lift in Top 10% Train Val 18 Gap = Abs( Trn-Val) Consrv Result Param 1 vars offerd Param 2 var selct Param 3 … Vars Seltd Trn Time Data Ver Algor Mod Num 1 Regrsn 1 27 stepw 9 12 5.77 5.94 0.17 5.60 vars offerd Hidn Nodes Direct Conn Arch Bad vs. Good 1 Neural 1 27 3 n MLP all 77 6.65 10.89 4.24 2.41 1 Neural 2 27 10 n MLP all 40 6.88 6.73 0.15 6.58 1 Neural 3 27 10 Y MLP all 36 6.40 6.93 0.53 5.87 1 Neural 4 27 10 n RBF all 34 5.67 5.54 0.13 5.41 1 Neural 5 27 10 Y RBF all 35 5.95 7.92 1.97 3.98
  19. 19. Model Notebook Outcome Details • My Heuristic Design Objectives: (yours may be different) – Accuracy in deployment – Reliability and consistent behavior, a general solution • Use one or more hold-out data sets to check consistency • Penalize more, as the forecast becomes less consistent – No penalty for model complexity (if it validates consistently) • Let me drive a car to work, instead limiting me to a bike – Message for check writer – Don’t consider only Occam’s Razor: value consistent good results – Develop a “smooth, continuous metric” to sort and find models that perform “best” in future deployment 19
  20. 20. Model Notebook Outcome Details • Training = results on the training set • Validation = results on the validation hold out • Gap = abs( Training – Validation ) A bigger gap (volatility) is a bigger concern for deployment, a symptom Minimize Senior VP Heart attacks! (one penalty for volatility) Set expectations & meet expectations Regularization helps significantly • Conservative Result = worst( Training, Validation) + Gap_penalty Corr / Lift / Profit  higher is better: Cons Result = min(Trn, Val) - Gap MAD / RMSE / Risk  lower is better: Cons Result = max(Trn, Val) + Gap Business Value or Pain ranking = function of( conservative result2 0)
  21. 21. Model Notebook Input Parameters Outcomes Lift in Top 10% Train Val 21 Gap = Abs( Trn-Val) Consrv Result Param 1 vars offerd Param 2 var selct Param 3 … Vars Seltd Trn Time Data Ver Algor Mod Num 1 Regrsn 1 27 stepw 9 12 5.77 5.94 0.17 5.60 vars offerd Hidn Nodes Direct Conn Arch Bad vs. Good 1 Neural 1 27 3 n MLP all 77 6.65 10.89 4.24 2.41 1 Neural 2 27 10 n MLP all 40 6.88 6.73 0.15 6.58 1 Neural 3 27 10 Y MLP all 36 6.40 6.93 0.53 5.87 1 Neural 4 27 10 n RBF all 34 5.67 5.54 0.13 5.41 1 Neural 5 27 10 Y RBF all 35 5.95 7.92 1.97 3.98
  22. 22. Model Notebook Process Tracking Detail  Training the Data Miner Data Ver Aut hor Input / Test Outcome Algor Mod Num chng from prior Model Notebook Project = Transit, Last Update 5/6/2010 Input Parameters Outcomes Param 1 Param 2 Param 3 Param 4 Param 5 Param 6 Param 7 Status Lift in Top 10% Over File Avg Var Sel Trn time (sec) Lift in Top 5% Over File Avg Top 5% Train Val Gap = Abs( Trn-Val) Consrv Result Outcomes Top 10% Train Val Gap = Abs( Trn-Val) Consrv Result Outcomes Lift in Top 20% Over File Avg Top 20% Train Val Gap = Abs( Trn-Val) Consrv Result Data Ver Aut hor Algor Mod Num chng from prior vars offered var selectn Var Sel Trn Time Train Val Gap Consrv Result Train Val Gap Consrv Result Train Val Gap Consrv Result 1 GM B logistic 1 0 27 stepws 10 12.04 8.12 3.92 4.20 7.59 4.85 2.74 2.11 1 GM B logistic 2 1 19 stepws 10 12.04 8.12 3.92 4.20 7.59 4.85 2.74 2.11 1 GM B logistic 3 1 6, no dbc stepws 4 7.51 1.98 5.53 -3.55 4.90 3.96 0.94 3.02 investigate inconsistency 1 GM B logistic 4 1 13, only dbc stepws 7 9.58 7.33 2.25 5.08 6.59 5.25 1.34 3.91 Data Ver Aut hor Algor Mod Num chng from prior vars offered regr type var selectn 2-factor interact polynom Var Sel Trn Time Train Val Gap Consrv Result Train Val Gap Consrv Result Train Val Gap Consrv Result Regression 1 GM regr 1 0 27 logistic stepws n 9 12 5.77 5.94 0.17 5.60 3.35 4.46 1.11 2.24 2.25 3.02 0.77 1.48 1 GM regr 2 1 27 logistic stepws Yes 9 16 5.76 5.94 0.18 5.58 3.35 4.46 1.11 2.24 2.25 3.02 0.77 1.48 1 GM regr 3 1 27 logistic stepws n 2 10 57 5.86 6.93 1.07 4.79 3.48 5.03 1.55 1.93 2.32 2.61 0.29 2.03 1 GM regr 4 1 27 logistic stepws Yes 2 11 58 5.86 6.93 1.07 4.79 3.48 5.04 1.56 1.92 2.32 2.92 0.60 1.72 4 GM regr 5 4 3 logistic stepwise Yes 2 8 63 12.88 13.40 0.52 12.36 6.65 6.89 0.24 6.41 3.53 3.64 0.11 3.43 4 GM regr 6 5 28 logistic stepwise Yes 2 didn't finish, out of memory 4 GM regr 7 5 3 logistic stepwise n 2 63 12.88 13.40 0.52 12.36 6.65 6.89 0.24 6.41 3.53 3.64 0.11 3.43 4 GM regr 8 5 3 logistic stepwise n 1 12.88 13.40 0.52 12.36 6.65 6.89 0.24 6.41 3.53 3.64 0.11 3.43 4 GM regr 9 5 3 logistic stepwise Yes 1 12.88 13.40 0.52 12.36 6.65 6.89 0.24 6.41 3.53 3.64 0.11 3.43 4 GM regr 10 8 28 logistic stepwise n 1 12.88 13.40 0.52 12.36 6.65 6.89 0.24 6.41 3.53 3.64 0.11 3.43 4 GM regr 11 5 3 logistic stepwise Yes 3 6 78 15.98 16.06 0.08 15.89 8.61 8.03 0.58 7.45 4.81 4.39 0.41 3.98 4 GM regr 12 5 3 logistic stepwise Yes 4 2 78 15.98 16.06 0.08 15.89 8.61 8.03 0.58 7.45 4.81 4.39 0.41 3.98 add Feb & Mar to 4n GM regr 13 11 3 logistic stepwise Yes 3 6 78 18.39 18.79 0.39 18.00 9.58 9.55 0.03 9.52 4.96 4.92 0.03 4.89 recent* recent_serrtrn_dbc changed to recent_serrtrn_flag 4n GM regr 14 11 3 6 78 12.49 12.12 0.36 11.76 7.63 7.42 0.20 7.22 4.29 4.47 0.18 4.12 (does DBC on ser patt help? YES) Yippeee! 1 GM DM Regr 1 0 27 logistic stepws 13 15 12.00 3.17 8.83 -5.66 7.21 4.16 3.05 1.11 4.28 3.07 1.21 1.86 4 GM DM Regr 2 0 28 max v 3000 min rsq 0.005 use aov16 var YES 6 72 16.27 15.76 0.52 15.24 8.67 8.03 0.64 7.39 4.58 4.24 0.34 3.90 1 GM PLS 1 0 1 GM PLS 2 1 27 default default default default 4 18 11.26 3.08 8.18 -5.10 7.12 4.85 2.27 2.58 4.28 3.12 1.16 1.96 1 GM PLS 3 1 Test Set Cros Val didn't finish, don't use Xvalidation 4 GM PLS 4 0 28 PLS NIPALS 200 28 122 16.63 15.76 0.87 14.89 8.93 8.03 0.90 7.13 4.76 4.32 0.45 3.87 Data Ver Aut hor Algor Mod Num chng from prior vars offered hidden Direct Conn ? arch Var Sel Trn Time Train Val Gap Consrv Result Train Val Gap Consrv Result Train Val Gap Consrv Result 1 GM AutoNrl 1 0 27 2 n MLP all 35 4.19 3.76 0.43 3.33 2.47 2.57 0.10 2.37 1.77 1.88 0.11 1.66 1 GM AutoNrl 2 1 27 6 n MLP all 189 4.37 2.77 1.60 1.17 2.82 1.78 1.04 0.74 1.98 1.93 0.05 1.88 1 GM AutoNrl 3 1 27 8 n MLP AutoNeural trn action = search all 532 0.83 0.56 0.27 0.29 0.83 0.56 0.27 0.29 0.83 0.56 0.27 0.29 1 GM AutoNrl 4 1 27 8 n MLP activ = logistic all 356 5.12 2.97 2.15 0.82 3.02 3.37 0.35 2.67 1.90 2.57 0.67 1.23 1 GM AutoNrl 5 1 27 6 n MLP arch = block all 130 0.89 0.97 0.08 0.81 1 GM AutoNrl 6 1 27 6 n MLP arch = funnel all 595 1.36 1.08 0.28 0.80 4 GM AutoNrl 7 1 28 6 n MLP all 1201 16.2722 15.76 0.51 15.24 8.65 7.88 0.77 7.11 4.46 4.24 0.22 4.03 Data Ver Aut hor Algor Mod Num chng from prior vars offered hidden Direct Conn ? arch Decay Decision Weight Var Sel Trn Time Train Val Gap Consrv Result Train Val Gap Consrv Result Train Val Gap Consrv Result 1 GM Neural 1 0 27 3 n MLP all 77 6.65 10.89 4.24 2.41 3.90 6.53 2.63 1.27 2.52 3.96 1.44 1.08 1 GM Neural 2 1 27 10 n MLP all 40 6.88 6.73 0.15 6.58 3.97 4.55 0.58 3.39 2.56 3.02 0.46 2.10 1 GM Neural 3 1 27 10 Y MLP all 36 6.40 6.93 0.53 5.87 3.49 5.45 1.96 1.53 2.32 3.22 0.90 1.42 1 GM Neural 4 1 27 10 n RBF (orbfeq) all 34 5.67 5.54 0.13 5.41 3.25 4.85 1.60 1.65 2.20 3.22 1.02 1.18 1 GM Neural 5 1 27 10 Y RBF all 35 5.95 7.92 1.97 3.98 3.48 4.85 1.37 2.11 2.31 3.17 0.86 1.45 js1 JS Neural 6 0 17 5 n MLP Softmax 10,-5,-1,0 all 6.03 6.53 0.50 5.53 3.40 4.55 1.15 2.25 2.67 3.36 0.69 1.98 js1 JS Neural 7 6 15 5 Y MLP Softmax 10,-5,-1,0 all 6.14 5.74 0.40 5.34 3.59 2.97 0.62 2.35 2.77 2.37 0.40 1.97 js1 JS Neural 8 6 15 3 Y MLP Softmax 0.5 10,-5,-1,0 all 6.27 7.13 0.86 5.41 3.54 3.56 0.02 3.52 2.74 2.57 0.17 2.40 js1 JS Neural 9 6 15 3 n MLP Softmax 0.5 10,-5,-1,0 all 6.27 6.33 0.06 6.21 3.57 4.65 1.08 2.49 2.76 2.82 0.06 2.70 2 GM Neural 10 2 35 12 Y MLP 20,0,-1,0 all 3 GM Neural 11 2 45 20 n MLP 20,0,-1,0 all 18 6.26 7.76 1.50 4.76 3.54 4.22 0.68 2.86 2.18 2.46 0.28 1.91 3 GM Neural 12 11 45 20 n MLP 0.8 20,0,-1,0 all 16 6.26 7.76 1.50 4.76 3.54 4.22 0.68 2.86 2.18 2.46 0.28 1.91 3 GM Neural 13 11 45 20 n MLP 0.6 20,0,-1,0 all 16 6.26 7.76 1.50 4.76 3.54 4.22 0.68 2.86 2.18 2.46 0.28 1.91 4 GM Neural 14 11 3 20 n MLP 0.01 20,0,-1,0 all 204 16.39 15.15 1.24 13.91 8.67 8.03 0.64 7.39 4.82 4.39 0.43 3.97 4 GM Neural 15 11 28 20 n MLP 0.01 20,0,-1,0 all 713 16.39 15.76 0.63 15.12 8.54 7.88 0.66 7.22 4.40 4.25 0.15 4.11 4 GM Neural 16 15 31 40 n MLP 0.01 20,0,-1,0 all 782 18.02 18.18 0.16 17.86 9.21 9.55 0.34 8.87 4.60 4.77 0.17 4.44 4 GM Neural 17 15 same, max iter 20 --> 50 all 1754 18.02 18.18 0.16 17.86 9.21 9.55 0.34 8.87 4.66 4.77 0.11 4.55 4 GM Neural 18 16 29 (no twoYr) same, max iter 20 --> 50 Neural 40 0 0 all 18.386 18.98 18.18 0.80 17.38 9.25 9.59 0.34 8.90 4.67 4.86 0.20 4.47 4n GM DMNeural 19 0 13 3 n all 19 10.60 2.57 8.03 -5.46 6.93 4.36 2.57 1.79 4.14 2.57 1.57 1.00 More Heuristic Strategy: 1) Try a few models of many algorithm types (seed the search) 2) Opportunistically spend more effort on what is working (invest in top stocks) 3) Still try a few trials on medium success (diversify, limited by project time-box) 4) Try ensemble methods, combining model forecasts & top source vars w/ The Data Mining Battle Field model
  23. 23. Model Notebook Process Tracking Detail  Training the Data Miner M cnt Data Ver Aut hor Algor Mod Num chng from prior vars offered criterion max depth leaf size asses = 5% Lift Decision Weight Var Sel Trn Time Train Val Gap Consrv Result Train Val Gap Consrv Result Train Val Gap Consrv Result 47 1 GM Dec Tree 1 0 27 default 6 5 20,0,-5,0 7 13 13.71 9.59 4.12 5.47 7.67 5.35 2.32 3.03 4.33 3.80 0.53 3.27 48 1 GM Dec Tree 2 1 27 probchisq 6 5 20,0,-5,0 7 16 13.71 9.59 4.12 5.47 7.67 5.35 2.32 3.03 4.33 3.80 0.53 3.27 49 1 GM Dec Tree 3 1 27 entropy 6 5 20,0,-5,0 6 16 13.94 12.62 1.32 11.30 7.49 7.09 0.40 6.69 4.27 4.09 0.18 3.91 50 1 GM Dec Tree 4 1 27 gini 6 5 20,0,-5,0 10 22 13.76 11.28 2.48 8.80 7.70 6.10 1.60 4.50 4.32 3.71 0.61 3.10 51 1 GM Dec Tree 5 3 27 entropy 12 5 20,0,-5,0 6 13 13.94 12.62 1.32 11.30 7.49 7.09 0.40 6.69 4.27 4.09 0.18 3.91 52 1 GM Dec Tree 6 3 27 entropy 6 10 20,0,-5,0 6 13 13.94 12.62 1.32 11.30 7.49 7.09 0.40 6.69 4.27 4.09 0.18 3.91 53 1 GM Dec Tree 7 3 27 entropy 6 100 20,0,-5,0 6 17 13.94 12.62 1.32 11.30 7.49 7.09 0.40 6.69 4.27 4.09 0.18 3.91 54 1 GM Dec Tree 8 3 27 entropy 6 100 xval = Y 20,0,-5,0 8 32 14.51 12.82 1.69 11.13 8.95 7.42 1.53 5.89 4.72 4.13 0.59 3.54 55 1 GM Dec Tree 9 3 27 entropy 6 5 xval = Y 20,0,-5,0 8 32 14.51 12.82 1.69 11.13 8.95 7.42 1.53 5.89 4.72 4.13 0.59 3.54 56 1 GM Dec Tree 10 3 27 entropy 6 5 obs import = Y DecisionTree Data Version 1 20,0,-5,0 6 17 13.94 12.62 1.32 11.30 7.49 7.09 0.40 6.69 4.27 4.09 0.18 3.91 57 1 GM Dec Tree 11 3 27 entropy 6 5 asses = 5% Lift 20,0,-5,0 6 12 13.94 12.62 1.32 11.30 7.49 7.09 0.40 6.69 4.27 4.09 0.18 3.91 58 1 GM Dec Tree 12 3 27 entropy 10 2 20,0,-5,0 6 12 13.94 12.62 1.32 11.30 7.49 7.09 0.40 6.69 4.27 4.09 0.18 3.91 46 2 GM Dec Tree 13 3 33 entropy 6 5 a=5% lift 20,0,-5,0 7 16 15.92 14.96 0.96 14.00 8.29 7.84 0.45 7.39 4.40 4.17 0.23 3.94 47 2 GM Dec Tree 14 13 33 entropy 6 5 a=5% lift 10,-2.5,-1,0 13 15 16.32 15.05 1.27 13.78 9.07 8.00 1.07 6.93 4.63 4.08 0.55 3.53 48 2 GM Dec Tree 15 13 33 entropy 6 5 a=5% lift 1,-1,1,-1 8 15 15.30 14.34 0.96 13.38 7.98 7.53 0.45 7.08 4.25 4.05 0.20 3.85 49 2 GM Dec Tree 16 13 33 entropy 6 5 a=5% lift 10,-1,1,-1 12 16 16.32 15.05 1.27 13.78 8.96 8.14 0.82 7.32 4.62 4.23 0.39 3.84 50 2 GM Dec Tree 17 13 33 entropy 6 5 a=5% lift 20,-5,0,0 12 15 16.32 15.60 0.72 14.88 8.79 8.26 0.53 7.73 4.47 4.21 0.26 3.95 51 2 GM Dec Tree 18 13 33 entropy 6 5 a=5% lift 20,-1,0,0 12 15 16.32 15.60 0.72 14.88 8.79 8.26 0.53 7.73 4.47 4.21 0.26 3.95 52 2 GM Dec Tree 19 13 33 entropy 6 5 a=5% lift xval = no DecisionTree 20,0,-1,0 6 15 15.87 15.52 0.35 15.17 8.26 8.12 0.14 7.98 4.40 4.32 0.08 4.24 53 2 GM Dec Tree 20 13 33 entropy 6 5 a=5% lift 20,-5,-1,1 12 16 16.32 15.05 1.27 13.78 8.96 8.14 0.82 7.32 4.62 4.23 0.39 3.84 54 2 GM Dec Tree 21 13 33 entropy 6 5 a=5% lift xval = no 20,0,0,1 9 16 16.17 15.57 0.60 14.97 8.74 8.25 0.49 7.76 4.44 4.21 0.23 3.98 55 2 GM Dec Tree 22 19 33 gini 6 5 a=5% lift 20,0,-1,0 8 16 15.17 13.17 2.00 56 2 GM Dec Tree 23 19 33 probchisq 6 5 a=5% lift 20,0,-1,0 8 16 15.17 13.17 2.00 57 2 GM Dec Tree 24 19 33 entropy 20 5 a=5% lift Data 20,0,-1,0 19 Version 26 18.94 15.42 3.52 2 11.17 8.02 7.32 0.70 6.62 4.40 4.26 0.14 4.12 11.17 8.02 7.32 0.70 6.62 4.40 4.26 0.14 4.12 11.90 9.67 7.78 1.89 5.89 4.90 4.06 0.84 3.22 58 2 GM Dec Tree 25 19 33 entropy 20 20 a=5% lift 20,0,-1,0 19 26 18.94 13.80 5.14 8.66 9.67 7.78 1.89 5.89 4.90 4.06 0.84 3.22 59 2 GM Dec Tree 26 19 33 entropy 20 40 a=5% lift 20,0,-1,0 7 27 16.06 15.29 0.77 14.52 8.36 8.00 0.36 7.64 4.41 4.23 0.18 4.05 60 2 GM Dec Tree 27 19 33 entropy 20 60 a=5% lift 20,0,-1,0 7 27 16.06 15.29 0.77 14.52 8.36 8.00 0.36 7.64 4.41 4.23 0.18 4.05 61 2 GM Dec Tree 28 19 33 entropy 7 5 a=5% lift 20,0,-1,0 10 33 16.73 14.57 2.16 12.41 8.90 7.75 1.15 6.60 4.60 4.06 0.54 3.52 62 2 GM Dec Tree 29 19 33 entropy 7 10 a=5% lift 20,0,-1,0 10 33 16.73 14.57 2.16 12.41 8.90 7.75 1.15 6.60 4.60 4.06 0.54 3.52 63 2 GM Dec Tree 30 19 33 entropy 7 20 a=5% lift 20,0,-1,0 7 37 16.04 14.66 1.38 13.28 8.35 7.69 0.66 7.03 4.41 4.07 0.34 3.73 64 2 GM Dec Tree 31 19 35 entropy 7 40 a=5% lift itmledratio itm_to_led 20,0,-1,0 7 36 15.90 15.36 0.54 14.82 8.28 8.03 0.25 7.78 4.40 4.27 0.13 4.14 65 2 GM Dec Tree 32 19 35 entropy 7 60 a=5% lift 20,0,-1,0 6 35 15.90 15.36 0.54 14.82 8.28 8.03 0.25 7.78 4.40 4.27 0.13 4.14 66 2 GM Dec Tree 33 19 35 entropy 7 80 a=5% lift 20,0,-1,0 6 35 15.90 15.36 0.54 14.82 8.28 8.03 0.25 7.78 4.40 4.27 0.13 4.14 67 2 GM Dec Tree 34 19 35 entropy 7 100 a=5% lift 20,0,-1,0 6 35 15.90 15.36 0.54 14.82 8.28 8.03 0.25 7.78 4.40 4.27 0.13 4.14 68 2 GM Dec Tree 35 19 35 entropy 7 150 a=5% lift 20,0,-1,0 5 37 14.53 13.08 1.45 11.63 7.75 7.19 0.56 6.63 4.36 4.29 0.07 4.22 64 2 GM Dec Tree 36 19 35 entropy 6 5 a=5% lift 20,0,-1,0 7 29 15.91 14.95 0.96 13.99 8.29 7.83 0.46 7.37 4.40 4.17 0.23 3.94 ex=20k node s mp = 30k 65 2 GM Dec Tree 37 19 14, raw only entropy 6 5 a=5% lift 0 20,0,-1,0 7 16 13.92 11.81 2.11 9.69 7.46 6.54 0.93 5.61 4.24 3.91 0.33 3.57 5.28 2.15 0.41 improvement gain in Conservative Lift from new variables (vs. DecTree-d2-m19) 66 3 GM Dec Tree 38 19 45 entropy 8 5 a=5% lift xval = no 20,0,-5,1 3 39 13.41 15.52 2.11 11.30 7.50 8.47 0.97 6.54 4.01 4.44 0.43 3.58 67 3 GM Dec Tree 39 38 45 gini 8 5 a=5% lift xval = no 20,0,-5,1 3 71 13.41 15.52 2.11 11.30 7.50 8.47 0.97 6.54 4.01 4.44 0.43 3.58 68 3 GM Dec Tree 40 38 45 propchi 8 5 a=5% lift xval DecisionTree = no 20,0,-5,1 3 42 13.41 15.52 2.11 11.30 7.50 8.47 0.97 6.54 4.01 4.44 0.43 3.58 69 3 GM Dec Tree 41 38 45 entropy 20 5 a=5% lift subtr= 20,0,-5,1 33 91 20.00 14.81 5.19 9.61 10.00 7.54 2.46 5.08 5.00 3.90 1.10 2.80 70 3 GM Dec Tree 42 38 45 entropy 20 100 a=5% lift sub=lrg 20,0,-5,1 25 70 19.09 16.25 2.84 13.42 10.00 8.17 1.83 6.35 5.00 4.19 0.81 3.38 71 3 GM Dec Tree 43 38 45 entropy 20 200 a=5% lift sub=lrg 20,0,-5,1 23 64 17.67 16.67 1.01 72 3 GM Dec Tree 44 38 45 entropy 20 400 a=5% lift sub=lrg 20,0,-5,1 21 59 15.87 17.08 1.21 73 3 GM Dec Tree 45 38 45 entropy 20 800 a=5% lift Data sub=lrg 20,0,-5,1 Version 16 52 14.35 16.16 1.81 3 15.66 9.81 8.54 1.27 7.28 5.00 4.34 0.66 3.67 14.67 9.02 8.96 0.06 8.89 4.97 4.69 0.28 4.41 12.53 8.46 8.96 0.50 7.96 4.78 4.79 0.01 4.78 74 3 GM Dec Tree 46 38 45 entropy 20 1600 a=5% lift sub=lrg 20,0,-5,1 16 47 14.25 16.02 1.78 12.47 8.26 8.59 0.34 7.92 4.58 4.42 0.17 4.25 75 3 GM Dec Tree 47 38 45 entropy 20 3200 a=5% lift sub=lrg 20,0,-5,1 10 39 12.45 14.35 1.91 10.54 7.49 8.31 0.82 6.67 4.36 4.48 0.12 4.24 76 3 GM Dec Tree 48 43 45 entropy 20 150 a=5% lift sub=lrg 20,0,-5,1 23 68 18.57 16.25 2.32 13.93 10.00 8.14 1.86 6.27 5.00 4.17 0.83 3.34 77 3 GM Dec Tree 49 43 45 entropy 20 300 a=5% lift sub=lrg 20,0,-5,1 23 62 16.45 17.86 1.41 15.03 9.31 8.96 0.35 8.61 5.00 4.60 0.40 4.20 78 3 GM Dec Tree 50 43 45 entropy 20 250 a=5% lift sub=lrg 20,0,-5,1 24 65 16.64 17.71 1.07 15.57 9.56 8.96 0.60 8.36 5.00 4.61 0.39 4.21 79 3 GM Dec Tree 51 43 45 entropy 20 350 a=5% lift sub=lrg 20,0,-5,1 24 67 16.07 17.50 1.43 14.64 9.19 8.96 0.23 8.73 5.00 4.59 0.41 4.18 80 3 GM Dec Tree 52 43 45 entropy 20 225 a=5% lift sub=lrg 20,0,-5,1 23 63 17.85 16.67 1.18 15.49 9.83 8.96 0.87 8.09 5.00 4.53 0.48 4.05 81 3 GM Dec Tree 53 43 45 entropy 20 175 a=5% lift sub=lrg 20,0,-5,1 26 68 18.15 16.25 1.90 14.35 9.97 8.13 1.84 6.28 5.00 4.16 0.84 3.32 82 3 GM Dec Tree 54 43 45 entropy 20 200 a=5% lift sub=lrg 20,0,-5.0 23 65 17.67 16.67 1.01 15.66 9.81 8.54 1.27 7.28 5.00 4.34 0.66 3.67 83 3 GM Dec Tree 55 43 45 entropy 20 200 a=5% lift sub=lrg 20,0,-1,0 23 65 17.67 16.67 1.01 15.66 9.81 8.54 1.27 7.28 5.00 4.34 0.66 3.67 84 3 GM Dec Tree 56 43 45 entropy 20 200 a=5% lift sub=lrg 20,-5,0,0 23 65 17.67 16.67 1.01 15.66 9.81 8.54 1.27 7.28 5.00 4.34 0.66 3.67 85 4 GM Dec Tree 57 43 146 entropy 20 200 a=5% lift sub=lrg 20,0,-5,1 9 149 20.00 14.09 5.91 8.19 10.00 7.20 2.80 4.40 5.00 3.76 1.24 2.51 86 4 GM Dec Tree 58 57 107 (tree settings the same, dropped INT* categorical vars, not DBC) 18 115 20.00 16.09 3.91 12.18 10.00 8.15 1.85 6.29 5.00 4.18 0.82 3.35 87 4 GM Dec Tree 59 57 107 entropy 20 500 a=5% lift sub=DecisionTree lrg 20,0,-5,1 13 110 19.46 14.79 4.68 10.11 10.00 7.64 2.36 5.29 5.00 3.95 1.05 2.91 88 4 GM Dec Tree 60 57 107 entropy 20 1000 a=5% lift sub=lrg 20,0,-5,1 10 89 18.94 14.47 4.47 10.00 10.00 7.44 2.56 4.88 5.00 3.86 1.14 2.73 89 4 GM Dec Tree 61 57 107 entropy 20 2000 a=5% lift sub=lrg 20,0,-5,1 7 81 14.41 13.91 0.50 13.41 9.54 8.02 1.51 6.51 6.61 4.25 2.36 1.90 90 4 GM Dec Tree 62 57 107 entropy 20 3000 a=5% lift sub=lrg 20,0,-5,1 5 71 9.89 7.91 1.98 5.94 8.74 6.39 2.35 4.04 5.00 3.70 1.30 2.40 91 4 GM Dec Tree 63 57 107 entropy 20 1500 a=5% lift Data sub=lrg 20,0,-5,1 Version 9 60 16.17 14.66 1.50 4 13.16 9.89 8.18 1.71 6.47 5.00 3.38 1.62 1.76 92 4 GM Dec Tree 64 57 107 entropy 20 1750 a=5% lift sub=lrg 20,0,-5,1 7 60 15.23 14.32 0.92 13.40 9.68 8.07 1.61 6.46 5.00 4.26 0.75 3.51 93 4 GM Dec Tree 65 57 107 entropy 20 2250 a=5% lift sub=lrg 20,0,-5,1 5 60 15.43 11.00 4.43 6.56 9.55 6.30 3.25 3.05 5.00 3.70 1.30 2.40 94 4 GM Dec Tree 66 61 58 entropy 20 2000 a=5% lift sub=lrg 20,0,-5,1 8 105 14.07 13.92 0.15 13.77 8.45 7.88 0.57 7.30 4.74 4.02 0.73 3.29 95 4 GM Dec Tree 67 61 80 entropy 20 2000 a=5% lift sub=lrg 20,0,-5,1 8 97 14.25 13.94 0.30 13.64 9.25 7.88 1.37 6.51 5.00 4.25 0.75 3.49 96 4 GM Dec Tree 68 61 103 entropy 20 2000 a=5% lift sub=lrg 20,0,-5,1 7 103 14.41 13.72 0.69 13.03 9.54 8.02 1.52 6.50 5.00 4.25 0.75 3.50 interactions are getting selected, improve Trn results but decrease Val results. Perhaps I should regen the INT*dbc with a larger number of min records. More 97 4n GM Dec Tree 69 61 3 entropy 20 2000 a=5% lift sub=lrg 20,0,-5,0 7 14.61 15.54 0.93 13.68 8.83 8.99 0.16 8.67 4.88 4.73 0.15 4.58 98 4n GM Dec Tree 70 0 20 entropy 20 2000 a=5% lift sub=lrg 20,0,-5,0 10 11.50 11.12 0.38 10.74 7.08 7.29 0.21 6.87 4.24 3.94 0.30 3.64 use RAW vars ONLY, to test value of my preprocessing M cnt Data Ver Aut hor Algor Mod Num chng from prior binary model cleanup model max num rips Var Sel Trn Time Train Val Gap Consrv Result Train Val Gap Consrv Result Train Val Gap Consrv Result 94 1 GM Rule Ind 1 0 tree neural 16 32 10.77 9.92 0.85 9.07 6.28 5.60 0.68 4.92 3.35 3.09 0.26 2.83 95 1 GM Rule Ind 2 1 regr neural 16 36 5.95 7.52 1.57 4.38 3.55 4.85 1.30 2.25 2.35 3.17 0.82 1.53 96 1 GM Rule Ind 3 1 neural tree 16 121 5.95 7.92 1.97 3.98 3.52 5.64 2.12 1.40 2.34 3.31 0.97 1.37 97 1 GM Rule Ind 4 3 neural tree 4 121 5.95 7.92 1.97 3.98 3.52 5.64 2.12 1.40 2.34 3.31 0.97 1.37 98 1 GM Rule Ind 5 3 neural tree 32 121 5.95 7.92 1.97 3.98 3.53 5.64 2.11 1.42 2.34 3.32 0.98 1.36 99 1 GM Rule Ind 6 1 tree neural 32 32 7.25 5.26 1.99 3.27 6.45 5.17 1.28 3.89 3.43 3.09 0.34 2.75 “Agile Software Design” Get something simple, fully working and tested early on (Data Version 1) Data Version 2…4 Working, incremental improvements Incremental complexity Different preprocessing Add more fields, records Add & test more complexity
  24. 24. Model Notebook Process Tracking Detail  Training the Data Miner M cnt Data Ver Aut hor Algor Mod Num chng from prior vars offered criterion max depth leaf size asses = 5% Lift Decision Weight Var Sel Trn Time Train Val Gap Consrv Result Train Val Gap Consrv Result Train Val Gap Consrv Result 47 1 GM Dec Tree 1 0 27 default 6 5 20,0,-5,0 7 13 13.71 9.59 4.12 5.47 7.67 5.35 2.32 3.03 4.33 3.80 0.53 3.27 48 1 GM Dec Tree 2 1 27 probchisq 6 5 20,0,-5,0 7 16 13.71 9.59 4.12 5.47 7.67 5.35 2.32 3.03 4.33 3.80 0.53 3.27 49 1 GM Dec Tree 3 1 27 entropy 6 5 20,0,-5,0 6 16 13.94 12.62 1.32 11.30 7.49 7.09 0.40 6.69 4.27 4.09 0.18 3.91 50 1 GM Dec Tree 4 1 27 gini 6 5 20,0,-5,0 10 22 13.76 11.28 2.48 8.80 7.70 6.10 1.60 4.50 4.32 3.71 0.61 3.10 51 1 GM Dec Tree 5 3 27 entropy 12 5 20,0,-5,0 6 13 13.94 12.62 1.32 11.30 7.49 7.09 0.40 6.69 4.27 4.09 0.18 3.91 52 1 GM Dec Tree 6 3 27 entropy 6 10 20,0,-5,0 6 13 13.94 12.62 1.32 11.30 7.49 7.09 0.40 6.69 4.27 4.09 0.18 3.91 53 1 GM Dec Tree 7 3 27 entropy 6 100 20,0,-5,0 6 17 13.94 12.62 1.32 11.30 7.49 7.09 0.40 6.69 4.27 4.09 0.18 3.91 54 1 GM Dec Tree 8 3 27 entropy 6 100 xval = Y 20,0,-5,0 8 32 14.51 12.82 1.69 11.13 8.95 7.42 1.53 5.89 4.72 4.13 0.59 3.54 55 1 GM Dec Tree 9 3 27 entropy 6 5 xval = Y 20,0,-5,0 8 32 14.51 12.82 1.69 11.13 8.95 7.42 1.53 5.89 4.72 4.13 0.59 3.54 56 1 GM Dec Tree 10 3 27 entropy 6 5 obs import = Y 20,0,-5,0 6 17 13.94 12.62 1.32 11.30 7.49 7.09 0.40 6.69 4.27 4.09 0.18 3.91 57 1 GM Dec Tree 11 3 27 entropy 6 5 asses = 5% Lift 20,0,-5,0 6 12 13.94 12.62 1.32 11.30 7.49 7.09 0.40 6.69 4.27 4.09 0.18 3.91 58 1 GM Dec Tree 12 3 27 entropy 10 2 20,0,-5,0 6 12 13.94 12.62 1.32 11.30 7.49 7.09 0.40 6.69 4.27 4.09 0.18 3.91 46 2 GM Dec Tree 13 3 33 entropy 6 5 a=5% lift 20,0,-5,0 7 16 15.92 14.96 0.96 14.00 8.29 7.84 0.45 7.39 4.40 4.17 0.23 3.94 47 2 GM Dec Tree 14 13 33 entropy 6 5 a=5% lift 10,-2.5,-1,0 13 15 16.32 15.05 1.27 13.78 9.07 8.00 1.07 6.93 4.63 4.08 0.55 3.53 48 2 GM Dec Tree 15 13 33 entropy 6 5 a=5% lift 1,-1,1,-1 8 15 15.30 14.34 0.96 13.38 7.98 7.53 0.45 7.08 4.25 4.05 0.20 3.85 49 2 GM Dec Tree 16 13 33 entropy 6 5 a=5% lift 10,-1,1,-1 12 16 16.32 15.05 1.27 13.78 8.96 8.14 0.82 7.32 4.62 4.23 0.39 3.84 50 2 GM Dec Tree 17 13 33 entropy 6 5 a=5% lift 20,-5,0,0 12 15 16.32 15.60 0.72 14.88 8.79 8.26 0.53 7.73 4.47 4.21 0.26 3.95 51 2 GM Dec Tree 18 13 33 entropy 6 5 a=5% lift 20,-1,0,0 12 15 16.32 15.60 0.72 14.88 8.79 8.26 0.53 7.73 4.47 4.21 0.26 3.95 52 2 GM Dec Tree 19 13 33 entropy 6 5 a=5% lift xval = no 20,0,-1,0 6 15 15.87 15.52 0.35 15.17 8.26 8.12 0.14 7.98 4.40 4.32 0.08 4.24 53 2 GM Dec Tree 20 13 33 entropy 6 5 a=5% lift 20,-5,-1,1 12 16 16.32 15.05 1.27 13.78 8.96 8.14 0.82 7.32 4.62 4.23 0.39 3.84 54 2 GM Dec Tree 21 13 33 entropy 6 5 a=5% lift xval = no 20,0,0,1 9 16 16.17 15.57 0.60 14.97 8.74 8.25 0.49 7.76 4.44 4.21 0.23 3.98 55 2 GM Dec Tree 22 19 33 gini 6 5 a=5% lift 20,0,-1,0 8 16 15.17 13.17 2.00 11.17 8.02 7.32 0.70 6.62 4.40 4.26 0.14 4.12 56 2 GM Dec Tree 23 19 33 probchisq 6 5 a=5% lift 20,0,-1,0 8 16 15.17 13.17 2.00 11.17 8.02 7.32 0.70 6.62 4.40 4.26 0.14 4.12 57 2 GM Dec Tree 24 19 33 entropy 20 5 a=5% lift 20,0,-1,0 19 26 18.94 15.42 3.52 11.90 9.67 7.78 1.89 5.89 4.90 4.06 0.84 3.22 58 2 GM Dec Tree 25 19 33 entropy 20 20 a=5% lift 20,0,-1,0 19 26 18.94 13.80 5.14 8.66 9.67 7.78 1.89 5.89 4.90 4.06 0.84 3.22 59 2 GM Dec Tree 26 19 33 entropy 20 40 a=5% lift 20,0,-1,0 7 27 16.06 15.29 0.77 14.52 8.36 8.00 0.36 7.64 4.41 4.23 0.18 4.05 60 2 GM Dec Tree 27 19 33 entropy 20 60 a=5% lift 20,0,-1,0 7 27 16.06 15.29 0.77 14.52 8.36 8.00 0.36 7.64 4.41 4.23 0.18 4.05 61 2 GM Dec Tree 28 19 33 entropy 7 5 a=5% lift 20,0,-1,0 10 33 16.73 14.57 2.16 12.41 8.90 7.75 1.15 6.60 4.60 4.06 0.54 3.52 62 2 GM Dec Tree 29 19 33 entropy 7 10 a=5% lift 20,0,-1,0 10 33 16.73 14.57 2.16 12.41 8.90 7.75 1.15 6.60 4.60 4.06 0.54 3.52 63 2 GM Dec Tree 30 19 33 entropy 7 20 a=5% lift 20,0,-1,0 7 37 16.04 14.66 1.38 13.28 8.35 7.69 0.66 7.03 4.41 4.07 0.34 3.73 64 2 GM Dec Tree 31 19 35 entropy 7 40 a=5% lift itmledratio itm_to_led 20,0,-1,0 7 36 15.90 15.36 0.54 14.82 8.28 8.03 0.25 7.78 4.40 4.27 0.13 4.14 65 2 GM Dec Tree 32 19 35 entropy 7 60 a=5% lift 20,0,-1,0 6 35 15.90 15.36 0.54 14.82 8.28 8.03 0.25 7.78 4.40 4.27 0.13 4.14 66 2 GM Dec Tree 33 19 35 entropy 7 80 a=5% lift 20,0,-1,0 6 35 15.90 15.36 0.54 14.82 8.28 8.03 0.25 7.78 4.40 4.27 0.13 4.14 67 2 GM Dec Tree 34 19 35 entropy 7 100 a=5% lift 20,0,-1,0 6 35 15.90 15.36 0.54 14.82 8.28 8.03 0.25 7.78 4.40 4.27 0.13 4.14 68 2 GM Dec Tree 35 19 35 entropy 7 150 a=5% lift 20,0,-1,0 5 37 14.53 13.08 1.45 11.63 7.75 7.19 0.56 6.63 4.36 4.29 0.07 4.22 64 2 GM Dec Tree 36 19 35 entropy 6 5 a=5% lift 20,0,-1,0 7 29 15.91 14.95 0.96 13.99 8.29 7.83 0.46 7.37 4.40 4.17 0.23 3.94 ex=20k node s mp = 30k 65 2 GM Dec Tree 37 19 14, raw only entropy 6 5 a=5% lift 0 20,0,-1,0 7 16 13.92 11.81 2.11 9.69 7.46 6.54 0.93 5.61 4.24 3.91 0.33 3.57 5.28 2.15 0.41 improvement gain in Conservative Lift from new variables (vs. DecTree-d2-m19) 66 3 GM Dec Tree 38 19 45 entropy 8 5 a=5% lift xval = no 20,0,-5,1 3 39 13.41 15.52 2.11 11.30 7.50 8.47 0.97 6.54 4.01 4.44 0.43 3.58 67 3 GM Dec Tree 39 38 45 gini 8 5 a=5% lift xval = no 20,0,-5,1 3 71 13.41 15.52 2.11 11.30 7.50 8.47 0.97 6.54 4.01 4.44 0.43 3.58 68 3 GM Dec Tree 40 38 45 propchi 8 5 a=5% lift xval = no 20,0,-5,1 3 42 13.41 15.52 2.11 11.30 7.50 8.47 0.97 6.54 4.01 4.44 0.43 3.58 69 3 GM Dec Tree 41 38 45 entropy 20 5 a=5% lift subtr= 20,0,-5,1 33 91 20.00 14.81 5.19 9.61 10.00 7.54 2.46 5.08 5.00 3.90 1.10 2.80 70 3 GM Dec Tree 42 38 45 entropy 20 100 a=5% lift sub=lrg 20,0,-5,1 25 70 19.09 16.25 2.84 13.42 10.00 8.17 1.83 6.35 5.00 4.19 0.81 3.38 71 3 GM Dec Tree 43 38 45 entropy 20 200 a=5% lift sub=lrg 20,0,-5,1 23 64 17.67 16.67 1.01 15.66 9.81 8.54 1.27 7.28 5.00 4.34 0.66 3.67 72 3 GM Dec Tree 44 38 45 entropy 20 400 a=5% lift sub=lrg 20,0,-5,1 21 59 15.87 17.08 1.21 14.67 9.02 8.96 0.06 8.89 4.97 4.69 0.28 4.41 73 3 GM Dec Tree 45 38 45 entropy 20 800 a=5% lift sub=lrg 20,0,-5,1 16 52 14.35 16.16 1.81 12.53 8.46 8.96 0.50 7.96 4.78 4.79 0.01 4.78 74 3 GM Dec Tree 46 38 45 entropy 20 1600 a=5% lift sub=lrg 20,0,-5,1 16 47 14.25 16.02 1.78 12.47 8.26 8.59 0.34 7.92 4.58 4.42 0.17 4.25 75 3 GM Dec Tree 47 38 45 entropy 20 3200 a=5% lift sub=lrg 20,0,-5,1 10 39 12.45 14.35 1.91 10.54 7.49 8.31 0.82 6.67 4.36 4.48 0.12 4.24 76 3 GM Dec Tree 48 43 45 entropy 20 150 a=5% lift sub=lrg 20,0,-5,1 23 68 18.57 16.25 2.32 13.93 10.00 8.14 1.86 6.27 5.00 4.17 0.83 3.34 77 3 GM Dec Tree 49 43 45 entropy 20 300 a=5% lift sub=lrg 20,0,-5,1 23 62 16.45 17.86 1.41 15.03 9.31 8.96 0.35 8.61 5.00 4.60 0.40 4.20 78 3 GM Dec Tree 50 43 45 entropy 20 250 a=5% lift sub=lrg 20,0,-5,1 24 65 16.64 17.71 1.07 15.57 9.56 8.96 0.60 8.36 5.00 4.61 0.39 4.21 79 3 GM Dec Tree 51 43 45 entropy 20 350 a=5% lift sub=lrg 20,0,-5,1 24 67 16.07 17.50 1.43 14.64 9.19 8.96 0.23 8.73 5.00 4.59 0.41 4.18 80 3 GM Dec Tree 52 43 45 entropy 20 225 a=5% lift sub=lrg 20,0,-5,1 23 63 17.85 16.67 1.18 15.49 9.83 8.96 0.87 8.09 5.00 4.53 0.48 4.05 81 3 GM Dec Tree 53 43 45 entropy 20 175 a=5% lift sub=lrg 20,0,-5,1 26 68 18.15 16.25 1.90 14.35 9.97 8.13 1.84 6.28 5.00 4.16 0.84 3.32 82 3 GM Dec Tree 54 43 45 entropy 20 200 a=5% lift sub=lrg 20,0,-5.0 23 65 17.67 16.67 1.01 15.66 9.81 8.54 1.27 7.28 5.00 4.34 0.66 3.67 83 3 GM Dec Tree 55 43 45 entropy 20 200 a=5% lift sub=lrg 20,0,-1,0 23 65 17.67 16.67 1.01 15.66 9.81 8.54 1.27 7.28 5.00 4.34 0.66 3.67 84 3 GM Dec Tree 56 43 45 entropy 20 200 a=5% lift sub=lrg 20,-5,0,0 23 65 17.67 16.67 1.01 15.66 9.81 8.54 1.27 7.28 5.00 4.34 0.66 3.67 85 4 GM Dec Tree 57 43 146 entropy 20 200 a=5% lift sub=lrg 20,0,-5,1 9 149 20.00 14.09 5.91 8.19 10.00 7.20 2.80 4.40 5.00 3.76 1.24 2.51 86 4 GM Dec Tree 58 57 107 (tree settings the same, dropped INT* categorical vars, not DBC) 18 115 20.00 16.09 3.91 12.18 10.00 8.15 1.85 6.29 5.00 4.18 0.82 3.35 87 4 GM Dec Tree 59 57 107 entropy 20 500 a=5% lift sub=lrg 20,0,-5,1 13 110 19.46 14.79 4.68 10.11 10.00 7.64 2.36 5.29 5.00 3.95 1.05 2.91 88 4 GM Dec Tree 60 57 107 entropy 20 1000 a=5% lift sub=lrg 20,0,-5,1 10 89 18.94 14.47 4.47 10.00 10.00 7.44 2.56 4.88 5.00 3.86 1.14 2.73 89 4 GM Dec Tree 61 57 107 entropy 20 2000 a=5% lift sub=lrg 20,0,-5,1 7 81 14.41 13.91 0.50 13.41 9.54 8.02 1.51 6.51 6.61 4.25 2.36 1.90 90 4 GM Dec Tree 62 57 107 entropy 20 3000 a=5% lift sub=lrg 20,0,-5,1 5 71 9.89 7.91 1.98 5.94 8.74 6.39 2.35 4.04 5.00 3.70 1.30 2.40 91 4 GM Dec Tree 63 57 107 entropy 20 1500 a=5% lift sub=lrg 20,0,-5,1 9 60 16.17 14.66 1.50 13.16 9.89 8.18 1.71 6.47 5.00 3.38 1.62 1.76 92 4 GM Dec Tree 64 57 107 entropy 20 1750 a=5% lift sub=lrg 20,0,-5,1 7 60 15.23 14.32 0.92 13.40 9.68 8.07 1.61 6.46 5.00 4.26 0.75 3.51 93 4 GM Dec Tree 65 57 107 entropy 20 2250 a=5% lift sub=lrg 20,0,-5,1 5 60 15.43 11.00 4.43 6.56 9.55 6.30 3.25 3.05 5.00 3.70 1.30 2.40 94 4 GM Dec Tree 66 61 58 entropy 20 2000 a=5% lift sub=lrg 20,0,-5,1 8 105 14.07 13.92 0.15 13.77 8.45 7.88 0.57 7.30 4.74 4.02 0.73 3.29 95 4 GM Dec Tree 67 61 80 entropy 20 2000 a=5% lift sub=lrg 20,0,-5,1 8 97 14.25 13.94 0.30 13.64 9.25 7.88 1.37 6.51 5.00 4.25 0.75 3.49 96 4 GM Dec Tree 68 61 103 entropy 20 2000 a=5% lift sub=lrg 20,0,-5,1 7 103 14.41 13.72 0.69 13.03 9.54 8.02 1.52 6.50 5.00 4.25 0.75 3.50 interactions are getting selected, improve Trn results but decrease Val results. Perhaps I should regen the INT*dbc with a larger number of min records. More 97 4n GM Dec Tree 69 61 3 entropy 20 2000 a=5% lift sub=lrg 20,0,-5,0 7 14.61 15.54 0.93 13.68 8.83 8.99 0.16 8.67 4.88 4.73 0.15 4.58 98 4n GM Dec Tree 70 0 20 entropy 20 2000 a=5% lift sub=lrg 20,0,-5,0 10 11.50 11.12 0.38 10.74 7.08 7.29 0.21 6.87 4.24 3.94 0.30 3.64 use RAW vars ONLY, to test value of my preprocessing M cnt Data Ver Aut hor Algor Mod Num chng from prior binary model cleanup model max num rips Var Sel Trn Time Train Val Gap Consrv Result Train Val Gap Consrv Result Train Val Gap Consrv Result 94 1 GM Rule Ind 1 0 tree neural 16 32 10.77 9.92 0.85 9.07 6.28 5.60 0.68 4.92 3.35 3.09 0.26 2.83 95 1 GM Rule Ind 2 1 regr neural 16 36 5.95 7.52 1.57 4.38 3.55 4.85 1.30 2.25 2.35 3.17 0.82 1.53 96 1 GM Rule Ind 3 1 neural tree 16 121 5.95 7.92 1.97 3.98 3.52 5.64 2.12 1.40 2.34 3.31 0.97 1.37 97 1 GM Rule Ind 4 3 neural tree 4 121 5.95 7.92 1.97 3.98 3.52 5.64 2.12 1.40 2.34 3.31 0.97 1.37 98 1 GM Rule Ind 5 3 neural tree 32 121 5.95 7.92 1.97 3.98 3.53 5.64 2.11 1.42 2.34 3.32 0.98 1.36 99 1 GM Rule Ind 6 1 tree neural 32 32 7.25 5.26 1.99 3.27 6.45 5.17 1.28 3.89 3.43 3.09 0.34 2.75 Can treat model notebook table as meta-data (i.e. 144 records or models) Train models on meta-data Source vars = model parameters Target 1 = conservative result or Target 2 = training time Perform sensitivity analysis to answer questions: Q) Searching which model training parameters lead to the best results? Q) …most training time?
  25. 25. Outline Model Training Parameters in SAS Enterprise Miner Tracking Conservative Results in a “Model Notebook” How to Measure Progress Meta-Gradient Search of Model Training Parameters How to Plan and dynamically adapt How to Describe Any Complex System – Sensitivity 25
  26. 26. Design Of Experiments (DOE) Parameter Search • Ideally, vary one parameter at a time, quantify the results – Bigger challenge in BIG DATA compute per model • Exhaustive Grid Search O(3P) – for Param A = Low, Med, High (test 3 settings) – for Param B = Low, Med, High – for Param C = Low, Med, High – easy to implement, not the most efficient – Can use Fractional Factorial design (i.e. 10%) • Scales less effectively for many parameters • Stochastic Search (Genetic Algorithms) O(1002) C – Directed Random Search is more efficient than Grid Search, but… – Can be overkill in complexity: (100 models / generation) * (100’s gens) • Taguchi Analysis (works with this DOE approach) – Efficient multivariate orthogonal search – test landing pages w/ Offermatica (acquired by Ominture in 2007 for DOE) – http://en.wikipedia.org/wiki/Taguchi_methods – Does not use domain knowledge of parameter interactions - OPPORTUNITY A B
  27. 27. Taguchi Design • Not a full grid search • Can we improve with experience and a heuristic process? 27 http://www.itl.nist.gov/div898/handbook/pri/section5/pri56.htm http://www.jmp.com/support/downloads/pdf/jmp_design_of_experiments.pdf
  28. 28. Model Parameters Algorithm Searches Meta-Search by a Data Miner Design of Experiments (DOE) Over Your Choices Algorithm Model Parameters Model Training Parameters Regression weights variable selct (forward, step) Neural net weights step size; learning rate Decision Tree (spend < $1000) max depth; (Gini, Entropy) 28
  29. 29. Model Parameters vs. Model Training Parameters Algorithm Searches Meta-Search by a Data Miner Design of Experiments (DOE) Over Your Choices Algorithm Model Parameters Model Training Parameters Regression weights variable select (forward, step) Neural net weights step size; learning rate Decision Tree (spend < $1000) max depth; (Gini, Entropy) 29
  30. 30. Heuristic Planning Your Design of Experiments (DOE) • Assumptions about Data Mining Project – May be on BIG DATA, with practical constraints – May be training 4 to 400 models (not 4000+ like GA) – Want diversity, to investigate different algorithms – Want to generalize process to future deployments • Heuristic Strategies – Use knowledge of interacting parameters (parallel tests) • (Cost+profit weights) and (boosting weights) fight each other – Delay searching compute intensive parameters • First stabilize most other “computationally reasonable” params • Large decision tree depth, • neural nets w/ lots of connections – Opportunistically spend time by algorithm success 30
  31. 31. Gradient Descent Numerical Methods Searching to Find Minima 31 High Error Low Error Forest Fields Beach Water Deep Water Weight Parameter 1 Weight Param 2 Min Min hill tops beach water Min
  32. 32. Gradient Descent Numerical Methods Searching to Find Minima 32 “Ski Down” from the mountains to Lake Tahoe Moving = adjust param X = starting position M = a local minimum High Error Low Error Forest Fields Beach Water Deep Water Weight Parameter 1 Weight Param 2 X M M hill tops beach water
  33. 33. Conservative Result with Respect to Model Training Parameters 33 “Ski Down” from the mountains to Lake Tahoe Moving = adjust param X = starting position M = a local minimum High Error Low Error Forest Fields Beach Water Deep Water Model Parameter 1 Model Param 2 X M M
  34. 34. Heuristic Planning Your Design of Experiments (DOE) • Start with a reasonable default setting of parameters, – the “center of the daisy”  the gradient check • Vary one parameter at a time from the center – “each petal of the daisy”  gradient search trial • Move to the next “reasonable multivariate start” – The “stem of the daisy”  steepest descent 34
  35. 35. Heuristic “Meta-Gradient Search” of Model Training Parameters 35 High Error Parameter 2 Low Error Parameter 1 M
  36. 36. Heuristic “Meta-Gradient Search” of Model Training Parameters 36 High Error Parameter 2 Low Error Parameter 1 M
  37. 37. Heuristic “Meta-Gradient Search” of Model Training Parameters 37 Parameter 1 Parameter 2 M vs. Taguchi DOE Art vs. Science? No, a practical compliment using existing num. methods
  38. 38. Heuristic “Meta-Gradient Search” of Model Training Parameters 38 Mod Num chng from prior vars offered criterion max depth leaf size 1 0 27 default 6 5 2 1 27 probchisq 6 5 3 1 27 entropy 6 5 4 1 27 gini 6 5 5 3 27 entropy 12 5 6 3 27 entropy 6 10 7 3 27 entropy 6 100 8 3 27 entropy 6 100 9 3 27 entropy 6 5 10 3 27 entropy 6 5 11 3 27 entropy 6 5 12 3 27 entropy 10 2 Can you give a more tangible example? This sounds a bit vague. Change from Prior Model – tracks change from the “center of a daisy” (Model 1 or 3)
  39. 39. Heuristic “Meta-Gradient Search” of Model Training Parameters • After stabilizing most of the “fast” and “medium” compute time parameters, search the “long compute time settings” • With the final parameter settings, if 2x or 10x more data is available, perform a “final bake in,” long training run • Then try Ensemble Methods – Stacking, boosting, bagging combining many of the best models, – Gradient Boosting over residual error – Select models who’s residual errors correlate the least – Use a 2nd stage model to combine 1st stage models and top preprocessed fields (for context switching) – Last year’s KDD Cup winners – Netflix winners used Ensemble methods
  40. 40. Outline Model Training Parameters in SAS Enterprise Miner Tracking Conservative Results in a “Model Notebook” How to Measure Progress Meta-Gradient Search of Model Training Parameters How to Plan and dynamically adapt How to Describe Any Complex System Sensitivity Analysis 40
  41. 41. Needs to Describe Forecast Alg • Many Data Mining solutions need description – To check writer (to SVP, owner, business unit, …) business reality check before deployment – “What if” analysis, to fine tune larger system • Feed Operations Research or Revenue Management systems – Need a modeling “descriptive simulation” (political donations) – When evaluating credit, by law required to offer 4 “reason codes” for each person scored – when they are declined • Should the Data Miner cut algorithm choices? – NO! “I understand how a bike works, but I drive a car to work” – how much detailed understanding is needed? – Provide enough info to “drive the car” vs. “build the car” • Check writer does not need to understand B-tree to buy SQL 41
  42. 42. Sensitivity Analysis (OAT) One At a Time* For source fields with binned ranges, sensitivity tells you importance of the range, i.e. “low”, …. “high” Can put sensitivity values in Record Level “Reason codes” can be extracted from the most important bins that apply to the given 42 Target field Arbitrarily Complex Data Mining System (S) Source fields *Some catch interactions Pivot Tables or Cluster record Delta in forecast Present record N, S times, each input 5% bigger (fixed input delta) Record delta change in output, S times per record Aggregate: average(abs(delta)), target change per input field delta
  43. 43. 43 Descriptions of Predictive Models Reason Codes – Ranked by Sensitivity Analysis • Reason codes are specific to the model and record • Ranked predictive fields Mr. Smith Mr. Jones max_late_payment_120d 0 1 max_late_payment_90d 1 0 bankrupt_in_last_5_yrs 1 1 max_late_payment_60d 0 0 • Mr. Smith’s reason codes include: max_late_payment_90d 1 bankrupt_in_last_5_yrs 1
  44. 44. Summary • Conservative Result (How to Measure) – Continuous metric to select accurate and general models • Heuristic Meta-Gradient Search (How to Plan) – An automated or human process to plan a Design of Experiments (DOE) – Searches the training parameters that a data miner adjusts in data mining software (“meta-parameter search”) – Heuristic DOE improvements • Most systems can be “reasonably described” – Focus on repeatable business benefit (accuracy) over description or blind Occam’s Razor on a tech metric 44 SF Bay ACM, Data Mining SIG, Feb 28, 2011 http://www.sfbayacm.org/?p=2464 Greg_Makowski@yahoo.com www.LinkedIn.com/in/GregMakowski Take Away: The process of going from design objectives to heuristic design

×