Successfully reported this slideshow.
Upcoming SlideShare
×

# Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

208 views

Published on

Presentation on the OGIK 2016 Conference, Nov 11-12 2016. Corvinus University of Budapest, Institute of Information Technology.

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

1. 1. Linear Regression Model selection using a hybrid genetic - improved harmony search parallelized algorithm Blanka Láng, László Kovács, László Mohácsi Corvinus University of Budapest Institute of Information Technology
2. 2. Contents Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of our Hybrid Algoirthm
3. 3. Linear Regression We have: § Y: dependent variable § 𝑋 = 𝑋#, 𝑋%, … , 𝑋' vectors of independent variables Goal: 𝑌 = 𝛽* + 𝛽# 𝑋# + 𝛽% 𝑋% + ⋯ + 𝛽' 𝑋' + 𝜀 OLS Model: 𝑌. = 𝛽/* + 𝛽/# 𝑋# + 𝛽/% 𝑋% + ⋯ + 𝛽/' 𝑋' = 𝛽/* + ∑ 𝛽/1 𝑋1 ' 12# Parsimony: 𝑋3 ⊆ 𝑋 àminimalize residuals, with the use of as few independents as possible maximalize the model’s ability to generalize Partial effects of independentsàonly significant variables in the model these hypotheses can an be statistically tested Objective functions AIC SBC HQC adjusted R2 à MAX MIN Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm
4. 4. Dataset #1 Body Fat Measurements – real dataset from 1996 𝑛 = 252 𝑌: Percent of body fat to muscle tissue 𝑚 = 16 (age, abdomen circumference, weight, height, etc.) Multicollinearity: Redundancy between independents. Pl.: Which of these two independents matters most when predicting 𝑌? How can we interpret the partial effects of these independents? Measure: Regress the independents on each otheràVIF indicator for each independent if VIF>2àmulticollinearity Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm
5. 5. Dataset #2 DATA26 – simulated dataset from Gumbel Copula 𝑛 = 1000 𝑚 = 25 (plus 𝑌) Generating Correlation Matrix (CM) with high correlations in absolute value vineBeta method (Lewandowskia et. al, 2009) Simulating Multicollinearity All 26 generated variables follow N(µ,s) distributions, where µ and s are randomly generated for each variable Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm
6. 6. Performance of Selection Algorithms– FAT Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm AIC SBC 𝑅>% Runtime (sec) St Dev (sec) Best Subsets (SPSS Leaps and Bound) -2,013 (Variables: 1) -1,987 (Variables: 1) 0,9829 (Variables: 1, 2, 3, 5, 6, 8, 11, 12, 15) 4,558 0,878 Best Subsets (Minerva: GARS) -2,013 (Variables: 1) -1,987 (Variables: 1) 0,9829 (Variables: 1, 2, 3, 5, 6, 8, 11, 12, 15) 5,921 1,658 improved GARS -2,013 (Variables: 1) -1,987 (Variables: 1) 0,9822 (Variables: 1, 3, 5, 6, 8, 12, 15) 11,268 2,941 IHSRS -2,013 (Variables: 1) -1,987 (Variables: 1) 0,9822 (Variables: 1, 3, 5, 6, 8, 12, 15) 0,968 0,188 Forward+Backward 0,058 (Variables: 1, 3, 5, 6, 8, 12, 15) 0,239 (Variables: 1, 3, 5, 6, 8, 12, 15) 0,9822 (Variables: 1, 3, 5, 6, 8, 12, 15) 0,976 0,050 Variable Importance in Projection (Partial Least Squares) -0,247 (Variables: 1, 2, 5, 6, 8, 9) -0,092 (Variables: 1, 2, 5, 6, 8, 9) 0,9618 (Variables: 1, 2, 5, 6, 8, 9) 1,807 0,896 Elastic Net -2,013 (Variables: 1) -1,987 (Variables: 1) 0,9410 (Variables: 1) 50,858 9,019 Stepwise VIF Selection -0,189 (Variables: 1, 2, 15) -0,008 (Variables: 1, 2, 15) 0,954 (Variables: 1, 2, 15) 0,832 0,034 Nested Estimate Procedure -1,402 (Variables: 1, 8) -1,351 (Variables: 1, 8) 0,9538 (Variables: 1, 8) 0,352 0,047
7. 7. Performance of Selection Algorithms DATA26 Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm AIC SBC 𝑅% Runtime (sec) St Dev (sec) Best Subsets (SPSS Leaps and Bound) -8,840 (Variables: X24, X23, X10, X6, X4, X15, X17, X1, X13, X14, X12, X16, X5, X25, X9, X21, X18) -8,756 (Variables: X24, X23, X10, X6, X4, X15, X17, X1, X13, X14, X12, X16, X5, X25, X9, X21, X18) 0,9999944 (Variables: X15, X6, X24, X23, X5, X12, X9, X4, X1, X25, X10, X21, X13, X17, X16, X18, X14, X3) 32,352745 7,04028 Best Subsets (Minerva: GARS) -8,841 (Variables: X15, X6, X24, X23, X5, X12, X9, X4, X1, X25, X10, X21, X13, X17, X16, X18, X14, X3) -8,826 (Variables: X25, X10, X17, X13, X1, X16, X24, X18, X5, X21, X8, X23, X15, X12, X6, X4) 0,9999944 (Variables: X15, X6, X24, X23, X5, X12, X9, X4, X1, X25, X10, X21, X13, X17, X16, X18, X14, X3) 52,714638 12,62692 improved GARS -8,731 (Variables: X25, X10, X17, X13, X1, X16, X24, X18, X5, X21, X8, X23, X15, X12, X6, X4) -8,826 (Variables: X25, X10, X17, X13, X1, X16, X24, X18, X5, X21, X8, X23, X15, X12, X6, X4) 0,99999744 (Variables: X25, X10, X17, X13, X1, X16, X24, X18, X5, X21, X8, X23, X15, X12, X6, X4) 1281,45823 380,10328 IHSRS -8,731 (Variables: X25, X10, X17, X13, X1, X16, X24, X18, X5, X21, X8, X23, X15, X12, X6, X4) -8,826 (Variables: X25, X10, X17, X13, X1, X16, X24, X18, X5, X21, X8, X23, X15, X12, X6, X4) 0,99999744 (Variables: X25, X10, X17, X13, X1, X16, X24, X18, X5, X21, X8, X23, X15, X12, X6, X4) 402,1666233 79,070735 Forward+Backward -8,840 (Variables: X24, X23, X10, X6, X4, X15, X17, X1, X13, X14, X12, X16, X5, X25, X9, X21, X18) -8,756 (Variables: X24, X23, X10, X6, X4, X15, X17, X1, X13, X14, X12, X16, X5, X25, X9, X21, X18) 0,9999944 (Variables: X24, X23, X10, X6, X4, X15, X17, X1, X13, X14, X12, X16, X5, X25, X9, X21, X18) 1,0744 0,0937 Variable Importance in Projection (Partial Least Squares) -5,196 (Variables: X24, X5, X4, X10, X20, X18, X8, X22, X23, X11, X15, X6, X12) -5,132 (Variables: X24, X5, X4, X10, X20, X18, X8, X22, X23, X11, X15, X6, X12) 0,99979 (Variables: X24, X5, X4, X10, X20, X18, X8, X22, X23, X11, X15, X6, X12) 15,095273 7,19626 Elastic Net -4,363 (Full model, not significant: X5, X13) -4,240 (Full model, not significant: X5, X13) 0,993 (Full model, not significant: X5, X13) 478,683794 99,82244 Stepwise VIF Selection 0,434 (Variables: X6, X10, X16, X17, X19, X24) 0,464 (Variables: X6, X10, X16, X17, X19, X24) 0,940 (Variables: X6, X10, X16, X17, X19, X24) 0,93415 0,02986 Nested Estimate Procedure 0,760 (Variables: X10, X15, X23, X24) 0,780 (Variables: X10, X15, X23, X24) 0,917 (Variables: X10, X15, X23, X24) 0,39289 0,0533
8. 8. Problem with the results Model Collinearity Statistics Tolerance VIF X1 ,069 14,490 X3 ,017 59,097 X5 ,089 11,271 X6 ,030 33,682 X8 ,105 9,540 X12 ,239 4,182 X15 ,399 2,509 Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm Model Collinearity Statistics Tolerance VIF (Constant) X1 ,065 15,347 X4 ,001 1644,939 X5 ,003 388,860 X6 ,002 538,248 X8 ,005 197,505 X10 ,050 20,165 X12 ,001 1366,452 X13 ,030 33,293 X15 ,001 1133,939 X16 ,048 20,828 X17 ,041 24,297 X18 ,016 64,340 X21 ,003 393,569 X23 ,002 554,800 X24 ,004 262,232 X25 ,001 825,023 FAT DATA26 Optimal solutions of IHSRS for 𝑹@ 𝟐
9. 9. Modify the IHRSRS Include an all VIFs<2 condition to the optimalization task Optimal solutions of IHSRS with VIF conditions: Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm Model Collinearity Statistics Tolerance VIF X1 ,508 1,970 X2 ,879 1,138 X8 ,558 1,791 𝑹@% =0,9854 FAT Model Collinearity Statistics Tolerance VIF (Constant) X2 ,503 1,986 X6 ,548 1,825 X10 ,500 1,999 X14 ,526 1,902 X23 ,565 1,770 DATA26 𝑹@% =0,991 Other models with VIF values smaller than 2: Backward – VIF: 𝑹@% = 0,9540 (FAT); 0,940 (DATA26) Nested Estimates: 𝑹@% = 0,9538 (FAT); 0,917 (DATA26)
10. 10. A Great Setback for the modified IHSRS Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm 0 10000 20000 30000 40000 50000 60000 average solution time (number of steps) standard deviation of solution times (number of steps) FAT IHSRS without VIF IHSRS with VIF 0 10 20 30 40 50 60 70 average solution time (sec) standard deviation of solution times (sec) FAT IHSRS without VIF IHSRS with VIF 0 50000 100000 150000 200000 250000 average solution time (number of steps) standard deviation of solution times (number of steps) DATA26 IHSRS without VIF IHSRS with VIF 0 500 1000 1500 2000 2500 3000 3500 average solution time (sec) standard deviation of solution times (sec) DATA26 IHSRS without VIF IHSRS with VIF Average runtime is almost an hour!
11. 11. We can not parallelize the IHSRS Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm individual/melody: ● = 0 0 1 0 1 1 1 population/harmony memory: ● ● ● ● STEP 1&2: Generate a random harmony and evaluate the regressions for each individual ● ● ● ● HMCR prob 1-HMCR prob ● ● ● ● Generate a RANDOM indvidual PAR prob 1-PAR prob Mutate ● with mutation (bw) prob No modification on ● Increase PAR + Decrease bw Is new ● better than the worst individual? YES NO Change the worst individual YES Termination Criterion? NO STOP
12. 12. Our GA-HS hybrid solution Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm individual: ● = 0 0 1 0 1 1 1 population: ● ● ● ● STEP 1&2: Generate a random harmony and evaluate the regressions for each individual ● ● ● ● Select better than average individuals ● ● ● ● Start a new population: ● ● x x Can be Parallel ized! HMCR prob 1-HMCR prob ● ● x x Generate RANDOM indvidual Mutate ● with mutation (bw) prob Increase HMCR + Decrease bw Is every x filled? NO YES Evaluate the regressions for the new individuals in our population YES Termination Criterion? NO STOP
13. 13. Differences from GA 1. More than one kind of mutation 2. No crossover In Linear Regression Model Selection randomization is more important, than inhereted good properties The inclusion or exculsion of a single independent can save or ruin a model We could observe that GA is a relatively slow algorithm when applied to Model Selecton Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm
14. 14. The Performance 0 50000 100000 150000 200000 250000 average solution time (number of steps) standard deviation of solution times (number of steps) DATA26 IHSRS + VIF GAIHSRS +VIF 0 10 20 30 40 50 60 70 average solution time (sec) standard deviation of solution times (sec) FAT Standard Parallel 0 500 1000 1500 2000 2500 3000 3500 4000 average solution time (sec) standard deviation of solution times (sec) DATA26 Standard Parallel Average runtime and St. Dev. are decreased by 2/3 Thank you for your attention! 0 10000 20000 30000 40000 50000 60000 average solution time (number of steps) standard deviation of solution times (number of steps) FAT IHSRS + VIF GAIHSRS +VIF Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of our Hybrid Algoirthm
15. 15. Enviroment The solution times are an average of 30 runs. The standard deviation of the runtimes is determined from the same 30 runs. Most Selection Algorithms were used in IBM SPSS Statistics 22 Elastic Net: Catgreg SPSS macro by the University of Leiden Numpy and Scipy Python libraries for Partial Least Squares Metaheuristics (GARS, improved GARS, IHSRS, GAIHSRS) are implemented in C# OS and Hardware Configurations OS: Windows 8.1 Ultimate 64 bit CPU: Intel Core i7-2700K, 3.5GHz RAM: 16GB DDR3 SDRAM