Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Linear	Regression	Model	
selection	using	a	hybrid genetic	-
improved	harmony	search	
parallelized	algorithm
Blanka	Láng,	L...
Contents
Linear
Regression
Model Selection
Problem
Datasets Used
Performance	of	
Selection
Algorithms on
Our Data
The	Need...
Linear Regression
We have:
§ Y:	dependent variable
§ 𝑋 = 𝑋#, 𝑋%, … , 𝑋' vectors of	independent variables
Goal:
𝑌 = 𝛽* + 𝛽#...
Dataset #1
Body	Fat Measurements – real dataset from 1996
	 𝑛 = 252
	 𝑌:	Percent	of	body	fat to muscle tissue
	 𝑚 = 16 (ag...
Dataset #2
DATA26	– simulated dataset from Gumbel Copula
	 𝑛 = 1000
	 𝑚 = 25 (plus	𝑌)
Generating Correlation Matrix (CM)	w...
Performance	of	Selection Algorithms–
FAT
Linear Regression
Model Selection
Problem
Datasets Used
Performance	of	
Selection...
Performance	of	Selection Algorithms
DATA26
Linear Regression
Model Selection
Problem
Datasets Used
Performance	of	
Selecti...
Problem with the results
Model
Collinearity	Statistics
Tolerance VIF
X1 ,069 14,490
X3 ,017 59,097
X5 ,089 11,271
X6 ,030 ...
Modify the IHRSRS
Include an	all VIFs<2	condition to the optimalization task
Optimal solutions of	IHSRS	with VIF	condition...
A	Great	Setback for
the modified IHSRS
Linear Regression
Model Selection
Problem
Datasets Used
Performance	of	
Selection
A...
We can not parallelize
the IHSRS
Linear Regression
Model Selection
Problem
Datasets Used
Performance	of	
Selection Algorit...
Our GA-HS	hybrid
solution
Linear Regression
Model Selection
Problem
Datasets Used
Performance	of	
Selection Algorithms
on ...
Differences from GA
1. More	than one kind of	mutation
2. No	crossover
In Linear Regression Model Selection randomization i...
The	Performance
0
50000
100000
150000
200000
250000
average	solution	time	(number	of	steps) standard	deviation	of	solution...
Enviroment
The	solution times are an	average of	30	runs.	The	standard	deviation of	the
runtimes is	determined from the sam...
Upcoming SlideShare
Loading in …5
×

Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

208 views

Published on

Presentation on the OGIK 2016 Conference, Nov 11-12 2016. Corvinus University of Budapest, Institute of Information Technology.

Published in: Business
  • Be the first to comment

  • Be the first to like this

Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

  1. 1. Linear Regression Model selection using a hybrid genetic - improved harmony search parallelized algorithm Blanka Láng, László Kovács, László Mohácsi Corvinus University of Budapest Institute of Information Technology
  2. 2. Contents Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of our Hybrid Algoirthm
  3. 3. Linear Regression We have: § Y: dependent variable § 𝑋 = 𝑋#, 𝑋%, … , 𝑋' vectors of independent variables Goal: 𝑌 = 𝛽* + 𝛽# 𝑋# + 𝛽% 𝑋% + ⋯ + 𝛽' 𝑋' + 𝜀 OLS Model: 𝑌. = 𝛽/* + 𝛽/# 𝑋# + 𝛽/% 𝑋% + ⋯ + 𝛽/' 𝑋' = 𝛽/* + ∑ 𝛽/1 𝑋1 ' 12# Parsimony: 𝑋3 ⊆ 𝑋 àminimalize residuals, with the use of as few independents as possible maximalize the model’s ability to generalize Partial effects of independentsàonly significant variables in the model these hypotheses can an be statistically tested Objective functions AIC SBC HQC adjusted R2 à MAX MIN Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm
  4. 4. Dataset #1 Body Fat Measurements – real dataset from 1996 𝑛 = 252 𝑌: Percent of body fat to muscle tissue 𝑚 = 16 (age, abdomen circumference, weight, height, etc.) Multicollinearity: Redundancy between independents. Pl.: Which of these two independents matters most when predicting 𝑌? How can we interpret the partial effects of these independents? Measure: Regress the independents on each otheràVIF indicator for each independent if VIF>2àmulticollinearity Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm
  5. 5. Dataset #2 DATA26 – simulated dataset from Gumbel Copula 𝑛 = 1000 𝑚 = 25 (plus 𝑌) Generating Correlation Matrix (CM) with high correlations in absolute value vineBeta method (Lewandowskia et. al, 2009) Simulating Multicollinearity All 26 generated variables follow N(µ,s) distributions, where µ and s are randomly generated for each variable Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm
  6. 6. Performance of Selection Algorithms– FAT Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm AIC SBC 𝑅>% Runtime (sec) St Dev (sec) Best Subsets (SPSS Leaps and Bound) -2,013 (Variables: 1) -1,987 (Variables: 1) 0,9829 (Variables: 1, 2, 3, 5, 6, 8, 11, 12, 15) 4,558 0,878 Best Subsets (Minerva: GARS) -2,013 (Variables: 1) -1,987 (Variables: 1) 0,9829 (Variables: 1, 2, 3, 5, 6, 8, 11, 12, 15) 5,921 1,658 improved GARS -2,013 (Variables: 1) -1,987 (Variables: 1) 0,9822 (Variables: 1, 3, 5, 6, 8, 12, 15) 11,268 2,941 IHSRS -2,013 (Variables: 1) -1,987 (Variables: 1) 0,9822 (Variables: 1, 3, 5, 6, 8, 12, 15) 0,968 0,188 Forward+Backward 0,058 (Variables: 1, 3, 5, 6, 8, 12, 15) 0,239 (Variables: 1, 3, 5, 6, 8, 12, 15) 0,9822 (Variables: 1, 3, 5, 6, 8, 12, 15) 0,976 0,050 Variable Importance in Projection (Partial Least Squares) -0,247 (Variables: 1, 2, 5, 6, 8, 9) -0,092 (Variables: 1, 2, 5, 6, 8, 9) 0,9618 (Variables: 1, 2, 5, 6, 8, 9) 1,807 0,896 Elastic Net -2,013 (Variables: 1) -1,987 (Variables: 1) 0,9410 (Variables: 1) 50,858 9,019 Stepwise VIF Selection -0,189 (Variables: 1, 2, 15) -0,008 (Variables: 1, 2, 15) 0,954 (Variables: 1, 2, 15) 0,832 0,034 Nested Estimate Procedure -1,402 (Variables: 1, 8) -1,351 (Variables: 1, 8) 0,9538 (Variables: 1, 8) 0,352 0,047
  7. 7. Performance of Selection Algorithms DATA26 Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm AIC SBC 𝑅% Runtime (sec) St Dev (sec) Best Subsets (SPSS Leaps and Bound) -8,840 (Variables: X24, X23, X10, X6, X4, X15, X17, X1, X13, X14, X12, X16, X5, X25, X9, X21, X18) -8,756 (Variables: X24, X23, X10, X6, X4, X15, X17, X1, X13, X14, X12, X16, X5, X25, X9, X21, X18) 0,9999944 (Variables: X15, X6, X24, X23, X5, X12, X9, X4, X1, X25, X10, X21, X13, X17, X16, X18, X14, X3) 32,352745 7,04028 Best Subsets (Minerva: GARS) -8,841 (Variables: X15, X6, X24, X23, X5, X12, X9, X4, X1, X25, X10, X21, X13, X17, X16, X18, X14, X3) -8,826 (Variables: X25, X10, X17, X13, X1, X16, X24, X18, X5, X21, X8, X23, X15, X12, X6, X4) 0,9999944 (Variables: X15, X6, X24, X23, X5, X12, X9, X4, X1, X25, X10, X21, X13, X17, X16, X18, X14, X3) 52,714638 12,62692 improved GARS -8,731 (Variables: X25, X10, X17, X13, X1, X16, X24, X18, X5, X21, X8, X23, X15, X12, X6, X4) -8,826 (Variables: X25, X10, X17, X13, X1, X16, X24, X18, X5, X21, X8, X23, X15, X12, X6, X4) 0,99999744 (Variables: X25, X10, X17, X13, X1, X16, X24, X18, X5, X21, X8, X23, X15, X12, X6, X4) 1281,45823 380,10328 IHSRS -8,731 (Variables: X25, X10, X17, X13, X1, X16, X24, X18, X5, X21, X8, X23, X15, X12, X6, X4) -8,826 (Variables: X25, X10, X17, X13, X1, X16, X24, X18, X5, X21, X8, X23, X15, X12, X6, X4) 0,99999744 (Variables: X25, X10, X17, X13, X1, X16, X24, X18, X5, X21, X8, X23, X15, X12, X6, X4) 402,1666233 79,070735 Forward+Backward -8,840 (Variables: X24, X23, X10, X6, X4, X15, X17, X1, X13, X14, X12, X16, X5, X25, X9, X21, X18) -8,756 (Variables: X24, X23, X10, X6, X4, X15, X17, X1, X13, X14, X12, X16, X5, X25, X9, X21, X18) 0,9999944 (Variables: X24, X23, X10, X6, X4, X15, X17, X1, X13, X14, X12, X16, X5, X25, X9, X21, X18) 1,0744 0,0937 Variable Importance in Projection (Partial Least Squares) -5,196 (Variables: X24, X5, X4, X10, X20, X18, X8, X22, X23, X11, X15, X6, X12) -5,132 (Variables: X24, X5, X4, X10, X20, X18, X8, X22, X23, X11, X15, X6, X12) 0,99979 (Variables: X24, X5, X4, X10, X20, X18, X8, X22, X23, X11, X15, X6, X12) 15,095273 7,19626 Elastic Net -4,363 (Full model, not significant: X5, X13) -4,240 (Full model, not significant: X5, X13) 0,993 (Full model, not significant: X5, X13) 478,683794 99,82244 Stepwise VIF Selection 0,434 (Variables: X6, X10, X16, X17, X19, X24) 0,464 (Variables: X6, X10, X16, X17, X19, X24) 0,940 (Variables: X6, X10, X16, X17, X19, X24) 0,93415 0,02986 Nested Estimate Procedure 0,760 (Variables: X10, X15, X23, X24) 0,780 (Variables: X10, X15, X23, X24) 0,917 (Variables: X10, X15, X23, X24) 0,39289 0,0533
  8. 8. Problem with the results Model Collinearity Statistics Tolerance VIF X1 ,069 14,490 X3 ,017 59,097 X5 ,089 11,271 X6 ,030 33,682 X8 ,105 9,540 X12 ,239 4,182 X15 ,399 2,509 Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm Model Collinearity Statistics Tolerance VIF (Constant) X1 ,065 15,347 X4 ,001 1644,939 X5 ,003 388,860 X6 ,002 538,248 X8 ,005 197,505 X10 ,050 20,165 X12 ,001 1366,452 X13 ,030 33,293 X15 ,001 1133,939 X16 ,048 20,828 X17 ,041 24,297 X18 ,016 64,340 X21 ,003 393,569 X23 ,002 554,800 X24 ,004 262,232 X25 ,001 825,023 FAT DATA26 Optimal solutions of IHSRS for 𝑹@ 𝟐
  9. 9. Modify the IHRSRS Include an all VIFs<2 condition to the optimalization task Optimal solutions of IHSRS with VIF conditions: Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm Model Collinearity Statistics Tolerance VIF X1 ,508 1,970 X2 ,879 1,138 X8 ,558 1,791 𝑹@% =0,9854 FAT Model Collinearity Statistics Tolerance VIF (Constant) X2 ,503 1,986 X6 ,548 1,825 X10 ,500 1,999 X14 ,526 1,902 X23 ,565 1,770 DATA26 𝑹@% =0,991 Other models with VIF values smaller than 2: Backward – VIF: 𝑹@% = 0,9540 (FAT); 0,940 (DATA26) Nested Estimates: 𝑹@% = 0,9538 (FAT); 0,917 (DATA26)
  10. 10. A Great Setback for the modified IHSRS Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm 0 10000 20000 30000 40000 50000 60000 average solution time (number of steps) standard deviation of solution times (number of steps) FAT IHSRS without VIF IHSRS with VIF 0 10 20 30 40 50 60 70 average solution time (sec) standard deviation of solution times (sec) FAT IHSRS without VIF IHSRS with VIF 0 50000 100000 150000 200000 250000 average solution time (number of steps) standard deviation of solution times (number of steps) DATA26 IHSRS without VIF IHSRS with VIF 0 500 1000 1500 2000 2500 3000 3500 average solution time (sec) standard deviation of solution times (sec) DATA26 IHSRS without VIF IHSRS with VIF Average runtime is almost an hour!
  11. 11. We can not parallelize the IHSRS Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm individual/melody: ● = 0 0 1 0 1 1 1 population/harmony memory: ● ● ● ● STEP 1&2: Generate a random harmony and evaluate the regressions for each individual ● ● ● ● HMCR prob 1-HMCR prob ● ● ● ● Generate a RANDOM indvidual PAR prob 1-PAR prob Mutate ● with mutation (bw) prob No modification on ● Increase PAR + Decrease bw Is new ● better than the worst individual? YES NO Change the worst individual YES Termination Criterion? NO STOP
  12. 12. Our GA-HS hybrid solution Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm individual: ● = 0 0 1 0 1 1 1 population: ● ● ● ● STEP 1&2: Generate a random harmony and evaluate the regressions for each individual ● ● ● ● Select better than average individuals ● ● ● ● Start a new population: ● ● x x Can be Parallel ized! HMCR prob 1-HMCR prob ● ● x x Generate RANDOM indvidual Mutate ● with mutation (bw) prob Increase HMCR + Decrease bw Is every x filled? NO YES Evaluate the regressions for the new individuals in our population YES Termination Criterion? NO STOP
  13. 13. Differences from GA 1. More than one kind of mutation 2. No crossover In Linear Regression Model Selection randomization is more important, than inhereted good properties The inclusion or exculsion of a single independent can save or ruin a model We could observe that GA is a relatively slow algorithm when applied to Model Selecton Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of out Hybrid Algoirthm
  14. 14. The Performance 0 50000 100000 150000 200000 250000 average solution time (number of steps) standard deviation of solution times (number of steps) DATA26 IHSRS + VIF GAIHSRS +VIF 0 10 20 30 40 50 60 70 average solution time (sec) standard deviation of solution times (sec) FAT Standard Parallel 0 500 1000 1500 2000 2500 3000 3500 4000 average solution time (sec) standard deviation of solution times (sec) DATA26 Standard Parallel Average runtime and St. Dev. are decreased by 2/3 Thank you for your attention! 0 10000 20000 30000 40000 50000 60000 average solution time (number of steps) standard deviation of solution times (number of steps) FAT IHSRS + VIF GAIHSRS +VIF Linear Regression Model Selection Problem Datasets Used Performance of Selection Algorithms on Our Data The Need for a New Solution The Performance of our Hybrid Algoirthm
  15. 15. Enviroment The solution times are an average of 30 runs. The standard deviation of the runtimes is determined from the same 30 runs. Most Selection Algorithms were used in IBM SPSS Statistics 22 Elastic Net: Catgreg SPSS macro by the University of Leiden Numpy and Scipy Python libraries for Partial Least Squares Metaheuristics (GARS, improved GARS, IHSRS, GAIHSRS) are implemented in C# OS and Hardware Configurations OS: Windows 8.1 Ultimate 64 bit CPU: Intel Core i7-2700K, 3.5GHz RAM: 16GB DDR3 SDRAM

×