Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

Linear Regression Model
selection using a hybrid genetic -
improved harmony search
parallelized algorithm
Blanka Láng, László Kovács, László Mohácsi
Corvinus University of Budapest
Institute of Information Technology

Contents
Linear
Regression
Model Selection
Problem
Datasets Used
Performance of
Selection
Algorithms on
Our Data
The Need for a
New Solution
The
Performance of
our Hybrid
Algoirthm

Linear Regression
We have:
§ Y: dependent variable
§ 𝑋 = 𝑋#, 𝑋%, … , 𝑋' vectors of independent variables
Goal:
𝑌 = 𝛽* + 𝛽# 𝑋# + 𝛽% 𝑋% + ⋯ + 𝛽' 𝑋' + 𝜀
OLS Model: 𝑌. = 𝛽/* + 𝛽/# 𝑋# + 𝛽/% 𝑋% + ⋯ + 𝛽/' 𝑋' = 𝛽/* + ∑ 𝛽/1 𝑋1
'
12#
Parsimony: 𝑋3 ⊆ 𝑋 àminimalize residuals, with the use of as few independents as
possible
maximalize the model’s ability to generalize
Partial effects of independentsàonly significant variables in the model
these hypotheses can an be statistically tested
Objective functions
AIC
SBC
HQC
adjusted R2 à MAX
MIN
Linear Regression
Model Selection
Problem
Datasets Used
Performance of
Selection Algorithms
on Our Data
The Need for a New
Solution
The Performance of
out Hybrid Algoirthm

Dataset #1
Body Fat Measurements – real dataset from 1996
𝑛 = 252
𝑌: Percent of body fat to muscle tissue
𝑚 = 16 (age, abdomen circumference, weight, height, etc.)
Multicollinearity: Redundancy between independents.
Pl.:
Which of these two independents matters most when predicting 𝑌?
How can we interpret the partial effects of these independents?
Measure: Regress the independents on each otheràVIF indicator for each independent
if VIF>2àmulticollinearity
Linear
Regression
Model Selection
Problem
Datasets Used
Performance of
Selection
Algorithms on
Our Data
The Need for a
New Solution
The
Performance of
out Hybrid
Algoirthm

Dataset #2
DATA26 – simulated dataset from Gumbel Copula
𝑛 = 1000
𝑚 = 25 (plus 𝑌)
Generating Correlation Matrix (CM) with high correlations in absolute value
vineBeta method (Lewandowskia et. al, 2009)
Simulating Multicollinearity
All 26 generated variables follow N(µ,s)
distributions, where µ and s are
randomly generated for each variable
Linear
Regression
Model Selection
Problem
Datasets Used
Performance of
Selection
Algorithms on
Our Data
The Need for a
New Solution
The
Performance of
out Hybrid
Algoirthm

Performance of Selection Algorithms–
FAT
Linear Regression
Model Selection
Problem
Datasets Used
Performance of
on Our Data
The Need for a New
Solution
The Performance of
AIC SBC 𝑅>% Runtime (sec) St Dev (sec)
Best Subsets (SPSS Leaps
and Bound)
-2,013
(Variables: 1)
-1,987
(Variables: 1)
0,9829
(Variables: 1, 2, 3,
5, 6, 8, 11, 12, 15)
4,558 0,878
Best Subsets (Minerva:
GARS)
-2,013
(Variables: 1)
-1,987
(Variables: 1)
0,9829
5, 6, 8, 11, 12, 15)
5,921 1,658
improved GARS
-2,013
(Variables: 1)
-1,987
(Variables: 1)
0,9822
6, 8, 12, 15)
11,268 2,941
IHSRS
-2,013
(Variables: 1)
-1,987
(Variables: 1)
0,9822
6, 8, 12, 15)
0,968 0,188
Forward+Backward
0,058
6, 8, 12, 15)
0,239
6, 8, 12, 15)
0,9822
6, 8, 12, 15)
0,976 0,050
Variable Importance in
Projection (Partial Least
Squares)
-0,247
6, 8, 9)
-0,092
6, 8, 9)
0,9618
6, 8, 9)
1,807 0,896
Elastic Net
-2,013
(Variables: 1)
-1,987
(Variables: 1)
0,9410
(Variables: 1)
50,858 9,019
Stepwise VIF Selection
-0,189 (Variables:
1, 2, 15)
-0,008 (Variables:
1, 2, 15)
0,954
(Variables: 1, 2, 15)
0,832 0,034
Nested Estimate Procedure
-1,402
(Variables: 1, 8)
-1,351
(Variables: 1, 8)
0,9538
(Variables: 1, 8)
0,352 0,047

Performance of Selection Algorithms
DATA26
Linear Regression
Model Selection
Problem
Datasets Used
Performance of
on Our Data
The Need for a New
Solution
The Performance of
AIC SBC 𝑅% Runtime (sec) St Dev (sec)
Best Subsets (SPSS Leaps and
Bound)
-8,840
(Variables: X24, X23, X10, X6,
X4, X15, X17, X1, X13, X14, X12,
X16, X5, X25, X9, X21, X18)
-8,756
(Variables: X24, X23, X10, X6, X4,
X15, X17, X1, X13, X14, X12, X16,
X5, X25, X9, X21, X18)
0,9999944
X12, X9, X4, X1, X25, X10, X21,
X13, X17, X16, X18, X14, X3)
32,352745 7,04028
Best Subsets (Minerva: GARS)
-8,841
X5, X12, X9, X4, X1, X25, X10,
X21, X13, X17, X16, X18, X14,
X3)
-8,826
X1, X16, X24, X18, X5, X21, X8,
X23, X15, X12, X6, X4)
0,9999944
X12, X9, X4, X1, X25, X10, X21,
X13, X17, X16, X18, X14, X3)
52,714638 12,62692
improved GARS
-8,731
X1, X16, X24, X18, X5, X21, X8,
X23, X15, X12, X6, X4)
-8,826
X1, X16, X24, X18, X5, X21, X8,
X23, X15, X12, X6, X4)
0,99999744
X1, X16, X24, X18, X5, X21, X8,
X23, X15, X12, X6, X4)
1281,45823 380,10328
IHSRS
-8,731
X1, X16, X24, X18, X5, X21, X8,
X23, X15, X12, X6, X4)
-8,826
X1, X16, X24, X18, X5, X21, X8,
X23, X15, X12, X6, X4)
0,99999744
X1, X16, X24, X18, X5, X21, X8,
X23, X15, X12, X6, X4)
402,1666233 79,070735
Forward+Backward
-8,840
X4, X15, X17, X1, X13, X14, X12,
X16, X5, X25, X9, X21, X18)
-8,756
X15, X17, X1, X13, X14, X12, X16,
X5, X25, X9, X21, X18)
0,9999944
X15, X17, X1, X13, X14, X12, X16,
X5, X25, X9, X21, X18)
1,0744 0,0937
Variable Importance in
Projection (Partial Least
Squares)
-5,196 (Variables: X24, X5, X4,
X10, X20, X18, X8, X22, X23,
X11, X15, X6, X12)
-5,132 (Variables: X24, X5, X4,
X10, X20, X18, X8, X22, X23, X11,
X15, X6, X12)
0,99979 (Variables: X24, X5, X4,
X10, X20, X18, X8, X22, X23, X11,
X15, X6, X12)
15,095273 7,19626
Elastic Net
-4,363
(Full model, not significant: X5,
X13)
-4,240
X13)
0,993
X13)
478,683794 99,82244
Stepwise VIF Selection
0,434
X19, X24)
0,464
X19, X24)
0,940
X19, X24)
0,93415 0,02986
Nested Estimate Procedure
0,760
(Variables: X10, X15, X23, X24)
0,780
0,917
0,39289 0,0533

Problem with the results
Model
Collinearity Statistics
Tolerance VIF
X1 ,069 14,490
X3 ,017 59,097
X5 ,089 11,271
X6 ,030 33,682
X8 ,105 9,540
X12 ,239 4,182
X15 ,399 2,509
Linear Regression
Model Selection
Problem
Datasets Used
Performance of
Selection
Algorithms on Our
Data
The Need for a
New Solution
The Performance
of out Hybrid
Algoirthm
Model
Tolerance VIF
(Constant)
X1 ,065 15,347
X4 ,001 1644,939
X5 ,003 388,860
X6 ,002 538,248
X8 ,005 197,505
X10 ,050 20,165
X12 ,001 1366,452
X13 ,030 33,293
X15 ,001 1133,939
X16 ,048 20,828
X17 ,041 24,297
X18 ,016 64,340
X21 ,003 393,569
X23 ,002 554,800
X24 ,004 262,232
X25 ,001 825,023
FAT DATA26
Optimal solutions of IHSRS for 𝑹@ 𝟐

Modify the IHRSRS
Include an all VIFs<2 condition to the optimalization task
Optimal solutions of IHSRS with VIF conditions:
Linear Regression
Model Selection
Problem
Datasets Used
Performance of
Selection
Algorithms on Our
Data
The Need for a
New Solution
The Performance
of out Hybrid
Algoirthm
Model
Tolerance VIF
X1 ,508 1,970
X2 ,879 1,138
X8 ,558 1,791
𝑹@%
=0,9854
FAT
Model
Tolerance VIF
(Constant)
X2 ,503 1,986
X6 ,548 1,825
X10 ,500 1,999
X14 ,526 1,902
X23 ,565 1,770
DATA26
𝑹@%
=0,991
Other models with VIF values smaller than 2:
Backward – VIF: 𝑹@%
= 0,9540 (FAT); 0,940 (DATA26)
Nested Estimates: 𝑹@%
= 0,9538 (FAT); 0,917 (DATA26)

A Great Setback for
the modified IHSRS
Linear Regression
Model Selection
Problem
Datasets Used
Performance of
Selection
Algorithms on Our
Data
The Need for a
New Solution
The Performance
of out Hybrid
Algoirthm
0
10000
20000
30000
40000
50000
60000
average solution time (number of steps) standard deviation of solution times
(number of steps)
FAT
IHSRS without VIF IHSRS with VIF
0
10
20
30
40
50
60
70
average solution time (sec) standard deviation of solution times (sec)
FAT
0
50000
100000
150000
200000
250000
(number of steps)
DATA26
0
500
1000
1500
2000
2500
3000
3500
DATA26
Average runtime
is almost an hour!

We can not parallelize
the IHSRS
Linear Regression
Model Selection
Problem
Datasets Used
Performance of
on Our Data
The Need for a New
Solution
The Performance of
individual/melody: ● = 0 0 1 0 1 1 1
population/harmony memory: ● ● ● ●
STEP 1&2: Generate a random harmony and evaluate the regressions for each individual
● ● ● ●
HMCR prob 1-HMCR prob
● ● ● ● Generate a RANDOM indvidual
PAR prob 1-PAR prob
Mutate ● with mutation (bw) prob No modification on ●
Increase PAR + Decrease bw
Is new ● better than the worst individual?
YES NO
Change the worst individual
YES Termination Criterion? NO
STOP

Our GA-HS hybrid
solution
Linear Regression
Model Selection
Problem
Datasets Used
Performance of
on Our Data
The Need for a New
Solution
The Performance of
individual: ● = 0 0 1 0 1 1 1
population: ● ● ● ●
STEP 1&2: Generate a random harmony and evaluate the regressions for each individual
● ● ● ●
Select better than average individuals
● ● ● ●
Start a new population: ● ● x x
Can be
Parallel
ized!
HMCR prob 1-HMCR prob
● ● x x Generate RANDOM indvidual
Mutate ● with mutation (bw) prob
Increase HMCR + Decrease bw
Is every x filled? NO
YES
Evaluate the regressions for the new individuals in our population
YES Termination Criterion? NO
STOP

Differences from GA
1. More than one kind of mutation
2. No crossover
In Linear Regression Model Selection randomization is more important, than
inhereted good properties
The inclusion or exculsion of a single independent can save
or ruin a model
We could observe that GA is a relatively slow algorithm when applied to Model
Selecton
Linear Regression
Model Selection
Problem
Datasets Used
Performance of
on Our Data
The Need for a New
Solution
The Performance of

The Performance
0
50000
100000
150000
200000
250000
(number of steps)
DATA26
IHSRS + VIF GAIHSRS +VIF
0
10
20
30
40
50
60
70
FAT
Standard Parallel
0
500
1000
1500
2000
2500
3000
3500
4000
DATA26
Standard Parallel
Average runtime and St. Dev.
are decreased by 2/3
Thank you for your
attention!
0
10000
20000
30000
40000
50000
60000
average solution time (number of steps) standard deviation of solution times (number of
steps)
FAT
IHSRS + VIF GAIHSRS +VIF
Linear Regression
Model Selection
Problem
Datasets Used
Performance of
on Our Data
The Need for a New
Solution
The Performance of
our Hybrid Algoirthm

Enviroment
The solution times are an average of 30 runs. The standard deviation of the
runtimes is determined from the same 30 runs.
Most Selection Algorithms were used in IBM SPSS Statistics 22
Elastic Net: Catgreg SPSS macro by the University of Leiden
Numpy and Scipy Python libraries for Partial Least Squares
Metaheuristics (GARS, improved GARS, IHSRS, GAIHSRS) are implemented in C#
OS and Hardware Configurations
OS: Windows 8.1 Ultimate 64 bit
CPU: Intel Core i7-2700K, 3.5GHz
RAM: 16GB DDR3 SDRAM

Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

Similar to Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm (20)

More from Informatikai Intézet

More from Informatikai Intézet (20)

Recently uploaded

Recently uploaded (20)

Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm