Using several Modeling technics as Multiple regression, Decision Trees, Neural Networks and Partial Least Square we found and measured several causal factors that influence the level of alcohol in the Russian society.
The predictive modeling approach on continuous statistics
1. Predictive Modeling Approach for Socio-determined
Indicators based on Continuous Federal Statistics
Sergey Soshnikov
MD, Ph.D., Head of department
Federal Public Health Research Institute
1
Keywords: Data Quality, Decision Tree, Ensemble, Gradient Boosting, LASSO, Neural Network, Partial Least Squares
2. • Conducted in period June, 2011 – February, 2012.
• Support: Fulbright Program Scholarship for researchers, grant number
68435006 at Central Michigan University
2
Sergey Soshnikov
Dept. of Medical and Social Problems,
CPHRI, 11, Str. Dobrolyubova, Moscow, 127254, RU
ssoshnikov@fulbrightmail.org
Carl Lee
Department of Mathematics,
Central Michigan University, USA
Vasiliy Vlassov
Department Public health and Preventive Medicine,
I.M. Sechenov First Moscow State Medical University, RU
Maria Gaidar
Public Administration (MC/MPA),
Harvard University, USA
Sergey Vladimirov
Independent Laboratory SQLab, RU
Research team:
3. 3
PAPERS IN PEER REVIEW JOURNALS
1. Carl, Lee., Soshnikov, S.,Vladimirov, S. "Are Socio-Economic, Health Infrastructure, and Demographic Factors
Associated with Infant Mortality in Russia?," International Journal of Software Innovation (IJSI) 1 (2013): 4,
accessed (April 23, 2014), doi:10.4018/ijsi.2013100105.
2. Soshnikov, S., Lee, C., Vlassov, V., Vladimirov, S. (2012) Factors Associated with Abortions in Russia: a
Predictive Modeling Study. European Journal of Public Health, Vol. 22(2), 95-95.
PUBLISHED REFEREED ABSTRACTS
1. Soshnikov, S., Lee, C., Vladimirov, S. (2013). A Modeling approach to identify factors associated with Infant
mortality in Russia. 12th IEEE/ACIS International Conference on Computer and Information Science (ICIS
2013), p. 185-190. June 16 -20, 2013, Toki Messe, Niigata Japan.
2. Soshnikov, S., Lee, C., Vlassov, V., Gaidar, M. and Vladimirov, S. (2013). A comparison of some predictive
modeling techniques for modeling abortion rates in Russia. Proceedings (Refereed), 14th IEEE/ACIS
International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed
Computing (SNPD 2013), p. 115-120. July 1st – July 3rd, 2013, Honolulu, Hawaii, USA.
REFEREED PRESENTATIONS
1. The Predictive Modeling Approach on Continuous Statistics of Alcoholism Incidence in Russia. International
Symposium on Health Information Management Research (ISHIMR 2015), York, UK.
2. Analysis of the official medical and social statistics on the example of the infant mortality rate. X International
scientific conference "The use of multivariate statistical analysis in economics and quality assessment." (2014),
Higher School of Economics in Moscow, Russia.
3. A Modeling approach to identify factors associated with Infant mortality in Russia. 12th IEEE/ACIS International
Conference on Computer and Information Science (ICIS 2013), p. 185-190. June 16 -20, 2013, Toki Messe,
Niigata Japan.
4. 4
The secondary research database
MySQL
Over 130 variables from
6 Yearbooks and
2 Official databases
collected
The Central
Statistical
Database (CSDB)
Demographic
Yearbook
National
Accounts
Main
Economic and
Social
Indicators
HTML
DOC
XLS
CSV
5. 5
The Russian Federation comprises
85 federal regions.
There are 6 types of the federal regions:
• 22 republics
• 9 territories (krai)
• 46 regions (oblast)
• 3 federal cities;
• 1 autonomous region (oblast)
• 4 autonomous areas (okrug)
Geography of research
Subnational lever
6. 6
List of statistical sources used for developing database:
(Yearbooks and State Database):
a. Regions of Russia. Social-Economics Indicators. Data collected by state statistics on enterprises
and organizations of the population through censuses, sample surveys and other forms of statistical
observation data of ministries and departments of the Russian Federation as well as information
received from organizations
b. Regions of Russia. The Main Characteristics of the Russian Federation. In "Regions of Russia.
The main characteristics of the Russian Federation" detail is reflected in figures on social and
economic development of every region of Russia.
c. Demographic Yearbook of Russia. The yearbook contains statistical data about the administrative-
territorial division, changing the size and age and sex composition of the population, its location within
the Russian Federation, on births and deaths, marriages, divorces and migration, summarizing the
demographic indicators of the processes of human reproduction, standardized mortality rates by
cause of death.
d. Social status and standards of living of the Russian population. The most complete edition
reflecting the social processes and conditions of the population of Russia.
e. Healthcare in Russia. The handbook provides data describing the population's health status over
time.
f. National Accounts of Russia. Contains data on the volume, structure and dynamics of the Gross
Domestic Product, the consolidated accounts, integrated table of national accounts (by sector), as
well as regional indicators of National accounts.
g. The Central Statistical Database (CSDB). CSDB contains hundreds of variables carrying official
statistical data on the regional level of Russian Federation.
8. 8
Type Names of variables and descriptions
1. Environment-
related
& Alcohol Related.
Federal Districts on 2009 (GEO);
Average Temperature in January (JANTEMP);
Average Temperature in July (JULTEMP);
Emissions Air Pollutants tons /10 (EMAIRPOL);
Captured Air Pollutants (filtered) /10,000 (CAPPOL);
Absolute Alcohol liters /1 person (consumption in one year) (ALC_1);
Vodka liters /1 person (ALC_2);
Offenses with Illegal Alcohol /10000 (ALC_OFF);
Dahlgren and Whitehead’s diagram
9. 9
Type Names of variables and descriptions
Social & Health
Service Infrastructure
Square Footage /1 Resident (SQF1R);
Educational Institutions /10000 (EDUINST);
Hospital Beds /10,000 (HOSBED);
Clinics Power (visits a day) /10,000 (POWCLIN);
# of Physicians /10,000 (PHYS); # of Medical Staff /10,000 (MEDSTAF);
10. 10
Type Names of variables and descriptions
Economics & Money
Income
Gross Regional Product Millions Rubles /10000 (GRPMRUB);
Money Income /1 Person (RUBPOP);
% Monetary Income to 1999 (CASHIN); Consumer Spending /1 Person (CSPEND);
Average Salary /Employer (AVSALEM); Ratio of Working Population VS. Non-
Working Population (WVSNWP);
% of Economical Active Population (ECAPOP);
Annual Employ in Economy % to 1999 (AEEPOP); Number of Economical Active
People (NEAP); Non-Food Price Index (NFPIND); % Social Income /Total income
(SOCINC); % Salary Income /Total income (SALINC); % Business Income /Total
income (BUSINC); % Property Income /Total income (PROPINC); % Other Income
/Total Income (OTHINC); % People with Low Income (PEOPLOIN); Unemployment
Rate (Survey data) /10 (UNEMPR); Unemployment Rate (Officially registered) /10
11. 11
Type Names of variables and descriptions
Demographic factors
& Crimes
Divorces /1000 (DIVOR); Abortions /10,000 (Abortions);
Population (POP); # of Women per Man (WPERM); % of Urban Population
(URBPOP); Rural Population (RURPOP); % of Population < 16 Years Old (CHLDRN);
% of Population 16-60 Years Old (WORKPOP); % of Population Older than 60 Years
Old (PENSPOP); Growth Rural Population /1000 (GRUPOP); Growth Urban
Population /1000 (GURBPOP);
Murders & Attempted /10,000 (AMURDER); Success Murders /100,000 (SMURDER);
Juveniles Crimes /10,000 (JUVCRIME); Crimes in Alcohol Intoxication /10000
(ALCRIME)
12. 12
Type Names of variables and descriptions
Diseases & Death
rates
Injury & Poisoning /1000 (INPOI);
Deaths by Alcohol Poisoning /100,000 (ALCPOIS);
Alcoholism Incidence /10,000 (ALCOINC);
Narcotics Incidence /10,000 (NARCINS);
Tuberculosis Prevalence /10,000 (TUPREV);
Deaths rural /1000 (DRUR);
Deaths Urban (DEATHUR);
Deaths Suicides /100,000 (DSUICID);
Deaths All/1000 (DEATHS);
Other Covariates
13. Preliminary Analysis of Variable Characteristics
13
Figure 1: Distribution, box plot and Q-Q plot for original scale (left side) and
squared root transformed (right side) of the alcoholism incidence per 10,000
14. Mean Median Std Dev
YEARS Sample size Original SQRT
transformed
Original SQRT
transformed
Original SQRT
transformed
2000 75 13.83 3.618 13.56 3.682 6.54 0.867
2001 75 14.91 3.771 13.78 3.712 6.21 0.835
2002 0
2003 75 17.88 4.104 16.39 4.049 8.89 1.022
2004 75 17.36 4.041 16.01 4.001 9.21 1.023
2005 75 16.67 3.949 15.61 3.951 9.26 1.044
2006 76 15.59 3.834 14.42 3.797 8.00 0.950
2007 76 14.19 3.645 12.86 3.586 7.54 0.954
2008 77 14.24 3.669 13.04 3.612 6.94 0.884
2009 78 13.08 3.529 12.63 3.554 5.65 0.797
All Years 682 15.29 3.794 14.09 3.753 7.80 0.947
The summary
statistics of alcoholism
incidence per 10,000
population in Russia
by year.
15. District
ID
Districts
Name
Sample
size
Mean Median Std Dev
Original SQRT Original SQRT Original SQRT
8 North
Caucasian
63 6.66 2.379 6.74 2.598 3.98 1.010
2 Southern 45 11.64 3.393 11.08 3.329 2.51 0.363
1 Central 153 16.00 3.957 16.01 3.671 4.56 0.589
3 Northwester
n
90 15.86 3.912 14.72 3.837 5.81 0.748
7 Volga 130 14.95 3.839 14.25 3.776 3.65 0.468
6 Urals 36 14.95 3.856 15.06 3.881 2.24 0.289
5 Siberian 100 14.45 3.726 13.54 3.681 5.73 0.756
4 Far Eastern 65 25.89 4.878 19.25 4.388 15.83 1.458
Some summary statistics of alcoholism incidence per
10,000 in original and squared root transformed scales by
Federation District (from West to East).
16.
17. 61 inputs selected based
on the domain knowledge
and the context related to
alcoholism incidence
Alcoholism
Incidence
Steps conducted in order to finalize the input
variables for model building using the
following strategies:
• Explore the properties of each input by
investigating the distribution, the possible
outliers.
• Determine appropriate transformation for
each input variable, which depends on if
the input is a class variable (categorical) or
interval scale.
• Determine strategies for handling missing
data.
• Conduct preliminary analysis to investigate
the relationship between each input and the
alcoholism incidence.
• Conduct a preliminary input variable
selection.
19. Determine appropriate transformation for each input
variable, which depends on if the input, is a class variable
(categorical) or interval scale.
3
2
1
Determine strategies for handling missing data.
The “nearest neighbour” technique.
20. HOSBED ALCPOIS DRUR ALCRIME DEATH Abortions DSUICID JUVCRIMESALINC
.4827 .4566 .4395 .4269 .4235 .4171 .4168 .4150 .4056
MEDSTAF SMURDERAMURDER DEATHUR EDUINST EMARPOL DIVOR SQF1R INPOI
.3959 .3872 .3741 .3549 .2926 .2904 .2623 .2602 .2288
ALC_ILL POWCLIN ALC_2 PRURP CAPPOL VPER63 WPERM ECAPOP ALC_1
.2268 .2265 .1735 .1716 .1676 .1667 .1626 .1538 .1498
SOCINC PENSPOP NFPIND CHLDRN YEAR BUSINC UNEMPR URBPOP NEAP
.1364 .0973 -.1092 -.1094 -.1142 -.1277 -.1323 -.1847 -.2048
POP RURPOP CASHIN JANTEMP GURBPOPNATINC JULTEMP OTHINC GRP
-.2216 -.2385 -.2463 -.2668 -.2870 -.3538 -.3554 -.4034 -.4128
Table: Spearman Correlation Coefficients between Alcoholism Incidence/10,000 and 45
Inputs with significant Correlations Based on the original inputs
29 positive and 16 negative significant correlations.
Several inputs have very high positive mean correlation with alcoholism incidence. For example, Number of
Hospital Beds (HOSBED) has correlation at .4827, Alcohol Poisoning (ALCPOIS) .4566, Rural Deaths Rates
(DRUR) .4395, and Gross Regional Product (GRP) has negative correlation -.4128.
21. 21
Conduct a preliminary input variable selection.
First, we perform a simple correlation analysis
using the selection criterion at p-value = .00005.
This will only screen out inputs that have little or
no relationship with the target.
The second step is a regression Modeling using
forward selection procedure with cut-off p-value
at .0005.
By using this strategy, we can further reduce the
number of inputs without losing potentially
important inputs for the Modeling building.
3
2
1
22. 22
Modeling Methodology
(1)multiple regression,
(2) decision tree models,
(3) neural network models.
Partial least squares technique is performed to
compare with the final best model selected from the
three Modeling techniques.
The data are partitioned into Training (70%) and
Validation (30%). The Training data is used to
develop models and Validation data is use for
selecting the ‘best’ model for each Modeling
technique.
3
2
1
23. 23
An indicator variables are
created for each level of a
class input described in the
following example. Suppose a
class input X has four levels
{a,b,c,d}. Three indicator
variables, Ia , Ib and Ic , are
created to replace the variable
X: Ia = 1 if X = ‘a’ otherwise Ia
= 0. Ib and Ic are defined
accordingly.
27. 27
Results and Discussions
Model Type Average Squared Error
Regression (Transformed Inputs) .381
Regression (Original) .392
Decision Tree (Original Inputs) .383
Neural Network (Transformed Inputs) .438
Partial Least Square (Original Inputs) .350
Table: The Model Comparison Based on Validation Data
28. 28
Regression Model R2 Adjusted R2 AIC
Akaike
information
criterion
BIC
Bayesian
Information
Criterion
SBC
Schwarz
Bayes
criterion
Cp
Original Inputs .7112 .7029 -682.7 -684.34 -624.71 96.16
Transformed
Inputs
.7156 .7061 -682.89 -687.39 -616.65 148.60
Table: Model Fit Statistics for the Regression Models
Results and Discussions
29. 29
Regression
(Original)
ALC_2 WVSNWP ALCPD ATTMUR ECACPOP GEO
DIVOR DTHRUR BUSINC MEDSTAF HOSBED MURD
SQF1R
Regression
(Transformed)
ALC_2 WVSNWP ALCPD ATTMUR ECACPOP GEO
DIVOR DTHRUR BUSINC MEDSTAF HOSBED JANTEMP
PHYS AEEPOP DEASUI
Table: Inputs chosen for each of regression models and the decision tree model
30. 30
Factor Type Input Parameter
Estimate
t-statistic p-value
Environment &
Alcohol Related factors
Federal District FED Vs. Not FED
(GEO)
0.1759 3.53 .0005
Vodka liters /1person (ALC_2) -0.0332 -5.00 <.0001
Deaths of Alcohol Poisoning
/100,000 (ALCPD)
0.2689 4.00 <.0001
Social Factors &
Health Service,
Infrastructure
Square footage /1 Resident
(SQF1R)
0.0588 4.34 <.0001
Middle medical staff /10,000
(MEDSTAF)
0.0057 2.65 0.0084
Hospital beds /10,000 (HOSBED) 0.0097 4.88 <.0001
Economics & Money
income
Business Income (BUSINC) 0.0227 3.58 0.0004
% of Economical Active Pop
(ECACPOP)
0.0325 4.13 <.0001
Demographic &
Crimes
% of Pop older than 60
(PENSPOP)
-.0818 4.13 <.0001
Divorces /1000 (DIVOR) 0.1817 7.59 <.0001
Murders Attempt/10,000 (ATTMUR) 0.1632 4.63 <.0001
Murders /100,000 (MURD) -0.2253 -2.95 0.0034
Deaths Deaths rural /1000 (DTHRUR) 0.0786 9.35 <.0001
Table. The Parameter Estimates and the Corresponding t-statistics for the Regression Model
31. 31
Inputs Relative
Importance
Deaths Alcohol Poisoning /100,000 (ALCPOIS) 1.000
Hospital beds /10,000 (HOSBED) .7419
Deaths rural /1000 (DRUR) .5251
Murders /100,000 (SMURDER) .3277
Murders Attempted /10,000 (AMURDER) .2424
% of Economical Active Pop (ECAPOP) .2033
V85_Business Income (BUSINC) .1793
Table. The Relative Importance of Inputs Selected for the Regression Model
Interpreted by using Decision Tree
32. 32
Summary and Conclusion
• The main conclusion of research that official medical and social statistics are quite adequate for Modeling.
• Using several Modeling technics as Multiple regression, Decision Trees, Neural Networks and Partial Least
Square we found and measured several causal factors that influence the level of alcohol in the Russian
society.
• The direct influence on alcoholism level affects the geographical location of the region, level of medical
service development, money income of people, square footage of apartments per resident.
• The more sales legal vodka, the less impact on the level of alcoholism (?). One possible interpretation would
be that the data we has regarding to ‘Vodka Consumption’ only include the legal consumption but not the
moonshine consumption which dominates in the villages.
• The parameter estimate for older 60 years old population is negative because younger society contains more
heavy drinkers. Only abstainers live to old age.
• With the increase in the proportion of the economically active population, we see growing rates of alcoholism.
• Deaths because of alcohol poisoning, divorces, murders and attempting to murders are strongly depends on
alcoholism.
33. 33
PAPERS IN PEER REVIEW JOURNALS
1. Carl, Lee., Soshnikov, S.,Vladimirov, S. "Are Socio-Economic, Health Infrastructure, and Demographic Factors
Associated with Infant Mortality in Russia?," International Journal of Software Innovation (IJSI) 1 (2013): 4,
accessed (April 23, 2014), doi:10.4018/ijsi.2013100105.
2. Soshnikov, S., Lee, C., Vlassov, V., Vladimirov, S. (2012) Factors Associated with Abortions in Russia: a
Predictive Modeling Study. European Journal of Public Health, Vol. 22(2), 95-95.
PUBLISHED REFEREED ABSTRACTS
1. Soshnikov, S., Lee, C., Vladimirov, S. (2013). A Modeling approach to identify factors associated with Infant
mortality in Russia. 12th IEEE/ACIS International Conference on Computer and Information Science (ICIS
2013), p. 185-190. June 16 -20, 2013, Toki Messe, Niigata Japan.
2. Soshnikov, S., Lee, C., Vlassov, V., Gaidar, M. and Vladimirov, S. (2013). A comparison of some predictive
modeling techniques for modeling abortion rates in Russia. Proceedings (Refereed), 14th IEEE/ACIS
International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed
Computing (SNPD 2013), p. 115-120. July 1st – July 3rd, 2013, Honolulu, Hawaii, USA.
REFEREED PRESENTATIONS
1. The Predictive Modeling Approach on Continuous Statistics of Alcoholism Incidence in Russia. International
Symposium on Health Information Management Research (ISHIMR 2015), York, UK.
2. Analysis of the official medical and social statistics on the example of the infant mortality rate. X International
scientific conference "The use of multivariate statistical analysis in economics and quality assessment." (2014),
Higher School of Economics in Moscow, Russia.
3. A Modeling approach to identify factors associated with Infant mortality in Russia. 12th IEEE/ACIS International
Conference on Computer and Information Science (ICIS 2013), p. 185-190. June 16 -20, 2013, Toki Messe,
Niigata Japan.
Editor's Notes
Hello everyone!
In collaboration with Professor Carl Lee (Math. Dept. CMU) we developed three mathematical models of diseases in Russian data.
They was subsequently published.
I started to collect data in June while I was in Moscow. Than continue in Mt. Pleasant and we done in Feb-March 2012
In collaboration with Professor Carl Lee (Math. Dept. CMU) we developed three mathematical models of diseases in Russian data. They was subsequently published.
Collected healthcare data from open statistical sources and compiled the into a comprehensive research database. The database contained over 100 medical and social variables from 12 sources collected in MySQL database for further analysis.
The data were obtained from the statistical publications of continuous statistics of the Russian State Statistics Committee (ROSSTAT). Official national statistics is published in the
compilations in electronic and paper forms and on the ROSSTAT web site.
Data are presented as absolute numbers and as standardized ratios.
We developed a secondary research database “Russian SEIPH” (Social Economics Interference and Public Health) where over 130 variables from 6 yearbooks and two ROSSTAT
databases were combined. The period 10 years long (2000 to 2009) was selected for modeling. The unit of study is a region of Russia. Because the data were collected from different sources
we had to create MySQL database in the form of the data entry based on Hypertext Preprocessor (PHP) with possibility to export data in SAS file format. We used the relational
model of database, in which every value of variables held in a new row of fixed columns.
Super-regions: North Caucasian, Southern, Central, Northwestern, Volga, Urals, Siberian, Far Eastern
We built 3 models. For each model among the 130 variables, we exclude those variables not related to the purpose of this study.
The input variables selected for these studies are those passed preliminary statistical assessment and the variables suggested by the theoretical framework.
For Alcoholism incidence model 61 variables were chosen at first stage as input variables (factors) for Modeling.
Next 5 slides summarizes the input variables in this study. These inputs classified in five different factor types:
By the expert way we choose an excessive number of variables to further validate of having the pair relationships with the dependent variable.
There are a total of 61 inputs selected from the domain knowledge and the context related to abortion rates. These inputs cover six categories of factor types as described previously. These inputs varied greatly in terms of their characteristics including the measuring unit, the magnitude of the values, the distributions. In order to take further step of investigating the relationship between these inputs and the target, it is critical to explore these inputs and make proper variable transformation for further variable reduction and eventually building models to identify the impact factors.
The normal Q-Q plot and Box-plot of alcoholism incidence of the entire data set shown in the left side of Figure 1. This figure indicates the distribution of alcoholism incidence is far from normal. Applying maximum normal transformation, squared root transformation gives the distribution closest to normal, which shown on the right side of Figure 1. This figure indicates that the distribution of the squared root transformed alcoholism incidence does not follow normal well. The Shapiro-Wilk’s test statistic is 0.9924 (p-value < .05). However, the shape of the transformed alcoholism incidence is approximately symmetric with average incidence is 3.79 and median is 3.75. The extreme low alcoholism incidences occurred at the Ingushetia region and the extreme high alcoholism incidence occurred in the Magadan and Sakhalin regions.
Table 2. The summary statistics of alcoholism incidence per 10,000 population in Russia by year.
Table 2 summarizes the average, median and standard deviation of the alcoholism incidence (in original scale) separated by years.
The data for 2002 were not available. As these summaries show, averages are noticeably larger than median.
This is a clear indication that the distribution of alcoholism incidence is skewed-to-the right.
The corresponding box-plots are given in Figures (left side: Original scale and right side: SQRT transformed scale).
Some summary statistics of alcoholism incidence per 10,000 in original and squared root transformed scales by Federation District (from West to East).
The summary statistics and the corresponding box-plots for each of the eight federal districts (arranged from West to East of Russia) are presented in Table 3 and Figure 3. The highest alcoholism incidence is from the Far Eastern District with average incidence 25.89, and the lowest is North Caucasian with average 6.66. It is clear that there is a strong connection with religion in North Caucasus District. The variations also differ among districts, ranging from 2.24 (Urals District) to 15.82 (Far Eastern District). The squared root transformed alcoholism incidence appears to fit normal distribution better than the original scale as shown in Figure 3.
These steps performed interactively until we finalize the set of inputs that are ready for model building. In the following, we will address the approach we took to handle each of these steps.
We also Explore the properties of each input by investigating the distribution, the possible outliers. Many input variables in interval scales do not follow normal. The following shows distributions of some input variables. Most of the input variables are skewed to the right and require some variable transformation.
The examples shown above are interval scale variables. For this type of variables, we determine the appropriate transformation that will transform each variable to best-fit normal distribution (maximizing normality). For class variables, we apply rare group collapsing method by grouping categories with less than .1% of cases into a new group.
A great deal of time and efforts have been spent to ensure the quality and reliability of the data. This data set does not have many missing data. As a result, we did not need to drop any of the input units for our model. We employ the nearest neighbor technique to impute the missing using the average of two nearest years of the data from the same region.
After completing the data transformation and missing data imputation, we conduct a simple preliminary correlation analysis and scatterplots to investigate (1) if there is a high correlation between a given variable and the target and (2) is there any non-linear relationship between a given input and the target. Table 4 gives the Spearman’s correlation coefficients between the alcoholism incidence and 45 inputs with significant correlation coefficient at .01% level. We did not report the correlation coefficients between alcoholism incidence and the transformed inputs for the reason that the final best model selected is based on the original scale of input variables.
The correlations pairs can presents one-way and two ways relationships. We can suppose what factors affect increasing of alcoholism incidence in Russia.
Prior to conduct model building, we conduct a preliminary variable selection using the following strategy.
Three types of predictive Modeling techniques are applied and compared to select the ‘best’ model using several model selection criteria.
This Figure illustrates the Modeling methodology used in this study.
The target variable, Alcoholism Incidence per 10,000, is transformed using square root transformation so that the distribution of the transformed target is approximately normal. For input variables, two different processes are performed. One is to use the original scale of the inputs without transformation and the other is to apply maximum normality transformation to the inputs. As Figure 4 shows, for each process, we build a series of models including regression, decision tree, neural network, and partial least squares techniques. The best model chosen from each technique in each process is compared using the average squared error to select the final best model.
For the decision tree model, at the splitting stage, because the target is on an interval scale, we apply the maximizing F-statistic criterion through analysis of variance with Kass adjustment to prevent unexpected high type I error during the splitting process based on the Training data set. The pruning stage is performed using the assessment criterion of minimizing the average error based on the validation data set. The missing data is treated as a separate category and is used in the Modeling construction. The importance of input variables is determined by using the log worth measure (-log10(p-value)).
The inputs passing the first screen are subject to a forward selection regression procedure using p-value at .0005. The inputs passing the forward selection are then used as the inputs for regression Modeling. In the Modeling building stage, stepwise selection procedure by minimizing the average square error based on the training data is applied to build the models.
At each step of the stepwise regression procedure, a best regression model is obtained. The procedure is stopped when no input can be added or dropped at p-value = .05.
Each of these best regression models in each step of the stepwise regression is subject to a validating process by applying the model to the validation data and computing the average squared error.
The final best model is chosen to be the model having the smallest average squared error based on the validation data. This is done to prevent ‘overfitting of the model selected only by using the stepwise procedure.
It has been shown that neural network is an universal approximator for any function Ripley [ 12 ]. Neural network model can be quite complex when the number of hidden layers and number of hidden units increase. If we have a large number of inputs, the complexity is compounded and the number of weights to be estimated grows exponentially. Therefore, it is essential to conduct an appropriate input variable selection prior to entering the neural network Modeling. The preliminary screening employed for regression Modeling often results in still too many inputs for the neural network Modeling, since neural network Modeling technique does not have the ability of selecting input variables during the process of building the network, unlike the regression Modeling. Therefore, prior to building the neural network model, a decision tree technique is applied to conduct the variable selection:
The selection of the final best model is performed by comparing the average squared error criterion based on validation data set. Table 5 summarizes the average squared error for each Modeling technique.
The three models: Regression with original inputs, regression using transformed inputs and decision tree models are very compatible.
Since regression models are easier for interpretation, we decide to choose regression models as our final choice.
In Table, we compare the two regression models using various model-fitting criteria. It indicates that even though regression model based on transformed inputs has lower average squared error (shown in previous). The selection criteria in Table seem to suggest that the regression model based on original scale of inputs is a better choice for the following reasons: (1) with two less inputs, the R2 is only .0044 less, (2) while the AIC and SBC are comparable, the model using original inputs performs better based on SBC and Cp.
Therefore, we decide to choose the regression model based on the original inputs as our final ‘best’ model (Table ).
Table 7 gives a further comparison between the two regression models shows that there are eleven common inputs selected in both models.
The parameter estimates along with the t-statistics and p-values for the final ‘best’ regression model are summarized in Table 8.
The parameter estimate for the input ‘Vodka Consumption’ is negative, which suggests that the ‘pure’ association between alcoholism incidence and vodka consumption is negative. This seems to counter intuitive. A quick check of Alcoholism Incidence and the Vodka Consumption has a weak positive correlated at .1735. There could be many different interpretations of this association. One possible interpretation would be that the data we has regarding to ‘Vodka Consumption’ only include the legal consumption. It is known there is a great deal of ‘unreported’ ‘illegal’ Moonshine available across the nation, especially in the rural areas.
The parameter estimate for the input ‘% of Population older than 60 years old’ is negative because younger society contains more heavy drinkers.
The same explanation with variable ‘Murders’. Means that drunk people are unsuccessfully attempting to kill someone with drunk emotional mind, however the sober people more successful in murdering after preparing. With “cold nose”
In the following, we apply decision Tree technique to interpret the regression model obtained above. The following table shows the relative importance of input variables resulted from the decision tree. The decision tree that interprets the regression model is given.
Based on the Decision Tree, the lowest alcoholism incidence occurred when # of Deaths Alcohol Poisoning per 100,000 < 0.01283 (n = 35), and the predicted Alcoholism Incidence is 5.224 per 10,000. The highest alcoholism occurred when the number of hospital beds is greater than 146.2 per 10,000 and the number of deaths due to alcohol poisoning is greater than 0.01283 per 100,000 (n = 27). The predicted number of alcoholism incidence is 28.63
All significant factors associated with alcohol incidence classified on two groups: the factors associated with the incidence of alcoholism directly and indirectly. Indicator of the level of alcoholism incidence is associated with other diseases and social problems.
In collaboration with Professor Carl Lee (Math. Dept. CMU) we developed three mathematical models of diseases in Russian data. They was subsequently published.
Collected healthcare data from open statistical sources and compiled the into a comprehensive research database. The database contained over 100 medical and social variables from 12 sources collected in MySQL database for further analysis.