The predictive modeling approach on continuous statistics

Predictive Modeling Approach for Socio-determined
Indicators based on Continuous Federal Statistics
Sergey Soshnikov
MD, Ph.D., Head of department
Federal Public Health Research Institute
1
Keywords: Data Quality, Decision Tree, Ensemble, Gradient Boosting, LASSO, Neural Network, Partial Least Squares

• Conducted in period June, 2011 – February, 2012.
• Support: Fulbright Program Scholarship for researchers, grant number
68435006 at Central Michigan University
2
Sergey Soshnikov
Dept. of Medical and Social Problems,
CPHRI, 11, Str. Dobrolyubova, Moscow, 127254, RU
ssoshnikov@fulbrightmail.org
Carl Lee
Department of Mathematics,
Central Michigan University, USA
Vasiliy Vlassov
Department Public health and Preventive Medicine,
I.M. Sechenov First Moscow State Medical University, RU
Maria Gaidar
Public Administration (MC/MPA),
Harvard University, USA
Sergey Vladimirov
Independent Laboratory SQLab, RU
Research team:

3
PAPERS IN PEER REVIEW JOURNALS
1. Carl, Lee., Soshnikov, S.,Vladimirov, S. "Are Socio-Economic, Health Infrastructure, and Demographic Factors
Associated with Infant Mortality in Russia?," International Journal of Software Innovation (IJSI) 1 (2013): 4,
accessed (April 23, 2014), doi:10.4018/ijsi.2013100105.
2. Soshnikov, S., Lee, C., Vlassov, V., Vladimirov, S. (2012) Factors Associated with Abortions in Russia: a
Predictive Modeling Study. European Journal of Public Health, Vol. 22(2), 95-95.
PUBLISHED REFEREED ABSTRACTS
1. Soshnikov, S., Lee, C., Vladimirov, S. (2013). A Modeling approach to identify factors associated with Infant
mortality in Russia. 12th IEEE/ACIS International Conference on Computer and Information Science (ICIS
2013), p. 185-190. June 16 -20, 2013, Toki Messe, Niigata Japan.
2. Soshnikov, S., Lee, C., Vlassov, V., Gaidar, M. and Vladimirov, S. (2013). A comparison of some predictive
modeling techniques for modeling abortion rates in Russia. Proceedings (Refereed), 14th IEEE/ACIS
International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed
Computing (SNPD 2013), p. 115-120. July 1st – July 3rd, 2013, Honolulu, Hawaii, USA.
REFEREED PRESENTATIONS
1. The Predictive Modeling Approach on Continuous Statistics of Alcoholism Incidence in Russia. International
Symposium on Health Information Management Research (ISHIMR 2015), York, UK.
2. Analysis of the official medical and social statistics on the example of the infant mortality rate. X International
scientific conference "The use of multivariate statistical analysis in economics and quality assessment." (2014),
Higher School of Economics in Moscow, Russia.
3. A Modeling approach to identify factors associated with Infant mortality in Russia. 12th IEEE/ACIS International
Conference on Computer and Information Science (ICIS 2013), p. 185-190. June 16 -20, 2013, Toki Messe,
Niigata Japan.

4
The secondary research database
MySQL
Over 130 variables from
6 Yearbooks and
2 Official databases
collected
The Central
Statistical
Database (CSDB)
Demographic
Yearbook
National
Accounts
Main
Economic and
Social
Indicators
HTML
DOC
XLS
CSV

5
The Russian Federation comprises
85 federal regions.
There are 6 types of the federal regions:
• 22 republics
• 9 territories (krai)
• 46 regions (oblast)
• 3 federal cities;
• 1 autonomous region (oblast)
• 4 autonomous areas (okrug)
Geography of research
Subnational lever

6
List of statistical sources used for developing database:
(Yearbooks and State Database):
a. Regions of Russia. Social-Economics Indicators. Data collected by state statistics on enterprises
and organizations of the population through censuses, sample surveys and other forms of statistical
observation data of ministries and departments of the Russian Federation as well as information
received from organizations
b. Regions of Russia. The Main Characteristics of the Russian Federation. In "Regions of Russia.
The main characteristics of the Russian Federation" detail is reflected in figures on social and
economic development of every region of Russia.
c. Demographic Yearbook of Russia. The yearbook contains statistical data about the administrative-
territorial division, changing the size and age and sex composition of the population, its location within
the Russian Federation, on births and deaths, marriages, divorces and migration, summarizing the
demographic indicators of the processes of human reproduction, standardized mortality rates by
cause of death.
d. Social status and standards of living of the Russian population. The most complete edition
reflecting the social processes and conditions of the population of Russia.
e. Healthcare in Russia. The handbook provides data describing the population's health status over
time.
f. National Accounts of Russia. Contains data on the volume, structure and dynamics of the Gross
Domestic Product, the consolidated accounts, integrated table of national accounts (by sector), as
well as regional indicators of National accounts.
g. The Central Statistical Database (CSDB). CSDB contains hundreds of variables carrying official
statistical data on the regional level of Russian Federation.

7
130 Variables
Research DB
Alcoholism Model
61 Variables
Abortions Model
47 Variables
Infant Mortality Model
54 Variables
Var3
Var1
Var2

8
Type Names of variables and descriptions
1. Environment-
related
& Alcohol Related.
Federal Districts on 2009 (GEO);
Average Temperature in January (JANTEMP);
Average Temperature in July (JULTEMP);
Emissions Air Pollutants tons /10 (EMAIRPOL);
Captured Air Pollutants (filtered) /10,000 (CAPPOL);
Absolute Alcohol liters /1 person (consumption in one year) (ALC_1);
Vodka liters /1 person (ALC_2);
Offenses with Illegal Alcohol /10000 (ALC_OFF);
Dahlgren and Whitehead’s diagram

9
Social & Health
Service Infrastructure
Square Footage /1 Resident (SQF1R);
Educational Institutions /10000 (EDUINST);
Hospital Beds /10,000 (HOSBED);
Clinics Power (visits a day) /10,000 (POWCLIN);
# of Physicians /10,000 (PHYS); # of Medical Staff /10,000 (MEDSTAF);

10
Economics & Money
Income
Gross Regional Product Millions Rubles /10000 (GRPMRUB);
Money Income /1 Person (RUBPOP);
% Monetary Income to 1999 (CASHIN); Consumer Spending /1 Person (CSPEND);
Average Salary /Employer (AVSALEM); Ratio of Working Population VS. Non-
Working Population (WVSNWP);
% of Economical Active Population (ECAPOP);
Annual Employ in Economy % to 1999 (AEEPOP); Number of Economical Active
People (NEAP); Non-Food Price Index (NFPIND); % Social Income /Total income
(SOCINC); % Salary Income /Total income (SALINC); % Business Income /Total
income (BUSINC); % Property Income /Total income (PROPINC); % Other Income
/Total Income (OTHINC); % People with Low Income (PEOPLOIN); Unemployment
Rate (Survey data) /10 (UNEMPR); Unemployment Rate (Officially registered) /10

11
Demographic factors
& Crimes
Divorces /1000 (DIVOR); Abortions /10,000 (Abortions);
Population (POP); # of Women per Man (WPERM); % of Urban Population
(URBPOP); Rural Population (RURPOP); % of Population < 16 Years Old (CHLDRN);
% of Population 16-60 Years Old (WORKPOP); % of Population Older than 60 Years
Old (PENSPOP); Growth Rural Population /1000 (GRUPOP); Growth Urban
Population /1000 (GURBPOP);
Murders & Attempted /10,000 (AMURDER); Success Murders /100,000 (SMURDER);
Juveniles Crimes /10,000 (JUVCRIME); Crimes in Alcohol Intoxication /10000
(ALCRIME)

12
Diseases & Death
rates
Injury & Poisoning /1000 (INPOI);
Deaths by Alcohol Poisoning /100,000 (ALCPOIS);
Alcoholism Incidence /10,000 (ALCOINC);
Narcotics Incidence /10,000 (NARCINS);
Tuberculosis Prevalence /10,000 (TUPREV);
Deaths rural /1000 (DRUR);
Deaths Urban (DEATHUR);
Deaths Suicides /100,000 (DSUICID);
Deaths All/1000 (DEATHS);
Other Covariates

Preliminary Analysis of Variable Characteristics
13
Figure 1: Distribution, box plot and Q-Q plot for original scale (left side) and
squared root transformed (right side) of the alcoholism incidence per 10,000

Mean Median Std Dev
YEARS Sample size Original SQRT
transformed
Original SQRT
transformed
Original SQRT
transformed
2000 75 13.83 3.618 13.56 3.682 6.54 0.867
2001 75 14.91 3.771 13.78 3.712 6.21 0.835
2002 0
2003 75 17.88 4.104 16.39 4.049 8.89 1.022
2004 75 17.36 4.041 16.01 4.001 9.21 1.023
2005 75 16.67 3.949 15.61 3.951 9.26 1.044
2006 76 15.59 3.834 14.42 3.797 8.00 0.950
2007 76 14.19 3.645 12.86 3.586 7.54 0.954
2008 77 14.24 3.669 13.04 3.612 6.94 0.884
2009 78 13.08 3.529 12.63 3.554 5.65 0.797
All Years 682 15.29 3.794 14.09 3.753 7.80 0.947
The summary
statistics of alcoholism
incidence per 10,000
population in Russia
by year.

District
ID
Districts
Name
Sample
size
Mean Median Std Dev
Original SQRT Original SQRT Original SQRT
8 North
Caucasian
63 6.66 2.379 6.74 2.598 3.98 1.010
2 Southern 45 11.64 3.393 11.08 3.329 2.51 0.363
1 Central 153 16.00 3.957 16.01 3.671 4.56 0.589
3 Northwester
n
90 15.86 3.912 14.72 3.837 5.81 0.748
7 Volga 130 14.95 3.839 14.25 3.776 3.65 0.468
6 Urals 36 14.95 3.856 15.06 3.881 2.24 0.289
5 Siberian 100 14.45 3.726 13.54 3.681 5.73 0.756
4 Far Eastern 65 25.89 4.878 19.25 4.388 15.83 1.458
Some summary statistics of alcoholism incidence per
10,000 in original and squared root transformed scales by
Federation District (from West to East).

61 inputs selected based
on the domain knowledge
and the context related to
alcoholism incidence
Alcoholism
Incidence
Steps conducted in order to finalize the input
variables for model building using the
following strategies:
• Explore the properties of each input by
investigating the distribution, the possible
outliers.
• Determine appropriate transformation for
each input variable, which depends on if
the input is a class variable (categorical) or
interval scale.
• Determine strategies for handling missing
data.
• Conduct preliminary analysis to investigate
the relationship between each input and the
alcoholism incidence.
• Conduct a preliminary input variable
selection.

Characteristics of Inputs and their relationship with the Alcoholism incidence

Determine appropriate transformation for each input
variable, which depends on if the input, is a class variable
(categorical) or interval scale.
3
2
1
Determine strategies for handling missing data.
The “nearest neighbour” technique.

HOSBED ALCPOIS DRUR ALCRIME DEATH Abortions DSUICID JUVCRIMESALINC
.4827 .4566 .4395 .4269 .4235 .4171 .4168 .4150 .4056
MEDSTAF SMURDERAMURDER DEATHUR EDUINST EMARPOL DIVOR SQF1R INPOI
.3959 .3872 .3741 .3549 .2926 .2904 .2623 .2602 .2288
ALC_ILL POWCLIN ALC_2 PRURP CAPPOL VPER63 WPERM ECAPOP ALC_1
.2268 .2265 .1735 .1716 .1676 .1667 .1626 .1538 .1498
SOCINC PENSPOP NFPIND CHLDRN YEAR BUSINC UNEMPR URBPOP NEAP
.1364 .0973 -.1092 -.1094 -.1142 -.1277 -.1323 -.1847 -.2048
POP RURPOP CASHIN JANTEMP GURBPOPNATINC JULTEMP OTHINC GRP
-.2216 -.2385 -.2463 -.2668 -.2870 -.3538 -.3554 -.4034 -.4128
Table: Spearman Correlation Coefficients between Alcoholism Incidence/10,000 and 45
Inputs with significant Correlations Based on the original inputs
29 positive and 16 negative significant correlations.
Several inputs have very high positive mean correlation with alcoholism incidence. For example, Number of
Hospital Beds (HOSBED) has correlation at .4827, Alcohol Poisoning (ALCPOIS) .4566, Rural Deaths Rates
(DRUR) .4395, and Gross Regional Product (GRP) has negative correlation -.4128.

21
Conduct a preliminary input variable selection.
First, we perform a simple correlation analysis
using the selection criterion at p-value = .00005.
This will only screen out inputs that have little or
no relationship with the target.
The second step is a regression Modeling using
forward selection procedure with cut-off p-value
at .0005.
By using this strategy, we can further reduce the
number of inputs without losing potentially
important inputs for the Modeling building.
3
2
1

22
Modeling Methodology
(1)multiple regression,
(2) decision tree models,
(3) neural network models.
Partial least squares technique is performed to
compare with the final best model selected from the
three Modeling techniques.
The data are partitioned into Training (70%) and
Validation (30%). The Training data is used to
develop models and Validation data is use for
selecting the ‘best’ model for each Modeling
technique.
3
2
1

23
An indicator variables are
created for each level of a
class input described in the
following example. Suppose a
class input X has four levels
{a,b,c,d}. Three indicator
variables, Ia , Ib and Ic , are
created to replace the variable
X: Ia = 1 if X = ‘a’ otherwise Ia
= 0. Ib and Ic are defined
accordingly.

24
Decision tree models
3
2
1

25
Regression and neural network Modeling techniques
3
2
1

26
Regression and neural network Modeling techniques
3
2
1

27
Results and Discussions
Model Type Average Squared Error
Regression (Transformed Inputs) .381
Regression (Original) .392
Decision Tree (Original Inputs) .383
Neural Network (Transformed Inputs) .438
Partial Least Square (Original Inputs) .350
Table: The Model Comparison Based on Validation Data

28
Regression Model R2 Adjusted R2 AIC
Akaike
information
criterion
BIC
Bayesian
Information
Criterion
SBC
Schwarz
Bayes
criterion
Cp
Original Inputs .7112 .7029 -682.7 -684.34 -624.71 96.16
Transformed
Inputs
.7156 .7061 -682.89 -687.39 -616.65 148.60
Table: Model Fit Statistics for the Regression Models
Results and Discussions

29
Regression
(Original)
ALC_2 WVSNWP ALCPD ATTMUR ECACPOP GEO
DIVOR DTHRUR BUSINC MEDSTAF HOSBED MURD
SQF1R
Regression
(Transformed)
ALC_2 WVSNWP ALCPD ATTMUR ECACPOP GEO
DIVOR DTHRUR BUSINC MEDSTAF HOSBED JANTEMP
PHYS AEEPOP DEASUI
Table: Inputs chosen for each of regression models and the decision tree model

30
Factor Type Input Parameter
Estimate
t-statistic p-value
Environment &
Alcohol Related factors
Federal District FED Vs. Not FED
(GEO)
0.1759 3.53 .0005
Vodka liters /1person (ALC_2) -0.0332 -5.00 <.0001
Deaths of Alcohol Poisoning
/100,000 (ALCPD)
0.2689 4.00 <.0001
Social Factors &
Health Service,
Infrastructure
Square footage /1 Resident
(SQF1R)
0.0588 4.34 <.0001
Middle medical staff /10,000
(MEDSTAF)
0.0057 2.65 0.0084
Hospital beds /10,000 (HOSBED) 0.0097 4.88 <.0001
Economics & Money
income
Business Income (BUSINC) 0.0227 3.58 0.0004
% of Economical Active Pop
(ECACPOP)
0.0325 4.13 <.0001
Demographic &
Crimes
% of Pop older than 60
(PENSPOP)
-.0818 4.13 <.0001
Divorces /1000 (DIVOR) 0.1817 7.59 <.0001
Murders Attempt/10,000 (ATTMUR) 0.1632 4.63 <.0001
Murders /100,000 (MURD) -0.2253 -2.95 0.0034
Deaths Deaths rural /1000 (DTHRUR) 0.0786 9.35 <.0001
Table. The Parameter Estimates and the Corresponding t-statistics for the Regression Model

31
Inputs Relative
Importance
Deaths Alcohol Poisoning /100,000 (ALCPOIS) 1.000
Hospital beds /10,000 (HOSBED) .7419
Deaths rural /1000 (DRUR) .5251
Murders /100,000 (SMURDER) .3277
Murders Attempted /10,000 (AMURDER) .2424
% of Economical Active Pop (ECAPOP) .2033
V85_Business Income (BUSINC) .1793
Table. The Relative Importance of Inputs Selected for the Regression Model
Interpreted by using Decision Tree

32
Summary and Conclusion
• The main conclusion of research that official medical and social statistics are quite adequate for Modeling.
• Using several Modeling technics as Multiple regression, Decision Trees, Neural Networks and Partial Least
Square we found and measured several causal factors that influence the level of alcohol in the Russian
society.
• The direct influence on alcoholism level affects the geographical location of the region, level of medical
service development, money income of people, square footage of apartments per resident.
• The more sales legal vodka, the less impact on the level of alcoholism (?). One possible interpretation would
be that the data we has regarding to ‘Vodka Consumption’ only include the legal consumption but not the
moonshine consumption which dominates in the villages.
• The parameter estimate for older 60 years old population is negative because younger society contains more
heavy drinkers. Only abstainers live to old age.
• With the increase in the proportion of the economically active population, we see growing rates of alcoholism.
• Deaths because of alcohol poisoning, divorces, murders and attempting to murders are strongly depends on
alcoholism.

33
PAPERS IN PEER REVIEW JOURNALS
1. Carl, Lee., Soshnikov, S.,Vladimirov, S. "Are Socio-Economic, Health Infrastructure, and Demographic Factors
Associated with Infant Mortality in Russia?," International Journal of Software Innovation (IJSI) 1 (2013): 4,
accessed (April 23, 2014), doi:10.4018/ijsi.2013100105.
2. Soshnikov, S., Lee, C., Vlassov, V., Vladimirov, S. (2012) Factors Associated with Abortions in Russia: a
Predictive Modeling Study. European Journal of Public Health, Vol. 22(2), 95-95.
PUBLISHED REFEREED ABSTRACTS
1. Soshnikov, S., Lee, C., Vladimirov, S. (2013). A Modeling approach to identify factors associated with Infant
mortality in Russia. 12th IEEE/ACIS International Conference on Computer and Information Science (ICIS
2013), p. 185-190. June 16 -20, 2013, Toki Messe, Niigata Japan.
2. Soshnikov, S., Lee, C., Vlassov, V., Gaidar, M. and Vladimirov, S. (2013). A comparison of some predictive
modeling techniques for modeling abortion rates in Russia. Proceedings (Refereed), 14th IEEE/ACIS
International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed
Computing (SNPD 2013), p. 115-120. July 1st – July 3rd, 2013, Honolulu, Hawaii, USA.
REFEREED PRESENTATIONS
1. The Predictive Modeling Approach on Continuous Statistics of Alcoholism Incidence in Russia. International
Symposium on Health Information Management Research (ISHIMR 2015), York, UK.
2. Analysis of the official medical and social statistics on the example of the infant mortality rate. X International
scientific conference "The use of multivariate statistical analysis in economics and quality assessment." (2014),
Higher School of Economics in Moscow, Russia.
3. A Modeling approach to identify factors associated with Infant mortality in Russia. 12th IEEE/ACIS International
Conference on Computer and Information Science (ICIS 2013), p. 185-190. June 16 -20, 2013, Toki Messe,
Niigata Japan.

The predictive modeling approach on continuous statistics

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (16)

Similar to The predictive modeling approach on continuous statistics

Similar to The predictive modeling approach on continuous statistics (20)

Recently uploaded

Recently uploaded (20)

The predictive modeling approach on continuous statistics

Editor's Notes