SlideShare a Scribd company logo
1 of 33
Predictive Modeling Approach for Socio-determined
Indicators based on Continuous Federal Statistics
Sergey Soshnikov
MD, Ph.D., Head of department
Federal Public Health Research Institute
1
Keywords: Data Quality, Decision Tree, Ensemble, Gradient Boosting, LASSO, Neural Network, Partial Least Squares
• Conducted in period June, 2011 – February, 2012.
• Support: Fulbright Program Scholarship for researchers, grant number
68435006 at Central Michigan University
2
Sergey Soshnikov
Dept. of Medical and Social Problems,
CPHRI, 11, Str. Dobrolyubova, Moscow, 127254, RU
ssoshnikov@fulbrightmail.org
Carl Lee
Department of Mathematics,
Central Michigan University, USA
Vasiliy Vlassov
Department Public health and Preventive Medicine,
I.M. Sechenov First Moscow State Medical University, RU
Maria Gaidar
Public Administration (MC/MPA),
Harvard University, USA
Sergey Vladimirov
Independent Laboratory SQLab, RU
Research team:
3
PAPERS IN PEER REVIEW JOURNALS
1. Carl, Lee., Soshnikov, S.,Vladimirov, S. "Are Socio-Economic, Health Infrastructure, and Demographic Factors
Associated with Infant Mortality in Russia?," International Journal of Software Innovation (IJSI) 1 (2013): 4,
accessed (April 23, 2014), doi:10.4018/ijsi.2013100105.
2. Soshnikov, S., Lee, C., Vlassov, V., Vladimirov, S. (2012) Factors Associated with Abortions in Russia: a
Predictive Modeling Study. European Journal of Public Health, Vol. 22(2), 95-95.
PUBLISHED REFEREED ABSTRACTS
1. Soshnikov, S., Lee, C., Vladimirov, S. (2013). A Modeling approach to identify factors associated with Infant
mortality in Russia. 12th IEEE/ACIS International Conference on Computer and Information Science (ICIS
2013), p. 185-190. June 16 -20, 2013, Toki Messe, Niigata Japan.
2. Soshnikov, S., Lee, C., Vlassov, V., Gaidar, M. and Vladimirov, S. (2013). A comparison of some predictive
modeling techniques for modeling abortion rates in Russia. Proceedings (Refereed), 14th IEEE/ACIS
International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed
Computing (SNPD 2013), p. 115-120. July 1st – July 3rd, 2013, Honolulu, Hawaii, USA.
REFEREED PRESENTATIONS
1. The Predictive Modeling Approach on Continuous Statistics of Alcoholism Incidence in Russia. International
Symposium on Health Information Management Research (ISHIMR 2015), York, UK.
2. Analysis of the official medical and social statistics on the example of the infant mortality rate. X International
scientific conference "The use of multivariate statistical analysis in economics and quality assessment." (2014),
Higher School of Economics in Moscow, Russia.
3. A Modeling approach to identify factors associated with Infant mortality in Russia. 12th IEEE/ACIS International
Conference on Computer and Information Science (ICIS 2013), p. 185-190. June 16 -20, 2013, Toki Messe,
Niigata Japan.
4
The secondary research database
MySQL
Over 130 variables from
6 Yearbooks and
2 Official databases
collected
The Central
Statistical
Database (CSDB)
Demographic
Yearbook
National
Accounts
Main
Economic and
Social
Indicators
HTML
DOC
XLS
CSV
5
The Russian Federation comprises
85 federal regions.
There are 6 types of the federal regions:
• 22 republics
• 9 territories (krai)
• 46 regions (oblast)
• 3 federal cities;
• 1 autonomous region (oblast)
• 4 autonomous areas (okrug)
Geography of research
Subnational lever
6
List of statistical sources used for developing database:
(Yearbooks and State Database):
a. Regions of Russia. Social-Economics Indicators. Data collected by state statistics on enterprises
and organizations of the population through censuses, sample surveys and other forms of statistical
observation data of ministries and departments of the Russian Federation as well as information
received from organizations
b. Regions of Russia. The Main Characteristics of the Russian Federation. In "Regions of Russia.
The main characteristics of the Russian Federation" detail is reflected in figures on social and
economic development of every region of Russia.
c. Demographic Yearbook of Russia. The yearbook contains statistical data about the administrative-
territorial division, changing the size and age and sex composition of the population, its location within
the Russian Federation, on births and deaths, marriages, divorces and migration, summarizing the
demographic indicators of the processes of human reproduction, standardized mortality rates by
cause of death.
d. Social status and standards of living of the Russian population. The most complete edition
reflecting the social processes and conditions of the population of Russia.
e. Healthcare in Russia. The handbook provides data describing the population's health status over
time.
f. National Accounts of Russia. Contains data on the volume, structure and dynamics of the Gross
Domestic Product, the consolidated accounts, integrated table of national accounts (by sector), as
well as regional indicators of National accounts.
g. The Central Statistical Database (CSDB). CSDB contains hundreds of variables carrying official
statistical data on the regional level of Russian Federation.
7
130 Variables
Research DB
Alcoholism Model
61 Variables
Abortions Model
47 Variables
Infant Mortality Model
54 Variables
Var3
Var1
Var2
8
Type Names of variables and descriptions
1. Environment-
related
& Alcohol Related.
Federal Districts on 2009 (GEO);
Average Temperature in January (JANTEMP);
Average Temperature in July (JULTEMP);
Emissions Air Pollutants tons /10 (EMAIRPOL);
Captured Air Pollutants (filtered) /10,000 (CAPPOL);
Absolute Alcohol liters /1 person (consumption in one year) (ALC_1);
Vodka liters /1 person (ALC_2);
Offenses with Illegal Alcohol /10000 (ALC_OFF);
Dahlgren and Whitehead’s diagram
9
Type Names of variables and descriptions
Social & Health
Service Infrastructure
Square Footage /1 Resident (SQF1R);
Educational Institutions /10000 (EDUINST);
Hospital Beds /10,000 (HOSBED);
Clinics Power (visits a day) /10,000 (POWCLIN);
# of Physicians /10,000 (PHYS); # of Medical Staff /10,000 (MEDSTAF);
10
Type Names of variables and descriptions
Economics & Money
Income
Gross Regional Product Millions Rubles /10000 (GRPMRUB);
Money Income /1 Person (RUBPOP);
% Monetary Income to 1999 (CASHIN); Consumer Spending /1 Person (CSPEND);
Average Salary /Employer (AVSALEM); Ratio of Working Population VS. Non-
Working Population (WVSNWP);
% of Economical Active Population (ECAPOP);
Annual Employ in Economy % to 1999 (AEEPOP); Number of Economical Active
People (NEAP); Non-Food Price Index (NFPIND); % Social Income /Total income
(SOCINC); % Salary Income /Total income (SALINC); % Business Income /Total
income (BUSINC); % Property Income /Total income (PROPINC); % Other Income
/Total Income (OTHINC); % People with Low Income (PEOPLOIN); Unemployment
Rate (Survey data) /10 (UNEMPR); Unemployment Rate (Officially registered) /10
11
Type Names of variables and descriptions
Demographic factors
& Crimes
Divorces /1000 (DIVOR); Abortions /10,000 (Abortions);
Population (POP); # of Women per Man (WPERM); % of Urban Population
(URBPOP); Rural Population (RURPOP); % of Population < 16 Years Old (CHLDRN);
% of Population 16-60 Years Old (WORKPOP); % of Population Older than 60 Years
Old (PENSPOP); Growth Rural Population /1000 (GRUPOP); Growth Urban
Population /1000 (GURBPOP);
Murders & Attempted /10,000 (AMURDER); Success Murders /100,000 (SMURDER);
Juveniles Crimes /10,000 (JUVCRIME); Crimes in Alcohol Intoxication /10000
(ALCRIME)
12
Type Names of variables and descriptions
Diseases & Death
rates
Injury & Poisoning /1000 (INPOI);
Deaths by Alcohol Poisoning /100,000 (ALCPOIS);
Alcoholism Incidence /10,000 (ALCOINC);
Narcotics Incidence /10,000 (NARCINS);
Tuberculosis Prevalence /10,000 (TUPREV);
Deaths rural /1000 (DRUR);
Deaths Urban (DEATHUR);
Deaths Suicides /100,000 (DSUICID);
Deaths All/1000 (DEATHS);
Other Covariates
Preliminary Analysis of Variable Characteristics
13
Figure 1: Distribution, box plot and Q-Q plot for original scale (left side) and
squared root transformed (right side) of the alcoholism incidence per 10,000
Mean Median Std Dev
YEARS Sample size Original SQRT
transformed
Original SQRT
transformed
Original SQRT
transformed
2000 75 13.83 3.618 13.56 3.682 6.54 0.867
2001 75 14.91 3.771 13.78 3.712 6.21 0.835
2002 0
2003 75 17.88 4.104 16.39 4.049 8.89 1.022
2004 75 17.36 4.041 16.01 4.001 9.21 1.023
2005 75 16.67 3.949 15.61 3.951 9.26 1.044
2006 76 15.59 3.834 14.42 3.797 8.00 0.950
2007 76 14.19 3.645 12.86 3.586 7.54 0.954
2008 77 14.24 3.669 13.04 3.612 6.94 0.884
2009 78 13.08 3.529 12.63 3.554 5.65 0.797
All Years 682 15.29 3.794 14.09 3.753 7.80 0.947
The summary
statistics of alcoholism
incidence per 10,000
population in Russia
by year.
District
ID
Districts
Name
Sample
size
Mean Median Std Dev
Original SQRT Original SQRT Original SQRT
8 North
Caucasian
63 6.66 2.379 6.74 2.598 3.98 1.010
2 Southern 45 11.64 3.393 11.08 3.329 2.51 0.363
1 Central 153 16.00 3.957 16.01 3.671 4.56 0.589
3 Northwester
n
90 15.86 3.912 14.72 3.837 5.81 0.748
7 Volga 130 14.95 3.839 14.25 3.776 3.65 0.468
6 Urals 36 14.95 3.856 15.06 3.881 2.24 0.289
5 Siberian 100 14.45 3.726 13.54 3.681 5.73 0.756
4 Far Eastern 65 25.89 4.878 19.25 4.388 15.83 1.458
Some summary statistics of alcoholism incidence per
10,000 in original and squared root transformed scales by
Federation District (from West to East).
61 inputs selected based
on the domain knowledge
and the context related to
alcoholism incidence
Alcoholism
Incidence
Steps conducted in order to finalize the input
variables for model building using the
following strategies:
• Explore the properties of each input by
investigating the distribution, the possible
outliers.
• Determine appropriate transformation for
each input variable, which depends on if
the input is a class variable (categorical) or
interval scale.
• Determine strategies for handling missing
data.
• Conduct preliminary analysis to investigate
the relationship between each input and the
alcoholism incidence.
• Conduct a preliminary input variable
selection.
Characteristics of Inputs and their relationship with the Alcoholism incidence
Determine appropriate transformation for each input
variable, which depends on if the input, is a class variable
(categorical) or interval scale.
3
2
1
Determine strategies for handling missing data.
The “nearest neighbour” technique.
HOSBED ALCPOIS DRUR ALCRIME DEATH Abortions DSUICID JUVCRIMESALINC
.4827 .4566 .4395 .4269 .4235 .4171 .4168 .4150 .4056
MEDSTAF SMURDERAMURDER DEATHUR EDUINST EMARPOL DIVOR SQF1R INPOI
.3959 .3872 .3741 .3549 .2926 .2904 .2623 .2602 .2288
ALC_ILL POWCLIN ALC_2 PRURP CAPPOL VPER63 WPERM ECAPOP ALC_1
.2268 .2265 .1735 .1716 .1676 .1667 .1626 .1538 .1498
SOCINC PENSPOP NFPIND CHLDRN YEAR BUSINC UNEMPR URBPOP NEAP
.1364 .0973 -.1092 -.1094 -.1142 -.1277 -.1323 -.1847 -.2048
POP RURPOP CASHIN JANTEMP GURBPOPNATINC JULTEMP OTHINC GRP
-.2216 -.2385 -.2463 -.2668 -.2870 -.3538 -.3554 -.4034 -.4128
Table: Spearman Correlation Coefficients between Alcoholism Incidence/10,000 and 45
Inputs with significant Correlations Based on the original inputs
29 positive and 16 negative significant correlations.
Several inputs have very high positive mean correlation with alcoholism incidence. For example, Number of
Hospital Beds (HOSBED) has correlation at .4827, Alcohol Poisoning (ALCPOIS) .4566, Rural Deaths Rates
(DRUR) .4395, and Gross Regional Product (GRP) has negative correlation -.4128.
21
Conduct a preliminary input variable selection.
First, we perform a simple correlation analysis
using the selection criterion at p-value = .00005.
This will only screen out inputs that have little or
no relationship with the target.
The second step is a regression Modeling using
forward selection procedure with cut-off p-value
at .0005.
By using this strategy, we can further reduce the
number of inputs without losing potentially
important inputs for the Modeling building.
3
2
1
22
Modeling Methodology
(1)multiple regression,
(2) decision tree models,
(3) neural network models.
Partial least squares technique is performed to
compare with the final best model selected from the
three Modeling techniques.
The data are partitioned into Training (70%) and
Validation (30%). The Training data is used to
develop models and Validation data is use for
selecting the ‘best’ model for each Modeling
technique.
3
2
1
23
An indicator variables are
created for each level of a
class input described in the
following example. Suppose a
class input X has four levels
{a,b,c,d}. Three indicator
variables, Ia , Ib and Ic , are
created to replace the variable
X: Ia = 1 if X = ‘a’ otherwise Ia
= 0. Ib and Ic are defined
accordingly.
24
Modeling Methodology
Decision tree models
3
2
1
25
Modeling Methodology
Regression and neural network Modeling techniques
3
2
1
26
Modeling Methodology
Regression and neural network Modeling techniques
3
2
1
27
Results and Discussions
Model Type Average Squared Error
Regression (Transformed Inputs) .381
Regression (Original) .392
Decision Tree (Original Inputs) .383
Neural Network (Transformed Inputs) .438
Partial Least Square (Original Inputs) .350
Table: The Model Comparison Based on Validation Data
28
Regression Model R2 Adjusted R2 AIC
Akaike
information
criterion
BIC
Bayesian
Information
Criterion
SBC
Schwarz
Bayes
criterion
Cp
Original Inputs .7112 .7029 -682.7 -684.34 -624.71 96.16
Transformed
Inputs
.7156 .7061 -682.89 -687.39 -616.65 148.60
Table: Model Fit Statistics for the Regression Models
Results and Discussions
29
Regression
(Original)
ALC_2 WVSNWP ALCPD ATTMUR ECACPOP GEO
DIVOR DTHRUR BUSINC MEDSTAF HOSBED MURD
SQF1R
Regression
(Transformed)
ALC_2 WVSNWP ALCPD ATTMUR ECACPOP GEO
DIVOR DTHRUR BUSINC MEDSTAF HOSBED JANTEMP
PHYS AEEPOP DEASUI
Table: Inputs chosen for each of regression models and the decision tree model
30
Factor Type Input Parameter
Estimate
t-statistic p-value
Environment &
Alcohol Related factors
Federal District FED Vs. Not FED
(GEO)
0.1759 3.53 .0005
Vodka liters /1person (ALC_2) -0.0332 -5.00 <.0001
Deaths of Alcohol Poisoning
/100,000 (ALCPD)
0.2689 4.00 <.0001
Social Factors &
Health Service,
Infrastructure
Square footage /1 Resident
(SQF1R)
0.0588 4.34 <.0001
Middle medical staff /10,000
(MEDSTAF)
0.0057 2.65 0.0084
Hospital beds /10,000 (HOSBED) 0.0097 4.88 <.0001
Economics & Money
income
Business Income (BUSINC) 0.0227 3.58 0.0004
% of Economical Active Pop
(ECACPOP)
0.0325 4.13 <.0001
Demographic &
Crimes
% of Pop older than 60
(PENSPOP)
-.0818 4.13 <.0001
Divorces /1000 (DIVOR) 0.1817 7.59 <.0001
Murders Attempt/10,000 (ATTMUR) 0.1632 4.63 <.0001
Murders /100,000 (MURD) -0.2253 -2.95 0.0034
Deaths Deaths rural /1000 (DTHRUR) 0.0786 9.35 <.0001
Table. The Parameter Estimates and the Corresponding t-statistics for the Regression Model
31
Inputs Relative
Importance
Deaths Alcohol Poisoning /100,000 (ALCPOIS) 1.000
Hospital beds /10,000 (HOSBED) .7419
Deaths rural /1000 (DRUR) .5251
Murders /100,000 (SMURDER) .3277
Murders Attempted /10,000 (AMURDER) .2424
% of Economical Active Pop (ECAPOP) .2033
V85_Business Income (BUSINC) .1793
Table. The Relative Importance of Inputs Selected for the Regression Model
Interpreted by using Decision Tree
32
Summary and Conclusion
• The main conclusion of research that official medical and social statistics are quite adequate for Modeling.
• Using several Modeling technics as Multiple regression, Decision Trees, Neural Networks and Partial Least
Square we found and measured several causal factors that influence the level of alcohol in the Russian
society.
• The direct influence on alcoholism level affects the geographical location of the region, level of medical
service development, money income of people, square footage of apartments per resident.
• The more sales legal vodka, the less impact on the level of alcoholism (?). One possible interpretation would
be that the data we has regarding to ‘Vodka Consumption’ only include the legal consumption but not the
moonshine consumption which dominates in the villages.
• The parameter estimate for older 60 years old population is negative because younger society contains more
heavy drinkers. Only abstainers live to old age.
• With the increase in the proportion of the economically active population, we see growing rates of alcoholism.
• Deaths because of alcohol poisoning, divorces, murders and attempting to murders are strongly depends on
alcoholism.
33
PAPERS IN PEER REVIEW JOURNALS
1. Carl, Lee., Soshnikov, S.,Vladimirov, S. "Are Socio-Economic, Health Infrastructure, and Demographic Factors
Associated with Infant Mortality in Russia?," International Journal of Software Innovation (IJSI) 1 (2013): 4,
accessed (April 23, 2014), doi:10.4018/ijsi.2013100105.
2. Soshnikov, S., Lee, C., Vlassov, V., Vladimirov, S. (2012) Factors Associated with Abortions in Russia: a
Predictive Modeling Study. European Journal of Public Health, Vol. 22(2), 95-95.
PUBLISHED REFEREED ABSTRACTS
1. Soshnikov, S., Lee, C., Vladimirov, S. (2013). A Modeling approach to identify factors associated with Infant
mortality in Russia. 12th IEEE/ACIS International Conference on Computer and Information Science (ICIS
2013), p. 185-190. June 16 -20, 2013, Toki Messe, Niigata Japan.
2. Soshnikov, S., Lee, C., Vlassov, V., Gaidar, M. and Vladimirov, S. (2013). A comparison of some predictive
modeling techniques for modeling abortion rates in Russia. Proceedings (Refereed), 14th IEEE/ACIS
International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed
Computing (SNPD 2013), p. 115-120. July 1st – July 3rd, 2013, Honolulu, Hawaii, USA.
REFEREED PRESENTATIONS
1. The Predictive Modeling Approach on Continuous Statistics of Alcoholism Incidence in Russia. International
Symposium on Health Information Management Research (ISHIMR 2015), York, UK.
2. Analysis of the official medical and social statistics on the example of the infant mortality rate. X International
scientific conference "The use of multivariate statistical analysis in economics and quality assessment." (2014),
Higher School of Economics in Moscow, Russia.
3. A Modeling approach to identify factors associated with Infant mortality in Russia. 12th IEEE/ACIS International
Conference on Computer and Information Science (ICIS 2013), p. 185-190. June 16 -20, 2013, Toki Messe,
Niigata Japan.

More Related Content

Viewers also liked

Nuevo presentación de microsoft office power point
Nuevo presentación de microsoft office power pointNuevo presentación de microsoft office power point
Nuevo presentación de microsoft office power pointmarlen bravo franco
 
марденов сунгат+теплица+идея
марденов сунгат+теплица+идеямарденов сунгат+теплица+идея
марденов сунгат+теплица+идеяСунгат Марденов
 
ECHO-20140530-Economie-du-partage-PH-Guillaume-Bourrillon
ECHO-20140530-Economie-du-partage-PH-Guillaume-BourrillonECHO-20140530-Economie-du-partage-PH-Guillaume-Bourrillon
ECHO-20140530-Economie-du-partage-PH-Guillaume-BourrillonGuillaume BOURRILLON
 
Expressionismo part2
Expressionismo part2Expressionismo part2
Expressionismo part2Bruna Batista
 
Escala de Allport
Escala de AllportEscala de Allport
Escala de AllportBlogi4 agro
 
02 arfi administracion de los recursos y funcion informatica
02  arfi   administracion de los recursos y funcion informatica02  arfi   administracion de los recursos y funcion informatica
02 arfi administracion de los recursos y funcion informaticaAlberto Alexis Dominguez Ruiz
 
Silence no more m. pala
Silence no more m. palaSilence no more m. pala
Silence no more m. palaINA Foundation
 
Pousser les prestataires marocaines vers des acteurs logistiques mondiaux
Pousser les prestataires marocaines vers des acteurs logistiques mondiauxPousser les prestataires marocaines vers des acteurs logistiques mondiaux
Pousser les prestataires marocaines vers des acteurs logistiques mondiauxSaid ZARGUAN
 
Indiana University Health University Hospital Palliative Care Services
Indiana University Health University Hospital Palliative Care ServicesIndiana University Health University Hospital Palliative Care Services
Indiana University Health University Hospital Palliative Care ServicesMike Aref
 
Maestro Jaime León
Maestro Jaime LeónMaestro Jaime León
Maestro Jaime Leóntemporada
 
Divorce according to West Pakistan Family Courts Act 1964
Divorce according to West Pakistan Family Courts Act 1964Divorce according to West Pakistan Family Courts Act 1964
Divorce according to West Pakistan Family Courts Act 1964Farooq Haider
 

Viewers also liked (16)

Conflicto armado en colombia
Conflicto armado en colombiaConflicto armado en colombia
Conflicto armado en colombia
 
Nuevo presentación de microsoft office power point
Nuevo presentación de microsoft office power pointNuevo presentación de microsoft office power point
Nuevo presentación de microsoft office power point
 
Law of attraction
Law of attractionLaw of attraction
Law of attraction
 
марденов сунгат+теплица+идея
марденов сунгат+теплица+идеямарденов сунгат+теплица+идея
марденов сунгат+теплица+идея
 
ECHO-20140530-Economie-du-partage-PH-Guillaume-Bourrillon
ECHO-20140530-Economie-du-partage-PH-Guillaume-BourrillonECHO-20140530-Economie-du-partage-PH-Guillaume-Bourrillon
ECHO-20140530-Economie-du-partage-PH-Guillaume-Bourrillon
 
TELA project
TELA projectTELA project
TELA project
 
Expressionismo part2
Expressionismo part2Expressionismo part2
Expressionismo part2
 
Markets and brokers
Markets and brokersMarkets and brokers
Markets and brokers
 
Smscenter llt sumsel 8 kab
Smscenter llt sumsel 8 kabSmscenter llt sumsel 8 kab
Smscenter llt sumsel 8 kab
 
Escala de Allport
Escala de AllportEscala de Allport
Escala de Allport
 
02 arfi administracion de los recursos y funcion informatica
02  arfi   administracion de los recursos y funcion informatica02  arfi   administracion de los recursos y funcion informatica
02 arfi administracion de los recursos y funcion informatica
 
Silence no more m. pala
Silence no more m. palaSilence no more m. pala
Silence no more m. pala
 
Pousser les prestataires marocaines vers des acteurs logistiques mondiaux
Pousser les prestataires marocaines vers des acteurs logistiques mondiauxPousser les prestataires marocaines vers des acteurs logistiques mondiaux
Pousser les prestataires marocaines vers des acteurs logistiques mondiaux
 
Indiana University Health University Hospital Palliative Care Services
Indiana University Health University Hospital Palliative Care ServicesIndiana University Health University Hospital Palliative Care Services
Indiana University Health University Hospital Palliative Care Services
 
Maestro Jaime León
Maestro Jaime LeónMaestro Jaime León
Maestro Jaime León
 
Divorce according to West Pakistan Family Courts Act 1964
Divorce according to West Pakistan Family Courts Act 1964Divorce according to West Pakistan Family Courts Act 1964
Divorce according to West Pakistan Family Courts Act 1964
 

Similar to The predictive modeling approach on continuous statistics

Sergey Boytsov, State Scientific Research Institute of Organization and Infor...
Sergey Boytsov, State Scientific Research Institute of Organization and Infor...Sergey Boytsov, State Scientific Research Institute of Organization and Infor...
Sergey Boytsov, State Scientific Research Institute of Organization and Infor...Sosiaali- ja terveysministeriö / yleiset
 
Error detection in census data age reporting
Error detection in census data age reportingError detection in census data age reporting
Error detection in census data age reportingcimran15
 
Final Presentation- Tahsina Mame
Final Presentation- Tahsina MameFinal Presentation- Tahsina Mame
Final Presentation- Tahsina MameNadia Ayman
 
Health information 2
Health information 2Health information 2
Health information 2ibrahimkarti
 
Challenges and opportunities of running a public hospital in argentina
Challenges and opportunities of running a public hospital in argentinaChallenges and opportunities of running a public hospital in argentina
Challenges and opportunities of running a public hospital in argentinaAriel Mario Goldman
 
state of the Health in United states of America
state of the Health in United states of Americastate of the Health in United states of America
state of the Health in United states of AmericaSumit Roy
 
Population Studies / Demography Introduction
Population Studies / Demography IntroductionPopulation Studies / Demography Introduction
Population Studies / Demography IntroductionMuteeullah
 
The U.S. Nursing Workforce Trends in Supply and Education 3 .docx
The U.S. Nursing Workforce Trends in Supply and Education 3 .docxThe U.S. Nursing Workforce Trends in Supply and Education 3 .docx
The U.S. Nursing Workforce Trends in Supply and Education 3 .docxchristalgrieg
 
THE IMPACTS OF LIFESTYLE BEHAVIOR ON MEDICARE COSTS: A PANEL DATA ANALYSIS AT...
THE IMPACTS OF LIFESTYLE BEHAVIOR ON MEDICARE COSTS: A PANEL DATA ANALYSIS AT...THE IMPACTS OF LIFESTYLE BEHAVIOR ON MEDICARE COSTS: A PANEL DATA ANALYSIS AT...
THE IMPACTS OF LIFESTYLE BEHAVIOR ON MEDICARE COSTS: A PANEL DATA ANALYSIS AT...hiij
 
Analysis of Fertility Indicators in the Republic of Uzbekistan
Analysis of Fertility Indicators in the Republic of UzbekistanAnalysis of Fertility Indicators in the Republic of Uzbekistan
Analysis of Fertility Indicators in the Republic of Uzbekistanijtsrd
 
Regional Snapshot: Public Health in Metro Atlanta
Regional Snapshot: Public Health in Metro Atlanta Regional Snapshot: Public Health in Metro Atlanta
Regional Snapshot: Public Health in Metro Atlanta ARCResearch
 
Data Visualization in Public Health DC TUG March 17 2015
Data Visualization in Public Health DC TUG March 17 2015Data Visualization in Public Health DC TUG March 17 2015
Data Visualization in Public Health DC TUG March 17 2015Ramon Martinez
 
Dr Yousef Elshrek is One co-authors in this study >>>> Global, regional, and...
Dr Yousef Elshrek is  One co-authors in this study >>>> Global, regional, and...Dr Yousef Elshrek is  One co-authors in this study >>>> Global, regional, and...
Dr Yousef Elshrek is One co-authors in this study >>>> Global, regional, and...Univ. of Tripoli
 
Welcoming to incoming bioinformatics students at UCSF
Welcoming to incoming bioinformatics students at UCSFWelcoming to incoming bioinformatics students at UCSF
Welcoming to incoming bioinformatics students at UCSFDaniel Himmelstein
 

Similar to The predictive modeling approach on continuous statistics (20)

Heart Diseases and its associated factors in Geriatric Population residing in...
Heart Diseases and its associated factors in Geriatric Population residing in...Heart Diseases and its associated factors in Geriatric Population residing in...
Heart Diseases and its associated factors in Geriatric Population residing in...
 
Sergey Boytsov, State Scientific Research Institute of Organization and Infor...
Sergey Boytsov, State Scientific Research Institute of Organization and Infor...Sergey Boytsov, State Scientific Research Institute of Organization and Infor...
Sergey Boytsov, State Scientific Research Institute of Organization and Infor...
 
Bulgaria WHO
Bulgaria WHOBulgaria WHO
Bulgaria WHO
 
Error detection in census data age reporting
Error detection in census data age reportingError detection in census data age reporting
Error detection in census data age reporting
 
Final Presentation- Tahsina Mame
Final Presentation- Tahsina MameFinal Presentation- Tahsina Mame
Final Presentation- Tahsina Mame
 
Health information 2
Health information 2Health information 2
Health information 2
 
Challenges and opportunities of running a public hospital in argentina
Challenges and opportunities of running a public hospital in argentinaChallenges and opportunities of running a public hospital in argentina
Challenges and opportunities of running a public hospital in argentina
 
Maternal death review in andhra pradesh
Maternal death review in andhra pradeshMaternal death review in andhra pradesh
Maternal death review in andhra pradesh
 
state of the Health in United states of America
state of the Health in United states of Americastate of the Health in United states of America
state of the Health in United states of America
 
Population Studies / Demography Introduction
Population Studies / Demography IntroductionPopulation Studies / Demography Introduction
Population Studies / Demography Introduction
 
C05841121
C05841121C05841121
C05841121
 
Ori demography1 1
Ori demography1 1Ori demography1 1
Ori demography1 1
 
The U.S. Nursing Workforce Trends in Supply and Education 3 .docx
The U.S. Nursing Workforce Trends in Supply and Education 3 .docxThe U.S. Nursing Workforce Trends in Supply and Education 3 .docx
The U.S. Nursing Workforce Trends in Supply and Education 3 .docx
 
Session 5A - R.C. Sethi
Session 5A - R.C. SethiSession 5A - R.C. Sethi
Session 5A - R.C. Sethi
 
THE IMPACTS OF LIFESTYLE BEHAVIOR ON MEDICARE COSTS: A PANEL DATA ANALYSIS AT...
THE IMPACTS OF LIFESTYLE BEHAVIOR ON MEDICARE COSTS: A PANEL DATA ANALYSIS AT...THE IMPACTS OF LIFESTYLE BEHAVIOR ON MEDICARE COSTS: A PANEL DATA ANALYSIS AT...
THE IMPACTS OF LIFESTYLE BEHAVIOR ON MEDICARE COSTS: A PANEL DATA ANALYSIS AT...
 
Analysis of Fertility Indicators in the Republic of Uzbekistan
Analysis of Fertility Indicators in the Republic of UzbekistanAnalysis of Fertility Indicators in the Republic of Uzbekistan
Analysis of Fertility Indicators in the Republic of Uzbekistan
 
Regional Snapshot: Public Health in Metro Atlanta
Regional Snapshot: Public Health in Metro Atlanta Regional Snapshot: Public Health in Metro Atlanta
Regional Snapshot: Public Health in Metro Atlanta
 
Data Visualization in Public Health DC TUG March 17 2015
Data Visualization in Public Health DC TUG March 17 2015Data Visualization in Public Health DC TUG March 17 2015
Data Visualization in Public Health DC TUG March 17 2015
 
Dr Yousef Elshrek is One co-authors in this study >>>> Global, regional, and...
Dr Yousef Elshrek is  One co-authors in this study >>>> Global, regional, and...Dr Yousef Elshrek is  One co-authors in this study >>>> Global, regional, and...
Dr Yousef Elshrek is One co-authors in this study >>>> Global, regional, and...
 
Welcoming to incoming bioinformatics students at UCSF
Welcoming to incoming bioinformatics students at UCSFWelcoming to incoming bioinformatics students at UCSF
Welcoming to incoming bioinformatics students at UCSF
 

Recently uploaded

18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 

Recently uploaded (20)

18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 

The predictive modeling approach on continuous statistics

  • 1. Predictive Modeling Approach for Socio-determined Indicators based on Continuous Federal Statistics Sergey Soshnikov MD, Ph.D., Head of department Federal Public Health Research Institute 1 Keywords: Data Quality, Decision Tree, Ensemble, Gradient Boosting, LASSO, Neural Network, Partial Least Squares
  • 2. • Conducted in period June, 2011 – February, 2012. • Support: Fulbright Program Scholarship for researchers, grant number 68435006 at Central Michigan University 2 Sergey Soshnikov Dept. of Medical and Social Problems, CPHRI, 11, Str. Dobrolyubova, Moscow, 127254, RU ssoshnikov@fulbrightmail.org Carl Lee Department of Mathematics, Central Michigan University, USA Vasiliy Vlassov Department Public health and Preventive Medicine, I.M. Sechenov First Moscow State Medical University, RU Maria Gaidar Public Administration (MC/MPA), Harvard University, USA Sergey Vladimirov Independent Laboratory SQLab, RU Research team:
  • 3. 3 PAPERS IN PEER REVIEW JOURNALS 1. Carl, Lee., Soshnikov, S.,Vladimirov, S. "Are Socio-Economic, Health Infrastructure, and Demographic Factors Associated with Infant Mortality in Russia?," International Journal of Software Innovation (IJSI) 1 (2013): 4, accessed (April 23, 2014), doi:10.4018/ijsi.2013100105. 2. Soshnikov, S., Lee, C., Vlassov, V., Vladimirov, S. (2012) Factors Associated with Abortions in Russia: a Predictive Modeling Study. European Journal of Public Health, Vol. 22(2), 95-95. PUBLISHED REFEREED ABSTRACTS 1. Soshnikov, S., Lee, C., Vladimirov, S. (2013). A Modeling approach to identify factors associated with Infant mortality in Russia. 12th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2013), p. 185-190. June 16 -20, 2013, Toki Messe, Niigata Japan. 2. Soshnikov, S., Lee, C., Vlassov, V., Gaidar, M. and Vladimirov, S. (2013). A comparison of some predictive modeling techniques for modeling abortion rates in Russia. Proceedings (Refereed), 14th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD 2013), p. 115-120. July 1st – July 3rd, 2013, Honolulu, Hawaii, USA. REFEREED PRESENTATIONS 1. The Predictive Modeling Approach on Continuous Statistics of Alcoholism Incidence in Russia. International Symposium on Health Information Management Research (ISHIMR 2015), York, UK. 2. Analysis of the official medical and social statistics on the example of the infant mortality rate. X International scientific conference "The use of multivariate statistical analysis in economics and quality assessment." (2014), Higher School of Economics in Moscow, Russia. 3. A Modeling approach to identify factors associated with Infant mortality in Russia. 12th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2013), p. 185-190. June 16 -20, 2013, Toki Messe, Niigata Japan.
  • 4. 4 The secondary research database MySQL Over 130 variables from 6 Yearbooks and 2 Official databases collected The Central Statistical Database (CSDB) Demographic Yearbook National Accounts Main Economic and Social Indicators HTML DOC XLS CSV
  • 5. 5 The Russian Federation comprises 85 federal regions. There are 6 types of the federal regions: • 22 republics • 9 territories (krai) • 46 regions (oblast) • 3 federal cities; • 1 autonomous region (oblast) • 4 autonomous areas (okrug) Geography of research Subnational lever
  • 6. 6 List of statistical sources used for developing database: (Yearbooks and State Database): a. Regions of Russia. Social-Economics Indicators. Data collected by state statistics on enterprises and organizations of the population through censuses, sample surveys and other forms of statistical observation data of ministries and departments of the Russian Federation as well as information received from organizations b. Regions of Russia. The Main Characteristics of the Russian Federation. In "Regions of Russia. The main characteristics of the Russian Federation" detail is reflected in figures on social and economic development of every region of Russia. c. Demographic Yearbook of Russia. The yearbook contains statistical data about the administrative- territorial division, changing the size and age and sex composition of the population, its location within the Russian Federation, on births and deaths, marriages, divorces and migration, summarizing the demographic indicators of the processes of human reproduction, standardized mortality rates by cause of death. d. Social status and standards of living of the Russian population. The most complete edition reflecting the social processes and conditions of the population of Russia. e. Healthcare in Russia. The handbook provides data describing the population's health status over time. f. National Accounts of Russia. Contains data on the volume, structure and dynamics of the Gross Domestic Product, the consolidated accounts, integrated table of national accounts (by sector), as well as regional indicators of National accounts. g. The Central Statistical Database (CSDB). CSDB contains hundreds of variables carrying official statistical data on the regional level of Russian Federation.
  • 7. 7 130 Variables Research DB Alcoholism Model 61 Variables Abortions Model 47 Variables Infant Mortality Model 54 Variables Var3 Var1 Var2
  • 8. 8 Type Names of variables and descriptions 1. Environment- related & Alcohol Related. Federal Districts on 2009 (GEO); Average Temperature in January (JANTEMP); Average Temperature in July (JULTEMP); Emissions Air Pollutants tons /10 (EMAIRPOL); Captured Air Pollutants (filtered) /10,000 (CAPPOL); Absolute Alcohol liters /1 person (consumption in one year) (ALC_1); Vodka liters /1 person (ALC_2); Offenses with Illegal Alcohol /10000 (ALC_OFF); Dahlgren and Whitehead’s diagram
  • 9. 9 Type Names of variables and descriptions Social & Health Service Infrastructure Square Footage /1 Resident (SQF1R); Educational Institutions /10000 (EDUINST); Hospital Beds /10,000 (HOSBED); Clinics Power (visits a day) /10,000 (POWCLIN); # of Physicians /10,000 (PHYS); # of Medical Staff /10,000 (MEDSTAF);
  • 10. 10 Type Names of variables and descriptions Economics & Money Income Gross Regional Product Millions Rubles /10000 (GRPMRUB); Money Income /1 Person (RUBPOP); % Monetary Income to 1999 (CASHIN); Consumer Spending /1 Person (CSPEND); Average Salary /Employer (AVSALEM); Ratio of Working Population VS. Non- Working Population (WVSNWP); % of Economical Active Population (ECAPOP); Annual Employ in Economy % to 1999 (AEEPOP); Number of Economical Active People (NEAP); Non-Food Price Index (NFPIND); % Social Income /Total income (SOCINC); % Salary Income /Total income (SALINC); % Business Income /Total income (BUSINC); % Property Income /Total income (PROPINC); % Other Income /Total Income (OTHINC); % People with Low Income (PEOPLOIN); Unemployment Rate (Survey data) /10 (UNEMPR); Unemployment Rate (Officially registered) /10
  • 11. 11 Type Names of variables and descriptions Demographic factors & Crimes Divorces /1000 (DIVOR); Abortions /10,000 (Abortions); Population (POP); # of Women per Man (WPERM); % of Urban Population (URBPOP); Rural Population (RURPOP); % of Population < 16 Years Old (CHLDRN); % of Population 16-60 Years Old (WORKPOP); % of Population Older than 60 Years Old (PENSPOP); Growth Rural Population /1000 (GRUPOP); Growth Urban Population /1000 (GURBPOP); Murders & Attempted /10,000 (AMURDER); Success Murders /100,000 (SMURDER); Juveniles Crimes /10,000 (JUVCRIME); Crimes in Alcohol Intoxication /10000 (ALCRIME)
  • 12. 12 Type Names of variables and descriptions Diseases & Death rates Injury & Poisoning /1000 (INPOI); Deaths by Alcohol Poisoning /100,000 (ALCPOIS); Alcoholism Incidence /10,000 (ALCOINC); Narcotics Incidence /10,000 (NARCINS); Tuberculosis Prevalence /10,000 (TUPREV); Deaths rural /1000 (DRUR); Deaths Urban (DEATHUR); Deaths Suicides /100,000 (DSUICID); Deaths All/1000 (DEATHS); Other Covariates
  • 13. Preliminary Analysis of Variable Characteristics 13 Figure 1: Distribution, box plot and Q-Q plot for original scale (left side) and squared root transformed (right side) of the alcoholism incidence per 10,000
  • 14. Mean Median Std Dev YEARS Sample size Original SQRT transformed Original SQRT transformed Original SQRT transformed 2000 75 13.83 3.618 13.56 3.682 6.54 0.867 2001 75 14.91 3.771 13.78 3.712 6.21 0.835 2002 0 2003 75 17.88 4.104 16.39 4.049 8.89 1.022 2004 75 17.36 4.041 16.01 4.001 9.21 1.023 2005 75 16.67 3.949 15.61 3.951 9.26 1.044 2006 76 15.59 3.834 14.42 3.797 8.00 0.950 2007 76 14.19 3.645 12.86 3.586 7.54 0.954 2008 77 14.24 3.669 13.04 3.612 6.94 0.884 2009 78 13.08 3.529 12.63 3.554 5.65 0.797 All Years 682 15.29 3.794 14.09 3.753 7.80 0.947 The summary statistics of alcoholism incidence per 10,000 population in Russia by year.
  • 15. District ID Districts Name Sample size Mean Median Std Dev Original SQRT Original SQRT Original SQRT 8 North Caucasian 63 6.66 2.379 6.74 2.598 3.98 1.010 2 Southern 45 11.64 3.393 11.08 3.329 2.51 0.363 1 Central 153 16.00 3.957 16.01 3.671 4.56 0.589 3 Northwester n 90 15.86 3.912 14.72 3.837 5.81 0.748 7 Volga 130 14.95 3.839 14.25 3.776 3.65 0.468 6 Urals 36 14.95 3.856 15.06 3.881 2.24 0.289 5 Siberian 100 14.45 3.726 13.54 3.681 5.73 0.756 4 Far Eastern 65 25.89 4.878 19.25 4.388 15.83 1.458 Some summary statistics of alcoholism incidence per 10,000 in original and squared root transformed scales by Federation District (from West to East).
  • 16.
  • 17. 61 inputs selected based on the domain knowledge and the context related to alcoholism incidence Alcoholism Incidence Steps conducted in order to finalize the input variables for model building using the following strategies: • Explore the properties of each input by investigating the distribution, the possible outliers. • Determine appropriate transformation for each input variable, which depends on if the input is a class variable (categorical) or interval scale. • Determine strategies for handling missing data. • Conduct preliminary analysis to investigate the relationship between each input and the alcoholism incidence. • Conduct a preliminary input variable selection.
  • 18. Characteristics of Inputs and their relationship with the Alcoholism incidence
  • 19. Determine appropriate transformation for each input variable, which depends on if the input, is a class variable (categorical) or interval scale. 3 2 1 Determine strategies for handling missing data. The “nearest neighbour” technique.
  • 20. HOSBED ALCPOIS DRUR ALCRIME DEATH Abortions DSUICID JUVCRIMESALINC .4827 .4566 .4395 .4269 .4235 .4171 .4168 .4150 .4056 MEDSTAF SMURDERAMURDER DEATHUR EDUINST EMARPOL DIVOR SQF1R INPOI .3959 .3872 .3741 .3549 .2926 .2904 .2623 .2602 .2288 ALC_ILL POWCLIN ALC_2 PRURP CAPPOL VPER63 WPERM ECAPOP ALC_1 .2268 .2265 .1735 .1716 .1676 .1667 .1626 .1538 .1498 SOCINC PENSPOP NFPIND CHLDRN YEAR BUSINC UNEMPR URBPOP NEAP .1364 .0973 -.1092 -.1094 -.1142 -.1277 -.1323 -.1847 -.2048 POP RURPOP CASHIN JANTEMP GURBPOPNATINC JULTEMP OTHINC GRP -.2216 -.2385 -.2463 -.2668 -.2870 -.3538 -.3554 -.4034 -.4128 Table: Spearman Correlation Coefficients between Alcoholism Incidence/10,000 and 45 Inputs with significant Correlations Based on the original inputs 29 positive and 16 negative significant correlations. Several inputs have very high positive mean correlation with alcoholism incidence. For example, Number of Hospital Beds (HOSBED) has correlation at .4827, Alcohol Poisoning (ALCPOIS) .4566, Rural Deaths Rates (DRUR) .4395, and Gross Regional Product (GRP) has negative correlation -.4128.
  • 21. 21 Conduct a preliminary input variable selection. First, we perform a simple correlation analysis using the selection criterion at p-value = .00005. This will only screen out inputs that have little or no relationship with the target. The second step is a regression Modeling using forward selection procedure with cut-off p-value at .0005. By using this strategy, we can further reduce the number of inputs without losing potentially important inputs for the Modeling building. 3 2 1
  • 22. 22 Modeling Methodology (1)multiple regression, (2) decision tree models, (3) neural network models. Partial least squares technique is performed to compare with the final best model selected from the three Modeling techniques. The data are partitioned into Training (70%) and Validation (30%). The Training data is used to develop models and Validation data is use for selecting the ‘best’ model for each Modeling technique. 3 2 1
  • 23. 23 An indicator variables are created for each level of a class input described in the following example. Suppose a class input X has four levels {a,b,c,d}. Three indicator variables, Ia , Ib and Ic , are created to replace the variable X: Ia = 1 if X = ‘a’ otherwise Ia = 0. Ib and Ic are defined accordingly.
  • 25. 25 Modeling Methodology Regression and neural network Modeling techniques 3 2 1
  • 26. 26 Modeling Methodology Regression and neural network Modeling techniques 3 2 1
  • 27. 27 Results and Discussions Model Type Average Squared Error Regression (Transformed Inputs) .381 Regression (Original) .392 Decision Tree (Original Inputs) .383 Neural Network (Transformed Inputs) .438 Partial Least Square (Original Inputs) .350 Table: The Model Comparison Based on Validation Data
  • 28. 28 Regression Model R2 Adjusted R2 AIC Akaike information criterion BIC Bayesian Information Criterion SBC Schwarz Bayes criterion Cp Original Inputs .7112 .7029 -682.7 -684.34 -624.71 96.16 Transformed Inputs .7156 .7061 -682.89 -687.39 -616.65 148.60 Table: Model Fit Statistics for the Regression Models Results and Discussions
  • 29. 29 Regression (Original) ALC_2 WVSNWP ALCPD ATTMUR ECACPOP GEO DIVOR DTHRUR BUSINC MEDSTAF HOSBED MURD SQF1R Regression (Transformed) ALC_2 WVSNWP ALCPD ATTMUR ECACPOP GEO DIVOR DTHRUR BUSINC MEDSTAF HOSBED JANTEMP PHYS AEEPOP DEASUI Table: Inputs chosen for each of regression models and the decision tree model
  • 30. 30 Factor Type Input Parameter Estimate t-statistic p-value Environment & Alcohol Related factors Federal District FED Vs. Not FED (GEO) 0.1759 3.53 .0005 Vodka liters /1person (ALC_2) -0.0332 -5.00 <.0001 Deaths of Alcohol Poisoning /100,000 (ALCPD) 0.2689 4.00 <.0001 Social Factors & Health Service, Infrastructure Square footage /1 Resident (SQF1R) 0.0588 4.34 <.0001 Middle medical staff /10,000 (MEDSTAF) 0.0057 2.65 0.0084 Hospital beds /10,000 (HOSBED) 0.0097 4.88 <.0001 Economics & Money income Business Income (BUSINC) 0.0227 3.58 0.0004 % of Economical Active Pop (ECACPOP) 0.0325 4.13 <.0001 Demographic & Crimes % of Pop older than 60 (PENSPOP) -.0818 4.13 <.0001 Divorces /1000 (DIVOR) 0.1817 7.59 <.0001 Murders Attempt/10,000 (ATTMUR) 0.1632 4.63 <.0001 Murders /100,000 (MURD) -0.2253 -2.95 0.0034 Deaths Deaths rural /1000 (DTHRUR) 0.0786 9.35 <.0001 Table. The Parameter Estimates and the Corresponding t-statistics for the Regression Model
  • 31. 31 Inputs Relative Importance Deaths Alcohol Poisoning /100,000 (ALCPOIS) 1.000 Hospital beds /10,000 (HOSBED) .7419 Deaths rural /1000 (DRUR) .5251 Murders /100,000 (SMURDER) .3277 Murders Attempted /10,000 (AMURDER) .2424 % of Economical Active Pop (ECAPOP) .2033 V85_Business Income (BUSINC) .1793 Table. The Relative Importance of Inputs Selected for the Regression Model Interpreted by using Decision Tree
  • 32. 32 Summary and Conclusion • The main conclusion of research that official medical and social statistics are quite adequate for Modeling. • Using several Modeling technics as Multiple regression, Decision Trees, Neural Networks and Partial Least Square we found and measured several causal factors that influence the level of alcohol in the Russian society. • The direct influence on alcoholism level affects the geographical location of the region, level of medical service development, money income of people, square footage of apartments per resident. • The more sales legal vodka, the less impact on the level of alcoholism (?). One possible interpretation would be that the data we has regarding to ‘Vodka Consumption’ only include the legal consumption but not the moonshine consumption which dominates in the villages. • The parameter estimate for older 60 years old population is negative because younger society contains more heavy drinkers. Only abstainers live to old age. • With the increase in the proportion of the economically active population, we see growing rates of alcoholism. • Deaths because of alcohol poisoning, divorces, murders and attempting to murders are strongly depends on alcoholism.
  • 33. 33 PAPERS IN PEER REVIEW JOURNALS 1. Carl, Lee., Soshnikov, S.,Vladimirov, S. "Are Socio-Economic, Health Infrastructure, and Demographic Factors Associated with Infant Mortality in Russia?," International Journal of Software Innovation (IJSI) 1 (2013): 4, accessed (April 23, 2014), doi:10.4018/ijsi.2013100105. 2. Soshnikov, S., Lee, C., Vlassov, V., Vladimirov, S. (2012) Factors Associated with Abortions in Russia: a Predictive Modeling Study. European Journal of Public Health, Vol. 22(2), 95-95. PUBLISHED REFEREED ABSTRACTS 1. Soshnikov, S., Lee, C., Vladimirov, S. (2013). A Modeling approach to identify factors associated with Infant mortality in Russia. 12th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2013), p. 185-190. June 16 -20, 2013, Toki Messe, Niigata Japan. 2. Soshnikov, S., Lee, C., Vlassov, V., Gaidar, M. and Vladimirov, S. (2013). A comparison of some predictive modeling techniques for modeling abortion rates in Russia. Proceedings (Refereed), 14th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD 2013), p. 115-120. July 1st – July 3rd, 2013, Honolulu, Hawaii, USA. REFEREED PRESENTATIONS 1. The Predictive Modeling Approach on Continuous Statistics of Alcoholism Incidence in Russia. International Symposium on Health Information Management Research (ISHIMR 2015), York, UK. 2. Analysis of the official medical and social statistics on the example of the infant mortality rate. X International scientific conference "The use of multivariate statistical analysis in economics and quality assessment." (2014), Higher School of Economics in Moscow, Russia. 3. A Modeling approach to identify factors associated with Infant mortality in Russia. 12th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2013), p. 185-190. June 16 -20, 2013, Toki Messe, Niigata Japan.

Editor's Notes

  1. Hello everyone!
  2. In collaboration with Professor Carl Lee (Math. Dept. CMU) we developed three mathematical models of diseases in Russian data. They was subsequently published. I started to collect data in June while I was in Moscow. Than continue in Mt. Pleasant and we done in Feb-March 2012
  3. In collaboration with Professor Carl Lee (Math. Dept. CMU) we developed three mathematical models of diseases in Russian data. They was subsequently published. Collected healthcare data from open statistical sources and compiled the into a comprehensive research database. The database contained over 100 medical and social variables from 12 sources collected in MySQL database for further analysis.  
  4. The data were obtained from the statistical publications of continuous statistics of the Russian State Statistics Committee (ROSSTAT). Official national statistics is published in the compilations in electronic and paper forms and on the ROSSTAT web site. Data are presented as absolute numbers and as standardized ratios. We developed a secondary research database “Russian SEIPH” (Social Economics Interference and Public Health) where over 130 variables from 6 yearbooks and two ROSSTAT databases were combined. The period 10 years long (2000 to 2009) was selected for modeling. The unit of study is a region of Russia. Because the data were collected from different sources we had to create MySQL database in the form of the data entry based on Hypertext Preprocessor (PHP) with possibility to export data in SAS file format. We used the relational model of database, in which every value of variables held in a new row of fixed columns.
  5. Super-regions: North Caucasian, Southern, Central, Northwestern, Volga, Urals, Siberian, Far Eastern
  6. We built 3 models. For each model among the 130 variables, we exclude those variables not related to the purpose of this study. The input variables selected for these studies are those passed preliminary statistical assessment and the variables suggested by the theoretical framework. For Alcoholism incidence model 61 variables were chosen at first stage as input variables (factors) for Modeling. Next 5 slides summarizes the input variables in this study. These inputs classified in five different factor types:
  7. By the expert way we choose an excessive number of variables to further validate of having the pair relationships with the dependent variable.
  8. There are a total of 61 inputs selected from the domain knowledge and the context related to abortion rates. These inputs cover six categories of factor types as described previously. These inputs varied greatly in terms of their characteristics including the measuring unit, the magnitude of the values, the distributions. In order to take further step of investigating the relationship between these inputs and the target, it is critical to explore these inputs and make proper variable transformation for further variable reduction and eventually building models to identify the impact factors.
  9. The normal Q-Q plot and Box-plot of alcoholism incidence of the entire data set shown in the left side of Figure 1. This figure indicates the distribution of alcoholism incidence is far from normal. Applying maximum normal transformation, squared root transformation gives the distribution closest to normal, which shown on the right side of Figure 1. This figure indicates that the distribution of the squared root transformed alcoholism incidence does not follow normal well. The Shapiro-Wilk’s test statistic is 0.9924 (p-value < .05). However, the shape of the transformed alcoholism incidence is approximately symmetric with average incidence is 3.79 and median is 3.75. The extreme low alcoholism incidences occurred at the Ingushetia region and the extreme high alcoholism incidence occurred in the Magadan and Sakhalin regions.
  10. Table 2. The summary statistics of alcoholism incidence per 10,000 population in Russia by year. Table 2 summarizes the average, median and standard deviation of the alcoholism incidence (in original scale) separated by years. The data for 2002 were not available. As these summaries show, averages are noticeably larger than median. This is a clear indication that the distribution of alcoholism incidence is skewed-to-the right. The corresponding box-plots are given in Figures (left side: Original scale and right side: SQRT transformed scale).
  11. Some summary statistics of alcoholism incidence per 10,000 in original and squared root transformed scales by Federation District (from West to East).
  12. The summary statistics and the corresponding box-plots for each of the eight federal districts (arranged from West to East of Russia) are presented in Table 3 and Figure 3. The highest alcoholism incidence is from the Far Eastern District with average incidence 25.89, and the lowest is North Caucasian with average 6.66. It is clear that there is a strong connection with religion in North Caucasus District. The variations also differ among districts, ranging from 2.24 (Urals District) to 15.82 (Far Eastern District). The squared root transformed alcoholism incidence appears to fit normal distribution better than the original scale as shown in Figure 3.
  13. These steps performed interactively until we finalize the set of inputs that are ready for model building. In the following, we will address the approach we took to handle each of these steps.
  14. We also Explore the properties of each input by investigating the distribution, the possible outliers. Many input variables in interval scales do not follow normal. The following shows distributions of some input variables. Most of the input variables are skewed to the right and require some variable transformation.
  15. The examples shown above are interval scale variables. For this type of variables, we determine the appropriate transformation that will transform each variable to best-fit normal distribution (maximizing normality). For class variables, we apply rare group collapsing method by grouping categories with less than .1% of cases into a new group. A great deal of time and efforts have been spent to ensure the quality and reliability of the data. This data set does not have many missing data. As a result, we did not need to drop any of the input units for our model. We employ the nearest neighbor technique to impute the missing using the average of two nearest years of the data from the same region.
  16. After completing the data transformation and missing data imputation, we conduct a simple preliminary correlation analysis and scatterplots to investigate (1) if there is a high correlation between a given variable and the target and (2) is there any non-linear relationship between a given input and the target. Table 4 gives the Spearman’s correlation coefficients between the alcoholism incidence and 45 inputs with significant correlation coefficient at .01% level. We did not report the correlation coefficients between alcoholism incidence and the transformed inputs for the reason that the final best model selected is based on the original scale of input variables. The correlations pairs can presents one-way and two ways relationships. We can suppose what factors affect increasing of alcoholism incidence in Russia.
  17. Prior to conduct model building, we conduct a preliminary variable selection using the following strategy.
  18.   Three types of predictive Modeling techniques are applied and compared to select the ‘best’ model using several model selection criteria.
  19. This Figure illustrates the Modeling methodology used in this study. The target variable, Alcoholism Incidence per 10,000, is transformed using square root transformation so that the distribution of the transformed target is approximately normal. For input variables, two different processes are performed. One is to use the original scale of the inputs without transformation and the other is to apply maximum normality transformation to the inputs. As Figure 4 shows, for each process, we build a series of models including regression, decision tree, neural network, and partial least squares techniques. The best model chosen from each technique in each process is compared using the average squared error to select the final best model.
  20. For the decision tree model, at the splitting stage, because the target is on an interval scale, we apply the maximizing F-statistic criterion through analysis of variance with Kass adjustment to prevent unexpected high type I error during the splitting process based on the Training data set. The pruning stage is performed using the assessment criterion of minimizing the average error based on the validation data set. The missing data is treated as a separate category and is used in the Modeling construction. The importance of input variables is determined by using the log worth measure (-log10(p-value)).
  21. The inputs passing the first screen are subject to a forward selection regression procedure using p-value at .0005. The inputs passing the forward selection are then used as the inputs for regression Modeling. In the Modeling building stage, stepwise selection procedure by minimizing the average square error based on the training data is applied to build the models. At each step of the stepwise regression procedure, a best regression model is obtained. The procedure is stopped when no input can be added or dropped at p-value = .05. Each of these best regression models in each step of the stepwise regression is subject to a validating process by applying the model to the validation data and computing the average squared error. The final best model is chosen to be the model having the smallest average squared error based on the validation data. This is done to prevent ‘overfitting of the model selected only by using the stepwise procedure.
  22. It has been shown that neural network is an universal approximator for any function Ripley [ 12 ]. Neural network model can be quite complex when the number of hidden layers and number of hidden units increase. If we have a large number of inputs, the complexity is compounded and the number of weights to be estimated grows exponentially. Therefore, it is essential to conduct an appropriate input variable selection prior to entering the neural network Modeling. The preliminary screening employed for regression Modeling often results in still too many inputs for the neural network Modeling, since neural network Modeling technique does not have the ability of selecting input variables during the process of building the network, unlike the regression Modeling. Therefore, prior to building the neural network model, a decision tree technique is applied to conduct the variable selection:
  23. The selection of the final best model is performed by comparing the average squared error criterion based on validation data set. Table 5 summarizes the average squared error for each Modeling technique. The three models: Regression with original inputs, regression using transformed inputs and decision tree models are very compatible. Since regression models are easier for interpretation, we decide to choose regression models as our final choice.
  24. In Table, we compare the two regression models using various model-fitting criteria. It indicates that even though regression model based on transformed inputs has lower average squared error (shown in previous). The selection criteria in Table seem to suggest that the regression model based on original scale of inputs is a better choice for the following reasons: (1) with two less inputs, the R2 is only .0044 less, (2) while the AIC and SBC are comparable, the model using original inputs performs better based on SBC and Cp. Therefore, we decide to choose the regression model based on the original inputs as our final ‘best’ model (Table ).
  25. Table 7 gives a further comparison between the two regression models shows that there are eleven common inputs selected in both models.
  26. The parameter estimates along with the t-statistics and p-values for the final ‘best’ regression model are summarized in Table 8. The parameter estimate for the input ‘Vodka Consumption’ is negative, which suggests that the ‘pure’ association between alcoholism incidence and vodka consumption is negative. This seems to counter intuitive. A quick check of Alcoholism Incidence and the Vodka Consumption has a weak positive correlated at .1735. There could be many different interpretations of this association. One possible interpretation would be that the data we has regarding to ‘Vodka Consumption’ only include the legal consumption. It is known there is a great deal of ‘unreported’ ‘illegal’ Moonshine available across the nation, especially in the rural areas. The parameter estimate for the input ‘% of Population older than 60 years old’ is negative because younger society contains more heavy drinkers. The same explanation with variable ‘Murders’. Means that drunk people are unsuccessfully attempting to kill someone with drunk emotional mind, however the sober people more successful in murdering after preparing. With “cold nose”
  27. In the following, we apply decision Tree technique to interpret the regression model obtained above. The following table shows the relative importance of input variables resulted from the decision tree. The decision tree that interprets the regression model is given. Based on the Decision Tree, the lowest alcoholism incidence occurred when # of Deaths Alcohol Poisoning per 100,000 < 0.01283 (n = 35), and the predicted Alcoholism Incidence is 5.224 per 10,000. The highest alcoholism occurred when the number of hospital beds is greater than 146.2 per 10,000 and the number of deaths due to alcohol poisoning is greater than 0.01283 per 100,000 (n = 27). The predicted number of alcoholism incidence is 28.63 All significant factors associated with alcohol incidence classified on two groups: the factors associated with the incidence of alcoholism directly and indirectly. Indicator of the level of alcoholism incidence is associated with other diseases and social problems.
  28. In collaboration with Professor Carl Lee (Math. Dept. CMU) we developed three mathematical models of diseases in Russian data. They was subsequently published. Collected healthcare data from open statistical sources and compiled the into a comprehensive research database. The database contained over 100 medical and social variables from 12 sources collected in MySQL database for further analysis.