Sigma xi presentation

A Novel Analysis and Predictive Risk Model of Chronic
Diseases: A Markov Chain Monte Carlo Simulated Study on
the Rising Incidence of Type II Diabetes Mellitus in the
Youth
By Meghna Narayan

Objective and
Motivation
Objective: The study of longevity, mortality, morbidity and demographic risks of Type II
Diabetes Mellitus (T2DM) on the 15-20 year olds using a compartmentalized, statistical and
Markov Chain Monte Carlo simulation based model to understand the impact on the
dependency ratio of the United States.
Motivation: Type II Diabetes is usually associated with older age; however, there is a
growing awareness of its increase, particularly among the youth in the United States. The
study is an approach to build a multi-state Monte Carlo simulated compartmental model to
understand the dynamics of chronic diseases such as Type II diabetes mellitus and its
increasing prevalence among the 15-20 years old age group in the United States by scoring
the risk factors associated with this disease. This study aims at providing a reasonable
argument to reformulate the dependency ratio (ratio of dependent people to the population of
working age) for morbidity adjustments due to labor impairment. The approach modeled here
is a cost-effective way to define the role of multiple risk factors as well as the temporal
progression of this disease.

Methods and Expected
Findings
Methods: Aggregate data from several studies and databases such as Policy Map, CDC’s
SEARCH, NHANES, ADA and NIH was used to create a composite view of the prevalence
and growth of T2DM in young adults in the U.S. Data pertaining to morbidity was extracted
and used in a Markov Chain Monte Carlo simulation to fit a Bayesian model to estimate the
co-variants contributing to the increasing prevalence in the T2DM in the recent years within
the context of a three-state compartmental model.
Expected findings: Studies conducted across the U.S. and globally by different
organizations document the increasing prevalence of T2DM amongst the youth. Extensive
literature review suggest the rate of increase is prominent in the 15-20 age in minority
race/ethnicity (i.e. African American, Hispanic, Native American sub cohorts). The model
will help identify the most prominent risk factor contributing to this increasing prevalence.

Anticipated Conclusion
Anticipated conclusion: The study will provide a correlation
to the underlying risk factors leading to T2DM and project the
impact of morbidity on the dependency ratio in the coming
decades

Hypothesis and
Research Problem
Hypothesis:
A dependency ratio measures the number of people either very young or very old to work, compared to
the number of people within working age. In economics, the dependency ratio is an age-population ratio
of those typically not in the labor force (the dependent part) and those typically in the labor force (the
productive part). It is used to measure the pressure on productive population. [18]
Unfortunately, the current calculations for dependency ratios does not take into account the morbidity
factor of the person eligible to work. Moreover, despite an aging population, those under the age of 16
will continue to constitute the largest "dependent" group in future years. Persons between the ages of 15
and 60 are termed “prime-age” adults for they are in their most productive working years. At this time,
most can contribute to labor and income in rural households. When these “prime-age” adults become
chronically ill, they change from productive members to members requiring care and medicines, no longer
working in the fields or elsewhere. With death or illness, even more profound can be the effects of lost
labor and skills.
Research Problem: This project focuses on finding a model to predict the trends of type 2 diabetes
in youth, as well as find a method to reformulate the dependency ratio.

Background Information
In this project, studying the risk factors causing a high T2DM prevalence rate is very important because
the cost burden, labor impairments, and overall effects of increased morbidity and mortality will have
significant socioeconomic consequences on the U.S. population. Therefore, it is crucial to make accurate
assessments and predictions about the distribution of T2DM in order to facilitate effective interventions.
In addition, this paper focuses on identifying, studying and validating the various factors leading to
T2DM and its long term effect on the US economy.
Diseases can be classified as communicable or non-communicable. Epidemiologists and statisticians
work towards identifying diseases, their risk factors and distributions that shape subsequent interventions
to inhibit the spread of both acute and chronic illnesses.
The lack of accurate data on the basis of which decisions and plans can be made calls for mathematical
models and simulations that predict the pattern and frequency of an increasing trend of a non-
communicable disease. Various disease trend models have been developed to study epidemic outbreaks in
a population.
Mathematical models based on compartmental models of epidemiology help in predicting the transition
from one state to another in homogenous populations by means of stochastic or deterministic
calculations. In order to model the spread of infectious diseases in a population, it is important to
consider the non-homogeneity and spatial distribution of the population of a region; identify risk groups
for the disease among the various demographics and the social behavior of the participating demographic
groups. (individuals who are overweight; have low levels of physical activity, poor eating habits; have a
family history of T2DM; people belonging to a certain race/ethnic group, as well as people living in
poor economic conditions.)

Literature Review
An extensive literature review on the incidence and prevalence of Type II diabetes among 15-20 year olds
can be used in traditional models such as logistic regression using predictive covariates and relative risks;
however, the advent of Markov Chain Monte Carlo (MCMC) simulations allows for much greater
accuracy in quantifying the future trends of Type II diabetes among this age group.
The TODAY trial on diabetes in youth uncovers obesity as one of the primary cause for T2DM The
National Institutes of Health showed that trends in obesity have been increasing from 1971 to 2006, for
the age group of 15-20 year olds and have a higher chance of acquiring T2DM quickly. Childhood
obesity has more than doubled in younger children and tripled in adolescents in the past 30 years.
The National Diabetes Statistics in 2011 stated that during 2002 to 2005, 3,600 youth were newly
diagnosed with T2DM annually. The estimated costs for medical care were about $116 billion. The
trends in T2DM have been gradually increasing from the year of 1971, and can be seen as a huge
problem.
The National Diabetes Statistics in 2011 found that about 215,000 people younger than 20 years had
some form of diabetes in the United States in 2010. [15] The study reported children in certain ethnic
backgrounds had a higher prevalence number. On an average the prevalence rates were 87 percent higher
for Mexican Americans, 94 percent higher for Puerto Ricans, 18 percent higher for Asian Americans, 66
percent higher for Hispanics/Latinos, and 77 percent higher for blacks. In 2002-2005, 3600 youth were
newly diagnosed with T2DM.

Significance
Significance:
The goal of this study work was to simulate a predictive model for understanding
the disease dynamics by risk factor and exposure. The Monte-Carlo method
provides two major advantages: it is preferred when the raw data is not available,
and it accounts for indirect effect due to mediators among the covariates.

Methods
A predictive model of risk progression for T2DM was developed using logistic regression and
Bayesian inference by Markov chain Monte Carlo method. One-year cycles were used for the disease
progression in this model. Primary end points for progression were transition to “no-diabetes” to “pre-
diabetes” and “diabetes” state. The three-state model partitions the sample dataset into “no diabetes”,
“pre-diabetes” and diagnosed “diabetes” compartments. Independent covariates included –
Ethnicity
Gender
Obesity
Age
The model can be explored further for -
Sedentary behavior
Family history of T2DM
Consuming high density, low-nutrient food and drinks
Income level
BPA levels in urine

Cox Logistic Regression for
Survival Analysis
A simple two-state model for modeling a chronic disease such as T2DM
is the three-state model for survival data with one transient state ‘0: alive’
and one absorbing state ‘1: dead’. In general, an absorbing state is a state
from which further transitions cannot occur while a transient state is a state
that is not absorbing. The observation for a given individual will here in the
most simple form consist of a random variable, say T, representing the time
from a given origin (time 0) to the occurrence of the event ‘death’. The
distribution of T may be characterized by the probability distribution function
F(t)= Prob(T <= t) or, equivalently, by the survival distribution function S(t)=
1-F(t) = Prob(T > t). It is seen that S(t) and F(t), respectively, correspond to
the probabilities of being in state 0 or 1 at time t. If every individual is
assumed to be in state 0 at time 0 then F(t) is also the transition probability
from state 0 to state 1 for the time interval from 0 to t.

Bayesian Inference
Bayes' theorem provides a mathematical method that can be used to calculate, given
occurrences in prior trials, the likelihood of a target occurrence in future trials. According to
Bayesian logic, the only way to quantify a situation with an uncertain outcome is through
determining its probability. Let y be a set of covariate observations (scalar or vector) at
discrete time points . Let θ represent a vector of free parameters. The goal is to find the set
of parameters that best fits the data and to evaluate how good the model is. Bayesian
statistical conclusions about a parameter θ, or unobserved data, ŷ, are made in terms of
probability statements. These probability statements are conditional on the observed value
of y and are simply written as p(θ|y) or p(ŷ|y). The goal is to find the set of parameters that
best fits the data and to evaluate how good the model is. The best way to do this is to use
Bayesian inference and model comparison, which can be computed using the Markov
Chain Monte Carlo (MCMC). However, the MCMC can also be used just to get the
parameters in the sense of finding the best fit according to some criterion.

Markov Chain Monte
CarloMarkov Chain Monte Carlo (MCMC) simulation is a general method based on drawing
samples from a known sample or prior distribution, p(y|θ) to better approximate the target
posterior distribution, p(θ|y), depending on the last value drawn; hence the draws form a
Markov chain. The key to the success of the method is that the approximate distributions
are improved at each step in the simulation, in the sense of converging to the target
distribution. In MCMC, several independent sequences of simulation draws are created;
each sequence, θt, t = 1, 2, 3… is produced by starting at some point θ0 and then, for each
t, drawing θt from a transition distribution, Tt(θt | θt-1) that depends on the previous draw,
θt-1. It is often convenient to allow the transition distribution to depend on the iteration
number t; hence the notation Tt. The transition probability distributions must be constructed
so that the Markov Chain converges to a unique stationary distribution that is the posterior
distribution, p(θ|y). MCMC is used when it is not feasible to sample θ directly from p(θ|y).
The samples are taken iteratively in such a way that at each step of the process it can be
expected to draw from a distribution that becomes closer and closer to p(θ|y). The key to
MCMC is to create a Markov process whose stationary distribution is the specified p(θ|y)
and run the simulation long enough that the distribution of the current draws is close
enough to this stationary distribution. The adaptive rejection Metropolis sampling (ARMS)
algorithm to draw the Gibbs samples is used in my model

Results
Table 1:
Table 2:
Model Information
Data Set WORK.DIABETES
Dependent Variable Time Survival Time
Censoring Variable T2DMStatus 0=No Diabetes 1=Pre-diabetes 2=Diabetes
Censoring Value(s) 0 1
Model Cox
Ties Handling DISCRETE
Sampling Algorithm ARMS
Burn-In Size 2000
MC Sample Size 20000
Thinning 1
Table 3: Table 5:
Summary of the Number of Event and Censored
Values
Total Event Censored
Percent
Censored
326 117 209 64.11
Table 4:
Table 6:
Table 7: Bayesian Analysis from the program Table: 8
Number of Observations Read
Number of Observations Used
326
326
Maximum Likelihood Estimates
ParameterDFEstimate
Standard
Error 95% Confidence Limits
BMI 1 0.0515 0.0183 0.0157 0.0873
Sex 1 -0.4075 0.2353 -0.8687 0.0536
Race 1 0.3690 0.0741 0.2237 0.5143
Age 1 0.0922 0.0575 -0.0206 0.2049
Uniform Prior for Regression Coefficients
Parameter Prior
BMI Constant
Sex Constant
Race Constant
Age Constant
Initial Values of the Chain
Chain Seed BMI Sex Race Age
1 1 0.0515 -0.4075 0.3690 0.0922
Posterior Summaries
ParameterN Mean
Standard
Deviation
Percentiles
25% 50% 75%
BMI 200000.0524 0.0184 0.0400 0.0522 0.0644
Sex 20000-0.41220.2360 -0.5700-0.4115-0.2511
Race 200000.3685 0.0744 0.3196 0.3694 0.4191
Age 200000.0942 0.0575 0.0550 0.0939 0.1326
Posterior Intervals
ParameterAlphaEqual-Tail IntervalHPD Interval
BMI 0.050 0.0168 0.0893 0.0167 0.0891
Sex 0.050 -0.8832 0.0430 -0.87320.0514
Race 0.050 0.2197 0.5118 0.2260 0.5177
Age 0.050 -0.0163 0.2078 -0.01700.2068
Posterior Correlation Matrix
Parameter BMI Sex Race Age
BMI 1.0000 0.2060 0.1

Results
●Number of Score
Variables Chi-Square Variables Included in Model
1 25.5541 Race
1 13.9674 Age
1 7.2832 BMI
2 39.4908 Sex Race
2 37.9123 BMIRace
2 32.2592 Race Age
3 44.6657 BMISex Race
3 43.6418 BMIRace Age
3 41.3759 Sex Race Age
4 47.1469 BMISex Race Age
Regression Models Selected by Score Criterion

Discussion
The Cox logistic regression yielded the maximum likelihood estimates
for the four independent variables. The result with respect to Age and Sex
were disregarded due to the noted confidence limits which included 0. Out
of the remaining variables, Race was the strongest predictor (MLE 0.3690,
SE 0.0741, 95% CL 0.2237-0.5143) followed by BMI (MLE 0.0515, SE
0.0183, 95% CL 0.0157-0.0873) Correlation between the independent
variables is shown in the Bayesian estimation figures. The two co-variates
that show the highest correlation are Age and Sex (0.3652) followed by and
BMI and Sex (0.2171), and Race and BMI (0.1738). These inter-variable
correlations when compared to their associations with the outcome would be
valuable in determining the presence of confounding. The posterior
summary generated by the Bayesian MCMC also shows that Race was the
dominant covariant in determining the outcome variable. This may be due to
cultural (socioeconomic status, diet, et al.) or genetic predispositions and
would be valuable to explore further in subsequent analysis.

Conclusions
The alarming incidence of Type II Diabetes (T2DM) in both children and
young adults requires immediate intervention to minimize the morbidity
among those who have been diagnosed, and prevent future cases from
occurring in this demographic. If the current trend continues, there will be a
significantly greater prevalence of cardiovascular disease, peripheral
neuropathy, infection, and ultimately – disability that will affect a tremendous
burden on the U.S. healthcare system. The dependency ratio helps to
describe the proportion of the population that is economically dependent.
Even as we do not have longitudinal data on the morbidity and mortality of
T2DM among children and young adults, we can certainly suggest that the
dependency ratio will be increased as a result of this occurrence based on
data gleaned from studies conducted on adult population

Conclusions(cont'd)
In order to address this issue, it is important to provide robust predictive models to the
important stakeholders from which we can acquire resources and ensure the most effect
allocation of those resources. The results of our analysis identified Race/Ethnicity as the most
significant factor affecting the outcome of T2DM. Race may represent either genetic
predisposition, environmental factors, or both. If we assume that our results are externally valid,
then it suggests that the most effective interventions would be targeted towards high risk groups
based on race / ethnicity. Further studies should be carried out that help identify contributors to
this risk category. Valid raw data using this analytic method will yield valuable evidence that can
better define the primary determinants of disease specific to this population. Additionally, the
results of this method decrease the margin of error and measures of variation that characterize
traditional predictive modeling. This research has shown that using Bayesian methods of
inference using a Markov Chain Monte Carlo simulation, chronic disease progress and its
associated risk factors can be studied as follows:
1. Estimate missing data- The Bayesian models were able to estimate prevalence rates for
diseases and risk factors with limited data input allowing for estimations to be made for even
lesser studied diseases and risk factors.
2. Incorporate any available prior information- in the case where there was no prior information,
this was still fine as non-informative priors or flat priors could be used.
3. Include additional predictors- The models are open and so addition of new predictors is easy.

References
[1] Harris MI: Prevalence of noninsulin-dependent diabetes and impaired glucose tolerance. Chapter VI In: Diabetes in America, Harris MI,
Hamman RF, eds. NIH publ. no. 85-1468, 1985
[2] Dept of Health and Human Services, C. f. (2011). http://www.cdc.gov/diabetes/pubs/pdf/search.pdf.
[3] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2762509/
[4] National Health Interview Survey (NHIS, available at http://www.cdc.gov/nchs/nhis.htm) of the National Center for Health Statistics
(NCHS)
[5] Bloom, D. C.-L.-G. (2011).The Global Economic Burden of Noncommunicable Diseases. Geneva:World Economic Forum.
[6] Search for Diabetes in Youth http://www.cdc.gov/diabetes/pubs/pdf/search.pdf
[7] Presented at the 72nd American Diabetes Scientific Sessions, June 9, 2012, ADA: BothType 1 andT2DM Rates Increase Significantly
among AmericanYouth
[8] Constantino, Maria I. "Long-Term Complications and Mortality in Young-Onset Diabetes:Type 2 Diabetes Is More Hazardous and Lethal
than Type 1 Diabetes." Diabetes Care. Diabetes Centre, 11 July 2013.Web.<
http://www.ncbi.nlm.nih.gov/pubmed/?term=Long-Term+Complications+and+Mortality+in+Young-Onset+Diabetes>
[9] SEARCH for Diabetes in Youth http://www.cdc.gov/diabetes/pubs/pdf/search.pdf
[10] Obesity andT2DM in children and youth (Kaufman, 2006)
[11] T2DM in youth:rates, antecedents, treatment, problems and prevention, Editorial Pediatric Diabetes 2007, Sept 9 (4-6)
[12] TODAY trialTREATMENT PROTOCOL https://portal.bsc.gwu.edu/documents/11448/d69340ae-3443-4bd6-81f4-1d8d5f2ae369
[13] Ogden CL, Carroll MD, Kit BK, Flegal KM. Prevalence of obesity and trends in body mass index among US children and adolescents,
1999-2010. Journal of the American Medical Association2012;307(5):483-490
[14] National Center for Health Statistics. Health, United States, 2011:With Special Features on Socioeconomic Status and Health.
Hyattsville, MD; U.S. Department of Health and Human Services; 2012.
[15] National Institutes of Health, National Heart, Lung, and Blood Institute.
Disease and Conditions Index:What Are Overweight and Obesity? Bethesda, MD: National Institutes of Health; 2010.
[16] Krebs NF, Himes JH, Jacobson D, NicklasTA, Guilday P, Styne D. Assessment of child and adolescent overweight and
obesity. Pediatrics 2007;120:S193–S228
[17] Daniels SR, Arnett DK, Eckel RH, et al. Overweight in children and adolescents: pathophysiology, consequences, prevention, and
treatment. Circulation 2005;111;1999–2002.
[18] William H. Crown, SomeThoughts on Reformulating the Dependency Ratio, Gerontological Society of America, 1995

Acknowledgments
I would like to acknowledge Dr. Vladimir
Shapovalov for his work in guiding me as his
research student and Lata Ganesh from the World
Bank Organization in supervising me with my
research and providing me with the access to
texts, information and consultation to finalize this
work, as well as David Mordecai, Samantha
Kappagoda and Daniel Stein from New York
University for assisting me with the mathematical
concepts.

Sigma xi presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Sigma xi presentation

Similar to Sigma xi presentation (20)

Recently uploaded

Recently uploaded (20)

Sigma xi presentation