1. Ameliorating Statistical Methodologies as Genomic Data Burgeon:
Refined Proportional Odds Model with Application to New Dravet Dataset
Ivan Rodriguez and Joseph C. Watkins, Ph.D.
The University of Arizona, UROC-PREP/STAR Program
This investigation was facilitated tremendously by the
generous guidance and support from Miao Zhang along
with Joseph Watkins, Jin Zhou, and the following members
of the Hammer Lab: Michael Hammer, Laurel Johnstone,
and Brian Hallmark. In addition, Andrew Huerta and Reneé
Reynolds were instrumental in providing meaningful
assistance and valuable criticism throughout the entirety of
the investigation.
Introduction
Motivation: approximately 150,000 newborns are
diagnosed with a genetic disease each year
(Nussbaum, McInnes, & Willard, 2007).
Purpose: match data and diagnosis by improving
McCullagh’s (1980) celebrated proportional odds
model (POM); apply enhanced model to a new
and exclusive Dravet syndrome patient dataset.
Methods
Ordinal categorical data analysis: technique
employed when a response is obtained that has a
natural ordering of its categories (i.e., disease
severity). A central complication in this setting
involves the arbitrary nature of assigning numeric
values to the ordinal categories, thereby also
neglecting the nonequidistance between them. A
naïve solution is to dichotomize the ordinal trait;
this, however, introduces another layer of
arbitrariness that discards data and consequently
decreases statistical power.
Proportional odds model (POM): gold standard for
ordinal variable (Bender & Grouven, 1998). The
POM extends binary logistic regression (Cox,
1958) and has applications in survey research,
food testing, industrial quality assurance, radiology,
and clinical research (McCullagh, 1999).
The POM assumption on parallel logit surfaces is
often violated in practice. In this investigation,
adjustments in the formulation of the latent variable
and the null hypothesis are proposed.
Methods Continued Methods Continued
Bender, R., & Grouven, U. (1998). Using binary logistic regression models for ordinal data with non-proportional odds. Journal of Clinical
Epidemiology, 51(10), 809–816. doi:10.1016/S0895-4356(98)00066-3
Cox, D. R. (1958). The regression analysis of binary sequences (with discussion). Journal of the Royal Statistical Society, Series B, 20,
215–242.
Lee, S., Emond, M. J., Bamshad, M. J., Barnes, K. C., Rieder, M. J., Nickerson, D. A., NHLBI GO Exome Sequencing Project, … , Lin, X.
(2012). Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome
sequencing studies. The American Journal of Human Genetics, 91(2), 224–237. doi:10.1016/j.ajhg.2012.06.007
McCullagh, P. (1980). Regression models for ordinal data. Journal of the Royal Statistical Society, 42(2), 109–142.
McCullagh, P. (1999). The proportional odds model. In P. Armitage, Encyclopedia of Biostatistics Vol. 5 (3560–3563). Hoboken, NJ: John
Wiley & Sons.
Morgenthaler, S., & Thilly, W. G. (2007). A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: A
cohort allelic sums test (CAST). Mutation Research, 615(1–2), 28–56. doi:10.1016/j.mrfmmm.2006.09.003
Nussbaum, R. L., McInnes, R. R., & Willard, H. F. (2001). Thompson & Thompson genetics in medicine (6th ed.). Philadelphia, PA: W. B.
Saunders. doi:10.1016/S0015-0282(02)03084-4
Price, A. L., Kryukov, G. V., de Bakker, P. I., Purcell, S. M., Staples, J., Wei, L. J., & Sunyaev, S. R. (2010). Pooled association tests for
rare variants in exon-resequencing studies. The American Journal of Human Genetics, 86(6), 832–838. doi:10.1016/j.ajhg.2010.04.005
Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M., & Lin, X. (2011). Rare-variant association testing for sequencing data with the sequence
kernel association test. The American Journal of Human Genetics, 89(1), 82–93. doi: 10.1016/j.ajhg.2011.05.029
Zawistowski, M., Gopalakrishnan, S., Ding, J., Li, Y., Grimm, S., & Zöllner, S. (2010). Extending rare-variant testing strategies: Analysis of
noncoding sequence and imputed genotypes. The American Journal of Human Genetics, 87(5), 604–617.
doi:10.1016/j.ajhg.2010.10.012
Proportional odds model (POM) formulation:
Latent variable: underlying relationship between predictor and response must be inferred.
Hypothesis testing: the formal statistical method of analyzing hypotheses.
Score function: an equation that allows the proposed model’s performance to be quantified.
Simulation algorithm:
1. Generate genotype data.
2. Obtain error terms.
3. Set latent variables.
4. Produce ordinal categorical responses.
5. Estimate intercept parameters.
6. Plug parameters into score function.
7. Receive evidence regarding null hypothesis.
Application: Dravet syndrome patient severity
dataset with 12 stress-related single nucleotide
polymorphisms as main predictor variables. There
are 22 relatively isolated Japanese observations
with ordinal responses of mild and severe in
addition to sex, status, IQ, and allele count data.
Results
Model performance: type I error and power
comparable to the sequence kernel association
test (Wu et al., 2011) and the optimized sequence
kernel association test (Lee et al., 2012). In terms
of power, the proposed model outperforms the
variable threshold test (Price et al., 2010), cohort
allelic sums test (Morgenthaler & Thilly, 2007) and
the cumulative minor-allele-test (2010,
Zawistowski et al.)
Dravet dataset: rare phenotypes are prevalent for
young patients with a severe diagnosis, several
genes protect or exacerbate Dravet on a case-by-
case basis, and the link between stress and
Dravet is contingent on sample heterogeneity.