Ullmayer_Rodriguez_Presentation

Survival Analysis
Dimension Reduction Techniques
Claressa Ullmayer and Iván Rodríguez
The University of Alaska, Fairbanks
The University of Arizona
30 July 2015
Claressa Ullmayer and Iván Rodríguez Survival Analysis Dimension Reduction Techniques

Background
Given a dataset, we want to estimate the true survival function.
Complications:
Data Dimensionality
Data Censoring
Unknown True Survival Curve
We want to minimize bias and mean-squared error (MSE)

Applications
Our running example: microarray gene expression datasets
with n patients and p genes such that n p
However, there exist many other implementations:
Engineering
Business
Public Health
Security
Biostatistics

The Survival Function
A survival function, S(t) describes the probability of an object
experiencing an explicit event after a particular time:
S(t) := P(T > t) =
∞
t
f(τ) dτ = 1 − F(t),
where t is the speciﬁc time, T is a random variable, f(τ) is the
PDF of T, and F(t) is the CDF of T
In our running example:
event of interest = death
S(t) = probability that a cancer patient survives—death
not observed—after a particular time

An Example of Survival Curves
Below are four survival arms demonstrating efﬁcacy of different
drug choices for a particular cancer

The Accelerated Failure Time (AFT) Model
Two classic models are used to estimate the survival function:
Cox Proportional Hazards (CPH)
Accelerated Failure Time (AFT)
Chief differences:
Ease of interpretation—survivorship vis-à-vis hazard
AFT directly models survival times
AFT assumes covariates affect a constant
acceleration/deceleration of ‘disease’ life course
CPH posits no assumption about baseline hazard function

The Accelerated Failure Time (AFT) Model Cont.
Underlying formula:
ln(Ti) = µ + zi β + ei
i = 1, . . . , n total observations
Ti is the ith observation’s survival time
parameter µ is the theoretical mean
vector zi denotes the data covariates
vector β indicates the covariate or ‘regression’ coefﬁcients
ei designates the random error for the ith observation

Dimension Reduction Techniques
Three dimension reduction techniques are compared given
predictors in X and responses in Y:
Principal Component Analysis (PCA)
Partial Least Squares (PLS)
Johnson-Lindenstrauss inspired Random Matrices (RM)

Principal Component Analysis (PCA)
PCA obtains orthogonal variance-maximized components in X
PCA is used when
X is highly collinear
covariates outnumber observations
Model: T = XW
Xn×p now related to Wp×p ‘loadings’ and Tn×p ‘scores’
Columns of W are eigenvectors of XT
X
Desired ‘principal’ components are retained
These have maximal variability in their respective directions
Note: response variable Y disregarded
Thus, known as an ‘unsupervised’ method

Partial Least Squares (PLS)
PLS analyzes linear combinations of X and Y
PLS is used when
X is highly collinear
covariates vastly outnumber observations
Y is multidimensional
Model: X = TP + E and Y = UQ + F
X now related to ‘scores’ T, ‘loadings’ P, and error E
Y now related to ‘scores’ U, ‘loadings’ Q, and error F
PLS is iterative
covariance maximized between T and U
resulting ‘latent vectors’ retained—subtracted from X and Y
process repeated until X is a null matrix
Note: PLS performs singular value decompositions of XTY
Hence, known as a ‘supervised’ method

Johnson-Lindenstrauss Lemma
Random Matrices inspired by the Johnson-Lindenstrauss
Lemma is the third dimension reduction technique
The Johnson-Lindenstrauss Lemma
For any ∈ (0, 1) and any n ∈ Z, let k ∈ Z be positive and let
k ≥
4 ln(n)
2/2 − 3/3
.
Then, for any set S of n points in Rd , there exists a mapping
f : Rd → Rk such that, for all points u, v ∈ S,
(1 − ) u − v 2
≤ f(u) − f(v) 2
≤ (1 + ) u − v 2
.

Generating Random Matrices
Three Random Matrices were generated according to the
papers of Achlioptas and Dasgupta-Gupta
Properties of Achlioptas Matrices:
Rij =
1
√
k
×
+1 with probability 1/2,
−1 with probability 1/2.
Rij =
3
k
×



+1 with probability 1/6,
0 with probability 2/3,
−1 with probability 1/6.
Properties of Dasgupta-Gupta Matrix:
entries from N(0, 1) distribution with normalized rows

Johnson-Lindenstrauss Success Simulations
Accuracy of the Johnson-Lindenstrauss Lemma was tested
with the three matrices testing varying values of and k
Johnson-Lindenstrauss Lemma passes 100% of the time
under the constraints for k and
To reduce X to 100 × 37, ≈ 0.65 to satisfy the
Johnson-Lindenstrauss Lemma
Show Simulations

Simulating Data
Data was simulated in order to test which method best
minimized bias and MSE
dimension of 100 × 1000 observations and covariates
β1×1000 random regression coefﬁcients from
U(−1.0 × 10−7, 1.0 × 107)
z1×1000 covariates for 100 observations
z ∼ (N(0, 1), 1)
z was exponentiated to make all the values log-normally
distributed
Ti, survival times, are exponentially distributed with
λi = e−zi β

Applying PCA
Now that we had our data simulated, our next goal was to apply
our dimension reduction techniques.
First we implemented PCA and received 99 components
representing the eigenvalues of the variance-covariance
matrix
The components are linear combinations of the original
covariates (genes)
Below are the ﬁrst ten components:
> z_star_PCA$eig
eigenvalue percentage of variance cumulative percentage of variance
comp 1 16.825242335933619841626 1.6825242335933618953447 1.682524233593361895345
comp 2 16.493827336712410414066 1.6493827336712409969977 3.331906967264602670298
comp 3 16.342590741574667845271 1.6342590741574667401181 4.966166041422069632461
comp 4 16.152005225467195970168 1.6152005225467198634703 6.581366563968789051842
comp 5 15.614782499223768041929 1.5614782499223769374197 8.142844813891166211306
comp 6 15.433746107384354928627 1.5433746107384354040448 9.686219424629602059440
comp 7 15.153687644673805579032 1.5153687644673805579032 11.201588189096982617343
comp 8 15.019382184758567788663 1.5019382184758567344574 12.703526407572839573845
comp 9 14.957535219902165835038 1.4957535219902164946859 14.199279929563054736263
comp 10 14.871508798048505894940 1.4871508798048505006761 15.686430809367905681029
We decided to incorporate 50% of the overall variance in
picking our components
Hence, we chose to reduce the data from 99 to 37Claressa Ullmayer and Iván Rodríguez Survival Analysis Dimension Reduction Techniques

Applying PLS and AFT
Next, we implemented PLS using the same number of
components that we chose for PCA
From both PCA and PLS, we obtained the weights on all
the genes for each component (open PDF)
We then multiply our original 100 × 1000 matrix by the
resulting 1000 × 37 matrix of weights to get a 100 × 37
reduced matrix for both PLS and PCA

Estimating the Survival Function
We took these new matrices and fed them into the AFT
model to get our estimated regression coefficients
We can then estimate the survival function, defined as
ˆS0(t) = e−te−¯z∗ ˆβ∗
¯z∗
is the column-centered original matrix of observations
and covariates
ˆβ∗
is a matrix produced by multiplying our original simulated
regression coefficients by our matrix of obtained weights
−¯z∗ ˆβ∗
becomes a scalar
We know our real survival function is S0(t) = e−¯λt
We repeated this procedure for 5000 iterations

Results
Since we have the estimated and the real survival function,
we can estimate the bias and MSE for PCA, PLS, and the
three random matrices
To compare the performance of the dimension reduction
techniques , we ﬁrst partitioned the y-axis of the survival
curve into equally spaced sections, ui for i = 1, . . . , 20
Then, we found the corresponding ti on the x-axis of the
survival curve
For each of the 20 ti, we summed the bias and MSE for
each point to get the distribution of the errors after 5000
iterations

Bias Plot, PCA and PLS

Mean-Squared Error Plot, PCA and PLS

Bias Plot, Random Matrices

Mean-Squared Error Plot, Random Matrices

Bias Plot, All Methods

Mean-Squared Error Plot, All Methods

Discussion
Censoring is when the event of interest for a given subject
was not observed for some extraneous reason
Naturally censoring is a problem in real life investigations
and studies. Unfortunately, we did not have the time to
incorporate the effect of censoring on our data simulations.
Furthermore, a complication arose in the generation of the
fixed β coefficients; essentially, R software necessitated
generating grossly smaller βs due to the exponent in
ˆS0(t) = e−te−¯z∗ ˆβ∗
in the survival curve estimate.
An initial goal was to apply our findings to real microarray
gene datasets—due to time constraints, this objective was
not fulfilled

References
Cox, DR. Regression Models and life tables (with discussion).
Journal of Royal Statistical Society Series B34: 187-220, 1972.
Johnson, W.B. and J. Lindenstrauss. Extensions of Lipschitz
maps into a Hilbert space. Contemp Math 26: 189-206, 1984.
Pearson, K. On lines and planes of closest ﬁt to systems of
points in space. Philosophical Magazine 2: 559-572, 1901.
Wold, H. Estimation of principal components and related
models by iterative least squares. P.R. Krishnaiaah: 391-420,
1966.

References Cont.
Achlioptas, D. Database-friendly random projections:
Johnson-Lindenstrauss with binary coins. Journal of Computer
and System Sciences 66(4): 671-687, 2003.
Dasgupta, S. and A. Gupta. An elementary proof of a theorem
of Johnson and Lindenstrauss. Random Structures and
Algorithms 22(1): 60-65, 2003.
Nguyen, D.V. Partial least squares dimension reduction for
microarray gene expression data with a censored response.
Math Biosci 193: 119-137, 2005.

References Cont.
Nguyen, D.V., and D.M. Rocke. On partial least squares
dimension reduction for microarraybased classiﬁcation: A
simulation study. Comput Stat Data Analysis 46: 407-425,
2004.
Nguyen, Tuan S. and Javier Rojo. Dimension Reduction of
Microarray Gene Expression Data: The Accelerated Failure
Time Model. Journal of Bioinformatics and Computational
Biology 7(6): 939-954, 2009.
Nguyen, Tuan S. and Javier Rojo. Dimension Reduction of
Microarray Data in the Presence of a Censored Survival
Response: A Simulation Study. Statistical Applications in
Genetics and Molecular Biology 8(1): 2009.

Thank You
This research was supported by the National Security Agency
through REU Grant H98230-15-1-0048 to The University of
Nevada at Reno, Javier Rojo PI.
We would like to greatly thank the NSA for funding our
research this summer
Thank you all for taking the time to be here and listen to
our presentation

Ullmayer_Rodriguez_Presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ullmayer_Rodriguez_Presentation

Similar to Ullmayer_Rodriguez_Presentation (20)

More from Iván Rodríguez

More from Iván Rodríguez (12)