Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report

Survival Analysis Dimension Reduction Techniques
A Comparison of Select Methods
Claressa L. Ullmayer and Iván Rodríguez
Abstract
Although formal studies across many ﬁelds may yield copious data, it can
often be collinear (redundant) in terms of explaining particular outcomes.
Thus, dataset dimensionality reduction becomes imperative for facilitating
the explanation of phenomena given abundant covariates (independent vari-
ables). Principal Component Analysis (PCA) and Partial Least Squares
(PLS) are established methods used to obtain components—eigenvalues of
the given data’s variance-covariance matrix—such that the covariance and
correlation is maximized between linear combinations of predictor and re-
sponse variables. PCA employs orthogonal transformations on covariates
to reduce dataset dimensionality by producing new uncorrelated variables.
PLS, rather, projects both predictor and response variables into a new space
to model their covariance structure. In addition to these standard procedures,
three variants of Johnson-Lindenstrauss low-distortion Euclidean-space em-
beddings (random matrices, RM) were also investigated. Each technique’s
performance was explored by simulating 5,000 datasets using R statistical
software. The semi-parametric Accelerated Failure Time (AFT) model was
utilized to obtain predicted survivor curves. Then, total bias error (BE) and
mean-squared error (MSE) between true and estimated survivor curves was
determined to ﬁnd the error distributions of all methods. The results herein
indicate that PCA outperforms PLS, the RMs are comparable, and the RMs
outdo both PCA and PLS.
Keywords: survival analysis; dimension reduction; big data; principal com-
ponent analysis (PCA); partial least squares (PLS); Johnson-Lindenstrauss
(JL); random matrices; accelerated failure time (AFT); bias; mean-squared
error.

Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
Contents
1 Introduction 1
2 Survival Analysis 2
3 Methods 5
3.1 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1.2 Principle Component Analysis . . . . . . . . . . . . . . . 6
3.1.3 Partial Least Squares . . . . . . . . . . . . . . . . . . . . 6
3.1.4 Random Matrices . . . . . . . . . . . . . . . . . . . . . . 7
3.2 The Accelerated Failure Time Model . . . . . . . . . . . . . . . . 9
4 Method Assessments 9
4.1 Simulated Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 10
5 Results 11
5.1 Principle Component Analysis versus Partial Least Squares . . . . 12
5.2 Random Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.3 All Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6 Discussion 15
7 Conclusion 15
8 Acknowledgments 16
9 References 16
10 Appendix 18
10.1 Error Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
10.2 Johnson-Lindenstrauss Testing . . . . . . . . . . . . . . . . . . . 33
10.3 Survival Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
i

1 Introduction
Throughout various studies, researchers are able to associate covariates to a set of
observations. From here, analysts would naturally seek to explain the relationship
between the two with regard to a given set of phenomena. Methods such as the
Cox Proportional Hazards (CPH) and the Accelerated Failure Time (AFT) models
have been proposed with this intent in mind (Cox, 1972). However, to successfully
utilize both approaches, it is necessary to have more observations than covariates.
Depending on the context, this property may not initially be satisfied, thus ren-
dering both methods inept. One example of this complication arises in common-
place microarray gene expression data. In this situation, there can often be less
observations—patients—than covariates attributed to them—genes. As a result, it
becomes imperative to reduce the dimensionality of the dataset and then apply a
suitable regression technique thereafter to understand the underlying relationships
between the predictor and response variables. As a natural consequence, reduc-
ing the original dataset’s dimensionality insinuates a loss of information; thus, a
favorable dimension reduction technique will minimize loss of relevant informa-
tion.
Given this to consider, dimension-reduction techniques have abounded to meet
this end. In this investigation, the methods of Principal Component Analysis
(PCA), Partial Least Squares (PLS), and three variants of Johnson-Lindenstrauss
inspired Random Matrices (RM) will be compared (Johnson, Lindenstrauss, 1984).
The first approach, PCA, originated and was described by Pearson (1901). PLS
was first rigorously introduced and explained by Wold (1966). Then, the three
variants of RMs were constructed according to specifications of Achlioptas (2002)
and Dasgupta-Gupta (2002). This research was motivated in part by the results
attributed to Nguyen and Rocke (2004) and Nguyen (2005) regarding the perfor-
mance of PCA vis-à-vis PLS. Furthermore, the works of Nguyen and Rojo (2009)
with respect to the performance of PLS variants and Nguyen and Rojo (2009) in
regard to a multitude of reduction and regression approaches were utilized in this
inquiry.
Typically, the Cox PH model has been the standard model in this applica-
tion. In this paper, however, the AFT model was employed. Random datasets
were first generated using the statistical software suite R. For a given amount of
these datasets, there was a constant and true survivor function attributed to them.
From here, the three dimension reduction techniques were employed on the sim-
ulated datasets. Then, the AFT model was used primarily to generate a predicted
survivor function. Bias and mean-squared error between the real and estimated
curves were then calculated for a partition of fixed time values.
1

2 Survival Analysis
Before any serious discussion of the current work can begin, a familiarity with the
area known as survival analysis must first be cultivated. In a sentence, survival
analysis employs various methods to analyze data where the response variable is
a time until an unambiguous event of interest occurs (Despa). This event must be
rigorously defined—some examples include birth, death, marriage, divorce, job
termination, promotion, arrests, revolutions, heart attack, stroke, metastasis, and
winning the lottery, to name a few (Ross).
Depending on the research domain, this wide field has many monikers. It is
referred to as failure time analysis, hazard analysis, transition analysis, duration
analysis, reliability theory/analysis in engineering, duration analysis/modeling in
economics, and event history analysis in sociology (Allison). At the time of this
investigation, ‘survival analysis’ serves as the umbrella term for all the aforemen-
tioned epithets.
Survival analysis is borne out of the desire to overcome some limitations pre-
sented in standard linear regression approaches (Despa). One of the two imme-
diate complications that survival analysis can successfully address is data where
responses are all positive values—exempl¯ı gr¯ati¯a, survival times that range from
t ∈ (0, ∞) (Despa). Secondly, survival analysis can grapple with censored data.
After the event of interest within a particular investigation has been rigorously
declared, an observation is branded as ‘censored’ if the special event was not ob-
served. This can occur due to a plethora of reasons. A common one involves a
patient in a clinical trial dropping out of the study. In this case, it is unknown
how much longer it may have taken for that individual to experience the partic-
ular event of interest. Another example of censoring in the real world involves
observations that do not experience the special event upon the end of a formal
investigation. That is, an individual managed to not express the event of interest
for the whole duration of a study, so they are necessarily labeled as censored.
With this ubiquitous term broadly explained, it is also necessary to understand
that many forms of censoring exist. Typically, most data are ‘right-censored’. This
term signifies observations that have the potential to experience the declared event
of interest after—or to the right in a time-line—of the time they became censored.
For instance, take an individual with a stage of cancer and declare the event of
interest to be death. Then, if this person becomes censored, the event of interest is
naturally bound to occur after the time they became censored. In a similar manner,
‘left-censored’ data occurs when the event of interest occurred before the specific
time a formal investigation began (Lunn). Understandably, this phenomenon is
less commonplace in reality. An example of left-censored data involves providing
a questionnaire to mothers inquiring whether or not they are actively breastfeed-
ing (Vermeylen). Left-censoring would occur if a mother entered the study and
2

had hitherto stopped breastfeeding. Finally, a third type is known as ‘interval cen-
soring’. This might be observed in a case where clinical follow-ups are necessary.
For a datum to be interval-censored, the event of interest would have to be ob-
served within an interval between two successive follow-ups (Sun).
Survival analysis is a prominent regression approach because it can success-
fully incorporate both censored and uncensored data when modeling the relation-
ship between predictors and responses (Despa). Typically, the response variables
will have at least both a survival time and censoring status associated with them.
From here, methods exist to estimate both survival and hazard functions that fa-
cilitate the interpretation of the distribution of survival times (Despa).
Survivor curves determine the probability that the event of interest is not ex-
perienced after a particular time. Rigorously,
S(t) = P(T > t) =
∞
t
f(τ) dτ = 1 − F(t),
where S(t) denotes the survivor function, t is a fixed time, T is a random variable,
f(τ) is the probability density function of T, and F(t) is the cumulative distribu-
tion function of T.
The hazard, on the other hand, is defined as a rate in which events happen
(Duerden). Thus, one can calculate the probability of an event happening within a
small time interval as this hazard rate multiplied by the length of time (Duerden).
Additionally, the hazard function describes the probability that an observation ex-
periences the event of interest at a particular time (Duerden). This implies that
the observation has already survived—that is, has not experienced the event of
interest—at the specified time (Duerden). In precise terms, the hazard function is
defined as
h(t) =
f(t)
S(t)
,
where f(t) denotes the probability distribution function and S(t) represents the
survival function given a random variable T. From this expression, it is imme-
diately possible to understand the intricate relationship between distribution, sur-
vival, and hazard functions. As a result, many other expressions exist aside from
this rather simplistic form.
A natural thought that may arise within survival analysis is whether results
involving survivor curves or hazard functions are desired. In many contexts, stan-
dard researchers prefer survivor curves in order to interpret results of their gath-
ered data. Arguably, since these curves output a probability in response to an input
of time, it becomes easier to comprehend trends and relationships than by doing
so via hazard. Furthermore, hazard functions and hazard rates are based on ratios
of probability distribution functions and survival curves; this makes hazard results
3

more difficult to digest and understand.
Aside from these considerations, there is also another factor involved in sur-
vival analysis to cognize: the selection of methods that can be utilized to relate
predictor variables and the resulting survival times. The three main forms to
achieve this end include parametric, semiparametric, and nonparametric models
(Despa). These differ in the assumptions being made on the given data.
Parametric approaches make the prime assumption that the distribution of the
survival times follows a known probability distribution (Despa). For example,
these can include the exponential and compound exponential, Weibull, Gompertz-
Makeham, Rayleigh, gamma and generalized gamma, log-normal, log-logistic,
generalized F, and the Coale-McNeil models (Rodriguez, 2010). For these and
other applicable methods, model parameters are estimated according to an alter-
ation to their maximum likelihood (Despa). In parametric techniques, relation-
ships are forced between f(t), F(t), S(t), and h(t) (Cook).
In contrast, a nonparametric model does not assert as many relatively bold
assumptions. For instance, linearity and a smooth regression function is not nec-
essary in a nonparametric context (Fox). Although this provides a researcher with
much more flexibility, interpretation can oftentimes become more difficult.
A semiparametric model posits that the error attributed to a nonlinear regres-
sion model follows a well-defined probability distribution, but the error is uncor-
related and identically distributed. In addition, a model of this form does not
presume that the baseline hazard function has a particular ‘shape’ attributed to
it. Additionally, when a combination of both parametric and nonparametric as-
sumptions are available, the regression model is appropriately described as being
semiparametric in nature.
These three types of regression models are rigorously represented below. Let
n denote the number of observations, Y represent the response variable, X sig-
nify the matrix of predictors, and let β be regression coefficients with errors .
Additionally, let m(·) = E(yi | xi) such that i = 1, . . . , n
A parametric model can be expressed as
yi = xi
T
β + i, i = 1, . . . , n.
In this case, the resulting curve is smooth and known. Furthermore, it is described
by a finite set of parameters which will need to be estimated. Ultimately, interpre-
tation is simple through this approach.
Then, for a nonparametric method,
yi = m(xi) + i, i = 1, . . . , n.
Here, function m(·) is also smooth and flexible, yet it is now unknown. Further-
more, the interpretation of such a curve becomes ambiguous.
4

Lastly, in the case where a model is classiﬁed as semiparametric, we observe
that
yi = xi
T
β + mz(zi) + i, i = 1, . . . , n.
As previously mentioned, some parameters are necessarily estimated while some
will be determined through the given data.
3 Methods
The main methods employed in this investigation were centered on different ways
of performing dimension reduction. These methods were: Principle Component
Analysis (PCA), Partial Least Squares (PLS), and a set of three distinct Random
Matrices (RM). For each method, the AFT model was employed primarily to gen-
erate survivor curve estimates. These methods will be discussed in greater detail
here.
3.1 Dimension Reduction
The central goal of the three aforementioned dimension reduction techniques is to
reduce a dataset with n observations and p covariates to a new dataset of dimen-
sions n × k such that k p. Additionally, a competent method will achieve this
end while retaining an acceptable amount of relevant data and omitting relatively
collinear variables.
Both PCA and PLS reduce dimensionality through orthogonal transformations
of covariates; then, a subset of these is retained such that these new covariates pre-
dict the response with a satisfactory caliber of precision. Meanwhile, RM differs
from these two procedures by generating a matrix with certain qualities that also
reduces dimensionality.
To facilitate the explanation of these reduction techniques, pertinent notation
will ﬁrst be introduced.
3.1.1 Notation
Let X be the n × p column-centered matrix such that n and p denote given obser-
vations and covariates, respectively. Also, let n p. Furthermore, let Y be the
n × q matrix of observed covariates.
In the microarray gene dataset example, n would represent the number of pa-
tients while p would denote the amount of observed genes attributed to them.
Thus, X would be a matrix that contains particular patients on the rows and their
respective genes on the columns. Additionally, Y would serve as an n × 1 vector
of survival times.
5

3.1.2 Principle Component Analysis
PCA reduces dataset dimensionality through orthogonal components obtained by
maximizing the variance between linear combinations of the original predictors
contained in X. More precisely, k weight vectors or ‘loadings’ w are constructed
such that rows of X map to principal component scores t. For n observations,
tn = xnwk.
Ultimately, X can be completely decomposed into its components as follows:
T = XW.
Here, X has original dimensions n×p, W has dimensions p×p, and T, therefore,
has dimensions n × p as expected. Additionally, the columns of W contain the
eigenvectors of XT
X.
From here, a desired amount of the resulting orthogonal components is cho-
sen. These are then referred to as ‘principal components’ since they are chosen in
order to maximize the variability along each direction of the new and reduced set
of axes. What this transformation accomplishes, in other words, is that it projects
the original data cloud into a new coordinate system via rotations of the initial
coordinate system such that variability of the initial data is maximized along each
direction. Additionally, PCs are ranked according to how much variance they
account for in their respective directions. That is, the PCs with the largest eigen-
values are ranked the highest and represent a sizable portion of the data since
variability is greatest along its eigenvector’s direction.
It is imperative to note that the chosen PCs obtained from PCA rely on op-
erations performed on X, the given dataset matrix. Thus, the response variable
Y is not taken into account during this particular dimension reduction algorithm.
Consequently, these PCs may not be laudable predictors of the response variable
in a given context. Due to this property of PCA, it is often referred to as an ‘un-
supervised’ technique.
3.1.3 Partial Least Squares
Whereas PCA reduces dimensionality through X, the method of PLS does so
through a consideration of both independent and dependent variables X and Y.
Thus, this approach is often referred to as being ‘supervised’.
This regression model is especially useful when there is either high collinear-
ity among predictors or when the number of predictor variables is much greater
than the amount of observations. In these situations, ordinary least-squares re-
gression would either perform poorly or fail entirely; it would also fail if Y was
not one-dimensional—id est, if there were more than one observed response.
6

PLS extracts factors from both X and Y so that the covariance between these
factors is maximized. In particular, PLS is largely based on the singular value de-
composition of XT
Y. Recall that PLS does not require Y to be one-dimensional;
an advantage of the PLS procedure is that Y can contain as many observed re-
sponses as are deemed necessary and practical by researchers.
The method of PLS decomposes both X and Y so that
X = TPT
+ E and Y = UQT
+ F.
Here, T is a matrix of ‘X-scores’, P is a matrix of ‘X-loadings’, and E is a matrix
of error for X. Similarly, U, Q, and F represent ‘Y-scores’, ‘Y-loadings’, and Y
error, respectively. Both X- and Y-scores are defined as being linear combinations
of the predictor and response variables, respectively. Then, X- and Y-loadings are
linear coefficients that form a bridge from X to T and from Y to U. A common
assumption about E and F is that they are random variables with independent
and identical distributions. This decomposition of X and Y is done in hopes of
maximizing the covariance between T and U.
The PLS algorithm is an iterative procedure. First, two sets of weights must
be constructed as linear combinations of the columns of both X and Y. These
will be denoted by w and c, respectively. The goal here is to have their covariance
be maximal. Recall that matrices T and U denote, accordingly, X- and Y-scores.
Then, the next step in the PLS approach is to obtain a first pair of vectors t = Xw
and u = Yc such that wT
w = 1, tT
t = 1, and tT
u be maximized. After these
first so-called ‘latent vectors’ have been obtained, they are subtracted from both
X and Y. This procedure is then repeated, thereby eventually reducing X to a
zero matrix.
3.1.4 Random Matrices
Whereas the previously discussed methods of PCA and PLS reduce dimension-
ality through a careful analysis of X and Y, the third technique of constructing
random matrices, as the name implies, is considerably cavalier by comparison. In
essence, a random matrix with a particular set of qualities is fabricated. Then,
this matrix is multiplied to a given dataset—matrix X in this particular investi-
gation. According to the lemma attributed to Johnson and Lindenstrauss, if two
observations in X are considered as multidimensional points and have an initial
distance-squared between them, then once these particular random matrices are
multiplied to X, their intial distance is not distorted by too much. Similar to the
approaches utilized in PCA and PLS, random matrices can reduce dimensionality
without losing much information in the process. First, the Johnson-Lindenstrauss
(JL) Lemma will be presented as well as a description of the three particular ran-
7

dom matrices that were constructed in this research. The constraint on k was
utilized according to Dasgupta-Gupta.
The Johnson-Lindenstrauss Lemma. For any ∈ (0, 1) and any n ∈ Z, let
k ∈ Z be positive and let
k ≥
4 ln(n)
2/2 − 3/3
.
Then, for any set S of n points in Rd
, there exists a mapping f : Rd
→ Rk
such
that, for all points u, v ∈ S,
(1 − ) u − v 2
≤ f(u) − f(v) 2
≤ (1 + ) u − v 2
.
In terms of this investigation, n also represents the number of observations
while denotes the error tolerance. Finally, k can be thought of as the resulting
dimension in this given context after applying a random matrix to the dataset ma-
trix X.
An immediate complication of these so-called ‘JL-embeddings’ is that we may
sometimes observe that k ≥ d as a result of strictly following the hypotheses of
the lemma. Id est, by employing the results of this theorem, a researcher would
be taking data from a smaller dimension and transforming it so that the data exists
in a higher dimension. Ultimately, the JL Lemma may not reduce dimensionality
at all, thus rendering it impractical for the desired purposes of this text. Thus, it
became imperative in this research to observe the effects of ignoring the restraints
on k of the JL Lemma and deducing whether or not desirable results are obtained
nonetheless. Having understood the motivation behind random matrices and these
precise limitations, now an explanation of the three random matrices themselves
is in order.
The ﬁrst two random matrices were fabricated according to the previous re-
sults of Achlioptas while the third was constructed by following the speciﬁcations
of Dasgupta-Gupta. Let Γ1, Γ2, and Γ3 accordingly denote these random ma-
trices. To keep consistent with the previous notation, recall that X is an n × p
predictor matrix of observations on the rows and covariates on the columns. It
follows that Γ1, Γ2, and Γ3 are p × k matrices. Once multiplied to X, the result-
ing matrix Ω will have dimensions n × k, where the goal is to have n > k.
Entries of Γ1 were produced from the following distribution:
1
√
k
×
−1 with probability 1/2
+1 with probabilty 1/2
For Γ2, its entries were obtained from
3
k
×



−1 with probability 1/6
0 with probability 4/6
+1 with probabilty 1/6
8

Finally, Γ3 is a Gaussian random matrix generated from N(0, 1). The resulting
rows of Γ3 are then normalized.
3.2 The Accelerated Failure Time Model
The previously described techniques were sourced in order to reduce dimension-
ality. After successfully achieving this consequence, it was necessary to generate
a survival curve based on the modified data and compare it with the true survival
curve. In this investigation, the AFT model was the vehicle to generate estimates
of the survivor curves.
The AFT model is seldom utilized compared to the celebrated Cox Propor-
tional Hazards (PH) model for various reasons. One reason to adopt the AFT
approach in this investigation is due to the simplified interpretation it provides re-
searchers of the data. This approach presents an interpretation of the relationship
between observation covariates and given responses in terms of survivor curves.
The Cox PH model, on the other hand, does so through hazard functions and haz-
ard ratios that, while equally profound, are not as visually simple to comprehend
as the AFT model’s survivorship presentation. In simple terms, the hazard is the
instantaneous event probability within a range of a particular time. It is arguably
more straightforward to understand results in terms of the probability that an in-
dividual ‘survives’ or does not experience an event of interest after a particular
time. Thus, this first reason to employ the AFT model in this text is a matter
of user preference and ease of interpretation of results. Another technical reason
to employ the AFT model is due to the fact that it directly models given survival
times. This is one luxury that the Cox PH model cannot allow a fervent researcher.
In this investigation, AFT was implemented according to the following under-
lying model:
ln(Ti) = µ + ziβ + ei.
Here, i represents a particular observation from a set of n observations. Further-
more, Ti denotes the survival time for the ith
observation. Meanwhile, µ desig-
nates the given theoretical mean, zi is the vector of covariates for the ith
obser-
vation, and β is the vector of covariate/regression coefficients. Finally, ei is the
given error for the ith
observation.
4 Method Assessments
This research utilized a programming environment to simulate datasets that would
undergo reduction procedure from PCA, PLS, and the variants of RMs. Addition-
ally, ‘feeding’ this data into the AFT model to obtain and compare the pairs of
9

survival curves was likewise accomplished through statistical software. This sec-
tion will address specifically how the research was performed.
4.1 Simulated Datasets
In order to compare dimension reduction techniques, R statistical software is im-
plemented to simulate data. β regression coefficients, observations, covariates,
and survival times are simulated using the previously discussed AFT formula,
where the theoretical mean, µ, is set to 0 for simplicity. The dimensionality of the
data matrix, X, is 100 observations by 1000 covariates. A vector of 1000 β regres-
sion coefficients relating to the 1000 covariates is obtained by generating random
values from U(−1 × 10−7
, 1 × 10−7
). A vector, µj, of random values is generated
from a N(0, 1) distribution for j = 1, . . . , p, where p represents the number of
covariates. β and µ remain fixed for all simulations. Next, the matrix, X100×1000,
of the 1000 covariates and 100 observations is generated where xij = ezij
where
zij ∼ N(µj, 1) for j = 1, . . . , p and i = 1, . . . , n where n is the number of obser-
vations, therefore the data is log-normally distributed. The survival times, Ti, are
constructed from an exponential distribution with λi = e−xiβ
for i = 1, · · · , n.
Now that all the data is generated, zn×p is converted to z∗
n×p by centering each
column about its mean. PCA is applied to z∗
n×p using the function PCA from the
package FactoMineR (Husson et al., 2015) to obtain 99 principle components.
After this procedure is completed, the principle components are narrowed down
to 37, which represents 50% of total variance of the model. PCA outputs a weight
matrix of dimension 1000 × 37, which represents the weights given to each co-
variate by the 37 principle components. The data matrix, X, is multiplied by this
weight matrix to obtain a reduced dimension matrix of 100 × 37. A surv object
is created, which inputs survivals times, censoring type, and an indicator vector
denoting 1 if the observation is censored or 0 if it is not, and outputs a response
matrix. The Ti vector and the 37 principle components are fed into the AFT model
in R using the package aftgee (Chiou et al., 2015) to obtain the estimated 1000 β
coefficients. The weight matrix was multiplied by these β estimates to obtain the
37 β estimates for the 37 principle components.
In order to acquire estimated lambda values for the estimated survival func-
tion, the mean of the exponentiated product of the centered data matrix and the
β estimates is taken. The estimated survival function is now found by ˆS0 = e−ˆ¯λt
where ˆ¯λ is the estimated mean lambda value. This procedure was repeated for
PLS using the same number of principle components as PCA except using the
function plsreg1 from the package plsdepot (Sanchez, 2015) instead.
The matrices Γ1, Γ2, and Γ3 from Achlioptas and Dasgupta-Gupta are gen-
erated containing random entries that satisfy each author’s probability specifica-
10

tions. An algorithm in R is created to validate the Johnson-Lindenstrauss Lemma
dimension reduction ability for Γ1, Γ2, and Γ3. The algorithm takes two randomly
picked vectors u, v from X and maps f : Rp
→ Rk
where k is the new reduced
dimension. The Johnson-Lindenstrauss Lemma is then tested using varying val-
ues of and k for multiple simulations. It is shown that as long as k and follow
the constraints given by Dasgupta-Gupta(CITA), then the Johnson-Lindenstrauss
Lemma is satisfied 100% of the time. The value of is varied until the desired
dimension of 1000 × 37 projection matrix is obtained satisfying the Johnson-
Lindenstrauss Lemma. Unfortunately, a fairly high value of approximately 0.65
is required to satisfy the lemma. Therefore, either a high value is used or the
lemma is not followed.
In order to compare random matrices to PCA and PLS, X is multiplied by
Γ1, Γ2, and Γ3 with dimensions 1000 × 37 to obtain a resulting k dimensional
matrix of 100×37. Then, the reduced matrices are fed into the AFT model and all
the same steps as PCA and PLS are performed. Therefore, five different estimated
survival curves are produced, one each for PCA and PLS and three for the three
random matrices.
The true survival curve is S0 = e−¯λt
where ¯λ is the mean of the λi values,
which are created by exponentiating the product of the centered data matrix and
the true β coefficients. The y-axis of the survival curve is partitioned into 20
equally spaced sections from 0.025, . . . , 0.975 and then the corresponding ti val-
ues are found along the x-axis. The bias and mean squared error (MSE) are cal-
culated at each of these ti values to obtain the error distribution for all methods.
The bias is found by calculating the pointwise difference between the real and
estimated survival curves and the MSE is calculated by finding the squared differ-
ence. The bias and MSE at each ti are summed for 5000 simulations and the error
distributions are compared for all methods.
5 Results
In the following sections, the error distribution plots for the dimension reduction
techniques are compared after 5000 simulations. PLS and PCA are compared to
each other, the random matrices are compared, and then all dimension reduction
techniques are compared. The goal is to minimize Bias and MSE, therefore, the
dimension reduction technique closest to zero is the more efficient method. In
the Bias plots, zero is at the top of the plots and for MSE, the black horizontal
line at the bottom denotes zero. Notice the plots differ least at the extremes of
the survival curves domain while the most variability is observed in middle of the
interval.
11

5.1 Principle Component Analysis versus Partial Least Squares
From the plots above, it is shown that PCA outperforms PLS by a maximum
magnitude of approximately 0.07 for the bias and 0.03 for MSE.
12

5.2 Random Matrices
In the plots above, RM1 denotes Γ1, RM2 denotes Γ2, and RM3 denotes Γ3.
The results show that there is no signiﬁcant difference in performance between
the three random matrices in terms of Bias and MSE.
13

5.3 All Methods
From both the Bias and MSE plots, it is evident that all three random matrices
outperform both PCA and PLS. All matrices outperform PCA by a magnitude of
approximately 0.03 and PLS by 0.10 for bias and 0.015 and 0.045 for MSE.
14

6 Discussion
We originally wanted to generate our β coefficients from a U(−0.2, 0.2), but when
we multiplied xiβ to get our λi values, we obtained very large values. Recall,
our formula λi = e−xiβ
. When xiβ is very large, then the λi values become
very small and the precision of R estimates the survival function as 1, creating
a horizontal survival function. Therefore, we had to reduce the β coefficients to
U(−1 × 10−7
, 1 × 10−7
) to obtain survival curves with realistic properties.
Before conducting our research, we investigated previous work that has been
done in the field, such as the research in the papers of Nguyen and Rojo (2009)
and Nguyen and Rojo (2009). According to their findings, PLS outperformed
PCA, which is the results we also expected to receive, but instead we observed
that PCA greatly outperformed PLS. We are not exactly positive why our results
differ from these works, but we suspect that it is due to not incorporating censored
data. In both papers of Nguyen and Rojo, they compared methods using censored
data, which we did not have time to incorporate into our research. Therefore, we
suspect that PLS might outperform PCA when censored data is used, but PCA
outperforms PLS with uncensored data.
Obviously in real life studies, censored data can be a serious problem that
needs to be taken into account. We wanted to incorporate censored data in our in-
vestigation, but were unable to due to time constraints. This is something that we
would like to add in future investigations. We also wanted to apply our findings
to real microarray gene data sets where there are a few number of patients, with a
specific type of cancer, and a large dimension of genes. We wanted to work with
these data sets and apply our dimension reduction techniques to obtain estimated
survival curves where the event of interest was death and the survival curve mod-
eled each patient’s probability of surviving after a given time Ti. Unfortunately,
we were not able to work with these real data sets, which is also something we
would like to investigate at a future time.
7 Conclusion
The results of performing PLS, PCA, and the three Johnson-Lindenstrauss in-
spired matrices from Achlioptas and Dasgupta-Gupta on log-normally distributed,
uncensored data for estimating the survival curve under the AFT model show that
PCA outperforms PLS in terms of both bias and MSE. The three random matrices
do not show a significant difference between each other in terms of either bias or
MSE. Overall, the random matrices outperform both PCA and PLS for both bias
and MSE.
15

8 Acknowledgments
This research was supported by the National Security Agency through REU Grant
H98230 15-1-0048 to The University of Nevada at Reno, Javier Rojo PI. We
would like to greatly thank and acknowledge our advisor Dr. Javier Rojo, Nathan
Wiseman, and Kyle Bradford from the University of Nevada Reno for their sup-
port and generous contributions to our research.
9 References
Cox, DR. Regression Models and life tables (with discussion). Journal of Royal
Statistical Society Series B34: 187-220, 1972.
Johnson, W.B. and J. Lindenstrauss. Extensions of Lipschitz maps into a Hilbert
space. Contemp Math 26: 189-206, 1984.
Pearson, K. On lines and planes of closest ﬁt to systems of points in space. Philo-
sophical Magazine 2: 559-572, 1901.
Wold, H. Estimation of principal components and related models by iterative least
squares. P.R. Krishnaiaah: 391-420, 1966.
Achlioptas, D. Database-friendly random projections: Johnson-Lindenstrauss with
binary coins. Journal of Computer and System Sciences 66(4): 671-687, 2003.
Dasgupta, S. and A. Gupta. An elementary proof of a theorem of Johnson and
Lindenstrauss. Random Structures and Algorithms 22(1): 60-65, 2003.
Nguyen, D.V. Partial least squares dimension reduction for microarray gene ex-
pression data with a censored response. Math Biosci 193: 119-137, 2005.
Nguyen, D.V., and D.M. Rocke. On partial least squares dimension reduction for
microarraybased classiﬁcation: A simulation study. Comput Stat Data Analysis
46: 407-425, 2004.
Despa, Simona. What is Survival Analysis? StatNews 78: 1-2.
Ross, Eric. "Survival Analysis." 2012. PDF
16

Allison, Paul D. "Survival Analysis." 2013. PDF
Lunn, Mary. "Deﬁnitions and Censoring." 2012. PDF.
Vermeylen, Francoise. Censored Data. StatNews 67: 1, 2005.
Nguyen, Tuan S. and Javier Rojo. Dimension Reduction of Microarray Gene Ex-
pression Data: The Accelerated Failure Time Model. Journal of Bioinformatics
and Computational Biology 7(6): 939-954, 2009.
Nguyen, Tuan S. and Javier Rojo. Dimension Reduction of Microarray Data in
the Presence of a Censored Survival Response: A Simulation Study. Statistical
Applications in Genetics and Molecular Biology 8(1): 2009.
Sun, Jianguo. "Interval Censoring." 2011. PDF.
Duerden, Martin. "What Are Hazard Ratios?" 2012. PDF.
Rodriguez, German. "Parametric Survival Models. Princeton." 2010. PDF.
Cook, Alex. "Survival and hazard functions." 2008. PDF.
Fox, John. "Introduction to Nonparametric Methods." 2005. PDF.
Husson et al. "Package ‘FactoMineR’." 2015. PDF.
Sanchez, Gaston. "Package ‘plsdepot’." 2015. PDF.
Chiou et al. "Package ‘aftgee’." 2015. PDF.
Thernou et al. "Package ‘survival’." 2015. PDF.
17

10 Appendix
Herein, the R code utilized in this investigation is presented. Packages survival
(Thernou et al., 2015), FactoMineR (Husson et al., 2015), plsdepot (Gaston, 2015),
and aftgee (Chiou et al., 2015) will need to be installed and loaded into R software
to successfully run the provided code.
10.1 Error Plots
Below is the code used to produce the six error plots for the ﬁve various reductions
methods.
library(survival)
# We created a Surv object using function ’Surv’ from this
# package.
library(FactoMineR)
# We used the function ’PCA’ from this package.
library(plsdepot)
# We used ’plsreg1’ from this package.
library(aftgee)
# With this package, we were able to apply the AFT model to our
# simulated data using the function ’aftgee’.
sim <- function(s) # This function will produce ’s’
# simulations and output error plots.
{
t1 <- Sys.time() # Initial time.
num <- 1 # Initial counter.
sum_PCA_BE_t <- matrix(0, 1, 20)
sum_PCA_MSE_t <- matrix(0, 1, 20)
sum_PLS_BE_t <- matrix(0, 1, 20)
sum_PLS_MSE_t <- matrix(0, 1, 20)
sum_RM1_BE_t <- matrix(0, 1, 20)
sum_RM1_MSE_t <- matrix(0, 1, 20)
18

# These will store the calculated bias and mean-squared
# error across 20 selected points geeer we have run ’s’
# simulations.
beta <- c(runif(1000, min = -0.0000001, max = 0.0000001))
# Fixed coefficients.
mu <- c(rnorm(1000, mean = 0, sd = 1)) # Mean values.
X <- matrix(0, 100, 1000)
# A location for the dataset information.
while(num <= s)
# Running the entire code for a ’s’ iterations.
{
# No problems at the start of this iteration.
for(i in 1:100)
{
for(j in 1:1000)
{
X[i, j] <- rnorm(1, mean = mu[j], sd = 1)
# A matrix of random data containing observations
# on the rows and covariates on the columns.
}
}
z <- exp(X) # All entries of matrix ’X’ have been
# exponentiated and stored in ’z’, which has dimensions
# 100 by 1,000.
lambda <- matrix(0, 100, 1) # Rate values.
for(i in 1:100) # Generating lambda values.
{
lambda[i] <- exp(t(-z[i,]) %*% as.matrix(beta))
}
19

T <- matrix(0, nrow = 100, ncol = 1)
# Location for survival times.
for(i in 1:100) # Surivival times being generated.
{
T[i] <- rexp(1, rate=lambda[i])
}
RM1 <- matrix(0, 1000, 37)
# Random matrix one with ’-1’s and ’+1’s.
for (m in 1:1000)
{
for (n in 1:37)
{
RM1[m, n] <- sample(c(-1, 1), 1, replace = TRUE,
prob = c(1/2, 1/2))
}
}
RM1 <- RM1 / sqrt(37)
RM2 <- matrix(0, 1000, 37)
# Random matrix two with
# ’-sqrt(3)’s, ’0’s, and ’+sqrt(3)’s.
for (m in 1:1000)
{
for (n in 1:37)
{
RM2[m,n] <- sample(c(-sqrt(3), 0, sqrt(3)),
1, replace = TRUE,
prob = c(1/6, 4/6, 1/6))
}
}
RM2 <- RM2 / sqrt(37)
RM3 <- matrix(0, 1000, 37)
# Random matrix three generated under a Gaussian
20

# distribution.
for (m in 1:1000)
{
for (n in 1:37)
{
RM3[m,n] <- rnorm(1, 0, 1)
}
}
RM3_norm <- matrix(0, 1000, 1)
for (p in 1:1000)
{
RM3_norm[p, ] <- sqrt(sum(RM3[p, ] ^ 2))
}
for (m in 1:1000)
{
for(n in 1:37)
{
RM3[m,n] <- RM3[m,n] / RM3_norm[m, ]
}
}
z_star <- scale(z, center = TRUE, scale = FALSE)
# Column-centered ’z’ matrix for PCA.
z_star_PCA <- PCA(z_star, graph = FALSE, ncp = 37)
z_star_PLS <- plsreg1(scale(z, center = TRUE,
scale = TRUE), T, comps = 37,
crosval = FALSE)
# Applying PCA and PLS to the data.
z_double_star_PCA <- z_star %*% z_star_PCA$var$coord
z_double_star_PLS <- z_star %*% z_star_PLS$x.loads
z_double_star_RM1 <- z %*% RM1
# Reducing dimensionality.
21

delta <- matrix(0, nrow = 100, ncol = 1)
# An indicator matrix. Here, delta is a 100 by 1 matrix
# of zeros. The zeros are interpreted as meaning that the
# event of interest has definitively occured. In other
# words, there is currently no censoring with ’delta’
# set up in this manner.
data_Surv <- Surv(time = T, event = delta,
type = c("right"))
# A Surv object that takes the survival times from ’T’,
# censoring information from ’delta’, and is specified
# as being right-censored.
data_AFT_fit_PCA <- aftgee(data_Surv ~ -1 +
z_double_star_PCA,
corstr = "independence", B = 0)
data_AFT_fit_PLS <- aftgee(data_Surv ~ -1 +
z_double_star_PLS,
data_AFT_fit_RM1 <- aftgee(data_Surv ~ -1 +
z_double_star_RM1,
z_double_star_RM2,
z_double_star_RM3,
beta_hat_star_PCA <- data_AFT_fit_PCA$coefficients
beta_hat_star_PLS <- data_AFT_fit_PLS$coefficients
beta_hat_star_RM1 <- data_AFT_fit_RM1$coefficients
# The full beta/regression coefficients.
z_bar_star <- matrix(0, 1, 1000)
22

# Averaged columns of ’z’ will go here.
for (i in 1:1000) # Averaging ’z’s columns.
{
z_bar_star[1, i] <- mean(z[, i])
}
beta_hat_z_PCA <- z_star_PCA$var$coord %*%
beta_hat_star_PCA
beta_hat_z_PLS <- z_star_PLS$x.loads %*%
beta_hat_star_PLS
beta_hat_z_RM1 <- RM1 %*%
beta_hat_star_RM1
beta_hat_star_RM2
beta_hat_star_RM3
# The final beta estimates for each technique.
lambda_hat_PCA <- mean(exp(-z %*% beta_hat_z_PCA))
lambda_hat_PLS <- mean(exp(-z %*% beta_hat_z_PLS))
lambda_hat_RM1 <- mean(exp(-z %*% beta_hat_z_RM1))
# Generating the lambda constant from each technique
# employed.
lambda_bar = mean(lambda) # Taking the average of all
# ’lambda’ values and storing it in ’lambda_bar’.
S <- function(t) # The true survivor function.
{
exp(-t * lambda_bar)
}
S_hat_naught_PCA <- function(t)
# The predicted survivor function through PCA.
23

{
exp(-t * lambda_hat_PCA)
}
S_hat_naught_PLS <- function(t)
# The predicted survivor function through PLS.
{
exp(-t * lambda_hat_PLS)
}
S_hat_naught_RM1 <- function(t)
# The predicted survivor function through RM1.
{
exp(-t * lambda_hat_RM1)
}
{
}
{
}
u <- c(seq(0.025, 0.975, 0.05))
# Desired outputs ’u’ that range from 0.025 to 0.975
# and are spaced out by 0.05, resulting in 20 points.
t <- (-1/lambda_bar) * log(u) # Input times ’t’ from the
# respective ’u’s. There are 20 generated times ’t’ in
# this vector.
for (i in 1:20)
# Storing bias across the 20 point pairs in PCA.
{
sum_PCA_BE_t[i] <- sum_PCA_BE_t[i] +
(S_hat_naught_PCA(t[i]) - S(t[i]))
24

}
for (i in 1:20)
# Storing mean-squared error across the 20 point pairs
# in PCA.
{
sum_PCA_MSE_t[i] <- sum_PCA_MSE_t[i] +
(S_hat_naught_PCA(t[i]) - S(t[i])) ^ 2
}
for (i in 1:20)
# Storing bias across the 20 point pairs in PLS.
{
sum_PLS_BE_t[i] <- sum_PLS_BE_t[i] +
(S_hat_naught_PLS(t[i]) - S(t[i]))
}
for (i in 1:20)
# in PLS.
{
sum_PLS_MSE_t[i] <- sum_PLS_MSE_t[i] +
(S_hat_naught_PLS(t[i]) - S(t[i])) ^ 2
}
for (i in 1:20)
# Storing bias across the 20 point pairs in RM1.
{
sum_RM1_BE_t[i] <- sum_RM1_BE_t[i] +
(S_hat_naught_RM1(t[i]) - S(t[i]))
}
for (i in 1:20)
# in RM1.
{
sum_RM1_MSE_t[i] <- sum_RM1_MSE_t[i] +
(S_hat_naught_RM1(t[i]) - S(t[i])) ^ 2
}
for (i in 1:20)
25

# Storing bias across the 20 point pairs in RM2
{
}
for (i in 1:20)
# in RM2.
{
}
for (i in 1:20)
# Storing bias across the 20 point pairs in RM3.
{
}
for (i in 1:20)
# in RM3.
{
}
print(paste("Simulation", num, "Complete."))
num <- num + 1
}
ymin_PCA_BE <- min(sum_PCA_BE_t)
ymin_PLS_BE <- min(sum_PLS_BE_t)
ymin_RM1_BE <- min(sum_RM1_BE_t)
# Finding the minimum bias per each technique after
26

# ’s’ simulations.
ymax_PCA_BE <- max(sum_PCA_BE_t)
ymax_PLS_BE <- max(sum_PLS_BE_t)
ymax_RM1_BE <- max(sum_RM1_BE_t)
# Finding the maximum bias per each technique after
# ’s’ simulations.
ymin_BE <- min(ymin_PCA_BE, ymin_PLS_BE, ymin_RM1_BE,
ymin_RM2_BE, ymin_RM3_BE) / s
ymax_BE <- max(ymax_PCA_BE, ymax_PLS_BE, ymax_RM1_BE,
ymax_RM2_BE, ymax_RM3_BE) / s
# Finding the minimum and maximum bias across all five
# techniques after ’s’ simulations. These will serve as
# the lower and upper range of the y-axis in the final plot.
ymin_PCA_PLS_BE <- min(ymin_PCA_BE, ymin_PLS_BE) / s
ymax_PCA_PLS_BE <- max(ymax_PCA_BE, ymax_PLS_BE) / s
# Calculating the averaged minimum and maximum bias for PCA
# and PLS after ’s’ simulations for plotting purposes.
ymin_RM_BE <-
min(ymin_RM1_BE, ymin_RM2_BE, ymin_RM3_BE) / s
ymax_RM_BE <-
max(ymax_RM1_BE, ymax_RM2_BE, ymax_RM3_BE) / s
# Calculating the averaged minimum and maximum bias for the
# three RMs after ’s’ simulations for plotting purposes.
ymin_PCA_MSE <- min(sum_PCA_MSE_t)
ymin_PLS_MSE <- min(sum_PLS_MSE_t)
ymin_RM1_MSE <- min(sum_RM1_MSE_t)
# Finding the minimum mean-squared error per each technique
# after ’s’ simulations.
ymax_PCA_MSE <- max(sum_PCA_MSE_t)
ymax_PLS_MSE <- max(sum_PLS_MSE_t)
ymax_RM1_MSE <- max(sum_RM1_MSE_t)
27

# Finding the maximum mean-squared error per each technique
# after ’s’ simulations.
ymin_MSE <- min(ymin_PCA_MSE, ymin_PLS_MSE, ymin_RM1_MSE,
ymin_RM2_MSE, ymin_RM3_MSE) / s
ymax_MSE <- max(ymax_PCA_MSE, ymax_PLS_MSE, ymax_RM1_MSE,
ymax_RM2_MSE, ymax_RM3_MSE) / s
# Finding the minimum and maximum mean-squared error across
# all techniques. These will serve as the lower and upper
# range of the y-axis in the final plot.
ymin_PCA_PLS_MSE <- min(ymin_PCA_MSE, ymin_PLS_MSE) / s
ymax_PCA_PLS_MSE <- max(ymax_PCA_MSE, ymax_PLS_MSE) / s
# Calculating the averaged minimum and maximum MSE for PCA
# and PLS after ’s’ simulations for plotting purposes.
ymin_RM_MSE <-
min(ymin_RM1_MSE, ymin_RM2_MSE, ymin_RM3_MSE) / s
ymax_RM_MSE <-
max(ymax_RM1_MSE, ymax_RM2_MSE, ymax_RM3_MSE) / s
# Calculating the averaged minimum and maximum MSE for the
# three RMs after ’s’ simulations for plotting purposes.
# Start of bias plot for PCA and PLS.
plot(t, (sum_PCA_BE_t) / s, pch = 15,
main = paste("Bias: PCA and PLS n", s,
"Total Simulations"),
xlab = "Time",
ylab = "Average Bias", ylim = c(ymin_PCA_PLS_BE,
ymax_PCA_PLS_BE),
xlim = c(0, max(t)),
col = "black")
points(t, (sum_PLS_BE_t) / s, pch = 15, col = "grey")
par(new = TRUE)
abline(0, 0, h = 0)
28

par(new = TRUE)
legend("topright", c("PCA", "PLS"), pch = c(15, 15),
col = c("black", "grey"))
# End of bias plot for PCA and PLS.
# Start of the mean-squared error plot for PCA and PLS.
plot(t, (sum_PCA_MSE_t) / s, pch = 15,
main = paste("Mean-Squared Error: PCA and PLS n",
s, "Total Simulations"),
xlab = "Time",
ylab = "Average MSE", ylim = c(ymin_PCA_PLS_MSE,
ymax_PCA_PLS_MSE),
col = "black")
points(t, (sum_PLS_MSE_t) / s, pch = 15, col = "grey")
par(new = TRUE)
abline(0, 0, h = 0)
par(new = TRUE)
legend("topright", c("PCA", "PLS"), pch = c(15, 15),
col = c("black", "grey"))
# End of mean-squared error plot for PCA and PLS.
# Start of the bias plot for the random matrices.
plot(t, (sum_RM1_BE_t) / s, pch = 15,
main = paste("Bias: Random Matrices n", s,
"Total Simulations"),
xlab = "Time",
ylab = "Average Bias", ylim = c(ymin_RM_BE,
ymax_RM_BE),
col = "darkblue")
points(t, (sum_RM2_BE_t) / s, pch = 15, col = "red")
points(t, (sum_RM3_BE_t) / s, pch = 15, col = "gold")
29

par(new = TRUE)
abline(0, 0, h = 0)
par(new = TRUE)
legend("topright", c("RM1", "RM2", "RM3"),
pch = c(15, 15, 15),
col = c("darkblue", "red", "gold"))
# End of bias plot for the random matrices.
# Start of the mean-squared error plot for the
# random matrices.
plot(t, (sum_RM1_MSE_t) / s, pch = 15,
main = paste("Mean-Squared Error: Random Matrices n",
s, "Total Simulations"),
xlab = "Time",
ylab = "Average MSE", ylim = c(ymin_RM_MSE,
ymax_RM_MSE),
col = "darkblue")
points(t, (sum_RM2_MSE_t) / s, pch = 15, col = "red")
points(t, (sum_RM3_MSE_t) / s, pch = 15, col = "gold")
par(new = TRUE)
abline(0, 0, h = 0)
par(new = TRUE)
legend("topright", c("RM1", "RM2", "RM3"),
pch = c(15, 15, 15),
col = c("darkblue", "red", "gold"))
# End of mean-squared error plot for the random matrices.
# Start of bias plot for all methods.
plot(t, (sum_PCA_BE_t) / s, pch = 15,
main = paste("Bias: All Techniques n",
30

s, "Total Simulations"), xlab = "Time",
ylab = "Average Bias", ylim = c(ymin_BE, ymax_BE),
col = "black")
points(t, (sum_PLS_BE_t) / s, pch = 15, col = "gray")
points(t, (sum_RM1_BE_t) / s, pch = 15, col = "darkblue")
points(t, (sum_RM2_BE_t) / s, pch = 15, col = "red")
points(t, (sum_RM3_BE_t) / s, pch = 15, col = "gold")
par(new = TRUE)
abline(0, 0, h = 0)
par(new = TRUE)
legend("topright", c("PCA", "PLS", "RM1", "RM2", "RM3"),
pch = c(15, 15, 15, 15, 15),
col = c("black", "gray", "darkblue", "red", "gold"))
# End of bias plot for all methods.
# Start of mean-squared error plot for all methods.
plot(t, (sum_PCA_MSE_t) / s, pch = 15,
main = paste("Mean-Squared Error: All Techniques n",
s, "Total Simulations"), xlab = "Time",
ylab = "Average MSE", ylim = c(ymin_MSE, ymax_MSE),
xlim = c(0, max(t)), col = "black")
points(t, (sum_PLS_MSE_t) / s, pch = 15, col = "gray")
points(t, (sum_RM1_MSE_t) / s, pch = 15, col = "darkblue")
points(t, (sum_RM2_MSE_t) / s, pch = 15, col = "red")
points(t, (sum_RM3_MSE_t) / s, pch = 15, col = "gold")
par(new = TRUE)
31

abline(0, 0, h = 0)
par(new = TRUE)
legend("topright", c("PCA", "PLS", "RM1", "RM2", "RM3"),
pch = c(15, 15, 15, 15, 15),
col = c("black", "gray", "darkblue", "red", "gold"))
# End of mean-squared error plot for all methods.
t2 <- Sys.time() # End time.
total_time <- t2 - t1 # Difference between start and end
# times.
print(total_time) # Printing total time to run simulations
# and obtain the plots.
}
32

10.2 Johnson-Lindenstrauss Testing
Below is the code used for testing the Johnson-Lindenstrauss Lemma by varying
k and .
good_points_RM1 <- 0
# Good points counter for each random matrix.
# Points are considered ’good’ if they satisfy
# the Johnson-Lindenstrauss Lemma.
sim <- function(s, k, epsilon)
# This function takes in ’s’ simulations and a desired
# ’epsilon’. It returns the number of times the
# Johnson-Lindenstrauss Lemma was satisfied based on the
# three random matrices.
{
t1 <- Sys.time() # Initial time.
num <- 1 # Initial counter.
X <- matrix(0, 100, 1000)
# A location for the dataset information.
while(num <= s)
# Running the entire code for a ’s’ iterations.
{
problem <- FALSE # No problems at the start of this
# iteration.
for(i in 1:100)
{
for(j in 1:1000)
{
X[i, j] <- rnorm(1, mean = mu[j], sd = 1)
# A matrix of random data containing observations on
# the rows and covariates on the columns.
33

}
}
# 100 by 1,000.
u_v_rows <- sample(1:100, 2, replace = FALSE)
obs_u_old <- z[u_v_rows[1], ]
obs_v_old <- z[u_v_rows[2], ]
# We’ve selected two different rows from the dataset
# matrix ’z’ and stored them as new variables. Here,
# observations ’u’ and ’v’ can be thought of as
# 1,000-dimensional points.
dist_old <- sum((obs_u_old - obs_v_old) ^ 2)
# Here, the distance has been calculated between
# observations ’u’ and ’v’.
RM1 <- matrix(0, 1000, k)
# Random matrix one with ’-1’s and ’+1’s.
for (m in 1:1000)
{
for (n in 1:k)
{
RM1[m, n] <- sample(c(-1, 1), 1, replace = TRUE,
prob = c(1/2, 1/2))
}
}
RM1 <- RM1 / sqrt(k)
RM2 <- matrix(0, 1000, k)
# Random matrix two with ’-sqrt(3)’s, ’0’s, and
# ’+sqrt(3)’s.
for (m in 1:1000)
{
for (n in 1:k)
{
34

RM2[m,n] <- sample(c(-sqrt(3), 0, sqrt(3)), 1,
replace = TRUE,
prob = c(1/6, 4/6, 1/6))
}
}
RM2 <- RM2 / sqrt(k)
RM3 <- matrix(0, 1000, k)
# Random matrix three generated under a Gaussian
# distribution.
for (m in 1:1000)
{
for (n in 1:k)
{
RM3[m,n] <- rnorm(1, mean = 0, sd = 1)
}
}
RM3_norm <- matrix(0, 1000, 1)
for (p in 1:1000)
{
RM3_norm[p, ] <- sqrt(sum(RM3[p, ] ^ 2))
}
for (m in 1:1000)
{
for(n in 1:k)
{
RM3[m,n] <- RM3[m,n] / RM3_norm[m, ]
}
}
z_star <- scale(z, center=TRUE, scale=FALSE)
# Column-centered ’z’ matrix.
35

# Reducing dimensionality.
obs_u_new_RM1 <- z_double_star_RM1[u_v_rows[1], ]
obs_v_new_RM1 <- z_double_star_RM1[u_v_rows[2], ]
# After reducing dimensions, points ’u’ and ’v’ now have
# new coordinates. Since there were three random
# matrices, there are three new ’u’ and ’v’ points.
dist_new_RM1 <- sum((obs_u_new_RM1 - obs_v_new_RM1) ^ 2)
# Calculating the new distance between the transformed
# points ’u’ and ’v’ for each generated random matrix.
if((1 - epsilon) * (dist_old) <= dist_new_RM1
&& dist_new_RM1 <= (1 + epsilon) * (dist_old))
{
good_points_RM1 <- good_points_RM1 + 1
}
{
}
{
}
# The preceding three ’if’ statements check to see if
# the Johnson-Lindenstrauss Lemma was satisfied in this
# iteration for each different random matrix.
print(paste("Simulation", num, "Complete."))
36

num <- num + 1
}
print(paste("For an epsilon of", epsilon, ", k is", k,
"."))
print(paste("Number of times JL was satisfied, RM1:",
good_points_RM1, "out of", s, "simulations."))
t2 <- Sys.time() # End time.
total_time <- t2 - t1
# Difference between start and end times.
print(total_time) # Printing total time to run simulations
# and obtain the plots.
}
37

10.3 Survival Curves
Below is the code used for generating the real survival curve and the estimated
survival curve under PCA.
library(survival)
library(FactoMineR)
sim <- function(s) # Making a function that takes in a
# simulation count ’s’.
{
options(digits = 22) # Preserving more digits in hopes of
# less algorithm failure.
results <- matrix(0, s, 2) # A matrix with BE on column 1
# and MSE on column 2.
BE_T <- 0 # Initial total BE count.
MSE_T <- 0 # Initial total MSE count.
sum_BE_t <- matrix(0, 1, 20) # Matrix of BE at time ’t’.
sum_MSE_t <- matrix(0, 1, 20) # Matrix of MSE at time ’t’.
num <- 1 # Iteration counter.
sum_BE_t1 <- 0 # Bias error at time ’t1’.
sum_MSE_t1 <- 0 # Mean-squared error at time ’t1’.
beta <- c(runif(1000, min = -0.0000001, max = 0.0000001))
# Fixed coefficients.
X <- matrix(0, 100, 1000) # A location for the dataset
# information.
while(num <= s) # Running the entire code for a specified
# amount of iterations.
{
problem <- FALSE # No problems at the start of this
# iteration.
for(i in 1:100)
{
for(j in 1:1000)
38

{
X[i, j] <- rnorm(1, mean = mu[j], sd = 1) # A matrix
# of random data containing observations on the rows
# and covariates on the columns.
}
}
# 100 by 1,000.
lambda <- matrix(0, 100, 1) # Rate values.
for(i in 1:100)
{
lambda[i] <- exp(t(-z[i,]) %*% as.matrix(beta))
# Generating lambda values.
}
T <- matrix(0, nrow = 100, ncol = 1)
# Location for survival times.
for(i in 1:100)
{
T[i] <- rexp(1,rate=lambda[i])
}
z_star <- scale(z, center=TRUE,scale=FALSE)
z_star_PCA <- PCA(z_star, graph=FALSE, ncp=37)
z_double_star <- z_star %*% z_star_PCA$var$coord
delta <- matrix(0, nrow = 100, ncol = 1) # An indicator
# matrix. Here, delta is a 100 by 1 matrix of zeros.
# The zeros are interpreted as meaning that the event of
# interest has definitively occured. In other words,
# there is currently no censoring with ’delta’ set up in
# this manner.
data_Surv <- Surv(time = T, event = delta,
39

type = c("right"))
# A Surv object that takes the survival times from ’T’,
# censoring information from ’delta’, and is specified as
# being right-censored.
data_AFT_fit <- NULL
data_AFT_fit <- tryCatch(survreg(data_Surv ~ -1 +
z_double_star,
dist = "lognormal",
survreg.control(maxiter=100000000)),
warning=function(c) {problem<<-TRUE})
if(!problem) # If there’s no problem, then our previous
# code will run.
{
beta_hat_star <- as.matrix(data_AFT_fit$coeff)
# These are beta estimates.
z_bar_star <- matrix(0, 1, 1000)
# Averaged columns of ’z’ go here.
for (i in 1:1000)
{
z_bar_star[1, i] <- mean(z[, i])
# Taking the average of each column of ’z’.
}
beta_hat_z <- matrix(0, 1, 1000)
# A location for our beta estimates.
beta_hat_z <- z_star_PCA$var$coord %*% beta_hat_star
# Beta estimates.
lambda_hat <- exp(-z_bar_star %*% beta_hat_z)
# Survival function constant.
lambda_bar = mean(lambda)
# Taking the average of all ’lambda’ values and storing
# it in ’lambda_bar’.
40

S_hat_naught <- function(t)
# The predicted survivor function.
{
exp(-t * lambda_hat)
}
S <- function(t)
# The true survivor function.
{
exp(-t * lambda_bar)
}
data_AFT_pred <- predict(data_AFT_fit, type = "terms",
se.fit = TRUE)
# Here, we get the predicted values from the ’survreg’
# object ’data_AFT_fit’. To wit, we get here the beta
# values and the standard errors in a ’list’ format.
surv_curv <- curve(S_hat_naught, from = 0, to = 7,
n = 1000, type="l",
xlab = "", ylab = "", xaxt = ’n’,
yaxt = ’n’, col = "99")
# Plotting the predicted survivor function.
par(new = TRUE)
curve(S, from = 0, to = 7, n = 1000, type = "l",
main = paste("Survivor Curves n Simulation", num),
xlab = expression(italic(t)),
ylab = expression(S(italic(t))), col = "black")
u <- c(seq(0.025,0.975,0.05))
# Outputs ’u’ that range from 0.025 to 0.975
# spaced out by 0.05, resulting in 20 points.
t <- (-1/lambda_bar) * log(u)
# Input times ’t’, generated from ’u’. There
# are 20 generated times ’t’ in this vector.
print(paste("Simulation ", num, sep = ""))
41

num <- num + 1
}
else
{
}
}
}
42

Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (12)

Similar to Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report

Similar to Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report (20)

Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report