SlideShare a Scribd company logo
Survival Analysis Dimension Reduction Techniques
A Comparison of Select Methods
Claressa L. Ullmayer and Iván Rodríguez
Abstract
Although formal studies across many fields may yield copious data, it can
often be collinear (redundant) in terms of explaining particular outcomes.
Thus, dataset dimensionality reduction becomes imperative for facilitating
the explanation of phenomena given abundant covariates (independent vari-
ables). Principal Component Analysis (PCA) and Partial Least Squares
(PLS) are established methods used to obtain components—eigenvalues of
the given data’s variance-covariance matrix—such that the covariance and
correlation is maximized between linear combinations of predictor and re-
sponse variables. PCA employs orthogonal transformations on covariates
to reduce dataset dimensionality by producing new uncorrelated variables.
PLS, rather, projects both predictor and response variables into a new space
to model their covariance structure. In addition to these standard procedures,
three variants of Johnson-Lindenstrauss low-distortion Euclidean-space em-
beddings (random matrices, RM) were also investigated. Each technique’s
performance was explored by simulating 5,000 datasets using R statistical
software. The semi-parametric Accelerated Failure Time (AFT) model was
utilized to obtain predicted survivor curves. Then, total bias error (BE) and
mean-squared error (MSE) between true and estimated survivor curves was
determined to find the error distributions of all methods. The results herein
indicate that PCA outperforms PLS, the RMs are comparable, and the RMs
outdo both PCA and PLS.
Keywords: survival analysis; dimension reduction; big data; principal com-
ponent analysis (PCA); partial least squares (PLS); Johnson-Lindenstrauss
(JL); random matrices; accelerated failure time (AFT); bias; mean-squared
error.
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
Contents
1 Introduction 1
2 Survival Analysis 2
3 Methods 5
3.1 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1.2 Principle Component Analysis . . . . . . . . . . . . . . . 6
3.1.3 Partial Least Squares . . . . . . . . . . . . . . . . . . . . 6
3.1.4 Random Matrices . . . . . . . . . . . . . . . . . . . . . . 7
3.2 The Accelerated Failure Time Model . . . . . . . . . . . . . . . . 9
4 Method Assessments 9
4.1 Simulated Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 10
5 Results 11
5.1 Principle Component Analysis versus Partial Least Squares . . . . 12
5.2 Random Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.3 All Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6 Discussion 15
7 Conclusion 15
8 Acknowledgments 16
9 References 16
10 Appendix 18
10.1 Error Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
10.2 Johnson-Lindenstrauss Testing . . . . . . . . . . . . . . . . . . . 33
10.3 Survival Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
i
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
1 Introduction
Throughout various studies, researchers are able to associate covariates to a set of
observations. From here, analysts would naturally seek to explain the relationship
between the two with regard to a given set of phenomena. Methods such as the
Cox Proportional Hazards (CPH) and the Accelerated Failure Time (AFT) models
have been proposed with this intent in mind (Cox, 1972). However, to successfully
utilize both approaches, it is necessary to have more observations than covariates.
Depending on the context, this property may not initially be satisfied, thus ren-
dering both methods inept. One example of this complication arises in common-
place microarray gene expression data. In this situation, there can often be less
observations—patients—than covariates attributed to them—genes. As a result, it
becomes imperative to reduce the dimensionality of the dataset and then apply a
suitable regression technique thereafter to understand the underlying relationships
between the predictor and response variables. As a natural consequence, reduc-
ing the original dataset’s dimensionality insinuates a loss of information; thus, a
favorable dimension reduction technique will minimize loss of relevant informa-
tion.
Given this to consider, dimension-reduction techniques have abounded to meet
this end. In this investigation, the methods of Principal Component Analysis
(PCA), Partial Least Squares (PLS), and three variants of Johnson-Lindenstrauss
inspired Random Matrices (RM) will be compared (Johnson, Lindenstrauss, 1984).
The first approach, PCA, originated and was described by Pearson (1901). PLS
was first rigorously introduced and explained by Wold (1966). Then, the three
variants of RMs were constructed according to specifications of Achlioptas (2002)
and Dasgupta-Gupta (2002). This research was motivated in part by the results
attributed to Nguyen and Rocke (2004) and Nguyen (2005) regarding the perfor-
mance of PCA vis-à-vis PLS. Furthermore, the works of Nguyen and Rojo (2009)
with respect to the performance of PLS variants and Nguyen and Rojo (2009) in
regard to a multitude of reduction and regression approaches were utilized in this
inquiry.
Typically, the Cox PH model has been the standard model in this applica-
tion. In this paper, however, the AFT model was employed. Random datasets
were first generated using the statistical software suite R. For a given amount of
these datasets, there was a constant and true survivor function attributed to them.
From here, the three dimension reduction techniques were employed on the sim-
ulated datasets. Then, the AFT model was used primarily to generate a predicted
survivor function. Bias and mean-squared error between the real and estimated
curves were then calculated for a partition of fixed time values.
1
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
2 Survival Analysis
Before any serious discussion of the current work can begin, a familiarity with the
area known as survival analysis must first be cultivated. In a sentence, survival
analysis employs various methods to analyze data where the response variable is
a time until an unambiguous event of interest occurs (Despa). This event must be
rigorously defined—some examples include birth, death, marriage, divorce, job
termination, promotion, arrests, revolutions, heart attack, stroke, metastasis, and
winning the lottery, to name a few (Ross).
Depending on the research domain, this wide field has many monikers. It is
referred to as failure time analysis, hazard analysis, transition analysis, duration
analysis, reliability theory/analysis in engineering, duration analysis/modeling in
economics, and event history analysis in sociology (Allison). At the time of this
investigation, ‘survival analysis’ serves as the umbrella term for all the aforemen-
tioned epithets.
Survival analysis is borne out of the desire to overcome some limitations pre-
sented in standard linear regression approaches (Despa). One of the two imme-
diate complications that survival analysis can successfully address is data where
responses are all positive values—exempl¯ı gr¯ati¯a, survival times that range from
t ∈ (0, ∞) (Despa). Secondly, survival analysis can grapple with censored data.
After the event of interest within a particular investigation has been rigorously
declared, an observation is branded as ‘censored’ if the special event was not ob-
served. This can occur due to a plethora of reasons. A common one involves a
patient in a clinical trial dropping out of the study. In this case, it is unknown
how much longer it may have taken for that individual to experience the partic-
ular event of interest. Another example of censoring in the real world involves
observations that do not experience the special event upon the end of a formal
investigation. That is, an individual managed to not express the event of interest
for the whole duration of a study, so they are necessarily labeled as censored.
With this ubiquitous term broadly explained, it is also necessary to understand
that many forms of censoring exist. Typically, most data are ‘right-censored’. This
term signifies observations that have the potential to experience the declared event
of interest after—or to the right in a time-line—of the time they became censored.
For instance, take an individual with a stage of cancer and declare the event of
interest to be death. Then, if this person becomes censored, the event of interest is
naturally bound to occur after the time they became censored. In a similar manner,
‘left-censored’ data occurs when the event of interest occurred before the specific
time a formal investigation began (Lunn). Understandably, this phenomenon is
less commonplace in reality. An example of left-censored data involves providing
a questionnaire to mothers inquiring whether or not they are actively breastfeed-
ing (Vermeylen). Left-censoring would occur if a mother entered the study and
2
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
had hitherto stopped breastfeeding. Finally, a third type is known as ‘interval cen-
soring’. This might be observed in a case where clinical follow-ups are necessary.
For a datum to be interval-censored, the event of interest would have to be ob-
served within an interval between two successive follow-ups (Sun).
Survival analysis is a prominent regression approach because it can success-
fully incorporate both censored and uncensored data when modeling the relation-
ship between predictors and responses (Despa). Typically, the response variables
will have at least both a survival time and censoring status associated with them.
From here, methods exist to estimate both survival and hazard functions that fa-
cilitate the interpretation of the distribution of survival times (Despa).
Survivor curves determine the probability that the event of interest is not ex-
perienced after a particular time. Rigorously,
S(t) = P(T > t) =
∞
t
f(τ) dτ = 1 − F(t),
where S(t) denotes the survivor function, t is a fixed time, T is a random variable,
f(τ) is the probability density function of T, and F(t) is the cumulative distribu-
tion function of T.
The hazard, on the other hand, is defined as a rate in which events happen
(Duerden). Thus, one can calculate the probability of an event happening within a
small time interval as this hazard rate multiplied by the length of time (Duerden).
Additionally, the hazard function describes the probability that an observation ex-
periences the event of interest at a particular time (Duerden). This implies that
the observation has already survived—that is, has not experienced the event of
interest—at the specified time (Duerden). In precise terms, the hazard function is
defined as
h(t) =
f(t)
S(t)
,
where f(t) denotes the probability distribution function and S(t) represents the
survival function given a random variable T. From this expression, it is imme-
diately possible to understand the intricate relationship between distribution, sur-
vival, and hazard functions. As a result, many other expressions exist aside from
this rather simplistic form.
A natural thought that may arise within survival analysis is whether results
involving survivor curves or hazard functions are desired. In many contexts, stan-
dard researchers prefer survivor curves in order to interpret results of their gath-
ered data. Arguably, since these curves output a probability in response to an input
of time, it becomes easier to comprehend trends and relationships than by doing
so via hazard. Furthermore, hazard functions and hazard rates are based on ratios
of probability distribution functions and survival curves; this makes hazard results
3
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
more difficult to digest and understand.
Aside from these considerations, there is also another factor involved in sur-
vival analysis to cognize: the selection of methods that can be utilized to relate
predictor variables and the resulting survival times. The three main forms to
achieve this end include parametric, semiparametric, and nonparametric models
(Despa). These differ in the assumptions being made on the given data.
Parametric approaches make the prime assumption that the distribution of the
survival times follows a known probability distribution (Despa). For example,
these can include the exponential and compound exponential, Weibull, Gompertz-
Makeham, Rayleigh, gamma and generalized gamma, log-normal, log-logistic,
generalized F, and the Coale-McNeil models (Rodriguez, 2010). For these and
other applicable methods, model parameters are estimated according to an alter-
ation to their maximum likelihood (Despa). In parametric techniques, relation-
ships are forced between f(t), F(t), S(t), and h(t) (Cook).
In contrast, a nonparametric model does not assert as many relatively bold
assumptions. For instance, linearity and a smooth regression function is not nec-
essary in a nonparametric context (Fox). Although this provides a researcher with
much more flexibility, interpretation can oftentimes become more difficult.
A semiparametric model posits that the error attributed to a nonlinear regres-
sion model follows a well-defined probability distribution, but the error is uncor-
related and identically distributed. In addition, a model of this form does not
presume that the baseline hazard function has a particular ‘shape’ attributed to
it. Additionally, when a combination of both parametric and nonparametric as-
sumptions are available, the regression model is appropriately described as being
semiparametric in nature.
These three types of regression models are rigorously represented below. Let
n denote the number of observations, Y represent the response variable, X sig-
nify the matrix of predictors, and let β be regression coefficients with errors .
Additionally, let m(·) = E(yi | xi) such that i = 1, . . . , n
A parametric model can be expressed as
yi = xi
T
β + i, i = 1, . . . , n.
In this case, the resulting curve is smooth and known. Furthermore, it is described
by a finite set of parameters which will need to be estimated. Ultimately, interpre-
tation is simple through this approach.
Then, for a nonparametric method,
yi = m(xi) + i, i = 1, . . . , n.
Here, function m(·) is also smooth and flexible, yet it is now unknown. Further-
more, the interpretation of such a curve becomes ambiguous.
4
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
Lastly, in the case where a model is classified as semiparametric, we observe
that
yi = xi
T
β + mz(zi) + i, i = 1, . . . , n.
As previously mentioned, some parameters are necessarily estimated while some
will be determined through the given data.
3 Methods
The main methods employed in this investigation were centered on different ways
of performing dimension reduction. These methods were: Principle Component
Analysis (PCA), Partial Least Squares (PLS), and a set of three distinct Random
Matrices (RM). For each method, the AFT model was employed primarily to gen-
erate survivor curve estimates. These methods will be discussed in greater detail
here.
3.1 Dimension Reduction
The central goal of the three aforementioned dimension reduction techniques is to
reduce a dataset with n observations and p covariates to a new dataset of dimen-
sions n × k such that k p. Additionally, a competent method will achieve this
end while retaining an acceptable amount of relevant data and omitting relatively
collinear variables.
Both PCA and PLS reduce dimensionality through orthogonal transformations
of covariates; then, a subset of these is retained such that these new covariates pre-
dict the response with a satisfactory caliber of precision. Meanwhile, RM differs
from these two procedures by generating a matrix with certain qualities that also
reduces dimensionality.
To facilitate the explanation of these reduction techniques, pertinent notation
will first be introduced.
3.1.1 Notation
Let X be the n × p column-centered matrix such that n and p denote given obser-
vations and covariates, respectively. Also, let n p. Furthermore, let Y be the
n × q matrix of observed covariates.
In the microarray gene dataset example, n would represent the number of pa-
tients while p would denote the amount of observed genes attributed to them.
Thus, X would be a matrix that contains particular patients on the rows and their
respective genes on the columns. Additionally, Y would serve as an n × 1 vector
of survival times.
5
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
3.1.2 Principle Component Analysis
PCA reduces dataset dimensionality through orthogonal components obtained by
maximizing the variance between linear combinations of the original predictors
contained in X. More precisely, k weight vectors or ‘loadings’ w are constructed
such that rows of X map to principal component scores t. For n observations,
tn = xnwk.
Ultimately, X can be completely decomposed into its components as follows:
T = XW.
Here, X has original dimensions n×p, W has dimensions p×p, and T, therefore,
has dimensions n × p as expected. Additionally, the columns of W contain the
eigenvectors of XT
X.
From here, a desired amount of the resulting orthogonal components is cho-
sen. These are then referred to as ‘principal components’ since they are chosen in
order to maximize the variability along each direction of the new and reduced set
of axes. What this transformation accomplishes, in other words, is that it projects
the original data cloud into a new coordinate system via rotations of the initial
coordinate system such that variability of the initial data is maximized along each
direction. Additionally, PCs are ranked according to how much variance they
account for in their respective directions. That is, the PCs with the largest eigen-
values are ranked the highest and represent a sizable portion of the data since
variability is greatest along its eigenvector’s direction.
It is imperative to note that the chosen PCs obtained from PCA rely on op-
erations performed on X, the given dataset matrix. Thus, the response variable
Y is not taken into account during this particular dimension reduction algorithm.
Consequently, these PCs may not be laudable predictors of the response variable
in a given context. Due to this property of PCA, it is often referred to as an ‘un-
supervised’ technique.
3.1.3 Partial Least Squares
Whereas PCA reduces dimensionality through X, the method of PLS does so
through a consideration of both independent and dependent variables X and Y.
Thus, this approach is often referred to as being ‘supervised’.
This regression model is especially useful when there is either high collinear-
ity among predictors or when the number of predictor variables is much greater
than the amount of observations. In these situations, ordinary least-squares re-
gression would either perform poorly or fail entirely; it would also fail if Y was
not one-dimensional—id est, if there were more than one observed response.
6
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
PLS extracts factors from both X and Y so that the covariance between these
factors is maximized. In particular, PLS is largely based on the singular value de-
composition of XT
Y. Recall that PLS does not require Y to be one-dimensional;
an advantage of the PLS procedure is that Y can contain as many observed re-
sponses as are deemed necessary and practical by researchers.
The method of PLS decomposes both X and Y so that
X = TPT
+ E and Y = UQT
+ F.
Here, T is a matrix of ‘X-scores’, P is a matrix of ‘X-loadings’, and E is a matrix
of error for X. Similarly, U, Q, and F represent ‘Y-scores’, ‘Y-loadings’, and Y
error, respectively. Both X- and Y-scores are defined as being linear combinations
of the predictor and response variables, respectively. Then, X- and Y-loadings are
linear coefficients that form a bridge from X to T and from Y to U. A common
assumption about E and F is that they are random variables with independent
and identical distributions. This decomposition of X and Y is done in hopes of
maximizing the covariance between T and U.
The PLS algorithm is an iterative procedure. First, two sets of weights must
be constructed as linear combinations of the columns of both X and Y. These
will be denoted by w and c, respectively. The goal here is to have their covariance
be maximal. Recall that matrices T and U denote, accordingly, X- and Y-scores.
Then, the next step in the PLS approach is to obtain a first pair of vectors t = Xw
and u = Yc such that wT
w = 1, tT
t = 1, and tT
u be maximized. After these
first so-called ‘latent vectors’ have been obtained, they are subtracted from both
X and Y. This procedure is then repeated, thereby eventually reducing X to a
zero matrix.
3.1.4 Random Matrices
Whereas the previously discussed methods of PCA and PLS reduce dimension-
ality through a careful analysis of X and Y, the third technique of constructing
random matrices, as the name implies, is considerably cavalier by comparison. In
essence, a random matrix with a particular set of qualities is fabricated. Then,
this matrix is multiplied to a given dataset—matrix X in this particular investi-
gation. According to the lemma attributed to Johnson and Lindenstrauss, if two
observations in X are considered as multidimensional points and have an initial
distance-squared between them, then once these particular random matrices are
multiplied to X, their intial distance is not distorted by too much. Similar to the
approaches utilized in PCA and PLS, random matrices can reduce dimensionality
without losing much information in the process. First, the Johnson-Lindenstrauss
(JL) Lemma will be presented as well as a description of the three particular ran-
7
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
dom matrices that were constructed in this research. The constraint on k was
utilized according to Dasgupta-Gupta.
The Johnson-Lindenstrauss Lemma. For any ∈ (0, 1) and any n ∈ Z, let
k ∈ Z be positive and let
k ≥
4 ln(n)
2/2 − 3/3
.
Then, for any set S of n points in Rd
, there exists a mapping f : Rd
→ Rk
such
that, for all points u, v ∈ S,
(1 − ) u − v 2
≤ f(u) − f(v) 2
≤ (1 + ) u − v 2
.
In terms of this investigation, n also represents the number of observations
while denotes the error tolerance. Finally, k can be thought of as the resulting
dimension in this given context after applying a random matrix to the dataset ma-
trix X.
An immediate complication of these so-called ‘JL-embeddings’ is that we may
sometimes observe that k ≥ d as a result of strictly following the hypotheses of
the lemma. Id est, by employing the results of this theorem, a researcher would
be taking data from a smaller dimension and transforming it so that the data exists
in a higher dimension. Ultimately, the JL Lemma may not reduce dimensionality
at all, thus rendering it impractical for the desired purposes of this text. Thus, it
became imperative in this research to observe the effects of ignoring the restraints
on k of the JL Lemma and deducing whether or not desirable results are obtained
nonetheless. Having understood the motivation behind random matrices and these
precise limitations, now an explanation of the three random matrices themselves
is in order.
The first two random matrices were fabricated according to the previous re-
sults of Achlioptas while the third was constructed by following the specifications
of Dasgupta-Gupta. Let Γ1, Γ2, and Γ3 accordingly denote these random ma-
trices. To keep consistent with the previous notation, recall that X is an n × p
predictor matrix of observations on the rows and covariates on the columns. It
follows that Γ1, Γ2, and Γ3 are p × k matrices. Once multiplied to X, the result-
ing matrix Ω will have dimensions n × k, where the goal is to have n > k.
Entries of Γ1 were produced from the following distribution:
1
√
k
×
−1 with probability 1/2
+1 with probabilty 1/2
For Γ2, its entries were obtained from
3
k
×



−1 with probability 1/6
0 with probability 4/6
+1 with probabilty 1/6
8
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
Finally, Γ3 is a Gaussian random matrix generated from N(0, 1). The resulting
rows of Γ3 are then normalized.
3.2 The Accelerated Failure Time Model
The previously described techniques were sourced in order to reduce dimension-
ality. After successfully achieving this consequence, it was necessary to generate
a survival curve based on the modified data and compare it with the true survival
curve. In this investigation, the AFT model was the vehicle to generate estimates
of the survivor curves.
The AFT model is seldom utilized compared to the celebrated Cox Propor-
tional Hazards (PH) model for various reasons. One reason to adopt the AFT
approach in this investigation is due to the simplified interpretation it provides re-
searchers of the data. This approach presents an interpretation of the relationship
between observation covariates and given responses in terms of survivor curves.
The Cox PH model, on the other hand, does so through hazard functions and haz-
ard ratios that, while equally profound, are not as visually simple to comprehend
as the AFT model’s survivorship presentation. In simple terms, the hazard is the
instantaneous event probability within a range of a particular time. It is arguably
more straightforward to understand results in terms of the probability that an in-
dividual ‘survives’ or does not experience an event of interest after a particular
time. Thus, this first reason to employ the AFT model in this text is a matter
of user preference and ease of interpretation of results. Another technical reason
to employ the AFT model is due to the fact that it directly models given survival
times. This is one luxury that the Cox PH model cannot allow a fervent researcher.
In this investigation, AFT was implemented according to the following under-
lying model:
ln(Ti) = µ + ziβ + ei.
Here, i represents a particular observation from a set of n observations. Further-
more, Ti denotes the survival time for the ith
observation. Meanwhile, µ desig-
nates the given theoretical mean, zi is the vector of covariates for the ith
obser-
vation, and β is the vector of covariate/regression coefficients. Finally, ei is the
given error for the ith
observation.
4 Method Assessments
This research utilized a programming environment to simulate datasets that would
undergo reduction procedure from PCA, PLS, and the variants of RMs. Addition-
ally, ‘feeding’ this data into the AFT model to obtain and compare the pairs of
9
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
survival curves was likewise accomplished through statistical software. This sec-
tion will address specifically how the research was performed.
4.1 Simulated Datasets
In order to compare dimension reduction techniques, R statistical software is im-
plemented to simulate data. β regression coefficients, observations, covariates,
and survival times are simulated using the previously discussed AFT formula,
where the theoretical mean, µ, is set to 0 for simplicity. The dimensionality of the
data matrix, X, is 100 observations by 1000 covariates. A vector of 1000 β regres-
sion coefficients relating to the 1000 covariates is obtained by generating random
values from U(−1 × 10−7
, 1 × 10−7
). A vector, µj, of random values is generated
from a N(0, 1) distribution for j = 1, . . . , p, where p represents the number of
covariates. β and µ remain fixed for all simulations. Next, the matrix, X100×1000,
of the 1000 covariates and 100 observations is generated where xij = ezij
where
zij ∼ N(µj, 1) for j = 1, . . . , p and i = 1, . . . , n where n is the number of obser-
vations, therefore the data is log-normally distributed. The survival times, Ti, are
constructed from an exponential distribution with λi = e−xiβ
for i = 1, · · · , n.
Now that all the data is generated, zn×p is converted to z∗
n×p by centering each
column about its mean. PCA is applied to z∗
n×p using the function PCA from the
package FactoMineR (Husson et al., 2015) to obtain 99 principle components.
After this procedure is completed, the principle components are narrowed down
to 37, which represents 50% of total variance of the model. PCA outputs a weight
matrix of dimension 1000 × 37, which represents the weights given to each co-
variate by the 37 principle components. The data matrix, X, is multiplied by this
weight matrix to obtain a reduced dimension matrix of 100 × 37. A surv object
is created, which inputs survivals times, censoring type, and an indicator vector
denoting 1 if the observation is censored or 0 if it is not, and outputs a response
matrix. The Ti vector and the 37 principle components are fed into the AFT model
in R using the package aftgee (Chiou et al., 2015) to obtain the estimated 1000 β
coefficients. The weight matrix was multiplied by these β estimates to obtain the
37 β estimates for the 37 principle components.
In order to acquire estimated lambda values for the estimated survival func-
tion, the mean of the exponentiated product of the centered data matrix and the
β estimates is taken. The estimated survival function is now found by ˆS0 = e−ˆ¯λt
where ˆ¯λ is the estimated mean lambda value. This procedure was repeated for
PLS using the same number of principle components as PCA except using the
function plsreg1 from the package plsdepot (Sanchez, 2015) instead.
The matrices Γ1, Γ2, and Γ3 from Achlioptas and Dasgupta-Gupta are gen-
erated containing random entries that satisfy each author’s probability specifica-
10
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
tions. An algorithm in R is created to validate the Johnson-Lindenstrauss Lemma
dimension reduction ability for Γ1, Γ2, and Γ3. The algorithm takes two randomly
picked vectors u, v from X and maps f : Rp
→ Rk
where k is the new reduced
dimension. The Johnson-Lindenstrauss Lemma is then tested using varying val-
ues of and k for multiple simulations. It is shown that as long as k and follow
the constraints given by Dasgupta-Gupta(CITA), then the Johnson-Lindenstrauss
Lemma is satisfied 100% of the time. The value of is varied until the desired
dimension of 1000 × 37 projection matrix is obtained satisfying the Johnson-
Lindenstrauss Lemma. Unfortunately, a fairly high value of approximately 0.65
is required to satisfy the lemma. Therefore, either a high value is used or the
lemma is not followed.
In order to compare random matrices to PCA and PLS, X is multiplied by
Γ1, Γ2, and Γ3 with dimensions 1000 × 37 to obtain a resulting k dimensional
matrix of 100×37. Then, the reduced matrices are fed into the AFT model and all
the same steps as PCA and PLS are performed. Therefore, five different estimated
survival curves are produced, one each for PCA and PLS and three for the three
random matrices.
The true survival curve is S0 = e−¯λt
where ¯λ is the mean of the λi values,
which are created by exponentiating the product of the centered data matrix and
the true β coefficients. The y-axis of the survival curve is partitioned into 20
equally spaced sections from 0.025, . . . , 0.975 and then the corresponding ti val-
ues are found along the x-axis. The bias and mean squared error (MSE) are cal-
culated at each of these ti values to obtain the error distribution for all methods.
The bias is found by calculating the pointwise difference between the real and
estimated survival curves and the MSE is calculated by finding the squared differ-
ence. The bias and MSE at each ti are summed for 5000 simulations and the error
distributions are compared for all methods.
5 Results
In the following sections, the error distribution plots for the dimension reduction
techniques are compared after 5000 simulations. PLS and PCA are compared to
each other, the random matrices are compared, and then all dimension reduction
techniques are compared. The goal is to minimize Bias and MSE, therefore, the
dimension reduction technique closest to zero is the more efficient method. In
the Bias plots, zero is at the top of the plots and for MSE, the black horizontal
line at the bottom denotes zero. Notice the plots differ least at the extremes of
the survival curves domain while the most variability is observed in middle of the
interval.
11
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
5.1 Principle Component Analysis versus Partial Least Squares
From the plots above, it is shown that PCA outperforms PLS by a maximum
magnitude of approximately 0.07 for the bias and 0.03 for MSE.
12
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
5.2 Random Matrices
In the plots above, RM1 denotes Γ1, RM2 denotes Γ2, and RM3 denotes Γ3.
The results show that there is no significant difference in performance between
the three random matrices in terms of Bias and MSE.
13
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
5.3 All Methods
From both the Bias and MSE plots, it is evident that all three random matrices
outperform both PCA and PLS. All matrices outperform PCA by a magnitude of
approximately 0.03 and PLS by 0.10 for bias and 0.015 and 0.045 for MSE.
14
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
6 Discussion
We originally wanted to generate our β coefficients from a U(−0.2, 0.2), but when
we multiplied xiβ to get our λi values, we obtained very large values. Recall,
our formula λi = e−xiβ
. When xiβ is very large, then the λi values become
very small and the precision of R estimates the survival function as 1, creating
a horizontal survival function. Therefore, we had to reduce the β coefficients to
U(−1 × 10−7
, 1 × 10−7
) to obtain survival curves with realistic properties.
Before conducting our research, we investigated previous work that has been
done in the field, such as the research in the papers of Nguyen and Rojo (2009)
and Nguyen and Rojo (2009). According to their findings, PLS outperformed
PCA, which is the results we also expected to receive, but instead we observed
that PCA greatly outperformed PLS. We are not exactly positive why our results
differ from these works, but we suspect that it is due to not incorporating censored
data. In both papers of Nguyen and Rojo, they compared methods using censored
data, which we did not have time to incorporate into our research. Therefore, we
suspect that PLS might outperform PCA when censored data is used, but PCA
outperforms PLS with uncensored data.
Obviously in real life studies, censored data can be a serious problem that
needs to be taken into account. We wanted to incorporate censored data in our in-
vestigation, but were unable to due to time constraints. This is something that we
would like to add in future investigations. We also wanted to apply our findings
to real microarray gene data sets where there are a few number of patients, with a
specific type of cancer, and a large dimension of genes. We wanted to work with
these data sets and apply our dimension reduction techniques to obtain estimated
survival curves where the event of interest was death and the survival curve mod-
eled each patient’s probability of surviving after a given time Ti. Unfortunately,
we were not able to work with these real data sets, which is also something we
would like to investigate at a future time.
7 Conclusion
The results of performing PLS, PCA, and the three Johnson-Lindenstrauss in-
spired matrices from Achlioptas and Dasgupta-Gupta on log-normally distributed,
uncensored data for estimating the survival curve under the AFT model show that
PCA outperforms PLS in terms of both bias and MSE. The three random matrices
do not show a significant difference between each other in terms of either bias or
MSE. Overall, the random matrices outperform both PCA and PLS for both bias
and MSE.
15
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
8 Acknowledgments
This research was supported by the National Security Agency through REU Grant
H98230 15-1-0048 to The University of Nevada at Reno, Javier Rojo PI. We
would like to greatly thank and acknowledge our advisor Dr. Javier Rojo, Nathan
Wiseman, and Kyle Bradford from the University of Nevada Reno for their sup-
port and generous contributions to our research.
9 References
Cox, DR. Regression Models and life tables (with discussion). Journal of Royal
Statistical Society Series B34: 187-220, 1972.
Johnson, W.B. and J. Lindenstrauss. Extensions of Lipschitz maps into a Hilbert
space. Contemp Math 26: 189-206, 1984.
Pearson, K. On lines and planes of closest fit to systems of points in space. Philo-
sophical Magazine 2: 559-572, 1901.
Wold, H. Estimation of principal components and related models by iterative least
squares. P.R. Krishnaiaah: 391-420, 1966.
Achlioptas, D. Database-friendly random projections: Johnson-Lindenstrauss with
binary coins. Journal of Computer and System Sciences 66(4): 671-687, 2003.
Dasgupta, S. and A. Gupta. An elementary proof of a theorem of Johnson and
Lindenstrauss. Random Structures and Algorithms 22(1): 60-65, 2003.
Nguyen, D.V. Partial least squares dimension reduction for microarray gene ex-
pression data with a censored response. Math Biosci 193: 119-137, 2005.
Nguyen, D.V., and D.M. Rocke. On partial least squares dimension reduction for
microarraybased classification: A simulation study. Comput Stat Data Analysis
46: 407-425, 2004.
Despa, Simona. What is Survival Analysis? StatNews 78: 1-2.
Ross, Eric. "Survival Analysis." 2012. PDF
16
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
Allison, Paul D. "Survival Analysis." 2013. PDF
Lunn, Mary. "Definitions and Censoring." 2012. PDF.
Vermeylen, Francoise. Censored Data. StatNews 67: 1, 2005.
Nguyen, Tuan S. and Javier Rojo. Dimension Reduction of Microarray Gene Ex-
pression Data: The Accelerated Failure Time Model. Journal of Bioinformatics
and Computational Biology 7(6): 939-954, 2009.
Nguyen, Tuan S. and Javier Rojo. Dimension Reduction of Microarray Data in
the Presence of a Censored Survival Response: A Simulation Study. Statistical
Applications in Genetics and Molecular Biology 8(1): 2009.
Sun, Jianguo. "Interval Censoring." 2011. PDF.
Duerden, Martin. "What Are Hazard Ratios?" 2012. PDF.
Rodriguez, German. "Parametric Survival Models. Princeton." 2010. PDF.
Cook, Alex. "Survival and hazard functions." 2008. PDF.
Fox, John. "Introduction to Nonparametric Methods." 2005. PDF.
Husson et al. "Package ‘FactoMineR’." 2015. PDF.
Sanchez, Gaston. "Package ‘plsdepot’." 2015. PDF.
Chiou et al. "Package ‘aftgee’." 2015. PDF.
Thernou et al. "Package ‘survival’." 2015. PDF.
17
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
10 Appendix
Herein, the R code utilized in this investigation is presented. Packages survival
(Thernou et al., 2015), FactoMineR (Husson et al., 2015), plsdepot (Gaston, 2015),
and aftgee (Chiou et al., 2015) will need to be installed and loaded into R software
to successfully run the provided code.
10.1 Error Plots
Below is the code used to produce the six error plots for the five various reductions
methods.
library(survival)
# We created a Surv object using function ’Surv’ from this
# package.
library(FactoMineR)
# We used the function ’PCA’ from this package.
library(plsdepot)
# We used ’plsreg1’ from this package.
library(aftgee)
# With this package, we were able to apply the AFT model to our
# simulated data using the function ’aftgee’.
sim <- function(s) # This function will produce ’s’
# simulations and output error plots.
{
t1 <- Sys.time() # Initial time.
num <- 1 # Initial counter.
sum_PCA_BE_t <- matrix(0, 1, 20)
sum_PCA_MSE_t <- matrix(0, 1, 20)
sum_PLS_BE_t <- matrix(0, 1, 20)
sum_PLS_MSE_t <- matrix(0, 1, 20)
sum_RM1_BE_t <- matrix(0, 1, 20)
sum_RM1_MSE_t <- matrix(0, 1, 20)
18
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
sum_RM2_BE_t <- matrix(0, 1, 20)
sum_RM2_MSE_t <- matrix(0, 1, 20)
sum_RM3_BE_t <- matrix(0, 1, 20)
sum_RM3_MSE_t <- matrix(0, 1, 20)
# These will store the calculated bias and mean-squared
# error across 20 selected points geeer we have run ’s’
# simulations.
beta <- c(runif(1000, min = -0.0000001, max = 0.0000001))
# Fixed coefficients.
mu <- c(rnorm(1000, mean = 0, sd = 1)) # Mean values.
X <- matrix(0, 100, 1000)
# A location for the dataset information.
while(num <= s)
# Running the entire code for a ’s’ iterations.
{
# No problems at the start of this iteration.
for(i in 1:100)
{
for(j in 1:1000)
{
X[i, j] <- rnorm(1, mean = mu[j], sd = 1)
# A matrix of random data containing observations
# on the rows and covariates on the columns.
}
}
z <- exp(X) # All entries of matrix ’X’ have been
# exponentiated and stored in ’z’, which has dimensions
# 100 by 1,000.
lambda <- matrix(0, 100, 1) # Rate values.
for(i in 1:100) # Generating lambda values.
{
lambda[i] <- exp(t(-z[i,]) %*% as.matrix(beta))
}
19
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
T <- matrix(0, nrow = 100, ncol = 1)
# Location for survival times.
for(i in 1:100) # Surivival times being generated.
{
T[i] <- rexp(1, rate=lambda[i])
}
RM1 <- matrix(0, 1000, 37)
# Random matrix one with ’-1’s and ’+1’s.
for (m in 1:1000)
{
for (n in 1:37)
{
RM1[m, n] <- sample(c(-1, 1), 1, replace = TRUE,
prob = c(1/2, 1/2))
}
}
RM1 <- RM1 / sqrt(37)
RM2 <- matrix(0, 1000, 37)
# Random matrix two with
# ’-sqrt(3)’s, ’0’s, and ’+sqrt(3)’s.
for (m in 1:1000)
{
for (n in 1:37)
{
RM2[m,n] <- sample(c(-sqrt(3), 0, sqrt(3)),
1, replace = TRUE,
prob = c(1/6, 4/6, 1/6))
}
}
RM2 <- RM2 / sqrt(37)
RM3 <- matrix(0, 1000, 37)
# Random matrix three generated under a Gaussian
20
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
# distribution.
for (m in 1:1000)
{
for (n in 1:37)
{
RM3[m,n] <- rnorm(1, 0, 1)
}
}
RM3_norm <- matrix(0, 1000, 1)
for (p in 1:1000)
{
RM3_norm[p, ] <- sqrt(sum(RM3[p, ] ^ 2))
}
for (m in 1:1000)
{
for(n in 1:37)
{
RM3[m,n] <- RM3[m,n] / RM3_norm[m, ]
}
}
z_star <- scale(z, center = TRUE, scale = FALSE)
# Column-centered ’z’ matrix for PCA.
z_star_PCA <- PCA(z_star, graph = FALSE, ncp = 37)
z_star_PLS <- plsreg1(scale(z, center = TRUE,
scale = TRUE), T, comps = 37,
crosval = FALSE)
# Applying PCA and PLS to the data.
z_double_star_PCA <- z_star %*% z_star_PCA$var$coord
z_double_star_PLS <- z_star %*% z_star_PLS$x.loads
z_double_star_RM1 <- z %*% RM1
z_double_star_RM2 <- z %*% RM2
z_double_star_RM3 <- z %*% RM3
# Reducing dimensionality.
21
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
delta <- matrix(0, nrow = 100, ncol = 1)
# An indicator matrix. Here, delta is a 100 by 1 matrix
# of zeros. The zeros are interpreted as meaning that the
# event of interest has definitively occured. In other
# words, there is currently no censoring with ’delta’
# set up in this manner.
data_Surv <- Surv(time = T, event = delta,
type = c("right"))
# A Surv object that takes the survival times from ’T’,
# censoring information from ’delta’, and is specified
# as being right-censored.
data_AFT_fit_PCA <- aftgee(data_Surv ~ -1 +
z_double_star_PCA,
corstr = "independence", B = 0)
data_AFT_fit_PLS <- aftgee(data_Surv ~ -1 +
z_double_star_PLS,
corstr = "independence", B = 0)
data_AFT_fit_RM1 <- aftgee(data_Surv ~ -1 +
z_double_star_RM1,
corstr = "independence", B = 0)
data_AFT_fit_RM2 <- aftgee(data_Surv ~ -1 +
z_double_star_RM2,
corstr = "independence", B = 0)
data_AFT_fit_RM3 <- aftgee(data_Surv ~ -1 +
z_double_star_RM3,
corstr = "independence", B = 0)
beta_hat_star_PCA <- data_AFT_fit_PCA$coefficients
beta_hat_star_PLS <- data_AFT_fit_PLS$coefficients
beta_hat_star_RM1 <- data_AFT_fit_RM1$coefficients
beta_hat_star_RM2 <- data_AFT_fit_RM2$coefficients
beta_hat_star_RM3 <- data_AFT_fit_RM3$coefficients
# The full beta/regression coefficients.
z_bar_star <- matrix(0, 1, 1000)
22
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
# Averaged columns of ’z’ will go here.
for (i in 1:1000) # Averaging ’z’s columns.
{
z_bar_star[1, i] <- mean(z[, i])
}
beta_hat_z_PCA <- z_star_PCA$var$coord %*%
beta_hat_star_PCA
beta_hat_z_PLS <- z_star_PLS$x.loads %*%
beta_hat_star_PLS
beta_hat_z_RM1 <- RM1 %*%
beta_hat_star_RM1
beta_hat_z_RM2 <- RM2 %*%
beta_hat_star_RM2
beta_hat_z_RM3 <- RM3 %*%
beta_hat_star_RM3
# The final beta estimates for each technique.
lambda_hat_PCA <- mean(exp(-z %*% beta_hat_z_PCA))
lambda_hat_PLS <- mean(exp(-z %*% beta_hat_z_PLS))
lambda_hat_RM1 <- mean(exp(-z %*% beta_hat_z_RM1))
lambda_hat_RM2 <- mean(exp(-z %*% beta_hat_z_RM2))
lambda_hat_RM3 <- mean(exp(-z %*% beta_hat_z_RM3))
# Generating the lambda constant from each technique
# employed.
lambda_bar = mean(lambda) # Taking the average of all
# ’lambda’ values and storing it in ’lambda_bar’.
S <- function(t) # The true survivor function.
{
exp(-t * lambda_bar)
}
S_hat_naught_PCA <- function(t)
# The predicted survivor function through PCA.
23
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
{
exp(-t * lambda_hat_PCA)
}
S_hat_naught_PLS <- function(t)
# The predicted survivor function through PLS.
{
exp(-t * lambda_hat_PLS)
}
S_hat_naught_RM1 <- function(t)
# The predicted survivor function through RM1.
{
exp(-t * lambda_hat_RM1)
}
S_hat_naught_RM2 <- function(t)
# The predicted survivor function through RM2.
{
exp(-t * lambda_hat_RM2)
}
S_hat_naught_RM3 <- function(t)
# The predicted survivor function through RM3.
{
exp(-t * lambda_hat_RM3)
}
u <- c(seq(0.025, 0.975, 0.05))
# Desired outputs ’u’ that range from 0.025 to 0.975
# and are spaced out by 0.05, resulting in 20 points.
t <- (-1/lambda_bar) * log(u) # Input times ’t’ from the
# respective ’u’s. There are 20 generated times ’t’ in
# this vector.
for (i in 1:20)
# Storing bias across the 20 point pairs in PCA.
{
sum_PCA_BE_t[i] <- sum_PCA_BE_t[i] +
(S_hat_naught_PCA(t[i]) - S(t[i]))
24
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
}
for (i in 1:20)
# Storing mean-squared error across the 20 point pairs
# in PCA.
{
sum_PCA_MSE_t[i] <- sum_PCA_MSE_t[i] +
(S_hat_naught_PCA(t[i]) - S(t[i])) ^ 2
}
for (i in 1:20)
# Storing bias across the 20 point pairs in PLS.
{
sum_PLS_BE_t[i] <- sum_PLS_BE_t[i] +
(S_hat_naught_PLS(t[i]) - S(t[i]))
}
for (i in 1:20)
# Storing mean-squared error across the 20 point pairs
# in PLS.
{
sum_PLS_MSE_t[i] <- sum_PLS_MSE_t[i] +
(S_hat_naught_PLS(t[i]) - S(t[i])) ^ 2
}
for (i in 1:20)
# Storing bias across the 20 point pairs in RM1.
{
sum_RM1_BE_t[i] <- sum_RM1_BE_t[i] +
(S_hat_naught_RM1(t[i]) - S(t[i]))
}
for (i in 1:20)
# Storing mean-squared error across the 20 point pairs
# in RM1.
{
sum_RM1_MSE_t[i] <- sum_RM1_MSE_t[i] +
(S_hat_naught_RM1(t[i]) - S(t[i])) ^ 2
}
for (i in 1:20)
25
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
# Storing bias across the 20 point pairs in RM2
{
sum_RM2_BE_t[i] <- sum_RM2_BE_t[i] +
(S_hat_naught_RM2(t[i]) - S(t[i]))
}
for (i in 1:20)
# Storing mean-squared error across the 20 point pairs
# in RM2.
{
sum_RM2_MSE_t[i] <- sum_RM2_MSE_t[i] +
(S_hat_naught_RM2(t[i]) - S(t[i])) ^ 2
}
for (i in 1:20)
# Storing bias across the 20 point pairs in RM3.
{
sum_RM3_BE_t[i] <- sum_RM3_BE_t[i] +
(S_hat_naught_RM3(t[i]) - S(t[i]))
}
for (i in 1:20)
# Storing mean-squared error across the 20 point pairs
# in RM3.
{
sum_RM3_MSE_t[i] <- sum_RM3_MSE_t[i] +
(S_hat_naught_RM3(t[i]) - S(t[i])) ^ 2
}
print(paste("Simulation", num, "Complete."))
num <- num + 1
}
ymin_PCA_BE <- min(sum_PCA_BE_t)
ymin_PLS_BE <- min(sum_PLS_BE_t)
ymin_RM1_BE <- min(sum_RM1_BE_t)
ymin_RM2_BE <- min(sum_RM2_BE_t)
ymin_RM3_BE <- min(sum_RM3_BE_t)
# Finding the minimum bias per each technique after
26
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
# ’s’ simulations.
ymax_PCA_BE <- max(sum_PCA_BE_t)
ymax_PLS_BE <- max(sum_PLS_BE_t)
ymax_RM1_BE <- max(sum_RM1_BE_t)
ymax_RM2_BE <- max(sum_RM2_BE_t)
ymax_RM3_BE <- max(sum_RM3_BE_t)
# Finding the maximum bias per each technique after
# ’s’ simulations.
ymin_BE <- min(ymin_PCA_BE, ymin_PLS_BE, ymin_RM1_BE,
ymin_RM2_BE, ymin_RM3_BE) / s
ymax_BE <- max(ymax_PCA_BE, ymax_PLS_BE, ymax_RM1_BE,
ymax_RM2_BE, ymax_RM3_BE) / s
# Finding the minimum and maximum bias across all five
# techniques after ’s’ simulations. These will serve as
# the lower and upper range of the y-axis in the final plot.
ymin_PCA_PLS_BE <- min(ymin_PCA_BE, ymin_PLS_BE) / s
ymax_PCA_PLS_BE <- max(ymax_PCA_BE, ymax_PLS_BE) / s
# Calculating the averaged minimum and maximum bias for PCA
# and PLS after ’s’ simulations for plotting purposes.
ymin_RM_BE <-
min(ymin_RM1_BE, ymin_RM2_BE, ymin_RM3_BE) / s
ymax_RM_BE <-
max(ymax_RM1_BE, ymax_RM2_BE, ymax_RM3_BE) / s
# Calculating the averaged minimum and maximum bias for the
# three RMs after ’s’ simulations for plotting purposes.
ymin_PCA_MSE <- min(sum_PCA_MSE_t)
ymin_PLS_MSE <- min(sum_PLS_MSE_t)
ymin_RM1_MSE <- min(sum_RM1_MSE_t)
ymin_RM2_MSE <- min(sum_RM2_MSE_t)
ymin_RM3_MSE <- min(sum_RM3_MSE_t)
# Finding the minimum mean-squared error per each technique
# after ’s’ simulations.
ymax_PCA_MSE <- max(sum_PCA_MSE_t)
ymax_PLS_MSE <- max(sum_PLS_MSE_t)
ymax_RM1_MSE <- max(sum_RM1_MSE_t)
27
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
ymax_RM2_MSE <- max(sum_RM2_MSE_t)
ymax_RM3_MSE <- max(sum_RM3_MSE_t)
# Finding the maximum mean-squared error per each technique
# after ’s’ simulations.
ymin_MSE <- min(ymin_PCA_MSE, ymin_PLS_MSE, ymin_RM1_MSE,
ymin_RM2_MSE, ymin_RM3_MSE) / s
ymax_MSE <- max(ymax_PCA_MSE, ymax_PLS_MSE, ymax_RM1_MSE,
ymax_RM2_MSE, ymax_RM3_MSE) / s
# Finding the minimum and maximum mean-squared error across
# all techniques. These will serve as the lower and upper
# range of the y-axis in the final plot.
ymin_PCA_PLS_MSE <- min(ymin_PCA_MSE, ymin_PLS_MSE) / s
ymax_PCA_PLS_MSE <- max(ymax_PCA_MSE, ymax_PLS_MSE) / s
# Calculating the averaged minimum and maximum MSE for PCA
# and PLS after ’s’ simulations for plotting purposes.
ymin_RM_MSE <-
min(ymin_RM1_MSE, ymin_RM2_MSE, ymin_RM3_MSE) / s
ymax_RM_MSE <-
max(ymax_RM1_MSE, ymax_RM2_MSE, ymax_RM3_MSE) / s
# Calculating the averaged minimum and maximum MSE for the
# three RMs after ’s’ simulations for plotting purposes.
# Start of bias plot for PCA and PLS.
plot(t, (sum_PCA_BE_t) / s, pch = 15,
main = paste("Bias: PCA and PLS n", s,
"Total Simulations"),
xlab = "Time",
ylab = "Average Bias", ylim = c(ymin_PCA_PLS_BE,
ymax_PCA_PLS_BE),
xlim = c(0, max(t)),
col = "black")
points(t, (sum_PLS_BE_t) / s, pch = 15, col = "grey")
par(new = TRUE)
abline(0, 0, h = 0)
28
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
par(new = TRUE)
legend("topright", c("PCA", "PLS"), pch = c(15, 15),
col = c("black", "grey"))
# End of bias plot for PCA and PLS.
# Start of the mean-squared error plot for PCA and PLS.
plot(t, (sum_PCA_MSE_t) / s, pch = 15,
main = paste("Mean-Squared Error: PCA and PLS n",
s, "Total Simulations"),
xlab = "Time",
ylab = "Average MSE", ylim = c(ymin_PCA_PLS_MSE,
ymax_PCA_PLS_MSE),
xlim = c(0, max(t)),
col = "black")
points(t, (sum_PLS_MSE_t) / s, pch = 15, col = "grey")
par(new = TRUE)
abline(0, 0, h = 0)
par(new = TRUE)
legend("topright", c("PCA", "PLS"), pch = c(15, 15),
col = c("black", "grey"))
# End of mean-squared error plot for PCA and PLS.
# Start of the bias plot for the random matrices.
plot(t, (sum_RM1_BE_t) / s, pch = 15,
main = paste("Bias: Random Matrices n", s,
"Total Simulations"),
xlab = "Time",
ylab = "Average Bias", ylim = c(ymin_RM_BE,
ymax_RM_BE),
xlim = c(0, max(t)),
col = "darkblue")
points(t, (sum_RM2_BE_t) / s, pch = 15, col = "red")
points(t, (sum_RM3_BE_t) / s, pch = 15, col = "gold")
29
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
par(new = TRUE)
abline(0, 0, h = 0)
par(new = TRUE)
legend("topright", c("RM1", "RM2", "RM3"),
pch = c(15, 15, 15),
col = c("darkblue", "red", "gold"))
# End of bias plot for the random matrices.
# Start of the mean-squared error plot for the
# random matrices.
plot(t, (sum_RM1_MSE_t) / s, pch = 15,
main = paste("Mean-Squared Error: Random Matrices n",
s, "Total Simulations"),
xlab = "Time",
ylab = "Average MSE", ylim = c(ymin_RM_MSE,
ymax_RM_MSE),
xlim = c(0, max(t)),
col = "darkblue")
points(t, (sum_RM2_MSE_t) / s, pch = 15, col = "red")
points(t, (sum_RM3_MSE_t) / s, pch = 15, col = "gold")
par(new = TRUE)
abline(0, 0, h = 0)
par(new = TRUE)
legend("topright", c("RM1", "RM2", "RM3"),
pch = c(15, 15, 15),
col = c("darkblue", "red", "gold"))
# End of mean-squared error plot for the random matrices.
# Start of bias plot for all methods.
plot(t, (sum_PCA_BE_t) / s, pch = 15,
main = paste("Bias: All Techniques n",
30
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
s, "Total Simulations"), xlab = "Time",
ylab = "Average Bias", ylim = c(ymin_BE, ymax_BE),
xlim = c(0, max(t)),
col = "black")
points(t, (sum_PLS_BE_t) / s, pch = 15, col = "gray")
points(t, (sum_RM1_BE_t) / s, pch = 15, col = "darkblue")
points(t, (sum_RM2_BE_t) / s, pch = 15, col = "red")
points(t, (sum_RM3_BE_t) / s, pch = 15, col = "gold")
par(new = TRUE)
abline(0, 0, h = 0)
par(new = TRUE)
legend("topright", c("PCA", "PLS", "RM1", "RM2", "RM3"),
pch = c(15, 15, 15, 15, 15),
col = c("black", "gray", "darkblue", "red", "gold"))
# End of bias plot for all methods.
# Start of mean-squared error plot for all methods.
plot(t, (sum_PCA_MSE_t) / s, pch = 15,
main = paste("Mean-Squared Error: All Techniques n",
s, "Total Simulations"), xlab = "Time",
ylab = "Average MSE", ylim = c(ymin_MSE, ymax_MSE),
xlim = c(0, max(t)), col = "black")
points(t, (sum_PLS_MSE_t) / s, pch = 15, col = "gray")
points(t, (sum_RM1_MSE_t) / s, pch = 15, col = "darkblue")
points(t, (sum_RM2_MSE_t) / s, pch = 15, col = "red")
points(t, (sum_RM3_MSE_t) / s, pch = 15, col = "gold")
par(new = TRUE)
31
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
abline(0, 0, h = 0)
par(new = TRUE)
legend("topright", c("PCA", "PLS", "RM1", "RM2", "RM3"),
pch = c(15, 15, 15, 15, 15),
col = c("black", "gray", "darkblue", "red", "gold"))
# End of mean-squared error plot for all methods.
t2 <- Sys.time() # End time.
total_time <- t2 - t1 # Difference between start and end
# times.
print(total_time) # Printing total time to run simulations
# and obtain the plots.
}
32
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
10.2 Johnson-Lindenstrauss Testing
Below is the code used for testing the Johnson-Lindenstrauss Lemma by varying
k and .
good_points_RM1 <- 0
good_points_RM2 <- 0
good_points_RM3 <- 0
# Good points counter for each random matrix.
# Points are considered ’good’ if they satisfy
# the Johnson-Lindenstrauss Lemma.
sim <- function(s, k, epsilon)
# This function takes in ’s’ simulations and a desired
# ’epsilon’. It returns the number of times the
# Johnson-Lindenstrauss Lemma was satisfied based on the
# three random matrices.
{
t1 <- Sys.time() # Initial time.
num <- 1 # Initial counter.
mu <- c(rnorm(1000, mean = 0, sd = 1)) # Mean values.
X <- matrix(0, 100, 1000)
# A location for the dataset information.
while(num <= s)
# Running the entire code for a ’s’ iterations.
{
problem <- FALSE # No problems at the start of this
# iteration.
for(i in 1:100)
{
for(j in 1:1000)
{
X[i, j] <- rnorm(1, mean = mu[j], sd = 1)
# A matrix of random data containing observations on
# the rows and covariates on the columns.
33
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
}
}
z <- exp(X) # All entries of matrix ’X’ have been
# exponentiated and stored in ’z’, which has dimensions
# 100 by 1,000.
u_v_rows <- sample(1:100, 2, replace = FALSE)
obs_u_old <- z[u_v_rows[1], ]
obs_v_old <- z[u_v_rows[2], ]
# We’ve selected two different rows from the dataset
# matrix ’z’ and stored them as new variables. Here,
# observations ’u’ and ’v’ can be thought of as
# 1,000-dimensional points.
dist_old <- sum((obs_u_old - obs_v_old) ^ 2)
# Here, the distance has been calculated between
# observations ’u’ and ’v’.
RM1 <- matrix(0, 1000, k)
# Random matrix one with ’-1’s and ’+1’s.
for (m in 1:1000)
{
for (n in 1:k)
{
RM1[m, n] <- sample(c(-1, 1), 1, replace = TRUE,
prob = c(1/2, 1/2))
}
}
RM1 <- RM1 / sqrt(k)
RM2 <- matrix(0, 1000, k)
# Random matrix two with ’-sqrt(3)’s, ’0’s, and
# ’+sqrt(3)’s.
for (m in 1:1000)
{
for (n in 1:k)
{
34
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
RM2[m,n] <- sample(c(-sqrt(3), 0, sqrt(3)), 1,
replace = TRUE,
prob = c(1/6, 4/6, 1/6))
}
}
RM2 <- RM2 / sqrt(k)
RM3 <- matrix(0, 1000, k)
# Random matrix three generated under a Gaussian
# distribution.
for (m in 1:1000)
{
for (n in 1:k)
{
RM3[m,n] <- rnorm(1, mean = 0, sd = 1)
}
}
RM3_norm <- matrix(0, 1000, 1)
for (p in 1:1000)
{
RM3_norm[p, ] <- sqrt(sum(RM3[p, ] ^ 2))
}
for (m in 1:1000)
{
for(n in 1:k)
{
RM3[m,n] <- RM3[m,n] / RM3_norm[m, ]
}
}
z_star <- scale(z, center=TRUE, scale=FALSE)
# Column-centered ’z’ matrix.
z_double_star_RM1 <- z %*% RM1
z_double_star_RM2 <- z %*% RM2
z_double_star_RM3 <- z %*% RM3
35
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
# Reducing dimensionality.
obs_u_new_RM1 <- z_double_star_RM1[u_v_rows[1], ]
obs_v_new_RM1 <- z_double_star_RM1[u_v_rows[2], ]
obs_u_new_RM2 <- z_double_star_RM2[u_v_rows[1], ]
obs_v_new_RM2 <- z_double_star_RM2[u_v_rows[2], ]
obs_u_new_RM3 <- z_double_star_RM3[u_v_rows[1], ]
obs_v_new_RM3 <- z_double_star_RM3[u_v_rows[2], ]
# After reducing dimensions, points ’u’ and ’v’ now have
# new coordinates. Since there were three random
# matrices, there are three new ’u’ and ’v’ points.
dist_new_RM1 <- sum((obs_u_new_RM1 - obs_v_new_RM1) ^ 2)
dist_new_RM2 <- sum((obs_u_new_RM2 - obs_v_new_RM2) ^ 2)
dist_new_RM3 <- sum((obs_u_new_RM3 - obs_v_new_RM3) ^ 2)
# Calculating the new distance between the transformed
# points ’u’ and ’v’ for each generated random matrix.
if((1 - epsilon) * (dist_old) <= dist_new_RM1
&& dist_new_RM1 <= (1 + epsilon) * (dist_old))
{
good_points_RM1 <- good_points_RM1 + 1
}
if((1 - epsilon) * (dist_old) <= dist_new_RM2
&& dist_new_RM2 <= (1 + epsilon) * (dist_old))
{
good_points_RM2 <- good_points_RM2 + 1
}
if((1 - epsilon) * (dist_old) <= dist_new_RM3
&& dist_new_RM3 <= (1 + epsilon) * (dist_old))
{
good_points_RM3 <- good_points_RM3 + 1
}
# The preceding three ’if’ statements check to see if
# the Johnson-Lindenstrauss Lemma was satisfied in this
# iteration for each different random matrix.
print(paste("Simulation", num, "Complete."))
36
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
num <- num + 1
}
print(paste("For an epsilon of", epsilon, ", k is", k,
"."))
print(paste("Number of times JL was satisfied, RM1:",
good_points_RM1, "out of", s, "simulations."))
print(paste("Number of times JL was satisfied, RM2:",
good_points_RM2, "out of", s, "simulations."))
print(paste("Number of times JL was satisfied, RM3:",
good_points_RM3, "out of", s, "simulations."))
t2 <- Sys.time() # End time.
total_time <- t2 - t1
# Difference between start and end times.
print(total_time) # Printing total time to run simulations
# and obtain the plots.
}
37
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
10.3 Survival Curves
Below is the code used for generating the real survival curve and the estimated
survival curve under PCA.
library(survival)
library(FactoMineR)
sim <- function(s) # Making a function that takes in a
# simulation count ’s’.
{
options(digits = 22) # Preserving more digits in hopes of
# less algorithm failure.
results <- matrix(0, s, 2) # A matrix with BE on column 1
# and MSE on column 2.
BE_T <- 0 # Initial total BE count.
MSE_T <- 0 # Initial total MSE count.
sum_BE_t <- matrix(0, 1, 20) # Matrix of BE at time ’t’.
sum_MSE_t <- matrix(0, 1, 20) # Matrix of MSE at time ’t’.
num <- 1 # Iteration counter.
sum_BE_t1 <- 0 # Bias error at time ’t1’.
sum_MSE_t1 <- 0 # Mean-squared error at time ’t1’.
beta <- c(runif(1000, min = -0.0000001, max = 0.0000001))
# Fixed coefficients.
mu <- c(rnorm(1000, mean = 0, sd = 1)) # Mean values.
X <- matrix(0, 100, 1000) # A location for the dataset
# information.
while(num <= s) # Running the entire code for a specified
# amount of iterations.
{
problem <- FALSE # No problems at the start of this
# iteration.
for(i in 1:100)
{
for(j in 1:1000)
38
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
{
X[i, j] <- rnorm(1, mean = mu[j], sd = 1) # A matrix
# of random data containing observations on the rows
# and covariates on the columns.
}
}
z <- exp(X) # All entries of matrix ’X’ have been
# exponentiated and stored in ’z’, which has dimensions
# 100 by 1,000.
lambda <- matrix(0, 100, 1) # Rate values.
for(i in 1:100)
{
lambda[i] <- exp(t(-z[i,]) %*% as.matrix(beta))
# Generating lambda values.
}
T <- matrix(0, nrow = 100, ncol = 1)
# Location for survival times.
for(i in 1:100)
{
T[i] <- rexp(1,rate=lambda[i])
}
z_star <- scale(z, center=TRUE,scale=FALSE)
z_star_PCA <- PCA(z_star, graph=FALSE, ncp=37)
z_double_star <- z_star %*% z_star_PCA$var$coord
delta <- matrix(0, nrow = 100, ncol = 1) # An indicator
# matrix. Here, delta is a 100 by 1 matrix of zeros.
# The zeros are interpreted as meaning that the event of
# interest has definitively occured. In other words,
# there is currently no censoring with ’delta’ set up in
# this manner.
data_Surv <- Surv(time = T, event = delta,
39
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
type = c("right"))
# A Surv object that takes the survival times from ’T’,
# censoring information from ’delta’, and is specified as
# being right-censored.
data_AFT_fit <- NULL
data_AFT_fit <- tryCatch(survreg(data_Surv ~ -1 +
z_double_star,
dist = "lognormal",
survreg.control(maxiter=100000000)),
warning=function(c) {problem<<-TRUE})
if(!problem) # If there’s no problem, then our previous
# code will run.
{
beta_hat_star <- as.matrix(data_AFT_fit$coeff)
# These are beta estimates.
z_bar_star <- matrix(0, 1, 1000)
# Averaged columns of ’z’ go here.
for (i in 1:1000)
{
z_bar_star[1, i] <- mean(z[, i])
# Taking the average of each column of ’z’.
}
beta_hat_z <- matrix(0, 1, 1000)
# A location for our beta estimates.
beta_hat_z <- z_star_PCA$var$coord %*% beta_hat_star
# Beta estimates.
lambda_hat <- exp(-z_bar_star %*% beta_hat_z)
# Survival function constant.
lambda_bar = mean(lambda)
# Taking the average of all ’lambda’ values and storing
# it in ’lambda_bar’.
40
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
S_hat_naught <- function(t)
# The predicted survivor function.
{
exp(-t * lambda_hat)
}
S <- function(t)
# The true survivor function.
{
exp(-t * lambda_bar)
}
data_AFT_pred <- predict(data_AFT_fit, type = "terms",
se.fit = TRUE)
# Here, we get the predicted values from the ’survreg’
# object ’data_AFT_fit’. To wit, we get here the beta
# values and the standard errors in a ’list’ format.
surv_curv <- curve(S_hat_naught, from = 0, to = 7,
n = 1000, type="l",
xlab = "", ylab = "", xaxt = ’n’,
yaxt = ’n’, col = "99")
# Plotting the predicted survivor function.
par(new = TRUE)
curve(S, from = 0, to = 7, n = 1000, type = "l",
main = paste("Survivor Curves n Simulation", num),
xlab = expression(italic(t)),
ylab = expression(S(italic(t))), col = "black")
u <- c(seq(0.025,0.975,0.05))
# Outputs ’u’ that range from 0.025 to 0.975
# spaced out by 0.05, resulting in 20 points.
t <- (-1/lambda_bar) * log(u)
# Input times ’t’, generated from ’u’. There
# are 20 generated times ’t’ in this vector.
print(paste("Simulation ", num, sep = ""))
41
Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
num <- num + 1
}
else
{
}
}
}
42

More Related Content

What's hot

ESTIMATING R 2 SHRINKAGE IN REGRESSION
ESTIMATING R 2 SHRINKAGE IN REGRESSIONESTIMATING R 2 SHRINKAGE IN REGRESSION
ESTIMATING R 2 SHRINKAGE IN REGRESSION
International Journal of Technical Research & Application
 
Wilcoxon Signed Rank Test
Wilcoxon Signed Rank Test Wilcoxon Signed Rank Test
Wilcoxon Signed Rank Test
Sharlaine Ruth
 
Chapter 5 Anova2009
Chapter 5 Anova2009Chapter 5 Anova2009
Chapter 5 Anova2009
Sumit Prajapati
 
Count data analysis
Count data analysisCount data analysis
Count data analysis
Walkite Furgasa
 
Fact, Figures and Statistics
Fact, Figures and StatisticsFact, Figures and Statistics
Fact, Figures and Statistics
meducationdotnet
 
Posthoc
PosthocPosthoc
Posthoc
Shakeel Ahmad
 
Non Parametric Tests
Non Parametric TestsNon Parametric Tests
Non Parametric Tests
Neeraj Kaushik
 
Design of experiments(
Design of experiments(Design of experiments(
Design of experiments(
Nugurusaichandan
 
Austin Statistics
Austin StatisticsAustin Statistics
Austin Statistics
Austin Publishing Group
 
Chapter 5 Anova2009
Chapter 5 Anova2009Chapter 5 Anova2009
Chapter 5 Anova2009
ghalan
 
Chi square(hospital admin) A
Chi square(hospital admin) AChi square(hospital admin) A
Chi square(hospital admin) A
Mmedsc Hahm
 
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
cambridgeWD
 
Application of-different-statistical-tests-in-fisheries-science
Application of-different-statistical-tests-in-fisheries-scienceApplication of-different-statistical-tests-in-fisheries-science
Application of-different-statistical-tests-in-fisheries-science
As Siyam
 
Chi square test
Chi square testChi square test
Chi square test
YASMEEN CHAUDHARI
 
International Journal of Pharmaceutica Analytica Acta
International Journal of Pharmaceutica Analytica ActaInternational Journal of Pharmaceutica Analytica Acta
International Journal of Pharmaceutica Analytica Acta
SciRes Literature LLC. | Open Access Journals
 
Sample size and power calculations
Sample size and power calculationsSample size and power calculations
Sample size and power calculations
Ramachandra Barik
 
Propensity Scores in Medical Device Trials
Propensity Scores in Medical Device TrialsPropensity Scores in Medical Device Trials
Propensity Scores in Medical Device Trials
Biomedical Statistical Consulting
 
Chapter 6 Ranksumtest
Chapter 6 RanksumtestChapter 6 Ranksumtest
Chapter 6 Ranksumtest
Sumit Prajapati
 
P-Value: a true test of significance in agricultural research
P-Value: a true test of significance in agricultural researchP-Value: a true test of significance in agricultural research
P-Value: a true test of significance in agricultural research
Jiban Shrestha
 

What's hot (19)

ESTIMATING R 2 SHRINKAGE IN REGRESSION
ESTIMATING R 2 SHRINKAGE IN REGRESSIONESTIMATING R 2 SHRINKAGE IN REGRESSION
ESTIMATING R 2 SHRINKAGE IN REGRESSION
 
Wilcoxon Signed Rank Test
Wilcoxon Signed Rank Test Wilcoxon Signed Rank Test
Wilcoxon Signed Rank Test
 
Chapter 5 Anova2009
Chapter 5 Anova2009Chapter 5 Anova2009
Chapter 5 Anova2009
 
Count data analysis
Count data analysisCount data analysis
Count data analysis
 
Fact, Figures and Statistics
Fact, Figures and StatisticsFact, Figures and Statistics
Fact, Figures and Statistics
 
Posthoc
PosthocPosthoc
Posthoc
 
Non Parametric Tests
Non Parametric TestsNon Parametric Tests
Non Parametric Tests
 
Design of experiments(
Design of experiments(Design of experiments(
Design of experiments(
 
Austin Statistics
Austin StatisticsAustin Statistics
Austin Statistics
 
Chapter 5 Anova2009
Chapter 5 Anova2009Chapter 5 Anova2009
Chapter 5 Anova2009
 
Chi square(hospital admin) A
Chi square(hospital admin) AChi square(hospital admin) A
Chi square(hospital admin) A
 
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
 
Application of-different-statistical-tests-in-fisheries-science
Application of-different-statistical-tests-in-fisheries-scienceApplication of-different-statistical-tests-in-fisheries-science
Application of-different-statistical-tests-in-fisheries-science
 
Chi square test
Chi square testChi square test
Chi square test
 
International Journal of Pharmaceutica Analytica Acta
International Journal of Pharmaceutica Analytica ActaInternational Journal of Pharmaceutica Analytica Acta
International Journal of Pharmaceutica Analytica Acta
 
Sample size and power calculations
Sample size and power calculationsSample size and power calculations
Sample size and power calculations
 
Propensity Scores in Medical Device Trials
Propensity Scores in Medical Device TrialsPropensity Scores in Medical Device Trials
Propensity Scores in Medical Device Trials
 
Chapter 6 Ranksumtest
Chapter 6 RanksumtestChapter 6 Ranksumtest
Chapter 6 Ranksumtest
 
P-Value: a true test of significance in agricultural research
P-Value: a true test of significance in agricultural researchP-Value: a true test of significance in agricultural research
P-Value: a true test of significance in agricultural research
 

Viewers also liked

Let's Pretend to Do Something Real
Let's Pretend to Do Something RealLet's Pretend to Do Something Real
Let's Pretend to Do Something Real
Eminence Waite
 
Rodriguez_Survival_Abstract_Beamer
Rodriguez_Survival_Abstract_BeamerRodriguez_Survival_Abstract_Beamer
Rodriguez_Survival_Abstract_Beamer
​Iván Rodríguez
 
Adam Lovinus
Adam LovinusAdam Lovinus
Adam Lovinus
Adam Lovinus
 
Losing Control and Falling Into the Abyss
Losing Control and Falling Into the AbyssLosing Control and Falling Into the Abyss
Losing Control and Falling Into the Abyss
Eminence Waite
 
Rodriguez_THINK_TANK_Mathematics_Tutoring_Philosophy
Rodriguez_THINK_TANK_Mathematics_Tutoring_PhilosophyRodriguez_THINK_TANK_Mathematics_Tutoring_Philosophy
Rodriguez_THINK_TANK_Mathematics_Tutoring_Philosophy
​Iván Rodríguez
 
Rodriguez_NRMC_Presentation
Rodriguez_NRMC_PresentationRodriguez_NRMC_Presentation
Rodriguez_NRMC_Presentation
​Iván Rodríguez
 
Selim-Hesham-El-Zien-C.V.
Selim-Hesham-El-Zien-C.V.Selim-Hesham-El-Zien-C.V.
Selim-Hesham-El-Zien-C.V.
Selim Hesham
 
Rodriguez_UROC_Final_Presentation
Rodriguez_UROC_Final_PresentationRodriguez_UROC_Final_Presentation
Rodriguez_UROC_Final_Presentation
​Iván Rodríguez
 
Rodriguez_DRT_Abstract_Beamer
Rodriguez_DRT_Abstract_BeamerRodriguez_DRT_Abstract_Beamer
Rodriguez_DRT_Abstract_Beamer
​Iván Rodríguez
 
Notion cv
Notion cvNotion cv
Janhavi_Mishra_Testing
Janhavi_Mishra_TestingJanhavi_Mishra_Testing
Janhavi_Mishra_Testing
Janhavi Mishra
 
Ullmayer_Rodriguez_Presentation
Ullmayer_Rodriguez_PresentationUllmayer_Rodriguez_Presentation
Ullmayer_Rodriguez_Presentation
​Iván Rodríguez
 

Viewers also liked (12)

Let's Pretend to Do Something Real
Let's Pretend to Do Something RealLet's Pretend to Do Something Real
Let's Pretend to Do Something Real
 
Rodriguez_Survival_Abstract_Beamer
Rodriguez_Survival_Abstract_BeamerRodriguez_Survival_Abstract_Beamer
Rodriguez_Survival_Abstract_Beamer
 
Adam Lovinus
Adam LovinusAdam Lovinus
Adam Lovinus
 
Losing Control and Falling Into the Abyss
Losing Control and Falling Into the AbyssLosing Control and Falling Into the Abyss
Losing Control and Falling Into the Abyss
 
Rodriguez_THINK_TANK_Mathematics_Tutoring_Philosophy
Rodriguez_THINK_TANK_Mathematics_Tutoring_PhilosophyRodriguez_THINK_TANK_Mathematics_Tutoring_Philosophy
Rodriguez_THINK_TANK_Mathematics_Tutoring_Philosophy
 
Rodriguez_NRMC_Presentation
Rodriguez_NRMC_PresentationRodriguez_NRMC_Presentation
Rodriguez_NRMC_Presentation
 
Selim-Hesham-El-Zien-C.V.
Selim-Hesham-El-Zien-C.V.Selim-Hesham-El-Zien-C.V.
Selim-Hesham-El-Zien-C.V.
 
Rodriguez_UROC_Final_Presentation
Rodriguez_UROC_Final_PresentationRodriguez_UROC_Final_Presentation
Rodriguez_UROC_Final_Presentation
 
Rodriguez_DRT_Abstract_Beamer
Rodriguez_DRT_Abstract_BeamerRodriguez_DRT_Abstract_Beamer
Rodriguez_DRT_Abstract_Beamer
 
Notion cv
Notion cvNotion cv
Notion cv
 
Janhavi_Mishra_Testing
Janhavi_Mishra_TestingJanhavi_Mishra_Testing
Janhavi_Mishra_Testing
 
Ullmayer_Rodriguez_Presentation
Ullmayer_Rodriguez_PresentationUllmayer_Rodriguez_Presentation
Ullmayer_Rodriguez_Presentation
 

Similar to Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report

Survival Analysis With Generalized Additive Models
Survival Analysis With Generalized Additive ModelsSurvival Analysis With Generalized Additive Models
Survival Analysis With Generalized Additive Models
Christos Argyropoulos
 
Statistical Methods to Handle Missing Data
Statistical Methods to Handle Missing DataStatistical Methods to Handle Missing Data
Statistical Methods to Handle Missing Data
Tianfan Song
 
2013.11.14 Big Data Workshop Adam Ralph - 1st set of slides
2013.11.14 Big Data Workshop Adam Ralph - 1st set of slides2013.11.14 Big Data Workshop Adam Ralph - 1st set of slides
2013.11.14 Big Data Workshop Adam Ralph - 1st set of slides
NUI Galway
 
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MININGUNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
IJDKP
 
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
cambridgeWD
 
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_SACNAS
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_SACNASRodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_SACNAS
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_SACNAS
​Iván Rodríguez
 
Use Proportional Hazards Regression Method To Analyze The Survival of Patient...
Use Proportional Hazards Regression Method To Analyze The Survival of Patient...Use Proportional Hazards Regression Method To Analyze The Survival of Patient...
Use Proportional Hazards Regression Method To Analyze The Survival of Patient...
Waqas Tariq
 
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IJDKP
 
Logistic Loglogistic With Long Term Survivors For Split Population Model
Logistic Loglogistic With Long Term Survivors For Split Population ModelLogistic Loglogistic With Long Term Survivors For Split Population Model
Logistic Loglogistic With Long Term Survivors For Split Population Model
Waqas Tariq
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1
Double Check ĆŐNSULTING
 
Extending A Trial’s Design Case Studies Of Dealing With Study Design Issues
Extending A Trial’s Design Case Studies Of Dealing With Study Design IssuesExtending A Trial’s Design Case Studies Of Dealing With Study Design Issues
Extending A Trial’s Design Case Studies Of Dealing With Study Design Issues
nQuery
 
Restricted Mean Survival Analysis
Restricted Mean Survival AnalysisRestricted Mean Survival Analysis
Restricted Mean Survival Analysis
ayatan2
 
Maxillofacial Pathology Detection Using an Extended a Contrario Approach Comb...
Maxillofacial Pathology Detection Using an Extended a Contrario Approach Comb...Maxillofacial Pathology Detection Using an Extended a Contrario Approach Comb...
Maxillofacial Pathology Detection Using an Extended a Contrario Approach Comb...
sipij
 
EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...
EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...
EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...
Servio Fernando Lima Reina
 
Projecting ‘time to event’ outcomes in technology assessment: an alternative ...
Projecting ‘time to event’ outcomes in technology assessment: an alternative ...Projecting ‘time to event’ outcomes in technology assessment: an alternative ...
Projecting ‘time to event’ outcomes in technology assessment: an alternative ...
cheweb1
 
final paper
final paperfinal paper
final paper
Asek Md. Suzauddin
 
Bel ventutorial hetero
Bel ventutorial heteroBel ventutorial hetero
Bel ventutorial hetero
Edda Kang
 
SUITABILITY OF COINTEGRATION TESTS ON DATA STRUCTURE OF DIFFERENT ORDERS
SUITABILITY OF COINTEGRATION TESTS ON DATA STRUCTURE  OF DIFFERENT ORDERSSUITABILITY OF COINTEGRATION TESTS ON DATA STRUCTURE  OF DIFFERENT ORDERS
SUITABILITY OF COINTEGRATION TESTS ON DATA STRUCTURE OF DIFFERENT ORDERS
BRNSS Publication Hub
 
A walk in the black forest - during which I explain the fundamental problem o...
A walk in the black forest - during which I explain the fundamental problem o...A walk in the black forest - during which I explain the fundamental problem o...
A walk in the black forest - during which I explain the fundamental problem o...
Richard Gill
 
Statsci
StatsciStatsci

Similar to Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report (20)

Survival Analysis With Generalized Additive Models
Survival Analysis With Generalized Additive ModelsSurvival Analysis With Generalized Additive Models
Survival Analysis With Generalized Additive Models
 
Statistical Methods to Handle Missing Data
Statistical Methods to Handle Missing DataStatistical Methods to Handle Missing Data
Statistical Methods to Handle Missing Data
 
2013.11.14 Big Data Workshop Adam Ralph - 1st set of slides
2013.11.14 Big Data Workshop Adam Ralph - 1st set of slides2013.11.14 Big Data Workshop Adam Ralph - 1st set of slides
2013.11.14 Big Data Workshop Adam Ralph - 1st set of slides
 
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MININGUNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
 
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
 
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_SACNAS
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_SACNASRodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_SACNAS
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_SACNAS
 
Use Proportional Hazards Regression Method To Analyze The Survival of Patient...
Use Proportional Hazards Regression Method To Analyze The Survival of Patient...Use Proportional Hazards Regression Method To Analyze The Survival of Patient...
Use Proportional Hazards Regression Method To Analyze The Survival of Patient...
 
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
 
Logistic Loglogistic With Long Term Survivors For Split Population Model
Logistic Loglogistic With Long Term Survivors For Split Population ModelLogistic Loglogistic With Long Term Survivors For Split Population Model
Logistic Loglogistic With Long Term Survivors For Split Population Model
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1
 
Extending A Trial’s Design Case Studies Of Dealing With Study Design Issues
Extending A Trial’s Design Case Studies Of Dealing With Study Design IssuesExtending A Trial’s Design Case Studies Of Dealing With Study Design Issues
Extending A Trial’s Design Case Studies Of Dealing With Study Design Issues
 
Restricted Mean Survival Analysis
Restricted Mean Survival AnalysisRestricted Mean Survival Analysis
Restricted Mean Survival Analysis
 
Maxillofacial Pathology Detection Using an Extended a Contrario Approach Comb...
Maxillofacial Pathology Detection Using an Extended a Contrario Approach Comb...Maxillofacial Pathology Detection Using an Extended a Contrario Approach Comb...
Maxillofacial Pathology Detection Using an Extended a Contrario Approach Comb...
 
EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...
EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...
EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...
 
Projecting ‘time to event’ outcomes in technology assessment: an alternative ...
Projecting ‘time to event’ outcomes in technology assessment: an alternative ...Projecting ‘time to event’ outcomes in technology assessment: an alternative ...
Projecting ‘time to event’ outcomes in technology assessment: an alternative ...
 
final paper
final paperfinal paper
final paper
 
Bel ventutorial hetero
Bel ventutorial heteroBel ventutorial hetero
Bel ventutorial hetero
 
SUITABILITY OF COINTEGRATION TESTS ON DATA STRUCTURE OF DIFFERENT ORDERS
SUITABILITY OF COINTEGRATION TESTS ON DATA STRUCTURE  OF DIFFERENT ORDERSSUITABILITY OF COINTEGRATION TESTS ON DATA STRUCTURE  OF DIFFERENT ORDERS
SUITABILITY OF COINTEGRATION TESTS ON DATA STRUCTURE OF DIFFERENT ORDERS
 
A walk in the black forest - during which I explain the fundamental problem o...
A walk in the black forest - during which I explain the fundamental problem o...A walk in the black forest - during which I explain the fundamental problem o...
A walk in the black forest - during which I explain the fundamental problem o...
 
Statsci
StatsciStatsci
Statsci
 

Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report

  • 1. Survival Analysis Dimension Reduction Techniques A Comparison of Select Methods Claressa L. Ullmayer and Iván Rodríguez Abstract Although formal studies across many fields may yield copious data, it can often be collinear (redundant) in terms of explaining particular outcomes. Thus, dataset dimensionality reduction becomes imperative for facilitating the explanation of phenomena given abundant covariates (independent vari- ables). Principal Component Analysis (PCA) and Partial Least Squares (PLS) are established methods used to obtain components—eigenvalues of the given data’s variance-covariance matrix—such that the covariance and correlation is maximized between linear combinations of predictor and re- sponse variables. PCA employs orthogonal transformations on covariates to reduce dataset dimensionality by producing new uncorrelated variables. PLS, rather, projects both predictor and response variables into a new space to model their covariance structure. In addition to these standard procedures, three variants of Johnson-Lindenstrauss low-distortion Euclidean-space em- beddings (random matrices, RM) were also investigated. Each technique’s performance was explored by simulating 5,000 datasets using R statistical software. The semi-parametric Accelerated Failure Time (AFT) model was utilized to obtain predicted survivor curves. Then, total bias error (BE) and mean-squared error (MSE) between true and estimated survivor curves was determined to find the error distributions of all methods. The results herein indicate that PCA outperforms PLS, the RMs are comparable, and the RMs outdo both PCA and PLS. Keywords: survival analysis; dimension reduction; big data; principal com- ponent analysis (PCA); partial least squares (PLS); Johnson-Lindenstrauss (JL); random matrices; accelerated failure time (AFT); bias; mean-squared error.
  • 2. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques Contents 1 Introduction 1 2 Survival Analysis 2 3 Methods 5 3.1 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1.2 Principle Component Analysis . . . . . . . . . . . . . . . 6 3.1.3 Partial Least Squares . . . . . . . . . . . . . . . . . . . . 6 3.1.4 Random Matrices . . . . . . . . . . . . . . . . . . . . . . 7 3.2 The Accelerated Failure Time Model . . . . . . . . . . . . . . . . 9 4 Method Assessments 9 4.1 Simulated Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 10 5 Results 11 5.1 Principle Component Analysis versus Partial Least Squares . . . . 12 5.2 Random Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.3 All Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 6 Discussion 15 7 Conclusion 15 8 Acknowledgments 16 9 References 16 10 Appendix 18 10.1 Error Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 10.2 Johnson-Lindenstrauss Testing . . . . . . . . . . . . . . . . . . . 33 10.3 Survival Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 i
  • 3. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques 1 Introduction Throughout various studies, researchers are able to associate covariates to a set of observations. From here, analysts would naturally seek to explain the relationship between the two with regard to a given set of phenomena. Methods such as the Cox Proportional Hazards (CPH) and the Accelerated Failure Time (AFT) models have been proposed with this intent in mind (Cox, 1972). However, to successfully utilize both approaches, it is necessary to have more observations than covariates. Depending on the context, this property may not initially be satisfied, thus ren- dering both methods inept. One example of this complication arises in common- place microarray gene expression data. In this situation, there can often be less observations—patients—than covariates attributed to them—genes. As a result, it becomes imperative to reduce the dimensionality of the dataset and then apply a suitable regression technique thereafter to understand the underlying relationships between the predictor and response variables. As a natural consequence, reduc- ing the original dataset’s dimensionality insinuates a loss of information; thus, a favorable dimension reduction technique will minimize loss of relevant informa- tion. Given this to consider, dimension-reduction techniques have abounded to meet this end. In this investigation, the methods of Principal Component Analysis (PCA), Partial Least Squares (PLS), and three variants of Johnson-Lindenstrauss inspired Random Matrices (RM) will be compared (Johnson, Lindenstrauss, 1984). The first approach, PCA, originated and was described by Pearson (1901). PLS was first rigorously introduced and explained by Wold (1966). Then, the three variants of RMs were constructed according to specifications of Achlioptas (2002) and Dasgupta-Gupta (2002). This research was motivated in part by the results attributed to Nguyen and Rocke (2004) and Nguyen (2005) regarding the perfor- mance of PCA vis-à-vis PLS. Furthermore, the works of Nguyen and Rojo (2009) with respect to the performance of PLS variants and Nguyen and Rojo (2009) in regard to a multitude of reduction and regression approaches were utilized in this inquiry. Typically, the Cox PH model has been the standard model in this applica- tion. In this paper, however, the AFT model was employed. Random datasets were first generated using the statistical software suite R. For a given amount of these datasets, there was a constant and true survivor function attributed to them. From here, the three dimension reduction techniques were employed on the sim- ulated datasets. Then, the AFT model was used primarily to generate a predicted survivor function. Bias and mean-squared error between the real and estimated curves were then calculated for a partition of fixed time values. 1
  • 4. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques 2 Survival Analysis Before any serious discussion of the current work can begin, a familiarity with the area known as survival analysis must first be cultivated. In a sentence, survival analysis employs various methods to analyze data where the response variable is a time until an unambiguous event of interest occurs (Despa). This event must be rigorously defined—some examples include birth, death, marriage, divorce, job termination, promotion, arrests, revolutions, heart attack, stroke, metastasis, and winning the lottery, to name a few (Ross). Depending on the research domain, this wide field has many monikers. It is referred to as failure time analysis, hazard analysis, transition analysis, duration analysis, reliability theory/analysis in engineering, duration analysis/modeling in economics, and event history analysis in sociology (Allison). At the time of this investigation, ‘survival analysis’ serves as the umbrella term for all the aforemen- tioned epithets. Survival analysis is borne out of the desire to overcome some limitations pre- sented in standard linear regression approaches (Despa). One of the two imme- diate complications that survival analysis can successfully address is data where responses are all positive values—exempl¯ı gr¯ati¯a, survival times that range from t ∈ (0, ∞) (Despa). Secondly, survival analysis can grapple with censored data. After the event of interest within a particular investigation has been rigorously declared, an observation is branded as ‘censored’ if the special event was not ob- served. This can occur due to a plethora of reasons. A common one involves a patient in a clinical trial dropping out of the study. In this case, it is unknown how much longer it may have taken for that individual to experience the partic- ular event of interest. Another example of censoring in the real world involves observations that do not experience the special event upon the end of a formal investigation. That is, an individual managed to not express the event of interest for the whole duration of a study, so they are necessarily labeled as censored. With this ubiquitous term broadly explained, it is also necessary to understand that many forms of censoring exist. Typically, most data are ‘right-censored’. This term signifies observations that have the potential to experience the declared event of interest after—or to the right in a time-line—of the time they became censored. For instance, take an individual with a stage of cancer and declare the event of interest to be death. Then, if this person becomes censored, the event of interest is naturally bound to occur after the time they became censored. In a similar manner, ‘left-censored’ data occurs when the event of interest occurred before the specific time a formal investigation began (Lunn). Understandably, this phenomenon is less commonplace in reality. An example of left-censored data involves providing a questionnaire to mothers inquiring whether or not they are actively breastfeed- ing (Vermeylen). Left-censoring would occur if a mother entered the study and 2
  • 5. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques had hitherto stopped breastfeeding. Finally, a third type is known as ‘interval cen- soring’. This might be observed in a case where clinical follow-ups are necessary. For a datum to be interval-censored, the event of interest would have to be ob- served within an interval between two successive follow-ups (Sun). Survival analysis is a prominent regression approach because it can success- fully incorporate both censored and uncensored data when modeling the relation- ship between predictors and responses (Despa). Typically, the response variables will have at least both a survival time and censoring status associated with them. From here, methods exist to estimate both survival and hazard functions that fa- cilitate the interpretation of the distribution of survival times (Despa). Survivor curves determine the probability that the event of interest is not ex- perienced after a particular time. Rigorously, S(t) = P(T > t) = ∞ t f(τ) dτ = 1 − F(t), where S(t) denotes the survivor function, t is a fixed time, T is a random variable, f(τ) is the probability density function of T, and F(t) is the cumulative distribu- tion function of T. The hazard, on the other hand, is defined as a rate in which events happen (Duerden). Thus, one can calculate the probability of an event happening within a small time interval as this hazard rate multiplied by the length of time (Duerden). Additionally, the hazard function describes the probability that an observation ex- periences the event of interest at a particular time (Duerden). This implies that the observation has already survived—that is, has not experienced the event of interest—at the specified time (Duerden). In precise terms, the hazard function is defined as h(t) = f(t) S(t) , where f(t) denotes the probability distribution function and S(t) represents the survival function given a random variable T. From this expression, it is imme- diately possible to understand the intricate relationship between distribution, sur- vival, and hazard functions. As a result, many other expressions exist aside from this rather simplistic form. A natural thought that may arise within survival analysis is whether results involving survivor curves or hazard functions are desired. In many contexts, stan- dard researchers prefer survivor curves in order to interpret results of their gath- ered data. Arguably, since these curves output a probability in response to an input of time, it becomes easier to comprehend trends and relationships than by doing so via hazard. Furthermore, hazard functions and hazard rates are based on ratios of probability distribution functions and survival curves; this makes hazard results 3
  • 6. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques more difficult to digest and understand. Aside from these considerations, there is also another factor involved in sur- vival analysis to cognize: the selection of methods that can be utilized to relate predictor variables and the resulting survival times. The three main forms to achieve this end include parametric, semiparametric, and nonparametric models (Despa). These differ in the assumptions being made on the given data. Parametric approaches make the prime assumption that the distribution of the survival times follows a known probability distribution (Despa). For example, these can include the exponential and compound exponential, Weibull, Gompertz- Makeham, Rayleigh, gamma and generalized gamma, log-normal, log-logistic, generalized F, and the Coale-McNeil models (Rodriguez, 2010). For these and other applicable methods, model parameters are estimated according to an alter- ation to their maximum likelihood (Despa). In parametric techniques, relation- ships are forced between f(t), F(t), S(t), and h(t) (Cook). In contrast, a nonparametric model does not assert as many relatively bold assumptions. For instance, linearity and a smooth regression function is not nec- essary in a nonparametric context (Fox). Although this provides a researcher with much more flexibility, interpretation can oftentimes become more difficult. A semiparametric model posits that the error attributed to a nonlinear regres- sion model follows a well-defined probability distribution, but the error is uncor- related and identically distributed. In addition, a model of this form does not presume that the baseline hazard function has a particular ‘shape’ attributed to it. Additionally, when a combination of both parametric and nonparametric as- sumptions are available, the regression model is appropriately described as being semiparametric in nature. These three types of regression models are rigorously represented below. Let n denote the number of observations, Y represent the response variable, X sig- nify the matrix of predictors, and let β be regression coefficients with errors . Additionally, let m(·) = E(yi | xi) such that i = 1, . . . , n A parametric model can be expressed as yi = xi T β + i, i = 1, . . . , n. In this case, the resulting curve is smooth and known. Furthermore, it is described by a finite set of parameters which will need to be estimated. Ultimately, interpre- tation is simple through this approach. Then, for a nonparametric method, yi = m(xi) + i, i = 1, . . . , n. Here, function m(·) is also smooth and flexible, yet it is now unknown. Further- more, the interpretation of such a curve becomes ambiguous. 4
  • 7. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques Lastly, in the case where a model is classified as semiparametric, we observe that yi = xi T β + mz(zi) + i, i = 1, . . . , n. As previously mentioned, some parameters are necessarily estimated while some will be determined through the given data. 3 Methods The main methods employed in this investigation were centered on different ways of performing dimension reduction. These methods were: Principle Component Analysis (PCA), Partial Least Squares (PLS), and a set of three distinct Random Matrices (RM). For each method, the AFT model was employed primarily to gen- erate survivor curve estimates. These methods will be discussed in greater detail here. 3.1 Dimension Reduction The central goal of the three aforementioned dimension reduction techniques is to reduce a dataset with n observations and p covariates to a new dataset of dimen- sions n × k such that k p. Additionally, a competent method will achieve this end while retaining an acceptable amount of relevant data and omitting relatively collinear variables. Both PCA and PLS reduce dimensionality through orthogonal transformations of covariates; then, a subset of these is retained such that these new covariates pre- dict the response with a satisfactory caliber of precision. Meanwhile, RM differs from these two procedures by generating a matrix with certain qualities that also reduces dimensionality. To facilitate the explanation of these reduction techniques, pertinent notation will first be introduced. 3.1.1 Notation Let X be the n × p column-centered matrix such that n and p denote given obser- vations and covariates, respectively. Also, let n p. Furthermore, let Y be the n × q matrix of observed covariates. In the microarray gene dataset example, n would represent the number of pa- tients while p would denote the amount of observed genes attributed to them. Thus, X would be a matrix that contains particular patients on the rows and their respective genes on the columns. Additionally, Y would serve as an n × 1 vector of survival times. 5
  • 8. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques 3.1.2 Principle Component Analysis PCA reduces dataset dimensionality through orthogonal components obtained by maximizing the variance between linear combinations of the original predictors contained in X. More precisely, k weight vectors or ‘loadings’ w are constructed such that rows of X map to principal component scores t. For n observations, tn = xnwk. Ultimately, X can be completely decomposed into its components as follows: T = XW. Here, X has original dimensions n×p, W has dimensions p×p, and T, therefore, has dimensions n × p as expected. Additionally, the columns of W contain the eigenvectors of XT X. From here, a desired amount of the resulting orthogonal components is cho- sen. These are then referred to as ‘principal components’ since they are chosen in order to maximize the variability along each direction of the new and reduced set of axes. What this transformation accomplishes, in other words, is that it projects the original data cloud into a new coordinate system via rotations of the initial coordinate system such that variability of the initial data is maximized along each direction. Additionally, PCs are ranked according to how much variance they account for in their respective directions. That is, the PCs with the largest eigen- values are ranked the highest and represent a sizable portion of the data since variability is greatest along its eigenvector’s direction. It is imperative to note that the chosen PCs obtained from PCA rely on op- erations performed on X, the given dataset matrix. Thus, the response variable Y is not taken into account during this particular dimension reduction algorithm. Consequently, these PCs may not be laudable predictors of the response variable in a given context. Due to this property of PCA, it is often referred to as an ‘un- supervised’ technique. 3.1.3 Partial Least Squares Whereas PCA reduces dimensionality through X, the method of PLS does so through a consideration of both independent and dependent variables X and Y. Thus, this approach is often referred to as being ‘supervised’. This regression model is especially useful when there is either high collinear- ity among predictors or when the number of predictor variables is much greater than the amount of observations. In these situations, ordinary least-squares re- gression would either perform poorly or fail entirely; it would also fail if Y was not one-dimensional—id est, if there were more than one observed response. 6
  • 9. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques PLS extracts factors from both X and Y so that the covariance between these factors is maximized. In particular, PLS is largely based on the singular value de- composition of XT Y. Recall that PLS does not require Y to be one-dimensional; an advantage of the PLS procedure is that Y can contain as many observed re- sponses as are deemed necessary and practical by researchers. The method of PLS decomposes both X and Y so that X = TPT + E and Y = UQT + F. Here, T is a matrix of ‘X-scores’, P is a matrix of ‘X-loadings’, and E is a matrix of error for X. Similarly, U, Q, and F represent ‘Y-scores’, ‘Y-loadings’, and Y error, respectively. Both X- and Y-scores are defined as being linear combinations of the predictor and response variables, respectively. Then, X- and Y-loadings are linear coefficients that form a bridge from X to T and from Y to U. A common assumption about E and F is that they are random variables with independent and identical distributions. This decomposition of X and Y is done in hopes of maximizing the covariance between T and U. The PLS algorithm is an iterative procedure. First, two sets of weights must be constructed as linear combinations of the columns of both X and Y. These will be denoted by w and c, respectively. The goal here is to have their covariance be maximal. Recall that matrices T and U denote, accordingly, X- and Y-scores. Then, the next step in the PLS approach is to obtain a first pair of vectors t = Xw and u = Yc such that wT w = 1, tT t = 1, and tT u be maximized. After these first so-called ‘latent vectors’ have been obtained, they are subtracted from both X and Y. This procedure is then repeated, thereby eventually reducing X to a zero matrix. 3.1.4 Random Matrices Whereas the previously discussed methods of PCA and PLS reduce dimension- ality through a careful analysis of X and Y, the third technique of constructing random matrices, as the name implies, is considerably cavalier by comparison. In essence, a random matrix with a particular set of qualities is fabricated. Then, this matrix is multiplied to a given dataset—matrix X in this particular investi- gation. According to the lemma attributed to Johnson and Lindenstrauss, if two observations in X are considered as multidimensional points and have an initial distance-squared between them, then once these particular random matrices are multiplied to X, their intial distance is not distorted by too much. Similar to the approaches utilized in PCA and PLS, random matrices can reduce dimensionality without losing much information in the process. First, the Johnson-Lindenstrauss (JL) Lemma will be presented as well as a description of the three particular ran- 7
  • 10. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques dom matrices that were constructed in this research. The constraint on k was utilized according to Dasgupta-Gupta. The Johnson-Lindenstrauss Lemma. For any ∈ (0, 1) and any n ∈ Z, let k ∈ Z be positive and let k ≥ 4 ln(n) 2/2 − 3/3 . Then, for any set S of n points in Rd , there exists a mapping f : Rd → Rk such that, for all points u, v ∈ S, (1 − ) u − v 2 ≤ f(u) − f(v) 2 ≤ (1 + ) u − v 2 . In terms of this investigation, n also represents the number of observations while denotes the error tolerance. Finally, k can be thought of as the resulting dimension in this given context after applying a random matrix to the dataset ma- trix X. An immediate complication of these so-called ‘JL-embeddings’ is that we may sometimes observe that k ≥ d as a result of strictly following the hypotheses of the lemma. Id est, by employing the results of this theorem, a researcher would be taking data from a smaller dimension and transforming it so that the data exists in a higher dimension. Ultimately, the JL Lemma may not reduce dimensionality at all, thus rendering it impractical for the desired purposes of this text. Thus, it became imperative in this research to observe the effects of ignoring the restraints on k of the JL Lemma and deducing whether or not desirable results are obtained nonetheless. Having understood the motivation behind random matrices and these precise limitations, now an explanation of the three random matrices themselves is in order. The first two random matrices were fabricated according to the previous re- sults of Achlioptas while the third was constructed by following the specifications of Dasgupta-Gupta. Let Γ1, Γ2, and Γ3 accordingly denote these random ma- trices. To keep consistent with the previous notation, recall that X is an n × p predictor matrix of observations on the rows and covariates on the columns. It follows that Γ1, Γ2, and Γ3 are p × k matrices. Once multiplied to X, the result- ing matrix Ω will have dimensions n × k, where the goal is to have n > k. Entries of Γ1 were produced from the following distribution: 1 √ k × −1 with probability 1/2 +1 with probabilty 1/2 For Γ2, its entries were obtained from 3 k ×    −1 with probability 1/6 0 with probability 4/6 +1 with probabilty 1/6 8
  • 11. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques Finally, Γ3 is a Gaussian random matrix generated from N(0, 1). The resulting rows of Γ3 are then normalized. 3.2 The Accelerated Failure Time Model The previously described techniques were sourced in order to reduce dimension- ality. After successfully achieving this consequence, it was necessary to generate a survival curve based on the modified data and compare it with the true survival curve. In this investigation, the AFT model was the vehicle to generate estimates of the survivor curves. The AFT model is seldom utilized compared to the celebrated Cox Propor- tional Hazards (PH) model for various reasons. One reason to adopt the AFT approach in this investigation is due to the simplified interpretation it provides re- searchers of the data. This approach presents an interpretation of the relationship between observation covariates and given responses in terms of survivor curves. The Cox PH model, on the other hand, does so through hazard functions and haz- ard ratios that, while equally profound, are not as visually simple to comprehend as the AFT model’s survivorship presentation. In simple terms, the hazard is the instantaneous event probability within a range of a particular time. It is arguably more straightforward to understand results in terms of the probability that an in- dividual ‘survives’ or does not experience an event of interest after a particular time. Thus, this first reason to employ the AFT model in this text is a matter of user preference and ease of interpretation of results. Another technical reason to employ the AFT model is due to the fact that it directly models given survival times. This is one luxury that the Cox PH model cannot allow a fervent researcher. In this investigation, AFT was implemented according to the following under- lying model: ln(Ti) = µ + ziβ + ei. Here, i represents a particular observation from a set of n observations. Further- more, Ti denotes the survival time for the ith observation. Meanwhile, µ desig- nates the given theoretical mean, zi is the vector of covariates for the ith obser- vation, and β is the vector of covariate/regression coefficients. Finally, ei is the given error for the ith observation. 4 Method Assessments This research utilized a programming environment to simulate datasets that would undergo reduction procedure from PCA, PLS, and the variants of RMs. Addition- ally, ‘feeding’ this data into the AFT model to obtain and compare the pairs of 9
  • 12. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques survival curves was likewise accomplished through statistical software. This sec- tion will address specifically how the research was performed. 4.1 Simulated Datasets In order to compare dimension reduction techniques, R statistical software is im- plemented to simulate data. β regression coefficients, observations, covariates, and survival times are simulated using the previously discussed AFT formula, where the theoretical mean, µ, is set to 0 for simplicity. The dimensionality of the data matrix, X, is 100 observations by 1000 covariates. A vector of 1000 β regres- sion coefficients relating to the 1000 covariates is obtained by generating random values from U(−1 × 10−7 , 1 × 10−7 ). A vector, µj, of random values is generated from a N(0, 1) distribution for j = 1, . . . , p, where p represents the number of covariates. β and µ remain fixed for all simulations. Next, the matrix, X100×1000, of the 1000 covariates and 100 observations is generated where xij = ezij where zij ∼ N(µj, 1) for j = 1, . . . , p and i = 1, . . . , n where n is the number of obser- vations, therefore the data is log-normally distributed. The survival times, Ti, are constructed from an exponential distribution with λi = e−xiβ for i = 1, · · · , n. Now that all the data is generated, zn×p is converted to z∗ n×p by centering each column about its mean. PCA is applied to z∗ n×p using the function PCA from the package FactoMineR (Husson et al., 2015) to obtain 99 principle components. After this procedure is completed, the principle components are narrowed down to 37, which represents 50% of total variance of the model. PCA outputs a weight matrix of dimension 1000 × 37, which represents the weights given to each co- variate by the 37 principle components. The data matrix, X, is multiplied by this weight matrix to obtain a reduced dimension matrix of 100 × 37. A surv object is created, which inputs survivals times, censoring type, and an indicator vector denoting 1 if the observation is censored or 0 if it is not, and outputs a response matrix. The Ti vector and the 37 principle components are fed into the AFT model in R using the package aftgee (Chiou et al., 2015) to obtain the estimated 1000 β coefficients. The weight matrix was multiplied by these β estimates to obtain the 37 β estimates for the 37 principle components. In order to acquire estimated lambda values for the estimated survival func- tion, the mean of the exponentiated product of the centered data matrix and the β estimates is taken. The estimated survival function is now found by ˆS0 = e−ˆ¯λt where ˆ¯λ is the estimated mean lambda value. This procedure was repeated for PLS using the same number of principle components as PCA except using the function plsreg1 from the package plsdepot (Sanchez, 2015) instead. The matrices Γ1, Γ2, and Γ3 from Achlioptas and Dasgupta-Gupta are gen- erated containing random entries that satisfy each author’s probability specifica- 10
  • 13. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques tions. An algorithm in R is created to validate the Johnson-Lindenstrauss Lemma dimension reduction ability for Γ1, Γ2, and Γ3. The algorithm takes two randomly picked vectors u, v from X and maps f : Rp → Rk where k is the new reduced dimension. The Johnson-Lindenstrauss Lemma is then tested using varying val- ues of and k for multiple simulations. It is shown that as long as k and follow the constraints given by Dasgupta-Gupta(CITA), then the Johnson-Lindenstrauss Lemma is satisfied 100% of the time. The value of is varied until the desired dimension of 1000 × 37 projection matrix is obtained satisfying the Johnson- Lindenstrauss Lemma. Unfortunately, a fairly high value of approximately 0.65 is required to satisfy the lemma. Therefore, either a high value is used or the lemma is not followed. In order to compare random matrices to PCA and PLS, X is multiplied by Γ1, Γ2, and Γ3 with dimensions 1000 × 37 to obtain a resulting k dimensional matrix of 100×37. Then, the reduced matrices are fed into the AFT model and all the same steps as PCA and PLS are performed. Therefore, five different estimated survival curves are produced, one each for PCA and PLS and three for the three random matrices. The true survival curve is S0 = e−¯λt where ¯λ is the mean of the λi values, which are created by exponentiating the product of the centered data matrix and the true β coefficients. The y-axis of the survival curve is partitioned into 20 equally spaced sections from 0.025, . . . , 0.975 and then the corresponding ti val- ues are found along the x-axis. The bias and mean squared error (MSE) are cal- culated at each of these ti values to obtain the error distribution for all methods. The bias is found by calculating the pointwise difference between the real and estimated survival curves and the MSE is calculated by finding the squared differ- ence. The bias and MSE at each ti are summed for 5000 simulations and the error distributions are compared for all methods. 5 Results In the following sections, the error distribution plots for the dimension reduction techniques are compared after 5000 simulations. PLS and PCA are compared to each other, the random matrices are compared, and then all dimension reduction techniques are compared. The goal is to minimize Bias and MSE, therefore, the dimension reduction technique closest to zero is the more efficient method. In the Bias plots, zero is at the top of the plots and for MSE, the black horizontal line at the bottom denotes zero. Notice the plots differ least at the extremes of the survival curves domain while the most variability is observed in middle of the interval. 11
  • 14. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques 5.1 Principle Component Analysis versus Partial Least Squares From the plots above, it is shown that PCA outperforms PLS by a maximum magnitude of approximately 0.07 for the bias and 0.03 for MSE. 12
  • 15. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques 5.2 Random Matrices In the plots above, RM1 denotes Γ1, RM2 denotes Γ2, and RM3 denotes Γ3. The results show that there is no significant difference in performance between the three random matrices in terms of Bias and MSE. 13
  • 16. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques 5.3 All Methods From both the Bias and MSE plots, it is evident that all three random matrices outperform both PCA and PLS. All matrices outperform PCA by a magnitude of approximately 0.03 and PLS by 0.10 for bias and 0.015 and 0.045 for MSE. 14
  • 17. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques 6 Discussion We originally wanted to generate our β coefficients from a U(−0.2, 0.2), but when we multiplied xiβ to get our λi values, we obtained very large values. Recall, our formula λi = e−xiβ . When xiβ is very large, then the λi values become very small and the precision of R estimates the survival function as 1, creating a horizontal survival function. Therefore, we had to reduce the β coefficients to U(−1 × 10−7 , 1 × 10−7 ) to obtain survival curves with realistic properties. Before conducting our research, we investigated previous work that has been done in the field, such as the research in the papers of Nguyen and Rojo (2009) and Nguyen and Rojo (2009). According to their findings, PLS outperformed PCA, which is the results we also expected to receive, but instead we observed that PCA greatly outperformed PLS. We are not exactly positive why our results differ from these works, but we suspect that it is due to not incorporating censored data. In both papers of Nguyen and Rojo, they compared methods using censored data, which we did not have time to incorporate into our research. Therefore, we suspect that PLS might outperform PCA when censored data is used, but PCA outperforms PLS with uncensored data. Obviously in real life studies, censored data can be a serious problem that needs to be taken into account. We wanted to incorporate censored data in our in- vestigation, but were unable to due to time constraints. This is something that we would like to add in future investigations. We also wanted to apply our findings to real microarray gene data sets where there are a few number of patients, with a specific type of cancer, and a large dimension of genes. We wanted to work with these data sets and apply our dimension reduction techniques to obtain estimated survival curves where the event of interest was death and the survival curve mod- eled each patient’s probability of surviving after a given time Ti. Unfortunately, we were not able to work with these real data sets, which is also something we would like to investigate at a future time. 7 Conclusion The results of performing PLS, PCA, and the three Johnson-Lindenstrauss in- spired matrices from Achlioptas and Dasgupta-Gupta on log-normally distributed, uncensored data for estimating the survival curve under the AFT model show that PCA outperforms PLS in terms of both bias and MSE. The three random matrices do not show a significant difference between each other in terms of either bias or MSE. Overall, the random matrices outperform both PCA and PLS for both bias and MSE. 15
  • 18. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques 8 Acknowledgments This research was supported by the National Security Agency through REU Grant H98230 15-1-0048 to The University of Nevada at Reno, Javier Rojo PI. We would like to greatly thank and acknowledge our advisor Dr. Javier Rojo, Nathan Wiseman, and Kyle Bradford from the University of Nevada Reno for their sup- port and generous contributions to our research. 9 References Cox, DR. Regression Models and life tables (with discussion). Journal of Royal Statistical Society Series B34: 187-220, 1972. Johnson, W.B. and J. Lindenstrauss. Extensions of Lipschitz maps into a Hilbert space. Contemp Math 26: 189-206, 1984. Pearson, K. On lines and planes of closest fit to systems of points in space. Philo- sophical Magazine 2: 559-572, 1901. Wold, H. Estimation of principal components and related models by iterative least squares. P.R. Krishnaiaah: 391-420, 1966. Achlioptas, D. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences 66(4): 671-687, 2003. Dasgupta, S. and A. Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and Algorithms 22(1): 60-65, 2003. Nguyen, D.V. Partial least squares dimension reduction for microarray gene ex- pression data with a censored response. Math Biosci 193: 119-137, 2005. Nguyen, D.V., and D.M. Rocke. On partial least squares dimension reduction for microarraybased classification: A simulation study. Comput Stat Data Analysis 46: 407-425, 2004. Despa, Simona. What is Survival Analysis? StatNews 78: 1-2. Ross, Eric. "Survival Analysis." 2012. PDF 16
  • 19. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques Allison, Paul D. "Survival Analysis." 2013. PDF Lunn, Mary. "Definitions and Censoring." 2012. PDF. Vermeylen, Francoise. Censored Data. StatNews 67: 1, 2005. Nguyen, Tuan S. and Javier Rojo. Dimension Reduction of Microarray Gene Ex- pression Data: The Accelerated Failure Time Model. Journal of Bioinformatics and Computational Biology 7(6): 939-954, 2009. Nguyen, Tuan S. and Javier Rojo. Dimension Reduction of Microarray Data in the Presence of a Censored Survival Response: A Simulation Study. Statistical Applications in Genetics and Molecular Biology 8(1): 2009. Sun, Jianguo. "Interval Censoring." 2011. PDF. Duerden, Martin. "What Are Hazard Ratios?" 2012. PDF. Rodriguez, German. "Parametric Survival Models. Princeton." 2010. PDF. Cook, Alex. "Survival and hazard functions." 2008. PDF. Fox, John. "Introduction to Nonparametric Methods." 2005. PDF. Husson et al. "Package ‘FactoMineR’." 2015. PDF. Sanchez, Gaston. "Package ‘plsdepot’." 2015. PDF. Chiou et al. "Package ‘aftgee’." 2015. PDF. Thernou et al. "Package ‘survival’." 2015. PDF. 17
  • 20. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques 10 Appendix Herein, the R code utilized in this investigation is presented. Packages survival (Thernou et al., 2015), FactoMineR (Husson et al., 2015), plsdepot (Gaston, 2015), and aftgee (Chiou et al., 2015) will need to be installed and loaded into R software to successfully run the provided code. 10.1 Error Plots Below is the code used to produce the six error plots for the five various reductions methods. library(survival) # We created a Surv object using function ’Surv’ from this # package. library(FactoMineR) # We used the function ’PCA’ from this package. library(plsdepot) # We used ’plsreg1’ from this package. library(aftgee) # With this package, we were able to apply the AFT model to our # simulated data using the function ’aftgee’. sim <- function(s) # This function will produce ’s’ # simulations and output error plots. { t1 <- Sys.time() # Initial time. num <- 1 # Initial counter. sum_PCA_BE_t <- matrix(0, 1, 20) sum_PCA_MSE_t <- matrix(0, 1, 20) sum_PLS_BE_t <- matrix(0, 1, 20) sum_PLS_MSE_t <- matrix(0, 1, 20) sum_RM1_BE_t <- matrix(0, 1, 20) sum_RM1_MSE_t <- matrix(0, 1, 20) 18
  • 21. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques sum_RM2_BE_t <- matrix(0, 1, 20) sum_RM2_MSE_t <- matrix(0, 1, 20) sum_RM3_BE_t <- matrix(0, 1, 20) sum_RM3_MSE_t <- matrix(0, 1, 20) # These will store the calculated bias and mean-squared # error across 20 selected points geeer we have run ’s’ # simulations. beta <- c(runif(1000, min = -0.0000001, max = 0.0000001)) # Fixed coefficients. mu <- c(rnorm(1000, mean = 0, sd = 1)) # Mean values. X <- matrix(0, 100, 1000) # A location for the dataset information. while(num <= s) # Running the entire code for a ’s’ iterations. { # No problems at the start of this iteration. for(i in 1:100) { for(j in 1:1000) { X[i, j] <- rnorm(1, mean = mu[j], sd = 1) # A matrix of random data containing observations # on the rows and covariates on the columns. } } z <- exp(X) # All entries of matrix ’X’ have been # exponentiated and stored in ’z’, which has dimensions # 100 by 1,000. lambda <- matrix(0, 100, 1) # Rate values. for(i in 1:100) # Generating lambda values. { lambda[i] <- exp(t(-z[i,]) %*% as.matrix(beta)) } 19
  • 22. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques T <- matrix(0, nrow = 100, ncol = 1) # Location for survival times. for(i in 1:100) # Surivival times being generated. { T[i] <- rexp(1, rate=lambda[i]) } RM1 <- matrix(0, 1000, 37) # Random matrix one with ’-1’s and ’+1’s. for (m in 1:1000) { for (n in 1:37) { RM1[m, n] <- sample(c(-1, 1), 1, replace = TRUE, prob = c(1/2, 1/2)) } } RM1 <- RM1 / sqrt(37) RM2 <- matrix(0, 1000, 37) # Random matrix two with # ’-sqrt(3)’s, ’0’s, and ’+sqrt(3)’s. for (m in 1:1000) { for (n in 1:37) { RM2[m,n] <- sample(c(-sqrt(3), 0, sqrt(3)), 1, replace = TRUE, prob = c(1/6, 4/6, 1/6)) } } RM2 <- RM2 / sqrt(37) RM3 <- matrix(0, 1000, 37) # Random matrix three generated under a Gaussian 20
  • 23. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques # distribution. for (m in 1:1000) { for (n in 1:37) { RM3[m,n] <- rnorm(1, 0, 1) } } RM3_norm <- matrix(0, 1000, 1) for (p in 1:1000) { RM3_norm[p, ] <- sqrt(sum(RM3[p, ] ^ 2)) } for (m in 1:1000) { for(n in 1:37) { RM3[m,n] <- RM3[m,n] / RM3_norm[m, ] } } z_star <- scale(z, center = TRUE, scale = FALSE) # Column-centered ’z’ matrix for PCA. z_star_PCA <- PCA(z_star, graph = FALSE, ncp = 37) z_star_PLS <- plsreg1(scale(z, center = TRUE, scale = TRUE), T, comps = 37, crosval = FALSE) # Applying PCA and PLS to the data. z_double_star_PCA <- z_star %*% z_star_PCA$var$coord z_double_star_PLS <- z_star %*% z_star_PLS$x.loads z_double_star_RM1 <- z %*% RM1 z_double_star_RM2 <- z %*% RM2 z_double_star_RM3 <- z %*% RM3 # Reducing dimensionality. 21
  • 24. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques delta <- matrix(0, nrow = 100, ncol = 1) # An indicator matrix. Here, delta is a 100 by 1 matrix # of zeros. The zeros are interpreted as meaning that the # event of interest has definitively occured. In other # words, there is currently no censoring with ’delta’ # set up in this manner. data_Surv <- Surv(time = T, event = delta, type = c("right")) # A Surv object that takes the survival times from ’T’, # censoring information from ’delta’, and is specified # as being right-censored. data_AFT_fit_PCA <- aftgee(data_Surv ~ -1 + z_double_star_PCA, corstr = "independence", B = 0) data_AFT_fit_PLS <- aftgee(data_Surv ~ -1 + z_double_star_PLS, corstr = "independence", B = 0) data_AFT_fit_RM1 <- aftgee(data_Surv ~ -1 + z_double_star_RM1, corstr = "independence", B = 0) data_AFT_fit_RM2 <- aftgee(data_Surv ~ -1 + z_double_star_RM2, corstr = "independence", B = 0) data_AFT_fit_RM3 <- aftgee(data_Surv ~ -1 + z_double_star_RM3, corstr = "independence", B = 0) beta_hat_star_PCA <- data_AFT_fit_PCA$coefficients beta_hat_star_PLS <- data_AFT_fit_PLS$coefficients beta_hat_star_RM1 <- data_AFT_fit_RM1$coefficients beta_hat_star_RM2 <- data_AFT_fit_RM2$coefficients beta_hat_star_RM3 <- data_AFT_fit_RM3$coefficients # The full beta/regression coefficients. z_bar_star <- matrix(0, 1, 1000) 22
  • 25. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques # Averaged columns of ’z’ will go here. for (i in 1:1000) # Averaging ’z’s columns. { z_bar_star[1, i] <- mean(z[, i]) } beta_hat_z_PCA <- z_star_PCA$var$coord %*% beta_hat_star_PCA beta_hat_z_PLS <- z_star_PLS$x.loads %*% beta_hat_star_PLS beta_hat_z_RM1 <- RM1 %*% beta_hat_star_RM1 beta_hat_z_RM2 <- RM2 %*% beta_hat_star_RM2 beta_hat_z_RM3 <- RM3 %*% beta_hat_star_RM3 # The final beta estimates for each technique. lambda_hat_PCA <- mean(exp(-z %*% beta_hat_z_PCA)) lambda_hat_PLS <- mean(exp(-z %*% beta_hat_z_PLS)) lambda_hat_RM1 <- mean(exp(-z %*% beta_hat_z_RM1)) lambda_hat_RM2 <- mean(exp(-z %*% beta_hat_z_RM2)) lambda_hat_RM3 <- mean(exp(-z %*% beta_hat_z_RM3)) # Generating the lambda constant from each technique # employed. lambda_bar = mean(lambda) # Taking the average of all # ’lambda’ values and storing it in ’lambda_bar’. S <- function(t) # The true survivor function. { exp(-t * lambda_bar) } S_hat_naught_PCA <- function(t) # The predicted survivor function through PCA. 23
  • 26. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques { exp(-t * lambda_hat_PCA) } S_hat_naught_PLS <- function(t) # The predicted survivor function through PLS. { exp(-t * lambda_hat_PLS) } S_hat_naught_RM1 <- function(t) # The predicted survivor function through RM1. { exp(-t * lambda_hat_RM1) } S_hat_naught_RM2 <- function(t) # The predicted survivor function through RM2. { exp(-t * lambda_hat_RM2) } S_hat_naught_RM3 <- function(t) # The predicted survivor function through RM3. { exp(-t * lambda_hat_RM3) } u <- c(seq(0.025, 0.975, 0.05)) # Desired outputs ’u’ that range from 0.025 to 0.975 # and are spaced out by 0.05, resulting in 20 points. t <- (-1/lambda_bar) * log(u) # Input times ’t’ from the # respective ’u’s. There are 20 generated times ’t’ in # this vector. for (i in 1:20) # Storing bias across the 20 point pairs in PCA. { sum_PCA_BE_t[i] <- sum_PCA_BE_t[i] + (S_hat_naught_PCA(t[i]) - S(t[i])) 24
  • 27. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques } for (i in 1:20) # Storing mean-squared error across the 20 point pairs # in PCA. { sum_PCA_MSE_t[i] <- sum_PCA_MSE_t[i] + (S_hat_naught_PCA(t[i]) - S(t[i])) ^ 2 } for (i in 1:20) # Storing bias across the 20 point pairs in PLS. { sum_PLS_BE_t[i] <- sum_PLS_BE_t[i] + (S_hat_naught_PLS(t[i]) - S(t[i])) } for (i in 1:20) # Storing mean-squared error across the 20 point pairs # in PLS. { sum_PLS_MSE_t[i] <- sum_PLS_MSE_t[i] + (S_hat_naught_PLS(t[i]) - S(t[i])) ^ 2 } for (i in 1:20) # Storing bias across the 20 point pairs in RM1. { sum_RM1_BE_t[i] <- sum_RM1_BE_t[i] + (S_hat_naught_RM1(t[i]) - S(t[i])) } for (i in 1:20) # Storing mean-squared error across the 20 point pairs # in RM1. { sum_RM1_MSE_t[i] <- sum_RM1_MSE_t[i] + (S_hat_naught_RM1(t[i]) - S(t[i])) ^ 2 } for (i in 1:20) 25
  • 28. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques # Storing bias across the 20 point pairs in RM2 { sum_RM2_BE_t[i] <- sum_RM2_BE_t[i] + (S_hat_naught_RM2(t[i]) - S(t[i])) } for (i in 1:20) # Storing mean-squared error across the 20 point pairs # in RM2. { sum_RM2_MSE_t[i] <- sum_RM2_MSE_t[i] + (S_hat_naught_RM2(t[i]) - S(t[i])) ^ 2 } for (i in 1:20) # Storing bias across the 20 point pairs in RM3. { sum_RM3_BE_t[i] <- sum_RM3_BE_t[i] + (S_hat_naught_RM3(t[i]) - S(t[i])) } for (i in 1:20) # Storing mean-squared error across the 20 point pairs # in RM3. { sum_RM3_MSE_t[i] <- sum_RM3_MSE_t[i] + (S_hat_naught_RM3(t[i]) - S(t[i])) ^ 2 } print(paste("Simulation", num, "Complete.")) num <- num + 1 } ymin_PCA_BE <- min(sum_PCA_BE_t) ymin_PLS_BE <- min(sum_PLS_BE_t) ymin_RM1_BE <- min(sum_RM1_BE_t) ymin_RM2_BE <- min(sum_RM2_BE_t) ymin_RM3_BE <- min(sum_RM3_BE_t) # Finding the minimum bias per each technique after 26
  • 29. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques # ’s’ simulations. ymax_PCA_BE <- max(sum_PCA_BE_t) ymax_PLS_BE <- max(sum_PLS_BE_t) ymax_RM1_BE <- max(sum_RM1_BE_t) ymax_RM2_BE <- max(sum_RM2_BE_t) ymax_RM3_BE <- max(sum_RM3_BE_t) # Finding the maximum bias per each technique after # ’s’ simulations. ymin_BE <- min(ymin_PCA_BE, ymin_PLS_BE, ymin_RM1_BE, ymin_RM2_BE, ymin_RM3_BE) / s ymax_BE <- max(ymax_PCA_BE, ymax_PLS_BE, ymax_RM1_BE, ymax_RM2_BE, ymax_RM3_BE) / s # Finding the minimum and maximum bias across all five # techniques after ’s’ simulations. These will serve as # the lower and upper range of the y-axis in the final plot. ymin_PCA_PLS_BE <- min(ymin_PCA_BE, ymin_PLS_BE) / s ymax_PCA_PLS_BE <- max(ymax_PCA_BE, ymax_PLS_BE) / s # Calculating the averaged minimum and maximum bias for PCA # and PLS after ’s’ simulations for plotting purposes. ymin_RM_BE <- min(ymin_RM1_BE, ymin_RM2_BE, ymin_RM3_BE) / s ymax_RM_BE <- max(ymax_RM1_BE, ymax_RM2_BE, ymax_RM3_BE) / s # Calculating the averaged minimum and maximum bias for the # three RMs after ’s’ simulations for plotting purposes. ymin_PCA_MSE <- min(sum_PCA_MSE_t) ymin_PLS_MSE <- min(sum_PLS_MSE_t) ymin_RM1_MSE <- min(sum_RM1_MSE_t) ymin_RM2_MSE <- min(sum_RM2_MSE_t) ymin_RM3_MSE <- min(sum_RM3_MSE_t) # Finding the minimum mean-squared error per each technique # after ’s’ simulations. ymax_PCA_MSE <- max(sum_PCA_MSE_t) ymax_PLS_MSE <- max(sum_PLS_MSE_t) ymax_RM1_MSE <- max(sum_RM1_MSE_t) 27
  • 30. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques ymax_RM2_MSE <- max(sum_RM2_MSE_t) ymax_RM3_MSE <- max(sum_RM3_MSE_t) # Finding the maximum mean-squared error per each technique # after ’s’ simulations. ymin_MSE <- min(ymin_PCA_MSE, ymin_PLS_MSE, ymin_RM1_MSE, ymin_RM2_MSE, ymin_RM3_MSE) / s ymax_MSE <- max(ymax_PCA_MSE, ymax_PLS_MSE, ymax_RM1_MSE, ymax_RM2_MSE, ymax_RM3_MSE) / s # Finding the minimum and maximum mean-squared error across # all techniques. These will serve as the lower and upper # range of the y-axis in the final plot. ymin_PCA_PLS_MSE <- min(ymin_PCA_MSE, ymin_PLS_MSE) / s ymax_PCA_PLS_MSE <- max(ymax_PCA_MSE, ymax_PLS_MSE) / s # Calculating the averaged minimum and maximum MSE for PCA # and PLS after ’s’ simulations for plotting purposes. ymin_RM_MSE <- min(ymin_RM1_MSE, ymin_RM2_MSE, ymin_RM3_MSE) / s ymax_RM_MSE <- max(ymax_RM1_MSE, ymax_RM2_MSE, ymax_RM3_MSE) / s # Calculating the averaged minimum and maximum MSE for the # three RMs after ’s’ simulations for plotting purposes. # Start of bias plot for PCA and PLS. plot(t, (sum_PCA_BE_t) / s, pch = 15, main = paste("Bias: PCA and PLS n", s, "Total Simulations"), xlab = "Time", ylab = "Average Bias", ylim = c(ymin_PCA_PLS_BE, ymax_PCA_PLS_BE), xlim = c(0, max(t)), col = "black") points(t, (sum_PLS_BE_t) / s, pch = 15, col = "grey") par(new = TRUE) abline(0, 0, h = 0) 28
  • 31. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques par(new = TRUE) legend("topright", c("PCA", "PLS"), pch = c(15, 15), col = c("black", "grey")) # End of bias plot for PCA and PLS. # Start of the mean-squared error plot for PCA and PLS. plot(t, (sum_PCA_MSE_t) / s, pch = 15, main = paste("Mean-Squared Error: PCA and PLS n", s, "Total Simulations"), xlab = "Time", ylab = "Average MSE", ylim = c(ymin_PCA_PLS_MSE, ymax_PCA_PLS_MSE), xlim = c(0, max(t)), col = "black") points(t, (sum_PLS_MSE_t) / s, pch = 15, col = "grey") par(new = TRUE) abline(0, 0, h = 0) par(new = TRUE) legend("topright", c("PCA", "PLS"), pch = c(15, 15), col = c("black", "grey")) # End of mean-squared error plot for PCA and PLS. # Start of the bias plot for the random matrices. plot(t, (sum_RM1_BE_t) / s, pch = 15, main = paste("Bias: Random Matrices n", s, "Total Simulations"), xlab = "Time", ylab = "Average Bias", ylim = c(ymin_RM_BE, ymax_RM_BE), xlim = c(0, max(t)), col = "darkblue") points(t, (sum_RM2_BE_t) / s, pch = 15, col = "red") points(t, (sum_RM3_BE_t) / s, pch = 15, col = "gold") 29
  • 32. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques par(new = TRUE) abline(0, 0, h = 0) par(new = TRUE) legend("topright", c("RM1", "RM2", "RM3"), pch = c(15, 15, 15), col = c("darkblue", "red", "gold")) # End of bias plot for the random matrices. # Start of the mean-squared error plot for the # random matrices. plot(t, (sum_RM1_MSE_t) / s, pch = 15, main = paste("Mean-Squared Error: Random Matrices n", s, "Total Simulations"), xlab = "Time", ylab = "Average MSE", ylim = c(ymin_RM_MSE, ymax_RM_MSE), xlim = c(0, max(t)), col = "darkblue") points(t, (sum_RM2_MSE_t) / s, pch = 15, col = "red") points(t, (sum_RM3_MSE_t) / s, pch = 15, col = "gold") par(new = TRUE) abline(0, 0, h = 0) par(new = TRUE) legend("topright", c("RM1", "RM2", "RM3"), pch = c(15, 15, 15), col = c("darkblue", "red", "gold")) # End of mean-squared error plot for the random matrices. # Start of bias plot for all methods. plot(t, (sum_PCA_BE_t) / s, pch = 15, main = paste("Bias: All Techniques n", 30
  • 33. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques s, "Total Simulations"), xlab = "Time", ylab = "Average Bias", ylim = c(ymin_BE, ymax_BE), xlim = c(0, max(t)), col = "black") points(t, (sum_PLS_BE_t) / s, pch = 15, col = "gray") points(t, (sum_RM1_BE_t) / s, pch = 15, col = "darkblue") points(t, (sum_RM2_BE_t) / s, pch = 15, col = "red") points(t, (sum_RM3_BE_t) / s, pch = 15, col = "gold") par(new = TRUE) abline(0, 0, h = 0) par(new = TRUE) legend("topright", c("PCA", "PLS", "RM1", "RM2", "RM3"), pch = c(15, 15, 15, 15, 15), col = c("black", "gray", "darkblue", "red", "gold")) # End of bias plot for all methods. # Start of mean-squared error plot for all methods. plot(t, (sum_PCA_MSE_t) / s, pch = 15, main = paste("Mean-Squared Error: All Techniques n", s, "Total Simulations"), xlab = "Time", ylab = "Average MSE", ylim = c(ymin_MSE, ymax_MSE), xlim = c(0, max(t)), col = "black") points(t, (sum_PLS_MSE_t) / s, pch = 15, col = "gray") points(t, (sum_RM1_MSE_t) / s, pch = 15, col = "darkblue") points(t, (sum_RM2_MSE_t) / s, pch = 15, col = "red") points(t, (sum_RM3_MSE_t) / s, pch = 15, col = "gold") par(new = TRUE) 31
  • 34. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques abline(0, 0, h = 0) par(new = TRUE) legend("topright", c("PCA", "PLS", "RM1", "RM2", "RM3"), pch = c(15, 15, 15, 15, 15), col = c("black", "gray", "darkblue", "red", "gold")) # End of mean-squared error plot for all methods. t2 <- Sys.time() # End time. total_time <- t2 - t1 # Difference between start and end # times. print(total_time) # Printing total time to run simulations # and obtain the plots. } 32
  • 35. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques 10.2 Johnson-Lindenstrauss Testing Below is the code used for testing the Johnson-Lindenstrauss Lemma by varying k and . good_points_RM1 <- 0 good_points_RM2 <- 0 good_points_RM3 <- 0 # Good points counter for each random matrix. # Points are considered ’good’ if they satisfy # the Johnson-Lindenstrauss Lemma. sim <- function(s, k, epsilon) # This function takes in ’s’ simulations and a desired # ’epsilon’. It returns the number of times the # Johnson-Lindenstrauss Lemma was satisfied based on the # three random matrices. { t1 <- Sys.time() # Initial time. num <- 1 # Initial counter. mu <- c(rnorm(1000, mean = 0, sd = 1)) # Mean values. X <- matrix(0, 100, 1000) # A location for the dataset information. while(num <= s) # Running the entire code for a ’s’ iterations. { problem <- FALSE # No problems at the start of this # iteration. for(i in 1:100) { for(j in 1:1000) { X[i, j] <- rnorm(1, mean = mu[j], sd = 1) # A matrix of random data containing observations on # the rows and covariates on the columns. 33
  • 36. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques } } z <- exp(X) # All entries of matrix ’X’ have been # exponentiated and stored in ’z’, which has dimensions # 100 by 1,000. u_v_rows <- sample(1:100, 2, replace = FALSE) obs_u_old <- z[u_v_rows[1], ] obs_v_old <- z[u_v_rows[2], ] # We’ve selected two different rows from the dataset # matrix ’z’ and stored them as new variables. Here, # observations ’u’ and ’v’ can be thought of as # 1,000-dimensional points. dist_old <- sum((obs_u_old - obs_v_old) ^ 2) # Here, the distance has been calculated between # observations ’u’ and ’v’. RM1 <- matrix(0, 1000, k) # Random matrix one with ’-1’s and ’+1’s. for (m in 1:1000) { for (n in 1:k) { RM1[m, n] <- sample(c(-1, 1), 1, replace = TRUE, prob = c(1/2, 1/2)) } } RM1 <- RM1 / sqrt(k) RM2 <- matrix(0, 1000, k) # Random matrix two with ’-sqrt(3)’s, ’0’s, and # ’+sqrt(3)’s. for (m in 1:1000) { for (n in 1:k) { 34
  • 37. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques RM2[m,n] <- sample(c(-sqrt(3), 0, sqrt(3)), 1, replace = TRUE, prob = c(1/6, 4/6, 1/6)) } } RM2 <- RM2 / sqrt(k) RM3 <- matrix(0, 1000, k) # Random matrix three generated under a Gaussian # distribution. for (m in 1:1000) { for (n in 1:k) { RM3[m,n] <- rnorm(1, mean = 0, sd = 1) } } RM3_norm <- matrix(0, 1000, 1) for (p in 1:1000) { RM3_norm[p, ] <- sqrt(sum(RM3[p, ] ^ 2)) } for (m in 1:1000) { for(n in 1:k) { RM3[m,n] <- RM3[m,n] / RM3_norm[m, ] } } z_star <- scale(z, center=TRUE, scale=FALSE) # Column-centered ’z’ matrix. z_double_star_RM1 <- z %*% RM1 z_double_star_RM2 <- z %*% RM2 z_double_star_RM3 <- z %*% RM3 35
  • 38. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques # Reducing dimensionality. obs_u_new_RM1 <- z_double_star_RM1[u_v_rows[1], ] obs_v_new_RM1 <- z_double_star_RM1[u_v_rows[2], ] obs_u_new_RM2 <- z_double_star_RM2[u_v_rows[1], ] obs_v_new_RM2 <- z_double_star_RM2[u_v_rows[2], ] obs_u_new_RM3 <- z_double_star_RM3[u_v_rows[1], ] obs_v_new_RM3 <- z_double_star_RM3[u_v_rows[2], ] # After reducing dimensions, points ’u’ and ’v’ now have # new coordinates. Since there were three random # matrices, there are three new ’u’ and ’v’ points. dist_new_RM1 <- sum((obs_u_new_RM1 - obs_v_new_RM1) ^ 2) dist_new_RM2 <- sum((obs_u_new_RM2 - obs_v_new_RM2) ^ 2) dist_new_RM3 <- sum((obs_u_new_RM3 - obs_v_new_RM3) ^ 2) # Calculating the new distance between the transformed # points ’u’ and ’v’ for each generated random matrix. if((1 - epsilon) * (dist_old) <= dist_new_RM1 && dist_new_RM1 <= (1 + epsilon) * (dist_old)) { good_points_RM1 <- good_points_RM1 + 1 } if((1 - epsilon) * (dist_old) <= dist_new_RM2 && dist_new_RM2 <= (1 + epsilon) * (dist_old)) { good_points_RM2 <- good_points_RM2 + 1 } if((1 - epsilon) * (dist_old) <= dist_new_RM3 && dist_new_RM3 <= (1 + epsilon) * (dist_old)) { good_points_RM3 <- good_points_RM3 + 1 } # The preceding three ’if’ statements check to see if # the Johnson-Lindenstrauss Lemma was satisfied in this # iteration for each different random matrix. print(paste("Simulation", num, "Complete.")) 36
  • 39. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques num <- num + 1 } print(paste("For an epsilon of", epsilon, ", k is", k, ".")) print(paste("Number of times JL was satisfied, RM1:", good_points_RM1, "out of", s, "simulations.")) print(paste("Number of times JL was satisfied, RM2:", good_points_RM2, "out of", s, "simulations.")) print(paste("Number of times JL was satisfied, RM3:", good_points_RM3, "out of", s, "simulations.")) t2 <- Sys.time() # End time. total_time <- t2 - t1 # Difference between start and end times. print(total_time) # Printing total time to run simulations # and obtain the plots. } 37
  • 40. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques 10.3 Survival Curves Below is the code used for generating the real survival curve and the estimated survival curve under PCA. library(survival) library(FactoMineR) sim <- function(s) # Making a function that takes in a # simulation count ’s’. { options(digits = 22) # Preserving more digits in hopes of # less algorithm failure. results <- matrix(0, s, 2) # A matrix with BE on column 1 # and MSE on column 2. BE_T <- 0 # Initial total BE count. MSE_T <- 0 # Initial total MSE count. sum_BE_t <- matrix(0, 1, 20) # Matrix of BE at time ’t’. sum_MSE_t <- matrix(0, 1, 20) # Matrix of MSE at time ’t’. num <- 1 # Iteration counter. sum_BE_t1 <- 0 # Bias error at time ’t1’. sum_MSE_t1 <- 0 # Mean-squared error at time ’t1’. beta <- c(runif(1000, min = -0.0000001, max = 0.0000001)) # Fixed coefficients. mu <- c(rnorm(1000, mean = 0, sd = 1)) # Mean values. X <- matrix(0, 100, 1000) # A location for the dataset # information. while(num <= s) # Running the entire code for a specified # amount of iterations. { problem <- FALSE # No problems at the start of this # iteration. for(i in 1:100) { for(j in 1:1000) 38
  • 41. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques { X[i, j] <- rnorm(1, mean = mu[j], sd = 1) # A matrix # of random data containing observations on the rows # and covariates on the columns. } } z <- exp(X) # All entries of matrix ’X’ have been # exponentiated and stored in ’z’, which has dimensions # 100 by 1,000. lambda <- matrix(0, 100, 1) # Rate values. for(i in 1:100) { lambda[i] <- exp(t(-z[i,]) %*% as.matrix(beta)) # Generating lambda values. } T <- matrix(0, nrow = 100, ncol = 1) # Location for survival times. for(i in 1:100) { T[i] <- rexp(1,rate=lambda[i]) } z_star <- scale(z, center=TRUE,scale=FALSE) z_star_PCA <- PCA(z_star, graph=FALSE, ncp=37) z_double_star <- z_star %*% z_star_PCA$var$coord delta <- matrix(0, nrow = 100, ncol = 1) # An indicator # matrix. Here, delta is a 100 by 1 matrix of zeros. # The zeros are interpreted as meaning that the event of # interest has definitively occured. In other words, # there is currently no censoring with ’delta’ set up in # this manner. data_Surv <- Surv(time = T, event = delta, 39
  • 42. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques type = c("right")) # A Surv object that takes the survival times from ’T’, # censoring information from ’delta’, and is specified as # being right-censored. data_AFT_fit <- NULL data_AFT_fit <- tryCatch(survreg(data_Surv ~ -1 + z_double_star, dist = "lognormal", survreg.control(maxiter=100000000)), warning=function(c) {problem<<-TRUE}) if(!problem) # If there’s no problem, then our previous # code will run. { beta_hat_star <- as.matrix(data_AFT_fit$coeff) # These are beta estimates. z_bar_star <- matrix(0, 1, 1000) # Averaged columns of ’z’ go here. for (i in 1:1000) { z_bar_star[1, i] <- mean(z[, i]) # Taking the average of each column of ’z’. } beta_hat_z <- matrix(0, 1, 1000) # A location for our beta estimates. beta_hat_z <- z_star_PCA$var$coord %*% beta_hat_star # Beta estimates. lambda_hat <- exp(-z_bar_star %*% beta_hat_z) # Survival function constant. lambda_bar = mean(lambda) # Taking the average of all ’lambda’ values and storing # it in ’lambda_bar’. 40
  • 43. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques S_hat_naught <- function(t) # The predicted survivor function. { exp(-t * lambda_hat) } S <- function(t) # The true survivor function. { exp(-t * lambda_bar) } data_AFT_pred <- predict(data_AFT_fit, type = "terms", se.fit = TRUE) # Here, we get the predicted values from the ’survreg’ # object ’data_AFT_fit’. To wit, we get here the beta # values and the standard errors in a ’list’ format. surv_curv <- curve(S_hat_naught, from = 0, to = 7, n = 1000, type="l", xlab = "", ylab = "", xaxt = ’n’, yaxt = ’n’, col = "99") # Plotting the predicted survivor function. par(new = TRUE) curve(S, from = 0, to = 7, n = 1000, type = "l", main = paste("Survivor Curves n Simulation", num), xlab = expression(italic(t)), ylab = expression(S(italic(t))), col = "black") u <- c(seq(0.025,0.975,0.05)) # Outputs ’u’ that range from 0.025 to 0.975 # spaced out by 0.05, resulting in 20 points. t <- (-1/lambda_bar) * log(u) # Input times ’t’, generated from ’u’. There # are 20 generated times ’t’ in this vector. print(paste("Simulation ", num, sep = "")) 41
  • 44. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques num <- num + 1 } else { } } } 42