This document compares several dimension reduction techniques for survival analysis when there are many covariates: principal component analysis (PCA), partial least squares (PLS), and three variants of random matrices (RM) based on Johnson-Lindenstrauss embeddings. It simulates 5,000 datasets using the accelerated failure time model and determines the total bias error and mean-squared error between the true and estimated survivor curves for each method. The results indicate that PCA outperforms PLS, the RMs are comparable, and the RMs outdo both PCA and PLS.
This book provides a comprehensive overview of modern statistical methods aimed at overcoming issues that arise when standard statistical assumptions like normality and equal variance are violated. It introduces robust techniques for estimating location, testing hypotheses, computing confidence intervals, comparing groups, detecting outliers, and linear regression. The book is intended to bridge the gap between current robust method developments and practical application, offering an intuitive understanding of why and how standard techniques can mislead and the advantages of modern robust alternatives. It assumes a basic understanding of statistical concepts and methods.
Common statistical tests and applications in epidemiological literatureKadium
This document provides an overview of common statistical tests and applications in epidemiological literature. It describes the different types of data, including nominal, ordinal and continuous data. It also discusses describing data through distributions and other characteristics. Hypothesis testing and the concepts of null and alternative hypotheses are explained. Types of errors in statistical testing like Type I and Type II errors are defined. Specific statistical tests like the student's t-test and chi-square analysis are outlined along with examples of their applications. Practice questions related to hypothesis testing and p-values are also included.
This document discusses metrics for assessing the predictability and efficiency of covariate-adaptive randomization designs in clinical trials. It proposes measuring predictability using a modified Blackwell-Hodges potential selection bias metric that calculates how well an observer could guess the next treatment assignment. It also considers entropy and periodicity measures. Balance/efficiency is proposed to be measured using Atkinson's method of quantifying the loss of statistical power as an equivalent reduction in sample size due to treatment imbalances within subgroups. The document then outlines a simulation study to compare various randomization methods using these proposed metrics.
Modelling differential clustering and treatment effect heterogeneity in paral...Karla hemming
Cluster randomized trials are frequently used in health service evaluation. It is common practice to use an analysis model with a random effect to combine between cluster information about treatment effects. It is increasingly being acknowledged that intervention effects might vary across clusters, or the variation between clusters might differ across the randomized arms. It has been proposed in both parallel cluster trials, stepped-wedge and other crossover designs that this heterogeneity can be allowed for by incorporating additional random effect(s) into the model. Here we show that the choice of model parameterization needs careful consideration as some parameterizations for additional heterogeneity induce unnecessary assumptions. We suggest more appropriate parameterizations, discuss their relative advantages and demonstrate the implications of these model choices using practical examples of a parallel cluster trial and a simulated stepped-wedge trial.
This document provides an introduction and guidelines for linear and multiple regression analyses. It discusses key aspects of each analysis including examining outputs such as model summaries, ANOVA tables, and coefficients. For multiple regression, it recommends a hierarchical approach, entering demographic variables in the first block, extraversion in the second, and narcissism in the third to test if narcissism predicts social media use over and above other factors. The output would show if narcissism explains a significant unique amount of variance in the outcome.
Statistics For Data Analytics - Multiple & logistic regression Shrikant Samarth
Task: To build multiple regression and logistic regression models on appropriate data.
Approach: A general topic was selected first after which the data was downloaded from the source keeping the restrictions in mind and then cleaned in R. Then the multiple regression and logistic regression models were built using IBM SPSS and the outputs were interpreted. The dependent variable was life expectancy and the independent variables were Age-standardized Mortality-Communicable”, “Age-standardized Mortality-Cardiovascular Disease and Diabetes".
Findings: Multipleregression - analysis was conducted to make sure normality, linearity, multi-collinearity, independence of errors and homoscedasticity were not violated. Statistically, the score of Life expectancy at age 60, 퐹(2,102) = 39.474 푅2 = .436, 푝 < 0.0005
Logistic Regression: Result shows 58.9% (Cox & Snell R-Square) and 80.1% (Nagelkerke R-Square) of the variance and gives 92.4% of correctly classified countries. The two indicating factors made a remarkable commitment to the model. Also, the model predicts the increase in “Mortality-Cardiovascular/Diabetes” and “Mortality rate cause by Communicable” variables is the cause of a decrease in Life Expectancy in a country.
Tools: IBM SPSS
This book provides a comprehensive overview of modern statistical methods aimed at overcoming issues that arise when standard statistical assumptions like normality and equal variance are violated. It introduces robust techniques for estimating location, testing hypotheses, computing confidence intervals, comparing groups, detecting outliers, and linear regression. The book is intended to bridge the gap between current robust method developments and practical application, offering an intuitive understanding of why and how standard techniques can mislead and the advantages of modern robust alternatives. It assumes a basic understanding of statistical concepts and methods.
Common statistical tests and applications in epidemiological literatureKadium
This document provides an overview of common statistical tests and applications in epidemiological literature. It describes the different types of data, including nominal, ordinal and continuous data. It also discusses describing data through distributions and other characteristics. Hypothesis testing and the concepts of null and alternative hypotheses are explained. Types of errors in statistical testing like Type I and Type II errors are defined. Specific statistical tests like the student's t-test and chi-square analysis are outlined along with examples of their applications. Practice questions related to hypothesis testing and p-values are also included.
This document discusses metrics for assessing the predictability and efficiency of covariate-adaptive randomization designs in clinical trials. It proposes measuring predictability using a modified Blackwell-Hodges potential selection bias metric that calculates how well an observer could guess the next treatment assignment. It also considers entropy and periodicity measures. Balance/efficiency is proposed to be measured using Atkinson's method of quantifying the loss of statistical power as an equivalent reduction in sample size due to treatment imbalances within subgroups. The document then outlines a simulation study to compare various randomization methods using these proposed metrics.
Modelling differential clustering and treatment effect heterogeneity in paral...Karla hemming
Cluster randomized trials are frequently used in health service evaluation. It is common practice to use an analysis model with a random effect to combine between cluster information about treatment effects. It is increasingly being acknowledged that intervention effects might vary across clusters, or the variation between clusters might differ across the randomized arms. It has been proposed in both parallel cluster trials, stepped-wedge and other crossover designs that this heterogeneity can be allowed for by incorporating additional random effect(s) into the model. Here we show that the choice of model parameterization needs careful consideration as some parameterizations for additional heterogeneity induce unnecessary assumptions. We suggest more appropriate parameterizations, discuss their relative advantages and demonstrate the implications of these model choices using practical examples of a parallel cluster trial and a simulated stepped-wedge trial.
This document provides an introduction and guidelines for linear and multiple regression analyses. It discusses key aspects of each analysis including examining outputs such as model summaries, ANOVA tables, and coefficients. For multiple regression, it recommends a hierarchical approach, entering demographic variables in the first block, extraversion in the second, and narcissism in the third to test if narcissism predicts social media use over and above other factors. The output would show if narcissism explains a significant unique amount of variance in the outcome.
Statistics For Data Analytics - Multiple & logistic regression Shrikant Samarth
Task: To build multiple regression and logistic regression models on appropriate data.
Approach: A general topic was selected first after which the data was downloaded from the source keeping the restrictions in mind and then cleaned in R. Then the multiple regression and logistic regression models were built using IBM SPSS and the outputs were interpreted. The dependent variable was life expectancy and the independent variables were Age-standardized Mortality-Communicable”, “Age-standardized Mortality-Cardiovascular Disease and Diabetes".
Findings: Multipleregression - analysis was conducted to make sure normality, linearity, multi-collinearity, independence of errors and homoscedasticity were not violated. Statistically, the score of Life expectancy at age 60, 퐹(2,102) = 39.474 푅2 = .436, 푝 < 0.0005
Logistic Regression: Result shows 58.9% (Cox & Snell R-Square) and 80.1% (Nagelkerke R-Square) of the variance and gives 92.4% of correctly classified countries. The two indicating factors made a remarkable commitment to the model. Also, the model predicts the increase in “Mortality-Cardiovascular/Diabetes” and “Mortality rate cause by Communicable” variables is the cause of a decrease in Life Expectancy in a country.
Tools: IBM SPSS
The effectiveness of various analytical formulas for
estimating R2 Shrinkage in multiple regression analysis was
investigated. Two categories of formulas were identified estimators
of the squared population multiple correlation coefficient (
2
)
and those of the squared population cross-validity coefficient
(
2 c
). The authors compeered the effectiveness of the analytical
formulas for determining R2 shrinkage, with squared population
multiple correlation coefficient and number of predictors after
finding all combination among variables, maximum correlation
was selected to computed all two categories of formulas. The
results indicated that Among the 6 analytical formulas designed to
estimate the population
2
, the performance of the (Olkin & part
formula-1 for six variable then followed by Burket formula &
Lord formula-2 among the 9 analytical formulas were found to be
most stable and satisfactory.
The document outlines the steps to perform the Wilcoxon Signed Rank Test to compare two related samples:
1) Obtain the differences between paired values in two samples and rank the absolute differences.
2) Assign ranks to positive and negative differences and calculate the sum of ranks.
3) Compare the smaller sum of ranks (T) to critical values to determine if the null hypothesis that the samples are identical can be rejected.
This document summarizes analysis of variance (ANOVA) methods, including:
1) The basic steps and logic of ANOVA, and how it is used to test for differences between two or more groups.
2) Applying a one-way ANOVA to data from a completely randomized design with at least three groups to test if their means are significantly different.
3) Performing multiple comparisons, like the LSD t-test and SNK q-test, to examine differences between specific group means.
3) Using a two-way ANOVA for a randomized complete-block design to reduce variation between experimental units and test if treatment means differ.
This document describes an analysis of count data from a study on the detection of anthelmintic resistance in gastrointestinal nematodes of small ruminants. The data consists of egg counts from 30 goats and 30 sheep that were grouped into Albendazole, Ivermectin, and control groups. The data was analyzed using Poisson and negative binomial regression models in R software. The Poisson model did not fit the data well due to overdispersion. However, the negative binomial regression model provided a better fit for the overdispersed data. Key findings from the negative binomial regression analysis are summarized.
This document provides an overview of key concepts in statistics, including:
1. Statistics involves the systematic presentation of numerical data to minimize erroneous conclusions when information is incomplete. Induction and deduction are two main methods of assessment, and samples are used to make reasonable conclusions about whole populations.
2. For a sample to be representative, it should be randomly selected, large in size, and stratified if necessary to account for subgroups. Random allocation in experiments helps ensure intervention and control groups are similar.
3. Common statistical terms are defined, including mean, median, mode, range, and standard deviation. Normal distribution and confidence intervals are also explained.
1. Post hoc tests are used in ANOVA to determine which specific group means differ significantly when an omnibus F-test is significant and there are three or more groups.
2. Three common post hoc tests are described - the LSD test, Tukey's HSD test, and Scheffe's test. They differ in how conservative they are and how much they control for Type 1 error from multiple comparisons.
3. Tukey's HSD test is generally recommended when all pairwise comparisons between groups are of interest, as it maintains the familywise error rate while having more statistical power than Scheffe's very conservative test.
The document discusses various non-parametric statistical tests that can be used to analyze data when the assumptions of parametric tests are not met. It provides examples of how each test can be used, including chi-square test, binomial test, runs test, Mann-Whitney U test, Kruskal-Wallis test, median test, Wilcoxon test, McNemar test, Friedman test, and Cochran's Q test. For each test, it describes a scenario and states which test should be used to analyze the corresponding data.
The document provides information on the basic principles of experimental design, including replication, randomization, and local control. It then discusses the completely randomized design (CRD) in detail. The CRD allocates treatments randomly across experimental units. It has advantages like maximum use of units and simple analysis, but disadvantages like more experimental error. The document also introduces the randomized block design (RBD) which controls for variation among blocks. The RBD stratifies the experimental area into blocks and allocates treatments randomly within each block.
Austin Statistics is an open access, peer reviewed, scholarly journal dedicated to publish articles in all areas of statistics.
The aim of the journal is to provide a forum for scientists, academicians and researchers to find most recent advances in the field statistics.
Austin Statistics accepts original research articles, review articles, case reports and rapid communication on all the aspects of statistics.
This document summarizes key aspects of analysis of variance (ANOVA), including the basic logic and steps of hypothesis testing, different types of ANOVA for different experimental designs, and methods for multiple comparisons. It discusses one-way ANOVA for completely randomized designs and randomized complete-block designs, assumptions of ANOVA, and post-hoc tests like least significant difference and Student-Newman-Keuls tests for comparing group means. Examples are provided to illustrate random assignment of subjects to groups and testing for differences in group means.
This document discusses chi square distribution and its use in analyzing frequency data. Chi square tests can be used to test goodness of fit, independence, and homogeneity. It provides examples of chi square tests for goodness of fit to determine if sample data fits a theoretical distribution, and tests of independence to determine if two classification criteria are independent. The document also outlines the steps for conducting chi square tests, including calculating test statistics, determining degrees of freedom, and comparing results to critical values to reject or fail to reject the null hypothesis.
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...cambridgeWD
Clinical trials and health outcomes research differ in important ways that impact statistical modeling approaches. Clinical trials typically use homogeneous samples and focus on a single endpoint, while health outcomes data is heterogeneous with multiple endpoints. Predictive modeling techniques used in health outcomes research, like those in SAS Enterprise Miner, are better suited than traditional methods as they can handle complex real-world data without strong assumptions and more accurately predict rare events. Validation of models on separate test data is also important for generalizing results.
1. Statistical tests are used in fisheries science to test hypotheses and make quantitative decisions about fisheries processes. Common statistical tests include correlation tests, comparison of means tests, regression analyses, and hypothesis tests.
2. The appropriate statistical test to use depends on the research design, data distribution, and variable type. Parametric tests are used for normally distributed data, while non-parametric tests are used when assumptions are not met.
3. Accuracy of statistical tests relies on quality survey data. Both fishery-dependent and fishery-independent data are important, though confounding factors must be considered with dependent data. Proper study design and use of statistics allows prediction of fish production.
Chi square test- a test of association, Pearson's chi square test of independence, Goodness of fit test, chi square test of homogeneity, advantages and disadvantages of chi square test.
Since were discovered by their antimicrobial activity, parabens have been widely used in many cosmetics, pharmaceuticals, personal care products and food, among others consumer products. These compounds, after human consumption, reach wastewater treatment plants, where are not efficiently removed, ending up in the environment, as well as by the direct discharge of detergents, soaps or products that may contain these compounds in their formulation. This concern has significantly boosted the number of publications in recent literature, although the number of papers on aqueous samples is still much higher compared to solid matrices, probably due to the complexity of these last. In this work, the accuracy of a new developed analytical method for the determination of emerging pollutants (Methylparaben (MeP)) in sediment samples has been carried out using recovery assays and evaluated by mean of the cofindence ellipse.
When designing a clinical study, a fundamental aspect is the sample size. In this article, we describe the rationale for sample size calculations, when it should be calculated and describe the components necessary to calculate it. For simple studies, standard formulae can be
used; however, for more advanced studies, it is generally necessary to use specialized statistical software programs and consult a biostatistician. Sample size calculations for non-randomized studies are also discussed and two clinical examples are used for illustration
August 1, 2010. Design of Non-Randomized Medical Device Trials Based on Sub-Classification Using Propensity Score Quintiles, Topic Contributed Session on Medical Devices, (Greg Maislin and Donald B Rubin). Joint Statistical Meetings 2010, Vancouver Canada.
The document discusses nonparametric tests that can be used when the data distribution is unknown or non-normal. It provides examples of the Wilcoxon signed-rank test to compare two related samples, the Wilcoxon rank-sum test to compare two independent samples, the Kruskal-Wallis H test to compare more than two independent samples, and the Friedman test to compare blocks of data. Multiple comparison tests are also discussed to determine the specific groups that differ when overall differences are found.
P-Value: a true test of significance in agricultural researchJiban Shrestha
This document discusses the use of p-values and significance levels in statistical analysis. It explains that p-values represent the probability of obtaining results at least as extreme as the observed results of a study, given that the null hypothesis is true. A lower p-value indicates stronger evidence against the null hypothesis. By convention, p-values of 0.05 or lower are considered statistically significant. The document cautions that statistical significance does not necessarily imply practical or clinical significance. It also discusses the concept of least significant difference tests and notes some limitations of relying solely on p-values to guide decisions.
Every year world and corporate leaders pretend to care about the state of the planet -- yet nothing changes. What if there was a single event that would change the world for the better? On the agenda: honesty in the age of cowardice, what to do with surplus population, subservience to a higher calling and when He rises.
Presentation given on January 21, 2016 at the World Economic Forum in Davos by Eminence Waite, campaign manager for Cthulhu for America.
The document compares the performance of principal component analysis (PCA), partial least squares (PLS), and random matrices (RM) for dimensionality reduction in survival analysis. It uses simulated datasets in R to analyze the bias and mean-squared error between true and estimated survivor curves for each method. PCA aims to maximize covariance and correlation of predictor variables, PLS additionally maximizes covariance between predictors and responses, and three flavors of RM are inspired by Johnson-Lindenstrauss embeddings. The results show that PCA outperforms PLS, RM performance is comparable to PCA and PLS, and in some cases RM outperform the other methods for reducing dimensionality while minimizing information loss.
This document provides information about Adam Lovinus and his skills and experience as a copywriter. It summarizes that he has experience writing about technology, business, arts, and parenting. As a trained journalist, he takes an approach that blends straight reporting with content marketing. Examples of his work include managing the blog HardBoiled for NeweggBusiness, where he publishes about 3,000 words of content per week and drives traffic through social media.
The effectiveness of various analytical formulas for
estimating R2 Shrinkage in multiple regression analysis was
investigated. Two categories of formulas were identified estimators
of the squared population multiple correlation coefficient (
2
)
and those of the squared population cross-validity coefficient
(
2 c
). The authors compeered the effectiveness of the analytical
formulas for determining R2 shrinkage, with squared population
multiple correlation coefficient and number of predictors after
finding all combination among variables, maximum correlation
was selected to computed all two categories of formulas. The
results indicated that Among the 6 analytical formulas designed to
estimate the population
2
, the performance of the (Olkin & part
formula-1 for six variable then followed by Burket formula &
Lord formula-2 among the 9 analytical formulas were found to be
most stable and satisfactory.
The document outlines the steps to perform the Wilcoxon Signed Rank Test to compare two related samples:
1) Obtain the differences between paired values in two samples and rank the absolute differences.
2) Assign ranks to positive and negative differences and calculate the sum of ranks.
3) Compare the smaller sum of ranks (T) to critical values to determine if the null hypothesis that the samples are identical can be rejected.
This document summarizes analysis of variance (ANOVA) methods, including:
1) The basic steps and logic of ANOVA, and how it is used to test for differences between two or more groups.
2) Applying a one-way ANOVA to data from a completely randomized design with at least three groups to test if their means are significantly different.
3) Performing multiple comparisons, like the LSD t-test and SNK q-test, to examine differences between specific group means.
3) Using a two-way ANOVA for a randomized complete-block design to reduce variation between experimental units and test if treatment means differ.
This document describes an analysis of count data from a study on the detection of anthelmintic resistance in gastrointestinal nematodes of small ruminants. The data consists of egg counts from 30 goats and 30 sheep that were grouped into Albendazole, Ivermectin, and control groups. The data was analyzed using Poisson and negative binomial regression models in R software. The Poisson model did not fit the data well due to overdispersion. However, the negative binomial regression model provided a better fit for the overdispersed data. Key findings from the negative binomial regression analysis are summarized.
This document provides an overview of key concepts in statistics, including:
1. Statistics involves the systematic presentation of numerical data to minimize erroneous conclusions when information is incomplete. Induction and deduction are two main methods of assessment, and samples are used to make reasonable conclusions about whole populations.
2. For a sample to be representative, it should be randomly selected, large in size, and stratified if necessary to account for subgroups. Random allocation in experiments helps ensure intervention and control groups are similar.
3. Common statistical terms are defined, including mean, median, mode, range, and standard deviation. Normal distribution and confidence intervals are also explained.
1. Post hoc tests are used in ANOVA to determine which specific group means differ significantly when an omnibus F-test is significant and there are three or more groups.
2. Three common post hoc tests are described - the LSD test, Tukey's HSD test, and Scheffe's test. They differ in how conservative they are and how much they control for Type 1 error from multiple comparisons.
3. Tukey's HSD test is generally recommended when all pairwise comparisons between groups are of interest, as it maintains the familywise error rate while having more statistical power than Scheffe's very conservative test.
The document discusses various non-parametric statistical tests that can be used to analyze data when the assumptions of parametric tests are not met. It provides examples of how each test can be used, including chi-square test, binomial test, runs test, Mann-Whitney U test, Kruskal-Wallis test, median test, Wilcoxon test, McNemar test, Friedman test, and Cochran's Q test. For each test, it describes a scenario and states which test should be used to analyze the corresponding data.
The document provides information on the basic principles of experimental design, including replication, randomization, and local control. It then discusses the completely randomized design (CRD) in detail. The CRD allocates treatments randomly across experimental units. It has advantages like maximum use of units and simple analysis, but disadvantages like more experimental error. The document also introduces the randomized block design (RBD) which controls for variation among blocks. The RBD stratifies the experimental area into blocks and allocates treatments randomly within each block.
Austin Statistics is an open access, peer reviewed, scholarly journal dedicated to publish articles in all areas of statistics.
The aim of the journal is to provide a forum for scientists, academicians and researchers to find most recent advances in the field statistics.
Austin Statistics accepts original research articles, review articles, case reports and rapid communication on all the aspects of statistics.
This document summarizes key aspects of analysis of variance (ANOVA), including the basic logic and steps of hypothesis testing, different types of ANOVA for different experimental designs, and methods for multiple comparisons. It discusses one-way ANOVA for completely randomized designs and randomized complete-block designs, assumptions of ANOVA, and post-hoc tests like least significant difference and Student-Newman-Keuls tests for comparing group means. Examples are provided to illustrate random assignment of subjects to groups and testing for differences in group means.
This document discusses chi square distribution and its use in analyzing frequency data. Chi square tests can be used to test goodness of fit, independence, and homogeneity. It provides examples of chi square tests for goodness of fit to determine if sample data fits a theoretical distribution, and tests of independence to determine if two classification criteria are independent. The document also outlines the steps for conducting chi square tests, including calculating test statistics, determining degrees of freedom, and comparing results to critical values to reject or fail to reject the null hypothesis.
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...cambridgeWD
Clinical trials and health outcomes research differ in important ways that impact statistical modeling approaches. Clinical trials typically use homogeneous samples and focus on a single endpoint, while health outcomes data is heterogeneous with multiple endpoints. Predictive modeling techniques used in health outcomes research, like those in SAS Enterprise Miner, are better suited than traditional methods as they can handle complex real-world data without strong assumptions and more accurately predict rare events. Validation of models on separate test data is also important for generalizing results.
1. Statistical tests are used in fisheries science to test hypotheses and make quantitative decisions about fisheries processes. Common statistical tests include correlation tests, comparison of means tests, regression analyses, and hypothesis tests.
2. The appropriate statistical test to use depends on the research design, data distribution, and variable type. Parametric tests are used for normally distributed data, while non-parametric tests are used when assumptions are not met.
3. Accuracy of statistical tests relies on quality survey data. Both fishery-dependent and fishery-independent data are important, though confounding factors must be considered with dependent data. Proper study design and use of statistics allows prediction of fish production.
Chi square test- a test of association, Pearson's chi square test of independence, Goodness of fit test, chi square test of homogeneity, advantages and disadvantages of chi square test.
Since were discovered by their antimicrobial activity, parabens have been widely used in many cosmetics, pharmaceuticals, personal care products and food, among others consumer products. These compounds, after human consumption, reach wastewater treatment plants, where are not efficiently removed, ending up in the environment, as well as by the direct discharge of detergents, soaps or products that may contain these compounds in their formulation. This concern has significantly boosted the number of publications in recent literature, although the number of papers on aqueous samples is still much higher compared to solid matrices, probably due to the complexity of these last. In this work, the accuracy of a new developed analytical method for the determination of emerging pollutants (Methylparaben (MeP)) in sediment samples has been carried out using recovery assays and evaluated by mean of the cofindence ellipse.
When designing a clinical study, a fundamental aspect is the sample size. In this article, we describe the rationale for sample size calculations, when it should be calculated and describe the components necessary to calculate it. For simple studies, standard formulae can be
used; however, for more advanced studies, it is generally necessary to use specialized statistical software programs and consult a biostatistician. Sample size calculations for non-randomized studies are also discussed and two clinical examples are used for illustration
August 1, 2010. Design of Non-Randomized Medical Device Trials Based on Sub-Classification Using Propensity Score Quintiles, Topic Contributed Session on Medical Devices, (Greg Maislin and Donald B Rubin). Joint Statistical Meetings 2010, Vancouver Canada.
The document discusses nonparametric tests that can be used when the data distribution is unknown or non-normal. It provides examples of the Wilcoxon signed-rank test to compare two related samples, the Wilcoxon rank-sum test to compare two independent samples, the Kruskal-Wallis H test to compare more than two independent samples, and the Friedman test to compare blocks of data. Multiple comparison tests are also discussed to determine the specific groups that differ when overall differences are found.
P-Value: a true test of significance in agricultural researchJiban Shrestha
This document discusses the use of p-values and significance levels in statistical analysis. It explains that p-values represent the probability of obtaining results at least as extreme as the observed results of a study, given that the null hypothesis is true. A lower p-value indicates stronger evidence against the null hypothesis. By convention, p-values of 0.05 or lower are considered statistically significant. The document cautions that statistical significance does not necessarily imply practical or clinical significance. It also discusses the concept of least significant difference tests and notes some limitations of relying solely on p-values to guide decisions.
Every year world and corporate leaders pretend to care about the state of the planet -- yet nothing changes. What if there was a single event that would change the world for the better? On the agenda: honesty in the age of cowardice, what to do with surplus population, subservience to a higher calling and when He rises.
Presentation given on January 21, 2016 at the World Economic Forum in Davos by Eminence Waite, campaign manager for Cthulhu for America.
The document compares the performance of principal component analysis (PCA), partial least squares (PLS), and random matrices (RM) for dimensionality reduction in survival analysis. It uses simulated datasets in R to analyze the bias and mean-squared error between true and estimated survivor curves for each method. PCA aims to maximize covariance and correlation of predictor variables, PLS additionally maximizes covariance between predictors and responses, and three flavors of RM are inspired by Johnson-Lindenstrauss embeddings. The results show that PCA outperforms PLS, RM performance is comparable to PCA and PLS, and in some cases RM outperform the other methods for reducing dimensionality while minimizing information loss.
This document provides information about Adam Lovinus and his skills and experience as a copywriter. It summarizes that he has experience writing about technology, business, arts, and parenting. As a trained journalist, he takes an approach that blends straight reporting with content marketing. Examples of his work include managing the blog HardBoiled for NeweggBusiness, where he publishes about 3,000 words of content per week and drives traffic through social media.
Eminence Waite argues that fear is the most effective form of social control but it is no longer enough on its own given increasing instability. The current electoral choices in America are dysfunctional and eroding trust in government. Waite proposes that the only solution is a massive depopulation on a global scale, suggesting sacrificing the "surplus population" to the entities worshipped by a secretive doomsday cult that Waite claims has been operational for 6,000 years and is now well-positioned to take advantage of the impending apocalyptic events foretold in their prophecies. Waite invites the Bilderberg group to officially join this cult and profit from the inevitable next evolution of society.
Ivan Rodriguez reflects on how his philosophy of tutoring mathematics has changed since beginning work at the THINK TANK tutoring center. Initially, he saw his role as replacing professors, but now understands his role is to be a resource, filling gaps and encouraging learning strategies. This shift occurred due to training sessions, assigned readings, and tutoring experiences. Training emphasized each student's unique abilities and the importance of problem-solving creativity. Readings showed the value of understanding concepts' applications and teaching methods rather than just providing answers. Experiences, like failing to help one student but then improving, reinforced the importance of preparation and collaboration.
This document describes a study that aimed to improve existing statistical methods for analyzing ordinal categorical data from genomic studies. The researchers sought to better match genomic data to diagnoses by refining the proportional odds model and applying it to a new dataset on Dravet syndrome. Specifically, they modified the proportional odds model by refining the latent variable and fine-tuning the null hypothesis to address limitations of the original model in practice, such as violations of the proportional odds assumption.
Selim Hesham El Zien is seeking a full-time job in Egypt. He has over 10 years of experience in retail management and customer service roles. His experience includes positions as an assistant store manager, IT section manager, and retail sales roles. He has strong computer skills and is fluent in English and Arabic.
This document describes a statistical analysis of genetic disease data. The objectives are to improve an existing technique called the proportional odds model to better match genetic data to diagnoses, and to apply the improved model to a new dataset. The key methods discussed are refining the proportional odds model by modifying the latent variable and null hypothesis, using a score function to evaluate model performance, and conducting simulations to assess type I error rates and power.
The document describes techniques for reducing the dimensionality of large datasets, including principal component analysis (PCA), partial least squares (PLS), and random matrices (RMs). It presents these methods in the context of survival analysis using an accelerated failure time model. Simulation results show that PCA outperforms PLS, RMs perform comparably to PCA, and RMs can outperform PCA and PLS in certain situations. The document uses a running example of a gene expression microarray with 100 individuals and 1,000 genes to explore cancer outcomes.
This document provides a curriculum vitae for Notion Gombe, who has a Masters of Public Health degree from the University of Zimbabwe. It details his education background and qualifications, as well as his extensive work experience in epidemiology and public health in Zimbabwe over the past 15+ years. This includes roles coordinating MPH research projects and field placements, as well as positions as an environmental health officer and consultant for various public health programs and surveys. It also lists his publications, training, areas of expertise, and participation in scientific conferences both within and outside of Africa.
Janhavi Mishra is a Test Automation Engineer with over 2 years of experience working with Amdocs. She has expertise in test automation, including tools like Selenium Webdriver, UFT, Unix, Linux, and SQL. Some of her key skills and responsibilities include automation testing, regression testing, integration testing, unit testing, and shell scripting. She has received several appreciation emails for finding critical bugs and delivering projects on time. Janhavi holds an M.C.A. in Computer Science and seeks to further contribute her technical skills and experience.
This document presents a comparison of dimension reduction techniques for survival analysis, including principal component analysis (PCA), partial least squares (PLS), and random matrix approaches. Simulation data with 100 observations and 1000 covariates was generated to test the ability of each method to minimize bias and mean squared error in estimating survival functions. PCA and PLS were able to capture 50% of the variance by reducing the dimensions to 37. The estimated survival functions were compared to the true function over 5000 iterations. PLS had the lowest bias and mean squared error, followed by PCA, with the random matrix approaches performing worse.
Survival analysis is an important method for analysis time to event data for biomedical and reliability applications. It is often done with semiparametric methods e.g. the Cox proportional hazards model. In this presentation I discuss an alternative parametric approach to survival analysis that can overcome some of the limitations of the Cox model and provide additional flexibility to the modeler. This approach may also be justified from a Bayesian perspective and the connection is shown as well. Simulations and case studies that illustrate the flexibility of the GAM approach for survival analysis and its equivalent performance to existing methods for survival data are discussed in the text.
The material presented herein are based on two publications:
1) Argyropoulos C, Unruh ML. Analysis of time to event outcomes in randomized controlled trials by generalized additive models. PLoS One. 2015 Apr 23;10(4):e0123784. doi: 10.1371/journal.pone.0123784. PMID: 25906075; PMCID: PMC4408032.
2)Bologa CG, Pankratz VS, Unruh ML, Roumelioti ME, Shah V, Shaffi SK, Arzhan S, Cook J, Argyropoulos C. High performance implementation of the hierarchical likelihood for generalized linear mixed models: an application to estimate the potassium reference range in massive electronic health records datasets. BMC Med Res Methodol. 2021 Jul 24;21(1):151. doi: 10.1186/s12874-021-01318-6. PMID: 34303362; PMCID: PMC8310602.
Statistical Methods to Handle Missing DataTianfan Song
1) The document compares listwise deletion and multiple imputation approaches for handling missing data. Listwise deletion is the traditional default approach but can result in large data loss, while multiple imputation is a modern approach that addresses this issue.
2) An example using blood pressure data demonstrates the three types of missing data - MCAR, MAR, MNAR - and compares results from listwise deletion and multiple imputation. Multiple imputation should be used when the type of missingness is unknown.
3) The document reviews traditional approaches like listwise deletion and single imputation and modern approaches like multiple imputation and maximum likelihood for handling missing data. Multiple imputation is presented as an improvement over traditional single imputation methods.
2013.11.14 Big Data Workshop Adam Ralph - 1st set of slidesNUI Galway
Adam Ralph from the Irish Centre for High End Computing presented this Introduction to Basic R during the Big Data Workshop hosted by the Social Sciences Computing Hub at the Whitaker Institute on the 14th November 2013
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MININGIJDKP
This article advances our understanding of regression-based data mining by comparing the utility of Least
Absolute Value (LAV) and Least Squares (LS) regression methods. Using demographic variables from
U.S. state-wide data, we fit variable regression models to dependent variables of varying distributions
using both LS and LAV. Forecasts generated from the resulting equations are used to compare the
performance of the regression methods under different dependent variable distribution conditions. Initial
findings indicate LAV procedures better forecast in data mining applications when the dependent variable
is non-normal. Our results differ from those found in prior research using simulated data.
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...cambridgeWD
This document discusses the differences between clinical trials and health outcomes research. Clinical trials use homogeneous samples, surrogate endpoints, and focus on a single outcome. They are also typically underpowered for rare events. Health outcomes research uses heterogeneous data from the general population to examine multiple real endpoints simultaneously. It has larger samples and data that allow analysis of rare occurrences. Predictive modeling is better suited than traditional statistical methods for analyzing heterogeneous health outcomes data due to relaxed assumptions like normality.
This document compares different dimensionality reduction techniques for survival analysis, including principal component analysis (PCA), partial least squares (PLS), and random matrices (RM). It simulates datasets using R and applies the techniques to analyze survival curves. The results found that PCA outperformed PLS, and that all three variants of RM were comparable and superior to PCA and PLS. The document suggests this unexpected outcome may relate to limitations of R or not incorporating censored data, and recommends further exploring the techniques on real datasets.
Use Proportional Hazards Regression Method To Analyze The Survival of Patient...Waqas Tariq
The Kaplan Meier method is used to analyze data based on the survival time. In this paper used Kaplan Meier procedure and Cox regression with these objectives. The objectives are finding the percentage of survival at any time of interest, comparing the survival time of two studied groups and examining the effect of continuous covariates with the relationship between an event and possible explanatory variables. The variables (Age, Gender, Weight, Drinking, Smoking, District, Employer, Blood Group) are used to study the survival patients with cancer stomach. The data in this study taken from Hiwa/Hospital in Sualamaniyah governorate during the period of (48) months starting from (1/1/2010) to (31/12/2013) .After Appling the Cox model and achieve the hypothesis we estimated the parameters of the model by using (Partial Likelihood) method and then test the variables by using (Wald test) the result show that the variables age and weight are influential at the survival of time.
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...IJDKP
This document summarizes an algorithm called Principal Component Outlier Detection (PrCmpOut) for identifying outliers in high-dimensional molecular descriptor datasets. PrCmpOut uses principal component analysis to transform the data into a lower-dimensional space, where it can more efficiently detect outliers using robust estimators of location and covariance. The properties of PrCmpOut are analyzed and compared to other robust outlier detection methods through simulation studies using a dataset of oxazoline and oxazole molecular descriptors. Numerical results show PrCmpOut performs well at outlier detection in high-dimensional data.
Logistic Loglogistic With Long Term Survivors For Split Population ModelWaqas Tariq
Split population models are also known as mixture model . The data used in this paper is Stanford Heart Transplant data. Survival times of potential heart transplant recipients from their date of acceptance into the Stanford Heart Transplant program [3]. This set consists of the survival times, in days, uncensored and censored for the 103 patients and with 3 covariates are considered Ages of patients in years, Surgery and Transplant, failure for these individuals is death. Covariate methods have been examined quite extensively in the context of parametric survival models for which the distribution of the survival times depends on the vector of covariates associated with each individual. See [6] for approaches which accommodate censoring and covariates in the ordinary exponential model for survival. Currently, such mixture models with immunes and covariates are in use in many areas such as medicine and criminology. See for examples [4][5][7]. In our formulation, the covariates are incorporated into a split loglogistic model by allowing the proportion of ultimate failures and the rate of failure to depend on the covariates and the unknown parameter vectors via logistic model. Within this setup, we provide simple sufficient conditions for the existence, consistency, and asymptotic normality of a maximum likelihood estimator for the parameters involved. As an application of this theory, the likelihood ratio test for a difference in immune proportions is shown to have an asymptotic chi-square distribution. These results allow immediate practical applications on the covariates and also provide some insight into the assumptions on the covariates and the censoring mechanism that are likely to be needed in practice. Our models and analysis are described in section 5.
large data set is not available for some disease such as Brain Tumor. This and part2 presentation shows how to find "Actionable solution from a difficult cancer dataset
Extending A Trial’s Design Case Studies Of Dealing With Study Design IssuesnQuery
This document discusses several case studies of dealing with complex study design issues in clinical trials, including non-proportional hazards, cluster randomization, and three-armed trials. The agenda outlines topics on non-proportional hazards modeling and sample size considerations, cluster randomized and stepped-wedge designs, and methods for analyzing data from three-armed trials that include experimental, reference, and placebo groups. Worked examples are provided to illustrate sample size calculations and statistical approaches for each of these complex trial design scenarios.
This document discusses different methods for analyzing survival data in clinical trials, including Kaplan-Meier survival analysis and restricted mean survival time (RMST) analysis. It reviews literature on survival analysis concepts and applications. The document also notes limitations of Kaplan-Meier analysis when data does not satisfy proportional hazards assumptions or when patients are lost to follow up. RMST is presented as an alternative to estimate mean survival times without these limitations. The document then applies different survival analysis methods to a dataset to compare results.
Maxillofacial Pathology Detection Using an Extended a Contrario Approach Comb...sipij
This document summarizes a method for detecting maxillofacial pathology in 3D CT medical images using an extended a contrario approach combined with fuzzy logic. The method models samples using the Fisher distribution and applies a Fisher test to detect significant changes between a normal sample and one containing a patient. P-values from three measures are combined using fuzzy logic to provide a decision on pathology with a degree of uncertainty. The method was able to detect pathological areas in a test patient but also regions requiring further investigation, showing performance and leaving room for physician exploration.
Projecting ‘time to event’ outcomes in technology assessment: an alternative ...cheweb1
This document discusses alternative methods for projecting survival outcomes in technology assessments beyond what is observed in clinical trials.
The standard method of fitting parametric survival functions to trial data and extrapolating is problematic as it assumes a single mechanism and does not account for trial design or changes in risk over time. LRiG proposes examining trial data to understand risk trajectories and formulating hypotheses based on clinical context rather than selecting a model solely on fit. A case study demonstrates modeling progression-free survival, post-progression survival, and overall survival as separate phases using exponential convolution functions. LRiG advocates understanding empirical data and developing more informative multi-phase models rather than relying on standard projections.
- The document describes two Markov models (Muenz-Rubinstein and Azzalini) for analyzing health condition data from the Health and Retirement Survey (HRS)
- It fits both models to HRS data to predict health conditions (dependent variable) based on age, gender, and BMI (independent variables)
- Results show Azzalini's model was more efficient at predicting health conditions compared to Muenz-Rubinstein based on the estimated efficiencies of each model
This document introduces robust estimation techniques in R. It discusses robust methods for estimating location and scale parameters that are resistant to outliers. These include trimmed means, medians, M-estimators, and high breakdown point estimators. It also covers robust regression methods like M-estimators, weighted likelihood, and bounded influence estimators. Examples are provided using real data to illustrate robust versus non-robust estimates and how influential observations can be identified.
SUITABILITY OF COINTEGRATION TESTS ON DATA STRUCTURE OF DIFFERENT ORDERSBRNSS Publication Hub
This document summarizes research investigating the suitability of cointegration tests on time series data of different orders. The researchers used simulated time series data from normal and gamma distributions at sample sizes of 30, 60, and 90. Three cointegration tests (Engle-Granger, Johansen, and Phillips-Ouliaris) were applied to the data. The tests were assessed based on type 1 error rates and power to determine which test was most robust for different distributions and sample sizes. The results indicated the Phillips-Ouliaris test was generally the most effective at determining cointegration across different sample sizes and distributions.
This document provides a tutorial on Bayesian model averaging (BMA). BMA accounts for model uncertainty by averaging over multiple models, weighted by their posterior probabilities. Standard statistical practice selects a single best model, ignoring model uncertainty. BMA offers improved predictive performance over any single model. However, implementing BMA presents challenges including an enormous number of terms to average over and difficult integrals. The document discusses methods for managing the summation and computing the necessary integrals, including Occam's window, Markov chain Monte Carlo model composition, and stochastic search variable selection. Examples are also provided to demonstrate the application and benefits of BMA.
Similar to Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report (20)
1. Survival Analysis Dimension Reduction Techniques
A Comparison of Select Methods
Claressa L. Ullmayer and Iván Rodríguez
Abstract
Although formal studies across many fields may yield copious data, it can
often be collinear (redundant) in terms of explaining particular outcomes.
Thus, dataset dimensionality reduction becomes imperative for facilitating
the explanation of phenomena given abundant covariates (independent vari-
ables). Principal Component Analysis (PCA) and Partial Least Squares
(PLS) are established methods used to obtain components—eigenvalues of
the given data’s variance-covariance matrix—such that the covariance and
correlation is maximized between linear combinations of predictor and re-
sponse variables. PCA employs orthogonal transformations on covariates
to reduce dataset dimensionality by producing new uncorrelated variables.
PLS, rather, projects both predictor and response variables into a new space
to model their covariance structure. In addition to these standard procedures,
three variants of Johnson-Lindenstrauss low-distortion Euclidean-space em-
beddings (random matrices, RM) were also investigated. Each technique’s
performance was explored by simulating 5,000 datasets using R statistical
software. The semi-parametric Accelerated Failure Time (AFT) model was
utilized to obtain predicted survivor curves. Then, total bias error (BE) and
mean-squared error (MSE) between true and estimated survivor curves was
determined to find the error distributions of all methods. The results herein
indicate that PCA outperforms PLS, the RMs are comparable, and the RMs
outdo both PCA and PLS.
Keywords: survival analysis; dimension reduction; big data; principal com-
ponent analysis (PCA); partial least squares (PLS); Johnson-Lindenstrauss
(JL); random matrices; accelerated failure time (AFT); bias; mean-squared
error.
3. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
1 Introduction
Throughout various studies, researchers are able to associate covariates to a set of
observations. From here, analysts would naturally seek to explain the relationship
between the two with regard to a given set of phenomena. Methods such as the
Cox Proportional Hazards (CPH) and the Accelerated Failure Time (AFT) models
have been proposed with this intent in mind (Cox, 1972). However, to successfully
utilize both approaches, it is necessary to have more observations than covariates.
Depending on the context, this property may not initially be satisfied, thus ren-
dering both methods inept. One example of this complication arises in common-
place microarray gene expression data. In this situation, there can often be less
observations—patients—than covariates attributed to them—genes. As a result, it
becomes imperative to reduce the dimensionality of the dataset and then apply a
suitable regression technique thereafter to understand the underlying relationships
between the predictor and response variables. As a natural consequence, reduc-
ing the original dataset’s dimensionality insinuates a loss of information; thus, a
favorable dimension reduction technique will minimize loss of relevant informa-
tion.
Given this to consider, dimension-reduction techniques have abounded to meet
this end. In this investigation, the methods of Principal Component Analysis
(PCA), Partial Least Squares (PLS), and three variants of Johnson-Lindenstrauss
inspired Random Matrices (RM) will be compared (Johnson, Lindenstrauss, 1984).
The first approach, PCA, originated and was described by Pearson (1901). PLS
was first rigorously introduced and explained by Wold (1966). Then, the three
variants of RMs were constructed according to specifications of Achlioptas (2002)
and Dasgupta-Gupta (2002). This research was motivated in part by the results
attributed to Nguyen and Rocke (2004) and Nguyen (2005) regarding the perfor-
mance of PCA vis-à-vis PLS. Furthermore, the works of Nguyen and Rojo (2009)
with respect to the performance of PLS variants and Nguyen and Rojo (2009) in
regard to a multitude of reduction and regression approaches were utilized in this
inquiry.
Typically, the Cox PH model has been the standard model in this applica-
tion. In this paper, however, the AFT model was employed. Random datasets
were first generated using the statistical software suite R. For a given amount of
these datasets, there was a constant and true survivor function attributed to them.
From here, the three dimension reduction techniques were employed on the sim-
ulated datasets. Then, the AFT model was used primarily to generate a predicted
survivor function. Bias and mean-squared error between the real and estimated
curves were then calculated for a partition of fixed time values.
1
4. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
2 Survival Analysis
Before any serious discussion of the current work can begin, a familiarity with the
area known as survival analysis must first be cultivated. In a sentence, survival
analysis employs various methods to analyze data where the response variable is
a time until an unambiguous event of interest occurs (Despa). This event must be
rigorously defined—some examples include birth, death, marriage, divorce, job
termination, promotion, arrests, revolutions, heart attack, stroke, metastasis, and
winning the lottery, to name a few (Ross).
Depending on the research domain, this wide field has many monikers. It is
referred to as failure time analysis, hazard analysis, transition analysis, duration
analysis, reliability theory/analysis in engineering, duration analysis/modeling in
economics, and event history analysis in sociology (Allison). At the time of this
investigation, ‘survival analysis’ serves as the umbrella term for all the aforemen-
tioned epithets.
Survival analysis is borne out of the desire to overcome some limitations pre-
sented in standard linear regression approaches (Despa). One of the two imme-
diate complications that survival analysis can successfully address is data where
responses are all positive values—exempl¯ı gr¯ati¯a, survival times that range from
t ∈ (0, ∞) (Despa). Secondly, survival analysis can grapple with censored data.
After the event of interest within a particular investigation has been rigorously
declared, an observation is branded as ‘censored’ if the special event was not ob-
served. This can occur due to a plethora of reasons. A common one involves a
patient in a clinical trial dropping out of the study. In this case, it is unknown
how much longer it may have taken for that individual to experience the partic-
ular event of interest. Another example of censoring in the real world involves
observations that do not experience the special event upon the end of a formal
investigation. That is, an individual managed to not express the event of interest
for the whole duration of a study, so they are necessarily labeled as censored.
With this ubiquitous term broadly explained, it is also necessary to understand
that many forms of censoring exist. Typically, most data are ‘right-censored’. This
term signifies observations that have the potential to experience the declared event
of interest after—or to the right in a time-line—of the time they became censored.
For instance, take an individual with a stage of cancer and declare the event of
interest to be death. Then, if this person becomes censored, the event of interest is
naturally bound to occur after the time they became censored. In a similar manner,
‘left-censored’ data occurs when the event of interest occurred before the specific
time a formal investigation began (Lunn). Understandably, this phenomenon is
less commonplace in reality. An example of left-censored data involves providing
a questionnaire to mothers inquiring whether or not they are actively breastfeed-
ing (Vermeylen). Left-censoring would occur if a mother entered the study and
2
5. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
had hitherto stopped breastfeeding. Finally, a third type is known as ‘interval cen-
soring’. This might be observed in a case where clinical follow-ups are necessary.
For a datum to be interval-censored, the event of interest would have to be ob-
served within an interval between two successive follow-ups (Sun).
Survival analysis is a prominent regression approach because it can success-
fully incorporate both censored and uncensored data when modeling the relation-
ship between predictors and responses (Despa). Typically, the response variables
will have at least both a survival time and censoring status associated with them.
From here, methods exist to estimate both survival and hazard functions that fa-
cilitate the interpretation of the distribution of survival times (Despa).
Survivor curves determine the probability that the event of interest is not ex-
perienced after a particular time. Rigorously,
S(t) = P(T > t) =
∞
t
f(τ) dτ = 1 − F(t),
where S(t) denotes the survivor function, t is a fixed time, T is a random variable,
f(τ) is the probability density function of T, and F(t) is the cumulative distribu-
tion function of T.
The hazard, on the other hand, is defined as a rate in which events happen
(Duerden). Thus, one can calculate the probability of an event happening within a
small time interval as this hazard rate multiplied by the length of time (Duerden).
Additionally, the hazard function describes the probability that an observation ex-
periences the event of interest at a particular time (Duerden). This implies that
the observation has already survived—that is, has not experienced the event of
interest—at the specified time (Duerden). In precise terms, the hazard function is
defined as
h(t) =
f(t)
S(t)
,
where f(t) denotes the probability distribution function and S(t) represents the
survival function given a random variable T. From this expression, it is imme-
diately possible to understand the intricate relationship between distribution, sur-
vival, and hazard functions. As a result, many other expressions exist aside from
this rather simplistic form.
A natural thought that may arise within survival analysis is whether results
involving survivor curves or hazard functions are desired. In many contexts, stan-
dard researchers prefer survivor curves in order to interpret results of their gath-
ered data. Arguably, since these curves output a probability in response to an input
of time, it becomes easier to comprehend trends and relationships than by doing
so via hazard. Furthermore, hazard functions and hazard rates are based on ratios
of probability distribution functions and survival curves; this makes hazard results
3
6. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
more difficult to digest and understand.
Aside from these considerations, there is also another factor involved in sur-
vival analysis to cognize: the selection of methods that can be utilized to relate
predictor variables and the resulting survival times. The three main forms to
achieve this end include parametric, semiparametric, and nonparametric models
(Despa). These differ in the assumptions being made on the given data.
Parametric approaches make the prime assumption that the distribution of the
survival times follows a known probability distribution (Despa). For example,
these can include the exponential and compound exponential, Weibull, Gompertz-
Makeham, Rayleigh, gamma and generalized gamma, log-normal, log-logistic,
generalized F, and the Coale-McNeil models (Rodriguez, 2010). For these and
other applicable methods, model parameters are estimated according to an alter-
ation to their maximum likelihood (Despa). In parametric techniques, relation-
ships are forced between f(t), F(t), S(t), and h(t) (Cook).
In contrast, a nonparametric model does not assert as many relatively bold
assumptions. For instance, linearity and a smooth regression function is not nec-
essary in a nonparametric context (Fox). Although this provides a researcher with
much more flexibility, interpretation can oftentimes become more difficult.
A semiparametric model posits that the error attributed to a nonlinear regres-
sion model follows a well-defined probability distribution, but the error is uncor-
related and identically distributed. In addition, a model of this form does not
presume that the baseline hazard function has a particular ‘shape’ attributed to
it. Additionally, when a combination of both parametric and nonparametric as-
sumptions are available, the regression model is appropriately described as being
semiparametric in nature.
These three types of regression models are rigorously represented below. Let
n denote the number of observations, Y represent the response variable, X sig-
nify the matrix of predictors, and let β be regression coefficients with errors .
Additionally, let m(·) = E(yi | xi) such that i = 1, . . . , n
A parametric model can be expressed as
yi = xi
T
β + i, i = 1, . . . , n.
In this case, the resulting curve is smooth and known. Furthermore, it is described
by a finite set of parameters which will need to be estimated. Ultimately, interpre-
tation is simple through this approach.
Then, for a nonparametric method,
yi = m(xi) + i, i = 1, . . . , n.
Here, function m(·) is also smooth and flexible, yet it is now unknown. Further-
more, the interpretation of such a curve becomes ambiguous.
4
7. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
Lastly, in the case where a model is classified as semiparametric, we observe
that
yi = xi
T
β + mz(zi) + i, i = 1, . . . , n.
As previously mentioned, some parameters are necessarily estimated while some
will be determined through the given data.
3 Methods
The main methods employed in this investigation were centered on different ways
of performing dimension reduction. These methods were: Principle Component
Analysis (PCA), Partial Least Squares (PLS), and a set of three distinct Random
Matrices (RM). For each method, the AFT model was employed primarily to gen-
erate survivor curve estimates. These methods will be discussed in greater detail
here.
3.1 Dimension Reduction
The central goal of the three aforementioned dimension reduction techniques is to
reduce a dataset with n observations and p covariates to a new dataset of dimen-
sions n × k such that k p. Additionally, a competent method will achieve this
end while retaining an acceptable amount of relevant data and omitting relatively
collinear variables.
Both PCA and PLS reduce dimensionality through orthogonal transformations
of covariates; then, a subset of these is retained such that these new covariates pre-
dict the response with a satisfactory caliber of precision. Meanwhile, RM differs
from these two procedures by generating a matrix with certain qualities that also
reduces dimensionality.
To facilitate the explanation of these reduction techniques, pertinent notation
will first be introduced.
3.1.1 Notation
Let X be the n × p column-centered matrix such that n and p denote given obser-
vations and covariates, respectively. Also, let n p. Furthermore, let Y be the
n × q matrix of observed covariates.
In the microarray gene dataset example, n would represent the number of pa-
tients while p would denote the amount of observed genes attributed to them.
Thus, X would be a matrix that contains particular patients on the rows and their
respective genes on the columns. Additionally, Y would serve as an n × 1 vector
of survival times.
5
8. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
3.1.2 Principle Component Analysis
PCA reduces dataset dimensionality through orthogonal components obtained by
maximizing the variance between linear combinations of the original predictors
contained in X. More precisely, k weight vectors or ‘loadings’ w are constructed
such that rows of X map to principal component scores t. For n observations,
tn = xnwk.
Ultimately, X can be completely decomposed into its components as follows:
T = XW.
Here, X has original dimensions n×p, W has dimensions p×p, and T, therefore,
has dimensions n × p as expected. Additionally, the columns of W contain the
eigenvectors of XT
X.
From here, a desired amount of the resulting orthogonal components is cho-
sen. These are then referred to as ‘principal components’ since they are chosen in
order to maximize the variability along each direction of the new and reduced set
of axes. What this transformation accomplishes, in other words, is that it projects
the original data cloud into a new coordinate system via rotations of the initial
coordinate system such that variability of the initial data is maximized along each
direction. Additionally, PCs are ranked according to how much variance they
account for in their respective directions. That is, the PCs with the largest eigen-
values are ranked the highest and represent a sizable portion of the data since
variability is greatest along its eigenvector’s direction.
It is imperative to note that the chosen PCs obtained from PCA rely on op-
erations performed on X, the given dataset matrix. Thus, the response variable
Y is not taken into account during this particular dimension reduction algorithm.
Consequently, these PCs may not be laudable predictors of the response variable
in a given context. Due to this property of PCA, it is often referred to as an ‘un-
supervised’ technique.
3.1.3 Partial Least Squares
Whereas PCA reduces dimensionality through X, the method of PLS does so
through a consideration of both independent and dependent variables X and Y.
Thus, this approach is often referred to as being ‘supervised’.
This regression model is especially useful when there is either high collinear-
ity among predictors or when the number of predictor variables is much greater
than the amount of observations. In these situations, ordinary least-squares re-
gression would either perform poorly or fail entirely; it would also fail if Y was
not one-dimensional—id est, if there were more than one observed response.
6
9. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
PLS extracts factors from both X and Y so that the covariance between these
factors is maximized. In particular, PLS is largely based on the singular value de-
composition of XT
Y. Recall that PLS does not require Y to be one-dimensional;
an advantage of the PLS procedure is that Y can contain as many observed re-
sponses as are deemed necessary and practical by researchers.
The method of PLS decomposes both X and Y so that
X = TPT
+ E and Y = UQT
+ F.
Here, T is a matrix of ‘X-scores’, P is a matrix of ‘X-loadings’, and E is a matrix
of error for X. Similarly, U, Q, and F represent ‘Y-scores’, ‘Y-loadings’, and Y
error, respectively. Both X- and Y-scores are defined as being linear combinations
of the predictor and response variables, respectively. Then, X- and Y-loadings are
linear coefficients that form a bridge from X to T and from Y to U. A common
assumption about E and F is that they are random variables with independent
and identical distributions. This decomposition of X and Y is done in hopes of
maximizing the covariance between T and U.
The PLS algorithm is an iterative procedure. First, two sets of weights must
be constructed as linear combinations of the columns of both X and Y. These
will be denoted by w and c, respectively. The goal here is to have their covariance
be maximal. Recall that matrices T and U denote, accordingly, X- and Y-scores.
Then, the next step in the PLS approach is to obtain a first pair of vectors t = Xw
and u = Yc such that wT
w = 1, tT
t = 1, and tT
u be maximized. After these
first so-called ‘latent vectors’ have been obtained, they are subtracted from both
X and Y. This procedure is then repeated, thereby eventually reducing X to a
zero matrix.
3.1.4 Random Matrices
Whereas the previously discussed methods of PCA and PLS reduce dimension-
ality through a careful analysis of X and Y, the third technique of constructing
random matrices, as the name implies, is considerably cavalier by comparison. In
essence, a random matrix with a particular set of qualities is fabricated. Then,
this matrix is multiplied to a given dataset—matrix X in this particular investi-
gation. According to the lemma attributed to Johnson and Lindenstrauss, if two
observations in X are considered as multidimensional points and have an initial
distance-squared between them, then once these particular random matrices are
multiplied to X, their intial distance is not distorted by too much. Similar to the
approaches utilized in PCA and PLS, random matrices can reduce dimensionality
without losing much information in the process. First, the Johnson-Lindenstrauss
(JL) Lemma will be presented as well as a description of the three particular ran-
7
10. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
dom matrices that were constructed in this research. The constraint on k was
utilized according to Dasgupta-Gupta.
The Johnson-Lindenstrauss Lemma. For any ∈ (0, 1) and any n ∈ Z, let
k ∈ Z be positive and let
k ≥
4 ln(n)
2/2 − 3/3
.
Then, for any set S of n points in Rd
, there exists a mapping f : Rd
→ Rk
such
that, for all points u, v ∈ S,
(1 − ) u − v 2
≤ f(u) − f(v) 2
≤ (1 + ) u − v 2
.
In terms of this investigation, n also represents the number of observations
while denotes the error tolerance. Finally, k can be thought of as the resulting
dimension in this given context after applying a random matrix to the dataset ma-
trix X.
An immediate complication of these so-called ‘JL-embeddings’ is that we may
sometimes observe that k ≥ d as a result of strictly following the hypotheses of
the lemma. Id est, by employing the results of this theorem, a researcher would
be taking data from a smaller dimension and transforming it so that the data exists
in a higher dimension. Ultimately, the JL Lemma may not reduce dimensionality
at all, thus rendering it impractical for the desired purposes of this text. Thus, it
became imperative in this research to observe the effects of ignoring the restraints
on k of the JL Lemma and deducing whether or not desirable results are obtained
nonetheless. Having understood the motivation behind random matrices and these
precise limitations, now an explanation of the three random matrices themselves
is in order.
The first two random matrices were fabricated according to the previous re-
sults of Achlioptas while the third was constructed by following the specifications
of Dasgupta-Gupta. Let Γ1, Γ2, and Γ3 accordingly denote these random ma-
trices. To keep consistent with the previous notation, recall that X is an n × p
predictor matrix of observations on the rows and covariates on the columns. It
follows that Γ1, Γ2, and Γ3 are p × k matrices. Once multiplied to X, the result-
ing matrix Ω will have dimensions n × k, where the goal is to have n > k.
Entries of Γ1 were produced from the following distribution:
1
√
k
×
−1 with probability 1/2
+1 with probabilty 1/2
For Γ2, its entries were obtained from
3
k
×
−1 with probability 1/6
0 with probability 4/6
+1 with probabilty 1/6
8
11. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
Finally, Γ3 is a Gaussian random matrix generated from N(0, 1). The resulting
rows of Γ3 are then normalized.
3.2 The Accelerated Failure Time Model
The previously described techniques were sourced in order to reduce dimension-
ality. After successfully achieving this consequence, it was necessary to generate
a survival curve based on the modified data and compare it with the true survival
curve. In this investigation, the AFT model was the vehicle to generate estimates
of the survivor curves.
The AFT model is seldom utilized compared to the celebrated Cox Propor-
tional Hazards (PH) model for various reasons. One reason to adopt the AFT
approach in this investigation is due to the simplified interpretation it provides re-
searchers of the data. This approach presents an interpretation of the relationship
between observation covariates and given responses in terms of survivor curves.
The Cox PH model, on the other hand, does so through hazard functions and haz-
ard ratios that, while equally profound, are not as visually simple to comprehend
as the AFT model’s survivorship presentation. In simple terms, the hazard is the
instantaneous event probability within a range of a particular time. It is arguably
more straightforward to understand results in terms of the probability that an in-
dividual ‘survives’ or does not experience an event of interest after a particular
time. Thus, this first reason to employ the AFT model in this text is a matter
of user preference and ease of interpretation of results. Another technical reason
to employ the AFT model is due to the fact that it directly models given survival
times. This is one luxury that the Cox PH model cannot allow a fervent researcher.
In this investigation, AFT was implemented according to the following under-
lying model:
ln(Ti) = µ + ziβ + ei.
Here, i represents a particular observation from a set of n observations. Further-
more, Ti denotes the survival time for the ith
observation. Meanwhile, µ desig-
nates the given theoretical mean, zi is the vector of covariates for the ith
obser-
vation, and β is the vector of covariate/regression coefficients. Finally, ei is the
given error for the ith
observation.
4 Method Assessments
This research utilized a programming environment to simulate datasets that would
undergo reduction procedure from PCA, PLS, and the variants of RMs. Addition-
ally, ‘feeding’ this data into the AFT model to obtain and compare the pairs of
9
12. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
survival curves was likewise accomplished through statistical software. This sec-
tion will address specifically how the research was performed.
4.1 Simulated Datasets
In order to compare dimension reduction techniques, R statistical software is im-
plemented to simulate data. β regression coefficients, observations, covariates,
and survival times are simulated using the previously discussed AFT formula,
where the theoretical mean, µ, is set to 0 for simplicity. The dimensionality of the
data matrix, X, is 100 observations by 1000 covariates. A vector of 1000 β regres-
sion coefficients relating to the 1000 covariates is obtained by generating random
values from U(−1 × 10−7
, 1 × 10−7
). A vector, µj, of random values is generated
from a N(0, 1) distribution for j = 1, . . . , p, where p represents the number of
covariates. β and µ remain fixed for all simulations. Next, the matrix, X100×1000,
of the 1000 covariates and 100 observations is generated where xij = ezij
where
zij ∼ N(µj, 1) for j = 1, . . . , p and i = 1, . . . , n where n is the number of obser-
vations, therefore the data is log-normally distributed. The survival times, Ti, are
constructed from an exponential distribution with λi = e−xiβ
for i = 1, · · · , n.
Now that all the data is generated, zn×p is converted to z∗
n×p by centering each
column about its mean. PCA is applied to z∗
n×p using the function PCA from the
package FactoMineR (Husson et al., 2015) to obtain 99 principle components.
After this procedure is completed, the principle components are narrowed down
to 37, which represents 50% of total variance of the model. PCA outputs a weight
matrix of dimension 1000 × 37, which represents the weights given to each co-
variate by the 37 principle components. The data matrix, X, is multiplied by this
weight matrix to obtain a reduced dimension matrix of 100 × 37. A surv object
is created, which inputs survivals times, censoring type, and an indicator vector
denoting 1 if the observation is censored or 0 if it is not, and outputs a response
matrix. The Ti vector and the 37 principle components are fed into the AFT model
in R using the package aftgee (Chiou et al., 2015) to obtain the estimated 1000 β
coefficients. The weight matrix was multiplied by these β estimates to obtain the
37 β estimates for the 37 principle components.
In order to acquire estimated lambda values for the estimated survival func-
tion, the mean of the exponentiated product of the centered data matrix and the
β estimates is taken. The estimated survival function is now found by ˆS0 = e−ˆ¯λt
where ˆ¯λ is the estimated mean lambda value. This procedure was repeated for
PLS using the same number of principle components as PCA except using the
function plsreg1 from the package plsdepot (Sanchez, 2015) instead.
The matrices Γ1, Γ2, and Γ3 from Achlioptas and Dasgupta-Gupta are gen-
erated containing random entries that satisfy each author’s probability specifica-
10
13. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
tions. An algorithm in R is created to validate the Johnson-Lindenstrauss Lemma
dimension reduction ability for Γ1, Γ2, and Γ3. The algorithm takes two randomly
picked vectors u, v from X and maps f : Rp
→ Rk
where k is the new reduced
dimension. The Johnson-Lindenstrauss Lemma is then tested using varying val-
ues of and k for multiple simulations. It is shown that as long as k and follow
the constraints given by Dasgupta-Gupta(CITA), then the Johnson-Lindenstrauss
Lemma is satisfied 100% of the time. The value of is varied until the desired
dimension of 1000 × 37 projection matrix is obtained satisfying the Johnson-
Lindenstrauss Lemma. Unfortunately, a fairly high value of approximately 0.65
is required to satisfy the lemma. Therefore, either a high value is used or the
lemma is not followed.
In order to compare random matrices to PCA and PLS, X is multiplied by
Γ1, Γ2, and Γ3 with dimensions 1000 × 37 to obtain a resulting k dimensional
matrix of 100×37. Then, the reduced matrices are fed into the AFT model and all
the same steps as PCA and PLS are performed. Therefore, five different estimated
survival curves are produced, one each for PCA and PLS and three for the three
random matrices.
The true survival curve is S0 = e−¯λt
where ¯λ is the mean of the λi values,
which are created by exponentiating the product of the centered data matrix and
the true β coefficients. The y-axis of the survival curve is partitioned into 20
equally spaced sections from 0.025, . . . , 0.975 and then the corresponding ti val-
ues are found along the x-axis. The bias and mean squared error (MSE) are cal-
culated at each of these ti values to obtain the error distribution for all methods.
The bias is found by calculating the pointwise difference between the real and
estimated survival curves and the MSE is calculated by finding the squared differ-
ence. The bias and MSE at each ti are summed for 5000 simulations and the error
distributions are compared for all methods.
5 Results
In the following sections, the error distribution plots for the dimension reduction
techniques are compared after 5000 simulations. PLS and PCA are compared to
each other, the random matrices are compared, and then all dimension reduction
techniques are compared. The goal is to minimize Bias and MSE, therefore, the
dimension reduction technique closest to zero is the more efficient method. In
the Bias plots, zero is at the top of the plots and for MSE, the black horizontal
line at the bottom denotes zero. Notice the plots differ least at the extremes of
the survival curves domain while the most variability is observed in middle of the
interval.
11
14. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
5.1 Principle Component Analysis versus Partial Least Squares
From the plots above, it is shown that PCA outperforms PLS by a maximum
magnitude of approximately 0.07 for the bias and 0.03 for MSE.
12
15. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
5.2 Random Matrices
In the plots above, RM1 denotes Γ1, RM2 denotes Γ2, and RM3 denotes Γ3.
The results show that there is no significant difference in performance between
the three random matrices in terms of Bias and MSE.
13
16. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
5.3 All Methods
From both the Bias and MSE plots, it is evident that all three random matrices
outperform both PCA and PLS. All matrices outperform PCA by a magnitude of
approximately 0.03 and PLS by 0.10 for bias and 0.015 and 0.045 for MSE.
14
17. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
6 Discussion
We originally wanted to generate our β coefficients from a U(−0.2, 0.2), but when
we multiplied xiβ to get our λi values, we obtained very large values. Recall,
our formula λi = e−xiβ
. When xiβ is very large, then the λi values become
very small and the precision of R estimates the survival function as 1, creating
a horizontal survival function. Therefore, we had to reduce the β coefficients to
U(−1 × 10−7
, 1 × 10−7
) to obtain survival curves with realistic properties.
Before conducting our research, we investigated previous work that has been
done in the field, such as the research in the papers of Nguyen and Rojo (2009)
and Nguyen and Rojo (2009). According to their findings, PLS outperformed
PCA, which is the results we also expected to receive, but instead we observed
that PCA greatly outperformed PLS. We are not exactly positive why our results
differ from these works, but we suspect that it is due to not incorporating censored
data. In both papers of Nguyen and Rojo, they compared methods using censored
data, which we did not have time to incorporate into our research. Therefore, we
suspect that PLS might outperform PCA when censored data is used, but PCA
outperforms PLS with uncensored data.
Obviously in real life studies, censored data can be a serious problem that
needs to be taken into account. We wanted to incorporate censored data in our in-
vestigation, but were unable to due to time constraints. This is something that we
would like to add in future investigations. We also wanted to apply our findings
to real microarray gene data sets where there are a few number of patients, with a
specific type of cancer, and a large dimension of genes. We wanted to work with
these data sets and apply our dimension reduction techniques to obtain estimated
survival curves where the event of interest was death and the survival curve mod-
eled each patient’s probability of surviving after a given time Ti. Unfortunately,
we were not able to work with these real data sets, which is also something we
would like to investigate at a future time.
7 Conclusion
The results of performing PLS, PCA, and the three Johnson-Lindenstrauss in-
spired matrices from Achlioptas and Dasgupta-Gupta on log-normally distributed,
uncensored data for estimating the survival curve under the AFT model show that
PCA outperforms PLS in terms of both bias and MSE. The three random matrices
do not show a significant difference between each other in terms of either bias or
MSE. Overall, the random matrices outperform both PCA and PLS for both bias
and MSE.
15
18. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
8 Acknowledgments
This research was supported by the National Security Agency through REU Grant
H98230 15-1-0048 to The University of Nevada at Reno, Javier Rojo PI. We
would like to greatly thank and acknowledge our advisor Dr. Javier Rojo, Nathan
Wiseman, and Kyle Bradford from the University of Nevada Reno for their sup-
port and generous contributions to our research.
9 References
Cox, DR. Regression Models and life tables (with discussion). Journal of Royal
Statistical Society Series B34: 187-220, 1972.
Johnson, W.B. and J. Lindenstrauss. Extensions of Lipschitz maps into a Hilbert
space. Contemp Math 26: 189-206, 1984.
Pearson, K. On lines and planes of closest fit to systems of points in space. Philo-
sophical Magazine 2: 559-572, 1901.
Wold, H. Estimation of principal components and related models by iterative least
squares. P.R. Krishnaiaah: 391-420, 1966.
Achlioptas, D. Database-friendly random projections: Johnson-Lindenstrauss with
binary coins. Journal of Computer and System Sciences 66(4): 671-687, 2003.
Dasgupta, S. and A. Gupta. An elementary proof of a theorem of Johnson and
Lindenstrauss. Random Structures and Algorithms 22(1): 60-65, 2003.
Nguyen, D.V. Partial least squares dimension reduction for microarray gene ex-
pression data with a censored response. Math Biosci 193: 119-137, 2005.
Nguyen, D.V., and D.M. Rocke. On partial least squares dimension reduction for
microarraybased classification: A simulation study. Comput Stat Data Analysis
46: 407-425, 2004.
Despa, Simona. What is Survival Analysis? StatNews 78: 1-2.
Ross, Eric. "Survival Analysis." 2012. PDF
16
19. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
Allison, Paul D. "Survival Analysis." 2013. PDF
Lunn, Mary. "Definitions and Censoring." 2012. PDF.
Vermeylen, Francoise. Censored Data. StatNews 67: 1, 2005.
Nguyen, Tuan S. and Javier Rojo. Dimension Reduction of Microarray Gene Ex-
pression Data: The Accelerated Failure Time Model. Journal of Bioinformatics
and Computational Biology 7(6): 939-954, 2009.
Nguyen, Tuan S. and Javier Rojo. Dimension Reduction of Microarray Data in
the Presence of a Censored Survival Response: A Simulation Study. Statistical
Applications in Genetics and Molecular Biology 8(1): 2009.
Sun, Jianguo. "Interval Censoring." 2011. PDF.
Duerden, Martin. "What Are Hazard Ratios?" 2012. PDF.
Rodriguez, German. "Parametric Survival Models. Princeton." 2010. PDF.
Cook, Alex. "Survival and hazard functions." 2008. PDF.
Fox, John. "Introduction to Nonparametric Methods." 2005. PDF.
Husson et al. "Package ‘FactoMineR’." 2015. PDF.
Sanchez, Gaston. "Package ‘plsdepot’." 2015. PDF.
Chiou et al. "Package ‘aftgee’." 2015. PDF.
Thernou et al. "Package ‘survival’." 2015. PDF.
17
20. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
10 Appendix
Herein, the R code utilized in this investigation is presented. Packages survival
(Thernou et al., 2015), FactoMineR (Husson et al., 2015), plsdepot (Gaston, 2015),
and aftgee (Chiou et al., 2015) will need to be installed and loaded into R software
to successfully run the provided code.
10.1 Error Plots
Below is the code used to produce the six error plots for the five various reductions
methods.
library(survival)
# We created a Surv object using function ’Surv’ from this
# package.
library(FactoMineR)
# We used the function ’PCA’ from this package.
library(plsdepot)
# We used ’plsreg1’ from this package.
library(aftgee)
# With this package, we were able to apply the AFT model to our
# simulated data using the function ’aftgee’.
sim <- function(s) # This function will produce ’s’
# simulations and output error plots.
{
t1 <- Sys.time() # Initial time.
num <- 1 # Initial counter.
sum_PCA_BE_t <- matrix(0, 1, 20)
sum_PCA_MSE_t <- matrix(0, 1, 20)
sum_PLS_BE_t <- matrix(0, 1, 20)
sum_PLS_MSE_t <- matrix(0, 1, 20)
sum_RM1_BE_t <- matrix(0, 1, 20)
sum_RM1_MSE_t <- matrix(0, 1, 20)
18
21. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
sum_RM2_BE_t <- matrix(0, 1, 20)
sum_RM2_MSE_t <- matrix(0, 1, 20)
sum_RM3_BE_t <- matrix(0, 1, 20)
sum_RM3_MSE_t <- matrix(0, 1, 20)
# These will store the calculated bias and mean-squared
# error across 20 selected points geeer we have run ’s’
# simulations.
beta <- c(runif(1000, min = -0.0000001, max = 0.0000001))
# Fixed coefficients.
mu <- c(rnorm(1000, mean = 0, sd = 1)) # Mean values.
X <- matrix(0, 100, 1000)
# A location for the dataset information.
while(num <= s)
# Running the entire code for a ’s’ iterations.
{
# No problems at the start of this iteration.
for(i in 1:100)
{
for(j in 1:1000)
{
X[i, j] <- rnorm(1, mean = mu[j], sd = 1)
# A matrix of random data containing observations
# on the rows and covariates on the columns.
}
}
z <- exp(X) # All entries of matrix ’X’ have been
# exponentiated and stored in ’z’, which has dimensions
# 100 by 1,000.
lambda <- matrix(0, 100, 1) # Rate values.
for(i in 1:100) # Generating lambda values.
{
lambda[i] <- exp(t(-z[i,]) %*% as.matrix(beta))
}
19
22. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
T <- matrix(0, nrow = 100, ncol = 1)
# Location for survival times.
for(i in 1:100) # Surivival times being generated.
{
T[i] <- rexp(1, rate=lambda[i])
}
RM1 <- matrix(0, 1000, 37)
# Random matrix one with ’-1’s and ’+1’s.
for (m in 1:1000)
{
for (n in 1:37)
{
RM1[m, n] <- sample(c(-1, 1), 1, replace = TRUE,
prob = c(1/2, 1/2))
}
}
RM1 <- RM1 / sqrt(37)
RM2 <- matrix(0, 1000, 37)
# Random matrix two with
# ’-sqrt(3)’s, ’0’s, and ’+sqrt(3)’s.
for (m in 1:1000)
{
for (n in 1:37)
{
RM2[m,n] <- sample(c(-sqrt(3), 0, sqrt(3)),
1, replace = TRUE,
prob = c(1/6, 4/6, 1/6))
}
}
RM2 <- RM2 / sqrt(37)
RM3 <- matrix(0, 1000, 37)
# Random matrix three generated under a Gaussian
20
23. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
# distribution.
for (m in 1:1000)
{
for (n in 1:37)
{
RM3[m,n] <- rnorm(1, 0, 1)
}
}
RM3_norm <- matrix(0, 1000, 1)
for (p in 1:1000)
{
RM3_norm[p, ] <- sqrt(sum(RM3[p, ] ^ 2))
}
for (m in 1:1000)
{
for(n in 1:37)
{
RM3[m,n] <- RM3[m,n] / RM3_norm[m, ]
}
}
z_star <- scale(z, center = TRUE, scale = FALSE)
# Column-centered ’z’ matrix for PCA.
z_star_PCA <- PCA(z_star, graph = FALSE, ncp = 37)
z_star_PLS <- plsreg1(scale(z, center = TRUE,
scale = TRUE), T, comps = 37,
crosval = FALSE)
# Applying PCA and PLS to the data.
z_double_star_PCA <- z_star %*% z_star_PCA$var$coord
z_double_star_PLS <- z_star %*% z_star_PLS$x.loads
z_double_star_RM1 <- z %*% RM1
z_double_star_RM2 <- z %*% RM2
z_double_star_RM3 <- z %*% RM3
# Reducing dimensionality.
21
24. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
delta <- matrix(0, nrow = 100, ncol = 1)
# An indicator matrix. Here, delta is a 100 by 1 matrix
# of zeros. The zeros are interpreted as meaning that the
# event of interest has definitively occured. In other
# words, there is currently no censoring with ’delta’
# set up in this manner.
data_Surv <- Surv(time = T, event = delta,
type = c("right"))
# A Surv object that takes the survival times from ’T’,
# censoring information from ’delta’, and is specified
# as being right-censored.
data_AFT_fit_PCA <- aftgee(data_Surv ~ -1 +
z_double_star_PCA,
corstr = "independence", B = 0)
data_AFT_fit_PLS <- aftgee(data_Surv ~ -1 +
z_double_star_PLS,
corstr = "independence", B = 0)
data_AFT_fit_RM1 <- aftgee(data_Surv ~ -1 +
z_double_star_RM1,
corstr = "independence", B = 0)
data_AFT_fit_RM2 <- aftgee(data_Surv ~ -1 +
z_double_star_RM2,
corstr = "independence", B = 0)
data_AFT_fit_RM3 <- aftgee(data_Surv ~ -1 +
z_double_star_RM3,
corstr = "independence", B = 0)
beta_hat_star_PCA <- data_AFT_fit_PCA$coefficients
beta_hat_star_PLS <- data_AFT_fit_PLS$coefficients
beta_hat_star_RM1 <- data_AFT_fit_RM1$coefficients
beta_hat_star_RM2 <- data_AFT_fit_RM2$coefficients
beta_hat_star_RM3 <- data_AFT_fit_RM3$coefficients
# The full beta/regression coefficients.
z_bar_star <- matrix(0, 1, 1000)
22
25. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
# Averaged columns of ’z’ will go here.
for (i in 1:1000) # Averaging ’z’s columns.
{
z_bar_star[1, i] <- mean(z[, i])
}
beta_hat_z_PCA <- z_star_PCA$var$coord %*%
beta_hat_star_PCA
beta_hat_z_PLS <- z_star_PLS$x.loads %*%
beta_hat_star_PLS
beta_hat_z_RM1 <- RM1 %*%
beta_hat_star_RM1
beta_hat_z_RM2 <- RM2 %*%
beta_hat_star_RM2
beta_hat_z_RM3 <- RM3 %*%
beta_hat_star_RM3
# The final beta estimates for each technique.
lambda_hat_PCA <- mean(exp(-z %*% beta_hat_z_PCA))
lambda_hat_PLS <- mean(exp(-z %*% beta_hat_z_PLS))
lambda_hat_RM1 <- mean(exp(-z %*% beta_hat_z_RM1))
lambda_hat_RM2 <- mean(exp(-z %*% beta_hat_z_RM2))
lambda_hat_RM3 <- mean(exp(-z %*% beta_hat_z_RM3))
# Generating the lambda constant from each technique
# employed.
lambda_bar = mean(lambda) # Taking the average of all
# ’lambda’ values and storing it in ’lambda_bar’.
S <- function(t) # The true survivor function.
{
exp(-t * lambda_bar)
}
S_hat_naught_PCA <- function(t)
# The predicted survivor function through PCA.
23
26. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
{
exp(-t * lambda_hat_PCA)
}
S_hat_naught_PLS <- function(t)
# The predicted survivor function through PLS.
{
exp(-t * lambda_hat_PLS)
}
S_hat_naught_RM1 <- function(t)
# The predicted survivor function through RM1.
{
exp(-t * lambda_hat_RM1)
}
S_hat_naught_RM2 <- function(t)
# The predicted survivor function through RM2.
{
exp(-t * lambda_hat_RM2)
}
S_hat_naught_RM3 <- function(t)
# The predicted survivor function through RM3.
{
exp(-t * lambda_hat_RM3)
}
u <- c(seq(0.025, 0.975, 0.05))
# Desired outputs ’u’ that range from 0.025 to 0.975
# and are spaced out by 0.05, resulting in 20 points.
t <- (-1/lambda_bar) * log(u) # Input times ’t’ from the
# respective ’u’s. There are 20 generated times ’t’ in
# this vector.
for (i in 1:20)
# Storing bias across the 20 point pairs in PCA.
{
sum_PCA_BE_t[i] <- sum_PCA_BE_t[i] +
(S_hat_naught_PCA(t[i]) - S(t[i]))
24
27. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
}
for (i in 1:20)
# Storing mean-squared error across the 20 point pairs
# in PCA.
{
sum_PCA_MSE_t[i] <- sum_PCA_MSE_t[i] +
(S_hat_naught_PCA(t[i]) - S(t[i])) ^ 2
}
for (i in 1:20)
# Storing bias across the 20 point pairs in PLS.
{
sum_PLS_BE_t[i] <- sum_PLS_BE_t[i] +
(S_hat_naught_PLS(t[i]) - S(t[i]))
}
for (i in 1:20)
# Storing mean-squared error across the 20 point pairs
# in PLS.
{
sum_PLS_MSE_t[i] <- sum_PLS_MSE_t[i] +
(S_hat_naught_PLS(t[i]) - S(t[i])) ^ 2
}
for (i in 1:20)
# Storing bias across the 20 point pairs in RM1.
{
sum_RM1_BE_t[i] <- sum_RM1_BE_t[i] +
(S_hat_naught_RM1(t[i]) - S(t[i]))
}
for (i in 1:20)
# Storing mean-squared error across the 20 point pairs
# in RM1.
{
sum_RM1_MSE_t[i] <- sum_RM1_MSE_t[i] +
(S_hat_naught_RM1(t[i]) - S(t[i])) ^ 2
}
for (i in 1:20)
25
28. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
# Storing bias across the 20 point pairs in RM2
{
sum_RM2_BE_t[i] <- sum_RM2_BE_t[i] +
(S_hat_naught_RM2(t[i]) - S(t[i]))
}
for (i in 1:20)
# Storing mean-squared error across the 20 point pairs
# in RM2.
{
sum_RM2_MSE_t[i] <- sum_RM2_MSE_t[i] +
(S_hat_naught_RM2(t[i]) - S(t[i])) ^ 2
}
for (i in 1:20)
# Storing bias across the 20 point pairs in RM3.
{
sum_RM3_BE_t[i] <- sum_RM3_BE_t[i] +
(S_hat_naught_RM3(t[i]) - S(t[i]))
}
for (i in 1:20)
# Storing mean-squared error across the 20 point pairs
# in RM3.
{
sum_RM3_MSE_t[i] <- sum_RM3_MSE_t[i] +
(S_hat_naught_RM3(t[i]) - S(t[i])) ^ 2
}
print(paste("Simulation", num, "Complete."))
num <- num + 1
}
ymin_PCA_BE <- min(sum_PCA_BE_t)
ymin_PLS_BE <- min(sum_PLS_BE_t)
ymin_RM1_BE <- min(sum_RM1_BE_t)
ymin_RM2_BE <- min(sum_RM2_BE_t)
ymin_RM3_BE <- min(sum_RM3_BE_t)
# Finding the minimum bias per each technique after
26
29. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
# ’s’ simulations.
ymax_PCA_BE <- max(sum_PCA_BE_t)
ymax_PLS_BE <- max(sum_PLS_BE_t)
ymax_RM1_BE <- max(sum_RM1_BE_t)
ymax_RM2_BE <- max(sum_RM2_BE_t)
ymax_RM3_BE <- max(sum_RM3_BE_t)
# Finding the maximum bias per each technique after
# ’s’ simulations.
ymin_BE <- min(ymin_PCA_BE, ymin_PLS_BE, ymin_RM1_BE,
ymin_RM2_BE, ymin_RM3_BE) / s
ymax_BE <- max(ymax_PCA_BE, ymax_PLS_BE, ymax_RM1_BE,
ymax_RM2_BE, ymax_RM3_BE) / s
# Finding the minimum and maximum bias across all five
# techniques after ’s’ simulations. These will serve as
# the lower and upper range of the y-axis in the final plot.
ymin_PCA_PLS_BE <- min(ymin_PCA_BE, ymin_PLS_BE) / s
ymax_PCA_PLS_BE <- max(ymax_PCA_BE, ymax_PLS_BE) / s
# Calculating the averaged minimum and maximum bias for PCA
# and PLS after ’s’ simulations for plotting purposes.
ymin_RM_BE <-
min(ymin_RM1_BE, ymin_RM2_BE, ymin_RM3_BE) / s
ymax_RM_BE <-
max(ymax_RM1_BE, ymax_RM2_BE, ymax_RM3_BE) / s
# Calculating the averaged minimum and maximum bias for the
# three RMs after ’s’ simulations for plotting purposes.
ymin_PCA_MSE <- min(sum_PCA_MSE_t)
ymin_PLS_MSE <- min(sum_PLS_MSE_t)
ymin_RM1_MSE <- min(sum_RM1_MSE_t)
ymin_RM2_MSE <- min(sum_RM2_MSE_t)
ymin_RM3_MSE <- min(sum_RM3_MSE_t)
# Finding the minimum mean-squared error per each technique
# after ’s’ simulations.
ymax_PCA_MSE <- max(sum_PCA_MSE_t)
ymax_PLS_MSE <- max(sum_PLS_MSE_t)
ymax_RM1_MSE <- max(sum_RM1_MSE_t)
27
30. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
ymax_RM2_MSE <- max(sum_RM2_MSE_t)
ymax_RM3_MSE <- max(sum_RM3_MSE_t)
# Finding the maximum mean-squared error per each technique
# after ’s’ simulations.
ymin_MSE <- min(ymin_PCA_MSE, ymin_PLS_MSE, ymin_RM1_MSE,
ymin_RM2_MSE, ymin_RM3_MSE) / s
ymax_MSE <- max(ymax_PCA_MSE, ymax_PLS_MSE, ymax_RM1_MSE,
ymax_RM2_MSE, ymax_RM3_MSE) / s
# Finding the minimum and maximum mean-squared error across
# all techniques. These will serve as the lower and upper
# range of the y-axis in the final plot.
ymin_PCA_PLS_MSE <- min(ymin_PCA_MSE, ymin_PLS_MSE) / s
ymax_PCA_PLS_MSE <- max(ymax_PCA_MSE, ymax_PLS_MSE) / s
# Calculating the averaged minimum and maximum MSE for PCA
# and PLS after ’s’ simulations for plotting purposes.
ymin_RM_MSE <-
min(ymin_RM1_MSE, ymin_RM2_MSE, ymin_RM3_MSE) / s
ymax_RM_MSE <-
max(ymax_RM1_MSE, ymax_RM2_MSE, ymax_RM3_MSE) / s
# Calculating the averaged minimum and maximum MSE for the
# three RMs after ’s’ simulations for plotting purposes.
# Start of bias plot for PCA and PLS.
plot(t, (sum_PCA_BE_t) / s, pch = 15,
main = paste("Bias: PCA and PLS n", s,
"Total Simulations"),
xlab = "Time",
ylab = "Average Bias", ylim = c(ymin_PCA_PLS_BE,
ymax_PCA_PLS_BE),
xlim = c(0, max(t)),
col = "black")
points(t, (sum_PLS_BE_t) / s, pch = 15, col = "grey")
par(new = TRUE)
abline(0, 0, h = 0)
28
31. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
par(new = TRUE)
legend("topright", c("PCA", "PLS"), pch = c(15, 15),
col = c("black", "grey"))
# End of bias plot for PCA and PLS.
# Start of the mean-squared error plot for PCA and PLS.
plot(t, (sum_PCA_MSE_t) / s, pch = 15,
main = paste("Mean-Squared Error: PCA and PLS n",
s, "Total Simulations"),
xlab = "Time",
ylab = "Average MSE", ylim = c(ymin_PCA_PLS_MSE,
ymax_PCA_PLS_MSE),
xlim = c(0, max(t)),
col = "black")
points(t, (sum_PLS_MSE_t) / s, pch = 15, col = "grey")
par(new = TRUE)
abline(0, 0, h = 0)
par(new = TRUE)
legend("topright", c("PCA", "PLS"), pch = c(15, 15),
col = c("black", "grey"))
# End of mean-squared error plot for PCA and PLS.
# Start of the bias plot for the random matrices.
plot(t, (sum_RM1_BE_t) / s, pch = 15,
main = paste("Bias: Random Matrices n", s,
"Total Simulations"),
xlab = "Time",
ylab = "Average Bias", ylim = c(ymin_RM_BE,
ymax_RM_BE),
xlim = c(0, max(t)),
col = "darkblue")
points(t, (sum_RM2_BE_t) / s, pch = 15, col = "red")
points(t, (sum_RM3_BE_t) / s, pch = 15, col = "gold")
29
32. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
par(new = TRUE)
abline(0, 0, h = 0)
par(new = TRUE)
legend("topright", c("RM1", "RM2", "RM3"),
pch = c(15, 15, 15),
col = c("darkblue", "red", "gold"))
# End of bias plot for the random matrices.
# Start of the mean-squared error plot for the
# random matrices.
plot(t, (sum_RM1_MSE_t) / s, pch = 15,
main = paste("Mean-Squared Error: Random Matrices n",
s, "Total Simulations"),
xlab = "Time",
ylab = "Average MSE", ylim = c(ymin_RM_MSE,
ymax_RM_MSE),
xlim = c(0, max(t)),
col = "darkblue")
points(t, (sum_RM2_MSE_t) / s, pch = 15, col = "red")
points(t, (sum_RM3_MSE_t) / s, pch = 15, col = "gold")
par(new = TRUE)
abline(0, 0, h = 0)
par(new = TRUE)
legend("topright", c("RM1", "RM2", "RM3"),
pch = c(15, 15, 15),
col = c("darkblue", "red", "gold"))
# End of mean-squared error plot for the random matrices.
# Start of bias plot for all methods.
plot(t, (sum_PCA_BE_t) / s, pch = 15,
main = paste("Bias: All Techniques n",
30
33. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
s, "Total Simulations"), xlab = "Time",
ylab = "Average Bias", ylim = c(ymin_BE, ymax_BE),
xlim = c(0, max(t)),
col = "black")
points(t, (sum_PLS_BE_t) / s, pch = 15, col = "gray")
points(t, (sum_RM1_BE_t) / s, pch = 15, col = "darkblue")
points(t, (sum_RM2_BE_t) / s, pch = 15, col = "red")
points(t, (sum_RM3_BE_t) / s, pch = 15, col = "gold")
par(new = TRUE)
abline(0, 0, h = 0)
par(new = TRUE)
legend("topright", c("PCA", "PLS", "RM1", "RM2", "RM3"),
pch = c(15, 15, 15, 15, 15),
col = c("black", "gray", "darkblue", "red", "gold"))
# End of bias plot for all methods.
# Start of mean-squared error plot for all methods.
plot(t, (sum_PCA_MSE_t) / s, pch = 15,
main = paste("Mean-Squared Error: All Techniques n",
s, "Total Simulations"), xlab = "Time",
ylab = "Average MSE", ylim = c(ymin_MSE, ymax_MSE),
xlim = c(0, max(t)), col = "black")
points(t, (sum_PLS_MSE_t) / s, pch = 15, col = "gray")
points(t, (sum_RM1_MSE_t) / s, pch = 15, col = "darkblue")
points(t, (sum_RM2_MSE_t) / s, pch = 15, col = "red")
points(t, (sum_RM3_MSE_t) / s, pch = 15, col = "gold")
par(new = TRUE)
31
34. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
abline(0, 0, h = 0)
par(new = TRUE)
legend("topright", c("PCA", "PLS", "RM1", "RM2", "RM3"),
pch = c(15, 15, 15, 15, 15),
col = c("black", "gray", "darkblue", "red", "gold"))
# End of mean-squared error plot for all methods.
t2 <- Sys.time() # End time.
total_time <- t2 - t1 # Difference between start and end
# times.
print(total_time) # Printing total time to run simulations
# and obtain the plots.
}
32
35. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
10.2 Johnson-Lindenstrauss Testing
Below is the code used for testing the Johnson-Lindenstrauss Lemma by varying
k and .
good_points_RM1 <- 0
good_points_RM2 <- 0
good_points_RM3 <- 0
# Good points counter for each random matrix.
# Points are considered ’good’ if they satisfy
# the Johnson-Lindenstrauss Lemma.
sim <- function(s, k, epsilon)
# This function takes in ’s’ simulations and a desired
# ’epsilon’. It returns the number of times the
# Johnson-Lindenstrauss Lemma was satisfied based on the
# three random matrices.
{
t1 <- Sys.time() # Initial time.
num <- 1 # Initial counter.
mu <- c(rnorm(1000, mean = 0, sd = 1)) # Mean values.
X <- matrix(0, 100, 1000)
# A location for the dataset information.
while(num <= s)
# Running the entire code for a ’s’ iterations.
{
problem <- FALSE # No problems at the start of this
# iteration.
for(i in 1:100)
{
for(j in 1:1000)
{
X[i, j] <- rnorm(1, mean = mu[j], sd = 1)
# A matrix of random data containing observations on
# the rows and covariates on the columns.
33
36. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
}
}
z <- exp(X) # All entries of matrix ’X’ have been
# exponentiated and stored in ’z’, which has dimensions
# 100 by 1,000.
u_v_rows <- sample(1:100, 2, replace = FALSE)
obs_u_old <- z[u_v_rows[1], ]
obs_v_old <- z[u_v_rows[2], ]
# We’ve selected two different rows from the dataset
# matrix ’z’ and stored them as new variables. Here,
# observations ’u’ and ’v’ can be thought of as
# 1,000-dimensional points.
dist_old <- sum((obs_u_old - obs_v_old) ^ 2)
# Here, the distance has been calculated between
# observations ’u’ and ’v’.
RM1 <- matrix(0, 1000, k)
# Random matrix one with ’-1’s and ’+1’s.
for (m in 1:1000)
{
for (n in 1:k)
{
RM1[m, n] <- sample(c(-1, 1), 1, replace = TRUE,
prob = c(1/2, 1/2))
}
}
RM1 <- RM1 / sqrt(k)
RM2 <- matrix(0, 1000, k)
# Random matrix two with ’-sqrt(3)’s, ’0’s, and
# ’+sqrt(3)’s.
for (m in 1:1000)
{
for (n in 1:k)
{
34
37. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
RM2[m,n] <- sample(c(-sqrt(3), 0, sqrt(3)), 1,
replace = TRUE,
prob = c(1/6, 4/6, 1/6))
}
}
RM2 <- RM2 / sqrt(k)
RM3 <- matrix(0, 1000, k)
# Random matrix three generated under a Gaussian
# distribution.
for (m in 1:1000)
{
for (n in 1:k)
{
RM3[m,n] <- rnorm(1, mean = 0, sd = 1)
}
}
RM3_norm <- matrix(0, 1000, 1)
for (p in 1:1000)
{
RM3_norm[p, ] <- sqrt(sum(RM3[p, ] ^ 2))
}
for (m in 1:1000)
{
for(n in 1:k)
{
RM3[m,n] <- RM3[m,n] / RM3_norm[m, ]
}
}
z_star <- scale(z, center=TRUE, scale=FALSE)
# Column-centered ’z’ matrix.
z_double_star_RM1 <- z %*% RM1
z_double_star_RM2 <- z %*% RM2
z_double_star_RM3 <- z %*% RM3
35
38. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
# Reducing dimensionality.
obs_u_new_RM1 <- z_double_star_RM1[u_v_rows[1], ]
obs_v_new_RM1 <- z_double_star_RM1[u_v_rows[2], ]
obs_u_new_RM2 <- z_double_star_RM2[u_v_rows[1], ]
obs_v_new_RM2 <- z_double_star_RM2[u_v_rows[2], ]
obs_u_new_RM3 <- z_double_star_RM3[u_v_rows[1], ]
obs_v_new_RM3 <- z_double_star_RM3[u_v_rows[2], ]
# After reducing dimensions, points ’u’ and ’v’ now have
# new coordinates. Since there were three random
# matrices, there are three new ’u’ and ’v’ points.
dist_new_RM1 <- sum((obs_u_new_RM1 - obs_v_new_RM1) ^ 2)
dist_new_RM2 <- sum((obs_u_new_RM2 - obs_v_new_RM2) ^ 2)
dist_new_RM3 <- sum((obs_u_new_RM3 - obs_v_new_RM3) ^ 2)
# Calculating the new distance between the transformed
# points ’u’ and ’v’ for each generated random matrix.
if((1 - epsilon) * (dist_old) <= dist_new_RM1
&& dist_new_RM1 <= (1 + epsilon) * (dist_old))
{
good_points_RM1 <- good_points_RM1 + 1
}
if((1 - epsilon) * (dist_old) <= dist_new_RM2
&& dist_new_RM2 <= (1 + epsilon) * (dist_old))
{
good_points_RM2 <- good_points_RM2 + 1
}
if((1 - epsilon) * (dist_old) <= dist_new_RM3
&& dist_new_RM3 <= (1 + epsilon) * (dist_old))
{
good_points_RM3 <- good_points_RM3 + 1
}
# The preceding three ’if’ statements check to see if
# the Johnson-Lindenstrauss Lemma was satisfied in this
# iteration for each different random matrix.
print(paste("Simulation", num, "Complete."))
36
39. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
num <- num + 1
}
print(paste("For an epsilon of", epsilon, ", k is", k,
"."))
print(paste("Number of times JL was satisfied, RM1:",
good_points_RM1, "out of", s, "simulations."))
print(paste("Number of times JL was satisfied, RM2:",
good_points_RM2, "out of", s, "simulations."))
print(paste("Number of times JL was satisfied, RM3:",
good_points_RM3, "out of", s, "simulations."))
t2 <- Sys.time() # End time.
total_time <- t2 - t1
# Difference between start and end times.
print(total_time) # Printing total time to run simulations
# and obtain the plots.
}
37
40. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
10.3 Survival Curves
Below is the code used for generating the real survival curve and the estimated
survival curve under PCA.
library(survival)
library(FactoMineR)
sim <- function(s) # Making a function that takes in a
# simulation count ’s’.
{
options(digits = 22) # Preserving more digits in hopes of
# less algorithm failure.
results <- matrix(0, s, 2) # A matrix with BE on column 1
# and MSE on column 2.
BE_T <- 0 # Initial total BE count.
MSE_T <- 0 # Initial total MSE count.
sum_BE_t <- matrix(0, 1, 20) # Matrix of BE at time ’t’.
sum_MSE_t <- matrix(0, 1, 20) # Matrix of MSE at time ’t’.
num <- 1 # Iteration counter.
sum_BE_t1 <- 0 # Bias error at time ’t1’.
sum_MSE_t1 <- 0 # Mean-squared error at time ’t1’.
beta <- c(runif(1000, min = -0.0000001, max = 0.0000001))
# Fixed coefficients.
mu <- c(rnorm(1000, mean = 0, sd = 1)) # Mean values.
X <- matrix(0, 100, 1000) # A location for the dataset
# information.
while(num <= s) # Running the entire code for a specified
# amount of iterations.
{
problem <- FALSE # No problems at the start of this
# iteration.
for(i in 1:100)
{
for(j in 1:1000)
38
41. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
{
X[i, j] <- rnorm(1, mean = mu[j], sd = 1) # A matrix
# of random data containing observations on the rows
# and covariates on the columns.
}
}
z <- exp(X) # All entries of matrix ’X’ have been
# exponentiated and stored in ’z’, which has dimensions
# 100 by 1,000.
lambda <- matrix(0, 100, 1) # Rate values.
for(i in 1:100)
{
lambda[i] <- exp(t(-z[i,]) %*% as.matrix(beta))
# Generating lambda values.
}
T <- matrix(0, nrow = 100, ncol = 1)
# Location for survival times.
for(i in 1:100)
{
T[i] <- rexp(1,rate=lambda[i])
}
z_star <- scale(z, center=TRUE,scale=FALSE)
z_star_PCA <- PCA(z_star, graph=FALSE, ncp=37)
z_double_star <- z_star %*% z_star_PCA$var$coord
delta <- matrix(0, nrow = 100, ncol = 1) # An indicator
# matrix. Here, delta is a 100 by 1 matrix of zeros.
# The zeros are interpreted as meaning that the event of
# interest has definitively occured. In other words,
# there is currently no censoring with ’delta’ set up in
# this manner.
data_Surv <- Surv(time = T, event = delta,
39
42. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
type = c("right"))
# A Surv object that takes the survival times from ’T’,
# censoring information from ’delta’, and is specified as
# being right-censored.
data_AFT_fit <- NULL
data_AFT_fit <- tryCatch(survreg(data_Surv ~ -1 +
z_double_star,
dist = "lognormal",
survreg.control(maxiter=100000000)),
warning=function(c) {problem<<-TRUE})
if(!problem) # If there’s no problem, then our previous
# code will run.
{
beta_hat_star <- as.matrix(data_AFT_fit$coeff)
# These are beta estimates.
z_bar_star <- matrix(0, 1, 1000)
# Averaged columns of ’z’ go here.
for (i in 1:1000)
{
z_bar_star[1, i] <- mean(z[, i])
# Taking the average of each column of ’z’.
}
beta_hat_z <- matrix(0, 1, 1000)
# A location for our beta estimates.
beta_hat_z <- z_star_PCA$var$coord %*% beta_hat_star
# Beta estimates.
lambda_hat <- exp(-z_bar_star %*% beta_hat_z)
# Survival function constant.
lambda_bar = mean(lambda)
# Taking the average of all ’lambda’ values and storing
# it in ’lambda_bar’.
40
43. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
S_hat_naught <- function(t)
# The predicted survivor function.
{
exp(-t * lambda_hat)
}
S <- function(t)
# The true survivor function.
{
exp(-t * lambda_bar)
}
data_AFT_pred <- predict(data_AFT_fit, type = "terms",
se.fit = TRUE)
# Here, we get the predicted values from the ’survreg’
# object ’data_AFT_fit’. To wit, we get here the beta
# values and the standard errors in a ’list’ format.
surv_curv <- curve(S_hat_naught, from = 0, to = 7,
n = 1000, type="l",
xlab = "", ylab = "", xaxt = ’n’,
yaxt = ’n’, col = "99")
# Plotting the predicted survivor function.
par(new = TRUE)
curve(S, from = 0, to = 7, n = 1000, type = "l",
main = paste("Survivor Curves n Simulation", num),
xlab = expression(italic(t)),
ylab = expression(S(italic(t))), col = "black")
u <- c(seq(0.025,0.975,0.05))
# Outputs ’u’ that range from 0.025 to 0.975
# spaced out by 0.05, resulting in 20 points.
t <- (-1/lambda_bar) * log(u)
# Input times ’t’, generated from ’u’. There
# are 20 generated times ’t’ in this vector.
print(paste("Simulation ", num, sep = ""))
41
44. Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques
num <- num + 1
}
else
{
}
}
}
42